Commit Graph

64 Commits

Author SHA1 Message Date
Matthew Honnibal
f9946154d9
Add SpanCategorizer component (#6747)
* Draft spancat model

* Add spancat model

* Add test for extract_spans

* Add extract_spans layer

* Upd extract_spans

* Add spancat model

* Add test for spancat model

* Upd spancat model

* Update spancat component

* Upd spancat

* Update spancat model

* Add quick spancat test

* Import SpanCategorizer

* Fix SpanCategorizer component

* Import SpanGroup

* Fix span extraction

* Fix import

* Fix import

* Upd model

* Update spancat models

* Add scoring, update defaults

* Update and add docs

* Fix type

* Update spacy/ml/extract_spans.py

* Auto-format and fix import

* Fix comment

* Fix type

* Fix type

* Update website/docs/api/spancategorizer.md

* Fix comment

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Better defense

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix labels list

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/ml/extract_spans.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/pipeline/spancat.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Set annotations during update

* Set annotations in spancat

* fix imports in test

* Update spacy/pipeline/spancat.py

* replace MaxoutLogistic with LinearLogistic

* fix config

* various small fixes

* remove set_annotations parameter in update

* use our beloved tupley format with recent support for doc.spans

* bugfix to allow renaming the default span_key (scores weren't showing up)

* use different key in docs example

* change defaults to better-working parameters from project (WIP)

* register spacy.extract_spans.v1 for legacy purposes

* Upd dev version so can build wheel

* layers instead of architectures for smaller building blocks

* Update website/docs/api/spancategorizer.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update website/docs/api/spancategorizer.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Include additional scores from overrides in combined score weights

* Parameterize spans key in scoring

Parameterize the `SpanCategorizer` `spans_key` for scoring purposes so
that it's possible to evaluate multiple `spancat` components in the same
pipeline.

* Use the (intentionally very short) default spans key `sc` in the
  `SpanCategorizer`
* Adjust the default score weights to include the default key
* Adjust the scorer to use `spans_{spans_key}` as the prefix for the
  returned score
* Revert addition of `attr_name` argument to `score_spans` and adjust
  the key in the `getter` instead.

Note that for `spancat` components with a custom `span_key`, the score
weights currently need to be modified manually in
`[training.score_weights]` for them to be available during training. To
suppress the default score weights `spans_sc_p/r/f` during training, set
them to `null` in `[training.score_weights]`.

* Update website/docs/api/scorer.md

* Fix scorer for spans key containing underscore

* Increment version

* Add Spans to Evaluate CLI (#8439)

* Add Spans to Evaluate CLI

* Change to spans_key

* Add spans per_type output

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Fix spancat GPU issues (#8455)

* Fix GPU issues

* Require thinc >=8.0.6

* Switch to glorot_uniform_init

* Fix and test ngram suggester

* Include final ngram in doc for all sizes
* Fix ngrams for docs of the same length as ngram size
* Handle batches of docs that result in no ngrams
* Add tests

Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Nirant <NirantK@users.noreply.github.com>
2021-06-24 12:35:27 +02:00
Adriane Boyd
994bed2fe2
Update dependencies (#8409)
* Require `thinc>=8.0.5`
* Use `spacy-lookups-data>=1.0.2`
2021-06-16 19:50:28 +02:00
Sofie Van Landeghem
e796aab4b3
Resizable textcat (#7862)
* implement textcat resizing for TextCatCNN

* resizing textcat in-place

* simplify code

* ensure predictions for old textcat labels remain the same after resizing (WIP)

* fix for softmax

* store softmax as attr

* fix ensemble weight copy and cleanup

* restructure slightly

* adjust documentation, update tests and quickstart templates to use latest versions

* extend unit test slightly

* revert unnecessary edits

* fix typo

* ensemble architecture won't be resizable for now

* use resizable layer (WIP)

* revert using resizable layer

* resizable container while avoid shape inference trouble

* cleanup

* ensure model continues training after resizing

* use fill_b parameter

* use fill_defaults

* resize_layer callback

* format

* bump thinc to 8.0.4

* bump spacy-legacy to 3.0.6
2021-06-16 11:45:00 +02:00
Adriane Boyd
6d2789452e
Restrict cython to <3.0 (#8337) 2021-06-10 11:03:30 +02:00
Sofie Van Landeghem
cfad7e21d5
fix config parsing of ints/strings (#7755)
* add few failing tests for parsing integers and strings

* bump thinc to 8.0.3
2021-04-22 18:09:13 +10:00
Adriane Boyd
53a3b967ac
Update thinc pin and set version to v3.0.5 (#7389) 2021-03-10 11:10:53 +01:00
Matthew Honnibal
c54c300680 Use thinc v8.0.0 2021-01-21 23:51:35 +11:00
Adriane Boyd
2df1ab8a1f Remove detailed numpy constraints from pyproject.toml 2020-12-17 08:54:20 +01:00
Ines Montani
e99cd82367 Update version pins 2020-12-17 10:21:08 +11:00
Ines Montani
05a2812ae0 Merge branch 'develop' into pr/6444 2020-12-09 11:04:03 +11:00
Sofie Van Landeghem
2c27093c5f
require_cpu functionality (#6336)
* add require_cpu from Thinc 8.0.0rc2

* add docs

* fix test if cupy is not installed
2020-12-08 14:42:40 +08:00
Adriane Boyd
724831b066 Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master
* Update Macedonian for v3
* Update Turkish for v3
2020-11-25 11:49:34 +01:00
Adriane Boyd
cdca44ac11
Dynamically include numpy headers (#6418)
* Dynamically include numpy headers

* Add `build-constraints.txt` with numpy version pins for building wheels with `pip` and `wheelwright`
* Update `setup.py` to add current numpy include directory
* Assume `cython` and `numpy` are installed for `setup.py`
* Remove included numpy headers

* Fix typo in requirements.txt

* Use script in CI
2020-11-23 11:15:11 +01:00
Adriane Boyd
13f0676f04
Updates for python 3.9 (#6338)
* Update blis and thinc version ranges

* Update thinc version range

* Update setup.cfg for python 3.9

* Adjust blis and thinc ranges
* Add python 3.9 classifier

* Update CI for python 3.9

* Add --prefer-binary to CI sdist install

* Update CI python 3.7 mac image

* Add --prefer-binary to Travis CI

* Update install instructions in README

* Specify blis versions separately for < / >= 3.6

* Update --prefer-binary in README

* Test cleaner sdist install

* Also upgrade pip

(This is kind of unnecessary given --prefer-binary but may avoid other
issues related to sdist installs in the future.)

* Compile with -j 2

* Remove wheel from setup_requires

* Update to have separate CI uninstall step

* Remove wheel from pyproject.toml

* Recommend upgrading setuptools in addition to pip
2020-11-23 09:45:18 +01:00
Adriane Boyd
1c4df8fd09
Replace pytokenizations with internal alignment (#6293)
* Replace pytokenizations with internal alignment

Replace pytokenizations with internal alignment algorithm that is
restricted to only allow differences in whitespace and capitalization.

* Rename `spacy.training.align` to `spacy.training.alignment` to contain
the `Alignment` dataclass
* Implement `get_alignments` in `spacy.training.align`

* Refactor trailing whitespace handling

* Remove unnecessary exception for empty docs

Allow a non-empty whitespace-only doc to be aligned with an empty doc

* Remove empty docs exceptions completely
2020-11-03 16:24:38 +01:00
Ines Montani
2e8dcba379 Update version pins 2020-10-14 14:59:09 +02:00
Ines Montani
74972744e5 Update Thinc 2020-10-10 19:08:57 +02:00
Sofie Van Landeghem
f4f49f5877
update blis (#6198)
* allow higher blis version

* fix typo

* bump to 3.0.0a34

* fix pins in other files
2020-10-05 14:58:56 +02:00
Ines Montani
01c1538c72 Integrate file readers 2020-10-02 01:36:06 +02:00
Ines Montani
7d04ba20c0 Update Thinc 2020-09-30 00:05:17 +02:00
Ines Montani
c0c842ae5b Update Thinc version 2020-09-27 23:24:40 +02:00
Ines Montani
7e938ed63e Update config resolution to use new Thinc 2020-09-27 22:21:31 +02:00
Ines Montani
ca3c997062 Improve CLI config validation with latest Thinc 2020-09-26 13:13:57 +02:00
Sofie Van Landeghem
009ba14aaf
Fix pretraining in train script (#6143)
* update pretraining API in train CLI

* bump thinc to 8.0.0a35

* bump to 3.0.0a26

* doc fixes

* small doc fix
2020-09-25 15:47:10 +02:00
Sofie Van Landeghem
39872de1f6
Introducing the gpu_allocator (#6091)
* rename 'use_pytorch_for_gpu_memory' to 'gpu_allocator'

* --code instead of --code-path

* update documentation

* avoid querying the "system" section directly

* add explanation of gpu_allocator to TF/PyTorch section in docs

* fix typo

* fix typo 2

* use set_gpu_allocator from thinc 8.0.0a34

* default null instead of empty string
2020-09-19 01:17:02 +02:00
svlandeg
0dc914b667 bump thinc to 8.0.0a33 2020-09-16 16:42:58 +02:00
Sofie Van Landeghem
60f22e1800
Pipe API (#6034)
* ensure Language passes on valid examples for initialization

* fix tagger model initialization

* check for valid get_examples across components

* assume labels were added before begin_training

* fix senter initialization

* fix morphologizer initialization

* use methods to check arguments

* test textcat init, requires thinc>=8.0.0a31

* fix tok2vec init

* fix entity linker init

* use islice

* fix simple NER

* cleanup debug model

* fix assert statements

* fix tests

* throw error when adding a label if the output layer can't be resized anymore

* fix test

* add failing test for simple_ner

* UX improvements

* morphologizer UX

* assume begin_training gets a representative set and processes the labels

* remove assumptions for output of untrained NER model

* restore test for original purpose
2020-09-08 22:44:25 +02:00
Ines Montani
e12b03358b
Support removing extra values in fill-config (#5966)
* Support removing extra values in fill-config

* Fix test
2020-08-24 22:53:47 +02:00
Matthew Honnibal
463f1c8623 Avoid requiring smart-open directly 2020-08-24 14:49:17 +02:00
Matthew Honnibal
e559867605
Allow spacy project to push and pull to/from remote storage (#5949)
* Add utils for working with remote storage

* WIP add remote_cache for project

* WIP add push and pull commands

* Use pathy in remote_cache

* Updarte util

* Update remote_cache

* Update util

* Update project assets

* Update pull script

* Update push script

* Fix type annotation in util

* Work on remote storage

* Remove site and env hash

* Fix imports

* Fix type annotation

* Require pathy

* Require pathy

* Fix import

* Add a util to handle project variable substitution

* Import push and pull commands

* Fix pull command

* Fix push command

* Fix tarfile in remote_storage

* Improve printing

* Fiddle with status messages

* Set version to v3.0.0a9

* Draft docs for spacy project remote storages

* Update docs [ci skip]

* Use Thinc config to simplify and unify template variables

* Auto-format

* Don't import Pathy globally for now

Causes slow and annoying Google Cloud warning

* Tidy up test

* Tidy up and update tests

* Update to latest Thinc

* Update docs

* variables -> vars

* Update docs [ci skip]

* Update docs [ci skip]

Co-authored-by: Ines Montani <ines@ines.io>
2020-08-23 18:32:09 +02:00
Ines Montani
6ad59d59fe Merge branch 'develop' of https://github.com/explosion/spaCy into develop [ci skip] 2020-08-20 11:20:58 +02:00
Ines Montani
daba316930 Update Thinc version 2020-08-14 18:39:51 +02:00
Ines Montani
67cc39af7f Update Thinc and include section order 2020-08-14 14:06:22 +02:00
Ines Montani
88b0a96801 Update for new Thinc and adjust config 2020-08-13 17:38:30 +02:00
Ines Montani
955d7b1b6b Update to latest Thinc 2020-08-07 14:41:35 +02:00
Ines Montani
ab5ef37abb Update to latest Thinc 2020-08-05 15:00:49 +02:00
svlandeg
5fa3235d06 set DATA_VALIDATION to False for debug_model (upgrade thinc) 2020-07-31 15:21:01 +02:00
Matthew Honnibal
520d25cb50
Add smart_open dependency to fetch project assets (#5812)
* Use smart_open for project assets

* Fix assets.py

* Update pyproject.toml
2020-07-26 12:15:00 +02:00
Ines Montani
43b960c01b
Refactor pipeline components, config and language data (#5759)
* Update with WIP

* Update with WIP

* Update with pipeline serialization

* Update types and pipe factories

* Add deep merge, tidy up and add tests

* Fix pipe creation from config

* Don't validate default configs on load

* Update spacy/language.py

Co-authored-by: Ines Montani <ines@ines.io>

* Adjust factory/component meta error

* Clean up factory args and remove defaults

* Add test for failing empty dict defaults

* Update pipeline handling and methods

* provide KB as registry function instead of as object

* small change in test to make functionality more clear

* update example script for EL configuration

* Fix typo

* Simplify test

* Simplify test

* splitting pipes.pyx into separate files

* moving default configs to each component file

* fix batch_size type

* removing default values from component constructors where possible (TODO: test 4725)

* skip instead of xfail

* Add test for config -> nlp with multiple instances

* pipeline.pipes -> pipeline.pipe

* Tidy up, document, remove kwargs

* small cleanup/generalization for Tok2VecListener

* use DEFAULT_UPSTREAM field

* revert to avoid circular imports

* Fix tests

* Replace deprecated arg

* Make model dirs require config

* fix pickling of keyword-only arguments in constructor

* WIP: clean up and integrate full config

* Add helper to handle function args more reliably

Now also includes keyword-only args

* Fix config composition and serialization

* Improve config debugging and add visual diff

* Remove unused defaults and fix type

* Remove pipeline and factories from meta

* Update spacy/default_config.cfg

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/default_config.cfg

* small UX edits

* avoid printing stack trace for debug CLI commands

* Add support for language-specific factories

* specify the section of the config which holds the model to debug

* WIP: add Language.from_config

* Update with language data refactor WIP

* Auto-format

* Add backwards-compat handling for Language.factories

* Update morphologizer.pyx

* Fix morphologizer

* Update and simplify lemmatizers

* Fix Japanese tests

* Port over tagger changes

* Fix Chinese and tests

* Update to latest Thinc

* WIP: xfail first Russian lemmatizer test

* Fix component-specific overrides

* fix nO for output layers in debug_model

* Fix default value

* Fix tests and don't pass objects in config

* Fix deep merging

* Fix lemma lookup data registry

Only load the lookups if an entry is available in the registry (and if spacy-lookups-data is installed)

* Add types

* Add Vocab.from_config

* Fix typo

* Fix tests

* Make config copying more elegant

* Fix pipe analysis

* Fix lemmatizers and is_base_form

* WIP: move language defaults to config

* Fix morphology type

* Fix vocab

* Remove comment

* Update to latest Thinc

* Add morph rules to config

* Tidy up

* Remove set_morphology option from tagger factory

* Hack use_gpu

* Move [pipeline] to top-level block and make [nlp.pipeline] list

Allows separating component blocks from component order – otherwise, ordering the config would mean a changed component order, which is bad. Also allows initial config to define more components and not use all of them

* Fix use_gpu and resume in CLI

* Auto-format

* Remove resume from config

* Fix formatting and error

* [pipeline] -> [components]

* Fix types

* Fix tagger test: requires set_morphology?

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-07-22 13:42:59 +02:00
Sofie Van Landeghem
c9da9605f7
Test suite clean up (#5781)
* step_through tests: skip instead of xfail

* test_empty_doc should be fixed with new Thinc version

* remove outdated test (there are other misaligned tests now)

* xfail reason

* fix test according to french exceptions

* clarified some skipped tests

* skip ukranian test instead of xfail

* skip instead of xfail

* skip + reason instead of xfail

* removed obsolete tests referring to removed "set_frozen" functionality

* fix test 999

* remove unused AlignmentError

* remove xfail where possible, skip otherwise

* increment thinc release for empty_doc test
2020-07-20 14:49:54 +02:00
Ines Montani
70abcca60e Update Thinc pin 2020-07-11 17:02:54 +02:00
Ines Montani
9e48ea48a1 Update Thinc pin 2020-07-10 23:34:57 +02:00
Ines Montani
a3667394b4 Integrate with latest Thinc and config overrides 2020-07-10 19:47:05 +02:00
Matthew Honnibal
cc477be952
Improve gold-standard alignment (#5711)
* Remove previous alignment

* Implement better alignment, using ragged data structure

* Use pytokenizations for alignment

* Fixes

* Fixes

* Fix overlapping entities in alignment

* Fix align split_sents

* Update test

* Commit align.py

* Try to appease setuptools

* Fix flake8

* use realistic entities for testing

* Update tests for better alignment

* Improve alignment heuristic

Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
2020-07-06 17:39:31 +02:00
Matthew Honnibal
2bd1bf81f1
Refactor pretrain and support character-based objective for v3 (#5706)
* Start adding character-based stuff

* Start adding character-based objective

* Start adding character-based stuff

* Start adding character-based objective

* Remove outdated comment

* Update pretraining models

* Add/fix character-based multi-task models

* Refactor pretrain and support character-based objective

* Update pretrain config

* Remove unused

* Fix flake8 errors

* Clean up imports

* Format

* Format

* Update Thinc version

* Raise error if vectors objective but no vectors
2020-07-03 17:57:28 +02:00
Matthew Honnibal
acbf6345c9 Fix thinc dependency 2020-06-29 13:56:07 +02:00
Adriane Boyd
736f3cb5af Bump version and deps for v2.3.0
* spacy to v2.3.0
* thinc to v7.4.1
* spacy-lookups-data to v0.3.2
2020-05-25 12:03:49 +02:00
Sofie Van Landeghem
7f5715a081
Various fixes to NEL functionality, Example class etc (#5460)
* setting KB in the EL constructor, similar to how the model is passed on

* removing wikipedia example files - moved to projects

* throw an error when nlp.update is called with 2 positional arguments

* rewriting the config logic in create pipe to accomodate for other objects (e.g. KB) in the config

* update config files with new parameters

* avoid training pipeline components that don't have a model (like sentencizer)

* various small fixes + UX improvements

* small fixes

* set thinc to 8.0.0a9 everywhere

* remove outdated comment
2020-05-20 11:41:12 +02:00
svlandeg
6fb6a8518c bump to 3.0.0.dev7 and thinc to 8.0.0a8 2020-05-15 13:25:54 +02:00
svlandeg
9fe1e23512 update to thinc 8.0.0a6 2020-05-12 13:51:25 +02:00