Commit Graph

706 Commits

Author SHA1 Message Date
Adriane Boyd
0c936004d1 Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-rc3 2021-01-14 11:49:58 +01:00
Matthew Honnibal
f277bfdf0f
Add SpanGroup and Graph container types to represent arbitrary annotations (#6696)
* Draft out initial Spans data structure

* Initial span group commit

* Basic span group support on Doc

* Basic test for span group

* Compile span_group.pyx

* Draft addition of SpanGroup to DocBin

* Add deserialization for SpanGroup

* Add tests for serializing SpanGroup

* Fix serialization of SpanGroup

* Add EdgeC and GraphC structs

* Add draft Graph data structure

* Compile graph

* More work on Graph

* Update GraphC

* Upd graph

* Fix walk functions

* Let Graph take nodes and edges on construction

* Fix walking and getting

* Add graph tests

* Fix import

* Add module with the SpanGroups dict thingy

* Update test

* Rename 'span_groups' attribute

* Try to fix c++11 compilation

* Fix test

* Update DocBin

* Try to fix compilation

* Try to fix graph

* Improve SpanGroup docstrings

* Add doc.spans to documentation

* Fix serialization

* Tidy up and add docs

* Update docs [ci skip]

* Add SpanGroup.has_overlap

* WIP updated Graph API

* Start testing new Graph API

* Update Graph tests

* Update Graph

* Add docstring

Co-authored-by: Ines Montani <ines@ines.io>
2021-01-14 17:30:41 +11:00
Sofie Van Landeghem
75d9019343
Fix types of Tok2Vec encoding architectures (#6442)
* fix TorchBiLSTMEncoder documentation

* ensure the types of the encoding Tok2vec layers are correct

* update references from v1 to v2 for the new architectures
2021-01-07 16:39:27 +11:00
Sofie Van Landeghem
82ae95267a
Docs for pretrain architectures (#6605)
* document pretraining architectures

* formatting

* bit more info

* small fixes
2021-01-06 16:12:30 +11:00
Sofie Van Landeghem
afc5714d32
multi-label textcat component (#6474)
* multi-label textcat component

* formatting

* fix comment

* cleanup

* fix from #6481

* random edit to push the tests

* add explicit error when textcat is called with multi-label gold data

* fix error nr

* small fix
2021-01-06 13:07:14 +11:00
Ines Montani
6f83abb971
Merge pull request #6647 from svlandeg/feature/init_config_overwrite 2021-01-05 14:59:04 +11:00
Ines Montani
3614472e29
Merge pull request #6646 from svlandeg/feature/cli-docs [ci skip] 2021-01-05 13:52:49 +11:00
Ines Montani
9c078a5885
Update formatting for consistency [ci skip] 2021-01-05 13:52:28 +11:00
Ines Montani
a9e845426f Use --force for consistency and add docs 2021-01-05 13:49:59 +11:00
svlandeg
d5ff0fecf8 add docs 2020-12-30 14:01:13 +01:00
svlandeg
2fa23b0304 fix capitalization for link 2020-12-29 15:01:22 +01:00
svlandeg
43cc6aea93 remove non-existing link 2020-12-29 14:59:39 +01:00
svlandeg
543073bf9d add pretrain example 2020-12-29 14:51:23 +01:00
svlandeg
1d0ef98873 move example 2020-12-29 14:46:03 +01:00
svlandeg
20113b8063 add train CLI example 2020-12-29 14:44:56 +01:00
Sofie Van Landeghem
87562e470d
fix backticks in docs (#6635) 2020-12-27 22:12:37 +01:00
Sofie Van Landeghem
8df5b7f513
fix documentation of 'path' in tokenizer.to_disk (#6634) 2020-12-27 22:01:06 +01:00
Sofie Van Landeghem
282a3b49ea
Fix parser resizing when there is no upper layer (#6460)
* allow resizing of the parser model even when upper=False

* update from spacy.TransitionBasedParser.v1 to v2

* bugfix
2020-12-18 18:56:57 +08:00
Gareth Sparks
efc229c3f4
Doc.char_span arg: alignment_mode (#6591)
Currently labeled "mode", actually "alignment_mode"
2020-12-18 09:54:56 +01:00
Ines Montani
513c4e332a
Include custom code via spacy package command (#6531) 2020-12-10 20:36:46 +08:00
Ines Montani
2a6043fabb
Merge pull request #6530 from explosion/feature/init-config-cpu-gpu 2020-12-10 09:38:46 +11:00
Ines Montani
9d32e839d3 Merge branch 'develop' into feature/init-config-cpu-gpu 2020-12-10 08:50:53 +11:00
Adriane Boyd
972820e2b3 Add batch_size to data formats docs 2020-12-09 12:44:04 +01:00
Adriane Boyd
80ac8af1bf Format 2020-12-09 12:44:01 +01:00
Adriane Boyd
795b5bd049
Update website/docs/api/language.md
Co-authored-by: Ines Montani <ines@ines.io>
2020-12-09 12:23:32 +01:00
Adriane Boyd
fa8fa474a3 Add nlp.batch_size setting
Add a default `batch_size` setting for `Language.pipe` and
`Language.evaluate` as `nlp.batch_size`.
2020-12-09 09:13:26 +01:00
Ines Montani
34449b66fd Update matcher.md 2020-12-09 11:09:45 +11:00
Ines Montani
758ad6c3cd Make CPU the default for init config 2020-12-09 11:00:51 +11:00
Ines Montani
94a5a9814f Update argument handling and documentation 2020-12-08 20:41:18 +11:00
Adriane Boyd
5ceac425ee Remove non-working --use-chars from train CLI
Remove the non-working `--use-chars` option from the train CLI. The
implementation of the option across component types and the CLI settings
could be fixed, but the `CharacterEmbed` model does not work on GPU in
v2 so it's better to remove it.
2020-12-08 08:30:00 +01:00
Sofie Van Landeghem
2c27093c5f
require_cpu functionality (#6336)
* add require_cpu from Thinc 8.0.0rc2

* add docs

* fix test if cupy is not installed
2020-12-08 14:42:40 +08:00
Ines Montani
ee2ec52f48
Merge pull request #6409 from svlandeg/feature/trf-docs 2020-12-08 06:32:10 +01:00
Ines Montani
82e88f0e3b
Merge pull request #6379 from svlandeg/fix/labels-constructor 2020-12-08 06:29:56 +01:00
svlandeg
636be3c791 Merge remote-tracking branch 'upstream/develop' into feature/trf-docs 2020-11-19 14:15:35 +01:00
Sofie Van Landeghem
165993d8e5
fix typo in transformer docs (#6404) 2020-11-19 14:11:38 +01:00
svlandeg
73fc1ed963 remove labels from morphologizer constructor 2020-11-11 21:48:50 +01:00
svlandeg
fcd79e0655 remove set_morphology from docs 2020-11-11 21:32:34 +01:00
svlandeg
789fb3d124 add docs for upstream argument of TransformerListener 2020-11-09 21:42:58 +01:00
Ines Montani
363ac73c72 Update docs [ci skip] 2020-11-09 12:43:26 +08:00
Adriane Boyd
8644ee3e3f
Update TIGER link and tag description (#6344) 2020-11-05 09:33:00 +01:00
Sofie Van Landeghem
8ef056cf98
fix embed_size in Entity Linker architecture (#6343) 2020-11-04 22:20:13 +01:00
Adriane Boyd
a4b32b9552
Handle missing reference values in scorer (#6286)
* Handle missing reference values in scorer

Handle missing values in reference doc during scoring where it is
possible to detect an unset state for the attribute. If no reference
docs contain annotation, `None` is returned instead of a score. `spacy
evaluate` displays `-` for missing scores and the missing scores are
saved as `None`/`null` in the metrics.

Attributes without unset states:

* `token.head`: relies on `token.dep` to recognize unset values
* `doc.cats`: unable to handle missing annotation

Additional changes:

* add optional `has_annotation` check to `score_scans` to replace
`doc.sents` hack
* update `score_token_attr_per_feat` to handle missing and empty morph
representations
* fix bug in `Doc.has_annotation` for normalization of `IS_SENT_START`
vs. `SENT_START`

* Fix import

* Update return types
2020-11-03 15:47:18 +01:00
Sofie Van Landeghem
75a202ce65
TextCat updates and fixes (#6263)
* small fix in example imports

* throw error when train_corpus or dev_corpus is not a string

* small fix in custom logger example

* limit macro_auc to labels with 2 annotations

* fix typo

* also create parents of output_dir if need be

* update documentation of textcat scores

* refactor TextCatEnsemble

* fix tests for new AUC definition

* bump to 3.0.0a42

* update docs

* rename to spacy.TextCatEnsemble.v2

* spacy.TextCatEnsemble.v1 in legacy

* cleanup

* small fix

* update to 3.0.0rc2

* fix import that got lost in merge

* cursed IDE

* fix two typos
2020-10-18 14:50:41 +02:00
Ines Montani
4d99d2b94a Update docs [ci skip] 2020-10-13 11:38:52 +02:00
svlandeg
40276fd3be update NEL docs after latest refactor 2020-10-12 11:41:27 +02:00
Ines Montani
e50dc2c1c9 Update docs [ci skip] 2020-10-09 12:04:52 +02:00
Ines Montani
329b61ee7b Update docs [ci skip] 2020-10-09 10:36:06 +02:00
Sofie Van Landeghem
d093d6343b
TrainablePipe (#6213)
* rename Pipe to TrainablePipe

* split functionality between Pipe and TrainablePipe

* remove unnecessary methods from certain components

* cleanup

* hasattr(component, "pipe") should be sufficient again

* remove serialization and vocab/cfg from Pipe

* unify _ensure_examples and validate_examples

* small fixes

* hasattr checks for self.cfg and self.vocab

* make is_resizable and is_trainable properties

* serialize strings.json instead of vocab

* fix KB IO + tests

* fix typos

* more typos

* _added_strings as a set

* few more tests specifically for _added_strings field

* bump to 3.0.0a36
2020-10-08 21:33:49 +02:00
Ines Montani
064575d79d
Merge pull request #6216 from svlandeg/feature/nel-initialize 2020-10-08 11:14:12 +02:00
Ines Montani
43e59bb22a Update docs and install extras [ci skip] 2020-10-08 10:58:50 +02:00