Commit Graph

875 Commits

Author SHA1 Message Date
Adriane Boyd
bdb485cc80
Add callback to copy vocab/tokenizer from model (#7750)
* Add callback to copy vocab/tokenizer from model

Add callback `spacy.copy_from_base_model.v1` to copy the tokenizer
settings and/or vocab (including vectors) from a base model.

* Move spacy.copy_from_base_model.v1 to spacy.training.callbacks

* Add documentation

* Modify to specify model as tokenizer and vocab params
2021-04-22 12:36:50 +02:00
Adriane Boyd
f68fc29130
Update sent_starts in Example.from_dict (#7847)
* Update sent_starts in Example.from_dict

Update `sent_starts` for `Example.from_dict` so that `Optional[bool]`
values have the same meaning as for `Token.is_sent_start`.

Use `Optional[bool]` as the type for sent start values in the docs.

* Use helper function for conversion to ternary ints
2021-04-22 11:32:45 +02:00
Adriane Boyd
d2bdaa7823
Replace negative rows with 0 in StaticVectors (#7674)
* Replace negative rows with 0 in StaticVectors

Replace negative row indices with 0-vectors in `StaticVectors`.

* Increase versions related to StaticVectors

* Increase versions of all architctures and layers related to
`StaticVectors`
* Improve efficiency of 0-vector operations

Parallel `spacy-legacy` PR: https://github.com/explosion/spacy-legacy/pull/5

* Update config defaults to new versions

* Update docs
2021-04-22 18:04:15 +10:00
Sofie Van Landeghem
6f565cf39d
fix typo in entity_linker docs 2021-04-22 09:59:24 +02:00
Sofie Van Landeghem
2e746dbf32
update EL training data format in docs (#7839)
* update EL training data format

* fix typo

* all -1 because reasons
2021-04-22 08:50:09 +02:00
Sofie Van Landeghem
c786e98e56
assemble CLI command (#7783)
* assemble CLI command

* ensure assemble runs even without training section

* cleanup
2021-04-19 18:39:11 +10:00
Adriane Boyd
673e2bc4c0
Add usage docs for streamed train corpora (#7693) 2021-04-09 16:15:38 +02:00
Sofie Van Landeghem
204c2f116b
Extend score_spans for overlapping & non-labeled spans (#7209)
* extend span scorer with consider_label and allow_overlap

* unit test for spans y2x overlap

* add score_spans unit test

* docs for new fields in scorer.score_spans

* rename to include_label

* spell out if-else for clarity

* rename to 'labeled'

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-04-08 12:19:17 +02:00
broaddeep
ee159b8543
Support match alignments (#7321)
* Support match alignments

* change naming from match_alignments to with_alignments, add conditional flow if with_alignments is given, validate with_alignments, add related test case

* remove added errors, utilize bint type, cleanup whitespace

* fix no new line in end of file

* Minor formatting

* Skip alignments processing if as_spans is set

* Add with_alignments to Matcher API docs

* Update website/docs/api/matcher.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-04-08 18:10:14 +10:00
Ines Montani
1d1cfadbca Fix formatting [ci skip] 2021-04-06 14:13:13 +10:00
Ayush Chaurasia
3c2ce41dd8
W&B integration: Optional support for dataset and model checkpoint logging and versioning (#7429)
* Add optional artifacts logging

* Update docs

* Update spacy/training/loggers.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/training/loggers.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/training/loggers.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Bump WandbLogger Version

* Add documentation of v1 to legacy docs

* bump spacy-legacy to 3.0.2 (to be released)

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
2021-04-01 19:36:23 +02:00
Sofie Van Landeghem
59c2069eb1
Legacy docs (#7601)
* document legacy Tok2Vec architectures

* add TextCatEnsemble.v1 legacy documentation

* Separate legacy section in side bar
2021-03-30 12:43:14 +02:00
Ines Montani
3ee2fcfba0
Merge pull request #7483 from adrianeboyd/docs/various-v3-4 [ci skip] 2021-03-22 12:37:06 +01:00
Ines Montani
88e5a0dc16
Merge pull request #7504 from polm/fix/lexeme-docs [ci skip]
Fix mismatched backtick in Lexeme docs
2021-03-22 12:36:44 +01:00
Paul O'Leary McCann
e39c0dcf33 Fix mismatched backtick in Lexeme docs 2021-03-20 18:40:00 +09:00
Adriane Boyd
c771ec22f0 Update matcher errors and docs
* Mention `tagger+attribute_ruler` in `POS`/`MORPH` error messages for
`Matcher` and `PhraseMatcher`
* Document `Matcher.__call__(allow_missing=)`
2021-03-19 10:11:18 +01:00
Adriane Boyd
83c1b919a7 Fix positional/option in CLI types 2021-03-18 13:31:42 +01:00
Adriane Boyd
9fd41d6742 Remove Language.pipe cleanup arg 2021-03-18 13:31:42 +01:00
Ines Montani
c67d5a6eb0
Merge pull request #7394 from adrianeboyd/docs/ner-example-data-readme 2021-03-13 04:26:18 +01:00
Ines Montani
068b97a617
Merge pull request #7408 from adrianeboyd/bugfix/load-keyword-only 2021-03-13 04:25:50 +01:00
Adriane Boyd
3168103605 Fix type of spacy train --output in docs 2021-03-12 10:04:57 +01:00
Adriane Boyd
03e9e7b567 Add --code option to init fill-config 2021-03-12 10:03:57 +01:00
Adriane Boyd
124304b146 Add vocab kwarg back to spacy.load
* Additional minor formatting and docs cleanup
2021-03-11 10:58:59 +01:00
Adriane Boyd
84470d9b9e Incorporate BILUO note from #7407 2021-03-11 10:11:21 +01:00
Adriane Boyd
4294bcf4ab Align keyword-only in docs for init/util 2021-03-11 09:52:40 +01:00
Adriane Boyd
28726c25a1 Update docs for convert CLI and NER examples 2021-03-10 11:42:02 +01:00
Adriane Boyd
d746ea6278
Add warning about GPU selection in Jupyter notebooks (#7075)
* Initial warning

* Update check

* Redo edit

* Move jupyter warning to helper method

* Add link with details to warnings
2021-03-09 15:35:21 +01:00
Sofie Van Landeghem
932887b950
textcat scoring fix and multi_label docs (#6974)
* add multi-label textcat to menu

* add infobox on textcat API

* add info to v3 migration guide

* small edits

* further fixes in doc strings

* add infobox to textcat architectures

* add textcat_multilabel to overview of built-in components

* spelling

* fix unrelated warn msg

* Add textcat_multilabel to quickstart [ci skip]

* remove separate documentation page for multilabel_textcategorizer

* small edits

* positive label clarification

* avoid duplicating information in self.cfg and fix textcat.score

* fix multilabel textcat too

* revert threshold to storage in cfg

* revert threshold stuff for multi-textcat

Co-authored-by: Ines Montani <ines@ines.io>
2021-03-09 23:04:22 +11:00
Sofie Van Landeghem
cd70c3cb79
Fixing pretrain (#7342)
* initialize NLP with train corpus

* add more pretraining tests

* more tests

* function to fetch tok2vec layer for pretraining

* clarify parameter name

* test different objectives

* formatting

* fix check for static vectors when using vectors objective

* clarify docs

* logger statement

* fix init_tok2vec and proc.initialize order

* test training after pretraining

* add init_config tests for pretraining

* pop pretraining block to avoid config validation errors

* custom errors
2021-03-09 14:01:13 +11:00
svlandeg
682a6232e3 fix typo 2021-03-02 17:59:13 +01:00
graue70
0fddc0447c
Fix copy & paste error in API docs 2021-03-02 14:00:14 +01:00
Ines Montani
8f7c7b2658
Merge pull request #7211 from svlandeg/docs/el_update [ci skip]
kb.get_candidates renamed to get_alias_candidates
2021-02-27 11:51:22 +11:00
Ines Montani
408b94887a
Merge pull request #7207 from adrianeboyd/docs/get-noun-chunks [ci skip]
Extend docs related to Vocab.get_noun_chunks
2021-02-27 11:51:08 +11:00
svlandeg
248339039e fix type in docs 2021-02-26 14:27:10 +01:00
svlandeg
08fd901a1b kb.get_candidates renamed to get_alias_candidates 2021-02-25 20:09:36 +01:00
Adriane Boyd
6a37f343d5 Extend docs related to Vocab.get_noun_chunks 2021-02-25 16:38:21 +01:00
Ken
fa7ddc7f88
Update sentencizer documentation example with sentencizer pipe name (#7185) 2021-02-24 08:06:54 +01:00
Sofie Van Landeghem
b92f81d5da
fix NEL config and IO, and n_sents functionality (#7100)
* fix NEL config and IO, and n_sents functionality

* add docs

* fix test
2021-02-22 14:49:52 +11:00
Sofie Van Landeghem
ba5a50f62b
NEL docs & UX (#7129)
* EL set_kb docs fix

* custom warning for set_kb mistake
2021-02-22 11:04:22 +11:00
Sofie Van Landeghem
709c9e75af
span.ent only returns first sentence (#7084)
* return first sentence when span contains sentence boundary

* docs fix

* small fixes

* cleanup
2021-02-19 23:02:38 +11:00
Peter Baumann
61b04a70d5
Run PhraseMatcher on Spans (#6918)
* Add regression test

* Run PhraseMatcher on Spans

* Add test for PhraseMatcher on Spans and Docs

* Add SCA

* Add test with 3 matches in Doc, 1 match in Span

* Update docs

* Use doc.length for find_matches in tokenizer

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-02-10 23:43:32 +11:00
tarskiandhutch
e897e7aaad
Line 70: syntax error
Original config definition treated dictionary key as a function argument.
2021-02-08 15:24:57 -05:00
Sofie Van Landeghem
6ed423c16c
reduce memory load when reading all vectors from file (#6945)
* reduce memory load when reading all vectors from file

* one more small typo fix
2021-02-07 08:05:43 +08:00
svlandeg
7cda5605a0 add type 2021-02-03 13:13:58 +01:00
svlandeg
94929c2b98 small doc fixes 2021-02-03 13:10:22 +01:00
Ines Montani
a59f3fcf5d Make wheel the default format and update docs [ci skip] 2021-02-01 23:18:43 +11:00
Ines Montani
7752f80f39 Update docs [ci skip] 2021-01-31 16:11:24 +11:00
Ines Montani
45c551037d Update CLI docs [ci skip] 2021-01-30 21:50:23 +11:00
Ines Montani
2332c4280b Update and use unified --build option 2021-01-30 13:11:36 +11:00
Ines Montani
2609ba4e89 Support building wheel in spacy package 2021-01-30 11:54:02 +11:00
Ines Montani
95e958a229
Merge pull request #6852 from explosion/feature/replace-listeners 2021-01-30 00:58:08 +11:00
Ines Montani
e766e8c56d
Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-01-29 21:41:17 +11:00
svlandeg
d7d838281c adding new="3" mentions in the doc 2021-01-29 11:26:37 +01:00
Ines Montani
99af9e7125 Update documentation 2021-01-29 18:45:48 +11:00
Sofie Van Landeghem
24a697abb8
avoid empty aliases and improve UX and docs (#6840) 2021-01-29 08:51:40 +08:00
Sofie Van Landeghem
837a4f53c2
Error handling in nlp.pipe (#6817)
* add error handler for pipe methods

* add unit tests

* remove pipe method that are the same as their base class

* have Language keep track of a default error handler

* cleanup

* formatting

* small refactor

* add documentation
2021-01-29 08:51:21 +08:00
Ines Montani
230e651ad6 Merge branch 'develop' into master-tmp 2021-01-27 13:26:29 +11:00
Adriane Boyd
c447aa2b98 Update --code arg in evaluate CLI docs 2021-01-26 15:30:46 +01:00
jganseman
907bce7a78
Merge pull request #1 from jganseman/patch-1
Patch 1
2021-01-26 11:12:30 +01:00
jganseman
8bc57ec372
also update is_oov in lexeme docs 2021-01-26 11:09:16 +01:00
jganseman
1f2b0ec168
proposing a more concise explanation for is_oov
proposing a more concise explanation for is_oov
2021-01-26 10:53:39 +01:00
Matthew Honnibal
f049df1715
Revert "Set annotations in update" (#6810)
* Revert "Set annotations in update (#6767)"

This reverts commit e680efc7cc.

* Fix version

* Update spacy/pipeline/entity_linker.py

* Update spacy/pipeline/entity_linker.py

* Update spacy/pipeline/tagger.pyx

* Update spacy/pipeline/tok2vec.py

* Update spacy/pipeline/tok2vec.py

* Update spacy/pipeline/transition_parser.pyx

* Update spacy/pipeline/transition_parser.pyx

* Update website/docs/api/multilabel_textcategorizer.md

* Update website/docs/api/tok2vec.md

* Update website/docs/usage/layers-architectures.md

* Update website/docs/usage/layers-architectures.md

* Update website/docs/api/transformer.md

* Update website/docs/api/textcategorizer.md

* Update website/docs/api/tagger.md

* Update spacy/pipeline/entity_linker.py

* Update website/docs/api/sentencerecognizer.md

* Update website/docs/api/pipe.md

* Update website/docs/api/morphologizer.md

* Update website/docs/api/entityrecognizer.md

* Update spacy/pipeline/entity_linker.py

* Update spacy/pipeline/multitask.pyx

* Update spacy/pipeline/tagger.pyx

* Update spacy/pipeline/tagger.pyx

* Update spacy/pipeline/textcat.py

* Update spacy/pipeline/textcat.py

* Update spacy/pipeline/textcat.py

* Update spacy/pipeline/tok2vec.py

* Update spacy/pipeline/trainable_pipe.pyx

* Update spacy/pipeline/trainable_pipe.pyx

* Update spacy/pipeline/transition_parser.pyx

* Update spacy/pipeline/transition_parser.pyx

* Update website/docs/api/entitylinker.md

* Update website/docs/api/dependencyparser.md

* Update spacy/pipeline/trainable_pipe.pyx
2021-01-25 22:18:45 +08:00
Adriane Boyd
61c9f8bf24
Remove transformers model max length section (#6807) 2021-01-25 19:59:34 +08:00
Adriane Boyd
d0236136a2
Fix default config init in Transformer API docs (#6781) 2021-01-21 23:18:03 +08:00
Sofie Van Landeghem
e680efc7cc
Set annotations in update (#6767)
* bump to 3.0.0rc4

* do set_annotations in component update calls

* update docs and remove set_annotations flag

* fix EL test
2021-01-20 11:49:25 +11:00
Ines Montani
f50502dad7 Update docs [ci skip] 2021-01-19 00:22:47 +11:00
Sofie Van Landeghem
fed8f48965
raise NotImplementedError when noun_chunks iterator is not implemented (#6711)
* raise NotImplementedError when noun_chunks iterator is not implemented

* bring back, fix and document span.noun_chunks

* formatting

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2021-01-17 19:56:05 +08:00
Adriane Boyd
bf0cdae8d4
Add token_splitter component (#6726)
* Add long_token_splitter component

Add a `long_token_splitter` component for use with transformer
pipelines. This component splits up long tokens like URLs into smaller
tokens. This is particularly relevant for pretrained pipelines with
`strided_spans`, since the user can't change the length of the span
`window` and may not wish to preprocess the input texts.

The `long_token_splitter` splits tokens that are at least
`long_token_length` tokens long into smaller tokens of `split_length`
size.

Notes:

* Since this is intended for use as the first component in a pipeline,
the token splitter does not try to preserve any token annotation.
* API docs to come when the API is stable.

* Adjust API, add test

* Fix name in factory
2021-01-17 19:54:41 +08:00
Adriane Boyd
9328dd5625
Handle unset token.morph in Morphologizer (#6704)
* Handle unset token.morph in Morphologizer

Handle unset `token.morph` in `Morphologizer.initialize` and
`Morphologizer.get_loss`. If both `token.morph` and `token.pos` are
unset, treat the annotation as missing rather than empty.

* Add token.has_morph()
2021-01-15 17:20:10 +01:00
Adriane Boyd
0c936004d1 Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-rc3 2021-01-14 11:49:58 +01:00
Matthew Honnibal
f277bfdf0f
Add SpanGroup and Graph container types to represent arbitrary annotations (#6696)
* Draft out initial Spans data structure

* Initial span group commit

* Basic span group support on Doc

* Basic test for span group

* Compile span_group.pyx

* Draft addition of SpanGroup to DocBin

* Add deserialization for SpanGroup

* Add tests for serializing SpanGroup

* Fix serialization of SpanGroup

* Add EdgeC and GraphC structs

* Add draft Graph data structure

* Compile graph

* More work on Graph

* Update GraphC

* Upd graph

* Fix walk functions

* Let Graph take nodes and edges on construction

* Fix walking and getting

* Add graph tests

* Fix import

* Add module with the SpanGroups dict thingy

* Update test

* Rename 'span_groups' attribute

* Try to fix c++11 compilation

* Fix test

* Update DocBin

* Try to fix compilation

* Try to fix graph

* Improve SpanGroup docstrings

* Add doc.spans to documentation

* Fix serialization

* Tidy up and add docs

* Update docs [ci skip]

* Add SpanGroup.has_overlap

* WIP updated Graph API

* Start testing new Graph API

* Update Graph tests

* Update Graph

* Add docstring

Co-authored-by: Ines Montani <ines@ines.io>
2021-01-14 17:30:41 +11:00
Sofie Van Landeghem
75d9019343
Fix types of Tok2Vec encoding architectures (#6442)
* fix TorchBiLSTMEncoder documentation

* ensure the types of the encoding Tok2vec layers are correct

* update references from v1 to v2 for the new architectures
2021-01-07 16:39:27 +11:00
Sofie Van Landeghem
82ae95267a
Docs for pretrain architectures (#6605)
* document pretraining architectures

* formatting

* bit more info

* small fixes
2021-01-06 16:12:30 +11:00
Sofie Van Landeghem
afc5714d32
multi-label textcat component (#6474)
* multi-label textcat component

* formatting

* fix comment

* cleanup

* fix from #6481

* random edit to push the tests

* add explicit error when textcat is called with multi-label gold data

* fix error nr

* small fix
2021-01-06 13:07:14 +11:00
Ines Montani
6f83abb971
Merge pull request #6647 from svlandeg/feature/init_config_overwrite 2021-01-05 14:59:04 +11:00
Ines Montani
3614472e29
Merge pull request #6646 from svlandeg/feature/cli-docs [ci skip] 2021-01-05 13:52:49 +11:00
Ines Montani
9c078a5885
Update formatting for consistency [ci skip] 2021-01-05 13:52:28 +11:00
Ines Montani
a9e845426f Use --force for consistency and add docs 2021-01-05 13:49:59 +11:00
svlandeg
d5ff0fecf8 add docs 2020-12-30 14:01:13 +01:00
svlandeg
2fa23b0304 fix capitalization for link 2020-12-29 15:01:22 +01:00
svlandeg
43cc6aea93 remove non-existing link 2020-12-29 14:59:39 +01:00
svlandeg
543073bf9d add pretrain example 2020-12-29 14:51:23 +01:00
svlandeg
1d0ef98873 move example 2020-12-29 14:46:03 +01:00
svlandeg
20113b8063 add train CLI example 2020-12-29 14:44:56 +01:00
Sofie Van Landeghem
87562e470d
fix backticks in docs (#6635) 2020-12-27 22:12:37 +01:00
Sofie Van Landeghem
8df5b7f513
fix documentation of 'path' in tokenizer.to_disk (#6634) 2020-12-27 22:01:06 +01:00
Sofie Van Landeghem
282a3b49ea
Fix parser resizing when there is no upper layer (#6460)
* allow resizing of the parser model even when upper=False

* update from spacy.TransitionBasedParser.v1 to v2

* bugfix
2020-12-18 18:56:57 +08:00
Gareth Sparks
efc229c3f4
Doc.char_span arg: alignment_mode (#6591)
Currently labeled "mode", actually "alignment_mode"
2020-12-18 09:54:56 +01:00
Ines Montani
513c4e332a
Include custom code via spacy package command (#6531) 2020-12-10 20:36:46 +08:00
Ines Montani
2a6043fabb
Merge pull request #6530 from explosion/feature/init-config-cpu-gpu 2020-12-10 09:38:46 +11:00
Ines Montani
9d32e839d3 Merge branch 'develop' into feature/init-config-cpu-gpu 2020-12-10 08:50:53 +11:00
Adriane Boyd
972820e2b3 Add batch_size to data formats docs 2020-12-09 12:44:04 +01:00
Adriane Boyd
80ac8af1bf Format 2020-12-09 12:44:01 +01:00
Adriane Boyd
795b5bd049
Update website/docs/api/language.md
Co-authored-by: Ines Montani <ines@ines.io>
2020-12-09 12:23:32 +01:00
Adriane Boyd
fa8fa474a3 Add nlp.batch_size setting
Add a default `batch_size` setting for `Language.pipe` and
`Language.evaluate` as `nlp.batch_size`.
2020-12-09 09:13:26 +01:00
Ines Montani
34449b66fd Update matcher.md 2020-12-09 11:09:45 +11:00
Ines Montani
758ad6c3cd Make CPU the default for init config 2020-12-09 11:00:51 +11:00
Ines Montani
94a5a9814f Update argument handling and documentation 2020-12-08 20:41:18 +11:00
Adriane Boyd
5ceac425ee Remove non-working --use-chars from train CLI
Remove the non-working `--use-chars` option from the train CLI. The
implementation of the option across component types and the CLI settings
could be fixed, but the `CharacterEmbed` model does not work on GPU in
v2 so it's better to remove it.
2020-12-08 08:30:00 +01:00
Sofie Van Landeghem
2c27093c5f
require_cpu functionality (#6336)
* add require_cpu from Thinc 8.0.0rc2

* add docs

* fix test if cupy is not installed
2020-12-08 14:42:40 +08:00
Ines Montani
ee2ec52f48
Merge pull request #6409 from svlandeg/feature/trf-docs 2020-12-08 06:32:10 +01:00
Ines Montani
82e88f0e3b
Merge pull request #6379 from svlandeg/fix/labels-constructor 2020-12-08 06:29:56 +01:00
svlandeg
636be3c791 Merge remote-tracking branch 'upstream/develop' into feature/trf-docs 2020-11-19 14:15:35 +01:00
Sofie Van Landeghem
165993d8e5
fix typo in transformer docs (#6404) 2020-11-19 14:11:38 +01:00
svlandeg
73fc1ed963 remove labels from morphologizer constructor 2020-11-11 21:48:50 +01:00
svlandeg
fcd79e0655 remove set_morphology from docs 2020-11-11 21:32:34 +01:00
svlandeg
789fb3d124 add docs for upstream argument of TransformerListener 2020-11-09 21:42:58 +01:00
Ines Montani
363ac73c72 Update docs [ci skip] 2020-11-09 12:43:26 +08:00
Adriane Boyd
8644ee3e3f
Update TIGER link and tag description (#6344) 2020-11-05 09:33:00 +01:00
Sofie Van Landeghem
8ef056cf98
fix embed_size in Entity Linker architecture (#6343) 2020-11-04 22:20:13 +01:00
Adriane Boyd
a4b32b9552
Handle missing reference values in scorer (#6286)
* Handle missing reference values in scorer

Handle missing values in reference doc during scoring where it is
possible to detect an unset state for the attribute. If no reference
docs contain annotation, `None` is returned instead of a score. `spacy
evaluate` displays `-` for missing scores and the missing scores are
saved as `None`/`null` in the metrics.

Attributes without unset states:

* `token.head`: relies on `token.dep` to recognize unset values
* `doc.cats`: unable to handle missing annotation

Additional changes:

* add optional `has_annotation` check to `score_scans` to replace
`doc.sents` hack
* update `score_token_attr_per_feat` to handle missing and empty morph
representations
* fix bug in `Doc.has_annotation` for normalization of `IS_SENT_START`
vs. `SENT_START`

* Fix import

* Update return types
2020-11-03 15:47:18 +01:00
Sofie Van Landeghem
75a202ce65
TextCat updates and fixes (#6263)
* small fix in example imports

* throw error when train_corpus or dev_corpus is not a string

* small fix in custom logger example

* limit macro_auc to labels with 2 annotations

* fix typo

* also create parents of output_dir if need be

* update documentation of textcat scores

* refactor TextCatEnsemble

* fix tests for new AUC definition

* bump to 3.0.0a42

* update docs

* rename to spacy.TextCatEnsemble.v2

* spacy.TextCatEnsemble.v1 in legacy

* cleanup

* small fix

* update to 3.0.0rc2

* fix import that got lost in merge

* cursed IDE

* fix two typos
2020-10-18 14:50:41 +02:00
Ines Montani
4d99d2b94a Update docs [ci skip] 2020-10-13 11:38:52 +02:00
svlandeg
40276fd3be update NEL docs after latest refactor 2020-10-12 11:41:27 +02:00
Ines Montani
e50dc2c1c9 Update docs [ci skip] 2020-10-09 12:04:52 +02:00
Ines Montani
329b61ee7b Update docs [ci skip] 2020-10-09 10:36:06 +02:00
Sofie Van Landeghem
d093d6343b
TrainablePipe (#6213)
* rename Pipe to TrainablePipe

* split functionality between Pipe and TrainablePipe

* remove unnecessary methods from certain components

* cleanup

* hasattr(component, "pipe") should be sufficient again

* remove serialization and vocab/cfg from Pipe

* unify _ensure_examples and validate_examples

* small fixes

* hasattr checks for self.cfg and self.vocab

* make is_resizable and is_trainable properties

* serialize strings.json instead of vocab

* fix KB IO + tests

* fix typos

* more typos

* _added_strings as a set

* few more tests specifically for _added_strings field

* bump to 3.0.0a36
2020-10-08 21:33:49 +02:00
Ines Montani
064575d79d
Merge pull request #6216 from svlandeg/feature/nel-initialize 2020-10-08 11:14:12 +02:00
Ines Montani
43e59bb22a Update docs and install extras [ci skip] 2020-10-08 10:58:50 +02:00
svlandeg
eaf5c265cb set_kb method for entity_linker 2020-10-08 10:34:01 +02:00
Ines Montani
2fd7122074 Update docs [ci skip] 2020-10-06 10:31:48 +02:00
Ines Montani
568e12215d
Merge pull request #6206 from svlandeg/fix/patterns-init 2020-10-06 10:27:23 +02:00
svlandeg
9b4cf7b0b6 update output of debug config command 2020-10-06 09:47:23 +02:00
svlandeg
fd0f60e2bc updates to data format for training and pretraining 2020-10-06 09:28:53 +02:00
svlandeg
ff9ac39c88 read entity_ruler patterns with srsly.read_jsonl.v1 2020-10-05 22:50:14 +02:00
Ines Montani
1a554bdcb1 Update docs and docstring [ci skip] 2020-10-05 21:55:27 +02:00
Matthew Honnibal
919790cb47 Upd MultiHashEmbed docs 2020-10-05 20:28:21 +02:00
svlandeg
193e0d5a98 add docs for entity_ruler.initialize 2020-10-05 18:04:08 +02:00
svlandeg
65abd77779 add finish_update to Pipe 2020-10-05 16:23:33 +02:00
Ines Montani
0f64556c04
Merge pull request #6197 from svlandeg/feature/pipe-docs [ci skip] 2020-10-05 11:55:40 +02:00
svlandeg
52b660e9dc initialize and update explanation 2020-10-05 00:39:36 +02:00
Ines Montani
3c36a57e84
Update data augmenters (#6196)
* Draft lower-case augmenter

* Make warning a debug log

* Update lowercase augmenter, docs and tests

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-10-04 17:46:29 +02:00
Ines Montani
11347f34da Tidy up, tests and docs 2020-10-04 13:54:05 +02:00
Ines Montani
989c59918c Update docs [ci skip] 2020-10-03 18:53:39 +02:00
Ines Montani
7c4ab7e82c Fix Lemmatizer.get_lookups_config 2020-10-03 17:16:10 +02:00
Ines Montani
dd542ec6a4
Fix label initialization of textcat component (#6190) 2020-10-03 17:07:38 +02:00
Ines Montani
35d695a031 Update docs 2020-10-03 16:08:24 +02:00
svlandeg
02247cccaf Merge remote-tracking branch 'upstream/develop' into feature/small-fixes 2020-10-02 20:48:11 +02:00
Sofie Van Landeghem
09dcb75076
small UX fix for DocBin (#6167)
* add informative warning when messing up store_user_data DocBin flags

* add informative warning when messing up store_user_data DocBin flags

* cleanup test

* rename to patterns_path
2020-10-02 15:43:32 +02:00
Ines Montani
f0b30aedad
Make lemmatizers use initialize logic (#6182)
* Make lemmatizer use initialize logic and tidy up

* Fix typo

* Raise for uninitialized tables
2020-10-02 15:42:36 +02:00
Ines Montani
df06f7a792 Update docs [ci skip] 2020-10-02 13:24:33 +02:00
Ines Montani
d2aa662ab2
Merge pull request #6179 from adrianeboyd/feature/token-morph-refactor-2 [ci skip] 2020-10-02 12:10:27 +02:00
Ines Montani
32cdc1c4f4 Update docs [ci skip] 2020-10-02 11:38:03 +02:00
Adriane Boyd
fd09e6b140 Update docs for Token.morph / Token.set_morph 2020-10-02 09:05:15 +02:00
Ines Montani
01c1538c72 Integrate file readers 2020-10-02 01:36:06 +02:00
Ines Montani
6b94cee468 Fix docs [ci skip] 2020-10-02 01:11:19 +02:00
Ines Montani
f2627157c8 Update docs [ci skip] 2020-10-01 17:38:17 +02:00
svlandeg
1328c9fd14 consistently use --code instead of --code-path 2020-10-01 16:59:22 +02:00
Sofie Van Landeghem
a22215f427
Add FeatureExtractor from Thinc (#6170)
* move featureextractor from Thinc

* Update website/docs/api/architectures.md

Co-authored-by: Ines Montani <ines@ines.io>

* Update website/docs/api/architectures.md

Co-authored-by: Ines Montani <ines@ines.io>

Co-authored-by: Ines Montani <ines@ines.io>
2020-10-01 16:22:48 +02:00
Ines Montani
0a8a124a6e Update docs [ci skip] 2020-10-01 12:15:53 +02:00