Commit Graph

7807 Commits

Author SHA1 Message Date
svlandeg
35a3931064 fix typo 2020-09-17 16:36:27 +02:00
svlandeg
ddfc1fc146 add pretraining option to init config 2020-09-17 16:05:40 +02:00
svlandeg
427dbecdd6 cleanup and formatting 2020-09-17 11:48:04 +02:00
svlandeg
0c35885751 generalize corpora, dot notation for dev and train corpus 2020-09-17 11:38:59 +02:00
svlandeg
781fae678b Merge remote-tracking branch 'upstream/develop' into fix/corpus 2020-09-17 09:24:36 +02:00
Matthew Honnibal
8303d101a5 Set version to v3.0.0a19 2020-09-17 00:18:49 +02:00
Adriane Boyd
7e4cd7575c
Refactor Docs.is_ flags (#6044)
* Refactor Docs.is_ flags

* Add derived `Doc.has_annotation` method

  * `Doc.has_annotation(attr)` returns `True` for partial annotation

  * `Doc.has_annotation(attr, require_complete=True)` returns `True` for
    complete annotation

* Add deprecation warnings to `is_tagged`, `is_parsed`, `is_sentenced`
and `is_nered`

* Add `Doc._get_array_attrs()`, which returns a full list of `Doc` attrs
for use with `Doc.to_array`, `Doc.to_bytes` and `Doc.from_docs`. The
list is the `DocBin` attributes list plus `SPACY` and `LENGTH`.

Notes on `Doc.has_annotation`:

* `HEAD` is converted to `DEP` because heads don't have an unset state

* Accept `IS_SENT_START` as a synonym of `SENT_START`

Additional changes:

* Add `NORM`, `ENT_ID` and `SENT_START` to default attributes for
`DocBin`

* In `Doc.from_array()` the presence of `DEP` causes `HEAD` to override
`SENT_START`

* In `Doc.from_array()` using `attrs` other than
`Doc._get_array_attrs()` (i.e., a user's custom list rather than our
default internal list) with both `HEAD` and `SENT_START` shows a warning
that `HEAD` will override `SENT_START`

* `set_children_from_heads` does not require dependency labels to set
sentence boundaries and sets `sent_start` for all non-sentence starts to
`-1`

* Fix call to set_children_form_heads

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-09-17 00:14:01 +02:00
Adriane Boyd
a119667a36
Clean up spacy.tokens (#6046)
* Clean up spacy.tokens

* Update `set_children_from_heads`:
  * Don't check `dep` when setting lr_* or sentence starts
  * Set all non-sentence starts to `False`

* Use `set_children_from_heads` in `Token.head` setter
  * Reduce similar/duplicate code (admittedly adds a bit of overhead)
  * Update sentence starts consistently

* Remove unused `Doc.set_parse`

* Minor changes:
  * Declare cython variables (to avoid cython warnings)
  * Clean up imports

* Modify set_children_from_heads to set token range

Modify `set_children_from_heads` so that it adjust tokens within a
specified range rather then the whole document.

Modify the `Token.head` setter to adjust only the tokens affected by the
new head assignment.
2020-09-16 20:32:38 +02:00
Matthew Honnibal
c776594ab1 Fix 2020-09-16 18:15:14 +02:00
Matthew Honnibal
4a573d18b3 Add comment 2020-09-16 17:51:29 +02:00
Matthew Honnibal
d31afc8334 Fix Language.link_components when model is None 2020-09-16 17:49:48 +02:00
Adriane Boyd
f3db3f6fe0
Add vectors option to CharacterEmbed (#6069)
* Add vectors option to CharacterEmbed

* Update spacy/pipeline/morphologizer.pyx

* Adjust default morphologizer config

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-09-16 17:45:04 +02:00
Adriane Boyd
d722a439aa
Remove unneeded methods in senter and morphologizer (#6074)
Now that the tagger doesn't manage the tag map, the child classes senter
and morphologizer don't need to override the serialization methods.
2020-09-16 17:39:41 +02:00
Adriane Boyd
87c329c711
Set rule-based lemmatizers as default (#6076)
For languages without provided models and with lemmatizer rules in
`spacy-lookups-data`, make the rule-based lemmatizer the default:
Bengali, Persian, Norwegian, Swedish
2020-09-16 17:37:29 +02:00
svlandeg
1040e250d8 actual commit with test for custom readers with ml_datasets >= 0.2 2020-09-16 16:41:28 +02:00
svlandeg
714a5a05c6 test for custom readers with ml_datasets >= 0.2 2020-09-16 16:39:55 +02:00
svlandeg
0d1392340f Merge remote-tracking branch 'upstream/develop' into fix/corpus 2020-09-15 23:17:08 +02:00
svlandeg
f420aa1138 use e.value to get to the ExceptionInfo value 2020-09-15 22:30:09 +02:00
svlandeg
7336657662 corpus is a Dict 2020-09-15 22:07:16 +02:00
svlandeg
51fa929f47 rewrite train_corpus to corpus.train in config 2020-09-15 21:58:04 +02:00
svlandeg
bd87e8686e move tests to correct subdir 2020-09-15 21:40:38 +02:00
Ines Montani
aaf01689a1 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-09-15 14:24:42 +02:00
Ines Montani
91a6637f74 Remove extra pipe config values before merging 2020-09-15 14:24:17 +02:00
Ines Montani
d3d7f92f05 Fix lang check and error handling in Language.from_config 2020-09-15 14:24:06 +02:00
Ines Montani
2ed6e2a218 Auto-format 2020-09-15 14:20:04 +02:00
Ines Montani
2214d1bb7b
Merge pull request #6067 from explosion/feature/spacy-blank-from-config 2020-09-15 14:18:33 +02:00
Ines Montani
253ba5ef14 Raise for bad Vocab values 2020-09-15 13:25:34 +02:00
svlandeg
7677e5c0e2 fix wandb logger when calling multiple times from same script 2020-09-15 12:56:33 +02:00
Ines Montani
eff9406718 Support vocab arg in spacy.blank 2020-09-15 11:39:36 +02:00
Ines Montani
99549a5ace Fix consistency and update docs 2020-09-15 11:37:37 +02:00
Ines Montani
7dfc4bc062 Allow overriding meta from spacy.blank 2020-09-15 11:12:12 +02:00
Ines Montani
0f943157af Delegate to Language.from_config in spacy.blank 2020-09-15 11:07:55 +02:00
Ines Montani
e977086a9a Update default pretraining config [ci skip] 2020-09-15 01:12:02 +02:00
Ines Montani
154752f9c2 Update docs and consistency [ci skip] 2020-09-15 00:32:49 +02:00
Ines Montani
9cc304c194
Merge pull request #6064 from explosion/fix/sparse-checkout-ux
Fix sparse checkout and error handling
2020-09-15 00:32:20 +02:00
Matthew Honnibal
475323cd36 Set version to v3.0.0a18 2020-09-14 22:05:43 +02:00
Matthew Honnibal
e8378b57bc Fix test 2020-09-14 21:21:13 +02:00
Matthew Honnibal
adf0bab23a Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-09-14 21:04:49 +02:00
Matthew Honnibal
ae15fa9688 Fix iob converter 2020-09-14 21:02:18 +02:00
Sofie Van Landeghem
3216a33149
positive_label config for textcat (#6062)
* hook up positive_label in textcat

* unit tests

* documentation

* formatting

* tests

* fix typo

* move verify_config to after begin_training

* revert accidential commit
2020-09-14 17:08:00 +02:00
Ines Montani
c052017025 Fix sparse checkout and error handling 2020-09-14 14:12:58 +02:00
Matthew Honnibal
fdd2340f6c Set version to v3.0.0a17 2020-09-13 23:52:03 +02:00
Ines Montani
416deb412f Prevent duplicate traceback on CalledProcessError [ci skip] 2020-09-13 19:28:54 +02:00
Ines Montani
61a4ef0b46 Fix syntax error 2020-09-13 19:23:09 +02:00
Matthew Honnibal
b693d2d224 Fix speed report in table 2020-09-13 17:39:31 +02:00
Sofie Van Landeghem
744df9814a
define threshold for scoring textcat in TextCat config (#6055)
* define threshold for scoring textcat in TextCat config

* fix unit test and documentation
2020-09-13 14:15:52 +02:00
Adriane Boyd
ab270364f1
Modify Token.morph to enable unsetting (#6043)
Modify `Token.morph` property so that `Token.c.morph` can be reset back
to an internal value of `0`. Allow setting `Token.morph` from a hash as
long as the morph string is already in the `StringStore`, setting it
indirectly through `Token.morph_` so that the value is added to the
morphology. If the hash is not in the `StringStore`, raise an error.
2020-09-13 14:06:07 +02:00
Adriane Boyd
c7bd631b5f
Fix token.idx for special cases with affixes (#6035) 2020-09-13 14:05:36 +02:00
Matthew Honnibal
54c40223a1
Improve v3 pretrain command (#6040)
* Starts to run

* Update pretrain script

* Update corpus

* Update pretrain schema

* Remove outdated test

* Make JsonlTexts produce Example objects.
2020-09-13 14:05:05 +02:00
Ines Montani
febb99916d Tidy up and auto-format [ci skip] 2020-09-13 10:55:36 +02:00
Ines Montani
a5633b205f Fix handling of errors around git [ci skip] 2020-09-13 10:52:28 +02:00
Ines Montani
f8846c198d Update types and docstrings 2020-09-13 10:52:02 +02:00
Sofie Van Landeghem
e92e850c72
Raise if empty examples (#6052)
* raise error if no valid Example objects were found during initialization

* fix max_length parameter

* remove commit from other branch

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-09-12 21:01:53 +02:00
Matthew Honnibal
37347830d4 Fix reading in GloVe vectors 2020-09-12 17:31:18 +02:00
Ines Montani
b41be87213
Merge pull request #6051 from svlandeg/feature/cli-config 2020-09-12 17:12:35 +02:00
Ines Montani
eedaaaec75 Fix handling of existing asset without checksum [ci skip] 2020-09-12 17:02:53 +02:00
svlandeg
a75cfe0da6 Merge remote-tracking branch 'upstream/develop' into feature/cli-config 2020-09-12 14:44:40 +02:00
svlandeg
115147804a string_to_list to parse comma-separated string into a list 2020-09-12 14:43:22 +02:00
Ines Montani
f886f5bbc8
Merge pull request #6048 from explosion/fix/clone-compat 2020-09-12 10:30:49 +02:00
svlandeg
711166a75a prevent overwriting score_weights 2020-09-11 15:12:05 +02:00
Ines Montani
62eec33bc4 Fix meta.json validation 2020-09-11 11:38:33 +02:00
Ines Montani
0b2e07215d Support overwriting name on spacy package 2020-09-11 11:38:28 +02:00
svlandeg
5b94aeece9 support pipeline as "list in string" 2020-09-11 11:08:46 +02:00
Ines Montani
1bce432b4a Adjust message [ci skip] 2020-09-11 10:00:49 +02:00
Ines Montani
5acd4fbcd8 Merge branch 'develop' into fix/clone-compat 2020-09-11 09:58:30 +02:00
Ines Montani
761bd60d43 Adjust info message 2020-09-11 09:57:00 +02:00
Ines Montani
6831161bfa Resolve path to be extra sure 2020-09-11 09:56:49 +02:00
svlandeg
1723fb73c4 remove brol 2020-09-10 17:44:59 +02:00
svlandeg
08a831ce83 process trailing slash if any 2020-09-10 17:39:52 +02:00
Ines Montani
3e83a509bb WIP: fix project clone compatibility 2020-09-10 15:49:13 +02:00
svlandeg
f1bc09c1e9 restore partly 2020-09-10 14:53:02 +02:00
svlandeg
3889747119 asset fix & UX 2020-09-10 14:36:53 +02:00
svlandeg
a36766d153 hookup branch 2020-09-10 12:00:34 +02:00
svlandeg
97d99f7efa Merge remote-tracking branch 'upstream/develop' into feature/doc-fixes 2020-09-10 11:51:34 +02:00
Ines Montani
908f3a4494 Update default projects repo [ci skip] 2020-09-10 11:42:14 +02:00
svlandeg
92f9d2f406 small UX fixes 2020-09-10 11:35:50 +02:00
svlandeg
1fc5486792 more fine-grained errors for git_sparse_checkout 2020-09-10 11:31:32 +02:00
Ines Montani
15bc3a37b4 Add --branch to project clone 2020-09-10 11:08:15 +02:00
Ines Montani
1955aaaa20
Merge pull request #6045 from svlandeg/feature/more-layers-docs [ci skip] 2020-09-09 21:46:40 +02:00
Sofie Van Landeghem
cb66ea7400
Remove simple_ner code (#6041)
* remove simple_ner code

* remove unused _biluo and _iob files
2020-09-09 16:11:27 +02:00
svlandeg
39aa740777 Merge remote-tracking branch 'upstream/develop' into feature/more-layers-docs 2020-09-09 11:59:34 +02:00
Sofie Van Landeghem
8e7557656f
Renaming gold & annotation_setter (#6042)
* version bump to 3.0.0a16

* rename "gold" folder to "training"

* rename 'annotation_setter' to 'set_extra_annotations'

* formatting
2020-09-09 10:31:03 +02:00
Sofie Van Landeghem
60f22e1800
Pipe API (#6034)
* ensure Language passes on valid examples for initialization

* fix tagger model initialization

* check for valid get_examples across components

* assume labels were added before begin_training

* fix senter initialization

* fix morphologizer initialization

* use methods to check arguments

* test textcat init, requires thinc>=8.0.0a31

* fix tok2vec init

* fix entity linker init

* use islice

* fix simple NER

* cleanup debug model

* fix assert statements

* fix tests

* throw error when adding a label if the output layer can't be resized anymore

* fix test

* add failing test for simple_ner

* UX improvements

* morphologizer UX

* assume begin_training gets a representative set and processes the labels

* remove assumptions for output of untrained NER model

* restore test for original purpose
2020-09-08 22:44:25 +02:00
svlandeg
d0a8849e4d fix typo 2020-09-08 18:32:12 +02:00
svlandeg
bd8f9b188b small fixes 2020-09-08 17:24:36 +02:00
Matthew Honnibal
4b82882767 Fix defaults 2020-09-08 15:31:21 +02:00
Matthew Honnibal
5d09e3e154 Set version to v3.0.0a15 2020-09-08 15:25:10 +02:00
Matthew Honnibal
ba5f4c9b32 Add words and seconds to train info 2020-09-08 15:24:47 +02:00
Matthew Honnibal
b470062153
Add CLI registry (#6037) 2020-09-08 15:23:34 +02:00
svlandeg
06ef66fd73 Merge remote-tracking branch 'upstream/develop' into feature/more-layers-docs 2020-09-08 10:28:42 +02:00
Matthew Honnibal
dae22f3dfa Fix ignoring of punct labels 2020-09-05 14:11:59 +02:00
Matthew Honnibal
12e1279f6b Set version to v3.0.0a14 2020-09-05 04:13:53 +02:00
Matthew Honnibal
4b7abaafdb Fix learn rate for non-transformer 2020-09-04 21:22:50 +02:00
Matthew Honnibal
465785a672 Fix project pull and push 2020-09-04 21:15:55 +02:00
Ines Montani
f174c7b1f3 Merge branch 'develop' into pr/6018 2020-09-04 15:54:49 +02:00
Ines Montani
f06eed800e
Merge pull request #6029 from explosion/master-tmp 2020-09-04 15:11:55 +02:00
Ines Montani
f9550b4493 Fix components in meta.json and website [ci skip] 2020-09-04 14:42:12 +02:00
Ines Montani
d7cc2ee72d Fix tests 2020-09-04 14:05:55 +02:00
Ines Montani
90043a6f9b Tidy up and auto-format 2020-09-04 13:42:33 +02:00
Ines Montani
df0b68f60e Remove unicode declarations and update language data 2020-09-04 13:19:16 +02:00