Commit Graph

13508 Commits

Author SHA1 Message Date
svlandeg
e4fc7e0222 fixing output sample to proper 2D array 2020-09-17 22:34:36 +02:00
Adriane Boyd
8b650f3a78 Modify setting missing and blocked entity tokens
In order to make it easier to construct `Doc` objects as training data,
modify how missing and blocked entity tokens are set to prioritize
setting `O` and missing entity tokens for training purposes over setting
blocked entity tokens.

* `Doc.ents` setter sets tokens outside entity spans to `O` regardless
of the current state of each token

* For `Doc.ents`, setting a span with a missing label sets the `ent_iob`
to missing instead of blocked

* `Doc.block_ents(spans)` marks spans as hard `O` for use with the
`EntityRecognizer`
2020-09-17 21:27:42 +02:00
Ines Montani
9062585a13
Merge pull request #6087 from explosion/docs/pretrain-usage [ci skip] 2020-09-17 19:25:24 +02:00
Ines Montani
a0b4389a38 Update docs [ci skip] 2020-09-17 19:24:48 +02:00
Matthew Honnibal
6efb7688a6 Draft pretrain usage 2020-09-17 18:17:03 +02:00
Sofie Van Landeghem
ed0fb034cb
ml_datasets v0.2.0a0 2020-09-17 18:11:10 +02:00
Ines Montani
1bb8b4f824 Merge branch 'master' into develop 2020-09-17 17:46:20 +02:00
Ines Montani
6bd0d25fb9
Merge pull request #6085 from explosion/docs/static-vectors-intro [ci skip] 2020-09-17 17:14:45 +02:00
Ines Montani
a2c8cda26f Update docs [ci skip] 2020-09-17 17:12:51 +02:00
Ines Montani
2c80f41852
Merge pull request #6084 from svlandeg/feature/init-config-pretrain [ci skip] 2020-09-17 16:59:14 +02:00
Ines Montani
2e3ce9f42f Merge branch 'feature/init-config-pretrain' of https://github.com/svlandeg/spaCy into pr/6084 2020-09-17 16:58:49 +02:00
Ines Montani
3d8e010655 Change order 2020-09-17 16:58:46 +02:00
Ines Montani
c4b414b282
Update website/docs/api/cli.md 2020-09-17 16:58:09 +02:00
Ines Montani
3865214343 Use consistent shortcut 2020-09-17 16:57:02 +02:00
Sofie Van Landeghem
e5ceec5df0
Update website/docs/api/cli.md
Co-authored-by: Ines Montani <ines@ines.io>
2020-09-17 16:56:20 +02:00
Sofie Van Landeghem
127ce0c574
Update website/docs/api/cli.md
Co-authored-by: Ines Montani <ines@ines.io>
2020-09-17 16:55:53 +02:00
Matthew Honnibal
ec751068f3 Draft text for static vectors intro 2020-09-17 16:42:53 +02:00
svlandeg
35a3931064 fix typo 2020-09-17 16:36:27 +02:00
svlandeg
5fade4feb7 fix cli abbrev 2020-09-17 16:15:20 +02:00
svlandeg
ddfc1fc146 add pretraining option to init config 2020-09-17 16:05:40 +02:00
svlandeg
3a3110ef60 remove empty files 2020-09-17 15:44:11 +02:00
svlandeg
c8c84f1ccd Merge remote-tracking branch 'upstream/develop' into fix/corpus 2020-09-17 15:43:04 +02:00
svlandeg
130ffa5fbf fix typos in docs 2020-09-17 14:59:41 +02:00
Matthew Honnibal
b57ce9a875 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-09-17 13:59:25 +02:00
Matthew Honnibal
30e85b2a42 Remove outdated configs 2020-09-17 13:59:12 +02:00
Ines Montani
c8fa2247e3 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-09-17 12:34:15 +02:00
Ines Montani
6761028c6f Update docs [ci skip] 2020-09-17 12:34:11 +02:00
svlandeg
427dbecdd6 cleanup and formatting 2020-09-17 11:48:04 +02:00
svlandeg
0c35885751 generalize corpora, dot notation for dev and train corpus 2020-09-17 11:38:59 +02:00
svlandeg
8cedb2f380 Merge branch 'fix/corpus' of https://github.com/svlandeg/spaCy into fix/corpus 2020-09-17 09:27:55 +02:00
svlandeg
781fae678b Merge remote-tracking branch 'upstream/develop' into fix/corpus 2020-09-17 09:24:36 +02:00
Sofie Van Landeghem
21dcf92964
Update website/docs/api/data-formats.md
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-09-17 09:21:36 +02:00
Matthew Honnibal
8303d101a5 Set version to v3.0.0a19 2020-09-17 00:18:49 +02:00
Adriane Boyd
7e4cd7575c
Refactor Docs.is_ flags (#6044)
* Refactor Docs.is_ flags

* Add derived `Doc.has_annotation` method

  * `Doc.has_annotation(attr)` returns `True` for partial annotation

  * `Doc.has_annotation(attr, require_complete=True)` returns `True` for
    complete annotation

* Add deprecation warnings to `is_tagged`, `is_parsed`, `is_sentenced`
and `is_nered`

* Add `Doc._get_array_attrs()`, which returns a full list of `Doc` attrs
for use with `Doc.to_array`, `Doc.to_bytes` and `Doc.from_docs`. The
list is the `DocBin` attributes list plus `SPACY` and `LENGTH`.

Notes on `Doc.has_annotation`:

* `HEAD` is converted to `DEP` because heads don't have an unset state

* Accept `IS_SENT_START` as a synonym of `SENT_START`

Additional changes:

* Add `NORM`, `ENT_ID` and `SENT_START` to default attributes for
`DocBin`

* In `Doc.from_array()` the presence of `DEP` causes `HEAD` to override
`SENT_START`

* In `Doc.from_array()` using `attrs` other than
`Doc._get_array_attrs()` (i.e., a user's custom list rather than our
default internal list) with both `HEAD` and `SENT_START` shows a warning
that `HEAD` will override `SENT_START`

* `set_children_from_heads` does not require dependency labels to set
sentence boundaries and sets `sent_start` for all non-sentence starts to
`-1`

* Fix call to set_children_form_heads

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-09-17 00:14:01 +02:00
Adriane Boyd
a119667a36
Clean up spacy.tokens (#6046)
* Clean up spacy.tokens

* Update `set_children_from_heads`:
  * Don't check `dep` when setting lr_* or sentence starts
  * Set all non-sentence starts to `False`

* Use `set_children_from_heads` in `Token.head` setter
  * Reduce similar/duplicate code (admittedly adds a bit of overhead)
  * Update sentence starts consistently

* Remove unused `Doc.set_parse`

* Minor changes:
  * Declare cython variables (to avoid cython warnings)
  * Clean up imports

* Modify set_children_from_heads to set token range

Modify `set_children_from_heads` so that it adjust tokens within a
specified range rather then the whole document.

Modify the `Token.head` setter to adjust only the tokens affected by the
new head assignment.
2020-09-16 20:32:38 +02:00
Matthew Honnibal
c776594ab1 Fix 2020-09-16 18:15:14 +02:00
Matthew Honnibal
4a573d18b3 Add comment 2020-09-16 17:51:29 +02:00
Matthew Honnibal
d31afc8334 Fix Language.link_components when model is None 2020-09-16 17:49:48 +02:00
Adriane Boyd
f3db3f6fe0
Add vectors option to CharacterEmbed (#6069)
* Add vectors option to CharacterEmbed

* Update spacy/pipeline/morphologizer.pyx

* Adjust default morphologizer config

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-09-16 17:45:04 +02:00
Adriane Boyd
d722a439aa
Remove unneeded methods in senter and morphologizer (#6074)
Now that the tagger doesn't manage the tag map, the child classes senter
and morphologizer don't need to override the serialization methods.
2020-09-16 17:39:41 +02:00
Adriane Boyd
87c329c711
Set rule-based lemmatizers as default (#6076)
For languages without provided models and with lemmatizer rules in
`spacy-lookups-data`, make the rule-based lemmatizer the default:
Bengali, Persian, Norwegian, Swedish
2020-09-16 17:37:29 +02:00
svlandeg
0dc914b667 bump thinc to 8.0.0a33 2020-09-16 16:42:58 +02:00
svlandeg
1040e250d8 actual commit with test for custom readers with ml_datasets >= 0.2 2020-09-16 16:41:28 +02:00
svlandeg
714a5a05c6 test for custom readers with ml_datasets >= 0.2 2020-09-16 16:39:55 +02:00
svlandeg
0d1392340f Merge remote-tracking branch 'upstream/develop' into fix/corpus 2020-09-15 23:17:08 +02:00
Ines Montani
4d75040546
Merge pull request #6072 from svlandeg/bugfix/ExceptionInfo
Fix unit test with ExceptionInfo
2020-09-15 22:52:48 +02:00
svlandeg
f420aa1138 use e.value to get to the ExceptionInfo value 2020-09-15 22:30:09 +02:00
svlandeg
55f8d5478e fix example output 2020-09-15 22:09:30 +02:00
svlandeg
7336657662 corpus is a Dict 2020-09-15 22:07:16 +02:00
svlandeg
51fa929f47 rewrite train_corpus to corpus.train in config 2020-09-15 21:58:04 +02:00