Commit Graph

54 Commits

Author SHA1 Message Date
svlandeg
e30ec9b2a8 fix test checking for variants 2020-06-19 14:05:35 +02:00
svlandeg
25b0674320 clean up 2020-06-19 11:31:01 +02:00
svlandeg
c705a28438 add links to to_dict 2020-06-19 11:22:24 +02:00
svlandeg
6ca6d7d6b4 test for split sentences with various alignment issues, works 2020-06-18 20:01:02 +02:00
svlandeg
6d73e139b0 fix entity linker 2020-06-17 21:12:25 +02:00
svlandeg
4ed399c848 minibatch utiltiy can deal with strings, docs or examples 2020-06-16 21:35:55 +02:00
svlandeg
8b66c11ff2 add spaces to json output format 2020-06-16 19:30:03 +02:00
svlandeg
ba80ad7efd fixed some tests + WIP roundtrip unit test 2020-06-16 18:26:50 +02:00
svlandeg
43d41d6bb6 allow None as BILUO annotation 2020-06-16 15:30:05 +02:00
svlandeg
44a0f9c2c8 test_gold_biluo_different_tokenization works 2020-06-16 15:21:20 +02:00
svlandeg
1c35b8efcd fix spaces 2020-06-16 12:08:25 +02:00
svlandeg
0702a1d3fb fix test for misaligned 2020-06-15 23:10:47 +02:00
svlandeg
a28f8f369e Fix many-to-one IOB codes 2020-06-15 23:06:22 +02:00
svlandeg
12886b787b fixing NER one-to-many alignment 2020-06-15 22:44:17 +02:00
Matthew Honnibal
98ca14f577 Remove GoldParse
WIP on removing goldparse

Get ArcEager compiling after GoldParse excise

Update setup.py

Get spacy.syntax compiling after removing GoldParse

Rename NewExample -> Example and clean up

Clean html files

Start updating tests

Update Morphologizer
2020-06-14 19:53:30 +02:00
Matthew Honnibal
7de997c0a5 Update test 2020-06-13 23:11:45 +02:00
Matthew Honnibal
3eb8f3867e Update test 2020-06-13 23:05:16 +02:00
svlandeg
6a67a11682 adding tests for new example class (some still failing - WIP) 2020-06-11 17:43:40 +02:00
Matthew Honnibal
488727aee0 Start updating test 2020-06-09 23:58:28 +02:00
Matthew Honnibal
d9289712ba * Make GoldCorpus return dict, not Example
* Make Example require a Doc object (previously optional)

Clarify methods in GoldCorpus

WIP refactor Example

Refactor Example.split_sents

Fix test

Fix augment

Update test

Update test

Fix import

Update test_scorer

Update Example
2020-06-09 01:01:59 +02:00
Matthew Honnibal
084271c9e9
Remove GoldParse from public API
* Move get_parses_from_example to spacy.syntax

* Get GoldParse out of Example

* Avoid expecting GoldParse input in parser

* Add Alignment to spacy.gold.align

* Update Example object

* Add comment

* Update pipeline

* Fix imports

* Simplify gold_io

* WIP on GoldCorpus

* Update test

* Xfail some gold tests

* Remove ignore_misaligned option from GoldCorpus

* Fix Example constructor

* Update test

* Fix usage of Example

* Add deprecated_get_gold method on Example

* Patch scorer

* Fix test

* Fix test

* Update tests

* Xfail a test

* Fix passing of make_projective

* Pass make_projective by default

* Hack data format in Example.from_dict

* Update tests

* Fix example.from_dict

* Update morphologizer

* Fix entity linker

* Add get_field to TokenAnnotation

* Fix Example.get_aligned

* Update test

* Fix alignment

* Fix corpus

* Fix GoldCorpus

* Handle misaligned

* Format

* Fix missing import
2020-06-08 22:09:57 +02:00
Ines Montani
24f72c669c Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
Ines Montani
d8f3190c0a Tidy up and auto-format 2020-05-21 14:14:01 +02:00
adrianeboyd
70da1fd2d6
Add warning for misaligned character offset spans (#5007)
* Add warning for misaligned character offset spans

* Resolve conflict

* Filter warnings in example scripts

Filter warnings in example scripts to show warnings once, in particular
warnings about misaligned entities.

Co-authored-by: Ines Montani <ines@ines.io>
2020-05-19 16:01:18 +02:00
adrianeboyd
84e06f9fb7
Improve GoldParse NER alignment (#5335)
Improve GoldParse NER alignment by including all cases where the start
and end of the NER span can be aligned, regardless of internal
tokenization differences.

To do this, convert BILUO tags to character offsets, check start/end
alignment with `doc.char_span()`, and assign the BILUO tags for the
aligned spans. Alignment for `O/-` tags is handled through the
one-to-one and multi alignments.
2020-04-23 16:58:23 +02:00
adrianeboyd
521f361052
Switch to new gold.align method (#5334)
* Switch from original `_align` to new simpler alignment algorithm from
  #4526

* Remove alignment normalizations beyond whitespace and lowercasing
2020-04-21 19:31:03 +02:00
adrianeboyd
ce0e538068
Check whether doc is instantiated in Example.get_gold_parses() (#5167)
* Check whether doc is instantiated

When creating docs to pair with gold parses, modify test to check
whether a doc is unset rather than whether it contains tokens.

* Restore test of evaluate on an empty doc

* Set a minimal gold.orig for the scorer

Without a minimal gold.orig the scorer can't evaluate empty docs. This
is the v3 equivalent of #4925.
2020-03-29 13:57:00 +02:00
adrianeboyd
06b251dd1e Add support for pos/morphs/lemmas in training data (#4941)
Add support for pos/morphs/lemmas throughout `GoldParse`, `Example`, and
`docs_to_json()`.
2020-01-28 11:36:29 +01:00
Ines Montani
db55577c45
Drop Python 2.7 and 3.5 (#4828)
* Remove unicode declarations

* Remove Python 3.5 and 2.7 from CI

* Don't require pathlib

* Replace compat helpers

* Remove OrderedDict

* Use f-strings

* Set Cython compiler language level

* Fix typo

* Re-add OrderedDict for Table

* Update setup.cfg

* Revert CONTRIBUTING.md

* Revert lookups.md

* Revert top-level.md

* Small adjustments and docs [ci skip]
2019-12-22 01:53:56 +01:00
adrianeboyd
79ba1a3b92 Add lemmas to GoldParse / Example / docs_to_json (#4726) 2019-11-28 14:53:44 +01:00
adrianeboyd
0c9640ced3 Replace old gold alignment with new gold alignment (#4710)
Replace old gold alignment that allowed for some noise in the alignment between raw and orth with the new simpler alignment that requires that the raw and orth strings are identical except for whitespace and capitalization.

* Replace old alignment with new alignment, removing `_align.pyx` and
its tests
* Remove all quote normalizations
* Enable test for new align
  * Modify test case for quote normalization
2019-11-25 23:13:26 +01:00
adrianeboyd
392c4880d9 Restructure Example with merged sents as default (#4632)
* Switch to train_dataset() function in train CLI

* Fixes for pipe() methods in pipeline components

* Don't clobber `examples` variable with `as_example` in pipe() methods
* Remove unnecessary traversals of `examples`

* Update Parser.pipe() for Examples

* Add `as_examples` kwarg to `pipe()` with implementation to return
`Example`s

* Accept `Doc` or `Example` in `pipe()` with `_get_doc()` (copied from
`Pipe`)

* Fixes to Example implementation in spacy.gold

* Move `make_projective` from an attribute of Example to an argument of
`Example.get_gold_parses()`

* Head of 0 are not treated as unset

* Unset heads are set to self rather than `None` (which causes problems
while projectivizing)

* Check for `Doc` (not just not `None`) when creating GoldParses for
pre-merged example

* Don't clobber `examples` variable in `iter_gold_docs()`

* Add/modify gold tests for handling projectivity

* In JSON roundtrip compare results from `dev_dataset` rather than
`train_dataset` to avoid projectivization (and other potential
modifications)

* Add test for projective train vs. nonprojective dev versions of the
same `Doc`

* Handle ignore_misaligned as arg rather than attr

Move `ignore_misaligned` from an attribute of `Example` to an argument
to `Example.get_gold_parses()`, which makes it parallel to
`make_projective`.

Add test with old and new align that checks whether `ignore_misaligned`
errors are raised as expected (only for new align).

* Remove unused attrs from gold.pxd

Remove `ignore_misaligned` and `make_projective` from `gold.pxd`

* Restructure Example with merged sents as default

An `Example` now includes a single `TokenAnnotation` that includes all
the information from one `Doc` (=JSON `paragraph`). If required, the
individual sentences can be returned as a list of examples with
`Example.split_sents()` with no raw text available.

* Input/output a single `Example.token_annotation`

* Add `sent_starts` to `TokenAnnotation` to handle sentence boundaries

* Replace `Example.merge_sents()` with `Example.split_sents()`

* Modify components to use a single `Example.token_annotation`

  * Pipeline components
  * conllu2json converter

* Rework/rename `add_token_annotation()` and `add_doc_annotation()` to
`set_token_annotation()` and `set_doc_annotation()`, functions that set
rather then appending/extending.

* Rename `morphology` to `morphs` in `TokenAnnotation` and `GoldParse`

* Add getters to `TokenAnnotation` to supply default values when a given
attribute is not available

* `Example.get_gold_parses()` in `spacy.gold._make_golds()` is only
applied on single examples, so the `GoldParse` is returned saved in the
provided `Example` rather than creating a new `Example` with no other
internal annotation

* Update tests for API changes and `merge_sents()` vs. `split_sents()`

* Refer to Example.goldparse in iter_gold_docs()

Use `Example.goldparse` in `iter_gold_docs()` instead of `Example.gold`
because a `None` `GoldParse` is generated with ignore_misaligned and
generating it on-the-fly can raise an unwanted AlignmentError

* Fix make_orth_variants()

Fix bug in make_orth_variants() related to conversion from multiple to
one TokenAnnotation per Example.

* Add basic test for make_orth_variants()

* Replace try/except with conditionals

* Replace default morph value with set
2019-11-25 16:03:28 +01:00
adrianeboyd
44829950ba Fix Example details for train CLI / pipeline components (#4624)
* Switch to train_dataset() function in train CLI

* Fixes for pipe() methods in pipeline components

* Don't clobber `examples` variable with `as_example` in pipe() methods
* Remove unnecessary traversals of `examples`

* Update Parser.pipe() for Examples

* Add `as_examples` kwarg to `pipe()` with implementation to return
`Example`s

* Accept `Doc` or `Example` in `pipe()` with `_get_doc()` (copied from
`Pipe`)

* Fixes to Example implementation in spacy.gold

* Move `make_projective` from an attribute of Example to an argument of
`Example.get_gold_parses()`

* Head of 0 are not treated as unset

* Unset heads are set to self rather than `None` (which causes problems
while projectivizing)

* Check for `Doc` (not just not `None`) when creating GoldParses for
pre-merged example

* Don't clobber `examples` variable in `iter_gold_docs()`

* Add/modify gold tests for handling projectivity

* In JSON roundtrip compare results from `dev_dataset` rather than
`train_dataset` to avoid projectivization (and other potential
modifications)

* Add test for projective train vs. nonprojective dev versions of the
same `Doc`

* Handle ignore_misaligned as arg rather than attr

Move `ignore_misaligned` from an attribute of `Example` to an argument
to `Example.get_gold_parses()`, which makes it parallel to
`make_projective`.

Add test with old and new align that checks whether `ignore_misaligned`
errors are raised as expected (only for new align).

* Remove unused attrs from gold.pxd

Remove `ignore_misaligned` and `make_projective` from `gold.pxd`

* Refer to Example.goldparse in iter_gold_docs()

Use `Example.goldparse` in `iter_gold_docs()` instead of `Example.gold`
because a `None` `GoldParse` is generated with ignore_misaligned and
generating it on-the-fly can raise an unwanted AlignmentError

* Update test for ignore_misaligned
2019-11-23 14:32:15 +01:00
Ines Montani
74b951fe61
Fix xpassing tests (#4657)
* Ignore internal warnings

* Un-xfail passing tests

* Skip instead of xfail
2019-11-16 20:20:53 +01:00
Sofie Van Landeghem
e48a09df4e Example class for training data (#4543)
* OrigAnnot class instead of gold.orig_annot list of zipped tuples

* from_orig to replace from_annot_tuples

* rename to RawAnnot

* some unit tests for GoldParse creation and internal format

* removing orig_annot and switching to lists instead of tuple

* rewriting tuples to use RawAnnot (+ debug statements, WIP)

* fix pop() changing the data

* small fixes

* pop-append fixes

* return RawAnnot for existing GoldParse to have uniform interface

* clean up imports

* fix merge_sents

* add unit test for 4402 with new structure (not working yet)

* introduce DocAnnot

* typo fixes

* add unit test for merge_sents

* rename from_orig to from_raw

* fixing unit tests

* fix nn parser

* read_annots to produce text, doc_annot pairs

* _make_golds fix

* rename golds_to_gold_annots

* small fixes

* fix encoding

* have golds_to_gold_annots use DocAnnot

* missed a spot

* merge_sents as function in DocAnnot

* allow specifying only part of the token-level annotations

* refactor with Example class + underlying dicts

* pipeline components to work with Example objects (wip)

* input checking

* fix yielding

* fix calls to update

* small fixes

* fix scorer unit test with new format

* fix kwargs order

* fixes for ud and conllu scripts

* fix reading data for conllu script

* add in proper errors (not fixed numbering yet to avoid merge conflicts)

* fixing few more small bugs

* fix EL script
2019-11-11 17:35:27 +01:00
Matthew Honnibal
a927b3a21e Put new alignment behind flag for v2.2.2 release (#4541)
* Xfail new tokenization test

* Put new alignment behind feature flag

* Move USE_ALIGN to top of the file [ci skip]


Co-authored-by: Ines Montani <ines@ines.io>
2019-10-28 16:12:32 +01:00
tamuhey
df293f3894 modified gold.align to handle space tokens (#4537)
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2019-10-28 15:44:28 +01:00
Ines Montani
c5e41247e8 Tidy up and auto-format 2019-10-28 12:43:55 +01:00
tamuhey
554850206c [#4525] fix gold.align (#4526)
* fix: gold.align

* fix align

* remove old align
2019-10-27 13:38:04 +01:00
Ines Montani
cc05d9dad6 Auto-format [ci skip] 2019-10-24 16:21:08 +02:00
adrianeboyd
8516e9d53b Support train dict format as JSONL (#4471)
* Support train dict format as JSONL

* Add (overly simple) check for dict vs. tuple to read JSONL lines as
either train dicts or train tuples

* Extend JSON/JSONL roundtrip conversion tests using `docs_to_json()`
and `GoldCorpus.train_tuples`

* Revert docs to default JSON output with convert
2019-10-23 16:01:44 +02:00
adrianeboyd
f5c551a43a Checks/errors related to ill-formed IOB input in CLI convert and debug-data (#4487)
* Error for ill-formed input to iob_to_biluo()

Check for empty label in iob_to_biluo(), which can result from
ill-formed input.

* Check for empty NER label in debug-data
2019-10-21 12:20:28 +02:00
Ines Montani
3d8fd4b461 Revert #4334 2019-09-29 17:32:12 +02:00
Ines Montani
c9cd516d96 Move tests out of package (#4334)
* Move tests out of package

* Fix typo
2019-09-28 18:05:00 +02:00
adrianeboyd
b5d999e510 Add textcat to train CLI (#4226)
* Add doc.cats to spacy.gold at the paragraph level

Support `doc.cats` as `"cats": [{"label": string, "value": number}]` in
the spacy JSON training format at the paragraph level.

* `spacy.gold.docs_to_json()` writes `docs.cats`

* `GoldCorpus` reads in cats in each `GoldParse`

* Update instances of gold_tuples to handle cats

Update iteration over gold_tuples / gold_parses to handle addition of
cats at the paragraph level.

* Add textcat to train CLI

* Add textcat options to train CLI
* Add textcat labels in `TextCategorizer.begin_training()`
* Add textcat evaluation to `Scorer`:
  * For binary exclusive classes with provided label: F1 for label
  * For 2+ exclusive classes: F1 macro average
  * For multilabel (not exclusive): ROC AUC macro average (currently
relying on sklearn)
* Provide user info on textcat evaluation settings, potential
incompatibilities
* Provide pipeline to Scorer in `Language.evaluate` for textcat config
* Customize train CLI output to include only metrics relevant to current
pipeline
* Add textcat evaluation to evaluate CLI

* Fix handling of unset arguments and config params

Fix handling of unset arguments and model confiug parameters in Scorer
initialization.

* Temporarily add sklearn requirement

* Remove sklearn version number

* Improve Scorer handling of models without textcats

* Fixing Scorer handling of models without textcats

* Update Scorer output for python 2.7

* Modify inf in Scorer for python 2.7

* Auto-format

Also make small adjustments to make auto-formatting with black easier and produce nicer results

* Move error message to Errors

* Update documentation

* Add cats to annotation JSON format [ci skip]

* Fix tpl flag and docs [ci skip]

* Switch to internal roc_auc_score

Switch to internal `roc_auc_score()` adapted from scikit-learn.

* Add AUCROCScore tests and improve errors/warnings

* Add tests for AUCROCScore and roc_auc_score
* Add missing error for only positive/negative values
* Remove unnecessary warnings and errors

* Make reduced roc_auc_score functions private

Because most of the checks and warnings have been stripped for the
internal functions and access is only intended through `ROCAUCScore`,
make the functions for roc_auc_score adapted from scikit-learn private.

* Check that data corresponds with multilabel flag

Check that the training instances correspond with the multilabel flag,
adding the multilabel flag if required.

* Add textcat score to early stopping check

* Add more checks to debug-data for textcat

* Add example training data for textcat

* Add more checks to textcat train CLI

* Check configuration when extending base model
* Fix typos

* Update textcat example data

* Provide licensing details and licenses for data
* Remove two labels with no positive instances from jigsaw-toxic-comment
data.


Co-authored-by: Ines Montani <ines@ines.io>
2019-09-15 22:31:31 +02:00
Ines Montani
009280fbc5 Tidy up and auto-format 2019-08-18 15:09:16 +02:00
Ziming He
eea7d4f4a8 biluo_tags_from_offsets throw exception for overlapping entities (#4021)
* Check whether two entities overlap

- biluo_gold_biluo_overlap now throw exception when entities passed in have overlaps
- added unit test

* SCA agreement
2019-08-15 18:13:32 +02:00
Ines Montani
9b62639d19 Auto-format [ci skip] 2019-02-27 14:24:55 +01:00
Matthew Honnibal
f1d77eb140
💫 Improve handling of missing NER tags (closes #2603) (#3341)
* Improve handling of missing NER tags

GoldParse can accept missing NER tags, if entities is provided
in BILUO format (rather than as spans). Missing tags can be provided
as None values.

Fix bug that occurred when first tag was a None value. Closes #2603.

* Document specification of missing NER tags.
2019-02-27 12:06:32 +01:00
Ines Montani
f25bd9f5e4 Add gold.spans_from_biluo_tags helper (#3227) 2019-02-06 21:50:26 +11:00