Commit Graph

604 Commits

Author SHA1 Message Date
Matthew Honnibal
6f5e308d17
Support negative examples in partial NER annotations (#8106)
* Support a cfg field in transition system

* Make NER 'has gold' check use right alignment for span

* Pass 'negative_samples_key' property into NER transition system

* Add field for negative samples to NER transition system

* Check neg_key in NER has_gold

* Support negative examples in NER oracle

* Test for negative examples in NER

* Fix name of config variable in NER

* Remove vestiges of old-style partial annotation

* Remove obsolete tests

* Add comment noting lack of support for negative samples in parser

* Additions to "neg examples" PR (#8201)

* add custom error and test for deprecated format

* add test for unlearning an entity

* add break also for Begin's cost

* add negative_samples_key property on Parser

* rename

* extend docs & fix some older docs issues

* add subclass constructors, clean up tests, fix docs

* add flaky test with ValueError if gold parse was not found

* remove ValueError if n_gold == 0

* fix docstring

* Hack in environment variables to try out training

* Remove hack

* Remove NER hack, and support 'negative O' samples

* Fix O oracle

* Fix transition parser

* Remove 'not O' from oracle

* Fix NER oracle

* check for spans in both gold.ents and gold.spans and raise if so, to prevent memory access violation

* use set instead of list in consistency check

Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-06-17 17:33:00 +10:00
Adriane Boyd
5646fcbe46 Merge remote-tracking branch 'upstream/develop' into chore/develop-into-master-v3.1 2021-06-15 15:05:17 +02:00
Paul O'Leary McCann
d959603d51
Don't add duplicate patterns all the time in EntityRuler (fix #8216) (#8246)
* Don't add duplicate patterns (fix #8216)

* Refactor EntityRuler init

This simplifies the EntityRuler init code. This is helpful as prep for
allowing the EntityRuler to reset itself.

* Make EntityRuler.clear reset matchers

Includes a new test for this.

* Tidy PhraseMatcher instantiation

Since the attr can be None safely now, the guard if is no longer
required here.

Also renamed the `_validate` attr. Maybe it's not needed?

* Fix NER test

* Add test to make sure patterns aren't increasing

* Move test to regression tests
2021-06-03 09:05:26 +02:00
Dhruv Naik
283f64a98d
Fix bug from Entityruler: ent_ids returns None for phrases (#8169)
* bugfix for explosion/spaCy#8168

* add test for explosion/spaCy#8168
2021-05-31 18:38:53 +10:00
Narayan Acharya
6b79714080
Address missing config overrides post load of models (#8208) 2021-05-31 18:36:52 +10:00
Adriane Boyd
36ecba224e
Set up GPU CI testing (#7293)
* Set up CI for tests with GPU agent

* Update tests for enabled GPU

* Fix steps filename

* Add parallel build jobs as a setting

* Fix test requirements

* Fix install test requirements condition

* Fix pipeline models test

* Reset current ops in prefer/require testing

* Fix more tests

* Remove separate test_models test

* Fix regression 5551

* fix StaticVectors for GPU use

* fix vocab tests

* Fix regression test 5082

* Move azure steps to .github and reenable default pool jobs

* Consolidate/rename azure steps

Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
2021-04-22 14:58:29 +02:00
Sofie Van Landeghem
27dbbb9903
Bugfix/nel crossing sentence (#7630)
* ensure each entity gets a KB ID, even when it's not within a sentence

* cleanup
2021-04-12 18:08:01 +10:00
Sofie Van Landeghem
113e8d082b
only evaluate named entities for NEL if there is a corresponding gold span (#7074) 2021-02-22 11:06:50 +11:00
Sofie Van Landeghem
709c9e75af
span.ent only returns first sentence (#7084)
* return first sentence when span contains sentence boundary

* docs fix

* small fixes

* cleanup
2021-02-19 23:02:38 +11:00
Ines Montani
f4f46b617f
Preserve sourced components in fill-config (fixes #7055) (#7058) 2021-02-14 14:02:14 +11:00
Matthew Honnibal
0fb8d437c0
Fix sentence fragments bug (#7056, #7035) (#7057)
* Add test for #7035

* Update test for issue 7056

* Fix test

* Fix transitions method used in testing

* Fix state eol detection when rebuffer

* Clean up redundant fix
2021-02-14 13:38:13 +11:00
Ines Montani
9ba715ed16 Tidy up and auto-format 2021-02-13 12:55:56 +11:00
svlandeg
03b4ec7d7f fix typo 2021-02-12 14:30:16 +01:00
svlandeg
278e9eaa14 remove ner 2021-02-11 21:08:04 +01:00
svlandeg
ebeedfc70b regression test for 7029 2021-02-11 20:56:48 +01:00
Ines Montani
26bf642afd
Fix issue #7019: Handle None scores in evaluate printer (#7026) 2021-02-11 16:45:23 +11:00
Ines Montani
ad9ce3c8f6 Fix issue #6950: allow pickling Tok2Vec with listeners 2021-02-11 11:37:39 +11:00
Peter Baumann
61b04a70d5
Run PhraseMatcher on Spans (#6918)
* Add regression test

* Run PhraseMatcher on Spans

* Add test for PhraseMatcher on Spans and Docs

* Add SCA

* Add test with 3 matches in Doc, 1 match in Span

* Update docs

* Use doc.length for find_matches in tokenizer

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-02-10 23:43:32 +11:00
René Octavio Queiroz Dias
999ff03b19
fix: Fix textcat labels to expect a Optional[Iterable[str]] instead of Optional[Dict] (#6911)
* docs: Add agreement

* bug: Regression test

Issue #6908

* fix: Changed from Dict to Iterable[str]

Fix #6908

* Update test to use make_tempdir

* fix: Fix WindowsPath error

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-02-04 23:37:13 +01:00
Ines Montani
e6accb3a9e Tidy up and auto-format 2021-01-30 12:52:33 +11:00
Sofie Van Landeghem
24a697abb8
avoid empty aliases and improve UX and docs (#6840) 2021-01-29 08:51:40 +08:00
Sofie Van Landeghem
6b68ad027b
Fix beam NER resizing (#6834)
* move label check to sub methods

* add tests
2021-01-27 23:39:14 +11:00
Dhruv Naik
e7db07a0b9
Fix Span.char_span bug (#6816)
* Create dhruvrnaik.md

* add test for issue #6815

* bugfix for issue #6815

* update dhruvrnaik.md

* add span.vector test for #6815
2021-01-26 15:50:37 +08:00
Adriane Boyd
2263bc7b28
Update develop from master for v3.0.0rc5 (#6811)
* Fix `spacy.util.minibatch` when the size iterator is finished (#6745)

* Skip 0-length matches (#6759)

Add hack to prevent matcher from returning 0-length matches.

* support IS_SENT_START in PhraseMatcher (#6771)

* support IS_SENT_START in PhraseMatcher

* add unit test and friendlier error

* use IDS.get instead

* ensure span.text works for an empty span (#6772)

* Remove unicode_literals

Co-authored-by: Santiago Castro <bryant@montevideo.com.uy>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-01-26 14:52:45 +11:00
Sofie Van Landeghem
fed8f48965
raise NotImplementedError when noun_chunks iterator is not implemented (#6711)
* raise NotImplementedError when noun_chunks iterator is not implemented

* bring back, fix and document span.noun_chunks

* formatting

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2021-01-17 19:56:05 +08:00
Ines Montani
b0b743597c Tidy up and auto-format 2021-01-15 11:57:36 +11:00
Sofie Van Landeghem
402dbc5bae
Getting scores out of beam_ner (#6575)
* small fixes and formatting

* bring test_issue4313 up-to-date, currently fails

* formatting

* add get_beam_parses method back

* add scored_ents function

* delete tag map
2021-01-06 12:02:32 +01:00
Sofie Van Landeghem
afc5714d32
multi-label textcat component (#6474)
* multi-label textcat component

* formatting

* fix comment

* cleanup

* fix from #6481

* random edit to push the tests

* add explicit error when textcat is called with multi-label gold data

* fix error nr

* small fix
2021-01-06 13:07:14 +11:00
Ines Montani
991669c934 Tidy up and auto-format 2021-01-05 13:41:53 +11:00
Matthew Honnibal
8656a08777
Add beam_parser and beam_ner components for v3 (#6369)
* Get basic beam tests working

* Get basic beam tests working

* Compile _beam_utils

* Remove prints

* Test beam density

* Beam parser seems to train

* Draft beam NER

* Upd beam

* Add hypothesis as dev dependency

* Implement missing is-gold-parse method

* Implement early update

* Fix state hashing

* Fix test

* Fix test

* Default to non-beam in parser constructor

* Improve oracle for beam

* Start refactoring beam

* Update test

* Refactor beam

* Update nn

* Refactor beam and weight by cost

* Update ner beam settings

* Update test

* Add __init__.pxd

* Upd test

* Fix test

* Upd test

* Fix test

* Remove ring buffer history from StateC

* WIP change arc-eager transitions

* Add state tests

* Support ternary sent start values

* Fix arc eager

* Fix NER

* Pass oracle cut size for beam

* Fix ner test

* Fix beam

* Improve StateC.clone

* Improve StateClass.borrow

* Work directly with StateC, not StateClass

* Remove print statements

* Fix state copy

* Improve state class

* Refactor parser oracles

* Fix arc eager oracle

* Fix arc eager oracle

* Use a vector to implement the stack

* Refactor state data structure

* Fix alignment of sent start

* Add get_aligned_sent_starts method

* Add test for ae oracle when bad sentence starts

* Fix sentence segment handling

* Avoid Reduce that inserts illegal sentence

* Update preset SBD test

* Fix test

* Remove prints

* Fix sent starts in Example

* Improve python API of StateClass

* Tweak comments and debug output of arc eager

* Upd test

* Fix state test

* Fix state test
2020-12-13 09:08:32 +08:00
Sofie Van Landeghem
d6c616a125
Fixes in test suite (#6457)
* fix slow test for textcat readers

* cleanup test_issue5551

* add explicit score weight

* cleanup
2020-12-02 12:57:08 +01:00
Jan Margeta
1ad2213349 Fix TokenPatternSchema pattern field validation
Empty pattern field should be considered invalid

This is fixed by replacing minItems with min_items
as described in Pydantic docs:
https://pydantic-docs.helpmanual.io/usage/schema/
2020-10-16 00:41:21 +02:00
Ines Montani
b1d568a4df Tidy up tests 2020-10-15 10:20:21 +02:00
Ines Montani
539b0c10da Tidy up and auto-format 2020-10-10 19:14:48 +02:00
Ines Montani
bfa3931c9d
Revert added_strings change (#6236) 2020-10-10 18:55:07 +02:00
Florijan Stamenković
18f5c309dc Fix Issue 6207 (#6208)
* Regression test for issue 6207

* Fix issue 6207

* Sign contributor agreement

* Minor adjustments to test

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-10-09 10:14:40 +02:00
Sofie Van Landeghem
d093d6343b
TrainablePipe (#6213)
* rename Pipe to TrainablePipe

* split functionality between Pipe and TrainablePipe

* remove unnecessary methods from certain components

* cleanup

* hasattr(component, "pipe") should be sufficient again

* remove serialization and vocab/cfg from Pipe

* unify _ensure_examples and validate_examples

* small fixes

* hasattr checks for self.cfg and self.vocab

* make is_resizable and is_trainable properties

* serialize strings.json instead of vocab

* fix KB IO + tests

* fix typos

* more typos

* _added_strings as a set

* few more tests specifically for _added_strings field

* bump to 3.0.0a36
2020-10-08 21:33:49 +02:00
svlandeg
eaf5c265cb set_kb method for entity_linker 2020-10-08 10:34:01 +02:00
svlandeg
efedccea8d fix tests 2020-10-07 15:29:52 +02:00
Ines Montani
1c641e41c3 Remove unused import [ci skip] 2020-10-05 11:50:11 +02:00
Ines Montani
549758f67d Adjust test for now 2020-10-04 23:16:09 +02:00
Adriane Boyd
73538782a0
Switch Doc.__init__(ents=) to IOB tags (#6173)
* Switch Doc.__init__(ents=) to IOB tags

* Fix check for "-"

* Allow "" or None as missing IOB tag
2020-10-01 16:22:18 +02:00
Ines Montani
fa47f87924 Tidy up and auto-format 2020-09-29 21:39:28 +02:00
Ines Montani
ff9a63bfbd begin_training -> initialize 2020-09-28 21:35:09 +02:00
Ines Montani
7e938ed63e Update config resolution to use new Thinc 2020-09-27 22:21:31 +02:00
svlandeg
b556a10808 rename converts in_to_out 2020-09-22 11:50:19 +02:00
Ines Montani
67fbcb3da5 Tidy up tests and docs 2020-09-21 20:43:54 +02:00
Ines Montani
1114219ae3 Tidy up and auto-format 2020-09-21 10:59:07 +02:00
Adriane Boyd
7e4cd7575c
Refactor Docs.is_ flags (#6044)
* Refactor Docs.is_ flags

* Add derived `Doc.has_annotation` method

  * `Doc.has_annotation(attr)` returns `True` for partial annotation

  * `Doc.has_annotation(attr, require_complete=True)` returns `True` for
    complete annotation

* Add deprecation warnings to `is_tagged`, `is_parsed`, `is_sentenced`
and `is_nered`

* Add `Doc._get_array_attrs()`, which returns a full list of `Doc` attrs
for use with `Doc.to_array`, `Doc.to_bytes` and `Doc.from_docs`. The
list is the `DocBin` attributes list plus `SPACY` and `LENGTH`.

Notes on `Doc.has_annotation`:

* `HEAD` is converted to `DEP` because heads don't have an unset state

* Accept `IS_SENT_START` as a synonym of `SENT_START`

Additional changes:

* Add `NORM`, `ENT_ID` and `SENT_START` to default attributes for
`DocBin`

* In `Doc.from_array()` the presence of `DEP` causes `HEAD` to override
`SENT_START`

* In `Doc.from_array()` using `attrs` other than
`Doc._get_array_attrs()` (i.e., a user's custom list rather than our
default internal list) with both `HEAD` and `SENT_START` shows a warning
that `HEAD` will override `SENT_START`

* `set_children_from_heads` does not require dependency labels to set
sentence boundaries and sets `sent_start` for all non-sentence starts to
`-1`

* Fix call to set_children_form_heads

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-09-17 00:14:01 +02:00
Adriane Boyd
a119667a36
Clean up spacy.tokens (#6046)
* Clean up spacy.tokens

* Update `set_children_from_heads`:
  * Don't check `dep` when setting lr_* or sentence starts
  * Set all non-sentence starts to `False`

* Use `set_children_from_heads` in `Token.head` setter
  * Reduce similar/duplicate code (admittedly adds a bit of overhead)
  * Update sentence starts consistently

* Remove unused `Doc.set_parse`

* Minor changes:
  * Declare cython variables (to avoid cython warnings)
  * Clean up imports

* Modify set_children_from_heads to set token range

Modify `set_children_from_heads` so that it adjust tokens within a
specified range rather then the whole document.

Modify the `Token.head` setter to adjust only the tokens affected by the
new head assignment.
2020-09-16 20:32:38 +02:00