Commit Graph

2290 Commits

Author SHA1 Message Date
Ines Montani
c48564688f
Merge pull request #9423 from explosion/tests/issue-marker 2021-10-13 16:53:40 +02:00
Jette16
78365452d3
Moved test for universe into .github folder (#9447)
* Moved universe-test into .github folder

* Cleaned code

* CHanged a file name
2021-10-13 14:13:06 +02:00
Sofie Van Landeghem
2e3d6b8b5a
Fix test for spancat (#9446)
* fix test for spancat

* increase tolerance for almost equal checks

* Update spacy/tests/test_models.py

* Update spacy/tests/test_models.py
2021-10-13 10:47:56 +02:00
Paul O'Leary McCann
efe5beefe0
Add test for case where parser overwrite annotations (#9406)
* Add test for case where parser overwrite annotations

* Move test to its own file

Also add note about how other tokens modify results.

* Fix xfail decorator
2021-10-11 14:57:45 +02:00
Ines Montani
1fa7c4e73b Support issue marker via pytest 2021-10-11 13:56:24 +02:00
Jette16
3b144a3a51 Add universe test (#9278)
* Added test for universe.json

* Added contributor agreement

* Ran black on test_universe_json.py
2021-10-11 11:08:46 +02:00
Paul O'Leary McCann
2a7e327310
Fix Dependency Matcher Ordering Issue (#9337)
* Fix inconsistency

This makes the failing test pass, so that behavior is consistent whether
patterns are added in one call or two.

The issue is that the hash for patterns depended on the index of the
pattern in the list of current patterns, not the list of total patterns,
so a second call would get identical match ids.

* Add illustrative test case

* Add failing test for remove case

Patterns are not removed from the internal matcher on calls to remove,
which causes spurious weird matches (or misses).

* Fix removal issue

Remove patterns from the internal matcher.

* Check that the single add call also gets no matches
2021-10-11 10:26:13 +02:00
Adriane Boyd
b3192ddea3
Sync thinc install dep in setup, fix test packaging (#9336)
* Sync thinc install dep in setup

* Add __init__.py to include package tests in package

* Include *.toml in package
2021-09-30 19:02:10 +02:00
Adriane Boyd
effae12cbd
Update slow readers test to use textcat_multilabel (#9300) 2021-09-27 20:04:02 +02:00
github-actions[bot]
4da2af4e0e
Auto-format code with black (#9284)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2021-09-24 10:46:43 +02:00
Adriane Boyd
00bdb31150
Fix vector for 0-length span (#9244) 2021-09-20 20:22:49 +02:00
github-actions[bot]
015d439eb6
Auto-format code with black (#9234)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2021-09-20 08:49:19 +02:00
Paul O'Leary McCann
c4f0800fb8
Validate pos values when creating Doc (#9148)
* Validate pos values when creating Doc

* Add clear error when setting invalid pos

This also changes the error language slightly.

* Fix variable name

* Update spacy/tokens/doc.pyx

* Test that setting invalid pos raises an error

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-09-16 13:28:05 +02:00
Adriane Boyd
aba6ce3a43
Handle spacy-legacy in package CLI for dependencies (#9163)
* Handle spacy-legacy in package CLI for dependencies

* Implement legacy backoff in spacy registry.find

* Remove unused import

* Update and format test
2021-09-08 11:46:40 +02:00
github-actions[bot]
584fae5807
Auto-format code with black (#9130)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2021-09-03 10:47:03 +02:00
Kevin Humphreys
ca93504660
Pass alignments to Matcher callbacks (#9001)
* pass alignments to callbacks

* refactor for single callback loop

* Update spacy/matcher/matcher.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-09-02 12:58:05 +02:00
Robyn Speer
d60b748e3c
Fix surprises when asking for the root of a git repo (#9074)
* Fix surprises when asking for the root of a git repo

In the case of the first asset I wanted to get from git, the data I
wanted was the entire repository. I tried leaving "path" blank, which
gave a less-than-helpful error, and then I tried `path: "/"`, which
started copying my entire filesystem into the project. The path I should
have used was "".

I've made two changes to make this smoother for others:

- The 'path' within a git clone defaults to ""
- If the path points outside of the tmpdir that the git clone goes
into, we fail with an error

Signed-off-by: Elia Robyn Speer <elia@explosion.ai>

* use a descriptive error instead of a default

plus some minor fixes from PR review

Signed-off-by: Elia Robyn Speer <elia@explosion.ai>

* check for None values in assets

Signed-off-by: Elia Robyn Speer <elia@explosion.ai>

Co-authored-by: Elia Robyn Speer <elia@explosion.ai>
2021-09-01 22:52:08 +02:00
github-actions[bot]
fb9c31fbda
Auto-format code with black (#9065)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2021-08-27 11:42:27 +02:00
Ines Montani
4cd052e81d
Include component factories in third-party dependencies resolver (#9009)
* Include component factories in third-party dependencies resolver

* Increment catalogue and update test
2021-08-25 14:58:01 +02:00
Sofie Van Landeghem
4d52d7051c
Fix spancat training on nested entities (#9007)
* overfitting test on non-overlapping entities

* add failing overfitting test for overlapping entities

* failing test for list comprehension

* remove test that was put in separate PR

* bugfix

* cleanup
2021-08-20 12:37:50 +02:00
Sofie Van Landeghem
de025beb5f
Warn and document spangroup.doc weakref (#8980)
* test for error after Doc has been garbage collected

* warn about using a SpanGroup when the Doc has been garbage collected

* add warning to the docs

* rephrase slightly

* raise error instead of warning

* update

* move warning to doc property
2021-08-20 11:06:19 +02:00
Ines Montani
d94ddd5686
Auto-detect package dependencies in spacy package (#8948)
* Auto-detect package dependencies in spacy package

* Add simple get_third_party_dependencies test

* Import packages_distributions explicitly

* Inline packages_distributions

* Fix docstring [ci skip]

* Relax catalogue requirement

* Move importlib_metadata to spacy.compat with note

* Include license information [ci skip]
2021-08-17 14:05:13 +02:00
Sofie Van Landeghem
0a6b68848f
Fix making span_group (#8975)
* fix _make_span_group

* fix imports
2021-08-17 10:36:34 +02:00
github-actions[bot]
92071326d8
Auto-format code with black (#8950)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2021-08-13 11:48:38 +02:00
Paul O'Leary McCann
6029cfc391
Add scores to output in spancat (#8855)
* Add scores to output in spancat

This exposes the scores as an attribute on the SpanGroup. Includes a
basic test.

* Add basic doc note

* Vectorize score calcs

* Add "annotation format" section

* Update website/docs/api/spancategorizer.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Clean up doc section

* Ran prettier on docs

* Get arrays off the gpu before iterating over them

* Remove int() calls

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-08-10 13:47:49 +02:00
github-actions[bot]
56d4d87aeb
Auto-format code with black (#8895)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2021-08-06 13:38:06 +02:00
Adriane Boyd
fa2e7a4bbf
Fix spancat tests on GPU (#8872)
* Fix spancat tests on GPU

* Fix more spancat tests
2021-08-04 14:29:43 +02:00
Adriane Boyd
941a591f3c
Pass excludes when serializing vocab (#8824)
* Pass excludes when serializing vocab

Additional minor bug fix:

* Deserialize vocab in `EntityLinker.from_disk`

* Add test for excluding strings on load

* Fix formatting
2021-08-03 14:42:44 +02:00
Adriane Boyd
175847f92c
Support list values and INTERSECTS in Matcher (#8784)
* Support list values and IS_INTERSECT in Matcher

* Support list values as token attributes for set operators, not just as
pattern values.

* Add `IS_INTERSECT` operator.

* Fix incorrect `ISSUBSET` and `ISSUPERSET` in schema and docs.

* Rename IS_INTERSECT to INTERSECTS
2021-08-02 19:39:26 +02:00
Adriane Boyd
fbbbda1954
Fix start/end chars for empty and out-of-bounds spans (#8816) 2021-08-02 19:07:19 +02:00
Sofie Van Landeghem
83e27d262e
negative tag annotation (#8731)
* unit test to unlearn tag via negative annotation

* bump thinc to 8.0.8
2021-07-19 14:39:11 +02:00
Ines Montani
f90482d077 Tidy up and auto-format 2021-07-18 15:44:56 +10:00
Ines Montani
15e6578f7d
Adjust formatting 2021-07-17 10:49:13 +10:00
explosion-bot
eff3d1088b Auto-format code with black 2021-07-16 08:03:36 +00:00
Adriane Boyd
ac45c7c045
Add pre-commit to ignored requirements (#8728) 2021-07-15 16:41:15 +02:00
jmyerston
993b0fab0e
Added ancient Greek language support (#8606)
* Add ancient Greek language support

Initial commit

* Contributor Agreement

* grc tokenizer test added  and files formatted with black, unnecessary import removed

Co-Authored-By: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Commas in lists fixed. __init__py added to test

* Update lex_attrs.py

* Update stop_words.py

* Update stop_words.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-07-15 10:27:17 +02:00
Sofie Van Landeghem
77859beb99
spacy.ngram_range_suggester.v1 (#8699) 2021-07-15 10:01:22 +02:00
Julien Rossi
e117573822
Adding noun_chunks to the DUTCH language model (nl) (#8529)
*  implement noun_chunks for dutch language

* copy/paste FR and SV syntax iterators to accomodate UD tags
* added tests with dutch text
* signed contributor agreement

* 🐛 fix noun chunks generator

* built from scratch
* define noun chunk as a single Noun-Phrase
* includes some corner cases debugging (incorrect POS tagging)
* test with provided annotated sample (POS, DEP)

*  fix failing test

* CI pipeline did not like the added sample file
* add the sample as a pytest fixture

* Update spacy/lang/nl/syntax_iterators.py

* Update spacy/lang/nl/syntax_iterators.py

Code readability

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/tests/lang/nl/test_noun_chunks.py

correct comment

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* finalize code

* change "if next_word" into "if next_word is not None"

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-07-14 14:01:02 +02:00
Adriane Boyd
b8e720fdb9
Fix Azerbaijani init, extend lang init tests (#8656)
* Extend langs in initialize tests

* Fix az init
2021-07-09 15:36:35 +02:00
Sofie Van Landeghem
64fac754fe
add spacy prefix to ngram_suggester.v1 (#8623) 2021-07-07 08:09:30 +02:00
Sofie Van Landeghem
733e8ceea9
fix spancat initialize with labels (#8620) 2021-07-06 19:08:25 +02:00
Sofie Van Landeghem
3daf57d70c
Small spancat fixes (#8614)
* two small fixes + additional tests

* rename
2021-07-06 14:15:41 +02:00
Adriane Boyd
29906884c5
Raise an error for textcat with <2 labels (#8584)
* Raise an error for textcat with <2 labels

Raise an error if initializing a `textcat` component without at least
two labels.

* Add similar note to docs

* Update positive_label description in API docs
2021-07-06 12:35:22 +02:00
explosion-bot
ee37288a1f Auto-format code with black 2021-07-02 07:48:26 +00:00
Ines Montani
af9d984407
Merge pull request #8405 from svlandeg/fix/whitespace_tokenizer [ci skip] 2021-06-30 20:52:59 +10:00
Ines Montani
7f65902702
Merge pull request #8522 from adrianeboyd/chore/update-flake8
Update flake8 version in reqs and CI
2021-06-28 21:46:06 +10:00
Adriane Boyd
86d01e9229 Tidy up with flake8: imports, comparisons, etc. 2021-06-28 12:08:15 +02:00
Adriane Boyd
5eeb25f043 Tidy up code 2021-06-28 12:08:15 +02:00
Adriane Boyd
4b0ed73ed4 Update flake8 version in reqs and CI
* Update some unneeded forward refs related to flake8 checks
2021-06-28 11:29:36 +02:00
Matthew Honnibal
f9946154d9
Add SpanCategorizer component (#6747)
* Draft spancat model

* Add spancat model

* Add test for extract_spans

* Add extract_spans layer

* Upd extract_spans

* Add spancat model

* Add test for spancat model

* Upd spancat model

* Update spancat component

* Upd spancat

* Update spancat model

* Add quick spancat test

* Import SpanCategorizer

* Fix SpanCategorizer component

* Import SpanGroup

* Fix span extraction

* Fix import

* Fix import

* Upd model

* Update spancat models

* Add scoring, update defaults

* Update and add docs

* Fix type

* Update spacy/ml/extract_spans.py

* Auto-format and fix import

* Fix comment

* Fix type

* Fix type

* Update website/docs/api/spancategorizer.md

* Fix comment

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Better defense

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix labels list

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/ml/extract_spans.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/pipeline/spancat.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Set annotations during update

* Set annotations in spancat

* fix imports in test

* Update spacy/pipeline/spancat.py

* replace MaxoutLogistic with LinearLogistic

* fix config

* various small fixes

* remove set_annotations parameter in update

* use our beloved tupley format with recent support for doc.spans

* bugfix to allow renaming the default span_key (scores weren't showing up)

* use different key in docs example

* change defaults to better-working parameters from project (WIP)

* register spacy.extract_spans.v1 for legacy purposes

* Upd dev version so can build wheel

* layers instead of architectures for smaller building blocks

* Update website/docs/api/spancategorizer.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update website/docs/api/spancategorizer.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Include additional scores from overrides in combined score weights

* Parameterize spans key in scoring

Parameterize the `SpanCategorizer` `spans_key` for scoring purposes so
that it's possible to evaluate multiple `spancat` components in the same
pipeline.

* Use the (intentionally very short) default spans key `sc` in the
  `SpanCategorizer`
* Adjust the default score weights to include the default key
* Adjust the scorer to use `spans_{spans_key}` as the prefix for the
  returned score
* Revert addition of `attr_name` argument to `score_spans` and adjust
  the key in the `getter` instead.

Note that for `spancat` components with a custom `span_key`, the score
weights currently need to be modified manually in
`[training.score_weights]` for them to be available during training. To
suppress the default score weights `spans_sc_p/r/f` during training, set
them to `null` in `[training.score_weights]`.

* Update website/docs/api/scorer.md

* Fix scorer for spans key containing underscore

* Increment version

* Add Spans to Evaluate CLI (#8439)

* Add Spans to Evaluate CLI

* Change to spans_key

* Add spans per_type output

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Fix spancat GPU issues (#8455)

* Fix GPU issues

* Require thinc >=8.0.6

* Switch to glorot_uniform_init

* Fix and test ngram suggester

* Include final ngram in doc for all sizes
* Fix ngrams for docs of the same length as ngram size
* Handle batches of docs that result in no ngrams
* Add tests

Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Nirant <NirantK@users.noreply.github.com>
2021-06-24 12:35:27 +02:00