Commit Graph

15504 Commits

Author SHA1 Message Date
github-actions[bot]
d2536cfa70 Auto-format code with black (#10857)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-06-07 19:04:21 +02:00
kadarakos
fa1a065b48 Better errors for has_annotation and Matcher (#10830)
* Show input argument instead of None

* catch invalid attr early

* moved error message from code to errors.py

* Update spacy/errors.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/errors.py

* update E153 and E154

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-06-07 19:04:21 +02:00
Paul O'Leary McCann
cef69b360a Fix Entity Linker with tokenization mismatches (fix #9575) (#10457)
* Add failing test

* Partial fix for issue

This kind of works. The issue with token length mismatches is gone. The
problem is that when you get empty lists of encodings to compare, it
fails because the sizes are not the same, even though they're both zero:
(0, 3) vs (0,). Not sure why that happens...

* Short circuit on empties

* Remove spurious check

The check here isn't needed now the the short circuit is fixed.

* Update spacy/tests/pipeline/test_entity_linker.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Use "eg", not "example"

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-06-07 19:04:21 +02:00
Lj Miranda
e9eb601005 Add spacy-span-analyzer to debug data (#10668)
* Rename to spans_key for consistency

* Implement spans length in debug data

* Implement how span bounds and spans are obtained

In this commit, I implemented how span boundaries (the tokens) around a
given span and spans are obtained. I've put them in the compile_gold()
function so that it's accessible later on. I will do the actual
computation of the span and boundary distinctiveness in the main
function above.

* Compute for p_spans and p_bounds

* Add computation for SD and BD

* Fix mypy issues

* Add weighted average computation

* Fix compile_gold conditional logic

* Add test for frequency distribution computation

* Add tests for kl-divergence computation

* Fix weighted average computation

* Make tables more compact by rounding them

* Add more descriptive checks for spans

* Modularize span computation methods

In this commit, I added the _get_span_characteristics and
_print_span_characteristics functions so that they can be reusable
anywhere.

* Remove unnecessary arguments and make fxs more compact

* Update a few parameter arguments

* Add tests for print_span and get_span methods

* Update API to talk about span characteristics in brief

* Add better reporting of spans_length

* Add test for span length reporting

* Update formatting of span length report

Removed '' to indicate that it's not a string, then
sort the n-grams by their length, not by their frequency.

* Apply suggestions from code review

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Show all frequency distribution when -V

In this commit, I displayed the full frequency distribution of the
span lengths when --verbose is passed. To make things simpler, I
rewrote some of the formatter functions so that I can call them
whenever.

Another notable change is that instead of showing percentages as
Integers, I showed them as floats (max 2-decimal places). I did this
because it looks weird when it displays (0%).

* Update logic on how total is computed

The way the 90% thresholding is computed now is that we keep
adding the percentages until we reach >= 90%. I also updated the wording
and used the term "At least" to denote that >= 90% of your spans have
these distributions.

* Fix display when showing the threshold percentage

* Apply suggestions from code review

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Add better phrasing for span information

* Update spacy/cli/debug_data.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Add minor edits for whitespaces etc.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-06-07 19:04:21 +02:00
Madeesh Kannan
fc09badd6b Disable weekly GPU/slow tests on forks (#10831) 2022-06-07 19:04:21 +02:00
Paul O'Leary McCann
eb3a7a6653 Add glossary entry for root (#10821)
* Add glossary entry for root

There was already one but it was lower case, maybe that should be
removed?

* remove lowercase root

On reflection, that was probably just a mistake.

* Add lowercase root back

It's harmless to leave it there.
2022-06-07 19:04:21 +02:00
Raphael Mitsch
2d12342abd Fuzz tokenizer.explain: draft for fuzzy tests. (#10771)
* Fuzz tokenizer.explain: draft for fuzzy tests.

* Fuzz tokenizer.explain: xignoring tokenizer.explain() tests. Removed deadline modification. Removed LANGUAGES_WITHOUT_TOKENIZERS.

* Fuzz tokenizer.explain: changed tokenizer initialization to avoid failus in Azure runs.

* Fuzz tokenizer.explain: type hint for tokenizer in test.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-06-07 19:04:21 +02:00
github-actions[bot]
d47f428102 Auto-format code with black (#10795)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-06-07 19:04:21 +02:00
kadarakos
6b36809055 bugfix parser labels (#10797) 2022-06-07 19:04:21 +02:00
Patrick Düggelin
9a10264dac Fix PhraseMatcher remove overlapping terms (#10734)
* Add regression test for issue 10643

* Improve overlapping terms testcase

* Fix removing overlapping terms in phrase matcher (#10643)
2022-06-07 19:04:21 +02:00
Raphael Mitsch
df9304afac Ignore overrides for pipe names in config argument (#10779)
* Pipe name override in config: added check with warning, added removal of name override from config, extended tests.

* Pipoe name override in config: added pytest UserWarning.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-06-07 19:04:21 +02:00
Adriane Boyd
c73fcc2908 Override SpanGroups.setdefault to provide default SpanGroup (#10772)
* Fix mistake in SpanGroup API docs

* Restrict SpanGroups.setdefault to SpanGroup only

* Refactor to support default span iterable
2022-06-07 19:04:21 +02:00
Raphael Mitsch
307e66215e Allow assets to be optional in spacy project (#10714)
* Allow assets to be optional in spacy project: draft for optional flag/download_all options.

* Allow assets to be optional in spacy project: added OPTIONAL_DEFAULT reflecting default asset optionality.

* Allow assets to be optional in spacy project: renamed --all to --extra.

* Allow assets to be optional in spacy project: included optional flag in project config test.

* Allow assets to be optional in spacy project: added documentation.

* Allow assets to be optional in spacy project: fixing deprecated --all reference.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Allow assets to be optional in spacy project: fixed project_assets() docstring.

* Allow assets to be optional in spacy project: adjusted wording in justification of optional assets.

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Allow assets to be optional in spacy project: switched to  as keyword in project.yml. Updated docs.

* Allow assets to be optional in spacy project: updated comment.

* Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in output.

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in docstring..

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in test..

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in test.

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Allow assets to be optional in spacy project: renamed OPTIONAL_DEFAULT to EXTRA_DEFAULT.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-06-07 19:04:21 +02:00
Sofie Van Landeghem
50a1745235 Add test for old architectures (#10751)
* add v1 and v2 tests for tok2vec architectures

* textcat architectures are not "layers"

* test older textcat architectures

* test older parser architecture
2022-06-07 19:04:21 +02:00
Luca Dorigo
b0becebf48 Fix StringStore.__getitem__ return type depending on parameter types (#10741)
* Fix StringStore.__getitem__ return type depending on parameter types

Small fix using  `@overload` so that `StringStore.__getitem__` returns an `int` when given a `str` or `bytes` and a `str` when given an `int`.

* Update spacy/strings.pyi

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-06-07 19:04:21 +02:00
Raphael Mitsch
ece27a0424 Refactor error messages to remove hardcoded strings (#10729)
* Use custom error msg instead of hardcoded string: replaced remaining hardcoded error message strings.

* Use custom error msg instead of hardcoded string: fixing faulty Errors import.
2022-06-07 19:04:21 +02:00
Madeesh Kannan
1a4a8f14f7 Remove vestigial debug print statement in walk_head_nodes (#10718)
* `graph`: Remove vestigial debug print statement in `walk_head_nodes`

* Revert whitespace changes

* Remove more debug print statements
2022-06-07 19:04:21 +02:00
Ilya Nikitin
1ff0c695d4 token.md: Fix documentation of Token.ancestors (#10917) 2022-06-06 14:37:15 +09:00
vincent d warmerdam
21b2c3ebdb Add spacy-report to universe (#10910)
* Add spacy-report to universe

* Remove extra comma

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
2022-06-05 18:58:46 +09:00
richardpaulhudson
c7f798ac23 Update Holmes entry in universe.json 2022-05-31 09:09:35 +02:00
Max Tarlov
837c522258 Update documentation for displacy style kwargs (#10841)
* Update docs for displacy style kwargs

Added "span" to the accepted values for the style kwarg in the displacy.serve and displacy.render top-level functions. These styles are new as of SpaCy 3.3, so I added the "new" tag for that option only

* restored alpha ordering
2022-05-30 09:12:23 +02:00
Peter Baumgartner
fcdf2bd08b add doc cleaner to menu (#10862) 2022-05-30 08:52:31 +02:00
Freddy Heppell
9dffd93a0e Fix misspelt keyword in StringStore example 2022-05-30 14:23:38 +09:00
Sofie Van Landeghem
83dd42163a Remove NBSP's across tables in the docs (#10842) 2022-05-25 09:49:11 +02:00
Peter Baumgartner
494df895d3 add floret to static vectors docs (#10833) 2022-05-23 09:16:51 +02:00
kadarakos
e9a4011f77 oov confusion fix (#10828) 2022-05-23 09:16:08 +02:00
Adriane Boyd
3171f2d80c Remove cuda extras for non-linux arm in install widget (#10796)
* Remove cuda extras for non-linux arm platforms in install widget
* Extend cuda versions install widget
* Update GPU install docs to clarify cuda
2022-05-20 09:58:11 +02:00
schaeran
439bfbbfb5 update spaCy Universe: spacytextblob (code example) 2022-05-13 12:10:37 +09:00
Richard Hudson
8b56383bee Add documentation tip about overriding variables (#10780) 2022-05-11 10:17:34 +02:00
Madeesh Kannan
f45ae27d9d training.md: Fix typos (#10775) 2022-05-09 19:45:16 +02:00
Raphael Mitsch
d8a5444e96 Document different ways to create a pipeline (#10762)
* Document different ways to create a pipeline: moved up/slightly modified paragraph on pipeline creation.

* Document different ways to create a pipeline: changed Finnish to Ukrainian in example for language without trained pipeline.

* Document different ways to create a pipeline: added explanation of blank pipeline.

* Document different ways to create a pipeline: exchanged Ukrainian with Yoruba.
2022-05-06 15:42:15 +02:00
Richard Hudson
b2d3ac8222 Updated Coreferee Universe entry 2022-05-06 14:27:24 +02:00
Sofie Van Landeghem
6ce13d0299 Small doc typos (#10750)
* fix typos

* formatting
2022-05-03 14:02:17 +02:00
vincent d warmerdam
be9ac68d33 Update universe.json to Include spaCy video #6 (#10723)
* Update universe.json

I noticed that episode 6 was missing, so I added it.

* Update universe.json

* Update universe.json
2022-05-02 13:35:45 +02:00
Adriane Boyd
ae2155455d Merge branch 'master' into spacy.io 2022-04-29 09:33:58 +02:00
Adriane Boyd
497a708c71
Docs for v3.3 (#10628)
* Temporarily disable CI tests

* Start v3.3 website updates

* Add trainable lemmatizer to pipeline design

* Fix Vectors.most_similar

* Add floret vector info to pipeline design

* Add Lower and Upper Sorbian

* Add span to sidebar

* Work on release notes

* Copy from release notes

* Update pipeline design graphic

* Upgrading note about Doc.from_docs

* Add tables and details

* Update website/docs/models/index.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix da lemma acc

* Add minimal intro, various updates

* Round lemma acc

* Add section on floret / word lists

* Add new pipelines table, minor edits

* Fix displacy spans example title

* Clarify adding non-trainable lemmatizer

* Update adding-languages URLs

* Revert "Temporarily disable CI tests"

This reverts commit 1dee505920.

* Spell out words/sec

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-04-28 14:09:35 +02:00
Adriane Boyd
10377fb945
Set version to v3.3.0 (#10614)
* Set version to v3.3.0

* Revert "Temporarily skip tests that require models/compat"

This reverts commit e422101e00.
2022-04-28 13:07:49 +02:00
Raphael Mitsch
3579507ba1
Bumped black to 22.3.0 due to a fix for https://github.com/psf/black/issues/2964. (#10715) 2022-04-27 14:49:24 +02:00
harmbuisman
c066fb8a4e
#10672: fixes displacy output for manual unsorted entities (#10673)
* #10672: fixes displacy output for manual unsorted entities

* #10672: removed unused import

* fix prettier formatting

Co-authored-by: Harm Buisman <h.buisman@iknl.nl>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-04-27 09:51:58 +02:00
Sofie Van Landeghem
b3717ba53a
removing print statements from the test suite (#10712) 2022-04-27 09:14:25 +02:00
Adriane Boyd
455f089c9b
Support exclude in Doc.from_docs (#10689)
* Support exclude in Doc.from_docs

* Update API docs

* Add new tag to docs
2022-04-25 18:19:03 +02:00
Mike
5533f889b7 Fixed example for spacy_syllables (#10705)
There was a typo in the example for the spacy_syllables project.
2022-04-25 16:43:20 +02:00
Mike
3b208197c3
Fixed example for spacy_syllables (#10705)
There was a typo in the example for the spacy_syllables project.
2022-04-25 16:40:54 +02:00
github-actions[bot]
e07500369c
Auto-format code with black (#10687)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-04-22 11:24:53 +02:00
Sofie Van Landeghem
2c2dbb844c Syntax for a branch from a PR 2022-04-22 09:45:49 +02:00
Ryn Daniels
29afbdb91e
add readme for explosion-bot (#10677) 2022-04-20 09:52:34 +02:00
Richard Hudson
4b227f4861
Merge pull request #10669 from mgrojo/develop
Fix some issues in Spanish stop-word list and examples
2022-04-19 09:37:34 +02:00
mgr
3d50b1a989 Fix some issues in Spanish examples
- Spelling: nationalities in lowercase, accent.
- Incorrect verb composition
- Untranslated word
2022-04-18 22:12:57 +02:00
mgr
2a2654c756 Remove significant or not very frequent words from stop word list [es]
The list of stop words for Spanish contained many inadequate words, see:

https://github.com/explosion/spaCy/issues/3052#issuecomment-1100760100

Removed words:
- verb forms of 'trabajar' (work) and intentar (try)
- words related to 'empleo' (employment)
- incorrect words: ampleamos, arribaabajo, soyos, paìs
- miscellaneous words due to being too significant of too infrequent:
  actualmente, aproximadamente, antaño, cosas, ejemplo, horas, general,
  pais, principalmente, raras

Added other stop words for completion:
- Spanish one-letter words
- numbers up to twelve

Some reformatting to 79 columns.

When in doubt, the English and German lists have been consulted as good
examples.
2022-04-18 22:04:02 +02:00
Madeesh Kannan
aa6780eb27
Matcher: Remove superfluous GIL-acquiring check in get_is_final (#10659)
* `Matcher`: Remove superfluous GIL-acquiring check in `get_is_final`

This check incurred a significant performance penalty due to  implict interactions between the GIL and Cython ref-counting code.

* `Matcher`: Inline `PatternStateC` accessors
2022-04-18 12:59:34 +02:00