Commit Graph

15585 Commits

Author SHA1 Message Date
Lj Miranda
1d34aa2b3d
Add spacy-span-analyzer to debug data (#10668)
* Rename to spans_key for consistency

* Implement spans length in debug data

* Implement how span bounds and spans are obtained

In this commit, I implemented how span boundaries (the tokens) around a
given span and spans are obtained. I've put them in the compile_gold()
function so that it's accessible later on. I will do the actual
computation of the span and boundary distinctiveness in the main
function above.

* Compute for p_spans and p_bounds

* Add computation for SD and BD

* Fix mypy issues

* Add weighted average computation

* Fix compile_gold conditional logic

* Add test for frequency distribution computation

* Add tests for kl-divergence computation

* Fix weighted average computation

* Make tables more compact by rounding them

* Add more descriptive checks for spans

* Modularize span computation methods

In this commit, I added the _get_span_characteristics and
_print_span_characteristics functions so that they can be reusable
anywhere.

* Remove unnecessary arguments and make fxs more compact

* Update a few parameter arguments

* Add tests for print_span and get_span methods

* Update API to talk about span characteristics in brief

* Add better reporting of spans_length

* Add test for span length reporting

* Update formatting of span length report

Removed '' to indicate that it's not a string, then
sort the n-grams by their length, not by their frequency.

* Apply suggestions from code review

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Show all frequency distribution when -V

In this commit, I displayed the full frequency distribution of the
span lengths when --verbose is passed. To make things simpler, I
rewrote some of the formatter functions so that I can call them
whenever.

Another notable change is that instead of showing percentages as
Integers, I showed them as floats (max 2-decimal places). I did this
because it looks weird when it displays (0%).

* Update logic on how total is computed

The way the 90% thresholding is computed now is that we keep
adding the percentages until we reach >= 90%. I also updated the wording
and used the term "At least" to denote that >= 90% of your spans have
these distributions.

* Fix display when showing the threshold percentage

* Apply suggestions from code review

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Add better phrasing for span information

* Update spacy/cli/debug_data.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Add minor edits for whitespaces etc.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-05-23 19:06:38 +02:00
Peter Baumgartner
7ce3460b23
add floret to static vectors docs (#10833) 2022-05-23 09:16:31 +02:00
kadarakos
a3814ee739
oov confusion fix (#10828) 2022-05-23 09:15:51 +02:00
Madeesh Kannan
4fb1809c72
Disable weekly GPU/slow tests on forks (#10831) 2022-05-20 15:46:30 +02:00
Adriane Boyd
a82ec56aae
Remove cuda extras for non-linux arm in install widget (#10796)
* Remove cuda extras for non-linux arm platforms in install widget
* Extend cuda versions install widget
* Update GPU install docs to clarify cuda
2022-05-20 09:57:41 +02:00
Paul O'Leary McCann
46982cf694
Add glossary entry for root (#10821)
* Add glossary entry for root

There was already one but it was lower case, maybe that should be
removed?

* remove lowercase root

On reflection, that was probably just a mistake.

* Add lowercase root back

It's harmless to leave it there.
2022-05-20 09:56:32 +02:00
Raphael Mitsch
357be2614e
Fuzz tokenizer.explain: draft for fuzzy tests. (#10771)
* Fuzz tokenizer.explain: draft for fuzzy tests.

* Fuzz tokenizer.explain: xignoring tokenizer.explain() tests. Removed deadline modification. Removed LANGUAGES_WITHOUT_TOKENIZERS.

* Fuzz tokenizer.explain: changed tokenizer initialization to avoid failus in Azure runs.

* Fuzz tokenizer.explain: type hint for tokenizer in test.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-05-17 10:23:16 +02:00
github-actions[bot]
99aeaf9bd3
Auto-format code with black (#10795)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-05-13 19:02:08 +02:00
kadarakos
fd36469900
bugfix parser labels (#10797) 2022-05-13 11:41:32 +02:00
Paul O'Leary McCann
7634a488fe
Merge pull request #10793 from Schero1994/feature/update
Update spaCy Universe: spacytextblob (code example)
2022-05-13 12:07:37 +09:00
schaeran
f5952c0851 update spaCy Universe: spacytextblob (code example) 2022-05-12 18:23:00 +02:00
Patrick Düggelin
cb06309ed8
Fix PhraseMatcher remove overlapping terms (#10734)
* Add regression test for issue 10643

* Improve overlapping terms testcase

* Fix removing overlapping terms in phrase matcher (#10643)
2022-05-12 12:23:52 +02:00
Raphael Mitsch
6f9e2ca81f
Ignore overrides for pipe names in config argument (#10779)
* Pipe name override in config: added check with warning, added removal of name override from config, extended tests.

* Pipoe name override in config: added pytest UserWarning.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-05-12 11:46:08 +02:00
Adriane Boyd
b65d652881
Override SpanGroups.setdefault to provide default SpanGroup (#10772)
* Fix mistake in SpanGroup API docs

* Restrict SpanGroups.setdefault to SpanGroup only

* Refactor to support default span iterable
2022-05-12 10:06:25 +02:00
Richard Hudson
d524f6415f
Add documentation tip about overriding variables (#10780) 2022-05-11 10:15:32 +02:00
Raphael Mitsch
2904359685
Allow assets to be optional in spacy project (#10714)
* Allow assets to be optional in spacy project: draft for optional flag/download_all options.

* Allow assets to be optional in spacy project: added OPTIONAL_DEFAULT reflecting default asset optionality.

* Allow assets to be optional in spacy project: renamed --all to --extra.

* Allow assets to be optional in spacy project: included optional flag in project config test.

* Allow assets to be optional in spacy project: added documentation.

* Allow assets to be optional in spacy project: fixing deprecated --all reference.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Allow assets to be optional in spacy project: fixed project_assets() docstring.

* Allow assets to be optional in spacy project: adjusted wording in justification of optional assets.

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Allow assets to be optional in spacy project: switched to  as keyword in project.yml. Updated docs.

* Allow assets to be optional in spacy project: updated comment.

* Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in output.

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in docstring..

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in test..

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in test.

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Allow assets to be optional in spacy project: renamed OPTIONAL_DEFAULT to EXTRA_DEFAULT.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-05-10 10:40:11 +02:00
Sofie Van Landeghem
1543558d08
Add test for old architectures (#10751)
* add v1 and v2 tests for tok2vec architectures

* textcat architectures are not "layers"

* test older textcat architectures

* test older parser architecture
2022-05-10 08:24:42 +02:00
Madeesh Kannan
733114bdd9
training.md: Fix typos (#10775) 2022-05-09 19:44:14 +02:00
Raphael Mitsch
e626df959f
Document different ways to create a pipeline (#10762)
* Document different ways to create a pipeline: moved up/slightly modified paragraph on pipeline creation.

* Document different ways to create a pipeline: changed Finnish to Ukrainian in example for language without trained pipeline.

* Document different ways to create a pipeline: added explanation of blank pipeline.

* Document different ways to create a pipeline: exchanged Ukrainian with Yoruba.
2022-05-06 15:40:59 +02:00
Richard Hudson
c32e1a0079
Updated Coreferee Universe entry (#10763) 2022-05-06 13:21:39 +02:00
Luca Dorigo
0a92d5644e
Fix StringStore.__getitem__ return type depending on parameter types (#10741)
* Fix StringStore.__getitem__ return type depending on parameter types

Small fix using  `@overload` so that `StringStore.__getitem__` returns an `int` when given a `str` or `bytes` and a `str` when given an `int`.

* Update spacy/strings.pyi

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-05-03 17:57:07 +02:00
Sofie Van Landeghem
e03b9f8095
Small doc typos (#10750)
* fix typos

* formatting
2022-05-03 13:55:27 +02:00
Raphael Mitsch
f5390e278a
Refactor error messages to remove hardcoded strings (#10729)
* Use custom error msg instead of hardcoded string: replaced remaining hardcoded error message strings.

* Use custom error msg instead of hardcoded string: fixing faulty Errors import.
2022-05-02 13:38:46 +02:00
Madeesh Kannan
0a503ce5e0
Remove vestigial debug print statement in walk_head_nodes (#10718)
* `graph`: Remove vestigial debug print statement in `walk_head_nodes`

* Revert whitespace changes

* Remove more debug print statements
2022-05-02 13:36:35 +02:00
vincent d warmerdam
f3de976513
Update universe.json to Include spaCy video #6 (#10723)
* Update universe.json

I noticed that episode 6 was missing, so I added it.

* Update universe.json

* Update universe.json
2022-05-02 13:35:14 +02:00
Adriane Boyd
497a708c71
Docs for v3.3 (#10628)
* Temporarily disable CI tests

* Start v3.3 website updates

* Add trainable lemmatizer to pipeline design

* Fix Vectors.most_similar

* Add floret vector info to pipeline design

* Add Lower and Upper Sorbian

* Add span to sidebar

* Work on release notes

* Copy from release notes

* Update pipeline design graphic

* Upgrading note about Doc.from_docs

* Add tables and details

* Update website/docs/models/index.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix da lemma acc

* Add minimal intro, various updates

* Round lemma acc

* Add section on floret / word lists

* Add new pipelines table, minor edits

* Fix displacy spans example title

* Clarify adding non-trainable lemmatizer

* Update adding-languages URLs

* Revert "Temporarily disable CI tests"

This reverts commit 1dee505920.

* Spell out words/sec

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-04-28 14:09:35 +02:00
Adriane Boyd
10377fb945
Set version to v3.3.0 (#10614)
* Set version to v3.3.0

* Revert "Temporarily skip tests that require models/compat"

This reverts commit e422101e00.
2022-04-28 13:07:49 +02:00
Raphael Mitsch
3579507ba1
Bumped black to 22.3.0 due to a fix for https://github.com/psf/black/issues/2964. (#10715) 2022-04-27 14:49:24 +02:00
harmbuisman
c066fb8a4e
#10672: fixes displacy output for manual unsorted entities (#10673)
* #10672: fixes displacy output for manual unsorted entities

* #10672: removed unused import

* fix prettier formatting

Co-authored-by: Harm Buisman <h.buisman@iknl.nl>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-04-27 09:51:58 +02:00
Sofie Van Landeghem
b3717ba53a
removing print statements from the test suite (#10712) 2022-04-27 09:14:25 +02:00
Adriane Boyd
455f089c9b
Support exclude in Doc.from_docs (#10689)
* Support exclude in Doc.from_docs

* Update API docs

* Add new tag to docs
2022-04-25 18:19:03 +02:00
Mike
3b208197c3
Fixed example for spacy_syllables (#10705)
There was a typo in the example for the spacy_syllables project.
2022-04-25 16:40:54 +02:00
github-actions[bot]
e07500369c
Auto-format code with black (#10687)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-04-22 11:24:53 +02:00
Sofie Van Landeghem
2c2dbb844c Syntax for a branch from a PR 2022-04-22 09:45:49 +02:00
Ryn Daniels
29afbdb91e
add readme for explosion-bot (#10677) 2022-04-20 09:52:34 +02:00
Richard Hudson
4b227f4861
Merge pull request #10669 from mgrojo/develop
Fix some issues in Spanish stop-word list and examples
2022-04-19 09:37:34 +02:00
mgr
3d50b1a989 Fix some issues in Spanish examples
- Spelling: nationalities in lowercase, accent.
- Incorrect verb composition
- Untranslated word
2022-04-18 22:12:57 +02:00
mgr
2a2654c756 Remove significant or not very frequent words from stop word list [es]
The list of stop words for Spanish contained many inadequate words, see:

https://github.com/explosion/spaCy/issues/3052#issuecomment-1100760100

Removed words:
- verb forms of 'trabajar' (work) and intentar (try)
- words related to 'empleo' (employment)
- incorrect words: ampleamos, arribaabajo, soyos, paìs
- miscellaneous words due to being too significant of too infrequent:
  actualmente, aproximadamente, antaño, cosas, ejemplo, horas, general,
  pais, principalmente, raras

Added other stop words for completion:
- Spanish one-letter words
- numbers up to twelve

Some reformatting to 79 columns.

When in doubt, the English and German lists have been consulted as good
examples.
2022-04-18 22:04:02 +02:00
Madeesh Kannan
aa6780eb27
Matcher: Remove superfluous GIL-acquiring check in get_is_final (#10659)
* `Matcher`: Remove superfluous GIL-acquiring check in `get_is_final`

This check incurred a significant performance penalty due to  implict interactions between the GIL and Cython ref-counting code.

* `Matcher`: Inline `PatternStateC` accessors
2022-04-18 12:59:34 +02:00
Duy Ngo
229ecaf0ea
Add numbers and definitions (#10665) 2022-04-18 12:58:32 +02:00
Schero1994
d622883a42
Adding and updating content in the spacy universe (#10493)
* signing contributor agreement

* adding new content to the spaCy universe

* updating outdated example codes

* resolving issues for the PR

* resolve review for klayers

* remove contributor-agreement file from the PR

* Update code example of spaCySentiWS

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy-sentiws code example

Co-authored-by: schaeran <schaeran1994@gmail.com>
Co-authored-by: schaeran <schaeran@explosion.ai>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-04-15 15:36:54 +02:00
Joachim Fainberg
4e1716223c
displaCy: Avoid increasing levels for identical arcs (#10639)
* Test for arc levels for identical arcs

Also moves the test in order with the other numbered tests.

* displaCy: filter identical arcs

Avoid increased levels due to identical arcs by first
filtering any identical arcs.

* Sort keys before filtering

Manual entry with keys out of order would previously become
different tuples and therefore not filtered correctly.

Co-authored-by: Joachim Fainberg <joachimfainberg@Joachims-MBP.lan>
2022-04-14 16:48:00 +02:00
Philip Vollet
e63a5d4888
Update newsletter id (#10655) 2022-04-14 13:34:01 +02:00
fonfonx
028cbad05e
Add feminine form of word "one" in French (#10653)
* Add French number

* Add fonfonx.md

* Add feminine ordinal words for French
2022-04-14 10:21:27 +02:00
Schero1994
caf8528af7
Batch #1 | spaCy universe cleanup (#10642)
* delete universe object: wmd-relax

* delete universe object: spaCy.jl

* delete universe object: saber

* delete universe object: languagecrunch

* delete universe object: gracyql

* delete universe object: ExcelCy

* delete universe object: EpiTator

Co-authored-by: schaeran <schaeran1994@gmail.com>
2022-04-14 10:08:19 +02:00
single-fingal
4228f3c757
Fix a few minor bugs in the SpanGroup API web docs (#10650)
* Fix a few minor bugs in the SpanGroup API web docs

* Update SpanGroup docs examples to have Spans reflect intended "errors"
2022-04-14 09:59:48 +02:00
Adriane Boyd
64602d997d
Require srsly v2.4.3+ due to buffer overflow vulnerability (#10651) 2022-04-13 11:41:40 +02:00
Richard Hudson
75fbbcdc18
Display warning when spacy.explain() finds no term (#10645)
* Display warning when spacy.explain() finds no term

* Updated warning message text
2022-04-12 10:48:28 +02:00
David Berenstein
d4196a62f1
added crosslingual coreference to spacy universe without additional commits (#10580)
* added crosslingual coreference to spacy universe

* Updated example to introduce batching example.

Co-authored-by: David Berenstein <david.berenstein@pandoraintelligence.com>
2022-04-08 08:23:58 +02:00
Madeesh Kannan
9ba3e1cb2f
Basic tests for the Tamil language (#10629)
* Add basic tests for Tamil (ta)

* Add comment
Remove superfluous condition

* Remove superfluous call to `pipe`
Instantiate new tokenizer for special case
2022-04-07 14:47:37 +02:00