Commit Graph

15936 Commits

Author SHA1 Message Date
Ryn Daniels
d30ee14ab3
Pass the matrix branch to the checkout action (#10304) 2022-02-16 15:39:42 +01:00
Sofie Van Landeghem
a16b14e591
Merge branch 'master' into copy/develop 2022-02-16 14:04:59 +01:00
Adriane Boyd
22066f4e0f
Also exclude workflows from non-PR CI runs (#10305) 2022-02-16 13:45:30 +01:00
Ryn Daniels
f6250015ab
Fix the datemath for reals (#10294)
* add debugging branch and quotes to daily slowtest action

* Apparently the quotes fixed it
2022-02-15 14:18:36 +01:00
Paul O'Leary McCann
23bd103d89 Add tmtoolkit setup steps 2022-02-14 15:17:25 +09:00
Markus Konrad
8818a44a39
add tmtoolkit package to spaCy universe (#10245) 2022-02-14 15:16:43 +09:00
github-actions[bot]
5adedb8587
Auto-format code with black (#10260)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-02-11 14:23:01 +01:00
Adriane Boyd
9a06a210ec
Exclude github workflow edits from CI (#10261) 2022-02-11 14:22:43 +01:00
Adriane Boyd
bbaf41fb3b
Set version to v3.2.2 (#10262) 2022-02-11 11:45:26 +01:00
Edward
7961a0a959
Fix typo in errors (#10256) 2022-02-10 13:45:46 +01:00
Ryn Daniels
2d6cabb23c
Fix the date command and the matrix failure mode (#10254) 2022-02-10 12:06:30 +01:00
Peter Baumgartner
ee662ec381
Raise error in spacy package when model name is not a valid python identifier (#10192)
* MultiHashEmbed vector docs correction

* raise error for invalid identifier as model name

* more succinct error message

* update success message

* permitted package name + double underscore

* clarify package name error

* clarify underscore run message

* tweak language + simplify underscore run

* cleanup underscore run warning

* spacing correction

* Update spacy/tests/test_cli.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-02-10 08:15:23 +01:00
Ryn Daniels
3877f78ff9
fix the syntax for the slow/gpu test crons (#10244) 2022-02-09 11:21:20 +01:00
John Boy
10c77af83d
add textnets to spaCy universe (#10216)
https://github.com/jboynyc/textnets/issues/38
2022-02-09 15:04:26 +09:00
Ines Montani
7b883da9fd
Merge pull request #10239 from explosion/docs/spacy-tailored-pipelines [ci skip] 2022-02-08 18:04:01 +01:00
Ramon Ziai
6477dafac2
fix(phrasematcher.pyi): change type annotation of docs in add() to List[Doc] (#10235)
https://github.com/explosion/spaCy/issues/10234
2022-02-08 13:37:27 +01:00
Ines Montani
f2c2b97e56 Add spaCy Tailored Pipelines 2022-02-08 11:46:42 +01:00
Adriane Boyd
a9ee5bff98
Support mixed case model package names (#10223) 2022-02-08 10:52:46 +01:00
Ryn Daniels
f939da0bfa
Add github actions for slow and gpu tests (#10225)
* Add github actions for slow and gpu tests

* change weekly GPU tests to also run slow tests, and change the time

* only run the tests if there were commits in the past day
2022-02-08 10:05:35 +01:00
Antti Ajanki
e9c26f2ee9
Add a noun chunker for Finnish (#10214)
with test cases
2022-02-08 08:44:11 +01:00
Sofie Van Landeghem
deb143fa70
Token sent attributes more consistent (#10164)
* remove duplicate line

* add sent start/end token attributes to the docs

* let has_annotation work with IS_SENT_END

* elif instead of if

* add has_annotation test for sent attributes

* fix typo

* remove duplicate is_sent_start entry in docs
2022-02-08 08:35:37 +01:00
Peter Baumgartner
836f689cc7
YAML multiline tip for project.yml files (#10187)
* MultiHashEmbed vector docs correction

* add in multi-line tip

* convert to sidebar tip
2022-02-08 08:35:09 +01:00
Kenneth Enevoldsen
e4625d2fc3
Added Augmenty to universe (#10229)
* Added Augmenty to universe

* Update website/meta/universe.json

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update website/meta/universe.json

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-02-08 08:32:11 +01:00
Lj Miranda
42072f4468
Add spancat pipeline in spacy debug data (#10070)
* Setup debug data for spancat

* Add check for missing labels

* Add low-level data warning error

* Improve logic when compiling the gold train data

* Implement check for negative examples

* Remove breakpoint

* Remove ws_ents and missing entity checks

* Fix mypy errors

* Make variable name spans_key consistent

* Rename pipeline -> component for consistency

* Account for missing labels per spans_key

* Cleanup variable names for consistency

* Improve brevity of conditional statements

* Remove unused variables

* Include spans_key as an argument for _get_examples

* Add a conditional check for spans_key

* Update spancat debug data based on new API

- Instead of using _get_labels_from_model(), I'm now using
_get_labels_from_spancat() (cf. https://github.com/explosion/spaCy/pull10079)
- The way information is displayed was also changed (text -> table)

* Rename model_labels to ensure mypy works

* Update wording on warning messages

Use "span type" instead of "entity type" in wording the warning messages.
This is because Spans aren't necessarily entities.

* Update component type into a Literal

This is to make it clear that the component parameter should only accept
either 'spancat' or 'ner'.

* Update checks to include actual model span_keys

Instead of looking at everything in the data, we only check those
span_keys from the actual spancat component. Instead of doing the filter
inside the for-loop, I just made another dictionary,
data_labels_in_component to hold this value.

* Update spacy/cli/debug_data.py

* Show label counts only when verbose is True

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-02-07 15:03:36 +01:00
Lj Miranda
72fece712f
Add shuffle parameter to Corpus API docs (#10220)
* Add shuffle parameter to Corpus API docs

* Update website/docs/api/corpus.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-02-07 14:55:53 +01:00
Adriane Boyd
63e1e4e8f6
Fix debug data check for ents that cross sents (#10188)
* Fix debug data check for ents that cross sents

* Use aligned sent starts to have the same indices for the NER and sent
start annotation
* Add a temporary, insufficient hack for the case where a
sentence-initial reference token is split into multiple tokens in the
predicted doc, since `Example.get_aligned("SENT_START")` currently
aligns `True` to all the split tokens.

* Improve test example

* Use Example.get_aligned_sent_starts

* Add test for crossing entity
2022-02-07 08:53:30 +01:00
github-actions[bot]
91ccacea12
Auto-format code with black (#10209)
* Auto-format code with black

* add black requirement to dev dependencies and pin to 22.x

* ignore black dependency for comparison with setup.cfg

Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
Co-authored-by: svlandeg <svlandeg@github.com>
2022-02-06 16:30:30 +01:00
Sofie Van Landeghem
bc12ecb870
Merge pull request #10185 from martinjack/master
Update Ukrainian tokenizer_exceptions
2022-02-06 16:30:03 +01:00
Sofie Van Landeghem
14513f82da
Merge pull request #10215 from explosion/master
update develop
2022-02-06 13:45:41 +01:00
Adriane Boyd
0668a449ba
Add Pipe.hide_labels to omit labels from pipeline meta (#10175) 2022-02-05 17:59:24 +01:00
Adriane Boyd
6f551043e4
Use paths.vectors for vectors in init config (#10146)
So that overriding `paths.vectors` works consistently in generated
configs, set vectors model in `paths.vectors` and always refer to this
path in `initialize.vectors`.
2022-02-04 21:09:48 +01:00
Daniël de Kok
7b02f0fe5d
Write moves costs directly into numpy array (#10163)
This avoids elementwise indexing and the allocation of an additional
array.

Gives a ~15% speed improvement when using batch_by_sequence with size
32.
2022-02-04 21:07:14 +01:00
Adriane Boyd
fef896ce49
Allow Example to align whitespace annotation (#10189)
Remove exception for whitespace tokens in `Example.get_aligned` so that
annotation on whitespace tokens is aligned in the same way as for
non-whitespace tokens.
2022-02-03 17:01:53 +01:00
Kenneth Enevoldsen
a2f27ff83a
Added spacy-wrap to universe (#10168)
* Added spacy-wrap to universe 

Added spacy-wrap to universe a small package for wrapping fine-tuned huggingface transformers to a spacy pipeline following the same API as spacy-transformers. (Currently limited to classification models)

* Update website/meta/universe.json

* Update website/meta/universe.json

* Update website/meta/universe.json

* Update website/meta/universe.json

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-02-03 12:30:09 +01:00
Evgen Kytonin
fc3d446c71 Update Ukrainian tokenizer_exceptions 2022-02-01 13:24:00 +02:00
Lj Miranda
345e7f6bc4
Clarify Span.ents documentation (#10154)
* Clarify Span.ents documentation

Ref: #10135

Retain current behaviour. Span.ents will only include entities within
said span. You can't get tokens outside of the original span.

* Reword docstrings

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update API docs in the website

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-01-31 08:41:42 +01:00
Marek Šuppa
f09c799a96
fix: Add missing comma to _eleven_to_beyond (#10166)
* This comma has been most probably been left out unintentionally, leading to string concatenation between the two consecutive lines. This issue has been found automatically using a regular expression.
2022-01-30 16:45:06 +09:00
Marek Šuppa
67ecac633f
fix: Add missing comma to examples.py (#10167)
* This comma has been most probably been left out unintentionally, leading to string concatenation between the two consecutive lines. This issue has been found automatically using a regular expression.
2022-01-30 16:43:29 +09:00
Adriane Boyd
4f441dfa24
Fix infix as prefix in Tokenizer.explain (#10140)
* Fix infix as prefix in Tokenizer.explain

Update `Tokenizer.explain` to align with the `Tokenizer` algorithm:

* skip infix matches that are prefixes in the current substring

* Update tokenizer pseudocode in docs
2022-01-28 17:00:54 +01:00
Eduard Zorita
30cf9d6a05
Update typing hints (#10109)
* Improve typing hints for Matcher.__call__

* Add typing hints for DependencyMatcher

* Add typing hints to underscore extensions

* Update Doc.tensor type (requires numpy 1.21)

* Fix typing hints for Language.component decorator

* Use generic np.ndarray type in Doc to avoid numpy version update

* Fix mypy errors

* Fix cyclic import caused by Underscore typing hints

* Use Literal type from spacy.compat

* Update matcher.pyi import format

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-01-28 16:59:54 +01:00
Adriane Boyd
09734c56fc
Use simple suggester for spancat initialization (#10143)
Instead of the running the actual suggester, which may require
annotation from annotating components that is not necessarily present in
the reference docs, use the built-in 1-gram suggester.
2022-01-28 09:34:23 +01:00
Daniël de Kok
b68d7a1ebf
Merge pull request #10141 from danieldk/bool-mask
Improve unseen label masking
2022-01-27 08:32:15 +01:00
Daniël de Kok
c2475a25de Improve unseen label masking
Two changes to speed up masking by ~10%:

- Use a bool array rather than an array of float32.

- Let the mask indicate whether a label was seen, rather than
  unseen. The mask is most frequently used to index scores for
  seen labels. However, since the mask marked unseen labels,
  this required computing an intermittent flipped mask.
2022-01-26 12:01:37 +01:00
Daniël de Kok
ec0cae9db8
Merge pull request #10103 from svlandeg/refactor/parser-gpu
Consolidate parser refactor branches
2022-01-21 18:15:18 +01:00
github-actions[bot]
6d4db5c3c7
Auto-format code with black (#10106)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-01-21 10:01:10 +01:00
Ines Montani
34ed93ef68
Support version tags in universe and add note about reporting (#10093)
* Support version tags in universe and add note about reporting

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-01-20 23:21:26 +01:00
Peter Baumgartner
a69005037a
Docker Image for Website Dev (#10098)
* add docker instructions

* Update website/README.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/README.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* clarifying language on docker image

* fix markdown formatting

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-01-20 23:02:13 +01:00
svlandeg
6243ac35eb cleanup remnants from merge conflicts 2022-01-20 18:14:26 +01:00
svlandeg
c4c41b14cf ensure labels are added upon predict 2022-01-20 18:00:31 +01:00
svlandeg
6d32ae01da mypy fixes 2022-01-20 17:50:37 +01:00