Commit Graph

15396 Commits

Author SHA1 Message Date
Edward
7961a0a959
Fix typo in errors (#10256) 2022-02-10 13:45:46 +01:00
Ryn Daniels
2d6cabb23c
Fix the date command and the matrix failure mode (#10254) 2022-02-10 12:06:30 +01:00
Peter Baumgartner
ee662ec381
Raise error in spacy package when model name is not a valid python identifier (#10192)
* MultiHashEmbed vector docs correction

* raise error for invalid identifier as model name

* more succinct error message

* update success message

* permitted package name + double underscore

* clarify package name error

* clarify underscore run message

* tweak language + simplify underscore run

* cleanup underscore run warning

* spacing correction

* Update spacy/tests/test_cli.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-02-10 08:15:23 +01:00
Ryn Daniels
3877f78ff9
fix the syntax for the slow/gpu test crons (#10244) 2022-02-09 11:21:20 +01:00
John Boy
10c77af83d
add textnets to spaCy universe (#10216)
https://github.com/jboynyc/textnets/issues/38
2022-02-09 15:04:26 +09:00
Ines Montani
7b883da9fd
Merge pull request #10239 from explosion/docs/spacy-tailored-pipelines [ci skip] 2022-02-08 18:04:01 +01:00
Ramon Ziai
6477dafac2
fix(phrasematcher.pyi): change type annotation of docs in add() to List[Doc] (#10235)
https://github.com/explosion/spaCy/issues/10234
2022-02-08 13:37:27 +01:00
Ines Montani
f2c2b97e56 Add spaCy Tailored Pipelines 2022-02-08 11:46:42 +01:00
Adriane Boyd
a9ee5bff98
Support mixed case model package names (#10223) 2022-02-08 10:52:46 +01:00
Ryn Daniels
f939da0bfa
Add github actions for slow and gpu tests (#10225)
* Add github actions for slow and gpu tests

* change weekly GPU tests to also run slow tests, and change the time

* only run the tests if there were commits in the past day
2022-02-08 10:05:35 +01:00
Antti Ajanki
e9c26f2ee9
Add a noun chunker for Finnish (#10214)
with test cases
2022-02-08 08:44:11 +01:00
Sofie Van Landeghem
deb143fa70
Token sent attributes more consistent (#10164)
* remove duplicate line

* add sent start/end token attributes to the docs

* let has_annotation work with IS_SENT_END

* elif instead of if

* add has_annotation test for sent attributes

* fix typo

* remove duplicate is_sent_start entry in docs
2022-02-08 08:35:37 +01:00
Peter Baumgartner
836f689cc7
YAML multiline tip for project.yml files (#10187)
* MultiHashEmbed vector docs correction

* add in multi-line tip

* convert to sidebar tip
2022-02-08 08:35:09 +01:00
Kenneth Enevoldsen
e4625d2fc3
Added Augmenty to universe (#10229)
* Added Augmenty to universe

* Update website/meta/universe.json

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update website/meta/universe.json

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-02-08 08:32:11 +01:00
Lj Miranda
42072f4468
Add spancat pipeline in spacy debug data (#10070)
* Setup debug data for spancat

* Add check for missing labels

* Add low-level data warning error

* Improve logic when compiling the gold train data

* Implement check for negative examples

* Remove breakpoint

* Remove ws_ents and missing entity checks

* Fix mypy errors

* Make variable name spans_key consistent

* Rename pipeline -> component for consistency

* Account for missing labels per spans_key

* Cleanup variable names for consistency

* Improve brevity of conditional statements

* Remove unused variables

* Include spans_key as an argument for _get_examples

* Add a conditional check for spans_key

* Update spancat debug data based on new API

- Instead of using _get_labels_from_model(), I'm now using
_get_labels_from_spancat() (cf. https://github.com/explosion/spaCy/pull10079)
- The way information is displayed was also changed (text -> table)

* Rename model_labels to ensure mypy works

* Update wording on warning messages

Use "span type" instead of "entity type" in wording the warning messages.
This is because Spans aren't necessarily entities.

* Update component type into a Literal

This is to make it clear that the component parameter should only accept
either 'spancat' or 'ner'.

* Update checks to include actual model span_keys

Instead of looking at everything in the data, we only check those
span_keys from the actual spancat component. Instead of doing the filter
inside the for-loop, I just made another dictionary,
data_labels_in_component to hold this value.

* Update spacy/cli/debug_data.py

* Show label counts only when verbose is True

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-02-07 15:03:36 +01:00
Lj Miranda
72fece712f
Add shuffle parameter to Corpus API docs (#10220)
* Add shuffle parameter to Corpus API docs

* Update website/docs/api/corpus.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-02-07 14:55:53 +01:00
Adriane Boyd
63e1e4e8f6
Fix debug data check for ents that cross sents (#10188)
* Fix debug data check for ents that cross sents

* Use aligned sent starts to have the same indices for the NER and sent
start annotation
* Add a temporary, insufficient hack for the case where a
sentence-initial reference token is split into multiple tokens in the
predicted doc, since `Example.get_aligned("SENT_START")` currently
aligns `True` to all the split tokens.

* Improve test example

* Use Example.get_aligned_sent_starts

* Add test for crossing entity
2022-02-07 08:53:30 +01:00
github-actions[bot]
91ccacea12
Auto-format code with black (#10209)
* Auto-format code with black

* add black requirement to dev dependencies and pin to 22.x

* ignore black dependency for comparison with setup.cfg

Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
Co-authored-by: svlandeg <svlandeg@github.com>
2022-02-06 16:30:30 +01:00
Sofie Van Landeghem
bc12ecb870
Merge pull request #10185 from martinjack/master
Update Ukrainian tokenizer_exceptions
2022-02-06 16:30:03 +01:00
Sofie Van Landeghem
14513f82da
Merge pull request #10215 from explosion/master
update develop
2022-02-06 13:45:41 +01:00
Adriane Boyd
0668a449ba
Add Pipe.hide_labels to omit labels from pipeline meta (#10175) 2022-02-05 17:59:24 +01:00
Adriane Boyd
6f551043e4
Use paths.vectors for vectors in init config (#10146)
So that overriding `paths.vectors` works consistently in generated
configs, set vectors model in `paths.vectors` and always refer to this
path in `initialize.vectors`.
2022-02-04 21:09:48 +01:00
Adriane Boyd
fef896ce49
Allow Example to align whitespace annotation (#10189)
Remove exception for whitespace tokens in `Example.get_aligned` so that
annotation on whitespace tokens is aligned in the same way as for
non-whitespace tokens.
2022-02-03 17:01:53 +01:00
Kenneth Enevoldsen
a2f27ff83a
Added spacy-wrap to universe (#10168)
* Added spacy-wrap to universe 

Added spacy-wrap to universe a small package for wrapping fine-tuned huggingface transformers to a spacy pipeline following the same API as spacy-transformers. (Currently limited to classification models)

* Update website/meta/universe.json

* Update website/meta/universe.json

* Update website/meta/universe.json

* Update website/meta/universe.json

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-02-03 12:30:09 +01:00
Evgen Kytonin
fc3d446c71 Update Ukrainian tokenizer_exceptions 2022-02-01 13:24:00 +02:00
Lj Miranda
345e7f6bc4
Clarify Span.ents documentation (#10154)
* Clarify Span.ents documentation

Ref: #10135

Retain current behaviour. Span.ents will only include entities within
said span. You can't get tokens outside of the original span.

* Reword docstrings

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update API docs in the website

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-01-31 08:41:42 +01:00
Marek Šuppa
f09c799a96
fix: Add missing comma to _eleven_to_beyond (#10166)
* This comma has been most probably been left out unintentionally, leading to string concatenation between the two consecutive lines. This issue has been found automatically using a regular expression.
2022-01-30 16:45:06 +09:00
Marek Šuppa
67ecac633f
fix: Add missing comma to examples.py (#10167)
* This comma has been most probably been left out unintentionally, leading to string concatenation between the two consecutive lines. This issue has been found automatically using a regular expression.
2022-01-30 16:43:29 +09:00
Adriane Boyd
4f441dfa24
Fix infix as prefix in Tokenizer.explain (#10140)
* Fix infix as prefix in Tokenizer.explain

Update `Tokenizer.explain` to align with the `Tokenizer` algorithm:

* skip infix matches that are prefixes in the current substring

* Update tokenizer pseudocode in docs
2022-01-28 17:00:54 +01:00
Eduard Zorita
30cf9d6a05
Update typing hints (#10109)
* Improve typing hints for Matcher.__call__

* Add typing hints for DependencyMatcher

* Add typing hints to underscore extensions

* Update Doc.tensor type (requires numpy 1.21)

* Fix typing hints for Language.component decorator

* Use generic np.ndarray type in Doc to avoid numpy version update

* Fix mypy errors

* Fix cyclic import caused by Underscore typing hints

* Use Literal type from spacy.compat

* Update matcher.pyi import format

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-01-28 16:59:54 +01:00
Adriane Boyd
09734c56fc
Use simple suggester for spancat initialization (#10143)
Instead of the running the actual suggester, which may require
annotation from annotating components that is not necessarily present in
the reference docs, use the built-in 1-gram suggester.
2022-01-28 09:34:23 +01:00
github-actions[bot]
6d4db5c3c7
Auto-format code with black (#10106)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-01-21 10:01:10 +01:00
Ines Montani
34ed93ef68
Support version tags in universe and add note about reporting (#10093)
* Support version tags in universe and add note about reporting

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-01-20 23:21:26 +01:00
Peter Baumgartner
a69005037a
Docker Image for Website Dev (#10098)
* add docker instructions

* Update website/README.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/README.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* clarifying language on docker image

* fix markdown formatting

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-01-20 23:02:13 +01:00
pepemedigu
2abd380f2d
Update lex_attrs.py for Spanish with ordinals (#10038)
* Update lex_attrs.py

Add ordinal words

* black formatting

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-01-20 15:44:13 +01:00
Sofie Van Landeghem
d2afdfefc2
Merge pull request #10100 from svlandeg/feature/master_copy
Update develop with latest from master (2)
2022-01-20 14:29:50 +01:00
Sofie Van Landeghem
4465fe0306
Merge branch 'develop' into feature/master_copy 2022-01-20 13:36:17 +01:00
Duygu Altinok
47a2916801
Intify IOB (#9738)
* added iob to int

* added tests

* added iob strings

* added error

* blacked attrs

* Update spacy/tests/lang/test_attrs.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/attrs.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* added iob strings as global

* minor refinement with iob

* removed iob strings from token

* changed to uppercase

* cleaned and went back to master version

* imported iob from attrs

* Update and format errors

* Support and test both str and int ENT_IOB key

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-01-20 13:19:38 +01:00
Duygu Altinok
268ddf8a06
Add ENT_IOB key to Matcher (#9649)
* added new field

* added exception for IOb strings

* minor refinement to schema

* removed field

* fixed typo

* imported numeriacla val

* changed the code bit

* cosmetics

* added test for matcher

* set ents of moc docs

* added invalid pattern

* minor update to documentation

* blacked matcher

* added pattern validation

* add IOB vals to schema

* changed into test

* mypy compat

* cleaned left over

* added compat import

* changed type

* added compat import

* changed literal a bit

* went back to old

* made explicit type

* Update spacy/schemas.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/schemas.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/schemas.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-01-20 13:18:39 +01:00
Daniël de Kok
6984f55277
Merge pull request #10048 from danieldk/index-arcs-by-head
Use constant-time head lookups in StateC::{L,R}
2022-01-20 13:06:14 +01:00
Paul O'Leary McCann
32bd3856b3
Rename FACILITY to FAC in color list (#10067)
This matches the English models
2022-01-20 12:00:28 +01:00
Adriane Boyd
a55212fca0
Determine labels by factory name in debug data (#10079)
* Determine labels by factory name in debug data

For all components, return labels for all components with the
corresponding factory name rather than for only the default name.

For `spancat`, return labels as a dict keyed by `spans_key`.

* Refactor for typing

* Add test

* Use assert instead of cast, removed unneeded arg

* Mark test as slow
2022-01-20 11:42:52 +01:00
Richard Hudson
e9c6314539
Bugfix for similarity return types (#10051) 2022-01-20 11:40:46 +01:00
Adriane Boyd
7d528e607c
Update quickstart install steps (#10092)
* For conda:
  * Use conda environment rather than venv
  * Install `spacy-transformers` as a conda package
* For pip:
  * Add quotes if extras are included
2022-01-20 10:53:40 +01:00
Paul O'Leary McCann
2ff53834bb
Add link to pattern file info in EntityRuler.initialize docs (#10091)
* Add link to pattern file info in EntityRuler.initialize docs

* Update website/docs/api/entityruler.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-01-19 10:45:11 +01:00
Daniël de Kok
50d2a2c930
User fewer Vector internals (#9879)
* Use Vectors.shape rather than Vectors.data.shape

* Use Vectors.size rather than Vectors.data.size

* Add Vectors.to_ops to move data between different ops

* Add documentation for Vector.to_ops
2022-01-18 17:14:35 +01:00
Adriane Boyd
4dfd559e55
Fix spaces in Doc.from_docs for empty docs (#10052)
Fix spaces in `Doc.from_docs(ensure_whitespace=True)` for cases where an
doc ending in whitespace is followed by an empty doc.
2022-01-18 17:12:42 +01:00
Paul O'Leary McCann
c28e33637b
Mark flaky spancat test so it doesn't fail the build (#10075)
* Mark flaky spancat test so it doesn't fail the build

* Skip, don't run and ignore
2022-01-18 09:36:28 +01:00
Adriane Boyd
39f1b13e77
Update sudachipy extras (#10072)
By @polm, redone from #9917 after incorrect (reverted) rebase.

`sudachipy>=0.5.2` is needed for newer dictionaries. `sudachipy<0.6.0`
is kept for users who might still prefer the older version, in
particular to be able to compile it without rust.
2022-01-17 11:48:39 +01:00
Natalia Rodnova
47ea6704f1
Span richcmp fix (#9956)
* Corrected Span's __richcmp__ implementation to take end, label and kb_id in consideration

* Updated test

* Updated test

* Removed formatting from a test for readability sake

* Use same tuples for all comparisons

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-01-17 11:17:49 +01:00