Commit Graph

15320 Commits

Author SHA1 Message Date
Daniël de Kok
78a8bec4d0
Make core projectivization functions cdef nogil (#10241)
* Make core projectivization methods cdef nogil

While profiling the parser, I noticed that relatively a lot of time is
spent in projectivization. This change rewrites the functions in the
core loops as cdef nogil for efficiency.

In C++-land, we use vector in place of Python lists and absent heads
are represented as -1 in place of None.

* _heads_to_c: add assertion

Validation should be performed by the caller, but this assertion ensures that
we are not reading/writing out of bounds with incorrect input.
2022-02-21 15:02:21 +01:00
Adriane Boyd
30030176ee
Update Korean defaults for Tokenizer (#10322)
Update Korean defaults for `Tokenizer` for tokenization following UD
Korean Kaist.
2022-02-21 10:26:19 +01:00
Adriane Boyd
f32ee2e533
Fix NER check in CoNLL-U converter (#10302)
* Fix NER check in CoNLL-U converter

Leave ents unset if no NER annotation is found in the MISC column.

* Revert to global rather than per-sentence NER check

* Update spacy/training/converters/conllu_to_docs.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-02-21 10:24:52 +01:00
Peter Baumgartner
3358fb9bdd
Miscellaneous Minor SpanGroups/DocBin Improvements (#10250)
* MultiHashEmbed vector docs correction

* doc copy span test

* ignore empty lists in DocBin.span_groups

* serialized empty list const + SpanGroups.is_empty

* add conditional deserial on from_bytes

* clean up + reorganize

* rm test

* add constant as class attribute

* rename to _EMPTY_BYTES

* Update spacy/tests/doc/test_span.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-02-21 10:24:15 +01:00
Adriane Boyd
f4c74764b8
Fix Tok2Vec for empty batches (#10324)
* Add test for tok2vec with vectors and empty docs

* Add shortcut for empty batch in Tok2Vec.predict

* Avoid types
2022-02-21 10:22:36 +01:00
github-actions[bot]
6de84c8757
Auto-format code with black (#10333)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-02-21 09:15:42 +01:00
Adriane Boyd
28ba31e793
Add whitespace and combined augmenters (#10170)
Add whitespace augmenter that inserts a single whitespace token into a
doc containing annotation used in core trained pipelines.

Add a combined augmenter that handles lowercasing, orth variants and
whitespace augmentation.
2022-02-17 15:54:09 +01:00
Grey Murav
aa93b471a1
Extend list of stopwords for ru language (#10313) 2022-02-17 15:51:15 +01:00
Grey Murav
23f06dc37f
Extend list of numbers for ru language (#10280)
* Extended list of numbers for ru language

Extended list of numbers with all forms and cases including short forms, slang variants and roman numerals.

* Update lex_attrs.py

* Update 'like_num' function with percentages

Added support for numbers with percentages like 12%, 1.2% and etc. to the  'like_num' function.

* black formatting

Co-authored-by: thomashacker <EdwardSchmuhl@web.de>
2022-02-17 15:50:08 +01:00
Grey Murav
a9756963e6
Extend list of abbreviations for ru language (#10282)
* Extend list of abbreviations for ru language

Extended list of abbreviations for ru language those may have influence on tokenization.

* black formatting

Co-authored-by: thomashacker <EdwardSchmuhl@web.de>
2022-02-17 15:48:50 +01:00
Adriane Boyd
da7520a83c
Delay loading of mecab in Korean tokenizer (#10295)
* Delay loading of mecab in Korean tokenizer

Delay loading of mecab until the tokenizer is called the first time so
that it's possible to initialize a blank `ko` pipeline without having
mecab installed, e.g. for use with `spacy init vectors`.

* Move mecab import back to __init__

Move mecab import back to __init__ to warn users at the same point as
before for missing python dependencies.
2022-02-17 11:35:34 +01:00
Sofie Van Landeghem
3854ab901f
Merge pull request #10312 from svlandeg/fix/drop-develop
Remove daily/weekly tests for develop branch
2022-02-16 16:28:22 +01:00
Sofie Van Landeghem
26eac22d3b
remove develop also from GPU tests 2022-02-16 15:44:05 +01:00
Sofie Van Landeghem
fef768ef74
remove develop (not an active branch anymore) 2022-02-16 15:43:36 +01:00
Sofie Van Landeghem
228aaa16b7
Merge pull request #10309 from svlandeg/copy/develop
Update master with latest from develop
2022-02-16 15:40:58 +01:00
Ryn Daniels
d30ee14ab3
Pass the matrix branch to the checkout action (#10304) 2022-02-16 15:39:42 +01:00
Sofie Van Landeghem
a16b14e591
Merge branch 'master' into copy/develop 2022-02-16 14:04:59 +01:00
Adriane Boyd
22066f4e0f
Also exclude workflows from non-PR CI runs (#10305) 2022-02-16 13:45:30 +01:00
Ryn Daniels
f6250015ab
Fix the datemath for reals (#10294)
* add debugging branch and quotes to daily slowtest action

* Apparently the quotes fixed it
2022-02-15 14:18:36 +01:00
Paul O'Leary McCann
23bd103d89 Add tmtoolkit setup steps 2022-02-14 15:17:25 +09:00
Markus Konrad
8818a44a39
add tmtoolkit package to spaCy universe (#10245) 2022-02-14 15:16:43 +09:00
github-actions[bot]
5adedb8587
Auto-format code with black (#10260)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-02-11 14:23:01 +01:00
Adriane Boyd
9a06a210ec
Exclude github workflow edits from CI (#10261) 2022-02-11 14:22:43 +01:00
Adriane Boyd
bbaf41fb3b
Set version to v3.2.2 (#10262) 2022-02-11 11:45:26 +01:00
Edward
7961a0a959
Fix typo in errors (#10256) 2022-02-10 13:45:46 +01:00
Ryn Daniels
2d6cabb23c
Fix the date command and the matrix failure mode (#10254) 2022-02-10 12:06:30 +01:00
Peter Baumgartner
ee662ec381
Raise error in spacy package when model name is not a valid python identifier (#10192)
* MultiHashEmbed vector docs correction

* raise error for invalid identifier as model name

* more succinct error message

* update success message

* permitted package name + double underscore

* clarify package name error

* clarify underscore run message

* tweak language + simplify underscore run

* cleanup underscore run warning

* spacing correction

* Update spacy/tests/test_cli.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-02-10 08:15:23 +01:00
Ryn Daniels
3877f78ff9
fix the syntax for the slow/gpu test crons (#10244) 2022-02-09 11:21:20 +01:00
John Boy
10c77af83d
add textnets to spaCy universe (#10216)
https://github.com/jboynyc/textnets/issues/38
2022-02-09 15:04:26 +09:00
Ines Montani
7b883da9fd
Merge pull request #10239 from explosion/docs/spacy-tailored-pipelines [ci skip] 2022-02-08 18:04:01 +01:00
Ramon Ziai
6477dafac2
fix(phrasematcher.pyi): change type annotation of docs in add() to List[Doc] (#10235)
https://github.com/explosion/spaCy/issues/10234
2022-02-08 13:37:27 +01:00
Ines Montani
f2c2b97e56 Add spaCy Tailored Pipelines 2022-02-08 11:46:42 +01:00
Adriane Boyd
a9ee5bff98
Support mixed case model package names (#10223) 2022-02-08 10:52:46 +01:00
Ryn Daniels
f939da0bfa
Add github actions for slow and gpu tests (#10225)
* Add github actions for slow and gpu tests

* change weekly GPU tests to also run slow tests, and change the time

* only run the tests if there were commits in the past day
2022-02-08 10:05:35 +01:00
Antti Ajanki
e9c26f2ee9
Add a noun chunker for Finnish (#10214)
with test cases
2022-02-08 08:44:11 +01:00
Sofie Van Landeghem
deb143fa70
Token sent attributes more consistent (#10164)
* remove duplicate line

* add sent start/end token attributes to the docs

* let has_annotation work with IS_SENT_END

* elif instead of if

* add has_annotation test for sent attributes

* fix typo

* remove duplicate is_sent_start entry in docs
2022-02-08 08:35:37 +01:00
Peter Baumgartner
836f689cc7
YAML multiline tip for project.yml files (#10187)
* MultiHashEmbed vector docs correction

* add in multi-line tip

* convert to sidebar tip
2022-02-08 08:35:09 +01:00
Kenneth Enevoldsen
e4625d2fc3
Added Augmenty to universe (#10229)
* Added Augmenty to universe

* Update website/meta/universe.json

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update website/meta/universe.json

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-02-08 08:32:11 +01:00
Lj Miranda
42072f4468
Add spancat pipeline in spacy debug data (#10070)
* Setup debug data for spancat

* Add check for missing labels

* Add low-level data warning error

* Improve logic when compiling the gold train data

* Implement check for negative examples

* Remove breakpoint

* Remove ws_ents and missing entity checks

* Fix mypy errors

* Make variable name spans_key consistent

* Rename pipeline -> component for consistency

* Account for missing labels per spans_key

* Cleanup variable names for consistency

* Improve brevity of conditional statements

* Remove unused variables

* Include spans_key as an argument for _get_examples

* Add a conditional check for spans_key

* Update spancat debug data based on new API

- Instead of using _get_labels_from_model(), I'm now using
_get_labels_from_spancat() (cf. https://github.com/explosion/spaCy/pull10079)
- The way information is displayed was also changed (text -> table)

* Rename model_labels to ensure mypy works

* Update wording on warning messages

Use "span type" instead of "entity type" in wording the warning messages.
This is because Spans aren't necessarily entities.

* Update component type into a Literal

This is to make it clear that the component parameter should only accept
either 'spancat' or 'ner'.

* Update checks to include actual model span_keys

Instead of looking at everything in the data, we only check those
span_keys from the actual spancat component. Instead of doing the filter
inside the for-loop, I just made another dictionary,
data_labels_in_component to hold this value.

* Update spacy/cli/debug_data.py

* Show label counts only when verbose is True

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-02-07 15:03:36 +01:00
Lj Miranda
72fece712f
Add shuffle parameter to Corpus API docs (#10220)
* Add shuffle parameter to Corpus API docs

* Update website/docs/api/corpus.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-02-07 14:55:53 +01:00
Adriane Boyd
63e1e4e8f6
Fix debug data check for ents that cross sents (#10188)
* Fix debug data check for ents that cross sents

* Use aligned sent starts to have the same indices for the NER and sent
start annotation
* Add a temporary, insufficient hack for the case where a
sentence-initial reference token is split into multiple tokens in the
predicted doc, since `Example.get_aligned("SENT_START")` currently
aligns `True` to all the split tokens.

* Improve test example

* Use Example.get_aligned_sent_starts

* Add test for crossing entity
2022-02-07 08:53:30 +01:00
github-actions[bot]
91ccacea12
Auto-format code with black (#10209)
* Auto-format code with black

* add black requirement to dev dependencies and pin to 22.x

* ignore black dependency for comparison with setup.cfg

Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
Co-authored-by: svlandeg <svlandeg@github.com>
2022-02-06 16:30:30 +01:00
Sofie Van Landeghem
bc12ecb870
Merge pull request #10185 from martinjack/master
Update Ukrainian tokenizer_exceptions
2022-02-06 16:30:03 +01:00
Sofie Van Landeghem
14513f82da
Merge pull request #10215 from explosion/master
update develop
2022-02-06 13:45:41 +01:00
Adriane Boyd
0668a449ba
Add Pipe.hide_labels to omit labels from pipeline meta (#10175) 2022-02-05 17:59:24 +01:00
Adriane Boyd
6f551043e4
Use paths.vectors for vectors in init config (#10146)
So that overriding `paths.vectors` works consistently in generated
configs, set vectors model in `paths.vectors` and always refer to this
path in `initialize.vectors`.
2022-02-04 21:09:48 +01:00
Adriane Boyd
fef896ce49
Allow Example to align whitespace annotation (#10189)
Remove exception for whitespace tokens in `Example.get_aligned` so that
annotation on whitespace tokens is aligned in the same way as for
non-whitespace tokens.
2022-02-03 17:01:53 +01:00
Kenneth Enevoldsen
a2f27ff83a
Added spacy-wrap to universe (#10168)
* Added spacy-wrap to universe 

Added spacy-wrap to universe a small package for wrapping fine-tuned huggingface transformers to a spacy pipeline following the same API as spacy-transformers. (Currently limited to classification models)

* Update website/meta/universe.json

* Update website/meta/universe.json

* Update website/meta/universe.json

* Update website/meta/universe.json

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-02-03 12:30:09 +01:00
Evgen Kytonin
fc3d446c71 Update Ukrainian tokenizer_exceptions 2022-02-01 13:24:00 +02:00
Lj Miranda
345e7f6bc4
Clarify Span.ents documentation (#10154)
* Clarify Span.ents documentation

Ref: #10135

Retain current behaviour. Span.ents will only include entities within
said span. You can't get tokens outside of the original span.

* Reword docstrings

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update API docs in the website

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-01-31 08:41:42 +01:00