Commit Graph

15950 Commits

Author SHA1 Message Date
Adriane Boyd
bffe54d02b Set version to v3.4.0 2022-06-24 08:48:58 +02:00
Peter Baumgartner
9738b69c0e
Update Code Conventions.md (#11018) 2022-06-24 15:11:29 +09:00
Dmytro Sadovnychyi
4cd8b4cc22
Fix some of the broken links on universe pages (#11011)
Currently some of the "AUTHOR INFO" links (e.g. here[0]) are broken:

```
https://github.com/https://github.com/explosion
```

[0] https://spacy.io/universe/project/spacy-experimental


Also one remains broken with `https://szegedai.github.io/`.
2022-06-23 17:53:00 +02:00
Sofie Van Landeghem
f8116078ce
disable failing test because Stanford servers are down (#11015) 2022-06-23 10:57:46 +02:00
Adriane Boyd
d4e3f43639
Update thinc version to switch back to blis v0.7 (#11014) 2022-06-23 09:50:25 +02:00
Adriane Boyd
f1197d9175
Add API docs for token attribute symbols (#10836)
* Add API docs for token attribute symbols

* Remove NBSP's

* Fix typo

* Rephrase

Co-authored-by: svlandeg <svlandeg@github.com>
2022-06-23 08:16:38 +02:00
Peter Baumgartner
3335bb9d0c
remove cuda116 extra from install widget (#11012) 2022-06-23 08:15:28 +02:00
jademlc
bed23ff291
Update serialization methods code block (#11004)
* Update serialization methods code block

* Update website/docs/usage/saving-loading.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-06-22 20:45:26 +02:00
Sofie Van Landeghem
0fa004c4cd the 'new' indicator wants a 'number' (#10997) 2022-06-21 22:01:16 +02:00
Philip Vollet
1ae13b2a70
Merge pull request #10991 from Lucaterre/master
updated spacy universe for spacyfishing
2022-06-21 10:33:26 +02:00
Daniël de Kok
0271306f16
Use thinc-apple-ops>=0.1.0.dev0 with apple extras (#10904)
* Use thinc-apple-ops>=0.1.0.dev0 with `apple` extras

Also test with thinc-apple-ops that is at least 0.1.0.dev0.

* Check thinc-apple-ops on macOS with Python 3.10

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Use `pip install --pre` for installing thinc-apple-ops in CI

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-06-21 08:26:59 +02:00
Victoria
a08ca064e5
Update linguistic-features.md (#10993)
Change link for downloading fasttext word vectors
2022-06-21 15:03:41 +09:00
Lucaterre
2820d7dd8d correct typo in universe.json for 'code_example' key : pipe name 'entityfishing' 2022-06-20 15:26:23 +02:00
Lucaterre
cdad815c68 updated spacy universe for spacyfishing 2022-06-20 14:28:49 +02:00
Sofie Van Landeghem
f00254ae27
add counts to verbose list of NER labels (#10957) 2022-06-20 09:48:40 +02:00
Raphael Mitsch
4c058eb40a
enable argument for spacy.load() (#10784)
* Enable flag on spacy.load: foundation for include, enable arguments.

* Enable flag on spacy.load: fixed tests.

* Enable flag on spacy.load: switched from pretrained model to empty model with added pipes for tests.

* Enable flag on spacy.load: switched to more consistent error on misspecification of component activity. Test refactoring. Added  to default config.

* Enable flag on spacy.load: added support for fields not in pipeline.

* Enable flag on spacy.load: removed serialization fields from supported fields.

* Enable flag on spacy.load: removed 'enable' from config again.

* Enable flag on spacy.load: relaxed checks in _resolve_component_activation_status() to allow non-standard pipes.

* Enable flag on spacy.load: fixed relaxed checks for _resolve_component_activation_status() to allow non-standard pipes. Extended tests.

* Enable flag on spacy.load: comments w.r.t. resolution workarounds.

* Enable flag on spacy.load: remove include fields. Update website docs.

* Enable flag on spacy.load: updates w.r.t. changes in master.

* Implement Doc.from_json(): update docstrings.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): remove newline.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): change error message for E1038.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Enable flag on spacy.load: wrapped docstring for _resolve_component_status() at 80 chars.

* Enable flag on spacy.load: changed exmples for enable flag.

* Remove newline.

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix docstring for Language._resolve_component_status().

* Rename E1038 to E1042.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-06-17 20:24:13 +01:00
Sofie Van Landeghem
eaeca5eb6a
account for NER labels with a hyphen in the name (#10960)
* account for NER labels with a hyphen in the name

* cleanup

* fix docstring

* add return type to helper method

* shorter method and few more occurrences

* user helper method across repo

* fix circular import

* partial revert to avoid circular import
2022-06-17 20:02:37 +01:00
github-actions[bot]
6313787fb6
Auto-format code with black (#10977)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-06-17 19:41:55 +01:00
Raphael Mitsch
d50668dbf0
Made _initialize_X() methods private. (#10978) 2022-06-17 15:55:34 +02:00
Raphael Mitsch
a7f6bc5dfb
Workaround for Typer optional default values with Python calls (#10788)
* Workaround for Typer optional default values with Python calls: added test and workaround.

* @rmitsch Workaround for Typer optional default values with Python calls: reverting some black formatting changes.

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* @rmitsch Workaround for Typer optional default values with Python calls: removing return type hint.

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Workaround for Typer optional default values with Python calls: fixed imports, added GitHub issue marker.

* Workaround for Typer optional default values with Python calls: removed forcing of default values for optional arguments in init_config_cli(). Added default values for init_config(). Synchronized default values for init_config_cli() and init_config().

* Workaround for Typer optional default values with Python calls: removed unused import.

* Workaround for Typer optional default values with Python calls: fixed usage of optimize in init_config_cli().

* Workaround for Typer optional default values with Pythhon calls: remove output_file from InitDefaultValues.

* Workaround for Typer optional default values with Python calls: rename class for default init values.

* Workaround for Typer optional default values with Python calls: remove newline.

* remove introduced newlines

* Remove test_init_config_from_python_without_optional_args().

* remove leftover import

* reformat import

* remove duplicate

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-06-17 12:15:36 +02:00
Daniël de Kok
3d3fbeda9f
Update for CBlas changes in Thinc 8.1.0.dev2 (#10970) 2022-06-16 11:42:34 +02:00
Daniël de Kok
0d352c46ed
vectors: remove use of float as row number (#10955)
The float -1 was returned rather than the integer -1 as the row for
unknown keys. This doesn't introduce a realy bug, since such floats
cast (without issues) to int in the conversion to NumPy arrays. Still,
it's nice to to do the correct thing :).
2022-06-15 15:32:02 +02:00
Madeesh Kannan
126d1db123
Add failing test: test_matcher_extension_in_set_predicate (#10948) 2022-06-13 10:56:45 +02:00
Daniël de Kok
a83a501195
precomputable_biaffine: avoid concatenation (#10911)
The `forward` of `precomputable_biaffine` performs matrix multiplication
and then `vstack`s the result with padding. This creates a temporary
array used for the output of matrix concatenation.

This change avoids the temporary by pre-allocating an array that is
large enough for the output of matrix multiplication plus padding and
fills the array in-place.

This gave me a small speedup (a bit over 100 WPS) on de_core_news_lg on
M1 Max (after changing thinc-apple-ops to support in-place gemm as BLIS
does).
2022-06-10 18:12:28 +02:00
github-actions[bot]
97e8a5041b
Auto-format code with black (#10945)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-06-10 13:21:33 +02:00
Gor Arakelyan
605f84938b
Add "Aim-spaCy" to spaCy Universe (#10943)
* Add Aim-spaCy to spaCy universe

* Update Aim thumbnail

* Fix author links

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
2022-06-10 18:33:17 +09:00
Daniël de Kok
9fccaab333 Re-enable some more beam parser tests 2022-06-09 16:01:24 +02:00
Daniël de Kok
c8e4627e39 Remove commented line 2022-06-09 15:59:33 +02:00
Daniël de Kok
224ebcc4b3 test_ner: make loss test more strict again 2022-06-09 15:59:01 +02:00
Daniël de Kok
0b92666d6b Re-enable test that works now. 2022-06-09 15:56:49 +02:00
Daniël de Kok
5fd369b9e9 ner: re-enable sentence boundary checks 2022-06-09 14:50:47 +02:00
Daniël de Kok
11ae4cc3d3 Remove last vestiges of PrecomputableAffine
This does not exist anymore as a separate layer.
2022-06-09 14:46:13 +02:00
Daniël de Kok
fba264382c Parser slow test fixup
We don't have TransitionBasedParser.{v1,v2} until we bring it back as a
legacy option.
2022-06-09 14:23:03 +02:00
Daniël de Kok
e851f4a560 Merge remote-tracking branch 'upstream/v4' into feature/refactor-parser 2022-06-09 14:10:58 +02:00
Daniël de Kok
bc36c71982
Rename some identifiers in the parser refactor (#10935)
* Rename _parseC to _parse_batch

* tb_framework: prefix many auxiliary functions with underscore

To clearly state the intent that they are private.

* Rename `lower` to `hidden`, `upper` to `output`
2022-06-09 13:45:30 +02:00
Daniël de Kok
7f3842f54d
Merge pull request #10936 from danieldk/merge-master-v4-20220609
Merge `master` into `v4`
2022-06-09 12:47:13 +02:00
Daniël de Kok
2f05c6824c Merge remote-tracking branch 'upstream/master' into merge-master-v4-20220609 2022-06-09 10:18:25 +02:00
Daniël de Kok
63e90dd6a1
Reimplement parser rehearsal function (#10878)
* Reimplement parser rehearsal function

Before the parser refactor, rehearsal was driven by a loop in the
`rehearse` method itself. For each parsing step, the loops would:

1. Get the predictions of the teacher.
2. Get the predictions and backprop function of the student.
3. Compute the loss and backprop into the student.
4. Move the teacher and student forward with the predictions of
   the student.

In the refactored parser, we cannot perform search stepwise rehearsal
anymore, since the model now predicts all parsing steps at once.
Therefore, rehearsal is performed in the following steps:

1. Get the predictions of all parsing steps from the student, along
   with its backprop function.
2. Get the predictions from the teacher, but use the predictions of
   the student to advance the parser while doing so.
3. Compute the loss and backprop into the student.

To support the second step a new method, `advance_with_actions` is
added to `GreedyBatch`, which performs the provided parsing steps.

* tb_framework: wrap upper_W and upper_b in Linear

Thinc's Optimizer cannot handle resizing of existing parameters. Until
it does, we work around this by wrapping the weights/biases of the upper
layer of the parser model in Linear. When the upper layer is resized, we
copy over the existing parameters into a new Linear instance. This does
not trigger an error in Optimizer, because it sees the resized layer as
a new set of parameters.

* Add test for TransitionSystem.apply_actions

* Better FIXME marker

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* Fixes from Madeesh

* Apply suggestions from Sofie

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Remove useless assignment

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-06-08 20:02:04 +02:00
kadarakos
1bb87f35bc
Detect cycle during projectivize (#10877)
* detect cycle during projectivize

* not complete test to detect cycle in projectivize

* boolean to int type to propagate error

* use unordered_set instead of set

* moved error message to errors

* removed cycle from test case

* use find instead of count

* cycle check: only perform one lookup

* Return bool again from _has_head_as_ancestor

Communicate presence of cycles through an output argument.

* Switch to returning std::pair to encode presence of a cycle

The has_cycle pointer is too easy to misuse. Ideally, we would have a
sum type like Rust's `Result` here, but C++ is not there yet.

* _is_non_proj_arc: clarify what we are returning

* _has_head_as_ancestor: remove count

We are now explicitly checking for cycles, so the algorithm must always
terminate. Either we encounter the head, we find a root, or a cycle.

* _is_nonproj_arc: simplify condition

* Another refactor using C++ exceptions

* Remove unused error code

* Print graph with cycle on exception

* Include .hh files in source package

* Add FIXME comment

* cycle detection test

* find cycle when starting from problematic vertex

Co-authored-by: Daniël de Kok <me@danieldk.eu>
2022-06-08 19:34:11 +02:00
Paul O'Leary McCann
d176afd32f
Add note about multiple patterns (#10826)
* Add note about multiple patterns

* Move note to the top of method docs

* Remove EntityRuler note
2022-06-08 16:24:14 +02:00
Sofie Van Landeghem
763dcbf885
Fix version in SpanRuler docs (#10925)
* SpanRuler is new since 3.3.1

* update SpanRuler version since 3.3.1
2022-06-08 14:45:04 +02:00
Daniël de Kok
a4003532a3
Update README: spaCy 3.3.1 is out now (#10927) 2022-06-08 15:16:22 +09:00
Ilya Nikitin
c323789721
token.md: Fix documentation of Token.ancestors (#10917) 2022-06-06 14:32:36 +09:00
vincent d warmerdam
e7d2b26966
Add spacy-report to universe (#10910)
* Add spacy-report to universe

* Remove extra comma

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
2022-06-05 18:57:58 +09:00
github-actions[bot]
24aafdffad
Auto-format code with black (#10908)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2022-06-03 11:01:55 +02:00
Adriane Boyd
727ce6d1f5
Remove English exceptions with mismatched features (#10873)
Remove English contraction exceptions with mismatched features that lead
to exceptions like "theses" and "thisre".
2022-06-03 09:44:04 +02:00
Madeesh Kannan
41389ffe1e
Avoid pickling Doc inputs passed to Language.pipe() (#10864)
* `Language.pipe()`: Serialize `Doc` objects to bytes when using multiprocessing to avoid pickling overhead

* `Doc.to_dict()`: Serialize `_context` attribute (keeping in line with `(un)pickle_doc()`

* Correct type annotations

* Fix typo

* `Doc`: Do not serialize `_context`

* `Language.pipe`: Send context objects to child processes, Simplify `as_tuples` handling

* Fix type annotation

* `Language.pipe`: Simplify `as_tuple` multiprocessor handling

* Cleanup code, fix typos

* MyPy fixes

* Move doc preparation function into `_multiprocessing_pipe`
Whitespace changes

* Remove superfluous comma

* Rename `prepare_doc` to `prepare_input`

* Update spacy/errors.py

* Undo renaming for error

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-06-02 20:06:49 +02:00
Adriane Boyd
430592b3ce
Extend typing_extensions to <4.2.0 (#10899) 2022-06-02 17:22:34 +02:00
single-fingal
6c6b8da7cc
Fix: De/Serialize SpanGroups including the SpanGroup keys (#10707)
* fix: De/Serialize `SpanGroups` including the SpanGroup keys

This prevents the loss of `SpanGroup`s that have the same .name as other `SpanGroup`s within the same `SpanGroups` object (upon de/serialization of the `SpanGroups`).

Fixes #10685

* Maintain backwards compatibility for serialized `SpanGroups`

(serialized as: a list of `SpanGroup`s, or b'')

* Add tests for `SpanGroups` deserialization backwards-compatibility

* Move a `SpanGroups` de/serialization test (test_issue10685)
  to tests/serialize/test_serialize_spangroups.py

* Output a warning if deserializing a `SpanGroups` with duplicate .name-d `SpanGroup`s

* Minor refactor

* `SpanGroups.from_bytes` handles only `list` and `dict` types with
`dict` as the expected default
* For lists, keep first rather than last value encountered
* Update error message
* Rename and update tests

* Update to preserve list serialization of SpanGroups

To avoid breaking compatibility of serialized `Doc` and `DocBin` with
earlier versions of spacy v3, revert back to a list-only serialization,
but update the names just for serialization so that the SpanGroups keys
override the SpanGroup names.

* Preserve object identity and current key overwrite

* Preserve SpanGroup object identity
* Preserve last rather than first span group from SpanGroup list
  format without SpanGroups keys

* Update inline comments

* Fix types

* Add type info for SpanGroup.copy

* Deserialize `SpanGroup`s as copies

when a single SpanGroup is the value for more than 1 `SpanGroups` key.
This is because we serialize `SpanGroups` as dicts (to maintain backward-
and forward-compatibility) and we can't assume `SpanGroup`s with the same
bytes/serialization were the same (identical) object, pre-serialization.

* Update spacy/tokens/_dict_proxies.py

* Add more SpanGroups serialization tests

Test that serialized SpanGroups maintain their Span order

* small clarification on older spaCy version

* Update spacy/tests/serialize/test_serialize_span_groups.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-06-02 15:56:27 +02:00
Adriane Boyd
7e13652d36
Fix schemas import in Doc (#10898) 2022-06-02 15:53:03 +02:00