Commit Graph

15645 Commits

Author SHA1 Message Date
Daniël de Kok
2593dadbe4 Remove unused W400 warning.
This warning was used when we still allowed the user to specify
which activations to save.
2022-08-31 11:23:13 +02:00
Daniël de Kok
cd6e4fa8f4 Rename activations
- "probs" -> "probabilities"
- "guesses" -> "label_ids", except in the edit tree lemmatizer, where
  "guesses" -> "tree_ids".
2022-08-31 11:18:40 +02:00
Daniël de Kok
6f80e80305 Don't train in any test_save_activations test 2022-08-30 11:19:57 +02:00
Daniël de Kok
699a1877de Fix copied line in morphologizer activations test 2022-08-30 10:42:40 +02:00
Daniël de Kok
d245da08e1 docs: tag (save_)activations as new in spaCy 4.0 2022-08-30 10:32:18 +02:00
Daniël de Kok
51c87c5f98 Change wording in API docs after store -> save change 2022-08-30 10:24:11 +02:00
Daniël de Kok
cdbda0bbe1 Error E1400 is not used anymore
This error was used when activations were still `Union[bool, List[str]]`.
2022-08-30 10:23:25 +02:00
Daniël de Kok
2290a04d55 Rename TrainablePipe.store_activations to save_activations 2022-08-30 10:20:59 +02:00
Daniël de Kok
3937abd2e7 Merge remote-tracking branch 'upstream/v4' into store-activations 2022-08-30 10:11:04 +02:00
Daniël de Kok
8f84e6ea8a Add type annotations for activations in predict/set_annotations 2022-08-30 10:07:33 +02:00
Daniël de Kok
8c2652d788 Remove TrainablePipe.activations
We do not need to enumerate the activations anymore since `store_activations` is
`bool`.
2022-08-29 16:47:13 +02:00
Daniël de Kok
aea53378dc Make the TrainablePipe.store_activations property a bool
This means that we can also bring back `store_activations` setter.
2022-08-29 16:41:11 +02:00
Adriane Boyd
4bce8fa755
Remove setup_requires from setup.cfg (#11384)
* Remove setup_requires from setup.cfg

* Update requirements test to ignore cython in setup.cfg
2022-08-29 13:23:24 +02:00
Adriane Boyd
2a558a7cdc
Switch to mecab-ko as default Korean tokenizer (#11294)
* Switch to mecab-ko as default Korean tokenizer

Switch to the (confusingly-named) mecab-ko python module for default Korean
tokenization.

Maintain the previous `natto-py` tokenizer as
`spacy.KoreanNattoTokenizer.v1`.

* Temporarily run tests with mecab-ko tokenizer

* Fix types

* Fix duplicate test names

* Update requirements test

* Revert "Temporarily run tests with mecab-ko tokenizer"

This reverts commit d2083e7044.

* Add mecab_args setting, fix pickle for KoreanNattoTokenizer

* Fix length check

* Update docs

* Formatting

* Update natto-py error message

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
2022-08-26 10:11:18 +02:00
Adriane Boyd
1eb7ce5ef7
Merge pull request #11377 from adrianeboyd/chore/update-v4-from-develop-2
Update v4 from develop
2022-08-25 08:26:55 +02:00
Adriane Boyd
740c33fe58 Merge remote-tracking branch 'upstream/develop' into chore/update-v4-from-develop 2022-08-24 20:43:07 +02:00
Adriane Boyd
6fd3b4d9d6
Merge pull request #11375 from adrianeboyd/chore/update-develop-from-master-v3.5-1
Update develop from master for v3.5
2022-08-24 20:41:25 +02:00
Adriane Boyd
81874265e9 Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.5-1 2022-08-24 12:47:42 +02:00
Sofie Van Landeghem
8dd1fa9896
Merge pull request #11366 from adrianeboyd/chore/update-v4-from-master
Update v4 from master
2022-08-24 09:45:55 +02:00
Adriane Boyd
c44d243f25 Merge remote-tracking branch 'upstream/master' into chore/update-v4-from-master 2022-08-24 07:15:41 +02:00
Tobius Saul
c09d2fa25b
luganda language extension (#10847)
* luganda language extension

* __init__.py changes

* New enhancements

* Lexical attribute changed

* punctuaction and sentence additions

* Remove comment header

* Fix typos, reformat

* reformated version

* Add tokenizer test

* Remove contractions from stop words

* Format

* Add Luganda to website

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-08-23 13:09:36 +02:00
Edward
5afa98aabf
Support custom attributes for tokens and spans in json conversion (#11125)
* Add token and span custom attributes to to_json()

* Change logic for to_json

* Add functionality to from_json

* Small adjustments

* Move token/span attributes to new dict key

* Fix test

* Fix the same test but much better

* Add backwards compatibility tests and adjust logic

* Add test to check if attributes not set in underscore are not saved in the json

* Add tests for json compatibility

* Adjust test names

* Fix tests and clean up code

* Fix assert json tests

* small adjustment

* adjust naming and code readability

* Adjust naming, added more tests and changed logic

* Fix typo

* Adjust errors, naming, and small test optimization

* Fix byte tests

* Fix bytes tests

* Change naming and json structure

* update schema

* Update spacy/schemas.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/tokens/doc.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/tokens/doc.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/schemas.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update schema for underscore attributes

* Adjust underscore schema

* adjust schema tests

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-08-23 10:05:02 +02:00
Tal Zussman
7e75327893
Fix menu order in linguistic-features.md (#11364)
Swap 'Vectors & Similarity' and 'Mappings & Exceptions' in menu to match order in body
2022-08-23 14:40:38 +09:00
Adriane Boyd
bb0e178878
Make Span/Doc.ents more consistent for ent_kb_id and ent_id (#11328)
* Map `Span.id` to `Token.ent_id` in all cases when setting `Doc.ents`
* Reset `Token.ent_id` and `Token.ent_kb_id` when setting `Doc.ents`
* Make `Span.ent_id` an alias of `Span.id` rather than a read-only view
of the root token's `ent_id` annotation
2022-08-22 20:28:57 +02:00
Sofie Van Landeghem
6e20842370
dev docs: numeric comparators (#11334)
* add section on numeric comparators

* edit

* prettier

* Update extra/DEVELOPER_DOCS/Code Conventions.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* note on typing imports

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-08-22 15:52:53 +02:00
Sofie Van Landeghem
1a5be63715
Cleanup Cython structs (#11337)
* cleanup Tokenizer fields

* remove unused object from vocab

* remove IS_OOV_DEPRECATED

* add back in as FLAG13

* FLAG 18 instead

* import fix

* fix clumpsy fingers

* revert symbol changes in favor of #11352

* bint instead of bool
2022-08-22 15:52:24 +02:00
Adriane Boyd
f55bb7470d
Clean up warnings in the test suite (#11331) 2022-08-22 12:04:30 +02:00
Paul O'Leary McCann
0f07defe2c
Remove reference to voting on issue (#11335)
Not clear which issue this refers to, we don't suggest this for any
other issues, and we don't use votes in general.
2022-08-22 11:29:05 +02:00
Adriane Boyd
04c6e5cb95
Improve floret vectors display in pipeline docs (#11343) 2022-08-22 11:28:13 +02:00
Adriane Boyd
5fa8f4faca
Switch ru and uk lemmatizers to pymorphy3 (#11345)
* Switch ru and uk lemmatizers to pymorphy3

* Switch to pymorphy3 in tests
2022-08-22 11:27:14 +02:00
Adriane Boyd
3e4cf1bbe1
Check for . in factory names (#11336) 2022-08-19 09:52:12 +02:00
Adriane Boyd
09b3118b26
Add uk pipelines to website (#11332) 2022-08-18 14:04:57 +02:00
Sofie Van Landeghem
cab263791f
include span_ruler for default warning filter (#11333) 2022-08-17 19:55:54 +02:00
Adriane Boyd
d757dec5c4
Remove intify_attrs(_do_deprecated) (#11319) 2022-08-17 12:13:54 +02:00
Peter Baumgartner
db7b9938a4
Docs: displaCy documentation - data types, parse_{deps,ents,spans}, spans example (#10950)
* add in spans example and parse references

* rm autoformatter

* rm extra ents copy

* TypedDict draft

* type fixes

* restore non-documentation files

* docs update

* fix spans example

* fix hyperlinks

* add parse example

* example fix + argument fix

* fix api arg in docs

* fix bad variable replacement

* fix spacing in style

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* fix spacing on table

* fix spacing on table

* rm temp files

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-08-16 11:23:34 -04:00
antonpibm
551e73ccfc
Match private networks as URLs (#11121) 2022-08-11 11:26:26 +02:00
Sofie Van Landeghem
5d54c0e32a
Rename modules for consistency (#11286)
* rename Python module to entity_ruler

* rename Python module to attribute_ruler
2022-08-10 11:44:05 +02:00
Adriane Boyd
ed4ad309e6
Fix Dutch noun chunks to skip overlapping spans (#11275)
* Add test for overlapping noun chunks

* Skip overlapping noun chunks

* Update spacy/tests/lang/nl/test_noun_chunks.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-08-10 09:49:08 +02:00
Paul O'Leary McCann
231a17817d
Clean up automated label-based issue handling (#11284)
* Clean up automated label-based issue handline

1. upgrade tiangolo/issue-manager to latest
2. move needs-more-info to tiangolo
3. change needs-more-info close time to 7 days
4. delete old needs-more-info config

* Use old, longer message

* Fix label name
2022-08-09 14:50:50 +02:00
Adriane Boyd
e700358ba0
Add W605 to the errors raised by flake8 in the CI (#11283) 2022-08-09 12:15:13 +02:00
Adriane Boyd
fc4246558b
Fix regex invalid escape sequences (#11276) 2022-08-09 10:59:36 +02:00
Daniël de Kok
1cfbb934ed TextCat.predict: return dict 2022-08-05 14:40:42 +02:00
Daniël de Kok
c792019350
Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-08-05 14:17:07 +02:00
Daniël de Kok
75d76cb2a3 set_annotations: add type annotations 2022-08-05 14:09:30 +02:00
stefawolf
23749cfc91
adding spans to doc_annotation in Example.to_dict (#11261)
* adding spans to doc_annotation in Example.to_dict

* to_dict compatible with from_dict: tuples instead of spans

* use strings for label and kb_id

* Simplify test

* Update data formats docs

Co-authored-by: Stefanie Wolf <stefanie.wolf@vitecsoftware.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-08-05 12:26:38 +02:00
Daniël de Kok
ce36f345db _store_activations: make kwarg-only, remove doc_scores_lens arg 2022-08-05 12:00:01 +02:00
Daniël de Kok
57caae8393 EntityLinker: add type annotations to _add_activations 2022-08-05 11:48:04 +02:00
Daniël de Kok
230264daa0 Revert "Use dict comprehension suggested by @svlandeg"
This reverts commit 6e7b958f70.
2022-08-05 10:33:20 +02:00
Luka Dragar
b64243ed55
Updates to Slovenian language (#11162)
* Added examples for Slovene

* Update spacy/lang/sl/examples.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Corrected a typo in one of the sentences

* Updated support for Slovenian

* Some minor changes to corrections

* Added forint currency

* Corrected HYPHENS_PERMITTED regex and some formatting

* Minor changes

* Un-xfail tokenizer test

* Format

Co-authored-by: Luka Dragar <D20124481@mytudublin.ie>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-08-05 10:10:18 +02:00
Adriane Boyd
b5d9d0897e
Merge pull request #11270 from adrianeboyd/chore/update-develop-v3.5
Prepare develop for v3.5
2022-08-04 21:17:26 +02:00