Commit Graph

15609 Commits

Author SHA1 Message Date
Adriane Boyd
2a558a7cdc
Switch to mecab-ko as default Korean tokenizer (#11294)
* Switch to mecab-ko as default Korean tokenizer

Switch to the (confusingly-named) mecab-ko python module for default Korean
tokenization.

Maintain the previous `natto-py` tokenizer as
`spacy.KoreanNattoTokenizer.v1`.

* Temporarily run tests with mecab-ko tokenizer

* Fix types

* Fix duplicate test names

* Update requirements test

* Revert "Temporarily run tests with mecab-ko tokenizer"

This reverts commit d2083e7044.

* Add mecab_args setting, fix pickle for KoreanNattoTokenizer

* Fix length check

* Update docs

* Formatting

* Update natto-py error message

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
2022-08-26 10:11:18 +02:00
Adriane Boyd
1eb7ce5ef7
Merge pull request #11377 from adrianeboyd/chore/update-v4-from-develop-2
Update v4 from develop
2022-08-25 08:26:55 +02:00
Adriane Boyd
740c33fe58 Merge remote-tracking branch 'upstream/develop' into chore/update-v4-from-develop 2022-08-24 20:43:07 +02:00
Adriane Boyd
6fd3b4d9d6
Merge pull request #11375 from adrianeboyd/chore/update-develop-from-master-v3.5-1
Update develop from master for v3.5
2022-08-24 20:41:25 +02:00
Adriane Boyd
81874265e9 Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.5-1 2022-08-24 12:47:42 +02:00
Sofie Van Landeghem
8dd1fa9896
Merge pull request #11366 from adrianeboyd/chore/update-v4-from-master
Update v4 from master
2022-08-24 09:45:55 +02:00
Adriane Boyd
c44d243f25 Merge remote-tracking branch 'upstream/master' into chore/update-v4-from-master 2022-08-24 07:15:41 +02:00
Tobius Saul
c09d2fa25b
luganda language extension (#10847)
* luganda language extension

* __init__.py changes

* New enhancements

* Lexical attribute changed

* punctuaction and sentence additions

* Remove comment header

* Fix typos, reformat

* reformated version

* Add tokenizer test

* Remove contractions from stop words

* Format

* Add Luganda to website

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-08-23 13:09:36 +02:00
Edward
5afa98aabf
Support custom attributes for tokens and spans in json conversion (#11125)
* Add token and span custom attributes to to_json()

* Change logic for to_json

* Add functionality to from_json

* Small adjustments

* Move token/span attributes to new dict key

* Fix test

* Fix the same test but much better

* Add backwards compatibility tests and adjust logic

* Add test to check if attributes not set in underscore are not saved in the json

* Add tests for json compatibility

* Adjust test names

* Fix tests and clean up code

* Fix assert json tests

* small adjustment

* adjust naming and code readability

* Adjust naming, added more tests and changed logic

* Fix typo

* Adjust errors, naming, and small test optimization

* Fix byte tests

* Fix bytes tests

* Change naming and json structure

* update schema

* Update spacy/schemas.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/tokens/doc.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/tokens/doc.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update spacy/schemas.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update schema for underscore attributes

* Adjust underscore schema

* adjust schema tests

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-08-23 10:05:02 +02:00
Tal Zussman
7e75327893
Fix menu order in linguistic-features.md (#11364)
Swap 'Vectors & Similarity' and 'Mappings & Exceptions' in menu to match order in body
2022-08-23 14:40:38 +09:00
Adriane Boyd
bb0e178878
Make Span/Doc.ents more consistent for ent_kb_id and ent_id (#11328)
* Map `Span.id` to `Token.ent_id` in all cases when setting `Doc.ents`
* Reset `Token.ent_id` and `Token.ent_kb_id` when setting `Doc.ents`
* Make `Span.ent_id` an alias of `Span.id` rather than a read-only view
of the root token's `ent_id` annotation
2022-08-22 20:28:57 +02:00
Sofie Van Landeghem
6e20842370
dev docs: numeric comparators (#11334)
* add section on numeric comparators

* edit

* prettier

* Update extra/DEVELOPER_DOCS/Code Conventions.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* note on typing imports

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-08-22 15:52:53 +02:00
Sofie Van Landeghem
1a5be63715
Cleanup Cython structs (#11337)
* cleanup Tokenizer fields

* remove unused object from vocab

* remove IS_OOV_DEPRECATED

* add back in as FLAG13

* FLAG 18 instead

* import fix

* fix clumpsy fingers

* revert symbol changes in favor of #11352

* bint instead of bool
2022-08-22 15:52:24 +02:00
Adriane Boyd
f55bb7470d
Clean up warnings in the test suite (#11331) 2022-08-22 12:04:30 +02:00
Paul O'Leary McCann
0f07defe2c
Remove reference to voting on issue (#11335)
Not clear which issue this refers to, we don't suggest this for any
other issues, and we don't use votes in general.
2022-08-22 11:29:05 +02:00
Adriane Boyd
04c6e5cb95
Improve floret vectors display in pipeline docs (#11343) 2022-08-22 11:28:13 +02:00
Adriane Boyd
5fa8f4faca
Switch ru and uk lemmatizers to pymorphy3 (#11345)
* Switch ru and uk lemmatizers to pymorphy3

* Switch to pymorphy3 in tests
2022-08-22 11:27:14 +02:00
Adriane Boyd
3e4cf1bbe1
Check for . in factory names (#11336) 2022-08-19 09:52:12 +02:00
Adriane Boyd
09b3118b26
Add uk pipelines to website (#11332) 2022-08-18 14:04:57 +02:00
Sofie Van Landeghem
cab263791f
include span_ruler for default warning filter (#11333) 2022-08-17 19:55:54 +02:00
Adriane Boyd
d757dec5c4
Remove intify_attrs(_do_deprecated) (#11319) 2022-08-17 12:13:54 +02:00
Peter Baumgartner
db7b9938a4
Docs: displaCy documentation - data types, parse_{deps,ents,spans}, spans example (#10950)
* add in spans example and parse references

* rm autoformatter

* rm extra ents copy

* TypedDict draft

* type fixes

* restore non-documentation files

* docs update

* fix spans example

* fix hyperlinks

* add parse example

* example fix + argument fix

* fix api arg in docs

* fix bad variable replacement

* fix spacing in style

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* fix spacing on table

* fix spacing on table

* rm temp files

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-08-16 11:23:34 -04:00
antonpibm
551e73ccfc
Match private networks as URLs (#11121) 2022-08-11 11:26:26 +02:00
Sofie Van Landeghem
5d54c0e32a
Rename modules for consistency (#11286)
* rename Python module to entity_ruler

* rename Python module to attribute_ruler
2022-08-10 11:44:05 +02:00
Adriane Boyd
ed4ad309e6
Fix Dutch noun chunks to skip overlapping spans (#11275)
* Add test for overlapping noun chunks

* Skip overlapping noun chunks

* Update spacy/tests/lang/nl/test_noun_chunks.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-08-10 09:49:08 +02:00
Paul O'Leary McCann
231a17817d
Clean up automated label-based issue handling (#11284)
* Clean up automated label-based issue handline

1. upgrade tiangolo/issue-manager to latest
2. move needs-more-info to tiangolo
3. change needs-more-info close time to 7 days
4. delete old needs-more-info config

* Use old, longer message

* Fix label name
2022-08-09 14:50:50 +02:00
Adriane Boyd
e700358ba0
Add W605 to the errors raised by flake8 in the CI (#11283) 2022-08-09 12:15:13 +02:00
Adriane Boyd
fc4246558b
Fix regex invalid escape sequences (#11276) 2022-08-09 10:59:36 +02:00
stefawolf
23749cfc91
adding spans to doc_annotation in Example.to_dict (#11261)
* adding spans to doc_annotation in Example.to_dict

* to_dict compatible with from_dict: tuples instead of spans

* use strings for label and kb_id

* Simplify test

* Update data formats docs

Co-authored-by: Stefanie Wolf <stefanie.wolf@vitecsoftware.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-08-05 12:26:38 +02:00
Luka Dragar
b64243ed55
Updates to Slovenian language (#11162)
* Added examples for Slovene

* Update spacy/lang/sl/examples.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Corrected a typo in one of the sentences

* Updated support for Slovenian

* Some minor changes to corrections

* Added forint currency

* Corrected HYPHENS_PERMITTED regex and some formatting

* Minor changes

* Un-xfail tokenizer test

* Format

Co-authored-by: Luka Dragar <D20124481@mytudublin.ie>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-08-05 10:10:18 +02:00
Adriane Boyd
b5d9d0897e
Merge pull request #11270 from adrianeboyd/chore/update-develop-v3.5
Prepare develop for v3.5
2022-08-04 21:17:26 +02:00
Adriane Boyd
a3f6d6bce1 Merge remote-tracking branch 'upstream/master' into develop 2022-08-04 18:19:28 +02:00
Adriane Boyd
b07708d5d0
Support full prerelease versions in the compat table (#11228)
* Support full prerelease versions in the compat table

* Fix types
2022-08-04 15:14:19 +02:00
Jules Belveze
cd09614ab2
chore: add 'concepCy' to spacy universe (#11255)
* chore: add 'concepCy' to spacy universe

* docs: add 'slogan' to concepCy
2022-08-04 15:42:38 +09:00
Lj Miranda
d993df41e5
Update docs for pipeline initialize() methods (#11221)
* Update documentation for dependency parser

* Update documentation for trainable_lemmatizer

* Update documentation for entity_linker

* Update documentation for ner

* Update documentation for morphologizer

* Update documentation for senter

* Update documentation for spancat

* Update documentation for tagger

* Update documentation for textcat

* Update documentation for tok2vec

* Run prettier on edited files

* Apply similar changes in transformer docs

* Remove need to say annotated example explicitly

I removed the need to say "Must contain at least one annotated Example"
because it's often a given that Examples will contain some gold-standard
annotation.

* Run prettier on transformer docs
2022-08-03 16:53:02 +02:00
Adriane Boyd
d0578c2ede
Add scorer to textcat API docs config settings (#11263) 2022-08-03 16:41:20 +02:00
Daniël de Kok
e581eeac34
precompute_hiddens/Parser: look up CPU ops once (v4) (#11068)
* precompute_hiddens/Parser: look up CPU ops once

* precompute_hiddens: make cpu_ops private
2022-07-29 15:12:19 +02:00
Daniël de Kok
b2d05f9f66
Merge pull request #11242 from danieldk/merge-master-v4-20220728
Merge `master` into `v4`
2022-07-29 09:17:02 +02:00
Daniël de Kok
1ff683a50b Merge remote-tracking branch 'upstream/master' into merge-master-v4-20220728 2022-07-28 13:53:59 +02:00
Paul O'Leary McCann
2d89dd9db8
Update natto-py version spec (#11222)
* Update natto-py version spec

* Update setup.cfg

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-07-28 07:45:02 +02:00
ninjalu
95a1b8aca6
add additional REL_OP (#10371)
* add additional  REL_OP

* change to condition and new rel_op symbols

* add operators to docs

* add the anchor while we're in here

* add tests

Co-authored-by: Peter Baumgartner <5107405+pmbaumgartner@users.noreply.github.com>
2022-07-27 13:16:44 +02:00
Madeesh Kannan
1829d7120a
ExplosionBot: Add note about case-sensitivity (#11211) 2022-07-27 14:24:22 +09:00
Edward
360a702ecd
Add parent argument (#11210) 2022-07-26 14:35:18 +02:00
Adriane Boyd
5c2a00cef0
Set version to v3.4.1 (#11209) 2022-07-26 12:52:38 +02:00
Adriane Boyd
c8f5b752bb
Add link to developer docs code conventions (#11171) 2022-07-26 10:56:53 +02:00
Daniël de Kok
4ee8a06149
Fix compatibility with CuPy 9.x (#11194)
After the precomputable affine table of shape [nB, nF, nO, nP] is
computed, padding with shape [1, nF, nO, nP] is assigned to the first
row of the precomputed affine table. However, when we are indexing the
precomputed table, we get a row of shape [nF, nO, nP]. CuPy versions
before 10.0 cannot paper over this shape difference.

This change fixes compatibility with CuPy < 10.0 by squeezing the first
dimension of the padding before assignment.
2022-07-26 10:52:01 +02:00
Adriane Boyd
36ff2a5441
Merge pull request #11200 from adrianeboyd/chore/reenable-model-tests
Revert "Temporarily skip tests that require models/compat"
2022-07-25 20:13:44 +02:00
Adriane Boyd
e5990db713 Revert "Temporarily skip tests that require models/compat"
This reverts commit d9320db7db.
2022-07-25 18:12:18 +02:00
Paul O'Leary McCann
1c12812d1a
Replace link to old label (#11188) 2022-07-25 16:39:34 +09:00
Adriane Boyd
7a99fe3c65
Move sent-patterns to correct section of universe.json (#11192) 2022-07-25 09:14:50 +02:00