spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-09-16 17:12:38 +03:00

Author	SHA1	Message	Date
Paul O'Leary McCann	698b8b495f	Update/remove old Matcher syntax (#11370 ) * Clean up old Matcher call style related stuff In v2 Matcher.add was called with (key, on_match, patterns). In v3 this was changed to (key, patterns, , on_match=None), but there were various points where the old call syntax was documented or handled specially. This removes all those. The Matcher itself didn't need any code changes, as it just gives a generic type error. However the PhraseMatcher required some changes because it would automatically "fix" the old call style. Surprisingly, the tokenizer was still using the old call style in one place. After these changes tests failed in two places: 1. one test for the "new" call style, including the "old" call style. I removed this test. 2. deserializing the PhraseMatcher fails because the input docs are a set. I am not sure why 2 is happening - I guess it's a quirk of the serialization format? - so for now I just convert the set to a list when deserializing. The check that the input Docs are a List in the PhraseMatcher is a new check, but makes it parallel with the other Matchers, which seemed like the right thing to do. * Add notes related to input docs / deserialization type * Remove Typing import * Remove old note about call style change * Apply suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Use separate method for setting internal doc representations In addition to the title change, this changes the internal dict to be a defaultdict, instead of a dict with frequent use of setdefault. * Add _add_from_arrays for unpickling * Cleanup around adding from arrays This moves adding to internal structures into the private batch method, and removes the single-add method. This has one behavioral change for `add`, in that if something is wrong with the list of input Docs (such as one of the items not being a Doc), valid items before the invalid one will not be added. Also the callback will not be updated if anything is invalid. This change should not be significant. This also adds a test to check failure when given a non-Doc. * Update spacy/matcher/phrasematcher.pyx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-08-30 15:40:31 +02:00
Adriane Boyd	98a916e01a	Make stable private modules public and adjust names (#11353 ) * Make stable private modules public and adjust names * `spacy.ml._character_embed` -> `spacy.ml.character_embed` * `spacy.ml._precomputable_affine` -> `spacy.ml.precomputable_affine` * `spacy.tokens._serialize` -> `spacy.tokens.doc_bin` * `spacy.tokens._retokenize` -> `spacy.tokens.retokenize` * `spacy.tokens._dict_proxies` -> `spacy.tokens.span_groups` * Skip _precomputable_affine * retokenize -> retokenizer * Fix imports	2022-08-30 13:56:35 +02:00
Adriane Boyd	4bce8fa755	Remove setup_requires from setup.cfg (#11384 ) * Remove setup_requires from setup.cfg * Update requirements test to ignore cython in setup.cfg	2022-08-29 13:23:24 +02:00
Adriane Boyd	2a558a7cdc	Switch to mecab-ko as default Korean tokenizer (#11294 ) * Switch to mecab-ko as default Korean tokenizer Switch to the (confusingly-named) mecab-ko python module for default Korean tokenization. Maintain the previous `natto-py` tokenizer as `spacy.KoreanNattoTokenizer.v1`. * Temporarily run tests with mecab-ko tokenizer * Fix types * Fix duplicate test names * Update requirements test * Revert "Temporarily run tests with mecab-ko tokenizer" This reverts commit `d2083e7044`. * Add mecab_args setting, fix pickle for KoreanNattoTokenizer * Fix length check * Update docs * Formatting * Update natto-py error message Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>	2022-08-26 10:11:18 +02:00
Adriane Boyd	1eb7ce5ef7	Merge pull request #11377 from adrianeboyd/chore/update-v4-from-develop-2 Update v4 from develop	2022-08-25 08:26:55 +02:00
Adriane Boyd	740c33fe58	Merge remote-tracking branch 'upstream/develop' into chore/update-v4-from-develop	2022-08-24 20:43:07 +02:00
Adriane Boyd	6fd3b4d9d6	Merge pull request #11375 from adrianeboyd/chore/update-develop-from-master-v3.5-1 Update develop from master for v3.5	2022-08-24 20:41:25 +02:00
Adriane Boyd	81874265e9	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.5-1	2022-08-24 12:47:42 +02:00
Sofie Van Landeghem	8dd1fa9896	Merge pull request #11366 from adrianeboyd/chore/update-v4-from-master Update v4 from master	2022-08-24 09:45:55 +02:00
Adriane Boyd	c44d243f25	Merge remote-tracking branch 'upstream/master' into chore/update-v4-from-master	2022-08-24 07:15:41 +02:00
Tobius Saul	c09d2fa25b	luganda language extension (#10847 ) * luganda language extension * __init__.py changes * New enhancements * Lexical attribute changed * punctuaction and sentence additions * Remove comment header * Fix typos, reformat * reformated version * Add tokenizer test * Remove contractions from stop words * Format * Add Luganda to website Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-08-23 13:09:36 +02:00
Edward	5afa98aabf	Support custom attributes for tokens and spans in json conversion (#11125 ) * Add token and span custom attributes to to_json() * Change logic for to_json * Add functionality to from_json * Small adjustments * Move token/span attributes to new dict key * Fix test * Fix the same test but much better * Add backwards compatibility tests and adjust logic * Add test to check if attributes not set in underscore are not saved in the json * Add tests for json compatibility * Adjust test names * Fix tests and clean up code * Fix assert json tests * small adjustment * adjust naming and code readability * Adjust naming, added more tests and changed logic * Fix typo * Adjust errors, naming, and small test optimization * Fix byte tests * Fix bytes tests * Change naming and json structure * update schema * Update spacy/schemas.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/tokens/doc.pyx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/tokens/doc.pyx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/schemas.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update schema for underscore attributes * Adjust underscore schema * adjust schema tests Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-08-23 10:05:02 +02:00
Tal Zussman	7e75327893	Fix menu order in linguistic-features.md (#11364 ) Swap 'Vectors & Similarity' and 'Mappings & Exceptions' in menu to match order in body	2022-08-23 14:40:38 +09:00
Adriane Boyd	bb0e178878	Make Span/Doc.ents more consistent for ent_kb_id and ent_id (#11328 ) * Map `Span.id` to `Token.ent_id` in all cases when setting `Doc.ents` * Reset `Token.ent_id` and `Token.ent_kb_id` when setting `Doc.ents` * Make `Span.ent_id` an alias of `Span.id` rather than a read-only view of the root token's `ent_id` annotation	2022-08-22 20:28:57 +02:00
Sofie Van Landeghem	6e20842370	dev docs: numeric comparators (#11334 ) * add section on numeric comparators * edit * prettier * Update extra/DEVELOPER_DOCS/Code Conventions.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * note on typing imports Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-08-22 15:52:53 +02:00
Sofie Van Landeghem	1a5be63715	Cleanup Cython structs (#11337 ) * cleanup Tokenizer fields * remove unused object from vocab * remove IS_OOV_DEPRECATED * add back in as FLAG13 * FLAG 18 instead * import fix * fix clumpsy fingers * revert symbol changes in favor of #11352 * bint instead of bool	2022-08-22 15:52:24 +02:00
Adriane Boyd	f55bb7470d	Clean up warnings in the test suite (#11331 )	2022-08-22 12:04:30 +02:00
Paul O'Leary McCann	0f07defe2c	Remove reference to voting on issue (#11335 ) Not clear which issue this refers to, we don't suggest this for any other issues, and we don't use votes in general.	2022-08-22 11:29:05 +02:00
Adriane Boyd	04c6e5cb95	Improve floret vectors display in pipeline docs (#11343 )	2022-08-22 11:28:13 +02:00
Adriane Boyd	5fa8f4faca	Switch ru and uk lemmatizers to pymorphy3 (#11345 ) * Switch ru and uk lemmatizers to pymorphy3 * Switch to pymorphy3 in tests	2022-08-22 11:27:14 +02:00
Adriane Boyd	3e4cf1bbe1	Check for . in factory names (#11336 )	2022-08-19 09:52:12 +02:00
Adriane Boyd	09b3118b26	Add uk pipelines to website (#11332 )	2022-08-18 14:04:57 +02:00
Sofie Van Landeghem	cab263791f	include span_ruler for default warning filter (#11333 )	2022-08-17 19:55:54 +02:00
Adriane Boyd	d757dec5c4	Remove intify_attrs(_do_deprecated) (#11319 )	2022-08-17 12:13:54 +02:00
Peter Baumgartner	db7b9938a4	Docs: displaCy documentation - data types, `parse_{deps,ents,spans}`, spans example (#10950 ) * add in spans example and parse references * rm autoformatter * rm extra ents copy * TypedDict draft * type fixes * restore non-documentation files * docs update * fix spans example * fix hyperlinks * add parse example * example fix + argument fix * fix api arg in docs * fix bad variable replacement * fix spacing in style Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * fix spacing on table * fix spacing on table * rm temp files Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-08-16 11:23:34 -04:00
antonpibm	551e73ccfc	Match private networks as URLs (#11121 )	2022-08-11 11:26:26 +02:00
Sofie Van Landeghem	5d54c0e32a	Rename modules for consistency (#11286 ) * rename Python module to entity_ruler * rename Python module to attribute_ruler	2022-08-10 11:44:05 +02:00
Adriane Boyd	ed4ad309e6	Fix Dutch noun chunks to skip overlapping spans (#11275 ) * Add test for overlapping noun chunks * Skip overlapping noun chunks * Update spacy/tests/lang/nl/test_noun_chunks.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-08-10 09:49:08 +02:00
Paul O'Leary McCann	231a17817d	Clean up automated label-based issue handling (#11284 ) * Clean up automated label-based issue handline 1. upgrade tiangolo/issue-manager to latest 2. move needs-more-info to tiangolo 3. change needs-more-info close time to 7 days 4. delete old needs-more-info config * Use old, longer message * Fix label name	2022-08-09 14:50:50 +02:00
Adriane Boyd	e700358ba0	Add W605 to the errors raised by flake8 in the CI (#11283 )	2022-08-09 12:15:13 +02:00
Adriane Boyd	fc4246558b	Fix regex invalid escape sequences (#11276 )	2022-08-09 10:59:36 +02:00
stefawolf	23749cfc91	adding spans to doc_annotation in Example.to_dict (#11261 ) * adding spans to doc_annotation in Example.to_dict * to_dict compatible with from_dict: tuples instead of spans * use strings for label and kb_id * Simplify test * Update data formats docs Co-authored-by: Stefanie Wolf <stefanie.wolf@vitecsoftware.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-08-05 12:26:38 +02:00
Luka Dragar	b64243ed55	Updates to Slovenian language (#11162 ) * Added examples for Slovene * Update spacy/lang/sl/examples.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Corrected a typo in one of the sentences * Updated support for Slovenian * Some minor changes to corrections * Added forint currency * Corrected HYPHENS_PERMITTED regex and some formatting * Minor changes * Un-xfail tokenizer test * Format Co-authored-by: Luka Dragar <D20124481@mytudublin.ie> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-08-05 10:10:18 +02:00
Adriane Boyd	b5d9d0897e	Merge pull request #11270 from adrianeboyd/chore/update-develop-v3.5 Prepare develop for v3.5	2022-08-04 21:17:26 +02:00
Adriane Boyd	a3f6d6bce1	Merge remote-tracking branch 'upstream/master' into develop	2022-08-04 18:19:28 +02:00
Adriane Boyd	b07708d5d0	Support full prerelease versions in the compat table (#11228 ) * Support full prerelease versions in the compat table * Fix types	2022-08-04 15:14:19 +02:00
Jules Belveze	cd09614ab2	chore: add 'concepCy' to spacy universe (#11255 ) * chore: add 'concepCy' to spacy universe * docs: add 'slogan' to concepCy	2022-08-04 15:42:38 +09:00
Lj Miranda	d993df41e5	Update docs for pipeline initialize() methods (#11221 ) * Update documentation for dependency parser * Update documentation for trainable_lemmatizer * Update documentation for entity_linker * Update documentation for ner * Update documentation for morphologizer * Update documentation for senter * Update documentation for spancat * Update documentation for tagger * Update documentation for textcat * Update documentation for tok2vec * Run prettier on edited files * Apply similar changes in transformer docs * Remove need to say annotated example explicitly I removed the need to say "Must contain at least one annotated Example" because it's often a given that Examples will contain some gold-standard annotation. * Run prettier on transformer docs	2022-08-03 16:53:02 +02:00
Adriane Boyd	d0578c2ede	Add scorer to textcat API docs config settings (#11263 )	2022-08-03 16:41:20 +02:00
Daniël de Kok	e581eeac34	precompute_hiddens/Parser: look up CPU ops once (v4) (#11068 ) * precompute_hiddens/Parser: look up CPU ops once * precompute_hiddens: make cpu_ops private	2022-07-29 15:12:19 +02:00
Daniël de Kok	b2d05f9f66	Merge pull request #11242 from danieldk/merge-master-v4-20220728 Merge `master` into `v4`	2022-07-29 09:17:02 +02:00
Daniël de Kok	1ff683a50b	Merge remote-tracking branch 'upstream/master' into merge-master-v4-20220728	2022-07-28 13:53:59 +02:00
Paul O'Leary McCann	2d89dd9db8	Update natto-py version spec (#11222 ) * Update natto-py version spec * Update setup.cfg Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-07-28 07:45:02 +02:00
ninjalu	95a1b8aca6	add additional REL_OP (#10371 ) * add additional REL_OP * change to condition and new rel_op symbols * add operators to docs * add the anchor while we're in here * add tests Co-authored-by: Peter Baumgartner <5107405+pmbaumgartner@users.noreply.github.com>	2022-07-27 13:16:44 +02:00
Madeesh Kannan	1829d7120a	`ExplosionBot`: Add note about case-sensitivity (#11211 )	2022-07-27 14:24:22 +09:00
Edward	360a702ecd	Add parent argument (#11210 )	2022-07-26 14:35:18 +02:00
Adriane Boyd	5c2a00cef0	Set version to v3.4.1 (#11209 )	2022-07-26 12:52:38 +02:00
Adriane Boyd	c8f5b752bb	Add link to developer docs code conventions (#11171 )	2022-07-26 10:56:53 +02:00
Daniël de Kok	4ee8a06149	Fix compatibility with CuPy 9.x (#11194 ) After the precomputable affine table of shape [nB, nF, nO, nP] is computed, padding with shape [1, nF, nO, nP] is assigned to the first row of the precomputed affine table. However, when we are indexing the precomputed table, we get a row of shape [nF, nO, nP]. CuPy versions before 10.0 cannot paper over this shape difference. This change fixes compatibility with CuPy < 10.0 by squeezing the first dimension of the padding before assignment.	2022-07-26 10:52:01 +02:00
Adriane Boyd	36ff2a5441	Merge pull request #11200 from adrianeboyd/chore/reenable-model-tests Revert "Temporarily skip tests that require models/compat"	2022-07-25 20:13:44 +02:00

1 2 3 4 5 ...

15612 Commits