spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-01-26 17:24:41 +03:00

Author	SHA1	Message	Date
Adriane Boyd	30d4eb506a	Fix setting empty entities in Example.from_dict (#8426 )	2021-06-18 10:41:50 +02:00
Matthew Honnibal	6f5e308d17	Support negative examples in partial NER annotations (#8106 ) * Support a cfg field in transition system * Make NER 'has gold' check use right alignment for span * Pass 'negative_samples_key' property into NER transition system * Add field for negative samples to NER transition system * Check neg_key in NER has_gold * Support negative examples in NER oracle * Test for negative examples in NER * Fix name of config variable in NER * Remove vestiges of old-style partial annotation * Remove obsolete tests * Add comment noting lack of support for negative samples in parser * Additions to "neg examples" PR (#8201) * add custom error and test for deprecated format * add test for unlearning an entity * add break also for Begin's cost * add negative_samples_key property on Parser * rename * extend docs & fix some older docs issues * add subclass constructors, clean up tests, fix docs * add flaky test with ValueError if gold parse was not found * remove ValueError if n_gold == 0 * fix docstring * Hack in environment variables to try out training * Remove hack * Remove NER hack, and support 'negative O' samples * Fix O oracle * Fix transition parser * Remove 'not O' from oracle * Fix NER oracle * check for spans in both gold.ents and gold.spans and raise if so, to prevent memory access violation * use set instead of list in consistency check Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-06-17 17:33:00 +10:00
Adriane Boyd	02bac8f269	Fix non-deterministic deduplication in Greek lemmatizer (#8421 )	2021-06-17 09:11:01 +02:00
Sofie Van Landeghem	e796aab4b3	Resizable textcat (#7862 ) * implement textcat resizing for TextCatCNN * resizing textcat in-place * simplify code * ensure predictions for old textcat labels remain the same after resizing (WIP) * fix for softmax * store softmax as attr * fix ensemble weight copy and cleanup * restructure slightly * adjust documentation, update tests and quickstart templates to use latest versions * extend unit test slightly * revert unnecessary edits * fix typo * ensemble architecture won't be resizable for now * use resizable layer (WIP) * revert using resizable layer * resizable container while avoid shape inference trouble * cleanup * ensure model continues training after resizing * use fill_b parameter * use fill_defaults * resize_layer callback * format * bump thinc to 8.0.4 * bump spacy-legacy to 3.0.6	2021-06-16 11:45:00 +02:00
Giovanni Toffoli	19521d525b	Added Italian POS-aware lemmatizer. (#8079 ) * Added Italian POS-aware lemmatizer. Also added the code used to build the lookup tables by POS. * Create gtoffoli.md * Add imports and format * Remove helper script * Use lemma_lookup instead of lemma_lookup_legacy Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-06-16 11:14:45 +02:00
Antti Ajanki	5a6125c227	[Finnish tokenizer] Handle conjunction contractions (#8105 )	2021-06-16 10:56:47 +02:00
Adriane Boyd	5646fcbe46	Merge remote-tracking branch 'upstream/develop' into chore/develop-into-master-v3.1	2021-06-15 15:05:17 +02:00
Adriane Boyd	480a3bf3be	Make JsonlReader path optional (#8396 ) To avoid config errors during training when `[corpora.pretrain.path]` is `None` with the default `spacy.JsonlCorpus.v1` reader, make the reader path optional, similar to `spacy.Corpus.v1`.	2021-06-15 14:55:15 +02:00
Paul O'Leary McCann	94e1346f44	Change span lemmas to use original whitespace (fix #8368 ) (#8391 ) * Change span lemmas to use original whitespace (fix #8368) This is a redo of #8371 based off master. The test for this required some changes to existing tests. I don't think the changes were significant but I'd like someone to check them. * Remove mystery docstring This sentence was uncompleted for years, and now we will never know how it ends.	2021-06-15 13:24:54 +02:00
Paul O'Leary McCann	2c105cdbce	Raise error if deps not provided with heads (#8335 ) * Fill in deps if not provided with heads Before this change, if heads were passed without deps they would be silently ignored, which could be confusing. See #8334. * Use "dep" instead of a blank string This is the customary placeholder dep. It might be better to show an error here instead though. * Throw error on heads without deps * Add a test * Fix tests * Formatting * Fix all tests * Fix a test I missed * Revise error message * Clean up whitespace Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-06-15 13:23:32 +02:00
Sofie Van Landeghem	0fd0d949c4	fix 's typo's across code base (#8384 )	2021-06-15 10:57:08 +02:00
Adriane Boyd	6b69b8934b	Set version to v3.1.0.dev0 (#8379 )	2021-06-14 11:17:35 +02:00
Sofie Van Landeghem	8729307e67	register extract_ngrams layer (#8358 ) * register extract_ngrams layer * fix import * bump spacy-legacy to 3.0.6 * revert bump (wrong PR)	2021-06-14 10:30:30 +02:00
Adriane Boyd	b98d216205	Update Catalan language data (#8308 ) * Update Catalan language data Update Catalan language data based on contributions from the Text Mining Unit at the Barcelona Supercomputing Center: https://github.com/TeMU-BSC/spacy4release/tree/main/lang_data * Update tokenizer settings for UD Catalan AnCora Update for UD Catalan AnCora v2.7 with merged multi-word tokens. * Update test * Move prefix patternt to more generic infix pattern * Clean up	2021-06-11 10:21:22 +02:00
Adriane Boyd	d9be9e6cf9	Move README.md and LICENSES_SOURCES in package (#8297 ) In addition to `LICENSE`, move the files `README.md` and `LICENSES_SOURCES` to the top directory in `spacy package` if present in the model directory.	2021-06-11 10:20:24 +02:00
Adriane Boyd	f4008bdb13	Restrict pymorphy2 requirement to pymorphy2 mode (#8299 ) For the Russian and Ukrainian lemmatizers, restrict the `pymorphy2` requirement to the mode `pymorphy2` so that lookup or other lemmatizer modes can be loaded without installing `pymorphy2`.	2021-06-11 10:19:22 +02:00
graue70	f34dd0b98f	Fix typos in comments (#8279 )	2021-06-07 10:43:54 +02:00
Adriane Boyd	9dfd3c9484	Use warnings.warn instead of logger.warning	2021-06-04 17:44:08 +02:00
Sofie Van Landeghem	f0277bdeab	Show warning if entity_ruler runs without patterns (#7807 ) * Show warning if entity_ruler runs without patterns * Show warning if matcher runs without patterns * fix wording * unit test for warning once (WIP) * warn W036 only once * cleanup * create filter_warning helper	2021-06-04 17:37:38 +02:00
Paul O'Leary McCann	d959603d51	Don't add duplicate patterns all the time in EntityRuler (fix #8216 ) (#8246 ) * Don't add duplicate patterns (fix #8216) * Refactor EntityRuler init This simplifies the EntityRuler init code. This is helpful as prep for allowing the EntityRuler to reset itself. * Make EntityRuler.clear reset matchers Includes a new test for this. * Tidy PhraseMatcher instantiation Since the attr can be None safely now, the guard if is no longer required here. Also renamed the `_validate` attr. Maybe it's not needed? * Fix NER test * Add test to make sure patterns aren't increasing * Move test to regression tests	2021-06-03 09:05:26 +02:00
Jean-Hugues Roy	ff5cf3606c	Improvements to French stopwords list (#7941 ) * "y" etc. Many changes described in pull request * Update spacy/lang/fr/stop_words.py * Update spacy/lang/fr/stop_words.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-06-02 11:50:49 +02:00
Vito De Tullio	3672464e25	applying suggestion to avoid mypy errors (#8265 ) * applying suggestion to avoid mypy errors * sign contributor agreement	2021-06-02 19:25:30 +10:00
Adriane Boyd	4aa1a7d5a3	Remove unsupported attrs from attrs.IDS (#8132 ) The attributes `PROB`, `CLUSTER` and `SENT_END` are not supported by `Lexeme.get_struct_attr` so should not be included through `attrs.IDS` as supported attributes in `Doc.to_array` and other methods.	2021-06-02 19:16:57 +10:00
Paul O'Leary McCann	d54631f68b	Fix other open calls without context managers (#8245 )	2021-05-31 19:04:29 +10:00
Dhruv Naik	283f64a98d	Fix bug from Entityruler: ent_ids returns None for phrases (#8169 ) * bugfix for explosion/spaCy#8168 * add test for explosion/spaCy#8168	2021-05-31 18:38:53 +10:00
Narayan Acharya	6b79714080	Address missing config overrides post load of models (#8208 )	2021-05-31 18:36:52 +10:00
Sofie Van Landeghem	fff662e41f	Ensemble textcat with listener (#8012 ) * add unit test for two listeners, with a textcat ensemble in the middle * return zero gradients instead of None in accumulate_gradient	2021-05-31 18:21:06 +10:00
Sofie Van Landeghem	ff91e6dac7	Show warning if entity_ruler runs without patterns (#7807 ) * Show warning if entity_ruler runs without patterns * Show warning if matcher runs without patterns * fix wording * unit test for warning once (WIP) * warn W036 only once * cleanup * create filter_warning helper	2021-05-31 18:20:27 +10:00
Paul O'Leary McCann	d1a221a374	Add all symbols in Unicode Currency Symbols block (#8212 ) * Add all symbols in Unicode Currency Symbols block In #8102 it came up that the rupee symbol was treated different from dollar / euro / yen symbols. This adds many symbols not already included. * Fix test * Fix training test	2021-05-31 18:03:40 +10:00
Paul O'Leary McCann	04239e94c7	Use a context manager when reading model (fix #7036 ) (#8244 )	2021-05-31 17:36:17 +10:00
Ines Montani	5957ab74f7	Merge pull request #8112 from svlandeg/bugfix/replace-trf	2021-05-28 11:35:17 +10:00
Sofie Van Landeghem	3c58c0323f	fix docs (#8200 )	2021-05-27 10:48:59 +02:00
Sofie Van Landeghem	290bd6ed39	ensure tolerance is properly passed on (#8158 )	2021-05-27 18:10:28 +10:00
Adriane Boyd	cd6bd91c3a	Switch default train corpus max_length to 0 in quickstart (#8142 ) The behavior of `spacy.Corpus.v1` is unexpected enough for `max_length != 0` that `0` is a better default for users creating a new config with the quickstart. If not, documents are skipped, sometimes the entire corpus is skipped, and sometimes documents are (quite unexpectedly for your average user) split into sentences.	2021-05-20 14:48:09 +02:00
Sofie Van Landeghem	202943bc8c	KB & NEL to/from bytes (#8113 ) * unit test for pickling KB * add pickling test for NEL * KB to_bytes and from_bytes * NEL to_bytes and from_bytes * xfail pickle tests for now * fix docs * cleanup	2021-05-20 18:11:30 +10:00
Adriane Boyd	2c545c4c5b	Fix offsets in Span.get_lca_matrix (#8116 ) * Fix range in Span.get_lca_matrix Fix the adjusted token index / lca matrix index ranges for `_get_lca_matrix` for spans. * The range for `k` should correspond to the adjusted indices in `lca_matrix` with the `start` indexed at `0` * Update test for v3.x	2021-05-17 16:54:23 +02:00
Sofie Van Landeghem	0dffc5d9e2	Custom warning if the doc_bin is too large (#8069 ) * custom warning if the doc_bin is too large * cleanup * Update spacy/errors.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * fix numbering * fixing numbering once more * fixing this seems to be pretty hard Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-05-17 15:48:40 +02:00
Adriane Boyd	b120fb3511	Handle errors while multiprocessing (#8004 ) * Handle errors while multiprocessing Handle errors while multiprocessing without hanging. * Return the traceback for errors raised while processing a batch, which can be handled by the top-level error handler * Allow for shortened batches due to custom error handlers that ignore errors and skip documents * Define custom components at a higher level * Also move up custom error handler * Use simpler component for test * Switch error type * Adjust test * Only call top-level error handler for exceptions * Register custom test components within tests Use global functions (so they can be pickled) but register the components only within the individual tests.	2021-05-17 13:28:39 +02:00
Adriane Boyd	8a2602051c	Update debug data for textcat (#8066 ) * Check for unsupported cats values * Only show labels if train/dev mismatched * Don't show label counts (only counting positive labels seems odd) * Use warnings for mismatched train/dev labels	2021-05-17 13:27:04 +02:00
Adriane Boyd	1d59fdbd39	Update Vietnamese tokenizer (#8099 ) * Adapt tokenization methods from `pyvi` to preserve text encoding and whitespace * Add serialization support similar to Chinese and Japanese Note: as for Chinese and Japanese, some settings are duplicated in `config.cfg` and `tokenizer/cfg`.	2021-05-17 18:16:20 +10:00
Adriane Boyd	fe3a4aa846	Add ENT_ID and NORM to DocBin strings (#8054 ) Save strings for token attributes `ENT_ID` and `NORM` in `DocBin` strings.	2021-05-17 18:06:11 +10:00
Adriane Boyd	82fa81d095	Make all Span attrs writable (#8062 ) Also allow `Span` string properties `label_` and `kb_id_` to be writable following #6696.	2021-05-17 18:05:45 +10:00
svlandeg	235e9f5488	call replace_listener_cfg attr if it's available	2021-05-12 17:19:38 +02:00
svlandeg	44a3a58599	call replace_listener attr if it's available	2021-05-12 16:01:02 +02:00
svlandeg	ece8be4fec	extend test to training with replaced tok2vec layer	2021-05-12 11:32:22 +02:00
Adriane Boyd	d5bbd1f94f	Handle partial entities in Span.as_doc (#8055 ) * Handle partial entities in Span.as_doc In `Span.as_doc` replace partial entities at the beginning or end of the span with missing entity annotation. Fixes a bug where invalid entity annotation (no initial `B`) was returned for an initial partial entity. * Check for empty span in ents conversion Note: `Span.as_doc()` will still fail on an empty span due to failures in `Span.vector`.	2021-05-11 17:10:16 +02:00
Paul O'Leary McCann	bdeaf3a18b	Fix/fix en ordinals (#8028 ) * Fix #8019 "th" is not the only ordinal ending. * Add some more ordinal tests	2021-05-07 10:26:42 +02:00
Adriane Boyd	6788d90f61	Preserve existing ENT_KB_ID annotation in NER (#7988 ) * Preserve existing ENT_KB_ID annotation in NER Preserve `ent_kb_id` annotation on existing entity spans, which is not preserved by the transition system. * Simplify kb_id assignment * Simplify further	2021-05-06 18:49:55 +10:00
Sofie Van Landeghem	02a6a5fea0	Fix 'debug model' for transformers + generalize (#7973 ) * add overrides to docs * fix debug model with transformer * assume training data is set in config	2021-05-06 18:43:32 +10:00
Adriane Boyd	cc5aeaed29	Add Chinese PTB tags to glossary (#7993 )	2021-05-06 18:43:03 +10:00

1 2 3 4 5 ...

8684 Commits