spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-09-07 21:05:05 +03:00

Author	SHA1	Message	Date
Adriane Boyd	31528f62ed	Add / to nb infixes (#7991 )	2021-05-04 11:00:10 +02:00
Santiago Castro	e99ff6f255	Fix typo in Language docstrings (#7958 )	2021-05-03 14:44:09 +02:00
Ines Montani	12d3d0fedd	Fix quickstart default checked of conditional fields [ci skip]	2021-05-03 11:48:12 +10:00
Adriane Boyd	2320791f6d	Fix Transformer.initialize example (#7963 )	2021-04-30 12:21:31 +02:00
Adriane Boyd	cf032ec31e	Update to catalogue>=2.0.4 (#7951 )	2021-04-29 19:11:28 +02:00
Adriane Boyd	7cf5bd072f	Refactor util.to_ternary_int (#7944 ) * Refactor to avoid literal comparison with `is` * Extend tests	2021-04-29 16:58:54 +02:00
Sevdimali	49aed683cc	Azerbaijani language added (#7911 )	2021-04-28 14:42:02 +02:00
Adriane Boyd	f4080983ea	Extend to cupy 9.0.0 (#7914 )	2021-04-28 10:18:24 +02:00
Paul O'Leary McCann	8007d5c814	Check if the resume path points to a directory (#7919 ) This came up in #7878, but if --resume-path is a directory then loading the weights will fail. On Linux this will give a straightforward error message, but on Windows it gives "Permission Denied", which is confusing.	2021-04-28 09:17:15 +02:00
Paul O'Leary McCann	de6b5ed14d	Fix percent unk display in debug data (#7886 ) * Fix percent unk display This was showing (ratio %), so 10% would show as 0.10%. Fix by multiplying ration by 100. Might want to add a warning if this is over a threshold. * Only show whole-integer percents	2021-04-27 09:16:35 +02:00
Janis Klaise	1690595e4d	Update load_lookups return type and docstring (#7907 ) * Update load_lookups return type and docstring * Add contributor agreement	2021-04-27 09:13:39 +02:00
Adriane Boyd	874cd02539	Set spacy-legacy to >=3.0.5 (#7897 ) Set `spacy-legacy` to `>=3.0.5` due to `spacy.StaticVectors.v1` init bug.	2021-04-26 17:06:32 +02:00
Adriane Boyd	df3444421a	Update spacy-legacy to >=3.0.4 (#7865 )	2021-04-23 12:16:12 +02:00
Adriane Boyd	8a95475b3d	Set version to v3.0.6 (#7854 )	2021-04-22 16:33:26 +02:00
Adriane Boyd	36ecba224e	Set up GPU CI testing (#7293 ) * Set up CI for tests with GPU agent * Update tests for enabled GPU * Fix steps filename * Add parallel build jobs as a setting * Fix test requirements * Fix install test requirements condition * Fix pipeline models test * Reset current ops in prefer/require testing * Fix more tests * Remove separate test_models test * Fix regression 5551 * fix StaticVectors for GPU use * fix vocab tests * Fix regression test 5082 * Move azure steps to .github and reenable default pool jobs * Consolidate/rename azure steps Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>	2021-04-22 14:58:29 +02:00
Adriane Boyd	bdb485cc80	Add callback to copy vocab/tokenizer from model (#7750 ) * Add callback to copy vocab/tokenizer from model Add callback `spacy.copy_from_base_model.v1` to copy the tokenizer settings and/or vocab (including vectors) from a base model. * Move spacy.copy_from_base_model.v1 to spacy.training.callbacks * Add documentation * Modify to specify model as tokenizer and vocab params	2021-04-22 12:36:50 +02:00
Adriane Boyd	f68fc29130	Update sent_starts in Example.from_dict (#7847 ) * Update sent_starts in Example.from_dict Update `sent_starts` for `Example.from_dict` so that `Optional[bool]` values have the same meaning as for `Token.is_sent_start`. Use `Optional[bool]` as the type for sent start values in the docs. * Use helper function for conversion to ternary ints	2021-04-22 11:32:45 +02:00
Adriane Boyd	f4339f9bff	Fix tokenizer cache flushing (#7836 ) * Fix tokenizer cache flushing Fix/simplify tokenizer init detection in order to fix cache flushing when properties are modified. * Remove init reloading logic * Remove logic disabling `_reload_special_cases` on init * Setting `rules` last in `__init__` (as before) means that setting other properties doesn't reload any special cases * Reset `rules` first in `from_bytes` so that setting other properties during deserialization doesn't reload any special cases unnecessarily * Reset all properties in `Tokenizer.from_bytes` to allow any settings to be `None` * Also reset special matcher when special cache is flushed * Remove duplicate special case validation * Add test for special cases flushing * Extend test for tokenizer deserialization of None values	2021-04-22 18:14:57 +10:00
Sofie Van Landeghem	cfad7e21d5	fix config parsing of ints/strings (#7755 ) * add few failing tests for parsing integers and strings * bump thinc to 8.0.3	2021-04-22 18:09:13 +10:00
Adriane Boyd	d2bdaa7823	Replace negative rows with 0 in StaticVectors (#7674 ) * Replace negative rows with 0 in StaticVectors Replace negative row indices with 0-vectors in `StaticVectors`. * Increase versions related to StaticVectors * Increase versions of all architctures and layers related to `StaticVectors` * Improve efficiency of 0-vector operations Parallel `spacy-legacy` PR: https://github.com/explosion/spacy-legacy/pull/5 * Update config defaults to new versions * Update docs	2021-04-22 18:04:15 +10:00
Sofie Van Landeghem	6f565cf39d	fix typo in entity_linker docs	2021-04-22 09:59:24 +02:00
Sofie Van Landeghem	2e746dbf32	update EL training data format in docs (#7839 ) * update EL training data format * fix typo * all -1 because reasons	2021-04-22 08:50:09 +02:00
meghanabhange	49ff1126bf	Project Idea : denomme \| Multilingual Name Detection (#7845 ) * Add denomme * spaCy contributor agreement Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-04-22 08:48:17 +02:00
Sam Edwardes	b8c6c10c6f	Added a logo to spaCyTextBlob (#7818 ) * Added a logo to spaCyTextBlob * Updated to better thumb	2021-04-22 08:41:55 +02:00
Diego Palma	bbade153ed	Add TRUNAJOD to spaCy universe. (#7754 ) * Add TRUNAJOD to spaCy universe. * Add trunajod logo and thumb. Co-authored-by: Diego <dpalma@evernote.com>	2021-04-22 08:40:28 +02:00
Ines Montani	a9e5ae9b5c	Auto-format [ci skip]	2021-04-22 10:58:05 +10:00
Ines Montani	5cbe414ce6	Merge pull request #7851 from plison/master [ci skip]	2021-04-22 10:56:35 +10:00
Pierre Lison	2f0ef2c9cc	adding skweak to the SpaCy universe	2021-04-22 01:16:34 +02:00
Pierre Lison	debfb46088	adding skweak to the SpaCy universe	2021-04-22 00:58:09 +02:00
Shantam Raj	6017fcf693	Default code for Setting Entity annotations on the website errors (#7738 ) * the default example for "Setting entity annotations" errors on Binder * updating contributer info * using a new variable to store original entities	2021-04-21 09:16:32 +02:00
Ines Montani	aad5ba13af	Merge pull request #7826 from richardpaulhudson/master Add entry for Coreferee project to universe.json	2021-04-21 16:22:43 +10:00
hudsonr	2722424ec5	Added universe entry for Coreferee	2021-04-19 14:28:06 +02:00
langdonholmes	df541c6b5e	Update processing-pipelines.md to mention method for doc metadata (#7480 ) * Update processing-pipelines.md Under "things to try," inform users they can save metadata when using nlp.pipe(foobar, as_tuples=True) Link to a new example on the attributes page detailing the following: > ``` > data = [ > ("Some text to process", {"meta": "foo"}), > ("And more text...", {"meta": "bar"}) > ] > > for doc, context in nlp.pipe(data, as_tuples=True): > # Let's assume you have a "meta" extension registered on the Doc > doc._.meta = context["meta"] > ``` from https://stackoverflow.com/questions/57058798/make-spacy-nlp-pipe-process-tuples-of-text-and-additional-information-to-add-as * Updating the attributes section Update the attributes section with example of how extensions can be used to store metadata. * Update processing-pipelines.md * Update processing-pipelines.md Made as_tuples example executable and relocated to the end of the "Processing Text" section. * Update processing-pipelines.md * Update processing-pipelines.md Removed extra line * Reformat and rephrase Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-04-19 11:58:12 +02:00
Adriane Boyd	0e7f94b247	Update Tokenizer.explain with special matches (#7749 ) * Update Tokenizer.explain with special matches Update `Tokenizer.explain` and the pseudo-code in the docs to include the processing of special cases that contain affixes or whitespace. * Handle optional settings in explain * Add test for special matches in explain Add test for `Tokenizer.explain` for special cases containing affixes.	2021-04-19 19:08:20 +10:00
Adriane Boyd	07b41c38ae	Register CharEmbed layer (#7805 )	2021-04-19 18:39:34 +10:00
Sofie Van Landeghem	c786e98e56	assemble CLI command (#7783 ) * assemble CLI command * ensure assemble runs even without training section * cleanup	2021-04-19 18:39:11 +10:00
Adriane Boyd	15bd230413	Set catalogue lower pin to v2.0.3 (#7762 ) * Set catalogue lower pin to v2.0.2 * Update importlib-metadata pins to match * Require catalogue v2.0.3 Switch to vendored `importlib-metadata` v3.2.0 provided by `catalogue`.	2021-04-19 18:37:17 +10:00
Adriane Boyd	1ad646cbcf	Improve checks for sourced components (#7490 ) * Improve checks for sourced components * Remove language class checks * Convert python warning to logger warning * Remove unused warning * Fix formatting	2021-04-19 18:36:32 +10:00
Sofie Van Landeghem	05bdbe28bb	Fix vectors data on GPU (#7626 ) * ensure vectors data is stored on right device * ensure the added vector is on the right device * move vector to numpy before iterating * move best_rows to numpy before iterating	2021-04-19 18:30:03 +10:00
Bram Vanroy	ed561cf428	Terminology: deprecated vs obsolete (#7621 ) * Terminology: deprecated vs obsolete Typically, deprecated is used for functionality that is bound to become unavailable but that can still be used. Obsolete is used for features that have been removed. In E941, I think what is meant is "obsolete" since loading a model by a shortcut simply does not work anymore (and throws an error). This is different from downloading a model with a shortcut, which is deprecated but still works. In light of this, perhaps all other error codes should be checked as well. * clarify that the link command is removed and not just deprecated Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>	2021-04-12 14:37:00 +02:00
Sofie Van Landeghem	8d7af5b2b1	Ensure hyphen in config file works as string value (#7642 ) * add test for serializing '-' in a config file * bump srsly to 2.4.1	2021-04-12 14:35:57 +02:00
Sofie Van Landeghem	27dbbb9903	Bugfix/nel crossing sentence (#7630 ) * ensure each entity gets a KB ID, even when it's not within a sentence * cleanup	2021-04-12 18:08:01 +10:00
Adriane Boyd	673e2bc4c0	Add usage docs for streamed train corpora (#7693 )	2021-04-09 16:15:38 +02:00
Adriane Boyd	73a8c0f992	Update debug data further for v3 (#7602 ) * Update debug data further for v3 * Remove new/existing label distinction (new labels are not immediately distinguishable because the pipeline is already initialized) * Warn on missing labels in training data for all components except parser * Separate textcat and textcat_multilabel sections * Add section for morphologizer * Reword missing label warnings	2021-04-09 11:53:42 +02:00
Stanislav Schmidt	2516896849	Make vocab update in get_docs deterministic (#7603 ) * Make vocab update in get_docs deterministic The attribute `DocBin.strings` is a set. In `DocBin.get_docs` a given vocab is updated by iterating over this set. Iteration over a python set produces an arbitrary ordering, therefore vocab is updated non-deterministically. When training (fine-tuning) a spacy model, the base model's vocabulary will be updated with the new vocabulary in the training data in exactly the way described above. After serialization, the file `model/vocab/strings.json` will be sorted in an arbitrary way. This prevents reproducible model training. * Revert "Make vocab update in get_docs deterministic" This reverts commit `d6b87a2f55`. * Sort strings in StringStore serialization Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-04-09 11:53:13 +02:00
Adriane Boyd	8008e2f75b	Use morph hash in lemmatizer cache key (#7690 ) Use the morph hash rather than the `MorphAnalysis` object in the cache key so that the `Lemmatizer` can be pickled.	2021-04-08 13:22:38 +02:00
Sofie Van Landeghem	3e5bd5055e	expand quickstart widget with cuda 11.1 and 11.2 (#7615 )	2021-04-08 12:25:42 +02:00
Adriane Boyd	e6b7600adf	Fix parser sourcing in NER converter (#7631 )	2021-04-08 12:25:03 +02:00
Sofie Van Landeghem	204c2f116b	Extend score_spans for overlapping & non-labeled spans (#7209 ) * extend span scorer with consider_label and allow_overlap * unit test for spans y2x overlap * add score_spans unit test * docs for new fields in scorer.score_spans * rename to include_label * spell out if-else for clarity * rename to 'labeled' Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-04-08 12:19:17 +02:00
Paul O'Leary McCann	c362006cb9	Fix is_sent_start when converting from JSON (fix #7635 ) (#7655 ) Data in the JSON format is split into sentences, and each sentence is saved with is_sent_start flags. Currently the flags are 1 for the first token and 0 for the others. When deserialized this results in a pattern of True, None, None, None... which makes single-sentence documents look as though they haven't had sentence boundaries set. Since items saved in JSON format have been split into sentences already, the is_sent_start values should all be True or False.	2021-04-08 18:24:52 +10:00

1 2 3 4 5 ...

14525 Commits