spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-10-27 06:01:28 +03:00

Author	SHA1	Message	Date
harmbuisman	c066fb8a4e	#10672 : fixes displacy output for manual unsorted entities (#10673 ) * #10672: fixes displacy output for manual unsorted entities * #10672: removed unused import * fix prettier formatting Co-authored-by: Harm Buisman <h.buisman@iknl.nl> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-04-27 09:51:58 +02:00
Adriane Boyd	85778dfcf4	Add edit tree lemmatizer (#10231 ) * Add edit tree lemmatizer Co-authored-by: Daniël de Kok <me@danieldk.eu> * Hide edit tree lemmatizer labels * Use relative imports * Switch to single quotes in error message * Type annotation fixes Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Reformat edit_tree_lemmatizer with black * EditTreeLemmatizer.predict: take Iterable Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Validate edit trees during deserialization This change also changes the serialized representation. Rather than mirroring the deep C structure, we use a simple flat union of the match and substitution node types. * Move edit_trees to _edit_tree_internals * Fix invalid edit tree format error message * edit_tree_lemmatizer: remove outdated TODO comment * Rename factory name to trainable_lemmatizer * Ignore type instead of casting truths to List[Union[Ints1d, Floats2d, List[int], List[str]]] for thinc v8.0.14 * Switch to Tagger.v2 * Add documentation for EditTreeLemmatizer * docs: Fix 3.2 -> 3.3 somewhere * trainable_lemmatizer documentation fixes * docs: EditTreeLemmatizer is in edit_tree_lemmatizer.py Co-authored-by: Daniël de Kok <me@danieldk.eu> Co-authored-by: Daniël de Kok <me@github.danieldk.eu> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-03-28 11:13:50 +02:00
Adriane Boyd	d5666fd12d	Add NORM to Matcher feature in docs (#10560 )	2022-03-28 10:35:47 +02:00
Lj Miranda	a79cd3542b	Add displacy support for overlapping Spans (#10332 ) * Fix docstring for EntityRenderer * Add warning in displacy if doc.spans are empty * Implement parse_spans converter One notable change here is that the default spans_key is sc, and it's set by the user through the options. * Implement SpanRenderer Here, I implemented a SpanRenderer that looks similar to the EntityRenderer except for some templates. The spans_key, by default, is set to sc, but can be configured in the options (see parse_spans). The way I rendered these spans is per-token, i.e., I first check if each token (1) belongs to a given span type and (2) a starting token of a given span type. Once I have this information, I render them into the markup. * Fix mypy issues on typing * Add tests for displacy spans support * Update colors from RGB to hex Co-authored-by: Ines Montani <ines@ines.io> * Remove unnecessary CSS properties * Add documentation for website * Remove unnecesasry scripts * Update wording on the documentation Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Put typing dependency on top of file * Put back z-index so that spans overlap properly * Make warning more explicit for spans_key Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-03-16 18:14:34 +01:00
Adriane Boyd	e8357923ec	Various install docs updates (#10487 ) * Simplify quickstart source install to use only editable pip install * Update pytorch install instructions to more recent versions	2022-03-15 11:12:50 +01:00
Adriane Boyd	297dd82c86	Fix initial special cases for Tokenizer.explain (#10460 ) Add the missing initial check for special cases to `Tokenizer.explain` to align with `Tokenizer._tokenize_affixes`.	2022-03-11 10:50:47 +01:00
Peter Baumgartner	01ec6349ea	Add `path.mkdir` to custom component examples of `to_disk` (#10348 ) * add `path.mkdir` to examples * add ensure_path + mkdir * update highlights	2022-03-08 16:04:10 +01:00
Adriane Boyd	b2bbefd0b5	Add Finnish, Korean, and Swedish models and Korean support notes (#10355 ) * Add Finnish, Korean, and Swedish models to website * Add Korean language support notes	2022-03-07 17:03:45 +01:00
Peter Baumgartner	836f689cc7	YAML multiline tip for project.yml files (#10187 ) * MultiHashEmbed vector docs correction * add in multi-line tip * convert to sidebar tip	2022-02-08 08:35:09 +01:00
Adriane Boyd	4f441dfa24	Fix infix as prefix in Tokenizer.explain (#10140 ) * Fix infix as prefix in Tokenizer.explain Update `Tokenizer.explain` to align with the `Tokenizer` algorithm: * skip infix matches that are prefixes in the current substring * Update tokenizer pseudocode in docs	2022-01-28 17:00:54 +01:00
ColleterVi	a784b12eff	fix: new restcountries url (#10043 ) Url extension "eu" and path "rest" are no longer available. Replacing them for a working url.	2022-01-13 20:25:06 +09:00
Adriane Boyd	6763cbfdc0	Update Catalan acknowledgements for v3.2 (#9763 )	2021-11-29 14:14:21 +01:00
Paul O'Leary McCann	f3981bd0c8	Clarify how to fill in init_tok2vec after pretraining (#9639 ) * Clarify how to fill in init_tok2vec after pretraining * Ignore init_tok2vec arg in pretraining * Update docs, config setting * Remove obsolete note about not filling init_tok2vec early This seems to have also caught some lines that needed cleanup.	2021-11-18 15:38:30 +01:00
Adriane Boyd	216ed231a9	What's new in v3.2 (#9633 ) * What's new in v3.2 * Fix formatting * Fix typo * Redo thanks * Formatting * Fix typo * Fix project links * Fix typo * Minimal intro, floret python module * Rephrase * Rephrase, extend * Rephrase * Update links and formatting [ci skip] * Minor correction * Fix typo Co-authored-by: Ines Montani <ines@ines.io>	2021-11-05 16:31:14 +01:00
Adriane Boyd	07dea324f6	Merge remote-tracking branch 'upstream/develop' into chore/switch-to-master-v3.2.0	2021-11-03 15:32:18 +01:00
Vasundhara	5279c7c4ba	Fix broken link to mappings-exceptions (#9573 )	2021-10-31 13:44:29 +09:00
Adriane Boyd	2d430958e1	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-3	2021-10-29 12:18:15 +02:00
Paul O'Leary McCann	2fd8d616e7	Add docs section for spacy.cli.train.train (#9545 ) * Add section for spacy.cli.train.train * Add link from training page to train function * Ensure path in train helper * Update docs Co-authored-by: Ines Montani <ines@ines.io>	2021-10-29 10:36:34 +02:00
Adriane Boyd	5477453ea3	Docs for thinc-apple-ops (#9549 ) * Docs for thinc-apple-ops * Ignore thinc-apple-ops in reqs tests * Fix install quickstart * Add cupy cuda 113, 114 extras * Remove draft section Co-authored-by: Ines Montani <ines@ines.io>	2021-10-29 10:35:31 +02:00
Adriane Boyd	a803af9dfa	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-1	2021-10-26 11:53:50 +02:00
Elia Robyn Lake (Robyn Speer)	fa70837f28	clarify how to connect pretraining to training (#9450 ) * clarify how to connect pretraining to training Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * Update website/docs/usage/embeddings-transformers.md * Update website/docs/usage/embeddings-transformers.md * Update website/docs/usage/embeddings-transformers.md * Update website/docs/usage/embeddings-transformers.md Co-authored-by: Elia Robyn Speer <elia@explosion.ai> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-10-22 13:15:47 +02:00
Paul O'Leary McCann	222cf9b6d2	Clarify how to change base Transformer model (#9498 ) * Add note about how the model name is used * Add link to TransformersModel docs, separate paragraph * Local link * Revise docs * Update website/docs/usage/embeddings-transformers.md * Update website/docs/usage/embeddings-transformers.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-10-19 23:28:20 +02:00
Sofie Van Landeghem	3fd3531e12	Docs for new spacy-trf architectures (#8954 ) * use TransformerModel.v2 in quickstart * update docs for new transformer architectures * bump spacy_transformers to 1.1.0 * Add new arguments spacy-transformers.TransformerModel.v3 * Mention that mixed-precision support is experimental * Describe delta transformers.Tok2VecTransformer versions * add dot * add dot, again * Update some more TransformerModel references v2 -> v3 * Add mixed-precision options to the training quickstart Disable mixed-precision training/prediction by default. * Update setup.cfg Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Apply suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update website/docs/usage/embeddings-transformers.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Daniël de Kok <me@danieldk.eu> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-10-18 14:15:06 +02:00
Connor Brinton	657af5f91f	🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167 ) * 🚨 Ignore all existing Mypy errors * 🏗 Add Mypy check to CI * Add types-mock and types-requests as dev requirements * Add additional type ignore directives * Add types packages to dev-only list in reqs test * Add types-dataclasses for python 3.6 * Add ignore to pretrain * 🏷 Improve type annotation on `run_command` helper The `run_command` helper previously declared that it returned an `Optional[subprocess.CompletedProcess]`, but it isn't actually possible for the function to return `None`. These changes modify the type annotation of the `run_command` helper and remove all now-unnecessary `# type: ignore` directives. * 🔧 Allow variable type redefinition in limited contexts These changes modify how Mypy is configured to allow variables to have their type automatically redefined under certain conditions. The Mypy documentation contains the following example: ```python def process(items: List[str]) -> None: # 'items' has type List[str] items = [item.split() for item in items] # 'items' now has type List[List[str]] ... ``` This configuration change is especially helpful in reducing the number of `# type: ignore` directives needed to handle the common pattern of: * Accepting a filepath as a string * Overwriting the variable using `filepath = ensure_path(filepath)` These changes enable redefinition and remove all `# type: ignore` directives rendered redundant by this change. * 🏷 Add type annotation to converters mapping * 🚨 Fix Mypy error in convert CLI argument verification * 🏷 Improve type annotation on `resolve_dot_names` helper * 🏷 Add type annotations for `Vocab` attributes `strings` and `vectors` * 🏷 Add type annotations for more `Vocab` attributes * 🏷 Add loose type annotation for gold data compilation * 🏷 Improve `_format_labels` type annotation * 🏷 Fix `get_lang_class` type annotation * 🏷 Loosen return type of `Language.evaluate` * 🏷 Don't accept `Scorer` in `handle_scores_per_type` * 🏷 Add `string_to_list` overloads * 🏷 Fix non-Optional command-line options * 🙈 Ignore redefinition of `wandb_logger` in `loggers.py` * ➕ Install `typing_extensions` in Python 3.8+ The `typing_extensions` package states that it should be used when "writing code that must be compatible with multiple Python versions". Since SpaCy needs to support multiple Python versions, it should be used when newer `typing` module members are required. One example of this is `Literal`, which is available starting with Python 3.8. Previously SpaCy tried to import `Literal` from `typing`, falling back to `typing_extensions` if the import failed. However, Mypy doesn't seem to be able to understand what `Literal` means when the initial import means. Therefore, these changes modify how `compat` imports `Literal` by always importing it from `typing_extensions`. These changes also modify how `typing_extensions` is installed, so that it is a requirement for all Python versions, including those greater than or equal to 3.8. * 🏷 Improve type annotation for `Language.pipe` These changes add a missing overload variant to the type signature of `Language.pipe`. Additionally, the type signature is enhanced to allow type checkers to differentiate between the two overload variants based on the `as_tuple` parameter. Fixes #8772 * ➖ Don't install `typing-extensions` in Python 3.8+ After more detailed analysis of how to implement Python version-specific type annotations using SpaCy, it has been determined that by branching on a comparison against `sys.version_info` can be statically analyzed by Mypy well enough to enable us to conditionally use `typing_extensions.Literal`. This means that we no longer need to install `typing_extensions` for Python versions greater than or equal to 3.8! 🎉 These changes revert previous changes installing `typing-extensions` regardless of Python version and modify how we import the `Literal` type to ensure that Mypy treats it properly. * resolve mypy errors for Strict pydantic types * refactor code to avoid missing return statement * fix types of convert CLI command * avoid list-set confustion in debug_data * fix typo and formatting * small fixes to avoid type ignores * fix types in profile CLI command and make it more efficient * type fixes in projects CLI * put one ignore back * type fixes for render * fix render types - the sequel * fix BaseDefault in language definitions * fix type of noun_chunks iterator - yields tuple instead of span * fix types in language-specific modules * 🏷 Expand accepted inputs of `get_string_id` `get_string_id` accepts either a string (in which case it returns its ID) or an ID (in which case it immediately returns the ID). These changes extend the type annotation of `get_string_id` to indicate that it can accept either strings or IDs. * 🏷 Handle override types in `combine_score_weights` The `combine_score_weights` function allows users to pass an `overrides` mapping to override data extracted from the `weights` argument. Since it allows `Optional` dictionary values, the return value may also include `Optional` dictionary values. These changes update the type annotations for `combine_score_weights` to reflect this fact. * 🏷 Fix tokenizer serialization method signatures in `DummyTokenizer` * 🏷 Fix redefinition of `wandb_logger` These changes fix the redefinition of `wandb_logger` by giving a separate name to each `WandbLogger` version. For backwards-compatibility, `spacy.train` still exports `wandb_logger_v3` as `wandb_logger` for now. * more fixes for typing in language * type fixes in model definitions * 🏷 Annotate `_RandomWords.probs` as `NDArray` * 🏷 Annotate `tok2vec` layers to help Mypy * 🐛 Fix `_RandomWords.probs` type annotations for Python 3.6 Also remove an import that I forgot to move to the top of the module 😅 * more fixes for matchers and other pipeline components * quick fix for entity linker * fixing types for spancat, textcat, etc * bugfix for tok2vec * type annotations for scorer * add runtime_checkable for Protocol * type and import fixes in tests * mypy fixes for training utilities * few fixes in util * fix import * 🐵 Remove unused `# type: ignore` directives * 🏷 Annotate `Language._components` * 🏷 Annotate `spacy.pipeline.Pipe` * add doc as property to span.pyi * small fixes and cleanup * explicit type annotations instead of via comment Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com> Co-authored-by: svlandeg <svlandeg@github.com>	2021-10-14 15:21:40 +02:00
Adriane Boyd	d98d525bc8	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.1-3	2021-10-14 09:41:46 +02:00
Paul O'Leary McCann	b53e39455e	Fix UD POS docs links (fix #9013 ) (#9407 ) * Fix UD POS docs links (fix #9013) The previous link seems to have been for UD v1. * Fix link	2021-10-11 11:51:19 +02:00
Paul O'Leary McCann	1ee6541ab0	Moving Japanese tokenizer extra info to Token.morph (#8977 ) * Use morph for extra Japanese tokenizer info Previously Japanese tokenizer info that didn't correspond to Token fields was put in user data. Since spaCy core should avoid touching user data, this moves most information to the Token.morph attribute. It also adds the normalized form, which wasn't exposed before. The subtokens, which are a list of full tokens, are still added to user data, except with the default tokenizer granualarity. With the default tokenizer settings the subtokens are all None, so in this case the user data is simply not set. * Update tests Also adds a new test for norm data. * Update docs * Add Japanese morphologizer factory Set the default to `extend=True` so that the morphologizer does not clobber the values set by the tokenizer. * Use the norm_ field for normalized forms Before this commit, normalized forms were put in the "norm" field in the morph attributes. I am not sure why I did that instead of using the token morph, I think I just forgot about it. * Skip test if sudachipy is not installed * Fix import Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-10-01 19:19:26 +02:00
Paul O'Leary McCann	6e833b617a	Updating Troubleshooting Docs (#9329 ) * Add link to Discussions FAQ * Remove old FAQ entries I think these are no longer relevant. - no-cache-dir: affected pip versions are very old now - narrow unicode: not an issue from py3.3+ - utf-8 osx: upstream bug closed in 2019 Some of the other issues are also maybe not frequent.	2021-10-01 12:28:22 +02:00
Elia Robyn Lake (Robyn Speer)	5b0b0ca809	Move WandB loggers into spacy-loggers (#9223 ) * factor out the WandB logger into spacy-loggers Signed-off-by: Elia Robyn Speer <gh@arborelia.net> * depend on spacy-loggers so they are available Signed-off-by: Elia Robyn Speer <gh@arborelia.net> * remove docs of spacy.WandbLogger.v2 (moved to spacy-loggers) Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * Version number suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * update references to WandbLogger Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * make order of deps more consistent Signed-off-by: Elia Robyn Speer <elia@explosion.ai> Co-authored-by: Elia Robyn Speer <elia@explosion.ai> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-09-29 11:12:50 +02:00
Ines Montani	6bb0324b81	Adjust kb_id visualizer templating and docs	2021-09-23 11:59:02 +02:00
Ines Montani	beb4a8c524	Merge pull request #9199 from shigapov/master (resolves #9129 )	2021-09-23 19:41:53 +10:00
Paul O'Leary McCann	1d57d78758	Make docs consistent (fix #9126 )	2021-09-16 15:54:12 +09:00
Renat Shigapov	d5cc009faf	Merge branch 'explosion:master' into master	2021-09-13 08:43:48 +02:00
Renat Shigapov	e61d93f8c3	add NEL-visualisation to manual-usage	2021-09-13 08:38:58 +02:00
Paul O'Leary McCann	f89e1c34c9	Minor typo fix in docs	2021-09-11 14:22:05 +09:00
Sofie Van Landeghem	8895e3c9ad	matcher doc corrections (#9115 ) * update error message to current UX * clarify uppercase effect * fix docstring	2021-09-02 09:26:33 +02:00
Robyn Speer	d60b748e3c	Fix surprises when asking for the root of a git repo (#9074 ) * Fix surprises when asking for the root of a git repo In the case of the first asset I wanted to get from git, the data I wanted was the entire repository. I tried leaving "path" blank, which gave a less-than-helpful error, and then I tried `path: "/"`, which started copying my entire filesystem into the project. The path I should have used was "". I've made two changes to make this smoother for others: - The 'path' within a git clone defaults to "" - If the path points outside of the tmpdir that the git clone goes into, we fail with an error Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * use a descriptive error instead of a default plus some minor fixes from PR review Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * check for None values in assets Signed-off-by: Elia Robyn Speer <elia@explosion.ai> Co-authored-by: Elia Robyn Speer <elia@explosion.ai>	2021-09-01 22:52:08 +02:00
Sofie Van Landeghem	4d39430b82	Document use-case of freezing tok2vec (#8992 ) * update error msg * add sentence to docs * expand note on frozen components	2021-08-26 09:50:35 +02:00
Paul O'Leary McCann	9391998c77	Add notes on preparing training data to docs (#8964 ) * Add training data section Not entirely sure this is in the right location on the page - maybe it should be after quickstart? * Add pointer from binary format to training data section * Minor cleanup * Add to ToC, fix filename * Update website/docs/usage/training.md Co-authored-by: Ines Montani <ines@ines.io> * Update website/docs/usage/training.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/docs/usage/training.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Move the training data section further down the page * Update website/docs/usage/training.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/docs/usage/training.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Run prettier Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-08-16 17:37:21 +02:00
Ines Montani	4f769ff913	Update Prodigy project template for v1.11 [ci skip]	2021-08-12 13:46:20 +10:00
Adriane Boyd	175847f92c	Support list values and INTERSECTS in Matcher (#8784 ) * Support list values and IS_INTERSECT in Matcher * Support list values as token attributes for set operators, not just as pattern values. * Add `IS_INTERSECT` operator. * Fix incorrect `ISSUBSET` and `ISSUPERSET` in schema and docs. * Rename IS_INTERSECT to INTERSECTS	2021-08-02 19:39:26 +02:00
Ines Montani	30f20496d5	Merge pull request #8840 from polm/docs/evaluate-speed [ci skip]	2021-07-30 09:10:15 +10:00
Ines Montani	65d163fab5	Adjust formatting [ci skip]	2021-07-30 09:10:04 +10:00
Paul O'Leary McCann	a60cb13910	Update speed entry in metrics table	2021-07-29 16:35:19 +09:00
Paul O'Leary McCann	8867e60fbb	Update website/docs/usage/v3.md Co-authored-by: Ines Montani <ines@ines.io>	2021-07-29 14:56:56 +09:00
Paul O'Leary McCann	76ac95923a	Add note to migration guide about lexeme tables (fix #7290 ) This just adds the resolution from #6388 to the docs.	2021-07-27 19:19:25 +09:00
Paul O'Leary McCann	67ecdcc3ac	Update subset/superset docs (#8795 ) * Update subset/superset docs * Update website/docs/usage/rule-based-matching.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-07-27 12:08:46 +02:00
Ines Montani	50000d37e4	Avoid double parentheses [ci skip]	2021-07-10 10:52:01 +10:00
Calum Sieppert	e2d53aa1a6	Typo fixes	2021-07-09 10:25:56 -06:00
Calum Sieppert	889c187bc2	Typo fixes	2021-07-07 16:53:04 -06:00
Adriane Boyd	6db647dfe0	Update v3.1 usage docs	2021-07-07 08:43:33 +02:00
Ines Montani	04a9ade40f	Merge pull request #8466 from explosion/docs/new-in-v3-1 [ci skip]	2021-07-06 22:20:24 +10:00
Sofie Van Landeghem	b9f59118bf	Fix silent evaluation (#8581 ) * fix silentness * sneak in docs typo fix * pass silent boolean instead	2021-07-06 14:16:19 +02:00
Ines Montani	5bb7fe4b41	Update with HF hub integration [ci skip]	2021-07-06 19:30:59 +10:00
Cass	7d13fc799b	Fix a command typo in models.md "dowmload" -> "download"	2021-07-05 18:44:18 -07:00
Ines Montani	8423864b50	Add docs notes on installing models from Python and in Jupyter [ci skip] (#8597 )	2021-07-05 13:49:20 +02:00
Ines Montani	af9d984407	Merge pull request #8405 from svlandeg/fix/whitespace_tokenizer [ci skip]	2021-06-30 20:52:59 +10:00
Adriane Boyd	41292a1b84	Add note about updating with fill-config	2021-06-29 10:45:36 +02:00
Ines Montani	4544412442	Update wording [ci skip] Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-06-25 13:52:48 +10:00
Ines Montani	0d2e2b59bc	Update intro [ci skip]	2021-06-24 22:53:20 +10:00
Ines Montani	68721af628	Formatting and preliminary intro [ci skip]	2021-06-24 20:32:23 +10:00
Adriane Boyd	92dc6b409e	Notes on source with vectors	2021-06-24 10:34:07 +02:00
Adriane Boyd	35425d7e26	Add details for Catalan and Danish	2021-06-24 10:10:33 +02:00
Ines Montani	5daf450f51	Update upgrading notes [ci skip]	2021-06-24 18:06:28 +10:00
Ines Montani	528746129d	Merge branch 'master' into docs/new-in-v3-1	2021-06-24 13:11:37 +10:00
Ines Montani	3e058dee62	Update features [ci skip]	2021-06-24 12:36:04 +10:00
Ines Montani	a1e4aca267	Fix sentence [ci skip]	2021-06-24 11:40:36 +10:00
Ines Montani	ca0d904faa	Update details [ci skip]	2021-06-23 13:05:56 +10:00
themrmax	d96c422cfc	Fix broken link change /api/registry to /api/top-level#registry	2021-06-22 15:34:06 -07:00
Ines Montani	e9b68d4f4c	Update details and add example [ci skip]	2021-06-22 17:51:03 +10:00
Nick Sorros	31504f5982	Switch model and data path in prodigy project.yml recipe (#8467 )	2021-06-22 09:41:45 +02:00
Ines Montani	bc93c34f54	Add "New in v3.1" guide	2021-06-22 15:23:18 +10:00
Ines Montani	02d2fdb123	Add link anchor [ci skip]	2021-06-20 11:29:19 +10:00
svlandeg	bb9d2f1546	extend example to ensure the text is preserved	2021-06-16 23:56:35 +02:00
Sofie Van Landeghem	e796aab4b3	Resizable textcat (#7862 ) * implement textcat resizing for TextCatCNN * resizing textcat in-place * simplify code * ensure predictions for old textcat labels remain the same after resizing (WIP) * fix for softmax * store softmax as attr * fix ensemble weight copy and cleanup * restructure slightly * adjust documentation, update tests and quickstart templates to use latest versions * extend unit test slightly * revert unnecessary edits * fix typo * ensemble architecture won't be resizable for now * use resizable layer (WIP) * revert using resizable layer * resizable container while avoid shape inference trouble * cleanup * ensure model continues training after resizing * use fill_b parameter * use fill_defaults * resize_layer callback * format * bump thinc to 8.0.4 * bump spacy-legacy to 3.0.6	2021-06-16 11:45:00 +02:00
svlandeg	29d83dec0c	adjust whitespace tokenizer to avoid sep in split()	2021-06-16 10:58:45 +02:00
Adriane Boyd	5646fcbe46	Merge remote-tracking branch 'upstream/develop' into chore/develop-into-master-v3.1	2021-06-15 15:05:17 +02:00
Sofie Van Landeghem	0fd0d949c4	fix 's typo's across code base (#8384 )	2021-06-15 10:57:08 +02:00
Adriane Boyd	6baab565eb	Minor updates to quickstart settings/instructions (#7965 ) * Minor updates to quickstart settings/instructions * set default value of textcat exclusive to `false` until the default checkbox behavior is updated * add the `morphologizer` to the list of components * add a note that v3.0.6+ is required * Switch to warning above quickstart * Undo changes to textcat default in quickstart Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-05-17 16:55:22 +02:00
Paul O'Leary McCann	66bfabd839	Fix pretraining objectives fragment (#8005 ) * Fix pretraining objectives fragment The fragment here is reused from a heading higher up, so you couldn't link to this section. * Fix section link to new fragment	2021-05-06 08:27:36 +02:00
Adriane Boyd	95c0833656	Add training option to set annotations on update (#7767 ) * Add training option to set annotations on update Add a `[training]` option called `set_annotations_on_update` to specify a list of components for which the predicted annotations should be set on `example.predicted` immediately after that component has been updated. The predicted annotations can be accessed by later components in the pipeline during the processing of the batch in the same `update` call. * Rename to annotates / annotating_components * Add test for `annotating_components` when training from config * Add documentation	2021-04-26 16:53:53 +02:00
Adriane Boyd	d2bdaa7823	Replace negative rows with 0 in StaticVectors (#7674 ) * Replace negative rows with 0 in StaticVectors Replace negative row indices with 0-vectors in `StaticVectors`. * Increase versions related to StaticVectors * Increase versions of all architctures and layers related to `StaticVectors` * Improve efficiency of 0-vector operations Parallel `spacy-legacy` PR: https://github.com/explosion/spacy-legacy/pull/5 * Update config defaults to new versions * Update docs	2021-04-22 18:04:15 +10:00
Shantam Raj	6017fcf693	Default code for Setting Entity annotations on the website errors (#7738 ) * the default example for "Setting entity annotations" errors on Binder * updating contributer info * using a new variable to store original entities	2021-04-21 09:16:32 +02:00
langdonholmes	df541c6b5e	Update processing-pipelines.md to mention method for doc metadata (#7480 ) * Update processing-pipelines.md Under "things to try," inform users they can save metadata when using nlp.pipe(foobar, as_tuples=True) Link to a new example on the attributes page detailing the following: > ``` > data = [ > ("Some text to process", {"meta": "foo"}), > ("And more text...", {"meta": "bar"}) > ] > > for doc, context in nlp.pipe(data, as_tuples=True): > # Let's assume you have a "meta" extension registered on the Doc > doc._.meta = context["meta"] > ``` from https://stackoverflow.com/questions/57058798/make-spacy-nlp-pipe-process-tuples-of-text-and-additional-information-to-add-as * Updating the attributes section Update the attributes section with example of how extensions can be used to store metadata. * Update processing-pipelines.md * Update processing-pipelines.md Made as_tuples example executable and relocated to the end of the "Processing Text" section. * Update processing-pipelines.md * Update processing-pipelines.md Removed extra line * Reformat and rephrase Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-04-19 11:58:12 +02:00
Adriane Boyd	0e7f94b247	Update Tokenizer.explain with special matches (#7749 ) * Update Tokenizer.explain with special matches Update `Tokenizer.explain` and the pseudo-code in the docs to include the processing of special cases that contain affixes or whitespace. * Handle optional settings in explain * Add test for special matches in explain Add test for `Tokenizer.explain` for special cases containing affixes.	2021-04-19 19:08:20 +10:00
Bram Vanroy	ed561cf428	Terminology: deprecated vs obsolete (#7621 ) * Terminology: deprecated vs obsolete Typically, deprecated is used for functionality that is bound to become unavailable but that can still be used. Obsolete is used for features that have been removed. In E941, I think what is meant is "obsolete" since loading a model by a shortcut simply does not work anymore (and throws an error). This is different from downloading a model with a shortcut, which is deprecated but still works. In light of this, perhaps all other error codes should be checked as well. * clarify that the link command is removed and not just deprecated Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>	2021-04-12 14:37:00 +02:00
Adriane Boyd	673e2bc4c0	Add usage docs for streamed train corpora (#7693 )	2021-04-09 16:15:38 +02:00
Ayush Chaurasia	3c2ce41dd8	W&B integration: Optional support for dataset and model checkpoint logging and versioning (#7429 ) * Add optional artifacts logging * Update docs * Update spacy/training/loggers.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/training/loggers.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/training/loggers.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Bump WandbLogger Version * Add documentation of v1 to legacy docs * bump spacy-legacy to 3.0.2 (to be released) Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>	2021-04-01 19:36:23 +02:00
Santiago Castro	af07fc3bc1	Add support for CUDA 11.2 (#7583 ) * Add support for CUDA 11.2 * Update the docs * Format Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-03-30 09:47:33 +02:00
Álvaro Abella Bascarán	5b4dde38a3	fix fn name: tokenizer.infixes_finditer -> tokenizer.infix_finditer (#7606 )	2021-03-30 09:45:49 +02:00
Adriane Boyd	0d2b723e8d	Update entity setting section	2021-03-20 11:38:55 +01:00
Adriane Boyd	6a9a467766	Update website/docs/usage/processing-pipelines.md Co-authored-by: Ines Montani <ines@ines.io>	2021-03-19 08:12:49 +01:00
Adriane Boyd	40e5d3a980	Update saving/loading example	2021-03-18 16:56:10 +01:00
Adriane Boyd	0fb1881f36	Reformat processing pipelines	2021-03-18 13:31:42 +01:00
Adriane Boyd	acc58719da	Update custom similarity hooks example	2021-03-18 13:31:42 +01:00
Adriane Boyd	c9e1a9ac17	Add multiprocessing section	2021-03-18 13:31:42 +01:00
Adriane Boyd	9a254d3995	Include all en_core_web_sm components in examples	2021-03-18 13:31:42 +01:00
bsweileh	61472e7cb3	Update _training.md - Fix broken link on backpropagation (#7431 ) * Update _training.md Fix broken link on backpropagation * Add agreement add spacy contributor agreement	2021-03-15 09:21:35 +01:00
Adriane Boyd	d746ea6278	Add warning about GPU selection in Jupyter notebooks (#7075 ) * Initial warning * Update check * Redo edit * Move jupyter warning to helper method * Add link with details to warnings	2021-03-09 15:35:21 +01:00
Sofie Van Landeghem	932887b950	textcat scoring fix and multi_label docs (#6974 ) * add multi-label textcat to menu * add infobox on textcat API * add info to v3 migration guide * small edits * further fixes in doc strings * add infobox to textcat architectures * add textcat_multilabel to overview of built-in components * spelling * fix unrelated warn msg * Add textcat_multilabel to quickstart [ci skip] * remove separate documentation page for multilabel_textcategorizer * small edits * positive label clarification * avoid duplicating information in self.cfg and fix textcat.score * fix multilabel textcat too * revert threshold to storage in cfg * revert threshold stuff for multi-textcat Co-authored-by: Ines Montani <ines@ines.io>	2021-03-09 23:04:22 +11:00

1 2 3 4 5 ...

1072 Commits