spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-11-14 05:37:03 +03:00

Author	SHA1	Message	Date
Paul O'Leary McCann	c1cc94a33a	Fix typo about receptive field size (#9564 )	2021-11-03 15:16:55 +01:00
Adriane Boyd	79cea03983	Update website model display (#9589 ) * Remove vectors from core trf model descriptions * Update accuracy labels and exclude morph_acc for ja	2021-11-03 09:56:00 +01:00
Paul O'Leary McCann	e43639b27a	Add note about round-trip serializing pipeline to API docs (#9583 )	2021-11-03 09:55:30 +01:00
xxyzz	90ec820f05	Add WordDumb to spaCy Universe (#9572 ) * Add WordDumb to spaCy Universe * Add standalone category Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>	2021-11-01 18:38:41 +09:00
Bruce W. Lee (이웅성)	a4dcb68cf6	Adding LingFeat Software to spaCy Universe. (#9574 ) * add lingfeat in universe * add lingfeat in universe * Fix JSON * Minor cleanup Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>	2021-11-01 18:38:14 +09:00
Vasundhara	5279c7c4ba	Fix broken link to mappings-exceptions (#9573 )	2021-10-31 13:44:29 +09:00
Adriane Boyd	2d430958e1	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-3	2021-10-29 12:18:15 +02:00
Paul O'Leary McCann	006df1ae1f	Clarify error when words are of wrong type (#9541 ) * Clarify error when words are of wrong type See #9437 * Update docs * Use try/except * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-10-29 12:08:40 +02:00
Paul O'Leary McCann	2fd8d616e7	Add docs section for spacy.cli.train.train (#9545 ) * Add section for spacy.cli.train.train * Add link from training page to train function * Ensure path in train helper * Update docs Co-authored-by: Ines Montani <ines@ines.io>	2021-10-29 10:36:34 +02:00
Adriane Boyd	5477453ea3	Docs for thinc-apple-ops (#9549 ) * Docs for thinc-apple-ops * Ignore thinc-apple-ops in reqs tests * Fix install quickstart * Add cupy cuda 113, 114 extras * Remove draft section Co-authored-by: Ines Montani <ines@ines.io>	2021-10-29 10:35:31 +02:00
Adriane Boyd	12974bf4d9	Add micro PRF for morph scoring (#9546 ) * Add micro PRF for morph scoring For pipelines where morph features are added by more than one component and a reference training corpus may not contain all features, a micro PRF score is more flexible than a simple accuracy score. An example is the reading and inflection features added by the Japanese tokenizer. * Use `morph_micro_f` as the default morph score for Japanese morphologizers. * Update docstring * Fix typo in docstring * Update Scorer API docs * Fix results type * Organize score list by attribute prefix	2021-10-29 10:29:29 +02:00
Philip Vollet	76173b0866	fixed typo and URL (#9560 )	2021-10-29 13:57:44 +09:00
Adriane Boyd	c053f158c5	Add support for floret vectors (#8909 ) * Add support for fasttext-bloom hash-only vectors Overview: * Extend `Vectors` to have two modes: `default` and `ngram` * `default` is the default mode and equivalent to the current `Vectors` * `ngram` supports the hash-only ngram tables from `fasttext-bloom` * Extend `spacy.StaticVectors.v2` to handle both modes with no changes for `default` vectors * Extend `spacy init vectors` to support ngram tables The `ngram` mode only supports vector tables produced by this fork of fastText, which adds an option to represent all vectors using only the ngram buckets table and which uses the exact same ngram generation algorithm and hash function (`MurmurHash3_x64_128`). `fasttext-bloom` produces an additional `.hashvec` table, which can be loaded by `spacy init vectors --fasttext-bloom-vectors`. https://github.com/adrianeboyd/fastText/tree/feature/bloom Implementation details: * `Vectors` now includes the `StringStore` as `Vectors.strings` so that the API can stay consistent for both `default` (which can look up from `str` or `int`) and `ngram` (which requires `str` to calculate the ngrams). * In ngram mode `Vectors` uses a default `Vectors` object as a cache since the ngram vectors lookups are relatively expensive. * The default cache size is the same size as the provided ngram vector table. * Once the cache is full, no more entries are added. The user is responsible for managing the cache in cases where the initial documents are not representative of the texts. * The cache can be resized by setting `Vectors.ngram_cache_size` or cleared with `vectors._ngram_cache.clear()`. * The API ends up a bit split between methods for `default` and for `ngram`, so functions that only make sense for `default` or `ngram` include warnings with custom messages suggesting alternatives where possible. * `Vocab.vectors` becomes a property so that the string stores can be synced when assigning vectors to a vocab. * `Vectors` serializes its own config settings as `vectors.cfg`. * The `Vectors` serialization methods have added support for `exclude` so that the `Vocab` can exclude the `Vectors` strings while serializing. Removed: * The `minn` and `maxn` options and related code from `Vocab.get_vector`, which does not work in a meaningful way for default vector tables. * The unused `GlobalRegistry` in `Vectors`. * Refactor to use reduce_mean Refactor to use reduce_mean and remove the ngram vectors cache. * Rename to floret * Rename to floret in error messages * Use --vectors-mode in CLI, vector init * Fix vectors mode in init * Remove unused var * Minor API and docstrings adjustments * Rename `--vectors-mode` to `--mode` in `init vectors` CLI * Rename `Vectors.get_floret_vectors` to `Vectors.get_batch` and support both modes. * Minor updates to Vectors docstrings. * Update API docs for Vectors and init vectors CLI * Update types for StaticVectors	2021-10-27 14:08:31 +02:00
Adriane Boyd	a803af9dfa	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-1	2021-10-26 11:53:50 +02:00
Elia Robyn Lake (Robyn Speer)	fa70837f28	clarify how to connect pretraining to training (#9450 ) * clarify how to connect pretraining to training Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * Update website/docs/usage/embeddings-transformers.md * Update website/docs/usage/embeddings-transformers.md * Update website/docs/usage/embeddings-transformers.md * Update website/docs/usage/embeddings-transformers.md Co-authored-by: Elia Robyn Speer <elia@explosion.ai> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-10-22 13:15:47 +02:00
Duygu Altinok	7b98aa4c16	Corrected broken (#9505 )	2021-10-20 17:31:59 +02:00
Daniël de Kok	1f05f56433	Add the spacy.models_with_nvtx_range.v1 callback (#9124 ) * Add the spacy.models_with_nvtx_range.v1 callback This callback recursively adds NVTX ranges to the Models in each pipe in a pipeline. * Fix create_models_with_nvtx_range type signature * NVTX range: wrap models of all trainable pipes jointly This avoids that (sub-)models that are shared between pipes get wrapped twice. * NVTX range callback: make color configurable Add forward_color and backprop_color options to set the color for the NVTX range. * Move create_models_with_nvtx_range to spacy.ml * Update create_models_with_nvtx_range for thinc changes with_nvtx_range now updates an existing node, rather than returning a wrapper node. So, we can simply walk over the nodes and update them. * NVTX: use after_pipeline_creation in example	2021-10-20 11:59:48 +02:00
Adriane Boyd	3f181b73d0	Add ja_core_news_trf to website (#9515 )	2021-10-20 10:18:02 +02:00
Paul O'Leary McCann	222cf9b6d2	Clarify how to change base Transformer model (#9498 ) * Add note about how the model name is used * Add link to TransformersModel docs, separate paragraph * Local link * Revise docs * Update website/docs/usage/embeddings-transformers.md * Update website/docs/usage/embeddings-transformers.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-10-19 23:28:20 +02:00
Adriane Boyd	a6424bcea9	Minor updates to spacy-transformers docs for v1.1.0 (#9496 )	2021-10-18 14:55:02 +02:00
Adriane Boyd	9b86209a4a	Update docs for spacy-transformers v1.1 data classes (#9361 )	2021-10-18 14:16:58 +02:00
Sofie Van Landeghem	3fd3531e12	Docs for new spacy-trf architectures (#8954 ) * use TransformerModel.v2 in quickstart * update docs for new transformer architectures * bump spacy_transformers to 1.1.0 * Add new arguments spacy-transformers.TransformerModel.v3 * Mention that mixed-precision support is experimental * Describe delta transformers.Tok2VecTransformer versions * add dot * add dot, again * Update some more TransformerModel references v2 -> v3 * Add mixed-precision options to the training quickstart Disable mixed-precision training/prediction by default. * Update setup.cfg Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Apply suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update website/docs/usage/embeddings-transformers.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Daniël de Kok <me@danieldk.eu> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-10-18 14:15:06 +02:00
Connor Brinton	657af5f91f	🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167 ) * 🚨 Ignore all existing Mypy errors * 🏗 Add Mypy check to CI * Add types-mock and types-requests as dev requirements * Add additional type ignore directives * Add types packages to dev-only list in reqs test * Add types-dataclasses for python 3.6 * Add ignore to pretrain * 🏷 Improve type annotation on `run_command` helper The `run_command` helper previously declared that it returned an `Optional[subprocess.CompletedProcess]`, but it isn't actually possible for the function to return `None`. These changes modify the type annotation of the `run_command` helper and remove all now-unnecessary `# type: ignore` directives. * 🔧 Allow variable type redefinition in limited contexts These changes modify how Mypy is configured to allow variables to have their type automatically redefined under certain conditions. The Mypy documentation contains the following example: ```python def process(items: List[str]) -> None: # 'items' has type List[str] items = [item.split() for item in items] # 'items' now has type List[List[str]] ... ``` This configuration change is especially helpful in reducing the number of `# type: ignore` directives needed to handle the common pattern of: * Accepting a filepath as a string * Overwriting the variable using `filepath = ensure_path(filepath)` These changes enable redefinition and remove all `# type: ignore` directives rendered redundant by this change. * 🏷 Add type annotation to converters mapping * 🚨 Fix Mypy error in convert CLI argument verification * 🏷 Improve type annotation on `resolve_dot_names` helper * 🏷 Add type annotations for `Vocab` attributes `strings` and `vectors` * 🏷 Add type annotations for more `Vocab` attributes * 🏷 Add loose type annotation for gold data compilation * 🏷 Improve `_format_labels` type annotation * 🏷 Fix `get_lang_class` type annotation * 🏷 Loosen return type of `Language.evaluate` * 🏷 Don't accept `Scorer` in `handle_scores_per_type` * 🏷 Add `string_to_list` overloads * 🏷 Fix non-Optional command-line options * 🙈 Ignore redefinition of `wandb_logger` in `loggers.py` * ➕ Install `typing_extensions` in Python 3.8+ The `typing_extensions` package states that it should be used when "writing code that must be compatible with multiple Python versions". Since SpaCy needs to support multiple Python versions, it should be used when newer `typing` module members are required. One example of this is `Literal`, which is available starting with Python 3.8. Previously SpaCy tried to import `Literal` from `typing`, falling back to `typing_extensions` if the import failed. However, Mypy doesn't seem to be able to understand what `Literal` means when the initial import means. Therefore, these changes modify how `compat` imports `Literal` by always importing it from `typing_extensions`. These changes also modify how `typing_extensions` is installed, so that it is a requirement for all Python versions, including those greater than or equal to 3.8. * 🏷 Improve type annotation for `Language.pipe` These changes add a missing overload variant to the type signature of `Language.pipe`. Additionally, the type signature is enhanced to allow type checkers to differentiate between the two overload variants based on the `as_tuple` parameter. Fixes #8772 * ➖ Don't install `typing-extensions` in Python 3.8+ After more detailed analysis of how to implement Python version-specific type annotations using SpaCy, it has been determined that by branching on a comparison against `sys.version_info` can be statically analyzed by Mypy well enough to enable us to conditionally use `typing_extensions.Literal`. This means that we no longer need to install `typing_extensions` for Python versions greater than or equal to 3.8! 🎉 These changes revert previous changes installing `typing-extensions` regardless of Python version and modify how we import the `Literal` type to ensure that Mypy treats it properly. * resolve mypy errors for Strict pydantic types * refactor code to avoid missing return statement * fix types of convert CLI command * avoid list-set confustion in debug_data * fix typo and formatting * small fixes to avoid type ignores * fix types in profile CLI command and make it more efficient * type fixes in projects CLI * put one ignore back * type fixes for render * fix render types - the sequel * fix BaseDefault in language definitions * fix type of noun_chunks iterator - yields tuple instead of span * fix types in language-specific modules * 🏷 Expand accepted inputs of `get_string_id` `get_string_id` accepts either a string (in which case it returns its ID) or an ID (in which case it immediately returns the ID). These changes extend the type annotation of `get_string_id` to indicate that it can accept either strings or IDs. * 🏷 Handle override types in `combine_score_weights` The `combine_score_weights` function allows users to pass an `overrides` mapping to override data extracted from the `weights` argument. Since it allows `Optional` dictionary values, the return value may also include `Optional` dictionary values. These changes update the type annotations for `combine_score_weights` to reflect this fact. * 🏷 Fix tokenizer serialization method signatures in `DummyTokenizer` * 🏷 Fix redefinition of `wandb_logger` These changes fix the redefinition of `wandb_logger` by giving a separate name to each `WandbLogger` version. For backwards-compatibility, `spacy.train` still exports `wandb_logger_v3` as `wandb_logger` for now. * more fixes for typing in language * type fixes in model definitions * 🏷 Annotate `_RandomWords.probs` as `NDArray` * 🏷 Annotate `tok2vec` layers to help Mypy * 🐛 Fix `_RandomWords.probs` type annotations for Python 3.6 Also remove an import that I forgot to move to the top of the module 😅 * more fixes for matchers and other pipeline components * quick fix for entity linker * fixing types for spancat, textcat, etc * bugfix for tok2vec * type annotations for scorer * add runtime_checkable for Protocol * type and import fixes in tests * mypy fixes for training utilities * few fixes in util * fix import * 🐵 Remove unused `# type: ignore` directives * 🏷 Annotate `Language._components` * 🏷 Annotate `spacy.pipeline.Pipe` * add doc as property to span.pyi * small fixes and cleanup * explicit type annotations instead of via comment Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com> Co-authored-by: svlandeg <svlandeg@github.com>	2021-10-14 15:21:40 +02:00
Adriane Boyd	d98d525bc8	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.1-3	2021-10-14 09:41:46 +02:00
Edward	72711dc2c9	Update universe example codes (#9422 ) * Update universe plugins * Adjust azure trigger * Add init to tests/universe * deliberatly trying to break the universe to see if the CI catches it * revert Co-authored-by: svlandeg <svlandeg@github.com>	2021-10-13 16:29:19 +02:00
Paul O'Leary McCann	b53e39455e	Fix UD POS docs links (fix #9013 ) (#9407 ) * Fix UD POS docs links (fix #9013) The previous link seems to have been for UD v1. * Fix link	2021-10-11 11:51:19 +02:00
Adriane Boyd	fd7edbc645	Fix types descriptions of sm and sent models (#9401 )	2021-10-11 11:17:18 +02:00
Adriane Boyd	a5231cb044	Remove traces of lexemes from vocab serialization (#9400 )	2021-10-11 11:13:35 +02:00
Adriane Boyd	ae1b3e960b	Update overwrite and scorer in API docs (#9384 ) * Update overwrite and scorer in API docs * Rephrase morphologizer extend + example	2021-10-11 10:35:07 +02:00
Sofie Van Landeghem	f87ae3cb7d	Doc fixes in convert API (#9350 ) * add more info on the spacy debug command * formatting	2021-10-06 13:13:18 +09:00
Elia Robyn Lake (Robyn Speer)	53b5f245ed	Allow IETF language codes, aliases, and close matches (#9342 ) * use language-matching to allow language code aliases Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * link to "IETF language tags" in docs Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * Make requirements consistent Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * change "two-letter language ID" to "IETF language tag" in language docs Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * use langcodes 3.2 and handle language-tag errors better Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * all unknown language codes are ImportErrors Signed-off-by: Elia Robyn Speer <elia@explosion.ai> Co-authored-by: Elia Robyn Speer <elia@explosion.ai>	2021-10-05 09:52:22 +02:00
Paul O'Leary McCann	1ee6541ab0	Moving Japanese tokenizer extra info to Token.morph (#8977 ) * Use morph for extra Japanese tokenizer info Previously Japanese tokenizer info that didn't correspond to Token fields was put in user data. Since spaCy core should avoid touching user data, this moves most information to the Token.morph attribute. It also adds the normalized form, which wasn't exposed before. The subtokens, which are a list of full tokens, are still added to user data, except with the default tokenizer granualarity. With the default tokenizer settings the subtokens are all None, so in this case the user data is simply not set. * Update tests Also adds a new test for norm data. * Update docs * Add Japanese morphologizer factory Set the default to `extend=True` so that the morphologizer does not clobber the values set by the tokenizer. * Use the norm_ field for normalized forms Before this commit, normalized forms were put in the "norm" field in the morph attributes. I am not sure why I did that instead of using the token morph, I think I just forgot about it. * Skip test if sudachipy is not installed * Fix import Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-10-01 19:19:26 +02:00
Paul O'Leary McCann	6e833b617a	Updating Troubleshooting Docs (#9329 ) * Add link to Discussions FAQ * Remove old FAQ entries I think these are no longer relevant. - no-cache-dir: affected pip versions are very old now - narrow unicode: not an issue from py3.3+ - utf-8 osx: upstream bug closed in 2019 Some of the other issues are also maybe not frequent.	2021-10-01 12:28:22 +02:00
Paul O'Leary McCann	78a88f7de7	Fix invalid json	2021-09-30 15:23:55 +09:00
Martin Vallone	a14ab7e882	Adding PhruzzMatcher to spaCy universe (#9321 ) * Adding PhruzzMatcher to spaCy universe * Fixes to make the package work properly	2021-09-30 13:46:53 +09:00
Elia Robyn Lake (Robyn Speer)	5b0b0ca809	Move WandB loggers into spacy-loggers (#9223 ) * factor out the WandB logger into spacy-loggers Signed-off-by: Elia Robyn Speer <gh@arborelia.net> * depend on spacy-loggers so they are available Signed-off-by: Elia Robyn Speer <gh@arborelia.net> * remove docs of spacy.WandbLogger.v2 (moved to spacy-loggers) Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * Version number suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * update references to WandbLogger Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * make order of deps more consistent Signed-off-by: Elia Robyn Speer <elia@explosion.ai> Co-authored-by: Elia Robyn Speer <elia@explosion.ai> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-09-29 11:12:50 +02:00
Adriane Boyd	03f234b739	Merge remote-tracking branch 'upstream/master' into develop	2021-09-27 09:10:45 +02:00
Ines Montani	6bb0324b81	Adjust kb_id visualizer templating and docs	2021-09-23 11:59:02 +02:00
Ines Montani	beb4a8c524	Merge pull request #9199 from shigapov/master (resolves #9129 )	2021-09-23 19:41:53 +10:00
Philip Vollet	d2adfe1efa	Add projects to spaCy Universe (#9269 ) * Added spaCy Universe projects * Added user license agreement Philip Vollet * Update website/meta/universe.json Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/meta/universe.json Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/meta/universe.json Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-09-23 10:56:45 +02:00
Edward	8bda39f088	Update Hammurabi example code to v3 (#9218 ) * Update Hammurabi example code * Fix typo	2021-09-16 13:32:44 +02:00
Jozef Harag	865cfbc903	feat: add `spacy.WandbLogger.v3` with optional `run_name` and `entity` parameters (#9202 ) * feat: add `spacy.WandbLogger.v3` with optional `run_name` and `entity` parameters * update versioning in docs Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>	2021-09-16 12:26:41 +02:00
Paul O'Leary McCann	1d57d78758	Make docs consistent (fix #9126 )	2021-09-16 15:54:12 +09:00
Renat Shigapov	d5cc009faf	Merge branch 'explosion:master' into master	2021-09-13 08:43:48 +02:00
Renat Shigapov	e61d93f8c3	add NEL-visualisation to manual-usage	2021-09-13 08:38:58 +02:00
Paul O'Leary McCann	f89e1c34c9	Minor typo fix in docs	2021-09-11 14:22:05 +09:00
Renat Shigapov	646f3a54db	added spaCyOpenTapioca (#9181 ) * add spaCyOpenTapioca to universe * add agreement * fix misprint in tags	2021-09-11 13:16:51 +09:00
mylibrar	ee28aac68e	Update example code of forte (#9175 ) Co-authored-by: Suqi Sun <suqi.sun@petuum.com>	2021-09-11 13:13:13 +09:00
Renat Shigapov	c1927fe994	fix misprint in tags	2021-09-09 15:37:34 +02:00
Renat Shigapov	ea58294076	add spaCyOpenTapioca to universe	2021-09-09 15:13:18 +02:00
Sofie Van Landeghem	8895e3c9ad	matcher doc corrections (#9115 ) * update error message to current UX * clarify uppercase effect * fix docstring	2021-09-02 09:26:33 +02:00
Robyn Speer	d60b748e3c	Fix surprises when asking for the root of a git repo (#9074 ) * Fix surprises when asking for the root of a git repo In the case of the first asset I wanted to get from git, the data I wanted was the entire repository. I tried leaving "path" blank, which gave a less-than-helpful error, and then I tried `path: "/"`, which started copying my entire filesystem into the project. The path I should have used was "". I've made two changes to make this smoother for others: - The 'path' within a git clone defaults to "" - If the path points outside of the tmpdir that the git clone goes into, we fail with an error Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * use a descriptive error instead of a default plus some minor fixes from PR review Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * check for None values in assets Signed-off-by: Elia Robyn Speer <elia@explosion.ai> Co-authored-by: Elia Robyn Speer <elia@explosion.ai>	2021-09-01 22:52:08 +02:00
Paul O'Leary McCann	ba6a37d358	Document Assigned Attributes of Pipeline Components (#9041 ) * Add textcat docs * Add NER docs * Add Entity Linker docs * Add assigned fields docs for the tagger This also adds a preamble, since there wasn't one. * Add morphologizer docs * Add dependency parser docs * Update entityrecognizer docs This is a little weird because `Doc.ents` is the only thing assigned to, but it's actually a bidirectional property. * Add token fields for entityrecognizer * Fix section name * Add entity ruler docs * Add lemmatizer docs * Add sentencizer/recognizer docs * Update website/docs/api/entityrecognizer.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update website/docs/api/entityruler.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update website/docs/api/tagger.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update website/docs/api/entityruler.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update type for Doc.ents This was `Tuple[Span, ...]` everywhere but `Tuple[Span]` seems to be correct. * Run prettier * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Run prettier * Add transformers section This basically just moves and renames the "custom attributes" section from the bottom of the page to be consistent with "assigned attributes" on other pages. I looked at moving the paragraph just above the section into the section, but it includes the unrelated registry additions, so it seemed better to leave it unchanged. * Make table header consistent Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-09-01 12:09:39 +02:00
Davide Fiocco	1dd69be1f1	Fix point typo on docbin docs (#9097 )	2021-08-31 10:55:44 +02:00
Meenal Jhajharia	2613f0e98f	benepar usage example has deprecated imports	2021-08-28 16:35:58 +05:30
Sofie Van Landeghem	1e974de837	config is not Optional (#9024 )	2021-08-27 11:44:31 +02:00
Sofie Van Landeghem	4d39430b82	Document use-case of freezing tok2vec (#8992 ) * update error msg * add sentence to docs * expand note on frozen components	2021-08-26 09:50:35 +02:00
Sofie Van Landeghem	94fb840443	fix docs for Span constructor arguments (#9023 )	2021-08-25 16:06:22 +02:00
Sofie Van Landeghem	de025beb5f	Warn and document spangroup.doc weakref (#8980 ) * test for error after Doc has been garbage collected * warn about using a SpanGroup when the Doc has been garbage collected * add warning to the docs * rephrase slightly * raise error instead of warning * update * move warning to doc property	2021-08-20 11:06:19 +02:00
Paul O'Leary McCann	37fe847af4	Fix type annotation in docs	2021-08-20 15:34:22 +09:00
Ines Montani	f2b61b77a5	Fix universe.json [ci skip]	2021-08-20 11:26:29 +10:00
Baltazar	71e65fe943	added spacy api v3 docker	2021-08-19 21:29:25 +02:00
Paul O'Leary McCann	9391998c77	Add notes on preparing training data to docs (#8964 ) * Add training data section Not entirely sure this is in the right location on the page - maybe it should be after quickstart? * Add pointer from binary format to training data section * Minor cleanup * Add to ToC, fix filename * Update website/docs/usage/training.md Co-authored-by: Ines Montani <ines@ines.io> * Update website/docs/usage/training.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/docs/usage/training.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Move the training data section further down the page * Update website/docs/usage/training.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/docs/usage/training.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Run prettier Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-08-16 17:37:21 +02:00
Lasse	839ea0f987	change tags formatting to match	2021-08-13 14:40:08 +02:00
Lasse	70ab596f61	Merge branch 'master' of https://github.com/HLasse/spaCy	2021-08-13 14:35:21 +02:00
Lasse	195e4e48c3	add textdescriptives to universe	2021-08-13 14:35:18 +02:00
Adriane Boyd	b278f31ee6	Document scorers in registry and components from #8766 (#8929 ) * Document scorers in registry and components from #8766 * Update spacy/pipeline/lemmatizer.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/docs/api/dependencyparser.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Reformat Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-08-12 12:50:03 +02:00
Ines Montani	4f769ff913	Update Prodigy project template for v1.11 [ci skip]	2021-08-12 13:46:20 +10:00
Paul O'Leary McCann	e227d24d43	Allow passing in array vars for speedup (#8882 ) * Allow passing in array vars for speedup This fixes #8845. Not sure about the docstring changes here... * Update docs Types maybe need more detail? Maybe not? * Run prettier on docs * Update spacy/tokens/span.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-08-10 15:13:53 +02:00
Paul O'Leary McCann	6029cfc391	Add scores to output in spancat (#8855 ) * Add scores to output in spancat This exposes the scores as an attribute on the SpanGroup. Includes a basic test. * Add basic doc note * Vectorize score calcs * Add "annotation format" section * Update website/docs/api/spancategorizer.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Clean up doc section * Ran prettier on docs * Get arrays off the gpu before iterating over them * Remove int() calls Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-08-10 13:47:49 +02:00
Duygu Altinok	380b2817cf	updated unv json for new book	2021-08-09 12:39:22 +02:00
Paul O'Leary McCann	cac298471f	Fix #8902 (bad link in docs) typo fix	2021-08-08 22:04:00 +09:00
Adriane Boyd	175847f92c	Support list values and INTERSECTS in Matcher (#8784 ) * Support list values and IS_INTERSECT in Matcher * Support list values as token attributes for set operators, not just as pattern values. * Add `IS_INTERSECT` operator. * Fix incorrect `ISSUBSET` and `ISSUPERSET` in schema and docs. * Rename IS_INTERSECT to INTERSECTS	2021-08-02 19:39:26 +02:00
Ines Montani	30f20496d5	Merge pull request #8840 from polm/docs/evaluate-speed [ci skip]	2021-07-30 09:10:15 +10:00
Ines Montani	65d163fab5	Adjust formatting [ci skip]	2021-07-30 09:10:04 +10:00
Ines Montani	3a701d3645	Merge pull request #8841 from adrianeboyd/docs/ent-id-sep [ci skip] Fix formatting of ent_id_sep in EntityRuler API docs	2021-07-30 09:09:25 +10:00
thomashacker	02258916c8	Fix example config typo for transformer architecture	2021-07-29 11:19:40 +02:00
Adriane Boyd	15b12f3e35	Fix formatting of ent_id_sep in EntityRuler API docs	2021-07-29 10:10:12 +02:00
Paul O'Leary McCann	a60cb13910	Update speed entry in metrics table	2021-07-29 16:35:19 +09:00
Paul O'Leary McCann	e125313a50	Revert "Add note about SPEED in output" This reverts commit `c92d268176`.	2021-07-29 16:34:08 +09:00
Ines Montani	0a1e299d30	Merge pull request #8814 from polm/docs/migrate-lexeme-tables [ci skip]	2021-07-29 17:18:02 +10:00
Paul O'Leary McCann	c92d268176	Add note about SPEED in output In #8823 it was pointed out that the `SPEED` value wasn't documented anywhere.	2021-07-29 15:03:07 +09:00
Paul O'Leary McCann	8867e60fbb	Update website/docs/usage/v3.md Co-authored-by: Ines Montani <ines@ines.io>	2021-07-29 14:56:56 +09:00
Adriane Boyd	8547514aa4	Remove labels from textcat component config example (#8815 )	2021-07-27 13:14:38 +02:00
Paul O'Leary McCann	76ac95923a	Add note to migration guide about lexeme tables (fix #7290 ) This just adds the resolution from #6388 to the docs.	2021-07-27 19:19:25 +09:00
Paul O'Leary McCann	67ecdcc3ac	Update subset/superset docs (#8795 ) * Update subset/superset docs * Update website/docs/usage/rule-based-matching.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-07-27 12:08:46 +02:00
Ines Montani	134cb06af3	Merge pull request #8808 from kevinlu1248/master [ci skip] Changed a CLI command in data-formats.md due to erroneous information	2021-07-27 12:15:16 +10:00
Kevin Lu	4a8e9e4e4e	Update data-formats.md	2021-07-25 22:58:53 -07:00
Ledenel	413f745c68	fix broken example in spaCy universe Chatterbot	2021-07-25 15:53:32 +00:00
Paul O'Leary McCann	d717593eb7	Merge pull request #8754 from KennethEnevoldsen/patch-1 [minor] removed outdated spacy version for spacymoji	2021-07-18 19:17:33 +09:00
Kenneth Enevoldsen	5d6aed0773	fixed GitHub link and thumbnail Sorry, I seem to have misunderstood that the GitHub reference shouldn't be a link.	2021-07-18 10:22:00 +02:00
Ines Montani	313f55e560	Fix JSON [ci skip]	2021-07-18 13:21:33 +10:00
Ines Montani	51e5903d6f	Merge pull request #8702 from KennethEnevoldsen/master [ci skip]	2021-07-18 13:18:42 +10:00
Kenneth Enevoldsen	8546948fba	removed outdated spacy version for spacymoji From the documentation of spacymoji (and the requirements.txt) it seems like it is not only for version 2.	2021-07-17 15:19:43 +02:00
Kenneth Enevoldsen	a0e0ccdb46	Update website/meta/universe.json Co-authored-by: Ines Montani <ines@ines.io>	2021-07-17 07:14:46 +02:00
Mario Šaško	1ba2e8a646	Add TakeLab/spacy-udpipe to Universe (#8698 ) * Add TakeLab/spacy-udpipe to universe * Add SCA * Sign SCA	2021-07-16 11:15:52 +02:00
Adriane Boyd	f5acc48111	Remove TrainablePipe as base class for Lemmatizer in API docs (#8725 )	2021-07-15 16:41:36 +02:00
Sofie Van Landeghem	77859beb99	spacy.ngram_range_suggester.v1 (#8699 )	2021-07-15 10:01:22 +02:00
Ines Montani	2a8eeed5da	Merge pull request #8703 from thomashacker/update/spacy-stanza [ci skip] Update spacy-stanza universe.json	2021-07-13 19:03:42 +10:00
thomashacker	aafb89df78	Update universe.json code_example	2021-07-13 10:22:49 +02:00
Kenneth Enevoldsen	94ce904e10	added missing comma	2021-07-13 09:59:34 +02:00
Kenneth Enevoldsen	a81fcc81b0	added dacy to universe	2021-07-13 09:54:08 +02:00
Ines Montani	50000d37e4	Avoid double parentheses [ci skip]	2021-07-10 10:52:01 +10:00
Calum Sieppert	e2d53aa1a6	Typo fixes	2021-07-09 10:25:56 -06:00
Adriane Boyd	1ee5bee29d	Add Macedonian models to website (#8637 )	2021-07-08 09:32:14 +02:00
Paul O'Leary McCann	1d9209d43a	Merge pull request #8547 from mylibrar/update-universe Add forte to universe.json	2021-07-08 14:59:49 +09:00
Ines Montani	39c8f7949e	Add code preview for textcat_multilabel [ci skip]	2021-07-08 13:33:25 +10:00
Calum Sieppert	889c187bc2	Typo fixes	2021-07-07 16:53:04 -06:00
Adriane Boyd	6db647dfe0	Update v3.1 usage docs	2021-07-07 08:43:33 +02:00
Sofie Van Landeghem	64fac754fe	add spacy prefix to ngram_suggester.v1 (#8623 )	2021-07-07 08:09:30 +02:00
Sofie Van Landeghem	e7d747e3ee	TransitionBasedParser.v1 to legacy (#8586 ) * TransitionBasedParser.v1 to legacy * register sublayers * bump spacy-legacy to 3.0.7	2021-07-06 15:26:45 +02:00
Ines Montani	04a9ade40f	Merge pull request #8466 from explosion/docs/new-in-v3-1 [ci skip]	2021-07-06 22:20:24 +10:00
Sofie Van Landeghem	b9f59118bf	Fix silent evaluation (#8581 ) * fix silentness * sneak in docs typo fix * pass silent boolean instead	2021-07-06 14:16:19 +02:00
Adriane Boyd	29906884c5	Raise an error for textcat with <2 labels (#8584 ) * Raise an error for textcat with <2 labels Raise an error if initializing a `textcat` component without at least two labels. * Add similar note to docs * Update positive_label description in API docs	2021-07-06 12:35:22 +02:00
Ines Montani	5bb7fe4b41	Update with HF hub integration [ci skip]	2021-07-06 19:30:59 +10:00
Cass	7d13fc799b	Fix a command typo in models.md "dowmload" -> "download"	2021-07-05 18:44:18 -07:00
Ines Montani	8423864b50	Add docs notes on installing models from Python and in Jupyter [ci skip] (#8597 )	2021-07-05 13:49:20 +02:00
Yoichiro Hasebe	596e04cbb4	Github repo info fixed for ruby-spacy	2021-07-04 18:55:17 +09:00
Yoichiro Hasebe	2bdfa42107	Update universe.json	2021-07-04 08:44:39 +09:00
Suqi Sun	3901507df8	Update pip	2021-06-30 16:44:43 -04:00
Suqi Sun	61c868ed75	Update pip and code example	2021-06-30 14:49:51 -04:00
Ines Montani	af9d984407	Merge pull request #8405 from svlandeg/fix/whitespace_tokenizer [ci skip]	2021-06-30 20:52:59 +10:00
Suqi Sun	4331c40b78	Add forte to universe.json	2021-06-29 16:17:22 -04:00
Adriane Boyd	41292a1b84	Add note about updating with fill-config	2021-06-29 10:45:36 +02:00
Nick Sorros	bb781ae7f7	Remove extra parenthesis from the example for spacy-streamlit (#8527 )	2021-06-28 14:03:31 +02:00
Adriane Boyd	4d1ef8f695	Tidy up docs	2021-06-28 12:08:15 +02:00
Ines Montani	93572dc12a	Merge pull request #8505 from bryant1410/patch-2 [ci skip] Fix double slash in model release web page	2021-06-28 12:51:06 +10:00
Kevin	1a3e7cc5ef	Updated PyATE syntax to fit spaCy V3	2021-06-26 17:52:41 -07:00
Santiago Castro	2e71944e1e	Fix double slash in model release web page	2021-06-25 19:19:10 -07:00
Ines Montani	4544412442	Update wording [ci skip] Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-06-25 13:52:48 +10:00
Ines Montani	0d2e2b59bc	Update intro [ci skip]	2021-06-24 22:53:20 +10:00
Matthew Honnibal	f9946154d9	Add SpanCategorizer component (#6747 ) * Draft spancat model * Add spancat model * Add test for extract_spans * Add extract_spans layer * Upd extract_spans * Add spancat model * Add test for spancat model * Upd spancat model * Update spancat component * Upd spancat * Update spancat model * Add quick spancat test * Import SpanCategorizer * Fix SpanCategorizer component * Import SpanGroup * Fix span extraction * Fix import * Fix import * Upd model * Update spancat models * Add scoring, update defaults * Update and add docs * Fix type * Update spacy/ml/extract_spans.py * Auto-format and fix import * Fix comment * Fix type * Fix type * Update website/docs/api/spancategorizer.md * Fix comment Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Better defense Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix labels list Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/ml/extract_spans.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/pipeline/spancat.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Set annotations during update * Set annotations in spancat * fix imports in test * Update spacy/pipeline/spancat.py * replace MaxoutLogistic with LinearLogistic * fix config * various small fixes * remove set_annotations parameter in update * use our beloved tupley format with recent support for doc.spans * bugfix to allow renaming the default span_key (scores weren't showing up) * use different key in docs example * change defaults to better-working parameters from project (WIP) * register spacy.extract_spans.v1 for legacy purposes * Upd dev version so can build wheel * layers instead of architectures for smaller building blocks * Update website/docs/api/spancategorizer.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update website/docs/api/spancategorizer.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Include additional scores from overrides in combined score weights * Parameterize spans key in scoring Parameterize the `SpanCategorizer` `spans_key` for scoring purposes so that it's possible to evaluate multiple `spancat` components in the same pipeline. * Use the (intentionally very short) default spans key `sc` in the `SpanCategorizer` * Adjust the default score weights to include the default key * Adjust the scorer to use `spans_{spans_key}` as the prefix for the returned score * Revert addition of `attr_name` argument to `score_spans` and adjust the key in the `getter` instead. Note that for `spancat` components with a custom `span_key`, the score weights currently need to be modified manually in `[training.score_weights]` for them to be available during training. To suppress the default score weights `spans_sc_p/r/f` during training, set them to `null` in `[training.score_weights]`. * Update website/docs/api/scorer.md * Fix scorer for spans key containing underscore * Increment version * Add Spans to Evaluate CLI (#8439) * Add Spans to Evaluate CLI * Change to spans_key * Add spans per_type output Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Fix spancat GPU issues (#8455) * Fix GPU issues * Require thinc >=8.0.6 * Switch to glorot_uniform_init * Fix and test ngram suggester * Include final ngram in doc for all sizes * Fix ngrams for docs of the same length as ngram size * Handle batches of docs that result in no ngrams * Add tests Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Nirant <NirantK@users.noreply.github.com>	2021-06-24 12:35:27 +02:00
Ines Montani	68721af628	Formatting and preliminary intro [ci skip]	2021-06-24 20:32:23 +10:00
Adriane Boyd	92dc6b409e	Notes on source with vectors	2021-06-24 10:34:07 +02:00
Adriane Boyd	35425d7e26	Add details for Catalan and Danish	2021-06-24 10:10:33 +02:00
Ines Montani	5daf450f51	Update upgrading notes [ci skip]	2021-06-24 18:06:28 +10:00
Ines Montani	528746129d	Merge branch 'master' into docs/new-in-v3-1	2021-06-24 13:11:37 +10:00
Ines Montani	a8e8d02ba7	Merge pull request #8465 from explosion/feature/spacy-package-readme	2021-06-24 13:11:08 +10:00
Ines Montani	3e058dee62	Update features [ci skip]	2021-06-24 12:36:04 +10:00
Ines Montani	40f13c3f0c	Add docs [ci skip]	2021-06-24 11:57:15 +10:00
Ines Montani	a1e4aca267	Fix sentence [ci skip]	2021-06-24 11:40:36 +10:00
Ines Montani	ca0d904faa	Update details [ci skip]	2021-06-23 13:05:56 +10:00
themrmax	d96c422cfc	Fix broken link change /api/registry to /api/top-level#registry	2021-06-22 15:34:06 -07:00
Ines Montani	e9b68d4f4c	Update details and add example [ci skip]	2021-06-22 17:51:03 +10:00
Nick Sorros	31504f5982	Switch model and data path in prodigy project.yml recipe (#8467 )	2021-06-22 09:41:45 +02:00
Ines Montani	bc93c34f54	Add "New in v3.1" guide	2021-06-22 15:23:18 +10:00
Adriane Boyd	e39d1bd4ab	Various docs updates for v3.1 (#8406 ) * Update for Catalan/Italian lemmatizer changes * Add warning about relevance of section	2021-06-21 09:33:50 +02:00
Ines Montani	02d2fdb123	Add link anchor [ci skip]	2021-06-20 11:29:19 +10:00
Matthew Honnibal	6f5e308d17	Support negative examples in partial NER annotations (#8106 ) * Support a cfg field in transition system * Make NER 'has gold' check use right alignment for span * Pass 'negative_samples_key' property into NER transition system * Add field for negative samples to NER transition system * Check neg_key in NER has_gold * Support negative examples in NER oracle * Test for negative examples in NER * Fix name of config variable in NER * Remove vestiges of old-style partial annotation * Remove obsolete tests * Add comment noting lack of support for negative samples in parser * Additions to "neg examples" PR (#8201) * add custom error and test for deprecated format * add test for unlearning an entity * add break also for Begin's cost * add negative_samples_key property on Parser * rename * extend docs & fix some older docs issues * add subclass constructors, clean up tests, fix docs * add flaky test with ValueError if gold parse was not found * remove ValueError if n_gold == 0 * fix docstring * Hack in environment variables to try out training * Remove hack * Remove NER hack, and support 'negative O' samples * Fix O oracle * Fix transition parser * Remove 'not O' from oracle * Fix NER oracle * check for spans in both gold.ents and gold.spans and raise if so, to prevent memory access violation * use set instead of list in consistency check Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-06-17 17:33:00 +10:00
svlandeg	bb9d2f1546	extend example to ensure the text is preserved	2021-06-16 23:56:35 +02:00

1 2 3 4 5 ...

2861 Commits