spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-03-01 18:31:27 +03:00

Author	SHA1	Message	Date
Lj Miranda	345e7f6bc4	Clarify Span.ents documentation (#10154 ) * Clarify Span.ents documentation Ref: #10135 Retain current behaviour. Span.ents will only include entities within said span. You can't get tokens outside of the original span. * Reword docstrings Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update API docs in the website Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-01-31 08:41:42 +01:00
Adriane Boyd	4f441dfa24	Fix infix as prefix in Tokenizer.explain (#10140 ) * Fix infix as prefix in Tokenizer.explain Update `Tokenizer.explain` to align with the `Tokenizer` algorithm: * skip infix matches that are prefixes in the current substring * Update tokenizer pseudocode in docs	2022-01-28 17:00:54 +01:00
Duygu Altinok	268ddf8a06	Add ENT_IOB key to Matcher (#9649 ) * added new field * added exception for IOb strings * minor refinement to schema * removed field * fixed typo * imported numeriacla val * changed the code bit * cosmetics * added test for matcher * set ents of moc docs * added invalid pattern * minor update to documentation * blacked matcher * added pattern validation * add IOB vals to schema * changed into test * mypy compat * cleaned left over * added compat import * changed type * added compat import * changed literal a bit * went back to old * made explicit type * Update spacy/schemas.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/schemas.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/schemas.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-01-20 13:18:39 +01:00
Paul O'Leary McCann	2ff53834bb	Add link to pattern file info in EntityRuler.initialize docs (#10091 ) * Add link to pattern file info in EntityRuler.initialize docs * Update website/docs/api/entityruler.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-01-19 10:45:11 +01:00
Daniël de Kok	50d2a2c930	User fewer Vector internals (#9879 ) * Use Vectors.shape rather than Vectors.data.shape * Use Vectors.size rather than Vectors.data.size * Add Vectors.to_ops to move data between different ops * Add documentation for Vector.to_ops	2022-01-18 17:14:35 +01:00
ColleterVi	a784b12eff	fix: new restcountries url (#10043 ) Url extension "eu" and path "rest" are no longer available. Replacing them for a working url.	2022-01-13 20:25:06 +09:00
Sofie Van Landeghem	56dcb39fb7	Fix references to config file in the docs & UX (#9961 ) * doc fixes around config file * fix typo * clarify default	2022-01-04 14:31:26 +01:00
Peter Baumgartner	72abf9e102	MultiHashEmbed vector docs correction (#9918 )	2021-12-27 11:18:08 +01:00
Adriane Boyd	51a3b60027	Document Tagger neg_prefix, fix typo (#9821 )	2021-12-07 09:42:40 +01:00
Duygu Altinok	b56b9e7f31	Entity ruler remove pattern (#9685 ) * added ruler coe * added error for none existing pattern * changed error to warning * changed error to warning * added basic tests * fixed place * added test files * went back to error * went back to pattern error * minor change to docs * changed style * changed doc * changed error slightly * added remove to phrasem api * error key already existed * phrase matcher match code to api * blacked tests * moved comments before expr * corrected error no * Update website/docs/api/entityruler.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/docs/api/entityruler.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-12-06 15:32:49 +01:00
Natalia Rodnova	472740d613	Added sents property to Span for Spans spanning over several sentences (#9699 ) * Added sents property to Span class that returns a generator of sentences the Span belongs to * Added description to Span.sents property * Update test_span to clarify the difference between span.sent and span.sents Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/tests/doc/test_span.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix documentation typos in spacy/tokens/span.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update Span.sents doc string in spacy/tokens/span.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Parametrized test_span_spans * Corrected Span.sents to check for span-level hook first. Also, made Span.sent respect doc-level sents hook if no span-level hook is provided * Corrected Span ocumentation copy/paste issue * Put back accidentally deleted lines * Fixed formatting in span.pyx * Moved check for SENT_START annotation after user hooks in Span.sents * add version where the property was introduced Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-12-06 09:58:01 +01:00
Narayan Acharya	1be8a4dab3	Displacy serve entity linking support without `manual=True` support. (#9748 ) * Add support for kb_id to be displayed via displacy.serve. The current support is only limited to the manual option in displacy.render * Commit to check pre-commit hooks are run. * Update spacy/displacy/__init__.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Changes as per suggestions on the PR. * Update website/docs/api/top-level.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/docs/api/top-level.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * tag option as new from 3.2.1 onwards Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>	2021-11-29 17:13:26 +01:00
Adriane Boyd	6763cbfdc0	Update Catalan acknowledgements for v3.2 (#9763 )	2021-11-29 14:14:21 +01:00
Natalia Rodnova	a4c43e5c57	Allow Matcher to match on ENT_ID and ENT_KB_ID (#9688 ) * Added ENT_ID and ENT_KB_ID into the list of the attributes that Matcher matches on * Added ENT_ID and ENT_KB_ID to TEST_PATTERNS in test_pattern_validation.py. Disabled tests that I added before * Update website/docs/api/matcher.md * Format * Remove skipped tests Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-11-24 10:37:10 +01:00
Adriane Boyd	9ac6d4991e	Add doc_cleaner component (#9659 ) * Add doc_cleaner component * Fix types * Fix loop * Rephrase method description	2021-11-23 15:33:33 +01:00
Paul O'Leary McCann	52b8c2d2e0	Add note on batch contract for listeners (#9691 ) * Add note on batch contract Using listeners requires batches to be consistent. This is obvious if you understand how the listener works, but it wasn't clearly stated in the Docs, and was subtle enough that the EntityLinker missed it. There is probably a clearer way to explain what the actual requirement is, but I figure this is a good start. * Rewrite to clarify role of caching	2021-11-22 11:06:07 +01:00
Sofie Van Landeghem	13645dcbf5	add note that annotating components is new since 3.1 (#9678 )	2021-11-22 14:43:11 +09:00
Paul O'Leary McCann	f3981bd0c8	Clarify how to fill in init_tok2vec after pretraining (#9639 ) * Clarify how to fill in init_tok2vec after pretraining * Ignore init_tok2vec arg in pretraining * Update docs, config setting * Remove obsolete note about not filling init_tok2vec early This seems to have also caught some lines that needed cleanup.	2021-11-18 15:38:30 +01:00
Adriane Boyd	216ed231a9	What's new in v3.2 (#9633 ) * What's new in v3.2 * Fix formatting * Fix typo * Redo thanks * Formatting * Fix typo * Fix project links * Fix typo * Minimal intro, floret python module * Rephrase * Rephrase, extend * Rephrase * Update links and formatting [ci skip] * Minor correction * Fix typo Co-authored-by: Ines Montani <ines@ines.io>	2021-11-05 16:31:14 +01:00
Adriane Boyd	07dea324f6	Merge remote-tracking branch 'upstream/develop' into chore/switch-to-master-v3.2.0	2021-11-03 15:32:18 +01:00
Paul O'Leary McCann	c1cc94a33a	Fix typo about receptive field size (#9564 )	2021-11-03 15:16:55 +01:00
Paul O'Leary McCann	e43639b27a	Add note about round-trip serializing pipeline to API docs (#9583 )	2021-11-03 09:55:30 +01:00
Vasundhara	5279c7c4ba	Fix broken link to mappings-exceptions (#9573 )	2021-10-31 13:44:29 +09:00
Adriane Boyd	2d430958e1	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-3	2021-10-29 12:18:15 +02:00
Paul O'Leary McCann	006df1ae1f	Clarify error when words are of wrong type (#9541 ) * Clarify error when words are of wrong type See #9437 * Update docs * Use try/except * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-10-29 12:08:40 +02:00
Paul O'Leary McCann	2fd8d616e7	Add docs section for spacy.cli.train.train (#9545 ) * Add section for spacy.cli.train.train * Add link from training page to train function * Ensure path in train helper * Update docs Co-authored-by: Ines Montani <ines@ines.io>	2021-10-29 10:36:34 +02:00
Adriane Boyd	5477453ea3	Docs for thinc-apple-ops (#9549 ) * Docs for thinc-apple-ops * Ignore thinc-apple-ops in reqs tests * Fix install quickstart * Add cupy cuda 113, 114 extras * Remove draft section Co-authored-by: Ines Montani <ines@ines.io>	2021-10-29 10:35:31 +02:00
Adriane Boyd	12974bf4d9	Add micro PRF for morph scoring (#9546 ) * Add micro PRF for morph scoring For pipelines where morph features are added by more than one component and a reference training corpus may not contain all features, a micro PRF score is more flexible than a simple accuracy score. An example is the reading and inflection features added by the Japanese tokenizer. * Use `morph_micro_f` as the default morph score for Japanese morphologizers. * Update docstring * Fix typo in docstring * Update Scorer API docs * Fix results type * Organize score list by attribute prefix	2021-10-29 10:29:29 +02:00
Adriane Boyd	c053f158c5	Add support for floret vectors (#8909 ) * Add support for fasttext-bloom hash-only vectors Overview: * Extend `Vectors` to have two modes: `default` and `ngram` * `default` is the default mode and equivalent to the current `Vectors` * `ngram` supports the hash-only ngram tables from `fasttext-bloom` * Extend `spacy.StaticVectors.v2` to handle both modes with no changes for `default` vectors * Extend `spacy init vectors` to support ngram tables The `ngram` mode only supports vector tables produced by this fork of fastText, which adds an option to represent all vectors using only the ngram buckets table and which uses the exact same ngram generation algorithm and hash function (`MurmurHash3_x64_128`). `fasttext-bloom` produces an additional `.hashvec` table, which can be loaded by `spacy init vectors --fasttext-bloom-vectors`. https://github.com/adrianeboyd/fastText/tree/feature/bloom Implementation details: * `Vectors` now includes the `StringStore` as `Vectors.strings` so that the API can stay consistent for both `default` (which can look up from `str` or `int`) and `ngram` (which requires `str` to calculate the ngrams). * In ngram mode `Vectors` uses a default `Vectors` object as a cache since the ngram vectors lookups are relatively expensive. * The default cache size is the same size as the provided ngram vector table. * Once the cache is full, no more entries are added. The user is responsible for managing the cache in cases where the initial documents are not representative of the texts. * The cache can be resized by setting `Vectors.ngram_cache_size` or cleared with `vectors._ngram_cache.clear()`. * The API ends up a bit split between methods for `default` and for `ngram`, so functions that only make sense for `default` or `ngram` include warnings with custom messages suggesting alternatives where possible. * `Vocab.vectors` becomes a property so that the string stores can be synced when assigning vectors to a vocab. * `Vectors` serializes its own config settings as `vectors.cfg`. * The `Vectors` serialization methods have added support for `exclude` so that the `Vocab` can exclude the `Vectors` strings while serializing. Removed: * The `minn` and `maxn` options and related code from `Vocab.get_vector`, which does not work in a meaningful way for default vector tables. * The unused `GlobalRegistry` in `Vectors`. * Refactor to use reduce_mean Refactor to use reduce_mean and remove the ngram vectors cache. * Rename to floret * Rename to floret in error messages * Use --vectors-mode in CLI, vector init * Fix vectors mode in init * Remove unused var * Minor API and docstrings adjustments * Rename `--vectors-mode` to `--mode` in `init vectors` CLI * Rename `Vectors.get_floret_vectors` to `Vectors.get_batch` and support both modes. * Minor updates to Vectors docstrings. * Update API docs for Vectors and init vectors CLI * Update types for StaticVectors	2021-10-27 14:08:31 +02:00
Adriane Boyd	a803af9dfa	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-1	2021-10-26 11:53:50 +02:00
Elia Robyn Lake (Robyn Speer)	fa70837f28	clarify how to connect pretraining to training (#9450 ) * clarify how to connect pretraining to training Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * Update website/docs/usage/embeddings-transformers.md * Update website/docs/usage/embeddings-transformers.md * Update website/docs/usage/embeddings-transformers.md * Update website/docs/usage/embeddings-transformers.md Co-authored-by: Elia Robyn Speer <elia@explosion.ai> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-10-22 13:15:47 +02:00
Daniël de Kok	1f05f56433	Add the spacy.models_with_nvtx_range.v1 callback (#9124 ) * Add the spacy.models_with_nvtx_range.v1 callback This callback recursively adds NVTX ranges to the Models in each pipe in a pipeline. * Fix create_models_with_nvtx_range type signature * NVTX range: wrap models of all trainable pipes jointly This avoids that (sub-)models that are shared between pipes get wrapped twice. * NVTX range callback: make color configurable Add forward_color and backprop_color options to set the color for the NVTX range. * Move create_models_with_nvtx_range to spacy.ml * Update create_models_with_nvtx_range for thinc changes with_nvtx_range now updates an existing node, rather than returning a wrapper node. So, we can simply walk over the nodes and update them. * NVTX: use after_pipeline_creation in example	2021-10-20 11:59:48 +02:00
Paul O'Leary McCann	222cf9b6d2	Clarify how to change base Transformer model (#9498 ) * Add note about how the model name is used * Add link to TransformersModel docs, separate paragraph * Local link * Revise docs * Update website/docs/usage/embeddings-transformers.md * Update website/docs/usage/embeddings-transformers.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-10-19 23:28:20 +02:00
Adriane Boyd	a6424bcea9	Minor updates to spacy-transformers docs for v1.1.0 (#9496 )	2021-10-18 14:55:02 +02:00
Adriane Boyd	9b86209a4a	Update docs for spacy-transformers v1.1 data classes (#9361 )	2021-10-18 14:16:58 +02:00
Sofie Van Landeghem	3fd3531e12	Docs for new spacy-trf architectures (#8954 ) * use TransformerModel.v2 in quickstart * update docs for new transformer architectures * bump spacy_transformers to 1.1.0 * Add new arguments spacy-transformers.TransformerModel.v3 * Mention that mixed-precision support is experimental * Describe delta transformers.Tok2VecTransformer versions * add dot * add dot, again * Update some more TransformerModel references v2 -> v3 * Add mixed-precision options to the training quickstart Disable mixed-precision training/prediction by default. * Update setup.cfg Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Apply suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update website/docs/usage/embeddings-transformers.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Daniël de Kok <me@danieldk.eu> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-10-18 14:15:06 +02:00
Connor Brinton	657af5f91f	🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167 ) * 🚨 Ignore all existing Mypy errors * 🏗 Add Mypy check to CI * Add types-mock and types-requests as dev requirements * Add additional type ignore directives * Add types packages to dev-only list in reqs test * Add types-dataclasses for python 3.6 * Add ignore to pretrain * 🏷 Improve type annotation on `run_command` helper The `run_command` helper previously declared that it returned an `Optional[subprocess.CompletedProcess]`, but it isn't actually possible for the function to return `None`. These changes modify the type annotation of the `run_command` helper and remove all now-unnecessary `# type: ignore` directives. * 🔧 Allow variable type redefinition in limited contexts These changes modify how Mypy is configured to allow variables to have their type automatically redefined under certain conditions. The Mypy documentation contains the following example: ```python def process(items: List[str]) -> None: # 'items' has type List[str] items = [item.split() for item in items] # 'items' now has type List[List[str]] ... ``` This configuration change is especially helpful in reducing the number of `# type: ignore` directives needed to handle the common pattern of: * Accepting a filepath as a string * Overwriting the variable using `filepath = ensure_path(filepath)` These changes enable redefinition and remove all `# type: ignore` directives rendered redundant by this change. * 🏷 Add type annotation to converters mapping * 🚨 Fix Mypy error in convert CLI argument verification * 🏷 Improve type annotation on `resolve_dot_names` helper * 🏷 Add type annotations for `Vocab` attributes `strings` and `vectors` * 🏷 Add type annotations for more `Vocab` attributes * 🏷 Add loose type annotation for gold data compilation * 🏷 Improve `_format_labels` type annotation * 🏷 Fix `get_lang_class` type annotation * 🏷 Loosen return type of `Language.evaluate` * 🏷 Don't accept `Scorer` in `handle_scores_per_type` * 🏷 Add `string_to_list` overloads * 🏷 Fix non-Optional command-line options * 🙈 Ignore redefinition of `wandb_logger` in `loggers.py` * ➕ Install `typing_extensions` in Python 3.8+ The `typing_extensions` package states that it should be used when "writing code that must be compatible with multiple Python versions". Since SpaCy needs to support multiple Python versions, it should be used when newer `typing` module members are required. One example of this is `Literal`, which is available starting with Python 3.8. Previously SpaCy tried to import `Literal` from `typing`, falling back to `typing_extensions` if the import failed. However, Mypy doesn't seem to be able to understand what `Literal` means when the initial import means. Therefore, these changes modify how `compat` imports `Literal` by always importing it from `typing_extensions`. These changes also modify how `typing_extensions` is installed, so that it is a requirement for all Python versions, including those greater than or equal to 3.8. * 🏷 Improve type annotation for `Language.pipe` These changes add a missing overload variant to the type signature of `Language.pipe`. Additionally, the type signature is enhanced to allow type checkers to differentiate between the two overload variants based on the `as_tuple` parameter. Fixes #8772 * ➖ Don't install `typing-extensions` in Python 3.8+ After more detailed analysis of how to implement Python version-specific type annotations using SpaCy, it has been determined that by branching on a comparison against `sys.version_info` can be statically analyzed by Mypy well enough to enable us to conditionally use `typing_extensions.Literal`. This means that we no longer need to install `typing_extensions` for Python versions greater than or equal to 3.8! 🎉 These changes revert previous changes installing `typing-extensions` regardless of Python version and modify how we import the `Literal` type to ensure that Mypy treats it properly. * resolve mypy errors for Strict pydantic types * refactor code to avoid missing return statement * fix types of convert CLI command * avoid list-set confustion in debug_data * fix typo and formatting * small fixes to avoid type ignores * fix types in profile CLI command and make it more efficient * type fixes in projects CLI * put one ignore back * type fixes for render * fix render types - the sequel * fix BaseDefault in language definitions * fix type of noun_chunks iterator - yields tuple instead of span * fix types in language-specific modules * 🏷 Expand accepted inputs of `get_string_id` `get_string_id` accepts either a string (in which case it returns its ID) or an ID (in which case it immediately returns the ID). These changes extend the type annotation of `get_string_id` to indicate that it can accept either strings or IDs. * 🏷 Handle override types in `combine_score_weights` The `combine_score_weights` function allows users to pass an `overrides` mapping to override data extracted from the `weights` argument. Since it allows `Optional` dictionary values, the return value may also include `Optional` dictionary values. These changes update the type annotations for `combine_score_weights` to reflect this fact. * 🏷 Fix tokenizer serialization method signatures in `DummyTokenizer` * 🏷 Fix redefinition of `wandb_logger` These changes fix the redefinition of `wandb_logger` by giving a separate name to each `WandbLogger` version. For backwards-compatibility, `spacy.train` still exports `wandb_logger_v3` as `wandb_logger` for now. * more fixes for typing in language * type fixes in model definitions * 🏷 Annotate `_RandomWords.probs` as `NDArray` * 🏷 Annotate `tok2vec` layers to help Mypy * 🐛 Fix `_RandomWords.probs` type annotations for Python 3.6 Also remove an import that I forgot to move to the top of the module 😅 * more fixes for matchers and other pipeline components * quick fix for entity linker * fixing types for spancat, textcat, etc * bugfix for tok2vec * type annotations for scorer * add runtime_checkable for Protocol * type and import fixes in tests * mypy fixes for training utilities * few fixes in util * fix import * 🐵 Remove unused `# type: ignore` directives * 🏷 Annotate `Language._components` * 🏷 Annotate `spacy.pipeline.Pipe` * add doc as property to span.pyi * small fixes and cleanup * explicit type annotations instead of via comment Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com> Co-authored-by: svlandeg <svlandeg@github.com>	2021-10-14 15:21:40 +02:00
Adriane Boyd	d98d525bc8	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.1-3	2021-10-14 09:41:46 +02:00
Paul O'Leary McCann	b53e39455e	Fix UD POS docs links (fix #9013 ) (#9407 ) * Fix UD POS docs links (fix #9013) The previous link seems to have been for UD v1. * Fix link	2021-10-11 11:51:19 +02:00
Adriane Boyd	a5231cb044	Remove traces of lexemes from vocab serialization (#9400 )	2021-10-11 11:13:35 +02:00
Adriane Boyd	ae1b3e960b	Update overwrite and scorer in API docs (#9384 ) * Update overwrite and scorer in API docs * Rephrase morphologizer extend + example	2021-10-11 10:35:07 +02:00
Sofie Van Landeghem	f87ae3cb7d	Doc fixes in convert API (#9350 ) * add more info on the spacy debug command * formatting	2021-10-06 13:13:18 +09:00
Elia Robyn Lake (Robyn Speer)	53b5f245ed	Allow IETF language codes, aliases, and close matches (#9342 ) * use language-matching to allow language code aliases Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * link to "IETF language tags" in docs Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * Make requirements consistent Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * change "two-letter language ID" to "IETF language tag" in language docs Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * use langcodes 3.2 and handle language-tag errors better Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * all unknown language codes are ImportErrors Signed-off-by: Elia Robyn Speer <elia@explosion.ai> Co-authored-by: Elia Robyn Speer <elia@explosion.ai>	2021-10-05 09:52:22 +02:00
Paul O'Leary McCann	1ee6541ab0	Moving Japanese tokenizer extra info to Token.morph (#8977 ) * Use morph for extra Japanese tokenizer info Previously Japanese tokenizer info that didn't correspond to Token fields was put in user data. Since spaCy core should avoid touching user data, this moves most information to the Token.morph attribute. It also adds the normalized form, which wasn't exposed before. The subtokens, which are a list of full tokens, are still added to user data, except with the default tokenizer granualarity. With the default tokenizer settings the subtokens are all None, so in this case the user data is simply not set. * Update tests Also adds a new test for norm data. * Update docs * Add Japanese morphologizer factory Set the default to `extend=True` so that the morphologizer does not clobber the values set by the tokenizer. * Use the norm_ field for normalized forms Before this commit, normalized forms were put in the "norm" field in the morph attributes. I am not sure why I did that instead of using the token morph, I think I just forgot about it. * Skip test if sudachipy is not installed * Fix import Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-10-01 19:19:26 +02:00
Paul O'Leary McCann	6e833b617a	Updating Troubleshooting Docs (#9329 ) * Add link to Discussions FAQ * Remove old FAQ entries I think these are no longer relevant. - no-cache-dir: affected pip versions are very old now - narrow unicode: not an issue from py3.3+ - utf-8 osx: upstream bug closed in 2019 Some of the other issues are also maybe not frequent.	2021-10-01 12:28:22 +02:00
Elia Robyn Lake (Robyn Speer)	5b0b0ca809	Move WandB loggers into spacy-loggers (#9223 ) * factor out the WandB logger into spacy-loggers Signed-off-by: Elia Robyn Speer <gh@arborelia.net> * depend on spacy-loggers so they are available Signed-off-by: Elia Robyn Speer <gh@arborelia.net> * remove docs of spacy.WandbLogger.v2 (moved to spacy-loggers) Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * Version number suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * update references to WandbLogger Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * make order of deps more consistent Signed-off-by: Elia Robyn Speer <elia@explosion.ai> Co-authored-by: Elia Robyn Speer <elia@explosion.ai> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-09-29 11:12:50 +02:00
Adriane Boyd	03f234b739	Merge remote-tracking branch 'upstream/master' into develop	2021-09-27 09:10:45 +02:00
Ines Montani	6bb0324b81	Adjust kb_id visualizer templating and docs	2021-09-23 11:59:02 +02:00
Ines Montani	beb4a8c524	Merge pull request #9199 from shigapov/master (resolves #9129 )	2021-09-23 19:41:53 +10:00
Jozef Harag	865cfbc903	feat: add `spacy.WandbLogger.v3` with optional `run_name` and `entity` parameters (#9202 ) * feat: add `spacy.WandbLogger.v3` with optional `run_name` and `entity` parameters * update versioning in docs Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>	2021-09-16 12:26:41 +02:00

1 2 3 4 5 ...

1664 Commits