Commit Graph

2762 Commits

Author SHA1 Message Date
Adriane Boyd
07dea324f6 Merge remote-tracking branch 'upstream/develop' into chore/switch-to-master-v3.2.0 2021-11-03 15:32:18 +01:00
Paul O'Leary McCann
c1cc94a33a
Fix typo about receptive field size (#9564) 2021-11-03 15:16:55 +01:00
Adriane Boyd
79cea03983
Update website model display (#9589)
* Remove vectors from core trf model descriptions

* Update accuracy labels and exclude morph_acc for ja
2021-11-03 09:56:00 +01:00
Paul O'Leary McCann
e43639b27a
Add note about round-trip serializing pipeline to API docs (#9583) 2021-11-03 09:55:30 +01:00
xxyzz
90ec820f05
Add WordDumb to spaCy Universe (#9572)
* Add WordDumb to spaCy Universe

* Add standalone category

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
2021-11-01 18:38:41 +09:00
Bruce W. Lee (이웅성)
a4dcb68cf6
Adding LingFeat Software to spaCy Universe. (#9574)
* add lingfeat in universe

* add lingfeat in universe

* Fix JSON

* Minor cleanup

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
2021-11-01 18:38:14 +09:00
Vasundhara
5279c7c4ba
Fix broken link to mappings-exceptions (#9573) 2021-10-31 13:44:29 +09:00
Adriane Boyd
2d430958e1 Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-3 2021-10-29 12:18:15 +02:00
Paul O'Leary McCann
006df1ae1f
Clarify error when words are of wrong type (#9541)
* Clarify error when words are of wrong type

See #9437

* Update docs

* Use try/except

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-10-29 12:08:40 +02:00
Paul O'Leary McCann
2fd8d616e7
Add docs section for spacy.cli.train.train (#9545)
* Add section for spacy.cli.train.train

* Add link from training page to train function

* Ensure path in train helper

* Update docs

Co-authored-by: Ines Montani <ines@ines.io>
2021-10-29 10:36:34 +02:00
Adriane Boyd
5477453ea3
Docs for thinc-apple-ops (#9549)
* Docs for thinc-apple-ops

* Ignore thinc-apple-ops in reqs tests

* Fix install quickstart

* Add cupy cuda 113, 114 extras

* Remove draft section

Co-authored-by: Ines Montani <ines@ines.io>
2021-10-29 10:35:31 +02:00
Adriane Boyd
12974bf4d9
Add micro PRF for morph scoring (#9546)
* Add micro PRF for morph scoring

For pipelines where morph features are added by more than one component
and a reference training corpus may not contain all features, a micro
PRF score is more flexible than a simple accuracy score. An example is
the reading and inflection features added by the Japanese tokenizer.

* Use `morph_micro_f` as the default morph score for Japanese
morphologizers.

* Update docstring

* Fix typo in docstring

* Update Scorer API docs

* Fix results type

* Organize score list by attribute prefix
2021-10-29 10:29:29 +02:00
Philip Vollet
76173b0866
fixed typo and URL (#9560) 2021-10-29 13:57:44 +09:00
Adriane Boyd
c053f158c5
Add support for floret vectors (#8909)
* Add support for fasttext-bloom hash-only vectors

Overview:

* Extend `Vectors` to have two modes: `default` and `ngram`
  * `default` is the default mode and equivalent to the current
    `Vectors`
  * `ngram` supports the hash-only ngram tables from `fasttext-bloom`
* Extend `spacy.StaticVectors.v2` to handle both modes with no changes
  for `default` vectors
* Extend `spacy init vectors` to support ngram tables

The `ngram` mode **only** supports vector tables produced by this
fork of fastText, which adds an option to represent all vectors using
only the ngram buckets table and which uses the exact same ngram
generation algorithm and hash function (`MurmurHash3_x64_128`).
`fasttext-bloom` produces an additional `.hashvec` table, which can be
loaded by `spacy init vectors --fasttext-bloom-vectors`.

https://github.com/adrianeboyd/fastText/tree/feature/bloom

Implementation details:

* `Vectors` now includes the `StringStore` as `Vectors.strings` so that
  the API can stay consistent for both `default` (which can look up from
  `str` or `int`) and `ngram` (which requires `str` to calculate the
  ngrams).

* In ngram mode `Vectors` uses a default `Vectors` object as a cache
  since the ngram vectors lookups are relatively expensive.

  * The default cache size is the same size as the provided ngram vector
    table.

  * Once the cache is full, no more entries are added. The user is
    responsible for managing the cache in cases where the initial
    documents are not representative of the texts.

  * The cache can be resized by setting `Vectors.ngram_cache_size` or
    cleared with `vectors._ngram_cache.clear()`.

* The API ends up a bit split between methods for `default` and for
  `ngram`, so functions that only make sense for `default` or `ngram`
  include warnings with custom messages suggesting alternatives where
  possible.

* `Vocab.vectors` becomes a property so that the string stores can be
  synced when assigning vectors to a vocab.

* `Vectors` serializes its own config settings as `vectors.cfg`.

* The `Vectors` serialization methods have added support for `exclude`
  so that the `Vocab` can exclude the `Vectors` strings while serializing.

Removed:

* The `minn` and `maxn` options and related code from
  `Vocab.get_vector`, which does not work in a meaningful way for default
  vector tables.

* The unused `GlobalRegistry` in `Vectors`.

* Refactor to use reduce_mean

Refactor to use reduce_mean and remove the ngram vectors cache.

* Rename to floret

* Rename to floret in error messages

* Use --vectors-mode in CLI, vector init

* Fix vectors mode in init

* Remove unused var

* Minor API and docstrings adjustments

* Rename `--vectors-mode` to `--mode` in `init vectors` CLI
* Rename `Vectors.get_floret_vectors` to `Vectors.get_batch` and support
  both modes.
* Minor updates to Vectors docstrings.

* Update API docs for Vectors and init vectors CLI

* Update types for StaticVectors
2021-10-27 14:08:31 +02:00
Adriane Boyd
a803af9dfa Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-1 2021-10-26 11:53:50 +02:00
Elia Robyn Lake (Robyn Speer)
fa70837f28
clarify how to connect pretraining to training (#9450)
* clarify how to connect pretraining to training

Signed-off-by: Elia Robyn Speer <elia@explosion.ai>

* Update website/docs/usage/embeddings-transformers.md

* Update website/docs/usage/embeddings-transformers.md

* Update website/docs/usage/embeddings-transformers.md

* Update website/docs/usage/embeddings-transformers.md

Co-authored-by: Elia Robyn Speer <elia@explosion.ai>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-10-22 13:15:47 +02:00
Duygu Altinok
7b98aa4c16
Corrected broken (#9505) 2021-10-20 17:31:59 +02:00
Daniël de Kok
1f05f56433
Add the spacy.models_with_nvtx_range.v1 callback (#9124)
* Add the spacy.models_with_nvtx_range.v1 callback

This callback recursively adds NVTX ranges to the Models in each pipe in
a pipeline.

* Fix create_models_with_nvtx_range type signature

* NVTX range: wrap models of all trainable pipes jointly

This avoids that (sub-)models that are shared between pipes get wrapped
twice.

* NVTX range callback: make color configurable

Add forward_color and backprop_color options to set the color for the
NVTX range.

* Move create_models_with_nvtx_range to spacy.ml

* Update create_models_with_nvtx_range for thinc changes

with_nvtx_range now updates an existing node, rather than returning a
wrapper node. So, we can simply walk over the nodes and update them.

* NVTX: use after_pipeline_creation in example
2021-10-20 11:59:48 +02:00
Adriane Boyd
3f181b73d0
Add ja_core_news_trf to website (#9515) 2021-10-20 10:18:02 +02:00
Paul O'Leary McCann
222cf9b6d2
Clarify how to change base Transformer model (#9498)
* Add note about how the model name is used

* Add link to TransformersModel docs, separate paragraph

* Local link

* Revise docs

* Update website/docs/usage/embeddings-transformers.md

* Update website/docs/usage/embeddings-transformers.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-10-19 23:28:20 +02:00
Adriane Boyd
a6424bcea9
Minor updates to spacy-transformers docs for v1.1.0 (#9496) 2021-10-18 14:55:02 +02:00
Adriane Boyd
9b86209a4a
Update docs for spacy-transformers v1.1 data classes (#9361) 2021-10-18 14:16:58 +02:00
Sofie Van Landeghem
3fd3531e12
Docs for new spacy-trf architectures (#8954)
* use TransformerModel.v2 in quickstart

* update docs for new transformer architectures

* bump spacy_transformers to 1.1.0

* Add new arguments spacy-transformers.TransformerModel.v3

* Mention that mixed-precision support is experimental

* Describe delta transformers.Tok2VecTransformer versions

* add dot

* add dot, again

* Update some more TransformerModel references v2 -> v3

* Add mixed-precision options to the training quickstart

Disable mixed-precision training/prediction by default.

* Update setup.cfg

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Apply suggestions from code review

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update website/docs/usage/embeddings-transformers.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Daniël de Kok <me@danieldk.eu>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-10-18 14:15:06 +02:00
Connor Brinton
657af5f91f
🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167)
* 🚨 Ignore all existing Mypy errors

* 🏗 Add Mypy check to CI

* Add types-mock and types-requests as dev requirements

* Add additional type ignore directives

* Add types packages to dev-only list in reqs test

* Add types-dataclasses for python 3.6

* Add ignore to pretrain

* 🏷 Improve type annotation on `run_command` helper

The `run_command` helper previously declared that it returned an
`Optional[subprocess.CompletedProcess]`, but it isn't actually possible
for the function to return `None`. These changes modify the type
annotation of the `run_command` helper and remove all now-unnecessary
`# type: ignore` directives.

* 🔧 Allow variable type redefinition in limited contexts

These changes modify how Mypy is configured to allow variables to have
their type automatically redefined under certain conditions. The Mypy
documentation contains the following example:

```python
def process(items: List[str]) -> None:
    # 'items' has type List[str]
    items = [item.split() for item in items]
    # 'items' now has type List[List[str]]
    ...
```

This configuration change is especially helpful in reducing the number
of `# type: ignore` directives needed to handle the common pattern of:
* Accepting a filepath as a string
* Overwriting the variable using `filepath = ensure_path(filepath)`

These changes enable redefinition and remove all `# type: ignore`
directives rendered redundant by this change.

* 🏷 Add type annotation to converters mapping

* 🚨 Fix Mypy error in convert CLI argument verification

* 🏷 Improve type annotation on `resolve_dot_names` helper

* 🏷 Add type annotations for `Vocab` attributes `strings` and `vectors`

* 🏷 Add type annotations for more `Vocab` attributes

* 🏷 Add loose type annotation for gold data compilation

* 🏷 Improve `_format_labels` type annotation

* 🏷 Fix `get_lang_class` type annotation

* 🏷 Loosen return type of `Language.evaluate`

* 🏷 Don't accept `Scorer` in `handle_scores_per_type`

* 🏷 Add `string_to_list` overloads

* 🏷 Fix non-Optional command-line options

* 🙈 Ignore redefinition of `wandb_logger` in `loggers.py`

*  Install `typing_extensions` in Python 3.8+

The `typing_extensions` package states that it should be used when
"writing code that must be compatible with multiple Python versions".
Since SpaCy needs to support multiple Python versions, it should be used
when newer `typing` module members are required. One example of this is
`Literal`, which is available starting with Python 3.8.

Previously SpaCy tried to import `Literal` from `typing`, falling back
to `typing_extensions` if the import failed. However, Mypy doesn't seem
to be able to understand what `Literal` means when the initial import
means. Therefore, these changes modify how `compat` imports `Literal` by
always importing it from `typing_extensions`.

These changes also modify how `typing_extensions` is installed, so that
it is a requirement for all Python versions, including those greater
than or equal to 3.8.

* 🏷 Improve type annotation for `Language.pipe`

These changes add a missing overload variant to the type signature of
`Language.pipe`. Additionally, the type signature is enhanced to allow
type checkers to differentiate between the two overload variants based
on the `as_tuple` parameter.

Fixes #8772

*  Don't install `typing-extensions` in Python 3.8+

After more detailed analysis of how to implement Python version-specific
type annotations using SpaCy, it has been determined that by branching
on a comparison against `sys.version_info` can be statically analyzed by
Mypy well enough to enable us to conditionally use
`typing_extensions.Literal`. This means that we no longer need to
install `typing_extensions` for Python versions greater than or equal to
3.8! 🎉

These changes revert previous changes installing `typing-extensions`
regardless of Python version and modify how we import the `Literal` type
to ensure that Mypy treats it properly.

* resolve mypy errors for Strict pydantic types

* refactor code to avoid missing return statement

* fix types of convert CLI command

* avoid list-set confustion in debug_data

* fix typo and formatting

* small fixes to avoid type ignores

* fix types in profile CLI command and make it more efficient

* type fixes in projects CLI

* put one ignore back

* type fixes for render

* fix render types - the sequel

* fix BaseDefault in language definitions

* fix type of noun_chunks iterator - yields tuple instead of span

* fix types in language-specific modules

* 🏷 Expand accepted inputs of `get_string_id`

`get_string_id` accepts either a string (in which case it returns its 
ID) or an ID (in which case it immediately returns the ID). These 
changes extend the type annotation of `get_string_id` to indicate that 
it can accept either strings or IDs.

* 🏷 Handle override types in `combine_score_weights`

The `combine_score_weights` function allows users to pass an `overrides` 
mapping to override data extracted from the `weights` argument. Since it 
allows `Optional` dictionary values, the return value may also include 
`Optional` dictionary values.

These changes update the type annotations for `combine_score_weights` to 
reflect this fact.

* 🏷 Fix tokenizer serialization method signatures in `DummyTokenizer`

* 🏷 Fix redefinition of `wandb_logger`

These changes fix the redefinition of `wandb_logger` by giving a 
separate name to each `WandbLogger` version. For 
backwards-compatibility, `spacy.train` still exports `wandb_logger_v3` 
as `wandb_logger` for now.

* more fixes for typing in language

* type fixes in model definitions

* 🏷 Annotate `_RandomWords.probs` as `NDArray`

* 🏷 Annotate `tok2vec` layers to help Mypy

* 🐛 Fix `_RandomWords.probs` type annotations for Python 3.6

Also remove an import that I forgot to move to the top of the module 😅

* more fixes for matchers and other pipeline components

* quick fix for entity linker

* fixing types for spancat, textcat, etc

* bugfix for tok2vec

* type annotations for scorer

* add runtime_checkable for Protocol

* type and import fixes in tests

* mypy fixes for training utilities

* few fixes in util

* fix import

* 🐵 Remove unused `# type: ignore` directives

* 🏷 Annotate `Language._components`

* 🏷 Annotate `spacy.pipeline.Pipe`

* add doc as property to span.pyi

* small fixes and cleanup

* explicit type annotations instead of via comment

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: svlandeg <svlandeg@github.com>
2021-10-14 15:21:40 +02:00
Adriane Boyd
d98d525bc8 Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.1-3 2021-10-14 09:41:46 +02:00
Edward
72711dc2c9
Update universe example codes (#9422)
* Update universe plugins

* Adjust azure trigger

* Add init to tests/universe

* deliberatly trying to break the universe to see if the CI catches it

* revert

Co-authored-by: svlandeg <svlandeg@github.com>
2021-10-13 16:29:19 +02:00
Paul O'Leary McCann
b53e39455e
Fix UD POS docs links (fix #9013) (#9407)
* Fix UD POS docs links (fix #9013)

The previous link seems to have been for UD v1.

* Fix link
2021-10-11 11:51:19 +02:00
Adriane Boyd
fd7edbc645
Fix types descriptions of sm and sent models (#9401) 2021-10-11 11:17:18 +02:00
Adriane Boyd
a5231cb044
Remove traces of lexemes from vocab serialization (#9400) 2021-10-11 11:13:35 +02:00
Adriane Boyd
ae1b3e960b
Update overwrite and scorer in API docs (#9384)
* Update overwrite and scorer in API docs

* Rephrase morphologizer extend + example
2021-10-11 10:35:07 +02:00
Sofie Van Landeghem
f87ae3cb7d
Doc fixes in convert API (#9350)
* add more info on the spacy debug command

* formatting
2021-10-06 13:13:18 +09:00
Elia Robyn Lake (Robyn Speer)
53b5f245ed
Allow IETF language codes, aliases, and close matches (#9342)
* use language-matching to allow language code aliases

Signed-off-by: Elia Robyn Speer <elia@explosion.ai>

* link to "IETF language tags" in docs

Signed-off-by: Elia Robyn Speer <elia@explosion.ai>

* Make requirements consistent

Signed-off-by: Elia Robyn Speer <elia@explosion.ai>

* change "two-letter language ID" to "IETF language tag" in language docs

Signed-off-by: Elia Robyn Speer <elia@explosion.ai>

* use langcodes 3.2 and handle language-tag errors better

Signed-off-by: Elia Robyn Speer <elia@explosion.ai>

* all unknown language codes are ImportErrors

Signed-off-by: Elia Robyn Speer <elia@explosion.ai>

Co-authored-by: Elia Robyn Speer <elia@explosion.ai>
2021-10-05 09:52:22 +02:00
Paul O'Leary McCann
1ee6541ab0
Moving Japanese tokenizer extra info to Token.morph (#8977)
* Use morph for extra Japanese tokenizer info

Previously Japanese tokenizer info that didn't correspond to Token
fields was put in user data. Since spaCy core should avoid touching user
data, this moves most information to the Token.morph attribute. It also
adds the normalized form, which wasn't exposed before.

The subtokens, which are a list of full tokens, are still added to user
data, except with the default tokenizer granualarity. With the default
tokenizer settings the subtokens are all None, so in this case the user
data is simply not set.

* Update tests

Also adds a new test for norm data.

* Update docs

* Add Japanese morphologizer factory

Set the default to `extend=True` so that the morphologizer does not
clobber the values set by the tokenizer.

* Use the norm_ field for normalized forms

Before this commit, normalized forms were put in the "norm" field in the
morph attributes. I am not sure why I did that instead of using the
token morph, I think I just forgot about it.

* Skip test if sudachipy is not installed

* Fix import

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-10-01 19:19:26 +02:00
Paul O'Leary McCann
6e833b617a
Updating Troubleshooting Docs (#9329)
* Add link to Discussions FAQ

* Remove old FAQ entries

I think these are no longer relevant.

- no-cache-dir: affected pip versions are *very* old now
- narrow unicode: not an issue from py3.3+
- utf-8 osx: upstream bug closed in 2019

Some of the other issues are also maybe not frequent.
2021-10-01 12:28:22 +02:00
Paul O'Leary McCann
78a88f7de7 Fix invalid json 2021-09-30 15:23:55 +09:00
Martin Vallone
a14ab7e882
Adding PhruzzMatcher to spaCy universe (#9321)
* Adding PhruzzMatcher to spaCy universe

* Fixes to make the package work properly
2021-09-30 13:46:53 +09:00
Elia Robyn Lake (Robyn Speer)
5b0b0ca809
Move WandB loggers into spacy-loggers (#9223)
* factor out the WandB logger into spacy-loggers

Signed-off-by: Elia Robyn Speer <gh@arborelia.net>

* depend on spacy-loggers so they are available

Signed-off-by: Elia Robyn Speer <gh@arborelia.net>

* remove docs of spacy.WandbLogger.v2 (moved to spacy-loggers)

Signed-off-by: Elia Robyn Speer <elia@explosion.ai>

* Version number suggestions from code review

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* update references to WandbLogger

Signed-off-by: Elia Robyn Speer <elia@explosion.ai>

* make order of deps more consistent

Signed-off-by: Elia Robyn Speer <elia@explosion.ai>

Co-authored-by: Elia Robyn Speer <elia@explosion.ai>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-09-29 11:12:50 +02:00
Adriane Boyd
03f234b739 Merge remote-tracking branch 'upstream/master' into develop 2021-09-27 09:10:45 +02:00
Ines Montani
6bb0324b81 Adjust kb_id visualizer templating and docs 2021-09-23 11:59:02 +02:00
Ines Montani
beb4a8c524
Merge pull request #9199 from shigapov/master (resolves #9129) 2021-09-23 19:41:53 +10:00
Philip Vollet
d2adfe1efa
Add projects to spaCy Universe (#9269)
* Added spaCy Universe projects

* Added user license agreement Philip Vollet

* Update website/meta/universe.json

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/meta/universe.json

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/meta/universe.json

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-09-23 10:56:45 +02:00
Edward
8bda39f088
Update Hammurabi example code to v3 (#9218)
* Update Hammurabi example code

* Fix typo
2021-09-16 13:32:44 +02:00
Jozef Harag
865cfbc903
feat: add spacy.WandbLogger.v3 with optional run_name and entity parameters (#9202)
* feat: add `spacy.WandbLogger.v3` with optional `run_name` and `entity` parameters

* update versioning in docs

Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
2021-09-16 12:26:41 +02:00
Paul O'Leary McCann
1d57d78758 Make docs consistent (fix #9126) 2021-09-16 15:54:12 +09:00
Renat Shigapov
d5cc009faf
Merge branch 'explosion:master' into master 2021-09-13 08:43:48 +02:00
Renat Shigapov
e61d93f8c3
add NEL-visualisation to manual-usage 2021-09-13 08:38:58 +02:00
Paul O'Leary McCann
f89e1c34c9
Minor typo fix in docs 2021-09-11 14:22:05 +09:00
Renat Shigapov
646f3a54db
added spaCyOpenTapioca (#9181)
* add spaCyOpenTapioca to universe

* add agreement

* fix misprint in tags
2021-09-11 13:16:51 +09:00
mylibrar
ee28aac68e
Update example code of forte (#9175)
Co-authored-by: Suqi Sun <suqi.sun@petuum.com>
2021-09-11 13:13:13 +09:00
Renat Shigapov
c1927fe994
fix misprint in tags 2021-09-09 15:37:34 +02:00