Merge remote-tracking branch 'upstream/develop' into feature/docs-docs-docs

# Conflicts:
#	website/src/widgets/quickstart-training-generator.js
This commit is contained in:
svlandeg 2020-08-21 15:16:30 +02:00
commit 3060e4ae65
27 changed files with 357 additions and 134 deletions

2
.gitignore vendored
View File

@ -18,8 +18,6 @@ website/.npm
website/logs website/logs
*.log *.log
npm-debug.log* npm-debug.log*
website/www/
website/_deploy.sh
quickstart-training-generator.js quickstart-training-generator.js
# Cython / C extensions # Cython / C extensions

View File

@ -5,7 +5,7 @@
Thanks for your interest in contributing to spaCy 🎉 The project is maintained Thanks for your interest in contributing to spaCy 🎉 The project is maintained
by [@honnibal](https://github.com/honnibal) and [@ines](https://github.com/ines), by [@honnibal](https://github.com/honnibal) and [@ines](https://github.com/ines),
and we'll do our best to help you get started. This page will give you a quick and we'll do our best to help you get started. This page will give you a quick
overview of how things are organised and most importantly, how to get involved. overview of how things are organized and most importantly, how to get involved.
## Table of contents ## Table of contents
@ -195,7 +195,7 @@ modules in `.py` files, not Cython modules in `.pyx` and `.pxd` files.**
### Code formatting ### Code formatting
[`black`](https://github.com/ambv/black) is an opinionated Python code [`black`](https://github.com/ambv/black) is an opinionated Python code
formatter, optimised to produce readable code and small diffs. You can run formatter, optimized to produce readable code and small diffs. You can run
`black` from the command-line, or via your code editor. For example, if you're `black` from the command-line, or via your code editor. For example, if you're
using [Visual Studio Code](https://code.visualstudio.com/), you can add the using [Visual Studio Code](https://code.visualstudio.com/), you can add the
following to your `settings.json` to use `black` for formatting and auto-format following to your `settings.json` to use `black` for formatting and auto-format
@ -286,7 +286,7 @@ Code that interacts with the file-system should accept objects that follow the
If the function is user-facing and takes a path as an argument, it should check If the function is user-facing and takes a path as an argument, it should check
whether the path is provided as a string. Strings should be converted to whether the path is provided as a string. Strings should be converted to
`pathlib.Path` objects. Serialization and deserialization functions should always `pathlib.Path` objects. Serialization and deserialization functions should always
accept **file-like objects**, as it makes the library io-agnostic. Working on accept **file-like objects**, as it makes the library IO-agnostic. Working on
buffers makes the code more general, easier to test, and compatible with Python buffers makes the code more general, easier to test, and compatible with Python
3's asynchronous IO. 3's asynchronous IO.
@ -384,7 +384,7 @@ of Python and C++, with additional complexity and syntax from numpy. The
many "traps for new players". Working in Cython is very rewarding once you're many "traps for new players". Working in Cython is very rewarding once you're
over the initial learning curve. As with C and C++, the first way you write over the initial learning curve. As with C and C++, the first way you write
something in Cython will often be the performance-optimal approach. In contrast, something in Cython will often be the performance-optimal approach. In contrast,
Python optimisation generally requires a lot of experimentation. Is it faster to Python optimization generally requires a lot of experimentation. Is it faster to
have an `if item in my_dict` check, or to use `.get()`? What about `try`/`except`? have an `if item in my_dict` check, or to use `.get()`? What about `try`/`except`?
Does this numpy operation create a copy? There's no way to guess the answers to Does this numpy operation create a copy? There's no way to guess the answers to
these questions, and you'll usually be dissatisfied with your results — so these questions, and you'll usually be dissatisfied with your results — so
@ -400,7 +400,7 @@ Python. If it's not fast enough the first time, just switch to Cython.
- [PEP 8 Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) (python.org) - [PEP 8 Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) (python.org)
- [Official Cython documentation](http://docs.cython.org/en/latest/) (cython.org) - [Official Cython documentation](http://docs.cython.org/en/latest/) (cython.org)
- [Writing C in Cython](https://explosion.ai/blog/writing-c-in-cython) (explosion.ai) - [Writing C in Cython](https://explosion.ai/blog/writing-c-in-cython) (explosion.ai)
- [Multi-threading spaCys parser and named entity recogniser](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai) - [Multi-threading spaCys parser and named entity recognizer](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai)
## Adding tests ## Adding tests
@ -412,7 +412,7 @@ name. For example, tests for the `Tokenizer` can be found in
all test files and test functions need to be prefixed with `test_`. all test files and test functions need to be prefixed with `test_`.
When adding tests, make sure to use descriptive names, keep the code short and When adding tests, make sure to use descriptive names, keep the code short and
concise and only test for one behaviour at a time. Try to `parametrize` test concise and only test for one behavior at a time. Try to `parametrize` test
cases wherever possible, use our pre-defined fixtures for spaCy components and cases wherever possible, use our pre-defined fixtures for spaCy components and
avoid unnecessary imports. avoid unnecessary imports.

View File

@ -49,9 +49,8 @@ It's commercial open-source software, released under the MIT license.
## 💬 Where to ask questions ## 💬 Where to ask questions
The spaCy project is maintained by [@honnibal](https://github.com/honnibal) and The spaCy project is maintained by [@honnibal](https://github.com/honnibal),
[@ines](https://github.com/ines), along with core contributors [@ines](https://github.com/ines), [@svlandeg](https://github.com/svlandeg) and
[@svlandeg](https://github.com/svlandeg) and
[@adrianeboyd](https://github.com/adrianeboyd). Please understand that we won't [@adrianeboyd](https://github.com/adrianeboyd). Please understand that we won't
be able to provide individual support via email. We also believe that help is be able to provide individual support via email. We also believe that help is
much more valuable if it's shared publicly, so that more people can benefit from much more valuable if it's shared publicly, so that more people can benefit from

View File

@ -24,7 +24,7 @@ class Optimizations(str, Enum):
@init_cli.command("config") @init_cli.command("config")
def init_config_cli( def init_config_cli(
# fmt: off # fmt: off
output_file: Path = Arg("-", help="File to save config.cfg to (or - for stdout)", allow_dash=True), output_file: Path = Arg(..., help="File to save config.cfg to or - for stdout (will only output config and no additional logging info)", allow_dash=True),
lang: Optional[str] = Opt("en", "--lang", "-l", help="Two-letter code of the language to use"), lang: Optional[str] = Opt("en", "--lang", "-l", help="Two-letter code of the language to use"),
pipeline: Optional[str] = Opt("tagger,parser,ner", "--pipeline", "-p", help="Comma-separated names of trainable pipeline components to include in the model (without 'tok2vec' or 'transformer')"), pipeline: Optional[str] = Opt("tagger,parser,ner", "--pipeline", "-p", help="Comma-separated names of trainable pipeline components to include in the model (without 'tok2vec' or 'transformer')"),
optimize: Optimizations = Opt(Optimizations.efficiency.value, "--optimize", "-o", help="Whether to optimize for efficiency (faster inference, smaller model, lower memory consumption) or higher accuracy (potentially larger and slower model). This will impact the choice of architecture, pretrained weights and related hyperparameters."), optimize: Optimizations = Opt(Optimizations.efficiency.value, "--optimize", "-o", help="Whether to optimize for efficiency (faster inference, smaller model, lower memory consumption) or higher accuracy (potentially larger and slower model). This will impact the choice of architecture, pretrained weights and related hyperparameters."),
@ -110,6 +110,13 @@ def init_config(
"word_vectors": reco["word_vectors"], "word_vectors": reco["word_vectors"],
"has_letters": reco["has_letters"], "has_letters": reco["has_letters"],
} }
if variables["transformer_data"] and not has_spacy_transformers():
msg.warn(
"To generate a more effective transformer-based config (GPU-only), "
"install the spacy-transformers package and re-run this command. "
"The config generated now does not use transformers."
)
variables["transformer_data"] = None
base_template = template.render(variables).strip() base_template = template.render(variables).strip()
# Giving up on getting the newlines right in jinja for now # Giving up on getting the newlines right in jinja for now
base_template = re.sub(r"\n\n\n+", "\n\n", base_template) base_template = re.sub(r"\n\n\n+", "\n\n", base_template)
@ -126,8 +133,6 @@ def init_config(
for label, value in use_case.items(): for label, value in use_case.items():
msg.text(f"- {label}: {value}") msg.text(f"- {label}: {value}")
use_transformer = bool(template_vars.use_transformer) use_transformer = bool(template_vars.use_transformer)
if use_transformer:
require_spacy_transformers(msg)
with show_validation_error(hint_fill=False): with show_validation_error(hint_fill=False):
config = util.load_config_from_str(base_template) config = util.load_config_from_str(base_template)
nlp, _ = util.load_model_from_config(config, auto_fill=True) nlp, _ = util.load_model_from_config(config, auto_fill=True)
@ -149,12 +154,10 @@ def save_config(config: Config, output_file: Path, is_stdout: bool = False) -> N
print(f"{COMMAND} train {output_file.parts[-1]} {' '.join(variables)}") print(f"{COMMAND} train {output_file.parts[-1]} {' '.join(variables)}")
def require_spacy_transformers(msg: Printer) -> None: def has_spacy_transformers() -> bool:
try: try:
import spacy_transformers # noqa: F401 import spacy_transformers # noqa: F401
return True
except ImportError: except ImportError:
msg.fail( return False
"Using a transformer-based pipeline requires spacy-transformers "
"to be installed.",
exits=1,
)

View File

@ -107,8 +107,8 @@ factory = "tok2vec"
@architectures = "spacy.MultiHashEmbed.v1" @architectures = "spacy.MultiHashEmbed.v1"
width = ${components.tok2vec.model.encode.width} width = ${components.tok2vec.model.encode.width}
rows = {{ 2000 if optimize == "efficiency" else 7000 }} rows = {{ 2000 if optimize == "efficiency" else 7000 }}
also_embed_subwords = {{ true if has_letters else false }} also_embed_subwords = {{ "true" if has_letters else "false" }}
also_use_static_vectors = {{ true if optimize == "accuracy" else false }} also_use_static_vectors = {{ "true" if optimize == "accuracy" else "false" }}
[components.tok2vec.model.encode] [components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1" @architectures = "spacy.MaxoutWindowEncoder.v1"
@ -195,7 +195,7 @@ initial_rate = 5e-5
[training.train_corpus] [training.train_corpus]
@readers = "spacy.Corpus.v1" @readers = "spacy.Corpus.v1"
path = ${paths.train} path = ${paths.train}
max_length = {{ 500 if hardware == "gpu" else 0 }} max_length = {{ 500 if hardware == "gpu" else 2000 }}
[training.dev_corpus] [training.dev_corpus]
@readers = "spacy.Corpus.v1" @readers = "spacy.Corpus.v1"

View File

@ -252,8 +252,10 @@ class EntityRenderer:
colors.update(user_color) colors.update(user_color)
colors.update(options.get("colors", {})) colors.update(options.get("colors", {}))
self.default_color = DEFAULT_ENTITY_COLOR self.default_color = DEFAULT_ENTITY_COLOR
self.colors = colors self.colors = {label.upper(): color for label, color in colors.items()}
self.ents = options.get("ents", None) self.ents = options.get("ents", None)
if self.ents is not None:
self.ents = [ent.upper() for ent in self.ents]
self.direction = DEFAULT_DIR self.direction = DEFAULT_DIR
self.lang = DEFAULT_LANG self.lang = DEFAULT_LANG
template = options.get("template") template = options.get("template")

View File

@ -51,14 +51,14 @@ TPL_ENTS = """
TPL_ENT = """ TPL_ENT = """
<mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;"> <mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
{text} {text}
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">{label}</span> <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">{label}</span>
</mark> </mark>
""" """
TPL_ENT_RTL = """ TPL_ENT_RTL = """
<mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em"> <mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em">
{text} {text}
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-right: 0.5rem">{label}</span> <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-right: 0.5rem">{label}</span>
</mark> </mark>
""" """

View File

@ -1,6 +1,6 @@
import pytest import pytest
from spacy import displacy from spacy import displacy
from spacy.displacy.render import DependencyRenderer from spacy.displacy.render import DependencyRenderer, EntityRenderer
from spacy.tokens import Span from spacy.tokens import Span
from spacy.lang.fa import Persian from spacy.lang.fa import Persian
@ -97,3 +97,17 @@ def test_displacy_render_wrapper(en_vocab):
assert html.endswith("/div>TEST") assert html.endswith("/div>TEST")
# Restore # Restore
displacy.set_render_wrapper(lambda html: html) displacy.set_render_wrapper(lambda html: html)
def test_displacy_options_case():
ents = ["foo", "BAR"]
colors = {"FOO": "red", "bar": "green"}
renderer = EntityRenderer({"ents": ents, "colors": colors})
text = "abcd"
labels = ["foo", "bar", "FOO", "BAR"]
spans = [{"start": i, "end": i + 1, "label": labels[i]} for i in range(len(text))]
result = renderer.render_ents("abcde", spans, None).split("\n\n")
assert "red" in result[0] and "foo" in result[0]
assert "green" in result[1] and "bar" in result[1]
assert "red" in result[2] and "FOO" in result[2]
assert "green" in result[3] and "BAR" in result[3]

View File

@ -47,9 +47,9 @@ cdef class Tokenizer:
`infix_finditer` (callable): A function matching the signature of `infix_finditer` (callable): A function matching the signature of
`re.compile(string).finditer` to find infixes. `re.compile(string).finditer` to find infixes.
token_match (callable): A boolean function matching strings to be token_match (callable): A boolean function matching strings to be
recognised as tokens. recognized as tokens.
url_match (callable): A boolean function matching strings to be url_match (callable): A boolean function matching strings to be
recognised as tokens after considering prefixes and suffixes. recognized as tokens after considering prefixes and suffixes.
EXAMPLE: EXAMPLE:
>>> tokenizer = Tokenizer(nlp.vocab) >>> tokenizer = Tokenizer(nlp.vocab)

View File

@ -399,7 +399,7 @@ one component.
> subword_features = true > subword_features = true
> ``` > ```
Build a transition-based parser model. Can apply to NER or dependency-parsing. Build a transition-based parser model. Can apply to NER or dependency parsing.
Transition-based parsing is an approach to structured prediction where the task Transition-based parsing is an approach to structured prediction where the task
of predicting the structure is mapped to a series of state transitions. You of predicting the structure is mapped to a series of state transitions. You
might find [this tutorial](https://explosion.ai/blog/parsing-english-in-python) might find [this tutorial](https://explosion.ai/blog/parsing-english-in-python)
@ -416,8 +416,6 @@ consists of either two or three subnetworks:
state representation. If not present, the output from the lower model is used state representation. If not present, the output from the lower model is used
as action scores directly. as action scores directly.
<!-- TODO: model return type -->
| Name | Description | | Name | Description |
| ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `tok2vec` | Subnetwork to map tokens into vector representations. ~~Model[List[Doc], List[Floats2d]]~~ | | `tok2vec` | Subnetwork to map tokens into vector representations. ~~Model[List[Doc], List[Floats2d]]~~ |
@ -426,7 +424,7 @@ consists of either two or three subnetworks:
| `maxout_pieces` | How many pieces to use in the state prediction layer. Recommended values are `1`, `2` or `3`. If `1`, the maxout non-linearity is replaced with a [`Relu`](https://thinc.ai/docs/api-layers#relu) non-linearity if `use_upper` is `True`, and no non-linearity if `False`. ~~int~~ | | `maxout_pieces` | How many pieces to use in the state prediction layer. Recommended values are `1`, `2` or `3`. If `1`, the maxout non-linearity is replaced with a [`Relu`](https://thinc.ai/docs/api-layers#relu) non-linearity if `use_upper` is `True`, and no non-linearity if `False`. ~~int~~ |
| `use_upper` | Whether to use an additional hidden layer after the state vector in order to predict the action scores. It is recommended to set this to `False` for large pretrained models such as transformers, and `True` for smaller networks. The upper layer is computed on CPU, which becomes a bottleneck on larger GPU-based models, where it's also less necessary. ~~bool~~ | | `use_upper` | Whether to use an additional hidden layer after the state vector in order to predict the action scores. It is recommended to set this to `False` for large pretrained models such as transformers, and `True` for smaller networks. The upper layer is computed on CPU, which becomes a bottleneck on larger GPU-based models, where it's also less necessary. ~~bool~~ |
| `nO` | The number of actions the model will predict between. Usually inferred from data at the beginning of training, or loaded from disk. ~~int~~ | | `nO` | The number of actions the model will predict between. Usually inferred from data at the beginning of training, or loaded from disk. ~~int~~ |
| **CREATES** | The model using the architecture. ~~Model~~ | | **CREATES** | The model using the architecture. ~~Model[List[Docs], List[List[Floats2d]]]~~ |
### spacy.BILUOTagger.v1 {#BILUOTagger source="spacy/ml/models/simple_ner.py"} ### spacy.BILUOTagger.v1 {#BILUOTagger source="spacy/ml/models/simple_ner.py"}

View File

@ -7,7 +7,7 @@ source: spacy/morphology.pyx
Store the possible morphological analyses for a language, and index them by Store the possible morphological analyses for a language, and index them by
hash. To save space on each token, tokens only know the hash of their hash. To save space on each token, tokens only know the hash of their
morphological analysis, so queries of morphological attributes are delegated to morphological analysis, so queries of morphological attributes are delegated to
this class. See [`MorphAnalysis`](/api/morphology#morphansalysis) for the this class. See [`MorphAnalysis`](/api/morphology#morphanalysis) for the
container storing a single morphological analysis. container storing a single morphological analysis.
## Morphology.\_\_init\_\_ {#init tag="method"} ## Morphology.\_\_init\_\_ {#init tag="method"}

View File

@ -450,8 +450,8 @@ The L2 norm of the token's vector representation.
| `pos_` | Coarse-grained part-of-speech from the [Universal POS tag set](https://universaldependencies.org/docs/u/pos/). ~~str~~ | | `pos_` | Coarse-grained part-of-speech from the [Universal POS tag set](https://universaldependencies.org/docs/u/pos/). ~~str~~ |
| `tag` | Fine-grained part-of-speech. ~~int~~ | | `tag` | Fine-grained part-of-speech. ~~int~~ |
| `tag_` | Fine-grained part-of-speech. ~~str~~ | | `tag_` | Fine-grained part-of-speech. ~~str~~ |
| `morph` | Morphological analysis. ~~MorphAnalysis~~ | | `morph` <Tag variant="new">3</Tag> | Morphological analysis. ~~MorphAnalysis~~ |
| `morph_` | Morphological analysis in the Universal Dependencies [FEATS]https://universaldependencies.org/format.html#morphological-annotation format. ~~str~~ | | `morph_` <Tag variant="new">3</Tag> | Morphological analysis in the Universal Dependencies [FEATS]https://universaldependencies.org/format.html#morphological-annotation format. ~~str~~ |
| `dep` | Syntactic dependency relation. ~~int~~ | | `dep` | Syntactic dependency relation. ~~int~~ |
| `dep_` | Syntactic dependency relation. ~~str~~ | | `dep_` | Syntactic dependency relation. ~~str~~ |
| `lang` | Language of the parent document's vocabulary. ~~int~~ | | `lang` | Language of the parent document's vocabulary. ~~int~~ |

View File

@ -257,7 +257,7 @@ If a setting is not present in the options, the default value will be used.
| Name | Description | | Name | Description |
| --------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | --------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `ents` | Entity types to highlight or `None` for all types (default). ~~Optional[List[str]]~~ | | `ents` | Entity types to highlight or `None` for all types (default). ~~Optional[List[str]]~~ |
| `colors` | Color overrides. Entity types in uppercase should be mapped to color names or values. ~~Dict[str, str]~~ | | `colors` | Color overrides. Entity types should be mapped to color names or values. ~~Dict[str, str]~~ |
| `template` <Tag variant="new">2.2</Tag> | Optional template to overwrite the HTML used to render entity spans. Should be a format string and can use `{bg}`, `{text}` and `{label}`. See [`templates.py`](https://github.com/explosion/spaCy/blob/master/spacy/displacy/templates.py) for examples. ~~Optional[str]~~ | | `template` <Tag variant="new">2.2</Tag> | Optional template to overwrite the HTML used to render entity spans. Should be a format string and can use `{bg}`, `{text}` and `{label}`. See [`templates.py`](https://github.com/explosion/spaCy/blob/master/spacy/displacy/templates.py) for examples. ~~Optional[str]~~ |
By default, displaCy comes with colors for all entity types used by By default, displaCy comes with colors for all entity types used by
@ -632,6 +632,23 @@ validate its contents.
| `path` | Path to the model's `meta.json`. ~~Union[str, Path]~~ | | `path` | Path to the model's `meta.json`. ~~Union[str, Path]~~ |
| **RETURNS** | The model's meta data. ~~Dict[str, Any]~~ | | **RETURNS** | The model's meta data. ~~Dict[str, Any]~~ |
### util.get_installed_models {#util.get_installed_models tag="function" new="3"}
List all model packages installed in the current environment. This will include
any spaCy model that was packaged with [`spacy package`](/api/cli#package).
Under the hood, model packages expose a Python entry point that spaCy can check,
without having to load the model.
> #### Example
>
> ```python
> model_names = util.get_installed_models()
> ```
| Name | Description |
| ----------- | ---------------------------------------------------------------------------------- |
| **RETURNS** | The string names of the models installed in the current environment. ~~List[str]~~ |
### util.is_package {#util.is_package tag="function"} ### util.is_package {#util.is_package tag="function"}
Check if string maps to a package installed via pip. Mainly used to validate Check if string maps to a package installed via pip. Mainly used to validate

Binary file not shown.

After

Width:  |  Height:  |  Size: 224 KiB

View File

@ -80,25 +80,73 @@ duplicate if it's very similar to an already existing one.
Each [`Doc`](/api/doc), [`Span`](/api/span), [`Token`](/api/token) and Each [`Doc`](/api/doc), [`Span`](/api/span), [`Token`](/api/token) and
[`Lexeme`](/api/lexeme) comes with a [`.similarity`](/api/token#similarity) [`Lexeme`](/api/lexeme) comes with a [`.similarity`](/api/token#similarity)
method that lets you compare it with another object, and determine the method that lets you compare it with another object, and determine the
similarity. Of course similarity is always subjective whether "dog" and "cat" similarity. Of course similarity is always subjective whether two words, spans
are similar really depends on how you're looking at it. spaCy's similarity model or documents are similar really depends on how you're looking at it. spaCy's
usually assumes a pretty general-purpose definition of similarity. similarity model usually assumes a pretty general-purpose definition of
similarity.
<!-- TODO: use better example here --> > #### 📝 Things to try
>
> 1. Compare two different tokens and try to find the two most _dissimilar_
> tokens in the texts with the lowest similarity score (according to the
> vectors).
> 2. Compare the similarity of two [`Lexeme`](/api/lexeme) objects, entries in
> the vocabulary. You can get a lexeme via the `.lex` attribute of a token.
> You should see that the similarity results are identical to the token
> similarity.
```python ```python
### {executable="true"} ### {executable="true"}
import spacy import spacy
nlp = spacy.load("en_core_web_md") # make sure to use larger model! nlp = spacy.load("en_core_web_md") # make sure to use larger model!
tokens = nlp("dog cat banana") doc1 = nlp("I like salty fries and hamburgers.")
doc2 = nlp("Fast food tastes very good.")
for token1 in tokens: # Similarity of two documents
for token2 in tokens: print(doc1, "<->", doc2, doc1.similarity(doc2))
print(token1.text, token2.text, token1.similarity(token2)) # Similarity of tokens and spans
french_fries = doc1[2:4]
burgers = doc1[5]
print(french_fries, "<->", burgers, french_fries.similarity(burgers))
``` ```
In this case, the model's predictions are pretty on point. A dog is very similar ### What to expect from similarity results {#similarity-expectations}
to a cat, whereas a banana is not very similar to either of them. Identical
tokens are obviously 100% similar to each other (just not always exactly `1.0`, Computing similarity scores can be helpful in many situations, but it's also
because of vector math and floating point imprecisions). important to maintain **realistic expectations** about what information it can
provide. Words can be related to each over in many ways, so a single
"similarity" score will always be a **mix of different signals**, and vectors
trained on different data can produce very different results that may not be
useful for your purpose. Here are some important considerations to keep in mind:
- There's no objective definition of similarity. Whether "I like burgers" and "I
like pasta" is similar **depends on your application**. Both talk about food
preferences, which makes them very similar but if you're analyzing mentions
of food, those sentences are pretty dissimilar, because they talk about very
different foods.
- The similarity of [`Doc`](/api/doc) and [`Span`](/api/span) objects defaults
to the **average** of the token vectors. This means that the vector for "fast
food" is the average of the vectors for "fast" and "food", which isn't
necessarily representative of the phrase "fast food".
- Vector averaging means that the vector of multiple tokens is **insensitive to
the order** of the words. Two documents expressing the same meaning with
dissimilar wording will return a lower similarity score than two documents
that happen to contain the same words while expressing different meanings.
<Infobox title="Tip: Check out sense2vec" emoji="💡">
[![](../../images/sense2vec.jpg)](https://github.com/explosion/sense2vec)
[`sense2vec`](https://github.com/explosion/sense2vec) is a library developed by
us that builds on top of spaCy and lets you train and query more interesting and
detailed word vectors. It combines noun phrases like "fast food" or "fair game"
and includes the part-of-speech tags and entity labels. The library also
includes annotation recipes for our annotation tool [Prodigy](https://prodi.gy)
that let you evaluate vector models and create terminology lists. For more
details, check out
[our blog post](https://explosion.ai/blog/sense2vec-reloaded). To explore the
semantic similarities across all Reddit comments of 2015 and 2019, see the
[interactive demo](https://explosion.ai/demos/sense2vec).
</Infobox>

View File

@ -11,6 +11,10 @@ next: /usage/training
<!-- TODO: intro, short explanation of embeddings/transformers, Tok2Vec and Transformer components, point user to processing pipelines docs for more general info that user should know first --> <!-- TODO: intro, short explanation of embeddings/transformers, Tok2Vec and Transformer components, point user to processing pipelines docs for more general info that user should know first -->
If you're looking for details on using word vectors and semantic similarity,
check out the
[linguistic features docs](/usage/linguistic-features#vectors-similarity).
<Accordion title="Whats the difference between word vectors and language models?" id="vectors-vs-language-models"> <Accordion title="Whats the difference between word vectors and language models?" id="vectors-vs-language-models">
The key difference between [word vectors](#word-vectors) and contextual language The key difference between [word vectors](#word-vectors) and contextual language
@ -180,7 +184,7 @@ yourself. For details on how to get started with training your own model, check
out the [training quickstart](/usage/training#quickstart). out the [training quickstart](/usage/training#quickstart).
<!-- TODO: <!-- TODO:
<Project id="en_core_bert"> <Project id="en_core_trf_lg">
The easiest way to get started is to clone a transformers-based project The easiest way to get started is to clone a transformers-based project
template. Swap in your data, edit the settings and hyperparameters and train, template. Swap in your data, edit the settings and hyperparameters and train,

View File

@ -368,7 +368,7 @@ from is called `spacy`. So, when using spaCy, never call anything else `spacy`.
</Accordion> </Accordion>
<Accordion title="NER model doesn't recognise other entities anymore after training" id="catastrophic-forgetting"> <Accordion title="NER model doesn't recognize other entities anymore after training" id="catastrophic-forgetting">
If your training data only contained new entities and you didn't mix in any If your training data only contained new entities and you didn't mix in any
examples the model previously recognized, it can cause the model to "forget" examples the model previously recognized, it can cause the model to "forget"

View File

@ -429,7 +429,7 @@ nlp = spacy.load("en_core_web_sm")
doc = nlp("fb is hiring a new vice president of global policy") doc = nlp("fb is hiring a new vice president of global policy")
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents] ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print('Before', ents) print('Before', ents)
# the model didn't recognise "fb" as an entity :( # The model didn't recognize "fb" as an entity :(
fb_ent = Span(doc, 0, 1, label="ORG") # create a Span for the new entity fb_ent = Span(doc, 0, 1, label="ORG") # create a Span for the new entity
doc.ents = list(doc.ents) + [fb_ent] doc.ents = list(doc.ents) + [fb_ent]
@ -558,11 +558,11 @@ import spacy
nlp = spacy.load("my_custom_el_model") nlp = spacy.load("my_custom_el_model")
doc = nlp("Ada Lovelace was born in London") doc = nlp("Ada Lovelace was born in London")
# document level # Document level
ents = [(e.text, e.label_, e.kb_id_) for e in doc.ents] ents = [(e.text, e.label_, e.kb_id_) for e in doc.ents]
print(ents) # [('Ada Lovelace', 'PERSON', 'Q7259'), ('London', 'GPE', 'Q84')] print(ents) # [('Ada Lovelace', 'PERSON', 'Q7259'), ('London', 'GPE', 'Q84')]
# token level # Token level
ent_ada_0 = [doc[0].text, doc[0].ent_type_, doc[0].ent_kb_id_] ent_ada_0 = [doc[0].text, doc[0].ent_type_, doc[0].ent_kb_id_]
ent_ada_1 = [doc[1].text, doc[1].ent_type_, doc[1].ent_kb_id_] ent_ada_1 = [doc[1].text, doc[1].ent_type_, doc[1].ent_kb_id_]
ent_london_5 = [doc[5].text, doc[5].ent_type_, doc[5].ent_kb_id_] ent_london_5 = [doc[5].text, doc[5].ent_type_, doc[5].ent_kb_id_]
@ -914,12 +914,12 @@ from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER
from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
from spacy.util import compile_infix_regex from spacy.util import compile_infix_regex
# default tokenizer # Default tokenizer
nlp = spacy.load("en_core_web_sm") nlp = spacy.load("en_core_web_sm")
doc = nlp("mother-in-law") doc = nlp("mother-in-law")
print([t.text for t in doc]) # ['mother', '-', 'in', '-', 'law'] print([t.text for t in doc]) # ['mother', '-', 'in', '-', 'law']
# modify tokenizer infix patterns # Modify tokenizer infix patterns
infixes = ( infixes = (
LIST_ELLIPSES LIST_ELLIPSES
+ LIST_ICONS + LIST_ICONS
@ -929,7 +929,7 @@ infixes = (
al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
), ),
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA), r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
# EDIT: commented out regex that splits on hyphens between letters: # ✅ Commented out regex that splits on hyphens between letters:
# r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS), # r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA), r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
] ]
@ -1547,23 +1547,6 @@ import Vectors101 from 'usage/101/\_vectors-similarity.md'
<Vectors101 /> <Vectors101 />
<Infobox title="What to expect from similarity results" variant="warning">
Computing similarity scores can be helpful in many situations, but it's also
important to maintain **realistic expectations** about what information it can
provide. Words can be related to each over in many ways, so a single
"similarity" score will always be a **mix of different signals**, and vectors
trained on different data can produce very different results that may not be
useful for your purpose.
Also note that the similarity of `Doc` or `Span` objects defaults to the
**average** of the token vectors. This means it's insensitive to the order of
the words. Two documents expressing the same meaning with dissimilar wording
will return a lower similarity score than two documents that happen to contain
the same words while expressing different meanings.
</Infobox>
### Adding word vectors {#adding-vectors} ### Adding word vectors {#adding-vectors}
Custom word vectors can be trained using a number of open-source libraries, such Custom word vectors can be trained using a number of open-source libraries, such

View File

@ -108,11 +108,11 @@ class, or defined within a [model package](/usage/saving-loading#models).
> >
> [components.tagger] > [components.tagger]
> factory = "tagger" > factory = "tagger"
> # settings for the tagger component > # Settings for the tagger component
> >
> [components.parser] > [components.parser]
> factory = "parser" > factory = "parser"
> # settings for the parser component > # Settings for the parser component
> ``` > ```
When you load a model, spaCy first consults the model's When you load a model, spaCy first consults the model's
@ -171,7 +171,7 @@ lang = "en"
pipeline = ["tagger", "parser", "ner"] pipeline = ["tagger", "parser", "ner"]
data_path = "path/to/en_core_web_sm/en_core_web_sm-2.0.0" data_path = "path/to/en_core_web_sm/en_core_web_sm-2.0.0"
cls = spacy.util.get_lang_class(lang) # 1. Get Language instance, e.g. English() cls = spacy.util.get_lang_class(lang) # 1. Get Language class, e.g. English
nlp = cls() # 2. Initialize it nlp = cls() # 2. Initialize it
for name in pipeline: for name in pipeline:
nlp.add_pipe(name) # 3. Add the component to the pipeline nlp.add_pipe(name) # 3. Add the component to the pipeline
@ -187,9 +187,9 @@ which is then processed by the component next in the pipeline.
```python ```python
### The pipeline under the hood ### The pipeline under the hood
doc = nlp.make_doc("This is a sentence") # create a Doc from raw text doc = nlp.make_doc("This is a sentence") # Create a Doc from raw text
for name, proc in nlp.pipeline: # iterate over components in order for name, proc in nlp.pipeline: # Iterate over components in order
doc = proc(doc) # apply each component doc = proc(doc) # Apply each component
``` ```
The current processing pipeline is available as `nlp.pipeline`, which returns a The current processing pipeline is available as `nlp.pipeline`, which returns a
@ -473,7 +473,7 @@ only being able to modify it afterwards.
> >
> @Language.component("my_component") > @Language.component("my_component")
> def my_component(doc): > def my_component(doc):
> # do something to the doc here > # Do something to the doc here
> return doc > return doc
> ``` > ```

View File

@ -511,21 +511,21 @@ from spacy.language import Language
from spacy.matcher import Matcher from spacy.matcher import Matcher
from spacy.tokens import Token from spacy.tokens import Token
# We're using a component factory because the component needs to be initialized # We're using a component factory because the component needs to be
# with the shared vocab via the nlp object # initialized with the shared vocab via the nlp object
@Language.factory("html_merger") @Language.factory("html_merger")
def create_bad_html_merger(nlp, name): def create_bad_html_merger(nlp, name):
return BadHTMLMerger(nlp) return BadHTMLMerger(nlp.vocab)
class BadHTMLMerger: class BadHTMLMerger:
def __init__(self, nlp): def __init__(self, vocab):
patterns = [ patterns = [
[{"ORTH": "<"}, {"LOWER": "br"}, {"ORTH": ">"}], [{"ORTH": "<"}, {"LOWER": "br"}, {"ORTH": ">"}],
[{"ORTH": "<"}, {"LOWER": "br/"}, {"ORTH": ">"}], [{"ORTH": "<"}, {"LOWER": "br/"}, {"ORTH": ">"}],
] ]
# Register a new token extension to flag bad HTML # Register a new token extension to flag bad HTML
Token.set_extension("bad_html", default=False) Token.set_extension("bad_html", default=False)
self.matcher = Matcher(nlp.vocab) self.matcher = Matcher(vocab)
self.matcher.add("BAD_HTML", patterns) self.matcher.add("BAD_HTML", patterns)
def __call__(self, doc): def __call__(self, doc):

View File

@ -104,10 +104,10 @@ workflows, from data preprocessing to training and packaging your model.
## Training config {#config} ## Training config {#config}
> #### Migration from spaCy v2.x <!-- > #### Migration from spaCy v2.x
> >
> TODO: once we have an answer for how to update the training command > TODO: once we have an answer for how to update the training command
> (`spacy migrate`?), add details here > (`spacy migrate`?), add details here -->
Training config files include all **settings and hyperparameters** for training Training config files include all **settings and hyperparameters** for training
your model. Instead of providing lots of arguments on the command line, you only your model. Instead of providing lots of arguments on the command line, you only
@ -404,11 +404,15 @@ recipe once the dish has already been prepared. You have to make a new one.
spaCy includes a variety of built-in [architectures](/api/architectures) for spaCy includes a variety of built-in [architectures](/api/architectures) for
different tasks. For example: different tasks. For example:
<!-- TODO: select example architectures to showcase --> <!-- TODO: model return types -->
| Architecture | Description | | Architecture | Description |
| ----------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ----------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [HashEmbedCNN](/api/architectures#HashEmbedCNN) | Build spaCys “standard” embedding layer, which uses hash embedding with subword features and a CNN with layer-normalized maxout. ~~Model[List[Doc], List[Floats2d]]~~ | | [HashEmbedCNN](/api/architectures#HashEmbedCNN) | Build spaCys "standard" embedding layer, which uses hash embedding with subword features and a CNN with layer-normalized maxout. ~~Model[List[Doc], List[Floats2d]]~~ |
| [TransitionBasedParser](/api/architectures#TransitionBasedParser) | Build a [transition-based parser](https://explosion.ai/blog/parsing-english-in-python) model used in the default [`EntityRecognizer`](/api/entityrecognizer) and [`DependencyParser`](/api/dependencyparser). ~~Model[List[Docs], List[List[Floats2d]]]~~ |
| [TextCatEnsemble](/api/architectures#TextCatEnsemble) | Stacked ensemble of a bag-of-words model and a neural network model with an internal CNN embedding layer. Used in the default [`TextCategorizer`](/api/textcategorizer). ~~Model~~ |
<!-- TODO: link to not yet existing usage page on custom architectures etc. -->
### Metrics, training output and weighted scores {#metrics} ### Metrics, training output and weighted scores {#metrics}
@ -788,7 +792,7 @@ you save the transformer outputs for later use.
<!-- TODO: <!-- TODO:
<Project id="en_core_bert"> <Project id="en_core_trf_lg">
Try out a BERT-based model pipeline using this project template: swap in your Try out a BERT-based model pipeline using this project template: swap in your
data, edit the settings and hyperparameters and train, evaluate, package and data, edit the settings and hyperparameters and train, evaluate, package and

View File

@ -10,6 +10,32 @@ menu:
## Summary {#summary} ## Summary {#summary}
<Grid cols={2}>
<div>
</div>
<Infobox title="Table of Contents" id="toc">
- [Summary](#summary)
- [New features](#features)
- [Training & config system](#features-training)
- [Transformer-based pipelines](#features-transformers)
- [Custom models](#features-custom-models)
- [End-to-end project workflows](#features-projects)
- [New built-in components](#features-pipeline-components)
- [New custom component API](#features-components)
- [Python type hints](#features-types)
- [New methods & attributes](#new-methods)
- [New & updated documentation](#new-docs)
- [Backwards incompatibilities](#incompat)
- [Migrating from spaCy v2.x](#migrating)
</Infobox>
</Grid>
## New Features {#features} ## New Features {#features}
### New training workflow and config system {#features-training} ### New training workflow and config system {#features-training}
@ -28,6 +54,8 @@ menu:
### Transformer-based pipelines {#features-transformers} ### Transformer-based pipelines {#features-transformers}
![Pipeline components listening to shared embedding component](../images/tok2vec-listener.svg)
<Infobox title="Details & Documentation" emoji="📖" list> <Infobox title="Details & Documentation" emoji="📖" list>
- **Usage:** [Embeddings & Transformers](/usage/embeddings-transformers), - **Usage:** [Embeddings & Transformers](/usage/embeddings-transformers),
@ -38,7 +66,7 @@ menu:
- **Architectures: ** [TransformerModel](/api/architectures#TransformerModel), - **Architectures: ** [TransformerModel](/api/architectures#TransformerModel),
[Tok2VecListener](/api/architectures#transformers-Tok2VecListener), [Tok2VecListener](/api/architectures#transformers-Tok2VecListener),
[Tok2VecTransformer](/api/architectures#Tok2VecTransformer) [Tok2VecTransformer](/api/architectures#Tok2VecTransformer)
- **Models:** [`en_core_bert_sm`](/models/en) - **Models:** [`en_core_trf_lg_sm`](/models/en)
- **Implementation:** - **Implementation:**
[`spacy-transformers`](https://github.com/explosion/spacy-transformers) [`spacy-transformers`](https://github.com/explosion/spacy-transformers)
@ -46,8 +74,53 @@ menu:
### Custom models using any framework {#features-custom-models} ### Custom models using any framework {#features-custom-models}
<Infobox title="Details & Documentation" emoji="📖" list>
<!-- TODO: link to new custom models page -->
- **Thinc: **
[Wrapping PyTorch, TensorFlow & MXNet](https://thinc.ai/docs/usage-frameworks)
- **API:** [Model architectures](/api/architectures), [`Pipe`](/api/pipe)
</Infobox>
### Manage end-to-end workflows with projects {#features-projects} ### Manage end-to-end workflows with projects {#features-projects}
<!-- TODO: update example -->
> #### Example
>
> ```cli
> # Clone a project template
> $ python -m spacy project clone example
> $ cd example
> # Download data assets
> $ python -m spacy project assets
> # Run a workflow
> $ python -m spacy project run train
> ```
spaCy projects let you manage and share **end-to-end spaCy workflows** for
different **use cases and domains**, and orchestrate training, packaging and
serving your custom models. You can start off by cloning a pre-defined project
template, adjust it to fit your needs, load in your data, train a model, export
it as a Python package and share the project templates with your team. spaCy
projects also make it easy to **integrate with other tools** in the data science
and machine learning ecosystem, including [DVC](/usage/projects#dvc) for data
version control, [Prodigy](/usage/projects#prodigy) for creating labelled data,
[Streamlit](/usage/projects#streamlit) for building interactive apps,
[FastAPI](/usage/projects#fastapi) for serving models in production,
[Ray](/usage/projects#ray) for parallel training,
[Weights & Biases](/usage/projects#wandb) for experiment tracking, and more!
<!-- <Project id="some_example_project">
The easiest way to get started with an end-to-end training process is to clone a
[project](/usage/projects) template. Projects let you manage multi-step
workflows, from data preprocessing to training and packaging your model.
</Project>-->
<Infobox title="Details & Documentation" emoji="📖" list> <Infobox title="Details & Documentation" emoji="📖" list>
- **Usage:** [spaCy projects](/usage/projects), - **Usage:** [spaCy projects](/usage/projects),
@ -59,6 +132,16 @@ menu:
### New built-in pipeline components {#features-pipeline-components} ### New built-in pipeline components {#features-pipeline-components}
spaCy v3.0 includes several new trainable and rule-based components that you can
add to your pipeline and customize for your use case:
> #### Example
>
> ```python
> nlp = spacy.blank("en")
> nlp.add_pipe("lemmatizer")
> ```
| Name | Description | | Name | Description |
| ----------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ----------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [`SentenceRecognizer`](/api/sentencerecognizer) | Trainable component for sentence segmentation. | | [`SentenceRecognizer`](/api/sentencerecognizer) | Trainable component for sentence segmentation. |
@ -78,15 +161,37 @@ menu:
### New and improved pipeline component APIs {#features-components} ### New and improved pipeline component APIs {#features-components}
- `Language.factory`, `Language.component` > #### Example
- `Language.analyze_pipes` >
- Adding components from other models > ```python
> @Language.component("my_component")
> def my_component(doc):
> return doc
>
> nlp.add_pipe("my_component")
> nlp.add_pipe("ner", source=other_nlp)
> nlp.analyze_pipes(pretty=True)
> ```
Defining, configuring, reusing, training and analyzing pipeline components is
now easier and more convenient. The `@Language.component` and
`@Language.factory` decorators let you register your component, define its
default configuration and meta data, like the attribute values it assigns and
requires. Any custom component can be included during training, and sourcing
components from existing pretrained models lets you **mix and match custom
pipelines**. The `nlp.analyze_pipes` method outputs structured information about
the current pipeline and its components, including the attributes they assign,
the scores they compute during training and whether any required attributes
aren't set.
<Infobox title="Details & Documentation" emoji="📖" list> <Infobox title="Details & Documentation" emoji="📖" list>
- **Usage:** [Custom components](/usage/processing-pipelines#custom_components), - **Usage:** [Custom components](/usage/processing-pipelines#custom_components),
[Defining components during training](/usage/training#config-components) [Defining components for training](/usage/training#config-components)
- **API:** [`Language`](/api/language) - **API:** [`@Language.component`](/api/language#component),
[`@Language.factory`](/api/language#factory),
[`Language.add_pipe`](/api/language#add_pipe),
[`Language.analyze_pipes`](/api/language#analyze_pipes)
- **Implementation:** - **Implementation:**
[`spacy/language.py`](https://github.com/explosion/spaCy/tree/develop/spacy/language.py) [`spacy/language.py`](https://github.com/explosion/spaCy/tree/develop/spacy/language.py)
@ -136,13 +241,14 @@ in your config and see validation errors if the argument values don't match.
</Infobox> </Infobox>
### New methods, attributes and commands ### New methods, attributes and commands {#new-methods}
The following methods, attributes and commands are new in spaCy v3.0. The following methods, attributes and commands are new in spaCy v3.0.
| Name | Description | | Name | Description |
| ----------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | ----------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| [`Token.lex`](/api/token#attributes) | Access a token's [`Lexeme`](/api/lexeme). | | [`Token.lex`](/api/token#attributes) | Access a token's [`Lexeme`](/api/lexeme). |
| [`Token.morph`](/api/token#attributes) [`Token.morph_`](/api/token#attributes) | Access a token's morphological analysis. |
| [`Language.select_pipes`](/api/language#select_pipes) | Contextmanager for enabling or disabling specific pipeline components for a block. | | [`Language.select_pipes`](/api/language#select_pipes) | Contextmanager for enabling or disabling specific pipeline components for a block. |
| [`Language.analyze_pipes`](/api/language#analyze_pipes) | [Analyze](/usage/processing-pipelines#analysis) components and their interdependencies. | | [`Language.analyze_pipes`](/api/language#analyze_pipes) | [Analyze](/usage/processing-pipelines#analysis) components and their interdependencies. |
| [`Language.resume_training`](/api/language#resume_training) | Experimental: continue training a pretrained model and initialize "rehearsal" for components that implement a `rehearse` method to prevent catastrophic forgetting. | | [`Language.resume_training`](/api/language#resume_training) | Experimental: continue training a pretrained model and initialize "rehearsal" for components that implement a `rehearse` method to prevent catastrophic forgetting. |
@ -153,9 +259,53 @@ The following methods, attributes and commands are new in spaCy v3.0.
| [`Pipe.score`](/api/pipe#score) | Method on trainable pipeline components that returns a dictionary of evaluation scores. | | [`Pipe.score`](/api/pipe#score) | Method on trainable pipeline components that returns a dictionary of evaluation scores. |
| [`registry`](/api/top-level#registry) | Function registry to map functions to string names that can be referenced in [configs](/usage/training#config). | | [`registry`](/api/top-level#registry) | Function registry to map functions to string names that can be referenced in [configs](/usage/training#config). |
| [`util.load_meta`](/api/top-level#util.load_meta) [`util.load_config`](/api/top-level#util.load_config) | Updated helpers for loading a model's [`meta.json`](/api/data-formats#meta) and [`config.cfg`](/api/data-formats#config). | | [`util.load_meta`](/api/top-level#util.load_meta) [`util.load_config`](/api/top-level#util.load_config) | Updated helpers for loading a model's [`meta.json`](/api/data-formats#meta) and [`config.cfg`](/api/data-formats#config). |
| [`util.get_installed_models`](/api/top-level#util.get_installed_models) | Names of all models installed in the environment. |
| [`init config`](/api/cli#init-config) [`init fill-config`](/api/cli#init-fill-config) [`debug config`](/api/cli#debug-config) | CLI commands for initializing, auto-filling and debugging [training configs](/usage/training). | | [`init config`](/api/cli#init-config) [`init fill-config`](/api/cli#init-fill-config) [`debug config`](/api/cli#debug-config) | CLI commands for initializing, auto-filling and debugging [training configs](/usage/training). |
| [`project`](/api/cli#project) | Suite of CLI commands for cloning, running and managing [spaCy projects](/usage/projects). | | [`project`](/api/cli#project) | Suite of CLI commands for cloning, running and managing [spaCy projects](/usage/projects). |
### New and updated documentation {#new-docs}
<Grid cols={2} gutterBottom={false}>
<div>
To help you get started with spaCy v3.0 and the new features, we've added
several new or rewritten documentation pages, including a new usage guide on
[embeddings, transformers and transfer learning](/usage/embeddings-transformers),
a guide on [training models](/usage/training) rewritten from scratch, a page
explaining the new [spaCy projects](/usage/projects) and updated usage
documentation on
[custom pipeline components](/usage/processing-pipelines#custom-components).
We've also added a bunch of new illustrations and new API reference pages
documenting spaCy's machine learning [model architectures](/api/architectures)
and the expected [data formats](/api/data-formats). API pages about
[pipeline components](/api/#architecture-pipeline) now include more information,
like the default config and implementation, and we've adopted a more detailed
format for documenting argument and return types.
</div>
[![Library architecture](../images/architecture.svg)](/api)
</Grid>
<Infobox title="New or reworked documentation" emoji="📖" list>
- **Usage: ** [Embeddings & Transformers](/usage/embeddings-transformers),
[Training models](/usage/training), [Projects](/usage/projects),
[Custom pipeline components](/usage/processing-pipelines#custom-components),
[Custom tokenizers](/usage/linguistic-features#custom-tokenizer)
- **API Reference: ** [Library architecture](/api),
[Model architectures](/api/architectures), [Data formats](/api/data-formats)
- **New Classes: ** [`Example`](/api/example), [`Tok2Vec`](/api/tok2vec),
[`Transformer`](/api/transformer), [`Lemmatizer`](/api/lemmatizer),
[`Morphologizer`](/api/morphologizer),
[`AttributeRuler`](/api/attributeruler),
[`SentenceRecognizer`](/api/sentencerecognizer), [`Pipe`](/api/pipe),
[`Corpus`](/api/corpus)
</Infobox>
## Backwards Incompatibilities {#incompat} ## Backwards Incompatibilities {#incompat}
As always, we've tried to keep the breaking changes to a minimum and focus on As always, we've tried to keep the breaking changes to a minimum and focus on
@ -213,14 +363,15 @@ Note that spaCy v3.0 now requires **Python 3.6+**.
### Removed or renamed API {#incompat-removed} ### Removed or renamed API {#incompat-removed}
| Removed | Replacement | | Removed | Replacement |
| ------------------------------------------------------ | ----------------------------------------------------------------------------------------- | | -------------------------------------------------------- | ----------------------------------------------------------------------------------------- |
| `Language.disable_pipes` | [`Language.select_pipes`](/api/language#select_pipes) | | `Language.disable_pipes` | [`Language.select_pipes`](/api/language#select_pipes) |
| `GoldParse` | [`Example`](/api/example) | | `GoldParse` | [`Example`](/api/example) |
| `GoldCorpus` | [`Corpus`](/api/corpus) | | `GoldCorpus` | [`Corpus`](/api/corpus) |
| `KnowledgeBase.load_bulk` `KnowledgeBase.dump` | [`KnowledgeBase.from_disk`](/api/kb#from_disk) [`KnowledgeBase.to_disk`](/api/kb#to_disk) | | `KnowledgeBase.load_bulk` `KnowledgeBase.dump` | [`KnowledgeBase.from_disk`](/api/kb#from_disk) [`KnowledgeBase.to_disk`](/api/kb#to_disk) |
| `spacy init-model` | [`spacy init model`](/api/cli#init-model) |
| `spacy debug-data` | [`spacy debug data`](/api/cli#debug-data) | | `spacy debug-data` | [`spacy debug data`](/api/cli#debug-data) |
| `spacy profile` | [`spacy debug profile`](/api/cli#debug-profile) | | `spacy profile` | [`spacy debug profile`](/api/cli#debug-profile) |
| `spacy link` `util.set_data_path` `util.get_data_path` | not needed, model symlinks are deprecated | | `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, model symlinks are deprecated |
The following deprecated methods, attributes and arguments were removed in v3.0. The following deprecated methods, attributes and arguments were removed in v3.0.
Most of them have been **deprecated for a while** and many would previously Most of them have been **deprecated for a while** and many would previously
@ -236,7 +387,7 @@ on them.
| `Language.tagger`, `Language.parser`, `Language.entity` | [`Language.get_pipe`](/api/language#get_pipe) | | `Language.tagger`, `Language.parser`, `Language.entity` | [`Language.get_pipe`](/api/language#get_pipe) |
| keyword-arguments like `vocab=False` on `to_disk`, `from_disk`, `to_bytes`, `from_bytes` | `exclude=["vocab"]` | | keyword-arguments like `vocab=False` on `to_disk`, `from_disk`, `to_bytes`, `from_bytes` | `exclude=["vocab"]` |
| `n_threads` argument on [`Tokenizer`](/api/tokenizer), [`Matcher`](/api/matcher), [`PhraseMatcher`](/api/phrasematcher) | `n_process` | | `n_threads` argument on [`Tokenizer`](/api/tokenizer), [`Matcher`](/api/matcher), [`PhraseMatcher`](/api/phrasematcher) | `n_process` |
| `verbose` argument on [`Language.evaluate`] | logging | | `verbose` argument on [`Language.evaluate`](/api/language#evaluate) | logging (`DEBUG`) |
| `SentenceSegmenter` hook, `SimilarityHook` | [user hooks](/usage/processing-pipelines#custom-components-user-hooks), [`Sentencizer`](/api/sentencizer), [`SentenceRecognizer`](/api/sentenceregognizer) | | `SentenceSegmenter` hook, `SimilarityHook` | [user hooks](/usage/processing-pipelines#custom-components-user-hooks), [`Sentencizer`](/api/sentencizer), [`SentenceRecognizer`](/api/sentenceregognizer) |
## Migrating from v2.x {#migrating} ## Migrating from v2.x {#migrating}

View File

@ -122,9 +122,9 @@ import DisplacyEntHtml from 'images/displacy-ent2.html'
The entity visualizer lets you customize the following `options`: The entity visualizer lets you customize the following `options`:
| Argument | Description | | Argument | Description |
| -------- | -------------------------------------------------------------------------------------------------------------------------- | | -------- | ------------------------------------------------------------------------------------------------------------- |
| `ents` | Entity types to highlight (`None` for all types). Defaults to `None`. ~~Optional[List[str]]~~ | `None` | | `ents` | Entity types to highlight (`None` for all types). Defaults to `None`. ~~Optional[List[str]]~~ | `None` |
| `colors` | Color overrides. Entity types in uppercase should be mapped to color names or values. Defaults to `{}`. ~~Dict[str, str]~~ | | `colors` | Color overrides. Entity types should be mapped to color names or values. Defaults to `{}`. ~~Dict[str, str]~~ |
If you specify a list of `ents`, only those entity types will be rendered for If you specify a list of `ents`, only those entity types will be rendered for
example, you can choose to display `PERSON` entities. Internally, the visualizer example, you can choose to display `PERSON` entities. Internally, the visualizer

View File

@ -6,7 +6,7 @@ import classNames from 'classnames'
import Icon from './icon' import Icon from './icon'
import classes from '../styles/link.module.sass' import classes from '../styles/link.module.sass'
import { isString } from './util' import { isString, isImage } from './util'
const internalRegex = /(http(s?)):\/\/(prodi.gy|spacy.io|irl.spacy.io|explosion.ai|course.spacy.io)/gi const internalRegex = /(http(s?)):\/\/(prodi.gy|spacy.io|irl.spacy.io|explosion.ai|course.spacy.io)/gi
@ -39,7 +39,7 @@ export default function Link({
const dest = to || href const dest = to || href
const external = forceExternal || /(http(s?)):\/\//gi.test(dest) const external = forceExternal || /(http(s?)):\/\//gi.test(dest)
const icon = getIcon(dest) const icon = getIcon(dest)
const withIcon = !hidden && !hideIcon && !!icon const withIcon = !hidden && !hideIcon && !!icon && !isImage(children)
const sourceWithText = withIcon && isString(children) const sourceWithText = withIcon && isString(children)
const linkClassNames = classNames(classes.root, className, { const linkClassNames = classNames(classes.root, className, {
[classes.hidden]: hidden, [classes.hidden]: hidden,

View File

@ -46,6 +46,17 @@ export function isString(obj) {
return typeof obj === 'string' || obj instanceof String return typeof obj === 'string' || obj instanceof String
} }
/**
* @param obj - The object to check.
* @returns {boolean} Whether the object is an image
*/
export function isImage(obj) {
if (!obj || !React.isValidElement(obj)) {
return false
}
return obj.props.name == 'img' || obj.props.className == 'gatsby-resp-image-wrapper'
}
/** /**
* @param obj - The object to check. * @param obj - The object to check.
* @returns {boolean} - Whether the object is empty. * @returns {boolean} - Whether the object is empty.

View File

@ -363,7 +363,7 @@ body [id]:target
color: var(--color-red-medium) color: var(--color-red-medium)
background: var(--color-red-transparent) background: var(--color-red-transparent)
&.italic &.italic, &.comment
font-style: italic font-style: italic
@ -384,9 +384,11 @@ body [id]:target
// Settings for ini syntax (config files) // Settings for ini syntax (config files)
[class*="language-ini"] [class*="language-ini"]
color: var(--syntax-comment) color: var(--syntax-comment)
font-style: italic !important
.token .token
color: var(--color-subtle) color: var(--color-subtle)
font-style: normal !important
.gatsby-highlight-code-line .gatsby-highlight-code-line
@ -424,6 +426,7 @@ body [id]:target
.cm-comment .cm-comment
color: var(--syntax-comment) color: var(--syntax-comment)
font-style: italic
.cm-keyword .cm-keyword
color: var(--syntax-keyword) color: var(--syntax-keyword)

File diff suppressed because one or more lines are too long