Merge remote-tracking branch 'upstream/develop' into feature/docs-docs-docs

# Conflicts:
#	website/src/widgets/quickstart-training-generator.js
This commit is contained in:
svlandeg 2020-08-21 15:16:30 +02:00
commit 3060e4ae65
27 changed files with 357 additions and 134 deletions

2
.gitignore vendored
View File

@ -18,8 +18,6 @@ website/.npm
website/logs
*.log
npm-debug.log*
website/www/
website/_deploy.sh
quickstart-training-generator.js
# Cython / C extensions

View File

@ -5,7 +5,7 @@
Thanks for your interest in contributing to spaCy 🎉 The project is maintained
by [@honnibal](https://github.com/honnibal) and [@ines](https://github.com/ines),
and we'll do our best to help you get started. This page will give you a quick
overview of how things are organised and most importantly, how to get involved.
overview of how things are organized and most importantly, how to get involved.
## Table of contents
@ -195,7 +195,7 @@ modules in `.py` files, not Cython modules in `.pyx` and `.pxd` files.**
### Code formatting
[`black`](https://github.com/ambv/black) is an opinionated Python code
formatter, optimised to produce readable code and small diffs. You can run
formatter, optimized to produce readable code and small diffs. You can run
`black` from the command-line, or via your code editor. For example, if you're
using [Visual Studio Code](https://code.visualstudio.com/), you can add the
following to your `settings.json` to use `black` for formatting and auto-format
@ -286,7 +286,7 @@ Code that interacts with the file-system should accept objects that follow the
If the function is user-facing and takes a path as an argument, it should check
whether the path is provided as a string. Strings should be converted to
`pathlib.Path` objects. Serialization and deserialization functions should always
accept **file-like objects**, as it makes the library io-agnostic. Working on
accept **file-like objects**, as it makes the library IO-agnostic. Working on
buffers makes the code more general, easier to test, and compatible with Python
3's asynchronous IO.
@ -384,7 +384,7 @@ of Python and C++, with additional complexity and syntax from numpy. The
many "traps for new players". Working in Cython is very rewarding once you're
over the initial learning curve. As with C and C++, the first way you write
something in Cython will often be the performance-optimal approach. In contrast,
Python optimisation generally requires a lot of experimentation. Is it faster to
Python optimization generally requires a lot of experimentation. Is it faster to
have an `if item in my_dict` check, or to use `.get()`? What about `try`/`except`?
Does this numpy operation create a copy? There's no way to guess the answers to
these questions, and you'll usually be dissatisfied with your results — so
@ -400,7 +400,7 @@ Python. If it's not fast enough the first time, just switch to Cython.
- [PEP 8 Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) (python.org)
- [Official Cython documentation](http://docs.cython.org/en/latest/) (cython.org)
- [Writing C in Cython](https://explosion.ai/blog/writing-c-in-cython) (explosion.ai)
- [Multi-threading spaCys parser and named entity recogniser](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai)
- [Multi-threading spaCys parser and named entity recognizer](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai)
## Adding tests
@ -412,7 +412,7 @@ name. For example, tests for the `Tokenizer` can be found in
all test files and test functions need to be prefixed with `test_`.
When adding tests, make sure to use descriptive names, keep the code short and
concise and only test for one behaviour at a time. Try to `parametrize` test
concise and only test for one behavior at a time. Try to `parametrize` test
cases wherever possible, use our pre-defined fixtures for spaCy components and
avoid unnecessary imports.

View File

@ -49,9 +49,8 @@ It's commercial open-source software, released under the MIT license.
## 💬 Where to ask questions
The spaCy project is maintained by [@honnibal](https://github.com/honnibal) and
[@ines](https://github.com/ines), along with core contributors
[@svlandeg](https://github.com/svlandeg) and
The spaCy project is maintained by [@honnibal](https://github.com/honnibal),
[@ines](https://github.com/ines), [@svlandeg](https://github.com/svlandeg) and
[@adrianeboyd](https://github.com/adrianeboyd). Please understand that we won't
be able to provide individual support via email. We also believe that help is
much more valuable if it's shared publicly, so that more people can benefit from

View File

@ -24,7 +24,7 @@ class Optimizations(str, Enum):
@init_cli.command("config")
def init_config_cli(
# fmt: off
output_file: Path = Arg("-", help="File to save config.cfg to (or - for stdout)", allow_dash=True),
output_file: Path = Arg(..., help="File to save config.cfg to or - for stdout (will only output config and no additional logging info)", allow_dash=True),
lang: Optional[str] = Opt("en", "--lang", "-l", help="Two-letter code of the language to use"),
pipeline: Optional[str] = Opt("tagger,parser,ner", "--pipeline", "-p", help="Comma-separated names of trainable pipeline components to include in the model (without 'tok2vec' or 'transformer')"),
optimize: Optimizations = Opt(Optimizations.efficiency.value, "--optimize", "-o", help="Whether to optimize for efficiency (faster inference, smaller model, lower memory consumption) or higher accuracy (potentially larger and slower model). This will impact the choice of architecture, pretrained weights and related hyperparameters."),
@ -110,6 +110,13 @@ def init_config(
"word_vectors": reco["word_vectors"],
"has_letters": reco["has_letters"],
}
if variables["transformer_data"] and not has_spacy_transformers():
msg.warn(
"To generate a more effective transformer-based config (GPU-only), "
"install the spacy-transformers package and re-run this command. "
"The config generated now does not use transformers."
)
variables["transformer_data"] = None
base_template = template.render(variables).strip()
# Giving up on getting the newlines right in jinja for now
base_template = re.sub(r"\n\n\n+", "\n\n", base_template)
@ -126,8 +133,6 @@ def init_config(
for label, value in use_case.items():
msg.text(f"- {label}: {value}")
use_transformer = bool(template_vars.use_transformer)
if use_transformer:
require_spacy_transformers(msg)
with show_validation_error(hint_fill=False):
config = util.load_config_from_str(base_template)
nlp, _ = util.load_model_from_config(config, auto_fill=True)
@ -149,12 +154,10 @@ def save_config(config: Config, output_file: Path, is_stdout: bool = False) -> N
print(f"{COMMAND} train {output_file.parts[-1]} {' '.join(variables)}")
def require_spacy_transformers(msg: Printer) -> None:
def has_spacy_transformers() -> bool:
try:
import spacy_transformers # noqa: F401
return True
except ImportError:
msg.fail(
"Using a transformer-based pipeline requires spacy-transformers "
"to be installed.",
exits=1,
)
return False

View File

@ -107,8 +107,8 @@ factory = "tok2vec"
@architectures = "spacy.MultiHashEmbed.v1"
width = ${components.tok2vec.model.encode.width}
rows = {{ 2000 if optimize == "efficiency" else 7000 }}
also_embed_subwords = {{ true if has_letters else false }}
also_use_static_vectors = {{ true if optimize == "accuracy" else false }}
also_embed_subwords = {{ "true" if has_letters else "false" }}
also_use_static_vectors = {{ "true" if optimize == "accuracy" else "false" }}
[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
@ -195,7 +195,7 @@ initial_rate = 5e-5
[training.train_corpus]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = {{ 500 if hardware == "gpu" else 0 }}
max_length = {{ 500 if hardware == "gpu" else 2000 }}
[training.dev_corpus]
@readers = "spacy.Corpus.v1"

View File

@ -252,8 +252,10 @@ class EntityRenderer:
colors.update(user_color)
colors.update(options.get("colors", {}))
self.default_color = DEFAULT_ENTITY_COLOR
self.colors = colors
self.colors = {label.upper(): color for label, color in colors.items()}
self.ents = options.get("ents", None)
if self.ents is not None:
self.ents = [ent.upper() for ent in self.ents]
self.direction = DEFAULT_DIR
self.lang = DEFAULT_LANG
template = options.get("template")

View File

@ -51,14 +51,14 @@ TPL_ENTS = """
TPL_ENT = """
<mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
{text}
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">{label}</span>
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">{label}</span>
</mark>
"""
TPL_ENT_RTL = """
<mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em">
{text}
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-right: 0.5rem">{label}</span>
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-right: 0.5rem">{label}</span>
</mark>
"""

View File

@ -1,6 +1,6 @@
import pytest
from spacy import displacy
from spacy.displacy.render import DependencyRenderer
from spacy.displacy.render import DependencyRenderer, EntityRenderer
from spacy.tokens import Span
from spacy.lang.fa import Persian
@ -97,3 +97,17 @@ def test_displacy_render_wrapper(en_vocab):
assert html.endswith("/div>TEST")
# Restore
displacy.set_render_wrapper(lambda html: html)
def test_displacy_options_case():
ents = ["foo", "BAR"]
colors = {"FOO": "red", "bar": "green"}
renderer = EntityRenderer({"ents": ents, "colors": colors})
text = "abcd"
labels = ["foo", "bar", "FOO", "BAR"]
spans = [{"start": i, "end": i + 1, "label": labels[i]} for i in range(len(text))]
result = renderer.render_ents("abcde", spans, None).split("\n\n")
assert "red" in result[0] and "foo" in result[0]
assert "green" in result[1] and "bar" in result[1]
assert "red" in result[2] and "FOO" in result[2]
assert "green" in result[3] and "BAR" in result[3]

View File

@ -47,9 +47,9 @@ cdef class Tokenizer:
`infix_finditer` (callable): A function matching the signature of
`re.compile(string).finditer` to find infixes.
token_match (callable): A boolean function matching strings to be
recognised as tokens.
recognized as tokens.
url_match (callable): A boolean function matching strings to be
recognised as tokens after considering prefixes and suffixes.
recognized as tokens after considering prefixes and suffixes.
EXAMPLE:
>>> tokenizer = Tokenizer(nlp.vocab)

View File

@ -399,7 +399,7 @@ one component.
> subword_features = true
> ```
Build a transition-based parser model. Can apply to NER or dependency-parsing.
Build a transition-based parser model. Can apply to NER or dependency parsing.
Transition-based parsing is an approach to structured prediction where the task
of predicting the structure is mapped to a series of state transitions. You
might find [this tutorial](https://explosion.ai/blog/parsing-english-in-python)
@ -416,8 +416,6 @@ consists of either two or three subnetworks:
state representation. If not present, the output from the lower model is used
as action scores directly.
<!-- TODO: model return type -->
| Name | Description |
| ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `tok2vec` | Subnetwork to map tokens into vector representations. ~~Model[List[Doc], List[Floats2d]]~~ |
@ -426,7 +424,7 @@ consists of either two or three subnetworks:
| `maxout_pieces` | How many pieces to use in the state prediction layer. Recommended values are `1`, `2` or `3`. If `1`, the maxout non-linearity is replaced with a [`Relu`](https://thinc.ai/docs/api-layers#relu) non-linearity if `use_upper` is `True`, and no non-linearity if `False`. ~~int~~ |
| `use_upper` | Whether to use an additional hidden layer after the state vector in order to predict the action scores. It is recommended to set this to `False` for large pretrained models such as transformers, and `True` for smaller networks. The upper layer is computed on CPU, which becomes a bottleneck on larger GPU-based models, where it's also less necessary. ~~bool~~ |
| `nO` | The number of actions the model will predict between. Usually inferred from data at the beginning of training, or loaded from disk. ~~int~~ |
| **CREATES** | The model using the architecture. ~~Model~~ |
| **CREATES** | The model using the architecture. ~~Model[List[Docs], List[List[Floats2d]]]~~ |
### spacy.BILUOTagger.v1 {#BILUOTagger source="spacy/ml/models/simple_ner.py"}

View File

@ -7,7 +7,7 @@ source: spacy/morphology.pyx
Store the possible morphological analyses for a language, and index them by
hash. To save space on each token, tokens only know the hash of their
morphological analysis, so queries of morphological attributes are delegated to
this class. See [`MorphAnalysis`](/api/morphology#morphansalysis) for the
this class. See [`MorphAnalysis`](/api/morphology#morphanalysis) for the
container storing a single morphological analysis.
## Morphology.\_\_init\_\_ {#init tag="method"}

View File

@ -450,8 +450,8 @@ The L2 norm of the token's vector representation.
| `pos_` | Coarse-grained part-of-speech from the [Universal POS tag set](https://universaldependencies.org/docs/u/pos/). ~~str~~ |
| `tag` | Fine-grained part-of-speech. ~~int~~ |
| `tag_` | Fine-grained part-of-speech. ~~str~~ |
| `morph` | Morphological analysis. ~~MorphAnalysis~~ |
| `morph_` | Morphological analysis in the Universal Dependencies [FEATS]https://universaldependencies.org/format.html#morphological-annotation format. ~~str~~ |
| `morph` <Tag variant="new">3</Tag> | Morphological analysis. ~~MorphAnalysis~~ |
| `morph_` <Tag variant="new">3</Tag> | Morphological analysis in the Universal Dependencies [FEATS]https://universaldependencies.org/format.html#morphological-annotation format. ~~str~~ |
| `dep` | Syntactic dependency relation. ~~int~~ |
| `dep_` | Syntactic dependency relation. ~~str~~ |
| `lang` | Language of the parent document's vocabulary. ~~int~~ |

View File

@ -257,7 +257,7 @@ If a setting is not present in the options, the default value will be used.
| Name | Description |
| --------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `ents` | Entity types to highlight or `None` for all types (default). ~~Optional[List[str]]~~ |
| `colors` | Color overrides. Entity types in uppercase should be mapped to color names or values. ~~Dict[str, str]~~ |
| `colors` | Color overrides. Entity types should be mapped to color names or values. ~~Dict[str, str]~~ |
| `template` <Tag variant="new">2.2</Tag> | Optional template to overwrite the HTML used to render entity spans. Should be a format string and can use `{bg}`, `{text}` and `{label}`. See [`templates.py`](https://github.com/explosion/spaCy/blob/master/spacy/displacy/templates.py) for examples. ~~Optional[str]~~ |
By default, displaCy comes with colors for all entity types used by
@ -632,6 +632,23 @@ validate its contents.
| `path` | Path to the model's `meta.json`. ~~Union[str, Path]~~ |
| **RETURNS** | The model's meta data. ~~Dict[str, Any]~~ |
### util.get_installed_models {#util.get_installed_models tag="function" new="3"}
List all model packages installed in the current environment. This will include
any spaCy model that was packaged with [`spacy package`](/api/cli#package).
Under the hood, model packages expose a Python entry point that spaCy can check,
without having to load the model.
> #### Example
>
> ```python
> model_names = util.get_installed_models()
> ```
| Name | Description |
| ----------- | ---------------------------------------------------------------------------------- |
| **RETURNS** | The string names of the models installed in the current environment. ~~List[str]~~ |
### util.is_package {#util.is_package tag="function"}
Check if string maps to a package installed via pip. Mainly used to validate

Binary file not shown.

After

Width:  |  Height:  |  Size: 224 KiB

View File

@ -80,25 +80,73 @@ duplicate if it's very similar to an already existing one.
Each [`Doc`](/api/doc), [`Span`](/api/span), [`Token`](/api/token) and
[`Lexeme`](/api/lexeme) comes with a [`.similarity`](/api/token#similarity)
method that lets you compare it with another object, and determine the
similarity. Of course similarity is always subjective whether "dog" and "cat"
are similar really depends on how you're looking at it. spaCy's similarity model
usually assumes a pretty general-purpose definition of similarity.
similarity. Of course similarity is always subjective whether two words, spans
or documents are similar really depends on how you're looking at it. spaCy's
similarity model usually assumes a pretty general-purpose definition of
similarity.
<!-- TODO: use better example here -->
> #### 📝 Things to try
>
> 1. Compare two different tokens and try to find the two most _dissimilar_
> tokens in the texts with the lowest similarity score (according to the
> vectors).
> 2. Compare the similarity of two [`Lexeme`](/api/lexeme) objects, entries in
> the vocabulary. You can get a lexeme via the `.lex` attribute of a token.
> You should see that the similarity results are identical to the token
> similarity.
```python
### {executable="true"}
import spacy
nlp = spacy.load("en_core_web_md") # make sure to use larger model!
tokens = nlp("dog cat banana")
doc1 = nlp("I like salty fries and hamburgers.")
doc2 = nlp("Fast food tastes very good.")
for token1 in tokens:
for token2 in tokens:
print(token1.text, token2.text, token1.similarity(token2))
# Similarity of two documents
print(doc1, "<->", doc2, doc1.similarity(doc2))
# Similarity of tokens and spans
french_fries = doc1[2:4]
burgers = doc1[5]
print(french_fries, "<->", burgers, french_fries.similarity(burgers))
```
In this case, the model's predictions are pretty on point. A dog is very similar
to a cat, whereas a banana is not very similar to either of them. Identical
tokens are obviously 100% similar to each other (just not always exactly `1.0`,
because of vector math and floating point imprecisions).
### What to expect from similarity results {#similarity-expectations}
Computing similarity scores can be helpful in many situations, but it's also
important to maintain **realistic expectations** about what information it can
provide. Words can be related to each over in many ways, so a single
"similarity" score will always be a **mix of different signals**, and vectors
trained on different data can produce very different results that may not be
useful for your purpose. Here are some important considerations to keep in mind:
- There's no objective definition of similarity. Whether "I like burgers" and "I
like pasta" is similar **depends on your application**. Both talk about food
preferences, which makes them very similar but if you're analyzing mentions
of food, those sentences are pretty dissimilar, because they talk about very
different foods.
- The similarity of [`Doc`](/api/doc) and [`Span`](/api/span) objects defaults
to the **average** of the token vectors. This means that the vector for "fast
food" is the average of the vectors for "fast" and "food", which isn't
necessarily representative of the phrase "fast food".
- Vector averaging means that the vector of multiple tokens is **insensitive to
the order** of the words. Two documents expressing the same meaning with
dissimilar wording will return a lower similarity score than two documents
that happen to contain the same words while expressing different meanings.
<Infobox title="Tip: Check out sense2vec" emoji="💡">
[![](../../images/sense2vec.jpg)](https://github.com/explosion/sense2vec)
[`sense2vec`](https://github.com/explosion/sense2vec) is a library developed by
us that builds on top of spaCy and lets you train and query more interesting and
detailed word vectors. It combines noun phrases like "fast food" or "fair game"
and includes the part-of-speech tags and entity labels. The library also
includes annotation recipes for our annotation tool [Prodigy](https://prodi.gy)
that let you evaluate vector models and create terminology lists. For more
details, check out
[our blog post](https://explosion.ai/blog/sense2vec-reloaded). To explore the
semantic similarities across all Reddit comments of 2015 and 2019, see the
[interactive demo](https://explosion.ai/demos/sense2vec).
</Infobox>

View File

@ -11,6 +11,10 @@ next: /usage/training
<!-- TODO: intro, short explanation of embeddings/transformers, Tok2Vec and Transformer components, point user to processing pipelines docs for more general info that user should know first -->
If you're looking for details on using word vectors and semantic similarity,
check out the
[linguistic features docs](/usage/linguistic-features#vectors-similarity).
<Accordion title="Whats the difference between word vectors and language models?" id="vectors-vs-language-models">
The key difference between [word vectors](#word-vectors) and contextual language
@ -180,7 +184,7 @@ yourself. For details on how to get started with training your own model, check
out the [training quickstart](/usage/training#quickstart).
<!-- TODO:
<Project id="en_core_bert">
<Project id="en_core_trf_lg">
The easiest way to get started is to clone a transformers-based project
template. Swap in your data, edit the settings and hyperparameters and train,

View File

@ -368,7 +368,7 @@ from is called `spacy`. So, when using spaCy, never call anything else `spacy`.
</Accordion>
<Accordion title="NER model doesn't recognise other entities anymore after training" id="catastrophic-forgetting">
<Accordion title="NER model doesn't recognize other entities anymore after training" id="catastrophic-forgetting">
If your training data only contained new entities and you didn't mix in any
examples the model previously recognized, it can cause the model to "forget"

View File

@ -429,7 +429,7 @@ nlp = spacy.load("en_core_web_sm")
doc = nlp("fb is hiring a new vice president of global policy")
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print('Before', ents)
# the model didn't recognise "fb" as an entity :(
# The model didn't recognize "fb" as an entity :(
fb_ent = Span(doc, 0, 1, label="ORG") # create a Span for the new entity
doc.ents = list(doc.ents) + [fb_ent]
@ -558,11 +558,11 @@ import spacy
nlp = spacy.load("my_custom_el_model")
doc = nlp("Ada Lovelace was born in London")
# document level
# Document level
ents = [(e.text, e.label_, e.kb_id_) for e in doc.ents]
print(ents) # [('Ada Lovelace', 'PERSON', 'Q7259'), ('London', 'GPE', 'Q84')]
# token level
# Token level
ent_ada_0 = [doc[0].text, doc[0].ent_type_, doc[0].ent_kb_id_]
ent_ada_1 = [doc[1].text, doc[1].ent_type_, doc[1].ent_kb_id_]
ent_london_5 = [doc[5].text, doc[5].ent_type_, doc[5].ent_kb_id_]
@ -914,12 +914,12 @@ from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER
from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
from spacy.util import compile_infix_regex
# default tokenizer
# Default tokenizer
nlp = spacy.load("en_core_web_sm")
doc = nlp("mother-in-law")
print([t.text for t in doc]) # ['mother', '-', 'in', '-', 'law']
# modify tokenizer infix patterns
# Modify tokenizer infix patterns
infixes = (
LIST_ELLIPSES
+ LIST_ICONS
@ -929,8 +929,8 @@ infixes = (
al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
),
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
# EDIT: commented out regex that splits on hyphens between letters:
#r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
# ✅ Commented out regex that splits on hyphens between letters:
# r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
]
)
@ -1547,23 +1547,6 @@ import Vectors101 from 'usage/101/\_vectors-similarity.md'
<Vectors101 />
<Infobox title="What to expect from similarity results" variant="warning">
Computing similarity scores can be helpful in many situations, but it's also
important to maintain **realistic expectations** about what information it can
provide. Words can be related to each over in many ways, so a single
"similarity" score will always be a **mix of different signals**, and vectors
trained on different data can produce very different results that may not be
useful for your purpose.
Also note that the similarity of `Doc` or `Span` objects defaults to the
**average** of the token vectors. This means it's insensitive to the order of
the words. Two documents expressing the same meaning with dissimilar wording
will return a lower similarity score than two documents that happen to contain
the same words while expressing different meanings.
</Infobox>
### Adding word vectors {#adding-vectors}
Custom word vectors can be trained using a number of open-source libraries, such

View File

@ -108,11 +108,11 @@ class, or defined within a [model package](/usage/saving-loading#models).
>
> [components.tagger]
> factory = "tagger"
> # settings for the tagger component
> # Settings for the tagger component
>
> [components.parser]
> factory = "parser"
> # settings for the parser component
> # Settings for the parser component
> ```
When you load a model, spaCy first consults the model's
@ -171,11 +171,11 @@ lang = "en"
pipeline = ["tagger", "parser", "ner"]
data_path = "path/to/en_core_web_sm/en_core_web_sm-2.0.0"
cls = spacy.util.get_lang_class(lang) # 1. Get Language instance, e.g. English()
nlp = cls() # 2. Initialize it
cls = spacy.util.get_lang_class(lang) # 1. Get Language class, e.g. English
nlp = cls() # 2. Initialize it
for name in pipeline:
nlp.add_pipe(name) # 3. Add the component to the pipeline
nlp.from_disk(model_data_path) # 4. Load in the binary data
nlp.add_pipe(name) # 3. Add the component to the pipeline
nlp.from_disk(model_data_path) # 4. Load in the binary data
```
When you call `nlp` on a text, spaCy will **tokenize** it and then **call each
@ -187,9 +187,9 @@ which is then processed by the component next in the pipeline.
```python
### The pipeline under the hood
doc = nlp.make_doc("This is a sentence") # create a Doc from raw text
for name, proc in nlp.pipeline: # iterate over components in order
doc = proc(doc) # apply each component
doc = nlp.make_doc("This is a sentence") # Create a Doc from raw text
for name, proc in nlp.pipeline: # Iterate over components in order
doc = proc(doc) # Apply each component
```
The current processing pipeline is available as `nlp.pipeline`, which returns a
@ -473,7 +473,7 @@ only being able to modify it afterwards.
>
> @Language.component("my_component")
> def my_component(doc):
> # do something to the doc here
> # Do something to the doc here
> return doc
> ```

View File

@ -511,21 +511,21 @@ from spacy.language import Language
from spacy.matcher import Matcher
from spacy.tokens import Token
# We're using a component factory because the component needs to be initialized
# with the shared vocab via the nlp object
# We're using a component factory because the component needs to be
# initialized with the shared vocab via the nlp object
@Language.factory("html_merger")
def create_bad_html_merger(nlp, name):
return BadHTMLMerger(nlp)
return BadHTMLMerger(nlp.vocab)
class BadHTMLMerger:
def __init__(self, nlp):
def __init__(self, vocab):
patterns = [
[{"ORTH": "<"}, {"LOWER": "br"}, {"ORTH": ">"}],
[{"ORTH": "<"}, {"LOWER": "br/"}, {"ORTH": ">"}],
]
# Register a new token extension to flag bad HTML
Token.set_extension("bad_html", default=False)
self.matcher = Matcher(nlp.vocab)
self.matcher = Matcher(vocab)
self.matcher.add("BAD_HTML", patterns)
def __call__(self, doc):

View File

@ -104,10 +104,10 @@ workflows, from data preprocessing to training and packaging your model.
## Training config {#config}
> #### Migration from spaCy v2.x
<!-- > #### Migration from spaCy v2.x
>
> TODO: once we have an answer for how to update the training command
> (`spacy migrate`?), add details here
> (`spacy migrate`?), add details here -->
Training config files include all **settings and hyperparameters** for training
your model. Instead of providing lots of arguments on the command line, you only
@ -404,11 +404,15 @@ recipe once the dish has already been prepared. You have to make a new one.
spaCy includes a variety of built-in [architectures](/api/architectures) for
different tasks. For example:
<!-- TODO: select example architectures to showcase -->
<!-- TODO: model return types -->
| Architecture | Description |
| ----------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [HashEmbedCNN](/api/architectures#HashEmbedCNN) | Build spaCys “standard” embedding layer, which uses hash embedding with subword features and a CNN with layer-normalized maxout. ~~Model[List[Doc], List[Floats2d]]~~ |
| Architecture | Description |
| ----------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [HashEmbedCNN](/api/architectures#HashEmbedCNN) | Build spaCys "standard" embedding layer, which uses hash embedding with subword features and a CNN with layer-normalized maxout. ~~Model[List[Doc], List[Floats2d]]~~ |
| [TransitionBasedParser](/api/architectures#TransitionBasedParser) | Build a [transition-based parser](https://explosion.ai/blog/parsing-english-in-python) model used in the default [`EntityRecognizer`](/api/entityrecognizer) and [`DependencyParser`](/api/dependencyparser). ~~Model[List[Docs], List[List[Floats2d]]]~~ |
| [TextCatEnsemble](/api/architectures#TextCatEnsemble) | Stacked ensemble of a bag-of-words model and a neural network model with an internal CNN embedding layer. Used in the default [`TextCategorizer`](/api/textcategorizer). ~~Model~~ |
<!-- TODO: link to not yet existing usage page on custom architectures etc. -->
### Metrics, training output and weighted scores {#metrics}
@ -788,7 +792,7 @@ you save the transformer outputs for later use.
<!-- TODO:
<Project id="en_core_bert">
<Project id="en_core_trf_lg">
Try out a BERT-based model pipeline using this project template: swap in your
data, edit the settings and hyperparameters and train, evaluate, package and

View File

@ -10,6 +10,32 @@ menu:
## Summary {#summary}
<Grid cols={2}>
<div>
</div>
<Infobox title="Table of Contents" id="toc">
- [Summary](#summary)
- [New features](#features)
- [Training & config system](#features-training)
- [Transformer-based pipelines](#features-transformers)
- [Custom models](#features-custom-models)
- [End-to-end project workflows](#features-projects)
- [New built-in components](#features-pipeline-components)
- [New custom component API](#features-components)
- [Python type hints](#features-types)
- [New methods & attributes](#new-methods)
- [New & updated documentation](#new-docs)
- [Backwards incompatibilities](#incompat)
- [Migrating from spaCy v2.x](#migrating)
</Infobox>
</Grid>
## New Features {#features}
### New training workflow and config system {#features-training}
@ -28,6 +54,8 @@ menu:
### Transformer-based pipelines {#features-transformers}
![Pipeline components listening to shared embedding component](../images/tok2vec-listener.svg)
<Infobox title="Details & Documentation" emoji="📖" list>
- **Usage:** [Embeddings & Transformers](/usage/embeddings-transformers),
@ -38,7 +66,7 @@ menu:
- **Architectures: ** [TransformerModel](/api/architectures#TransformerModel),
[Tok2VecListener](/api/architectures#transformers-Tok2VecListener),
[Tok2VecTransformer](/api/architectures#Tok2VecTransformer)
- **Models:** [`en_core_bert_sm`](/models/en)
- **Models:** [`en_core_trf_lg_sm`](/models/en)
- **Implementation:**
[`spacy-transformers`](https://github.com/explosion/spacy-transformers)
@ -46,8 +74,53 @@ menu:
### Custom models using any framework {#features-custom-models}
<Infobox title="Details & Documentation" emoji="📖" list>
<!-- TODO: link to new custom models page -->
- **Thinc: **
[Wrapping PyTorch, TensorFlow & MXNet](https://thinc.ai/docs/usage-frameworks)
- **API:** [Model architectures](/api/architectures), [`Pipe`](/api/pipe)
</Infobox>
### Manage end-to-end workflows with projects {#features-projects}
<!-- TODO: update example -->
> #### Example
>
> ```cli
> # Clone a project template
> $ python -m spacy project clone example
> $ cd example
> # Download data assets
> $ python -m spacy project assets
> # Run a workflow
> $ python -m spacy project run train
> ```
spaCy projects let you manage and share **end-to-end spaCy workflows** for
different **use cases and domains**, and orchestrate training, packaging and
serving your custom models. You can start off by cloning a pre-defined project
template, adjust it to fit your needs, load in your data, train a model, export
it as a Python package and share the project templates with your team. spaCy
projects also make it easy to **integrate with other tools** in the data science
and machine learning ecosystem, including [DVC](/usage/projects#dvc) for data
version control, [Prodigy](/usage/projects#prodigy) for creating labelled data,
[Streamlit](/usage/projects#streamlit) for building interactive apps,
[FastAPI](/usage/projects#fastapi) for serving models in production,
[Ray](/usage/projects#ray) for parallel training,
[Weights & Biases](/usage/projects#wandb) for experiment tracking, and more!
<!-- <Project id="some_example_project">
The easiest way to get started with an end-to-end training process is to clone a
[project](/usage/projects) template. Projects let you manage multi-step
workflows, from data preprocessing to training and packaging your model.
</Project>-->
<Infobox title="Details & Documentation" emoji="📖" list>
- **Usage:** [spaCy projects](/usage/projects),
@ -59,6 +132,16 @@ menu:
### New built-in pipeline components {#features-pipeline-components}
spaCy v3.0 includes several new trainable and rule-based components that you can
add to your pipeline and customize for your use case:
> #### Example
>
> ```python
> nlp = spacy.blank("en")
> nlp.add_pipe("lemmatizer")
> ```
| Name | Description |
| ----------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [`SentenceRecognizer`](/api/sentencerecognizer) | Trainable component for sentence segmentation. |
@ -78,15 +161,37 @@ menu:
### New and improved pipeline component APIs {#features-components}
- `Language.factory`, `Language.component`
- `Language.analyze_pipes`
- Adding components from other models
> #### Example
>
> ```python
> @Language.component("my_component")
> def my_component(doc):
> return doc
>
> nlp.add_pipe("my_component")
> nlp.add_pipe("ner", source=other_nlp)
> nlp.analyze_pipes(pretty=True)
> ```
Defining, configuring, reusing, training and analyzing pipeline components is
now easier and more convenient. The `@Language.component` and
`@Language.factory` decorators let you register your component, define its
default configuration and meta data, like the attribute values it assigns and
requires. Any custom component can be included during training, and sourcing
components from existing pretrained models lets you **mix and match custom
pipelines**. The `nlp.analyze_pipes` method outputs structured information about
the current pipeline and its components, including the attributes they assign,
the scores they compute during training and whether any required attributes
aren't set.
<Infobox title="Details & Documentation" emoji="📖" list>
- **Usage:** [Custom components](/usage/processing-pipelines#custom_components),
[Defining components during training](/usage/training#config-components)
- **API:** [`Language`](/api/language)
[Defining components for training](/usage/training#config-components)
- **API:** [`@Language.component`](/api/language#component),
[`@Language.factory`](/api/language#factory),
[`Language.add_pipe`](/api/language#add_pipe),
[`Language.analyze_pipes`](/api/language#analyze_pipes)
- **Implementation:**
[`spacy/language.py`](https://github.com/explosion/spaCy/tree/develop/spacy/language.py)
@ -136,13 +241,14 @@ in your config and see validation errors if the argument values don't match.
</Infobox>
### New methods, attributes and commands
### New methods, attributes and commands {#new-methods}
The following methods, attributes and commands are new in spaCy v3.0.
| Name | Description |
| ----------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| [`Token.lex`](/api/token#attributes) | Access a token's [`Lexeme`](/api/lexeme). |
| [`Token.morph`](/api/token#attributes) [`Token.morph_`](/api/token#attributes) | Access a token's morphological analysis. |
| [`Language.select_pipes`](/api/language#select_pipes) | Contextmanager for enabling or disabling specific pipeline components for a block. |
| [`Language.analyze_pipes`](/api/language#analyze_pipes) | [Analyze](/usage/processing-pipelines#analysis) components and their interdependencies. |
| [`Language.resume_training`](/api/language#resume_training) | Experimental: continue training a pretrained model and initialize "rehearsal" for components that implement a `rehearse` method to prevent catastrophic forgetting. |
@ -153,9 +259,53 @@ The following methods, attributes and commands are new in spaCy v3.0.
| [`Pipe.score`](/api/pipe#score) | Method on trainable pipeline components that returns a dictionary of evaluation scores. |
| [`registry`](/api/top-level#registry) | Function registry to map functions to string names that can be referenced in [configs](/usage/training#config). |
| [`util.load_meta`](/api/top-level#util.load_meta) [`util.load_config`](/api/top-level#util.load_config) | Updated helpers for loading a model's [`meta.json`](/api/data-formats#meta) and [`config.cfg`](/api/data-formats#config). |
| [`util.get_installed_models`](/api/top-level#util.get_installed_models) | Names of all models installed in the environment. |
| [`init config`](/api/cli#init-config) [`init fill-config`](/api/cli#init-fill-config) [`debug config`](/api/cli#debug-config) | CLI commands for initializing, auto-filling and debugging [training configs](/usage/training). |
| [`project`](/api/cli#project) | Suite of CLI commands for cloning, running and managing [spaCy projects](/usage/projects). |
### New and updated documentation {#new-docs}
<Grid cols={2} gutterBottom={false}>
<div>
To help you get started with spaCy v3.0 and the new features, we've added
several new or rewritten documentation pages, including a new usage guide on
[embeddings, transformers and transfer learning](/usage/embeddings-transformers),
a guide on [training models](/usage/training) rewritten from scratch, a page
explaining the new [spaCy projects](/usage/projects) and updated usage
documentation on
[custom pipeline components](/usage/processing-pipelines#custom-components).
We've also added a bunch of new illustrations and new API reference pages
documenting spaCy's machine learning [model architectures](/api/architectures)
and the expected [data formats](/api/data-formats). API pages about
[pipeline components](/api/#architecture-pipeline) now include more information,
like the default config and implementation, and we've adopted a more detailed
format for documenting argument and return types.
</div>
[![Library architecture](../images/architecture.svg)](/api)
</Grid>
<Infobox title="New or reworked documentation" emoji="📖" list>
- **Usage: ** [Embeddings & Transformers](/usage/embeddings-transformers),
[Training models](/usage/training), [Projects](/usage/projects),
[Custom pipeline components](/usage/processing-pipelines#custom-components),
[Custom tokenizers](/usage/linguistic-features#custom-tokenizer)
- **API Reference: ** [Library architecture](/api),
[Model architectures](/api/architectures), [Data formats](/api/data-formats)
- **New Classes: ** [`Example`](/api/example), [`Tok2Vec`](/api/tok2vec),
[`Transformer`](/api/transformer), [`Lemmatizer`](/api/lemmatizer),
[`Morphologizer`](/api/morphologizer),
[`AttributeRuler`](/api/attributeruler),
[`SentenceRecognizer`](/api/sentencerecognizer), [`Pipe`](/api/pipe),
[`Corpus`](/api/corpus)
</Infobox>
## Backwards Incompatibilities {#incompat}
As always, we've tried to keep the breaking changes to a minimum and focus on
@ -212,15 +362,16 @@ Note that spaCy v3.0 now requires **Python 3.6+**.
### Removed or renamed API {#incompat-removed}
| Removed | Replacement |
| ------------------------------------------------------ | ----------------------------------------------------------------------------------------- |
| `Language.disable_pipes` | [`Language.select_pipes`](/api/language#select_pipes) |
| `GoldParse` | [`Example`](/api/example) |
| `GoldCorpus` | [`Corpus`](/api/corpus) |
| `KnowledgeBase.load_bulk` `KnowledgeBase.dump` | [`KnowledgeBase.from_disk`](/api/kb#from_disk) [`KnowledgeBase.to_disk`](/api/kb#to_disk) |
| `spacy debug-data` | [`spacy debug data`](/api/cli#debug-data) |
| `spacy profile` | [`spacy debug profile`](/api/cli#debug-profile) |
| `spacy link` `util.set_data_path` `util.get_data_path` | not needed, model symlinks are deprecated |
| Removed | Replacement |
| -------------------------------------------------------- | ----------------------------------------------------------------------------------------- |
| `Language.disable_pipes` | [`Language.select_pipes`](/api/language#select_pipes) |
| `GoldParse` | [`Example`](/api/example) |
| `GoldCorpus` | [`Corpus`](/api/corpus) |
| `KnowledgeBase.load_bulk` `KnowledgeBase.dump` | [`KnowledgeBase.from_disk`](/api/kb#from_disk) [`KnowledgeBase.to_disk`](/api/kb#to_disk) |
| `spacy init-model` | [`spacy init model`](/api/cli#init-model) |
| `spacy debug-data` | [`spacy debug data`](/api/cli#debug-data) |
| `spacy profile` | [`spacy debug profile`](/api/cli#debug-profile) |
| `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, model symlinks are deprecated |
The following deprecated methods, attributes and arguments were removed in v3.0.
Most of them have been **deprecated for a while** and many would previously
@ -236,7 +387,7 @@ on them.
| `Language.tagger`, `Language.parser`, `Language.entity` | [`Language.get_pipe`](/api/language#get_pipe) |
| keyword-arguments like `vocab=False` on `to_disk`, `from_disk`, `to_bytes`, `from_bytes` | `exclude=["vocab"]` |
| `n_threads` argument on [`Tokenizer`](/api/tokenizer), [`Matcher`](/api/matcher), [`PhraseMatcher`](/api/phrasematcher) | `n_process` |
| `verbose` argument on [`Language.evaluate`] | logging |
| `verbose` argument on [`Language.evaluate`](/api/language#evaluate) | logging (`DEBUG`) |
| `SentenceSegmenter` hook, `SimilarityHook` | [user hooks](/usage/processing-pipelines#custom-components-user-hooks), [`Sentencizer`](/api/sentencizer), [`SentenceRecognizer`](/api/sentenceregognizer) |
## Migrating from v2.x {#migrating}

View File

@ -121,10 +121,10 @@ import DisplacyEntHtml from 'images/displacy-ent2.html'
The entity visualizer lets you customize the following `options`:
| Argument | Description |
| -------- | -------------------------------------------------------------------------------------------------------------------------- |
| `ents` | Entity types to highlight (`None` for all types). Defaults to `None`. ~~Optional[List[str]]~~ | `None` |
| `colors` | Color overrides. Entity types in uppercase should be mapped to color names or values. Defaults to `{}`. ~~Dict[str, str]~~ |
| Argument | Description |
| -------- | ------------------------------------------------------------------------------------------------------------- |
| `ents` | Entity types to highlight (`None` for all types). Defaults to `None`. ~~Optional[List[str]]~~ | `None` |
| `colors` | Color overrides. Entity types should be mapped to color names or values. Defaults to `{}`. ~~Dict[str, str]~~ |
If you specify a list of `ents`, only those entity types will be rendered for
example, you can choose to display `PERSON` entities. Internally, the visualizer

View File

@ -6,7 +6,7 @@ import classNames from 'classnames'
import Icon from './icon'
import classes from '../styles/link.module.sass'
import { isString } from './util'
import { isString, isImage } from './util'
const internalRegex = /(http(s?)):\/\/(prodi.gy|spacy.io|irl.spacy.io|explosion.ai|course.spacy.io)/gi
@ -39,7 +39,7 @@ export default function Link({
const dest = to || href
const external = forceExternal || /(http(s?)):\/\//gi.test(dest)
const icon = getIcon(dest)
const withIcon = !hidden && !hideIcon && !!icon
const withIcon = !hidden && !hideIcon && !!icon && !isImage(children)
const sourceWithText = withIcon && isString(children)
const linkClassNames = classNames(classes.root, className, {
[classes.hidden]: hidden,

View File

@ -46,6 +46,17 @@ export function isString(obj) {
return typeof obj === 'string' || obj instanceof String
}
/**
* @param obj - The object to check.
* @returns {boolean} Whether the object is an image
*/
export function isImage(obj) {
if (!obj || !React.isValidElement(obj)) {
return false
}
return obj.props.name == 'img' || obj.props.className == 'gatsby-resp-image-wrapper'
}
/**
* @param obj - The object to check.
* @returns {boolean} - Whether the object is empty.

View File

@ -363,7 +363,7 @@ body [id]:target
color: var(--color-red-medium)
background: var(--color-red-transparent)
&.italic
&.italic, &.comment
font-style: italic
@ -384,9 +384,11 @@ body [id]:target
// Settings for ini syntax (config files)
[class*="language-ini"]
color: var(--syntax-comment)
font-style: italic !important
.token
color: var(--color-subtle)
font-style: normal !important
.gatsby-highlight-code-line
@ -424,6 +426,7 @@ body [id]:target
.cm-comment
color: var(--syntax-comment)
font-style: italic
.cm-keyword
color: var(--syntax-keyword)

File diff suppressed because one or more lines are too long