mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-26 09:14:32 +03:00
Update docs and add keyword-only tag
This commit is contained in:
parent
a35236e5f0
commit
44790c1c32
|
@ -2,7 +2,8 @@
|
|||
title: Data formats
|
||||
teaser: Details on spaCy's input and output data formats
|
||||
menu:
|
||||
- ['Training data', 'training']
|
||||
- ['Training Data', 'training']
|
||||
- ['Training Config', 'config']
|
||||
- ['Vocabulary', 'vocab']
|
||||
---
|
||||
|
||||
|
@ -74,6 +75,28 @@ from the English Wall Street Journal portion of the Penn Treebank:
|
|||
https://github.com/explosion/spaCy/tree/master/examples/training/training-data.json
|
||||
```
|
||||
|
||||
## Training config {#config new="3"}
|
||||
|
||||
Config files define the training process and model pipeline and can be passed to
|
||||
[`spacy train`](/api/cli#train). They use
|
||||
[Thinc's configuration system](https://thinc.ai/docs/usage-config) under the
|
||||
hood. For details on how to use training configs, see the
|
||||
[usage documentation](/usage/training#config).
|
||||
|
||||
<Infobox variant="warning">
|
||||
|
||||
The `@` notation lets you refer to function names registered in the
|
||||
[function registry](/api/top-level#registry). For example,
|
||||
`@architectures = "spacy.HashEmbedCNN.v1"` refers to a registered function of
|
||||
the name `"spacy.HashEmbedCNN.v1"` and all other values defined in its block
|
||||
will be passed into that function as arguments. Those arguments depend on the
|
||||
registered function. See the [model architectures](/api/architectures) docs for
|
||||
API details.
|
||||
|
||||
</Infobox>
|
||||
|
||||
<!-- TODO: we need to come up with a good way to present the sections and their expected values visually? -->
|
||||
|
||||
## Lexical data for vocabulary {#vocab-jsonl new="2"}
|
||||
|
||||
To populate a model's vocabulary, you can use the
|
||||
|
|
|
@ -30,12 +30,13 @@ Construct a `Doc` object. The most common way to get a `Doc` object is via the
|
|||
> doc = Doc(nlp.vocab, words=words, spaces=spaces)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `vocab` | `Vocab` | A storage container for lexical types. |
|
||||
| `words` | iterable | A list of strings to add to the container. |
|
||||
| `spaces` | iterable | A list of boolean values indicating whether each word has a subsequent space. Must have the same length as `words`, if specified. Defaults to a sequence of `True`. |
|
||||
| **RETURNS** | `Doc` | The newly constructed object. |
|
||||
| Name | Type | Description |
|
||||
| -------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `vocab` | `Vocab` | A storage container for lexical types. |
|
||||
| _keyword-only_ | | |
|
||||
| `words` | iterable | A list of strings to add to the container. |
|
||||
| `spaces` | iterable | A list of boolean values indicating whether each word has a subsequent space. Must have the same length as `words`, if specified. Defaults to a sequence of `True`. |
|
||||
| **RETURNS** | `Doc` | The newly constructed object. |
|
||||
|
||||
## Doc.\_\_getitem\_\_ {#getitem tag="method"}
|
||||
|
||||
|
|
|
@ -3,6 +3,7 @@ title: Top-level Functions
|
|||
menu:
|
||||
- ['spacy', 'spacy']
|
||||
- ['displacy', 'displacy']
|
||||
- ['registry', 'registry']
|
||||
- ['Data & Alignment', 'gold']
|
||||
- ['Utility Functions', 'util']
|
||||
---
|
||||
|
@ -259,6 +260,48 @@ package can also expose a
|
|||
[`spacy_displacy_colors` entry point](/usage/saving-loading#entry-points-displacy)
|
||||
to add custom labels and their colors automatically.
|
||||
|
||||
## registry {#registry source="spacy/util.py" new="3"}
|
||||
|
||||
spaCy's function registry extends
|
||||
[Thinc's `registry`](https://thinc.ai/docs/api-config#registry) and allows you
|
||||
to map strings to functions. You can register functions to create architectures,
|
||||
optimizers, schedules and more, and then refer to them and set their arguments
|
||||
in your [config file](/usage/training#config). Python type hints are used to
|
||||
validate the inputs. See the
|
||||
[Thinc docs](https://thinc.ai/docs/api-config#registry) for details on the
|
||||
`registry` methods and our helper library
|
||||
[`catalogue`](https://github.com/explosion/catalogue) for some background on the
|
||||
concept of function registries. spaCy also uses the function registry for
|
||||
language subclasses, model architecture, lookups and pipeline component
|
||||
factories.
|
||||
|
||||
<!-- TODO: improve example? -->
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> import spacy
|
||||
> from thinc.api import Model
|
||||
>
|
||||
> @spacy.registry.architectures("CustomNER.v1")
|
||||
> def custom_ner(n0: int) -> Model:
|
||||
> return Model("custom", forward, dims={"nO": nO})
|
||||
> ```
|
||||
|
||||
| Registry name | Description |
|
||||
| ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `architectures` | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`. |
|
||||
| `factories` | Registry for functions that create [pipeline components](/usage/processing-pipelines#custom-components). Added automatically when you use the `@spacy.component` decorator and also reads from [entry points](/usage/saving-loading#entry-points) |
|
||||
| `languages` | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points). |
|
||||
| `lookups` | Registry for large lookup tables available via `vocab.lookups`. |
|
||||
| `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points). |
|
||||
| `assets` | <!-- TODO: what is this used for again?--> |
|
||||
| `optimizers` | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers). |
|
||||
| `schedules` | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules). |
|
||||
| `layers` | Registry for functions that create [layers](https://thinc.ai/docs/api-layers). |
|
||||
| `losses` | Registry for functions that create [losses](https://thinc.ai/docs/api-loss). |
|
||||
| `initializers` | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers). |
|
||||
|
||||
## Training data and alignment {#gold source="spacy/gold"}
|
||||
|
||||
### gold.docs_to_json {#docs_to_json tag="function"}
|
||||
|
@ -421,6 +464,8 @@ page should be safe to use and we'll try to ensure backwards compatibility.
|
|||
However, we recommend having additional tests in place if your application
|
||||
depends on any of spaCy's utilities.
|
||||
|
||||
<!-- TODO: document new config-related util functions? -->
|
||||
|
||||
### util.get_lang_class {#util.get_lang_class tag="function"}
|
||||
|
||||
Import and load a `Language` class. Allows lazy-loading
|
||||
|
@ -705,7 +750,7 @@ of one entity) or when merging spans with
|
|||
| `spans` | iterable | The spans to filter. |
|
||||
| **RETURNS** | list | The filtered spans. |
|
||||
|
||||
## util.get_words_and_spaces {#get_words_and_spaces tag="function" new="3"}
|
||||
### util.get_words_and_spaces {#get_words_and_spaces tag="function" new="3"}
|
||||
|
||||
<!-- TODO: document -->
|
||||
|
||||
|
|
|
@ -103,26 +103,38 @@ still look good.
|
|||
|
||||
> #### Migration from spaCy v2.x
|
||||
>
|
||||
> TODO: ...
|
||||
> TODO: once we have an answer for how to update the training command
|
||||
> (`spacy migrate`?), add details here
|
||||
|
||||
Training config files include all **settings and hyperparameters** for training
|
||||
your model. Instead of providing lots of arguments on the command line, you only
|
||||
need to pass your `config.cfg` file to [`spacy train`](/api/cli#train).
|
||||
need to pass your `config.cfg` file to [`spacy train`](/api/cli#train). Under
|
||||
the hood, the training config uses the
|
||||
[configuration system](https://thinc.ai/docs/usage-config) provided by our
|
||||
machine learning library [Thinc](https://thinc.ai). This also makes it easy to
|
||||
integrate custom models and architectures, written in your framework of choice.
|
||||
Some of the main advantages and features of spaCy's training config are:
|
||||
|
||||
To read more about how the config system works under the hood, check out the
|
||||
[Thinc documentation](https://thinc.ai/docs/usage-config).
|
||||
|
||||
- **Structured sections.**
|
||||
- **Structured sections.** The config is grouped into sections, and nested
|
||||
sections are defined using the `.` notation. For example, `[nlp.pipeline.ner]`
|
||||
defines the settings for the pipeline's named entity recognizer. The config
|
||||
can be loaded as a Python dict.
|
||||
- **References to registered functions.** Sections can refer to registered
|
||||
functions like [model architectures](/api/architectures),
|
||||
[optimizers](https://thinc.ai/docs/api-optimizers) or
|
||||
[schedules](https://thinc.ai/docs/api-schedules) and define arguments that are
|
||||
passed into them. You can also register your own functions to define
|
||||
[custom architectures](#custom-models), reference them in your config,
|
||||
[custom architectures](#custom-models), reference them in your config and
|
||||
tweak their parameters.
|
||||
- **Interpolation.** If you have hyperparameters used by multiple components,
|
||||
define them once and reference them as variables.
|
||||
|
||||
<!-- TODO: we need to come up with a good way to present the sections and their expected values visually? -->
|
||||
- **Reproducibility with no hidden defaults.** The config file is the "single
|
||||
source of truth" and includes all settings. <!-- TODO: explain this better -->
|
||||
- **Automated checks and validation.** When you load a config, spaCy checks if
|
||||
the settings are complete and if all values have the correct types. This lets
|
||||
you catch potential mistakes early. In your custom architectures, you can use
|
||||
Python [type hints](https://docs.python.org/3/library/typing.html) to tell the
|
||||
config which types of data to expect.
|
||||
|
||||
<!-- TODO: instead of hard-coding a full config here, we probably want to embed it from GitHub, e.g. from one of the project templates. This also makes it easier to keep it up to date, and the embed widgets take up less space-->
|
||||
|
||||
|
@ -181,6 +193,19 @@ pretrained_vectors = null
|
|||
dropout = null
|
||||
```
|
||||
|
||||
<!-- TODO: explain settings and @ notation, refer to function registry docs -->
|
||||
|
||||
<Infobox title="📖 Config format and settings">
|
||||
|
||||
For a full overview of spaCy's config format and settings, see the
|
||||
[training format documentation](/api/data-formats#config). The settings
|
||||
available for the different architectures are documented with the
|
||||
[model architectures API](/api/architectures). See the Thinc documentation for
|
||||
[optimizers](https://thinc.ai/docs/api-optimizers) and
|
||||
[schedules](https://thinc.ai/docs/api-schedules).
|
||||
|
||||
</Infobox>
|
||||
|
||||
### Model architectures {#model-architectures}
|
||||
|
||||
<!-- TODO: refer to architectures API: /api/architectures. This should document the architectures in spacy/ml/models -->
|
||||
|
|
|
@ -26,6 +26,16 @@ function getCellContent(children) {
|
|||
return children
|
||||
}
|
||||
|
||||
function isDividerRow(children) {
|
||||
if (children.length && children[0].props.name == 'td') {
|
||||
const tdChildren = children[0].props.children
|
||||
if (!Array.isArray(tdChildren)) {
|
||||
return tdChildren.props.name === 'em'
|
||||
}
|
||||
}
|
||||
return false
|
||||
}
|
||||
|
||||
function isFootRow(children) {
|
||||
const rowRegex = /^(RETURNS|YIELDS|CREATES|PRINTS)/
|
||||
if (children.length && children[0].props.name === 'td') {
|
||||
|
@ -53,9 +63,11 @@ export const Th = props => <th className={classes.th} {...props} />
|
|||
|
||||
export const Tr = ({ evenodd = true, children, ...props }) => {
|
||||
const foot = isFootRow(children)
|
||||
const isDivider = isDividerRow(children)
|
||||
const trClasssNames = classNames({
|
||||
[classes.tr]: evenodd,
|
||||
[classes.footer]: foot,
|
||||
[classes.divider]: isDivider,
|
||||
'table-footer': foot,
|
||||
})
|
||||
|
||||
|
|
|
@ -49,6 +49,36 @@
|
|||
border-bottom: 2px solid var(--color-theme)
|
||||
vertical-align: bottom
|
||||
|
||||
.divider
|
||||
height: 0
|
||||
border-bottom: 1px solid var(--color-subtle)
|
||||
|
||||
td
|
||||
top: -1px
|
||||
height: 0
|
||||
position: relative
|
||||
padding: 0 !important
|
||||
|
||||
& + tr td
|
||||
padding-top: 12px
|
||||
|
||||
td em
|
||||
position: absolute
|
||||
top: -5px
|
||||
left: 10px
|
||||
display: inline-block
|
||||
background: var(--color-theme)
|
||||
color: var(--color-back)
|
||||
padding: 0 5px 1px
|
||||
font-size: 0.85rem
|
||||
text-transform: uppercase
|
||||
font-weight: bold
|
||||
border: 0
|
||||
border-radius: 1em
|
||||
font-style: normal
|
||||
white-space: nowrap
|
||||
z-index: 5
|
||||
|
||||
// Responsive table
|
||||
// Shadows adapted from "CSS only Responsive Tables" by David Bushell
|
||||
// http://codepen.io/dbushell/pen/wGaamR
|
||||
|
|
Loading…
Reference in New Issue
Block a user