mirror of
https://github.com/explosion/spaCy.git
synced 2025-07-04 20:03:13 +03:00
Merge pull request #6253 from walterhenry/develop [ci skip]
This commit is contained in:
commit
82be1478cb
|
@ -1,7 +1,7 @@
|
||||||
A named entity is a "real-world object" that's assigned a name – for example, a
|
A named entity is a "real-world object" that's assigned a name – for example, a
|
||||||
person, a country, a product or a book title. spaCy can **recognize various
|
person, a country, a product or a book title. spaCy can **recognize various
|
||||||
types of named entities in a document, by asking the model for a
|
types of named entities in a document, by asking the model for a
|
||||||
**prediction\*\*. Because models are statistical and strongly depend on the
|
prediction**. Because models are statistical and strongly depend on the
|
||||||
examples they were trained on, this doesn't always work _perfectly_ and might
|
examples they were trained on, this doesn't always work _perfectly_ and might
|
||||||
need some tuning later, depending on your use case.
|
need some tuning later, depending on your use case.
|
||||||
|
|
||||||
|
|
|
@ -45,6 +45,6 @@ marks.
|
||||||
|
|
||||||
While punctuation rules are usually pretty general, tokenizer exceptions
|
While punctuation rules are usually pretty general, tokenizer exceptions
|
||||||
strongly depend on the specifics of the individual language. This is why each
|
strongly depend on the specifics of the individual language. This is why each
|
||||||
[available language](/usage/models#languages) has its own subclass like
|
[available language](/usage/models#languages) has its own subclass, like
|
||||||
`English` or `German`, that loads in lists of hard-coded data and exception
|
`English` or `German`, that loads in lists of hard-coded data and exception
|
||||||
rules.
|
rules.
|
||||||
|
|
|
@ -14,7 +14,7 @@ menu:
|
||||||
>
|
>
|
||||||
> To help you make the transition from v2.x to v3.0, we've uploaded the old
|
> To help you make the transition from v2.x to v3.0, we've uploaded the old
|
||||||
> website to [**v2.spacy.io**](https://v2.spacy.io/docs). To see what's changed
|
> website to [**v2.spacy.io**](https://v2.spacy.io/docs). To see what's changed
|
||||||
> and how to migrate, see the guide on [v3.0 guide](/usage/v3).
|
> and how to migrate, see the [v3.0 guide](/usage/v3).
|
||||||
|
|
||||||
import QuickstartInstall from 'widgets/quickstart-install.js'
|
import QuickstartInstall from 'widgets/quickstart-install.js'
|
||||||
|
|
||||||
|
@ -187,7 +187,7 @@ to get the right commands for your platform and Python version.
|
||||||
`sudo apt-get install build-essential python-dev git`
|
`sudo apt-get install build-essential python-dev git`
|
||||||
- **macOS / OS X:** Install a recent version of
|
- **macOS / OS X:** Install a recent version of
|
||||||
[XCode](https://developer.apple.com/xcode/), including the so-called "Command
|
[XCode](https://developer.apple.com/xcode/), including the so-called "Command
|
||||||
Line Tools". macOS and OS X ship with Python and git preinstalled.
|
Line Tools". macOS and OS X ship with Python and Git preinstalled.
|
||||||
- **Windows:** Install a version of the
|
- **Windows:** Install a version of the
|
||||||
[Visual C++ Build Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/)
|
[Visual C++ Build Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/)
|
||||||
or
|
or
|
||||||
|
@ -380,7 +380,7 @@ This error may occur when running the `spacy` command from the command line.
|
||||||
spaCy does not currently add an entry to your `PATH` environment variable, as
|
spaCy does not currently add an entry to your `PATH` environment variable, as
|
||||||
this can lead to unexpected results, especially when using a virtual
|
this can lead to unexpected results, especially when using a virtual
|
||||||
environment. Instead, spaCy adds an auto-alias that maps `spacy` to
|
environment. Instead, spaCy adds an auto-alias that maps `spacy` to
|
||||||
`python -m spacy]`. If this is not working as expected, run the command with
|
`python -m spacy`. If this is not working as expected, run the command with
|
||||||
`python -m`, yourself – for example `python -m spacy download en_core_web_sm`.
|
`python -m`, yourself – for example `python -m spacy download en_core_web_sm`.
|
||||||
For more info on this, see the [`download`](/api/cli#download) command.
|
For more info on this, see the [`download`](/api/cli#download) command.
|
||||||
|
|
||||||
|
@ -427,8 +427,8 @@ disk has some binary files that should not go through this conversion. When they
|
||||||
do, you get the error above. You can fix it by either changing your
|
do, you get the error above. You can fix it by either changing your
|
||||||
[`core.autocrlf`](https://git-scm.com/book/en/v2/Customizing-Git-Git-Configuration)
|
[`core.autocrlf`](https://git-scm.com/book/en/v2/Customizing-Git-Git-Configuration)
|
||||||
setting to `"false"`, or by committing a
|
setting to `"false"`, or by committing a
|
||||||
[`.gitattributes`](https://git-scm.com/docs/gitattributes) file] to your
|
[`.gitattributes`](https://git-scm.com/docs/gitattributes) file to your
|
||||||
repository to tell git on which files or folders it shouldn't do LF-to-CRLF
|
repository to tell Git on which files or folders it shouldn't do LF-to-CRLF
|
||||||
conversion, with an entry like `path/to/spacy/model/** -text`. After you've done
|
conversion, with an entry like `path/to/spacy/model/** -text`. After you've done
|
||||||
either of these, clone your repository again.
|
either of these, clone your repository again.
|
||||||
|
|
||||||
|
|
|
@ -352,7 +352,7 @@ dropout = 0.2
|
||||||
|
|
||||||
<Infobox variant="warning">
|
<Infobox variant="warning">
|
||||||
|
|
||||||
Remember that it is best not to rely on any (hidden) default values, to ensure
|
Remember that it is best not to rely on any (hidden) default values to ensure
|
||||||
that training configs are complete and experiments fully reproducible.
|
that training configs are complete and experiments fully reproducible.
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
|
@ -44,7 +44,7 @@ in the [models directory](/models).
|
||||||
|
|
||||||
Inflectional morphology is the process by which a root form of a word is
|
Inflectional morphology is the process by which a root form of a word is
|
||||||
modified by adding prefixes or suffixes that specify its grammatical function
|
modified by adding prefixes or suffixes that specify its grammatical function
|
||||||
but do not changes its part-of-speech. We say that a **lemma** (root form) is
|
but do not change its part-of-speech. We say that a **lemma** (root form) is
|
||||||
**inflected** (modified/combined) with one or more **morphological features** to
|
**inflected** (modified/combined) with one or more **morphological features** to
|
||||||
create a surface form. Here are some examples:
|
create a surface form. Here are some examples:
|
||||||
|
|
||||||
|
@ -288,7 +288,7 @@ import DisplaCyLong2Html from 'images/displacy-long2.html'
|
||||||
Because the syntactic relations form a tree, every word has **exactly one
|
Because the syntactic relations form a tree, every word has **exactly one
|
||||||
head**. You can therefore iterate over the arcs in the tree by iterating over
|
head**. You can therefore iterate over the arcs in the tree by iterating over
|
||||||
the words in the sentence. This is usually the best way to match an arc of
|
the words in the sentence. This is usually the best way to match an arc of
|
||||||
interest — from below:
|
interest – from below:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
### {executable="true"}
|
### {executable="true"}
|
||||||
|
@ -397,7 +397,7 @@ for descendant in subject.subtree:
|
||||||
Finally, the `.left_edge` and `.right_edge` attributes can be especially useful,
|
Finally, the `.left_edge` and `.right_edge` attributes can be especially useful,
|
||||||
because they give you the first and last token of the subtree. This is the
|
because they give you the first and last token of the subtree. This is the
|
||||||
easiest way to create a `Span` object for a syntactic phrase. Note that
|
easiest way to create a `Span` object for a syntactic phrase. Note that
|
||||||
`.right_edge` gives a token **within** the subtree — so if you use it as the
|
`.right_edge` gives a token **within** the subtree – so if you use it as the
|
||||||
end-point of a range, don't forget to `+1`!
|
end-point of a range, don't forget to `+1`!
|
||||||
|
|
||||||
```python
|
```python
|
||||||
|
@ -639,7 +639,7 @@ print("After", doc.ents) # [London]
|
||||||
|
|
||||||
#### Setting entity annotations in Cython {#setting-cython}
|
#### Setting entity annotations in Cython {#setting-cython}
|
||||||
|
|
||||||
Finally, you can always write to the underlying struct, if you compile a
|
Finally, you can always write to the underlying struct if you compile a
|
||||||
[Cython](http://cython.org/) function. This is easy to do, and allows you to
|
[Cython](http://cython.org/) function. This is easy to do, and allows you to
|
||||||
write efficient native code.
|
write efficient native code.
|
||||||
|
|
||||||
|
@ -763,15 +763,15 @@ import Tokenization101 from 'usage/101/\_tokenization.md'
|
||||||
|
|
||||||
<Accordion title="Algorithm details: How spaCy's tokenizer works" id="how-tokenizer-works" spaced>
|
<Accordion title="Algorithm details: How spaCy's tokenizer works" id="how-tokenizer-works" spaced>
|
||||||
|
|
||||||
spaCy introduces a novel tokenization algorithm, that gives a better balance
|
spaCy introduces a novel tokenization algorithm that gives a better balance
|
||||||
between performance, ease of definition, and ease of alignment into the original
|
between performance, ease of definition and ease of alignment into the original
|
||||||
string.
|
string.
|
||||||
|
|
||||||
After consuming a prefix or suffix, we consult the special cases again. We want
|
After consuming a prefix or suffix, we consult the special cases again. We want
|
||||||
the special cases to handle things like "don't" in English, and we want the same
|
the special cases to handle things like "don't" in English, and we want the same
|
||||||
rule to work for "(don't)!". We do this by splitting off the open bracket, then
|
rule to work for "(don't)!". We do this by splitting off the open bracket, then
|
||||||
the exclamation, then the close bracket, and finally matching the special case.
|
the exclamation, then the closed bracket, and finally matching the special case.
|
||||||
Here's an implementation of the algorithm in Python, optimized for readability
|
Here's an implementation of the algorithm in Python optimized for readability
|
||||||
rather than performance:
|
rather than performance:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
|
@ -845,7 +845,7 @@ The algorithm can be summarized as follows:
|
||||||
#2.
|
#2.
|
||||||
6. If we can't consume a prefix or a suffix, look for a URL match.
|
6. If we can't consume a prefix or a suffix, look for a URL match.
|
||||||
7. If there's no URL match, then look for a special case.
|
7. If there's no URL match, then look for a special case.
|
||||||
8. Look for "infixes" — stuff like hyphens etc. and split the substring into
|
8. Look for "infixes" – stuff like hyphens etc. and split the substring into
|
||||||
tokens on all infixes.
|
tokens on all infixes.
|
||||||
9. Once we can't consume any more of the string, handle it as a single token.
|
9. Once we can't consume any more of the string, handle it as a single token.
|
||||||
|
|
||||||
|
@ -862,10 +862,10 @@ intact (abbreviations like "U.S.").
|
||||||
<Accordion title="Should I change the language data or add custom tokenizer rules?" id="lang-data-vs-tokenizer">
|
<Accordion title="Should I change the language data or add custom tokenizer rules?" id="lang-data-vs-tokenizer">
|
||||||
|
|
||||||
Tokenization rules that are specific to one language, but can be **generalized
|
Tokenization rules that are specific to one language, but can be **generalized
|
||||||
across that language** should ideally live in the language data in
|
across that language**, should ideally live in the language data in
|
||||||
[`spacy/lang`](%%GITHUB_SPACY/spacy/lang) – we always appreciate pull requests!
|
[`spacy/lang`](%%GITHUB_SPACY/spacy/lang) – we always appreciate pull requests!
|
||||||
Anything that's specific to a domain or text type – like financial trading
|
Anything that's specific to a domain or text type – like financial trading
|
||||||
abbreviations, or Bavarian youth slang – should be added as a special case rule
|
abbreviations or Bavarian youth slang – should be added as a special case rule
|
||||||
to your tokenizer instance. If you're dealing with a lot of customizations, it
|
to your tokenizer instance. If you're dealing with a lot of customizations, it
|
||||||
might make sense to create an entirely custom subclass.
|
might make sense to create an entirely custom subclass.
|
||||||
|
|
||||||
|
@ -1108,7 +1108,7 @@ tokenized `Doc`.
|
||||||

|

|
||||||
|
|
||||||
To overwrite the existing tokenizer, you need to replace `nlp.tokenizer` with a
|
To overwrite the existing tokenizer, you need to replace `nlp.tokenizer` with a
|
||||||
custom function that takes a text, and returns a [`Doc`](/api/doc).
|
custom function that takes a text and returns a [`Doc`](/api/doc).
|
||||||
|
|
||||||
> #### Creating a Doc
|
> #### Creating a Doc
|
||||||
>
|
>
|
||||||
|
@ -1227,7 +1227,7 @@ tokenizer** it will be using at runtime. See the docs on
|
||||||
|
|
||||||
#### Training with custom tokenization {#custom-tokenizer-training new="3"}
|
#### Training with custom tokenization {#custom-tokenizer-training new="3"}
|
||||||
|
|
||||||
spaCy's [training config](/usage/training#config) describe the settings,
|
spaCy's [training config](/usage/training#config) describes the settings,
|
||||||
hyperparameters, pipeline and tokenizer used for constructing and training the
|
hyperparameters, pipeline and tokenizer used for constructing and training the
|
||||||
pipeline. The `[nlp.tokenizer]` block refers to a **registered function** that
|
pipeline. The `[nlp.tokenizer]` block refers to a **registered function** that
|
||||||
takes the `nlp` object and returns a tokenizer. Here, we're registering a
|
takes the `nlp` object and returns a tokenizer. Here, we're registering a
|
||||||
|
@ -1463,7 +1463,7 @@ filtered_spans = filter_spans(spans)
|
||||||
The [`retokenizer.split`](/api/doc#retokenizer.split) method allows splitting
|
The [`retokenizer.split`](/api/doc#retokenizer.split) method allows splitting
|
||||||
one token into two or more tokens. This can be useful for cases where
|
one token into two or more tokens. This can be useful for cases where
|
||||||
tokenization rules alone aren't sufficient. For example, you might want to split
|
tokenization rules alone aren't sufficient. For example, you might want to split
|
||||||
"its" into the tokens "it" and "is" — but not the possessive pronoun "its". You
|
"its" into the tokens "it" and "is" – but not the possessive pronoun "its". You
|
||||||
can write rule-based logic that can find only the correct "its" to split, but by
|
can write rule-based logic that can find only the correct "its" to split, but by
|
||||||
that time, the `Doc` will already be tokenized.
|
that time, the `Doc` will already be tokenized.
|
||||||
|
|
||||||
|
@ -1511,7 +1511,7 @@ the token indices after splitting.
|
||||||
| `"York"` | `doc[2]` | Attach this token to `doc[1]` in the original `Doc`, i.e. "in". |
|
| `"York"` | `doc[2]` | Attach this token to `doc[1]` in the original `Doc`, i.e. "in". |
|
||||||
|
|
||||||
If you don't care about the heads (for example, if you're only running the
|
If you don't care about the heads (for example, if you're only running the
|
||||||
tokenizer and not the parser), you can each subtoken to itself:
|
tokenizer and not the parser), you can attach each subtoken to itself:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
### {highlight="3"}
|
### {highlight="3"}
|
||||||
|
@ -1880,7 +1880,7 @@ assert nlp.vocab.vectors.n_keys > n_vectors # but not the total entries
|
||||||
[`Vocab.prune_vectors`](/api/vocab#prune_vectors) reduces the current vector
|
[`Vocab.prune_vectors`](/api/vocab#prune_vectors) reduces the current vector
|
||||||
table to a given number of unique entries, and returns a dictionary containing
|
table to a given number of unique entries, and returns a dictionary containing
|
||||||
the removed words, mapped to `(string, score)` tuples, where `string` is the
|
the removed words, mapped to `(string, score)` tuples, where `string` is the
|
||||||
entry the removed word was mapped to, and `score` the similarity score between
|
entry the removed word was mapped to and `score` the similarity score between
|
||||||
the two words.
|
the two words.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
|
|
|
@ -132,8 +132,8 @@ should be created. spaCy will then do the following:
|
||||||
2. Iterate over the **pipeline names** and look up each component name in the
|
2. Iterate over the **pipeline names** and look up each component name in the
|
||||||
`[components]` block. The `factory` tells spaCy which
|
`[components]` block. The `factory` tells spaCy which
|
||||||
[component factory](#custom-components-factories) to use for adding the
|
[component factory](#custom-components-factories) to use for adding the
|
||||||
component with with [`add_pipe`](/api/language#add_pipe). The settings are
|
component with [`add_pipe`](/api/language#add_pipe). The settings are passed
|
||||||
passed into the factory.
|
into the factory.
|
||||||
3. Make the **model data** available to the `Language` class by calling
|
3. Make the **model data** available to the `Language` class by calling
|
||||||
[`from_disk`](/api/language#from_disk) with the path to the data directory.
|
[`from_disk`](/api/language#from_disk) with the path to the data directory.
|
||||||
|
|
||||||
|
@ -332,7 +332,7 @@ to remove pipeline components from an existing pipeline, the
|
||||||
[`rename_pipe`](/api/language#rename_pipe) method to rename them, or the
|
[`rename_pipe`](/api/language#rename_pipe) method to rename them, or the
|
||||||
[`replace_pipe`](/api/language#replace_pipe) method to replace them with a
|
[`replace_pipe`](/api/language#replace_pipe) method to replace them with a
|
||||||
custom component entirely (more details on this in the section on
|
custom component entirely (more details on this in the section on
|
||||||
[custom components](#custom-components).
|
[custom components](#custom-components)).
|
||||||
|
|
||||||
```python
|
```python
|
||||||
nlp.remove_pipe("parser")
|
nlp.remove_pipe("parser")
|
||||||
|
@ -391,7 +391,7 @@ vectors available – otherwise, it won't be able to make the same predictions.
|
||||||
>
|
>
|
||||||
> Instead of providing a `factory`, component blocks in the training
|
> Instead of providing a `factory`, component blocks in the training
|
||||||
> [config](/usage/training#config) can also define a `source`. The string needs
|
> [config](/usage/training#config) can also define a `source`. The string needs
|
||||||
> to be a loadable spaCy pipeline package or path. The
|
> to be a loadable spaCy pipeline package or path.
|
||||||
>
|
>
|
||||||
> ```ini
|
> ```ini
|
||||||
> [components.ner]
|
> [components.ner]
|
||||||
|
@ -424,7 +424,7 @@ print(nlp.pipe_names)
|
||||||
### Analyzing pipeline components {#analysis new="3"}
|
### Analyzing pipeline components {#analysis new="3"}
|
||||||
|
|
||||||
The [`nlp.analyze_pipes`](/api/language#analyze_pipes) method analyzes the
|
The [`nlp.analyze_pipes`](/api/language#analyze_pipes) method analyzes the
|
||||||
components in the current pipeline and outputs information about them, like the
|
components in the current pipeline and outputs information about them like the
|
||||||
attributes they set on the [`Doc`](/api/doc) and [`Token`](/api/token), whether
|
attributes they set on the [`Doc`](/api/doc) and [`Token`](/api/token), whether
|
||||||
they retokenize the `Doc` and which scores they produce during training. It will
|
they retokenize the `Doc` and which scores they produce during training. It will
|
||||||
also show warnings if components require values that aren't set by previous
|
also show warnings if components require values that aren't set by previous
|
||||||
|
@ -518,8 +518,8 @@ doesn't, the pipeline analysis won't catch that.
|
||||||
## Creating custom pipeline components {#custom-components}
|
## Creating custom pipeline components {#custom-components}
|
||||||
|
|
||||||
A pipeline component is a function that receives a `Doc` object, modifies it and
|
A pipeline component is a function that receives a `Doc` object, modifies it and
|
||||||
returns it – – for example, by using the current weights to make a prediction
|
returns it – for example, by using the current weights to make a prediction and
|
||||||
and set some annotation on the document. By adding a component to the pipeline,
|
set some annotation on the document. By adding a component to the pipeline,
|
||||||
you'll get access to the `Doc` at any point **during processing** – instead of
|
you'll get access to the `Doc` at any point **during processing** – instead of
|
||||||
only being able to modify it afterwards.
|
only being able to modify it afterwards.
|
||||||
|
|
||||||
|
@ -709,9 +709,9 @@ nlp.add_pipe("my_component", config={"some_setting": False})
|
||||||
<Accordion title="How is @Language.factory different from @Language.component?" id="factories-decorator-component">
|
<Accordion title="How is @Language.factory different from @Language.component?" id="factories-decorator-component">
|
||||||
|
|
||||||
The [`@Language.component`](/api/language#component) decorator is essentially a
|
The [`@Language.component`](/api/language#component) decorator is essentially a
|
||||||
**shortcut** for stateless pipeline component that don't need any settings. This
|
**shortcut** for stateless pipeline components that don't need any settings.
|
||||||
means you don't have to always write a function that returns your function if
|
This means you don't have to always write a function that returns your function
|
||||||
there's no state to be passed through – spaCy can just take care of this for
|
if there's no state to be passed through – spaCy can just take care of this for
|
||||||
you. The following two code examples are equivalent:
|
you. The following two code examples are equivalent:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
|
@ -745,7 +745,7 @@ make your factory a separate function. That's also how spaCy does it internally.
|
||||||
|
|
||||||
### Language-specific factories {#factories-language new="3"}
|
### Language-specific factories {#factories-language new="3"}
|
||||||
|
|
||||||
There are many use case where you might want your pipeline components to be
|
There are many use cases where you might want your pipeline components to be
|
||||||
language-specific. Sometimes this requires entirely different implementation per
|
language-specific. Sometimes this requires entirely different implementation per
|
||||||
language, sometimes the only difference is in the settings or data. spaCy allows
|
language, sometimes the only difference is in the settings or data. spaCy allows
|
||||||
you to register factories of the **same name** on both the `Language` base
|
you to register factories of the **same name** on both the `Language` base
|
||||||
|
@ -966,7 +966,7 @@ components in pipelines that you [train](/usage/training). To make sure spaCy
|
||||||
knows where to find your custom `@misc` function, you can pass in a Python file
|
knows where to find your custom `@misc` function, you can pass in a Python file
|
||||||
via the argument `--code`. If someone else is using your component, all they
|
via the argument `--code`. If someone else is using your component, all they
|
||||||
have to do to customize the data is to register their own function and swap out
|
have to do to customize the data is to register their own function and swap out
|
||||||
the name. Registered functions can also take **arguments** by the way that can
|
the name. Registered functions can also take **arguments**, by the way, that can
|
||||||
be defined in the config as well – you can read more about this in the docs on
|
be defined in the config as well – you can read more about this in the docs on
|
||||||
[training with custom code](/usage/training#custom-code).
|
[training with custom code](/usage/training#custom-code).
|
||||||
|
|
||||||
|
@ -1497,7 +1497,7 @@ to `Doc.user_span_hooks` and `Doc.user_token_hooks`.
|
||||||
>
|
>
|
||||||
> The hooks live on the `Doc` object because the `Span` and `Token` objects are
|
> The hooks live on the `Doc` object because the `Span` and `Token` objects are
|
||||||
> created lazily, and don't own any data. They just proxy to their parent `Doc`.
|
> created lazily, and don't own any data. They just proxy to their parent `Doc`.
|
||||||
> This turns out to be convenient here — we only have to worry about installing
|
> This turns out to be convenient here – we only have to worry about installing
|
||||||
> hooks in one place.
|
> hooks in one place.
|
||||||
|
|
||||||
| Name | Customizes |
|
| Name | Customizes |
|
||||||
|
|
|
@ -69,7 +69,7 @@ python -m spacy project clone pipelines/tagger_parser_ud
|
||||||
|
|
||||||
By default, the project will be cloned into the current working directory. You
|
By default, the project will be cloned into the current working directory. You
|
||||||
can specify an optional second argument to define the output directory. The
|
can specify an optional second argument to define the output directory. The
|
||||||
`--repo` option lets you define a custom repo to clone from, if you don't want
|
`--repo` option lets you define a custom repo to clone from if you don't want
|
||||||
to use the spaCy [`projects`](https://github.com/explosion/projects) repo. You
|
to use the spaCy [`projects`](https://github.com/explosion/projects) repo. You
|
||||||
can also use any private repo you have access to with Git.
|
can also use any private repo you have access to with Git.
|
||||||
|
|
||||||
|
@ -105,7 +105,7 @@ $ python -m spacy project assets
|
||||||
Asset URLs can be a number of different protocols: HTTP, HTTPS, FTP, SSH, and
|
Asset URLs can be a number of different protocols: HTTP, HTTPS, FTP, SSH, and
|
||||||
even cloud storage such as GCS and S3. You can also fetch assets using git, by
|
even cloud storage such as GCS and S3. You can also fetch assets using git, by
|
||||||
replacing the `url` string with a `git` block. spaCy will use Git's "sparse
|
replacing the `url` string with a `git` block. spaCy will use Git's "sparse
|
||||||
checkout" feature, to avoid download the whole repository.
|
checkout" feature to avoid downloading the whole repository.
|
||||||
|
|
||||||
### 3. Run a command {#run}
|
### 3. Run a command {#run}
|
||||||
|
|
||||||
|
@ -310,7 +310,7 @@ company-internal and not available over the internet. In that case, you can
|
||||||
specify the destination paths and a checksum, and leave out the URL. When your
|
specify the destination paths and a checksum, and leave out the URL. When your
|
||||||
teammates clone and run your project, they can place the files in the respective
|
teammates clone and run your project, they can place the files in the respective
|
||||||
directory themselves. The [`project assets`](/api/cli#project-assets) command
|
directory themselves. The [`project assets`](/api/cli#project-assets) command
|
||||||
will alert about missing files and mismatched checksums, so you can ensure that
|
will alert you about missing files and mismatched checksums, so you can ensure that
|
||||||
others are running your project with the same data.
|
others are running your project with the same data.
|
||||||
|
|
||||||
### Dependencies and outputs {#deps-outputs}
|
### Dependencies and outputs {#deps-outputs}
|
||||||
|
@ -358,8 +358,7 @@ graphs based on the dependencies and outputs, and won't re-run previous steps
|
||||||
automatically. For instance, if you only run the command `train` that depends on
|
automatically. For instance, if you only run the command `train` that depends on
|
||||||
data created by `preprocess` and those files are missing, spaCy will show an
|
data created by `preprocess` and those files are missing, spaCy will show an
|
||||||
error – it won't just re-run `preprocess`. If you're looking for more advanced
|
error – it won't just re-run `preprocess`. If you're looking for more advanced
|
||||||
data management, check out the [Data Version Control (DVC) integration](#dvc)
|
data management, check out the [Data Version Control (DVC) integration](#dvc). If you're planning on integrating your spaCy project with DVC, you
|
||||||
integration. If you're planning on integrating your spaCy project with DVC, you
|
|
||||||
can also use `outputs_no_cache` instead of `outputs` to define outputs that
|
can also use `outputs_no_cache` instead of `outputs` to define outputs that
|
||||||
won't be cached or tracked.
|
won't be cached or tracked.
|
||||||
|
|
||||||
|
|
|
@ -55,7 +55,7 @@ abstract representations of the tokens you're looking for, using lexical
|
||||||
attributes, linguistic features predicted by the model, operators, set
|
attributes, linguistic features predicted by the model, operators, set
|
||||||
membership and rich comparison. For example, you can find a noun, followed by a
|
membership and rich comparison. For example, you can find a noun, followed by a
|
||||||
verb with the lemma "love" or "like", followed by an optional determiner and
|
verb with the lemma "love" or "like", followed by an optional determiner and
|
||||||
another token that's at least ten characters long.
|
another token that's at least 10 characters long.
|
||||||
|
|
||||||
</Accordion>
|
</Accordion>
|
||||||
|
|
||||||
|
@ -494,7 +494,7 @@ you prefer.
|
||||||
| `matcher` | The matcher instance. ~~Matcher~~ |
|
| `matcher` | The matcher instance. ~~Matcher~~ |
|
||||||
| `doc` | The document the matcher was used on. ~~Doc~~ |
|
| `doc` | The document the matcher was used on. ~~Doc~~ |
|
||||||
| `i` | Index of the current match (`matches[i`]). ~~int~~ |
|
| `i` | Index of the current match (`matches[i`]). ~~int~~ |
|
||||||
| `matches` | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. ~~ List[Tuple[int, int int]]~~ |
|
| `matches` | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. ~~List[Tuple[int, int int]]~~ |
|
||||||
|
|
||||||
### Creating spans from matches {#matcher-spans}
|
### Creating spans from matches {#matcher-spans}
|
||||||
|
|
||||||
|
@ -631,7 +631,7 @@ To get a quick overview of the results, you could collect all sentences
|
||||||
containing a match and render them with the
|
containing a match and render them with the
|
||||||
[displaCy visualizer](/usage/visualizers). In the callback function, you'll have
|
[displaCy visualizer](/usage/visualizers). In the callback function, you'll have
|
||||||
access to the `start` and `end` of each match, as well as the parent `Doc`. This
|
access to the `start` and `end` of each match, as well as the parent `Doc`. This
|
||||||
lets you determine the sentence containing the match, `doc[start : end`.sent],
|
lets you determine the sentence containing the match, `doc[start:end].sent`,
|
||||||
and calculate the start and end of the matched span within the sentence. Using
|
and calculate the start and end of the matched span within the sentence. Using
|
||||||
displaCy in ["manual" mode](/usage/visualizers#manual-usage) lets you pass in a
|
displaCy in ["manual" mode](/usage/visualizers#manual-usage) lets you pass in a
|
||||||
list of dictionaries containing the text and entities to render.
|
list of dictionaries containing the text and entities to render.
|
||||||
|
@ -1454,7 +1454,7 @@ When using a trained
|
||||||
extract information from your texts, you may find that the predicted span only
|
extract information from your texts, you may find that the predicted span only
|
||||||
includes parts of the entity you're looking for. Sometimes, this happens if
|
includes parts of the entity you're looking for. Sometimes, this happens if
|
||||||
statistical model predicts entities incorrectly. Other times, it happens if the
|
statistical model predicts entities incorrectly. Other times, it happens if the
|
||||||
way the entity type way defined in the original training corpus doesn't match
|
way the entity type was defined in the original training corpus doesn't match
|
||||||
what you need for your application.
|
what you need for your application.
|
||||||
|
|
||||||
> #### Where corpora come from
|
> #### Where corpora come from
|
||||||
|
@ -1645,7 +1645,7 @@ affiliation is current, we can check the head's part-of-speech tag.
|
||||||
```python
|
```python
|
||||||
person_entities = [ent for ent in doc.ents if ent.label_ == "PERSON"]
|
person_entities = [ent for ent in doc.ents if ent.label_ == "PERSON"]
|
||||||
for ent in person_entities:
|
for ent in person_entities:
|
||||||
# Because the entity is a spans, we need to use its root token. The head
|
# Because the entity is a span, we need to use its root token. The head
|
||||||
# is the syntactic governor of the person, e.g. the verb
|
# is the syntactic governor of the person, e.g. the verb
|
||||||
head = ent.root.head
|
head = ent.root.head
|
||||||
if head.lemma_ == "work":
|
if head.lemma_ == "work":
|
||||||
|
|
|
@ -463,7 +463,7 @@ entry_points={
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
The factory can also implement other pipeline component like `to_disk` and
|
The factory can also implement other pipeline component methods like `to_disk` and
|
||||||
`from_disk` for serialization, or even `update` to make the component trainable.
|
`from_disk` for serialization, or even `update` to make the component trainable.
|
||||||
If a component exposes a `from_disk` method and is included in a pipeline, spaCy
|
If a component exposes a `from_disk` method and is included in a pipeline, spaCy
|
||||||
will call it on load. This lets you ship custom data with your pipeline package.
|
will call it on load. This lets you ship custom data with your pipeline package.
|
||||||
|
@ -690,7 +690,7 @@ care of putting all this together and returning a `Language` object with the
|
||||||
loaded pipeline and data. If your pipeline requires
|
loaded pipeline and data. If your pipeline requires
|
||||||
[custom components](/usage/processing-pipelines#custom-components) or a custom
|
[custom components](/usage/processing-pipelines#custom-components) or a custom
|
||||||
language class, you can also **ship the code with your package** and include it
|
language class, you can also **ship the code with your package** and include it
|
||||||
in the `__init__.py` – for example, to register component before the `nlp`
|
in the `__init__.py` – for example, to register a component before the `nlp`
|
||||||
object is created.
|
object is created.
|
||||||
|
|
||||||
<Infobox variant="warning" title="Important note on making manual edits">
|
<Infobox variant="warning" title="Important note on making manual edits">
|
||||||
|
|
|
@ -551,7 +551,7 @@ or TensorFlow, make **custom modifications** to the `nlp` object, create custom
|
||||||
optimizers or schedules, or **stream in data** and preprocesses it on the fly
|
optimizers or schedules, or **stream in data** and preprocesses it on the fly
|
||||||
while training.
|
while training.
|
||||||
|
|
||||||
Each custom function can have any numbers of arguments that are passed in via
|
Each custom function can have any number of arguments that are passed in via
|
||||||
the [config](#config), just the built-in functions. If your function defines
|
the [config](#config), just the built-in functions. If your function defines
|
||||||
**default argument values**, spaCy is able to auto-fill your config when you run
|
**default argument values**, spaCy is able to auto-fill your config when you run
|
||||||
[`init fill-config`](/api/cli#init-fill-config). If you want to make sure that a
|
[`init fill-config`](/api/cli#init-fill-config). If you want to make sure that a
|
||||||
|
|
Loading…
Reference in New Issue
Block a user