mirror of
https://github.com/explosion/spaCy.git
synced 2025-06-29 17:33:10 +03:00
New batch of proofs
Just tiny fixes to the docs as a proofreader
This commit is contained in:
parent
1c65b3b2c0
commit
6af585dba5
|
@ -1,7 +1,7 @@
|
||||||
A named entity is a "real-world object" that's assigned a name – for example, a
|
A named entity is a "real-world object" that's assigned a name – for example, a
|
||||||
person, a country, a product or a book title. spaCy can **recognize various
|
person, a country, a product or a book title. spaCy can **recognize various
|
||||||
types of named entities in a document, by asking the model for a
|
types of named entities in a document, by asking the model for a
|
||||||
**prediction\*\*. Because models are statistical and strongly depend on the
|
prediction**. Because models are statistical and strongly depend on the
|
||||||
examples they were trained on, this doesn't always work _perfectly_ and might
|
examples they were trained on, this doesn't always work _perfectly_ and might
|
||||||
need some tuning later, depending on your use case.
|
need some tuning later, depending on your use case.
|
||||||
|
|
||||||
|
|
|
@ -45,6 +45,6 @@ marks.
|
||||||
|
|
||||||
While punctuation rules are usually pretty general, tokenizer exceptions
|
While punctuation rules are usually pretty general, tokenizer exceptions
|
||||||
strongly depend on the specifics of the individual language. This is why each
|
strongly depend on the specifics of the individual language. This is why each
|
||||||
[available language](/usage/models#languages) has its own subclass like
|
[available language](/usage/models#languages) has its own subclass, like
|
||||||
`English` or `German`, that loads in lists of hard-coded data and exception
|
`English` or `German`, that loads in lists of hard-coded data and exception
|
||||||
rules.
|
rules.
|
||||||
|
|
|
@ -641,7 +641,7 @@ print("After", doc.ents) # [London]
|
||||||
|
|
||||||
#### Setting entity annotations in Cython {#setting-cython}
|
#### Setting entity annotations in Cython {#setting-cython}
|
||||||
|
|
||||||
Finally, you can always write to the underlying struct, if you compile a
|
Finally, you can always write to the underlying struct if you compile a
|
||||||
[Cython](http://cython.org/) function. This is easy to do, and allows you to
|
[Cython](http://cython.org/) function. This is easy to do, and allows you to
|
||||||
write efficient native code.
|
write efficient native code.
|
||||||
|
|
||||||
|
@ -765,15 +765,15 @@ import Tokenization101 from 'usage/101/\_tokenization.md'
|
||||||
|
|
||||||
<Accordion title="Algorithm details: How spaCy's tokenizer works" id="how-tokenizer-works" spaced>
|
<Accordion title="Algorithm details: How spaCy's tokenizer works" id="how-tokenizer-works" spaced>
|
||||||
|
|
||||||
spaCy introduces a novel tokenization algorithm, that gives a better balance
|
spaCy introduces a novel tokenization algorithm that gives a better balance
|
||||||
between performance, ease of definition, and ease of alignment into the original
|
between performance, ease of definition and ease of alignment into the original
|
||||||
string.
|
string.
|
||||||
|
|
||||||
After consuming a prefix or suffix, we consult the special cases again. We want
|
After consuming a prefix or suffix, we consult the special cases again. We want
|
||||||
the special cases to handle things like "don't" in English, and we want the same
|
the special cases to handle things like "don't" in English, and we want the same
|
||||||
rule to work for "(don't)!". We do this by splitting off the open bracket, then
|
rule to work for "(don't)!". We do this by splitting off the open bracket, then
|
||||||
the exclamation, then the close bracket, and finally matching the special case.
|
the exclamation, then the closed bracket, and finally matching the special case.
|
||||||
Here's an implementation of the algorithm in Python, optimized for readability
|
Here's an implementation of the algorithm in Python optimized for readability
|
||||||
rather than performance:
|
rather than performance:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
|
@ -847,7 +847,7 @@ The algorithm can be summarized as follows:
|
||||||
#2.
|
#2.
|
||||||
6. If we can't consume a prefix or a suffix, look for a URL match.
|
6. If we can't consume a prefix or a suffix, look for a URL match.
|
||||||
7. If there's no URL match, then look for a special case.
|
7. If there's no URL match, then look for a special case.
|
||||||
8. Look for "infixes" — stuff like hyphens etc. and split the substring into
|
8. Look for "infixes" – stuff like hyphens etc. and split the substring into
|
||||||
tokens on all infixes.
|
tokens on all infixes.
|
||||||
9. Once we can't consume any more of the string, handle it as a single token.
|
9. Once we can't consume any more of the string, handle it as a single token.
|
||||||
|
|
||||||
|
@ -864,10 +864,10 @@ intact (abbreviations like "U.S.").
|
||||||
<Accordion title="Should I change the language data or add custom tokenizer rules?" id="lang-data-vs-tokenizer">
|
<Accordion title="Should I change the language data or add custom tokenizer rules?" id="lang-data-vs-tokenizer">
|
||||||
|
|
||||||
Tokenization rules that are specific to one language, but can be **generalized
|
Tokenization rules that are specific to one language, but can be **generalized
|
||||||
across that language** should ideally live in the language data in
|
across that language**, should ideally live in the language data in
|
||||||
[`spacy/lang`](%%GITHUB_SPACY/spacy/lang) – we always appreciate pull requests!
|
[`spacy/lang`](%%GITHUB_SPACY/spacy/lang) – we always appreciate pull requests!
|
||||||
Anything that's specific to a domain or text type – like financial trading
|
Anything that's specific to a domain or text type – like financial trading
|
||||||
abbreviations, or Bavarian youth slang – should be added as a special case rule
|
abbreviations or Bavarian youth slang – should be added as a special case rule
|
||||||
to your tokenizer instance. If you're dealing with a lot of customizations, it
|
to your tokenizer instance. If you're dealing with a lot of customizations, it
|
||||||
might make sense to create an entirely custom subclass.
|
might make sense to create an entirely custom subclass.
|
||||||
|
|
||||||
|
@ -1110,7 +1110,7 @@ tokenized `Doc`.
|
||||||

|

|
||||||
|
|
||||||
To overwrite the existing tokenizer, you need to replace `nlp.tokenizer` with a
|
To overwrite the existing tokenizer, you need to replace `nlp.tokenizer` with a
|
||||||
custom function that takes a text, and returns a [`Doc`](/api/doc).
|
custom function that takes a text and returns a [`Doc`](/api/doc).
|
||||||
|
|
||||||
> #### Creating a Doc
|
> #### Creating a Doc
|
||||||
>
|
>
|
||||||
|
@ -1229,7 +1229,7 @@ tokenizer** it will be using at runtime. See the docs on
|
||||||
|
|
||||||
#### Training with custom tokenization {#custom-tokenizer-training new="3"}
|
#### Training with custom tokenization {#custom-tokenizer-training new="3"}
|
||||||
|
|
||||||
spaCy's [training config](/usage/training#config) describe the settings,
|
spaCy's [training config](/usage/training#config) describes the settings,
|
||||||
hyperparameters, pipeline and tokenizer used for constructing and training the
|
hyperparameters, pipeline and tokenizer used for constructing and training the
|
||||||
pipeline. The `[nlp.tokenizer]` block refers to a **registered function** that
|
pipeline. The `[nlp.tokenizer]` block refers to a **registered function** that
|
||||||
takes the `nlp` object and returns a tokenizer. Here, we're registering a
|
takes the `nlp` object and returns a tokenizer. Here, we're registering a
|
||||||
|
@ -1465,7 +1465,7 @@ filtered_spans = filter_spans(spans)
|
||||||
The [`retokenizer.split`](/api/doc#retokenizer.split) method allows splitting
|
The [`retokenizer.split`](/api/doc#retokenizer.split) method allows splitting
|
||||||
one token into two or more tokens. This can be useful for cases where
|
one token into two or more tokens. This can be useful for cases where
|
||||||
tokenization rules alone aren't sufficient. For example, you might want to split
|
tokenization rules alone aren't sufficient. For example, you might want to split
|
||||||
"its" into the tokens "it" and "is" — but not the possessive pronoun "its". You
|
"its" into the tokens "it" and "is" – but not the possessive pronoun "its". You
|
||||||
can write rule-based logic that can find only the correct "its" to split, but by
|
can write rule-based logic that can find only the correct "its" to split, but by
|
||||||
that time, the `Doc` will already be tokenized.
|
that time, the `Doc` will already be tokenized.
|
||||||
|
|
||||||
|
@ -1513,7 +1513,7 @@ the token indices after splitting.
|
||||||
| `"York"` | `doc[2]` | Attach this token to `doc[1]` in the original `Doc`, i.e. "in". |
|
| `"York"` | `doc[2]` | Attach this token to `doc[1]` in the original `Doc`, i.e. "in". |
|
||||||
|
|
||||||
If you don't care about the heads (for example, if you're only running the
|
If you don't care about the heads (for example, if you're only running the
|
||||||
tokenizer and not the parser), you can each subtoken to itself:
|
tokenizer and not the parser), you can attach each subtoken to itself:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
### {highlight="3"}
|
### {highlight="3"}
|
||||||
|
@ -1879,7 +1879,7 @@ assert nlp.vocab.vectors.n_keys > n_vectors # but not the total entries
|
||||||
[`Vocab.prune_vectors`](/api/vocab#prune_vectors) reduces the current vector
|
[`Vocab.prune_vectors`](/api/vocab#prune_vectors) reduces the current vector
|
||||||
table to a given number of unique entries, and returns a dictionary containing
|
table to a given number of unique entries, and returns a dictionary containing
|
||||||
the removed words, mapped to `(string, score)` tuples, where `string` is the
|
the removed words, mapped to `(string, score)` tuples, where `string` is the
|
||||||
entry the removed word was mapped to, and `score` the similarity score between
|
entry the removed word was mapped to and `score` the similarity score between
|
||||||
the two words.
|
the two words.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
|
|
|
@ -128,7 +128,7 @@ should be created. spaCy will then do the following:
|
||||||
2. Iterate over the **pipeline names** and look up each component name in the
|
2. Iterate over the **pipeline names** and look up each component name in the
|
||||||
`[components]` block. The `factory` tells spaCy which
|
`[components]` block. The `factory` tells spaCy which
|
||||||
[component factory](#custom-components-factories) to use for adding the
|
[component factory](#custom-components-factories) to use for adding the
|
||||||
component with with [`add_pipe`](/api/language#add_pipe). The settings are
|
component with [`add_pipe`](/api/language#add_pipe). The settings are
|
||||||
passed into the factory.
|
passed into the factory.
|
||||||
3. Make the **model data** available to the `Language` class by calling
|
3. Make the **model data** available to the `Language` class by calling
|
||||||
[`from_disk`](/api/language#from_disk) with the path to the data directory.
|
[`from_disk`](/api/language#from_disk) with the path to the data directory.
|
||||||
|
@ -325,7 +325,7 @@ to remove pipeline components from an existing pipeline, the
|
||||||
[`rename_pipe`](/api/language#rename_pipe) method to rename them, or the
|
[`rename_pipe`](/api/language#rename_pipe) method to rename them, or the
|
||||||
[`replace_pipe`](/api/language#replace_pipe) method to replace them with a
|
[`replace_pipe`](/api/language#replace_pipe) method to replace them with a
|
||||||
custom component entirely (more details on this in the section on
|
custom component entirely (more details on this in the section on
|
||||||
[custom components](#custom-components).
|
[custom components](#custom-components)).
|
||||||
|
|
||||||
```python
|
```python
|
||||||
nlp.remove_pipe("parser")
|
nlp.remove_pipe("parser")
|
||||||
|
@ -384,7 +384,7 @@ vectors available – otherwise, it won't be able to make the same predictions.
|
||||||
>
|
>
|
||||||
> Instead of providing a `factory`, component blocks in the training
|
> Instead of providing a `factory`, component blocks in the training
|
||||||
> [config](/usage/training#config) can also define a `source`. The string needs
|
> [config](/usage/training#config) can also define a `source`. The string needs
|
||||||
> to be a loadable spaCy pipeline package or path. The
|
> to be a loadable spaCy pipeline package or path.
|
||||||
>
|
>
|
||||||
> ```ini
|
> ```ini
|
||||||
> [components.ner]
|
> [components.ner]
|
||||||
|
@ -417,7 +417,7 @@ print(nlp.pipe_names)
|
||||||
### Analyzing pipeline components {#analysis new="3"}
|
### Analyzing pipeline components {#analysis new="3"}
|
||||||
|
|
||||||
The [`nlp.analyze_pipes`](/api/language#analyze_pipes) method analyzes the
|
The [`nlp.analyze_pipes`](/api/language#analyze_pipes) method analyzes the
|
||||||
components in the current pipeline and outputs information about them, like the
|
components in the current pipeline and outputs information about them like the
|
||||||
attributes they set on the [`Doc`](/api/doc) and [`Token`](/api/token), whether
|
attributes they set on the [`Doc`](/api/doc) and [`Token`](/api/token), whether
|
||||||
they retokenize the `Doc` and which scores they produce during training. It will
|
they retokenize the `Doc` and which scores they produce during training. It will
|
||||||
also show warnings if components require values that aren't set by previous
|
also show warnings if components require values that aren't set by previous
|
||||||
|
@ -511,7 +511,7 @@ doesn't, the pipeline analysis won't catch that.
|
||||||
## Creating custom pipeline components {#custom-components}
|
## Creating custom pipeline components {#custom-components}
|
||||||
|
|
||||||
A pipeline component is a function that receives a `Doc` object, modifies it and
|
A pipeline component is a function that receives a `Doc` object, modifies it and
|
||||||
returns it – – for example, by using the current weights to make a prediction
|
returns it – for example, by using the current weights to make a prediction
|
||||||
and set some annotation on the document. By adding a component to the pipeline,
|
and set some annotation on the document. By adding a component to the pipeline,
|
||||||
you'll get access to the `Doc` at any point **during processing** – instead of
|
you'll get access to the `Doc` at any point **during processing** – instead of
|
||||||
only being able to modify it afterwards.
|
only being able to modify it afterwards.
|
||||||
|
@ -702,7 +702,7 @@ nlp.add_pipe("my_component", config={"some_setting": False})
|
||||||
<Accordion title="How is @Language.factory different from @Language.component?" id="factories-decorator-component">
|
<Accordion title="How is @Language.factory different from @Language.component?" id="factories-decorator-component">
|
||||||
|
|
||||||
The [`@Language.component`](/api/language#component) decorator is essentially a
|
The [`@Language.component`](/api/language#component) decorator is essentially a
|
||||||
**shortcut** for stateless pipeline component that don't need any settings. This
|
**shortcut** for stateless pipeline components that don't need any settings. This
|
||||||
means you don't have to always write a function that returns your function if
|
means you don't have to always write a function that returns your function if
|
||||||
there's no state to be passed through – spaCy can just take care of this for
|
there's no state to be passed through – spaCy can just take care of this for
|
||||||
you. The following two code examples are equivalent:
|
you. The following two code examples are equivalent:
|
||||||
|
@ -888,7 +888,7 @@ components in pipelines that you [train](/usage/training). To make sure spaCy
|
||||||
knows where to find your custom `@misc` function, you can pass in a Python file
|
knows where to find your custom `@misc` function, you can pass in a Python file
|
||||||
via the argument `--code`. If someone else is using your component, all they
|
via the argument `--code`. If someone else is using your component, all they
|
||||||
have to do to customize the data is to register their own function and swap out
|
have to do to customize the data is to register their own function and swap out
|
||||||
the name. Registered functions can also take **arguments** by the way that can
|
the name. Registered functions can also take **arguments**, by the way, that can
|
||||||
be defined in the config as well – you can read more about this in the docs on
|
be defined in the config as well – you can read more about this in the docs on
|
||||||
[training with custom code](/usage/training#custom-code).
|
[training with custom code](/usage/training#custom-code).
|
||||||
|
|
||||||
|
@ -963,7 +963,7 @@ doc = nlp("This is a text...")
|
||||||
|
|
||||||
### Language-specific factories {#factories-language new="3"}
|
### Language-specific factories {#factories-language new="3"}
|
||||||
|
|
||||||
There are many use case where you might want your pipeline components to be
|
There are many use cases where you might want your pipeline components to be
|
||||||
language-specific. Sometimes this requires entirely different implementation per
|
language-specific. Sometimes this requires entirely different implementation per
|
||||||
language, sometimes the only difference is in the settings or data. spaCy allows
|
language, sometimes the only difference is in the settings or data. spaCy allows
|
||||||
you to register factories of the **same name** on both the `Language` base
|
you to register factories of the **same name** on both the `Language` base
|
||||||
|
@ -1028,8 +1028,8 @@ plug fully custom machine learning components into your pipeline. You'll need
|
||||||
the following:
|
the following:
|
||||||
|
|
||||||
1. **Model:** A Thinc [`Model`](https://thinc.ai/docs/api-model) instance. This
|
1. **Model:** A Thinc [`Model`](https://thinc.ai/docs/api-model) instance. This
|
||||||
can be a model using implemented in
|
can be a model implemented in
|
||||||
[Thinc](/usage/layers-architectures#thinc), or a
|
[Thinc](/usage/layers-architectures#thinc) or a
|
||||||
[wrapped model](/usage/layers-architectures#frameworks) implemented in
|
[wrapped model](/usage/layers-architectures#frameworks) implemented in
|
||||||
PyTorch, TensorFlow, MXNet or a fully custom solution. The model must take a
|
PyTorch, TensorFlow, MXNet or a fully custom solution. The model must take a
|
||||||
list of [`Doc`](/api/doc) objects as input and can have any type of output.
|
list of [`Doc`](/api/doc) objects as input and can have any type of output.
|
||||||
|
@ -1354,7 +1354,7 @@ to `Doc.user_span_hooks` and `Doc.user_token_hooks`.
|
||||||
>
|
>
|
||||||
> The hooks live on the `Doc` object because the `Span` and `Token` objects are
|
> The hooks live on the `Doc` object because the `Span` and `Token` objects are
|
||||||
> created lazily, and don't own any data. They just proxy to their parent `Doc`.
|
> created lazily, and don't own any data. They just proxy to their parent `Doc`.
|
||||||
> This turns out to be convenient here — we only have to worry about installing
|
> This turns out to be convenient here – we only have to worry about installing
|
||||||
> hooks in one place.
|
> hooks in one place.
|
||||||
|
|
||||||
| Name | Customizes |
|
| Name | Customizes |
|
||||||
|
|
|
@ -73,7 +73,7 @@ python -m spacy project clone some_example_project
|
||||||
|
|
||||||
By default, the project will be cloned into the current working directory. You
|
By default, the project will be cloned into the current working directory. You
|
||||||
can specify an optional second argument to define the output directory. The
|
can specify an optional second argument to define the output directory. The
|
||||||
`--repo` option lets you define a custom repo to clone from, if you don't want
|
`--repo` option lets you define a custom repo to clone from if you don't want
|
||||||
to use the spaCy [`projects`](https://github.com/explosion/projects) repo. You
|
to use the spaCy [`projects`](https://github.com/explosion/projects) repo. You
|
||||||
can also use any private repo you have access to with Git.
|
can also use any private repo you have access to with Git.
|
||||||
|
|
||||||
|
@ -109,7 +109,7 @@ $ python -m spacy project assets
|
||||||
Asset URLs can be a number of different protocols: HTTP, HTTPS, FTP, SSH, and
|
Asset URLs can be a number of different protocols: HTTP, HTTPS, FTP, SSH, and
|
||||||
even cloud storage such as GCS and S3. You can also fetch assets using git, by
|
even cloud storage such as GCS and S3. You can also fetch assets using git, by
|
||||||
replacing the `url` string with a `git` block. spaCy will use Git's "sparse
|
replacing the `url` string with a `git` block. spaCy will use Git's "sparse
|
||||||
checkout" feature, to avoid download the whole repository.
|
checkout" feature to avoid downloading the whole repository.
|
||||||
|
|
||||||
### 3. Run a command {#run}
|
### 3. Run a command {#run}
|
||||||
|
|
||||||
|
@ -201,7 +201,7 @@ $ python -m spacy project push
|
||||||
```
|
```
|
||||||
|
|
||||||
The `remotes` section in your `project.yml` lets you assign names to the
|
The `remotes` section in your `project.yml` lets you assign names to the
|
||||||
different storages. To download state from a remote storage, you can use the
|
different storages. To download a state from a remote storage, you can use the
|
||||||
[`spacy project pull`](/api/cli#project-pull) command. For more details, see the
|
[`spacy project pull`](/api/cli#project-pull) command. For more details, see the
|
||||||
docs on [remote storage](#remote).
|
docs on [remote storage](#remote).
|
||||||
|
|
||||||
|
@ -315,7 +315,7 @@ company-internal and not available over the internet. In that case, you can
|
||||||
specify the destination paths and a checksum, and leave out the URL. When your
|
specify the destination paths and a checksum, and leave out the URL. When your
|
||||||
teammates clone and run your project, they can place the files in the respective
|
teammates clone and run your project, they can place the files in the respective
|
||||||
directory themselves. The [`project assets`](/api/cli#project-assets) command
|
directory themselves. The [`project assets`](/api/cli#project-assets) command
|
||||||
will alert about missing files and mismatched checksums, so you can ensure that
|
will alert you about missing files and mismatched checksums, so you can ensure that
|
||||||
others are running your project with the same data.
|
others are running your project with the same data.
|
||||||
|
|
||||||
### Dependencies and outputs {#deps-outputs}
|
### Dependencies and outputs {#deps-outputs}
|
||||||
|
@ -363,8 +363,7 @@ graphs based on the dependencies and outputs, and won't re-run previous steps
|
||||||
automatically. For instance, if you only run the command `train` that depends on
|
automatically. For instance, if you only run the command `train` that depends on
|
||||||
data created by `preprocess` and those files are missing, spaCy will show an
|
data created by `preprocess` and those files are missing, spaCy will show an
|
||||||
error – it won't just re-run `preprocess`. If you're looking for more advanced
|
error – it won't just re-run `preprocess`. If you're looking for more advanced
|
||||||
data management, check out the [Data Version Control (DVC) integration](#dvc)
|
data management, check out the [Data Version Control (DVC) integration](#dvc). If you're planning on integrating your spaCy project with DVC, you
|
||||||
integration. If you're planning on integrating your spaCy project with DVC, you
|
|
||||||
can also use `outputs_no_cache` instead of `outputs` to define outputs that
|
can also use `outputs_no_cache` instead of `outputs` to define outputs that
|
||||||
won't be cached or tracked.
|
won't be cached or tracked.
|
||||||
|
|
||||||
|
@ -508,7 +507,7 @@ commands:
|
||||||
|
|
||||||
When your custom project is ready and you want to share it with others, you can
|
When your custom project is ready and you want to share it with others, you can
|
||||||
use the [`spacy project document`](/api/cli#project-document) command to
|
use the [`spacy project document`](/api/cli#project-document) command to
|
||||||
**auto-generate** a pretty, Markdown-formatted `README` file based on your
|
**auto-generate** a pretty, markdown-formatted `README` file based on your
|
||||||
project's `project.yml`. It will list all commands, workflows and assets defined
|
project's `project.yml`. It will list all commands, workflows and assets defined
|
||||||
in the project and include details on how to run the project, as well as links
|
in the project and include details on how to run the project, as well as links
|
||||||
to the relevant spaCy documentation to make it easy for others to get started
|
to the relevant spaCy documentation to make it easy for others to get started
|
||||||
|
|
|
@ -55,7 +55,7 @@ abstract representations of the tokens you're looking for, using lexical
|
||||||
attributes, linguistic features predicted by the model, operators, set
|
attributes, linguistic features predicted by the model, operators, set
|
||||||
membership and rich comparison. For example, you can find a noun, followed by a
|
membership and rich comparison. For example, you can find a noun, followed by a
|
||||||
verb with the lemma "love" or "like", followed by an optional determiner and
|
verb with the lemma "love" or "like", followed by an optional determiner and
|
||||||
another token that's at least ten characters long.
|
another token that's at least 10 characters long.
|
||||||
|
|
||||||
</Accordion>
|
</Accordion>
|
||||||
|
|
||||||
|
@ -491,7 +491,7 @@ you prefer.
|
||||||
| `matcher` | The matcher instance. ~~Matcher~~ |
|
| `matcher` | The matcher instance. ~~Matcher~~ |
|
||||||
| `doc` | The document the matcher was used on. ~~Doc~~ |
|
| `doc` | The document the matcher was used on. ~~Doc~~ |
|
||||||
| `i` | Index of the current match (`matches[i`]). ~~int~~ |
|
| `i` | Index of the current match (`matches[i`]). ~~int~~ |
|
||||||
| `matches` | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. ~~ List[Tuple[int, int int]]~~ |
|
| `matches` | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. ~~List[Tuple[int, int int]]~~ |
|
||||||
|
|
||||||
### Creating spans from matches {#matcher-spans}
|
### Creating spans from matches {#matcher-spans}
|
||||||
|
|
||||||
|
@ -628,7 +628,7 @@ To get a quick overview of the results, you could collect all sentences
|
||||||
containing a match and render them with the
|
containing a match and render them with the
|
||||||
[displaCy visualizer](/usage/visualizers). In the callback function, you'll have
|
[displaCy visualizer](/usage/visualizers). In the callback function, you'll have
|
||||||
access to the `start` and `end` of each match, as well as the parent `Doc`. This
|
access to the `start` and `end` of each match, as well as the parent `Doc`. This
|
||||||
lets you determine the sentence containing the match, `doc[start : end`.sent],
|
lets you determine the sentence containing the match, `doc[start:end].sent`,
|
||||||
and calculate the start and end of the matched span within the sentence. Using
|
and calculate the start and end of the matched span within the sentence. Using
|
||||||
displaCy in ["manual" mode](/usage/visualizers#manual-usage) lets you pass in a
|
displaCy in ["manual" mode](/usage/visualizers#manual-usage) lets you pass in a
|
||||||
list of dictionaries containing the text and entities to render.
|
list of dictionaries containing the text and entities to render.
|
||||||
|
@ -1451,7 +1451,7 @@ When using a trained
|
||||||
extract information from your texts, you may find that the predicted span only
|
extract information from your texts, you may find that the predicted span only
|
||||||
includes parts of the entity you're looking for. Sometimes, this happens if
|
includes parts of the entity you're looking for. Sometimes, this happens if
|
||||||
statistical model predicts entities incorrectly. Other times, it happens if the
|
statistical model predicts entities incorrectly. Other times, it happens if the
|
||||||
way the entity type way defined in the original training corpus doesn't match
|
way the entity type was defined in the original training corpus doesn't match
|
||||||
what you need for your application.
|
what you need for your application.
|
||||||
|
|
||||||
> #### Where corpora come from
|
> #### Where corpora come from
|
||||||
|
@ -1642,7 +1642,7 @@ affiliation is current, we can check the head's part-of-speech tag.
|
||||||
```python
|
```python
|
||||||
person_entities = [ent for ent in doc.ents if ent.label_ == "PERSON"]
|
person_entities = [ent for ent in doc.ents if ent.label_ == "PERSON"]
|
||||||
for ent in person_entities:
|
for ent in person_entities:
|
||||||
# Because the entity is a spans, we need to use its root token. The head
|
# Because the entity is a span, we need to use its root token. The head
|
||||||
# is the syntactic governor of the person, e.g. the verb
|
# is the syntactic governor of the person, e.g. the verb
|
||||||
head = ent.root.head
|
head = ent.root.head
|
||||||
if head.lemma_ == "work":
|
if head.lemma_ == "work":
|
||||||
|
|
|
@ -448,7 +448,7 @@ entry_points={
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
The factory can also implement other pipeline component like `to_disk` and
|
The factory can also implement other pipeline components like `to_disk` and
|
||||||
`from_disk` for serialization, or even `update` to make the component trainable.
|
`from_disk` for serialization, or even `update` to make the component trainable.
|
||||||
If a component exposes a `from_disk` method and is included in a pipeline, spaCy
|
If a component exposes a `from_disk` method and is included in a pipeline, spaCy
|
||||||
will call it on load. This lets you ship custom data with your pipeline package.
|
will call it on load. This lets you ship custom data with your pipeline package.
|
||||||
|
@ -666,7 +666,7 @@ care of putting all this together and returning a `Language` object with the
|
||||||
loaded pipeline and data. If your pipeline requires
|
loaded pipeline and data. If your pipeline requires
|
||||||
[custom components](/usage/processing-pipelines#custom-components) or a custom
|
[custom components](/usage/processing-pipelines#custom-components) or a custom
|
||||||
language class, you can also **ship the code with your package** and include it
|
language class, you can also **ship the code with your package** and include it
|
||||||
in the `__init__.py` – for example, to register component before the `nlp`
|
in the `__init__.py` – for example, to register a component before the `nlp`
|
||||||
object is created.
|
object is created.
|
||||||
|
|
||||||
<Infobox variant="warning" title="Important note on making manual edits">
|
<Infobox variant="warning" title="Important note on making manual edits">
|
||||||
|
|
|
@ -489,7 +489,7 @@ or TensorFlow, make **custom modifications** to the `nlp` object, create custom
|
||||||
optimizers or schedules, or **stream in data** and preprocesses it on the fly
|
optimizers or schedules, or **stream in data** and preprocesses it on the fly
|
||||||
while training.
|
while training.
|
||||||
|
|
||||||
Each custom function can have any numbers of arguments that are passed in via
|
Each custom function can have any number of arguments that are passed in via
|
||||||
the [config](#config), just the built-in functions. If your function defines
|
the [config](#config), just the built-in functions. If your function defines
|
||||||
**default argument values**, spaCy is able to auto-fill your config when you run
|
**default argument values**, spaCy is able to auto-fill your config when you run
|
||||||
[`init fill-config`](/api/cli#init-fill-config). If you want to make sure that a
|
[`init fill-config`](/api/cli#init-fill-config). If you want to make sure that a
|
||||||
|
|
Loading…
Reference in New Issue
Block a user