mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-25 17:36:30 +03:00
New batch of proofs
Just tiny fixes to the docs as a proofreader
This commit is contained in:
parent
1c65b3b2c0
commit
6af585dba5
|
@ -1,7 +1,7 @@
|
|||
A named entity is a "real-world object" that's assigned a name – for example, a
|
||||
person, a country, a product or a book title. spaCy can **recognize various
|
||||
types of named entities in a document, by asking the model for a
|
||||
**prediction\*\*. Because models are statistical and strongly depend on the
|
||||
prediction**. Because models are statistical and strongly depend on the
|
||||
examples they were trained on, this doesn't always work _perfectly_ and might
|
||||
need some tuning later, depending on your use case.
|
||||
|
||||
|
|
|
@ -45,6 +45,6 @@ marks.
|
|||
|
||||
While punctuation rules are usually pretty general, tokenizer exceptions
|
||||
strongly depend on the specifics of the individual language. This is why each
|
||||
[available language](/usage/models#languages) has its own subclass like
|
||||
[available language](/usage/models#languages) has its own subclass, like
|
||||
`English` or `German`, that loads in lists of hard-coded data and exception
|
||||
rules.
|
||||
|
|
|
@ -641,7 +641,7 @@ print("After", doc.ents) # [London]
|
|||
|
||||
#### Setting entity annotations in Cython {#setting-cython}
|
||||
|
||||
Finally, you can always write to the underlying struct, if you compile a
|
||||
Finally, you can always write to the underlying struct if you compile a
|
||||
[Cython](http://cython.org/) function. This is easy to do, and allows you to
|
||||
write efficient native code.
|
||||
|
||||
|
@ -765,15 +765,15 @@ import Tokenization101 from 'usage/101/\_tokenization.md'
|
|||
|
||||
<Accordion title="Algorithm details: How spaCy's tokenizer works" id="how-tokenizer-works" spaced>
|
||||
|
||||
spaCy introduces a novel tokenization algorithm, that gives a better balance
|
||||
between performance, ease of definition, and ease of alignment into the original
|
||||
spaCy introduces a novel tokenization algorithm that gives a better balance
|
||||
between performance, ease of definition and ease of alignment into the original
|
||||
string.
|
||||
|
||||
After consuming a prefix or suffix, we consult the special cases again. We want
|
||||
the special cases to handle things like "don't" in English, and we want the same
|
||||
rule to work for "(don't)!". We do this by splitting off the open bracket, then
|
||||
the exclamation, then the close bracket, and finally matching the special case.
|
||||
Here's an implementation of the algorithm in Python, optimized for readability
|
||||
the exclamation, then the closed bracket, and finally matching the special case.
|
||||
Here's an implementation of the algorithm in Python optimized for readability
|
||||
rather than performance:
|
||||
|
||||
```python
|
||||
|
@ -847,7 +847,7 @@ The algorithm can be summarized as follows:
|
|||
#2.
|
||||
6. If we can't consume a prefix or a suffix, look for a URL match.
|
||||
7. If there's no URL match, then look for a special case.
|
||||
8. Look for "infixes" — stuff like hyphens etc. and split the substring into
|
||||
8. Look for "infixes" – stuff like hyphens etc. and split the substring into
|
||||
tokens on all infixes.
|
||||
9. Once we can't consume any more of the string, handle it as a single token.
|
||||
|
||||
|
@ -864,10 +864,10 @@ intact (abbreviations like "U.S.").
|
|||
<Accordion title="Should I change the language data or add custom tokenizer rules?" id="lang-data-vs-tokenizer">
|
||||
|
||||
Tokenization rules that are specific to one language, but can be **generalized
|
||||
across that language** should ideally live in the language data in
|
||||
across that language**, should ideally live in the language data in
|
||||
[`spacy/lang`](%%GITHUB_SPACY/spacy/lang) – we always appreciate pull requests!
|
||||
Anything that's specific to a domain or text type – like financial trading
|
||||
abbreviations, or Bavarian youth slang – should be added as a special case rule
|
||||
abbreviations or Bavarian youth slang – should be added as a special case rule
|
||||
to your tokenizer instance. If you're dealing with a lot of customizations, it
|
||||
might make sense to create an entirely custom subclass.
|
||||
|
||||
|
@ -1110,7 +1110,7 @@ tokenized `Doc`.
|
|||
![The processing pipeline](../images/pipeline.svg)
|
||||
|
||||
To overwrite the existing tokenizer, you need to replace `nlp.tokenizer` with a
|
||||
custom function that takes a text, and returns a [`Doc`](/api/doc).
|
||||
custom function that takes a text and returns a [`Doc`](/api/doc).
|
||||
|
||||
> #### Creating a Doc
|
||||
>
|
||||
|
@ -1229,7 +1229,7 @@ tokenizer** it will be using at runtime. See the docs on
|
|||
|
||||
#### Training with custom tokenization {#custom-tokenizer-training new="3"}
|
||||
|
||||
spaCy's [training config](/usage/training#config) describe the settings,
|
||||
spaCy's [training config](/usage/training#config) describes the settings,
|
||||
hyperparameters, pipeline and tokenizer used for constructing and training the
|
||||
pipeline. The `[nlp.tokenizer]` block refers to a **registered function** that
|
||||
takes the `nlp` object and returns a tokenizer. Here, we're registering a
|
||||
|
@ -1465,7 +1465,7 @@ filtered_spans = filter_spans(spans)
|
|||
The [`retokenizer.split`](/api/doc#retokenizer.split) method allows splitting
|
||||
one token into two or more tokens. This can be useful for cases where
|
||||
tokenization rules alone aren't sufficient. For example, you might want to split
|
||||
"its" into the tokens "it" and "is" — but not the possessive pronoun "its". You
|
||||
"its" into the tokens "it" and "is" – but not the possessive pronoun "its". You
|
||||
can write rule-based logic that can find only the correct "its" to split, but by
|
||||
that time, the `Doc` will already be tokenized.
|
||||
|
||||
|
@ -1513,7 +1513,7 @@ the token indices after splitting.
|
|||
| `"York"` | `doc[2]` | Attach this token to `doc[1]` in the original `Doc`, i.e. "in". |
|
||||
|
||||
If you don't care about the heads (for example, if you're only running the
|
||||
tokenizer and not the parser), you can each subtoken to itself:
|
||||
tokenizer and not the parser), you can attach each subtoken to itself:
|
||||
|
||||
```python
|
||||
### {highlight="3"}
|
||||
|
@ -1879,7 +1879,7 @@ assert nlp.vocab.vectors.n_keys > n_vectors # but not the total entries
|
|||
[`Vocab.prune_vectors`](/api/vocab#prune_vectors) reduces the current vector
|
||||
table to a given number of unique entries, and returns a dictionary containing
|
||||
the removed words, mapped to `(string, score)` tuples, where `string` is the
|
||||
entry the removed word was mapped to, and `score` the similarity score between
|
||||
entry the removed word was mapped to and `score` the similarity score between
|
||||
the two words.
|
||||
|
||||
```python
|
||||
|
|
|
@ -128,7 +128,7 @@ should be created. spaCy will then do the following:
|
|||
2. Iterate over the **pipeline names** and look up each component name in the
|
||||
`[components]` block. The `factory` tells spaCy which
|
||||
[component factory](#custom-components-factories) to use for adding the
|
||||
component with with [`add_pipe`](/api/language#add_pipe). The settings are
|
||||
component with [`add_pipe`](/api/language#add_pipe). The settings are
|
||||
passed into the factory.
|
||||
3. Make the **model data** available to the `Language` class by calling
|
||||
[`from_disk`](/api/language#from_disk) with the path to the data directory.
|
||||
|
@ -325,7 +325,7 @@ to remove pipeline components from an existing pipeline, the
|
|||
[`rename_pipe`](/api/language#rename_pipe) method to rename them, or the
|
||||
[`replace_pipe`](/api/language#replace_pipe) method to replace them with a
|
||||
custom component entirely (more details on this in the section on
|
||||
[custom components](#custom-components).
|
||||
[custom components](#custom-components)).
|
||||
|
||||
```python
|
||||
nlp.remove_pipe("parser")
|
||||
|
@ -384,7 +384,7 @@ vectors available – otherwise, it won't be able to make the same predictions.
|
|||
>
|
||||
> Instead of providing a `factory`, component blocks in the training
|
||||
> [config](/usage/training#config) can also define a `source`. The string needs
|
||||
> to be a loadable spaCy pipeline package or path. The
|
||||
> to be a loadable spaCy pipeline package or path.
|
||||
>
|
||||
> ```ini
|
||||
> [components.ner]
|
||||
|
@ -417,7 +417,7 @@ print(nlp.pipe_names)
|
|||
### Analyzing pipeline components {#analysis new="3"}
|
||||
|
||||
The [`nlp.analyze_pipes`](/api/language#analyze_pipes) method analyzes the
|
||||
components in the current pipeline and outputs information about them, like the
|
||||
components in the current pipeline and outputs information about them like the
|
||||
attributes they set on the [`Doc`](/api/doc) and [`Token`](/api/token), whether
|
||||
they retokenize the `Doc` and which scores they produce during training. It will
|
||||
also show warnings if components require values that aren't set by previous
|
||||
|
@ -511,7 +511,7 @@ doesn't, the pipeline analysis won't catch that.
|
|||
## Creating custom pipeline components {#custom-components}
|
||||
|
||||
A pipeline component is a function that receives a `Doc` object, modifies it and
|
||||
returns it – – for example, by using the current weights to make a prediction
|
||||
returns it – for example, by using the current weights to make a prediction
|
||||
and set some annotation on the document. By adding a component to the pipeline,
|
||||
you'll get access to the `Doc` at any point **during processing** – instead of
|
||||
only being able to modify it afterwards.
|
||||
|
@ -702,7 +702,7 @@ nlp.add_pipe("my_component", config={"some_setting": False})
|
|||
<Accordion title="How is @Language.factory different from @Language.component?" id="factories-decorator-component">
|
||||
|
||||
The [`@Language.component`](/api/language#component) decorator is essentially a
|
||||
**shortcut** for stateless pipeline component that don't need any settings. This
|
||||
**shortcut** for stateless pipeline components that don't need any settings. This
|
||||
means you don't have to always write a function that returns your function if
|
||||
there's no state to be passed through – spaCy can just take care of this for
|
||||
you. The following two code examples are equivalent:
|
||||
|
@ -888,7 +888,7 @@ components in pipelines that you [train](/usage/training). To make sure spaCy
|
|||
knows where to find your custom `@misc` function, you can pass in a Python file
|
||||
via the argument `--code`. If someone else is using your component, all they
|
||||
have to do to customize the data is to register their own function and swap out
|
||||
the name. Registered functions can also take **arguments** by the way that can
|
||||
the name. Registered functions can also take **arguments**, by the way, that can
|
||||
be defined in the config as well – you can read more about this in the docs on
|
||||
[training with custom code](/usage/training#custom-code).
|
||||
|
||||
|
@ -963,7 +963,7 @@ doc = nlp("This is a text...")
|
|||
|
||||
### Language-specific factories {#factories-language new="3"}
|
||||
|
||||
There are many use case where you might want your pipeline components to be
|
||||
There are many use cases where you might want your pipeline components to be
|
||||
language-specific. Sometimes this requires entirely different implementation per
|
||||
language, sometimes the only difference is in the settings or data. spaCy allows
|
||||
you to register factories of the **same name** on both the `Language` base
|
||||
|
@ -1028,8 +1028,8 @@ plug fully custom machine learning components into your pipeline. You'll need
|
|||
the following:
|
||||
|
||||
1. **Model:** A Thinc [`Model`](https://thinc.ai/docs/api-model) instance. This
|
||||
can be a model using implemented in
|
||||
[Thinc](/usage/layers-architectures#thinc), or a
|
||||
can be a model implemented in
|
||||
[Thinc](/usage/layers-architectures#thinc) or a
|
||||
[wrapped model](/usage/layers-architectures#frameworks) implemented in
|
||||
PyTorch, TensorFlow, MXNet or a fully custom solution. The model must take a
|
||||
list of [`Doc`](/api/doc) objects as input and can have any type of output.
|
||||
|
@ -1354,7 +1354,7 @@ to `Doc.user_span_hooks` and `Doc.user_token_hooks`.
|
|||
>
|
||||
> The hooks live on the `Doc` object because the `Span` and `Token` objects are
|
||||
> created lazily, and don't own any data. They just proxy to their parent `Doc`.
|
||||
> This turns out to be convenient here — we only have to worry about installing
|
||||
> This turns out to be convenient here – we only have to worry about installing
|
||||
> hooks in one place.
|
||||
|
||||
| Name | Customizes |
|
||||
|
|
|
@ -73,7 +73,7 @@ python -m spacy project clone some_example_project
|
|||
|
||||
By default, the project will be cloned into the current working directory. You
|
||||
can specify an optional second argument to define the output directory. The
|
||||
`--repo` option lets you define a custom repo to clone from, if you don't want
|
||||
`--repo` option lets you define a custom repo to clone from if you don't want
|
||||
to use the spaCy [`projects`](https://github.com/explosion/projects) repo. You
|
||||
can also use any private repo you have access to with Git.
|
||||
|
||||
|
@ -109,7 +109,7 @@ $ python -m spacy project assets
|
|||
Asset URLs can be a number of different protocols: HTTP, HTTPS, FTP, SSH, and
|
||||
even cloud storage such as GCS and S3. You can also fetch assets using git, by
|
||||
replacing the `url` string with a `git` block. spaCy will use Git's "sparse
|
||||
checkout" feature, to avoid download the whole repository.
|
||||
checkout" feature to avoid downloading the whole repository.
|
||||
|
||||
### 3. Run a command {#run}
|
||||
|
||||
|
@ -201,7 +201,7 @@ $ python -m spacy project push
|
|||
```
|
||||
|
||||
The `remotes` section in your `project.yml` lets you assign names to the
|
||||
different storages. To download state from a remote storage, you can use the
|
||||
different storages. To download a state from a remote storage, you can use the
|
||||
[`spacy project pull`](/api/cli#project-pull) command. For more details, see the
|
||||
docs on [remote storage](#remote).
|
||||
|
||||
|
@ -315,7 +315,7 @@ company-internal and not available over the internet. In that case, you can
|
|||
specify the destination paths and a checksum, and leave out the URL. When your
|
||||
teammates clone and run your project, they can place the files in the respective
|
||||
directory themselves. The [`project assets`](/api/cli#project-assets) command
|
||||
will alert about missing files and mismatched checksums, so you can ensure that
|
||||
will alert you about missing files and mismatched checksums, so you can ensure that
|
||||
others are running your project with the same data.
|
||||
|
||||
### Dependencies and outputs {#deps-outputs}
|
||||
|
@ -363,8 +363,7 @@ graphs based on the dependencies and outputs, and won't re-run previous steps
|
|||
automatically. For instance, if you only run the command `train` that depends on
|
||||
data created by `preprocess` and those files are missing, spaCy will show an
|
||||
error – it won't just re-run `preprocess`. If you're looking for more advanced
|
||||
data management, check out the [Data Version Control (DVC) integration](#dvc)
|
||||
integration. If you're planning on integrating your spaCy project with DVC, you
|
||||
data management, check out the [Data Version Control (DVC) integration](#dvc). If you're planning on integrating your spaCy project with DVC, you
|
||||
can also use `outputs_no_cache` instead of `outputs` to define outputs that
|
||||
won't be cached or tracked.
|
||||
|
||||
|
@ -508,7 +507,7 @@ commands:
|
|||
|
||||
When your custom project is ready and you want to share it with others, you can
|
||||
use the [`spacy project document`](/api/cli#project-document) command to
|
||||
**auto-generate** a pretty, Markdown-formatted `README` file based on your
|
||||
**auto-generate** a pretty, markdown-formatted `README` file based on your
|
||||
project's `project.yml`. It will list all commands, workflows and assets defined
|
||||
in the project and include details on how to run the project, as well as links
|
||||
to the relevant spaCy documentation to make it easy for others to get started
|
||||
|
|
|
@ -55,7 +55,7 @@ abstract representations of the tokens you're looking for, using lexical
|
|||
attributes, linguistic features predicted by the model, operators, set
|
||||
membership and rich comparison. For example, you can find a noun, followed by a
|
||||
verb with the lemma "love" or "like", followed by an optional determiner and
|
||||
another token that's at least ten characters long.
|
||||
another token that's at least 10 characters long.
|
||||
|
||||
</Accordion>
|
||||
|
||||
|
@ -491,7 +491,7 @@ you prefer.
|
|||
| `matcher` | The matcher instance. ~~Matcher~~ |
|
||||
| `doc` | The document the matcher was used on. ~~Doc~~ |
|
||||
| `i` | Index of the current match (`matches[i`]). ~~int~~ |
|
||||
| `matches` | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. ~~ List[Tuple[int, int int]]~~ |
|
||||
| `matches` | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. ~~List[Tuple[int, int int]]~~ |
|
||||
|
||||
### Creating spans from matches {#matcher-spans}
|
||||
|
||||
|
@ -628,7 +628,7 @@ To get a quick overview of the results, you could collect all sentences
|
|||
containing a match and render them with the
|
||||
[displaCy visualizer](/usage/visualizers). In the callback function, you'll have
|
||||
access to the `start` and `end` of each match, as well as the parent `Doc`. This
|
||||
lets you determine the sentence containing the match, `doc[start : end`.sent],
|
||||
lets you determine the sentence containing the match, `doc[start:end].sent`,
|
||||
and calculate the start and end of the matched span within the sentence. Using
|
||||
displaCy in ["manual" mode](/usage/visualizers#manual-usage) lets you pass in a
|
||||
list of dictionaries containing the text and entities to render.
|
||||
|
@ -1451,7 +1451,7 @@ When using a trained
|
|||
extract information from your texts, you may find that the predicted span only
|
||||
includes parts of the entity you're looking for. Sometimes, this happens if
|
||||
statistical model predicts entities incorrectly. Other times, it happens if the
|
||||
way the entity type way defined in the original training corpus doesn't match
|
||||
way the entity type was defined in the original training corpus doesn't match
|
||||
what you need for your application.
|
||||
|
||||
> #### Where corpora come from
|
||||
|
@ -1642,7 +1642,7 @@ affiliation is current, we can check the head's part-of-speech tag.
|
|||
```python
|
||||
person_entities = [ent for ent in doc.ents if ent.label_ == "PERSON"]
|
||||
for ent in person_entities:
|
||||
# Because the entity is a spans, we need to use its root token. The head
|
||||
# Because the entity is a span, we need to use its root token. The head
|
||||
# is the syntactic governor of the person, e.g. the verb
|
||||
head = ent.root.head
|
||||
if head.lemma_ == "work":
|
||||
|
|
|
@ -448,7 +448,7 @@ entry_points={
|
|||
}
|
||||
```
|
||||
|
||||
The factory can also implement other pipeline component like `to_disk` and
|
||||
The factory can also implement other pipeline components like `to_disk` and
|
||||
`from_disk` for serialization, or even `update` to make the component trainable.
|
||||
If a component exposes a `from_disk` method and is included in a pipeline, spaCy
|
||||
will call it on load. This lets you ship custom data with your pipeline package.
|
||||
|
@ -666,7 +666,7 @@ care of putting all this together and returning a `Language` object with the
|
|||
loaded pipeline and data. If your pipeline requires
|
||||
[custom components](/usage/processing-pipelines#custom-components) or a custom
|
||||
language class, you can also **ship the code with your package** and include it
|
||||
in the `__init__.py` – for example, to register component before the `nlp`
|
||||
in the `__init__.py` – for example, to register a component before the `nlp`
|
||||
object is created.
|
||||
|
||||
<Infobox variant="warning" title="Important note on making manual edits">
|
||||
|
|
|
@ -489,7 +489,7 @@ or TensorFlow, make **custom modifications** to the `nlp` object, create custom
|
|||
optimizers or schedules, or **stream in data** and preprocesses it on the fly
|
||||
while training.
|
||||
|
||||
Each custom function can have any numbers of arguments that are passed in via
|
||||
Each custom function can have any number of arguments that are passed in via
|
||||
the [config](#config), just the built-in functions. If your function defines
|
||||
**default argument values**, spaCy is able to auto-fill your config when you run
|
||||
[`init fill-config`](/api/cli#init-fill-config). If you want to make sure that a
|
||||
|
|
Loading…
Reference in New Issue
Block a user