New batch of proofs

Just tiny fixes to the docs as a proofreader
2025-11-06 10:57:34 +03:00 · 2020-10-14 16:37:57 +02:00 · 2020-10-14 16:37:57 +02:00 · 6af585dba5
commit 6af585dba5
parent 1c65b3b2c0
8 changed files with 40 additions and 41 deletions
--- a/website/docs/usage/101/_named-entities.md
+++ b/website/docs/usage/101/_named-entities.md
@ -1,7 +1,7 @@
 A named entity is a "real-world object" that's assigned a name – for example, a
 person, a country, a product or a book title. spaCy can **recognize various
 types of named entities in a document, by asking the model for a
-**prediction\*\*. Because models are statistical and strongly depend on the
+prediction**. Because models are statistical and strongly depend on the
 examples they were trained on, this doesn't always work _perfectly_ and might
 need some tuning later, depending on your use case.
--- a/website/docs/usage/101/_tokenization.md
+++ b/website/docs/usage/101/_tokenization.md
@ -45,6 +45,6 @@ marks.
 While punctuation rules are usually pretty general, tokenizer exceptions
 strongly depend on the specifics of the individual language. This is why each
-[available language](/usage/models#languages) has its own subclass like
+[available language](/usage/models#languages) has its own subclass, like
 `English` or `German`, that loads in lists of hard-coded data and exception
 rules.
--- a/website/docs/usage/linguistic-features.md
+++ b/website/docs/usage/linguistic-features.md
@ -641,7 +641,7 @@ print("After", doc.ents)  # [London]
 #### Setting entity annotations in Cython {#setting-cython}
-Finally, you can always write to the underlying struct, if you compile a
+Finally, you can always write to the underlying struct if you compile a
 [Cython](http://cython.org/) function. This is easy to do, and allows you to
 write efficient native code.
@ -765,15 +765,15 @@ import Tokenization101 from 'usage/101/\_tokenization.md'
 <Accordion title="Algorithm details: How spaCy's tokenizer works" id="how-tokenizer-works" spaced>
-spaCy introduces a novel tokenization algorithm, that gives a better balance
+spaCy introduces a novel tokenization algorithm that gives a better balance
-between performance, ease of definition, and ease of alignment into the original
+between performance, ease of definition and ease of alignment into the original
 string.
 After consuming a prefix or suffix, we consult the special cases again. We want
 the special cases to handle things like "don't" in English, and we want the same
 rule to work for "(don't)!". We do this by splitting off the open bracket, then
-the exclamation, then the close bracket, and finally matching the special case.
+the exclamation, then the closed bracket, and finally matching the special case.
-Here's an implementation of the algorithm in Python, optimized for readability
+Here's an implementation of the algorithm in Python optimized for readability
 rather than performance:
 ```python
@ -847,7 +847,7 @@ The algorithm can be summarized as follows:
   #2.
 6. If we can't consume a prefix or a suffix, look for a URL match.
 7. If there's no URL match, then look for a special case.
-8. Look for "infixes" — stuff like hyphens etc. and split the substring into
+8. Look for "infixes" – stuff like hyphens etc. and split the substring into
   tokens on all infixes.
 9. Once we can't consume any more of the string, handle it as a single token.
@ -864,10 +864,10 @@ intact (abbreviations like "U.S.").
 <Accordion title="Should I change the language data or add custom tokenizer rules?" id="lang-data-vs-tokenizer">
 Tokenization rules that are specific to one language, but can be **generalized
-across that language** should ideally live in the language data in
+across that language**, should ideally live in the language data in
 [`spacy/lang`](%%GITHUB_SPACY/spacy/lang) – we always appreciate pull requests!
 Anything that's specific to a domain or text type – like financial trading
-abbreviations, or Bavarian youth slang – should be added as a special case rule
+abbreviations or Bavarian youth slang – should be added as a special case rule
 to your tokenizer instance. If you're dealing with a lot of customizations, it
 might make sense to create an entirely custom subclass.
@ -1110,7 +1110,7 @@ tokenized `Doc`.
 ![The processing pipeline](../images/pipeline.svg)
 To overwrite the existing tokenizer, you need to replace `nlp.tokenizer` with a
-custom function that takes a text, and returns a [`Doc`](/api/doc).
+custom function that takes a text and returns a [`Doc`](/api/doc).
 > #### Creating a Doc
 >
@ -1229,7 +1229,7 @@ tokenizer** it will be using at runtime. See the docs on
 #### Training with custom tokenization {#custom-tokenizer-training new="3"}
-spaCy's [training config](/usage/training#config) describe the settings,
+spaCy's [training config](/usage/training#config) describes the settings,
 hyperparameters, pipeline and tokenizer used for constructing and training the
 pipeline. The `[nlp.tokenizer]` block refers to a **registered function** that
 takes the `nlp` object and returns a tokenizer. Here, we're registering a
@ -1465,7 +1465,7 @@ filtered_spans = filter_spans(spans)
 The [`retokenizer.split`](/api/doc#retokenizer.split) method allows splitting
 one token into two or more tokens. This can be useful for cases where
 tokenization rules alone aren't sufficient. For example, you might want to split
-"its" into the tokens "it" and "is" — but not the possessive pronoun "its". You
+"its" into the tokens "it" and "is" – but not the possessive pronoun "its". You
 can write rule-based logic that can find only the correct "its" to split, but by
 that time, the `Doc` will already be tokenized.
@ -1513,7 +1513,7 @@ the token indices after splitting.
 | `"York"` | `doc[2]`      | Attach this token to `doc[1]` in the original `Doc`, i.e. "in".                                     |
 If you don't care about the heads (for example, if you're only running the
-tokenizer and not the parser), you can each subtoken to itself:
+tokenizer and not the parser), you can attach each subtoken to itself:
 ```python
 ### {highlight="3"}
@ -1879,7 +1879,7 @@ assert nlp.vocab.vectors.n_keys > n_vectors  # but not the total entries
 [`Vocab.prune_vectors`](/api/vocab#prune_vectors) reduces the current vector
 table to a given number of unique entries, and returns a dictionary containing
 the removed words, mapped to `(string, score)` tuples, where `string` is the
-entry the removed word was mapped to, and `score` the similarity score between
+entry the removed word was mapped to and `score` the similarity score between
 the two words.
 ```python
--- a/website/docs/usage/processing-pipelines.md
+++ b/website/docs/usage/processing-pipelines.md
@ -128,7 +128,7 @@ should be created. spaCy will then do the following:
 2. Iterate over the **pipeline names** and look up each component name in the
   `[components]` block. The `factory` tells spaCy which
   [component factory](#custom-components-factories) to use for adding the
-   component with with [`add_pipe`](/api/language#add_pipe). The settings are
+   component with [`add_pipe`](/api/language#add_pipe). The settings are
   passed into the factory.
 3. Make the **model data** available to the `Language` class by calling
   [`from_disk`](/api/language#from_disk) with the path to the data directory.
@ -325,7 +325,7 @@ to remove pipeline components from an existing pipeline, the
 [`rename_pipe`](/api/language#rename_pipe) method to rename them, or the
 [`replace_pipe`](/api/language#replace_pipe) method to replace them with a
 custom component entirely (more details on this in the section on
-[custom components](#custom-components).
+[custom components](#custom-components)).
 ```python
 nlp.remove_pipe("parser")
@ -384,7 +384,7 @@ vectors available – otherwise, it won't be able to make the same predictions.
 >
 > Instead of providing a `factory`, component blocks in the training
 > [config](/usage/training#config) can also define a `source`. The string needs
-> to be a loadable spaCy pipeline package or path. The
+> to be a loadable spaCy pipeline package or path.
 >
 > ```ini
 > [components.ner]
@ -417,7 +417,7 @@ print(nlp.pipe_names)
 ### Analyzing pipeline components {#analysis new="3"}
 The [`nlp.analyze_pipes`](/api/language#analyze_pipes) method analyzes the
-components in the current pipeline and outputs information about them, like the
+components in the current pipeline and outputs information about them like the
 attributes they set on the [`Doc`](/api/doc) and [`Token`](/api/token), whether
 they retokenize the `Doc` and which scores they produce during training. It will
 also show warnings if components require values that aren't set by previous
@ -511,7 +511,7 @@ doesn't, the pipeline analysis won't catch that.
 ## Creating custom pipeline components {#custom-components}
 A pipeline component is a function that receives a `Doc` object, modifies it and
-returns it – – for example, by using the current weights to make a prediction
+returns it – for example, by using the current weights to make a prediction
 and set some annotation on the document. By adding a component to the pipeline,
 you'll get access to the `Doc` at any point **during processing** – instead of
 only being able to modify it afterwards.
@ -702,7 +702,7 @@ nlp.add_pipe("my_component", config={"some_setting": False})
 <Accordion title="How is @Language.factory different from @Language.component?" id="factories-decorator-component">
 The [`@Language.component`](/api/language#component) decorator is essentially a
-**shortcut** for stateless pipeline component that don't need any settings. This
+**shortcut** for stateless pipeline components that don't need any settings. This
 means you don't have to always write a function that returns your function if
 there's no state to be passed through – spaCy can just take care of this for
 you. The following two code examples are equivalent:
@ -888,7 +888,7 @@ components in pipelines that you [train](/usage/training). To make sure spaCy
 knows where to find your custom `@misc` function, you can pass in a Python file
 via the argument `--code`. If someone else is using your component, all they
 have to do to customize the data is to register their own function and swap out
-the name. Registered functions can also take **arguments** by the way that can
+the name. Registered functions can also take **arguments**, by the way, that can
 be defined in the config as well – you can read more about this in the docs on
 [training with custom code](/usage/training#custom-code).
@ -963,7 +963,7 @@ doc = nlp("This is a text...")
 ### Language-specific factories {#factories-language new="3"}
-There are many use case where you might want your pipeline components to be
+There are many use cases where you might want your pipeline components to be
 language-specific. Sometimes this requires entirely different implementation per
 language, sometimes the only difference is in the settings or data. spaCy allows
 you to register factories of the **same name** on both the `Language` base
@ -1028,8 +1028,8 @@ plug fully custom machine learning components into your pipeline. You'll need
 the following:
 1. **Model:** A Thinc [`Model`](https://thinc.ai/docs/api-model) instance. This
-   can be a model using implemented in
+   can be a model implemented in
-   [Thinc](/usage/layers-architectures#thinc), or a
+   [Thinc](/usage/layers-architectures#thinc) or a
   [wrapped model](/usage/layers-architectures#frameworks) implemented in
   PyTorch, TensorFlow, MXNet or a fully custom solution. The model must take a
   list of [`Doc`](/api/doc) objects as input and can have any type of output.
@ -1354,7 +1354,7 @@ to `Doc.user_span_hooks` and `Doc.user_token_hooks`.
 >
 > The hooks live on the `Doc` object because the `Span` and `Token` objects are
 > created lazily, and don't own any data. They just proxy to their parent `Doc`.
-> This turns out to be convenient here — we only have to worry about installing
+> This turns out to be convenient here – we only have to worry about installing
 > hooks in one place.
 | Name               | Customizes                                                                                                                                                                                                              |
--- a/website/docs/usage/projects.md
+++ b/website/docs/usage/projects.md
@ -73,7 +73,7 @@ python -m spacy project clone some_example_project
 By default, the project will be cloned into the current working directory. You
 can specify an optional second argument to define the output directory. The
-`--repo` option lets you define a custom repo to clone from, if you don't want
+`--repo` option lets you define a custom repo to clone from if you don't want
 to use the spaCy [`projects`](https://github.com/explosion/projects) repo. You
 can also use any private repo you have access to with Git.
@ -109,7 +109,7 @@ $ python -m spacy project assets
 Asset URLs can be a number of different protocols: HTTP, HTTPS, FTP, SSH, and
 even cloud storage such as GCS and S3. You can also fetch assets using git, by
 replacing the `url` string with a `git` block. spaCy will use Git's "sparse
-checkout" feature, to avoid download the whole repository.
+checkout" feature to avoid downloading the whole repository.
 ### 3. Run a command {#run}
@ -201,7 +201,7 @@ $ python -m spacy project push
 ```
 The `remotes` section in your `project.yml` lets you assign names to the
-different storages. To download state from a remote storage, you can use the
+different storages. To download a state from a remote storage, you can use the
 [`spacy project pull`](/api/cli#project-pull) command. For more details, see the
 docs on [remote storage](#remote).
@ -315,7 +315,7 @@ company-internal and not available over the internet. In that case, you can
 specify the destination paths and a checksum, and leave out the URL. When your
 teammates clone and run your project, they can place the files in the respective
 directory themselves. The [`project assets`](/api/cli#project-assets) command
-will alert about missing files and mismatched checksums, so you can ensure that
+will alert you about missing files and mismatched checksums, so you can ensure that
 others are running your project with the same data.
 ### Dependencies and outputs {#deps-outputs}
@ -363,8 +363,7 @@ graphs based on the dependencies and outputs, and won't re-run previous steps
 automatically. For instance, if you only run the command `train` that depends on
 data created by `preprocess` and those files are missing, spaCy will show an
 error – it won't just re-run `preprocess`. If you're looking for more advanced
-data management, check out the [Data Version Control (DVC) integration](#dvc)
+data management, check out the [Data Version Control (DVC) integration](#dvc). If you're planning on integrating your spaCy project with DVC, you
 integration. If you're planning on integrating your spaCy project with DVC, you
 can also use `outputs_no_cache` instead of `outputs` to define outputs that
 won't be cached or tracked.
@ -508,7 +507,7 @@ commands:
 When your custom project is ready and you want to share it with others, you can
 use the [`spacy project document`](/api/cli#project-document) command to
-**auto-generate** a pretty, Markdown-formatted `README` file based on your
+**auto-generate** a pretty, markdown-formatted `README` file based on your
 project's `project.yml`. It will list all commands, workflows and assets defined
 in the project and include details on how to run the project, as well as links
 to the relevant spaCy documentation to make it easy for others to get started
--- a/website/docs/usage/rule-based-matching.md
+++ b/website/docs/usage/rule-based-matching.md
@ -55,7 +55,7 @@ abstract representations of the tokens you're looking for, using lexical
 attributes, linguistic features predicted by the model, operators, set
 membership and rich comparison. For example, you can find a noun, followed by a
 verb with the lemma "love" or "like", followed by an optional determiner and
-another token that's at least ten characters long.
+another token that's at least 10 characters long.
 </Accordion>
@ -491,7 +491,7 @@ you prefer.
 | `matcher` | The matcher instance. ~~Matcher~~                                                                                                                  |
 | `doc`     | The document the matcher was used on. ~~Doc~~                                                                                                      |
 | `i`       | Index of the current match (`matches[i`]). ~~int~~                                                                                                 |
-| `matches` | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. ~~ List[Tuple[int, int int]]~~ |
+| `matches` | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. ~~List[Tuple[int, int int]]~~ |
 ### Creating spans from matches {#matcher-spans}
@ -628,7 +628,7 @@ To get a quick overview of the results, you could collect all sentences
 containing a match and render them with the
 [displaCy visualizer](/usage/visualizers). In the callback function, you'll have
 access to the `start` and `end` of each match, as well as the parent `Doc`. This
-lets you determine the sentence containing the match, `doc[start : end`.sent],
+lets you determine the sentence containing the match, `doc[start:end].sent`,
 and calculate the start and end of the matched span within the sentence. Using
 displaCy in ["manual" mode](/usage/visualizers#manual-usage) lets you pass in a
 list of dictionaries containing the text and entities to render.
@ -1451,7 +1451,7 @@ When using a trained
 extract information from your texts, you may find that the predicted span only
 includes parts of the entity you're looking for. Sometimes, this happens if
 statistical model predicts entities incorrectly. Other times, it happens if the
-way the entity type way defined in the original training corpus doesn't match
+way the entity type was defined in the original training corpus doesn't match
 what you need for your application.
 > #### Where corpora come from
@ -1642,7 +1642,7 @@ affiliation is current, we can check the head's part-of-speech tag.
 ```python
 person_entities = [ent for ent in doc.ents if ent.label_ == "PERSON"]
 for ent in person_entities:
-    # Because the entity is a spans, we need to use its root token. The head
+    # Because the entity is a span, we need to use its root token. The head
    # is the syntactic governor of the person, e.g. the verb
    head = ent.root.head
    if head.lemma_ == "work":
--- a/website/docs/usage/saving-loading.md
+++ b/website/docs/usage/saving-loading.md
@ -448,7 +448,7 @@ entry_points={
 }
 ```
-The factory can also implement other pipeline component like `to_disk` and
+The factory can also implement other pipeline components like `to_disk` and
 `from_disk` for serialization, or even `update` to make the component trainable.
 If a component exposes a `from_disk` method and is included in a pipeline, spaCy
 will call it on load. This lets you ship custom data with your pipeline package.
@ -666,7 +666,7 @@ care of putting all this together and returning a `Language` object with the
 loaded pipeline and data. If your pipeline requires
 [custom components](/usage/processing-pipelines#custom-components) or a custom
 language class, you can also **ship the code with your package** and include it
-in the `__init__.py` – for example, to register component before the `nlp`
+in the `__init__.py` – for example, to register a component before the `nlp`
 object is created.
 <Infobox variant="warning" title="Important note on making manual edits">
--- a/website/docs/usage/training.md
+++ b/website/docs/usage/training.md
@ -489,7 +489,7 @@ or TensorFlow, make **custom modifications** to the `nlp` object, create custom
 optimizers or schedules, or **stream in data** and preprocesses it on the fly
 while training.
-Each custom function can have any numbers of arguments that are passed in via
+Each custom function can have any number of arguments that are passed in via
 the [config](#config), just the built-in functions. If your function defines
 **default argument values**, spaCy is able to auto-fill your config when you run
 [`init fill-config`](/api/cli#init-fill-config). If you want to make sure that a