Merge branch 'master' into spacy.io

2025-10-28 22:47:52 +03:00 · 2019-11-18 12:42:04 +01:00 · 2019-11-18 12:42:04 +01:00 · 534c4aa55b
commit 534c4aa55b
parent 2a38fd00bd e8b9cee6fd
1 changed files with 21 additions and 22 deletions
--- a/website/docs/usage/linguistic-features.md
+++ b/website/docs/usage/linguistic-features.md
@ -435,22 +435,22 @@ import spacy
 from spacy.tokens import Span
 nlp = spacy.load("en_core_web_sm")
-doc = nlp("FB is hiring a new Vice President of global policy")
+doc = nlp("fb is hiring a new vice president of global policy")
 ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
 print('Before', ents)
-# the model didn't recognise "FB" as an entity :(
+# the model didn't recognise "fb" as an entity :(
 fb_ent = Span(doc, 0, 1, label="ORG") # create a Span for the new entity
 doc.ents = list(doc.ents) + [fb_ent]
 ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
 print('After', ents)
-# [('FB', 0, 2, 'ORG')] 🎉
+# [('fb', 0, 2, 'ORG')] 🎉
 ```
 Keep in mind that you need to create a `Span` with the start and end index of
 the **token**, not the start and end index of the entity in the document. In
-this case, "FB" is token `(0, 1)` – but at the document level, the entity will
+this case, "fb" is token `(0, 1)` – but at the document level, the entity will
 have the start and end indices `(0, 2)`.
 #### Setting entity annotations from array {#setting-from-array}
@ -782,8 +782,8 @@ The algorithm can be summarized as follows:
 1. Iterate over whitespace-separated substrings.
 2. Check whether we have an explicitly defined rule for this substring. If we
   do, use it.
-3. Otherwise, try to consume one prefix. If we consumed a prefix, go back to
+3. Otherwise, try to consume one prefix. If we consumed a prefix, go back to #2,
-   #2, so that special cases always get priority.
+   so that special cases always get priority.
 4. If we didn't consume a prefix, try to consume a suffix and then go back to
   #2.
 5. If we can't consume a prefix or a suffix, look for a special case.
@ -805,10 +805,10 @@ domain. There are five things you would need to define:
   commas, periods, close quotes, etc.
 4. A function `infixes_finditer`, to handle non-whitespace separators, such as
   hyphens etc.
-5. An optional boolean function `token_match` matching strings that should
+5. An optional boolean function `token_match` matching strings that should never
-   never be split, overriding the infix rules. Useful for things like URLs or
+   be split, overriding the infix rules. Useful for things like URLs or numbers.
-   numbers. Note that prefixes and suffixes will be split off before
+   Note that prefixes and suffixes will be split off before `token_match` is
-   `token_match` is applied.
+   applied.
 You shouldn't usually need to create a `Tokenizer` subclass. Standard usage is
 to use `re.compile()` to build a regular expression object, and pass its
@ -858,8 +858,8 @@ only be applied at the **end of a token**, so your expression should end with a
 #### Modifying existing rule sets {#native-tokenizer-additions}
 In many situations, you don't necessarily need entirely custom rules. Sometimes
-you just want to add another character to the prefixes, suffixes or infixes.
+you just want to add another character to the prefixes, suffixes or infixes. The
-The default prefix, suffix and infix rules are available via the `nlp` object's
+default prefix, suffix and infix rules are available via the `nlp` object's
 `Defaults` and the `Tokenizer` attributes such as
 [`Tokenizer.suffix_search`](/api/tokenizer#attributes) are writable, so you can
 overwrite them with compiled regular expression objects using modified default
@ -893,20 +893,19 @@ If you're using a statistical model, writing to the `nlp.Defaults` or
 `English.Defaults` directly won't work, since the regular expressions are read
 from the model and will be compiled when you load it. If you modify
 `nlp.Defaults`, you'll only see the effect if you call
-[`spacy.blank`](/api/top-level#spacy.blank) or `Defaults.create_tokenizer()`.
+[`spacy.blank`](/api/top-level#spacy.blank) or `Defaults.create_tokenizer()`. If
-If you want to modify the tokenizer loaded from a statistical model, you should
+you want to modify the tokenizer loaded from a statistical model, you should
 modify `nlp.tokenizer` directly.
 </Infobox>
 The prefix, infix and suffix rule sets include not only individual characters
 but also detailed regular expressions that take the surrounding context into
-account. For example, there is a regular expression that treats a hyphen
+account. For example, there is a regular expression that treats a hyphen between
-between letters as an infix. If you do not want the tokenizer to split on
+letters as an infix. If you do not want the tokenizer to split on hyphens
-hyphens between letters, you can modify the existing infix definition from
+between letters, you can modify the existing infix definition from
 [`lang/punctuation.py`](https://github.com/explosion/spaCy/blob/master/spacy/lang/punctuation.py):
 ```python
 ### {executable="true"}
 import spacy
@ -1074,10 +1073,10 @@ can sometimes tokenize things differently – for example, `"I'm"` →
 In situations like that, you often want to align the tokenization so that you
 can merge annotations from different sources together, or take vectors predicted
 by a
-[pretrained BERT model](https://github.com/huggingface/pytorch-transformers)
+[pretrained BERT model](https://github.com/huggingface/pytorch-transformers) and
-and apply them to spaCy tokens. spaCy's [`gold.align`](/api/goldparse#align)
+apply them to spaCy tokens. spaCy's [`gold.align`](/api/goldparse#align) helper
-helper returns a `(cost, a2b, b2a, a2b_multi, b2a_multi)` tuple describing the
+returns a `(cost, a2b, b2a, a2b_multi, b2a_multi)` tuple describing the number
-number of misaligned tokens, the one-to-one mappings of token indices in both
+of misaligned tokens, the one-to-one mappings of token indices in both
 directions and the indices where multiple tokens align to one single token.
 > #### ✏️ Things to try