From e01a1a237f7760eb30445b15479ec15c3a3bb440 Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Mon, 18 Nov 2019 12:41:31 +0100 Subject: [PATCH 1/2] Auto-format [ci skip] --- website/docs/usage/linguistic-features.md | 35 +++++++++++------------ 1 file changed, 17 insertions(+), 18 deletions(-) diff --git a/website/docs/usage/linguistic-features.md b/website/docs/usage/linguistic-features.md index 250f02602..0b274b2b6 100644 --- a/website/docs/usage/linguistic-features.md +++ b/website/docs/usage/linguistic-features.md @@ -782,8 +782,8 @@ The algorithm can be summarized as follows: 1. Iterate over whitespace-separated substrings. 2. Check whether we have an explicitly defined rule for this substring. If we do, use it. -3. Otherwise, try to consume one prefix. If we consumed a prefix, go back to - #2, so that special cases always get priority. +3. Otherwise, try to consume one prefix. If we consumed a prefix, go back to #2, + so that special cases always get priority. 4. If we didn't consume a prefix, try to consume a suffix and then go back to #2. 5. If we can't consume a prefix or a suffix, look for a special case. @@ -805,10 +805,10 @@ domain. There are five things you would need to define: commas, periods, close quotes, etc. 4. A function `infixes_finditer`, to handle non-whitespace separators, such as hyphens etc. -5. An optional boolean function `token_match` matching strings that should - never be split, overriding the infix rules. Useful for things like URLs or - numbers. Note that prefixes and suffixes will be split off before - `token_match` is applied. +5. An optional boolean function `token_match` matching strings that should never + be split, overriding the infix rules. Useful for things like URLs or numbers. + Note that prefixes and suffixes will be split off before `token_match` is + applied. You shouldn't usually need to create a `Tokenizer` subclass. Standard usage is to use `re.compile()` to build a regular expression object, and pass its @@ -858,8 +858,8 @@ only be applied at the **end of a token**, so your expression should end with a #### Modifying existing rule sets {#native-tokenizer-additions} In many situations, you don't necessarily need entirely custom rules. Sometimes -you just want to add another character to the prefixes, suffixes or infixes. -The default prefix, suffix and infix rules are available via the `nlp` object's +you just want to add another character to the prefixes, suffixes or infixes. The +default prefix, suffix and infix rules are available via the `nlp` object's `Defaults` and the `Tokenizer` attributes such as [`Tokenizer.suffix_search`](/api/tokenizer#attributes) are writable, so you can overwrite them with compiled regular expression objects using modified default @@ -893,20 +893,19 @@ If you're using a statistical model, writing to the `nlp.Defaults` or `English.Defaults` directly won't work, since the regular expressions are read from the model and will be compiled when you load it. If you modify `nlp.Defaults`, you'll only see the effect if you call -[`spacy.blank`](/api/top-level#spacy.blank) or `Defaults.create_tokenizer()`. -If you want to modify the tokenizer loaded from a statistical model, you should +[`spacy.blank`](/api/top-level#spacy.blank) or `Defaults.create_tokenizer()`. If +you want to modify the tokenizer loaded from a statistical model, you should modify `nlp.tokenizer` directly. The prefix, infix and suffix rule sets include not only individual characters but also detailed regular expressions that take the surrounding context into -account. For example, there is a regular expression that treats a hyphen -between letters as an infix. If you do not want the tokenizer to split on -hyphens between letters, you can modify the existing infix definition from +account. For example, there is a regular expression that treats a hyphen between +letters as an infix. If you do not want the tokenizer to split on hyphens +between letters, you can modify the existing infix definition from [`lang/punctuation.py`](https://github.com/explosion/spaCy/blob/master/spacy/lang/punctuation.py): - ```python ### {executable="true"} import spacy @@ -1074,10 +1073,10 @@ can sometimes tokenize things differently – for example, `"I'm"` → In situations like that, you often want to align the tokenization so that you can merge annotations from different sources together, or take vectors predicted by a -[pretrained BERT model](https://github.com/huggingface/pytorch-transformers) -and apply them to spaCy tokens. spaCy's [`gold.align`](/api/goldparse#align) -helper returns a `(cost, a2b, b2a, a2b_multi, b2a_multi)` tuple describing the -number of misaligned tokens, the one-to-one mappings of token indices in both +[pretrained BERT model](https://github.com/huggingface/pytorch-transformers) and +apply them to spaCy tokens. spaCy's [`gold.align`](/api/goldparse#align) helper +returns a `(cost, a2b, b2a, a2b_multi, b2a_multi)` tuple describing the number +of misaligned tokens, the one-to-one mappings of token indices in both directions and the indices where multiple tokens align to one single token. > #### ✏️ Things to try From e8b9cee6fd942298145d13f0264ce430e50947d2 Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Mon, 18 Nov 2019 12:41:48 +0100 Subject: [PATCH 2/2] Make example consistent with model (closes #4587) [ci skip] --- website/docs/usage/linguistic-features.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/website/docs/usage/linguistic-features.md b/website/docs/usage/linguistic-features.md index 0b274b2b6..db3aac686 100644 --- a/website/docs/usage/linguistic-features.md +++ b/website/docs/usage/linguistic-features.md @@ -435,22 +435,22 @@ import spacy from spacy.tokens import Span nlp = spacy.load("en_core_web_sm") -doc = nlp("FB is hiring a new Vice President of global policy") +doc = nlp("fb is hiring a new vice president of global policy") ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents] print('Before', ents) -# the model didn't recognise "FB" as an entity :( +# the model didn't recognise "fb" as an entity :( fb_ent = Span(doc, 0, 1, label="ORG") # create a Span for the new entity doc.ents = list(doc.ents) + [fb_ent] ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents] print('After', ents) -# [('FB', 0, 2, 'ORG')] 🎉 +# [('fb', 0, 2, 'ORG')] 🎉 ``` Keep in mind that you need to create a `Span` with the start and end index of the **token**, not the start and end index of the entity in the document. In -this case, "FB" is token `(0, 1)` – but at the document level, the entity will +this case, "fb" is token `(0, 1)` – but at the document level, the entity will have the start and end indices `(0, 2)`. #### Setting entity annotations from array {#setting-from-array}