Merge branch 'master' into spacy.io

This commit is contained in:
Ines Montani 2019-11-18 12:42:04 +01:00
commit 534c4aa55b

View File

@ -435,22 +435,22 @@ import spacy
from spacy.tokens import Span from spacy.tokens import Span
nlp = spacy.load("en_core_web_sm") nlp = spacy.load("en_core_web_sm")
doc = nlp("FB is hiring a new Vice President of global policy") doc = nlp("fb is hiring a new vice president of global policy")
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents] ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print('Before', ents) print('Before', ents)
# the model didn't recognise "FB" as an entity :( # the model didn't recognise "fb" as an entity :(
fb_ent = Span(doc, 0, 1, label="ORG") # create a Span for the new entity fb_ent = Span(doc, 0, 1, label="ORG") # create a Span for the new entity
doc.ents = list(doc.ents) + [fb_ent] doc.ents = list(doc.ents) + [fb_ent]
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents] ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print('After', ents) print('After', ents)
# [('FB', 0, 2, 'ORG')] 🎉 # [('fb', 0, 2, 'ORG')] 🎉
``` ```
Keep in mind that you need to create a `Span` with the start and end index of Keep in mind that you need to create a `Span` with the start and end index of
the **token**, not the start and end index of the entity in the document. In the **token**, not the start and end index of the entity in the document. In
this case, "FB" is token `(0, 1)` but at the document level, the entity will this case, "fb" is token `(0, 1)` but at the document level, the entity will
have the start and end indices `(0, 2)`. have the start and end indices `(0, 2)`.
#### Setting entity annotations from array {#setting-from-array} #### Setting entity annotations from array {#setting-from-array}
@ -782,8 +782,8 @@ The algorithm can be summarized as follows:
1. Iterate over whitespace-separated substrings. 1. Iterate over whitespace-separated substrings.
2. Check whether we have an explicitly defined rule for this substring. If we 2. Check whether we have an explicitly defined rule for this substring. If we
do, use it. do, use it.
3. Otherwise, try to consume one prefix. If we consumed a prefix, go back to 3. Otherwise, try to consume one prefix. If we consumed a prefix, go back to #2,
#2, so that special cases always get priority. so that special cases always get priority.
4. If we didn't consume a prefix, try to consume a suffix and then go back to 4. If we didn't consume a prefix, try to consume a suffix and then go back to
#2. #2.
5. If we can't consume a prefix or a suffix, look for a special case. 5. If we can't consume a prefix or a suffix, look for a special case.
@ -805,10 +805,10 @@ domain. There are five things you would need to define:
commas, periods, close quotes, etc. commas, periods, close quotes, etc.
4. A function `infixes_finditer`, to handle non-whitespace separators, such as 4. A function `infixes_finditer`, to handle non-whitespace separators, such as
hyphens etc. hyphens etc.
5. An optional boolean function `token_match` matching strings that should 5. An optional boolean function `token_match` matching strings that should never
never be split, overriding the infix rules. Useful for things like URLs or be split, overriding the infix rules. Useful for things like URLs or numbers.
numbers. Note that prefixes and suffixes will be split off before Note that prefixes and suffixes will be split off before `token_match` is
`token_match` is applied. applied.
You shouldn't usually need to create a `Tokenizer` subclass. Standard usage is You shouldn't usually need to create a `Tokenizer` subclass. Standard usage is
to use `re.compile()` to build a regular expression object, and pass its to use `re.compile()` to build a regular expression object, and pass its
@ -858,8 +858,8 @@ only be applied at the **end of a token**, so your expression should end with a
#### Modifying existing rule sets {#native-tokenizer-additions} #### Modifying existing rule sets {#native-tokenizer-additions}
In many situations, you don't necessarily need entirely custom rules. Sometimes In many situations, you don't necessarily need entirely custom rules. Sometimes
you just want to add another character to the prefixes, suffixes or infixes. you just want to add another character to the prefixes, suffixes or infixes. The
The default prefix, suffix and infix rules are available via the `nlp` object's default prefix, suffix and infix rules are available via the `nlp` object's
`Defaults` and the `Tokenizer` attributes such as `Defaults` and the `Tokenizer` attributes such as
[`Tokenizer.suffix_search`](/api/tokenizer#attributes) are writable, so you can [`Tokenizer.suffix_search`](/api/tokenizer#attributes) are writable, so you can
overwrite them with compiled regular expression objects using modified default overwrite them with compiled regular expression objects using modified default
@ -893,20 +893,19 @@ If you're using a statistical model, writing to the `nlp.Defaults` or
`English.Defaults` directly won't work, since the regular expressions are read `English.Defaults` directly won't work, since the regular expressions are read
from the model and will be compiled when you load it. If you modify from the model and will be compiled when you load it. If you modify
`nlp.Defaults`, you'll only see the effect if you call `nlp.Defaults`, you'll only see the effect if you call
[`spacy.blank`](/api/top-level#spacy.blank) or `Defaults.create_tokenizer()`. [`spacy.blank`](/api/top-level#spacy.blank) or `Defaults.create_tokenizer()`. If
If you want to modify the tokenizer loaded from a statistical model, you should you want to modify the tokenizer loaded from a statistical model, you should
modify `nlp.tokenizer` directly. modify `nlp.tokenizer` directly.
</Infobox> </Infobox>
The prefix, infix and suffix rule sets include not only individual characters The prefix, infix and suffix rule sets include not only individual characters
but also detailed regular expressions that take the surrounding context into but also detailed regular expressions that take the surrounding context into
account. For example, there is a regular expression that treats a hyphen account. For example, there is a regular expression that treats a hyphen between
between letters as an infix. If you do not want the tokenizer to split on letters as an infix. If you do not want the tokenizer to split on hyphens
hyphens between letters, you can modify the existing infix definition from between letters, you can modify the existing infix definition from
[`lang/punctuation.py`](https://github.com/explosion/spaCy/blob/master/spacy/lang/punctuation.py): [`lang/punctuation.py`](https://github.com/explosion/spaCy/blob/master/spacy/lang/punctuation.py):
```python ```python
### {executable="true"} ### {executable="true"}
import spacy import spacy
@ -1074,10 +1073,10 @@ can sometimes tokenize things differently for example, `"I'm"` →
In situations like that, you often want to align the tokenization so that you In situations like that, you often want to align the tokenization so that you
can merge annotations from different sources together, or take vectors predicted can merge annotations from different sources together, or take vectors predicted
by a by a
[pretrained BERT model](https://github.com/huggingface/pytorch-transformers) [pretrained BERT model](https://github.com/huggingface/pytorch-transformers) and
and apply them to spaCy tokens. spaCy's [`gold.align`](/api/goldparse#align) apply them to spaCy tokens. spaCy's [`gold.align`](/api/goldparse#align) helper
helper returns a `(cost, a2b, b2a, a2b_multi, b2a_multi)` tuple describing the returns a `(cost, a2b, b2a, a2b_multi, b2a_multi)` tuple describing the number
number of misaligned tokens, the one-to-one mappings of token indices in both of misaligned tokens, the one-to-one mappings of token indices in both
directions and the indices where multiple tokens align to one single token. directions and the indices where multiple tokens align to one single token.
> #### ✏️ Things to try > #### ✏️ Things to try