mirror of
https://github.com/explosion/spaCy.git
synced 2025-04-27 04:13:41 +03:00
Update docs [ci skip]
This commit is contained in:
parent
dd84577a98
commit
f31c4462ca
|
@ -82,6 +82,14 @@ check whether a [`Doc`](/api/doc) object has been parsed with the
|
||||||
`doc.is_parsed` attribute, which returns a boolean value. If this attribute is
|
`doc.is_parsed` attribute, which returns a boolean value. If this attribute is
|
||||||
`False`, the default sentence iterator will raise an exception.
|
`False`, the default sentence iterator will raise an exception.
|
||||||
|
|
||||||
|
<Infobox title="Dependency label scheme" emoji="📖">
|
||||||
|
|
||||||
|
For a list of the syntactic dependency labels assigned by spaCy's models across
|
||||||
|
different languages, see the label schemes documented in the
|
||||||
|
[models directory](/models).
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
### Noun chunks {#noun-chunks}
|
### Noun chunks {#noun-chunks}
|
||||||
|
|
||||||
Noun chunks are "base noun phrases" – flat phrases that have a noun as their
|
Noun chunks are "base noun phrases" – flat phrases that have a noun as their
|
||||||
|
@ -288,11 +296,45 @@ for token in doc:
|
||||||
| their | `ADJ` | `poss` | requests |
|
| their | `ADJ` | `poss` | requests |
|
||||||
| requests | `NOUN` | `dobj` | submit |
|
| requests | `NOUN` | `dobj` | submit |
|
||||||
|
|
||||||
<Infobox title="Dependency label scheme" emoji="📖">
|
The dependency parse can be a useful tool for **information extraction**,
|
||||||
|
especially when combined with other predictions like
|
||||||
|
[named entities](#named-entities). The following example extracts money and
|
||||||
|
currency values, i.e. entities labeled as `MONEY`, and then uses the dependency
|
||||||
|
parse to find the noun phrase they are referring to – for example `"Net income"`
|
||||||
|
→ `"$9.4 million"`.
|
||||||
|
|
||||||
For a list of the syntactic dependency labels assigned by spaCy's models across
|
```python
|
||||||
different languages, see the label schemes documented in the
|
### {executable="true"}
|
||||||
[models directory](/models).
|
import spacy
|
||||||
|
|
||||||
|
nlp = spacy.load("en_core_web_sm")
|
||||||
|
# Merge noun phrases and entities for easier analysis
|
||||||
|
nlp.add_pipe("merge_entities")
|
||||||
|
nlp.add_pipe("merge_noun_chunks")
|
||||||
|
|
||||||
|
TEXTS = [
|
||||||
|
"Net income was $9.4 million compared to the prior year of $2.7 million.",
|
||||||
|
"Revenue exceeded twelve billion dollars, with a loss of $1b.",
|
||||||
|
]
|
||||||
|
for doc in nlp.pipe(TEXTS):
|
||||||
|
for token in doc:
|
||||||
|
if token.ent_type_ == "MONEY":
|
||||||
|
# We have an attribute and direct object, so check for subject
|
||||||
|
if token.dep_ in ("attr", "dobj"):
|
||||||
|
subj = [w for w in token.head.lefts if w.dep_ == "nsubj"]
|
||||||
|
if subj:
|
||||||
|
print(subj[0], "-->", token)
|
||||||
|
# We have a prepositional object with a preposition
|
||||||
|
elif token.dep_ == "pobj" and token.head.dep_ == "prep":
|
||||||
|
print(token.head.head, "-->", token)
|
||||||
|
```
|
||||||
|
|
||||||
|
<Infobox title="Combining models and rules" emoji="📖">
|
||||||
|
|
||||||
|
For more examples of how to write rule-based information extraction logic that
|
||||||
|
takes advantage of the model's predictions produced by the different components,
|
||||||
|
see the usage guide on
|
||||||
|
[combining models and rules](/usage/rule-based-matching#models-rules).
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
|
@ -545,7 +587,7 @@ identifier from a knowledge base (KB). You can create your own
|
||||||
[train a new Entity Linking model](/usage/training#entity-linker) using that
|
[train a new Entity Linking model](/usage/training#entity-linker) using that
|
||||||
custom-made KB.
|
custom-made KB.
|
||||||
|
|
||||||
### Accessing entity identifiers {#entity-linking-accessing}
|
### Accessing entity identifiers {#entity-linking-accessing model="entity linking"}
|
||||||
|
|
||||||
The annotated KB identifier is accessible as either a hash value or as a string,
|
The annotated KB identifier is accessible as either a hash value or as a string,
|
||||||
using the attributes `ent.kb_id` and `ent.kb_id_` of a [`Span`](/api/span)
|
using the attributes `ent.kb_id` and `ent.kb_id_` of a [`Span`](/api/span)
|
||||||
|
@ -571,15 +613,6 @@ print(ent_ada_1) # ['Lovelace', 'PERSON', 'Q7259']
|
||||||
print(ent_london_5) # ['London', 'GPE', 'Q84']
|
print(ent_london_5) # ['London', 'GPE', 'Q84']
|
||||||
```
|
```
|
||||||
|
|
||||||
| Text | ent_type\_ | ent_kb_id\_ |
|
|
||||||
| -------- | ---------- | ----------- |
|
|
||||||
| Ada | `"PERSON"` | `"Q7259"` |
|
|
||||||
| Lovelace | `"PERSON"` | `"Q7259"` |
|
|
||||||
| was | - | - |
|
|
||||||
| born | - | - |
|
|
||||||
| in | - | - |
|
|
||||||
| London | `"GPE"` | `"Q84"` |
|
|
||||||
|
|
||||||
## Tokenization {#tokenization}
|
## Tokenization {#tokenization}
|
||||||
|
|
||||||
Tokenization is the task of splitting a text into meaningful segments, called
|
Tokenization is the task of splitting a text into meaningful segments, called
|
||||||
|
|
Loading…
Reference in New Issue
Block a user