diff --git a/website/docs/usage/linguistic-features.md b/website/docs/usage/linguistic-features.md index f2ec48d63..5c5198308 100644 --- a/website/docs/usage/linguistic-features.md +++ b/website/docs/usage/linguistic-features.md @@ -82,6 +82,14 @@ check whether a [`Doc`](/api/doc) object has been parsed with the `doc.is_parsed` attribute, which returns a boolean value. If this attribute is `False`, the default sentence iterator will raise an exception. + + +For a list of the syntactic dependency labels assigned by spaCy's models across +different languages, see the label schemes documented in the +[models directory](/models). + + + ### Noun chunks {#noun-chunks} Noun chunks are "base noun phrases" – flat phrases that have a noun as their @@ -288,11 +296,45 @@ for token in doc: | their | `ADJ` | `poss` | requests | | requests | `NOUN` | `dobj` | submit | - +The dependency parse can be a useful tool for **information extraction**, +especially when combined with other predictions like +[named entities](#named-entities). The following example extracts money and +currency values, i.e. entities labeled as `MONEY`, and then uses the dependency +parse to find the noun phrase they are referring to – for example `"Net income"` +→ `"$9.4 million"`. -For a list of the syntactic dependency labels assigned by spaCy's models across -different languages, see the label schemes documented in the -[models directory](/models). +```python +### {executable="true"} +import spacy + +nlp = spacy.load("en_core_web_sm") +# Merge noun phrases and entities for easier analysis +nlp.add_pipe("merge_entities") +nlp.add_pipe("merge_noun_chunks") + +TEXTS = [ + "Net income was $9.4 million compared to the prior year of $2.7 million.", + "Revenue exceeded twelve billion dollars, with a loss of $1b.", +] +for doc in nlp.pipe(TEXTS): + for token in doc: + if token.ent_type_ == "MONEY": + # We have an attribute and direct object, so check for subject + if token.dep_ in ("attr", "dobj"): + subj = [w for w in token.head.lefts if w.dep_ == "nsubj"] + if subj: + print(subj[0], "-->", token) + # We have a prepositional object with a preposition + elif token.dep_ == "pobj" and token.head.dep_ == "prep": + print(token.head.head, "-->", token) +``` + + + +For more examples of how to write rule-based information extraction logic that +takes advantage of the model's predictions produced by the different components, +see the usage guide on +[combining models and rules](/usage/rule-based-matching#models-rules). @@ -545,7 +587,7 @@ identifier from a knowledge base (KB). You can create your own [train a new Entity Linking model](/usage/training#entity-linker) using that custom-made KB. -### Accessing entity identifiers {#entity-linking-accessing} +### Accessing entity identifiers {#entity-linking-accessing model="entity linking"} The annotated KB identifier is accessible as either a hash value or as a string, using the attributes `ent.kb_id` and `ent.kb_id_` of a [`Span`](/api/span) @@ -571,15 +613,6 @@ print(ent_ada_1) # ['Lovelace', 'PERSON', 'Q7259'] print(ent_london_5) # ['London', 'GPE', 'Q84'] ``` -| Text | ent_type\_ | ent_kb_id\_ | -| -------- | ---------- | ----------- | -| Ada | `"PERSON"` | `"Q7259"` | -| Lovelace | `"PERSON"` | `"Q7259"` | -| was | - | - | -| born | - | - | -| in | - | - | -| London | `"GPE"` | `"Q84"` | - ## Tokenization {#tokenization} Tokenization is the task of splitting a text into meaningful segments, called