2017-05-24 20:25:13 +03:00
|
|
|
|
//- 💫 DOCS > USAGE > SPACY 101 > PIPELINES
|
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| When you call #[code nlp] on a text, spaCy first tokenizes the text to
|
2017-05-25 12:17:21 +03:00
|
|
|
|
| produce a #[code Doc] object. The #[code Doc] is then processed in several
|
2017-05-24 20:25:13 +03:00
|
|
|
|
| different steps – this is also referred to as the
|
2017-05-25 12:17:21 +03:00
|
|
|
|
| #[strong processing pipeline]. The pipeline used by the
|
2017-11-01 21:49:04 +03:00
|
|
|
|
| #[+a("/models") default models] consists of a tagger, a parser and an
|
|
|
|
|
| entity recognizer. Each pipeline component returns the processed
|
|
|
|
|
| #[code Doc], which is then passed on to the next component.
|
2017-05-24 20:25:13 +03:00
|
|
|
|
|
2017-10-03 15:26:20 +03:00
|
|
|
|
+graphic("/assets/img/pipeline.svg")
|
|
|
|
|
include ../../assets/img/pipeline.svg
|
2017-05-24 20:25:13 +03:00
|
|
|
|
|
2017-05-24 23:46:18 +03:00
|
|
|
|
+aside
|
|
|
|
|
| #[strong Name:] ID of the pipeline component.#[br]
|
|
|
|
|
| #[strong Component:] spaCy's implementation of the component.#[br]
|
|
|
|
|
| #[strong Creates:] Objects, attributes and properties modified and set by
|
|
|
|
|
| the component.
|
|
|
|
|
|
2017-05-29 00:26:13 +03:00
|
|
|
|
+table(["Name", "Component", "Creates", "Description"])
|
2017-05-24 20:25:13 +03:00
|
|
|
|
+row
|
2017-11-01 21:49:04 +03:00
|
|
|
|
+cell #[strong tokenizer]
|
2017-05-24 20:25:13 +03:00
|
|
|
|
+cell #[+api("tokenizer") #[code Tokenizer]]
|
|
|
|
|
+cell #[code Doc]
|
2017-05-29 00:26:13 +03:00
|
|
|
|
+cell Segment text into tokens.
|
2017-05-24 20:25:13 +03:00
|
|
|
|
|
|
|
|
|
+row("divider")
|
2017-11-01 21:49:04 +03:00
|
|
|
|
+cell #[strong tagger]
|
2017-05-24 20:25:13 +03:00
|
|
|
|
+cell #[+api("tagger") #[code Tagger]]
|
|
|
|
|
+cell #[code Doc[i].tag]
|
2017-05-29 00:26:13 +03:00
|
|
|
|
+cell Assign part-of-speech tags.
|
2017-05-24 20:25:13 +03:00
|
|
|
|
|
|
|
|
|
+row
|
2017-11-01 21:49:04 +03:00
|
|
|
|
+cell #[strong parser]
|
2017-05-24 20:25:13 +03:00
|
|
|
|
+cell #[+api("dependencyparser") #[code DependencyParser]]
|
|
|
|
|
+cell
|
2017-11-01 21:49:04 +03:00
|
|
|
|
| #[code Doc[i].head],
|
|
|
|
|
| #[code Doc[i].dep],
|
|
|
|
|
| #[code Doc.sents],
|
2017-05-24 20:25:13 +03:00
|
|
|
|
| #[code Doc.noun_chunks]
|
2017-05-29 00:26:13 +03:00
|
|
|
|
+cell Assign dependency labels.
|
2017-05-24 20:25:13 +03:00
|
|
|
|
|
|
|
|
|
+row
|
2017-11-01 21:49:04 +03:00
|
|
|
|
+cell #[strong ner]
|
2017-05-24 20:25:13 +03:00
|
|
|
|
+cell #[+api("entityrecognizer") #[code EntityRecognizer]]
|
|
|
|
|
+cell #[code Doc.ents], #[code Doc[i].ent_iob], #[code Doc[i].ent_type]
|
2017-05-29 00:26:13 +03:00
|
|
|
|
+cell Detect and label named entities.
|
2017-05-26 13:46:29 +03:00
|
|
|
|
|
2017-10-03 15:26:20 +03:00
|
|
|
|
+row
|
2017-11-01 21:49:04 +03:00
|
|
|
|
+cell #[strong textcat]
|
2017-10-03 15:26:20 +03:00
|
|
|
|
+cell #[+api("textcategorizer") #[code TextCategorizer]]
|
|
|
|
|
+cell #[code Doc.cats]
|
|
|
|
|
+cell Assign document labels.
|
|
|
|
|
|
2017-11-01 21:49:04 +03:00
|
|
|
|
+row("divider")
|
|
|
|
|
+cell #[strong ...]
|
|
|
|
|
+cell #[+a("/usage/processing-pipelines#custom-components") custom components]
|
|
|
|
|
+cell #[code Doc._.xxx], #[code Token._.xxx], #[code Span._.xxx]
|
|
|
|
|
+cell Assign custom attributes, methods or properties.
|
|
|
|
|
|
2017-05-26 13:46:29 +03:00
|
|
|
|
p
|
|
|
|
|
| The processing pipeline always #[strong depends on the statistical model]
|
|
|
|
|
| and its capabilities. For example, a pipeline can only include an entity
|
|
|
|
|
| recognizer component if the model includes data to make predictions of
|
|
|
|
|
| entity labels. This is why each model will specify the pipeline to use
|
|
|
|
|
| in its meta data, as a simple list containing the component names:
|
|
|
|
|
|
|
|
|
|
+code(false, "json").
|
2017-11-01 21:49:04 +03:00
|
|
|
|
"pipeline": ["tagger", "parser", "ner"]
|
2017-05-29 12:45:32 +03:00
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| Although you can mix and match pipeline components, their
|
|
|
|
|
| #[strong order and combination] is usually important. Some components may
|
2017-11-01 21:49:04 +03:00
|
|
|
|
| require certain modifications on the #[code Doc] to process it. As the
|
|
|
|
|
| processing pipeline is applied, spaCy encodes the document's internal
|
2017-05-29 12:45:32 +03:00
|
|
|
|
| #[strong meaning representations] as an array of floats, also called a
|
|
|
|
|
| #[strong tensor]. This includes the tokens and their context, which is
|
2017-11-01 21:49:04 +03:00
|
|
|
|
| required for the first component, the tagger, to make predictions of the
|
2017-05-29 12:45:32 +03:00
|
|
|
|
| part-of-speech tags. Because spaCy's models are neural network models,
|
|
|
|
|
| they only "speak" tensors and expect the input #[code Doc] to have
|
|
|
|
|
| a #[code tensor].
|