spaCy/website/usage/_processing-pipelines/_custom-components.jade
Ines Montani 49cee4af92
💫 Interactive code examples, spaCy Universe and various docs improvements (#2274)
* Integrate Python kernel via Binder

* Add live model test for languages with examples

* Update docs and code examples

* Adjust margin (if not bootstrapped)

* Add binder version to global config

* Update terminal and executable code mixins

* Pass attributes through infobox and section

* Hide v-cloak

* Fix example

* Take out model comparison for now

* Add meta text for compat

* Remove chart.js dependency

* Tidy up and simplify JS and port big components over to Vue

* Remove chartjs example

* Add Twitter icon

* Add purple stylesheet option

* Add utility for hand cursor (special cases only)

* Add transition classes

* Add small option for section

* Add thumb object for small round thumbnail images

* Allow unset code block language via "none" value

(workaround to still allow unset language to default to DEFAULT_SYNTAX)

* Pass through attributes

* Add syntax highlighting definitions for Julia, R and Docker

* Add website icon

* Remove user survey from navigation

* Don't hide GitHub icon on small screens

* Make top navigation scrollable on small screens

* Remove old resources page and references to it

* Add Universe

* Add helper functions for better page URL and title

* Update site description

* Increment versions

* Update preview images

* Update mentions of resources

* Fix image

* Fix social images

* Fix problem with cover sizing and floats

* Add divider and move badges into heading

* Add docstrings

* Reference converting section

* Add section on converting word vectors

* Move converting section to custom section and fix formatting

* Remove old fastText example

* Move extensions content to own section

Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary)

* Use better component example and add factories section

* Add note on larger model

* Use better example for non-vector

* Remove similarity in context section

Only works via small models with tensors so has always been kind of confusing

* Add note on init-model command

* Fix lightning tour examples and make excutable if possible

* Add spacy train CLI section to train

* Fix formatting and add video

* Fix formatting

* Fix textcat example description (resolves #2246)

* Add dummy file to try resolve conflict

* Delete dummy file

* Tidy up [ci skip]

* Ensure sufficient height of loading container

* Add loading animation to universe

* Update Thebelab build and use better startup message

* Fix asset versioning

* Fix typo [ci skip]

* Add note on project idea label
2018-04-29 02:06:46 +02:00

131 lines
5.6 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

//- 💫 DOCS > USAGE > PROCESSING PIPELINES > CUSTOM COMPONENTS
p
| A component receives a #[code Doc] object and can modify it for example,
| by using the current weights to make a prediction and set some annotation
| on the document. By adding a component to the pipeline, you'll get access
| to the #[code Doc] at any point #[strong during processing] instead of
| only being able to modify it afterwards.
+aside-code("Example").
def my_component(doc):
# do something to the doc here
return doc
+table(["Argument", "Type", "Description"])
+row
+cell #[code doc]
+cell #[code Doc]
+cell The #[code Doc] object processed by the previous component.
+row("foot")
+cell returns
+cell #[code Doc]
+cell The #[code Doc] object processed by this pipeline component.
p
| Custom components can be added to the pipeline using the
| #[+api("language#add_pipe") #[code add_pipe]] method. Optionally, you
| can either specify a component to add it #[strong before or after], tell
| spaCy to add it #[strong first or last] in the pipeline, or define a
| #[strong custom name]. If no name is set and no #[code name] attribute
| is present on your component, the function name is used.
+code-exec.
import spacy
def my_component(doc):
print("After tokenization, this doc has %s tokens." % len(doc))
if len(doc) < 10:
print("This is a pretty short document.")
return doc
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe(my_component, name='print_info', first=True)
print(nlp.pipe_names) # ['print_info', 'tagger', 'parser', 'ner']
doc = nlp(u"This is a sentence.")
p
| Of course, you can also wrap your component as a class to allow
| initialising it with custom settings and hold state within the component.
| This is useful for #[strong stateful components], especially ones which
| #[strong depend on shared data]. In the following example, the custom
| component #[code EntityMatcher] can be initialised with #[code nlp] object,
| a terminology list and an entity label. Using the
| #[+api("phrasematcher") #[code PhraseMatcher]], it then matches the terms
| in the #[code Doc] and adds them to the existing entities.
+aside("Rule-based entities vs. model", "💡")
| For complex tasks, it's usually better to train a statistical entity
| recognition model. However, statistical models require training data, so
| for many situations, rule-based approaches are more practical. This is
| especially true at the start of a project: you can use a rule-based
| approach as part of a data collection process, to help you "bootstrap" a
| statistical model.
+code-exec.
import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span
class EntityMatcher(object):
name = 'entity_matcher'
def __init__(self, nlp, terms, label):
patterns = [nlp(text) for text in terms]
self.matcher = PhraseMatcher(nlp.vocab)
self.matcher.add(label, None, *patterns)
def __call__(self, doc):
matches = self.matcher(doc)
for match_id, start, end in matches:
span = Span(doc, start, end, label=match_id)
doc.ents = list(doc.ents) + [span]
return doc
nlp = spacy.load('en_core_web_sm')
terms = (u'cat', u'dog', u'tree kangaroo', u'giant sea spider')
entity_matcher = EntityMatcher(nlp, terms, 'ANIMAL')
nlp.add_pipe(entity_matcher, after='ner')
print(nlp.pipe_names) # the components in the pipeline
doc = nlp(u"This is a text about Barack Obama and a tree kangaroo")
print([(ent.text, ent.label_) for ent in doc.ents])
+h(3, "custom-components-factories") Adding factories
p
| When spaCy loads a model via its #[code meta.json], it will iterate over
| the #[code "pipeline"] setting, look up every component name in the
| internal factories and call
| #[+api("language#create_pipe") #[code nlp.create_pipe]] to initialise the
| individual components, like the tagger, parser or entity recogniser. If
| your model uses custom components, this won't work so you'll have to
| tell spaCy #[strong where to find your component]. You can do this by
| writing to the #[code Language.factories]:
+code.
from spacy.language import Language
Language.factories['entity_matcher'] = lambda nlp, **cfg: EntityMatcher(nlp, **cfg)
p
| You can also ship the above code and your custom component in your
| packaged model's #[code __init__.py], so it's executed when you load your
| model. The #[code **cfg] config parameters are passed all the way down
| from #[+api("spacy#load") #[code spacy.load]], so you can load the model
| and its components with custom settings:
+code.
nlp = spacy.load('your_custom_model', terms=(u'tree kangaroo'), label='ANIMAL')
+infobox("Important note", "⚠️")
| When you load a model via its shortcut or package name, like
| #[code en_core_web_sm], spaCy will import the package and then call its
| #[code load()] method. This means that custom code in the model's
| #[code __init__.py] will be executed, too. This is #[strong not the case]
| if you're loading a model from a path containing the model data. Here,
| spaCy will only read in the #[code meta.json]. If you want to use custom
| factories with a model loaded from a path, you need to add them to
| #[code Language.factories] #[em before] you load the model.