mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-11-04 09:57:26 +03:00 
			
		
		
		
	* Integrate Python kernel via Binder * Add live model test for languages with examples * Update docs and code examples * Adjust margin (if not bootstrapped) * Add binder version to global config * Update terminal and executable code mixins * Pass attributes through infobox and section * Hide v-cloak * Fix example * Take out model comparison for now * Add meta text for compat * Remove chart.js dependency * Tidy up and simplify JS and port big components over to Vue * Remove chartjs example * Add Twitter icon * Add purple stylesheet option * Add utility for hand cursor (special cases only) * Add transition classes * Add small option for section * Add thumb object for small round thumbnail images * Allow unset code block language via "none" value (workaround to still allow unset language to default to DEFAULT_SYNTAX) * Pass through attributes * Add syntax highlighting definitions for Julia, R and Docker * Add website icon * Remove user survey from navigation * Don't hide GitHub icon on small screens * Make top navigation scrollable on small screens * Remove old resources page and references to it * Add Universe * Add helper functions for better page URL and title * Update site description * Increment versions * Update preview images * Update mentions of resources * Fix image * Fix social images * Fix problem with cover sizing and floats * Add divider and move badges into heading * Add docstrings * Reference converting section * Add section on converting word vectors * Move converting section to custom section and fix formatting * Remove old fastText example * Move extensions content to own section Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary) * Use better component example and add factories section * Add note on larger model * Use better example for non-vector * Remove similarity in context section Only works via small models with tensors so has always been kind of confusing * Add note on init-model command * Fix lightning tour examples and make excutable if possible * Add spacy train CLI section to train * Fix formatting and add video * Fix formatting * Fix textcat example description (resolves #2246) * Add dummy file to try resolve conflict * Delete dummy file * Tidy up [ci skip] * Ensure sufficient height of loading container * Add loading animation to universe * Update Thebelab build and use better startup message * Fix asset versioning * Fix typo [ci skip] * Add note on project idea label
		
			
				
	
	
		
			131 lines
		
	
	
		
			5.6 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			131 lines
		
	
	
		
			5.6 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
//- 💫 DOCS > USAGE > PROCESSING PIPELINES > CUSTOM COMPONENTS
 | 
						||
 | 
						||
p
 | 
						||
    |  A component receives a #[code Doc] object and can modify it – for example,
 | 
						||
    |  by using the current weights to make a prediction and set some annotation
 | 
						||
    |  on the document. By adding a component to the pipeline, you'll get access
 | 
						||
    |  to the #[code Doc] at any point #[strong during processing] – instead of
 | 
						||
    |  only being able to modify it afterwards.
 | 
						||
 | 
						||
+aside-code("Example").
 | 
						||
    def my_component(doc):
 | 
						||
        # do something to the doc here
 | 
						||
        return doc
 | 
						||
 | 
						||
+table(["Argument", "Type", "Description"])
 | 
						||
    +row
 | 
						||
        +cell #[code doc]
 | 
						||
        +cell #[code Doc]
 | 
						||
        +cell The #[code Doc] object processed by the previous component.
 | 
						||
 | 
						||
    +row("foot")
 | 
						||
        +cell returns
 | 
						||
        +cell #[code Doc]
 | 
						||
        +cell The #[code Doc] object processed by this pipeline component.
 | 
						||
 | 
						||
p
 | 
						||
    |  Custom components can be added to the pipeline using the
 | 
						||
    |  #[+api("language#add_pipe") #[code add_pipe]] method. Optionally, you
 | 
						||
    |  can either specify a component to add it #[strong before or after], tell
 | 
						||
    |  spaCy to add it #[strong first or last] in the pipeline, or define a
 | 
						||
    |  #[strong custom name]. If no name is set and no #[code name] attribute
 | 
						||
    |  is present on your component, the function name is used.
 | 
						||
 | 
						||
+code-exec.
 | 
						||
    import spacy
 | 
						||
 | 
						||
    def my_component(doc):
 | 
						||
        print("After tokenization, this doc has %s tokens." % len(doc))
 | 
						||
        if len(doc) < 10:
 | 
						||
            print("This is a pretty short document.")
 | 
						||
        return doc
 | 
						||
 | 
						||
    nlp = spacy.load('en_core_web_sm')
 | 
						||
    nlp.add_pipe(my_component, name='print_info', first=True)
 | 
						||
    print(nlp.pipe_names) # ['print_info', 'tagger', 'parser', 'ner']
 | 
						||
    doc = nlp(u"This is a sentence.")
 | 
						||
 | 
						||
p
 | 
						||
    |  Of course, you can also wrap your component as a class to allow
 | 
						||
    |  initialising it with custom settings and hold state within the component.
 | 
						||
    |  This is useful for #[strong stateful components], especially ones which
 | 
						||
    |  #[strong depend on shared data]. In the following example, the custom
 | 
						||
    |  component #[code EntityMatcher] can be initialised with #[code nlp] object,
 | 
						||
    |  a terminology list and an entity label. Using the
 | 
						||
    |  #[+api("phrasematcher") #[code PhraseMatcher]], it then matches the terms
 | 
						||
    |  in the #[code Doc] and adds them to the existing entities.
 | 
						||
 | 
						||
+aside("Rule-based entities vs. model", "💡")
 | 
						||
    |  For complex tasks, it's usually better to train a statistical entity
 | 
						||
    |  recognition model. However, statistical models require training data, so
 | 
						||
    |  for many situations, rule-based approaches are more practical. This is
 | 
						||
    |  especially true at the start of a project: you can use a rule-based
 | 
						||
    |  approach as part of a data collection process, to help you "bootstrap" a
 | 
						||
    |  statistical model.
 | 
						||
 | 
						||
+code-exec.
 | 
						||
    import spacy
 | 
						||
    from spacy.matcher import PhraseMatcher
 | 
						||
    from spacy.tokens import Span
 | 
						||
 | 
						||
    class EntityMatcher(object):
 | 
						||
        name = 'entity_matcher'
 | 
						||
 | 
						||
        def __init__(self, nlp, terms, label):
 | 
						||
            patterns = [nlp(text) for text in terms]
 | 
						||
            self.matcher = PhraseMatcher(nlp.vocab)
 | 
						||
            self.matcher.add(label, None, *patterns)
 | 
						||
 | 
						||
        def __call__(self, doc):
 | 
						||
            matches = self.matcher(doc)
 | 
						||
            for match_id, start, end in matches:
 | 
						||
                span = Span(doc, start, end, label=match_id)
 | 
						||
                doc.ents = list(doc.ents) + [span]
 | 
						||
            return doc
 | 
						||
 | 
						||
    nlp = spacy.load('en_core_web_sm')
 | 
						||
    terms = (u'cat', u'dog', u'tree kangaroo', u'giant sea spider')
 | 
						||
    entity_matcher = EntityMatcher(nlp, terms, 'ANIMAL')
 | 
						||
 | 
						||
    nlp.add_pipe(entity_matcher, after='ner')
 | 
						||
    print(nlp.pipe_names)  # the components in the pipeline
 | 
						||
 | 
						||
    doc = nlp(u"This is a text about Barack Obama and a tree kangaroo")
 | 
						||
    print([(ent.text, ent.label_) for ent in doc.ents])
 | 
						||
 | 
						||
+h(3, "custom-components-factories") Adding factories
 | 
						||
 | 
						||
p
 | 
						||
    |  When spaCy loads a model via its #[code meta.json], it will iterate over
 | 
						||
    |  the #[code "pipeline"] setting, look up every component name in the
 | 
						||
    |  internal factories and call
 | 
						||
    |  #[+api("language#create_pipe") #[code nlp.create_pipe]] to initialise the
 | 
						||
    |  individual components, like the tagger, parser or entity recogniser. If
 | 
						||
    |  your model uses custom components, this won't work – so you'll have to
 | 
						||
    |  tell spaCy #[strong where to find your component]. You can do this by
 | 
						||
    |  writing to the #[code Language.factories]:
 | 
						||
 | 
						||
+code.
 | 
						||
    from spacy.language import Language
 | 
						||
    Language.factories['entity_matcher'] = lambda nlp, **cfg: EntityMatcher(nlp, **cfg)
 | 
						||
 | 
						||
p
 | 
						||
    |  You can also ship the above code and your custom component in your
 | 
						||
    |  packaged model's #[code __init__.py], so it's executed when you load your
 | 
						||
    |  model. The #[code **cfg] config parameters are passed all the way down
 | 
						||
    |  from #[+api("spacy#load") #[code spacy.load]], so you can load the model
 | 
						||
    |  and its components with custom settings:
 | 
						||
 | 
						||
+code.
 | 
						||
    nlp = spacy.load('your_custom_model', terms=(u'tree kangaroo'), label='ANIMAL')
 | 
						||
 | 
						||
+infobox("Important note", "⚠️")
 | 
						||
    |  When you load a model via its shortcut or package name, like
 | 
						||
    |  #[code en_core_web_sm], spaCy will import the package and then call its
 | 
						||
    |  #[code load()] method. This means that custom code in the model's
 | 
						||
    |  #[code __init__.py] will be executed, too. This is #[strong not the case]
 | 
						||
    |  if you're loading a model from a path containing the model data. Here,
 | 
						||
    |  spaCy will only read in the #[code meta.json]. If you want to use custom
 | 
						||
    |  factories with a model loaded from a path, you need to add them to
 | 
						||
    |  #[code Language.factories] #[em before] you load the model.
 |