mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-31 07:57:35 +03:00 
			
		
		
		
	* Integrate Python kernel via Binder * Add live model test for languages with examples * Update docs and code examples * Adjust margin (if not bootstrapped) * Add binder version to global config * Update terminal and executable code mixins * Pass attributes through infobox and section * Hide v-cloak * Fix example * Take out model comparison for now * Add meta text for compat * Remove chart.js dependency * Tidy up and simplify JS and port big components over to Vue * Remove chartjs example * Add Twitter icon * Add purple stylesheet option * Add utility for hand cursor (special cases only) * Add transition classes * Add small option for section * Add thumb object for small round thumbnail images * Allow unset code block language via "none" value (workaround to still allow unset language to default to DEFAULT_SYNTAX) * Pass through attributes * Add syntax highlighting definitions for Julia, R and Docker * Add website icon * Remove user survey from navigation * Don't hide GitHub icon on small screens * Make top navigation scrollable on small screens * Remove old resources page and references to it * Add Universe * Add helper functions for better page URL and title * Update site description * Increment versions * Update preview images * Update mentions of resources * Fix image * Fix social images * Fix problem with cover sizing and floats * Add divider and move badges into heading * Add docstrings * Reference converting section * Add section on converting word vectors * Move converting section to custom section and fix formatting * Remove old fastText example * Move extensions content to own section Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary) * Use better component example and add factories section * Add note on larger model * Use better example for non-vector * Remove similarity in context section Only works via small models with tensors so has always been kind of confusing * Add note on init-model command * Fix lightning tour examples and make excutable if possible * Add spacy train CLI section to train * Fix formatting and add video * Fix formatting * Fix textcat example description (resolves #2246) * Add dummy file to try resolve conflict * Delete dummy file * Tidy up [ci skip] * Ensure sufficient height of loading container * Add loading animation to universe * Update Thebelab build and use better startup message * Fix asset versioning * Fix typo [ci skip] * Add note on project idea label
		
			
				
	
	
		
			131 lines
		
	
	
		
			5.6 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			131 lines
		
	
	
		
			5.6 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| //- 💫 DOCS > USAGE > PROCESSING PIPELINES > CUSTOM COMPONENTS
 | ||
| 
 | ||
| p
 | ||
|     |  A component receives a #[code Doc] object and can modify it – for example,
 | ||
|     |  by using the current weights to make a prediction and set some annotation
 | ||
|     |  on the document. By adding a component to the pipeline, you'll get access
 | ||
|     |  to the #[code Doc] at any point #[strong during processing] – instead of
 | ||
|     |  only being able to modify it afterwards.
 | ||
| 
 | ||
| +aside-code("Example").
 | ||
|     def my_component(doc):
 | ||
|         # do something to the doc here
 | ||
|         return doc
 | ||
| 
 | ||
| +table(["Argument", "Type", "Description"])
 | ||
|     +row
 | ||
|         +cell #[code doc]
 | ||
|         +cell #[code Doc]
 | ||
|         +cell The #[code Doc] object processed by the previous component.
 | ||
| 
 | ||
|     +row("foot")
 | ||
|         +cell returns
 | ||
|         +cell #[code Doc]
 | ||
|         +cell The #[code Doc] object processed by this pipeline component.
 | ||
| 
 | ||
| p
 | ||
|     |  Custom components can be added to the pipeline using the
 | ||
|     |  #[+api("language#add_pipe") #[code add_pipe]] method. Optionally, you
 | ||
|     |  can either specify a component to add it #[strong before or after], tell
 | ||
|     |  spaCy to add it #[strong first or last] in the pipeline, or define a
 | ||
|     |  #[strong custom name]. If no name is set and no #[code name] attribute
 | ||
|     |  is present on your component, the function name is used.
 | ||
| 
 | ||
| +code-exec.
 | ||
|     import spacy
 | ||
| 
 | ||
|     def my_component(doc):
 | ||
|         print("After tokenization, this doc has %s tokens." % len(doc))
 | ||
|         if len(doc) < 10:
 | ||
|             print("This is a pretty short document.")
 | ||
|         return doc
 | ||
| 
 | ||
|     nlp = spacy.load('en_core_web_sm')
 | ||
|     nlp.add_pipe(my_component, name='print_info', first=True)
 | ||
|     print(nlp.pipe_names) # ['print_info', 'tagger', 'parser', 'ner']
 | ||
|     doc = nlp(u"This is a sentence.")
 | ||
| 
 | ||
| p
 | ||
|     |  Of course, you can also wrap your component as a class to allow
 | ||
|     |  initialising it with custom settings and hold state within the component.
 | ||
|     |  This is useful for #[strong stateful components], especially ones which
 | ||
|     |  #[strong depend on shared data]. In the following example, the custom
 | ||
|     |  component #[code EntityMatcher] can be initialised with #[code nlp] object,
 | ||
|     |  a terminology list and an entity label. Using the
 | ||
|     |  #[+api("phrasematcher") #[code PhraseMatcher]], it then matches the terms
 | ||
|     |  in the #[code Doc] and adds them to the existing entities.
 | ||
| 
 | ||
| +aside("Rule-based entities vs. model", "💡")
 | ||
|     |  For complex tasks, it's usually better to train a statistical entity
 | ||
|     |  recognition model. However, statistical models require training data, so
 | ||
|     |  for many situations, rule-based approaches are more practical. This is
 | ||
|     |  especially true at the start of a project: you can use a rule-based
 | ||
|     |  approach as part of a data collection process, to help you "bootstrap" a
 | ||
|     |  statistical model.
 | ||
| 
 | ||
| +code-exec.
 | ||
|     import spacy
 | ||
|     from spacy.matcher import PhraseMatcher
 | ||
|     from spacy.tokens import Span
 | ||
| 
 | ||
|     class EntityMatcher(object):
 | ||
|         name = 'entity_matcher'
 | ||
| 
 | ||
|         def __init__(self, nlp, terms, label):
 | ||
|             patterns = [nlp(text) for text in terms]
 | ||
|             self.matcher = PhraseMatcher(nlp.vocab)
 | ||
|             self.matcher.add(label, None, *patterns)
 | ||
| 
 | ||
|         def __call__(self, doc):
 | ||
|             matches = self.matcher(doc)
 | ||
|             for match_id, start, end in matches:
 | ||
|                 span = Span(doc, start, end, label=match_id)
 | ||
|                 doc.ents = list(doc.ents) + [span]
 | ||
|             return doc
 | ||
| 
 | ||
|     nlp = spacy.load('en_core_web_sm')
 | ||
|     terms = (u'cat', u'dog', u'tree kangaroo', u'giant sea spider')
 | ||
|     entity_matcher = EntityMatcher(nlp, terms, 'ANIMAL')
 | ||
| 
 | ||
|     nlp.add_pipe(entity_matcher, after='ner')
 | ||
|     print(nlp.pipe_names)  # the components in the pipeline
 | ||
| 
 | ||
|     doc = nlp(u"This is a text about Barack Obama and a tree kangaroo")
 | ||
|     print([(ent.text, ent.label_) for ent in doc.ents])
 | ||
| 
 | ||
| +h(3, "custom-components-factories") Adding factories
 | ||
| 
 | ||
| p
 | ||
|     |  When spaCy loads a model via its #[code meta.json], it will iterate over
 | ||
|     |  the #[code "pipeline"] setting, look up every component name in the
 | ||
|     |  internal factories and call
 | ||
|     |  #[+api("language#create_pipe") #[code nlp.create_pipe]] to initialise the
 | ||
|     |  individual components, like the tagger, parser or entity recogniser. If
 | ||
|     |  your model uses custom components, this won't work – so you'll have to
 | ||
|     |  tell spaCy #[strong where to find your component]. You can do this by
 | ||
|     |  writing to the #[code Language.factories]:
 | ||
| 
 | ||
| +code.
 | ||
|     from spacy.language import Language
 | ||
|     Language.factories['entity_matcher'] = lambda nlp, **cfg: EntityMatcher(nlp, **cfg)
 | ||
| 
 | ||
| p
 | ||
|     |  You can also ship the above code and your custom component in your
 | ||
|     |  packaged model's #[code __init__.py], so it's executed when you load your
 | ||
|     |  model. The #[code **cfg] config parameters are passed all the way down
 | ||
|     |  from #[+api("spacy#load") #[code spacy.load]], so you can load the model
 | ||
|     |  and its components with custom settings:
 | ||
| 
 | ||
| +code.
 | ||
|     nlp = spacy.load('your_custom_model', terms=(u'tree kangaroo'), label='ANIMAL')
 | ||
| 
 | ||
| +infobox("Important note", "⚠️")
 | ||
|     |  When you load a model via its shortcut or package name, like
 | ||
|     |  #[code en_core_web_sm], spaCy will import the package and then call its
 | ||
|     |  #[code load()] method. This means that custom code in the model's
 | ||
|     |  #[code __init__.py] will be executed, too. This is #[strong not the case]
 | ||
|     |  if you're loading a model from a path containing the model data. Here,
 | ||
|     |  spaCy will only read in the #[code meta.json]. If you want to use custom
 | ||
|     |  factories with a model loaded from a path, you need to add them to
 | ||
|     |  #[code Language.factories] #[em before] you load the model.
 |