mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-30 23:47:31 +03:00 
			
		
		
		
	
		
			
				
	
	
		
			126 lines
		
	
	
		
			6.1 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			126 lines
		
	
	
		
			6.1 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| ---
 | |
| title: EntityRuler
 | |
| version: 2.1
 | |
| teaser: 'Pipeline component for rule-based named entity recognition'
 | |
| api_string_name: entity_ruler
 | |
| api_trainable: false
 | |
| ---
 | |
| 
 | |
| <Infobox title="New in v4" variant="warning">
 | |
| 
 | |
| As of spaCy v4, there is no separate `EntityRuler` class. The entity ruler is
 | |
| implemented as a special case of the `SpanRuler` component.
 | |
| 
 | |
| See the [migration guide](#migrating) below for differences between the v3
 | |
| `EntityRuler` and v4 `SpanRuler` implementations of the `entity_ruler`
 | |
| component.
 | |
| 
 | |
| See the [`SpanRuler`](/api/spanruler) API docs for the full API.
 | |
| 
 | |
| </Infobox>
 | |
| 
 | |
| The entity ruler lets you add spans to the [`Doc.ents`](/api/doc#ents) using
 | |
| token-based rules or exact phrase matches. It can be combined with the
 | |
| statistical [`EntityRecognizer`](/api/entityrecognizer) to boost accuracy, or
 | |
| used on its own to implement a purely rule-based entity recognition system. For
 | |
| usage examples, see the docs on
 | |
| [rule-based entity recognition](/usage/rule-based-matching#entityruler).
 | |
| 
 | |
| ## Assigned Attributes {id="assigned-attributes"}
 | |
| 
 | |
| This component assigns predictions basically the same way as the
 | |
| [`EntityRecognizer`](/api/entityrecognizer).
 | |
| 
 | |
| Predictions can be accessed under `Doc.ents` as a tuple. Each label will also be
 | |
| reflected in each underlying token, where it is saved in the `Token.ent_type`
 | |
| and `Token.ent_iob` fields. Note that by definition each token can only have one
 | |
| label.
 | |
| 
 | |
| When setting `Doc.ents` to create training data, all the spans must be valid and
 | |
| non-overlapping, or an error will be thrown.
 | |
| 
 | |
| | Location          | Value                                                             |
 | |
| | ----------------- | ----------------------------------------------------------------- |
 | |
| | `Doc.ents`        | The annotated spans. ~~Tuple[Span]~~                              |
 | |
| | `Token.ent_iob`   | An enum encoding of the IOB part of the named entity tag. ~~int~~ |
 | |
| | `Token.ent_iob_`  | The IOB part of the named entity tag. ~~str~~                     |
 | |
| | `Token.ent_type`  | The label part of the named entity tag (hash). ~~int~~            |
 | |
| | `Token.ent_type_` | The label part of the named entity tag. ~~str~~                   |
 | |
| 
 | |
| ## Config and implementation {id="config"}
 | |
| 
 | |
| The default config is defined by the pipeline component factory and describes
 | |
| how the component should be configured. You can override its settings via the
 | |
| `config` argument on [`nlp.add_pipe`](/api/language#add_pipe) or in your
 | |
| [`config.cfg` for training](/usage/training#config).
 | |
| 
 | |
| > #### Example
 | |
| >
 | |
| > ```python
 | |
| > config = {
 | |
| >    "phrase_matcher_attr": None,
 | |
| >    "validate": True,
 | |
| >    "overwrite_ents": False,
 | |
| >    "ent_id_sep": "||",
 | |
| > }
 | |
| > nlp.add_pipe("entity_ruler", config=config)
 | |
| > ```
 | |
| 
 | |
| | Setting                                              | Description                                                                                                                                                                                   |
 | |
| | ---------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | |
| | `phrase_matcher_attr`                                | Optional attribute name match on for the internal [`PhraseMatcher`](/api/phrasematcher), e.g. `LOWER` to match on the lowercase token text. Defaults to `None`. ~~Optional[Union[int, str]]~~ |
 | |
| | `matcher_fuzzy_compare` <Tag variant="new">3.5</Tag> | The fuzzy comparison method, passed on to the internal `Matcher`. Defaults to `spacy.matcher.levenshtein.levenshtein_compare`. ~~Callable~~                                                             |
 | |
| | `validate`                                           | Whether patterns should be validated (passed to the `Matcher` and `PhraseMatcher`). Defaults to `False`. ~~bool~~                                                                             |
 | |
| | `overwrite_ents`                                     | If existing entities are present, e.g. entities added by the model, overwrite them by matches if necessary. Defaults to `False`. ~~bool~~                                                     |
 | |
| | `ent_id_sep`                                         | Separator used internally for entity IDs. Defaults to `"\|\|"`. ~~str~~                                                                                                                       |
 | |
| | `scorer`                                             | The scoring method. Defaults to [`spacy.scorer.get_ner_prf`](/api/scorer#get_ner_prf). ~~Optional[Callable]~~                                                                                 |
 | |
| 
 | |
| ## Migrating from v3 {id="migrating"}
 | |
| 
 | |
| ### Loading patterns
 | |
| 
 | |
| Unlike the v3 `EntityRuler`, the `SpanRuler` cannot load patterns on
 | |
| initialization with `SpanRuler(patterns=patterns)` or directly from a JSONL file
 | |
| path with `SpanRuler.from_disk(jsonl_path)`. Patterns should be loaded from the
 | |
| JSONL file separately and then added through
 | |
| [`SpanRuler.initialize`](/api/spanruler#initialize]) or
 | |
| [`SpanRuler.add_patterns`](/api/spanruler#add_patterns).
 | |
| 
 | |
| ```diff
 | |
|  ruler = nlp.get_pipe("entity_ruler")
 | |
| - ruler.from_disk("patterns.jsonl")
 | |
| + import srsly
 | |
| + patterns = srsly.read_jsonl("patterns.jsonl")
 | |
| + ruler.add_patterns(patterns)
 | |
| ```
 | |
| 
 | |
| ### Saving patterns
 | |
| 
 | |
| `SpanRuler.to_disk` always saves the full component data to a directory and does
 | |
| not include an option to save the patterns to a single JSONL file.
 | |
| 
 | |
| ```diff
 | |
|  ruler = nlp.get_pipe("entity_ruler")
 | |
| - ruler.to_disk("patterns.jsonl")
 | |
| + import srsly
 | |
| + srsly.write_jsonl("patterns.jsonl", ruler.patterns)
 | |
| ```
 | |
| 
 | |
| ### Accessing token and phrase patterns
 | |
| 
 | |
| The separate token patterns and phrase patterns are no longer accessible under
 | |
| `ruler.token_patterns` or `ruler.phrase_patterns`. You can access the combined
 | |
| patterns in their original format using the property
 | |
| [`SpanRuler.patterns`](/api/spanruler#patterns).
 | |
| 
 | |
| ### Removing patterns by ID
 | |
| 
 | |
| [`SpanRuler.remove`](/api/spanruler#remove) removes by label rather than ID. To
 | |
| remove by ID, use [`SpanRuler.remove_by_id`](/api/spanruler#remove_by_id):
 | |
| 
 | |
| ```diff
 | |
|  ruler = nlp.get_pipe("entity_ruler")
 | |
| - ruler.remove("id")
 | |
| + ruler.remove_by_id("id")
 | |
| ```
 |