spaCy/website/docs/api/entityruler.mdx
2023-01-11 18:40:55 +01:00

126 lines
6.1 KiB
Plaintext

---
title: EntityRuler
version: 2.1
teaser: 'Pipeline component for rule-based named entity recognition'
api_string_name: entity_ruler
api_trainable: false
---
<Infobox title="New in v4" variant="warning">
As of spaCy v4, there is no separate `EntityRuler` class. The entity ruler is
implemented as a special case of the `SpanRuler` component.
See the [migration guide](#migrating) below for differences between the v3
`EntityRuler` and v4 `SpanRuler` implementations of the `entity_ruler`
component.
See the [`SpanRuler`](/api/spanruler) API docs for the full API.
</Infobox>
The entity ruler lets you add spans to the [`Doc.ents`](/api/doc#ents) using
token-based rules or exact phrase matches. It can be combined with the
statistical [`EntityRecognizer`](/api/entityrecognizer) to boost accuracy, or
used on its own to implement a purely rule-based entity recognition system. For
usage examples, see the docs on
[rule-based entity recognition](/usage/rule-based-matching#entityruler).
## Assigned Attributes {id="assigned-attributes"}
This component assigns predictions basically the same way as the
[`EntityRecognizer`](/api/entityrecognizer).
Predictions can be accessed under `Doc.ents` as a tuple. Each label will also be
reflected in each underlying token, where it is saved in the `Token.ent_type`
and `Token.ent_iob` fields. Note that by definition each token can only have one
label.
When setting `Doc.ents` to create training data, all the spans must be valid and
non-overlapping, or an error will be thrown.
| Location | Value |
| ----------------- | ----------------------------------------------------------------- |
| `Doc.ents` | The annotated spans. ~~Tuple[Span]~~ |
| `Token.ent_iob` | An enum encoding of the IOB part of the named entity tag. ~~int~~ |
| `Token.ent_iob_` | The IOB part of the named entity tag. ~~str~~ |
| `Token.ent_type` | The label part of the named entity tag (hash). ~~int~~ |
| `Token.ent_type_` | The label part of the named entity tag. ~~str~~ |
## Config and implementation {id="config"}
The default config is defined by the pipeline component factory and describes
how the component should be configured. You can override its settings via the
`config` argument on [`nlp.add_pipe`](/api/language#add_pipe) or in your
[`config.cfg` for training](/usage/training#config).
> #### Example
>
> ```python
> config = {
> "phrase_matcher_attr": None,
> "validate": True,
> "overwrite_ents": False,
> "ent_id_sep": "||",
> }
> nlp.add_pipe("entity_ruler", config=config)
> ```
| Setting | Description |
| ---------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `phrase_matcher_attr` | Optional attribute name match on for the internal [`PhraseMatcher`](/api/phrasematcher), e.g. `LOWER` to match on the lowercase token text. Defaults to `None`. ~~Optional[Union[int, str]]~~ |
| `matcher_fuzzy_compare` <Tag variant="new">3.5</Tag> | The fuzzy comparison method, passed on to the internal `Matcher`. Defaults to `spacy.matcher.levenshtein.levenshtein_compare`. ~~Callable~~ |
| `validate` | Whether patterns should be validated (passed to the `Matcher` and `PhraseMatcher`). Defaults to `False`. ~~bool~~ |
| `overwrite_ents` | If existing entities are present, e.g. entities added by the model, overwrite them by matches if necessary. Defaults to `False`. ~~bool~~ |
| `ent_id_sep` | Separator used internally for entity IDs. Defaults to `"\|\|"`. ~~str~~ |
| `scorer` | The scoring method. Defaults to [`spacy.scorer.get_ner_prf`](/api/scorer#get_ner_prf). ~~Optional[Callable]~~ |
## Migrating from v3 {id="migrating"}
### Loading patterns
Unlike the v3 `EntityRuler`, the `SpanRuler` cannot load patterns on
initialization with `SpanRuler(patterns=patterns)` or directly from a JSONL file
path with `SpanRuler.from_disk(jsonl_path)`. Patterns should be loaded from the
JSONL file separately and then added through
[`SpanRuler.initialize`](/api/spanruler#initialize]) or
[`SpanRuler.add_patterns`](/api/spanruler#add_patterns).
```diff
ruler = nlp.get_pipe("entity_ruler")
- ruler.from_disk("patterns.jsonl")
+ import srsly
+ patterns = srsly.read_jsonl("patterns.jsonl")
+ ruler.add_patterns(patterns)
```
### Saving patterns
`SpanRuler.to_disk` always saves the full component data to a directory and does
not include an option to save the patterns to a single JSONL file.
```diff
ruler = nlp.get_pipe("entity_ruler")
- ruler.to_disk("patterns.jsonl")
+ import srsly
+ srsly.write_jsonl("patterns.jsonl", ruler.patterns)
```
### Accessing token and phrase patterns
The separate token patterns and phrase patterns are no longer accessible under
`ruler.token_patterns` or `ruler.phrase_patterns`. You can access the combined
patterns in their original format using the property
[`SpanRuler.patterns`](/api/spanruler#patterns).
### Removing patterns by ID
[`SpanRuler.remove`](/api/spanruler#remove) removes by label rather than ID. To
remove by ID, use [`SpanRuler.remove_by_id`](/api/spanruler#remove_by_id):
```diff
ruler = nlp.get_pipe("entity_ruler")
- ruler.remove("id")
+ ruler.remove_by_id("id")
```