mirror of
https://github.com/explosion/spaCy.git
synced 2024-11-11 20:28:20 +03:00
e53232533b
* Describing priority rules for overlapping matches * Create Tiljander.md * Describing priority rules for overlapping matches * Update website/docs/api/entityruler.md Co-Authored-By: Ines Montani <ines@ines.io> Co-authored-by: Ines Montani <ines@ines.io>
230 lines
10 KiB
Markdown
230 lines
10 KiB
Markdown
---
|
|
title: EntityRuler
|
|
tag: class
|
|
source: spacy/pipeline/entityruler.py
|
|
new: 2.1
|
|
---
|
|
|
|
The EntityRuler lets you add spans to the [`Doc.ents`](/api/doc#ents) using
|
|
token-based rules or exact phrase matches. It can be combined with the
|
|
statistical [`EntityRecognizer`](/api/entityrecognizer) to boost accuracy, or
|
|
used on its own to implement a purely rule-based entity recognition system.
|
|
After initialization, the component is typically added to the processing
|
|
pipeline using [`nlp.add_pipe`](/api/language#add_pipe). For usage examples, see
|
|
the docs on
|
|
[rule-based entity recognition](/usage/rule-based-matching#entityruler).
|
|
|
|
## EntityRuler.\_\_init\_\_ {#init tag="method"}
|
|
|
|
Initialize the entity ruler. If patterns are supplied here, they need to be a
|
|
list of dictionaries with a `"label"` and `"pattern"` key. A pattern can either
|
|
be a token pattern (list) or a phrase pattern (string). For example:
|
|
`{'label': 'ORG', 'pattern': 'Apple'}`.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> # Construction via create_pipe
|
|
> ruler = nlp.create_pipe("entity_ruler")
|
|
>
|
|
> # Construction from class
|
|
> from spacy.pipeline import EntityRuler
|
|
> ruler = EntityRuler(nlp, overwrite_ents=True)
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| --------------------- | ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `nlp` | `Language` | The shared nlp object to pass the vocab to the matchers and process phrase patterns. |
|
|
| `patterns` | iterable | Optional patterns to load in. |
|
|
| `phrase_matcher_attr` | int / unicode | Optional attr to pass to the internal [`PhraseMatcher`](/api/phrasematcher). defaults to `None` |
|
|
| `validate` | bool | Whether patterns should be validated, passed to Matcher and PhraseMatcher as `validate`. Defaults to `False`. |
|
|
| `overwrite_ents` | bool | If existing entities are present, e.g. entities added by the model, overwrite them by matches if necessary. Defaults to `False`. |
|
|
| `**cfg` | - | Other config parameters. If pipeline component is loaded as part of a model pipeline, this will include all keyword arguments passed to `spacy.load`. |
|
|
| **RETURNS** | `EntityRuler` | The newly constructed object. |
|
|
|
|
## EntityRuler.\_\len\_\_ {#len tag="method"}
|
|
|
|
The number of all patterns added to the entity ruler.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> ruler = EntityRuler(nlp)
|
|
> assert len(ruler) == 0
|
|
> ruler.add_patterns([{"label": "ORG", "pattern": "Apple"}])
|
|
> assert len(ruler) == 1
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ---- | ----------------------- |
|
|
| **RETURNS** | int | The number of patterns. |
|
|
|
|
## EntityRuler.\_\_contains\_\_ {#contains tag="method"}
|
|
|
|
Whether a label is present in the patterns.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> ruler = EntityRuler(nlp)
|
|
> ruler.add_patterns([{"label": "ORG", "pattern": "Apple"}])
|
|
> assert "ORG" in ruler
|
|
> assert not "PERSON" in ruler
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ------- | -------------------------------------------- |
|
|
| `label` | unicode | The label to check. |
|
|
| **RETURNS** | bool | Whether the entity ruler contains the label. |
|
|
|
|
## EntityRuler.\_\_call\_\_ {#call tag="method"}
|
|
|
|
Find matches in the `Doc` and add them to the `doc.ents`. Typically, this
|
|
happens automatically after the component has been added to the pipeline using
|
|
[`nlp.add_pipe`](/api/language#add_pipe). If the entity ruler was initialized
|
|
with `overwrite_ents=True`, existing entities will be replaced if they overlap
|
|
with the matches. When matches overlap in a Doc, the entity ruler prioritizes longer
|
|
patterns over shorter, and if equal the match occuring first in the Doc is chosen.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> ruler = EntityRuler(nlp)
|
|
> ruler.add_patterns([{"label": "ORG", "pattern": "Apple"}])
|
|
> nlp.add_pipe(ruler)
|
|
>
|
|
> doc = nlp("A text about Apple.")
|
|
> ents = [(ent.text, ent.label_) for ent in doc.ents]
|
|
> assert ents == [("Apple", "ORG")]
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ----- | ------------------------------------------------------------ |
|
|
| `doc` | `Doc` | The `Doc` object to process, e.g. the `Doc` in the pipeline. |
|
|
| **RETURNS** | `Doc` | The modified `Doc` with added entities, if available. |
|
|
|
|
## EntityRuler.add_patterns {#add_patterns tag="method"}
|
|
|
|
Add patterns to the entity ruler. A pattern can either be a token pattern (list
|
|
of dicts) or a phrase pattern (string). For more details, see the usage guide on
|
|
[rule-based matching](/usage/rule-based-matching).
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> patterns = [
|
|
> {"label": "ORG", "pattern": "Apple"},
|
|
> {"label": "GPE", "pattern": [{"lower": "san"}, {"lower": "francisco"}]}
|
|
> ]
|
|
> ruler = EntityRuler(nlp)
|
|
> ruler.add_patterns(patterns)
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ---------- | ---- | -------------------- |
|
|
| `patterns` | list | The patterns to add. |
|
|
|
|
## EntityRuler.to_disk {#to_disk tag="method"}
|
|
|
|
Save the entity ruler patterns to a directory. The patterns will be saved as
|
|
newline-delimited JSON (JSONL). If a file with the suffix `.jsonl` is provided,
|
|
only the patterns are saved as JSONL. If a directory name is provided, a
|
|
`patterns.jsonl` and `cfg` file with the component configuration is exported.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> ruler = EntityRuler(nlp)
|
|
> ruler.to_disk("/path/to/patterns.jsonl") # saves patterns only
|
|
> ruler.to_disk("/path/to/entity_ruler") # saves patterns and config
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ------ | ---------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `path` | unicode / `Path` | A path to a JSONL file or directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
|
|
|
## EntityRuler.from_disk {#from_disk tag="method"}
|
|
|
|
Load the entity ruler from a file. Expects either a file containing
|
|
newline-delimited JSON (JSONL) with one entry per line, or a directory
|
|
containing a `patterns.jsonl` file and a `cfg` file with the component
|
|
configuration.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> ruler = EntityRuler(nlp)
|
|
> ruler.from_disk("/path/to/patterns.jsonl") # loads patterns only
|
|
> ruler.from_disk("/path/to/entity_ruler") # loads patterns and config
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ---------------- | ---------------------------------------------------------------------------------------- |
|
|
| `path` | unicode / `Path` | A path to a JSONL file or directory. Paths may be either strings or `Path`-like objects. |
|
|
| **RETURNS** | `EntityRuler` | The modified `EntityRuler` object. |
|
|
|
|
## EntityRuler.to_bytes {#to_bytes tag="method"}
|
|
|
|
Serialize the entity ruler patterns to a bytestring.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> ruler = EntityRuler(nlp)
|
|
> ruler_bytes = ruler.to_bytes()
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ----- | ------------------------ |
|
|
| **RETURNS** | bytes | The serialized patterns. |
|
|
|
|
## EntityRuler.from_bytes {#from_bytes tag="method"}
|
|
|
|
Load the pipe from a bytestring. Modifies the object in place and returns it.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> ruler_bytes = ruler.to_bytes()
|
|
> ruler = EntityRuler(nlp)
|
|
> ruler.from_bytes(ruler_bytes)
|
|
> ```
|
|
|
|
| Name | Type | Description |
|
|
| ---------------- | ------------- | ---------------------------------- |
|
|
| `patterns_bytes` | bytes | The bytestring to load. |
|
|
| **RETURNS** | `EntityRuler` | The modified `EntityRuler` object. |
|
|
|
|
## EntityRuler.labels {#labels tag="property"}
|
|
|
|
All labels present in the match patterns.
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ----- | ------------------ |
|
|
| **RETURNS** | tuple | The string labels. |
|
|
|
|
## EntityRuler.ent_ids {#labels tag="property" new="2.2.2"}
|
|
|
|
All entity ids present in the match patterns `id` properties.
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ----- | ------------------- |
|
|
| **RETURNS** | tuple | The string ent_ids. |
|
|
|
|
## EntityRuler.patterns {#patterns tag="property"}
|
|
|
|
Get all patterns that were added to the entity ruler.
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ---- | -------------------------------------------------- |
|
|
| **RETURNS** | list | The original patterns, one dictionary per pattern. |
|
|
|
|
## Attributes {#attributes}
|
|
|
|
| Name | Type | Description |
|
|
| ----------------- | ------------------------------------- | ---------------------------------------------------------------- |
|
|
| `matcher` | [`Matcher`](/api/matcher) | The underlying matcher used to process token patterns. |
|
|
| `phrase_matcher` | [`PhraseMatcher`](/api/phrasematcher) | The underlying phrase matcher, used to process phrase patterns. |
|
|
| `token_patterns` | dict | The token patterns present in the entity ruler, keyed by label. |
|
|
| `phrase_patterns` | dict | The phrase patterns present in the entity ruler, keyed by label. |
|