title |
tag |
source |
new |
EntityRuler |
class |
spacy/pipeline/entityruler.py |
2.1 |
The EntityRuler lets you add spans to the Doc.ents
using
token-based rules or exact phrase matches. It can be combined with the
statistical EntityRecognizer
to boost accuracy, or
used on its own to implement a purely rule-based entity recognition system.
After initialization, the component is typically added to the processing
pipeline using nlp.add_pipe
. For usage examples, see
the docs on
rule-based entity recognition.
EntityRuler.__init__
Initialize the entity ruler. If patterns are supplied here, they need to be a
list of dictionaries with a "label"
and "pattern"
key. A pattern can either
be a token pattern (list) or a phrase pattern (string). For example:
{'label': 'ORG', 'pattern': 'Apple'}
.
Example
# Construction via create_pipe
ruler = nlp.create_pipe("entity_ruler")
# Construction from class
from spacy.pipeline import EntityRuler
ruler = EntityRuler(nlp, overwrite_ents=True)
Name |
Type |
Description |
nlp |
Language |
The shared nlp object to pass the vocab to the matchers and process phrase patterns. |
patterns |
iterable |
Optional patterns to load in. |
phrase_matcher_attr |
int / unicode |
Optional attr to pass to the internal PhraseMatcher . defaults to None |
validate |
bool |
Whether patterns should be validated, passed to Matcher and PhraseMatcher as validate . Defaults to False . |
overwrite_ents |
bool |
If existing entities are present, e.g. entities added by the model, overwrite them by matches if necessary. Defaults to False . |
**cfg |
- |
Other config parameters. If pipeline component is loaded as part of a model pipeline, this will include all keyword arguments passed to spacy.load . |
RETURNS |
EntityRuler |
The newly constructed object. |
EntityRuler._\len__
The number of all patterns added to the entity ruler.
Example
ruler = EntityRuler(nlp)
assert len(ruler) == 0
ruler.add_patterns([{"label": "ORG", "pattern": "Apple"}])
assert len(ruler) == 1
Name |
Type |
Description |
RETURNS |
int |
The number of patterns. |
EntityRuler.__contains__
Whether a label is present in the patterns.
Example
ruler = EntityRuler(nlp)
ruler.add_patterns([{"label": "ORG", "pattern": "Apple"}])
assert "ORG" in ruler
assert not "PERSON" in ruler
Name |
Type |
Description |
label |
unicode |
The label to check. |
RETURNS |
bool |
Whether the entity ruler contains the label. |
EntityRuler.__call__
Find matches in the Doc
and add them to the doc.ents
. Typically, this
happens automatically after the component has been added to the pipeline using
nlp.add_pipe
. If the entity ruler was initialized
with overwrite_ents=True
, existing entities will be replaced if they overlap
with the matches. When matches overlap in a Doc, the entity ruler prioritizes longer
patterns over shorter, and if equal the match occuring first in the Doc is chosen.
Example
ruler = EntityRuler(nlp)
ruler.add_patterns([{"label": "ORG", "pattern": "Apple"}])
nlp.add_pipe(ruler)
doc = nlp("A text about Apple.")
ents = [(ent.text, ent.label_) for ent in doc.ents]
assert ents == [("Apple", "ORG")]
Name |
Type |
Description |
doc |
Doc |
The Doc object to process, e.g. the Doc in the pipeline. |
RETURNS |
Doc |
The modified Doc with added entities, if available. |
EntityRuler.add_patterns
Add patterns to the entity ruler. A pattern can either be a token pattern (list
of dicts) or a phrase pattern (string). For more details, see the usage guide on
rule-based matching.
Example
patterns = [
{"label": "ORG", "pattern": "Apple"},
{"label": "GPE", "pattern": [{"lower": "san"}, {"lower": "francisco"}]}
]
ruler = EntityRuler(nlp)
ruler.add_patterns(patterns)
Name |
Type |
Description |
patterns |
list |
The patterns to add. |
EntityRuler.to_disk
Save the entity ruler patterns to a directory. The patterns will be saved as
newline-delimited JSON (JSONL). If a file with the suffix .jsonl
is provided,
only the patterns are saved as JSONL. If a directory name is provided, a
patterns.jsonl
and cfg
file with the component configuration is exported.
Example
ruler = EntityRuler(nlp)
ruler.to_disk("/path/to/patterns.jsonl") # saves patterns only
ruler.to_disk("/path/to/entity_ruler") # saves patterns and config
Name |
Type |
Description |
path |
unicode / Path |
A path to a JSONL file or directory, which will be created if it doesn't exist. Paths may be either strings or Path -like objects. |
EntityRuler.from_disk
Load the entity ruler from a file. Expects either a file containing
newline-delimited JSON (JSONL) with one entry per line, or a directory
containing a patterns.jsonl
file and a cfg
file with the component
configuration.
Example
ruler = EntityRuler(nlp)
ruler.from_disk("/path/to/patterns.jsonl") # loads patterns only
ruler.from_disk("/path/to/entity_ruler") # loads patterns and config
Name |
Type |
Description |
path |
unicode / Path |
A path to a JSONL file or directory. Paths may be either strings or Path -like objects. |
RETURNS |
EntityRuler |
The modified EntityRuler object. |
EntityRuler.to_bytes
Serialize the entity ruler patterns to a bytestring.
Example
ruler = EntityRuler(nlp)
ruler_bytes = ruler.to_bytes()
Name |
Type |
Description |
RETURNS |
bytes |
The serialized patterns. |
EntityRuler.from_bytes
Load the pipe from a bytestring. Modifies the object in place and returns it.
Example
ruler_bytes = ruler.to_bytes()
ruler = EntityRuler(nlp)
ruler.from_bytes(ruler_bytes)
Name |
Type |
Description |
patterns_bytes |
bytes |
The bytestring to load. |
RETURNS |
EntityRuler |
The modified EntityRuler object. |
EntityRuler.labels
All labels present in the match patterns.
Name |
Type |
Description |
RETURNS |
tuple |
The string labels. |
EntityRuler.ent_ids
All entity ids present in the match patterns id
properties.
Name |
Type |
Description |
RETURNS |
tuple |
The string ent_ids. |
EntityRuler.patterns
Get all patterns that were added to the entity ruler.
Name |
Type |
Description |
RETURNS |
list |
The original patterns, one dictionary per pattern. |
Attributes
Name |
Type |
Description |
matcher |
Matcher |
The underlying matcher used to process token patterns. |
phrase_matcher |
PhraseMatcher |
The underlying phrase matcher, used to process phrase patterns. |
token_patterns |
dict |
The token patterns present in the entity ruler, keyed by label. |
phrase_patterns |
dict |
The phrase patterns present in the entity ruler, keyed by label. |