Improve EntityRuler serialization

This commit is contained in:
Ines Montani 2019-07-10 12:25:45 +02:00
parent 570ab1f481
commit 40cd03fc35
2 changed files with 32 additions and 23 deletions

View File

@ -293,12 +293,13 @@ class EntityRuler(object):
"""Save the entity ruler patterns to a directory. The patterns will be """Save the entity ruler patterns to a directory. The patterns will be
saved as newline-delimited JSON (JSONL). saved as newline-delimited JSON (JSONL).
path (unicode / Path): The JSONL file to load. path (unicode / Path): The JSONL file to save.
**kwargs: Other config paramters, mostly for consistency. **kwargs: Other config paramters, mostly for consistency.
RETURNS (EntityRuler): The loaded entity ruler. RETURNS (EntityRuler): The loaded entity ruler.
DOCS: https://spacy.io/api/entityruler#to_disk DOCS: https://spacy.io/api/entityruler#to_disk
""" """
path = ensure_path(path)
cfg = { cfg = {
"overwrite": self.overwrite, "overwrite": self.overwrite,
"phrase_matcher_attr": self.phrase_matcher_attr, "phrase_matcher_attr": self.phrase_matcher_attr,
@ -310,5 +311,7 @@ class EntityRuler(object):
), ),
"cfg": lambda p: srsly.write_json(p, cfg), "cfg": lambda p: srsly.write_json(p, cfg),
} }
path = ensure_path(path) if path.suffix == ".jsonl": # user wants to save only JSONL
to_disk(path, serializers, {}) srsly.write_jsonl(path, self.patterns)
else:
to_disk(path, serializers, {})

View File

@ -30,14 +30,14 @@ be a token pattern (list) or a phrase pattern (string). For example:
> ruler = EntityRuler(nlp, overwrite_ents=True) > ruler = EntityRuler(nlp, overwrite_ents=True)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ---------------- | ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- | | --------------------- | ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
| `nlp` | `Language` | The shared nlp object to pass the vocab to the matchers and process phrase patterns. | | `nlp` | `Language` | The shared nlp object to pass the vocab to the matchers and process phrase patterns. |
| `patterns` | iterable | Optional patterns to load in. | | `patterns` | iterable | Optional patterns to load in. |
| `phrase_matcher_attr` | int / unicode | Optional attr to pass to the internal [`PhraseMatcher`](/api/phtasematcher). defaults to `None` | `phrase_matcher_attr` | int / unicode | Optional attr to pass to the internal [`PhraseMatcher`](/api/phtasematcher). defaults to `None` |
| `overwrite_ents` | bool | If existing entities are present, e.g. entities added by the model, overwrite them by matches if necessary. Defaults to `False`. | | `overwrite_ents` | bool | If existing entities are present, e.g. entities added by the model, overwrite them by matches if necessary. Defaults to `False`. |
| `**cfg` | - | Other config parameters. If pipeline component is loaded as part of a model pipeline, this will include all keyword arguments passed to `spacy.load`. | | `**cfg` | - | Other config parameters. If pipeline component is loaded as part of a model pipeline, this will include all keyword arguments passed to `spacy.load`. |
| **RETURNS** | `EntityRuler` | The newly constructed object. | | **RETURNS** | `EntityRuler` | The newly constructed object. |
## EntityRuler.\_\len\_\_ {#len tag="method"} ## EntityRuler.\_\len\_\_ {#len tag="method"}
@ -123,35 +123,41 @@ of dicts) or a phrase pattern (string). For more details, see the usage guide on
## EntityRuler.to_disk {#to_disk tag="method"} ## EntityRuler.to_disk {#to_disk tag="method"}
Save the entity ruler patterns to a directory. The patterns will be saved as Save the entity ruler patterns to a directory. The patterns will be saved as
newline-delimited JSON (JSONL). newline-delimited JSON (JSONL). If a file with the suffix `.jsonl` is provided,
only the patterns are saved as JSONL. If a directory name is provided, a
`patterns.jsonl` and `cfg` file with the component configuration is exported.
> #### Example > #### Example
> >
> ```python > ```python
> ruler = EntityRuler(nlp) > ruler = EntityRuler(nlp)
> ruler.to_disk("/path/to/rules.jsonl") > ruler.to_disk("/path/to/patterns.jsonl") # saves patterns only
> ruler.to_disk("/path/to/entity_ruler") # saves patterns and config
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------ | ---------------- | ---------------------------------------------------------------------------------------------------------------- | | ------ | ---------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a file, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | | `path` | unicode / `Path` | A path to a JSONL file or directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
## EntityRuler.from_disk {#from_disk tag="method"} ## EntityRuler.from_disk {#from_disk tag="method"}
Load the entity ruler from a file. Expects a file containing newline-delimited Load the entity ruler from a file. Expects either a file containing
JSON (JSONL) with one entry per line. newline-delimited JSON (JSONL) with one entry per line, or a directory
containing a `patterns.jsonl` file and a `cfg` file with the component
configuration.
> #### Example > #### Example
> >
> ```python > ```python
> ruler = EntityRuler(nlp) > ruler = EntityRuler(nlp)
> ruler.from_disk("/path/to/rules.jsonl") > ruler.from_disk("/path/to/patterns.jsonl") # loads patterns only
> ruler.from_disk("/path/to/entity_ruler") # loads patterns and config
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ---------------- | --------------------------------------------------------------------------- | | ----------- | ---------------- | ---------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a JSONL file. Paths may be either strings or `Path`-like objects. | | `path` | unicode / `Path` | A path to a JSONL file or directory. Paths may be either strings or `Path`-like objects. |
| **RETURNS** | `EntityRuler` | The modified `EntityRuler` object. | | **RETURNS** | `EntityRuler` | The modified `EntityRuler` object. |
## EntityRuler.to_bytes {#to_bytes tag="method"} ## EntityRuler.to_bytes {#to_bytes tag="method"}