2017-05-24 20:24:40 +03:00
|
|
|
|
//- 💫 DOCS > USAGE > SPACY 101 > SERIALIZATION
|
|
|
|
|
|
|
|
|
|
p
|
2017-06-01 12:52:58 +03:00
|
|
|
|
| If you've been modifying the pipeline, vocabulary, vectors and entities,
|
|
|
|
|
| or made updates to the model, you'll eventually want to
|
|
|
|
|
| #[strong save your progress] – for example, everything that's in your
|
|
|
|
|
| #[code nlp] object. This means you'll have to translate its contents and
|
|
|
|
|
| structure into a format that can be saved, like a file or a byte string.
|
|
|
|
|
| This process is called serialization. spaCy comes with
|
2017-05-24 20:24:40 +03:00
|
|
|
|
| #[strong built-in serialization methods] and supports the
|
|
|
|
|
| #[+a("http://www.diveintopython3.net/serializing.html#dump") Pickle protocol].
|
|
|
|
|
|
|
|
|
|
+aside("What's pickle?")
|
|
|
|
|
| Pickle is Python's built-in object persistance system. It lets you
|
|
|
|
|
| transfer arbitrary Python objects between processes. This is usually used
|
|
|
|
|
| to load an object to and from disk, but it's also used for distributed
|
|
|
|
|
| computing, e.g. with
|
|
|
|
|
| #[+a("https://spark.apache.org/docs/0.9.0/python-programming-guide.html") PySpark]
|
|
|
|
|
| or #[+a("http://dask.pydata.org/en/latest/") Dask]. When you unpickle an
|
|
|
|
|
| object, you're agreeing to execute whatever code it contains. It's like
|
|
|
|
|
| calling #[code eval()] on a string – so don't unpickle objects from
|
|
|
|
|
| untrusted sources.
|
|
|
|
|
|
|
|
|
|
p
|
2017-05-26 14:22:45 +03:00
|
|
|
|
| All container classes, i.e. #[+api("language") #[code Language]],
|
|
|
|
|
| #[+api("doc") #[code Doc]], #[+api("vocab") #[code Vocab]] and
|
|
|
|
|
| #[+api("stringstore") #[code StringStore]] have the following methods
|
|
|
|
|
| available:
|
2017-05-24 20:24:40 +03:00
|
|
|
|
|
|
|
|
|
+table(["Method", "Returns", "Example"])
|
|
|
|
|
- style = [1, 0, 1]
|
|
|
|
|
+annotation-row(["to_bytes", "bytes", "nlp.to_bytes()"], style)
|
|
|
|
|
+annotation-row(["from_bytes", "object", "nlp.from_bytes(bytes)"], style)
|
|
|
|
|
+annotation-row(["to_disk", "-", "nlp.to_disk('/path')"], style)
|
|
|
|
|
+annotation-row(["from_disk", "object", "nlp.from_disk('/path')"], style)
|
2017-05-25 12:17:21 +03:00
|
|
|
|
|
2017-05-26 13:46:29 +03:00
|
|
|
|
p
|
|
|
|
|
| For example, if you've processed a very large document, you can use
|
|
|
|
|
| #[+api("doc#to_disk") #[code Doc.to_disk]] to save it to a file on your
|
|
|
|
|
| local machine. This will save the document and its tokens, as well as
|
|
|
|
|
| the vocabulary associated with the #[code Doc].
|
|
|
|
|
|
|
|
|
|
+aside("Why saving the vocab?")
|
|
|
|
|
| Saving the vocabulary with the #[code Doc] is important, because the
|
|
|
|
|
| #[code Vocab] holds the context-independent information about the words,
|
2017-05-28 20:25:34 +03:00
|
|
|
|
| tags and labels, and their #[strong hash values]. If the #[code Vocab]
|
2017-05-26 13:46:29 +03:00
|
|
|
|
| wasn't saved with the #[code Doc], spaCy wouldn't know how to resolve
|
2017-06-01 12:52:58 +03:00
|
|
|
|
| those IDs back to strings.
|
2017-05-26 13:46:29 +03:00
|
|
|
|
|
2017-05-25 12:17:21 +03:00
|
|
|
|
+code.
|
|
|
|
|
moby_dick = open('moby_dick.txt', 'r') # open a large document
|
|
|
|
|
doc = nlp(moby_dick) # process it
|
|
|
|
|
doc.to_disk('/moby_dick.bin') # save the processed Doc
|
2017-05-26 13:46:29 +03:00
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| If you need it again later, you can load it back into an empty #[code Doc]
|
|
|
|
|
| with an empty #[code Vocab] by calling
|
|
|
|
|
| #[+api("doc#from_disk") #[code from_disk()]]:
|
|
|
|
|
|
|
|
|
|
+code.
|
|
|
|
|
from spacy.tokens import Doc # to create empty Doc
|
|
|
|
|
from spacy.vocab import Vocab # to create empty Vocab
|
|
|
|
|
|
|
|
|
|
doc = Doc(Vocab()).from_disk('/moby_dick.bin') # load processed Doc
|