spaCy/website/docs/_api-stringstore.jade

106 lines
3.8 KiB
Plaintext
Raw Normal View History

2016-10-03 21:19:13 +03:00
//- ----------------------------------
//- 💫 DOCS > API > STRINGSTORE
//- ----------------------------------
2016-03-31 17:24:48 +03:00
2016-10-03 21:19:13 +03:00
+section("stringstore")
+h(2, "stringstore", "https://github.com/" + SOCIAL.github + "/spaCy/blob/master/spacy/strings.pyx")
| #[+tag class] StringStore
2016-03-31 17:24:48 +03:00
2016-10-03 21:19:13 +03:00
p Intern strings, and map them to sequential integer IDs.
2016-03-31 17:24:48 +03:00
2016-10-03 21:19:13 +03:00
p.
Only the integer IDs are held by spaCy's data
classes (#[code Doc], #[code Token], #[code Span] and #[code Lexeme])
– when you use a string-valued attribute like #[code token.orth_],
you access a property that computes #[code token.strings[token.orth]].
2016-03-31 17:24:48 +03:00
2016-10-03 21:19:13 +03:00
+aside("Efficiency").
2016-03-31 17:24:48 +03:00
The mapping table is very efficient , and a small-string optimization
is used to maintain a small memory footprint.
2016-10-03 21:19:13 +03:00
+table(["Usage", "Description"])
2016-03-31 17:24:48 +03:00
+row
2016-10-03 21:19:13 +03:00
+cell #[code string = string_store[int_id]]
2016-03-31 17:24:48 +03:00
+cell.
2016-10-03 21:19:13 +03:00
Retrieve a string from a given integer ID. If the integer ID
2016-03-31 17:24:48 +03:00
is not found, raise #[code IndexError].
+row
2016-10-03 21:19:13 +03:00
+cell #[code int_id = string_store[unicode_string]]
2016-03-31 17:24:48 +03:00
+cell.
2016-10-03 21:19:13 +03:00
Map a unicode string to an integer ID. If the string is
2016-03-31 17:24:48 +03:00
previously unseen, it is interned, and a new ID is returned.
+row
2016-10-03 21:19:13 +03:00
+cell #[code int_id = string_store[utf8_byte_string]]
2016-03-31 17:24:48 +03:00
+cell.
2016-10-03 21:19:13 +03:00
Byte strings are assumed to be in UTF-8 encoding. Strings
encoded with other codecs may fail silently. Given a utf8
string, the behaviour is the same as for unicode strings.
Internally, strings are stored in UTF-8 format. So if you start
with a UTF-8 byte string, it's less efficient to first decode
it as unicode, as StringStore will then have to encode it as
2016-03-31 17:24:48 +03:00
UTF-8 once again.
+row
2016-10-03 21:19:13 +03:00
+cell #[code n_strings = len(string_store)]
2016-03-31 17:24:48 +03:00
+cell.
Number of strings in the string-store.
+row
2016-10-03 21:19:13 +03:00
+cell #[code for string in string_store]
+cell
2016-03-31 17:24:48 +03:00
p.
2016-10-03 21:19:13 +03:00
Iterate over strings in the string store, in order, such
2016-03-31 17:24:48 +03:00
that the ith string in the sequence has the ID #[code i]:
+code.code-block-small.no-block.
string_store = doc.vocab.strings
for i, string in enumerate(string_store):
assert i == string_store[string]
2016-10-03 21:19:13 +03:00
+section("stringstore-init")
+h(3, "stringstore-init")
| #[+tag method] StringStore.__init__
2016-03-31 17:24:48 +03:00
2016-10-03 21:19:13 +03:00
+code("python", "Definition").
2016-03-31 17:24:48 +03:00
def __init__(self):
return self
2016-10-03 21:19:13 +03:00
+section("stringstore-dump")
+h(3, "stringstore-dump")
| #[+tag method] StringStore.dump
2016-03-31 17:24:48 +03:00
p Save the string-to-int mapping to the given file.
2016-10-03 21:19:13 +03:00
+code("python", "Definition").
2016-03-31 17:24:48 +03:00
def dump(self, file):
return None
2016-10-03 21:19:13 +03:00
+table(["Name", "Type", "Description"])
+row
2016-03-31 17:24:48 +03:00
+cell loc
+cell str
+cell.
The file to write the data to.
2016-10-03 21:19:13 +03:00
+section("stringstore-load")
+h(3, "stringstore-load")
| #[+tag method] StringStore.load
2016-03-31 17:24:48 +03:00
p Load the strings from the given file.
2016-10-03 21:19:13 +03:00
+code("python", "Definition").
2016-03-31 17:24:48 +03:00
def load(self, file):
return None
2016-10-03 21:19:13 +03:00
+table(["Name", "Type", "Description"])
+row
2016-03-31 17:24:48 +03:00
+cell file
+cell file
+cell.
File-like object to load the data from. The format is subject
to change; so if you need to read/write compatible files, please
find details in the strings.pyx source.