spaCy/docs/source/reference/lookup.rst

112 lines
2.4 KiB
ReStructuredText
Raw Normal View History

2015-07-08 19:00:33 +03:00
Lexical Lookup
--------------
Where possible, spaCy computes information over lexical *types*, rather than
*tokens*. If you process a large batch of text, the number of unique types
you will see will grow exponentially slower than the number of tokens --- so
it's much more efficient to compute over types. And, in small samples, we generally
want to know about the distribution of a word in the language at large ---
which again, is type-based information.
You can access the lexical features via the Token object, but you can also look them
up in the vocabulary directly:
>>> from spacy.en import English
>>> nlp = English()
>>> lexeme = nlp.vocab[u'Amazon']
.. py:class:: vocab.Vocab(self, data_dir=None, lex_props_getter=None)
2015-08-08 20:14:32 +03:00
.. py:method:: __len__(self)
2015-07-08 19:00:33 +03:00
2015-08-08 20:14:32 +03:00
:returns: number of words in the vocabulary
:rtype: int
2015-07-08 19:00:33 +03:00
2015-08-08 20:14:32 +03:00
.. py:method:: __getitem__(self, key_int)
2015-07-08 19:00:33 +03:00
2015-08-08 20:14:32 +03:00
:param int key:
Integer ID
2015-07-08 19:00:33 +03:00
2015-08-08 20:14:32 +03:00
:returns: A Lexeme object
2015-07-08 19:00:33 +03:00
2015-08-08 20:14:32 +03:00
.. py:method:: __getitem__(self, key_str)
2015-07-08 19:00:33 +03:00
2015-08-08 20:14:32 +03:00
:param unicode key_str:
A string in the vocabulary
:rtype: Lexeme
.. py:method:: __setitem__(self, orth_str, props)
:param unicode orth_str:
The orth key
:param dict props:
A props dictionary
:returns: None
.. py:method:: dump(self, loc)
:param unicode loc:
Path where the vocabulary should be saved
.. py:method:: load_lexemes(self, loc)
:param unicode loc:
Path to load the lexemes.bin file from
.. py:method:: load_vectors(self, loc)
:param unicode loc:
Path to load the vectors.bin from
2015-07-08 19:00:33 +03:00
.. py:class:: strings.StringStore(self)
2015-08-08 20:14:32 +03:00
.. py:method:: __len__(self)
:returns:
Number of strings in the string-store
.. py:method:: __getitem__(self, key_int)
:param int key_int: An integer key
:returns:
The string that the integer key maps to
:rtype: unicode
.. py:method:: __getitem__(self, key_unicode)
:param int key_unicode:
A key, as a unicode string
:returns:
The integer ID of the string.
:rtype: int
.. py:method:: __getitem__(self, key_utf8_bytes)
2015-07-08 19:00:33 +03:00
2015-08-08 20:14:32 +03:00
:param int key_utf8_bytes:
A key, as a UTF-8 encoded byte-string
2015-07-08 19:00:33 +03:00
2015-08-08 20:14:32 +03:00
:returns:
The integer ID of the string.
2015-07-08 19:00:33 +03:00
2015-08-08 20:14:32 +03:00
:rtype:
int
2015-07-08 19:00:33 +03:00
2015-08-08 20:14:32 +03:00
.. py:method:: dump(self, loc)
2015-07-08 19:00:33 +03:00
2015-08-08 20:14:32 +03:00
:param loc:
File path to save the strings.txt to.
2015-07-08 19:00:33 +03:00
2015-08-08 20:14:32 +03:00
.. py:method:: load(self, loc)
2015-07-08 19:00:33 +03:00
2015-08-08 20:14:32 +03:00
:param loc:
File path to load the strings.txt from.