2015-07-08 19:00:33 +03:00
|
|
|
Lexical Lookup
|
|
|
|
--------------
|
|
|
|
|
|
|
|
Where possible, spaCy computes information over lexical *types*, rather than
|
|
|
|
*tokens*. If you process a large batch of text, the number of unique types
|
|
|
|
you will see will grow exponentially slower than the number of tokens --- so
|
|
|
|
it's much more efficient to compute over types. And, in small samples, we generally
|
|
|
|
want to know about the distribution of a word in the language at large ---
|
|
|
|
which again, is type-based information.
|
|
|
|
|
|
|
|
You can access the lexical features via the Token object, but you can also look them
|
|
|
|
up in the vocabulary directly:
|
|
|
|
|
|
|
|
>>> from spacy.en import English
|
|
|
|
>>> nlp = English()
|
|
|
|
>>> lexeme = nlp.vocab[u'Amazon']
|
|
|
|
|
|
|
|
.. py:class:: vocab.Vocab(self, data_dir=None, lex_props_getter=None)
|
|
|
|
|
2015-08-08 20:14:32 +03:00
|
|
|
.. py:method:: __len__(self)
|
2015-07-08 19:00:33 +03:00
|
|
|
|
2015-08-08 20:14:32 +03:00
|
|
|
:returns: number of words in the vocabulary
|
|
|
|
:rtype: int
|
2015-07-08 19:00:33 +03:00
|
|
|
|
2015-08-08 20:14:32 +03:00
|
|
|
.. py:method:: __getitem__(self, key_int)
|
2015-07-08 19:00:33 +03:00
|
|
|
|
2015-08-08 20:14:32 +03:00
|
|
|
:param int key:
|
|
|
|
Integer ID
|
2015-07-08 19:00:33 +03:00
|
|
|
|
2015-08-08 20:14:32 +03:00
|
|
|
:returns: A Lexeme object
|
2015-07-08 19:00:33 +03:00
|
|
|
|
2015-08-08 20:14:32 +03:00
|
|
|
.. py:method:: __getitem__(self, key_str)
|
2015-07-08 19:00:33 +03:00
|
|
|
|
2015-08-08 20:14:32 +03:00
|
|
|
:param unicode key_str:
|
|
|
|
A string in the vocabulary
|
|
|
|
|
|
|
|
:rtype: Lexeme
|
|
|
|
|
|
|
|
|
|
|
|
.. py:method:: __setitem__(self, orth_str, props)
|
|
|
|
|
|
|
|
:param unicode orth_str:
|
|
|
|
The orth key
|
|
|
|
|
|
|
|
:param dict props:
|
|
|
|
A props dictionary
|
|
|
|
|
|
|
|
:returns: None
|
|
|
|
|
|
|
|
.. py:method:: dump(self, loc)
|
|
|
|
|
|
|
|
:param unicode loc:
|
|
|
|
Path where the vocabulary should be saved
|
|
|
|
|
|
|
|
.. py:method:: load_lexemes(self, loc)
|
|
|
|
|
|
|
|
:param unicode loc:
|
|
|
|
Path to load the lexemes.bin file from
|
|
|
|
|
|
|
|
.. py:method:: load_vectors(self, loc)
|
|
|
|
|
|
|
|
:param unicode loc:
|
|
|
|
Path to load the vectors.bin from
|
2015-07-08 19:00:33 +03:00
|
|
|
|
|
|
|
|
|
|
|
.. py:class:: strings.StringStore(self)
|
|
|
|
|
2015-08-08 20:14:32 +03:00
|
|
|
.. py:method:: __len__(self)
|
|
|
|
|
|
|
|
:returns:
|
|
|
|
Number of strings in the string-store
|
|
|
|
|
|
|
|
.. py:method:: __getitem__(self, key_int)
|
|
|
|
|
|
|
|
:param int key_int: An integer key
|
|
|
|
|
|
|
|
:returns:
|
|
|
|
The string that the integer key maps to
|
|
|
|
|
|
|
|
:rtype: unicode
|
|
|
|
|
|
|
|
.. py:method:: __getitem__(self, key_unicode)
|
|
|
|
|
|
|
|
:param int key_unicode:
|
|
|
|
A key, as a unicode string
|
|
|
|
|
|
|
|
:returns:
|
|
|
|
The integer ID of the string.
|
|
|
|
|
|
|
|
:rtype: int
|
|
|
|
|
|
|
|
.. py:method:: __getitem__(self, key_utf8_bytes)
|
2015-07-08 19:00:33 +03:00
|
|
|
|
2015-08-08 20:14:32 +03:00
|
|
|
:param int key_utf8_bytes:
|
|
|
|
A key, as a UTF-8 encoded byte-string
|
2015-07-08 19:00:33 +03:00
|
|
|
|
2015-08-08 20:14:32 +03:00
|
|
|
:returns:
|
|
|
|
The integer ID of the string.
|
2015-07-08 19:00:33 +03:00
|
|
|
|
2015-08-08 20:14:32 +03:00
|
|
|
:rtype:
|
|
|
|
int
|
2015-07-08 19:00:33 +03:00
|
|
|
|
2015-08-08 20:14:32 +03:00
|
|
|
.. py:method:: dump(self, loc)
|
2015-07-08 19:00:33 +03:00
|
|
|
|
2015-08-08 20:14:32 +03:00
|
|
|
:param loc:
|
|
|
|
File path to save the strings.txt to.
|
2015-07-08 19:00:33 +03:00
|
|
|
|
2015-08-08 20:14:32 +03:00
|
|
|
.. py:method:: load(self, loc)
|
2015-07-08 19:00:33 +03:00
|
|
|
|
2015-08-08 20:14:32 +03:00
|
|
|
:param loc:
|
|
|
|
File path to load the strings.txt from.
|