Update docs and change integer IDs to hash values

This commit is contained in:
ines 2017-05-28 19:25:34 +02:00
parent 738b4f7187
commit c7b57ea314
7 changed files with 13 additions and 10 deletions

View File

@ -355,7 +355,7 @@ p
+row
+cell #[code ent_id]
+cell int
+cell The integer ID of the named entity the token is an instance of.
+cell The hash value of the named entity the token is an instance of.
+row
+cell #[code ent_id_]

View File

@ -397,13 +397,15 @@ p The L2 norm of the token's vector representation.
+row
+cell #[code shape_]
+cell unicode
+cell
| Transform of the tokens's string, to show orthographic features.
| For example, "Xxxx" or "dd".
+row
+cell #[code prefix]
+cell int
+cell Integer ID of a length-N substring from the start of the
+cell
| Hash value of a length-N substring from the start of the
| token. Defaults to #[code N=1].
+row
@ -417,7 +419,8 @@ p The L2 norm of the token's vector representation.
+cell #[code suffix]
+cell int
+cell
| Length-N substring from the end of the token. Defaults to #[code N=3].
| Hash value of a length-N substring from the end of the token.
| Defaults to #[code N=3].
+row
+cell #[code suffix_]

View File

@ -36,7 +36,7 @@ p Create the vocabulary.
+cell #[code strings]
+cell #[code StringStore]
+cell
| A #[code StringStore] that maps strings to integers, and vice
| A #[code StringStore] that maps strings to hash values, and vice
| versa.
+footrow
@ -74,7 +74,7 @@ p
+row
+cell #[code id_or_string]
+cell int / unicode
+cell The integer ID of a word, or its unicode string.
+cell The hash value of a word, or its unicode string.
+footrow
+cell returns

View File

@ -12,7 +12,7 @@ p
p
| Linguistic annotations are available as
| #[+api("token#attributes") #[code Token] attributes]. Like many NLP
| libraries, spaCy #[strong encodes all strings to integers] to reduce
| libraries, spaCy #[strong encodes all strings to hash values] to reduce
| memory usage and improve efficiency. So to get the readable string
| representation of an attribute, we need to add an underscore #[code _]
| to its name:

View File

@ -43,7 +43,7 @@ p
+aside("Why saving the vocab?")
| Saving the vocabulary with the #[code Doc] is important, because the
| #[code Vocab] holds the context-independent information about the words,
| tags and labels, and their #[strong integer IDs]. If the #[code Vocab]
| tags and labels, and their #[strong hash values]. If the #[code Vocab]
| wasn't saved with the #[code Doc], spaCy wouldn't know how to resolve
| those IDs for example, the word text or the dependency labels. You
| might be saving #[code 446] for "whale", but in a different vocabulary,

View File

@ -48,7 +48,7 @@ p
| #[strong connected by a single arc] in the dependency tree. The term
| #[strong dep] is used for the arc label, which describes the type of
| syntactic relation that connects the child to the head. As with other
| attributes, the value of #[code .dep] is an integer. You can get
| attributes, the value of #[code .dep] is a hash value. You can get
| the string value with #[code .dep_].
+code("Example").

View File

@ -20,7 +20,7 @@ p
| The standard way to access entity annotations is the
| #[+api("doc#ents") #[code doc.ents]] property, which produces a sequence
| of #[+api("span") #[code Span]] objects. The entity type is accessible
| either as an integer ID or as a string, using the attributes
| either as a hash value or as a string, using the attributes
| #[code ent.label] and #[code ent.label_]. The #[code Span] object acts
| as a sequence of tokens, so you can iterate over the entity or index into
| it. You can also get the text form of the whole entity, as though it were
@ -78,7 +78,7 @@ p
doc = nlp(u'Netflix is hiring a new VP of global policy')
# the model didn't recognise any entities :(
ORG = doc.vocab.strings[u'ORG'] # get integer ID of entity label
ORG = doc.vocab.strings[u'ORG'] # get hash value of entity label
netflix_ent = Span(doc, 0, 1, label=ORG) # create a Span for the new entity
doc.ents = [netflix_ent]