Update docs and change integer IDs to hash values

This commit is contained in:
ines 2017-05-28 19:25:34 +02:00
parent 738b4f7187
commit c7b57ea314
7 changed files with 13 additions and 10 deletions

View File

@ -355,7 +355,7 @@ p
+row +row
+cell #[code ent_id] +cell #[code ent_id]
+cell int +cell int
+cell The integer ID of the named entity the token is an instance of. +cell The hash value of the named entity the token is an instance of.
+row +row
+cell #[code ent_id_] +cell #[code ent_id_]

View File

@ -397,13 +397,15 @@ p The L2 norm of the token's vector representation.
+row +row
+cell #[code shape_] +cell #[code shape_]
+cell unicode +cell unicode
+cell
| Transform of the tokens's string, to show orthographic features. | Transform of the tokens's string, to show orthographic features.
| For example, "Xxxx" or "dd". | For example, "Xxxx" or "dd".
+row +row
+cell #[code prefix] +cell #[code prefix]
+cell int +cell int
+cell Integer ID of a length-N substring from the start of the +cell
| Hash value of a length-N substring from the start of the
| token. Defaults to #[code N=1]. | token. Defaults to #[code N=1].
+row +row
@ -417,7 +419,8 @@ p The L2 norm of the token's vector representation.
+cell #[code suffix] +cell #[code suffix]
+cell int +cell int
+cell +cell
| Length-N substring from the end of the token. Defaults to #[code N=3]. | Hash value of a length-N substring from the end of the token.
| Defaults to #[code N=3].
+row +row
+cell #[code suffix_] +cell #[code suffix_]

View File

@ -36,7 +36,7 @@ p Create the vocabulary.
+cell #[code strings] +cell #[code strings]
+cell #[code StringStore] +cell #[code StringStore]
+cell +cell
| A #[code StringStore] that maps strings to integers, and vice | A #[code StringStore] that maps strings to hash values, and vice
| versa. | versa.
+footrow +footrow
@ -74,7 +74,7 @@ p
+row +row
+cell #[code id_or_string] +cell #[code id_or_string]
+cell int / unicode +cell int / unicode
+cell The integer ID of a word, or its unicode string. +cell The hash value of a word, or its unicode string.
+footrow +footrow
+cell returns +cell returns

View File

@ -12,7 +12,7 @@ p
p p
| Linguistic annotations are available as | Linguistic annotations are available as
| #[+api("token#attributes") #[code Token] attributes]. Like many NLP | #[+api("token#attributes") #[code Token] attributes]. Like many NLP
| libraries, spaCy #[strong encodes all strings to integers] to reduce | libraries, spaCy #[strong encodes all strings to hash values] to reduce
| memory usage and improve efficiency. So to get the readable string | memory usage and improve efficiency. So to get the readable string
| representation of an attribute, we need to add an underscore #[code _] | representation of an attribute, we need to add an underscore #[code _]
| to its name: | to its name:

View File

@ -43,7 +43,7 @@ p
+aside("Why saving the vocab?") +aside("Why saving the vocab?")
| Saving the vocabulary with the #[code Doc] is important, because the | Saving the vocabulary with the #[code Doc] is important, because the
| #[code Vocab] holds the context-independent information about the words, | #[code Vocab] holds the context-independent information about the words,
| tags and labels, and their #[strong integer IDs]. If the #[code Vocab] | tags and labels, and their #[strong hash values]. If the #[code Vocab]
| wasn't saved with the #[code Doc], spaCy wouldn't know how to resolve | wasn't saved with the #[code Doc], spaCy wouldn't know how to resolve
| those IDs for example, the word text or the dependency labels. You | those IDs for example, the word text or the dependency labels. You
| might be saving #[code 446] for "whale", but in a different vocabulary, | might be saving #[code 446] for "whale", but in a different vocabulary,

View File

@ -48,7 +48,7 @@ p
| #[strong connected by a single arc] in the dependency tree. The term | #[strong connected by a single arc] in the dependency tree. The term
| #[strong dep] is used for the arc label, which describes the type of | #[strong dep] is used for the arc label, which describes the type of
| syntactic relation that connects the child to the head. As with other | syntactic relation that connects the child to the head. As with other
| attributes, the value of #[code .dep] is an integer. You can get | attributes, the value of #[code .dep] is a hash value. You can get
| the string value with #[code .dep_]. | the string value with #[code .dep_].
+code("Example"). +code("Example").

View File

@ -20,7 +20,7 @@ p
| The standard way to access entity annotations is the | The standard way to access entity annotations is the
| #[+api("doc#ents") #[code doc.ents]] property, which produces a sequence | #[+api("doc#ents") #[code doc.ents]] property, which produces a sequence
| of #[+api("span") #[code Span]] objects. The entity type is accessible | of #[+api("span") #[code Span]] objects. The entity type is accessible
| either as an integer ID or as a string, using the attributes | either as a hash value or as a string, using the attributes
| #[code ent.label] and #[code ent.label_]. The #[code Span] object acts | #[code ent.label] and #[code ent.label_]. The #[code Span] object acts
| as a sequence of tokens, so you can iterate over the entity or index into | as a sequence of tokens, so you can iterate over the entity or index into
| it. You can also get the text form of the whole entity, as though it were | it. You can also get the text form of the whole entity, as though it were
@ -78,7 +78,7 @@ p
doc = nlp(u'Netflix is hiring a new VP of global policy') doc = nlp(u'Netflix is hiring a new VP of global policy')
# the model didn't recognise any entities :( # the model didn't recognise any entities :(
ORG = doc.vocab.strings[u'ORG'] # get integer ID of entity label ORG = doc.vocab.strings[u'ORG'] # get hash value of entity label
netflix_ent = Span(doc, 0, 1, label=ORG) # create a Span for the new entity netflix_ent = Span(doc, 0, 1, label=ORG) # create a Span for the new entity
doc.ents = [netflix_ent] doc.ents = [netflix_ent]