mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-26 17:24:41 +03:00
Update docs and change integer IDs to hash values
This commit is contained in:
parent
738b4f7187
commit
c7b57ea314
|
@ -355,7 +355,7 @@ p
|
|||
+row
|
||||
+cell #[code ent_id]
|
||||
+cell int
|
||||
+cell The integer ID of the named entity the token is an instance of.
|
||||
+cell The hash value of the named entity the token is an instance of.
|
||||
|
||||
+row
|
||||
+cell #[code ent_id_]
|
||||
|
|
|
@ -397,13 +397,15 @@ p The L2 norm of the token's vector representation.
|
|||
+row
|
||||
+cell #[code shape_]
|
||||
+cell unicode
|
||||
+cell
|
||||
| Transform of the tokens's string, to show orthographic features.
|
||||
| For example, "Xxxx" or "dd".
|
||||
|
||||
+row
|
||||
+cell #[code prefix]
|
||||
+cell int
|
||||
+cell Integer ID of a length-N substring from the start of the
|
||||
+cell
|
||||
| Hash value of a length-N substring from the start of the
|
||||
| token. Defaults to #[code N=1].
|
||||
|
||||
+row
|
||||
|
@ -417,7 +419,8 @@ p The L2 norm of the token's vector representation.
|
|||
+cell #[code suffix]
|
||||
+cell int
|
||||
+cell
|
||||
| Length-N substring from the end of the token. Defaults to #[code N=3].
|
||||
| Hash value of a length-N substring from the end of the token.
|
||||
| Defaults to #[code N=3].
|
||||
|
||||
+row
|
||||
+cell #[code suffix_]
|
||||
|
|
|
@ -36,7 +36,7 @@ p Create the vocabulary.
|
|||
+cell #[code strings]
|
||||
+cell #[code StringStore]
|
||||
+cell
|
||||
| A #[code StringStore] that maps strings to integers, and vice
|
||||
| A #[code StringStore] that maps strings to hash values, and vice
|
||||
| versa.
|
||||
|
||||
+footrow
|
||||
|
@ -74,7 +74,7 @@ p
|
|||
+row
|
||||
+cell #[code id_or_string]
|
||||
+cell int / unicode
|
||||
+cell The integer ID of a word, or its unicode string.
|
||||
+cell The hash value of a word, or its unicode string.
|
||||
|
||||
+footrow
|
||||
+cell returns
|
||||
|
|
|
@ -12,7 +12,7 @@ p
|
|||
p
|
||||
| Linguistic annotations are available as
|
||||
| #[+api("token#attributes") #[code Token] attributes]. Like many NLP
|
||||
| libraries, spaCy #[strong encodes all strings to integers] to reduce
|
||||
| libraries, spaCy #[strong encodes all strings to hash values] to reduce
|
||||
| memory usage and improve efficiency. So to get the readable string
|
||||
| representation of an attribute, we need to add an underscore #[code _]
|
||||
| to its name:
|
||||
|
|
|
@ -43,7 +43,7 @@ p
|
|||
+aside("Why saving the vocab?")
|
||||
| Saving the vocabulary with the #[code Doc] is important, because the
|
||||
| #[code Vocab] holds the context-independent information about the words,
|
||||
| tags and labels, and their #[strong integer IDs]. If the #[code Vocab]
|
||||
| tags and labels, and their #[strong hash values]. If the #[code Vocab]
|
||||
| wasn't saved with the #[code Doc], spaCy wouldn't know how to resolve
|
||||
| those IDs – for example, the word text or the dependency labels. You
|
||||
| might be saving #[code 446] for "whale", but in a different vocabulary,
|
||||
|
|
|
@ -48,7 +48,7 @@ p
|
|||
| #[strong connected by a single arc] in the dependency tree. The term
|
||||
| #[strong dep] is used for the arc label, which describes the type of
|
||||
| syntactic relation that connects the child to the head. As with other
|
||||
| attributes, the value of #[code .dep] is an integer. You can get
|
||||
| attributes, the value of #[code .dep] is a hash value. You can get
|
||||
| the string value with #[code .dep_].
|
||||
|
||||
+code("Example").
|
||||
|
|
|
@ -20,7 +20,7 @@ p
|
|||
| The standard way to access entity annotations is the
|
||||
| #[+api("doc#ents") #[code doc.ents]] property, which produces a sequence
|
||||
| of #[+api("span") #[code Span]] objects. The entity type is accessible
|
||||
| either as an integer ID or as a string, using the attributes
|
||||
| either as a hash value or as a string, using the attributes
|
||||
| #[code ent.label] and #[code ent.label_]. The #[code Span] object acts
|
||||
| as a sequence of tokens, so you can iterate over the entity or index into
|
||||
| it. You can also get the text form of the whole entity, as though it were
|
||||
|
@ -78,7 +78,7 @@ p
|
|||
doc = nlp(u'Netflix is hiring a new VP of global policy')
|
||||
# the model didn't recognise any entities :(
|
||||
|
||||
ORG = doc.vocab.strings[u'ORG'] # get integer ID of entity label
|
||||
ORG = doc.vocab.strings[u'ORG'] # get hash value of entity label
|
||||
netflix_ent = Span(doc, 0, 1, label=ORG) # create a Span for the new entity
|
||||
doc.ents = [netflix_ent]
|
||||
|
||||
|
|
Loading…
Reference in New Issue
Block a user