spaCy/website/usage/_vectors-similarity/_custom.jade

//- 💫 DOCS > USAGE > VECTORS & SIMILARITY > CUSTOM VECTORS

p
    |  By default, #[+api("token#vector") #[code Token.vector]] returns the
    |  vector for its underlying #[+api("lexeme") #[code Lexeme]], while
    |  #[+api("doc#vector") #[code Doc.vector]] and
    |  #[+api("span#vector") #[code Span.vector]] return an average of the
    |  vectors of their tokens. You can customize these
    |  behaviours by modifying the #[code doc.user_hooks],
    |  #[code doc.user_span_hooks] and #[code doc.user_token_hooks]
    |  dictionaries.

+infobox
    |  For more details on #[strong adding hooks] and #[strong overwriting] the
    |  built-in #[code Doc], #[code Span] and #[code Token] methods, see the
    |  usage guide on #[+a("/usage/processing-pipelines#user-hooks") user hooks].

+h(3, "custom-vectors-add") Adding vectors
    +tag-new(2)

p
    |  The new #[+api("vectors") #[code Vectors]] class makes it easy to add
    |  your own vectors to spaCy. Just like the #[+api("vocab") #[code Vocab]],
    |  it is initialised with a #[+api("stringstore") #[code StringStore]] or
    |  a list of strings.

+code("Adding vectors one-by-one").
    from spacy.strings import StringStore
    from spacy.vectors import Vectors

    vector_data = {'dog': numpy.random.uniform(-1, 1, (300,)),
                   'cat': numpy.random.uniform(-1, 1, (300,)),
                   'orange': numpy.random.uniform(-1, 1, (300,))}

    vectors = Vectors(StringStore(), 300)
    for word, vector in vector_data.items():
        vectors.add(word, vector)

p
    |  You can also add the vector values directly on initialisation:

+code("Adding vectors on initialisation").
    from spacy.vectors import Vectors

    vector_table = numpy.zeros((3, 300), dtype='f')
    vectors = Vectors([u'dog', u'cat', u'orange'], vector_table)

+h(3, "custom-loading-glove") Loading GloVe vectors
    +tag-new(2)

p
    |  spaCy comes with built-in support for loading
    |  #[+a("https://nlp.stanford.edu/projects/glove/") GloVe] vectors from
    |  a directory. The #[+api("vectors#from_glove") #[code Vectors.from_glove]]
    |  method assumes a binary format, the vocab provided in a
    |  #[code vocab.txt], and the naming scheme of
    |  #[code vectors.{size}.[fd].bin]. For example:

+aside-code("Directory structure", "yaml").
    └── vectors
        ├── vectors.128.f.bin  # vectors file
        └── vocab.txt          # vocabulary

+table(["File name", "Dimensions", "Data type"])
    +row
        +cell #[code vectors.128.f.bin]
        +cell 128
        +cell float32

    +row
        +cell #[code vectors.300.d.bin]
        +cell 300
        +cell float64 (double)

+code.
    from spacy.vectors import Vectors

    vectors = Vectors([], 128)
    vectors.from_glove('/path/to/vectors')

+h(3, "custom-loading-other") Loading other vectors
    +tag-new(2)

p
    |  You can also choose to load in vectors from other sources, like the
    |  #[+a("https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md") fastText vectors]
    |  for 294 languages, trained on Wikipedia. After reading in the file,
    |  the vectors are added to the #[code Vocab] using the
    |  #[+api("vocab#set_vector") #[code set_vector]] method.

+github("spacy", "examples/vectors_fast_text.py")
Update usage documentation 2017-10-03 15:26:20 +03:00			`//- 💫 DOCS > USAGE > VECTORS & SIMILARITY > CUSTOM VECTORS`

			`p`
			`\| By default, #[+api("token#vector") #[code Token.vector]] returns the`
			`\| vector for its underlying #[+api("lexeme") #[code Lexeme]], while`
			`\| #[+api("doc#vector") #[code Doc.vector]] and`
			`\| #[+api("span#vector") #[code Span.vector]] return an average of the`
			`\| vectors of their tokens. You can customize these`
			`\| behaviours by modifying the #[code doc.user_hooks],`
			`\| #[code doc.user_span_hooks] and #[code doc.user_token_hooks]`
			`\| dictionaries.`

			`+infobox`
			`\| For more details on #[strong adding hooks] and #[strong overwriting] the`
			`\| built-in #[code Doc], #[code Span] and #[code Token] methods, see the`
			`\| usage guide on #[+a("/usage/processing-pipelines#user-hooks") user hooks].`

			`+h(3, "custom-vectors-add") Adding vectors`
			`+tag-new(2)`

			`p`
			`\| The new #[+api("vectors") #[code Vectors]] class makes it easy to add`
			`\| your own vectors to spaCy. Just like the #[+api("vocab") #[code Vocab]],`
			`\| it is initialised with a #[+api("stringstore") #[code StringStore]] or`
			`\| a list of strings.`

			`+code("Adding vectors one-by-one").`
			`from spacy.strings import StringStore`
			`from spacy.vectors import Vectors`

			`vector_data = {'dog': numpy.random.uniform(-1, 1, (300,)),`
			`'cat': numpy.random.uniform(-1, 1, (300,)),`
			`'orange': numpy.random.uniform(-1, 1, (300,))}`

			`vectors = Vectors(StringStore(), 300)`
			`for word, vector in vector_data.items():`
			`vectors.add(word, vector)`

			`p`
			`\| You can also add the vector values directly on initialisation:`

			`+code("Adding vectors on initialisation").`
			`from spacy.vectors import Vectors`

			`vector_table = numpy.zeros((3, 300), dtype='f')`
			`vectors = Vectors([u'dog', u'cat', u'orange'], vector_table)`

			`+h(3, "custom-loading-glove") Loading GloVe vectors`
			`+tag-new(2)`

			`p`
			`\| spaCy comes with built-in support for loading`
			`\| #[+a("https://nlp.stanford.edu/projects/glove/") GloVe] vectors from`
			`\| a directory. The #[+api("vectors#from_glove") #[code Vectors.from_glove]]`
			`\| method assumes a binary format, the vocab provided in a`
			`\| #[code vocab.txt], and the naming scheme of`
			`\| #[code vectors.{size}.[fd].bin]. For example:`

			`+aside-code("Directory structure", "yaml").`
			`└── vectors`
			`├── vectors.128.f.bin # vectors file`
			`└── vocab.txt # vocabulary`

			`+table(["File name", "Dimensions", "Data type"])`
			`+row`
			`+cell #[code vectors.128.f.bin]`
			`+cell 128`
			`+cell float32`

			`+row`
			`+cell #[code vectors.300.d.bin]`
			`+cell 300`
			`+cell float64 (double)`

			`+code.`
			`from spacy.vectors import Vectors`

			`vectors = Vectors([], 128)`
			`vectors.from_glove('/path/to/vectors')`

			`+h(3, "custom-loading-other") Loading other vectors`
			`+tag-new(2)`

			`p`
			`\| You can also choose to load in vectors from other sources, like the`
			`\| #[+a("https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md") fastText vectors]`
			`\| for 294 languages, trained on Wikipedia. After reading in the file,`
			`\| the vectors are added to the #[code Vocab] using the`
			`\| #[+api("vocab#set_vector") #[code set_vector]] method.`

			`+github("spacy", "examples/vectors_fast_text.py")`