This commit is contained in:
Henning Peters 2015-09-25 09:42:01 +02:00
commit 6717fdc5f3
3 changed files with 89 additions and 52 deletions

View File

@ -1,14 +1,14 @@
{ {
"PRP": { "PRP": {
"I": {"L": "-PRON-", "PronType": "Prs", "Person": "One", "Number": "Sing", "Case": "Nom"}, "I": {"L": "-PRON-", "PronType": "Prs", "Person": "One", "Number": "Sing", "Case": "Nom"},
"me": {"L": "-PRON-", "PronType": "Prs", "Person": "One", "Number": "Sing", "Case": "Acc"}, "me": {"L": "-PRON-", "PronType": "Prs", "Person": "One", "Number": "Sing", "Case": "Acc"},
"you": {"L": "-PRON-", "PronType": "Prs", "Person": "Two", "Case": "Nom,Acc"}, "you": {"L": "-PRON-", "PronType": "Prs", "Person": "Two"},
"he": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Masc", "Case": "Nom"}, "he": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Masc", "Case": "Nom"},
"him": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Masc", "Case": "Acc"}, "him": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Masc", "Case": "Acc"},
"she": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Masc", "Case": "Nom"}, "she": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Fem", "Case": "Nom"},
"her": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Fem", "Case": "Acc"}, "her": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Fem", "Case": "Acc"},
"it": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Neut", "Case": "Nom,Acc"}, "it": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Neut"},
"we": {"L": "-PRON-", "PronType": "Prs", "Person": "One", "Number": "Sing", "Case": "Nom"}, "we": {"L": "-PRON-", "PronType": "Prs", "Person": "One", "Number": "Plur", "Case": "Nom"},
"us": {"L": "-PRON-", "PronType": "Prs", "Person": "One", "Number": "Plur", "Case": "Acc"}, "us": {"L": "-PRON-", "PronType": "Prs", "Person": "One", "Number": "Plur", "Case": "Acc"},
"they": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Plur", "Case": "Nom"}, "they": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Plur", "Case": "Nom"},
"them": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Plur", "Case": "Acc"}, "them": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Plur", "Case": "Acc"},
@ -35,25 +35,12 @@
}, },
"PRP$": { "PRP$": {
"my": {"L": "-PRON-", "Person": "One", "Number": "Sing", "PronType": "Prs", "Poss": "Yes"}, "my": {"L": "-PRON-", "Person": "One", "Number": "Sing", "PronType": "Prs", "Poss": "Yes"},
"your": {"L": "-PRON-", "Person": "Two", "Number": "Sing,Plur", "PronType": "Prs", "Poss": "Yes"}, "your": {"L": "-PRON-", "Person": "Two", "PronType": "Prs", "Poss": "Yes"},
"his": {"L": "-PRON-", "Person": "Three", "Number": "Sing", "Gender": "Masc", "PronType": "Prs", "Poss": "Yes"}, "his": {"L": "-PRON-", "Person": "Three", "Number": "Sing", "Gender": "Masc", "PronType": "Prs", "Poss": "Yes"},
"her": {"L": "-PRON-", "Person": "Three", "Number": "Sing", "Gender": "Fem", "PronType": "Prs", "Poss": "Yes"}, "her": {"L": "-PRON-", "Person": "Three", "Number": "Sing", "Gender": "Fem", "PronType": "Prs", "Poss": "Yes"},
"its": {"L": "-PRON-", "Person": "Three", "Number": "Sing", "Gender": "Neut", "PronType": "Prs", "Poss": "Yes"}, "its": {"L": "-PRON-", "Person": "Three", "Number": "Sing", "Gender": "Neut", "PronType": "Prs", "Poss": "Yes"},
"our": {"L": "-PRON-", "Person": "One", "Number": "Plur", "PronType": "Prs", "Poss": "Yes"}, "our": {"L": "-PRON-", "Person": "One", "Number": "Plur", "PronType": "Prs", "Poss": "Yes"},
"their": {"L": "-PRON-", "Person": "Three", "Number": "Plur", "Gender": "Masc,Fem,Neut", "PronType": "Prs", "Poss": "Yes"} "their": {"L": "-PRON-", "Person": "Three", "Number": "Plur", "PronType": "Prs", "Poss": "Yes"}
},
"JJR": {
"better": {"L": "good", "misc": 1}
},
"JJS": {
"best": {"L": "good", "misc": 2}
},
"RBR": {
"better": {"L": "good", "misc": 1}
},
"RBS": {
"best": {"L": "good", "misc": 2}
} }
} }

30
website/README.md Normal file
View File

@ -0,0 +1,30 @@
Source for spacy.io
==============================
This directory contains the source for official spaCy website at http://spacy.io/.
Fixes, updates and suggestions are welcome.
Releases
--------
Changes made to this directory go live on spacy.io. <When / how often?>
The Stack
--------
The site is built with the [Jade](http://jade-lang.com/) template language.
See [the Makefile](Makefile) for more
Developing
--------
To make and test changes
```
npm install jade --global
cd website
make
python -m SimpleHTTPServer 8000
```
Then visit [localhost:8000/src/...](http://localhost:8000/src/)

View File

@ -3,12 +3,23 @@ include ../header.jade
+WritePost(Meta) +WritePost(Meta)
p By default spaCy loads a #[code data/vocab/vec.bin] file, where the #[em data] directory is within the #[code spacy.en] module directory. p By default spaCy loads a #[code data/vocab/vec.bin] file, where the #[em data] directory is within the #[code spacy.en] module directory. This file can be replaced, to customize the word vectors that spaCy loads. You can also replace the word vectors at run-time.
p You can customize the word vectors loaded by spaCy in three different ways. For the first two, you'll need to convert your vectors into spaCy's binary file format. The binary format is used because it's smaller and loads faster.
p You can either place the binary file in the location spaCy expects h4 Replacing vec.bin
p The function #[code spacy.vocab.write_binary_vectors] creates a word vectors file in spaCy's binary data format. It expects a #[code bz2] file in the following format:
pre
code
word_key1 0.92 0.45 -0.9 0.0
word_key2 0.3 0.1 0.6 0.3
...
p That is, each line is a single entry. Each entry consists of a key string, followed by a sequence of floats. Each entry should have the same number of floats.
p The following example script will replace the #[code vec.bin] file with vectors read from a #[code bz2] archive:
pre pre
code.language-python code.language-python
| from spacy.vocab import write_binary_vectors | from spacy.vocab import write_binary_vectors
@ -23,29 +34,38 @@ include ../header.jade
| if __name__ == '__main__': | if __name__ == '__main__':
| plac.call(main) | plac.call(main)
ol
li Replace the vec.bin, so your vectors will be loaded by default. The function #[code spacy.vocab.write_binary_vectors] is provided to convert files to spaCy's binary format. The advantage of the binary format is that it's smaller and loads faster.
li Load vectors at run-time
Create the vec.bin file from a bz2 file using spacy.vocab.write_binary_vectors
Either replace spaCy's vec.bin file, or call nlp.vocab.load_rep_vectors at run-time, with the path to the binary file.
The above is a bit inconvenient at first, but the binary file format is much smaller and faster to load, and the vectors files are fairly big. Note that GloVe distributes in gzip format, not bzip.
Out of interest: are you using the GloVe vectors, or something you trained on your own data? If your own data, did you use Gensim? I'd like to make this much easier, so I'd appreciate suggestions for what work-flow you'd like to see.
Load new vectors at run-time, optionally converting them
h4 Replace the vectors at run-time, from an archive
p Since v0.93, instances of #[code Vocab] allow new vectors to be loaded from #[code bz2] archive files. This allows vectors to be loaded as follows:
pre pre
code.language-python code.language-python
| import spacy.vocab | >>> from spacy.en import English
| >>> nlp = English()
| def set_spacy_vectors(nlp, binary_loc, bz2_loc=None): | >>> n_dimensions = nlp.vocab.load_vectors('glove.840B.300d.txt.bz2')
| if bz2_loc is not None: | >>> n_dimensions
| spacy.vocab.write_binary_vectors(bz2_loc, binary_loc) | 300
| write_binary_vectors(bz2_input_loc, binary_loc)
|
| nlp.vocab.load_rep_vectors(binary_loc)
h4 Replace vectors at run-time, per word
p Since v0.93, you can assign to the #[code .vector] attribute of #[code Lexeme] instances. Tokens of that lexical type will then inherit the updated vector. For instance:
pre
code.language-python
| >>> from spacy.en import English
| >>> nlp = English()
| >>> apples, oranges = nlp(u'apples oranges')
| <type 'spacy.tokens.token.Token'>
| >>> apples_lexeme = nlp.vocab[u'apples']
| >>> type(apples), type(apples_lexeme)
| (<type 'spacy.tokens.token.Token'>, <type 'spacy.lexeme.Lexeme'>)
| >>> sum(apples.vector)
| 0.56299778164247982
| >>> apples_lexeme.vector *= 2
| >>> sum(apples.vector)
| 1.1259955632849596
p All tokens which have the #[code orth] attribute #[em apples] will inherit the updated vector.
p Note that the updated vectors won't persist after exit, unless you persist them yourself, and then replace the #[code vec.bin] file as described above.