Merge branch 'develop' of https://github.com/explosion/spaCy into develop

This commit is contained in:
Matthew Honnibal 2017-11-01 16:38:26 +01:00
commit d17a12c71d
109 changed files with 3601 additions and 1750 deletions

106
.github/contributors/jimregan.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Jim O'Regan |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2017-06-24 |
| GitHub username | jimregan |
| Website (optional) | |

View File

@ -1,7 +1,6 @@
#!/usr/bin/env python
# coding: utf8
"""
A simple example of extracting relations between phrases and entities using
"""A simple example of extracting relations between phrases and entities using
spaCy's named entity recognizer and the dependency parse. Here, we extract
money and currency values (entities labelled as MONEY) and then check the
dependency tree to find the noun phrase they are referring to for example:

View File

@ -1,8 +1,7 @@
#!/usr/bin/env python
# coding: utf8
"""
This example shows how to navigate the parse tree including subtrees attached
to a word.
"""This example shows how to navigate the parse tree including subtrees
attached to a word.
Based on issue #252:
"In the documents and tutorials the main thing I haven't found is

View File

@ -1,9 +1,10 @@
#!/usr/bin/env python
# coding: utf8
"""Match a large set of multi-word expressions in O(1) time.
The idea is to associate each word in the vocabulary with a tag, noting whether
they begin, end, or are inside at least one pattern. An additional tag is used
for single-word patterns. Complete patterns are also stored in a hash set.
When we process a document, we look up the words in the vocabulary, to
associate the words with the tags. We then search for tag-sequences that
correspond to valid candidates. Finally, we look up the candidates in the hash

View File

@ -1,5 +1,6 @@
"""
Example of multi-processing with Joblib. Here, we're exporting
#!/usr/bin/env python
# coding: utf8
"""Example of multi-processing with Joblib. Here, we're exporting
part-of-speech-tagged, true-cased, (very roughly) sentence-separated text, with
each "sentence" on a newline, and spaces between tokens. Data is loaded from
the IMDB movie reviews dataset and will be loaded automatically via Thinc's

View File

@ -94,7 +94,7 @@ def main(model=None, output_dir=None, n_iter=100):
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'parser']
with nlp.disable_pipes(*other_pipes): # only train parser
optimizer = nlp.begin_training(lambda: [])
optimizer = nlp.begin_training()
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}

View File

@ -1,7 +1,6 @@
#!/usr/bin/env python
# coding: utf8
"""
Example of training spaCy's named entity recognizer, starting off with an
"""Example of training spaCy's named entity recognizer, starting off with an
existing model or a blank model.
For more details, see the documentation:

View File

@ -1,7 +1,6 @@
#!/usr/bin/env python
# coding: utf8
"""
Example of training an additional entity type
"""Example of training an additional entity type
This script shows how to add a new entity type to an existing pre-trained NER
model. To keep the example short and simple, only four sentences are provided
@ -88,7 +87,7 @@ def main(model=None, new_model_name='animal', output_dir=None, n_iter=50):
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
random.seed(0)
optimizer = nlp.begin_training(lambda: [])
optimizer = nlp.begin_training()
for itn in range(n_iter):
losses = {}
gold_parses = get_gold_parses(nlp.make_doc, TRAIN_DATA)

View File

@ -1,10 +1,7 @@
#!/usr/bin/env python
# coding: utf8
"""
Example of training spaCy dependency parser, starting off with an existing model
or a blank model.
For more details, see the documentation:
"""Example of training spaCy dependency parser, starting off with an existing
model or a blank model. For more details, see the documentation:
* Training: https://alpha.spacy.io/usage/training
* Dependency Parse: https://alpha.spacy.io/usage/linguistic-features#dependency-parse
@ -67,7 +64,7 @@ def main(model=None, output_dir=None, n_iter=1000):
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'parser']
with nlp.disable_pipes(*other_pipes): # only train parser
optimizer = nlp.begin_training(lambda: [])
optimizer = nlp.begin_training()
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}

View File

@ -3,9 +3,8 @@
"""
A simple example for training a part-of-speech tagger with a custom tag map.
To allow us to update the tag map with our custom one, this example starts off
with a blank Language class and modifies its defaults.
For more details, see the documentation:
with a blank Language class and modifies its defaults. For more details, see
the documentation:
* Training: https://alpha.spacy.io/usage/training
* POS Tagging: https://alpha.spacy.io/usage/linguistic-features#pos-tagging
@ -62,7 +61,7 @@ def main(lang='en', output_dir=None, n_iter=25):
tagger = nlp.create_pipe('tagger')
nlp.add_pipe(tagger)
optimizer = nlp.begin_training(lambda: [])
optimizer = nlp.begin_training()
for i in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}

View File

@ -3,9 +3,8 @@
"""Train a multi-label convolutional neural network text classifier on the
IMDB dataset, using the TextCategorizer component. The dataset will be loaded
automatically via Thinc's built-in dataset loader. The model is added to
spacy.pipeline, and predictions are available via `doc.cats`.
For more details, see the documentation:
spacy.pipeline, and predictions are available via `doc.cats`. For more details,
see the documentation:
* Training: https://alpha.spacy.io/usage/training
* Text classification: https://alpha.spacy.io/usage/text-classification
@ -27,8 +26,9 @@ from spacy.pipeline import TextCategorizer
@plac.annotations(
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
output_dir=("Optional output directory", "option", "o", Path),
n_examples=("Number of texts to train from", "option", "N", int),
n_iter=("Number of training iterations", "option", "n", int))
def main(model=None, output_dir=None, n_iter=20):
def main(model=None, output_dir=None, n_iter=20, n_texts=2000):
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
@ -51,7 +51,8 @@ def main(model=None, output_dir=None, n_iter=20):
# load the IMBD dataset
print("Loading IMDB data...")
(train_texts, train_cats), (dev_texts, dev_cats) = load_data(limit=2000)
print("Using %d training examples" % n_texts)
(train_texts, train_cats), (dev_texts, dev_cats) = load_data(limit=n_texts)
train_docs = [nlp.tokenizer(text) for text in train_texts]
train_gold = [GoldParse(doc, cats=cats) for doc, cats in
zip(train_docs, train_cats)]
@ -60,20 +61,20 @@ def main(model=None, output_dir=None, n_iter=20):
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'textcat']
with nlp.disable_pipes(*other_pipes): # only train textcat
optimizer = nlp.begin_training(lambda: [])
optimizer = nlp.begin_training()
print("Training the model...")
print('{:^5}\t{:^5}\t{:^5}\t{:^5}'.format('LOSS', 'P', 'R', 'F'))
for i in range(n_iter):
losses = {}
# batch up the examples using spaCy's minibatch
batches = minibatch(train_data, size=compounding(4., 128., 1.001))
batches = minibatch(train_data, size=compounding(4., 32., 1.001))
for batch in batches:
docs, golds = zip(*batch)
nlp.update(docs, golds, sgd=optimizer, drop=0.2, losses=losses)
with textcat.model.use_params(optimizer.averages):
# evaluate on the dev data split off in load_data()
scores = evaluate(nlp.tokenizer, textcat, dev_texts, dev_cats)
print('{0:.3f}\t{0:.3f}\t{0:.3f}\t{0:.3f}' # print a simple table
print('{0:.3f}\t{1:.3f}\t{2:.3f}\t{3:.3f}' # print a simple table
.format(losses['textcat'], scores['textcat_p'],
scores['textcat_r'], scores['textcat_f']))

View File

@ -0,0 +1,21 @@
{"lang": "en", "settings": {"oov_prob": -20.502029418945312}}
{"orth": ".", "id": 1, "lower": ".", "norm": ".", "shape": ".", "prefix": ".", "suffix": ".", "length": 1, "cluster": "8", "prob": -3.0678977966308594, "is_alpha": false, "is_ascii": true, "is_digit": false, "is_lower": false, "is_punct": true, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
{"orth": ",", "id": 2, "lower": ",", "norm": ",", "shape": ",", "prefix": ",", "suffix": ",", "length": 1, "cluster": "4", "prob": -3.4549596309661865, "is_alpha": false, "is_ascii": true, "is_digit": false, "is_lower": false, "is_punct": true, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
{"orth": "the", "id": 3, "lower": "the", "norm": "the", "shape": "xxx", "prefix": "t", "suffix": "the", "length": 3, "cluster": "11", "prob": -3.528766632080078, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
{"orth": "I", "id": 4, "lower": "i", "norm": "I", "shape": "X", "prefix": "I", "suffix": "I", "length": 1, "cluster": "346", "prob": -3.791565179824829, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": false, "is_punct": false, "is_space": false, "is_title": true, "is_upper": true, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
{"orth": "to", "id": 5, "lower": "to", "norm": "to", "shape": "xx", "prefix": "t", "suffix": "to", "length": 2, "cluster": "12", "prob": -3.8560216426849365, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
{"orth": "a", "id": 6, "lower": "a", "norm": "a", "shape": "x", "prefix": "a", "suffix": "a", "length": 1, "cluster": "19", "prob": -3.92978835105896, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
{"orth": "and", "id": 7, "lower": "and", "norm": "and", "shape": "xxx", "prefix": "a", "suffix": "and", "length": 3, "cluster": "20", "prob": -4.113108158111572, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
{"orth": "of", "id": 8, "lower": "of", "norm": "of", "shape": "xx", "prefix": "o", "suffix": "of", "length": 2, "cluster": "28", "prob": -4.27587366104126, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
{"orth": "you", "id": 9, "lower": "you", "norm": "you", "shape": "xxx", "prefix": "y", "suffix": "you", "length": 3, "cluster": "602", "prob": -4.373791217803955, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
{"orth": "it", "id": 10, "lower": "it", "norm": "it", "shape": "xx", "prefix": "i", "suffix": "it", "length": 2, "cluster": "474", "prob": -4.388050079345703, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
{"orth": "is", "id": 11, "lower": "is", "norm": "is", "shape": "xx", "prefix": "i", "suffix": "is", "length": 2, "cluster": "762", "prob": -4.457748889923096, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
{"orth": "that", "id": 12, "lower": "that", "norm": "that", "shape": "xxxx", "prefix": "t", "suffix": "hat", "length": 4, "cluster": "84", "prob": -4.464504718780518, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
{"orth": "\n\n", "id": 0, "lower": "\n\n", "norm": "\n\n", "shape": "\n\n", "prefix": "\n", "suffix": "\n\n", "length": 2, "cluster": "0", "prob": -4.606560707092285, "is_alpha": false, "is_ascii": true, "is_digit": false, "is_lower": false, "is_punct": false, "is_space": true, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
{"orth": "in", "id": 13, "lower": "in", "norm": "in", "shape": "xx", "prefix": "i", "suffix": "in", "length": 2, "cluster": "60", "prob": -4.619071960449219, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
{"orth": "'s", "id": 14, "lower": "'s", "norm": "'s", "shape": "'x", "prefix": "'", "suffix": "'s", "length": 2, "cluster": "52", "prob": -4.830559253692627, "is_alpha": false, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
{"orth": "n't", "id": 15, "lower": "n't", "norm": "n't", "shape": "x'x", "prefix": "n", "suffix": "n't", "length": 3, "cluster": "74", "prob": -4.859938621520996, "is_alpha": false, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
{"orth": "for", "id": 16, "lower": "for", "norm": "for", "shape": "xxx", "prefix": "f", "suffix": "for", "length": 3, "cluster": "508", "prob": -4.8801093101501465, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
{"orth": "\"", "id": 17, "lower": "\"", "norm": "\"", "shape": "\"", "prefix": "\"", "suffix": "\"", "length": 1, "cluster": "0", "prob": -5.02677583694458, "is_alpha": false, "is_ascii": true, "is_digit": false, "is_lower": false, "is_punct": true, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": true, "is_left_punct": true, "is_right_punct": true}
{"orth": "?", "id": 18, "lower": "?", "norm": "?", "shape": "?", "prefix": "?", "suffix": "?", "length": 1, "cluster": "0", "prob": -5.05924654006958, "is_alpha": false, "is_ascii": true, "is_digit": false, "is_lower": false, "is_punct": true, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
{"orth": " ", "id": 0, "lower": " ", "norm": " ", "shape": " ", "prefix": " ", "suffix": " ", "length": 1, "cluster": "0", "prob": -5.129165172576904, "is_alpha": false, "is_ascii": true, "is_digit": false, "is_lower": false, "is_punct": false, "is_space": true, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}

View File

@ -7,14 +7,13 @@ from __future__ import unicode_literals
import plac
import numpy
import from spacy.language import Language
from spacy.language import Language
@plac.annotations(
vectors_loc=("Path to vectors", "positional", None, str))
def main(vectors_loc):
nlp = Language()
nlp = Language() # start off with a blank Language class
with open(vectors_loc, 'rb') as file_:
header = file_.readline()
nr_row, nr_dim = header.split()
@ -24,9 +23,11 @@ def main(vectors_loc):
pieces = line.split()
word = pieces[0]
vector = numpy.asarray([float(v) for v in pieces[1:]], dtype='f')
nlp.vocab.set_vector(word, vector)
doc = nlp(u'class colspan')
print(doc[0].similarity(doc[1]))
nlp.vocab.set_vector(word, vector) # add the vectors to the vocab
# test the vectors and similarity
text = 'class colspan'
doc = nlp(text)
print(text, doc[0].similarity(doc[1]))
if __name__ == '__main__':

View File

@ -6,7 +6,7 @@ from __future__ import print_function
if __name__ == '__main__':
import plac
import sys
from spacy.cli import download, link, info, package, train, convert, model
from spacy.cli import download, link, info, package, train, convert
from spacy.cli import vocab, profile, evaluate, validate
from spacy.util import prints
@ -18,8 +18,7 @@ if __name__ == '__main__':
'evaluate': evaluate,
'convert': convert,
'package': package,
'model': model,
'model': vocab,
'vocab': vocab,
'profile': profile,
'validate': validate
}

View File

@ -29,6 +29,16 @@ from . import util
VECTORS_KEY = 'spacy_pretrained_vectors'
def cosine(vec1, vec2):
xp = get_array_module(vec1)
norm1 = xp.linalg.norm(vec1)
norm2 = xp.linalg.norm(vec2)
if norm1 == 0. or norm2 == 0.:
return 0
else:
return vec1.dot(vec2) / (norm1 * norm2)
@layerize
def _flatten_add_lengths(seqs, pad=0, drop=0.):
ops = Model.ops
@ -428,7 +438,7 @@ def build_text_classifier(nr_class, width=64, **cfg):
pretrained_dims = cfg.get('pretrained_dims', 0)
with Model.define_operators({'>>': chain, '+': add, '|': concatenate,
'**': clone}):
if cfg.get('low_data'):
if cfg.get('low_data') and pretrained_dims:
model = (
SpacyVectors
>> flatten_add_lengths

View File

@ -6,6 +6,5 @@ from .profile import profile
from .train import train
from .evaluate import evaluate
from .convert import convert
from .model import model
from .vocab import make_vocab as vocab
from .validate import validate

View File

@ -17,14 +17,14 @@ numpy.random.seed(0)
@plac.annotations(
model=("Model name or path", "positional", None, str),
data_path=("Location of JSON-formatted evaluation data", "positional",
model=("model name or path", "positional", None, str),
data_path=("location of JSON-formatted evaluation data", "positional",
None, str),
gold_preproc=("Use gold preprocessing", "flag", "G", bool),
gpu_id=("Use GPU", "option", "g", int),
displacy_path=("Directory to output rendered parses as HTML", "option",
gold_preproc=("use gold preprocessing", "flag", "G", bool),
gpu_id=("use GPU", "option", "g", int),
displacy_path=("directory to output rendered parses as HTML", "option",
"dp", str),
displacy_limit=("Limit of parses to render as HTML", "option", "dl", int))
displacy_limit=("limit of parses to render as HTML", "option", "dl", int))
def evaluate(cmd, model, data_path, gpu_id=-1, gold_preproc=False,
displacy_path=None, displacy_limit=25):
"""

View File

@ -1,140 +0,0 @@
# coding: utf8
from __future__ import unicode_literals
try:
import bz2
import gzip
except ImportError:
pass
import math
from ast import literal_eval
from pathlib import Path
import numpy as np
import spacy
from preshed.counter import PreshCounter
from .. import util
from ..compat import fix_text
def model(cmd, lang, model_dir, freqs_data, clusters_data, vectors_data,
min_doc_freq=5, min_word_freq=200):
model_path = Path(model_dir)
freqs_path = Path(freqs_data)
clusters_path = Path(clusters_data) if clusters_data else None
vectors_path = Path(vectors_data) if vectors_data else None
check_dirs(freqs_path, clusters_path, vectors_path)
vocab = util.get_lang_class(lang).Defaults.create_vocab()
nlp = spacy.blank(lang)
vocab = nlp.vocab
probs, oov_prob = read_probs(
freqs_path, min_doc_freq=int(min_doc_freq), min_freq=int(min_doc_freq))
clusters = read_clusters(clusters_path) if clusters_path else {}
populate_vocab(vocab, clusters, probs, oov_prob)
add_vectors(vocab, vectors_path)
create_model(model_path, nlp)
def add_vectors(vocab, vectors_path):
with bz2.BZ2File(vectors_path.as_posix()) as f:
num_words, dim = next(f).split()
vocab.clear_vectors(int(dim))
for line in f:
word_w_vector = line.decode("utf8").strip().split(" ")
word = word_w_vector[0]
vector = np.array([float(val) for val in word_w_vector[1:]])
if word in vocab:
vocab.set_vector(word, vector)
def create_model(model_path, model):
if not model_path.exists():
model_path.mkdir()
model.to_disk(model_path.as_posix())
def read_probs(freqs_path, max_length=100, min_doc_freq=5, min_freq=200):
counts = PreshCounter()
total = 0
freqs_file = check_unzip(freqs_path)
for i, line in enumerate(freqs_file):
freq, doc_freq, key = line.rstrip().split('\t', 2)
freq = int(freq)
counts.inc(i + 1, freq)
total += freq
counts.smooth()
log_total = math.log(total)
freqs_file = check_unzip(freqs_path)
probs = {}
for line in freqs_file:
freq, doc_freq, key = line.rstrip().split('\t', 2)
doc_freq = int(doc_freq)
freq = int(freq)
if doc_freq >= min_doc_freq and freq >= min_freq and len(
key) < max_length:
word = literal_eval(key)
smooth_count = counts.smoother(int(freq))
probs[word] = math.log(smooth_count) - log_total
oov_prob = math.log(counts.smoother(0)) - log_total
return probs, oov_prob
def read_clusters(clusters_path):
clusters = {}
with clusters_path.open() as f:
for line in f:
try:
cluster, word, freq = line.split()
word = fix_text(word)
except ValueError:
continue
# If the clusterer has only seen the word a few times, its
# cluster is unreliable.
if int(freq) >= 3:
clusters[word] = cluster
else:
clusters[word] = '0'
# Expand clusters with re-casing
for word, cluster in list(clusters.items()):
if word.lower() not in clusters:
clusters[word.lower()] = cluster
if word.title() not in clusters:
clusters[word.title()] = cluster
if word.upper() not in clusters:
clusters[word.upper()] = cluster
return clusters
def populate_vocab(vocab, clusters, probs, oov_prob):
for word, prob in reversed(
sorted(list(probs.items()), key=lambda item: item[1])):
lexeme = vocab[word]
lexeme.prob = prob
lexeme.is_oov = False
# Decode as a little-endian string, so that we can do & 15 to get
# the first 4 bits. See _parse_features.pyx
if word in clusters:
lexeme.cluster = int(clusters[word][::-1], 2)
else:
lexeme.cluster = 0
def check_unzip(file_path):
file_path_str = file_path.as_posix()
if file_path_str.endswith('gz'):
return gzip.open(file_path_str)
else:
return file_path.open()
def check_dirs(freqs_data, clusters_data, vectors_data):
if not freqs_data.is_file():
util.sys_exit(freqs_data.as_posix(), title="No frequencies file found")
if clusters_data and not clusters_data.is_file():
util.sys_exit(
clusters_data.as_posix(), title="No Brown clusters file found")
if vectors_data and not vectors_data.is_file():
util.sys_exit(
vectors_data.as_posix(), title="No word vectors file found")

View File

@ -16,10 +16,11 @@ from .. import about
input_dir=("directory with model data", "positional", None, str),
output_dir=("output parent directory", "positional", None, str),
meta_path=("path to meta.json", "option", "m", str),
create_meta=("create meta.json, even if one exists in directory", "flag",
"c", bool),
force=("force overwriting of existing folder in output directory", "flag",
"f", bool))
create_meta=("create meta.json, even if one exists in directory if "
"existing meta is found, entries are shown as defaults in "
"the command line prompt", "flag", "c", bool),
force=("force overwriting of existing model directory in output directory",
"flag", "f", bool))
def package(cmd, input_dir, output_dir, meta_path=None, create_meta=False,
force=False):
"""
@ -41,13 +42,13 @@ def package(cmd, input_dir, output_dir, meta_path=None, create_meta=False,
template_manifest = get_template('MANIFEST.in')
template_init = get_template('xx_model_name/__init__.py')
meta_path = meta_path or input_path / 'meta.json'
if not create_meta and meta_path.is_file():
prints(meta_path, title="Reading meta.json from file")
if meta_path.is_file():
meta = util.read_json(meta_path)
else:
meta = generate_meta(input_dir)
if not create_meta: # only print this if user doesn't want to overwrite
prints(meta_path, title="Loaded meta.json from file")
else:
meta = generate_meta(input_dir, meta)
meta = validate_meta(meta, ['lang', 'name', 'version'])
model_name = meta['lang'] + '_' + meta['name']
model_name_v = model_name + '-' + meta['version']
main_path = output_path / model_name_v
@ -82,22 +83,24 @@ def create_file(file_path, contents):
file_path.open('w', encoding='utf-8').write(contents)
def generate_meta(model_path):
meta = {}
settings = [('lang', 'Model language', 'en'),
('name', 'Model name', 'model'),
('version', 'Model version', '0.0.0'),
def generate_meta(model_path, existing_meta):
meta = existing_meta or {}
settings = [('lang', 'Model language', meta.get('lang', 'en')),
('name', 'Model name', meta.get('name', 'model')),
('version', 'Model version', meta.get('version', '0.0.0')),
('spacy_version', 'Required spaCy version',
'>=%s,<3.0.0' % about.__version__),
('description', 'Model description', False),
('author', 'Author', False),
('email', 'Author email', False),
('url', 'Author website', False),
('license', 'License', 'CC BY-NC 3.0')]
('description', 'Model description',
meta.get('description', False)),
('author', 'Author', meta.get('author', False)),
('email', 'Author email', meta.get('email', False)),
('url', 'Author website', meta.get('url', False)),
('license', 'License', meta.get('license', 'CC BY-SA 3.0'))]
nlp = util.load_model_from_path(Path(model_path))
meta['pipeline'] = nlp.pipe_names
meta['vectors'] = {'width': nlp.vocab.vectors_length,
'entries': len(nlp.vocab.vectors)}
'vectors': len(nlp.vocab.vectors),
'keys': nlp.vocab.vectors.n_keys}
prints("Enter the package settings for your model. The following "
"information will be read from your model data: pipeline, vectors.",
title="Generating meta.json")

View File

@ -32,7 +32,6 @@ numpy.random.seed(0)
n_sents=("number of sentences", "option", "ns", int),
use_gpu=("Use GPU", "option", "g", int),
vectors=("Model to load vectors from", "option", "v"),
vectors_limit=("Truncate to N vectors (requires -v)", "option", None, int),
no_tagger=("Don't train tagger", "flag", "T", bool),
no_parser=("Don't train parser", "flag", "P", bool),
no_entities=("Don't train NER", "flag", "N", bool),
@ -41,7 +40,7 @@ numpy.random.seed(0)
meta_path=("Optional path to meta.json. All relevant properties will be "
"overwritten.", "option", "m", Path))
def train(cmd, lang, output_dir, train_data, dev_data, n_iter=30, n_sents=0,
use_gpu=-1, vectors=None, vectors_limit=None, no_tagger=False,
use_gpu=-1, vectors=None, no_tagger=False,
no_parser=False, no_entities=False, gold_preproc=False,
version="0.0.0", meta_path=None):
"""
@ -95,8 +94,6 @@ def train(cmd, lang, output_dir, train_data, dev_data, n_iter=30, n_sents=0,
nlp.meta.update(meta)
if vectors:
util.load_model(vectors, vocab=nlp.vocab)
if vectors_limit is not None:
nlp.vocab.prune_vectors(vectors_limit)
for name in pipeline:
nlp.add_pipe(nlp.create_pipe(name), name=name)
optimizer = nlp.begin_training(lambda: corpus.train_tuples, device=use_gpu)
@ -149,7 +146,8 @@ def train(cmd, lang, output_dir, train_data, dev_data, n_iter=30, n_sents=0,
meta['speed'] = {'nwords': nwords, 'cpu': cpu_wps,
'gpu': gpu_wps}
meta['vectors'] = {'width': nlp.vocab.vectors_length,
'entries': len(nlp.vocab.vectors)}
'vectors': len(nlp.vocab.vectors),
'keys': nlp.vocab.vectors.n_keys}
meta['lang'] = nlp.lang
meta['pipeline'] = pipeline
meta['spacy_version'] = '>=%s' % about.__version__

View File

@ -1,31 +1,37 @@
'''Compile a vocabulary from a lexicon jsonl file and word vectors.'''
# coding: utf8
from __future__ import unicode_literals
from pathlib import Path
import plac
import json
import spacy
import numpy
from spacy.util import ensure_path
from pathlib import Path
from ..vectors import Vectors
from ..util import prints, ensure_path
@plac.annotations(
lang=("model language", "positional", None, str),
output_dir=("output directory to store model in", "positional", None, str),
output_dir=("model output directory", "positional", None, Path),
lexemes_loc=("location of JSONL-formatted lexical data", "positional",
None, str),
vectors_loc=("location of vectors data, as numpy .npz (optional)",
"positional", None, str),
version=("Model version", "option", "V", str),
None, Path),
vectors_loc=("optional: location of vectors data, as numpy .npz",
"positional", None, str),
prune_vectors=("optional: number of vectors to prune to.",
"option", "V", int)
)
def make_vocab(lang, output_dir, lexemes_loc, vectors_loc=None, version=None):
out_dir = ensure_path(output_dir)
jsonl_loc = ensure_path(lexemes_loc)
def make_vocab(cmd, lang, output_dir, lexemes_loc,
vectors_loc=None, prune_vectors=-1):
"""Compile a vocabulary from a lexicon jsonl file and word vectors."""
if not lexemes_loc.exists():
prints(lexemes_loc, title="Can't find lexical data", exits=1)
vectors_loc = ensure_path(vectors_loc)
nlp = spacy.blank(lang)
for word in nlp.vocab:
word.rank = 0
with jsonl_loc.open() as file_:
lex_added = 0
with lexemes_loc.open() as file_:
for line in file_:
if line.strip():
attrs = json.loads(line)
@ -35,14 +41,20 @@ def make_vocab(lang, output_dir, lexemes_loc, vectors_loc=None, version=None):
lex = nlp.vocab[attrs['orth']]
lex.set_attrs(**attrs)
assert lex.rank == attrs['id']
lex_added += 1
if vectors_loc is not None:
vector_data = numpy.load(open(vectors_loc, 'rb'))
nlp.vocab.clear_vectors(width=vector_data.shape[1])
added = 0
vector_data = numpy.load(vectors_loc.open('rb'))
nlp.vocab.vectors = Vectors(data=vector_data)
for word in nlp.vocab:
if word.rank:
nlp.vocab.vectors.add(word.orth_, row=word.rank,
vector=vector_data[word.rank])
added += 1
nlp.to_disk(out_dir)
nlp.vocab.vectors.add(word.orth, row=word.rank)
if prune_vectors >= 1:
remap = nlp.vocab.prune_vectors(prune_vectors)
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
vec_added = len(nlp.vocab.vectors)
prints("{} entries, {} vectors".format(lex_added, vec_added), output_dir,
title="Sucessfully compiled vocab and vectors, and saved model")
return nlp

View File

@ -23,4 +23,4 @@ for exc_data in [
_exc[exc_data[ORTH]] = [dict(exc_data)]
TOKENIZER_EXCEPTIONS = dict(_exc)
TOKENIZER_EXCEPTIONS = _exc

View File

@ -30,4 +30,4 @@ for orth in [
_exc[orth] = [{ORTH: orth}]
TOKENIZER_EXCEPTIONS = dict(_exc)
TOKENIZER_EXCEPTIONS = _exc

View File

@ -181,4 +181,4 @@ for orth in [
_exc[orth] = [{ORTH: orth}]
TOKENIZER_EXCEPTIONS = dict(_exc)
TOKENIZER_EXCEPTIONS = _exc

View File

@ -456,4 +456,4 @@ for string in _exclude:
_exc.pop(string)
TOKENIZER_EXCEPTIONS = dict(_exc)
TOKENIZER_EXCEPTIONS = _exc

View File

@ -54,4 +54,4 @@ for orth in [
_exc[orth] = [{ORTH: orth}]
TOKENIZER_EXCEPTIONS = dict(_exc)
TOKENIZER_EXCEPTIONS = _exc

View File

@ -76,4 +76,4 @@ for exc_data in [
_exc[exc_data[ORTH]] = [dict(exc_data)]
TOKENIZER_EXCEPTIONS = dict(_exc)
TOKENIZER_EXCEPTIONS = _exc

View File

@ -147,5 +147,5 @@ _regular_exp += ["^{prefix}[{elision}][{alpha}][{alpha}{elision}{hyphen}\-]*$".f
_regular_exp.append(URL_PATTERN)
TOKENIZER_EXCEPTIONS = dict(_exc)
TOKENIZER_EXCEPTIONS = _exc
TOKEN_MATCH = re.compile('|'.join('(?:{})'.format(m) for m in _regular_exp), re.IGNORECASE).match

25
spacy/lang/ga/__init__.py Normal file
View File

@ -0,0 +1,25 @@
# coding: utf8
from __future__ import unicode_literals
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .stop_words import STOP_WORDS
from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ...language import Language
from ...attrs import LANG
from ...util import update_exc
class IrishDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda text: 'ga'
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
stop_words = set(STOP_WORDS)
class Irish(Language):
lang = 'ga'
Defaults = IrishDefaults
__all__ = ['Irish']

View File

@ -0,0 +1,33 @@
# coding: utf8
from __future__ import unicode_literals
class IrishMorph:
consonants = ['b', 'c', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n', 'p', 'q', 'r', 's', 't', 'v', 'w', 'x', 'z']
broad_vowels = ['a', 'á', 'o', 'ó', 'u', 'ú']
slender_vowels = ['e', 'é', 'i', 'í']
vowels = broad_vowels + slender_vowels
def ends_dentals(word):
if word != "" and word[-1] in ['d', 'n', 't', 's']:
return True
else:
return False
def devoice(word):
if len(word) > 2 and word[-2] == 's' and word[-1] == 'd':
return word[:-1] + 't'
else:
return word
def ends_with_vowel(word):
return word != "" and word[-1] in vowels
def starts_with_vowel(word):
return word != "" and word[0] in vowels
def deduplicate(word):
if len(word) > 2 and word[-2] == word[-1] and word[-1] in consonants:
return word[:-1]
else:
return word

View File

@ -0,0 +1,45 @@
# encoding: utf8
from __future__ import unicode_literals
STOP_WORDS = set("""
a ach ag agus an aon ar arna as
ba beirt bhúr
caoga ceathair ceathrar chomh chuig chun cois céad cúig cúigear
daichead dar de deich deichniúr den dhá do don dtí dár
faoi faoin faoina faoinár fara fiche
gach gan go gur
haon hocht
i iad idir in ina ins inár is
le leis lena lenár
mar mo muid
na nach naoi naonúr níor nócha
ocht ochtar ochtó os
roimh
sa seacht seachtar seachtó seasca seisear siad sibh sinn sna
tar thar thú triúr trí trína trínár tríocha
um
ár
é éis
í
ó ón óna ónár
""".split())

368
spacy/lang/ga/tag_map.py Normal file
View File

@ -0,0 +1,368 @@
# coding: utf8
from __future__ import unicode_literals
TAG_MAP = {
"ADJ__Case=Gen|Form=Len|Gender=Masc|Number=Sing": {"pos": "ADJ", "Case": "gen", "Gender": "masc", "Number": "sing", "Other": {"Form": "len"}},
"ADJ__Case=Gen|Gender=Fem|Number=Sing": {"pos": "ADJ", "Case": "gen", "Gender": "fem", "Number": "sing"},
"ADJ__Case=Gen|Gender=Masc|Number=Sing": {"pos": "ADJ", "Case": "gen", "Gender": "masc", "Number": "sing"},
"ADJ__Case=Gen|NounType=Strong|Number=Plur": {"pos": "ADJ", "Case": "gen", "Number": "plur", "Other": {"NounType": "strong"}},
"ADJ__Case=Gen|NounType=Weak|Number=Plur": {"pos": "ADJ", "Case": "gen", "Number": "plur", "Other": {"NounType": "weak"}},
"ADJ__Case=NomAcc|Form=Len|Gender=Fem|Number=Sing": {"pos": "ADJ", "Case": "nom|acc", "Gender": "fem", "Number": "sing", "Other": {"Form": "len"}},
"ADJ__Case=NomAcc|Form=Len|Gender=Masc|Number=Sing": {"pos": "ADJ", "Case": "nom|acc", "Gender": "masc", "Number": "sing", "Other": {"Form": "len"}},
"ADJ__Case=NomAcc|Gender=Fem|Number=Plur": {"pos": "ADJ", "Case": "nom|acc", "Gender": "fem", "Number": "plur"},
"ADJ__Case=NomAcc|Gender=Fem|Number=Sing": {"pos": "ADJ", "Case": "nom|acc", "Gender": "fem", "Number": "sing"},
"ADJ__Case=NomAcc|Gender=Masc|Number=Plur": {"pos": "ADJ", "Case": "nom|acc", "Gender": "masc", "Number": "plur"},
"ADJ__Case=NomAcc|Gender=Masc|Number=Sing": {"pos": "ADJ", "Case": "nom|acc", "Gender": "masc", "Number": "sing"},
"ADJ__Case=NomAcc|NounType=NotSlender|Number=Plur": {"pos": "ADJ", "Case": "nom|acc", "Number": "plur", "Other": {"NounType": "notslender"}},
"ADJ__Case=NomAcc|NounType=Slender|Number=Plur": {"pos": "ADJ", "Case": "nom|acc", "Number": "plur", "Other": {"NounType": "slender"}},
"ADJ__Degree=Cmp,Sup|Form=Len": {"pos": "ADJ", "Degree": "cmp|sup", "Other": {"Form": "len"}},
"ADJ__Degree=Cmp,Sup": {"pos": "ADJ", "Degree": "cmp|sup"},
"ADJ__Degree=Pos|Form=Ecl": {"pos": "ADJ", "Degree": "pos", "Other": {"Form": "ecl"}},
"ADJ__Degree=Pos|Form=HPref": {"pos": "ADJ", "Degree": "pos", "Other": {"Form": "hpref"}},
"ADJ__Degree=Pos|Form=Len": {"pos": "ADJ", "Degree": "pos", "Other": {"Form": "len"}},
"ADJ__Degree=Pos": {"pos": "ADJ", "Degree": "pos"},
"ADJ__Foreign=Yes": {"pos": "ADJ", "Foreign": "yes"},
"ADJ__Form=Len|VerbForm=Part": {"pos": "ADJ", "VerbForm": "part", "Other": {"Form": "len"}},
"ADJ__Gender=Masc|Number=Sing|PartType=Voc": {"pos": "ADJ", "Gender": "masc", "Number": "sing", "Case": "voc"},
"ADJ__Gender=Masc|Number=Sing|Case=Voc": {"pos": "ADJ", "Gender": "masc", "Number": "sing", "Case": "voc"},
"ADJ__Number=Plur|PartType=Voc": {"pos": "ADJ", "Number": "plur", "Case": "voc"},
"ADJ__Number=Plur|Case=Voc": {"pos": "ADJ", "Number": "plur", "Case": "voc"},
"ADJ__Number=Plur": {"pos": "ADJ", "Number": "plur"},
"ADJ___": {"pos": "ADJ"},
"ADJ__VerbForm=Part": {"pos": "ADJ", "VerbForm": "part"},
"ADP__Foreign=Yes": {"pos": "ADP", "Foreign": "yes"},
"ADP__Form=Len|Number=Plur|Person=1": {"pos": "ADP", "Number": "plur", "Person": 1, "Other": {"Form": "len"}},
"ADP__Form=Len|Number=Plur|Person=3": {"pos": "ADP", "Number": "plur", "Person": 3, "Other": {"Form": "len"}},
"ADP__Form=Len|Number=Sing|Person=1": {"pos": "ADP", "Number": "sing", "Person": 1, "Other": {"Form": "len"}},
"ADP__Gender=Fem|Number=Sing|Person=3": {"pos": "ADP", "Gender": "fem", "Number": "sing", "Person": 3},
"ADP__Gender=Fem|Number=Sing|Person=3|Poss=Yes": {"pos": "ADP", "Gender": "fem", "Number": "sing", "Person": 3, "Poss": "yes"},
"ADP__Gender=Fem|Number=Sing|Person=3|Poss=Yes|PronType=Prs": {"pos": "ADP", "Gender": "fem", "Number": "sing", "Person": 3, "Poss": "yes", "PronType": "prs"},
"ADP__Gender=Masc|Number=Sing|Person=3": {"pos": "ADP", "Gender": "masc", "Number": "sing", "Person": 3},
"ADP__Gender=Masc|Number=Sing|Person=3|Poss=Yes": {"pos": "ADP", "Gender": "masc", "Number": "sing", "Person": 3, "Poss": "yes"},
"ADP__Gender=Masc|Number=Sing|Person=3|Poss=Yes|PronType=Prs": {"pos": "ADP", "Gender": "masc", "Number": "sing", "Person": 3, "Poss": "yes", "PronType": "prs"},
"ADP__Gender=Masc|Number=Sing|Person=3|PronType=Emp": {"pos": "ADP", "Gender": "masc", "Number": "sing", "Person": 3, "PronType": "emp"},
"ADP__Number=Plur|Person=1": {"pos": "ADP", "Number": "plur", "Person": 1},
"ADP__Number=Plur|Person=1|Poss=Yes": {"pos": "ADP", "Number": "plur", "Person": 1, "Poss": "yes"},
"ADP__Number=Plur|Person=1|PronType=Emp": {"pos": "ADP", "Number": "plur", "Person": 1, "PronType": "emp"},
"ADP__Number=Plur|Person=2": {"pos": "ADP", "Number": "plur", "Person": 2},
"ADP__Number=Plur|Person=3": {"pos": "ADP", "Number": "plur", "Person": 3},
"ADP__Number=Plur|Person=3|Poss=Yes": {"pos": "ADP", "Number": "plur", "Person": 3, "Poss": "yes"},
"ADP__Number=Plur|Person=3|Poss=Yes|PronType=Prs": {"pos": "ADP", "Number": "plur", "Person": 3, "Poss": "yes", "PronType": "prs"},
"ADP__Number=Plur|Person=3|PronType=Emp": {"pos": "ADP", "Number": "plur", "Person": 3, "PronType": "emp"},
"ADP__Number=Plur|PronType=Art": {"pos": "ADP", "Number": "plur", "PronType": "art"},
"ADP__Number=Sing|Person=1": {"pos": "ADP", "Number": "sing", "Person": 1},
"ADP__Number=Sing|Person=1|Poss=Yes": {"pos": "ADP", "Number": "sing", "Person": 1, "Poss": "yes"},
"ADP__Number=Sing|Person=1|PronType=Emp": {"pos": "ADP", "Number": "sing", "Person": 1, "PronType": "emp"},
"ADP__Number=Sing|Person=2": {"pos": "ADP", "Number": "sing", "Person": 2},
"ADP__Number=Sing|Person=3": {"pos": "ADP", "Number": "sing", "Person": 3},
"ADP__Number=Sing|PronType=Art": {"pos": "ADP", "Number": "sing", "PronType": "art"},
"ADP__Person=3|Poss=Yes": {"pos": "ADP", "Person": 3, "Poss": "yes"},
"ADP___": {"pos": "ADP"},
"ADP__Poss=Yes": {"pos": "ADP", "Poss": "yes"},
"ADP__PrepForm=Cmpd": {"pos": "ADP", "Other": {"PrepForm": "cmpd"}},
"ADP__PronType=Art": {"pos": "ADP", "PronType": "art"},
"ADV__Form=Len": {"pos": "ADV", "Other": {"Form": "len"}},
"ADV___": {"pos": "ADV"},
"ADV__PronType=Int": {"pos": "ADV", "PronType": "int"},
"AUX__Form=VF|Polarity=Neg|PronType=Rel|Tense=Past|VerbForm=Cop": {"pos": "AUX", "Polarity": "neg", "PronType": "rel", "Tense": "past", "Other": {"Form": "vf", "VerbForm": "cop"}},
"AUX__Form=VF|Polarity=Neg|Tense=Past|VerbForm=Cop": {"pos": "AUX", "Polarity": "neg", "Tense": "past", "Other": {"Form": "vf", "VerbForm": "cop"}},
"AUX__Form=VF|PronType=Rel|Tense=Past|VerbForm=Cop": {"pos": "AUX", "PronType": "rel", "Tense": "past", "Other": {"Form": "vf", "VerbForm": "cop"}},
"AUX__Form=VF|Tense=Past|VerbForm=Cop": {"pos": "AUX", "Tense": "past", "Other": {"Form": "vf", "VerbForm": "cop"}},
"AUX__Form=VF|Tense=Pres|VerbForm=Cop": {"pos": "AUX", "Tense": "pres", "Other": {"Form": "vf", "VerbForm": "cop"}},
"AUX__Gender=Masc|Number=Sing|Person=3|VerbForm=Cop": {"pos": "AUX", "Gender": "masc", "Number": "sing", "Person": 3, "Other": {"VerbForm": "cop"}},
"AUX__Mood=Int|Number=Sing|PronType=Art|VerbForm=Cop": {"pos": "AUX", "Number": "sing", "PronType": "art", "Other": {"Mood": "int", "VerbForm": "cop"}},
"AUX__Mood=Int|Polarity=Neg|Tense=Past|VerbForm=Cop": {"pos": "AUX", "Polarity": "neg", "Tense": "past", "Other": {"Mood": "int", "VerbForm": "cop"}},
"AUX__Mood=Int|Polarity=Neg|Tense=Pres|VerbForm=Cop": {"pos": "AUX", "Polarity": "neg", "Tense": "pres", "Other": {"Mood": "int", "VerbForm": "cop"}},
"AUX__Mood=Int|Tense=Pres|VerbForm=Cop": {"pos": "AUX", "Tense": "pres", "Other": {"Mood": "int", "VerbForm": "cop"}},
"AUX__PartType=Comp|Tense=Past|VerbForm=Cop": {"pos": "AUX", "Tense": "past", "Other": {"PartType": "comp", "VerbForm": "cop"}},
"AUX__Polarity=Neg|PronType=Rel|Tense=Past|VerbForm=Cop": {"pos": "AUX", "Polarity": "neg", "PronType": "rel", "Tense": "past", "Other": {"VerbForm": "cop"}},
"AUX__Polarity=Neg|PronType=Rel|Tense=Pres|VerbForm=Cop": {"pos": "AUX", "Polarity": "neg", "PronType": "rel", "Tense": "pres", "Other": {"VerbForm": "cop"}},
"AUX__Polarity=Neg|Tense=Past|VerbForm=Cop": {"pos": "AUX", "Polarity": "neg", "Tense": "past", "Other": {"VerbForm": "cop"}},
"AUX__Polarity=Neg|Tense=Pres|VerbForm=Cop": {"pos": "AUX", "Polarity": "neg", "Tense": "pres", "Other": {"VerbForm": "cop"}},
"AUX___": {"pos": "AUX"},
"AUX__PronType=Dem|VerbForm=Cop": {"pos": "AUX", "PronType": "dem", "Other": {"VerbForm": "cop"}},
"AUX__PronType=Rel|Tense=Past|VerbForm=Cop": {"pos": "AUX", "PronType": "rel", "Tense": "past", "Other": {"VerbForm": "cop"}},
"AUX__PronType=Rel|Tense=Pres|VerbForm=Cop": {"pos": "AUX", "PronType": "rel", "Tense": "pres", "Other": {"VerbForm": "cop"}},
"AUX__Tense=Past|VerbForm=Cop": {"pos": "AUX", "Tense": "past", "Other": {"VerbForm": "cop"}},
"AUX__Tense=Pres|VerbForm=Cop": {"pos": "AUX", "Tense": "pres", "Other": {"VerbForm": "cop"}},
"AUX__VerbForm=Cop": {"pos": "AUX", "Other": {"VerbForm": "cop"}},
"CCONJ___": {"pos": "CCONJ"},
"DET__Case=Gen|Definite=Def|Gender=Fem|Number=Sing|PronType=Art": {"pos": "DET", "Case": "gen", "Definite": "def", "Gender": "fem", "Number": "sing", "PronType": "art"},
"DET__Definite=Def|Form=Ecl": {"pos": "DET", "Definite": "def", "Other": {"Form": "ecl"}},
"DET__Definite=Def|Gender=Fem|Number=Sing|PronType=Art": {"pos": "DET", "Definite": "def", "Gender": "fem", "Number": "sing", "PronType": "art"},
"DET__Definite=Def|Number=Plur|PronType=Art": {"pos": "DET", "Definite": "def", "Number": "plur", "PronType": "art"},
"DET__Definite=Def|Number=Sing|PronType=Art": {"pos": "DET", "Definite": "def", "Number": "sing", "PronType": "art"},
"DET__Definite=Def": {"pos": "DET", "Definite": "def"},
"DET__Form=HPref|PronType=Ind": {"pos": "DET", "PronType": "ind", "Other": {"Form": "hpref"}},
"DET__Gender=Fem|Number=Sing|Person=3|Poss=Yes": {"pos": "DET", "Gender": "fem", "Number": "sing", "Person": 3, "Poss": "yes"},
"DET__Gender=Masc|Number=Sing|Person=3|Poss=Yes": {"pos": "DET", "Gender": "masc", "Number": "sing", "Person": 3, "Poss": "yes"},
"DET__Number=Plur|Person=1|Poss=Yes": {"pos": "DET", "Number": "plur", "Person": 1, "Poss": "yes"},
"DET__Number=Plur|Person=3|Poss=Yes": {"pos": "DET", "Number": "plur", "Person": 3, "Poss": "yes"},
"DET__Number=Sing|Person=1|Poss=Yes": {"pos": "DET", "Number": "sing", "Person": 1, "Poss": "yes"},
"DET__Number=Sing|Person=2|Poss=Yes": {"pos": "DET", "Number": "sing", "Person": 2, "Poss": "yes"},
"DET__Number=Sing|PronType=Int": {"pos": "DET", "Number": "sing", "PronType": "int"},
"DET___": {"pos": "DET"},
"DET__PronType=Dem": {"pos": "DET", "PronType": "dem"},
"DET__PronType=Ind": {"pos": "DET", "PronType": "ind"},
"NOUN__Case=Dat|Definite=Ind|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "dat", "Definite": "ind", "Gender": "fem", "Number": "sing"},
"NOUN__Case=Dat|Form=Ecl|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "dat", "Gender": "fem", "Number": "sing", "Other": {"Form": "ecl"}},
"NOUN__Case=Dat|Form=Len|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "dat", "Gender": "fem", "Number": "sing", "Other": {"Form": "len"}},
"NOUN__Case=Dat|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "dat", "Gender": "fem", "Number": "sing"},
"NOUN__Case=Dat|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "dat", "Gender": "masc", "Number": "sing"},
"NOUN__Case=Gen|Definite=Def|Gender=Fem|NounType=Strong|Number=Plur": {"pos": "NOUN", "Case": "gen", "Definite": "def", "Gender": "fem", "Number": "plur", "Other": {"NounType": "strong"}},
"NOUN__Case=Gen|Definite=Def|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "gen", "Definite": "def", "Gender": "fem", "Number": "sing"},
"NOUN__Case=Gen|Definite=Def|Gender=Masc|NounType=Strong|Number=Plur": {"pos": "NOUN", "Case": "gen", "Definite": "def", "Gender": "masc", "Number": "plur", "Other": {"NounType": "strong"}},
"NOUN__Case=Gen|Definite=Def|Gender=Masc|NounType=Weak|Number=Plur": {"pos": "NOUN", "Case": "gen", "Definite": "def", "Gender": "masc", "Number": "plur", "Other": {"NounType": "weak"}},
"NOUN__Case=Gen|Definite=Def|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "gen", "Definite": "def", "Gender": "masc", "Number": "sing"},
"NOUN__Case=Gen|Definite=Ind|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "gen", "Definite": "ind", "Gender": "fem", "Number": "sing"},
"NOUN__Case=Gen|Form=Ecl|Gender=Fem|NounType=Strong|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "fem", "Number": "plur", "Other": {"Form": "ecl", "NounType": "strong"}},
"NOUN__Case=Gen|Form=Ecl|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "gen", "Gender": "fem", "Number": "sing", "Other": {"Form": "ecl"}},
"NOUN__Case=Gen|Form=Ecl|Gender=Masc|NounType=Strong|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "plur", "Other": {"Form": "ecl", "NounType": "strong"}},
"NOUN__Case=Gen|Form=Ecl|Gender=Masc|NounType=Weak|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "plur", "Other": {"Form": "ecl", "NounType": "weak"}},
"NOUN__Case=Gen|Form=Ecl|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "sing", "Other": {"Form": "ecl"}},
"NOUN__Case=Gen|Form=HPref|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "gen", "Gender": "fem", "Number": "sing", "Other": {"Form": "hpref"}},
"NOUN__Case=Gen|Form=Len|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "gen", "Gender": "fem", "Number": "sing", "Other": {"Form": "len"}},
"NOUN__Case=Gen|Form=Len|Gender=Masc|NounType=Strong|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "plur", "Other": {"Form": "len", "NounType": "strong"}},
"NOUN__Case=Gen|Form=Len|Gender=Masc|NounType=Weak|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "plur", "Other": {"Form": "len", "NounType": "weak"}},
"NOUN__Case=Gen|Form=Len|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "sing", "Other": {"Form": "len"}},
"NOUN__Case=Gen|Form=Len|VerbForm=Inf": {"pos": "NOUN", "Case": "gen", "VerbForm": "inf", "Other": {"Form": "len"}},
"NOUN__Case=Gen|Gender=Fem|NounType=Strong|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "fem", "Number": "plur", "Other": {"NounType": "strong"}},
"NOUN__Case=Gen|Gender=Fem|NounType=Weak|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "fem", "Number": "plur", "Other": {"NounType": "weak"}},
"NOUN__Case=Gen|Gender=Fem|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "fem", "Number": "plur"},
"NOUN__Case=Gen|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "gen", "Gender": "fem", "Number": "sing"},
"NOUN__Case=Gen|Gender=Masc|NounType=Strong|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "plur", "Other": {"NounType": "strong"}},
"NOUN__Case=Gen|Gender=Masc|NounType=Weak|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "plur", "Other": {"NounType": "weak"}},
"NOUN__Case=Gen|Gender=Masc|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "plur"},
"NOUN__Case=Gen|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "sing"},
"NOUN__Case=Gen|Number=Sing": {"pos": "NOUN", "Case": "gen", "Number": "sing"},
"NOUN__Case=Gen|VerbForm=Inf": {"pos": "NOUN", "Case": "gen", "VerbForm": "inf"},
"NOUN__Case=NomAcc|Definite=Def|Gender=Fem|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Definite": "def", "Gender": "fem", "Number": "plur"},
"NOUN__Case=NomAcc|Definite=Def|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Definite": "def", "Gender": "fem", "Number": "sing"},
"NOUN__Case=NomAcc|Definite=Def|Gender=Fem": {"pos": "NOUN", "Case": "nom|acc", "Definite": "def", "Gender": "fem"},
"NOUN__Case=NomAcc|Definite=Def|Gender=Masc|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Definite": "def", "Gender": "masc", "Number": "plur"},
"NOUN__Case=NomAcc|Definite=Def|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Definite": "def", "Gender": "masc", "Number": "sing"},
"NOUN__Case=NomAcc|Definite=Ind|Gender=Masc|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Definite": "ind", "Gender": "masc", "Number": "plur"},
"NOUN__Case=NomAcc|Form=Ecl|Gender=Fem|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Gender": "fem", "Number": "plur", "Other": {"Form": "ecl"}},
"NOUN__Case=NomAcc|Form=Ecl|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "fem", "Number": "sing", "Other": {"Form": "ecl"}},
"NOUN__Case=NomAcc|Form=Ecl|Gender=Masc|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "plur", "Other": {"Form": "ecl"}},
"NOUN__Case=NomAcc|Form=Ecl|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "sing", "Other": {"Form": "ecl"}},
"NOUN__Case=NomAcc|Form=Emp|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "sing", "Other": {"Form": "emp"}},
"NOUN__Case=NomAcc|Form=HPref|Gender=Fem|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Gender": "fem", "Number": "plur", "Other": {"Form": "hpref"}},
"NOUN__Case=NomAcc|Form=HPref|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "fem", "Number": "sing", "Other": {"Form": "hpref"}},
"NOUN__Case=NomAcc|Form=HPref|Gender=Masc|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "plur", "Other": {"Form": "hpref"}},
"NOUN__Case=NomAcc|Form=HPref|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "sing", "Other": {"Form": "hpref"}},
"NOUN__Case=NomAcc|Form=Len|Gender=Fem|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Gender": "fem", "Number": "plur", "Other": {"Form": "len"}},
"NOUN__Case=NomAcc|Form=Len|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "fem", "Number": "sing", "Other": {"Form": "len"}},
"NOUN__Case=NomAcc|Form=Len|Gender=Masc|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "plur", "Other": {"Form": "len"}},
"NOUN__Case=NomAcc|Form=Len|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "sing", "Other": {"Form": "len"}},
"NOUN__Case=NomAcc|Gender=Fem|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Gender": "fem", "Number": "plur"},
"NOUN__Case=NomAcc|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "fem", "Number": "sing"},
"NOUN__Case=NomAcc|Gender=Masc|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "plur"},
"NOUN__Case=NomAcc|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "sing"},
"NOUN__Case=Voc|Definite=Def|Gender=Masc|Number=Plur": {"pos": "NOUN", "Case": "voc", "Definite": "def", "Gender": "masc", "Number": "plur"},
"NOUN__Case=Voc|Form=Len|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "voc", "Gender": "fem", "Number": "sing", "Other": {"Form": "len"}},
"NOUN__Case=Voc|Form=Len|Gender=Masc|Number=Plur": {"pos": "NOUN", "Case": "voc", "Gender": "masc", "Number": "plur", "Other": {"Form": "len"}},
"NOUN__Case=Voc|Form=Len|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "voc", "Gender": "masc", "Number": "sing", "Other": {"Form": "len"}},
"NOUN__Case=Voc|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "voc", "Gender": "masc", "Number": "sing"},
"NOUN__Degree=Pos": {"pos": "NOUN", "Degree": "pos"},
"NOUN__Foreign=Yes": {"pos": "NOUN", "Foreign": "yes"},
"NOUN__Form=Ecl|Number=Sing": {"pos": "NOUN", "Number": "sing", "Other": {"Form": "ecl"}},
"NOUN__Form=Ecl|VerbForm=Inf": {"pos": "NOUN", "VerbForm": "inf", "Other": {"Form": "ecl"}},
"NOUN__Form=Ecl|VerbForm=Vnoun": {"pos": "NOUN", "VerbForm": "vnoun", "Other": {"Form": "ecl"}},
"NOUN__Form=HPref|VerbForm=Inf": {"pos": "NOUN", "VerbForm": "inf", "Other": {"Form": "hpref"}},
"NOUN__Form=Len|Number=Sing": {"pos": "NOUN", "Number": "sing", "Other": {"Form": "len"}},
"NOUN__Form=Len|VerbForm=Inf": {"pos": "NOUN", "VerbForm": "inf", "Other": {"Form": "len"}},
"NOUN__Gender=Fem|Number=Sing": {"pos": "NOUN", "Gender": "fem", "Number": "sing"},
"NOUN__Number=Sing|PartType=Comp": {"pos": "NOUN", "Number": "sing", "Other": {"PartType": "comp"}},
"NOUN__Number=Sing": {"pos": "NOUN", "Number": "sing"},
"NOUN___": {"pos": "NOUN"},
"NOUN__Reflex=Yes": {"pos": "NOUN", "Reflex": "yes"},
"NOUN__VerbForm=Inf": {"pos": "NOUN", "VerbForm": "inf"},
"NOUN__VerbForm=Vnoun": {"pos": "NOUN", "VerbForm": "vnoun"},
"NUM__Definite=Def|NumType=Card": {"pos": "NUM", "Definite": "def", "NumType": "card"},
"NUM__Form=Ecl|NumType=Card": {"pos": "NUM", "NumType": "card", "Other": {"Form": "ecl"}},
"NUM__Form=Ecl|NumType=Ord": {"pos": "NUM", "NumType": "ord", "Other": {"Form": "ecl"}},
"NUM__Form=HPref|NumType=Card": {"pos": "NUM", "NumType": "card", "Other": {"Form": "hpref"}},
"NUM__Form=Len|NumType=Card": {"pos": "NUM", "NumType": "card", "Other": {"Form": "len"}},
"NUM__Form=Len|NumType=Ord": {"pos": "NUM", "NumType": "ord", "Other": {"Form": "len"}},
"NUM__NumType=Card": {"pos": "NUM", "NumType": "card"},
"NUM__NumType=Ord": {"pos": "NUM", "NumType": "ord"},
"NUM___": {"pos": "NUM"},
"PART__Form=Ecl|PartType=Vb|PronType=Rel": {"pos": "PART", "PronType": "rel", "Other": {"Form": "ecl", "PartType": "vb"}},
"PART__Mood=Imp|PartType=Vb|Polarity=Neg": {"pos": "PART", "Mood": "imp", "Polarity": "neg", "Other": {"PartType": "vb"}},
"PART__Mood=Imp|PartType=Vb": {"pos": "PART", "Mood": "imp", "Other": {"PartType": "vb"}},
"PART__Mood=Int|PartType=Vb|Polarity=Neg": {"pos": "PART", "Polarity": "neg", "Other": {"Mood": "int", "PartType": "vb"}},
"PART__PartType=Ad": {"pos": "PART", "Other": {"PartType": "ad"}},
"PART__PartType=Cmpl|Polarity=Neg": {"pos": "PART", "Polarity": "neg", "Other": {"PartType": "cmpl"}},
"PART__PartType=Cmpl|Polarity=Neg|Tense=Past": {"pos": "PART", "Polarity": "neg", "Tense": "past", "Other": {"PartType": "cmpl"}},
"PART__PartType=Cmpl": {"pos": "PART", "Other": {"PartType": "cmpl"}},
"PART__PartType=Comp": {"pos": "PART", "Other": {"PartType": "comp"}},
"PART__PartType=Cop|PronType=Rel": {"pos": "PART", "PronType": "rel", "Other": {"PartType": "cop"}},
"PART__PartType=Deg": {"pos": "PART", "Other": {"PartType": "deg"}},
"PART__PartType=Inf": {"pos": "PART", "PartType": "inf"},
"PART__PartType=Num": {"pos": "PART", "Other": {"PartType": "num"}},
"PART__PartType=Pat": {"pos": "PART", "Other": {"PartType": "pat"}},
"PART__PartType=Vb|Polarity=Neg": {"pos": "PART", "Polarity": "neg", "Other": {"PartType": "vb"}},
"PART__PartType=Vb|Polarity=Neg|PronType=Rel": {"pos": "PART", "Polarity": "neg", "PronType": "rel", "Other": {"PartType": "vb"}},
"PART__PartType=Vb|Polarity=Neg|PronType=Rel|Tense=Past": {"pos": "PART", "Polarity": "neg", "PronType": "rel", "Tense": "past", "Other": {"PartType": "vb"}},
"PART__PartType=Vb|Polarity=Neg|Tense=Past": {"pos": "PART", "Polarity": "neg", "Tense": "past", "Other": {"PartType": "vb"}},
"PART__PartType=Vb": {"pos": "PART", "Other": {"PartType": "vb"}},
"PART__PartType=Vb|PronType=Rel": {"pos": "PART", "PronType": "rel", "Other": {"PartType": "vb"}},
"PART__PartType=Vb|PronType=Rel|Tense=Past": {"pos": "PART", "PronType": "rel", "Tense": "past", "Other": {"PartType": "vb"}},
"PART__PartType=Vb|Tense=Past": {"pos": "PART", "Tense": "past", "Other": {"PartType": "vb"}},
"PART__PartType=Voc": {"pos": "PART", "Other": {"PartType": "voc"}},
"PART___": {"pos": "PART"},
"PART__PronType=Rel": {"pos": "PART", "PronType": "rel"},
"PRON__Form=Len|Number=Sing|Person=2": {"pos": "PRON", "Number": "sing", "Person": 2, "Other": {"Form": "len"}},
"PRON__Form=Len|PronType=Ind": {"pos": "PRON", "PronType": "ind", "Other": {"Form": "len"}},
"PRON__Gender=Fem|Number=Sing|Person=3": {"pos": "PRON", "Gender": "fem", "Number": "sing", "Person": 3},
"PRON__Gender=Masc|Number=Sing|Person=3": {"pos": "PRON", "Gender": "masc", "Number": "sing", "Person": 3},
"PRON__Gender=Masc|Number=Sing|Person=3|PronType=Emp": {"pos": "PRON", "Gender": "masc", "Number": "sing", "Person": 3, "PronType": "emp"},
"PRON__Gender=Masc|Person=3": {"pos": "PRON", "Gender": "masc", "Person": 3},
"PRON__Number=Plur|Person=1": {"pos": "PRON", "Number": "plur", "Person": 1},
"PRON__Number=Plur|Person=1|PronType=Emp": {"pos": "PRON", "Number": "plur", "Person": 1, "PronType": "emp"},
"PRON__Number=Plur|Person=2": {"pos": "PRON", "Number": "plur", "Person": 2},
"PRON__Number=Plur|Person=3": {"pos": "PRON", "Number": "plur", "Person": 3},
"PRON__Number=Plur|Person=3|PronType=Emp": {"pos": "PRON", "Number": "plur", "Person": 3, "PronType": "emp"},
"PRON__Number=Sing|Person=1": {"pos": "PRON", "Number": "sing", "Person": 1},
"PRON__Number=Sing|Person=1|PronType=Emp": {"pos": "PRON", "Number": "sing", "Person": 1, "PronType": "emp"},
"PRON__Number=Sing|Person=2": {"pos": "PRON", "Number": "sing", "Person": 2},
"PRON__Number=Sing|Person=2|PronType=Emp": {"pos": "PRON", "Number": "sing", "Person": 2, "PronType": "emp"},
"PRON__Number=Sing|Person=3": {"pos": "PRON", "Number": "sing", "Person": 3},
"PRON__Number=Sing|PronType=Int": {"pos": "PRON", "Number": "sing", "PronType": "int"},
"PRON__PronType=Dem": {"pos": "PRON", "PronType": "dem"},
"PRON__PronType=Ind": {"pos": "PRON", "PronType": "ind"},
"PRON__PronType=Int": {"pos": "PRON", "PronType": "int"},
"PRON__Reflex=Yes": {"pos": "PRON", "Reflex": "yes"},
"PROPN__Abbr=Yes": {"pos": "PROPN", "Other": {"Abbr": "yes"}},
"PROPN__Case=Dat|Gender=Fem|Number=Sing": {"pos": "PROPN", "Case": "dat", "Gender": "fem", "Number": "sing"},
"PROPN__Case=Gen|Definite=Def|Gender=Fem|Number=Sing": {"pos": "PROPN", "Case": "gen", "Definite": "def", "Gender": "fem", "Number": "sing"},
"PROPN__Case=Gen|Form=Ecl|Gender=Fem|Number=Plur": {"pos": "PROPN", "Case": "gen", "Gender": "fem", "Number": "plur", "Other": {"Form": "ecl"}},
"PROPN__Case=Gen|Form=Ecl|Gender=Masc|Number=Plur": {"pos": "PROPN", "Case": "gen", "Gender": "masc", "Number": "plur", "Other": {"Form": "ecl"}},
"PROPN__Case=Gen|Form=HPref|Gender=Fem|Number=Sing": {"pos": "PROPN", "Case": "gen", "Gender": "fem", "Number": "sing", "Other": {"Form": "hpref"}},
"PROPN__Case=Gen|Form=Len|Gender=Fem|Number=Sing": {"pos": "PROPN", "Case": "gen", "Gender": "fem", "Number": "sing", "Other": {"Form": "len"}},
"PROPN__Case=Gen|Form=Len|Gender=Fem": {"pos": "PROPN", "Case": "gen", "Gender": "fem", "Other": {"Form": "len"}},
"PROPN__Case=Gen|Form=Len|Gender=Masc|Number=Sing": {"pos": "PROPN", "Case": "gen", "Gender": "masc", "Number": "sing", "Other": {"Form": "len"}},
"PROPN__Case=Gen|Form=Len|Gender=Masc": {"pos": "PROPN", "Case": "gen", "Gender": "masc", "Other": {"Form": "len"}},
"PROPN__Case=Gen|Gender=Fem|Number=Sing": {"pos": "PROPN", "Case": "gen", "Gender": "fem", "Number": "sing"},
"PROPN__Case=Gen|Gender=Fem": {"pos": "PROPN", "Case": "gen", "Gender": "fem"},
"PROPN__Case=Gen|Gender=Masc|NounType=Weak|Number=Plur": {"pos": "PROPN", "Case": "gen", "Gender": "masc", "Number": "plur", "Other": {"NounType": "weak"}},
"PROPN__Case=Gen|Gender=Masc|Number=Sing": {"pos": "PROPN", "Case": "gen", "Gender": "masc", "Number": "sing"},
"PROPN__Case=Gen|Gender=Masc": {"pos": "PROPN", "Case": "gen", "Gender": "masc"},
"PROPN__Case=NomAcc|Definite=Def|Gender=Fem|Number=Sing": {"pos": "PROPN", "Case": "nom|acc", "Definite": "def", "Gender": "fem", "Number": "sing"},
"PROPN__Case=NomAcc|Definite=Def|Gender=Masc|Number=Plur": {"pos": "PROPN", "Case": "nom|acc", "Definite": "def", "Gender": "masc", "Number": "plur"},
"PROPN__Case=NomAcc|Definite=Def|Gender=Masc|Number=Sing": {"pos": "PROPN", "Case": "nom|acc", "Definite": "def", "Gender": "masc", "Number": "sing"},
"PROPN__Case=NomAcc|Form=Ecl|Gender=Fem|Number=Sing": {"pos": "PROPN", "Case": "nom|acc", "Gender": "fem", "Number": "sing", "Other": {"Form": "ecl"}},
"PROPN__Case=NomAcc|Form=Ecl|Gender=Masc|Number=Sing": {"pos": "PROPN", "Case": "nom|acc", "Gender": "masc", "Number": "sing", "Other": {"Form": "ecl"}},
"PROPN__Case=NomAcc|Form=HPref|Gender=Masc|Number=Sing": {"pos": "PROPN", "Case": "nom|acc", "Gender": "masc", "Number": "sing", "Other": {"Form": "hpref"}},
"PROPN__Case=NomAcc|Form=Len|Gender=Fem|Number=Sing": {"pos": "PROPN", "Case": "nom|acc", "Gender": "fem", "Number": "sing", "Other": {"Form": "len"}},
"PROPN__Case=NomAcc|Form=Len|Gender=Masc|Number=Sing": {"pos": "PROPN", "Case": "nom|acc", "Gender": "masc", "Number": "sing", "Other": {"Form": "len"}},
"PROPN__Case=NomAcc|Gender=Fem|Number=Sing": {"pos": "PROPN", "Case": "nom|acc", "Gender": "fem", "Number": "sing"},
"PROPN__Case=NomAcc|Gender=Masc|Number=Plur": {"pos": "PROPN", "Case": "nom|acc", "Gender": "masc", "Number": "plur"},
"PROPN__Case=NomAcc|Gender=Masc|Number=Sing": {"pos": "PROPN", "Case": "nom|acc", "Gender": "masc", "Number": "sing"},
"PROPN__Case=NomAcc|Gender=Masc": {"pos": "PROPN", "Case": "nom|acc", "Gender": "masc"},
"PROPN__Case=Voc|Form=Len|Gender=Fem": {"pos": "PROPN", "Case": "voc", "Gender": "fem", "Other": {"Form": "len"}},
"PROPN__Case=Voc|Gender=Masc|Number=Sing": {"pos": "PROPN", "Case": "voc", "Gender": "masc", "Number": "sing"},
"PROPN__Gender=Masc|Number=Sing": {"pos": "PROPN", "Gender": "masc", "Number": "sing"},
"PROPN___": {"pos": "PROPN"},
"PUNCT___": {"pos": "PUNCT"},
"SCONJ___": {"pos": "SCONJ"},
"SCONJ__Tense=Past|VerbForm=Cop": {"pos": "SCONJ", "Tense": "past", "Other": {"VerbForm": "cop"}},
"SCONJ__VerbForm=Cop": {"pos": "SCONJ", "Other": {"VerbForm": "cop"}},
"SYM__Abbr=Yes": {"pos": "SYM", "Other": {"Abbr": "yes"}},
"VERB__Case=NomAcc|Gender=Masc|Mood=Ind|Number=Sing|Tense=Pres": {"pos": "VERB", "Case": "nom|acc", "Gender": "masc", "Mood": "ind", "Number": "sing", "Tense": "pres"},
"VERB__Dialect=Munster|Form=Len|Mood=Ind|Tense=Past": {"pos": "VERB", "Mood": "ind", "Tense": "past", "Other": {"Dialect": "munster", "Form": "len"}},
"VERB__Foreign=Yes": {"pos": "VERB", "Foreign": "yes"},
"VERB__Form=Ecl|Mood=Cnd|Number=Sing|Person=1": {"pos": "VERB", "Mood": "cnd", "Number": "sing", "Person": 1, "Other": {"Form": "ecl"}},
"VERB__Form=Ecl|Mood=Cnd|Polarity=Neg": {"pos": "VERB", "Mood": "cnd", "Polarity": "neg", "Other": {"Form": "ecl"}},
"VERB__Form=Ecl|Mood=Cnd": {"pos": "VERB", "Mood": "cnd", "Other": {"Form": "ecl"}},
"VERB__Form=Ecl|Mood=Cnd|Voice=Auto": {"pos": "VERB", "Mood": "cnd", "Other": {"Form": "ecl", "Voice": "auto"}},
"VERB__Form=Ecl|Mood=Imp|Number=Sing|Person=1|Tense=Past": {"pos": "VERB", "Mood": "imp", "Number": "sing", "Person": 1, "Tense": "past", "Other": {"Form": "ecl"}},
"VERB__Form=Ecl|Mood=Imp|Tense=Past": {"pos": "VERB", "Mood": "imp", "Tense": "past", "Other": {"Form": "ecl"}},
"VERB__Form=Ecl|Mood=Ind|Number=Plur|Person=1|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Number": "plur", "Person": 1, "Tense": "pres", "Other": {"Form": "ecl"}},
"VERB__Form=Ecl|Mood=Ind|Number=Sing|Person=1|Tense=Past": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "Tense": "past", "Other": {"Form": "ecl"}},
"VERB__Form=Ecl|Mood=Ind|Number=Sing|Person=1|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "Tense": "pres", "Other": {"Form": "ecl"}},
"VERB__Form=Ecl|Mood=Ind|Polarity=Neg|Tense=Fut": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "fut", "Other": {"Form": "ecl"}},
"VERB__Form=Ecl|Mood=Ind|Polarity=Neg|Tense=Fut|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "fut", "Other": {"Form": "ecl", "Voice": "auto"}},
"VERB__Form=Ecl|Mood=Ind|Polarity=Neg|Tense=Past": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "past", "Other": {"Form": "ecl"}},
"VERB__Form=Ecl|Mood=Ind|Polarity=Neg|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "pres", "Other": {"Form": "ecl"}},
"VERB__Form=Ecl|Mood=Ind|Tense=Fut": {"pos": "VERB", "Mood": "ind", "Tense": "fut", "Other": {"Form": "ecl"}},
"VERB__Form=Ecl|Mood=Ind|Tense=Fut|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Tense": "fut", "Other": {"Form": "ecl", "Voice": "auto"}},
"VERB__Form=Ecl|Mood=Ind|Tense=Past": {"pos": "VERB", "Mood": "ind", "Tense": "past", "Other": {"Form": "ecl"}},
"VERB__Form=Ecl|Mood=Ind|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Tense": "pres", "Other": {"Form": "ecl"}},
"VERB__Form=Ecl|Mood=Ind|Tense=Pres|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Tense": "pres", "Other": {"Form": "ecl", "Voice": "auto"}},
"VERB__Form=Ecl|Mood=Sub|Tense=Pres": {"pos": "VERB", "Mood": "sub", "Tense": "pres", "Other": {"Form": "ecl"}},
"VERB__Form=Ecl": {"pos": "VERB", "Other": {"Form": "ecl"}},
"VERB__Form=Emp|Mood=Ind|Number=Plur|Person=1|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Number": "plur", "Person": 1, "Tense": "pres", "Other": {"Form": "emp"}},
"VERB__Form=Emp|Mood=Ind|Number=Sing|Person=1|PronType=Rel|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "PronType": "rel", "Tense": "pres", "Other": {"Form": "emp"}},
"VERB__Form=Emp|Mood=Ind|Number=Sing|Person=1|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "Tense": "pres", "Other": {"Form": "emp"}},
"VERB__Form=Len|Mood=Cnd|Number=Plur|Person=3": {"pos": "VERB", "Mood": "cnd", "Number": "plur", "Person": 3, "Other": {"Form": "len"}},
"VERB__Form=Len|Mood=Cnd|Number=Sing|Person=1": {"pos": "VERB", "Mood": "cnd", "Number": "sing", "Person": 1, "Other": {"Form": "len"}},
"VERB__Form=Len|Mood=Cnd|Number=Sing|Person=2": {"pos": "VERB", "Mood": "cnd", "Number": "sing", "Person": 2, "Other": {"Form": "len"}},
"VERB__Form=Len|Mood=Cnd|Polarity=Neg": {"pos": "VERB", "Mood": "cnd", "Polarity": "neg", "Other": {"Form": "len"}},
"VERB__Form=Len|Mood=Cnd": {"pos": "VERB", "Mood": "cnd", "Other": {"Form": "len"}},
"VERB__Form=Len|Mood=Cnd|Voice=Auto": {"pos": "VERB", "Mood": "cnd", "Other": {"Form": "len", "Voice": "auto"}},
"VERB__Form=Len|Mood=Imp|Number=Plur|Person=3|Tense=Past": {"pos": "VERB", "Mood": "imp", "Number": "plur", "Person": 3, "Tense": "past", "Other": {"Form": "len"}},
"VERB__Form=Len|Mood=Imp|Tense=Past": {"pos": "VERB", "Mood": "imp", "Tense": "past", "Other": {"Form": "len"}},
"VERB__Form=Len|Mood=Imp|Tense=Past|Voice=Auto": {"pos": "VERB", "Mood": "imp", "Tense": "past", "Other": {"Form": "len", "Voice": "auto"}},
"VERB__Form=Len|Mood=Imp|Voice=Auto": {"pos": "VERB", "Mood": "imp", "Other": {"Form": "len", "Voice": "auto"}},
"VERB__Form=Len|Mood=Ind|Number=Plur|Person=1|Tense=Fut": {"pos": "VERB", "Mood": "ind", "Number": "plur", "Person": 1, "Tense": "fut", "Other": {"Form": "len"}},
"VERB__Form=Len|Mood=Ind|Number=Plur|Person=1|Tense=Past": {"pos": "VERB", "Mood": "ind", "Number": "plur", "Person": 1, "Tense": "past", "Other": {"Form": "len"}},
"VERB__Form=Len|Mood=Ind|Number=Plur|Person=3|Tense=Past": {"pos": "VERB", "Mood": "ind", "Number": "plur", "Person": 3, "Tense": "past", "Other": {"Form": "len"}},
"VERB__Form=Len|Mood=Ind|Number=Sing|Person=1|Polarity=Neg|Tense=Past": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "Polarity": "neg", "Tense": "past", "Other": {"Form": "len"}},
"VERB__Form=Len|Mood=Ind|Number=Sing|Person=1|Polarity=Neg|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "Polarity": "neg", "Tense": "pres", "Other": {"Form": "len"}},
"VERB__Form=Len|Mood=Ind|Number=Sing|Person=1|Tense=Past": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "Tense": "past", "Other": {"Form": "len"}},
"VERB__Form=Len|Mood=Ind|Polarity=Neg|Tense=Fut": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "fut", "Other": {"Form": "len"}},
"VERB__Form=Len|Mood=Ind|Polarity=Neg|Tense=Fut|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "fut", "Other": {"Form": "len", "Voice": "auto"}},
"VERB__Form=Len|Mood=Ind|Polarity=Neg|Tense=Past": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "past", "Other": {"Form": "len"}},
"VERB__Form=Len|Mood=Ind|Polarity=Neg|Tense=Past|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "past", "Other": {"Form": "len", "Voice": "auto"}},
"VERB__Form=Len|Mood=Ind|Polarity=Neg|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "pres", "Other": {"Form": "len"}},
"VERB__Form=Len|Mood=Ind|Polarity=Neg|Tense=Pres|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "pres", "Other": {"Form": "len", "Voice": "auto"}},
"VERB__Form=Len|Mood=Ind|Tense=Fut": {"pos": "VERB", "Mood": "ind", "Tense": "fut", "Other": {"Form": "len"}},
"VERB__Form=Len|Mood=Ind|Tense=Fut|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Tense": "fut", "Other": {"Form": "len", "Voice": "auto"}},
"VERB__Form=Len|Mood=Ind|Tense=Past": {"pos": "VERB", "Mood": "ind", "Tense": "past", "Other": {"Form": "len"}},
"VERB__Form=Len|Mood=Ind|Tense=Past|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Tense": "past", "Other": {"Form": "len", "Voice": "auto"}},
"VERB__Form=Len|Mood=Ind|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Tense": "pres", "Other": {"Form": "len"}},
"VERB__Form=Len|Mood=Ind|Tense=Pres|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Tense": "pres", "Other": {"Form": "len", "Voice": "auto"}},
"VERB__Form=Len|Mood=Sub|Polarity=Neg|Tense=Pres": {"pos": "VERB", "Mood": "sub", "Polarity": "neg", "Tense": "pres", "Other": {"Form": "len"}},
"VERB__Form=Len|Polarity=Neg": {"pos": "VERB", "Polarity": "neg", "Other": {"Form": "len"}},
"VERB__Form=Len": {"pos": "VERB", "Other": {"Form": "len"}},
"VERB__Mood=Cnd|Number=Plur|Person=3": {"pos": "VERB", "Mood": "cnd", "Number": "plur", "Person": 3},
"VERB__Mood=Cnd|Number=Sing|Person=1": {"pos": "VERB", "Mood": "cnd", "Number": "sing", "Person": 1},
"VERB__Mood=Cnd": {"pos": "VERB", "Mood": "cnd"},
"VERB__Mood=Cnd|Voice=Auto": {"pos": "VERB", "Mood": "cnd", "Other": {"Voice": "auto"}},
"VERB__Mood=Imp|Number=Plur|Person=1|Polarity=Neg": {"pos": "VERB", "Mood": "imp", "Number": "plur", "Person": 1, "Polarity": "neg"},
"VERB__Mood=Imp|Number=Plur|Person=1": {"pos": "VERB", "Mood": "imp", "Number": "plur", "Person": 1},
"VERB__Mood=Imp|Number=Plur|Person=2": {"pos": "VERB", "Mood": "imp", "Number": "plur", "Person": 2},
"VERB__Mood=Imp|Number=Sing|Person=2": {"pos": "VERB", "Mood": "imp", "Number": "sing", "Person": 2},
"VERB__Mood=Imp|Tense=Past": {"pos": "VERB", "Mood": "imp", "Tense": "past"},
"VERB__Mood=Ind|Number=Plur|Person=1|Tense=Past": {"pos": "VERB", "Mood": "ind", "Number": "plur", "Person": 1, "Tense": "past"},
"VERB__Mood=Ind|Number=Plur|Person=1|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Number": "plur", "Person": 1, "Tense": "pres"},
"VERB__Mood=Ind|Number=Sing|Person=1|Tense=Past": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "Tense": "past"},
"VERB__Mood=Ind|Number=Sing|Person=1|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "Tense": "pres"},
"VERB__Mood=Ind|Polarity=Neg|Tense=Past|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "past", "Other": {"Voice": "auto"}},
"VERB__Mood=Ind|Polarity=Neg|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "pres"},
"VERB__Mood=Ind|PronType=Rel|Tense=Fut": {"pos": "VERB", "Mood": "ind", "PronType": "rel", "Tense": "fut"},
"VERB__Mood=Ind|PronType=Rel|Tense=Pres": {"pos": "VERB", "Mood": "ind", "PronType": "rel", "Tense": "pres"},
"VERB__Mood=Ind|Tense=Fut": {"pos": "VERB", "Mood": "ind", "Tense": "fut"},
"VERB__Mood=Ind|Tense=Fut|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Tense": "fut", "Other": {"Voice": "auto"}},
"VERB__Mood=Ind|Tense=Past": {"pos": "VERB", "Mood": "ind", "Tense": "past"},
"VERB__Mood=Ind|Tense=Past|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Tense": "past", "Other": {"Voice": "auto"}},
"VERB__Mood=Ind|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Tense": "pres"},
"VERB__Mood=Ind|Tense=Pres|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Tense": "pres", "Other": {"Voice": "auto"}},
"VERB___": {"pos": "VERB"},
"X__Abbr=Yes": {"pos": "X", "Other": {"Abbr": "yes"}},
"X__Case=NomAcc|Foreign=Yes|Gender=Fem|Number=Sing": {"pos": "X", "Case": "nom|acc", "Gender": "fem", "Number": "sing", "Foreign": "yes"},
"X__Definite=Def|Dialect=Ulster": {"pos": "X", "Definite": "def", "Other": {"Dialect": "ulster"}},
"X__Dialect=Munster|Form=Len|Mood=Ind|Number=Sing|Person=1|Tense=Past": {"pos": "X", "Mood": "ind", "Number": "sing", "Person": 1, "Tense": "past", "Other": {"Dialect": "munster", "Form": "len"}},
"X__Dialect=Munster|Mood=Imp|Number=Sing|Person=2|Polarity=Neg": {"pos": "X", "Mood": "imp", "Number": "sing", "Person": 2, "Polarity": "neg", "Other": {"Dialect": "munster"}},
"X__Dialect=Munster|Mood=Ind|Tense=Past|Voice=Auto": {"pos": "X", "Mood": "ind", "Tense": "past", "Other": {"Dialect": "munster", "Voice": "auto"}},
"X__Dialect=Munster": {"pos": "X", "Other": {"Dialect": "munster"}},
"X__Dialect=Munster|PronType=Dem": {"pos": "X", "PronType": "dem", "Other": {"Dialect": "munster"}},
"X__Dialect=Ulster|Gender=Masc|Number=Sing|Person=3": {"pos": "X", "Gender": "masc", "Number": "sing", "Person": 3, "Other": {"Dialect": "ulster"}},
"X__Dialect=Ulster|PartType=Vb|Polarity=Neg": {"pos": "X", "Polarity": "neg", "Other": {"Dialect": "ulster", "PartType": "vb"}},
"X__Dialect=Ulster|VerbForm=Cop": {"pos": "X", "Other": {"Dialect": "ulster", "VerbForm": "cop"}},
"X__Foreign=Yes": {"pos": "X", "Foreign": "yes"},
"X___": {"pos": "X"}
}

View File

@ -0,0 +1,86 @@
# encoding: utf8
from __future__ import unicode_literals
from ...symbols import POS, DET, ADP, CCONJ, ADV, NOUN, X, AUX
from ...symbols import ORTH, LEMMA, NORM
_exc = {
"'acha'n": [
{ORTH: "'ach", LEMMA: "gach", NORM: "gach", POS: DET},
{ORTH: "a'n", LEMMA: "aon", NORM: "aon", POS: DET}],
"dem'": [
{ORTH: "de", LEMMA: "de", NORM: "de", POS: ADP},
{ORTH: "m'", LEMMA: "mo", NORM: "mo", POS: DET}],
"ded'": [
{ORTH: "de", LEMMA: "de", NORM: "de", POS: ADP},
{ORTH: "d'", LEMMA: "do", NORM: "do", POS: DET}],
"lem'": [
{ORTH: "le", LEMMA: "le", NORM: "le", POS: ADP},
{ORTH: "m'", LEMMA: "mo", NORM: "mo", POS: DET}],
"led'": [
{ORTH: "le", LEMMA: "le", NORM: "le", POS: ADP},
{ORTH: "d'", LEMMA: "mo", NORM: "do", POS: DET}]
}
for exc_data in [
{ORTH: "'gus", LEMMA: "agus", NORM: "agus", POS: CCONJ},
{ORTH: "'ach", LEMMA: "gach", NORM: "gach", POS: DET},
{ORTH: "ao'", LEMMA: "aon", NORM: "aon"},
{ORTH: "'niar", LEMMA: "aniar", NORM: "aniar", POS: ADV},
{ORTH: "'níos", LEMMA: "aníos", NORM: "aníos", POS: ADV},
{ORTH: "'ndiu", LEMMA: "inniu", NORM: "inniu", POS: ADV},
{ORTH: "'nocht", LEMMA: "anocht", NORM: "anocht", POS: ADV},
{ORTH: "m'", LEMMA: "mo", POS: DET},
{ORTH: "Aib.", LEMMA: "Aibreán", POS: NOUN},
{ORTH: "Ath.", LEMMA: "athair", POS: NOUN},
{ORTH: "Beal.", LEMMA: "Bealtaine", POS: NOUN},
{ORTH: "a.C.n.", LEMMA: "ante Christum natum", POS: X},
{ORTH: "m.sh.", LEMMA: "mar shampla", POS: ADV},
{ORTH: "M.F.", LEMMA: "Meán Fómhair", POS: NOUN},
{ORTH: "M.Fómh.", LEMMA: "Meán Fómhair", POS: NOUN},
{ORTH: "D.F.", LEMMA: "Deireadh Fómhair", POS: NOUN},
{ORTH: "D.Fómh.", LEMMA: "Deireadh Fómhair", POS: NOUN},
{ORTH: "r.C.", LEMMA: "roimh Chríost", POS: ADV},
{ORTH: "R.C.", LEMMA: "roimh Chríost", POS: ADV},
{ORTH: "r.Ch.", LEMMA: "roimh Chríost", POS: ADV},
{ORTH: "r.Chr.", LEMMA: "roimh Chríost", POS: ADV},
{ORTH: "R.Ch.", LEMMA: "roimh Chríost", POS: ADV},
{ORTH: "R.Chr.", LEMMA: "roimh Chríost", POS: ADV},
{ORTH: "⁊rl.", LEMMA: "agus araile", POS: ADV},
{ORTH: "srl.", LEMMA: "agus araile", POS: ADV},
{ORTH: "Co.", LEMMA: "contae", POS: NOUN},
{ORTH: "Ean.", LEMMA: "Eanáir", POS: NOUN},
{ORTH: "Feab.", LEMMA: "Feabhra", POS: NOUN},
{ORTH: "gCo.", LEMMA: "contae", POS: NOUN},
{ORTH: ".i.", LEMMA: "eadhon", POS: ADV},
{ORTH: "B'", LEMMA: "ba", POS: AUX},
{ORTH: "b'", LEMMA: "ba", POS: AUX},
{ORTH: "lch.", LEMMA: "leathanach", POS: NOUN},
{ORTH: "Lch.", LEMMA: "leathanach", POS: NOUN},
{ORTH: "lgh.", LEMMA: "leathanach", POS: NOUN},
{ORTH: "Lgh.", LEMMA: "leathanach", POS: NOUN},
{ORTH: "Lún.", LEMMA: "Lúnasa", POS: NOUN},
{ORTH: "Már.", LEMMA: "Márta", POS: NOUN},
{ORTH: "Meith.", LEMMA: "Meitheamh", POS: NOUN},
{ORTH: "Noll.", LEMMA: "Nollaig", POS: NOUN},
{ORTH: "Samh.", LEMMA: "Samhain", POS: NOUN},
{ORTH: "tAth.", LEMMA: "athair", POS: NOUN},
{ORTH: "tUas.", LEMMA: "Uasal", POS: NOUN},
{ORTH: "teo.", LEMMA: "teoranta", POS: NOUN},
{ORTH: "Teo.", LEMMA: "teoranta", POS: NOUN},
{ORTH: "Uas.", LEMMA: "Uasal", POS: NOUN},
{ORTH: "uimh.", LEMMA: "uimhir", POS: NOUN},
{ORTH: "Uimh.", LEMMA: "uimhir", POS: NOUN}]:
_exc[exc_data[ORTH]] = [exc_data]
for orth in [
"d'", "D'"]:
_exc[orth] = [{ORTH: orth}]
TOKENIZER_EXCEPTIONS = _exc

View File

@ -95,5 +95,5 @@ _nums = "(({ne})|({t})|({on})|({c}))({s})?".format(
c=CURRENCY, s=_suffixes)
TOKENIZER_EXCEPTIONS = dict(_exc)
TOKENIZER_EXCEPTIONS = _exc
TOKEN_MATCH = re.compile("^({u})|({n})$".format(u=URL_PATTERN, n=_nums)).match

View File

@ -46,5 +46,4 @@ for orth in [
]:
_exc[orth] = [{ORTH: orth}]
TOKENIZER_EXCEPTIONS = dict(_exc)
TOKENIZER_EXCEPTIONS = _exc

View File

@ -35,4 +35,4 @@ for orth in [
_exc[orth] = [{ORTH: orth}]
TOKENIZER_EXCEPTIONS = dict(_exc)
TOKENIZER_EXCEPTIONS = _exc

View File

@ -20,4 +20,4 @@ for orth in [
_exc[orth] = [{ORTH: orth}]
TOKENIZER_EXCEPTIONS = dict(_exc)
TOKENIZER_EXCEPTIONS = _exc

View File

@ -72,4 +72,4 @@ for orth in [
_exc[orth] = [{ORTH: orth}]
TOKENIZER_EXCEPTIONS = dict(_exc)
TOKENIZER_EXCEPTIONS = _exc

View File

@ -80,4 +80,4 @@ for orth in [
_exc[orth] = [{ORTH: orth}]
TOKENIZER_EXCEPTIONS = dict(_exc)
TOKENIZER_EXCEPTIONS = _exc

View File

@ -2,10 +2,10 @@
# data from Korakot Chaovavanich (https://www.facebook.com/photo.php?fbid=390564854695031&set=p.390564854695031&type=3&permPage=1&ifg=1)
from __future__ import unicode_literals
from ...symbols import *
from ...symbols import POS, NOUN, PRON, ADJ, ADV, INTJ, PROPN, DET, NUM, AUX
from ...symbols import ADP, CCONJ, PART, PUNCT, SPACE, SCONJ
TAG_MAP = {
#NOUN
# NOUN
"NOUN": {POS: NOUN},
"NCMN": {POS: NOUN},
"NTTL": {POS: NOUN},
@ -14,7 +14,7 @@ TAG_MAP = {
"CMTR": {POS: NOUN},
"CFQC": {POS: NOUN},
"CVBL": {POS: NOUN},
#PRON
# PRON
"PRON": {POS: PRON},
"NPRP": {POS: PRON},
# ADJ
@ -28,7 +28,7 @@ TAG_MAP = {
"ADVI": {POS: ADV},
"ADVP": {POS: ADV},
"ADVS": {POS: ADV},
# INT
# INT
"INT": {POS: INTJ},
# PRON
"PROPN": {POS: PROPN},
@ -50,20 +50,20 @@ TAG_MAP = {
"NCNM": {POS: NUM},
"NLBL": {POS: NUM},
"DCNM": {POS: NUM},
# AUX
# AUX
"AUX": {POS: AUX},
"XVBM": {POS: AUX},
"XVAM": {POS: AUX},
"XVMM": {POS: AUX},
"XVBB": {POS: AUX},
"XVAE": {POS: AUX},
# ADP
# ADP
"ADP": {POS: ADP},
"RPRE": {POS: ADP},
# CCONJ
"CCONJ": {POS: CCONJ},
"JCRG": {POS: CCONJ},
# SCONJ
# SCONJ
"SCONJ": {POS: SCONJ},
"PREL": {POS: SCONJ},
"JSBR": {POS: SCONJ},

View File

@ -1,43 +1,23 @@
# encoding: utf8
from __future__ import unicode_literals
from ...symbols import *
from ...symbols import ORTH, LEMMA
TOKENIZER_EXCEPTIONS = {
"ม.ค.": [
{ORTH: "ม.ค.", LEMMA: "มกราคม"}
],
"ก.พ.": [
{ORTH: "ก.พ.", LEMMA: "กุมภาพันธ์"}
],
"มี.ค.": [
{ORTH: "มี.ค.", LEMMA: "มีนาคม"}
],
"เม.ย.": [
{ORTH: "เม.ย.", LEMMA: "เมษายน"}
],
"พ.ค.": [
{ORTH: "พ.ค.", LEMMA: "พฤษภาคม"}
],
"มิ.ย.": [
{ORTH: "มิ.ย.", LEMMA: "มิถุนายน"}
],
"ก.ค.": [
{ORTH: "ก.ค.", LEMMA: "กรกฎาคม"}
],
"ส.ค.": [
{ORTH: "ส.ค.", LEMMA: "สิงหาคม"}
],
"ก.ย.": [
{ORTH: "ก.ย.", LEMMA: "กันยายน"}
],
"ต.ค.": [
{ORTH: "ต.ค.", LEMMA: "ตุลาคม"}
],
"พ.ย.": [
{ORTH: "พ.ย.", LEMMA: "พฤศจิกายน"}
],
"ธ.ค.": [
{ORTH: "ธ.ค.", LEMMA: "ธันวาคม"}
]
_exc = {
"ม.ค.": [{ORTH: "ม.ค.", LEMMA: "มกราคม"}],
"ก.พ.": [{ORTH: "ก.พ.", LEMMA: "กุมภาพันธ์"}],
"มี.ค.": [{ORTH: "มี.ค.", LEMMA: "มีนาคม"}],
"เม.ย.": [{ORTH: "เม.ย.", LEMMA: "เมษายน"}],
"พ.ค.": [{ORTH: "พ.ค.", LEMMA: "พฤษภาคม"}],
"มิ.ย.": [{ORTH: "มิ.ย.", LEMMA: "มิถุนายน"}],
"ก.ค.": [{ORTH: "ก.ค.", LEMMA: "กรกฎาคม"}],
"ส.ค.": [{ORTH: "ส.ค.", LEMMA: "สิงหาคม"}],
"ก.ย.": [{ORTH: "ก.ย.", LEMMA: "กันยายน"}],
"ต.ค.": [{ORTH: "ต.ค.", LEMMA: "ตุลาคม"}],
"พ.ย.": [{ORTH: "พ.ย.", LEMMA: "พฤศจิกายน"}],
"ธ.ค.": [{ORTH: "ธ.ค.", LEMMA: "ธันวาคม"}]
}
TOKENIZER_EXCEPTIONS = _exc

View File

@ -154,6 +154,9 @@ class Language(object):
self._meta.setdefault('email', '')
self._meta.setdefault('url', '')
self._meta.setdefault('license', '')
self._meta['vectors'] = {'width': self.vocab.vectors_length,
'vectors': len(self.vocab.vectors),
'keys': self.vocab.vectors.n_keys}
self._meta['pipeline'] = self.pipe_names
return self._meta
@ -433,8 +436,10 @@ class Language(object):
**cfg: Config parameters.
RETURNS: An optimizer
"""
if get_gold_tuples is None:
get_gold_tuples = lambda: []
# Populate vocab
if get_gold_tuples is not None:
else:
for _, annots_brackets in get_gold_tuples():
for annots, _ in annots_brackets:
for word in annots[1]:

View File

@ -11,9 +11,9 @@ import ujson
import msgpack
from thinc.api import chain
from thinc.v2v import Softmax
from thinc.v2v import Affine, Softmax
from thinc.t2v import Pooling, max_pool, mean_pool
from thinc.neural.util import to_categorical
from thinc.neural.util import to_categorical, copy_array
from thinc.neural._classes.difference import Siamese, CauchySimilarity
from .tokens.doc cimport Doc
@ -130,6 +130,15 @@ class Pipe(object):
documents and their predicted scores."""
raise NotImplementedError
def add_label(self, label):
"""Add an output label, to be predicted by the model.
It's possible to extend pre-trained models with new labels,
but care should be taken to avoid the "catastrophic forgetting"
problem.
"""
raise NotImplementedError
def begin_training(self, gold_tuples=tuple(), pipeline=None):
"""Initialize the pipe for training, using data exampes if available.
If no model has been initialized yet, the model is added."""
@ -325,6 +334,14 @@ class Tagger(Pipe):
self.cfg.setdefault('pretrained_dims',
self.vocab.vectors.data.shape[1])
@property
def labels(self):
return self.cfg.setdefault('tag_names', [])
@labels.setter
def labels(self, value):
self.cfg['tag_names'] = value
def __call__(self, doc):
tags = self.predict([doc])
self.set_annotations([doc], tags)
@ -352,6 +369,7 @@ class Tagger(Pipe):
cdef Doc doc
cdef int idx = 0
cdef Vocab vocab = self.vocab
tags = list(self.labels)
for i, doc in enumerate(docs):
doc_tag_ids = batch_tag_ids[i]
if hasattr(doc_tag_ids, 'get'):
@ -359,7 +377,7 @@ class Tagger(Pipe):
for j, tag_id in enumerate(doc_tag_ids):
# Don't clobber preset POS tags
if doc.c[j].tag == 0 and doc.c[j].pos == 0:
vocab.morphology.assign_tag_id(&doc.c[j], tag_id)
vocab.morphology.assign_tag(&doc.c[j], tags[tag_id])
idx += 1
doc.is_tagged = True
@ -420,6 +438,17 @@ class Tagger(Pipe):
def Model(cls, n_tags, **cfg):
return build_tagger_model(n_tags, **cfg)
def add_label(self, label):
if label in self.labels:
return 0
smaller = self.model[-1]._layers[-1]
larger = Softmax(len(self.labels)+1, smaller.nI)
copy_array(larger.W[:smaller.nO], smaller.W)
copy_array(larger.b[:smaller.nO], smaller.b)
self.model[-1]._layers[-1] = larger
self.labels.append(label)
return 1
def use_params(self, params):
with self.model.use_params(params):
yield
@ -675,7 +704,7 @@ class TextCategorizer(Pipe):
@property
def labels(self):
return self.cfg.get('labels', ['LABEL'])
return self.cfg.setdefault('labels', ['LABEL'])
@labels.setter
def labels(self, value):
@ -727,6 +756,17 @@ class TextCategorizer(Pipe):
mean_square_error = ((scores-truths)**2).sum(axis=1).mean()
return mean_square_error, d_scores
def add_label(self, label):
if label in self.labels:
return 0
smaller = self.model[-1]._layers[-1]
larger = Affine(len(self.labels)+1, smaller.nI)
copy_array(larger.W[:smaller.nO], smaller.W)
copy_array(larger.b[:smaller.nO], smaller.b)
self.model[-1]._layers[-1] = larger
self.labels.append(label)
return 1
def begin_training(self, gold_tuples=tuple(), pipeline=None):
if pipeline and getattr(pipeline[0], 'name', None) == 'tensorizer':
token_vector_width = pipeline[0].model.nO

View File

@ -14,9 +14,8 @@ from .. import util
# These languages are used for generic tokenizer tests only add a language
# here if it's using spaCy's tokenizer (not a different library)
# TODO: re-implement generic tokenizer tests
_languages = ['bn', 'da', 'de', 'en', 'es', 'fi', 'fr', 'he', 'hu', 'id',
_languages = ['bn', 'da', 'de', 'en', 'es', 'fi', 'fr', 'ga', 'he', 'hu', 'id',
'it', 'nb', 'nl', 'pl', 'pt', 'sv', 'xx']
_models = {'en': ['en_core_web_sm'],
'de': ['de_core_news_md'],
'fr': ['fr_depvec_web_lg'],
@ -108,6 +107,11 @@ def bn_tokenizer():
return util.get_lang_class('bn').Defaults.create_tokenizer()
@pytest.fixture
def ga_tokenizer():
return util.get_lang_class('ga').Defaults.create_tokenizer()
@pytest.fixture
def he_tokenizer():
return util.get_lang_class('he').Defaults.create_tokenizer()

View File

@ -208,8 +208,8 @@ def test_doc_api_right_edge(en_tokenizer):
def test_doc_api_has_vector():
vocab = Vocab()
vocab.clear_vectors(2)
vocab.vectors.add('kitten', vector=numpy.asarray([0., 2.], dtype='f'))
vocab.reset_vectors(width=2)
vocab.set_vector('kitten', vector=numpy.asarray([0., 2.], dtype='f'))
doc = Doc(vocab, words=['kitten'])
assert doc.has_vector

View File

@ -72,9 +72,9 @@ def test_doc_token_api_is_properties(en_vocab):
def test_doc_token_api_vectors():
vocab = Vocab()
vocab.clear_vectors(2)
vocab.vectors.add('apples', vector=numpy.asarray([0., 2.], dtype='f'))
vocab.vectors.add('oranges', vector=numpy.asarray([0., 1.], dtype='f'))
vocab.reset_vectors(width=2)
vocab.set_vector('apples', vector=numpy.asarray([0., 2.], dtype='f'))
vocab.set_vector('oranges', vector=numpy.asarray([0., 1.], dtype='f'))
doc = Doc(vocab, words=['apples', 'oranges', 'oov'])
assert doc.has_vector
@ -155,13 +155,13 @@ def test_doc_token_api_head_setter(en_tokenizer):
assert doc[2].left_edge.i == 0
def test_sent_start(en_tokenizer):
def test_is_sent_start(en_tokenizer):
doc = en_tokenizer(u'This is a sentence. This is another.')
assert not doc[0].sent_start
assert not doc[5].sent_start
doc[5].sent_start = True
assert doc[5].sent_start
assert not doc[0].sent_start
assert doc[5].is_sent_start is None
doc[5].is_sent_start = True
assert doc[5].is_sent_start is True
# Backwards compatibility
assert doc[0].sent_start is False
doc.is_parsed = True
assert len(list(doc.sents)) == 2

View File

View File

@ -0,0 +1,17 @@
# coding: utf8
from __future__ import unicode_literals
import pytest
GA_TOKEN_EXCEPTION_TESTS = [
('Niall Ó Domhnaill, Rialtas na hÉireann 1977 (lch. 600).', ['Niall', 'Ó', 'Domhnaill', ',', 'Rialtas', 'na', 'hÉireann', '1977', '(', 'lch.', '600', ')', '.']),
('Daoine a bhfuil Gaeilge acu, m.sh. tusa agus mise', ['Daoine', 'a', 'bhfuil', 'Gaeilge', 'acu', ',', 'm.sh.', 'tusa', 'agus', 'mise'])
]
@pytest.mark.parametrize('text,expected_tokens', GA_TOKEN_EXCEPTION_TESTS)
def test_tokenizer_handles_exception_cases(ga_tokenizer, text, expected_tokens):
tokens = ga_tokenizer(text)
token_list = [token.text for token in tokens if not token.is_space]
assert expected_tokens == token_list

View File

@ -118,8 +118,7 @@ def test_span_to_array(doc):
assert arr[0, 1] == len(span[0])
@pytest.mark.xfail
def test_span_as_doc(doc):
span = doc[4:10]
span_doc = span.as_doc()
assert span.text == span_doc.text
assert span.text == span_doc.text.strip()

View File

@ -79,9 +79,9 @@ def add_vecs_to_vocab(vocab, vectors):
"""Add list of vector tuples to given vocab. All vectors need to have the
same length. Format: [("text", [1, 2, 3])]"""
length = len(vectors[0][1])
vocab.clear_vectors(length)
vocab.reset_vectors(width=length)
for word, vec in vectors:
vocab.set_vector(word, vec)
vocab.set_vector(word, vector=vec)
return vocab

View File

@ -3,6 +3,7 @@ from __future__ import unicode_literals
from ...vectors import Vectors
from ...tokenizer import Tokenizer
from ...strings import hash_string
from ..util import add_vecs_to_vocab, get_doc
import numpy
@ -35,20 +36,19 @@ def vocab(en_vocab, vectors):
def test_init_vectors_with_data(strings, data):
v = Vectors(strings, data=data)
v = Vectors(data=data)
assert v.shape == data.shape
def test_init_vectors_with_width(strings):
v = Vectors(strings, width=3)
for string in strings:
v.add(string)
def test_init_vectors_with_shape(strings):
v = Vectors(shape=(len(strings), 3))
assert v.shape == (len(strings), 3)
def test_get_vector(strings, data):
v = Vectors(strings, data=data)
for string in strings:
v.add(string)
v = Vectors(data=data)
strings = [hash_string(s) for s in strings]
for i, string in enumerate(strings):
v.add(string, row=i)
assert list(v[strings[0]]) == list(data[0])
assert list(v[strings[0]]) != list(data[1])
assert list(v[strings[1]]) != list(data[0])
@ -56,9 +56,10 @@ def test_get_vector(strings, data):
def test_set_vector(strings, data):
orig = data.copy()
v = Vectors(strings, data=data)
for string in strings:
v.add(string)
v = Vectors(data=data)
strings = [hash_string(s) for s in strings]
for i, string in enumerate(strings):
v.add(string, row=i)
assert list(v[strings[0]]) == list(orig[0])
assert list(v[strings[0]]) != list(orig[1])
v[strings[0]] = data[1]
@ -66,7 +67,6 @@ def test_set_vector(strings, data):
assert list(v[strings[0]]) != list(orig[0])
@pytest.fixture()
def tokenizer_v(vocab):
return Tokenizer(vocab, {}, None, None, None)

View File

@ -2,14 +2,39 @@
from __future__ import unicode_literals
import numpy
import pytest
from numpy.testing import assert_allclose
from ...vocab import Vocab
from ..._ml import cosine
@pytest.mark.xfail
@pytest.mark.parametrize('text', ["Hello"])
def test_vocab_add_vector(en_vocab, text):
en_vocab.resize_vectors(10)
lex = en_vocab[text]
lex.vector = numpy.ndarray((10,), dtype='float32')
lex = en_vocab[text]
assert lex.vector.shape == (10,)
def test_vocab_add_vector():
vocab = Vocab()
data = numpy.ndarray((5,3), dtype='f')
data[0] = 1.
data[1] = 2.
vocab.set_vector(u'cat', data[0])
vocab.set_vector(u'dog', data[1])
cat = vocab[u'cat']
assert list(cat.vector) == [1., 1., 1.]
dog = vocab[u'dog']
assert list(dog.vector) == [2., 2., 2.]
def test_vocab_prune_vectors():
vocab = Vocab()
_ = vocab[u'cat']
_ = vocab[u'dog']
_ = vocab[u'kitten']
data = numpy.ndarray((5,3), dtype='f')
data[0] = 1.
data[1] = 2.
data[2] = 1.1
vocab.set_vector(u'cat', data[0])
vocab.set_vector(u'dog', data[1])
vocab.set_vector(u'kitten', data[2])
remap = vocab.prune_vectors(2)
assert list(remap.keys()) == [u'kitten']
neighbour, similarity = list(remap.values())[0]
assert neighbour == u'cat', remap
assert_allclose(similarity, cosine(data[0], data[2]), atol=1e-6)

View File

@ -28,7 +28,7 @@ from ..attrs cimport LENGTH, POS, LEMMA, TAG, DEP, HEAD, SPACY, ENT_IOB
from ..attrs cimport ENT_TYPE, SENT_START
from ..parts_of_speech cimport CCONJ, PUNCT, NOUN, univ_pos_t
from ..util import normalize_slice
from ..compat import is_config, copy_reg, pickle
from ..compat import is_config, copy_reg, pickle, basestring_
from .. import about
from .. import util
from .underscore import Underscore
@ -571,7 +571,8 @@ cdef class Doc:
cdef np.ndarray[attr_t, ndim=1] attr_ids
cdef np.ndarray[attr_t, ndim=2] output
# Handle scalar/list inputs of strings/ints for py_attr_ids
if not hasattr(py_attr_ids, '__iter__'):
if not hasattr(py_attr_ids, '__iter__') \
and not isinstance(py_attr_ids, basestring_):
py_attr_ids = [py_attr_ids]
# Allow strings, e.g. 'lemma' or 'LEMMA'

View File

@ -474,17 +474,15 @@ cdef class Span:
"""RETURNS (int): The number of leftward immediate children of the
span, in the syntactic dependency parse.
"""
# TODO: implement
def __get__(self):
raise NotImplementedError
return len(list(self.lefts))
property n_rights:
"""RETURNS (int): The number of rightward immediate children of the
span, in the syntactic dependency parse.
"""
# TODO: implement
def __get__(self):
raise NotImplementedError
return len(list(self.rights))
property subtree:
"""Tokens that descend from tokens in the span, but fall outside it.

View File

@ -302,10 +302,7 @@ cdef class Token:
def __get__(self):
if 'vector' in self.doc.user_token_hooks:
return self.doc.user_token_hooks['vector'](self)
if self.has_vector:
return self.vocab.get_vector(self.c.lex.orth)
else:
return self.doc.tensor[self.i]
return self.vocab.get_vector(self.c.lex.orth)
property vector_norm:
"""The L2 norm of the token's vector representation.
@ -333,9 +330,29 @@ cdef class Token:
return self.c.r_kids
property sent_start:
# TODO: fix and document
# TODO deprecation warning
def __get__(self):
return self.c.sent_start
# Handle broken backwards compatibility case: doc[0].sent_start
# was False.
if self.i == 0:
return False
else:
return self.sent_start
def __set__(self, value):
self.is_sent_start = value
property is_sent_start:
"""RETURNS (bool / None): Whether the token starts a sentence.
None if unknown.
"""
def __get__(self):
if self.c.sent_start == 0:
return None
elif self.c.sent_start < 0:
return False
else:
return True
def __set__(self, value):
if self.doc.is_parsed:

View File

@ -236,8 +236,6 @@ def is_in_jupyter():
def get_cuda_stream(require=False):
# TODO: Error and tell to install chainer if not found
# Requires GPU
return CudaStream() if CudaStream is not None else None

View File

@ -10,11 +10,17 @@ cimport numpy as np
from thinc.neural.util import get_array_module
from thinc.neural._classes.model import Model
from .strings cimport StringStore
from .strings cimport StringStore, hash_string
from .compat import basestring_, path2str
from . import util
def unpickle_vectors(keys_and_rows, data):
vectors = Vectors(data=data)
for key, row in keys_and_rows:
vectors.add(key, row=row)
cdef class Vectors:
"""Store, save and load word vectors.
@ -22,138 +28,36 @@ cdef class Vectors:
instance of numpy.ndarray (for CPU vectors) or cupy.ndarray
(for GPU vectors). `vectors.key2row` is a dictionary mapping word hashes to
rows in the vectors.data table.
Multiple keys can be mapped to the same vector, so len(keys) may be greater
(but not smaller) than data.shape[0].
Multiple keys can be mapped to the same vector, and not all of the rows in
the table need to be assigned --- so len(list(vectors.keys())) may be
greater or smaller than vectors.shape[0].
"""
cdef public object data
cdef readonly StringStore strings
cdef public object key2row
cdef public object keys
cdef public int i
cdef public object _unset
def __init__(self, strings, width=0, data=None):
"""Create a new vector store. To keep the vector table empty, pass
`width=0`. You can also create the vector table and add vectors one by
one, or set the vector values directly on initialisation.
def __init__(self, *, shape=None, data=None, keys=None):
"""Create a new vector store.
strings (StringStore or list): List of strings or StringStore that maps
strings to hash values, and vice versa.
width (int): Number of dimensions.
shape (tuple): Size of the table, as (# entries, # columns)
data (numpy.ndarray): The vector data.
keys (iterable): A sequence of keys, aligned with the data.
RETURNS (Vectors): The newly created object.
"""
if isinstance(strings, StringStore):
self.strings = strings
if data is None:
if shape is None:
shape = (0,0)
data = numpy.zeros(shape, dtype='f')
self.data = data
self.key2row = OrderedDict()
if self.data is not None:
self._unset = set(range(self.data.shape[0]))
else:
self.strings = StringStore()
for string in strings:
self.strings.add(string)
if data is not None:
self.data = numpy.asarray(data, dtype='f')
else:
self.data = numpy.zeros((len(self.strings), width), dtype='f')
self.i = 0
self.key2row = {}
self.keys = numpy.zeros((self.data.shape[0],), dtype='uint64')
if data is not None:
for i, string in enumerate(self.strings):
if i >= self.data.shape[0]:
break
self.add(self.strings[string], vector=self.data[i])
def __reduce__(self):
return (Vectors, (self.strings, self.data))
def __getitem__(self, key):
"""Get a vector by key. If key is a string, it is hashed to an integer
ID using the vectors.strings table. If the integer key is not found in
the table, a KeyError is raised.
key (unicode / int): The key to get the vector for.
RETURNS (numpy.ndarray): The vector for the key.
"""
if isinstance(key, basestring):
key = self.strings[key]
i = self.key2row[key]
if i is None:
raise KeyError(key)
else:
return self.data[i]
def __setitem__(self, key, vector):
"""Set a vector for the given key. If key is a string, it is hashed
to an integer ID using the vectors.strings table.
key (unicode / int): The key to set the vector for.
vector (numpy.ndarray): The vector to set.
"""
if isinstance(key, basestring):
key = self.strings.add(key)
i = self.key2row[key]
self.data[i] = vector
def __iter__(self):
"""Yield vectors from the table.
YIELDS (numpy.ndarray): A vector.
"""
yield from self.data
def __len__(self):
"""Return the number of vectors that have been assigned.
RETURNS (int): The number of vectors in the data.
"""
return self.i
def __contains__(self, key):
"""Check whether a key has a vector entry in the table.
key (unicode / int): The key to check.
RETURNS (bool): Whether the key has a vector entry.
"""
if isinstance(key, basestring_):
key = self.strings[key]
return key in self.key2row
def add(self, key, *, vector=None, row=None):
"""Add a key to the table. Keys can be mapped to an existing vector
by setting `row`, or a new vector can be added.
key (unicode / int): The key to add.
vector (numpy.ndarray / None): A vector to add for the key.
row (int / None): The row-number of a vector to map the key to.
"""
if isinstance(key, basestring_):
key = self.strings.add(key)
if key in self.key2row and row is None:
row = self.key2row[key]
elif key in self.key2row and row is not None:
self.key2row[key] = row
elif row is None:
row = self.i
self.i += 1
if row >= self.keys.shape[0]:
self.keys.resize((row*2,))
self.data.resize((row*2, self.data.shape[1]))
self.keys[row] = key
self.key2row[key] = row
self.keys[row] = key
if vector is not None:
self.data[row] = vector
return row
def items(self):
"""Iterate over `(string key, vector)` pairs, in order.
YIELDS (tuple): A key/vector pair.
"""
for i, key in enumerate(self.keys):
string = self.strings[key]
row = self.key2row[key]
yield string, self.data[row]
self._unset = set()
if keys is not None:
for i, key in enumerate(keys):
self.add(key, row=i)
@property
def shape(self):
@ -164,9 +68,219 @@ cdef class Vectors:
"""
return self.data.shape
def most_similar(self, key):
# TODO: implement
raise NotImplementedError
@property
def size(self):
"""RETURNS (int): rows*dims"""
return self.data.shape[0] * self.data.shape[1]
@property
def is_full(self):
"""RETURNS (bool): `True` if no slots are available for new keys."""
return len(self._unset) == 0
@property
def n_keys(self):
"""RETURNS (int) The number of keys in the table. Note that this is the
number of all keys, not just unique vectors."""
return len(self.key2row)
def __reduce__(self):
keys_and_rows = self.key2row.items()
return (unpickle_vectors, (keys_and_rows, self.data))
def __getitem__(self, key):
"""Get a vector by key. If the key is not found, a KeyError is raised.
key (int): The key to get the vector for.
RETURNS (ndarray): The vector for the key.
"""
i = self.key2row[key]
if i is None:
raise KeyError(key)
else:
return self.data[i]
def __setitem__(self, key, vector):
"""Set a vector for the given key.
key (int): The key to set the vector for.
vector (ndarray): The vector to set.
"""
i = self.key2row[key]
self.data[i] = vector
if i in self._unset:
self._unset.remove(i)
def __iter__(self):
"""Iterate over the keys in the table.
YIELDS (int): A key in the table.
"""
yield from self.key2row
def __len__(self):
"""Return the number of vectors in the table.
RETURNS (int): The number of vectors in the data.
"""
return self.data.shape[0]
def __contains__(self, key):
"""Check whether a key has been mapped to a vector entry in the table.
key (int): The key to check.
RETURNS (bool): Whether the key has a vector entry.
"""
return key in self.key2row
def resize(self, shape, inplace=False):
"""Resize the underlying vectors array. If inplace=True, the memory
is reallocated. This may cause other references to the data to become
invalid, so only use inplace=True if you're sure that's what you want.
If the number of vectors is reduced, keys mapped to rows that have been
deleted are removed. These removed items are returned as a list of
`(key, row)` tuples.
"""
if inplace:
self.data.resize(shape, refcheck=False)
else:
xp = get_array_module(self.data)
self.data = xp.resize(self.data, shape)
filled = {row for row in self.key2row.values()}
self._unset = {row for row in range(shape[0]) if row not in filled}
removed_items = []
for key, row in dict(self.key2row.items()):
if row >= shape[0]:
self.key2row.pop(key)
removed_items.append((key, row))
return removed_items
def keys(self):
"""A sequence of the keys in the table.
RETURNS (iterable): The keys.
"""
return self.key2row.keys()
def values(self):
"""Iterate over vectors that have been assigned to at least one key.
Note that some vectors may be unassigned, so the number of vectors
returned may be less than the length of the vectors table.
YIELDS (ndarray): A vector in the table.
"""
for row, vector in enumerate(range(self.data.shape[0])):
if row not in self._unset:
yield vector
def items(self):
"""Iterate over `(key, vector)` pairs.
YIELDS (tuple): A key/vector pair.
"""
for key, row in self.key2row.items():
yield key, self.data[row]
def find(self, *, key=None, keys=None, row=None, rows=None):
"""Look up one or more keys by row, or vice versa.
key (unicode / int): Find the row that the given key points to.
Returns int, -1 if missing.
keys (iterable): Find rows that the keys point to.
Returns ndarray.
row (int): Find the first key that point to the row.
Returns int.
rows (iterable): Find the keys that point to the rows.
Returns ndarray.
RETURNS: The requested key, keys, row or rows.
"""
if sum(arg is None for arg in (key, keys, row, rows)) != 3:
raise ValueError("One (and only one) keyword arg must be set.")
xp = get_array_module(self.data)
if key is not None:
if isinstance(key, basestring_):
key = hash_string(key)
return self.key2row.get(key, -1)
elif keys is not None:
keys = [hash_string(key) if isinstance(key, basestring_) else key
for key in keys]
rows = [self.key2row.get(key, -1.) for key in keys]
return xp.asarray(rows, dtype='i')
else:
targets = set()
if row is not None:
targets.add(row)
else:
targets.update(rows)
results = []
for key, row in self.key2row.items():
if row in targets:
results.append(key)
targets.remove(row)
return xp.asarray(results, dtype='uint64')
def add(self, key, *, vector=None, row=None):
"""Add a key to the table. Keys can be mapped to an existing vector
by setting `row`, or a new vector can be added.
key (int): The key to add.
vector (ndarray / None): A vector to add for the key.
row (int / None): The row number of a vector to map the key to.
RETURNS (int): The row the vector was added to.
"""
if isinstance(key, basestring):
key = hash_string(key)
if row is None and key in self.key2row:
row = self.key2row[key]
elif row is None:
if self.is_full:
raise ValueError("Cannot add new key to vectors -- full")
row = min(self._unset)
self.key2row[key] = row
if vector is not None:
self.data[row] = vector
if row in self._unset:
self._unset.remove(row)
return row
def most_similar(self, queries, *, batch_size=1024):
"""For each of the given vectors, find the single entry most similar
to it, by cosine.
Queries are by vector. Results are returned as a `(keys, best_rows,
scores)` tuple. If `queries` is large, the calculations are performed in
chunks, to avoid consuming too much memory. You can set the `batch_size`
to control the size/space trade-off during the calculations.
queries (ndarray): An array with one or more vectors.
batch_size (int): The batch size to use.
RETURNS (tuple): The most similar entry as a `(keys, best_rows, scores)`
tuple.
"""
xp = get_array_module(self.data)
vectors = self.data / xp.linalg.norm(self.data, axis=1, keepdims=True)
best_rows = xp.zeros((queries.shape[0],), dtype='i')
scores = xp.zeros((queries.shape[0],), dtype='f')
# Work in batches, to avoid memory problems.
for i in range(0, queries.shape[0], batch_size):
batch = queries[i : i+batch_size]
batch /= xp.linalg.norm(batch, axis=1, keepdims=True)
# batch e.g. (1024, 300)
# vectors e.g. (10000, 300)
# sims e.g. (1024, 10000)
sims = xp.dot(batch, vectors.T)
best_rows[i:i+batch_size] = sims.argmax(axis=1)
scores[i:i+batch_size] = sims.max(axis=1)
xp = get_array_module(self.data)
row2key = {row: key for key, row in self.key2row.items()}
keys = xp.asarray([row2key[row] for row in best_rows], dtype='uint64')
return (keys, best_rows, scores)
def from_glove(self, path):
"""Load GloVe vectors from a directory. Assumes binary format,
@ -176,27 +290,32 @@ cdef class Vectors:
By default GloVe outputs 64-bit vectors.
path (unicode / Path): The path to load the GloVe vectors from.
RETURNS: A `StringStore` object, holding the key-to-string mapping.
"""
path = util.ensure_path(path)
width = None
for name in path.iterdir():
if name.parts[-1].startswith('vectors'):
_, dims, dtype, _2 = name.parts[-1].split('.')
self.width = int(dims)
width = int(dims)
break
else:
raise IOError("Expected file named e.g. vectors.128.f.bin")
bin_loc = path / 'vectors.{dims}.{dtype}.bin'.format(dims=dims,
dtype=dtype)
xp = get_array_module(self.data)
self.data = None
with bin_loc.open('rb') as file_:
self.data = numpy.fromfile(file_, dtype='float64')
self.data = numpy.ascontiguousarray(self.data, dtype='float32')
self.data = xp.fromfile(file_, dtype=dtype)
if dtype != 'float32':
self.data = xp.ascontiguousarray(self.data, dtype='float32')
n = 0
strings = StringStore()
with (path / 'vocab.txt').open('r') as file_:
for line in file_:
self.add(line.strip())
n += 1
if (self.data.size % self.width) == 0:
self.data
for i, line in enumerate(file_):
key = strings.add(line.strip())
self.add(key, row=i)
return strings
def to_disk(self, path, **exclude):
"""Save the current state to a directory.
@ -212,7 +331,7 @@ cdef class Vectors:
save_array = lambda arr, file_: xp.save(file_, arr)
serializers = OrderedDict((
('vectors', lambda p: save_array(self.data, p.open('wb'))),
('keys', lambda p: xp.save(p.open('wb'), self.keys))
('key2row', lambda p: msgpack.dump(self.key2row, p.open('wb')))
))
return util.to_disk(path, serializers, exclude)
@ -223,12 +342,18 @@ cdef class Vectors:
path (unicode / Path): Directory path, string or Path-like object.
RETURNS (Vectors): The modified object.
"""
def load_key2row(path):
if path.exists():
self.key2row = msgpack.load(path.open('rb'))
for key, row in self.key2row.items():
if row in self._unset:
self._unset.remove(row)
def load_keys(path):
if path.exists():
self.keys = numpy.load(path2str(path))
for i, key in enumerate(self.keys):
self.keys[i] = key
self.key2row[key] = i
keys = numpy.load(str(path))
for i, key in enumerate(keys):
self.add(key, row=i)
def load_vectors(path):
xp = Model.ops.xp
@ -236,6 +361,7 @@ cdef class Vectors:
self.data = xp.load(path)
serializers = OrderedDict((
('key2row', load_key2row),
('keys', load_keys),
('vectors', load_vectors),
))
@ -254,7 +380,7 @@ cdef class Vectors:
else:
return msgpack.dumps(self.data)
serializers = OrderedDict((
('keys', lambda: msgpack.dumps(self.keys)),
('key2row', lambda: msgpack.dumps(self.key2row)),
('vectors', serialize_weights)
))
return util.to_bytes(serializers, exclude)
@ -272,14 +398,8 @@ cdef class Vectors:
else:
self.data = msgpack.loads(b)
def load_keys(keys):
self.keys.resize((len(keys),))
for i, key in enumerate(keys):
self.keys[i] = key
self.key2row[key] = i
deserializers = OrderedDict((
('keys', lambda b: load_keys(msgpack.loads(b))),
('key2row', lambda b: self.key2row.update(msgpack.loads(b))),
('vectors', deserialize_weights)
))
util.from_bytes(data, deserializers, exclude)

View File

@ -55,7 +55,7 @@ cdef class Vocab:
_ = self[string]
self.lex_attr_getters = lex_attr_getters
self.morphology = Morphology(self.strings, tag_map, lemmatizer)
self.vectors = Vectors(self.strings, width=0)
self.vectors = Vectors()
property lang:
def __get__(self):
@ -192,10 +192,11 @@ cdef class Vocab:
YIELDS (Lexeme): An entry in the vocabulary.
"""
cdef attr_t orth
cdef attr_t key
cdef size_t addr
for orth, addr in self._by_orth.items():
yield Lexeme(self, orth)
for key, addr in self._by_orth.items():
lex = Lexeme(self, key)
yield lex
def __getitem__(self, id_or_string):
"""Retrieve a lexeme, given an int ID or a unicode string. If a
@ -213,7 +214,7 @@ cdef class Vocab:
>>> assert nlp.vocab[apple] == nlp.vocab[u'apple']
"""
cdef attr_t orth
if type(id_or_string) == unicode:
if isinstance(id_or_string, unicode):
orth = self.strings.add(id_or_string)
else:
orth = id_or_string
@ -240,19 +241,23 @@ cdef class Vocab:
def vectors_length(self):
return self.vectors.data.shape[1]
def clear_vectors(self, width=None):
def reset_vectors(self, *, width=None, shape=None):
"""Drop the current vector table. Because all vectors must be the same
width, you have to call this to change the size of the vectors.
"""
if width is None:
width = self.vectors.data.shape[1]
self.vectors = Vectors(self.strings, width=width)
if width is not None and shape is not None:
raise ValueError("Only one of width and shape can be specified")
elif shape is not None:
self.vectors = Vectors(shape=shape)
else:
width = width if width is not None else self.vectors.data.shape[1]
self.vectors = Vectors(shape=(self.vectors.shape[0], width))
def prune_vectors(self, nr_row, batch_size=1024):
"""Reduce the current vector table to `nr_row` unique entries. Words
mapped to the discarded vectors will be remapped to the closest vector
among those remaining.
For example, suppose the original table had vectors for the words:
['sat', 'cat', 'feline', 'reclined']. If we prune the vector table to,
two rows, we would discard the vectors for 'feline' and 'reclined'.
@ -263,28 +268,41 @@ cdef class Vocab:
The similarities are judged by cosine. The original vectors may
be large, so the cosines are calculated in minibatches, to reduce
memory usage.
nr_row (int): The number of rows to keep in the vector table.
batch_size (int): Batch of vectors for calculating the similarities.
Larger batch sizes might be faster, while temporarily requiring
more memory.
RETURNS (dict): A dictionary keyed by removed words mapped to
`(string, score)` tuples, where `string` is the entry the removed
word was mapped to, and `score` the similarity score between the
two words.
"""
xp = get_array_module(self.vectors.data)
# Work in batches, to avoid memory problems.
keep = self.vectors.data[:nr_row]
toss = self.vectors.data[nr_row:]
# Normalize the vectors, so cosine similarity is just dot product.
# Note we can't modify the ones we're keeping in-place...
keep = keep / (xp.linalg.norm(keep)+1e-8)
keep = xp.ascontiguousarray(keep.T)
neighbours = xp.zeros((toss.shape[0],), dtype='i')
for i in range(0, toss.shape[0], batch_size):
batch = toss[i : i+batch_size]
batch /= xp.linalg.norm(batch)+1e-8
neighbours[i:i+batch_size] = xp.dot(batch, keep).argmax(axis=1)
for lex in self:
# If we're losing the vector for this word, map it to the nearest
# vector we're keeping.
if lex.rank >= nr_row:
lex.rank = neighbours[lex.rank-nr_row]
self.vectors.add(lex.orth, row=lex.rank)
# Make copy, to encourage the original table to be garbage collected.
self.vectors.data = xp.ascontiguousarray(self.vectors.data[:nr_row])
# Make prob negative so it sorts by rank ascending
# (key2row contains the rank)
priority = [(-lex.prob, self.vectors.key2row[lex.orth], lex.orth)
for lex in self if lex.orth in self.vectors.key2row]
priority.sort()
indices = xp.asarray([i for (prob, i, key) in priority], dtype='i')
keys = xp.asarray([key for (prob, i, key) in priority], dtype='uint64')
keep = xp.ascontiguousarray(self.vectors.data[indices[:nr_row]])
toss = xp.ascontiguousarray(self.vectors.data[indices[nr_row:]])
self.vectors = Vectors(data=keep, keys=keys)
syn_keys, syn_rows, scores = self.vectors.most_similar(toss)
remap = {}
for i, key in enumerate(keys[nr_row:]):
self.vectors.add(key, row=syn_rows[i])
word = self.strings[key]
synonym = self.strings[syn_keys[i]]
score = scores[i]
remap[word] = (synonym, score)
link_vectors_to_models(self)
return remap
def get_vector(self, orth):
"""Retrieve a vector for a word in the vocabulary. Words can be looked
@ -306,8 +324,16 @@ cdef class Vocab:
"""Set a vector for a word in the vocabulary. Words can be referenced
by string or int ID.
"""
if not isinstance(orth, basestring_):
orth = self.strings[orth]
if isinstance(orth, basestring_):
orth = self.strings.add(orth)
if self.vectors.is_full and orth not in self.vectors:
new_rows = max(100, int(self.vectors.shape[0]*1.3))
if self.vectors.shape[1] == 0:
width = vector.size
else:
width = self.vectors.shape[1]
self.vectors.resize((new_rows, width))
self.vectors.add(orth, vector=vector)
self.vectors.add(orth, vector=vector)
def has_vector(self, orth):

View File

@ -84,8 +84,8 @@
],
"ALPHA": true,
"V_CSS": "2.0a1",
"V_JS": "2.0a0",
"V_CSS": "2.0a2",
"V_JS": "2.0a1",
"DEFAULT_SYNTAX": "python",
"ANALYTICS": "UA-58931649-1",
"MAILCHIMP": {

View File

@ -41,9 +41,6 @@
- var comps = path.split('#');
- return "top-level#" + comps[0] + '.' + comps[1];
- }
- else if (path.startsWith('cli#')) {
- return "top-level#" + path.split('#')[1];
- }
- return path;
- }

View File

@ -281,7 +281,12 @@ mixin github(repo, file, height, alt_file, language)
figure.o-block
pre.c-code-block.o-block-small(class="lang-#{(language || DEFAULT_SYNTAX)}" style="height: #{height}px; min-height: #{height}px")
code.c-code-block__content(data-gh-embed="#{repo}/#{branch}/#{file}")
code.c-code-block__content(data-gh-embed="#{repo}/#{branch}/#{file}").
Can't fetch code example from GitHub :(
Please use the link below to view the example. If you've come across
a broken link, we always appreciate a pull request to the repository,
or a report on the issue tracker. Thanks!
footer.o-grid.u-text
.o-block-small.u-flex-full.u-padding-small #[+icon("github")] #[code.u-break.u-break--all=repo + '/' + (alt_file || file)]

View File

@ -20,7 +20,7 @@ for id in CURRENT_MODELS
p(data-tpl=id data-tpl-key="description")
div(data-tpl=id data-tpl-key="error" style="display: none")
div(data-tpl=id data-tpl-key="error")
+infobox
| Unable to load model details from GitHub. To find out more
| about this model, see the overview of the
@ -54,7 +54,7 @@ for id in CURRENT_MODELS
+cell
.o-field.u-float-left
select.o-field__select.u-text-small(data-tpl=id data-tpl-key="compat")
.o-empty(data-tpl=id data-tpl-key="compat-versions") &nbsp;
div(data-tpl=id data-tpl-key="compat-versions") &nbsp;
section(data-tpl=id data-tpl-key="benchmarks" style="display: none")
+grid.o-block-small

View File

@ -1,43 +1,86 @@
//- 💫 INCLUDES > SCRIPTS
if quickstart
script(src="/assets/js/quickstart.min.js")
script(src="/assets/js/vendor/quickstart.min.js")
if IS_PAGE
script(src="/assets/js/in-view.min.js")
script(src="/assets/js/vendor/in-view.min.js")
if environment == "deploy"
script(async src="https://www.google-analytics.com/analytics.js")
script(src="/assets/js/prism.min.js")
script(src="/assets/js/main.js?v#{V_JS}")
script(src="/assets/js/vendor/prism.min.js")
if SECTION == "models"
script(src="/assets/js/vendor/chart.min.js")
script(src="/assets/js/models.js?v#{V_JS}" type="module")
script
| new ProgressBar('.js-progress');
if changelog
| new Changelog('!{SOCIAL.github}', 'spacy');
if quickstart
| new Quickstart("#qs");
if IS_PAGE
| new SectionHighlighter('data-section', 'data-nav');
| new GitHubEmbed('!{SOCIAL.github}', 'data-gh-embed');
| ((window.gitter = {}).chat = {}).options = {
| useStyles: false,
| activationElement: '.js-gitter-button',
| targetElement: '.js-gitter',
| room: '!{SOCIAL.gitter}'
| };
if HAS_MODELS
| new ModelLoader('!{MODELS_REPO}', !{JSON.stringify(CURRENT_MODELS)}, !{JSON.stringify(MODEL_LICENSES)}, !{JSON.stringify(MODEL_BENCHMARKS)});
if environment == "deploy"
| window.ga=window.ga||function(){
| (ga.q=ga.q||[]).push(arguments)}; ga.l=+new Date;
| ga('create', '#{ANALYTICS}', 'auto'); ga('send', 'pageview');
if IS_PAGE
script
| ((window.gitter = {}).chat = {}).options = {
| useStyles: false,
| activationElement: '.js-gitter-button',
| targetElement: '.js-gitter',
| room: '!{SOCIAL.gitter}'
| };
script(src="https://sidecar.gitter.im/dist/sidecar.v1.js" async defer)
//- JS modules slightly hacky, but necessary to dynamically instantiate the
classes with data from the Harp JSON files, while still being able to
support older browsers that can't handle JS modules. More details:
https://medium.com/dev-channel/es6-modules-in-chrome-canary-m60-ba588dfb8ab7
- ProgressBar = "new ProgressBar('.js-progress');"
- Changelog = "new Changelog('" + SOCIAL.github + "', 'spacy');"
- NavHighlighter = "new NavHighlighter('data-section', 'data-nav');"
- GitHubEmbed = "new GitHubEmbed('" + SOCIAL.github + "', 'data-gh-embed');"
- ModelLoader = "new ModelLoader('" + MODELS_REPO + "'," + JSON.stringify(CURRENT_MODELS) + "," + JSON.stringify(MODEL_LICENSES) + "," + JSON.stringify(MODEL_BENCHMARKS) + ");"
- ModelComparer = "new ModelComparer('" + MODELS_REPO + "'," + JSON.stringify(MODEL_LICENSES) + "," + JSON.stringify(MODEL_BENCHMARKS) + "," + JSON.stringify(LANGUAGES) + "," + JSON.stringify(MODEL_META) + "," + JSON.stringify(default_models || false) + ");"
//- Browsers with JS module support.
Will be ignored otherwise.
script(type="module")
| import ProgressBar from '/assets/js/progress.js';
!=ProgressBar
if changelog
| import Changelog from '/assets/js/changelog.js';
!=Changelog
if IS_PAGE
| import NavHighlighter from '/assets/js/nav-highlighter.js';
!=NavHighlighter
| import GitHubEmbed from '/assets/js/github-embed.js';
!=GitHubEmbed
if HAS_MODELS
| import { ModelLoader } from '/assets/js/models.js';
!=ModelLoader
if compare_models
| import { ModelComparer } from '/assets/js/models.js';
!=ModelComparer
//- Browsers with no JS module support.
Won't be fetched or interpreted otherwise.
script(nomodule src="/assets/js/rollup.js")
script(nomodule)
!=ProgressBar
if changelog
!=Changelog
if IS_PAGE
!=NavHighlighter
!=GitHubEmbed
if HAS_MODELS
!=ModeLoader
if compare_models
!=ModelComparer

View File

@ -19,5 +19,5 @@ menu.c-sidebar.js-sidebar.u-text
- var counter = 0
for id, title in menu
- counter++
li.c-sidebar__crumb__item(data-nav=id class=(counter == 1) ? "is-active" : null)
li.c-sidebar__crumb__item(data-nav=id)
+a("#section-" + id)=title

View File

@ -62,6 +62,9 @@ svg(style="position: absolute; visibility: hidden; width: 0; height: 0;" width="
symbol#svg_explosion(viewBox="0 0 500 500")
path(fill="currentColor" d="M111.7 74.9L91.2 93.1l9.1 10.2 17.8-15.8 7.4 8.4-17.8 15.8 10.1 11.4 20.6-18.2 7.7 8.7-30.4 26.9-41.9-47.3 30.3-26.9 7.6 8.6zM190.8 59.6L219 84.3l-14.4 4.5-20.4-18.2-6.4 26.6-14.4 4.5 8.9-36.4-26.9-24.1 14.3-4.5L179 54.2l5.7-25.2 14.3-4.5-8.2 35.1zM250.1 21.2l27.1 3.4c6.1.8 10.8 3.1 14 7.2 3.2 4.1 4.5 9.2 3.7 15.5-.8 6.3-3.2 11-7.4 14.1-4.1 3.1-9.2 4.3-15.3 3.5L258 63.2l-2.8 22.3-13-1.6 7.9-62.7zm11.5 13l-2.2 17.5 12.6 1.6c5.1.6 9.1-2 9.8-7.6.7-5.6-2.5-9.2-7.6-9.9l-12.6-1.6zM329.1 95.4l23.8 13.8-5.8 10L312 98.8l31.8-54.6 11.3 6.6-26 44.6zM440.5 145c-1.3 8.4-5.9 15.4-13.9 21.1s-16.2 7.7-24.6 6.1c-8.4-1.6-15.3-6.3-20.8-14.1-5.5-7.9-7.6-16-6.4-24.4 1.3-8.5 6-15.5 14-21.1 8-5.6 16.2-7.7 24.5-6 8.4 1.6 15.4 6.3 20.9 14.2 5.5 7.6 7.6 15.7 6.3 24.2zM412 119c-5.1-.8-10.3.6-15.6 4.4-5.2 3.7-8.4 8.1-9.4 13.2-1 5.2.2 10.1 3.5 14.8 3.4 4.8 7.5 7.5 12.7 8.2 5.2.8 10.4-.7 15.6-4.4 5.3-3.7 8.4-8.1 9.4-13.2 1.1-5.1-.1-9.9-3.4-14.7-3.4-4.8-7.6-7.6-12.8-8.3zM471.5 237.9c-2.8 4.8-7.1 7.6-13 8.7l-2.6-13.1c5.3-.9 8.1-5 7.2-11-.9-5.8-4.3-8.8-8.9-8.2-2.3.3-3.7 1.4-4.5 3.3-.7 1.9-1.4 5.2-1.7 10.1-.8 7.5-2.2 13.1-4.3 16.9-2.1 3.9-5.7 6.2-10.9 7-6.3.9-11.3-.5-15.2-4.4-3.9-3.8-6.3-9-7.3-15.7-1.1-7.4-.2-13.7 2.6-18.8 2.8-5.1 7.4-8.2 13.7-9.2l2.6 13c-5.6 1.1-8.7 6.6-7.7 13.4 1 6.6 3.9 9.5 8.6 8.8 4.4-.7 5.7-4.5 6.7-14.1.3-3.5.7-6.2 1.1-8.4.4-2.2 1.2-4.4 2.2-6.8 2.1-4.7 6-7.2 11.8-8.1 5.4-.8 10.3.4 14.5 3.7 4.2 3.3 6.9 8.5 8 15.6.9 6.9-.1 12.6-2.9 17.3zM408.6 293.5l2.4-12.9 62 11.7-2.4 12.9-62-11.7zM419.6 396.9c-8.3 2-16.5.3-24.8-5-8.2-5.3-13.2-12.1-14.9-20.5-1.6-8.4.1-16.6 5.3-24.6 5.2-8.1 11.9-13.1 20.2-15.1 8.4-1.9 16.6-.3 24.9 5 8.2 5.3 13.2 12.1 14.8 20.5 1.7 8.4 0 16.6-5.2 24.7-5.2 8-12 13-20.3 15zm13.4-36.3c-1.2-5.1-4.5-9.3-9.9-12.8s-10.6-4.7-15.8-3.7-9.3 4-12.4 8.9-4.1 9.8-2.8 14.8c1.2 5.1 4.5 9.3 9.9 12.8 5.5 3.5 10.7 4.8 15.8 3.7 5.1-.9 9.2-3.8 12.3-8.7s4.1-9.9 2.9-15zM303.6 416.5l9.6-5.4 43.3 20.4-19.2-34 11.4-6.4 31 55-9.6 5.4-43.4-20.5 19.2 34.1-11.3 6.4-31-55zM238.2 468.8c-49 0-96.9-17.4-134.8-49-38.3-32-64-76.7-72.5-125.9-2-11.9-3.1-24-3.1-35.9 0-36.5 9.6-72.6 27.9-104.4 2.1-3.6 6.7-4.9 10.3-2.8 3.6 2.1 4.9 6.7 2.8 10.3-16.9 29.5-25.9 63.1-25.9 96.9 0 11.1 1 22.3 2.9 33.4 7.9 45.7 31.8 87.2 67.3 116.9 35.2 29.3 79.6 45.5 125.1 45.5 11.1 0 22.3-1 33.4-2.9 4.1-.7 8 2 8.7 6.1.7 4.1-2 8-6.1 8.7-11.9 2-24 3.1-36 3.1z")
symbol#svg_prodigy(viewBox="0 0 538.5 157.6")
path(fill="currentColor" d="M70.6 48.6c7 7.3 10.5 17.1 10.5 29.2S77.7 99.7 70.6 107c-6.9 7.3-15.9 11.1-27 11.1-9.4 0-16.8-2.7-21.7-8.2v44.8H0V39h20.7v8.1c4.8-6.4 12.4-9.6 22.9-9.6 11.1 0 20.1 3.7 27 11.1zM21.9 76v3.6c0 12.1 7.3 19.8 18.3 19.8 11.2 0 18.7-7.9 18.7-21.6s-7.5-21.6-18.7-21.6c-11 0-18.3 7.7-18.3 19.8zM133.8 59.4c-12.6 0-20.5 7-20.5 17.8v39.3h-22V39h21.1v8.8c4-6.4 11.2-9.6 21.3-9.6v21.2zM209.5 107.1c-7.6 7.3-17.5 11.1-29.5 11.1s-21.9-3.8-29.7-11.1c-7.6-7.5-11.5-17.2-11.5-29.2 0-12.1 3.9-21.9 11.5-29.2 7.8-7.3 17.7-11.1 29.7-11.1s21.9 3.8 29.5 11.1c7.8 7.3 11.7 17.1 11.7 29.2 0 11.9-3.9 21.7-11.7 29.2zM180 56.2c-5.7 0-10.3 1.9-13.8 5.8-3.5 3.8-5.2 9-5.2 15.7 0 6.7 1.8 12 5.2 15.7 3.4 3.8 8.1 5.7 13.8 5.7s10.3-1.9 13.8-5.7 5.2-9 5.2-15.7c0-6.8-1.8-12-5.2-15.7-3.5-3.8-8.1-5.8-13.8-5.8zM313 116.5h-20.5v-7.9c-4.4 5.5-12.7 9.6-23.1 9.6-10.9 0-19.9-3.8-27-11.1C235.5 99.7 232 90 232 77.8s3.5-21.9 10.3-29.2c7-7.3 16-11.1 27-11.1 9.7 0 17.1 2.7 21.9 8.2V0H313v116.5zm-58.8-38.7c0 13.6 7.5 21.4 18.7 21.4 10.9 0 18.3-7.3 18.3-19.8V76c0-12.2-7.3-19.8-18.3-19.8-11.2 0-18.7 8-18.7 21.6zM354.1 13.6c0 3.6-1.3 6.8-3.9 9.3-5 4.9-13.6 4.9-18.6 0-8.4-7.5-1.6-23.1 9.3-22.5 7.4 0 13.2 5.9 13.2 13.2zm-2.2 102.9H330V39h21.9v77.5zM425.1 47.1V39h20.5v80.4c0 11.2-3.6 20.1-10.6 26.8-7 6.7-16.6 10-28.5 10-23.4 0-36.9-11.4-39.9-29.8l21.7-.8c1 7.6 7.6 12 17.4 12 11.2 0 18.1-5.8 18.1-16.6v-11.1c-5.1 5.5-12.5 8.2-21.9 8.2-10.9 0-19.9-3.8-27-11.1-6.9-7.3-10.3-17.1-10.3-29.2s3.5-21.9 10.3-29.2c7-7.3 16-11.1 27-11.1 10.7 0 18.4 3.1 23.2 9.6zm-38.3 30.7c0 13.6 7.5 21.6 18.7 21.6 11 0 18.3-7.6 18.3-19.8V76c0-12.2-7.3-19.8-18.3-19.8-11.2 0-18.7 8-18.7 21.6zM488.8 154.8H465l19.8-45.1L454.5 39h24.1l17.8 46.2L514.2 39h24.3l-49.7 115.8z")
//- Machine learning & NLP libraries

View File

@ -1,5 +1,7 @@
//- 💫 DOCS > API > ANNOTATION > TRAINING
+h(3, "json-input") JSON input format for training
p
| spaCy takes training data in JSON format. The built-in
| #[+api("cli#convert") #[code convert]] command helps you convert the
@ -46,3 +48,57 @@ p
| Treebank:
+github("spacy", "examples/training/training-data.json", false, false, "json")
+h(3, "vocab-jsonl") Lexical data for vocabulary
+tag-new(2)
p
| The populate a model's vocabulary, you can use the
| #[+api("cli#vocab") #[code spacy vocab]] command and load in a
| #[+a("https://jsonlines.readthedocs.io/en/latest/") newline-delimited JSON]
| (JSONL) file containing one lexical entry per line. The first line
| defines the language and vocabulary settings. All other lines are
| expected to be JSON objects describing an individual lexeme. The lexical
| attributes will be then set as attributes on spaCy's
| #[+api("lexeme#attributes") #[code Lexeme]] object. The #[code vocab]
| command outputs a ready-to-use spaCy model with a #[code Vocab]
| containing the lexical data.
+code("First line").
{"lang": "en", "settings": {"oov_prob": -20.502029418945312}}
+code("Entry structure").
{
"orth": string,
"id": int,
"lower": string,
"norm": string,
"shape": string
"prefix": string,
"suffix": string,
"length": int,
"cluster": string,
"prob": float,
"is_alpha": bool,
"is_ascii": bool,
"is_digit": bool,
"is_lower": bool,
"is_punct": bool,
"is_space": bool,
"is_title": bool,
"is_upper": bool,
"like_url": bool,
"like_num": bool,
"like_email": bool,
"is_stop": bool,
"is_oov": bool,
"is_quote": bool,
"is_left_punct": bool,
"is_right_punct": bool
}
p
| Here's an example of the 20 most frequent lexemes in the English
| training data:
+github("spacy", "examples/training/vocab-data.jsonl", false, false, "json")

View File

@ -3,8 +3,10 @@
"Overview": {
"Architecture": "./",
"Annotation Specs": "annotation",
"Command Line": "cli",
"Functions": "top-level"
},
"Containers": {
"Doc": "doc",
"Token": "token",
@ -45,14 +47,19 @@
}
},
"cli": {
"title": "Command Line Interface",
"teaser": "Download, train and package models, and debug spaCy.",
"source": "spacy/cli"
},
"top-level": {
"title": "Top-level Functions",
"menu": {
"spacy": "spacy",
"displacy": "displacy",
"Utility Functions": "util",
"Compatibility": "compat",
"Command Line": "cli"
"Compatibility": "compat"
}
},
@ -213,7 +220,7 @@
"Lemmatization": "lemmatization",
"Dependencies": "dependency-parsing",
"Named Entities": "named-entities",
"Training Data": "training"
"Models & Training": "training"
}
}
}

View File

@ -58,16 +58,16 @@ p
nlp.from_disk(model_data_path) # load in model data
+infobox("Deprecation note", "⚠️")
.o-block
| As of spaCy 2.0, the #[code path] keyword argument is deprecated. spaCy
| will also raise an error if no model could be loaded and never just
| return an empty #[code Language] object. If you need a blank language,
| you can use the new function #[+api("spacy#blank") #[code spacy.blank()]]
| or import the class explicitly, e.g.
| #[code from spacy.lang.en import English].
| As of spaCy 2.0, the #[code path] keyword argument is deprecated. spaCy
| will also raise an error if no model could be loaded and never just
| return an empty #[code Language] object. If you need a blank language,
| you can use the new function #[+api("spacy#blank") #[code spacy.blank()]]
| or import the class explicitly, e.g.
| #[code from spacy.lang.en import English].
+code-new nlp = spacy.load('/model')
+code-old nlp = spacy.load('en', path='/model')
+code-wrapper
+code-new nlp = spacy.load('/model')
+code-old nlp = spacy.load('en', path='/model')
+h(3, "spacy.blank") spacy.blank
+tag function
@ -85,7 +85,9 @@ p
+row
+cell #[code name]
+cell unicode
+cell ISO code of the language class to load.
+cell
| #[+a("https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes") ISO code]
| of the language class to load.
+row
+cell #[code disable]

View File

@ -99,6 +99,6 @@ p This document describes the target annotations spaCy is trained to predict.
include _annotation/_biluo
+section("training")
+h(2, "json-input") JSON input format for training
+h(2, "training") Models and training data
include _annotation/_training

View File

@ -1,4 +1,6 @@
//- 💫 DOCS > API > TOP-LEVEL > COMMAND LINE INTERFACE
//- 💫 DOCS > API > COMMAND LINE INTERFACE
include ../_includes/_mixins
p
| As of v1.7.0, spaCy comes with new command line helpers to download and
@ -34,6 +36,13 @@ p
+cell flag
+cell Show help message and available arguments.
+row("foot")
+cell creates
+cell directory, symlink
+cell
| The installed model package in your #[code site-packages]
| directory and a shortcut link as a symlink in #[code spacy/data].
+aside("Downloading best practices")
| The #[code download] command is mostly intended as a convenient,
| interactive wrapper it performs compatibility checks and prints
@ -86,6 +95,13 @@ p
+cell flag
+cell Show help message and available arguments.
+row("foot")
+cell creates
+cell symlink
+cell
| A shortcut link of the given name as a symlink in
| #[code spacy/data].
+h(3, "info") Info
p
@ -113,6 +129,11 @@ p
+cell flag
+cell Show help message and available arguments.
+row("foot")
+cell prints
+cell #[code stdout]
+cell Information about your spaCy installation.
+h(3, "validate") Validate
+tag-new(2)
@ -129,6 +150,12 @@ p
+code(false, "bash", "$").
spacy validate
+table(["Argument", "Type", "Description"])
+row("foot")
+cell prints
+cell #[code stdout]
+cell Details about the compatibility of your installed models.
+h(3, "convert") Convert
p
@ -172,6 +199,11 @@ p
+cell flag
+cell Show help message and available arguments.
+row("foot")
+cell creates
+cell JSON
+cell Data in spaCy's #[+a("/api/annotation#json-input") JSON format].
p The following converters are available:
+table(["ID", "Description"])
@ -286,6 +318,11 @@ p
+cell flag
+cell Show help message and available arguments.
+row("foot")
+cell creates
+cell model, pickle
+cell A spaCy model on each epoch, and a final #[code .pickle] file.
+h(4, "train-hyperparams") Environment variables for hyperparameters
+tag-new(2)
@ -395,6 +432,50 @@ p
+cell Gradient L2 norm constraint.
+cell #[code 1.0]
+h(3, "vocab") Vocab
+tag-new(2)
p
| Compile a vocabulary from a
| #[+a("/api/annotation#vocab-jsonl") lexicon JSONL] file and optional
| word vectors. Will save out a valid spaCy model that you can load via
| #[+api("spacy#load") #[code spacy.load]] or package using the
| #[+api("cli#package") #[code package]] command.
+code(false, "bash", "$").
spacy vocab [lang] [output_dir] [lexemes_loc] [vectors_loc]
+table(["Argument", "Type", "Description"])
+row
+cell #[code lang]
+cell positional
+cell
| Model language
| #[+a("https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes") ISO code],
| e.g. #[code en].
+row
+cell #[code output_dir]
+cell positional
+cell Model output directory. Will be created if it doesn't exist.
+row
+cell #[code lexemes_loc]
+cell positional
+cell
| Location of lexical data in spaCy's
| #[+a("/api/annotation#vocab-jsonl") JSONL format].
+row
+cell #[code vectors_loc]
+cell positional
+cell Optional location of vectors data as numpy #[code .npz] file.
+row("foot")
+cell creates
+cell model
+cell A spaCy model containing the vocab and vectors.
+h(3, "evaluate") Evaluate
+tag-new(2)
@ -447,22 +528,36 @@ p
+cell flag
+cell Use gold preprocessing.
+row("foot")
+cell prints / creates
+cell #[code stdout], HTML
+cell Training results and optional displaCy visualizations.
+h(3, "package") Package
p
| Generate a #[+a("/usage/training#models-generating") model Python package]
| from an existing model data directory. All data files are copied over.
| If the path to a meta.json is supplied, or a meta.json is found in the
| input directory, this file is used. Otherwise, the data can be entered
| directly from the command line. The required file templates are downloaded
| from #[+src(gh("spacy-dev-resources", "templates/model")) GitHub] to make
| If the path to a #[code meta.json] is supplied, or a #[code meta.json] is
| found in the input directory, this file is used. Otherwise, the data can
| be entered directly from the command line. The required file templates
| are downloaded from
| #[+src(gh("spacy-dev-resources", "templates/model")) GitHub] to make
| sure you're always using the latest versions. This means you need to be
| connected to the internet to use this command.
| connected to the internet to use this command. After packaging, you
| can run #[code python setup.py sdist] from the newly created directory
| to turn your model into an installable archive file.
+code(false, "bash", "$", false, false, true).
spacy package [input_dir] [output_dir] [--meta-path] [--create-meta] [--force]
+aside-code("Example", "bash").
spacy package /input /output
cd /output/en_model-0.0.0
python setup.py sdist
pip install dist/en_model-0.0.0.tar.gz
+table(["Argument", "Type", "Description"])
+row
+cell #[code input_dir]
@ -477,15 +572,16 @@ p
+row
+cell #[code --meta-path], #[code -m]
+cell option
+cell #[+tag-new(2)] Path to meta.json file (optional).
+cell #[+tag-new(2)] Path to #[code meta.json] file (optional).
+row
+cell #[code --create-meta], #[code -c]
+cell flag
+cell
| #[+tag-new(2)] Create a meta.json file on the command line, even
| if one already exists in the directory.
| #[+tag-new(2)] Create a #[code meta.json] file on the command
| line, even if one already exists in the directory. If an
| existing file is found, its entries will be shown as the defaults
| in the command line prompt.
+row
+cell #[code --force], #[code -f]
+cell flag
@ -495,3 +591,8 @@ p
+cell #[code --help], #[code -h]
+cell flag
+cell Show help message and available arguments.
+row("foot")
+cell creates
+cell directory
+cell A Python package containing the spaCy model.

View File

@ -84,13 +84,13 @@ p
+cell A container for accessing the annotations.
+infobox("Deprecation note", "⚠️")
.o-block
| Pipeline components to prevent from being loaded can now be added as
| a list to #[code disable], instead of specifying one keyword argument
| per component.
| Pipeline components to prevent from being loaded can now be added as
| a list to #[code disable], instead of specifying one keyword argument
| per component.
+code-new doc = nlp(u"I don't want parsed", disable=['parser'])
+code-old doc = nlp(u"I don't want parsed", parse=False)
+code-wrapper
+code-new doc = nlp(u"I don't want parsed", disable=['parser'])
+code-old doc = nlp(u"I don't want parsed", parse=False)
+h(2, "pipe") Language.pipe
+tag method
@ -533,15 +533,15 @@ p
+cell The modified #[code Language] object.
+infobox("Deprecation note", "⚠️")
.o-block
| As of spaCy v2.0, the #[code save_to_directory] method has been
| renamed to #[code to_disk], to improve consistency across classes.
| Pipeline components to prevent from being loaded can now be added as
| a list to #[code disable], instead of specifying one keyword argument
| per component.
| As of spaCy v2.0, the #[code save_to_directory] method has been
| renamed to #[code to_disk], to improve consistency across classes.
| Pipeline components to prevent from being loaded can now be added as
| a list to #[code disable], instead of specifying one keyword argument
| per component.
+code-new nlp = English().from_disk(disable=['tagger', 'ner'])
+code-old nlp = spacy.load('en', tagger=False, entity=False)
+code-wrapper
+code-new nlp = English().from_disk(disable=['tagger', 'ner'])
+code-old nlp = spacy.load('en', tagger=False, entity=False)
+h(2, "to_bytes") Language.to_bytes
+tag method
@ -595,13 +595,13 @@ p Load state from a binary string.
+cell The #[code Language] object.
+infobox("Deprecation note", "⚠️")
.o-block
| Pipeline components to prevent from being loaded can now be added as
| a list to #[code disable], instead of specifying one keyword argument
| per component.
| Pipeline components to prevent from being loaded can now be added as
| a list to #[code disable], instead of specifying one keyword argument
| per component.
+code-new nlp = English().from_bytes(bytes, disable=['tagger', 'ner'])
+code-old nlp = English().from_bytes('en', tagger=False, entity=False)
+code-wrapper
+code-new nlp = English().from_bytes(bytes, disable=['tagger', 'ner'])
+code-old nlp = English().from_bytes('en', tagger=False, entity=False)
+h(2, "attributes") Attributes

View File

@ -203,18 +203,18 @@ p
| dict describes a token.
+infobox("Deprecation note", "⚠️")
.o-block
| As of spaCy 2.0, #[code Matcher.add_pattern] and #[code Matcher.add_entity]
| are deprecated and have been replaced with a simpler
| #[+api("matcher#add") #[code Matcher.add]] that lets you add a list of
| patterns and a callback for a given match ID.
| As of spaCy 2.0, #[code Matcher.add_pattern] and #[code Matcher.add_entity]
| are deprecated and have been replaced with a simpler
| #[+api("matcher#add") #[code Matcher.add]] that lets you add a list of
| patterns and a callback for a given match ID.
+code-new.
matcher.add('GoogleNow', merge_phrases, [{ORTH: 'Google'}, {ORTH: 'Now'}])
+code-wrapper
+code-new.
matcher.add('GoogleNow', merge_phrases, [{ORTH: 'Google'}, {ORTH: 'Now'}])
+code-old.
matcher.add_entity('GoogleNow', on_match=merge_phrases)
matcher.add_pattern('GoogleNow', [{ORTH: 'Google'}, {ORTH: 'Now'}])
+code-old.
matcher.add_entity('GoogleNow', on_match=merge_phrases)
matcher.add_pattern('GoogleNow', [{ORTH: 'Google'}, {ORTH: 'Now'}])
+h(2, "remove") Matcher.remove
+tag method

View File

@ -393,6 +393,37 @@ p A sequence of all the token's syntactic descendents.
+cell #[code Token]
+cell A descendant token such that #[code self.is_ancestor(descendant)].
+h(2, "is_sent_start") Token.is_sent_start
+tag property
+tag-new(2)
p
| A boolean value indicating whether the token starts a sentence.
| #[code None] if unknown.
+aside-code("Example").
doc = nlp(u'Give it back! He pleaded.')
assert doc[4].is_sent_start
assert not doc[5].is_sent_start
+table(["Name", "Type", "Description"])
+row("foot")
+cell returns
+cell bool
+cell Whether the token starts a sentence.
+infobox("Deprecation note", "⚠️")
| As of spaCy v2.0, the #[code Token.sent_start] property is deprecated and
| has been replaced with #[code Token.is_sent_start], which returns a
| boolean value instead of a misleading #[code 0] for #[code False] and
| #[code 1] for #[code True]. It also now returns #[code None] if the
| answer is unknown, and fixes a quirk in the old logic that would always
| set the property to #[code 0] for the first word of the document.
+code-wrapper
+code-new assert doc[4].is_sent_start == True
+code-old assert doc[4].sent_start == 1
+h(2, "has_vector") Token.has_vector
+tag property
+tag-model("vectors")

View File

@ -18,7 +18,3 @@ include ../_includes/_mixins
+section("compat")
+h(2, "compat", "spacy/compaty.py") Compatibility functions
include _top-level/_compat
+section("cli", "spacy/cli")
+h(2, "cli") Command line
include _top-level/_cli

View File

@ -5,46 +5,47 @@ include ../_includes/_mixins
p
| Vectors data is kept in the #[code Vectors.data] attribute, which should
| be an instance of #[code numpy.ndarray] (for CPU vectors) or
| #[code cupy.ndarray] (for GPU vectors).
| #[code cupy.ndarray] (for GPU vectors). Multiple keys can be mapped to
| the same vector, and not all of the rows in the table need to be
| assigned so #[code vectors.n_keys] may be greater or smaller than
| #[code vectors.shape[0]].
+h(2, "init") Vectors.__init__
+tag method
p
| Create a new vector store. To keep the vector table empty, pass
| #[code width=0]. You can also create the vector table and add
| vectors one by one, or set the vector values directly on initialisation.
| Create a new vector store. You can set the vector values and keys
| directly on initialisation, or supply a #[code shape] keyword argument
| to create an empty table you can add vectors to later.
+aside-code("Example").
from spacy.vectors import Vectors
from spacy.strings import StringStore
empty_vectors = Vectors(StringStore())
empty_vectors = Vectors(shape=(10000, 300))
vectors = Vectors([u'cat'], width=300)
vectors[u'cat'] = numpy.random.uniform(-1, 1, (300,))
vector_table = numpy.zeros((3, 300), dtype='f')
vectors = Vectors(StringStore(), data=vector_table)
data = numpy.zeros((3, 300), dtype='f')
keys = [u'cat', u'dog', u'rat']
vectors = Vectors(data=data, keys=keys)
+table(["Name", "Type", "Description"])
+row
+cell #[code strings]
+cell #[code StringStore] or list
+cell
| List of strings, or a #[+api("stringstore") #[code StringStore]]
| that maps strings to hash values, and vice versa.
+row
+cell #[code width]
+cell int
+cell Number of dimensions.
+row
+cell #[code data]
+cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
+cell #[code.u-break ndarray[ndim=1, dtype='float32']]
+cell The vector data.
+row
+cell #[code keys]
+cell iterable
+cell A sequence of keys aligned with the data.
+row
+cell #[code shape]
+cell tuple
+cell
| Size of the table as #[code (n_entries, n_columns)], the number
| of entries and number of columns. Not required if you're
| initialising the object with #[code data] and #[code keys].
+row("foot")
+cell returns
+cell #[code Vectors]
@ -54,97 +55,92 @@ p
+tag method
p
| Get a vector by key. If key is a string, it is hashed to an integer ID
| using the #[code Vectors.strings] table. If the integer key is not found
| in the table, a #[code KeyError] is raised.
| Get a vector by key. If the key is not found in the table, a
| #[code KeyError] is raised.
+aside-code("Example").
vectors = Vectors(StringStore(), 300)
vectors.add(u'cat', numpy.random.uniform(-1, 1, (300,)))
cat_vector = vectors[u'cat']
cat_id = nlp.vocab.strings[u'cat']
cat_vector = nlp.vocab.vectors[cat_id]
assert cat_vector == nlp.vocab[u'cat'].vector
+table(["Name", "Type", "Description"])
+row
+cell #[code key]
+cell unicode / int
+cell int
+cell The key to get the vector for.
+row
+cell returns
+cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
+cell #[code.u-break ndarray[ndim=1, dtype='float32']]
+cell The vector for the key.
+h(2, "setitem") Vectors.__setitem__
+tag method
p
| Set a vector for the given key. If key is a string, it is hashed to an
| integer ID using the #[code Vectors.strings] table.
| Set a vector for the given key.
+aside-code("Example").
vectors = Vectors(StringStore(), 300)
vectors[u'cat'] = numpy.random.uniform(-1, 1, (300,))
cat_id = nlp.vocab.strings[u'cat']
vector = numpy.random.uniform(-1, 1, (300,))
nlp.vocab.vectors[cat_id] = vector
+table(["Name", "Type", "Description"])
+row
+cell #[code key]
+cell unicode / int
+cell int
+cell The key to set the vector for.
+row
+cell #[code vector]
+cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
+cell #[code.u-break ndarray[ndim=1, dtype='float32']]
+cell The vector to set.
+h(2, "iter") Vectors.__iter__
+tag method
p Yield vectors from the table.
p Iterate over the keys in the table.
+aside-code("Example").
vector_table = numpy.zeros((3, 300), dtype='f')
vectors = Vectors(StringStore(), vector_table)
for vector in vectors:
print(vector)
for key in nlp.vocab.vectors:
print(key, nlp.vocab.strings[key])
+table(["Name", "Type", "Description"])
+row("foot")
+cell yields
+cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
+cell A vector from the table.
+cell int
+cell A key in the table.
+h(2, "len") Vectors.__len__
+tag method
p Return the number of vectors that have been assigned.
p Return the number of vectors in the table.
+aside-code("Example").
vector_table = numpy.zeros((3, 300), dtype='f')
vectors = Vectors(StringStore(), vector_table)
vectors = Vectors(shape=(3, 300))
assert len(vectors) == 3
+table(["Name", "Type", "Description"])
+row("foot")
+cell returns
+cell int
+cell The number of vectors in the data.
+cell The number of vectors in the table.
+h(2, "contains") Vectors.__contains__
+tag method
p
| Check whether a key has a vector entry in the table. If key is a string,
| it is hashed to an integer ID using the #[code Vectors.strings] table.
| Check whether a key has been mapped to a vector entry in the table.
+aside-code("Example").
vectors = Vectors(StringStore(), 300)
vectors.add(u'cat', numpy.random.uniform(-1, 1, (300,)))
assert u'cat' in vectors
cat_id = nlp.vocab.strings[u'cat']
nlp.vectors.add(cat_id, numpy.random.uniform(-1, 1, (300,)))
assert cat_id in vectors
+table(["Name", "Type", "Description"])
+row
+cell #[code key]
+cell unicode / int
+cell int
+cell The key to check.
+row("foot")
@ -156,13 +152,20 @@ p
+tag method
p
| Add a key to the table, optionally setting a vector value as well. If
| key is a string, it is hashed to an integer ID using the
| #[code Vectors.strings] table.
| Add a key to the table, optionally setting a vector value as well. Keys
| can be mapped to an existing vector by setting #[code row], or a new
| vector can be added. When adding unicode keys, keep in mind that the
| #[code Vectors] class itself has no
| #[+api("stringstore") #[code StringStore]], so you have to store the
| hash-to-string mapping separately. If you need to manage the strings,
| you should use the #[code Vectors] via the
| #[+api("vocab") #[code Vocab]] class, e.g. #[code vocab.vectors].
+aside-code("Example").
vectors = Vectors(StringStore(), 300)
vectors.add(u'cat', numpy.random.uniform(-1, 1, (300,)))
vector = numpy.random.uniform(-1, 1, (300,))
cat_id = nlp.vocab.strings[u'cat']
nlp.vocab.vectors.add(cat_id, vector=vector)
nlp.vocab.vectors.add(u'dog', row=0)
+table(["Name", "Type", "Description"])
+row
@ -172,25 +175,66 @@ p
+row
+cell #[code vector]
+cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
+cell An optional vector to add.
+cell #[code.u-break ndarray[ndim=1, dtype='float32']]
+cell An optional vector to add for the key.
+row
+cell #[code row]
+cell int
+cell An optional row number of a vector to map the key to.
+row("foot")
+cell returns
+cell int
+cell The row the vector was added to.
+h(2, "keys") Vectors.keys
+tag method
p A sequence of the keys in the table.
+aside-code("Example").
for key in nlp.vocab.vectors.keys():
print(key, nlp.vocab.strings[key])
+table(["Name", "Type", "Description"])
+row("foot")
+cell returns
+cell iterable
+cell The keys.
+h(2, "values") Vectors.values
+tag method
p
| Iterate over vectors that have been assigned to at least one key. Note
| that some vectors may be unassigned, so the number of vectors returned
| may be less than the length of the vectors table.
+aside-code("Example").
for vector in nlp.vocab.vectors.values():
print(vector)
+table(["Name", "Type", "Description"])
+row("foot")
+cell yields
+cell #[code.u-break ndarray[ndim=1, dtype='float32']]
+cell A vector in the table.
+h(2, "items") Vectors.items
+tag method
p Iterate over #[code (string key, vector)] pairs, in order.
p Iterate over #[code (key, vector)] pairs, in order.
+aside-code("Example").
vectors = Vectors(StringStore(), 300)
vectors.add(u'cat', numpy.random.uniform(-1, 1, (300,)))
for key, vector in vectors.items():
print(key, vector)
for key, vector in nlp.vocab.vectors.items():
print(key, nlp.vocab.strings[key], vector)
+table(["Name", "Type", "Description"])
+row("foot")
+cell yields
+cell tuple
+cell #[code (string key, vector)] pairs, in order.
+cell #[code (key, vector)] pairs, in order.
+h(2, "shape") Vectors.shape
+tag property
@ -200,7 +244,7 @@ p
| dimensions in the vector table.
+aside-code("Example").
vectors = Vectors(StringStore(), 300)
vectors = Vectors(shape(1, 300))
vectors.add(u'cat', numpy.random.uniform(-1, 1, (300,)))
rows, dims = vectors.shape
assert rows == 1
@ -212,6 +256,59 @@ p
+cell tuple
+cell A #[code (rows, dims)] pair.
+h(2, "size") Vectors.size
+tag property
p The vector size, i.e. #[code rows * dims].
+aside-code("Example").
vectors = Vectors(shape=(500, 300))
assert vectors.size == 150000
+table(["Name", "Type", "Description"])
+row("foot")
+cell returns
+cell int
+cell The vector size.
+h(2, "is_full") Vectors.is_full
+tag property
p
| Whether the vectors table is full and has no slots are available for new
| keys. If a table is full, it can be resized using
| #[+api("vectors#resize") #[code Vectors.resize]].
+aside-code("Example").
vectors = Vectors(shape=(1, 300))
vectors.add(u'cat', numpy.random.uniform(-1, 1, (300,)))
assert vectors.is_full
+table(["Name", "Type", "Description"])
+row("foot")
+cell returns
+cell bool
+cell Whether the vectors table is full.
+h(2, "n_keys") Vectors.n_keys
+tag property
p
| Get the number of keys in the table. Note that this is the number of
| #[em all] keys, not just unique vectors. If several keys are mapped
| are mapped to the same vectors, they will be counted individually.
+aside-code("Example").
vectors = Vectors(shape=(10, 300))
assert len(vectors) == 10
assert vectors.n_keys == 0
+table(["Name", "Type", "Description"])
+row("foot")
+cell returns
+cell int
+cell The number of all keys in the table.
+h(2, "from_glove") Vectors.from_glove
+tag method
@ -223,6 +320,10 @@ p
| float32 vectors, #[code vectors.300.d.bin] for 300d float64 (double)
| vectors, etc. By default GloVe outputs 64-bit vectors.
+aside-code("Example").
vectors = Vectors()
vectors.from_glove('/path/to/glove_vectors')
+table(["Name", "Type", "Description"])
+row
+cell #[code path]
@ -323,7 +424,7 @@ p Load state from a binary string.
+table(["Name", "Type", "Description"])
+row
+cell #[code data]
+cell #[code numpy.ndarray] / #[code cupy.ndarray]
+cell #[code.u-break ndarray[ndim=1, dtype='float32']]
+cell
| Stored vectors data. #[code numpy] is used for CPU vectors,
| #[code cupy] for GPU vectors.
@ -337,7 +438,7 @@ p Load state from a binary string.
+row
+cell #[code keys]
+cell #[code numpy.ndarray]
+cell #[code.u-break ndarray[ndim=1, dtype='float32']]
+cell
| Array keeping the keys in order, such that
| #[code keys[vectors.key2row[key]] == key]

View File

@ -162,7 +162,7 @@ p
+cell int
+cell The integer ID by which the flag value can be checked.
+h(2, "add_flag") Vocab.clear_vectors
+h(2, "clear_vectors") Vocab.clear_vectors
+tag method
+tag-new(2)
@ -181,7 +181,50 @@ p
| Number of dimensions of the new vectors. If #[code None], size
| is not changed.
+h(2, "add_flag") Vocab.get_vector
+h(2, "prune_vectors") Vocab.prune_vectors
+tag method
+tag-new(2)
p
| Reduce the current vector table to #[code nr_row] unique entries. Words
| mapped to the discarded vectors will be remapped to the closest vector
| among those remaining. For example, suppose the original table had
| vectors for the words:
| #[code.u-break ['sat', 'cat', 'feline', 'reclined']]. If we prune the
| vector table to, two rows, we would discard the vectors for "feline"
| and "reclined". These words would then be remapped to the closest
| remaining vector so "feline" would have the same vector as "cat",
| and "reclined" would have the same vector as "sat". The similarities are
| judged by cosine. The original vectors may be large, so the cosines are
| calculated in minibatches, to reduce memory usage.
+aside-code("Example").
nlp.vocab.prune_vectors(10000)
assert len(nlp.vocab.vectors) &lt;= 1000
+table(["Name", "Type", "Description"])
+row
+cell #[code nr_row]
+cell int
+cell The number of rows to keep in the vector table.
+row
+cell #[code batch_size]
+cell int
+cell
| Batch of vectors for calculating the similarities. Larger batch
| sizes might be faster, while temporarily requiring more memory.
+row("foot")
+cell returns
+cell dict
+cell
| A dictionary keyed by removed words mapped to
| #[code (string, score)] tuples, where #[code string] is the entry
| the removed word was mapped to, and #[code score] the similarity
| score between the two words.
+h(2, "get_vector") Vocab.get_vector
+tag method
+tag-new(2)
@ -206,7 +249,7 @@ p
| A word vector. Size and shape are determined by the
| #[code Vocab.vectors] instance.
+h(2, "add_flag") Vocab.set_vector
+h(2, "set_vector") Vocab.set_vector
+tag method
+tag-new(2)
@ -228,7 +271,7 @@ p
+cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
+cell The vector to set.
+h(2, "add_flag") Vocab.has_vector
+h(2, "has_vector") Vocab.has_vector
+tag method
+tag-new(2)

View File

@ -163,11 +163,4 @@
height: 1.4em
border: none
text-align-last: center
.o-empty:empty:before
@include size(1em)
border-radius: 50%
content: ""
display: inline-block
background: $color-red
vertical-align: middle
width: 100%

View File

@ -47,7 +47,7 @@
font: 600 1.1rem/#{1} $font-secondary
background: $color-theme
color: $color-back
padding: 0.15em 0.5em 0.35em
padding: 2px 6px 4px
border-radius: 1em
text-transform: uppercase
vertical-align: middle

View File

@ -0,0 +1,72 @@
'use strict';
import { Templater, handleResponse } from './util.js';
export default class Changelog {
/**
* Fetch and render changelog from GitHub. Clones a template node (table row)
* to avoid doubling templating markup in JavaScript.
* @param {string} user - GitHub username.
* @param {string} repo - Repository to fetch releases from.
*/
constructor(user, repo) {
this.url = `https://api.github.com/repos/${user}/${repo}/releases`;
this.template = new Templater('changelog');
this.fetchChangelog()
.then(json => this.render(json))
.catch(this.showError.bind(this));
// make sure scroll positions for progress bar etc. are recalculated
window.dispatchEvent(new Event('resize'));
}
fetchChangelog() {
return new Promise((resolve, reject) =>
fetch(this.url)
.then(res => handleResponse(res))
.then(json => json.ok ? resolve(json) : reject()))
}
showError() {
this.template.get('error').style.display = 'block';
}
/**
* Get template section from template row. Hacky, but does make sense.
* @param {node} item - Parent element.
* @param {string} id - ID of child element, set via data-changelog.
*/
getField(item, id) {
return item.querySelector(`[data-changelog="${id}"]`);
}
render(json) {
this.template.get('table').style.display = 'block';
this.row = this.template.get('item');
this.releases = this.template.get('releases');
this.prereleases = this.template.get('prereleases');
Object.values(json)
.filter(release => release.name)
.forEach(release => this.renderRelease(release));
this.row.remove();
}
/**
* Clone the template row and populate with content from API response.
* https://developer.github.com/v3/repos/releases/#list-releases-for-a-repository
* @param {string} name - Release title.
* @param {string} tag (tag_name) - Release tag.
* @param {string} url (html_url) - URL to the release page on GitHub.
* @param {string} date (published_at) - Timestamp of release publication.
* @param {boolean} prerelease - Whether the release is a prerelease.
*/
renderRelease({ name, tag_name: tag, html_url: url, published_at: date, prerelease }) {
const container = prerelease ? this.prereleases : this.releases;
const tagLink = `<a href="${url}" target="_blank"><code>${tag}</code></a>`;
const title = (name.split(': ').length == 2) ? name.split(': ')[1] : name;
const row = this.row.cloneNode(true);
this.getField(row, 'date').textContent = date.split('T')[0];
this.getField(row, 'tag').innerHTML = tagLink;
this.getField(row, 'title').textContent = title;
container.appendChild(row);
}
}

View File

@ -0,0 +1,42 @@
'use strict';
import { $$ } from './util.js';
export default class GitHubEmbed {
/**
* Embed code from GitHub repositories, similar to Gist embeds. Fetches the
* raw text and places it inside element.
* Usage: <pre><code data-gh-embed="spacy/master/examples/x.py"></code><pre>
* @param {string} user - GitHub user or organization.
* @param {string} attr - Data attribute used to select containers. Attribute
* value should be path to file relative to user.
*/
constructor(user, attr) {
this.url = `https://raw.githubusercontent.com/${user}`;
this.attr = attr;
[...$$(`[${this.attr}]`)].forEach(el => this.embed(el));
}
/**
* Fetch code from GitHub and insert it as element content. File path is
* read off the container's data attribute.
* @param {node} el - The element.
*/
embed(el) {
el.parentElement.setAttribute('data-loading', '');
fetch(`${this.url}/${el.getAttribute(this.attr)}`)
.then(res => res.text().then(text => ({ text, ok: res.ok })))
.then(({ text, ok }) => ok ? this.render(el, text) : false)
el.parentElement.removeAttribute('data-loading');
}
/**
* Add text to container and apply syntax highlighting via Prism, if available.
* @param {node} el - The element.
* @param {string} text - The raw code, fetched from GitHub.
*/
render(el, text) {
el.textContent = text;
if (window.Prism) Prism.highlightElement(el);
}
}

View File

@ -1,323 +0,0 @@
//- 💫 MAIN JAVASCRIPT
//- Note: Will be compiled using Babel before deployment.
'use strict'
const $ = document.querySelector.bind(document);
const $$ = document.querySelectorAll.bind(document);
class ProgressBar {
/**
* Animated reading progress bar.
* @param {String} selector CSS selector of progress bar element.
*/
constructor(selector) {
this.el = $(selector);
this.scrollY = 0;
this.sizes = this.updateSizes();
this.el.setAttribute('max', 100);
this.init();
}
init() {
window.addEventListener('scroll', () => {
this.scrollY = (window.pageYOffset || document.scrollTop) - (document.clientTop || 0);
requestAnimationFrame(this.update.bind(this));
}, false);
window.addEventListener('resize', () => {
this.sizes = this.updateSizes();
requestAnimationFrame(this.update.bind(this));
})
}
update() {
const offset = 100 - ((this.sizes.height - this.scrollY - this.sizes.vh) / this.sizes.height * 100);
this.el.setAttribute('value', (this.scrollY == 0) ? 0 : offset || 0);
}
updateSizes() {
const body = document.body;
const html = document.documentElement;
return {
height: Math.max(body.scrollHeight, body.offsetHeight, html.clientHeight, html.scrollHeight, html.offsetHeight),
vh: Math.max(html.clientHeight, window.innerHeight || 0)
}
}
}
class SectionHighlighter {
/**
* Hightlight section in viewport in sidebar, using in-view library.
* @param {String} sectionAttr - Data attribute of sections.
* @param {String} navAttr - Data attribute of navigation items.
* @param {String} activeClass Class name of active element.
*/
constructor(sectionAttr, navAttr, activeClass = 'is-active') {
this.sections = [...$$(`[${navAttr}]`)];
this.navAttr = navAttr;
this.sectionAttr = sectionAttr;
this.activeClass = activeClass;
inView(`[${sectionAttr}]`).on('enter', this.highlightSection.bind(this));
}
highlightSection(section) {
const id = section.getAttribute(this.sectionAttr);
const el = $(`[${this.navAttr}="${id}"]`);
if (el) {
this.sections.forEach(el => el.classList.remove(this.activeClass));
el.classList.add(this.activeClass);
}
}
}
class Templater {
/**
* Mini templating engine based on data attributes. Selects elements based
* on a data-tpl and data-tpl-key attribute and can set textContent
* and innterHtml.
*
* @param {String} templateId - Template section, e.g. value of data-tpl.
*/
constructor(templateId) {
this.templateId = templateId;
}
get(key) {
return $(`[data-tpl="${this.templateId}"][data-tpl-key="${key}"]`);
}
fill(key, value, html = false) {
const el = this.get(key);
if (html) el.innerHTML = value || '';
else el.textContent = value || '';
return el;
}
}
class ModelLoader {
/**
* Load model meta from GitHub and update model details on site. Uses the
* Templater mini template engine to update DOM.
*
* @param {String} repo - Path tp GitHub repository containing releases.
* @param {Array} models - List of model IDs, e.g. "en_core_web_sm".
* @param {Object} licenses - License IDs mapped to URLs.
* @param {Object} accKeys - Available accuracy keys mapped to display labels.
*/
constructor(repo, models = [], licenses = {}, benchmarkKeys = {}) {
this.url = `https://raw.githubusercontent.com/${repo}/master`;
this.repo = `https://github.com/${repo}`;
this.modelIds = models;
this.licenses = licenses;
this.benchKeys = benchmarkKeys;
this.init();
}
init() {
this.modelIds.forEach(modelId =>
new Templater(modelId).get('table').setAttribute('data-loading', ''));
fetch(`${this.url}/compatibility.json`)
.then(res => this.handleResponse(res))
.then(json => json.ok ? this.getModels(json['spacy']) : this.modelIds.forEach(modelId => this.showError(modelId)))
}
handleResponse(res) {
if (res.ok) return res.json().then(json => Object.assign({}, json, { ok: res.ok }))
else return ({ ok: res.ok })
}
convertNumber(num, separator = ',') {
return num.toString().replace(/\B(?=(\d{3})+(?!\d))/g, separator);
}
getModels(compat) {
this.compat = compat;
for (let modelId of this.modelIds) {
const version = this.getLatestVersion(modelId, compat);
if (!version) {
this.showError(modelId); return;
}
fetch(`${this.url}/meta/${modelId}-${version}.json`)
.then(res => this.handleResponse(res))
.then(json => json.ok ? this.render(json) : this.showError(modelId))
}
// make sure scroll positions for progress bar etc. are recalculated
window.dispatchEvent(new Event('resize'));
}
showError(modelId) {
const template = new Templater(modelId);
template.get('table').removeAttribute('data-loading');
template.get('error').style.display = 'block';
for (let key of ['sources', 'pipeline', 'vectors', 'author', 'license']) {
template.get(key).parentElement.parentElement.style.display = 'none';
}
}
/**
* Update model details in tables. Currently quite hacky :(
*/
render({ lang, name, version, sources, pipeline, vectors, url, author, license, accuracy, speed, size, description, notes }) {
const modelId = `${lang}_${name}`;
const model = `${modelId}-${version}`;
const template = new Templater(modelId);
const getSources = s => (s instanceof Array) ? s.join(', ') : s;
const getPipeline = p => p.map(comp => `<code>${comp}</code>`).join(', ');
const getVectors = v => `${this.convertNumber(v.entries)} (${v.width} dimensions)`;
const getLink = (t, l) => `<a href="${l}" target="_blank">${t}</a>`;
const keys = { version, size, description, notes }
Object.keys(keys).forEach(key => template.fill(key, keys[key]));
if (sources) template.fill('sources', getSources(sources));
if (pipeline && pipeline.length) template.fill('pipeline', getPipeline(pipeline), true);
else template.get('pipeline').parentElement.parentElement.style.display = 'none';
if (vectors) template.fill('vectors', getVectors(vectors));
else template.get('vectors').parentElement.parentElement.style.display = 'none';
if (author) template.fill('author', url ? getLink(author, url) : author, true);
if (license) template.fill('license', this.licenses[license] ? getLink(license, this.licenses[license]) : license, true);
template.get('download').setAttribute('href', `${this.repo}/releases/tag/${model}`);
this.renderBenchmarks(template, accuracy, speed);
this.renderCompat(template, modelId);
template.get('table').removeAttribute('data-loading');
}
renderBenchmarks(template, accuracy = {}, speed = {}) {
if (!accuracy && !speed) return;
template.get('benchmarks').style.display = 'block';
this.renderTable(template, 'parser', accuracy, val => val.toFixed(2));
this.renderTable(template, 'ner', accuracy, val => val.toFixed(2));
this.renderTable(template, 'speed', speed, Math.round);
}
renderTable(template, id, benchmarks, convertVal = val => val) {
if (!this.benchKeys[id] || !Object.keys(this.benchKeys[id]).some(key => benchmarks[key])) return;
const keys = Object.keys(this.benchKeys[id]).map(k => benchmarks[k] ? k : false).filter(k => k);
template.get(id).style.display = 'block';
for (let key of keys) {
template
.fill(key, this.convertNumber(convertVal(benchmarks[key])))
.parentElement.style.display = 'table-row';
}
}
renderCompat(template, modelId) {
template.get('compat-wrapper').style.display = 'table-row';
const options = Object.keys(this.compat).map(v => `<option value="${v}">v${v}</option>`).join('');
template
.fill('compat', '<option selected disabled>spaCy version</option>' + options, true)
.addEventListener('change', ev => {
const result = this.compat[ev.target.value][modelId];
if (result) template.fill('compat-versions', `<code>${modelId}-${result[0]}</code>`, true);
else template.fill('compat-versions', '');
});
}
getLatestVersion(model, compat = {}) {
for (let spacy_v of Object.keys(compat)) {
const models = compat[spacy_v];
if (models[model]) return models[model][0];
}
}
}
class Changelog {
/**
* Fetch and render changelog from GitHub. Clones a template node (table row)
* to avoid doubling templating markup in JavaScript.
*
* @param {String} user - GitHub username.
* @param {String} repo - Repository to fetch releases from.
*/
constructor(user, repo) {
this.url = `https://api.github.com/repos/${user}/${repo}/releases`;
this.template = new Templater('changelog');
fetch(this.url)
.then(res => this.handleResponse(res))
.then(json => json.ok ? this.render(json) : false)
}
/**
* Get template section from template row. Slightly hacky, but does make sense.
*/
$(item, id) {
return item.querySelector(`[data-changelog="${id}"]`);
}
handleResponse(res) {
if (res.ok) return res.json().then(json => Object.assign({}, json, { ok: res.ok }))
else return ({ ok: res.ok })
}
render(json) {
this.template.get('error').style.display = 'none';
this.template.get('table').style.display = 'block';
this.row = this.template.get('item');
this.releases = this.template.get('releases');
this.prereleases = this.template.get('prereleases');
Object.values(json)
.filter(release => release.name)
.forEach(release => this.renderRelease(release));
this.row.remove();
// make sure scroll positions for progress bar etc. are recalculated
window.dispatchEvent(new Event('resize'));
}
/**
* Clone the template row and populate with content from API response.
* https://developer.github.com/v3/repos/releases/#list-releases-for-a-repository
*
* @param {String} name - Release title.
* @param {String} tag (tag_name) - Release tag.
* @param {String} url (html_url) - URL to the release page on GitHub.
* @param {String} date (published_at) - Timestamp of release publication.
* @param {Boolean} pre (prerelease) - Whether the release is a prerelease.
*/
renderRelease({ name, tag_name: tag, html_url: url, published_at: date, prerelease: pre }) {
const container = pre ? this.prereleases : this.releases;
const row = this.row.cloneNode(true);
this.$(row, 'date').textContent = date.split('T')[0];
this.$(row, 'tag').innerHTML = `<a href="${url}" target="_blank"><code>${tag}</code></a>`;
this.$(row, 'title').textContent = (name.split(': ').length == 2) ? name.split(': ')[1] : name;
container.appendChild(row);
}
}
class GitHubEmbed {
/**
* Embed code from GitHub repositories, similar to Gist embeds. Fetches the
* raw text and places it inside element.
* Usage: <pre><code data-gh-embed="spacy/master/examples/x.py"></code><pre>
*
* @param {String} user - GitHub user or organization.
* @param {String} attr - Data attribute used to select containers. Attribute
* value should be path to file relative to user.
*/
constructor(user, attr) {
this.url = `https://raw.githubusercontent.com/${user}`;
this.attr = attr;
this.error = `\nCan't fetch code example from GitHub :(\n\nPlease use the link below to view the example. If you've come across\na broken link, we always appreciate a pull request to the repository,\nor a report on the issue tracker. Thanks!`;
[...$$(`[${this.attr}]`)].forEach(el => this.embed(el));
}
embed(el) {
el.parentElement.setAttribute('data-loading', '');
fetch(`${this.url}/${el.getAttribute(this.attr)}`)
.then(res => res.text().then(text => ({ text, ok: res.ok })))
.then(({ text, ok }) => {
el.textContent = ok ? text : this.error;
if (ok && window.Prism) Prism.highlightElement(el);
})
el.parentElement.removeAttribute('data-loading');
}
}

317
website/assets/js/models.js Normal file
View File

@ -0,0 +1,317 @@
'use strict';
import { Templater, handleResponse, convertNumber, abbrNumber } from './util.js';
/**
* Chart.js defaults
*/
const CHART_COLORS = { model1: '#09a3d5', model2: '#066B8C' };
const CHART_FONTS = {
legend: '-apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"',
ticks: 'Menlo, Monaco, Consolas, "Liberation Mono", "Courier New", monospace'
};
/**
* Formatters for model details.
* @property {function} author Format model author with optional link.
* @property {function} license - Format model license with optional link.
* @property {function} sources - Format training data sources (list or string).
* @property {function} pipeline - Format list of pipeline components.
* @property {function} vectors - Format vector data (entries and dimensions).
* @property {function} version - Format model version number.
*/
export const formats = {
author: (author, url) => url ? `<a href="${url}" target="_blank">${author}</a>` : author,
license: (license, url) => url ? `<a href="${url}" target="_blank">${license}</a>` : license,
sources: sources => (sources instanceof Array) ? sources.join(', ') : sources,
pipeline: pipes => (pipes && pipes.length) ? pipes.map(p => `<code>${p}</code>`).join(', ') : '-',
vectors: vec => vec ? `${abbrNumber(vec.keys)} keys, ${abbrNumber(vec.vectors)} unique vectors (${vec.width} dimensions)` : 'n/a',
version: version => `<code>v${version}</code>`
};
/**
* Find the latest version of a model in a compatibility table.
* @param {string} model - The model name.
* @param {Object} compat - Compatibility table, keyed by spaCy version.
*/
export const getLatestVersion = (model, compat = {}) => {
for (let [spacy_v, models] of Object.entries(compat)) {
if (models[model]) return models[model][0];
}
};
export class ModelLoader {
/**
* Load model meta from GitHub and update model details on site. Uses the
* Templater mini template engine to update DOM.
* @param {string} repo - Path tp GitHub repository containing releases.
* @param {Array} models - List of model IDs, e.g. "en_core_web_sm".
* @param {Object} licenses - License IDs mapped to URLs.
* @param {Object} benchmarkKeys - Objects of available keys by type, e.g.
* 'parser', 'ner', 'speed', mapped to labels.
*/
constructor(repo, models = [], licenses = {}, benchmarkKeys = {}) {
this.url = `https://raw.githubusercontent.com/${repo}/master`;
this.repo = `https://github.com/${repo}`;
this.modelIds = models;
this.licenses = licenses;
this.benchKeys = benchmarkKeys;
this.init();
}
init() {
this.modelIds.forEach(modelId =>
new Templater(modelId).get('table').setAttribute('data-loading', ''));
this.fetch(`${this.url}/compatibility.json`)
.then(json => this.getModels(json.spacy))
.catch(_ => this.modelIds.forEach(modelId => this.showError(modelId)));
// make sure scroll positions for progress bar etc. are recalculated
window.dispatchEvent(new Event('resize'));
}
fetch(url) {
return new Promise((resolve, reject) =>
fetch(url).then(res => handleResponse(res))
.then(json => json.ok ? resolve(json) : reject()))
}
getModels(compat) {
this.compat = compat;
for (let modelId of this.modelIds) {
const version = getLatestVersion(modelId, compat);
if (version) this.fetch(`${this.url}/meta/${modelId}-${version}.json`)
.then(json => this.render(json))
.catch(_ => this.showError(modelId))
else this.showError(modelId);
}
}
showError(modelId) {
const tpl = new Templater(modelId);
tpl.get('table').removeAttribute('data-loading');
tpl.get('error').style.display = 'block';
for (let key of ['sources', 'pipeline', 'vectors', 'author', 'license']) {
tpl.get(key).parentElement.parentElement.style.display = 'none';
}
}
/**
* Update model details in tables. Currently quite hacky :(
*/
render(data) {
const modelId = `${data.lang}_${data.name}`;
const model = `${modelId}-${data.version}`;
const tpl = new Templater(modelId);
tpl.get('error').style.display = 'none';
this.renderDetails(tpl, data)
this.renderBenchmarks(tpl, data.accuracy, data.speed);
this.renderCompat(tpl, modelId);
tpl.get('download').setAttribute('href', `${this.repo}/releases/tag/${model}`);
tpl.get('table').removeAttribute('data-loading');
}
renderDetails(tpl, { version, size, description, notes, author, url,
license, sources, vectors, pipeline }) {
const basics = { version, size, description, notes }
for (let [key, value] of Object.entries(basics)) {
if (value) tpl.fill(key, value);
}
if (author) tpl.fill('author', formats.author(author, url), true);
if (license) tpl.fill('license', formats.license(license, this.licenses[license]), true);
if (sources) tpl.fill('sources', formats.sources(sources));
if (vectors) tpl.fill('vectors', formats.vectors(vectors));
else tpl.get('vectors').parentElement.parentElement.style.display = 'none';
if (pipeline && pipeline.length) tpl.fill('pipeline', formats.pipeline(pipeline), true);
else tpl.get('pipeline').parentElement.parentElement.style.display = 'none';
}
renderBenchmarks(tpl, accuracy = {}, speed = {}) {
if (!accuracy && !speed) return;
this.renderTable(tpl, 'parser', accuracy, val => val.toFixed(2));
this.renderTable(tpl, 'ner', accuracy, val => val.toFixed(2));
this.renderTable(tpl, 'speed', speed, Math.round);
tpl.get('benchmarks').style.display = 'block';
}
renderTable(tpl, id, benchmarks, converter = val => val) {
if (!this.benchKeys[id] || !Object.keys(this.benchKeys[id]).some(key => benchmarks[key])) return;
for (let key of Object.keys(this.benchKeys[id])) {
if (benchmarks[key]) tpl
.fill(key, convertNumber(converter(benchmarks[key])))
.parentElement.style.display = 'table-row';
}
tpl.get(id).style.display = 'block';
}
renderCompat(tpl, modelId) {
tpl.get('compat-wrapper').style.display = 'table-row';
const header = '<option selected disabled>spaCy version</option>';
const options = Object.keys(this.compat)
.map(v => `<option value="${v}">v${v}</option>`)
.join('');
tpl
.fill('compat', header + options, true)
.addEventListener('change', ({ target: { value }}) =>
tpl.fill('compat-versions', this.getCompat(value, modelId), true))
}
getCompat(version, model) {
const res = this.compat[version][model];
return res ? `<code>${model}-${res[0]}</code>` : '<em>not compatible</em>';
}
}
export class ModelComparer {
/**
* Compare to model meta files and render chart and comparison table.
* @param {string} repo - Path tp GitHub repository containing releases.
* @param {Object} licenses - License IDs mapped to URLs.
* @param {Object} benchmarkKeys - Objects of available keys by type, e.g.
* 'parser', 'ner', 'speed', mapped to labels.
* @param {Object} languages - Available languages, ID mapped to name.
* @param {Object} defaultModels - Models to compare on load, 'model1' and
* 'model2' mapped to model names.
*/
constructor(repo, licenses = {}, benchmarkKeys = {}, languages = {}, labels = {}, defaultModels) {
this.url = `https://raw.githubusercontent.com/${repo}/master`;
this.repo = `https://github.com/${repo}`;
this.tpl = new Templater('compare');
this.benchKeys = benchmarkKeys;
this.licenses = licenses;
this.languages = languages;
this.labels = labels;
this.models = {};
this.colors = CHART_COLORS;
this.fonts = CHART_FONTS;
this.defaultModels = defaultModels;
this.tpl.get('result').style.display = 'block';
this.fetchCompat()
.then(compat => this.init(compat))
.catch(this.showError.bind(this))
}
init(compat) {
this.compat = compat;
const selectA = this.tpl.get('model1');
const selectB = this.tpl.get('model2');
selectA.addEventListener('change', this.onSelect.bind(this));
selectB.addEventListener('change', this.onSelect.bind(this));
this.chart = new Chart('chart_compare_accuracy', { type: 'bar', options: {
responsive: true,
legend: { position: 'bottom', labels: { fontFamily: this.fonts.legend, fontSize: 13 }},
scales: {
yAxes: [{ label: 'Accuracy', ticks: { min: 70, fontFamily: this.fonts.ticks }}],
xAxes: [{ barPercentage: 0.75, ticks: { fontFamily: this.fonts.ticks }}]
}
}});
if (this.defaultModels) {
selectA.value = this.defaultModels.model1;
selectB.value = this.defaultModels.model2;
this.getModels(this.defaultModels);
}
}
fetchCompat() {
return new Promise((resolve, reject) =>
fetch(`${this.url}/compatibility.json`)
.then(res => handleResponse(res))
.then(json => json.ok ? resolve(json.spacy) : reject()))
}
fetchModel(name) {
const version = getLatestVersion(name, this.compat);
const modelName = `${name}-${version}`;
return new Promise((resolve, reject) => {
// resolve immediately if model already loaded, e.g. in this.models
if (this.models[name]) resolve(this.models[name]);
else fetch(`${this.url}/meta/${modelName}.json`)
.then(res => handleResponse(res))
.then(json => json.ok ? resolve(this.saveModel(name, json)) : reject())
})
}
/**
* "Save" meta to this.models so it only has to be fetched from GitHub once.
* @param {string} name - The model name.
* @param {Object} data - The model meta data.
*/
saveModel(name, data) {
this.models[name] = data;
return data;
}
showError(err) {
console.error(err);
this.tpl.get('result').style.display = 'none';
this.tpl.get('error').style.display = 'block';
}
onSelect(ev) {
const modelId = ev.target.value;
const otherId = (ev.target.id == 'model1') ? 'model2' : 'model1';
const otherVal = this.tpl.get(otherId);
const otherModel = otherVal.options[otherVal.selectedIndex].value;
if (otherModel != '') this.getModels({
[ev.target.id]: modelId,
[otherId]: otherModel
})
}
getModels({ model1, model2 }) {
this.tpl.get('result').setAttribute('data-loading', '');
this.fetchModel(model1)
.then(data1 => this.fetchModel(model2)
.then(data2 => this.render({ model1: data1, model2: data2 })))
.catch(this.showError.bind(this))
}
/**
* Render two models, and populate the chart and table. Currently quite hacky :(
* @param {Object} models - The models to render.
* @param {Object} models.model1 - The first model (via first <select>).
* @param {Object} models.model2 - The second model (via second <select>).
*/
render({ model1, model2 }) {
const accKeys = Object.assign({}, this.benchKeys.parser, this.benchKeys.ner);
const allKeys = [...Object.keys(model1.accuracy || []), ...Object.keys(model2.accuracy || [])];
const metaKeys = Object.keys(accKeys).filter(k => allKeys.includes(k));
const labels = metaKeys.map(key => accKeys[key]);
const datasets = [model1, model2]
.map(({ lang, name, version, accuracy = {} }, i) => ({
label: `${lang}_${name}-${version}`,
backgroundColor: this.colors[`model${i + 1}`],
data: metaKeys.map(key => (accuracy[key] || 0).toFixed(2))
}));
this.chart.data = { labels, datasets };
this.chart.update();
[model1, model2].forEach((model, i) => this.renderTable(metaKeys, i + 1, model));
this.tpl.get('result').removeAttribute('data-loading');
}
renderTable(metaKeys, i, { lang, name, version, size, description,
notes, author, url, license, sources, vectors, pipeline, accuracy = {},
speed = {}}) {
const type = name.split('_')[0]; // extract type from model name
const genre = name.split('_')[1]; // extract genre from model name
this.tpl.fill(`table-head${i}`, `${lang}_${name}`);
this.tpl.get(`link${i}`).setAttribute('href', `/models/${lang}#${lang}_${name}`);
this.tpl.fill(`download${i}`, `spacy download ${lang}_${name}\n`);
this.tpl.fill(`lang${i}`, this.languages[lang] || lang);
this.tpl.fill(`type${i}`, this.labels[type] || type);
this.tpl.fill(`genre${i}`, this.labels[genre] || genre);
this.tpl.fill(`version${i}`, formats.version(version), true);
this.tpl.fill(`size${i}`, size);
this.tpl.fill(`desc${i}`, description || 'n/a');
this.tpl.fill(`pipeline${i}`, formats.pipeline(pipeline), true);
this.tpl.fill(`vectors${i}`, formats.vectors(vectors));
this.tpl.fill(`sources${i}`, formats.sources(sources));
this.tpl.fill(`author${i}`, formats.author(author, url), true);
this.tpl.fill(`license${i}`, formats.license(license, this.licenses[license]), true);
// check if model accuracy or speed includes one of the pre-set keys
for (let key of [...metaKeys, ...Object.keys(this.benchKeys.speed)]) {
if (accuracy[key]) this.tpl.fill(`${key}${i}`, accuracy[key].toFixed(2))
else if (speed[key]) this.tpl.fill(`${key}${i}`, convertNumber(Math.round(speed[key])))
else this.tpl.fill(`${key}${i}`, 'n/a')
}
}
}

View File

@ -0,0 +1,35 @@
'use strict';
import { $, $$ } from './util.js';
export default class NavHighlighter {
/**
* Hightlight section in viewport in sidebar, using in-view library.
* @param {string} sectionAttr - Data attribute of sections.
* @param {string} navAttr - Data attribute of navigation items.
* @param {string} activeClass Class name of active element.
*/
constructor(sectionAttr, navAttr, activeClass = 'is-active') {
this.sections = [...$$(`[${navAttr}]`)];
// highlight first item regardless
if (this.sections.length) this.sections[0].classList.add(activeClass);
this.navAttr = navAttr;
this.sectionAttr = sectionAttr;
this.activeClass = activeClass;
if (window.inView) inView(`[${sectionAttr}]`)
.on('enter', this.highlightSection.bind(this));
}
/**
* Check if section in view exists in sidebar and mark as active.
* @param {node} section - The section in view.
*/
highlightSection(section) {
const id = section.getAttribute(this.sectionAttr);
const el = $(`[${this.navAttr}="${id}"]`);
if (el) {
this.sections.forEach(el => el.classList.remove(this.activeClass));
el.classList.add(this.activeClass);
}
}
}

View File

@ -0,0 +1,52 @@
'use strict';
import { $ } from './util.js';
export default class ProgressBar {
/**
* Animated reading progress bar.
* @param {string} selector CSS selector of progress bar element.
*/
constructor(selector) {
this.scrollY = 0;
this.sizes = this.updateSizes();
this.el = $(selector);
this.el.setAttribute('max', 100);
window.addEventListener('scroll', this.onScroll.bind(this));
window.addEventListener('resize', this.onResize.bind(this));
}
onScroll(ev) {
this.scrollY = (window.pageYOffset || document.scrollTop) - (document.clientTop || 0);
requestAnimationFrame(this.update.bind(this));
}
onResize(ev) {
this.sizes = this.updateSizes();
requestAnimationFrame(this.update.bind(this));
}
update() {
const offset = 100 - ((this.sizes.height - this.scrollY - this.sizes.vh) / this.sizes.height * 100);
this.el.setAttribute('value', (this.scrollY == 0) ? 0 : offset || 0);
}
/**
* Update scroll and viewport height. Called on load and window resize.
*/
updateSizes() {
return {
height: Math.max(
document.body.scrollHeight,
document.body.offsetHeight,
document.documentElement.clientHeight,
document.documentElement.scrollHeight,
document.documentElement.offsetHeight
),
vh: Math.max(
document.documentElement.clientHeight,
window.innerHeight || 0
)
}
}
}

View File

@ -0,0 +1,23 @@
/**
* This file is bundled by Rollup, compiled with Babel and included as
* <script nomodule> for older browsers that don't yet support JavaScript
* modules. Browsers that do will ignore this bundle and won't even fetch it
* from the server. Details:
* https://github.com/rollup/rollup
* https://medium.com/dev-channel/es6-modules-in-chrome-canary-m60-ba588dfb8ab7
*/
// Import all modules that are instantiated directly in _includes/_scripts.jade
import ProgressBar from './progress.js';
import NavHighlighter from './nav-highlighter.js';
import Changelog from './changelog.js';
import GitHubEmbed from './github-embed.js';
import { ModelLoader, ModelComparer } from './models.js';
// Assign to window so they are bundled by rollup
window.ProgressBar = ProgressBar;
window.NavHighlighter = NavHighlighter;
window.Changelog = Changelog;
window.GitHubEmbed = GitHubEmbed;
window.ModelLoader = ModelLoader;
window.ModelComparer = ModelComparer;

69
website/assets/js/util.js Normal file
View File

@ -0,0 +1,69 @@
'use strict';
export const $ = document.querySelector.bind(document);
export const $$ = document.querySelectorAll.bind(document);
export class Templater {
/**
* Mini templating engine based on data attributes. Selects elements based
* on a data-tpl and data-tpl-key attribute and can set textContent
* and innterHtml.
* @param {string} templateId - Template section, e.g. value of data-tpl.
*/
constructor(templateId) {
this.templateId = templateId;
}
/**
* Get an element from the template and return it.
* @param {string} key - Name of the key within the current template.
*/
get(key) {
return $(`[data-tpl="${this.templateId}"][data-tpl-key="${key}"]`);
}
/**
* Fill the content of a template element with a value.
* @param {string} key - Name of the key within the current template.
* @param {string} value - Content to insert into template element.
* @param {boolean} html - Insert content as HTML. Defaults to false.
*/
fill(key, value, html = false) {
const el = this.get(key);
if (html) el.innerHTML = value || '';
else el.textContent = value || '';
return el;
}
}
/**
* Handle API response and assign status to returned JSON.
* @param {Response} res The response.
*/
export const handleResponse = res => {
if (res.ok) return res.json()
.then(json => Object.assign({}, json, { ok: res.ok }))
else return ({ ok: res.ok })
};
/**
* Convert a number to a string and add thousand separator.
* @param {number|string} num - The number to convert.
* @param {string} separator Thousand separator.
*/
export const convertNumber = (num = 0, separator = ',') =>
num.toString().replace(/\B(?=(\d{3})+(?!\d))/g, separator);
/**
* Abbreviate a number, e.g. 14249930 --> 14.25m.
* @param {number|string} num - The number to convert.
* @param {number} fixed - Number of decimals.
*/
export const abbrNumber = (num = 0, fixed = 2) => {
const suffixes = ['', 'k', 'm', 'b', 't'];
if (num === null || num === 0) return 0;
const b = num.toPrecision(2).split('e');
const k = (b.length === 1) ? 0 : Math.floor(Math.min(b[1].slice(1), 14) / 3);
const c = (k < 1) ? num.toFixed(fixed) : (num / Math.pow(10, k * 3)).toFixed(fixed + 1);
return (c < 0 ? c : Math.abs(c)) + suffixes[k];
}

View File

@ -1,7 +1,8 @@
{
"sidebar": {
"Models": {
"Overview": "./"
"Overview": "./",
"Comparison": "comparison"
},
"Language models": {
@ -26,6 +27,17 @@
}
},
"comparison": {
"title": "Model Comparison",
"teaser": "Compare spaCy's statistical models and their accuracy.",
"tag": "experimental",
"compare_models": true,
"default_models": {
"model1": "en_core_web_sm",
"model2": "en_core_web_lg"
}
},
"MODELS": {
"en": ["en_core_web_sm", "en_core_web_lg", "en_vectors_web_lg"],
"de": ["de_dep_news_sm"],
@ -88,6 +100,7 @@
"hu": "Hungarian",
"pl": "Polish",
"he": "Hebrew",
"ga": "Irish",
"bn": "Bengali",
"hi": "Hindi",
"id": "Indonesian",
@ -102,6 +115,8 @@
"de": "Dies ist ein Satz.",
"fr": "C'est une phrase.",
"es": "Esto es una frase.",
"pt": "Esta é uma frase.",
"it": "Questa è una frase.",
"xx": "This is a sentence about Facebook."
}
}

View File

@ -0,0 +1,81 @@
//- 💫 DOCS > MODELS > COMPARISON
include ../_includes/_mixins
p
| This experimental tool helps you compare spaCy's statistical models
| by features, accuracy and speed. This can be especially useful to get an
| idea of the trade-offs between larger and smaller models of the same
| type. For example, #[code lg] models tend to be more accurate than
| the corresponding #[code sm] versions but they're often significantly
| larger in file size and memory usage.
- TPL = "compare"
+grid.o-box
for i in [1, 2]
+grid-col("half", "no-gutter")
label.u-heading.u-text-label.u-text-center.u-color-theme(for="model#{i}") Model #{i}
.o-field.o-grid.o-grid--vcenter.u-padding-small
select.o-field__select.u-text-small(id="model#{i}" data-tpl=TPL data-tpl-key="model#{i}")
option(selected="" disabled="" value="") Select model...
for models, _ in MODELS
for model in models
option(value=model)=model
div(data-tpl=TPL data-tpl-key="error")
+infobox
| Unable to load model details and accuracy figures from GitHub to
| compare the models. For details of the individual models, see the
| overview of the
| #[+a(gh("spacy-models") + "/releases") latest model releases].
div(data-tpl=TPL data-tpl-key="result" style="display: none")
+chart("compare_accuracy", 350)
+aside-code("Download", "text")
for i in [1, 2]
span(data-tpl=TPL data-tpl-key="download#{i}")
+table.o-block-small(data-tpl=TPL data-tpl-key="table")
+row("head")
+head-cell
for i in [1, 2]
+head-cell(style="width: 40%")
a(data-tpl=TPL data-tpl-key="link#{i}")
code(data-tpl=TPL data-tpl-key="table-head#{i}" style="text-transform: initial; font-weight: normal")
for label, id in {lang: "Language", type: "Type", genre: "Genre"}
+row
+cell #[+label=label]
for i in [1, 2]
+cell(data-tpl=TPL data-tpl-key="#{id}#{i}") n/a
for label in ["Version", "Size", "Pipeline", "Vectors", "Sources", "Author", "License"]
- var field = label.toLowerCase()
+row
+cell.u-nowrap
+label=label
if MODEL_META[field]
| #[+help(MODEL_META[field]).u-color-subtle]
for i in [1, 2]
+cell
span(data-tpl=TPL data-tpl-key=field + i) #[em n/a]
+row
+cell #[+label Description]
for i in [1, 2]
+cell.u-text-tiny(data-tpl=TPL data-tpl-key="desc#{i}") n/a
for benchmark, _ in MODEL_BENCHMARKS
- var counter = 0
for label, field in benchmark
+row((counter == 0) ? "divider" : null)
+cell.u-nowrap
+label=label
if MODEL_META[field]
| #[+help(MODEL_META[field]).u-color-subtle]
for i in [1, 2]
+cell
span(data-tpl=TPL data-tpl-key=field + i) n/a
- counter++

View File

@ -8,13 +8,15 @@
"devDependencies": {
"babel-cli": "^6.14.0",
"harp": "^0.24.0",
"rollup": "^0.50.0",
"uglify-js": "^2.7.3"
},
"dependencies": {},
"scripts": {
"compile": "NODE_ENV=deploy harp compile",
"compile_js": "babel www/assets/js/main.js --out-file www/assets/js/main.js --presets=es2015",
"uglify": "uglifyjs www/assets/js/main.js --output www/assets/js/main.js",
"build": "npm run compile && npm run compile_js && npm run uglify"
"rollup_js": "rollup www/assets/js/rollup.js --output.format iife --output.file www/assets/js/rollup.js",
"compile_rollup": "babel www/assets/js/rollup.js --out-file www/assets/js/rollup.js --presets=es2015",
"uglify": "uglifyjs www/assets/js/rollup.js --output www/assets/js/rollup.js",
"build": "npm run compile && echo 'Compiled website' && npm run rollup_js && echo 'Bundled rollup.js' && npm run compile_rollup && echo 'Compiled rollup.js' && npm run uglify && echo 'Uglified rollup.js'"
}
}

View File

@ -130,10 +130,11 @@ include _includes/_mixins
| capabilities and can be used to mark features that require a
| respective model to be installed.
p.o-block.o-inline-list
+tag I'm a tag
+tag-new(2)
+tag-model("Named entities")
.o-block
p.o-inline-list
+tag I'm a tag
+tag-new(2)
+tag-model("Named entities")
+h(3, "icons", "website/_includes/_svg.jade") Icons
@ -359,18 +360,14 @@ include _includes/_mixins
script(src="/assets/js/chart.min.js")
script new Chart('chart_accuracy', { datasets: [] })
+grid
+grid-col("half")
+chart("accuracy", 400)
+chart("accuracy", 400)
+chart("speed", 300)
+grid-col("half")
+chart("speed", 300)
script(src="/assets/js/chart.min.js")
script(src="/assets/js/vendor/chart.min.js")
script.
Chart.defaults.global.defaultFontFamily = "-apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'";
new Chart('chart_accuracy', { type: 'bar', options: { legend: false, responsive: true, scales: { yAxes: [{ label: 'Accuracy', ticks: { suggestedMin: 70 } }], xAxes: [{ barPercentage: 0.425 }]}}, data: { labels: ['UAS', 'LAS', 'POS', 'NER F', 'NER P', 'NER R'], datasets: [{ label: 'en_core_web_sm', data: [91.49, 89.66, 97.23, 86.46, 86.78, 86.15], backgroundColor: '#09a3d5' }]}});
new Chart('chart_speed', { type: 'horizontalBar', options: { legend: false, responsive: true, scales: { xAxes: [{ label: 'Speed', ticks: { suggestedMin: 0 }}], yAxes: [{ barPercentage: 0.425 }]}}, data: { labels: ['w/s CPU', 'w/s GPU'], datasets: [{ label: 'en_core_web_sm', data: [9575, 25531], backgroundColor: '#09a3d5'}]}});
new Chart('chart_accuracy', { type: 'bar', options: { legend: { position: 'bottom'}, responsive: true, scales: { yAxes: [{ label: 'Accuracy', ticks: { suggestedMin: 70 } }], xAxes: [{ barPercentage: 0.75 }]}}, data: { labels: ['UAS', 'LAS', 'POS', 'NER F', 'NER P', 'NER R'], datasets: [{ label: 'en_core_web_sm', data: [91.65, 89.77, 97.05, 84.80, 84.53, 85.06], backgroundColor: '#09a3d5' }, { label: 'en_core_web_lg', data: [91.49, 89.66, 97.23, 86.46, 86.78, 86.15], backgroundColor: '#066B8C'}]}});
new Chart('chart_speed', { type: 'horizontalBar', options: { legend: { position: 'bottom'}, responsive: true, scales: { xAxes: [{ label: 'Speed', ticks: { suggestedMin: 0 }}], yAxes: [{ barPercentage: 0.75 }]}}, data: { labels: ['w/s CPU', 'w/s GPU'], datasets: [{ label: 'en_core_web_sm', data: [9575, 25531], backgroundColor: '#09a3d5'}, { label: 'en_core_web_lg', data: [8421, 22092], backgroundColor: '#066B8C'}]}});
+section("embeds")
+h(2, "embeds") Embeds

View File

@ -79,6 +79,7 @@
"title": "What's New in v2.0",
"teaser": "New features, backwards incompatibilities and migration guide.",
"menu": {
"Summary": "summary",
"New features": "features",
"Backwards Incompatibilities": "incompat",
"Migrating from v1.x": "migrating",
@ -116,7 +117,6 @@
"next": "text-classification",
"menu": {
"Basics": "basics",
"Similarity in Context": "in-context",
"Custom Vectors": "custom",
"GPU Usage": "gpu"
}

View File

@ -19,6 +19,7 @@
+qs({package: 'source'}) git clone https://github.com/explosion/spaCy
+qs({package: 'source'}) cd spaCy
+qs({package: 'source'}) export PYTHONPATH=`pwd`
+qs({package: 'source'}) pip install -r requirements.txt
+qs({package: 'source'}) pip install -e .

View File

@ -46,7 +46,6 @@ p
+item #[strong Chinese]: #[+a("https://github.com/fxsjy/jieba") Jieba]
+item #[strong Japanese]: #[+a("https://github.com/mocobeta/janome") Janome]
+item #[strong Thai]: #[+a("https://github.com/wannaphongcom/pythainlp") pythainlp]
+item #[strong Russian]: #[+a("https://github.com/kmike/pymorphy2") pymorphy2]
+h(3, "multi-language") Multi-language support
+tag-new(2)

View File

@ -76,6 +76,16 @@ p
("Google rebrands its business apps", [(0, 6, "ORG")]),
("look what i found on google! 😂", [(21, 27, "PRODUCT")])]
+infobox("Tip: Try the Prodigy annotation tool")
+infobox-logos(["prodigy", 100, 29, "https://prodi.gy"])
| If you need to label a lot of data, check out
| #[+a("https://prodi.gy", true) Prodigy], a new, active learning-powered
| annotation tool we've developed. Prodigy is fast and extensible, and
| comes with a modern #[strong web application] that helps you collect
| training data faster. It integrates seamlessly with spaCy, pre-selects
| the #[strong most relevant examples] for annotation, and lets you
| train and evaluate ready-to-use spaCy models.
+h(3, "annotations") Training with annotations
p
@ -180,9 +190,10 @@ p
+cell #[code optimizer]
+cell Callable to update the model's weights.
+infobox
| For the #[strong full example and more details], see the usage guide on
| #[+a("/usage/training#ner") training the named entity recognizer],
| or the runnable
| #[+src(gh("spaCy", "examples/training/train_ner.py")) training script]
| on GitHub.
p
| Instead of writing your own training loop, you can also use the
| built-in #[+api("cli#train") #[code train]] command, which expects data
| in spaCy's #[+a("/api/annotation#json-input") JSON format]. On each epoch,
| a model will be saved out to the directory. After training, you can
| use the #[+api("cli#package") #[code package]] command to generate an
| installable Python package from your model.

View File

@ -190,7 +190,3 @@ p
+item
| #[strong Test] the model to make sure the parser works as expected.
+h(3, "training-json") JSON format for training
include ../../api/_annotation/_training

View File

@ -0,0 +1,237 @@
//- 💫 DOCS > USAGE > WHAT'S NEW IN V2.0 > NEW FEATURES
p
| This section contains an overview of the most important
| #[strong new features and improvements]. The #[+a("/api") API docs]
| include additional deprecation notes. New methods and functions that
| were introduced in this version are marked with a
| #[span.u-text-tag.u-text-tag--spaced v2.0] tag.
+h(3, "features-models") Convolutional neural network models
+aside-code("Example", "bash")
for model in ["en", "de", "fr", "es", "pt", "it"]
| spacy download #{model} # default #{LANGUAGES[model]} model!{'\n'}
| spacy download xx_ent_wiki_sm # multi-language NER
p
| spaCy v2.0 features new neural models for tagging,
| parsing and entity recognition. The models have
| been designed and implemented from scratch specifically for spaCy, to
| give you an unmatched balance of speed, size and accuracy. The new
| models are #[strong 10&times; smaller], #[strong 20% more accurate],
| and #[strong just as fast] as the previous generation.
| #[strong GPU usage] is now supported via
| #[+a("http://chainer.org") Chainer]'s CuPy module.
+infobox
| #[+label-inline Usage:] #[+a("/models") Models directory],
| #[+a("/models/comparison") Models comparison],
| #[+a("/usage/#gpu") Using spaCy with GPU]
+h(3, "features-pipelines") Improved processing pipelines
+aside-code("Example").
# Set custom attributes
Doc.set_extension('my_attr', default=False)
Token.set_extension('my_attr', getter=my_token_getter)
assert doc._.my_attr, token._.my_attr
# Add components to the pipeline
my_component = lambda doc: doc
nlp.add_pipe(my_component)
p
| It's now much easier to #[strong customise the pipeline] with your own
| components: functions that receive a #[code Doc] object, modify and
| return it. Extensions let you write any
| #[strong attributes, properties and methods] to the #[code Doc],
| #[code Token] and #[code Span]. You can add data, implement new
| features, integrate other libraries with spaCy or plug in your own
| machine learning models.
+image
include ../../assets/img/pipeline.svg
+infobox
| #[+label-inline API:] #[+api("language") #[code Language]],
| #[+api("doc#set_extension") #[code Doc.set_extension]],
| #[+api("span#set_extension") #[code Span.set_extension]],
| #[+api("token#set_extension") #[code Token.set_extension]]
| #[+label-inline Usage:]
| #[+a("/usage/processing-pipelines") Processing pipelines]
| #[+label-inline Code:]
| #[+src("/usage/examples#section-pipeline") Pipeline examples]
+h(3, "features-text-classification") Text classification
+aside-code("Example").
textcat = nlp.create_pipe('textcat')
nlp.add_pipe(textcat, last=True)
optimizer = nlp.begin_training()
for itn in range(100):
for doc, gold in train_data:
nlp.update([doc], [gold], sgd=optimizer)
doc = nlp(u'This is a text.')
print(doc.cats)
p
| spaCy v2.0 lets you add text categorization models to spaCy pipelines.
| The model supports classification with multiple, non-mutually
| exclusive labels so multiple labels can apply at once. You can
| change the model architecture rather easily, but by default, the
| #[code TextCategorizer] class uses a convolutional neural network to
| assign position-sensitive vectors to each word in the document.
+infobox
| #[+label-inline API:] #[+api("textcategorizer") #[code TextCategorizer]],
| #[+api("doc#attributes") #[code Doc.cats]],
| #[+api("goldparse#attributes") #[code GoldParse.cats]]#[br]
| #[+label-inline Usage:] #[+a("/usage/text-classification") Text classification]
+h(3, "features-hash-ids") Hash values instead of integer IDs
+aside-code("Example").
doc = nlp(u'I love coffee')
assert doc.vocab.strings[u'coffee'] == 3197928453018144401
assert doc.vocab.strings[3197928453018144401] == u'coffee'
beer_hash = doc.vocab.strings.add(u'beer')
assert doc.vocab.strings[u'beer'] == beer_hash
assert doc.vocab.strings[beer_hash] == u'beer'
p
| The #[+api("stringstore") #[code StringStore]] now resolves all strings
| to hash values instead of integer IDs. This means that the string-to-int
| mapping #[strong no longer depends on the vocabulary state], making a lot
| of workflows much simpler, especially during training. Unlike integer IDs
| in spaCy v1.x, hash values will #[strong always match] even across
| models. Strings can now be added explicitly using the new
| #[+api("stringstore#add") #[code Stringstore.add]] method. A token's hash
| is available via #[code token.orth].
+infobox
| #[+label-inline API:] #[+api("stringstore") #[code StringStore]]
| #[+label-inline Usage:] #[+a("/usage/spacy-101#vocab") Vocab, hashes and lexemes 101]
+h(3, "features-vectors") Improved word vectors support
+aside-code("Example").
for word, vector in vector_data:
nlp.vocab.set_vector(word, vector)
nlp.vocab.vectors.from_glove('/path/to/vectors')
# keep 10000 unique vectors and remap the rest
nlp.vocab.prune_vectors(10000)
nlp.to_disk('/model')
p
| The new #[+api("vectors") #[code Vectors]] class helps the
| #[code Vocab] manage the vectors assigned to strings, and lets you
| assign vectors individually, or
| #[+a("/usage/vectors-similarity#custom-loading-glove") load in GloVe vectors]
| from a directory. To help you strike a good balance between coverage
| and memory usage, the #[code Vectors] class lets you map
| #[strong multiple keys] to the #[strong same row] of the table. If
| you're using the #[+api("cli#vocab") #[code spacy vocab]] command to
| create a vocabulary, pruning the vectors will be taken care of
| automatically. Otherwise, you can use the new
| #[+api("vocab#prune_vectors") #[code Vocab.prune_vectors]].
+infobox
| #[+label-inline API:] #[+api("vectors") #[code Vectors]],
| #[+api("vocab") #[code Vocab]]
| #[+label-inline Usage:] #[+a("/usage/vectors-similarity") Word vectors and semantic similarity]
+h(3, "features-serializer") Saving, loading and serialization
+aside-code("Example").
nlp = spacy.load('en') # shortcut link
nlp = spacy.load('en_core_web_sm') # package
nlp = spacy.load('/path/to/en') # unicode path
nlp = spacy.load(Path('/path/to/en')) # pathlib Path
nlp.to_disk('/path/to/nlp')
nlp = English().from_disk('/path/to/nlp')
p
| spay's serialization API has been made consistent across classes and
| objects. All container classes, i.e. #[code Language], #[code Doc],
| #[code Vocab] and #[code StringStore] now have a #[code to_bytes()],
| #[code from_bytes()], #[code to_disk()] and #[code from_disk()] method
| that supports the Pickle protocol.
p
| The improved #[code spacy.load] makes loading models easier and more
| transparent. You can load a model by supplying its
| #[+a("/usage/models#usage") shortcut link], the name of an installed
| #[+a("/usage/saving-loading#generating") model package] or a path.
| The #[code Language] class to initialise will be determined based on the
| model's settings. For a blank language, you can import the class directly,
| e.g. #[code from spacy.lang.en import English].
+infobox
| #[+label-inline API:] #[+api("spacy#load") #[code spacy.load]]
| #[+label-inline Usage:] #[+a("/usage/saving-loading") Saving and loading]
+h(3, "features-displacy") displaCy visualizer with Jupyter support
+aside-code("Example").
from spacy import displacy
doc = nlp(u'This is a sentence about Facebook.')
displacy.serve(doc, style='dep') # run the web server
html = displacy.render(doc, style='ent') # generate HTML
p
| Our popular dependency and named entity visualizers are now an official
| part of the spaCy library. displaCy can run a simple web server, or
| generate raw HTML markup or SVG files to be exported. You can pass in one
| or more docs, and customise the style. displaCy also auto-detects whether
| you're running #[+a("https://jupyter.org") Jupyter] and will render the
| visualizations in your notebook.
+infobox
| #[+label-inline API:] #[+api("displacy") #[code displacy]]
| #[+label-inline Usage:] #[+a("/usage/visualizers") Visualizing spaCy]
+h(3, "features-language") Improved language data and lazy loading
p
| Language-specfic data now lives in its own submodule, #[code spacy.lang].
| Languages are lazy-loaded, i.e. only loaded when you import a
| #[code Language] class, or load a model that initialises one. This allows
| languages to contain more custom data, e.g. lemmatizer lookup tables, or
| complex regular expressions. The language data has also been tidied up
| and simplified. spaCy now also supports simple lookup-based
| lemmatization and #[strong #{LANG_COUNT} languages] in total!
+infobox
| #[+label-inline API:] #[+api("language") #[code Language]]
| #[+label-inline Code:] #[+src(gh("spaCy", "spacy/lang")) #[code spacy/lang]]
| #[+label-inline Usage:] #[+a("/usage/adding-languages") Adding languages]
+h(3, "features-matcher") Revised matcher API and phrase matcher
+aside-code("Example").
from spacy.matcher import Matcher, PhraseMatcher
matcher = Matcher(nlp.vocab)
matcher.add('HEARTS', None, [{'ORTH': '❤️', 'OP': '+'}])
phrasematcher = PhraseMatcher(nlp.vocab)
phrasematcher.add('OBAMA', None, nlp(u"Barack Obama"))
p
| Patterns can now be added to the matcher by calling
| #[+api("matcher-add") #[code matcher.add()]] with a match ID, an optional
| callback function to be invoked on each match, and one or more patterns.
| This allows you to write powerful, pattern-specific logic using only one
| matcher. For example, you might only want to merge some entity types,
| and set custom flags for other matched patterns. The new
| #[+api("phrasematcher") #[code PhraseMatcher]] lets you efficiently
| match very large terminology lists using #[code Doc] objects as match
| patterns.
+infobox
| #[+label-inline API:] #[+api("matcher") #[code Matcher]],
| #[+api("phrasematcher") #[code PhraseMatcher]]
| #[+label-inline Usage:] #[+a("/usage/rule-based-matching") Rule-based matching]

Some files were not shown because too many files have changed in this diff Show More