mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-13 10:46:29 +03:00
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
This commit is contained in:
commit
d17a12c71d
106
.github/contributors/jimregan.md
vendored
Normal file
106
.github/contributors/jimregan.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Jim O'Regan |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2017-06-24 |
|
||||
| GitHub username | jimregan |
|
||||
| Website (optional) | |
|
|
@ -1,7 +1,6 @@
|
|||
#!/usr/bin/env python
|
||||
# coding: utf8
|
||||
"""
|
||||
A simple example of extracting relations between phrases and entities using
|
||||
"""A simple example of extracting relations between phrases and entities using
|
||||
spaCy's named entity recognizer and the dependency parse. Here, we extract
|
||||
money and currency values (entities labelled as MONEY) and then check the
|
||||
dependency tree to find the noun phrase they are referring to – for example:
|
||||
|
|
|
@ -1,8 +1,7 @@
|
|||
#!/usr/bin/env python
|
||||
# coding: utf8
|
||||
"""
|
||||
This example shows how to navigate the parse tree including subtrees attached
|
||||
to a word.
|
||||
"""This example shows how to navigate the parse tree including subtrees
|
||||
attached to a word.
|
||||
|
||||
Based on issue #252:
|
||||
"In the documents and tutorials the main thing I haven't found is
|
||||
|
|
|
@ -1,9 +1,10 @@
|
|||
#!/usr/bin/env python
|
||||
# coding: utf8
|
||||
"""Match a large set of multi-word expressions in O(1) time.
|
||||
|
||||
The idea is to associate each word in the vocabulary with a tag, noting whether
|
||||
they begin, end, or are inside at least one pattern. An additional tag is used
|
||||
for single-word patterns. Complete patterns are also stored in a hash set.
|
||||
|
||||
When we process a document, we look up the words in the vocabulary, to
|
||||
associate the words with the tags. We then search for tag-sequences that
|
||||
correspond to valid candidates. Finally, we look up the candidates in the hash
|
||||
|
|
|
@ -1,5 +1,6 @@
|
|||
"""
|
||||
Example of multi-processing with Joblib. Here, we're exporting
|
||||
#!/usr/bin/env python
|
||||
# coding: utf8
|
||||
"""Example of multi-processing with Joblib. Here, we're exporting
|
||||
part-of-speech-tagged, true-cased, (very roughly) sentence-separated text, with
|
||||
each "sentence" on a newline, and spaces between tokens. Data is loaded from
|
||||
the IMDB movie reviews dataset and will be loaded automatically via Thinc's
|
||||
|
|
|
@ -94,7 +94,7 @@ def main(model=None, output_dir=None, n_iter=100):
|
|||
|
||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'parser']
|
||||
with nlp.disable_pipes(*other_pipes): # only train parser
|
||||
optimizer = nlp.begin_training(lambda: [])
|
||||
optimizer = nlp.begin_training()
|
||||
for itn in range(n_iter):
|
||||
random.shuffle(TRAIN_DATA)
|
||||
losses = {}
|
||||
|
|
|
@ -1,7 +1,6 @@
|
|||
#!/usr/bin/env python
|
||||
# coding: utf8
|
||||
"""
|
||||
Example of training spaCy's named entity recognizer, starting off with an
|
||||
"""Example of training spaCy's named entity recognizer, starting off with an
|
||||
existing model or a blank model.
|
||||
|
||||
For more details, see the documentation:
|
||||
|
|
|
@ -1,7 +1,6 @@
|
|||
#!/usr/bin/env python
|
||||
# coding: utf8
|
||||
"""
|
||||
Example of training an additional entity type
|
||||
"""Example of training an additional entity type
|
||||
|
||||
This script shows how to add a new entity type to an existing pre-trained NER
|
||||
model. To keep the example short and simple, only four sentences are provided
|
||||
|
@ -88,7 +87,7 @@ def main(model=None, new_model_name='animal', output_dir=None, n_iter=50):
|
|||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
|
||||
with nlp.disable_pipes(*other_pipes): # only train NER
|
||||
random.seed(0)
|
||||
optimizer = nlp.begin_training(lambda: [])
|
||||
optimizer = nlp.begin_training()
|
||||
for itn in range(n_iter):
|
||||
losses = {}
|
||||
gold_parses = get_gold_parses(nlp.make_doc, TRAIN_DATA)
|
||||
|
|
|
@ -1,10 +1,7 @@
|
|||
#!/usr/bin/env python
|
||||
# coding: utf8
|
||||
"""
|
||||
Example of training spaCy dependency parser, starting off with an existing model
|
||||
or a blank model.
|
||||
|
||||
For more details, see the documentation:
|
||||
"""Example of training spaCy dependency parser, starting off with an existing
|
||||
model or a blank model. For more details, see the documentation:
|
||||
* Training: https://alpha.spacy.io/usage/training
|
||||
* Dependency Parse: https://alpha.spacy.io/usage/linguistic-features#dependency-parse
|
||||
|
||||
|
@ -67,7 +64,7 @@ def main(model=None, output_dir=None, n_iter=1000):
|
|||
# get names of other pipes to disable them during training
|
||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'parser']
|
||||
with nlp.disable_pipes(*other_pipes): # only train parser
|
||||
optimizer = nlp.begin_training(lambda: [])
|
||||
optimizer = nlp.begin_training()
|
||||
for itn in range(n_iter):
|
||||
random.shuffle(TRAIN_DATA)
|
||||
losses = {}
|
||||
|
|
|
@ -3,9 +3,8 @@
|
|||
"""
|
||||
A simple example for training a part-of-speech tagger with a custom tag map.
|
||||
To allow us to update the tag map with our custom one, this example starts off
|
||||
with a blank Language class and modifies its defaults.
|
||||
|
||||
For more details, see the documentation:
|
||||
with a blank Language class and modifies its defaults. For more details, see
|
||||
the documentation:
|
||||
* Training: https://alpha.spacy.io/usage/training
|
||||
* POS Tagging: https://alpha.spacy.io/usage/linguistic-features#pos-tagging
|
||||
|
||||
|
@ -62,7 +61,7 @@ def main(lang='en', output_dir=None, n_iter=25):
|
|||
tagger = nlp.create_pipe('tagger')
|
||||
nlp.add_pipe(tagger)
|
||||
|
||||
optimizer = nlp.begin_training(lambda: [])
|
||||
optimizer = nlp.begin_training()
|
||||
for i in range(n_iter):
|
||||
random.shuffle(TRAIN_DATA)
|
||||
losses = {}
|
||||
|
|
|
@ -3,9 +3,8 @@
|
|||
"""Train a multi-label convolutional neural network text classifier on the
|
||||
IMDB dataset, using the TextCategorizer component. The dataset will be loaded
|
||||
automatically via Thinc's built-in dataset loader. The model is added to
|
||||
spacy.pipeline, and predictions are available via `doc.cats`.
|
||||
|
||||
For more details, see the documentation:
|
||||
spacy.pipeline, and predictions are available via `doc.cats`. For more details,
|
||||
see the documentation:
|
||||
* Training: https://alpha.spacy.io/usage/training
|
||||
* Text classification: https://alpha.spacy.io/usage/text-classification
|
||||
|
||||
|
@ -27,8 +26,9 @@ from spacy.pipeline import TextCategorizer
|
|||
@plac.annotations(
|
||||
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
|
||||
output_dir=("Optional output directory", "option", "o", Path),
|
||||
n_examples=("Number of texts to train from", "option", "N", int),
|
||||
n_iter=("Number of training iterations", "option", "n", int))
|
||||
def main(model=None, output_dir=None, n_iter=20):
|
||||
def main(model=None, output_dir=None, n_iter=20, n_texts=2000):
|
||||
if model is not None:
|
||||
nlp = spacy.load(model) # load existing spaCy model
|
||||
print("Loaded model '%s'" % model)
|
||||
|
@ -51,7 +51,8 @@ def main(model=None, output_dir=None, n_iter=20):
|
|||
|
||||
# load the IMBD dataset
|
||||
print("Loading IMDB data...")
|
||||
(train_texts, train_cats), (dev_texts, dev_cats) = load_data(limit=2000)
|
||||
print("Using %d training examples" % n_texts)
|
||||
(train_texts, train_cats), (dev_texts, dev_cats) = load_data(limit=n_texts)
|
||||
train_docs = [nlp.tokenizer(text) for text in train_texts]
|
||||
train_gold = [GoldParse(doc, cats=cats) for doc, cats in
|
||||
zip(train_docs, train_cats)]
|
||||
|
@ -60,20 +61,20 @@ def main(model=None, output_dir=None, n_iter=20):
|
|||
# get names of other pipes to disable them during training
|
||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'textcat']
|
||||
with nlp.disable_pipes(*other_pipes): # only train textcat
|
||||
optimizer = nlp.begin_training(lambda: [])
|
||||
optimizer = nlp.begin_training()
|
||||
print("Training the model...")
|
||||
print('{:^5}\t{:^5}\t{:^5}\t{:^5}'.format('LOSS', 'P', 'R', 'F'))
|
||||
for i in range(n_iter):
|
||||
losses = {}
|
||||
# batch up the examples using spaCy's minibatch
|
||||
batches = minibatch(train_data, size=compounding(4., 128., 1.001))
|
||||
batches = minibatch(train_data, size=compounding(4., 32., 1.001))
|
||||
for batch in batches:
|
||||
docs, golds = zip(*batch)
|
||||
nlp.update(docs, golds, sgd=optimizer, drop=0.2, losses=losses)
|
||||
with textcat.model.use_params(optimizer.averages):
|
||||
# evaluate on the dev data split off in load_data()
|
||||
scores = evaluate(nlp.tokenizer, textcat, dev_texts, dev_cats)
|
||||
print('{0:.3f}\t{0:.3f}\t{0:.3f}\t{0:.3f}' # print a simple table
|
||||
print('{0:.3f}\t{1:.3f}\t{2:.3f}\t{3:.3f}' # print a simple table
|
||||
.format(losses['textcat'], scores['textcat_p'],
|
||||
scores['textcat_r'], scores['textcat_f']))
|
||||
|
||||
|
|
21
examples/training/vocab-data.jsonl
Normal file
21
examples/training/vocab-data.jsonl
Normal file
|
@ -0,0 +1,21 @@
|
|||
{"lang": "en", "settings": {"oov_prob": -20.502029418945312}}
|
||||
{"orth": ".", "id": 1, "lower": ".", "norm": ".", "shape": ".", "prefix": ".", "suffix": ".", "length": 1, "cluster": "8", "prob": -3.0678977966308594, "is_alpha": false, "is_ascii": true, "is_digit": false, "is_lower": false, "is_punct": true, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
|
||||
{"orth": ",", "id": 2, "lower": ",", "norm": ",", "shape": ",", "prefix": ",", "suffix": ",", "length": 1, "cluster": "4", "prob": -3.4549596309661865, "is_alpha": false, "is_ascii": true, "is_digit": false, "is_lower": false, "is_punct": true, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
|
||||
{"orth": "the", "id": 3, "lower": "the", "norm": "the", "shape": "xxx", "prefix": "t", "suffix": "the", "length": 3, "cluster": "11", "prob": -3.528766632080078, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
|
||||
{"orth": "I", "id": 4, "lower": "i", "norm": "I", "shape": "X", "prefix": "I", "suffix": "I", "length": 1, "cluster": "346", "prob": -3.791565179824829, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": false, "is_punct": false, "is_space": false, "is_title": true, "is_upper": true, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
|
||||
{"orth": "to", "id": 5, "lower": "to", "norm": "to", "shape": "xx", "prefix": "t", "suffix": "to", "length": 2, "cluster": "12", "prob": -3.8560216426849365, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
|
||||
{"orth": "a", "id": 6, "lower": "a", "norm": "a", "shape": "x", "prefix": "a", "suffix": "a", "length": 1, "cluster": "19", "prob": -3.92978835105896, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
|
||||
{"orth": "and", "id": 7, "lower": "and", "norm": "and", "shape": "xxx", "prefix": "a", "suffix": "and", "length": 3, "cluster": "20", "prob": -4.113108158111572, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
|
||||
{"orth": "of", "id": 8, "lower": "of", "norm": "of", "shape": "xx", "prefix": "o", "suffix": "of", "length": 2, "cluster": "28", "prob": -4.27587366104126, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
|
||||
{"orth": "you", "id": 9, "lower": "you", "norm": "you", "shape": "xxx", "prefix": "y", "suffix": "you", "length": 3, "cluster": "602", "prob": -4.373791217803955, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
|
||||
{"orth": "it", "id": 10, "lower": "it", "norm": "it", "shape": "xx", "prefix": "i", "suffix": "it", "length": 2, "cluster": "474", "prob": -4.388050079345703, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
|
||||
{"orth": "is", "id": 11, "lower": "is", "norm": "is", "shape": "xx", "prefix": "i", "suffix": "is", "length": 2, "cluster": "762", "prob": -4.457748889923096, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
|
||||
{"orth": "that", "id": 12, "lower": "that", "norm": "that", "shape": "xxxx", "prefix": "t", "suffix": "hat", "length": 4, "cluster": "84", "prob": -4.464504718780518, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
|
||||
{"orth": "\n\n", "id": 0, "lower": "\n\n", "norm": "\n\n", "shape": "\n\n", "prefix": "\n", "suffix": "\n\n", "length": 2, "cluster": "0", "prob": -4.606560707092285, "is_alpha": false, "is_ascii": true, "is_digit": false, "is_lower": false, "is_punct": false, "is_space": true, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
|
||||
{"orth": "in", "id": 13, "lower": "in", "norm": "in", "shape": "xx", "prefix": "i", "suffix": "in", "length": 2, "cluster": "60", "prob": -4.619071960449219, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
|
||||
{"orth": "'s", "id": 14, "lower": "'s", "norm": "'s", "shape": "'x", "prefix": "'", "suffix": "'s", "length": 2, "cluster": "52", "prob": -4.830559253692627, "is_alpha": false, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
|
||||
{"orth": "n't", "id": 15, "lower": "n't", "norm": "n't", "shape": "x'x", "prefix": "n", "suffix": "n't", "length": 3, "cluster": "74", "prob": -4.859938621520996, "is_alpha": false, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
|
||||
{"orth": "for", "id": 16, "lower": "for", "norm": "for", "shape": "xxx", "prefix": "f", "suffix": "for", "length": 3, "cluster": "508", "prob": -4.8801093101501465, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
|
||||
{"orth": "\"", "id": 17, "lower": "\"", "norm": "\"", "shape": "\"", "prefix": "\"", "suffix": "\"", "length": 1, "cluster": "0", "prob": -5.02677583694458, "is_alpha": false, "is_ascii": true, "is_digit": false, "is_lower": false, "is_punct": true, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": true, "is_left_punct": true, "is_right_punct": true}
|
||||
{"orth": "?", "id": 18, "lower": "?", "norm": "?", "shape": "?", "prefix": "?", "suffix": "?", "length": 1, "cluster": "0", "prob": -5.05924654006958, "is_alpha": false, "is_ascii": true, "is_digit": false, "is_lower": false, "is_punct": true, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
|
||||
{"orth": " ", "id": 0, "lower": " ", "norm": " ", "shape": " ", "prefix": " ", "suffix": " ", "length": 1, "cluster": "0", "prob": -5.129165172576904, "is_alpha": false, "is_ascii": true, "is_digit": false, "is_lower": false, "is_punct": false, "is_space": true, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
|
|
@ -7,14 +7,13 @@ from __future__ import unicode_literals
|
|||
import plac
|
||||
import numpy
|
||||
|
||||
import from spacy.language import Language
|
||||
from spacy.language import Language
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
vectors_loc=("Path to vectors", "positional", None, str))
|
||||
def main(vectors_loc):
|
||||
nlp = Language()
|
||||
|
||||
nlp = Language() # start off with a blank Language class
|
||||
with open(vectors_loc, 'rb') as file_:
|
||||
header = file_.readline()
|
||||
nr_row, nr_dim = header.split()
|
||||
|
@ -24,9 +23,11 @@ def main(vectors_loc):
|
|||
pieces = line.split()
|
||||
word = pieces[0]
|
||||
vector = numpy.asarray([float(v) for v in pieces[1:]], dtype='f')
|
||||
nlp.vocab.set_vector(word, vector)
|
||||
doc = nlp(u'class colspan')
|
||||
print(doc[0].similarity(doc[1]))
|
||||
nlp.vocab.set_vector(word, vector) # add the vectors to the vocab
|
||||
# test the vectors and similarity
|
||||
text = 'class colspan'
|
||||
doc = nlp(text)
|
||||
print(text, doc[0].similarity(doc[1]))
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
|
|
|
@ -6,7 +6,7 @@ from __future__ import print_function
|
|||
if __name__ == '__main__':
|
||||
import plac
|
||||
import sys
|
||||
from spacy.cli import download, link, info, package, train, convert, model
|
||||
from spacy.cli import download, link, info, package, train, convert
|
||||
from spacy.cli import vocab, profile, evaluate, validate
|
||||
from spacy.util import prints
|
||||
|
||||
|
@ -18,8 +18,7 @@ if __name__ == '__main__':
|
|||
'evaluate': evaluate,
|
||||
'convert': convert,
|
||||
'package': package,
|
||||
'model': model,
|
||||
'model': vocab,
|
||||
'vocab': vocab,
|
||||
'profile': profile,
|
||||
'validate': validate
|
||||
}
|
||||
|
|
12
spacy/_ml.py
12
spacy/_ml.py
|
@ -29,6 +29,16 @@ from . import util
|
|||
VECTORS_KEY = 'spacy_pretrained_vectors'
|
||||
|
||||
|
||||
def cosine(vec1, vec2):
|
||||
xp = get_array_module(vec1)
|
||||
norm1 = xp.linalg.norm(vec1)
|
||||
norm2 = xp.linalg.norm(vec2)
|
||||
if norm1 == 0. or norm2 == 0.:
|
||||
return 0
|
||||
else:
|
||||
return vec1.dot(vec2) / (norm1 * norm2)
|
||||
|
||||
|
||||
@layerize
|
||||
def _flatten_add_lengths(seqs, pad=0, drop=0.):
|
||||
ops = Model.ops
|
||||
|
@ -428,7 +438,7 @@ def build_text_classifier(nr_class, width=64, **cfg):
|
|||
pretrained_dims = cfg.get('pretrained_dims', 0)
|
||||
with Model.define_operators({'>>': chain, '+': add, '|': concatenate,
|
||||
'**': clone}):
|
||||
if cfg.get('low_data'):
|
||||
if cfg.get('low_data') and pretrained_dims:
|
||||
model = (
|
||||
SpacyVectors
|
||||
>> flatten_add_lengths
|
||||
|
|
|
@ -6,6 +6,5 @@ from .profile import profile
|
|||
from .train import train
|
||||
from .evaluate import evaluate
|
||||
from .convert import convert
|
||||
from .model import model
|
||||
from .vocab import make_vocab as vocab
|
||||
from .validate import validate
|
||||
|
|
|
@ -17,14 +17,14 @@ numpy.random.seed(0)
|
|||
|
||||
|
||||
@plac.annotations(
|
||||
model=("Model name or path", "positional", None, str),
|
||||
data_path=("Location of JSON-formatted evaluation data", "positional",
|
||||
model=("model name or path", "positional", None, str),
|
||||
data_path=("location of JSON-formatted evaluation data", "positional",
|
||||
None, str),
|
||||
gold_preproc=("Use gold preprocessing", "flag", "G", bool),
|
||||
gpu_id=("Use GPU", "option", "g", int),
|
||||
displacy_path=("Directory to output rendered parses as HTML", "option",
|
||||
gold_preproc=("use gold preprocessing", "flag", "G", bool),
|
||||
gpu_id=("use GPU", "option", "g", int),
|
||||
displacy_path=("directory to output rendered parses as HTML", "option",
|
||||
"dp", str),
|
||||
displacy_limit=("Limit of parses to render as HTML", "option", "dl", int))
|
||||
displacy_limit=("limit of parses to render as HTML", "option", "dl", int))
|
||||
def evaluate(cmd, model, data_path, gpu_id=-1, gold_preproc=False,
|
||||
displacy_path=None, displacy_limit=25):
|
||||
"""
|
||||
|
|
|
@ -1,140 +0,0 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
try:
|
||||
import bz2
|
||||
import gzip
|
||||
except ImportError:
|
||||
pass
|
||||
import math
|
||||
from ast import literal_eval
|
||||
from pathlib import Path
|
||||
|
||||
import numpy as np
|
||||
import spacy
|
||||
from preshed.counter import PreshCounter
|
||||
|
||||
from .. import util
|
||||
from ..compat import fix_text
|
||||
|
||||
|
||||
def model(cmd, lang, model_dir, freqs_data, clusters_data, vectors_data,
|
||||
min_doc_freq=5, min_word_freq=200):
|
||||
model_path = Path(model_dir)
|
||||
freqs_path = Path(freqs_data)
|
||||
clusters_path = Path(clusters_data) if clusters_data else None
|
||||
vectors_path = Path(vectors_data) if vectors_data else None
|
||||
|
||||
check_dirs(freqs_path, clusters_path, vectors_path)
|
||||
vocab = util.get_lang_class(lang).Defaults.create_vocab()
|
||||
nlp = spacy.blank(lang)
|
||||
vocab = nlp.vocab
|
||||
probs, oov_prob = read_probs(
|
||||
freqs_path, min_doc_freq=int(min_doc_freq), min_freq=int(min_doc_freq))
|
||||
clusters = read_clusters(clusters_path) if clusters_path else {}
|
||||
populate_vocab(vocab, clusters, probs, oov_prob)
|
||||
add_vectors(vocab, vectors_path)
|
||||
create_model(model_path, nlp)
|
||||
|
||||
|
||||
def add_vectors(vocab, vectors_path):
|
||||
with bz2.BZ2File(vectors_path.as_posix()) as f:
|
||||
num_words, dim = next(f).split()
|
||||
vocab.clear_vectors(int(dim))
|
||||
for line in f:
|
||||
word_w_vector = line.decode("utf8").strip().split(" ")
|
||||
word = word_w_vector[0]
|
||||
vector = np.array([float(val) for val in word_w_vector[1:]])
|
||||
if word in vocab:
|
||||
vocab.set_vector(word, vector)
|
||||
|
||||
|
||||
def create_model(model_path, model):
|
||||
if not model_path.exists():
|
||||
model_path.mkdir()
|
||||
model.to_disk(model_path.as_posix())
|
||||
|
||||
|
||||
def read_probs(freqs_path, max_length=100, min_doc_freq=5, min_freq=200):
|
||||
counts = PreshCounter()
|
||||
total = 0
|
||||
freqs_file = check_unzip(freqs_path)
|
||||
for i, line in enumerate(freqs_file):
|
||||
freq, doc_freq, key = line.rstrip().split('\t', 2)
|
||||
freq = int(freq)
|
||||
counts.inc(i + 1, freq)
|
||||
total += freq
|
||||
counts.smooth()
|
||||
log_total = math.log(total)
|
||||
freqs_file = check_unzip(freqs_path)
|
||||
probs = {}
|
||||
for line in freqs_file:
|
||||
freq, doc_freq, key = line.rstrip().split('\t', 2)
|
||||
doc_freq = int(doc_freq)
|
||||
freq = int(freq)
|
||||
if doc_freq >= min_doc_freq and freq >= min_freq and len(
|
||||
key) < max_length:
|
||||
word = literal_eval(key)
|
||||
smooth_count = counts.smoother(int(freq))
|
||||
probs[word] = math.log(smooth_count) - log_total
|
||||
oov_prob = math.log(counts.smoother(0)) - log_total
|
||||
return probs, oov_prob
|
||||
|
||||
|
||||
def read_clusters(clusters_path):
|
||||
clusters = {}
|
||||
with clusters_path.open() as f:
|
||||
for line in f:
|
||||
try:
|
||||
cluster, word, freq = line.split()
|
||||
word = fix_text(word)
|
||||
except ValueError:
|
||||
continue
|
||||
# If the clusterer has only seen the word a few times, its
|
||||
# cluster is unreliable.
|
||||
if int(freq) >= 3:
|
||||
clusters[word] = cluster
|
||||
else:
|
||||
clusters[word] = '0'
|
||||
# Expand clusters with re-casing
|
||||
for word, cluster in list(clusters.items()):
|
||||
if word.lower() not in clusters:
|
||||
clusters[word.lower()] = cluster
|
||||
if word.title() not in clusters:
|
||||
clusters[word.title()] = cluster
|
||||
if word.upper() not in clusters:
|
||||
clusters[word.upper()] = cluster
|
||||
return clusters
|
||||
|
||||
|
||||
def populate_vocab(vocab, clusters, probs, oov_prob):
|
||||
for word, prob in reversed(
|
||||
sorted(list(probs.items()), key=lambda item: item[1])):
|
||||
lexeme = vocab[word]
|
||||
lexeme.prob = prob
|
||||
lexeme.is_oov = False
|
||||
# Decode as a little-endian string, so that we can do & 15 to get
|
||||
# the first 4 bits. See _parse_features.pyx
|
||||
if word in clusters:
|
||||
lexeme.cluster = int(clusters[word][::-1], 2)
|
||||
else:
|
||||
lexeme.cluster = 0
|
||||
|
||||
|
||||
def check_unzip(file_path):
|
||||
file_path_str = file_path.as_posix()
|
||||
if file_path_str.endswith('gz'):
|
||||
return gzip.open(file_path_str)
|
||||
else:
|
||||
return file_path.open()
|
||||
|
||||
|
||||
def check_dirs(freqs_data, clusters_data, vectors_data):
|
||||
if not freqs_data.is_file():
|
||||
util.sys_exit(freqs_data.as_posix(), title="No frequencies file found")
|
||||
if clusters_data and not clusters_data.is_file():
|
||||
util.sys_exit(
|
||||
clusters_data.as_posix(), title="No Brown clusters file found")
|
||||
if vectors_data and not vectors_data.is_file():
|
||||
util.sys_exit(
|
||||
vectors_data.as_posix(), title="No word vectors file found")
|
|
@ -16,10 +16,11 @@ from .. import about
|
|||
input_dir=("directory with model data", "positional", None, str),
|
||||
output_dir=("output parent directory", "positional", None, str),
|
||||
meta_path=("path to meta.json", "option", "m", str),
|
||||
create_meta=("create meta.json, even if one exists in directory", "flag",
|
||||
"c", bool),
|
||||
force=("force overwriting of existing folder in output directory", "flag",
|
||||
"f", bool))
|
||||
create_meta=("create meta.json, even if one exists in directory – if "
|
||||
"existing meta is found, entries are shown as defaults in "
|
||||
"the command line prompt", "flag", "c", bool),
|
||||
force=("force overwriting of existing model directory in output directory",
|
||||
"flag", "f", bool))
|
||||
def package(cmd, input_dir, output_dir, meta_path=None, create_meta=False,
|
||||
force=False):
|
||||
"""
|
||||
|
@ -41,13 +42,13 @@ def package(cmd, input_dir, output_dir, meta_path=None, create_meta=False,
|
|||
template_manifest = get_template('MANIFEST.in')
|
||||
template_init = get_template('xx_model_name/__init__.py')
|
||||
meta_path = meta_path or input_path / 'meta.json'
|
||||
if not create_meta and meta_path.is_file():
|
||||
prints(meta_path, title="Reading meta.json from file")
|
||||
if meta_path.is_file():
|
||||
meta = util.read_json(meta_path)
|
||||
else:
|
||||
meta = generate_meta(input_dir)
|
||||
if not create_meta: # only print this if user doesn't want to overwrite
|
||||
prints(meta_path, title="Loaded meta.json from file")
|
||||
else:
|
||||
meta = generate_meta(input_dir, meta)
|
||||
meta = validate_meta(meta, ['lang', 'name', 'version'])
|
||||
|
||||
model_name = meta['lang'] + '_' + meta['name']
|
||||
model_name_v = model_name + '-' + meta['version']
|
||||
main_path = output_path / model_name_v
|
||||
|
@ -82,22 +83,24 @@ def create_file(file_path, contents):
|
|||
file_path.open('w', encoding='utf-8').write(contents)
|
||||
|
||||
|
||||
def generate_meta(model_path):
|
||||
meta = {}
|
||||
settings = [('lang', 'Model language', 'en'),
|
||||
('name', 'Model name', 'model'),
|
||||
('version', 'Model version', '0.0.0'),
|
||||
def generate_meta(model_path, existing_meta):
|
||||
meta = existing_meta or {}
|
||||
settings = [('lang', 'Model language', meta.get('lang', 'en')),
|
||||
('name', 'Model name', meta.get('name', 'model')),
|
||||
('version', 'Model version', meta.get('version', '0.0.0')),
|
||||
('spacy_version', 'Required spaCy version',
|
||||
'>=%s,<3.0.0' % about.__version__),
|
||||
('description', 'Model description', False),
|
||||
('author', 'Author', False),
|
||||
('email', 'Author email', False),
|
||||
('url', 'Author website', False),
|
||||
('license', 'License', 'CC BY-NC 3.0')]
|
||||
('description', 'Model description',
|
||||
meta.get('description', False)),
|
||||
('author', 'Author', meta.get('author', False)),
|
||||
('email', 'Author email', meta.get('email', False)),
|
||||
('url', 'Author website', meta.get('url', False)),
|
||||
('license', 'License', meta.get('license', 'CC BY-SA 3.0'))]
|
||||
nlp = util.load_model_from_path(Path(model_path))
|
||||
meta['pipeline'] = nlp.pipe_names
|
||||
meta['vectors'] = {'width': nlp.vocab.vectors_length,
|
||||
'entries': len(nlp.vocab.vectors)}
|
||||
'vectors': len(nlp.vocab.vectors),
|
||||
'keys': nlp.vocab.vectors.n_keys}
|
||||
prints("Enter the package settings for your model. The following "
|
||||
"information will be read from your model data: pipeline, vectors.",
|
||||
title="Generating meta.json")
|
||||
|
|
|
@ -32,7 +32,6 @@ numpy.random.seed(0)
|
|||
n_sents=("number of sentences", "option", "ns", int),
|
||||
use_gpu=("Use GPU", "option", "g", int),
|
||||
vectors=("Model to load vectors from", "option", "v"),
|
||||
vectors_limit=("Truncate to N vectors (requires -v)", "option", None, int),
|
||||
no_tagger=("Don't train tagger", "flag", "T", bool),
|
||||
no_parser=("Don't train parser", "flag", "P", bool),
|
||||
no_entities=("Don't train NER", "flag", "N", bool),
|
||||
|
@ -41,7 +40,7 @@ numpy.random.seed(0)
|
|||
meta_path=("Optional path to meta.json. All relevant properties will be "
|
||||
"overwritten.", "option", "m", Path))
|
||||
def train(cmd, lang, output_dir, train_data, dev_data, n_iter=30, n_sents=0,
|
||||
use_gpu=-1, vectors=None, vectors_limit=None, no_tagger=False,
|
||||
use_gpu=-1, vectors=None, no_tagger=False,
|
||||
no_parser=False, no_entities=False, gold_preproc=False,
|
||||
version="0.0.0", meta_path=None):
|
||||
"""
|
||||
|
@ -95,8 +94,6 @@ def train(cmd, lang, output_dir, train_data, dev_data, n_iter=30, n_sents=0,
|
|||
nlp.meta.update(meta)
|
||||
if vectors:
|
||||
util.load_model(vectors, vocab=nlp.vocab)
|
||||
if vectors_limit is not None:
|
||||
nlp.vocab.prune_vectors(vectors_limit)
|
||||
for name in pipeline:
|
||||
nlp.add_pipe(nlp.create_pipe(name), name=name)
|
||||
optimizer = nlp.begin_training(lambda: corpus.train_tuples, device=use_gpu)
|
||||
|
@ -149,7 +146,8 @@ def train(cmd, lang, output_dir, train_data, dev_data, n_iter=30, n_sents=0,
|
|||
meta['speed'] = {'nwords': nwords, 'cpu': cpu_wps,
|
||||
'gpu': gpu_wps}
|
||||
meta['vectors'] = {'width': nlp.vocab.vectors_length,
|
||||
'entries': len(nlp.vocab.vectors)}
|
||||
'vectors': len(nlp.vocab.vectors),
|
||||
'keys': nlp.vocab.vectors.n_keys}
|
||||
meta['lang'] = nlp.lang
|
||||
meta['pipeline'] = pipeline
|
||||
meta['spacy_version'] = '>=%s' % about.__version__
|
||||
|
|
|
@ -1,31 +1,37 @@
|
|||
'''Compile a vocabulary from a lexicon jsonl file and word vectors.'''
|
||||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from pathlib import Path
|
||||
import plac
|
||||
import json
|
||||
import spacy
|
||||
import numpy
|
||||
from spacy.util import ensure_path
|
||||
from pathlib import Path
|
||||
|
||||
from ..vectors import Vectors
|
||||
from ..util import prints, ensure_path
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
lang=("model language", "positional", None, str),
|
||||
output_dir=("output directory to store model in", "positional", None, str),
|
||||
output_dir=("model output directory", "positional", None, Path),
|
||||
lexemes_loc=("location of JSONL-formatted lexical data", "positional",
|
||||
None, str),
|
||||
vectors_loc=("location of vectors data, as numpy .npz (optional)",
|
||||
"positional", None, str),
|
||||
version=("Model version", "option", "V", str),
|
||||
None, Path),
|
||||
vectors_loc=("optional: location of vectors data, as numpy .npz",
|
||||
"positional", None, str),
|
||||
prune_vectors=("optional: number of vectors to prune to.",
|
||||
"option", "V", int)
|
||||
)
|
||||
def make_vocab(lang, output_dir, lexemes_loc, vectors_loc=None, version=None):
|
||||
out_dir = ensure_path(output_dir)
|
||||
jsonl_loc = ensure_path(lexemes_loc)
|
||||
def make_vocab(cmd, lang, output_dir, lexemes_loc,
|
||||
vectors_loc=None, prune_vectors=-1):
|
||||
"""Compile a vocabulary from a lexicon jsonl file and word vectors."""
|
||||
if not lexemes_loc.exists():
|
||||
prints(lexemes_loc, title="Can't find lexical data", exits=1)
|
||||
vectors_loc = ensure_path(vectors_loc)
|
||||
nlp = spacy.blank(lang)
|
||||
for word in nlp.vocab:
|
||||
word.rank = 0
|
||||
with jsonl_loc.open() as file_:
|
||||
lex_added = 0
|
||||
with lexemes_loc.open() as file_:
|
||||
for line in file_:
|
||||
if line.strip():
|
||||
attrs = json.loads(line)
|
||||
|
@ -35,14 +41,20 @@ def make_vocab(lang, output_dir, lexemes_loc, vectors_loc=None, version=None):
|
|||
lex = nlp.vocab[attrs['orth']]
|
||||
lex.set_attrs(**attrs)
|
||||
assert lex.rank == attrs['id']
|
||||
lex_added += 1
|
||||
if vectors_loc is not None:
|
||||
vector_data = numpy.load(open(vectors_loc, 'rb'))
|
||||
nlp.vocab.clear_vectors(width=vector_data.shape[1])
|
||||
added = 0
|
||||
vector_data = numpy.load(vectors_loc.open('rb'))
|
||||
nlp.vocab.vectors = Vectors(data=vector_data)
|
||||
for word in nlp.vocab:
|
||||
if word.rank:
|
||||
nlp.vocab.vectors.add(word.orth_, row=word.rank,
|
||||
vector=vector_data[word.rank])
|
||||
added += 1
|
||||
nlp.to_disk(out_dir)
|
||||
nlp.vocab.vectors.add(word.orth, row=word.rank)
|
||||
|
||||
if prune_vectors >= 1:
|
||||
remap = nlp.vocab.prune_vectors(prune_vectors)
|
||||
if not output_dir.exists():
|
||||
output_dir.mkdir()
|
||||
nlp.to_disk(output_dir)
|
||||
vec_added = len(nlp.vocab.vectors)
|
||||
prints("{} entries, {} vectors".format(lex_added, vec_added), output_dir,
|
||||
title="Sucessfully compiled vocab and vectors, and saved model")
|
||||
return nlp
|
||||
|
|
|
@ -23,4 +23,4 @@ for exc_data in [
|
|||
_exc[exc_data[ORTH]] = [dict(exc_data)]
|
||||
|
||||
|
||||
TOKENIZER_EXCEPTIONS = dict(_exc)
|
||||
TOKENIZER_EXCEPTIONS = _exc
|
||||
|
|
|
@ -30,4 +30,4 @@ for orth in [
|
|||
_exc[orth] = [{ORTH: orth}]
|
||||
|
||||
|
||||
TOKENIZER_EXCEPTIONS = dict(_exc)
|
||||
TOKENIZER_EXCEPTIONS = _exc
|
||||
|
|
|
@ -181,4 +181,4 @@ for orth in [
|
|||
_exc[orth] = [{ORTH: orth}]
|
||||
|
||||
|
||||
TOKENIZER_EXCEPTIONS = dict(_exc)
|
||||
TOKENIZER_EXCEPTIONS = _exc
|
||||
|
|
|
@ -456,4 +456,4 @@ for string in _exclude:
|
|||
_exc.pop(string)
|
||||
|
||||
|
||||
TOKENIZER_EXCEPTIONS = dict(_exc)
|
||||
TOKENIZER_EXCEPTIONS = _exc
|
||||
|
|
|
@ -54,4 +54,4 @@ for orth in [
|
|||
_exc[orth] = [{ORTH: orth}]
|
||||
|
||||
|
||||
TOKENIZER_EXCEPTIONS = dict(_exc)
|
||||
TOKENIZER_EXCEPTIONS = _exc
|
||||
|
|
|
@ -76,4 +76,4 @@ for exc_data in [
|
|||
_exc[exc_data[ORTH]] = [dict(exc_data)]
|
||||
|
||||
|
||||
TOKENIZER_EXCEPTIONS = dict(_exc)
|
||||
TOKENIZER_EXCEPTIONS = _exc
|
||||
|
|
|
@ -147,5 +147,5 @@ _regular_exp += ["^{prefix}[{elision}][{alpha}][{alpha}{elision}{hyphen}\-]*$".f
|
|||
_regular_exp.append(URL_PATTERN)
|
||||
|
||||
|
||||
TOKENIZER_EXCEPTIONS = dict(_exc)
|
||||
TOKENIZER_EXCEPTIONS = _exc
|
||||
TOKEN_MATCH = re.compile('|'.join('(?:{})'.format(m) for m in _regular_exp), re.IGNORECASE).match
|
||||
|
|
25
spacy/lang/ga/__init__.py
Normal file
25
spacy/lang/ga/__init__.py
Normal file
|
@ -0,0 +1,25 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||
from .stop_words import STOP_WORDS
|
||||
|
||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||
from ...language import Language
|
||||
from ...attrs import LANG
|
||||
from ...util import update_exc
|
||||
|
||||
|
||||
class IrishDefaults(Language.Defaults):
|
||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||
lex_attr_getters[LANG] = lambda text: 'ga'
|
||||
|
||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||
stop_words = set(STOP_WORDS)
|
||||
|
||||
class Irish(Language):
|
||||
lang = 'ga'
|
||||
Defaults = IrishDefaults
|
||||
|
||||
|
||||
__all__ = ['Irish']
|
33
spacy/lang/ga/irish_morphology_helpers.py
Normal file
33
spacy/lang/ga/irish_morphology_helpers.py
Normal file
|
@ -0,0 +1,33 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
class IrishMorph:
|
||||
consonants = ['b', 'c', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n', 'p', 'q', 'r', 's', 't', 'v', 'w', 'x', 'z']
|
||||
broad_vowels = ['a', 'á', 'o', 'ó', 'u', 'ú']
|
||||
slender_vowels = ['e', 'é', 'i', 'í']
|
||||
vowels = broad_vowels + slender_vowels
|
||||
|
||||
def ends_dentals(word):
|
||||
if word != "" and word[-1] in ['d', 'n', 't', 's']:
|
||||
return True
|
||||
else:
|
||||
return False
|
||||
|
||||
def devoice(word):
|
||||
if len(word) > 2 and word[-2] == 's' and word[-1] == 'd':
|
||||
return word[:-1] + 't'
|
||||
else:
|
||||
return word
|
||||
|
||||
def ends_with_vowel(word):
|
||||
return word != "" and word[-1] in vowels
|
||||
|
||||
def starts_with_vowel(word):
|
||||
return word != "" and word[0] in vowels
|
||||
|
||||
def deduplicate(word):
|
||||
if len(word) > 2 and word[-2] == word[-1] and word[-1] in consonants:
|
||||
return word[:-1]
|
||||
else:
|
||||
return word
|
||||
|
45
spacy/lang/ga/stop_words.py
Normal file
45
spacy/lang/ga/stop_words.py
Normal file
|
@ -0,0 +1,45 @@
|
|||
# encoding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
STOP_WORDS = set("""
|
||||
a ach ag agus an aon ar arna as
|
||||
|
||||
ba beirt bhúr
|
||||
|
||||
caoga ceathair ceathrar chomh chuig chun cois céad cúig cúigear
|
||||
|
||||
daichead dar de deich deichniúr den dhá do don dtí dá dár dó
|
||||
|
||||
faoi faoin faoina faoinár fara fiche
|
||||
|
||||
gach gan go gur
|
||||
|
||||
haon hocht
|
||||
|
||||
i iad idir in ina ins inár is
|
||||
|
||||
le leis lena lenár
|
||||
|
||||
mar mo muid mé
|
||||
|
||||
na nach naoi naonúr ná ní níor nó nócha
|
||||
|
||||
ocht ochtar ochtó os
|
||||
|
||||
roimh
|
||||
|
||||
sa seacht seachtar seachtó seasca seisear siad sibh sinn sna sé sí
|
||||
|
||||
tar thar thú triúr trí trína trínár tríocha tú
|
||||
|
||||
um
|
||||
|
||||
ár
|
||||
|
||||
é éis
|
||||
|
||||
í
|
||||
|
||||
ó ón óna ónár
|
||||
""".split())
|
368
spacy/lang/ga/tag_map.py
Normal file
368
spacy/lang/ga/tag_map.py
Normal file
|
@ -0,0 +1,368 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
TAG_MAP = {
|
||||
"ADJ__Case=Gen|Form=Len|Gender=Masc|Number=Sing": {"pos": "ADJ", "Case": "gen", "Gender": "masc", "Number": "sing", "Other": {"Form": "len"}},
|
||||
"ADJ__Case=Gen|Gender=Fem|Number=Sing": {"pos": "ADJ", "Case": "gen", "Gender": "fem", "Number": "sing"},
|
||||
"ADJ__Case=Gen|Gender=Masc|Number=Sing": {"pos": "ADJ", "Case": "gen", "Gender": "masc", "Number": "sing"},
|
||||
"ADJ__Case=Gen|NounType=Strong|Number=Plur": {"pos": "ADJ", "Case": "gen", "Number": "plur", "Other": {"NounType": "strong"}},
|
||||
"ADJ__Case=Gen|NounType=Weak|Number=Plur": {"pos": "ADJ", "Case": "gen", "Number": "plur", "Other": {"NounType": "weak"}},
|
||||
"ADJ__Case=NomAcc|Form=Len|Gender=Fem|Number=Sing": {"pos": "ADJ", "Case": "nom|acc", "Gender": "fem", "Number": "sing", "Other": {"Form": "len"}},
|
||||
"ADJ__Case=NomAcc|Form=Len|Gender=Masc|Number=Sing": {"pos": "ADJ", "Case": "nom|acc", "Gender": "masc", "Number": "sing", "Other": {"Form": "len"}},
|
||||
"ADJ__Case=NomAcc|Gender=Fem|Number=Plur": {"pos": "ADJ", "Case": "nom|acc", "Gender": "fem", "Number": "plur"},
|
||||
"ADJ__Case=NomAcc|Gender=Fem|Number=Sing": {"pos": "ADJ", "Case": "nom|acc", "Gender": "fem", "Number": "sing"},
|
||||
"ADJ__Case=NomAcc|Gender=Masc|Number=Plur": {"pos": "ADJ", "Case": "nom|acc", "Gender": "masc", "Number": "plur"},
|
||||
"ADJ__Case=NomAcc|Gender=Masc|Number=Sing": {"pos": "ADJ", "Case": "nom|acc", "Gender": "masc", "Number": "sing"},
|
||||
"ADJ__Case=NomAcc|NounType=NotSlender|Number=Plur": {"pos": "ADJ", "Case": "nom|acc", "Number": "plur", "Other": {"NounType": "notslender"}},
|
||||
"ADJ__Case=NomAcc|NounType=Slender|Number=Plur": {"pos": "ADJ", "Case": "nom|acc", "Number": "plur", "Other": {"NounType": "slender"}},
|
||||
"ADJ__Degree=Cmp,Sup|Form=Len": {"pos": "ADJ", "Degree": "cmp|sup", "Other": {"Form": "len"}},
|
||||
"ADJ__Degree=Cmp,Sup": {"pos": "ADJ", "Degree": "cmp|sup"},
|
||||
"ADJ__Degree=Pos|Form=Ecl": {"pos": "ADJ", "Degree": "pos", "Other": {"Form": "ecl"}},
|
||||
"ADJ__Degree=Pos|Form=HPref": {"pos": "ADJ", "Degree": "pos", "Other": {"Form": "hpref"}},
|
||||
"ADJ__Degree=Pos|Form=Len": {"pos": "ADJ", "Degree": "pos", "Other": {"Form": "len"}},
|
||||
"ADJ__Degree=Pos": {"pos": "ADJ", "Degree": "pos"},
|
||||
"ADJ__Foreign=Yes": {"pos": "ADJ", "Foreign": "yes"},
|
||||
"ADJ__Form=Len|VerbForm=Part": {"pos": "ADJ", "VerbForm": "part", "Other": {"Form": "len"}},
|
||||
"ADJ__Gender=Masc|Number=Sing|PartType=Voc": {"pos": "ADJ", "Gender": "masc", "Number": "sing", "Case": "voc"},
|
||||
"ADJ__Gender=Masc|Number=Sing|Case=Voc": {"pos": "ADJ", "Gender": "masc", "Number": "sing", "Case": "voc"},
|
||||
"ADJ__Number=Plur|PartType=Voc": {"pos": "ADJ", "Number": "plur", "Case": "voc"},
|
||||
"ADJ__Number=Plur|Case=Voc": {"pos": "ADJ", "Number": "plur", "Case": "voc"},
|
||||
"ADJ__Number=Plur": {"pos": "ADJ", "Number": "plur"},
|
||||
"ADJ___": {"pos": "ADJ"},
|
||||
"ADJ__VerbForm=Part": {"pos": "ADJ", "VerbForm": "part"},
|
||||
"ADP__Foreign=Yes": {"pos": "ADP", "Foreign": "yes"},
|
||||
"ADP__Form=Len|Number=Plur|Person=1": {"pos": "ADP", "Number": "plur", "Person": 1, "Other": {"Form": "len"}},
|
||||
"ADP__Form=Len|Number=Plur|Person=3": {"pos": "ADP", "Number": "plur", "Person": 3, "Other": {"Form": "len"}},
|
||||
"ADP__Form=Len|Number=Sing|Person=1": {"pos": "ADP", "Number": "sing", "Person": 1, "Other": {"Form": "len"}},
|
||||
"ADP__Gender=Fem|Number=Sing|Person=3": {"pos": "ADP", "Gender": "fem", "Number": "sing", "Person": 3},
|
||||
"ADP__Gender=Fem|Number=Sing|Person=3|Poss=Yes": {"pos": "ADP", "Gender": "fem", "Number": "sing", "Person": 3, "Poss": "yes"},
|
||||
"ADP__Gender=Fem|Number=Sing|Person=3|Poss=Yes|PronType=Prs": {"pos": "ADP", "Gender": "fem", "Number": "sing", "Person": 3, "Poss": "yes", "PronType": "prs"},
|
||||
"ADP__Gender=Masc|Number=Sing|Person=3": {"pos": "ADP", "Gender": "masc", "Number": "sing", "Person": 3},
|
||||
"ADP__Gender=Masc|Number=Sing|Person=3|Poss=Yes": {"pos": "ADP", "Gender": "masc", "Number": "sing", "Person": 3, "Poss": "yes"},
|
||||
"ADP__Gender=Masc|Number=Sing|Person=3|Poss=Yes|PronType=Prs": {"pos": "ADP", "Gender": "masc", "Number": "sing", "Person": 3, "Poss": "yes", "PronType": "prs"},
|
||||
"ADP__Gender=Masc|Number=Sing|Person=3|PronType=Emp": {"pos": "ADP", "Gender": "masc", "Number": "sing", "Person": 3, "PronType": "emp"},
|
||||
"ADP__Number=Plur|Person=1": {"pos": "ADP", "Number": "plur", "Person": 1},
|
||||
"ADP__Number=Plur|Person=1|Poss=Yes": {"pos": "ADP", "Number": "plur", "Person": 1, "Poss": "yes"},
|
||||
"ADP__Number=Plur|Person=1|PronType=Emp": {"pos": "ADP", "Number": "plur", "Person": 1, "PronType": "emp"},
|
||||
"ADP__Number=Plur|Person=2": {"pos": "ADP", "Number": "plur", "Person": 2},
|
||||
"ADP__Number=Plur|Person=3": {"pos": "ADP", "Number": "plur", "Person": 3},
|
||||
"ADP__Number=Plur|Person=3|Poss=Yes": {"pos": "ADP", "Number": "plur", "Person": 3, "Poss": "yes"},
|
||||
"ADP__Number=Plur|Person=3|Poss=Yes|PronType=Prs": {"pos": "ADP", "Number": "plur", "Person": 3, "Poss": "yes", "PronType": "prs"},
|
||||
"ADP__Number=Plur|Person=3|PronType=Emp": {"pos": "ADP", "Number": "plur", "Person": 3, "PronType": "emp"},
|
||||
"ADP__Number=Plur|PronType=Art": {"pos": "ADP", "Number": "plur", "PronType": "art"},
|
||||
"ADP__Number=Sing|Person=1": {"pos": "ADP", "Number": "sing", "Person": 1},
|
||||
"ADP__Number=Sing|Person=1|Poss=Yes": {"pos": "ADP", "Number": "sing", "Person": 1, "Poss": "yes"},
|
||||
"ADP__Number=Sing|Person=1|PronType=Emp": {"pos": "ADP", "Number": "sing", "Person": 1, "PronType": "emp"},
|
||||
"ADP__Number=Sing|Person=2": {"pos": "ADP", "Number": "sing", "Person": 2},
|
||||
"ADP__Number=Sing|Person=3": {"pos": "ADP", "Number": "sing", "Person": 3},
|
||||
"ADP__Number=Sing|PronType=Art": {"pos": "ADP", "Number": "sing", "PronType": "art"},
|
||||
"ADP__Person=3|Poss=Yes": {"pos": "ADP", "Person": 3, "Poss": "yes"},
|
||||
"ADP___": {"pos": "ADP"},
|
||||
"ADP__Poss=Yes": {"pos": "ADP", "Poss": "yes"},
|
||||
"ADP__PrepForm=Cmpd": {"pos": "ADP", "Other": {"PrepForm": "cmpd"}},
|
||||
"ADP__PronType=Art": {"pos": "ADP", "PronType": "art"},
|
||||
"ADV__Form=Len": {"pos": "ADV", "Other": {"Form": "len"}},
|
||||
"ADV___": {"pos": "ADV"},
|
||||
"ADV__PronType=Int": {"pos": "ADV", "PronType": "int"},
|
||||
"AUX__Form=VF|Polarity=Neg|PronType=Rel|Tense=Past|VerbForm=Cop": {"pos": "AUX", "Polarity": "neg", "PronType": "rel", "Tense": "past", "Other": {"Form": "vf", "VerbForm": "cop"}},
|
||||
"AUX__Form=VF|Polarity=Neg|Tense=Past|VerbForm=Cop": {"pos": "AUX", "Polarity": "neg", "Tense": "past", "Other": {"Form": "vf", "VerbForm": "cop"}},
|
||||
"AUX__Form=VF|PronType=Rel|Tense=Past|VerbForm=Cop": {"pos": "AUX", "PronType": "rel", "Tense": "past", "Other": {"Form": "vf", "VerbForm": "cop"}},
|
||||
"AUX__Form=VF|Tense=Past|VerbForm=Cop": {"pos": "AUX", "Tense": "past", "Other": {"Form": "vf", "VerbForm": "cop"}},
|
||||
"AUX__Form=VF|Tense=Pres|VerbForm=Cop": {"pos": "AUX", "Tense": "pres", "Other": {"Form": "vf", "VerbForm": "cop"}},
|
||||
"AUX__Gender=Masc|Number=Sing|Person=3|VerbForm=Cop": {"pos": "AUX", "Gender": "masc", "Number": "sing", "Person": 3, "Other": {"VerbForm": "cop"}},
|
||||
"AUX__Mood=Int|Number=Sing|PronType=Art|VerbForm=Cop": {"pos": "AUX", "Number": "sing", "PronType": "art", "Other": {"Mood": "int", "VerbForm": "cop"}},
|
||||
"AUX__Mood=Int|Polarity=Neg|Tense=Past|VerbForm=Cop": {"pos": "AUX", "Polarity": "neg", "Tense": "past", "Other": {"Mood": "int", "VerbForm": "cop"}},
|
||||
"AUX__Mood=Int|Polarity=Neg|Tense=Pres|VerbForm=Cop": {"pos": "AUX", "Polarity": "neg", "Tense": "pres", "Other": {"Mood": "int", "VerbForm": "cop"}},
|
||||
"AUX__Mood=Int|Tense=Pres|VerbForm=Cop": {"pos": "AUX", "Tense": "pres", "Other": {"Mood": "int", "VerbForm": "cop"}},
|
||||
"AUX__PartType=Comp|Tense=Past|VerbForm=Cop": {"pos": "AUX", "Tense": "past", "Other": {"PartType": "comp", "VerbForm": "cop"}},
|
||||
"AUX__Polarity=Neg|PronType=Rel|Tense=Past|VerbForm=Cop": {"pos": "AUX", "Polarity": "neg", "PronType": "rel", "Tense": "past", "Other": {"VerbForm": "cop"}},
|
||||
"AUX__Polarity=Neg|PronType=Rel|Tense=Pres|VerbForm=Cop": {"pos": "AUX", "Polarity": "neg", "PronType": "rel", "Tense": "pres", "Other": {"VerbForm": "cop"}},
|
||||
"AUX__Polarity=Neg|Tense=Past|VerbForm=Cop": {"pos": "AUX", "Polarity": "neg", "Tense": "past", "Other": {"VerbForm": "cop"}},
|
||||
"AUX__Polarity=Neg|Tense=Pres|VerbForm=Cop": {"pos": "AUX", "Polarity": "neg", "Tense": "pres", "Other": {"VerbForm": "cop"}},
|
||||
"AUX___": {"pos": "AUX"},
|
||||
"AUX__PronType=Dem|VerbForm=Cop": {"pos": "AUX", "PronType": "dem", "Other": {"VerbForm": "cop"}},
|
||||
"AUX__PronType=Rel|Tense=Past|VerbForm=Cop": {"pos": "AUX", "PronType": "rel", "Tense": "past", "Other": {"VerbForm": "cop"}},
|
||||
"AUX__PronType=Rel|Tense=Pres|VerbForm=Cop": {"pos": "AUX", "PronType": "rel", "Tense": "pres", "Other": {"VerbForm": "cop"}},
|
||||
"AUX__Tense=Past|VerbForm=Cop": {"pos": "AUX", "Tense": "past", "Other": {"VerbForm": "cop"}},
|
||||
"AUX__Tense=Pres|VerbForm=Cop": {"pos": "AUX", "Tense": "pres", "Other": {"VerbForm": "cop"}},
|
||||
"AUX__VerbForm=Cop": {"pos": "AUX", "Other": {"VerbForm": "cop"}},
|
||||
"CCONJ___": {"pos": "CCONJ"},
|
||||
"DET__Case=Gen|Definite=Def|Gender=Fem|Number=Sing|PronType=Art": {"pos": "DET", "Case": "gen", "Definite": "def", "Gender": "fem", "Number": "sing", "PronType": "art"},
|
||||
"DET__Definite=Def|Form=Ecl": {"pos": "DET", "Definite": "def", "Other": {"Form": "ecl"}},
|
||||
"DET__Definite=Def|Gender=Fem|Number=Sing|PronType=Art": {"pos": "DET", "Definite": "def", "Gender": "fem", "Number": "sing", "PronType": "art"},
|
||||
"DET__Definite=Def|Number=Plur|PronType=Art": {"pos": "DET", "Definite": "def", "Number": "plur", "PronType": "art"},
|
||||
"DET__Definite=Def|Number=Sing|PronType=Art": {"pos": "DET", "Definite": "def", "Number": "sing", "PronType": "art"},
|
||||
"DET__Definite=Def": {"pos": "DET", "Definite": "def"},
|
||||
"DET__Form=HPref|PronType=Ind": {"pos": "DET", "PronType": "ind", "Other": {"Form": "hpref"}},
|
||||
"DET__Gender=Fem|Number=Sing|Person=3|Poss=Yes": {"pos": "DET", "Gender": "fem", "Number": "sing", "Person": 3, "Poss": "yes"},
|
||||
"DET__Gender=Masc|Number=Sing|Person=3|Poss=Yes": {"pos": "DET", "Gender": "masc", "Number": "sing", "Person": 3, "Poss": "yes"},
|
||||
"DET__Number=Plur|Person=1|Poss=Yes": {"pos": "DET", "Number": "plur", "Person": 1, "Poss": "yes"},
|
||||
"DET__Number=Plur|Person=3|Poss=Yes": {"pos": "DET", "Number": "plur", "Person": 3, "Poss": "yes"},
|
||||
"DET__Number=Sing|Person=1|Poss=Yes": {"pos": "DET", "Number": "sing", "Person": 1, "Poss": "yes"},
|
||||
"DET__Number=Sing|Person=2|Poss=Yes": {"pos": "DET", "Number": "sing", "Person": 2, "Poss": "yes"},
|
||||
"DET__Number=Sing|PronType=Int": {"pos": "DET", "Number": "sing", "PronType": "int"},
|
||||
"DET___": {"pos": "DET"},
|
||||
"DET__PronType=Dem": {"pos": "DET", "PronType": "dem"},
|
||||
"DET__PronType=Ind": {"pos": "DET", "PronType": "ind"},
|
||||
"NOUN__Case=Dat|Definite=Ind|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "dat", "Definite": "ind", "Gender": "fem", "Number": "sing"},
|
||||
"NOUN__Case=Dat|Form=Ecl|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "dat", "Gender": "fem", "Number": "sing", "Other": {"Form": "ecl"}},
|
||||
"NOUN__Case=Dat|Form=Len|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "dat", "Gender": "fem", "Number": "sing", "Other": {"Form": "len"}},
|
||||
"NOUN__Case=Dat|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "dat", "Gender": "fem", "Number": "sing"},
|
||||
"NOUN__Case=Dat|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "dat", "Gender": "masc", "Number": "sing"},
|
||||
"NOUN__Case=Gen|Definite=Def|Gender=Fem|NounType=Strong|Number=Plur": {"pos": "NOUN", "Case": "gen", "Definite": "def", "Gender": "fem", "Number": "plur", "Other": {"NounType": "strong"}},
|
||||
"NOUN__Case=Gen|Definite=Def|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "gen", "Definite": "def", "Gender": "fem", "Number": "sing"},
|
||||
"NOUN__Case=Gen|Definite=Def|Gender=Masc|NounType=Strong|Number=Plur": {"pos": "NOUN", "Case": "gen", "Definite": "def", "Gender": "masc", "Number": "plur", "Other": {"NounType": "strong"}},
|
||||
"NOUN__Case=Gen|Definite=Def|Gender=Masc|NounType=Weak|Number=Plur": {"pos": "NOUN", "Case": "gen", "Definite": "def", "Gender": "masc", "Number": "plur", "Other": {"NounType": "weak"}},
|
||||
"NOUN__Case=Gen|Definite=Def|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "gen", "Definite": "def", "Gender": "masc", "Number": "sing"},
|
||||
"NOUN__Case=Gen|Definite=Ind|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "gen", "Definite": "ind", "Gender": "fem", "Number": "sing"},
|
||||
"NOUN__Case=Gen|Form=Ecl|Gender=Fem|NounType=Strong|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "fem", "Number": "plur", "Other": {"Form": "ecl", "NounType": "strong"}},
|
||||
"NOUN__Case=Gen|Form=Ecl|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "gen", "Gender": "fem", "Number": "sing", "Other": {"Form": "ecl"}},
|
||||
"NOUN__Case=Gen|Form=Ecl|Gender=Masc|NounType=Strong|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "plur", "Other": {"Form": "ecl", "NounType": "strong"}},
|
||||
"NOUN__Case=Gen|Form=Ecl|Gender=Masc|NounType=Weak|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "plur", "Other": {"Form": "ecl", "NounType": "weak"}},
|
||||
"NOUN__Case=Gen|Form=Ecl|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "sing", "Other": {"Form": "ecl"}},
|
||||
"NOUN__Case=Gen|Form=HPref|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "gen", "Gender": "fem", "Number": "sing", "Other": {"Form": "hpref"}},
|
||||
"NOUN__Case=Gen|Form=Len|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "gen", "Gender": "fem", "Number": "sing", "Other": {"Form": "len"}},
|
||||
"NOUN__Case=Gen|Form=Len|Gender=Masc|NounType=Strong|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "plur", "Other": {"Form": "len", "NounType": "strong"}},
|
||||
"NOUN__Case=Gen|Form=Len|Gender=Masc|NounType=Weak|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "plur", "Other": {"Form": "len", "NounType": "weak"}},
|
||||
"NOUN__Case=Gen|Form=Len|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "sing", "Other": {"Form": "len"}},
|
||||
"NOUN__Case=Gen|Form=Len|VerbForm=Inf": {"pos": "NOUN", "Case": "gen", "VerbForm": "inf", "Other": {"Form": "len"}},
|
||||
"NOUN__Case=Gen|Gender=Fem|NounType=Strong|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "fem", "Number": "plur", "Other": {"NounType": "strong"}},
|
||||
"NOUN__Case=Gen|Gender=Fem|NounType=Weak|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "fem", "Number": "plur", "Other": {"NounType": "weak"}},
|
||||
"NOUN__Case=Gen|Gender=Fem|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "fem", "Number": "plur"},
|
||||
"NOUN__Case=Gen|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "gen", "Gender": "fem", "Number": "sing"},
|
||||
"NOUN__Case=Gen|Gender=Masc|NounType=Strong|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "plur", "Other": {"NounType": "strong"}},
|
||||
"NOUN__Case=Gen|Gender=Masc|NounType=Weak|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "plur", "Other": {"NounType": "weak"}},
|
||||
"NOUN__Case=Gen|Gender=Masc|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "plur"},
|
||||
"NOUN__Case=Gen|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "sing"},
|
||||
"NOUN__Case=Gen|Number=Sing": {"pos": "NOUN", "Case": "gen", "Number": "sing"},
|
||||
"NOUN__Case=Gen|VerbForm=Inf": {"pos": "NOUN", "Case": "gen", "VerbForm": "inf"},
|
||||
"NOUN__Case=NomAcc|Definite=Def|Gender=Fem|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Definite": "def", "Gender": "fem", "Number": "plur"},
|
||||
"NOUN__Case=NomAcc|Definite=Def|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Definite": "def", "Gender": "fem", "Number": "sing"},
|
||||
"NOUN__Case=NomAcc|Definite=Def|Gender=Fem": {"pos": "NOUN", "Case": "nom|acc", "Definite": "def", "Gender": "fem"},
|
||||
"NOUN__Case=NomAcc|Definite=Def|Gender=Masc|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Definite": "def", "Gender": "masc", "Number": "plur"},
|
||||
"NOUN__Case=NomAcc|Definite=Def|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Definite": "def", "Gender": "masc", "Number": "sing"},
|
||||
"NOUN__Case=NomAcc|Definite=Ind|Gender=Masc|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Definite": "ind", "Gender": "masc", "Number": "plur"},
|
||||
"NOUN__Case=NomAcc|Form=Ecl|Gender=Fem|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Gender": "fem", "Number": "plur", "Other": {"Form": "ecl"}},
|
||||
"NOUN__Case=NomAcc|Form=Ecl|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "fem", "Number": "sing", "Other": {"Form": "ecl"}},
|
||||
"NOUN__Case=NomAcc|Form=Ecl|Gender=Masc|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "plur", "Other": {"Form": "ecl"}},
|
||||
"NOUN__Case=NomAcc|Form=Ecl|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "sing", "Other": {"Form": "ecl"}},
|
||||
"NOUN__Case=NomAcc|Form=Emp|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "sing", "Other": {"Form": "emp"}},
|
||||
"NOUN__Case=NomAcc|Form=HPref|Gender=Fem|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Gender": "fem", "Number": "plur", "Other": {"Form": "hpref"}},
|
||||
"NOUN__Case=NomAcc|Form=HPref|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "fem", "Number": "sing", "Other": {"Form": "hpref"}},
|
||||
"NOUN__Case=NomAcc|Form=HPref|Gender=Masc|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "plur", "Other": {"Form": "hpref"}},
|
||||
"NOUN__Case=NomAcc|Form=HPref|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "sing", "Other": {"Form": "hpref"}},
|
||||
"NOUN__Case=NomAcc|Form=Len|Gender=Fem|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Gender": "fem", "Number": "plur", "Other": {"Form": "len"}},
|
||||
"NOUN__Case=NomAcc|Form=Len|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "fem", "Number": "sing", "Other": {"Form": "len"}},
|
||||
"NOUN__Case=NomAcc|Form=Len|Gender=Masc|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "plur", "Other": {"Form": "len"}},
|
||||
"NOUN__Case=NomAcc|Form=Len|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "sing", "Other": {"Form": "len"}},
|
||||
"NOUN__Case=NomAcc|Gender=Fem|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Gender": "fem", "Number": "plur"},
|
||||
"NOUN__Case=NomAcc|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "fem", "Number": "sing"},
|
||||
"NOUN__Case=NomAcc|Gender=Masc|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "plur"},
|
||||
"NOUN__Case=NomAcc|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "sing"},
|
||||
"NOUN__Case=Voc|Definite=Def|Gender=Masc|Number=Plur": {"pos": "NOUN", "Case": "voc", "Definite": "def", "Gender": "masc", "Number": "plur"},
|
||||
"NOUN__Case=Voc|Form=Len|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "voc", "Gender": "fem", "Number": "sing", "Other": {"Form": "len"}},
|
||||
"NOUN__Case=Voc|Form=Len|Gender=Masc|Number=Plur": {"pos": "NOUN", "Case": "voc", "Gender": "masc", "Number": "plur", "Other": {"Form": "len"}},
|
||||
"NOUN__Case=Voc|Form=Len|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "voc", "Gender": "masc", "Number": "sing", "Other": {"Form": "len"}},
|
||||
"NOUN__Case=Voc|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "voc", "Gender": "masc", "Number": "sing"},
|
||||
"NOUN__Degree=Pos": {"pos": "NOUN", "Degree": "pos"},
|
||||
"NOUN__Foreign=Yes": {"pos": "NOUN", "Foreign": "yes"},
|
||||
"NOUN__Form=Ecl|Number=Sing": {"pos": "NOUN", "Number": "sing", "Other": {"Form": "ecl"}},
|
||||
"NOUN__Form=Ecl|VerbForm=Inf": {"pos": "NOUN", "VerbForm": "inf", "Other": {"Form": "ecl"}},
|
||||
"NOUN__Form=Ecl|VerbForm=Vnoun": {"pos": "NOUN", "VerbForm": "vnoun", "Other": {"Form": "ecl"}},
|
||||
"NOUN__Form=HPref|VerbForm=Inf": {"pos": "NOUN", "VerbForm": "inf", "Other": {"Form": "hpref"}},
|
||||
"NOUN__Form=Len|Number=Sing": {"pos": "NOUN", "Number": "sing", "Other": {"Form": "len"}},
|
||||
"NOUN__Form=Len|VerbForm=Inf": {"pos": "NOUN", "VerbForm": "inf", "Other": {"Form": "len"}},
|
||||
"NOUN__Gender=Fem|Number=Sing": {"pos": "NOUN", "Gender": "fem", "Number": "sing"},
|
||||
"NOUN__Number=Sing|PartType=Comp": {"pos": "NOUN", "Number": "sing", "Other": {"PartType": "comp"}},
|
||||
"NOUN__Number=Sing": {"pos": "NOUN", "Number": "sing"},
|
||||
"NOUN___": {"pos": "NOUN"},
|
||||
"NOUN__Reflex=Yes": {"pos": "NOUN", "Reflex": "yes"},
|
||||
"NOUN__VerbForm=Inf": {"pos": "NOUN", "VerbForm": "inf"},
|
||||
"NOUN__VerbForm=Vnoun": {"pos": "NOUN", "VerbForm": "vnoun"},
|
||||
"NUM__Definite=Def|NumType=Card": {"pos": "NUM", "Definite": "def", "NumType": "card"},
|
||||
"NUM__Form=Ecl|NumType=Card": {"pos": "NUM", "NumType": "card", "Other": {"Form": "ecl"}},
|
||||
"NUM__Form=Ecl|NumType=Ord": {"pos": "NUM", "NumType": "ord", "Other": {"Form": "ecl"}},
|
||||
"NUM__Form=HPref|NumType=Card": {"pos": "NUM", "NumType": "card", "Other": {"Form": "hpref"}},
|
||||
"NUM__Form=Len|NumType=Card": {"pos": "NUM", "NumType": "card", "Other": {"Form": "len"}},
|
||||
"NUM__Form=Len|NumType=Ord": {"pos": "NUM", "NumType": "ord", "Other": {"Form": "len"}},
|
||||
"NUM__NumType=Card": {"pos": "NUM", "NumType": "card"},
|
||||
"NUM__NumType=Ord": {"pos": "NUM", "NumType": "ord"},
|
||||
"NUM___": {"pos": "NUM"},
|
||||
"PART__Form=Ecl|PartType=Vb|PronType=Rel": {"pos": "PART", "PronType": "rel", "Other": {"Form": "ecl", "PartType": "vb"}},
|
||||
"PART__Mood=Imp|PartType=Vb|Polarity=Neg": {"pos": "PART", "Mood": "imp", "Polarity": "neg", "Other": {"PartType": "vb"}},
|
||||
"PART__Mood=Imp|PartType=Vb": {"pos": "PART", "Mood": "imp", "Other": {"PartType": "vb"}},
|
||||
"PART__Mood=Int|PartType=Vb|Polarity=Neg": {"pos": "PART", "Polarity": "neg", "Other": {"Mood": "int", "PartType": "vb"}},
|
||||
"PART__PartType=Ad": {"pos": "PART", "Other": {"PartType": "ad"}},
|
||||
"PART__PartType=Cmpl|Polarity=Neg": {"pos": "PART", "Polarity": "neg", "Other": {"PartType": "cmpl"}},
|
||||
"PART__PartType=Cmpl|Polarity=Neg|Tense=Past": {"pos": "PART", "Polarity": "neg", "Tense": "past", "Other": {"PartType": "cmpl"}},
|
||||
"PART__PartType=Cmpl": {"pos": "PART", "Other": {"PartType": "cmpl"}},
|
||||
"PART__PartType=Comp": {"pos": "PART", "Other": {"PartType": "comp"}},
|
||||
"PART__PartType=Cop|PronType=Rel": {"pos": "PART", "PronType": "rel", "Other": {"PartType": "cop"}},
|
||||
"PART__PartType=Deg": {"pos": "PART", "Other": {"PartType": "deg"}},
|
||||
"PART__PartType=Inf": {"pos": "PART", "PartType": "inf"},
|
||||
"PART__PartType=Num": {"pos": "PART", "Other": {"PartType": "num"}},
|
||||
"PART__PartType=Pat": {"pos": "PART", "Other": {"PartType": "pat"}},
|
||||
"PART__PartType=Vb|Polarity=Neg": {"pos": "PART", "Polarity": "neg", "Other": {"PartType": "vb"}},
|
||||
"PART__PartType=Vb|Polarity=Neg|PronType=Rel": {"pos": "PART", "Polarity": "neg", "PronType": "rel", "Other": {"PartType": "vb"}},
|
||||
"PART__PartType=Vb|Polarity=Neg|PronType=Rel|Tense=Past": {"pos": "PART", "Polarity": "neg", "PronType": "rel", "Tense": "past", "Other": {"PartType": "vb"}},
|
||||
"PART__PartType=Vb|Polarity=Neg|Tense=Past": {"pos": "PART", "Polarity": "neg", "Tense": "past", "Other": {"PartType": "vb"}},
|
||||
"PART__PartType=Vb": {"pos": "PART", "Other": {"PartType": "vb"}},
|
||||
"PART__PartType=Vb|PronType=Rel": {"pos": "PART", "PronType": "rel", "Other": {"PartType": "vb"}},
|
||||
"PART__PartType=Vb|PronType=Rel|Tense=Past": {"pos": "PART", "PronType": "rel", "Tense": "past", "Other": {"PartType": "vb"}},
|
||||
"PART__PartType=Vb|Tense=Past": {"pos": "PART", "Tense": "past", "Other": {"PartType": "vb"}},
|
||||
"PART__PartType=Voc": {"pos": "PART", "Other": {"PartType": "voc"}},
|
||||
"PART___": {"pos": "PART"},
|
||||
"PART__PronType=Rel": {"pos": "PART", "PronType": "rel"},
|
||||
"PRON__Form=Len|Number=Sing|Person=2": {"pos": "PRON", "Number": "sing", "Person": 2, "Other": {"Form": "len"}},
|
||||
"PRON__Form=Len|PronType=Ind": {"pos": "PRON", "PronType": "ind", "Other": {"Form": "len"}},
|
||||
"PRON__Gender=Fem|Number=Sing|Person=3": {"pos": "PRON", "Gender": "fem", "Number": "sing", "Person": 3},
|
||||
"PRON__Gender=Masc|Number=Sing|Person=3": {"pos": "PRON", "Gender": "masc", "Number": "sing", "Person": 3},
|
||||
"PRON__Gender=Masc|Number=Sing|Person=3|PronType=Emp": {"pos": "PRON", "Gender": "masc", "Number": "sing", "Person": 3, "PronType": "emp"},
|
||||
"PRON__Gender=Masc|Person=3": {"pos": "PRON", "Gender": "masc", "Person": 3},
|
||||
"PRON__Number=Plur|Person=1": {"pos": "PRON", "Number": "plur", "Person": 1},
|
||||
"PRON__Number=Plur|Person=1|PronType=Emp": {"pos": "PRON", "Number": "plur", "Person": 1, "PronType": "emp"},
|
||||
"PRON__Number=Plur|Person=2": {"pos": "PRON", "Number": "plur", "Person": 2},
|
||||
"PRON__Number=Plur|Person=3": {"pos": "PRON", "Number": "plur", "Person": 3},
|
||||
"PRON__Number=Plur|Person=3|PronType=Emp": {"pos": "PRON", "Number": "plur", "Person": 3, "PronType": "emp"},
|
||||
"PRON__Number=Sing|Person=1": {"pos": "PRON", "Number": "sing", "Person": 1},
|
||||
"PRON__Number=Sing|Person=1|PronType=Emp": {"pos": "PRON", "Number": "sing", "Person": 1, "PronType": "emp"},
|
||||
"PRON__Number=Sing|Person=2": {"pos": "PRON", "Number": "sing", "Person": 2},
|
||||
"PRON__Number=Sing|Person=2|PronType=Emp": {"pos": "PRON", "Number": "sing", "Person": 2, "PronType": "emp"},
|
||||
"PRON__Number=Sing|Person=3": {"pos": "PRON", "Number": "sing", "Person": 3},
|
||||
"PRON__Number=Sing|PronType=Int": {"pos": "PRON", "Number": "sing", "PronType": "int"},
|
||||
"PRON__PronType=Dem": {"pos": "PRON", "PronType": "dem"},
|
||||
"PRON__PronType=Ind": {"pos": "PRON", "PronType": "ind"},
|
||||
"PRON__PronType=Int": {"pos": "PRON", "PronType": "int"},
|
||||
"PRON__Reflex=Yes": {"pos": "PRON", "Reflex": "yes"},
|
||||
"PROPN__Abbr=Yes": {"pos": "PROPN", "Other": {"Abbr": "yes"}},
|
||||
"PROPN__Case=Dat|Gender=Fem|Number=Sing": {"pos": "PROPN", "Case": "dat", "Gender": "fem", "Number": "sing"},
|
||||
"PROPN__Case=Gen|Definite=Def|Gender=Fem|Number=Sing": {"pos": "PROPN", "Case": "gen", "Definite": "def", "Gender": "fem", "Number": "sing"},
|
||||
"PROPN__Case=Gen|Form=Ecl|Gender=Fem|Number=Plur": {"pos": "PROPN", "Case": "gen", "Gender": "fem", "Number": "plur", "Other": {"Form": "ecl"}},
|
||||
"PROPN__Case=Gen|Form=Ecl|Gender=Masc|Number=Plur": {"pos": "PROPN", "Case": "gen", "Gender": "masc", "Number": "plur", "Other": {"Form": "ecl"}},
|
||||
"PROPN__Case=Gen|Form=HPref|Gender=Fem|Number=Sing": {"pos": "PROPN", "Case": "gen", "Gender": "fem", "Number": "sing", "Other": {"Form": "hpref"}},
|
||||
"PROPN__Case=Gen|Form=Len|Gender=Fem|Number=Sing": {"pos": "PROPN", "Case": "gen", "Gender": "fem", "Number": "sing", "Other": {"Form": "len"}},
|
||||
"PROPN__Case=Gen|Form=Len|Gender=Fem": {"pos": "PROPN", "Case": "gen", "Gender": "fem", "Other": {"Form": "len"}},
|
||||
"PROPN__Case=Gen|Form=Len|Gender=Masc|Number=Sing": {"pos": "PROPN", "Case": "gen", "Gender": "masc", "Number": "sing", "Other": {"Form": "len"}},
|
||||
"PROPN__Case=Gen|Form=Len|Gender=Masc": {"pos": "PROPN", "Case": "gen", "Gender": "masc", "Other": {"Form": "len"}},
|
||||
"PROPN__Case=Gen|Gender=Fem|Number=Sing": {"pos": "PROPN", "Case": "gen", "Gender": "fem", "Number": "sing"},
|
||||
"PROPN__Case=Gen|Gender=Fem": {"pos": "PROPN", "Case": "gen", "Gender": "fem"},
|
||||
"PROPN__Case=Gen|Gender=Masc|NounType=Weak|Number=Plur": {"pos": "PROPN", "Case": "gen", "Gender": "masc", "Number": "plur", "Other": {"NounType": "weak"}},
|
||||
"PROPN__Case=Gen|Gender=Masc|Number=Sing": {"pos": "PROPN", "Case": "gen", "Gender": "masc", "Number": "sing"},
|
||||
"PROPN__Case=Gen|Gender=Masc": {"pos": "PROPN", "Case": "gen", "Gender": "masc"},
|
||||
"PROPN__Case=NomAcc|Definite=Def|Gender=Fem|Number=Sing": {"pos": "PROPN", "Case": "nom|acc", "Definite": "def", "Gender": "fem", "Number": "sing"},
|
||||
"PROPN__Case=NomAcc|Definite=Def|Gender=Masc|Number=Plur": {"pos": "PROPN", "Case": "nom|acc", "Definite": "def", "Gender": "masc", "Number": "plur"},
|
||||
"PROPN__Case=NomAcc|Definite=Def|Gender=Masc|Number=Sing": {"pos": "PROPN", "Case": "nom|acc", "Definite": "def", "Gender": "masc", "Number": "sing"},
|
||||
"PROPN__Case=NomAcc|Form=Ecl|Gender=Fem|Number=Sing": {"pos": "PROPN", "Case": "nom|acc", "Gender": "fem", "Number": "sing", "Other": {"Form": "ecl"}},
|
||||
"PROPN__Case=NomAcc|Form=Ecl|Gender=Masc|Number=Sing": {"pos": "PROPN", "Case": "nom|acc", "Gender": "masc", "Number": "sing", "Other": {"Form": "ecl"}},
|
||||
"PROPN__Case=NomAcc|Form=HPref|Gender=Masc|Number=Sing": {"pos": "PROPN", "Case": "nom|acc", "Gender": "masc", "Number": "sing", "Other": {"Form": "hpref"}},
|
||||
"PROPN__Case=NomAcc|Form=Len|Gender=Fem|Number=Sing": {"pos": "PROPN", "Case": "nom|acc", "Gender": "fem", "Number": "sing", "Other": {"Form": "len"}},
|
||||
"PROPN__Case=NomAcc|Form=Len|Gender=Masc|Number=Sing": {"pos": "PROPN", "Case": "nom|acc", "Gender": "masc", "Number": "sing", "Other": {"Form": "len"}},
|
||||
"PROPN__Case=NomAcc|Gender=Fem|Number=Sing": {"pos": "PROPN", "Case": "nom|acc", "Gender": "fem", "Number": "sing"},
|
||||
"PROPN__Case=NomAcc|Gender=Masc|Number=Plur": {"pos": "PROPN", "Case": "nom|acc", "Gender": "masc", "Number": "plur"},
|
||||
"PROPN__Case=NomAcc|Gender=Masc|Number=Sing": {"pos": "PROPN", "Case": "nom|acc", "Gender": "masc", "Number": "sing"},
|
||||
"PROPN__Case=NomAcc|Gender=Masc": {"pos": "PROPN", "Case": "nom|acc", "Gender": "masc"},
|
||||
"PROPN__Case=Voc|Form=Len|Gender=Fem": {"pos": "PROPN", "Case": "voc", "Gender": "fem", "Other": {"Form": "len"}},
|
||||
"PROPN__Case=Voc|Gender=Masc|Number=Sing": {"pos": "PROPN", "Case": "voc", "Gender": "masc", "Number": "sing"},
|
||||
"PROPN__Gender=Masc|Number=Sing": {"pos": "PROPN", "Gender": "masc", "Number": "sing"},
|
||||
"PROPN___": {"pos": "PROPN"},
|
||||
"PUNCT___": {"pos": "PUNCT"},
|
||||
"SCONJ___": {"pos": "SCONJ"},
|
||||
"SCONJ__Tense=Past|VerbForm=Cop": {"pos": "SCONJ", "Tense": "past", "Other": {"VerbForm": "cop"}},
|
||||
"SCONJ__VerbForm=Cop": {"pos": "SCONJ", "Other": {"VerbForm": "cop"}},
|
||||
"SYM__Abbr=Yes": {"pos": "SYM", "Other": {"Abbr": "yes"}},
|
||||
"VERB__Case=NomAcc|Gender=Masc|Mood=Ind|Number=Sing|Tense=Pres": {"pos": "VERB", "Case": "nom|acc", "Gender": "masc", "Mood": "ind", "Number": "sing", "Tense": "pres"},
|
||||
"VERB__Dialect=Munster|Form=Len|Mood=Ind|Tense=Past": {"pos": "VERB", "Mood": "ind", "Tense": "past", "Other": {"Dialect": "munster", "Form": "len"}},
|
||||
"VERB__Foreign=Yes": {"pos": "VERB", "Foreign": "yes"},
|
||||
"VERB__Form=Ecl|Mood=Cnd|Number=Sing|Person=1": {"pos": "VERB", "Mood": "cnd", "Number": "sing", "Person": 1, "Other": {"Form": "ecl"}},
|
||||
"VERB__Form=Ecl|Mood=Cnd|Polarity=Neg": {"pos": "VERB", "Mood": "cnd", "Polarity": "neg", "Other": {"Form": "ecl"}},
|
||||
"VERB__Form=Ecl|Mood=Cnd": {"pos": "VERB", "Mood": "cnd", "Other": {"Form": "ecl"}},
|
||||
"VERB__Form=Ecl|Mood=Cnd|Voice=Auto": {"pos": "VERB", "Mood": "cnd", "Other": {"Form": "ecl", "Voice": "auto"}},
|
||||
"VERB__Form=Ecl|Mood=Imp|Number=Sing|Person=1|Tense=Past": {"pos": "VERB", "Mood": "imp", "Number": "sing", "Person": 1, "Tense": "past", "Other": {"Form": "ecl"}},
|
||||
"VERB__Form=Ecl|Mood=Imp|Tense=Past": {"pos": "VERB", "Mood": "imp", "Tense": "past", "Other": {"Form": "ecl"}},
|
||||
"VERB__Form=Ecl|Mood=Ind|Number=Plur|Person=1|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Number": "plur", "Person": 1, "Tense": "pres", "Other": {"Form": "ecl"}},
|
||||
"VERB__Form=Ecl|Mood=Ind|Number=Sing|Person=1|Tense=Past": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "Tense": "past", "Other": {"Form": "ecl"}},
|
||||
"VERB__Form=Ecl|Mood=Ind|Number=Sing|Person=1|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "Tense": "pres", "Other": {"Form": "ecl"}},
|
||||
"VERB__Form=Ecl|Mood=Ind|Polarity=Neg|Tense=Fut": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "fut", "Other": {"Form": "ecl"}},
|
||||
"VERB__Form=Ecl|Mood=Ind|Polarity=Neg|Tense=Fut|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "fut", "Other": {"Form": "ecl", "Voice": "auto"}},
|
||||
"VERB__Form=Ecl|Mood=Ind|Polarity=Neg|Tense=Past": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "past", "Other": {"Form": "ecl"}},
|
||||
"VERB__Form=Ecl|Mood=Ind|Polarity=Neg|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "pres", "Other": {"Form": "ecl"}},
|
||||
"VERB__Form=Ecl|Mood=Ind|Tense=Fut": {"pos": "VERB", "Mood": "ind", "Tense": "fut", "Other": {"Form": "ecl"}},
|
||||
"VERB__Form=Ecl|Mood=Ind|Tense=Fut|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Tense": "fut", "Other": {"Form": "ecl", "Voice": "auto"}},
|
||||
"VERB__Form=Ecl|Mood=Ind|Tense=Past": {"pos": "VERB", "Mood": "ind", "Tense": "past", "Other": {"Form": "ecl"}},
|
||||
"VERB__Form=Ecl|Mood=Ind|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Tense": "pres", "Other": {"Form": "ecl"}},
|
||||
"VERB__Form=Ecl|Mood=Ind|Tense=Pres|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Tense": "pres", "Other": {"Form": "ecl", "Voice": "auto"}},
|
||||
"VERB__Form=Ecl|Mood=Sub|Tense=Pres": {"pos": "VERB", "Mood": "sub", "Tense": "pres", "Other": {"Form": "ecl"}},
|
||||
"VERB__Form=Ecl": {"pos": "VERB", "Other": {"Form": "ecl"}},
|
||||
"VERB__Form=Emp|Mood=Ind|Number=Plur|Person=1|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Number": "plur", "Person": 1, "Tense": "pres", "Other": {"Form": "emp"}},
|
||||
"VERB__Form=Emp|Mood=Ind|Number=Sing|Person=1|PronType=Rel|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "PronType": "rel", "Tense": "pres", "Other": {"Form": "emp"}},
|
||||
"VERB__Form=Emp|Mood=Ind|Number=Sing|Person=1|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "Tense": "pres", "Other": {"Form": "emp"}},
|
||||
"VERB__Form=Len|Mood=Cnd|Number=Plur|Person=3": {"pos": "VERB", "Mood": "cnd", "Number": "plur", "Person": 3, "Other": {"Form": "len"}},
|
||||
"VERB__Form=Len|Mood=Cnd|Number=Sing|Person=1": {"pos": "VERB", "Mood": "cnd", "Number": "sing", "Person": 1, "Other": {"Form": "len"}},
|
||||
"VERB__Form=Len|Mood=Cnd|Number=Sing|Person=2": {"pos": "VERB", "Mood": "cnd", "Number": "sing", "Person": 2, "Other": {"Form": "len"}},
|
||||
"VERB__Form=Len|Mood=Cnd|Polarity=Neg": {"pos": "VERB", "Mood": "cnd", "Polarity": "neg", "Other": {"Form": "len"}},
|
||||
"VERB__Form=Len|Mood=Cnd": {"pos": "VERB", "Mood": "cnd", "Other": {"Form": "len"}},
|
||||
"VERB__Form=Len|Mood=Cnd|Voice=Auto": {"pos": "VERB", "Mood": "cnd", "Other": {"Form": "len", "Voice": "auto"}},
|
||||
"VERB__Form=Len|Mood=Imp|Number=Plur|Person=3|Tense=Past": {"pos": "VERB", "Mood": "imp", "Number": "plur", "Person": 3, "Tense": "past", "Other": {"Form": "len"}},
|
||||
"VERB__Form=Len|Mood=Imp|Tense=Past": {"pos": "VERB", "Mood": "imp", "Tense": "past", "Other": {"Form": "len"}},
|
||||
"VERB__Form=Len|Mood=Imp|Tense=Past|Voice=Auto": {"pos": "VERB", "Mood": "imp", "Tense": "past", "Other": {"Form": "len", "Voice": "auto"}},
|
||||
"VERB__Form=Len|Mood=Imp|Voice=Auto": {"pos": "VERB", "Mood": "imp", "Other": {"Form": "len", "Voice": "auto"}},
|
||||
"VERB__Form=Len|Mood=Ind|Number=Plur|Person=1|Tense=Fut": {"pos": "VERB", "Mood": "ind", "Number": "plur", "Person": 1, "Tense": "fut", "Other": {"Form": "len"}},
|
||||
"VERB__Form=Len|Mood=Ind|Number=Plur|Person=1|Tense=Past": {"pos": "VERB", "Mood": "ind", "Number": "plur", "Person": 1, "Tense": "past", "Other": {"Form": "len"}},
|
||||
"VERB__Form=Len|Mood=Ind|Number=Plur|Person=3|Tense=Past": {"pos": "VERB", "Mood": "ind", "Number": "plur", "Person": 3, "Tense": "past", "Other": {"Form": "len"}},
|
||||
"VERB__Form=Len|Mood=Ind|Number=Sing|Person=1|Polarity=Neg|Tense=Past": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "Polarity": "neg", "Tense": "past", "Other": {"Form": "len"}},
|
||||
"VERB__Form=Len|Mood=Ind|Number=Sing|Person=1|Polarity=Neg|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "Polarity": "neg", "Tense": "pres", "Other": {"Form": "len"}},
|
||||
"VERB__Form=Len|Mood=Ind|Number=Sing|Person=1|Tense=Past": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "Tense": "past", "Other": {"Form": "len"}},
|
||||
"VERB__Form=Len|Mood=Ind|Polarity=Neg|Tense=Fut": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "fut", "Other": {"Form": "len"}},
|
||||
"VERB__Form=Len|Mood=Ind|Polarity=Neg|Tense=Fut|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "fut", "Other": {"Form": "len", "Voice": "auto"}},
|
||||
"VERB__Form=Len|Mood=Ind|Polarity=Neg|Tense=Past": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "past", "Other": {"Form": "len"}},
|
||||
"VERB__Form=Len|Mood=Ind|Polarity=Neg|Tense=Past|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "past", "Other": {"Form": "len", "Voice": "auto"}},
|
||||
"VERB__Form=Len|Mood=Ind|Polarity=Neg|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "pres", "Other": {"Form": "len"}},
|
||||
"VERB__Form=Len|Mood=Ind|Polarity=Neg|Tense=Pres|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "pres", "Other": {"Form": "len", "Voice": "auto"}},
|
||||
"VERB__Form=Len|Mood=Ind|Tense=Fut": {"pos": "VERB", "Mood": "ind", "Tense": "fut", "Other": {"Form": "len"}},
|
||||
"VERB__Form=Len|Mood=Ind|Tense=Fut|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Tense": "fut", "Other": {"Form": "len", "Voice": "auto"}},
|
||||
"VERB__Form=Len|Mood=Ind|Tense=Past": {"pos": "VERB", "Mood": "ind", "Tense": "past", "Other": {"Form": "len"}},
|
||||
"VERB__Form=Len|Mood=Ind|Tense=Past|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Tense": "past", "Other": {"Form": "len", "Voice": "auto"}},
|
||||
"VERB__Form=Len|Mood=Ind|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Tense": "pres", "Other": {"Form": "len"}},
|
||||
"VERB__Form=Len|Mood=Ind|Tense=Pres|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Tense": "pres", "Other": {"Form": "len", "Voice": "auto"}},
|
||||
"VERB__Form=Len|Mood=Sub|Polarity=Neg|Tense=Pres": {"pos": "VERB", "Mood": "sub", "Polarity": "neg", "Tense": "pres", "Other": {"Form": "len"}},
|
||||
"VERB__Form=Len|Polarity=Neg": {"pos": "VERB", "Polarity": "neg", "Other": {"Form": "len"}},
|
||||
"VERB__Form=Len": {"pos": "VERB", "Other": {"Form": "len"}},
|
||||
"VERB__Mood=Cnd|Number=Plur|Person=3": {"pos": "VERB", "Mood": "cnd", "Number": "plur", "Person": 3},
|
||||
"VERB__Mood=Cnd|Number=Sing|Person=1": {"pos": "VERB", "Mood": "cnd", "Number": "sing", "Person": 1},
|
||||
"VERB__Mood=Cnd": {"pos": "VERB", "Mood": "cnd"},
|
||||
"VERB__Mood=Cnd|Voice=Auto": {"pos": "VERB", "Mood": "cnd", "Other": {"Voice": "auto"}},
|
||||
"VERB__Mood=Imp|Number=Plur|Person=1|Polarity=Neg": {"pos": "VERB", "Mood": "imp", "Number": "plur", "Person": 1, "Polarity": "neg"},
|
||||
"VERB__Mood=Imp|Number=Plur|Person=1": {"pos": "VERB", "Mood": "imp", "Number": "plur", "Person": 1},
|
||||
"VERB__Mood=Imp|Number=Plur|Person=2": {"pos": "VERB", "Mood": "imp", "Number": "plur", "Person": 2},
|
||||
"VERB__Mood=Imp|Number=Sing|Person=2": {"pos": "VERB", "Mood": "imp", "Number": "sing", "Person": 2},
|
||||
"VERB__Mood=Imp|Tense=Past": {"pos": "VERB", "Mood": "imp", "Tense": "past"},
|
||||
"VERB__Mood=Ind|Number=Plur|Person=1|Tense=Past": {"pos": "VERB", "Mood": "ind", "Number": "plur", "Person": 1, "Tense": "past"},
|
||||
"VERB__Mood=Ind|Number=Plur|Person=1|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Number": "plur", "Person": 1, "Tense": "pres"},
|
||||
"VERB__Mood=Ind|Number=Sing|Person=1|Tense=Past": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "Tense": "past"},
|
||||
"VERB__Mood=Ind|Number=Sing|Person=1|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "Tense": "pres"},
|
||||
"VERB__Mood=Ind|Polarity=Neg|Tense=Past|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "past", "Other": {"Voice": "auto"}},
|
||||
"VERB__Mood=Ind|Polarity=Neg|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "pres"},
|
||||
"VERB__Mood=Ind|PronType=Rel|Tense=Fut": {"pos": "VERB", "Mood": "ind", "PronType": "rel", "Tense": "fut"},
|
||||
"VERB__Mood=Ind|PronType=Rel|Tense=Pres": {"pos": "VERB", "Mood": "ind", "PronType": "rel", "Tense": "pres"},
|
||||
"VERB__Mood=Ind|Tense=Fut": {"pos": "VERB", "Mood": "ind", "Tense": "fut"},
|
||||
"VERB__Mood=Ind|Tense=Fut|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Tense": "fut", "Other": {"Voice": "auto"}},
|
||||
"VERB__Mood=Ind|Tense=Past": {"pos": "VERB", "Mood": "ind", "Tense": "past"},
|
||||
"VERB__Mood=Ind|Tense=Past|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Tense": "past", "Other": {"Voice": "auto"}},
|
||||
"VERB__Mood=Ind|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Tense": "pres"},
|
||||
"VERB__Mood=Ind|Tense=Pres|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Tense": "pres", "Other": {"Voice": "auto"}},
|
||||
"VERB___": {"pos": "VERB"},
|
||||
"X__Abbr=Yes": {"pos": "X", "Other": {"Abbr": "yes"}},
|
||||
"X__Case=NomAcc|Foreign=Yes|Gender=Fem|Number=Sing": {"pos": "X", "Case": "nom|acc", "Gender": "fem", "Number": "sing", "Foreign": "yes"},
|
||||
"X__Definite=Def|Dialect=Ulster": {"pos": "X", "Definite": "def", "Other": {"Dialect": "ulster"}},
|
||||
"X__Dialect=Munster|Form=Len|Mood=Ind|Number=Sing|Person=1|Tense=Past": {"pos": "X", "Mood": "ind", "Number": "sing", "Person": 1, "Tense": "past", "Other": {"Dialect": "munster", "Form": "len"}},
|
||||
"X__Dialect=Munster|Mood=Imp|Number=Sing|Person=2|Polarity=Neg": {"pos": "X", "Mood": "imp", "Number": "sing", "Person": 2, "Polarity": "neg", "Other": {"Dialect": "munster"}},
|
||||
"X__Dialect=Munster|Mood=Ind|Tense=Past|Voice=Auto": {"pos": "X", "Mood": "ind", "Tense": "past", "Other": {"Dialect": "munster", "Voice": "auto"}},
|
||||
"X__Dialect=Munster": {"pos": "X", "Other": {"Dialect": "munster"}},
|
||||
"X__Dialect=Munster|PronType=Dem": {"pos": "X", "PronType": "dem", "Other": {"Dialect": "munster"}},
|
||||
"X__Dialect=Ulster|Gender=Masc|Number=Sing|Person=3": {"pos": "X", "Gender": "masc", "Number": "sing", "Person": 3, "Other": {"Dialect": "ulster"}},
|
||||
"X__Dialect=Ulster|PartType=Vb|Polarity=Neg": {"pos": "X", "Polarity": "neg", "Other": {"Dialect": "ulster", "PartType": "vb"}},
|
||||
"X__Dialect=Ulster|VerbForm=Cop": {"pos": "X", "Other": {"Dialect": "ulster", "VerbForm": "cop"}},
|
||||
"X__Foreign=Yes": {"pos": "X", "Foreign": "yes"},
|
||||
"X___": {"pos": "X"}
|
||||
}
|
86
spacy/lang/ga/tokenizer_exceptions.py
Normal file
86
spacy/lang/ga/tokenizer_exceptions.py
Normal file
|
@ -0,0 +1,86 @@
|
|||
# encoding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...symbols import POS, DET, ADP, CCONJ, ADV, NOUN, X, AUX
|
||||
from ...symbols import ORTH, LEMMA, NORM
|
||||
|
||||
|
||||
_exc = {
|
||||
"'acha'n": [
|
||||
{ORTH: "'ach", LEMMA: "gach", NORM: "gach", POS: DET},
|
||||
{ORTH: "a'n", LEMMA: "aon", NORM: "aon", POS: DET}],
|
||||
|
||||
"dem'": [
|
||||
{ORTH: "de", LEMMA: "de", NORM: "de", POS: ADP},
|
||||
{ORTH: "m'", LEMMA: "mo", NORM: "mo", POS: DET}],
|
||||
|
||||
"ded'": [
|
||||
{ORTH: "de", LEMMA: "de", NORM: "de", POS: ADP},
|
||||
{ORTH: "d'", LEMMA: "do", NORM: "do", POS: DET}],
|
||||
|
||||
"lem'": [
|
||||
{ORTH: "le", LEMMA: "le", NORM: "le", POS: ADP},
|
||||
{ORTH: "m'", LEMMA: "mo", NORM: "mo", POS: DET}],
|
||||
|
||||
"led'": [
|
||||
{ORTH: "le", LEMMA: "le", NORM: "le", POS: ADP},
|
||||
{ORTH: "d'", LEMMA: "mo", NORM: "do", POS: DET}]
|
||||
}
|
||||
|
||||
for exc_data in [
|
||||
{ORTH: "'gus", LEMMA: "agus", NORM: "agus", POS: CCONJ},
|
||||
{ORTH: "'ach", LEMMA: "gach", NORM: "gach", POS: DET},
|
||||
{ORTH: "ao'", LEMMA: "aon", NORM: "aon"},
|
||||
{ORTH: "'niar", LEMMA: "aniar", NORM: "aniar", POS: ADV},
|
||||
{ORTH: "'níos", LEMMA: "aníos", NORM: "aníos", POS: ADV},
|
||||
{ORTH: "'ndiu", LEMMA: "inniu", NORM: "inniu", POS: ADV},
|
||||
{ORTH: "'nocht", LEMMA: "anocht", NORM: "anocht", POS: ADV},
|
||||
{ORTH: "m'", LEMMA: "mo", POS: DET},
|
||||
{ORTH: "Aib.", LEMMA: "Aibreán", POS: NOUN},
|
||||
{ORTH: "Ath.", LEMMA: "athair", POS: NOUN},
|
||||
{ORTH: "Beal.", LEMMA: "Bealtaine", POS: NOUN},
|
||||
{ORTH: "a.C.n.", LEMMA: "ante Christum natum", POS: X},
|
||||
{ORTH: "m.sh.", LEMMA: "mar shampla", POS: ADV},
|
||||
{ORTH: "M.F.", LEMMA: "Meán Fómhair", POS: NOUN},
|
||||
{ORTH: "M.Fómh.", LEMMA: "Meán Fómhair", POS: NOUN},
|
||||
{ORTH: "D.F.", LEMMA: "Deireadh Fómhair", POS: NOUN},
|
||||
{ORTH: "D.Fómh.", LEMMA: "Deireadh Fómhair", POS: NOUN},
|
||||
{ORTH: "r.C.", LEMMA: "roimh Chríost", POS: ADV},
|
||||
{ORTH: "R.C.", LEMMA: "roimh Chríost", POS: ADV},
|
||||
{ORTH: "r.Ch.", LEMMA: "roimh Chríost", POS: ADV},
|
||||
{ORTH: "r.Chr.", LEMMA: "roimh Chríost", POS: ADV},
|
||||
{ORTH: "R.Ch.", LEMMA: "roimh Chríost", POS: ADV},
|
||||
{ORTH: "R.Chr.", LEMMA: "roimh Chríost", POS: ADV},
|
||||
{ORTH: "⁊rl.", LEMMA: "agus araile", POS: ADV},
|
||||
{ORTH: "srl.", LEMMA: "agus araile", POS: ADV},
|
||||
{ORTH: "Co.", LEMMA: "contae", POS: NOUN},
|
||||
{ORTH: "Ean.", LEMMA: "Eanáir", POS: NOUN},
|
||||
{ORTH: "Feab.", LEMMA: "Feabhra", POS: NOUN},
|
||||
{ORTH: "gCo.", LEMMA: "contae", POS: NOUN},
|
||||
{ORTH: ".i.", LEMMA: "eadhon", POS: ADV},
|
||||
{ORTH: "B'", LEMMA: "ba", POS: AUX},
|
||||
{ORTH: "b'", LEMMA: "ba", POS: AUX},
|
||||
{ORTH: "lch.", LEMMA: "leathanach", POS: NOUN},
|
||||
{ORTH: "Lch.", LEMMA: "leathanach", POS: NOUN},
|
||||
{ORTH: "lgh.", LEMMA: "leathanach", POS: NOUN},
|
||||
{ORTH: "Lgh.", LEMMA: "leathanach", POS: NOUN},
|
||||
{ORTH: "Lún.", LEMMA: "Lúnasa", POS: NOUN},
|
||||
{ORTH: "Már.", LEMMA: "Márta", POS: NOUN},
|
||||
{ORTH: "Meith.", LEMMA: "Meitheamh", POS: NOUN},
|
||||
{ORTH: "Noll.", LEMMA: "Nollaig", POS: NOUN},
|
||||
{ORTH: "Samh.", LEMMA: "Samhain", POS: NOUN},
|
||||
{ORTH: "tAth.", LEMMA: "athair", POS: NOUN},
|
||||
{ORTH: "tUas.", LEMMA: "Uasal", POS: NOUN},
|
||||
{ORTH: "teo.", LEMMA: "teoranta", POS: NOUN},
|
||||
{ORTH: "Teo.", LEMMA: "teoranta", POS: NOUN},
|
||||
{ORTH: "Uas.", LEMMA: "Uasal", POS: NOUN},
|
||||
{ORTH: "uimh.", LEMMA: "uimhir", POS: NOUN},
|
||||
{ORTH: "Uimh.", LEMMA: "uimhir", POS: NOUN}]:
|
||||
_exc[exc_data[ORTH]] = [exc_data]
|
||||
|
||||
for orth in [
|
||||
"d'", "D'"]:
|
||||
_exc[orth] = [{ORTH: orth}]
|
||||
|
||||
|
||||
TOKENIZER_EXCEPTIONS = _exc
|
|
@ -95,5 +95,5 @@ _nums = "(({ne})|({t})|({on})|({c}))({s})?".format(
|
|||
c=CURRENCY, s=_suffixes)
|
||||
|
||||
|
||||
TOKENIZER_EXCEPTIONS = dict(_exc)
|
||||
TOKENIZER_EXCEPTIONS = _exc
|
||||
TOKEN_MATCH = re.compile("^({u})|({n})$".format(u=URL_PATTERN, n=_nums)).match
|
||||
|
|
|
@ -46,5 +46,4 @@ for orth in [
|
|||
]:
|
||||
_exc[orth] = [{ORTH: orth}]
|
||||
|
||||
TOKENIZER_EXCEPTIONS = dict(_exc)
|
||||
|
||||
TOKENIZER_EXCEPTIONS = _exc
|
||||
|
|
|
@ -35,4 +35,4 @@ for orth in [
|
|||
_exc[orth] = [{ORTH: orth}]
|
||||
|
||||
|
||||
TOKENIZER_EXCEPTIONS = dict(_exc)
|
||||
TOKENIZER_EXCEPTIONS = _exc
|
||||
|
|
|
@ -20,4 +20,4 @@ for orth in [
|
|||
_exc[orth] = [{ORTH: orth}]
|
||||
|
||||
|
||||
TOKENIZER_EXCEPTIONS = dict(_exc)
|
||||
TOKENIZER_EXCEPTIONS = _exc
|
||||
|
|
|
@ -72,4 +72,4 @@ for orth in [
|
|||
_exc[orth] = [{ORTH: orth}]
|
||||
|
||||
|
||||
TOKENIZER_EXCEPTIONS = dict(_exc)
|
||||
TOKENIZER_EXCEPTIONS = _exc
|
||||
|
|
|
@ -80,4 +80,4 @@ for orth in [
|
|||
_exc[orth] = [{ORTH: orth}]
|
||||
|
||||
|
||||
TOKENIZER_EXCEPTIONS = dict(_exc)
|
||||
TOKENIZER_EXCEPTIONS = _exc
|
||||
|
|
|
@ -2,10 +2,10 @@
|
|||
# data from Korakot Chaovavanich (https://www.facebook.com/photo.php?fbid=390564854695031&set=p.390564854695031&type=3&permPage=1&ifg=1)
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...symbols import *
|
||||
|
||||
from ...symbols import POS, NOUN, PRON, ADJ, ADV, INTJ, PROPN, DET, NUM, AUX
|
||||
from ...symbols import ADP, CCONJ, PART, PUNCT, SPACE, SCONJ
|
||||
TAG_MAP = {
|
||||
#NOUN
|
||||
# NOUN
|
||||
"NOUN": {POS: NOUN},
|
||||
"NCMN": {POS: NOUN},
|
||||
"NTTL": {POS: NOUN},
|
||||
|
@ -14,7 +14,7 @@ TAG_MAP = {
|
|||
"CMTR": {POS: NOUN},
|
||||
"CFQC": {POS: NOUN},
|
||||
"CVBL": {POS: NOUN},
|
||||
#PRON
|
||||
# PRON
|
||||
"PRON": {POS: PRON},
|
||||
"NPRP": {POS: PRON},
|
||||
# ADJ
|
||||
|
@ -28,7 +28,7 @@ TAG_MAP = {
|
|||
"ADVI": {POS: ADV},
|
||||
"ADVP": {POS: ADV},
|
||||
"ADVS": {POS: ADV},
|
||||
# INT
|
||||
# INT
|
||||
"INT": {POS: INTJ},
|
||||
# PRON
|
||||
"PROPN": {POS: PROPN},
|
||||
|
@ -50,20 +50,20 @@ TAG_MAP = {
|
|||
"NCNM": {POS: NUM},
|
||||
"NLBL": {POS: NUM},
|
||||
"DCNM": {POS: NUM},
|
||||
# AUX
|
||||
# AUX
|
||||
"AUX": {POS: AUX},
|
||||
"XVBM": {POS: AUX},
|
||||
"XVAM": {POS: AUX},
|
||||
"XVMM": {POS: AUX},
|
||||
"XVBB": {POS: AUX},
|
||||
"XVAE": {POS: AUX},
|
||||
# ADP
|
||||
# ADP
|
||||
"ADP": {POS: ADP},
|
||||
"RPRE": {POS: ADP},
|
||||
# CCONJ
|
||||
"CCONJ": {POS: CCONJ},
|
||||
"JCRG": {POS: CCONJ},
|
||||
# SCONJ
|
||||
# SCONJ
|
||||
"SCONJ": {POS: SCONJ},
|
||||
"PREL": {POS: SCONJ},
|
||||
"JSBR": {POS: SCONJ},
|
||||
|
|
|
@ -1,43 +1,23 @@
|
|||
# encoding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...symbols import *
|
||||
from ...symbols import ORTH, LEMMA
|
||||
|
||||
TOKENIZER_EXCEPTIONS = {
|
||||
"ม.ค.": [
|
||||
{ORTH: "ม.ค.", LEMMA: "มกราคม"}
|
||||
],
|
||||
"ก.พ.": [
|
||||
{ORTH: "ก.พ.", LEMMA: "กุมภาพันธ์"}
|
||||
],
|
||||
"มี.ค.": [
|
||||
{ORTH: "มี.ค.", LEMMA: "มีนาคม"}
|
||||
],
|
||||
"เม.ย.": [
|
||||
{ORTH: "เม.ย.", LEMMA: "เมษายน"}
|
||||
],
|
||||
"พ.ค.": [
|
||||
{ORTH: "พ.ค.", LEMMA: "พฤษภาคม"}
|
||||
],
|
||||
"มิ.ย.": [
|
||||
{ORTH: "มิ.ย.", LEMMA: "มิถุนายน"}
|
||||
],
|
||||
"ก.ค.": [
|
||||
{ORTH: "ก.ค.", LEMMA: "กรกฎาคม"}
|
||||
],
|
||||
"ส.ค.": [
|
||||
{ORTH: "ส.ค.", LEMMA: "สิงหาคม"}
|
||||
],
|
||||
"ก.ย.": [
|
||||
{ORTH: "ก.ย.", LEMMA: "กันยายน"}
|
||||
],
|
||||
"ต.ค.": [
|
||||
{ORTH: "ต.ค.", LEMMA: "ตุลาคม"}
|
||||
],
|
||||
"พ.ย.": [
|
||||
{ORTH: "พ.ย.", LEMMA: "พฤศจิกายน"}
|
||||
],
|
||||
"ธ.ค.": [
|
||||
{ORTH: "ธ.ค.", LEMMA: "ธันวาคม"}
|
||||
]
|
||||
|
||||
_exc = {
|
||||
"ม.ค.": [{ORTH: "ม.ค.", LEMMA: "มกราคม"}],
|
||||
"ก.พ.": [{ORTH: "ก.พ.", LEMMA: "กุมภาพันธ์"}],
|
||||
"มี.ค.": [{ORTH: "มี.ค.", LEMMA: "มีนาคม"}],
|
||||
"เม.ย.": [{ORTH: "เม.ย.", LEMMA: "เมษายน"}],
|
||||
"พ.ค.": [{ORTH: "พ.ค.", LEMMA: "พฤษภาคม"}],
|
||||
"มิ.ย.": [{ORTH: "มิ.ย.", LEMMA: "มิถุนายน"}],
|
||||
"ก.ค.": [{ORTH: "ก.ค.", LEMMA: "กรกฎาคม"}],
|
||||
"ส.ค.": [{ORTH: "ส.ค.", LEMMA: "สิงหาคม"}],
|
||||
"ก.ย.": [{ORTH: "ก.ย.", LEMMA: "กันยายน"}],
|
||||
"ต.ค.": [{ORTH: "ต.ค.", LEMMA: "ตุลาคม"}],
|
||||
"พ.ย.": [{ORTH: "พ.ย.", LEMMA: "พฤศจิกายน"}],
|
||||
"ธ.ค.": [{ORTH: "ธ.ค.", LEMMA: "ธันวาคม"}]
|
||||
}
|
||||
|
||||
|
||||
TOKENIZER_EXCEPTIONS = _exc
|
||||
|
|
|
@ -154,6 +154,9 @@ class Language(object):
|
|||
self._meta.setdefault('email', '')
|
||||
self._meta.setdefault('url', '')
|
||||
self._meta.setdefault('license', '')
|
||||
self._meta['vectors'] = {'width': self.vocab.vectors_length,
|
||||
'vectors': len(self.vocab.vectors),
|
||||
'keys': self.vocab.vectors.n_keys}
|
||||
self._meta['pipeline'] = self.pipe_names
|
||||
return self._meta
|
||||
|
||||
|
@ -433,8 +436,10 @@ class Language(object):
|
|||
**cfg: Config parameters.
|
||||
RETURNS: An optimizer
|
||||
"""
|
||||
if get_gold_tuples is None:
|
||||
get_gold_tuples = lambda: []
|
||||
# Populate vocab
|
||||
if get_gold_tuples is not None:
|
||||
else:
|
||||
for _, annots_brackets in get_gold_tuples():
|
||||
for annots, _ in annots_brackets:
|
||||
for word in annots[1]:
|
||||
|
|
|
@ -11,9 +11,9 @@ import ujson
|
|||
import msgpack
|
||||
|
||||
from thinc.api import chain
|
||||
from thinc.v2v import Softmax
|
||||
from thinc.v2v import Affine, Softmax
|
||||
from thinc.t2v import Pooling, max_pool, mean_pool
|
||||
from thinc.neural.util import to_categorical
|
||||
from thinc.neural.util import to_categorical, copy_array
|
||||
from thinc.neural._classes.difference import Siamese, CauchySimilarity
|
||||
|
||||
from .tokens.doc cimport Doc
|
||||
|
@ -130,6 +130,15 @@ class Pipe(object):
|
|||
documents and their predicted scores."""
|
||||
raise NotImplementedError
|
||||
|
||||
def add_label(self, label):
|
||||
"""Add an output label, to be predicted by the model.
|
||||
|
||||
It's possible to extend pre-trained models with new labels,
|
||||
but care should be taken to avoid the "catastrophic forgetting"
|
||||
problem.
|
||||
"""
|
||||
raise NotImplementedError
|
||||
|
||||
def begin_training(self, gold_tuples=tuple(), pipeline=None):
|
||||
"""Initialize the pipe for training, using data exampes if available.
|
||||
If no model has been initialized yet, the model is added."""
|
||||
|
@ -325,6 +334,14 @@ class Tagger(Pipe):
|
|||
self.cfg.setdefault('pretrained_dims',
|
||||
self.vocab.vectors.data.shape[1])
|
||||
|
||||
@property
|
||||
def labels(self):
|
||||
return self.cfg.setdefault('tag_names', [])
|
||||
|
||||
@labels.setter
|
||||
def labels(self, value):
|
||||
self.cfg['tag_names'] = value
|
||||
|
||||
def __call__(self, doc):
|
||||
tags = self.predict([doc])
|
||||
self.set_annotations([doc], tags)
|
||||
|
@ -352,6 +369,7 @@ class Tagger(Pipe):
|
|||
cdef Doc doc
|
||||
cdef int idx = 0
|
||||
cdef Vocab vocab = self.vocab
|
||||
tags = list(self.labels)
|
||||
for i, doc in enumerate(docs):
|
||||
doc_tag_ids = batch_tag_ids[i]
|
||||
if hasattr(doc_tag_ids, 'get'):
|
||||
|
@ -359,7 +377,7 @@ class Tagger(Pipe):
|
|||
for j, tag_id in enumerate(doc_tag_ids):
|
||||
# Don't clobber preset POS tags
|
||||
if doc.c[j].tag == 0 and doc.c[j].pos == 0:
|
||||
vocab.morphology.assign_tag_id(&doc.c[j], tag_id)
|
||||
vocab.morphology.assign_tag(&doc.c[j], tags[tag_id])
|
||||
idx += 1
|
||||
doc.is_tagged = True
|
||||
|
||||
|
@ -420,6 +438,17 @@ class Tagger(Pipe):
|
|||
def Model(cls, n_tags, **cfg):
|
||||
return build_tagger_model(n_tags, **cfg)
|
||||
|
||||
def add_label(self, label):
|
||||
if label in self.labels:
|
||||
return 0
|
||||
smaller = self.model[-1]._layers[-1]
|
||||
larger = Softmax(len(self.labels)+1, smaller.nI)
|
||||
copy_array(larger.W[:smaller.nO], smaller.W)
|
||||
copy_array(larger.b[:smaller.nO], smaller.b)
|
||||
self.model[-1]._layers[-1] = larger
|
||||
self.labels.append(label)
|
||||
return 1
|
||||
|
||||
def use_params(self, params):
|
||||
with self.model.use_params(params):
|
||||
yield
|
||||
|
@ -675,7 +704,7 @@ class TextCategorizer(Pipe):
|
|||
|
||||
@property
|
||||
def labels(self):
|
||||
return self.cfg.get('labels', ['LABEL'])
|
||||
return self.cfg.setdefault('labels', ['LABEL'])
|
||||
|
||||
@labels.setter
|
||||
def labels(self, value):
|
||||
|
@ -727,6 +756,17 @@ class TextCategorizer(Pipe):
|
|||
mean_square_error = ((scores-truths)**2).sum(axis=1).mean()
|
||||
return mean_square_error, d_scores
|
||||
|
||||
def add_label(self, label):
|
||||
if label in self.labels:
|
||||
return 0
|
||||
smaller = self.model[-1]._layers[-1]
|
||||
larger = Affine(len(self.labels)+1, smaller.nI)
|
||||
copy_array(larger.W[:smaller.nO], smaller.W)
|
||||
copy_array(larger.b[:smaller.nO], smaller.b)
|
||||
self.model[-1]._layers[-1] = larger
|
||||
self.labels.append(label)
|
||||
return 1
|
||||
|
||||
def begin_training(self, gold_tuples=tuple(), pipeline=None):
|
||||
if pipeline and getattr(pipeline[0], 'name', None) == 'tensorizer':
|
||||
token_vector_width = pipeline[0].model.nO
|
||||
|
|
|
@ -14,9 +14,8 @@ from .. import util
|
|||
# These languages are used for generic tokenizer tests – only add a language
|
||||
# here if it's using spaCy's tokenizer (not a different library)
|
||||
# TODO: re-implement generic tokenizer tests
|
||||
_languages = ['bn', 'da', 'de', 'en', 'es', 'fi', 'fr', 'he', 'hu', 'id',
|
||||
_languages = ['bn', 'da', 'de', 'en', 'es', 'fi', 'fr', 'ga', 'he', 'hu', 'id',
|
||||
'it', 'nb', 'nl', 'pl', 'pt', 'sv', 'xx']
|
||||
|
||||
_models = {'en': ['en_core_web_sm'],
|
||||
'de': ['de_core_news_md'],
|
||||
'fr': ['fr_depvec_web_lg'],
|
||||
|
@ -108,6 +107,11 @@ def bn_tokenizer():
|
|||
return util.get_lang_class('bn').Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def ga_tokenizer():
|
||||
return util.get_lang_class('ga').Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def he_tokenizer():
|
||||
return util.get_lang_class('he').Defaults.create_tokenizer()
|
||||
|
|
|
@ -208,8 +208,8 @@ def test_doc_api_right_edge(en_tokenizer):
|
|||
|
||||
def test_doc_api_has_vector():
|
||||
vocab = Vocab()
|
||||
vocab.clear_vectors(2)
|
||||
vocab.vectors.add('kitten', vector=numpy.asarray([0., 2.], dtype='f'))
|
||||
vocab.reset_vectors(width=2)
|
||||
vocab.set_vector('kitten', vector=numpy.asarray([0., 2.], dtype='f'))
|
||||
doc = Doc(vocab, words=['kitten'])
|
||||
assert doc.has_vector
|
||||
|
||||
|
|
|
@ -72,9 +72,9 @@ def test_doc_token_api_is_properties(en_vocab):
|
|||
|
||||
def test_doc_token_api_vectors():
|
||||
vocab = Vocab()
|
||||
vocab.clear_vectors(2)
|
||||
vocab.vectors.add('apples', vector=numpy.asarray([0., 2.], dtype='f'))
|
||||
vocab.vectors.add('oranges', vector=numpy.asarray([0., 1.], dtype='f'))
|
||||
vocab.reset_vectors(width=2)
|
||||
vocab.set_vector('apples', vector=numpy.asarray([0., 2.], dtype='f'))
|
||||
vocab.set_vector('oranges', vector=numpy.asarray([0., 1.], dtype='f'))
|
||||
doc = Doc(vocab, words=['apples', 'oranges', 'oov'])
|
||||
assert doc.has_vector
|
||||
|
||||
|
@ -155,13 +155,13 @@ def test_doc_token_api_head_setter(en_tokenizer):
|
|||
assert doc[2].left_edge.i == 0
|
||||
|
||||
|
||||
def test_sent_start(en_tokenizer):
|
||||
def test_is_sent_start(en_tokenizer):
|
||||
doc = en_tokenizer(u'This is a sentence. This is another.')
|
||||
assert not doc[0].sent_start
|
||||
assert not doc[5].sent_start
|
||||
doc[5].sent_start = True
|
||||
assert doc[5].sent_start
|
||||
assert not doc[0].sent_start
|
||||
assert doc[5].is_sent_start is None
|
||||
doc[5].is_sent_start = True
|
||||
assert doc[5].is_sent_start is True
|
||||
# Backwards compatibility
|
||||
assert doc[0].sent_start is False
|
||||
doc.is_parsed = True
|
||||
assert len(list(doc.sents)) == 2
|
||||
|
||||
|
|
0
spacy/tests/lang/ga/__init__.py
Normal file
0
spacy/tests/lang/ga/__init__.py
Normal file
17
spacy/tests/lang/ga/test_tokenizer.py
Normal file
17
spacy/tests/lang/ga/test_tokenizer.py
Normal file
|
@ -0,0 +1,17 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
GA_TOKEN_EXCEPTION_TESTS = [
|
||||
('Niall Ó Domhnaill, Rialtas na hÉireann 1977 (lch. 600).', ['Niall', 'Ó', 'Domhnaill', ',', 'Rialtas', 'na', 'hÉireann', '1977', '(', 'lch.', '600', ')', '.']),
|
||||
('Daoine a bhfuil Gaeilge acu, m.sh. tusa agus mise', ['Daoine', 'a', 'bhfuil', 'Gaeilge', 'acu', ',', 'm.sh.', 'tusa', 'agus', 'mise'])
|
||||
]
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text,expected_tokens', GA_TOKEN_EXCEPTION_TESTS)
|
||||
def test_tokenizer_handles_exception_cases(ga_tokenizer, text, expected_tokens):
|
||||
tokens = ga_tokenizer(text)
|
||||
token_list = [token.text for token in tokens if not token.is_space]
|
||||
assert expected_tokens == token_list
|
|
@ -118,8 +118,7 @@ def test_span_to_array(doc):
|
|||
assert arr[0, 1] == len(span[0])
|
||||
|
||||
|
||||
@pytest.mark.xfail
|
||||
def test_span_as_doc(doc):
|
||||
span = doc[4:10]
|
||||
span_doc = span.as_doc()
|
||||
assert span.text == span_doc.text
|
||||
assert span.text == span_doc.text.strip()
|
||||
|
|
|
@ -79,9 +79,9 @@ def add_vecs_to_vocab(vocab, vectors):
|
|||
"""Add list of vector tuples to given vocab. All vectors need to have the
|
||||
same length. Format: [("text", [1, 2, 3])]"""
|
||||
length = len(vectors[0][1])
|
||||
vocab.clear_vectors(length)
|
||||
vocab.reset_vectors(width=length)
|
||||
for word, vec in vectors:
|
||||
vocab.set_vector(word, vec)
|
||||
vocab.set_vector(word, vector=vec)
|
||||
return vocab
|
||||
|
||||
|
||||
|
|
|
@ -3,6 +3,7 @@ from __future__ import unicode_literals
|
|||
|
||||
from ...vectors import Vectors
|
||||
from ...tokenizer import Tokenizer
|
||||
from ...strings import hash_string
|
||||
from ..util import add_vecs_to_vocab, get_doc
|
||||
|
||||
import numpy
|
||||
|
@ -35,20 +36,19 @@ def vocab(en_vocab, vectors):
|
|||
|
||||
|
||||
def test_init_vectors_with_data(strings, data):
|
||||
v = Vectors(strings, data=data)
|
||||
v = Vectors(data=data)
|
||||
assert v.shape == data.shape
|
||||
|
||||
def test_init_vectors_with_width(strings):
|
||||
v = Vectors(strings, width=3)
|
||||
for string in strings:
|
||||
v.add(string)
|
||||
def test_init_vectors_with_shape(strings):
|
||||
v = Vectors(shape=(len(strings), 3))
|
||||
assert v.shape == (len(strings), 3)
|
||||
|
||||
|
||||
def test_get_vector(strings, data):
|
||||
v = Vectors(strings, data=data)
|
||||
for string in strings:
|
||||
v.add(string)
|
||||
v = Vectors(data=data)
|
||||
strings = [hash_string(s) for s in strings]
|
||||
for i, string in enumerate(strings):
|
||||
v.add(string, row=i)
|
||||
assert list(v[strings[0]]) == list(data[0])
|
||||
assert list(v[strings[0]]) != list(data[1])
|
||||
assert list(v[strings[1]]) != list(data[0])
|
||||
|
@ -56,9 +56,10 @@ def test_get_vector(strings, data):
|
|||
|
||||
def test_set_vector(strings, data):
|
||||
orig = data.copy()
|
||||
v = Vectors(strings, data=data)
|
||||
for string in strings:
|
||||
v.add(string)
|
||||
v = Vectors(data=data)
|
||||
strings = [hash_string(s) for s in strings]
|
||||
for i, string in enumerate(strings):
|
||||
v.add(string, row=i)
|
||||
assert list(v[strings[0]]) == list(orig[0])
|
||||
assert list(v[strings[0]]) != list(orig[1])
|
||||
v[strings[0]] = data[1]
|
||||
|
@ -66,7 +67,6 @@ def test_set_vector(strings, data):
|
|||
assert list(v[strings[0]]) != list(orig[0])
|
||||
|
||||
|
||||
|
||||
@pytest.fixture()
|
||||
def tokenizer_v(vocab):
|
||||
return Tokenizer(vocab, {}, None, None, None)
|
||||
|
|
|
@ -2,14 +2,39 @@
|
|||
from __future__ import unicode_literals
|
||||
|
||||
import numpy
|
||||
import pytest
|
||||
from numpy.testing import assert_allclose
|
||||
from ...vocab import Vocab
|
||||
from ..._ml import cosine
|
||||
|
||||
|
||||
@pytest.mark.xfail
|
||||
@pytest.mark.parametrize('text', ["Hello"])
|
||||
def test_vocab_add_vector(en_vocab, text):
|
||||
en_vocab.resize_vectors(10)
|
||||
lex = en_vocab[text]
|
||||
lex.vector = numpy.ndarray((10,), dtype='float32')
|
||||
lex = en_vocab[text]
|
||||
assert lex.vector.shape == (10,)
|
||||
def test_vocab_add_vector():
|
||||
vocab = Vocab()
|
||||
data = numpy.ndarray((5,3), dtype='f')
|
||||
data[0] = 1.
|
||||
data[1] = 2.
|
||||
vocab.set_vector(u'cat', data[0])
|
||||
vocab.set_vector(u'dog', data[1])
|
||||
cat = vocab[u'cat']
|
||||
assert list(cat.vector) == [1., 1., 1.]
|
||||
dog = vocab[u'dog']
|
||||
assert list(dog.vector) == [2., 2., 2.]
|
||||
|
||||
|
||||
def test_vocab_prune_vectors():
|
||||
vocab = Vocab()
|
||||
_ = vocab[u'cat']
|
||||
_ = vocab[u'dog']
|
||||
_ = vocab[u'kitten']
|
||||
data = numpy.ndarray((5,3), dtype='f')
|
||||
data[0] = 1.
|
||||
data[1] = 2.
|
||||
data[2] = 1.1
|
||||
vocab.set_vector(u'cat', data[0])
|
||||
vocab.set_vector(u'dog', data[1])
|
||||
vocab.set_vector(u'kitten', data[2])
|
||||
|
||||
remap = vocab.prune_vectors(2)
|
||||
assert list(remap.keys()) == [u'kitten']
|
||||
neighbour, similarity = list(remap.values())[0]
|
||||
assert neighbour == u'cat', remap
|
||||
assert_allclose(similarity, cosine(data[0], data[2]), atol=1e-6)
|
||||
|
|
|
@ -28,7 +28,7 @@ from ..attrs cimport LENGTH, POS, LEMMA, TAG, DEP, HEAD, SPACY, ENT_IOB
|
|||
from ..attrs cimport ENT_TYPE, SENT_START
|
||||
from ..parts_of_speech cimport CCONJ, PUNCT, NOUN, univ_pos_t
|
||||
from ..util import normalize_slice
|
||||
from ..compat import is_config, copy_reg, pickle
|
||||
from ..compat import is_config, copy_reg, pickle, basestring_
|
||||
from .. import about
|
||||
from .. import util
|
||||
from .underscore import Underscore
|
||||
|
@ -571,7 +571,8 @@ cdef class Doc:
|
|||
cdef np.ndarray[attr_t, ndim=1] attr_ids
|
||||
cdef np.ndarray[attr_t, ndim=2] output
|
||||
# Handle scalar/list inputs of strings/ints for py_attr_ids
|
||||
if not hasattr(py_attr_ids, '__iter__'):
|
||||
if not hasattr(py_attr_ids, '__iter__') \
|
||||
and not isinstance(py_attr_ids, basestring_):
|
||||
py_attr_ids = [py_attr_ids]
|
||||
|
||||
# Allow strings, e.g. 'lemma' or 'LEMMA'
|
||||
|
|
|
@ -474,17 +474,15 @@ cdef class Span:
|
|||
"""RETURNS (int): The number of leftward immediate children of the
|
||||
span, in the syntactic dependency parse.
|
||||
"""
|
||||
# TODO: implement
|
||||
def __get__(self):
|
||||
raise NotImplementedError
|
||||
return len(list(self.lefts))
|
||||
|
||||
property n_rights:
|
||||
"""RETURNS (int): The number of rightward immediate children of the
|
||||
span, in the syntactic dependency parse.
|
||||
"""
|
||||
# TODO: implement
|
||||
def __get__(self):
|
||||
raise NotImplementedError
|
||||
return len(list(self.rights))
|
||||
|
||||
property subtree:
|
||||
"""Tokens that descend from tokens in the span, but fall outside it.
|
||||
|
|
|
@ -302,10 +302,7 @@ cdef class Token:
|
|||
def __get__(self):
|
||||
if 'vector' in self.doc.user_token_hooks:
|
||||
return self.doc.user_token_hooks['vector'](self)
|
||||
if self.has_vector:
|
||||
return self.vocab.get_vector(self.c.lex.orth)
|
||||
else:
|
||||
return self.doc.tensor[self.i]
|
||||
return self.vocab.get_vector(self.c.lex.orth)
|
||||
|
||||
property vector_norm:
|
||||
"""The L2 norm of the token's vector representation.
|
||||
|
@ -333,9 +330,29 @@ cdef class Token:
|
|||
return self.c.r_kids
|
||||
|
||||
property sent_start:
|
||||
# TODO: fix and document
|
||||
# TODO deprecation warning
|
||||
def __get__(self):
|
||||
return self.c.sent_start
|
||||
# Handle broken backwards compatibility case: doc[0].sent_start
|
||||
# was False.
|
||||
if self.i == 0:
|
||||
return False
|
||||
else:
|
||||
return self.sent_start
|
||||
|
||||
def __set__(self, value):
|
||||
self.is_sent_start = value
|
||||
|
||||
property is_sent_start:
|
||||
"""RETURNS (bool / None): Whether the token starts a sentence.
|
||||
None if unknown.
|
||||
"""
|
||||
def __get__(self):
|
||||
if self.c.sent_start == 0:
|
||||
return None
|
||||
elif self.c.sent_start < 0:
|
||||
return False
|
||||
else:
|
||||
return True
|
||||
|
||||
def __set__(self, value):
|
||||
if self.doc.is_parsed:
|
||||
|
|
|
@ -236,8 +236,6 @@ def is_in_jupyter():
|
|||
|
||||
|
||||
def get_cuda_stream(require=False):
|
||||
# TODO: Error and tell to install chainer if not found
|
||||
# Requires GPU
|
||||
return CudaStream() if CudaStream is not None else None
|
||||
|
||||
|
||||
|
|
|
@ -10,11 +10,17 @@ cimport numpy as np
|
|||
from thinc.neural.util import get_array_module
|
||||
from thinc.neural._classes.model import Model
|
||||
|
||||
from .strings cimport StringStore
|
||||
from .strings cimport StringStore, hash_string
|
||||
from .compat import basestring_, path2str
|
||||
from . import util
|
||||
|
||||
|
||||
def unpickle_vectors(keys_and_rows, data):
|
||||
vectors = Vectors(data=data)
|
||||
for key, row in keys_and_rows:
|
||||
vectors.add(key, row=row)
|
||||
|
||||
|
||||
cdef class Vectors:
|
||||
"""Store, save and load word vectors.
|
||||
|
||||
|
@ -22,138 +28,36 @@ cdef class Vectors:
|
|||
instance of numpy.ndarray (for CPU vectors) or cupy.ndarray
|
||||
(for GPU vectors). `vectors.key2row` is a dictionary mapping word hashes to
|
||||
rows in the vectors.data table.
|
||||
|
||||
Multiple keys can be mapped to the same vector, so len(keys) may be greater
|
||||
(but not smaller) than data.shape[0].
|
||||
|
||||
Multiple keys can be mapped to the same vector, and not all of the rows in
|
||||
the table need to be assigned --- so len(list(vectors.keys())) may be
|
||||
greater or smaller than vectors.shape[0].
|
||||
"""
|
||||
cdef public object data
|
||||
cdef readonly StringStore strings
|
||||
cdef public object key2row
|
||||
cdef public object keys
|
||||
cdef public int i
|
||||
cdef public object _unset
|
||||
|
||||
def __init__(self, strings, width=0, data=None):
|
||||
"""Create a new vector store. To keep the vector table empty, pass
|
||||
`width=0`. You can also create the vector table and add vectors one by
|
||||
one, or set the vector values directly on initialisation.
|
||||
def __init__(self, *, shape=None, data=None, keys=None):
|
||||
"""Create a new vector store.
|
||||
|
||||
strings (StringStore or list): List of strings or StringStore that maps
|
||||
strings to hash values, and vice versa.
|
||||
width (int): Number of dimensions.
|
||||
shape (tuple): Size of the table, as (# entries, # columns)
|
||||
data (numpy.ndarray): The vector data.
|
||||
keys (iterable): A sequence of keys, aligned with the data.
|
||||
RETURNS (Vectors): The newly created object.
|
||||
"""
|
||||
if isinstance(strings, StringStore):
|
||||
self.strings = strings
|
||||
if data is None:
|
||||
if shape is None:
|
||||
shape = (0,0)
|
||||
data = numpy.zeros(shape, dtype='f')
|
||||
self.data = data
|
||||
self.key2row = OrderedDict()
|
||||
if self.data is not None:
|
||||
self._unset = set(range(self.data.shape[0]))
|
||||
else:
|
||||
self.strings = StringStore()
|
||||
for string in strings:
|
||||
self.strings.add(string)
|
||||
if data is not None:
|
||||
self.data = numpy.asarray(data, dtype='f')
|
||||
else:
|
||||
self.data = numpy.zeros((len(self.strings), width), dtype='f')
|
||||
self.i = 0
|
||||
self.key2row = {}
|
||||
self.keys = numpy.zeros((self.data.shape[0],), dtype='uint64')
|
||||
if data is not None:
|
||||
for i, string in enumerate(self.strings):
|
||||
if i >= self.data.shape[0]:
|
||||
break
|
||||
self.add(self.strings[string], vector=self.data[i])
|
||||
|
||||
def __reduce__(self):
|
||||
return (Vectors, (self.strings, self.data))
|
||||
|
||||
def __getitem__(self, key):
|
||||
"""Get a vector by key. If key is a string, it is hashed to an integer
|
||||
ID using the vectors.strings table. If the integer key is not found in
|
||||
the table, a KeyError is raised.
|
||||
|
||||
key (unicode / int): The key to get the vector for.
|
||||
RETURNS (numpy.ndarray): The vector for the key.
|
||||
"""
|
||||
if isinstance(key, basestring):
|
||||
key = self.strings[key]
|
||||
i = self.key2row[key]
|
||||
if i is None:
|
||||
raise KeyError(key)
|
||||
else:
|
||||
return self.data[i]
|
||||
|
||||
def __setitem__(self, key, vector):
|
||||
"""Set a vector for the given key. If key is a string, it is hashed
|
||||
to an integer ID using the vectors.strings table.
|
||||
|
||||
key (unicode / int): The key to set the vector for.
|
||||
vector (numpy.ndarray): The vector to set.
|
||||
"""
|
||||
if isinstance(key, basestring):
|
||||
key = self.strings.add(key)
|
||||
i = self.key2row[key]
|
||||
self.data[i] = vector
|
||||
|
||||
def __iter__(self):
|
||||
"""Yield vectors from the table.
|
||||
|
||||
YIELDS (numpy.ndarray): A vector.
|
||||
"""
|
||||
yield from self.data
|
||||
|
||||
def __len__(self):
|
||||
"""Return the number of vectors that have been assigned.
|
||||
|
||||
RETURNS (int): The number of vectors in the data.
|
||||
"""
|
||||
return self.i
|
||||
|
||||
def __contains__(self, key):
|
||||
"""Check whether a key has a vector entry in the table.
|
||||
|
||||
key (unicode / int): The key to check.
|
||||
RETURNS (bool): Whether the key has a vector entry.
|
||||
"""
|
||||
if isinstance(key, basestring_):
|
||||
key = self.strings[key]
|
||||
return key in self.key2row
|
||||
|
||||
def add(self, key, *, vector=None, row=None):
|
||||
"""Add a key to the table. Keys can be mapped to an existing vector
|
||||
by setting `row`, or a new vector can be added.
|
||||
|
||||
key (unicode / int): The key to add.
|
||||
vector (numpy.ndarray / None): A vector to add for the key.
|
||||
row (int / None): The row-number of a vector to map the key to.
|
||||
"""
|
||||
if isinstance(key, basestring_):
|
||||
key = self.strings.add(key)
|
||||
if key in self.key2row and row is None:
|
||||
row = self.key2row[key]
|
||||
elif key in self.key2row and row is not None:
|
||||
self.key2row[key] = row
|
||||
elif row is None:
|
||||
row = self.i
|
||||
self.i += 1
|
||||
if row >= self.keys.shape[0]:
|
||||
self.keys.resize((row*2,))
|
||||
self.data.resize((row*2, self.data.shape[1]))
|
||||
self.keys[row] = key
|
||||
|
||||
self.key2row[key] = row
|
||||
self.keys[row] = key
|
||||
if vector is not None:
|
||||
self.data[row] = vector
|
||||
return row
|
||||
|
||||
def items(self):
|
||||
"""Iterate over `(string key, vector)` pairs, in order.
|
||||
|
||||
YIELDS (tuple): A key/vector pair.
|
||||
"""
|
||||
for i, key in enumerate(self.keys):
|
||||
string = self.strings[key]
|
||||
row = self.key2row[key]
|
||||
yield string, self.data[row]
|
||||
self._unset = set()
|
||||
if keys is not None:
|
||||
for i, key in enumerate(keys):
|
||||
self.add(key, row=i)
|
||||
|
||||
@property
|
||||
def shape(self):
|
||||
|
@ -164,9 +68,219 @@ cdef class Vectors:
|
|||
"""
|
||||
return self.data.shape
|
||||
|
||||
def most_similar(self, key):
|
||||
# TODO: implement
|
||||
raise NotImplementedError
|
||||
@property
|
||||
def size(self):
|
||||
"""RETURNS (int): rows*dims"""
|
||||
return self.data.shape[0] * self.data.shape[1]
|
||||
|
||||
@property
|
||||
def is_full(self):
|
||||
"""RETURNS (bool): `True` if no slots are available for new keys."""
|
||||
return len(self._unset) == 0
|
||||
|
||||
@property
|
||||
def n_keys(self):
|
||||
"""RETURNS (int) The number of keys in the table. Note that this is the
|
||||
number of all keys, not just unique vectors."""
|
||||
return len(self.key2row)
|
||||
|
||||
def __reduce__(self):
|
||||
keys_and_rows = self.key2row.items()
|
||||
return (unpickle_vectors, (keys_and_rows, self.data))
|
||||
|
||||
def __getitem__(self, key):
|
||||
"""Get a vector by key. If the key is not found, a KeyError is raised.
|
||||
|
||||
key (int): The key to get the vector for.
|
||||
RETURNS (ndarray): The vector for the key.
|
||||
"""
|
||||
i = self.key2row[key]
|
||||
if i is None:
|
||||
raise KeyError(key)
|
||||
else:
|
||||
return self.data[i]
|
||||
|
||||
def __setitem__(self, key, vector):
|
||||
"""Set a vector for the given key.
|
||||
|
||||
key (int): The key to set the vector for.
|
||||
vector (ndarray): The vector to set.
|
||||
"""
|
||||
i = self.key2row[key]
|
||||
self.data[i] = vector
|
||||
if i in self._unset:
|
||||
self._unset.remove(i)
|
||||
|
||||
def __iter__(self):
|
||||
"""Iterate over the keys in the table.
|
||||
|
||||
YIELDS (int): A key in the table.
|
||||
"""
|
||||
yield from self.key2row
|
||||
|
||||
def __len__(self):
|
||||
"""Return the number of vectors in the table.
|
||||
|
||||
RETURNS (int): The number of vectors in the data.
|
||||
"""
|
||||
return self.data.shape[0]
|
||||
|
||||
def __contains__(self, key):
|
||||
"""Check whether a key has been mapped to a vector entry in the table.
|
||||
|
||||
key (int): The key to check.
|
||||
RETURNS (bool): Whether the key has a vector entry.
|
||||
"""
|
||||
return key in self.key2row
|
||||
|
||||
def resize(self, shape, inplace=False):
|
||||
"""Resize the underlying vectors array. If inplace=True, the memory
|
||||
is reallocated. This may cause other references to the data to become
|
||||
invalid, so only use inplace=True if you're sure that's what you want.
|
||||
|
||||
If the number of vectors is reduced, keys mapped to rows that have been
|
||||
deleted are removed. These removed items are returned as a list of
|
||||
`(key, row)` tuples.
|
||||
"""
|
||||
if inplace:
|
||||
self.data.resize(shape, refcheck=False)
|
||||
else:
|
||||
xp = get_array_module(self.data)
|
||||
self.data = xp.resize(self.data, shape)
|
||||
filled = {row for row in self.key2row.values()}
|
||||
self._unset = {row for row in range(shape[0]) if row not in filled}
|
||||
removed_items = []
|
||||
for key, row in dict(self.key2row.items()):
|
||||
if row >= shape[0]:
|
||||
self.key2row.pop(key)
|
||||
removed_items.append((key, row))
|
||||
return removed_items
|
||||
|
||||
def keys(self):
|
||||
"""A sequence of the keys in the table.
|
||||
|
||||
RETURNS (iterable): The keys.
|
||||
"""
|
||||
return self.key2row.keys()
|
||||
|
||||
def values(self):
|
||||
"""Iterate over vectors that have been assigned to at least one key.
|
||||
|
||||
Note that some vectors may be unassigned, so the number of vectors
|
||||
returned may be less than the length of the vectors table.
|
||||
|
||||
YIELDS (ndarray): A vector in the table.
|
||||
"""
|
||||
for row, vector in enumerate(range(self.data.shape[0])):
|
||||
if row not in self._unset:
|
||||
yield vector
|
||||
|
||||
def items(self):
|
||||
"""Iterate over `(key, vector)` pairs.
|
||||
|
||||
YIELDS (tuple): A key/vector pair.
|
||||
"""
|
||||
for key, row in self.key2row.items():
|
||||
yield key, self.data[row]
|
||||
|
||||
def find(self, *, key=None, keys=None, row=None, rows=None):
|
||||
"""Look up one or more keys by row, or vice versa.
|
||||
|
||||
key (unicode / int): Find the row that the given key points to.
|
||||
Returns int, -1 if missing.
|
||||
keys (iterable): Find rows that the keys point to.
|
||||
Returns ndarray.
|
||||
row (int): Find the first key that point to the row.
|
||||
Returns int.
|
||||
rows (iterable): Find the keys that point to the rows.
|
||||
Returns ndarray.
|
||||
RETURNS: The requested key, keys, row or rows.
|
||||
"""
|
||||
if sum(arg is None for arg in (key, keys, row, rows)) != 3:
|
||||
raise ValueError("One (and only one) keyword arg must be set.")
|
||||
xp = get_array_module(self.data)
|
||||
if key is not None:
|
||||
if isinstance(key, basestring_):
|
||||
key = hash_string(key)
|
||||
return self.key2row.get(key, -1)
|
||||
elif keys is not None:
|
||||
keys = [hash_string(key) if isinstance(key, basestring_) else key
|
||||
for key in keys]
|
||||
rows = [self.key2row.get(key, -1.) for key in keys]
|
||||
return xp.asarray(rows, dtype='i')
|
||||
else:
|
||||
targets = set()
|
||||
if row is not None:
|
||||
targets.add(row)
|
||||
else:
|
||||
targets.update(rows)
|
||||
results = []
|
||||
for key, row in self.key2row.items():
|
||||
if row in targets:
|
||||
results.append(key)
|
||||
targets.remove(row)
|
||||
return xp.asarray(results, dtype='uint64')
|
||||
|
||||
def add(self, key, *, vector=None, row=None):
|
||||
"""Add a key to the table. Keys can be mapped to an existing vector
|
||||
by setting `row`, or a new vector can be added.
|
||||
|
||||
key (int): The key to add.
|
||||
vector (ndarray / None): A vector to add for the key.
|
||||
row (int / None): The row number of a vector to map the key to.
|
||||
RETURNS (int): The row the vector was added to.
|
||||
"""
|
||||
if isinstance(key, basestring):
|
||||
key = hash_string(key)
|
||||
if row is None and key in self.key2row:
|
||||
row = self.key2row[key]
|
||||
elif row is None:
|
||||
if self.is_full:
|
||||
raise ValueError("Cannot add new key to vectors -- full")
|
||||
row = min(self._unset)
|
||||
|
||||
self.key2row[key] = row
|
||||
if vector is not None:
|
||||
self.data[row] = vector
|
||||
if row in self._unset:
|
||||
self._unset.remove(row)
|
||||
return row
|
||||
|
||||
def most_similar(self, queries, *, batch_size=1024):
|
||||
"""For each of the given vectors, find the single entry most similar
|
||||
to it, by cosine.
|
||||
|
||||
Queries are by vector. Results are returned as a `(keys, best_rows,
|
||||
scores)` tuple. If `queries` is large, the calculations are performed in
|
||||
chunks, to avoid consuming too much memory. You can set the `batch_size`
|
||||
to control the size/space trade-off during the calculations.
|
||||
|
||||
queries (ndarray): An array with one or more vectors.
|
||||
batch_size (int): The batch size to use.
|
||||
RETURNS (tuple): The most similar entry as a `(keys, best_rows, scores)`
|
||||
tuple.
|
||||
"""
|
||||
xp = get_array_module(self.data)
|
||||
|
||||
vectors = self.data / xp.linalg.norm(self.data, axis=1, keepdims=True)
|
||||
|
||||
best_rows = xp.zeros((queries.shape[0],), dtype='i')
|
||||
scores = xp.zeros((queries.shape[0],), dtype='f')
|
||||
# Work in batches, to avoid memory problems.
|
||||
for i in range(0, queries.shape[0], batch_size):
|
||||
batch = queries[i : i+batch_size]
|
||||
batch /= xp.linalg.norm(batch, axis=1, keepdims=True)
|
||||
# batch e.g. (1024, 300)
|
||||
# vectors e.g. (10000, 300)
|
||||
# sims e.g. (1024, 10000)
|
||||
sims = xp.dot(batch, vectors.T)
|
||||
best_rows[i:i+batch_size] = sims.argmax(axis=1)
|
||||
scores[i:i+batch_size] = sims.max(axis=1)
|
||||
|
||||
xp = get_array_module(self.data)
|
||||
row2key = {row: key for key, row in self.key2row.items()}
|
||||
keys = xp.asarray([row2key[row] for row in best_rows], dtype='uint64')
|
||||
return (keys, best_rows, scores)
|
||||
|
||||
def from_glove(self, path):
|
||||
"""Load GloVe vectors from a directory. Assumes binary format,
|
||||
|
@ -176,27 +290,32 @@ cdef class Vectors:
|
|||
By default GloVe outputs 64-bit vectors.
|
||||
|
||||
path (unicode / Path): The path to load the GloVe vectors from.
|
||||
RETURNS: A `StringStore` object, holding the key-to-string mapping.
|
||||
"""
|
||||
path = util.ensure_path(path)
|
||||
width = None
|
||||
for name in path.iterdir():
|
||||
if name.parts[-1].startswith('vectors'):
|
||||
_, dims, dtype, _2 = name.parts[-1].split('.')
|
||||
self.width = int(dims)
|
||||
width = int(dims)
|
||||
break
|
||||
else:
|
||||
raise IOError("Expected file named e.g. vectors.128.f.bin")
|
||||
bin_loc = path / 'vectors.{dims}.{dtype}.bin'.format(dims=dims,
|
||||
dtype=dtype)
|
||||
xp = get_array_module(self.data)
|
||||
self.data = None
|
||||
with bin_loc.open('rb') as file_:
|
||||
self.data = numpy.fromfile(file_, dtype='float64')
|
||||
self.data = numpy.ascontiguousarray(self.data, dtype='float32')
|
||||
self.data = xp.fromfile(file_, dtype=dtype)
|
||||
if dtype != 'float32':
|
||||
self.data = xp.ascontiguousarray(self.data, dtype='float32')
|
||||
n = 0
|
||||
strings = StringStore()
|
||||
with (path / 'vocab.txt').open('r') as file_:
|
||||
for line in file_:
|
||||
self.add(line.strip())
|
||||
n += 1
|
||||
if (self.data.size % self.width) == 0:
|
||||
self.data
|
||||
for i, line in enumerate(file_):
|
||||
key = strings.add(line.strip())
|
||||
self.add(key, row=i)
|
||||
return strings
|
||||
|
||||
def to_disk(self, path, **exclude):
|
||||
"""Save the current state to a directory.
|
||||
|
@ -212,7 +331,7 @@ cdef class Vectors:
|
|||
save_array = lambda arr, file_: xp.save(file_, arr)
|
||||
serializers = OrderedDict((
|
||||
('vectors', lambda p: save_array(self.data, p.open('wb'))),
|
||||
('keys', lambda p: xp.save(p.open('wb'), self.keys))
|
||||
('key2row', lambda p: msgpack.dump(self.key2row, p.open('wb')))
|
||||
))
|
||||
return util.to_disk(path, serializers, exclude)
|
||||
|
||||
|
@ -223,12 +342,18 @@ cdef class Vectors:
|
|||
path (unicode / Path): Directory path, string or Path-like object.
|
||||
RETURNS (Vectors): The modified object.
|
||||
"""
|
||||
def load_key2row(path):
|
||||
if path.exists():
|
||||
self.key2row = msgpack.load(path.open('rb'))
|
||||
for key, row in self.key2row.items():
|
||||
if row in self._unset:
|
||||
self._unset.remove(row)
|
||||
|
||||
def load_keys(path):
|
||||
if path.exists():
|
||||
self.keys = numpy.load(path2str(path))
|
||||
for i, key in enumerate(self.keys):
|
||||
self.keys[i] = key
|
||||
self.key2row[key] = i
|
||||
keys = numpy.load(str(path))
|
||||
for i, key in enumerate(keys):
|
||||
self.add(key, row=i)
|
||||
|
||||
def load_vectors(path):
|
||||
xp = Model.ops.xp
|
||||
|
@ -236,6 +361,7 @@ cdef class Vectors:
|
|||
self.data = xp.load(path)
|
||||
|
||||
serializers = OrderedDict((
|
||||
('key2row', load_key2row),
|
||||
('keys', load_keys),
|
||||
('vectors', load_vectors),
|
||||
))
|
||||
|
@ -254,7 +380,7 @@ cdef class Vectors:
|
|||
else:
|
||||
return msgpack.dumps(self.data)
|
||||
serializers = OrderedDict((
|
||||
('keys', lambda: msgpack.dumps(self.keys)),
|
||||
('key2row', lambda: msgpack.dumps(self.key2row)),
|
||||
('vectors', serialize_weights)
|
||||
))
|
||||
return util.to_bytes(serializers, exclude)
|
||||
|
@ -272,14 +398,8 @@ cdef class Vectors:
|
|||
else:
|
||||
self.data = msgpack.loads(b)
|
||||
|
||||
def load_keys(keys):
|
||||
self.keys.resize((len(keys),))
|
||||
for i, key in enumerate(keys):
|
||||
self.keys[i] = key
|
||||
self.key2row[key] = i
|
||||
|
||||
deserializers = OrderedDict((
|
||||
('keys', lambda b: load_keys(msgpack.loads(b))),
|
||||
('key2row', lambda b: self.key2row.update(msgpack.loads(b))),
|
||||
('vectors', deserialize_weights)
|
||||
))
|
||||
util.from_bytes(data, deserializers, exclude)
|
||||
|
|
|
@ -55,7 +55,7 @@ cdef class Vocab:
|
|||
_ = self[string]
|
||||
self.lex_attr_getters = lex_attr_getters
|
||||
self.morphology = Morphology(self.strings, tag_map, lemmatizer)
|
||||
self.vectors = Vectors(self.strings, width=0)
|
||||
self.vectors = Vectors()
|
||||
|
||||
property lang:
|
||||
def __get__(self):
|
||||
|
@ -192,10 +192,11 @@ cdef class Vocab:
|
|||
|
||||
YIELDS (Lexeme): An entry in the vocabulary.
|
||||
"""
|
||||
cdef attr_t orth
|
||||
cdef attr_t key
|
||||
cdef size_t addr
|
||||
for orth, addr in self._by_orth.items():
|
||||
yield Lexeme(self, orth)
|
||||
for key, addr in self._by_orth.items():
|
||||
lex = Lexeme(self, key)
|
||||
yield lex
|
||||
|
||||
def __getitem__(self, id_or_string):
|
||||
"""Retrieve a lexeme, given an int ID or a unicode string. If a
|
||||
|
@ -213,7 +214,7 @@ cdef class Vocab:
|
|||
>>> assert nlp.vocab[apple] == nlp.vocab[u'apple']
|
||||
"""
|
||||
cdef attr_t orth
|
||||
if type(id_or_string) == unicode:
|
||||
if isinstance(id_or_string, unicode):
|
||||
orth = self.strings.add(id_or_string)
|
||||
else:
|
||||
orth = id_or_string
|
||||
|
@ -240,19 +241,23 @@ cdef class Vocab:
|
|||
def vectors_length(self):
|
||||
return self.vectors.data.shape[1]
|
||||
|
||||
def clear_vectors(self, width=None):
|
||||
def reset_vectors(self, *, width=None, shape=None):
|
||||
"""Drop the current vector table. Because all vectors must be the same
|
||||
width, you have to call this to change the size of the vectors.
|
||||
"""
|
||||
if width is None:
|
||||
width = self.vectors.data.shape[1]
|
||||
self.vectors = Vectors(self.strings, width=width)
|
||||
if width is not None and shape is not None:
|
||||
raise ValueError("Only one of width and shape can be specified")
|
||||
elif shape is not None:
|
||||
self.vectors = Vectors(shape=shape)
|
||||
else:
|
||||
width = width if width is not None else self.vectors.data.shape[1]
|
||||
self.vectors = Vectors(shape=(self.vectors.shape[0], width))
|
||||
|
||||
def prune_vectors(self, nr_row, batch_size=1024):
|
||||
"""Reduce the current vector table to `nr_row` unique entries. Words
|
||||
mapped to the discarded vectors will be remapped to the closest vector
|
||||
among those remaining.
|
||||
|
||||
|
||||
For example, suppose the original table had vectors for the words:
|
||||
['sat', 'cat', 'feline', 'reclined']. If we prune the vector table to,
|
||||
two rows, we would discard the vectors for 'feline' and 'reclined'.
|
||||
|
@ -263,28 +268,41 @@ cdef class Vocab:
|
|||
The similarities are judged by cosine. The original vectors may
|
||||
be large, so the cosines are calculated in minibatches, to reduce
|
||||
memory usage.
|
||||
|
||||
nr_row (int): The number of rows to keep in the vector table.
|
||||
batch_size (int): Batch of vectors for calculating the similarities.
|
||||
Larger batch sizes might be faster, while temporarily requiring
|
||||
more memory.
|
||||
RETURNS (dict): A dictionary keyed by removed words mapped to
|
||||
`(string, score)` tuples, where `string` is the entry the removed
|
||||
word was mapped to, and `score` the similarity score between the
|
||||
two words.
|
||||
"""
|
||||
xp = get_array_module(self.vectors.data)
|
||||
# Work in batches, to avoid memory problems.
|
||||
keep = self.vectors.data[:nr_row]
|
||||
toss = self.vectors.data[nr_row:]
|
||||
# Normalize the vectors, so cosine similarity is just dot product.
|
||||
# Note we can't modify the ones we're keeping in-place...
|
||||
keep = keep / (xp.linalg.norm(keep)+1e-8)
|
||||
keep = xp.ascontiguousarray(keep.T)
|
||||
neighbours = xp.zeros((toss.shape[0],), dtype='i')
|
||||
for i in range(0, toss.shape[0], batch_size):
|
||||
batch = toss[i : i+batch_size]
|
||||
batch /= xp.linalg.norm(batch)+1e-8
|
||||
neighbours[i:i+batch_size] = xp.dot(batch, keep).argmax(axis=1)
|
||||
for lex in self:
|
||||
# If we're losing the vector for this word, map it to the nearest
|
||||
# vector we're keeping.
|
||||
if lex.rank >= nr_row:
|
||||
lex.rank = neighbours[lex.rank-nr_row]
|
||||
self.vectors.add(lex.orth, row=lex.rank)
|
||||
# Make copy, to encourage the original table to be garbage collected.
|
||||
self.vectors.data = xp.ascontiguousarray(self.vectors.data[:nr_row])
|
||||
# Make prob negative so it sorts by rank ascending
|
||||
# (key2row contains the rank)
|
||||
priority = [(-lex.prob, self.vectors.key2row[lex.orth], lex.orth)
|
||||
for lex in self if lex.orth in self.vectors.key2row]
|
||||
priority.sort()
|
||||
indices = xp.asarray([i for (prob, i, key) in priority], dtype='i')
|
||||
keys = xp.asarray([key for (prob, i, key) in priority], dtype='uint64')
|
||||
|
||||
keep = xp.ascontiguousarray(self.vectors.data[indices[:nr_row]])
|
||||
toss = xp.ascontiguousarray(self.vectors.data[indices[nr_row:]])
|
||||
|
||||
self.vectors = Vectors(data=keep, keys=keys)
|
||||
|
||||
syn_keys, syn_rows, scores = self.vectors.most_similar(toss)
|
||||
|
||||
remap = {}
|
||||
for i, key in enumerate(keys[nr_row:]):
|
||||
self.vectors.add(key, row=syn_rows[i])
|
||||
word = self.strings[key]
|
||||
synonym = self.strings[syn_keys[i]]
|
||||
score = scores[i]
|
||||
remap[word] = (synonym, score)
|
||||
link_vectors_to_models(self)
|
||||
return remap
|
||||
|
||||
def get_vector(self, orth):
|
||||
"""Retrieve a vector for a word in the vocabulary. Words can be looked
|
||||
|
@ -306,8 +324,16 @@ cdef class Vocab:
|
|||
"""Set a vector for a word in the vocabulary. Words can be referenced
|
||||
by string or int ID.
|
||||
"""
|
||||
if not isinstance(orth, basestring_):
|
||||
orth = self.strings[orth]
|
||||
if isinstance(orth, basestring_):
|
||||
orth = self.strings.add(orth)
|
||||
if self.vectors.is_full and orth not in self.vectors:
|
||||
new_rows = max(100, int(self.vectors.shape[0]*1.3))
|
||||
if self.vectors.shape[1] == 0:
|
||||
width = vector.size
|
||||
else:
|
||||
width = self.vectors.shape[1]
|
||||
self.vectors.resize((new_rows, width))
|
||||
self.vectors.add(orth, vector=vector)
|
||||
self.vectors.add(orth, vector=vector)
|
||||
|
||||
def has_vector(self, orth):
|
||||
|
|
|
@ -84,8 +84,8 @@
|
|||
],
|
||||
|
||||
"ALPHA": true,
|
||||
"V_CSS": "2.0a1",
|
||||
"V_JS": "2.0a0",
|
||||
"V_CSS": "2.0a2",
|
||||
"V_JS": "2.0a1",
|
||||
"DEFAULT_SYNTAX": "python",
|
||||
"ANALYTICS": "UA-58931649-1",
|
||||
"MAILCHIMP": {
|
||||
|
|
|
@ -41,9 +41,6 @@
|
|||
- var comps = path.split('#');
|
||||
- return "top-level#" + comps[0] + '.' + comps[1];
|
||||
- }
|
||||
- else if (path.startsWith('cli#')) {
|
||||
- return "top-level#" + path.split('#')[1];
|
||||
- }
|
||||
- return path;
|
||||
- }
|
||||
|
||||
|
|
|
@ -281,7 +281,12 @@ mixin github(repo, file, height, alt_file, language)
|
|||
|
||||
figure.o-block
|
||||
pre.c-code-block.o-block-small(class="lang-#{(language || DEFAULT_SYNTAX)}" style="height: #{height}px; min-height: #{height}px")
|
||||
code.c-code-block__content(data-gh-embed="#{repo}/#{branch}/#{file}")
|
||||
code.c-code-block__content(data-gh-embed="#{repo}/#{branch}/#{file}").
|
||||
Can't fetch code example from GitHub :(
|
||||
|
||||
Please use the link below to view the example. If you've come across
|
||||
a broken link, we always appreciate a pull request to the repository,
|
||||
or a report on the issue tracker. Thanks!
|
||||
|
||||
footer.o-grid.u-text
|
||||
.o-block-small.u-flex-full.u-padding-small #[+icon("github")] #[code.u-break.u-break--all=repo + '/' + (alt_file || file)]
|
||||
|
|
|
@ -20,7 +20,7 @@ for id in CURRENT_MODELS
|
|||
|
||||
p(data-tpl=id data-tpl-key="description")
|
||||
|
||||
div(data-tpl=id data-tpl-key="error" style="display: none")
|
||||
div(data-tpl=id data-tpl-key="error")
|
||||
+infobox
|
||||
| Unable to load model details from GitHub. To find out more
|
||||
| about this model, see the overview of the
|
||||
|
@ -54,7 +54,7 @@ for id in CURRENT_MODELS
|
|||
+cell
|
||||
.o-field.u-float-left
|
||||
select.o-field__select.u-text-small(data-tpl=id data-tpl-key="compat")
|
||||
.o-empty(data-tpl=id data-tpl-key="compat-versions")
|
||||
div(data-tpl=id data-tpl-key="compat-versions")
|
||||
|
||||
section(data-tpl=id data-tpl-key="benchmarks" style="display: none")
|
||||
+grid.o-block-small
|
||||
|
|
|
@ -1,43 +1,86 @@
|
|||
//- 💫 INCLUDES > SCRIPTS
|
||||
|
||||
if quickstart
|
||||
script(src="/assets/js/quickstart.min.js")
|
||||
script(src="/assets/js/vendor/quickstart.min.js")
|
||||
|
||||
if IS_PAGE
|
||||
script(src="/assets/js/in-view.min.js")
|
||||
script(src="/assets/js/vendor/in-view.min.js")
|
||||
|
||||
if environment == "deploy"
|
||||
script(async src="https://www.google-analytics.com/analytics.js")
|
||||
|
||||
script(src="/assets/js/prism.min.js")
|
||||
script(src="/assets/js/main.js?v#{V_JS}")
|
||||
script(src="/assets/js/vendor/prism.min.js")
|
||||
|
||||
if SECTION == "models"
|
||||
script(src="/assets/js/vendor/chart.min.js")
|
||||
script(src="/assets/js/models.js?v#{V_JS}" type="module")
|
||||
|
||||
script
|
||||
| new ProgressBar('.js-progress');
|
||||
|
||||
if changelog
|
||||
| new Changelog('!{SOCIAL.github}', 'spacy');
|
||||
|
||||
if quickstart
|
||||
| new Quickstart("#qs");
|
||||
|
||||
if IS_PAGE
|
||||
| new SectionHighlighter('data-section', 'data-nav');
|
||||
| new GitHubEmbed('!{SOCIAL.github}', 'data-gh-embed');
|
||||
| ((window.gitter = {}).chat = {}).options = {
|
||||
| useStyles: false,
|
||||
| activationElement: '.js-gitter-button',
|
||||
| targetElement: '.js-gitter',
|
||||
| room: '!{SOCIAL.gitter}'
|
||||
| };
|
||||
|
||||
if HAS_MODELS
|
||||
| new ModelLoader('!{MODELS_REPO}', !{JSON.stringify(CURRENT_MODELS)}, !{JSON.stringify(MODEL_LICENSES)}, !{JSON.stringify(MODEL_BENCHMARKS)});
|
||||
|
||||
if environment == "deploy"
|
||||
| window.ga=window.ga||function(){
|
||||
| (ga.q=ga.q||[]).push(arguments)}; ga.l=+new Date;
|
||||
| ga('create', '#{ANALYTICS}', 'auto'); ga('send', 'pageview');
|
||||
|
||||
|
||||
if IS_PAGE
|
||||
script
|
||||
| ((window.gitter = {}).chat = {}).options = {
|
||||
| useStyles: false,
|
||||
| activationElement: '.js-gitter-button',
|
||||
| targetElement: '.js-gitter',
|
||||
| room: '!{SOCIAL.gitter}'
|
||||
| };
|
||||
script(src="https://sidecar.gitter.im/dist/sidecar.v1.js" async defer)
|
||||
|
||||
|
||||
//- JS modules – slightly hacky, but necessary to dynamically instantiate the
|
||||
classes with data from the Harp JSON files, while still being able to
|
||||
support older browsers that can't handle JS modules. More details:
|
||||
https://medium.com/dev-channel/es6-modules-in-chrome-canary-m60-ba588dfb8ab7
|
||||
|
||||
- ProgressBar = "new ProgressBar('.js-progress');"
|
||||
- Changelog = "new Changelog('" + SOCIAL.github + "', 'spacy');"
|
||||
- NavHighlighter = "new NavHighlighter('data-section', 'data-nav');"
|
||||
- GitHubEmbed = "new GitHubEmbed('" + SOCIAL.github + "', 'data-gh-embed');"
|
||||
- ModelLoader = "new ModelLoader('" + MODELS_REPO + "'," + JSON.stringify(CURRENT_MODELS) + "," + JSON.stringify(MODEL_LICENSES) + "," + JSON.stringify(MODEL_BENCHMARKS) + ");"
|
||||
- ModelComparer = "new ModelComparer('" + MODELS_REPO + "'," + JSON.stringify(MODEL_LICENSES) + "," + JSON.stringify(MODEL_BENCHMARKS) + "," + JSON.stringify(LANGUAGES) + "," + JSON.stringify(MODEL_META) + "," + JSON.stringify(default_models || false) + ");"
|
||||
|
||||
//- Browsers with JS module support.
|
||||
Will be ignored otherwise.
|
||||
|
||||
script(type="module")
|
||||
| import ProgressBar from '/assets/js/progress.js';
|
||||
!=ProgressBar
|
||||
if changelog
|
||||
| import Changelog from '/assets/js/changelog.js';
|
||||
!=Changelog
|
||||
if IS_PAGE
|
||||
| import NavHighlighter from '/assets/js/nav-highlighter.js';
|
||||
!=NavHighlighter
|
||||
| import GitHubEmbed from '/assets/js/github-embed.js';
|
||||
!=GitHubEmbed
|
||||
if HAS_MODELS
|
||||
| import { ModelLoader } from '/assets/js/models.js';
|
||||
!=ModelLoader
|
||||
if compare_models
|
||||
| import { ModelComparer } from '/assets/js/models.js';
|
||||
!=ModelComparer
|
||||
|
||||
//- Browsers with no JS module support.
|
||||
Won't be fetched or interpreted otherwise.
|
||||
|
||||
script(nomodule src="/assets/js/rollup.js")
|
||||
script(nomodule)
|
||||
!=ProgressBar
|
||||
if changelog
|
||||
!=Changelog
|
||||
if IS_PAGE
|
||||
!=NavHighlighter
|
||||
!=GitHubEmbed
|
||||
if HAS_MODELS
|
||||
!=ModeLoader
|
||||
if compare_models
|
||||
!=ModelComparer
|
||||
|
|
|
@ -19,5 +19,5 @@ menu.c-sidebar.js-sidebar.u-text
|
|||
- var counter = 0
|
||||
for id, title in menu
|
||||
- counter++
|
||||
li.c-sidebar__crumb__item(data-nav=id class=(counter == 1) ? "is-active" : null)
|
||||
li.c-sidebar__crumb__item(data-nav=id)
|
||||
+a("#section-" + id)=title
|
||||
|
|
|
@ -62,6 +62,9 @@ svg(style="position: absolute; visibility: hidden; width: 0; height: 0;" width="
|
|||
symbol#svg_explosion(viewBox="0 0 500 500")
|
||||
path(fill="currentColor" d="M111.7 74.9L91.2 93.1l9.1 10.2 17.8-15.8 7.4 8.4-17.8 15.8 10.1 11.4 20.6-18.2 7.7 8.7-30.4 26.9-41.9-47.3 30.3-26.9 7.6 8.6zM190.8 59.6L219 84.3l-14.4 4.5-20.4-18.2-6.4 26.6-14.4 4.5 8.9-36.4-26.9-24.1 14.3-4.5L179 54.2l5.7-25.2 14.3-4.5-8.2 35.1zM250.1 21.2l27.1 3.4c6.1.8 10.8 3.1 14 7.2 3.2 4.1 4.5 9.2 3.7 15.5-.8 6.3-3.2 11-7.4 14.1-4.1 3.1-9.2 4.3-15.3 3.5L258 63.2l-2.8 22.3-13-1.6 7.9-62.7zm11.5 13l-2.2 17.5 12.6 1.6c5.1.6 9.1-2 9.8-7.6.7-5.6-2.5-9.2-7.6-9.9l-12.6-1.6zM329.1 95.4l23.8 13.8-5.8 10L312 98.8l31.8-54.6 11.3 6.6-26 44.6zM440.5 145c-1.3 8.4-5.9 15.4-13.9 21.1s-16.2 7.7-24.6 6.1c-8.4-1.6-15.3-6.3-20.8-14.1-5.5-7.9-7.6-16-6.4-24.4 1.3-8.5 6-15.5 14-21.1 8-5.6 16.2-7.7 24.5-6 8.4 1.6 15.4 6.3 20.9 14.2 5.5 7.6 7.6 15.7 6.3 24.2zM412 119c-5.1-.8-10.3.6-15.6 4.4-5.2 3.7-8.4 8.1-9.4 13.2-1 5.2.2 10.1 3.5 14.8 3.4 4.8 7.5 7.5 12.7 8.2 5.2.8 10.4-.7 15.6-4.4 5.3-3.7 8.4-8.1 9.4-13.2 1.1-5.1-.1-9.9-3.4-14.7-3.4-4.8-7.6-7.6-12.8-8.3zM471.5 237.9c-2.8 4.8-7.1 7.6-13 8.7l-2.6-13.1c5.3-.9 8.1-5 7.2-11-.9-5.8-4.3-8.8-8.9-8.2-2.3.3-3.7 1.4-4.5 3.3-.7 1.9-1.4 5.2-1.7 10.1-.8 7.5-2.2 13.1-4.3 16.9-2.1 3.9-5.7 6.2-10.9 7-6.3.9-11.3-.5-15.2-4.4-3.9-3.8-6.3-9-7.3-15.7-1.1-7.4-.2-13.7 2.6-18.8 2.8-5.1 7.4-8.2 13.7-9.2l2.6 13c-5.6 1.1-8.7 6.6-7.7 13.4 1 6.6 3.9 9.5 8.6 8.8 4.4-.7 5.7-4.5 6.7-14.1.3-3.5.7-6.2 1.1-8.4.4-2.2 1.2-4.4 2.2-6.8 2.1-4.7 6-7.2 11.8-8.1 5.4-.8 10.3.4 14.5 3.7 4.2 3.3 6.9 8.5 8 15.6.9 6.9-.1 12.6-2.9 17.3zM408.6 293.5l2.4-12.9 62 11.7-2.4 12.9-62-11.7zM419.6 396.9c-8.3 2-16.5.3-24.8-5-8.2-5.3-13.2-12.1-14.9-20.5-1.6-8.4.1-16.6 5.3-24.6 5.2-8.1 11.9-13.1 20.2-15.1 8.4-1.9 16.6-.3 24.9 5 8.2 5.3 13.2 12.1 14.8 20.5 1.7 8.4 0 16.6-5.2 24.7-5.2 8-12 13-20.3 15zm13.4-36.3c-1.2-5.1-4.5-9.3-9.9-12.8s-10.6-4.7-15.8-3.7-9.3 4-12.4 8.9-4.1 9.8-2.8 14.8c1.2 5.1 4.5 9.3 9.9 12.8 5.5 3.5 10.7 4.8 15.8 3.7 5.1-.9 9.2-3.8 12.3-8.7s4.1-9.9 2.9-15zM303.6 416.5l9.6-5.4 43.3 20.4-19.2-34 11.4-6.4 31 55-9.6 5.4-43.4-20.5 19.2 34.1-11.3 6.4-31-55zM238.2 468.8c-49 0-96.9-17.4-134.8-49-38.3-32-64-76.7-72.5-125.9-2-11.9-3.1-24-3.1-35.9 0-36.5 9.6-72.6 27.9-104.4 2.1-3.6 6.7-4.9 10.3-2.8 3.6 2.1 4.9 6.7 2.8 10.3-16.9 29.5-25.9 63.1-25.9 96.9 0 11.1 1 22.3 2.9 33.4 7.9 45.7 31.8 87.2 67.3 116.9 35.2 29.3 79.6 45.5 125.1 45.5 11.1 0 22.3-1 33.4-2.9 4.1-.7 8 2 8.7 6.1.7 4.1-2 8-6.1 8.7-11.9 2-24 3.1-36 3.1z")
|
||||
|
||||
symbol#svg_prodigy(viewBox="0 0 538.5 157.6")
|
||||
path(fill="currentColor" d="M70.6 48.6c7 7.3 10.5 17.1 10.5 29.2S77.7 99.7 70.6 107c-6.9 7.3-15.9 11.1-27 11.1-9.4 0-16.8-2.7-21.7-8.2v44.8H0V39h20.7v8.1c4.8-6.4 12.4-9.6 22.9-9.6 11.1 0 20.1 3.7 27 11.1zM21.9 76v3.6c0 12.1 7.3 19.8 18.3 19.8 11.2 0 18.7-7.9 18.7-21.6s-7.5-21.6-18.7-21.6c-11 0-18.3 7.7-18.3 19.8zM133.8 59.4c-12.6 0-20.5 7-20.5 17.8v39.3h-22V39h21.1v8.8c4-6.4 11.2-9.6 21.3-9.6v21.2zM209.5 107.1c-7.6 7.3-17.5 11.1-29.5 11.1s-21.9-3.8-29.7-11.1c-7.6-7.5-11.5-17.2-11.5-29.2 0-12.1 3.9-21.9 11.5-29.2 7.8-7.3 17.7-11.1 29.7-11.1s21.9 3.8 29.5 11.1c7.8 7.3 11.7 17.1 11.7 29.2 0 11.9-3.9 21.7-11.7 29.2zM180 56.2c-5.7 0-10.3 1.9-13.8 5.8-3.5 3.8-5.2 9-5.2 15.7 0 6.7 1.8 12 5.2 15.7 3.4 3.8 8.1 5.7 13.8 5.7s10.3-1.9 13.8-5.7 5.2-9 5.2-15.7c0-6.8-1.8-12-5.2-15.7-3.5-3.8-8.1-5.8-13.8-5.8zM313 116.5h-20.5v-7.9c-4.4 5.5-12.7 9.6-23.1 9.6-10.9 0-19.9-3.8-27-11.1C235.5 99.7 232 90 232 77.8s3.5-21.9 10.3-29.2c7-7.3 16-11.1 27-11.1 9.7 0 17.1 2.7 21.9 8.2V0H313v116.5zm-58.8-38.7c0 13.6 7.5 21.4 18.7 21.4 10.9 0 18.3-7.3 18.3-19.8V76c0-12.2-7.3-19.8-18.3-19.8-11.2 0-18.7 8-18.7 21.6zM354.1 13.6c0 3.6-1.3 6.8-3.9 9.3-5 4.9-13.6 4.9-18.6 0-8.4-7.5-1.6-23.1 9.3-22.5 7.4 0 13.2 5.9 13.2 13.2zm-2.2 102.9H330V39h21.9v77.5zM425.1 47.1V39h20.5v80.4c0 11.2-3.6 20.1-10.6 26.8-7 6.7-16.6 10-28.5 10-23.4 0-36.9-11.4-39.9-29.8l21.7-.8c1 7.6 7.6 12 17.4 12 11.2 0 18.1-5.8 18.1-16.6v-11.1c-5.1 5.5-12.5 8.2-21.9 8.2-10.9 0-19.9-3.8-27-11.1-6.9-7.3-10.3-17.1-10.3-29.2s3.5-21.9 10.3-29.2c7-7.3 16-11.1 27-11.1 10.7 0 18.4 3.1 23.2 9.6zm-38.3 30.7c0 13.6 7.5 21.6 18.7 21.6 11 0 18.3-7.6 18.3-19.8V76c0-12.2-7.3-19.8-18.3-19.8-11.2 0-18.7 8-18.7 21.6zM488.8 154.8H465l19.8-45.1L454.5 39h24.1l17.8 46.2L514.2 39h24.3l-49.7 115.8z")
|
||||
|
||||
|
||||
//- Machine learning & NLP libraries
|
||||
|
||||
|
|
|
@ -1,5 +1,7 @@
|
|||
//- 💫 DOCS > API > ANNOTATION > TRAINING
|
||||
|
||||
+h(3, "json-input") JSON input format for training
|
||||
|
||||
p
|
||||
| spaCy takes training data in JSON format. The built-in
|
||||
| #[+api("cli#convert") #[code convert]] command helps you convert the
|
||||
|
@ -46,3 +48,57 @@ p
|
|||
| Treebank:
|
||||
|
||||
+github("spacy", "examples/training/training-data.json", false, false, "json")
|
||||
|
||||
+h(3, "vocab-jsonl") Lexical data for vocabulary
|
||||
+tag-new(2)
|
||||
|
||||
p
|
||||
| The populate a model's vocabulary, you can use the
|
||||
| #[+api("cli#vocab") #[code spacy vocab]] command and load in a
|
||||
| #[+a("https://jsonlines.readthedocs.io/en/latest/") newline-delimited JSON]
|
||||
| (JSONL) file containing one lexical entry per line. The first line
|
||||
| defines the language and vocabulary settings. All other lines are
|
||||
| expected to be JSON objects describing an individual lexeme. The lexical
|
||||
| attributes will be then set as attributes on spaCy's
|
||||
| #[+api("lexeme#attributes") #[code Lexeme]] object. The #[code vocab]
|
||||
| command outputs a ready-to-use spaCy model with a #[code Vocab]
|
||||
| containing the lexical data.
|
||||
|
||||
+code("First line").
|
||||
{"lang": "en", "settings": {"oov_prob": -20.502029418945312}}
|
||||
|
||||
+code("Entry structure").
|
||||
{
|
||||
"orth": string,
|
||||
"id": int,
|
||||
"lower": string,
|
||||
"norm": string,
|
||||
"shape": string
|
||||
"prefix": string,
|
||||
"suffix": string,
|
||||
"length": int,
|
||||
"cluster": string,
|
||||
"prob": float,
|
||||
"is_alpha": bool,
|
||||
"is_ascii": bool,
|
||||
"is_digit": bool,
|
||||
"is_lower": bool,
|
||||
"is_punct": bool,
|
||||
"is_space": bool,
|
||||
"is_title": bool,
|
||||
"is_upper": bool,
|
||||
"like_url": bool,
|
||||
"like_num": bool,
|
||||
"like_email": bool,
|
||||
"is_stop": bool,
|
||||
"is_oov": bool,
|
||||
"is_quote": bool,
|
||||
"is_left_punct": bool,
|
||||
"is_right_punct": bool
|
||||
}
|
||||
|
||||
p
|
||||
| Here's an example of the 20 most frequent lexemes in the English
|
||||
| training data:
|
||||
|
||||
+github("spacy", "examples/training/vocab-data.jsonl", false, false, "json")
|
||||
|
|
|
@ -3,8 +3,10 @@
|
|||
"Overview": {
|
||||
"Architecture": "./",
|
||||
"Annotation Specs": "annotation",
|
||||
"Command Line": "cli",
|
||||
"Functions": "top-level"
|
||||
},
|
||||
|
||||
"Containers": {
|
||||
"Doc": "doc",
|
||||
"Token": "token",
|
||||
|
@ -45,14 +47,19 @@
|
|||
}
|
||||
},
|
||||
|
||||
"cli": {
|
||||
"title": "Command Line Interface",
|
||||
"teaser": "Download, train and package models, and debug spaCy.",
|
||||
"source": "spacy/cli"
|
||||
},
|
||||
|
||||
"top-level": {
|
||||
"title": "Top-level Functions",
|
||||
"menu": {
|
||||
"spacy": "spacy",
|
||||
"displacy": "displacy",
|
||||
"Utility Functions": "util",
|
||||
"Compatibility": "compat",
|
||||
"Command Line": "cli"
|
||||
"Compatibility": "compat"
|
||||
}
|
||||
},
|
||||
|
||||
|
@ -213,7 +220,7 @@
|
|||
"Lemmatization": "lemmatization",
|
||||
"Dependencies": "dependency-parsing",
|
||||
"Named Entities": "named-entities",
|
||||
"Training Data": "training"
|
||||
"Models & Training": "training"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
|
@ -58,16 +58,16 @@ p
|
|||
nlp.from_disk(model_data_path) # load in model data
|
||||
|
||||
+infobox("Deprecation note", "⚠️")
|
||||
.o-block
|
||||
| As of spaCy 2.0, the #[code path] keyword argument is deprecated. spaCy
|
||||
| will also raise an error if no model could be loaded and never just
|
||||
| return an empty #[code Language] object. If you need a blank language,
|
||||
| you can use the new function #[+api("spacy#blank") #[code spacy.blank()]]
|
||||
| or import the class explicitly, e.g.
|
||||
| #[code from spacy.lang.en import English].
|
||||
| As of spaCy 2.0, the #[code path] keyword argument is deprecated. spaCy
|
||||
| will also raise an error if no model could be loaded and never just
|
||||
| return an empty #[code Language] object. If you need a blank language,
|
||||
| you can use the new function #[+api("spacy#blank") #[code spacy.blank()]]
|
||||
| or import the class explicitly, e.g.
|
||||
| #[code from spacy.lang.en import English].
|
||||
|
||||
+code-new nlp = spacy.load('/model')
|
||||
+code-old nlp = spacy.load('en', path='/model')
|
||||
+code-wrapper
|
||||
+code-new nlp = spacy.load('/model')
|
||||
+code-old nlp = spacy.load('en', path='/model')
|
||||
|
||||
+h(3, "spacy.blank") spacy.blank
|
||||
+tag function
|
||||
|
@ -85,7 +85,9 @@ p
|
|||
+row
|
||||
+cell #[code name]
|
||||
+cell unicode
|
||||
+cell ISO code of the language class to load.
|
||||
+cell
|
||||
| #[+a("https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes") ISO code]
|
||||
| of the language class to load.
|
||||
|
||||
+row
|
||||
+cell #[code disable]
|
||||
|
|
|
@ -99,6 +99,6 @@ p This document describes the target annotations spaCy is trained to predict.
|
|||
include _annotation/_biluo
|
||||
|
||||
+section("training")
|
||||
+h(2, "json-input") JSON input format for training
|
||||
+h(2, "training") Models and training data
|
||||
|
||||
include _annotation/_training
|
||||
|
|
|
@ -1,4 +1,6 @@
|
|||
//- 💫 DOCS > API > TOP-LEVEL > COMMAND LINE INTERFACE
|
||||
//- 💫 DOCS > API > COMMAND LINE INTERFACE
|
||||
|
||||
include ../_includes/_mixins
|
||||
|
||||
p
|
||||
| As of v1.7.0, spaCy comes with new command line helpers to download and
|
||||
|
@ -34,6 +36,13 @@ p
|
|||
+cell flag
|
||||
+cell Show help message and available arguments.
|
||||
|
||||
+row("foot")
|
||||
+cell creates
|
||||
+cell directory, symlink
|
||||
+cell
|
||||
| The installed model package in your #[code site-packages]
|
||||
| directory and a shortcut link as a symlink in #[code spacy/data].
|
||||
|
||||
+aside("Downloading best practices")
|
||||
| The #[code download] command is mostly intended as a convenient,
|
||||
| interactive wrapper – it performs compatibility checks and prints
|
||||
|
@ -86,6 +95,13 @@ p
|
|||
+cell flag
|
||||
+cell Show help message and available arguments.
|
||||
|
||||
+row("foot")
|
||||
+cell creates
|
||||
+cell symlink
|
||||
+cell
|
||||
| A shortcut link of the given name as a symlink in
|
||||
| #[code spacy/data].
|
||||
|
||||
+h(3, "info") Info
|
||||
|
||||
p
|
||||
|
@ -113,6 +129,11 @@ p
|
|||
+cell flag
|
||||
+cell Show help message and available arguments.
|
||||
|
||||
+row("foot")
|
||||
+cell prints
|
||||
+cell #[code stdout]
|
||||
+cell Information about your spaCy installation.
|
||||
|
||||
+h(3, "validate") Validate
|
||||
+tag-new(2)
|
||||
|
||||
|
@ -129,6 +150,12 @@ p
|
|||
+code(false, "bash", "$").
|
||||
spacy validate
|
||||
|
||||
+table(["Argument", "Type", "Description"])
|
||||
+row("foot")
|
||||
+cell prints
|
||||
+cell #[code stdout]
|
||||
+cell Details about the compatibility of your installed models.
|
||||
|
||||
+h(3, "convert") Convert
|
||||
|
||||
p
|
||||
|
@ -172,6 +199,11 @@ p
|
|||
+cell flag
|
||||
+cell Show help message and available arguments.
|
||||
|
||||
+row("foot")
|
||||
+cell creates
|
||||
+cell JSON
|
||||
+cell Data in spaCy's #[+a("/api/annotation#json-input") JSON format].
|
||||
|
||||
p The following converters are available:
|
||||
|
||||
+table(["ID", "Description"])
|
||||
|
@ -286,6 +318,11 @@ p
|
|||
+cell flag
|
||||
+cell Show help message and available arguments.
|
||||
|
||||
+row("foot")
|
||||
+cell creates
|
||||
+cell model, pickle
|
||||
+cell A spaCy model on each epoch, and a final #[code .pickle] file.
|
||||
|
||||
+h(4, "train-hyperparams") Environment variables for hyperparameters
|
||||
+tag-new(2)
|
||||
|
||||
|
@ -395,6 +432,50 @@ p
|
|||
+cell Gradient L2 norm constraint.
|
||||
+cell #[code 1.0]
|
||||
|
||||
+h(3, "vocab") Vocab
|
||||
+tag-new(2)
|
||||
|
||||
p
|
||||
| Compile a vocabulary from a
|
||||
| #[+a("/api/annotation#vocab-jsonl") lexicon JSONL] file and optional
|
||||
| word vectors. Will save out a valid spaCy model that you can load via
|
||||
| #[+api("spacy#load") #[code spacy.load]] or package using the
|
||||
| #[+api("cli#package") #[code package]] command.
|
||||
|
||||
+code(false, "bash", "$").
|
||||
spacy vocab [lang] [output_dir] [lexemes_loc] [vectors_loc]
|
||||
|
||||
+table(["Argument", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code lang]
|
||||
+cell positional
|
||||
+cell
|
||||
| Model language
|
||||
| #[+a("https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes") ISO code],
|
||||
| e.g. #[code en].
|
||||
|
||||
+row
|
||||
+cell #[code output_dir]
|
||||
+cell positional
|
||||
+cell Model output directory. Will be created if it doesn't exist.
|
||||
|
||||
+row
|
||||
+cell #[code lexemes_loc]
|
||||
+cell positional
|
||||
+cell
|
||||
| Location of lexical data in spaCy's
|
||||
| #[+a("/api/annotation#vocab-jsonl") JSONL format].
|
||||
|
||||
+row
|
||||
+cell #[code vectors_loc]
|
||||
+cell positional
|
||||
+cell Optional location of vectors data as numpy #[code .npz] file.
|
||||
|
||||
+row("foot")
|
||||
+cell creates
|
||||
+cell model
|
||||
+cell A spaCy model containing the vocab and vectors.
|
||||
|
||||
+h(3, "evaluate") Evaluate
|
||||
+tag-new(2)
|
||||
|
||||
|
@ -447,22 +528,36 @@ p
|
|||
+cell flag
|
||||
+cell Use gold preprocessing.
|
||||
|
||||
+row("foot")
|
||||
+cell prints / creates
|
||||
+cell #[code stdout], HTML
|
||||
+cell Training results and optional displaCy visualizations.
|
||||
|
||||
|
||||
+h(3, "package") Package
|
||||
|
||||
p
|
||||
| Generate a #[+a("/usage/training#models-generating") model Python package]
|
||||
| from an existing model data directory. All data files are copied over.
|
||||
| If the path to a meta.json is supplied, or a meta.json is found in the
|
||||
| input directory, this file is used. Otherwise, the data can be entered
|
||||
| directly from the command line. The required file templates are downloaded
|
||||
| from #[+src(gh("spacy-dev-resources", "templates/model")) GitHub] to make
|
||||
| If the path to a #[code meta.json] is supplied, or a #[code meta.json] is
|
||||
| found in the input directory, this file is used. Otherwise, the data can
|
||||
| be entered directly from the command line. The required file templates
|
||||
| are downloaded from
|
||||
| #[+src(gh("spacy-dev-resources", "templates/model")) GitHub] to make
|
||||
| sure you're always using the latest versions. This means you need to be
|
||||
| connected to the internet to use this command.
|
||||
| connected to the internet to use this command. After packaging, you
|
||||
| can run #[code python setup.py sdist] from the newly created directory
|
||||
| to turn your model into an installable archive file.
|
||||
|
||||
+code(false, "bash", "$", false, false, true).
|
||||
spacy package [input_dir] [output_dir] [--meta-path] [--create-meta] [--force]
|
||||
|
||||
+aside-code("Example", "bash").
|
||||
spacy package /input /output
|
||||
cd /output/en_model-0.0.0
|
||||
python setup.py sdist
|
||||
pip install dist/en_model-0.0.0.tar.gz
|
||||
|
||||
+table(["Argument", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code input_dir]
|
||||
|
@ -477,15 +572,16 @@ p
|
|||
+row
|
||||
+cell #[code --meta-path], #[code -m]
|
||||
+cell option
|
||||
+cell #[+tag-new(2)] Path to meta.json file (optional).
|
||||
+cell #[+tag-new(2)] Path to #[code meta.json] file (optional).
|
||||
|
||||
+row
|
||||
+cell #[code --create-meta], #[code -c]
|
||||
+cell flag
|
||||
+cell
|
||||
| #[+tag-new(2)] Create a meta.json file on the command line, even
|
||||
| if one already exists in the directory.
|
||||
|
||||
| #[+tag-new(2)] Create a #[code meta.json] file on the command
|
||||
| line, even if one already exists in the directory. If an
|
||||
| existing file is found, its entries will be shown as the defaults
|
||||
| in the command line prompt.
|
||||
+row
|
||||
+cell #[code --force], #[code -f]
|
||||
+cell flag
|
||||
|
@ -495,3 +591,8 @@ p
|
|||
+cell #[code --help], #[code -h]
|
||||
+cell flag
|
||||
+cell Show help message and available arguments.
|
||||
|
||||
+row("foot")
|
||||
+cell creates
|
||||
+cell directory
|
||||
+cell A Python package containing the spaCy model.
|
|
@ -84,13 +84,13 @@ p
|
|||
+cell A container for accessing the annotations.
|
||||
|
||||
+infobox("Deprecation note", "⚠️")
|
||||
.o-block
|
||||
| Pipeline components to prevent from being loaded can now be added as
|
||||
| a list to #[code disable], instead of specifying one keyword argument
|
||||
| per component.
|
||||
| Pipeline components to prevent from being loaded can now be added as
|
||||
| a list to #[code disable], instead of specifying one keyword argument
|
||||
| per component.
|
||||
|
||||
+code-new doc = nlp(u"I don't want parsed", disable=['parser'])
|
||||
+code-old doc = nlp(u"I don't want parsed", parse=False)
|
||||
+code-wrapper
|
||||
+code-new doc = nlp(u"I don't want parsed", disable=['parser'])
|
||||
+code-old doc = nlp(u"I don't want parsed", parse=False)
|
||||
|
||||
+h(2, "pipe") Language.pipe
|
||||
+tag method
|
||||
|
@ -533,15 +533,15 @@ p
|
|||
+cell The modified #[code Language] object.
|
||||
|
||||
+infobox("Deprecation note", "⚠️")
|
||||
.o-block
|
||||
| As of spaCy v2.0, the #[code save_to_directory] method has been
|
||||
| renamed to #[code to_disk], to improve consistency across classes.
|
||||
| Pipeline components to prevent from being loaded can now be added as
|
||||
| a list to #[code disable], instead of specifying one keyword argument
|
||||
| per component.
|
||||
| As of spaCy v2.0, the #[code save_to_directory] method has been
|
||||
| renamed to #[code to_disk], to improve consistency across classes.
|
||||
| Pipeline components to prevent from being loaded can now be added as
|
||||
| a list to #[code disable], instead of specifying one keyword argument
|
||||
| per component.
|
||||
|
||||
+code-new nlp = English().from_disk(disable=['tagger', 'ner'])
|
||||
+code-old nlp = spacy.load('en', tagger=False, entity=False)
|
||||
+code-wrapper
|
||||
+code-new nlp = English().from_disk(disable=['tagger', 'ner'])
|
||||
+code-old nlp = spacy.load('en', tagger=False, entity=False)
|
||||
|
||||
+h(2, "to_bytes") Language.to_bytes
|
||||
+tag method
|
||||
|
@ -595,13 +595,13 @@ p Load state from a binary string.
|
|||
+cell The #[code Language] object.
|
||||
|
||||
+infobox("Deprecation note", "⚠️")
|
||||
.o-block
|
||||
| Pipeline components to prevent from being loaded can now be added as
|
||||
| a list to #[code disable], instead of specifying one keyword argument
|
||||
| per component.
|
||||
| Pipeline components to prevent from being loaded can now be added as
|
||||
| a list to #[code disable], instead of specifying one keyword argument
|
||||
| per component.
|
||||
|
||||
+code-new nlp = English().from_bytes(bytes, disable=['tagger', 'ner'])
|
||||
+code-old nlp = English().from_bytes('en', tagger=False, entity=False)
|
||||
+code-wrapper
|
||||
+code-new nlp = English().from_bytes(bytes, disable=['tagger', 'ner'])
|
||||
+code-old nlp = English().from_bytes('en', tagger=False, entity=False)
|
||||
|
||||
+h(2, "attributes") Attributes
|
||||
|
||||
|
|
|
@ -203,18 +203,18 @@ p
|
|||
| dict describes a token.
|
||||
|
||||
+infobox("Deprecation note", "⚠️")
|
||||
.o-block
|
||||
| As of spaCy 2.0, #[code Matcher.add_pattern] and #[code Matcher.add_entity]
|
||||
| are deprecated and have been replaced with a simpler
|
||||
| #[+api("matcher#add") #[code Matcher.add]] that lets you add a list of
|
||||
| patterns and a callback for a given match ID.
|
||||
| As of spaCy 2.0, #[code Matcher.add_pattern] and #[code Matcher.add_entity]
|
||||
| are deprecated and have been replaced with a simpler
|
||||
| #[+api("matcher#add") #[code Matcher.add]] that lets you add a list of
|
||||
| patterns and a callback for a given match ID.
|
||||
|
||||
+code-new.
|
||||
matcher.add('GoogleNow', merge_phrases, [{ORTH: 'Google'}, {ORTH: 'Now'}])
|
||||
+code-wrapper
|
||||
+code-new.
|
||||
matcher.add('GoogleNow', merge_phrases, [{ORTH: 'Google'}, {ORTH: 'Now'}])
|
||||
|
||||
+code-old.
|
||||
matcher.add_entity('GoogleNow', on_match=merge_phrases)
|
||||
matcher.add_pattern('GoogleNow', [{ORTH: 'Google'}, {ORTH: 'Now'}])
|
||||
+code-old.
|
||||
matcher.add_entity('GoogleNow', on_match=merge_phrases)
|
||||
matcher.add_pattern('GoogleNow', [{ORTH: 'Google'}, {ORTH: 'Now'}])
|
||||
|
||||
+h(2, "remove") Matcher.remove
|
||||
+tag method
|
||||
|
|
|
@ -393,6 +393,37 @@ p A sequence of all the token's syntactic descendents.
|
|||
+cell #[code Token]
|
||||
+cell A descendant token such that #[code self.is_ancestor(descendant)].
|
||||
|
||||
+h(2, "is_sent_start") Token.is_sent_start
|
||||
+tag property
|
||||
+tag-new(2)
|
||||
|
||||
p
|
||||
| A boolean value indicating whether the token starts a sentence.
|
||||
| #[code None] if unknown.
|
||||
|
||||
+aside-code("Example").
|
||||
doc = nlp(u'Give it back! He pleaded.')
|
||||
assert doc[4].is_sent_start
|
||||
assert not doc[5].is_sent_start
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell bool
|
||||
+cell Whether the token starts a sentence.
|
||||
|
||||
+infobox("Deprecation note", "⚠️")
|
||||
| As of spaCy v2.0, the #[code Token.sent_start] property is deprecated and
|
||||
| has been replaced with #[code Token.is_sent_start], which returns a
|
||||
| boolean value instead of a misleading #[code 0] for #[code False] and
|
||||
| #[code 1] for #[code True]. It also now returns #[code None] if the
|
||||
| answer is unknown, and fixes a quirk in the old logic that would always
|
||||
| set the property to #[code 0] for the first word of the document.
|
||||
|
||||
+code-wrapper
|
||||
+code-new assert doc[4].is_sent_start == True
|
||||
+code-old assert doc[4].sent_start == 1
|
||||
|
||||
+h(2, "has_vector") Token.has_vector
|
||||
+tag property
|
||||
+tag-model("vectors")
|
||||
|
|
|
@ -18,7 +18,3 @@ include ../_includes/_mixins
|
|||
+section("compat")
|
||||
+h(2, "compat", "spacy/compaty.py") Compatibility functions
|
||||
include _top-level/_compat
|
||||
|
||||
+section("cli", "spacy/cli")
|
||||
+h(2, "cli") Command line
|
||||
include _top-level/_cli
|
||||
|
|
|
@ -5,46 +5,47 @@ include ../_includes/_mixins
|
|||
p
|
||||
| Vectors data is kept in the #[code Vectors.data] attribute, which should
|
||||
| be an instance of #[code numpy.ndarray] (for CPU vectors) or
|
||||
| #[code cupy.ndarray] (for GPU vectors).
|
||||
| #[code cupy.ndarray] (for GPU vectors). Multiple keys can be mapped to
|
||||
| the same vector, and not all of the rows in the table need to be
|
||||
| assigned – so #[code vectors.n_keys] may be greater or smaller than
|
||||
| #[code vectors.shape[0]].
|
||||
|
||||
+h(2, "init") Vectors.__init__
|
||||
+tag method
|
||||
|
||||
p
|
||||
| Create a new vector store. To keep the vector table empty, pass
|
||||
| #[code width=0]. You can also create the vector table and add
|
||||
| vectors one by one, or set the vector values directly on initialisation.
|
||||
| Create a new vector store. You can set the vector values and keys
|
||||
| directly on initialisation, or supply a #[code shape] keyword argument
|
||||
| to create an empty table you can add vectors to later.
|
||||
|
||||
+aside-code("Example").
|
||||
from spacy.vectors import Vectors
|
||||
from spacy.strings import StringStore
|
||||
|
||||
empty_vectors = Vectors(StringStore())
|
||||
empty_vectors = Vectors(shape=(10000, 300))
|
||||
|
||||
vectors = Vectors([u'cat'], width=300)
|
||||
vectors[u'cat'] = numpy.random.uniform(-1, 1, (300,))
|
||||
|
||||
vector_table = numpy.zeros((3, 300), dtype='f')
|
||||
vectors = Vectors(StringStore(), data=vector_table)
|
||||
data = numpy.zeros((3, 300), dtype='f')
|
||||
keys = [u'cat', u'dog', u'rat']
|
||||
vectors = Vectors(data=data, keys=keys)
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code strings]
|
||||
+cell #[code StringStore] or list
|
||||
+cell
|
||||
| List of strings, or a #[+api("stringstore") #[code StringStore]]
|
||||
| that maps strings to hash values, and vice versa.
|
||||
|
||||
+row
|
||||
+cell #[code width]
|
||||
+cell int
|
||||
+cell Number of dimensions.
|
||||
|
||||
+row
|
||||
+cell #[code data]
|
||||
+cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
|
||||
+cell #[code.u-break ndarray[ndim=1, dtype='float32']]
|
||||
+cell The vector data.
|
||||
|
||||
+row
|
||||
+cell #[code keys]
|
||||
+cell iterable
|
||||
+cell A sequence of keys aligned with the data.
|
||||
|
||||
+row
|
||||
+cell #[code shape]
|
||||
+cell tuple
|
||||
+cell
|
||||
| Size of the table as #[code (n_entries, n_columns)], the number
|
||||
| of entries and number of columns. Not required if you're
|
||||
| initialising the object with #[code data] and #[code keys].
|
||||
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell #[code Vectors]
|
||||
|
@ -54,97 +55,92 @@ p
|
|||
+tag method
|
||||
|
||||
p
|
||||
| Get a vector by key. If key is a string, it is hashed to an integer ID
|
||||
| using the #[code Vectors.strings] table. If the integer key is not found
|
||||
| in the table, a #[code KeyError] is raised.
|
||||
| Get a vector by key. If the key is not found in the table, a
|
||||
| #[code KeyError] is raised.
|
||||
|
||||
+aside-code("Example").
|
||||
vectors = Vectors(StringStore(), 300)
|
||||
vectors.add(u'cat', numpy.random.uniform(-1, 1, (300,)))
|
||||
cat_vector = vectors[u'cat']
|
||||
cat_id = nlp.vocab.strings[u'cat']
|
||||
cat_vector = nlp.vocab.vectors[cat_id]
|
||||
assert cat_vector == nlp.vocab[u'cat'].vector
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code key]
|
||||
+cell unicode / int
|
||||
+cell int
|
||||
+cell The key to get the vector for.
|
||||
|
||||
+row
|
||||
+cell returns
|
||||
+cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
|
||||
+cell #[code.u-break ndarray[ndim=1, dtype='float32']]
|
||||
+cell The vector for the key.
|
||||
|
||||
+h(2, "setitem") Vectors.__setitem__
|
||||
+tag method
|
||||
|
||||
p
|
||||
| Set a vector for the given key. If key is a string, it is hashed to an
|
||||
| integer ID using the #[code Vectors.strings] table.
|
||||
| Set a vector for the given key.
|
||||
|
||||
+aside-code("Example").
|
||||
vectors = Vectors(StringStore(), 300)
|
||||
vectors[u'cat'] = numpy.random.uniform(-1, 1, (300,))
|
||||
cat_id = nlp.vocab.strings[u'cat']
|
||||
vector = numpy.random.uniform(-1, 1, (300,))
|
||||
nlp.vocab.vectors[cat_id] = vector
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code key]
|
||||
+cell unicode / int
|
||||
+cell int
|
||||
+cell The key to set the vector for.
|
||||
|
||||
+row
|
||||
+cell #[code vector]
|
||||
+cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
|
||||
+cell #[code.u-break ndarray[ndim=1, dtype='float32']]
|
||||
+cell The vector to set.
|
||||
|
||||
+h(2, "iter") Vectors.__iter__
|
||||
+tag method
|
||||
|
||||
p Yield vectors from the table.
|
||||
p Iterate over the keys in the table.
|
||||
|
||||
+aside-code("Example").
|
||||
vector_table = numpy.zeros((3, 300), dtype='f')
|
||||
vectors = Vectors(StringStore(), vector_table)
|
||||
for vector in vectors:
|
||||
print(vector)
|
||||
for key in nlp.vocab.vectors:
|
||||
print(key, nlp.vocab.strings[key])
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row("foot")
|
||||
+cell yields
|
||||
+cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
|
||||
+cell A vector from the table.
|
||||
+cell int
|
||||
+cell A key in the table.
|
||||
|
||||
+h(2, "len") Vectors.__len__
|
||||
+tag method
|
||||
|
||||
p Return the number of vectors that have been assigned.
|
||||
p Return the number of vectors in the table.
|
||||
|
||||
+aside-code("Example").
|
||||
vector_table = numpy.zeros((3, 300), dtype='f')
|
||||
vectors = Vectors(StringStore(), vector_table)
|
||||
vectors = Vectors(shape=(3, 300))
|
||||
assert len(vectors) == 3
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell int
|
||||
+cell The number of vectors in the data.
|
||||
+cell The number of vectors in the table.
|
||||
|
||||
+h(2, "contains") Vectors.__contains__
|
||||
+tag method
|
||||
|
||||
p
|
||||
| Check whether a key has a vector entry in the table. If key is a string,
|
||||
| it is hashed to an integer ID using the #[code Vectors.strings] table.
|
||||
| Check whether a key has been mapped to a vector entry in the table.
|
||||
|
||||
+aside-code("Example").
|
||||
vectors = Vectors(StringStore(), 300)
|
||||
vectors.add(u'cat', numpy.random.uniform(-1, 1, (300,)))
|
||||
assert u'cat' in vectors
|
||||
cat_id = nlp.vocab.strings[u'cat']
|
||||
nlp.vectors.add(cat_id, numpy.random.uniform(-1, 1, (300,)))
|
||||
assert cat_id in vectors
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code key]
|
||||
+cell unicode / int
|
||||
+cell int
|
||||
+cell The key to check.
|
||||
|
||||
+row("foot")
|
||||
|
@ -156,13 +152,20 @@ p
|
|||
+tag method
|
||||
|
||||
p
|
||||
| Add a key to the table, optionally setting a vector value as well. If
|
||||
| key is a string, it is hashed to an integer ID using the
|
||||
| #[code Vectors.strings] table.
|
||||
| Add a key to the table, optionally setting a vector value as well. Keys
|
||||
| can be mapped to an existing vector by setting #[code row], or a new
|
||||
| vector can be added. When adding unicode keys, keep in mind that the
|
||||
| #[code Vectors] class itself has no
|
||||
| #[+api("stringstore") #[code StringStore]], so you have to store the
|
||||
| hash-to-string mapping separately. If you need to manage the strings,
|
||||
| you should use the #[code Vectors] via the
|
||||
| #[+api("vocab") #[code Vocab]] class, e.g. #[code vocab.vectors].
|
||||
|
||||
+aside-code("Example").
|
||||
vectors = Vectors(StringStore(), 300)
|
||||
vectors.add(u'cat', numpy.random.uniform(-1, 1, (300,)))
|
||||
vector = numpy.random.uniform(-1, 1, (300,))
|
||||
cat_id = nlp.vocab.strings[u'cat']
|
||||
nlp.vocab.vectors.add(cat_id, vector=vector)
|
||||
nlp.vocab.vectors.add(u'dog', row=0)
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
|
@ -172,25 +175,66 @@ p
|
|||
|
||||
+row
|
||||
+cell #[code vector]
|
||||
+cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
|
||||
+cell An optional vector to add.
|
||||
+cell #[code.u-break ndarray[ndim=1, dtype='float32']]
|
||||
+cell An optional vector to add for the key.
|
||||
|
||||
+row
|
||||
+cell #[code row]
|
||||
+cell int
|
||||
+cell An optional row number of a vector to map the key to.
|
||||
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell int
|
||||
+cell The row the vector was added to.
|
||||
|
||||
+h(2, "keys") Vectors.keys
|
||||
+tag method
|
||||
|
||||
p A sequence of the keys in the table.
|
||||
|
||||
+aside-code("Example").
|
||||
for key in nlp.vocab.vectors.keys():
|
||||
print(key, nlp.vocab.strings[key])
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell iterable
|
||||
+cell The keys.
|
||||
|
||||
+h(2, "values") Vectors.values
|
||||
+tag method
|
||||
|
||||
p
|
||||
| Iterate over vectors that have been assigned to at least one key. Note
|
||||
| that some vectors may be unassigned, so the number of vectors returned
|
||||
| may be less than the length of the vectors table.
|
||||
|
||||
+aside-code("Example").
|
||||
for vector in nlp.vocab.vectors.values():
|
||||
print(vector)
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row("foot")
|
||||
+cell yields
|
||||
+cell #[code.u-break ndarray[ndim=1, dtype='float32']]
|
||||
+cell A vector in the table.
|
||||
|
||||
+h(2, "items") Vectors.items
|
||||
+tag method
|
||||
|
||||
p Iterate over #[code (string key, vector)] pairs, in order.
|
||||
p Iterate over #[code (key, vector)] pairs, in order.
|
||||
|
||||
+aside-code("Example").
|
||||
vectors = Vectors(StringStore(), 300)
|
||||
vectors.add(u'cat', numpy.random.uniform(-1, 1, (300,)))
|
||||
for key, vector in vectors.items():
|
||||
print(key, vector)
|
||||
for key, vector in nlp.vocab.vectors.items():
|
||||
print(key, nlp.vocab.strings[key], vector)
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row("foot")
|
||||
+cell yields
|
||||
+cell tuple
|
||||
+cell #[code (string key, vector)] pairs, in order.
|
||||
+cell #[code (key, vector)] pairs, in order.
|
||||
|
||||
+h(2, "shape") Vectors.shape
|
||||
+tag property
|
||||
|
@ -200,7 +244,7 @@ p
|
|||
| dimensions in the vector table.
|
||||
|
||||
+aside-code("Example").
|
||||
vectors = Vectors(StringStore(), 300)
|
||||
vectors = Vectors(shape(1, 300))
|
||||
vectors.add(u'cat', numpy.random.uniform(-1, 1, (300,)))
|
||||
rows, dims = vectors.shape
|
||||
assert rows == 1
|
||||
|
@ -212,6 +256,59 @@ p
|
|||
+cell tuple
|
||||
+cell A #[code (rows, dims)] pair.
|
||||
|
||||
+h(2, "size") Vectors.size
|
||||
+tag property
|
||||
|
||||
p The vector size, i.e. #[code rows * dims].
|
||||
|
||||
+aside-code("Example").
|
||||
vectors = Vectors(shape=(500, 300))
|
||||
assert vectors.size == 150000
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell int
|
||||
+cell The vector size.
|
||||
|
||||
+h(2, "is_full") Vectors.is_full
|
||||
+tag property
|
||||
|
||||
p
|
||||
| Whether the vectors table is full and has no slots are available for new
|
||||
| keys. If a table is full, it can be resized using
|
||||
| #[+api("vectors#resize") #[code Vectors.resize]].
|
||||
|
||||
+aside-code("Example").
|
||||
vectors = Vectors(shape=(1, 300))
|
||||
vectors.add(u'cat', numpy.random.uniform(-1, 1, (300,)))
|
||||
assert vectors.is_full
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell bool
|
||||
+cell Whether the vectors table is full.
|
||||
|
||||
+h(2, "n_keys") Vectors.n_keys
|
||||
+tag property
|
||||
|
||||
p
|
||||
| Get the number of keys in the table. Note that this is the number of
|
||||
| #[em all] keys, not just unique vectors. If several keys are mapped
|
||||
| are mapped to the same vectors, they will be counted individually.
|
||||
|
||||
+aside-code("Example").
|
||||
vectors = Vectors(shape=(10, 300))
|
||||
assert len(vectors) == 10
|
||||
assert vectors.n_keys == 0
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell int
|
||||
+cell The number of all keys in the table.
|
||||
|
||||
+h(2, "from_glove") Vectors.from_glove
|
||||
+tag method
|
||||
|
||||
|
@ -223,6 +320,10 @@ p
|
|||
| float32 vectors, #[code vectors.300.d.bin] for 300d float64 (double)
|
||||
| vectors, etc. By default GloVe outputs 64-bit vectors.
|
||||
|
||||
+aside-code("Example").
|
||||
vectors = Vectors()
|
||||
vectors.from_glove('/path/to/glove_vectors')
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code path]
|
||||
|
@ -323,7 +424,7 @@ p Load state from a binary string.
|
|||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code data]
|
||||
+cell #[code numpy.ndarray] / #[code cupy.ndarray]
|
||||
+cell #[code.u-break ndarray[ndim=1, dtype='float32']]
|
||||
+cell
|
||||
| Stored vectors data. #[code numpy] is used for CPU vectors,
|
||||
| #[code cupy] for GPU vectors.
|
||||
|
@ -337,7 +438,7 @@ p Load state from a binary string.
|
|||
|
||||
+row
|
||||
+cell #[code keys]
|
||||
+cell #[code numpy.ndarray]
|
||||
+cell #[code.u-break ndarray[ndim=1, dtype='float32']]
|
||||
+cell
|
||||
| Array keeping the keys in order, such that
|
||||
| #[code keys[vectors.key2row[key]] == key]
|
||||
|
|
|
@ -162,7 +162,7 @@ p
|
|||
+cell int
|
||||
+cell The integer ID by which the flag value can be checked.
|
||||
|
||||
+h(2, "add_flag") Vocab.clear_vectors
|
||||
+h(2, "clear_vectors") Vocab.clear_vectors
|
||||
+tag method
|
||||
+tag-new(2)
|
||||
|
||||
|
@ -181,7 +181,50 @@ p
|
|||
| Number of dimensions of the new vectors. If #[code None], size
|
||||
| is not changed.
|
||||
|
||||
+h(2, "add_flag") Vocab.get_vector
|
||||
+h(2, "prune_vectors") Vocab.prune_vectors
|
||||
+tag method
|
||||
+tag-new(2)
|
||||
|
||||
p
|
||||
| Reduce the current vector table to #[code nr_row] unique entries. Words
|
||||
| mapped to the discarded vectors will be remapped to the closest vector
|
||||
| among those remaining. For example, suppose the original table had
|
||||
| vectors for the words:
|
||||
| #[code.u-break ['sat', 'cat', 'feline', 'reclined']]. If we prune the
|
||||
| vector table to, two rows, we would discard the vectors for "feline"
|
||||
| and "reclined". These words would then be remapped to the closest
|
||||
| remaining vector – so "feline" would have the same vector as "cat",
|
||||
| and "reclined" would have the same vector as "sat". The similarities are
|
||||
| judged by cosine. The original vectors may be large, so the cosines are
|
||||
| calculated in minibatches, to reduce memory usage.
|
||||
|
||||
+aside-code("Example").
|
||||
nlp.vocab.prune_vectors(10000)
|
||||
assert len(nlp.vocab.vectors) <= 1000
|
||||
|
||||
+table(["Name", "Type", "Description"])
|
||||
+row
|
||||
+cell #[code nr_row]
|
||||
+cell int
|
||||
+cell The number of rows to keep in the vector table.
|
||||
|
||||
+row
|
||||
+cell #[code batch_size]
|
||||
+cell int
|
||||
+cell
|
||||
| Batch of vectors for calculating the similarities. Larger batch
|
||||
| sizes might be faster, while temporarily requiring more memory.
|
||||
|
||||
+row("foot")
|
||||
+cell returns
|
||||
+cell dict
|
||||
+cell
|
||||
| A dictionary keyed by removed words mapped to
|
||||
| #[code (string, score)] tuples, where #[code string] is the entry
|
||||
| the removed word was mapped to, and #[code score] the similarity
|
||||
| score between the two words.
|
||||
|
||||
+h(2, "get_vector") Vocab.get_vector
|
||||
+tag method
|
||||
+tag-new(2)
|
||||
|
||||
|
@ -206,7 +249,7 @@ p
|
|||
| A word vector. Size and shape are determined by the
|
||||
| #[code Vocab.vectors] instance.
|
||||
|
||||
+h(2, "add_flag") Vocab.set_vector
|
||||
+h(2, "set_vector") Vocab.set_vector
|
||||
+tag method
|
||||
+tag-new(2)
|
||||
|
||||
|
@ -228,7 +271,7 @@ p
|
|||
+cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
|
||||
+cell The vector to set.
|
||||
|
||||
+h(2, "add_flag") Vocab.has_vector
|
||||
+h(2, "has_vector") Vocab.has_vector
|
||||
+tag method
|
||||
+tag-new(2)
|
||||
|
||||
|
|
|
@ -163,11 +163,4 @@
|
|||
height: 1.4em
|
||||
border: none
|
||||
text-align-last: center
|
||||
|
||||
.o-empty:empty:before
|
||||
@include size(1em)
|
||||
border-radius: 50%
|
||||
content: ""
|
||||
display: inline-block
|
||||
background: $color-red
|
||||
vertical-align: middle
|
||||
width: 100%
|
||||
|
|
|
@ -47,7 +47,7 @@
|
|||
font: 600 1.1rem/#{1} $font-secondary
|
||||
background: $color-theme
|
||||
color: $color-back
|
||||
padding: 0.15em 0.5em 0.35em
|
||||
padding: 2px 6px 4px
|
||||
border-radius: 1em
|
||||
text-transform: uppercase
|
||||
vertical-align: middle
|
||||
|
|
72
website/assets/js/changelog.js
Normal file
72
website/assets/js/changelog.js
Normal file
|
@ -0,0 +1,72 @@
|
|||
'use strict';
|
||||
|
||||
import { Templater, handleResponse } from './util.js';
|
||||
|
||||
export default class Changelog {
|
||||
/**
|
||||
* Fetch and render changelog from GitHub. Clones a template node (table row)
|
||||
* to avoid doubling templating markup in JavaScript.
|
||||
* @param {string} user - GitHub username.
|
||||
* @param {string} repo - Repository to fetch releases from.
|
||||
*/
|
||||
constructor(user, repo) {
|
||||
this.url = `https://api.github.com/repos/${user}/${repo}/releases`;
|
||||
this.template = new Templater('changelog');
|
||||
this.fetchChangelog()
|
||||
.then(json => this.render(json))
|
||||
.catch(this.showError.bind(this));
|
||||
// make sure scroll positions for progress bar etc. are recalculated
|
||||
window.dispatchEvent(new Event('resize'));
|
||||
}
|
||||
|
||||
fetchChangelog() {
|
||||
return new Promise((resolve, reject) =>
|
||||
fetch(this.url)
|
||||
.then(res => handleResponse(res))
|
||||
.then(json => json.ok ? resolve(json) : reject()))
|
||||
}
|
||||
|
||||
showError() {
|
||||
this.template.get('error').style.display = 'block';
|
||||
}
|
||||
|
||||
/**
|
||||
* Get template section from template row. Hacky, but does make sense.
|
||||
* @param {node} item - Parent element.
|
||||
* @param {string} id - ID of child element, set via data-changelog.
|
||||
*/
|
||||
getField(item, id) {
|
||||
return item.querySelector(`[data-changelog="${id}"]`);
|
||||
}
|
||||
|
||||
render(json) {
|
||||
this.template.get('table').style.display = 'block';
|
||||
this.row = this.template.get('item');
|
||||
this.releases = this.template.get('releases');
|
||||
this.prereleases = this.template.get('prereleases');
|
||||
Object.values(json)
|
||||
.filter(release => release.name)
|
||||
.forEach(release => this.renderRelease(release));
|
||||
this.row.remove();
|
||||
}
|
||||
|
||||
/**
|
||||
* Clone the template row and populate with content from API response.
|
||||
* https://developer.github.com/v3/repos/releases/#list-releases-for-a-repository
|
||||
* @param {string} name - Release title.
|
||||
* @param {string} tag (tag_name) - Release tag.
|
||||
* @param {string} url (html_url) - URL to the release page on GitHub.
|
||||
* @param {string} date (published_at) - Timestamp of release publication.
|
||||
* @param {boolean} prerelease - Whether the release is a prerelease.
|
||||
*/
|
||||
renderRelease({ name, tag_name: tag, html_url: url, published_at: date, prerelease }) {
|
||||
const container = prerelease ? this.prereleases : this.releases;
|
||||
const tagLink = `<a href="${url}" target="_blank"><code>${tag}</code></a>`;
|
||||
const title = (name.split(': ').length == 2) ? name.split(': ')[1] : name;
|
||||
const row = this.row.cloneNode(true);
|
||||
this.getField(row, 'date').textContent = date.split('T')[0];
|
||||
this.getField(row, 'tag').innerHTML = tagLink;
|
||||
this.getField(row, 'title').textContent = title;
|
||||
container.appendChild(row);
|
||||
}
|
||||
}
|
42
website/assets/js/github-embed.js
Normal file
42
website/assets/js/github-embed.js
Normal file
|
@ -0,0 +1,42 @@
|
|||
'use strict';
|
||||
|
||||
import { $$ } from './util.js';
|
||||
|
||||
export default class GitHubEmbed {
|
||||
/**
|
||||
* Embed code from GitHub repositories, similar to Gist embeds. Fetches the
|
||||
* raw text and places it inside element.
|
||||
* Usage: <pre><code data-gh-embed="spacy/master/examples/x.py"></code><pre>
|
||||
* @param {string} user - GitHub user or organization.
|
||||
* @param {string} attr - Data attribute used to select containers. Attribute
|
||||
* value should be path to file relative to user.
|
||||
*/
|
||||
constructor(user, attr) {
|
||||
this.url = `https://raw.githubusercontent.com/${user}`;
|
||||
this.attr = attr;
|
||||
[...$$(`[${this.attr}]`)].forEach(el => this.embed(el));
|
||||
}
|
||||
|
||||
/**
|
||||
* Fetch code from GitHub and insert it as element content. File path is
|
||||
* read off the container's data attribute.
|
||||
* @param {node} el - The element.
|
||||
*/
|
||||
embed(el) {
|
||||
el.parentElement.setAttribute('data-loading', '');
|
||||
fetch(`${this.url}/${el.getAttribute(this.attr)}`)
|
||||
.then(res => res.text().then(text => ({ text, ok: res.ok })))
|
||||
.then(({ text, ok }) => ok ? this.render(el, text) : false)
|
||||
el.parentElement.removeAttribute('data-loading');
|
||||
}
|
||||
|
||||
/**
|
||||
* Add text to container and apply syntax highlighting via Prism, if available.
|
||||
* @param {node} el - The element.
|
||||
* @param {string} text - The raw code, fetched from GitHub.
|
||||
*/
|
||||
render(el, text) {
|
||||
el.textContent = text;
|
||||
if (window.Prism) Prism.highlightElement(el);
|
||||
}
|
||||
}
|
|
@ -1,323 +0,0 @@
|
|||
//- 💫 MAIN JAVASCRIPT
|
||||
//- Note: Will be compiled using Babel before deployment.
|
||||
|
||||
'use strict'
|
||||
|
||||
const $ = document.querySelector.bind(document);
|
||||
const $$ = document.querySelectorAll.bind(document);
|
||||
|
||||
|
||||
class ProgressBar {
|
||||
/**
|
||||
* Animated reading progress bar.
|
||||
* @param {String} selector – CSS selector of progress bar element.
|
||||
*/
|
||||
constructor(selector) {
|
||||
this.el = $(selector);
|
||||
this.scrollY = 0;
|
||||
this.sizes = this.updateSizes();
|
||||
this.el.setAttribute('max', 100);
|
||||
this.init();
|
||||
}
|
||||
|
||||
init() {
|
||||
window.addEventListener('scroll', () => {
|
||||
this.scrollY = (window.pageYOffset || document.scrollTop) - (document.clientTop || 0);
|
||||
requestAnimationFrame(this.update.bind(this));
|
||||
}, false);
|
||||
window.addEventListener('resize', () => {
|
||||
this.sizes = this.updateSizes();
|
||||
requestAnimationFrame(this.update.bind(this));
|
||||
})
|
||||
}
|
||||
|
||||
update() {
|
||||
const offset = 100 - ((this.sizes.height - this.scrollY - this.sizes.vh) / this.sizes.height * 100);
|
||||
this.el.setAttribute('value', (this.scrollY == 0) ? 0 : offset || 0);
|
||||
}
|
||||
|
||||
updateSizes() {
|
||||
const body = document.body;
|
||||
const html = document.documentElement;
|
||||
return {
|
||||
height: Math.max(body.scrollHeight, body.offsetHeight, html.clientHeight, html.scrollHeight, html.offsetHeight),
|
||||
vh: Math.max(html.clientHeight, window.innerHeight || 0)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
class SectionHighlighter {
|
||||
/**
|
||||
* Hightlight section in viewport in sidebar, using in-view library.
|
||||
* @param {String} sectionAttr - Data attribute of sections.
|
||||
* @param {String} navAttr - Data attribute of navigation items.
|
||||
* @param {String} activeClass – Class name of active element.
|
||||
*/
|
||||
constructor(sectionAttr, navAttr, activeClass = 'is-active') {
|
||||
this.sections = [...$$(`[${navAttr}]`)];
|
||||
this.navAttr = navAttr;
|
||||
this.sectionAttr = sectionAttr;
|
||||
this.activeClass = activeClass;
|
||||
inView(`[${sectionAttr}]`).on('enter', this.highlightSection.bind(this));
|
||||
}
|
||||
|
||||
highlightSection(section) {
|
||||
const id = section.getAttribute(this.sectionAttr);
|
||||
const el = $(`[${this.navAttr}="${id}"]`);
|
||||
if (el) {
|
||||
this.sections.forEach(el => el.classList.remove(this.activeClass));
|
||||
el.classList.add(this.activeClass);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
class Templater {
|
||||
/**
|
||||
* Mini templating engine based on data attributes. Selects elements based
|
||||
* on a data-tpl and data-tpl-key attribute and can set textContent
|
||||
* and innterHtml.
|
||||
*
|
||||
* @param {String} templateId - Template section, e.g. value of data-tpl.
|
||||
*/
|
||||
constructor(templateId) {
|
||||
this.templateId = templateId;
|
||||
}
|
||||
|
||||
get(key) {
|
||||
return $(`[data-tpl="${this.templateId}"][data-tpl-key="${key}"]`);
|
||||
}
|
||||
|
||||
fill(key, value, html = false) {
|
||||
const el = this.get(key);
|
||||
if (html) el.innerHTML = value || '';
|
||||
else el.textContent = value || '';
|
||||
return el;
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
class ModelLoader {
|
||||
/**
|
||||
* Load model meta from GitHub and update model details on site. Uses the
|
||||
* Templater mini template engine to update DOM.
|
||||
*
|
||||
* @param {String} repo - Path tp GitHub repository containing releases.
|
||||
* @param {Array} models - List of model IDs, e.g. "en_core_web_sm".
|
||||
* @param {Object} licenses - License IDs mapped to URLs.
|
||||
* @param {Object} accKeys - Available accuracy keys mapped to display labels.
|
||||
*/
|
||||
constructor(repo, models = [], licenses = {}, benchmarkKeys = {}) {
|
||||
this.url = `https://raw.githubusercontent.com/${repo}/master`;
|
||||
this.repo = `https://github.com/${repo}`;
|
||||
this.modelIds = models;
|
||||
this.licenses = licenses;
|
||||
this.benchKeys = benchmarkKeys;
|
||||
this.init();
|
||||
}
|
||||
|
||||
init() {
|
||||
this.modelIds.forEach(modelId =>
|
||||
new Templater(modelId).get('table').setAttribute('data-loading', ''));
|
||||
fetch(`${this.url}/compatibility.json`)
|
||||
.then(res => this.handleResponse(res))
|
||||
.then(json => json.ok ? this.getModels(json['spacy']) : this.modelIds.forEach(modelId => this.showError(modelId)))
|
||||
}
|
||||
|
||||
handleResponse(res) {
|
||||
if (res.ok) return res.json().then(json => Object.assign({}, json, { ok: res.ok }))
|
||||
else return ({ ok: res.ok })
|
||||
}
|
||||
|
||||
convertNumber(num, separator = ',') {
|
||||
return num.toString().replace(/\B(?=(\d{3})+(?!\d))/g, separator);
|
||||
}
|
||||
|
||||
getModels(compat) {
|
||||
this.compat = compat;
|
||||
for (let modelId of this.modelIds) {
|
||||
const version = this.getLatestVersion(modelId, compat);
|
||||
if (!version) {
|
||||
this.showError(modelId); return;
|
||||
}
|
||||
fetch(`${this.url}/meta/${modelId}-${version}.json`)
|
||||
.then(res => this.handleResponse(res))
|
||||
.then(json => json.ok ? this.render(json) : this.showError(modelId))
|
||||
}
|
||||
// make sure scroll positions for progress bar etc. are recalculated
|
||||
window.dispatchEvent(new Event('resize'));
|
||||
}
|
||||
|
||||
showError(modelId) {
|
||||
const template = new Templater(modelId);
|
||||
template.get('table').removeAttribute('data-loading');
|
||||
template.get('error').style.display = 'block';
|
||||
for (let key of ['sources', 'pipeline', 'vectors', 'author', 'license']) {
|
||||
template.get(key).parentElement.parentElement.style.display = 'none';
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Update model details in tables. Currently quite hacky :(
|
||||
*/
|
||||
render({ lang, name, version, sources, pipeline, vectors, url, author, license, accuracy, speed, size, description, notes }) {
|
||||
const modelId = `${lang}_${name}`;
|
||||
const model = `${modelId}-${version}`;
|
||||
const template = new Templater(modelId);
|
||||
|
||||
const getSources = s => (s instanceof Array) ? s.join(', ') : s;
|
||||
const getPipeline = p => p.map(comp => `<code>${comp}</code>`).join(', ');
|
||||
const getVectors = v => `${this.convertNumber(v.entries)} (${v.width} dimensions)`;
|
||||
const getLink = (t, l) => `<a href="${l}" target="_blank">${t}</a>`;
|
||||
|
||||
const keys = { version, size, description, notes }
|
||||
Object.keys(keys).forEach(key => template.fill(key, keys[key]));
|
||||
|
||||
if (sources) template.fill('sources', getSources(sources));
|
||||
if (pipeline && pipeline.length) template.fill('pipeline', getPipeline(pipeline), true);
|
||||
else template.get('pipeline').parentElement.parentElement.style.display = 'none';
|
||||
if (vectors) template.fill('vectors', getVectors(vectors));
|
||||
else template.get('vectors').parentElement.parentElement.style.display = 'none';
|
||||
|
||||
if (author) template.fill('author', url ? getLink(author, url) : author, true);
|
||||
if (license) template.fill('license', this.licenses[license] ? getLink(license, this.licenses[license]) : license, true);
|
||||
|
||||
template.get('download').setAttribute('href', `${this.repo}/releases/tag/${model}`);
|
||||
|
||||
this.renderBenchmarks(template, accuracy, speed);
|
||||
this.renderCompat(template, modelId);
|
||||
template.get('table').removeAttribute('data-loading');
|
||||
}
|
||||
|
||||
renderBenchmarks(template, accuracy = {}, speed = {}) {
|
||||
if (!accuracy && !speed) return;
|
||||
template.get('benchmarks').style.display = 'block';
|
||||
this.renderTable(template, 'parser', accuracy, val => val.toFixed(2));
|
||||
this.renderTable(template, 'ner', accuracy, val => val.toFixed(2));
|
||||
this.renderTable(template, 'speed', speed, Math.round);
|
||||
}
|
||||
|
||||
renderTable(template, id, benchmarks, convertVal = val => val) {
|
||||
if (!this.benchKeys[id] || !Object.keys(this.benchKeys[id]).some(key => benchmarks[key])) return;
|
||||
const keys = Object.keys(this.benchKeys[id]).map(k => benchmarks[k] ? k : false).filter(k => k);
|
||||
template.get(id).style.display = 'block';
|
||||
for (let key of keys) {
|
||||
template
|
||||
.fill(key, this.convertNumber(convertVal(benchmarks[key])))
|
||||
.parentElement.style.display = 'table-row';
|
||||
}
|
||||
}
|
||||
|
||||
renderCompat(template, modelId) {
|
||||
template.get('compat-wrapper').style.display = 'table-row';
|
||||
const options = Object.keys(this.compat).map(v => `<option value="${v}">v${v}</option>`).join('');
|
||||
template
|
||||
.fill('compat', '<option selected disabled>spaCy version</option>' + options, true)
|
||||
.addEventListener('change', ev => {
|
||||
const result = this.compat[ev.target.value][modelId];
|
||||
if (result) template.fill('compat-versions', `<code>${modelId}-${result[0]}</code>`, true);
|
||||
else template.fill('compat-versions', '');
|
||||
});
|
||||
}
|
||||
|
||||
getLatestVersion(model, compat = {}) {
|
||||
for (let spacy_v of Object.keys(compat)) {
|
||||
const models = compat[spacy_v];
|
||||
if (models[model]) return models[model][0];
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
class Changelog {
|
||||
/**
|
||||
* Fetch and render changelog from GitHub. Clones a template node (table row)
|
||||
* to avoid doubling templating markup in JavaScript.
|
||||
*
|
||||
* @param {String} user - GitHub username.
|
||||
* @param {String} repo - Repository to fetch releases from.
|
||||
*/
|
||||
constructor(user, repo) {
|
||||
this.url = `https://api.github.com/repos/${user}/${repo}/releases`;
|
||||
this.template = new Templater('changelog');
|
||||
fetch(this.url)
|
||||
.then(res => this.handleResponse(res))
|
||||
.then(json => json.ok ? this.render(json) : false)
|
||||
}
|
||||
|
||||
/**
|
||||
* Get template section from template row. Slightly hacky, but does make sense.
|
||||
*/
|
||||
$(item, id) {
|
||||
return item.querySelector(`[data-changelog="${id}"]`);
|
||||
}
|
||||
|
||||
handleResponse(res) {
|
||||
if (res.ok) return res.json().then(json => Object.assign({}, json, { ok: res.ok }))
|
||||
else return ({ ok: res.ok })
|
||||
}
|
||||
|
||||
render(json) {
|
||||
this.template.get('error').style.display = 'none';
|
||||
this.template.get('table').style.display = 'block';
|
||||
this.row = this.template.get('item');
|
||||
this.releases = this.template.get('releases');
|
||||
this.prereleases = this.template.get('prereleases');
|
||||
Object.values(json)
|
||||
.filter(release => release.name)
|
||||
.forEach(release => this.renderRelease(release));
|
||||
this.row.remove();
|
||||
// make sure scroll positions for progress bar etc. are recalculated
|
||||
window.dispatchEvent(new Event('resize'));
|
||||
}
|
||||
|
||||
/**
|
||||
* Clone the template row and populate with content from API response.
|
||||
* https://developer.github.com/v3/repos/releases/#list-releases-for-a-repository
|
||||
*
|
||||
* @param {String} name - Release title.
|
||||
* @param {String} tag (tag_name) - Release tag.
|
||||
* @param {String} url (html_url) - URL to the release page on GitHub.
|
||||
* @param {String} date (published_at) - Timestamp of release publication.
|
||||
* @param {Boolean} pre (prerelease) - Whether the release is a prerelease.
|
||||
*/
|
||||
renderRelease({ name, tag_name: tag, html_url: url, published_at: date, prerelease: pre }) {
|
||||
const container = pre ? this.prereleases : this.releases;
|
||||
const row = this.row.cloneNode(true);
|
||||
this.$(row, 'date').textContent = date.split('T')[0];
|
||||
this.$(row, 'tag').innerHTML = `<a href="${url}" target="_blank"><code>${tag}</code></a>`;
|
||||
this.$(row, 'title').textContent = (name.split(': ').length == 2) ? name.split(': ')[1] : name;
|
||||
container.appendChild(row);
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
class GitHubEmbed {
|
||||
/**
|
||||
* Embed code from GitHub repositories, similar to Gist embeds. Fetches the
|
||||
* raw text and places it inside element.
|
||||
* Usage: <pre><code data-gh-embed="spacy/master/examples/x.py"></code><pre>
|
||||
*
|
||||
* @param {String} user - GitHub user or organization.
|
||||
* @param {String} attr - Data attribute used to select containers. Attribute
|
||||
* value should be path to file relative to user.
|
||||
*/
|
||||
constructor(user, attr) {
|
||||
this.url = `https://raw.githubusercontent.com/${user}`;
|
||||
this.attr = attr;
|
||||
this.error = `\nCan't fetch code example from GitHub :(\n\nPlease use the link below to view the example. If you've come across\na broken link, we always appreciate a pull request to the repository,\nor a report on the issue tracker. Thanks!`;
|
||||
[...$$(`[${this.attr}]`)].forEach(el => this.embed(el));
|
||||
}
|
||||
|
||||
embed(el) {
|
||||
el.parentElement.setAttribute('data-loading', '');
|
||||
fetch(`${this.url}/${el.getAttribute(this.attr)}`)
|
||||
.then(res => res.text().then(text => ({ text, ok: res.ok })))
|
||||
.then(({ text, ok }) => {
|
||||
el.textContent = ok ? text : this.error;
|
||||
if (ok && window.Prism) Prism.highlightElement(el);
|
||||
})
|
||||
el.parentElement.removeAttribute('data-loading');
|
||||
}
|
||||
}
|
317
website/assets/js/models.js
Normal file
317
website/assets/js/models.js
Normal file
|
@ -0,0 +1,317 @@
|
|||
'use strict';
|
||||
|
||||
import { Templater, handleResponse, convertNumber, abbrNumber } from './util.js';
|
||||
|
||||
/**
|
||||
* Chart.js defaults
|
||||
*/
|
||||
const CHART_COLORS = { model1: '#09a3d5', model2: '#066B8C' };
|
||||
const CHART_FONTS = {
|
||||
legend: '-apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"',
|
||||
ticks: 'Menlo, Monaco, Consolas, "Liberation Mono", "Courier New", monospace'
|
||||
};
|
||||
|
||||
/**
|
||||
* Formatters for model details.
|
||||
* @property {function} author – Format model author with optional link.
|
||||
* @property {function} license - Format model license with optional link.
|
||||
* @property {function} sources - Format training data sources (list or string).
|
||||
* @property {function} pipeline - Format list of pipeline components.
|
||||
* @property {function} vectors - Format vector data (entries and dimensions).
|
||||
* @property {function} version - Format model version number.
|
||||
*/
|
||||
export const formats = {
|
||||
author: (author, url) => url ? `<a href="${url}" target="_blank">${author}</a>` : author,
|
||||
license: (license, url) => url ? `<a href="${url}" target="_blank">${license}</a>` : license,
|
||||
sources: sources => (sources instanceof Array) ? sources.join(', ') : sources,
|
||||
pipeline: pipes => (pipes && pipes.length) ? pipes.map(p => `<code>${p}</code>`).join(', ') : '-',
|
||||
vectors: vec => vec ? `${abbrNumber(vec.keys)} keys, ${abbrNumber(vec.vectors)} unique vectors (${vec.width} dimensions)` : 'n/a',
|
||||
version: version => `<code>v${version}</code>`
|
||||
};
|
||||
|
||||
/**
|
||||
* Find the latest version of a model in a compatibility table.
|
||||
* @param {string} model - The model name.
|
||||
* @param {Object} compat - Compatibility table, keyed by spaCy version.
|
||||
*/
|
||||
export const getLatestVersion = (model, compat = {}) => {
|
||||
for (let [spacy_v, models] of Object.entries(compat)) {
|
||||
if (models[model]) return models[model][0];
|
||||
}
|
||||
};
|
||||
|
||||
export class ModelLoader {
|
||||
/**
|
||||
* Load model meta from GitHub and update model details on site. Uses the
|
||||
* Templater mini template engine to update DOM.
|
||||
* @param {string} repo - Path tp GitHub repository containing releases.
|
||||
* @param {Array} models - List of model IDs, e.g. "en_core_web_sm".
|
||||
* @param {Object} licenses - License IDs mapped to URLs.
|
||||
* @param {Object} benchmarkKeys - Objects of available keys by type, e.g.
|
||||
* 'parser', 'ner', 'speed', mapped to labels.
|
||||
*/
|
||||
constructor(repo, models = [], licenses = {}, benchmarkKeys = {}) {
|
||||
this.url = `https://raw.githubusercontent.com/${repo}/master`;
|
||||
this.repo = `https://github.com/${repo}`;
|
||||
this.modelIds = models;
|
||||
this.licenses = licenses;
|
||||
this.benchKeys = benchmarkKeys;
|
||||
this.init();
|
||||
}
|
||||
|
||||
init() {
|
||||
this.modelIds.forEach(modelId =>
|
||||
new Templater(modelId).get('table').setAttribute('data-loading', ''));
|
||||
this.fetch(`${this.url}/compatibility.json`)
|
||||
.then(json => this.getModels(json.spacy))
|
||||
.catch(_ => this.modelIds.forEach(modelId => this.showError(modelId)));
|
||||
// make sure scroll positions for progress bar etc. are recalculated
|
||||
window.dispatchEvent(new Event('resize'));
|
||||
}
|
||||
|
||||
fetch(url) {
|
||||
return new Promise((resolve, reject) =>
|
||||
fetch(url).then(res => handleResponse(res))
|
||||
.then(json => json.ok ? resolve(json) : reject()))
|
||||
}
|
||||
|
||||
getModels(compat) {
|
||||
this.compat = compat;
|
||||
for (let modelId of this.modelIds) {
|
||||
const version = getLatestVersion(modelId, compat);
|
||||
if (version) this.fetch(`${this.url}/meta/${modelId}-${version}.json`)
|
||||
.then(json => this.render(json))
|
||||
.catch(_ => this.showError(modelId))
|
||||
else this.showError(modelId);
|
||||
}
|
||||
}
|
||||
|
||||
showError(modelId) {
|
||||
const tpl = new Templater(modelId);
|
||||
tpl.get('table').removeAttribute('data-loading');
|
||||
tpl.get('error').style.display = 'block';
|
||||
for (let key of ['sources', 'pipeline', 'vectors', 'author', 'license']) {
|
||||
tpl.get(key).parentElement.parentElement.style.display = 'none';
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Update model details in tables. Currently quite hacky :(
|
||||
*/
|
||||
render(data) {
|
||||
const modelId = `${data.lang}_${data.name}`;
|
||||
const model = `${modelId}-${data.version}`;
|
||||
const tpl = new Templater(modelId);
|
||||
tpl.get('error').style.display = 'none';
|
||||
this.renderDetails(tpl, data)
|
||||
this.renderBenchmarks(tpl, data.accuracy, data.speed);
|
||||
this.renderCompat(tpl, modelId);
|
||||
tpl.get('download').setAttribute('href', `${this.repo}/releases/tag/${model}`);
|
||||
tpl.get('table').removeAttribute('data-loading');
|
||||
}
|
||||
|
||||
renderDetails(tpl, { version, size, description, notes, author, url,
|
||||
license, sources, vectors, pipeline }) {
|
||||
const basics = { version, size, description, notes }
|
||||
for (let [key, value] of Object.entries(basics)) {
|
||||
if (value) tpl.fill(key, value);
|
||||
}
|
||||
if (author) tpl.fill('author', formats.author(author, url), true);
|
||||
if (license) tpl.fill('license', formats.license(license, this.licenses[license]), true);
|
||||
if (sources) tpl.fill('sources', formats.sources(sources));
|
||||
if (vectors) tpl.fill('vectors', formats.vectors(vectors));
|
||||
else tpl.get('vectors').parentElement.parentElement.style.display = 'none';
|
||||
if (pipeline && pipeline.length) tpl.fill('pipeline', formats.pipeline(pipeline), true);
|
||||
else tpl.get('pipeline').parentElement.parentElement.style.display = 'none';
|
||||
}
|
||||
|
||||
renderBenchmarks(tpl, accuracy = {}, speed = {}) {
|
||||
if (!accuracy && !speed) return;
|
||||
this.renderTable(tpl, 'parser', accuracy, val => val.toFixed(2));
|
||||
this.renderTable(tpl, 'ner', accuracy, val => val.toFixed(2));
|
||||
this.renderTable(tpl, 'speed', speed, Math.round);
|
||||
tpl.get('benchmarks').style.display = 'block';
|
||||
}
|
||||
|
||||
renderTable(tpl, id, benchmarks, converter = val => val) {
|
||||
if (!this.benchKeys[id] || !Object.keys(this.benchKeys[id]).some(key => benchmarks[key])) return;
|
||||
for (let key of Object.keys(this.benchKeys[id])) {
|
||||
if (benchmarks[key]) tpl
|
||||
.fill(key, convertNumber(converter(benchmarks[key])))
|
||||
.parentElement.style.display = 'table-row';
|
||||
}
|
||||
tpl.get(id).style.display = 'block';
|
||||
}
|
||||
|
||||
renderCompat(tpl, modelId) {
|
||||
tpl.get('compat-wrapper').style.display = 'table-row';
|
||||
const header = '<option selected disabled>spaCy version</option>';
|
||||
const options = Object.keys(this.compat)
|
||||
.map(v => `<option value="${v}">v${v}</option>`)
|
||||
.join('');
|
||||
tpl
|
||||
.fill('compat', header + options, true)
|
||||
.addEventListener('change', ({ target: { value }}) =>
|
||||
tpl.fill('compat-versions', this.getCompat(value, modelId), true))
|
||||
}
|
||||
|
||||
getCompat(version, model) {
|
||||
const res = this.compat[version][model];
|
||||
return res ? `<code>${model}-${res[0]}</code>` : '<em>not compatible</em>';
|
||||
}
|
||||
}
|
||||
|
||||
export class ModelComparer {
|
||||
/**
|
||||
* Compare to model meta files and render chart and comparison table.
|
||||
* @param {string} repo - Path tp GitHub repository containing releases.
|
||||
* @param {Object} licenses - License IDs mapped to URLs.
|
||||
* @param {Object} benchmarkKeys - Objects of available keys by type, e.g.
|
||||
* 'parser', 'ner', 'speed', mapped to labels.
|
||||
* @param {Object} languages - Available languages, ID mapped to name.
|
||||
* @param {Object} defaultModels - Models to compare on load, 'model1' and
|
||||
* 'model2' mapped to model names.
|
||||
*/
|
||||
constructor(repo, licenses = {}, benchmarkKeys = {}, languages = {}, labels = {}, defaultModels) {
|
||||
this.url = `https://raw.githubusercontent.com/${repo}/master`;
|
||||
this.repo = `https://github.com/${repo}`;
|
||||
this.tpl = new Templater('compare');
|
||||
this.benchKeys = benchmarkKeys;
|
||||
this.licenses = licenses;
|
||||
this.languages = languages;
|
||||
this.labels = labels;
|
||||
this.models = {};
|
||||
this.colors = CHART_COLORS;
|
||||
this.fonts = CHART_FONTS;
|
||||
this.defaultModels = defaultModels;
|
||||
this.tpl.get('result').style.display = 'block';
|
||||
this.fetchCompat()
|
||||
.then(compat => this.init(compat))
|
||||
.catch(this.showError.bind(this))
|
||||
}
|
||||
|
||||
init(compat) {
|
||||
this.compat = compat;
|
||||
const selectA = this.tpl.get('model1');
|
||||
const selectB = this.tpl.get('model2');
|
||||
selectA.addEventListener('change', this.onSelect.bind(this));
|
||||
selectB.addEventListener('change', this.onSelect.bind(this));
|
||||
this.chart = new Chart('chart_compare_accuracy', { type: 'bar', options: {
|
||||
responsive: true,
|
||||
legend: { position: 'bottom', labels: { fontFamily: this.fonts.legend, fontSize: 13 }},
|
||||
scales: {
|
||||
yAxes: [{ label: 'Accuracy', ticks: { min: 70, fontFamily: this.fonts.ticks }}],
|
||||
xAxes: [{ barPercentage: 0.75, ticks: { fontFamily: this.fonts.ticks }}]
|
||||
}
|
||||
}});
|
||||
if (this.defaultModels) {
|
||||
selectA.value = this.defaultModels.model1;
|
||||
selectB.value = this.defaultModels.model2;
|
||||
this.getModels(this.defaultModels);
|
||||
}
|
||||
}
|
||||
|
||||
fetchCompat() {
|
||||
return new Promise((resolve, reject) =>
|
||||
fetch(`${this.url}/compatibility.json`)
|
||||
.then(res => handleResponse(res))
|
||||
.then(json => json.ok ? resolve(json.spacy) : reject()))
|
||||
}
|
||||
|
||||
fetchModel(name) {
|
||||
const version = getLatestVersion(name, this.compat);
|
||||
const modelName = `${name}-${version}`;
|
||||
return new Promise((resolve, reject) => {
|
||||
// resolve immediately if model already loaded, e.g. in this.models
|
||||
if (this.models[name]) resolve(this.models[name]);
|
||||
else fetch(`${this.url}/meta/${modelName}.json`)
|
||||
.then(res => handleResponse(res))
|
||||
.then(json => json.ok ? resolve(this.saveModel(name, json)) : reject())
|
||||
})
|
||||
}
|
||||
|
||||
/**
|
||||
* "Save" meta to this.models so it only has to be fetched from GitHub once.
|
||||
* @param {string} name - The model name.
|
||||
* @param {Object} data - The model meta data.
|
||||
*/
|
||||
saveModel(name, data) {
|
||||
this.models[name] = data;
|
||||
return data;
|
||||
}
|
||||
|
||||
showError(err) {
|
||||
console.error(err);
|
||||
this.tpl.get('result').style.display = 'none';
|
||||
this.tpl.get('error').style.display = 'block';
|
||||
}
|
||||
|
||||
onSelect(ev) {
|
||||
const modelId = ev.target.value;
|
||||
const otherId = (ev.target.id == 'model1') ? 'model2' : 'model1';
|
||||
const otherVal = this.tpl.get(otherId);
|
||||
const otherModel = otherVal.options[otherVal.selectedIndex].value;
|
||||
if (otherModel != '') this.getModels({
|
||||
[ev.target.id]: modelId,
|
||||
[otherId]: otherModel
|
||||
})
|
||||
}
|
||||
|
||||
getModels({ model1, model2 }) {
|
||||
this.tpl.get('result').setAttribute('data-loading', '');
|
||||
this.fetchModel(model1)
|
||||
.then(data1 => this.fetchModel(model2)
|
||||
.then(data2 => this.render({ model1: data1, model2: data2 })))
|
||||
.catch(this.showError.bind(this))
|
||||
}
|
||||
|
||||
/**
|
||||
* Render two models, and populate the chart and table. Currently quite hacky :(
|
||||
* @param {Object} models - The models to render.
|
||||
* @param {Object} models.model1 - The first model (via first <select>).
|
||||
* @param {Object} models.model2 - The second model (via second <select>).
|
||||
*/
|
||||
render({ model1, model2 }) {
|
||||
const accKeys = Object.assign({}, this.benchKeys.parser, this.benchKeys.ner);
|
||||
const allKeys = [...Object.keys(model1.accuracy || []), ...Object.keys(model2.accuracy || [])];
|
||||
const metaKeys = Object.keys(accKeys).filter(k => allKeys.includes(k));
|
||||
const labels = metaKeys.map(key => accKeys[key]);
|
||||
const datasets = [model1, model2]
|
||||
.map(({ lang, name, version, accuracy = {} }, i) => ({
|
||||
label: `${lang}_${name}-${version}`,
|
||||
backgroundColor: this.colors[`model${i + 1}`],
|
||||
data: metaKeys.map(key => (accuracy[key] || 0).toFixed(2))
|
||||
}));
|
||||
this.chart.data = { labels, datasets };
|
||||
this.chart.update();
|
||||
[model1, model2].forEach((model, i) => this.renderTable(metaKeys, i + 1, model));
|
||||
this.tpl.get('result').removeAttribute('data-loading');
|
||||
}
|
||||
|
||||
renderTable(metaKeys, i, { lang, name, version, size, description,
|
||||
notes, author, url, license, sources, vectors, pipeline, accuracy = {},
|
||||
speed = {}}) {
|
||||
const type = name.split('_')[0]; // extract type from model name
|
||||
const genre = name.split('_')[1]; // extract genre from model name
|
||||
this.tpl.fill(`table-head${i}`, `${lang}_${name}`);
|
||||
this.tpl.get(`link${i}`).setAttribute('href', `/models/${lang}#${lang}_${name}`);
|
||||
this.tpl.fill(`download${i}`, `spacy download ${lang}_${name}\n`);
|
||||
this.tpl.fill(`lang${i}`, this.languages[lang] || lang);
|
||||
this.tpl.fill(`type${i}`, this.labels[type] || type);
|
||||
this.tpl.fill(`genre${i}`, this.labels[genre] || genre);
|
||||
this.tpl.fill(`version${i}`, formats.version(version), true);
|
||||
this.tpl.fill(`size${i}`, size);
|
||||
this.tpl.fill(`desc${i}`, description || 'n/a');
|
||||
this.tpl.fill(`pipeline${i}`, formats.pipeline(pipeline), true);
|
||||
this.tpl.fill(`vectors${i}`, formats.vectors(vectors));
|
||||
this.tpl.fill(`sources${i}`, formats.sources(sources));
|
||||
this.tpl.fill(`author${i}`, formats.author(author, url), true);
|
||||
this.tpl.fill(`license${i}`, formats.license(license, this.licenses[license]), true);
|
||||
// check if model accuracy or speed includes one of the pre-set keys
|
||||
for (let key of [...metaKeys, ...Object.keys(this.benchKeys.speed)]) {
|
||||
if (accuracy[key]) this.tpl.fill(`${key}${i}`, accuracy[key].toFixed(2))
|
||||
else if (speed[key]) this.tpl.fill(`${key}${i}`, convertNumber(Math.round(speed[key])))
|
||||
else this.tpl.fill(`${key}${i}`, 'n/a')
|
||||
}
|
||||
}
|
||||
}
|
35
website/assets/js/nav-highlighter.js
Normal file
35
website/assets/js/nav-highlighter.js
Normal file
|
@ -0,0 +1,35 @@
|
|||
'use strict';
|
||||
|
||||
import { $, $$ } from './util.js';
|
||||
|
||||
export default class NavHighlighter {
|
||||
/**
|
||||
* Hightlight section in viewport in sidebar, using in-view library.
|
||||
* @param {string} sectionAttr - Data attribute of sections.
|
||||
* @param {string} navAttr - Data attribute of navigation items.
|
||||
* @param {string} activeClass – Class name of active element.
|
||||
*/
|
||||
constructor(sectionAttr, navAttr, activeClass = 'is-active') {
|
||||
this.sections = [...$$(`[${navAttr}]`)];
|
||||
// highlight first item regardless
|
||||
if (this.sections.length) this.sections[0].classList.add(activeClass);
|
||||
this.navAttr = navAttr;
|
||||
this.sectionAttr = sectionAttr;
|
||||
this.activeClass = activeClass;
|
||||
if (window.inView) inView(`[${sectionAttr}]`)
|
||||
.on('enter', this.highlightSection.bind(this));
|
||||
}
|
||||
|
||||
/**
|
||||
* Check if section in view exists in sidebar and mark as active.
|
||||
* @param {node} section - The section in view.
|
||||
*/
|
||||
highlightSection(section) {
|
||||
const id = section.getAttribute(this.sectionAttr);
|
||||
const el = $(`[${this.navAttr}="${id}"]`);
|
||||
if (el) {
|
||||
this.sections.forEach(el => el.classList.remove(this.activeClass));
|
||||
el.classList.add(this.activeClass);
|
||||
}
|
||||
}
|
||||
}
|
52
website/assets/js/progress.js
Normal file
52
website/assets/js/progress.js
Normal file
|
@ -0,0 +1,52 @@
|
|||
'use strict';
|
||||
|
||||
import { $ } from './util.js';
|
||||
|
||||
export default class ProgressBar {
|
||||
/**
|
||||
* Animated reading progress bar.
|
||||
* @param {string} selector – CSS selector of progress bar element.
|
||||
*/
|
||||
constructor(selector) {
|
||||
this.scrollY = 0;
|
||||
this.sizes = this.updateSizes();
|
||||
this.el = $(selector);
|
||||
this.el.setAttribute('max', 100);
|
||||
window.addEventListener('scroll', this.onScroll.bind(this));
|
||||
window.addEventListener('resize', this.onResize.bind(this));
|
||||
}
|
||||
|
||||
onScroll(ev) {
|
||||
this.scrollY = (window.pageYOffset || document.scrollTop) - (document.clientTop || 0);
|
||||
requestAnimationFrame(this.update.bind(this));
|
||||
}
|
||||
|
||||
onResize(ev) {
|
||||
this.sizes = this.updateSizes();
|
||||
requestAnimationFrame(this.update.bind(this));
|
||||
}
|
||||
|
||||
update() {
|
||||
const offset = 100 - ((this.sizes.height - this.scrollY - this.sizes.vh) / this.sizes.height * 100);
|
||||
this.el.setAttribute('value', (this.scrollY == 0) ? 0 : offset || 0);
|
||||
}
|
||||
|
||||
/**
|
||||
* Update scroll and viewport height. Called on load and window resize.
|
||||
*/
|
||||
updateSizes() {
|
||||
return {
|
||||
height: Math.max(
|
||||
document.body.scrollHeight,
|
||||
document.body.offsetHeight,
|
||||
document.documentElement.clientHeight,
|
||||
document.documentElement.scrollHeight,
|
||||
document.documentElement.offsetHeight
|
||||
),
|
||||
vh: Math.max(
|
||||
document.documentElement.clientHeight,
|
||||
window.innerHeight || 0
|
||||
)
|
||||
}
|
||||
}
|
||||
}
|
23
website/assets/js/rollup.js
Normal file
23
website/assets/js/rollup.js
Normal file
|
@ -0,0 +1,23 @@
|
|||
/**
|
||||
* This file is bundled by Rollup, compiled with Babel and included as
|
||||
* <script nomodule> for older browsers that don't yet support JavaScript
|
||||
* modules. Browsers that do will ignore this bundle and won't even fetch it
|
||||
* from the server. Details:
|
||||
* https://github.com/rollup/rollup
|
||||
* https://medium.com/dev-channel/es6-modules-in-chrome-canary-m60-ba588dfb8ab7
|
||||
*/
|
||||
|
||||
// Import all modules that are instantiated directly in _includes/_scripts.jade
|
||||
import ProgressBar from './progress.js';
|
||||
import NavHighlighter from './nav-highlighter.js';
|
||||
import Changelog from './changelog.js';
|
||||
import GitHubEmbed from './github-embed.js';
|
||||
import { ModelLoader, ModelComparer } from './models.js';
|
||||
|
||||
// Assign to window so they are bundled by rollup
|
||||
window.ProgressBar = ProgressBar;
|
||||
window.NavHighlighter = NavHighlighter;
|
||||
window.Changelog = Changelog;
|
||||
window.GitHubEmbed = GitHubEmbed;
|
||||
window.ModelLoader = ModelLoader;
|
||||
window.ModelComparer = ModelComparer;
|
69
website/assets/js/util.js
Normal file
69
website/assets/js/util.js
Normal file
|
@ -0,0 +1,69 @@
|
|||
'use strict';
|
||||
|
||||
export const $ = document.querySelector.bind(document);
|
||||
export const $$ = document.querySelectorAll.bind(document);
|
||||
|
||||
export class Templater {
|
||||
/**
|
||||
* Mini templating engine based on data attributes. Selects elements based
|
||||
* on a data-tpl and data-tpl-key attribute and can set textContent
|
||||
* and innterHtml.
|
||||
* @param {string} templateId - Template section, e.g. value of data-tpl.
|
||||
*/
|
||||
constructor(templateId) {
|
||||
this.templateId = templateId;
|
||||
}
|
||||
|
||||
/**
|
||||
* Get an element from the template and return it.
|
||||
* @param {string} key - Name of the key within the current template.
|
||||
*/
|
||||
get(key) {
|
||||
return $(`[data-tpl="${this.templateId}"][data-tpl-key="${key}"]`);
|
||||
}
|
||||
|
||||
/**
|
||||
* Fill the content of a template element with a value.
|
||||
* @param {string} key - Name of the key within the current template.
|
||||
* @param {string} value - Content to insert into template element.
|
||||
* @param {boolean} html - Insert content as HTML. Defaults to false.
|
||||
*/
|
||||
fill(key, value, html = false) {
|
||||
const el = this.get(key);
|
||||
if (html) el.innerHTML = value || '';
|
||||
else el.textContent = value || '';
|
||||
return el;
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Handle API response and assign status to returned JSON.
|
||||
* @param {Response} res – The response.
|
||||
*/
|
||||
export const handleResponse = res => {
|
||||
if (res.ok) return res.json()
|
||||
.then(json => Object.assign({}, json, { ok: res.ok }))
|
||||
else return ({ ok: res.ok })
|
||||
};
|
||||
|
||||
/**
|
||||
* Convert a number to a string and add thousand separator.
|
||||
* @param {number|string} num - The number to convert.
|
||||
* @param {string} separator – Thousand separator.
|
||||
*/
|
||||
export const convertNumber = (num = 0, separator = ',') =>
|
||||
num.toString().replace(/\B(?=(\d{3})+(?!\d))/g, separator);
|
||||
|
||||
/**
|
||||
* Abbreviate a number, e.g. 14249930 --> 14.25m.
|
||||
* @param {number|string} num - The number to convert.
|
||||
* @param {number} fixed - Number of decimals.
|
||||
*/
|
||||
export const abbrNumber = (num = 0, fixed = 2) => {
|
||||
const suffixes = ['', 'k', 'm', 'b', 't'];
|
||||
if (num === null || num === 0) return 0;
|
||||
const b = num.toPrecision(2).split('e');
|
||||
const k = (b.length === 1) ? 0 : Math.floor(Math.min(b[1].slice(1), 14) / 3);
|
||||
const c = (k < 1) ? num.toFixed(fixed) : (num / Math.pow(10, k * 3)).toFixed(fixed + 1);
|
||||
return (c < 0 ? c : Math.abs(c)) + suffixes[k];
|
||||
}
|
|
@ -1,7 +1,8 @@
|
|||
{
|
||||
"sidebar": {
|
||||
"Models": {
|
||||
"Overview": "./"
|
||||
"Overview": "./",
|
||||
"Comparison": "comparison"
|
||||
},
|
||||
|
||||
"Language models": {
|
||||
|
@ -26,6 +27,17 @@
|
|||
}
|
||||
},
|
||||
|
||||
"comparison": {
|
||||
"title": "Model Comparison",
|
||||
"teaser": "Compare spaCy's statistical models and their accuracy.",
|
||||
"tag": "experimental",
|
||||
"compare_models": true,
|
||||
"default_models": {
|
||||
"model1": "en_core_web_sm",
|
||||
"model2": "en_core_web_lg"
|
||||
}
|
||||
},
|
||||
|
||||
"MODELS": {
|
||||
"en": ["en_core_web_sm", "en_core_web_lg", "en_vectors_web_lg"],
|
||||
"de": ["de_dep_news_sm"],
|
||||
|
@ -88,6 +100,7 @@
|
|||
"hu": "Hungarian",
|
||||
"pl": "Polish",
|
||||
"he": "Hebrew",
|
||||
"ga": "Irish",
|
||||
"bn": "Bengali",
|
||||
"hi": "Hindi",
|
||||
"id": "Indonesian",
|
||||
|
@ -102,6 +115,8 @@
|
|||
"de": "Dies ist ein Satz.",
|
||||
"fr": "C'est une phrase.",
|
||||
"es": "Esto es una frase.",
|
||||
"pt": "Esta é uma frase.",
|
||||
"it": "Questa è una frase.",
|
||||
"xx": "This is a sentence about Facebook."
|
||||
}
|
||||
}
|
||||
|
|
81
website/models/comparison.jade
Normal file
81
website/models/comparison.jade
Normal file
|
@ -0,0 +1,81 @@
|
|||
//- 💫 DOCS > MODELS > COMPARISON
|
||||
|
||||
include ../_includes/_mixins
|
||||
|
||||
p
|
||||
| This experimental tool helps you compare spaCy's statistical models
|
||||
| by features, accuracy and speed. This can be especially useful to get an
|
||||
| idea of the trade-offs between larger and smaller models of the same
|
||||
| type. For example, #[code lg] models tend to be more accurate than
|
||||
| the corresponding #[code sm] versions – but they're often significantly
|
||||
| larger in file size and memory usage.
|
||||
|
||||
- TPL = "compare"
|
||||
|
||||
+grid.o-box
|
||||
for i in [1, 2]
|
||||
+grid-col("half", "no-gutter")
|
||||
label.u-heading.u-text-label.u-text-center.u-color-theme(for="model#{i}") Model #{i}
|
||||
.o-field.o-grid.o-grid--vcenter.u-padding-small
|
||||
select.o-field__select.u-text-small(id="model#{i}" data-tpl=TPL data-tpl-key="model#{i}")
|
||||
option(selected="" disabled="" value="") Select model...
|
||||
for models, _ in MODELS
|
||||
for model in models
|
||||
option(value=model)=model
|
||||
|
||||
div(data-tpl=TPL data-tpl-key="error")
|
||||
+infobox
|
||||
| Unable to load model details and accuracy figures from GitHub to
|
||||
| compare the models. For details of the individual models, see the
|
||||
| overview of the
|
||||
| #[+a(gh("spacy-models") + "/releases") latest model releases].
|
||||
|
||||
div(data-tpl=TPL data-tpl-key="result" style="display: none")
|
||||
+chart("compare_accuracy", 350)
|
||||
|
||||
+aside-code("Download", "text")
|
||||
for i in [1, 2]
|
||||
span(data-tpl=TPL data-tpl-key="download#{i}")
|
||||
|
||||
+table.o-block-small(data-tpl=TPL data-tpl-key="table")
|
||||
+row("head")
|
||||
+head-cell
|
||||
for i in [1, 2]
|
||||
+head-cell(style="width: 40%")
|
||||
a(data-tpl=TPL data-tpl-key="link#{i}")
|
||||
code(data-tpl=TPL data-tpl-key="table-head#{i}" style="text-transform: initial; font-weight: normal")
|
||||
|
||||
for label, id in {lang: "Language", type: "Type", genre: "Genre"}
|
||||
+row
|
||||
+cell #[+label=label]
|
||||
for i in [1, 2]
|
||||
+cell(data-tpl=TPL data-tpl-key="#{id}#{i}") n/a
|
||||
|
||||
for label in ["Version", "Size", "Pipeline", "Vectors", "Sources", "Author", "License"]
|
||||
- var field = label.toLowerCase()
|
||||
+row
|
||||
+cell.u-nowrap
|
||||
+label=label
|
||||
if MODEL_META[field]
|
||||
| #[+help(MODEL_META[field]).u-color-subtle]
|
||||
for i in [1, 2]
|
||||
+cell
|
||||
span(data-tpl=TPL data-tpl-key=field + i) #[em n/a]
|
||||
|
||||
+row
|
||||
+cell #[+label Description]
|
||||
for i in [1, 2]
|
||||
+cell.u-text-tiny(data-tpl=TPL data-tpl-key="desc#{i}") n/a
|
||||
|
||||
for benchmark, _ in MODEL_BENCHMARKS
|
||||
- var counter = 0
|
||||
for label, field in benchmark
|
||||
+row((counter == 0) ? "divider" : null)
|
||||
+cell.u-nowrap
|
||||
+label=label
|
||||
if MODEL_META[field]
|
||||
| #[+help(MODEL_META[field]).u-color-subtle]
|
||||
for i in [1, 2]
|
||||
+cell
|
||||
span(data-tpl=TPL data-tpl-key=field + i) n/a
|
||||
- counter++
|
|
@ -8,13 +8,15 @@
|
|||
"devDependencies": {
|
||||
"babel-cli": "^6.14.0",
|
||||
"harp": "^0.24.0",
|
||||
"rollup": "^0.50.0",
|
||||
"uglify-js": "^2.7.3"
|
||||
},
|
||||
"dependencies": {},
|
||||
"scripts": {
|
||||
"compile": "NODE_ENV=deploy harp compile",
|
||||
"compile_js": "babel www/assets/js/main.js --out-file www/assets/js/main.js --presets=es2015",
|
||||
"uglify": "uglifyjs www/assets/js/main.js --output www/assets/js/main.js",
|
||||
"build": "npm run compile && npm run compile_js && npm run uglify"
|
||||
"rollup_js": "rollup www/assets/js/rollup.js --output.format iife --output.file www/assets/js/rollup.js",
|
||||
"compile_rollup": "babel www/assets/js/rollup.js --out-file www/assets/js/rollup.js --presets=es2015",
|
||||
"uglify": "uglifyjs www/assets/js/rollup.js --output www/assets/js/rollup.js",
|
||||
"build": "npm run compile && echo 'Compiled website' && npm run rollup_js && echo 'Bundled rollup.js' && npm run compile_rollup && echo 'Compiled rollup.js' && npm run uglify && echo 'Uglified rollup.js'"
|
||||
}
|
||||
}
|
||||
|
|
|
@ -130,10 +130,11 @@ include _includes/_mixins
|
|||
| capabilities and can be used to mark features that require a
|
||||
| respective model to be installed.
|
||||
|
||||
p.o-block.o-inline-list
|
||||
+tag I'm a tag
|
||||
+tag-new(2)
|
||||
+tag-model("Named entities")
|
||||
.o-block
|
||||
p.o-inline-list
|
||||
+tag I'm a tag
|
||||
+tag-new(2)
|
||||
+tag-model("Named entities")
|
||||
|
||||
+h(3, "icons", "website/_includes/_svg.jade") Icons
|
||||
|
||||
|
@ -359,18 +360,14 @@ include _includes/_mixins
|
|||
script(src="/assets/js/chart.min.js")
|
||||
script new Chart('chart_accuracy', { datasets: [] })
|
||||
|
||||
+grid
|
||||
+grid-col("half")
|
||||
+chart("accuracy", 400)
|
||||
+chart("accuracy", 400)
|
||||
+chart("speed", 300)
|
||||
|
||||
+grid-col("half")
|
||||
+chart("speed", 300)
|
||||
|
||||
script(src="/assets/js/chart.min.js")
|
||||
script(src="/assets/js/vendor/chart.min.js")
|
||||
script.
|
||||
Chart.defaults.global.defaultFontFamily = "-apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'";
|
||||
new Chart('chart_accuracy', { type: 'bar', options: { legend: false, responsive: true, scales: { yAxes: [{ label: 'Accuracy', ticks: { suggestedMin: 70 } }], xAxes: [{ barPercentage: 0.425 }]}}, data: { labels: ['UAS', 'LAS', 'POS', 'NER F', 'NER P', 'NER R'], datasets: [{ label: 'en_core_web_sm', data: [91.49, 89.66, 97.23, 86.46, 86.78, 86.15], backgroundColor: '#09a3d5' }]}});
|
||||
new Chart('chart_speed', { type: 'horizontalBar', options: { legend: false, responsive: true, scales: { xAxes: [{ label: 'Speed', ticks: { suggestedMin: 0 }}], yAxes: [{ barPercentage: 0.425 }]}}, data: { labels: ['w/s CPU', 'w/s GPU'], datasets: [{ label: 'en_core_web_sm', data: [9575, 25531], backgroundColor: '#09a3d5'}]}});
|
||||
new Chart('chart_accuracy', { type: 'bar', options: { legend: { position: 'bottom'}, responsive: true, scales: { yAxes: [{ label: 'Accuracy', ticks: { suggestedMin: 70 } }], xAxes: [{ barPercentage: 0.75 }]}}, data: { labels: ['UAS', 'LAS', 'POS', 'NER F', 'NER P', 'NER R'], datasets: [{ label: 'en_core_web_sm', data: [91.65, 89.77, 97.05, 84.80, 84.53, 85.06], backgroundColor: '#09a3d5' }, { label: 'en_core_web_lg', data: [91.49, 89.66, 97.23, 86.46, 86.78, 86.15], backgroundColor: '#066B8C'}]}});
|
||||
new Chart('chart_speed', { type: 'horizontalBar', options: { legend: { position: 'bottom'}, responsive: true, scales: { xAxes: [{ label: 'Speed', ticks: { suggestedMin: 0 }}], yAxes: [{ barPercentage: 0.75 }]}}, data: { labels: ['w/s CPU', 'w/s GPU'], datasets: [{ label: 'en_core_web_sm', data: [9575, 25531], backgroundColor: '#09a3d5'}, { label: 'en_core_web_lg', data: [8421, 22092], backgroundColor: '#066B8C'}]}});
|
||||
|
||||
+section("embeds")
|
||||
+h(2, "embeds") Embeds
|
||||
|
|
|
@ -79,6 +79,7 @@
|
|||
"title": "What's New in v2.0",
|
||||
"teaser": "New features, backwards incompatibilities and migration guide.",
|
||||
"menu": {
|
||||
"Summary": "summary",
|
||||
"New features": "features",
|
||||
"Backwards Incompatibilities": "incompat",
|
||||
"Migrating from v1.x": "migrating",
|
||||
|
@ -116,7 +117,6 @@
|
|||
"next": "text-classification",
|
||||
"menu": {
|
||||
"Basics": "basics",
|
||||
"Similarity in Context": "in-context",
|
||||
"Custom Vectors": "custom",
|
||||
"GPU Usage": "gpu"
|
||||
}
|
||||
|
|
|
@ -19,6 +19,7 @@
|
|||
|
||||
+qs({package: 'source'}) git clone https://github.com/explosion/spaCy
|
||||
+qs({package: 'source'}) cd spaCy
|
||||
+qs({package: 'source'}) export PYTHONPATH=`pwd`
|
||||
+qs({package: 'source'}) pip install -r requirements.txt
|
||||
+qs({package: 'source'}) pip install -e .
|
||||
|
||||
|
|
|
@ -46,7 +46,6 @@ p
|
|||
+item #[strong Chinese]: #[+a("https://github.com/fxsjy/jieba") Jieba]
|
||||
+item #[strong Japanese]: #[+a("https://github.com/mocobeta/janome") Janome]
|
||||
+item #[strong Thai]: #[+a("https://github.com/wannaphongcom/pythainlp") pythainlp]
|
||||
+item #[strong Russian]: #[+a("https://github.com/kmike/pymorphy2") pymorphy2]
|
||||
|
||||
+h(3, "multi-language") Multi-language support
|
||||
+tag-new(2)
|
||||
|
|
|
@ -76,6 +76,16 @@ p
|
|||
("Google rebrands its business apps", [(0, 6, "ORG")]),
|
||||
("look what i found on google! 😂", [(21, 27, "PRODUCT")])]
|
||||
|
||||
+infobox("Tip: Try the Prodigy annotation tool")
|
||||
+infobox-logos(["prodigy", 100, 29, "https://prodi.gy"])
|
||||
| If you need to label a lot of data, check out
|
||||
| #[+a("https://prodi.gy", true) Prodigy], a new, active learning-powered
|
||||
| annotation tool we've developed. Prodigy is fast and extensible, and
|
||||
| comes with a modern #[strong web application] that helps you collect
|
||||
| training data faster. It integrates seamlessly with spaCy, pre-selects
|
||||
| the #[strong most relevant examples] for annotation, and lets you
|
||||
| train and evaluate ready-to-use spaCy models.
|
||||
|
||||
+h(3, "annotations") Training with annotations
|
||||
|
||||
p
|
||||
|
@ -180,9 +190,10 @@ p
|
|||
+cell #[code optimizer]
|
||||
+cell Callable to update the model's weights.
|
||||
|
||||
+infobox
|
||||
| For the #[strong full example and more details], see the usage guide on
|
||||
| #[+a("/usage/training#ner") training the named entity recognizer],
|
||||
| or the runnable
|
||||
| #[+src(gh("spaCy", "examples/training/train_ner.py")) training script]
|
||||
| on GitHub.
|
||||
p
|
||||
| Instead of writing your own training loop, you can also use the
|
||||
| built-in #[+api("cli#train") #[code train]] command, which expects data
|
||||
| in spaCy's #[+a("/api/annotation#json-input") JSON format]. On each epoch,
|
||||
| a model will be saved out to the directory. After training, you can
|
||||
| use the #[+api("cli#package") #[code package]] command to generate an
|
||||
| installable Python package from your model.
|
||||
|
|
|
@ -190,7 +190,3 @@ p
|
|||
|
||||
+item
|
||||
| #[strong Test] the model to make sure the parser works as expected.
|
||||
|
||||
+h(3, "training-json") JSON format for training
|
||||
|
||||
include ../../api/_annotation/_training
|
||||
|
|
237
website/usage/_v2/_features.jade
Normal file
237
website/usage/_v2/_features.jade
Normal file
|
@ -0,0 +1,237 @@
|
|||
//- 💫 DOCS > USAGE > WHAT'S NEW IN V2.0 > NEW FEATURES
|
||||
|
||||
p
|
||||
| This section contains an overview of the most important
|
||||
| #[strong new features and improvements]. The #[+a("/api") API docs]
|
||||
| include additional deprecation notes. New methods and functions that
|
||||
| were introduced in this version are marked with a
|
||||
| #[span.u-text-tag.u-text-tag--spaced v2.0] tag.
|
||||
|
||||
+h(3, "features-models") Convolutional neural network models
|
||||
|
||||
+aside-code("Example", "bash")
|
||||
for model in ["en", "de", "fr", "es", "pt", "it"]
|
||||
| spacy download #{model} # default #{LANGUAGES[model]} model!{'\n'}
|
||||
| spacy download xx_ent_wiki_sm # multi-language NER
|
||||
|
||||
p
|
||||
| spaCy v2.0 features new neural models for tagging,
|
||||
| parsing and entity recognition. The models have
|
||||
| been designed and implemented from scratch specifically for spaCy, to
|
||||
| give you an unmatched balance of speed, size and accuracy. The new
|
||||
| models are #[strong 10× smaller], #[strong 20% more accurate],
|
||||
| and #[strong just as fast] as the previous generation.
|
||||
| #[strong GPU usage] is now supported via
|
||||
| #[+a("http://chainer.org") Chainer]'s CuPy module.
|
||||
|
||||
+infobox
|
||||
| #[+label-inline Usage:] #[+a("/models") Models directory],
|
||||
| #[+a("/models/comparison") Models comparison],
|
||||
| #[+a("/usage/#gpu") Using spaCy with GPU]
|
||||
|
||||
+h(3, "features-pipelines") Improved processing pipelines
|
||||
|
||||
+aside-code("Example").
|
||||
# Set custom attributes
|
||||
Doc.set_extension('my_attr', default=False)
|
||||
Token.set_extension('my_attr', getter=my_token_getter)
|
||||
assert doc._.my_attr, token._.my_attr
|
||||
|
||||
# Add components to the pipeline
|
||||
my_component = lambda doc: doc
|
||||
nlp.add_pipe(my_component)
|
||||
|
||||
p
|
||||
| It's now much easier to #[strong customise the pipeline] with your own
|
||||
| components: functions that receive a #[code Doc] object, modify and
|
||||
| return it. Extensions let you write any
|
||||
| #[strong attributes, properties and methods] to the #[code Doc],
|
||||
| #[code Token] and #[code Span]. You can add data, implement new
|
||||
| features, integrate other libraries with spaCy or plug in your own
|
||||
| machine learning models.
|
||||
|
||||
+image
|
||||
include ../../assets/img/pipeline.svg
|
||||
|
||||
+infobox
|
||||
| #[+label-inline API:] #[+api("language") #[code Language]],
|
||||
| #[+api("doc#set_extension") #[code Doc.set_extension]],
|
||||
| #[+api("span#set_extension") #[code Span.set_extension]],
|
||||
| #[+api("token#set_extension") #[code Token.set_extension]]
|
||||
| #[+label-inline Usage:]
|
||||
| #[+a("/usage/processing-pipelines") Processing pipelines]
|
||||
| #[+label-inline Code:]
|
||||
| #[+src("/usage/examples#section-pipeline") Pipeline examples]
|
||||
|
||||
+h(3, "features-text-classification") Text classification
|
||||
|
||||
+aside-code("Example").
|
||||
textcat = nlp.create_pipe('textcat')
|
||||
nlp.add_pipe(textcat, last=True)
|
||||
optimizer = nlp.begin_training()
|
||||
for itn in range(100):
|
||||
for doc, gold in train_data:
|
||||
nlp.update([doc], [gold], sgd=optimizer)
|
||||
doc = nlp(u'This is a text.')
|
||||
print(doc.cats)
|
||||
|
||||
p
|
||||
| spaCy v2.0 lets you add text categorization models to spaCy pipelines.
|
||||
| The model supports classification with multiple, non-mutually
|
||||
| exclusive labels – so multiple labels can apply at once. You can
|
||||
| change the model architecture rather easily, but by default, the
|
||||
| #[code TextCategorizer] class uses a convolutional neural network to
|
||||
| assign position-sensitive vectors to each word in the document.
|
||||
|
||||
+infobox
|
||||
| #[+label-inline API:] #[+api("textcategorizer") #[code TextCategorizer]],
|
||||
| #[+api("doc#attributes") #[code Doc.cats]],
|
||||
| #[+api("goldparse#attributes") #[code GoldParse.cats]]#[br]
|
||||
| #[+label-inline Usage:] #[+a("/usage/text-classification") Text classification]
|
||||
|
||||
+h(3, "features-hash-ids") Hash values instead of integer IDs
|
||||
|
||||
+aside-code("Example").
|
||||
doc = nlp(u'I love coffee')
|
||||
assert doc.vocab.strings[u'coffee'] == 3197928453018144401
|
||||
assert doc.vocab.strings[3197928453018144401] == u'coffee'
|
||||
|
||||
beer_hash = doc.vocab.strings.add(u'beer')
|
||||
assert doc.vocab.strings[u'beer'] == beer_hash
|
||||
assert doc.vocab.strings[beer_hash] == u'beer'
|
||||
|
||||
p
|
||||
| The #[+api("stringstore") #[code StringStore]] now resolves all strings
|
||||
| to hash values instead of integer IDs. This means that the string-to-int
|
||||
| mapping #[strong no longer depends on the vocabulary state], making a lot
|
||||
| of workflows much simpler, especially during training. Unlike integer IDs
|
||||
| in spaCy v1.x, hash values will #[strong always match] – even across
|
||||
| models. Strings can now be added explicitly using the new
|
||||
| #[+api("stringstore#add") #[code Stringstore.add]] method. A token's hash
|
||||
| is available via #[code token.orth].
|
||||
|
||||
+infobox
|
||||
| #[+label-inline API:] #[+api("stringstore") #[code StringStore]]
|
||||
| #[+label-inline Usage:] #[+a("/usage/spacy-101#vocab") Vocab, hashes and lexemes 101]
|
||||
|
||||
+h(3, "features-vectors") Improved word vectors support
|
||||
|
||||
+aside-code("Example").
|
||||
for word, vector in vector_data:
|
||||
nlp.vocab.set_vector(word, vector)
|
||||
nlp.vocab.vectors.from_glove('/path/to/vectors')
|
||||
# keep 10000 unique vectors and remap the rest
|
||||
nlp.vocab.prune_vectors(10000)
|
||||
nlp.to_disk('/model')
|
||||
|
||||
p
|
||||
| The new #[+api("vectors") #[code Vectors]] class helps the
|
||||
| #[code Vocab] manage the vectors assigned to strings, and lets you
|
||||
| assign vectors individually, or
|
||||
| #[+a("/usage/vectors-similarity#custom-loading-glove") load in GloVe vectors]
|
||||
| from a directory. To help you strike a good balance between coverage
|
||||
| and memory usage, the #[code Vectors] class lets you map
|
||||
| #[strong multiple keys] to the #[strong same row] of the table. If
|
||||
| you're using the #[+api("cli#vocab") #[code spacy vocab]] command to
|
||||
| create a vocabulary, pruning the vectors will be taken care of
|
||||
| automatically. Otherwise, you can use the new
|
||||
| #[+api("vocab#prune_vectors") #[code Vocab.prune_vectors]].
|
||||
|
||||
+infobox
|
||||
| #[+label-inline API:] #[+api("vectors") #[code Vectors]],
|
||||
| #[+api("vocab") #[code Vocab]]
|
||||
| #[+label-inline Usage:] #[+a("/usage/vectors-similarity") Word vectors and semantic similarity]
|
||||
|
||||
+h(3, "features-serializer") Saving, loading and serialization
|
||||
|
||||
+aside-code("Example").
|
||||
nlp = spacy.load('en') # shortcut link
|
||||
nlp = spacy.load('en_core_web_sm') # package
|
||||
nlp = spacy.load('/path/to/en') # unicode path
|
||||
nlp = spacy.load(Path('/path/to/en')) # pathlib Path
|
||||
|
||||
nlp.to_disk('/path/to/nlp')
|
||||
nlp = English().from_disk('/path/to/nlp')
|
||||
|
||||
p
|
||||
| spay's serialization API has been made consistent across classes and
|
||||
| objects. All container classes, i.e. #[code Language], #[code Doc],
|
||||
| #[code Vocab] and #[code StringStore] now have a #[code to_bytes()],
|
||||
| #[code from_bytes()], #[code to_disk()] and #[code from_disk()] method
|
||||
| that supports the Pickle protocol.
|
||||
|
||||
p
|
||||
| The improved #[code spacy.load] makes loading models easier and more
|
||||
| transparent. You can load a model by supplying its
|
||||
| #[+a("/usage/models#usage") shortcut link], the name of an installed
|
||||
| #[+a("/usage/saving-loading#generating") model package] or a path.
|
||||
| The #[code Language] class to initialise will be determined based on the
|
||||
| model's settings. For a blank language, you can import the class directly,
|
||||
| e.g. #[code from spacy.lang.en import English].
|
||||
|
||||
+infobox
|
||||
| #[+label-inline API:] #[+api("spacy#load") #[code spacy.load]]
|
||||
| #[+label-inline Usage:] #[+a("/usage/saving-loading") Saving and loading]
|
||||
|
||||
+h(3, "features-displacy") displaCy visualizer with Jupyter support
|
||||
|
||||
+aside-code("Example").
|
||||
from spacy import displacy
|
||||
doc = nlp(u'This is a sentence about Facebook.')
|
||||
displacy.serve(doc, style='dep') # run the web server
|
||||
html = displacy.render(doc, style='ent') # generate HTML
|
||||
|
||||
p
|
||||
| Our popular dependency and named entity visualizers are now an official
|
||||
| part of the spaCy library. displaCy can run a simple web server, or
|
||||
| generate raw HTML markup or SVG files to be exported. You can pass in one
|
||||
| or more docs, and customise the style. displaCy also auto-detects whether
|
||||
| you're running #[+a("https://jupyter.org") Jupyter] and will render the
|
||||
| visualizations in your notebook.
|
||||
|
||||
+infobox
|
||||
| #[+label-inline API:] #[+api("displacy") #[code displacy]]
|
||||
| #[+label-inline Usage:] #[+a("/usage/visualizers") Visualizing spaCy]
|
||||
|
||||
+h(3, "features-language") Improved language data and lazy loading
|
||||
|
||||
p
|
||||
| Language-specfic data now lives in its own submodule, #[code spacy.lang].
|
||||
| Languages are lazy-loaded, i.e. only loaded when you import a
|
||||
| #[code Language] class, or load a model that initialises one. This allows
|
||||
| languages to contain more custom data, e.g. lemmatizer lookup tables, or
|
||||
| complex regular expressions. The language data has also been tidied up
|
||||
| and simplified. spaCy now also supports simple lookup-based
|
||||
| lemmatization – and #[strong #{LANG_COUNT} languages] in total!
|
||||
|
||||
+infobox
|
||||
| #[+label-inline API:] #[+api("language") #[code Language]]
|
||||
| #[+label-inline Code:] #[+src(gh("spaCy", "spacy/lang")) #[code spacy/lang]]
|
||||
| #[+label-inline Usage:] #[+a("/usage/adding-languages") Adding languages]
|
||||
|
||||
+h(3, "features-matcher") Revised matcher API and phrase matcher
|
||||
|
||||
+aside-code("Example").
|
||||
from spacy.matcher import Matcher, PhraseMatcher
|
||||
|
||||
matcher = Matcher(nlp.vocab)
|
||||
matcher.add('HEARTS', None, [{'ORTH': '❤️', 'OP': '+'}])
|
||||
|
||||
phrasematcher = PhraseMatcher(nlp.vocab)
|
||||
phrasematcher.add('OBAMA', None, nlp(u"Barack Obama"))
|
||||
|
||||
p
|
||||
| Patterns can now be added to the matcher by calling
|
||||
| #[+api("matcher-add") #[code matcher.add()]] with a match ID, an optional
|
||||
| callback function to be invoked on each match, and one or more patterns.
|
||||
| This allows you to write powerful, pattern-specific logic using only one
|
||||
| matcher. For example, you might only want to merge some entity types,
|
||||
| and set custom flags for other matched patterns. The new
|
||||
| #[+api("phrasematcher") #[code PhraseMatcher]] lets you efficiently
|
||||
| match very large terminology lists using #[code Doc] objects as match
|
||||
| patterns.
|
||||
|
||||
+infobox
|
||||
| #[+label-inline API:] #[+api("matcher") #[code Matcher]],
|
||||
| #[+api("phrasematcher") #[code PhraseMatcher]]
|
||||
| #[+label-inline Usage:] #[+a("/usage/rule-based-matching") Rule-based matching]
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user