mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-13 10:46:29 +03:00
Merge branch 'master' into tokenizer_exceptions
This commit is contained in:
commit
309da78bf0
106
.github/contributors/wallinm1.md
vendored
Normal file
106
.github/contributors/wallinm1.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------------------- |
|
||||||
|
| Name | Michael Wallin |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 2017-02-04 |
|
||||||
|
| GitHub username | wallinm1 |
|
||||||
|
| Website (optional) | |
|
|
@ -19,7 +19,7 @@ First, [do a quick search](https://github.com/issues?q=+is%3Aissue+user%3Aexplos
|
||||||
|
|
||||||
If you're looking for help with your code, consider posting a question on [StackOverflow](http://stackoverflow.com/questions/tagged/spacy) instead. If you tag it `spacy` and `python`, more people will see it and hopefully be able to help.
|
If you're looking for help with your code, consider posting a question on [StackOverflow](http://stackoverflow.com/questions/tagged/spacy) instead. If you tag it `spacy` and `python`, more people will see it and hopefully be able to help.
|
||||||
|
|
||||||
When opening an issue, use a descriptive title and include your environment (operating system, Python version, spaCy version). Our [issue template](https://github.com/explosion/spaCy/issues/new) helps you remember the most important details to include.
|
When opening an issue, use a descriptive title and include your environment (operating system, Python version, spaCy version). Our [issue template](https://github.com/explosion/spaCy/issues/new) helps you remember the most important details to include. **Pro tip:** If you need to share long blocks of code or logs, you can wrap them in `<details>` and `</details>`. This [collapses the content](https://developer.mozilla.org/en/docs/Web/HTML/Element/details) so it only becomes visible on click, making the issue easier to read and follow.
|
||||||
|
|
||||||
If you've discovered a bug, you can also submit a [regression test](#fixing-bugs) straight away. When you're opening an issue to report the bug, simply refer to your pull request in the issue body.
|
If you've discovered a bug, you can also submit a [regression test](#fixing-bugs) straight away. When you're opening an issue to report the bug, simply refer to your pull request in the issue body.
|
||||||
|
|
||||||
|
|
|
@ -22,12 +22,14 @@ This is a list of everyone who has made significant contributions to spaCy, in a
|
||||||
* Mark Amery, [@ExplodingCabbage](https://github.com/ExplodingCabbage)
|
* Mark Amery, [@ExplodingCabbage](https://github.com/ExplodingCabbage)
|
||||||
* Matthew Honnibal, [@honnibal](https://github.com/honnibal)
|
* Matthew Honnibal, [@honnibal](https://github.com/honnibal)
|
||||||
* Maxim Samsonov, [@maxirmx](https://github.com/maxirmx)
|
* Maxim Samsonov, [@maxirmx](https://github.com/maxirmx)
|
||||||
|
* Michael Wallin, [@wallinm1](https://github.com/wallinm1)
|
||||||
* Oleg Zd, [@olegzd](https://github.com/olegzd)
|
* Oleg Zd, [@olegzd](https://github.com/olegzd)
|
||||||
* Pokey Rule, [@pokey](https://github.com/pokey)
|
* Pokey Rule, [@pokey](https://github.com/pokey)
|
||||||
* Raphaël Bournhonesque, [@raphael0202](https://github.com/raphael0202)
|
* Raphaël Bournhonesque, [@raphael0202](https://github.com/raphael0202)
|
||||||
* Rob van Nieuwpoort, [@RvanNieuwpoort](https://github.com/RvanNieuwpoort)
|
* Rob van Nieuwpoort, [@RvanNieuwpoort](https://github.com/RvanNieuwpoort)
|
||||||
* Sam Bozek, [@sambozek](https://github.com/sambozek)
|
* Sam Bozek, [@sambozek](https://github.com/sambozek)
|
||||||
* Sasho Savkov [@savkov](https://github.com/savkov)
|
* Sasho Savkov [@savkov](https://github.com/savkov)
|
||||||
|
* Thomas Tanon, [@Tpt](https://github.com/Tpt)
|
||||||
* Tiago Rodrigues, [@TiagoMRodrigues](https://github.com/TiagoMRodrigues)
|
* Tiago Rodrigues, [@TiagoMRodrigues](https://github.com/TiagoMRodrigues)
|
||||||
* Vsevolod Solovyov, [@vsolovyov](https://github.com/vsolovyov)
|
* Vsevolod Solovyov, [@vsolovyov](https://github.com/vsolovyov)
|
||||||
* Wah Loon Keng, [@kengz](https://github.com/kengz)
|
* Wah Loon Keng, [@kengz](https://github.com/kengz)
|
||||||
|
|
|
@ -5,8 +5,8 @@ spaCy is a library for advanced natural language processing in Python and
|
||||||
Cython. spaCy is built on the very latest research, but it isn't researchware.
|
Cython. spaCy is built on the very latest research, but it isn't researchware.
|
||||||
It was designed from day one to be used in real products. spaCy currently supports
|
It was designed from day one to be used in real products. spaCy currently supports
|
||||||
English and German, as well as tokenization for Chinese, Spanish, Italian, French,
|
English and German, as well as tokenization for Chinese, Spanish, Italian, French,
|
||||||
Portuguese, Dutch, Swedish and Hungarian. It's commercial open-source software,
|
Portuguese, Dutch, Swedish, Finnish and Hungarian. It's commercial open-source
|
||||||
released under the MIT license.
|
software, released under the MIT license.
|
||||||
|
|
||||||
💫 **Version 1.6 out now!** `Read the release notes here. <https://github.com/explosion/spaCy/releases/>`_
|
💫 **Version 1.6 out now!** `Read the release notes here. <https://github.com/explosion/spaCy/releases/>`_
|
||||||
|
|
||||||
|
|
|
@ -12,17 +12,23 @@ from spacy_hook import create_similarity_pipeline
|
||||||
|
|
||||||
from keras_decomposable_attention import build_model
|
from keras_decomposable_attention import build_model
|
||||||
|
|
||||||
|
try:
|
||||||
|
import cPickle as pickle
|
||||||
|
except ImportError:
|
||||||
|
import pickle
|
||||||
|
|
||||||
|
|
||||||
def train(model_dir, train_loc, dev_loc, shape, settings):
|
def train(model_dir, train_loc, dev_loc, shape, settings):
|
||||||
train_texts1, train_texts2, train_labels = read_snli(train_loc)
|
train_texts1, train_texts2, train_labels = read_snli(train_loc)
|
||||||
dev_texts1, dev_texts2, dev_labels = read_snli(dev_loc)
|
dev_texts1, dev_texts2, dev_labels = read_snli(dev_loc)
|
||||||
|
|
||||||
print("Loading spaCy")
|
print("Loading spaCy")
|
||||||
nlp = spacy.load('en')
|
nlp = spacy.load('en')
|
||||||
|
assert nlp.path is not None
|
||||||
print("Compiling network")
|
print("Compiling network")
|
||||||
model = build_model(get_embeddings(nlp.vocab), shape, settings)
|
model = build_model(get_embeddings(nlp.vocab), shape, settings)
|
||||||
print("Processing texts...")
|
print("Processing texts...")
|
||||||
Xs = []
|
Xs = []
|
||||||
for texts in (train_texts1, train_texts2, dev_texts1, dev_texts2):
|
for texts in (train_texts1, train_texts2, dev_texts1, dev_texts2):
|
||||||
Xs.append(get_word_ids(list(nlp.pipe(texts, n_threads=20, batch_size=20000)),
|
Xs.append(get_word_ids(list(nlp.pipe(texts, n_threads=20, batch_size=20000)),
|
||||||
max_length=shape[0],
|
max_length=shape[0],
|
||||||
|
@ -36,35 +42,41 @@ def train(model_dir, train_loc, dev_loc, shape, settings):
|
||||||
validation_data=([dev_X1, dev_X2], dev_labels),
|
validation_data=([dev_X1, dev_X2], dev_labels),
|
||||||
nb_epoch=settings['nr_epoch'],
|
nb_epoch=settings['nr_epoch'],
|
||||||
batch_size=settings['batch_size'])
|
batch_size=settings['batch_size'])
|
||||||
|
if not (nlp.path / 'similarity').exists():
|
||||||
|
(nlp.path / 'similarity').mkdir()
|
||||||
|
print("Saving to", model_dir / 'similarity')
|
||||||
|
weights = model.get_weights()
|
||||||
|
with (nlp.path / 'similarity' / 'model').open('wb') as file_:
|
||||||
|
pickle.dump(weights[1:], file_)
|
||||||
|
with (nlp.path / 'similarity' / 'config.json').open('wb') as file_:
|
||||||
|
file_.write(model.to_json())
|
||||||
|
|
||||||
|
|
||||||
def evaluate(model_dir, dev_loc):
|
def evaluate(model_dir, dev_loc):
|
||||||
nlp = spacy.load('en', path=model_dir,
|
dev_texts1, dev_texts2, dev_labels = read_snli(dev_loc)
|
||||||
tagger=False, parser=False, entity=False, matcher=False,
|
nlp = spacy.load('en',
|
||||||
create_pipeline=create_similarity_pipeline)
|
create_pipeline=create_similarity_pipeline)
|
||||||
n = 0
|
total = 0.
|
||||||
correct = 0
|
correct = 0.
|
||||||
for (text1, text2), label in zip(dev_texts, dev_labels):
|
for text1, text2, label in zip(dev_texts1, dev_texts2, dev_labels):
|
||||||
doc1 = nlp(text1)
|
doc1 = nlp(text1)
|
||||||
doc2 = nlp(text2)
|
doc2 = nlp(text2)
|
||||||
sim = doc1.similarity(doc2)
|
sim = doc1.similarity(doc2)
|
||||||
if bool(sim >= 0.5) == label:
|
if sim.argmax() == label.argmax():
|
||||||
correct += 1
|
correct += 1
|
||||||
n += 1
|
total += 1
|
||||||
return correct, total
|
return correct, total
|
||||||
|
|
||||||
|
|
||||||
def demo(model_dir):
|
def demo(model_dir):
|
||||||
nlp = spacy.load('en', path=model_dir,
|
nlp = spacy.load('en', path=model_dir,
|
||||||
tagger=False, parser=False, entity=False, matcher=False,
|
|
||||||
create_pipeline=create_similarity_pipeline)
|
create_pipeline=create_similarity_pipeline)
|
||||||
doc1 = nlp(u'Worst fries ever! Greasy and horrible...')
|
doc1 = nlp(u'What were the best crime fiction books in 2016?')
|
||||||
doc2 = nlp(u'The milkshakes are good. The fries are bad.')
|
doc2 = nlp(
|
||||||
print('doc1.similarity(doc2)', doc1.similarity(doc2))
|
u'What should I read that was published last year? I like crime stories.')
|
||||||
sent1a, sent1b = doc1.sents
|
print(doc1)
|
||||||
print('sent1a.similarity(sent1b)', sent1a.similarity(sent1b))
|
print(doc2)
|
||||||
print('sent1a.similarity(doc2)', sent1a.similarity(doc2))
|
print("Similarity", doc1.similarity(doc2))
|
||||||
print('sent1b.similarity(doc2)', sent1b.similarity(doc2))
|
|
||||||
|
|
||||||
|
|
||||||
LABELS = {'entailment': 0, 'contradiction': 1, 'neutral': 2}
|
LABELS = {'entailment': 0, 'contradiction': 1, 'neutral': 2}
|
||||||
|
@ -119,7 +131,8 @@ def main(mode, model_dir, train_loc, dev_loc,
|
||||||
if mode == 'train':
|
if mode == 'train':
|
||||||
train(model_dir, train_loc, dev_loc, shape, settings)
|
train(model_dir, train_loc, dev_loc, shape, settings)
|
||||||
elif mode == 'evaluate':
|
elif mode == 'evaluate':
|
||||||
evaluate(model_dir, dev_loc)
|
correct, total = evaluate(model_dir, dev_loc)
|
||||||
|
print(correct, '/', total, correct / total)
|
||||||
else:
|
else:
|
||||||
demo(model_dir)
|
demo(model_dir)
|
||||||
|
|
||||||
|
|
|
@ -12,6 +12,8 @@ from keras.models import Sequential, Model, model_from_json
|
||||||
from keras.regularizers import l2
|
from keras.regularizers import l2
|
||||||
from keras.optimizers import Adam
|
from keras.optimizers import Adam
|
||||||
from keras.layers.normalization import BatchNormalization
|
from keras.layers.normalization import BatchNormalization
|
||||||
|
from keras.layers.pooling import GlobalAveragePooling1D, GlobalMaxPooling1D
|
||||||
|
from keras.layers import Merge
|
||||||
|
|
||||||
|
|
||||||
def build_model(vectors, shape, settings):
|
def build_model(vectors, shape, settings):
|
||||||
|
@ -29,11 +31,11 @@ def build_model(vectors, shape, settings):
|
||||||
align = _SoftAlignment(max_length, nr_hidden)
|
align = _SoftAlignment(max_length, nr_hidden)
|
||||||
compare = _Comparison(max_length, nr_hidden, dropout=settings['dropout'])
|
compare = _Comparison(max_length, nr_hidden, dropout=settings['dropout'])
|
||||||
entail = _Entailment(nr_hidden, nr_class, dropout=settings['dropout'])
|
entail = _Entailment(nr_hidden, nr_class, dropout=settings['dropout'])
|
||||||
|
|
||||||
# Declare the model as a computational graph.
|
# Declare the model as a computational graph.
|
||||||
sent1 = embed(ids1) # Shape: (i, n)
|
sent1 = embed(ids1) # Shape: (i, n)
|
||||||
sent2 = embed(ids2) # Shape: (j, n)
|
sent2 = embed(ids2) # Shape: (j, n)
|
||||||
|
|
||||||
if settings['gru_encode']:
|
if settings['gru_encode']:
|
||||||
sent1 = encode(sent1)
|
sent1 = encode(sent1)
|
||||||
sent2 = encode(sent2)
|
sent2 = encode(sent2)
|
||||||
|
@ -42,12 +44,12 @@ def build_model(vectors, shape, settings):
|
||||||
|
|
||||||
align1 = align(sent2, attention)
|
align1 = align(sent2, attention)
|
||||||
align2 = align(sent1, attention, transpose=True)
|
align2 = align(sent1, attention, transpose=True)
|
||||||
|
|
||||||
feats1 = compare(sent1, align1)
|
feats1 = compare(sent1, align1)
|
||||||
feats2 = compare(sent2, align2)
|
feats2 = compare(sent2, align2)
|
||||||
|
|
||||||
scores = entail(feats1, feats2)
|
scores = entail(feats1, feats2)
|
||||||
|
|
||||||
# Now that we have the input/output, we can construct the Model object...
|
# Now that we have the input/output, we can construct the Model object...
|
||||||
model = Model(input=[ids1, ids2], output=[scores])
|
model = Model(input=[ids1, ids2], output=[scores])
|
||||||
|
|
||||||
|
@ -93,7 +95,7 @@ class _StaticEmbedding(object):
|
||||||
def get_output_shape(shapes):
|
def get_output_shape(shapes):
|
||||||
print(shapes)
|
print(shapes)
|
||||||
return shapes[0]
|
return shapes[0]
|
||||||
mod_sent = self.mod_ids(sentence)
|
mod_sent = self.mod_ids(sentence)
|
||||||
tuning = self.tune(mod_sent)
|
tuning = self.tune(mod_sent)
|
||||||
#tuning = merge([tuning, mod_sent],
|
#tuning = merge([tuning, mod_sent],
|
||||||
# mode=lambda AB: AB[0] * (K.clip(K.cast(AB[1], 'float32'), 0, 1)),
|
# mode=lambda AB: AB[0] * (K.clip(K.cast(AB[1], 'float32'), 0, 1)),
|
||||||
|
@ -129,7 +131,7 @@ class _Attention(object):
|
||||||
self.model.add(Dense(nr_hidden, name='attend2',
|
self.model.add(Dense(nr_hidden, name='attend2',
|
||||||
init='he_normal', W_regularizer=l2(L2), activation='relu'))
|
init='he_normal', W_regularizer=l2(L2), activation='relu'))
|
||||||
self.model = TimeDistributed(self.model)
|
self.model = TimeDistributed(self.model)
|
||||||
|
|
||||||
def __call__(self, sent1, sent2):
|
def __call__(self, sent1, sent2):
|
||||||
def _outer(AB):
|
def _outer(AB):
|
||||||
att_ji = K.batch_dot(AB[1], K.permute_dimensions(AB[0], (0, 2, 1)))
|
att_ji = K.batch_dot(AB[1], K.permute_dimensions(AB[0], (0, 2, 1)))
|
||||||
|
@ -158,7 +160,7 @@ class _SoftAlignment(object):
|
||||||
return K.batch_dot(sm_att, mat)
|
return K.batch_dot(sm_att, mat)
|
||||||
return merge([attention, sentence], mode=_normalize_attention,
|
return merge([attention, sentence], mode=_normalize_attention,
|
||||||
output_shape=(self.max_length, self.nr_hidden)) # Shape: (i, n)
|
output_shape=(self.max_length, self.nr_hidden)) # Shape: (i, n)
|
||||||
|
|
||||||
|
|
||||||
class _Comparison(object):
|
class _Comparison(object):
|
||||||
def __init__(self, words, nr_hidden, L2=0.0, dropout=0.0):
|
def __init__(self, words, nr_hidden, L2=0.0, dropout=0.0):
|
||||||
|
@ -176,10 +178,12 @@ class _Comparison(object):
|
||||||
|
|
||||||
def __call__(self, sent, align, **kwargs):
|
def __call__(self, sent, align, **kwargs):
|
||||||
result = self.model(merge([sent, align], mode='concat')) # Shape: (i, n)
|
result = self.model(merge([sent, align], mode='concat')) # Shape: (i, n)
|
||||||
result = _GlobalSumPooling1D()(result, mask=self.words)
|
avged = GlobalAveragePooling1D()(result, mask=self.words)
|
||||||
result = BatchNormalization()(result)
|
maxed = GlobalMaxPooling1D()(result, mask=self.words)
|
||||||
|
merged = merge([avged, maxed])
|
||||||
|
result = BatchNormalization()(merged)
|
||||||
return result
|
return result
|
||||||
|
|
||||||
|
|
||||||
class _Entailment(object):
|
class _Entailment(object):
|
||||||
def __init__(self, nr_hidden, nr_out, dropout=0.0, L2=0.0):
|
def __init__(self, nr_hidden, nr_out, dropout=0.0, L2=0.0):
|
||||||
|
@ -251,7 +255,7 @@ def test_fit_model():
|
||||||
shape = (10, 16, 3)
|
shape = (10, 16, 3)
|
||||||
settings = {'lr': 0.001, 'dropout': 0.2, 'gru_encode':True}
|
settings = {'lr': 0.001, 'dropout': 0.2, 'gru_encode':True}
|
||||||
model = build_model(vectors, shape, settings)
|
model = build_model(vectors, shape, settings)
|
||||||
|
|
||||||
train_X = _generate_X(20, shape[0], vectors.shape[1])
|
train_X = _generate_X(20, shape[0], vectors.shape[1])
|
||||||
train_Y = _generate_Y(20, shape[2])
|
train_Y = _generate_Y(20, shape[2])
|
||||||
dev_X = _generate_X(15, shape[0], vectors.shape[1])
|
dev_X = _generate_X(15, shape[0], vectors.shape[1])
|
||||||
|
@ -261,6 +265,4 @@ def test_fit_model():
|
||||||
batch_size=4)
|
batch_size=4)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
__all__ = [build_model]
|
__all__ = [build_model]
|
||||||
|
|
|
@ -1,33 +1,40 @@
|
||||||
from keras.models import model_from_json
|
from keras.models import model_from_json
|
||||||
import numpy
|
import numpy
|
||||||
import numpy.random
|
import numpy.random
|
||||||
|
import json
|
||||||
|
from spacy.tokens.span import Span
|
||||||
|
|
||||||
|
try:
|
||||||
|
import cPickle as pickle
|
||||||
|
except ImportError:
|
||||||
|
import pickle
|
||||||
|
|
||||||
|
|
||||||
class KerasSimilarityShim(object):
|
class KerasSimilarityShim(object):
|
||||||
@classmethod
|
@classmethod
|
||||||
def load(cls, path, nlp, get_features=None):
|
def load(cls, path, nlp, get_features=None, max_length=100):
|
||||||
if get_features is None:
|
if get_features is None:
|
||||||
get_features = doc2ids
|
get_features = get_word_ids
|
||||||
with (path / 'config.json').open() as file_:
|
with (path / 'config.json').open() as file_:
|
||||||
config = json.load(file_)
|
model = model_from_json(file_.read())
|
||||||
model = model_from_json(config['model'])
|
|
||||||
with (path / 'model').open('rb') as file_:
|
with (path / 'model').open('rb') as file_:
|
||||||
weights = pickle.load(file_)
|
weights = pickle.load(file_)
|
||||||
embeddings = get_embeddings(nlp.vocab)
|
embeddings = get_embeddings(nlp.vocab)
|
||||||
model.set_weights([embeddings] + weights)
|
model.set_weights([embeddings] + weights)
|
||||||
return cls(model, get_features=get_features)
|
return cls(model, get_features=get_features, max_length=max_length)
|
||||||
|
|
||||||
def __init__(self, model, get_features=None):
|
def __init__(self, model, get_features=None, max_length=100):
|
||||||
self.model = model
|
self.model = model
|
||||||
self.get_features = get_features
|
self.get_features = get_features
|
||||||
|
self.max_length = max_length
|
||||||
|
|
||||||
def __call__(self, doc):
|
def __call__(self, doc):
|
||||||
doc.user_hooks['similarity'] = self.predict
|
doc.user_hooks['similarity'] = self.predict
|
||||||
doc.user_span_hooks['similarity'] = self.predict
|
doc.user_span_hooks['similarity'] = self.predict
|
||||||
|
|
||||||
def predict(self, doc1, doc2):
|
def predict(self, doc1, doc2):
|
||||||
x1 = self.get_features(doc1)
|
x1 = self.get_features([doc1], max_length=self.max_length, tree_truncate=True)
|
||||||
x2 = self.get_features(doc2)
|
x2 = self.get_features([doc2], max_length=self.max_length, tree_truncate=True)
|
||||||
scores = self.model.predict([x1, x2])
|
scores = self.model.predict([x1, x2])
|
||||||
return scores[0]
|
return scores[0]
|
||||||
|
|
||||||
|
@ -45,7 +52,10 @@ def get_word_ids(docs, rnn_encode=False, tree_truncate=False, max_length=100, nr
|
||||||
Xs = numpy.zeros((len(docs), max_length), dtype='int32')
|
Xs = numpy.zeros((len(docs), max_length), dtype='int32')
|
||||||
for i, doc in enumerate(docs):
|
for i, doc in enumerate(docs):
|
||||||
if tree_truncate:
|
if tree_truncate:
|
||||||
queue = [sent.root for sent in doc.sents]
|
if isinstance(doc, Span):
|
||||||
|
queue = [doc.root]
|
||||||
|
else:
|
||||||
|
queue = [sent.root for sent in doc.sents]
|
||||||
else:
|
else:
|
||||||
queue = list(doc)
|
queue = list(doc)
|
||||||
words = []
|
words = []
|
||||||
|
@ -71,7 +81,9 @@ def get_word_ids(docs, rnn_encode=False, tree_truncate=False, max_length=100, nr
|
||||||
|
|
||||||
|
|
||||||
def create_similarity_pipeline(nlp):
|
def create_similarity_pipeline(nlp):
|
||||||
return [SimilarityModel.load(
|
return [
|
||||||
nlp.path / 'similarity',
|
nlp.tagger,
|
||||||
nlp,
|
nlp.entity,
|
||||||
feature_extracter=get_features)]
|
nlp.parser,
|
||||||
|
KerasSimilarityShim.load(nlp.path / 'similarity', nlp, max_length=10)
|
||||||
|
]
|
||||||
|
|
1
setup.py
1
setup.py
|
@ -31,6 +31,7 @@ PACKAGES = [
|
||||||
'spacy.pt',
|
'spacy.pt',
|
||||||
'spacy.nl',
|
'spacy.nl',
|
||||||
'spacy.sv',
|
'spacy.sv',
|
||||||
|
'spacy.fi',
|
||||||
'spacy.language_data',
|
'spacy.language_data',
|
||||||
'spacy.serialize',
|
'spacy.serialize',
|
||||||
'spacy.syntax',
|
'spacy.syntax',
|
||||||
|
|
|
@ -13,7 +13,7 @@ from . import fr
|
||||||
from . import pt
|
from . import pt
|
||||||
from . import nl
|
from . import nl
|
||||||
from . import sv
|
from . import sv
|
||||||
|
from . import fi
|
||||||
|
|
||||||
try:
|
try:
|
||||||
basestring
|
basestring
|
||||||
|
@ -31,6 +31,8 @@ set_lang_class(hu.Hungarian.lang, hu.Hungarian)
|
||||||
set_lang_class(zh.Chinese.lang, zh.Chinese)
|
set_lang_class(zh.Chinese.lang, zh.Chinese)
|
||||||
set_lang_class(nl.Dutch.lang, nl.Dutch)
|
set_lang_class(nl.Dutch.lang, nl.Dutch)
|
||||||
set_lang_class(sv.Swedish.lang, sv.Swedish)
|
set_lang_class(sv.Swedish.lang, sv.Swedish)
|
||||||
|
set_lang_class(fi.Finnish.lang, fi.Finnish)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
def load(name, **overrides):
|
def load(name, **overrides):
|
||||||
|
|
17
spacy/fi/__init__.py
Normal file
17
spacy/fi/__init__.py
Normal file
|
@ -0,0 +1,17 @@
|
||||||
|
# encoding: utf8
|
||||||
|
from __future__ import unicode_literals, print_function
|
||||||
|
|
||||||
|
from ..language import Language
|
||||||
|
from ..attrs import LANG
|
||||||
|
from .language_data import *
|
||||||
|
|
||||||
|
|
||||||
|
class Finnish(Language):
|
||||||
|
lang = 'fi'
|
||||||
|
|
||||||
|
class Defaults(Language.Defaults):
|
||||||
|
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||||
|
lex_attr_getters[LANG] = lambda text: 'fi'
|
||||||
|
|
||||||
|
tokenizer_exceptions = TOKENIZER_EXCEPTIONS
|
||||||
|
stop_words = STOP_WORDS
|
17
spacy/fi/language_data.py
Normal file
17
spacy/fi/language_data.py
Normal file
|
@ -0,0 +1,17 @@
|
||||||
|
# encoding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from .. import language_data as base
|
||||||
|
from ..language_data import update_exc, strings_to_exc
|
||||||
|
|
||||||
|
from .stop_words import STOP_WORDS
|
||||||
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
|
|
||||||
|
|
||||||
|
STOP_WORDS = set(STOP_WORDS)
|
||||||
|
|
||||||
|
TOKENIZER_EXCEPTIONS = dict(TOKENIZER_EXCEPTIONS)
|
||||||
|
update_exc(TOKENIZER_EXCEPTIONS, strings_to_exc(base.EMOTICONS))
|
||||||
|
|
||||||
|
|
||||||
|
__all__ = ["TOKENIZER_EXCEPTIONS", "STOP_WORDS"]
|
114
spacy/fi/stop_words.py
Normal file
114
spacy/fi/stop_words.py
Normal file
|
@ -0,0 +1,114 @@
|
||||||
|
# encoding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
# Source https://github.com/stopwords-iso/stopwords-fi/blob/master/stopwords-fi.txt
|
||||||
|
# Reformatted with some minor corrections
|
||||||
|
|
||||||
|
STOP_WORDS = set("""
|
||||||
|
|
||||||
|
aiemmin aika aikaa aikaan aikaisemmin aikaisin aikana aikoina aikoo aikovat
|
||||||
|
aina ainakaan ainakin ainoa ainoat aiomme aion aiotte aivan ajan alas alemmas
|
||||||
|
alkuisin alkuun alla alle aloitamme aloitan aloitat aloitatte aloitattivat
|
||||||
|
aloitettava aloitettavaksi aloitettu aloitimme aloitin aloitit aloititte
|
||||||
|
aloittaa aloittamatta aloitti aloittivat alta aluksi alussa alusta annettavaksi
|
||||||
|
annettava annettu ansiosta antaa antamatta antoi apu asia asiaa asian asiasta
|
||||||
|
asiat asioiden asioihin asioita asti avuksi avulla avun avutta
|
||||||
|
|
||||||
|
edelle edelleen edellä edeltä edemmäs edes edessä edestä ehkä ei eikä eilen
|
||||||
|
eivät eli ellei elleivät ellemme ellen ellet ellette emme en enemmän eniten
|
||||||
|
ennen ensi ensimmäinen ensimmäiseksi ensimmäisen ensimmäisenä ensimmäiset
|
||||||
|
ensimmäisiksi ensimmäisinä ensimmäisiä ensimmäistä ensin entinen entisen
|
||||||
|
entisiä entisten entistä enää eri erittäin erityisesti eräiden eräs eräät esi
|
||||||
|
esiin esillä esimerkiksi et eteen etenkin ette ettei että
|
||||||
|
|
||||||
|
halua haluaa haluamatta haluamme haluan haluat haluatte haluavat halunnut
|
||||||
|
halusi halusimme halusin halusit halusitte halusivat halutessa haluton he hei
|
||||||
|
heidän heidät heihin heille heillä heiltä heissä heistä heitä helposti heti
|
||||||
|
hetkellä hieman hitaasti huolimatta huomenna hyvien hyviin hyviksi hyville
|
||||||
|
hyviltä hyvin hyvinä hyvissä hyvistä hyviä hyvä hyvät hyvää hän häneen hänelle
|
||||||
|
hänellä häneltä hänen hänessä hänestä hänet häntä
|
||||||
|
|
||||||
|
ihan ilman ilmeisesti itse itsensä itseään
|
||||||
|
|
||||||
|
ja jo johon joiden joihin joiksi joilla joille joilta joina joissa joista joita
|
||||||
|
joka jokainen jokin joko joksi joku jolla jolle jolloin jolta jompikumpi jona
|
||||||
|
jonka jonkin jonne joo jopa jos joskus jossa josta jota jotain joten jotenkin
|
||||||
|
jotenkuten jotka jotta jouduimme jouduin jouduit jouduitte joudumme joudun
|
||||||
|
joudutte joukkoon joukossa joukosta joutua joutui joutuivat joutumaan joutuu
|
||||||
|
joutuvat juuri jälkeen jälleen jää
|
||||||
|
|
||||||
|
kahdeksan kahdeksannen kahdella kahdelle kahdelta kahden kahdessa kahdesta
|
||||||
|
kahta kahteen kai kaiken kaikille kaikilta kaikkea kaikki kaikkia kaikkiaan
|
||||||
|
kaikkialla kaikkialle kaikkialta kaikkien kaikkiin kaksi kannalta kannattaa
|
||||||
|
kanssa kanssaan kanssamme kanssani kanssanne kanssasi kauan kauemmas kaukana
|
||||||
|
kautta kehen keiden keihin keiksi keille keillä keiltä keinä keissä keistä
|
||||||
|
keitten keittä keitä keneen keneksi kenelle kenellä keneltä kenen kenenä
|
||||||
|
kenessä kenestä kenet kenettä kenties kerran kerta kertaa keskellä kesken
|
||||||
|
keskimäärin ketkä ketä kiitos kohti koko kokonaan kolmas kolme kolmen kolmesti
|
||||||
|
koska koskaan kovin kuin kuinka kuinkaan kuitenkaan kuitenkin kuka kukaan kukin
|
||||||
|
kumpainen kumpainenkaan kumpi kumpikaan kumpikin kun kuten kuuden kuusi kuutta
|
||||||
|
kylliksi kyllä kymmenen kyse
|
||||||
|
|
||||||
|
liian liki lisäksi lisää lla luo luona lähekkäin lähelle lähellä läheltä
|
||||||
|
lähemmäs lähes lähinnä lähtien läpi
|
||||||
|
|
||||||
|
mahdollisimman mahdollista me meidän meidät meihin meille meillä meiltä meissä
|
||||||
|
meistä meitä melkein melko menee menemme menen menet menette menevät meni
|
||||||
|
menimme menin menit menivät mennessä mennyt menossa mihin miksi mikä mikäli
|
||||||
|
mikään mille milloin milloinkan millä miltä minkä minne minua minulla minulle
|
||||||
|
minulta minun minussa minusta minut minuun minä missä mistä miten mitkä mitä
|
||||||
|
mitään moi molemmat mones monesti monet moni moniaalla moniaalle moniaalta
|
||||||
|
monta muassa muiden muita muka mukaan mukaansa mukana mutta muu muualla muualle
|
||||||
|
muualta muuanne muulloin muun muut muuta muutama muutaman muuten myöhemmin myös
|
||||||
|
myöskin myöskään myötä
|
||||||
|
|
||||||
|
ne neljä neljän neljää niiden niihin niiksi niille niillä niiltä niin niinä
|
||||||
|
niissä niistä niitä noiden noihin noiksi noilla noille noilta noin noina noissa
|
||||||
|
noista noita nopeammin nopeasti nopeiten nro nuo nyt näiden näihin näiksi
|
||||||
|
näille näillä näiltä näin näinä näissä näistä näitä nämä
|
||||||
|
|
||||||
|
ohi oikea oikealla oikein ole olemme olen olet olette oleva olevan olevat oli
|
||||||
|
olimme olin olisi olisimme olisin olisit olisitte olisivat olit olitte olivat
|
||||||
|
olla olleet ollut oma omaa omaan omaksi omalle omalta oman omassa omat omia
|
||||||
|
omien omiin omiksi omille omilta omissa omista on onkin onko ovat
|
||||||
|
|
||||||
|
paikoittain paitsi pakosti paljon paremmin parempi parhaillaan parhaiten
|
||||||
|
perusteella peräti pian pieneen pieneksi pienelle pienellä pieneltä pienempi
|
||||||
|
pienestä pieni pienin poikki puolesta puolestaan päälle
|
||||||
|
|
||||||
|
runsaasti
|
||||||
|
|
||||||
|
saakka sama samaa samaan samalla saman samat samoin sata sataa satojen se
|
||||||
|
seitsemän sekä sen seuraavat siellä sieltä siihen siinä siis siitä sijaan siksi
|
||||||
|
sille silloin sillä silti siltä sinne sinua sinulla sinulle sinulta sinun
|
||||||
|
sinussa sinusta sinut sinuun sinä sisäkkäin sisällä siten sitten sitä ssa sta
|
||||||
|
suoraan suuntaan suuren suuret suuri suuria suurin suurten
|
||||||
|
|
||||||
|
taa taas taemmas tahansa tai takaa takaisin takana takia tallä tapauksessa
|
||||||
|
tarpeeksi tavalla tavoitteena te teidän teidät teihin teille teillä teiltä
|
||||||
|
teissä teistä teitä tietysti todella toinen toisaalla toisaalle toisaalta
|
||||||
|
toiseen toiseksi toisella toiselle toiselta toisemme toisen toisensa toisessa
|
||||||
|
toisesta toista toistaiseksi toki tosin tuhannen tuhat tule tulee tulemme tulen
|
||||||
|
tulet tulette tulevat tulimme tulin tulisi tulisimme tulisin tulisit tulisitte
|
||||||
|
tulisivat tulit tulitte tulivat tulla tulleet tullut tuntuu tuo tuohon tuoksi
|
||||||
|
tuolla tuolle tuolloin tuolta tuon tuona tuonne tuossa tuosta tuota tuskin tykö
|
||||||
|
tähän täksi tälle tällä tällöin tältä tämä tämän tänne tänä tänään tässä tästä
|
||||||
|
täten tätä täysin täytyvät täytyy täällä täältä
|
||||||
|
|
||||||
|
ulkopuolella usea useasti useimmiten usein useita uudeksi uudelleen uuden uudet
|
||||||
|
uusi uusia uusien uusinta uuteen uutta
|
||||||
|
|
||||||
|
vaan vai vaiheessa vaikea vaikean vaikeat vaikeilla vaikeille vaikeilta
|
||||||
|
vaikeissa vaikeista vaikka vain varmasti varsin varsinkin varten vasen
|
||||||
|
vasemmalla vasta vastaan vastakkain vastan verran vielä vierekkäin vieressä
|
||||||
|
vieri viiden viime viimeinen viimeisen viimeksi viisi voi voidaan voimme voin
|
||||||
|
voisi voit voitte voivat vuoden vuoksi vuosi vuosien vuosina vuotta vähemmän
|
||||||
|
vähintään vähiten vähän välillä
|
||||||
|
|
||||||
|
yhdeksän yhden yhdessä yhteen yhteensä yhteydessä yhteyteen yhtä yhtäälle
|
||||||
|
yhtäällä yhtäältä yhtään yhä yksi yksin yksittäin yleensä ylemmäs yli ylös
|
||||||
|
ympäri
|
||||||
|
|
||||||
|
älköön älä
|
||||||
|
|
||||||
|
""".split())
|
202
spacy/fi/tokenizer_exceptions.py
Normal file
202
spacy/fi/tokenizer_exceptions.py
Normal file
|
@ -0,0 +1,202 @@
|
||||||
|
# encoding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from ..symbols import *
|
||||||
|
from ..language_data import PRON_LEMMA
|
||||||
|
|
||||||
|
# Source https://www.cs.tut.fi/~jkorpela/kielenopas/5.5.html
|
||||||
|
|
||||||
|
TOKENIZER_EXCEPTIONS = {
|
||||||
|
"aik.": [
|
||||||
|
{ORTH: "aik.", LEMMA: "aikaisempi"}
|
||||||
|
],
|
||||||
|
"alk.": [
|
||||||
|
{ORTH: "alk.", LEMMA: "alkaen"}
|
||||||
|
],
|
||||||
|
"alv.": [
|
||||||
|
{ORTH: "alv.", LEMMA: "arvonlisävero"}
|
||||||
|
],
|
||||||
|
"ark.": [
|
||||||
|
{ORTH: "ark.", LEMMA: "arkisin"}
|
||||||
|
],
|
||||||
|
"as.": [
|
||||||
|
{ORTH: "as.", LEMMA: "asunto"}
|
||||||
|
],
|
||||||
|
"ed.": [
|
||||||
|
{ORTH: "ed.", LEMMA: "edellinen"}
|
||||||
|
],
|
||||||
|
"esim.": [
|
||||||
|
{ORTH: "esim.", LEMMA: "esimerkki"}
|
||||||
|
],
|
||||||
|
"huom.": [
|
||||||
|
{ORTH: "huom.", LEMMA: "huomautus"}
|
||||||
|
],
|
||||||
|
"jne.": [
|
||||||
|
{ORTH: "jne.", LEMMA: "ja niin edelleen"}
|
||||||
|
],
|
||||||
|
"joht.": [
|
||||||
|
{ORTH: "joht.", LEMMA: "johtaja"}
|
||||||
|
],
|
||||||
|
"k.": [
|
||||||
|
{ORTH: "k.", LEMMA: "kuollut"}
|
||||||
|
],
|
||||||
|
"ks.": [
|
||||||
|
{ORTH: "ks.", LEMMA: "katso"}
|
||||||
|
],
|
||||||
|
"lk.": [
|
||||||
|
{ORTH: "lk.", LEMMA: "luokka"}
|
||||||
|
],
|
||||||
|
"lkm.": [
|
||||||
|
{ORTH: "lkm.", LEMMA: "lukumäärä"}
|
||||||
|
],
|
||||||
|
"lyh.": [
|
||||||
|
{ORTH: "lyh.", LEMMA: "lyhenne"}
|
||||||
|
],
|
||||||
|
"läh.": [
|
||||||
|
{ORTH: "läh.", LEMMA: "lähettäjä"}
|
||||||
|
],
|
||||||
|
"miel.": [
|
||||||
|
{ORTH: "miel.", LEMMA: "mieluummin"}
|
||||||
|
],
|
||||||
|
"milj.": [
|
||||||
|
{ORTH: "milj.", LEMMA: "miljoona"}
|
||||||
|
],
|
||||||
|
"mm.": [
|
||||||
|
{ORTH: "mm.", LEMMA: "muun muassa"}
|
||||||
|
],
|
||||||
|
"myöh.": [
|
||||||
|
{ORTH: "myöh.", LEMMA: "myöhempi"}
|
||||||
|
],
|
||||||
|
"n.": [
|
||||||
|
{ORTH: "n.", LEMMA: "noin"}
|
||||||
|
],
|
||||||
|
"nimim.": [
|
||||||
|
{ORTH: "nimim.", LEMMA: "nimimerkki"}
|
||||||
|
],
|
||||||
|
"ns.": [
|
||||||
|
{ORTH: "ns.", LEMMA: "niin sanottu"}
|
||||||
|
],
|
||||||
|
"nyk.": [
|
||||||
|
{ORTH: "nyk.", LEMMA: "nykyinen"}
|
||||||
|
],
|
||||||
|
"oik.": [
|
||||||
|
{ORTH: "oik.", LEMMA: "oikealla"}
|
||||||
|
],
|
||||||
|
"os.": [
|
||||||
|
{ORTH: "os.", LEMMA: "osoite"}
|
||||||
|
],
|
||||||
|
"p.": [
|
||||||
|
{ORTH: "p.", LEMMA: "päivä"}
|
||||||
|
],
|
||||||
|
"par.": [
|
||||||
|
{ORTH: "par.", LEMMA: "paremmin"}
|
||||||
|
],
|
||||||
|
"per.": [
|
||||||
|
{ORTH: "per.", LEMMA: "perustettu"}
|
||||||
|
],
|
||||||
|
"pj.": [
|
||||||
|
{ORTH: "pj.", LEMMA: "puheenjohtaja"}
|
||||||
|
],
|
||||||
|
"puh.joht.": [
|
||||||
|
{ORTH: "puh.joht.", LEMMA: "puheenjohtaja"}
|
||||||
|
],
|
||||||
|
"prof.": [
|
||||||
|
{ORTH: "prof.", LEMMA: "professori"}
|
||||||
|
],
|
||||||
|
"puh.": [
|
||||||
|
{ORTH: "puh.", LEMMA: "puhelin"}
|
||||||
|
],
|
||||||
|
"pvm.": [
|
||||||
|
{ORTH: "pvm.", LEMMA: "päivämäärä"}
|
||||||
|
],
|
||||||
|
"rak.": [
|
||||||
|
{ORTH: "rak.", LEMMA: "rakennettu"}
|
||||||
|
],
|
||||||
|
"ry.": [
|
||||||
|
{ORTH: "ry.", LEMMA: "rekisteröity yhdistys"}
|
||||||
|
],
|
||||||
|
"s.": [
|
||||||
|
{ORTH: "s.", LEMMA: "sivu"}
|
||||||
|
],
|
||||||
|
"siht.": [
|
||||||
|
{ORTH: "siht.", LEMMA: "sihteeri"}
|
||||||
|
],
|
||||||
|
"synt.": [
|
||||||
|
{ORTH: "synt.", LEMMA: "syntynyt"}
|
||||||
|
],
|
||||||
|
"t.": [
|
||||||
|
{ORTH: "t.", LEMMA: "toivoo"}
|
||||||
|
],
|
||||||
|
"tark.": [
|
||||||
|
{ORTH: "tark.", LEMMA: "tarkastanut"}
|
||||||
|
],
|
||||||
|
"til.": [
|
||||||
|
{ORTH: "til.", LEMMA: "tilattu"}
|
||||||
|
],
|
||||||
|
"tms.": [
|
||||||
|
{ORTH: "tms.", LEMMA: "tai muuta sellaista"}
|
||||||
|
],
|
||||||
|
"toim.": [
|
||||||
|
{ORTH: "toim.", LEMMA: "toimittanut"}
|
||||||
|
],
|
||||||
|
"v.": [
|
||||||
|
{ORTH: "v.", LEMMA: "vuosi"}
|
||||||
|
],
|
||||||
|
"vas.": [
|
||||||
|
{ORTH: "vas.", LEMMA: "vasen"}
|
||||||
|
],
|
||||||
|
"vast.": [
|
||||||
|
{ORTH: "vast.", LEMMA: "vastaus"}
|
||||||
|
],
|
||||||
|
"vrt.": [
|
||||||
|
{ORTH: "vrt.", LEMMA: "vertaa"}
|
||||||
|
],
|
||||||
|
"yht.": [
|
||||||
|
{ORTH: "yht.", LEMMA: "yhteensä"}
|
||||||
|
],
|
||||||
|
"yl.": [
|
||||||
|
{ORTH: "yl.", LEMMA: "yleinen"}
|
||||||
|
],
|
||||||
|
"ym.": [
|
||||||
|
{ORTH: "ym.", LEMMA: "ynnä muuta"}
|
||||||
|
],
|
||||||
|
"yms.": [
|
||||||
|
{ORTH: "yms.", LEMMA: "ynnä muuta sellaista"}
|
||||||
|
],
|
||||||
|
"yo.": [
|
||||||
|
{ORTH: "yo.", LEMMA: "ylioppilas"}
|
||||||
|
],
|
||||||
|
"yliopp.": [
|
||||||
|
{ORTH: "yliopp.", LEMMA: "ylioppilas"}
|
||||||
|
],
|
||||||
|
"ao.": [
|
||||||
|
{ORTH: "ao.", LEMMA: "asianomainen"}
|
||||||
|
],
|
||||||
|
"em.": [
|
||||||
|
{ORTH: "em.", LEMMA: "edellä mainittu"}
|
||||||
|
],
|
||||||
|
"ko.": [
|
||||||
|
{ORTH: "ko.", LEMMA: "kyseessä oleva"}
|
||||||
|
],
|
||||||
|
"ml.": [
|
||||||
|
{ORTH: "ml.", LEMMA: "mukaan luettuna"}
|
||||||
|
],
|
||||||
|
"po.": [
|
||||||
|
{ORTH: "po.", LEMMA: "puheena oleva"}
|
||||||
|
],
|
||||||
|
"so.": [
|
||||||
|
{ORTH: "so.", LEMMA: "se on"}
|
||||||
|
],
|
||||||
|
"ts.": [
|
||||||
|
{ORTH: "ts.", LEMMA: "toisin sanoen"}
|
||||||
|
],
|
||||||
|
"vm.": [
|
||||||
|
{ORTH: "vm.", LEMMA: "viimeksi mainittu"}
|
||||||
|
],
|
||||||
|
"siht.": [
|
||||||
|
{ORTH: "siht.", LEMMA: "sihteeri"}
|
||||||
|
],
|
||||||
|
"srk.": [
|
||||||
|
{ORTH: "srk.", LEMMA: "seurakunta"}
|
||||||
|
]
|
||||||
|
}
|
|
@ -50,6 +50,7 @@ EMOTICONS = set("""
|
||||||
:/
|
:/
|
||||||
:-/
|
:-/
|
||||||
=/
|
=/
|
||||||
|
=|
|
||||||
:|
|
:|
|
||||||
:-|
|
:-|
|
||||||
:1
|
:1
|
||||||
|
|
|
@ -72,7 +72,7 @@ HYPHENS = _HYPHENS.strip().replace(' ', '|')
|
||||||
# Prefixes
|
# Prefixes
|
||||||
|
|
||||||
TOKENIZER_PREFIXES = (
|
TOKENIZER_PREFIXES = (
|
||||||
['§', '%', r'\+'] +
|
['§', '%', '=', r'\+'] +
|
||||||
LIST_PUNCT +
|
LIST_PUNCT +
|
||||||
LIST_ELLIPSES +
|
LIST_ELLIPSES +
|
||||||
LIST_QUOTES +
|
LIST_QUOTES +
|
||||||
|
@ -106,7 +106,7 @@ TOKENIZER_INFIXES = (
|
||||||
r'(?<=[0-9])[+\-\*^](?=[0-9-])',
|
r'(?<=[0-9])[+\-\*^](?=[0-9-])',
|
||||||
r'(?<=[{al}])\.(?=[{au}])'.format(al=ALPHA_LOWER, au=ALPHA_UPPER),
|
r'(?<=[{al}])\.(?=[{au}])'.format(al=ALPHA_LOWER, au=ALPHA_UPPER),
|
||||||
r'(?<=[{a}]),(?=[{a}])'.format(a=ALPHA),
|
r'(?<=[{a}]),(?=[{a}])'.format(a=ALPHA),
|
||||||
r'(?<=[{a}])(?:{h})(?=[{a}])'.format(a=ALPHA, h=HYPHENS),
|
r'(?<=[{a}])[?";:=,.]*(?:{h})(?=[{a}])'.format(a=ALPHA, h=HYPHENS),
|
||||||
r'(?<=[{a}"])[:<>=](?=[{a}])'.format(a=ALPHA)
|
r'(?<=[{a}"])[:<>=](?=[{a}])'.format(a=ALPHA)
|
||||||
]
|
]
|
||||||
)
|
)
|
||||||
|
|
|
@ -5,12 +5,14 @@ from .. import language_data as base
|
||||||
from ..language_data import update_exc, strings_to_exc
|
from ..language_data import update_exc, strings_to_exc
|
||||||
|
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS, ORTH_ONLY
|
||||||
|
|
||||||
|
|
||||||
STOP_WORDS = set(STOP_WORDS)
|
STOP_WORDS = set(STOP_WORDS)
|
||||||
|
|
||||||
|
TOKENIZER_EXCEPTIONS = dict(TOKENIZER_EXCEPTIONS)
|
||||||
TOKENIZER_EXCEPTIONS = strings_to_exc(base.EMOTICONS)
|
update_exc(TOKENIZER_EXCEPTIONS, strings_to_exc(ORTH_ONLY))
|
||||||
|
update_exc(TOKENIZER_EXCEPTIONS, strings_to_exc(base.EMOTICONS))
|
||||||
update_exc(TOKENIZER_EXCEPTIONS, strings_to_exc(base.ABBREVIATIONS))
|
update_exc(TOKENIZER_EXCEPTIONS, strings_to_exc(base.ABBREVIATIONS))
|
||||||
|
|
||||||
|
|
||||||
|
|
45
spacy/sv/lemma_rules.py
Normal file
45
spacy/sv/lemma_rules.py
Normal file
|
@ -0,0 +1,45 @@
|
||||||
|
# encoding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
|
||||||
|
LEMMA_RULES = {
|
||||||
|
"noun": [
|
||||||
|
["t", ""],
|
||||||
|
["n", ""],
|
||||||
|
["na", ""],
|
||||||
|
["na", "e"],
|
||||||
|
["or", "a"],
|
||||||
|
["orna", "a"],
|
||||||
|
["et", ""],
|
||||||
|
["en", ""],
|
||||||
|
["en", "e"],
|
||||||
|
["er", ""],
|
||||||
|
["erna", ""],
|
||||||
|
["ar", "e"],
|
||||||
|
["ar", ""],
|
||||||
|
["lar", "el"],
|
||||||
|
["arna", "e"],
|
||||||
|
["arna", ""],
|
||||||
|
["larna", "el"]
|
||||||
|
],
|
||||||
|
|
||||||
|
"adj": [
|
||||||
|
["are", ""],
|
||||||
|
["ast", ""],
|
||||||
|
["re", ""],
|
||||||
|
["st", ""],
|
||||||
|
["ägre", "åg"],
|
||||||
|
["ägst", "åg"],
|
||||||
|
["ängre", "ång"],
|
||||||
|
["ängst", "ång"],
|
||||||
|
["örre", "or"],
|
||||||
|
["örst", "or"],
|
||||||
|
],
|
||||||
|
|
||||||
|
"punct": [
|
||||||
|
["“", "\""],
|
||||||
|
["”", "\""],
|
||||||
|
["\u2018", "'"],
|
||||||
|
["\u2019", "'"]
|
||||||
|
]
|
||||||
|
}
|
|
@ -5,7 +5,31 @@ from ..symbols import *
|
||||||
from ..language_data import PRON_LEMMA
|
from ..language_data import PRON_LEMMA
|
||||||
|
|
||||||
|
|
||||||
TOKENIZER_EXCEPTIONS = {
|
EXC = {}
|
||||||
|
|
||||||
|
# Verbs
|
||||||
|
|
||||||
|
for verb_data in [
|
||||||
|
{ORTH: "driver"},
|
||||||
|
{ORTH: "kör"},
|
||||||
|
{ORTH: "hörr", LEMMA: "hör"},
|
||||||
|
{ORTH: "fattar"},
|
||||||
|
{ORTH: "hajar", LEMMA: "förstår"},
|
||||||
|
{ORTH: "lever"},
|
||||||
|
{ORTH: "serr", LEMMA: "ser"},
|
||||||
|
{ORTH: "fixar"}
|
||||||
|
]:
|
||||||
|
verb_data_tc = dict(verb_data)
|
||||||
|
verb_data_tc[ORTH] = verb_data_tc[ORTH].title()
|
||||||
|
|
||||||
|
for data in [verb_data, verb_data_tc]:
|
||||||
|
EXC[data[ORTH] + "u"] = [
|
||||||
|
dict(data),
|
||||||
|
{ORTH: "u", LEMMA: PRON_LEMMA, NORM: "du"}
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
ABBREVIATIONS = {
|
||||||
"jan.": [
|
"jan.": [
|
||||||
{ORTH: "jan.", LEMMA: "januari"}
|
{ORTH: "jan.", LEMMA: "januari"}
|
||||||
],
|
],
|
||||||
|
@ -63,6 +87,63 @@ TOKENIZER_EXCEPTIONS = {
|
||||||
"sön.": [
|
"sön.": [
|
||||||
{ORTH: "sön.", LEMMA: "söndag"}
|
{ORTH: "sön.", LEMMA: "söndag"}
|
||||||
],
|
],
|
||||||
|
"Jan.": [
|
||||||
|
{ORTH: "Jan.", LEMMA: "Januari"}
|
||||||
|
],
|
||||||
|
"Febr.": [
|
||||||
|
{ORTH: "Febr.", LEMMA: "Februari"}
|
||||||
|
],
|
||||||
|
"Feb.": [
|
||||||
|
{ORTH: "Feb.", LEMMA: "Februari"}
|
||||||
|
],
|
||||||
|
"Apr.": [
|
||||||
|
{ORTH: "Apr.", LEMMA: "April"}
|
||||||
|
],
|
||||||
|
"Jun.": [
|
||||||
|
{ORTH: "Jun.", LEMMA: "Juni"}
|
||||||
|
],
|
||||||
|
"Jul.": [
|
||||||
|
{ORTH: "Jul.", LEMMA: "Juli"}
|
||||||
|
],
|
||||||
|
"Aug.": [
|
||||||
|
{ORTH: "Aug.", LEMMA: "Augusti"}
|
||||||
|
],
|
||||||
|
"Sept.": [
|
||||||
|
{ORTH: "Sept.", LEMMA: "September"}
|
||||||
|
],
|
||||||
|
"Sep.": [
|
||||||
|
{ORTH: "Sep.", LEMMA: "September"}
|
||||||
|
],
|
||||||
|
"Okt.": [
|
||||||
|
{ORTH: "Okt.", LEMMA: "Oktober"}
|
||||||
|
],
|
||||||
|
"Nov.": [
|
||||||
|
{ORTH: "Nov.", LEMMA: "November"}
|
||||||
|
],
|
||||||
|
"Dec.": [
|
||||||
|
{ORTH: "Dec.", LEMMA: "December"}
|
||||||
|
],
|
||||||
|
"Mån.": [
|
||||||
|
{ORTH: "Mån.", LEMMA: "Måndag"}
|
||||||
|
],
|
||||||
|
"Tis.": [
|
||||||
|
{ORTH: "Tis.", LEMMA: "Tisdag"}
|
||||||
|
],
|
||||||
|
"Ons.": [
|
||||||
|
{ORTH: "Ons.", LEMMA: "Onsdag"}
|
||||||
|
],
|
||||||
|
"Tors.": [
|
||||||
|
{ORTH: "Tors.", LEMMA: "Torsdag"}
|
||||||
|
],
|
||||||
|
"Fre.": [
|
||||||
|
{ORTH: "Fre.", LEMMA: "Fredag"}
|
||||||
|
],
|
||||||
|
"Lör.": [
|
||||||
|
{ORTH: "Lör.", LEMMA: "Lördag"}
|
||||||
|
],
|
||||||
|
"Sön.": [
|
||||||
|
{ORTH: "Sön.", LEMMA: "Söndag"}
|
||||||
|
],
|
||||||
"sthlm": [
|
"sthlm": [
|
||||||
{ORTH: "sthlm", LEMMA: "Stockholm"}
|
{ORTH: "sthlm", LEMMA: "Stockholm"}
|
||||||
],
|
],
|
||||||
|
@ -72,6 +153,10 @@ TOKENIZER_EXCEPTIONS = {
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
TOKENIZER_EXCEPTIONS = dict(EXC)
|
||||||
|
TOKENIZER_EXCEPTIONS.update(ABBREVIATIONS)
|
||||||
|
|
||||||
|
|
||||||
ORTH_ONLY = [
|
ORTH_ONLY = [
|
||||||
"ang.",
|
"ang.",
|
||||||
"anm.",
|
"anm.",
|
||||||
|
@ -107,7 +192,6 @@ ORTH_ONLY = [
|
||||||
"p.g.a.",
|
"p.g.a.",
|
||||||
"ref.",
|
"ref.",
|
||||||
"resp.",
|
"resp.",
|
||||||
"s.",
|
|
||||||
"s.a.s.",
|
"s.a.s.",
|
||||||
"s.k.",
|
"s.k.",
|
||||||
"st.",
|
"st.",
|
||||||
|
|
|
@ -10,6 +10,7 @@ from ..pt import Portuguese
|
||||||
from ..nl import Dutch
|
from ..nl import Dutch
|
||||||
from ..sv import Swedish
|
from ..sv import Swedish
|
||||||
from ..hu import Hungarian
|
from ..hu import Hungarian
|
||||||
|
from ..fi import Finnish
|
||||||
from ..tokens import Doc
|
from ..tokens import Doc
|
||||||
from ..strings import StringStore
|
from ..strings import StringStore
|
||||||
from ..lemmatizer import Lemmatizer
|
from ..lemmatizer import Lemmatizer
|
||||||
|
@ -23,7 +24,7 @@ import pytest
|
||||||
|
|
||||||
|
|
||||||
LANGUAGES = [English, German, Spanish, Italian, French, Portuguese, Dutch,
|
LANGUAGES = [English, German, Spanish, Italian, French, Portuguese, Dutch,
|
||||||
Swedish, Hungarian]
|
Swedish, Hungarian, Finnish]
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture(params=LANGUAGES)
|
@pytest.fixture(params=LANGUAGES)
|
||||||
|
@ -62,6 +63,16 @@ def hu_tokenizer():
|
||||||
return Hungarian.Defaults.create_tokenizer()
|
return Hungarian.Defaults.create_tokenizer()
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def fi_tokenizer():
|
||||||
|
return Finnish.Defaults.create_tokenizer()
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def sv_tokenizer():
|
||||||
|
return Swedish.Defaults.create_tokenizer()
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def stringstore():
|
def stringstore():
|
||||||
return StringStore()
|
return StringStore()
|
||||||
|
|
0
spacy/tests/fi/__init__.py
Normal file
0
spacy/tests/fi/__init__.py
Normal file
18
spacy/tests/fi/test_tokenizer.py
Normal file
18
spacy/tests/fi/test_tokenizer.py
Normal file
|
@ -0,0 +1,18 @@
|
||||||
|
# encoding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
ABBREVIATION_TESTS = [
|
||||||
|
('Hyvää uutta vuotta t. siht. Niemelä!', ['Hyvää', 'uutta', 'vuotta', 't.', 'siht.', 'Niemelä', '!']),
|
||||||
|
('Paino on n. 2.2 kg', ['Paino', 'on', 'n.', '2.2', 'kg'])
|
||||||
|
]
|
||||||
|
|
||||||
|
TESTCASES = ABBREVIATION_TESTS
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('text,expected_tokens', TESTCASES)
|
||||||
|
def test_tokenizer_handles_testcases(fi_tokenizer, text, expected_tokens):
|
||||||
|
tokens = fi_tokenizer(text)
|
||||||
|
token_list = [token.text for token in tokens if not token.is_space]
|
||||||
|
assert expected_tokens == token_list
|
12
spacy/tests/regression/test_issue792.py
Normal file
12
spacy/tests/regression/test_issue792.py
Normal file
|
@ -0,0 +1,12 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.xfail
|
||||||
|
@pytest.mark.parametrize('text', ["This is a string ", "This is a string\u0020"])
|
||||||
|
def test_issue792(en_tokenizer, text):
|
||||||
|
"""Test for Issue #792: Trailing whitespace is removed after parsing."""
|
||||||
|
doc = en_tokenizer(text)
|
||||||
|
assert doc.text_with_ws == text
|
19
spacy/tests/regression/test_issue801.py
Normal file
19
spacy/tests/regression/test_issue801.py
Normal file
|
@ -0,0 +1,19 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('text,tokens', [
|
||||||
|
('"deserve,"--and', ['"', "deserve", ',"--', "and"]),
|
||||||
|
("exception;--exclusive", ["exception", ";--", "exclusive"]),
|
||||||
|
("day.--Is", ["day", ".--", "Is"]),
|
||||||
|
("refinement:--just", ["refinement", ":--", "just"]),
|
||||||
|
("memories?--To", ["memories", "?--", "To"]),
|
||||||
|
("Useful.=--Therefore", ["Useful", ".=--", "Therefore"]),
|
||||||
|
("=Hope.=--Pandora", ["=", "Hope", ".=--", "Pandora"])])
|
||||||
|
def test_issue801(en_tokenizer, text, tokens):
|
||||||
|
"""Test that special characters + hyphens are split correctly."""
|
||||||
|
doc = en_tokenizer(text)
|
||||||
|
assert len(doc) == len(tokens)
|
||||||
|
assert [t.text for t in doc] == tokens
|
15
spacy/tests/regression/test_issue805.py
Normal file
15
spacy/tests/regression/test_issue805.py
Normal file
|
@ -0,0 +1,15 @@
|
||||||
|
# encoding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
SV_TOKEN_EXCEPTION_TESTS = [
|
||||||
|
('Smörsåsen används bl.a. till fisk', ['Smörsåsen', 'används', 'bl.a.', 'till', 'fisk']),
|
||||||
|
('Jag kommer först kl. 13 p.g.a. diverse förseningar', ['Jag', 'kommer', 'först', 'kl.', '13', 'p.g.a.', 'diverse', 'förseningar'])
|
||||||
|
]
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('text,expected_tokens', SV_TOKEN_EXCEPTION_TESTS)
|
||||||
|
def test_issue805(sv_tokenizer, text, expected_tokens):
|
||||||
|
tokens = sv_tokenizer(text)
|
||||||
|
token_list = [token.text for token in tokens if not token.is_space]
|
||||||
|
assert expected_tokens == token_list
|
0
spacy/tests/sv/__init__.py
Normal file
0
spacy/tests/sv/__init__.py
Normal file
24
spacy/tests/sv/test_tokenizer.py
Normal file
24
spacy/tests/sv/test_tokenizer.py
Normal file
|
@ -0,0 +1,24 @@
|
||||||
|
# encoding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
|
SV_TOKEN_EXCEPTION_TESTS = [
|
||||||
|
('Smörsåsen används bl.a. till fisk', ['Smörsåsen', 'används', 'bl.a.', 'till', 'fisk']),
|
||||||
|
('Jag kommer först kl. 13 p.g.a. diverse förseningar', ['Jag', 'kommer', 'först', 'kl.', '13', 'p.g.a.', 'diverse', 'förseningar'])
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('text,expected_tokens', SV_TOKEN_EXCEPTION_TESTS)
|
||||||
|
def test_tokenizer_handles_exception_cases(sv_tokenizer, text, expected_tokens):
|
||||||
|
tokens = sv_tokenizer(text)
|
||||||
|
token_list = [token.text for token in tokens if not token.is_space]
|
||||||
|
assert expected_tokens == token_list
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('text', ["driveru", "hajaru", "Serru", "Fixaru"])
|
||||||
|
def test_tokenizer_handles_verb_exceptions(sv_tokenizer, text):
|
||||||
|
tokens = sv_tokenizer(text)
|
||||||
|
assert len(tokens) == 2
|
||||||
|
assert tokens[1].text == "u"
|
|
@ -500,7 +500,8 @@ cdef class Doc:
|
||||||
by the values of the given attribute ID.
|
by the values of the given attribute ID.
|
||||||
|
|
||||||
Example:
|
Example:
|
||||||
from spacy.en import English, attrs
|
from spacy.en import English
|
||||||
|
from spacy import attrs
|
||||||
nlp = English()
|
nlp = English()
|
||||||
tokens = nlp(u'apple apple orange banana')
|
tokens = nlp(u'apple apple orange banana')
|
||||||
tokens.count_by(attrs.ORTH)
|
tokens.count_by(attrs.ORTH)
|
||||||
|
@ -585,9 +586,6 @@ cdef class Doc:
|
||||||
elif attr_id == POS:
|
elif attr_id == POS:
|
||||||
for i in range(length):
|
for i in range(length):
|
||||||
tokens[i].pos = <univ_pos_t>values[i]
|
tokens[i].pos = <univ_pos_t>values[i]
|
||||||
elif attr_id == TAG:
|
|
||||||
for i in range(length):
|
|
||||||
tokens[i].tag = <univ_pos_t>values[i]
|
|
||||||
elif attr_id == DEP:
|
elif attr_id == DEP:
|
||||||
for i in range(length):
|
for i in range(length):
|
||||||
tokens[i].dep = values[i]
|
tokens[i].dep = values[i]
|
||||||
|
|
|
@ -55,7 +55,7 @@
|
||||||
},
|
},
|
||||||
|
|
||||||
"V_CSS": "1.15",
|
"V_CSS": "1.15",
|
||||||
"V_JS": "1.0",
|
"V_JS": "1.1",
|
||||||
"DEFAULT_SYNTAX": "python",
|
"DEFAULT_SYNTAX": "python",
|
||||||
"ANALYTICS": "UA-58931649-1",
|
"ANALYTICS": "UA-58931649-1",
|
||||||
"MAILCHIMP": {
|
"MAILCHIMP": {
|
||||||
|
|
|
@ -14,11 +14,11 @@
|
||||||
const updateNav = () => {
|
const updateNav = () => {
|
||||||
const vh = updateVh()
|
const vh = updateVh()
|
||||||
const newScrollY = (window.pageYOffset || document.scrollTop) - (document.clientTop || 0)
|
const newScrollY = (window.pageYOffset || document.scrollTop) - (document.clientTop || 0)
|
||||||
scrollUp = newScrollY <= scrollY
|
if (newScrollY != scrollY) scrollUp = newScrollY <= scrollY
|
||||||
scrollY = newScrollY
|
scrollY = newScrollY
|
||||||
|
|
||||||
if(scrollUp && !(isNaN(scrollY) || scrollY <= vh)) nav.classList.add(fixedClass)
|
if(scrollUp && !(isNaN(scrollY) || scrollY <= vh)) nav.classList.add(fixedClass)
|
||||||
else if(!scrollUp || (isNaN(scrollY) || scrollY <= vh/2)) nav.classList.remove(fixedClass)
|
else if (!scrollUp || (isNaN(scrollY) || scrollY <= vh/2)) nav.classList.remove(fixedClass)
|
||||||
}
|
}
|
||||||
|
|
||||||
window.addEventListener('scroll', () => requestAnimationFrame(updateNav))
|
window.addEventListener('scroll', () => requestAnimationFrame(updateNav))
|
||||||
|
|
|
@ -19,21 +19,6 @@ p spaCy currently supports the following languages and capabilities:
|
||||||
each icon in [ "pro", "pro", "con", "pro", "pro", "pro", "pro", "con" ]
|
each icon in [ "pro", "pro", "con", "pro", "pro", "pro", "pro", "con" ]
|
||||||
+cell.u-text-center #[+procon(icon)]
|
+cell.u-text-center #[+procon(icon)]
|
||||||
|
|
||||||
+row
|
|
||||||
+cell Chinese #[code zh]
|
|
||||||
each icon in [ "pro", "con", "con", "con", "con", "con", "con", "con" ]
|
|
||||||
+cell.u-text-center #[+procon(icon)]
|
|
||||||
|
|
||||||
+row
|
|
||||||
+cell Spanish #[code es]
|
|
||||||
each icon in [ "pro", "con", "con", "con", "con", "con", "con", "con" ]
|
|
||||||
+cell.u-text-center #[+procon(icon)]
|
|
||||||
|
|
||||||
p
|
|
||||||
| Chinese tokenization requires the
|
|
||||||
| #[+a("https://github.com/fxsjy/jieba") Jieba] library. Statistical
|
|
||||||
| models are coming soon.
|
|
||||||
|
|
||||||
|
|
||||||
+h(2, "alpha-support") Alpha support
|
+h(2, "alpha-support") Alpha support
|
||||||
|
|
||||||
|
@ -42,8 +27,13 @@ p
|
||||||
| the existing language data and extending the tokenization patterns.
|
| the existing language data and extending the tokenization patterns.
|
||||||
|
|
||||||
+table([ "Language", "Source" ])
|
+table([ "Language", "Source" ])
|
||||||
each language, code in { it: "Italian", fr: "French", pt: "Portuguese", nl: "Dutch", sv: "Swedish", hu: "Hungarian" }
|
each language, code in { zh: "Chinese", es: "Spanish", it: "Italian", fr: "French", pt: "Portuguese", nl: "Dutch", sv: "Swedish", fi: "Finnish", hu: "Hungarian" }
|
||||||
+row
|
+row
|
||||||
+cell #{language} #[code=code]
|
+cell #{language} #[code=code]
|
||||||
+cell
|
+cell
|
||||||
+src(gh("spaCy", "spacy/" + code)) spacy/#{code}
|
+src(gh("spaCy", "spacy/" + code)) spacy/#{code}
|
||||||
|
|
||||||
|
p
|
||||||
|
| Chinese tokenization requires the
|
||||||
|
| #[+a("https://github.com/fxsjy/jieba") Jieba] library. Statistical
|
||||||
|
| models are coming soon.
|
||||||
|
|
|
@ -54,7 +54,7 @@ p
|
||||||
doc = nlp(u'London is a big city in the United Kingdom.')
|
doc = nlp(u'London is a big city in the United Kingdom.')
|
||||||
doc.ents = []
|
doc.ents = []
|
||||||
assert doc[0].ent_type_ == ''
|
assert doc[0].ent_type_ == ''
|
||||||
doc.ents = [Span(0, 1, label='GPE')]
|
doc.ents = [Span(doc, 0, 1, label=doc.vocab.strings['GPE'])]
|
||||||
assert doc[0].ent_type_ == 'GPE'
|
assert doc[0].ent_type_ == 'GPE'
|
||||||
doc.ents = []
|
doc.ents = []
|
||||||
doc.ents = [(u'LondonCity', u'GPE', 0, 1)]
|
doc.ents = [(u'LondonCity', u'GPE', 0, 1)]
|
||||||
|
|
|
@ -20,7 +20,7 @@ p
|
||||||
| Once we've added the pattern, we can use the #[code matcher] as a
|
| Once we've added the pattern, we can use the #[code matcher] as a
|
||||||
| callable, to receive a list of #[code (ent_id, start, end)] tuples.
|
| callable, to receive a list of #[code (ent_id, start, end)] tuples.
|
||||||
| Note that #[code LOWER] and #[code IS_PUNCT] are data attributes
|
| Note that #[code LOWER] and #[code IS_PUNCT] are data attributes
|
||||||
| of #[code Matcher.attrs].
|
| of #[code spacy.attrs].
|
||||||
|
|
||||||
+code.
|
+code.
|
||||||
from spacy.matcher import Matcher
|
from spacy.matcher import Matcher
|
||||||
|
|
Loading…
Reference in New Issue
Block a user