mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-25 17:36:30 +03:00
Merge branch 'master' into tokenizer_exceptions
This commit is contained in:
commit
309da78bf0
106
.github/contributors/wallinm1.md
vendored
Normal file
106
.github/contributors/wallinm1.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------------------- |
|
||||
| Name | Michael Wallin |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2017-02-04 |
|
||||
| GitHub username | wallinm1 |
|
||||
| Website (optional) | |
|
|
@ -19,7 +19,7 @@ First, [do a quick search](https://github.com/issues?q=+is%3Aissue+user%3Aexplos
|
|||
|
||||
If you're looking for help with your code, consider posting a question on [StackOverflow](http://stackoverflow.com/questions/tagged/spacy) instead. If you tag it `spacy` and `python`, more people will see it and hopefully be able to help.
|
||||
|
||||
When opening an issue, use a descriptive title and include your environment (operating system, Python version, spaCy version). Our [issue template](https://github.com/explosion/spaCy/issues/new) helps you remember the most important details to include.
|
||||
When opening an issue, use a descriptive title and include your environment (operating system, Python version, spaCy version). Our [issue template](https://github.com/explosion/spaCy/issues/new) helps you remember the most important details to include. **Pro tip:** If you need to share long blocks of code or logs, you can wrap them in `<details>` and `</details>`. This [collapses the content](https://developer.mozilla.org/en/docs/Web/HTML/Element/details) so it only becomes visible on click, making the issue easier to read and follow.
|
||||
|
||||
If you've discovered a bug, you can also submit a [regression test](#fixing-bugs) straight away. When you're opening an issue to report the bug, simply refer to your pull request in the issue body.
|
||||
|
||||
|
|
|
@ -22,12 +22,14 @@ This is a list of everyone who has made significant contributions to spaCy, in a
|
|||
* Mark Amery, [@ExplodingCabbage](https://github.com/ExplodingCabbage)
|
||||
* Matthew Honnibal, [@honnibal](https://github.com/honnibal)
|
||||
* Maxim Samsonov, [@maxirmx](https://github.com/maxirmx)
|
||||
* Michael Wallin, [@wallinm1](https://github.com/wallinm1)
|
||||
* Oleg Zd, [@olegzd](https://github.com/olegzd)
|
||||
* Pokey Rule, [@pokey](https://github.com/pokey)
|
||||
* Raphaël Bournhonesque, [@raphael0202](https://github.com/raphael0202)
|
||||
* Rob van Nieuwpoort, [@RvanNieuwpoort](https://github.com/RvanNieuwpoort)
|
||||
* Sam Bozek, [@sambozek](https://github.com/sambozek)
|
||||
* Sasho Savkov [@savkov](https://github.com/savkov)
|
||||
* Thomas Tanon, [@Tpt](https://github.com/Tpt)
|
||||
* Tiago Rodrigues, [@TiagoMRodrigues](https://github.com/TiagoMRodrigues)
|
||||
* Vsevolod Solovyov, [@vsolovyov](https://github.com/vsolovyov)
|
||||
* Wah Loon Keng, [@kengz](https://github.com/kengz)
|
||||
|
|
|
@ -5,8 +5,8 @@ spaCy is a library for advanced natural language processing in Python and
|
|||
Cython. spaCy is built on the very latest research, but it isn't researchware.
|
||||
It was designed from day one to be used in real products. spaCy currently supports
|
||||
English and German, as well as tokenization for Chinese, Spanish, Italian, French,
|
||||
Portuguese, Dutch, Swedish and Hungarian. It's commercial open-source software,
|
||||
released under the MIT license.
|
||||
Portuguese, Dutch, Swedish, Finnish and Hungarian. It's commercial open-source
|
||||
software, released under the MIT license.
|
||||
|
||||
💫 **Version 1.6 out now!** `Read the release notes here. <https://github.com/explosion/spaCy/releases/>`_
|
||||
|
||||
|
|
|
@ -12,17 +12,23 @@ from spacy_hook import create_similarity_pipeline
|
|||
|
||||
from keras_decomposable_attention import build_model
|
||||
|
||||
try:
|
||||
import cPickle as pickle
|
||||
except ImportError:
|
||||
import pickle
|
||||
|
||||
|
||||
def train(model_dir, train_loc, dev_loc, shape, settings):
|
||||
train_texts1, train_texts2, train_labels = read_snli(train_loc)
|
||||
dev_texts1, dev_texts2, dev_labels = read_snli(dev_loc)
|
||||
|
||||
|
||||
print("Loading spaCy")
|
||||
nlp = spacy.load('en')
|
||||
assert nlp.path is not None
|
||||
print("Compiling network")
|
||||
model = build_model(get_embeddings(nlp.vocab), shape, settings)
|
||||
print("Processing texts...")
|
||||
Xs = []
|
||||
Xs = []
|
||||
for texts in (train_texts1, train_texts2, dev_texts1, dev_texts2):
|
||||
Xs.append(get_word_ids(list(nlp.pipe(texts, n_threads=20, batch_size=20000)),
|
||||
max_length=shape[0],
|
||||
|
@ -36,35 +42,41 @@ def train(model_dir, train_loc, dev_loc, shape, settings):
|
|||
validation_data=([dev_X1, dev_X2], dev_labels),
|
||||
nb_epoch=settings['nr_epoch'],
|
||||
batch_size=settings['batch_size'])
|
||||
if not (nlp.path / 'similarity').exists():
|
||||
(nlp.path / 'similarity').mkdir()
|
||||
print("Saving to", model_dir / 'similarity')
|
||||
weights = model.get_weights()
|
||||
with (nlp.path / 'similarity' / 'model').open('wb') as file_:
|
||||
pickle.dump(weights[1:], file_)
|
||||
with (nlp.path / 'similarity' / 'config.json').open('wb') as file_:
|
||||
file_.write(model.to_json())
|
||||
|
||||
|
||||
def evaluate(model_dir, dev_loc):
|
||||
nlp = spacy.load('en', path=model_dir,
|
||||
tagger=False, parser=False, entity=False, matcher=False,
|
||||
dev_texts1, dev_texts2, dev_labels = read_snli(dev_loc)
|
||||
nlp = spacy.load('en',
|
||||
create_pipeline=create_similarity_pipeline)
|
||||
n = 0
|
||||
correct = 0
|
||||
for (text1, text2), label in zip(dev_texts, dev_labels):
|
||||
total = 0.
|
||||
correct = 0.
|
||||
for text1, text2, label in zip(dev_texts1, dev_texts2, dev_labels):
|
||||
doc1 = nlp(text1)
|
||||
doc2 = nlp(text2)
|
||||
sim = doc1.similarity(doc2)
|
||||
if bool(sim >= 0.5) == label:
|
||||
if sim.argmax() == label.argmax():
|
||||
correct += 1
|
||||
n += 1
|
||||
total += 1
|
||||
return correct, total
|
||||
|
||||
|
||||
def demo(model_dir):
|
||||
nlp = spacy.load('en', path=model_dir,
|
||||
tagger=False, parser=False, entity=False, matcher=False,
|
||||
create_pipeline=create_similarity_pipeline)
|
||||
doc1 = nlp(u'Worst fries ever! Greasy and horrible...')
|
||||
doc2 = nlp(u'The milkshakes are good. The fries are bad.')
|
||||
print('doc1.similarity(doc2)', doc1.similarity(doc2))
|
||||
sent1a, sent1b = doc1.sents
|
||||
print('sent1a.similarity(sent1b)', sent1a.similarity(sent1b))
|
||||
print('sent1a.similarity(doc2)', sent1a.similarity(doc2))
|
||||
print('sent1b.similarity(doc2)', sent1b.similarity(doc2))
|
||||
doc1 = nlp(u'What were the best crime fiction books in 2016?')
|
||||
doc2 = nlp(
|
||||
u'What should I read that was published last year? I like crime stories.')
|
||||
print(doc1)
|
||||
print(doc2)
|
||||
print("Similarity", doc1.similarity(doc2))
|
||||
|
||||
|
||||
LABELS = {'entailment': 0, 'contradiction': 1, 'neutral': 2}
|
||||
|
@ -119,7 +131,8 @@ def main(mode, model_dir, train_loc, dev_loc,
|
|||
if mode == 'train':
|
||||
train(model_dir, train_loc, dev_loc, shape, settings)
|
||||
elif mode == 'evaluate':
|
||||
evaluate(model_dir, dev_loc)
|
||||
correct, total = evaluate(model_dir, dev_loc)
|
||||
print(correct, '/', total, correct / total)
|
||||
else:
|
||||
demo(model_dir)
|
||||
|
||||
|
|
|
@ -12,6 +12,8 @@ from keras.models import Sequential, Model, model_from_json
|
|||
from keras.regularizers import l2
|
||||
from keras.optimizers import Adam
|
||||
from keras.layers.normalization import BatchNormalization
|
||||
from keras.layers.pooling import GlobalAveragePooling1D, GlobalMaxPooling1D
|
||||
from keras.layers import Merge
|
||||
|
||||
|
||||
def build_model(vectors, shape, settings):
|
||||
|
@ -29,11 +31,11 @@ def build_model(vectors, shape, settings):
|
|||
align = _SoftAlignment(max_length, nr_hidden)
|
||||
compare = _Comparison(max_length, nr_hidden, dropout=settings['dropout'])
|
||||
entail = _Entailment(nr_hidden, nr_class, dropout=settings['dropout'])
|
||||
|
||||
|
||||
# Declare the model as a computational graph.
|
||||
sent1 = embed(ids1) # Shape: (i, n)
|
||||
sent2 = embed(ids2) # Shape: (j, n)
|
||||
|
||||
|
||||
if settings['gru_encode']:
|
||||
sent1 = encode(sent1)
|
||||
sent2 = encode(sent2)
|
||||
|
@ -42,12 +44,12 @@ def build_model(vectors, shape, settings):
|
|||
|
||||
align1 = align(sent2, attention)
|
||||
align2 = align(sent1, attention, transpose=True)
|
||||
|
||||
|
||||
feats1 = compare(sent1, align1)
|
||||
feats2 = compare(sent2, align2)
|
||||
|
||||
|
||||
scores = entail(feats1, feats2)
|
||||
|
||||
|
||||
# Now that we have the input/output, we can construct the Model object...
|
||||
model = Model(input=[ids1, ids2], output=[scores])
|
||||
|
||||
|
@ -93,7 +95,7 @@ class _StaticEmbedding(object):
|
|||
def get_output_shape(shapes):
|
||||
print(shapes)
|
||||
return shapes[0]
|
||||
mod_sent = self.mod_ids(sentence)
|
||||
mod_sent = self.mod_ids(sentence)
|
||||
tuning = self.tune(mod_sent)
|
||||
#tuning = merge([tuning, mod_sent],
|
||||
# mode=lambda AB: AB[0] * (K.clip(K.cast(AB[1], 'float32'), 0, 1)),
|
||||
|
@ -129,7 +131,7 @@ class _Attention(object):
|
|||
self.model.add(Dense(nr_hidden, name='attend2',
|
||||
init='he_normal', W_regularizer=l2(L2), activation='relu'))
|
||||
self.model = TimeDistributed(self.model)
|
||||
|
||||
|
||||
def __call__(self, sent1, sent2):
|
||||
def _outer(AB):
|
||||
att_ji = K.batch_dot(AB[1], K.permute_dimensions(AB[0], (0, 2, 1)))
|
||||
|
@ -158,7 +160,7 @@ class _SoftAlignment(object):
|
|||
return K.batch_dot(sm_att, mat)
|
||||
return merge([attention, sentence], mode=_normalize_attention,
|
||||
output_shape=(self.max_length, self.nr_hidden)) # Shape: (i, n)
|
||||
|
||||
|
||||
|
||||
class _Comparison(object):
|
||||
def __init__(self, words, nr_hidden, L2=0.0, dropout=0.0):
|
||||
|
@ -176,10 +178,12 @@ class _Comparison(object):
|
|||
|
||||
def __call__(self, sent, align, **kwargs):
|
||||
result = self.model(merge([sent, align], mode='concat')) # Shape: (i, n)
|
||||
result = _GlobalSumPooling1D()(result, mask=self.words)
|
||||
result = BatchNormalization()(result)
|
||||
avged = GlobalAveragePooling1D()(result, mask=self.words)
|
||||
maxed = GlobalMaxPooling1D()(result, mask=self.words)
|
||||
merged = merge([avged, maxed])
|
||||
result = BatchNormalization()(merged)
|
||||
return result
|
||||
|
||||
|
||||
|
||||
class _Entailment(object):
|
||||
def __init__(self, nr_hidden, nr_out, dropout=0.0, L2=0.0):
|
||||
|
@ -251,7 +255,7 @@ def test_fit_model():
|
|||
shape = (10, 16, 3)
|
||||
settings = {'lr': 0.001, 'dropout': 0.2, 'gru_encode':True}
|
||||
model = build_model(vectors, shape, settings)
|
||||
|
||||
|
||||
train_X = _generate_X(20, shape[0], vectors.shape[1])
|
||||
train_Y = _generate_Y(20, shape[2])
|
||||
dev_X = _generate_X(15, shape[0], vectors.shape[1])
|
||||
|
@ -261,6 +265,4 @@ def test_fit_model():
|
|||
batch_size=4)
|
||||
|
||||
|
||||
|
||||
|
||||
__all__ = [build_model]
|
||||
|
|
|
@ -1,33 +1,40 @@
|
|||
from keras.models import model_from_json
|
||||
import numpy
|
||||
import numpy.random
|
||||
import json
|
||||
from spacy.tokens.span import Span
|
||||
|
||||
try:
|
||||
import cPickle as pickle
|
||||
except ImportError:
|
||||
import pickle
|
||||
|
||||
|
||||
class KerasSimilarityShim(object):
|
||||
@classmethod
|
||||
def load(cls, path, nlp, get_features=None):
|
||||
def load(cls, path, nlp, get_features=None, max_length=100):
|
||||
if get_features is None:
|
||||
get_features = doc2ids
|
||||
get_features = get_word_ids
|
||||
with (path / 'config.json').open() as file_:
|
||||
config = json.load(file_)
|
||||
model = model_from_json(config['model'])
|
||||
model = model_from_json(file_.read())
|
||||
with (path / 'model').open('rb') as file_:
|
||||
weights = pickle.load(file_)
|
||||
embeddings = get_embeddings(nlp.vocab)
|
||||
model.set_weights([embeddings] + weights)
|
||||
return cls(model, get_features=get_features)
|
||||
return cls(model, get_features=get_features, max_length=max_length)
|
||||
|
||||
def __init__(self, model, get_features=None):
|
||||
def __init__(self, model, get_features=None, max_length=100):
|
||||
self.model = model
|
||||
self.get_features = get_features
|
||||
self.max_length = max_length
|
||||
|
||||
def __call__(self, doc):
|
||||
doc.user_hooks['similarity'] = self.predict
|
||||
doc.user_span_hooks['similarity'] = self.predict
|
||||
|
||||
|
||||
def predict(self, doc1, doc2):
|
||||
x1 = self.get_features(doc1)
|
||||
x2 = self.get_features(doc2)
|
||||
x1 = self.get_features([doc1], max_length=self.max_length, tree_truncate=True)
|
||||
x2 = self.get_features([doc2], max_length=self.max_length, tree_truncate=True)
|
||||
scores = self.model.predict([x1, x2])
|
||||
return scores[0]
|
||||
|
||||
|
@ -45,7 +52,10 @@ def get_word_ids(docs, rnn_encode=False, tree_truncate=False, max_length=100, nr
|
|||
Xs = numpy.zeros((len(docs), max_length), dtype='int32')
|
||||
for i, doc in enumerate(docs):
|
||||
if tree_truncate:
|
||||
queue = [sent.root for sent in doc.sents]
|
||||
if isinstance(doc, Span):
|
||||
queue = [doc.root]
|
||||
else:
|
||||
queue = [sent.root for sent in doc.sents]
|
||||
else:
|
||||
queue = list(doc)
|
||||
words = []
|
||||
|
@ -71,7 +81,9 @@ def get_word_ids(docs, rnn_encode=False, tree_truncate=False, max_length=100, nr
|
|||
|
||||
|
||||
def create_similarity_pipeline(nlp):
|
||||
return [SimilarityModel.load(
|
||||
nlp.path / 'similarity',
|
||||
nlp,
|
||||
feature_extracter=get_features)]
|
||||
return [
|
||||
nlp.tagger,
|
||||
nlp.entity,
|
||||
nlp.parser,
|
||||
KerasSimilarityShim.load(nlp.path / 'similarity', nlp, max_length=10)
|
||||
]
|
||||
|
|
1
setup.py
1
setup.py
|
@ -31,6 +31,7 @@ PACKAGES = [
|
|||
'spacy.pt',
|
||||
'spacy.nl',
|
||||
'spacy.sv',
|
||||
'spacy.fi',
|
||||
'spacy.language_data',
|
||||
'spacy.serialize',
|
||||
'spacy.syntax',
|
||||
|
|
|
@ -13,7 +13,7 @@ from . import fr
|
|||
from . import pt
|
||||
from . import nl
|
||||
from . import sv
|
||||
|
||||
from . import fi
|
||||
|
||||
try:
|
||||
basestring
|
||||
|
@ -31,6 +31,8 @@ set_lang_class(hu.Hungarian.lang, hu.Hungarian)
|
|||
set_lang_class(zh.Chinese.lang, zh.Chinese)
|
||||
set_lang_class(nl.Dutch.lang, nl.Dutch)
|
||||
set_lang_class(sv.Swedish.lang, sv.Swedish)
|
||||
set_lang_class(fi.Finnish.lang, fi.Finnish)
|
||||
|
||||
|
||||
|
||||
def load(name, **overrides):
|
||||
|
|
17
spacy/fi/__init__.py
Normal file
17
spacy/fi/__init__.py
Normal file
|
@ -0,0 +1,17 @@
|
|||
# encoding: utf8
|
||||
from __future__ import unicode_literals, print_function
|
||||
|
||||
from ..language import Language
|
||||
from ..attrs import LANG
|
||||
from .language_data import *
|
||||
|
||||
|
||||
class Finnish(Language):
|
||||
lang = 'fi'
|
||||
|
||||
class Defaults(Language.Defaults):
|
||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||
lex_attr_getters[LANG] = lambda text: 'fi'
|
||||
|
||||
tokenizer_exceptions = TOKENIZER_EXCEPTIONS
|
||||
stop_words = STOP_WORDS
|
17
spacy/fi/language_data.py
Normal file
17
spacy/fi/language_data.py
Normal file
|
@ -0,0 +1,17 @@
|
|||
# encoding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from .. import language_data as base
|
||||
from ..language_data import update_exc, strings_to_exc
|
||||
|
||||
from .stop_words import STOP_WORDS
|
||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||
|
||||
|
||||
STOP_WORDS = set(STOP_WORDS)
|
||||
|
||||
TOKENIZER_EXCEPTIONS = dict(TOKENIZER_EXCEPTIONS)
|
||||
update_exc(TOKENIZER_EXCEPTIONS, strings_to_exc(base.EMOTICONS))
|
||||
|
||||
|
||||
__all__ = ["TOKENIZER_EXCEPTIONS", "STOP_WORDS"]
|
114
spacy/fi/stop_words.py
Normal file
114
spacy/fi/stop_words.py
Normal file
|
@ -0,0 +1,114 @@
|
|||
# encoding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
# Source https://github.com/stopwords-iso/stopwords-fi/blob/master/stopwords-fi.txt
|
||||
# Reformatted with some minor corrections
|
||||
|
||||
STOP_WORDS = set("""
|
||||
|
||||
aiemmin aika aikaa aikaan aikaisemmin aikaisin aikana aikoina aikoo aikovat
|
||||
aina ainakaan ainakin ainoa ainoat aiomme aion aiotte aivan ajan alas alemmas
|
||||
alkuisin alkuun alla alle aloitamme aloitan aloitat aloitatte aloitattivat
|
||||
aloitettava aloitettavaksi aloitettu aloitimme aloitin aloitit aloititte
|
||||
aloittaa aloittamatta aloitti aloittivat alta aluksi alussa alusta annettavaksi
|
||||
annettava annettu ansiosta antaa antamatta antoi apu asia asiaa asian asiasta
|
||||
asiat asioiden asioihin asioita asti avuksi avulla avun avutta
|
||||
|
||||
edelle edelleen edellä edeltä edemmäs edes edessä edestä ehkä ei eikä eilen
|
||||
eivät eli ellei elleivät ellemme ellen ellet ellette emme en enemmän eniten
|
||||
ennen ensi ensimmäinen ensimmäiseksi ensimmäisen ensimmäisenä ensimmäiset
|
||||
ensimmäisiksi ensimmäisinä ensimmäisiä ensimmäistä ensin entinen entisen
|
||||
entisiä entisten entistä enää eri erittäin erityisesti eräiden eräs eräät esi
|
||||
esiin esillä esimerkiksi et eteen etenkin ette ettei että
|
||||
|
||||
halua haluaa haluamatta haluamme haluan haluat haluatte haluavat halunnut
|
||||
halusi halusimme halusin halusit halusitte halusivat halutessa haluton he hei
|
||||
heidän heidät heihin heille heillä heiltä heissä heistä heitä helposti heti
|
||||
hetkellä hieman hitaasti huolimatta huomenna hyvien hyviin hyviksi hyville
|
||||
hyviltä hyvin hyvinä hyvissä hyvistä hyviä hyvä hyvät hyvää hän häneen hänelle
|
||||
hänellä häneltä hänen hänessä hänestä hänet häntä
|
||||
|
||||
ihan ilman ilmeisesti itse itsensä itseään
|
||||
|
||||
ja jo johon joiden joihin joiksi joilla joille joilta joina joissa joista joita
|
||||
joka jokainen jokin joko joksi joku jolla jolle jolloin jolta jompikumpi jona
|
||||
jonka jonkin jonne joo jopa jos joskus jossa josta jota jotain joten jotenkin
|
||||
jotenkuten jotka jotta jouduimme jouduin jouduit jouduitte joudumme joudun
|
||||
joudutte joukkoon joukossa joukosta joutua joutui joutuivat joutumaan joutuu
|
||||
joutuvat juuri jälkeen jälleen jää
|
||||
|
||||
kahdeksan kahdeksannen kahdella kahdelle kahdelta kahden kahdessa kahdesta
|
||||
kahta kahteen kai kaiken kaikille kaikilta kaikkea kaikki kaikkia kaikkiaan
|
||||
kaikkialla kaikkialle kaikkialta kaikkien kaikkiin kaksi kannalta kannattaa
|
||||
kanssa kanssaan kanssamme kanssani kanssanne kanssasi kauan kauemmas kaukana
|
||||
kautta kehen keiden keihin keiksi keille keillä keiltä keinä keissä keistä
|
||||
keitten keittä keitä keneen keneksi kenelle kenellä keneltä kenen kenenä
|
||||
kenessä kenestä kenet kenettä kenties kerran kerta kertaa keskellä kesken
|
||||
keskimäärin ketkä ketä kiitos kohti koko kokonaan kolmas kolme kolmen kolmesti
|
||||
koska koskaan kovin kuin kuinka kuinkaan kuitenkaan kuitenkin kuka kukaan kukin
|
||||
kumpainen kumpainenkaan kumpi kumpikaan kumpikin kun kuten kuuden kuusi kuutta
|
||||
kylliksi kyllä kymmenen kyse
|
||||
|
||||
liian liki lisäksi lisää lla luo luona lähekkäin lähelle lähellä läheltä
|
||||
lähemmäs lähes lähinnä lähtien läpi
|
||||
|
||||
mahdollisimman mahdollista me meidän meidät meihin meille meillä meiltä meissä
|
||||
meistä meitä melkein melko menee menemme menen menet menette menevät meni
|
||||
menimme menin menit menivät mennessä mennyt menossa mihin miksi mikä mikäli
|
||||
mikään mille milloin milloinkan millä miltä minkä minne minua minulla minulle
|
||||
minulta minun minussa minusta minut minuun minä missä mistä miten mitkä mitä
|
||||
mitään moi molemmat mones monesti monet moni moniaalla moniaalle moniaalta
|
||||
monta muassa muiden muita muka mukaan mukaansa mukana mutta muu muualla muualle
|
||||
muualta muuanne muulloin muun muut muuta muutama muutaman muuten myöhemmin myös
|
||||
myöskin myöskään myötä
|
||||
|
||||
ne neljä neljän neljää niiden niihin niiksi niille niillä niiltä niin niinä
|
||||
niissä niistä niitä noiden noihin noiksi noilla noille noilta noin noina noissa
|
||||
noista noita nopeammin nopeasti nopeiten nro nuo nyt näiden näihin näiksi
|
||||
näille näillä näiltä näin näinä näissä näistä näitä nämä
|
||||
|
||||
ohi oikea oikealla oikein ole olemme olen olet olette oleva olevan olevat oli
|
||||
olimme olin olisi olisimme olisin olisit olisitte olisivat olit olitte olivat
|
||||
olla olleet ollut oma omaa omaan omaksi omalle omalta oman omassa omat omia
|
||||
omien omiin omiksi omille omilta omissa omista on onkin onko ovat
|
||||
|
||||
paikoittain paitsi pakosti paljon paremmin parempi parhaillaan parhaiten
|
||||
perusteella peräti pian pieneen pieneksi pienelle pienellä pieneltä pienempi
|
||||
pienestä pieni pienin poikki puolesta puolestaan päälle
|
||||
|
||||
runsaasti
|
||||
|
||||
saakka sama samaa samaan samalla saman samat samoin sata sataa satojen se
|
||||
seitsemän sekä sen seuraavat siellä sieltä siihen siinä siis siitä sijaan siksi
|
||||
sille silloin sillä silti siltä sinne sinua sinulla sinulle sinulta sinun
|
||||
sinussa sinusta sinut sinuun sinä sisäkkäin sisällä siten sitten sitä ssa sta
|
||||
suoraan suuntaan suuren suuret suuri suuria suurin suurten
|
||||
|
||||
taa taas taemmas tahansa tai takaa takaisin takana takia tallä tapauksessa
|
||||
tarpeeksi tavalla tavoitteena te teidän teidät teihin teille teillä teiltä
|
||||
teissä teistä teitä tietysti todella toinen toisaalla toisaalle toisaalta
|
||||
toiseen toiseksi toisella toiselle toiselta toisemme toisen toisensa toisessa
|
||||
toisesta toista toistaiseksi toki tosin tuhannen tuhat tule tulee tulemme tulen
|
||||
tulet tulette tulevat tulimme tulin tulisi tulisimme tulisin tulisit tulisitte
|
||||
tulisivat tulit tulitte tulivat tulla tulleet tullut tuntuu tuo tuohon tuoksi
|
||||
tuolla tuolle tuolloin tuolta tuon tuona tuonne tuossa tuosta tuota tuskin tykö
|
||||
tähän täksi tälle tällä tällöin tältä tämä tämän tänne tänä tänään tässä tästä
|
||||
täten tätä täysin täytyvät täytyy täällä täältä
|
||||
|
||||
ulkopuolella usea useasti useimmiten usein useita uudeksi uudelleen uuden uudet
|
||||
uusi uusia uusien uusinta uuteen uutta
|
||||
|
||||
vaan vai vaiheessa vaikea vaikean vaikeat vaikeilla vaikeille vaikeilta
|
||||
vaikeissa vaikeista vaikka vain varmasti varsin varsinkin varten vasen
|
||||
vasemmalla vasta vastaan vastakkain vastan verran vielä vierekkäin vieressä
|
||||
vieri viiden viime viimeinen viimeisen viimeksi viisi voi voidaan voimme voin
|
||||
voisi voit voitte voivat vuoden vuoksi vuosi vuosien vuosina vuotta vähemmän
|
||||
vähintään vähiten vähän välillä
|
||||
|
||||
yhdeksän yhden yhdessä yhteen yhteensä yhteydessä yhteyteen yhtä yhtäälle
|
||||
yhtäällä yhtäältä yhtään yhä yksi yksin yksittäin yleensä ylemmäs yli ylös
|
||||
ympäri
|
||||
|
||||
älköön älä
|
||||
|
||||
""".split())
|
202
spacy/fi/tokenizer_exceptions.py
Normal file
202
spacy/fi/tokenizer_exceptions.py
Normal file
|
@ -0,0 +1,202 @@
|
|||
# encoding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ..symbols import *
|
||||
from ..language_data import PRON_LEMMA
|
||||
|
||||
# Source https://www.cs.tut.fi/~jkorpela/kielenopas/5.5.html
|
||||
|
||||
TOKENIZER_EXCEPTIONS = {
|
||||
"aik.": [
|
||||
{ORTH: "aik.", LEMMA: "aikaisempi"}
|
||||
],
|
||||
"alk.": [
|
||||
{ORTH: "alk.", LEMMA: "alkaen"}
|
||||
],
|
||||
"alv.": [
|
||||
{ORTH: "alv.", LEMMA: "arvonlisävero"}
|
||||
],
|
||||
"ark.": [
|
||||
{ORTH: "ark.", LEMMA: "arkisin"}
|
||||
],
|
||||
"as.": [
|
||||
{ORTH: "as.", LEMMA: "asunto"}
|
||||
],
|
||||
"ed.": [
|
||||
{ORTH: "ed.", LEMMA: "edellinen"}
|
||||
],
|
||||
"esim.": [
|
||||
{ORTH: "esim.", LEMMA: "esimerkki"}
|
||||
],
|
||||
"huom.": [
|
||||
{ORTH: "huom.", LEMMA: "huomautus"}
|
||||
],
|
||||
"jne.": [
|
||||
{ORTH: "jne.", LEMMA: "ja niin edelleen"}
|
||||
],
|
||||
"joht.": [
|
||||
{ORTH: "joht.", LEMMA: "johtaja"}
|
||||
],
|
||||
"k.": [
|
||||
{ORTH: "k.", LEMMA: "kuollut"}
|
||||
],
|
||||
"ks.": [
|
||||
{ORTH: "ks.", LEMMA: "katso"}
|
||||
],
|
||||
"lk.": [
|
||||
{ORTH: "lk.", LEMMA: "luokka"}
|
||||
],
|
||||
"lkm.": [
|
||||
{ORTH: "lkm.", LEMMA: "lukumäärä"}
|
||||
],
|
||||
"lyh.": [
|
||||
{ORTH: "lyh.", LEMMA: "lyhenne"}
|
||||
],
|
||||
"läh.": [
|
||||
{ORTH: "läh.", LEMMA: "lähettäjä"}
|
||||
],
|
||||
"miel.": [
|
||||
{ORTH: "miel.", LEMMA: "mieluummin"}
|
||||
],
|
||||
"milj.": [
|
||||
{ORTH: "milj.", LEMMA: "miljoona"}
|
||||
],
|
||||
"mm.": [
|
||||
{ORTH: "mm.", LEMMA: "muun muassa"}
|
||||
],
|
||||
"myöh.": [
|
||||
{ORTH: "myöh.", LEMMA: "myöhempi"}
|
||||
],
|
||||
"n.": [
|
||||
{ORTH: "n.", LEMMA: "noin"}
|
||||
],
|
||||
"nimim.": [
|
||||
{ORTH: "nimim.", LEMMA: "nimimerkki"}
|
||||
],
|
||||
"ns.": [
|
||||
{ORTH: "ns.", LEMMA: "niin sanottu"}
|
||||
],
|
||||
"nyk.": [
|
||||
{ORTH: "nyk.", LEMMA: "nykyinen"}
|
||||
],
|
||||
"oik.": [
|
||||
{ORTH: "oik.", LEMMA: "oikealla"}
|
||||
],
|
||||
"os.": [
|
||||
{ORTH: "os.", LEMMA: "osoite"}
|
||||
],
|
||||
"p.": [
|
||||
{ORTH: "p.", LEMMA: "päivä"}
|
||||
],
|
||||
"par.": [
|
||||
{ORTH: "par.", LEMMA: "paremmin"}
|
||||
],
|
||||
"per.": [
|
||||
{ORTH: "per.", LEMMA: "perustettu"}
|
||||
],
|
||||
"pj.": [
|
||||
{ORTH: "pj.", LEMMA: "puheenjohtaja"}
|
||||
],
|
||||
"puh.joht.": [
|
||||
{ORTH: "puh.joht.", LEMMA: "puheenjohtaja"}
|
||||
],
|
||||
"prof.": [
|
||||
{ORTH: "prof.", LEMMA: "professori"}
|
||||
],
|
||||
"puh.": [
|
||||
{ORTH: "puh.", LEMMA: "puhelin"}
|
||||
],
|
||||
"pvm.": [
|
||||
{ORTH: "pvm.", LEMMA: "päivämäärä"}
|
||||
],
|
||||
"rak.": [
|
||||
{ORTH: "rak.", LEMMA: "rakennettu"}
|
||||
],
|
||||
"ry.": [
|
||||
{ORTH: "ry.", LEMMA: "rekisteröity yhdistys"}
|
||||
],
|
||||
"s.": [
|
||||
{ORTH: "s.", LEMMA: "sivu"}
|
||||
],
|
||||
"siht.": [
|
||||
{ORTH: "siht.", LEMMA: "sihteeri"}
|
||||
],
|
||||
"synt.": [
|
||||
{ORTH: "synt.", LEMMA: "syntynyt"}
|
||||
],
|
||||
"t.": [
|
||||
{ORTH: "t.", LEMMA: "toivoo"}
|
||||
],
|
||||
"tark.": [
|
||||
{ORTH: "tark.", LEMMA: "tarkastanut"}
|
||||
],
|
||||
"til.": [
|
||||
{ORTH: "til.", LEMMA: "tilattu"}
|
||||
],
|
||||
"tms.": [
|
||||
{ORTH: "tms.", LEMMA: "tai muuta sellaista"}
|
||||
],
|
||||
"toim.": [
|
||||
{ORTH: "toim.", LEMMA: "toimittanut"}
|
||||
],
|
||||
"v.": [
|
||||
{ORTH: "v.", LEMMA: "vuosi"}
|
||||
],
|
||||
"vas.": [
|
||||
{ORTH: "vas.", LEMMA: "vasen"}
|
||||
],
|
||||
"vast.": [
|
||||
{ORTH: "vast.", LEMMA: "vastaus"}
|
||||
],
|
||||
"vrt.": [
|
||||
{ORTH: "vrt.", LEMMA: "vertaa"}
|
||||
],
|
||||
"yht.": [
|
||||
{ORTH: "yht.", LEMMA: "yhteensä"}
|
||||
],
|
||||
"yl.": [
|
||||
{ORTH: "yl.", LEMMA: "yleinen"}
|
||||
],
|
||||
"ym.": [
|
||||
{ORTH: "ym.", LEMMA: "ynnä muuta"}
|
||||
],
|
||||
"yms.": [
|
||||
{ORTH: "yms.", LEMMA: "ynnä muuta sellaista"}
|
||||
],
|
||||
"yo.": [
|
||||
{ORTH: "yo.", LEMMA: "ylioppilas"}
|
||||
],
|
||||
"yliopp.": [
|
||||
{ORTH: "yliopp.", LEMMA: "ylioppilas"}
|
||||
],
|
||||
"ao.": [
|
||||
{ORTH: "ao.", LEMMA: "asianomainen"}
|
||||
],
|
||||
"em.": [
|
||||
{ORTH: "em.", LEMMA: "edellä mainittu"}
|
||||
],
|
||||
"ko.": [
|
||||
{ORTH: "ko.", LEMMA: "kyseessä oleva"}
|
||||
],
|
||||
"ml.": [
|
||||
{ORTH: "ml.", LEMMA: "mukaan luettuna"}
|
||||
],
|
||||
"po.": [
|
||||
{ORTH: "po.", LEMMA: "puheena oleva"}
|
||||
],
|
||||
"so.": [
|
||||
{ORTH: "so.", LEMMA: "se on"}
|
||||
],
|
||||
"ts.": [
|
||||
{ORTH: "ts.", LEMMA: "toisin sanoen"}
|
||||
],
|
||||
"vm.": [
|
||||
{ORTH: "vm.", LEMMA: "viimeksi mainittu"}
|
||||
],
|
||||
"siht.": [
|
||||
{ORTH: "siht.", LEMMA: "sihteeri"}
|
||||
],
|
||||
"srk.": [
|
||||
{ORTH: "srk.", LEMMA: "seurakunta"}
|
||||
]
|
||||
}
|
|
@ -50,6 +50,7 @@ EMOTICONS = set("""
|
|||
:/
|
||||
:-/
|
||||
=/
|
||||
=|
|
||||
:|
|
||||
:-|
|
||||
:1
|
||||
|
|
|
@ -72,7 +72,7 @@ HYPHENS = _HYPHENS.strip().replace(' ', '|')
|
|||
# Prefixes
|
||||
|
||||
TOKENIZER_PREFIXES = (
|
||||
['§', '%', r'\+'] +
|
||||
['§', '%', '=', r'\+'] +
|
||||
LIST_PUNCT +
|
||||
LIST_ELLIPSES +
|
||||
LIST_QUOTES +
|
||||
|
@ -106,7 +106,7 @@ TOKENIZER_INFIXES = (
|
|||
r'(?<=[0-9])[+\-\*^](?=[0-9-])',
|
||||
r'(?<=[{al}])\.(?=[{au}])'.format(al=ALPHA_LOWER, au=ALPHA_UPPER),
|
||||
r'(?<=[{a}]),(?=[{a}])'.format(a=ALPHA),
|
||||
r'(?<=[{a}])(?:{h})(?=[{a}])'.format(a=ALPHA, h=HYPHENS),
|
||||
r'(?<=[{a}])[?";:=,.]*(?:{h})(?=[{a}])'.format(a=ALPHA, h=HYPHENS),
|
||||
r'(?<=[{a}"])[:<>=](?=[{a}])'.format(a=ALPHA)
|
||||
]
|
||||
)
|
||||
|
|
|
@ -5,12 +5,14 @@ from .. import language_data as base
|
|||
from ..language_data import update_exc, strings_to_exc
|
||||
|
||||
from .stop_words import STOP_WORDS
|
||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS, ORTH_ONLY
|
||||
|
||||
|
||||
STOP_WORDS = set(STOP_WORDS)
|
||||
|
||||
|
||||
TOKENIZER_EXCEPTIONS = strings_to_exc(base.EMOTICONS)
|
||||
TOKENIZER_EXCEPTIONS = dict(TOKENIZER_EXCEPTIONS)
|
||||
update_exc(TOKENIZER_EXCEPTIONS, strings_to_exc(ORTH_ONLY))
|
||||
update_exc(TOKENIZER_EXCEPTIONS, strings_to_exc(base.EMOTICONS))
|
||||
update_exc(TOKENIZER_EXCEPTIONS, strings_to_exc(base.ABBREVIATIONS))
|
||||
|
||||
|
||||
|
|
45
spacy/sv/lemma_rules.py
Normal file
45
spacy/sv/lemma_rules.py
Normal file
|
@ -0,0 +1,45 @@
|
|||
# encoding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
LEMMA_RULES = {
|
||||
"noun": [
|
||||
["t", ""],
|
||||
["n", ""],
|
||||
["na", ""],
|
||||
["na", "e"],
|
||||
["or", "a"],
|
||||
["orna", "a"],
|
||||
["et", ""],
|
||||
["en", ""],
|
||||
["en", "e"],
|
||||
["er", ""],
|
||||
["erna", ""],
|
||||
["ar", "e"],
|
||||
["ar", ""],
|
||||
["lar", "el"],
|
||||
["arna", "e"],
|
||||
["arna", ""],
|
||||
["larna", "el"]
|
||||
],
|
||||
|
||||
"adj": [
|
||||
["are", ""],
|
||||
["ast", ""],
|
||||
["re", ""],
|
||||
["st", ""],
|
||||
["ägre", "åg"],
|
||||
["ägst", "åg"],
|
||||
["ängre", "ång"],
|
||||
["ängst", "ång"],
|
||||
["örre", "or"],
|
||||
["örst", "or"],
|
||||
],
|
||||
|
||||
"punct": [
|
||||
["“", "\""],
|
||||
["”", "\""],
|
||||
["\u2018", "'"],
|
||||
["\u2019", "'"]
|
||||
]
|
||||
}
|
|
@ -5,7 +5,31 @@ from ..symbols import *
|
|||
from ..language_data import PRON_LEMMA
|
||||
|
||||
|
||||
TOKENIZER_EXCEPTIONS = {
|
||||
EXC = {}
|
||||
|
||||
# Verbs
|
||||
|
||||
for verb_data in [
|
||||
{ORTH: "driver"},
|
||||
{ORTH: "kör"},
|
||||
{ORTH: "hörr", LEMMA: "hör"},
|
||||
{ORTH: "fattar"},
|
||||
{ORTH: "hajar", LEMMA: "förstår"},
|
||||
{ORTH: "lever"},
|
||||
{ORTH: "serr", LEMMA: "ser"},
|
||||
{ORTH: "fixar"}
|
||||
]:
|
||||
verb_data_tc = dict(verb_data)
|
||||
verb_data_tc[ORTH] = verb_data_tc[ORTH].title()
|
||||
|
||||
for data in [verb_data, verb_data_tc]:
|
||||
EXC[data[ORTH] + "u"] = [
|
||||
dict(data),
|
||||
{ORTH: "u", LEMMA: PRON_LEMMA, NORM: "du"}
|
||||
]
|
||||
|
||||
|
||||
ABBREVIATIONS = {
|
||||
"jan.": [
|
||||
{ORTH: "jan.", LEMMA: "januari"}
|
||||
],
|
||||
|
@ -63,6 +87,63 @@ TOKENIZER_EXCEPTIONS = {
|
|||
"sön.": [
|
||||
{ORTH: "sön.", LEMMA: "söndag"}
|
||||
],
|
||||
"Jan.": [
|
||||
{ORTH: "Jan.", LEMMA: "Januari"}
|
||||
],
|
||||
"Febr.": [
|
||||
{ORTH: "Febr.", LEMMA: "Februari"}
|
||||
],
|
||||
"Feb.": [
|
||||
{ORTH: "Feb.", LEMMA: "Februari"}
|
||||
],
|
||||
"Apr.": [
|
||||
{ORTH: "Apr.", LEMMA: "April"}
|
||||
],
|
||||
"Jun.": [
|
||||
{ORTH: "Jun.", LEMMA: "Juni"}
|
||||
],
|
||||
"Jul.": [
|
||||
{ORTH: "Jul.", LEMMA: "Juli"}
|
||||
],
|
||||
"Aug.": [
|
||||
{ORTH: "Aug.", LEMMA: "Augusti"}
|
||||
],
|
||||
"Sept.": [
|
||||
{ORTH: "Sept.", LEMMA: "September"}
|
||||
],
|
||||
"Sep.": [
|
||||
{ORTH: "Sep.", LEMMA: "September"}
|
||||
],
|
||||
"Okt.": [
|
||||
{ORTH: "Okt.", LEMMA: "Oktober"}
|
||||
],
|
||||
"Nov.": [
|
||||
{ORTH: "Nov.", LEMMA: "November"}
|
||||
],
|
||||
"Dec.": [
|
||||
{ORTH: "Dec.", LEMMA: "December"}
|
||||
],
|
||||
"Mån.": [
|
||||
{ORTH: "Mån.", LEMMA: "Måndag"}
|
||||
],
|
||||
"Tis.": [
|
||||
{ORTH: "Tis.", LEMMA: "Tisdag"}
|
||||
],
|
||||
"Ons.": [
|
||||
{ORTH: "Ons.", LEMMA: "Onsdag"}
|
||||
],
|
||||
"Tors.": [
|
||||
{ORTH: "Tors.", LEMMA: "Torsdag"}
|
||||
],
|
||||
"Fre.": [
|
||||
{ORTH: "Fre.", LEMMA: "Fredag"}
|
||||
],
|
||||
"Lör.": [
|
||||
{ORTH: "Lör.", LEMMA: "Lördag"}
|
||||
],
|
||||
"Sön.": [
|
||||
{ORTH: "Sön.", LEMMA: "Söndag"}
|
||||
],
|
||||
"sthlm": [
|
||||
{ORTH: "sthlm", LEMMA: "Stockholm"}
|
||||
],
|
||||
|
@ -72,6 +153,10 @@ TOKENIZER_EXCEPTIONS = {
|
|||
}
|
||||
|
||||
|
||||
TOKENIZER_EXCEPTIONS = dict(EXC)
|
||||
TOKENIZER_EXCEPTIONS.update(ABBREVIATIONS)
|
||||
|
||||
|
||||
ORTH_ONLY = [
|
||||
"ang.",
|
||||
"anm.",
|
||||
|
@ -107,7 +192,6 @@ ORTH_ONLY = [
|
|||
"p.g.a.",
|
||||
"ref.",
|
||||
"resp.",
|
||||
"s.",
|
||||
"s.a.s.",
|
||||
"s.k.",
|
||||
"st.",
|
||||
|
|
|
@ -10,6 +10,7 @@ from ..pt import Portuguese
|
|||
from ..nl import Dutch
|
||||
from ..sv import Swedish
|
||||
from ..hu import Hungarian
|
||||
from ..fi import Finnish
|
||||
from ..tokens import Doc
|
||||
from ..strings import StringStore
|
||||
from ..lemmatizer import Lemmatizer
|
||||
|
@ -23,7 +24,7 @@ import pytest
|
|||
|
||||
|
||||
LANGUAGES = [English, German, Spanish, Italian, French, Portuguese, Dutch,
|
||||
Swedish, Hungarian]
|
||||
Swedish, Hungarian, Finnish]
|
||||
|
||||
|
||||
@pytest.fixture(params=LANGUAGES)
|
||||
|
@ -62,6 +63,16 @@ def hu_tokenizer():
|
|||
return Hungarian.Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def fi_tokenizer():
|
||||
return Finnish.Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def sv_tokenizer():
|
||||
return Swedish.Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def stringstore():
|
||||
return StringStore()
|
||||
|
|
0
spacy/tests/fi/__init__.py
Normal file
0
spacy/tests/fi/__init__.py
Normal file
18
spacy/tests/fi/test_tokenizer.py
Normal file
18
spacy/tests/fi/test_tokenizer.py
Normal file
|
@ -0,0 +1,18 @@
|
|||
# encoding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
ABBREVIATION_TESTS = [
|
||||
('Hyvää uutta vuotta t. siht. Niemelä!', ['Hyvää', 'uutta', 'vuotta', 't.', 'siht.', 'Niemelä', '!']),
|
||||
('Paino on n. 2.2 kg', ['Paino', 'on', 'n.', '2.2', 'kg'])
|
||||
]
|
||||
|
||||
TESTCASES = ABBREVIATION_TESTS
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text,expected_tokens', TESTCASES)
|
||||
def test_tokenizer_handles_testcases(fi_tokenizer, text, expected_tokens):
|
||||
tokens = fi_tokenizer(text)
|
||||
token_list = [token.text for token in tokens if not token.is_space]
|
||||
assert expected_tokens == token_list
|
12
spacy/tests/regression/test_issue792.py
Normal file
12
spacy/tests/regression/test_issue792.py
Normal file
|
@ -0,0 +1,12 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
@pytest.mark.xfail
|
||||
@pytest.mark.parametrize('text', ["This is a string ", "This is a string\u0020"])
|
||||
def test_issue792(en_tokenizer, text):
|
||||
"""Test for Issue #792: Trailing whitespace is removed after parsing."""
|
||||
doc = en_tokenizer(text)
|
||||
assert doc.text_with_ws == text
|
19
spacy/tests/regression/test_issue801.py
Normal file
19
spacy/tests/regression/test_issue801.py
Normal file
|
@ -0,0 +1,19 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text,tokens', [
|
||||
('"deserve,"--and', ['"', "deserve", ',"--', "and"]),
|
||||
("exception;--exclusive", ["exception", ";--", "exclusive"]),
|
||||
("day.--Is", ["day", ".--", "Is"]),
|
||||
("refinement:--just", ["refinement", ":--", "just"]),
|
||||
("memories?--To", ["memories", "?--", "To"]),
|
||||
("Useful.=--Therefore", ["Useful", ".=--", "Therefore"]),
|
||||
("=Hope.=--Pandora", ["=", "Hope", ".=--", "Pandora"])])
|
||||
def test_issue801(en_tokenizer, text, tokens):
|
||||
"""Test that special characters + hyphens are split correctly."""
|
||||
doc = en_tokenizer(text)
|
||||
assert len(doc) == len(tokens)
|
||||
assert [t.text for t in doc] == tokens
|
15
spacy/tests/regression/test_issue805.py
Normal file
15
spacy/tests/regression/test_issue805.py
Normal file
|
@ -0,0 +1,15 @@
|
|||
# encoding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
SV_TOKEN_EXCEPTION_TESTS = [
|
||||
('Smörsåsen används bl.a. till fisk', ['Smörsåsen', 'används', 'bl.a.', 'till', 'fisk']),
|
||||
('Jag kommer först kl. 13 p.g.a. diverse förseningar', ['Jag', 'kommer', 'först', 'kl.', '13', 'p.g.a.', 'diverse', 'förseningar'])
|
||||
]
|
||||
|
||||
@pytest.mark.parametrize('text,expected_tokens', SV_TOKEN_EXCEPTION_TESTS)
|
||||
def test_issue805(sv_tokenizer, text, expected_tokens):
|
||||
tokens = sv_tokenizer(text)
|
||||
token_list = [token.text for token in tokens if not token.is_space]
|
||||
assert expected_tokens == token_list
|
0
spacy/tests/sv/__init__.py
Normal file
0
spacy/tests/sv/__init__.py
Normal file
24
spacy/tests/sv/test_tokenizer.py
Normal file
24
spacy/tests/sv/test_tokenizer.py
Normal file
|
@ -0,0 +1,24 @@
|
|||
# encoding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
SV_TOKEN_EXCEPTION_TESTS = [
|
||||
('Smörsåsen används bl.a. till fisk', ['Smörsåsen', 'används', 'bl.a.', 'till', 'fisk']),
|
||||
('Jag kommer först kl. 13 p.g.a. diverse förseningar', ['Jag', 'kommer', 'först', 'kl.', '13', 'p.g.a.', 'diverse', 'förseningar'])
|
||||
]
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text,expected_tokens', SV_TOKEN_EXCEPTION_TESTS)
|
||||
def test_tokenizer_handles_exception_cases(sv_tokenizer, text, expected_tokens):
|
||||
tokens = sv_tokenizer(text)
|
||||
token_list = [token.text for token in tokens if not token.is_space]
|
||||
assert expected_tokens == token_list
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["driveru", "hajaru", "Serru", "Fixaru"])
|
||||
def test_tokenizer_handles_verb_exceptions(sv_tokenizer, text):
|
||||
tokens = sv_tokenizer(text)
|
||||
assert len(tokens) == 2
|
||||
assert tokens[1].text == "u"
|
|
@ -500,7 +500,8 @@ cdef class Doc:
|
|||
by the values of the given attribute ID.
|
||||
|
||||
Example:
|
||||
from spacy.en import English, attrs
|
||||
from spacy.en import English
|
||||
from spacy import attrs
|
||||
nlp = English()
|
||||
tokens = nlp(u'apple apple orange banana')
|
||||
tokens.count_by(attrs.ORTH)
|
||||
|
@ -585,9 +586,6 @@ cdef class Doc:
|
|||
elif attr_id == POS:
|
||||
for i in range(length):
|
||||
tokens[i].pos = <univ_pos_t>values[i]
|
||||
elif attr_id == TAG:
|
||||
for i in range(length):
|
||||
tokens[i].tag = <univ_pos_t>values[i]
|
||||
elif attr_id == DEP:
|
||||
for i in range(length):
|
||||
tokens[i].dep = values[i]
|
||||
|
|
|
@ -55,7 +55,7 @@
|
|||
},
|
||||
|
||||
"V_CSS": "1.15",
|
||||
"V_JS": "1.0",
|
||||
"V_JS": "1.1",
|
||||
"DEFAULT_SYNTAX": "python",
|
||||
"ANALYTICS": "UA-58931649-1",
|
||||
"MAILCHIMP": {
|
||||
|
|
|
@ -14,11 +14,11 @@
|
|||
const updateNav = () => {
|
||||
const vh = updateVh()
|
||||
const newScrollY = (window.pageYOffset || document.scrollTop) - (document.clientTop || 0)
|
||||
scrollUp = newScrollY <= scrollY
|
||||
if (newScrollY != scrollY) scrollUp = newScrollY <= scrollY
|
||||
scrollY = newScrollY
|
||||
|
||||
if(scrollUp && !(isNaN(scrollY) || scrollY <= vh)) nav.classList.add(fixedClass)
|
||||
else if(!scrollUp || (isNaN(scrollY) || scrollY <= vh/2)) nav.classList.remove(fixedClass)
|
||||
else if (!scrollUp || (isNaN(scrollY) || scrollY <= vh/2)) nav.classList.remove(fixedClass)
|
||||
}
|
||||
|
||||
window.addEventListener('scroll', () => requestAnimationFrame(updateNav))
|
||||
|
|
|
@ -19,21 +19,6 @@ p spaCy currently supports the following languages and capabilities:
|
|||
each icon in [ "pro", "pro", "con", "pro", "pro", "pro", "pro", "con" ]
|
||||
+cell.u-text-center #[+procon(icon)]
|
||||
|
||||
+row
|
||||
+cell Chinese #[code zh]
|
||||
each icon in [ "pro", "con", "con", "con", "con", "con", "con", "con" ]
|
||||
+cell.u-text-center #[+procon(icon)]
|
||||
|
||||
+row
|
||||
+cell Spanish #[code es]
|
||||
each icon in [ "pro", "con", "con", "con", "con", "con", "con", "con" ]
|
||||
+cell.u-text-center #[+procon(icon)]
|
||||
|
||||
p
|
||||
| Chinese tokenization requires the
|
||||
| #[+a("https://github.com/fxsjy/jieba") Jieba] library. Statistical
|
||||
| models are coming soon.
|
||||
|
||||
|
||||
+h(2, "alpha-support") Alpha support
|
||||
|
||||
|
@ -42,8 +27,13 @@ p
|
|||
| the existing language data and extending the tokenization patterns.
|
||||
|
||||
+table([ "Language", "Source" ])
|
||||
each language, code in { it: "Italian", fr: "French", pt: "Portuguese", nl: "Dutch", sv: "Swedish", hu: "Hungarian" }
|
||||
each language, code in { zh: "Chinese", es: "Spanish", it: "Italian", fr: "French", pt: "Portuguese", nl: "Dutch", sv: "Swedish", fi: "Finnish", hu: "Hungarian" }
|
||||
+row
|
||||
+cell #{language} #[code=code]
|
||||
+cell
|
||||
+src(gh("spaCy", "spacy/" + code)) spacy/#{code}
|
||||
|
||||
p
|
||||
| Chinese tokenization requires the
|
||||
| #[+a("https://github.com/fxsjy/jieba") Jieba] library. Statistical
|
||||
| models are coming soon.
|
||||
|
|
|
@ -54,7 +54,7 @@ p
|
|||
doc = nlp(u'London is a big city in the United Kingdom.')
|
||||
doc.ents = []
|
||||
assert doc[0].ent_type_ == ''
|
||||
doc.ents = [Span(0, 1, label='GPE')]
|
||||
doc.ents = [Span(doc, 0, 1, label=doc.vocab.strings['GPE'])]
|
||||
assert doc[0].ent_type_ == 'GPE'
|
||||
doc.ents = []
|
||||
doc.ents = [(u'LondonCity', u'GPE', 0, 1)]
|
||||
|
|
|
@ -20,7 +20,7 @@ p
|
|||
| Once we've added the pattern, we can use the #[code matcher] as a
|
||||
| callable, to receive a list of #[code (ent_id, start, end)] tuples.
|
||||
| Note that #[code LOWER] and #[code IS_PUNCT] are data attributes
|
||||
| of #[code Matcher.attrs].
|
||||
| of #[code spacy.attrs].
|
||||
|
||||
+code.
|
||||
from spacy.matcher import Matcher
|
||||
|
|
Loading…
Reference in New Issue
Block a user