Merge branch 'master' of ssh://github.com/explosion/spaCy

This commit is contained in:
Matthew Honnibal 2016-12-27 21:04:10 +01:00
commit cade536d1e
45 changed files with 1946 additions and 497 deletions

106
.github/contributors/magnusburton.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------------------- |
| Name | Magnus Burton |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 17-12-2016 |
| GitHub username | magnusburton |
| Website (optional) | |

106
.github/contributors/oroszgy.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [X] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | György Orosz |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2016-12-26 |
| GitHub username | oroszgy |
| Website (optional) | gyorgy.orosz.link |

View File

@ -8,6 +8,7 @@ This is a list of everyone who has made significant contributions to spaCy, in a
* Christoph Schwienheer, [@chssch](https://github.com/chssch)
* Dafne van Kuppevelt, [@dafnevk](https://github.com/dafnevk)
* Dmytro Sadovnychyi, [@sadovnychyi](https://github.com/sadovnychyi)
* György Orosz, [@oroszgy](https://github.com/oroszgy)
* Henning Peters, [@henningpeters](https://github.com/henningpeters)
* Ines Montani, [@ines](https://github.com/ines)
* J Nicolas Schrading, [@NSchrading](https://github.com/NSchrading)
@ -16,6 +17,7 @@ This is a list of everyone who has made significant contributions to spaCy, in a
* Kendrick Tan, [@kendricktan](https://github.com/kendricktan)
* Kyle P. Johnson, [@kylepjohnson](https://github.com/kylepjohnson)
* Liling Tan, [@alvations](https://github.com/alvations)
* Magnus Burton, [@magnusburton](https://github.com/magnusburton)
* Mark Amery, [@ExplodingCabbage](https://github.com/ExplodingCabbage)
* Matthew Honnibal, [@honnibal](https://github.com/honnibal)
* Maxim Samsonov, [@maxirmx](https://github.com/maxirmx)

View File

@ -1,8 +1,6 @@
The MIT License (MIT)
Copyright (C) 2015 Matthew Honnibal
2016 spaCy GmbH
2016 ExplosionAI UG (haftungsbeschränkt)
Copyright (C) 2016 ExplosionAI UG (haftungsbeschränkt), 2016 spaCy GmbH, 2015 Matthew Honnibal
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal

View File

@ -78,7 +78,7 @@ Features
See `facts, figures and benchmarks <https://spacy.io/docs/api/>`_.
Top Peformance
Top Performance
==============
* Fastest in the world: <50ms per document. No faster system has ever been

View File

@ -3,6 +3,9 @@ import plac
import random
import six
import cProfile
import pstats
import pathlib
import cPickle as pickle
from itertools import izip
@ -81,7 +84,7 @@ class SentimentModel(Chain):
def __init__(self, nlp, shape, **settings):
Chain.__init__(self,
embed=_Embed(shape['nr_vector'], shape['nr_dim'], shape['nr_hidden'],
initialW=lambda arr: set_vectors(arr, nlp.vocab)),
set_vectors=lambda arr: set_vectors(arr, nlp.vocab)),
encode=_Encode(shape['nr_hidden'], shape['nr_hidden']),
attend=_Attend(shape['nr_hidden'], shape['nr_hidden']),
predict=_Predict(shape['nr_hidden'], shape['nr_class']))
@ -95,11 +98,11 @@ class SentimentModel(Chain):
class _Embed(Chain):
def __init__(self, nr_vector, nr_dim, nr_out):
def __init__(self, nr_vector, nr_dim, nr_out, set_vectors=None):
Chain.__init__(self,
embed=L.EmbedID(nr_vector, nr_dim),
embed=L.EmbedID(nr_vector, nr_dim, initialW=set_vectors),
project=L.Linear(None, nr_out, nobias=True))
#self.embed.unchain_backward()
self.embed.W.volatile = False
def __call__(self, sentence):
return [self.project(self.embed(ts)) for ts in F.transpose(sentence)]
@ -214,7 +217,6 @@ def set_vectors(vectors, vocab):
vectors[lex.rank + 1] = lex.vector
else:
lex.norm = 0
vectors.unchain_backwards()
return vectors
@ -223,7 +225,9 @@ def train(train_texts, train_labels, dev_texts, dev_labels,
by_sentence=True):
nlp = spacy.load('en', entity=False)
if 'nr_vector' not in lstm_shape:
lstm_shape['nr_vector'] = max(lex.rank+1 for lex in vocab if lex.has_vector)
lstm_shape['nr_vector'] = max(lex.rank+1 for lex in nlp.vocab if lex.has_vector)
if 'nr_dim' not in lstm_shape:
lstm_shape['nr_dim'] = nlp.vocab.vectors_length
print("Make model")
model = Classifier(SentimentModel(nlp, lstm_shape, **lstm_settings))
print("Parsing texts...")
@ -240,7 +244,7 @@ def train(train_texts, train_labels, dev_texts, dev_labels,
optimizer = chainer.optimizers.Adam()
optimizer.setup(model)
updater = chainer.training.StandardUpdater(train_iter, optimizer, device=0)
trainer = chainer.training.Trainer(updater, (20, 'epoch'), out='result')
trainer = chainer.training.Trainer(updater, (1, 'epoch'), out='result')
trainer.extend(extensions.Evaluator(dev_iter, model, device=0))
trainer.extend(extensions.LogReport())
@ -305,11 +309,14 @@ def main(model_dir, train_dir, dev_dir,
dev_labels = xp.asarray(dev_labels, dtype='i')
lstm = train(train_texts, train_labels, dev_texts, dev_labels,
{'nr_hidden': nr_hidden, 'max_length': max_length, 'nr_class': 2,
'nr_vector': 2000, 'nr_dim': 32},
'nr_vector': 5000},
{'dropout': 0.5, 'lr': learn_rate},
{},
nb_epoch=nb_epoch, batch_size=batch_size)
if __name__ == '__main__':
#cProfile.runctx("plac.call(main)", globals(), locals(), "Profile.prof")
#s = pstats.Stats("Profile.prof")
#s.strip_dirs().sort_stats("time").print_stats()
plac.call(main)

View File

@ -111,10 +111,9 @@ def compile_lstm(embeddings, shape, settings):
mask_zero=True
)
)
model.add(TimeDistributed(Dense(shape['nr_hidden'] * 2, bias=False)))
model.add(Dropout(settings['dropout']))
model.add(Bidirectional(LSTM(shape['nr_hidden'])))
model.add(Dropout(settings['dropout']))
model.add(TimeDistributed(Dense(shape['nr_hidden'], bias=False)))
model.add(Bidirectional(LSTM(shape['nr_hidden'], dropout_U=settings['dropout'],
dropout_W=settings['dropout'])))
model.add(Dense(shape['nr_class'], activation='sigmoid'))
model.compile(optimizer=Adam(lr=settings['lr']), loss='binary_crossentropy',
metrics=['accuracy'])
@ -195,7 +194,7 @@ def main(model_dir, train_dir, dev_dir,
dev_labels = numpy.asarray(dev_labels, dtype='int32')
lstm = train(train_texts, train_labels, dev_texts, dev_labels,
{'nr_hidden': nr_hidden, 'max_length': max_length, 'nr_class': 1},
{'dropout': 0.5, 'lr': learn_rate},
{'dropout': dropout, 'lr': learn_rate},
{},
nb_epoch=nb_epoch, batch_size=batch_size)
weights = lstm.get_weights()

View File

@ -27,8 +27,10 @@ PACKAGES = [
'spacy.es',
'spacy.fr',
'spacy.it',
'spacy.hu',
'spacy.pt',
'spacy.nl',
'spacy.sv',
'spacy.language_data',
'spacy.serialize',
'spacy.syntax',
@ -95,7 +97,7 @@ LINK_OPTIONS = {
'other' : []
}
# I don't understand this very well yet. See Issue #267
# Fingers crossed!
#if os.environ.get('USE_OPENMP') == '1':

View File

@ -8,9 +8,11 @@ from . import de
from . import zh
from . import es
from . import it
from . import hu
from . import fr
from . import pt
from . import nl
from . import sv
try:
@ -25,8 +27,10 @@ set_lang_class(es.Spanish.lang, es.Spanish)
set_lang_class(pt.Portuguese.lang, pt.Portuguese)
set_lang_class(fr.French.lang, fr.French)
set_lang_class(it.Italian.lang, it.Italian)
set_lang_class(hu.Hungarian.lang, hu.Hungarian)
set_lang_class(zh.Chinese.lang, zh.Chinese)
set_lang_class(nl.Dutch.lang, nl.Dutch)
set_lang_class(sv.Swedish.lang, sv.Swedish)
def load(name, **overrides):

View File

@ -2,7 +2,7 @@
from __future__ import unicode_literals
from ..symbols import *
from ..language_data import PRON_LEMMA
from ..language_data import PRON_LEMMA, DET_LEMMA
TOKENIZER_EXCEPTIONS = {
@ -15,23 +15,27 @@ TOKENIZER_EXCEPTIONS = {
],
"'S": [
{ORTH: "'S", LEMMA: PRON_LEMMA}
{ORTH: "'S", LEMMA: PRON_LEMMA, TAG: "PPER"}
],
"'n": [
{ORTH: "'n", LEMMA: "ein"}
{ORTH: "'n", LEMMA: DET_LEMMA, NORM: "ein"}
],
"'ne": [
{ORTH: "'ne", LEMMA: "eine"}
{ORTH: "'ne", LEMMA: DET_LEMMA, NORM: "eine"}
],
"'nen": [
{ORTH: "'nen", LEMMA: "einen"}
{ORTH: "'nen", LEMMA: DET_LEMMA, NORM: "einen"}
],
"'nem": [
{ORTH: "'nem", LEMMA: DET_LEMMA, NORM: "einem"}
],
"'s": [
{ORTH: "'s", LEMMA: PRON_LEMMA}
{ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER"}
],
"Abb.": [
@ -195,7 +199,7 @@ TOKENIZER_EXCEPTIONS = {
],
"S'": [
{ORTH: "S'", LEMMA: PRON_LEMMA}
{ORTH: "S'", LEMMA: PRON_LEMMA, TAG: "PPER"}
],
"Sa.": [
@ -244,7 +248,7 @@ TOKENIZER_EXCEPTIONS = {
"auf'm": [
{ORTH: "auf", LEMMA: "auf"},
{ORTH: "'m", LEMMA: PRON_LEMMA}
{ORTH: "'m", LEMMA: DET_LEMMA, NORM: "dem" }
],
"bspw.": [
@ -268,8 +272,8 @@ TOKENIZER_EXCEPTIONS = {
],
"du's": [
{ORTH: "du", LEMMA: PRON_LEMMA},
{ORTH: "'s", LEMMA: PRON_LEMMA}
{ORTH: "du", LEMMA: PRON_LEMMA, TAG: "PPER"},
{ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER", NORM: "es"}
],
"ebd.": [
@ -285,8 +289,8 @@ TOKENIZER_EXCEPTIONS = {
],
"er's": [
{ORTH: "er", LEMMA: PRON_LEMMA},
{ORTH: "'s", LEMMA: PRON_LEMMA}
{ORTH: "er", LEMMA: PRON_LEMMA, TAG: "PPER"},
{ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER", NORM: "es"}
],
"evtl.": [
@ -315,7 +319,7 @@ TOKENIZER_EXCEPTIONS = {
"hinter'm": [
{ORTH: "hinter", LEMMA: "hinter"},
{ORTH: "'m", LEMMA: PRON_LEMMA}
{ORTH: "'m", LEMMA: DET_LEMMA, NORM: "dem"}
],
"i.O.": [
@ -327,13 +331,13 @@ TOKENIZER_EXCEPTIONS = {
],
"ich's": [
{ORTH: "ich", LEMMA: PRON_LEMMA},
{ORTH: "'s", LEMMA: PRON_LEMMA}
{ORTH: "ich", LEMMA: PRON_LEMMA, TAG: "PPER"},
{ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER", NORM: "es"}
],
"ihr's": [
{ORTH: "ihr", LEMMA: PRON_LEMMA},
{ORTH: "'s", LEMMA: PRON_LEMMA}
{ORTH: "ihr", LEMMA: PRON_LEMMA, TAG: "PPER"},
{ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER", NORM: "es"}
],
"incl.": [
@ -385,7 +389,7 @@ TOKENIZER_EXCEPTIONS = {
],
"s'": [
{ORTH: "s'", LEMMA: PRON_LEMMA}
{ORTH: "s'", LEMMA: PRON_LEMMA, TAG: "PPER"}
],
"s.o.": [
@ -393,8 +397,8 @@ TOKENIZER_EXCEPTIONS = {
],
"sie's": [
{ORTH: "sie", LEMMA: PRON_LEMMA},
{ORTH: "'s", LEMMA: PRON_LEMMA}
{ORTH: "sie", LEMMA: PRON_LEMMA, TAG: "PPER"},
{ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER", NORM: "es"}
],
"sog.": [
@ -423,7 +427,7 @@ TOKENIZER_EXCEPTIONS = {
"unter'm": [
{ORTH: "unter", LEMMA: "unter"},
{ORTH: "'m", LEMMA: PRON_LEMMA}
{ORTH: "'m", LEMMA: DET_LEMMA, NORM: "dem"}
],
"usf.": [
@ -464,12 +468,12 @@ TOKENIZER_EXCEPTIONS = {
"vor'm": [
{ORTH: "vor", LEMMA: "vor"},
{ORTH: "'m", LEMMA: PRON_LEMMA}
{ORTH: "'m", LEMMA: DET_LEMMA, NORM: "dem"}
],
"wir's": [
{ORTH: "wir", LEMMA: PRON_LEMMA},
{ORTH: "'s", LEMMA: PRON_LEMMA}
{ORTH: "wir", LEMMA: PRON_LEMMA, TAG: "PPER"},
{ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER", NORM: "es"}
],
"z.B.": [
@ -506,7 +510,7 @@ TOKENIZER_EXCEPTIONS = {
"über'm": [
{ORTH: "über", LEMMA: "über"},
{ORTH: "'m", LEMMA: PRON_LEMMA}
{ORTH: "'m", LEMMA: DET_LEMMA, NORM: "dem"}
]
}
@ -625,5 +629,5 @@ ORTH_ONLY = [
"wiss.",
"x.",
"y.",
"z.",
"z."
]

View File

@ -44,6 +44,7 @@ def _fix_deprecated_glove_vectors_loading(overrides):
else:
path = overrides['path']
data_path = path.parent
vec_path = None
if 'add_vectors' not in overrides:
if 'vectors' in overrides:
vec_path = match_best_version(overrides['vectors'], None, data_path)

View File

@ -11,7 +11,7 @@ TOKENIZER_EXCEPTIONS = {
],
"Theydve": [
{ORTH: "They", LEMMA: PRON_LEMMA},
{ORTH: "They", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "d", LEMMA: "would", TAG: "MD"},
{ORTH: "ve", LEMMA: "have", TAG: "VB"}
],
@ -68,7 +68,7 @@ TOKENIZER_EXCEPTIONS = {
],
"itll": [
{ORTH: "it", LEMMA: PRON_LEMMA},
{ORTH: "it", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "ll", LEMMA: "will", TAG: "MD"}
],
@ -113,7 +113,7 @@ TOKENIZER_EXCEPTIONS = {
],
"Idve": [
{ORTH: "I", LEMMA: PRON_LEMMA},
{ORTH: "I", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "d", LEMMA: "would", TAG: "MD"},
{ORTH: "ve", LEMMA: "have", TAG: "VB"}
],
@ -124,23 +124,23 @@ TOKENIZER_EXCEPTIONS = {
],
"Ive": [
{ORTH: "I", LEMMA: PRON_LEMMA},
{ORTH: "I", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "ve", LEMMA: "have", TAG: "VB"}
],
"they'd": [
{ORTH: "they", LEMMA: PRON_LEMMA},
{ORTH: "they", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'d", LEMMA: "would", TAG: "MD"}
],
"Youdve": [
{ORTH: "You", LEMMA: PRON_LEMMA},
{ORTH: "You", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "d", LEMMA: "would", TAG: "MD"},
{ORTH: "ve", LEMMA: "have", TAG: "VB"}
],
"theyve": [
{ORTH: "they", LEMMA: PRON_LEMMA},
{ORTH: "they", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "ve", LEMMA: "have", TAG: "VB"}
],
@ -160,12 +160,12 @@ TOKENIZER_EXCEPTIONS = {
],
"I'm": [
{ORTH: "I", LEMMA: PRON_LEMMA},
{ORTH: "I", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'m", TAG: "VBP", "tenspect": 1, "number": 1, LEMMA: "be"}
],
"She'd've": [
{ORTH: "She", LEMMA: PRON_LEMMA},
{ORTH: "She", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'d", LEMMA: "would", TAG: "MD"},
{ORTH: "'ve", LEMMA: "have", TAG: "VB"}
],
@ -191,7 +191,7 @@ TOKENIZER_EXCEPTIONS = {
],
"they've": [
{ORTH: "they", LEMMA: PRON_LEMMA},
{ORTH: "they", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'ve", LEMMA: "have", TAG: "VB"}
],
@ -226,12 +226,12 @@ TOKENIZER_EXCEPTIONS = {
],
"i'll": [
{ORTH: "i", LEMMA: PRON_LEMMA},
{ORTH: "i", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'ll", LEMMA: "will", TAG: "MD"}
],
"you'd": [
{ORTH: "you", LEMMA: PRON_LEMMA},
{ORTH: "you", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'d", LEMMA: "would", TAG: "MD"}
],
@ -287,7 +287,7 @@ TOKENIZER_EXCEPTIONS = {
],
"youll": [
{ORTH: "you", LEMMA: PRON_LEMMA},
{ORTH: "you", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "ll", LEMMA: "will", TAG: "MD"}
],
@ -307,7 +307,7 @@ TOKENIZER_EXCEPTIONS = {
],
"Youre": [
{ORTH: "You", LEMMA: PRON_LEMMA},
{ORTH: "You", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "re", LEMMA: "be"}
],
@ -369,7 +369,7 @@ TOKENIZER_EXCEPTIONS = {
],
"You'll": [
{ORTH: "You", LEMMA: PRON_LEMMA},
{ORTH: "You", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'ll", LEMMA: "will", TAG: "MD"}
],
@ -379,7 +379,7 @@ TOKENIZER_EXCEPTIONS = {
],
"i'd": [
{ORTH: "i", LEMMA: PRON_LEMMA},
{ORTH: "i", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'d", LEMMA: "would", TAG: "MD"}
],
@ -394,7 +394,7 @@ TOKENIZER_EXCEPTIONS = {
],
"i'm": [
{ORTH: "i", LEMMA: PRON_LEMMA},
{ORTH: "i", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'m", TAG: "VBP", "tenspect": 1, "number": 1, LEMMA: "be"}
],
@ -425,7 +425,7 @@ TOKENIZER_EXCEPTIONS = {
],
"Hes": [
{ORTH: "He", LEMMA: PRON_LEMMA},
{ORTH: "He", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "s"}
],
@ -435,7 +435,7 @@ TOKENIZER_EXCEPTIONS = {
],
"It's": [
{ORTH: "It", LEMMA: PRON_LEMMA},
{ORTH: "It", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'s"}
],
@ -445,7 +445,7 @@ TOKENIZER_EXCEPTIONS = {
],
"Hed": [
{ORTH: "He", LEMMA: PRON_LEMMA},
{ORTH: "He", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "d", LEMMA: "would", TAG: "MD"}
],
@ -464,12 +464,12 @@ TOKENIZER_EXCEPTIONS = {
],
"It'd": [
{ORTH: "It", LEMMA: PRON_LEMMA},
{ORTH: "It", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'d", LEMMA: "would", TAG: "MD"}
],
"theydve": [
{ORTH: "they", LEMMA: PRON_LEMMA},
{ORTH: "they", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "d", LEMMA: "would", TAG: "MD"},
{ORTH: "ve", LEMMA: "have", TAG: "VB"}
],
@ -489,7 +489,7 @@ TOKENIZER_EXCEPTIONS = {
],
"I've": [
{ORTH: "I", LEMMA: PRON_LEMMA},
{ORTH: "I", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'ve", LEMMA: "have", TAG: "VB"}
],
@ -499,13 +499,13 @@ TOKENIZER_EXCEPTIONS = {
],
"Itdve": [
{ORTH: "It", LEMMA: PRON_LEMMA},
{ORTH: "It", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "d", LEMMA: "would", TAG: "MD"},
{ORTH: "ve", LEMMA: "have", TAG: "VB"}
],
"I'ma": [
{ORTH: "I", LEMMA: PRON_LEMMA},
{ORTH: "I", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'ma"}
],
@ -515,7 +515,7 @@ TOKENIZER_EXCEPTIONS = {
],
"They'd": [
{ORTH: "They", LEMMA: PRON_LEMMA},
{ORTH: "They", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'d", LEMMA: "would", TAG: "MD"}
],
@ -525,7 +525,7 @@ TOKENIZER_EXCEPTIONS = {
],
"You've": [
{ORTH: "You", LEMMA: PRON_LEMMA},
{ORTH: "You", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'ve", LEMMA: "have", TAG: "VB"}
],
@ -546,7 +546,7 @@ TOKENIZER_EXCEPTIONS = {
],
"I'd've": [
{ORTH: "I", LEMMA: PRON_LEMMA},
{ORTH: "I", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'d", LEMMA: "would", TAG: "MD"},
{ORTH: "'ve", LEMMA: "have", TAG: "VB"}
],
@ -557,13 +557,13 @@ TOKENIZER_EXCEPTIONS = {
],
"it'd": [
{ORTH: "it", LEMMA: PRON_LEMMA},
{ORTH: "it", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'d", LEMMA: "would", TAG: "MD"}
],
"what're": [
{ORTH: "what"},
{ORTH: "'re", LEMMA: "be"}
{ORTH: "'re", LEMMA: "be", NORM: "are"}
],
"Wasn't": [
@ -577,18 +577,18 @@ TOKENIZER_EXCEPTIONS = {
],
"he'd've": [
{ORTH: "he", LEMMA: PRON_LEMMA},
{ORTH: "he", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'d", LEMMA: "would", TAG: "MD"},
{ORTH: "'ve", LEMMA: "have", TAG: "VB"}
],
"She'd": [
{ORTH: "She", LEMMA: PRON_LEMMA},
{ORTH: "She", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'d", LEMMA: "would", TAG: "MD"}
],
"shedve": [
{ORTH: "she", LEMMA: PRON_LEMMA},
{ORTH: "she", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "d", LEMMA: "would", TAG: "MD"},
{ORTH: "ve", LEMMA: "have", TAG: "VB"}
],
@ -599,12 +599,12 @@ TOKENIZER_EXCEPTIONS = {
],
"She's": [
{ORTH: "i", LEMMA: PRON_LEMMA},
{ORTH: "i", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'s"}
],
"i'd've": [
{ORTH: "i", LEMMA: PRON_LEMMA},
{ORTH: "i", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'d", LEMMA: "would", TAG: "MD"},
{ORTH: "'ve", LEMMA: "have", TAG: "VB"}
],
@ -631,7 +631,7 @@ TOKENIZER_EXCEPTIONS = {
],
"you'd've": [
{ORTH: "you", LEMMA: PRON_LEMMA},
{ORTH: "you", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'d", LEMMA: "would", TAG: "MD"},
{ORTH: "'ve", LEMMA: "have", TAG: "VB"}
],
@ -647,7 +647,7 @@ TOKENIZER_EXCEPTIONS = {
],
"Youd": [
{ORTH: "You", LEMMA: PRON_LEMMA},
{ORTH: "You", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "d", LEMMA: "would", TAG: "MD"}
],
@ -678,12 +678,12 @@ TOKENIZER_EXCEPTIONS = {
],
"ive": [
{ORTH: "i", LEMMA: PRON_LEMMA},
{ORTH: "i", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "ve", LEMMA: "have", TAG: "VB"}
],
"It'd've": [
{ORTH: "It", LEMMA: PRON_LEMMA},
{ORTH: "It", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'d", LEMMA: "would", TAG: "MD"},
{ORTH: "'ve", LEMMA: "have", TAG: "VB"}
],
@ -693,7 +693,7 @@ TOKENIZER_EXCEPTIONS = {
],
"Itll": [
{ORTH: "It", LEMMA: PRON_LEMMA},
{ORTH: "It", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "ll", LEMMA: "will", TAG: "MD"}
],
@ -708,12 +708,12 @@ TOKENIZER_EXCEPTIONS = {
],
"im": [
{ORTH: "i", LEMMA: PRON_LEMMA},
{ORTH: "i", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "m", TAG: "VBP", "tenspect": 1, "number": 1, LEMMA: "be"}
],
"they'd've": [
{ORTH: "they", LEMMA: PRON_LEMMA},
{ORTH: "they", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'d", LEMMA: "would", TAG: "MD"},
{ORTH: "'ve", LEMMA: "have", TAG: "VB"}
],
@ -735,19 +735,19 @@ TOKENIZER_EXCEPTIONS = {
],
"youdve": [
{ORTH: "you", LEMMA: PRON_LEMMA},
{ORTH: "you", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "d", LEMMA: "would", TAG: "MD"},
{ORTH: "ve", LEMMA: "have", TAG: "VB"}
],
"Shedve": [
{ORTH: "i", LEMMA: PRON_LEMMA},
{ORTH: "i", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "d", LEMMA: "would", TAG: "MD"},
{ORTH: "ve", LEMMA: "have", TAG: "VB"}
],
"theyd": [
{ORTH: "they", LEMMA: PRON_LEMMA},
{ORTH: "they", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "d", LEMMA: "would", TAG: "MD"}
],
@ -763,11 +763,11 @@ TOKENIZER_EXCEPTIONS = {
"What're": [
{ORTH: "What"},
{ORTH: "'re", LEMMA: "be"}
{ORTH: "'re", LEMMA: "be", NORM: "are"}
],
"He'll": [
{ORTH: "He", LEMMA: PRON_LEMMA},
{ORTH: "He", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'ll", LEMMA: "will", TAG: "MD"}
],
@ -777,8 +777,8 @@ TOKENIZER_EXCEPTIONS = {
],
"They're": [
{ORTH: "They", LEMMA: PRON_LEMMA},
{ORTH: "'re", LEMMA: "be"}
{ORTH: "They", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'re", LEMMA: "be", NORM: "are"}
],
"shouldnt": [
@ -796,7 +796,7 @@ TOKENIZER_EXCEPTIONS = {
],
"youve": [
{ORTH: "you", LEMMA: PRON_LEMMA},
{ORTH: "you", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "ve", LEMMA: "have", TAG: "VB"}
],
@ -816,7 +816,7 @@ TOKENIZER_EXCEPTIONS = {
],
"Youve": [
{ORTH: "You", LEMMA: PRON_LEMMA},
{ORTH: "You", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "ve", LEMMA: "have", TAG: "VB"}
],
@ -841,12 +841,12 @@ TOKENIZER_EXCEPTIONS = {
],
"they're": [
{ORTH: "they", LEMMA: PRON_LEMMA},
{ORTH: "'re", LEMMA: "be"}
{ORTH: "they", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'re", LEMMA: "be", NORM: "are"}
],
"idve": [
{ORTH: "i", LEMMA: PRON_LEMMA},
{ORTH: "i", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "d", LEMMA: "would", TAG: "MD"},
{ORTH: "ve", LEMMA: "have", TAG: "VB"}
],
@ -857,8 +857,8 @@ TOKENIZER_EXCEPTIONS = {
],
"youre": [
{ORTH: "you", LEMMA: PRON_LEMMA},
{ORTH: "re"}
{ORTH: "you", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "re", LEMMA: "be", NORM: "are"}
],
"Didn't": [
@ -877,8 +877,8 @@ TOKENIZER_EXCEPTIONS = {
],
"Im": [
{ORTH: "I", LEMMA: PRON_LEMMA},
{ORTH: "m", TAG: "VBP", "tenspect": 1, "number": 1, LEMMA: "be"}
{ORTH: "I", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "m", TAG: "VBP", "tenspect": 1, "number": 1, LEMMA: "be", NORM: "am"}
],
"howd": [
@ -887,22 +887,22 @@ TOKENIZER_EXCEPTIONS = {
],
"you've": [
{ORTH: "you", LEMMA: PRON_LEMMA},
{ORTH: "you", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'ve", LEMMA: "have", TAG: "VB"}
],
"You're": [
{ORTH: "You", LEMMA: PRON_LEMMA},
{ORTH: "'re", LEMMA: "be"}
{ORTH: "You", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'re", LEMMA: "be", NORM: "are"}
],
"she'll": [
{ORTH: "she", LEMMA: PRON_LEMMA},
{ORTH: "she", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'ll", LEMMA: "will", TAG: "MD"}
],
"Theyll": [
{ORTH: "They", LEMMA: PRON_LEMMA},
{ORTH: "They", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "ll", LEMMA: "will", TAG: "MD"}
],
@ -912,12 +912,12 @@ TOKENIZER_EXCEPTIONS = {
],
"itd": [
{ORTH: "it", LEMMA: PRON_LEMMA},
{ORTH: "it", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "d", LEMMA: "would", TAG: "MD"}
],
"Hedve": [
{ORTH: "He", LEMMA: PRON_LEMMA},
{ORTH: "He", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "d", LEMMA: "would", TAG: "MD"},
{ORTH: "ve", LEMMA: "have", TAG: "VB"}
],
@ -933,8 +933,8 @@ TOKENIZER_EXCEPTIONS = {
],
"We're": [
{ORTH: "We", LEMMA: PRON_LEMMA},
{ORTH: "'re", LEMMA: "be"}
{ORTH: "We", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'re", LEMMA: "be", NORM: "are"}
],
"\u2018S": [
@ -951,7 +951,7 @@ TOKENIZER_EXCEPTIONS = {
],
"ima": [
{ORTH: "i", LEMMA: PRON_LEMMA},
{ORTH: "i", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "ma"}
],
@ -961,7 +961,7 @@ TOKENIZER_EXCEPTIONS = {
],
"he's": [
{ORTH: "he", LEMMA: PRON_LEMMA},
{ORTH: "he", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'s"}
],
@ -981,13 +981,13 @@ TOKENIZER_EXCEPTIONS = {
],
"hedve": [
{ORTH: "he", LEMMA: PRON_LEMMA},
{ORTH: "he", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "d", LEMMA: "would", TAG: "MD"},
{ORTH: "ve", LEMMA: "have", TAG: "VB"}
],
"he'd": [
{ORTH: "he", LEMMA: PRON_LEMMA},
{ORTH: "he", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'d", LEMMA: "would", TAG: "MD"}
],
@ -1029,7 +1029,7 @@ TOKENIZER_EXCEPTIONS = {
],
"You'd've": [
{ORTH: "You", LEMMA: PRON_LEMMA},
{ORTH: "You", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'d", LEMMA: "would", TAG: "MD"},
{ORTH: "'ve", LEMMA: "have", TAG: "VB"}
],
@ -1072,12 +1072,12 @@ TOKENIZER_EXCEPTIONS = {
],
"wont": [
{ORTH: "wo"},
{ORTH: "wo", LEMMA: "will"},
{ORTH: "nt", LEMMA: "not", TAG: "RB"}
],
"she'd've": [
{ORTH: "she", LEMMA: PRON_LEMMA},
{ORTH: "she", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'d", LEMMA: "would", TAG: "MD"},
{ORTH: "'ve", LEMMA: "have", TAG: "VB"}
],
@ -1088,7 +1088,7 @@ TOKENIZER_EXCEPTIONS = {
],
"theyre": [
{ORTH: "they", LEMMA: PRON_LEMMA},
{ORTH: "they", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "re"}
],
@ -1129,7 +1129,7 @@ TOKENIZER_EXCEPTIONS = {
],
"They'll": [
{ORTH: "They", LEMMA: PRON_LEMMA},
{ORTH: "They", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'ll", LEMMA: "will", TAG: "MD"}
],
@ -1139,7 +1139,7 @@ TOKENIZER_EXCEPTIONS = {
],
"Wedve": [
{ORTH: "We"},
{ORTH: "We", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "d", LEMMA: "would", TAG: "MD"},
{ORTH: "ve", LEMMA: "have", TAG: "VB"}
],
@ -1156,7 +1156,7 @@ TOKENIZER_EXCEPTIONS = {
],
"we'd": [
{ORTH: "we"},
{ORTH: "we", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'d", LEMMA: "would", TAG: "MD"}
],
@ -1193,7 +1193,7 @@ TOKENIZER_EXCEPTIONS = {
"why're": [
{ORTH: "why"},
{ORTH: "'re", LEMMA: "be"}
{ORTH: "'re", LEMMA: "be", NORM: "are"}
],
"Doesnt": [
@ -1207,12 +1207,12 @@ TOKENIZER_EXCEPTIONS = {
],
"they'll": [
{ORTH: "they", LEMMA: PRON_LEMMA},
{ORTH: "they", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'ll", LEMMA: "will", TAG: "MD"}
],
"I'd": [
{ORTH: "I", LEMMA: PRON_LEMMA},
{ORTH: "I", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'d", LEMMA: "would", TAG: "MD"}
],
@ -1237,12 +1237,12 @@ TOKENIZER_EXCEPTIONS = {
],
"you're": [
{ORTH: "you", LEMMA: PRON_LEMMA},
{ORTH: "'re", LEMMA: "be"}
{ORTH: "you", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'re", LEMMA: "be", NORM: "are"}
],
"They've": [
{ORTH: "They", LEMMA: PRON_LEMMA},
{ORTH: "They", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'ve", LEMMA: "have", TAG: "VB"}
],
@ -1272,12 +1272,12 @@ TOKENIZER_EXCEPTIONS = {
],
"She'll": [
{ORTH: "i", LEMMA: PRON_LEMMA},
{ORTH: "She", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'ll", LEMMA: "will", TAG: "MD"}
],
"You'd": [
{ORTH: "You", LEMMA: PRON_LEMMA},
{ORTH: "You", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'d", LEMMA: "would", TAG: "MD"}
],
@ -1297,8 +1297,8 @@ TOKENIZER_EXCEPTIONS = {
],
"Theyre": [
{ORTH: "They", LEMMA: PRON_LEMMA},
{ORTH: "re"}
{ORTH: "They", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "re", LEMMA: "be", NORM: "are"}
],
"Won't": [
@ -1312,33 +1312,33 @@ TOKENIZER_EXCEPTIONS = {
],
"it's": [
{ORTH: "it", LEMMA: PRON_LEMMA},
{ORTH: "it", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'s"}
],
"it'll": [
{ORTH: "it", LEMMA: PRON_LEMMA},
{ORTH: "it", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'ll", LEMMA: "will", TAG: "MD"}
],
"They'd've": [
{ORTH: "They", LEMMA: PRON_LEMMA},
{ORTH: "They", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'d", LEMMA: "would", TAG: "MD"},
{ORTH: "'ve", LEMMA: "have", TAG: "VB"}
],
"Ima": [
{ORTH: "I", LEMMA: PRON_LEMMA},
{ORTH: "I", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "ma"}
],
"gonna": [
{ORTH: "gon", LEMMA: "go"},
{ORTH: "gon", LEMMA: "go", NORM: "going"},
{ORTH: "na", LEMMA: "to"}
],
"Gonna": [
{ORTH: "Gon", LEMMA: "go"},
{ORTH: "Gon", LEMMA: "go", NORM: "going"},
{ORTH: "na", LEMMA: "to"}
],
@ -1359,7 +1359,7 @@ TOKENIZER_EXCEPTIONS = {
],
"youd": [
{ORTH: "you", LEMMA: PRON_LEMMA},
{ORTH: "you", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "d", LEMMA: "would", TAG: "MD"}
],
@ -1390,7 +1390,7 @@ TOKENIZER_EXCEPTIONS = {
],
"He'd've": [
{ORTH: "He", LEMMA: PRON_LEMMA},
{ORTH: "He", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'d", LEMMA: "would", TAG: "MD"},
{ORTH: "'ve", LEMMA: "have", TAG: "VB"}
],
@ -1427,17 +1427,17 @@ TOKENIZER_EXCEPTIONS = {
],
"hes": [
{ORTH: "he", LEMMA: PRON_LEMMA},
{ORTH: "he", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "s"}
],
"he'll": [
{ORTH: "he", LEMMA: PRON_LEMMA},
{ORTH: "he", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'ll", LEMMA: "will", TAG: "MD"}
],
"hed": [
{ORTH: "he", LEMMA: PRON_LEMMA},
{ORTH: "he", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "d", LEMMA: "would", TAG: "MD"}
],
@ -1447,8 +1447,8 @@ TOKENIZER_EXCEPTIONS = {
],
"we're": [
{ORTH: "we", LEMMA: PRON_LEMMA},
{ORTH: "'re", LEMMA: "be"}
{ORTH: "we", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'re", LEMMA: "be", NORM :"are"}
],
"Hadnt": [
@ -1457,12 +1457,12 @@ TOKENIZER_EXCEPTIONS = {
],
"Shant": [
{ORTH: "Sha"},
{ORTH: "Sha", LEMMA: "shall"},
{ORTH: "nt", LEMMA: "not", TAG: "RB"}
],
"Theyve": [
{ORTH: "They", LEMMA: PRON_LEMMA},
{ORTH: "They", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "ve", LEMMA: "have", TAG: "VB"}
],
@ -1477,7 +1477,7 @@ TOKENIZER_EXCEPTIONS = {
],
"i've": [
{ORTH: "i", LEMMA: PRON_LEMMA},
{ORTH: "i", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'ve", LEMMA: "have", TAG: "VB"}
],
@ -1487,7 +1487,7 @@ TOKENIZER_EXCEPTIONS = {
],
"i'ma": [
{ORTH: "i", LEMMA: PRON_LEMMA},
{ORTH: "i", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'ma"}
],
@ -1502,7 +1502,7 @@ TOKENIZER_EXCEPTIONS = {
],
"shant": [
{ORTH: "sha"},
{ORTH: "sha", LEMMA: "shall"},
{ORTH: "nt", LEMMA: "not", TAG: "RB"}
],
@ -1513,7 +1513,7 @@ TOKENIZER_EXCEPTIONS = {
],
"I'll": [
{ORTH: "I", LEMMA: PRON_LEMMA},
{ORTH: "I", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'ll", LEMMA: "will", TAG: "MD"}
],
@ -1571,7 +1571,7 @@ TOKENIZER_EXCEPTIONS = {
],
"shes": [
{ORTH: "she", LEMMA: PRON_LEMMA},
{ORTH: "she", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "s"}
],
@ -1586,12 +1586,12 @@ TOKENIZER_EXCEPTIONS = {
],
"Hasnt": [
{ORTH: "Has"},
{ORTH: "Has", LEMMA: "have"},
{ORTH: "nt", LEMMA: "not", TAG: "RB"}
],
"He's": [
{ORTH: "He", LEMMA: PRON_LEMMA},
{ORTH: "He", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'s"}
],
@ -1611,12 +1611,12 @@ TOKENIZER_EXCEPTIONS = {
],
"He'd": [
{ORTH: "He", LEMMA: PRON_LEMMA},
{ORTH: "He", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'d", LEMMA: "would", TAG: "MD"}
],
"Shes": [
{ORTH: "i", LEMMA: PRON_LEMMA},
{ORTH: "She", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "s"}
],
@ -1626,7 +1626,7 @@ TOKENIZER_EXCEPTIONS = {
],
"Youll": [
{ORTH: "You", LEMMA: PRON_LEMMA},
{ORTH: "You", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "ll", LEMMA: "will", TAG: "MD"}
],
@ -1636,18 +1636,18 @@ TOKENIZER_EXCEPTIONS = {
],
"theyll": [
{ORTH: "they", LEMMA: PRON_LEMMA},
{ORTH: "they", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "ll", LEMMA: "will", TAG: "MD"}
],
"it'd've": [
{ORTH: "it", LEMMA: PRON_LEMMA},
{ORTH: "it", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'d", LEMMA: "would", TAG: "MD"},
{ORTH: "'ve", LEMMA: "have", TAG: "VB"}
],
"itdve": [
{ORTH: "it", LEMMA: PRON_LEMMA},
{ORTH: "it", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "d", LEMMA: "would", TAG: "MD"},
{ORTH: "ve", LEMMA: "have", TAG: "VB"}
],
@ -1674,7 +1674,7 @@ TOKENIZER_EXCEPTIONS = {
],
"Wont": [
{ORTH: "Wo"},
{ORTH: "Wo", LEMMA: "will"},
{ORTH: "nt", LEMMA: "not", TAG: "RB"}
],
@ -1691,7 +1691,7 @@ TOKENIZER_EXCEPTIONS = {
"Whatre": [
{ORTH: "What"},
{ORTH: "re"}
{ORTH: "re", LEMMA: "be", NORM: "are"}
],
"'s": [
@ -1719,12 +1719,12 @@ TOKENIZER_EXCEPTIONS = {
],
"It'll": [
{ORTH: "It", LEMMA: PRON_LEMMA},
{ORTH: "It", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'ll", LEMMA: "will", TAG: "MD"}
],
"We'd": [
{ORTH: "We"},
{ORTH: "We", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'d", LEMMA: "would", TAG: "MD"}
],
@ -1738,12 +1738,12 @@ TOKENIZER_EXCEPTIONS = {
],
"Itd": [
{ORTH: "It", LEMMA: PRON_LEMMA},
{ORTH: "It", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "d", LEMMA: "would", TAG: "MD"}
],
"she'd": [
{ORTH: "she", LEMMA: PRON_LEMMA},
{ORTH: "she", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'d", LEMMA: "would", TAG: "MD"}
],
@ -1758,17 +1758,17 @@ TOKENIZER_EXCEPTIONS = {
],
"you'll": [
{ORTH: "you", LEMMA: PRON_LEMMA},
{ORTH: "you", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'ll", LEMMA: "will", TAG: "MD"}
],
"Theyd": [
{ORTH: "They", LEMMA: PRON_LEMMA},
{ORTH: "They", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "d", LEMMA: "would", TAG: "MD"}
],
"she's": [
{ORTH: "she", LEMMA: PRON_LEMMA},
{ORTH: "she", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'s"}
],
@ -1783,7 +1783,7 @@ TOKENIZER_EXCEPTIONS = {
],
"'em": [
{ORTH: "'em", LEMMA: PRON_LEMMA}
{ORTH: "'em", LEMMA: PRON_LEMMA, NORM: "them"}
],
"ol'": [

View File

@ -3,17 +3,48 @@ from __future__ import unicode_literals
from .. import language_data as base
from ..language_data import update_exc, strings_to_exc
from ..symbols import ORTH, LEMMA
from .stop_words import STOP_WORDS
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS, ORTH_ONLY
def get_time_exc(hours):
exc = {
"12m.": [
{ORTH: "12"},
{ORTH: "m.", LEMMA: "p.m."}
]
}
for hour in hours:
exc["%da.m." % hour] = [
{ORTH: hour},
{ORTH: "a.m."}
]
exc["%dp.m." % hour] = [
{ORTH: hour},
{ORTH: "p.m."}
]
exc["%dam" % hour] = [
{ORTH: hour},
{ORTH: "am", LEMMA: "a.m."}
]
exc["%dpm" % hour] = [
{ORTH: hour},
{ORTH: "pm", LEMMA: "p.m."}
]
return exc
TOKENIZER_EXCEPTIONS = dict(TOKENIZER_EXCEPTIONS)
STOP_WORDS = set(STOP_WORDS)
update_exc(TOKENIZER_EXCEPTIONS, strings_to_exc(ORTH_ONLY))
update_exc(TOKENIZER_EXCEPTIONS, get_time_exc(range(1, 12 + 1)))
update_exc(TOKENIZER_EXCEPTIONS, strings_to_exc(base.EMOTICONS))
__all__ = ["TOKENIZER_EXCEPTIONS", "STOP_WORDS"]

View File

@ -2,317 +2,138 @@
from __future__ import unicode_literals
from ..symbols import *
from ..language_data import PRON_LEMMA
from ..language_data import PRON_LEMMA, DET_LEMMA
TOKENIZER_EXCEPTIONS = {
"accidentarse": [
{ORTH: "accidentar", LEMMA: "accidentar", POS: AUX},
{ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON}
],
"aceptarlo": [
{ORTH: "aceptar", LEMMA: "aceptar", POS: AUX},
{ORTH: "lo", LEMMA: PRON_LEMMA, POS: PRON}
],
"acompañarla": [
{ORTH: "acompañar", LEMMA: "acompañar", POS: AUX},
{ORTH: "la", LEMMA: PRON_LEMMA, POS: PRON}
],
"advertirle": [
{ORTH: "advertir", LEMMA: "advertir", POS: AUX},
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
],
"al": [
{ORTH: "a", LEMMA: "a", POS: ADP},
{ORTH: "el", LEMMA: "el", POS: DET}
{ORTH: "a", LEMMA: "a", TAG: ADP},
{ORTH: "el", LEMMA: "el", TAG: DET}
],
"anunciarnos": [
{ORTH: "anunciar", LEMMA: "anunciar", POS: AUX},
{ORTH: "nos", LEMMA: PRON_LEMMA, POS: PRON}
"consigo": [
{ORTH: "con", LEMMA: "con"},
{ORTH: "sigo", LEMMA: PRON_LEMMA, NORM: ""}
],
"asegurándole": [
{ORTH: "asegurando", LEMMA: "asegurar", POS: AUX},
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
"conmigo": [
{ORTH: "con", LEMMA: "con"},
{ORTH: "migo", LEMMA: PRON_LEMMA, NORM: ""}
],
"considerarle": [
{ORTH: "considerar", LEMMA: "considerar", POS: AUX},
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
],
"decirle": [
{ORTH: "decir", LEMMA: "decir", POS: AUX},
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
],
"decirles": [
{ORTH: "decir", LEMMA: "decir", POS: AUX},
{ORTH: "les", LEMMA: PRON_LEMMA, POS: PRON}
],
"decirte": [
{ORTH: "Decir", LEMMA: "decir", POS: AUX},
{ORTH: "te", LEMMA: PRON_LEMMA, POS: PRON}
],
"dejarla": [
{ORTH: "dejar", LEMMA: "dejar", POS: AUX},
{ORTH: "la", LEMMA: PRON_LEMMA, POS: PRON}
],
"dejarnos": [
{ORTH: "dejar", LEMMA: "dejar", POS: AUX},
{ORTH: "nos", LEMMA: PRON_LEMMA, POS: PRON}
],
"dejándole": [
{ORTH: "dejando", LEMMA: "dejar", POS: AUX},
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
"contigo": [
{ORTH: "con", LEMMA: "con"},
{ORTH: "tigo", LEMMA: PRON_LEMMA, NORM: "ti"}
],
"del": [
{ORTH: "de", LEMMA: "de", POS: ADP},
{ORTH: "el", LEMMA: "el", POS: DET}
],
"demostrarles": [
{ORTH: "demostrar", LEMMA: "demostrar", POS: AUX},
{ORTH: "les", LEMMA: PRON_LEMMA, POS: PRON}
],
"diciéndole": [
{ORTH: "diciendo", LEMMA: "decir", POS: AUX},
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
],
"diciéndoles": [
{ORTH: "diciendo", LEMMA: "decir", POS: AUX},
{ORTH: "les", LEMMA: PRON_LEMMA, POS: PRON}
],
"diferenciarse": [
{ORTH: "diferenciar", LEMMA: "diferenciar", POS: AUX},
{ORTH: "se", LEMMA: "él", POS: PRON}
],
"divirtiéndome": [
{ORTH: "divirtiendo", LEMMA: "divertir", POS: AUX},
{ORTH: "me", LEMMA: PRON_LEMMA, POS: PRON}
],
"ensanchándose": [
{ORTH: "ensanchando", LEMMA: "ensanchar", POS: AUX},
{ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON}
],
"explicarles": [
{ORTH: "explicar", LEMMA: "explicar", POS: AUX},
{ORTH: "les", LEMMA: PRON_LEMMA, POS: PRON}
],
"haberla": [
{ORTH: "haber", LEMMA: "haber", POS: AUX},
{ORTH: "la", LEMMA: PRON_LEMMA, POS: PRON}
],
"haberlas": [
{ORTH: "haber", LEMMA: "haber", POS: AUX},
{ORTH: "las", LEMMA: PRON_LEMMA, POS: PRON}
],
"haberlo": [
{ORTH: "haber", LEMMA: "haber", POS: AUX},
{ORTH: "lo", LEMMA: PRON_LEMMA, POS: PRON}
],
"haberlos": [
{ORTH: "haber", LEMMA: "haber", POS: AUX},
{ORTH: "los", LEMMA: PRON_LEMMA, POS: PRON}
],
"haberme": [
{ORTH: "haber", LEMMA: "haber", POS: AUX},
{ORTH: "me", LEMMA: PRON_LEMMA, POS: PRON}
],
"haberse": [
{ORTH: "haber", LEMMA: "haber", POS: AUX},
{ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON}
],
"hacerle": [
{ORTH: "hacer", LEMMA: "hacer", POS: AUX},
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
],
"hacerles": [
{ORTH: "hacer", LEMMA: "hacer", POS: AUX},
{ORTH: "les", LEMMA: PRON_LEMMA, POS: PRON}
],
"hallarse": [
{ORTH: "hallar", LEMMA: "hallar", POS: AUX},
{ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON}
],
"imaginaros": [
{ORTH: "imaginar", LEMMA: "imaginar", POS: AUX},
{ORTH: "os", LEMMA: PRON_LEMMA, POS: PRON}
],
"insinuarle": [
{ORTH: "insinuar", LEMMA: "insinuar", POS: AUX},
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
],
"justificarla": [
{ORTH: "justificar", LEMMA: "justificar", POS: AUX},
{ORTH: "la", LEMMA: PRON_LEMMA, POS: PRON}
],
"mantenerlas": [
{ORTH: "mantener", LEMMA: "mantener", POS: AUX},
{ORTH: "las", LEMMA: PRON_LEMMA, POS: PRON}
],
"mantenerlos": [
{ORTH: "mantener", LEMMA: "mantener", POS: AUX},
{ORTH: "los", LEMMA: PRON_LEMMA, POS: PRON}
],
"mantenerme": [
{ORTH: "mantener", LEMMA: "mantener", POS: AUX},
{ORTH: "me", LEMMA: PRON_LEMMA, POS: PRON}
],
"pasarte": [
{ORTH: "pasar", LEMMA: "pasar", POS: AUX},
{ORTH: "te", LEMMA: PRON_LEMMA, POS: PRON}
],
"pedirle": [
{ORTH: "pedir", LEMMA: "pedir", POS: AUX},
{ORTH: "le", LEMMA: "él", POS: PRON}
{ORTH: "de", LEMMA: "de", TAG: ADP},
{ORTH: "l", LEMMA: "el", TAG: DET}
],
"pel": [
{ORTH: "per", LEMMA: "per", POS: ADP},
{ORTH: "el", LEMMA: "el", POS: DET}
{ORTH: "pe", LEMMA: "per", TAG: ADP},
{ORTH: "l", LEMMA: "el", TAG: DET}
],
"pidiéndonos": [
{ORTH: "pidiendo", LEMMA: "pedir", POS: AUX},
{ORTH: "nos", LEMMA: PRON_LEMMA, POS: PRON}
"pal": [
{ORTH: "pa", LEMMA: "para"},
{ORTH: "l", LEMMA: DET_LEMMA, NORM: "el"}
],
"poderle": [
{ORTH: "poder", LEMMA: "poder", POS: AUX},
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
"pala": [
{ORTH: "pa", LEMMA: "para"},
{ORTH: "la", LEMMA: DET_LEMMA}
],
"preguntarse": [
{ORTH: "preguntar", LEMMA: "preguntar", POS: AUX},
{ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON}
"aprox.": [
{ORTH: "aprox.", LEMMA: "aproximadamente"}
],
"preguntándose": [
{ORTH: "preguntando", LEMMA: "preguntar", POS: AUX},
{ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON}
"dna.": [
{ORTH: "dna.", LEMMA: "docena"}
],
"presentarla": [
{ORTH: "presentar", LEMMA: "presentar", POS: AUX},
{ORTH: "la", LEMMA: PRON_LEMMA, POS: PRON}
"esq.": [
{ORTH: "esq.", LEMMA: "esquina"}
],
"pudiéndolo": [
{ORTH: "pudiendo", LEMMA: "poder", POS: AUX},
{ORTH: "lo", LEMMA: PRON_LEMMA, POS: PRON}
"pág.": [
{ORTH: "pág.", LEMMA: "página"}
],
"pudiéndose": [
{ORTH: "pudiendo", LEMMA: "poder", POS: AUX},
{ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON}
"p.ej.": [
{ORTH: "p.ej.", LEMMA: "por ejemplo"}
],
"quererle": [
{ORTH: "querer", LEMMA: "querer", POS: AUX},
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
"Ud.": [
{ORTH: "Ud.", LEMMA: PRON_LEMMA, NORM: "usted"}
],
"rasgarse": [
{ORTH: "Rasgar", LEMMA: "rasgar", POS: AUX},
{ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON}
"Vd.": [
{ORTH: "Vd.", LEMMA: PRON_LEMMA, NORM: "usted"}
],
"repetirlo": [
{ORTH: "repetir", LEMMA: "repetir", POS: AUX},
{ORTH: "lo", LEMMA: PRON_LEMMA, POS: PRON}
"Uds.": [
{ORTH: "Uds.", LEMMA: PRON_LEMMA, NORM: "ustedes"}
],
"robarle": [
{ORTH: "robar", LEMMA: "robar", POS: AUX},
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
],
"seguirlos": [
{ORTH: "seguir", LEMMA: "seguir", POS: AUX},
{ORTH: "los", LEMMA: PRON_LEMMA, POS: PRON}
],
"serle": [
{ORTH: "ser", LEMMA: "ser", POS: AUX},
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
],
"serlo": [
{ORTH: "ser", LEMMA: "ser", POS: AUX},
{ORTH: "lo", LEMMA: PRON_LEMMA, POS: PRON}
],
"señalándole": [
{ORTH: "señalando", LEMMA: "señalar", POS: AUX},
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
],
"suplicarle": [
{ORTH: "suplicar", LEMMA: "suplicar", POS: AUX},
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
],
"tenerlos": [
{ORTH: "tener", LEMMA: "tener", POS: AUX},
{ORTH: "los", LEMMA: PRON_LEMMA, POS: PRON}
],
"vengarse": [
{ORTH: "vengar", LEMMA: "vengar", POS: AUX},
{ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON}
],
"verla": [
{ORTH: "ver", LEMMA: "ver", POS: AUX},
{ORTH: "la", LEMMA: PRON_LEMMA, POS: PRON}
],
"verle": [
{ORTH: "ver", LEMMA: "ver", POS: AUX},
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
],
"volverlo": [
{ORTH: "volver", LEMMA: "volver", POS: AUX},
{ORTH: "lo", LEMMA: PRON_LEMMA, POS: PRON}
"Vds.": [
{ORTH: "Vds.", LEMMA: PRON_LEMMA, NORM: "ustedes"}
]
}
ORTH_ONLY = [
"a.",
"a.C.",
"a.J.C.",
"apdo.",
"Av.",
"Avda.",
"b.",
"c.",
"Cía.",
"d.",
"e.",
"etc.",
"f.",
"g.",
"Gob.",
"Gral.",
"h.",
"i.",
"Ing.",
"j.",
"J.C.",
"k.",
"l.",
"Lic.",
"m.",
"m.n.",
"n.",
"no.",
"núm.",
"o.",
"p.",
"P.D.",
"Prof.",
"Profa.",
"q.",
"q.e.p.d."
"r.",
"s.",
"S.A.",
"S.L.",
"s.s.s.",
"Sr.",
"Sra.",
"Srta.",
"t.",
"u.",
"v.",
"w.",
"x.",
"y.",
"z."
]

23
spacy/hu/__init__.py Normal file
View File

@ -0,0 +1,23 @@
# encoding: utf8
from __future__ import unicode_literals, print_function
from .language_data import *
from ..attrs import LANG
from ..language import Language
class Hungarian(Language):
lang = 'hu'
class Defaults(Language.Defaults):
tokenizer_exceptions = dict(TOKENIZER_EXCEPTIONS)
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda text: 'hu'
prefixes = tuple(TOKENIZER_PREFIXES)
suffixes = tuple(TOKENIZER_SUFFIXES)
infixes = tuple(TOKENIZER_INFIXES)
stop_words = set(STOP_WORDS)

24
spacy/hu/language_data.py Normal file
View File

@ -0,0 +1,24 @@
# encoding: utf8
from __future__ import unicode_literals
import six
from spacy.language_data import strings_to_exc, update_exc
from .punctuations import *
from .stop_words import STOP_WORDS
from .tokenizer_exceptions import ABBREVIATIONS
from .tokenizer_exceptions import OTHER_EXC
from .. import language_data as base
STOP_WORDS = set(STOP_WORDS)
TOKENIZER_EXCEPTIONS = strings_to_exc(base.EMOTICONS)
TOKENIZER_PREFIXES = base.TOKENIZER_PREFIXES + TOKENIZER_PREFIXES
TOKENIZER_SUFFIXES = TOKENIZER_SUFFIXES
TOKENIZER_INFIXES = TOKENIZER_INFIXES
# HYPHENS = [six.unichr(cp) for cp in [173, 8211, 8212, 8213, 8722, 9472]]
update_exc(TOKENIZER_EXCEPTIONS, strings_to_exc(OTHER_EXC))
update_exc(TOKENIZER_EXCEPTIONS, strings_to_exc(ABBREVIATIONS))
__all__ = ["TOKENIZER_EXCEPTIONS", "STOP_WORDS", "TOKENIZER_PREFIXES", "TOKENIZER_SUFFIXES", "TOKENIZER_INFIXES"]

89
spacy/hu/punctuations.py Normal file
View File

@ -0,0 +1,89 @@
# encoding: utf8
from __future__ import unicode_literals
TOKENIZER_PREFIXES = r'''
+
'''.strip().split('\n')
TOKENIZER_SUFFIXES = r'''
,
\"
\)
\]
\}
\*
\!
\?
\$
>
:
;
'
«
_
''
\.\.
\.\.\.
\.\.\.\.
(?<=[a-züóőúéáűí)\]"'´«‘’%\)²“”+-])\.
(?<=[a-züóőúéáűí)])-e
\-\-
´
(?<=[0-9])\+
(?<=[a-z0-9üóőúéáűí][\)\]"'%\)§/])\.
(?<=[0-9])km²
(?<=[0-9])
(?<=[0-9])cm²
(?<=[0-9])mm²
(?<=[0-9])km³
(?<=[0-9])
(?<=[0-9])cm³
(?<=[0-9])mm³
(?<=[0-9])ha
(?<=[0-9])km
(?<=[0-9])m
(?<=[0-9])cm
(?<=[0-9])mm
(?<=[0-9])µm
(?<=[0-9])nm
(?<=[0-9])yd
(?<=[0-9])in
(?<=[0-9])ft
(?<=[0-9])kg
(?<=[0-9])g
(?<=[0-9])mg
(?<=[0-9])µg
(?<=[0-9])t
(?<=[0-9])lb
(?<=[0-9])oz
(?<=[0-9])m/s
(?<=[0-9])km/h
(?<=[0-9])mph
(?<=°[FCK])\.
(?<=[0-9])hPa
(?<=[0-9])Pa
(?<=[0-9])mbar
(?<=[0-9])mb
(?<=[0-9])T
(?<=[0-9])G
(?<=[0-9])M
(?<=[0-9])K
(?<=[0-9])kb
'''.strip().split('\n')
TOKENIZER_INFIXES = r'''
\.\.+
(?<=[a-züóőúéáűí])\.(?=[A-ZÜÓŐÚÉÁŰÍ])
(?<=[a-zA-ZüóőúéáűíÜÓŐÚÉÁŰÍ0-9])"(?=[\-a-zA-ZüóőúéáűíÜÓŐÚÉÁŰÍ])
(?<=[a-zA-ZüóőúéáűíÜÓŐÚÉÁŰÍ])--(?=[a-zA-ZüóőúéáűíÜÓŐÚÉÁŰÍ])
(?<=[0-9])[+\-\*/^](?=[0-9])
(?<=[a-zA-ZüóőúéáűíÜÓŐÚÉÁŰÍ]),(?=[a-zA-ZüóőúéáűíÜÓŐÚÉÁŰÍ])
'''.strip().split('\n')
__all__ = ["TOKENIZER_PREFIXES", "TOKENIZER_SUFFIXES", "TOKENIZER_INFIXES"]

64
spacy/hu/stop_words.py Normal file
View File

@ -0,0 +1,64 @@
# encoding: utf8
from __future__ import unicode_literals
STOP_WORDS = set("""
a abban ahhoz ahogy ahol aki akik akkor akár alatt amely amelyek amelyekben
amelyeket amelyet amelynek ami amikor amit amolyan amíg annak arra arról az
azok azon azonban azt aztán azután azzal azért
be belül benne bár
cikk cikkek cikkeket csak
de
e ebben eddig egy egyes egyetlen egyik egyre egyéb egész ehhez ekkor el ellen
elo eloször elott elso elég előtt emilyen ennek erre ez ezek ezen ezt ezzel
ezért
fel felé
ha hanem hiszen hogy hogyan hát
ide igen ill ill. illetve ilyen ilyenkor inkább is ismét ison itt
jobban jól
kell kellett keressünk keresztül ki kívül között közül
le legalább legyen lehet lehetett lenne lenni lesz lett
ma maga magát majd meg mellett mely melyek mert mi miatt mikor milyen minden
mindenki mindent mindig mint mintha mit mivel miért mondta most már más másik
még míg
nagy nagyobb nagyon ne nekem neki nem nincs néha néhány nélkül
o oda ok oket olyan ott
pedig persze például
s saját sem semmi sok sokat sokkal stb. szemben szerint szinte számára szét
talán te tehát teljes ti tovább továbbá több túl ugyanis
utolsó után utána
vagy vagyis vagyok valaki valami valamint való van vannak vele vissza viszont
volna volt voltak voltam voltunk
által általában át
én éppen és
így
ön össze
úgy új újabb újra
ő őket
""".split())

View File

@ -0,0 +1,549 @@
# encoding: utf8
from __future__ import unicode_literals
ABBREVIATIONS = """
AkH.
.
B.CS.
B.S.
B.Sc.
B.ú.é.k.
BE.
BEK.
BSC.
BSc.
BTK.
Be.
Bek.
Bfok.
Bk.
Bp.
Btk.
Btke.
Btét.
CSC.
Cal.
Co.
Colo.
Comp.
Copr.
Cs.
Csc.
Csop.
Ctv.
D.
DR.
Dipl.
Dr.
Dsz.
Dzs.
Fla.
Főszerk.
GM.
Gy.
HKsz.
Hmvh.
Inform.
K.m.f.
KER.
KFT.
KRT.
Ker.
Kft.
Kong.
Korm.
Kr.
Kr.e.
Kr.u.
Krt.
M.A.
M.S.
M.SC.
M.Sc.
MA.
MSC.
MSc.
Mass.
Mlle.
Mme.
Mo.
Mr.
Mrs.
Ms.
Mt.
N.N.
NB.
NBr.
Nat.
Nr.
Ny.
Nyh.
Nyr.
Op.
P.H.
P.S.
PH.D.
PHD.
PROF.
Ph.D
PhD.
Pp.
Proc.
Prof.
Ptk.
Rer.
S.B.
SZOLG.
Salg.
St.
Sz.
Szfv.
Szjt.
Szolg.
Szt.
Sztv.
TEL.
Tel.
Ty.
Tyr.
Ui.
Vcs.
Vhr.
X.Y.
Zs.
a.
a.C.
ac.
adj.
adm.
ag.
agit.
alez.
alk.
altbgy.
an.
ang.
arch.
at.
aug.
b.
b.a.
b.s.
b.sc.
bek.
belker.
berend.
biz.
bizt.
bo.
bp.
br.
bsc.
bt.
btk.
c.
ca.
cc.
cca.
cf.
cif.
co.
corp.
cos.
cs.
csc.
csüt.
cső.
ctv.
d.
dbj.
dd.
ddr.
de.
dec.
dikt.
dipl.
dj.
dk.
dny.
dolg.
dr.
du.
dzs.
e.
ea.
ed.
eff.
egyh.
ell.
elv.
elvt.
em.
eng.
eny.
et.
etc.
ev.
ezr.
.
f.
f.h.
f.é.
fam.
febr.
fej.
felv.
felügy.
ff.
ffi.
fhdgy.
fil.
fiz.
fm.
foglalk.
ford.
fp.
fr.
frsz.
fszla.
fszt.
ft.
fuv.
főig.
főisk.
főtörm.
főv.
g.
gazd.
gimn.
gk.
gkv.
gondn.
gr.
grav.
gy.
gyak.
gyártm.
gör.
h.
hads.
hallg.
hdm.
hdp.
hds.
hg.
hiv.
hk.
hm.
ho.
honv.
hp.
hr.
hrsz.
hsz.
ht.
htb.
hv.
hőm.
i.e.
i.sz.
id.
ifj.
ig.
igh.
ill.
imp.
inc.
ind.
inform.
inic.
int.
io.
ip.
ir.
irod.
isk.
ism.
izr.
.
j.
jan.
jav.
jegyz.
jjv.
jkv.
jogh.
jogt.
jr.
jvb.
júl.
jún.
k.
karb.
kat.
kb.
kcs.
kd.
ker.
kf.
kft.
kht.
kir.
kirend.
kisip.
kiv.
kk.
kkt.
klin.
kp.
krt.
kt.
ktsg.
kult.
kv.
kve.
képv.
kísérl.
kóth.
könyvt.
körz.
köv.
közj.
közl.
közp.
közt.
.
l.
lat.
ld.
legs.
lg.
lgv.
loc.
lt.
ltd.
ltp.
luth.
m.
m.a.
m.s.
m.sc.
ma.
mat.
mb.
med.
megh.
met.
mf.
mfszt.
min.
miss.
mjr.
mjv.
mk.
mlle.
mme.
mn.
mozg.
mr.
mrs.
ms.
msc.
.
máj.
márc.
.
mélt.
.
műh.
műsz.
műv.
művez.
n.
nagyker.
nagys.
nat.
nb.
neg.
nk.
nov.
nu.
ny.
nyilv.
nyrt.
nyug.
o.
obj.
okl.
okt.
olv.
orsz.
ort.
ov.
ovh.
p.
pf.
pg.
ph.d
ph.d.
phd.
pk.
pl.
plb.
plc.
pld.
plur.
pol.
polg.
poz.
pp.
proc.
prof.
prot.
pság.
ptk.
pu.
.
q.
r.
r.k.
rac.
rad.
red.
ref.
reg.
rer.
rev.
rf.
rkp.
rkt.
rt.
rtg.
röv.
s.
s.b.
s.k.
sa.
sel.
sgt.
sm.
st.
stat.
stb.
strat.
sz.
szakm.
szaksz.
szakszerv.
szd.
szds.
szept.
szerk.
szf.
szimf.
szjt.
szkv.
szla.
szn.
szolg.
szt.
szubj.
szöv.
szül.
t.
tanm.
tb.
tbk.
tc.
techn.
tek.
tel.
tf.
tgk.
ti.
tip.
tisztv.
titks.
tk.
tkp.
tny.
tp.
tszf.
tszk.
tszkv.
tv.
tvr.
ty.
törv.
.
u.
ua.
ui.
unit.
uo.
uv.
v.
vas.
vb.
vegy.
vh.
vhol.
vill.
vizsg.
vk.
vkf.
vkny.
vm.
vol.
vs.
vsz.
vv.
vál.
vízv.
.
w.
y.
z.
zrt.
zs.
Ész.
Új-Z.
ÚjZ.
á.
ált.
ápr.
ásv.
é.
ék.
ény.
érk.
évf.
í.
ó.
ö.
össz.
ötk.
özv.
ú.
úm.
ún.
út.
ü.
üag.
üd.
üdv.
üe.
ümk.
ütk.
üv.
ő.
ű.
őrgy.
őrpk.
őrv.
""".strip().split()
OTHER_EXC = """
''
-e
""".strip().split()

View File

@ -5,6 +5,7 @@ from ..symbols import *
PRON_LEMMA = "-PRON-"
DET_LEMMA = "-DET-"
ENT_ID = "ent_id"

19
spacy/sv/__init__.py Normal file
View File

@ -0,0 +1,19 @@
# encoding: utf8
from __future__ import unicode_literals, print_function
from os import path
from ..language import Language
from ..attrs import LANG
from .language_data import *
class Swedish(Language):
lang = 'sv'
class Defaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda text: 'sv'
tokenizer_exceptions = TOKENIZER_EXCEPTIONS
stop_words = STOP_WORDS

14
spacy/sv/language_data.py Normal file
View File

@ -0,0 +1,14 @@
# encoding: utf8
from __future__ import unicode_literals
from .. import language_data as base
from ..language_data import update_exc, strings_to_exc
from .stop_words import STOP_WORDS
TOKENIZER_EXCEPTIONS = strings_to_exc(base.EMOTICONS)
STOP_WORDS = set(STOP_WORDS)
__all__ = ["TOKENIZER_EXCEPTIONS", "STOP_WORDS"]

68
spacy/sv/morph_rules.py Normal file
View File

@ -0,0 +1,68 @@
# encoding: utf8
from __future__ import unicode_literals
from ..symbols import *
from ..language_data import PRON_LEMMA
# Used the table of pronouns at https://sv.wiktionary.org/wiki/deras
MORPH_RULES = {
"PRP": {
"jag": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Case": "Nom"},
"mig": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Case": "Acc"},
"mej": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Case": "Acc"},
"du": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Sing", "Case": "Nom"},
"han": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Masc", "Case": "Nom"},
"honom": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Masc", "Case": "Acc"},
"hon": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Fem", "Case": "Nom"},
"henne": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Fem", "Case": "Acc"},
"det": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Neut"},
"vi": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Case": "Nom"},
"oss": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Case": "Acc"},
"ni": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Case": "Nom"},
"er": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Case": "Acc"},
"de": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Plur", "Case": "Nom"},
"dom": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Plur", "Case": "Nom"},
"dem": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Plur", "Case": "Acc"},
"dom": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Plur", "Case": "Acc"},
"min": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Poss": "Yes", "Reflex": "Yes"},
"mitt": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Poss": "Yes", "Reflex": "Yes"},
"mina": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
"din": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Sing", "Poss": "Yes", "Reflex": "Yes"},
"ditt": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Sing", "Poss": "Yes", "Reflex": "Yes"},
"dina": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
"hans": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Sing", "Gender": "Masc", "Poss": "Yes", "Reflex": "Yes"},
"hans": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Gender": "Masc", "Poss": "Yes", "Reflex": "Yes"},
"hennes": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Sing", "Gender": "Fem", "Poss": "Yes", "Reflex": "Yes"},
"hennes": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Gender": "Fem", "Poss": "Yes", "Reflex": "Yes"},
"dess": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Sing", "Poss": "Yes", "Reflex": "Yes"},
"dess": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
"vår": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
"våran": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
"vårt": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
"vårat": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
"våra": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
"er": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
"eran": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
"ert": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
"erat": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
"era": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
"deras": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"}
},
"VBZ": {
"är": {LEMMA: "be", "VerbForm": "Fin", "Person": "One", "Tense": "Pres", "Mood": "Ind"},
"är": {LEMMA: "be", "VerbForm": "Fin", "Person": "Two", "Tense": "Pres", "Mood": "Ind"},
"är": {LEMMA: "be", "VerbForm": "Fin", "Person": "Three", "Tense": "Pres", "Mood": "Ind"},
},
"VBP": {
"är": {LEMMA: "be", "VerbForm": "Fin", "Tense": "Pres", "Mood": "Ind"}
},
"VBD": {
"var": {LEMMA: "be", "VerbForm": "Fin", "Tense": "Past", "Number": "Sing"},
"vart": {LEMMA: "be", "VerbForm": "Fin", "Tense": "Past", "Number": "Plur"}
}
}

47
spacy/sv/stop_words.py Normal file
View File

@ -0,0 +1,47 @@
# encoding: utf8
from __future__ import unicode_literals
STOP_WORDS = set("""
aderton adertonde adjö aldrig alla allas allt alltid alltså än andra andras annan annat ännu artonde arton åtminstone att åtta åttio åttionde åttonde av även
båda bådas bakom bara bäst bättre behöva behövas behövde behövt beslut beslutat beslutit bland blev bli blir blivit bort borta bra
dag dagar dagarna dagen där därför de del delen dem den deras dess det detta dig din dina dit ditt dock du
efter eftersom elfte eller elva en enkel enkelt enkla enligt er era ert ett ettusen
fanns får fått fem femte femtio femtionde femton femtonde fick fin finnas finns fjärde fjorton fjortonde fler flera flesta följande för före förlåt förra första fram framför från fyra fyrtio fyrtionde
gälla gäller gällt går gärna gått genast genom gick gjorde gjort god goda godare godast gör göra gott
ha hade haft han hans har här heller hellre helst helt henne hennes hit hög höger högre högst hon honom hundra hundraen hundraett hur
i ibland idag igår igen imorgon in inför inga ingen ingenting inget innan inne inom inte inuti
ja jag jämfört
kan kanske knappast kom komma kommer kommit kr kunde kunna kunnat kvar
länge längre långsam långsammare långsammast långsamt längst långt lätt lättare lättast legat ligga ligger lika likställd likställda lilla lite liten litet
man många måste med mellan men mer mera mest mig min mina mindre minst mitt mittemot möjlig möjligen möjligt möjligtvis mot mycket
någon någonting något några när nästa ned nederst nedersta nedre nej ner ni nio nionde nittio nittionde nitton nittonde nödvändig nödvändiga nödvändigt nödvändigtvis nog noll nr nu nummer
och också ofta oftast olika olikt om oss
över övermorgon överst övre
rakt rätt redan
sade säga säger sagt samma sämre sämst sedan senare senast sent sex sextio sextionde sexton sextonde sig sin sina sist sista siste sitt sjätte sju sjunde sjuttio sjuttionde sjutton sjuttonde ska skall skulle slutligen små smått snart som stor stora större störst stort
tack tidig tidigare tidigast tidigt till tills tillsammans tio tionde tjugo tjugoen tjugoett tjugonde tjugotre tjugotvå tjungo tolfte tolv tre tredje trettio trettionde tretton trettonde två tvåhundra
under upp ur ursäkt ut utan utanför ute
vad vänster vänstra var vår vara våra varför varifrån varit varken värre varsågod vart vårt vem vems verkligen vi vid vidare viktig viktigare viktigast viktigt vilka vilken vilket vill
""".split())

View File

@ -0,0 +1,58 @@
# encoding: utf8
from __future__ import unicode_literals
from ..symbols import *
from ..language_data import PRON_LEMMA
TOKENIZER_EXCEPTIONS = {
}
ORTH_ONLY = [
"ang.",
"anm.",
"bil.",
"bl.a.",
"dvs.",
"e.Kr.",
"el.",
"e.d.",
"eng.",
"etc.",
"exkl.",
"f.d.",
"fid.",
"f.Kr.",
"forts.",
"fr.o.m.",
"f.ö.",
"förf.",
"inkl.",
"jur.",
"kl.",
"kr.",
"lat.",
"m.a.o.",
"max.",
"m.fl.",
"min.",
"m.m.",
"obs.",
"o.d.",
"osv.",
"p.g.a.",
"ref.",
"resp.",
"s.",
"s.a.s.",
"s.k.",
"st.",
"s:t",
"t.ex.",
"t.o.m.",
"ung.",
"äv.",
"övers."
]

View File

View File

View File

@ -0,0 +1,233 @@
# encoding: utf8
from __future__ import unicode_literals
import pytest
from spacy.hu import Hungarian
_DEFAULT_TESTS = [('N. kormányzósági\nszékhely.', ['N.', 'kormányzósági', 'székhely', '.']),
('A .hu egy tld.', ['A', '.hu', 'egy', 'tld', '.']),
('Az egy.ketto pelda.', ['Az', 'egy.ketto', 'pelda', '.']),
('A pl. rovidites.', ['A', 'pl.', 'rovidites', '.']),
('A S.M.A.R.T. szo.', ['A', 'S.M.A.R.T.', 'szo', '.']),
('A .hu.', ['A', '.hu', '.']),
('Az egy.ketto.', ['Az', 'egy.ketto', '.']),
('A pl.', ['A', 'pl.']),
('A S.M.A.R.T.', ['A', 'S.M.A.R.T.']),
('Egy..ket.', ['Egy', '..', 'ket', '.']),
('Valami... van.', ['Valami', '...', 'van', '.']),
('Valami ...van...', ['Valami', '...', 'van', '...']),
('Valami...', ['Valami', '...']),
('Valami ...', ['Valami', '...']),
('Valami ... más.', ['Valami', '...', 'más', '.'])]
_HYPHEN_TESTS = [
('Egy -nak, -jaiért, -magyar, bel- van.', ['Egy', '-nak', ',', '-jaiért', ',', '-magyar', ',', 'bel-', 'van', '.']),
('Egy -nak.', ['Egy', '-nak', '.']),
('Egy bel-.', ['Egy', 'bel-', '.']),
('Dinnye-domb-.', ['Dinnye-domb-', '.']),
('Ezen -e elcsatangolt.', ['Ezen', '-e', 'elcsatangolt', '.']),
('Lakik-e', ['Lakik', '-e']),
('Lakik-e?', ['Lakik', '-e', '?']),
('Lakik-e.', ['Lakik', '-e', '.']),
('Lakik-e...', ['Lakik', '-e', '...']),
('Lakik-e... van.', ['Lakik', '-e', '...', 'van', '.']),
('Lakik-e van?', ['Lakik', '-e', 'van', '?']),
('Lakik-elem van?', ['Lakik-elem', 'van', '?']),
('Van lakik-elem.', ['Van', 'lakik-elem', '.']),
('A 7-es busz?', ['A', '7-es', 'busz', '?']),
('A 7-es?', ['A', '7-es', '?']),
('A 7-es.', ['A', '7-es', '.']),
('Ez (lakik)-e?', ['Ez', '(', 'lakik', ')', '-e', '?']),
('A %-sal.', ['A', '%-sal', '.']),
('A CD-ROM-okrol.', ['A', 'CD-ROM-okrol', '.'])]
_NUMBER_TESTS = [('A 2b van.', ['A', '2b', 'van', '.']),
('A 2b-ben van.', ['A', '2b-ben', 'van', '.']),
('A 2b.', ['A', '2b', '.']),
('A 2b-ben.', ['A', '2b-ben', '.']),
('A 3.b van.', ['A', '3.b', 'van', '.']),
('A 3.b-ben van.', ['A', '3.b-ben', 'van', '.']),
('A 3.b.', ['A', '3.b', '.']),
('A 3.b-ben.', ['A', '3.b-ben', '.']),
('A 1:20:36.7 van.', ['A', '1:20:36.7', 'van', '.']),
('A 1:20:36.7-ben van.', ['A', '1:20:36.7-ben', 'van', '.']),
('A 1:20:36.7-ben.', ['A', '1:20:36.7-ben', '.']),
('A 1:35 van.', ['A', '1:35', 'van', '.']),
('A 1:35-ben van.', ['A', '1:35-ben', 'van', '.']),
('A 1:35-ben.', ['A', '1:35-ben', '.']),
('A 1.35 van.', ['A', '1.35', 'van', '.']),
('A 1.35-ben van.', ['A', '1.35-ben', 'van', '.']),
('A 1.35-ben.', ['A', '1.35-ben', '.']),
('A 4:01,95 van.', ['A', '4:01,95', 'van', '.']),
('A 4:01,95-ben van.', ['A', '4:01,95-ben', 'van', '.']),
('A 4:01,95-ben.', ['A', '4:01,95-ben', '.']),
('A 10--12 van.', ['A', '10--12', 'van', '.']),
('A 10--12-ben van.', ['A', '10--12-ben', 'van', '.']),
('A 10--12-ben.', ['A', '10--12-ben', '.']),
('A 1012 van.', ['A', '1012', 'van', '.']),
('A 1012-ben van.', ['A', '1012-ben', 'van', '.']),
('A 1012-ben.', ['A', '1012-ben', '.']),
('A 1012 van.', ['A', '1012', 'van', '.']),
('A 1012-ben van.', ['A', '1012-ben', 'van', '.']),
('A 1012-ben.', ['A', '1012-ben', '.']),
('A 1012 van.', ['A', '1012', 'van', '.']),
('A 1012-ben van.', ['A', '1012-ben', 'van', '.']),
('A 1012-ben.', ['A', '1012-ben', '.']),
('A 1012 van.', ['A', '1012', 'van', '.']),
('A 1012-ben van.', ['A', '1012-ben', 'van', '.']),
('A 1012-ben.', ['A', '1012-ben', '.']),
('A 10—12 van.', ['A', '10—12', 'van', '.']),
('A 10—12-ben van.', ['A', '10—12-ben', 'van', '.']),
('A 10—12-ben.', ['A', '10—12-ben', '.']),
('A 10―12 van.', ['A', '10―12', 'van', '.']),
('A 10―12-ben van.', ['A', '10―12-ben', 'van', '.']),
('A 10―12-ben.', ['A', '10―12-ben', '.']),
('A -23,12 van.', ['A', '-23,12', 'van', '.']),
('A -23,12-ben van.', ['A', '-23,12-ben', 'van', '.']),
('A -23,12-ben.', ['A', '-23,12-ben', '.']),
('A 2+3 van.', ['A', '2', '+', '3', 'van', '.']),
('A 2 +3 van.', ['A', '2', '+', '3', 'van', '.']),
('A 2+ 3 van.', ['A', '2', '+', '3', 'van', '.']),
('A 2 + 3 van.', ['A', '2', '+', '3', 'van', '.']),
('A 2*3 van.', ['A', '2', '*', '3', 'van', '.']),
('A 2 *3 van.', ['A', '2', '*', '3', 'van', '.']),
('A 2* 3 van.', ['A', '2', '*', '3', 'van', '.']),
('A 2 * 3 van.', ['A', '2', '*', '3', 'van', '.']),
('A C++ van.', ['A', 'C++', 'van', '.']),
('A C++-ben van.', ['A', 'C++-ben', 'van', '.']),
('A C++.', ['A', 'C++', '.']),
('A C++-ben.', ['A', 'C++-ben', '.']),
('A 2003. I. 06. van.', ['A', '2003.', 'I.', '06.', 'van', '.']),
('A 2003. I. 06-ben van.', ['A', '2003.', 'I.', '06-ben', 'van', '.']),
('A 2003. I. 06.', ['A', '2003.', 'I.', '06.']),
('A 2003. I. 06-ben.', ['A', '2003.', 'I.', '06-ben', '.']),
('A 2003. 01. 06. van.', ['A', '2003.', '01.', '06.', 'van', '.']),
('A 2003. 01. 06-ben van.', ['A', '2003.', '01.', '06-ben', 'van', '.']),
('A 2003. 01. 06.', ['A', '2003.', '01.', '06.']),
('A 2003. 01. 06-ben.', ['A', '2003.', '01.', '06-ben', '.']),
('A IV. 12. van.', ['A', 'IV.', '12.', 'van', '.']),
('A IV. 12-ben van.', ['A', 'IV.', '12-ben', 'van', '.']),
('A IV. 12.', ['A', 'IV.', '12.']),
('A IV. 12-ben.', ['A', 'IV.', '12-ben', '.']),
('A 2003.01.06. van.', ['A', '2003.01.06.', 'van', '.']),
('A 2003.01.06-ben van.', ['A', '2003.01.06-ben', 'van', '.']),
('A 2003.01.06.', ['A', '2003.01.06.']),
('A 2003.01.06-ben.', ['A', '2003.01.06-ben', '.']),
('A IV.12. van.', ['A', 'IV.12.', 'van', '.']),
('A IV.12-ben van.', ['A', 'IV.12-ben', 'van', '.']),
('A IV.12.', ['A', 'IV.12.']),
('A IV.12-ben.', ['A', 'IV.12-ben', '.']),
('A 1.1.2. van.', ['A', '1.1.2.', 'van', '.']),
('A 1.1.2-ben van.', ['A', '1.1.2-ben', 'van', '.']),
('A 1.1.2.', ['A', '1.1.2.']),
('A 1.1.2-ben.', ['A', '1.1.2-ben', '.']),
('A 1,5--2,5 van.', ['A', '1,5--2,5', 'van', '.']),
('A 1,5--2,5-ben van.', ['A', '1,5--2,5-ben', 'van', '.']),
('A 1,5--2,5-ben.', ['A', '1,5--2,5-ben', '.']),
('A 3,14 van.', ['A', '3,14', 'van', '.']),
('A 3,14-ben van.', ['A', '3,14-ben', 'van', '.']),
('A 3,14-ben.', ['A', '3,14-ben', '.']),
('A 3.14 van.', ['A', '3.14', 'van', '.']),
('A 3.14-ben van.', ['A', '3.14-ben', 'van', '.']),
('A 3.14-ben.', ['A', '3.14-ben', '.']),
('A 15. van.', ['A', '15.', 'van', '.']),
('A 15-ben van.', ['A', '15-ben', 'van', '.']),
('A 15-ben.', ['A', '15-ben', '.']),
('A 15.-ben van.', ['A', '15.-ben', 'van', '.']),
('A 15.-ben.', ['A', '15.-ben', '.']),
('A 2002--2003. van.', ['A', '2002--2003.', 'van', '.']),
('A 2002--2003-ben van.', ['A', '2002--2003-ben', 'van', '.']),
('A 2002--2003-ben.', ['A', '2002--2003-ben', '.']),
('A -0,99% van.', ['A', '-0,99%', 'van', '.']),
('A -0,99%-ben van.', ['A', '-0,99%-ben', 'van', '.']),
('A -0,99%.', ['A', '-0,99%', '.']),
('A -0,99%-ben.', ['A', '-0,99%-ben', '.']),
('A 10--20% van.', ['A', '10--20%', 'van', '.']),
('A 10--20%-ben van.', ['A', '10--20%-ben', 'van', '.']),
('A 10--20%.', ['A', '10--20%', '.']),
('A 10--20%-ben.', ['A', '10--20%-ben', '.']),
('A 99§ van.', ['A', '99§', 'van', '.']),
('A 99§-ben van.', ['A', '99§-ben', 'van', '.']),
('A 99§-ben.', ['A', '99§-ben', '.']),
('A 10--20§ van.', ['A', '10--20§', 'van', '.']),
('A 10--20§-ben van.', ['A', '10--20§-ben', 'van', '.']),
('A 10--20§-ben.', ['A', '10--20§-ben', '.']),
('A 99° van.', ['A', '99°', 'van', '.']),
('A 99°-ben van.', ['A', '99°-ben', 'van', '.']),
('A 99°-ben.', ['A', '99°-ben', '.']),
('A 10--20° van.', ['A', '10--20°', 'van', '.']),
('A 10--20°-ben van.', ['A', '10--20°-ben', 'van', '.']),
('A 10--20°-ben.', ['A', '10--20°-ben', '.']),
('A °C van.', ['A', '°C', 'van', '.']),
('A °C-ben van.', ['A', '°C-ben', 'van', '.']),
('A °C.', ['A', '°C', '.']),
('A °C-ben.', ['A', '°C-ben', '.']),
('A 100°C van.', ['A', '100°C', 'van', '.']),
('A 100°C-ben van.', ['A', '100°C-ben', 'van', '.']),
('A 100°C.', ['A', '100°C', '.']),
('A 100°C-ben.', ['A', '100°C-ben', '.']),
('A 800x600 van.', ['A', '800x600', 'van', '.']),
('A 800x600-ben van.', ['A', '800x600-ben', 'van', '.']),
('A 800x600-ben.', ['A', '800x600-ben', '.']),
('A 1x2x3x4 van.', ['A', '1x2x3x4', 'van', '.']),
('A 1x2x3x4-ben van.', ['A', '1x2x3x4-ben', 'van', '.']),
('A 1x2x3x4-ben.', ['A', '1x2x3x4-ben', '.']),
('A 5/J van.', ['A', '5/J', 'van', '.']),
('A 5/J-ben van.', ['A', '5/J-ben', 'van', '.']),
('A 5/J-ben.', ['A', '5/J-ben', '.']),
('A 5/J. van.', ['A', '5/J.', 'van', '.']),
('A 5/J.-ben van.', ['A', '5/J.-ben', 'van', '.']),
('A 5/J.-ben.', ['A', '5/J.-ben', '.']),
('A III/1 van.', ['A', 'III/1', 'van', '.']),
('A III/1-ben van.', ['A', 'III/1-ben', 'van', '.']),
('A III/1-ben.', ['A', 'III/1-ben', '.']),
('A III/1. van.', ['A', 'III/1.', 'van', '.']),
('A III/1.-ben van.', ['A', 'III/1.-ben', 'van', '.']),
('A III/1.-ben.', ['A', 'III/1.-ben', '.']),
('A III/c van.', ['A', 'III/c', 'van', '.']),
('A III/c-ben van.', ['A', 'III/c-ben', 'van', '.']),
('A III/c.', ['A', 'III/c', '.']),
('A III/c-ben.', ['A', 'III/c-ben', '.']),
('A TU154 van.', ['A', 'TU154', 'van', '.']),
('A TU154-ben van.', ['A', 'TU154-ben', 'van', '.']),
('A TU154-ben.', ['A', 'TU154-ben', '.'])]
_QUOTE_TESTS = [('Az "Ime, hat"-ban irja.', ['Az', '"', 'Ime', ',', 'hat', '"', '-ban', 'irja', '.']),
('"Ime, hat"-ban irja.', ['"', 'Ime', ',', 'hat', '"', '-ban', 'irja', '.']),
('Az "Ime, hat".', ['Az', '"', 'Ime', ',', 'hat', '"', '.']),
('Egy 24"-os monitor.', ['Egy', '24', '"', '-os', 'monitor', '.']),
("A don't van.", ['A', "don't", 'van', '.'])]
_DOT_TESTS = [('N. kormányzósági\nszékhely.', ['N.', 'kormányzósági', 'székhely', '.']),
('A .hu egy tld.', ['A', '.hu', 'egy', 'tld', '.']),
('Az egy.ketto pelda.', ['Az', 'egy.ketto', 'pelda', '.']),
('A pl. rovidites.', ['A', 'pl.', 'rovidites', '.']),
('A S.M.A.R.T. szo.', ['A', 'S.M.A.R.T.', 'szo', '.']),
('A .hu.', ['A', '.hu', '.']),
('Az egy.ketto.', ['Az', 'egy.ketto', '.']),
('A pl.', ['A', 'pl.']),
('A S.M.A.R.T.', ['A', 'S.M.A.R.T.']),
('Egy..ket.', ['Egy', '..', 'ket', '.']),
('Valami... van.', ['Valami', '...', 'van', '.']),
('Valami ...van...', ['Valami', '...', 'van', '...']),
('Valami...', ['Valami', '...']),
('Valami ...', ['Valami', '...']),
('Valami ... más.', ['Valami', '...', 'más', '.'])]
@pytest.fixture(scope="session")
def HU():
return Hungarian()
@pytest.fixture(scope="module")
def hu_tokenizer(HU):
return HU.tokenizer
@pytest.mark.parametrize(("input", "expected_tokens"),
_DEFAULT_TESTS + _HYPHEN_TESTS + _NUMBER_TESTS + _DOT_TESTS + _QUOTE_TESTS)
def test_testcases(hu_tokenizer, input, expected_tokens):
tokens = hu_tokenizer(input)
token_list = [token.orth_ for token in tokens if not token.is_space]
assert expected_tokens == token_list

View File

@ -53,7 +53,7 @@ cdef class Vocab:
'''
@classmethod
def load(cls, path, lex_attr_getters=None, lemmatizer=True,
tag_map=True, serializer_freqs=True, oov_prob=True, **deprecated_kwargs):
tag_map=True, serializer_freqs=True, oov_prob=True, **deprecated_kwargs):
"""
Load the vocabulary from a path.
@ -96,6 +96,8 @@ cdef class Vocab:
if serializer_freqs is True and (path / 'vocab' / 'serializer.json').exists():
with (path / 'vocab' / 'serializer.json').open('r', encoding='utf8') as file_:
serializer_freqs = json.load(file_)
else:
serializer_freqs = None
cdef Vocab self = cls(lex_attr_getters=lex_attr_getters, tag_map=tag_map,
lemmatizer=lemmatizer, serializer_freqs=serializer_freqs)
@ -124,7 +126,7 @@ cdef class Vocab:
Vocab: The newly constructed vocab object.
'''
util.check_renamed_kwargs({'get_lex_attr': 'lex_attr_getters'}, deprecated_kwargs)
lex_attr_getters = lex_attr_getters if lex_attr_getters is not None else {}
tag_map = tag_map if tag_map is not None else {}
if lemmatizer in (None, True, False):
@ -149,10 +151,10 @@ cdef class Vocab:
self.lex_attr_getters = lex_attr_getters
self.morphology = Morphology(self.strings, tag_map, lemmatizer)
self.serializer_freqs = serializer_freqs
self.length = 1
self._serializer = None
property serializer:
# Having the serializer live here is super messy :(
def __get__(self):
@ -177,7 +179,7 @@ cdef class Vocab:
vectors if necessary. The memory will be zeroed.
Arguments:
new_size (int): The new size of the vectors.
new_size (int): The new size of the vectors.
'''
cdef hash_t key
cdef size_t addr
@ -190,11 +192,11 @@ cdef class Vocab:
def add_flag(self, flag_getter, int flag_id=-1):
'''Set a new boolean flag to words in the vocabulary.
The flag_setter function will be called over the words currently in the
vocab, and then applied to new words as they occur. You'll then be able
to access the flag value on each token, using token.check_flag(flag_id).
See also:
Lexeme.set_flag, Lexeme.check_flag, Token.set_flag, Token.check_flag.
@ -204,7 +206,7 @@ cdef class Vocab:
flag_id (int):
An integer between 1 and 63 (inclusive), specifying the bit at which the
flag will be stored. If -1, the lowest available bit will be
flag will be stored. If -1, the lowest available bit will be
chosen.
Returns:
@ -322,7 +324,7 @@ cdef class Vocab:
Arguments:
id_or_string (int or unicode):
The integer ID of a word, or its unicode string.
If an int >= Lexicon.size, IndexError is raised. If id_or_string
is neither an int nor a unicode string, ValueError is raised.
@ -349,7 +351,7 @@ cdef class Vocab:
for attr_id, value in props.items():
Token.set_struct_attr(token, attr_id, value)
return tokens
def dump(self, loc):
"""Save the lexemes binary data to the given location.
@ -443,7 +445,7 @@ cdef class Vocab:
cdef int32_t word_len
cdef bytes word_str
cdef char* chars
cdef Lexeme lexeme
cdef CFile out_file = CFile(out_loc, 'wb')
for lexeme in self:
@ -460,7 +462,7 @@ cdef class Vocab:
out_file.close()
def load_vectors(self, file_):
"""Load vectors from a text-based file.
"""Load vectors from a text-based file.
Arguments:
file_ (buffer): The file to read from. Entries should be separated by newlines,

View File

@ -12,7 +12,7 @@ writing. You can read more about our approach in our blog post, ["Rebuilding a W
```bash
sudo npm install --global harp
git clone https://github.com/explosion/spaCy
cd website
cd spaCy/website
harp server
```

View File

@ -11,7 +11,7 @@ footer.o-footer.u-text.u-border-dotted
each url, item in group
li
+a(url)(target=url.includes("http") ? "_blank" : false)=item
+a(url)=item
if SECTION != "docs"
+grid-col("quarter")

View File

@ -20,7 +20,8 @@ mixin h(level, id)
info: https://mathiasbynens.github.io/rel-noopener/
mixin a(url, trusted)
a(href=url target="_blank" rel=!trusted ? "noopener nofollow" : false)&attributes(attributes)
- external = url.includes("http")
a(href=url target=external ? "_blank" : null rel=external && !trusted ? "noopener nofollow" : null)&attributes(attributes)
block
@ -33,7 +34,7 @@ mixin src(url)
+a(url)
block
| #[+icon("code", 16).u-color-subtle]
| #[+icon("code", 16).o-icon--inline.u-color-subtle]
//- API link (with added tag and automatically generated path)
@ -43,7 +44,7 @@ mixin api(path)
+a("/docs/api/" + path, true)(target="_self").u-no-border.u-inline-block
block
| #[+icon("book", 18).o-help-icon.u-color-subtle]
| #[+icon("book", 18).o-icon--inline.u-help.u-color-subtle]
//- Aside for text
@ -74,7 +75,8 @@ mixin aside-code(label, language)
see assets/css/_components/_buttons.sass
mixin button(url, trusted, ...style)
a.c-button.u-text-label(href=url class=prefixArgs(style, "c-button") role="button" target="_blank" rel=!trusted ? "noopener nofollow" : false)&attributes(attributes)
- external = url.includes("http")
a.c-button.u-text-label(href=url class=prefixArgs(style, "c-button") role="button" target=external ? "_blank" : null rel=external && !trusted ? "noopener nofollow" : null)&attributes(attributes)
block
@ -148,7 +150,7 @@ mixin tag()
mixin list(type, start)
if type
ol.c-list.o-block.u-text(class="c-list--#{type}" style=(start === 0 || start) ? "counter-reset: li #{(start - 1)}" : false)&attributes(attributes)
ol.c-list.o-block.u-text(class="c-list--#{type}" style=(start === 0 || start) ? "counter-reset: li #{(start - 1)}" : null)&attributes(attributes)
block
else

View File

@ -2,7 +2,7 @@
include _mixins
nav.c-nav.u-text.js-nav(class=landing ? "c-nav--theme" : false)
nav.c-nav.u-text.js-nav(class=landing ? "c-nav--theme" : null)
a(href='/') #[+logo]
if SUBSECTION != "index"
@ -11,7 +11,7 @@ nav.c-nav.u-text.js-nav(class=landing ? "c-nav--theme" : false)
ul.c-nav__menu
each url, item in NAVIGATION
li.c-nav__menu__item
a(href=url target=url.includes("http") ? "_blank" : false)=item
+a(url)=item
li.c-nav__menu__item
+a(gh("spaCy"))(aria-label="GitHub").u-hidden-xs #[+icon("github", 20)]

View File

@ -18,7 +18,7 @@ main.o-main.o-main--sidebar.o-main--aside
- data = public.docs[SUBSECTION]._data[next]
.o-inline-list
span #[strong.u-text-label Read next:] #[a(href=next).u-link=data.title]
span #[strong.u-text-label Read next:] #[+a(next).u-link=data.title]
+grid-col("half").u-text-right
.o-inline-list

View File

@ -9,5 +9,5 @@ menu.c-sidebar.js-sidebar.u-text
li.u-text-label.u-color-subtle=menu
each url, item in items
li(class=(CURRENT == url || (CURRENT == "index" && url == "./")) ? "is-active" : false)
+a(url)(target=url.includes("http") ? "_blank" : false)=item
li(class=(CURRENT == url || (CURRENT == "index" && url == "./")) ? "is-active" : null)
+a(url)=item

View File

@ -67,9 +67,8 @@
.o-icon
vertical-align: middle
.o-help-icon
cursor: help
margin: 0 0.5rem 0 0.25rem
&.o-icon--inline
margin: 0 0.5rem 0 0.25rem
//- Inline List

View File

@ -141,6 +141,12 @@
background: $pattern
//- Cursors
.u-help
cursor: help
//- Hidden elements
.u-hidden

View File

@ -50,6 +50,13 @@ p A "lemma" is the uninflected form of a word. In English, this means:
+item #[strong Nouns]: The form like "dog", not "dogs"; like "child", not "children"
+item #[strong Verbs]: The form like "write", not "writes", "writing", "wrote" or "written"
+aside("About spaCy's custom pronoun lemma")
| Unlike verbs and common nouns, there's no clear base form of a personal
| pronoun. Should the lemma of "me" be "I", or should we normalize person
| as well, giving "it" — or maybe "he"? spaCy's solution is to introduce a
| novel symbol, #[code.u-nowrap -PRON-], which is used as the lemma for
| all personal pronouns.
p
| The lemmatization data is taken from
| #[+a("https://wordnet.princeton.edu") WordNet]. However, we also add a
@ -58,11 +65,16 @@ p
+h(2, "dependency-parsing") Syntactic Dependency Parsing
p
| The parser is trained on data produced by the
| #[+a("http://www.clearnlp.com") ClearNLP] converter. Details of the
| annotation scheme can be found
| #[+a("http://www.mathcs.emory.edu/~choi/doc/clear-dependency-2012.pdf") here].
+table(["Language", "Converter", "Scheme"])
+row
+cell English
+cell #[+a("http://www.clearnlp.com") ClearNLP]
+cell #[+a("http://www.mathcs.emory.edu/~choi/doc/clear-dependency-2012.pdf") CLEAR Style]
+row
+cell German
+cell #[+a("https://github.com/wbwseeker/tiger2dep") TIGER]
+cell #[+a("http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/index.html") TIGER]
+h(2, "named-entities") Named Entity Recognition

View File

@ -2,7 +2,7 @@
include ../../_includes/_mixins
p You can download data packs that add the following capabilities to spaCy.
p spaCy currently supports the following languages and capabilities:
+aside-code("Download language models", "bash").
python -m spacy.en.download all
@ -24,8 +24,27 @@ p You can download data packs that add the following capabilities to spaCy.
each icon in [ "pro", "con", "con", "con", "con", "con", "con", "con" ]
+cell.u-text-center #[+procon(icon)]
+row
+cell Spanish #[code es]
each icon in [ "pro", "con", "con", "con", "con", "con", "con", "con" ]
+cell.u-text-center #[+procon(icon)]
p
| Chinese tokenization requires the
| #[+a("https://github.com/fxsjy/jieba") Jieba] library. Statistical
| models are coming soon. Tokenizers for Spanish, French, Italian and
| Portuguese are now under development.
| models are coming soon.
+h(2, "alpha-support") Alpha support
p
| Work has started on the following languages. You can help by improving
| the existing language data and extending the tokenization patterns.
+table([ "Language", "Source" ])
each language, code in { it: "Italian", fr: "French", pt: "Portuguese", nl: "Dutch", sv: "Swedish" }
+row
+cell #{language} #[code=code]
+cell
+src(gh("spaCy", "spacy/" + code)) spacy/#{code}

View File

@ -2,7 +2,8 @@
"sidebar": {
"Get started": {
"Installation": "./",
"Lightning tour": "lightning-tour"
"Lightning tour": "lightning-tour",
"Resources": "resources"
},
"Workflows": {
"Loading the pipeline": "language-processing-pipeline",
@ -31,7 +32,12 @@
},
"lightning-tour": {
"title": "Lightning tour"
"title": "Lightning tour",
"next": "resources"
},
"resources": {
"title": "Resources"
},
"language-processing-pipeline": {

View File

@ -23,7 +23,7 @@ p
+item
| #[strong Build the vocabulary] including
| #[a(href="#word-probabilities") word probabilities],
| #[a(href="#word-frequencies") word frequencies],
| #[a(href="#brown-clusters") Brown clusters] and
| #[a(href="#word-vectors") word vectors].
@ -245,6 +245,12 @@ p
+cell
| Special value for pronoun lemmas (#[code &quot;-PRON-&quot;]).
+row
+cell #[code DET_LEMMA]
+cell
| Special value for determiner lemmas, used in languages with
| inflected determiners (#[code &quot;-DET-&quot;]).
+row
+cell #[code ENT_ID]
+cell
@ -392,7 +398,7 @@ p
| vectors files, you can use the
| #[+src(gh("spacy-dev-resources", "training/init.py")) init.py]
| script from our
| #[+a(gh("spacy-developer-resources")) developer resources] to create a
| #[+a(gh("spacy-dev-resources")) developer resources] to create a
| spaCy data directory:
+code(false, "bash").
@ -424,16 +430,22 @@ p
+h(3, "word-frequencies") Word frequencies
p
| The #[code init.py] script expects a tab-separated word frequencies file
| with three columns: the number of times the word occurred in your language
| sample, the number of distinct documents the word occurred in, and the
| word itself. You should make sure you use the spaCy tokenizer for your
| The #[+src(gh("spacy-dev-resources", "training/init.py")) init.py]
| script expects a tab-separated word frequencies file with three columns:
+list("numbers")
+item The number of times the word occurred in your language sample.
+item The number of distinct documents the word occurred in.
+item The word itself.
p
| You should make sure you use the spaCy tokenizer for your
| language to segment the text for your word frequencies. This will ensure
| that the frequencies refer to the same segmentation standards you'll be
| using at run-time. For instance, spaCy's English tokenizer segments "can't"
| into two tokens. If we segmented the text by whitespace to produce the
| frequency counts, we'll have incorrect frequency counts for the tokens
| "ca" and "n't".
| using at run-time. For instance, spaCy's English tokenizer segments
| "can't" into two tokens. If we segmented the text by whitespace to
| produce the frequency counts, we'll have incorrect frequency counts for
| the tokens "ca" and "n't".
+h(3, "brown-clusters") Training the Brown clusters

View File

@ -3,7 +3,7 @@
include ../../_includes/_mixins
p
| The following examples code snippets give you an overview of spaCy's
| The following examples and code snippets give you an overview of spaCy's
| functionality and its usage.
+h(2, "examples-resources") Load resources and process text

View File

@ -0,0 +1,118 @@
//- 💫 DOCS > USAGE > RESOURCES
include ../../_includes/_mixins
p Many of the associated tools and resources that we're developing alongside spaCy can be found in their own repositories.
+h(2, "developer") Developer tools
+table(["Name", "Description"])
+row
+cell
+src(gh("spacy-dev-resources")) spaCy Dev Resources
+cell
| Scripts, tools and resources for developing spaCy, adding new
| languages and training new models.
+row
+cell
+src("spacy-benchmarks") spaCy Benchmarks
+cell
| Runtime performance comparison of spaCy against other NLP
| libraries.
+row
+cell
+src(gh("spacy-services")) spaCy Services
+cell
| REST microservices for spaCy demos and visualisers.
+h(2, "libraries") Libraries and projects
+table(["Name", "Description"])
+row
+cell
+src(gh("sense2vec")) sense2vec
+cell
| Use spaCy to go beyond vanilla
| #[+a("https://en.wikipedia.org/wiki/Word2vec") Word2vec].
+h(2, "utility") Utility libraries and dependencies
+table(["Name", "Description"])
+row
+cell
+src(gh("thinc")) Thinc
+cell
| Super sparse multi-class machine learning with Cython.
+row
+cell
+src(gh("sputnik")) Sputnik
+cell
| Data package manager library for spaCy.
+row
+cell
+src(gh("sputnik-server")) Sputnik Server
+cell
| Index service for the Sputnik data package manager for spaCy.
+row
+cell
+src(gh("cymem")) Cymem
+cell
| Gate Cython calls to malloc/free behind Python ref-counted
| objects.
+row
+cell
+src(gh("preshed")) Preshed
+cell
| Cython hash tables that assume keys are pre-hashed
+row
+cell
+src(gh("murmurhash")) MurmurHash
+cell
| Cython bindings for
| #[+a("https://en.wikipedia.org/wiki/MurmurHash") MurmurHash2].
+h(2, "visualizers") Visualisers and demos
+table(["Name", "Description"])
+row
+cell
+src(gh("displacy")) displaCy.js
+cell
| A lightweight dependency visualisation library for the modern
| web, built with JavaScript, CSS and SVG.
| #[+a(DEMOS_URL + "/displacy") Demo here].
+row
+cell
+src(gh("displacy-ent")) displaCy#[sup ENT]
+cell
| A lightweight and modern named entity visualisation library
| built with JavaScript and CSS.
| #[+a(DEMOS_URL + "/displacy-ent") Demo here].
+row
+cell
+src(gh("sense2vec-demo")) sense2vec Demo
+cell
| Source of our Semantic Analysis of the Reddit Hivemind
| #[+a(DEMOS_URL + "/sense2vec") demo] using
| #[+a(gh("sense2vec")) sense2vec].

View File

@ -13,14 +13,17 @@ p
+code.
from spacy.vocab import Vocab
from spacy.pipeline import Tagger
from spacy.tagger import Tagger
from spacy.tokens import Doc
from spacy.gold import GoldParse
vocab = Vocab(tag_map={'N': {'pos': 'NOUN'}, 'V': {'pos': 'VERB'}})
tagger = Tagger(vocab)
doc = Doc(vocab, words=['I', 'like', 'stuff'])
tagger.update(doc, ['N', 'V', 'N'])
gold = GoldParse(doc, tags=['N', 'V', 'N'])
tagger.update(doc, gold)
tagger.model.end_training()

View File

@ -22,7 +22,7 @@ include _includes/_mixins
| process entire web dumps, spaCy is the library you want to
| be using.
+button("/docs/api", true, "primary")(target="_self")
+button("/docs/api", true, "primary")
| Facts & figures
+grid-col("third").o-card
@ -35,7 +35,7 @@ include _includes/_mixins
| think of spaCy as the Ruby on Rails of Natural Language
| Processing.
+button("/docs/usage", true, "primary")(target="_self")
+button("/docs/usage", true, "primary")
| Get started
+grid-col("third").o-card
@ -51,7 +51,7 @@ include _includes/_mixins
| connect the statistical models trained by these libraries
| to the rest of your application.
+button("/docs/usage/deep-learning", true, "primary")(target="_self")
+button("/docs/usage/deep-learning", true, "primary")
| Read more
.o-inline-list.o-block.u-border-bottom.u-text-small.u-text-center.u-padding-small
@ -105,7 +105,7 @@ include _includes/_mixins
+item Robust, rigorously evaluated accuracy
.o-inline-list
+button("/docs/usage/lightning-tour", true, "secondary")(target="_self")
+button("/docs/usage/lightning-tour", true, "secondary")
| See examples
.o-block.u-text-center.u-padding
@ -138,7 +138,7 @@ include _includes/_mixins
| all others.
p
| spaCy's #[a(href="/docs/api/philosophy") mission] is to make
| spaCy's #[+a("/docs/api/philosophy") mission] is to make
| cutting-edge NLP practical and commonly available. That's
| why I left academia in 2014, to build a production-quality
| open-source NLP library. It's why