mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-12 02:06:31 +03:00
Merge branch 'master' of ssh://github.com/explosion/spaCy
This commit is contained in:
commit
cade536d1e
106
.github/contributors/magnusburton.md
vendored
Normal file
106
.github/contributors/magnusburton.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------------------- |
|
||||
| Name | Magnus Burton |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 17-12-2016 |
|
||||
| GitHub username | magnusburton |
|
||||
| Website (optional) | |
|
106
.github/contributors/oroszgy.md
vendored
Normal file
106
.github/contributors/oroszgy.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [X] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | György Orosz |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2016-12-26 |
|
||||
| GitHub username | oroszgy |
|
||||
| Website (optional) | gyorgy.orosz.link |
|
|
@ -8,6 +8,7 @@ This is a list of everyone who has made significant contributions to spaCy, in a
|
|||
* Christoph Schwienheer, [@chssch](https://github.com/chssch)
|
||||
* Dafne van Kuppevelt, [@dafnevk](https://github.com/dafnevk)
|
||||
* Dmytro Sadovnychyi, [@sadovnychyi](https://github.com/sadovnychyi)
|
||||
* György Orosz, [@oroszgy](https://github.com/oroszgy)
|
||||
* Henning Peters, [@henningpeters](https://github.com/henningpeters)
|
||||
* Ines Montani, [@ines](https://github.com/ines)
|
||||
* J Nicolas Schrading, [@NSchrading](https://github.com/NSchrading)
|
||||
|
@ -16,6 +17,7 @@ This is a list of everyone who has made significant contributions to spaCy, in a
|
|||
* Kendrick Tan, [@kendricktan](https://github.com/kendricktan)
|
||||
* Kyle P. Johnson, [@kylepjohnson](https://github.com/kylepjohnson)
|
||||
* Liling Tan, [@alvations](https://github.com/alvations)
|
||||
* Magnus Burton, [@magnusburton](https://github.com/magnusburton)
|
||||
* Mark Amery, [@ExplodingCabbage](https://github.com/ExplodingCabbage)
|
||||
* Matthew Honnibal, [@honnibal](https://github.com/honnibal)
|
||||
* Maxim Samsonov, [@maxirmx](https://github.com/maxirmx)
|
||||
|
|
4
LICENSE
4
LICENSE
|
@ -1,8 +1,6 @@
|
|||
The MIT License (MIT)
|
||||
|
||||
Copyright (C) 2015 Matthew Honnibal
|
||||
2016 spaCy GmbH
|
||||
2016 ExplosionAI UG (haftungsbeschränkt)
|
||||
Copyright (C) 2016 ExplosionAI UG (haftungsbeschränkt), 2016 spaCy GmbH, 2015 Matthew Honnibal
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of this software and associated documentation files (the "Software"), to deal
|
||||
|
|
|
@ -78,7 +78,7 @@ Features
|
|||
|
||||
See `facts, figures and benchmarks <https://spacy.io/docs/api/>`_.
|
||||
|
||||
Top Peformance
|
||||
Top Performance
|
||||
==============
|
||||
|
||||
* Fastest in the world: <50ms per document. No faster system has ever been
|
||||
|
|
|
@ -3,6 +3,9 @@ import plac
|
|||
import random
|
||||
import six
|
||||
|
||||
import cProfile
|
||||
import pstats
|
||||
|
||||
import pathlib
|
||||
import cPickle as pickle
|
||||
from itertools import izip
|
||||
|
@ -81,7 +84,7 @@ class SentimentModel(Chain):
|
|||
def __init__(self, nlp, shape, **settings):
|
||||
Chain.__init__(self,
|
||||
embed=_Embed(shape['nr_vector'], shape['nr_dim'], shape['nr_hidden'],
|
||||
initialW=lambda arr: set_vectors(arr, nlp.vocab)),
|
||||
set_vectors=lambda arr: set_vectors(arr, nlp.vocab)),
|
||||
encode=_Encode(shape['nr_hidden'], shape['nr_hidden']),
|
||||
attend=_Attend(shape['nr_hidden'], shape['nr_hidden']),
|
||||
predict=_Predict(shape['nr_hidden'], shape['nr_class']))
|
||||
|
@ -95,11 +98,11 @@ class SentimentModel(Chain):
|
|||
|
||||
|
||||
class _Embed(Chain):
|
||||
def __init__(self, nr_vector, nr_dim, nr_out):
|
||||
def __init__(self, nr_vector, nr_dim, nr_out, set_vectors=None):
|
||||
Chain.__init__(self,
|
||||
embed=L.EmbedID(nr_vector, nr_dim),
|
||||
embed=L.EmbedID(nr_vector, nr_dim, initialW=set_vectors),
|
||||
project=L.Linear(None, nr_out, nobias=True))
|
||||
#self.embed.unchain_backward()
|
||||
self.embed.W.volatile = False
|
||||
|
||||
def __call__(self, sentence):
|
||||
return [self.project(self.embed(ts)) for ts in F.transpose(sentence)]
|
||||
|
@ -214,7 +217,6 @@ def set_vectors(vectors, vocab):
|
|||
vectors[lex.rank + 1] = lex.vector
|
||||
else:
|
||||
lex.norm = 0
|
||||
vectors.unchain_backwards()
|
||||
return vectors
|
||||
|
||||
|
||||
|
@ -223,7 +225,9 @@ def train(train_texts, train_labels, dev_texts, dev_labels,
|
|||
by_sentence=True):
|
||||
nlp = spacy.load('en', entity=False)
|
||||
if 'nr_vector' not in lstm_shape:
|
||||
lstm_shape['nr_vector'] = max(lex.rank+1 for lex in vocab if lex.has_vector)
|
||||
lstm_shape['nr_vector'] = max(lex.rank+1 for lex in nlp.vocab if lex.has_vector)
|
||||
if 'nr_dim' not in lstm_shape:
|
||||
lstm_shape['nr_dim'] = nlp.vocab.vectors_length
|
||||
print("Make model")
|
||||
model = Classifier(SentimentModel(nlp, lstm_shape, **lstm_settings))
|
||||
print("Parsing texts...")
|
||||
|
@ -240,7 +244,7 @@ def train(train_texts, train_labels, dev_texts, dev_labels,
|
|||
optimizer = chainer.optimizers.Adam()
|
||||
optimizer.setup(model)
|
||||
updater = chainer.training.StandardUpdater(train_iter, optimizer, device=0)
|
||||
trainer = chainer.training.Trainer(updater, (20, 'epoch'), out='result')
|
||||
trainer = chainer.training.Trainer(updater, (1, 'epoch'), out='result')
|
||||
|
||||
trainer.extend(extensions.Evaluator(dev_iter, model, device=0))
|
||||
trainer.extend(extensions.LogReport())
|
||||
|
@ -305,11 +309,14 @@ def main(model_dir, train_dir, dev_dir,
|
|||
dev_labels = xp.asarray(dev_labels, dtype='i')
|
||||
lstm = train(train_texts, train_labels, dev_texts, dev_labels,
|
||||
{'nr_hidden': nr_hidden, 'max_length': max_length, 'nr_class': 2,
|
||||
'nr_vector': 2000, 'nr_dim': 32},
|
||||
'nr_vector': 5000},
|
||||
{'dropout': 0.5, 'lr': learn_rate},
|
||||
{},
|
||||
nb_epoch=nb_epoch, batch_size=batch_size)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
#cProfile.runctx("plac.call(main)", globals(), locals(), "Profile.prof")
|
||||
#s = pstats.Stats("Profile.prof")
|
||||
#s.strip_dirs().sort_stats("time").print_stats()
|
||||
plac.call(main)
|
||||
|
|
|
@ -111,10 +111,9 @@ def compile_lstm(embeddings, shape, settings):
|
|||
mask_zero=True
|
||||
)
|
||||
)
|
||||
model.add(TimeDistributed(Dense(shape['nr_hidden'] * 2, bias=False)))
|
||||
model.add(Dropout(settings['dropout']))
|
||||
model.add(Bidirectional(LSTM(shape['nr_hidden'])))
|
||||
model.add(Dropout(settings['dropout']))
|
||||
model.add(TimeDistributed(Dense(shape['nr_hidden'], bias=False)))
|
||||
model.add(Bidirectional(LSTM(shape['nr_hidden'], dropout_U=settings['dropout'],
|
||||
dropout_W=settings['dropout'])))
|
||||
model.add(Dense(shape['nr_class'], activation='sigmoid'))
|
||||
model.compile(optimizer=Adam(lr=settings['lr']), loss='binary_crossentropy',
|
||||
metrics=['accuracy'])
|
||||
|
@ -195,7 +194,7 @@ def main(model_dir, train_dir, dev_dir,
|
|||
dev_labels = numpy.asarray(dev_labels, dtype='int32')
|
||||
lstm = train(train_texts, train_labels, dev_texts, dev_labels,
|
||||
{'nr_hidden': nr_hidden, 'max_length': max_length, 'nr_class': 1},
|
||||
{'dropout': 0.5, 'lr': learn_rate},
|
||||
{'dropout': dropout, 'lr': learn_rate},
|
||||
{},
|
||||
nb_epoch=nb_epoch, batch_size=batch_size)
|
||||
weights = lstm.get_weights()
|
||||
|
|
4
setup.py
4
setup.py
|
@ -27,8 +27,10 @@ PACKAGES = [
|
|||
'spacy.es',
|
||||
'spacy.fr',
|
||||
'spacy.it',
|
||||
'spacy.hu',
|
||||
'spacy.pt',
|
||||
'spacy.nl',
|
||||
'spacy.sv',
|
||||
'spacy.language_data',
|
||||
'spacy.serialize',
|
||||
'spacy.syntax',
|
||||
|
@ -95,7 +97,7 @@ LINK_OPTIONS = {
|
|||
'other' : []
|
||||
}
|
||||
|
||||
|
||||
|
||||
# I don't understand this very well yet. See Issue #267
|
||||
# Fingers crossed!
|
||||
#if os.environ.get('USE_OPENMP') == '1':
|
||||
|
|
|
@ -8,9 +8,11 @@ from . import de
|
|||
from . import zh
|
||||
from . import es
|
||||
from . import it
|
||||
from . import hu
|
||||
from . import fr
|
||||
from . import pt
|
||||
from . import nl
|
||||
from . import sv
|
||||
|
||||
|
||||
try:
|
||||
|
@ -25,8 +27,10 @@ set_lang_class(es.Spanish.lang, es.Spanish)
|
|||
set_lang_class(pt.Portuguese.lang, pt.Portuguese)
|
||||
set_lang_class(fr.French.lang, fr.French)
|
||||
set_lang_class(it.Italian.lang, it.Italian)
|
||||
set_lang_class(hu.Hungarian.lang, hu.Hungarian)
|
||||
set_lang_class(zh.Chinese.lang, zh.Chinese)
|
||||
set_lang_class(nl.Dutch.lang, nl.Dutch)
|
||||
set_lang_class(sv.Swedish.lang, sv.Swedish)
|
||||
|
||||
|
||||
def load(name, **overrides):
|
||||
|
|
|
@ -2,7 +2,7 @@
|
|||
from __future__ import unicode_literals
|
||||
|
||||
from ..symbols import *
|
||||
from ..language_data import PRON_LEMMA
|
||||
from ..language_data import PRON_LEMMA, DET_LEMMA
|
||||
|
||||
|
||||
TOKENIZER_EXCEPTIONS = {
|
||||
|
@ -15,23 +15,27 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"'S": [
|
||||
{ORTH: "'S", LEMMA: PRON_LEMMA}
|
||||
{ORTH: "'S", LEMMA: PRON_LEMMA, TAG: "PPER"}
|
||||
],
|
||||
|
||||
"'n": [
|
||||
{ORTH: "'n", LEMMA: "ein"}
|
||||
{ORTH: "'n", LEMMA: DET_LEMMA, NORM: "ein"}
|
||||
],
|
||||
|
||||
"'ne": [
|
||||
{ORTH: "'ne", LEMMA: "eine"}
|
||||
{ORTH: "'ne", LEMMA: DET_LEMMA, NORM: "eine"}
|
||||
],
|
||||
|
||||
"'nen": [
|
||||
{ORTH: "'nen", LEMMA: "einen"}
|
||||
{ORTH: "'nen", LEMMA: DET_LEMMA, NORM: "einen"}
|
||||
],
|
||||
|
||||
"'nem": [
|
||||
{ORTH: "'nem", LEMMA: DET_LEMMA, NORM: "einem"}
|
||||
],
|
||||
|
||||
"'s": [
|
||||
{ORTH: "'s", LEMMA: PRON_LEMMA}
|
||||
{ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER"}
|
||||
],
|
||||
|
||||
"Abb.": [
|
||||
|
@ -195,7 +199,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"S'": [
|
||||
{ORTH: "S'", LEMMA: PRON_LEMMA}
|
||||
{ORTH: "S'", LEMMA: PRON_LEMMA, TAG: "PPER"}
|
||||
],
|
||||
|
||||
"Sa.": [
|
||||
|
@ -244,7 +248,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
|
||||
"auf'm": [
|
||||
{ORTH: "auf", LEMMA: "auf"},
|
||||
{ORTH: "'m", LEMMA: PRON_LEMMA}
|
||||
{ORTH: "'m", LEMMA: DET_LEMMA, NORM: "dem" }
|
||||
],
|
||||
|
||||
"bspw.": [
|
||||
|
@ -268,8 +272,8 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"du's": [
|
||||
{ORTH: "du", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "'s", LEMMA: PRON_LEMMA}
|
||||
{ORTH: "du", LEMMA: PRON_LEMMA, TAG: "PPER"},
|
||||
{ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER", NORM: "es"}
|
||||
],
|
||||
|
||||
"ebd.": [
|
||||
|
@ -285,8 +289,8 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"er's": [
|
||||
{ORTH: "er", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "'s", LEMMA: PRON_LEMMA}
|
||||
{ORTH: "er", LEMMA: PRON_LEMMA, TAG: "PPER"},
|
||||
{ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER", NORM: "es"}
|
||||
],
|
||||
|
||||
"evtl.": [
|
||||
|
@ -315,7 +319,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
|
||||
"hinter'm": [
|
||||
{ORTH: "hinter", LEMMA: "hinter"},
|
||||
{ORTH: "'m", LEMMA: PRON_LEMMA}
|
||||
{ORTH: "'m", LEMMA: DET_LEMMA, NORM: "dem"}
|
||||
],
|
||||
|
||||
"i.O.": [
|
||||
|
@ -327,13 +331,13 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"ich's": [
|
||||
{ORTH: "ich", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "'s", LEMMA: PRON_LEMMA}
|
||||
{ORTH: "ich", LEMMA: PRON_LEMMA, TAG: "PPER"},
|
||||
{ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER", NORM: "es"}
|
||||
],
|
||||
|
||||
"ihr's": [
|
||||
{ORTH: "ihr", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "'s", LEMMA: PRON_LEMMA}
|
||||
{ORTH: "ihr", LEMMA: PRON_LEMMA, TAG: "PPER"},
|
||||
{ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER", NORM: "es"}
|
||||
],
|
||||
|
||||
"incl.": [
|
||||
|
@ -385,7 +389,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"s'": [
|
||||
{ORTH: "s'", LEMMA: PRON_LEMMA}
|
||||
{ORTH: "s'", LEMMA: PRON_LEMMA, TAG: "PPER"}
|
||||
],
|
||||
|
||||
"s.o.": [
|
||||
|
@ -393,8 +397,8 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"sie's": [
|
||||
{ORTH: "sie", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "'s", LEMMA: PRON_LEMMA}
|
||||
{ORTH: "sie", LEMMA: PRON_LEMMA, TAG: "PPER"},
|
||||
{ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER", NORM: "es"}
|
||||
],
|
||||
|
||||
"sog.": [
|
||||
|
@ -423,7 +427,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
|
||||
"unter'm": [
|
||||
{ORTH: "unter", LEMMA: "unter"},
|
||||
{ORTH: "'m", LEMMA: PRON_LEMMA}
|
||||
{ORTH: "'m", LEMMA: DET_LEMMA, NORM: "dem"}
|
||||
],
|
||||
|
||||
"usf.": [
|
||||
|
@ -464,12 +468,12 @@ TOKENIZER_EXCEPTIONS = {
|
|||
|
||||
"vor'm": [
|
||||
{ORTH: "vor", LEMMA: "vor"},
|
||||
{ORTH: "'m", LEMMA: PRON_LEMMA}
|
||||
{ORTH: "'m", LEMMA: DET_LEMMA, NORM: "dem"}
|
||||
],
|
||||
|
||||
"wir's": [
|
||||
{ORTH: "wir", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "'s", LEMMA: PRON_LEMMA}
|
||||
{ORTH: "wir", LEMMA: PRON_LEMMA, TAG: "PPER"},
|
||||
{ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER", NORM: "es"}
|
||||
],
|
||||
|
||||
"z.B.": [
|
||||
|
@ -506,7 +510,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
|
||||
"über'm": [
|
||||
{ORTH: "über", LEMMA: "über"},
|
||||
{ORTH: "'m", LEMMA: PRON_LEMMA}
|
||||
{ORTH: "'m", LEMMA: DET_LEMMA, NORM: "dem"}
|
||||
]
|
||||
}
|
||||
|
||||
|
@ -625,5 +629,5 @@ ORTH_ONLY = [
|
|||
"wiss.",
|
||||
"x.",
|
||||
"y.",
|
||||
"z.",
|
||||
"z."
|
||||
]
|
||||
|
|
|
@ -44,6 +44,7 @@ def _fix_deprecated_glove_vectors_loading(overrides):
|
|||
else:
|
||||
path = overrides['path']
|
||||
data_path = path.parent
|
||||
vec_path = None
|
||||
if 'add_vectors' not in overrides:
|
||||
if 'vectors' in overrides:
|
||||
vec_path = match_best_version(overrides['vectors'], None, data_path)
|
||||
|
|
|
@ -11,7 +11,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"Theydve": [
|
||||
{ORTH: "They", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "They", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "d", LEMMA: "would", TAG: "MD"},
|
||||
{ORTH: "ve", LEMMA: "have", TAG: "VB"}
|
||||
],
|
||||
|
@ -68,7 +68,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"itll": [
|
||||
{ORTH: "it", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "it", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "ll", LEMMA: "will", TAG: "MD"}
|
||||
],
|
||||
|
||||
|
@ -113,7 +113,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"Idve": [
|
||||
{ORTH: "I", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "I", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "d", LEMMA: "would", TAG: "MD"},
|
||||
{ORTH: "ve", LEMMA: "have", TAG: "VB"}
|
||||
],
|
||||
|
@ -124,23 +124,23 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"Ive": [
|
||||
{ORTH: "I", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "I", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "ve", LEMMA: "have", TAG: "VB"}
|
||||
],
|
||||
|
||||
"they'd": [
|
||||
{ORTH: "they", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "they", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'d", LEMMA: "would", TAG: "MD"}
|
||||
],
|
||||
|
||||
"Youdve": [
|
||||
{ORTH: "You", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "You", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "d", LEMMA: "would", TAG: "MD"},
|
||||
{ORTH: "ve", LEMMA: "have", TAG: "VB"}
|
||||
],
|
||||
|
||||
"theyve": [
|
||||
{ORTH: "they", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "they", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "ve", LEMMA: "have", TAG: "VB"}
|
||||
],
|
||||
|
||||
|
@ -160,12 +160,12 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"I'm": [
|
||||
{ORTH: "I", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "I", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'m", TAG: "VBP", "tenspect": 1, "number": 1, LEMMA: "be"}
|
||||
],
|
||||
|
||||
"She'd've": [
|
||||
{ORTH: "She", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "She", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'d", LEMMA: "would", TAG: "MD"},
|
||||
{ORTH: "'ve", LEMMA: "have", TAG: "VB"}
|
||||
],
|
||||
|
@ -191,7 +191,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"they've": [
|
||||
{ORTH: "they", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "they", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'ve", LEMMA: "have", TAG: "VB"}
|
||||
],
|
||||
|
||||
|
@ -226,12 +226,12 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"i'll": [
|
||||
{ORTH: "i", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "i", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'ll", LEMMA: "will", TAG: "MD"}
|
||||
],
|
||||
|
||||
"you'd": [
|
||||
{ORTH: "you", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "you", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'d", LEMMA: "would", TAG: "MD"}
|
||||
],
|
||||
|
||||
|
@ -287,7 +287,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"youll": [
|
||||
{ORTH: "you", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "you", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "ll", LEMMA: "will", TAG: "MD"}
|
||||
],
|
||||
|
||||
|
@ -307,7 +307,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"Youre": [
|
||||
{ORTH: "You", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "You", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "re", LEMMA: "be"}
|
||||
],
|
||||
|
||||
|
@ -369,7 +369,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"You'll": [
|
||||
{ORTH: "You", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "You", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'ll", LEMMA: "will", TAG: "MD"}
|
||||
],
|
||||
|
||||
|
@ -379,7 +379,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"i'd": [
|
||||
{ORTH: "i", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "i", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'d", LEMMA: "would", TAG: "MD"}
|
||||
],
|
||||
|
||||
|
@ -394,7 +394,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"i'm": [
|
||||
{ORTH: "i", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "i", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'m", TAG: "VBP", "tenspect": 1, "number": 1, LEMMA: "be"}
|
||||
],
|
||||
|
||||
|
@ -425,7 +425,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"Hes": [
|
||||
{ORTH: "He", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "He", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "s"}
|
||||
],
|
||||
|
||||
|
@ -435,7 +435,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"It's": [
|
||||
{ORTH: "It", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "It", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'s"}
|
||||
],
|
||||
|
||||
|
@ -445,7 +445,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"Hed": [
|
||||
{ORTH: "He", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "He", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "d", LEMMA: "would", TAG: "MD"}
|
||||
],
|
||||
|
||||
|
@ -464,12 +464,12 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"It'd": [
|
||||
{ORTH: "It", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "It", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'d", LEMMA: "would", TAG: "MD"}
|
||||
],
|
||||
|
||||
"theydve": [
|
||||
{ORTH: "they", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "they", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "d", LEMMA: "would", TAG: "MD"},
|
||||
{ORTH: "ve", LEMMA: "have", TAG: "VB"}
|
||||
],
|
||||
|
@ -489,7 +489,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"I've": [
|
||||
{ORTH: "I", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "I", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'ve", LEMMA: "have", TAG: "VB"}
|
||||
],
|
||||
|
||||
|
@ -499,13 +499,13 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"Itdve": [
|
||||
{ORTH: "It", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "It", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "d", LEMMA: "would", TAG: "MD"},
|
||||
{ORTH: "ve", LEMMA: "have", TAG: "VB"}
|
||||
],
|
||||
|
||||
"I'ma": [
|
||||
{ORTH: "I", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "I", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'ma"}
|
||||
],
|
||||
|
||||
|
@ -515,7 +515,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"They'd": [
|
||||
{ORTH: "They", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "They", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'d", LEMMA: "would", TAG: "MD"}
|
||||
],
|
||||
|
||||
|
@ -525,7 +525,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"You've": [
|
||||
{ORTH: "You", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "You", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'ve", LEMMA: "have", TAG: "VB"}
|
||||
],
|
||||
|
||||
|
@ -546,7 +546,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"I'd've": [
|
||||
{ORTH: "I", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "I", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'d", LEMMA: "would", TAG: "MD"},
|
||||
{ORTH: "'ve", LEMMA: "have", TAG: "VB"}
|
||||
],
|
||||
|
@ -557,13 +557,13 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"it'd": [
|
||||
{ORTH: "it", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "it", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'d", LEMMA: "would", TAG: "MD"}
|
||||
],
|
||||
|
||||
"what're": [
|
||||
{ORTH: "what"},
|
||||
{ORTH: "'re", LEMMA: "be"}
|
||||
{ORTH: "'re", LEMMA: "be", NORM: "are"}
|
||||
],
|
||||
|
||||
"Wasn't": [
|
||||
|
@ -577,18 +577,18 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"he'd've": [
|
||||
{ORTH: "he", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "he", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'d", LEMMA: "would", TAG: "MD"},
|
||||
{ORTH: "'ve", LEMMA: "have", TAG: "VB"}
|
||||
],
|
||||
|
||||
"She'd": [
|
||||
{ORTH: "She", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "She", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'d", LEMMA: "would", TAG: "MD"}
|
||||
],
|
||||
|
||||
"shedve": [
|
||||
{ORTH: "she", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "she", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "d", LEMMA: "would", TAG: "MD"},
|
||||
{ORTH: "ve", LEMMA: "have", TAG: "VB"}
|
||||
],
|
||||
|
@ -599,12 +599,12 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"She's": [
|
||||
{ORTH: "i", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "i", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'s"}
|
||||
],
|
||||
|
||||
"i'd've": [
|
||||
{ORTH: "i", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "i", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'d", LEMMA: "would", TAG: "MD"},
|
||||
{ORTH: "'ve", LEMMA: "have", TAG: "VB"}
|
||||
],
|
||||
|
@ -631,7 +631,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"you'd've": [
|
||||
{ORTH: "you", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "you", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'d", LEMMA: "would", TAG: "MD"},
|
||||
{ORTH: "'ve", LEMMA: "have", TAG: "VB"}
|
||||
],
|
||||
|
@ -647,7 +647,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"Youd": [
|
||||
{ORTH: "You", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "You", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "d", LEMMA: "would", TAG: "MD"}
|
||||
],
|
||||
|
||||
|
@ -678,12 +678,12 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"ive": [
|
||||
{ORTH: "i", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "i", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "ve", LEMMA: "have", TAG: "VB"}
|
||||
],
|
||||
|
||||
"It'd've": [
|
||||
{ORTH: "It", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "It", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'d", LEMMA: "would", TAG: "MD"},
|
||||
{ORTH: "'ve", LEMMA: "have", TAG: "VB"}
|
||||
],
|
||||
|
@ -693,7 +693,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"Itll": [
|
||||
{ORTH: "It", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "It", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "ll", LEMMA: "will", TAG: "MD"}
|
||||
],
|
||||
|
||||
|
@ -708,12 +708,12 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"im": [
|
||||
{ORTH: "i", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "i", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "m", TAG: "VBP", "tenspect": 1, "number": 1, LEMMA: "be"}
|
||||
],
|
||||
|
||||
"they'd've": [
|
||||
{ORTH: "they", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "they", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'d", LEMMA: "would", TAG: "MD"},
|
||||
{ORTH: "'ve", LEMMA: "have", TAG: "VB"}
|
||||
],
|
||||
|
@ -735,19 +735,19 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"youdve": [
|
||||
{ORTH: "you", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "you", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "d", LEMMA: "would", TAG: "MD"},
|
||||
{ORTH: "ve", LEMMA: "have", TAG: "VB"}
|
||||
],
|
||||
|
||||
"Shedve": [
|
||||
{ORTH: "i", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "i", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "d", LEMMA: "would", TAG: "MD"},
|
||||
{ORTH: "ve", LEMMA: "have", TAG: "VB"}
|
||||
],
|
||||
|
||||
"theyd": [
|
||||
{ORTH: "they", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "they", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "d", LEMMA: "would", TAG: "MD"}
|
||||
],
|
||||
|
||||
|
@ -763,11 +763,11 @@ TOKENIZER_EXCEPTIONS = {
|
|||
|
||||
"What're": [
|
||||
{ORTH: "What"},
|
||||
{ORTH: "'re", LEMMA: "be"}
|
||||
{ORTH: "'re", LEMMA: "be", NORM: "are"}
|
||||
],
|
||||
|
||||
"He'll": [
|
||||
{ORTH: "He", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "He", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'ll", LEMMA: "will", TAG: "MD"}
|
||||
],
|
||||
|
||||
|
@ -777,8 +777,8 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"They're": [
|
||||
{ORTH: "They", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "'re", LEMMA: "be"}
|
||||
{ORTH: "They", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'re", LEMMA: "be", NORM: "are"}
|
||||
],
|
||||
|
||||
"shouldnt": [
|
||||
|
@ -796,7 +796,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"youve": [
|
||||
{ORTH: "you", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "you", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "ve", LEMMA: "have", TAG: "VB"}
|
||||
],
|
||||
|
||||
|
@ -816,7 +816,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"Youve": [
|
||||
{ORTH: "You", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "You", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "ve", LEMMA: "have", TAG: "VB"}
|
||||
],
|
||||
|
||||
|
@ -841,12 +841,12 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"they're": [
|
||||
{ORTH: "they", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "'re", LEMMA: "be"}
|
||||
{ORTH: "they", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'re", LEMMA: "be", NORM: "are"}
|
||||
],
|
||||
|
||||
"idve": [
|
||||
{ORTH: "i", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "i", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "d", LEMMA: "would", TAG: "MD"},
|
||||
{ORTH: "ve", LEMMA: "have", TAG: "VB"}
|
||||
],
|
||||
|
@ -857,8 +857,8 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"youre": [
|
||||
{ORTH: "you", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "re"}
|
||||
{ORTH: "you", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "re", LEMMA: "be", NORM: "are"}
|
||||
],
|
||||
|
||||
"Didn't": [
|
||||
|
@ -877,8 +877,8 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"Im": [
|
||||
{ORTH: "I", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "m", TAG: "VBP", "tenspect": 1, "number": 1, LEMMA: "be"}
|
||||
{ORTH: "I", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "m", TAG: "VBP", "tenspect": 1, "number": 1, LEMMA: "be", NORM: "am"}
|
||||
],
|
||||
|
||||
"howd": [
|
||||
|
@ -887,22 +887,22 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"you've": [
|
||||
{ORTH: "you", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "you", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'ve", LEMMA: "have", TAG: "VB"}
|
||||
],
|
||||
|
||||
"You're": [
|
||||
{ORTH: "You", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "'re", LEMMA: "be"}
|
||||
{ORTH: "You", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'re", LEMMA: "be", NORM: "are"}
|
||||
],
|
||||
|
||||
"she'll": [
|
||||
{ORTH: "she", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "she", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'ll", LEMMA: "will", TAG: "MD"}
|
||||
],
|
||||
|
||||
"Theyll": [
|
||||
{ORTH: "They", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "They", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "ll", LEMMA: "will", TAG: "MD"}
|
||||
],
|
||||
|
||||
|
@ -912,12 +912,12 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"itd": [
|
||||
{ORTH: "it", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "it", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "d", LEMMA: "would", TAG: "MD"}
|
||||
],
|
||||
|
||||
"Hedve": [
|
||||
{ORTH: "He", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "He", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "d", LEMMA: "would", TAG: "MD"},
|
||||
{ORTH: "ve", LEMMA: "have", TAG: "VB"}
|
||||
],
|
||||
|
@ -933,8 +933,8 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"We're": [
|
||||
{ORTH: "We", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "'re", LEMMA: "be"}
|
||||
{ORTH: "We", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'re", LEMMA: "be", NORM: "are"}
|
||||
],
|
||||
|
||||
"\u2018S": [
|
||||
|
@ -951,7 +951,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"ima": [
|
||||
{ORTH: "i", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "i", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "ma"}
|
||||
],
|
||||
|
||||
|
@ -961,7 +961,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"he's": [
|
||||
{ORTH: "he", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "he", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'s"}
|
||||
],
|
||||
|
||||
|
@ -981,13 +981,13 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"hedve": [
|
||||
{ORTH: "he", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "he", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "d", LEMMA: "would", TAG: "MD"},
|
||||
{ORTH: "ve", LEMMA: "have", TAG: "VB"}
|
||||
],
|
||||
|
||||
"he'd": [
|
||||
{ORTH: "he", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "he", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'d", LEMMA: "would", TAG: "MD"}
|
||||
],
|
||||
|
||||
|
@ -1029,7 +1029,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"You'd've": [
|
||||
{ORTH: "You", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "You", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'d", LEMMA: "would", TAG: "MD"},
|
||||
{ORTH: "'ve", LEMMA: "have", TAG: "VB"}
|
||||
],
|
||||
|
@ -1072,12 +1072,12 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"wont": [
|
||||
{ORTH: "wo"},
|
||||
{ORTH: "wo", LEMMA: "will"},
|
||||
{ORTH: "nt", LEMMA: "not", TAG: "RB"}
|
||||
],
|
||||
|
||||
"she'd've": [
|
||||
{ORTH: "she", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "she", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'d", LEMMA: "would", TAG: "MD"},
|
||||
{ORTH: "'ve", LEMMA: "have", TAG: "VB"}
|
||||
],
|
||||
|
@ -1088,7 +1088,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"theyre": [
|
||||
{ORTH: "they", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "they", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "re"}
|
||||
],
|
||||
|
||||
|
@ -1129,7 +1129,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"They'll": [
|
||||
{ORTH: "They", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "They", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'ll", LEMMA: "will", TAG: "MD"}
|
||||
],
|
||||
|
||||
|
@ -1139,7 +1139,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"Wedve": [
|
||||
{ORTH: "We"},
|
||||
{ORTH: "We", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "d", LEMMA: "would", TAG: "MD"},
|
||||
{ORTH: "ve", LEMMA: "have", TAG: "VB"}
|
||||
],
|
||||
|
@ -1156,7 +1156,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"we'd": [
|
||||
{ORTH: "we"},
|
||||
{ORTH: "we", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'d", LEMMA: "would", TAG: "MD"}
|
||||
],
|
||||
|
||||
|
@ -1193,7 +1193,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
|
||||
"why're": [
|
||||
{ORTH: "why"},
|
||||
{ORTH: "'re", LEMMA: "be"}
|
||||
{ORTH: "'re", LEMMA: "be", NORM: "are"}
|
||||
],
|
||||
|
||||
"Doesnt": [
|
||||
|
@ -1207,12 +1207,12 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"they'll": [
|
||||
{ORTH: "they", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "they", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'ll", LEMMA: "will", TAG: "MD"}
|
||||
],
|
||||
|
||||
"I'd": [
|
||||
{ORTH: "I", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "I", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'d", LEMMA: "would", TAG: "MD"}
|
||||
],
|
||||
|
||||
|
@ -1237,12 +1237,12 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"you're": [
|
||||
{ORTH: "you", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "'re", LEMMA: "be"}
|
||||
{ORTH: "you", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'re", LEMMA: "be", NORM: "are"}
|
||||
],
|
||||
|
||||
"They've": [
|
||||
{ORTH: "They", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "They", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'ve", LEMMA: "have", TAG: "VB"}
|
||||
],
|
||||
|
||||
|
@ -1272,12 +1272,12 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"She'll": [
|
||||
{ORTH: "i", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "She", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'ll", LEMMA: "will", TAG: "MD"}
|
||||
],
|
||||
|
||||
"You'd": [
|
||||
{ORTH: "You", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "You", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'d", LEMMA: "would", TAG: "MD"}
|
||||
],
|
||||
|
||||
|
@ -1297,8 +1297,8 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"Theyre": [
|
||||
{ORTH: "They", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "re"}
|
||||
{ORTH: "They", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "re", LEMMA: "be", NORM: "are"}
|
||||
],
|
||||
|
||||
"Won't": [
|
||||
|
@ -1312,33 +1312,33 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"it's": [
|
||||
{ORTH: "it", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "it", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'s"}
|
||||
],
|
||||
|
||||
"it'll": [
|
||||
{ORTH: "it", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "it", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'ll", LEMMA: "will", TAG: "MD"}
|
||||
],
|
||||
|
||||
"They'd've": [
|
||||
{ORTH: "They", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "They", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'d", LEMMA: "would", TAG: "MD"},
|
||||
{ORTH: "'ve", LEMMA: "have", TAG: "VB"}
|
||||
],
|
||||
|
||||
"Ima": [
|
||||
{ORTH: "I", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "I", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "ma"}
|
||||
],
|
||||
|
||||
"gonna": [
|
||||
{ORTH: "gon", LEMMA: "go"},
|
||||
{ORTH: "gon", LEMMA: "go", NORM: "going"},
|
||||
{ORTH: "na", LEMMA: "to"}
|
||||
],
|
||||
|
||||
"Gonna": [
|
||||
{ORTH: "Gon", LEMMA: "go"},
|
||||
{ORTH: "Gon", LEMMA: "go", NORM: "going"},
|
||||
{ORTH: "na", LEMMA: "to"}
|
||||
],
|
||||
|
||||
|
@ -1359,7 +1359,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"youd": [
|
||||
{ORTH: "you", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "you", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "d", LEMMA: "would", TAG: "MD"}
|
||||
],
|
||||
|
||||
|
@ -1390,7 +1390,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"He'd've": [
|
||||
{ORTH: "He", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "He", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'d", LEMMA: "would", TAG: "MD"},
|
||||
{ORTH: "'ve", LEMMA: "have", TAG: "VB"}
|
||||
],
|
||||
|
@ -1427,17 +1427,17 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"hes": [
|
||||
{ORTH: "he", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "he", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "s"}
|
||||
],
|
||||
|
||||
"he'll": [
|
||||
{ORTH: "he", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "he", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'ll", LEMMA: "will", TAG: "MD"}
|
||||
],
|
||||
|
||||
"hed": [
|
||||
{ORTH: "he", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "he", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "d", LEMMA: "would", TAG: "MD"}
|
||||
],
|
||||
|
||||
|
@ -1447,8 +1447,8 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"we're": [
|
||||
{ORTH: "we", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "'re", LEMMA: "be"}
|
||||
{ORTH: "we", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'re", LEMMA: "be", NORM :"are"}
|
||||
],
|
||||
|
||||
"Hadnt": [
|
||||
|
@ -1457,12 +1457,12 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"Shant": [
|
||||
{ORTH: "Sha"},
|
||||
{ORTH: "Sha", LEMMA: "shall"},
|
||||
{ORTH: "nt", LEMMA: "not", TAG: "RB"}
|
||||
],
|
||||
|
||||
"Theyve": [
|
||||
{ORTH: "They", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "They", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "ve", LEMMA: "have", TAG: "VB"}
|
||||
],
|
||||
|
||||
|
@ -1477,7 +1477,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"i've": [
|
||||
{ORTH: "i", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "i", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'ve", LEMMA: "have", TAG: "VB"}
|
||||
],
|
||||
|
||||
|
@ -1487,7 +1487,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"i'ma": [
|
||||
{ORTH: "i", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "i", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'ma"}
|
||||
],
|
||||
|
||||
|
@ -1502,7 +1502,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"shant": [
|
||||
{ORTH: "sha"},
|
||||
{ORTH: "sha", LEMMA: "shall"},
|
||||
{ORTH: "nt", LEMMA: "not", TAG: "RB"}
|
||||
],
|
||||
|
||||
|
@ -1513,7 +1513,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"I'll": [
|
||||
{ORTH: "I", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "I", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'ll", LEMMA: "will", TAG: "MD"}
|
||||
],
|
||||
|
||||
|
@ -1571,7 +1571,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"shes": [
|
||||
{ORTH: "she", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "she", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "s"}
|
||||
],
|
||||
|
||||
|
@ -1586,12 +1586,12 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"Hasnt": [
|
||||
{ORTH: "Has"},
|
||||
{ORTH: "Has", LEMMA: "have"},
|
||||
{ORTH: "nt", LEMMA: "not", TAG: "RB"}
|
||||
],
|
||||
|
||||
"He's": [
|
||||
{ORTH: "He", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "He", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'s"}
|
||||
],
|
||||
|
||||
|
@ -1611,12 +1611,12 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"He'd": [
|
||||
{ORTH: "He", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "He", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'d", LEMMA: "would", TAG: "MD"}
|
||||
],
|
||||
|
||||
"Shes": [
|
||||
{ORTH: "i", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "She", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "s"}
|
||||
],
|
||||
|
||||
|
@ -1626,7 +1626,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"Youll": [
|
||||
{ORTH: "You", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "You", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "ll", LEMMA: "will", TAG: "MD"}
|
||||
],
|
||||
|
||||
|
@ -1636,18 +1636,18 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"theyll": [
|
||||
{ORTH: "they", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "they", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "ll", LEMMA: "will", TAG: "MD"}
|
||||
],
|
||||
|
||||
"it'd've": [
|
||||
{ORTH: "it", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "it", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'d", LEMMA: "would", TAG: "MD"},
|
||||
{ORTH: "'ve", LEMMA: "have", TAG: "VB"}
|
||||
],
|
||||
|
||||
"itdve": [
|
||||
{ORTH: "it", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "it", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "d", LEMMA: "would", TAG: "MD"},
|
||||
{ORTH: "ve", LEMMA: "have", TAG: "VB"}
|
||||
],
|
||||
|
@ -1674,7 +1674,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"Wont": [
|
||||
{ORTH: "Wo"},
|
||||
{ORTH: "Wo", LEMMA: "will"},
|
||||
{ORTH: "nt", LEMMA: "not", TAG: "RB"}
|
||||
],
|
||||
|
||||
|
@ -1691,7 +1691,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
|
||||
"Whatre": [
|
||||
{ORTH: "What"},
|
||||
{ORTH: "re"}
|
||||
{ORTH: "re", LEMMA: "be", NORM: "are"}
|
||||
],
|
||||
|
||||
"'s": [
|
||||
|
@ -1719,12 +1719,12 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"It'll": [
|
||||
{ORTH: "It", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "It", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'ll", LEMMA: "will", TAG: "MD"}
|
||||
],
|
||||
|
||||
"We'd": [
|
||||
{ORTH: "We"},
|
||||
{ORTH: "We", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'d", LEMMA: "would", TAG: "MD"}
|
||||
],
|
||||
|
||||
|
@ -1738,12 +1738,12 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"Itd": [
|
||||
{ORTH: "It", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "It", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "d", LEMMA: "would", TAG: "MD"}
|
||||
],
|
||||
|
||||
"she'd": [
|
||||
{ORTH: "she", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "she", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'d", LEMMA: "would", TAG: "MD"}
|
||||
],
|
||||
|
||||
|
@ -1758,17 +1758,17 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"you'll": [
|
||||
{ORTH: "you", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "you", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'ll", LEMMA: "will", TAG: "MD"}
|
||||
],
|
||||
|
||||
"Theyd": [
|
||||
{ORTH: "They", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "They", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "d", LEMMA: "would", TAG: "MD"}
|
||||
],
|
||||
|
||||
"she's": [
|
||||
{ORTH: "she", LEMMA: PRON_LEMMA},
|
||||
{ORTH: "she", LEMMA: PRON_LEMMA, TAG: "PRP"},
|
||||
{ORTH: "'s"}
|
||||
],
|
||||
|
||||
|
@ -1783,7 +1783,7 @@ TOKENIZER_EXCEPTIONS = {
|
|||
],
|
||||
|
||||
"'em": [
|
||||
{ORTH: "'em", LEMMA: PRON_LEMMA}
|
||||
{ORTH: "'em", LEMMA: PRON_LEMMA, NORM: "them"}
|
||||
],
|
||||
|
||||
"ol'": [
|
||||
|
|
|
@ -3,17 +3,48 @@ from __future__ import unicode_literals
|
|||
|
||||
from .. import language_data as base
|
||||
from ..language_data import update_exc, strings_to_exc
|
||||
from ..symbols import ORTH, LEMMA
|
||||
|
||||
from .stop_words import STOP_WORDS
|
||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS, ORTH_ONLY
|
||||
|
||||
|
||||
def get_time_exc(hours):
|
||||
exc = {
|
||||
"12m.": [
|
||||
{ORTH: "12"},
|
||||
{ORTH: "m.", LEMMA: "p.m."}
|
||||
]
|
||||
}
|
||||
|
||||
for hour in hours:
|
||||
exc["%da.m." % hour] = [
|
||||
{ORTH: hour},
|
||||
{ORTH: "a.m."}
|
||||
]
|
||||
|
||||
exc["%dp.m." % hour] = [
|
||||
{ORTH: hour},
|
||||
{ORTH: "p.m."}
|
||||
]
|
||||
|
||||
exc["%dam" % hour] = [
|
||||
{ORTH: hour},
|
||||
{ORTH: "am", LEMMA: "a.m."}
|
||||
]
|
||||
|
||||
exc["%dpm" % hour] = [
|
||||
{ORTH: hour},
|
||||
{ORTH: "pm", LEMMA: "p.m."}
|
||||
]
|
||||
return exc
|
||||
|
||||
|
||||
TOKENIZER_EXCEPTIONS = dict(TOKENIZER_EXCEPTIONS)
|
||||
STOP_WORDS = set(STOP_WORDS)
|
||||
|
||||
|
||||
update_exc(TOKENIZER_EXCEPTIONS, strings_to_exc(ORTH_ONLY))
|
||||
update_exc(TOKENIZER_EXCEPTIONS, get_time_exc(range(1, 12 + 1)))
|
||||
update_exc(TOKENIZER_EXCEPTIONS, strings_to_exc(base.EMOTICONS))
|
||||
|
||||
|
||||
__all__ = ["TOKENIZER_EXCEPTIONS", "STOP_WORDS"]
|
||||
|
|
|
@ -2,317 +2,138 @@
|
|||
from __future__ import unicode_literals
|
||||
|
||||
from ..symbols import *
|
||||
from ..language_data import PRON_LEMMA
|
||||
from ..language_data import PRON_LEMMA, DET_LEMMA
|
||||
|
||||
|
||||
TOKENIZER_EXCEPTIONS = {
|
||||
"accidentarse": [
|
||||
{ORTH: "accidentar", LEMMA: "accidentar", POS: AUX},
|
||||
{ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
],
|
||||
|
||||
"aceptarlo": [
|
||||
{ORTH: "aceptar", LEMMA: "aceptar", POS: AUX},
|
||||
{ORTH: "lo", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
],
|
||||
|
||||
"acompañarla": [
|
||||
{ORTH: "acompañar", LEMMA: "acompañar", POS: AUX},
|
||||
{ORTH: "la", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
],
|
||||
|
||||
"advertirle": [
|
||||
{ORTH: "advertir", LEMMA: "advertir", POS: AUX},
|
||||
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
],
|
||||
|
||||
"al": [
|
||||
{ORTH: "a", LEMMA: "a", POS: ADP},
|
||||
{ORTH: "el", LEMMA: "el", POS: DET}
|
||||
{ORTH: "a", LEMMA: "a", TAG: ADP},
|
||||
{ORTH: "el", LEMMA: "el", TAG: DET}
|
||||
],
|
||||
|
||||
"anunciarnos": [
|
||||
{ORTH: "anunciar", LEMMA: "anunciar", POS: AUX},
|
||||
{ORTH: "nos", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
"consigo": [
|
||||
{ORTH: "con", LEMMA: "con"},
|
||||
{ORTH: "sigo", LEMMA: PRON_LEMMA, NORM: "sí"}
|
||||
],
|
||||
|
||||
"asegurándole": [
|
||||
{ORTH: "asegurando", LEMMA: "asegurar", POS: AUX},
|
||||
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
"conmigo": [
|
||||
{ORTH: "con", LEMMA: "con"},
|
||||
{ORTH: "migo", LEMMA: PRON_LEMMA, NORM: "mí"}
|
||||
],
|
||||
|
||||
"considerarle": [
|
||||
{ORTH: "considerar", LEMMA: "considerar", POS: AUX},
|
||||
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
],
|
||||
|
||||
"decirle": [
|
||||
{ORTH: "decir", LEMMA: "decir", POS: AUX},
|
||||
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
],
|
||||
|
||||
"decirles": [
|
||||
{ORTH: "decir", LEMMA: "decir", POS: AUX},
|
||||
{ORTH: "les", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
],
|
||||
|
||||
"decirte": [
|
||||
{ORTH: "Decir", LEMMA: "decir", POS: AUX},
|
||||
{ORTH: "te", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
],
|
||||
|
||||
"dejarla": [
|
||||
{ORTH: "dejar", LEMMA: "dejar", POS: AUX},
|
||||
{ORTH: "la", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
],
|
||||
|
||||
"dejarnos": [
|
||||
{ORTH: "dejar", LEMMA: "dejar", POS: AUX},
|
||||
{ORTH: "nos", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
],
|
||||
|
||||
"dejándole": [
|
||||
{ORTH: "dejando", LEMMA: "dejar", POS: AUX},
|
||||
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
"contigo": [
|
||||
{ORTH: "con", LEMMA: "con"},
|
||||
{ORTH: "tigo", LEMMA: PRON_LEMMA, NORM: "ti"}
|
||||
],
|
||||
|
||||
"del": [
|
||||
{ORTH: "de", LEMMA: "de", POS: ADP},
|
||||
{ORTH: "el", LEMMA: "el", POS: DET}
|
||||
],
|
||||
|
||||
"demostrarles": [
|
||||
{ORTH: "demostrar", LEMMA: "demostrar", POS: AUX},
|
||||
{ORTH: "les", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
],
|
||||
|
||||
"diciéndole": [
|
||||
{ORTH: "diciendo", LEMMA: "decir", POS: AUX},
|
||||
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
],
|
||||
|
||||
"diciéndoles": [
|
||||
{ORTH: "diciendo", LEMMA: "decir", POS: AUX},
|
||||
{ORTH: "les", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
],
|
||||
|
||||
"diferenciarse": [
|
||||
{ORTH: "diferenciar", LEMMA: "diferenciar", POS: AUX},
|
||||
{ORTH: "se", LEMMA: "él", POS: PRON}
|
||||
],
|
||||
|
||||
"divirtiéndome": [
|
||||
{ORTH: "divirtiendo", LEMMA: "divertir", POS: AUX},
|
||||
{ORTH: "me", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
],
|
||||
|
||||
"ensanchándose": [
|
||||
{ORTH: "ensanchando", LEMMA: "ensanchar", POS: AUX},
|
||||
{ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
],
|
||||
|
||||
"explicarles": [
|
||||
{ORTH: "explicar", LEMMA: "explicar", POS: AUX},
|
||||
{ORTH: "les", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
],
|
||||
|
||||
"haberla": [
|
||||
{ORTH: "haber", LEMMA: "haber", POS: AUX},
|
||||
{ORTH: "la", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
],
|
||||
|
||||
"haberlas": [
|
||||
{ORTH: "haber", LEMMA: "haber", POS: AUX},
|
||||
{ORTH: "las", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
],
|
||||
|
||||
"haberlo": [
|
||||
{ORTH: "haber", LEMMA: "haber", POS: AUX},
|
||||
{ORTH: "lo", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
],
|
||||
|
||||
"haberlos": [
|
||||
{ORTH: "haber", LEMMA: "haber", POS: AUX},
|
||||
{ORTH: "los", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
],
|
||||
|
||||
"haberme": [
|
||||
{ORTH: "haber", LEMMA: "haber", POS: AUX},
|
||||
{ORTH: "me", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
],
|
||||
|
||||
"haberse": [
|
||||
{ORTH: "haber", LEMMA: "haber", POS: AUX},
|
||||
{ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
],
|
||||
|
||||
"hacerle": [
|
||||
{ORTH: "hacer", LEMMA: "hacer", POS: AUX},
|
||||
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
],
|
||||
|
||||
"hacerles": [
|
||||
{ORTH: "hacer", LEMMA: "hacer", POS: AUX},
|
||||
{ORTH: "les", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
],
|
||||
|
||||
"hallarse": [
|
||||
{ORTH: "hallar", LEMMA: "hallar", POS: AUX},
|
||||
{ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
],
|
||||
|
||||
"imaginaros": [
|
||||
{ORTH: "imaginar", LEMMA: "imaginar", POS: AUX},
|
||||
{ORTH: "os", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
],
|
||||
|
||||
"insinuarle": [
|
||||
{ORTH: "insinuar", LEMMA: "insinuar", POS: AUX},
|
||||
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
],
|
||||
|
||||
"justificarla": [
|
||||
{ORTH: "justificar", LEMMA: "justificar", POS: AUX},
|
||||
{ORTH: "la", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
],
|
||||
|
||||
"mantenerlas": [
|
||||
{ORTH: "mantener", LEMMA: "mantener", POS: AUX},
|
||||
{ORTH: "las", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
],
|
||||
|
||||
"mantenerlos": [
|
||||
{ORTH: "mantener", LEMMA: "mantener", POS: AUX},
|
||||
{ORTH: "los", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
],
|
||||
|
||||
"mantenerme": [
|
||||
{ORTH: "mantener", LEMMA: "mantener", POS: AUX},
|
||||
{ORTH: "me", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
],
|
||||
|
||||
"pasarte": [
|
||||
{ORTH: "pasar", LEMMA: "pasar", POS: AUX},
|
||||
{ORTH: "te", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
],
|
||||
|
||||
"pedirle": [
|
||||
{ORTH: "pedir", LEMMA: "pedir", POS: AUX},
|
||||
{ORTH: "le", LEMMA: "él", POS: PRON}
|
||||
{ORTH: "de", LEMMA: "de", TAG: ADP},
|
||||
{ORTH: "l", LEMMA: "el", TAG: DET}
|
||||
],
|
||||
|
||||
"pel": [
|
||||
{ORTH: "per", LEMMA: "per", POS: ADP},
|
||||
{ORTH: "el", LEMMA: "el", POS: DET}
|
||||
{ORTH: "pe", LEMMA: "per", TAG: ADP},
|
||||
{ORTH: "l", LEMMA: "el", TAG: DET}
|
||||
],
|
||||
|
||||
"pidiéndonos": [
|
||||
{ORTH: "pidiendo", LEMMA: "pedir", POS: AUX},
|
||||
{ORTH: "nos", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
"pal": [
|
||||
{ORTH: "pa", LEMMA: "para"},
|
||||
{ORTH: "l", LEMMA: DET_LEMMA, NORM: "el"}
|
||||
],
|
||||
|
||||
"poderle": [
|
||||
{ORTH: "poder", LEMMA: "poder", POS: AUX},
|
||||
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
"pala": [
|
||||
{ORTH: "pa", LEMMA: "para"},
|
||||
{ORTH: "la", LEMMA: DET_LEMMA}
|
||||
],
|
||||
|
||||
"preguntarse": [
|
||||
{ORTH: "preguntar", LEMMA: "preguntar", POS: AUX},
|
||||
{ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
"aprox.": [
|
||||
{ORTH: "aprox.", LEMMA: "aproximadamente"}
|
||||
],
|
||||
|
||||
"preguntándose": [
|
||||
{ORTH: "preguntando", LEMMA: "preguntar", POS: AUX},
|
||||
{ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
"dna.": [
|
||||
{ORTH: "dna.", LEMMA: "docena"}
|
||||
],
|
||||
|
||||
"presentarla": [
|
||||
{ORTH: "presentar", LEMMA: "presentar", POS: AUX},
|
||||
{ORTH: "la", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
"esq.": [
|
||||
{ORTH: "esq.", LEMMA: "esquina"}
|
||||
],
|
||||
|
||||
"pudiéndolo": [
|
||||
{ORTH: "pudiendo", LEMMA: "poder", POS: AUX},
|
||||
{ORTH: "lo", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
"pág.": [
|
||||
{ORTH: "pág.", LEMMA: "página"}
|
||||
],
|
||||
|
||||
"pudiéndose": [
|
||||
{ORTH: "pudiendo", LEMMA: "poder", POS: AUX},
|
||||
{ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
"p.ej.": [
|
||||
{ORTH: "p.ej.", LEMMA: "por ejemplo"}
|
||||
],
|
||||
|
||||
"quererle": [
|
||||
{ORTH: "querer", LEMMA: "querer", POS: AUX},
|
||||
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
"Ud.": [
|
||||
{ORTH: "Ud.", LEMMA: PRON_LEMMA, NORM: "usted"}
|
||||
],
|
||||
|
||||
"rasgarse": [
|
||||
{ORTH: "Rasgar", LEMMA: "rasgar", POS: AUX},
|
||||
{ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
"Vd.": [
|
||||
{ORTH: "Vd.", LEMMA: PRON_LEMMA, NORM: "usted"}
|
||||
],
|
||||
|
||||
"repetirlo": [
|
||||
{ORTH: "repetir", LEMMA: "repetir", POS: AUX},
|
||||
{ORTH: "lo", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
"Uds.": [
|
||||
{ORTH: "Uds.", LEMMA: PRON_LEMMA, NORM: "ustedes"}
|
||||
],
|
||||
|
||||
"robarle": [
|
||||
{ORTH: "robar", LEMMA: "robar", POS: AUX},
|
||||
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
],
|
||||
|
||||
"seguirlos": [
|
||||
{ORTH: "seguir", LEMMA: "seguir", POS: AUX},
|
||||
{ORTH: "los", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
],
|
||||
|
||||
"serle": [
|
||||
{ORTH: "ser", LEMMA: "ser", POS: AUX},
|
||||
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
],
|
||||
|
||||
"serlo": [
|
||||
{ORTH: "ser", LEMMA: "ser", POS: AUX},
|
||||
{ORTH: "lo", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
],
|
||||
|
||||
"señalándole": [
|
||||
{ORTH: "señalando", LEMMA: "señalar", POS: AUX},
|
||||
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
],
|
||||
|
||||
"suplicarle": [
|
||||
{ORTH: "suplicar", LEMMA: "suplicar", POS: AUX},
|
||||
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
],
|
||||
|
||||
"tenerlos": [
|
||||
{ORTH: "tener", LEMMA: "tener", POS: AUX},
|
||||
{ORTH: "los", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
],
|
||||
|
||||
"vengarse": [
|
||||
{ORTH: "vengar", LEMMA: "vengar", POS: AUX},
|
||||
{ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
],
|
||||
|
||||
"verla": [
|
||||
{ORTH: "ver", LEMMA: "ver", POS: AUX},
|
||||
{ORTH: "la", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
],
|
||||
|
||||
"verle": [
|
||||
{ORTH: "ver", LEMMA: "ver", POS: AUX},
|
||||
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
],
|
||||
|
||||
"volverlo": [
|
||||
{ORTH: "volver", LEMMA: "volver", POS: AUX},
|
||||
{ORTH: "lo", LEMMA: PRON_LEMMA, POS: PRON}
|
||||
"Vds.": [
|
||||
{ORTH: "Vds.", LEMMA: PRON_LEMMA, NORM: "ustedes"}
|
||||
]
|
||||
}
|
||||
|
||||
|
||||
ORTH_ONLY = [
|
||||
|
||||
"a.",
|
||||
"a.C.",
|
||||
"a.J.C.",
|
||||
"apdo.",
|
||||
"Av.",
|
||||
"Avda.",
|
||||
"b.",
|
||||
"c.",
|
||||
"Cía.",
|
||||
"d.",
|
||||
"e.",
|
||||
"etc.",
|
||||
"f.",
|
||||
"g.",
|
||||
"Gob.",
|
||||
"Gral.",
|
||||
"h.",
|
||||
"i.",
|
||||
"Ing.",
|
||||
"j.",
|
||||
"J.C.",
|
||||
"k.",
|
||||
"l.",
|
||||
"Lic.",
|
||||
"m.",
|
||||
"m.n.",
|
||||
"n.",
|
||||
"no.",
|
||||
"núm.",
|
||||
"o.",
|
||||
"p.",
|
||||
"P.D.",
|
||||
"Prof.",
|
||||
"Profa.",
|
||||
"q.",
|
||||
"q.e.p.d."
|
||||
"r.",
|
||||
"s.",
|
||||
"S.A.",
|
||||
"S.L.",
|
||||
"s.s.s.",
|
||||
"Sr.",
|
||||
"Sra.",
|
||||
"Srta.",
|
||||
"t.",
|
||||
"u.",
|
||||
"v.",
|
||||
"w.",
|
||||
"x.",
|
||||
"y.",
|
||||
"z."
|
||||
]
|
||||
|
|
23
spacy/hu/__init__.py
Normal file
23
spacy/hu/__init__.py
Normal file
|
@ -0,0 +1,23 @@
|
|||
# encoding: utf8
|
||||
from __future__ import unicode_literals, print_function
|
||||
|
||||
from .language_data import *
|
||||
from ..attrs import LANG
|
||||
from ..language import Language
|
||||
|
||||
|
||||
class Hungarian(Language):
|
||||
lang = 'hu'
|
||||
|
||||
class Defaults(Language.Defaults):
|
||||
tokenizer_exceptions = dict(TOKENIZER_EXCEPTIONS)
|
||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||
lex_attr_getters[LANG] = lambda text: 'hu'
|
||||
|
||||
prefixes = tuple(TOKENIZER_PREFIXES)
|
||||
|
||||
suffixes = tuple(TOKENIZER_SUFFIXES)
|
||||
|
||||
infixes = tuple(TOKENIZER_INFIXES)
|
||||
|
||||
stop_words = set(STOP_WORDS)
|
24
spacy/hu/language_data.py
Normal file
24
spacy/hu/language_data.py
Normal file
|
@ -0,0 +1,24 @@
|
|||
# encoding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import six
|
||||
|
||||
from spacy.language_data import strings_to_exc, update_exc
|
||||
from .punctuations import *
|
||||
from .stop_words import STOP_WORDS
|
||||
from .tokenizer_exceptions import ABBREVIATIONS
|
||||
from .tokenizer_exceptions import OTHER_EXC
|
||||
from .. import language_data as base
|
||||
|
||||
STOP_WORDS = set(STOP_WORDS)
|
||||
TOKENIZER_EXCEPTIONS = strings_to_exc(base.EMOTICONS)
|
||||
TOKENIZER_PREFIXES = base.TOKENIZER_PREFIXES + TOKENIZER_PREFIXES
|
||||
TOKENIZER_SUFFIXES = TOKENIZER_SUFFIXES
|
||||
TOKENIZER_INFIXES = TOKENIZER_INFIXES
|
||||
|
||||
# HYPHENS = [six.unichr(cp) for cp in [173, 8211, 8212, 8213, 8722, 9472]]
|
||||
|
||||
update_exc(TOKENIZER_EXCEPTIONS, strings_to_exc(OTHER_EXC))
|
||||
update_exc(TOKENIZER_EXCEPTIONS, strings_to_exc(ABBREVIATIONS))
|
||||
|
||||
__all__ = ["TOKENIZER_EXCEPTIONS", "STOP_WORDS", "TOKENIZER_PREFIXES", "TOKENIZER_SUFFIXES", "TOKENIZER_INFIXES"]
|
89
spacy/hu/punctuations.py
Normal file
89
spacy/hu/punctuations.py
Normal file
|
@ -0,0 +1,89 @@
|
|||
# encoding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
TOKENIZER_PREFIXES = r'''
|
||||
+
|
||||
'''.strip().split('\n')
|
||||
|
||||
TOKENIZER_SUFFIXES = r'''
|
||||
,
|
||||
\"
|
||||
\)
|
||||
\]
|
||||
\}
|
||||
\*
|
||||
\!
|
||||
\?
|
||||
\$
|
||||
>
|
||||
:
|
||||
;
|
||||
'
|
||||
”
|
||||
“
|
||||
«
|
||||
_
|
||||
''
|
||||
’
|
||||
‘
|
||||
€
|
||||
\.\.
|
||||
\.\.\.
|
||||
\.\.\.\.
|
||||
(?<=[a-züóőúéáűí)\]"'´«‘’%\)²“”+-])\.
|
||||
(?<=[a-züóőúéáűí)])-e
|
||||
\-\-
|
||||
´
|
||||
(?<=[0-9])\+
|
||||
(?<=[a-z0-9üóőúéáűí][\)\]”"'%\)§/])\.
|
||||
(?<=[0-9])km²
|
||||
(?<=[0-9])m²
|
||||
(?<=[0-9])cm²
|
||||
(?<=[0-9])mm²
|
||||
(?<=[0-9])km³
|
||||
(?<=[0-9])m³
|
||||
(?<=[0-9])cm³
|
||||
(?<=[0-9])mm³
|
||||
(?<=[0-9])ha
|
||||
(?<=[0-9])km
|
||||
(?<=[0-9])m
|
||||
(?<=[0-9])cm
|
||||
(?<=[0-9])mm
|
||||
(?<=[0-9])µm
|
||||
(?<=[0-9])nm
|
||||
(?<=[0-9])yd
|
||||
(?<=[0-9])in
|
||||
(?<=[0-9])ft
|
||||
(?<=[0-9])kg
|
||||
(?<=[0-9])g
|
||||
(?<=[0-9])mg
|
||||
(?<=[0-9])µg
|
||||
(?<=[0-9])t
|
||||
(?<=[0-9])lb
|
||||
(?<=[0-9])oz
|
||||
(?<=[0-9])m/s
|
||||
(?<=[0-9])km/h
|
||||
(?<=[0-9])mph
|
||||
(?<=°[FCK])\.
|
||||
(?<=[0-9])hPa
|
||||
(?<=[0-9])Pa
|
||||
(?<=[0-9])mbar
|
||||
(?<=[0-9])mb
|
||||
(?<=[0-9])T
|
||||
(?<=[0-9])G
|
||||
(?<=[0-9])M
|
||||
(?<=[0-9])K
|
||||
(?<=[0-9])kb
|
||||
'''.strip().split('\n')
|
||||
|
||||
TOKENIZER_INFIXES = r'''
|
||||
…
|
||||
\.\.+
|
||||
(?<=[a-züóőúéáűí])\.(?=[A-ZÜÓŐÚÉÁŰÍ])
|
||||
(?<=[a-zA-ZüóőúéáűíÜÓŐÚÉÁŰÍ0-9])"(?=[\-a-zA-ZüóőúéáűíÜÓŐÚÉÁŰÍ])
|
||||
(?<=[a-zA-ZüóőúéáűíÜÓŐÚÉÁŰÍ])--(?=[a-zA-ZüóőúéáűíÜÓŐÚÉÁŰÍ])
|
||||
(?<=[0-9])[+\-\*/^](?=[0-9])
|
||||
(?<=[a-zA-ZüóőúéáűíÜÓŐÚÉÁŰÍ]),(?=[a-zA-ZüóőúéáűíÜÓŐÚÉÁŰÍ])
|
||||
'''.strip().split('\n')
|
||||
|
||||
__all__ = ["TOKENIZER_PREFIXES", "TOKENIZER_SUFFIXES", "TOKENIZER_INFIXES"]
|
64
spacy/hu/stop_words.py
Normal file
64
spacy/hu/stop_words.py
Normal file
|
@ -0,0 +1,64 @@
|
|||
# encoding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
STOP_WORDS = set("""
|
||||
a abban ahhoz ahogy ahol aki akik akkor akár alatt amely amelyek amelyekben
|
||||
amelyeket amelyet amelynek ami amikor amit amolyan amíg annak arra arról az
|
||||
azok azon azonban azt aztán azután azzal azért
|
||||
|
||||
be belül benne bár
|
||||
|
||||
cikk cikkek cikkeket csak
|
||||
|
||||
de
|
||||
|
||||
e ebben eddig egy egyes egyetlen egyik egyre egyéb egész ehhez ekkor el ellen
|
||||
elo eloször elott elso elég előtt emilyen ennek erre ez ezek ezen ezt ezzel
|
||||
ezért
|
||||
|
||||
fel felé
|
||||
|
||||
ha hanem hiszen hogy hogyan hát
|
||||
|
||||
ide igen ill ill. illetve ilyen ilyenkor inkább is ismét ison itt
|
||||
|
||||
jobban jó jól
|
||||
|
||||
kell kellett keressünk keresztül ki kívül között közül
|
||||
|
||||
le legalább legyen lehet lehetett lenne lenni lesz lett
|
||||
|
||||
ma maga magát majd meg mellett mely melyek mert mi miatt mikor milyen minden
|
||||
mindenki mindent mindig mint mintha mit mivel miért mondta most már más másik
|
||||
még míg
|
||||
|
||||
nagy nagyobb nagyon ne nekem neki nem nincs néha néhány nélkül
|
||||
|
||||
o oda ok oket olyan ott
|
||||
|
||||
pedig persze például
|
||||
|
||||
rá
|
||||
|
||||
s saját sem semmi sok sokat sokkal stb. szemben szerint szinte számára szét
|
||||
|
||||
talán te tehát teljes ti tovább továbbá több túl ugyanis
|
||||
|
||||
utolsó után utána
|
||||
|
||||
vagy vagyis vagyok valaki valami valamint való van vannak vele vissza viszont
|
||||
volna volt voltak voltam voltunk
|
||||
|
||||
által általában át
|
||||
|
||||
én éppen és
|
||||
|
||||
így
|
||||
|
||||
ön össze
|
||||
|
||||
úgy új újabb újra
|
||||
|
||||
ő őket
|
||||
""".split())
|
549
spacy/hu/tokenizer_exceptions.py
Normal file
549
spacy/hu/tokenizer_exceptions.py
Normal file
|
@ -0,0 +1,549 @@
|
|||
# encoding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
ABBREVIATIONS = """
|
||||
AkH.
|
||||
Aö.
|
||||
B.CS.
|
||||
B.S.
|
||||
B.Sc.
|
||||
B.ú.é.k.
|
||||
BE.
|
||||
BEK.
|
||||
BSC.
|
||||
BSc.
|
||||
BTK.
|
||||
Be.
|
||||
Bek.
|
||||
Bfok.
|
||||
Bk.
|
||||
Bp.
|
||||
Btk.
|
||||
Btke.
|
||||
Btét.
|
||||
CSC.
|
||||
Cal.
|
||||
Co.
|
||||
Colo.
|
||||
Comp.
|
||||
Copr.
|
||||
Cs.
|
||||
Csc.
|
||||
Csop.
|
||||
Ctv.
|
||||
D.
|
||||
DR.
|
||||
Dipl.
|
||||
Dr.
|
||||
Dsz.
|
||||
Dzs.
|
||||
Fla.
|
||||
Főszerk.
|
||||
GM.
|
||||
Gy.
|
||||
HKsz.
|
||||
Hmvh.
|
||||
Inform.
|
||||
K.m.f.
|
||||
KER.
|
||||
KFT.
|
||||
KRT.
|
||||
Ker.
|
||||
Kft.
|
||||
Kong.
|
||||
Korm.
|
||||
Kr.
|
||||
Kr.e.
|
||||
Kr.u.
|
||||
Krt.
|
||||
M.A.
|
||||
M.S.
|
||||
M.SC.
|
||||
M.Sc.
|
||||
MA.
|
||||
MSC.
|
||||
MSc.
|
||||
Mass.
|
||||
Mlle.
|
||||
Mme.
|
||||
Mo.
|
||||
Mr.
|
||||
Mrs.
|
||||
Ms.
|
||||
Mt.
|
||||
N.N.
|
||||
NB.
|
||||
NBr.
|
||||
Nat.
|
||||
Nr.
|
||||
Ny.
|
||||
Nyh.
|
||||
Nyr.
|
||||
Op.
|
||||
P.H.
|
||||
P.S.
|
||||
PH.D.
|
||||
PHD.
|
||||
PROF.
|
||||
Ph.D
|
||||
PhD.
|
||||
Pp.
|
||||
Proc.
|
||||
Prof.
|
||||
Ptk.
|
||||
Rer.
|
||||
S.B.
|
||||
SZOLG.
|
||||
Salg.
|
||||
St.
|
||||
Sz.
|
||||
Szfv.
|
||||
Szjt.
|
||||
Szolg.
|
||||
Szt.
|
||||
Sztv.
|
||||
TEL.
|
||||
Tel.
|
||||
Ty.
|
||||
Tyr.
|
||||
Ui.
|
||||
Vcs.
|
||||
Vhr.
|
||||
X.Y.
|
||||
Zs.
|
||||
a.
|
||||
a.C.
|
||||
ac.
|
||||
adj.
|
||||
adm.
|
||||
ag.
|
||||
agit.
|
||||
alez.
|
||||
alk.
|
||||
altbgy.
|
||||
an.
|
||||
ang.
|
||||
arch.
|
||||
at.
|
||||
aug.
|
||||
b.
|
||||
b.a.
|
||||
b.s.
|
||||
b.sc.
|
||||
bek.
|
||||
belker.
|
||||
berend.
|
||||
biz.
|
||||
bizt.
|
||||
bo.
|
||||
bp.
|
||||
br.
|
||||
bsc.
|
||||
bt.
|
||||
btk.
|
||||
c.
|
||||
ca.
|
||||
cc.
|
||||
cca.
|
||||
cf.
|
||||
cif.
|
||||
co.
|
||||
corp.
|
||||
cos.
|
||||
cs.
|
||||
csc.
|
||||
csüt.
|
||||
cső.
|
||||
ctv.
|
||||
d.
|
||||
dbj.
|
||||
dd.
|
||||
ddr.
|
||||
de.
|
||||
dec.
|
||||
dikt.
|
||||
dipl.
|
||||
dj.
|
||||
dk.
|
||||
dny.
|
||||
dolg.
|
||||
dr.
|
||||
du.
|
||||
dzs.
|
||||
e.
|
||||
ea.
|
||||
ed.
|
||||
eff.
|
||||
egyh.
|
||||
ell.
|
||||
elv.
|
||||
elvt.
|
||||
em.
|
||||
eng.
|
||||
eny.
|
||||
et.
|
||||
etc.
|
||||
ev.
|
||||
ezr.
|
||||
eü.
|
||||
f.
|
||||
f.h.
|
||||
f.é.
|
||||
fam.
|
||||
febr.
|
||||
fej.
|
||||
felv.
|
||||
felügy.
|
||||
ff.
|
||||
ffi.
|
||||
fhdgy.
|
||||
fil.
|
||||
fiz.
|
||||
fm.
|
||||
foglalk.
|
||||
ford.
|
||||
fp.
|
||||
fr.
|
||||
frsz.
|
||||
fszla.
|
||||
fszt.
|
||||
ft.
|
||||
fuv.
|
||||
főig.
|
||||
főisk.
|
||||
főtörm.
|
||||
főv.
|
||||
g.
|
||||
gazd.
|
||||
gimn.
|
||||
gk.
|
||||
gkv.
|
||||
gondn.
|
||||
gr.
|
||||
grav.
|
||||
gy.
|
||||
gyak.
|
||||
gyártm.
|
||||
gör.
|
||||
h.
|
||||
hads.
|
||||
hallg.
|
||||
hdm.
|
||||
hdp.
|
||||
hds.
|
||||
hg.
|
||||
hiv.
|
||||
hk.
|
||||
hm.
|
||||
ho.
|
||||
honv.
|
||||
hp.
|
||||
hr.
|
||||
hrsz.
|
||||
hsz.
|
||||
ht.
|
||||
htb.
|
||||
hv.
|
||||
hőm.
|
||||
i.e.
|
||||
i.sz.
|
||||
id.
|
||||
ifj.
|
||||
ig.
|
||||
igh.
|
||||
ill.
|
||||
imp.
|
||||
inc.
|
||||
ind.
|
||||
inform.
|
||||
inic.
|
||||
int.
|
||||
io.
|
||||
ip.
|
||||
ir.
|
||||
irod.
|
||||
isk.
|
||||
ism.
|
||||
izr.
|
||||
iá.
|
||||
j.
|
||||
jan.
|
||||
jav.
|
||||
jegyz.
|
||||
jjv.
|
||||
jkv.
|
||||
jogh.
|
||||
jogt.
|
||||
jr.
|
||||
jvb.
|
||||
júl.
|
||||
jún.
|
||||
k.
|
||||
karb.
|
||||
kat.
|
||||
kb.
|
||||
kcs.
|
||||
kd.
|
||||
ker.
|
||||
kf.
|
||||
kft.
|
||||
kht.
|
||||
kir.
|
||||
kirend.
|
||||
kisip.
|
||||
kiv.
|
||||
kk.
|
||||
kkt.
|
||||
klin.
|
||||
kp.
|
||||
krt.
|
||||
kt.
|
||||
ktsg.
|
||||
kult.
|
||||
kv.
|
||||
kve.
|
||||
képv.
|
||||
kísérl.
|
||||
kóth.
|
||||
könyvt.
|
||||
körz.
|
||||
köv.
|
||||
közj.
|
||||
közl.
|
||||
közp.
|
||||
közt.
|
||||
kü.
|
||||
l.
|
||||
lat.
|
||||
ld.
|
||||
legs.
|
||||
lg.
|
||||
lgv.
|
||||
loc.
|
||||
lt.
|
||||
ltd.
|
||||
ltp.
|
||||
luth.
|
||||
m.
|
||||
m.a.
|
||||
m.s.
|
||||
m.sc.
|
||||
ma.
|
||||
mat.
|
||||
mb.
|
||||
med.
|
||||
megh.
|
||||
met.
|
||||
mf.
|
||||
mfszt.
|
||||
min.
|
||||
miss.
|
||||
mjr.
|
||||
mjv.
|
||||
mk.
|
||||
mlle.
|
||||
mme.
|
||||
mn.
|
||||
mozg.
|
||||
mr.
|
||||
mrs.
|
||||
ms.
|
||||
msc.
|
||||
má.
|
||||
máj.
|
||||
márc.
|
||||
mé.
|
||||
mélt.
|
||||
mü.
|
||||
műh.
|
||||
műsz.
|
||||
műv.
|
||||
művez.
|
||||
n.
|
||||
nagyker.
|
||||
nagys.
|
||||
nat.
|
||||
nb.
|
||||
neg.
|
||||
nk.
|
||||
nov.
|
||||
nu.
|
||||
ny.
|
||||
nyilv.
|
||||
nyrt.
|
||||
nyug.
|
||||
o.
|
||||
obj.
|
||||
okl.
|
||||
okt.
|
||||
olv.
|
||||
orsz.
|
||||
ort.
|
||||
ov.
|
||||
ovh.
|
||||
p.
|
||||
pf.
|
||||
pg.
|
||||
ph.d
|
||||
ph.d.
|
||||
phd.
|
||||
pk.
|
||||
pl.
|
||||
plb.
|
||||
plc.
|
||||
pld.
|
||||
plur.
|
||||
pol.
|
||||
polg.
|
||||
poz.
|
||||
pp.
|
||||
proc.
|
||||
prof.
|
||||
prot.
|
||||
pság.
|
||||
ptk.
|
||||
pu.
|
||||
pü.
|
||||
q.
|
||||
r.
|
||||
r.k.
|
||||
rac.
|
||||
rad.
|
||||
red.
|
||||
ref.
|
||||
reg.
|
||||
rer.
|
||||
rev.
|
||||
rf.
|
||||
rkp.
|
||||
rkt.
|
||||
rt.
|
||||
rtg.
|
||||
röv.
|
||||
s.
|
||||
s.b.
|
||||
s.k.
|
||||
sa.
|
||||
sel.
|
||||
sgt.
|
||||
sm.
|
||||
st.
|
||||
stat.
|
||||
stb.
|
||||
strat.
|
||||
sz.
|
||||
szakm.
|
||||
szaksz.
|
||||
szakszerv.
|
||||
szd.
|
||||
szds.
|
||||
szept.
|
||||
szerk.
|
||||
szf.
|
||||
szimf.
|
||||
szjt.
|
||||
szkv.
|
||||
szla.
|
||||
szn.
|
||||
szolg.
|
||||
szt.
|
||||
szubj.
|
||||
szöv.
|
||||
szül.
|
||||
t.
|
||||
tanm.
|
||||
tb.
|
||||
tbk.
|
||||
tc.
|
||||
techn.
|
||||
tek.
|
||||
tel.
|
||||
tf.
|
||||
tgk.
|
||||
ti.
|
||||
tip.
|
||||
tisztv.
|
||||
titks.
|
||||
tk.
|
||||
tkp.
|
||||
tny.
|
||||
tp.
|
||||
tszf.
|
||||
tszk.
|
||||
tszkv.
|
||||
tv.
|
||||
tvr.
|
||||
ty.
|
||||
törv.
|
||||
tü.
|
||||
u.
|
||||
ua.
|
||||
ui.
|
||||
unit.
|
||||
uo.
|
||||
uv.
|
||||
v.
|
||||
vas.
|
||||
vb.
|
||||
vegy.
|
||||
vh.
|
||||
vhol.
|
||||
vill.
|
||||
vizsg.
|
||||
vk.
|
||||
vkf.
|
||||
vkny.
|
||||
vm.
|
||||
vol.
|
||||
vs.
|
||||
vsz.
|
||||
vv.
|
||||
vál.
|
||||
vízv.
|
||||
vö.
|
||||
w.
|
||||
y.
|
||||
z.
|
||||
zrt.
|
||||
zs.
|
||||
Ész.
|
||||
Új-Z.
|
||||
ÚjZ.
|
||||
á.
|
||||
ált.
|
||||
ápr.
|
||||
ásv.
|
||||
é.
|
||||
ék.
|
||||
ény.
|
||||
érk.
|
||||
évf.
|
||||
í.
|
||||
ó.
|
||||
ö.
|
||||
össz.
|
||||
ötk.
|
||||
özv.
|
||||
ú.
|
||||
úm.
|
||||
ún.
|
||||
út.
|
||||
ü.
|
||||
üag.
|
||||
üd.
|
||||
üdv.
|
||||
üe.
|
||||
ümk.
|
||||
ütk.
|
||||
üv.
|
||||
ő.
|
||||
ű.
|
||||
őrgy.
|
||||
őrpk.
|
||||
őrv.
|
||||
""".strip().split()
|
||||
|
||||
OTHER_EXC = """
|
||||
''
|
||||
-e
|
||||
""".strip().split()
|
|
@ -5,6 +5,7 @@ from ..symbols import *
|
|||
|
||||
|
||||
PRON_LEMMA = "-PRON-"
|
||||
DET_LEMMA = "-DET-"
|
||||
ENT_ID = "ent_id"
|
||||
|
||||
|
||||
|
|
19
spacy/sv/__init__.py
Normal file
19
spacy/sv/__init__.py
Normal file
|
@ -0,0 +1,19 @@
|
|||
# encoding: utf8
|
||||
from __future__ import unicode_literals, print_function
|
||||
|
||||
from os import path
|
||||
|
||||
from ..language import Language
|
||||
from ..attrs import LANG
|
||||
from .language_data import *
|
||||
|
||||
|
||||
class Swedish(Language):
|
||||
lang = 'sv'
|
||||
|
||||
class Defaults(Language.Defaults):
|
||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||
lex_attr_getters[LANG] = lambda text: 'sv'
|
||||
|
||||
tokenizer_exceptions = TOKENIZER_EXCEPTIONS
|
||||
stop_words = STOP_WORDS
|
14
spacy/sv/language_data.py
Normal file
14
spacy/sv/language_data.py
Normal file
|
@ -0,0 +1,14 @@
|
|||
# encoding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from .. import language_data as base
|
||||
from ..language_data import update_exc, strings_to_exc
|
||||
|
||||
from .stop_words import STOP_WORDS
|
||||
|
||||
|
||||
TOKENIZER_EXCEPTIONS = strings_to_exc(base.EMOTICONS)
|
||||
STOP_WORDS = set(STOP_WORDS)
|
||||
|
||||
|
||||
__all__ = ["TOKENIZER_EXCEPTIONS", "STOP_WORDS"]
|
68
spacy/sv/morph_rules.py
Normal file
68
spacy/sv/morph_rules.py
Normal file
|
@ -0,0 +1,68 @@
|
|||
# encoding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ..symbols import *
|
||||
from ..language_data import PRON_LEMMA
|
||||
|
||||
# Used the table of pronouns at https://sv.wiktionary.org/wiki/deras
|
||||
|
||||
MORPH_RULES = {
|
||||
"PRP": {
|
||||
"jag": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Case": "Nom"},
|
||||
"mig": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Case": "Acc"},
|
||||
"mej": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Case": "Acc"},
|
||||
"du": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Sing", "Case": "Nom"},
|
||||
"han": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Masc", "Case": "Nom"},
|
||||
"honom": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Masc", "Case": "Acc"},
|
||||
"hon": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Fem", "Case": "Nom"},
|
||||
"henne": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Fem", "Case": "Acc"},
|
||||
"det": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Neut"},
|
||||
"vi": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Case": "Nom"},
|
||||
"oss": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Case": "Acc"},
|
||||
"ni": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Case": "Nom"},
|
||||
"er": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Case": "Acc"},
|
||||
"de": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Plur", "Case": "Nom"},
|
||||
"dom": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Plur", "Case": "Nom"},
|
||||
"dem": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Plur", "Case": "Acc"},
|
||||
"dom": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Plur", "Case": "Acc"},
|
||||
|
||||
"min": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Poss": "Yes", "Reflex": "Yes"},
|
||||
"mitt": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Poss": "Yes", "Reflex": "Yes"},
|
||||
"mina": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
|
||||
"din": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Sing", "Poss": "Yes", "Reflex": "Yes"},
|
||||
"ditt": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Sing", "Poss": "Yes", "Reflex": "Yes"},
|
||||
"dina": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
|
||||
"hans": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Sing", "Gender": "Masc", "Poss": "Yes", "Reflex": "Yes"},
|
||||
"hans": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Gender": "Masc", "Poss": "Yes", "Reflex": "Yes"},
|
||||
"hennes": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Sing", "Gender": "Fem", "Poss": "Yes", "Reflex": "Yes"},
|
||||
"hennes": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Gender": "Fem", "Poss": "Yes", "Reflex": "Yes"},
|
||||
"dess": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Sing", "Poss": "Yes", "Reflex": "Yes"},
|
||||
"dess": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
|
||||
"vår": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
|
||||
"våran": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
|
||||
"vårt": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
|
||||
"vårat": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
|
||||
"våra": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
|
||||
"er": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
|
||||
"eran": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
|
||||
"ert": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
|
||||
"erat": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
|
||||
"era": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
|
||||
"deras": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"}
|
||||
},
|
||||
|
||||
"VBZ": {
|
||||
"är": {LEMMA: "be", "VerbForm": "Fin", "Person": "One", "Tense": "Pres", "Mood": "Ind"},
|
||||
"är": {LEMMA: "be", "VerbForm": "Fin", "Person": "Two", "Tense": "Pres", "Mood": "Ind"},
|
||||
"är": {LEMMA: "be", "VerbForm": "Fin", "Person": "Three", "Tense": "Pres", "Mood": "Ind"},
|
||||
},
|
||||
|
||||
"VBP": {
|
||||
"är": {LEMMA: "be", "VerbForm": "Fin", "Tense": "Pres", "Mood": "Ind"}
|
||||
},
|
||||
|
||||
"VBD": {
|
||||
"var": {LEMMA: "be", "VerbForm": "Fin", "Tense": "Past", "Number": "Sing"},
|
||||
"vart": {LEMMA: "be", "VerbForm": "Fin", "Tense": "Past", "Number": "Plur"}
|
||||
}
|
||||
}
|
47
spacy/sv/stop_words.py
Normal file
47
spacy/sv/stop_words.py
Normal file
|
@ -0,0 +1,47 @@
|
|||
# encoding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
STOP_WORDS = set("""
|
||||
aderton adertonde adjö aldrig alla allas allt alltid alltså än andra andras annan annat ännu artonde arton åtminstone att åtta åttio åttionde åttonde av även
|
||||
|
||||
båda bådas bakom bara bäst bättre behöva behövas behövde behövt beslut beslutat beslutit bland blev bli blir blivit bort borta bra
|
||||
|
||||
då dag dagar dagarna dagen där därför de del delen dem den deras dess det detta dig din dina dit ditt dock du
|
||||
|
||||
efter eftersom elfte eller elva en enkel enkelt enkla enligt er era ert ett ettusen
|
||||
|
||||
få fanns får fått fem femte femtio femtionde femton femtonde fick fin finnas finns fjärde fjorton fjortonde fler flera flesta följande för före förlåt förra första fram framför från fyra fyrtio fyrtionde
|
||||
|
||||
gå gälla gäller gällt går gärna gått genast genom gick gjorde gjort god goda godare godast gör göra gott
|
||||
|
||||
ha hade haft han hans har här heller hellre helst helt henne hennes hit hög höger högre högst hon honom hundra hundraen hundraett hur
|
||||
|
||||
i ibland idag igår igen imorgon in inför inga ingen ingenting inget innan inne inom inte inuti
|
||||
|
||||
ja jag jämfört
|
||||
|
||||
kan kanske knappast kom komma kommer kommit kr kunde kunna kunnat kvar
|
||||
|
||||
länge längre långsam långsammare långsammast långsamt längst långt lätt lättare lättast legat ligga ligger lika likställd likställda lilla lite liten litet
|
||||
|
||||
man många måste med mellan men mer mera mest mig min mina mindre minst mitt mittemot möjlig möjligen möjligt möjligtvis mot mycket
|
||||
|
||||
någon någonting något några när nästa ned nederst nedersta nedre nej ner ni nio nionde nittio nittionde nitton nittonde nödvändig nödvändiga nödvändigt nödvändigtvis nog noll nr nu nummer
|
||||
|
||||
och också ofta oftast olika olikt om oss
|
||||
|
||||
över övermorgon överst övre
|
||||
|
||||
på
|
||||
|
||||
rakt rätt redan
|
||||
|
||||
så sade säga säger sagt samma sämre sämst sedan senare senast sent sex sextio sextionde sexton sextonde sig sin sina sist sista siste sitt sjätte sju sjunde sjuttio sjuttionde sjutton sjuttonde ska skall skulle slutligen små smått snart som stor stora större störst stort
|
||||
|
||||
tack tidig tidigare tidigast tidigt till tills tillsammans tio tionde tjugo tjugoen tjugoett tjugonde tjugotre tjugotvå tjungo tolfte tolv tre tredje trettio trettionde tretton trettonde två tvåhundra
|
||||
|
||||
under upp ur ursäkt ut utan utanför ute
|
||||
|
||||
vad vänster vänstra var vår vara våra varför varifrån varit varken värre varsågod vart vårt vem vems verkligen vi vid vidare viktig viktigare viktigast viktigt vilka vilken vilket vill
|
||||
""".split())
|
58
spacy/sv/tokenizer_exceptions.py
Normal file
58
spacy/sv/tokenizer_exceptions.py
Normal file
|
@ -0,0 +1,58 @@
|
|||
# encoding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ..symbols import *
|
||||
from ..language_data import PRON_LEMMA
|
||||
|
||||
|
||||
TOKENIZER_EXCEPTIONS = {
|
||||
|
||||
}
|
||||
|
||||
|
||||
ORTH_ONLY = [
|
||||
"ang.",
|
||||
"anm.",
|
||||
"bil.",
|
||||
"bl.a.",
|
||||
"dvs.",
|
||||
"e.Kr.",
|
||||
"el.",
|
||||
"e.d.",
|
||||
"eng.",
|
||||
"etc.",
|
||||
"exkl.",
|
||||
"f.d.",
|
||||
"fid.",
|
||||
"f.Kr.",
|
||||
"forts.",
|
||||
"fr.o.m.",
|
||||
"f.ö.",
|
||||
"förf.",
|
||||
"inkl.",
|
||||
"jur.",
|
||||
"kl.",
|
||||
"kr.",
|
||||
"lat.",
|
||||
"m.a.o.",
|
||||
"max.",
|
||||
"m.fl.",
|
||||
"min.",
|
||||
"m.m.",
|
||||
"obs.",
|
||||
"o.d.",
|
||||
"osv.",
|
||||
"p.g.a.",
|
||||
"ref.",
|
||||
"resp.",
|
||||
"s.",
|
||||
"s.a.s.",
|
||||
"s.k.",
|
||||
"st.",
|
||||
"s:t",
|
||||
"t.ex.",
|
||||
"t.o.m.",
|
||||
"ung.",
|
||||
"äv.",
|
||||
"övers."
|
||||
]
|
0
spacy/tests/hu/__init__.py
Normal file
0
spacy/tests/hu/__init__.py
Normal file
0
spacy/tests/hu/tokenizer/__init__.py
Normal file
0
spacy/tests/hu/tokenizer/__init__.py
Normal file
233
spacy/tests/hu/tokenizer/test_tokenizer.py
Normal file
233
spacy/tests/hu/tokenizer/test_tokenizer.py
Normal file
|
@ -0,0 +1,233 @@
|
|||
# encoding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
from spacy.hu import Hungarian
|
||||
|
||||
_DEFAULT_TESTS = [('N. kormányzósági\nszékhely.', ['N.', 'kormányzósági', 'székhely', '.']),
|
||||
('A .hu egy tld.', ['A', '.hu', 'egy', 'tld', '.']),
|
||||
('Az egy.ketto pelda.', ['Az', 'egy.ketto', 'pelda', '.']),
|
||||
('A pl. rovidites.', ['A', 'pl.', 'rovidites', '.']),
|
||||
('A S.M.A.R.T. szo.', ['A', 'S.M.A.R.T.', 'szo', '.']),
|
||||
('A .hu.', ['A', '.hu', '.']),
|
||||
('Az egy.ketto.', ['Az', 'egy.ketto', '.']),
|
||||
('A pl.', ['A', 'pl.']),
|
||||
('A S.M.A.R.T.', ['A', 'S.M.A.R.T.']),
|
||||
('Egy..ket.', ['Egy', '..', 'ket', '.']),
|
||||
('Valami... van.', ['Valami', '...', 'van', '.']),
|
||||
('Valami ...van...', ['Valami', '...', 'van', '...']),
|
||||
('Valami...', ['Valami', '...']),
|
||||
('Valami ...', ['Valami', '...']),
|
||||
('Valami ... más.', ['Valami', '...', 'más', '.'])]
|
||||
|
||||
_HYPHEN_TESTS = [
|
||||
('Egy -nak, -jaiért, -magyar, bel- van.', ['Egy', '-nak', ',', '-jaiért', ',', '-magyar', ',', 'bel-', 'van', '.']),
|
||||
('Egy -nak.', ['Egy', '-nak', '.']),
|
||||
('Egy bel-.', ['Egy', 'bel-', '.']),
|
||||
('Dinnye-domb-.', ['Dinnye-domb-', '.']),
|
||||
('Ezen -e elcsatangolt.', ['Ezen', '-e', 'elcsatangolt', '.']),
|
||||
('Lakik-e', ['Lakik', '-e']),
|
||||
('Lakik-e?', ['Lakik', '-e', '?']),
|
||||
('Lakik-e.', ['Lakik', '-e', '.']),
|
||||
('Lakik-e...', ['Lakik', '-e', '...']),
|
||||
('Lakik-e... van.', ['Lakik', '-e', '...', 'van', '.']),
|
||||
('Lakik-e van?', ['Lakik', '-e', 'van', '?']),
|
||||
('Lakik-elem van?', ['Lakik-elem', 'van', '?']),
|
||||
('Van lakik-elem.', ['Van', 'lakik-elem', '.']),
|
||||
('A 7-es busz?', ['A', '7-es', 'busz', '?']),
|
||||
('A 7-es?', ['A', '7-es', '?']),
|
||||
('A 7-es.', ['A', '7-es', '.']),
|
||||
('Ez (lakik)-e?', ['Ez', '(', 'lakik', ')', '-e', '?']),
|
||||
('A %-sal.', ['A', '%-sal', '.']),
|
||||
('A CD-ROM-okrol.', ['A', 'CD-ROM-okrol', '.'])]
|
||||
|
||||
_NUMBER_TESTS = [('A 2b van.', ['A', '2b', 'van', '.']),
|
||||
('A 2b-ben van.', ['A', '2b-ben', 'van', '.']),
|
||||
('A 2b.', ['A', '2b', '.']),
|
||||
('A 2b-ben.', ['A', '2b-ben', '.']),
|
||||
('A 3.b van.', ['A', '3.b', 'van', '.']),
|
||||
('A 3.b-ben van.', ['A', '3.b-ben', 'van', '.']),
|
||||
('A 3.b.', ['A', '3.b', '.']),
|
||||
('A 3.b-ben.', ['A', '3.b-ben', '.']),
|
||||
('A 1:20:36.7 van.', ['A', '1:20:36.7', 'van', '.']),
|
||||
('A 1:20:36.7-ben van.', ['A', '1:20:36.7-ben', 'van', '.']),
|
||||
('A 1:20:36.7-ben.', ['A', '1:20:36.7-ben', '.']),
|
||||
('A 1:35 van.', ['A', '1:35', 'van', '.']),
|
||||
('A 1:35-ben van.', ['A', '1:35-ben', 'van', '.']),
|
||||
('A 1:35-ben.', ['A', '1:35-ben', '.']),
|
||||
('A 1.35 van.', ['A', '1.35', 'van', '.']),
|
||||
('A 1.35-ben van.', ['A', '1.35-ben', 'van', '.']),
|
||||
('A 1.35-ben.', ['A', '1.35-ben', '.']),
|
||||
('A 4:01,95 van.', ['A', '4:01,95', 'van', '.']),
|
||||
('A 4:01,95-ben van.', ['A', '4:01,95-ben', 'van', '.']),
|
||||
('A 4:01,95-ben.', ['A', '4:01,95-ben', '.']),
|
||||
('A 10--12 van.', ['A', '10--12', 'van', '.']),
|
||||
('A 10--12-ben van.', ['A', '10--12-ben', 'van', '.']),
|
||||
('A 10--12-ben.', ['A', '10--12-ben', '.']),
|
||||
('A 10‐12 van.', ['A', '10‐12', 'van', '.']),
|
||||
('A 10‐12-ben van.', ['A', '10‐12-ben', 'van', '.']),
|
||||
('A 10‐12-ben.', ['A', '10‐12-ben', '.']),
|
||||
('A 10‑12 van.', ['A', '10‑12', 'van', '.']),
|
||||
('A 10‑12-ben van.', ['A', '10‑12-ben', 'van', '.']),
|
||||
('A 10‑12-ben.', ['A', '10‑12-ben', '.']),
|
||||
('A 10‒12 van.', ['A', '10‒12', 'van', '.']),
|
||||
('A 10‒12-ben van.', ['A', '10‒12-ben', 'van', '.']),
|
||||
('A 10‒12-ben.', ['A', '10‒12-ben', '.']),
|
||||
('A 10–12 van.', ['A', '10–12', 'van', '.']),
|
||||
('A 10–12-ben van.', ['A', '10–12-ben', 'van', '.']),
|
||||
('A 10–12-ben.', ['A', '10–12-ben', '.']),
|
||||
('A 10—12 van.', ['A', '10—12', 'van', '.']),
|
||||
('A 10—12-ben van.', ['A', '10—12-ben', 'van', '.']),
|
||||
('A 10—12-ben.', ['A', '10—12-ben', '.']),
|
||||
('A 10―12 van.', ['A', '10―12', 'van', '.']),
|
||||
('A 10―12-ben van.', ['A', '10―12-ben', 'van', '.']),
|
||||
('A 10―12-ben.', ['A', '10―12-ben', '.']),
|
||||
('A -23,12 van.', ['A', '-23,12', 'van', '.']),
|
||||
('A -23,12-ben van.', ['A', '-23,12-ben', 'van', '.']),
|
||||
('A -23,12-ben.', ['A', '-23,12-ben', '.']),
|
||||
('A 2+3 van.', ['A', '2', '+', '3', 'van', '.']),
|
||||
('A 2 +3 van.', ['A', '2', '+', '3', 'van', '.']),
|
||||
('A 2+ 3 van.', ['A', '2', '+', '3', 'van', '.']),
|
||||
('A 2 + 3 van.', ['A', '2', '+', '3', 'van', '.']),
|
||||
('A 2*3 van.', ['A', '2', '*', '3', 'van', '.']),
|
||||
('A 2 *3 van.', ['A', '2', '*', '3', 'van', '.']),
|
||||
('A 2* 3 van.', ['A', '2', '*', '3', 'van', '.']),
|
||||
('A 2 * 3 van.', ['A', '2', '*', '3', 'van', '.']),
|
||||
('A C++ van.', ['A', 'C++', 'van', '.']),
|
||||
('A C++-ben van.', ['A', 'C++-ben', 'van', '.']),
|
||||
('A C++.', ['A', 'C++', '.']),
|
||||
('A C++-ben.', ['A', 'C++-ben', '.']),
|
||||
('A 2003. I. 06. van.', ['A', '2003.', 'I.', '06.', 'van', '.']),
|
||||
('A 2003. I. 06-ben van.', ['A', '2003.', 'I.', '06-ben', 'van', '.']),
|
||||
('A 2003. I. 06.', ['A', '2003.', 'I.', '06.']),
|
||||
('A 2003. I. 06-ben.', ['A', '2003.', 'I.', '06-ben', '.']),
|
||||
('A 2003. 01. 06. van.', ['A', '2003.', '01.', '06.', 'van', '.']),
|
||||
('A 2003. 01. 06-ben van.', ['A', '2003.', '01.', '06-ben', 'van', '.']),
|
||||
('A 2003. 01. 06.', ['A', '2003.', '01.', '06.']),
|
||||
('A 2003. 01. 06-ben.', ['A', '2003.', '01.', '06-ben', '.']),
|
||||
('A IV. 12. van.', ['A', 'IV.', '12.', 'van', '.']),
|
||||
('A IV. 12-ben van.', ['A', 'IV.', '12-ben', 'van', '.']),
|
||||
('A IV. 12.', ['A', 'IV.', '12.']),
|
||||
('A IV. 12-ben.', ['A', 'IV.', '12-ben', '.']),
|
||||
('A 2003.01.06. van.', ['A', '2003.01.06.', 'van', '.']),
|
||||
('A 2003.01.06-ben van.', ['A', '2003.01.06-ben', 'van', '.']),
|
||||
('A 2003.01.06.', ['A', '2003.01.06.']),
|
||||
('A 2003.01.06-ben.', ['A', '2003.01.06-ben', '.']),
|
||||
('A IV.12. van.', ['A', 'IV.12.', 'van', '.']),
|
||||
('A IV.12-ben van.', ['A', 'IV.12-ben', 'van', '.']),
|
||||
('A IV.12.', ['A', 'IV.12.']),
|
||||
('A IV.12-ben.', ['A', 'IV.12-ben', '.']),
|
||||
('A 1.1.2. van.', ['A', '1.1.2.', 'van', '.']),
|
||||
('A 1.1.2-ben van.', ['A', '1.1.2-ben', 'van', '.']),
|
||||
('A 1.1.2.', ['A', '1.1.2.']),
|
||||
('A 1.1.2-ben.', ['A', '1.1.2-ben', '.']),
|
||||
('A 1,5--2,5 van.', ['A', '1,5--2,5', 'van', '.']),
|
||||
('A 1,5--2,5-ben van.', ['A', '1,5--2,5-ben', 'van', '.']),
|
||||
('A 1,5--2,5-ben.', ['A', '1,5--2,5-ben', '.']),
|
||||
('A 3,14 van.', ['A', '3,14', 'van', '.']),
|
||||
('A 3,14-ben van.', ['A', '3,14-ben', 'van', '.']),
|
||||
('A 3,14-ben.', ['A', '3,14-ben', '.']),
|
||||
('A 3.14 van.', ['A', '3.14', 'van', '.']),
|
||||
('A 3.14-ben van.', ['A', '3.14-ben', 'van', '.']),
|
||||
('A 3.14-ben.', ['A', '3.14-ben', '.']),
|
||||
('A 15. van.', ['A', '15.', 'van', '.']),
|
||||
('A 15-ben van.', ['A', '15-ben', 'van', '.']),
|
||||
('A 15-ben.', ['A', '15-ben', '.']),
|
||||
('A 15.-ben van.', ['A', '15.-ben', 'van', '.']),
|
||||
('A 15.-ben.', ['A', '15.-ben', '.']),
|
||||
('A 2002--2003. van.', ['A', '2002--2003.', 'van', '.']),
|
||||
('A 2002--2003-ben van.', ['A', '2002--2003-ben', 'van', '.']),
|
||||
('A 2002--2003-ben.', ['A', '2002--2003-ben', '.']),
|
||||
('A -0,99% van.', ['A', '-0,99%', 'van', '.']),
|
||||
('A -0,99%-ben van.', ['A', '-0,99%-ben', 'van', '.']),
|
||||
('A -0,99%.', ['A', '-0,99%', '.']),
|
||||
('A -0,99%-ben.', ['A', '-0,99%-ben', '.']),
|
||||
('A 10--20% van.', ['A', '10--20%', 'van', '.']),
|
||||
('A 10--20%-ben van.', ['A', '10--20%-ben', 'van', '.']),
|
||||
('A 10--20%.', ['A', '10--20%', '.']),
|
||||
('A 10--20%-ben.', ['A', '10--20%-ben', '.']),
|
||||
('A 99§ van.', ['A', '99§', 'van', '.']),
|
||||
('A 99§-ben van.', ['A', '99§-ben', 'van', '.']),
|
||||
('A 99§-ben.', ['A', '99§-ben', '.']),
|
||||
('A 10--20§ van.', ['A', '10--20§', 'van', '.']),
|
||||
('A 10--20§-ben van.', ['A', '10--20§-ben', 'van', '.']),
|
||||
('A 10--20§-ben.', ['A', '10--20§-ben', '.']),
|
||||
('A 99° van.', ['A', '99°', 'van', '.']),
|
||||
('A 99°-ben van.', ['A', '99°-ben', 'van', '.']),
|
||||
('A 99°-ben.', ['A', '99°-ben', '.']),
|
||||
('A 10--20° van.', ['A', '10--20°', 'van', '.']),
|
||||
('A 10--20°-ben van.', ['A', '10--20°-ben', 'van', '.']),
|
||||
('A 10--20°-ben.', ['A', '10--20°-ben', '.']),
|
||||
('A °C van.', ['A', '°C', 'van', '.']),
|
||||
('A °C-ben van.', ['A', '°C-ben', 'van', '.']),
|
||||
('A °C.', ['A', '°C', '.']),
|
||||
('A °C-ben.', ['A', '°C-ben', '.']),
|
||||
('A 100°C van.', ['A', '100°C', 'van', '.']),
|
||||
('A 100°C-ben van.', ['A', '100°C-ben', 'van', '.']),
|
||||
('A 100°C.', ['A', '100°C', '.']),
|
||||
('A 100°C-ben.', ['A', '100°C-ben', '.']),
|
||||
('A 800x600 van.', ['A', '800x600', 'van', '.']),
|
||||
('A 800x600-ben van.', ['A', '800x600-ben', 'van', '.']),
|
||||
('A 800x600-ben.', ['A', '800x600-ben', '.']),
|
||||
('A 1x2x3x4 van.', ['A', '1x2x3x4', 'van', '.']),
|
||||
('A 1x2x3x4-ben van.', ['A', '1x2x3x4-ben', 'van', '.']),
|
||||
('A 1x2x3x4-ben.', ['A', '1x2x3x4-ben', '.']),
|
||||
('A 5/J van.', ['A', '5/J', 'van', '.']),
|
||||
('A 5/J-ben van.', ['A', '5/J-ben', 'van', '.']),
|
||||
('A 5/J-ben.', ['A', '5/J-ben', '.']),
|
||||
('A 5/J. van.', ['A', '5/J.', 'van', '.']),
|
||||
('A 5/J.-ben van.', ['A', '5/J.-ben', 'van', '.']),
|
||||
('A 5/J.-ben.', ['A', '5/J.-ben', '.']),
|
||||
('A III/1 van.', ['A', 'III/1', 'van', '.']),
|
||||
('A III/1-ben van.', ['A', 'III/1-ben', 'van', '.']),
|
||||
('A III/1-ben.', ['A', 'III/1-ben', '.']),
|
||||
('A III/1. van.', ['A', 'III/1.', 'van', '.']),
|
||||
('A III/1.-ben van.', ['A', 'III/1.-ben', 'van', '.']),
|
||||
('A III/1.-ben.', ['A', 'III/1.-ben', '.']),
|
||||
('A III/c van.', ['A', 'III/c', 'van', '.']),
|
||||
('A III/c-ben van.', ['A', 'III/c-ben', 'van', '.']),
|
||||
('A III/c.', ['A', 'III/c', '.']),
|
||||
('A III/c-ben.', ['A', 'III/c-ben', '.']),
|
||||
('A TU–154 van.', ['A', 'TU–154', 'van', '.']),
|
||||
('A TU–154-ben van.', ['A', 'TU–154-ben', 'van', '.']),
|
||||
('A TU–154-ben.', ['A', 'TU–154-ben', '.'])]
|
||||
|
||||
_QUOTE_TESTS = [('Az "Ime, hat"-ban irja.', ['Az', '"', 'Ime', ',', 'hat', '"', '-ban', 'irja', '.']),
|
||||
('"Ime, hat"-ban irja.', ['"', 'Ime', ',', 'hat', '"', '-ban', 'irja', '.']),
|
||||
('Az "Ime, hat".', ['Az', '"', 'Ime', ',', 'hat', '"', '.']),
|
||||
('Egy 24"-os monitor.', ['Egy', '24', '"', '-os', 'monitor', '.']),
|
||||
("A don't van.", ['A', "don't", 'van', '.'])]
|
||||
|
||||
_DOT_TESTS = [('N. kormányzósági\nszékhely.', ['N.', 'kormányzósági', 'székhely', '.']),
|
||||
('A .hu egy tld.', ['A', '.hu', 'egy', 'tld', '.']),
|
||||
('Az egy.ketto pelda.', ['Az', 'egy.ketto', 'pelda', '.']),
|
||||
('A pl. rovidites.', ['A', 'pl.', 'rovidites', '.']),
|
||||
('A S.M.A.R.T. szo.', ['A', 'S.M.A.R.T.', 'szo', '.']),
|
||||
('A .hu.', ['A', '.hu', '.']),
|
||||
('Az egy.ketto.', ['Az', 'egy.ketto', '.']),
|
||||
('A pl.', ['A', 'pl.']),
|
||||
('A S.M.A.R.T.', ['A', 'S.M.A.R.T.']),
|
||||
('Egy..ket.', ['Egy', '..', 'ket', '.']),
|
||||
('Valami... van.', ['Valami', '...', 'van', '.']),
|
||||
('Valami ...van...', ['Valami', '...', 'van', '...']),
|
||||
('Valami...', ['Valami', '...']),
|
||||
('Valami ...', ['Valami', '...']),
|
||||
('Valami ... más.', ['Valami', '...', 'más', '.'])]
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def HU():
|
||||
return Hungarian()
|
||||
|
||||
|
||||
@pytest.fixture(scope="module")
|
||||
def hu_tokenizer(HU):
|
||||
return HU.tokenizer
|
||||
|
||||
|
||||
@pytest.mark.parametrize(("input", "expected_tokens"),
|
||||
_DEFAULT_TESTS + _HYPHEN_TESTS + _NUMBER_TESTS + _DOT_TESTS + _QUOTE_TESTS)
|
||||
def test_testcases(hu_tokenizer, input, expected_tokens):
|
||||
tokens = hu_tokenizer(input)
|
||||
token_list = [token.orth_ for token in tokens if not token.is_space]
|
||||
assert expected_tokens == token_list
|
|
@ -53,7 +53,7 @@ cdef class Vocab:
|
|||
'''
|
||||
@classmethod
|
||||
def load(cls, path, lex_attr_getters=None, lemmatizer=True,
|
||||
tag_map=True, serializer_freqs=True, oov_prob=True, **deprecated_kwargs):
|
||||
tag_map=True, serializer_freqs=True, oov_prob=True, **deprecated_kwargs):
|
||||
"""
|
||||
Load the vocabulary from a path.
|
||||
|
||||
|
@ -96,6 +96,8 @@ cdef class Vocab:
|
|||
if serializer_freqs is True and (path / 'vocab' / 'serializer.json').exists():
|
||||
with (path / 'vocab' / 'serializer.json').open('r', encoding='utf8') as file_:
|
||||
serializer_freqs = json.load(file_)
|
||||
else:
|
||||
serializer_freqs = None
|
||||
|
||||
cdef Vocab self = cls(lex_attr_getters=lex_attr_getters, tag_map=tag_map,
|
||||
lemmatizer=lemmatizer, serializer_freqs=serializer_freqs)
|
||||
|
@ -124,7 +126,7 @@ cdef class Vocab:
|
|||
Vocab: The newly constructed vocab object.
|
||||
'''
|
||||
util.check_renamed_kwargs({'get_lex_attr': 'lex_attr_getters'}, deprecated_kwargs)
|
||||
|
||||
|
||||
lex_attr_getters = lex_attr_getters if lex_attr_getters is not None else {}
|
||||
tag_map = tag_map if tag_map is not None else {}
|
||||
if lemmatizer in (None, True, False):
|
||||
|
@ -149,10 +151,10 @@ cdef class Vocab:
|
|||
self.lex_attr_getters = lex_attr_getters
|
||||
self.morphology = Morphology(self.strings, tag_map, lemmatizer)
|
||||
self.serializer_freqs = serializer_freqs
|
||||
|
||||
|
||||
self.length = 1
|
||||
self._serializer = None
|
||||
|
||||
|
||||
property serializer:
|
||||
# Having the serializer live here is super messy :(
|
||||
def __get__(self):
|
||||
|
@ -177,7 +179,7 @@ cdef class Vocab:
|
|||
vectors if necessary. The memory will be zeroed.
|
||||
|
||||
Arguments:
|
||||
new_size (int): The new size of the vectors.
|
||||
new_size (int): The new size of the vectors.
|
||||
'''
|
||||
cdef hash_t key
|
||||
cdef size_t addr
|
||||
|
@ -190,11 +192,11 @@ cdef class Vocab:
|
|||
|
||||
def add_flag(self, flag_getter, int flag_id=-1):
|
||||
'''Set a new boolean flag to words in the vocabulary.
|
||||
|
||||
|
||||
The flag_setter function will be called over the words currently in the
|
||||
vocab, and then applied to new words as they occur. You'll then be able
|
||||
to access the flag value on each token, using token.check_flag(flag_id).
|
||||
|
||||
|
||||
See also:
|
||||
Lexeme.set_flag, Lexeme.check_flag, Token.set_flag, Token.check_flag.
|
||||
|
||||
|
@ -204,7 +206,7 @@ cdef class Vocab:
|
|||
|
||||
flag_id (int):
|
||||
An integer between 1 and 63 (inclusive), specifying the bit at which the
|
||||
flag will be stored. If -1, the lowest available bit will be
|
||||
flag will be stored. If -1, the lowest available bit will be
|
||||
chosen.
|
||||
|
||||
Returns:
|
||||
|
@ -322,7 +324,7 @@ cdef class Vocab:
|
|||
Arguments:
|
||||
id_or_string (int or unicode):
|
||||
The integer ID of a word, or its unicode string.
|
||||
|
||||
|
||||
If an int >= Lexicon.size, IndexError is raised. If id_or_string
|
||||
is neither an int nor a unicode string, ValueError is raised.
|
||||
|
||||
|
@ -349,7 +351,7 @@ cdef class Vocab:
|
|||
for attr_id, value in props.items():
|
||||
Token.set_struct_attr(token, attr_id, value)
|
||||
return tokens
|
||||
|
||||
|
||||
def dump(self, loc):
|
||||
"""Save the lexemes binary data to the given location.
|
||||
|
||||
|
@ -443,7 +445,7 @@ cdef class Vocab:
|
|||
cdef int32_t word_len
|
||||
cdef bytes word_str
|
||||
cdef char* chars
|
||||
|
||||
|
||||
cdef Lexeme lexeme
|
||||
cdef CFile out_file = CFile(out_loc, 'wb')
|
||||
for lexeme in self:
|
||||
|
@ -460,7 +462,7 @@ cdef class Vocab:
|
|||
out_file.close()
|
||||
|
||||
def load_vectors(self, file_):
|
||||
"""Load vectors from a text-based file.
|
||||
"""Load vectors from a text-based file.
|
||||
|
||||
Arguments:
|
||||
file_ (buffer): The file to read from. Entries should be separated by newlines,
|
||||
|
|
|
@ -12,7 +12,7 @@ writing. You can read more about our approach in our blog post, ["Rebuilding a W
|
|||
```bash
|
||||
sudo npm install --global harp
|
||||
git clone https://github.com/explosion/spaCy
|
||||
cd website
|
||||
cd spaCy/website
|
||||
harp server
|
||||
```
|
||||
|
||||
|
|
|
@ -11,7 +11,7 @@ footer.o-footer.u-text.u-border-dotted
|
|||
|
||||
each url, item in group
|
||||
li
|
||||
+a(url)(target=url.includes("http") ? "_blank" : false)=item
|
||||
+a(url)=item
|
||||
|
||||
if SECTION != "docs"
|
||||
+grid-col("quarter")
|
||||
|
|
|
@ -20,7 +20,8 @@ mixin h(level, id)
|
|||
info: https://mathiasbynens.github.io/rel-noopener/
|
||||
|
||||
mixin a(url, trusted)
|
||||
a(href=url target="_blank" rel=!trusted ? "noopener nofollow" : false)&attributes(attributes)
|
||||
- external = url.includes("http")
|
||||
a(href=url target=external ? "_blank" : null rel=external && !trusted ? "noopener nofollow" : null)&attributes(attributes)
|
||||
block
|
||||
|
||||
|
||||
|
@ -33,7 +34,7 @@ mixin src(url)
|
|||
+a(url)
|
||||
block
|
||||
|
||||
| #[+icon("code", 16).u-color-subtle]
|
||||
| #[+icon("code", 16).o-icon--inline.u-color-subtle]
|
||||
|
||||
|
||||
//- API link (with added tag and automatically generated path)
|
||||
|
@ -43,7 +44,7 @@ mixin api(path)
|
|||
+a("/docs/api/" + path, true)(target="_self").u-no-border.u-inline-block
|
||||
block
|
||||
|
||||
| #[+icon("book", 18).o-help-icon.u-color-subtle]
|
||||
| #[+icon("book", 18).o-icon--inline.u-help.u-color-subtle]
|
||||
|
||||
|
||||
//- Aside for text
|
||||
|
@ -74,7 +75,8 @@ mixin aside-code(label, language)
|
|||
see assets/css/_components/_buttons.sass
|
||||
|
||||
mixin button(url, trusted, ...style)
|
||||
a.c-button.u-text-label(href=url class=prefixArgs(style, "c-button") role="button" target="_blank" rel=!trusted ? "noopener nofollow" : false)&attributes(attributes)
|
||||
- external = url.includes("http")
|
||||
a.c-button.u-text-label(href=url class=prefixArgs(style, "c-button") role="button" target=external ? "_blank" : null rel=external && !trusted ? "noopener nofollow" : null)&attributes(attributes)
|
||||
block
|
||||
|
||||
|
||||
|
@ -148,7 +150,7 @@ mixin tag()
|
|||
|
||||
mixin list(type, start)
|
||||
if type
|
||||
ol.c-list.o-block.u-text(class="c-list--#{type}" style=(start === 0 || start) ? "counter-reset: li #{(start - 1)}" : false)&attributes(attributes)
|
||||
ol.c-list.o-block.u-text(class="c-list--#{type}" style=(start === 0 || start) ? "counter-reset: li #{(start - 1)}" : null)&attributes(attributes)
|
||||
block
|
||||
|
||||
else
|
||||
|
|
|
@ -2,7 +2,7 @@
|
|||
|
||||
include _mixins
|
||||
|
||||
nav.c-nav.u-text.js-nav(class=landing ? "c-nav--theme" : false)
|
||||
nav.c-nav.u-text.js-nav(class=landing ? "c-nav--theme" : null)
|
||||
a(href='/') #[+logo]
|
||||
|
||||
if SUBSECTION != "index"
|
||||
|
@ -11,7 +11,7 @@ nav.c-nav.u-text.js-nav(class=landing ? "c-nav--theme" : false)
|
|||
ul.c-nav__menu
|
||||
each url, item in NAVIGATION
|
||||
li.c-nav__menu__item
|
||||
a(href=url target=url.includes("http") ? "_blank" : false)=item
|
||||
+a(url)=item
|
||||
|
||||
li.c-nav__menu__item
|
||||
+a(gh("spaCy"))(aria-label="GitHub").u-hidden-xs #[+icon("github", 20)]
|
||||
|
|
|
@ -18,7 +18,7 @@ main.o-main.o-main--sidebar.o-main--aside
|
|||
- data = public.docs[SUBSECTION]._data[next]
|
||||
|
||||
.o-inline-list
|
||||
span #[strong.u-text-label Read next:] #[a(href=next).u-link=data.title]
|
||||
span #[strong.u-text-label Read next:] #[+a(next).u-link=data.title]
|
||||
|
||||
+grid-col("half").u-text-right
|
||||
.o-inline-list
|
||||
|
|
|
@ -9,5 +9,5 @@ menu.c-sidebar.js-sidebar.u-text
|
|||
li.u-text-label.u-color-subtle=menu
|
||||
|
||||
each url, item in items
|
||||
li(class=(CURRENT == url || (CURRENT == "index" && url == "./")) ? "is-active" : false)
|
||||
+a(url)(target=url.includes("http") ? "_blank" : false)=item
|
||||
li(class=(CURRENT == url || (CURRENT == "index" && url == "./")) ? "is-active" : null)
|
||||
+a(url)=item
|
||||
|
|
|
@ -67,9 +67,8 @@
|
|||
.o-icon
|
||||
vertical-align: middle
|
||||
|
||||
.o-help-icon
|
||||
cursor: help
|
||||
margin: 0 0.5rem 0 0.25rem
|
||||
&.o-icon--inline
|
||||
margin: 0 0.5rem 0 0.25rem
|
||||
|
||||
|
||||
//- Inline List
|
||||
|
|
|
@ -141,6 +141,12 @@
|
|||
background: $pattern
|
||||
|
||||
|
||||
//- Cursors
|
||||
|
||||
.u-help
|
||||
cursor: help
|
||||
|
||||
|
||||
//- Hidden elements
|
||||
|
||||
.u-hidden
|
||||
|
|
|
@ -50,6 +50,13 @@ p A "lemma" is the uninflected form of a word. In English, this means:
|
|||
+item #[strong Nouns]: The form like "dog", not "dogs"; like "child", not "children"
|
||||
+item #[strong Verbs]: The form like "write", not "writes", "writing", "wrote" or "written"
|
||||
|
||||
+aside("About spaCy's custom pronoun lemma")
|
||||
| Unlike verbs and common nouns, there's no clear base form of a personal
|
||||
| pronoun. Should the lemma of "me" be "I", or should we normalize person
|
||||
| as well, giving "it" — or maybe "he"? spaCy's solution is to introduce a
|
||||
| novel symbol, #[code.u-nowrap -PRON-], which is used as the lemma for
|
||||
| all personal pronouns.
|
||||
|
||||
p
|
||||
| The lemmatization data is taken from
|
||||
| #[+a("https://wordnet.princeton.edu") WordNet]. However, we also add a
|
||||
|
@ -58,11 +65,16 @@ p
|
|||
|
||||
+h(2, "dependency-parsing") Syntactic Dependency Parsing
|
||||
|
||||
p
|
||||
| The parser is trained on data produced by the
|
||||
| #[+a("http://www.clearnlp.com") ClearNLP] converter. Details of the
|
||||
| annotation scheme can be found
|
||||
| #[+a("http://www.mathcs.emory.edu/~choi/doc/clear-dependency-2012.pdf") here].
|
||||
+table(["Language", "Converter", "Scheme"])
|
||||
+row
|
||||
+cell English
|
||||
+cell #[+a("http://www.clearnlp.com") ClearNLP]
|
||||
+cell #[+a("http://www.mathcs.emory.edu/~choi/doc/clear-dependency-2012.pdf") CLEAR Style]
|
||||
|
||||
+row
|
||||
+cell German
|
||||
+cell #[+a("https://github.com/wbwseeker/tiger2dep") TIGER]
|
||||
+cell #[+a("http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/index.html") TIGER]
|
||||
|
||||
+h(2, "named-entities") Named Entity Recognition
|
||||
|
||||
|
|
|
@ -2,7 +2,7 @@
|
|||
|
||||
include ../../_includes/_mixins
|
||||
|
||||
p You can download data packs that add the following capabilities to spaCy.
|
||||
p spaCy currently supports the following languages and capabilities:
|
||||
|
||||
+aside-code("Download language models", "bash").
|
||||
python -m spacy.en.download all
|
||||
|
@ -24,8 +24,27 @@ p You can download data packs that add the following capabilities to spaCy.
|
|||
each icon in [ "pro", "con", "con", "con", "con", "con", "con", "con" ]
|
||||
+cell.u-text-center #[+procon(icon)]
|
||||
|
||||
+row
|
||||
+cell Spanish #[code es]
|
||||
each icon in [ "pro", "con", "con", "con", "con", "con", "con", "con" ]
|
||||
+cell.u-text-center #[+procon(icon)]
|
||||
|
||||
p
|
||||
| Chinese tokenization requires the
|
||||
| #[+a("https://github.com/fxsjy/jieba") Jieba] library. Statistical
|
||||
| models are coming soon. Tokenizers for Spanish, French, Italian and
|
||||
| Portuguese are now under development.
|
||||
| models are coming soon.
|
||||
|
||||
|
||||
|
||||
+h(2, "alpha-support") Alpha support
|
||||
|
||||
p
|
||||
| Work has started on the following languages. You can help by improving
|
||||
| the existing language data and extending the tokenization patterns.
|
||||
|
||||
+table([ "Language", "Source" ])
|
||||
each language, code in { it: "Italian", fr: "French", pt: "Portuguese", nl: "Dutch", sv: "Swedish" }
|
||||
+row
|
||||
+cell #{language} #[code=code]
|
||||
+cell
|
||||
+src(gh("spaCy", "spacy/" + code)) spacy/#{code}
|
||||
|
|
|
@ -2,7 +2,8 @@
|
|||
"sidebar": {
|
||||
"Get started": {
|
||||
"Installation": "./",
|
||||
"Lightning tour": "lightning-tour"
|
||||
"Lightning tour": "lightning-tour",
|
||||
"Resources": "resources"
|
||||
},
|
||||
"Workflows": {
|
||||
"Loading the pipeline": "language-processing-pipeline",
|
||||
|
@ -31,7 +32,12 @@
|
|||
},
|
||||
|
||||
"lightning-tour": {
|
||||
"title": "Lightning tour"
|
||||
"title": "Lightning tour",
|
||||
"next": "resources"
|
||||
},
|
||||
|
||||
"resources": {
|
||||
"title": "Resources"
|
||||
},
|
||||
|
||||
"language-processing-pipeline": {
|
||||
|
|
|
@ -23,7 +23,7 @@ p
|
|||
|
||||
+item
|
||||
| #[strong Build the vocabulary] including
|
||||
| #[a(href="#word-probabilities") word probabilities],
|
||||
| #[a(href="#word-frequencies") word frequencies],
|
||||
| #[a(href="#brown-clusters") Brown clusters] and
|
||||
| #[a(href="#word-vectors") word vectors].
|
||||
|
||||
|
@ -245,6 +245,12 @@ p
|
|||
+cell
|
||||
| Special value for pronoun lemmas (#[code "-PRON-"]).
|
||||
|
||||
+row
|
||||
+cell #[code DET_LEMMA]
|
||||
+cell
|
||||
| Special value for determiner lemmas, used in languages with
|
||||
| inflected determiners (#[code "-DET-"]).
|
||||
|
||||
+row
|
||||
+cell #[code ENT_ID]
|
||||
+cell
|
||||
|
@ -392,7 +398,7 @@ p
|
|||
| vectors files, you can use the
|
||||
| #[+src(gh("spacy-dev-resources", "training/init.py")) init.py]
|
||||
| script from our
|
||||
| #[+a(gh("spacy-developer-resources")) developer resources] to create a
|
||||
| #[+a(gh("spacy-dev-resources")) developer resources] to create a
|
||||
| spaCy data directory:
|
||||
|
||||
+code(false, "bash").
|
||||
|
@ -424,16 +430,22 @@ p
|
|||
+h(3, "word-frequencies") Word frequencies
|
||||
|
||||
p
|
||||
| The #[code init.py] script expects a tab-separated word frequencies file
|
||||
| with three columns: the number of times the word occurred in your language
|
||||
| sample, the number of distinct documents the word occurred in, and the
|
||||
| word itself. You should make sure you use the spaCy tokenizer for your
|
||||
| The #[+src(gh("spacy-dev-resources", "training/init.py")) init.py]
|
||||
| script expects a tab-separated word frequencies file with three columns:
|
||||
|
||||
+list("numbers")
|
||||
+item The number of times the word occurred in your language sample.
|
||||
+item The number of distinct documents the word occurred in.
|
||||
+item The word itself.
|
||||
|
||||
p
|
||||
| You should make sure you use the spaCy tokenizer for your
|
||||
| language to segment the text for your word frequencies. This will ensure
|
||||
| that the frequencies refer to the same segmentation standards you'll be
|
||||
| using at run-time. For instance, spaCy's English tokenizer segments "can't"
|
||||
| into two tokens. If we segmented the text by whitespace to produce the
|
||||
| frequency counts, we'll have incorrect frequency counts for the tokens
|
||||
| "ca" and "n't".
|
||||
| using at run-time. For instance, spaCy's English tokenizer segments
|
||||
| "can't" into two tokens. If we segmented the text by whitespace to
|
||||
| produce the frequency counts, we'll have incorrect frequency counts for
|
||||
| the tokens "ca" and "n't".
|
||||
|
||||
+h(3, "brown-clusters") Training the Brown clusters
|
||||
|
||||
|
|
|
@ -3,7 +3,7 @@
|
|||
include ../../_includes/_mixins
|
||||
|
||||
p
|
||||
| The following examples code snippets give you an overview of spaCy's
|
||||
| The following examples and code snippets give you an overview of spaCy's
|
||||
| functionality and its usage.
|
||||
|
||||
+h(2, "examples-resources") Load resources and process text
|
||||
|
|
118
website/docs/usage/resources.jade
Normal file
118
website/docs/usage/resources.jade
Normal file
|
@ -0,0 +1,118 @@
|
|||
//- 💫 DOCS > USAGE > RESOURCES
|
||||
|
||||
include ../../_includes/_mixins
|
||||
|
||||
p Many of the associated tools and resources that we're developing alongside spaCy can be found in their own repositories.
|
||||
|
||||
+h(2, "developer") Developer tools
|
||||
|
||||
+table(["Name", "Description"])
|
||||
+row
|
||||
+cell
|
||||
+src(gh("spacy-dev-resources")) spaCy Dev Resources
|
||||
|
||||
+cell
|
||||
| Scripts, tools and resources for developing spaCy, adding new
|
||||
| languages and training new models.
|
||||
|
||||
+row
|
||||
+cell
|
||||
+src("spacy-benchmarks") spaCy Benchmarks
|
||||
|
||||
+cell
|
||||
| Runtime performance comparison of spaCy against other NLP
|
||||
| libraries.
|
||||
|
||||
+row
|
||||
+cell
|
||||
+src(gh("spacy-services")) spaCy Services
|
||||
|
||||
+cell
|
||||
| REST microservices for spaCy demos and visualisers.
|
||||
|
||||
+h(2, "libraries") Libraries and projects
|
||||
+table(["Name", "Description"])
|
||||
+row
|
||||
+cell
|
||||
+src(gh("sense2vec")) sense2vec
|
||||
|
||||
+cell
|
||||
| Use spaCy to go beyond vanilla
|
||||
| #[+a("https://en.wikipedia.org/wiki/Word2vec") Word2vec].
|
||||
|
||||
+h(2, "utility") Utility libraries and dependencies
|
||||
|
||||
+table(["Name", "Description"])
|
||||
+row
|
||||
+cell
|
||||
+src(gh("thinc")) Thinc
|
||||
|
||||
+cell
|
||||
| Super sparse multi-class machine learning with Cython.
|
||||
|
||||
+row
|
||||
+cell
|
||||
+src(gh("sputnik")) Sputnik
|
||||
|
||||
+cell
|
||||
| Data package manager library for spaCy.
|
||||
|
||||
+row
|
||||
+cell
|
||||
+src(gh("sputnik-server")) Sputnik Server
|
||||
|
||||
+cell
|
||||
| Index service for the Sputnik data package manager for spaCy.
|
||||
|
||||
+row
|
||||
+cell
|
||||
+src(gh("cymem")) Cymem
|
||||
|
||||
+cell
|
||||
| Gate Cython calls to malloc/free behind Python ref-counted
|
||||
| objects.
|
||||
|
||||
+row
|
||||
+cell
|
||||
+src(gh("preshed")) Preshed
|
||||
|
||||
+cell
|
||||
| Cython hash tables that assume keys are pre-hashed
|
||||
|
||||
+row
|
||||
+cell
|
||||
+src(gh("murmurhash")) MurmurHash
|
||||
|
||||
+cell
|
||||
| Cython bindings for
|
||||
| #[+a("https://en.wikipedia.org/wiki/MurmurHash") MurmurHash2].
|
||||
|
||||
+h(2, "visualizers") Visualisers and demos
|
||||
|
||||
+table(["Name", "Description"])
|
||||
+row
|
||||
+cell
|
||||
+src(gh("displacy")) displaCy.js
|
||||
|
||||
+cell
|
||||
| A lightweight dependency visualisation library for the modern
|
||||
| web, built with JavaScript, CSS and SVG.
|
||||
| #[+a(DEMOS_URL + "/displacy") Demo here].
|
||||
|
||||
+row
|
||||
+cell
|
||||
+src(gh("displacy-ent")) displaCy#[sup ENT]
|
||||
|
||||
+cell
|
||||
| A lightweight and modern named entity visualisation library
|
||||
| built with JavaScript and CSS.
|
||||
| #[+a(DEMOS_URL + "/displacy-ent") Demo here].
|
||||
|
||||
+row
|
||||
+cell
|
||||
+src(gh("sense2vec-demo")) sense2vec Demo
|
||||
|
||||
+cell
|
||||
| Source of our Semantic Analysis of the Reddit Hivemind
|
||||
| #[+a(DEMOS_URL + "/sense2vec") demo] using
|
||||
| #[+a(gh("sense2vec")) sense2vec].
|
|
@ -13,14 +13,17 @@ p
|
|||
|
||||
+code.
|
||||
from spacy.vocab import Vocab
|
||||
from spacy.pipeline import Tagger
|
||||
from spacy.tagger import Tagger
|
||||
from spacy.tokens import Doc
|
||||
from spacy.gold import GoldParse
|
||||
|
||||
|
||||
vocab = Vocab(tag_map={'N': {'pos': 'NOUN'}, 'V': {'pos': 'VERB'}})
|
||||
tagger = Tagger(vocab)
|
||||
|
||||
doc = Doc(vocab, words=['I', 'like', 'stuff'])
|
||||
tagger.update(doc, ['N', 'V', 'N'])
|
||||
gold = GoldParse(doc, tags=['N', 'V', 'N'])
|
||||
tagger.update(doc, gold)
|
||||
|
||||
tagger.model.end_training()
|
||||
|
||||
|
|
|
@ -22,7 +22,7 @@ include _includes/_mixins
|
|||
| process entire web dumps, spaCy is the library you want to
|
||||
| be using.
|
||||
|
||||
+button("/docs/api", true, "primary")(target="_self")
|
||||
+button("/docs/api", true, "primary")
|
||||
| Facts & figures
|
||||
|
||||
+grid-col("third").o-card
|
||||
|
@ -35,7 +35,7 @@ include _includes/_mixins
|
|||
| think of spaCy as the Ruby on Rails of Natural Language
|
||||
| Processing.
|
||||
|
||||
+button("/docs/usage", true, "primary")(target="_self")
|
||||
+button("/docs/usage", true, "primary")
|
||||
| Get started
|
||||
|
||||
+grid-col("third").o-card
|
||||
|
@ -51,7 +51,7 @@ include _includes/_mixins
|
|||
| connect the statistical models trained by these libraries
|
||||
| to the rest of your application.
|
||||
|
||||
+button("/docs/usage/deep-learning", true, "primary")(target="_self")
|
||||
+button("/docs/usage/deep-learning", true, "primary")
|
||||
| Read more
|
||||
|
||||
.o-inline-list.o-block.u-border-bottom.u-text-small.u-text-center.u-padding-small
|
||||
|
@ -105,7 +105,7 @@ include _includes/_mixins
|
|||
+item Robust, rigorously evaluated accuracy
|
||||
|
||||
.o-inline-list
|
||||
+button("/docs/usage/lightning-tour", true, "secondary")(target="_self")
|
||||
+button("/docs/usage/lightning-tour", true, "secondary")
|
||||
| See examples
|
||||
|
||||
.o-block.u-text-center.u-padding
|
||||
|
@ -138,7 +138,7 @@ include _includes/_mixins
|
|||
| all others.
|
||||
|
||||
p
|
||||
| spaCy's #[a(href="/docs/api/philosophy") mission] is to make
|
||||
| spaCy's #[+a("/docs/api/philosophy") mission] is to make
|
||||
| cutting-edge NLP practical and commonly available. That's
|
||||
| why I left academia in 2014, to build a production-quality
|
||||
| open-source NLP library. It's why
|
||||
|
|
Loading…
Reference in New Issue
Block a user