💫 Tidy up and auto-format .py files (#2983)

<!--- Provide a general summary of your changes in the title. -->

## Description
- [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files.
- [x] Update flake8 config to exclude very large files (lemmatization tables etc.)
- [x] Update code to be compatible with flake8 rules
- [x] Fix various small bugs, inconsistencies and messy stuff in the language data
- [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means)

Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results.

At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information.

### Types of change
enhancement, code style

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
This commit is contained in:
Ines Montani 2018-11-30 17:03:03 +01:00 committed by GitHub
parent 852bc2ac16
commit eddeb36c96
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
268 changed files with 25626 additions and 17854 deletions

11
.flake8
View File

@ -1,4 +1,13 @@
[flake8]
ignore = E203, E266, E501, W503
ignore = E203, E266, E501, E731, W503
max-line-length = 80
select = B,C,E,F,W,T4,B9
exclude =
.env,
.git,
__pycache__,
lemmatizer.py,
lookup.py,
_tokenizer_exceptions_list.py,
spacy/lang/fr/lemmatizer,
spacy/lang/nb/lemmatizer

View File

@ -186,13 +186,99 @@ sure your test passes and reference the issue in your commit message.
## Code conventions
Code should loosely follow [pep8](https://www.python.org/dev/peps/pep-0008/).
Regular line length is **80 characters**, with some tolerance for lines up to
90 characters if the alternative would be worse — for instance, if your list
comprehension comes to 82 characters, it's better not to split it over two lines.
You can also use a linter like [`flake8`](https://pypi.python.org/pypi/flake8)
or [`frosted`](https://pypi.python.org/pypi/frosted) just keep in mind that
it won't work very well for `.pyx` files and will complain about Cython syntax
like `<int*>` or `cimport`.
As of `v2.1.0`, spaCy uses [`black`](https://github.com/ambv/black) for code
formatting and [`flake8`](http://flake8.pycqa.org/en/latest/) for linting its
Python modules. If you've built spaCy from source, you'll already have both
tools installed.
**⚠️ Note that formatting and linting is currently only possible for Python
modules in `.py` files, not Cython modules in `.pyx` and `.pxd` files.**
### Code formatting
[`black`](https://github.com/ambv/black) is an opinionated Python code
formatter, optimised to produce readable code and small diffs. You can run
`black` from the command-line, or via your code editor. For example, if you're
using [Visual Studio Code](https://code.visualstudio.com/), you can add the
following to your `settings.json` to use `black` for formatting and auto-format
your files on save:
```json
{
"python.formatting.provider": "black",
"[python]": {
"editor.formatOnSave": true
}
}
```
[See here](https://github.com/ambv/black#editor-integration) for the full
list of available editor integrations.
#### Disabling formatting
There are a few cases where auto-formatting doesn't improve readability for
example, in some of the the language data files like the `tag_map.py`, or in
the tests that construct `Doc` objects from lists of words and other labels.
Wrapping a block in `# fmt: off` and `# fmt: on` lets you disable formatting
for that particular code. Here's an example:
```python
# fmt: off
text = "I look forward to using Thingamajig. I've been told it will make my life easier..."
heads = [1, 0, -1, -2, -1, -1, -5, -1, 3, 2, 1, 0, 2, 1, -3, 1, 1, -3, -7]
deps = ["nsubj", "ROOT", "advmod", "prep", "pcomp", "dobj", "punct", "",
"nsubjpass", "aux", "auxpass", "ROOT", "nsubj", "aux", "ccomp",
"poss", "nsubj", "ccomp", "punct"]
# fmt: on
```
### Code linting
[`flake8`](http://flake8.pycqa.org/en/latest/) is a tool for enforcing code
style. It scans one or more files and outputs errors and warnings. This feedback
can help you stick to general standards and conventions, and can be very useful
for spotting potential mistakes and inconsistencies in your code. The most
important things to watch out for are syntax errors and undefined names, but you
also want to keep an eye on unused declared variables or repeated
(i.e. overwritten) dictionary keys. If your code was formatted with `black`
(see above), you shouldn't see any formatting-related warnings.
The [`.flake8`](.flake8) config defines the configuration we use for this
codebase. For example, we're not super strict about the line length, and we're
excluding very large files like lemmatization and tokenizer exception tables.
Ideally, running the following command from within the repo directory should
not return any errors or warnings:
```bash
flake8 spacy
```
#### Disabling linting
Sometimes, you explicitly want to write code that's not compatible with our
rules. For example, a module's `__init__.py` might import a function so other
modules can import it from there, but `flake8` will complain about an unused
import. And although it's generally discouraged, there might be cases where it
makes sense to use a bare `except`.
To ignore a given line, you can add a comment like `# noqa: F401`, specifying
the code of the error or warning we want to ignore. It's also possible to
ignore several comma-separated codes at once, e.g. `# noqa: E731,E123`. Here
are some examples:
```python
# The imported class isn't used in this file, but imported here, so it can be
# imported *from* here by another module.
from .submodule import SomeClass # noqa: F401
try:
do_something()
except: # noqa: E722
# This bare except is justified, for some specific reason
do_something_else()
```
### Python conventions

View File

@ -35,41 +35,49 @@ import subprocess
import argparse
HASH_FILE = 'cythonize.json'
HASH_FILE = "cythonize.json"
def process_pyx(fromfile, tofile, language_level='-2'):
print('Processing %s' % fromfile)
def process_pyx(fromfile, tofile, language_level="-2"):
print("Processing %s" % fromfile)
try:
from Cython.Compiler.Version import version as cython_version
from distutils.version import LooseVersion
if LooseVersion(cython_version) < LooseVersion('0.19'):
raise Exception('Require Cython >= 0.19')
if LooseVersion(cython_version) < LooseVersion("0.19"):
raise Exception("Require Cython >= 0.19")
except ImportError:
pass
flags = ['--fast-fail', language_level]
if tofile.endswith('.cpp'):
flags += ['--cplus']
flags = ["--fast-fail", language_level]
if tofile.endswith(".cpp"):
flags += ["--cplus"]
try:
try:
r = subprocess.call(['cython'] + flags + ['-o', tofile, fromfile],
env=os.environ) # See Issue #791
r = subprocess.call(
["cython"] + flags + ["-o", tofile, fromfile], env=os.environ
) # See Issue #791
if r != 0:
raise Exception('Cython failed')
raise Exception("Cython failed")
except OSError:
# There are ways of installing Cython that don't result in a cython
# executable on the path, see gh-2397.
r = subprocess.call([sys.executable, '-c',
'import sys; from Cython.Compiler.Main import '
'setuptools_main as main; sys.exit(main())'] + flags +
['-o', tofile, fromfile])
r = subprocess.call(
[
sys.executable,
"-c",
"import sys; from Cython.Compiler.Main import "
"setuptools_main as main; sys.exit(main())",
]
+ flags
+ ["-o", tofile, fromfile]
)
if r != 0:
raise Exception('Cython failed')
raise Exception("Cython failed")
except OSError:
raise OSError('Cython needs to be installed')
raise OSError("Cython needs to be installed")
def preserve_cwd(path, func, *args):
@ -89,12 +97,12 @@ def load_hashes(filename):
def save_hashes(hash_db, filename):
with open(filename, 'w') as f:
with open(filename, "w") as f:
f.write(json.dumps(hash_db))
def get_hash(path):
return hashlib.md5(open(path, 'rb').read()).hexdigest()
return hashlib.md5(open(path, "rb").read()).hexdigest()
def hash_changed(base, path, db):
@ -109,25 +117,27 @@ def hash_add(base, path, db):
def process(base, filename, db):
root, ext = os.path.splitext(filename)
if ext in ['.pyx', '.cpp']:
if hash_changed(base, filename, db) or not os.path.isfile(os.path.join(base, root + '.cpp')):
preserve_cwd(base, process_pyx, root + '.pyx', root + '.cpp')
hash_add(base, root + '.cpp', db)
hash_add(base, root + '.pyx', db)
if ext in [".pyx", ".cpp"]:
if hash_changed(base, filename, db) or not os.path.isfile(
os.path.join(base, root + ".cpp")
):
preserve_cwd(base, process_pyx, root + ".pyx", root + ".cpp")
hash_add(base, root + ".cpp", db)
hash_add(base, root + ".pyx", db)
def check_changes(root, db):
res = False
new_db = {}
setup_filename = 'setup.py'
hash_add('.', setup_filename, new_db)
if hash_changed('.', setup_filename, db):
setup_filename = "setup.py"
hash_add(".", setup_filename, new_db)
if hash_changed(".", setup_filename, db):
res = True
for base, _, files in os.walk(root):
for filename in files:
if filename.endswith('.pxd'):
if filename.endswith(".pxd"):
hash_add(base, filename, new_db)
if hash_changed(base, filename, db):
res = True
@ -150,8 +160,10 @@ def run(root):
save_hashes(db, HASH_FILE)
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Cythonize pyx files into C++ files as needed')
parser.add_argument('root', help='root directory')
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Cythonize pyx files into C++ files as needed"
)
parser.add_argument("root", help="root directory")
args = parser.parse_args()
run(args.root)

View File

@ -15,12 +15,13 @@ _unset = object()
class Reddit(object):
"""Stream cleaned comments from Reddit."""
pre_format_re = re.compile(r'^[\`\*\~]')
post_format_re = re.compile(r'[\`\*\~]$')
url_re = re.compile(r'\[([^]]+)\]\(%%URL\)')
link_re = re.compile(r'\[([^]]+)\]\(https?://[^\)]+\)')
def __init__(self, file_path, meta_keys={'subreddit': 'section'}):
pre_format_re = re.compile(r"^[\`\*\~]")
post_format_re = re.compile(r"[\`\*\~]$")
url_re = re.compile(r"\[([^]]+)\]\(%%URL\)")
link_re = re.compile(r"\[([^]]+)\]\(https?://[^\)]+\)")
def __init__(self, file_path, meta_keys={"subreddit": "section"}):
"""
file_path (unicode / Path): Path to archive or directory of archives.
meta_keys (dict): Meta data key included in the Reddit corpus, mapped
@ -45,28 +46,30 @@ class Reddit(object):
continue
comment = ujson.loads(line)
if self.is_valid(comment):
text = self.strip_tags(comment['body'])
yield {'text': text}
text = self.strip_tags(comment["body"])
yield {"text": text}
def get_meta(self, item):
return {name: item.get(key, 'n/a') for key, name in self.meta.items()}
return {name: item.get(key, "n/a") for key, name in self.meta.items()}
def iter_files(self):
for file_path in self.files:
yield file_path
def strip_tags(self, text):
text = self.link_re.sub(r'\1', text)
text = text.replace('&gt;', '>').replace('&lt;', '<')
text = self.pre_format_re.sub('', text)
text = self.post_format_re.sub('', text)
text = re.sub(r'\s+', ' ', text)
text = self.link_re.sub(r"\1", text)
text = text.replace("&gt;", ">").replace("&lt;", "<")
text = self.pre_format_re.sub("", text)
text = self.post_format_re.sub("", text)
text = re.sub(r"\s+", " ", text)
return text.strip()
def is_valid(self, comment):
return comment['body'] is not None \
and comment['body'] != '[deleted]' \
and comment['body'] != '[removed]'
return (
comment["body"] is not None
and comment["body"] != "[deleted]"
and comment["body"] != "[removed]"
)
def main(path):
@ -75,16 +78,18 @@ def main(path):
print(ujson.dumps(comment))
if __name__ == '__main__':
if __name__ == "__main__":
import socket
try:
BrokenPipeError
except NameError:
BrokenPipeError = socket.error
try:
plac.call(main)
except BrokenPipeError:
except BrokenPipeError:
import os, sys
# Python flushes standard streams on exit; redirect remaining output
# to devnull to avoid another BrokenPipeError at shutdown
devnull = os.open(os.devnull, os.O_WRONLY)

View File

@ -11,7 +11,10 @@ ujson>=1.35
dill>=0.2,<0.3
regex==2018.01.10
requests>=2.13.0,<3.0.0
pathlib==1.0.1; python_version < "3.4"
# Development dependencies
pytest>=4.0.0,<5.0.0
pytest-timeout>=1.3.0,<2.0.0
mock>=2.0.0,<3.0.0
pathlib==1.0.1; python_version < "3.4"
black==18.9b0
flake8>=3.5.0,<3.6.0

View File

@ -14,8 +14,7 @@ from thinc.api import uniqued, wrap, noop
from thinc.api import with_square_sequences
from thinc.linear.linear import LinearModel
from thinc.neural.ops import NumpyOps, CupyOps
from thinc.neural.util import get_array_module, copy_array
from thinc.neural._lsuv import svd_orthonormal
from thinc.neural.util import get_array_module
from thinc.neural.optimizers import Adam
from thinc import describe
@ -33,36 +32,36 @@ try:
except:
torch = None
VECTORS_KEY = 'spacy_pretrained_vectors'
VECTORS_KEY = "spacy_pretrained_vectors"
def cosine(vec1, vec2):
xp = get_array_module(vec1)
norm1 = xp.linalg.norm(vec1)
norm2 = xp.linalg.norm(vec2)
if norm1 == 0. or norm2 == 0.:
if norm1 == 0.0 or norm2 == 0.0:
return 0
else:
return vec1.dot(vec2) / (norm1 * norm2)
def create_default_optimizer(ops, **cfg):
learn_rate = util.env_opt('learn_rate', 0.001)
beta1 = util.env_opt('optimizer_B1', 0.8)
beta2 = util.env_opt('optimizer_B2', 0.8)
eps = util.env_opt('optimizer_eps', 0.00001)
L2 = util.env_opt('L2_penalty', 1e-6)
max_grad_norm = util.env_opt('grad_norm_clip', 5.)
optimizer = Adam(ops, learn_rate, L2=L2, beta1=beta1,
beta2=beta2, eps=eps)
learn_rate = util.env_opt("learn_rate", 0.001)
beta1 = util.env_opt("optimizer_B1", 0.8)
beta2 = util.env_opt("optimizer_B2", 0.8)
eps = util.env_opt("optimizer_eps", 0.00001)
L2 = util.env_opt("L2_penalty", 1e-6)
max_grad_norm = util.env_opt("grad_norm_clip", 5.0)
optimizer = Adam(ops, learn_rate, L2=L2, beta1=beta1, beta2=beta2, eps=eps)
optimizer.max_grad_norm = max_grad_norm
optimizer.device = ops.device
return optimizer
@layerize
def _flatten_add_lengths(seqs, pad=0, drop=0.):
def _flatten_add_lengths(seqs, pad=0, drop=0.0):
ops = Model.ops
lengths = ops.asarray([len(seq) for seq in seqs], dtype='i')
lengths = ops.asarray([len(seq) for seq in seqs], dtype="i")
def finish_update(d_X, sgd=None):
return ops.unflatten(d_X, lengths, pad=pad)
@ -74,14 +73,15 @@ def _flatten_add_lengths(seqs, pad=0, drop=0.):
def _zero_init(model):
def _zero_init_impl(self, X, y):
self.W.fill(0)
model.on_data_hooks.append(_zero_init_impl)
if model.W is not None:
model.W.fill(0.)
model.W.fill(0.0)
return model
@layerize
def _preprocess_doc(docs, drop=0.):
def _preprocess_doc(docs, drop=0.0):
keys = [doc.to_array(LOWER) for doc in docs]
ops = Model.ops
# The dtype here matches what thinc is expecting -- which differs per
@ -89,11 +89,12 @@ def _preprocess_doc(docs, drop=0.):
# is fixed on Thinc's side.
lengths = ops.asarray([arr.shape[0] for arr in keys], dtype=numpy.int_)
keys = ops.xp.concatenate(keys)
vals = ops.allocate(keys.shape) + 1.
vals = ops.allocate(keys.shape) + 1.0
return (keys, vals, lengths), None
@layerize
def _preprocess_doc_bigrams(docs, drop=0.):
def _preprocess_doc_bigrams(docs, drop=0.0):
unigrams = [doc.to_array(LOWER) for doc in docs]
ops = Model.ops
bigrams = [ops.ngrams(2, doc_unis) for doc_unis in unigrams]
@ -104,27 +105,29 @@ def _preprocess_doc_bigrams(docs, drop=0.):
# is fixed on Thinc's side.
lengths = ops.asarray([arr.shape[0] for arr in keys], dtype=numpy.int_)
keys = ops.xp.concatenate(keys)
vals = ops.asarray(ops.xp.concatenate(vals), dtype='f')
vals = ops.asarray(ops.xp.concatenate(vals), dtype="f")
return (keys, vals, lengths), None
@describe.on_data(_set_dimensions_if_needed,
lambda model, X, y: model.init_weights(model))
@describe.on_data(
_set_dimensions_if_needed, lambda model, X, y: model.init_weights(model)
)
@describe.attributes(
nI=Dimension("Input size"),
nF=Dimension("Number of features"),
nO=Dimension("Output size"),
nP=Dimension("Maxout pieces"),
W=Synapses("Weights matrix",
lambda obj: (obj.nF, obj.nO, obj.nP, obj.nI)),
b=Biases("Bias vector",
lambda obj: (obj.nO, obj.nP)),
pad=Synapses("Pad",
W=Synapses("Weights matrix", lambda obj: (obj.nF, obj.nO, obj.nP, obj.nI)),
b=Biases("Bias vector", lambda obj: (obj.nO, obj.nP)),
pad=Synapses(
"Pad",
lambda obj: (1, obj.nF, obj.nO, obj.nP),
lambda M, ops: ops.normal_init(M, 1.)),
lambda M, ops: ops.normal_init(M, 1.0),
),
d_W=Gradient("W"),
d_pad=Gradient("pad"),
d_b=Gradient("b"))
d_b=Gradient("b"),
)
class PrecomputableAffine(Model):
def __init__(self, nO=None, nI=None, nF=None, nP=None, **kwargs):
Model.__init__(self, **kwargs)
@ -133,9 +136,10 @@ class PrecomputableAffine(Model):
self.nI = nI
self.nF = nF
def begin_update(self, X, drop=0.):
Yf = self.ops.gemm(X,
self.W.reshape((self.nF*self.nO*self.nP, self.nI)), trans2=True)
def begin_update(self, X, drop=0.0):
Yf = self.ops.gemm(
X, self.W.reshape((self.nF * self.nO * self.nP, self.nI)), trans2=True
)
Yf = Yf.reshape((Yf.shape[0], self.nF, self.nO, self.nP))
Yf = self._add_padding(Yf)
@ -146,15 +150,16 @@ class PrecomputableAffine(Model):
Xf = Xf.reshape((Xf.shape[0], self.nF * self.nI))
self.d_b += dY.sum(axis=0)
dY = dY.reshape((dY.shape[0], self.nO*self.nP))
dY = dY.reshape((dY.shape[0], self.nO * self.nP))
Wopfi = self.W.transpose((1, 2, 0, 3))
Wopfi = self.ops.xp.ascontiguousarray(Wopfi)
Wopfi = Wopfi.reshape((self.nO*self.nP, self.nF * self.nI))
dXf = self.ops.gemm(dY.reshape((dY.shape[0], self.nO*self.nP)), Wopfi)
Wopfi = Wopfi.reshape((self.nO * self.nP, self.nF * self.nI))
dXf = self.ops.gemm(dY.reshape((dY.shape[0], self.nO * self.nP)), Wopfi)
# Reuse the buffer
dWopfi = Wopfi; dWopfi.fill(0.)
dWopfi = Wopfi
dWopfi.fill(0.0)
self.ops.gemm(dY, Xf, out=dWopfi, trans1=True)
dWopfi = dWopfi.reshape((self.nO, self.nP, self.nF, self.nI))
# (o, p, f, i) --> (f, o, p, i)
@ -163,6 +168,7 @@ class PrecomputableAffine(Model):
if sgd is not None:
sgd(self._mem.weights, self._mem.gradient, key=self.id)
return dXf.reshape((dXf.shape[0], self.nF, self.nI))
return Yf, backward
def _add_padding(self, Yf):
@ -171,7 +177,7 @@ class PrecomputableAffine(Model):
def _backprop_padding(self, dY, ids):
# (1, nF, nO, nP) += (nN, nF, nO, nP) where IDs (nN, nF) < 0
mask = ids < 0.
mask = ids < 0.0
mask = mask.sum(axis=1)
d_pad = dY * mask.reshape((ids.shape[0], 1, 1))
self.d_pad += d_pad.sum(axis=0)
@ -179,33 +185,36 @@ class PrecomputableAffine(Model):
@staticmethod
def init_weights(model):
'''This is like the 'layer sequential unit variance', but instead
"""This is like the 'layer sequential unit variance', but instead
of taking the actual inputs, we randomly generate whitened data.
Why's this all so complicated? We have a huge number of inputs,
and the maxout unit makes guessing the dynamics tricky. Instead
we set the maxout weights to values that empirically result in
whitened outputs given whitened inputs.
'''
if (model.W**2).sum() != 0.:
"""
if (model.W ** 2).sum() != 0.0:
return
ops = model.ops
xp = ops.xp
ops.normal_init(model.W, model.nF * model.nI, inplace=True)
ids = ops.allocate((5000, model.nF), dtype='f')
ids = ops.allocate((5000, model.nF), dtype="f")
ids += xp.random.uniform(0, 1000, ids.shape)
ids = ops.asarray(ids, dtype='i')
tokvecs = ops.allocate((5000, model.nI), dtype='f')
tokvecs += xp.random.normal(loc=0., scale=1.,
size=tokvecs.size).reshape(tokvecs.shape)
ids = ops.asarray(ids, dtype="i")
tokvecs = ops.allocate((5000, model.nI), dtype="f")
tokvecs += xp.random.normal(loc=0.0, scale=1.0, size=tokvecs.size).reshape(
tokvecs.shape
)
def predict(ids, tokvecs):
# nS ids. nW tokvecs. Exclude the padding array.
hiddens = model(tokvecs[:-1]) # (nW, f, o, p)
vectors = model.ops.allocate((ids.shape[0], model.nO * model.nP), dtype='f')
hiddens = model(tokvecs[:-1]) # (nW, f, o, p)
vectors = model.ops.allocate((ids.shape[0], model.nO * model.nP), dtype="f")
# need nS vectors
hiddens = hiddens.reshape((hiddens.shape[0] * model.nF, model.nO * model.nP))
hiddens = hiddens.reshape(
(hiddens.shape[0] * model.nF, model.nO * model.nP)
)
model.ops.scatter_add(vectors, ids.flatten(), hiddens)
vectors = vectors.reshape((vectors.shape[0], model.nO, model.nP))
vectors += model.b
@ -238,7 +247,8 @@ def link_vectors_to_models(vocab):
if vectors.data.size != 0:
print(
"Warning: Unnamed vectors -- this won't allow multiple vectors "
"models to be loaded. (Shape: (%d, %d))" % vectors.data.shape)
"models to be loaded. (Shape: (%d, %d))" % vectors.data.shape
)
ops = Model.ops
for word in vocab:
if word.orth in vectors.key2row:
@ -254,28 +264,31 @@ def link_vectors_to_models(vocab):
def PyTorchBiLSTM(nO, nI, depth, dropout=0.2):
if depth == 0:
return layerize(noop())
model = torch.nn.LSTM(nI, nO//2, depth, bidirectional=True, dropout=dropout)
model = torch.nn.LSTM(nI, nO // 2, depth, bidirectional=True, dropout=dropout)
return with_square_sequences(PyTorchWrapperRNN(model))
def Tok2Vec(width, embed_size, **kwargs):
pretrained_vectors = kwargs.get('pretrained_vectors', None)
cnn_maxout_pieces = kwargs.get('cnn_maxout_pieces', 2)
subword_features = kwargs.get('subword_features', True)
conv_depth = kwargs.get('conv_depth', 4)
bilstm_depth = kwargs.get('bilstm_depth', 0)
pretrained_vectors = kwargs.get("pretrained_vectors", None)
cnn_maxout_pieces = kwargs.get("cnn_maxout_pieces", 2)
subword_features = kwargs.get("subword_features", True)
conv_depth = kwargs.get("conv_depth", 4)
bilstm_depth = kwargs.get("bilstm_depth", 0)
cols = [ID, NORM, PREFIX, SUFFIX, SHAPE, ORTH]
with Model.define_operators({'>>': chain, '|': concatenate, '**': clone,
'+': add, '*': reapply}):
norm = HashEmbed(width, embed_size, column=cols.index(NORM),
name='embed_norm')
with Model.define_operators(
{">>": chain, "|": concatenate, "**": clone, "+": add, "*": reapply}
):
norm = HashEmbed(width, embed_size, column=cols.index(NORM), name="embed_norm")
if subword_features:
prefix = HashEmbed(width, embed_size//2, column=cols.index(PREFIX),
name='embed_prefix')
suffix = HashEmbed(width, embed_size//2, column=cols.index(SUFFIX),
name='embed_suffix')
shape = HashEmbed(width, embed_size//2, column=cols.index(SHAPE),
name='embed_shape')
prefix = HashEmbed(
width, embed_size // 2, column=cols.index(PREFIX), name="embed_prefix"
)
suffix = HashEmbed(
width, embed_size // 2, column=cols.index(SUFFIX), name="embed_suffix"
)
shape = HashEmbed(
width, embed_size // 2, column=cols.index(SHAPE), name="embed_shape"
)
else:
prefix, suffix, shape = (None, None, None)
if pretrained_vectors is not None:
@ -284,28 +297,29 @@ def Tok2Vec(width, embed_size, **kwargs):
if subword_features:
embed = uniqued(
(glove | norm | prefix | suffix | shape)
>> LN(Maxout(width, width*5, pieces=3)), column=cols.index(ORTH))
>> LN(Maxout(width, width * 5, pieces=3)),
column=cols.index(ORTH),
)
else:
embed = uniqued(
(glove | norm)
>> LN(Maxout(width, width*2, pieces=3)), column=cols.index(ORTH))
(glove | norm) >> LN(Maxout(width, width * 2, pieces=3)),
column=cols.index(ORTH),
)
elif subword_features:
embed = uniqued(
(norm | prefix | suffix | shape)
>> LN(Maxout(width, width*4, pieces=3)), column=cols.index(ORTH))
>> LN(Maxout(width, width * 4, pieces=3)),
column=cols.index(ORTH),
)
else:
embed = norm
convolution = Residual(
ExtractWindow(nW=1)
>> LN(Maxout(width, width*3, pieces=cnn_maxout_pieces))
>> LN(Maxout(width, width * 3, pieces=cnn_maxout_pieces))
)
tok2vec = (
FeatureExtracter(cols)
>> with_flatten(
embed
>> convolution ** conv_depth, pad=conv_depth
)
tok2vec = FeatureExtracter(cols) >> with_flatten(
embed >> convolution ** conv_depth, pad=conv_depth
)
if bilstm_depth >= 1:
tok2vec = tok2vec >> PyTorchBiLSTM(width, width, bilstm_depth)
@ -316,7 +330,7 @@ def Tok2Vec(width, embed_size, **kwargs):
def reapply(layer, n_times):
def reapply_fwd(X, drop=0.):
def reapply_fwd(X, drop=0.0):
backprops = []
for i in range(n_times):
Y, backprop = layer.begin_update(X, drop=drop)
@ -334,12 +348,14 @@ def reapply(layer, n_times):
return dX
return Y, reapply_bwd
return wrap(reapply_fwd, layer)
def asarray(ops, dtype):
def forward(X, drop=0.):
def forward(X, drop=0.0):
return ops.asarray(X, dtype=dtype), None
return layerize(forward)
@ -347,7 +363,7 @@ def _divide_array(X, size):
parts = []
index = 0
while index < len(X):
parts.append(X[index:index + size])
parts.append(X[index : index + size])
index += size
return parts
@ -356,7 +372,7 @@ def get_col(idx):
if idx < 0:
raise IndexError(Errors.E066.format(value=idx))
def forward(X, drop=0.):
def forward(X, drop=0.0):
if isinstance(X, numpy.ndarray):
ops = NumpyOps()
else:
@ -377,7 +393,7 @@ def doc2feats(cols=None):
if cols is None:
cols = [ID, NORM, PREFIX, SUFFIX, SHAPE, ORTH]
def forward(docs, drop=0.):
def forward(docs, drop=0.0):
feats = []
for doc in docs:
feats.append(doc.to_array(cols))
@ -389,13 +405,14 @@ def doc2feats(cols=None):
def print_shape(prefix):
def forward(X, drop=0.):
def forward(X, drop=0.0):
return X, lambda dX, **kwargs: dX
return layerize(forward)
@layerize
def get_token_vectors(tokens_attrs_vectors, drop=0.):
def get_token_vectors(tokens_attrs_vectors, drop=0.0):
tokens, attrs, vectors = tokens_attrs_vectors
def backward(d_output, sgd=None):
@ -405,17 +422,17 @@ def get_token_vectors(tokens_attrs_vectors, drop=0.):
@layerize
def logistic(X, drop=0.):
def logistic(X, drop=0.0):
xp = get_array_module(X)
if not isinstance(X, xp.ndarray):
X = xp.asarray(X)
# Clip to range (-10, 10)
X = xp.minimum(X, 10., X)
X = xp.maximum(X, -10., X)
Y = 1. / (1. + xp.exp(-X))
X = xp.minimum(X, 10.0, X)
X = xp.maximum(X, -10.0, X)
Y = 1.0 / (1.0 + xp.exp(-X))
def logistic_bwd(dY, sgd=None):
dX = dY * (Y * (1-Y))
dX = dY * (Y * (1 - Y))
return dX
return Y, logistic_bwd
@ -424,12 +441,13 @@ def logistic(X, drop=0.):
def zero_init(model):
def _zero_init_impl(self, X, y):
self.W.fill(0)
model.on_data_hooks.append(_zero_init_impl)
return model
@layerize
def preprocess_doc(docs, drop=0.):
def preprocess_doc(docs, drop=0.0):
keys = [doc.to_array([LOWER]) for doc in docs]
ops = Model.ops
lengths = ops.asarray([arr.shape[0] for arr in keys])
@ -439,31 +457,32 @@ def preprocess_doc(docs, drop=0.):
def getitem(i):
def getitem_fwd(X, drop=0.):
def getitem_fwd(X, drop=0.0):
return X[i], None
return layerize(getitem_fwd)
def build_tagger_model(nr_class, **cfg):
embed_size = util.env_opt('embed_size', 2000)
if 'token_vector_width' in cfg:
token_vector_width = cfg['token_vector_width']
embed_size = util.env_opt("embed_size", 2000)
if "token_vector_width" in cfg:
token_vector_width = cfg["token_vector_width"]
else:
token_vector_width = util.env_opt('token_vector_width', 96)
pretrained_vectors = cfg.get('pretrained_vectors')
subword_features = cfg.get('subword_features', True)
with Model.define_operators({'>>': chain, '+': add}):
if 'tok2vec' in cfg:
tok2vec = cfg['tok2vec']
token_vector_width = util.env_opt("token_vector_width", 96)
pretrained_vectors = cfg.get("pretrained_vectors")
subword_features = cfg.get("subword_features", True)
with Model.define_operators({">>": chain, "+": add}):
if "tok2vec" in cfg:
tok2vec = cfg["tok2vec"]
else:
tok2vec = Tok2Vec(token_vector_width, embed_size,
subword_features=subword_features,
pretrained_vectors=pretrained_vectors)
tok2vec = Tok2Vec(
token_vector_width,
embed_size,
subword_features=subword_features,
pretrained_vectors=pretrained_vectors,
)
softmax = with_flatten(Softmax(nr_class, token_vector_width))
model = (
tok2vec
>> softmax
)
model = tok2vec >> softmax
model.nI = None
model.tok2vec = tok2vec
model.softmax = softmax
@ -471,10 +490,10 @@ def build_tagger_model(nr_class, **cfg):
@layerize
def SpacyVectors(docs, drop=0.):
def SpacyVectors(docs, drop=0.0):
batch = []
for doc in docs:
indices = numpy.zeros((len(doc),), dtype='i')
indices = numpy.zeros((len(doc),), dtype="i")
for i, word in enumerate(doc):
if word.orth in doc.vocab.vectors.key2row:
indices[i] = doc.vocab.vectors.key2row[word.orth]
@ -486,12 +505,11 @@ def SpacyVectors(docs, drop=0.):
def build_text_classifier(nr_class, width=64, **cfg):
depth = cfg.get('depth', 2)
nr_vector = cfg.get('nr_vector', 5000)
pretrained_dims = cfg.get('pretrained_dims', 0)
with Model.define_operators({'>>': chain, '+': add, '|': concatenate,
'**': clone}):
if cfg.get('low_data') and pretrained_dims:
depth = cfg.get("depth", 2)
nr_vector = cfg.get("nr_vector", 5000)
pretrained_dims = cfg.get("pretrained_dims", 0)
with Model.define_operators({">>": chain, "+": add, "|": concatenate, "**": clone}):
if cfg.get("low_data") and pretrained_dims:
model = (
SpacyVectors
>> flatten_add_lengths
@ -505,41 +523,35 @@ def build_text_classifier(nr_class, width=64, **cfg):
return model
lower = HashEmbed(width, nr_vector, column=1)
prefix = HashEmbed(width//2, nr_vector, column=2)
suffix = HashEmbed(width//2, nr_vector, column=3)
shape = HashEmbed(width//2, nr_vector, column=4)
prefix = HashEmbed(width // 2, nr_vector, column=2)
suffix = HashEmbed(width // 2, nr_vector, column=3)
shape = HashEmbed(width // 2, nr_vector, column=4)
trained_vectors = (
FeatureExtracter([ORTH, LOWER, PREFIX, SUFFIX, SHAPE, ID])
>> with_flatten(
uniqued(
(lower | prefix | suffix | shape)
>> LN(Maxout(width, width+(width//2)*3)),
column=0
)
trained_vectors = FeatureExtracter(
[ORTH, LOWER, PREFIX, SUFFIX, SHAPE, ID]
) >> with_flatten(
uniqued(
(lower | prefix | suffix | shape)
>> LN(Maxout(width, width + (width // 2) * 3)),
column=0,
)
)
if pretrained_dims:
static_vectors = (
SpacyVectors
>> with_flatten(Affine(width, pretrained_dims))
static_vectors = SpacyVectors >> with_flatten(
Affine(width, pretrained_dims)
)
# TODO Make concatenate support lists
vectors = concatenate_lists(trained_vectors, static_vectors)
vectors_width = width*2
vectors_width = width * 2
else:
vectors = trained_vectors
vectors_width = width
static_vectors = None
tok2vec = (
vectors
>> with_flatten(
LN(Maxout(width, vectors_width))
>> Residual(
(ExtractWindow(nW=1) >> LN(Maxout(width, width*3)))
) ** depth, pad=depth
)
tok2vec = vectors >> with_flatten(
LN(Maxout(width, vectors_width))
>> Residual((ExtractWindow(nW=1) >> LN(Maxout(width, width * 3)))) ** depth,
pad=depth,
)
cnn_model = (
tok2vec
@ -550,13 +562,10 @@ def build_text_classifier(nr_class, width=64, **cfg):
>> zero_init(Affine(nr_class, width, drop_factor=0.0))
)
linear_model = (
_preprocess_doc
>> LinearModel(nr_class)
)
linear_model = _preprocess_doc >> LinearModel(nr_class)
model = (
(linear_model | cnn_model)
>> zero_init(Affine(nr_class, nr_class*2, drop_factor=0.0))
>> zero_init(Affine(nr_class, nr_class * 2, drop_factor=0.0))
>> logistic
)
model.tok2vec = tok2vec
@ -566,9 +575,9 @@ def build_text_classifier(nr_class, width=64, **cfg):
@layerize
def flatten(seqs, drop=0.):
def flatten(seqs, drop=0.0):
ops = Model.ops
lengths = ops.asarray([len(seq) for seq in seqs], dtype='i')
lengths = ops.asarray([len(seq) for seq in seqs], dtype="i")
def finish_update(d_X, sgd=None):
return ops.unflatten(d_X, lengths, pad=0)
@ -583,14 +592,14 @@ def concatenate_lists(*layers, **kwargs): # pragma: no cover
"""
if not layers:
return noop()
drop_factor = kwargs.get('drop_factor', 1.0)
drop_factor = kwargs.get("drop_factor", 1.0)
ops = layers[0].ops
layers = [chain(layer, flatten) for layer in layers]
concat = concatenate(*layers)
def concatenate_lists_fwd(Xs, drop=0.):
def concatenate_lists_fwd(Xs, drop=0.0):
drop *= drop_factor
lengths = ops.asarray([len(X) for X in Xs], dtype='i')
lengths = ops.asarray([len(X) for X in Xs], dtype="i")
flat_y, bp_flat_y = concat.begin_update(Xs, drop=drop)
ys = ops.unflatten(flat_y, lengths)

View File

@ -1,16 +1,17 @@
# inspired from:
# https://python-packaging-user-guide.readthedocs.org/en/latest/single_source_version/
# https://github.com/pypa/warehouse/blob/master/warehouse/__about__.py
# fmt: off
__title__ = 'spacy-nightly'
__version__ = '2.1.0a3'
__summary__ = 'Industrial-strength Natural Language Processing (NLP) with Python and Cython'
__uri__ = 'https://spacy.io'
__author__ = 'Explosion AI'
__email__ = 'contact@explosion.ai'
__license__ = 'MIT'
__title__ = "spacy-nightly"
__version__ = "2.1.0a3"
__summary__ = "Industrial-strength Natural Language Processing (NLP) with Python and Cython"
__uri__ = "https://spacy.io"
__author__ = "Explosion AI"
__email__ = "contact@explosion.ai"
__license__ = "MIT"
__release__ = False
__download_url__ = 'https://github.com/explosion/spacy-models/releases/download'
__compatibility__ = 'https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json'
__shortcuts__ = 'https://raw.githubusercontent.com/explosion/spacy-models/master/shortcuts-v2.json'
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
__shortcuts__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/shortcuts-v2.json"

View File

@ -6,7 +6,6 @@ import sys
import ujson
import itertools
import locale
import os
from thinc.neural.util import copy_array
@ -31,9 +30,9 @@ except ImportError:
cupy = None
try:
from thinc.neural.optimizers import Optimizer
from thinc.neural.optimizers import Optimizer # noqa: F401
except ImportError:
from thinc.neural.optimizers import Adam as Optimizer
from thinc.neural.optimizers import Adam as Optimizer # noqa: F401
pickle = pickle
copy_reg = copy_reg

View File

@ -12,8 +12,15 @@ _html = {}
IS_JUPYTER = is_in_jupyter()
def render(docs, style='dep', page=False, minify=False, jupyter=IS_JUPYTER,
options={}, manual=False):
def render(
docs,
style="dep",
page=False,
minify=False,
jupyter=IS_JUPYTER,
options={},
manual=False,
):
"""Render displaCy visualisation.
docs (list or Doc): Document(s) to visualise.
@ -25,8 +32,10 @@ def render(docs, style='dep', page=False, minify=False, jupyter=IS_JUPYTER,
manual (bool): Don't parse `Doc` and instead expect a dict/list of dicts.
RETURNS (unicode): Rendered HTML markup.
"""
factories = {'dep': (DependencyRenderer, parse_deps),
'ent': (EntityRenderer, parse_ents)}
factories = {
"dep": (DependencyRenderer, parse_deps),
"ent": (EntityRenderer, parse_ents),
}
if style not in factories:
raise ValueError(Errors.E087.format(style=style))
if isinstance(docs, (Doc, Span, dict)):
@ -37,16 +46,18 @@ def render(docs, style='dep', page=False, minify=False, jupyter=IS_JUPYTER,
renderer, converter = factories[style]
renderer = renderer(options=options)
parsed = [converter(doc, options) for doc in docs] if not manual else docs
_html['parsed'] = renderer.render(parsed, page=page, minify=minify).strip()
html = _html['parsed']
_html["parsed"] = renderer.render(parsed, page=page, minify=minify).strip()
html = _html["parsed"]
if jupyter: # return HTML rendered by IPython display()
from IPython.core.display import display, HTML
return display(HTML(html))
return html
def serve(docs, style='dep', page=True, minify=False, options={}, manual=False,
port=5000):
def serve(
docs, style="dep", page=True, minify=False, options={}, manual=False, port=5000
):
"""Serve displaCy visualisation.
docs (list or Doc): Document(s) to visualise.
@ -58,11 +69,13 @@ def serve(docs, style='dep', page=True, minify=False, options={}, manual=False,
port (int): Port to serve visualisation.
"""
from wsgiref import simple_server
render(docs, style=style, page=page, minify=minify, options=options,
manual=manual)
httpd = simple_server.make_server('0.0.0.0', port, app)
prints("Using the '{}' visualizer".format(style),
title="Serving on port {}...".format(port))
render(docs, style=style, page=page, minify=minify, options=options, manual=manual)
httpd = simple_server.make_server("0.0.0.0", port, app)
prints(
"Using the '{}' visualizer".format(style),
title="Serving on port {}...".format(port),
)
try:
httpd.serve_forever()
except KeyboardInterrupt:
@ -72,11 +85,10 @@ def serve(docs, style='dep', page=True, minify=False, options={}, manual=False,
def app(environ, start_response):
# headers and status need to be bytes in Python 2, see #1227
headers = [(b_to_str(b'Content-type'),
b_to_str(b'text/html; charset=utf-8'))]
start_response(b_to_str(b'200 OK'), headers)
res = _html['parsed'].encode(encoding='utf-8')
# Headers and status need to be bytes in Python 2, see #1227
headers = [(b_to_str(b"Content-type"), b_to_str(b"text/html; charset=utf-8"))]
start_response(b_to_str(b"200 OK"), headers)
res = _html["parsed"].encode(encoding="utf-8")
return [res]
@ -89,11 +101,10 @@ def parse_deps(orig_doc, options={}):
doc = Doc(orig_doc.vocab).from_bytes(orig_doc.to_bytes())
if not doc.is_parsed:
user_warning(Warnings.W005)
if options.get('collapse_phrases', False):
if options.get("collapse_phrases", False):
for np in list(doc.noun_chunks):
np.merge(tag=np.root.tag_, lemma=np.root.lemma_,
ent_type=np.root.ent_type_)
if options.get('collapse_punct', True):
np.merge(tag=np.root.tag_, lemma=np.root.lemma_, ent_type=np.root.ent_type_)
if options.get("collapse_punct", True):
spans = []
for word in doc[:-1]:
if word.is_punct or not word.nbor(1).is_punct:
@ -103,23 +114,31 @@ def parse_deps(orig_doc, options={}):
while end < len(doc) and doc[end].is_punct:
end += 1
span = doc[start:end]
spans.append((span.start_char, span.end_char, word.tag_,
word.lemma_, word.ent_type_))
spans.append(
(span.start_char, span.end_char, word.tag_, word.lemma_, word.ent_type_)
)
for start, end, tag, lemma, ent_type in spans:
doc.merge(start, end, tag=tag, lemma=lemma, ent_type=ent_type)
if options.get('fine_grained'):
words = [{'text': w.text, 'tag': w.tag_} for w in doc]
if options.get("fine_grained"):
words = [{"text": w.text, "tag": w.tag_} for w in doc]
else:
words = [{'text': w.text, 'tag': w.pos_} for w in doc]
words = [{"text": w.text, "tag": w.pos_} for w in doc]
arcs = []
for word in doc:
if word.i < word.head.i:
arcs.append({'start': word.i, 'end': word.head.i,
'label': word.dep_, 'dir': 'left'})
arcs.append(
{"start": word.i, "end": word.head.i, "label": word.dep_, "dir": "left"}
)
elif word.i > word.head.i:
arcs.append({'start': word.head.i, 'end': word.i,
'label': word.dep_, 'dir': 'right'})
return {'words': words, 'arcs': arcs}
arcs.append(
{
"start": word.head.i,
"end": word.i,
"label": word.dep_,
"dir": "right",
}
)
return {"words": words, "arcs": arcs}
def parse_ents(doc, options={}):
@ -128,10 +147,11 @@ def parse_ents(doc, options={}):
doc (Doc): Document do parse.
RETURNS (dict): Generated entities keyed by text (original text) and ents.
"""
ents = [{'start': ent.start_char, 'end': ent.end_char, 'label': ent.label_}
for ent in doc.ents]
ents = [
{"start": ent.start_char, "end": ent.end_char, "label": ent.label_}
for ent in doc.ents
]
if not ents:
user_warning(Warnings.W006)
title = (doc.user_data.get('title', None)
if hasattr(doc, 'user_data') else None)
return {'text': doc.text, 'ents': ents, 'title': title}
title = doc.user_data.get("title", None) if hasattr(doc, "user_data") else None
return {"text": doc.text, "ents": ents, "title": title}

View File

@ -10,7 +10,8 @@ from ..util import minify_html, escape_html
class DependencyRenderer(object):
"""Render dependency parses as SVGs."""
style = 'dep'
style = "dep"
def __init__(self, options={}):
"""Initialise dependency renderer.
@ -19,18 +20,16 @@ class DependencyRenderer(object):
arrow_spacing, arrow_width, arrow_stroke, distance, offset_x,
color, bg, font)
"""
self.compact = options.get('compact', False)
self.word_spacing = options.get('word_spacing', 45)
self.arrow_spacing = options.get('arrow_spacing',
12 if self.compact else 20)
self.arrow_width = options.get('arrow_width',
6 if self.compact else 10)
self.arrow_stroke = options.get('arrow_stroke', 2)
self.distance = options.get('distance', 150 if self.compact else 175)
self.offset_x = options.get('offset_x', 50)
self.color = options.get('color', '#000000')
self.bg = options.get('bg', '#ffffff')
self.font = options.get('font', 'Arial')
self.compact = options.get("compact", False)
self.word_spacing = options.get("word_spacing", 45)
self.arrow_spacing = options.get("arrow_spacing", 12 if self.compact else 20)
self.arrow_width = options.get("arrow_width", 6 if self.compact else 10)
self.arrow_stroke = options.get("arrow_stroke", 2)
self.distance = options.get("distance", 150 if self.compact else 175)
self.offset_x = options.get("offset_x", 50)
self.color = options.get("color", "#000000")
self.bg = options.get("bg", "#ffffff")
self.font = options.get("font", "Arial")
def render(self, parsed, page=False, minify=False):
"""Render complete markup.
@ -43,14 +42,15 @@ class DependencyRenderer(object):
# Create a random ID prefix to make sure parses don't receive the
# same ID, even if they're identical
id_prefix = random.randint(0, 999)
rendered = [self.render_svg('{}-{}'.format(id_prefix, i), p['words'], p['arcs'])
for i, p in enumerate(parsed)]
rendered = [
self.render_svg("{}-{}".format(id_prefix, i), p["words"], p["arcs"])
for i, p in enumerate(parsed)
]
if page:
content = ''.join([TPL_FIGURE.format(content=svg)
for svg in rendered])
content = "".join([TPL_FIGURE.format(content=svg) for svg in rendered])
markup = TPL_PAGE.format(content=content)
else:
markup = ''.join(rendered)
markup = "".join(rendered)
if minify:
return minify_html(markup)
return markup
@ -65,19 +65,25 @@ class DependencyRenderer(object):
"""
self.levels = self.get_levels(arcs)
self.highest_level = len(self.levels)
self.offset_y = self.distance/2*self.highest_level+self.arrow_stroke
self.width = self.offset_x+len(words)*self.distance
self.height = self.offset_y+3*self.word_spacing
self.offset_y = self.distance / 2 * self.highest_level + self.arrow_stroke
self.width = self.offset_x + len(words) * self.distance
self.height = self.offset_y + 3 * self.word_spacing
self.id = render_id
words = [self.render_word(w['text'], w['tag'], i)
for i, w in enumerate(words)]
arcs = [self.render_arrow(a['label'], a['start'],
a['end'], a['dir'], i)
for i, a in enumerate(arcs)]
content = ''.join(words) + ''.join(arcs)
return TPL_DEP_SVG.format(id=self.id, width=self.width,
height=self.height, color=self.color,
bg=self.bg, font=self.font, content=content)
words = [self.render_word(w["text"], w["tag"], i) for i, w in enumerate(words)]
arcs = [
self.render_arrow(a["label"], a["start"], a["end"], a["dir"], i)
for i, a in enumerate(arcs)
]
content = "".join(words) + "".join(arcs)
return TPL_DEP_SVG.format(
id=self.id,
width=self.width,
height=self.height,
color=self.color,
bg=self.bg,
font=self.font,
content=content,
)
def render_word(self, text, tag, i):
"""Render individual word.
@ -87,12 +93,11 @@ class DependencyRenderer(object):
i (int): Unique ID, typically word index.
RETURNS (unicode): Rendered SVG markup.
"""
y = self.offset_y+self.word_spacing
x = self.offset_x+i*self.distance
y = self.offset_y + self.word_spacing
x = self.offset_x + i * self.distance
html_text = escape_html(text)
return TPL_DEP_WORDS.format(text=html_text, tag=tag, x=x, y=y)
def render_arrow(self, label, start, end, direction, i):
"""Render indivicual arrow.
@ -103,20 +108,30 @@ class DependencyRenderer(object):
i (int): Unique ID, typically arrow index.
RETURNS (unicode): Rendered SVG markup.
"""
level = self.levels.index(end-start)+1
x_start = self.offset_x+start*self.distance+self.arrow_spacing
level = self.levels.index(end - start) + 1
x_start = self.offset_x + start * self.distance + self.arrow_spacing
y = self.offset_y
x_end = (self.offset_x+(end-start)*self.distance+start*self.distance
- self.arrow_spacing*(self.highest_level-level)/4)
y_curve = self.offset_y-level*self.distance/2
x_end = (
self.offset_x
+ (end - start) * self.distance
+ start * self.distance
- self.arrow_spacing * (self.highest_level - level) / 4
)
y_curve = self.offset_y - level * self.distance / 2
if self.compact:
y_curve = self.offset_y-level*self.distance/6
y_curve = self.offset_y - level * self.distance / 6
if y_curve == 0 and len(self.levels) > 5:
y_curve = -self.distance
arrowhead = self.get_arrowhead(direction, x_start, y, x_end)
arc = self.get_arc(x_start, y, y_curve, x_end)
return TPL_DEP_ARCS.format(id=self.id, i=i, stroke=self.arrow_stroke,
head=arrowhead, label=label, arc=arc)
return TPL_DEP_ARCS.format(
id=self.id,
i=i,
stroke=self.arrow_stroke,
head=arrowhead,
label=label,
arc=arc,
)
def get_arc(self, x_start, y, y_curve, x_end):
"""Render individual arc.
@ -141,13 +156,22 @@ class DependencyRenderer(object):
end (int): X-coordinate of arrow end point.
RETURNS (unicode): Definition of the arrow head path ('d' attribute).
"""
if direction == 'left':
pos1, pos2, pos3 = (x, x-self.arrow_width+2, x+self.arrow_width-2)
if direction == "left":
pos1, pos2, pos3 = (x, x - self.arrow_width + 2, x + self.arrow_width - 2)
else:
pos1, pos2, pos3 = (end, end+self.arrow_width-2,
end-self.arrow_width+2)
arrowhead = (pos1, y+2, pos2, y-self.arrow_width, pos3,
y-self.arrow_width)
pos1, pos2, pos3 = (
end,
end + self.arrow_width - 2,
end - self.arrow_width + 2,
)
arrowhead = (
pos1,
y + 2,
pos2,
y - self.arrow_width,
pos3,
y - self.arrow_width,
)
return "M{},{} L{},{} {},{}".format(*arrowhead)
def get_levels(self, arcs):
@ -157,30 +181,44 @@ class DependencyRenderer(object):
args (list): Individual arcs and their start, end, direction and label.
RETURNS (list): Arc levels sorted from lowest to highest.
"""
levels = set(map(lambda arc: arc['end'] - arc['start'], arcs))
levels = set(map(lambda arc: arc["end"] - arc["start"], arcs))
return sorted(list(levels))
class EntityRenderer(object):
"""Render named entities as HTML."""
style = 'ent'
style = "ent"
def __init__(self, options={}):
"""Initialise dependency renderer.
options (dict): Visualiser-specific options (colors, ents)
"""
colors = {'ORG': '#7aecec', 'PRODUCT': '#bfeeb7', 'GPE': '#feca74',
'LOC': '#ff9561', 'PERSON': '#aa9cfc', 'NORP': '#c887fb',
'FACILITY': '#9cc9cc', 'EVENT': '#ffeb80', 'LAW': '#ff8197',
'LANGUAGE': '#ff8197', 'WORK_OF_ART': '#f0d0ff',
'DATE': '#bfe1d9', 'TIME': '#bfe1d9', 'MONEY': '#e4e7d2',
'QUANTITY': '#e4e7d2', 'ORDINAL': '#e4e7d2',
'CARDINAL': '#e4e7d2', 'PERCENT': '#e4e7d2'}
colors.update(options.get('colors', {}))
self.default_color = '#ddd'
colors = {
"ORG": "#7aecec",
"PRODUCT": "#bfeeb7",
"GPE": "#feca74",
"LOC": "#ff9561",
"PERSON": "#aa9cfc",
"NORP": "#c887fb",
"FACILITY": "#9cc9cc",
"EVENT": "#ffeb80",
"LAW": "#ff8197",
"LANGUAGE": "#ff8197",
"WORK_OF_ART": "#f0d0ff",
"DATE": "#bfe1d9",
"TIME": "#bfe1d9",
"MONEY": "#e4e7d2",
"QUANTITY": "#e4e7d2",
"ORDINAL": "#e4e7d2",
"CARDINAL": "#e4e7d2",
"PERCENT": "#e4e7d2",
}
colors.update(options.get("colors", {}))
self.default_color = "#ddd"
self.colors = colors
self.ents = options.get('ents', None)
self.ents = options.get("ents", None)
def render(self, parsed, page=False, minify=False):
"""Render complete markup.
@ -190,14 +228,14 @@ class EntityRenderer(object):
minify (bool): Minify HTML markup.
RETURNS (unicode): Rendered HTML markup.
"""
rendered = [self.render_ents(p['text'], p['ents'],
p.get('title', None)) for p in parsed]
rendered = [
self.render_ents(p["text"], p["ents"], p.get("title", None)) for p in parsed
]
if page:
docs = ''.join([TPL_FIGURE.format(content=doc)
for doc in rendered])
docs = "".join([TPL_FIGURE.format(content=doc) for doc in rendered])
markup = TPL_PAGE.format(content=docs)
else:
markup = ''.join(rendered)
markup = "".join(rendered)
if minify:
return minify_html(markup)
return markup
@ -209,18 +247,18 @@ class EntityRenderer(object):
spans (list): Individual entity spans and their start, end and label.
title (unicode or None): Document title set in Doc.user_data['title'].
"""
markup = ''
markup = ""
offset = 0
for span in spans:
label = span['label']
start = span['start']
end = span['end']
label = span["label"]
start = span["start"]
end = span["end"]
entity = text[start:end]
fragments = text[offset:start].split('\n')
fragments = text[offset:start].split("\n")
for i, fragment in enumerate(fragments):
markup += fragment
if len(fragments) > 1 and i != len(fragments)-1:
markup += '</br>'
if len(fragments) > 1 and i != len(fragments) - 1:
markup += "</br>"
if self.ents is None or label.upper() in self.ents:
color = self.colors.get(label.upper(), self.default_color)
markup += TPL_ENT.format(label=label, text=entity, bg=color)

View File

@ -2,7 +2,7 @@
from __future__ import unicode_literals
# setting explicit height and max-width: none on the SVG is required for
# Setting explicit height and max-width: none on the SVG is required for
# Jupyter to render it properly in a cell
TPL_DEP_SVG = """

View File

@ -8,13 +8,17 @@ import inspect
def add_codes(err_cls):
"""Add error codes to string messages via class attribute names."""
class ErrorsWithCodes(object):
def __getattribute__(self, code):
msg = getattr(err_cls, code)
return '[{code}] {msg}'.format(code=code, msg=msg)
return "[{code}] {msg}".format(code=code, msg=msg)
return ErrorsWithCodes()
# fmt: off
@add_codes
class Warnings(object):
W001 = ("As of spaCy v2.0, the keyword argument `path=` is deprecated. "
@ -260,7 +264,7 @@ class Errors(object):
E095 = ("Can't write to frozen dictionary. This is likely an internal "
"error. Are you writing to a default function argument?")
E096 = ("Invalid object passed to displaCy: Can only visualize Doc or "
"Span objects, or dicts if set to manual=True.")
"Span objects, or dicts if set to manual=True.")
E097 = ("Invalid pattern: expected token pattern (list of dicts) or "
"phrase pattern (string) but got:\n{pattern}")
E098 = ("Invalid pattern specified: expected both SPEC and PATTERN.")
@ -275,6 +279,7 @@ class Errors(object):
" can only be part of one entity, so make sure the entities you're "
"setting don't overlap.")
@add_codes
class TempErrors(object):
T001 = ("Max length currently 10 for phrase matching")
@ -292,55 +297,57 @@ class TempErrors(object):
"(pretrained_dims) but not the new name (pretrained_vectors).")
# fmt: on
class ModelsWarning(UserWarning):
pass
WARNINGS = {
'user': UserWarning,
'deprecation': DeprecationWarning,
'models': ModelsWarning,
"user": UserWarning,
"deprecation": DeprecationWarning,
"models": ModelsWarning,
}
def _get_warn_types(arg):
if arg == '': # don't show any warnings
if arg == "": # don't show any warnings
return []
if not arg or arg == 'all': # show all available warnings
if not arg or arg == "all": # show all available warnings
return WARNINGS.keys()
return [w_type.strip() for w_type in arg.split(',')
if w_type.strip() in WARNINGS]
return [w_type.strip() for w_type in arg.split(",") if w_type.strip() in WARNINGS]
def _get_warn_excl(arg):
if not arg:
return []
return [w_id.strip() for w_id in arg.split(',')]
return [w_id.strip() for w_id in arg.split(",")]
SPACY_WARNING_FILTER = os.environ.get('SPACY_WARNING_FILTER')
SPACY_WARNING_TYPES = _get_warn_types(os.environ.get('SPACY_WARNING_TYPES'))
SPACY_WARNING_IGNORE = _get_warn_excl(os.environ.get('SPACY_WARNING_IGNORE'))
SPACY_WARNING_FILTER = os.environ.get("SPACY_WARNING_FILTER")
SPACY_WARNING_TYPES = _get_warn_types(os.environ.get("SPACY_WARNING_TYPES"))
SPACY_WARNING_IGNORE = _get_warn_excl(os.environ.get("SPACY_WARNING_IGNORE"))
def user_warning(message):
_warn(message, 'user')
_warn(message, "user")
def deprecation_warning(message):
_warn(message, 'deprecation')
_warn(message, "deprecation")
def models_warning(message):
_warn(message, 'models')
_warn(message, "models")
def _warn(message, warn_type='user'):
def _warn(message, warn_type="user"):
"""
message (unicode): The message to display.
category (Warning): The Warning to show.
"""
w_id = message.split('[', 1)[1].split(']', 1)[0] # get ID from string
w_id = message.split("[", 1)[1].split("]", 1)[0] # get ID from string
if warn_type in SPACY_WARNING_TYPES and w_id not in SPACY_WARNING_IGNORE:
category = WARNINGS[warn_type]
stack = inspect.stack()[-1]

View File

@ -21,295 +21,272 @@ GLOSSARY = {
# POS tags
# Universal POS Tags
# http://universaldependencies.org/u/pos/
'ADJ': 'adjective',
'ADP': 'adposition',
'ADV': 'adverb',
'AUX': 'auxiliary',
'CONJ': 'conjunction',
'CCONJ': 'coordinating conjunction',
'DET': 'determiner',
'INTJ': 'interjection',
'NOUN': 'noun',
'NUM': 'numeral',
'PART': 'particle',
'PRON': 'pronoun',
'PROPN': 'proper noun',
'PUNCT': 'punctuation',
'SCONJ': 'subordinating conjunction',
'SYM': 'symbol',
'VERB': 'verb',
'X': 'other',
'EOL': 'end of line',
'SPACE': 'space',
"ADJ": "adjective",
"ADP": "adposition",
"ADV": "adverb",
"AUX": "auxiliary",
"CONJ": "conjunction",
"CCONJ": "coordinating conjunction",
"DET": "determiner",
"INTJ": "interjection",
"NOUN": "noun",
"NUM": "numeral",
"PART": "particle",
"PRON": "pronoun",
"PROPN": "proper noun",
"PUNCT": "punctuation",
"SCONJ": "subordinating conjunction",
"SYM": "symbol",
"VERB": "verb",
"X": "other",
"EOL": "end of line",
"SPACE": "space",
# POS tags (English)
# OntoNotes 5 / Penn Treebank
# https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
'.': 'punctuation mark, sentence closer',
',': 'punctuation mark, comma',
'-LRB-': 'left round bracket',
'-RRB-': 'right round bracket',
'``': 'opening quotation mark',
'""': 'closing quotation mark',
"''": 'closing quotation mark',
':': 'punctuation mark, colon or ellipsis',
'$': 'symbol, currency',
'#': 'symbol, number sign',
'AFX': 'affix',
'CC': 'conjunction, coordinating',
'CD': 'cardinal number',
'DT': 'determiner',
'EX': 'existential there',
'FW': 'foreign word',
'HYPH': 'punctuation mark, hyphen',
'IN': 'conjunction, subordinating or preposition',
'JJ': 'adjective',
'JJR': 'adjective, comparative',
'JJS': 'adjective, superlative',
'LS': 'list item marker',
'MD': 'verb, modal auxiliary',
'NIL': 'missing tag',
'NN': 'noun, singular or mass',
'NNP': 'noun, proper singular',
'NNPS': 'noun, proper plural',
'NNS': 'noun, plural',
'PDT': 'predeterminer',
'POS': 'possessive ending',
'PRP': 'pronoun, personal',
'PRP$': 'pronoun, possessive',
'RB': 'adverb',
'RBR': 'adverb, comparative',
'RBS': 'adverb, superlative',
'RP': 'adverb, particle',
'TO': 'infinitival to',
'UH': 'interjection',
'VB': 'verb, base form',
'VBD': 'verb, past tense',
'VBG': 'verb, gerund or present participle',
'VBN': 'verb, past participle',
'VBP': 'verb, non-3rd person singular present',
'VBZ': 'verb, 3rd person singular present',
'WDT': 'wh-determiner',
'WP': 'wh-pronoun, personal',
'WP$': 'wh-pronoun, possessive',
'WRB': 'wh-adverb',
'SP': 'space',
'ADD': 'email',
'NFP': 'superfluous punctuation',
'GW': 'additional word in multi-word expression',
'XX': 'unknown',
'BES': 'auxiliary "be"',
'HVS': 'forms of "have"',
".": "punctuation mark, sentence closer",
",": "punctuation mark, comma",
"-LRB-": "left round bracket",
"-RRB-": "right round bracket",
"``": "opening quotation mark",
'""': "closing quotation mark",
"''": "closing quotation mark",
":": "punctuation mark, colon or ellipsis",
"$": "symbol, currency",
"#": "symbol, number sign",
"AFX": "affix",
"CC": "conjunction, coordinating",
"CD": "cardinal number",
"DT": "determiner",
"EX": "existential there",
"FW": "foreign word",
"HYPH": "punctuation mark, hyphen",
"IN": "conjunction, subordinating or preposition",
"JJ": "adjective",
"JJR": "adjective, comparative",
"JJS": "adjective, superlative",
"LS": "list item marker",
"MD": "verb, modal auxiliary",
"NIL": "missing tag",
"NN": "noun, singular or mass",
"NNP": "noun, proper singular",
"NNPS": "noun, proper plural",
"NNS": "noun, plural",
"PDT": "predeterminer",
"POS": "possessive ending",
"PRP": "pronoun, personal",
"PRP$": "pronoun, possessive",
"RB": "adverb",
"RBR": "adverb, comparative",
"RBS": "adverb, superlative",
"RP": "adverb, particle",
"TO": "infinitival to",
"UH": "interjection",
"VB": "verb, base form",
"VBD": "verb, past tense",
"VBG": "verb, gerund or present participle",
"VBN": "verb, past participle",
"VBP": "verb, non-3rd person singular present",
"VBZ": "verb, 3rd person singular present",
"WDT": "wh-determiner",
"WP": "wh-pronoun, personal",
"WP$": "wh-pronoun, possessive",
"WRB": "wh-adverb",
"SP": "space",
"ADD": "email",
"NFP": "superfluous punctuation",
"GW": "additional word in multi-word expression",
"XX": "unknown",
"BES": 'auxiliary "be"',
"HVS": 'forms of "have"',
# POS Tags (German)
# TIGER Treebank
# http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/tiger_introduction.pdf
'$(': 'other sentence-internal punctuation mark',
'$,': 'comma',
'$.': 'sentence-final punctuation mark',
'ADJA': 'adjective, attributive',
'ADJD': 'adjective, adverbial or predicative',
'APPO': 'postposition',
'APPR': 'preposition; circumposition left',
'APPRART': 'preposition with article',
'APZR': 'circumposition right',
'ART': 'definite or indefinite article',
'CARD': 'cardinal number',
'FM': 'foreign language material',
'ITJ': 'interjection',
'KOKOM': 'comparative conjunction',
'KON': 'coordinate conjunction',
'KOUI': 'subordinate conjunction with "zu" and infinitive',
'KOUS': 'subordinate conjunction with sentence',
'NE': 'proper noun',
'NNE': 'proper noun',
'PAV': 'pronominal adverb',
'PROAV': 'pronominal adverb',
'PDAT': 'attributive demonstrative pronoun',
'PDS': 'substituting demonstrative pronoun',
'PIAT': 'attributive indefinite pronoun without determiner',
'PIDAT': 'attributive indefinite pronoun with determiner',
'PIS': 'substituting indefinite pronoun',
'PPER': 'non-reflexive personal pronoun',
'PPOSAT': 'attributive possessive pronoun',
'PPOSS': 'substituting possessive pronoun',
'PRELAT': 'attributive relative pronoun',
'PRELS': 'substituting relative pronoun',
'PRF': 'reflexive personal pronoun',
'PTKA': 'particle with adjective or adverb',
'PTKANT': 'answer particle',
'PTKNEG': 'negative particle',
'PTKVZ': 'separable verbal particle',
'PTKZU': '"zu" before infinitive',
'PWAT': 'attributive interrogative pronoun',
'PWAV': 'adverbial interrogative or relative pronoun',
'PWS': 'substituting interrogative pronoun',
'TRUNC': 'word remnant',
'VAFIN': 'finite verb, auxiliary',
'VAIMP': 'imperative, auxiliary',
'VAINF': 'infinitive, auxiliary',
'VAPP': 'perfect participle, auxiliary',
'VMFIN': 'finite verb, modal',
'VMINF': 'infinitive, modal',
'VMPP': 'perfect participle, modal',
'VVFIN': 'finite verb, full',
'VVIMP': 'imperative, full',
'VVINF': 'infinitive, full',
'VVIZU': 'infinitive with "zu", full',
'VVPP': 'perfect participle, full',
'XY': 'non-word containing non-letter',
"$(": "other sentence-internal punctuation mark",
"$,": "comma",
"$.": "sentence-final punctuation mark",
"ADJA": "adjective, attributive",
"ADJD": "adjective, adverbial or predicative",
"APPO": "postposition",
"APPR": "preposition; circumposition left",
"APPRART": "preposition with article",
"APZR": "circumposition right",
"ART": "definite or indefinite article",
"CARD": "cardinal number",
"FM": "foreign language material",
"ITJ": "interjection",
"KOKOM": "comparative conjunction",
"KON": "coordinate conjunction",
"KOUI": 'subordinate conjunction with "zu" and infinitive',
"KOUS": "subordinate conjunction with sentence",
"NE": "proper noun",
"NNE": "proper noun",
"PAV": "pronominal adverb",
"PROAV": "pronominal adverb",
"PDAT": "attributive demonstrative pronoun",
"PDS": "substituting demonstrative pronoun",
"PIAT": "attributive indefinite pronoun without determiner",
"PIDAT": "attributive indefinite pronoun with determiner",
"PIS": "substituting indefinite pronoun",
"PPER": "non-reflexive personal pronoun",
"PPOSAT": "attributive possessive pronoun",
"PPOSS": "substituting possessive pronoun",
"PRELAT": "attributive relative pronoun",
"PRELS": "substituting relative pronoun",
"PRF": "reflexive personal pronoun",
"PTKA": "particle with adjective or adverb",
"PTKANT": "answer particle",
"PTKNEG": "negative particle",
"PTKVZ": "separable verbal particle",
"PTKZU": '"zu" before infinitive',
"PWAT": "attributive interrogative pronoun",
"PWAV": "adverbial interrogative or relative pronoun",
"PWS": "substituting interrogative pronoun",
"TRUNC": "word remnant",
"VAFIN": "finite verb, auxiliary",
"VAIMP": "imperative, auxiliary",
"VAINF": "infinitive, auxiliary",
"VAPP": "perfect participle, auxiliary",
"VMFIN": "finite verb, modal",
"VMINF": "infinitive, modal",
"VMPP": "perfect participle, modal",
"VVFIN": "finite verb, full",
"VVIMP": "imperative, full",
"VVINF": "infinitive, full",
"VVIZU": 'infinitive with "zu", full',
"VVPP": "perfect participle, full",
"XY": "non-word containing non-letter",
# Noun chunks
'NP': 'noun phrase',
'PP': 'prepositional phrase',
'VP': 'verb phrase',
'ADVP': 'adverb phrase',
'ADJP': 'adjective phrase',
'SBAR': 'subordinating conjunction',
'PRT': 'particle',
'PNP': 'prepositional noun phrase',
"NP": "noun phrase",
"PP": "prepositional phrase",
"VP": "verb phrase",
"ADVP": "adverb phrase",
"ADJP": "adjective phrase",
"SBAR": "subordinating conjunction",
"PRT": "particle",
"PNP": "prepositional noun phrase",
# Dependency Labels (English)
# ClearNLP / Universal Dependencies
# https://github.com/clir/clearnlp-guidelines/blob/master/md/specifications/dependency_labels.md
'acomp': 'adjectival complement',
'advcl': 'adverbial clause modifier',
'advmod': 'adverbial modifier',
'agent': 'agent',
'amod': 'adjectival modifier',
'appos': 'appositional modifier',
'attr': 'attribute',
'aux': 'auxiliary',
'auxpass': 'auxiliary (passive)',
'cc': 'coordinating conjunction',
'ccomp': 'clausal complement',
'complm': 'complementizer',
'conj': 'conjunct',
'cop': 'copula',
'csubj': 'clausal subject',
'csubjpass': 'clausal subject (passive)',
'dep': 'unclassified dependent',
'det': 'determiner',
'dobj': 'direct object',
'expl': 'expletive',
'hmod': 'modifier in hyphenation',
'hyph': 'hyphen',
'infmod': 'infinitival modifier',
'intj': 'interjection',
'iobj': 'indirect object',
'mark': 'marker',
'meta': 'meta modifier',
'neg': 'negation modifier',
'nmod': 'modifier of nominal',
'nn': 'noun compound modifier',
'npadvmod': 'noun phrase as adverbial modifier',
'nsubj': 'nominal subject',
'nsubjpass': 'nominal subject (passive)',
'num': 'number modifier',
'number': 'number compound modifier',
'oprd': 'object predicate',
'obj': 'object',
'obl': 'oblique nominal',
'parataxis': 'parataxis',
'partmod': 'participal modifier',
'pcomp': 'complement of preposition',
'pobj': 'object of preposition',
'poss': 'possession modifier',
'possessive': 'possessive modifier',
'preconj': 'pre-correlative conjunction',
'prep': 'prepositional modifier',
'prt': 'particle',
'punct': 'punctuation',
'quantmod': 'modifier of quantifier',
'rcmod': 'relative clause modifier',
'root': 'root',
'xcomp': 'open clausal complement',
"acomp": "adjectival complement",
"advcl": "adverbial clause modifier",
"advmod": "adverbial modifier",
"agent": "agent",
"amod": "adjectival modifier",
"appos": "appositional modifier",
"attr": "attribute",
"aux": "auxiliary",
"auxpass": "auxiliary (passive)",
"cc": "coordinating conjunction",
"ccomp": "clausal complement",
"complm": "complementizer",
"conj": "conjunct",
"cop": "copula",
"csubj": "clausal subject",
"csubjpass": "clausal subject (passive)",
"dep": "unclassified dependent",
"det": "determiner",
"dobj": "direct object",
"expl": "expletive",
"hmod": "modifier in hyphenation",
"hyph": "hyphen",
"infmod": "infinitival modifier",
"intj": "interjection",
"iobj": "indirect object",
"mark": "marker",
"meta": "meta modifier",
"neg": "negation modifier",
"nmod": "modifier of nominal",
"nn": "noun compound modifier",
"npadvmod": "noun phrase as adverbial modifier",
"nsubj": "nominal subject",
"nsubjpass": "nominal subject (passive)",
"num": "number modifier",
"number": "number compound modifier",
"oprd": "object predicate",
"obj": "object",
"obl": "oblique nominal",
"parataxis": "parataxis",
"partmod": "participal modifier",
"pcomp": "complement of preposition",
"pobj": "object of preposition",
"poss": "possession modifier",
"possessive": "possessive modifier",
"preconj": "pre-correlative conjunction",
"prep": "prepositional modifier",
"prt": "particle",
"punct": "punctuation",
"quantmod": "modifier of quantifier",
"rcmod": "relative clause modifier",
"root": "root",
"xcomp": "open clausal complement",
# Dependency labels (German)
# TIGER Treebank
# http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/tiger_introduction.pdf
# currently missing: 'cc' (comparative complement) because of conflict
# with English labels
'ac': 'adpositional case marker',
'adc': 'adjective component',
'ag': 'genitive attribute',
'ams': 'measure argument of adjective',
'app': 'apposition',
'avc': 'adverbial phrase component',
'cd': 'coordinating conjunction',
'cj': 'conjunct',
'cm': 'comparative conjunction',
'cp': 'complementizer',
'cvc': 'collocational verb construction',
'da': 'dative',
'dh': 'discourse-level head',
'dm': 'discourse marker',
'ep': 'expletive es',
'hd': 'head',
'ju': 'junctor',
'mnr': 'postnominal modifier',
'mo': 'modifier',
'ng': 'negation',
'nk': 'noun kernel element',
'nmc': 'numerical component',
'oa': 'accusative object',
'oc': 'clausal object',
'og': 'genitive object',
'op': 'prepositional object',
'par': 'parenthetical element',
'pd': 'predicate',
'pg': 'phrasal genitive',
'ph': 'placeholder',
'pm': 'morphological particle',
'pnc': 'proper noun component',
'rc': 'relative clause',
're': 'repeated element',
'rs': 'reported speech',
'sb': 'subject',
"ac": "adpositional case marker",
"adc": "adjective component",
"ag": "genitive attribute",
"ams": "measure argument of adjective",
"app": "apposition",
"avc": "adverbial phrase component",
"cd": "coordinating conjunction",
"cj": "conjunct",
"cm": "comparative conjunction",
"cp": "complementizer",
"cvc": "collocational verb construction",
"da": "dative",
"dh": "discourse-level head",
"dm": "discourse marker",
"ep": "expletive es",
"hd": "head",
"ju": "junctor",
"mnr": "postnominal modifier",
"mo": "modifier",
"ng": "negation",
"nk": "noun kernel element",
"nmc": "numerical component",
"oa": "accusative object",
"oc": "clausal object",
"og": "genitive object",
"op": "prepositional object",
"par": "parenthetical element",
"pd": "predicate",
"pg": "phrasal genitive",
"ph": "placeholder",
"pm": "morphological particle",
"pnc": "proper noun component",
"rc": "relative clause",
"re": "repeated element",
"rs": "reported speech",
"sb": "subject",
# Named Entity Recognition
# OntoNotes 5
# https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf
'PERSON': 'People, including fictional',
'NORP': 'Nationalities or religious or political groups',
'FACILITY': 'Buildings, airports, highways, bridges, etc.',
'FAC': 'Buildings, airports, highways, bridges, etc.',
'ORG': 'Companies, agencies, institutions, etc.',
'GPE': 'Countries, cities, states',
'LOC': 'Non-GPE locations, mountain ranges, bodies of water',
'PRODUCT': 'Objects, vehicles, foods, etc. (not services)',
'EVENT': 'Named hurricanes, battles, wars, sports events, etc.',
'WORK_OF_ART': 'Titles of books, songs, etc.',
'LAW': 'Named documents made into laws.',
'LANGUAGE': 'Any named language',
'DATE': 'Absolute or relative dates or periods',
'TIME': 'Times smaller than a day',
'PERCENT': 'Percentage, including "%"',
'MONEY': 'Monetary values, including unit',
'QUANTITY': 'Measurements, as of weight or distance',
'ORDINAL': '"first", "second", etc.',
'CARDINAL': 'Numerals that do not fall under another type',
"PERSON": "People, including fictional",
"NORP": "Nationalities or religious or political groups",
"FACILITY": "Buildings, airports, highways, bridges, etc.",
"FAC": "Buildings, airports, highways, bridges, etc.",
"ORG": "Companies, agencies, institutions, etc.",
"GPE": "Countries, cities, states",
"LOC": "Non-GPE locations, mountain ranges, bodies of water",
"PRODUCT": "Objects, vehicles, foods, etc. (not services)",
"EVENT": "Named hurricanes, battles, wars, sports events, etc.",
"WORK_OF_ART": "Titles of books, songs, etc.",
"LAW": "Named documents made into laws.",
"LANGUAGE": "Any named language",
"DATE": "Absolute or relative dates or periods",
"TIME": "Times smaller than a day",
"PERCENT": 'Percentage, including "%"',
"MONEY": "Monetary values, including unit",
"QUANTITY": "Measurements, as of weight or distance",
"ORDINAL": '"first", "second", etc.',
"CARDINAL": "Numerals that do not fall under another type",
# Named Entity Recognition
# Wikipedia
# http://www.sciencedirect.com/science/article/pii/S0004370212000276
# https://pdfs.semanticscholar.org/5744/578cc243d92287f47448870bb426c66cc941.pdf
'PER': 'Named person or family.',
'MISC': ('Miscellaneous entities, e.g. events, nationalities, '
'products or works of art'),
"PER": "Named person or family.",
"MISC": "Miscellaneous entities, e.g. events, nationalities, products or works of art",
}

View File

@ -16,16 +16,18 @@ from ...util import update_exc, add_lookups
class ArabicDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters.update(LEX_ATTRS)
lex_attr_getters[LANG] = lambda text: 'ar'
lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM], BASE_NORMS)
lex_attr_getters[LANG] = lambda text: "ar"
lex_attr_getters[NORM] = add_lookups(
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS
)
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
stop_words = STOP_WORDS
suffixes = TOKENIZER_SUFFIXES
class Arabic(Language):
lang = 'ar'
lang = "ar"
Defaults = ArabicDefaults
__all__ = ['Arabic']
__all__ = ["Arabic"]

View File

@ -10,11 +10,11 @@ Example sentences to test spaCy and its language models.
sentences = [
"نال الكاتب خالد توفيق جائزة الرواية العربية في معرض الشارقة الدولي للكتاب",
"أين تقع دمشق ؟"
"أين تقع دمشق ؟",
"كيف حالك ؟",
"هل يمكن ان نلتقي على الساعة الثانية عشرة ظهرا ؟",
"ماهي أبرز التطورات السياسية، الأمنية والاجتماعية في العالم ؟",
"هل بالإمكان أن نلتقي غدا؟",
"هناك نحو 382 مليون شخص مصاب بداء السكَّري في العالم",
"كشفت دراسة حديثة أن الخيل تقرأ تعبيرات الوجه وتستطيع أن تتذكر مشاعر الناس وعواطفهم"
"كشفت دراسة حديثة أن الخيل تقرأ تعبيرات الوجه وتستطيع أن تتذكر مشاعر الناس وعواطفهم",
]

View File

@ -2,7 +2,8 @@
from __future__ import unicode_literals
from ...attrs import LIKE_NUM
_num_words = set("""
_num_words = set(
"""
صفر
واحد
إثنان
@ -52,9 +53,11 @@ _num_words = set("""
مليون
مليار
مليارات
""".split())
""".split()
)
_ordinal_words = set("""
_ordinal_words = set(
"""
اول
أول
حاد
@ -69,20 +72,21 @@ _ordinal_words = set("""
ثامن
تاسع
عاشر
""".split())
""".split()
)
def like_num(text):
"""
check if text resembles a number
Check if text resembles a number
"""
if text.startswith(('+', '-', '±', '~')):
if text.startswith(("+", "-", "±", "~")):
text = text[1:]
text = text.replace(',', '').replace('.', '')
text = text.replace(",", "").replace(".", "")
if text.isdigit():
return True
if text.count('/') == 1:
num, denom = text.split('/')
if text.count("/") == 1:
num, denom = text.split("/")
if num.isdigit() and denom.isdigit():
return True
if text in _num_words:
@ -92,6 +96,4 @@ def like_num(text):
return False
LEX_ATTRS = {
LIKE_NUM: like_num
}
LEX_ATTRS = {LIKE_NUM: like_num}

View File

@ -1,15 +1,20 @@
# coding: utf8
from __future__ import unicode_literals
from ..punctuation import TOKENIZER_INFIXES
from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, CURRENCY
from ..char_classes import QUOTES, UNITS, ALPHA, ALPHA_LOWER, ALPHA_UPPER
from ..char_classes import UNITS, ALPHA_UPPER
_suffixes = (LIST_PUNCT + LIST_ELLIPSES + LIST_QUOTES +
[r'(?<=[0-9])\+',
# Arabic is written from Right-To-Left
r'(?<=[0-9])(?:{})'.format(CURRENCY),
r'(?<=[0-9])(?:{})'.format(UNITS),
r'(?<=[{au}][{au}])\.'.format(au=ALPHA_UPPER)])
_suffixes = (
LIST_PUNCT
+ LIST_ELLIPSES
+ LIST_QUOTES
+ [
r"(?<=[0-9])\+",
# Arabic is written from Right-To-Left
r"(?<=[0-9])(?:{})".format(CURRENCY),
r"(?<=[0-9])(?:{})".format(UNITS),
r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),
]
)
TOKENIZER_SUFFIXES = _suffixes

View File

@ -1,7 +1,8 @@
# coding: utf8
from __future__ import unicode_literals
STOP_WORDS = set("""
STOP_WORDS = set(
"""
من
نحو
لعل
@ -388,4 +389,5 @@ STOP_WORDS = set("""
وإن
ولو
يا
""".split())
""".split()
)

View File

@ -1,21 +1,23 @@
# coding: utf8
from __future__ import unicode_literals
from ...symbols import ORTH, LEMMA, TAG, NORM, PRON_LEMMA
import re
from ...symbols import ORTH, LEMMA
_exc = {}
# time
# Time
for exc_data in [
{LEMMA: "قبل الميلاد", ORTH: "ق.م"},
{LEMMA: "بعد الميلاد", ORTH: "ب. م"},
{LEMMA: "ميلادي", ORTH: ""},
{LEMMA: "هجري", ORTH: ".هـ"},
{LEMMA: "توفي", ORTH: ""}]:
{LEMMA: "توفي", ORTH: ""},
]:
_exc[exc_data[ORTH]] = [exc_data]
# scientific abv.
# Scientific abv.
for exc_data in [
{LEMMA: "صلى الله عليه وسلم", ORTH: "صلعم"},
{LEMMA: "الشارح", ORTH: "الشـ"},
@ -28,20 +30,20 @@ for exc_data in [
{LEMMA: "أنبأنا", ORTH: "أنا"},
{LEMMA: "أخبرنا", ORTH: "نا"},
{LEMMA: "مصدر سابق", ORTH: "م. س"},
{LEMMA: "مصدر نفسه", ORTH: "م. ن"}]:
{LEMMA: "مصدر نفسه", ORTH: "م. ن"},
]:
_exc[exc_data[ORTH]] = [exc_data]
# other abv.
# Other abv.
for exc_data in [
{LEMMA: "دكتور", ORTH: "د."},
{LEMMA: "أستاذ دكتور", ORTH: "أ.د"},
{LEMMA: "أستاذ", ORTH: "أ."},
{LEMMA: "بروفيسور", ORTH: "ب."}]:
{LEMMA: "بروفيسور", ORTH: "ب."},
]:
_exc[exc_data[ORTH]] = [exc_data]
for exc_data in [
{LEMMA: "تلفون", ORTH: "ت."},
{LEMMA: "صندوق بريد", ORTH: "ص.ب"}]:
for exc_data in [{LEMMA: "تلفون", ORTH: "ت."}, {LEMMA: "صندوق بريد", ORTH: "ص.ب"}]:
_exc[exc_data[ORTH]] = [exc_data]
TOKENIZER_EXCEPTIONS = _exc

View File

@ -15,7 +15,7 @@ from ...util import update_exc
class BengaliDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda text: 'bn'
lex_attr_getters[LANG] = lambda text: "bn"
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
tag_map = TAG_MAP
stop_words = STOP_WORDS
@ -26,8 +26,8 @@ class BengaliDefaults(Language.Defaults):
class Bengali(Language):
lang = 'bn'
lang = "bn"
Defaults = BengaliDefaults
__all__ = ['Bengali']
__all__ = ["Bengali"]

View File

@ -13,11 +13,9 @@ LEMMA_RULES = {
["গাছা", ""],
["গাছি", ""],
["ছড়া", ""],
["কে", ""],
["", ""],
["তে", ""],
["", ""],
["রা", ""],
["রে", ""],
@ -28,7 +26,6 @@ LEMMA_RULES = {
["গুলা", ""],
["গুলো", ""],
["গুলি", ""],
["কুল", ""],
["গণ", ""],
["দল", ""],
@ -45,7 +42,6 @@ LEMMA_RULES = {
["সকল", ""],
["মহল", ""],
["াবলি", ""], # আবলি
# Bengali digit representations
["", "0"],
["", "1"],
@ -58,11 +54,5 @@ LEMMA_RULES = {
["", "8"],
["", "9"],
],
"punct": [
["", "\""],
["", "\""],
["\u2018", "'"],
["\u2019", "'"]
]
"punct": [["", '"'], ["", '"'], ["\u2018", "'"], ["\u2019", "'"]],
}

View File

@ -5,64 +5,253 @@ from ...symbols import LEMMA, PRON_LEMMA
MORPH_RULES = {
"PRP": {
'': {LEMMA: PRON_LEMMA, 'PronType': 'Dem'},
'আমাকে': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'Person': 'One', 'PronType': 'Prs', 'Case': 'Acc'},
'কি': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'Gender': 'Neut', 'PronType': 'Int', 'Case': 'Acc'},
'সে': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'Person': 'Three', 'PronType': 'Prs', 'Case': 'Nom'},
'কিসে': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'Gender': 'Neut', 'PronType': 'Int', 'Case': 'Acc'},
'তাকে': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'Person': 'Three', 'PronType': 'Prs', 'Case': 'Acc'},
'স্বয়ং': {LEMMA: PRON_LEMMA, 'Reflex': 'Yes', 'PronType': 'Ref'},
'কোনগুলো': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'Gender': 'Neut', 'PronType': 'Int', 'Case': 'Acc'},
'তুমি': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'Person': 'Two', 'PronType': 'Prs', 'Case': 'Nom'},
'তুই': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'Person': 'Two', 'PronType': 'Prs', 'Case': 'Nom'},
'তাদেরকে': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'Person': 'Three', 'PronType': 'Prs', 'Case': 'Acc'},
'আমরা': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'Person': 'One ', 'PronType': 'Prs', 'Case': 'Nom'},
'যিনি': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'PronType': 'Rel', 'Case': 'Nom'},
'আমাদেরকে': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'Person': 'One', 'PronType': 'Prs', 'Case': 'Acc'},
'কোন': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'PronType': 'Int', 'Case': 'Acc'},
'কারা': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'PronType': 'Int', 'Case': 'Acc'},
'তোমাকে': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'Person': 'Two', 'PronType': 'Prs', 'Case': 'Acc'},
'তোকে': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'Person': 'Two', 'PronType': 'Prs', 'Case': 'Acc'},
'খোদ': {LEMMA: PRON_LEMMA, 'Reflex': 'Yes', 'PronType': 'Ref'},
'কে': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'PronType': 'Int', 'Case': 'Acc'},
'যারা': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'PronType': 'Rel', 'Case': 'Nom'},
'যে': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'PronType': 'Rel', 'Case': 'Nom'},
'তোমরা': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'Person': 'Two', 'PronType': 'Prs', 'Case': 'Nom'},
'তোরা': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'Person': 'Two', 'PronType': 'Prs', 'Case': 'Nom'},
'তোমাদেরকে': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'Person': 'Two', 'PronType': 'Prs', 'Case': 'Acc'},
'তোদেরকে': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'Person': 'Two', 'PronType': 'Prs', 'Case': 'Acc'},
'আপন': {LEMMA: PRON_LEMMA, 'Reflex': 'Yes', 'PronType': 'Ref'},
'': {LEMMA: PRON_LEMMA, 'PronType': 'Dem'},
'নিজ': {LEMMA: PRON_LEMMA, 'Reflex': 'Yes', 'PronType': 'Ref'},
'কার': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'PronType': 'Int', 'Case': 'Acc'},
'যা': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'Gender': 'Neut', 'PronType': 'Rel', 'Case': 'Nom'},
'তারা': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'Person': 'Three', 'PronType': 'Prs', 'Case': 'Nom'},
'আমি': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'Person': 'One', 'PronType': 'Prs', 'Case': 'Nom'}
"PRP": {
"": {LEMMA: PRON_LEMMA, "PronType": "Dem"},
"আমাকে": {
LEMMA: PRON_LEMMA,
"Number": "Sing",
"Person": "One",
"PronType": "Prs",
"Case": "Acc",
},
"কি": {
LEMMA: PRON_LEMMA,
"Number": "Sing",
"Gender": "Neut",
"PronType": "Int",
"Case": "Acc",
},
"সে": {
LEMMA: PRON_LEMMA,
"Number": "Sing",
"Person": "Three",
"PronType": "Prs",
"Case": "Nom",
},
"কিসে": {
LEMMA: PRON_LEMMA,
"Number": "Sing",
"Gender": "Neut",
"PronType": "Int",
"Case": "Acc",
},
"তাকে": {
LEMMA: PRON_LEMMA,
"Number": "Sing",
"Person": "Three",
"PronType": "Prs",
"Case": "Acc",
},
"স্বয়ং": {LEMMA: PRON_LEMMA, "Reflex": "Yes", "PronType": "Ref"},
"কোনগুলো": {
LEMMA: PRON_LEMMA,
"Number": "Plur",
"Gender": "Neut",
"PronType": "Int",
"Case": "Acc",
},
"তুমি": {
LEMMA: PRON_LEMMA,
"Number": "Sing",
"Person": "Two",
"PronType": "Prs",
"Case": "Nom",
},
"তুই": {
LEMMA: PRON_LEMMA,
"Number": "Sing",
"Person": "Two",
"PronType": "Prs",
"Case": "Nom",
},
"তাদেরকে": {
LEMMA: PRON_LEMMA,
"Number": "Plur",
"Person": "Three",
"PronType": "Prs",
"Case": "Acc",
},
"আমরা": {
LEMMA: PRON_LEMMA,
"Number": "Plur",
"Person": "One ",
"PronType": "Prs",
"Case": "Nom",
},
"যিনি": {LEMMA: PRON_LEMMA, "Number": "Sing", "PronType": "Rel", "Case": "Nom"},
"আমাদেরকে": {
LEMMA: PRON_LEMMA,
"Number": "Plur",
"Person": "One",
"PronType": "Prs",
"Case": "Acc",
},
"কোন": {LEMMA: PRON_LEMMA, "Number": "Sing", "PronType": "Int", "Case": "Acc"},
"কারা": {LEMMA: PRON_LEMMA, "Number": "Plur", "PronType": "Int", "Case": "Acc"},
"তোমাকে": {
LEMMA: PRON_LEMMA,
"Number": "Sing",
"Person": "Two",
"PronType": "Prs",
"Case": "Acc",
},
"তোকে": {
LEMMA: PRON_LEMMA,
"Number": "Sing",
"Person": "Two",
"PronType": "Prs",
"Case": "Acc",
},
"খোদ": {LEMMA: PRON_LEMMA, "Reflex": "Yes", "PronType": "Ref"},
"কে": {LEMMA: PRON_LEMMA, "Number": "Sing", "PronType": "Int", "Case": "Acc"},
"যারা": {LEMMA: PRON_LEMMA, "Number": "Plur", "PronType": "Rel", "Case": "Nom"},
"যে": {LEMMA: PRON_LEMMA, "Number": "Sing", "PronType": "Rel", "Case": "Nom"},
"তোমরা": {
LEMMA: PRON_LEMMA,
"Number": "Plur",
"Person": "Two",
"PronType": "Prs",
"Case": "Nom",
},
"তোরা": {
LEMMA: PRON_LEMMA,
"Number": "Plur",
"Person": "Two",
"PronType": "Prs",
"Case": "Nom",
},
"তোমাদেরকে": {
LEMMA: PRON_LEMMA,
"Number": "Plur",
"Person": "Two",
"PronType": "Prs",
"Case": "Acc",
},
"তোদেরকে": {
LEMMA: PRON_LEMMA,
"Number": "Plur",
"Person": "Two",
"PronType": "Prs",
"Case": "Acc",
},
"আপন": {LEMMA: PRON_LEMMA, "Reflex": "Yes", "PronType": "Ref"},
"": {LEMMA: PRON_LEMMA, "PronType": "Dem"},
"নিজ": {LEMMA: PRON_LEMMA, "Reflex": "Yes", "PronType": "Ref"},
"কার": {LEMMA: PRON_LEMMA, "Number": "Sing", "PronType": "Int", "Case": "Acc"},
"যা": {
LEMMA: PRON_LEMMA,
"Number": "Sing",
"Gender": "Neut",
"PronType": "Rel",
"Case": "Nom",
},
"তারা": {
LEMMA: PRON_LEMMA,
"Number": "Plur",
"Person": "Three",
"PronType": "Prs",
"Case": "Nom",
},
"আমি": {
LEMMA: PRON_LEMMA,
"Number": "Sing",
"Person": "One",
"PronType": "Prs",
"Case": "Nom",
},
},
"PRP$": {
'আমার': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'Person': 'One', 'PronType': 'Prs', 'Poss': 'Yes',
'Case': 'Nom'},
'মোর': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'Person': 'One', 'PronType': 'Prs', 'Poss': 'Yes',
'Case': 'Nom'},
'মোদের': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'Person': 'One', 'PronType': 'Prs', 'Poss': 'Yes',
'Case': 'Nom'},
'তার': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'Person': 'Three', 'PronType': 'Prs', 'Poss': 'Yes',
'Case': 'Nom'},
'তোমাদের': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'Person': 'Two', 'PronType': 'Prs', 'Poss': 'Yes',
'Case': 'Nom'},
'আমাদের': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'Person': 'One', 'PronType': 'Prs', 'Poss': 'Yes',
'Case': 'Nom'},
'তোমার': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'Person': 'Two', 'PronType': 'Prs', 'Poss': 'Yes',
'Case': 'Nom'},
'তোর': {LEMMA: PRON_LEMMA, 'Number': 'Sing', 'Person': 'Two', 'PronType': 'Prs', 'Poss': 'Yes',
'Case': 'Nom'},
'তাদের': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'Person': 'Three', 'PronType': 'Prs', 'Poss': 'Yes',
'Case': 'Nom'},
'কাদের': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'PronType': 'Int', 'Case': 'Acc'},
'তোদের': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'Person': 'Two', 'PronType': 'Prs', 'Poss': 'Yes',
'Case': 'Nom'},
'যাদের': {LEMMA: PRON_LEMMA, 'Number': 'Plur', 'PronType': 'Int', 'Case': 'Acc'},
}
"আমার": {
LEMMA: PRON_LEMMA,
"Number": "Sing",
"Person": "One",
"PronType": "Prs",
"Poss": "Yes",
"Case": "Nom",
},
"মোর": {
LEMMA: PRON_LEMMA,
"Number": "Sing",
"Person": "One",
"PronType": "Prs",
"Poss": "Yes",
"Case": "Nom",
},
"মোদের": {
LEMMA: PRON_LEMMA,
"Number": "Plur",
"Person": "One",
"PronType": "Prs",
"Poss": "Yes",
"Case": "Nom",
},
"তার": {
LEMMA: PRON_LEMMA,
"Number": "Sing",
"Person": "Three",
"PronType": "Prs",
"Poss": "Yes",
"Case": "Nom",
},
"তোমাদের": {
LEMMA: PRON_LEMMA,
"Number": "Plur",
"Person": "Two",
"PronType": "Prs",
"Poss": "Yes",
"Case": "Nom",
},
"আমাদের": {
LEMMA: PRON_LEMMA,
"Number": "Plur",
"Person": "One",
"PronType": "Prs",
"Poss": "Yes",
"Case": "Nom",
},
"তোমার": {
LEMMA: PRON_LEMMA,
"Number": "Sing",
"Person": "Two",
"PronType": "Prs",
"Poss": "Yes",
"Case": "Nom",
},
"তোর": {
LEMMA: PRON_LEMMA,
"Number": "Sing",
"Person": "Two",
"PronType": "Prs",
"Poss": "Yes",
"Case": "Nom",
},
"তাদের": {
LEMMA: PRON_LEMMA,
"Number": "Plur",
"Person": "Three",
"PronType": "Prs",
"Poss": "Yes",
"Case": "Nom",
},
"কাদের": {
LEMMA: PRON_LEMMA,
"Number": "Plur",
"PronType": "Int",
"Case": "Acc",
},
"তোদের": {
LEMMA: PRON_LEMMA,
"Number": "Plur",
"Person": "Two",
"PronType": "Prs",
"Poss": "Yes",
"Case": "Nom",
},
"যাদের": {
LEMMA: PRON_LEMMA,
"Number": "Plur",
"PronType": "Int",
"Case": "Acc",
},
},
}

View File

@ -2,29 +2,45 @@
from __future__ import unicode_literals
from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, LIST_ICONS
from ..char_classes import ALPHA_LOWER, ALPHA_UPPER, ALPHA, HYPHENS, QUOTES, UNITS
from ..char_classes import ALPHA_LOWER, ALPHA, HYPHENS, QUOTES, UNITS
_currency = r"\$|¢|£|€|¥|฿|৳"
_quotes = QUOTES.replace("'", '')
_list_punct = LIST_PUNCT + '। ॥'.strip().split()
_quotes = QUOTES.replace("'", "")
_list_punct = LIST_PUNCT + "। ॥".strip().split()
_prefixes = ([r'\+'] + _list_punct + LIST_ELLIPSES + LIST_QUOTES + LIST_ICONS)
_prefixes = [r"\+"] + _list_punct + LIST_ELLIPSES + LIST_QUOTES + LIST_ICONS
_suffixes = (_list_punct + LIST_ELLIPSES + LIST_QUOTES + LIST_ICONS +
[r'(?<=[0-9])\+',
r'(?<=°[FfCcKk])\.',
r'(?<=[0-9])(?:{})'.format(_currency),
r'(?<=[0-9])(?:{})'.format(UNITS),
r'(?<=[{}(?:{})])\.'.format('|'.join([ALPHA_LOWER, r'%²\-\)\]\+', QUOTES]), _currency)])
_suffixes = (
_list_punct
+ LIST_ELLIPSES
+ LIST_QUOTES
+ LIST_ICONS
+ [
r"(?<=[0-9])\+",
r"(?<=°[FfCcKk])\.",
r"(?<=[0-9])(?:{})".format(_currency),
r"(?<=[0-9])(?:{})".format(UNITS),
r"(?<=[{}(?:{})])\.".format(
"|".join([ALPHA_LOWER, r"%²\-\)\]\+", QUOTES]), _currency
),
]
)
_infixes = (LIST_ELLIPSES + LIST_ICONS +
[r'(?<=[0-9{zero}-{nine}])[+\-\*^=](?=[0-9{zero}-{nine}-])'.format(zero=u'', nine=u''),
r'(?<=[{a}]),(?=[{a}])'.format(a=ALPHA),
r'(?<=[{a}])[{h}](?={ae})'.format(a=ALPHA, h=HYPHENS, ae=u''),
r'(?<=[{a}])[?";:=,.]*(?:{h})(?=[{a}])'.format(a=ALPHA, h=HYPHENS),
r'(?<=[{a}"])[:<>=/](?=[{a}])'.format(a=ALPHA)])
_infixes = (
LIST_ELLIPSES
+ LIST_ICONS
+ [
r"(?<=[0-9{zero}-{nine}])[+\-\*^=](?=[0-9{zero}-{nine}-])".format(
zero="", nine=""
),
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
r"(?<=[{a}])[{h}](?={ae})".format(a=ALPHA, h=HYPHENS, ae=""),
r'(?<=[{a}])[?";:=,.]*(?:{h})(?=[{a}])'.format(a=ALPHA, h=HYPHENS),
r'(?<=[{a}"])[:<>=/](?=[{a}])'.format(a=ALPHA),
]
)
TOKENIZER_PREFIXES = _prefixes

View File

@ -2,43 +2,45 @@
from __future__ import unicode_literals
STOP_WORDS = set("""
STOP_WORDS = set(
"""
অতএব অথচ অথব অন অন অন অন অনতত অবধি অবশ অর অন অন অরধভ
আগ আগ আগ আছ আজ আদযভ আপন আপনি আব আমর আম আম আম আমি আর আরও
ইতি ইহ
আগ আগ আগ আছ আজ আদযভ আপন আপনি আব আমর আম আম আম আমি আর আরও
ইতি ইহ
উচি উনি উপর উপর উততর
এই এক একই একজন একট একটি একব এক এখন এখনও এখ এখ এট এস
এট এটি এত এতট এত এদ এব এব এমন এমনি এমনকি এর এর এল এস এস
ওই ওক ওখ ওদ ওর ওর
এট এটি এত এতট এত এদ এব এব এমন এমনি এমনকি এর এর এল এস এস
ওই ওক ওখ ওদ ওর ওর
কখনও কত কথ কব কয কযকটি করছ করছ করত করব করব করল কয় কয়কটি করি করি কর
করল কর কর কর কর করি করি করি করি কর কর করি কর কর কর উক
করল কর কর কর কর করি করি করি করি কর কর করি কর কর কর উক
রও রণ ি ি ি ি ি উই নও মন ি
ি ি ি ি ি
চল
চল
ি ি
জন জনক জন জন জন নত ি ি ি ি
ি
ি
তখন তত তথ তব তব রপর রই হল িনই
ি
ি
তখন তত তথ তব তব রপর রই হল িনই
িি িি ি মন
কব কব
ি ি ি ি ি ি ি ি ওয ওয খত
ি ি ি ি ি ি ি ি ওয ওয খত
ি ি ওয় ওয় ি
ধর ধর
ধর ধর
নয ি ি ি ি ি ি ি ি ওয় ওয নয় নত
পক পর পর পর পর পরযন ওয ি রতি রভি ওয় রথম থমি
ফল ি
ফল ি
বছর বদল বর বলত বলল বলল বল বল বল বল বস বহ ি িি ি িষযি যবহ বকতব বন ি
মত মত মত মধযভ মধ মধ মধ মন যম
যখন যত যতট যথ যদি যদি ওয ওয িি
মন
রকম রয রয়
লক
রণ মন সঙ সঙ সব সব সমস সমরতি সময় সহ সহি তর ি পষ বয
মত মত মত মধযভ মধ মধ মধ মন যম
যখন যত যতট যথ যদি যদি ওয ওয িি
মন
রকম রয রয়
লক
রণ মন সঙ সঙ সব সব সমস সমরতি সময় সহ সহি তর ি পষ বয
হইত হইব হইয হওয হওয হওয হচ হত হত হত হন হব হব হয হয হযি হয হয হযি হয
হয় হল হল হল হল হল ি ি হয় হয় হয় হইয় হয়ি হয় হয়নি হয় হয়ত হওয় হওয় হওয়
""".split())
""".split()
)

View File

@ -6,72 +6,77 @@ from ...symbols import CCONJ, NOUN, PROPN, PART, INTJ, SPACE, PRON, AUX, SYM
TAG_MAP = {
".": {POS: PUNCT, "PunctType": "peri"},
",": {POS: PUNCT, "PunctType": "comm"},
"-LRB-": {POS: PUNCT, "PunctType": "brck", "PunctSide": "ini"},
"-RRB-": {POS: PUNCT, "PunctType": "brck", "PunctSide": "fin"},
"``": {POS: PUNCT, "PunctType": "quot", "PunctSide": "ini"},
"\"\"": {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"},
"''": {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"},
":": {POS: PUNCT},
"": {POS: SYM, "Other": {"SymType": "currency"}},
"#": {POS: SYM, "Other": {"SymType": "numbersign"}},
"AFX": {POS: ADJ, "Hyph": "yes"},
"CC": {POS: CONJ, "ConjType": "coor"},
"CD": {POS: NUM, "NumType": "card"},
"DT": {POS: DET},
"EX": {POS: ADV, "AdvType": "ex"},
"FW": {POS: X, "Foreign": "yes"},
"HYPH": {POS: PUNCT, "PunctType": "dash"},
"IN": {POS: ADP},
"JJ": {POS: ADJ, "Degree": "pos"},
"JJR": {POS: ADJ, "Degree": "comp"},
"JJS": {POS: ADJ, "Degree": "sup"},
"LS": {POS: PUNCT, "NumType": "ord"},
"MD": {POS: VERB, "VerbType": "mod"},
"NIL": {POS: ""},
"NN": {POS: NOUN, "Number": "sing"},
"NNP": {POS: PROPN, "NounType": "prop", "Number": "sing"},
"NNPS": {POS: PROPN, "NounType": "prop", "Number": "plur"},
"NNS": {POS: NOUN, "Number": "plur"},
"PDT": {POS: ADJ, "AdjType": "pdt", "PronType": "prn"},
"POS": {POS: PART, "Poss": "yes"},
"PRP": {POS: PRON, "PronType": "prs"},
"PRP$": {POS: ADJ, "PronType": "prs", "Poss": "yes"},
"RB": {POS: ADV, "Degree": "pos"},
"RBR": {POS: ADV, "Degree": "comp"},
"RBS": {POS: ADV, "Degree": "sup"},
"RP": {POS: PART},
"SYM": {POS: SYM},
"TO": {POS: PART, "PartType": "inf", "VerbForm": "inf"},
"UH": {POS: INTJ},
"VB": {POS: VERB, "VerbForm": "inf"},
"VBD": {POS: VERB, "VerbForm": "fin", "Tense": "past"},
"VBG": {POS: VERB, "VerbForm": "part", "Tense": "pres", "Aspect": "prog"},
"VBN": {POS: VERB, "VerbForm": "part", "Tense": "past", "Aspect": "perf"},
"VBP": {POS: VERB, "VerbForm": "fin", "Tense": "pres"},
"VBZ": {POS: VERB, "VerbForm": "fin", "Tense": "pres", "Number": "sing", "Person": 3},
"WDT": {POS: ADJ, "PronType": "int|rel"},
"WP": {POS: NOUN, "PronType": "int|rel"},
"WP$": {POS: ADJ, "Poss": "yes", "PronType": "int|rel"},
"WRB": {POS: ADV, "PronType": "int|rel"},
"SP": {POS: SPACE},
"ADV": {POS: ADV},
"NOUN": {POS: NOUN},
"ADP": {POS: ADP},
"PRON": {POS: PRON},
"SCONJ": {POS: SCONJ},
"PROPN": {POS: PROPN},
"DET": {POS: DET},
"SYM": {POS: SYM},
"INTJ": {POS: INTJ},
"PUNCT": {POS: PUNCT},
"NUM": {POS: NUM},
"AUX": {POS: AUX},
"X": {POS: X},
"CONJ": {POS: CONJ},
"CCONJ": {POS: CCONJ},
"ADJ": {POS: ADJ},
"VERB": {POS: VERB},
"PART": {POS: PART},
".": {POS: PUNCT, "PunctType": "peri"},
",": {POS: PUNCT, "PunctType": "comm"},
"-LRB-": {POS: PUNCT, "PunctType": "brck", "PunctSide": "ini"},
"-RRB-": {POS: PUNCT, "PunctType": "brck", "PunctSide": "fin"},
"``": {POS: PUNCT, "PunctType": "quot", "PunctSide": "ini"},
'""': {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"},
"''": {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"},
":": {POS: PUNCT},
"": {POS: SYM, "Other": {"SymType": "currency"}},
"#": {POS: SYM, "Other": {"SymType": "numbersign"}},
"AFX": {POS: ADJ, "Hyph": "yes"},
"CC": {POS: CONJ, "ConjType": "coor"},
"CD": {POS: NUM, "NumType": "card"},
"DT": {POS: DET},
"EX": {POS: ADV, "AdvType": "ex"},
"FW": {POS: X, "Foreign": "yes"},
"HYPH": {POS: PUNCT, "PunctType": "dash"},
"IN": {POS: ADP},
"JJ": {POS: ADJ, "Degree": "pos"},
"JJR": {POS: ADJ, "Degree": "comp"},
"JJS": {POS: ADJ, "Degree": "sup"},
"LS": {POS: PUNCT, "NumType": "ord"},
"MD": {POS: VERB, "VerbType": "mod"},
"NIL": {POS: ""},
"NN": {POS: NOUN, "Number": "sing"},
"NNP": {POS: PROPN, "NounType": "prop", "Number": "sing"},
"NNPS": {POS: PROPN, "NounType": "prop", "Number": "plur"},
"NNS": {POS: NOUN, "Number": "plur"},
"PDT": {POS: ADJ, "AdjType": "pdt", "PronType": "prn"},
"POS": {POS: PART, "Poss": "yes"},
"PRP": {POS: PRON, "PronType": "prs"},
"PRP$": {POS: ADJ, "PronType": "prs", "Poss": "yes"},
"RB": {POS: ADV, "Degree": "pos"},
"RBR": {POS: ADV, "Degree": "comp"},
"RBS": {POS: ADV, "Degree": "sup"},
"RP": {POS: PART},
"TO": {POS: PART, "PartType": "inf", "VerbForm": "inf"},
"UH": {POS: INTJ},
"VB": {POS: VERB, "VerbForm": "inf"},
"VBD": {POS: VERB, "VerbForm": "fin", "Tense": "past"},
"VBG": {POS: VERB, "VerbForm": "part", "Tense": "pres", "Aspect": "prog"},
"VBN": {POS: VERB, "VerbForm": "part", "Tense": "past", "Aspect": "perf"},
"VBP": {POS: VERB, "VerbForm": "fin", "Tense": "pres"},
"VBZ": {
POS: VERB,
"VerbForm": "fin",
"Tense": "pres",
"Number": "sing",
"Person": 3,
},
"WDT": {POS: ADJ, "PronType": "int|rel"},
"WP": {POS: NOUN, "PronType": "int|rel"},
"WP$": {POS: ADJ, "Poss": "yes", "PronType": "int|rel"},
"WRB": {POS: ADV, "PronType": "int|rel"},
"SP": {POS: SPACE},
"ADV": {POS: ADV},
"NOUN": {POS: NOUN},
"ADP": {POS: ADP},
"PRON": {POS: PRON},
"SCONJ": {POS: SCONJ},
"PROPN": {POS: PROPN},
"DET": {POS: DET},
"SYM": {POS: SYM},
"INTJ": {POS: INTJ},
"PUNCT": {POS: PUNCT},
"NUM": {POS: NUM},
"AUX": {POS: AUX},
"X": {POS: X},
"CONJ": {POS: CONJ},
"CCONJ": {POS: CCONJ},
"ADJ": {POS: ADJ},
"VERB": {POS: VERB},
"PART": {POS: PART},
}

View File

@ -19,7 +19,8 @@ for exc_data in [
{ORTH: "কি.মি", LEMMA: "কিলোমিটার"},
{ORTH: "সে.মি.", LEMMA: "সেন্টিমিটার"},
{ORTH: "সে.মি", LEMMA: "সেন্টিমিটার"},
{ORTH: "মি.লি.", LEMMA: "মিলিলিটার"}]:
{ORTH: "মি.লি.", LEMMA: "মিলিলিটার"},
]:
_exc[exc_data[ORTH]] = [exc_data]

View File

@ -4,13 +4,6 @@ from __future__ import unicode_literals
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS
# uncomment if files are available
# from .norm_exceptions import NORM_EXCEPTIONS
# from .tag_map import TAG_MAP
# from .morph_rules import MORPH_RULES
# uncomment if lookup-based lemmatizer is available
from .lemmatizer import LOOKUP
from ..tokenizer_exceptions import BASE_EXCEPTIONS
@ -19,46 +12,22 @@ from ...language import Language
from ...attrs import LANG, NORM
from ...util import update_exc, add_lookups
# Create a Language subclass
# Documentation: https://spacy.io/docs/usage/adding-languages
# This file should be placed in spacy/lang/ca (ISO code of language).
# Before submitting a pull request, make sure the remove all comments from the
# language data files, and run at least the basic tokenizer tests. Simply add the
# language ID to the list of languages in spacy/tests/conftest.py to include it
# in the basic tokenizer sanity tests. You can optionally add a fixture for the
# language's tokenizer and add more specific tests. For more info, see the
# tests documentation: https://github.com/explosion/spaCy/tree/master/spacy/tests
class CatalanDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda text: 'ca' # ISO code
# add more norm exception dictionaries here
lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM], BASE_NORMS)
# overwrite functions for lexical attributes
lex_attr_getters[LANG] = lambda text: "ca"
lex_attr_getters[NORM] = add_lookups(
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS
)
lex_attr_getters.update(LEX_ATTRS)
# add custom tokenizer exceptions to base exceptions
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
# add stop words
stop_words = STOP_WORDS
# if available: add tag map
# tag_map = dict(TAG_MAP)
# if available: add morph rules
# morph_rules = dict(MORPH_RULES)
lemma_lookup = LOOKUP
class Catalan(Language):
lang = 'ca' # ISO code
Defaults = CatalanDefaults # set Defaults to custom language defaults
lang = "ca"
Defaults = CatalanDefaults
# set default export this allows the language class to be lazy-loaded
__all__ = ['Catalan']
__all__ = ["Catalan"]

View File

@ -5,7 +5,7 @@ from __future__ import unicode_literals
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.es.examples import sentences
>>> from spacy.lang.ca.examples import sentences
>>> docs = nlp.pipe(sentences)
"""

View File

@ -1,33 +1,57 @@
# coding: utf8
from __future__ import unicode_literals
# import the symbols for the attrs you want to overwrite
from ...attrs import LIKE_NUM
# Overwriting functions for lexical attributes
# Documentation: https://localhost:1234/docs/usage/adding-languages#lex-attrs
# Most of these functions, like is_lower or like_url should be language-
# independent. Others, like like_num (which includes both digits and number
# words), requires customisation.
# Example: check if token resembles a number
_num_words = ['zero', 'un', 'dos', 'tres', 'quatre', 'cinc', 'sis', 'set',
'vuit', 'nou', 'deu', 'onze', 'dotze', 'tretze', 'catorze',
'quinze', 'setze', 'disset', 'divuit', 'dinou', 'vint',
'trenta', 'quaranta', 'cinquanta', 'seixanta', 'setanta', 'vuitanta', 'noranta',
'cent', 'mil', 'milió', 'bilió', 'trilió', 'quatrilió',
'gazilió', 'bazilió']
_num_words = [
"zero",
"un",
"dos",
"tres",
"quatre",
"cinc",
"sis",
"set",
"vuit",
"nou",
"deu",
"onze",
"dotze",
"tretze",
"catorze",
"quinze",
"setze",
"disset",
"divuit",
"dinou",
"vint",
"trenta",
"quaranta",
"cinquanta",
"seixanta",
"setanta",
"vuitanta",
"noranta",
"cent",
"mil",
"milió",
"bilió",
"trilió",
"quatrilió",
"gazilió",
"bazilió",
]
def like_num(text):
text = text.replace(',', '').replace('.', '')
if text.startswith(("+", "-", "±", "~")):
text = text[1:]
text = text.replace(",", "").replace(".", "")
if text.isdigit():
return True
if text.count('/') == 1:
num, denom = text.split('/')
if text.count("/") == 1:
num, denom = text.split("/")
if num.isdigit() and denom.isdigit():
return True
if text in _num_words:
@ -35,9 +59,4 @@ def like_num(text):
return False
# Create dictionary of functions to overwrite. The default lex_attr_getters are
# updated with this one, so only the functions defined here are overwritten.
LEX_ATTRS = {
LIKE_NUM: like_num
}
LEX_ATTRS = {LIKE_NUM: like_num}

View File

@ -2,9 +2,8 @@
from __future__ import unicode_literals
# Stop words
STOP_WORDS = set("""
STOP_WORDS = set(
"""
a abans ací ah així això al aleshores algun alguna algunes alguns alhora allà allí allò
als altra altre altres amb ambdues ambdós anar ans apa aquell aquella aquelles aquells
aquest aquesta aquestes aquests aquí
@ -53,4 +52,5 @@ un una unes uns us últim ús
va vaig vam van vas veu vosaltres vostra vostre vostres
""".split())
""".split()
)

View File

@ -5,32 +5,24 @@ from ..symbols import POS, ADV, NOUN, ADP, PRON, SCONJ, PROPN, DET, SYM, INTJ
from ..symbols import PUNCT, NUM, AUX, X, CONJ, ADJ, VERB, PART, SPACE, CCONJ
# Add a tag map
# Documentation: https://spacy.io/docs/usage/adding-languages#tag-map
# Universal Dependencies: http://universaldependencies.org/u/pos/all.html
# The keys of the tag map should be strings in your tag set. The dictionary must
# have an entry POS whose value is one of the Universal Dependencies tags.
# Optionally, you can also include morphological features or other attributes.
TAG_MAP = {
"ADV": {POS: ADV},
"NOUN": {POS: NOUN},
"ADP": {POS: ADP},
"PRON": {POS: PRON},
"SCONJ": {POS: SCONJ},
"PROPN": {POS: PROPN},
"DET": {POS: DET},
"SYM": {POS: SYM},
"INTJ": {POS: INTJ},
"PUNCT": {POS: PUNCT},
"NUM": {POS: NUM},
"AUX": {POS: AUX},
"X": {POS: X},
"CONJ": {POS: CONJ},
"CCONJ": {POS: CCONJ},
"ADJ": {POS: ADJ},
"VERB": {POS: VERB},
"PART": {POS: PART},
"SP": {POS: SPACE}
"ADV": {POS: ADV},
"NOUN": {POS: NOUN},
"ADP": {POS: ADP},
"PRON": {POS: PRON},
"SCONJ": {POS: SCONJ},
"PROPN": {POS: PROPN},
"DET": {POS: DET},
"SYM": {POS: SYM},
"INTJ": {POS: INTJ},
"PUNCT": {POS: PUNCT},
"NUM": {POS: NUM},
"AUX": {POS: AUX},
"X": {POS: X},
"CONJ": {POS: CONJ},
"CCONJ": {POS: CCONJ},
"ADJ": {POS: ADJ},
"VERB": {POS: VERB},
"PART": {POS: PART},
"SP": {POS: SPACE},
}

View File

@ -1,8 +1,7 @@
# coding: utf8
from __future__ import unicode_literals
# import symbols if you need to use more, add them here
from ...symbols import ORTH, LEMMA, TAG, NORM, ADP, DET
from ...symbols import ORTH, LEMMA
_exc = {}
@ -25,27 +24,18 @@ for exc_data in [
{ORTH: "Srta.", LEMMA: "senyoreta"},
{ORTH: "núm", LEMMA: "número"},
{ORTH: "St.", LEMMA: "sant"},
{ORTH: "Sta.", LEMMA: "santa"}]:
{ORTH: "Sta.", LEMMA: "santa"},
]:
_exc[exc_data[ORTH]] = [exc_data]
# Times
_exc["12m."] = [
{ORTH: "12"},
{ORTH: "m.", LEMMA: "p.m."}]
_exc["12m."] = [{ORTH: "12"}, {ORTH: "m.", LEMMA: "p.m."}]
for h in range(1, 12 + 1):
for period in ["a.m.", "am"]:
_exc["%d%s" % (h, period)] = [
{ORTH: "%d" % h},
{ORTH: period, LEMMA: "a.m."}]
_exc["%d%s" % (h, period)] = [{ORTH: "%d" % h}, {ORTH: period, LEMMA: "a.m."}]
for period in ["p.m.", "pm"]:
_exc["%d%s" % (h, period)] = [
{ORTH: "%d" % h},
{ORTH: period, LEMMA: "p.m."}]
_exc["%d%s" % (h, period)] = [{ORTH: "%d" % h}, {ORTH: period, LEMMA: "p.m."}]
# To keep things clean and readable, it's recommended to only declare the
# TOKENIZER_EXCEPTIONS at the bottom:
TOKENIZER_EXCEPTIONS = _exc

View File

@ -4,23 +4,23 @@ from __future__ import unicode_literals
import regex as re
re.DEFAULT_VERSION = re.VERSION1
merge_char_classes = lambda classes: '[{}]'.format('||'.join(classes))
split_chars = lambda char: list(char.strip().split(' '))
merge_chars = lambda char: char.strip().replace(' ', '|')
merge_char_classes = lambda classes: "[{}]".format("||".join(classes))
split_chars = lambda char: list(char.strip().split(" "))
merge_chars = lambda char: char.strip().replace(" ", "|")
_bengali = r'[\p{L}&&\p{Bengali}]'
_hebrew = r'[\p{L}&&\p{Hebrew}]'
_latin_lower = r'[\p{Ll}&&\p{Latin}]'
_latin_upper = r'[\p{Lu}&&\p{Latin}]'
_latin = r'[[\p{Ll}||\p{Lu}]&&\p{Latin}]'
_persian = r'[\p{L}&&\p{Arabic}]'
_russian_lower = r'[ёа-я]'
_russian_upper = r'[ЁА-Я]'
_sinhala = r'[\p{L}&&\p{Sinhala}]'
_tatar_lower = r'[әөүҗңһ]'
_tatar_upper = r'[ӘӨҮҖҢҺ]'
_greek_lower = r'[α-ωάέίόώήύ]'
_greek_upper = r'[Α-ΩΆΈΊΌΏΉΎ]'
_bengali = r"[\p{L}&&\p{Bengali}]"
_hebrew = r"[\p{L}&&\p{Hebrew}]"
_latin_lower = r"[\p{Ll}&&\p{Latin}]"
_latin_upper = r"[\p{Lu}&&\p{Latin}]"
_latin = r"[[\p{Ll}||\p{Lu}]&&\p{Latin}]"
_persian = r"[\p{L}&&\p{Arabic}]"
_russian_lower = r"[ёа-я]"
_russian_upper = r"[ЁА-Я]"
_sinhala = r"[\p{L}&&\p{Sinhala}]"
_tatar_lower = r"[әөүҗңһ]"
_tatar_upper = r"[ӘӨҮҖҢҺ]"
_greek_lower = r"[α-ωάέίόώήύ]"
_greek_upper = r"[Α-ΩΆΈΊΌΏΉΎ]"
_upper = [_latin_upper, _russian_upper, _tatar_upper, _greek_upper]
_lower = [_latin_lower, _russian_lower, _tatar_lower, _greek_lower]
@ -30,23 +30,27 @@ ALPHA = merge_char_classes(_upper + _lower + _uncased)
ALPHA_LOWER = merge_char_classes(_lower + _uncased)
ALPHA_UPPER = merge_char_classes(_upper + _uncased)
_units = ('km km² km³ m m² m³ dm dm² dm³ cm cm² cm³ mm mm² mm³ ha µm nm yd in ft '
'kg g mg µg t lb oz m/s km/h kmh mph hPa Pa mbar mb MB kb KB gb GB tb '
'TB T G M K % км км² км³ м м² м³ дм дм² дм³ см см² см³ мм мм² мм³ нм '
'кг г мг м/с км/ч кПа Па мбар Кб КБ кб Мб МБ мб Гб ГБ гб Тб ТБ тб'
'كم كم² كم³ م م² م³ سم سم² سم³ مم مم² مم³ كم غرام جرام جم كغ ملغ كوب اكواب')
_currency = r'\$ £ € ¥ ฿ US\$ C\$ A\$ ₽ ﷼'
_units = (
"km km² km³ m m² m³ dm dm² dm³ cm cm² cm³ mm mm² mm³ ha µm nm yd in ft "
"kg g mg µg t lb oz m/s km/h kmh mph hPa Pa mbar mb MB kb KB gb GB tb "
"TB T G M K % км км² км³ м м² м³ дм дм² дм³ см см² см³ мм мм² мм³ нм "
"кг г мг м/с км/ч кПа Па мбар Кб КБ кб Мб МБ мб Гб ГБ гб Тб ТБ тб"
"كم كم² كم³ م م² م³ سم سم² سم³ مم مم² مم³ كم غرام جرام جم كغ ملغ كوب اكواب"
)
_currency = r"\$ £ € ¥ ฿ US\$ C\$ A\$ ₽ ﷼"
# These expressions contain various unicode variations, including characters
# used in Chinese (see #1333, #1340, #1351) unless there are cross-language
# conflicts, spaCy's base tokenizer should handle all of those by default
_punct = r'… …… , : ; \! \? ¿ ؟ ¡ \( \) \[ \] \{ \} < > _ # \* & 。 · । ، ؛ ٪'
_punct = (
r"… …… , : ; \! \? ¿ ؟ ¡ \( \) \[ \] \{ \} < > _ # \* & 。 · । ، ؛ ٪"
)
_quotes = r'\' \'\' " ” “ `` ` ´ , „ » « 「 」 『 』 【 】 《 》 〈 〉'
_hyphens = '- — -- --- —— ~'
_hyphens = "- — -- --- —— ~"
# Various symbols like dingbats, but also emoji
# Details: https://www.compart.com/en/unicode/category/So
_other_symbols = r'[\p{So}]'
_other_symbols = r"[\p{So}]"
UNITS = merge_chars(_units)
CURRENCY = merge_chars(_currency)
@ -60,5 +64,5 @@ LIST_CURRENCY = split_chars(_currency)
LIST_QUOTES = split_chars(_quotes)
LIST_PUNCT = split_chars(_punct)
LIST_HYPHENS = split_chars(_hyphens)
LIST_ELLIPSES = [r'\.\.+', '']
LIST_ELLIPSES = [r"\.\.+", ""]
LIST_ICONS = [_other_symbols]

View File

@ -20,9 +20,10 @@ from ...util import update_exc, add_lookups
class DanishDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters.update(LEX_ATTRS)
lex_attr_getters[LANG] = lambda text: 'da'
lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM],
BASE_NORMS, NORM_EXCEPTIONS)
lex_attr_getters[LANG] = lambda text: "da"
lex_attr_getters[NORM] = add_lookups(
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS
)
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
morph_rules = MORPH_RULES
infixes = TOKENIZER_INFIXES
@ -33,8 +34,8 @@ class DanishDefaults(Language.Defaults):
class Danish(Language):
lang = 'da'
lang = "da"
Defaults = DanishDefaults
__all__ = ['Danish']
__all__ = ["Danish"]

View File

@ -14,5 +14,5 @@ sentences = [
"Apple overvejer at købe et britisk startup for 1 milliard dollar",
"Selvkørende biler flytter forsikringsansvaret over på producenterne",
"San Francisco overvejer at forbyde udbringningsrobotter på fortov",
"London er en stor by i Storbritannien"
"London er en stor by i Storbritannien",
]

View File

@ -3,8 +3,8 @@ from __future__ import unicode_literals
from ...attrs import LIKE_NUM
# Source http://fjern-uv.dk/tal.php
# Source http://fjern-uv.dk/tal.php
_num_words = """nul
en et to tre fire fem seks syv otte ni ti
elleve tolv tretten fjorten femten seksten sytten atten nitten tyve
@ -19,8 +19,8 @@ enoghalvfems tooghalvfems treoghalvfems fireoghalvfems femoghalvfems seksoghalvf
million milliard billion billiard trillion trilliard
""".split()
# source http://www.duda.dk/video/dansk/grammatik/talord/talord.html
# Source: http://www.duda.dk/video/dansk/grammatik/talord/talord.html
_ordinal_words = """nulte
første anden tredje fjerde femte sjette syvende ottende niende tiende
elfte tolvte trettende fjortende femtende sekstende syttende attende nittende tyvende
@ -33,14 +33,15 @@ enogfirsindstyvende toogfirsindstyvende treogfirsindstyvende fireogfirsindstyven
enoghalvfemsindstyvende tooghalvfemsindstyvende treoghalvfemsindstyvende fireoghalvfemsindstyvende femoghalvfemsindstyvende seksoghalvfemsindstyvende syvoghalvfemsindstyvende otteoghalvfemsindstyvende nioghalvfemsindstyvende
""".split()
def like_num(text):
if text.startswith(('+', '-', '±', '~')):
if text.startswith(("+", "-", "±", "~")):
text = text[1:]
text = text.replace(',', '').replace('.', '')
text = text.replace(",", "").replace(".", "")
if text.isdigit():
return True
if text.count('/') == 1:
num, denom = text.split('/')
if text.count("/") == 1:
num, denom = text.split("/")
if num.isdigit() and denom.isdigit():
return True
if text.lower() in _num_words:
@ -49,6 +50,5 @@ def like_num(text):
return True
return False
LEX_ATTRS = {
LIKE_NUM: like_num
}
LEX_ATTRS = {LIKE_NUM: like_num}

View File

@ -11,53 +11,299 @@ from ...symbols import LEMMA, PRON_LEMMA
MORPH_RULES = {
"PRON": {
"jeg": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Case": "Nom", "Gender": "Com"}, # Case=Nom|Gender=Com|Number=Sing|Person=1|PronType=Prs
"mig": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Case": "Acc", "Gender": "Com"}, # Case=Acc|Gender=Com|Number=Sing|Person=1|PronType=Prs
"min": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Poss": "Yes", "Gender": "Com"}, # Gender=Com|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs
"mit": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Poss": "Yes", "Gender": "Neut"}, # Gender=Neut|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs
"vor": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Poss": "Yes", "Gender": "Com"}, # Gender=Com|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs|Style=Form
"vort": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Poss": "Yes", "Gender": "Neut"}, # Gender=Neut|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs|Style=Form
"du": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Sing", "Case": "Nom", "Gender": "Com"}, # Case=Nom|Gender=Com|Number=Sing|Person=2|PronType=Prs
"dig": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Sing", "Case": "Acc", "Gender": "Com"}, # Case=Acc|Gender=Com|Number=Sing|Person=2|PronType=Prs
"din": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Sing", "Poss": "Yes", "Gender": "Com"}, # Gender=Com|Number=Sing|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs
"dit": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Sing", "Poss": "Yes", "Gender": "Neut"}, # Gender=Neut|Number=Sing|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs
"han": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Case": "Nom", "Gender": "Com"}, # Case=Nom|Gender=Com|Number=Sing|Person=3|PronType=Prs
"hun": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Case": "Nom", "Gender": "Com"}, # Case=Nom|Gender=Com|Number=Sing|Person=3|PronType=Prs
"den": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Com"}, # Case=Acc|Gender=Com|Number=Sing|Person=3|PronType=Prs, See note above.
"det": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Neut"}, # Case=Acc|Gender=Neut|Number=Sing|Person=3|PronType=Prs See note above.
"ham": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Case": "Acc", "Gender": "Com"}, # Case=Acc|Gender=Com|Number=Sing|Person=3|PronType=Prs
"hende": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Case": "Acc", "Gender": "Com"}, # Case=Acc|Gender=Com|Number=Sing|Person=3|PronType=Prs
"sin": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Poss": "Yes", "Gender": "Com", "Reflex": "Yes"}, # Gender=Com|Number=Sing|Number[psor]=Sing|Person=3|Poss=Yes|PronType=Prs|Reflex=Yes
"sit": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Poss": "Yes", "Gender": "Neut", "Reflex": "Yes"}, # Gender=Neut|Number=Sing|Number[psor]=Sing|Person=3|Poss=Yes|PronType=Prs|Reflex=Yes
"vi": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Case": "Nom", "Gender": "Com"}, # Case=Nom|Gender=Com|Number=Plur|Person=1|PronType=Prs
"os": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Case": "Acc", "Gender": "Com"}, # Case=Acc|Gender=Com|Number=Plur|Person=1|PronType=Prs
"mine": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Poss": "Yes"}, # Number=Plur|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs
"vore": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Poss": "Yes"}, # Number=Plur|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs|Style=Form
"I": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Case": "Nom", "Gender": "Com"}, # Case=Nom|Gender=Com|Number=Plur|Person=2|PronType=Prs
"jer": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Case": "Acc", "Gender": "Com"}, # Case=Acc|Gender=Com|Number=Plur|Person=2|PronType=Prs
"dine": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Poss": "Yes"}, # Number=Plur|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs
"de": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Plur", "Case": "Nom"}, # Case=Nom|Number=Plur|Person=3|PronType=Prs
"dem": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Plur", "Case": "Acc"}, # Case=Acc|Number=Plur|Person=3|PronType=Prs
"sine": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"}, # Number=Plur|Number[psor]=Sing|Person=3|Poss=Yes|PronType=Prs|Reflex=Yes
"vores": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Poss": "Yes"}, # Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs
"De": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Case": "Nom", "Gender": "Com"}, # Case=Nom|Gender=Com|Person=2|Polite=Form|PronType=Prs
"Dem": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Case": "Acc", "Gender": "Com"}, # Case=Acc|Gender=Com|Person=2|Polite=Form|PronType=Prs
"Deres": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Poss": "Yes"}, # Person=2|Polite=Form|Poss=Yes|PronType=Prs
"jeres": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Poss": "Yes"}, # Number[psor]=Plur|Person=2|Poss=Yes|PronType=Prs
"sig": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Case": "Acc", "Reflex": "Yes"}, # Case=Acc|Person=3|PronType=Prs|Reflex=Yes
"hans": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Poss": "Yes"}, # Number[psor]=Sing|Person=3|Poss=Yes|PronType=Prs
"hendes": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Poss": "Yes"}, # Number[psor]=Sing|Person=3|Poss=Yes|PronType=Prs
"dens": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Poss": "Yes"}, # Number[psor]=Sing|Person=3|Poss=Yes|PronType=Prs
"dets": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Poss": "Yes"}, # Number[psor]=Sing|Person=3|Poss=Yes|PronType=Prs
"deres": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Poss": "Yes"}, # Number[psor]=Plur|Person=3|Poss=Yes|PronType=Prs
"jeg": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "One",
"Number": "Sing",
"Case": "Nom",
"Gender": "Com",
}, # Case=Nom|Gender=Com|Number=Sing|Person=1|PronType=Prs
"mig": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "One",
"Number": "Sing",
"Case": "Acc",
"Gender": "Com",
}, # Case=Acc|Gender=Com|Number=Sing|Person=1|PronType=Prs
"min": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "One",
"Number": "Sing",
"Poss": "Yes",
"Gender": "Com",
}, # Gender=Com|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs
"mit": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "One",
"Number": "Sing",
"Poss": "Yes",
"Gender": "Neut",
}, # Gender=Neut|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs
"vor": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "One",
"Number": "Sing",
"Poss": "Yes",
"Gender": "Com",
}, # Gender=Com|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs|Style=Form
"vort": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "One",
"Number": "Sing",
"Poss": "Yes",
"Gender": "Neut",
}, # Gender=Neut|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs|Style=Form
"du": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Two",
"Number": "Sing",
"Case": "Nom",
"Gender": "Com",
}, # Case=Nom|Gender=Com|Number=Sing|Person=2|PronType=Prs
"dig": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Two",
"Number": "Sing",
"Case": "Acc",
"Gender": "Com",
}, # Case=Acc|Gender=Com|Number=Sing|Person=2|PronType=Prs
"din": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Two",
"Number": "Sing",
"Poss": "Yes",
"Gender": "Com",
}, # Gender=Com|Number=Sing|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs
"dit": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Two",
"Number": "Sing",
"Poss": "Yes",
"Gender": "Neut",
}, # Gender=Neut|Number=Sing|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs
"han": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Three",
"Number": "Sing",
"Case": "Nom",
"Gender": "Com",
}, # Case=Nom|Gender=Com|Number=Sing|Person=3|PronType=Prs
"hun": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Three",
"Number": "Sing",
"Case": "Nom",
"Gender": "Com",
}, # Case=Nom|Gender=Com|Number=Sing|Person=3|PronType=Prs
"den": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Three",
"Number": "Sing",
"Gender": "Com",
}, # Case=Acc|Gender=Com|Number=Sing|Person=3|PronType=Prs, See note above.
"det": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Three",
"Number": "Sing",
"Gender": "Neut",
}, # Case=Acc|Gender=Neut|Number=Sing|Person=3|PronType=Prs See note above.
"ham": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Three",
"Number": "Sing",
"Case": "Acc",
"Gender": "Com",
}, # Case=Acc|Gender=Com|Number=Sing|Person=3|PronType=Prs
"hende": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Three",
"Number": "Sing",
"Case": "Acc",
"Gender": "Com",
}, # Case=Acc|Gender=Com|Number=Sing|Person=3|PronType=Prs
"sin": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Three",
"Number": "Sing",
"Poss": "Yes",
"Gender": "Com",
"Reflex": "Yes",
}, # Gender=Com|Number=Sing|Number[psor]=Sing|Person=3|Poss=Yes|PronType=Prs|Reflex=Yes
"sit": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Three",
"Number": "Sing",
"Poss": "Yes",
"Gender": "Neut",
"Reflex": "Yes",
}, # Gender=Neut|Number=Sing|Number[psor]=Sing|Person=3|Poss=Yes|PronType=Prs|Reflex=Yes
"vi": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "One",
"Number": "Plur",
"Case": "Nom",
"Gender": "Com",
}, # Case=Nom|Gender=Com|Number=Plur|Person=1|PronType=Prs
"os": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "One",
"Number": "Plur",
"Case": "Acc",
"Gender": "Com",
}, # Case=Acc|Gender=Com|Number=Plur|Person=1|PronType=Prs
"mine": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "One",
"Number": "Plur",
"Poss": "Yes",
}, # Number=Plur|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs
"vore": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "One",
"Number": "Plur",
"Poss": "Yes",
}, # Number=Plur|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs|Style=Form
"I": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Two",
"Number": "Plur",
"Case": "Nom",
"Gender": "Com",
}, # Case=Nom|Gender=Com|Number=Plur|Person=2|PronType=Prs
"jer": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Two",
"Number": "Plur",
"Case": "Acc",
"Gender": "Com",
}, # Case=Acc|Gender=Com|Number=Plur|Person=2|PronType=Prs
"dine": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Two",
"Number": "Plur",
"Poss": "Yes",
}, # Number=Plur|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs
"de": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Three",
"Number": "Plur",
"Case": "Nom",
}, # Case=Nom|Number=Plur|Person=3|PronType=Prs
"dem": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Three",
"Number": "Plur",
"Case": "Acc",
}, # Case=Acc|Number=Plur|Person=3|PronType=Prs
"sine": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Three",
"Number": "Plur",
"Poss": "Yes",
"Reflex": "Yes",
}, # Number=Plur|Number[psor]=Sing|Person=3|Poss=Yes|PronType=Prs|Reflex=Yes
"vores": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "One",
"Poss": "Yes",
}, # Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs
"De": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Two",
"Case": "Nom",
"Gender": "Com",
}, # Case=Nom|Gender=Com|Person=2|Polite=Form|PronType=Prs
"Dem": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Two",
"Case": "Acc",
"Gender": "Com",
}, # Case=Acc|Gender=Com|Person=2|Polite=Form|PronType=Prs
"Deres": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Two",
"Poss": "Yes",
}, # Person=2|Polite=Form|Poss=Yes|PronType=Prs
"jeres": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Two",
"Poss": "Yes",
}, # Number[psor]=Plur|Person=2|Poss=Yes|PronType=Prs
"sig": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Three",
"Case": "Acc",
"Reflex": "Yes",
}, # Case=Acc|Person=3|PronType=Prs|Reflex=Yes
"hans": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Three",
"Poss": "Yes",
}, # Number[psor]=Sing|Person=3|Poss=Yes|PronType=Prs
"hendes": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Three",
"Poss": "Yes",
}, # Number[psor]=Sing|Person=3|Poss=Yes|PronType=Prs
"dens": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Three",
"Poss": "Yes",
}, # Number[psor]=Sing|Person=3|Poss=Yes|PronType=Prs
"dets": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Three",
"Poss": "Yes",
}, # Number[psor]=Sing|Person=3|Poss=Yes|PronType=Prs
"deres": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Three",
"Poss": "Yes",
}, # Number[psor]=Plur|Person=3|Poss=Yes|PronType=Prs
},
"VERB": {
"er": {LEMMA: "være", "VerbForm": "Fin", "Tense": "Pres"},
"var": {LEMMA: "være", "VerbForm": "Fin", "Tense": "Past"}
}
"er": {LEMMA: "være", "VerbForm": "Fin", "Tense": "Pres"},
"var": {LEMMA: "være", "VerbForm": "Fin", "Tense": "Past"},
},
}
for tag, rules in MORPH_RULES.items():

View File

@ -516,7 +516,7 @@ _exc = {
"øjeåbner": "øjenåbner", # 1
"økonomiministerium": "økonomiministerie", # 1
"ørenring": "ørering", # 2
"øvehefte": "øvehæfte" # 1
"øvehefte": "øvehæfte", # 1
}

View File

@ -6,17 +6,26 @@ from ..char_classes import QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER
from ..punctuation import TOKENIZER_SUFFIXES
_quotes = QUOTES.replace("'", '')
_quotes = QUOTES.replace("'", "")
_infixes = (LIST_ELLIPSES + LIST_ICONS +
[r'(?<=[{}])\.(?=[{}])'.format(ALPHA_LOWER, ALPHA_UPPER),
r'(?<=[{a}])[,!?](?=[{a}])'.format(a=ALPHA),
r'(?<=[{a}"])[:<>=](?=[{a}])'.format(a=ALPHA),
r'(?<=[{a}]),(?=[{a}])'.format(a=ALPHA),
r'(?<=[{a}])([{q}\)\]\(\[])(?=[\{a}])'.format(a=ALPHA, q=_quotes),
r'(?<=[{a}])--(?=[{a}])'.format(a=ALPHA)])
_infixes = (
LIST_ELLIPSES
+ LIST_ICONS
+ [
r"(?<=[{}])\.(?=[{}])".format(ALPHA_LOWER, ALPHA_UPPER),
r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
r'(?<=[{a}"])[:<>=](?=[{a}])'.format(a=ALPHA),
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
r"(?<=[{a}])([{q}\)\]\(\[])(?=[\{a}])".format(a=ALPHA, q=_quotes),
r"(?<=[{a}])--(?=[{a}])".format(a=ALPHA),
]
)
_suffixes = [suffix for suffix in TOKENIZER_SUFFIXES if suffix not in ["'s", "'S", "s", "S", r"\'"]]
_suffixes = [
suffix
for suffix in TOKENIZER_SUFFIXES
if suffix not in ["'s", "'S", "s", "S", r"\'"]
]
_suffixes += [r"(?<=[^sSxXzZ])\'"]

View File

@ -3,7 +3,8 @@ from __future__ import unicode_literals
# Source: Handpicked by Jens Dahl Møllerhøj.
STOP_WORDS = set("""
STOP_WORDS = set(
"""
af aldrig alene alle allerede alligevel alt altid anden andet andre at
bag begge blandt blev blive bliver burde bør
@ -43,4 +44,5 @@ ud uden udover under undtagen
var ved vi via vil ville vore vores vær være været
øvrigt
""".split())
""".split()
)

View File

@ -15,130 +15,540 @@ _exc = {}
# (for "torsdag") are left out because they are ambiguous. The same is the case
# for abbreviations "jul." and "Jul." ("juli").
for exc_data in [
{ORTH: "Kbh.", LEMMA: "København", NORM: "København"},
{ORTH: "jan.", LEMMA: "januar"},
{ORTH: "febr.", LEMMA: "februar"},
{ORTH: "feb.", LEMMA: "februar"},
{ORTH: "mar.", LEMMA: "marts"},
{ORTH: "apr.", LEMMA: "april"},
{ORTH: "jun.", LEMMA: "juni"},
{ORTH: "aug.", LEMMA: "august"},
{ORTH: "sept.", LEMMA: "september"},
{ORTH: "sep.", LEMMA: "september"},
{ORTH: "okt.", LEMMA: "oktober"},
{ORTH: "nov.", LEMMA: "november"},
{ORTH: "dec.", LEMMA: "december"},
{ORTH: "man.", LEMMA: "mandag"},
{ORTH: "tirs.", LEMMA: "tirsdag"},
{ORTH: "ons.", LEMMA: "onsdag"},
{ORTH: "tor.", LEMMA: "torsdag"},
{ORTH: "tors.", LEMMA: "torsdag"},
{ORTH: "fre.", LEMMA: "fredag"},
{ORTH: "lør.", LEMMA: "lørdag"},
{ORTH: "Jan.", LEMMA: "januar"},
{ORTH: "Febr.", LEMMA: "februar"},
{ORTH: "Feb.", LEMMA: "februar"},
{ORTH: "Mar.", LEMMA: "marts"},
{ORTH: "Apr.", LEMMA: "april"},
{ORTH: "Jun.", LEMMA: "juni"},
{ORTH: "Aug.", LEMMA: "august"},
{ORTH: "Sept.", LEMMA: "september"},
{ORTH: "Sep.", LEMMA: "september"},
{ORTH: "Okt.", LEMMA: "oktober"},
{ORTH: "Nov.", LEMMA: "november"},
{ORTH: "Dec.", LEMMA: "december"},
{ORTH: "Man.", LEMMA: "mandag"},
{ORTH: "Tirs.", LEMMA: "tirsdag"},
{ORTH: "Ons.", LEMMA: "onsdag"},
{ORTH: "Fre.", LEMMA: "fredag"},
{ORTH: "Lør.", LEMMA: "lørdag"}]:
{ORTH: "Kbh.", LEMMA: "København", NORM: "København"},
{ORTH: "jan.", LEMMA: "januar"},
{ORTH: "febr.", LEMMA: "februar"},
{ORTH: "feb.", LEMMA: "februar"},
{ORTH: "mar.", LEMMA: "marts"},
{ORTH: "apr.", LEMMA: "april"},
{ORTH: "jun.", LEMMA: "juni"},
{ORTH: "aug.", LEMMA: "august"},
{ORTH: "sept.", LEMMA: "september"},
{ORTH: "sep.", LEMMA: "september"},
{ORTH: "okt.", LEMMA: "oktober"},
{ORTH: "nov.", LEMMA: "november"},
{ORTH: "dec.", LEMMA: "december"},
{ORTH: "man.", LEMMA: "mandag"},
{ORTH: "tirs.", LEMMA: "tirsdag"},
{ORTH: "ons.", LEMMA: "onsdag"},
{ORTH: "tor.", LEMMA: "torsdag"},
{ORTH: "tors.", LEMMA: "torsdag"},
{ORTH: "fre.", LEMMA: "fredag"},
{ORTH: "lør.", LEMMA: "lørdag"},
{ORTH: "Jan.", LEMMA: "januar"},
{ORTH: "Febr.", LEMMA: "februar"},
{ORTH: "Feb.", LEMMA: "februar"},
{ORTH: "Mar.", LEMMA: "marts"},
{ORTH: "Apr.", LEMMA: "april"},
{ORTH: "Jun.", LEMMA: "juni"},
{ORTH: "Aug.", LEMMA: "august"},
{ORTH: "Sept.", LEMMA: "september"},
{ORTH: "Sep.", LEMMA: "september"},
{ORTH: "Okt.", LEMMA: "oktober"},
{ORTH: "Nov.", LEMMA: "november"},
{ORTH: "Dec.", LEMMA: "december"},
{ORTH: "Man.", LEMMA: "mandag"},
{ORTH: "Tirs.", LEMMA: "tirsdag"},
{ORTH: "Ons.", LEMMA: "onsdag"},
{ORTH: "Fre.", LEMMA: "fredag"},
{ORTH: "Lør.", LEMMA: "lørdag"},
]:
_exc[exc_data[ORTH]] = [exc_data]
# Specified case only
for orth in [
"diam.", "ib.", "mia.", "mik.", "pers.", "A.D.", "A/S", "B.C.", "BK.",
"Dr.", "Boul.", "Chr.", "Dronn.", "H.K.H.", "H.M.", "Hf.", "i/s", "I/S",
"Kprs.", "L.A.", "Ll.", "m/s", "M/S", "Mag.", "Mr.", "Ndr.", "Ph.d.",
"Prs.", "Rcp.", "Sdr.", "Skt.", "Spl.", "Vg."]:
"diam.",
"ib.",
"mia.",
"mik.",
"pers.",
"A.D.",
"A/S",
"B.C.",
"BK.",
"Dr.",
"Boul.",
"Chr.",
"Dronn.",
"H.K.H.",
"H.M.",
"Hf.",
"i/s",
"I/S",
"Kprs.",
"L.A.",
"Ll.",
"m/s",
"M/S",
"Mag.",
"Mr.",
"Ndr.",
"Ph.d.",
"Prs.",
"Rcp.",
"Sdr.",
"Skt.",
"Spl.",
"Vg.",
]:
_exc[orth] = [{ORTH: orth}]
for orth in [
"aarh.", "ac.", "adj.", "adr.", "adsk.", "adv.", "afb.", "afd.", "afg.",
"afk.", "afs.", "aht.", "alg.", "alk.", "alm.", "amer.", "ang.", "ank.",
"anl.", "anv.", "arb.", "arr.", "att.", "bd.", "bdt.", "beg.", "begr.",
"beh.", "bet.", "bev.", "bhk.", "bib.", "bibl.", "bidr.", "bildl.",
"bill.", "biol.", "bk.", "bl.", "bl.a.", "borgm.", "br.", "brolægn.",
"bto.", "bygn.", "ca.", "cand.", "d.d.", "d.m.", "d.s.", "d.s.s.",
"d.y.", "d.å.", "d.æ.", "dagl.", "dat.", "dav.", "def.", "dek.", "dep.",
"desl.", "dir.", "disp.", "distr.", "div.", "dkr.", "dl.", "do.",
"dobb.", "dr.h.c", "dr.phil.", "ds.", "dvs.", "e.b.", "e.l.", "e.o.",
"e.v.t.", "eftf.", "eftm.", "egl.", "eks.", "eksam.", "ekskl.", "eksp.",
"ekspl.", "el.lign.", "emer.", "endv.", "eng.", "enk.", "etc.", "etym.",
"eur.", "evt.", "exam.", "f.eks.", "f.m.", "f.n.", "f.o.", "f.o.m.",
"f.s.v.", "f.t.", "f.v.t.", "f.å.", "fa.", "fakt.", "fam.", "ff.",
"fg.", "fhv.", "fig.", "filol.", "filos.", "fl.", "flg.", "fm.", "fmd.",
"fol.", "forb.", "foreg.", "foren.", "forf.", "fork.", "forr.", "fors.",
"forsk.", "forts.", "fr.", "fr.u.", "frk.", "fsva.", "fuldm.", "fung.",
"fx.", "fys.", "fær.", "g.d.", "g.m.", "gd.", "gdr.", "genuds.", "gl.",
"gn.", "gns.", "gr.", "grdl.", "gross.", "h.a.", "h.c.", "hdl.",
"henv.", "hhv.", "hj.hj.", "hj.spl.", "hort.", "hosp.", "hpl.", "hr.",
"hrs.", "hum.", "hvp.", "i.e.", "id.", "if.", "iflg.", "ifm.", "ift.",
"iht.", "ill.", "indb.", "indreg.", "inf.", "ing.", "inh.", "inj.",
"inkl.", "insp.", "instr.", "isl.", "istf.", "it.", "ital.", "iv.",
"jap.", "jf.", "jfr.", "jnr.", "j.nr.", "jr.", "jur.", "jvf.", "kap.",
"kbh.", "kem.", "kgl.", "kl.", "kld.", "knsp.", "komm.", "kons.",
"korr.", "kp.", "kr.", "kst.", "kt.", "ktr.", "kv.", "kvt.", "l.c.",
"lab.", "lat.", "lb.m.", "lb.nr.", "lejl.", "lgd.", "lic.", "lign.",
"lin.", "ling.merc.", "litt.", "loc.cit.", "lok.", "lrs.", "ltr.",
"m.a.o.", "m.fl.", "m.m.", "m.v.", "m.v.h.", "maks.", "md.", "mdr.",
"mdtl.", "mezz.", "mfl.", "m.h.p.", "m.h.t.", "mht.", "mill.", "mio.",
"modt.", "mrk.", "mul.", "mv.", "n.br.", "n.f.", "nb.", "nedenst.",
"nl.", "nr.", "nto.", "nuv.", "o/m", "o.a.", "o.fl.", "o.h.", "o.l.",
"o.lign.", "o.m.a.", "o.s.fr.", "obl.", "obs.", "odont.", "oecon.",
"off.", "ofl.", "omg.", "omkr.", "omr.", "omtr.", "opg.", "opl.",
"opr.", "org.", "orig.", "osv.", "ovenst.", "overs.", "ovf.", "p.a.",
"p.b.a", "p.b.v", "p.c.", "p.m.", "p.m.v.", "p.n.", "p.p.", "p.p.s.",
"p.s.", "p.t.", "p.v.a.", "p.v.c.", "pag.", "pass.", "pcs.", "pct.",
"pd.", "pens.", "pft.", "pg.", "pga.", "pgl.", "pinx.", "pk.", "pkt.",
"polit.", "polyt.", "pos.", "pp.", "ppm.", "pr.", "prc.", "priv.",
"prod.", "prof.", "pron.", "præd.", "præf.", "præt.", "psych.", "pt.",
"pæd.", "q.e.d.", "rad.", "red.", "ref.", "reg.", "regn.", "rel.",
"rep.", "repr.", "resp.", "rest.", "rm.", "rtg.", "russ.", "s.br.",
"s.d.", "s.f.", "s.m.b.a.", "s.u.", "s.å.", "sa.", "sb.", "sc.",
"scient.", "scil.", "sek.", "sekr.", "self.", "sem.", "shj.", "sign.",
"sing.", "sj.", "skr.", "slutn.", "sml.", "smp.", "snr.", "soc.",
"soc.dem.", "sp.", "spec.", "spm.", "spr.", "spsk.", "statsaut.", "st.",
"stk.", "str.", "stud.", "subj.", "subst.", "suff.", "sup.", "suppl.",
"sv.", "såk.", "sædv.", "t/r", "t.h.", "t.o.", "t.o.m.", "t.v.", "tbl.",
"tcp/ip", "td.", "tdl.", "tdr.", "techn.", "tekn.", "temp.", "th.",
"theol.", "tidl.", "tilf.", "tilh.", "till.", "tilsv.", "tjg.", "tkr.",
"tlf.", "tlgr.", "tr.", "trp.", "tsk.", "tv.", "ty.", "u/b", "udb.",
"udbet.", "ugtl.", "undt.", "v.f.", "vb.", "vedk.", "vedl.", "vedr.",
"vejl.", "vh.", "vha.", "vs.", "vsa.", "vær.", "zool.", "ø.lgd.",
"øvr.", "årg.", "årh."]:
"aarh.",
"ac.",
"adj.",
"adr.",
"adsk.",
"adv.",
"afb.",
"afd.",
"afg.",
"afk.",
"afs.",
"aht.",
"alg.",
"alk.",
"alm.",
"amer.",
"ang.",
"ank.",
"anl.",
"anv.",
"arb.",
"arr.",
"att.",
"bd.",
"bdt.",
"beg.",
"begr.",
"beh.",
"bet.",
"bev.",
"bhk.",
"bib.",
"bibl.",
"bidr.",
"bildl.",
"bill.",
"biol.",
"bk.",
"bl.",
"bl.a.",
"borgm.",
"br.",
"brolægn.",
"bto.",
"bygn.",
"ca.",
"cand.",
"d.d.",
"d.m.",
"d.s.",
"d.s.s.",
"d.y.",
"d.å.",
"d.æ.",
"dagl.",
"dat.",
"dav.",
"def.",
"dek.",
"dep.",
"desl.",
"dir.",
"disp.",
"distr.",
"div.",
"dkr.",
"dl.",
"do.",
"dobb.",
"dr.h.c",
"dr.phil.",
"ds.",
"dvs.",
"e.b.",
"e.l.",
"e.o.",
"e.v.t.",
"eftf.",
"eftm.",
"egl.",
"eks.",
"eksam.",
"ekskl.",
"eksp.",
"ekspl.",
"el.lign.",
"emer.",
"endv.",
"eng.",
"enk.",
"etc.",
"etym.",
"eur.",
"evt.",
"exam.",
"f.eks.",
"f.m.",
"f.n.",
"f.o.",
"f.o.m.",
"f.s.v.",
"f.t.",
"f.v.t.",
"f.å.",
"fa.",
"fakt.",
"fam.",
"ff.",
"fg.",
"fhv.",
"fig.",
"filol.",
"filos.",
"fl.",
"flg.",
"fm.",
"fmd.",
"fol.",
"forb.",
"foreg.",
"foren.",
"forf.",
"fork.",
"forr.",
"fors.",
"forsk.",
"forts.",
"fr.",
"fr.u.",
"frk.",
"fsva.",
"fuldm.",
"fung.",
"fx.",
"fys.",
"fær.",
"g.d.",
"g.m.",
"gd.",
"gdr.",
"genuds.",
"gl.",
"gn.",
"gns.",
"gr.",
"grdl.",
"gross.",
"h.a.",
"h.c.",
"hdl.",
"henv.",
"hhv.",
"hj.hj.",
"hj.spl.",
"hort.",
"hosp.",
"hpl.",
"hr.",
"hrs.",
"hum.",
"hvp.",
"i.e.",
"id.",
"if.",
"iflg.",
"ifm.",
"ift.",
"iht.",
"ill.",
"indb.",
"indreg.",
"inf.",
"ing.",
"inh.",
"inj.",
"inkl.",
"insp.",
"instr.",
"isl.",
"istf.",
"it.",
"ital.",
"iv.",
"jap.",
"jf.",
"jfr.",
"jnr.",
"j.nr.",
"jr.",
"jur.",
"jvf.",
"kap.",
"kbh.",
"kem.",
"kgl.",
"kl.",
"kld.",
"knsp.",
"komm.",
"kons.",
"korr.",
"kp.",
"kr.",
"kst.",
"kt.",
"ktr.",
"kv.",
"kvt.",
"l.c.",
"lab.",
"lat.",
"lb.m.",
"lb.nr.",
"lejl.",
"lgd.",
"lic.",
"lign.",
"lin.",
"ling.merc.",
"litt.",
"loc.cit.",
"lok.",
"lrs.",
"ltr.",
"m.a.o.",
"m.fl.",
"m.m.",
"m.v.",
"m.v.h.",
"maks.",
"md.",
"mdr.",
"mdtl.",
"mezz.",
"mfl.",
"m.h.p.",
"m.h.t.",
"mht.",
"mill.",
"mio.",
"modt.",
"mrk.",
"mul.",
"mv.",
"n.br.",
"n.f.",
"nb.",
"nedenst.",
"nl.",
"nr.",
"nto.",
"nuv.",
"o/m",
"o.a.",
"o.fl.",
"o.h.",
"o.l.",
"o.lign.",
"o.m.a.",
"o.s.fr.",
"obl.",
"obs.",
"odont.",
"oecon.",
"off.",
"ofl.",
"omg.",
"omkr.",
"omr.",
"omtr.",
"opg.",
"opl.",
"opr.",
"org.",
"orig.",
"osv.",
"ovenst.",
"overs.",
"ovf.",
"p.a.",
"p.b.a",
"p.b.v",
"p.c.",
"p.m.",
"p.m.v.",
"p.n.",
"p.p.",
"p.p.s.",
"p.s.",
"p.t.",
"p.v.a.",
"p.v.c.",
"pag.",
"pass.",
"pcs.",
"pct.",
"pd.",
"pens.",
"pft.",
"pg.",
"pga.",
"pgl.",
"pinx.",
"pk.",
"pkt.",
"polit.",
"polyt.",
"pos.",
"pp.",
"ppm.",
"pr.",
"prc.",
"priv.",
"prod.",
"prof.",
"pron.",
"præd.",
"præf.",
"præt.",
"psych.",
"pt.",
"pæd.",
"q.e.d.",
"rad.",
"red.",
"ref.",
"reg.",
"regn.",
"rel.",
"rep.",
"repr.",
"resp.",
"rest.",
"rm.",
"rtg.",
"russ.",
"s.br.",
"s.d.",
"s.f.",
"s.m.b.a.",
"s.u.",
"s.å.",
"sa.",
"sb.",
"sc.",
"scient.",
"scil.",
"sek.",
"sekr.",
"self.",
"sem.",
"shj.",
"sign.",
"sing.",
"sj.",
"skr.",
"slutn.",
"sml.",
"smp.",
"snr.",
"soc.",
"soc.dem.",
"sp.",
"spec.",
"spm.",
"spr.",
"spsk.",
"statsaut.",
"st.",
"stk.",
"str.",
"stud.",
"subj.",
"subst.",
"suff.",
"sup.",
"suppl.",
"sv.",
"såk.",
"sædv.",
"t/r",
"t.h.",
"t.o.",
"t.o.m.",
"t.v.",
"tbl.",
"tcp/ip",
"td.",
"tdl.",
"tdr.",
"techn.",
"tekn.",
"temp.",
"th.",
"theol.",
"tidl.",
"tilf.",
"tilh.",
"till.",
"tilsv.",
"tjg.",
"tkr.",
"tlf.",
"tlgr.",
"tr.",
"trp.",
"tsk.",
"tv.",
"ty.",
"u/b",
"udb.",
"udbet.",
"ugtl.",
"undt.",
"v.f.",
"vb.",
"vedk.",
"vedl.",
"vedr.",
"vejl.",
"vh.",
"vha.",
"vs.",
"vsa.",
"vær.",
"zool.",
"ø.lgd.",
"øvr.",
"årg.",
"årh.",
]:
_exc[orth] = [{ORTH: orth}]
capitalized = orth.capitalize()
_exc[capitalized] = [{ORTH: capitalized}]
for exc_data in [
{ORTH: "s'gu", LEMMA: "s'gu", NORM: "s'gu"},
{ORTH: "S'gu", LEMMA: "s'gu", NORM: "s'gu"},
{ORTH: "sgu'", LEMMA: "s'gu", NORM: "s'gu"},
{ORTH: "Sgu'", LEMMA: "s'gu", NORM: "s'gu"},
{ORTH: "sku'", LEMMA: "skal", NORM: "skulle"},
{ORTH: "ku'", LEMMA: "kan", NORM: "kunne"},
{ORTH: "Ku'", LEMMA: "kan", NORM: "kunne"},
{ORTH: "ka'", LEMMA: "kan", NORM: "kan"},
{ORTH: "Ka'", LEMMA: "kan", NORM: "kan"},
{ORTH: "gi'", LEMMA: "give", NORM: "giv"},
{ORTH: "Gi'", LEMMA: "give", NORM: "giv"},
{ORTH: "li'", LEMMA: "lide", NORM: "lide"},
{ORTH: "ha'", LEMMA: "have", NORM: "have"},
{ORTH: "Ha'", LEMMA: "have", NORM: "have"},
{ORTH: "ik'", LEMMA: "ikke", NORM: "ikke"},
{ORTH: "Ik'", LEMMA: "ikke", NORM: "ikke"}]:
{ORTH: "s'gu", LEMMA: "s'gu", NORM: "s'gu"},
{ORTH: "S'gu", LEMMA: "s'gu", NORM: "s'gu"},
{ORTH: "sgu'", LEMMA: "s'gu", NORM: "s'gu"},
{ORTH: "Sgu'", LEMMA: "s'gu", NORM: "s'gu"},
{ORTH: "sku'", LEMMA: "skal", NORM: "skulle"},
{ORTH: "ku'", LEMMA: "kan", NORM: "kunne"},
{ORTH: "Ku'", LEMMA: "kan", NORM: "kunne"},
{ORTH: "ka'", LEMMA: "kan", NORM: "kan"},
{ORTH: "Ka'", LEMMA: "kan", NORM: "kan"},
{ORTH: "gi'", LEMMA: "give", NORM: "giv"},
{ORTH: "Gi'", LEMMA: "give", NORM: "giv"},
{ORTH: "li'", LEMMA: "lide", NORM: "lide"},
{ORTH: "ha'", LEMMA: "have", NORM: "have"},
{ORTH: "Ha'", LEMMA: "have", NORM: "have"},
{ORTH: "ik'", LEMMA: "ikke", NORM: "ikke"},
{ORTH: "Ik'", LEMMA: "ikke", NORM: "ikke"},
]:
_exc[exc_data[ORTH]] = [exc_data]
@ -147,11 +557,7 @@ for h in range(1, 31 + 1):
for period in ["."]:
_exc["%d%s" % (h, period)] = [{ORTH: "%d." % h}]
_custom_base_exc = {
"i.": [
{ORTH: "i", LEMMA: "i", NORM: "i"},
{ORTH: ".", TAG: PUNCT}]
}
_custom_base_exc = {"i.": [{ORTH: "i", LEMMA: "i", NORM: "i"}, {ORTH: ".", TAG: PUNCT}]}
_exc.update(_custom_base_exc)
TOKENIZER_EXCEPTIONS = _exc

View File

@ -18,9 +18,10 @@ from ...util import update_exc, add_lookups
class GermanDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda text: 'de'
lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM],
NORM_EXCEPTIONS, BASE_NORMS)
lex_attr_getters[LANG] = lambda text: "de"
lex_attr_getters[NORM] = add_lookups(
Language.Defaults.lex_attr_getters[NORM], NORM_EXCEPTIONS, BASE_NORMS
)
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
infixes = TOKENIZER_INFIXES
tag_map = TAG_MAP
@ -30,8 +31,8 @@ class GermanDefaults(Language.Defaults):
class German(Language):
lang = 'de'
lang = "de"
Defaults = GermanDefaults
__all__ = ['German']
__all__ = ["German"]

View File

@ -18,5 +18,5 @@ sentences = [
"San Francisco erwägt Verbot von Lieferrobotern",
"Autonome Fahrzeuge verlagern Haftpflicht auf Hersteller",
"Wo bist du?",
"Was ist die Hauptstadt von Deutschland?"
"Was ist die Hauptstadt von Deutschland?",
]

View File

@ -6,9 +6,7 @@ from __future__ import unicode_literals
# old vs. new spelling rules, and all possible cases.
_exc = {
"daß": "dass"
}
_exc = {"daß": "dass"}
NORM_EXCEPTIONS = {}

View File

@ -5,16 +5,21 @@ from ..char_classes import LIST_ELLIPSES, LIST_ICONS
from ..char_classes import QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER
_quotes = QUOTES.replace("'", '')
_quotes = QUOTES.replace("'", "")
_infixes = (LIST_ELLIPSES + LIST_ICONS +
[r'(?<=[{}])\.(?=[{}])'.format(ALPHA_LOWER, ALPHA_UPPER),
r'(?<=[{a}])[,!?](?=[{a}])'.format(a=ALPHA),
r'(?<=[{a}"])[:<>=](?=[{a}])'.format(a=ALPHA),
r'(?<=[{a}]),(?=[{a}])'.format(a=ALPHA),
r'(?<=[{a}])([{q}\)\]\(\[])(?=[\{a}])'.format(a=ALPHA, q=_quotes),
r'(?<=[{a}])--(?=[{a}])'.format(a=ALPHA),
r'(?<=[0-9])-(?=[0-9])'])
_infixes = (
LIST_ELLIPSES
+ LIST_ICONS
+ [
r"(?<=[{}])\.(?=[{}])".format(ALPHA_LOWER, ALPHA_UPPER),
r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
r'(?<=[{a}"])[:<>=](?=[{a}])'.format(a=ALPHA),
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
r"(?<=[{a}])([{q}\)\]\(\[])(?=[\{a}])".format(a=ALPHA, q=_quotes),
r"(?<=[{a}])--(?=[{a}])".format(a=ALPHA),
r"(?<=[0-9])-(?=[0-9])",
]
)
TOKENIZER_INFIXES = _infixes

View File

@ -2,7 +2,8 @@
from __future__ import unicode_literals
STOP_WORDS = set("""
STOP_WORDS = set(
"""
á a ab aber ach acht achte achten achter achtes ag alle allein allem allen
aller allerdings alles allgemeinen als also am an andere anderen andern anders
auch auf aus ausser außer ausserdem außerdem
@ -78,4 +79,5 @@ wollt wollte wollten worden wurde würde wurden würden
zehn zehnte zehnten zehnter zehntes zeit zu zuerst zugleich zum zunächst zur
zurück zusammen zwanzig zwar zwei zweite zweiten zweiter zweites zwischen
""".split())
""".split()
)

View File

@ -13,26 +13,37 @@ def noun_chunks(obj):
# measurement construction, the span is sometimes extended to the right of
# the NOUN. Example: "eine Tasse Tee" (a cup (of) tea) returns "eine Tasse Tee"
# and not just "eine Tasse", same for "das Thema Familie".
labels = ['sb', 'oa', 'da', 'nk', 'mo', 'ag', 'ROOT', 'root', 'cj', 'pd', 'og', 'app']
doc = obj.doc # Ensure works on both Doc and Span.
np_label = doc.vocab.strings.add('NP')
labels = [
"sb",
"oa",
"da",
"nk",
"mo",
"ag",
"ROOT",
"root",
"cj",
"pd",
"og",
"app",
]
doc = obj.doc # Ensure works on both Doc and Span.
np_label = doc.vocab.strings.add("NP")
np_deps = set(doc.vocab.strings.add(label) for label in labels)
close_app = doc.vocab.strings.add('nk')
close_app = doc.vocab.strings.add("nk")
rbracket = 0
for i, word in enumerate(obj):
if i < rbracket:
continue
if word.pos in (NOUN, PROPN, PRON) and word.dep in np_deps:
rbracket = word.i+1
rbracket = word.i + 1
# try to extend the span to the right
# to capture close apposition/measurement constructions
for rdep in doc[word.i].rights:
if rdep.pos in (NOUN, PROPN) and rdep.dep == close_app:
rbracket = rdep.i+1
rbracket = rdep.i + 1
yield word.left_edge.i, rbracket, np_label
SYNTAX_ITERATORS = {
'noun_chunks': noun_chunks
}
SYNTAX_ITERATORS = {"noun_chunks": noun_chunks}

View File

@ -6,61 +6,61 @@ from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, AUX
TAG_MAP = {
"$(": {POS: PUNCT, "PunctType": "brck"},
"$,": {POS: PUNCT, "PunctType": "comm"},
"$.": {POS: PUNCT, "PunctType": "peri"},
"ADJA": {POS: ADJ},
"ADJD": {POS: ADJ, "Variant": "short"},
"ADV": {POS: ADV},
"APPO": {POS: ADP, "AdpType": "post"},
"APPR": {POS: ADP, "AdpType": "prep"},
"APPRART": {POS: ADP, "AdpType": "prep", "PronType": "art"},
"APZR": {POS: ADP, "AdpType": "circ"},
"ART": {POS: DET, "PronType": "art"},
"CARD": {POS: NUM, "NumType": "card"},
"FM": {POS: X, "Foreign": "yes"},
"ITJ": {POS: INTJ},
"KOKOM": {POS: CONJ, "ConjType": "comp"},
"KON": {POS: CONJ},
"KOUI": {POS: SCONJ},
"KOUS": {POS: SCONJ},
"NE": {POS: PROPN},
"NNE": {POS: PROPN},
"NN": {POS: NOUN},
"PAV": {POS: ADV, "PronType": "dem"},
"PROAV": {POS: ADV, "PronType": "dem"},
"PDAT": {POS: DET, "PronType": "dem"},
"PDS": {POS: PRON, "PronType": "dem"},
"PIAT": {POS: DET, "PronType": "ind|neg|tot"},
"PIDAT": {POS: DET, "AdjType": "pdt", "PronType": "ind|neg|tot"},
"PIS": {POS: PRON, "PronType": "ind|neg|tot"},
"PPER": {POS: PRON, "PronType": "prs"},
"PPOSAT": {POS: DET, "Poss": "yes", "PronType": "prs"},
"PPOSS": {POS: PRON, "Poss": "yes", "PronType": "prs"},
"PRELAT": {POS: DET, "PronType": "rel"},
"PRELS": {POS: PRON, "PronType": "rel"},
"PRF": {POS: PRON, "PronType": "prs", "Reflex": "yes"},
"PTKA": {POS: PART},
"PTKANT": {POS: PART, "PartType": "res"},
"PTKNEG": {POS: PART, "Polarity": "Neg"},
"PTKVZ": {POS: PART, "PartType": "vbp"},
"PTKZU": {POS: PART, "PartType": "inf"},
"PWAT": {POS: DET, "PronType": "int"},
"PWAV": {POS: ADV, "PronType": "int"},
"PWS": {POS: PRON, "PronType": "int"},
"TRUNC": {POS: X, "Hyph": "yes"},
"VAFIN": {POS: AUX, "Mood": "ind", "VerbForm": "fin"},
"VAIMP": {POS: AUX, "Mood": "imp", "VerbForm": "fin"},
"VAINF": {POS: AUX, "VerbForm": "inf"},
"VAPP": {POS: AUX, "Aspect": "perf", "VerbForm": "part"},
"VMFIN": {POS: VERB, "Mood": "ind", "VerbForm": "fin", "VerbType": "mod"},
"VMINF": {POS: VERB, "VerbForm": "inf", "VerbType": "mod"},
"VMPP": {POS: VERB, "Aspect": "perf", "VerbForm": "part", "VerbType": "mod"},
"VVFIN": {POS: VERB, "Mood": "ind", "VerbForm": "fin"},
"VVIMP": {POS: VERB, "Mood": "imp", "VerbForm": "fin"},
"VVINF": {POS: VERB, "VerbForm": "inf"},
"VVIZU": {POS: VERB, "VerbForm": "inf"},
"VVPP": {POS: VERB, "Aspect": "perf", "VerbForm": "part"},
"XY": {POS: X},
"_SP": {POS: SPACE}
"$(": {POS: PUNCT, "PunctType": "brck"},
"$,": {POS: PUNCT, "PunctType": "comm"},
"$.": {POS: PUNCT, "PunctType": "peri"},
"ADJA": {POS: ADJ},
"ADJD": {POS: ADJ, "Variant": "short"},
"ADV": {POS: ADV},
"APPO": {POS: ADP, "AdpType": "post"},
"APPR": {POS: ADP, "AdpType": "prep"},
"APPRART": {POS: ADP, "AdpType": "prep", "PronType": "art"},
"APZR": {POS: ADP, "AdpType": "circ"},
"ART": {POS: DET, "PronType": "art"},
"CARD": {POS: NUM, "NumType": "card"},
"FM": {POS: X, "Foreign": "yes"},
"ITJ": {POS: INTJ},
"KOKOM": {POS: CONJ, "ConjType": "comp"},
"KON": {POS: CONJ},
"KOUI": {POS: SCONJ},
"KOUS": {POS: SCONJ},
"NE": {POS: PROPN},
"NNE": {POS: PROPN},
"NN": {POS: NOUN},
"PAV": {POS: ADV, "PronType": "dem"},
"PROAV": {POS: ADV, "PronType": "dem"},
"PDAT": {POS: DET, "PronType": "dem"},
"PDS": {POS: PRON, "PronType": "dem"},
"PIAT": {POS: DET, "PronType": "ind|neg|tot"},
"PIDAT": {POS: DET, "AdjType": "pdt", "PronType": "ind|neg|tot"},
"PIS": {POS: PRON, "PronType": "ind|neg|tot"},
"PPER": {POS: PRON, "PronType": "prs"},
"PPOSAT": {POS: DET, "Poss": "yes", "PronType": "prs"},
"PPOSS": {POS: PRON, "Poss": "yes", "PronType": "prs"},
"PRELAT": {POS: DET, "PronType": "rel"},
"PRELS": {POS: PRON, "PronType": "rel"},
"PRF": {POS: PRON, "PronType": "prs", "Reflex": "yes"},
"PTKA": {POS: PART},
"PTKANT": {POS: PART, "PartType": "res"},
"PTKNEG": {POS: PART, "Polarity": "Neg"},
"PTKVZ": {POS: PART, "PartType": "vbp"},
"PTKZU": {POS: PART, "PartType": "inf"},
"PWAT": {POS: DET, "PronType": "int"},
"PWAV": {POS: ADV, "PronType": "int"},
"PWS": {POS: PRON, "PronType": "int"},
"TRUNC": {POS: X, "Hyph": "yes"},
"VAFIN": {POS: AUX, "Mood": "ind", "VerbForm": "fin"},
"VAIMP": {POS: AUX, "Mood": "imp", "VerbForm": "fin"},
"VAINF": {POS: AUX, "VerbForm": "inf"},
"VAPP": {POS: AUX, "Aspect": "perf", "VerbForm": "part"},
"VMFIN": {POS: VERB, "Mood": "ind", "VerbForm": "fin", "VerbType": "mod"},
"VMINF": {POS: VERB, "VerbForm": "inf", "VerbType": "mod"},
"VMPP": {POS: VERB, "Aspect": "perf", "VerbForm": "part", "VerbType": "mod"},
"VVFIN": {POS: VERB, "Mood": "ind", "VerbForm": "fin"},
"VVIMP": {POS: VERB, "Mood": "imp", "VerbForm": "fin"},
"VVINF": {POS: VERB, "VerbForm": "inf"},
"VVIZU": {POS: VERB, "VerbForm": "inf"},
"VVPP": {POS: VERB, "Aspect": "perf", "VerbForm": "part"},
"XY": {POS: X},
"_SP": {POS: SPACE},
}

View File

@ -5,49 +5,41 @@ from ...symbols import ORTH, LEMMA, TAG, NORM, PRON_LEMMA
_exc = {
"auf'm": [
{ORTH: "auf", LEMMA: "auf"},
{ORTH: "'m", LEMMA: "der", NORM: "dem"}],
"auf'm": [{ORTH: "auf", LEMMA: "auf"}, {ORTH: "'m", LEMMA: "der", NORM: "dem"}],
"du's": [
{ORTH: "du", LEMMA: PRON_LEMMA, TAG: "PPER"},
{ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER", NORM: "es"}],
{ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER", NORM: "es"},
],
"er's": [
{ORTH: "er", LEMMA: PRON_LEMMA, TAG: "PPER"},
{ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER", NORM: "es"}],
{ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER", NORM: "es"},
],
"hinter'm": [
{ORTH: "hinter", LEMMA: "hinter"},
{ORTH: "'m", LEMMA: "der", NORM: "dem"}],
{ORTH: "'m", LEMMA: "der", NORM: "dem"},
],
"ich's": [
{ORTH: "ich", LEMMA: PRON_LEMMA, TAG: "PPER"},
{ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER", NORM: "es"}],
{ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER", NORM: "es"},
],
"ihr's": [
{ORTH: "ihr", LEMMA: PRON_LEMMA, TAG: "PPER"},
{ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER", NORM: "es"}],
{ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER", NORM: "es"},
],
"sie's": [
{ORTH: "sie", LEMMA: PRON_LEMMA, TAG: "PPER"},
{ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER", NORM: "es"}],
{ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER", NORM: "es"},
],
"unter'm": [
{ORTH: "unter", LEMMA: "unter"},
{ORTH: "'m", LEMMA: "der", NORM: "dem"}],
"vor'm": [
{ORTH: "vor", LEMMA: "vor"},
{ORTH: "'m", LEMMA: "der", NORM: "dem"}],
{ORTH: "'m", LEMMA: "der", NORM: "dem"},
],
"vor'm": [{ORTH: "vor", LEMMA: "vor"}, {ORTH: "'m", LEMMA: "der", NORM: "dem"}],
"wir's": [
{ORTH: "wir", LEMMA: PRON_LEMMA, TAG: "PPER"},
{ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER", NORM: "es"}],
"über'm": [
{ORTH: "über", LEMMA: "über"},
{ORTH: "'m", LEMMA: "der", NORM: "dem"}]
{ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER", NORM: "es"},
],
"über'm": [{ORTH: "über", LEMMA: "über"}, {ORTH: "'m", LEMMA: "der", NORM: "dem"}],
}
@ -162,21 +154,95 @@ for exc_data in [
{ORTH: "z.Zt.", LEMMA: "zur Zeit"},
{ORTH: "z.b.", LEMMA: "zum Beispiel"},
{ORTH: "zzgl.", LEMMA: "zuzüglich"},
{ORTH: "österr.", LEMMA: "österreichisch", NORM: "österreichisch"}]:
{ORTH: "österr.", LEMMA: "österreichisch", NORM: "österreichisch"},
]:
_exc[exc_data[ORTH]] = [exc_data]
for orth in [
"A.C.", "a.D.", "A.D.", "A.G.", "a.M.", "a.Z.", "Abs.", "adv.", "al.",
"B.A.", "B.Sc.", "betr.", "biol.", "Biol.", "ca.", "Chr.", "Cie.", "co.",
"Co.", "D.C.", "Dipl.-Ing.", "Dipl.", "Dr.", "e.g.", "e.V.", "ehem.",
"entspr.", "erm.", "etc.", "ev.", "G.m.b.H.", "geb.", "Gebr.", "gem.",
"h.c.", "Hg.", "hrsg.", "Hrsg.", "i.A.", "i.e.", "i.G.", "i.Tr.", "i.V.",
"Ing.", "jr.", "Jr.", "jun.", "jur.", "K.O.", "L.A.", "lat.", "M.A.",
"m.E.", "m.M.", "M.Sc.", "Mr.", "N.Y.", "N.Y.C.", "nat.", "o.a.",
"o.ä.", "o.g.", "o.k.", "O.K.", "p.a.", "p.s.", "P.S.", "pers.", "phil.",
"q.e.d.", "R.I.P.", "rer.", "sen.", "St.", "std.", "u.a.", "U.S.", "U.S.A.",
"U.S.S.", "Vol.", "vs.", "wiss."]:
"A.C.",
"a.D.",
"A.D.",
"A.G.",
"a.M.",
"a.Z.",
"Abs.",
"adv.",
"al.",
"B.A.",
"B.Sc.",
"betr.",
"biol.",
"Biol.",
"ca.",
"Chr.",
"Cie.",
"co.",
"Co.",
"D.C.",
"Dipl.-Ing.",
"Dipl.",
"Dr.",
"e.g.",
"e.V.",
"ehem.",
"entspr.",
"erm.",
"etc.",
"ev.",
"G.m.b.H.",
"geb.",
"Gebr.",
"gem.",
"h.c.",
"Hg.",
"hrsg.",
"Hrsg.",
"i.A.",
"i.e.",
"i.G.",
"i.Tr.",
"i.V.",
"Ing.",
"jr.",
"Jr.",
"jun.",
"jur.",
"K.O.",
"L.A.",
"lat.",
"M.A.",
"m.E.",
"m.M.",
"M.Sc.",
"Mr.",
"N.Y.",
"N.Y.C.",
"nat.",
"o.a.",
"o.ä.",
"o.g.",
"o.k.",
"O.K.",
"p.a.",
"p.s.",
"P.S.",
"pers.",
"phil.",
"q.e.d.",
"R.I.P.",
"rer.",
"sen.",
"St.",
"std.",
"u.a.",
"U.S.",
"U.S.A.",
"U.S.S.",
"Vol.",
"vs.",
"wiss.",
]:
_exc[orth] = [{ORTH: orth}]

View File

@ -21,9 +21,10 @@ from ...util import update_exc, add_lookups
class GreekDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters.update(LEX_ATTRS)
lex_attr_getters[LANG] = lambda text: 'el' # ISO code
lex_attr_getters[LANG] = lambda text: "el" # ISO code
lex_attr_getters[NORM] = add_lookups(
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS)
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS
)
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
stop_words = STOP_WORDS
tag_map = TAG_MAP
@ -37,15 +38,16 @@ class GreekDefaults(Language.Defaults):
lemma_rules = LEMMA_RULES
lemma_index = LEMMA_INDEX
lemma_exc = LEMMA_EXC
return GreekLemmatizer(index=lemma_index, exceptions=lemma_exc,
rules=lemma_rules)
return GreekLemmatizer(
index=lemma_index, exceptions=lemma_exc, rules=lemma_rules
)
class Greek(Language):
lang = 'el' # ISO code
lang = "el" # ISO code
Defaults = GreekDefaults # set Defaults to custom language defaults
# set default export this allows the language class to be lazy-loaded
__all__ = ['Greek']
__all__ = ["Greek"]

View File

@ -9,20 +9,20 @@ Example sentences to test spaCy and its language models.
"""
sentences = [
'''Η άνιση κατανομή του πλούτου και του εισοδήματος, η οποία έχει λάβει
τρομερές διαστάσεις, δεν δείχνει τάσεις βελτίωσης.''',
'''Ο στόχος της σύντομης αυτής έκθεσης είναι να συνοψίσει τα κυριότερα
συμπεράσματα των επισκοπήσεων κάθε μιας χώρας.''',
'''Μέχρι αργά χθες το βράδυ ο πλοιοκτήτης παρέμενε έξω από το γραφείο του
"""Η άνιση κατανομή του πλούτου και του εισοδήματος, η οποία έχει λάβει
τρομερές διαστάσεις, δεν δείχνει τάσεις βελτίωσης.""",
"""Ο στόχος της σύντομης αυτής έκθεσης είναι να συνοψίσει τα κυριότερα
συμπεράσματα των επισκοπήσεων κάθε μιας χώρας.""",
"""Μέχρι αργά χθες το βράδυ ο πλοιοκτήτης παρέμενε έξω από το γραφείο του
γενικού γραμματέα του υπουργείου, ενώ είχε μόνον τηλεφωνική επικοινωνία με
τον υπουργό.''',
'''Σύμφωνα με καλά ενημερωμένη πηγή, από την επεξεργασία του προέκυψε ότι
τον υπουργό.""",
"""Σύμφωνα με καλά ενημερωμένη πηγή, από την επεξεργασία του προέκυψε ότι
οι δράστες της επίθεσης ήταν δύο, καθώς και ότι προσέγγισαν και αποχώρησαν
από το σημείο με μοτοσικλέτα.''',
από το σημείο με μοτοσικλέτα.""",
"Η υποδομή καταλυμάτων στην Ελλάδα είναι πλήρης και ανανεώνεται συνεχώς.",
'''Το επείγον ταχυδρομείο (ήτοι το παραδοτέο εντός 48 ωρών το πολύ) μπορεί
"""Το επείγον ταχυδρομείο (ήτοι το παραδοτέο εντός 48 ωρών το πολύ) μπορεί
να μεταφέρεται αεροπορικώς μόνον εφόσον εφαρμόζονται οι κανόνες
ασφαλείας''',
''''Στις ορεινές περιοχές του νησιού οι χιονοπτώσεις και οι παγετοί είναι
περιορισμένοι ενώ στις παραθαλάσσιες περιοχές σημειώνονται σπανίως.'''
ασφαλείας""",
"""'Στις ορεινές περιοχές του νησιού οι χιονοπτώσεις και οι παγετοί είναι
περιορισμένοι ενώ στις παραθαλάσσιες περιοχές σημειώνονται σπανίως.""",
]

View File

@ -12,10 +12,19 @@ from ._verbs import VERBS
from ._lemma_rules import ADJECTIVE_RULES, NOUN_RULES, VERB_RULES, PUNCT_RULES
LEMMA_INDEX = {'adj': ADJECTIVES, 'adv': ADVERBS, 'noun': NOUNS, 'verb': VERBS}
LEMMA_INDEX = {"adj": ADJECTIVES, "adv": ADVERBS, "noun": NOUNS, "verb": VERBS}
LEMMA_RULES = {'adj': ADJECTIVE_RULES, 'noun': NOUN_RULES, 'verb': VERB_RULES,
'punct': PUNCT_RULES}
LEMMA_RULES = {
"adj": ADJECTIVE_RULES,
"noun": NOUN_RULES,
"verb": VERB_RULES,
"punct": PUNCT_RULES,
}
LEMMA_EXC = {'adj': ADJECTIVES_IRREG, 'noun': NOUNS_IRREG, 'det': DETS_IRREG, 'verb': VERBS_IRREG}
LEMMA_EXC = {
"adj": ADJECTIVES_IRREG,
"noun": NOUNS_IRREG,
"det": DETS_IRREG,
"verb": VERBS_IRREG,
}

View File

@ -1,6 +1,8 @@
# coding: utf8
from __future__ import unicode_literals
ADJECTIVES = set("""
ADJECTIVES = set(
"""
n-διάστατος µεταφυτρωτικός άβαθος άβαλτος άβαρος άβατος άβαφος άβγαλτος άβιος
άβλαπτος άβλεπτος άβολος άβουλος άβραστος άβρεχτος άβροχος άβυθος άγαμος
άγγιχτος άγδαρτος άγδυτος άγευστος άγιος άγλυκος άγλωσσος άγναθος άγναντος
@ -2438,4 +2440,5 @@ ADJECTIVES = set("""
όμορφος όνειος όξινος όρθιος όσιος όφκαιρος όψια όψιμος ύπανδρος ύπατος
ύπουλος ύπτιος ύστατος ύστερος ύψιστος ώριμος ώριος ἀγκυλωτός ἀκαταμέτρητος
ἄπειρος ἄτροπος ἐλαφρός ἐνεστώς ἐνυπόστατος ἔναυλος ἥττων ἰσχυρός ἵστωρ
""".split())
""".split()
)

View File

@ -32,5 +32,4 @@ ADJECTIVES_IRREG = {
"πολύς": ("πολύ",),
"πολλύ": ("πολύ",),
"πολλύς": ("πολύ",),
}

View File

@ -1,6 +1,8 @@
# coding: utf8
from __future__ import unicode_literals
ADVERBS = set("""
ADVERBS = set(
"""
άβλαβα άβολα άβουλα άγαν άγαρμπα άγγιχτα άγνωμα άγρια άγρυπνα άδηλα άδικα
άδοξα άθελα άθλια άκαιρα άκακα άκαμπτα άκαρδα άκαρπα άκεφα άκομψα άκοπα άκοσμα
άκρως άκυρα άλαλα άλιωτα άλλοθεν άλλοτε άλλως άλλωστε άλογα άλυπα άμεμπτα
@ -861,4 +863,5 @@ ADVERBS = set("""
ψυχραντικά ψωροπερήφανα ψόφια ψύχραιμα ωδικώς ωμά ωρίμως ωραία ωραιότατα
ωριαία ωριαίως ως ωσαύτως ωσεί ωφέλιμα ωφελίμως ωφελιμιστικά ωχρά όθε όθεν όλο
όμορφα όντως όξω όπισθεν όπου όπως όρθια όρτσα όσια όσο όχι όψιμα ύπερθεν
""".split())
""".split()
)

View File

@ -1,5 +1,8 @@
# coding: utf8
from __future__ import unicode_literals
DETS = set("""
DETS = set(
"""
ένας η ο το τη
""".split())
""".split()
)

View File

@ -8,5 +8,5 @@ DETS_IRREG = {
"τους": ("το",),
"τις": ("τη",),
"τα": ("το",),
"οι": ("ο","η"),
"οι": ("ο", "η"),
}

View File

@ -140,17 +140,7 @@ VERB_RULES = [
["ξουμε", "ζω"],
["ξετε", "ζω"],
["ξουν", "ζω"],
]
PUNCT_RULES = [
["", "\""],
["", "\""],
["\u2018", "'"],
["\u2019", "'"]
]
PUNCT_RULES = [["", '"'], ["", '"'], ["\u2018", "'"], ["\u2019", "'"]]

View File

@ -1,6 +1,8 @@
# coding: utf8
from __future__ import unicode_literals
NOUNS = set("""
NOUNS = set(
"""
-αλγία -βατώ -βατῶ -ούλα -πληξία -ώνυμο sofa table άβακας άβατο άβατον άβυσσος
άγανο άγαρ άγγελμα άγγελος άγγιγμα άγγισμα άγγλος άγημα άγιασμα άγιο φως
άγκλισμα άγκυρα άγμα άγνοια άγνωστος άγονο άγος άγουρος άγουσα άγρα άγρευμα
@ -6066,4 +6068,5 @@ NOUNS = set("""
ἐντευκτήριον ἐντόσθια ἐξοικείωσις ἐξοχή ἐξωκκλήσιον ἐπίσκεψις ἐπίσχεστρον
ἐρωτίς ἑρμηνεία ἔκθλιψις ἔκτισις ἔκτρωμα ἔπαλξις ἱππάρχας ἱππάρχης ἴς ἵππαρχος
ὑστερικός ὕστερον ὠάριον ὠοθήκη ὠοθηκῖτις ὠοθυλάκιον ὠορρηξία ὠοσκόπιον
""".split())
""".split()
)

View File

@ -1,6 +1,8 @@
# coding: utf8
from __future__ import unicode_literals
PARTICIPLES = set("""
PARTICIPLES = set(
"""
έρποντας έχοντας αβανιάζοντας αβγατισμένος αγαπημένος αγαπώντας αγγίζοντας
αγγιγμένος αγιασμένος αγιογραφώντας αγιοποιημένος αγιοποιώντας αγκαζαρισμένος
αγκιστρωμένος αγκυλωμένος αγκυροβολημένος αγλακώντας αγνοημένος αγνοούμενος
@ -941,4 +943,5 @@ PARTICIPLES = set("""
ψιλούμενος ψοφολογώντας ψυχογραφώντας ψυχολογημένος ψυχομαχώντας ψυχομαχώντας
ψυχορραγώντας ψυχρηλατώντας ψυχωμένος ψωμοζητώντας ψωμοζώντας ψωμωμένος
ωθηθείς ωθώντας ωραιοποιημένος ωραιοποιώντας ωρυόμενος ωτοσκοπώντας όντας
""".split())
""".split()
)

View File

@ -1,5 +1,8 @@
# coding: utf8
from __future__ import unicode_literals
PROPER_NAMES = set("""
PROPER_NAMES = set(
"""
άαχεν άβαρος άβδηρα άβελ άβιλα άβολα άγγελοι άγγελος άγιο πνεύμα
άγιοι τόποι άγιον όρος άγιος αθανάσιος άγιος αναστάσιος άγιος αντώνιος
άγιος αριστείδης άγιος βαρθολομαίος άγιος βασίλειος άγιος βασίλης
@ -641,4 +644,5 @@ PROPER_NAMES = set("""
ωρολόγιον ωρωπός ωσηέ όγκα όγκατα όγκι όθρυς όθων όιτα όλγα όλιβερ όλυμπος
όμουρα όμπιδος όνειρος όνο όρεγκον όσακι όσατο όσκαρ όσλο όταμα ότσου όφενμπαχ
όχιρα ύδρα ύδρος ύψιστος ώλενος ώρες ώρχους ώστιν ἀλεξανδρούπολις ἀμαλιούπολις
""".split())
""".split()
)

View File

@ -1,6 +1,8 @@
# coding: utf8
from __future__ import unicode_literals
VERBS = set("""
VERBS = set(
"""
'γγίζω άγομαι άγχομαι άγω άδω άπτομαι άπωσον άρχομαι άρχω άφτω έγκειται έκιοσε
έπομαι έρπω έρχομαι έστω έχω ήγγικεν ήθελε ίπταμαι ίσταμαι αίρομαι αίρω
αβαντάρω αβαντζάρω αβαντσάρω αβαράρω αβασκαίνω αβγατίζω αβγαταίνω αβγοκόβω
@ -1186,4 +1188,5 @@ VERBS = set("""
ωρύομαι ωτακουστώ ωτοσκοπώ ωφελούμαι ωφελώ ωχραίνω ωχριώ όζω όψομαι ἀδικῶ
ἀκροῶμαι ἀλέθω ἀμελῶ ἀναπτερυγιάζω ἀναπτερώνω ἀναπτερώνω ἀνασαίνω ἀναταράσσω
ἀναφτερουγίζω ἀναφτερουγιάζω ἀναφτερώνω ἀναχωρίζω ἀντιμετρῶ ἀράζω ἀφοδεύω
""".split())
""".split()
)

View File

@ -1,200 +1,198 @@
# coding: utf8
from __future__ import unicode_literals
VERBS_IRREG = {
"είσαι": ("είμαι",),
"είναι": ("είμαι",),
"είμαστε": ("είμαι",),
"είστε": ("είμαι",),
"είσαστε": ("είμαι",),
"ήμουν": ("είμαι",),
"ήσουν": ("είμαι",),
"ήταν": ("είμαι",),
"ήμαστε": ("είμαι",),
"ήμασταν": ("είμαι",),
"ήταν": ("είμαι",),
"είπα": ("λέω",),
"είπες": ("λέω",),
"είπε": ("λέω",),
"είπαμε": ("λέω",),
"είπατε": ("λέω",),
"είπαν": ("λέω",),
"είπανε": ("λέω",),
"πει": ("λέω"),
"πω": ("λέω"),
"πάω": ("πηγαίνω",),
"πάς": ("πηγαίνω",),
"πας": ("πηγαίνω",),
"πάει": ("πηγαίνω",),
"πάμε": ("πηγαίνω",),
"πάτε": ("πηγαίνω",),
"πάνε": ("πηγαίνω",),
"πήγα": ("πηγαίνω",),
"πήγες": ("πηγαίνω",),
"πήγε": ("πηγαίνω",),
"πήγαμε": ("πηγαίνω",),
"πήγατε": ("πηγαίνω",),
"πήγαν": ("πηγαίνω",),
"πήγανε": ("πηγαίνω",),
"έπαιζα": ("παίζω",),
"έπαιζες": ("παίζω",),
"έπαιζε": ("παίζω",),
"έπαιζαν": ("παίζω,",),
"έπαιξα": ("παίζω",),
"έπαιξες": ("παίζω",),
"έπαιξε": ("παίζω",),
"έτρωγα": ("τρώω",),
"έτρωγες": ("τρώω",),
"έτρωγε": ("τρώω",),
"έτρωγαν": ("τρώω",),
"είχα": ("έχω",),
"είχες": ("έχω",),
"είχε": ("έχω",),
"είχαμε": ("έχω",),
"είχατε": ("έχω",),
"είχαν": ("έχω",),
"είχανε": ("έχω",),
"έπαιρνα": ("παίρνω",),
"έπαιρνες": ("παίρνω",),
"έπαιρνε": ("παίρνω",),
"έπαιρναν": ("παίρνω",),
"εδίνα": ("δίνω",),
"εδίνες": ("δίνω",),
"εδίνε": ("δίνω",),
"εδίναν": ("δίνω",),
"έκανα": ("κάνω",),
"έκανες": ("κάνω",),
"έκανε": ("κάνω",),
"έκαναν": ("κάνω",),
"ήθελα": ("θέλω",),
"ήθελες": ("θέλω",),
"ήθελε": ("θέλω",),
"ήθελαν": ("θέλω",),
"έβλεπα": ("βλέπω",),
"έβλεπες": ("βλέπω",),
"έβλεπε": ("βλέπω",),
"έβλεπαν": ("βλέπω",),
"είδα": ("βλέπω",),
"είδες": ("βλέπω",),
"είδε": ("βλέπω",),
"είδαμε": ("βλέπω",),
"είδατε": ("βλέπω",),
"είδαν": ("βλέπω",),
"έφερνα": ("φέρνω",),
"έφερνες": ("φέρνω",),
"έφερνε": ("φέρνω",),
"έφερναν": ("φέρνω",),
"έφερα": ("φέρω",),
"έφερες": ("φέρω",),
"έφερε": ("φέρω",),
"έφεραν": ("φέρω",),
"έλαβα": ("λαμβάνω",),
"έλαβες": ("λαμβάνω",),
"έλαβε": ("λαμβάνω",),
"έλαβαν": ("λαμβάνω",),
"έβρισκα": ("βρίσκω",),
"έβρισκες": ("βρίσκω",),
"έβρισκε": ("βρίσκω",),
"έβρισκαν": ("βρίσκω",),
"ήξερα": ("ξέρω",),
"ήξερες": ("ξέρω",),
"ήξερε": ("ξέρω",),
"ήξεραν": ("ξέρω",),
"ανέφερα": ("αναφέρω",),
"ανέφερες": ("αναφέρω",),
"ανέφερε": ("αναφέρω",),
"ανέφεραν": ("αναφέρω",),
"έβαζα": ("βάζω",),
"έβαζες": ("βάζω",),
"έβαζε": ("βάζω",),
"έβαζαν": ("βάζω",),
"έμεινα": ("μένω",),
"έμεινες": ("μένω",),
"έμεινε": ("μένω",),
"έμειναν": ("μένω",),
"έβγαζα": ("βγάζω",),
"έβγαζες": ("βγάζω",),
"έβγαζε": ("βγάζω",),
"έβγαζαν": ("βγάζω",),
"έμπαινα": ("μπαίνω",),
"έμπαινες": ("μπαίνω",),
"έμπαινε": ("μπαίνω",),
"έμπαιναν": ("μπαίνω",),
"βγήκα": ("βγαίνω",),
"βγήκες": ("βγαίνω",),
"βγήκε": ("βγαίνω",),
"βγήκαμε": ("βγαίνω",),
"βγήκατε": ("βγαίνω",),
"βγήκαν": ("βγαίνω",),
"έπεφτα": ("πέφτω",),
"έπεφτες": ("πέφτω",),
"έπεφτε": ("πέφτω",),
"έπεφταν": ("πέφτω",),
"έπεσα": ("πέφτω",),
"έπεσες": ("πέφτω",),
"έπεσε": ("πέφτω",),
"έπεσαν": ("πέφτω",),
"έστειλα": ("στέλνω",),
"έστειλες": ("στέλνω",),
"έστειλε": ("στέλνω",),
"έστειλαν": ("στέλνω",),
"έφυγα": ("φεύγω",),
"έφυγες": ("φεύγω",),
"έφυγες": ("φεύγω",),
"έφυγαν": ("φεύγω",),
"έμαθα": ("μαθαίνω",),
"έμαθες": ("μαθαίνω",),
"έμαθε": ("μαθαίνω",),
"έμαθαν": ("μαθαίνω",),
"υπέβαλλα": ("υποβάλλω",),
"υπέβαλλες": ("υποβάλλω",),
"υπέβαλλε": ("υποβάλλω",),
"υπέβαλλαν": ("υποβάλλω",),
"έπινα": ("πίνω",),
"έπινες": ("πίνω",),
"έπινε": ("πίνω",),
"έπιναν": ("πίνω",),
"ήπια": ("πίνω",),
"ήπιες": ("πίνω",),
"ήπιε": ("πίνω",),
"ήπιαμε": ("πίνω",),
"ήπιατε": ("πίνω",),
"ήπιαν": ("πίνω",),
"ετύχα": ("τυχαίνω",),
"ετύχες": ("τυχαίνω",),
"ετύχε": ("τυχαίνω",),
"ετύχαν": ("τυχαίνω",),
"φάω": ("τρώω",),
"φάς": ("τρώω",),
"φάει": ("τρώω",),
"φάμε": ("τρώω",),
"φάτε": ("τρώω",),
"φάνε": ("τρώω",),
"φάν": ("τρώω",),
"έτρωγα": ("τρώω",),
"έτρωγες": ("τρώω",),
"τρώγαμε": ("τρώω",),
"τρώγατε": ("τρώω",),
"τρώγανε": ("τρώω",),
"τρώγαν": ("τρώω",),
"πέρασα": ("περνώ",),
"πέρασες": ("περνώ",),
"πέρασε": ("περνώ",),
"πέρασαμε": ("περνώ",),
"πέρασατε": ("περνώ",),
"πέρασαν": ("περνώ",),
"έγδαρα": ("γδάρω",),
"έγδαρες": ("γδάρω",),
"έγδαρε": ("γδάρω",),
"έγδαραν": ("γδάρω",),
"έβγαλα": ("βγάλω",),
"έβγαλες": ("βγάλω",),
"έβγαλε": ("βγάλω",),
"έβγαλαν": ("βγάλω",),
"έφθασα": ("φτάνω",),
"έφθασες": ("φτάνω",),
"έφθασε": ("φτάνω",),
"έφθασαν": ("φτάνω",),
"είσαι": ("είμαι",),
"είναι": ("είμαι",),
"είμαστε": ("είμαι",),
"είστε": ("είμαι",),
"είσαστε": ("είμαι",),
"ήμουν": ("είμαι",),
"ήσουν": ("είμαι",),
"ήταν": ("είμαι",),
"ήμαστε": ("είμαι",),
"ήμασταν": ("είμαι",),
"ήταν": ("είμαι",),
"είπα": ("λέω",),
"είπες": ("λέω",),
"είπε": ("λέω",),
"είπαμε": ("λέω",),
"είπατε": ("λέω",),
"είπαν": ("λέω",),
"είπανε": ("λέω",),
"πει": ("λέω"),
"πω": ("λέω"),
"πάω": ("πηγαίνω",),
"πάς": ("πηγαίνω",),
"πας": ("πηγαίνω",),
"πάει": ("πηγαίνω",),
"πάμε": ("πηγαίνω",),
"πάτε": ("πηγαίνω",),
"πάνε": ("πηγαίνω",),
"πήγα": ("πηγαίνω",),
"πήγες": ("πηγαίνω",),
"πήγε": ("πηγαίνω",),
"πήγαμε": ("πηγαίνω",),
"πήγατε": ("πηγαίνω",),
"πήγαν": ("πηγαίνω",),
"πήγανε": ("πηγαίνω",),
"έπαιζα": ("παίζω",),
"έπαιζες": ("παίζω",),
"έπαιζε": ("παίζω",),
"έπαιζαν": ("παίζω,",),
"έπαιξα": ("παίζω",),
"έπαιξες": ("παίζω",),
"έπαιξε": ("παίζω",),
"έτρωγα": ("τρώω",),
"έτρωγες": ("τρώω",),
"έτρωγε": ("τρώω",),
"έτρωγαν": ("τρώω",),
"είχα": ("έχω",),
"είχες": ("έχω",),
"είχε": ("έχω",),
"είχαμε": ("έχω",),
"είχατε": ("έχω",),
"είχαν": ("έχω",),
"είχανε": ("έχω",),
"έπαιρνα": ("παίρνω",),
"έπαιρνες": ("παίρνω",),
"έπαιρνε": ("παίρνω",),
"έπαιρναν": ("παίρνω",),
"εδίνα": ("δίνω",),
"εδίνες": ("δίνω",),
"εδίνε": ("δίνω",),
"εδίναν": ("δίνω",),
"έκανα": ("κάνω",),
"έκανες": ("κάνω",),
"έκανε": ("κάνω",),
"έκαναν": ("κάνω",),
"ήθελα": ("θέλω",),
"ήθελες": ("θέλω",),
"ήθελε": ("θέλω",),
"ήθελαν": ("θέλω",),
"έβλεπα": ("βλέπω",),
"έβλεπες": ("βλέπω",),
"έβλεπε": ("βλέπω",),
"έβλεπαν": ("βλέπω",),
"είδα": ("βλέπω",),
"είδες": ("βλέπω",),
"είδε": ("βλέπω",),
"είδαμε": ("βλέπω",),
"είδατε": ("βλέπω",),
"είδαν": ("βλέπω",),
"έφερνα": ("φέρνω",),
"έφερνες": ("φέρνω",),
"έφερνε": ("φέρνω",),
"έφερναν": ("φέρνω",),
"έφερα": ("φέρω",),
"έφερες": ("φέρω",),
"έφερε": ("φέρω",),
"έφεραν": ("φέρω",),
"έλαβα": ("λαμβάνω",),
"έλαβες": ("λαμβάνω",),
"έλαβε": ("λαμβάνω",),
"έλαβαν": ("λαμβάνω",),
"έβρισκα": ("βρίσκω",),
"έβρισκες": ("βρίσκω",),
"έβρισκε": ("βρίσκω",),
"έβρισκαν": ("βρίσκω",),
"ήξερα": ("ξέρω",),
"ήξερες": ("ξέρω",),
"ήξερε": ("ξέρω",),
"ήξεραν": ("ξέρω",),
"ανέφερα": ("αναφέρω",),
"ανέφερες": ("αναφέρω",),
"ανέφερε": ("αναφέρω",),
"ανέφεραν": ("αναφέρω",),
"έβαζα": ("βάζω",),
"έβαζες": ("βάζω",),
"έβαζε": ("βάζω",),
"έβαζαν": ("βάζω",),
"έμεινα": ("μένω",),
"έμεινες": ("μένω",),
"έμεινε": ("μένω",),
"έμειναν": ("μένω",),
"έβγαζα": ("βγάζω",),
"έβγαζες": ("βγάζω",),
"έβγαζε": ("βγάζω",),
"έβγαζαν": ("βγάζω",),
"έμπαινα": ("μπαίνω",),
"έμπαινες": ("μπαίνω",),
"έμπαινε": ("μπαίνω",),
"έμπαιναν": ("μπαίνω",),
"βγήκα": ("βγαίνω",),
"βγήκες": ("βγαίνω",),
"βγήκε": ("βγαίνω",),
"βγήκαμε": ("βγαίνω",),
"βγήκατε": ("βγαίνω",),
"βγήκαν": ("βγαίνω",),
"έπεφτα": ("πέφτω",),
"έπεφτες": ("πέφτω",),
"έπεφτε": ("πέφτω",),
"έπεφταν": ("πέφτω",),
"έπεσα": ("πέφτω",),
"έπεσες": ("πέφτω",),
"έπεσε": ("πέφτω",),
"έπεσαν": ("πέφτω",),
"έστειλα": ("στέλνω",),
"έστειλες": ("στέλνω",),
"έστειλε": ("στέλνω",),
"έστειλαν": ("στέλνω",),
"έφυγα": ("φεύγω",),
"έφυγες": ("φεύγω",),
"έφυγες": ("φεύγω",),
"έφυγαν": ("φεύγω",),
"έμαθα": ("μαθαίνω",),
"έμαθες": ("μαθαίνω",),
"έμαθε": ("μαθαίνω",),
"έμαθαν": ("μαθαίνω",),
"υπέβαλλα": ("υποβάλλω",),
"υπέβαλλες": ("υποβάλλω",),
"υπέβαλλε": ("υποβάλλω",),
"υπέβαλλαν": ("υποβάλλω",),
"έπινα": ("πίνω",),
"έπινες": ("πίνω",),
"έπινε": ("πίνω",),
"έπιναν": ("πίνω",),
"ήπια": ("πίνω",),
"ήπιες": ("πίνω",),
"ήπιε": ("πίνω",),
"ήπιαμε": ("πίνω",),
"ήπιατε": ("πίνω",),
"ήπιαν": ("πίνω",),
"ετύχα": ("τυχαίνω",),
"ετύχες": ("τυχαίνω",),
"ετύχε": ("τυχαίνω",),
"ετύχαν": ("τυχαίνω",),
"φάω": ("τρώω",),
"φάς": ("τρώω",),
"φάει": ("τρώω",),
"φάμε": ("τρώω",),
"φάτε": ("τρώω",),
"φάνε": ("τρώω",),
"φάν": ("τρώω",),
"έτρωγα": ("τρώω",),
"έτρωγες": ("τρώω",),
"τρώγαμε": ("τρώω",),
"τρώγατε": ("τρώω",),
"τρώγανε": ("τρώω",),
"τρώγαν": ("τρώω",),
"πέρασα": ("περνώ",),
"πέρασες": ("περνώ",),
"πέρασε": ("περνώ",),
"πέρασαμε": ("περνώ",),
"πέρασατε": ("περνώ",),
"πέρασαν": ("περνώ",),
"έγδαρα": ("γδάρω",),
"έγδαρες": ("γδάρω",),
"έγδαρε": ("γδάρω",),
"έγδαραν": ("γδάρω",),
"έβγαλα": ("βγάλω",),
"έβγαλες": ("βγάλω",),
"έβγαλε": ("βγάλω",),
"έβγαλαν": ("βγάλω",),
"έφθασα": ("φτάνω",),
"έφθασες": ("φτάνω",),
"έφθασε": ("φτάνω",),
"έφθασαν": ("φτάνω",),
}

View File

@ -1,34 +1,45 @@
# coding: utf8
from __future__ import unicode_literals
import re
import pickle
from gensim.corpora.wikicorpus import extract_pages
regex = re.compile(r'==={{(\w+)\|el}}===')
regex2 = re.compile(r'==={{(\w+ \w+)\|el}}===')
regex = re.compile(r"==={{(\w+)\|el}}===")
regex2 = re.compile(r"==={{(\w+ \w+)\|el}}===")
# get words based on the Wiktionary dump
# check only for specific parts
# ==={{κύριο όνομα|el}}===
expected_parts = ['μετοχή', 'ρήμα', 'επίθετο',
'επίρρημα', 'ουσιαστικό', 'κύριο όνομα', 'άρθρο']
expected_parts = [
"μετοχή",
"ρήμα",
"επίθετο",
"επίρρημα",
"ουσιαστικό",
"κύριο όνομα",
"άρθρο",
]
unwanted_parts = '''
unwanted_parts = """
{'αναγραμματισμοί': 2, 'σύνδεσμος': 94, 'απαρέμφατο': 1, 'μορφή άρθρου': 1, 'ένθημα': 1, 'μερική συνωνυμία': 57, 'ορισμός': 1, 'σημείωση': 3, 'πρόσφυμα': 3, 'ταυτόσημα': 8, 'χαρακτήρας': 51, 'μορφή επιρρήματος': 1, 'εκφράσεις': 22, 'ρηματικό σχήμα': 3, 'πολυλεκτικό επίρρημα': 2, 'μόριο': 35, 'προφορά': 412, 'ρηματική έκφραση': 15, 'λογοπαίγνια': 2, 'πρόθεση': 46, 'ρηματικό επίθετο': 1, 'κατάληξη επιρρημάτων': 10, 'συναφείς όροι': 1, 'εξωτερικοί σύνδεσμοι': 1, 'αρσενικό γένος': 1, 'πρόθημα': 169, 'κατάληξη': 3, 'υπώνυμα': 7, 'επιφώνημα': 197, 'ρηματικός τύπος': 1, 'συντομομορφή': 560, 'μορφή ρήματος': 68282, 'μορφή επιθέτου': 61779, 'μορφές': 71, 'ιδιωματισμός': 2, 'πολυλεκτικός όρος': 719, 'πολυλεκτικό ουσιαστικό': 180, 'παράγωγα': 25, 'μορφή μετοχής': 806, 'μορφή αριθμητικού': 3, 'άκλιτο': 1, 'επίθημα': 181, 'αριθμητικό': 129, 'συγγενικά': 94, 'σημειώσεις': 45, 'Ιδιωματισμός': 1, 'ρητά': 12, 'φράση': 9, 'συνώνυμα': 556, 'μεταφράσεις': 1, 'κατάληξη ρημάτων': 15, 'σύνθετα': 27, 'υπερώνυμα': 1, 'εναλλακτικός τύπος': 22, 'μορφή ουσιαστικού': 35122, 'επιρρηματική έκφραση': 12, 'αντώνυμα': 76, 'βλέπε': 7, 'μορφή αντωνυμίας': 51, 'αντωνυμία': 100, 'κλίση': 11, 'σύνθετοι τύποι': 1, 'παροιμία': 5, 'μορφή_επιθέτου': 2, 'έκφραση': 738, 'σύμβολο': 8, 'πολυλεκτικό επίθετο': 1, 'ετυμολογία': 867}
'''
"""
wiktionary_file_path = '/data/gsoc2018-spacy/spacy/lang/el/res/elwiktionary-latest-pages-articles.xml'
wiktionary_file_path = (
"/data/gsoc2018-spacy/spacy/lang/el/res/elwiktionary-latest-pages-articles.xml"
)
proper_names_dict={
'ουσιαστικό':'nouns',
'επίθετο':'adjectives',
'άρθρο':'dets',
'επίρρημα':'adverbs',
'κύριο όνομα': 'proper_names',
'μετοχή': 'participles',
'ρήμα': 'verbs'
proper_names_dict = {
"ουσιαστικό": "nouns",
"επίθετο": "adjectives",
"άρθρο": "dets",
"επίρρημα": "adverbs",
"κύριο όνομα": "proper_names",
"μετοχή": "participles",
"ρήμα": "verbs",
}
expected_parts_dict = {}
for expected_part in expected_parts:
@ -36,7 +47,7 @@ for expected_part in expected_parts:
other_parts = {}
for title, text, pageid in extract_pages(wiktionary_file_path):
if text.startswith('#REDIRECT'):
if text.startswith("#REDIRECT"):
continue
title = title.lower()
all_regex = regex.findall(text)
@ -47,20 +58,17 @@ for title, text, pageid in extract_pages(wiktionary_file_path):
for i in expected_parts_dict:
with open('_{0}.py'.format(proper_names_dict[i]), 'w') as f:
f.write('from __future__ import unicode_literals\n')
f.write('{} = set(\"\"\"\n'.format(proper_names_dict[i].upper()))
with open("_{0}.py".format(proper_names_dict[i]), "w") as f:
f.write("from __future__ import unicode_literals\n")
f.write('{} = set("""\n'.format(proper_names_dict[i].upper()))
words = sorted(expected_parts_dict[i])
line = ''
line = ""
to_write = []
for word in words:
if len(line + ' ' + word) > 79:
for word in words:
if len(line + " " + word) > 79:
to_write.append(line)
line = ''
line = ""
else:
line = line + ' ' + word
f.write('\n'.join(to_write))
f.write('\n\"\"\".split())')
line = line + " " + word
f.write("\n".join(to_write))
f.write('\n""".split())')

View File

@ -3,18 +3,18 @@ from __future__ import unicode_literals
from ....symbols import NOUN, VERB, ADJ, PUNCT
'''
Greek language lemmatizer applies the default rule based lemmatization
procedure with some modifications for better Greek language support.
The first modification is that it checks if the word for lemmatization is
already a lemma and if yes, it just returns it.
The second modification is about removing the base forms function which is
not applicable for Greek language.
'''
class GreekLemmatizer(object):
"""
Greek language lemmatizer applies the default rule based lemmatization
procedure with some modifications for better Greek language support.
The first modification is that it checks if the word for lemmatization is
already a lemma and if yes, it just returns it.
The second modification is about removing the base forms function which is
not applicable for Greek language.
"""
@classmethod
def load(cls, path, index=None, exc=None, rules=None, lookup=None):
return cls(index, exc, rules, lookup)
@ -28,26 +28,29 @@ class GreekLemmatizer(object):
def __call__(self, string, univ_pos, morphology=None):
if not self.rules:
return [self.lookup_table.get(string, string)]
if univ_pos in (NOUN, 'NOUN', 'noun'):
univ_pos = 'noun'
elif univ_pos in (VERB, 'VERB', 'verb'):
univ_pos = 'verb'
elif univ_pos in (ADJ, 'ADJ', 'adj'):
univ_pos = 'adj'
elif univ_pos in (PUNCT, 'PUNCT', 'punct'):
univ_pos = 'punct'
if univ_pos in (NOUN, "NOUN", "noun"):
univ_pos = "noun"
elif univ_pos in (VERB, "VERB", "verb"):
univ_pos = "verb"
elif univ_pos in (ADJ, "ADJ", "adj"):
univ_pos = "adj"
elif univ_pos in (PUNCT, "PUNCT", "punct"):
univ_pos = "punct"
else:
return list(set([string.lower()]))
lemmas = lemmatize(string, self.index.get(univ_pos, {}),
self.exc.get(univ_pos, {}),
self.rules.get(univ_pos, []))
lemmas = lemmatize(
string,
self.index.get(univ_pos, {}),
self.exc.get(univ_pos, {}),
self.rules.get(univ_pos, []),
)
return lemmas
def lemmatize(string, index, exceptions, rules):
string = string.lower()
forms = []
if (string in index):
if string in index:
forms.append(string)
return forms
forms.extend(exceptions.get(string, []))
@ -55,7 +58,7 @@ def lemmatize(string, index, exceptions, rules):
if not forms:
for old, new in rules:
if string.endswith(old):
form = string[:len(string) - len(old)] + new
form = string[: len(string) - len(old)] + new
if not form:
pass
elif form in index or not form.isalpha():

View File

@ -4,43 +4,100 @@ from __future__ import unicode_literals
from ...attrs import LIKE_NUM
_num_words = ['μηδέν', 'ένας', 'δυο', 'δυό', 'τρεις', 'τέσσερις', 'πέντε',
'έξι', 'εφτά', 'επτά', 'οκτώ', 'οχτώ',
'εννιά', 'εννέα', 'δέκα', 'έντεκα', 'ένδεκα', 'δώδεκα',
'δεκατρείς', 'δεκατέσσερις', 'δεκαπέντε', 'δεκαέξι', 'δεκαεπτά',
'δεκαοχτώ', 'δεκαεννέα', 'δεκαεννεα', 'είκοσι', 'τριάντα',
'σαράντα', 'πενήντα', 'εξήντα', 'εβδομήντα', 'ογδόντα',
'ενενήντα', 'εκατό', 'διακόσιοι', 'διακόσοι', 'τριακόσιοι',
'τριακόσοι', 'τετρακόσιοι', 'τετρακόσοι', 'πεντακόσιοι',
'πεντακόσοι', 'εξακόσιοι', 'εξακόσοι', 'εφτακόσιοι', 'εφτακόσοι',
'επτακόσιοι', 'επτακόσοι', 'οχτακόσιοι', 'οχτακόσοι',
'οκτακόσιοι', 'οκτακόσοι', 'εννιακόσιοι', 'χίλιοι', 'χιλιάδα',
'εκατομμύριο', 'δισεκατομμύριο', 'τρισεκατομμύριο', 'τετράκις',
'πεντάκις', 'εξάκις', 'επτάκις', 'οκτάκις', 'εννεάκις', 'ένα',
'δύο', 'τρία', 'τέσσερα', 'δις', 'χιλιάδες']
_num_words = [
"μηδέν",
"ένας",
"δυο",
"δυό",
"τρεις",
"τέσσερις",
"πέντε",
"έξι",
"εφτά",
"επτά",
"οκτώ",
"οχτώ",
"εννιά",
"εννέα",
"δέκα",
"έντεκα",
"ένδεκα",
"δώδεκα",
"δεκατρείς",
"δεκατέσσερις",
"δεκαπέντε",
"δεκαέξι",
"δεκαεπτά",
"δεκαοχτώ",
"δεκαεννέα",
"δεκαεννεα",
"είκοσι",
"τριάντα",
"σαράντα",
"πενήντα",
"εξήντα",
"εβδομήντα",
"ογδόντα",
"ενενήντα",
"εκατό",
"διακόσιοι",
"διακόσοι",
"τριακόσιοι",
"τριακόσοι",
"τετρακόσιοι",
"τετρακόσοι",
"πεντακόσιοι",
"πεντακόσοι",
"εξακόσιοι",
"εξακόσοι",
"εφτακόσιοι",
"εφτακόσοι",
"επτακόσιοι",
"επτακόσοι",
"οχτακόσιοι",
"οχτακόσοι",
"οκτακόσιοι",
"οκτακόσοι",
"εννιακόσιοι",
"χίλιοι",
"χιλιάδα",
"εκατομμύριο",
"δισεκατομμύριο",
"τρισεκατομμύριο",
"τετράκις",
"πεντάκις",
"εξάκις",
"επτάκις",
"οκτάκις",
"εννεάκις",
"ένα",
"δύο",
"τρία",
"τέσσερα",
"δις",
"χιλιάδες",
]
def like_num(text):
if text.startswith(('+', '-', '±', '~')):
if text.startswith(("+", "-", "±", "~")):
text = text[1:]
text = text.replace(',', '').replace('.', '')
text = text.replace(",", "").replace(".", "")
if text.isdigit():
return True
if text.count('/') == 1:
num, denom = text.split('/')
if text.count("/") == 1:
num, denom = text.split("/")
if num.isdigit() and denom.isdigit():
return True
if text.count('^') == 1:
num, denom = text.split('^')
if text.count("^") == 1:
num, denom = text.split("^")
if num.isdigit() and denom.isdigit():
return True
if text.lower() in _num_words or text.lower().split(' ')[0] in _num_words:
if text.lower() in _num_words or text.lower().split(" ")[0] in _num_words:
return True
if text in _num_words:
return True
return False
LEX_ATTRS = {
LIKE_NUM: like_num
}
LEX_ATTRS = {LIKE_NUM: like_num}

View File

@ -3,8 +3,6 @@ from __future__ import unicode_literals
# These exceptions are used to add NORM values based on a token's ORTH value.
# Norms are only set if no alternative is provided in the tokenizer exceptions.
_exc = {

View File

@ -6,66 +6,91 @@ from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, LIST_CURRENCY
from ..char_classes import LIST_ICONS, ALPHA_LOWER, ALPHA_UPPER, ALPHA, HYPHENS
from ..char_classes import QUOTES, CURRENCY
_units = ('km km² km³ m m² m³ dm dm² dm³ cm cm² cm³ mm mm² mm³ ha µm nm yd in ft '
'kg g mg µg t lb oz m/s km/h kmh mph hPa Pa mbar mb MB kb KB gb GB tb '
'TB T G M K км км² км³ м м² м³ дм дм² дм³ см см² см³ мм мм² мм³ нм '
'кг г мг м/с км/ч кПа Па мбар Кб КБ кб Мб МБ мб Гб ГБ гб Тб ТБ тб')
_units = (
"km km² km³ m m² m³ dm dm² dm³ cm cm² cm³ mm mm² mm³ ha µm nm yd in ft "
"kg g mg µg t lb oz m/s km/h kmh mph hPa Pa mbar mb MB kb KB gb GB tb "
"TB T G M K км км² км³ м м² м³ дм дм² дм³ см см² см³ мм мм² мм³ нм "
"кг г мг м/с км/ч кПа Па мбар Кб КБ кб Мб МБ мб Гб ГБ гб Тб ТБ тб"
)
def merge_chars(char): return char.strip().replace(' ', '|')
def merge_chars(char):
return char.strip().replace(" ", "|")
UNITS = merge_chars(_units)
_prefixes = (['\'\'', '§', '%', '=', r'\+[0-9]+%', # 90%
r'\'([0-9]){2}([\-]\'([0-9]){2})*', # '12'-13
r'\-([0-9]){1,9}\.([0-9]){1,9}', # -12.13
r'\'([Α-Ωα-ωίϊΐόάέύϋΰήώ]+)\'', # 'αβγ'
r'([Α-Ωα-ωίϊΐόάέύϋΰήώ]){1,3}\'', # αβγ'
r'http://www.[A-Za-z]+\-[A-Za-z]+(\.[A-Za-z]+)+(\/[A-Za-z]+)*(\.[A-Za-z]+)*',
r'[ΈΆΊΑ-Ωα-ωίϊΐόάέύϋΰήώ]+\*', # όνομα*
r'\$([0-9])+([\,\.]([0-9])+){0,1}',
] + LIST_PUNCT + LIST_ELLIPSES + LIST_QUOTES +
LIST_CURRENCY + LIST_ICONS)
_prefixes = (
[
"''",
"§",
"%",
"=",
r"\+[0-9]+%", # 90%
r"\'([0-9]){2}([\-]\'([0-9]){2})*", # '12'-13
r"\-([0-9]){1,9}\.([0-9]){1,9}", # -12.13
r"\'([Α-Ωα-ωίϊΐόάέύϋΰήώ]+)\'", # 'αβγ'
r"([Α-Ωα-ωίϊΐόάέύϋΰήώ]){1,3}\'", # αβγ'
r"http://www.[A-Za-z]+\-[A-Za-z]+(\.[A-Za-z]+)+(\/[A-Za-z]+)*(\.[A-Za-z]+)*",
r"[ΈΆΊΑ-Ωα-ωίϊΐόάέύϋΰήώ]+\*", # όνομα*
r"\$([0-9])+([\,\.]([0-9])+){0,1}",
]
+ LIST_PUNCT
+ LIST_ELLIPSES
+ LIST_QUOTES
+ LIST_CURRENCY
+ LIST_ICONS
)
_suffixes = (LIST_PUNCT + LIST_ELLIPSES + LIST_QUOTES + LIST_ICONS +
[r'(?<=[0-9])\+', # 12+
r'([0-9])+\'', # 12'
r'([A-Za-z])?\'', # a'
r'^([0-9]){1,2}\.', # 12.
r' ([0-9]){1,2}\.', # 12.
r'([0-9]){1}\) ', # 12)
r'^([0-9]){1}\)$', # 12)
r'(?<=°[FfCcKk])\.',
r'([0-9])+\&', # 12&
r'(?<=[0-9])(?:{})'.format(CURRENCY),
r'(?<=[0-9])(?:{})'.format(UNITS),
r'(?<=[0-9{}{}(?:{})])\.'.format(ALPHA_LOWER, r'²\-\)\]\+', QUOTES),
r'(?<=[{a}][{a}])\.'.format(a=ALPHA_UPPER),
r'(?<=[Α-Ωα-ωίϊΐόάέύϋΰήώ])\-', # όνομα-
r'(?<=[Α-Ωα-ωίϊΐόάέύϋΰήώ])\.',
r'^[Α-Ω]{1}\.',
r'\ [Α-Ω]{1}\.',
# πρώτος-δεύτερος , πρώτος-δεύτερος-τρίτος
r'[ΈΆΊΑΌ-Ωα-ωίϊΐόάέύϋΰήώ]+([\-]([ΈΆΊΑΌ-Ωα-ωίϊΐόάέύϋΰήώ]+))+',
r'([0-9]+)mg', # 13mg
r'([0-9]+)\.([0-9]+)m' # 1.2m
])
_suffixes = (
LIST_PUNCT
+ LIST_ELLIPSES
+ LIST_QUOTES
+ LIST_ICONS
+ [
r"(?<=[0-9])\+", # 12+
r"([0-9])+\'", # 12'
r"([A-Za-z])?\'", # a'
r"^([0-9]){1,2}\.", # 12.
r" ([0-9]){1,2}\.", # 12.
r"([0-9]){1}\) ", # 12)
r"^([0-9]){1}\)$", # 12)
r"(?<=°[FfCcKk])\.",
r"([0-9])+\&", # 12&
r"(?<=[0-9])(?:{})".format(CURRENCY),
r"(?<=[0-9])(?:{})".format(UNITS),
r"(?<=[0-9{}{}(?:{})])\.".format(ALPHA_LOWER, r"²\-\)\]\+", QUOTES),
r"(?<=[{a}][{a}])\.".format(a=ALPHA_UPPER),
r"(?<=[Α-Ωα-ωίϊΐόάέύϋΰήώ])\-", # όνομα-
r"(?<=[Α-Ωα-ωίϊΐόάέύϋΰήώ])\.",
r"^[Α-Ω]{1}\.",
r"\ [Α-Ω]{1}\.",
# πρώτος-δεύτερος , πρώτος-δεύτερος-τρίτος
r"[ΈΆΊΑΌ-Ωα-ωίϊΐόάέύϋΰήώ]+([\-]([ΈΆΊΑΌ-Ωα-ωίϊΐόάέύϋΰήώ]+))+",
r"([0-9]+)mg", # 13mg
r"([0-9]+)\.([0-9]+)m", # 1.2m
]
)
_infixes = (LIST_ELLIPSES + LIST_ICONS +
[r'(?<=[0-9])[+\/\-\*^](?=[0-9])', # 1/2 , 1-2 , 1*2
r'([a-zA-Z]+)\/([a-zA-Z]+)\/([a-zA-Z]+)', # name1/name2/name3
r'([0-9])+(\.([0-9]+))*([\-]([0-9])+)+', # 10.9 , 10.9.9 , 10.9-6
r'([0-9])+[,]([0-9])+[\-]([0-9])+[,]([0-9])+', # 10,11,12
r'([0-9])+[ης]+([\-]([0-9])+)+', # 1ης-2
# 15/2 , 15/2/17 , 2017/2/15
r'([0-9]){1,4}[\/]([0-9]){1,2}([\/]([0-9]){0,4}){0,1}',
r'[A-Za-z]+\@[A-Za-z]+(\-[A-Za-z]+)*\.[A-Za-z]+', # abc@cde-fgh.a
r'([a-zA-Z]+)(\-([a-zA-Z]+))+', # abc-abc
r'(?<=[{}])\.(?=[{}])'.format(ALPHA_LOWER, ALPHA_UPPER),
r'(?<=[{a}]),(?=[{a}])'.format(a=ALPHA),
r'(?<=[{a}])[?";:=,.]*(?:{h})(?=[{a}])'.format(a=ALPHA, h=HYPHENS),
r'(?<=[{a}"])[:<>=/](?=[{a}])'.format(a=ALPHA)])
_infixes = (
LIST_ELLIPSES
+ LIST_ICONS
+ [
r"(?<=[0-9])[+\/\-\*^](?=[0-9])", # 1/2 , 1-2 , 1*2
r"([a-zA-Z]+)\/([a-zA-Z]+)\/([a-zA-Z]+)", # name1/name2/name3
r"([0-9])+(\.([0-9]+))*([\-]([0-9])+)+", # 10.9 , 10.9.9 , 10.9-6
r"([0-9])+[,]([0-9])+[\-]([0-9])+[,]([0-9])+", # 10,11,12
r"([0-9])+[ης]+([\-]([0-9])+)+", # 1ης-2
# 15/2 , 15/2/17 , 2017/2/15
r"([0-9]){1,4}[\/]([0-9]){1,2}([\/]([0-9]){0,4}){0,1}",
r"[A-Za-z]+\@[A-Za-z]+(\-[A-Za-z]+)*\.[A-Za-z]+", # abc@cde-fgh.a
r"([a-zA-Z]+)(\-([a-zA-Z]+))+", # abc-abc
r"(?<=[{}])\.(?=[{}])".format(ALPHA_LOWER, ALPHA_UPPER),
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
r'(?<=[{a}])[?";:=,.]*(?:{h})(?=[{a}])'.format(a=ALPHA, h=HYPHENS),
r'(?<=[{a}"])[:<>=/](?=[{a}])'.format(a=ALPHA),
]
)
TOKENIZER_PREFIXES = _prefixes
TOKENIZER_SUFFIXES = _suffixes

View File

@ -1,13 +1,11 @@
# -*- coding: utf-8 -*-
# coding: utf8
from __future__ import unicode_literals
# Stop words
# Link to greek stop words: https://www.translatum.gr/forum/index.php?topic=3550.0?topic=3550.0
STOP_WORDS = set("""
STOP_WORDS = set(
"""
αδιάκοπα αι ακόμα ακόμη ακριβώς άλλα αλλά αλλαχού άλλες άλλη άλλην
άλλης αλλιώς αλλιώτικα άλλο άλλοι αλλοιώς αλλοιώτικα άλλον άλλος άλλοτε αλλού
άλλους άλλων άμα άμεσα αμέσως αν ανά ανάμεσα αναμεταξύ άνευ αντί αντίπερα αντίς
@ -27,10 +25,10 @@ STOP_WORDS = set("""
έκαστες έκαστη έκαστην έκαστης έκαστο έκαστοι έκαστον έκαστος εκάστου εκάστους εκάστων
εκεί εκείνα εκείνες εκείνη εκείνην εκείνης εκείνο εκείνοι εκείνον εκείνος εκείνου
εκείνους εκείνων εκτός εμάς εμείς εμένα εμπρός εν ένα έναν ένας ενός εντελώς εντός
εναντίον εξής εξαιτίας επιπλέον επόμενη εντωμεταξύ ενώ εξ έξαφνα εξήσ εξίσου έξω επάνω
επειδή έπειτα επί επίσης επομένως εσάς εσείς εσένα έστω εσύ ετέρα ετέραι ετέρας έτερες
έτερη έτερης έτερο έτεροι έτερον έτερος ετέρου έτερους ετέρων ετούτα ετούτες ετούτη ετούτην
ετούτης ετούτο ετούτοι ετούτον ετούτος ετούτου ετούτους ετούτων έτσι εύγε ευθύς ευτυχώς εφεξής
εναντίον εξής εξαιτίας επιπλέον επόμενη εντωμεταξύ ενώ εξ έξαφνα εξήσ εξίσου έξω επάνω
επειδή έπειτα επί επίσης επομένως εσάς εσείς εσένα έστω εσύ ετέρα ετέραι ετέρας έτερες
έτερη έτερης έτερο έτεροι έτερον έτερος ετέρου έτερους ετέρων ετούτα ετούτες ετούτη ετούτην
ετούτης ετούτο ετούτοι ετούτον ετούτος ετούτου ετούτους ετούτων έτσι εύγε ευθύς ευτυχώς εφεξής
έχει έχεις έχετε έχομε έχουμε έχουν εχτές έχω έως έγιναν έγινε έκανε έξι έχοντας
η ήδη ήμασταν ήμαστε ήμουν ήσασταν ήσαστε ήσουν ήταν ήτανε ήτοι ήττον
@ -77,7 +75,7 @@ STOP_WORDS = set("""
συχνός συχνού συχνούς συχνών συχνώς σχεδόν
τα τάδε ταύτα ταύτες ταύτη ταύτην ταύτης ταύτοταύτον ταύτος ταύτου ταύτων τάχα τάχατε
τελευταία τελευταίο τελευταίος τού τρία τρίτη τρεις τελικά τελικώς τες τέτοια τέτοιαν
τελευταία τελευταίο τελευταίος τού τρία τρίτη τρεις τελικά τελικώς τες τέτοια τέτοιαν
τέτοιας τέτοιες τέτοιο τέτοιοι τέτοιον τέτοιος τέτοιου
τέτοιους τέτοιων τη την της τι τίποτα τίποτε τις το τοι τον τοσ τόσα τόσες τόση τόσην
τόσης τόσο τόσοι τόσον τόσος τόσου τόσους τόσων τότε του τουλάχιστο τουλάχιστον τους τούς τούτα
@ -89,4 +87,5 @@ STOP_WORDS = set("""
χωρίς χωριστά
ω ως ωσάν ωσότου ώσπου ώστε ωστόσο ωχ
""".split())
""".split()
)

View File

@ -8,18 +8,16 @@ def noun_chunks(obj):
"""
Detect base noun phrases. Works on both Doc and Span.
"""
# it follows the logic of the noun chunks finder of English language,
# It follows the logic of the noun chunks finder of English language,
# adjusted to some Greek language special characteristics.
# obj tag corrects some DEP tagger mistakes.
# Further improvement of the models will eliminate the need for this tag.
labels = ['nsubj', 'obj', 'iobj', 'appos', 'ROOT', 'obl']
labels = ["nsubj", "obj", "iobj", "appos", "ROOT", "obl"]
doc = obj.doc # Ensure works on both Doc and Span.
np_deps = [doc.vocab.strings.add(label) for label in labels]
conj = doc.vocab.strings.add('conj')
nmod = doc.vocab.strings.add('nmod')
np_label = doc.vocab.strings.add('NP')
conj = doc.vocab.strings.add("conj")
nmod = doc.vocab.strings.add("nmod")
np_label = doc.vocab.strings.add("NP")
seen = set()
for i, word in enumerate(obj):
if word.pos not in (NOUN, PROPN, PRON):
@ -31,16 +29,17 @@ def noun_chunks(obj):
if any(w.i in seen for w in word.subtree):
continue
flag = False
if (word.pos == NOUN):
if word.pos == NOUN:
# check for patterns such as γραμμή παραγωγής
for potential_nmod in word.rights:
if (potential_nmod.dep == nmod):
seen.update(j for j in range(
word.left_edge.i, potential_nmod.i + 1))
if potential_nmod.dep == nmod:
seen.update(
j for j in range(word.left_edge.i, potential_nmod.i + 1)
)
yield word.left_edge.i, potential_nmod.i + 1, np_label
flag = True
break
if (flag is False):
if flag is False:
seen.update(j for j in range(word.left_edge.i, word.i + 1))
yield word.left_edge.i, word.i + 1, np_label
elif word.dep == conj:
@ -56,6 +55,4 @@ def noun_chunks(obj):
yield word.left_edge.i, word.i + 1, np_label
SYNTAX_ITERATORS = {
'noun_chunks': noun_chunks
}
SYNTAX_ITERATORS = {"noun_chunks": noun_chunks}

File diff suppressed because it is too large Load Diff

View File

@ -1,3 +1,4 @@
# coding: utf8
from __future__ import unicode_literals
from ...symbols import POS, ADV, NOUN, ADP, PRON, SCONJ, PROPN, DET, SYM, INTJ
@ -22,5 +23,5 @@ TAG_MAP = {
"AUX": {POS: AUX},
"SPACE": {POS: SPACE},
"DET": {POS: DET},
"X": {POS: X}
"X": {POS: X},
}

View File

@ -1,303 +1,132 @@
# -*- coding: utf-8 -*-
# coding: utf8
from __future__ import unicode_literals
from ...symbols import ORTH, LEMMA, NORM
_exc = {}
for token in ["Απ'", "ΑΠ'", "αφ'", "Αφ'"]:
_exc[token] = [
{ORTH: token, LEMMA: "από", NORM: "από"}
]
_exc[token] = [{ORTH: token, LEMMA: "από", NORM: "από"}]
for token in ["Αλλ'", "αλλ'"]:
_exc[token] = [
{ORTH: token, LEMMA: "αλλά", NORM: "αλλά"}
]
_exc[token] = [{ORTH: token, LEMMA: "αλλά", NORM: "αλλά"}]
for token in ["παρ'", "Παρ'", "ΠΑΡ'"]:
_exc[token] = [
{ORTH: token, LEMMA: "παρά", NORM: "παρά"}
]
_exc[token] = [{ORTH: token, LEMMA: "παρά", NORM: "παρά"}]
for token in ["καθ'", "Καθ'"]:
_exc[token] = [
{ORTH: token, LEMMA: "κάθε", NORM: "κάθε"}
]
_exc[token] = [{ORTH: token, LEMMA: "κάθε", NORM: "κάθε"}]
for token in ["κατ'", "Κατ'"]:
_exc[token] = [
{ORTH: token, LEMMA: "κατά", NORM: "κατά"}
]
_exc[token] = [{ORTH: token, LEMMA: "κατά", NORM: "κατά"}]
for token in ["'ΣΟΥΝ", "'ναι", "'ταν", "'τανε", "'μαστε", "'μουνα", "'μουν"]:
_exc[token] = [
{ORTH: token, LEMMA: "είμαι", NORM: "είμαι"}
]
_exc[token] = [{ORTH: token, LEMMA: "είμαι", NORM: "είμαι"}]
for token in ["Επ'", "επ'", "εφ'", "Εφ'"]:
_exc[token] = [
{ORTH: token, LEMMA: "επί", NORM: "επί"}
]
_exc[token] = [{ORTH: token, LEMMA: "επί", NORM: "επί"}]
for token in ["Δι'", "δι'"]:
_exc[token] = [
{ORTH: token, LEMMA: "δια", NORM: "δια"}
]
_exc[token] = [{ORTH: token, LEMMA: "δια", NORM: "δια"}]
for token in ["'χουν", "'χουμε", "'χαμε", "'χα", "'χε", "'χεις", "'χει"]:
_exc[token] = [
{ORTH: token, LEMMA: "έχω", NORM: "έχω"}
]
_exc[token] = [{ORTH: token, LEMMA: "έχω", NORM: "έχω"}]
for token in ["υπ'", "Υπ'"]:
_exc[token] = [
{ORTH: token, LEMMA: "υπό", NORM: "υπό"}
]
_exc[token] = [{ORTH: token, LEMMA: "υπό", NORM: "υπό"}]
for token in ["Μετ'", "ΜΕΤ'", "'μετ"]:
_exc[token] = [
{ORTH: token, LEMMA: "μετά", NORM: "μετά"}
]
_exc[token] = [{ORTH: token, LEMMA: "μετά", NORM: "μετά"}]
for token in ["Μ'", "μ'"]:
_exc[token] = [
{ORTH: token, LEMMA: "με", NORM: "με"}
]
_exc[token] = [{ORTH: token, LEMMA: "με", NORM: "με"}]
for token in ["Γι'", "ΓΙ'", "γι'"]:
_exc[token] = [
{ORTH: token, LEMMA: "για", NORM: "για"}
]
_exc[token] = [{ORTH: token, LEMMA: "για", NORM: "για"}]
for token in ["Σ'", "σ'"]:
_exc[token] = [
{ORTH: token, LEMMA: "σε", NORM: "σε"}
]
_exc[token] = [{ORTH: token, LEMMA: "σε", NORM: "σε"}]
for token in ["Θ'", "θ'"]:
_exc[token] = [
{ORTH: token, LEMMA: "θα", NORM: "θα"}
]
_exc[token] = [{ORTH: token, LEMMA: "θα", NORM: "θα"}]
for token in ["Ν'", "ν'"]:
_exc[token] = [
{ORTH: token, LEMMA: "να", NORM: "να"}
]
_exc[token] = [{ORTH: token, LEMMA: "να", NORM: "να"}]
for token in ["Τ'", "τ'"]:
_exc[token] = [
{ORTH: token, LEMMA: "να", NORM: "να"}
]
_exc[token] = [{ORTH: token, LEMMA: "να", NORM: "να"}]
for token in ["'γω", "'σένα", "'μεις"]:
_exc[token] = [
{ORTH: token, LEMMA: "εγώ", NORM: "εγώ"}
]
_exc[token] = [{ORTH: token, LEMMA: "εγώ", NORM: "εγώ"}]
for token in ["Τ'", "τ'"]:
_exc[token] = [
{ORTH: token, LEMMA: "το", NORM: "το"}
]
_exc[token] = [{ORTH: token, LEMMA: "το", NORM: "το"}]
for token in ["Φέρ'", "Φερ'", "φέρ'", "φερ'"]:
_exc[token] = [
{ORTH: token, LEMMA: "φέρνω", NORM: "φέρνω"}
]
_exc[token] = [{ORTH: token, LEMMA: "φέρνω", NORM: "φέρνω"}]
for token in ["'ρθούνε", "'ρθουν", "'ρθει", "'ρθεί", "'ρθε", "'ρχεται"]:
_exc[token] = [
{ORTH: token, LEMMA: "έρχομαι", NORM: "έρχομαι"}
]
_exc[token] = [{ORTH: token, LEMMA: "έρχομαι", NORM: "έρχομαι"}]
for token in ["'πανε", "'λεγε", "'λεγαν", "'πε", "'λεγα"]:
_exc[token] = [
{ORTH: token, LEMMA: "λέγω", NORM: "λέγω"}
]
_exc[token] = [{ORTH: token, LEMMA: "λέγω", NORM: "λέγω"}]
for token in ["Πάρ'", "πάρ'"]:
_exc[token] = [
{ORTH: token, LEMMA: "παίρνω", NORM: "παίρνω"}
]
_exc[token] = [{ORTH: token, LEMMA: "παίρνω", NORM: "παίρνω"}]
for token in ["μέσ'", "Μέσ'", "μεσ'"]:
_exc[token] = [
{ORTH: token, LEMMA: "μέσα", NORM: "μέσα"}
]
_exc[token] = [{ORTH: token, LEMMA: "μέσα", NORM: "μέσα"}]
for token in ["Δέσ'", "Δεσ'", "δεσ'"]:
_exc[token] = [
{ORTH: token, LEMMA: "δένω", NORM: "δένω"}
]
_exc[token] = [{ORTH: token, LEMMA: "δένω", NORM: "δένω"}]
for token in ["'κανε", "Κάν'"]:
_exc[token] = [
{ORTH: token, LEMMA: "κάνω", NORM: "κάνω"}
]
_exc[token] = [{ORTH: token, LEMMA: "κάνω", NORM: "κάνω"}]
_other_exc = {
"κι": [
{ORTH: "κι", LEMMA: "και", NORM: "και"},
],
"Παίξ'": [
{ORTH: "Παίξ'", LEMMA: "παίζω", NORM: "παίζω"},
],
"Αντ'": [
{ORTH: "Αντ'", LEMMA: "αντί", NORM: "αντί"},
],
"ολ'": [
{ORTH: "ολ'", LEMMA: "όλος", NORM: "όλος"},
],
"ύστερ'": [
{ORTH: "ύστερ'", LEMMA: "ύστερα", NORM: "ύστερα"},
],
"'πρεπε": [
{ORTH: "'πρεπε", LEMMA: "πρέπει", NORM: "πρέπει"},
],
"Δύσκολ'": [
{ORTH: "Δύσκολ'", LEMMA: "δύσκολος", NORM: "δύσκολος"},
],
"'θελα": [
{ORTH: "'θελα", LEMMA: "θέλω", NORM: "θέλω"},
],
"'γραφα": [
{ORTH: "'γραφα", LEMMA: "γράφω", NORM: "γράφω"},
],
"'παιρνα": [
{ORTH: "'παιρνα", LEMMA: "παίρνω", NORM: "παίρνω"},
],
"'δειξε": [
{ORTH: "'δειξε", LEMMA: "δείχνω", NORM: "δείχνω"},
],
"όμουρφ'": [
{ORTH: "όμουρφ'", LEMMA: "όμορφος", NORM: "όμορφος"},
],
"κ'τσή": [
{ORTH: "κ'τσή", LEMMA: "κουτσός", NORM: "κουτσός"},
],
"μηδ'": [
{ORTH: "μηδ'", LEMMA: "μήδε", NORM: "μήδε"},
],
"κι": [{ORTH: "κι", LEMMA: "και", NORM: "και"}],
"Παίξ'": [{ORTH: "Παίξ'", LEMMA: "παίζω", NORM: "παίζω"}],
"Αντ'": [{ORTH: "Αντ'", LEMMA: "αντί", NORM: "αντί"}],
"ολ'": [{ORTH: "ολ'", LEMMA: "όλος", NORM: "όλος"}],
"ύστερ'": [{ORTH: "ύστερ'", LEMMA: "ύστερα", NORM: "ύστερα"}],
"'πρεπε": [{ORTH: "'πρεπε", LEMMA: "πρέπει", NORM: "πρέπει"}],
"Δύσκολ'": [{ORTH: "Δύσκολ'", LEMMA: "δύσκολος", NORM: "δύσκολος"}],
"'θελα": [{ORTH: "'θελα", LEMMA: "θέλω", NORM: "θέλω"}],
"'γραφα": [{ORTH: "'γραφα", LEMMA: "γράφω", NORM: "γράφω"}],
"'παιρνα": [{ORTH: "'παιρνα", LEMMA: "παίρνω", NORM: "παίρνω"}],
"'δειξε": [{ORTH: "'δειξε", LEMMA: "δείχνω", NORM: "δείχνω"}],
"όμουρφ'": [{ORTH: "όμουρφ'", LEMMA: "όμορφος", NORM: "όμορφος"}],
"κ'τσή": [{ORTH: "κ'τσή", LEMMA: "κουτσός", NORM: "κουτσός"}],
"μηδ'": [{ORTH: "μηδ'", LEMMA: "μήδε", NORM: "μήδε"}],
"'ξομολογήθηκε": [
{ORTH: "'ξομολογήθηκε", LEMMA: "εξομολογούμαι", NORM: "εξομολογούμαι"},
{ORTH: "'ξομολογήθηκε", LEMMA: "εξομολογούμαι", NORM: "εξομολογούμαι"}
],
"'μας": [
{ORTH: "'μας", LEMMA: "εμάς", NORM: "εμάς"},
],
"'ξερες": [
{ORTH: "'ξερες", LEMMA: "ξέρω", NORM: "ξέρω"},
],
"έφθασ'": [
{ORTH: "έφθασ'", LEMMA: "φθάνω", NORM: "φθάνω"},
],
"εξ'": [
{ORTH: "εξ'", LEMMA: "εκ", NORM: "εκ"},
],
"δώσ'": [
{ORTH: "δώσ'", LEMMA: "δίνω", NORM: "δίνω"},
],
"τίποτ'": [
{ORTH: "τίποτ'", LEMMA: "τίποτα", NORM: "τίποτα"},
],
"Λήξ'": [
{ORTH: "Λήξ'", LEMMA: "λήγω", NORM: "λήγω"},
],
"άσ'": [
{ORTH: "άσ'", LEMMA: "αφήνω", NORM: "αφήνω"},
],
"Στ'": [
{ORTH: "Στ'", LEMMA: "στο", NORM: "στο"},
],
"Δωσ'": [
{ORTH: "Δωσ'", LEMMA: "δίνω", NORM: "δίνω"},
],
"Βάψ'": [
{ORTH: "Βάψ'", LEMMA: "βάφω", NORM: "βάφω"},
],
"Αλλ'": [
{ORTH: "Αλλ'", LEMMA: "αλλά", NORM: "αλλά"},
],
"Αμ'": [
{ORTH: "Αμ'", LEMMA: "άμα", NORM: "άμα"},
],
"Αγόρασ'": [
{ORTH: "Αγόρασ'", LEMMA: "αγοράζω", NORM: "αγοράζω"},
],
"'φύγε": [
{ORTH: "'φύγε", LEMMA: "φεύγω", NORM: "φεύγω"},
],
"'φερε": [
{ORTH: "'φερε", LEMMA: "φέρνω", NORM: "φέρνω"},
],
"'φαγε": [
{ORTH: "'φαγε", LEMMA: "τρώω", NORM: "τρώω"},
],
"'σπαγαν": [
{ORTH: "'σπαγαν", LEMMA: "σπάω", NORM: "σπάω"},
],
"'σκασε": [
{ORTH: "'σκασε", LEMMA: "σκάω", NORM: "σκάω"},
],
"'σβηνε": [
{ORTH: "'σβηνε", LEMMA: "σβήνω", NORM: "σβήνω"},
],
"'ριξε": [
{ORTH: "'ριξε", LEMMA: "ρίχνω", NORM: "ρίχνω"},
],
"'κλεβε": [
{ORTH: "'κλεβε", LEMMA: "κλέβω", NORM: "κλέβω"},
],
"'κει": [
{ORTH: "'κει", LEMMA: "εκεί", NORM: "εκεί"},
],
"'βλεπε": [
{ORTH: "'βλεπε", LEMMA: "βλέπω", NORM: "βλέπω"},
],
"'βγαινε": [
{ORTH: "'βγαινε", LEMMA: "βγαίνω", NORM: "βγαίνω"},
]
"'μας": [{ORTH: "'μας", LEMMA: "εμάς", NORM: "εμάς"}],
"'ξερες": [{ORTH: "'ξερες", LEMMA: "ξέρω", NORM: "ξέρω"}],
"έφθασ'": [{ORTH: "έφθασ'", LEMMA: "φθάνω", NORM: "φθάνω"}],
"εξ'": [{ORTH: "εξ'", LEMMA: "εκ", NORM: "εκ"}],
"δώσ'": [{ORTH: "δώσ'", LEMMA: "δίνω", NORM: "δίνω"}],
"τίποτ'": [{ORTH: "τίποτ'", LEMMA: "τίποτα", NORM: "τίποτα"}],
"Λήξ'": [{ORTH: "Λήξ'", LEMMA: "λήγω", NORM: "λήγω"}],
"άσ'": [{ORTH: "άσ'", LEMMA: "αφήνω", NORM: "αφήνω"}],
"Στ'": [{ORTH: "Στ'", LEMMA: "στο", NORM: "στο"}],
"Δωσ'": [{ORTH: "Δωσ'", LEMMA: "δίνω", NORM: "δίνω"}],
"Βάψ'": [{ORTH: "Βάψ'", LEMMA: "βάφω", NORM: "βάφω"}],
"Αλλ'": [{ORTH: "Αλλ'", LEMMA: "αλλά", NORM: "αλλά"}],
"Αμ'": [{ORTH: "Αμ'", LEMMA: "άμα", NORM: "άμα"}],
"Αγόρασ'": [{ORTH: "Αγόρασ'", LEMMA: "αγοράζω", NORM: "αγοράζω"}],
"'φύγε": [{ORTH: "'φύγε", LEMMA: "φεύγω", NORM: "φεύγω"}],
"'φερε": [{ORTH: "'φερε", LEMMA: "φέρνω", NORM: "φέρνω"}],
"'φαγε": [{ORTH: "'φαγε", LEMMA: "τρώω", NORM: "τρώω"}],
"'σπαγαν": [{ORTH: "'σπαγαν", LEMMA: "σπάω", NORM: "σπάω"}],
"'σκασε": [{ORTH: "'σκασε", LEMMA: "σκάω", NORM: "σκάω"}],
"'σβηνε": [{ORTH: "'σβηνε", LEMMA: "σβήνω", NORM: "σβήνω"}],
"'ριξε": [{ORTH: "'ριξε", LEMMA: "ρίχνω", NORM: "ρίχνω"}],
"'κλεβε": [{ORTH: "'κλεβε", LEMMA: "κλέβω", NORM: "κλέβω"}],
"'κει": [{ORTH: "'κει", LEMMA: "εκεί", NORM: "εκεί"}],
"'βλεπε": [{ORTH: "'βλεπε", LEMMA: "βλέπω", NORM: "βλέπω"}],
"'βγαινε": [{ORTH: "'βγαινε", LEMMA: "βγαίνω", NORM: "βγαίνω"}],
}
_exc.update(_other_exc)
@ -307,12 +136,14 @@ for h in range(1, 12 + 1):
for period in ["π.μ.", "πμ"]:
_exc["%d%s" % (h, period)] = [
{ORTH: "%d" % h},
{ORTH: period, LEMMA: "π.μ.", NORM: "π.μ."}]
{ORTH: period, LEMMA: "π.μ.", NORM: "π.μ."},
]
for period in ["μ.μ.", "μμ"]:
_exc["%d%s" % (h, period)] = [
{ORTH: "%d" % h},
{ORTH: period, LEMMA: "μ.μ.", NORM: "μ.μ."}]
{ORTH: period, LEMMA: "μ.μ.", NORM: "μ.μ."},
]
for exc_data in [
{ORTH: "ΑΓΡ.", LEMMA: "Αγροτικός", NORM: "Αγροτικός"},
@ -339,43 +170,228 @@ for exc_data in [
for orth in [
"$ΗΠΑ",
"Α'", "Α.Ε.", "Α.Ε.Β.Ε.", "Α.Ε.Ι.", "Α.Ε.Π.", "Α.Μ.Α.", "Α.Π.Θ.", "Α.Τ.", "Α.Χ.", "ΑΝ.", "Αγ.", "Αλ.", "Αν.",
"Αντ.", "Απ.",
"Β'", "Β)", "Β.Ζ.", "Β.Ι.Ο.", "Β.Κ.", "Β.Μ.Α.", "Βασ.",
"Γ'", "Γ)", "Γ.Γ.", "Γ.Δ.", "Γκ.",
"Δ.Ε.Η.", "Δ.Ε.Σ.Ε.", "Δ.Ν.", "Δ.Ο.Υ.", "Δ.Σ.", "Δ.Υ.", "ΔΙ.ΚΑ.Τ.Σ.Α.", "Δηλ.", "Διον.",
"Ε.Α.", "Ε.Α.Κ.", "Ε.Α.Π.", "Ε.Ε.", "Ε.Κ.", "Ε.ΚΕ.ΠΙΣ.", "Ε.Λ.Α.", "Ε.Λ.Ι.Α.", "Ε.Π.Σ.", "Ε.Π.Τ.Α.", "Ε.Σ.Ε.Ε.Κ.",
"Ε.Υ.Κ.", "ΕΕ.", "ΕΚ.", "ΕΛ.", "ΕΛ.ΑΣ.", "Εθν.", "Ελ.", "Εμ.", "Επ.", "Ευ.",
"Η'", "Η.Π.Α.",
"ΘΕ.", "Θεμ.", "Θεοδ.", "Θρ.",
"Ι.Ε.Κ.", "Ι.Κ.Α.", "Ι.Κ.Υ.", "Ι.Σ.Θ.", "Ι.Χ.", "ΙΖ'", "ΙΧ.",
"Κ.Α.Α.", "Κ.Α.Ε.", "Κ.Β.Σ.", "Κ.Δ.", "Κ.Ε.", "Κ.Ε.Κ.", "Κ.Ι.", "Κ.Κ.", "Κ.Ι.Θ.", "Κ.Ι.Θ.", "Κ.ΚΕΚ.", "Κ.Ο.",
"Κ.Π.Ρ.", "ΚΑΤ.", "ΚΚ.", "Καν.", "Καρ.", "Κατ.", "Κυρ.", "Κων.",
"Λ.Α.", "Λ.χ.", "Λ.Χ.", "Λεωφ.", "Λι.",
"Μ.Δ.Ε.", "Μ.Ε.Ο.", "Μ.Ζ.", "Μ.Μ.Ε.", "Μ.Ο.", "Μεγ.", "Μιλτ.", "Μιχ.",
"Ν.Δ.", "Ν.Ε.Α.", "Ν.Κ.", "Ν.Ο.", "Ν.Ο.Θ.", "Ν.Π.Δ.Δ.", "Ν.Υ.", "ΝΔ.", "Νικ.", "Ντ'", "Ντ.",
"Ο'", "Ο.Α.", "Ο.Α.Ε.Δ.", "Ο.Δ.", "Ο.Ε.Ε.", "Ο.Ε.Ε.Κ.", "Ο.Η.Ε.", "Ο.Κ.",
"Π.Δ.", "Π.Ε.Κ.Δ.Υ.", "Π.Ε.Π.", "Π.Μ.Σ.", "ΠΟΛ.", "Π.Χ.", "Παρ.", "Πλ.", "Πρ.",
"Σ.Δ.Ο.Ε.", "Σ.Ε.", "Σ.Ε.Κ.", "Σ.Π.Δ.Ω.Β.", "Σ.Τ.", "Σαβ.", "Στ.", "ΣτΕ.", "Στρ.",
"Τ.Α.", "Τ.Ε.Ε.", "Τ.Ε.Ι.", "ΤΡ.", "Τζ.", "Τηλ.",
"Υ.Γ.", "ΥΓ.", "ΥΠ.Ε.Π.Θ.",
"Φ.Α.Β.Ε.", "Φ.Κ.", "Φ.Σ.", "Φ.Χ.", "Φ.Π.Α.", "Φιλ.",
"Χ.Α.Α.", "ΧΡ.", "Χ.Χ.", "Χαρ.", "Χιλ.", "Χρ.",
"άγ.", "άρθρ.", "αι.", "αν.", "απ.", "αρ.", "αριθ.", "αριθμ.",
"β'", "βλ.",
"γ.γ.", "γεν.", "γραμμ.",
"δ.δ.", "δ.σ.", "δηλ.", "δισ.", "δολ.", "δρχ.",
"εκ.", "εκατ.", "ελ.",
"Α'",
"Α.Ε.",
"Α.Ε.Β.Ε.",
"Α.Ε.Ι.",
"Α.Ε.Π.",
"Α.Μ.Α.",
"Α.Π.Θ.",
"Α.Τ.",
"Α.Χ.",
"ΑΝ.",
"Αγ.",
"Αλ.",
"Αν.",
"Αντ.",
"Απ.",
"Β'",
"Β)",
"Β.Ζ.",
"Β.Ι.Ο.",
"Β.Κ.",
"Β.Μ.Α.",
"Βασ.",
"Γ'",
"Γ)",
"Γ.Γ.",
"Γ.Δ.",
"Γκ.",
"Δ.Ε.Η.",
"Δ.Ε.Σ.Ε.",
"Δ.Ν.",
"Δ.Ο.Υ.",
"Δ.Σ.",
"Δ.Υ.",
"ΔΙ.ΚΑ.Τ.Σ.Α.",
"Δηλ.",
"Διον.",
"Ε.Α.",
"Ε.Α.Κ.",
"Ε.Α.Π.",
"Ε.Ε.",
"Ε.Κ.",
"Ε.ΚΕ.ΠΙΣ.",
"Ε.Λ.Α.",
"Ε.Λ.Ι.Α.",
"Ε.Π.Σ.",
"Ε.Π.Τ.Α.",
"Ε.Σ.Ε.Ε.Κ.",
"Ε.Υ.Κ.",
"ΕΕ.",
"ΕΚ.",
"ΕΛ.",
"ΕΛ.ΑΣ.",
"Εθν.",
"Ελ.",
"Εμ.",
"Επ.",
"Ευ.",
"Η'",
"Η.Π.Α.",
"ΘΕ.",
"Θεμ.",
"Θεοδ.",
"Θρ.",
"Ι.Ε.Κ.",
"Ι.Κ.Α.",
"Ι.Κ.Υ.",
"Ι.Σ.Θ.",
"Ι.Χ.",
"ΙΖ'",
"ΙΧ.",
"Κ.Α.Α.",
"Κ.Α.Ε.",
"Κ.Β.Σ.",
"Κ.Δ.",
"Κ.Ε.",
"Κ.Ε.Κ.",
"Κ.Ι.",
"Κ.Κ.",
"Κ.Ι.Θ.",
"Κ.Ι.Θ.",
"Κ.ΚΕΚ.",
"Κ.Ο.",
"Κ.Π.Ρ.",
"ΚΑΤ.",
"ΚΚ.",
"Καν.",
"Καρ.",
"Κατ.",
"Κυρ.",
"Κων.",
"Λ.Α.",
"Λ.χ.",
"Λ.Χ.",
"Λεωφ.",
"Λι.",
"Μ.Δ.Ε.",
"Μ.Ε.Ο.",
"Μ.Ζ.",
"Μ.Μ.Ε.",
"Μ.Ο.",
"Μεγ.",
"Μιλτ.",
"Μιχ.",
"Ν.Δ.",
"Ν.Ε.Α.",
"Ν.Κ.",
"Ν.Ο.",
"Ν.Ο.Θ.",
"Ν.Π.Δ.Δ.",
"Ν.Υ.",
"ΝΔ.",
"Νικ.",
"Ντ'",
"Ντ.",
"Ο'",
"Ο.Α.",
"Ο.Α.Ε.Δ.",
"Ο.Δ.",
"Ο.Ε.Ε.",
"Ο.Ε.Ε.Κ.",
"Ο.Η.Ε.",
"Ο.Κ.",
"Π.Δ.",
"Π.Ε.Κ.Δ.Υ.",
"Π.Ε.Π.",
"Π.Μ.Σ.",
"ΠΟΛ.",
"Π.Χ.",
"Παρ.",
"Πλ.",
"Πρ.",
"Σ.Δ.Ο.Ε.",
"Σ.Ε.",
"Σ.Ε.Κ.",
"Σ.Π.Δ.Ω.Β.",
"Σ.Τ.",
"Σαβ.",
"Στ.",
"ΣτΕ.",
"Στρ.",
"Τ.Α.",
"Τ.Ε.Ε.",
"Τ.Ε.Ι.",
"ΤΡ.",
"Τζ.",
"Τηλ.",
"Υ.Γ.",
"ΥΓ.",
"ΥΠ.Ε.Π.Θ.",
"Φ.Α.Β.Ε.",
"Φ.Κ.",
"Φ.Σ.",
"Φ.Χ.",
"Φ.Π.Α.",
"Φιλ.",
"Χ.Α.Α.",
"ΧΡ.",
"Χ.Χ.",
"Χαρ.",
"Χιλ.",
"Χρ.",
"άγ.",
"άρθρ.",
"αι.",
"αν.",
"απ.",
"αρ.",
"αριθ.",
"αριθμ.",
"β'",
"βλ.",
"γ.γ.",
"γεν.",
"γραμμ.",
"δ.δ.",
"δ.σ.",
"δηλ.",
"δισ.",
"δολ.",
"δρχ.",
"εκ.",
"εκατ.",
"ελ.",
"θιν'",
"κ.", "κ.ά.", "κ.α.", "κ.κ.", "κ.λπ.", "κ.ο.κ.", "κ.τ.λ.", "κλπ.", "κτλ.", "κυβ.",
"κ.",
"κ.ά.",
"κ.α.",
"κ.κ.",
"κ.λπ.",
"κ.ο.κ.",
"κ.τ.λ.",
"κλπ.",
"κτλ.",
"κυβ.",
"λ.χ.",
"μ.", "μ.Χ.", "μ.μ.", "μιλ.",
"μ.",
"μ.Χ.",
"μ.μ.",
"μιλ.",
"ντ'",
"π.Χ.", "π.β.", "π.δ.", "π.μ.", "π.χ.",
"σ.", "σ.α.λ.", "σ.σ.", "σελ.", "στρ.",
"τ'ς", "τ.μ.", "τετ.", "τετρ.", "τηλ.", "τρισ.", "τόν.",
"π.Χ.",
"π.β.",
"π.δ.",
"π.μ.",
"π.χ.",
"σ.",
"σ.α.λ.",
"σ.σ.",
"σελ.",
"στρ.",
"τ'ς",
"τ.μ.",
"τετ.",
"τετρ.",
"τηλ.",
"τρισ.",
"τόν.",
"υπ.",
"χ.μ.", "χγρ.", "χιλ.", "χλμ."
"χ.μ.",
"χγρ.",
"χιλ.",
"χλμ.",
]:
_exc[orth] = [{ORTH: orth}]

View File

@ -16,15 +16,18 @@ from ...language import Language
from ...attrs import LANG, NORM
from ...util import update_exc, add_lookups
def _return_en(_):
return 'en'
return "en"
class EnglishDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters.update(LEX_ATTRS)
lex_attr_getters[LANG] = _return_en
lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM],
BASE_NORMS, NORM_EXCEPTIONS)
lex_attr_getters[NORM] = add_lookups(
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS
)
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
tag_map = TAG_MAP
stop_words = STOP_WORDS
@ -37,8 +40,8 @@ class EnglishDefaults(Language.Defaults):
class English(Language):
lang = 'en'
lang = "en"
Defaults = EnglishDefaults
__all__ = ['English']
__all__ = ["English"]

View File

@ -18,5 +18,5 @@ sentences = [
"Where are you?",
"Who is the president of France?",
"What is the capital of the United States?",
"When was Barack Obama born?"
"When was Barack Obama born?",
]

View File

@ -1,7 +1,7 @@
# coding: utf8
from __future__ import unicode_literals
from .lookup import LOOKUP
from .lookup import LOOKUP # noqa: F401
from ._adjectives import ADJECTIVES
from ._adjectives_irreg import ADJECTIVES_IRREG
from ._adverbs import ADVERBS
@ -13,10 +13,18 @@ from ._verbs_irreg import VERBS_IRREG
from ._lemma_rules import ADJECTIVE_RULES, NOUN_RULES, VERB_RULES, PUNCT_RULES
LEMMA_INDEX = {'adj': ADJECTIVES, 'adv': ADVERBS, 'noun': NOUNS, 'verb': VERBS}
LEMMA_INDEX = {"adj": ADJECTIVES, "adv": ADVERBS, "noun": NOUNS, "verb": VERBS}
LEMMA_EXC = {'adj': ADJECTIVES_IRREG, 'adv': ADVERBS_IRREG, 'noun': NOUNS_IRREG,
'verb': VERBS_IRREG}
LEMMA_EXC = {
"adj": ADJECTIVES_IRREG,
"adv": ADVERBS_IRREG,
"noun": NOUNS_IRREG,
"verb": VERBS_IRREG,
}
LEMMA_RULES = {'adj': ADJECTIVE_RULES, 'noun': NOUN_RULES, 'verb': VERB_RULES,
'punct': PUNCT_RULES}
LEMMA_RULES = {
"adj": ADJECTIVE_RULES,
"noun": NOUN_RULES,
"verb": VERB_RULES,
"punct": PUNCT_RULES,
}

View File

@ -2,7 +2,8 @@
from __future__ import unicode_literals
ADJECTIVES = set("""
ADJECTIVES = set(
"""
.22-caliber .22-calibre .38-caliber .38-calibre .45-caliber .45-calibre 0 1 10
10-membered 100 1000 1000th 100th 101 101st 105 105th 10th 11 110 110th 115
115th 11th 12 120 120th 125 125th 12th 13 130 130th 135 135th 13th 14 140 140th
@ -2824,4 +2825,5 @@ zealous zenithal zero zeroth zestful zesty zig-zag zigzag zillion zimbabwean
zionist zippy zodiacal zoftig zoic zolaesque zonal zonary zoological zoonotic
zoophagous zoroastrian zygodactyl zygomatic zygomorphic zygomorphous zygotic
zymoid zymolytic zymotic
""".split())
""".split()
)

View File

@ -48,8 +48,7 @@ ADJECTIVES_IRREG = {
"bendier": ("bendy",),
"bendiest": ("bendy",),
"best": ("good",),
"better": ("good",
"well",),
"better": ("good", "well"),
"bigger": ("big",),
"biggest": ("big",),
"bitchier": ("bitchy",),
@ -289,10 +288,8 @@ ADJECTIVES_IRREG = {
"doughtiest": ("doughty",),
"dowdier": ("dowdy",),
"dowdiest": ("dowdy",),
"dowier": ("dowie",
"dowy",),
"dowiest": ("dowie",
"dowy",),
"dowier": ("dowie", "dowy"),
"dowiest": ("dowie", "dowy"),
"downer": ("downer",),
"downier": ("downy",),
"downiest": ("downy",),
@ -1494,5 +1491,5 @@ ADJECTIVES_IRREG = {
"zanier": ("zany",),
"zaniest": ("zany",),
"zippier": ("zippy",),
"zippiest": ("zippy",)
"zippiest": ("zippy",),
}

View File

@ -2,7 +2,8 @@
from __future__ import unicode_literals
ADVERBS = set("""
ADVERBS = set(
"""
'tween a.d. a.k.a. a.m. aback abaft abaxially abeam abed abjectly ably
abnormally aboard abominably aborad abortively about above aboveboard abreast
abroad abruptly absently absentmindedly absolutely abstemiously abstractedly
@ -540,4 +541,5 @@ wordlessly worriedly worryingly worse worst worthily worthlessly wrathfully
wretchedly wrong wrongfully wrongheadedly wrongly wryly yea yeah yearly
yearningly yesterday yet yieldingly yon yonder youthfully zealously zestfully
zestily zigzag
""".split())
""".split()
)

View File

@ -9,5 +9,5 @@ ADVERBS_IRREG = {
"farther": ("far",),
"further": ("far",),
"harder": ("hard",),
"hardest": ("hard",)
"hardest": ("hard",),
}

View File

@ -2,12 +2,7 @@
from __future__ import unicode_literals
ADJECTIVE_RULES = [
["er", ""],
["est", ""],
["er", "e"],
["est", "e"]
]
ADJECTIVE_RULES = [["er", ""], ["est", ""], ["er", "e"], ["est", "e"]]
NOUN_RULES = [
@ -19,7 +14,7 @@ NOUN_RULES = [
["ches", "ch"],
["shes", "sh"],
["men", "man"],
["ies", "y"]
["ies", "y"],
]
@ -31,13 +26,8 @@ VERB_RULES = [
["ed", "e"],
["ed", ""],
["ing", "e"],
["ing", ""]
["ing", ""],
]
PUNCT_RULES = [
["", "\""],
["", "\""],
["\u2018", "'"],
["\u2019", "'"]
]
PUNCT_RULES = [["", '"'], ["", '"'], ["\u2018", "'"], ["\u2019", "'"]]

View File

@ -2,7 +2,8 @@
from __future__ import unicode_literals
NOUNS = set("""
NOUNS = set(
"""
'hood .22 0 1 1-dodecanol 1-hitter 10 100 1000 10000 100000 1000000 1000000000
1000000000000 11 11-plus 12 120 13 14 144 15 1530s 16 17 1728 1750s 1760s 1770s
1780s 1790s 18 1820s 1830s 1840s 1850s 1860s 1870s 1880s 1890s 19 1900s 1920s
@ -7110,4 +7111,5 @@ zurvanism zweig zwieback zwingli zworykin zydeco zygnema zygnemales
zygnemataceae zygnematales zygocactus zygoma zygomatic zygomycetes zygomycota
zygomycotina zygophyllaceae zygophyllum zygoptera zygospore zygote zygotene
zyloprim zymase zymogen zymology zymolysis zymosis zymurgy zyrian
""".split())
""".split()
)

View File

@ -2,7 +2,8 @@
from __future__ import unicode_literals
VERBS = set("""
VERBS = set(
"""
aah abacinate abandon abase abash abate abbreviate abdicate abduce abduct
aberrate abet abhor abide abjure ablactate ablate abnegate abolish abominate
abort abound about-face abrade abrase abreact abridge abrogate abscise abscond
@ -912,4 +913,5 @@ wreck wrench wrest wrestle wrick wriggle wring wrinkle write writhe wrong x-ray
xerox yacht yack yak yammer yank yap yarn yarn-dye yaup yaw yawl yawn yawp yearn
yell yellow yelp yen yield yip yodel yoke yowl zap zero zest zigzag zinc zip
zipper zone zoom
""".split())
""".split()
)

View File

@ -4,22 +4,54 @@ from __future__ import unicode_literals
from ...attrs import LIKE_NUM
_num_words = ['zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven',
'eight', 'nine', 'ten', 'eleven', 'twelve', 'thirteen', 'fourteen',
'fifteen', 'sixteen', 'seventeen', 'eighteen', 'nineteen', 'twenty',
'thirty', 'forty', 'fifty', 'sixty', 'seventy', 'eighty', 'ninety',
'hundred', 'thousand', 'million', 'billion', 'trillion', 'quadrillion',
'gajillion', 'bazillion']
_num_words = [
"zero",
"one",
"two",
"three",
"four",
"five",
"six",
"seven",
"eight",
"nine",
"ten",
"eleven",
"twelve",
"thirteen",
"fourteen",
"fifteen",
"sixteen",
"seventeen",
"eighteen",
"nineteen",
"twenty",
"thirty",
"forty",
"fifty",
"sixty",
"seventy",
"eighty",
"ninety",
"hundred",
"thousand",
"million",
"billion",
"trillion",
"quadrillion",
"gajillion",
"bazillion",
]
def like_num(text):
if text.startswith(('+', '-', '±', '~')):
if text.startswith(("+", "-", "±", "~")):
text = text[1:]
text = text.replace(',', '').replace('.', '')
text = text.replace(",", "").replace(".", "")
if text.isdigit():
return True
if text.count('/') == 1:
num, denom = text.split('/')
if text.count("/") == 1:
num, denom = text.split("/")
if num.isdigit() and denom.isdigit():
return True
if text.lower() in _num_words:
@ -27,6 +59,4 @@ def like_num(text):
return False
LEX_ATTRS = {
LIKE_NUM: like_num
}
LEX_ATTRS = {LIKE_NUM: like_num}

View File

@ -6,66 +6,321 @@ from ...symbols import LEMMA, PRON_LEMMA
MORPH_RULES = {
"PRP": {
"I": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Case": "Nom"},
"me": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Case": "Acc"},
"you": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two"},
"he": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Masc", "Case": "Nom"},
"him": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Masc", "Case": "Acc"},
"she": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Fem", "Case": "Nom"},
"her": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Fem", "Case": "Acc"},
"it": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Neut"},
"we": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Case": "Nom"},
"us": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Case": "Acc"},
"they": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Plur", "Case": "Nom"},
"them": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Plur", "Case": "Acc"},
"mine": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Poss": "Yes", "Reflex": "Yes"},
"his": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Masc", "Poss": "Yes", "Reflex": "Yes"},
"hers": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Fem", "Poss": "Yes", "Reflex": "Yes"},
"its": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Neut", "Poss": "Yes", "Reflex": "Yes"},
"ours": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
"yours": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
"theirs": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
"myself": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Case": "Acc", "Reflex": "Yes"},
"yourself": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Case": "Acc", "Reflex": "Yes"},
"himself": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Case": "Acc", "Gender": "Masc", "Reflex": "Yes"},
"herself": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Case": "Acc", "Gender": "Fem", "Reflex": "Yes"},
"itself": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Case": "Acc", "Gender": "Neut", "Reflex": "Yes"},
"themself": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Case": "Acc", "Reflex": "Yes"},
"ourselves": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Case": "Acc", "Reflex": "Yes"},
"yourselves": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Case": "Acc", "Reflex": "Yes"},
"themselves": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Plur", "Case": "Acc", "Reflex": "Yes"}
"I": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "One",
"Number": "Sing",
"Case": "Nom",
},
"me": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "One",
"Number": "Sing",
"Case": "Acc",
},
"you": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two"},
"he": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Three",
"Number": "Sing",
"Gender": "Masc",
"Case": "Nom",
},
"him": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Three",
"Number": "Sing",
"Gender": "Masc",
"Case": "Acc",
},
"she": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Three",
"Number": "Sing",
"Gender": "Fem",
"Case": "Nom",
},
"her": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Three",
"Number": "Sing",
"Gender": "Fem",
"Case": "Acc",
},
"it": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Three",
"Number": "Sing",
"Gender": "Neut",
},
"we": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "One",
"Number": "Plur",
"Case": "Nom",
},
"us": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "One",
"Number": "Plur",
"Case": "Acc",
},
"they": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Three",
"Number": "Plur",
"Case": "Nom",
},
"them": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Three",
"Number": "Plur",
"Case": "Acc",
},
"mine": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "One",
"Number": "Sing",
"Poss": "Yes",
"Reflex": "Yes",
},
"his": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Three",
"Number": "Sing",
"Gender": "Masc",
"Poss": "Yes",
"Reflex": "Yes",
},
"hers": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Three",
"Number": "Sing",
"Gender": "Fem",
"Poss": "Yes",
"Reflex": "Yes",
},
"its": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Three",
"Number": "Sing",
"Gender": "Neut",
"Poss": "Yes",
"Reflex": "Yes",
},
"ours": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "One",
"Number": "Plur",
"Poss": "Yes",
"Reflex": "Yes",
},
"yours": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Two",
"Number": "Plur",
"Poss": "Yes",
"Reflex": "Yes",
},
"theirs": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Three",
"Number": "Plur",
"Poss": "Yes",
"Reflex": "Yes",
},
"myself": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "One",
"Number": "Sing",
"Case": "Acc",
"Reflex": "Yes",
},
"yourself": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Two",
"Case": "Acc",
"Reflex": "Yes",
},
"himself": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Three",
"Number": "Sing",
"Case": "Acc",
"Gender": "Masc",
"Reflex": "Yes",
},
"herself": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Three",
"Number": "Sing",
"Case": "Acc",
"Gender": "Fem",
"Reflex": "Yes",
},
"itself": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Three",
"Number": "Sing",
"Case": "Acc",
"Gender": "Neut",
"Reflex": "Yes",
},
"themself": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Three",
"Number": "Sing",
"Case": "Acc",
"Reflex": "Yes",
},
"ourselves": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "One",
"Number": "Plur",
"Case": "Acc",
"Reflex": "Yes",
},
"yourselves": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Two",
"Case": "Acc",
"Reflex": "Yes",
},
"themselves": {
LEMMA: PRON_LEMMA,
"PronType": "Prs",
"Person": "Three",
"Number": "Plur",
"Case": "Acc",
"Reflex": "Yes",
},
},
"PRP$": {
"my": {LEMMA: PRON_LEMMA, "Person": "One", "Number": "Sing", "PronType": "Prs", "Poss": "Yes"},
"your": {LEMMA: PRON_LEMMA, "Person": "Two", "PronType": "Prs", "Poss": "Yes"},
"his": {LEMMA: PRON_LEMMA, "Person": "Three", "Number": "Sing", "Gender": "Masc", "PronType": "Prs", "Poss": "Yes"},
"her": {LEMMA: PRON_LEMMA, "Person": "Three", "Number": "Sing", "Gender": "Fem", "PronType": "Prs", "Poss": "Yes"},
"its": {LEMMA: PRON_LEMMA, "Person": "Three", "Number": "Sing", "Gender": "Neut", "PronType": "Prs", "Poss": "Yes"},
"our": {LEMMA: PRON_LEMMA, "Person": "One", "Number": "Plur", "PronType": "Prs", "Poss": "Yes"},
"their": {LEMMA: PRON_LEMMA, "Person": "Three", "Number": "Plur", "PronType": "Prs", "Poss": "Yes"}
"my": {
LEMMA: PRON_LEMMA,
"Person": "One",
"Number": "Sing",
"PronType": "Prs",
"Poss": "Yes",
},
"your": {LEMMA: PRON_LEMMA, "Person": "Two", "PronType": "Prs", "Poss": "Yes"},
"his": {
LEMMA: PRON_LEMMA,
"Person": "Three",
"Number": "Sing",
"Gender": "Masc",
"PronType": "Prs",
"Poss": "Yes",
},
"her": {
LEMMA: PRON_LEMMA,
"Person": "Three",
"Number": "Sing",
"Gender": "Fem",
"PronType": "Prs",
"Poss": "Yes",
},
"its": {
LEMMA: PRON_LEMMA,
"Person": "Three",
"Number": "Sing",
"Gender": "Neut",
"PronType": "Prs",
"Poss": "Yes",
},
"our": {
LEMMA: PRON_LEMMA,
"Person": "One",
"Number": "Plur",
"PronType": "Prs",
"Poss": "Yes",
},
"their": {
LEMMA: PRON_LEMMA,
"Person": "Three",
"Number": "Plur",
"PronType": "Prs",
"Poss": "Yes",
},
},
"VBZ": {
"am": {LEMMA: "be", "VerbForm": "Fin", "Person": "One", "Tense": "Pres", "Mood": "Ind"},
"are": {LEMMA: "be", "VerbForm": "Fin", "Person": "Two", "Tense": "Pres", "Mood": "Ind"},
"is": {LEMMA: "be", "VerbForm": "Fin", "Person": "Three", "Tense": "Pres", "Mood": "Ind"},
"'re": {LEMMA: "be", "VerbForm": "Fin", "Person": "Two", "Tense": "Pres", "Mood": "Ind"},
"'s": {LEMMA: "be", "VerbForm": "Fin", "Person": "Three", "Tense": "Pres", "Mood": "Ind"},
"am": {
LEMMA: "be",
"VerbForm": "Fin",
"Person": "One",
"Tense": "Pres",
"Mood": "Ind",
},
"are": {
LEMMA: "be",
"VerbForm": "Fin",
"Person": "Two",
"Tense": "Pres",
"Mood": "Ind",
},
"is": {
LEMMA: "be",
"VerbForm": "Fin",
"Person": "Three",
"Tense": "Pres",
"Mood": "Ind",
},
"'re": {
LEMMA: "be",
"VerbForm": "Fin",
"Person": "Two",
"Tense": "Pres",
"Mood": "Ind",
},
"'s": {
LEMMA: "be",
"VerbForm": "Fin",
"Person": "Three",
"Tense": "Pres",
"Mood": "Ind",
},
},
"VBP": {
"are": {LEMMA: "be", "VerbForm": "Fin", "Tense": "Pres", "Mood": "Ind"},
"'re": {LEMMA: "be", "VerbForm": "Fin", "Tense": "Pres", "Mood": "Ind"},
"am": {LEMMA: "be", "VerbForm": "Fin", "Person": "One", "Tense": "Pres", "Mood": "Ind"},
"are": {LEMMA: "be", "VerbForm": "Fin", "Tense": "Pres", "Mood": "Ind"},
"'re": {LEMMA: "be", "VerbForm": "Fin", "Tense": "Pres", "Mood": "Ind"},
"am": {
LEMMA: "be",
"VerbForm": "Fin",
"Person": "One",
"Tense": "Pres",
"Mood": "Ind",
},
},
"VBD": {
"was": {LEMMA: "be", "VerbForm": "Fin", "Tense": "Past", "Number": "Sing"},
"were": {LEMMA: "be", "VerbForm": "Fin", "Tense": "Past", "Number": "Plur"}
}
"was": {LEMMA: "be", "VerbForm": "Fin", "Tense": "Past", "Number": "Sing"},
"were": {LEMMA: "be", "VerbForm": "Fin", "Tense": "Past", "Number": "Plur"},
},
}

View File

@ -12,7 +12,6 @@ _exc = {
"plz": "please",
"pls": "please",
"thx": "thanks",
# US vs. UK spelling
"accessorise": "accessorize",
"accessorised": "accessorized",
@ -690,7 +689,7 @@ _exc = {
"globalising": "globalizing",
"glueing ": "gluing ",
"goin": "going",
"goin'":"going",
"goin'": "going",
"goitre": "goiter",
"goitres": "goiters",
"gonorrhoea": "gonorrhea",
@ -1758,7 +1757,7 @@ _exc = {
"yoghourt": "yogurt",
"yoghourts": "yogurts",
"yoghurt": "yogurt",
"yoghurts": "yogurts"
"yoghurts": "yogurts",
}

View File

@ -3,8 +3,8 @@ from __future__ import unicode_literals
# Stop words
STOP_WORDS = set("""
STOP_WORDS = set(
"""
a about above across after afterwards again against all almost alone along
already also although always am among amongst amount an and another any anyhow
anyone anything anyway anywhere are around as at
@ -68,4 +68,5 @@ whither who whoever whole whom whose why will with within without would
yet you your yours yourself yourselves
'd 'll 'm 're 's 've
""".split())
""".split()
)

View File

@ -8,12 +8,21 @@ def noun_chunks(obj):
"""
Detect base noun phrases from a dependency parse. Works on both Doc and Span.
"""
labels = ['nsubj', 'dobj', 'nsubjpass', 'pcomp', 'pobj', 'dative', 'appos',
'attr', 'ROOT']
doc = obj.doc # Ensure works on both Doc and Span.
labels = [
"nsubj",
"dobj",
"nsubjpass",
"pcomp",
"pobj",
"dative",
"appos",
"attr",
"ROOT",
]
doc = obj.doc # Ensure works on both Doc and Span.
np_deps = [doc.vocab.strings.add(label) for label in labels]
conj = doc.vocab.strings.add('conj')
np_label = doc.vocab.strings.add('NP')
conj = doc.vocab.strings.add("conj")
np_label = doc.vocab.strings.add("NP")
seen = set()
for i, word in enumerate(obj):
if word.pos not in (NOUN, PROPN, PRON):
@ -24,8 +33,8 @@ def noun_chunks(obj):
if word.dep in np_deps:
if any(w.i in seen for w in word.subtree):
continue
seen.update(j for j in range(word.left_edge.i, word.i+1))
yield word.left_edge.i, word.i+1, np_label
seen.update(j for j in range(word.left_edge.i, word.i + 1))
yield word.left_edge.i, word.i + 1, np_label
elif word.dep == conj:
head = word.head
while head.dep == conj and head.head.i < head.i:
@ -34,10 +43,8 @@ def noun_chunks(obj):
if head.dep in np_deps:
if any(w.i in seen for w in word.subtree):
continue
seen.update(j for j in range(word.left_edge.i, word.i+1))
yield word.left_edge.i, word.i+1, np_label
seen.update(j for j in range(word.left_edge.i, word.i + 1))
yield word.left_edge.i, word.i + 1, np_label
SYNTAX_ITERATORS = {
'noun_chunks': noun_chunks
}
SYNTAX_ITERATORS = {"noun_chunks": noun_chunks}

View File

@ -6,61 +6,67 @@ from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON
TAG_MAP = {
".": {POS: PUNCT, "PunctType": "peri"},
",": {POS: PUNCT, "PunctType": "comm"},
"-LRB-": {POS: PUNCT, "PunctType": "brck", "PunctSide": "ini"},
"-RRB-": {POS: PUNCT, "PunctType": "brck", "PunctSide": "fin"},
"``": {POS: PUNCT, "PunctType": "quot", "PunctSide": "ini"},
"\"\"": {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"},
"''": {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"},
":": {POS: PUNCT},
"$": {POS: SYM, "Other": {"SymType": "currency"}},
"#": {POS: SYM, "Other": {"SymType": "numbersign"}},
"AFX": {POS: ADJ, "Hyph": "yes"},
"CC": {POS: CCONJ, "ConjType": "coor"},
"CD": {POS: NUM, "NumType": "card"},
"DT": {POS: DET},
"EX": {POS: ADV, "AdvType": "ex"},
"FW": {POS: X, "Foreign": "yes"},
"HYPH": {POS: PUNCT, "PunctType": "dash"},
"IN": {POS: ADP},
"JJ": {POS: ADJ, "Degree": "pos"},
"JJR": {POS: ADJ, "Degree": "comp"},
"JJS": {POS: ADJ, "Degree": "sup"},
"LS": {POS: PUNCT, "NumType": "ord"},
"MD": {POS: VERB, "VerbType": "mod"},
"NIL": {POS: ""},
"NN": {POS: NOUN, "Number": "sing"},
"NNP": {POS: PROPN, "NounType": "prop", "Number": "sing"},
"NNPS": {POS: PROPN, "NounType": "prop", "Number": "plur"},
"NNS": {POS: NOUN, "Number": "plur"},
"PDT": {POS: ADJ, "AdjType": "pdt", "PronType": "prn"},
"POS": {POS: PART, "Poss": "yes"},
"PRP": {POS: PRON, "PronType": "prs"},
"PRP$": {POS: ADJ, "PronType": "prs", "Poss": "yes"},
"RB": {POS: ADV, "Degree": "pos"},
"RBR": {POS: ADV, "Degree": "comp"},
"RBS": {POS: ADV, "Degree": "sup"},
"RP": {POS: PART},
"SP": {POS: SPACE},
"SYM": {POS: SYM},
"TO": {POS: PART, "PartType": "inf", "VerbForm": "inf"},
"UH": {POS: INTJ},
"VB": {POS: VERB, "VerbForm": "inf"},
"VBD": {POS: VERB, "VerbForm": "fin", "Tense": "past"},
"VBG": {POS: VERB, "VerbForm": "part", "Tense": "pres", "Aspect": "prog"},
"VBN": {POS: VERB, "VerbForm": "part", "Tense": "past", "Aspect": "perf"},
"VBP": {POS: VERB, "VerbForm": "fin", "Tense": "pres"},
"VBZ": {POS: VERB, "VerbForm": "fin", "Tense": "pres", "Number": "sing", "Person": 3},
"WDT": {POS: ADJ, "PronType": "int|rel"},
"WP": {POS: NOUN, "PronType": "int|rel"},
"WP$": {POS: ADJ, "Poss": "yes", "PronType": "int|rel"},
"WRB": {POS: ADV, "PronType": "int|rel"},
"ADD": {POS: X},
"NFP": {POS: PUNCT},
"GW": {POS: X},
"XX": {POS: X},
"BES": {POS: VERB},
"HVS": {POS: VERB},
"_SP": {POS: SPACE},
".": {POS: PUNCT, "PunctType": "peri"},
",": {POS: PUNCT, "PunctType": "comm"},
"-LRB-": {POS: PUNCT, "PunctType": "brck", "PunctSide": "ini"},
"-RRB-": {POS: PUNCT, "PunctType": "brck", "PunctSide": "fin"},
"``": {POS: PUNCT, "PunctType": "quot", "PunctSide": "ini"},
'""': {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"},
"''": {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"},
":": {POS: PUNCT},
"$": {POS: SYM, "Other": {"SymType": "currency"}},
"#": {POS: SYM, "Other": {"SymType": "numbersign"}},
"AFX": {POS: ADJ, "Hyph": "yes"},
"CC": {POS: CCONJ, "ConjType": "coor"},
"CD": {POS: NUM, "NumType": "card"},
"DT": {POS: DET},
"EX": {POS: ADV, "AdvType": "ex"},
"FW": {POS: X, "Foreign": "yes"},
"HYPH": {POS: PUNCT, "PunctType": "dash"},
"IN": {POS: ADP},
"JJ": {POS: ADJ, "Degree": "pos"},
"JJR": {POS: ADJ, "Degree": "comp"},
"JJS": {POS: ADJ, "Degree": "sup"},
"LS": {POS: PUNCT, "NumType": "ord"},
"MD": {POS: VERB, "VerbType": "mod"},
"NIL": {POS: ""},
"NN": {POS: NOUN, "Number": "sing"},
"NNP": {POS: PROPN, "NounType": "prop", "Number": "sing"},
"NNPS": {POS: PROPN, "NounType": "prop", "Number": "plur"},
"NNS": {POS: NOUN, "Number": "plur"},
"PDT": {POS: ADJ, "AdjType": "pdt", "PronType": "prn"},
"POS": {POS: PART, "Poss": "yes"},
"PRP": {POS: PRON, "PronType": "prs"},
"PRP$": {POS: ADJ, "PronType": "prs", "Poss": "yes"},
"RB": {POS: ADV, "Degree": "pos"},
"RBR": {POS: ADV, "Degree": "comp"},
"RBS": {POS: ADV, "Degree": "sup"},
"RP": {POS: PART},
"SP": {POS: SPACE},
"SYM": {POS: SYM},
"TO": {POS: PART, "PartType": "inf", "VerbForm": "inf"},
"UH": {POS: INTJ},
"VB": {POS: VERB, "VerbForm": "inf"},
"VBD": {POS: VERB, "VerbForm": "fin", "Tense": "past"},
"VBG": {POS: VERB, "VerbForm": "part", "Tense": "pres", "Aspect": "prog"},
"VBN": {POS: VERB, "VerbForm": "part", "Tense": "past", "Aspect": "perf"},
"VBP": {POS: VERB, "VerbForm": "fin", "Tense": "pres"},
"VBZ": {
POS: VERB,
"VerbForm": "fin",
"Tense": "pres",
"Number": "sing",
"Person": 3,
},
"WDT": {POS: ADJ, "PronType": "int|rel"},
"WP": {POS: NOUN, "PronType": "int|rel"},
"WP$": {POS: ADJ, "Poss": "yes", "PronType": "int|rel"},
"WRB": {POS: ADV, "PronType": "int|rel"},
"ADD": {POS: X},
"NFP": {POS: PUNCT},
"GW": {POS: X},
"XX": {POS: X},
"BES": {POS: VERB},
"HVS": {POS: VERB},
"_SP": {POS: SPACE},
}

View File

@ -5,103 +5,143 @@ from ...symbols import ORTH, LEMMA, TAG, NORM, PRON_LEMMA
_exc = {}
_exclude = ["Ill", "ill", "Its", "its", "Hell", "hell", "Shell", "shell",
"Shed", "shed", "were", "Were", "Well", "well", "Whore", "whore"]
_exclude = [
"Ill",
"ill",
"Its",
"its",
"Hell",
"hell",
"Shell",
"shell",
"Shed",
"shed",
"were",
"Were",
"Well",
"well",
"Whore",
"whore",
]
# Pronouns
for pron in ["i"]:
for orth in [pron, pron.title()]:
_exc[orth + "'m"] = [
{ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"},
{ORTH: "'m", LEMMA: "be", NORM: "am", TAG: "VBP", "tenspect": 1, "number": 1}]
{
ORTH: "'m",
LEMMA: "be",
NORM: "am",
TAG: "VBP",
"tenspect": 1,
"number": 1,
},
]
_exc[orth + "m"] = [
{ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"},
{ORTH: "m", LEMMA: "be", TAG: "VBP", "tenspect": 1, "number": 1 }]
{ORTH: "m", LEMMA: "be", TAG: "VBP", "tenspect": 1, "number": 1},
]
_exc[orth + "'ma"] = [
{ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"},
{ORTH: "'m", LEMMA: "be", NORM: "am"},
{ORTH: "a", LEMMA: "going to", NORM: "gonna"}]
{ORTH: "a", LEMMA: "going to", NORM: "gonna"},
]
_exc[orth + "ma"] = [
{ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"},
{ORTH: "m", LEMMA: "be", NORM: "am"},
{ORTH: "a", LEMMA: "going to", NORM: "gonna"}]
{ORTH: "a", LEMMA: "going to", NORM: "gonna"},
]
for pron in ["i", "you", "he", "she", "it", "we", "they"]:
for orth in [pron, pron.title()]:
_exc[orth + "'ll"] = [
{ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"},
{ORTH: "'ll", LEMMA: "will", NORM: "will", TAG: "MD"}]
{ORTH: "'ll", LEMMA: "will", NORM: "will", TAG: "MD"},
]
_exc[orth + "ll"] = [
{ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"},
{ORTH: "ll", LEMMA: "will", NORM: "will", TAG: "MD"}]
{ORTH: "ll", LEMMA: "will", NORM: "will", TAG: "MD"},
]
_exc[orth + "'ll've"] = [
{ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"},
{ORTH: "'ll", LEMMA: "will", NORM: "will", TAG: "MD"},
{ORTH: "'ve", LEMMA: "have", NORM: "have", TAG: "VB"}]
{ORTH: "'ve", LEMMA: "have", NORM: "have", TAG: "VB"},
]
_exc[orth + "llve"] = [
{ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"},
{ORTH: "ll", LEMMA: "will", NORM: "will", TAG: "MD"},
{ORTH: "ve", LEMMA: "have", NORM: "have", TAG: "VB"}]
{ORTH: "ve", LEMMA: "have", NORM: "have", TAG: "VB"},
]
_exc[orth + "'d"] = [
{ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"},
{ORTH: "'d", LEMMA: "would", NORM: "would", TAG: "MD"}]
{ORTH: "'d", LEMMA: "would", NORM: "would", TAG: "MD"},
]
_exc[orth + "d"] = [
{ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"},
{ORTH: "d", LEMMA: "would", NORM: "would", TAG: "MD"}]
{ORTH: "d", LEMMA: "would", NORM: "would", TAG: "MD"},
]
_exc[orth + "'d've"] = [
{ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"},
{ORTH: "'d", LEMMA: "would", NORM: "would", TAG: "MD"},
{ORTH: "'ve", LEMMA: "have", NORM: "have", TAG: "VB"}]
{ORTH: "'ve", LEMMA: "have", NORM: "have", TAG: "VB"},
]
_exc[orth + "dve"] = [
{ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"},
{ORTH: "d", LEMMA: "would", NORM: "would", TAG: "MD"},
{ORTH: "ve", LEMMA: "have", NORM: "have", TAG: "VB"}]
{ORTH: "ve", LEMMA: "have", NORM: "have", TAG: "VB"},
]
for pron in ["i", "you", "we", "they"]:
for orth in [pron, pron.title()]:
_exc[orth + "'ve"] = [
{ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"},
{ORTH: "'ve", LEMMA: "have", NORM: "have", TAG: "VB"}]
{ORTH: "'ve", LEMMA: "have", NORM: "have", TAG: "VB"},
]
_exc[orth + "ve"] = [
{ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"},
{ORTH: "ve", LEMMA: "have", NORM: "have", TAG: "VB"}]
{ORTH: "ve", LEMMA: "have", NORM: "have", TAG: "VB"},
]
for pron in ["you", "we", "they"]:
for orth in [pron, pron.title()]:
_exc[orth + "'re"] = [
{ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"},
{ORTH: "'re", LEMMA: "be", NORM: "are"}]
{ORTH: "'re", LEMMA: "be", NORM: "are"},
]
_exc[orth + "re"] = [
{ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"},
{ORTH: "re", LEMMA: "be", NORM: "are", TAG: "VBZ"}]
{ORTH: "re", LEMMA: "be", NORM: "are", TAG: "VBZ"},
]
for pron in ["he", "she", "it"]:
for orth in [pron, pron.title()]:
_exc[orth + "'s"] = [
{ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"},
{ORTH: "'s", NORM: "'s"}]
{ORTH: "'s", NORM: "'s"},
]
_exc[orth + "s"] = [
{ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"},
{ORTH: "s"}]
{ORTH: "s"},
]
# W-words, relative pronouns, prepositions etc.
@ -110,63 +150,71 @@ for word in ["who", "what", "when", "where", "why", "how", "there", "that"]:
for orth in [word, word.title()]:
_exc[orth + "'s"] = [
{ORTH: orth, LEMMA: word, NORM: word},
{ORTH: "'s", NORM: "'s"}]
{ORTH: "'s", NORM: "'s"},
]
_exc[orth + "s"] = [
{ORTH: orth, LEMMA: word, NORM: word},
{ORTH: "s"}]
_exc[orth + "s"] = [{ORTH: orth, LEMMA: word, NORM: word}, {ORTH: "s"}]
_exc[orth + "'ll"] = [
{ORTH: orth, LEMMA: word, NORM: word},
{ORTH: "'ll", LEMMA: "will", NORM: "will", TAG: "MD"}]
{ORTH: "'ll", LEMMA: "will", NORM: "will", TAG: "MD"},
]
_exc[orth + "ll"] = [
{ORTH: orth, LEMMA: word, NORM: word},
{ORTH: "ll", LEMMA: "will", NORM: "will", TAG: "MD"}]
{ORTH: "ll", LEMMA: "will", NORM: "will", TAG: "MD"},
]
_exc[orth + "'ll've"] = [
{ORTH: orth, LEMMA: word, NORM: word},
{ORTH: "'ll", LEMMA: "will", NORM: "will", TAG: "MD"},
{ORTH: "'ve", LEMMA: "have", NORM: "have", TAG: "VB"}]
{ORTH: "'ve", LEMMA: "have", NORM: "have", TAG: "VB"},
]
_exc[orth + "llve"] = [
{ORTH: orth, LEMMA: word, NORM: word},
{ORTH: "ll", LEMMA: "will", NORM: "will", TAG: "MD"},
{ORTH: "ve", LEMMA: "have", NORM: "have", TAG: "VB"}]
{ORTH: "ve", LEMMA: "have", NORM: "have", TAG: "VB"},
]
_exc[orth + "'re"] = [
{ORTH: orth, LEMMA: word, NORM: word},
{ORTH: "'re", LEMMA: "be", NORM: "are"}]
{ORTH: "'re", LEMMA: "be", NORM: "are"},
]
_exc[orth + "re"] = [
{ORTH: orth, LEMMA: word, NORM: word},
{ORTH: "re", LEMMA: "be", NORM: "are"}]
{ORTH: "re", LEMMA: "be", NORM: "are"},
]
_exc[orth + "'ve"] = [
{ORTH: orth, LEMMA: word, NORM: word},
{ORTH: "'ve", LEMMA: "have", TAG: "VB"}]
{ORTH: "'ve", LEMMA: "have", TAG: "VB"},
]
_exc[orth + "ve"] = [
{ORTH: orth, LEMMA: word},
{ORTH: "ve", LEMMA: "have", NORM: "have", TAG: "VB"}]
{ORTH: "ve", LEMMA: "have", NORM: "have", TAG: "VB"},
]
_exc[orth + "'d"] = [
{ORTH: orth, LEMMA: word, NORM: word},
{ORTH: "'d", NORM: "'d"}]
{ORTH: "'d", NORM: "'d"},
]
_exc[orth + "d"] = [
{ORTH: orth, LEMMA: word, NORM: word},
{ORTH: "d"}]
_exc[orth + "d"] = [{ORTH: orth, LEMMA: word, NORM: word}, {ORTH: "d"}]
_exc[orth + "'d've"] = [
{ORTH: orth, LEMMA: word, NORM: word},
{ORTH: "'d", LEMMA: "would", NORM: "would", TAG: "MD"},
{ORTH: "'ve", LEMMA: "have", NORM: "have", TAG: "VB"}]
{ORTH: "'ve", LEMMA: "have", NORM: "have", TAG: "VB"},
]
_exc[orth + "dve"] = [
{ORTH: orth, LEMMA: word, NORM: word},
{ORTH: "d", LEMMA: "would", NORM: "would", TAG: "MD"},
{ORTH: "ve", LEMMA: "have", NORM: "have", TAG: "VB"}]
{ORTH: "ve", LEMMA: "have", NORM: "have", TAG: "VB"},
]
# Verbs
@ -186,27 +234,32 @@ for verb_data in [
{ORTH: "sha", LEMMA: "shall", NORM: "shall", TAG: "MD"},
{ORTH: "should", NORM: "should", TAG: "MD"},
{ORTH: "wo", LEMMA: "will", NORM: "will", TAG: "MD"},
{ORTH: "would", NORM: "would", TAG: "MD"}]:
{ORTH: "would", NORM: "would", TAG: "MD"},
]:
verb_data_tc = dict(verb_data)
verb_data_tc[ORTH] = verb_data_tc[ORTH].title()
for data in [verb_data, verb_data_tc]:
_exc[data[ORTH] + "n't"] = [
dict(data),
{ORTH: "n't", LEMMA: "not", NORM: "not", TAG: "RB"}]
{ORTH: "n't", LEMMA: "not", NORM: "not", TAG: "RB"},
]
_exc[data[ORTH] + "nt"] = [
dict(data),
{ORTH: "nt", LEMMA: "not", NORM: "not", TAG: "RB"}]
{ORTH: "nt", LEMMA: "not", NORM: "not", TAG: "RB"},
]
_exc[data[ORTH] + "n't've"] = [
dict(data),
{ORTH: "n't", LEMMA: "not", NORM: "not", TAG: "RB"},
{ORTH: "'ve", LEMMA: "have", NORM: "have", TAG: "VB"}]
{ORTH: "'ve", LEMMA: "have", NORM: "have", TAG: "VB"},
]
_exc[data[ORTH] + "ntve"] = [
dict(data),
{ORTH: "nt", LEMMA: "not", NORM: "not", TAG: "RB"},
{ORTH: "ve", LEMMA: "have", NORM: "have", TAG: "VB"}]
{ORTH: "ve", LEMMA: "have", NORM: "have", TAG: "VB"},
]
for verb_data in [
@ -214,17 +267,14 @@ for verb_data in [
{ORTH: "might", NORM: "might", TAG: "MD"},
{ORTH: "must", NORM: "must", TAG: "MD"},
{ORTH: "should", NORM: "should", TAG: "MD"},
{ORTH: "would", NORM: "would", TAG: "MD"}]:
{ORTH: "would", NORM: "would", TAG: "MD"},
]:
verb_data_tc = dict(verb_data)
verb_data_tc[ORTH] = verb_data_tc[ORTH].title()
for data in [verb_data, verb_data_tc]:
_exc[data[ORTH] + "'ve"] = [
dict(data),
{ORTH: "'ve", LEMMA: "have", TAG: "VB"}]
_exc[data[ORTH] + "'ve"] = [dict(data), {ORTH: "'ve", LEMMA: "have", TAG: "VB"}]
_exc[data[ORTH] + "ve"] = [
dict(data),
{ORTH: "ve", LEMMA: "have", TAG: "VB"}]
_exc[data[ORTH] + "ve"] = [dict(data), {ORTH: "ve", LEMMA: "have", TAG: "VB"}]
for verb_data in [
@ -235,17 +285,20 @@ for verb_data in [
{ORTH: "were", LEMMA: "be", NORM: "were"},
{ORTH: "have", NORM: "have"},
{ORTH: "has", LEMMA: "have", NORM: "has"},
{ORTH: "dare", NORM: "dare"}]:
{ORTH: "dare", NORM: "dare"},
]:
verb_data_tc = dict(verb_data)
verb_data_tc[ORTH] = verb_data_tc[ORTH].title()
for data in [verb_data, verb_data_tc]:
_exc[data[ORTH] + "n't"] = [
dict(data),
{ORTH: "n't", LEMMA: "not", NORM: "not", TAG: "RB"}]
{ORTH: "n't", LEMMA: "not", NORM: "not", TAG: "RB"},
]
_exc[data[ORTH] + "nt"] = [
dict(data),
{ORTH: "nt", LEMMA: "not", NORM: "not", TAG: "RB"}]
{ORTH: "nt", LEMMA: "not", NORM: "not", TAG: "RB"},
]
# Other contractions with trailing apostrophe
@ -256,7 +309,8 @@ for exc_data in [
{ORTH: "nothin", LEMMA: "nothing", NORM: "nothing"},
{ORTH: "nuthin", LEMMA: "nothing", NORM: "nothing"},
{ORTH: "ol", LEMMA: "old", NORM: "old"},
{ORTH: "somethin", LEMMA: "something", NORM: "something"}]:
{ORTH: "somethin", LEMMA: "something", NORM: "something"},
]:
exc_data_tc = dict(exc_data)
exc_data_tc[ORTH] = exc_data_tc[ORTH].title()
for data in [exc_data, exc_data_tc]:
@ -272,7 +326,8 @@ for exc_data in [
{ORTH: "cause", LEMMA: "because", NORM: "because"},
{ORTH: "em", LEMMA: PRON_LEMMA, NORM: "them"},
{ORTH: "ll", LEMMA: "will", NORM: "will"},
{ORTH: "nuff", LEMMA: "enough", NORM: "enough"}]:
{ORTH: "nuff", LEMMA: "enough", NORM: "enough"},
]:
exc_data_apos = dict(exc_data)
exc_data_apos[ORTH] = "'" + exc_data_apos[ORTH]
for data in [exc_data, exc_data_apos]:
@ -285,81 +340,69 @@ for h in range(1, 12 + 1):
for period in ["a.m.", "am"]:
_exc["%d%s" % (h, period)] = [
{ORTH: "%d" % h},
{ORTH: period, LEMMA: "a.m.", NORM: "a.m."}]
{ORTH: period, LEMMA: "a.m.", NORM: "a.m."},
]
for period in ["p.m.", "pm"]:
_exc["%d%s" % (h, period)] = [
{ORTH: "%d" % h},
{ORTH: period, LEMMA: "p.m.", NORM: "p.m."}]
{ORTH: period, LEMMA: "p.m.", NORM: "p.m."},
]
# Rest
_other_exc = {
"y'all": [
{ORTH: "y'", LEMMA: PRON_LEMMA, NORM: "you"},
{ORTH: "all"}],
"yall": [
{ORTH: "y", LEMMA: PRON_LEMMA, NORM: "you"},
{ORTH: "all"}],
"y'all": [{ORTH: "y'", LEMMA: PRON_LEMMA, NORM: "you"}, {ORTH: "all"}],
"yall": [{ORTH: "y", LEMMA: PRON_LEMMA, NORM: "you"}, {ORTH: "all"}],
"how'd'y": [
{ORTH: "how", LEMMA: "how"},
{ORTH: "'d", LEMMA: "do"},
{ORTH: "'y", LEMMA: PRON_LEMMA, NORM: "you"}],
{ORTH: "'y", LEMMA: PRON_LEMMA, NORM: "you"},
],
"How'd'y": [
{ORTH: "How", LEMMA: "how", NORM: "how"},
{ORTH: "'d", LEMMA: "do"},
{ORTH: "'y", LEMMA: PRON_LEMMA, NORM: "you"}],
{ORTH: "'y", LEMMA: PRON_LEMMA, NORM: "you"},
],
"not've": [
{ORTH: "not", LEMMA: "not", TAG: "RB"},
{ORTH: "'ve", LEMMA: "have", NORM: "have", TAG: "VB"}],
{ORTH: "'ve", LEMMA: "have", NORM: "have", TAG: "VB"},
],
"notve": [
{ORTH: "not", LEMMA: "not", TAG: "RB"},
{ORTH: "ve", LEMMA: "have", NORM: "have", TAG: "VB"}],
{ORTH: "ve", LEMMA: "have", NORM: "have", TAG: "VB"},
],
"Not've": [
{ORTH: "Not", LEMMA: "not", NORM: "not", TAG: "RB"},
{ORTH: "'ve", LEMMA: "have", NORM: "have", TAG: "VB"}],
{ORTH: "'ve", LEMMA: "have", NORM: "have", TAG: "VB"},
],
"Notve": [
{ORTH: "Not", LEMMA: "not", NORM: "not", TAG: "RB"},
{ORTH: "ve", LEMMA: "have", NORM: "have", TAG: "VB"}],
{ORTH: "ve", LEMMA: "have", NORM: "have", TAG: "VB"},
],
"cannot": [
{ORTH: "can", LEMMA: "can", TAG: "MD"},
{ORTH: "not", LEMMA: "not", TAG: "RB"}],
{ORTH: "not", LEMMA: "not", TAG: "RB"},
],
"Cannot": [
{ORTH: "Can", LEMMA: "can", NORM: "can", TAG: "MD"},
{ORTH: "not", LEMMA: "not", TAG: "RB"}],
{ORTH: "not", LEMMA: "not", TAG: "RB"},
],
"gonna": [
{ORTH: "gon", LEMMA: "go", NORM: "going"},
{ORTH: "na", LEMMA: "to", NORM: "to"}],
{ORTH: "na", LEMMA: "to", NORM: "to"},
],
"Gonna": [
{ORTH: "Gon", LEMMA: "go", NORM: "going"},
{ORTH: "na", LEMMA: "to", NORM: "to"}],
"gotta": [
{ORTH: "got"},
{ORTH: "ta", LEMMA: "to", NORM: "to"}],
"Gotta": [
{ORTH: "Got", NORM: "got"},
{ORTH: "ta", LEMMA: "to", NORM: "to"}],
"let's": [
{ORTH: "let"},
{ORTH: "'s", LEMMA: PRON_LEMMA, NORM: "us"}],
{ORTH: "na", LEMMA: "to", NORM: "to"},
],
"gotta": [{ORTH: "got"}, {ORTH: "ta", LEMMA: "to", NORM: "to"}],
"Gotta": [{ORTH: "Got", NORM: "got"}, {ORTH: "ta", LEMMA: "to", NORM: "to"}],
"let's": [{ORTH: "let"}, {ORTH: "'s", LEMMA: PRON_LEMMA, NORM: "us"}],
"Let's": [
{ORTH: "Let", LEMMA: "let", NORM: "let"},
{ORTH: "'s", LEMMA: PRON_LEMMA, NORM: "us"}]
{ORTH: "'s", LEMMA: PRON_LEMMA, NORM: "us"},
],
}
_exc.update(_other_exc)
@ -402,8 +445,6 @@ for exc_data in [
{ORTH: "Goin'", LEMMA: "go", NORM: "going"},
{ORTH: "goin", LEMMA: "go", NORM: "going"},
{ORTH: "Goin", LEMMA: "go", NORM: "going"},
{ORTH: "Mt.", LEMMA: "Mount", NORM: "Mount"},
{ORTH: "Ak.", LEMMA: "Alaska", NORM: "Alaska"},
{ORTH: "Ala.", LEMMA: "Alabama", NORM: "Alabama"},
@ -456,15 +497,47 @@ for exc_data in [
{ORTH: "Tenn.", LEMMA: "Tennessee", NORM: "Tennessee"},
{ORTH: "Va.", LEMMA: "Virginia", NORM: "Virginia"},
{ORTH: "Wash.", LEMMA: "Washington", NORM: "Washington"},
{ORTH: "Wis.", LEMMA: "Wisconsin", NORM: "Wisconsin"}]:
{ORTH: "Wis.", LEMMA: "Wisconsin", NORM: "Wisconsin"},
]:
_exc[exc_data[ORTH]] = [exc_data]
for orth in [
"'d", "a.m.", "Adm.", "Bros.", "co.", "Co.", "Corp.", "D.C.", "Dr.", "e.g.",
"E.g.", "E.G.", "Gen.", "Gov.", "i.e.", "I.e.", "I.E.", "Inc.", "Jr.",
"Ltd.", "Md.", "Messrs.", "Mo.", "Mont.", "Mr.", "Mrs.", "Ms.", "p.m.",
"Ph.D.", "Rep.", "Rev.", "Sen.", "St.", "vs."]:
"'d",
"a.m.",
"Adm.",
"Bros.",
"co.",
"Co.",
"Corp.",
"D.C.",
"Dr.",
"e.g.",
"E.g.",
"E.G.",
"Gen.",
"Gov.",
"i.e.",
"I.e.",
"I.E.",
"Inc.",
"Jr.",
"Ltd.",
"Md.",
"Messrs.",
"Mo.",
"Mont.",
"Mr.",
"Mrs.",
"Ms.",
"p.m.",
"Ph.D.",
"Rep.",
"Rev.",
"Sen.",
"St.",
"vs.",
]:
_exc[orth] = [{ORTH: orth}]

View File

@ -30,8 +30,9 @@ for name, tag, patterns in [
("Facebook", "ORG", [[{LOWER: "facebook"}]]),
("Blizzard", "ORG", [[{LOWER: "blizzard"}]]),
("Ubuntu", "ORG", [[{LOWER: "ubuntu"}]]),
("YouTube", "PRODUCT", [[{LOWER: "youtube"}]]),]:
ENTITY_RULES.append({ENT_ID: name, 'attrs': {ENT_TYPE: tag}, 'patterns': patterns})
("YouTube", "PRODUCT", [[{LOWER: "youtube"}]]),
]:
ENTITY_RULES.append({ENT_ID: name, "attrs": {ENT_TYPE: tag}, "patterns": patterns})
FALSE_POSITIVES = [
@ -46,5 +47,5 @@ FALSE_POSITIVES = [
[{ORTH: "Yay"}],
[{ORTH: "Ahh"}],
[{ORTH: "Yea"}],
[{ORTH: "Bah"}]
[{ORTH: "Bah"}],
]

View File

@ -16,8 +16,10 @@ from ...util import update_exc, add_lookups
class SpanishDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda text: 'es'
lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM], BASE_NORMS)
lex_attr_getters[LANG] = lambda text: "es"
lex_attr_getters[NORM] = add_lookups(
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS
)
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
tag_map = TAG_MAP
stop_words = STOP_WORDS
@ -26,8 +28,8 @@ class SpanishDefaults(Language.Defaults):
class Spanish(Language):
lang = 'es'
lang = "es"
Defaults = SpanishDefaults
__all__ = ['Spanish']
__all__ = ["Spanish"]

View File

@ -18,5 +18,5 @@ sentences = [
"El gato come pescado",
"Veo al hombre con el telescopio",
"La araña come moscas",
"El pingüino incuba en su nido"
"El pingüino incuba en su nido",
]

View File

@ -2,7 +2,8 @@
from __future__ import unicode_literals
STOP_WORDS = set("""
STOP_WORDS = set(
"""
actualmente acuerdo adelante ademas además adrede afirmó agregó ahi ahora ahí
al algo alguna algunas alguno algunos algún alli allí alrededor ambos ampleamos
antano antaño ante anterior antes apenas aproximadamente aquel aquella aquellas
@ -81,4 +82,5 @@ va vais valor vamos van varias varios vaya veces ver verdad verdadera verdadero
vez vosotras vosotros voy vuestra vuestras vuestro vuestros
ya yo
""".split())
""".split()
)

View File

@ -8,18 +8,20 @@ def noun_chunks(obj):
doc = obj.doc
if not len(doc):
return
np_label = doc.vocab.strings.add('NP')
left_labels = ['det', 'fixed', 'neg'] #['nunmod', 'det', 'appos', 'fixed']
right_labels = ['flat', 'fixed', 'compound', 'neg']
stop_labels = ['punct']
np_label = doc.vocab.strings.add("NP")
left_labels = ["det", "fixed", "neg"] # ['nunmod', 'det', 'appos', 'fixed']
right_labels = ["flat", "fixed", "compound", "neg"]
stop_labels = ["punct"]
np_left_deps = [doc.vocab.strings.add(label) for label in left_labels]
np_right_deps = [doc.vocab.strings.add(label) for label in right_labels]
stop_deps = [doc.vocab.strings.add(label) for label in stop_labels]
token = doc[0]
while token and token.i < len(doc):
if token.pos in [PROPN, NOUN, PRON]:
left, right = noun_bounds(doc, token, np_left_deps, np_right_deps, stop_deps)
yield left.i, right.i+1, np_label
left, right = noun_bounds(
doc, token, np_left_deps, np_right_deps, stop_deps
)
yield left.i, right.i + 1, np_label
token = right
token = next_token(token)
@ -31,7 +33,7 @@ def is_verb_token(token):
def next_token(token):
try:
return token.nbor()
except:
except IndexError:
return None
@ -42,16 +44,20 @@ def noun_bounds(doc, root, np_left_deps, np_right_deps, stop_deps):
left_bound = token
right_bound = root
for token in root.rights:
if (token.dep in np_right_deps):
left, right = noun_bounds(doc, token, np_left_deps, np_right_deps, stop_deps)
if list(filter(lambda t: is_verb_token(t) or t.dep in stop_deps,
doc[left_bound.i: right.i])):
if token.dep in np_right_deps:
left, right = noun_bounds(
doc, token, np_left_deps, np_right_deps, stop_deps
)
if list(
filter(
lambda t: is_verb_token(t) or t.dep in stop_deps,
doc[left_bound.i : right.i],
)
):
break
else:
right_bound = right
return left_bound, right_bound
SYNTAX_ITERATORS = {
'noun_chunks': noun_chunks
}
SYNTAX_ITERATORS = {"noun_chunks": noun_chunks}

View File

@ -4,7 +4,7 @@ from __future__ import unicode_literals
from ...symbols import POS, PUNCT, SYM, ADJ, NUM, DET, ADV, ADP, X, VERB
from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, SCONJ, AUX, CONJ
# fmt: off
TAG_MAP = {
"ADJ___": {"morph": "_", POS: ADJ},
"ADJ__AdpType=Prep": {"morph": "AdpType=Prep", POS: ADJ},
@ -29,7 +29,7 @@ TAG_MAP = {
"ADP__AdpType=Preppron|Gender=Fem|Number=Sing": {"morph": "AdpType=Preppron|Gender=Fem|Number=Sing", POS: ADP},
"ADP__AdpType=Preppron|Gender=Masc|Number=Plur": {"morph": "AdpType=Preppron|Gender=Masc|Number=Plur", POS: ADP},
"ADP__AdpType=Preppron|Gender=Masc|Number=Sing": {"morph": "AdpType=Preppron|Gender=Masc|Number=Sing", POS: ADP},
"ADP": { POS: ADP},
"ADP": {POS: ADP},
"ADV___": {"morph": "_", POS: ADV},
"ADV__AdpType=Prep": {"morph": "AdpType=Prep", POS: ADV},
"ADV__AdpType=Preppron|Gender=Masc|Number=Sing": {"morph": "AdpType=Preppron|Gender=Masc|Number=Sing", POS: ADV},
@ -135,7 +135,7 @@ TAG_MAP = {
"DET__Number=Sing|PronType=Ind": {"morph": "Number=Sing|PronType=Ind", POS: DET},
"DET__PronType=Int": {"morph": "PronType=Int", POS: DET},
"DET__PronType=Rel": {"morph": "PronType=Rel", POS: DET},
"DET": { POS: DET},
"DET": {POS: DET},
"INTJ___": {"morph": "_", POS: INTJ},
"NOUN___": {"morph": "_", POS: NOUN},
"NOUN__AdvType=Tim": {"morph": "AdvType=Tim", POS: NOUN},
@ -307,3 +307,4 @@ TAG_MAP = {
"X___": {"morph": "_", POS: X},
"_SP": {"morph": "_", POS: SPACE},
}
# fmt: on

View File

@ -1,17 +1,12 @@
# coding: utf8
from __future__ import unicode_literals
from ...symbols import ORTH, LEMMA, TAG, NORM, ADP, DET, PRON_LEMMA
from ...symbols import ORTH, LEMMA, NORM, PRON_LEMMA
_exc = {
"pal": [
{ORTH: "pa", LEMMA: "para"},
{ORTH: "l", LEMMA: "el", NORM: "el"}],
"pala": [
{ORTH: "pa", LEMMA: "para"},
{ORTH: "la", LEMMA: "la", NORM: "la"}]
"pal": [{ORTH: "pa", LEMMA: "para"}, {ORTH: "l", LEMMA: "el", NORM: "el"}],
"pala": [{ORTH: "pa", LEMMA: "para"}, {ORTH: "la", LEMMA: "la", NORM: "la"}],
}
@ -24,32 +19,50 @@ for exc_data in [
{ORTH: "Ud.", LEMMA: PRON_LEMMA, NORM: "usted"},
{ORTH: "Vd.", LEMMA: PRON_LEMMA, NORM: "usted"},
{ORTH: "Uds.", LEMMA: PRON_LEMMA, NORM: "ustedes"},
{ORTH: "Vds.", LEMMA: PRON_LEMMA, NORM: "ustedes"}]:
{ORTH: "Vds.", LEMMA: PRON_LEMMA, NORM: "ustedes"},
]:
_exc[exc_data[ORTH]] = [exc_data]
# Times
_exc["12m."] = [
{ORTH: "12"},
{ORTH: "m.", LEMMA: "p.m."}]
_exc["12m."] = [{ORTH: "12"}, {ORTH: "m.", LEMMA: "p.m."}]
for h in range(1, 12 + 1):
for period in ["a.m.", "am"]:
_exc["%d%s" % (h, period)] = [
{ORTH: "%d" % h},
{ORTH: period, LEMMA: "a.m."}]
_exc["%d%s" % (h, period)] = [{ORTH: "%d" % h}, {ORTH: period, LEMMA: "a.m."}]
for period in ["p.m.", "pm"]:
_exc["%d%s" % (h, period)] = [
{ORTH: "%d" % h},
{ORTH: period, LEMMA: "p.m."}]
_exc["%d%s" % (h, period)] = [{ORTH: "%d" % h}, {ORTH: period, LEMMA: "p.m."}]
for orth in [
"a.C.", "a.J.C.", "apdo.", "Av.", "Avda.", "Cía.", "etc.", "Gob.", "Gral.",
"Ing.", "J.C.", "Lic.", "m.n.", "no.", "núm.", "P.D.", "Prof.", "Profa.",
"q.e.p.d.", "S.A.", "S.L.", "s.s.s.", "Sr.", "Sra.", "Srta."]:
"a.C.",
"a.J.C.",
"apdo.",
"Av.",
"Avda.",
"Cía.",
"etc.",
"Gob.",
"Gral.",
"Ing.",
"J.C.",
"Lic.",
"m.n.",
"no.",
"núm.",
"P.D.",
"Prof.",
"Profa.",
"q.e.p.d.",
"S.A.",
"S.L.",
"s.s.s.",
"Sr.",
"Sra.",
"Srta.",
]:
_exc[orth] = [{ORTH: orth}]

View File

@ -12,11 +12,14 @@ from .tag_map import TAG_MAP
from .punctuation import TOKENIZER_SUFFIXES
from .lemmatizer import LEMMA_RULES, LEMMA_INDEX, LEMMA_EXC
class PersianDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters.update(LEX_ATTRS)
lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM], BASE_NORMS)
lex_attr_getters[LANG] = lambda text: 'fa'
lex_attr_getters[NORM] = add_lookups(
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS
)
lex_attr_getters[LANG] = lambda text: "fa"
tokenizer_exceptions = update_exc(TOKENIZER_EXCEPTIONS)
lemma_rules = LEMMA_RULES
lemma_index = LEMMA_INDEX
@ -27,8 +30,8 @@ class PersianDefaults(Language.Defaults):
class Persian(Language):
lang = 'fa'
lang = "fa"
Defaults = PersianDefaults
__all__ = ['Persian']
__all__ = ["Persian"]

View File

@ -12,8 +12,8 @@ Example sentences to test spaCy and its language models.
sentences = [
"این یک جمله نمونه می باشد.",
"قرار ما، امروز ساعت ۲:۳۰ بعدازظهر هست!"
"قرار ما، امروز ساعت ۲:۳۰ بعدازظهر هست!",
"دیروز علی به من ۲۰۰۰.۱﷼ پول نقد داد.",
"چطور می‌توان از تهران به کاشان رفت؟"
"حدود ۸۰٪ هوا از نیتروژن تشکیل شده است."
"چطور می‌توان از تهران به کاشان رفت؟",
"حدود ۸۰٪ هوا از نیتروژن تشکیل شده است.",
]

View File

@ -10,23 +10,13 @@ from ._verbs_exc import VERBS_EXC
from ._lemma_rules import ADJECTIVE_RULES, NOUN_RULES, VERB_RULES, PUNCT_RULES
LEMMA_INDEX = {
'adj': ADJECTIVES,
'noun': NOUNS,
'verb': VERBS
}
LEMMA_INDEX = {"adj": ADJECTIVES, "noun": NOUNS, "verb": VERBS}
LEMMA_RULES = {
'adj': ADJECTIVE_RULES,
'noun': NOUN_RULES,
'verb': VERB_RULES,
'punct': PUNCT_RULES
"adj": ADJECTIVE_RULES,
"noun": NOUN_RULES,
"verb": VERB_RULES,
"punct": PUNCT_RULES,
}
LEMMA_EXC = {
'adj': ADJECTIVES_EXC,
'noun': NOUNS_EXC,
'verb': VERBS_EXC
}
LEMMA_EXC = {"adj": ADJECTIVES_EXC, "noun": NOUNS_EXC, "verb": VERBS_EXC}

Some files were not shown because too many files have changed in this diff Show More