fixed tag_map.py merge conflict

This commit is contained in:
Jeanne Choo 2019-04-04 14:18:27 +08:00
parent eba4f77526
commit 80e15af76c
50 changed files with 262625 additions and 116 deletions

106
.github/contributors/ivigamberdiev.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Igor Igamberdiev |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | April 2, 2019 |
| GitHub username | ivigamberdiev |
| Website (optional) | |

106
.github/contributors/nlptown.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [x] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Yves Peirsman |
| Company name (if applicable) | NLP Town (Island Constraints BVBA) |
| Title or role (if applicable) | Co-founder |
| Date | 14.03.2019 |
| GitHub username | nlptown |
| Website (optional) | http://www.nlp.town |

106
.github/contributors/socool.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Kamolsit Mongkolsrisawat |
| Company name (if applicable) | Mojito |
| Title or role (if applicable) | |
| Date | 02-4-2019 |
| GitHub username | socool |
| Website (optional) | |

View File

@ -17,7 +17,7 @@ released under the MIT license.
[![Azure Pipelines](https://img.shields.io/azure-devops/build/explosion-ai/public/8/master.svg?logo=azure-devops&style=flat-square)](https://dev.azure.com/explosion-ai/public/_build?definitionId=8) [![Azure Pipelines](https://img.shields.io/azure-devops/build/explosion-ai/public/8/master.svg?logo=azure-devops&style=flat-square)](https://dev.azure.com/explosion-ai/public/_build?definitionId=8)
[![Travis Build Status](https://img.shields.io/travis/explosion/spaCy/master.svg?style=flat-square&logo=travis)](https://travis-ci.org/explosion/spaCy) [![Travis Build Status](https://img.shields.io/travis/explosion/spaCy/master.svg?style=flat-square&logo=travis)](https://travis-ci.org/explosion/spaCy)
[![Current Release Version](https://img.shields.io/github/release/explosion/spacy.svg?style=flat-square)](https://github.com/explosion/spaCy/releases) [![Current Release Version](https://img.shields.io/github/release/explosion/spacy.svg?style=flat-square)](https://github.com/explosion/spaCy/releases)
[![pypi Version](https://img.shields.io/pypi/v/spacy.svg?style=flat-square)](https://pypi.python.org/pypi/spacy) [![pypi Version](https://img.shields.io/pypi/v/spacy.svg?style=flat-square)](https://pypi.org/project/spacy/)
[![conda Version](https://img.shields.io/conda/vn/conda-forge/spacy.svg?style=flat-square)](https://anaconda.org/conda-forge/spacy) [![conda Version](https://img.shields.io/conda/vn/conda-forge/spacy.svg?style=flat-square)](https://anaconda.org/conda-forge/spacy)
[![Python wheels](https://img.shields.io/badge/wheels-%E2%9C%93-4c1.svg?longCache=true&style=flat-square&logo=python&logoColor=white)](https://github.com/explosion/wheelwright/releases) [![Python wheels](https://img.shields.io/badge/wheels-%E2%9C%93-4c1.svg?longCache=true&style=flat-square&logo=python&logoColor=white)](https://github.com/explosion/wheelwright/releases)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg?style=flat-square)](https://github.com/ambv/black) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg?style=flat-square)](https://github.com/ambv/black)
@ -42,7 +42,7 @@ released under the MIT license.
[api reference]: https://spacy.io/api/ [api reference]: https://spacy.io/api/
[models]: https://spacy.io/models [models]: https://spacy.io/models
[universe]: https://spacy.io/universe [universe]: https://spacy.io/universe
[changelog]: https://spacy.io/usage/#changelog [changelog]: https://spacy.io/usage#changelog
[contribute]: https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md [contribute]: https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md
## 💬 Where to ask questions ## 💬 Where to ask questions
@ -60,7 +60,7 @@ valuable if it's shared publicly, so that more people can benefit from it.
| 🗯 **General Discussion** | [Gitter Chat] · [Reddit User Group] | | 🗯 **General Discussion** | [Gitter Chat] · [Reddit User Group] |
[github issue tracker]: https://github.com/explosion/spaCy/issues [github issue tracker]: https://github.com/explosion/spaCy/issues
[stack overflow]: http://stackoverflow.com/questions/tagged/spacy [stack overflow]: https://stackoverflow.com/questions/tagged/spacy
[gitter chat]: https://gitter.im/explosion/spaCy [gitter chat]: https://gitter.im/explosion/spaCy
[reddit user group]: https://www.reddit.com/r/spacynlp [reddit user group]: https://www.reddit.com/r/spacynlp
@ -95,7 +95,7 @@ For detailed installation instructions, see the
- **Python version**: Python 2.7, 3.5+ (only 64 bit) - **Python version**: Python 2.7, 3.5+ (only 64 bit)
- **Package managers**: [pip] · [conda] (via `conda-forge`) - **Package managers**: [pip] · [conda] (via `conda-forge`)
[pip]: https://pypi.python.org/pypi/spacy [pip]: https://pypi.org/project/spacy/
[conda]: https://anaconda.org/conda-forge/spacy [conda]: https://anaconda.org/conda-forge/spacy
### pip ### pip
@ -219,7 +219,7 @@ source. That is the common way if you want to make changes to the code base.
You'll need to make sure that you have a development environment consisting of a You'll need to make sure that you have a development environment consisting of a
Python distribution including header files, a compiler, Python distribution including header files, a compiler,
[pip](https://pip.pypa.io/en/latest/installing/), [pip](https://pip.pypa.io/en/latest/installing/),
[virtualenv](https://virtualenv.pypa.io/) and [git](https://git-scm.com) [virtualenv](https://virtualenv.pypa.io/en/latest/) and [git](https://git-scm.com)
installed. The compiler part is the trickiest. How to do that depends on your installed. The compiler part is the trickiest. How to do that depends on your
system. See notes on Ubuntu, OS X and Windows for details. system. See notes on Ubuntu, OS X and Windows for details.
@ -239,8 +239,8 @@ python setup.py build_ext --inplace
Compared to regular install via pip, [requirements.txt](requirements.txt) Compared to regular install via pip, [requirements.txt](requirements.txt)
additionally installs developer dependencies such as Cython. For more details additionally installs developer dependencies such as Cython. For more details
and instructions, see the documentation on and instructions, see the documentation on
[compiling spaCy from source](https://spacy.io/usage/#source) and the [compiling spaCy from source](https://spacy.io/usage#source) and the
[quickstart widget](https://spacy.io/usage/#section-quickstart) to get [quickstart widget](https://spacy.io/usage#section-quickstart) to get
the right commands for your platform and Python version. the right commands for your platform and Python version.
### Ubuntu ### Ubuntu
@ -260,7 +260,7 @@ and git preinstalled.
### Windows ### Windows
Install a version of the [Visual C++ Build Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/) or Install a version of the [Visual C++ Build Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/) or
[Visual Studio Express](https://www.visualstudio.com/vs/visual-studio-express/) [Visual Studio Express](https://visualstudio.microsoft.com/vs/express/)
that matches the version that was used to compile your Python that matches the version that was used to compile your Python
interpreter. For official distributions these are VS 2008 (Python 2.7), interpreter. For official distributions these are VS 2008 (Python 2.7),
VS 2010 (Python 3.4) and VS 2015 (Python 3.5). VS 2010 (Python 3.4) and VS 2015 (Python 3.5).
@ -282,5 +282,5 @@ pip install -r path/to/requirements.txt
python -m pytest <spacy-directory> python -m pytest <spacy-directory>
``` ```
See [the documentation](https://spacy.io/usage/#tests) for more details and See [the documentation](https://spacy.io/usage#tests) for more details and
examples. examples.

View File

@ -23,7 +23,7 @@ For more details, see the documentation:
* Training: https://spacy.io/usage/training * Training: https://spacy.io/usage/training
* NER: https://spacy.io/usage/linguistic-features#named-entities * NER: https://spacy.io/usage/linguistic-features#named-entities
Compatible with: spaCy v2.0.0+ Compatible with: spaCy v2.1.0+
Last tested with: v2.1.0 Last tested with: v2.1.0
""" """
from __future__ import unicode_literals, print_function from __future__ import unicode_literals, print_function

View File

@ -86,7 +86,7 @@ def with_cpu(ops, model):
as necessary.""" as necessary."""
model.to_cpu() model.to_cpu()
def with_cpu_forward(inputs, drop=0.): def with_cpu_forward(inputs, drop=0.0):
cpu_outputs, backprop = model.begin_update(_to_cpu(inputs), drop=drop) cpu_outputs, backprop = model.begin_update(_to_cpu(inputs), drop=drop)
gpu_outputs = _to_device(ops, cpu_outputs) gpu_outputs = _to_device(ops, cpu_outputs)
@ -106,7 +106,7 @@ def _to_cpu(X):
return tuple([_to_cpu(x) for x in X]) return tuple([_to_cpu(x) for x in X])
elif isinstance(X, list): elif isinstance(X, list):
return [_to_cpu(x) for x in X] return [_to_cpu(x) for x in X]
elif hasattr(X, 'get'): elif hasattr(X, "get"):
return X.get() return X.get()
else: else:
return X return X
@ -142,7 +142,9 @@ class extract_ngrams(Model):
# The dtype here matches what thinc is expecting -- which differs per # The dtype here matches what thinc is expecting -- which differs per
# platform (by int definition). This should be fixed once the problem # platform (by int definition). This should be fixed once the problem
# is fixed on Thinc's side. # is fixed on Thinc's side.
lengths = self.ops.asarray([arr.shape[0] for arr in batch_keys], dtype=numpy.int_) lengths = self.ops.asarray(
[arr.shape[0] for arr in batch_keys], dtype=numpy.int_
)
batch_keys = self.ops.xp.concatenate(batch_keys) batch_keys = self.ops.xp.concatenate(batch_keys)
batch_vals = self.ops.asarray(self.ops.xp.concatenate(batch_vals), dtype="f") batch_vals = self.ops.asarray(self.ops.xp.concatenate(batch_vals), dtype="f")
return (batch_keys, batch_vals, lengths), None return (batch_keys, batch_vals, lengths), None
@ -592,32 +594,27 @@ def build_text_classifier(nr_class, width=64, **cfg):
) )
linear_model = build_bow_text_classifier( linear_model = build_bow_text_classifier(
nr_class, ngram_size=cfg.get("ngram_size", 1), exclusive_classes=False) nr_class, ngram_size=cfg.get("ngram_size", 1), exclusive_classes=False
if cfg.get('exclusive_classes'): )
if cfg.get("exclusive_classes"):
output_layer = Softmax(nr_class, nr_class * 2) output_layer = Softmax(nr_class, nr_class * 2)
else: else:
output_layer = ( output_layer = (
zero_init(Affine(nr_class, nr_class * 2, drop_factor=0.0)) zero_init(Affine(nr_class, nr_class * 2, drop_factor=0.0)) >> logistic
>> logistic
) )
model = ( model = (linear_model | cnn_model) >> output_layer
(linear_model | cnn_model)
>> output_layer
)
model.tok2vec = chain(tok2vec, flatten) model.tok2vec = chain(tok2vec, flatten)
model.nO = nr_class model.nO = nr_class
model.lsuv = False model.lsuv = False
return model return model
def build_bow_text_classifier(nr_class, ngram_size=1, exclusive_classes=False, def build_bow_text_classifier(
no_output_layer=False, **cfg): nr_class, ngram_size=1, exclusive_classes=False, no_output_layer=False, **cfg
):
with Model.define_operators({">>": chain}): with Model.define_operators({">>": chain}):
model = ( model = with_cpu(
with_cpu(Model.ops, Model.ops, extract_ngrams(ngram_size, attr=ORTH) >> LinearModel(nr_class)
extract_ngrams(ngram_size, attr=ORTH)
>> LinearModel(nr_class)
)
) )
if not no_output_layer: if not no_output_layer:
model = model >> (cpu_softmax if exclusive_classes else logistic) model = model >> (cpu_softmax if exclusive_classes else logistic)
@ -626,11 +623,9 @@ def build_bow_text_classifier(nr_class, ngram_size=1, exclusive_classes=False,
@layerize @layerize
def cpu_softmax(X, drop=0.): def cpu_softmax(X, drop=0.0):
ops = NumpyOps() ops = NumpyOps()
Y = ops.softmax(X)
def cpu_softmax_backward(dY, sgd=None): def cpu_softmax_backward(dY, sgd=None):
return dY return dY
@ -648,7 +643,9 @@ def build_simple_cnn_text_classifier(tok2vec, nr_class, exclusive_classes=False,
if exclusive_classes: if exclusive_classes:
output_layer = Softmax(nr_class, tok2vec.nO) output_layer = Softmax(nr_class, tok2vec.nO)
else: else:
output_layer = zero_init(Affine(nr_class, tok2vec.nO, drop_factor=0.0)) >> logistic output_layer = (
zero_init(Affine(nr_class, tok2vec.nO, drop_factor=0.0)) >> logistic
)
model = tok2vec >> flatten_add_lengths >> Pooling(mean_pool) >> output_layer model = tok2vec >> flatten_add_lengths >> Pooling(mean_pool) >> output_layer
model.tok2vec = chain(tok2vec, flatten) model.tok2vec = chain(tok2vec, flatten)
model.nO = nr_class model.nO = nr_class

View File

@ -125,7 +125,9 @@ def pretrain(
max_length=max_length, max_length=max_length,
min_length=min_length, min_length=min_length,
) )
loss = make_update(model, docs, optimizer, objective=loss_func, drop=dropout) loss = make_update(
model, docs, optimizer, objective=loss_func, drop=dropout
)
progress = tracker.update(epoch, loss, docs) progress = tracker.update(epoch, loss, docs)
if progress: if progress:
msg.row(progress, **row_settings) msg.row(progress, **row_settings)
@ -215,8 +217,8 @@ def get_cossim_loss(yh, y):
norm_y = xp.linalg.norm(y, axis=1, keepdims=True) norm_y = xp.linalg.norm(y, axis=1, keepdims=True)
mul_norms = norm_yh * norm_y mul_norms = norm_yh * norm_y
cosine = (yh * y).sum(axis=1, keepdims=True) / mul_norms cosine = (yh * y).sum(axis=1, keepdims=True) / mul_norms
d_yh = (y / mul_norms) - (cosine * (yh / norm_yh**2)) d_yh = (y / mul_norms) - (cosine * (yh / norm_yh ** 2))
loss = xp.abs(cosine-1).sum() loss = xp.abs(cosine - 1).sum()
return loss, -d_yh return loss, -d_yh

View File

@ -50,8 +50,9 @@ class DependencyRenderer(object):
rendered = [] rendered = []
for i, p in enumerate(parsed): for i, p in enumerate(parsed):
if i == 0: if i == 0:
self.direction = p["settings"].get("direction", DEFAULT_DIR) settings = p.get("settings", {})
self.lang = p["settings"].get("lang", DEFAULT_LANG) self.direction = settings.get("direction", DEFAULT_DIR)
self.lang = settings.get("lang", DEFAULT_LANG)
render_id = "{}-{}".format(id_prefix, i) render_id = "{}-{}".format(id_prefix, i)
svg = self.render_svg(render_id, p["words"], p["arcs"]) svg = self.render_svg(render_id, p["words"], p["arcs"])
rendered.append(svg) rendered.append(svg)
@ -254,9 +255,10 @@ class EntityRenderer(object):
rendered = [] rendered = []
for i, p in enumerate(parsed): for i, p in enumerate(parsed):
if i == 0: if i == 0:
self.direction = p["settings"].get("direction", DEFAULT_DIR) settings = p.get("settings", {})
self.lang = p["settings"].get("lang", DEFAULT_LANG) self.direction = settings.get("direction", DEFAULT_DIR)
rendered.append(self.render_ents(p["text"], p["ents"], p["title"])) self.lang = settings.get("lang", DEFAULT_LANG)
rendered.append(self.render_ents(p["text"], p["ents"], p.get("title")))
if page: if page:
docs = "".join([TPL_FIGURE.format(content=doc) for doc in rendered]) docs = "".join([TPL_FIGURE.format(content=doc) for doc in rendered])
markup = TPL_PAGE.format(content=docs, lang=self.lang, dir=self.direction) markup = TPL_PAGE.format(content=docs, lang=self.lang, dir=self.direction)

View File

@ -1,7 +1,7 @@
# coding: utf8 # coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
from ...symbols import LEMMA, PRON_LEMMA, AUX from ...symbols import LEMMA, PRON_LEMMA
_subordinating_conjunctions = [ _subordinating_conjunctions = [
"that", "that",
@ -457,7 +457,6 @@ MORPH_RULES = {
"have": {"POS": "AUX"}, "have": {"POS": "AUX"},
"'m": {"POS": "AUX", LEMMA: "be"}, "'m": {"POS": "AUX", LEMMA: "be"},
"'ve": {"POS": "AUX"}, "'ve": {"POS": "AUX"},
"'re": {"POS": "AUX", LEMMA: "be"},
"'s": {"POS": "AUX"}, "'s": {"POS": "AUX"},
"is": {"POS": "AUX"}, "is": {"POS": "AUX"},
"'d": {"POS": "AUX"}, "'d": {"POS": "AUX"},

View File

@ -39,7 +39,7 @@ made make many may me meanwhile might mine more moreover most mostly move much
must my myself must my myself
name namely neither never nevertheless next nine no nobody none noone nor not name namely neither never nevertheless next nine no nobody none noone nor not
nothing now nowhere n't nothing now nowhere
of off often on once one only onto or other others otherwise our ours ourselves of off often on once one only onto or other others otherwise our ours ourselves
out over own out over own
@ -66,7 +66,13 @@ whereafter whereas whereby wherein whereupon wherever whether which while
whither who whoever whole whom whose why will with within without would whither who whoever whole whom whose why will with within without would
yet you your yours yourself yourselves yet you your yours yourself yourselves
'd 'll 'm 're 's 've
""".split() """.split()
) )
contractions = ["n't", "'d", "'ll", "'m", "'re", "'s", "'ve"]
STOP_WORDS.update(contractions)
for apostrophe in ["", ""]:
for stopword in contractions:
STOP_WORDS.add(stopword.replace("'", apostrophe))

View File

@ -2,7 +2,11 @@
from __future__ import unicode_literals from __future__ import unicode_literals
from ...symbols import POS, PUNCT, ADJ, CCONJ, NUM, DET, ADV, ADP, X, VERB from ...symbols import POS, PUNCT, ADJ, CCONJ, NUM, DET, ADV, ADP, X, VERB
<<<<<<< HEAD
from ...symbols import NOUN, PRON, AUX, SCONJ, INTJ, PART, PROPN from ...symbols import NOUN, PRON, AUX, SCONJ, INTJ, PART, PROPN
=======
from ...symbols import NOUN, PRON, AUX, SCONJ
>>>>>>> 4faf62d5154c2d2adb6def32da914d18d5e9c8fe
# POS explanations for indonesian available from https://www.aclweb.org/anthology/Y12-1014 # POS explanations for indonesian available from https://www.aclweb.org/anthology/Y12-1014
@ -92,4 +96,3 @@ TAG_MAP = {
"D--+PS2":{POS: ADV}, "D--+PS2":{POS: ADV},
"PP3+T—": {POS: PRON} "PP3+T—": {POS: PRON}
} }

View File

@ -4,6 +4,11 @@ from __future__ import unicode_literals
from .stop_words import STOP_WORDS from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS from .lex_attrs import LEX_ATTRS
from .tag_map import TAG_MAP from .tag_map import TAG_MAP
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES
from .lemmatizer import LOOKUP, LEMMA_EXC, LEMMA_INDEX, RULES
from .lemmatizer.lemmatizer import DutchLemmatizer
from ..tokenizer_exceptions import BASE_EXCEPTIONS from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ..norm_exceptions import BASE_NORMS from ..norm_exceptions import BASE_NORMS
@ -13,20 +18,33 @@ from ...util import update_exc, add_lookups
class DutchDefaults(Language.Defaults): class DutchDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters) lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters.update(LEX_ATTRS) lex_attr_getters.update(LEX_ATTRS)
lex_attr_getters[LANG] = lambda text: "nl" lex_attr_getters[LANG] = lambda text: 'nl'
lex_attr_getters[NORM] = add_lookups( lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM],
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS BASE_NORMS)
) tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS)
stop_words = STOP_WORDS stop_words = STOP_WORDS
tag_map = TAG_MAP tag_map = TAG_MAP
infixes = TOKENIZER_INFIXES
suffixes = TOKENIZER_SUFFIXES
@classmethod
def create_lemmatizer(cls, nlp=None):
rules = RULES
lemma_index = LEMMA_INDEX
lemma_exc = LEMMA_EXC
lemma_lookup = LOOKUP
return DutchLemmatizer(index=lemma_index,
exceptions=lemma_exc,
lookup=lemma_lookup,
rules=rules)
class Dutch(Language): class Dutch(Language):
lang = "nl" lang = 'nl'
Defaults = DutchDefaults Defaults = DutchDefaults
__all__ = ["Dutch"] __all__ = ['Dutch']

View File

@ -14,5 +14,5 @@ sentences = [
"Apple overweegt om voor 1 miljard een U.K. startup te kopen", "Apple overweegt om voor 1 miljard een U.K. startup te kopen",
"Autonome auto's verschuiven de verzekeringverantwoordelijkheid naar producenten", "Autonome auto's verschuiven de verzekeringverantwoordelijkheid naar producenten",
"San Francisco overweegt robots op voetpaden te verbieden", "San Francisco overweegt robots op voetpaden te verbieden",
"Londen is een grote stad in het Verenigd Koninkrijk", "Londen is een grote stad in het Verenigd Koninkrijk"
] ]

View File

@ -0,0 +1,40 @@
# coding: utf8
from __future__ import unicode_literals
from ._verbs_irreg import VERBS_IRREG
from ._nouns_irreg import NOUNS_IRREG
from ._adjectives_irreg import ADJECTIVES_IRREG
from ._adverbs_irreg import ADVERBS_IRREG
from ._adpositions_irreg import ADPOSITIONS_IRREG
from ._determiners_irreg import DETERMINERS_IRREG
from ._pronouns_irreg import PRONOUNS_IRREG
from ._verbs import VERBS
from ._nouns import NOUNS
from ._adjectives import ADJECTIVES
from ._adpositions import ADPOSITIONS
from ._determiners import DETERMINERS
from .lookup import LOOKUP
from ._lemma_rules import RULES
from .lemmatizer import DutchLemmatizer
LEMMA_INDEX = {"adj": ADJECTIVES,
"noun": NOUNS,
"verb": VERBS,
"adp": ADPOSITIONS,
"det": DETERMINERS}
LEMMA_EXC = {"adj": ADJECTIVES_IRREG,
"adv": ADVERBS_IRREG,
"adp": ADPOSITIONS_IRREG,
"noun": NOUNS_IRREG,
"verb": VERBS_IRREG,
"det": DETERMINERS_IRREG,
"pron": PRONOUNS_IRREG}

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,24 @@
# coding: utf8
from __future__ import unicode_literals
ADPOSITIONS = set(
('aan aangaande aanwezig achter af afgezien al als an annex anno anti '
'behalve behoudens beneden benevens benoorden beoosten betreffende bewesten '
'bezijden bezuiden bij binnen binnenuit binst bladzij blijkens boven bovenop '
'buiten conform contra cq daaraan daarbij daarbuiten daarin daarnaar '
'daaronder daartegenover daarvan dankzij deure dichtbij door doordat doorheen '
'echter eraf erop erover errond eruit ervoor evenals exclusief gedaan '
'gedurende gegeven getuige gezien halfweg halverwege heen hierdoorheen hierop '
'houdende in inclusief indien ingaande ingevolge inzake jegens kortweg '
'krachtens kralj langs langsheen langst lastens linksom lopende luidens mede '
'mee met middels midden middenop mits na naan naar naartoe naast naat nabij '
'nadat namens neer neffe neffen neven nevenst niettegenstaande nopens '
'officieel om omheen omstreeks omtrent onafgezien ondanks onder onderaan '
'ondere ongeacht ooit op open over per plus pro qua rechtover rond rondom '
"sedert sinds spijts strekkende te tegen tegenaan tegenop tegenover telde "
'teneinde terug tijdens toe tot totdat trots tussen tégen uit uitgenomen '
'ultimo van vanaf vandaan vandoor vanop vanuit vanwege versus via vinnen '
'vlakbij volgens voor voor- voorbij voordat voort voren vòòr vóór waaraan '
'waarbij waardoor waaronder weg wegens weleens zijdens zoals zodat zonder '
'zónder à').split())

View File

@ -0,0 +1,12 @@
# coding: utf8
from __future__ import unicode_literals
ADPOSITIONS_IRREG = {
"'t": ('te',),
'me': ('mee',),
'meer': ('mee',),
'on': ('om',),
'ten': ('te',),
'ter': ('te',)
}

View File

@ -0,0 +1,19 @@
# coding: utf8
from __future__ import unicode_literals
ADVERBS_IRREG = {
"'ns": ('eens',),
"'s": ('eens',),
"'t": ('het',),
"d'r": ('er',),
"d'raf": ('eraf',),
"d'rbij": ('erbij',),
"d'rheen": ('erheen',),
"d'rin": ('erin',),
"d'rna": ('erna',),
"d'rnaar": ('ernaar',),
'hele': ('heel',),
'nevenst': ('nevens',),
'overend': ('overeind',)
}

View File

@ -0,0 +1,17 @@
# coding: utf8
from __future__ import unicode_literals
DETERMINERS = set(
("al allebei allerhande allerminst alletwee"
"beide clip-on d'n d'r dat datgeen datgene de dees degeen degene den dewelke "
'deze dezelfde die diegeen diegene diehien dien diene diens diezelfde dit '
'ditgene e een eene eigen elk elkens elkes enig enkel enne ettelijke eure '
'euren evenveel ewe ge geen ginds géén haar haaren halfelf het hetgeen '
'hetwelk hetzelfde heur heure hulder hulle hullen hullie hun hunder hunderen '
'ieder iederes ja je jen jouw jouwen jouwes jullie junder keiveel keiweinig '
"m'ne me meer meerder meerdere menen menig mijn mijnes minst méér niemendal "
'oe ons onse se sommig sommigeder superveel telken teveel titulair ulder '
'uldere ulderen ulle under une uw vaak veel veels véél wat weinig welk welken '
"welkene welksten z'nen ze zenen zijn zo'n zo'ne zoiet zoveel zovele zovelen "
'zuk zulk zulkdanig zulken zulks zullie zíjn àlle álle').split())

View File

@ -0,0 +1,69 @@
# coding: utf8
from __future__ import unicode_literals
DETERMINERS_IRREG = {
"'r": ('haar',),
"'s": ('de',),
"'t": ('het',),
"'tgene": ('hetgeen',),
'alle': ('al',),
'allen': ('al',),
'aller': ('al',),
'beiden': ('beide',),
'beider': ('beide',),
"d'": ('het',),
"d'r": ('haar',),
'der': ('de',),
'des': ('de',),
'dezer': ('deze',),
'dienen': ('die',),
'dier': ('die',),
'elke': ('elk',),
'ene': ('een',),
'enen': ('een',),
'ener': ('een',),
'enige': ('enig',),
'enigen': ('enig',),
'er': ('haar',),
'gene': ('geen',),
'genen': ('geen',),
'hare': ('haar',),
'haren': ('haar',),
'harer': ('haar',),
'hunne': ('hun',),
'hunnen': ('hun',),
'jou': ('jouw',),
'jouwe': ('jouw',),
'julliejen': ('jullie',),
"m'n": ('mijn',),
'mee': ('meer',),
'meer': ('veel',),
'meerderen': ('meerdere',),
'meest': ('veel',),
'meesten': ('veel',),
'meet': ('veel',),
'menige': ('menig',),
'mij': ('mijn',),
'mijnen': ('mijn',),
'minder': ('weinig',),
'mindere': ('weinig',),
'minst': ('weinig',),
'minste': ('minst',),
'ne': ('een',),
'onze': ('ons',),
'onzent': ('ons',),
'onzer': ('ons',),
'ouw': ('uw',),
'sommige': ('sommig',),
'sommigen': ('sommig',),
'u': ('uw',),
'vaker': ('vaak',),
'vele': ('veel',),
'velen': ('veel',),
'welke': ('welk',),
'zijne': ('zijn',),
'zijnen': ('zijn',),
'zijns': ('zijn',),
'één': ('een',)
}

View File

@ -0,0 +1,79 @@
# coding: utf8
from __future__ import unicode_literals
ADJECTIVE_SUFFIX_RULES = [
["sten", ""],
["ste", ""],
["st", ""],
["er", ""],
["en", ""],
["e", ""],
["ende", "end"]
]
VERB_SUFFIX_RULES = [
["dt", "den"],
["de", "en"],
["te", "en"],
["dde", "den"],
["tte", "ten"],
["dden", "den"],
["tten", "ten"],
["end", "en"],
]
NOUN_SUFFIX_RULES = [
["en", ""],
["ën", ""],
["'er", ""],
["s", ""],
["tje", ""],
["kje", ""],
["'s", ""],
["ici", "icus"],
["heden", "heid"],
["elen", "eel"],
["ezen", "ees"],
["even", "eef"],
["ssen", "s"],
["rren", "r"],
["kken", "k"],
["bben", "b"]
]
NUM_SUFFIX_RULES = [
["ste", ""],
["sten", ""],
["ën", ""],
["en", ""],
["de", ""],
["er", ""],
["ër", ""],
["tjes", ""]
]
PUNCT_SUFFIX_RULES = [
["", "\""],
["", "\""],
["\u2018", "'"],
["\u2019", "'"]
]
# In-place sort guaranteeing that longer -- more specific -- rules are
# applied first.
for rule_set in (ADJECTIVE_SUFFIX_RULES,
NOUN_SUFFIX_RULES,
NUM_SUFFIX_RULES,
VERB_SUFFIX_RULES):
rule_set.sort(key=lambda r: len(r[0]), reverse=True)
RULES = {
"adj": ADJECTIVE_SUFFIX_RULES,
"noun": NOUN_SUFFIX_RULES,
"verb": VERB_SUFFIX_RULES,
"num": NUM_SUFFIX_RULES,
"punct": PUNCT_SUFFIX_RULES
}

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,31 @@
# coding: utf8
from __future__ import unicode_literals
NUMBERS_IRREG = {
'achten': ('acht',),
'biljoenen': ('biljoen',),
'drieën': ('drie',),
'duizenden': ('duizend',),
'eentjes': ('één',),
'elven': ('elf',),
'miljoenen': ('miljoen',),
'negenen': ('negen',),
'negentiger': ('negentig',),
'tienduizenden': ('tienduizend',),
'tienen': ('tien',),
'tientjes': ('tien',),
'twaalven': ('twaalf',),
'tweeën': ('twee',),
'twintiger': ('twintig',),
'twintigsten': ('twintig',),
'vieren': ('vier',),
'vijftiger': ('vijftig',),
'vijven': ('vijf',),
'zessen': ('zes',),
'zestiger': ('zestig',),
'zevenen': ('zeven',),
'zeventiger': ('zeventig',),
'zovele': ('zoveel',),
'zovelen': ('zoveel',)
}

View File

@ -0,0 +1,35 @@
# coding: utf8
from __future__ import unicode_literals
PRONOUNS_IRREG = {
"'r": ('haar',),
"'rzelf": ('haarzelf',),
"'t": ('het',),
"d'r": ('haar',),
'da': ('dat',),
'dienen': ('die',),
'diens': ('die',),
'dies': ('die',),
'elkaars': ('elkaar',),
'elkanders': ('elkander',),
'ene': ('een',),
'enen': ('een',),
'fik': ('ik',),
'gaat': ('gaan',),
'gene': ('geen',),
'harer': ('haar',),
'ieders': ('ieder',),
'iemands': ('iemand',),
'ikke': ('ik',),
'mijnen': ('mijn',),
'oe': ('je',),
'onzer': ('ons',),
'wa': ('wat',),
'watte': ('wat',),
'wier': ('wie',),
'zijns': ('zijn',),
'zoietsken': ('zoietske',),
'zulks': ('zulk',),
'één': ('een',)
}

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,130 @@
# coding: utf8
from __future__ import unicode_literals
from ....symbols import POS, NOUN, VERB, ADJ, NUM, DET, PRON, ADP, AUX, ADV
class DutchLemmatizer(object):
# Note: CGN does not distinguish AUX verbs, so we treat AUX as VERB.
univ_pos_name_variants = {
NOUN: "noun", "NOUN": "noun", "noun": "noun",
VERB: "verb", "VERB": "verb", "verb": "verb",
AUX: "verb", "AUX": "verb", "aux": "verb",
ADJ: "adj", "ADJ": "adj", "adj": "adj",
ADV: "adv", "ADV": "adv", "adv": "adv",
PRON: "pron", "PRON": "pron", "pron": "pron",
DET: "det", "DET": "det", "det": "det",
ADP: "adp", "ADP": "adp", "adp": "adp",
NUM: "num", "NUM": "num", "num": "num"
}
@classmethod
def load(cls, path, index=None, exc=None, rules=None, lookup=None):
return cls(index, exc, rules, lookup)
def __init__(self, index=None, exceptions=None, rules=None, lookup=None):
self.index = index
self.exc = exceptions
self.rules = rules or {}
self.lookup_table = lookup if lookup is not None else {}
def __call__(self, string, univ_pos, morphology=None):
# Difference 1: self.rules is assumed to be non-None, so no
# 'is None' check required.
# String lowercased from the get-go. All lemmatization results in
# lowercased strings. For most applications, this shouldn't pose
# any problems, and it keeps the exceptions indexes small. If this
# creates problems for proper nouns, we can introduce a check for
# univ_pos == "PROPN".
string = string.lower()
try:
univ_pos = self.univ_pos_name_variants[univ_pos]
except KeyError:
# Because PROPN not in self.univ_pos_name_variants, proper names
# are not lemmatized. They are lowercased, however.
return [string]
# if string in self.lemma_index.get(univ_pos)
lemma_index = self.index.get(univ_pos, {})
# string is already lemma
if string in lemma_index:
return [string]
exceptions = self.exc.get(univ_pos, {})
# string is irregular token contained in exceptions index.
try:
lemma = exceptions[string]
return [lemma[0]]
except KeyError:
pass
# string corresponds to key in lookup table
lookup_table = self.lookup_table
looked_up_lemma = lookup_table.get(string)
if looked_up_lemma and looked_up_lemma in lemma_index:
return [looked_up_lemma]
forms, is_known = lemmatize(
string,
lemma_index,
exceptions,
self.rules.get(univ_pos, []))
# Back-off through remaining return value candidates.
if forms:
if is_known:
return forms
else:
for form in forms:
if form in exceptions:
return [form]
if looked_up_lemma:
return [looked_up_lemma]
else:
return forms
elif looked_up_lemma:
return [looked_up_lemma]
else:
return [string]
# Overrides parent method so that a lowercased version of the string is
# used to search the lookup table. This is necessary because our lookup
# table consists entirely of lowercase keys.
def lookup(self, string):
string = string.lower()
return self.lookup_table.get(string, string)
def noun(self, string, morphology=None):
return self(string, 'noun', morphology)
def verb(self, string, morphology=None):
return self(string, 'verb', morphology)
def adj(self, string, morphology=None):
return self(string, 'adj', morphology)
def det(self, string, morphology=None):
return self(string, 'det', morphology)
def pron(self, string, morphology=None):
return self(string, 'pron', morphology)
def adp(self, string, morphology=None):
return self(string, 'adp', morphology)
def punct(self, string, morphology=None):
return self(string, 'punct', morphology)
# Reimplemented to focus more on application of suffix rules and to return
# as early as possible.
def lemmatize(string, index, exceptions, rules):
# returns (forms, is_known: bool)
oov_forms = []
for old, new in rules:
if string.endswith(old):
form = string[:len(string) - len(old)] + new
if not form:
pass
elif form in index:
return [form], True # True = Is known (is lemma)
else:
oov_forms.append(form)
return list(set(oov_forms)), False

File diff suppressed because it is too large Load Diff

View File

@ -4,22 +4,18 @@ from __future__ import unicode_literals
from ...attrs import LIKE_NUM from ...attrs import LIKE_NUM
_num_words = set( _num_words = set("""
"""
nul een één twee drie vier vijf zes zeven acht negen tien elf twaalf dertien nul een één twee drie vier vijf zes zeven acht negen tien elf twaalf dertien
veertien twintig dertig veertig vijftig zestig zeventig tachtig negentig honderd veertien twintig dertig veertig vijftig zestig zeventig tachtig negentig honderd
duizend miljoen miljard biljoen biljard triljoen triljard duizend miljoen miljard biljoen biljard triljoen triljard
""".split() """.split())
)
_ordinal_words = set( _ordinal_words = set("""
"""
eerste tweede derde vierde vijfde zesde zevende achtste negende tiende elfde eerste tweede derde vierde vijfde zesde zevende achtste negende tiende elfde
twaalfde dertiende veertiende twintigste dertigste veertigste vijftigste twaalfde dertiende veertiende twintigste dertigste veertigste vijftigste
zestigste zeventigste tachtigste negentigste honderdste duizendste miljoenste zestigste zeventigste tachtigste negentigste honderdste duizendste miljoenste
miljardste biljoenste biljardste triljoenste triljardste miljardste biljoenste biljardste triljoenste triljardste
""".split() """.split())
)
def like_num(text): def like_num(text):
@ -27,13 +23,11 @@ def like_num(text):
# or matches one of the number words. In order to handle numbers like # or matches one of the number words. In order to handle numbers like
# "drieëntwintig", more work is required. # "drieëntwintig", more work is required.
# See this discussion: https://github.com/explosion/spaCy/pull/1177 # See this discussion: https://github.com/explosion/spaCy/pull/1177
if text.startswith(("+", "-", "±", "~")): text = text.replace(',', '').replace('.', '')
text = text[1:]
text = text.replace(",", "").replace(".", "")
if text.isdigit(): if text.isdigit():
return True return True
if text.count("/") == 1: if text.count('/') == 1:
num, denom = text.split("/") num, denom = text.split('/')
if num.isdigit() and denom.isdigit(): if num.isdigit() and denom.isdigit():
return True return True
if text.lower() in _num_words: if text.lower() in _num_words:
@ -43,4 +37,6 @@ def like_num(text):
return False return False
LEX_ATTRS = {LIKE_NUM: like_num} LEX_ATTRS = {
LIKE_NUM: like_num
}

View File

@ -0,0 +1,33 @@
# coding: utf8
from __future__ import unicode_literals
from ..char_classes import LIST_ELLIPSES, LIST_ICONS
from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER
from ..punctuation import TOKENIZER_SUFFIXES as DEFAULT_TOKENIZER_SUFFIXES
# Copied from `de` package. Main purpose is to ensure that hyphens are not
# split on.
_quotes = CONCAT_QUOTES.replace("'", '')
_infixes = (LIST_ELLIPSES + LIST_ICONS +
[r'(?<=[{}])\.(?=[{}])'.format(ALPHA_LOWER, ALPHA_UPPER),
r'(?<=[{a}])[,!?](?=[{a}])'.format(a=ALPHA),
r'(?<=[{a}"])[:<>=](?=[{a}])'.format(a=ALPHA),
r'(?<=[{a}]),(?=[{a}])'.format(a=ALPHA),
r'(?<=[{a}])([{q}\)\]\(\[])(?=[{a}])'.format(a=ALPHA, q=_quotes),
r'(?<=[{a}])--(?=[{a}])'.format(a=ALPHA),
r'(?<=[0-9])-(?=[0-9])'])
# Remove "'s" suffix from suffix list. In Dutch, "'s" is a plural ending when
# it occurs as a suffix and a clitic for "eens" in standalone use. To avoid
# ambiguity it's better to just leave it attached when it occurs as a suffix.
default_suffix_blacklist = ("'s", "'S", 's', 'S')
_suffixes = [suffix for suffix in DEFAULT_TOKENIZER_SUFFIXES
if suffix not in default_suffix_blacklist]
TOKENIZER_INFIXES = _infixes
TOKENIZER_SUFFIXES = _suffixes

View File

@ -1,45 +1,73 @@
# coding: utf8 # coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
# The original stop words list (added in f46ffe3) was taken from
# http://www.damienvanholten.com/downloads/dutch-stop-words.txt
# and consisted of about 100 tokens.
# In order to achieve parity with some of the better-supported
# languages, e.g., English, French, and German, this original list has been
# extended with 200 additional tokens. The main source of inspiration was
# https://raw.githubusercontent.com/stopwords-iso/stopwords-nl/master/stopwords-nl.txt.
# However, quite a bit of manual editing has taken place as well.
# Tokens whose status as a stop word is not entirely clear were admitted or
# rejected by deferring to their counterparts in the stop words lists for English
# and French. Similarly, those lists were used to identify and fill in gaps so
# that -- in principle -- each token contained in the English stop words list
# should have a Dutch counterpart here.
# Stop words are retrieved from http://www.damienvanholten.com/downloads/dutch-stop-words.txt
STOP_WORDS = set( STOP_WORDS = set("""
""" aan af al alle alles allebei alleen allen als altijd ander anders andere anderen aangaangde aangezien achter achterna
aan af al alles als altijd andere afgelopen aldus alhoewel anderzijds
ben bij ben bij bijna bijvoorbeeld behalve beide beiden beneden bent bepaald beter betere betreffende binnen binnenin boven
bovenal bovendien bovenstaand buiten
daar dan dat de der deze die dit doch doen door dus daar dan dat de der den deze die dit doch doen door dus daarheen daarin daarna daarnet daarom daarop des dezelfde dezen
dien dikwijls doet doorgaand doorgaans
een eens en er een eens en er echter enige eerder eerst eerste eersten effe eigen elk elke enkel enkele enz erdoor etc even eveneens
evenwel
ge geen geweest ff
haar had heb hebben heeft hem het hier hij hoe hun ge geen geweest gauw gedurende gegeven gehad geheel gekund geleden gelijk gemogen geven geweest gewoon gewoonweg
geworden gij
iemand iets ik in is haar had heb hebben heeft hem het hier hij hoe hun hadden hare hebt hele hen hierbeneden hierboven hierin hoewel hun
ja je iemand iets ik in is idd ieder ikke ikzelf indien inmiddels inz inzake
kan kon kunnen ja je jou jouw jullie jezelf jij jijzelf jouwe juist
maar me meer men met mij mijn moet kan kon kunnen klaar konden krachtens kunnen kunt
na naar niet niets nog nu lang later liet liever
of om omdat ons ook op over maar me meer men met mij mijn moet mag mede meer meesten mezelf mijzelf min minder misschien mocht mochten moest moesten
moet moeten mogelijk mogen
reeds na naar niet niets nog nu nabij nadat net nogal nooit nr nu
te tegen toch toen tot of om omdat ons ook op over omhoog omlaag omstreeks omtrent omver onder ondertussen ongeveer onszelf onze ooit opdat
opnieuw opzij over overigens
u uit uw pas pp precies prof publ
van veel voor reeds rond rondom
want waren was wat we wel werd wezen wie wij wil worden sedert sinds sindsdien slechts sommige spoedig steeds
zal ze zei zelf zich zij zijn zo zonder zou t 't te tegen toch toen tot tamelijk ten tenzij ter terwijl thans tijdens toe totdat tussen
""".split()
) u uit uw uitgezonderd uwe uwen
van veel voor vaak vanaf vandaan vanuit vanwege veeleer verder verre vervolgens vgl volgens vooraf vooral vooralsnog
voorbij voordat voordien voorheen voorop voort voorts vooruit vrij vroeg
want waren was wat we wel werd wezen wie wij wil worden waar waarom wanneer want weer weg wegens weinig weinige weldra
welk welke welken werd werden wiens wier wilde wordt
zal ze zei zelf zich zij zijn zo zonder zou zeer zeker zekere zelfde zelfs zichzelf zijnde zijne zon zoals zodra zouden
zoveel zowat zulk zulke zulks zullen zult
""".split())

View File

@ -5,7 +5,6 @@ from ...symbols import POS, PUNCT, ADJ, NUM, DET, ADV, ADP, X, VERB
from ...symbols import NOUN, PROPN, SPACE, PRON, CONJ from ...symbols import NOUN, PROPN, SPACE, PRON, CONJ
# fmt: off
TAG_MAP = { TAG_MAP = {
"ADJ__Number=Sing": {POS: ADJ}, "ADJ__Number=Sing": {POS: ADJ},
"ADJ___": {POS: ADJ}, "ADJ___": {POS: ADJ},
@ -811,4 +810,3 @@ TAG_MAP = {
"X___": {POS: X}, "X___": {POS: X},
"_SP": {POS: SPACE} "_SP": {POS: SPACE}
} }
# fmt: on

View File

@ -0,0 +1,340 @@
# coding: utf8
from __future__ import unicode_literals
from ...symbols import ORTH, LEMMA, TAG, NORM, PRON_LEMMA
# Extensive list of both common and uncommon dutch abbreviations copied from
# github.com/diasks2/pragmatic_segmenter, a Ruby library for rule-based
# sentence boundary detection (MIT, Copyright 2015 Kevin S. Dias).
# Source file: https://github.com/diasks2/pragmatic_segmenter/blob/master/lib/pragmatic_segmenter/languages/dutch.rb
# (Last commit: 4d1477b)
# Main purpose of such an extensive list: considerably improved sentence
# segmentation.
# Note: This list has been copied over largely as-is. Some of the abbreviations
# are extremely domain-specific. Tokenizer performance may benefit from some
# slight pruning, although no performance regression has been observed so far.
abbrevs = ['a.2d.', 'a.a.', 'a.a.j.b.', 'a.f.t.', 'a.g.j.b.',
'a.h.v.', 'a.h.w.', 'a.hosp.', 'a.i.', 'a.j.b.', 'a.j.t.',
'a.m.', 'a.m.r.', 'a.p.m.', 'a.p.r.', 'a.p.t.', 'a.s.',
'a.t.d.f.', 'a.u.b.', 'a.v.a.', 'a.w.', 'aanbev.',
'aanbev.comm.', 'aant.', 'aanv.st.', 'aanw.', 'vnw.',
'aanw.vnw.', 'abd.', 'abm.', 'abs.', 'acc.act.',
'acc.bedr.m.', 'acc.bedr.t.', 'achterv.', 'act.dr.',
'act.dr.fam.', 'act.fisc.', 'act.soc.', 'adm.akk.',
'adm.besl.', 'adm.lex.', 'adm.onderr.', 'adm.ov.', 'adv.',
'adv.', 'gen.', 'adv.bl.', 'afd.', 'afl.', 'aggl.verord.',
'agr.', 'al.', 'alg.', 'alg.richts.', 'amén.', 'ann.dr.',
'ann.dr.lg.', 'ann.dr.sc.pol.', 'ann.ét.eur.',
'ann.fac.dr.lg.', 'ann.jur.créd.',
'ann.jur.créd.règl.coll.', 'ann.not.', 'ann.parl.',
'ann.prat.comm.', 'app.', 'arb.', 'aud.', 'arbbl.',
'arbh.', 'arbit.besl.', 'arbrb.', 'arr.', 'arr.cass.',
'arr.r.v.st.', 'arr.verbr.', 'arrondrb.', 'art.', 'artw.',
'aud.', 'b.', 'b.', 'b.&w.', 'b.a.', 'b.a.s.', 'b.b.o.',
'b.best.dep.', 'b.br.ex.', 'b.coll.fr.gem.comm.',
'b.coll.vl.gem.comm.', 'b.d.cult.r.', 'b.d.gem.ex.',
'b.d.gem.reg.', 'b.dep.', 'b.e.b.', 'b.f.r.',
'b.fr.gem.ex.', 'b.fr.gem.reg.', 'b.i.h.', 'b.inl.j.d.',
'b.inl.s.reg.', 'b.j.', 'b.l.', 'b.o.z.', 'b.prov.r.',
'b.r.h.', 'b.s.', 'b.sr.', 'b.stb.', 'b.t.i.r.',
'b.t.s.z.', 'b.t.w.rev.', 'b.v.',
'b.ver.coll.gem.gem.comm.', 'b.verg.r.b.', 'b.versl.',
'b.vl.ex.', 'b.voorl.reg.', 'b.w.', 'b.w.gew.ex.',
'b.z.d.g.', 'b.z.v.', 'bab.', 'bedr.org.', 'begins.',
'beheersov.', 'bekendm.comm.', 'bel.', 'bel.besch.',
'bel.w.p.', 'beleidsov.', 'belg.', 'grondw.', 'ber.',
'ber.w.', 'besch.', 'besl.', 'beslagr.', 'bestuurswet.',
'bet.', 'betr.', 'betr.', 'vnw.', 'bevest.', 'bew.',
'bijbl.', 'ind.', 'eig.', 'bijbl.n.bijdr.', 'bijl.',
'bijv.', 'bijw.', 'bijz.decr.', 'bin.b.', 'bkh.', 'bl.',
'blz.', 'bm.', 'bn.', 'rh.', 'bnw.', 'bouwr.', 'br.parl.',
'bs.', 'bull.', 'bull.adm.pénit.', 'bull.ass.',
'bull.b.m.m.', 'bull.bel.', 'bull.best.strafinr.',
'bull.bmm.', 'bull.c.b.n.', 'bull.c.n.c.', 'bull.cbn.',
'bull.centr.arb.', 'bull.cnc.', 'bull.contr.',
'bull.doc.min.fin.', 'bull.f.e.b.', 'bull.feb.',
'bull.fisc.fin.r.', 'bull.i.u.m.',
'bull.inf.ass.secr.soc.', 'bull.inf.i.e.c.',
'bull.inf.i.n.a.m.i.', 'bull.inf.i.r.e.', 'bull.inf.iec.',
'bull.inf.inami.', 'bull.inf.ire.', 'bull.inst.arb.',
'bull.ium.', 'bull.jur.imm.', 'bull.lég.b.', 'bull.off.',
'bull.trim.b.dr.comp.', 'bull.us.', 'bull.v.b.o.',
'bull.vbo.', 'bv.', 'bw.', 'bxh.', 'byz.', 'c.', 'c.a.',
'c.a.-a.', 'c.a.b.g.', 'c.c.', 'c.c.i.', 'c.c.s.',
'c.conc.jur.', 'c.d.e.', 'c.d.p.k.', 'c.e.', 'c.ex.',
'c.f.', 'c.h.a.', 'c.i.f.', 'c.i.f.i.c.', 'c.j.', 'c.l.',
'c.n.', 'c.o.d.', 'c.p.', 'c.pr.civ.', 'c.q.', 'c.r.',
'c.r.a.', 'c.s.', 'c.s.a.', 'c.s.q.n.', 'c.v.', 'c.v.a.',
'c.v.o.', 'ca.', 'cadeaust.', 'cah.const.',
'cah.dr.europ.', 'cah.dr.immo.', 'cah.dr.jud.', 'cal.',
'2d.', 'cal.', '3e.', 'cal.', 'rprt.', 'cap.', 'carg.',
'cass.', 'cass.', 'verw.', 'cert.', 'cf.', 'ch.', 'chron.',
'chron.d.s.', 'chron.dr.not.', 'cie.', 'cie.',
'verz.schr.', 'cir.', 'circ.', 'circ.z.', 'cit.',
'cit.loc.', 'civ.', 'cl.et.b.', 'cmt.', 'co.',
'cognoss.v.', 'coll.', 'v.', 'b.', 'colp.w.', 'com.',
'com.', 'cas.', 'com.v.min.', 'comm.', 'comm.', 'v.',
'comm.bijz.ov.', 'comm.erf.', 'comm.fin.', 'comm.ger.',
'comm.handel.', 'comm.pers.', 'comm.pub.', 'comm.straf.',
'comm.v.', 'comm.venn.', 'comm.verz.', 'comm.voor.',
'comp.', 'compt.w.', 'computerr.', 'con.m.', 'concl.',
'concr.', 'conf.', 'confl.w.', 'confl.w.huwbetr.', 'cons.',
'conv.', 'coöp.', 'ver.', 'corr.', 'corr.bl.',
'cour.fisc.', 'cour.immo.', 'cridon.', 'crim.', 'cur.',
'cur.', 'crt.', 'curs.', 'd.', 'd.-g.', 'd.a.', 'd.a.v.',
'd.b.f.', 'd.c.', 'd.c.c.r.', 'd.d.', 'd.d.p.', 'd.e.t.',
'd.gem.r.', 'd.h.', 'd.h.z.', 'd.i.', 'd.i.t.', 'd.j.',
'd.l.r.', 'd.m.', 'd.m.v.', 'd.o.v.', 'd.parl.', 'd.w.z.',
'dact.', 'dat.', 'dbesch.', 'dbesl.', 'decr.', 'decr.d.',
'decr.fr.', 'decr.vl.', 'decr.w.', 'def.', 'dep.opv.',
'dep.rtl.', 'derg.', 'desp.', 'det.mag.', 'deurw.regl.',
'dez.', 'dgl.', 'dhr.', 'disp.', 'diss.', 'div.',
'div.act.', 'div.bel.', 'dl.', 'dln.', 'dnotz.', 'doc.',
'hist.', 'doc.jur.b.', 'doc.min.fin.', 'doc.parl.',
'doctr.', 'dpl.', 'dpl.besl.', 'dr.', 'dr.banc.fin.',
'dr.circ.', 'dr.inform.', 'dr.mr.', 'dr.pén.entr.',
'dr.q.m.', 'drs.', 'dtp.', 'dwz.', 'dyn.', 'e.', 'e.a.',
'e.b.', 'tek.mod.', 'e.c.', 'e.c.a.', 'e.d.', 'e.e.',
'e.e.a.', 'e.e.g.', 'e.g.', 'e.g.a.', 'e.h.a.', 'e.i.',
'e.j.', 'e.m.a.', 'e.n.a.c.', 'e.o.', 'e.p.c.', 'e.r.c.',
'e.r.f.', 'e.r.h.', 'e.r.o.', 'e.r.p.', 'e.r.v.',
'e.s.r.a.', 'e.s.t.', 'e.v.', 'e.v.a.', 'e.w.', 'e&o.e.',
'ec.pol.r.', 'econ.', 'ed.', 'ed(s).', 'eff.', 'eig.',
'eig.mag.', 'eil.', 'elektr.', 'enmb.', 'enz.', 'err.',
'etc.', 'etq.', 'eur.', 'parl.', 'eur.t.s.', 'ev.', 'evt.',
'ex.', 'ex.crim.', 'exec.', 'f.', 'f.a.o.', 'f.a.q.',
'f.a.s.', 'f.i.b.', 'f.j.f.', 'f.o.b.', 'f.o.r.', 'f.o.s.',
'f.o.t.', 'f.r.', 'f.supp.', 'f.suppl.', 'fa.', 'facs.',
'fasc.', 'fg.', 'fid.ber.', 'fig.', 'fin.verh.w.', 'fisc.',
'fisc.', 'tijdschr.', 'fisc.act.', 'fisc.koer.', 'fl.',
'form.', 'foro.', 'it.', 'fr.', 'fr.cult.r.', 'fr.gem.r.',
'fr.parl.', 'fra.', 'ft.', 'g.', 'g.a.', 'g.a.v.',
'g.a.w.v.', 'g.g.d.', 'g.m.t.', 'g.o.', 'g.omt.e.', 'g.p.',
'g.s.', 'g.v.', 'g.w.w.', 'geb.', 'gebr.', 'gebrs.',
'gec.', 'gec.decr.', 'ged.', 'ged.st.', 'gedipl.',
'gedr.st.', 'geh.', 'gem.', 'gem.', 'gem.',
'gem.gem.comm.', 'gem.st.', 'gem.stem.', 'gem.w.',
'gemeensch.optr.', 'gemeensch.standp.', 'gemeensch.strat.',
'gemeent.', 'gemeent.b.', 'gemeent.regl.',
'gemeent.verord.', 'geol.', 'geopp.', 'gepubl.',
'ger.deurw.', 'ger.w.', 'gerekw.', 'gereq.', 'gesch.',
'get.', 'getr.', 'gev.m.', 'gev.maatr.', 'gew.', 'ghert.',
'gir.eff.verk.', 'gk.', 'gr.', 'gramm.', 'grat.w.',
'grootb.w.', 'grs.', 'grvm.', 'grw.', 'gst.', 'gw.',
'h.a.', 'h.a.v.o.', 'h.b.o.', 'h.e.a.o.', 'h.e.g.a.',
'h.e.geb.', 'h.e.gestr.', 'h.l.', 'h.m.', 'h.o.', 'h.r.',
'h.t.l.', 'h.t.m.', 'h.w.geb.', 'hand.', 'handelsn.w.',
'handelspr.', 'handelsr.w.', 'handelsreg.w.', 'handv.',
'harv.l.rev.', 'hc.', 'herald.', 'hert.', 'herz.',
'hfdst.', 'hfst.', 'hgrw.', 'hhr.', 'hist.', 'hooggel.',
'hoogl.', 'hosp.', 'hpw.', 'hr.', 'hr.', 'ms.', 'hr.ms.',
'hregw.', 'hrg.', 'hst.', 'huis.just.', 'huisv.w.',
'huurbl.', 'hv.vn.', 'hw.', 'hyp.w.', 'i.b.s.', 'i.c.',
'i.c.m.h.', 'i.e.', 'i.f.', 'i.f.p.', 'i.g.v.', 'i.h.',
'i.h.a.', 'i.h.b.', 'i.l.pr.', 'i.o.', 'i.p.o.', 'i.p.r.',
'i.p.v.', 'i.pl.v.', 'i.r.d.i.', 'i.s.m.', 'i.t.t.',
'i.v.', 'i.v.m.', 'i.v.s.', 'i.w.tr.', 'i.z.', 'ib.',
'ibid.', 'icip-ing.cons.', 'iem.', 'indic.soc.', 'indiv.',
'inf.', 'inf.i.d.a.c.', 'inf.idac.', 'inf.r.i.z.i.v.',
'inf.riziv.', 'inf.soc.secr.', 'ing.', 'ing.', 'cons.',
'ing.cons.', 'inst.', 'int.', 'int.', 'rechtsh.',
'strafz.', 'interm.', 'intern.fisc.act.',
'intern.vervoerr.', 'inv.', 'inv.', 'f.', 'inv.w.',
'inv.wet.', 'invord.w.', 'inz.', 'ir.', 'irspr.', 'iwtr.',
'j.', 'j.-cl.', 'j.c.b.', 'j.c.e.', 'j.c.fl.', 'j.c.j.',
'j.c.p.', 'j.d.e.', 'j.d.f.', 'j.d.s.c.', 'j.dr.jeun.',
'j.j.d.', 'j.j.p.', 'j.j.pol.', 'j.l.', 'j.l.m.b.',
'j.l.o.', 'j.p.a.', 'j.r.s.', 'j.t.', 'j.t.d.e.',
'j.t.dr.eur.', 'j.t.o.', 'j.t.t.', 'jaarl.', 'jb.hand.',
'jb.kred.', 'jb.kred.c.s.', 'jb.l.r.b.', 'jb.lrb.',
'jb.markt.', 'jb.mens.', 'jb.t.r.d.', 'jb.trd.',
'jeugdrb.', 'jeugdwerkg.w.', 'jg.', 'jis.', 'jl.',
'journ.jur.', 'journ.prat.dr.fisc.fin.', 'journ.proc.',
'jrg.', 'jur.', 'jur.comm.fl.', 'jur.dr.soc.b.l.n.',
'jur.f.p.e.', 'jur.fpe.', 'jur.niv.', 'jur.trav.brux.',
'jurambt.', 'jv.cass.', 'jv.h.r.j.', 'jv.hrj.', 'jw.',
'k.', 'k.', 'k.b.', 'k.g.', 'k.k.', 'k.m.b.o.', 'k.o.o.',
'k.v.k.', 'k.v.v.v.', 'kadasterw.', 'kaderb.', 'kador.',
'kbo-nr.', 'kg.', 'kh.', 'kiesw.', 'kind.bes.v.', 'kkr.',
'koopv.', 'kr.', 'krankz.w.', 'ksbel.', 'kt.', 'ktg.',
'ktr.', 'kvdm.', 'kw.r.', 'kymr.', 'kzr.', 'kzw.', 'l.',
'l.b.', 'l.b.o.', 'l.bas.', 'l.c.', 'l.gew.', 'l.j.',
'l.k.', 'l.l.', 'l.o.', 'l.r.b.', 'l.u.v.i.', 'l.v.r.',
'l.v.w.', 'l.w.', "l'exp.-compt.b..", 'lexp.-compt.b.',
'landinr.w.', 'landscrt.', 'lat.', 'law.ed.', 'lett.',
'levensverz.', 'lgrs.', 'lidw.', 'limb.rechtsl.', 'lit.',
'litt.', 'liw.', 'liwet.', 'lk.', 'll.', 'll.(l.)l.r.',
'loonw.', 'losbl.', 'ltd.', 'luchtv.', 'luchtv.w.', 'm.',
'm.', 'not.', 'm.a.v.o.', 'm.a.w.', 'm.b.', 'm.b.o.',
'm.b.r.', 'm.b.t.', 'm.d.g.o.', 'm.e.a.o.', 'm.e.r.',
'm.h.', 'm.h.d.', 'm.i.v.', 'm.j.t.', 'm.k.', 'm.m.',
'm.m.a.', 'm.m.h.h.', 'm.m.v.', 'm.n.', 'm.not.fisc.',
'm.nt.', 'm.o.', 'm.r.', 'm.s.a.', 'm.u.p.', 'm.v.a.',
'm.v.h.n.', 'm.v.t.', 'm.z.', 'maatr.teboekgest.luchtv.',
'maced.', 'mand.', 'max.', 'mbl.not.', 'me.', 'med.',
'med.', 'v.b.o.', 'med.b.u.f.r.', 'med.bufr.', 'med.vbo.',
'meerv.', 'meetbr.w.', 'mém.adm.', 'mgr.', 'mgrs.', 'mhd.',
'mi.verantw.', 'mil.', 'mil.bed.', 'mil.ger.', 'min.',
'min.', 'aanbev.', 'min.', 'circ.', 'min.', 'fin.',
'min.j.omz.', 'min.just.circ.', 'mitt.', 'mnd.', 'mod.',
'mon.', 'mouv.comm.', 'mr.', 'ms.', 'muz.', 'mv.', 'n.',
'chr.', 'n.a.', 'n.a.g.', 'n.a.v.', 'n.b.', 'n.c.',
'n.chr.', 'n.d.', 'n.d.r.', 'n.e.a.', 'n.g.', 'n.h.b.c.',
'n.j.', 'n.j.b.', 'n.j.w.', 'n.l.', 'n.m.', 'n.m.m.',
'n.n.', 'n.n.b.', 'n.n.g.', 'n.n.k.', 'n.o.m.', 'n.o.t.k.',
'n.rapp.', 'n.tijd.pol.', 'n.v.', 'n.v.d.r.', 'n.v.d.v.',
'n.v.o.b.', 'n.v.t.', 'nat.besch.w.', 'nat.omb.',
'nat.pers.', 'ned.cult.r.', 'neg.verkl.', 'nhd.', 'wisk.',
'njcm-bull.', 'nl.', 'nnd.', 'no.', 'not.fisc.m.',
'not.w.', 'not.wet.', 'nr.', 'nrs.', 'nste.', 'nt.',
'numism.', 'o.', 'o.a.', 'o.b.', 'o.c.', 'o.g.', 'o.g.v.',
'o.i.', 'o.i.d.', 'o.m.', 'o.o.', 'o.o.d.', 'o.o.v.',
'o.p.', 'o.r.', 'o.regl.', 'o.s.', 'o.t.s.', 'o.t.t.',
'o.t.t.t.', 'o.t.t.z.', 'o.tk.t.', 'o.v.t.', 'o.v.t.t.',
'o.v.tk.t.', 'o.v.v.', 'ob.', 'obsv.', 'octr.',
'octr.gem.regl.', 'octr.regl.', 'oe.', 'off.pol.', 'ofra.',
'ohd.', 'omb.', 'omnil.', 'omz.', 'on.ww.', 'onderr.',
'onfrank.', 'onteig.w.', 'ontw.', 'b.w.', 'onuitg.',
'onz.', 'oorl.w.', 'op.cit.', 'opin.pa.', 'opm.', 'or.',
'ord.br.', 'ord.gem.', 'ors.', 'orth.', 'os.', 'osm.',
'ov.', 'ov.w.i.', 'ov.w.ii.', 'ov.ww.', 'overg.w.',
'overw.', 'ovkst.', 'oz.', 'p.', 'p.a.', 'p.a.o.',
'p.b.o.', 'p.e.', 'p.g.', 'p.j.', 'p.m.', 'p.m.a.', 'p.o.',
'p.o.j.t.', 'p.p.', 'p.v.', 'p.v.s.', 'pachtw.', 'pag.',
'pan.', 'pand.b.', 'pand.pér.', 'parl.gesch.',
'parl.gesch.', 'inv.', 'parl.st.', 'part.arb.', 'pas.',
'pasin.', 'pat.', 'pb.c.', 'pb.l.', 'pens.',
'pensioenverz.', 'per.ber.i.b.r.', 'per.ber.ibr.', 'pers.',
'st.', 'pft.', 'pk.', 'pktg.', 'plv.', 'po.', 'pol.',
'pol.off.', 'pol.r.', 'pol.w.', 'postbankw.', 'postw.',
'pp.', 'pr.', 'preadv.', 'pres.', 'prf.', 'prft.', 'prg.',
'prijz.w.', 'proc.', 'procesregl.', 'prof.', 'prot.',
'prov.', 'prov.b.', 'prov.instr.h.m.g.', 'prov.regl.',
'prov.verord.', 'prov.w.', 'publ.', 'pun.', 'pw.',
'q.b.d.', 'q.e.d.', 'q.q.', 'q.r.', 'r.', 'r.a.b.g.',
'r.a.c.e.', 'r.a.j.b.', 'r.b.d.c.', 'r.b.d.i.', 'r.b.s.s.',
'r.c.', 'r.c.b.', 'r.c.d.c.', 'r.c.j.b.', 'r.c.s.j.',
'r.cass.', 'r.d.c.', 'r.d.i.', 'r.d.i.d.c.', 'r.d.j.b.',
'r.d.j.p.', 'r.d.p.c.', 'r.d.s.', 'r.d.t.i.', 'r.e.',
'r.f.s.v.p.', 'r.g.a.r.', 'r.g.c.f.', 'r.g.d.c.', 'r.g.f.',
'r.g.z.', 'r.h.a.', 'r.i.c.', 'r.i.d.a.', 'r.i.e.j.',
'r.i.n.', 'r.i.s.a.', 'r.j.d.a.', 'r.j.i.', 'r.k.', 'r.l.',
'r.l.g.b.', 'r.med.', 'r.med.rechtspr.', 'r.n.b.', 'r.o.',
'r.ov.', 'r.p.', 'r.p.d.b.', 'r.p.o.t.', 'r.p.r.j.',
'r.p.s.', 'r.r.d.', 'r.r.s.', 'r.s.', 'r.s.v.p.',
'r.stvb.', 'r.t.d.f.', 'r.t.d.h.', 'r.t.l.',
'r.trim.dr.eur.', 'r.v.a.', 'r.verkb.', 'r.w.', 'r.w.d.',
'rap.ann.c.a.', 'rap.ann.c.c.', 'rap.ann.c.e.',
'rap.ann.c.s.j.', 'rap.ann.ca.', 'rap.ann.cass.',
'rap.ann.cc.', 'rap.ann.ce.', 'rap.ann.csj.', 'rapp.',
'rb.', 'rb.kh.', 'rdn.', 'rdnr.', 're.pers.', 'rec.',
'rec.c.i.j.', 'rec.c.j.c.e.', 'rec.cij.', 'rec.cjce.',
'rec.gén.enr.not.', 'rechtsk.t.', 'rechtspl.zeem.',
'rechtspr.arb.br.', 'rechtspr.b.f.e.', 'rechtspr.bfe.',
'rechtspr.soc.r.b.l.n.', 'recl.reg.', 'rect.', 'red.',
'reg.', 'reg.huiz.bew.', 'reg.w.', 'registr.w.', 'regl.',
'regl.', 'r.v.k.', 'regl.besl.', 'regl.onderr.',
'regl.r.t.', 'rep.', 'rép.fisc.', 'rép.not.', 'rep.r.j.',
'rep.rj.', 'req.', 'res.', 'resp.', 'rev.', 'rev.',
'comp.', 'rev.', 'trim.', 'civ.', 'rev.', 'trim.', 'comm.',
'rev.acc.trav.', 'rev.adm.', 'rev.b.compt.',
'rev.b.dr.const.', 'rev.b.dr.intern.', 'rev.b.séc.soc.',
'rev.banc.fin.', 'rev.comm.', 'rev.cons.prud.',
'rev.dr.b.', 'rev.dr.commun.', 'rev.dr.étr.',
'rev.dr.fam.', 'rev.dr.intern.comp.', 'rev.dr.mil.',
'rev.dr.min.', 'rev.dr.pén.', 'rev.dr.pén.mil.',
'rev.dr.rur.', 'rev.dr.u.l.b.', 'rev.dr.ulb.', 'rev.exp.',
'rev.faill.', 'rev.fisc.', 'rev.gd.', 'rev.hist.dr.',
'rev.i.p.c.', 'rev.ipc.', 'rev.not.b.',
'rev.prat.dr.comm.', 'rev.prat.not.b.', 'rev.prat.soc.',
'rev.rec.', 'rev.rw.', 'rev.trav.', 'rev.trim.d.h.',
'rev.trim.dr.fam.', 'rev.urb.', 'richtl.', 'riv.dir.int.',
'riv.dir.int.priv.proc.', 'rk.', 'rln.', 'roln.', 'rom.',
'rondz.', 'rov.', 'rtl.', 'rubr.', 'ruilv.wet.',
'rv.verdr.', 'rvkb.', 's.', 's.', 's.a.', 's.b.n.',
's.ct.', 's.d.', 's.e.c.', 's.e.et.o.', 's.e.w.',
's.exec.rept.', 's.hrg.', 's.j.b.', 's.l.', 's.l.e.a.',
's.l.n.d.', 's.p.a.', 's.s.', 's.t.', 's.t.b.', 's.v.',
's.v.p.', 'samenw.', 'sc.', 'sch.', 'scheidsr.uitspr.',
'schepel.besl.', 'secr.comm.', 'secr.gen.', 'sect.soc.',
'sess.', 'cas.', 'sir.', 'soc.', 'best.', 'soc.', 'handv.',
'soc.', 'verz.', 'soc.act.', 'soc.best.', 'soc.kron.',
'soc.r.', 'soc.sw.', 'soc.weg.', 'sofi-nr.', 'somm.',
'somm.ann.', 'sp.c.c.', 'sr.', 'ss.', 'st.doc.b.c.n.a.r.',
'st.doc.bcnar.', 'st.vw.', 'stagever.', 'stas.', 'stat.',
'stb.', 'stbl.', 'stcrt.', 'stud.dipl.', 'su.', 'subs.',
'subst.', 'succ.w.', 'suppl.', 'sv.', 'sw.', 't.', 't.a.',
't.a.a.', 't.a.n.', 't.a.p.', 't.a.s.n.', 't.a.v.',
't.a.v.w.', 't.aann.', 't.acc.', 't.agr.r.', 't.app.',
't.b.b.r.', 't.b.h.', 't.b.m.', 't.b.o.', 't.b.p.',
't.b.r.', 't.b.s.', 't.b.v.', 't.bankw.', 't.belg.not.',
't.desk.', 't.e.m.', 't.e.p.', 't.f.r.', 't.fam.',
't.fin.r.', 't.g.r.', 't.g.t.', 't.g.v.', 't.gem.',
't.gez.', 't.huur.', 't.i.n.', 't.j.k.', 't.l.l.',
't.l.v.', 't.m.', 't.m.r.', 't.m.w.', 't.mil.r.',
't.mil.strafr.', 't.not.', 't.o.', 't.o.r.b.', 't.o.v.',
't.ontv.', 't.p.r.', 't.pol.', 't.r.', 't.r.g.',
't.r.o.s.', 't.r.v.', 't.s.r.', 't.strafr.', 't.t.',
't.u.', 't.v.c.', 't.v.g.', 't.v.m.r.', 't.v.o.', 't.v.v.',
't.v.v.d.b.', 't.v.w.', 't.verz.', 't.vred.', 't.vreemd.',
't.w.', 't.w.k.', 't.w.v.', 't.w.v.r.', 't.wrr.', 't.z.',
't.z.t.', 't.z.v.', 'taalk.', 'tar.burg.z.', 'td.',
'techn.', 'telecomm.', 'toel.', 'toel.st.v.w.', 'toep.',
'toep.regl.', 'tom.', 'top.', 'trans.b.', 'transp.r.',
'trb.', 'trib.', 'trib.civ.', 'trib.gr.inst.', 'ts.',
'ts.', 'best.', 'ts.', 'verv.', 'turnh.rechtsl.', 'tvpol.',
'tvpr.', 'tvrechtsgesch.', 'tw.', 'u.', 'u.a.', 'u.a.r.',
'u.a.v.', 'u.c.', 'u.c.c.', 'u.g.', 'u.p.', 'u.s.',
'u.s.d.c.', 'uitdr.', 'uitl.w.', 'uitv.besch.div.b.',
'uitv.besl.', 'uitv.besl.', 'succ.w.', 'uitv.besl.bel.rv.',
'uitv.besl.l.b.', 'uitv.reg.', 'inv.w.', 'uitv.reg.bel.d.',
'uitv.reg.afd.verm.', 'uitv.reg.lb.', 'uitv.reg.succ.w.',
'univ.', 'univ.verkl.', 'v.', 'v.', 'chr.', 'v.a.',
'v.a.v.', 'v.c.', 'v.chr.', 'v.h.', 'v.huw.verm.', 'v.i.',
'v.i.o.', 'v.k.a.', 'v.m.', 'v.o.f.', 'v.o.n.',
'v.onderh.verpl.', 'v.p.', 'v.r.', 'v.s.o.', 'v.t.t.',
'v.t.t.t.', 'v.tk.t.', 'v.toep.r.vert.', 'v.v.b.',
'v.v.g.', 'v.v.t.', 'v.v.t.t.', 'v.v.tk.t.', 'v.w.b.',
'v.z.m.', 'vb.', 'vb.bo.', 'vbb.', 'vc.', 'vd.', 'veldw.',
'ver.k.', 'ver.verg.gem.', 'gem.comm.', 'verbr.', 'verd.',
'verdr.', 'verdr.v.', 'tek.mod.', 'verenw.', 'verg.',
'verg.fr.gem.', 'comm.', 'verkl.', 'verkl.herz.gw.',
'verl.', 'deelw.', 'vern.', 'verord.', 'vers.r.',
'versch.', 'versl.c.s.w.', 'versl.csw.', 'vert.', 'verw.',
'verz.', 'verz.w.', 'verz.wett.besl.',
'verz.wett.decr.besl.', 'vgl.', 'vid.', 'viss.w.',
'vl.parl.', 'vl.r.', 'vl.t.gez.', 'vl.w.reg.',
'vl.w.succ.', 'vlg.', 'vn.', 'vnl.', 'vnw.', 'vo.',
'vo.bl.', 'voegw.', 'vol.', 'volg.', 'volt.', 'deelw.',
'voorl.', 'voorz.', 'vord.w.', 'vorst.d.', 'vr.', 'vred.',
'vrg.', 'vnw.', 'vrijgrs.', 'vs.', 'vt.', 'vw.', 'vz.',
'vzngr.', 'vzr.', 'w.', 'w.a.', 'w.b.r.', 'w.c.h.',
'w.conf.huw.', 'w.conf.huwelijksb.', 'w.consum.kr.',
'w.f.r.', 'w.g.', 'w.gew.r.', 'w.ident.pl.', 'w.just.doc.',
'w.kh.', 'w.l.r.', 'w.l.v.', 'w.mil.straf.spr.', 'w.n.',
'w.not.ambt.', 'w.o.', 'w.o.d.huurcomm.', 'w.o.d.k.',
'w.openb.manif.', 'w.parl.', 'w.r.', 'w.reg.', 'w.succ.',
'w.u.b.', 'w.uitv.pl.verord.', 'w.v.', 'w.v.k.',
'w.v.m.s.', 'w.v.r.', 'w.v.w.', 'w.venn.', 'wac.', 'wd.',
'wetb.', 'n.v.h.', 'wgb.', 'winkelt.w.', 'wisk.',
'wka-verkl.', 'wnd.', 'won.w.', 'woningw.', 'woonr.w.',
'wrr.', 'wrr.ber.', 'wrsch.', 'ws.', 'wsch.', 'wsr.',
'wtvb.', 'ww.', 'x.d.', 'z.a.', 'z.g.', 'z.i.', 'z.j.',
'z.o.z.', 'z.p.', 'z.s.m.', 'zg.', 'zgn.', 'zn.', 'znw.',
'zr.', 'zr.', 'ms.', 'zr.ms.']
_exc = {}
for orth in abbrevs:
_exc[orth] = [{ORTH: orth}]
uppered = orth.upper()
capsed = orth.capitalize()
for i in [uppered, capsed]:
_exc[i] = [{ORTH: i}]
TOKENIZER_EXCEPTIONS = _exc

View File

@ -1,7 +1,7 @@
# encoding: utf8 # encoding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
from ...symbols import POS, NOUN, PRON, ADJ, ADV, INTJ, PROPN, DET, NUM, AUX,VERB from ...symbols import POS, NOUN, PRON, ADJ, ADV, INTJ, PROPN, DET, NUM, AUX, VERB
from ...symbols import ADP, CCONJ, PART, PUNCT, SPACE, SCONJ from ...symbols import ADP, CCONJ, PART, PUNCT, SPACE, SCONJ
# Source: Korakot Chaovavanich # Source: Korakot Chaovavanich
@ -17,8 +17,8 @@ TAG_MAP = {
"CFQC": {POS: NOUN}, "CFQC": {POS: NOUN},
"CVBL": {POS: NOUN}, "CVBL": {POS: NOUN},
# VERB # VERB
"VACT":{POS:VERB}, "VACT": {POS: VERB},
"VSTA":{POS:VERB}, "VSTA": {POS: VERB},
# PRON # PRON
"PRON": {POS: PRON}, "PRON": {POS: PRON},
"NPRP": {POS: PRON}, "NPRP": {POS: PRON},

View File

@ -5,6 +5,320 @@ from ...symbols import ORTH, LEMMA
_exc = { _exc = {
#หน่วยงานรัฐ / government agency
"กกต.": [{ORTH: "กกต.", LEMMA: "คณะกรรมการการเลือกตั้ง"}],
"กทท.": [{ORTH: "กทท.", LEMMA: "การท่าเรือแห่งประเทศไทย"}],
"กทพ.": [{ORTH: "กทพ.", LEMMA: "การทางพิเศษแห่งประเทศไทย"}],
"กบข.": [{ORTH: "กบข.", LEMMA: "กองทุนบำเหน็จบำนาญข้าราชการพลเรือน"}],
"กบว.": [{ORTH: "กบว.", LEMMA: "คณะกรรมการบริหารวิทยุกระจายเสียงและวิทยุโทรทัศน์"}],
"กปน.": [{ORTH: "กปน.", LEMMA: "การประปานครหลวง"}],
"กปภ.": [{ORTH: "กปภ.", LEMMA: "การประปาส่วนภูมิภาค"}],
"กปส.": [{ORTH: "กปส.", LEMMA: "กรมประชาสัมพันธ์"}],
"กผม.": [{ORTH: "กผม.", LEMMA: "กองผังเมือง"}],
"กฟน.": [{ORTH: "กฟน.", LEMMA: "การไฟฟ้านครหลวง"}],
"กฟผ.": [{ORTH: "กฟผ.", LEMMA: "การไฟฟ้าฝ่ายผลิตแห่งประเทศไทย"}],
"กฟภ.": [{ORTH: "กฟภ.", LEMMA: "การไฟฟ้าส่วนภูมิภาค"}],
"ก.ช.น.": [{ORTH: "ก.ช.น.", LEMMA: "คณะกรรมการช่วยเหลือชาวนาชาวไร่"}],
"กยศ.": [{ORTH: "กยศ.", LEMMA: "กองทุนเงินให้กู้ยืมเพื่อการศึกษา"}],
"ก.ล.ต.": [{ORTH: "ก.ล.ต.", LEMMA: "คณะกรรมการกำกับหลักทรัพย์และตลาดหลักทรัพย์"}],
"กศ.บ.": [{ORTH: "กศ.บ.", LEMMA: "การศึกษาบัณฑิต"}],
"กศน.": [{ORTH: "กศน.", LEMMA: "กรมการศึกษานอกโรงเรียน"}],
"กสท.": [{ORTH: "กสท.", LEMMA: "การสื่อสารแห่งประเทศไทย"}],
"กอ.รมน.": [{ORTH: "กอ.รมน.", LEMMA: "กองอำนวยการรักษาความมั่นคงภายใน"}],
"กร.": [{ORTH: "กร.", LEMMA: "กองเรือยุทธการ"}],
"ขสมก.": [{ORTH: "ขสมก.", LEMMA: "องค์การขนส่งมวลชนกรุงเทพ"}],
"คตง.": [{ORTH: "คตง.", LEMMA: "คณะกรรมการตรวจเงินแผ่นดิน"}],
"ครม.": [{ORTH: "ครม.", LEMMA: "คณะรัฐมนตรี"}],
"คมช.": [{ORTH: "คมช.", LEMMA: "คณะมนตรีความมั่นคงแห่งชาติ"}],
"ตชด.": [{ORTH: "ตชด.", LEMMA: "ตำรวจตะเวนชายเดน"}],
"ตม.": [{ORTH: "ตม.", LEMMA: "กองตรวจคนเข้าเมือง"}],
"ตร.": [{ORTH: "ตร.", LEMMA: "ตำรวจ"}],
"ททท.": [{ORTH: "ททท.", LEMMA: "การท่องเที่ยวแห่งประเทศไทย"}],
"ททบ.": [{ORTH: "ททบ.", LEMMA: "สถานีวิทยุโทรทัศน์กองทัพบก"}],
"ทบ.": [{ORTH: "ทบ.", LEMMA: "กองทัพบก"}],
"ทร.": [{ORTH: "ทร.", LEMMA: "กองทัพเรือ"}],
"ทอ.": [{ORTH: "ทอ.", LEMMA: "กองทัพอากาศ"}],
"ทอท.": [{ORTH: "ทอท.", LEMMA: "การท่าอากาศยานแห่งประเทศไทย"}],
"ธ.ก.ส.": [{ORTH: "ธ.ก.ส.", LEMMA: "ธนาคารเพื่อการเกษตรและสหกรณ์การเกษตร"}],
"ธปท.": [{ORTH: "ธปท.", LEMMA: "ธนาคารแห่งประเทศไทย"}],
"ธอส.": [{ORTH: "ธอส.", LEMMA: "ธนาคารอาคารสงเคราะห์"}],
"นย.": [{ORTH: "นย.", LEMMA: "นาวิกโยธิน"}],
"ปตท.": [{ORTH: "ปตท.", LEMMA: "การปิโตรเลียมแห่งประเทศไทย"}],
"ป.ป.ช.": [{ORTH: "ป.ป.ช.", LEMMA: "คณะกรรมการป้องกันและปราบปรามการทุจริตและประพฤติมิชอบในวงราชการ"}],
"ป.ป.ส.": [{ORTH: "ป.ป.ส.", LEMMA: "คณะกรรมการป้องกันและปราบปรามยาเสพติด"}],
"บพร.": [{ORTH: "บพร.", LEMMA: "กรมการบินพลเรือน"}],
"บย.": [{ORTH: "บย.", LEMMA: "กองบินยุทธการ"}],
"พสวท.": [{ORTH: "พสวท.", LEMMA: "โครงการพัฒนาและส่งเสริมผู้มีความรู้ความสามารถพิเศษทางวิทยาศาสตร์และเทคโนโลยี"}],
"มอก.": [{ORTH: "มอก.", LEMMA: "สำนักงานมาตรฐานผลิตภัณฑ์อุตสาหกรรม"}],
"ยธ.": [{ORTH: "ยธ.", LEMMA: "กรมโยธาธิการ"}],
"รพช.": [{ORTH: "รพช.", LEMMA: "สำนักงานเร่งรัดพัฒนาชนบท"}],
"รฟท.": [{ORTH: "รฟท.", LEMMA: "การรถไฟแห่งประเทศไทย"}],
"รฟม.": [{ORTH: "รฟม.", LEMMA: "การรถไฟฟ้าขนส่งมวลชนแห่งประเทศไทย"}],
"ศธ.": [{ORTH: "ศธ.", LEMMA: "กระทรวงศึกษาธิการ"}],
"ศนธ.": [{ORTH: "ศนธ.", LEMMA: "ศูนย์กลางนิสิตนักศึกษาแห่งประเทศไทย"}],
"สกจ.": [{ORTH: "สกจ.", LEMMA: "สหกรณ์จังหวัด"}],
"สกท.": [{ORTH: "สกท.", LEMMA: "สำนักงานคณะกรรมการส่งเสริมการลงทุน"}],
"สกว.": [{ORTH: "สกว.", LEMMA: "สำนักงานกองทุนสนับสนุนการวิจัย"}],
"สคบ.": [{ORTH: "สคบ.", LEMMA: "สำนักงานคณะกรรมการคุ้มครองผู้บริโภค"}],
"สจร.": [{ORTH: "สจร.", LEMMA: "สำนักงานคณะกรรมการจัดระบบการจราจรทางบก"}],
"สตง.": [{ORTH: "สตง.", LEMMA: "สำนักงานตรวจเงินแผ่นดิน"}],
"สทท.": [{ORTH: "สทท.", LEMMA: "สถานีวิทยุโทรทัศน์แห่งประเทศไทย"}],
"สทร.": [{ORTH: "สทร.", LEMMA: "สำนักงานกลางทะเบียนราษฎร์"}],
"สธ": [{ORTH: "สธ", LEMMA: "กระทรวงสาธารณสุข"}],
"สนช.": [{ORTH: "สนช.", LEMMA: "สภานิติบัญญัติแห่งชาติ,สำนักงานนวัตกรรมแห่งชาติ"}],
"สนนท.": [{ORTH: "สนนท.", LEMMA: "สหพันธ์นิสิตนักศึกษาแห่งประเทศไทย"}],
"สปก.": [{ORTH: "สปก.", LEMMA: "สำนักงานการปฏิรูปที่ดินเพื่อเกษตรกรรม"}],
"สปช.": [{ORTH: "สปช.", LEMMA: "สำนักงานคณะกรรมการการประถมศึกษาแห่งชาติ"}],
"สปอ.": [{ORTH: "สปอ.", LEMMA: "สำนักงานการประถมศึกษาอำเภอ"}],
"สพช.": [{ORTH: "สพช.", LEMMA: "สำนักงานคณะกรรมการนโยบายพลังงานแห่งชาติ"}],
"สยช.": [{ORTH: "สยช.", LEMMA: "สำนักงานคณะกรรมการส่งเสริมและประสานงานเยาวชนแห่งชาติ"}],
"สวช.": [{ORTH: "สวช.", LEMMA: "สำนักงานคณะกรรมการวัฒนธรรมแห่งชาติ"}],
"สวท.": [{ORTH: "สวท.", LEMMA: "สถานีวิทยุกระจายเสียงแห่งประเทศไทย"}],
"สวทช.": [{ORTH: "สวทช.", LEMMA: "สำนักงานพัฒนาวิทยาศาสตร์และเทคโนโลยีแห่งชาติ"}],
"สคช.": [{ORTH: "สคช.", LEMMA: "สำนักงานคณะกรรมการพัฒนาการเศรษฐกิจและสังคมแห่งชาติ"}],
"สสว.": [{ORTH: "สสว.", LEMMA: "สำนักงานส่งเสริมวิสาหกิจขนาดกลางและขนาดย่อม"}],
"สสส.": [{ORTH: "สสส.", LEMMA: "สำนักงานกองทุนสนับสนุนการสร้างเสริมสุขภาพ"}],
"สสวท.": [{ORTH: "สสวท.", LEMMA: "สถาบันส่งเสริมการสอนวิทยาศาสตร์และเทคโนโลยี"}],
"อตก.": [{ORTH: "อตก.", LEMMA: "องค์การตลาดเพื่อเกษตรกร"}],
"อบจ.": [{ORTH: "อบจ.", LEMMA: "องค์การบริหารส่วนจังหวัด"}],
"อบต.": [{ORTH: "อบต.", LEMMA: "องค์การบริหารส่วนตำบล"}],
"อปพร.": [{ORTH: "อปพร.", LEMMA: "อาสาสมัครป้องกันภัยฝ่ายพลเรือน"}],
"อย.": [{ORTH: "อย.", LEMMA: "สำนักงานคณะกรรมการอาหารและยา"}],
"อ.ส.ม.ท.": [{ORTH: "อ.ส.ม.ท.", LEMMA: "องค์การสื่อสารมวลชนแห่งประเทศไทย"}],
#มหาวิทยาลัย / สถานศึกษา / university / college
"มทส.": [{ORTH: "มทส.", LEMMA: "มหาวิทยาลัยเทคโนโลยีสุรนารี"}],
"มธ.": [{ORTH: "มธ.", LEMMA: "มหาวิทยาลัยธรรมศาสตร์"}],
"ม.อ.": [{ORTH: "ม.อ.", LEMMA: "มหาวิทยาลัยสงขลานครินทร์"}],
"มทร.": [{ORTH: "มทร.", LEMMA: "มหาวิทยาลัยเทคโนโลยีราชมงคล"}],
"มมส.": [{ORTH: "มมส.", LEMMA: "มหาวิทยาลัยมหาสารคาม"}],
"วท.": [{ORTH: "วท.", LEMMA: "วิทยาลัยเทคนิค"}],
"สตม.": [{ORTH: "สตม.", LEMMA: "สำนักงานตรวจคนเข้าเมือง (ตำรวจ)"}],
#ยศ / rank
"ดร.": [{ORTH: "ดร.", LEMMA: "ดอกเตอร์"}],
"ด.ต.": [{ORTH: "ด.ต.", LEMMA: "ดาบตำรวจ"}],
"จ.ต.": [{ORTH: "จ.ต.", LEMMA: "จ่าตรี"}],
"จ.ท.": [{ORTH: "จ.ท.", LEMMA: "จ่าโท"}],
"จ.ส.ต.": [{ORTH: "จ.ส.ต.", LEMMA: "จ่าสิบตรี (ทหารบก)"}],
"จสต.": [{ORTH: "จสต.", LEMMA: "จ่าสิบตำรวจ"}],
"จ.ส.ท.": [{ORTH: "จ.ส.ท.", LEMMA: "จ่าสิบโท"}],
"จ.ส.อ.": [{ORTH: "จ.ส.อ.", LEMMA: "จ่าสิบเอก"}],
"จ.อ.": [{ORTH: "จ.อ.", LEMMA: "จ่าเอก"}],
"ทพญ.": [{ORTH: "ทพญ.", LEMMA: "ทันตแพทย์หญิง"}],
"ทนพ.": [{ORTH: "ทนพ.", LEMMA: "เทคนิคการแพทย์"}],
"นจอ.": [{ORTH: "นจอ.", LEMMA: "นักเรียนจ่าอากาศ"}],
"น.ช.": [{ORTH: "น.ช.", LEMMA: "นักโทษชาย"}],
"น.ญ.": [{ORTH: "น.ญ.", LEMMA: "นักโทษหญิง"}],
"น.ต.": [{ORTH: "น.ต.", LEMMA: "นาวาตรี"}],
"น.ท.": [{ORTH: "น.ท.", LEMMA: "นาวาโท"}],
"นตท.": [{ORTH: "นตท.", LEMMA: "นักเรียนเตรียมทหาร"}],
"นนส.": [{ORTH: "นนส.", LEMMA: "นักเรียนนายสิบทหารบก"}],
"นนร.": [{ORTH: "นนร.", LEMMA: "นักเรียนนายร้อย"}],
"นนอ.": [{ORTH: "นนอ.", LEMMA: "นักเรียนนายเรืออากาศ"}],
"นพ.": [{ORTH: "นพ.", LEMMA: "นายแพทย์"}],
"นพท.": [{ORTH: "นพท.", LEMMA: "นายแพทย์ทหาร"}],
"นรจ.": [{ORTH: "นรจ.", LEMMA: "นักเรียนจ่าทหารเรือ"}],
"นรต.": [{ORTH: "นรต.", LEMMA: "นักเรียนนายร้อยตำรวจ"}],
"นศพ.": [{ORTH: "นศพ.", LEMMA: "นักศึกษาแพทย์"}],
"นศท.": [{ORTH: "นศท.", LEMMA: "นักศึกษาวิชาทหาร"}],
"น.สพ.": [{ORTH: "น.สพ.", LEMMA: "นายสัตวแพทย์ (พ.ร.บ.วิชาชีพการสัตวแพทย์)"}],
"น.อ.": [{ORTH: "น.อ.", LEMMA: "นาวาเอก"}],
"บช.ก.": [{ORTH: "บช.ก.", LEMMA: "กองบัญชาการตำรวจสอบสวนกลาง"}],
"บช.น.": [{ORTH: "บช.น.", LEMMA: "กองบัญชาการตำรวจนครบาล"}],
"ผกก.": [{ORTH: "ผกก.", LEMMA: "ผู้กำกับการ"}],
"ผกก.ภ.": [{ORTH: "ผกก.ภ.", LEMMA: "ผู้กำกับการตำรวจภูธร"}],
"ผจก.": [{ORTH: "ผจก.", LEMMA: "ผู้จัดการ"}],
"ผช.": [{ORTH: "ผช.", LEMMA: "ผู้ช่วย"}],
"ผชก.": [{ORTH: "ผชก.", LEMMA: "ผู้ชำนาญการ"}],
"ผช.ผอ.": [{ORTH: "ผช.ผอ.", LEMMA: "ผู้ช่วยผู้อำนวยการ"}],
"ผญบ.": [{ORTH: "ผญบ.", LEMMA: "ผู้ใหญ่บ้าน"}],
"ผบ.": [{ORTH: "ผบ.", LEMMA: "ผู้บังคับบัญชา"}],
"ผบก.": [{ORTH: "ผบก.", LEMMA: "ผู้บังคับบัญชาการ (ตำรวจ)"}],
"ผบก.": [{ORTH: "ผบก.", LEMMA: "ผู้บังคับการ (ตำรวจ)"}],
"ผบก.น.": [{ORTH: "ผบก.น.", LEMMA: "ผู้บังคับการตำรวจนครบาล"}],
"ผบก.ป.": [{ORTH: "ผบก.ป.", LEMMA: "ผู้บังคับการตำรวจกองปราบปราม"}],
"ผบก.ปค.": [{ORTH: "ผบก.ปค.", LEMMA: "ผู้บังคับการ กองบังคับการปกครอง (โรงเรียนนายร้อยตำรวจ)"}],
"ผบก.ปม.": [{ORTH: "ผบก.ปม.", LEMMA: "ผู้บังคับการตำรวจป่าไม้"}],
"ผบก.ภ.": [{ORTH: "ผบก.ภ.", LEMMA: "ผู้บังคับการตำรวจภูธร"}],
"ผบช.": [{ORTH: "ผบช.", LEMMA: "ผู้บัญชาการ (ตำรวจ)"}],
"ผบช.ก.": [{ORTH: "ผบช.ก.", LEMMA: "ผู้บัญชาการตำรวจสอบสวนกลาง"}],
"ผบช.ตชด.": [{ORTH: "ผบช.ตชด.", LEMMA: "ผู้บัญชาการตำรวจตระเวนชายแดน"}],
"ผบช.น.": [{ORTH: "ผบช.น.", LEMMA: "ผู้บัญชาการตำรวจนครบาล"}],
"ผบช.ภ.": [{ORTH: "ผบช.ภ.", LEMMA: "ผู้บัญชาการตำรวจภูธร"}],
"ผบ.ทบ.": [{ORTH: "ผบ.ทบ.", LEMMA: "ผู้บัญชาการทหารบก"}],
"ผบ.ตร.": [{ORTH: "ผบ.ตร.", LEMMA: "ผู้บัญชาการตำรวจแห่งชาติ"}],
"ผบ.ทร.": [{ORTH: "ผบ.ทร.", LEMMA: "ผู้บัญชาการทหารเรือ"}],
"ผบ.ทอ.": [{ORTH: "ผบ.ทอ.", LEMMA: "ผู้บัญชาการทหารอากาศ"}],
"ผบ.ทสส.": [{ORTH: "ผบ.ทสส.", LEMMA: "ผู้บัญชาการทหารสูงสุด"}],
"ผวจ.": [{ORTH: "ผวจ.", LEMMA: "ผู้ว่าราชการจังหวัด"}],
"ผู้ว่าฯ": [{ORTH: "ผู้ว่าฯ", LEMMA: "ผู้ว่าราชการจังหวัด"}],
"พ.จ.ต.": [{ORTH: "พ.จ.ต.", LEMMA: "พันจ่าตรี"}],
"พ.จ.ท.": [{ORTH: "พ.จ.ท.", LEMMA: "พันจ่าโท"}],
"พ.จ.อ.": [{ORTH: "พ.จ.อ.", LEMMA: "พันจ่าเอก"}],
"พญ.": [{ORTH: "พญ.", LEMMA: "แพทย์หญิง"}],
"ฯพณฯ": [{ORTH: "ฯพณฯ", LEMMA: "พณท่าน"}],
"พ.ต.": [{ORTH: "พ.ต.", LEMMA: "พันตรี"}],
"พ.ท.": [{ORTH: "พ.ท.", LEMMA: "พันโท"}],
"พ.อ.": [{ORTH: "พ.อ.", LEMMA: "พันเอก"}],
"พ.ต.อ.พิเศษ": [{ORTH: "พ.ต.อ.พิเศษ", LEMMA: "พันตำรวจเอกพิเศษ"}],
"พลฯ": [{ORTH: "พลฯ", LEMMA: "พลทหาร"}],
"พล.๑ รอ.": [{ORTH: "พล.๑ รอ.", LEMMA: "กองพลที่ ๑ รักษาพระองค์ กองทัพบก"}],
"พล.ต.": [{ORTH: "พล.ต.", LEMMA: "พลตรี"}],
"พล.ต.ต.": [{ORTH: "พล.ต.ต.", LEMMA: "พลตำรวจตรี"}],
"พล.ต.ท.": [{ORTH: "พล.ต.ท.", LEMMA: "พลตำรวจโท"}],
"พล.ต.อ.": [{ORTH: "พล.ต.อ.", LEMMA: "พลตำรวจเอก"}],
"พล.ท.": [{ORTH: "พล.ท.", LEMMA: "พลโท"}],
"พล.ปตอ.": [{ORTH: "พล.ปตอ.", LEMMA: "กองพลทหารปืนใหญ่ต่อสู่อากาศยาน"}],
"พล.ม.": [{ORTH: "พล.ม.", LEMMA: "กองพลทหารม้า"}],
"พล.ม.๒": [{ORTH: "พล.ม.๒", LEMMA: "กองพลทหารม้าที่ ๒"}],
"พล.ร.ต.": [{ORTH: "พล.ร.ต.", LEMMA: "พลเรือตรี"}],
"พล.ร.ท.": [{ORTH: "พล.ร.ท.", LEMMA: "พลเรือโท"}],
"พล.ร.อ.": [{ORTH: "พล.ร.อ.", LEMMA: "พลเรือเอก"}],
"พล.อ.": [{ORTH: "พล.อ.", LEMMA: "พลเอก"}],
"พล.อ.ต.": [{ORTH: "พล.อ.ต.", LEMMA: "พลอากาศตรี"}],
"พล.อ.ท.": [{ORTH: "พล.อ.ท.", LEMMA: "พลอากาศโท"}],
"พล.อ.อ.": [{ORTH: "พล.อ.อ.", LEMMA: "พลอากาศเอก"}],
"พ.อ.": [{ORTH: "พ.อ.", LEMMA: "พันเอก"}],
"พ.อ.พิเศษ": [{ORTH: "พ.อ.พิเศษ", LEMMA: "พันเอกพิเศษ"}],
"พ.อ.ต.": [{ORTH: "พ.อ.ต.", LEMMA: "พันจ่าอากาศตรี"}],
"พ.อ.ท.": [{ORTH: "พ.อ.ท.", LEMMA: "พันจ่าอากาศโท"}],
"พ.อ.อ.": [{ORTH: "พ.อ.อ.", LEMMA: "พันจ่าอากาศเอก"}],
"ภกญ.": [{ORTH: "ภกญ.", LEMMA: "เภสัชกรหญิง"}],
"ม.จ.": [{ORTH: "ม.จ.", LEMMA: "หม่อมเจ้า"}],
"มท1": [{ORTH: "มท1", LEMMA: "รัฐมนตรีว่าการกระทรวงมหาดไทย"}],
"ม.ร.ว.": [{ORTH: "ม.ร.ว.", LEMMA: "หม่อมราชวงศ์"}],
"มล.": [{ORTH: "มล.", LEMMA: "หม่อมหลวง"}],
"ร.ต.": [{ORTH: "ร.ต.", LEMMA: "ร้อยตรี,เรือตรี,เรืออากาศตรี"}],
"ร.ต.ต.": [{ORTH: "ร.ต.ต.", LEMMA: "ร้อยตำรวจตรี"}],
"ร.ต.ท.": [{ORTH: "ร.ต.ท.", LEMMA: "ร้อยตำรวจโท"}],
"ร.ต.อ.": [{ORTH: "ร.ต.อ.", LEMMA: "ร้อยตำรวจเอก"}],
"ร.ท.": [{ORTH: "ร.ท.", LEMMA: "ร้อยโท,เรือโท,เรืออากาศโท"}],
"รมช.": [{ORTH: "รมช.", LEMMA: "รัฐมนตรีช่วยว่าการกระทรวง"}],
"รมต.": [{ORTH: "รมต.", LEMMA: "รัฐมนตรี"}],
"รมว.": [{ORTH: "รมว.", LEMMA: "รัฐมนตรีว่าการกระทรวง"}],
"รศ.": [{ORTH: "รศ.", LEMMA: "รองศาสตราจารย์"}],
"ร.อ.": [{ORTH: "ร.อ.", LEMMA: "ร้อยเอก,เรือเอก,เรืออากาศเอก"}],
"ศ.": [{ORTH: "ศ.", LEMMA: "ศาสตราจารย์"}],
"ส.ต.": [{ORTH: "ส.ต.", LEMMA: "สิบตรี"}],
"ส.ต.ต.": [{ORTH: "ส.ต.ต.", LEMMA: "สิบตำรวจตรี"}],
"ส.ต.ท.": [{ORTH: "ส.ต.ท.", LEMMA: "สิบตำรวจโท"}],
"ส.ต.อ.": [{ORTH: "ส.ต.อ.", LEMMA: "สิบตำรวจเอก"}],
"ส.ท.": [{ORTH: "ส.ท.", LEMMA: "สิบโท"}],
"สพ.": [{ORTH: "สพ.", LEMMA: "สัตวแพทย์"}],
"สพ.ญ.": [{ORTH: "สพ.ญ.", LEMMA: "สัตวแพทย์หญิง"}],
"สพ.ช.": [{ORTH: "สพ.ช.", LEMMA: "สัตวแพทย์ชาย"}],
"ส.อ.": [{ORTH: "ส.อ.", LEMMA: "สิบเอก"}],
"อจ.": [{ORTH: "อจ.", LEMMA: "อาจารย์"}],
"อจญ.": [{ORTH: "อจญ.", LEMMA: "อาจารย์ใหญ่"}],
#วุฒิ / bachelor degree
"ป.": [{ORTH: "ป.", LEMMA: "ประถมศึกษา"}],
"ป.กศ.": [{ORTH: "ป.กศ.", LEMMA: "ประกาศนียบัตรวิชาการศึกษา"}],
"ป.กศ.สูง": [{ORTH: "ป.กศ.สูง", LEMMA: "ประกาศนียบัตรวิชาการศึกษาชั้นสูง"}],
"ปวช.": [{ORTH: "ปวช.", LEMMA: "ประกาศนียบัตรวิชาชีพ"}],
"ปวท.": [{ORTH: "ปวท.", LEMMA: "ประกาศนียบัตรวิชาชีพเทคนิค"}],
"ปวส.": [{ORTH: "ปวส.", LEMMA: "ประกาศนียบัตรวิชาชีพชั้นสูง"}],
"ปทส.": [{ORTH: "ปทส.", LEMMA: "ประกาศนียบัตรครูเทคนิคชั้นสูง"}],
"กษ.บ.": [{ORTH: "กษ.บ.", LEMMA: "เกษตรศาสตรบัณฑิต"}],
"กษ.ม.": [{ORTH: "กษ.ม.", LEMMA: "เกษตรศาสตรมหาบัณฑิต"}],
"กษ.ด.": [{ORTH: "กษ.ด.", LEMMA: "เกษตรศาสตรดุษฎีบัณฑิต"}],
"ค.บ.": [{ORTH: "ค.บ.", LEMMA: "ครุศาสตรบัณฑิต"}],
"คศ.บ.": [{ORTH: "คศ.บ.", LEMMA: "คหกรรมศาสตรบัณฑิต"}],
"คศ.ม.": [{ORTH: "คศ.ม.", LEMMA: "คหกรรมศาสตรมหาบัณฑิต"}],
"คศ.ด.": [{ORTH: "คศ.ด.", LEMMA: "คหกรรมศาสตรดุษฎีบัณฑิต"}],
"ค.อ.บ.": [{ORTH: "ค.อ.บ.", LEMMA: "ครุศาสตรอุตสาหกรรมบัณฑิต"}],
"ค.อ.ม.": [{ORTH: "ค.อ.ม.", LEMMA: "ครุศาสตรอุตสาหกรรมมหาบัณฑิต"}],
"ค.อ.ด.": [{ORTH: "ค.อ.ด.", LEMMA: "ครุศาสตรอุตสาหกรรมดุษฎีบัณฑิต"}],
"ทก.บ.": [{ORTH: "ทก.บ.", LEMMA: "เทคโนโลยีการเกษตรบัณฑิต"}],
"ทก.ม.": [{ORTH: "ทก.ม.", LEMMA: "เทคโนโลยีการเกษตรมหาบัณฑิต"}],
"ทก.ด.": [{ORTH: "ทก.ด.", LEMMA: "เทคโนโลยีการเกษตรดุษฎีบัณฑิต"}],
"ท.บ.": [{ORTH: "ท.บ.", LEMMA: "ทันตแพทยศาสตรบัณฑิต"}],
"ท.ม.": [{ORTH: "ท.ม.", LEMMA: "ทันตแพทยศาสตรมหาบัณฑิต"}],
"ท.ด.": [{ORTH: "ท.ด.", LEMMA: "ทันตแพทยศาสตรดุษฎีบัณฑิต"}],
"น.บ.": [{ORTH: "น.บ.", LEMMA: "นิติศาสตรบัณฑิต"}],
"น.ม.": [{ORTH: "น.ม.", LEMMA: "นิติศาสตรมหาบัณฑิต"}],
"น.ด.": [{ORTH: "น.ด.", LEMMA: "นิติศาสตรดุษฎีบัณฑิต"}],
"นศ.บ.": [{ORTH: "นศ.บ.", LEMMA: "นิเทศศาสตรบัณฑิต"}],
"นศ.ม.": [{ORTH: "นศ.ม.", LEMMA: "นิเทศศาสตรมหาบัณฑิต"}],
"นศ.ด.": [{ORTH: "นศ.ด.", LEMMA: "นิเทศศาสตรดุษฎีบัณฑิต"}],
"บช.บ.": [{ORTH: "บช.บ.", LEMMA: "บัญชีบัณฑิต"}],
"บช.ม.": [{ORTH: "บช.ม.", LEMMA: "บัญชีมหาบัณฑิต"}],
"บช.ด.": [{ORTH: "บช.ด.", LEMMA: "บัญชีดุษฎีบัณฑิต"}],
"บธ.บ.": [{ORTH: "บธ.บ.", LEMMA: "บริหารธุรกิจบัณฑิต"}],
"บธ.ม.": [{ORTH: "บธ.ม.", LEMMA: "บริหารธุรกิจมหาบัณฑิต"}],
"บธ.ด.": [{ORTH: "บธ.ด.", LEMMA: "บริหารธุรกิจดุษฎีบัณฑิต"}],
"พณ.บ.": [{ORTH: "พณ.บ.", LEMMA: "พาณิชยศาสตรบัณฑิต"}],
"พณ.ม.": [{ORTH: "พณ.ม.", LEMMA: "พาณิชยศาสตรมหาบัณฑิต"}],
"พณ.ด.": [{ORTH: "พณ.ด.", LEMMA: "พาณิชยศาสตรดุษฎีบัณฑิต"}],
"พ.บ.": [{ORTH: "พ.บ.", LEMMA: "แพทยศาสตรบัณฑิต"}],
"พ.ม.": [{ORTH: "พ.ม.", LEMMA: "แพทยศาสตรมหาบัณฑิต"}],
"พ.ด.": [{ORTH: "พ.ด.", LEMMA: "แพทยศาสตรดุษฎีบัณฑิต"}],
"พธ.บ.": [{ORTH: "พธ.บ.", LEMMA: "พุทธศาสตรบัณฑิต"}],
"พธ.ม.": [{ORTH: "พธ.ม.", LEMMA: "พุทธศาสตรมหาบัณฑิต"}],
"พธ.ด.": [{ORTH: "พธ.ด.", LEMMA: "พุทธศาสตรดุษฎีบัณฑิต"}],
"พบ.บ.": [{ORTH: "พบ.บ.", LEMMA: "พัฒนบริหารศาสตรบัณฑิต"}],
"พบ.ม.": [{ORTH: "พบ.ม.", LEMMA: "พัฒนบริหารศาสตรมหาบัณฑิต"}],
"พบ.ด.": [{ORTH: "พบ.ด.", LEMMA: "พัฒนบริหารศาสตรดุษฎีบัณฑิต"}],
"พย.บ.": [{ORTH: "พย.บ.", LEMMA: "พยาบาลศาสตรดุษฎีบัณฑิต"}],
"พย.ม.": [{ORTH: "พย.ม.", LEMMA: "พยาบาลศาสตรมหาบัณฑิต"}],
"พย.ด.": [{ORTH: "พย.ด.", LEMMA: "พยาบาลศาสตรดุษฎีบัณฑิต"}],
"พศ.บ.": [{ORTH: "พศ.บ.", LEMMA: "พาณิชยศาสตรบัณฑิต"}],
"พศ.ม.": [{ORTH: "พศ.ม.", LEMMA: "พาณิชยศาสตรมหาบัณฑิต"}],
"พศ.ด.": [{ORTH: "พศ.ด.", LEMMA: "พาณิชยศาสตรดุษฎีบัณฑิต"}],
"ภ.บ.": [{ORTH: "ภ.บ.", LEMMA: "เภสัชศาสตรบัณฑิต"}],
"ภ.ม.": [{ORTH: "ภ.ม.", LEMMA: "เภสัชศาสตรมหาบัณฑิต"}],
"ภ.ด.": [{ORTH: "ภ.ด.", LEMMA: "เภสัชศาสตรดุษฎีบัณฑิต"}],
"ภ.สถ.บ.": [{ORTH: "ภ.สถ.บ.", LEMMA: "ภูมิสถาปัตยกรรมศาสตรบัณฑิต"}],
"รป.บ.": [{ORTH: "รป.บ.", LEMMA: "รัฐประศาสนศาสตร์บัณฑิต"}],
"รป.ม.": [{ORTH: "รป.ม.", LEMMA: "รัฐประศาสนศาสตร์มหาบัณฑิต"}],
"วท.บ.": [{ORTH: "วท.บ.", LEMMA: "วิทยาศาสตรบัณฑิต"}],
"วท.ม.": [{ORTH: "วท.ม.", LEMMA: "วิทยาศาสตรมหาบัณฑิต"}],
"วท.ด.": [{ORTH: "วท.ด.", LEMMA: "วิทยาศาสตรดุษฎีบัณฑิต"}],
"ศ.บ.": [{ORTH: "ศ.บ.", LEMMA: "ศิลปบัณฑิต"}],
"ศศ.บ.": [{ORTH: "ศศ.บ.", LEMMA: "ศิลปศาสตรบัณฑิต"}],
"ศษ.บ.": [{ORTH: "ศษ.บ.", LEMMA: "ศึกษาศาสตรบัณฑิต"}],
"ศส.บ.": [{ORTH: "ศส.บ.", LEMMA: "เศรษฐศาสตรบัณฑิต"}],
"สถ.บ.": [{ORTH: "สถ.บ.", LEMMA: "สถาปัตยกรรมศาสตรบัณฑิต"}],
"สถ.ม.": [{ORTH: "สถ.ม.", LEMMA: "สถาปัตยกรรมศาสตรมหาบัณฑิต"}],
"สถ.ด.": [{ORTH: "สถ.ด.", LEMMA: "สถาปัตยกรรมศาสตรดุษฎีบัณฑิต"}],
"สพ.บ.": [{ORTH: "สพ.บ.", LEMMA: "สัตวแพทยศาสตรบัณฑิต"}],
"อ.บ.": [{ORTH: "อ.บ.", LEMMA: "อักษรศาสตรบัณฑิต"}],
"อ.ม.": [{ORTH: "อ.ม.", LEMMA: "อักษรศาสตรมหาบัณฑิต"}],
"อ.ด.": [{ORTH: "อ.ด.", LEMMA: "อักษรศาสตรดุษฎีบัณฑิต"}],
#ปี / เวลา / year / time
"ชม.": [{ORTH: "ชม.", LEMMA: "ชั่วโมง"}],
"จ.ศ.": [{ORTH: "จ.ศ.", LEMMA: "จุลศักราช"}],
"ค.ศ.": [{ORTH: "ค.ศ.", LEMMA: "คริสต์ศักราช"}],
"ฮ.ศ.": [{ORTH: "ฮ.ศ.", LEMMA: "ฮิจเราะห์ศักราช"}],
"ว.ด.ป.": [{ORTH: "ว.ด.ป.", LEMMA: "วัน เดือน ปี"}],
#ระยะทาง / distance
"ฮม.": [{ORTH: "ฮม.", LEMMA: "เฮกโตเมตร"}],
"ดคม.": [{ORTH: "ดคม.", LEMMA: "เดคาเมตร"}],
"ดม.": [{ORTH: "ดม.", LEMMA: "เดซิเมตร"}],
"มม.": [{ORTH: "มม.", LEMMA: "มิลลิเมตร"}],
"ซม.": [{ORTH: "ซม.", LEMMA: "เซนติเมตร"}],
"กม.": [{ORTH: "กม.", LEMMA: "กิโลเมตร"}],
#น้ำหนัก / weight
"น.น.": [{ORTH: "น.น.", LEMMA: "น้ำหนัก"}],
"ฮก.": [{ORTH: "ฮก.", LEMMA: "เฮกโตกรัม"}],
"ดคก.": [{ORTH: "ดคก.", LEMMA: "เดคากรัม"}],
"ดก.": [{ORTH: "ดก.", LEMMA: "เดซิกรัม"}],
"ซก.": [{ORTH: "ซก.", LEMMA: "เซนติกรัม"}],
"มก.": [{ORTH: "มก.", LEMMA: "มิลลิกรัม"}],
"ก.": [{ORTH: "ก.", LEMMA: "กรัม"}],
"กก.": [{ORTH: "กก.", LEMMA: "กิโลกรัม"}],
#ปริมาตร / volume
"ฮล.": [{ORTH: "ฮล.", LEMMA: "เฮกโตลิตร"}],
"ดคล.": [{ORTH: "ดคล.", LEMMA: "เดคาลิตร"}],
"ดล.": [{ORTH: "ดล.", LEMMA: "เดซิลิตร"}],
"ซล.": [{ORTH: "ซล.", LEMMA: "เซนติลิตร"}],
"ล.": [{ORTH: "ล.", LEMMA: "ลิตร"}],
"กล.": [{ORTH: "กล.", LEMMA: "กิโลลิตร"}],
"ลบ.": [{ORTH: "ลบ.", LEMMA: "ลูกบาศก์"}],
#พื้นที่ / area
"ตร.ซม.": [{ORTH: "ตร.ซม.", LEMMA: "ตารางเซนติเมตร"}],
"ตร.ม.": [{ORTH: "ตร.ม.", LEMMA: "ตารางเมตร"}],
"ตร.ว.": [{ORTH: "ตร.ว.", LEMMA: "ตารางวา"}],
"ตร.กม.": [{ORTH: "ตร.กม.", LEMMA: "ตารางกิโลเมตร"}],
#เดือน / month
"ม.ค.": [{ORTH: "ม.ค.", LEMMA: "มกราคม"}], "ม.ค.": [{ORTH: "ม.ค.", LEMMA: "มกราคม"}],
"ก.พ.": [{ORTH: "ก.พ.", LEMMA: "กุมภาพันธ์"}], "ก.พ.": [{ORTH: "ก.พ.", LEMMA: "กุมภาพันธ์"}],
"มี.ค.": [{ORTH: "มี.ค.", LEMMA: "มีนาคม"}], "มี.ค.": [{ORTH: "มี.ค.", LEMMA: "มีนาคม"}],
@ -17,6 +331,114 @@ _exc = {
"ต.ค.": [{ORTH: "ต.ค.", LEMMA: "ตุลาคม"}], "ต.ค.": [{ORTH: "ต.ค.", LEMMA: "ตุลาคม"}],
"พ.ย.": [{ORTH: "พ.ย.", LEMMA: "พฤศจิกายน"}], "พ.ย.": [{ORTH: "พ.ย.", LEMMA: "พฤศจิกายน"}],
"ธ.ค.": [{ORTH: "ธ.ค.", LEMMA: "ธันวาคม"}], "ธ.ค.": [{ORTH: "ธ.ค.", LEMMA: "ธันวาคม"}],
#เพศ / gender
"ช.": [{ORTH: "ช.", LEMMA: "ชาย"}],
"ญ.": [{ORTH: "ญ.", LEMMA: "หญิง"}],
"ด.ช.": [{ORTH: "ด.ช.", LEMMA: "เด็กชาย"}],
"ด.ญ.": [{ORTH: "ด.ญ.", LEMMA: "เด็กหญิง"}],
#ที่อยู่ / address
"ถ.": [{ORTH: "ถ.", LEMMA: "ถนน"}],
"ต.": [{ORTH: "ต.", LEMMA: "ตำบล"}],
"อ.": [{ORTH: "อ.", LEMMA: "อำเภอ"}],
"จ.": [{ORTH: "จ.", LEMMA: "จังหวัด"}],
#สรรพนาม / pronoun
"ข้าฯ": [{ORTH: "ข้าฯ", LEMMA: "ข้าพระพุทธเจ้า"}],
"ทูลเกล้าฯ": [{ORTH: "ทูลเกล้าฯ", LEMMA: "ทูลเกล้าทูลกระหม่อม"}],
"น้อมเกล้าฯ": [{ORTH: "น้อมเกล้าฯ", LEMMA: "น้อมเกล้าน้อมกระหม่อม"}],
"โปรดเกล้าฯ": [{ORTH: "โปรดเกล้าฯ", LEMMA: "โปรดเกล้าโปรดกระหม่อม"}],
#การเมือง / politic
"ขจก.": [{ORTH: "ขจก.", LEMMA: "ขบวนการโจรก่อการร้าย"}],
"ขบด.": [{ORTH: "ขบด.", LEMMA: "ขบวนการแบ่งแยกดินแดน"}],
"นปช.": [{ORTH: "นปช.", LEMMA: "แนวร่วมประชาธิปไตยขับไล่เผด็จการ"}],
"ปชป.": [{ORTH: "ปชป.", LEMMA: "พรรคประชาธิปัตย์"}],
"ผกค.": [{ORTH: "ผกค.", LEMMA: "ผู้ก่อการร้ายคอมมิวนิสต์"}],
"พท.": [{ORTH: "พท.", LEMMA: "พรรคเพื่อไทย"}],
"พ.ร.ก.": [{ORTH: "พ.ร.ก.", LEMMA: "พระราชกำหนด"}],
"พ.ร.ฎ.": [{ORTH: "พ.ร.ฎ.", LEMMA: "พระราชกฤษฎีกา"}],
"พ.ร.บ.": [{ORTH: "พ.ร.บ.", LEMMA: "พระราชบัญญัติ"}],
"รธน.": [{ORTH: "รธน.", LEMMA: "รัฐธรรมนูญ"}],
"รบ.": [{ORTH: "รบ.", LEMMA: "รัฐบาล"}],
"รสช.": [{ORTH: "รสช.", LEMMA: "คณะรักษาความสงบเรียบร้อยแห่งชาติ"}],
"ส.ก.": [{ORTH: "ส.ก.", LEMMA: "สมาชิกสภากรุงเทพมหานคร"}],
"สจ.": [{ORTH: "สจ.", LEMMA: "สมาชิกสภาจังหวัด"}],
"สว.": [{ORTH: "สว.", LEMMA: "สมาชิกวุฒิสภา"}],
"ส.ส.": [{ORTH: "ส.ส.", LEMMA: "สมาชิกสภาผู้แทนราษฎร"}],
#ทั่วไป / general
"ก.ข.ค.": [{ORTH: "ก.ข.ค.", LEMMA: "ก้างขวางคอ"}],
"กทม.": [{ORTH: "กทม.", LEMMA: "กรุงเทพมหานคร"}],
"กรุงเทพฯ": [{ORTH: "กรุงเทพฯ", LEMMA: "กรุงเทพมหานคร"}],
"ขรก.": [{ORTH: "ขรก.", LEMMA: "ข้าราชการ"}],
"ขส": [{ORTH: "ขส.", LEMMA: "ขนส่ง"}],
"ค.ร.น.": [{ORTH: "ค.ร.น.", LEMMA: "คูณร่วมน้อย"}],
"ค.ร.ม.": [{ORTH: "ค.ร.ม.", LEMMA: "คูณร่วมมาก"}],
"ง.ด.": [{ORTH: "ง.ด.", LEMMA: "เงินเดือน"}],
"งป.": [{ORTH: "งป.", LEMMA: "งบประมาณ"}],
"จก.": [{ORTH: "จก.", LEMMA: "จำกัด"}],
"จขกท.": [{ORTH: "จขกท.", LEMMA: "เจ้าของกระทู้"}],
"จนท.": [{ORTH: "จนท.", LEMMA: "เจ้าหน้าที่"}],
"จ.ป.ร.": [{ORTH: "จ.ป.ร.", LEMMA: "มหาจุฬาลงกรณ ปรมราชาธิราช (พระปรมาภิไธยในพระบาทสมเด็จพระจุลจอมเกล้าเจ้าอยู่หัว)"}],
"จ.ม.": [{ORTH: "จ.ม.", LEMMA: "จดหมาย"}],
"จย.": [{ORTH: "จย.", LEMMA: "จักรยาน"}],
"จยย.": [{ORTH: "จยย.", LEMMA: "จักรยานยนต์"}],
"ตจว.": [{ORTH: "ตจว.", LEMMA: "ต่างจังหวัด"}],
"โทร.": [{ORTH: "โทร.", LEMMA: "โทรศัพท์"}],
"ธ.": [{ORTH: "ธ.", LEMMA: "ธนาคาร"}],
"น.ร.": [{ORTH: "น.ร.", LEMMA: "นักเรียน"}],
"น.ศ.": [{ORTH: "น.ศ.", LEMMA: "นักศึกษา"}],
"น.ส.": [{ORTH: "น.ส.", LEMMA: "นางสาว"}],
"น.ส.๓": [{ORTH: "น.ส.๓", LEMMA: "หนังสือรับรองการทำประโยชน์ในที่ดิน"}],
"น.ส.๓ ก.": [{ORTH: "น.ส.๓ ก", LEMMA: "หนังสือแสดงกรรมสิทธิ์ในที่ดิน (มีระวางกำหนด)"}],
"นสพ.": [{ORTH: "นสพ.", LEMMA: "หนังสือพิมพ์"}],
"บ.ก.": [{ORTH: "บ.ก.", LEMMA: "บรรณาธิการ"}],
"บจก.": [{ORTH: "บจก.", LEMMA: "บริษัทจำกัด"}],
"บงล.": [{ORTH: "บงล.", LEMMA: "บริษัทเงินทุนและหลักทรัพย์จำกัด"}],
"บบส.": [{ORTH: "บบส.", LEMMA: "บรรษัทบริหารสินทรัพย์สถาบันการเงิน"}],
"บมจ.": [{ORTH: "บมจ.", LEMMA: "บริษัทมหาชนจำกัด"}],
"บลจ.": [{ORTH: "บลจ.", LEMMA: "บริษัทหลักทรัพย์จัดการกองทุนรวมจำกัด"}],
"บ/ช": [{ORTH: "บ/ช", LEMMA: "บัญชี"}],
"บร.": [{ORTH: "บร.", LEMMA: "บรรณารักษ์"}],
"ปชช.": [{ORTH: "ปชช.", LEMMA: "ประชาชน"}],
"ปณ.": [{ORTH: "ปณ.", LEMMA: "ที่ทำการไปรษณีย์"}],
"ปณก.": [{ORTH: "ปณก.", LEMMA: "ที่ทำการไปรษณีย์กลาง"}],
"ปณส.": [{ORTH: "ปณส.", LEMMA: "ที่ทำการไปรษณีย์สาขา"}],
"ปธ.": [{ORTH: "ปธ.", LEMMA: "ประธาน"}],
"ปธน.": [{ORTH: "ปธน.", LEMMA: "ประธานาธิบดี"}],
"ปอ.": [{ORTH: "ปอ.", LEMMA: "รถยนต์โดยสารประจำทางปรับอากาศ"}],
"ปอ.พ.": [{ORTH: "ปอ.พ.", LEMMA: "รถยนต์โดยสารประจำทางปรับอากาศพิเศษ"}],
"พ.ก.ง.": [{ORTH: "พ.ก.ง.", LEMMA: "พัสดุเก็บเงินปลายทาง"}],
"พ.ก.ส.": [{ORTH: "พ.ก.ส.", LEMMA: "พนักงานเก็บค่าโดยสาร"}],
"พขร.": [{ORTH: "พขร.", LEMMA: "พนักงานขับรถ"}],
"ภ.ง.ด.": [{ORTH: "ภ.ง.ด.", LEMMA: "ภาษีเงินได้"}],
"ภ.ง.ด.๙": [{ORTH: "ภ.ง.ด.๙", LEMMA: "แบบแสดงรายการเสียภาษีเงินได้ของกรมสรรพากร"}],
"ภ.ป.ร.": [{ORTH: "ภ.ป.ร.", LEMMA: "ภูมิพลอดุยเดช ปรมราชาธิราช (พระปรมาภิไธยในพระบาทสมเด็จพระปรมินทรมหาภูมิพลอดุลยเดช)"}],
"ภ.พ.": [{ORTH: "ภ.พ.", LEMMA: "ภาษีมูลค่าเพิ่ม"}],
"ร.": [{ORTH: "ร.", LEMMA: "รัชกาล"}],
"ร.ง.": [{ORTH: "ร.ง.", LEMMA: "โรงงาน"}],
"ร.ด.": [{ORTH: "ร.ด.", LEMMA: "รักษาดินแดน"}],
"รปภ.": [{ORTH: "รปภ.", LEMMA: "รักษาความปลอดภัย"}],
"รพ.": [{ORTH: "รพ.", LEMMA: "โรงพยาบาล"}],
"ร.พ.": [{ORTH: "ร.พ.", LEMMA: "โรงพิมพ์"}],
"รร.": [{ORTH: "รร.", LEMMA: "โรงเรียน,โรงแรม"}],
"รสก.": [{ORTH: "รสก.", LEMMA: "รัฐวิสาหกิจ"}],
"ส.ค.ส.": [{ORTH: "ส.ค.ส.", LEMMA: "ส่งความสุขปีใหม่"}],
"สต.": [{ORTH: "สต.", LEMMA: "สตางค์"}],
"สน.": [{ORTH: "สน.", LEMMA: "สถานีตำรวจ"}],
"สนข.": [{ORTH: "สนข.", LEMMA: "สำนักงานเขต"}],
"สนง.": [{ORTH: "สนง.", LEMMA: "สำนักงาน"}],
"สนญ.": [{ORTH: "สนญ.", LEMMA: "สำนักงานใหญ่"}],
"ส.ป.ช.": [{ORTH: "ส.ป.ช.", LEMMA: "สร้างเสริมประสบการณ์ชีวิต"}],
"สภ.": [{ORTH: "สภ.", LEMMA: "สถานีตำรวจภูธร"}],
"ส.ล.น.": [{ORTH: "ส.ล.น.", LEMMA: "สร้างเสริมลักษณะนิสัย"}],
"สวญ.": [{ORTH: "สวญ.", LEMMA: "สารวัตรใหญ่"}],
"สวป.": [{ORTH: "สวป.", LEMMA: "สารวัตรป้องกันปราบปราม"}],
"สว.สส.": [{ORTH: "สว.สส.", LEMMA: "สารวัตรสืบสวน"}],
"ส.ห.": [{ORTH: "ส.ห.", LEMMA: "สารวัตรทหาร"}],
"สอ.": [{ORTH: "สอ.", LEMMA: "สถานีอนามัย"}],
"สอท.": [{ORTH: "สอท.", LEMMA: "สถานเอกอัครราชทูต"}],
"เสธ.": [{ORTH: "เสธ.", LEMMA: "เสนาธิการ"}],
"หจก.": [{ORTH: "หจก.", LEMMA: "ห้างหุ้นส่วนจำกัด"}],
"ห.ร.ม.": [{ORTH: "ห.ร.ม.", LEMMA: "ตัวหารร่วมมาก"}],
} }

View File

@ -134,6 +134,11 @@ def nl_tokenizer():
return get_lang_class("nl").Defaults.create_tokenizer() return get_lang_class("nl").Defaults.create_tokenizer()
@pytest.fixture
def nl_lemmatizer(scope="session"):
return get_lang_class("nl").Defaults.create_lemmatizer()
@pytest.fixture(scope="session") @pytest.fixture(scope="session")
def pl_tokenizer(): def pl_tokenizer():
return get_lang_class("pl").Defaults.create_tokenizer() return get_lang_class("pl").Defaults.create_tokenizer()

View File

@ -0,0 +1,143 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
# Calling the Lemmatizer directly
# Imitates behavior of:
# Tagger.set_annotations()
# -> vocab.morphology.assign_tag_id()
# -> vocab.morphology.assign_tag_id()
# -> Token.tag.__set__
# -> vocab.morphology.assign_tag(...)
# -> ... -> Morphology.assign_tag(...)
# -> self.lemmatize(analysis.tag.pos, token.lex.orth,
noun_irreg_lemmatization_cases = [
("volkeren", "volk"),
("vaatje", "vat"),
("verboden", "verbod"),
("ijsje", "ijsje"),
("slagen", "slag"),
("verdragen", "verdrag"),
("verloven", "verlof"),
("gebeden", "gebed"),
("gaten", "gat"),
("staven", "staf"),
("aquariums", "aquarium"),
("podia", "podium"),
("holen", "hol"),
("lammeren", "lam"),
("bevelen", "bevel"),
("wegen", "weg"),
("moeilijkheden", "moeilijkheid"),
("aanwezigheden", "aanwezigheid"),
("goden", "god"),
("loten", "lot"),
("kaarsen", "kaars"),
("leden", "lid"),
("glaasje", "glas"),
("eieren", "ei"),
("vatten", "vat"),
("kalveren", "kalf"),
("padden", "pad"),
("smeden", "smid"),
("genen", "gen"),
("beenderen", "been"),
]
verb_irreg_lemmatization_cases = [
("liep", "lopen"),
("hief", "heffen"),
("begon", "beginnen"),
("sla", "slaan"),
("aangekomen", "aankomen"),
("sproot", "spruiten"),
("waart", "zijn"),
("snoof", "snuiven"),
("spoot", "spuiten"),
("ontbeet", "ontbijten"),
("gehouwen", "houwen"),
("afgewassen", "afwassen"),
("deed", "doen"),
("schoven", "schuiven"),
("gelogen", "liegen"),
("woog", "wegen"),
("gebraden", "braden"),
("smolten", "smelten"),
("riep", "roepen"),
("aangedaan", "aandoen"),
("vermeden", "vermijden"),
("stootten", "stoten"),
("ging", "gaan"),
("geschoren", "scheren"),
("gesponnen", "spinnen"),
("reden", "rijden"),
("zochten", "zoeken"),
("leed", "lijden"),
("verzonnen", "verzinnen"),
]
@pytest.mark.parametrize("text,lemma", noun_irreg_lemmatization_cases)
def test_nl_lemmatizer_noun_lemmas_irreg(nl_lemmatizer, text, lemma):
pos = "noun"
lemmas_pred = nl_lemmatizer(text, pos)
assert lemma == sorted(lemmas_pred)[0]
@pytest.mark.parametrize("text,lemma", verb_irreg_lemmatization_cases)
def test_nl_lemmatizer_verb_lemmas_irreg(nl_lemmatizer, text, lemma):
pos = "verb"
lemmas_pred = nl_lemmatizer(text, pos)
assert lemma == sorted(lemmas_pred)[0]
@pytest.mark.skip
@pytest.mark.parametrize("text,lemma", [])
def test_nl_lemmatizer_verb_lemmas_reg(nl_lemmatizer, text, lemma):
# TODO: add test
pass
@pytest.mark.skip
@pytest.mark.parametrize("text,lemma", [])
def test_nl_lemmatizer_adjective_lemmas(nl_lemmatizer, text, lemma):
# TODO: add test
pass
@pytest.mark.skip
@pytest.mark.parametrize("text,lemma", [])
def test_nl_lemmatizer_determiner_lemmas(nl_lemmatizer, text, lemma):
# TODO: add test
pass
@pytest.mark.skip
@pytest.mark.parametrize("text,lemma", [])
def test_nl_lemmatizer_adverb_lemmas(nl_lemmatizer, text, lemma):
# TODO: add test
pass
@pytest.mark.parametrize("text,lemma", [])
def test_nl_lemmatizer_pronoun_lemmas(nl_lemmatizer, text, lemma):
# TODO: add test
pass
# Using the lemma lookup table only
@pytest.mark.parametrize("text,lemma", noun_irreg_lemmatization_cases)
def test_nl_lemmatizer_lookup_noun(nl_lemmatizer, text, lemma):
lemma_pred = nl_lemmatizer.lookup(text)
assert lemma_pred in (lemma, text)
@pytest.mark.parametrize("text,lemma", verb_irreg_lemmatization_cases)
def test_nl_lemmatizer_lookup_verb(nl_lemmatizer, text, lemma):
lemma_pred = nl_lemmatizer.lookup(text)
assert lemma_pred in (lemma, text)

View File

@ -9,3 +9,19 @@ from spacy.lang.nl.lex_attrs import like_num
def test_nl_lex_attrs_capitals(word): def test_nl_lex_attrs_capitals(word):
assert like_num(word) assert like_num(word)
assert like_num(word.upper()) assert like_num(word.upper())
@pytest.mark.parametrize(
"text,num_tokens",
[
(
"De aftredende minister-president benadrukte al dat zijn partij inhoudelijk weinig gemeen heeft met de groenen.",
16,
),
("Hij is sociaal-cultureel werker.", 5),
("Er staan een aantal dure auto's in de garage.", 10),
],
)
def test_tokenizer_doesnt_split_hyphens(nl_tokenizer, text, num_tokens):
tokens = nl_tokenizer(text)
assert len(tokens) == num_tokens

View File

@ -1,6 +1,8 @@
import pytest # coding: utf8
from __future__ import unicode_literals
import re import re
from ... import compat from spacy import compat
prefix_search = ( prefix_search = (
b"^\xc2\xa7|^%|^=|^\xe2\x80\x94|^\xe2\x80\x93|^\\+(?![0-9])" b"^\xc2\xa7|^%|^=|^\xe2\x80\x94|^\xe2\x80\x93|^\\+(?![0-9])"
@ -67,4 +69,4 @@ if compat.is_python2:
# string above in the xpass message. # string above in the xpass message.
def test_issue3356(): def test_issue3356():
pattern = re.compile(compat.unescape_unicode(prefix_search.decode("utf8"))) pattern = re.compile(compat.unescape_unicode(prefix_search.decode("utf8")))
assert not pattern.search(u"hello") assert not pattern.search("hello")

View File

@ -1,10 +1,14 @@
# coding: utf8
from __future__ import unicode_literals
from spacy.util import decaying from spacy.util import decaying
def test_decaying():
sizes = decaying(10., 1., .5) def test_issue3447():
sizes = decaying(10.0, 1.0, 0.5)
size = next(sizes) size = next(sizes)
assert size == 10. assert size == 10.0
size = next(sizes) size = next(sizes)
assert size == 10. - 0.5 assert size == 10.0 - 0.5
size = next(sizes) size = next(sizes)
assert size == 10. - 0.5 - 0.5 assert size == 10.0 - 0.5 - 0.5

View File

@ -0,0 +1,25 @@
# coding: utf8
from __future__ import unicode_literals
import pytest
from spacy.lang.en import English
@pytest.mark.xfail(reason="Current default suffix rules avoid one upper-case letter before a dot.")
def test_issue3449():
nlp = English()
nlp.add_pipe(nlp.create_pipe('sentencizer'))
text1 = "He gave the ball to I. Do you want to go to the movies with I?"
text2 = "He gave the ball to I. Do you want to go to the movies with I?"
text3 = "He gave the ball to I.\nDo you want to go to the movies with I?"
t1 = nlp(text1)
t2 = nlp(text2)
t3 = nlp(text3)
assert t1[5].text == 'I'
assert t2[5].text == 'I'
assert t3[5].text == 'I'

View File

@ -1,7 +1,6 @@
# coding: utf8 # coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
import pytest
from spacy.lang.en import English from spacy.lang.en import English
from spacy.tokens import Doc from spacy.tokens import Doc

View File

@ -0,0 +1,19 @@
# coding: utf8
from __future__ import unicode_literals
import pytest
@pytest.mark.parametrize(
"word",
[
"don't",
"dont",
"I'd",
"Id",
],
)
def test_issue3521(en_tokenizer, word):
tok = en_tokenizer(word)[1]
# 'not' and 'would' should be stopwords, also in their abbreviated forms
assert tok.is_stop

View File

@ -0,0 +1,33 @@
# coding: utf8
from __future__ import unicode_literals
from spacy import displacy
def test_issue3531():
"""Test that displaCy renderer doesn't require "settings" key."""
example_dep = {
"words": [
{"text": "But", "tag": "CCONJ"},
{"text": "Google", "tag": "PROPN"},
{"text": "is", "tag": "VERB"},
{"text": "starting", "tag": "VERB"},
{"text": "from", "tag": "ADP"},
{"text": "behind.", "tag": "ADV"},
],
"arcs": [
{"start": 0, "end": 3, "label": "cc", "dir": "left"},
{"start": 1, "end": 3, "label": "nsubj", "dir": "left"},
{"start": 2, "end": 3, "label": "aux", "dir": "left"},
{"start": 3, "end": 4, "label": "prep", "dir": "right"},
{"start": 4, "end": 5, "label": "pcomp", "dir": "right"},
],
}
example_ent = {
"text": "But Google is starting from behind.",
"ents": [{"start": 4, "end": 10, "label": "ORG"}],
}
dep_html = displacy.render(example_dep, style="dep", manual=True)
assert dep_html
ent_html = displacy.render(example_ent, style="ent", manual=True)
assert ent_html

View File

@ -26,6 +26,7 @@ def symlink_setup_target(request, symlink_target, symlink):
os.mkdir(path2str(symlink_target)) os.mkdir(path2str(symlink_target))
# yield -- need to cleanup even if assertion fails # yield -- need to cleanup even if assertion fails
# https://github.com/pytest-dev/pytest/issues/2508#issuecomment-309934240 # https://github.com/pytest-dev/pytest/issues/2508#issuecomment-309934240
def cleanup(): def cleanup():
symlink_remove(symlink) symlink_remove(symlink)
os.rmdir(path2str(symlink_target)) os.rmdir(path2str(symlink_target))

View File

@ -160,20 +160,14 @@ https://github.com/explosion/spaCy/tree/master/examples/training/train_textcat.p
### Visualizing spaCy vectors in TensorBoard {#tensorboard} ### Visualizing spaCy vectors in TensorBoard {#tensorboard}
These two scripts let you load any spaCy model containing word vectors into This script lets you load any spaCy model containing word vectors into
[TensorBoard](https://projector.tensorflow.org/) to create an [TensorBoard](https://projector.tensorflow.org/) to create an
[embedding visualization](https://www.tensorflow.org/versions/r1.1/get_started/embedding_viz). [embedding visualization](https://www.tensorflow.org/versions/r1.1/get_started/embedding_viz).
The first example uses TensorBoard, the second example TensorBoard's standalone
embedding projector.
```python ```python
https://github.com/explosion/spaCy/tree/master/examples/vectors_tensorboard.py https://github.com/explosion/spaCy/tree/master/examples/vectors_tensorboard.py
``` ```
```python
https://github.com/explosion/spaCy/tree/master/examples/vectors_tensorboard_standalone.py
```
## Deep Learning {#deep-learning hidden="true"} ## Deep Learning {#deep-learning hidden="true"}
### Text classification with Keras {#keras} ### Text classification with Keras {#keras}

View File

@ -35,7 +35,7 @@ const SEO = ({ description, lang, title, section, sectionTitle, bodyClass }) =>
siteMetadata.slogan, siteMetadata.slogan,
sectionTitle sectionTitle
) )
const socialImage = getImage(section) const socialImage = siteMetadata.siteUrl + getImage(section)
const meta = [ const meta = [
{ {
name: 'description', name: 'description',
@ -126,6 +126,7 @@ const query = graphql`
title title
description description
slogan slogan
siteUrl
social { social {
twitter twitter
} }

View File

@ -164,9 +164,9 @@ const Landing = ({ data }) => {
We're pleased to invite the spaCy community and other folks working on Natural We're pleased to invite the spaCy community and other folks working on Natural
Language Processing to Berlin this summer for a small and intimate event{' '} Language Processing to Berlin this summer for a small and intimate event{' '}
<strong>July 5-6, 2019</strong>. The event includes a hands-on training day for <strong>July 5-6, 2019</strong>. The event includes a hands-on training day for
teams using spaCy in production, followed by a one-track conference. We booked a teams using spaCy in production, followed by a one-track conference. We've
beautiful venue, hand-picked an awesome lineup of speakers and scheduled plenty booked a beautiful venue, hand-picked an awesome lineup of speakers and
of social time to get to know each other and exchange ideas. scheduled plenty of social time to get to know each other and exchange ideas.
</LandingBanner> </LandingBanner>
<LandingBanner <LandingBanner