mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-25 17:36:30 +03:00
fixed tag_map.py merge conflict
This commit is contained in:
parent
eba4f77526
commit
80e15af76c
106
.github/contributors/ivigamberdiev.md
vendored
Normal file
106
.github/contributors/ivigamberdiev.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Igor Igamberdiev |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | April 2, 2019 |
|
||||
| GitHub username | ivigamberdiev |
|
||||
| Website (optional) | |
|
106
.github/contributors/nlptown.md
vendored
Normal file
106
.github/contributors/nlptown.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [ ] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [x] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Yves Peirsman |
|
||||
| Company name (if applicable) | NLP Town (Island Constraints BVBA) |
|
||||
| Title or role (if applicable) | Co-founder |
|
||||
| Date | 14.03.2019 |
|
||||
| GitHub username | nlptown |
|
||||
| Website (optional) | http://www.nlp.town |
|
106
.github/contributors/socool.md
vendored
Normal file
106
.github/contributors/socool.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Kamolsit Mongkolsrisawat |
|
||||
| Company name (if applicable) | Mojito |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 02-4-2019 |
|
||||
| GitHub username | socool |
|
||||
| Website (optional) | |
|
18
README.md
18
README.md
|
@ -17,7 +17,7 @@ released under the MIT license.
|
|||
[![Azure Pipelines](https://img.shields.io/azure-devops/build/explosion-ai/public/8/master.svg?logo=azure-devops&style=flat-square)](https://dev.azure.com/explosion-ai/public/_build?definitionId=8)
|
||||
[![Travis Build Status](https://img.shields.io/travis/explosion/spaCy/master.svg?style=flat-square&logo=travis)](https://travis-ci.org/explosion/spaCy)
|
||||
[![Current Release Version](https://img.shields.io/github/release/explosion/spacy.svg?style=flat-square)](https://github.com/explosion/spaCy/releases)
|
||||
[![pypi Version](https://img.shields.io/pypi/v/spacy.svg?style=flat-square)](https://pypi.python.org/pypi/spacy)
|
||||
[![pypi Version](https://img.shields.io/pypi/v/spacy.svg?style=flat-square)](https://pypi.org/project/spacy/)
|
||||
[![conda Version](https://img.shields.io/conda/vn/conda-forge/spacy.svg?style=flat-square)](https://anaconda.org/conda-forge/spacy)
|
||||
[![Python wheels](https://img.shields.io/badge/wheels-%E2%9C%93-4c1.svg?longCache=true&style=flat-square&logo=python&logoColor=white)](https://github.com/explosion/wheelwright/releases)
|
||||
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg?style=flat-square)](https://github.com/ambv/black)
|
||||
|
@ -42,7 +42,7 @@ released under the MIT license.
|
|||
[api reference]: https://spacy.io/api/
|
||||
[models]: https://spacy.io/models
|
||||
[universe]: https://spacy.io/universe
|
||||
[changelog]: https://spacy.io/usage/#changelog
|
||||
[changelog]: https://spacy.io/usage#changelog
|
||||
[contribute]: https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md
|
||||
|
||||
## 💬 Where to ask questions
|
||||
|
@ -60,7 +60,7 @@ valuable if it's shared publicly, so that more people can benefit from it.
|
|||
| 🗯 **General Discussion** | [Gitter Chat] · [Reddit User Group] |
|
||||
|
||||
[github issue tracker]: https://github.com/explosion/spaCy/issues
|
||||
[stack overflow]: http://stackoverflow.com/questions/tagged/spacy
|
||||
[stack overflow]: https://stackoverflow.com/questions/tagged/spacy
|
||||
[gitter chat]: https://gitter.im/explosion/spaCy
|
||||
[reddit user group]: https://www.reddit.com/r/spacynlp
|
||||
|
||||
|
@ -95,7 +95,7 @@ For detailed installation instructions, see the
|
|||
- **Python version**: Python 2.7, 3.5+ (only 64 bit)
|
||||
- **Package managers**: [pip] · [conda] (via `conda-forge`)
|
||||
|
||||
[pip]: https://pypi.python.org/pypi/spacy
|
||||
[pip]: https://pypi.org/project/spacy/
|
||||
[conda]: https://anaconda.org/conda-forge/spacy
|
||||
|
||||
### pip
|
||||
|
@ -219,7 +219,7 @@ source. That is the common way if you want to make changes to the code base.
|
|||
You'll need to make sure that you have a development environment consisting of a
|
||||
Python distribution including header files, a compiler,
|
||||
[pip](https://pip.pypa.io/en/latest/installing/),
|
||||
[virtualenv](https://virtualenv.pypa.io/) and [git](https://git-scm.com)
|
||||
[virtualenv](https://virtualenv.pypa.io/en/latest/) and [git](https://git-scm.com)
|
||||
installed. The compiler part is the trickiest. How to do that depends on your
|
||||
system. See notes on Ubuntu, OS X and Windows for details.
|
||||
|
||||
|
@ -239,8 +239,8 @@ python setup.py build_ext --inplace
|
|||
Compared to regular install via pip, [requirements.txt](requirements.txt)
|
||||
additionally installs developer dependencies such as Cython. For more details
|
||||
and instructions, see the documentation on
|
||||
[compiling spaCy from source](https://spacy.io/usage/#source) and the
|
||||
[quickstart widget](https://spacy.io/usage/#section-quickstart) to get
|
||||
[compiling spaCy from source](https://spacy.io/usage#source) and the
|
||||
[quickstart widget](https://spacy.io/usage#section-quickstart) to get
|
||||
the right commands for your platform and Python version.
|
||||
|
||||
### Ubuntu
|
||||
|
@ -260,7 +260,7 @@ and git preinstalled.
|
|||
### Windows
|
||||
|
||||
Install a version of the [Visual C++ Build Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/) or
|
||||
[Visual Studio Express](https://www.visualstudio.com/vs/visual-studio-express/)
|
||||
[Visual Studio Express](https://visualstudio.microsoft.com/vs/express/)
|
||||
that matches the version that was used to compile your Python
|
||||
interpreter. For official distributions these are VS 2008 (Python 2.7),
|
||||
VS 2010 (Python 3.4) and VS 2015 (Python 3.5).
|
||||
|
@ -282,5 +282,5 @@ pip install -r path/to/requirements.txt
|
|||
python -m pytest <spacy-directory>
|
||||
```
|
||||
|
||||
See [the documentation](https://spacy.io/usage/#tests) for more details and
|
||||
See [the documentation](https://spacy.io/usage#tests) for more details and
|
||||
examples.
|
||||
|
|
|
@ -23,7 +23,7 @@ For more details, see the documentation:
|
|||
* Training: https://spacy.io/usage/training
|
||||
* NER: https://spacy.io/usage/linguistic-features#named-entities
|
||||
|
||||
Compatible with: spaCy v2.0.0+
|
||||
Compatible with: spaCy v2.1.0+
|
||||
Last tested with: v2.1.0
|
||||
"""
|
||||
from __future__ import unicode_literals, print_function
|
||||
|
|
41
spacy/_ml.py
41
spacy/_ml.py
|
@ -86,7 +86,7 @@ def with_cpu(ops, model):
|
|||
as necessary."""
|
||||
model.to_cpu()
|
||||
|
||||
def with_cpu_forward(inputs, drop=0.):
|
||||
def with_cpu_forward(inputs, drop=0.0):
|
||||
cpu_outputs, backprop = model.begin_update(_to_cpu(inputs), drop=drop)
|
||||
gpu_outputs = _to_device(ops, cpu_outputs)
|
||||
|
||||
|
@ -106,7 +106,7 @@ def _to_cpu(X):
|
|||
return tuple([_to_cpu(x) for x in X])
|
||||
elif isinstance(X, list):
|
||||
return [_to_cpu(x) for x in X]
|
||||
elif hasattr(X, 'get'):
|
||||
elif hasattr(X, "get"):
|
||||
return X.get()
|
||||
else:
|
||||
return X
|
||||
|
@ -142,7 +142,9 @@ class extract_ngrams(Model):
|
|||
# The dtype here matches what thinc is expecting -- which differs per
|
||||
# platform (by int definition). This should be fixed once the problem
|
||||
# is fixed on Thinc's side.
|
||||
lengths = self.ops.asarray([arr.shape[0] for arr in batch_keys], dtype=numpy.int_)
|
||||
lengths = self.ops.asarray(
|
||||
[arr.shape[0] for arr in batch_keys], dtype=numpy.int_
|
||||
)
|
||||
batch_keys = self.ops.xp.concatenate(batch_keys)
|
||||
batch_vals = self.ops.asarray(self.ops.xp.concatenate(batch_vals), dtype="f")
|
||||
return (batch_keys, batch_vals, lengths), None
|
||||
|
@ -592,32 +594,27 @@ def build_text_classifier(nr_class, width=64, **cfg):
|
|||
)
|
||||
|
||||
linear_model = build_bow_text_classifier(
|
||||
nr_class, ngram_size=cfg.get("ngram_size", 1), exclusive_classes=False)
|
||||
if cfg.get('exclusive_classes'):
|
||||
nr_class, ngram_size=cfg.get("ngram_size", 1), exclusive_classes=False
|
||||
)
|
||||
if cfg.get("exclusive_classes"):
|
||||
output_layer = Softmax(nr_class, nr_class * 2)
|
||||
else:
|
||||
output_layer = (
|
||||
zero_init(Affine(nr_class, nr_class * 2, drop_factor=0.0))
|
||||
>> logistic
|
||||
zero_init(Affine(nr_class, nr_class * 2, drop_factor=0.0)) >> logistic
|
||||
)
|
||||
model = (
|
||||
(linear_model | cnn_model)
|
||||
>> output_layer
|
||||
)
|
||||
model = (linear_model | cnn_model) >> output_layer
|
||||
model.tok2vec = chain(tok2vec, flatten)
|
||||
model.nO = nr_class
|
||||
model.lsuv = False
|
||||
return model
|
||||
|
||||
|
||||
def build_bow_text_classifier(nr_class, ngram_size=1, exclusive_classes=False,
|
||||
no_output_layer=False, **cfg):
|
||||
def build_bow_text_classifier(
|
||||
nr_class, ngram_size=1, exclusive_classes=False, no_output_layer=False, **cfg
|
||||
):
|
||||
with Model.define_operators({">>": chain}):
|
||||
model = (
|
||||
with_cpu(Model.ops,
|
||||
extract_ngrams(ngram_size, attr=ORTH)
|
||||
>> LinearModel(nr_class)
|
||||
)
|
||||
model = with_cpu(
|
||||
Model.ops, extract_ngrams(ngram_size, attr=ORTH) >> LinearModel(nr_class)
|
||||
)
|
||||
if not no_output_layer:
|
||||
model = model >> (cpu_softmax if exclusive_classes else logistic)
|
||||
|
@ -626,11 +623,9 @@ def build_bow_text_classifier(nr_class, ngram_size=1, exclusive_classes=False,
|
|||
|
||||
|
||||
@layerize
|
||||
def cpu_softmax(X, drop=0.):
|
||||
def cpu_softmax(X, drop=0.0):
|
||||
ops = NumpyOps()
|
||||
|
||||
Y = ops.softmax(X)
|
||||
|
||||
def cpu_softmax_backward(dY, sgd=None):
|
||||
return dY
|
||||
|
||||
|
@ -648,7 +643,9 @@ def build_simple_cnn_text_classifier(tok2vec, nr_class, exclusive_classes=False,
|
|||
if exclusive_classes:
|
||||
output_layer = Softmax(nr_class, tok2vec.nO)
|
||||
else:
|
||||
output_layer = zero_init(Affine(nr_class, tok2vec.nO, drop_factor=0.0)) >> logistic
|
||||
output_layer = (
|
||||
zero_init(Affine(nr_class, tok2vec.nO, drop_factor=0.0)) >> logistic
|
||||
)
|
||||
model = tok2vec >> flatten_add_lengths >> Pooling(mean_pool) >> output_layer
|
||||
model.tok2vec = chain(tok2vec, flatten)
|
||||
model.nO = nr_class
|
||||
|
|
|
@ -125,7 +125,9 @@ def pretrain(
|
|||
max_length=max_length,
|
||||
min_length=min_length,
|
||||
)
|
||||
loss = make_update(model, docs, optimizer, objective=loss_func, drop=dropout)
|
||||
loss = make_update(
|
||||
model, docs, optimizer, objective=loss_func, drop=dropout
|
||||
)
|
||||
progress = tracker.update(epoch, loss, docs)
|
||||
if progress:
|
||||
msg.row(progress, **row_settings)
|
||||
|
@ -215,8 +217,8 @@ def get_cossim_loss(yh, y):
|
|||
norm_y = xp.linalg.norm(y, axis=1, keepdims=True)
|
||||
mul_norms = norm_yh * norm_y
|
||||
cosine = (yh * y).sum(axis=1, keepdims=True) / mul_norms
|
||||
d_yh = (y / mul_norms) - (cosine * (yh / norm_yh**2))
|
||||
loss = xp.abs(cosine-1).sum()
|
||||
d_yh = (y / mul_norms) - (cosine * (yh / norm_yh ** 2))
|
||||
loss = xp.abs(cosine - 1).sum()
|
||||
return loss, -d_yh
|
||||
|
||||
|
||||
|
|
|
@ -50,8 +50,9 @@ class DependencyRenderer(object):
|
|||
rendered = []
|
||||
for i, p in enumerate(parsed):
|
||||
if i == 0:
|
||||
self.direction = p["settings"].get("direction", DEFAULT_DIR)
|
||||
self.lang = p["settings"].get("lang", DEFAULT_LANG)
|
||||
settings = p.get("settings", {})
|
||||
self.direction = settings.get("direction", DEFAULT_DIR)
|
||||
self.lang = settings.get("lang", DEFAULT_LANG)
|
||||
render_id = "{}-{}".format(id_prefix, i)
|
||||
svg = self.render_svg(render_id, p["words"], p["arcs"])
|
||||
rendered.append(svg)
|
||||
|
@ -254,9 +255,10 @@ class EntityRenderer(object):
|
|||
rendered = []
|
||||
for i, p in enumerate(parsed):
|
||||
if i == 0:
|
||||
self.direction = p["settings"].get("direction", DEFAULT_DIR)
|
||||
self.lang = p["settings"].get("lang", DEFAULT_LANG)
|
||||
rendered.append(self.render_ents(p["text"], p["ents"], p["title"]))
|
||||
settings = p.get("settings", {})
|
||||
self.direction = settings.get("direction", DEFAULT_DIR)
|
||||
self.lang = settings.get("lang", DEFAULT_LANG)
|
||||
rendered.append(self.render_ents(p["text"], p["ents"], p.get("title")))
|
||||
if page:
|
||||
docs = "".join([TPL_FIGURE.format(content=doc) for doc in rendered])
|
||||
markup = TPL_PAGE.format(content=docs, lang=self.lang, dir=self.direction)
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...symbols import LEMMA, PRON_LEMMA, AUX
|
||||
from ...symbols import LEMMA, PRON_LEMMA
|
||||
|
||||
_subordinating_conjunctions = [
|
||||
"that",
|
||||
|
@ -457,7 +457,6 @@ MORPH_RULES = {
|
|||
"have": {"POS": "AUX"},
|
||||
"'m": {"POS": "AUX", LEMMA: "be"},
|
||||
"'ve": {"POS": "AUX"},
|
||||
"'re": {"POS": "AUX", LEMMA: "be"},
|
||||
"'s": {"POS": "AUX"},
|
||||
"is": {"POS": "AUX"},
|
||||
"'d": {"POS": "AUX"},
|
||||
|
|
|
@ -39,7 +39,7 @@ made make many may me meanwhile might mine more moreover most mostly move much
|
|||
must my myself
|
||||
|
||||
name namely neither never nevertheless next nine no nobody none noone nor not
|
||||
nothing now nowhere n't
|
||||
nothing now nowhere
|
||||
|
||||
of off often on once one only onto or other others otherwise our ours ourselves
|
||||
out over own
|
||||
|
@ -66,7 +66,13 @@ whereafter whereas whereby wherein whereupon wherever whether which while
|
|||
whither who whoever whole whom whose why will with within without would
|
||||
|
||||
yet you your yours yourself yourselves
|
||||
|
||||
'd 'll 'm 're 's 've
|
||||
""".split()
|
||||
)
|
||||
|
||||
contractions = ["n't", "'d", "'ll", "'m", "'re", "'s", "'ve"]
|
||||
STOP_WORDS.update(contractions)
|
||||
|
||||
for apostrophe in ["‘", "’"]:
|
||||
for stopword in contractions:
|
||||
STOP_WORDS.add(stopword.replace("'", apostrophe))
|
||||
|
||||
|
|
|
@ -2,7 +2,11 @@
|
|||
from __future__ import unicode_literals
|
||||
|
||||
from ...symbols import POS, PUNCT, ADJ, CCONJ, NUM, DET, ADV, ADP, X, VERB
|
||||
<<<<<<< HEAD
|
||||
from ...symbols import NOUN, PRON, AUX, SCONJ, INTJ, PART, PROPN
|
||||
=======
|
||||
from ...symbols import NOUN, PRON, AUX, SCONJ
|
||||
>>>>>>> 4faf62d5154c2d2adb6def32da914d18d5e9c8fe
|
||||
|
||||
|
||||
# POS explanations for indonesian available from https://www.aclweb.org/anthology/Y12-1014
|
||||
|
@ -92,4 +96,3 @@ TAG_MAP = {
|
|||
"D--+PS2":{POS: ADV},
|
||||
"PP3+T—": {POS: PRON}
|
||||
}
|
||||
|
||||
|
|
|
@ -4,6 +4,11 @@ from __future__ import unicode_literals
|
|||
from .stop_words import STOP_WORDS
|
||||
from .lex_attrs import LEX_ATTRS
|
||||
from .tag_map import TAG_MAP
|
||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||
from .punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES
|
||||
|
||||
from .lemmatizer import LOOKUP, LEMMA_EXC, LEMMA_INDEX, RULES
|
||||
from .lemmatizer.lemmatizer import DutchLemmatizer
|
||||
|
||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||
from ..norm_exceptions import BASE_NORMS
|
||||
|
@ -13,20 +18,33 @@ from ...util import update_exc, add_lookups
|
|||
|
||||
|
||||
class DutchDefaults(Language.Defaults):
|
||||
|
||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||
lex_attr_getters.update(LEX_ATTRS)
|
||||
lex_attr_getters[LANG] = lambda text: "nl"
|
||||
lex_attr_getters[NORM] = add_lookups(
|
||||
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS
|
||||
)
|
||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS)
|
||||
lex_attr_getters[LANG] = lambda text: 'nl'
|
||||
lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM],
|
||||
BASE_NORMS)
|
||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||
stop_words = STOP_WORDS
|
||||
tag_map = TAG_MAP
|
||||
infixes = TOKENIZER_INFIXES
|
||||
suffixes = TOKENIZER_SUFFIXES
|
||||
|
||||
@classmethod
|
||||
def create_lemmatizer(cls, nlp=None):
|
||||
rules = RULES
|
||||
lemma_index = LEMMA_INDEX
|
||||
lemma_exc = LEMMA_EXC
|
||||
lemma_lookup = LOOKUP
|
||||
return DutchLemmatizer(index=lemma_index,
|
||||
exceptions=lemma_exc,
|
||||
lookup=lemma_lookup,
|
||||
rules=rules)
|
||||
|
||||
|
||||
class Dutch(Language):
|
||||
lang = "nl"
|
||||
lang = 'nl'
|
||||
Defaults = DutchDefaults
|
||||
|
||||
|
||||
__all__ = ["Dutch"]
|
||||
__all__ = ['Dutch']
|
||||
|
|
|
@ -14,5 +14,5 @@ sentences = [
|
|||
"Apple overweegt om voor 1 miljard een U.K. startup te kopen",
|
||||
"Autonome auto's verschuiven de verzekeringverantwoordelijkheid naar producenten",
|
||||
"San Francisco overweegt robots op voetpaden te verbieden",
|
||||
"Londen is een grote stad in het Verenigd Koninkrijk",
|
||||
"Londen is een grote stad in het Verenigd Koninkrijk"
|
||||
]
|
||||
|
|
40
spacy/lang/nl/lemmatizer/__init__.py
Normal file
40
spacy/lang/nl/lemmatizer/__init__.py
Normal file
|
@ -0,0 +1,40 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ._verbs_irreg import VERBS_IRREG
|
||||
from ._nouns_irreg import NOUNS_IRREG
|
||||
from ._adjectives_irreg import ADJECTIVES_IRREG
|
||||
from ._adverbs_irreg import ADVERBS_IRREG
|
||||
|
||||
from ._adpositions_irreg import ADPOSITIONS_IRREG
|
||||
from ._determiners_irreg import DETERMINERS_IRREG
|
||||
from ._pronouns_irreg import PRONOUNS_IRREG
|
||||
|
||||
from ._verbs import VERBS
|
||||
from ._nouns import NOUNS
|
||||
from ._adjectives import ADJECTIVES
|
||||
|
||||
from ._adpositions import ADPOSITIONS
|
||||
from ._determiners import DETERMINERS
|
||||
|
||||
from .lookup import LOOKUP
|
||||
|
||||
from ._lemma_rules import RULES
|
||||
|
||||
from .lemmatizer import DutchLemmatizer
|
||||
|
||||
|
||||
LEMMA_INDEX = {"adj": ADJECTIVES,
|
||||
"noun": NOUNS,
|
||||
"verb": VERBS,
|
||||
"adp": ADPOSITIONS,
|
||||
"det": DETERMINERS}
|
||||
|
||||
LEMMA_EXC = {"adj": ADJECTIVES_IRREG,
|
||||
"adv": ADVERBS_IRREG,
|
||||
"adp": ADPOSITIONS_IRREG,
|
||||
"noun": NOUNS_IRREG,
|
||||
"verb": VERBS_IRREG,
|
||||
"det": DETERMINERS_IRREG,
|
||||
"pron": PRONOUNS_IRREG}
|
||||
|
3461
spacy/lang/nl/lemmatizer/_adjectives.py
Normal file
3461
spacy/lang/nl/lemmatizer/_adjectives.py
Normal file
File diff suppressed because it is too large
Load Diff
3033
spacy/lang/nl/lemmatizer/_adjectives_irreg.py
Normal file
3033
spacy/lang/nl/lemmatizer/_adjectives_irreg.py
Normal file
File diff suppressed because it is too large
Load Diff
24
spacy/lang/nl/lemmatizer/_adpositions.py
Normal file
24
spacy/lang/nl/lemmatizer/_adpositions.py
Normal file
|
@ -0,0 +1,24 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
ADPOSITIONS = set(
|
||||
('aan aangaande aanwezig achter af afgezien al als an annex anno anti '
|
||||
'behalve behoudens beneden benevens benoorden beoosten betreffende bewesten '
|
||||
'bezijden bezuiden bij binnen binnenuit binst bladzij blijkens boven bovenop '
|
||||
'buiten conform contra cq daaraan daarbij daarbuiten daarin daarnaar '
|
||||
'daaronder daartegenover daarvan dankzij deure dichtbij door doordat doorheen '
|
||||
'echter eraf erop erover errond eruit ervoor evenals exclusief gedaan '
|
||||
'gedurende gegeven getuige gezien halfweg halverwege heen hierdoorheen hierop '
|
||||
'houdende in inclusief indien ingaande ingevolge inzake jegens kortweg '
|
||||
'krachtens kralj langs langsheen langst lastens linksom lopende luidens mede '
|
||||
'mee met middels midden middenop mits na naan naar naartoe naast naat nabij '
|
||||
'nadat namens neer neffe neffen neven nevenst niettegenstaande nopens '
|
||||
'officieel om omheen omstreeks omtrent onafgezien ondanks onder onderaan '
|
||||
'ondere ongeacht ooit op open over per plus pro qua rechtover rond rondom '
|
||||
"sedert sinds spijts strekkende te tegen tegenaan tegenop tegenover telde "
|
||||
'teneinde terug tijdens toe tot totdat trots tussen tégen uit uitgenomen '
|
||||
'ultimo van vanaf vandaan vandoor vanop vanuit vanwege versus via vinnen '
|
||||
'vlakbij volgens voor voor- voorbij voordat voort voren vòòr vóór waaraan '
|
||||
'waarbij waardoor waaronder weg wegens weleens zijdens zoals zodat zonder '
|
||||
'zónder à').split())
|
12
spacy/lang/nl/lemmatizer/_adpositions_irreg.py
Normal file
12
spacy/lang/nl/lemmatizer/_adpositions_irreg.py
Normal file
|
@ -0,0 +1,12 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
ADPOSITIONS_IRREG = {
|
||||
"'t": ('te',),
|
||||
'me': ('mee',),
|
||||
'meer': ('mee',),
|
||||
'on': ('om',),
|
||||
'ten': ('te',),
|
||||
'ter': ('te',)
|
||||
}
|
19
spacy/lang/nl/lemmatizer/_adverbs_irreg.py
Normal file
19
spacy/lang/nl/lemmatizer/_adverbs_irreg.py
Normal file
|
@ -0,0 +1,19 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
ADVERBS_IRREG = {
|
||||
"'ns": ('eens',),
|
||||
"'s": ('eens',),
|
||||
"'t": ('het',),
|
||||
"d'r": ('er',),
|
||||
"d'raf": ('eraf',),
|
||||
"d'rbij": ('erbij',),
|
||||
"d'rheen": ('erheen',),
|
||||
"d'rin": ('erin',),
|
||||
"d'rna": ('erna',),
|
||||
"d'rnaar": ('ernaar',),
|
||||
'hele': ('heel',),
|
||||
'nevenst': ('nevens',),
|
||||
'overend': ('overeind',)
|
||||
}
|
17
spacy/lang/nl/lemmatizer/_determiners.py
Normal file
17
spacy/lang/nl/lemmatizer/_determiners.py
Normal file
|
@ -0,0 +1,17 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
DETERMINERS = set(
|
||||
("al allebei allerhande allerminst alletwee"
|
||||
"beide clip-on d'n d'r dat datgeen datgene de dees degeen degene den dewelke "
|
||||
'deze dezelfde die diegeen diegene diehien dien diene diens diezelfde dit '
|
||||
'ditgene e een eene eigen elk elkens elkes enig enkel enne ettelijke eure '
|
||||
'euren evenveel ewe ge geen ginds géén haar haaren halfelf het hetgeen '
|
||||
'hetwelk hetzelfde heur heure hulder hulle hullen hullie hun hunder hunderen '
|
||||
'ieder iederes ja je jen jouw jouwen jouwes jullie junder keiveel keiweinig '
|
||||
"m'ne me meer meerder meerdere menen menig mijn mijnes minst méér niemendal "
|
||||
'oe ons onse se sommig sommigeder superveel telken teveel titulair ulder '
|
||||
'uldere ulderen ulle under une uw vaak veel veels véél wat weinig welk welken '
|
||||
"welkene welksten z'nen ze zenen zijn zo'n zo'ne zoiet zoveel zovele zovelen "
|
||||
'zuk zulk zulkdanig zulken zulks zullie zíjn àlle álle').split())
|
69
spacy/lang/nl/lemmatizer/_determiners_irreg.py
Normal file
69
spacy/lang/nl/lemmatizer/_determiners_irreg.py
Normal file
|
@ -0,0 +1,69 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
DETERMINERS_IRREG = {
|
||||
"'r": ('haar',),
|
||||
"'s": ('de',),
|
||||
"'t": ('het',),
|
||||
"'tgene": ('hetgeen',),
|
||||
'alle': ('al',),
|
||||
'allen': ('al',),
|
||||
'aller': ('al',),
|
||||
'beiden': ('beide',),
|
||||
'beider': ('beide',),
|
||||
"d'": ('het',),
|
||||
"d'r": ('haar',),
|
||||
'der': ('de',),
|
||||
'des': ('de',),
|
||||
'dezer': ('deze',),
|
||||
'dienen': ('die',),
|
||||
'dier': ('die',),
|
||||
'elke': ('elk',),
|
||||
'ene': ('een',),
|
||||
'enen': ('een',),
|
||||
'ener': ('een',),
|
||||
'enige': ('enig',),
|
||||
'enigen': ('enig',),
|
||||
'er': ('haar',),
|
||||
'gene': ('geen',),
|
||||
'genen': ('geen',),
|
||||
'hare': ('haar',),
|
||||
'haren': ('haar',),
|
||||
'harer': ('haar',),
|
||||
'hunne': ('hun',),
|
||||
'hunnen': ('hun',),
|
||||
'jou': ('jouw',),
|
||||
'jouwe': ('jouw',),
|
||||
'julliejen': ('jullie',),
|
||||
"m'n": ('mijn',),
|
||||
'mee': ('meer',),
|
||||
'meer': ('veel',),
|
||||
'meerderen': ('meerdere',),
|
||||
'meest': ('veel',),
|
||||
'meesten': ('veel',),
|
||||
'meet': ('veel',),
|
||||
'menige': ('menig',),
|
||||
'mij': ('mijn',),
|
||||
'mijnen': ('mijn',),
|
||||
'minder': ('weinig',),
|
||||
'mindere': ('weinig',),
|
||||
'minst': ('weinig',),
|
||||
'minste': ('minst',),
|
||||
'ne': ('een',),
|
||||
'onze': ('ons',),
|
||||
'onzent': ('ons',),
|
||||
'onzer': ('ons',),
|
||||
'ouw': ('uw',),
|
||||
'sommige': ('sommig',),
|
||||
'sommigen': ('sommig',),
|
||||
'u': ('uw',),
|
||||
'vaker': ('vaak',),
|
||||
'vele': ('veel',),
|
||||
'velen': ('veel',),
|
||||
'welke': ('welk',),
|
||||
'zijne': ('zijn',),
|
||||
'zijnen': ('zijn',),
|
||||
'zijns': ('zijn',),
|
||||
'één': ('een',)
|
||||
}
|
79
spacy/lang/nl/lemmatizer/_lemma_rules.py
Normal file
79
spacy/lang/nl/lemmatizer/_lemma_rules.py
Normal file
|
@ -0,0 +1,79 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
ADJECTIVE_SUFFIX_RULES = [
|
||||
["sten", ""],
|
||||
["ste", ""],
|
||||
["st", ""],
|
||||
["er", ""],
|
||||
["en", ""],
|
||||
["e", ""],
|
||||
["ende", "end"]
|
||||
]
|
||||
|
||||
VERB_SUFFIX_RULES = [
|
||||
["dt", "den"],
|
||||
["de", "en"],
|
||||
["te", "en"],
|
||||
["dde", "den"],
|
||||
["tte", "ten"],
|
||||
["dden", "den"],
|
||||
["tten", "ten"],
|
||||
["end", "en"],
|
||||
]
|
||||
|
||||
NOUN_SUFFIX_RULES = [
|
||||
["en", ""],
|
||||
["ën", ""],
|
||||
["'er", ""],
|
||||
["s", ""],
|
||||
["tje", ""],
|
||||
["kje", ""],
|
||||
["'s", ""],
|
||||
["ici", "icus"],
|
||||
["heden", "heid"],
|
||||
["elen", "eel"],
|
||||
["ezen", "ees"],
|
||||
["even", "eef"],
|
||||
["ssen", "s"],
|
||||
["rren", "r"],
|
||||
["kken", "k"],
|
||||
["bben", "b"]
|
||||
]
|
||||
|
||||
NUM_SUFFIX_RULES = [
|
||||
["ste", ""],
|
||||
["sten", ""],
|
||||
["ën", ""],
|
||||
["en", ""],
|
||||
["de", ""],
|
||||
["er", ""],
|
||||
["ër", ""],
|
||||
["tjes", ""]
|
||||
]
|
||||
|
||||
PUNCT_SUFFIX_RULES = [
|
||||
["“", "\""],
|
||||
["”", "\""],
|
||||
["\u2018", "'"],
|
||||
["\u2019", "'"]
|
||||
]
|
||||
|
||||
|
||||
# In-place sort guaranteeing that longer -- more specific -- rules are
|
||||
# applied first.
|
||||
for rule_set in (ADJECTIVE_SUFFIX_RULES,
|
||||
NOUN_SUFFIX_RULES,
|
||||
NUM_SUFFIX_RULES,
|
||||
VERB_SUFFIX_RULES):
|
||||
rule_set.sort(key=lambda r: len(r[0]), reverse=True)
|
||||
|
||||
|
||||
RULES = {
|
||||
"adj": ADJECTIVE_SUFFIX_RULES,
|
||||
"noun": NOUN_SUFFIX_RULES,
|
||||
"verb": VERB_SUFFIX_RULES,
|
||||
"num": NUM_SUFFIX_RULES,
|
||||
"punct": PUNCT_SUFFIX_RULES
|
||||
}
|
27890
spacy/lang/nl/lemmatizer/_nouns.py
Normal file
27890
spacy/lang/nl/lemmatizer/_nouns.py
Normal file
File diff suppressed because it is too large
Load Diff
3240
spacy/lang/nl/lemmatizer/_nouns_irreg.py
Normal file
3240
spacy/lang/nl/lemmatizer/_nouns_irreg.py
Normal file
File diff suppressed because it is too large
Load Diff
31
spacy/lang/nl/lemmatizer/_numbers_irreg.py
Normal file
31
spacy/lang/nl/lemmatizer/_numbers_irreg.py
Normal file
|
@ -0,0 +1,31 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
NUMBERS_IRREG = {
|
||||
'achten': ('acht',),
|
||||
'biljoenen': ('biljoen',),
|
||||
'drieën': ('drie',),
|
||||
'duizenden': ('duizend',),
|
||||
'eentjes': ('één',),
|
||||
'elven': ('elf',),
|
||||
'miljoenen': ('miljoen',),
|
||||
'negenen': ('negen',),
|
||||
'negentiger': ('negentig',),
|
||||
'tienduizenden': ('tienduizend',),
|
||||
'tienen': ('tien',),
|
||||
'tientjes': ('tien',),
|
||||
'twaalven': ('twaalf',),
|
||||
'tweeën': ('twee',),
|
||||
'twintiger': ('twintig',),
|
||||
'twintigsten': ('twintig',),
|
||||
'vieren': ('vier',),
|
||||
'vijftiger': ('vijftig',),
|
||||
'vijven': ('vijf',),
|
||||
'zessen': ('zes',),
|
||||
'zestiger': ('zestig',),
|
||||
'zevenen': ('zeven',),
|
||||
'zeventiger': ('zeventig',),
|
||||
'zovele': ('zoveel',),
|
||||
'zovelen': ('zoveel',)
|
||||
}
|
35
spacy/lang/nl/lemmatizer/_pronouns_irreg.py
Normal file
35
spacy/lang/nl/lemmatizer/_pronouns_irreg.py
Normal file
|
@ -0,0 +1,35 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
PRONOUNS_IRREG = {
|
||||
"'r": ('haar',),
|
||||
"'rzelf": ('haarzelf',),
|
||||
"'t": ('het',),
|
||||
"d'r": ('haar',),
|
||||
'da': ('dat',),
|
||||
'dienen': ('die',),
|
||||
'diens': ('die',),
|
||||
'dies': ('die',),
|
||||
'elkaars': ('elkaar',),
|
||||
'elkanders': ('elkander',),
|
||||
'ene': ('een',),
|
||||
'enen': ('een',),
|
||||
'fik': ('ik',),
|
||||
'gaat': ('gaan',),
|
||||
'gene': ('geen',),
|
||||
'harer': ('haar',),
|
||||
'ieders': ('ieder',),
|
||||
'iemands': ('iemand',),
|
||||
'ikke': ('ik',),
|
||||
'mijnen': ('mijn',),
|
||||
'oe': ('je',),
|
||||
'onzer': ('ons',),
|
||||
'wa': ('wat',),
|
||||
'watte': ('wat',),
|
||||
'wier': ('wie',),
|
||||
'zijns': ('zijn',),
|
||||
'zoietsken': ('zoietske',),
|
||||
'zulks': ('zulk',),
|
||||
'één': ('een',)
|
||||
}
|
2873
spacy/lang/nl/lemmatizer/_verbs.py
Normal file
2873
spacy/lang/nl/lemmatizer/_verbs.py
Normal file
File diff suppressed because it is too large
Load Diff
7201
spacy/lang/nl/lemmatizer/_verbs_irreg.py
Normal file
7201
spacy/lang/nl/lemmatizer/_verbs_irreg.py
Normal file
File diff suppressed because it is too large
Load Diff
130
spacy/lang/nl/lemmatizer/lemmatizer.py
Normal file
130
spacy/lang/nl/lemmatizer/lemmatizer.py
Normal file
|
@ -0,0 +1,130 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ....symbols import POS, NOUN, VERB, ADJ, NUM, DET, PRON, ADP, AUX, ADV
|
||||
|
||||
|
||||
class DutchLemmatizer(object):
|
||||
# Note: CGN does not distinguish AUX verbs, so we treat AUX as VERB.
|
||||
univ_pos_name_variants = {
|
||||
NOUN: "noun", "NOUN": "noun", "noun": "noun",
|
||||
VERB: "verb", "VERB": "verb", "verb": "verb",
|
||||
AUX: "verb", "AUX": "verb", "aux": "verb",
|
||||
ADJ: "adj", "ADJ": "adj", "adj": "adj",
|
||||
ADV: "adv", "ADV": "adv", "adv": "adv",
|
||||
PRON: "pron", "PRON": "pron", "pron": "pron",
|
||||
DET: "det", "DET": "det", "det": "det",
|
||||
ADP: "adp", "ADP": "adp", "adp": "adp",
|
||||
NUM: "num", "NUM": "num", "num": "num"
|
||||
}
|
||||
|
||||
@classmethod
|
||||
def load(cls, path, index=None, exc=None, rules=None, lookup=None):
|
||||
return cls(index, exc, rules, lookup)
|
||||
|
||||
def __init__(self, index=None, exceptions=None, rules=None, lookup=None):
|
||||
self.index = index
|
||||
self.exc = exceptions
|
||||
self.rules = rules or {}
|
||||
self.lookup_table = lookup if lookup is not None else {}
|
||||
|
||||
def __call__(self, string, univ_pos, morphology=None):
|
||||
# Difference 1: self.rules is assumed to be non-None, so no
|
||||
# 'is None' check required.
|
||||
# String lowercased from the get-go. All lemmatization results in
|
||||
# lowercased strings. For most applications, this shouldn't pose
|
||||
# any problems, and it keeps the exceptions indexes small. If this
|
||||
# creates problems for proper nouns, we can introduce a check for
|
||||
# univ_pos == "PROPN".
|
||||
string = string.lower()
|
||||
try:
|
||||
univ_pos = self.univ_pos_name_variants[univ_pos]
|
||||
except KeyError:
|
||||
# Because PROPN not in self.univ_pos_name_variants, proper names
|
||||
# are not lemmatized. They are lowercased, however.
|
||||
return [string]
|
||||
# if string in self.lemma_index.get(univ_pos)
|
||||
lemma_index = self.index.get(univ_pos, {})
|
||||
# string is already lemma
|
||||
if string in lemma_index:
|
||||
return [string]
|
||||
exceptions = self.exc.get(univ_pos, {})
|
||||
# string is irregular token contained in exceptions index.
|
||||
try:
|
||||
lemma = exceptions[string]
|
||||
return [lemma[0]]
|
||||
except KeyError:
|
||||
pass
|
||||
# string corresponds to key in lookup table
|
||||
lookup_table = self.lookup_table
|
||||
looked_up_lemma = lookup_table.get(string)
|
||||
if looked_up_lemma and looked_up_lemma in lemma_index:
|
||||
return [looked_up_lemma]
|
||||
|
||||
forms, is_known = lemmatize(
|
||||
string,
|
||||
lemma_index,
|
||||
exceptions,
|
||||
self.rules.get(univ_pos, []))
|
||||
|
||||
# Back-off through remaining return value candidates.
|
||||
if forms:
|
||||
if is_known:
|
||||
return forms
|
||||
else:
|
||||
for form in forms:
|
||||
if form in exceptions:
|
||||
return [form]
|
||||
if looked_up_lemma:
|
||||
return [looked_up_lemma]
|
||||
else:
|
||||
return forms
|
||||
elif looked_up_lemma:
|
||||
return [looked_up_lemma]
|
||||
else:
|
||||
return [string]
|
||||
|
||||
# Overrides parent method so that a lowercased version of the string is
|
||||
# used to search the lookup table. This is necessary because our lookup
|
||||
# table consists entirely of lowercase keys.
|
||||
def lookup(self, string):
|
||||
string = string.lower()
|
||||
return self.lookup_table.get(string, string)
|
||||
|
||||
def noun(self, string, morphology=None):
|
||||
return self(string, 'noun', morphology)
|
||||
|
||||
def verb(self, string, morphology=None):
|
||||
return self(string, 'verb', morphology)
|
||||
|
||||
def adj(self, string, morphology=None):
|
||||
return self(string, 'adj', morphology)
|
||||
|
||||
def det(self, string, morphology=None):
|
||||
return self(string, 'det', morphology)
|
||||
|
||||
def pron(self, string, morphology=None):
|
||||
return self(string, 'pron', morphology)
|
||||
|
||||
def adp(self, string, morphology=None):
|
||||
return self(string, 'adp', morphology)
|
||||
|
||||
def punct(self, string, morphology=None):
|
||||
return self(string, 'punct', morphology)
|
||||
|
||||
|
||||
# Reimplemented to focus more on application of suffix rules and to return
|
||||
# as early as possible.
|
||||
def lemmatize(string, index, exceptions, rules):
|
||||
# returns (forms, is_known: bool)
|
||||
oov_forms = []
|
||||
for old, new in rules:
|
||||
if string.endswith(old):
|
||||
form = string[:len(string) - len(old)] + new
|
||||
if not form:
|
||||
pass
|
||||
elif form in index:
|
||||
return [form], True # True = Is known (is lemma)
|
||||
else:
|
||||
oov_forms.append(form)
|
||||
return list(set(oov_forms)), False
|
212951
spacy/lang/nl/lemmatizer/lookup.py
Normal file
212951
spacy/lang/nl/lemmatizer/lookup.py
Normal file
File diff suppressed because it is too large
Load Diff
|
@ -4,22 +4,18 @@ from __future__ import unicode_literals
|
|||
from ...attrs import LIKE_NUM
|
||||
|
||||
|
||||
_num_words = set(
|
||||
"""
|
||||
_num_words = set("""
|
||||
nul een één twee drie vier vijf zes zeven acht negen tien elf twaalf dertien
|
||||
veertien twintig dertig veertig vijftig zestig zeventig tachtig negentig honderd
|
||||
duizend miljoen miljard biljoen biljard triljoen triljard
|
||||
""".split()
|
||||
)
|
||||
""".split())
|
||||
|
||||
_ordinal_words = set(
|
||||
"""
|
||||
_ordinal_words = set("""
|
||||
eerste tweede derde vierde vijfde zesde zevende achtste negende tiende elfde
|
||||
twaalfde dertiende veertiende twintigste dertigste veertigste vijftigste
|
||||
zestigste zeventigste tachtigste negentigste honderdste duizendste miljoenste
|
||||
miljardste biljoenste biljardste triljoenste triljardste
|
||||
""".split()
|
||||
)
|
||||
""".split())
|
||||
|
||||
|
||||
def like_num(text):
|
||||
|
@ -27,13 +23,11 @@ def like_num(text):
|
|||
# or matches one of the number words. In order to handle numbers like
|
||||
# "drieëntwintig", more work is required.
|
||||
# See this discussion: https://github.com/explosion/spaCy/pull/1177
|
||||
if text.startswith(("+", "-", "±", "~")):
|
||||
text = text[1:]
|
||||
text = text.replace(",", "").replace(".", "")
|
||||
text = text.replace(',', '').replace('.', '')
|
||||
if text.isdigit():
|
||||
return True
|
||||
if text.count("/") == 1:
|
||||
num, denom = text.split("/")
|
||||
if text.count('/') == 1:
|
||||
num, denom = text.split('/')
|
||||
if num.isdigit() and denom.isdigit():
|
||||
return True
|
||||
if text.lower() in _num_words:
|
||||
|
@ -43,4 +37,6 @@ def like_num(text):
|
|||
return False
|
||||
|
||||
|
||||
LEX_ATTRS = {LIKE_NUM: like_num}
|
||||
LEX_ATTRS = {
|
||||
LIKE_NUM: like_num
|
||||
}
|
||||
|
|
33
spacy/lang/nl/punctuation.py
Normal file
33
spacy/lang/nl/punctuation.py
Normal file
|
@ -0,0 +1,33 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ..char_classes import LIST_ELLIPSES, LIST_ICONS
|
||||
from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER
|
||||
|
||||
from ..punctuation import TOKENIZER_SUFFIXES as DEFAULT_TOKENIZER_SUFFIXES
|
||||
|
||||
|
||||
# Copied from `de` package. Main purpose is to ensure that hyphens are not
|
||||
# split on.
|
||||
|
||||
_quotes = CONCAT_QUOTES.replace("'", '')
|
||||
|
||||
_infixes = (LIST_ELLIPSES + LIST_ICONS +
|
||||
[r'(?<=[{}])\.(?=[{}])'.format(ALPHA_LOWER, ALPHA_UPPER),
|
||||
r'(?<=[{a}])[,!?](?=[{a}])'.format(a=ALPHA),
|
||||
r'(?<=[{a}"])[:<>=](?=[{a}])'.format(a=ALPHA),
|
||||
r'(?<=[{a}]),(?=[{a}])'.format(a=ALPHA),
|
||||
r'(?<=[{a}])([{q}\)\]\(\[])(?=[{a}])'.format(a=ALPHA, q=_quotes),
|
||||
r'(?<=[{a}])--(?=[{a}])'.format(a=ALPHA),
|
||||
r'(?<=[0-9])-(?=[0-9])'])
|
||||
|
||||
|
||||
# Remove "'s" suffix from suffix list. In Dutch, "'s" is a plural ending when
|
||||
# it occurs as a suffix and a clitic for "eens" in standalone use. To avoid
|
||||
# ambiguity it's better to just leave it attached when it occurs as a suffix.
|
||||
default_suffix_blacklist = ("'s", "'S", '’s', '’S')
|
||||
_suffixes = [suffix for suffix in DEFAULT_TOKENIZER_SUFFIXES
|
||||
if suffix not in default_suffix_blacklist]
|
||||
|
||||
TOKENIZER_INFIXES = _infixes
|
||||
TOKENIZER_SUFFIXES = _suffixes
|
|
@ -1,45 +1,73 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
# The original stop words list (added in f46ffe3) was taken from
|
||||
# http://www.damienvanholten.com/downloads/dutch-stop-words.txt
|
||||
# and consisted of about 100 tokens.
|
||||
# In order to achieve parity with some of the better-supported
|
||||
# languages, e.g., English, French, and German, this original list has been
|
||||
# extended with 200 additional tokens. The main source of inspiration was
|
||||
# https://raw.githubusercontent.com/stopwords-iso/stopwords-nl/master/stopwords-nl.txt.
|
||||
# However, quite a bit of manual editing has taken place as well.
|
||||
# Tokens whose status as a stop word is not entirely clear were admitted or
|
||||
# rejected by deferring to their counterparts in the stop words lists for English
|
||||
# and French. Similarly, those lists were used to identify and fill in gaps so
|
||||
# that -- in principle -- each token contained in the English stop words list
|
||||
# should have a Dutch counterpart here.
|
||||
|
||||
# Stop words are retrieved from http://www.damienvanholten.com/downloads/dutch-stop-words.txt
|
||||
|
||||
STOP_WORDS = set(
|
||||
"""
|
||||
aan af al alles als altijd andere
|
||||
STOP_WORDS = set("""
|
||||
aan af al alle alles allebei alleen allen als altijd ander anders andere anderen aangaangde aangezien achter achterna
|
||||
afgelopen aldus alhoewel anderzijds
|
||||
|
||||
ben bij
|
||||
ben bij bijna bijvoorbeeld behalve beide beiden beneden bent bepaald beter betere betreffende binnen binnenin boven
|
||||
bovenal bovendien bovenstaand buiten
|
||||
|
||||
daar dan dat de der deze die dit doch doen door dus
|
||||
daar dan dat de der den deze die dit doch doen door dus daarheen daarin daarna daarnet daarom daarop des dezelfde dezen
|
||||
dien dikwijls doet doorgaand doorgaans
|
||||
|
||||
een eens en er
|
||||
een eens en er echter enige eerder eerst eerste eersten effe eigen elk elke enkel enkele enz erdoor etc even eveneens
|
||||
evenwel
|
||||
|
||||
ge geen geweest
|
||||
ff
|
||||
|
||||
haar had heb hebben heeft hem het hier hij hoe hun
|
||||
ge geen geweest gauw gedurende gegeven gehad geheel gekund geleden gelijk gemogen geven geweest gewoon gewoonweg
|
||||
geworden gij
|
||||
|
||||
iemand iets ik in is
|
||||
haar had heb hebben heeft hem het hier hij hoe hun hadden hare hebt hele hen hierbeneden hierboven hierin hoewel hun
|
||||
|
||||
ja je
|
||||
iemand iets ik in is idd ieder ikke ikzelf indien inmiddels inz inzake
|
||||
|
||||
kan kon kunnen
|
||||
ja je jou jouw jullie jezelf jij jijzelf jouwe juist
|
||||
|
||||
maar me meer men met mij mijn moet
|
||||
kan kon kunnen klaar konden krachtens kunnen kunt
|
||||
|
||||
na naar niet niets nog nu
|
||||
lang later liet liever
|
||||
|
||||
of om omdat ons ook op over
|
||||
maar me meer men met mij mijn moet mag mede meer meesten mezelf mijzelf min minder misschien mocht mochten moest moesten
|
||||
moet moeten mogelijk mogen
|
||||
|
||||
reeds
|
||||
na naar niet niets nog nu nabij nadat net nogal nooit nr nu
|
||||
|
||||
te tegen toch toen tot
|
||||
of om omdat ons ook op over omhoog omlaag omstreeks omtrent omver onder ondertussen ongeveer onszelf onze ooit opdat
|
||||
opnieuw opzij over overigens
|
||||
|
||||
u uit uw
|
||||
pas pp precies prof publ
|
||||
|
||||
van veel voor
|
||||
reeds rond rondom
|
||||
|
||||
want waren was wat we wel werd wezen wie wij wil worden
|
||||
sedert sinds sindsdien slechts sommige spoedig steeds
|
||||
|
||||
zal ze zei zelf zich zij zijn zo zonder zou
|
||||
""".split()
|
||||
)
|
||||
‘t 't te tegen toch toen tot tamelijk ten tenzij ter terwijl thans tijdens toe totdat tussen
|
||||
|
||||
u uit uw uitgezonderd uwe uwen
|
||||
|
||||
van veel voor vaak vanaf vandaan vanuit vanwege veeleer verder verre vervolgens vgl volgens vooraf vooral vooralsnog
|
||||
voorbij voordat voordien voorheen voorop voort voorts vooruit vrij vroeg
|
||||
|
||||
want waren was wat we wel werd wezen wie wij wil worden waar waarom wanneer want weer weg wegens weinig weinige weldra
|
||||
welk welke welken werd werden wiens wier wilde wordt
|
||||
|
||||
zal ze zei zelf zich zij zijn zo zonder zou zeer zeker zekere zelfde zelfs zichzelf zijnde zijne zo’n zoals zodra zouden
|
||||
zoveel zowat zulk zulke zulks zullen zult
|
||||
""".split())
|
||||
|
|
|
@ -5,7 +5,6 @@ from ...symbols import POS, PUNCT, ADJ, NUM, DET, ADV, ADP, X, VERB
|
|||
from ...symbols import NOUN, PROPN, SPACE, PRON, CONJ
|
||||
|
||||
|
||||
# fmt: off
|
||||
TAG_MAP = {
|
||||
"ADJ__Number=Sing": {POS: ADJ},
|
||||
"ADJ___": {POS: ADJ},
|
||||
|
@ -811,4 +810,3 @@ TAG_MAP = {
|
|||
"X___": {POS: X},
|
||||
"_SP": {POS: SPACE}
|
||||
}
|
||||
# fmt: on
|
||||
|
|
340
spacy/lang/nl/tokenizer_exceptions.py
Normal file
340
spacy/lang/nl/tokenizer_exceptions.py
Normal file
|
@ -0,0 +1,340 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...symbols import ORTH, LEMMA, TAG, NORM, PRON_LEMMA
|
||||
|
||||
# Extensive list of both common and uncommon dutch abbreviations copied from
|
||||
# github.com/diasks2/pragmatic_segmenter, a Ruby library for rule-based
|
||||
# sentence boundary detection (MIT, Copyright 2015 Kevin S. Dias).
|
||||
# Source file: https://github.com/diasks2/pragmatic_segmenter/blob/master/lib/pragmatic_segmenter/languages/dutch.rb
|
||||
# (Last commit: 4d1477b)
|
||||
|
||||
# Main purpose of such an extensive list: considerably improved sentence
|
||||
# segmentation.
|
||||
|
||||
# Note: This list has been copied over largely as-is. Some of the abbreviations
|
||||
# are extremely domain-specific. Tokenizer performance may benefit from some
|
||||
# slight pruning, although no performance regression has been observed so far.
|
||||
|
||||
|
||||
abbrevs = ['a.2d.', 'a.a.', 'a.a.j.b.', 'a.f.t.', 'a.g.j.b.',
|
||||
'a.h.v.', 'a.h.w.', 'a.hosp.', 'a.i.', 'a.j.b.', 'a.j.t.',
|
||||
'a.m.', 'a.m.r.', 'a.p.m.', 'a.p.r.', 'a.p.t.', 'a.s.',
|
||||
'a.t.d.f.', 'a.u.b.', 'a.v.a.', 'a.w.', 'aanbev.',
|
||||
'aanbev.comm.', 'aant.', 'aanv.st.', 'aanw.', 'vnw.',
|
||||
'aanw.vnw.', 'abd.', 'abm.', 'abs.', 'acc.act.',
|
||||
'acc.bedr.m.', 'acc.bedr.t.', 'achterv.', 'act.dr.',
|
||||
'act.dr.fam.', 'act.fisc.', 'act.soc.', 'adm.akk.',
|
||||
'adm.besl.', 'adm.lex.', 'adm.onderr.', 'adm.ov.', 'adv.',
|
||||
'adv.', 'gen.', 'adv.bl.', 'afd.', 'afl.', 'aggl.verord.',
|
||||
'agr.', 'al.', 'alg.', 'alg.richts.', 'amén.', 'ann.dr.',
|
||||
'ann.dr.lg.', 'ann.dr.sc.pol.', 'ann.ét.eur.',
|
||||
'ann.fac.dr.lg.', 'ann.jur.créd.',
|
||||
'ann.jur.créd.règl.coll.', 'ann.not.', 'ann.parl.',
|
||||
'ann.prat.comm.', 'app.', 'arb.', 'aud.', 'arbbl.',
|
||||
'arbh.', 'arbit.besl.', 'arbrb.', 'arr.', 'arr.cass.',
|
||||
'arr.r.v.st.', 'arr.verbr.', 'arrondrb.', 'art.', 'artw.',
|
||||
'aud.', 'b.', 'b.', 'b.&w.', 'b.a.', 'b.a.s.', 'b.b.o.',
|
||||
'b.best.dep.', 'b.br.ex.', 'b.coll.fr.gem.comm.',
|
||||
'b.coll.vl.gem.comm.', 'b.d.cult.r.', 'b.d.gem.ex.',
|
||||
'b.d.gem.reg.', 'b.dep.', 'b.e.b.', 'b.f.r.',
|
||||
'b.fr.gem.ex.', 'b.fr.gem.reg.', 'b.i.h.', 'b.inl.j.d.',
|
||||
'b.inl.s.reg.', 'b.j.', 'b.l.', 'b.o.z.', 'b.prov.r.',
|
||||
'b.r.h.', 'b.s.', 'b.sr.', 'b.stb.', 'b.t.i.r.',
|
||||
'b.t.s.z.', 'b.t.w.rev.', 'b.v.',
|
||||
'b.ver.coll.gem.gem.comm.', 'b.verg.r.b.', 'b.versl.',
|
||||
'b.vl.ex.', 'b.voorl.reg.', 'b.w.', 'b.w.gew.ex.',
|
||||
'b.z.d.g.', 'b.z.v.', 'bab.', 'bedr.org.', 'begins.',
|
||||
'beheersov.', 'bekendm.comm.', 'bel.', 'bel.besch.',
|
||||
'bel.w.p.', 'beleidsov.', 'belg.', 'grondw.', 'ber.',
|
||||
'ber.w.', 'besch.', 'besl.', 'beslagr.', 'bestuurswet.',
|
||||
'bet.', 'betr.', 'betr.', 'vnw.', 'bevest.', 'bew.',
|
||||
'bijbl.', 'ind.', 'eig.', 'bijbl.n.bijdr.', 'bijl.',
|
||||
'bijv.', 'bijw.', 'bijz.decr.', 'bin.b.', 'bkh.', 'bl.',
|
||||
'blz.', 'bm.', 'bn.', 'rh.', 'bnw.', 'bouwr.', 'br.parl.',
|
||||
'bs.', 'bull.', 'bull.adm.pénit.', 'bull.ass.',
|
||||
'bull.b.m.m.', 'bull.bel.', 'bull.best.strafinr.',
|
||||
'bull.bmm.', 'bull.c.b.n.', 'bull.c.n.c.', 'bull.cbn.',
|
||||
'bull.centr.arb.', 'bull.cnc.', 'bull.contr.',
|
||||
'bull.doc.min.fin.', 'bull.f.e.b.', 'bull.feb.',
|
||||
'bull.fisc.fin.r.', 'bull.i.u.m.',
|
||||
'bull.inf.ass.secr.soc.', 'bull.inf.i.e.c.',
|
||||
'bull.inf.i.n.a.m.i.', 'bull.inf.i.r.e.', 'bull.inf.iec.',
|
||||
'bull.inf.inami.', 'bull.inf.ire.', 'bull.inst.arb.',
|
||||
'bull.ium.', 'bull.jur.imm.', 'bull.lég.b.', 'bull.off.',
|
||||
'bull.trim.b.dr.comp.', 'bull.us.', 'bull.v.b.o.',
|
||||
'bull.vbo.', 'bv.', 'bw.', 'bxh.', 'byz.', 'c.', 'c.a.',
|
||||
'c.a.-a.', 'c.a.b.g.', 'c.c.', 'c.c.i.', 'c.c.s.',
|
||||
'c.conc.jur.', 'c.d.e.', 'c.d.p.k.', 'c.e.', 'c.ex.',
|
||||
'c.f.', 'c.h.a.', 'c.i.f.', 'c.i.f.i.c.', 'c.j.', 'c.l.',
|
||||
'c.n.', 'c.o.d.', 'c.p.', 'c.pr.civ.', 'c.q.', 'c.r.',
|
||||
'c.r.a.', 'c.s.', 'c.s.a.', 'c.s.q.n.', 'c.v.', 'c.v.a.',
|
||||
'c.v.o.', 'ca.', 'cadeaust.', 'cah.const.',
|
||||
'cah.dr.europ.', 'cah.dr.immo.', 'cah.dr.jud.', 'cal.',
|
||||
'2d.', 'cal.', '3e.', 'cal.', 'rprt.', 'cap.', 'carg.',
|
||||
'cass.', 'cass.', 'verw.', 'cert.', 'cf.', 'ch.', 'chron.',
|
||||
'chron.d.s.', 'chron.dr.not.', 'cie.', 'cie.',
|
||||
'verz.schr.', 'cir.', 'circ.', 'circ.z.', 'cit.',
|
||||
'cit.loc.', 'civ.', 'cl.et.b.', 'cmt.', 'co.',
|
||||
'cognoss.v.', 'coll.', 'v.', 'b.', 'colp.w.', 'com.',
|
||||
'com.', 'cas.', 'com.v.min.', 'comm.', 'comm.', 'v.',
|
||||
'comm.bijz.ov.', 'comm.erf.', 'comm.fin.', 'comm.ger.',
|
||||
'comm.handel.', 'comm.pers.', 'comm.pub.', 'comm.straf.',
|
||||
'comm.v.', 'comm.venn.', 'comm.verz.', 'comm.voor.',
|
||||
'comp.', 'compt.w.', 'computerr.', 'con.m.', 'concl.',
|
||||
'concr.', 'conf.', 'confl.w.', 'confl.w.huwbetr.', 'cons.',
|
||||
'conv.', 'coöp.', 'ver.', 'corr.', 'corr.bl.',
|
||||
'cour.fisc.', 'cour.immo.', 'cridon.', 'crim.', 'cur.',
|
||||
'cur.', 'crt.', 'curs.', 'd.', 'd.-g.', 'd.a.', 'd.a.v.',
|
||||
'd.b.f.', 'd.c.', 'd.c.c.r.', 'd.d.', 'd.d.p.', 'd.e.t.',
|
||||
'd.gem.r.', 'd.h.', 'd.h.z.', 'd.i.', 'd.i.t.', 'd.j.',
|
||||
'd.l.r.', 'd.m.', 'd.m.v.', 'd.o.v.', 'd.parl.', 'd.w.z.',
|
||||
'dact.', 'dat.', 'dbesch.', 'dbesl.', 'decr.', 'decr.d.',
|
||||
'decr.fr.', 'decr.vl.', 'decr.w.', 'def.', 'dep.opv.',
|
||||
'dep.rtl.', 'derg.', 'desp.', 'det.mag.', 'deurw.regl.',
|
||||
'dez.', 'dgl.', 'dhr.', 'disp.', 'diss.', 'div.',
|
||||
'div.act.', 'div.bel.', 'dl.', 'dln.', 'dnotz.', 'doc.',
|
||||
'hist.', 'doc.jur.b.', 'doc.min.fin.', 'doc.parl.',
|
||||
'doctr.', 'dpl.', 'dpl.besl.', 'dr.', 'dr.banc.fin.',
|
||||
'dr.circ.', 'dr.inform.', 'dr.mr.', 'dr.pén.entr.',
|
||||
'dr.q.m.', 'drs.', 'dtp.', 'dwz.', 'dyn.', 'e.', 'e.a.',
|
||||
'e.b.', 'tek.mod.', 'e.c.', 'e.c.a.', 'e.d.', 'e.e.',
|
||||
'e.e.a.', 'e.e.g.', 'e.g.', 'e.g.a.', 'e.h.a.', 'e.i.',
|
||||
'e.j.', 'e.m.a.', 'e.n.a.c.', 'e.o.', 'e.p.c.', 'e.r.c.',
|
||||
'e.r.f.', 'e.r.h.', 'e.r.o.', 'e.r.p.', 'e.r.v.',
|
||||
'e.s.r.a.', 'e.s.t.', 'e.v.', 'e.v.a.', 'e.w.', 'e&o.e.',
|
||||
'ec.pol.r.', 'econ.', 'ed.', 'ed(s).', 'eff.', 'eig.',
|
||||
'eig.mag.', 'eil.', 'elektr.', 'enmb.', 'enz.', 'err.',
|
||||
'etc.', 'etq.', 'eur.', 'parl.', 'eur.t.s.', 'ev.', 'evt.',
|
||||
'ex.', 'ex.crim.', 'exec.', 'f.', 'f.a.o.', 'f.a.q.',
|
||||
'f.a.s.', 'f.i.b.', 'f.j.f.', 'f.o.b.', 'f.o.r.', 'f.o.s.',
|
||||
'f.o.t.', 'f.r.', 'f.supp.', 'f.suppl.', 'fa.', 'facs.',
|
||||
'fasc.', 'fg.', 'fid.ber.', 'fig.', 'fin.verh.w.', 'fisc.',
|
||||
'fisc.', 'tijdschr.', 'fisc.act.', 'fisc.koer.', 'fl.',
|
||||
'form.', 'foro.', 'it.', 'fr.', 'fr.cult.r.', 'fr.gem.r.',
|
||||
'fr.parl.', 'fra.', 'ft.', 'g.', 'g.a.', 'g.a.v.',
|
||||
'g.a.w.v.', 'g.g.d.', 'g.m.t.', 'g.o.', 'g.omt.e.', 'g.p.',
|
||||
'g.s.', 'g.v.', 'g.w.w.', 'geb.', 'gebr.', 'gebrs.',
|
||||
'gec.', 'gec.decr.', 'ged.', 'ged.st.', 'gedipl.',
|
||||
'gedr.st.', 'geh.', 'gem.', 'gem.', 'gem.',
|
||||
'gem.gem.comm.', 'gem.st.', 'gem.stem.', 'gem.w.',
|
||||
'gemeensch.optr.', 'gemeensch.standp.', 'gemeensch.strat.',
|
||||
'gemeent.', 'gemeent.b.', 'gemeent.regl.',
|
||||
'gemeent.verord.', 'geol.', 'geopp.', 'gepubl.',
|
||||
'ger.deurw.', 'ger.w.', 'gerekw.', 'gereq.', 'gesch.',
|
||||
'get.', 'getr.', 'gev.m.', 'gev.maatr.', 'gew.', 'ghert.',
|
||||
'gir.eff.verk.', 'gk.', 'gr.', 'gramm.', 'grat.w.',
|
||||
'grootb.w.', 'grs.', 'grvm.', 'grw.', 'gst.', 'gw.',
|
||||
'h.a.', 'h.a.v.o.', 'h.b.o.', 'h.e.a.o.', 'h.e.g.a.',
|
||||
'h.e.geb.', 'h.e.gestr.', 'h.l.', 'h.m.', 'h.o.', 'h.r.',
|
||||
'h.t.l.', 'h.t.m.', 'h.w.geb.', 'hand.', 'handelsn.w.',
|
||||
'handelspr.', 'handelsr.w.', 'handelsreg.w.', 'handv.',
|
||||
'harv.l.rev.', 'hc.', 'herald.', 'hert.', 'herz.',
|
||||
'hfdst.', 'hfst.', 'hgrw.', 'hhr.', 'hist.', 'hooggel.',
|
||||
'hoogl.', 'hosp.', 'hpw.', 'hr.', 'hr.', 'ms.', 'hr.ms.',
|
||||
'hregw.', 'hrg.', 'hst.', 'huis.just.', 'huisv.w.',
|
||||
'huurbl.', 'hv.vn.', 'hw.', 'hyp.w.', 'i.b.s.', 'i.c.',
|
||||
'i.c.m.h.', 'i.e.', 'i.f.', 'i.f.p.', 'i.g.v.', 'i.h.',
|
||||
'i.h.a.', 'i.h.b.', 'i.l.pr.', 'i.o.', 'i.p.o.', 'i.p.r.',
|
||||
'i.p.v.', 'i.pl.v.', 'i.r.d.i.', 'i.s.m.', 'i.t.t.',
|
||||
'i.v.', 'i.v.m.', 'i.v.s.', 'i.w.tr.', 'i.z.', 'ib.',
|
||||
'ibid.', 'icip-ing.cons.', 'iem.', 'indic.soc.', 'indiv.',
|
||||
'inf.', 'inf.i.d.a.c.', 'inf.idac.', 'inf.r.i.z.i.v.',
|
||||
'inf.riziv.', 'inf.soc.secr.', 'ing.', 'ing.', 'cons.',
|
||||
'ing.cons.', 'inst.', 'int.', 'int.', 'rechtsh.',
|
||||
'strafz.', 'interm.', 'intern.fisc.act.',
|
||||
'intern.vervoerr.', 'inv.', 'inv.', 'f.', 'inv.w.',
|
||||
'inv.wet.', 'invord.w.', 'inz.', 'ir.', 'irspr.', 'iwtr.',
|
||||
'j.', 'j.-cl.', 'j.c.b.', 'j.c.e.', 'j.c.fl.', 'j.c.j.',
|
||||
'j.c.p.', 'j.d.e.', 'j.d.f.', 'j.d.s.c.', 'j.dr.jeun.',
|
||||
'j.j.d.', 'j.j.p.', 'j.j.pol.', 'j.l.', 'j.l.m.b.',
|
||||
'j.l.o.', 'j.p.a.', 'j.r.s.', 'j.t.', 'j.t.d.e.',
|
||||
'j.t.dr.eur.', 'j.t.o.', 'j.t.t.', 'jaarl.', 'jb.hand.',
|
||||
'jb.kred.', 'jb.kred.c.s.', 'jb.l.r.b.', 'jb.lrb.',
|
||||
'jb.markt.', 'jb.mens.', 'jb.t.r.d.', 'jb.trd.',
|
||||
'jeugdrb.', 'jeugdwerkg.w.', 'jg.', 'jis.', 'jl.',
|
||||
'journ.jur.', 'journ.prat.dr.fisc.fin.', 'journ.proc.',
|
||||
'jrg.', 'jur.', 'jur.comm.fl.', 'jur.dr.soc.b.l.n.',
|
||||
'jur.f.p.e.', 'jur.fpe.', 'jur.niv.', 'jur.trav.brux.',
|
||||
'jurambt.', 'jv.cass.', 'jv.h.r.j.', 'jv.hrj.', 'jw.',
|
||||
'k.', 'k.', 'k.b.', 'k.g.', 'k.k.', 'k.m.b.o.', 'k.o.o.',
|
||||
'k.v.k.', 'k.v.v.v.', 'kadasterw.', 'kaderb.', 'kador.',
|
||||
'kbo-nr.', 'kg.', 'kh.', 'kiesw.', 'kind.bes.v.', 'kkr.',
|
||||
'koopv.', 'kr.', 'krankz.w.', 'ksbel.', 'kt.', 'ktg.',
|
||||
'ktr.', 'kvdm.', 'kw.r.', 'kymr.', 'kzr.', 'kzw.', 'l.',
|
||||
'l.b.', 'l.b.o.', 'l.bas.', 'l.c.', 'l.gew.', 'l.j.',
|
||||
'l.k.', 'l.l.', 'l.o.', 'l.r.b.', 'l.u.v.i.', 'l.v.r.',
|
||||
'l.v.w.', 'l.w.', "l'exp.-compt.b..", 'l’exp.-compt.b.',
|
||||
'landinr.w.', 'landscrt.', 'lat.', 'law.ed.', 'lett.',
|
||||
'levensverz.', 'lgrs.', 'lidw.', 'limb.rechtsl.', 'lit.',
|
||||
'litt.', 'liw.', 'liwet.', 'lk.', 'll.', 'll.(l.)l.r.',
|
||||
'loonw.', 'losbl.', 'ltd.', 'luchtv.', 'luchtv.w.', 'm.',
|
||||
'm.', 'not.', 'm.a.v.o.', 'm.a.w.', 'm.b.', 'm.b.o.',
|
||||
'm.b.r.', 'm.b.t.', 'm.d.g.o.', 'm.e.a.o.', 'm.e.r.',
|
||||
'm.h.', 'm.h.d.', 'm.i.v.', 'm.j.t.', 'm.k.', 'm.m.',
|
||||
'm.m.a.', 'm.m.h.h.', 'm.m.v.', 'm.n.', 'm.not.fisc.',
|
||||
'm.nt.', 'm.o.', 'm.r.', 'm.s.a.', 'm.u.p.', 'm.v.a.',
|
||||
'm.v.h.n.', 'm.v.t.', 'm.z.', 'maatr.teboekgest.luchtv.',
|
||||
'maced.', 'mand.', 'max.', 'mbl.not.', 'me.', 'med.',
|
||||
'med.', 'v.b.o.', 'med.b.u.f.r.', 'med.bufr.', 'med.vbo.',
|
||||
'meerv.', 'meetbr.w.', 'mém.adm.', 'mgr.', 'mgrs.', 'mhd.',
|
||||
'mi.verantw.', 'mil.', 'mil.bed.', 'mil.ger.', 'min.',
|
||||
'min.', 'aanbev.', 'min.', 'circ.', 'min.', 'fin.',
|
||||
'min.j.omz.', 'min.just.circ.', 'mitt.', 'mnd.', 'mod.',
|
||||
'mon.', 'mouv.comm.', 'mr.', 'ms.', 'muz.', 'mv.', 'n.',
|
||||
'chr.', 'n.a.', 'n.a.g.', 'n.a.v.', 'n.b.', 'n.c.',
|
||||
'n.chr.', 'n.d.', 'n.d.r.', 'n.e.a.', 'n.g.', 'n.h.b.c.',
|
||||
'n.j.', 'n.j.b.', 'n.j.w.', 'n.l.', 'n.m.', 'n.m.m.',
|
||||
'n.n.', 'n.n.b.', 'n.n.g.', 'n.n.k.', 'n.o.m.', 'n.o.t.k.',
|
||||
'n.rapp.', 'n.tijd.pol.', 'n.v.', 'n.v.d.r.', 'n.v.d.v.',
|
||||
'n.v.o.b.', 'n.v.t.', 'nat.besch.w.', 'nat.omb.',
|
||||
'nat.pers.', 'ned.cult.r.', 'neg.verkl.', 'nhd.', 'wisk.',
|
||||
'njcm-bull.', 'nl.', 'nnd.', 'no.', 'not.fisc.m.',
|
||||
'not.w.', 'not.wet.', 'nr.', 'nrs.', 'nste.', 'nt.',
|
||||
'numism.', 'o.', 'o.a.', 'o.b.', 'o.c.', 'o.g.', 'o.g.v.',
|
||||
'o.i.', 'o.i.d.', 'o.m.', 'o.o.', 'o.o.d.', 'o.o.v.',
|
||||
'o.p.', 'o.r.', 'o.regl.', 'o.s.', 'o.t.s.', 'o.t.t.',
|
||||
'o.t.t.t.', 'o.t.t.z.', 'o.tk.t.', 'o.v.t.', 'o.v.t.t.',
|
||||
'o.v.tk.t.', 'o.v.v.', 'ob.', 'obsv.', 'octr.',
|
||||
'octr.gem.regl.', 'octr.regl.', 'oe.', 'off.pol.', 'ofra.',
|
||||
'ohd.', 'omb.', 'omnil.', 'omz.', 'on.ww.', 'onderr.',
|
||||
'onfrank.', 'onteig.w.', 'ontw.', 'b.w.', 'onuitg.',
|
||||
'onz.', 'oorl.w.', 'op.cit.', 'opin.pa.', 'opm.', 'or.',
|
||||
'ord.br.', 'ord.gem.', 'ors.', 'orth.', 'os.', 'osm.',
|
||||
'ov.', 'ov.w.i.', 'ov.w.ii.', 'ov.ww.', 'overg.w.',
|
||||
'overw.', 'ovkst.', 'oz.', 'p.', 'p.a.', 'p.a.o.',
|
||||
'p.b.o.', 'p.e.', 'p.g.', 'p.j.', 'p.m.', 'p.m.a.', 'p.o.',
|
||||
'p.o.j.t.', 'p.p.', 'p.v.', 'p.v.s.', 'pachtw.', 'pag.',
|
||||
'pan.', 'pand.b.', 'pand.pér.', 'parl.gesch.',
|
||||
'parl.gesch.', 'inv.', 'parl.st.', 'part.arb.', 'pas.',
|
||||
'pasin.', 'pat.', 'pb.c.', 'pb.l.', 'pens.',
|
||||
'pensioenverz.', 'per.ber.i.b.r.', 'per.ber.ibr.', 'pers.',
|
||||
'st.', 'pft.', 'pk.', 'pktg.', 'plv.', 'po.', 'pol.',
|
||||
'pol.off.', 'pol.r.', 'pol.w.', 'postbankw.', 'postw.',
|
||||
'pp.', 'pr.', 'preadv.', 'pres.', 'prf.', 'prft.', 'prg.',
|
||||
'prijz.w.', 'proc.', 'procesregl.', 'prof.', 'prot.',
|
||||
'prov.', 'prov.b.', 'prov.instr.h.m.g.', 'prov.regl.',
|
||||
'prov.verord.', 'prov.w.', 'publ.', 'pun.', 'pw.',
|
||||
'q.b.d.', 'q.e.d.', 'q.q.', 'q.r.', 'r.', 'r.a.b.g.',
|
||||
'r.a.c.e.', 'r.a.j.b.', 'r.b.d.c.', 'r.b.d.i.', 'r.b.s.s.',
|
||||
'r.c.', 'r.c.b.', 'r.c.d.c.', 'r.c.j.b.', 'r.c.s.j.',
|
||||
'r.cass.', 'r.d.c.', 'r.d.i.', 'r.d.i.d.c.', 'r.d.j.b.',
|
||||
'r.d.j.p.', 'r.d.p.c.', 'r.d.s.', 'r.d.t.i.', 'r.e.',
|
||||
'r.f.s.v.p.', 'r.g.a.r.', 'r.g.c.f.', 'r.g.d.c.', 'r.g.f.',
|
||||
'r.g.z.', 'r.h.a.', 'r.i.c.', 'r.i.d.a.', 'r.i.e.j.',
|
||||
'r.i.n.', 'r.i.s.a.', 'r.j.d.a.', 'r.j.i.', 'r.k.', 'r.l.',
|
||||
'r.l.g.b.', 'r.med.', 'r.med.rechtspr.', 'r.n.b.', 'r.o.',
|
||||
'r.ov.', 'r.p.', 'r.p.d.b.', 'r.p.o.t.', 'r.p.r.j.',
|
||||
'r.p.s.', 'r.r.d.', 'r.r.s.', 'r.s.', 'r.s.v.p.',
|
||||
'r.stvb.', 'r.t.d.f.', 'r.t.d.h.', 'r.t.l.',
|
||||
'r.trim.dr.eur.', 'r.v.a.', 'r.verkb.', 'r.w.', 'r.w.d.',
|
||||
'rap.ann.c.a.', 'rap.ann.c.c.', 'rap.ann.c.e.',
|
||||
'rap.ann.c.s.j.', 'rap.ann.ca.', 'rap.ann.cass.',
|
||||
'rap.ann.cc.', 'rap.ann.ce.', 'rap.ann.csj.', 'rapp.',
|
||||
'rb.', 'rb.kh.', 'rdn.', 'rdnr.', 're.pers.', 'rec.',
|
||||
'rec.c.i.j.', 'rec.c.j.c.e.', 'rec.cij.', 'rec.cjce.',
|
||||
'rec.gén.enr.not.', 'rechtsk.t.', 'rechtspl.zeem.',
|
||||
'rechtspr.arb.br.', 'rechtspr.b.f.e.', 'rechtspr.bfe.',
|
||||
'rechtspr.soc.r.b.l.n.', 'recl.reg.', 'rect.', 'red.',
|
||||
'reg.', 'reg.huiz.bew.', 'reg.w.', 'registr.w.', 'regl.',
|
||||
'regl.', 'r.v.k.', 'regl.besl.', 'regl.onderr.',
|
||||
'regl.r.t.', 'rep.', 'rép.fisc.', 'rép.not.', 'rep.r.j.',
|
||||
'rep.rj.', 'req.', 'res.', 'resp.', 'rev.', 'rev.',
|
||||
'comp.', 'rev.', 'trim.', 'civ.', 'rev.', 'trim.', 'comm.',
|
||||
'rev.acc.trav.', 'rev.adm.', 'rev.b.compt.',
|
||||
'rev.b.dr.const.', 'rev.b.dr.intern.', 'rev.b.séc.soc.',
|
||||
'rev.banc.fin.', 'rev.comm.', 'rev.cons.prud.',
|
||||
'rev.dr.b.', 'rev.dr.commun.', 'rev.dr.étr.',
|
||||
'rev.dr.fam.', 'rev.dr.intern.comp.', 'rev.dr.mil.',
|
||||
'rev.dr.min.', 'rev.dr.pén.', 'rev.dr.pén.mil.',
|
||||
'rev.dr.rur.', 'rev.dr.u.l.b.', 'rev.dr.ulb.', 'rev.exp.',
|
||||
'rev.faill.', 'rev.fisc.', 'rev.gd.', 'rev.hist.dr.',
|
||||
'rev.i.p.c.', 'rev.ipc.', 'rev.not.b.',
|
||||
'rev.prat.dr.comm.', 'rev.prat.not.b.', 'rev.prat.soc.',
|
||||
'rev.rec.', 'rev.rw.', 'rev.trav.', 'rev.trim.d.h.',
|
||||
'rev.trim.dr.fam.', 'rev.urb.', 'richtl.', 'riv.dir.int.',
|
||||
'riv.dir.int.priv.proc.', 'rk.', 'rln.', 'roln.', 'rom.',
|
||||
'rondz.', 'rov.', 'rtl.', 'rubr.', 'ruilv.wet.',
|
||||
'rv.verdr.', 'rvkb.', 's.', 's.', 's.a.', 's.b.n.',
|
||||
's.ct.', 's.d.', 's.e.c.', 's.e.et.o.', 's.e.w.',
|
||||
's.exec.rept.', 's.hrg.', 's.j.b.', 's.l.', 's.l.e.a.',
|
||||
's.l.n.d.', 's.p.a.', 's.s.', 's.t.', 's.t.b.', 's.v.',
|
||||
's.v.p.', 'samenw.', 'sc.', 'sch.', 'scheidsr.uitspr.',
|
||||
'schepel.besl.', 'secr.comm.', 'secr.gen.', 'sect.soc.',
|
||||
'sess.', 'cas.', 'sir.', 'soc.', 'best.', 'soc.', 'handv.',
|
||||
'soc.', 'verz.', 'soc.act.', 'soc.best.', 'soc.kron.',
|
||||
'soc.r.', 'soc.sw.', 'soc.weg.', 'sofi-nr.', 'somm.',
|
||||
'somm.ann.', 'sp.c.c.', 'sr.', 'ss.', 'st.doc.b.c.n.a.r.',
|
||||
'st.doc.bcnar.', 'st.vw.', 'stagever.', 'stas.', 'stat.',
|
||||
'stb.', 'stbl.', 'stcrt.', 'stud.dipl.', 'su.', 'subs.',
|
||||
'subst.', 'succ.w.', 'suppl.', 'sv.', 'sw.', 't.', 't.a.',
|
||||
't.a.a.', 't.a.n.', 't.a.p.', 't.a.s.n.', 't.a.v.',
|
||||
't.a.v.w.', 't.aann.', 't.acc.', 't.agr.r.', 't.app.',
|
||||
't.b.b.r.', 't.b.h.', 't.b.m.', 't.b.o.', 't.b.p.',
|
||||
't.b.r.', 't.b.s.', 't.b.v.', 't.bankw.', 't.belg.not.',
|
||||
't.desk.', 't.e.m.', 't.e.p.', 't.f.r.', 't.fam.',
|
||||
't.fin.r.', 't.g.r.', 't.g.t.', 't.g.v.', 't.gem.',
|
||||
't.gez.', 't.huur.', 't.i.n.', 't.j.k.', 't.l.l.',
|
||||
't.l.v.', 't.m.', 't.m.r.', 't.m.w.', 't.mil.r.',
|
||||
't.mil.strafr.', 't.not.', 't.o.', 't.o.r.b.', 't.o.v.',
|
||||
't.ontv.', 't.p.r.', 't.pol.', 't.r.', 't.r.g.',
|
||||
't.r.o.s.', 't.r.v.', 't.s.r.', 't.strafr.', 't.t.',
|
||||
't.u.', 't.v.c.', 't.v.g.', 't.v.m.r.', 't.v.o.', 't.v.v.',
|
||||
't.v.v.d.b.', 't.v.w.', 't.verz.', 't.vred.', 't.vreemd.',
|
||||
't.w.', 't.w.k.', 't.w.v.', 't.w.v.r.', 't.wrr.', 't.z.',
|
||||
't.z.t.', 't.z.v.', 'taalk.', 'tar.burg.z.', 'td.',
|
||||
'techn.', 'telecomm.', 'toel.', 'toel.st.v.w.', 'toep.',
|
||||
'toep.regl.', 'tom.', 'top.', 'trans.b.', 'transp.r.',
|
||||
'trb.', 'trib.', 'trib.civ.', 'trib.gr.inst.', 'ts.',
|
||||
'ts.', 'best.', 'ts.', 'verv.', 'turnh.rechtsl.', 'tvpol.',
|
||||
'tvpr.', 'tvrechtsgesch.', 'tw.', 'u.', 'u.a.', 'u.a.r.',
|
||||
'u.a.v.', 'u.c.', 'u.c.c.', 'u.g.', 'u.p.', 'u.s.',
|
||||
'u.s.d.c.', 'uitdr.', 'uitl.w.', 'uitv.besch.div.b.',
|
||||
'uitv.besl.', 'uitv.besl.', 'succ.w.', 'uitv.besl.bel.rv.',
|
||||
'uitv.besl.l.b.', 'uitv.reg.', 'inv.w.', 'uitv.reg.bel.d.',
|
||||
'uitv.reg.afd.verm.', 'uitv.reg.lb.', 'uitv.reg.succ.w.',
|
||||
'univ.', 'univ.verkl.', 'v.', 'v.', 'chr.', 'v.a.',
|
||||
'v.a.v.', 'v.c.', 'v.chr.', 'v.h.', 'v.huw.verm.', 'v.i.',
|
||||
'v.i.o.', 'v.k.a.', 'v.m.', 'v.o.f.', 'v.o.n.',
|
||||
'v.onderh.verpl.', 'v.p.', 'v.r.', 'v.s.o.', 'v.t.t.',
|
||||
'v.t.t.t.', 'v.tk.t.', 'v.toep.r.vert.', 'v.v.b.',
|
||||
'v.v.g.', 'v.v.t.', 'v.v.t.t.', 'v.v.tk.t.', 'v.w.b.',
|
||||
'v.z.m.', 'vb.', 'vb.bo.', 'vbb.', 'vc.', 'vd.', 'veldw.',
|
||||
'ver.k.', 'ver.verg.gem.', 'gem.comm.', 'verbr.', 'verd.',
|
||||
'verdr.', 'verdr.v.', 'tek.mod.', 'verenw.', 'verg.',
|
||||
'verg.fr.gem.', 'comm.', 'verkl.', 'verkl.herz.gw.',
|
||||
'verl.', 'deelw.', 'vern.', 'verord.', 'vers.r.',
|
||||
'versch.', 'versl.c.s.w.', 'versl.csw.', 'vert.', 'verw.',
|
||||
'verz.', 'verz.w.', 'verz.wett.besl.',
|
||||
'verz.wett.decr.besl.', 'vgl.', 'vid.', 'viss.w.',
|
||||
'vl.parl.', 'vl.r.', 'vl.t.gez.', 'vl.w.reg.',
|
||||
'vl.w.succ.', 'vlg.', 'vn.', 'vnl.', 'vnw.', 'vo.',
|
||||
'vo.bl.', 'voegw.', 'vol.', 'volg.', 'volt.', 'deelw.',
|
||||
'voorl.', 'voorz.', 'vord.w.', 'vorst.d.', 'vr.', 'vred.',
|
||||
'vrg.', 'vnw.', 'vrijgrs.', 'vs.', 'vt.', 'vw.', 'vz.',
|
||||
'vzngr.', 'vzr.', 'w.', 'w.a.', 'w.b.r.', 'w.c.h.',
|
||||
'w.conf.huw.', 'w.conf.huwelijksb.', 'w.consum.kr.',
|
||||
'w.f.r.', 'w.g.', 'w.gew.r.', 'w.ident.pl.', 'w.just.doc.',
|
||||
'w.kh.', 'w.l.r.', 'w.l.v.', 'w.mil.straf.spr.', 'w.n.',
|
||||
'w.not.ambt.', 'w.o.', 'w.o.d.huurcomm.', 'w.o.d.k.',
|
||||
'w.openb.manif.', 'w.parl.', 'w.r.', 'w.reg.', 'w.succ.',
|
||||
'w.u.b.', 'w.uitv.pl.verord.', 'w.v.', 'w.v.k.',
|
||||
'w.v.m.s.', 'w.v.r.', 'w.v.w.', 'w.venn.', 'wac.', 'wd.',
|
||||
'wetb.', 'n.v.h.', 'wgb.', 'winkelt.w.', 'wisk.',
|
||||
'wka-verkl.', 'wnd.', 'won.w.', 'woningw.', 'woonr.w.',
|
||||
'wrr.', 'wrr.ber.', 'wrsch.', 'ws.', 'wsch.', 'wsr.',
|
||||
'wtvb.', 'ww.', 'x.d.', 'z.a.', 'z.g.', 'z.i.', 'z.j.',
|
||||
'z.o.z.', 'z.p.', 'z.s.m.', 'zg.', 'zgn.', 'zn.', 'znw.',
|
||||
'zr.', 'zr.', 'ms.', 'zr.ms.']
|
||||
|
||||
|
||||
_exc = {}
|
||||
for orth in abbrevs:
|
||||
_exc[orth] = [{ORTH: orth}]
|
||||
uppered = orth.upper()
|
||||
capsed = orth.capitalize()
|
||||
for i in [uppered, capsed]:
|
||||
_exc[i] = [{ORTH: i}]
|
||||
|
||||
|
||||
TOKENIZER_EXCEPTIONS = _exc
|
|
@ -1,7 +1,7 @@
|
|||
# encoding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...symbols import POS, NOUN, PRON, ADJ, ADV, INTJ, PROPN, DET, NUM, AUX,VERB
|
||||
from ...symbols import POS, NOUN, PRON, ADJ, ADV, INTJ, PROPN, DET, NUM, AUX, VERB
|
||||
from ...symbols import ADP, CCONJ, PART, PUNCT, SPACE, SCONJ
|
||||
|
||||
# Source: Korakot Chaovavanich
|
||||
|
@ -17,8 +17,8 @@ TAG_MAP = {
|
|||
"CFQC": {POS: NOUN},
|
||||
"CVBL": {POS: NOUN},
|
||||
# VERB
|
||||
"VACT":{POS:VERB},
|
||||
"VSTA":{POS:VERB},
|
||||
"VACT": {POS: VERB},
|
||||
"VSTA": {POS: VERB},
|
||||
# PRON
|
||||
"PRON": {POS: PRON},
|
||||
"NPRP": {POS: PRON},
|
||||
|
|
|
@ -5,6 +5,320 @@ from ...symbols import ORTH, LEMMA
|
|||
|
||||
|
||||
_exc = {
|
||||
#หน่วยงานรัฐ / government agency
|
||||
"กกต.": [{ORTH: "กกต.", LEMMA: "คณะกรรมการการเลือกตั้ง"}],
|
||||
"กทท.": [{ORTH: "กทท.", LEMMA: "การท่าเรือแห่งประเทศไทย"}],
|
||||
"กทพ.": [{ORTH: "กทพ.", LEMMA: "การทางพิเศษแห่งประเทศไทย"}],
|
||||
"กบข.": [{ORTH: "กบข.", LEMMA: "กองทุนบำเหน็จบำนาญข้าราชการพลเรือน"}],
|
||||
"กบว.": [{ORTH: "กบว.", LEMMA: "คณะกรรมการบริหารวิทยุกระจายเสียงและวิทยุโทรทัศน์"}],
|
||||
"กปน.": [{ORTH: "กปน.", LEMMA: "การประปานครหลวง"}],
|
||||
"กปภ.": [{ORTH: "กปภ.", LEMMA: "การประปาส่วนภูมิภาค"}],
|
||||
"กปส.": [{ORTH: "กปส.", LEMMA: "กรมประชาสัมพันธ์"}],
|
||||
"กผม.": [{ORTH: "กผม.", LEMMA: "กองผังเมือง"}],
|
||||
"กฟน.": [{ORTH: "กฟน.", LEMMA: "การไฟฟ้านครหลวง"}],
|
||||
"กฟผ.": [{ORTH: "กฟผ.", LEMMA: "การไฟฟ้าฝ่ายผลิตแห่งประเทศไทย"}],
|
||||
"กฟภ.": [{ORTH: "กฟภ.", LEMMA: "การไฟฟ้าส่วนภูมิภาค"}],
|
||||
"ก.ช.น.": [{ORTH: "ก.ช.น.", LEMMA: "คณะกรรมการช่วยเหลือชาวนาชาวไร่"}],
|
||||
"กยศ.": [{ORTH: "กยศ.", LEMMA: "กองทุนเงินให้กู้ยืมเพื่อการศึกษา"}],
|
||||
"ก.ล.ต.": [{ORTH: "ก.ล.ต.", LEMMA: "คณะกรรมการกำกับหลักทรัพย์และตลาดหลักทรัพย์"}],
|
||||
"กศ.บ.": [{ORTH: "กศ.บ.", LEMMA: "การศึกษาบัณฑิต"}],
|
||||
"กศน.": [{ORTH: "กศน.", LEMMA: "กรมการศึกษานอกโรงเรียน"}],
|
||||
"กสท.": [{ORTH: "กสท.", LEMMA: "การสื่อสารแห่งประเทศไทย"}],
|
||||
"กอ.รมน.": [{ORTH: "กอ.รมน.", LEMMA: "กองอำนวยการรักษาความมั่นคงภายใน"}],
|
||||
"กร.": [{ORTH: "กร.", LEMMA: "กองเรือยุทธการ"}],
|
||||
"ขสมก.": [{ORTH: "ขสมก.", LEMMA: "องค์การขนส่งมวลชนกรุงเทพ"}],
|
||||
"คตง.": [{ORTH: "คตง.", LEMMA: "คณะกรรมการตรวจเงินแผ่นดิน"}],
|
||||
"ครม.": [{ORTH: "ครม.", LEMMA: "คณะรัฐมนตรี"}],
|
||||
"คมช.": [{ORTH: "คมช.", LEMMA: "คณะมนตรีความมั่นคงแห่งชาติ"}],
|
||||
"ตชด.": [{ORTH: "ตชด.", LEMMA: "ตำรวจตะเวนชายเดน"}],
|
||||
"ตม.": [{ORTH: "ตม.", LEMMA: "กองตรวจคนเข้าเมือง"}],
|
||||
"ตร.": [{ORTH: "ตร.", LEMMA: "ตำรวจ"}],
|
||||
"ททท.": [{ORTH: "ททท.", LEMMA: "การท่องเที่ยวแห่งประเทศไทย"}],
|
||||
"ททบ.": [{ORTH: "ททบ.", LEMMA: "สถานีวิทยุโทรทัศน์กองทัพบก"}],
|
||||
"ทบ.": [{ORTH: "ทบ.", LEMMA: "กองทัพบก"}],
|
||||
"ทร.": [{ORTH: "ทร.", LEMMA: "กองทัพเรือ"}],
|
||||
"ทอ.": [{ORTH: "ทอ.", LEMMA: "กองทัพอากาศ"}],
|
||||
"ทอท.": [{ORTH: "ทอท.", LEMMA: "การท่าอากาศยานแห่งประเทศไทย"}],
|
||||
"ธ.ก.ส.": [{ORTH: "ธ.ก.ส.", LEMMA: "ธนาคารเพื่อการเกษตรและสหกรณ์การเกษตร"}],
|
||||
"ธปท.": [{ORTH: "ธปท.", LEMMA: "ธนาคารแห่งประเทศไทย"}],
|
||||
"ธอส.": [{ORTH: "ธอส.", LEMMA: "ธนาคารอาคารสงเคราะห์"}],
|
||||
"นย.": [{ORTH: "นย.", LEMMA: "นาวิกโยธิน"}],
|
||||
"ปตท.": [{ORTH: "ปตท.", LEMMA: "การปิโตรเลียมแห่งประเทศไทย"}],
|
||||
"ป.ป.ช.": [{ORTH: "ป.ป.ช.", LEMMA: "คณะกรรมการป้องกันและปราบปรามการทุจริตและประพฤติมิชอบในวงราชการ"}],
|
||||
"ป.ป.ส.": [{ORTH: "ป.ป.ส.", LEMMA: "คณะกรรมการป้องกันและปราบปรามยาเสพติด"}],
|
||||
"บพร.": [{ORTH: "บพร.", LEMMA: "กรมการบินพลเรือน"}],
|
||||
"บย.": [{ORTH: "บย.", LEMMA: "กองบินยุทธการ"}],
|
||||
"พสวท.": [{ORTH: "พสวท.", LEMMA: "โครงการพัฒนาและส่งเสริมผู้มีความรู้ความสามารถพิเศษทางวิทยาศาสตร์และเทคโนโลยี"}],
|
||||
"มอก.": [{ORTH: "มอก.", LEMMA: "สำนักงานมาตรฐานผลิตภัณฑ์อุตสาหกรรม"}],
|
||||
"ยธ.": [{ORTH: "ยธ.", LEMMA: "กรมโยธาธิการ"}],
|
||||
"รพช.": [{ORTH: "รพช.", LEMMA: "สำนักงานเร่งรัดพัฒนาชนบท"}],
|
||||
"รฟท.": [{ORTH: "รฟท.", LEMMA: "การรถไฟแห่งประเทศไทย"}],
|
||||
"รฟม.": [{ORTH: "รฟม.", LEMMA: "การรถไฟฟ้าขนส่งมวลชนแห่งประเทศไทย"}],
|
||||
"ศธ.": [{ORTH: "ศธ.", LEMMA: "กระทรวงศึกษาธิการ"}],
|
||||
"ศนธ.": [{ORTH: "ศนธ.", LEMMA: "ศูนย์กลางนิสิตนักศึกษาแห่งประเทศไทย"}],
|
||||
"สกจ.": [{ORTH: "สกจ.", LEMMA: "สหกรณ์จังหวัด"}],
|
||||
"สกท.": [{ORTH: "สกท.", LEMMA: "สำนักงานคณะกรรมการส่งเสริมการลงทุน"}],
|
||||
"สกว.": [{ORTH: "สกว.", LEMMA: "สำนักงานกองทุนสนับสนุนการวิจัย"}],
|
||||
"สคบ.": [{ORTH: "สคบ.", LEMMA: "สำนักงานคณะกรรมการคุ้มครองผู้บริโภค"}],
|
||||
"สจร.": [{ORTH: "สจร.", LEMMA: "สำนักงานคณะกรรมการจัดระบบการจราจรทางบก"}],
|
||||
"สตง.": [{ORTH: "สตง.", LEMMA: "สำนักงานตรวจเงินแผ่นดิน"}],
|
||||
"สทท.": [{ORTH: "สทท.", LEMMA: "สถานีวิทยุโทรทัศน์แห่งประเทศไทย"}],
|
||||
"สทร.": [{ORTH: "สทร.", LEMMA: "สำนักงานกลางทะเบียนราษฎร์"}],
|
||||
"สธ": [{ORTH: "สธ", LEMMA: "กระทรวงสาธารณสุข"}],
|
||||
"สนช.": [{ORTH: "สนช.", LEMMA: "สภานิติบัญญัติแห่งชาติ,สำนักงานนวัตกรรมแห่งชาติ"}],
|
||||
"สนนท.": [{ORTH: "สนนท.", LEMMA: "สหพันธ์นิสิตนักศึกษาแห่งประเทศไทย"}],
|
||||
"สปก.": [{ORTH: "สปก.", LEMMA: "สำนักงานการปฏิรูปที่ดินเพื่อเกษตรกรรม"}],
|
||||
"สปช.": [{ORTH: "สปช.", LEMMA: "สำนักงานคณะกรรมการการประถมศึกษาแห่งชาติ"}],
|
||||
"สปอ.": [{ORTH: "สปอ.", LEMMA: "สำนักงานการประถมศึกษาอำเภอ"}],
|
||||
"สพช.": [{ORTH: "สพช.", LEMMA: "สำนักงานคณะกรรมการนโยบายพลังงานแห่งชาติ"}],
|
||||
"สยช.": [{ORTH: "สยช.", LEMMA: "สำนักงานคณะกรรมการส่งเสริมและประสานงานเยาวชนแห่งชาติ"}],
|
||||
"สวช.": [{ORTH: "สวช.", LEMMA: "สำนักงานคณะกรรมการวัฒนธรรมแห่งชาติ"}],
|
||||
"สวท.": [{ORTH: "สวท.", LEMMA: "สถานีวิทยุกระจายเสียงแห่งประเทศไทย"}],
|
||||
"สวทช.": [{ORTH: "สวทช.", LEMMA: "สำนักงานพัฒนาวิทยาศาสตร์และเทคโนโลยีแห่งชาติ"}],
|
||||
"สคช.": [{ORTH: "สคช.", LEMMA: "สำนักงานคณะกรรมการพัฒนาการเศรษฐกิจและสังคมแห่งชาติ"}],
|
||||
"สสว.": [{ORTH: "สสว.", LEMMA: "สำนักงานส่งเสริมวิสาหกิจขนาดกลางและขนาดย่อม"}],
|
||||
"สสส.": [{ORTH: "สสส.", LEMMA: "สำนักงานกองทุนสนับสนุนการสร้างเสริมสุขภาพ"}],
|
||||
"สสวท.": [{ORTH: "สสวท.", LEMMA: "สถาบันส่งเสริมการสอนวิทยาศาสตร์และเทคโนโลยี"}],
|
||||
"อตก.": [{ORTH: "อตก.", LEMMA: "องค์การตลาดเพื่อเกษตรกร"}],
|
||||
"อบจ.": [{ORTH: "อบจ.", LEMMA: "องค์การบริหารส่วนจังหวัด"}],
|
||||
"อบต.": [{ORTH: "อบต.", LEMMA: "องค์การบริหารส่วนตำบล"}],
|
||||
"อปพร.": [{ORTH: "อปพร.", LEMMA: "อาสาสมัครป้องกันภัยฝ่ายพลเรือน"}],
|
||||
"อย.": [{ORTH: "อย.", LEMMA: "สำนักงานคณะกรรมการอาหารและยา"}],
|
||||
"อ.ส.ม.ท.": [{ORTH: "อ.ส.ม.ท.", LEMMA: "องค์การสื่อสารมวลชนแห่งประเทศไทย"}],
|
||||
#มหาวิทยาลัย / สถานศึกษา / university / college
|
||||
"มทส.": [{ORTH: "มทส.", LEMMA: "มหาวิทยาลัยเทคโนโลยีสุรนารี"}],
|
||||
"มธ.": [{ORTH: "มธ.", LEMMA: "มหาวิทยาลัยธรรมศาสตร์"}],
|
||||
"ม.อ.": [{ORTH: "ม.อ.", LEMMA: "มหาวิทยาลัยสงขลานครินทร์"}],
|
||||
"มทร.": [{ORTH: "มทร.", LEMMA: "มหาวิทยาลัยเทคโนโลยีราชมงคล"}],
|
||||
"มมส.": [{ORTH: "มมส.", LEMMA: "มหาวิทยาลัยมหาสารคาม"}],
|
||||
"วท.": [{ORTH: "วท.", LEMMA: "วิทยาลัยเทคนิค"}],
|
||||
"สตม.": [{ORTH: "สตม.", LEMMA: "สำนักงานตรวจคนเข้าเมือง (ตำรวจ)"}],
|
||||
#ยศ / rank
|
||||
"ดร.": [{ORTH: "ดร.", LEMMA: "ดอกเตอร์"}],
|
||||
"ด.ต.": [{ORTH: "ด.ต.", LEMMA: "ดาบตำรวจ"}],
|
||||
"จ.ต.": [{ORTH: "จ.ต.", LEMMA: "จ่าตรี"}],
|
||||
"จ.ท.": [{ORTH: "จ.ท.", LEMMA: "จ่าโท"}],
|
||||
"จ.ส.ต.": [{ORTH: "จ.ส.ต.", LEMMA: "จ่าสิบตรี (ทหารบก)"}],
|
||||
"จสต.": [{ORTH: "จสต.", LEMMA: "จ่าสิบตำรวจ"}],
|
||||
"จ.ส.ท.": [{ORTH: "จ.ส.ท.", LEMMA: "จ่าสิบโท"}],
|
||||
"จ.ส.อ.": [{ORTH: "จ.ส.อ.", LEMMA: "จ่าสิบเอก"}],
|
||||
"จ.อ.": [{ORTH: "จ.อ.", LEMMA: "จ่าเอก"}],
|
||||
"ทพญ.": [{ORTH: "ทพญ.", LEMMA: "ทันตแพทย์หญิง"}],
|
||||
"ทนพ.": [{ORTH: "ทนพ.", LEMMA: "เทคนิคการแพทย์"}],
|
||||
"นจอ.": [{ORTH: "นจอ.", LEMMA: "นักเรียนจ่าอากาศ"}],
|
||||
"น.ช.": [{ORTH: "น.ช.", LEMMA: "นักโทษชาย"}],
|
||||
"น.ญ.": [{ORTH: "น.ญ.", LEMMA: "นักโทษหญิง"}],
|
||||
"น.ต.": [{ORTH: "น.ต.", LEMMA: "นาวาตรี"}],
|
||||
"น.ท.": [{ORTH: "น.ท.", LEMMA: "นาวาโท"}],
|
||||
"นตท.": [{ORTH: "นตท.", LEMMA: "นักเรียนเตรียมทหาร"}],
|
||||
"นนส.": [{ORTH: "นนส.", LEMMA: "นักเรียนนายสิบทหารบก"}],
|
||||
"นนร.": [{ORTH: "นนร.", LEMMA: "นักเรียนนายร้อย"}],
|
||||
"นนอ.": [{ORTH: "นนอ.", LEMMA: "นักเรียนนายเรืออากาศ"}],
|
||||
"นพ.": [{ORTH: "นพ.", LEMMA: "นายแพทย์"}],
|
||||
"นพท.": [{ORTH: "นพท.", LEMMA: "นายแพทย์ทหาร"}],
|
||||
"นรจ.": [{ORTH: "นรจ.", LEMMA: "นักเรียนจ่าทหารเรือ"}],
|
||||
"นรต.": [{ORTH: "นรต.", LEMMA: "นักเรียนนายร้อยตำรวจ"}],
|
||||
"นศพ.": [{ORTH: "นศพ.", LEMMA: "นักศึกษาแพทย์"}],
|
||||
"นศท.": [{ORTH: "นศท.", LEMMA: "นักศึกษาวิชาทหาร"}],
|
||||
"น.สพ.": [{ORTH: "น.สพ.", LEMMA: "นายสัตวแพทย์ (พ.ร.บ.วิชาชีพการสัตวแพทย์)"}],
|
||||
"น.อ.": [{ORTH: "น.อ.", LEMMA: "นาวาเอก"}],
|
||||
"บช.ก.": [{ORTH: "บช.ก.", LEMMA: "กองบัญชาการตำรวจสอบสวนกลาง"}],
|
||||
"บช.น.": [{ORTH: "บช.น.", LEMMA: "กองบัญชาการตำรวจนครบาล"}],
|
||||
"ผกก.": [{ORTH: "ผกก.", LEMMA: "ผู้กำกับการ"}],
|
||||
"ผกก.ภ.": [{ORTH: "ผกก.ภ.", LEMMA: "ผู้กำกับการตำรวจภูธร"}],
|
||||
"ผจก.": [{ORTH: "ผจก.", LEMMA: "ผู้จัดการ"}],
|
||||
"ผช.": [{ORTH: "ผช.", LEMMA: "ผู้ช่วย"}],
|
||||
"ผชก.": [{ORTH: "ผชก.", LEMMA: "ผู้ชำนาญการ"}],
|
||||
"ผช.ผอ.": [{ORTH: "ผช.ผอ.", LEMMA: "ผู้ช่วยผู้อำนวยการ"}],
|
||||
"ผญบ.": [{ORTH: "ผญบ.", LEMMA: "ผู้ใหญ่บ้าน"}],
|
||||
"ผบ.": [{ORTH: "ผบ.", LEMMA: "ผู้บังคับบัญชา"}],
|
||||
"ผบก.": [{ORTH: "ผบก.", LEMMA: "ผู้บังคับบัญชาการ (ตำรวจ)"}],
|
||||
"ผบก.": [{ORTH: "ผบก.", LEMMA: "ผู้บังคับการ (ตำรวจ)"}],
|
||||
"ผบก.น.": [{ORTH: "ผบก.น.", LEMMA: "ผู้บังคับการตำรวจนครบาล"}],
|
||||
"ผบก.ป.": [{ORTH: "ผบก.ป.", LEMMA: "ผู้บังคับการตำรวจกองปราบปราม"}],
|
||||
"ผบก.ปค.": [{ORTH: "ผบก.ปค.", LEMMA: "ผู้บังคับการ กองบังคับการปกครอง (โรงเรียนนายร้อยตำรวจ)"}],
|
||||
"ผบก.ปม.": [{ORTH: "ผบก.ปม.", LEMMA: "ผู้บังคับการตำรวจป่าไม้"}],
|
||||
"ผบก.ภ.": [{ORTH: "ผบก.ภ.", LEMMA: "ผู้บังคับการตำรวจภูธร"}],
|
||||
"ผบช.": [{ORTH: "ผบช.", LEMMA: "ผู้บัญชาการ (ตำรวจ)"}],
|
||||
"ผบช.ก.": [{ORTH: "ผบช.ก.", LEMMA: "ผู้บัญชาการตำรวจสอบสวนกลาง"}],
|
||||
"ผบช.ตชด.": [{ORTH: "ผบช.ตชด.", LEMMA: "ผู้บัญชาการตำรวจตระเวนชายแดน"}],
|
||||
"ผบช.น.": [{ORTH: "ผบช.น.", LEMMA: "ผู้บัญชาการตำรวจนครบาล"}],
|
||||
"ผบช.ภ.": [{ORTH: "ผบช.ภ.", LEMMA: "ผู้บัญชาการตำรวจภูธร"}],
|
||||
"ผบ.ทบ.": [{ORTH: "ผบ.ทบ.", LEMMA: "ผู้บัญชาการทหารบก"}],
|
||||
"ผบ.ตร.": [{ORTH: "ผบ.ตร.", LEMMA: "ผู้บัญชาการตำรวจแห่งชาติ"}],
|
||||
"ผบ.ทร.": [{ORTH: "ผบ.ทร.", LEMMA: "ผู้บัญชาการทหารเรือ"}],
|
||||
"ผบ.ทอ.": [{ORTH: "ผบ.ทอ.", LEMMA: "ผู้บัญชาการทหารอากาศ"}],
|
||||
"ผบ.ทสส.": [{ORTH: "ผบ.ทสส.", LEMMA: "ผู้บัญชาการทหารสูงสุด"}],
|
||||
"ผวจ.": [{ORTH: "ผวจ.", LEMMA: "ผู้ว่าราชการจังหวัด"}],
|
||||
"ผู้ว่าฯ": [{ORTH: "ผู้ว่าฯ", LEMMA: "ผู้ว่าราชการจังหวัด"}],
|
||||
"พ.จ.ต.": [{ORTH: "พ.จ.ต.", LEMMA: "พันจ่าตรี"}],
|
||||
"พ.จ.ท.": [{ORTH: "พ.จ.ท.", LEMMA: "พันจ่าโท"}],
|
||||
"พ.จ.อ.": [{ORTH: "พ.จ.อ.", LEMMA: "พันจ่าเอก"}],
|
||||
"พญ.": [{ORTH: "พญ.", LEMMA: "แพทย์หญิง"}],
|
||||
"ฯพณฯ": [{ORTH: "ฯพณฯ", LEMMA: "พณท่าน"}],
|
||||
"พ.ต.": [{ORTH: "พ.ต.", LEMMA: "พันตรี"}],
|
||||
"พ.ท.": [{ORTH: "พ.ท.", LEMMA: "พันโท"}],
|
||||
"พ.อ.": [{ORTH: "พ.อ.", LEMMA: "พันเอก"}],
|
||||
"พ.ต.อ.พิเศษ": [{ORTH: "พ.ต.อ.พิเศษ", LEMMA: "พันตำรวจเอกพิเศษ"}],
|
||||
"พลฯ": [{ORTH: "พลฯ", LEMMA: "พลทหาร"}],
|
||||
"พล.๑ รอ.": [{ORTH: "พล.๑ รอ.", LEMMA: "กองพลที่ ๑ รักษาพระองค์ กองทัพบก"}],
|
||||
"พล.ต.": [{ORTH: "พล.ต.", LEMMA: "พลตรี"}],
|
||||
"พล.ต.ต.": [{ORTH: "พล.ต.ต.", LEMMA: "พลตำรวจตรี"}],
|
||||
"พล.ต.ท.": [{ORTH: "พล.ต.ท.", LEMMA: "พลตำรวจโท"}],
|
||||
"พล.ต.อ.": [{ORTH: "พล.ต.อ.", LEMMA: "พลตำรวจเอก"}],
|
||||
"พล.ท.": [{ORTH: "พล.ท.", LEMMA: "พลโท"}],
|
||||
"พล.ปตอ.": [{ORTH: "พล.ปตอ.", LEMMA: "กองพลทหารปืนใหญ่ต่อสู่อากาศยาน"}],
|
||||
"พล.ม.": [{ORTH: "พล.ม.", LEMMA: "กองพลทหารม้า"}],
|
||||
"พล.ม.๒": [{ORTH: "พล.ม.๒", LEMMA: "กองพลทหารม้าที่ ๒"}],
|
||||
"พล.ร.ต.": [{ORTH: "พล.ร.ต.", LEMMA: "พลเรือตรี"}],
|
||||
"พล.ร.ท.": [{ORTH: "พล.ร.ท.", LEMMA: "พลเรือโท"}],
|
||||
"พล.ร.อ.": [{ORTH: "พล.ร.อ.", LEMMA: "พลเรือเอก"}],
|
||||
"พล.อ.": [{ORTH: "พล.อ.", LEMMA: "พลเอก"}],
|
||||
"พล.อ.ต.": [{ORTH: "พล.อ.ต.", LEMMA: "พลอากาศตรี"}],
|
||||
"พล.อ.ท.": [{ORTH: "พล.อ.ท.", LEMMA: "พลอากาศโท"}],
|
||||
"พล.อ.อ.": [{ORTH: "พล.อ.อ.", LEMMA: "พลอากาศเอก"}],
|
||||
"พ.อ.": [{ORTH: "พ.อ.", LEMMA: "พันเอก"}],
|
||||
"พ.อ.พิเศษ": [{ORTH: "พ.อ.พิเศษ", LEMMA: "พันเอกพิเศษ"}],
|
||||
"พ.อ.ต.": [{ORTH: "พ.อ.ต.", LEMMA: "พันจ่าอากาศตรี"}],
|
||||
"พ.อ.ท.": [{ORTH: "พ.อ.ท.", LEMMA: "พันจ่าอากาศโท"}],
|
||||
"พ.อ.อ.": [{ORTH: "พ.อ.อ.", LEMMA: "พันจ่าอากาศเอก"}],
|
||||
"ภกญ.": [{ORTH: "ภกญ.", LEMMA: "เภสัชกรหญิง"}],
|
||||
"ม.จ.": [{ORTH: "ม.จ.", LEMMA: "หม่อมเจ้า"}],
|
||||
"มท1": [{ORTH: "มท1", LEMMA: "รัฐมนตรีว่าการกระทรวงมหาดไทย"}],
|
||||
"ม.ร.ว.": [{ORTH: "ม.ร.ว.", LEMMA: "หม่อมราชวงศ์"}],
|
||||
"มล.": [{ORTH: "มล.", LEMMA: "หม่อมหลวง"}],
|
||||
"ร.ต.": [{ORTH: "ร.ต.", LEMMA: "ร้อยตรี,เรือตรี,เรืออากาศตรี"}],
|
||||
"ร.ต.ต.": [{ORTH: "ร.ต.ต.", LEMMA: "ร้อยตำรวจตรี"}],
|
||||
"ร.ต.ท.": [{ORTH: "ร.ต.ท.", LEMMA: "ร้อยตำรวจโท"}],
|
||||
"ร.ต.อ.": [{ORTH: "ร.ต.อ.", LEMMA: "ร้อยตำรวจเอก"}],
|
||||
"ร.ท.": [{ORTH: "ร.ท.", LEMMA: "ร้อยโท,เรือโท,เรืออากาศโท"}],
|
||||
"รมช.": [{ORTH: "รมช.", LEMMA: "รัฐมนตรีช่วยว่าการกระทรวง"}],
|
||||
"รมต.": [{ORTH: "รมต.", LEMMA: "รัฐมนตรี"}],
|
||||
"รมว.": [{ORTH: "รมว.", LEMMA: "รัฐมนตรีว่าการกระทรวง"}],
|
||||
"รศ.": [{ORTH: "รศ.", LEMMA: "รองศาสตราจารย์"}],
|
||||
"ร.อ.": [{ORTH: "ร.อ.", LEMMA: "ร้อยเอก,เรือเอก,เรืออากาศเอก"}],
|
||||
"ศ.": [{ORTH: "ศ.", LEMMA: "ศาสตราจารย์"}],
|
||||
"ส.ต.": [{ORTH: "ส.ต.", LEMMA: "สิบตรี"}],
|
||||
"ส.ต.ต.": [{ORTH: "ส.ต.ต.", LEMMA: "สิบตำรวจตรี"}],
|
||||
"ส.ต.ท.": [{ORTH: "ส.ต.ท.", LEMMA: "สิบตำรวจโท"}],
|
||||
"ส.ต.อ.": [{ORTH: "ส.ต.อ.", LEMMA: "สิบตำรวจเอก"}],
|
||||
"ส.ท.": [{ORTH: "ส.ท.", LEMMA: "สิบโท"}],
|
||||
"สพ.": [{ORTH: "สพ.", LEMMA: "สัตวแพทย์"}],
|
||||
"สพ.ญ.": [{ORTH: "สพ.ญ.", LEMMA: "สัตวแพทย์หญิง"}],
|
||||
"สพ.ช.": [{ORTH: "สพ.ช.", LEMMA: "สัตวแพทย์ชาย"}],
|
||||
"ส.อ.": [{ORTH: "ส.อ.", LEMMA: "สิบเอก"}],
|
||||
"อจ.": [{ORTH: "อจ.", LEMMA: "อาจารย์"}],
|
||||
"อจญ.": [{ORTH: "อจญ.", LEMMA: "อาจารย์ใหญ่"}],
|
||||
#วุฒิ / bachelor degree
|
||||
"ป.": [{ORTH: "ป.", LEMMA: "ประถมศึกษา"}],
|
||||
"ป.กศ.": [{ORTH: "ป.กศ.", LEMMA: "ประกาศนียบัตรวิชาการศึกษา"}],
|
||||
"ป.กศ.สูง": [{ORTH: "ป.กศ.สูง", LEMMA: "ประกาศนียบัตรวิชาการศึกษาชั้นสูง"}],
|
||||
"ปวช.": [{ORTH: "ปวช.", LEMMA: "ประกาศนียบัตรวิชาชีพ"}],
|
||||
"ปวท.": [{ORTH: "ปวท.", LEMMA: "ประกาศนียบัตรวิชาชีพเทคนิค"}],
|
||||
"ปวส.": [{ORTH: "ปวส.", LEMMA: "ประกาศนียบัตรวิชาชีพชั้นสูง"}],
|
||||
"ปทส.": [{ORTH: "ปทส.", LEMMA: "ประกาศนียบัตรครูเทคนิคชั้นสูง"}],
|
||||
"กษ.บ.": [{ORTH: "กษ.บ.", LEMMA: "เกษตรศาสตรบัณฑิต"}],
|
||||
"กษ.ม.": [{ORTH: "กษ.ม.", LEMMA: "เกษตรศาสตรมหาบัณฑิต"}],
|
||||
"กษ.ด.": [{ORTH: "กษ.ด.", LEMMA: "เกษตรศาสตรดุษฎีบัณฑิต"}],
|
||||
"ค.บ.": [{ORTH: "ค.บ.", LEMMA: "ครุศาสตรบัณฑิต"}],
|
||||
"คศ.บ.": [{ORTH: "คศ.บ.", LEMMA: "คหกรรมศาสตรบัณฑิต"}],
|
||||
"คศ.ม.": [{ORTH: "คศ.ม.", LEMMA: "คหกรรมศาสตรมหาบัณฑิต"}],
|
||||
"คศ.ด.": [{ORTH: "คศ.ด.", LEMMA: "คหกรรมศาสตรดุษฎีบัณฑิต"}],
|
||||
"ค.อ.บ.": [{ORTH: "ค.อ.บ.", LEMMA: "ครุศาสตรอุตสาหกรรมบัณฑิต"}],
|
||||
"ค.อ.ม.": [{ORTH: "ค.อ.ม.", LEMMA: "ครุศาสตรอุตสาหกรรมมหาบัณฑิต"}],
|
||||
"ค.อ.ด.": [{ORTH: "ค.อ.ด.", LEMMA: "ครุศาสตรอุตสาหกรรมดุษฎีบัณฑิต"}],
|
||||
"ทก.บ.": [{ORTH: "ทก.บ.", LEMMA: "เทคโนโลยีการเกษตรบัณฑิต"}],
|
||||
"ทก.ม.": [{ORTH: "ทก.ม.", LEMMA: "เทคโนโลยีการเกษตรมหาบัณฑิต"}],
|
||||
"ทก.ด.": [{ORTH: "ทก.ด.", LEMMA: "เทคโนโลยีการเกษตรดุษฎีบัณฑิต"}],
|
||||
"ท.บ.": [{ORTH: "ท.บ.", LEMMA: "ทันตแพทยศาสตรบัณฑิต"}],
|
||||
"ท.ม.": [{ORTH: "ท.ม.", LEMMA: "ทันตแพทยศาสตรมหาบัณฑิต"}],
|
||||
"ท.ด.": [{ORTH: "ท.ด.", LEMMA: "ทันตแพทยศาสตรดุษฎีบัณฑิต"}],
|
||||
"น.บ.": [{ORTH: "น.บ.", LEMMA: "นิติศาสตรบัณฑิต"}],
|
||||
"น.ม.": [{ORTH: "น.ม.", LEMMA: "นิติศาสตรมหาบัณฑิต"}],
|
||||
"น.ด.": [{ORTH: "น.ด.", LEMMA: "นิติศาสตรดุษฎีบัณฑิต"}],
|
||||
"นศ.บ.": [{ORTH: "นศ.บ.", LEMMA: "นิเทศศาสตรบัณฑิต"}],
|
||||
"นศ.ม.": [{ORTH: "นศ.ม.", LEMMA: "นิเทศศาสตรมหาบัณฑิต"}],
|
||||
"นศ.ด.": [{ORTH: "นศ.ด.", LEMMA: "นิเทศศาสตรดุษฎีบัณฑิต"}],
|
||||
"บช.บ.": [{ORTH: "บช.บ.", LEMMA: "บัญชีบัณฑิต"}],
|
||||
"บช.ม.": [{ORTH: "บช.ม.", LEMMA: "บัญชีมหาบัณฑิต"}],
|
||||
"บช.ด.": [{ORTH: "บช.ด.", LEMMA: "บัญชีดุษฎีบัณฑิต"}],
|
||||
"บธ.บ.": [{ORTH: "บธ.บ.", LEMMA: "บริหารธุรกิจบัณฑิต"}],
|
||||
"บธ.ม.": [{ORTH: "บธ.ม.", LEMMA: "บริหารธุรกิจมหาบัณฑิต"}],
|
||||
"บธ.ด.": [{ORTH: "บธ.ด.", LEMMA: "บริหารธุรกิจดุษฎีบัณฑิต"}],
|
||||
"พณ.บ.": [{ORTH: "พณ.บ.", LEMMA: "พาณิชยศาสตรบัณฑิต"}],
|
||||
"พณ.ม.": [{ORTH: "พณ.ม.", LEMMA: "พาณิชยศาสตรมหาบัณฑิต"}],
|
||||
"พณ.ด.": [{ORTH: "พณ.ด.", LEMMA: "พาณิชยศาสตรดุษฎีบัณฑิต"}],
|
||||
"พ.บ.": [{ORTH: "พ.บ.", LEMMA: "แพทยศาสตรบัณฑิต"}],
|
||||
"พ.ม.": [{ORTH: "พ.ม.", LEMMA: "แพทยศาสตรมหาบัณฑิต"}],
|
||||
"พ.ด.": [{ORTH: "พ.ด.", LEMMA: "แพทยศาสตรดุษฎีบัณฑิต"}],
|
||||
"พธ.บ.": [{ORTH: "พธ.บ.", LEMMA: "พุทธศาสตรบัณฑิต"}],
|
||||
"พธ.ม.": [{ORTH: "พธ.ม.", LEMMA: "พุทธศาสตรมหาบัณฑิต"}],
|
||||
"พธ.ด.": [{ORTH: "พธ.ด.", LEMMA: "พุทธศาสตรดุษฎีบัณฑิต"}],
|
||||
"พบ.บ.": [{ORTH: "พบ.บ.", LEMMA: "พัฒนบริหารศาสตรบัณฑิต"}],
|
||||
"พบ.ม.": [{ORTH: "พบ.ม.", LEMMA: "พัฒนบริหารศาสตรมหาบัณฑิต"}],
|
||||
"พบ.ด.": [{ORTH: "พบ.ด.", LEMMA: "พัฒนบริหารศาสตรดุษฎีบัณฑิต"}],
|
||||
"พย.บ.": [{ORTH: "พย.บ.", LEMMA: "พยาบาลศาสตรดุษฎีบัณฑิต"}],
|
||||
"พย.ม.": [{ORTH: "พย.ม.", LEMMA: "พยาบาลศาสตรมหาบัณฑิต"}],
|
||||
"พย.ด.": [{ORTH: "พย.ด.", LEMMA: "พยาบาลศาสตรดุษฎีบัณฑิต"}],
|
||||
"พศ.บ.": [{ORTH: "พศ.บ.", LEMMA: "พาณิชยศาสตรบัณฑิต"}],
|
||||
"พศ.ม.": [{ORTH: "พศ.ม.", LEMMA: "พาณิชยศาสตรมหาบัณฑิต"}],
|
||||
"พศ.ด.": [{ORTH: "พศ.ด.", LEMMA: "พาณิชยศาสตรดุษฎีบัณฑิต"}],
|
||||
"ภ.บ.": [{ORTH: "ภ.บ.", LEMMA: "เภสัชศาสตรบัณฑิต"}],
|
||||
"ภ.ม.": [{ORTH: "ภ.ม.", LEMMA: "เภสัชศาสตรมหาบัณฑิต"}],
|
||||
"ภ.ด.": [{ORTH: "ภ.ด.", LEMMA: "เภสัชศาสตรดุษฎีบัณฑิต"}],
|
||||
"ภ.สถ.บ.": [{ORTH: "ภ.สถ.บ.", LEMMA: "ภูมิสถาปัตยกรรมศาสตรบัณฑิต"}],
|
||||
"รป.บ.": [{ORTH: "รป.บ.", LEMMA: "รัฐประศาสนศาสตร์บัณฑิต"}],
|
||||
"รป.ม.": [{ORTH: "รป.ม.", LEMMA: "รัฐประศาสนศาสตร์มหาบัณฑิต"}],
|
||||
"วท.บ.": [{ORTH: "วท.บ.", LEMMA: "วิทยาศาสตรบัณฑิต"}],
|
||||
"วท.ม.": [{ORTH: "วท.ม.", LEMMA: "วิทยาศาสตรมหาบัณฑิต"}],
|
||||
"วท.ด.": [{ORTH: "วท.ด.", LEMMA: "วิทยาศาสตรดุษฎีบัณฑิต"}],
|
||||
"ศ.บ.": [{ORTH: "ศ.บ.", LEMMA: "ศิลปบัณฑิต"}],
|
||||
"ศศ.บ.": [{ORTH: "ศศ.บ.", LEMMA: "ศิลปศาสตรบัณฑิต"}],
|
||||
"ศษ.บ.": [{ORTH: "ศษ.บ.", LEMMA: "ศึกษาศาสตรบัณฑิต"}],
|
||||
"ศส.บ.": [{ORTH: "ศส.บ.", LEMMA: "เศรษฐศาสตรบัณฑิต"}],
|
||||
"สถ.บ.": [{ORTH: "สถ.บ.", LEMMA: "สถาปัตยกรรมศาสตรบัณฑิต"}],
|
||||
"สถ.ม.": [{ORTH: "สถ.ม.", LEMMA: "สถาปัตยกรรมศาสตรมหาบัณฑิต"}],
|
||||
"สถ.ด.": [{ORTH: "สถ.ด.", LEMMA: "สถาปัตยกรรมศาสตรดุษฎีบัณฑิต"}],
|
||||
"สพ.บ.": [{ORTH: "สพ.บ.", LEMMA: "สัตวแพทยศาสตรบัณฑิต"}],
|
||||
"อ.บ.": [{ORTH: "อ.บ.", LEMMA: "อักษรศาสตรบัณฑิต"}],
|
||||
"อ.ม.": [{ORTH: "อ.ม.", LEMMA: "อักษรศาสตรมหาบัณฑิต"}],
|
||||
"อ.ด.": [{ORTH: "อ.ด.", LEMMA: "อักษรศาสตรดุษฎีบัณฑิต"}],
|
||||
#ปี / เวลา / year / time
|
||||
"ชม.": [{ORTH: "ชม.", LEMMA: "ชั่วโมง"}],
|
||||
"จ.ศ.": [{ORTH: "จ.ศ.", LEMMA: "จุลศักราช"}],
|
||||
"ค.ศ.": [{ORTH: "ค.ศ.", LEMMA: "คริสต์ศักราช"}],
|
||||
"ฮ.ศ.": [{ORTH: "ฮ.ศ.", LEMMA: "ฮิจเราะห์ศักราช"}],
|
||||
"ว.ด.ป.": [{ORTH: "ว.ด.ป.", LEMMA: "วัน เดือน ปี"}],
|
||||
#ระยะทาง / distance
|
||||
"ฮม.": [{ORTH: "ฮม.", LEMMA: "เฮกโตเมตร"}],
|
||||
"ดคม.": [{ORTH: "ดคม.", LEMMA: "เดคาเมตร"}],
|
||||
"ดม.": [{ORTH: "ดม.", LEMMA: "เดซิเมตร"}],
|
||||
"มม.": [{ORTH: "มม.", LEMMA: "มิลลิเมตร"}],
|
||||
"ซม.": [{ORTH: "ซม.", LEMMA: "เซนติเมตร"}],
|
||||
"กม.": [{ORTH: "กม.", LEMMA: "กิโลเมตร"}],
|
||||
#น้ำหนัก / weight
|
||||
"น.น.": [{ORTH: "น.น.", LEMMA: "น้ำหนัก"}],
|
||||
"ฮก.": [{ORTH: "ฮก.", LEMMA: "เฮกโตกรัม"}],
|
||||
"ดคก.": [{ORTH: "ดคก.", LEMMA: "เดคากรัม"}],
|
||||
"ดก.": [{ORTH: "ดก.", LEMMA: "เดซิกรัม"}],
|
||||
"ซก.": [{ORTH: "ซก.", LEMMA: "เซนติกรัม"}],
|
||||
"มก.": [{ORTH: "มก.", LEMMA: "มิลลิกรัม"}],
|
||||
"ก.": [{ORTH: "ก.", LEMMA: "กรัม"}],
|
||||
"กก.": [{ORTH: "กก.", LEMMA: "กิโลกรัม"}],
|
||||
#ปริมาตร / volume
|
||||
"ฮล.": [{ORTH: "ฮล.", LEMMA: "เฮกโตลิตร"}],
|
||||
"ดคล.": [{ORTH: "ดคล.", LEMMA: "เดคาลิตร"}],
|
||||
"ดล.": [{ORTH: "ดล.", LEMMA: "เดซิลิตร"}],
|
||||
"ซล.": [{ORTH: "ซล.", LEMMA: "เซนติลิตร"}],
|
||||
"ล.": [{ORTH: "ล.", LEMMA: "ลิตร"}],
|
||||
"กล.": [{ORTH: "กล.", LEMMA: "กิโลลิตร"}],
|
||||
"ลบ.": [{ORTH: "ลบ.", LEMMA: "ลูกบาศก์"}],
|
||||
#พื้นที่ / area
|
||||
"ตร.ซม.": [{ORTH: "ตร.ซม.", LEMMA: "ตารางเซนติเมตร"}],
|
||||
"ตร.ม.": [{ORTH: "ตร.ม.", LEMMA: "ตารางเมตร"}],
|
||||
"ตร.ว.": [{ORTH: "ตร.ว.", LEMMA: "ตารางวา"}],
|
||||
"ตร.กม.": [{ORTH: "ตร.กม.", LEMMA: "ตารางกิโลเมตร"}],
|
||||
#เดือน / month
|
||||
"ม.ค.": [{ORTH: "ม.ค.", LEMMA: "มกราคม"}],
|
||||
"ก.พ.": [{ORTH: "ก.พ.", LEMMA: "กุมภาพันธ์"}],
|
||||
"มี.ค.": [{ORTH: "มี.ค.", LEMMA: "มีนาคม"}],
|
||||
|
@ -17,6 +331,114 @@ _exc = {
|
|||
"ต.ค.": [{ORTH: "ต.ค.", LEMMA: "ตุลาคม"}],
|
||||
"พ.ย.": [{ORTH: "พ.ย.", LEMMA: "พฤศจิกายน"}],
|
||||
"ธ.ค.": [{ORTH: "ธ.ค.", LEMMA: "ธันวาคม"}],
|
||||
#เพศ / gender
|
||||
"ช.": [{ORTH: "ช.", LEMMA: "ชาย"}],
|
||||
"ญ.": [{ORTH: "ญ.", LEMMA: "หญิง"}],
|
||||
"ด.ช.": [{ORTH: "ด.ช.", LEMMA: "เด็กชาย"}],
|
||||
"ด.ญ.": [{ORTH: "ด.ญ.", LEMMA: "เด็กหญิง"}],
|
||||
#ที่อยู่ / address
|
||||
"ถ.": [{ORTH: "ถ.", LEMMA: "ถนน"}],
|
||||
"ต.": [{ORTH: "ต.", LEMMA: "ตำบล"}],
|
||||
"อ.": [{ORTH: "อ.", LEMMA: "อำเภอ"}],
|
||||
"จ.": [{ORTH: "จ.", LEMMA: "จังหวัด"}],
|
||||
#สรรพนาม / pronoun
|
||||
"ข้าฯ": [{ORTH: "ข้าฯ", LEMMA: "ข้าพระพุทธเจ้า"}],
|
||||
"ทูลเกล้าฯ": [{ORTH: "ทูลเกล้าฯ", LEMMA: "ทูลเกล้าทูลกระหม่อม"}],
|
||||
"น้อมเกล้าฯ": [{ORTH: "น้อมเกล้าฯ", LEMMA: "น้อมเกล้าน้อมกระหม่อม"}],
|
||||
"โปรดเกล้าฯ": [{ORTH: "โปรดเกล้าฯ", LEMMA: "โปรดเกล้าโปรดกระหม่อม"}],
|
||||
#การเมือง / politic
|
||||
"ขจก.": [{ORTH: "ขจก.", LEMMA: "ขบวนการโจรก่อการร้าย"}],
|
||||
"ขบด.": [{ORTH: "ขบด.", LEMMA: "ขบวนการแบ่งแยกดินแดน"}],
|
||||
"นปช.": [{ORTH: "นปช.", LEMMA: "แนวร่วมประชาธิปไตยขับไล่เผด็จการ"}],
|
||||
"ปชป.": [{ORTH: "ปชป.", LEMMA: "พรรคประชาธิปัตย์"}],
|
||||
"ผกค.": [{ORTH: "ผกค.", LEMMA: "ผู้ก่อการร้ายคอมมิวนิสต์"}],
|
||||
"พท.": [{ORTH: "พท.", LEMMA: "พรรคเพื่อไทย"}],
|
||||
"พ.ร.ก.": [{ORTH: "พ.ร.ก.", LEMMA: "พระราชกำหนด"}],
|
||||
"พ.ร.ฎ.": [{ORTH: "พ.ร.ฎ.", LEMMA: "พระราชกฤษฎีกา"}],
|
||||
"พ.ร.บ.": [{ORTH: "พ.ร.บ.", LEMMA: "พระราชบัญญัติ"}],
|
||||
"รธน.": [{ORTH: "รธน.", LEMMA: "รัฐธรรมนูญ"}],
|
||||
"รบ.": [{ORTH: "รบ.", LEMMA: "รัฐบาล"}],
|
||||
"รสช.": [{ORTH: "รสช.", LEMMA: "คณะรักษาความสงบเรียบร้อยแห่งชาติ"}],
|
||||
"ส.ก.": [{ORTH: "ส.ก.", LEMMA: "สมาชิกสภากรุงเทพมหานคร"}],
|
||||
"สจ.": [{ORTH: "สจ.", LEMMA: "สมาชิกสภาจังหวัด"}],
|
||||
"สว.": [{ORTH: "สว.", LEMMA: "สมาชิกวุฒิสภา"}],
|
||||
"ส.ส.": [{ORTH: "ส.ส.", LEMMA: "สมาชิกสภาผู้แทนราษฎร"}],
|
||||
#ทั่วไป / general
|
||||
"ก.ข.ค.": [{ORTH: "ก.ข.ค.", LEMMA: "ก้างขวางคอ"}],
|
||||
"กทม.": [{ORTH: "กทม.", LEMMA: "กรุงเทพมหานคร"}],
|
||||
"กรุงเทพฯ": [{ORTH: "กรุงเทพฯ", LEMMA: "กรุงเทพมหานคร"}],
|
||||
"ขรก.": [{ORTH: "ขรก.", LEMMA: "ข้าราชการ"}],
|
||||
"ขส": [{ORTH: "ขส.", LEMMA: "ขนส่ง"}],
|
||||
"ค.ร.น.": [{ORTH: "ค.ร.น.", LEMMA: "คูณร่วมน้อย"}],
|
||||
"ค.ร.ม.": [{ORTH: "ค.ร.ม.", LEMMA: "คูณร่วมมาก"}],
|
||||
"ง.ด.": [{ORTH: "ง.ด.", LEMMA: "เงินเดือน"}],
|
||||
"งป.": [{ORTH: "งป.", LEMMA: "งบประมาณ"}],
|
||||
"จก.": [{ORTH: "จก.", LEMMA: "จำกัด"}],
|
||||
"จขกท.": [{ORTH: "จขกท.", LEMMA: "เจ้าของกระทู้"}],
|
||||
"จนท.": [{ORTH: "จนท.", LEMMA: "เจ้าหน้าที่"}],
|
||||
"จ.ป.ร.": [{ORTH: "จ.ป.ร.", LEMMA: "มหาจุฬาลงกรณ ปรมราชาธิราช (พระปรมาภิไธยในพระบาทสมเด็จพระจุลจอมเกล้าเจ้าอยู่หัว)"}],
|
||||
"จ.ม.": [{ORTH: "จ.ม.", LEMMA: "จดหมาย"}],
|
||||
"จย.": [{ORTH: "จย.", LEMMA: "จักรยาน"}],
|
||||
"จยย.": [{ORTH: "จยย.", LEMMA: "จักรยานยนต์"}],
|
||||
"ตจว.": [{ORTH: "ตจว.", LEMMA: "ต่างจังหวัด"}],
|
||||
"โทร.": [{ORTH: "โทร.", LEMMA: "โทรศัพท์"}],
|
||||
"ธ.": [{ORTH: "ธ.", LEMMA: "ธนาคาร"}],
|
||||
"น.ร.": [{ORTH: "น.ร.", LEMMA: "นักเรียน"}],
|
||||
"น.ศ.": [{ORTH: "น.ศ.", LEMMA: "นักศึกษา"}],
|
||||
"น.ส.": [{ORTH: "น.ส.", LEMMA: "นางสาว"}],
|
||||
"น.ส.๓": [{ORTH: "น.ส.๓", LEMMA: "หนังสือรับรองการทำประโยชน์ในที่ดิน"}],
|
||||
"น.ส.๓ ก.": [{ORTH: "น.ส.๓ ก", LEMMA: "หนังสือแสดงกรรมสิทธิ์ในที่ดิน (มีระวางกำหนด)"}],
|
||||
"นสพ.": [{ORTH: "นสพ.", LEMMA: "หนังสือพิมพ์"}],
|
||||
"บ.ก.": [{ORTH: "บ.ก.", LEMMA: "บรรณาธิการ"}],
|
||||
"บจก.": [{ORTH: "บจก.", LEMMA: "บริษัทจำกัด"}],
|
||||
"บงล.": [{ORTH: "บงล.", LEMMA: "บริษัทเงินทุนและหลักทรัพย์จำกัด"}],
|
||||
"บบส.": [{ORTH: "บบส.", LEMMA: "บรรษัทบริหารสินทรัพย์สถาบันการเงิน"}],
|
||||
"บมจ.": [{ORTH: "บมจ.", LEMMA: "บริษัทมหาชนจำกัด"}],
|
||||
"บลจ.": [{ORTH: "บลจ.", LEMMA: "บริษัทหลักทรัพย์จัดการกองทุนรวมจำกัด"}],
|
||||
"บ/ช": [{ORTH: "บ/ช", LEMMA: "บัญชี"}],
|
||||
"บร.": [{ORTH: "บร.", LEMMA: "บรรณารักษ์"}],
|
||||
"ปชช.": [{ORTH: "ปชช.", LEMMA: "ประชาชน"}],
|
||||
"ปณ.": [{ORTH: "ปณ.", LEMMA: "ที่ทำการไปรษณีย์"}],
|
||||
"ปณก.": [{ORTH: "ปณก.", LEMMA: "ที่ทำการไปรษณีย์กลาง"}],
|
||||
"ปณส.": [{ORTH: "ปณส.", LEMMA: "ที่ทำการไปรษณีย์สาขา"}],
|
||||
"ปธ.": [{ORTH: "ปธ.", LEMMA: "ประธาน"}],
|
||||
"ปธน.": [{ORTH: "ปธน.", LEMMA: "ประธานาธิบดี"}],
|
||||
"ปอ.": [{ORTH: "ปอ.", LEMMA: "รถยนต์โดยสารประจำทางปรับอากาศ"}],
|
||||
"ปอ.พ.": [{ORTH: "ปอ.พ.", LEMMA: "รถยนต์โดยสารประจำทางปรับอากาศพิเศษ"}],
|
||||
"พ.ก.ง.": [{ORTH: "พ.ก.ง.", LEMMA: "พัสดุเก็บเงินปลายทาง"}],
|
||||
"พ.ก.ส.": [{ORTH: "พ.ก.ส.", LEMMA: "พนักงานเก็บค่าโดยสาร"}],
|
||||
"พขร.": [{ORTH: "พขร.", LEMMA: "พนักงานขับรถ"}],
|
||||
"ภ.ง.ด.": [{ORTH: "ภ.ง.ด.", LEMMA: "ภาษีเงินได้"}],
|
||||
"ภ.ง.ด.๙": [{ORTH: "ภ.ง.ด.๙", LEMMA: "แบบแสดงรายการเสียภาษีเงินได้ของกรมสรรพากร"}],
|
||||
"ภ.ป.ร.": [{ORTH: "ภ.ป.ร.", LEMMA: "ภูมิพลอดุยเดช ปรมราชาธิราช (พระปรมาภิไธยในพระบาทสมเด็จพระปรมินทรมหาภูมิพลอดุลยเดช)"}],
|
||||
"ภ.พ.": [{ORTH: "ภ.พ.", LEMMA: "ภาษีมูลค่าเพิ่ม"}],
|
||||
"ร.": [{ORTH: "ร.", LEMMA: "รัชกาล"}],
|
||||
"ร.ง.": [{ORTH: "ร.ง.", LEMMA: "โรงงาน"}],
|
||||
"ร.ด.": [{ORTH: "ร.ด.", LEMMA: "รักษาดินแดน"}],
|
||||
"รปภ.": [{ORTH: "รปภ.", LEMMA: "รักษาความปลอดภัย"}],
|
||||
"รพ.": [{ORTH: "รพ.", LEMMA: "โรงพยาบาล"}],
|
||||
"ร.พ.": [{ORTH: "ร.พ.", LEMMA: "โรงพิมพ์"}],
|
||||
"รร.": [{ORTH: "รร.", LEMMA: "โรงเรียน,โรงแรม"}],
|
||||
"รสก.": [{ORTH: "รสก.", LEMMA: "รัฐวิสาหกิจ"}],
|
||||
"ส.ค.ส.": [{ORTH: "ส.ค.ส.", LEMMA: "ส่งความสุขปีใหม่"}],
|
||||
"สต.": [{ORTH: "สต.", LEMMA: "สตางค์"}],
|
||||
"สน.": [{ORTH: "สน.", LEMMA: "สถานีตำรวจ"}],
|
||||
"สนข.": [{ORTH: "สนข.", LEMMA: "สำนักงานเขต"}],
|
||||
"สนง.": [{ORTH: "สนง.", LEMMA: "สำนักงาน"}],
|
||||
"สนญ.": [{ORTH: "สนญ.", LEMMA: "สำนักงานใหญ่"}],
|
||||
"ส.ป.ช.": [{ORTH: "ส.ป.ช.", LEMMA: "สร้างเสริมประสบการณ์ชีวิต"}],
|
||||
"สภ.": [{ORTH: "สภ.", LEMMA: "สถานีตำรวจภูธร"}],
|
||||
"ส.ล.น.": [{ORTH: "ส.ล.น.", LEMMA: "สร้างเสริมลักษณะนิสัย"}],
|
||||
"สวญ.": [{ORTH: "สวญ.", LEMMA: "สารวัตรใหญ่"}],
|
||||
"สวป.": [{ORTH: "สวป.", LEMMA: "สารวัตรป้องกันปราบปราม"}],
|
||||
"สว.สส.": [{ORTH: "สว.สส.", LEMMA: "สารวัตรสืบสวน"}],
|
||||
"ส.ห.": [{ORTH: "ส.ห.", LEMMA: "สารวัตรทหาร"}],
|
||||
"สอ.": [{ORTH: "สอ.", LEMMA: "สถานีอนามัย"}],
|
||||
"สอท.": [{ORTH: "สอท.", LEMMA: "สถานเอกอัครราชทูต"}],
|
||||
"เสธ.": [{ORTH: "เสธ.", LEMMA: "เสนาธิการ"}],
|
||||
"หจก.": [{ORTH: "หจก.", LEMMA: "ห้างหุ้นส่วนจำกัด"}],
|
||||
"ห.ร.ม.": [{ORTH: "ห.ร.ม.", LEMMA: "ตัวหารร่วมมาก"}],
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
|
|
@ -134,6 +134,11 @@ def nl_tokenizer():
|
|||
return get_lang_class("nl").Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def nl_lemmatizer(scope="session"):
|
||||
return get_lang_class("nl").Defaults.create_lemmatizer()
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def pl_tokenizer():
|
||||
return get_lang_class("pl").Defaults.create_tokenizer()
|
||||
|
|
143
spacy/tests/lang/nl/test_lemmatizer.py
Normal file
143
spacy/tests/lang/nl/test_lemmatizer.py
Normal file
|
@ -0,0 +1,143 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
# Calling the Lemmatizer directly
|
||||
# Imitates behavior of:
|
||||
# Tagger.set_annotations()
|
||||
# -> vocab.morphology.assign_tag_id()
|
||||
# -> vocab.morphology.assign_tag_id()
|
||||
# -> Token.tag.__set__
|
||||
# -> vocab.morphology.assign_tag(...)
|
||||
# -> ... -> Morphology.assign_tag(...)
|
||||
# -> self.lemmatize(analysis.tag.pos, token.lex.orth,
|
||||
|
||||
|
||||
noun_irreg_lemmatization_cases = [
|
||||
("volkeren", "volk"),
|
||||
("vaatje", "vat"),
|
||||
("verboden", "verbod"),
|
||||
("ijsje", "ijsje"),
|
||||
("slagen", "slag"),
|
||||
("verdragen", "verdrag"),
|
||||
("verloven", "verlof"),
|
||||
("gebeden", "gebed"),
|
||||
("gaten", "gat"),
|
||||
("staven", "staf"),
|
||||
("aquariums", "aquarium"),
|
||||
("podia", "podium"),
|
||||
("holen", "hol"),
|
||||
("lammeren", "lam"),
|
||||
("bevelen", "bevel"),
|
||||
("wegen", "weg"),
|
||||
("moeilijkheden", "moeilijkheid"),
|
||||
("aanwezigheden", "aanwezigheid"),
|
||||
("goden", "god"),
|
||||
("loten", "lot"),
|
||||
("kaarsen", "kaars"),
|
||||
("leden", "lid"),
|
||||
("glaasje", "glas"),
|
||||
("eieren", "ei"),
|
||||
("vatten", "vat"),
|
||||
("kalveren", "kalf"),
|
||||
("padden", "pad"),
|
||||
("smeden", "smid"),
|
||||
("genen", "gen"),
|
||||
("beenderen", "been"),
|
||||
]
|
||||
|
||||
|
||||
verb_irreg_lemmatization_cases = [
|
||||
("liep", "lopen"),
|
||||
("hief", "heffen"),
|
||||
("begon", "beginnen"),
|
||||
("sla", "slaan"),
|
||||
("aangekomen", "aankomen"),
|
||||
("sproot", "spruiten"),
|
||||
("waart", "zijn"),
|
||||
("snoof", "snuiven"),
|
||||
("spoot", "spuiten"),
|
||||
("ontbeet", "ontbijten"),
|
||||
("gehouwen", "houwen"),
|
||||
("afgewassen", "afwassen"),
|
||||
("deed", "doen"),
|
||||
("schoven", "schuiven"),
|
||||
("gelogen", "liegen"),
|
||||
("woog", "wegen"),
|
||||
("gebraden", "braden"),
|
||||
("smolten", "smelten"),
|
||||
("riep", "roepen"),
|
||||
("aangedaan", "aandoen"),
|
||||
("vermeden", "vermijden"),
|
||||
("stootten", "stoten"),
|
||||
("ging", "gaan"),
|
||||
("geschoren", "scheren"),
|
||||
("gesponnen", "spinnen"),
|
||||
("reden", "rijden"),
|
||||
("zochten", "zoeken"),
|
||||
("leed", "lijden"),
|
||||
("verzonnen", "verzinnen"),
|
||||
]
|
||||
|
||||
|
||||
@pytest.mark.parametrize("text,lemma", noun_irreg_lemmatization_cases)
|
||||
def test_nl_lemmatizer_noun_lemmas_irreg(nl_lemmatizer, text, lemma):
|
||||
pos = "noun"
|
||||
lemmas_pred = nl_lemmatizer(text, pos)
|
||||
assert lemma == sorted(lemmas_pred)[0]
|
||||
|
||||
|
||||
@pytest.mark.parametrize("text,lemma", verb_irreg_lemmatization_cases)
|
||||
def test_nl_lemmatizer_verb_lemmas_irreg(nl_lemmatizer, text, lemma):
|
||||
pos = "verb"
|
||||
lemmas_pred = nl_lemmatizer(text, pos)
|
||||
assert lemma == sorted(lemmas_pred)[0]
|
||||
|
||||
|
||||
@pytest.mark.skip
|
||||
@pytest.mark.parametrize("text,lemma", [])
|
||||
def test_nl_lemmatizer_verb_lemmas_reg(nl_lemmatizer, text, lemma):
|
||||
# TODO: add test
|
||||
pass
|
||||
|
||||
|
||||
@pytest.mark.skip
|
||||
@pytest.mark.parametrize("text,lemma", [])
|
||||
def test_nl_lemmatizer_adjective_lemmas(nl_lemmatizer, text, lemma):
|
||||
# TODO: add test
|
||||
pass
|
||||
|
||||
|
||||
@pytest.mark.skip
|
||||
@pytest.mark.parametrize("text,lemma", [])
|
||||
def test_nl_lemmatizer_determiner_lemmas(nl_lemmatizer, text, lemma):
|
||||
# TODO: add test
|
||||
pass
|
||||
|
||||
|
||||
@pytest.mark.skip
|
||||
@pytest.mark.parametrize("text,lemma", [])
|
||||
def test_nl_lemmatizer_adverb_lemmas(nl_lemmatizer, text, lemma):
|
||||
# TODO: add test
|
||||
pass
|
||||
|
||||
|
||||
@pytest.mark.parametrize("text,lemma", [])
|
||||
def test_nl_lemmatizer_pronoun_lemmas(nl_lemmatizer, text, lemma):
|
||||
# TODO: add test
|
||||
pass
|
||||
|
||||
|
||||
# Using the lemma lookup table only
|
||||
@pytest.mark.parametrize("text,lemma", noun_irreg_lemmatization_cases)
|
||||
def test_nl_lemmatizer_lookup_noun(nl_lemmatizer, text, lemma):
|
||||
lemma_pred = nl_lemmatizer.lookup(text)
|
||||
assert lemma_pred in (lemma, text)
|
||||
|
||||
|
||||
@pytest.mark.parametrize("text,lemma", verb_irreg_lemmatization_cases)
|
||||
def test_nl_lemmatizer_lookup_verb(nl_lemmatizer, text, lemma):
|
||||
lemma_pred = nl_lemmatizer.lookup(text)
|
||||
assert lemma_pred in (lemma, text)
|
|
@ -9,3 +9,19 @@ from spacy.lang.nl.lex_attrs import like_num
|
|||
def test_nl_lex_attrs_capitals(word):
|
||||
assert like_num(word)
|
||||
assert like_num(word.upper())
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"text,num_tokens",
|
||||
[
|
||||
(
|
||||
"De aftredende minister-president benadrukte al dat zijn partij inhoudelijk weinig gemeen heeft met de groenen.",
|
||||
16,
|
||||
),
|
||||
("Hij is sociaal-cultureel werker.", 5),
|
||||
("Er staan een aantal dure auto's in de garage.", 10),
|
||||
],
|
||||
)
|
||||
def test_tokenizer_doesnt_split_hyphens(nl_tokenizer, text, num_tokens):
|
||||
tokens = nl_tokenizer(text)
|
||||
assert len(tokens) == num_tokens
|
||||
|
|
|
@ -1,6 +1,8 @@
|
|||
import pytest
|
||||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import re
|
||||
from ... import compat
|
||||
from spacy import compat
|
||||
|
||||
prefix_search = (
|
||||
b"^\xc2\xa7|^%|^=|^\xe2\x80\x94|^\xe2\x80\x93|^\\+(?![0-9])"
|
||||
|
@ -67,4 +69,4 @@ if compat.is_python2:
|
|||
# string above in the xpass message.
|
||||
def test_issue3356():
|
||||
pattern = re.compile(compat.unescape_unicode(prefix_search.decode("utf8")))
|
||||
assert not pattern.search(u"hello")
|
||||
assert not pattern.search("hello")
|
||||
|
|
|
@ -1,10 +1,14 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from spacy.util import decaying
|
||||
|
||||
def test_decaying():
|
||||
sizes = decaying(10., 1., .5)
|
||||
|
||||
def test_issue3447():
|
||||
sizes = decaying(10.0, 1.0, 0.5)
|
||||
size = next(sizes)
|
||||
assert size == 10.
|
||||
assert size == 10.0
|
||||
size = next(sizes)
|
||||
assert size == 10. - 0.5
|
||||
assert size == 10.0 - 0.5
|
||||
size = next(sizes)
|
||||
assert size == 10. - 0.5 - 0.5
|
||||
assert size == 10.0 - 0.5 - 0.5
|
||||
|
|
25
spacy/tests/regression/test_issue3449.py
Normal file
25
spacy/tests/regression/test_issue3449.py
Normal file
|
@ -0,0 +1,25 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
from spacy.lang.en import English
|
||||
|
||||
|
||||
@pytest.mark.xfail(reason="Current default suffix rules avoid one upper-case letter before a dot.")
|
||||
def test_issue3449():
|
||||
nlp = English()
|
||||
nlp.add_pipe(nlp.create_pipe('sentencizer'))
|
||||
|
||||
text1 = "He gave the ball to I. Do you want to go to the movies with I?"
|
||||
text2 = "He gave the ball to I. Do you want to go to the movies with I?"
|
||||
text3 = "He gave the ball to I.\nDo you want to go to the movies with I?"
|
||||
|
||||
t1 = nlp(text1)
|
||||
t2 = nlp(text2)
|
||||
t3 = nlp(text3)
|
||||
|
||||
assert t1[5].text == 'I'
|
||||
assert t2[5].text == 'I'
|
||||
assert t3[5].text == 'I'
|
||||
|
|
@ -1,7 +1,6 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
from spacy.lang.en import English
|
||||
from spacy.tokens import Doc
|
||||
|
||||
|
|
19
spacy/tests/regression/test_issue3521.py
Normal file
19
spacy/tests/regression/test_issue3521.py
Normal file
|
@ -0,0 +1,19 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"word",
|
||||
[
|
||||
"don't",
|
||||
"don’t",
|
||||
"I'd",
|
||||
"I’d",
|
||||
],
|
||||
)
|
||||
def test_issue3521(en_tokenizer, word):
|
||||
tok = en_tokenizer(word)[1]
|
||||
# 'not' and 'would' should be stopwords, also in their abbreviated forms
|
||||
assert tok.is_stop
|
33
spacy/tests/regression/test_issue3531.py
Normal file
33
spacy/tests/regression/test_issue3531.py
Normal file
|
@ -0,0 +1,33 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from spacy import displacy
|
||||
|
||||
|
||||
def test_issue3531():
|
||||
"""Test that displaCy renderer doesn't require "settings" key."""
|
||||
example_dep = {
|
||||
"words": [
|
||||
{"text": "But", "tag": "CCONJ"},
|
||||
{"text": "Google", "tag": "PROPN"},
|
||||
{"text": "is", "tag": "VERB"},
|
||||
{"text": "starting", "tag": "VERB"},
|
||||
{"text": "from", "tag": "ADP"},
|
||||
{"text": "behind.", "tag": "ADV"},
|
||||
],
|
||||
"arcs": [
|
||||
{"start": 0, "end": 3, "label": "cc", "dir": "left"},
|
||||
{"start": 1, "end": 3, "label": "nsubj", "dir": "left"},
|
||||
{"start": 2, "end": 3, "label": "aux", "dir": "left"},
|
||||
{"start": 3, "end": 4, "label": "prep", "dir": "right"},
|
||||
{"start": 4, "end": 5, "label": "pcomp", "dir": "right"},
|
||||
],
|
||||
}
|
||||
example_ent = {
|
||||
"text": "But Google is starting from behind.",
|
||||
"ents": [{"start": 4, "end": 10, "label": "ORG"}],
|
||||
}
|
||||
dep_html = displacy.render(example_dep, style="dep", manual=True)
|
||||
assert dep_html
|
||||
ent_html = displacy.render(example_ent, style="ent", manual=True)
|
||||
assert ent_html
|
|
@ -26,6 +26,7 @@ def symlink_setup_target(request, symlink_target, symlink):
|
|||
os.mkdir(path2str(symlink_target))
|
||||
# yield -- need to cleanup even if assertion fails
|
||||
# https://github.com/pytest-dev/pytest/issues/2508#issuecomment-309934240
|
||||
|
||||
def cleanup():
|
||||
symlink_remove(symlink)
|
||||
os.rmdir(path2str(symlink_target))
|
||||
|
|
|
@ -160,20 +160,14 @@ https://github.com/explosion/spaCy/tree/master/examples/training/train_textcat.p
|
|||
|
||||
### Visualizing spaCy vectors in TensorBoard {#tensorboard}
|
||||
|
||||
These two scripts let you load any spaCy model containing word vectors into
|
||||
This script lets you load any spaCy model containing word vectors into
|
||||
[TensorBoard](https://projector.tensorflow.org/) to create an
|
||||
[embedding visualization](https://www.tensorflow.org/versions/r1.1/get_started/embedding_viz).
|
||||
The first example uses TensorBoard, the second example TensorBoard's standalone
|
||||
embedding projector.
|
||||
|
||||
```python
|
||||
https://github.com/explosion/spaCy/tree/master/examples/vectors_tensorboard.py
|
||||
```
|
||||
|
||||
```python
|
||||
https://github.com/explosion/spaCy/tree/master/examples/vectors_tensorboard_standalone.py
|
||||
```
|
||||
|
||||
## Deep Learning {#deep-learning hidden="true"}
|
||||
|
||||
### Text classification with Keras {#keras}
|
||||
|
|
|
@ -35,7 +35,7 @@ const SEO = ({ description, lang, title, section, sectionTitle, bodyClass }) =>
|
|||
siteMetadata.slogan,
|
||||
sectionTitle
|
||||
)
|
||||
const socialImage = getImage(section)
|
||||
const socialImage = siteMetadata.siteUrl + getImage(section)
|
||||
const meta = [
|
||||
{
|
||||
name: 'description',
|
||||
|
@ -126,6 +126,7 @@ const query = graphql`
|
|||
title
|
||||
description
|
||||
slogan
|
||||
siteUrl
|
||||
social {
|
||||
twitter
|
||||
}
|
||||
|
|
|
@ -164,9 +164,9 @@ const Landing = ({ data }) => {
|
|||
We're pleased to invite the spaCy community and other folks working on Natural
|
||||
Language Processing to Berlin this summer for a small and intimate event{' '}
|
||||
<strong>July 5-6, 2019</strong>. The event includes a hands-on training day for
|
||||
teams using spaCy in production, followed by a one-track conference. We booked a
|
||||
beautiful venue, hand-picked an awesome lineup of speakers and scheduled plenty
|
||||
of social time to get to know each other and exchange ideas.
|
||||
teams using spaCy in production, followed by a one-track conference. We've
|
||||
booked a beautiful venue, hand-picked an awesome lineup of speakers and
|
||||
scheduled plenty of social time to get to know each other and exchange ideas.
|
||||
</LandingBanner>
|
||||
|
||||
<LandingBanner
|
||||
|
|
Loading…
Reference in New Issue
Block a user