mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-26 01:46:28 +03:00
fixed tag_map.py merge conflict
This commit is contained in:
parent
eba4f77526
commit
80e15af76c
106
.github/contributors/ivigamberdiev.md
vendored
Normal file
106
.github/contributors/ivigamberdiev.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Igor Igamberdiev |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | April 2, 2019 |
|
||||||
|
| GitHub username | ivigamberdiev |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/nlptown.md
vendored
Normal file
106
.github/contributors/nlptown.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Yves Peirsman |
|
||||||
|
| Company name (if applicable) | NLP Town (Island Constraints BVBA) |
|
||||||
|
| Title or role (if applicable) | Co-founder |
|
||||||
|
| Date | 14.03.2019 |
|
||||||
|
| GitHub username | nlptown |
|
||||||
|
| Website (optional) | http://www.nlp.town |
|
106
.github/contributors/socool.md
vendored
Normal file
106
.github/contributors/socool.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Kamolsit Mongkolsrisawat |
|
||||||
|
| Company name (if applicable) | Mojito |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 02-4-2019 |
|
||||||
|
| GitHub username | socool |
|
||||||
|
| Website (optional) | |
|
18
README.md
18
README.md
|
@ -17,7 +17,7 @@ released under the MIT license.
|
||||||
[![Azure Pipelines](https://img.shields.io/azure-devops/build/explosion-ai/public/8/master.svg?logo=azure-devops&style=flat-square)](https://dev.azure.com/explosion-ai/public/_build?definitionId=8)
|
[![Azure Pipelines](https://img.shields.io/azure-devops/build/explosion-ai/public/8/master.svg?logo=azure-devops&style=flat-square)](https://dev.azure.com/explosion-ai/public/_build?definitionId=8)
|
||||||
[![Travis Build Status](https://img.shields.io/travis/explosion/spaCy/master.svg?style=flat-square&logo=travis)](https://travis-ci.org/explosion/spaCy)
|
[![Travis Build Status](https://img.shields.io/travis/explosion/spaCy/master.svg?style=flat-square&logo=travis)](https://travis-ci.org/explosion/spaCy)
|
||||||
[![Current Release Version](https://img.shields.io/github/release/explosion/spacy.svg?style=flat-square)](https://github.com/explosion/spaCy/releases)
|
[![Current Release Version](https://img.shields.io/github/release/explosion/spacy.svg?style=flat-square)](https://github.com/explosion/spaCy/releases)
|
||||||
[![pypi Version](https://img.shields.io/pypi/v/spacy.svg?style=flat-square)](https://pypi.python.org/pypi/spacy)
|
[![pypi Version](https://img.shields.io/pypi/v/spacy.svg?style=flat-square)](https://pypi.org/project/spacy/)
|
||||||
[![conda Version](https://img.shields.io/conda/vn/conda-forge/spacy.svg?style=flat-square)](https://anaconda.org/conda-forge/spacy)
|
[![conda Version](https://img.shields.io/conda/vn/conda-forge/spacy.svg?style=flat-square)](https://anaconda.org/conda-forge/spacy)
|
||||||
[![Python wheels](https://img.shields.io/badge/wheels-%E2%9C%93-4c1.svg?longCache=true&style=flat-square&logo=python&logoColor=white)](https://github.com/explosion/wheelwright/releases)
|
[![Python wheels](https://img.shields.io/badge/wheels-%E2%9C%93-4c1.svg?longCache=true&style=flat-square&logo=python&logoColor=white)](https://github.com/explosion/wheelwright/releases)
|
||||||
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg?style=flat-square)](https://github.com/ambv/black)
|
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg?style=flat-square)](https://github.com/ambv/black)
|
||||||
|
@ -42,7 +42,7 @@ released under the MIT license.
|
||||||
[api reference]: https://spacy.io/api/
|
[api reference]: https://spacy.io/api/
|
||||||
[models]: https://spacy.io/models
|
[models]: https://spacy.io/models
|
||||||
[universe]: https://spacy.io/universe
|
[universe]: https://spacy.io/universe
|
||||||
[changelog]: https://spacy.io/usage/#changelog
|
[changelog]: https://spacy.io/usage#changelog
|
||||||
[contribute]: https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md
|
[contribute]: https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md
|
||||||
|
|
||||||
## 💬 Where to ask questions
|
## 💬 Where to ask questions
|
||||||
|
@ -60,7 +60,7 @@ valuable if it's shared publicly, so that more people can benefit from it.
|
||||||
| 🗯 **General Discussion** | [Gitter Chat] · [Reddit User Group] |
|
| 🗯 **General Discussion** | [Gitter Chat] · [Reddit User Group] |
|
||||||
|
|
||||||
[github issue tracker]: https://github.com/explosion/spaCy/issues
|
[github issue tracker]: https://github.com/explosion/spaCy/issues
|
||||||
[stack overflow]: http://stackoverflow.com/questions/tagged/spacy
|
[stack overflow]: https://stackoverflow.com/questions/tagged/spacy
|
||||||
[gitter chat]: https://gitter.im/explosion/spaCy
|
[gitter chat]: https://gitter.im/explosion/spaCy
|
||||||
[reddit user group]: https://www.reddit.com/r/spacynlp
|
[reddit user group]: https://www.reddit.com/r/spacynlp
|
||||||
|
|
||||||
|
@ -95,7 +95,7 @@ For detailed installation instructions, see the
|
||||||
- **Python version**: Python 2.7, 3.5+ (only 64 bit)
|
- **Python version**: Python 2.7, 3.5+ (only 64 bit)
|
||||||
- **Package managers**: [pip] · [conda] (via `conda-forge`)
|
- **Package managers**: [pip] · [conda] (via `conda-forge`)
|
||||||
|
|
||||||
[pip]: https://pypi.python.org/pypi/spacy
|
[pip]: https://pypi.org/project/spacy/
|
||||||
[conda]: https://anaconda.org/conda-forge/spacy
|
[conda]: https://anaconda.org/conda-forge/spacy
|
||||||
|
|
||||||
### pip
|
### pip
|
||||||
|
@ -219,7 +219,7 @@ source. That is the common way if you want to make changes to the code base.
|
||||||
You'll need to make sure that you have a development environment consisting of a
|
You'll need to make sure that you have a development environment consisting of a
|
||||||
Python distribution including header files, a compiler,
|
Python distribution including header files, a compiler,
|
||||||
[pip](https://pip.pypa.io/en/latest/installing/),
|
[pip](https://pip.pypa.io/en/latest/installing/),
|
||||||
[virtualenv](https://virtualenv.pypa.io/) and [git](https://git-scm.com)
|
[virtualenv](https://virtualenv.pypa.io/en/latest/) and [git](https://git-scm.com)
|
||||||
installed. The compiler part is the trickiest. How to do that depends on your
|
installed. The compiler part is the trickiest. How to do that depends on your
|
||||||
system. See notes on Ubuntu, OS X and Windows for details.
|
system. See notes on Ubuntu, OS X and Windows for details.
|
||||||
|
|
||||||
|
@ -239,8 +239,8 @@ python setup.py build_ext --inplace
|
||||||
Compared to regular install via pip, [requirements.txt](requirements.txt)
|
Compared to regular install via pip, [requirements.txt](requirements.txt)
|
||||||
additionally installs developer dependencies such as Cython. For more details
|
additionally installs developer dependencies such as Cython. For more details
|
||||||
and instructions, see the documentation on
|
and instructions, see the documentation on
|
||||||
[compiling spaCy from source](https://spacy.io/usage/#source) and the
|
[compiling spaCy from source](https://spacy.io/usage#source) and the
|
||||||
[quickstart widget](https://spacy.io/usage/#section-quickstart) to get
|
[quickstart widget](https://spacy.io/usage#section-quickstart) to get
|
||||||
the right commands for your platform and Python version.
|
the right commands for your platform and Python version.
|
||||||
|
|
||||||
### Ubuntu
|
### Ubuntu
|
||||||
|
@ -260,7 +260,7 @@ and git preinstalled.
|
||||||
### Windows
|
### Windows
|
||||||
|
|
||||||
Install a version of the [Visual C++ Build Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/) or
|
Install a version of the [Visual C++ Build Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/) or
|
||||||
[Visual Studio Express](https://www.visualstudio.com/vs/visual-studio-express/)
|
[Visual Studio Express](https://visualstudio.microsoft.com/vs/express/)
|
||||||
that matches the version that was used to compile your Python
|
that matches the version that was used to compile your Python
|
||||||
interpreter. For official distributions these are VS 2008 (Python 2.7),
|
interpreter. For official distributions these are VS 2008 (Python 2.7),
|
||||||
VS 2010 (Python 3.4) and VS 2015 (Python 3.5).
|
VS 2010 (Python 3.4) and VS 2015 (Python 3.5).
|
||||||
|
@ -282,5 +282,5 @@ pip install -r path/to/requirements.txt
|
||||||
python -m pytest <spacy-directory>
|
python -m pytest <spacy-directory>
|
||||||
```
|
```
|
||||||
|
|
||||||
See [the documentation](https://spacy.io/usage/#tests) for more details and
|
See [the documentation](https://spacy.io/usage#tests) for more details and
|
||||||
examples.
|
examples.
|
||||||
|
|
|
@ -23,7 +23,7 @@ For more details, see the documentation:
|
||||||
* Training: https://spacy.io/usage/training
|
* Training: https://spacy.io/usage/training
|
||||||
* NER: https://spacy.io/usage/linguistic-features#named-entities
|
* NER: https://spacy.io/usage/linguistic-features#named-entities
|
||||||
|
|
||||||
Compatible with: spaCy v2.0.0+
|
Compatible with: spaCy v2.1.0+
|
||||||
Last tested with: v2.1.0
|
Last tested with: v2.1.0
|
||||||
"""
|
"""
|
||||||
from __future__ import unicode_literals, print_function
|
from __future__ import unicode_literals, print_function
|
||||||
|
|
41
spacy/_ml.py
41
spacy/_ml.py
|
@ -86,7 +86,7 @@ def with_cpu(ops, model):
|
||||||
as necessary."""
|
as necessary."""
|
||||||
model.to_cpu()
|
model.to_cpu()
|
||||||
|
|
||||||
def with_cpu_forward(inputs, drop=0.):
|
def with_cpu_forward(inputs, drop=0.0):
|
||||||
cpu_outputs, backprop = model.begin_update(_to_cpu(inputs), drop=drop)
|
cpu_outputs, backprop = model.begin_update(_to_cpu(inputs), drop=drop)
|
||||||
gpu_outputs = _to_device(ops, cpu_outputs)
|
gpu_outputs = _to_device(ops, cpu_outputs)
|
||||||
|
|
||||||
|
@ -106,7 +106,7 @@ def _to_cpu(X):
|
||||||
return tuple([_to_cpu(x) for x in X])
|
return tuple([_to_cpu(x) for x in X])
|
||||||
elif isinstance(X, list):
|
elif isinstance(X, list):
|
||||||
return [_to_cpu(x) for x in X]
|
return [_to_cpu(x) for x in X]
|
||||||
elif hasattr(X, 'get'):
|
elif hasattr(X, "get"):
|
||||||
return X.get()
|
return X.get()
|
||||||
else:
|
else:
|
||||||
return X
|
return X
|
||||||
|
@ -142,7 +142,9 @@ class extract_ngrams(Model):
|
||||||
# The dtype here matches what thinc is expecting -- which differs per
|
# The dtype here matches what thinc is expecting -- which differs per
|
||||||
# platform (by int definition). This should be fixed once the problem
|
# platform (by int definition). This should be fixed once the problem
|
||||||
# is fixed on Thinc's side.
|
# is fixed on Thinc's side.
|
||||||
lengths = self.ops.asarray([arr.shape[0] for arr in batch_keys], dtype=numpy.int_)
|
lengths = self.ops.asarray(
|
||||||
|
[arr.shape[0] for arr in batch_keys], dtype=numpy.int_
|
||||||
|
)
|
||||||
batch_keys = self.ops.xp.concatenate(batch_keys)
|
batch_keys = self.ops.xp.concatenate(batch_keys)
|
||||||
batch_vals = self.ops.asarray(self.ops.xp.concatenate(batch_vals), dtype="f")
|
batch_vals = self.ops.asarray(self.ops.xp.concatenate(batch_vals), dtype="f")
|
||||||
return (batch_keys, batch_vals, lengths), None
|
return (batch_keys, batch_vals, lengths), None
|
||||||
|
@ -592,32 +594,27 @@ def build_text_classifier(nr_class, width=64, **cfg):
|
||||||
)
|
)
|
||||||
|
|
||||||
linear_model = build_bow_text_classifier(
|
linear_model = build_bow_text_classifier(
|
||||||
nr_class, ngram_size=cfg.get("ngram_size", 1), exclusive_classes=False)
|
nr_class, ngram_size=cfg.get("ngram_size", 1), exclusive_classes=False
|
||||||
if cfg.get('exclusive_classes'):
|
)
|
||||||
|
if cfg.get("exclusive_classes"):
|
||||||
output_layer = Softmax(nr_class, nr_class * 2)
|
output_layer = Softmax(nr_class, nr_class * 2)
|
||||||
else:
|
else:
|
||||||
output_layer = (
|
output_layer = (
|
||||||
zero_init(Affine(nr_class, nr_class * 2, drop_factor=0.0))
|
zero_init(Affine(nr_class, nr_class * 2, drop_factor=0.0)) >> logistic
|
||||||
>> logistic
|
|
||||||
)
|
|
||||||
model = (
|
|
||||||
(linear_model | cnn_model)
|
|
||||||
>> output_layer
|
|
||||||
)
|
)
|
||||||
|
model = (linear_model | cnn_model) >> output_layer
|
||||||
model.tok2vec = chain(tok2vec, flatten)
|
model.tok2vec = chain(tok2vec, flatten)
|
||||||
model.nO = nr_class
|
model.nO = nr_class
|
||||||
model.lsuv = False
|
model.lsuv = False
|
||||||
return model
|
return model
|
||||||
|
|
||||||
|
|
||||||
def build_bow_text_classifier(nr_class, ngram_size=1, exclusive_classes=False,
|
def build_bow_text_classifier(
|
||||||
no_output_layer=False, **cfg):
|
nr_class, ngram_size=1, exclusive_classes=False, no_output_layer=False, **cfg
|
||||||
|
):
|
||||||
with Model.define_operators({">>": chain}):
|
with Model.define_operators({">>": chain}):
|
||||||
model = (
|
model = with_cpu(
|
||||||
with_cpu(Model.ops,
|
Model.ops, extract_ngrams(ngram_size, attr=ORTH) >> LinearModel(nr_class)
|
||||||
extract_ngrams(ngram_size, attr=ORTH)
|
|
||||||
>> LinearModel(nr_class)
|
|
||||||
)
|
|
||||||
)
|
)
|
||||||
if not no_output_layer:
|
if not no_output_layer:
|
||||||
model = model >> (cpu_softmax if exclusive_classes else logistic)
|
model = model >> (cpu_softmax if exclusive_classes else logistic)
|
||||||
|
@ -626,11 +623,9 @@ def build_bow_text_classifier(nr_class, ngram_size=1, exclusive_classes=False,
|
||||||
|
|
||||||
|
|
||||||
@layerize
|
@layerize
|
||||||
def cpu_softmax(X, drop=0.):
|
def cpu_softmax(X, drop=0.0):
|
||||||
ops = NumpyOps()
|
ops = NumpyOps()
|
||||||
|
|
||||||
Y = ops.softmax(X)
|
|
||||||
|
|
||||||
def cpu_softmax_backward(dY, sgd=None):
|
def cpu_softmax_backward(dY, sgd=None):
|
||||||
return dY
|
return dY
|
||||||
|
|
||||||
|
@ -648,7 +643,9 @@ def build_simple_cnn_text_classifier(tok2vec, nr_class, exclusive_classes=False,
|
||||||
if exclusive_classes:
|
if exclusive_classes:
|
||||||
output_layer = Softmax(nr_class, tok2vec.nO)
|
output_layer = Softmax(nr_class, tok2vec.nO)
|
||||||
else:
|
else:
|
||||||
output_layer = zero_init(Affine(nr_class, tok2vec.nO, drop_factor=0.0)) >> logistic
|
output_layer = (
|
||||||
|
zero_init(Affine(nr_class, tok2vec.nO, drop_factor=0.0)) >> logistic
|
||||||
|
)
|
||||||
model = tok2vec >> flatten_add_lengths >> Pooling(mean_pool) >> output_layer
|
model = tok2vec >> flatten_add_lengths >> Pooling(mean_pool) >> output_layer
|
||||||
model.tok2vec = chain(tok2vec, flatten)
|
model.tok2vec = chain(tok2vec, flatten)
|
||||||
model.nO = nr_class
|
model.nO = nr_class
|
||||||
|
|
|
@ -125,7 +125,9 @@ def pretrain(
|
||||||
max_length=max_length,
|
max_length=max_length,
|
||||||
min_length=min_length,
|
min_length=min_length,
|
||||||
)
|
)
|
||||||
loss = make_update(model, docs, optimizer, objective=loss_func, drop=dropout)
|
loss = make_update(
|
||||||
|
model, docs, optimizer, objective=loss_func, drop=dropout
|
||||||
|
)
|
||||||
progress = tracker.update(epoch, loss, docs)
|
progress = tracker.update(epoch, loss, docs)
|
||||||
if progress:
|
if progress:
|
||||||
msg.row(progress, **row_settings)
|
msg.row(progress, **row_settings)
|
||||||
|
|
|
@ -50,8 +50,9 @@ class DependencyRenderer(object):
|
||||||
rendered = []
|
rendered = []
|
||||||
for i, p in enumerate(parsed):
|
for i, p in enumerate(parsed):
|
||||||
if i == 0:
|
if i == 0:
|
||||||
self.direction = p["settings"].get("direction", DEFAULT_DIR)
|
settings = p.get("settings", {})
|
||||||
self.lang = p["settings"].get("lang", DEFAULT_LANG)
|
self.direction = settings.get("direction", DEFAULT_DIR)
|
||||||
|
self.lang = settings.get("lang", DEFAULT_LANG)
|
||||||
render_id = "{}-{}".format(id_prefix, i)
|
render_id = "{}-{}".format(id_prefix, i)
|
||||||
svg = self.render_svg(render_id, p["words"], p["arcs"])
|
svg = self.render_svg(render_id, p["words"], p["arcs"])
|
||||||
rendered.append(svg)
|
rendered.append(svg)
|
||||||
|
@ -254,9 +255,10 @@ class EntityRenderer(object):
|
||||||
rendered = []
|
rendered = []
|
||||||
for i, p in enumerate(parsed):
|
for i, p in enumerate(parsed):
|
||||||
if i == 0:
|
if i == 0:
|
||||||
self.direction = p["settings"].get("direction", DEFAULT_DIR)
|
settings = p.get("settings", {})
|
||||||
self.lang = p["settings"].get("lang", DEFAULT_LANG)
|
self.direction = settings.get("direction", DEFAULT_DIR)
|
||||||
rendered.append(self.render_ents(p["text"], p["ents"], p["title"]))
|
self.lang = settings.get("lang", DEFAULT_LANG)
|
||||||
|
rendered.append(self.render_ents(p["text"], p["ents"], p.get("title")))
|
||||||
if page:
|
if page:
|
||||||
docs = "".join([TPL_FIGURE.format(content=doc) for doc in rendered])
|
docs = "".join([TPL_FIGURE.format(content=doc) for doc in rendered])
|
||||||
markup = TPL_PAGE.format(content=docs, lang=self.lang, dir=self.direction)
|
markup = TPL_PAGE.format(content=docs, lang=self.lang, dir=self.direction)
|
||||||
|
|
|
@ -1,7 +1,7 @@
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from ...symbols import LEMMA, PRON_LEMMA, AUX
|
from ...symbols import LEMMA, PRON_LEMMA
|
||||||
|
|
||||||
_subordinating_conjunctions = [
|
_subordinating_conjunctions = [
|
||||||
"that",
|
"that",
|
||||||
|
@ -457,7 +457,6 @@ MORPH_RULES = {
|
||||||
"have": {"POS": "AUX"},
|
"have": {"POS": "AUX"},
|
||||||
"'m": {"POS": "AUX", LEMMA: "be"},
|
"'m": {"POS": "AUX", LEMMA: "be"},
|
||||||
"'ve": {"POS": "AUX"},
|
"'ve": {"POS": "AUX"},
|
||||||
"'re": {"POS": "AUX", LEMMA: "be"},
|
|
||||||
"'s": {"POS": "AUX"},
|
"'s": {"POS": "AUX"},
|
||||||
"is": {"POS": "AUX"},
|
"is": {"POS": "AUX"},
|
||||||
"'d": {"POS": "AUX"},
|
"'d": {"POS": "AUX"},
|
||||||
|
|
|
@ -39,7 +39,7 @@ made make many may me meanwhile might mine more moreover most mostly move much
|
||||||
must my myself
|
must my myself
|
||||||
|
|
||||||
name namely neither never nevertheless next nine no nobody none noone nor not
|
name namely neither never nevertheless next nine no nobody none noone nor not
|
||||||
nothing now nowhere n't
|
nothing now nowhere
|
||||||
|
|
||||||
of off often on once one only onto or other others otherwise our ours ourselves
|
of off often on once one only onto or other others otherwise our ours ourselves
|
||||||
out over own
|
out over own
|
||||||
|
@ -66,7 +66,13 @@ whereafter whereas whereby wherein whereupon wherever whether which while
|
||||||
whither who whoever whole whom whose why will with within without would
|
whither who whoever whole whom whose why will with within without would
|
||||||
|
|
||||||
yet you your yours yourself yourselves
|
yet you your yours yourself yourselves
|
||||||
|
|
||||||
'd 'll 'm 're 's 've
|
|
||||||
""".split()
|
""".split()
|
||||||
)
|
)
|
||||||
|
|
||||||
|
contractions = ["n't", "'d", "'ll", "'m", "'re", "'s", "'ve"]
|
||||||
|
STOP_WORDS.update(contractions)
|
||||||
|
|
||||||
|
for apostrophe in ["‘", "’"]:
|
||||||
|
for stopword in contractions:
|
||||||
|
STOP_WORDS.add(stopword.replace("'", apostrophe))
|
||||||
|
|
||||||
|
|
|
@ -2,7 +2,11 @@
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from ...symbols import POS, PUNCT, ADJ, CCONJ, NUM, DET, ADV, ADP, X, VERB
|
from ...symbols import POS, PUNCT, ADJ, CCONJ, NUM, DET, ADV, ADP, X, VERB
|
||||||
|
<<<<<<< HEAD
|
||||||
from ...symbols import NOUN, PRON, AUX, SCONJ, INTJ, PART, PROPN
|
from ...symbols import NOUN, PRON, AUX, SCONJ, INTJ, PART, PROPN
|
||||||
|
=======
|
||||||
|
from ...symbols import NOUN, PRON, AUX, SCONJ
|
||||||
|
>>>>>>> 4faf62d5154c2d2adb6def32da914d18d5e9c8fe
|
||||||
|
|
||||||
|
|
||||||
# POS explanations for indonesian available from https://www.aclweb.org/anthology/Y12-1014
|
# POS explanations for indonesian available from https://www.aclweb.org/anthology/Y12-1014
|
||||||
|
@ -92,4 +96,3 @@ TAG_MAP = {
|
||||||
"D--+PS2":{POS: ADV},
|
"D--+PS2":{POS: ADV},
|
||||||
"PP3+T—": {POS: PRON}
|
"PP3+T—": {POS: PRON}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
|
@ -4,6 +4,11 @@ from __future__ import unicode_literals
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
from .lex_attrs import LEX_ATTRS
|
from .lex_attrs import LEX_ATTRS
|
||||||
from .tag_map import TAG_MAP
|
from .tag_map import TAG_MAP
|
||||||
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
|
from .punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES
|
||||||
|
|
||||||
|
from .lemmatizer import LOOKUP, LEMMA_EXC, LEMMA_INDEX, RULES
|
||||||
|
from .lemmatizer.lemmatizer import DutchLemmatizer
|
||||||
|
|
||||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||||
from ..norm_exceptions import BASE_NORMS
|
from ..norm_exceptions import BASE_NORMS
|
||||||
|
@ -13,20 +18,33 @@ from ...util import update_exc, add_lookups
|
||||||
|
|
||||||
|
|
||||||
class DutchDefaults(Language.Defaults):
|
class DutchDefaults(Language.Defaults):
|
||||||
|
|
||||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||||
lex_attr_getters.update(LEX_ATTRS)
|
lex_attr_getters.update(LEX_ATTRS)
|
||||||
lex_attr_getters[LANG] = lambda text: "nl"
|
lex_attr_getters[LANG] = lambda text: 'nl'
|
||||||
lex_attr_getters[NORM] = add_lookups(
|
lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM],
|
||||||
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS
|
BASE_NORMS)
|
||||||
)
|
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS)
|
|
||||||
stop_words = STOP_WORDS
|
stop_words = STOP_WORDS
|
||||||
tag_map = TAG_MAP
|
tag_map = TAG_MAP
|
||||||
|
infixes = TOKENIZER_INFIXES
|
||||||
|
suffixes = TOKENIZER_SUFFIXES
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def create_lemmatizer(cls, nlp=None):
|
||||||
|
rules = RULES
|
||||||
|
lemma_index = LEMMA_INDEX
|
||||||
|
lemma_exc = LEMMA_EXC
|
||||||
|
lemma_lookup = LOOKUP
|
||||||
|
return DutchLemmatizer(index=lemma_index,
|
||||||
|
exceptions=lemma_exc,
|
||||||
|
lookup=lemma_lookup,
|
||||||
|
rules=rules)
|
||||||
|
|
||||||
|
|
||||||
class Dutch(Language):
|
class Dutch(Language):
|
||||||
lang = "nl"
|
lang = 'nl'
|
||||||
Defaults = DutchDefaults
|
Defaults = DutchDefaults
|
||||||
|
|
||||||
|
|
||||||
__all__ = ["Dutch"]
|
__all__ = ['Dutch']
|
||||||
|
|
|
@ -14,5 +14,5 @@ sentences = [
|
||||||
"Apple overweegt om voor 1 miljard een U.K. startup te kopen",
|
"Apple overweegt om voor 1 miljard een U.K. startup te kopen",
|
||||||
"Autonome auto's verschuiven de verzekeringverantwoordelijkheid naar producenten",
|
"Autonome auto's verschuiven de verzekeringverantwoordelijkheid naar producenten",
|
||||||
"San Francisco overweegt robots op voetpaden te verbieden",
|
"San Francisco overweegt robots op voetpaden te verbieden",
|
||||||
"Londen is een grote stad in het Verenigd Koninkrijk",
|
"Londen is een grote stad in het Verenigd Koninkrijk"
|
||||||
]
|
]
|
||||||
|
|
40
spacy/lang/nl/lemmatizer/__init__.py
Normal file
40
spacy/lang/nl/lemmatizer/__init__.py
Normal file
|
@ -0,0 +1,40 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from ._verbs_irreg import VERBS_IRREG
|
||||||
|
from ._nouns_irreg import NOUNS_IRREG
|
||||||
|
from ._adjectives_irreg import ADJECTIVES_IRREG
|
||||||
|
from ._adverbs_irreg import ADVERBS_IRREG
|
||||||
|
|
||||||
|
from ._adpositions_irreg import ADPOSITIONS_IRREG
|
||||||
|
from ._determiners_irreg import DETERMINERS_IRREG
|
||||||
|
from ._pronouns_irreg import PRONOUNS_IRREG
|
||||||
|
|
||||||
|
from ._verbs import VERBS
|
||||||
|
from ._nouns import NOUNS
|
||||||
|
from ._adjectives import ADJECTIVES
|
||||||
|
|
||||||
|
from ._adpositions import ADPOSITIONS
|
||||||
|
from ._determiners import DETERMINERS
|
||||||
|
|
||||||
|
from .lookup import LOOKUP
|
||||||
|
|
||||||
|
from ._lemma_rules import RULES
|
||||||
|
|
||||||
|
from .lemmatizer import DutchLemmatizer
|
||||||
|
|
||||||
|
|
||||||
|
LEMMA_INDEX = {"adj": ADJECTIVES,
|
||||||
|
"noun": NOUNS,
|
||||||
|
"verb": VERBS,
|
||||||
|
"adp": ADPOSITIONS,
|
||||||
|
"det": DETERMINERS}
|
||||||
|
|
||||||
|
LEMMA_EXC = {"adj": ADJECTIVES_IRREG,
|
||||||
|
"adv": ADVERBS_IRREG,
|
||||||
|
"adp": ADPOSITIONS_IRREG,
|
||||||
|
"noun": NOUNS_IRREG,
|
||||||
|
"verb": VERBS_IRREG,
|
||||||
|
"det": DETERMINERS_IRREG,
|
||||||
|
"pron": PRONOUNS_IRREG}
|
||||||
|
|
3461
spacy/lang/nl/lemmatizer/_adjectives.py
Normal file
3461
spacy/lang/nl/lemmatizer/_adjectives.py
Normal file
File diff suppressed because it is too large
Load Diff
3033
spacy/lang/nl/lemmatizer/_adjectives_irreg.py
Normal file
3033
spacy/lang/nl/lemmatizer/_adjectives_irreg.py
Normal file
File diff suppressed because it is too large
Load Diff
24
spacy/lang/nl/lemmatizer/_adpositions.py
Normal file
24
spacy/lang/nl/lemmatizer/_adpositions.py
Normal file
|
@ -0,0 +1,24 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
|
||||||
|
ADPOSITIONS = set(
|
||||||
|
('aan aangaande aanwezig achter af afgezien al als an annex anno anti '
|
||||||
|
'behalve behoudens beneden benevens benoorden beoosten betreffende bewesten '
|
||||||
|
'bezijden bezuiden bij binnen binnenuit binst bladzij blijkens boven bovenop '
|
||||||
|
'buiten conform contra cq daaraan daarbij daarbuiten daarin daarnaar '
|
||||||
|
'daaronder daartegenover daarvan dankzij deure dichtbij door doordat doorheen '
|
||||||
|
'echter eraf erop erover errond eruit ervoor evenals exclusief gedaan '
|
||||||
|
'gedurende gegeven getuige gezien halfweg halverwege heen hierdoorheen hierop '
|
||||||
|
'houdende in inclusief indien ingaande ingevolge inzake jegens kortweg '
|
||||||
|
'krachtens kralj langs langsheen langst lastens linksom lopende luidens mede '
|
||||||
|
'mee met middels midden middenop mits na naan naar naartoe naast naat nabij '
|
||||||
|
'nadat namens neer neffe neffen neven nevenst niettegenstaande nopens '
|
||||||
|
'officieel om omheen omstreeks omtrent onafgezien ondanks onder onderaan '
|
||||||
|
'ondere ongeacht ooit op open over per plus pro qua rechtover rond rondom '
|
||||||
|
"sedert sinds spijts strekkende te tegen tegenaan tegenop tegenover telde "
|
||||||
|
'teneinde terug tijdens toe tot totdat trots tussen tégen uit uitgenomen '
|
||||||
|
'ultimo van vanaf vandaan vandoor vanop vanuit vanwege versus via vinnen '
|
||||||
|
'vlakbij volgens voor voor- voorbij voordat voort voren vòòr vóór waaraan '
|
||||||
|
'waarbij waardoor waaronder weg wegens weleens zijdens zoals zodat zonder '
|
||||||
|
'zónder à').split())
|
12
spacy/lang/nl/lemmatizer/_adpositions_irreg.py
Normal file
12
spacy/lang/nl/lemmatizer/_adpositions_irreg.py
Normal file
|
@ -0,0 +1,12 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
|
||||||
|
ADPOSITIONS_IRREG = {
|
||||||
|
"'t": ('te',),
|
||||||
|
'me': ('mee',),
|
||||||
|
'meer': ('mee',),
|
||||||
|
'on': ('om',),
|
||||||
|
'ten': ('te',),
|
||||||
|
'ter': ('te',)
|
||||||
|
}
|
19
spacy/lang/nl/lemmatizer/_adverbs_irreg.py
Normal file
19
spacy/lang/nl/lemmatizer/_adverbs_irreg.py
Normal file
|
@ -0,0 +1,19 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
|
||||||
|
ADVERBS_IRREG = {
|
||||||
|
"'ns": ('eens',),
|
||||||
|
"'s": ('eens',),
|
||||||
|
"'t": ('het',),
|
||||||
|
"d'r": ('er',),
|
||||||
|
"d'raf": ('eraf',),
|
||||||
|
"d'rbij": ('erbij',),
|
||||||
|
"d'rheen": ('erheen',),
|
||||||
|
"d'rin": ('erin',),
|
||||||
|
"d'rna": ('erna',),
|
||||||
|
"d'rnaar": ('ernaar',),
|
||||||
|
'hele': ('heel',),
|
||||||
|
'nevenst': ('nevens',),
|
||||||
|
'overend': ('overeind',)
|
||||||
|
}
|
17
spacy/lang/nl/lemmatizer/_determiners.py
Normal file
17
spacy/lang/nl/lemmatizer/_determiners.py
Normal file
|
@ -0,0 +1,17 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
|
||||||
|
DETERMINERS = set(
|
||||||
|
("al allebei allerhande allerminst alletwee"
|
||||||
|
"beide clip-on d'n d'r dat datgeen datgene de dees degeen degene den dewelke "
|
||||||
|
'deze dezelfde die diegeen diegene diehien dien diene diens diezelfde dit '
|
||||||
|
'ditgene e een eene eigen elk elkens elkes enig enkel enne ettelijke eure '
|
||||||
|
'euren evenveel ewe ge geen ginds géén haar haaren halfelf het hetgeen '
|
||||||
|
'hetwelk hetzelfde heur heure hulder hulle hullen hullie hun hunder hunderen '
|
||||||
|
'ieder iederes ja je jen jouw jouwen jouwes jullie junder keiveel keiweinig '
|
||||||
|
"m'ne me meer meerder meerdere menen menig mijn mijnes minst méér niemendal "
|
||||||
|
'oe ons onse se sommig sommigeder superveel telken teveel titulair ulder '
|
||||||
|
'uldere ulderen ulle under une uw vaak veel veels véél wat weinig welk welken '
|
||||||
|
"welkene welksten z'nen ze zenen zijn zo'n zo'ne zoiet zoveel zovele zovelen "
|
||||||
|
'zuk zulk zulkdanig zulken zulks zullie zíjn àlle álle').split())
|
69
spacy/lang/nl/lemmatizer/_determiners_irreg.py
Normal file
69
spacy/lang/nl/lemmatizer/_determiners_irreg.py
Normal file
|
@ -0,0 +1,69 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
|
||||||
|
DETERMINERS_IRREG = {
|
||||||
|
"'r": ('haar',),
|
||||||
|
"'s": ('de',),
|
||||||
|
"'t": ('het',),
|
||||||
|
"'tgene": ('hetgeen',),
|
||||||
|
'alle': ('al',),
|
||||||
|
'allen': ('al',),
|
||||||
|
'aller': ('al',),
|
||||||
|
'beiden': ('beide',),
|
||||||
|
'beider': ('beide',),
|
||||||
|
"d'": ('het',),
|
||||||
|
"d'r": ('haar',),
|
||||||
|
'der': ('de',),
|
||||||
|
'des': ('de',),
|
||||||
|
'dezer': ('deze',),
|
||||||
|
'dienen': ('die',),
|
||||||
|
'dier': ('die',),
|
||||||
|
'elke': ('elk',),
|
||||||
|
'ene': ('een',),
|
||||||
|
'enen': ('een',),
|
||||||
|
'ener': ('een',),
|
||||||
|
'enige': ('enig',),
|
||||||
|
'enigen': ('enig',),
|
||||||
|
'er': ('haar',),
|
||||||
|
'gene': ('geen',),
|
||||||
|
'genen': ('geen',),
|
||||||
|
'hare': ('haar',),
|
||||||
|
'haren': ('haar',),
|
||||||
|
'harer': ('haar',),
|
||||||
|
'hunne': ('hun',),
|
||||||
|
'hunnen': ('hun',),
|
||||||
|
'jou': ('jouw',),
|
||||||
|
'jouwe': ('jouw',),
|
||||||
|
'julliejen': ('jullie',),
|
||||||
|
"m'n": ('mijn',),
|
||||||
|
'mee': ('meer',),
|
||||||
|
'meer': ('veel',),
|
||||||
|
'meerderen': ('meerdere',),
|
||||||
|
'meest': ('veel',),
|
||||||
|
'meesten': ('veel',),
|
||||||
|
'meet': ('veel',),
|
||||||
|
'menige': ('menig',),
|
||||||
|
'mij': ('mijn',),
|
||||||
|
'mijnen': ('mijn',),
|
||||||
|
'minder': ('weinig',),
|
||||||
|
'mindere': ('weinig',),
|
||||||
|
'minst': ('weinig',),
|
||||||
|
'minste': ('minst',),
|
||||||
|
'ne': ('een',),
|
||||||
|
'onze': ('ons',),
|
||||||
|
'onzent': ('ons',),
|
||||||
|
'onzer': ('ons',),
|
||||||
|
'ouw': ('uw',),
|
||||||
|
'sommige': ('sommig',),
|
||||||
|
'sommigen': ('sommig',),
|
||||||
|
'u': ('uw',),
|
||||||
|
'vaker': ('vaak',),
|
||||||
|
'vele': ('veel',),
|
||||||
|
'velen': ('veel',),
|
||||||
|
'welke': ('welk',),
|
||||||
|
'zijne': ('zijn',),
|
||||||
|
'zijnen': ('zijn',),
|
||||||
|
'zijns': ('zijn',),
|
||||||
|
'één': ('een',)
|
||||||
|
}
|
79
spacy/lang/nl/lemmatizer/_lemma_rules.py
Normal file
79
spacy/lang/nl/lemmatizer/_lemma_rules.py
Normal file
|
@ -0,0 +1,79 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
|
||||||
|
ADJECTIVE_SUFFIX_RULES = [
|
||||||
|
["sten", ""],
|
||||||
|
["ste", ""],
|
||||||
|
["st", ""],
|
||||||
|
["er", ""],
|
||||||
|
["en", ""],
|
||||||
|
["e", ""],
|
||||||
|
["ende", "end"]
|
||||||
|
]
|
||||||
|
|
||||||
|
VERB_SUFFIX_RULES = [
|
||||||
|
["dt", "den"],
|
||||||
|
["de", "en"],
|
||||||
|
["te", "en"],
|
||||||
|
["dde", "den"],
|
||||||
|
["tte", "ten"],
|
||||||
|
["dden", "den"],
|
||||||
|
["tten", "ten"],
|
||||||
|
["end", "en"],
|
||||||
|
]
|
||||||
|
|
||||||
|
NOUN_SUFFIX_RULES = [
|
||||||
|
["en", ""],
|
||||||
|
["ën", ""],
|
||||||
|
["'er", ""],
|
||||||
|
["s", ""],
|
||||||
|
["tje", ""],
|
||||||
|
["kje", ""],
|
||||||
|
["'s", ""],
|
||||||
|
["ici", "icus"],
|
||||||
|
["heden", "heid"],
|
||||||
|
["elen", "eel"],
|
||||||
|
["ezen", "ees"],
|
||||||
|
["even", "eef"],
|
||||||
|
["ssen", "s"],
|
||||||
|
["rren", "r"],
|
||||||
|
["kken", "k"],
|
||||||
|
["bben", "b"]
|
||||||
|
]
|
||||||
|
|
||||||
|
NUM_SUFFIX_RULES = [
|
||||||
|
["ste", ""],
|
||||||
|
["sten", ""],
|
||||||
|
["ën", ""],
|
||||||
|
["en", ""],
|
||||||
|
["de", ""],
|
||||||
|
["er", ""],
|
||||||
|
["ër", ""],
|
||||||
|
["tjes", ""]
|
||||||
|
]
|
||||||
|
|
||||||
|
PUNCT_SUFFIX_RULES = [
|
||||||
|
["“", "\""],
|
||||||
|
["”", "\""],
|
||||||
|
["\u2018", "'"],
|
||||||
|
["\u2019", "'"]
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
# In-place sort guaranteeing that longer -- more specific -- rules are
|
||||||
|
# applied first.
|
||||||
|
for rule_set in (ADJECTIVE_SUFFIX_RULES,
|
||||||
|
NOUN_SUFFIX_RULES,
|
||||||
|
NUM_SUFFIX_RULES,
|
||||||
|
VERB_SUFFIX_RULES):
|
||||||
|
rule_set.sort(key=lambda r: len(r[0]), reverse=True)
|
||||||
|
|
||||||
|
|
||||||
|
RULES = {
|
||||||
|
"adj": ADJECTIVE_SUFFIX_RULES,
|
||||||
|
"noun": NOUN_SUFFIX_RULES,
|
||||||
|
"verb": VERB_SUFFIX_RULES,
|
||||||
|
"num": NUM_SUFFIX_RULES,
|
||||||
|
"punct": PUNCT_SUFFIX_RULES
|
||||||
|
}
|
27890
spacy/lang/nl/lemmatizer/_nouns.py
Normal file
27890
spacy/lang/nl/lemmatizer/_nouns.py
Normal file
File diff suppressed because it is too large
Load Diff
3240
spacy/lang/nl/lemmatizer/_nouns_irreg.py
Normal file
3240
spacy/lang/nl/lemmatizer/_nouns_irreg.py
Normal file
File diff suppressed because it is too large
Load Diff
31
spacy/lang/nl/lemmatizer/_numbers_irreg.py
Normal file
31
spacy/lang/nl/lemmatizer/_numbers_irreg.py
Normal file
|
@ -0,0 +1,31 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
|
||||||
|
NUMBERS_IRREG = {
|
||||||
|
'achten': ('acht',),
|
||||||
|
'biljoenen': ('biljoen',),
|
||||||
|
'drieën': ('drie',),
|
||||||
|
'duizenden': ('duizend',),
|
||||||
|
'eentjes': ('één',),
|
||||||
|
'elven': ('elf',),
|
||||||
|
'miljoenen': ('miljoen',),
|
||||||
|
'negenen': ('negen',),
|
||||||
|
'negentiger': ('negentig',),
|
||||||
|
'tienduizenden': ('tienduizend',),
|
||||||
|
'tienen': ('tien',),
|
||||||
|
'tientjes': ('tien',),
|
||||||
|
'twaalven': ('twaalf',),
|
||||||
|
'tweeën': ('twee',),
|
||||||
|
'twintiger': ('twintig',),
|
||||||
|
'twintigsten': ('twintig',),
|
||||||
|
'vieren': ('vier',),
|
||||||
|
'vijftiger': ('vijftig',),
|
||||||
|
'vijven': ('vijf',),
|
||||||
|
'zessen': ('zes',),
|
||||||
|
'zestiger': ('zestig',),
|
||||||
|
'zevenen': ('zeven',),
|
||||||
|
'zeventiger': ('zeventig',),
|
||||||
|
'zovele': ('zoveel',),
|
||||||
|
'zovelen': ('zoveel',)
|
||||||
|
}
|
35
spacy/lang/nl/lemmatizer/_pronouns_irreg.py
Normal file
35
spacy/lang/nl/lemmatizer/_pronouns_irreg.py
Normal file
|
@ -0,0 +1,35 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
|
||||||
|
PRONOUNS_IRREG = {
|
||||||
|
"'r": ('haar',),
|
||||||
|
"'rzelf": ('haarzelf',),
|
||||||
|
"'t": ('het',),
|
||||||
|
"d'r": ('haar',),
|
||||||
|
'da': ('dat',),
|
||||||
|
'dienen': ('die',),
|
||||||
|
'diens': ('die',),
|
||||||
|
'dies': ('die',),
|
||||||
|
'elkaars': ('elkaar',),
|
||||||
|
'elkanders': ('elkander',),
|
||||||
|
'ene': ('een',),
|
||||||
|
'enen': ('een',),
|
||||||
|
'fik': ('ik',),
|
||||||
|
'gaat': ('gaan',),
|
||||||
|
'gene': ('geen',),
|
||||||
|
'harer': ('haar',),
|
||||||
|
'ieders': ('ieder',),
|
||||||
|
'iemands': ('iemand',),
|
||||||
|
'ikke': ('ik',),
|
||||||
|
'mijnen': ('mijn',),
|
||||||
|
'oe': ('je',),
|
||||||
|
'onzer': ('ons',),
|
||||||
|
'wa': ('wat',),
|
||||||
|
'watte': ('wat',),
|
||||||
|
'wier': ('wie',),
|
||||||
|
'zijns': ('zijn',),
|
||||||
|
'zoietsken': ('zoietske',),
|
||||||
|
'zulks': ('zulk',),
|
||||||
|
'één': ('een',)
|
||||||
|
}
|
2873
spacy/lang/nl/lemmatizer/_verbs.py
Normal file
2873
spacy/lang/nl/lemmatizer/_verbs.py
Normal file
File diff suppressed because it is too large
Load Diff
7201
spacy/lang/nl/lemmatizer/_verbs_irreg.py
Normal file
7201
spacy/lang/nl/lemmatizer/_verbs_irreg.py
Normal file
File diff suppressed because it is too large
Load Diff
130
spacy/lang/nl/lemmatizer/lemmatizer.py
Normal file
130
spacy/lang/nl/lemmatizer/lemmatizer.py
Normal file
|
@ -0,0 +1,130 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from ....symbols import POS, NOUN, VERB, ADJ, NUM, DET, PRON, ADP, AUX, ADV
|
||||||
|
|
||||||
|
|
||||||
|
class DutchLemmatizer(object):
|
||||||
|
# Note: CGN does not distinguish AUX verbs, so we treat AUX as VERB.
|
||||||
|
univ_pos_name_variants = {
|
||||||
|
NOUN: "noun", "NOUN": "noun", "noun": "noun",
|
||||||
|
VERB: "verb", "VERB": "verb", "verb": "verb",
|
||||||
|
AUX: "verb", "AUX": "verb", "aux": "verb",
|
||||||
|
ADJ: "adj", "ADJ": "adj", "adj": "adj",
|
||||||
|
ADV: "adv", "ADV": "adv", "adv": "adv",
|
||||||
|
PRON: "pron", "PRON": "pron", "pron": "pron",
|
||||||
|
DET: "det", "DET": "det", "det": "det",
|
||||||
|
ADP: "adp", "ADP": "adp", "adp": "adp",
|
||||||
|
NUM: "num", "NUM": "num", "num": "num"
|
||||||
|
}
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def load(cls, path, index=None, exc=None, rules=None, lookup=None):
|
||||||
|
return cls(index, exc, rules, lookup)
|
||||||
|
|
||||||
|
def __init__(self, index=None, exceptions=None, rules=None, lookup=None):
|
||||||
|
self.index = index
|
||||||
|
self.exc = exceptions
|
||||||
|
self.rules = rules or {}
|
||||||
|
self.lookup_table = lookup if lookup is not None else {}
|
||||||
|
|
||||||
|
def __call__(self, string, univ_pos, morphology=None):
|
||||||
|
# Difference 1: self.rules is assumed to be non-None, so no
|
||||||
|
# 'is None' check required.
|
||||||
|
# String lowercased from the get-go. All lemmatization results in
|
||||||
|
# lowercased strings. For most applications, this shouldn't pose
|
||||||
|
# any problems, and it keeps the exceptions indexes small. If this
|
||||||
|
# creates problems for proper nouns, we can introduce a check for
|
||||||
|
# univ_pos == "PROPN".
|
||||||
|
string = string.lower()
|
||||||
|
try:
|
||||||
|
univ_pos = self.univ_pos_name_variants[univ_pos]
|
||||||
|
except KeyError:
|
||||||
|
# Because PROPN not in self.univ_pos_name_variants, proper names
|
||||||
|
# are not lemmatized. They are lowercased, however.
|
||||||
|
return [string]
|
||||||
|
# if string in self.lemma_index.get(univ_pos)
|
||||||
|
lemma_index = self.index.get(univ_pos, {})
|
||||||
|
# string is already lemma
|
||||||
|
if string in lemma_index:
|
||||||
|
return [string]
|
||||||
|
exceptions = self.exc.get(univ_pos, {})
|
||||||
|
# string is irregular token contained in exceptions index.
|
||||||
|
try:
|
||||||
|
lemma = exceptions[string]
|
||||||
|
return [lemma[0]]
|
||||||
|
except KeyError:
|
||||||
|
pass
|
||||||
|
# string corresponds to key in lookup table
|
||||||
|
lookup_table = self.lookup_table
|
||||||
|
looked_up_lemma = lookup_table.get(string)
|
||||||
|
if looked_up_lemma and looked_up_lemma in lemma_index:
|
||||||
|
return [looked_up_lemma]
|
||||||
|
|
||||||
|
forms, is_known = lemmatize(
|
||||||
|
string,
|
||||||
|
lemma_index,
|
||||||
|
exceptions,
|
||||||
|
self.rules.get(univ_pos, []))
|
||||||
|
|
||||||
|
# Back-off through remaining return value candidates.
|
||||||
|
if forms:
|
||||||
|
if is_known:
|
||||||
|
return forms
|
||||||
|
else:
|
||||||
|
for form in forms:
|
||||||
|
if form in exceptions:
|
||||||
|
return [form]
|
||||||
|
if looked_up_lemma:
|
||||||
|
return [looked_up_lemma]
|
||||||
|
else:
|
||||||
|
return forms
|
||||||
|
elif looked_up_lemma:
|
||||||
|
return [looked_up_lemma]
|
||||||
|
else:
|
||||||
|
return [string]
|
||||||
|
|
||||||
|
# Overrides parent method so that a lowercased version of the string is
|
||||||
|
# used to search the lookup table. This is necessary because our lookup
|
||||||
|
# table consists entirely of lowercase keys.
|
||||||
|
def lookup(self, string):
|
||||||
|
string = string.lower()
|
||||||
|
return self.lookup_table.get(string, string)
|
||||||
|
|
||||||
|
def noun(self, string, morphology=None):
|
||||||
|
return self(string, 'noun', morphology)
|
||||||
|
|
||||||
|
def verb(self, string, morphology=None):
|
||||||
|
return self(string, 'verb', morphology)
|
||||||
|
|
||||||
|
def adj(self, string, morphology=None):
|
||||||
|
return self(string, 'adj', morphology)
|
||||||
|
|
||||||
|
def det(self, string, morphology=None):
|
||||||
|
return self(string, 'det', morphology)
|
||||||
|
|
||||||
|
def pron(self, string, morphology=None):
|
||||||
|
return self(string, 'pron', morphology)
|
||||||
|
|
||||||
|
def adp(self, string, morphology=None):
|
||||||
|
return self(string, 'adp', morphology)
|
||||||
|
|
||||||
|
def punct(self, string, morphology=None):
|
||||||
|
return self(string, 'punct', morphology)
|
||||||
|
|
||||||
|
|
||||||
|
# Reimplemented to focus more on application of suffix rules and to return
|
||||||
|
# as early as possible.
|
||||||
|
def lemmatize(string, index, exceptions, rules):
|
||||||
|
# returns (forms, is_known: bool)
|
||||||
|
oov_forms = []
|
||||||
|
for old, new in rules:
|
||||||
|
if string.endswith(old):
|
||||||
|
form = string[:len(string) - len(old)] + new
|
||||||
|
if not form:
|
||||||
|
pass
|
||||||
|
elif form in index:
|
||||||
|
return [form], True # True = Is known (is lemma)
|
||||||
|
else:
|
||||||
|
oov_forms.append(form)
|
||||||
|
return list(set(oov_forms)), False
|
212951
spacy/lang/nl/lemmatizer/lookup.py
Normal file
212951
spacy/lang/nl/lemmatizer/lookup.py
Normal file
File diff suppressed because it is too large
Load Diff
|
@ -4,22 +4,18 @@ from __future__ import unicode_literals
|
||||||
from ...attrs import LIKE_NUM
|
from ...attrs import LIKE_NUM
|
||||||
|
|
||||||
|
|
||||||
_num_words = set(
|
_num_words = set("""
|
||||||
"""
|
|
||||||
nul een één twee drie vier vijf zes zeven acht negen tien elf twaalf dertien
|
nul een één twee drie vier vijf zes zeven acht negen tien elf twaalf dertien
|
||||||
veertien twintig dertig veertig vijftig zestig zeventig tachtig negentig honderd
|
veertien twintig dertig veertig vijftig zestig zeventig tachtig negentig honderd
|
||||||
duizend miljoen miljard biljoen biljard triljoen triljard
|
duizend miljoen miljard biljoen biljard triljoen triljard
|
||||||
""".split()
|
""".split())
|
||||||
)
|
|
||||||
|
|
||||||
_ordinal_words = set(
|
_ordinal_words = set("""
|
||||||
"""
|
|
||||||
eerste tweede derde vierde vijfde zesde zevende achtste negende tiende elfde
|
eerste tweede derde vierde vijfde zesde zevende achtste negende tiende elfde
|
||||||
twaalfde dertiende veertiende twintigste dertigste veertigste vijftigste
|
twaalfde dertiende veertiende twintigste dertigste veertigste vijftigste
|
||||||
zestigste zeventigste tachtigste negentigste honderdste duizendste miljoenste
|
zestigste zeventigste tachtigste negentigste honderdste duizendste miljoenste
|
||||||
miljardste biljoenste biljardste triljoenste triljardste
|
miljardste biljoenste biljardste triljoenste triljardste
|
||||||
""".split()
|
""".split())
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
def like_num(text):
|
def like_num(text):
|
||||||
|
@ -27,13 +23,11 @@ def like_num(text):
|
||||||
# or matches one of the number words. In order to handle numbers like
|
# or matches one of the number words. In order to handle numbers like
|
||||||
# "drieëntwintig", more work is required.
|
# "drieëntwintig", more work is required.
|
||||||
# See this discussion: https://github.com/explosion/spaCy/pull/1177
|
# See this discussion: https://github.com/explosion/spaCy/pull/1177
|
||||||
if text.startswith(("+", "-", "±", "~")):
|
text = text.replace(',', '').replace('.', '')
|
||||||
text = text[1:]
|
|
||||||
text = text.replace(",", "").replace(".", "")
|
|
||||||
if text.isdigit():
|
if text.isdigit():
|
||||||
return True
|
return True
|
||||||
if text.count("/") == 1:
|
if text.count('/') == 1:
|
||||||
num, denom = text.split("/")
|
num, denom = text.split('/')
|
||||||
if num.isdigit() and denom.isdigit():
|
if num.isdigit() and denom.isdigit():
|
||||||
return True
|
return True
|
||||||
if text.lower() in _num_words:
|
if text.lower() in _num_words:
|
||||||
|
@ -43,4 +37,6 @@ def like_num(text):
|
||||||
return False
|
return False
|
||||||
|
|
||||||
|
|
||||||
LEX_ATTRS = {LIKE_NUM: like_num}
|
LEX_ATTRS = {
|
||||||
|
LIKE_NUM: like_num
|
||||||
|
}
|
||||||
|
|
33
spacy/lang/nl/punctuation.py
Normal file
33
spacy/lang/nl/punctuation.py
Normal file
|
@ -0,0 +1,33 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from ..char_classes import LIST_ELLIPSES, LIST_ICONS
|
||||||
|
from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER
|
||||||
|
|
||||||
|
from ..punctuation import TOKENIZER_SUFFIXES as DEFAULT_TOKENIZER_SUFFIXES
|
||||||
|
|
||||||
|
|
||||||
|
# Copied from `de` package. Main purpose is to ensure that hyphens are not
|
||||||
|
# split on.
|
||||||
|
|
||||||
|
_quotes = CONCAT_QUOTES.replace("'", '')
|
||||||
|
|
||||||
|
_infixes = (LIST_ELLIPSES + LIST_ICONS +
|
||||||
|
[r'(?<=[{}])\.(?=[{}])'.format(ALPHA_LOWER, ALPHA_UPPER),
|
||||||
|
r'(?<=[{a}])[,!?](?=[{a}])'.format(a=ALPHA),
|
||||||
|
r'(?<=[{a}"])[:<>=](?=[{a}])'.format(a=ALPHA),
|
||||||
|
r'(?<=[{a}]),(?=[{a}])'.format(a=ALPHA),
|
||||||
|
r'(?<=[{a}])([{q}\)\]\(\[])(?=[{a}])'.format(a=ALPHA, q=_quotes),
|
||||||
|
r'(?<=[{a}])--(?=[{a}])'.format(a=ALPHA),
|
||||||
|
r'(?<=[0-9])-(?=[0-9])'])
|
||||||
|
|
||||||
|
|
||||||
|
# Remove "'s" suffix from suffix list. In Dutch, "'s" is a plural ending when
|
||||||
|
# it occurs as a suffix and a clitic for "eens" in standalone use. To avoid
|
||||||
|
# ambiguity it's better to just leave it attached when it occurs as a suffix.
|
||||||
|
default_suffix_blacklist = ("'s", "'S", '’s', '’S')
|
||||||
|
_suffixes = [suffix for suffix in DEFAULT_TOKENIZER_SUFFIXES
|
||||||
|
if suffix not in default_suffix_blacklist]
|
||||||
|
|
||||||
|
TOKENIZER_INFIXES = _infixes
|
||||||
|
TOKENIZER_SUFFIXES = _suffixes
|
|
@ -1,45 +1,73 @@
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
# The original stop words list (added in f46ffe3) was taken from
|
||||||
|
# http://www.damienvanholten.com/downloads/dutch-stop-words.txt
|
||||||
|
# and consisted of about 100 tokens.
|
||||||
|
# In order to achieve parity with some of the better-supported
|
||||||
|
# languages, e.g., English, French, and German, this original list has been
|
||||||
|
# extended with 200 additional tokens. The main source of inspiration was
|
||||||
|
# https://raw.githubusercontent.com/stopwords-iso/stopwords-nl/master/stopwords-nl.txt.
|
||||||
|
# However, quite a bit of manual editing has taken place as well.
|
||||||
|
# Tokens whose status as a stop word is not entirely clear were admitted or
|
||||||
|
# rejected by deferring to their counterparts in the stop words lists for English
|
||||||
|
# and French. Similarly, those lists were used to identify and fill in gaps so
|
||||||
|
# that -- in principle -- each token contained in the English stop words list
|
||||||
|
# should have a Dutch counterpart here.
|
||||||
|
|
||||||
# Stop words are retrieved from http://www.damienvanholten.com/downloads/dutch-stop-words.txt
|
|
||||||
|
|
||||||
STOP_WORDS = set(
|
STOP_WORDS = set("""
|
||||||
"""
|
aan af al alle alles allebei alleen allen als altijd ander anders andere anderen aangaangde aangezien achter achterna
|
||||||
aan af al alles als altijd andere
|
afgelopen aldus alhoewel anderzijds
|
||||||
|
|
||||||
ben bij
|
ben bij bijna bijvoorbeeld behalve beide beiden beneden bent bepaald beter betere betreffende binnen binnenin boven
|
||||||
|
bovenal bovendien bovenstaand buiten
|
||||||
|
|
||||||
daar dan dat de der deze die dit doch doen door dus
|
daar dan dat de der den deze die dit doch doen door dus daarheen daarin daarna daarnet daarom daarop des dezelfde dezen
|
||||||
|
dien dikwijls doet doorgaand doorgaans
|
||||||
|
|
||||||
een eens en er
|
een eens en er echter enige eerder eerst eerste eersten effe eigen elk elke enkel enkele enz erdoor etc even eveneens
|
||||||
|
evenwel
|
||||||
|
|
||||||
ge geen geweest
|
ff
|
||||||
|
|
||||||
haar had heb hebben heeft hem het hier hij hoe hun
|
ge geen geweest gauw gedurende gegeven gehad geheel gekund geleden gelijk gemogen geven geweest gewoon gewoonweg
|
||||||
|
geworden gij
|
||||||
|
|
||||||
iemand iets ik in is
|
haar had heb hebben heeft hem het hier hij hoe hun hadden hare hebt hele hen hierbeneden hierboven hierin hoewel hun
|
||||||
|
|
||||||
ja je
|
iemand iets ik in is idd ieder ikke ikzelf indien inmiddels inz inzake
|
||||||
|
|
||||||
kan kon kunnen
|
ja je jou jouw jullie jezelf jij jijzelf jouwe juist
|
||||||
|
|
||||||
maar me meer men met mij mijn moet
|
kan kon kunnen klaar konden krachtens kunnen kunt
|
||||||
|
|
||||||
na naar niet niets nog nu
|
lang later liet liever
|
||||||
|
|
||||||
of om omdat ons ook op over
|
maar me meer men met mij mijn moet mag mede meer meesten mezelf mijzelf min minder misschien mocht mochten moest moesten
|
||||||
|
moet moeten mogelijk mogen
|
||||||
|
|
||||||
reeds
|
na naar niet niets nog nu nabij nadat net nogal nooit nr nu
|
||||||
|
|
||||||
te tegen toch toen tot
|
of om omdat ons ook op over omhoog omlaag omstreeks omtrent omver onder ondertussen ongeveer onszelf onze ooit opdat
|
||||||
|
opnieuw opzij over overigens
|
||||||
|
|
||||||
u uit uw
|
pas pp precies prof publ
|
||||||
|
|
||||||
van veel voor
|
reeds rond rondom
|
||||||
|
|
||||||
want waren was wat we wel werd wezen wie wij wil worden
|
sedert sinds sindsdien slechts sommige spoedig steeds
|
||||||
|
|
||||||
zal ze zei zelf zich zij zijn zo zonder zou
|
‘t 't te tegen toch toen tot tamelijk ten tenzij ter terwijl thans tijdens toe totdat tussen
|
||||||
""".split()
|
|
||||||
)
|
u uit uw uitgezonderd uwe uwen
|
||||||
|
|
||||||
|
van veel voor vaak vanaf vandaan vanuit vanwege veeleer verder verre vervolgens vgl volgens vooraf vooral vooralsnog
|
||||||
|
voorbij voordat voordien voorheen voorop voort voorts vooruit vrij vroeg
|
||||||
|
|
||||||
|
want waren was wat we wel werd wezen wie wij wil worden waar waarom wanneer want weer weg wegens weinig weinige weldra
|
||||||
|
welk welke welken werd werden wiens wier wilde wordt
|
||||||
|
|
||||||
|
zal ze zei zelf zich zij zijn zo zonder zou zeer zeker zekere zelfde zelfs zichzelf zijnde zijne zo’n zoals zodra zouden
|
||||||
|
zoveel zowat zulk zulke zulks zullen zult
|
||||||
|
""".split())
|
||||||
|
|
|
@ -5,7 +5,6 @@ from ...symbols import POS, PUNCT, ADJ, NUM, DET, ADV, ADP, X, VERB
|
||||||
from ...symbols import NOUN, PROPN, SPACE, PRON, CONJ
|
from ...symbols import NOUN, PROPN, SPACE, PRON, CONJ
|
||||||
|
|
||||||
|
|
||||||
# fmt: off
|
|
||||||
TAG_MAP = {
|
TAG_MAP = {
|
||||||
"ADJ__Number=Sing": {POS: ADJ},
|
"ADJ__Number=Sing": {POS: ADJ},
|
||||||
"ADJ___": {POS: ADJ},
|
"ADJ___": {POS: ADJ},
|
||||||
|
@ -811,4 +810,3 @@ TAG_MAP = {
|
||||||
"X___": {POS: X},
|
"X___": {POS: X},
|
||||||
"_SP": {POS: SPACE}
|
"_SP": {POS: SPACE}
|
||||||
}
|
}
|
||||||
# fmt: on
|
|
||||||
|
|
340
spacy/lang/nl/tokenizer_exceptions.py
Normal file
340
spacy/lang/nl/tokenizer_exceptions.py
Normal file
|
@ -0,0 +1,340 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from ...symbols import ORTH, LEMMA, TAG, NORM, PRON_LEMMA
|
||||||
|
|
||||||
|
# Extensive list of both common and uncommon dutch abbreviations copied from
|
||||||
|
# github.com/diasks2/pragmatic_segmenter, a Ruby library for rule-based
|
||||||
|
# sentence boundary detection (MIT, Copyright 2015 Kevin S. Dias).
|
||||||
|
# Source file: https://github.com/diasks2/pragmatic_segmenter/blob/master/lib/pragmatic_segmenter/languages/dutch.rb
|
||||||
|
# (Last commit: 4d1477b)
|
||||||
|
|
||||||
|
# Main purpose of such an extensive list: considerably improved sentence
|
||||||
|
# segmentation.
|
||||||
|
|
||||||
|
# Note: This list has been copied over largely as-is. Some of the abbreviations
|
||||||
|
# are extremely domain-specific. Tokenizer performance may benefit from some
|
||||||
|
# slight pruning, although no performance regression has been observed so far.
|
||||||
|
|
||||||
|
|
||||||
|
abbrevs = ['a.2d.', 'a.a.', 'a.a.j.b.', 'a.f.t.', 'a.g.j.b.',
|
||||||
|
'a.h.v.', 'a.h.w.', 'a.hosp.', 'a.i.', 'a.j.b.', 'a.j.t.',
|
||||||
|
'a.m.', 'a.m.r.', 'a.p.m.', 'a.p.r.', 'a.p.t.', 'a.s.',
|
||||||
|
'a.t.d.f.', 'a.u.b.', 'a.v.a.', 'a.w.', 'aanbev.',
|
||||||
|
'aanbev.comm.', 'aant.', 'aanv.st.', 'aanw.', 'vnw.',
|
||||||
|
'aanw.vnw.', 'abd.', 'abm.', 'abs.', 'acc.act.',
|
||||||
|
'acc.bedr.m.', 'acc.bedr.t.', 'achterv.', 'act.dr.',
|
||||||
|
'act.dr.fam.', 'act.fisc.', 'act.soc.', 'adm.akk.',
|
||||||
|
'adm.besl.', 'adm.lex.', 'adm.onderr.', 'adm.ov.', 'adv.',
|
||||||
|
'adv.', 'gen.', 'adv.bl.', 'afd.', 'afl.', 'aggl.verord.',
|
||||||
|
'agr.', 'al.', 'alg.', 'alg.richts.', 'amén.', 'ann.dr.',
|
||||||
|
'ann.dr.lg.', 'ann.dr.sc.pol.', 'ann.ét.eur.',
|
||||||
|
'ann.fac.dr.lg.', 'ann.jur.créd.',
|
||||||
|
'ann.jur.créd.règl.coll.', 'ann.not.', 'ann.parl.',
|
||||||
|
'ann.prat.comm.', 'app.', 'arb.', 'aud.', 'arbbl.',
|
||||||
|
'arbh.', 'arbit.besl.', 'arbrb.', 'arr.', 'arr.cass.',
|
||||||
|
'arr.r.v.st.', 'arr.verbr.', 'arrondrb.', 'art.', 'artw.',
|
||||||
|
'aud.', 'b.', 'b.', 'b.&w.', 'b.a.', 'b.a.s.', 'b.b.o.',
|
||||||
|
'b.best.dep.', 'b.br.ex.', 'b.coll.fr.gem.comm.',
|
||||||
|
'b.coll.vl.gem.comm.', 'b.d.cult.r.', 'b.d.gem.ex.',
|
||||||
|
'b.d.gem.reg.', 'b.dep.', 'b.e.b.', 'b.f.r.',
|
||||||
|
'b.fr.gem.ex.', 'b.fr.gem.reg.', 'b.i.h.', 'b.inl.j.d.',
|
||||||
|
'b.inl.s.reg.', 'b.j.', 'b.l.', 'b.o.z.', 'b.prov.r.',
|
||||||
|
'b.r.h.', 'b.s.', 'b.sr.', 'b.stb.', 'b.t.i.r.',
|
||||||
|
'b.t.s.z.', 'b.t.w.rev.', 'b.v.',
|
||||||
|
'b.ver.coll.gem.gem.comm.', 'b.verg.r.b.', 'b.versl.',
|
||||||
|
'b.vl.ex.', 'b.voorl.reg.', 'b.w.', 'b.w.gew.ex.',
|
||||||
|
'b.z.d.g.', 'b.z.v.', 'bab.', 'bedr.org.', 'begins.',
|
||||||
|
'beheersov.', 'bekendm.comm.', 'bel.', 'bel.besch.',
|
||||||
|
'bel.w.p.', 'beleidsov.', 'belg.', 'grondw.', 'ber.',
|
||||||
|
'ber.w.', 'besch.', 'besl.', 'beslagr.', 'bestuurswet.',
|
||||||
|
'bet.', 'betr.', 'betr.', 'vnw.', 'bevest.', 'bew.',
|
||||||
|
'bijbl.', 'ind.', 'eig.', 'bijbl.n.bijdr.', 'bijl.',
|
||||||
|
'bijv.', 'bijw.', 'bijz.decr.', 'bin.b.', 'bkh.', 'bl.',
|
||||||
|
'blz.', 'bm.', 'bn.', 'rh.', 'bnw.', 'bouwr.', 'br.parl.',
|
||||||
|
'bs.', 'bull.', 'bull.adm.pénit.', 'bull.ass.',
|
||||||
|
'bull.b.m.m.', 'bull.bel.', 'bull.best.strafinr.',
|
||||||
|
'bull.bmm.', 'bull.c.b.n.', 'bull.c.n.c.', 'bull.cbn.',
|
||||||
|
'bull.centr.arb.', 'bull.cnc.', 'bull.contr.',
|
||||||
|
'bull.doc.min.fin.', 'bull.f.e.b.', 'bull.feb.',
|
||||||
|
'bull.fisc.fin.r.', 'bull.i.u.m.',
|
||||||
|
'bull.inf.ass.secr.soc.', 'bull.inf.i.e.c.',
|
||||||
|
'bull.inf.i.n.a.m.i.', 'bull.inf.i.r.e.', 'bull.inf.iec.',
|
||||||
|
'bull.inf.inami.', 'bull.inf.ire.', 'bull.inst.arb.',
|
||||||
|
'bull.ium.', 'bull.jur.imm.', 'bull.lég.b.', 'bull.off.',
|
||||||
|
'bull.trim.b.dr.comp.', 'bull.us.', 'bull.v.b.o.',
|
||||||
|
'bull.vbo.', 'bv.', 'bw.', 'bxh.', 'byz.', 'c.', 'c.a.',
|
||||||
|
'c.a.-a.', 'c.a.b.g.', 'c.c.', 'c.c.i.', 'c.c.s.',
|
||||||
|
'c.conc.jur.', 'c.d.e.', 'c.d.p.k.', 'c.e.', 'c.ex.',
|
||||||
|
'c.f.', 'c.h.a.', 'c.i.f.', 'c.i.f.i.c.', 'c.j.', 'c.l.',
|
||||||
|
'c.n.', 'c.o.d.', 'c.p.', 'c.pr.civ.', 'c.q.', 'c.r.',
|
||||||
|
'c.r.a.', 'c.s.', 'c.s.a.', 'c.s.q.n.', 'c.v.', 'c.v.a.',
|
||||||
|
'c.v.o.', 'ca.', 'cadeaust.', 'cah.const.',
|
||||||
|
'cah.dr.europ.', 'cah.dr.immo.', 'cah.dr.jud.', 'cal.',
|
||||||
|
'2d.', 'cal.', '3e.', 'cal.', 'rprt.', 'cap.', 'carg.',
|
||||||
|
'cass.', 'cass.', 'verw.', 'cert.', 'cf.', 'ch.', 'chron.',
|
||||||
|
'chron.d.s.', 'chron.dr.not.', 'cie.', 'cie.',
|
||||||
|
'verz.schr.', 'cir.', 'circ.', 'circ.z.', 'cit.',
|
||||||
|
'cit.loc.', 'civ.', 'cl.et.b.', 'cmt.', 'co.',
|
||||||
|
'cognoss.v.', 'coll.', 'v.', 'b.', 'colp.w.', 'com.',
|
||||||
|
'com.', 'cas.', 'com.v.min.', 'comm.', 'comm.', 'v.',
|
||||||
|
'comm.bijz.ov.', 'comm.erf.', 'comm.fin.', 'comm.ger.',
|
||||||
|
'comm.handel.', 'comm.pers.', 'comm.pub.', 'comm.straf.',
|
||||||
|
'comm.v.', 'comm.venn.', 'comm.verz.', 'comm.voor.',
|
||||||
|
'comp.', 'compt.w.', 'computerr.', 'con.m.', 'concl.',
|
||||||
|
'concr.', 'conf.', 'confl.w.', 'confl.w.huwbetr.', 'cons.',
|
||||||
|
'conv.', 'coöp.', 'ver.', 'corr.', 'corr.bl.',
|
||||||
|
'cour.fisc.', 'cour.immo.', 'cridon.', 'crim.', 'cur.',
|
||||||
|
'cur.', 'crt.', 'curs.', 'd.', 'd.-g.', 'd.a.', 'd.a.v.',
|
||||||
|
'd.b.f.', 'd.c.', 'd.c.c.r.', 'd.d.', 'd.d.p.', 'd.e.t.',
|
||||||
|
'd.gem.r.', 'd.h.', 'd.h.z.', 'd.i.', 'd.i.t.', 'd.j.',
|
||||||
|
'd.l.r.', 'd.m.', 'd.m.v.', 'd.o.v.', 'd.parl.', 'd.w.z.',
|
||||||
|
'dact.', 'dat.', 'dbesch.', 'dbesl.', 'decr.', 'decr.d.',
|
||||||
|
'decr.fr.', 'decr.vl.', 'decr.w.', 'def.', 'dep.opv.',
|
||||||
|
'dep.rtl.', 'derg.', 'desp.', 'det.mag.', 'deurw.regl.',
|
||||||
|
'dez.', 'dgl.', 'dhr.', 'disp.', 'diss.', 'div.',
|
||||||
|
'div.act.', 'div.bel.', 'dl.', 'dln.', 'dnotz.', 'doc.',
|
||||||
|
'hist.', 'doc.jur.b.', 'doc.min.fin.', 'doc.parl.',
|
||||||
|
'doctr.', 'dpl.', 'dpl.besl.', 'dr.', 'dr.banc.fin.',
|
||||||
|
'dr.circ.', 'dr.inform.', 'dr.mr.', 'dr.pén.entr.',
|
||||||
|
'dr.q.m.', 'drs.', 'dtp.', 'dwz.', 'dyn.', 'e.', 'e.a.',
|
||||||
|
'e.b.', 'tek.mod.', 'e.c.', 'e.c.a.', 'e.d.', 'e.e.',
|
||||||
|
'e.e.a.', 'e.e.g.', 'e.g.', 'e.g.a.', 'e.h.a.', 'e.i.',
|
||||||
|
'e.j.', 'e.m.a.', 'e.n.a.c.', 'e.o.', 'e.p.c.', 'e.r.c.',
|
||||||
|
'e.r.f.', 'e.r.h.', 'e.r.o.', 'e.r.p.', 'e.r.v.',
|
||||||
|
'e.s.r.a.', 'e.s.t.', 'e.v.', 'e.v.a.', 'e.w.', 'e&o.e.',
|
||||||
|
'ec.pol.r.', 'econ.', 'ed.', 'ed(s).', 'eff.', 'eig.',
|
||||||
|
'eig.mag.', 'eil.', 'elektr.', 'enmb.', 'enz.', 'err.',
|
||||||
|
'etc.', 'etq.', 'eur.', 'parl.', 'eur.t.s.', 'ev.', 'evt.',
|
||||||
|
'ex.', 'ex.crim.', 'exec.', 'f.', 'f.a.o.', 'f.a.q.',
|
||||||
|
'f.a.s.', 'f.i.b.', 'f.j.f.', 'f.o.b.', 'f.o.r.', 'f.o.s.',
|
||||||
|
'f.o.t.', 'f.r.', 'f.supp.', 'f.suppl.', 'fa.', 'facs.',
|
||||||
|
'fasc.', 'fg.', 'fid.ber.', 'fig.', 'fin.verh.w.', 'fisc.',
|
||||||
|
'fisc.', 'tijdschr.', 'fisc.act.', 'fisc.koer.', 'fl.',
|
||||||
|
'form.', 'foro.', 'it.', 'fr.', 'fr.cult.r.', 'fr.gem.r.',
|
||||||
|
'fr.parl.', 'fra.', 'ft.', 'g.', 'g.a.', 'g.a.v.',
|
||||||
|
'g.a.w.v.', 'g.g.d.', 'g.m.t.', 'g.o.', 'g.omt.e.', 'g.p.',
|
||||||
|
'g.s.', 'g.v.', 'g.w.w.', 'geb.', 'gebr.', 'gebrs.',
|
||||||
|
'gec.', 'gec.decr.', 'ged.', 'ged.st.', 'gedipl.',
|
||||||
|
'gedr.st.', 'geh.', 'gem.', 'gem.', 'gem.',
|
||||||
|
'gem.gem.comm.', 'gem.st.', 'gem.stem.', 'gem.w.',
|
||||||
|
'gemeensch.optr.', 'gemeensch.standp.', 'gemeensch.strat.',
|
||||||
|
'gemeent.', 'gemeent.b.', 'gemeent.regl.',
|
||||||
|
'gemeent.verord.', 'geol.', 'geopp.', 'gepubl.',
|
||||||
|
'ger.deurw.', 'ger.w.', 'gerekw.', 'gereq.', 'gesch.',
|
||||||
|
'get.', 'getr.', 'gev.m.', 'gev.maatr.', 'gew.', 'ghert.',
|
||||||
|
'gir.eff.verk.', 'gk.', 'gr.', 'gramm.', 'grat.w.',
|
||||||
|
'grootb.w.', 'grs.', 'grvm.', 'grw.', 'gst.', 'gw.',
|
||||||
|
'h.a.', 'h.a.v.o.', 'h.b.o.', 'h.e.a.o.', 'h.e.g.a.',
|
||||||
|
'h.e.geb.', 'h.e.gestr.', 'h.l.', 'h.m.', 'h.o.', 'h.r.',
|
||||||
|
'h.t.l.', 'h.t.m.', 'h.w.geb.', 'hand.', 'handelsn.w.',
|
||||||
|
'handelspr.', 'handelsr.w.', 'handelsreg.w.', 'handv.',
|
||||||
|
'harv.l.rev.', 'hc.', 'herald.', 'hert.', 'herz.',
|
||||||
|
'hfdst.', 'hfst.', 'hgrw.', 'hhr.', 'hist.', 'hooggel.',
|
||||||
|
'hoogl.', 'hosp.', 'hpw.', 'hr.', 'hr.', 'ms.', 'hr.ms.',
|
||||||
|
'hregw.', 'hrg.', 'hst.', 'huis.just.', 'huisv.w.',
|
||||||
|
'huurbl.', 'hv.vn.', 'hw.', 'hyp.w.', 'i.b.s.', 'i.c.',
|
||||||
|
'i.c.m.h.', 'i.e.', 'i.f.', 'i.f.p.', 'i.g.v.', 'i.h.',
|
||||||
|
'i.h.a.', 'i.h.b.', 'i.l.pr.', 'i.o.', 'i.p.o.', 'i.p.r.',
|
||||||
|
'i.p.v.', 'i.pl.v.', 'i.r.d.i.', 'i.s.m.', 'i.t.t.',
|
||||||
|
'i.v.', 'i.v.m.', 'i.v.s.', 'i.w.tr.', 'i.z.', 'ib.',
|
||||||
|
'ibid.', 'icip-ing.cons.', 'iem.', 'indic.soc.', 'indiv.',
|
||||||
|
'inf.', 'inf.i.d.a.c.', 'inf.idac.', 'inf.r.i.z.i.v.',
|
||||||
|
'inf.riziv.', 'inf.soc.secr.', 'ing.', 'ing.', 'cons.',
|
||||||
|
'ing.cons.', 'inst.', 'int.', 'int.', 'rechtsh.',
|
||||||
|
'strafz.', 'interm.', 'intern.fisc.act.',
|
||||||
|
'intern.vervoerr.', 'inv.', 'inv.', 'f.', 'inv.w.',
|
||||||
|
'inv.wet.', 'invord.w.', 'inz.', 'ir.', 'irspr.', 'iwtr.',
|
||||||
|
'j.', 'j.-cl.', 'j.c.b.', 'j.c.e.', 'j.c.fl.', 'j.c.j.',
|
||||||
|
'j.c.p.', 'j.d.e.', 'j.d.f.', 'j.d.s.c.', 'j.dr.jeun.',
|
||||||
|
'j.j.d.', 'j.j.p.', 'j.j.pol.', 'j.l.', 'j.l.m.b.',
|
||||||
|
'j.l.o.', 'j.p.a.', 'j.r.s.', 'j.t.', 'j.t.d.e.',
|
||||||
|
'j.t.dr.eur.', 'j.t.o.', 'j.t.t.', 'jaarl.', 'jb.hand.',
|
||||||
|
'jb.kred.', 'jb.kred.c.s.', 'jb.l.r.b.', 'jb.lrb.',
|
||||||
|
'jb.markt.', 'jb.mens.', 'jb.t.r.d.', 'jb.trd.',
|
||||||
|
'jeugdrb.', 'jeugdwerkg.w.', 'jg.', 'jis.', 'jl.',
|
||||||
|
'journ.jur.', 'journ.prat.dr.fisc.fin.', 'journ.proc.',
|
||||||
|
'jrg.', 'jur.', 'jur.comm.fl.', 'jur.dr.soc.b.l.n.',
|
||||||
|
'jur.f.p.e.', 'jur.fpe.', 'jur.niv.', 'jur.trav.brux.',
|
||||||
|
'jurambt.', 'jv.cass.', 'jv.h.r.j.', 'jv.hrj.', 'jw.',
|
||||||
|
'k.', 'k.', 'k.b.', 'k.g.', 'k.k.', 'k.m.b.o.', 'k.o.o.',
|
||||||
|
'k.v.k.', 'k.v.v.v.', 'kadasterw.', 'kaderb.', 'kador.',
|
||||||
|
'kbo-nr.', 'kg.', 'kh.', 'kiesw.', 'kind.bes.v.', 'kkr.',
|
||||||
|
'koopv.', 'kr.', 'krankz.w.', 'ksbel.', 'kt.', 'ktg.',
|
||||||
|
'ktr.', 'kvdm.', 'kw.r.', 'kymr.', 'kzr.', 'kzw.', 'l.',
|
||||||
|
'l.b.', 'l.b.o.', 'l.bas.', 'l.c.', 'l.gew.', 'l.j.',
|
||||||
|
'l.k.', 'l.l.', 'l.o.', 'l.r.b.', 'l.u.v.i.', 'l.v.r.',
|
||||||
|
'l.v.w.', 'l.w.', "l'exp.-compt.b..", 'l’exp.-compt.b.',
|
||||||
|
'landinr.w.', 'landscrt.', 'lat.', 'law.ed.', 'lett.',
|
||||||
|
'levensverz.', 'lgrs.', 'lidw.', 'limb.rechtsl.', 'lit.',
|
||||||
|
'litt.', 'liw.', 'liwet.', 'lk.', 'll.', 'll.(l.)l.r.',
|
||||||
|
'loonw.', 'losbl.', 'ltd.', 'luchtv.', 'luchtv.w.', 'm.',
|
||||||
|
'm.', 'not.', 'm.a.v.o.', 'm.a.w.', 'm.b.', 'm.b.o.',
|
||||||
|
'm.b.r.', 'm.b.t.', 'm.d.g.o.', 'm.e.a.o.', 'm.e.r.',
|
||||||
|
'm.h.', 'm.h.d.', 'm.i.v.', 'm.j.t.', 'm.k.', 'm.m.',
|
||||||
|
'm.m.a.', 'm.m.h.h.', 'm.m.v.', 'm.n.', 'm.not.fisc.',
|
||||||
|
'm.nt.', 'm.o.', 'm.r.', 'm.s.a.', 'm.u.p.', 'm.v.a.',
|
||||||
|
'm.v.h.n.', 'm.v.t.', 'm.z.', 'maatr.teboekgest.luchtv.',
|
||||||
|
'maced.', 'mand.', 'max.', 'mbl.not.', 'me.', 'med.',
|
||||||
|
'med.', 'v.b.o.', 'med.b.u.f.r.', 'med.bufr.', 'med.vbo.',
|
||||||
|
'meerv.', 'meetbr.w.', 'mém.adm.', 'mgr.', 'mgrs.', 'mhd.',
|
||||||
|
'mi.verantw.', 'mil.', 'mil.bed.', 'mil.ger.', 'min.',
|
||||||
|
'min.', 'aanbev.', 'min.', 'circ.', 'min.', 'fin.',
|
||||||
|
'min.j.omz.', 'min.just.circ.', 'mitt.', 'mnd.', 'mod.',
|
||||||
|
'mon.', 'mouv.comm.', 'mr.', 'ms.', 'muz.', 'mv.', 'n.',
|
||||||
|
'chr.', 'n.a.', 'n.a.g.', 'n.a.v.', 'n.b.', 'n.c.',
|
||||||
|
'n.chr.', 'n.d.', 'n.d.r.', 'n.e.a.', 'n.g.', 'n.h.b.c.',
|
||||||
|
'n.j.', 'n.j.b.', 'n.j.w.', 'n.l.', 'n.m.', 'n.m.m.',
|
||||||
|
'n.n.', 'n.n.b.', 'n.n.g.', 'n.n.k.', 'n.o.m.', 'n.o.t.k.',
|
||||||
|
'n.rapp.', 'n.tijd.pol.', 'n.v.', 'n.v.d.r.', 'n.v.d.v.',
|
||||||
|
'n.v.o.b.', 'n.v.t.', 'nat.besch.w.', 'nat.omb.',
|
||||||
|
'nat.pers.', 'ned.cult.r.', 'neg.verkl.', 'nhd.', 'wisk.',
|
||||||
|
'njcm-bull.', 'nl.', 'nnd.', 'no.', 'not.fisc.m.',
|
||||||
|
'not.w.', 'not.wet.', 'nr.', 'nrs.', 'nste.', 'nt.',
|
||||||
|
'numism.', 'o.', 'o.a.', 'o.b.', 'o.c.', 'o.g.', 'o.g.v.',
|
||||||
|
'o.i.', 'o.i.d.', 'o.m.', 'o.o.', 'o.o.d.', 'o.o.v.',
|
||||||
|
'o.p.', 'o.r.', 'o.regl.', 'o.s.', 'o.t.s.', 'o.t.t.',
|
||||||
|
'o.t.t.t.', 'o.t.t.z.', 'o.tk.t.', 'o.v.t.', 'o.v.t.t.',
|
||||||
|
'o.v.tk.t.', 'o.v.v.', 'ob.', 'obsv.', 'octr.',
|
||||||
|
'octr.gem.regl.', 'octr.regl.', 'oe.', 'off.pol.', 'ofra.',
|
||||||
|
'ohd.', 'omb.', 'omnil.', 'omz.', 'on.ww.', 'onderr.',
|
||||||
|
'onfrank.', 'onteig.w.', 'ontw.', 'b.w.', 'onuitg.',
|
||||||
|
'onz.', 'oorl.w.', 'op.cit.', 'opin.pa.', 'opm.', 'or.',
|
||||||
|
'ord.br.', 'ord.gem.', 'ors.', 'orth.', 'os.', 'osm.',
|
||||||
|
'ov.', 'ov.w.i.', 'ov.w.ii.', 'ov.ww.', 'overg.w.',
|
||||||
|
'overw.', 'ovkst.', 'oz.', 'p.', 'p.a.', 'p.a.o.',
|
||||||
|
'p.b.o.', 'p.e.', 'p.g.', 'p.j.', 'p.m.', 'p.m.a.', 'p.o.',
|
||||||
|
'p.o.j.t.', 'p.p.', 'p.v.', 'p.v.s.', 'pachtw.', 'pag.',
|
||||||
|
'pan.', 'pand.b.', 'pand.pér.', 'parl.gesch.',
|
||||||
|
'parl.gesch.', 'inv.', 'parl.st.', 'part.arb.', 'pas.',
|
||||||
|
'pasin.', 'pat.', 'pb.c.', 'pb.l.', 'pens.',
|
||||||
|
'pensioenverz.', 'per.ber.i.b.r.', 'per.ber.ibr.', 'pers.',
|
||||||
|
'st.', 'pft.', 'pk.', 'pktg.', 'plv.', 'po.', 'pol.',
|
||||||
|
'pol.off.', 'pol.r.', 'pol.w.', 'postbankw.', 'postw.',
|
||||||
|
'pp.', 'pr.', 'preadv.', 'pres.', 'prf.', 'prft.', 'prg.',
|
||||||
|
'prijz.w.', 'proc.', 'procesregl.', 'prof.', 'prot.',
|
||||||
|
'prov.', 'prov.b.', 'prov.instr.h.m.g.', 'prov.regl.',
|
||||||
|
'prov.verord.', 'prov.w.', 'publ.', 'pun.', 'pw.',
|
||||||
|
'q.b.d.', 'q.e.d.', 'q.q.', 'q.r.', 'r.', 'r.a.b.g.',
|
||||||
|
'r.a.c.e.', 'r.a.j.b.', 'r.b.d.c.', 'r.b.d.i.', 'r.b.s.s.',
|
||||||
|
'r.c.', 'r.c.b.', 'r.c.d.c.', 'r.c.j.b.', 'r.c.s.j.',
|
||||||
|
'r.cass.', 'r.d.c.', 'r.d.i.', 'r.d.i.d.c.', 'r.d.j.b.',
|
||||||
|
'r.d.j.p.', 'r.d.p.c.', 'r.d.s.', 'r.d.t.i.', 'r.e.',
|
||||||
|
'r.f.s.v.p.', 'r.g.a.r.', 'r.g.c.f.', 'r.g.d.c.', 'r.g.f.',
|
||||||
|
'r.g.z.', 'r.h.a.', 'r.i.c.', 'r.i.d.a.', 'r.i.e.j.',
|
||||||
|
'r.i.n.', 'r.i.s.a.', 'r.j.d.a.', 'r.j.i.', 'r.k.', 'r.l.',
|
||||||
|
'r.l.g.b.', 'r.med.', 'r.med.rechtspr.', 'r.n.b.', 'r.o.',
|
||||||
|
'r.ov.', 'r.p.', 'r.p.d.b.', 'r.p.o.t.', 'r.p.r.j.',
|
||||||
|
'r.p.s.', 'r.r.d.', 'r.r.s.', 'r.s.', 'r.s.v.p.',
|
||||||
|
'r.stvb.', 'r.t.d.f.', 'r.t.d.h.', 'r.t.l.',
|
||||||
|
'r.trim.dr.eur.', 'r.v.a.', 'r.verkb.', 'r.w.', 'r.w.d.',
|
||||||
|
'rap.ann.c.a.', 'rap.ann.c.c.', 'rap.ann.c.e.',
|
||||||
|
'rap.ann.c.s.j.', 'rap.ann.ca.', 'rap.ann.cass.',
|
||||||
|
'rap.ann.cc.', 'rap.ann.ce.', 'rap.ann.csj.', 'rapp.',
|
||||||
|
'rb.', 'rb.kh.', 'rdn.', 'rdnr.', 're.pers.', 'rec.',
|
||||||
|
'rec.c.i.j.', 'rec.c.j.c.e.', 'rec.cij.', 'rec.cjce.',
|
||||||
|
'rec.gén.enr.not.', 'rechtsk.t.', 'rechtspl.zeem.',
|
||||||
|
'rechtspr.arb.br.', 'rechtspr.b.f.e.', 'rechtspr.bfe.',
|
||||||
|
'rechtspr.soc.r.b.l.n.', 'recl.reg.', 'rect.', 'red.',
|
||||||
|
'reg.', 'reg.huiz.bew.', 'reg.w.', 'registr.w.', 'regl.',
|
||||||
|
'regl.', 'r.v.k.', 'regl.besl.', 'regl.onderr.',
|
||||||
|
'regl.r.t.', 'rep.', 'rép.fisc.', 'rép.not.', 'rep.r.j.',
|
||||||
|
'rep.rj.', 'req.', 'res.', 'resp.', 'rev.', 'rev.',
|
||||||
|
'comp.', 'rev.', 'trim.', 'civ.', 'rev.', 'trim.', 'comm.',
|
||||||
|
'rev.acc.trav.', 'rev.adm.', 'rev.b.compt.',
|
||||||
|
'rev.b.dr.const.', 'rev.b.dr.intern.', 'rev.b.séc.soc.',
|
||||||
|
'rev.banc.fin.', 'rev.comm.', 'rev.cons.prud.',
|
||||||
|
'rev.dr.b.', 'rev.dr.commun.', 'rev.dr.étr.',
|
||||||
|
'rev.dr.fam.', 'rev.dr.intern.comp.', 'rev.dr.mil.',
|
||||||
|
'rev.dr.min.', 'rev.dr.pén.', 'rev.dr.pén.mil.',
|
||||||
|
'rev.dr.rur.', 'rev.dr.u.l.b.', 'rev.dr.ulb.', 'rev.exp.',
|
||||||
|
'rev.faill.', 'rev.fisc.', 'rev.gd.', 'rev.hist.dr.',
|
||||||
|
'rev.i.p.c.', 'rev.ipc.', 'rev.not.b.',
|
||||||
|
'rev.prat.dr.comm.', 'rev.prat.not.b.', 'rev.prat.soc.',
|
||||||
|
'rev.rec.', 'rev.rw.', 'rev.trav.', 'rev.trim.d.h.',
|
||||||
|
'rev.trim.dr.fam.', 'rev.urb.', 'richtl.', 'riv.dir.int.',
|
||||||
|
'riv.dir.int.priv.proc.', 'rk.', 'rln.', 'roln.', 'rom.',
|
||||||
|
'rondz.', 'rov.', 'rtl.', 'rubr.', 'ruilv.wet.',
|
||||||
|
'rv.verdr.', 'rvkb.', 's.', 's.', 's.a.', 's.b.n.',
|
||||||
|
's.ct.', 's.d.', 's.e.c.', 's.e.et.o.', 's.e.w.',
|
||||||
|
's.exec.rept.', 's.hrg.', 's.j.b.', 's.l.', 's.l.e.a.',
|
||||||
|
's.l.n.d.', 's.p.a.', 's.s.', 's.t.', 's.t.b.', 's.v.',
|
||||||
|
's.v.p.', 'samenw.', 'sc.', 'sch.', 'scheidsr.uitspr.',
|
||||||
|
'schepel.besl.', 'secr.comm.', 'secr.gen.', 'sect.soc.',
|
||||||
|
'sess.', 'cas.', 'sir.', 'soc.', 'best.', 'soc.', 'handv.',
|
||||||
|
'soc.', 'verz.', 'soc.act.', 'soc.best.', 'soc.kron.',
|
||||||
|
'soc.r.', 'soc.sw.', 'soc.weg.', 'sofi-nr.', 'somm.',
|
||||||
|
'somm.ann.', 'sp.c.c.', 'sr.', 'ss.', 'st.doc.b.c.n.a.r.',
|
||||||
|
'st.doc.bcnar.', 'st.vw.', 'stagever.', 'stas.', 'stat.',
|
||||||
|
'stb.', 'stbl.', 'stcrt.', 'stud.dipl.', 'su.', 'subs.',
|
||||||
|
'subst.', 'succ.w.', 'suppl.', 'sv.', 'sw.', 't.', 't.a.',
|
||||||
|
't.a.a.', 't.a.n.', 't.a.p.', 't.a.s.n.', 't.a.v.',
|
||||||
|
't.a.v.w.', 't.aann.', 't.acc.', 't.agr.r.', 't.app.',
|
||||||
|
't.b.b.r.', 't.b.h.', 't.b.m.', 't.b.o.', 't.b.p.',
|
||||||
|
't.b.r.', 't.b.s.', 't.b.v.', 't.bankw.', 't.belg.not.',
|
||||||
|
't.desk.', 't.e.m.', 't.e.p.', 't.f.r.', 't.fam.',
|
||||||
|
't.fin.r.', 't.g.r.', 't.g.t.', 't.g.v.', 't.gem.',
|
||||||
|
't.gez.', 't.huur.', 't.i.n.', 't.j.k.', 't.l.l.',
|
||||||
|
't.l.v.', 't.m.', 't.m.r.', 't.m.w.', 't.mil.r.',
|
||||||
|
't.mil.strafr.', 't.not.', 't.o.', 't.o.r.b.', 't.o.v.',
|
||||||
|
't.ontv.', 't.p.r.', 't.pol.', 't.r.', 't.r.g.',
|
||||||
|
't.r.o.s.', 't.r.v.', 't.s.r.', 't.strafr.', 't.t.',
|
||||||
|
't.u.', 't.v.c.', 't.v.g.', 't.v.m.r.', 't.v.o.', 't.v.v.',
|
||||||
|
't.v.v.d.b.', 't.v.w.', 't.verz.', 't.vred.', 't.vreemd.',
|
||||||
|
't.w.', 't.w.k.', 't.w.v.', 't.w.v.r.', 't.wrr.', 't.z.',
|
||||||
|
't.z.t.', 't.z.v.', 'taalk.', 'tar.burg.z.', 'td.',
|
||||||
|
'techn.', 'telecomm.', 'toel.', 'toel.st.v.w.', 'toep.',
|
||||||
|
'toep.regl.', 'tom.', 'top.', 'trans.b.', 'transp.r.',
|
||||||
|
'trb.', 'trib.', 'trib.civ.', 'trib.gr.inst.', 'ts.',
|
||||||
|
'ts.', 'best.', 'ts.', 'verv.', 'turnh.rechtsl.', 'tvpol.',
|
||||||
|
'tvpr.', 'tvrechtsgesch.', 'tw.', 'u.', 'u.a.', 'u.a.r.',
|
||||||
|
'u.a.v.', 'u.c.', 'u.c.c.', 'u.g.', 'u.p.', 'u.s.',
|
||||||
|
'u.s.d.c.', 'uitdr.', 'uitl.w.', 'uitv.besch.div.b.',
|
||||||
|
'uitv.besl.', 'uitv.besl.', 'succ.w.', 'uitv.besl.bel.rv.',
|
||||||
|
'uitv.besl.l.b.', 'uitv.reg.', 'inv.w.', 'uitv.reg.bel.d.',
|
||||||
|
'uitv.reg.afd.verm.', 'uitv.reg.lb.', 'uitv.reg.succ.w.',
|
||||||
|
'univ.', 'univ.verkl.', 'v.', 'v.', 'chr.', 'v.a.',
|
||||||
|
'v.a.v.', 'v.c.', 'v.chr.', 'v.h.', 'v.huw.verm.', 'v.i.',
|
||||||
|
'v.i.o.', 'v.k.a.', 'v.m.', 'v.o.f.', 'v.o.n.',
|
||||||
|
'v.onderh.verpl.', 'v.p.', 'v.r.', 'v.s.o.', 'v.t.t.',
|
||||||
|
'v.t.t.t.', 'v.tk.t.', 'v.toep.r.vert.', 'v.v.b.',
|
||||||
|
'v.v.g.', 'v.v.t.', 'v.v.t.t.', 'v.v.tk.t.', 'v.w.b.',
|
||||||
|
'v.z.m.', 'vb.', 'vb.bo.', 'vbb.', 'vc.', 'vd.', 'veldw.',
|
||||||
|
'ver.k.', 'ver.verg.gem.', 'gem.comm.', 'verbr.', 'verd.',
|
||||||
|
'verdr.', 'verdr.v.', 'tek.mod.', 'verenw.', 'verg.',
|
||||||
|
'verg.fr.gem.', 'comm.', 'verkl.', 'verkl.herz.gw.',
|
||||||
|
'verl.', 'deelw.', 'vern.', 'verord.', 'vers.r.',
|
||||||
|
'versch.', 'versl.c.s.w.', 'versl.csw.', 'vert.', 'verw.',
|
||||||
|
'verz.', 'verz.w.', 'verz.wett.besl.',
|
||||||
|
'verz.wett.decr.besl.', 'vgl.', 'vid.', 'viss.w.',
|
||||||
|
'vl.parl.', 'vl.r.', 'vl.t.gez.', 'vl.w.reg.',
|
||||||
|
'vl.w.succ.', 'vlg.', 'vn.', 'vnl.', 'vnw.', 'vo.',
|
||||||
|
'vo.bl.', 'voegw.', 'vol.', 'volg.', 'volt.', 'deelw.',
|
||||||
|
'voorl.', 'voorz.', 'vord.w.', 'vorst.d.', 'vr.', 'vred.',
|
||||||
|
'vrg.', 'vnw.', 'vrijgrs.', 'vs.', 'vt.', 'vw.', 'vz.',
|
||||||
|
'vzngr.', 'vzr.', 'w.', 'w.a.', 'w.b.r.', 'w.c.h.',
|
||||||
|
'w.conf.huw.', 'w.conf.huwelijksb.', 'w.consum.kr.',
|
||||||
|
'w.f.r.', 'w.g.', 'w.gew.r.', 'w.ident.pl.', 'w.just.doc.',
|
||||||
|
'w.kh.', 'w.l.r.', 'w.l.v.', 'w.mil.straf.spr.', 'w.n.',
|
||||||
|
'w.not.ambt.', 'w.o.', 'w.o.d.huurcomm.', 'w.o.d.k.',
|
||||||
|
'w.openb.manif.', 'w.parl.', 'w.r.', 'w.reg.', 'w.succ.',
|
||||||
|
'w.u.b.', 'w.uitv.pl.verord.', 'w.v.', 'w.v.k.',
|
||||||
|
'w.v.m.s.', 'w.v.r.', 'w.v.w.', 'w.venn.', 'wac.', 'wd.',
|
||||||
|
'wetb.', 'n.v.h.', 'wgb.', 'winkelt.w.', 'wisk.',
|
||||||
|
'wka-verkl.', 'wnd.', 'won.w.', 'woningw.', 'woonr.w.',
|
||||||
|
'wrr.', 'wrr.ber.', 'wrsch.', 'ws.', 'wsch.', 'wsr.',
|
||||||
|
'wtvb.', 'ww.', 'x.d.', 'z.a.', 'z.g.', 'z.i.', 'z.j.',
|
||||||
|
'z.o.z.', 'z.p.', 'z.s.m.', 'zg.', 'zgn.', 'zn.', 'znw.',
|
||||||
|
'zr.', 'zr.', 'ms.', 'zr.ms.']
|
||||||
|
|
||||||
|
|
||||||
|
_exc = {}
|
||||||
|
for orth in abbrevs:
|
||||||
|
_exc[orth] = [{ORTH: orth}]
|
||||||
|
uppered = orth.upper()
|
||||||
|
capsed = orth.capitalize()
|
||||||
|
for i in [uppered, capsed]:
|
||||||
|
_exc[i] = [{ORTH: i}]
|
||||||
|
|
||||||
|
|
||||||
|
TOKENIZER_EXCEPTIONS = _exc
|
|
@ -5,6 +5,320 @@ from ...symbols import ORTH, LEMMA
|
||||||
|
|
||||||
|
|
||||||
_exc = {
|
_exc = {
|
||||||
|
#หน่วยงานรัฐ / government agency
|
||||||
|
"กกต.": [{ORTH: "กกต.", LEMMA: "คณะกรรมการการเลือกตั้ง"}],
|
||||||
|
"กทท.": [{ORTH: "กทท.", LEMMA: "การท่าเรือแห่งประเทศไทย"}],
|
||||||
|
"กทพ.": [{ORTH: "กทพ.", LEMMA: "การทางพิเศษแห่งประเทศไทย"}],
|
||||||
|
"กบข.": [{ORTH: "กบข.", LEMMA: "กองทุนบำเหน็จบำนาญข้าราชการพลเรือน"}],
|
||||||
|
"กบว.": [{ORTH: "กบว.", LEMMA: "คณะกรรมการบริหารวิทยุกระจายเสียงและวิทยุโทรทัศน์"}],
|
||||||
|
"กปน.": [{ORTH: "กปน.", LEMMA: "การประปานครหลวง"}],
|
||||||
|
"กปภ.": [{ORTH: "กปภ.", LEMMA: "การประปาส่วนภูมิภาค"}],
|
||||||
|
"กปส.": [{ORTH: "กปส.", LEMMA: "กรมประชาสัมพันธ์"}],
|
||||||
|
"กผม.": [{ORTH: "กผม.", LEMMA: "กองผังเมือง"}],
|
||||||
|
"กฟน.": [{ORTH: "กฟน.", LEMMA: "การไฟฟ้านครหลวง"}],
|
||||||
|
"กฟผ.": [{ORTH: "กฟผ.", LEMMA: "การไฟฟ้าฝ่ายผลิตแห่งประเทศไทย"}],
|
||||||
|
"กฟภ.": [{ORTH: "กฟภ.", LEMMA: "การไฟฟ้าส่วนภูมิภาค"}],
|
||||||
|
"ก.ช.น.": [{ORTH: "ก.ช.น.", LEMMA: "คณะกรรมการช่วยเหลือชาวนาชาวไร่"}],
|
||||||
|
"กยศ.": [{ORTH: "กยศ.", LEMMA: "กองทุนเงินให้กู้ยืมเพื่อการศึกษา"}],
|
||||||
|
"ก.ล.ต.": [{ORTH: "ก.ล.ต.", LEMMA: "คณะกรรมการกำกับหลักทรัพย์และตลาดหลักทรัพย์"}],
|
||||||
|
"กศ.บ.": [{ORTH: "กศ.บ.", LEMMA: "การศึกษาบัณฑิต"}],
|
||||||
|
"กศน.": [{ORTH: "กศน.", LEMMA: "กรมการศึกษานอกโรงเรียน"}],
|
||||||
|
"กสท.": [{ORTH: "กสท.", LEMMA: "การสื่อสารแห่งประเทศไทย"}],
|
||||||
|
"กอ.รมน.": [{ORTH: "กอ.รมน.", LEMMA: "กองอำนวยการรักษาความมั่นคงภายใน"}],
|
||||||
|
"กร.": [{ORTH: "กร.", LEMMA: "กองเรือยุทธการ"}],
|
||||||
|
"ขสมก.": [{ORTH: "ขสมก.", LEMMA: "องค์การขนส่งมวลชนกรุงเทพ"}],
|
||||||
|
"คตง.": [{ORTH: "คตง.", LEMMA: "คณะกรรมการตรวจเงินแผ่นดิน"}],
|
||||||
|
"ครม.": [{ORTH: "ครม.", LEMMA: "คณะรัฐมนตรี"}],
|
||||||
|
"คมช.": [{ORTH: "คมช.", LEMMA: "คณะมนตรีความมั่นคงแห่งชาติ"}],
|
||||||
|
"ตชด.": [{ORTH: "ตชด.", LEMMA: "ตำรวจตะเวนชายเดน"}],
|
||||||
|
"ตม.": [{ORTH: "ตม.", LEMMA: "กองตรวจคนเข้าเมือง"}],
|
||||||
|
"ตร.": [{ORTH: "ตร.", LEMMA: "ตำรวจ"}],
|
||||||
|
"ททท.": [{ORTH: "ททท.", LEMMA: "การท่องเที่ยวแห่งประเทศไทย"}],
|
||||||
|
"ททบ.": [{ORTH: "ททบ.", LEMMA: "สถานีวิทยุโทรทัศน์กองทัพบก"}],
|
||||||
|
"ทบ.": [{ORTH: "ทบ.", LEMMA: "กองทัพบก"}],
|
||||||
|
"ทร.": [{ORTH: "ทร.", LEMMA: "กองทัพเรือ"}],
|
||||||
|
"ทอ.": [{ORTH: "ทอ.", LEMMA: "กองทัพอากาศ"}],
|
||||||
|
"ทอท.": [{ORTH: "ทอท.", LEMMA: "การท่าอากาศยานแห่งประเทศไทย"}],
|
||||||
|
"ธ.ก.ส.": [{ORTH: "ธ.ก.ส.", LEMMA: "ธนาคารเพื่อการเกษตรและสหกรณ์การเกษตร"}],
|
||||||
|
"ธปท.": [{ORTH: "ธปท.", LEMMA: "ธนาคารแห่งประเทศไทย"}],
|
||||||
|
"ธอส.": [{ORTH: "ธอส.", LEMMA: "ธนาคารอาคารสงเคราะห์"}],
|
||||||
|
"นย.": [{ORTH: "นย.", LEMMA: "นาวิกโยธิน"}],
|
||||||
|
"ปตท.": [{ORTH: "ปตท.", LEMMA: "การปิโตรเลียมแห่งประเทศไทย"}],
|
||||||
|
"ป.ป.ช.": [{ORTH: "ป.ป.ช.", LEMMA: "คณะกรรมการป้องกันและปราบปรามการทุจริตและประพฤติมิชอบในวงราชการ"}],
|
||||||
|
"ป.ป.ส.": [{ORTH: "ป.ป.ส.", LEMMA: "คณะกรรมการป้องกันและปราบปรามยาเสพติด"}],
|
||||||
|
"บพร.": [{ORTH: "บพร.", LEMMA: "กรมการบินพลเรือน"}],
|
||||||
|
"บย.": [{ORTH: "บย.", LEMMA: "กองบินยุทธการ"}],
|
||||||
|
"พสวท.": [{ORTH: "พสวท.", LEMMA: "โครงการพัฒนาและส่งเสริมผู้มีความรู้ความสามารถพิเศษทางวิทยาศาสตร์และเทคโนโลยี"}],
|
||||||
|
"มอก.": [{ORTH: "มอก.", LEMMA: "สำนักงานมาตรฐานผลิตภัณฑ์อุตสาหกรรม"}],
|
||||||
|
"ยธ.": [{ORTH: "ยธ.", LEMMA: "กรมโยธาธิการ"}],
|
||||||
|
"รพช.": [{ORTH: "รพช.", LEMMA: "สำนักงานเร่งรัดพัฒนาชนบท"}],
|
||||||
|
"รฟท.": [{ORTH: "รฟท.", LEMMA: "การรถไฟแห่งประเทศไทย"}],
|
||||||
|
"รฟม.": [{ORTH: "รฟม.", LEMMA: "การรถไฟฟ้าขนส่งมวลชนแห่งประเทศไทย"}],
|
||||||
|
"ศธ.": [{ORTH: "ศธ.", LEMMA: "กระทรวงศึกษาธิการ"}],
|
||||||
|
"ศนธ.": [{ORTH: "ศนธ.", LEMMA: "ศูนย์กลางนิสิตนักศึกษาแห่งประเทศไทย"}],
|
||||||
|
"สกจ.": [{ORTH: "สกจ.", LEMMA: "สหกรณ์จังหวัด"}],
|
||||||
|
"สกท.": [{ORTH: "สกท.", LEMMA: "สำนักงานคณะกรรมการส่งเสริมการลงทุน"}],
|
||||||
|
"สกว.": [{ORTH: "สกว.", LEMMA: "สำนักงานกองทุนสนับสนุนการวิจัย"}],
|
||||||
|
"สคบ.": [{ORTH: "สคบ.", LEMMA: "สำนักงานคณะกรรมการคุ้มครองผู้บริโภค"}],
|
||||||
|
"สจร.": [{ORTH: "สจร.", LEMMA: "สำนักงานคณะกรรมการจัดระบบการจราจรทางบก"}],
|
||||||
|
"สตง.": [{ORTH: "สตง.", LEMMA: "สำนักงานตรวจเงินแผ่นดิน"}],
|
||||||
|
"สทท.": [{ORTH: "สทท.", LEMMA: "สถานีวิทยุโทรทัศน์แห่งประเทศไทย"}],
|
||||||
|
"สทร.": [{ORTH: "สทร.", LEMMA: "สำนักงานกลางทะเบียนราษฎร์"}],
|
||||||
|
"สธ": [{ORTH: "สธ", LEMMA: "กระทรวงสาธารณสุข"}],
|
||||||
|
"สนช.": [{ORTH: "สนช.", LEMMA: "สภานิติบัญญัติแห่งชาติ,สำนักงานนวัตกรรมแห่งชาติ"}],
|
||||||
|
"สนนท.": [{ORTH: "สนนท.", LEMMA: "สหพันธ์นิสิตนักศึกษาแห่งประเทศไทย"}],
|
||||||
|
"สปก.": [{ORTH: "สปก.", LEMMA: "สำนักงานการปฏิรูปที่ดินเพื่อเกษตรกรรม"}],
|
||||||
|
"สปช.": [{ORTH: "สปช.", LEMMA: "สำนักงานคณะกรรมการการประถมศึกษาแห่งชาติ"}],
|
||||||
|
"สปอ.": [{ORTH: "สปอ.", LEMMA: "สำนักงานการประถมศึกษาอำเภอ"}],
|
||||||
|
"สพช.": [{ORTH: "สพช.", LEMMA: "สำนักงานคณะกรรมการนโยบายพลังงานแห่งชาติ"}],
|
||||||
|
"สยช.": [{ORTH: "สยช.", LEMMA: "สำนักงานคณะกรรมการส่งเสริมและประสานงานเยาวชนแห่งชาติ"}],
|
||||||
|
"สวช.": [{ORTH: "สวช.", LEMMA: "สำนักงานคณะกรรมการวัฒนธรรมแห่งชาติ"}],
|
||||||
|
"สวท.": [{ORTH: "สวท.", LEMMA: "สถานีวิทยุกระจายเสียงแห่งประเทศไทย"}],
|
||||||
|
"สวทช.": [{ORTH: "สวทช.", LEMMA: "สำนักงานพัฒนาวิทยาศาสตร์และเทคโนโลยีแห่งชาติ"}],
|
||||||
|
"สคช.": [{ORTH: "สคช.", LEMMA: "สำนักงานคณะกรรมการพัฒนาการเศรษฐกิจและสังคมแห่งชาติ"}],
|
||||||
|
"สสว.": [{ORTH: "สสว.", LEMMA: "สำนักงานส่งเสริมวิสาหกิจขนาดกลางและขนาดย่อม"}],
|
||||||
|
"สสส.": [{ORTH: "สสส.", LEMMA: "สำนักงานกองทุนสนับสนุนการสร้างเสริมสุขภาพ"}],
|
||||||
|
"สสวท.": [{ORTH: "สสวท.", LEMMA: "สถาบันส่งเสริมการสอนวิทยาศาสตร์และเทคโนโลยี"}],
|
||||||
|
"อตก.": [{ORTH: "อตก.", LEMMA: "องค์การตลาดเพื่อเกษตรกร"}],
|
||||||
|
"อบจ.": [{ORTH: "อบจ.", LEMMA: "องค์การบริหารส่วนจังหวัด"}],
|
||||||
|
"อบต.": [{ORTH: "อบต.", LEMMA: "องค์การบริหารส่วนตำบล"}],
|
||||||
|
"อปพร.": [{ORTH: "อปพร.", LEMMA: "อาสาสมัครป้องกันภัยฝ่ายพลเรือน"}],
|
||||||
|
"อย.": [{ORTH: "อย.", LEMMA: "สำนักงานคณะกรรมการอาหารและยา"}],
|
||||||
|
"อ.ส.ม.ท.": [{ORTH: "อ.ส.ม.ท.", LEMMA: "องค์การสื่อสารมวลชนแห่งประเทศไทย"}],
|
||||||
|
#มหาวิทยาลัย / สถานศึกษา / university / college
|
||||||
|
"มทส.": [{ORTH: "มทส.", LEMMA: "มหาวิทยาลัยเทคโนโลยีสุรนารี"}],
|
||||||
|
"มธ.": [{ORTH: "มธ.", LEMMA: "มหาวิทยาลัยธรรมศาสตร์"}],
|
||||||
|
"ม.อ.": [{ORTH: "ม.อ.", LEMMA: "มหาวิทยาลัยสงขลานครินทร์"}],
|
||||||
|
"มทร.": [{ORTH: "มทร.", LEMMA: "มหาวิทยาลัยเทคโนโลยีราชมงคล"}],
|
||||||
|
"มมส.": [{ORTH: "มมส.", LEMMA: "มหาวิทยาลัยมหาสารคาม"}],
|
||||||
|
"วท.": [{ORTH: "วท.", LEMMA: "วิทยาลัยเทคนิค"}],
|
||||||
|
"สตม.": [{ORTH: "สตม.", LEMMA: "สำนักงานตรวจคนเข้าเมือง (ตำรวจ)"}],
|
||||||
|
#ยศ / rank
|
||||||
|
"ดร.": [{ORTH: "ดร.", LEMMA: "ดอกเตอร์"}],
|
||||||
|
"ด.ต.": [{ORTH: "ด.ต.", LEMMA: "ดาบตำรวจ"}],
|
||||||
|
"จ.ต.": [{ORTH: "จ.ต.", LEMMA: "จ่าตรี"}],
|
||||||
|
"จ.ท.": [{ORTH: "จ.ท.", LEMMA: "จ่าโท"}],
|
||||||
|
"จ.ส.ต.": [{ORTH: "จ.ส.ต.", LEMMA: "จ่าสิบตรี (ทหารบก)"}],
|
||||||
|
"จสต.": [{ORTH: "จสต.", LEMMA: "จ่าสิบตำรวจ"}],
|
||||||
|
"จ.ส.ท.": [{ORTH: "จ.ส.ท.", LEMMA: "จ่าสิบโท"}],
|
||||||
|
"จ.ส.อ.": [{ORTH: "จ.ส.อ.", LEMMA: "จ่าสิบเอก"}],
|
||||||
|
"จ.อ.": [{ORTH: "จ.อ.", LEMMA: "จ่าเอก"}],
|
||||||
|
"ทพญ.": [{ORTH: "ทพญ.", LEMMA: "ทันตแพทย์หญิง"}],
|
||||||
|
"ทนพ.": [{ORTH: "ทนพ.", LEMMA: "เทคนิคการแพทย์"}],
|
||||||
|
"นจอ.": [{ORTH: "นจอ.", LEMMA: "นักเรียนจ่าอากาศ"}],
|
||||||
|
"น.ช.": [{ORTH: "น.ช.", LEMMA: "นักโทษชาย"}],
|
||||||
|
"น.ญ.": [{ORTH: "น.ญ.", LEMMA: "นักโทษหญิง"}],
|
||||||
|
"น.ต.": [{ORTH: "น.ต.", LEMMA: "นาวาตรี"}],
|
||||||
|
"น.ท.": [{ORTH: "น.ท.", LEMMA: "นาวาโท"}],
|
||||||
|
"นตท.": [{ORTH: "นตท.", LEMMA: "นักเรียนเตรียมทหาร"}],
|
||||||
|
"นนส.": [{ORTH: "นนส.", LEMMA: "นักเรียนนายสิบทหารบก"}],
|
||||||
|
"นนร.": [{ORTH: "นนร.", LEMMA: "นักเรียนนายร้อย"}],
|
||||||
|
"นนอ.": [{ORTH: "นนอ.", LEMMA: "นักเรียนนายเรืออากาศ"}],
|
||||||
|
"นพ.": [{ORTH: "นพ.", LEMMA: "นายแพทย์"}],
|
||||||
|
"นพท.": [{ORTH: "นพท.", LEMMA: "นายแพทย์ทหาร"}],
|
||||||
|
"นรจ.": [{ORTH: "นรจ.", LEMMA: "นักเรียนจ่าทหารเรือ"}],
|
||||||
|
"นรต.": [{ORTH: "นรต.", LEMMA: "นักเรียนนายร้อยตำรวจ"}],
|
||||||
|
"นศพ.": [{ORTH: "นศพ.", LEMMA: "นักศึกษาแพทย์"}],
|
||||||
|
"นศท.": [{ORTH: "นศท.", LEMMA: "นักศึกษาวิชาทหาร"}],
|
||||||
|
"น.สพ.": [{ORTH: "น.สพ.", LEMMA: "นายสัตวแพทย์ (พ.ร.บ.วิชาชีพการสัตวแพทย์)"}],
|
||||||
|
"น.อ.": [{ORTH: "น.อ.", LEMMA: "นาวาเอก"}],
|
||||||
|
"บช.ก.": [{ORTH: "บช.ก.", LEMMA: "กองบัญชาการตำรวจสอบสวนกลาง"}],
|
||||||
|
"บช.น.": [{ORTH: "บช.น.", LEMMA: "กองบัญชาการตำรวจนครบาล"}],
|
||||||
|
"ผกก.": [{ORTH: "ผกก.", LEMMA: "ผู้กำกับการ"}],
|
||||||
|
"ผกก.ภ.": [{ORTH: "ผกก.ภ.", LEMMA: "ผู้กำกับการตำรวจภูธร"}],
|
||||||
|
"ผจก.": [{ORTH: "ผจก.", LEMMA: "ผู้จัดการ"}],
|
||||||
|
"ผช.": [{ORTH: "ผช.", LEMMA: "ผู้ช่วย"}],
|
||||||
|
"ผชก.": [{ORTH: "ผชก.", LEMMA: "ผู้ชำนาญการ"}],
|
||||||
|
"ผช.ผอ.": [{ORTH: "ผช.ผอ.", LEMMA: "ผู้ช่วยผู้อำนวยการ"}],
|
||||||
|
"ผญบ.": [{ORTH: "ผญบ.", LEMMA: "ผู้ใหญ่บ้าน"}],
|
||||||
|
"ผบ.": [{ORTH: "ผบ.", LEMMA: "ผู้บังคับบัญชา"}],
|
||||||
|
"ผบก.": [{ORTH: "ผบก.", LEMMA: "ผู้บังคับบัญชาการ (ตำรวจ)"}],
|
||||||
|
"ผบก.": [{ORTH: "ผบก.", LEMMA: "ผู้บังคับการ (ตำรวจ)"}],
|
||||||
|
"ผบก.น.": [{ORTH: "ผบก.น.", LEMMA: "ผู้บังคับการตำรวจนครบาล"}],
|
||||||
|
"ผบก.ป.": [{ORTH: "ผบก.ป.", LEMMA: "ผู้บังคับการตำรวจกองปราบปราม"}],
|
||||||
|
"ผบก.ปค.": [{ORTH: "ผบก.ปค.", LEMMA: "ผู้บังคับการ กองบังคับการปกครอง (โรงเรียนนายร้อยตำรวจ)"}],
|
||||||
|
"ผบก.ปม.": [{ORTH: "ผบก.ปม.", LEMMA: "ผู้บังคับการตำรวจป่าไม้"}],
|
||||||
|
"ผบก.ภ.": [{ORTH: "ผบก.ภ.", LEMMA: "ผู้บังคับการตำรวจภูธร"}],
|
||||||
|
"ผบช.": [{ORTH: "ผบช.", LEMMA: "ผู้บัญชาการ (ตำรวจ)"}],
|
||||||
|
"ผบช.ก.": [{ORTH: "ผบช.ก.", LEMMA: "ผู้บัญชาการตำรวจสอบสวนกลาง"}],
|
||||||
|
"ผบช.ตชด.": [{ORTH: "ผบช.ตชด.", LEMMA: "ผู้บัญชาการตำรวจตระเวนชายแดน"}],
|
||||||
|
"ผบช.น.": [{ORTH: "ผบช.น.", LEMMA: "ผู้บัญชาการตำรวจนครบาล"}],
|
||||||
|
"ผบช.ภ.": [{ORTH: "ผบช.ภ.", LEMMA: "ผู้บัญชาการตำรวจภูธร"}],
|
||||||
|
"ผบ.ทบ.": [{ORTH: "ผบ.ทบ.", LEMMA: "ผู้บัญชาการทหารบก"}],
|
||||||
|
"ผบ.ตร.": [{ORTH: "ผบ.ตร.", LEMMA: "ผู้บัญชาการตำรวจแห่งชาติ"}],
|
||||||
|
"ผบ.ทร.": [{ORTH: "ผบ.ทร.", LEMMA: "ผู้บัญชาการทหารเรือ"}],
|
||||||
|
"ผบ.ทอ.": [{ORTH: "ผบ.ทอ.", LEMMA: "ผู้บัญชาการทหารอากาศ"}],
|
||||||
|
"ผบ.ทสส.": [{ORTH: "ผบ.ทสส.", LEMMA: "ผู้บัญชาการทหารสูงสุด"}],
|
||||||
|
"ผวจ.": [{ORTH: "ผวจ.", LEMMA: "ผู้ว่าราชการจังหวัด"}],
|
||||||
|
"ผู้ว่าฯ": [{ORTH: "ผู้ว่าฯ", LEMMA: "ผู้ว่าราชการจังหวัด"}],
|
||||||
|
"พ.จ.ต.": [{ORTH: "พ.จ.ต.", LEMMA: "พันจ่าตรี"}],
|
||||||
|
"พ.จ.ท.": [{ORTH: "พ.จ.ท.", LEMMA: "พันจ่าโท"}],
|
||||||
|
"พ.จ.อ.": [{ORTH: "พ.จ.อ.", LEMMA: "พันจ่าเอก"}],
|
||||||
|
"พญ.": [{ORTH: "พญ.", LEMMA: "แพทย์หญิง"}],
|
||||||
|
"ฯพณฯ": [{ORTH: "ฯพณฯ", LEMMA: "พณท่าน"}],
|
||||||
|
"พ.ต.": [{ORTH: "พ.ต.", LEMMA: "พันตรี"}],
|
||||||
|
"พ.ท.": [{ORTH: "พ.ท.", LEMMA: "พันโท"}],
|
||||||
|
"พ.อ.": [{ORTH: "พ.อ.", LEMMA: "พันเอก"}],
|
||||||
|
"พ.ต.อ.พิเศษ": [{ORTH: "พ.ต.อ.พิเศษ", LEMMA: "พันตำรวจเอกพิเศษ"}],
|
||||||
|
"พลฯ": [{ORTH: "พลฯ", LEMMA: "พลทหาร"}],
|
||||||
|
"พล.๑ รอ.": [{ORTH: "พล.๑ รอ.", LEMMA: "กองพลที่ ๑ รักษาพระองค์ กองทัพบก"}],
|
||||||
|
"พล.ต.": [{ORTH: "พล.ต.", LEMMA: "พลตรี"}],
|
||||||
|
"พล.ต.ต.": [{ORTH: "พล.ต.ต.", LEMMA: "พลตำรวจตรี"}],
|
||||||
|
"พล.ต.ท.": [{ORTH: "พล.ต.ท.", LEMMA: "พลตำรวจโท"}],
|
||||||
|
"พล.ต.อ.": [{ORTH: "พล.ต.อ.", LEMMA: "พลตำรวจเอก"}],
|
||||||
|
"พล.ท.": [{ORTH: "พล.ท.", LEMMA: "พลโท"}],
|
||||||
|
"พล.ปตอ.": [{ORTH: "พล.ปตอ.", LEMMA: "กองพลทหารปืนใหญ่ต่อสู่อากาศยาน"}],
|
||||||
|
"พล.ม.": [{ORTH: "พล.ม.", LEMMA: "กองพลทหารม้า"}],
|
||||||
|
"พล.ม.๒": [{ORTH: "พล.ม.๒", LEMMA: "กองพลทหารม้าที่ ๒"}],
|
||||||
|
"พล.ร.ต.": [{ORTH: "พล.ร.ต.", LEMMA: "พลเรือตรี"}],
|
||||||
|
"พล.ร.ท.": [{ORTH: "พล.ร.ท.", LEMMA: "พลเรือโท"}],
|
||||||
|
"พล.ร.อ.": [{ORTH: "พล.ร.อ.", LEMMA: "พลเรือเอก"}],
|
||||||
|
"พล.อ.": [{ORTH: "พล.อ.", LEMMA: "พลเอก"}],
|
||||||
|
"พล.อ.ต.": [{ORTH: "พล.อ.ต.", LEMMA: "พลอากาศตรี"}],
|
||||||
|
"พล.อ.ท.": [{ORTH: "พล.อ.ท.", LEMMA: "พลอากาศโท"}],
|
||||||
|
"พล.อ.อ.": [{ORTH: "พล.อ.อ.", LEMMA: "พลอากาศเอก"}],
|
||||||
|
"พ.อ.": [{ORTH: "พ.อ.", LEMMA: "พันเอก"}],
|
||||||
|
"พ.อ.พิเศษ": [{ORTH: "พ.อ.พิเศษ", LEMMA: "พันเอกพิเศษ"}],
|
||||||
|
"พ.อ.ต.": [{ORTH: "พ.อ.ต.", LEMMA: "พันจ่าอากาศตรี"}],
|
||||||
|
"พ.อ.ท.": [{ORTH: "พ.อ.ท.", LEMMA: "พันจ่าอากาศโท"}],
|
||||||
|
"พ.อ.อ.": [{ORTH: "พ.อ.อ.", LEMMA: "พันจ่าอากาศเอก"}],
|
||||||
|
"ภกญ.": [{ORTH: "ภกญ.", LEMMA: "เภสัชกรหญิง"}],
|
||||||
|
"ม.จ.": [{ORTH: "ม.จ.", LEMMA: "หม่อมเจ้า"}],
|
||||||
|
"มท1": [{ORTH: "มท1", LEMMA: "รัฐมนตรีว่าการกระทรวงมหาดไทย"}],
|
||||||
|
"ม.ร.ว.": [{ORTH: "ม.ร.ว.", LEMMA: "หม่อมราชวงศ์"}],
|
||||||
|
"มล.": [{ORTH: "มล.", LEMMA: "หม่อมหลวง"}],
|
||||||
|
"ร.ต.": [{ORTH: "ร.ต.", LEMMA: "ร้อยตรี,เรือตรี,เรืออากาศตรี"}],
|
||||||
|
"ร.ต.ต.": [{ORTH: "ร.ต.ต.", LEMMA: "ร้อยตำรวจตรี"}],
|
||||||
|
"ร.ต.ท.": [{ORTH: "ร.ต.ท.", LEMMA: "ร้อยตำรวจโท"}],
|
||||||
|
"ร.ต.อ.": [{ORTH: "ร.ต.อ.", LEMMA: "ร้อยตำรวจเอก"}],
|
||||||
|
"ร.ท.": [{ORTH: "ร.ท.", LEMMA: "ร้อยโท,เรือโท,เรืออากาศโท"}],
|
||||||
|
"รมช.": [{ORTH: "รมช.", LEMMA: "รัฐมนตรีช่วยว่าการกระทรวง"}],
|
||||||
|
"รมต.": [{ORTH: "รมต.", LEMMA: "รัฐมนตรี"}],
|
||||||
|
"รมว.": [{ORTH: "รมว.", LEMMA: "รัฐมนตรีว่าการกระทรวง"}],
|
||||||
|
"รศ.": [{ORTH: "รศ.", LEMMA: "รองศาสตราจารย์"}],
|
||||||
|
"ร.อ.": [{ORTH: "ร.อ.", LEMMA: "ร้อยเอก,เรือเอก,เรืออากาศเอก"}],
|
||||||
|
"ศ.": [{ORTH: "ศ.", LEMMA: "ศาสตราจารย์"}],
|
||||||
|
"ส.ต.": [{ORTH: "ส.ต.", LEMMA: "สิบตรี"}],
|
||||||
|
"ส.ต.ต.": [{ORTH: "ส.ต.ต.", LEMMA: "สิบตำรวจตรี"}],
|
||||||
|
"ส.ต.ท.": [{ORTH: "ส.ต.ท.", LEMMA: "สิบตำรวจโท"}],
|
||||||
|
"ส.ต.อ.": [{ORTH: "ส.ต.อ.", LEMMA: "สิบตำรวจเอก"}],
|
||||||
|
"ส.ท.": [{ORTH: "ส.ท.", LEMMA: "สิบโท"}],
|
||||||
|
"สพ.": [{ORTH: "สพ.", LEMMA: "สัตวแพทย์"}],
|
||||||
|
"สพ.ญ.": [{ORTH: "สพ.ญ.", LEMMA: "สัตวแพทย์หญิง"}],
|
||||||
|
"สพ.ช.": [{ORTH: "สพ.ช.", LEMMA: "สัตวแพทย์ชาย"}],
|
||||||
|
"ส.อ.": [{ORTH: "ส.อ.", LEMMA: "สิบเอก"}],
|
||||||
|
"อจ.": [{ORTH: "อจ.", LEMMA: "อาจารย์"}],
|
||||||
|
"อจญ.": [{ORTH: "อจญ.", LEMMA: "อาจารย์ใหญ่"}],
|
||||||
|
#วุฒิ / bachelor degree
|
||||||
|
"ป.": [{ORTH: "ป.", LEMMA: "ประถมศึกษา"}],
|
||||||
|
"ป.กศ.": [{ORTH: "ป.กศ.", LEMMA: "ประกาศนียบัตรวิชาการศึกษา"}],
|
||||||
|
"ป.กศ.สูง": [{ORTH: "ป.กศ.สูง", LEMMA: "ประกาศนียบัตรวิชาการศึกษาชั้นสูง"}],
|
||||||
|
"ปวช.": [{ORTH: "ปวช.", LEMMA: "ประกาศนียบัตรวิชาชีพ"}],
|
||||||
|
"ปวท.": [{ORTH: "ปวท.", LEMMA: "ประกาศนียบัตรวิชาชีพเทคนิค"}],
|
||||||
|
"ปวส.": [{ORTH: "ปวส.", LEMMA: "ประกาศนียบัตรวิชาชีพชั้นสูง"}],
|
||||||
|
"ปทส.": [{ORTH: "ปทส.", LEMMA: "ประกาศนียบัตรครูเทคนิคชั้นสูง"}],
|
||||||
|
"กษ.บ.": [{ORTH: "กษ.บ.", LEMMA: "เกษตรศาสตรบัณฑิต"}],
|
||||||
|
"กษ.ม.": [{ORTH: "กษ.ม.", LEMMA: "เกษตรศาสตรมหาบัณฑิต"}],
|
||||||
|
"กษ.ด.": [{ORTH: "กษ.ด.", LEMMA: "เกษตรศาสตรดุษฎีบัณฑิต"}],
|
||||||
|
"ค.บ.": [{ORTH: "ค.บ.", LEMMA: "ครุศาสตรบัณฑิต"}],
|
||||||
|
"คศ.บ.": [{ORTH: "คศ.บ.", LEMMA: "คหกรรมศาสตรบัณฑิต"}],
|
||||||
|
"คศ.ม.": [{ORTH: "คศ.ม.", LEMMA: "คหกรรมศาสตรมหาบัณฑิต"}],
|
||||||
|
"คศ.ด.": [{ORTH: "คศ.ด.", LEMMA: "คหกรรมศาสตรดุษฎีบัณฑิต"}],
|
||||||
|
"ค.อ.บ.": [{ORTH: "ค.อ.บ.", LEMMA: "ครุศาสตรอุตสาหกรรมบัณฑิต"}],
|
||||||
|
"ค.อ.ม.": [{ORTH: "ค.อ.ม.", LEMMA: "ครุศาสตรอุตสาหกรรมมหาบัณฑิต"}],
|
||||||
|
"ค.อ.ด.": [{ORTH: "ค.อ.ด.", LEMMA: "ครุศาสตรอุตสาหกรรมดุษฎีบัณฑิต"}],
|
||||||
|
"ทก.บ.": [{ORTH: "ทก.บ.", LEMMA: "เทคโนโลยีการเกษตรบัณฑิต"}],
|
||||||
|
"ทก.ม.": [{ORTH: "ทก.ม.", LEMMA: "เทคโนโลยีการเกษตรมหาบัณฑิต"}],
|
||||||
|
"ทก.ด.": [{ORTH: "ทก.ด.", LEMMA: "เทคโนโลยีการเกษตรดุษฎีบัณฑิต"}],
|
||||||
|
"ท.บ.": [{ORTH: "ท.บ.", LEMMA: "ทันตแพทยศาสตรบัณฑิต"}],
|
||||||
|
"ท.ม.": [{ORTH: "ท.ม.", LEMMA: "ทันตแพทยศาสตรมหาบัณฑิต"}],
|
||||||
|
"ท.ด.": [{ORTH: "ท.ด.", LEMMA: "ทันตแพทยศาสตรดุษฎีบัณฑิต"}],
|
||||||
|
"น.บ.": [{ORTH: "น.บ.", LEMMA: "นิติศาสตรบัณฑิต"}],
|
||||||
|
"น.ม.": [{ORTH: "น.ม.", LEMMA: "นิติศาสตรมหาบัณฑิต"}],
|
||||||
|
"น.ด.": [{ORTH: "น.ด.", LEMMA: "นิติศาสตรดุษฎีบัณฑิต"}],
|
||||||
|
"นศ.บ.": [{ORTH: "นศ.บ.", LEMMA: "นิเทศศาสตรบัณฑิต"}],
|
||||||
|
"นศ.ม.": [{ORTH: "นศ.ม.", LEMMA: "นิเทศศาสตรมหาบัณฑิต"}],
|
||||||
|
"นศ.ด.": [{ORTH: "นศ.ด.", LEMMA: "นิเทศศาสตรดุษฎีบัณฑิต"}],
|
||||||
|
"บช.บ.": [{ORTH: "บช.บ.", LEMMA: "บัญชีบัณฑิต"}],
|
||||||
|
"บช.ม.": [{ORTH: "บช.ม.", LEMMA: "บัญชีมหาบัณฑิต"}],
|
||||||
|
"บช.ด.": [{ORTH: "บช.ด.", LEMMA: "บัญชีดุษฎีบัณฑิต"}],
|
||||||
|
"บธ.บ.": [{ORTH: "บธ.บ.", LEMMA: "บริหารธุรกิจบัณฑิต"}],
|
||||||
|
"บธ.ม.": [{ORTH: "บธ.ม.", LEMMA: "บริหารธุรกิจมหาบัณฑิต"}],
|
||||||
|
"บธ.ด.": [{ORTH: "บธ.ด.", LEMMA: "บริหารธุรกิจดุษฎีบัณฑิต"}],
|
||||||
|
"พณ.บ.": [{ORTH: "พณ.บ.", LEMMA: "พาณิชยศาสตรบัณฑิต"}],
|
||||||
|
"พณ.ม.": [{ORTH: "พณ.ม.", LEMMA: "พาณิชยศาสตรมหาบัณฑิต"}],
|
||||||
|
"พณ.ด.": [{ORTH: "พณ.ด.", LEMMA: "พาณิชยศาสตรดุษฎีบัณฑิต"}],
|
||||||
|
"พ.บ.": [{ORTH: "พ.บ.", LEMMA: "แพทยศาสตรบัณฑิต"}],
|
||||||
|
"พ.ม.": [{ORTH: "พ.ม.", LEMMA: "แพทยศาสตรมหาบัณฑิต"}],
|
||||||
|
"พ.ด.": [{ORTH: "พ.ด.", LEMMA: "แพทยศาสตรดุษฎีบัณฑิต"}],
|
||||||
|
"พธ.บ.": [{ORTH: "พธ.บ.", LEMMA: "พุทธศาสตรบัณฑิต"}],
|
||||||
|
"พธ.ม.": [{ORTH: "พธ.ม.", LEMMA: "พุทธศาสตรมหาบัณฑิต"}],
|
||||||
|
"พธ.ด.": [{ORTH: "พธ.ด.", LEMMA: "พุทธศาสตรดุษฎีบัณฑิต"}],
|
||||||
|
"พบ.บ.": [{ORTH: "พบ.บ.", LEMMA: "พัฒนบริหารศาสตรบัณฑิต"}],
|
||||||
|
"พบ.ม.": [{ORTH: "พบ.ม.", LEMMA: "พัฒนบริหารศาสตรมหาบัณฑิต"}],
|
||||||
|
"พบ.ด.": [{ORTH: "พบ.ด.", LEMMA: "พัฒนบริหารศาสตรดุษฎีบัณฑิต"}],
|
||||||
|
"พย.บ.": [{ORTH: "พย.บ.", LEMMA: "พยาบาลศาสตรดุษฎีบัณฑิต"}],
|
||||||
|
"พย.ม.": [{ORTH: "พย.ม.", LEMMA: "พยาบาลศาสตรมหาบัณฑิต"}],
|
||||||
|
"พย.ด.": [{ORTH: "พย.ด.", LEMMA: "พยาบาลศาสตรดุษฎีบัณฑิต"}],
|
||||||
|
"พศ.บ.": [{ORTH: "พศ.บ.", LEMMA: "พาณิชยศาสตรบัณฑิต"}],
|
||||||
|
"พศ.ม.": [{ORTH: "พศ.ม.", LEMMA: "พาณิชยศาสตรมหาบัณฑิต"}],
|
||||||
|
"พศ.ด.": [{ORTH: "พศ.ด.", LEMMA: "พาณิชยศาสตรดุษฎีบัณฑิต"}],
|
||||||
|
"ภ.บ.": [{ORTH: "ภ.บ.", LEMMA: "เภสัชศาสตรบัณฑิต"}],
|
||||||
|
"ภ.ม.": [{ORTH: "ภ.ม.", LEMMA: "เภสัชศาสตรมหาบัณฑิต"}],
|
||||||
|
"ภ.ด.": [{ORTH: "ภ.ด.", LEMMA: "เภสัชศาสตรดุษฎีบัณฑิต"}],
|
||||||
|
"ภ.สถ.บ.": [{ORTH: "ภ.สถ.บ.", LEMMA: "ภูมิสถาปัตยกรรมศาสตรบัณฑิต"}],
|
||||||
|
"รป.บ.": [{ORTH: "รป.บ.", LEMMA: "รัฐประศาสนศาสตร์บัณฑิต"}],
|
||||||
|
"รป.ม.": [{ORTH: "รป.ม.", LEMMA: "รัฐประศาสนศาสตร์มหาบัณฑิต"}],
|
||||||
|
"วท.บ.": [{ORTH: "วท.บ.", LEMMA: "วิทยาศาสตรบัณฑิต"}],
|
||||||
|
"วท.ม.": [{ORTH: "วท.ม.", LEMMA: "วิทยาศาสตรมหาบัณฑิต"}],
|
||||||
|
"วท.ด.": [{ORTH: "วท.ด.", LEMMA: "วิทยาศาสตรดุษฎีบัณฑิต"}],
|
||||||
|
"ศ.บ.": [{ORTH: "ศ.บ.", LEMMA: "ศิลปบัณฑิต"}],
|
||||||
|
"ศศ.บ.": [{ORTH: "ศศ.บ.", LEMMA: "ศิลปศาสตรบัณฑิต"}],
|
||||||
|
"ศษ.บ.": [{ORTH: "ศษ.บ.", LEMMA: "ศึกษาศาสตรบัณฑิต"}],
|
||||||
|
"ศส.บ.": [{ORTH: "ศส.บ.", LEMMA: "เศรษฐศาสตรบัณฑิต"}],
|
||||||
|
"สถ.บ.": [{ORTH: "สถ.บ.", LEMMA: "สถาปัตยกรรมศาสตรบัณฑิต"}],
|
||||||
|
"สถ.ม.": [{ORTH: "สถ.ม.", LEMMA: "สถาปัตยกรรมศาสตรมหาบัณฑิต"}],
|
||||||
|
"สถ.ด.": [{ORTH: "สถ.ด.", LEMMA: "สถาปัตยกรรมศาสตรดุษฎีบัณฑิต"}],
|
||||||
|
"สพ.บ.": [{ORTH: "สพ.บ.", LEMMA: "สัตวแพทยศาสตรบัณฑิต"}],
|
||||||
|
"อ.บ.": [{ORTH: "อ.บ.", LEMMA: "อักษรศาสตรบัณฑิต"}],
|
||||||
|
"อ.ม.": [{ORTH: "อ.ม.", LEMMA: "อักษรศาสตรมหาบัณฑิต"}],
|
||||||
|
"อ.ด.": [{ORTH: "อ.ด.", LEMMA: "อักษรศาสตรดุษฎีบัณฑิต"}],
|
||||||
|
#ปี / เวลา / year / time
|
||||||
|
"ชม.": [{ORTH: "ชม.", LEMMA: "ชั่วโมง"}],
|
||||||
|
"จ.ศ.": [{ORTH: "จ.ศ.", LEMMA: "จุลศักราช"}],
|
||||||
|
"ค.ศ.": [{ORTH: "ค.ศ.", LEMMA: "คริสต์ศักราช"}],
|
||||||
|
"ฮ.ศ.": [{ORTH: "ฮ.ศ.", LEMMA: "ฮิจเราะห์ศักราช"}],
|
||||||
|
"ว.ด.ป.": [{ORTH: "ว.ด.ป.", LEMMA: "วัน เดือน ปี"}],
|
||||||
|
#ระยะทาง / distance
|
||||||
|
"ฮม.": [{ORTH: "ฮม.", LEMMA: "เฮกโตเมตร"}],
|
||||||
|
"ดคม.": [{ORTH: "ดคม.", LEMMA: "เดคาเมตร"}],
|
||||||
|
"ดม.": [{ORTH: "ดม.", LEMMA: "เดซิเมตร"}],
|
||||||
|
"มม.": [{ORTH: "มม.", LEMMA: "มิลลิเมตร"}],
|
||||||
|
"ซม.": [{ORTH: "ซม.", LEMMA: "เซนติเมตร"}],
|
||||||
|
"กม.": [{ORTH: "กม.", LEMMA: "กิโลเมตร"}],
|
||||||
|
#น้ำหนัก / weight
|
||||||
|
"น.น.": [{ORTH: "น.น.", LEMMA: "น้ำหนัก"}],
|
||||||
|
"ฮก.": [{ORTH: "ฮก.", LEMMA: "เฮกโตกรัม"}],
|
||||||
|
"ดคก.": [{ORTH: "ดคก.", LEMMA: "เดคากรัม"}],
|
||||||
|
"ดก.": [{ORTH: "ดก.", LEMMA: "เดซิกรัม"}],
|
||||||
|
"ซก.": [{ORTH: "ซก.", LEMMA: "เซนติกรัม"}],
|
||||||
|
"มก.": [{ORTH: "มก.", LEMMA: "มิลลิกรัม"}],
|
||||||
|
"ก.": [{ORTH: "ก.", LEMMA: "กรัม"}],
|
||||||
|
"กก.": [{ORTH: "กก.", LEMMA: "กิโลกรัม"}],
|
||||||
|
#ปริมาตร / volume
|
||||||
|
"ฮล.": [{ORTH: "ฮล.", LEMMA: "เฮกโตลิตร"}],
|
||||||
|
"ดคล.": [{ORTH: "ดคล.", LEMMA: "เดคาลิตร"}],
|
||||||
|
"ดล.": [{ORTH: "ดล.", LEMMA: "เดซิลิตร"}],
|
||||||
|
"ซล.": [{ORTH: "ซล.", LEMMA: "เซนติลิตร"}],
|
||||||
|
"ล.": [{ORTH: "ล.", LEMMA: "ลิตร"}],
|
||||||
|
"กล.": [{ORTH: "กล.", LEMMA: "กิโลลิตร"}],
|
||||||
|
"ลบ.": [{ORTH: "ลบ.", LEMMA: "ลูกบาศก์"}],
|
||||||
|
#พื้นที่ / area
|
||||||
|
"ตร.ซม.": [{ORTH: "ตร.ซม.", LEMMA: "ตารางเซนติเมตร"}],
|
||||||
|
"ตร.ม.": [{ORTH: "ตร.ม.", LEMMA: "ตารางเมตร"}],
|
||||||
|
"ตร.ว.": [{ORTH: "ตร.ว.", LEMMA: "ตารางวา"}],
|
||||||
|
"ตร.กม.": [{ORTH: "ตร.กม.", LEMMA: "ตารางกิโลเมตร"}],
|
||||||
|
#เดือน / month
|
||||||
"ม.ค.": [{ORTH: "ม.ค.", LEMMA: "มกราคม"}],
|
"ม.ค.": [{ORTH: "ม.ค.", LEMMA: "มกราคม"}],
|
||||||
"ก.พ.": [{ORTH: "ก.พ.", LEMMA: "กุมภาพันธ์"}],
|
"ก.พ.": [{ORTH: "ก.พ.", LEMMA: "กุมภาพันธ์"}],
|
||||||
"มี.ค.": [{ORTH: "มี.ค.", LEMMA: "มีนาคม"}],
|
"มี.ค.": [{ORTH: "มี.ค.", LEMMA: "มีนาคม"}],
|
||||||
|
@ -17,6 +331,114 @@ _exc = {
|
||||||
"ต.ค.": [{ORTH: "ต.ค.", LEMMA: "ตุลาคม"}],
|
"ต.ค.": [{ORTH: "ต.ค.", LEMMA: "ตุลาคม"}],
|
||||||
"พ.ย.": [{ORTH: "พ.ย.", LEMMA: "พฤศจิกายน"}],
|
"พ.ย.": [{ORTH: "พ.ย.", LEMMA: "พฤศจิกายน"}],
|
||||||
"ธ.ค.": [{ORTH: "ธ.ค.", LEMMA: "ธันวาคม"}],
|
"ธ.ค.": [{ORTH: "ธ.ค.", LEMMA: "ธันวาคม"}],
|
||||||
|
#เพศ / gender
|
||||||
|
"ช.": [{ORTH: "ช.", LEMMA: "ชาย"}],
|
||||||
|
"ญ.": [{ORTH: "ญ.", LEMMA: "หญิง"}],
|
||||||
|
"ด.ช.": [{ORTH: "ด.ช.", LEMMA: "เด็กชาย"}],
|
||||||
|
"ด.ญ.": [{ORTH: "ด.ญ.", LEMMA: "เด็กหญิง"}],
|
||||||
|
#ที่อยู่ / address
|
||||||
|
"ถ.": [{ORTH: "ถ.", LEMMA: "ถนน"}],
|
||||||
|
"ต.": [{ORTH: "ต.", LEMMA: "ตำบล"}],
|
||||||
|
"อ.": [{ORTH: "อ.", LEMMA: "อำเภอ"}],
|
||||||
|
"จ.": [{ORTH: "จ.", LEMMA: "จังหวัด"}],
|
||||||
|
#สรรพนาม / pronoun
|
||||||
|
"ข้าฯ": [{ORTH: "ข้าฯ", LEMMA: "ข้าพระพุทธเจ้า"}],
|
||||||
|
"ทูลเกล้าฯ": [{ORTH: "ทูลเกล้าฯ", LEMMA: "ทูลเกล้าทูลกระหม่อม"}],
|
||||||
|
"น้อมเกล้าฯ": [{ORTH: "น้อมเกล้าฯ", LEMMA: "น้อมเกล้าน้อมกระหม่อม"}],
|
||||||
|
"โปรดเกล้าฯ": [{ORTH: "โปรดเกล้าฯ", LEMMA: "โปรดเกล้าโปรดกระหม่อม"}],
|
||||||
|
#การเมือง / politic
|
||||||
|
"ขจก.": [{ORTH: "ขจก.", LEMMA: "ขบวนการโจรก่อการร้าย"}],
|
||||||
|
"ขบด.": [{ORTH: "ขบด.", LEMMA: "ขบวนการแบ่งแยกดินแดน"}],
|
||||||
|
"นปช.": [{ORTH: "นปช.", LEMMA: "แนวร่วมประชาธิปไตยขับไล่เผด็จการ"}],
|
||||||
|
"ปชป.": [{ORTH: "ปชป.", LEMMA: "พรรคประชาธิปัตย์"}],
|
||||||
|
"ผกค.": [{ORTH: "ผกค.", LEMMA: "ผู้ก่อการร้ายคอมมิวนิสต์"}],
|
||||||
|
"พท.": [{ORTH: "พท.", LEMMA: "พรรคเพื่อไทย"}],
|
||||||
|
"พ.ร.ก.": [{ORTH: "พ.ร.ก.", LEMMA: "พระราชกำหนด"}],
|
||||||
|
"พ.ร.ฎ.": [{ORTH: "พ.ร.ฎ.", LEMMA: "พระราชกฤษฎีกา"}],
|
||||||
|
"พ.ร.บ.": [{ORTH: "พ.ร.บ.", LEMMA: "พระราชบัญญัติ"}],
|
||||||
|
"รธน.": [{ORTH: "รธน.", LEMMA: "รัฐธรรมนูญ"}],
|
||||||
|
"รบ.": [{ORTH: "รบ.", LEMMA: "รัฐบาล"}],
|
||||||
|
"รสช.": [{ORTH: "รสช.", LEMMA: "คณะรักษาความสงบเรียบร้อยแห่งชาติ"}],
|
||||||
|
"ส.ก.": [{ORTH: "ส.ก.", LEMMA: "สมาชิกสภากรุงเทพมหานคร"}],
|
||||||
|
"สจ.": [{ORTH: "สจ.", LEMMA: "สมาชิกสภาจังหวัด"}],
|
||||||
|
"สว.": [{ORTH: "สว.", LEMMA: "สมาชิกวุฒิสภา"}],
|
||||||
|
"ส.ส.": [{ORTH: "ส.ส.", LEMMA: "สมาชิกสภาผู้แทนราษฎร"}],
|
||||||
|
#ทั่วไป / general
|
||||||
|
"ก.ข.ค.": [{ORTH: "ก.ข.ค.", LEMMA: "ก้างขวางคอ"}],
|
||||||
|
"กทม.": [{ORTH: "กทม.", LEMMA: "กรุงเทพมหานคร"}],
|
||||||
|
"กรุงเทพฯ": [{ORTH: "กรุงเทพฯ", LEMMA: "กรุงเทพมหานคร"}],
|
||||||
|
"ขรก.": [{ORTH: "ขรก.", LEMMA: "ข้าราชการ"}],
|
||||||
|
"ขส": [{ORTH: "ขส.", LEMMA: "ขนส่ง"}],
|
||||||
|
"ค.ร.น.": [{ORTH: "ค.ร.น.", LEMMA: "คูณร่วมน้อย"}],
|
||||||
|
"ค.ร.ม.": [{ORTH: "ค.ร.ม.", LEMMA: "คูณร่วมมาก"}],
|
||||||
|
"ง.ด.": [{ORTH: "ง.ด.", LEMMA: "เงินเดือน"}],
|
||||||
|
"งป.": [{ORTH: "งป.", LEMMA: "งบประมาณ"}],
|
||||||
|
"จก.": [{ORTH: "จก.", LEMMA: "จำกัด"}],
|
||||||
|
"จขกท.": [{ORTH: "จขกท.", LEMMA: "เจ้าของกระทู้"}],
|
||||||
|
"จนท.": [{ORTH: "จนท.", LEMMA: "เจ้าหน้าที่"}],
|
||||||
|
"จ.ป.ร.": [{ORTH: "จ.ป.ร.", LEMMA: "มหาจุฬาลงกรณ ปรมราชาธิราช (พระปรมาภิไธยในพระบาทสมเด็จพระจุลจอมเกล้าเจ้าอยู่หัว)"}],
|
||||||
|
"จ.ม.": [{ORTH: "จ.ม.", LEMMA: "จดหมาย"}],
|
||||||
|
"จย.": [{ORTH: "จย.", LEMMA: "จักรยาน"}],
|
||||||
|
"จยย.": [{ORTH: "จยย.", LEMMA: "จักรยานยนต์"}],
|
||||||
|
"ตจว.": [{ORTH: "ตจว.", LEMMA: "ต่างจังหวัด"}],
|
||||||
|
"โทร.": [{ORTH: "โทร.", LEMMA: "โทรศัพท์"}],
|
||||||
|
"ธ.": [{ORTH: "ธ.", LEMMA: "ธนาคาร"}],
|
||||||
|
"น.ร.": [{ORTH: "น.ร.", LEMMA: "นักเรียน"}],
|
||||||
|
"น.ศ.": [{ORTH: "น.ศ.", LEMMA: "นักศึกษา"}],
|
||||||
|
"น.ส.": [{ORTH: "น.ส.", LEMMA: "นางสาว"}],
|
||||||
|
"น.ส.๓": [{ORTH: "น.ส.๓", LEMMA: "หนังสือรับรองการทำประโยชน์ในที่ดิน"}],
|
||||||
|
"น.ส.๓ ก.": [{ORTH: "น.ส.๓ ก", LEMMA: "หนังสือแสดงกรรมสิทธิ์ในที่ดิน (มีระวางกำหนด)"}],
|
||||||
|
"นสพ.": [{ORTH: "นสพ.", LEMMA: "หนังสือพิมพ์"}],
|
||||||
|
"บ.ก.": [{ORTH: "บ.ก.", LEMMA: "บรรณาธิการ"}],
|
||||||
|
"บจก.": [{ORTH: "บจก.", LEMMA: "บริษัทจำกัด"}],
|
||||||
|
"บงล.": [{ORTH: "บงล.", LEMMA: "บริษัทเงินทุนและหลักทรัพย์จำกัด"}],
|
||||||
|
"บบส.": [{ORTH: "บบส.", LEMMA: "บรรษัทบริหารสินทรัพย์สถาบันการเงิน"}],
|
||||||
|
"บมจ.": [{ORTH: "บมจ.", LEMMA: "บริษัทมหาชนจำกัด"}],
|
||||||
|
"บลจ.": [{ORTH: "บลจ.", LEMMA: "บริษัทหลักทรัพย์จัดการกองทุนรวมจำกัด"}],
|
||||||
|
"บ/ช": [{ORTH: "บ/ช", LEMMA: "บัญชี"}],
|
||||||
|
"บร.": [{ORTH: "บร.", LEMMA: "บรรณารักษ์"}],
|
||||||
|
"ปชช.": [{ORTH: "ปชช.", LEMMA: "ประชาชน"}],
|
||||||
|
"ปณ.": [{ORTH: "ปณ.", LEMMA: "ที่ทำการไปรษณีย์"}],
|
||||||
|
"ปณก.": [{ORTH: "ปณก.", LEMMA: "ที่ทำการไปรษณีย์กลาง"}],
|
||||||
|
"ปณส.": [{ORTH: "ปณส.", LEMMA: "ที่ทำการไปรษณีย์สาขา"}],
|
||||||
|
"ปธ.": [{ORTH: "ปธ.", LEMMA: "ประธาน"}],
|
||||||
|
"ปธน.": [{ORTH: "ปธน.", LEMMA: "ประธานาธิบดี"}],
|
||||||
|
"ปอ.": [{ORTH: "ปอ.", LEMMA: "รถยนต์โดยสารประจำทางปรับอากาศ"}],
|
||||||
|
"ปอ.พ.": [{ORTH: "ปอ.พ.", LEMMA: "รถยนต์โดยสารประจำทางปรับอากาศพิเศษ"}],
|
||||||
|
"พ.ก.ง.": [{ORTH: "พ.ก.ง.", LEMMA: "พัสดุเก็บเงินปลายทาง"}],
|
||||||
|
"พ.ก.ส.": [{ORTH: "พ.ก.ส.", LEMMA: "พนักงานเก็บค่าโดยสาร"}],
|
||||||
|
"พขร.": [{ORTH: "พขร.", LEMMA: "พนักงานขับรถ"}],
|
||||||
|
"ภ.ง.ด.": [{ORTH: "ภ.ง.ด.", LEMMA: "ภาษีเงินได้"}],
|
||||||
|
"ภ.ง.ด.๙": [{ORTH: "ภ.ง.ด.๙", LEMMA: "แบบแสดงรายการเสียภาษีเงินได้ของกรมสรรพากร"}],
|
||||||
|
"ภ.ป.ร.": [{ORTH: "ภ.ป.ร.", LEMMA: "ภูมิพลอดุยเดช ปรมราชาธิราช (พระปรมาภิไธยในพระบาทสมเด็จพระปรมินทรมหาภูมิพลอดุลยเดช)"}],
|
||||||
|
"ภ.พ.": [{ORTH: "ภ.พ.", LEMMA: "ภาษีมูลค่าเพิ่ม"}],
|
||||||
|
"ร.": [{ORTH: "ร.", LEMMA: "รัชกาล"}],
|
||||||
|
"ร.ง.": [{ORTH: "ร.ง.", LEMMA: "โรงงาน"}],
|
||||||
|
"ร.ด.": [{ORTH: "ร.ด.", LEMMA: "รักษาดินแดน"}],
|
||||||
|
"รปภ.": [{ORTH: "รปภ.", LEMMA: "รักษาความปลอดภัย"}],
|
||||||
|
"รพ.": [{ORTH: "รพ.", LEMMA: "โรงพยาบาล"}],
|
||||||
|
"ร.พ.": [{ORTH: "ร.พ.", LEMMA: "โรงพิมพ์"}],
|
||||||
|
"รร.": [{ORTH: "รร.", LEMMA: "โรงเรียน,โรงแรม"}],
|
||||||
|
"รสก.": [{ORTH: "รสก.", LEMMA: "รัฐวิสาหกิจ"}],
|
||||||
|
"ส.ค.ส.": [{ORTH: "ส.ค.ส.", LEMMA: "ส่งความสุขปีใหม่"}],
|
||||||
|
"สต.": [{ORTH: "สต.", LEMMA: "สตางค์"}],
|
||||||
|
"สน.": [{ORTH: "สน.", LEMMA: "สถานีตำรวจ"}],
|
||||||
|
"สนข.": [{ORTH: "สนข.", LEMMA: "สำนักงานเขต"}],
|
||||||
|
"สนง.": [{ORTH: "สนง.", LEMMA: "สำนักงาน"}],
|
||||||
|
"สนญ.": [{ORTH: "สนญ.", LEMMA: "สำนักงานใหญ่"}],
|
||||||
|
"ส.ป.ช.": [{ORTH: "ส.ป.ช.", LEMMA: "สร้างเสริมประสบการณ์ชีวิต"}],
|
||||||
|
"สภ.": [{ORTH: "สภ.", LEMMA: "สถานีตำรวจภูธร"}],
|
||||||
|
"ส.ล.น.": [{ORTH: "ส.ล.น.", LEMMA: "สร้างเสริมลักษณะนิสัย"}],
|
||||||
|
"สวญ.": [{ORTH: "สวญ.", LEMMA: "สารวัตรใหญ่"}],
|
||||||
|
"สวป.": [{ORTH: "สวป.", LEMMA: "สารวัตรป้องกันปราบปราม"}],
|
||||||
|
"สว.สส.": [{ORTH: "สว.สส.", LEMMA: "สารวัตรสืบสวน"}],
|
||||||
|
"ส.ห.": [{ORTH: "ส.ห.", LEMMA: "สารวัตรทหาร"}],
|
||||||
|
"สอ.": [{ORTH: "สอ.", LEMMA: "สถานีอนามัย"}],
|
||||||
|
"สอท.": [{ORTH: "สอท.", LEMMA: "สถานเอกอัครราชทูต"}],
|
||||||
|
"เสธ.": [{ORTH: "เสธ.", LEMMA: "เสนาธิการ"}],
|
||||||
|
"หจก.": [{ORTH: "หจก.", LEMMA: "ห้างหุ้นส่วนจำกัด"}],
|
||||||
|
"ห.ร.ม.": [{ORTH: "ห.ร.ม.", LEMMA: "ตัวหารร่วมมาก"}],
|
||||||
|
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -134,6 +134,11 @@ def nl_tokenizer():
|
||||||
return get_lang_class("nl").Defaults.create_tokenizer()
|
return get_lang_class("nl").Defaults.create_tokenizer()
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def nl_lemmatizer(scope="session"):
|
||||||
|
return get_lang_class("nl").Defaults.create_lemmatizer()
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture(scope="session")
|
@pytest.fixture(scope="session")
|
||||||
def pl_tokenizer():
|
def pl_tokenizer():
|
||||||
return get_lang_class("pl").Defaults.create_tokenizer()
|
return get_lang_class("pl").Defaults.create_tokenizer()
|
||||||
|
|
143
spacy/tests/lang/nl/test_lemmatizer.py
Normal file
143
spacy/tests/lang/nl/test_lemmatizer.py
Normal file
|
@ -0,0 +1,143 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
|
# Calling the Lemmatizer directly
|
||||||
|
# Imitates behavior of:
|
||||||
|
# Tagger.set_annotations()
|
||||||
|
# -> vocab.morphology.assign_tag_id()
|
||||||
|
# -> vocab.morphology.assign_tag_id()
|
||||||
|
# -> Token.tag.__set__
|
||||||
|
# -> vocab.morphology.assign_tag(...)
|
||||||
|
# -> ... -> Morphology.assign_tag(...)
|
||||||
|
# -> self.lemmatize(analysis.tag.pos, token.lex.orth,
|
||||||
|
|
||||||
|
|
||||||
|
noun_irreg_lemmatization_cases = [
|
||||||
|
("volkeren", "volk"),
|
||||||
|
("vaatje", "vat"),
|
||||||
|
("verboden", "verbod"),
|
||||||
|
("ijsje", "ijsje"),
|
||||||
|
("slagen", "slag"),
|
||||||
|
("verdragen", "verdrag"),
|
||||||
|
("verloven", "verlof"),
|
||||||
|
("gebeden", "gebed"),
|
||||||
|
("gaten", "gat"),
|
||||||
|
("staven", "staf"),
|
||||||
|
("aquariums", "aquarium"),
|
||||||
|
("podia", "podium"),
|
||||||
|
("holen", "hol"),
|
||||||
|
("lammeren", "lam"),
|
||||||
|
("bevelen", "bevel"),
|
||||||
|
("wegen", "weg"),
|
||||||
|
("moeilijkheden", "moeilijkheid"),
|
||||||
|
("aanwezigheden", "aanwezigheid"),
|
||||||
|
("goden", "god"),
|
||||||
|
("loten", "lot"),
|
||||||
|
("kaarsen", "kaars"),
|
||||||
|
("leden", "lid"),
|
||||||
|
("glaasje", "glas"),
|
||||||
|
("eieren", "ei"),
|
||||||
|
("vatten", "vat"),
|
||||||
|
("kalveren", "kalf"),
|
||||||
|
("padden", "pad"),
|
||||||
|
("smeden", "smid"),
|
||||||
|
("genen", "gen"),
|
||||||
|
("beenderen", "been"),
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
verb_irreg_lemmatization_cases = [
|
||||||
|
("liep", "lopen"),
|
||||||
|
("hief", "heffen"),
|
||||||
|
("begon", "beginnen"),
|
||||||
|
("sla", "slaan"),
|
||||||
|
("aangekomen", "aankomen"),
|
||||||
|
("sproot", "spruiten"),
|
||||||
|
("waart", "zijn"),
|
||||||
|
("snoof", "snuiven"),
|
||||||
|
("spoot", "spuiten"),
|
||||||
|
("ontbeet", "ontbijten"),
|
||||||
|
("gehouwen", "houwen"),
|
||||||
|
("afgewassen", "afwassen"),
|
||||||
|
("deed", "doen"),
|
||||||
|
("schoven", "schuiven"),
|
||||||
|
("gelogen", "liegen"),
|
||||||
|
("woog", "wegen"),
|
||||||
|
("gebraden", "braden"),
|
||||||
|
("smolten", "smelten"),
|
||||||
|
("riep", "roepen"),
|
||||||
|
("aangedaan", "aandoen"),
|
||||||
|
("vermeden", "vermijden"),
|
||||||
|
("stootten", "stoten"),
|
||||||
|
("ging", "gaan"),
|
||||||
|
("geschoren", "scheren"),
|
||||||
|
("gesponnen", "spinnen"),
|
||||||
|
("reden", "rijden"),
|
||||||
|
("zochten", "zoeken"),
|
||||||
|
("leed", "lijden"),
|
||||||
|
("verzonnen", "verzinnen"),
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("text,lemma", noun_irreg_lemmatization_cases)
|
||||||
|
def test_nl_lemmatizer_noun_lemmas_irreg(nl_lemmatizer, text, lemma):
|
||||||
|
pos = "noun"
|
||||||
|
lemmas_pred = nl_lemmatizer(text, pos)
|
||||||
|
assert lemma == sorted(lemmas_pred)[0]
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("text,lemma", verb_irreg_lemmatization_cases)
|
||||||
|
def test_nl_lemmatizer_verb_lemmas_irreg(nl_lemmatizer, text, lemma):
|
||||||
|
pos = "verb"
|
||||||
|
lemmas_pred = nl_lemmatizer(text, pos)
|
||||||
|
assert lemma == sorted(lemmas_pred)[0]
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.skip
|
||||||
|
@pytest.mark.parametrize("text,lemma", [])
|
||||||
|
def test_nl_lemmatizer_verb_lemmas_reg(nl_lemmatizer, text, lemma):
|
||||||
|
# TODO: add test
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.skip
|
||||||
|
@pytest.mark.parametrize("text,lemma", [])
|
||||||
|
def test_nl_lemmatizer_adjective_lemmas(nl_lemmatizer, text, lemma):
|
||||||
|
# TODO: add test
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.skip
|
||||||
|
@pytest.mark.parametrize("text,lemma", [])
|
||||||
|
def test_nl_lemmatizer_determiner_lemmas(nl_lemmatizer, text, lemma):
|
||||||
|
# TODO: add test
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.skip
|
||||||
|
@pytest.mark.parametrize("text,lemma", [])
|
||||||
|
def test_nl_lemmatizer_adverb_lemmas(nl_lemmatizer, text, lemma):
|
||||||
|
# TODO: add test
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("text,lemma", [])
|
||||||
|
def test_nl_lemmatizer_pronoun_lemmas(nl_lemmatizer, text, lemma):
|
||||||
|
# TODO: add test
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
# Using the lemma lookup table only
|
||||||
|
@pytest.mark.parametrize("text,lemma", noun_irreg_lemmatization_cases)
|
||||||
|
def test_nl_lemmatizer_lookup_noun(nl_lemmatizer, text, lemma):
|
||||||
|
lemma_pred = nl_lemmatizer.lookup(text)
|
||||||
|
assert lemma_pred in (lemma, text)
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("text,lemma", verb_irreg_lemmatization_cases)
|
||||||
|
def test_nl_lemmatizer_lookup_verb(nl_lemmatizer, text, lemma):
|
||||||
|
lemma_pred = nl_lemmatizer.lookup(text)
|
||||||
|
assert lemma_pred in (lemma, text)
|
|
@ -9,3 +9,19 @@ from spacy.lang.nl.lex_attrs import like_num
|
||||||
def test_nl_lex_attrs_capitals(word):
|
def test_nl_lex_attrs_capitals(word):
|
||||||
assert like_num(word)
|
assert like_num(word)
|
||||||
assert like_num(word.upper())
|
assert like_num(word.upper())
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize(
|
||||||
|
"text,num_tokens",
|
||||||
|
[
|
||||||
|
(
|
||||||
|
"De aftredende minister-president benadrukte al dat zijn partij inhoudelijk weinig gemeen heeft met de groenen.",
|
||||||
|
16,
|
||||||
|
),
|
||||||
|
("Hij is sociaal-cultureel werker.", 5),
|
||||||
|
("Er staan een aantal dure auto's in de garage.", 10),
|
||||||
|
],
|
||||||
|
)
|
||||||
|
def test_tokenizer_doesnt_split_hyphens(nl_tokenizer, text, num_tokens):
|
||||||
|
tokens = nl_tokenizer(text)
|
||||||
|
assert len(tokens) == num_tokens
|
||||||
|
|
|
@ -1,6 +1,8 @@
|
||||||
import pytest
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import re
|
import re
|
||||||
from ... import compat
|
from spacy import compat
|
||||||
|
|
||||||
prefix_search = (
|
prefix_search = (
|
||||||
b"^\xc2\xa7|^%|^=|^\xe2\x80\x94|^\xe2\x80\x93|^\\+(?![0-9])"
|
b"^\xc2\xa7|^%|^=|^\xe2\x80\x94|^\xe2\x80\x93|^\\+(?![0-9])"
|
||||||
|
@ -67,4 +69,4 @@ if compat.is_python2:
|
||||||
# string above in the xpass message.
|
# string above in the xpass message.
|
||||||
def test_issue3356():
|
def test_issue3356():
|
||||||
pattern = re.compile(compat.unescape_unicode(prefix_search.decode("utf8")))
|
pattern = re.compile(compat.unescape_unicode(prefix_search.decode("utf8")))
|
||||||
assert not pattern.search(u"hello")
|
assert not pattern.search("hello")
|
||||||
|
|
|
@ -1,10 +1,14 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from spacy.util import decaying
|
from spacy.util import decaying
|
||||||
|
|
||||||
def test_decaying():
|
|
||||||
sizes = decaying(10., 1., .5)
|
def test_issue3447():
|
||||||
|
sizes = decaying(10.0, 1.0, 0.5)
|
||||||
size = next(sizes)
|
size = next(sizes)
|
||||||
assert size == 10.
|
assert size == 10.0
|
||||||
size = next(sizes)
|
size = next(sizes)
|
||||||
assert size == 10. - 0.5
|
assert size == 10.0 - 0.5
|
||||||
size = next(sizes)
|
size = next(sizes)
|
||||||
assert size == 10. - 0.5 - 0.5
|
assert size == 10.0 - 0.5 - 0.5
|
||||||
|
|
25
spacy/tests/regression/test_issue3449.py
Normal file
25
spacy/tests/regression/test_issue3449.py
Normal file
|
@ -0,0 +1,25 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from spacy.lang.en import English
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.xfail(reason="Current default suffix rules avoid one upper-case letter before a dot.")
|
||||||
|
def test_issue3449():
|
||||||
|
nlp = English()
|
||||||
|
nlp.add_pipe(nlp.create_pipe('sentencizer'))
|
||||||
|
|
||||||
|
text1 = "He gave the ball to I. Do you want to go to the movies with I?"
|
||||||
|
text2 = "He gave the ball to I. Do you want to go to the movies with I?"
|
||||||
|
text3 = "He gave the ball to I.\nDo you want to go to the movies with I?"
|
||||||
|
|
||||||
|
t1 = nlp(text1)
|
||||||
|
t2 = nlp(text2)
|
||||||
|
t3 = nlp(text3)
|
||||||
|
|
||||||
|
assert t1[5].text == 'I'
|
||||||
|
assert t2[5].text == 'I'
|
||||||
|
assert t3[5].text == 'I'
|
||||||
|
|
|
@ -1,7 +1,6 @@
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
|
||||||
from spacy.lang.en import English
|
from spacy.lang.en import English
|
||||||
from spacy.tokens import Doc
|
from spacy.tokens import Doc
|
||||||
|
|
||||||
|
|
19
spacy/tests/regression/test_issue3521.py
Normal file
19
spacy/tests/regression/test_issue3521.py
Normal file
|
@ -0,0 +1,19 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize(
|
||||||
|
"word",
|
||||||
|
[
|
||||||
|
"don't",
|
||||||
|
"don’t",
|
||||||
|
"I'd",
|
||||||
|
"I’d",
|
||||||
|
],
|
||||||
|
)
|
||||||
|
def test_issue3521(en_tokenizer, word):
|
||||||
|
tok = en_tokenizer(word)[1]
|
||||||
|
# 'not' and 'would' should be stopwords, also in their abbreviated forms
|
||||||
|
assert tok.is_stop
|
33
spacy/tests/regression/test_issue3531.py
Normal file
33
spacy/tests/regression/test_issue3531.py
Normal file
|
@ -0,0 +1,33 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from spacy import displacy
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue3531():
|
||||||
|
"""Test that displaCy renderer doesn't require "settings" key."""
|
||||||
|
example_dep = {
|
||||||
|
"words": [
|
||||||
|
{"text": "But", "tag": "CCONJ"},
|
||||||
|
{"text": "Google", "tag": "PROPN"},
|
||||||
|
{"text": "is", "tag": "VERB"},
|
||||||
|
{"text": "starting", "tag": "VERB"},
|
||||||
|
{"text": "from", "tag": "ADP"},
|
||||||
|
{"text": "behind.", "tag": "ADV"},
|
||||||
|
],
|
||||||
|
"arcs": [
|
||||||
|
{"start": 0, "end": 3, "label": "cc", "dir": "left"},
|
||||||
|
{"start": 1, "end": 3, "label": "nsubj", "dir": "left"},
|
||||||
|
{"start": 2, "end": 3, "label": "aux", "dir": "left"},
|
||||||
|
{"start": 3, "end": 4, "label": "prep", "dir": "right"},
|
||||||
|
{"start": 4, "end": 5, "label": "pcomp", "dir": "right"},
|
||||||
|
],
|
||||||
|
}
|
||||||
|
example_ent = {
|
||||||
|
"text": "But Google is starting from behind.",
|
||||||
|
"ents": [{"start": 4, "end": 10, "label": "ORG"}],
|
||||||
|
}
|
||||||
|
dep_html = displacy.render(example_dep, style="dep", manual=True)
|
||||||
|
assert dep_html
|
||||||
|
ent_html = displacy.render(example_ent, style="ent", manual=True)
|
||||||
|
assert ent_html
|
|
@ -26,6 +26,7 @@ def symlink_setup_target(request, symlink_target, symlink):
|
||||||
os.mkdir(path2str(symlink_target))
|
os.mkdir(path2str(symlink_target))
|
||||||
# yield -- need to cleanup even if assertion fails
|
# yield -- need to cleanup even if assertion fails
|
||||||
# https://github.com/pytest-dev/pytest/issues/2508#issuecomment-309934240
|
# https://github.com/pytest-dev/pytest/issues/2508#issuecomment-309934240
|
||||||
|
|
||||||
def cleanup():
|
def cleanup():
|
||||||
symlink_remove(symlink)
|
symlink_remove(symlink)
|
||||||
os.rmdir(path2str(symlink_target))
|
os.rmdir(path2str(symlink_target))
|
||||||
|
|
|
@ -160,20 +160,14 @@ https://github.com/explosion/spaCy/tree/master/examples/training/train_textcat.p
|
||||||
|
|
||||||
### Visualizing spaCy vectors in TensorBoard {#tensorboard}
|
### Visualizing spaCy vectors in TensorBoard {#tensorboard}
|
||||||
|
|
||||||
These two scripts let you load any spaCy model containing word vectors into
|
This script lets you load any spaCy model containing word vectors into
|
||||||
[TensorBoard](https://projector.tensorflow.org/) to create an
|
[TensorBoard](https://projector.tensorflow.org/) to create an
|
||||||
[embedding visualization](https://www.tensorflow.org/versions/r1.1/get_started/embedding_viz).
|
[embedding visualization](https://www.tensorflow.org/versions/r1.1/get_started/embedding_viz).
|
||||||
The first example uses TensorBoard, the second example TensorBoard's standalone
|
|
||||||
embedding projector.
|
|
||||||
|
|
||||||
```python
|
```python
|
||||||
https://github.com/explosion/spaCy/tree/master/examples/vectors_tensorboard.py
|
https://github.com/explosion/spaCy/tree/master/examples/vectors_tensorboard.py
|
||||||
```
|
```
|
||||||
|
|
||||||
```python
|
|
||||||
https://github.com/explosion/spaCy/tree/master/examples/vectors_tensorboard_standalone.py
|
|
||||||
```
|
|
||||||
|
|
||||||
## Deep Learning {#deep-learning hidden="true"}
|
## Deep Learning {#deep-learning hidden="true"}
|
||||||
|
|
||||||
### Text classification with Keras {#keras}
|
### Text classification with Keras {#keras}
|
||||||
|
|
|
@ -35,7 +35,7 @@ const SEO = ({ description, lang, title, section, sectionTitle, bodyClass }) =>
|
||||||
siteMetadata.slogan,
|
siteMetadata.slogan,
|
||||||
sectionTitle
|
sectionTitle
|
||||||
)
|
)
|
||||||
const socialImage = getImage(section)
|
const socialImage = siteMetadata.siteUrl + getImage(section)
|
||||||
const meta = [
|
const meta = [
|
||||||
{
|
{
|
||||||
name: 'description',
|
name: 'description',
|
||||||
|
@ -126,6 +126,7 @@ const query = graphql`
|
||||||
title
|
title
|
||||||
description
|
description
|
||||||
slogan
|
slogan
|
||||||
|
siteUrl
|
||||||
social {
|
social {
|
||||||
twitter
|
twitter
|
||||||
}
|
}
|
||||||
|
|
|
@ -164,9 +164,9 @@ const Landing = ({ data }) => {
|
||||||
We're pleased to invite the spaCy community and other folks working on Natural
|
We're pleased to invite the spaCy community and other folks working on Natural
|
||||||
Language Processing to Berlin this summer for a small and intimate event{' '}
|
Language Processing to Berlin this summer for a small and intimate event{' '}
|
||||||
<strong>July 5-6, 2019</strong>. The event includes a hands-on training day for
|
<strong>July 5-6, 2019</strong>. The event includes a hands-on training day for
|
||||||
teams using spaCy in production, followed by a one-track conference. We booked a
|
teams using spaCy in production, followed by a one-track conference. We've
|
||||||
beautiful venue, hand-picked an awesome lineup of speakers and scheduled plenty
|
booked a beautiful venue, hand-picked an awesome lineup of speakers and
|
||||||
of social time to get to know each other and exchange ideas.
|
scheduled plenty of social time to get to know each other and exchange ideas.
|
||||||
</LandingBanner>
|
</LandingBanner>
|
||||||
|
|
||||||
<LandingBanner
|
<LandingBanner
|
||||||
|
|
Loading…
Reference in New Issue
Block a user