mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-12 18:26:30 +03:00
Merge branch 'master' of https://github.com/explosion/spaCy
This commit is contained in:
commit
5a319060b9
107
.github/contributors/RvanNieuwpoort.md
vendored
Executable file
107
.github/contributors/RvanNieuwpoort.md
vendored
Executable file
|
@ -0,0 +1,107 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------------------- |
|
||||||
|
| Name | Rob van Nieuwpoort |
|
||||||
|
| Signing on behalf of | Dafne van Kuppevelt, Janneke van der Zwaan, Willem van Hage |
|
||||||
|
| Company name (if applicable) | Netherlands eScience center |
|
||||||
|
| Title or role (if applicable) | Director of technology |
|
||||||
|
| Date | 14-12-2016 |
|
||||||
|
| GitHub username | RvanNieuwpoort |
|
||||||
|
| Website (optional) | https://www.esciencecenter.nl/ |
|
106
.github/contributors/magnusburton.md
vendored
Normal file
106
.github/contributors/magnusburton.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------------------- |
|
||||||
|
| Name | Magnus Burton |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 17-12-2016 |
|
||||||
|
| GitHub username | magnusburton |
|
||||||
|
| Website (optional) | |
|
9
.gitignore
vendored
9
.gitignore
vendored
|
@ -29,14 +29,12 @@ spacy/orthography/*.cpp
|
||||||
ext/murmurhash.cpp
|
ext/murmurhash.cpp
|
||||||
ext/sparsehash.cpp
|
ext/sparsehash.cpp
|
||||||
|
|
||||||
data/en/pos
|
/spacy/data/
|
||||||
data/en/ner
|
|
||||||
data/en/lexemes
|
|
||||||
data/en/strings
|
|
||||||
|
|
||||||
_build/
|
_build/
|
||||||
.env/
|
.env/
|
||||||
tmp/
|
tmp/
|
||||||
|
cythonize.json
|
||||||
|
|
||||||
# Byte-compiled / optimized / DLL files
|
# Byte-compiled / optimized / DLL files
|
||||||
__pycache__/
|
__pycache__/
|
||||||
|
@ -95,6 +93,9 @@ coverage.xml
|
||||||
# Mac OS X
|
# Mac OS X
|
||||||
*.DS_Store
|
*.DS_Store
|
||||||
|
|
||||||
|
# Temporary files / Dropbox hack
|
||||||
|
*.~*
|
||||||
|
|
||||||
# Komodo project files
|
# Komodo project files
|
||||||
*.komodoproject
|
*.komodoproject
|
||||||
|
|
||||||
|
|
|
@ -21,7 +21,7 @@ install:
|
||||||
|
|
||||||
script:
|
script:
|
||||||
- "pip install pytest"
|
- "pip install pytest"
|
||||||
- if [[ "${VIA}" == "compile" ]]; then SPACY_DATA=models/en python -m pytest spacy; fi
|
- if [[ "${VIA}" == "compile" ]]; then python -m pytest spacy; fi
|
||||||
- if [[ "${VIA}" == "pypi" ]]; then python -m pytest `python -c "import pathlib; import spacy; print(pathlib.Path(spacy.__file__).parent.resolve())"`; fi
|
- if [[ "${VIA}" == "pypi" ]]; then python -m pytest `python -c "import pathlib; import spacy; print(pathlib.Path(spacy.__file__).parent.resolve())"`; fi
|
||||||
- if [[ "${VIA}" == "sdist" ]]; then python -m pytest `python -c "import pathlib; import spacy; print(pathlib.Path(spacy.__file__).parent.resolve())"`; fi
|
- if [[ "${VIA}" == "sdist" ]]; then python -m pytest `python -c "import pathlib; import spacy; print(pathlib.Path(spacy.__file__).parent.resolve())"`; fi
|
||||||
|
|
||||||
|
|
|
@ -53,6 +53,10 @@ Coming soon.
|
||||||
|
|
||||||
Coming soon.
|
Coming soon.
|
||||||
|
|
||||||
|
### Developer resources
|
||||||
|
|
||||||
|
The [spaCy developer resources](https://github.com/explosion/spacy-dev-resources) repo contains useful scripts, tools and templates for developing spaCy, adding new languages and training new models. If you've written a script that might help others, feel free to contribute it to that repository.
|
||||||
|
|
||||||
### Contributor agreement
|
### Contributor agreement
|
||||||
|
|
||||||
If you've made a substantial contribution to spaCy, you should fill in the [spaCy contributor agreement](.github/CONTRIBUTOR_AGREEMENT.md) to ensure that your contribution can be used across the project. If you agree to be bound by the terms of the agreement, fill in the [template]((.github/CONTRIBUTOR_AGREEMENT.md)) and include it with your pull request, or sumit it separately to [`.github/contributors/`](/.github/contributors). The name of the file should be your GitHub username, with the extension `.md`. For example, the user
|
If you've made a substantial contribution to spaCy, you should fill in the [spaCy contributor agreement](.github/CONTRIBUTOR_AGREEMENT.md) to ensure that your contribution can be used across the project. If you agree to be bound by the terms of the agreement, fill in the [template]((.github/CONTRIBUTOR_AGREEMENT.md)) and include it with your pull request, or sumit it separately to [`.github/contributors/`](/.github/contributors). The name of the file should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
|
|
@ -6,22 +6,29 @@ This is a list of everyone who has made significant contributions to spaCy, in a
|
||||||
* Andreas Grivas, [@andreasgrv](https://github.com/andreasgrv)
|
* Andreas Grivas, [@andreasgrv](https://github.com/andreasgrv)
|
||||||
* Chris DuBois, [@chrisdubois](https://github.com/chrisdubois)
|
* Chris DuBois, [@chrisdubois](https://github.com/chrisdubois)
|
||||||
* Christoph Schwienheer, [@chssch](https://github.com/chssch)
|
* Christoph Schwienheer, [@chssch](https://github.com/chssch)
|
||||||
|
* Dafne van Kuppevelt, [@dafnevk](https://github.com/dafnevk)
|
||||||
* Dmytro Sadovnychyi, [@sadovnychyi](https://github.com/sadovnychyi)
|
* Dmytro Sadovnychyi, [@sadovnychyi](https://github.com/sadovnychyi)
|
||||||
* Henning Peters, [@henningpeters](https://github.com/henningpeters)
|
* Henning Peters, [@henningpeters](https://github.com/henningpeters)
|
||||||
* Ines Montani, [@ines](https://github.com/ines)
|
* Ines Montani, [@ines](https://github.com/ines)
|
||||||
* J Nicolas Schrading, [@NSchrading](https://github.com/NSchrading)
|
* J Nicolas Schrading, [@NSchrading](https://github.com/NSchrading)
|
||||||
|
* Janneke van der Zwaan, [@jvdzwaan](https://github.com/jvdzwaan)
|
||||||
* Jordan Suchow, [@suchow](https://github.com/suchow)
|
* Jordan Suchow, [@suchow](https://github.com/suchow)
|
||||||
* Kendrick Tan, [@kendricktan](https://github.com/kendricktan)
|
* Kendrick Tan, [@kendricktan](https://github.com/kendricktan)
|
||||||
* Kyle P. Johnson, [@kylepjohnson](https://github.com/kylepjohnson)
|
* Kyle P. Johnson, [@kylepjohnson](https://github.com/kylepjohnson)
|
||||||
* Liling Tan, [@alvations](https://github.com/alvations)
|
* Liling Tan, [@alvations](https://github.com/alvations)
|
||||||
|
* Magnus Burton, [@magnusburton](https://github.com/magnusburton)
|
||||||
|
* Mark Amery, [@ExplodingCabbage](https://github.com/ExplodingCabbage)
|
||||||
* Matthew Honnibal, [@honnibal](https://github.com/honnibal)
|
* Matthew Honnibal, [@honnibal](https://github.com/honnibal)
|
||||||
* Maxim Samsonov, [@maxirmx](https://github.com/maxirmx)
|
* Maxim Samsonov, [@maxirmx](https://github.com/maxirmx)
|
||||||
* Oleg Zd, [@olegzd](https://github.com/olegzd)
|
* Oleg Zd, [@olegzd](https://github.com/olegzd)
|
||||||
|
* Pokey Rule, [@pokey](https://github.com/pokey)
|
||||||
|
* Rob van Nieuwpoort, [@RvanNieuwpoort](https://github.com/RvanNieuwpoort)
|
||||||
* Sam Bozek, [@sambozek](https://github.com/sambozek)
|
* Sam Bozek, [@sambozek](https://github.com/sambozek)
|
||||||
* Sasho Savkov [@savkov](https://github.com/savkov)
|
* Sasho Savkov [@savkov](https://github.com/savkov)
|
||||||
* Tiago Rodrigues, [@TiagoMRodrigues](https://github.com/TiagoMRodrigues)
|
* Tiago Rodrigues, [@TiagoMRodrigues](https://github.com/TiagoMRodrigues)
|
||||||
* Vsevolod Solovyov, [@vsolovyov](https://github.com/vsolovyov)
|
* Vsevolod Solovyov, [@vsolovyov](https://github.com/vsolovyov)
|
||||||
* Wah Loon Keng, [@kengz](https://github.com/kengz)
|
* Wah Loon Keng, [@kengz](https://github.com/kengz)
|
||||||
|
* Willem van Hage, [@wrvhage](https://github.com/wrvhage)
|
||||||
* Wolfgang Seeker, [@wbwseeker](https://github.com/wbwseeker)
|
* Wolfgang Seeker, [@wbwseeker](https://github.com/wbwseeker)
|
||||||
* Yanhao Yang, [@YanhaoYang](https://github.com/YanhaoYang)
|
* Yanhao Yang, [@YanhaoYang](https://github.com/YanhaoYang)
|
||||||
* Yubing Dong, [@tomtung](https://github.com/tomtung)
|
* Yubing Dong, [@tomtung](https://github.com/tomtung)
|
||||||
|
|
89
README.rst
89
README.rst
|
@ -6,7 +6,7 @@ Cython. spaCy is built on the very latest research, but it isn't researchware.
|
||||||
It was designed from day 1 to be used in real products. It's commercial
|
It was designed from day 1 to be used in real products. It's commercial
|
||||||
open-source software, released under the MIT license.
|
open-source software, released under the MIT license.
|
||||||
|
|
||||||
💫 **Version 1.2 out now!** `Read the release notes here. <https://github.com/explosion/spaCy/releases/>`_
|
💫 **Version 1.4 out now!** `Read the release notes here. <https://github.com/explosion/spaCy/releases/>`_
|
||||||
|
|
||||||
.. image:: http://i.imgur.com/wFvLZyJ.png
|
.. image:: http://i.imgur.com/wFvLZyJ.png
|
||||||
:target: https://travis-ci.org/explosion/spaCy
|
:target: https://travis-ci.org/explosion/spaCy
|
||||||
|
@ -222,11 +222,94 @@ and ``--model`` are optional and enable additional tests:
|
||||||
|
|
||||||
python -m pytest <spacy-directory> --vectors --model --slow
|
python -m pytest <spacy-directory> --vectors --model --slow
|
||||||
|
|
||||||
|
Download model to custom location
|
||||||
|
=================================
|
||||||
|
|
||||||
|
You can specify where ``spacy.en.download`` and ``spacy.de.download`` download the language model
|
||||||
|
to using the ``--data-path`` or ``-d`` argument:
|
||||||
|
|
||||||
|
.. code:: bash
|
||||||
|
|
||||||
|
python -m spacy.en.download all --data-path /some/dir
|
||||||
|
|
||||||
|
|
||||||
|
If you choose to download to a custom location, you will need to tell spaCy where to load the model
|
||||||
|
from in order to use it. You can do this either by calling ``spacy.util.set_data_path()`` before
|
||||||
|
calling ``spacy.load()``, or by passing a ``path`` argument to the ``spacy.en.English`` or
|
||||||
|
``spacy.de.German`` constructors.
|
||||||
|
|
||||||
Changelog
|
Changelog
|
||||||
=========
|
=========
|
||||||
|
|
||||||
2016-11-04 `v1.2.0 <https://github.com/explosion/spaCy/releases>`_: *Alpha tokenizers for Chinese, French, Spanish, Italian and Portuguese*
|
2016-12-18 `v1.4.0 <https://github.com/explosion/spaCy/releases>`_: *Improved language data and alpha Dutch support*
|
||||||
-------------------------------------------------------------------------------------------------------------------------------------------
|
--------------------------------------------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
**✨ Major features and improvements**
|
||||||
|
|
||||||
|
* **NEW:** Alpha support for Dutch tokenization.
|
||||||
|
* Reorganise and improve format for language data.
|
||||||
|
* Add shared tag map, entity rules, emoticons and punctuation to language data.
|
||||||
|
* Convert entity rules, morphological rules and lemmatization rules from JSON to Python.
|
||||||
|
* Update language data for English, German, Spanish, French, Italian and Portuguese.
|
||||||
|
|
||||||
|
**🔴 Bug fixes**
|
||||||
|
|
||||||
|
* Fix issue `#649 <https://github.com/explosion/spaCy/issues/649>`_: Update and reorganise stop lists.
|
||||||
|
* Fix issue `#672 <https://github.com/explosion/spaCy/issues/672>`_: Make ``token.ent_iob_`` return unicode.
|
||||||
|
* Fix issue `#674 <https://github.com/explosion/spaCy/issues/674>`_: Add missing lemmas for contracted forms of "be" to ``TOKENIZER_EXCEPTIONS``.
|
||||||
|
* Fix issue `#683 <https://github.com/explosion/spaCy/issues/683>`_ ``Morphology`` class now supplies tag map value for the special space tag if it's missing.
|
||||||
|
* Fix issue `#684 <https://github.com/explosion/spaCy/issues/684>`_: Ensure ``spacy.en.English()`` loads the Glove vector data if available. Previously was inconsistent with behaviour of ``spacy.load('en')``.
|
||||||
|
* Fix issue `#685 <https://github.com/explosion/spaCy/issues/685>`_: Expand ``TOKENIZER_EXCEPTIONS`` with unicode apostrophe (``’``).
|
||||||
|
* Fix issue `#689 <https://github.com/explosion/spaCy/issues/689>`_: Correct typo in ``STOP_WORDS``.
|
||||||
|
* Fix issue `#691 <https://github.com/explosion/spaCy/issues/691>`_: Add tokenizer exceptions for "gonna" and "Gonna".
|
||||||
|
|
||||||
|
**⚠️ Backwards incompatibilities**
|
||||||
|
|
||||||
|
No changes to the public, documented API, but the previously undocumented language data and model initialisation processes have been refactored and reorganised. If you were relying on the ``bin/init_model.py`` script, see the new `spaCy Developer Resources <https://github.com/explosion/spacy-dev-resources>`_ repo. Code that references internals of the ``spacy.en`` or ``spacy.de`` packages should also be reviewed before updating to this version.
|
||||||
|
|
||||||
|
**📖 Documentation and examples**
|
||||||
|
|
||||||
|
* **NEW:** `"Adding languages" <https://spacy.io/docs/usage/adding-languages>`_ workflow.
|
||||||
|
* **NEW:** `"Part-of-speech tagging" <https://spacy.io/docs/usage/pos-tagging>`_ workflow.
|
||||||
|
* **NEW:** `spaCy Developer Resources <https://github.com/explosion/spacy-dev-resources>`_ repo – scripts, tools and resources for developing spaCy.
|
||||||
|
* Fix various typos and inconsistencies.
|
||||||
|
|
||||||
|
**👥 Contributors**
|
||||||
|
|
||||||
|
Thanks to `@dafnevk <https://github.com/dafnevk>`_, `@jvdzwaan <https://github.com/jvdzwaan>`_, `@RvanNieuwpoort <https://github.com/RvanNieuwpoort>`_, `@wrvhage <https://github.com/wrvhage>`_, `@jaspb <https://github.com/jaspb>`_, `@savvopoulos <https://github.com/savvopoulos>`_ and `@davedwards <https://github.com/davedwards>`_ for the pull requests!
|
||||||
|
|
||||||
|
2016-12-03 `v1.3.0 <https://github.com/explosion/spaCy/releases/tag/v1.3.0>`_: *Improve API consistency*
|
||||||
|
--------------------------------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
**✨ API improvements**
|
||||||
|
|
||||||
|
* Add ``Span.sentiment`` attribute.
|
||||||
|
* `#658 <https://github.com/explosion/spaCy/pull/658>`_: Add ``Span.noun_chunks`` iterator (thanks `@pokey <https://github.com/pokey>`_).
|
||||||
|
* `#642 <https://github.com/explosion/spaCy/pull/642>`_: Let ``--data-path`` be specified when running download.py scripts (thanks `@ExplodingCabbage <https://github.com/ExplodingCabbage>`_).
|
||||||
|
* `#638 <https://github.com/explosion/spaCy/pull/638>`_: Add German stopwords (thanks `@souravsingh <https://github.com/souravsingh>`_).
|
||||||
|
* `#614 <https://github.com/explosion/spaCy/pull/614>`_: Fix ``PhraseMatcher`` to work with new ``Matcher`` (thanks `@sadovnychyi <https://github.com/sadovnychyi>`_).
|
||||||
|
|
||||||
|
**🔴 Bug fixes**
|
||||||
|
|
||||||
|
* Fix issue `#605 <https://github.com/explosion/spaCy/issues/605>`_: ``accept`` argument to ``Matcher`` now rejects matches as expected.
|
||||||
|
* Fix issue `#617 <https://github.com/explosion/spaCy/issues/617>`_: ``Vocab.load()`` now works with string paths, as well as ``Path`` objects.
|
||||||
|
* Fix issue `#639 <https://github.com/explosion/spaCy/issues/639>`_: Stop words in ``Language`` class now used as expected.
|
||||||
|
* Fix issues `#656 <https://github.com/explosion/spaCy/issues/656>`_, `#624 <https://github.com/explosion/spaCy/issues/624>`_: ``Tokenizer`` special-case rules now support arbitrary token attributes.
|
||||||
|
|
||||||
|
|
||||||
|
**📖 Documentation and examples**
|
||||||
|
|
||||||
|
* Add `"Customizing the tokenizer" <https://spacy.io/docs/usage/customizing-tokenizer>`_ workflow.
|
||||||
|
* Add `"Training the tagger, parser and entity recognizer" <https://spacy.io/docs/usage/training>`_ workflow.
|
||||||
|
* Add `"Entity recognition" <https://spacy.io/docs/usage/entity-recognition>`_ workflow.
|
||||||
|
* Fix various typos and inconsistencies.
|
||||||
|
|
||||||
|
**👥 Contributors**
|
||||||
|
|
||||||
|
Thanks to `@pokey <https://github.com/pokey>`_, `@ExplodingCabbage <https://github.com/ExplodingCabbage>`_, `@souravsingh <https://github.com/souravsingh>`_, `@sadovnychyi <https://github.com/sadovnychyi>`_, `@manojsakhwar <https://github.com/manojsakhwar>`_, `@TiagoMRodrigues <https://github.com/TiagoMRodrigues>`_, `@savkov <https://github.com/savkov>`_, `@pspiegelhalter <https://github.com/pspiegelhalter>`_, `@chenb67 <https://github.com/chenb67>`_, `@kylepjohnson <https://github.com/kylepjohnson>`_, `@YanhaoYang <https://github.com/YanhaoYang>`_, `@tjrileywisc <https://github.com/tjrileywisc>`_, `@dechov <https://github.com/dechov>`_, `@wjt <https://github.com/wjt>`_, `@jsmootiv <https://github.com/jsmootiv>`_ and `@blarghmatey <https://github.com/blarghmatey>`_ for the pull requests!
|
||||||
|
|
||||||
|
2016-11-04 `v1.2.0 <https://github.com/explosion/spaCy/releases/tag/v1.2.0>`_: *Alpha tokenizers for Chinese, French, Spanish, Italian and Portuguese*
|
||||||
|
------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||||
|
|
||||||
**✨ Major features and improvements**
|
**✨ Major features and improvements**
|
||||||
|
|
||||||
|
|
|
@ -1,229 +0,0 @@
|
||||||
"""Set up a model directory.
|
|
||||||
|
|
||||||
Requires:
|
|
||||||
|
|
||||||
lang_data --- Rules for the tokenizer
|
|
||||||
* prefix.txt
|
|
||||||
* suffix.txt
|
|
||||||
* infix.txt
|
|
||||||
* morphs.json
|
|
||||||
* specials.json
|
|
||||||
|
|
||||||
corpora --- Data files
|
|
||||||
* WordNet
|
|
||||||
* words.sgt.prob --- Smoothed unigram probabilities
|
|
||||||
* clusters.txt --- Output of hierarchical clustering, e.g. Brown clusters
|
|
||||||
* vectors.bz2 --- output of something like word2vec, compressed with bzip
|
|
||||||
"""
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
from ast import literal_eval
|
|
||||||
import math
|
|
||||||
import gzip
|
|
||||||
import json
|
|
||||||
|
|
||||||
import plac
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
from shutil import copyfile
|
|
||||||
from shutil import copytree
|
|
||||||
from collections import defaultdict
|
|
||||||
import io
|
|
||||||
|
|
||||||
from spacy.vocab import Vocab
|
|
||||||
from spacy.vocab import write_binary_vectors
|
|
||||||
from spacy.strings import hash_string
|
|
||||||
from preshed.counter import PreshCounter
|
|
||||||
|
|
||||||
from spacy.parts_of_speech import NOUN, VERB, ADJ
|
|
||||||
from spacy.util import get_lang_class
|
|
||||||
|
|
||||||
|
|
||||||
try:
|
|
||||||
unicode
|
|
||||||
except NameError:
|
|
||||||
unicode = str
|
|
||||||
|
|
||||||
|
|
||||||
def setup_tokenizer(lang_data_dir, tok_dir):
|
|
||||||
if not tok_dir.exists():
|
|
||||||
tok_dir.mkdir()
|
|
||||||
|
|
||||||
for filename in ('infix.txt', 'morphs.json', 'prefix.txt', 'specials.json',
|
|
||||||
'suffix.txt'):
|
|
||||||
src = lang_data_dir / filename
|
|
||||||
dst = tok_dir / filename
|
|
||||||
copyfile(str(src), str(dst))
|
|
||||||
|
|
||||||
|
|
||||||
def _read_clusters(loc):
|
|
||||||
if not loc.exists():
|
|
||||||
print("Warning: Clusters file not found")
|
|
||||||
return {}
|
|
||||||
clusters = {}
|
|
||||||
for line in io.open(str(loc), 'r', encoding='utf8'):
|
|
||||||
try:
|
|
||||||
cluster, word, freq = line.split()
|
|
||||||
except ValueError:
|
|
||||||
continue
|
|
||||||
# If the clusterer has only seen the word a few times, its cluster is
|
|
||||||
# unreliable.
|
|
||||||
if int(freq) >= 3:
|
|
||||||
clusters[word] = cluster
|
|
||||||
else:
|
|
||||||
clusters[word] = '0'
|
|
||||||
# Expand clusters with re-casing
|
|
||||||
for word, cluster in list(clusters.items()):
|
|
||||||
if word.lower() not in clusters:
|
|
||||||
clusters[word.lower()] = cluster
|
|
||||||
if word.title() not in clusters:
|
|
||||||
clusters[word.title()] = cluster
|
|
||||||
if word.upper() not in clusters:
|
|
||||||
clusters[word.upper()] = cluster
|
|
||||||
return clusters
|
|
||||||
|
|
||||||
|
|
||||||
def _read_probs(loc):
|
|
||||||
if not loc.exists():
|
|
||||||
print("Probabilities file not found. Trying freqs.")
|
|
||||||
return {}, 0.0
|
|
||||||
probs = {}
|
|
||||||
for i, line in enumerate(io.open(str(loc), 'r', encoding='utf8')):
|
|
||||||
prob, word = line.split()
|
|
||||||
prob = float(prob)
|
|
||||||
probs[word] = prob
|
|
||||||
return probs, probs['-OOV-']
|
|
||||||
|
|
||||||
|
|
||||||
def _read_freqs(loc, max_length=100, min_doc_freq=5, min_freq=200):
|
|
||||||
if not loc.exists():
|
|
||||||
print("Warning: Frequencies file not found")
|
|
||||||
return {}, 0.0
|
|
||||||
counts = PreshCounter()
|
|
||||||
total = 0
|
|
||||||
if str(loc).endswith('gz'):
|
|
||||||
file_ = gzip.open(str(loc))
|
|
||||||
else:
|
|
||||||
file_ = loc.open()
|
|
||||||
for i, line in enumerate(file_):
|
|
||||||
freq, doc_freq, key = line.rstrip().split('\t', 2)
|
|
||||||
freq = int(freq)
|
|
||||||
counts.inc(i+1, freq)
|
|
||||||
total += freq
|
|
||||||
counts.smooth()
|
|
||||||
log_total = math.log(total)
|
|
||||||
if str(loc).endswith('gz'):
|
|
||||||
file_ = gzip.open(str(loc))
|
|
||||||
else:
|
|
||||||
file_ = loc.open()
|
|
||||||
probs = {}
|
|
||||||
for line in file_:
|
|
||||||
freq, doc_freq, key = line.rstrip().split('\t', 2)
|
|
||||||
doc_freq = int(doc_freq)
|
|
||||||
freq = int(freq)
|
|
||||||
if doc_freq >= min_doc_freq and freq >= min_freq and len(key) < max_length:
|
|
||||||
word = literal_eval(key)
|
|
||||||
smooth_count = counts.smoother(int(freq))
|
|
||||||
probs[word] = math.log(smooth_count) - log_total
|
|
||||||
oov_prob = math.log(counts.smoother(0)) - log_total
|
|
||||||
return probs, oov_prob
|
|
||||||
|
|
||||||
|
|
||||||
def _read_senses(loc):
|
|
||||||
lexicon = defaultdict(lambda: defaultdict(list))
|
|
||||||
if not loc.exists():
|
|
||||||
print("Warning: WordNet senses not found")
|
|
||||||
return lexicon
|
|
||||||
sense_names = dict((s, i) for i, s in enumerate(spacy.senses.STRINGS))
|
|
||||||
pos_ids = {'noun': NOUN, 'verb': VERB, 'adjective': ADJ}
|
|
||||||
for line in codecs.open(str(loc), 'r', 'utf8'):
|
|
||||||
sense_strings = line.split()
|
|
||||||
word = sense_strings.pop(0)
|
|
||||||
for sense in sense_strings:
|
|
||||||
pos, sense = sense[3:].split('.')
|
|
||||||
sense_name = '%s_%s' % (pos[0].upper(), sense.lower())
|
|
||||||
if sense_name != 'N_tops':
|
|
||||||
sense_id = sense_names[sense_name]
|
|
||||||
lexicon[word][pos_ids[pos]].append(sense_id)
|
|
||||||
return lexicon
|
|
||||||
|
|
||||||
|
|
||||||
def setup_vocab(lex_attr_getters, tag_map, src_dir, dst_dir):
|
|
||||||
if not dst_dir.exists():
|
|
||||||
dst_dir.mkdir()
|
|
||||||
|
|
||||||
vectors_src = src_dir / 'vectors.bz2'
|
|
||||||
if vectors_src.exists():
|
|
||||||
write_binary_vectors(vectors_src.as_posix, (dst_dir / 'vec.bin').as_posix())
|
|
||||||
else:
|
|
||||||
print("Warning: Word vectors file not found")
|
|
||||||
vocab = Vocab(lex_attr_getters=lex_attr_getters, tag_map=tag_map)
|
|
||||||
clusters = _read_clusters(src_dir / 'clusters.txt')
|
|
||||||
probs, oov_prob = _read_probs(src_dir / 'words.sgt.prob')
|
|
||||||
if not probs:
|
|
||||||
probs, oov_prob = _read_freqs(src_dir / 'freqs.txt.gz')
|
|
||||||
if not probs:
|
|
||||||
oov_prob = -20
|
|
||||||
else:
|
|
||||||
oov_prob = min(probs.values())
|
|
||||||
for word in clusters:
|
|
||||||
if word not in probs:
|
|
||||||
probs[word] = oov_prob
|
|
||||||
|
|
||||||
lexicon = []
|
|
||||||
for word, prob in reversed(sorted(list(probs.items()), key=lambda item: item[1])):
|
|
||||||
# First encode the strings into the StringStore. This way, we can map
|
|
||||||
# the orth IDs to frequency ranks
|
|
||||||
orth = vocab.strings[word]
|
|
||||||
# Now actually load the vocab
|
|
||||||
for word, prob in reversed(sorted(list(probs.items()), key=lambda item: item[1])):
|
|
||||||
lexeme = vocab[word]
|
|
||||||
lexeme.prob = prob
|
|
||||||
lexeme.is_oov = False
|
|
||||||
# Decode as a little-endian string, so that we can do & 15 to get
|
|
||||||
# the first 4 bits. See _parse_features.pyx
|
|
||||||
if word in clusters:
|
|
||||||
lexeme.cluster = int(clusters[word][::-1], 2)
|
|
||||||
else:
|
|
||||||
lexeme.cluster = 0
|
|
||||||
vocab.dump((dst_dir / 'lexemes.bin').as_posix())
|
|
||||||
with (dst_dir / 'strings.json').open('w') as file_:
|
|
||||||
vocab.strings.dump(file_)
|
|
||||||
with (dst_dir / 'oov_prob').open('w') as file_:
|
|
||||||
file_.write('%f' % oov_prob)
|
|
||||||
|
|
||||||
|
|
||||||
def main(lang_id, lang_data_dir, corpora_dir, model_dir):
|
|
||||||
model_dir = Path(model_dir)
|
|
||||||
lang_data_dir = Path(lang_data_dir) / lang_id
|
|
||||||
corpora_dir = Path(corpora_dir) / lang_id
|
|
||||||
|
|
||||||
assert corpora_dir.exists()
|
|
||||||
assert lang_data_dir.exists()
|
|
||||||
|
|
||||||
if not model_dir.exists():
|
|
||||||
model_dir.mkdir()
|
|
||||||
|
|
||||||
tag_map = json.load((lang_data_dir / 'tag_map.json').open())
|
|
||||||
setup_tokenizer(lang_data_dir, model_dir / 'tokenizer')
|
|
||||||
setup_vocab(get_lang_class(lang_id).Defaults.lex_attr_getters, tag_map, corpora_dir,
|
|
||||||
model_dir / 'vocab')
|
|
||||||
|
|
||||||
if (lang_data_dir / 'gazetteer.json').exists():
|
|
||||||
copyfile((lang_data_dir / 'gazetteer.json').as_posix(),
|
|
||||||
(model_dir / 'vocab' / 'gazetteer.json').as_posix())
|
|
||||||
|
|
||||||
copyfile((lang_data_dir / 'tag_map.json').as_posix(),
|
|
||||||
(model_dir / 'vocab' / 'tag_map.json').as_posix())
|
|
||||||
|
|
||||||
if (lang_data_dir / 'lemma_rules.json').exists():
|
|
||||||
copyfile((lang_data_dir / 'lemma_rules.json').as_posix(),
|
|
||||||
(model_dir / 'vocab' / 'lemma_rules.json').as_posix())
|
|
||||||
|
|
||||||
if not (model_dir / 'wordnet').exists() and (corpora_dir / 'wordnet').exists():
|
|
||||||
copytree((corpora_dir / 'wordnet' / 'dict').as_posix(),
|
|
||||||
(model_dir / 'wordnet').as_posix())
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
|
||||||
plac.call(main)
|
|
|
@ -100,7 +100,7 @@ def evaluate(Language, gold_tuples, model_dir, gold_preproc=False, verbose=False
|
||||||
nlp.entity(tokens)
|
nlp.entity(tokens)
|
||||||
else:
|
else:
|
||||||
tokens = nlp(raw_text)
|
tokens = nlp(raw_text)
|
||||||
gold = GoldParse(tokens, annot_tuples)
|
gold = GoldParse.from_annot_tuples(tokens, annot_tuples)
|
||||||
scorer.score(tokens, gold, verbose=verbose)
|
scorer.score(tokens, gold, verbose=verbose)
|
||||||
return scorer
|
return scorer
|
||||||
|
|
||||||
|
|
|
@ -1,3 +1,4 @@
|
||||||
|
from __future__ import unicode_literals
|
||||||
import plac
|
import plac
|
||||||
import json
|
import json
|
||||||
from os import path
|
from os import path
|
||||||
|
@ -5,106 +6,25 @@ import shutil
|
||||||
import os
|
import os
|
||||||
import random
|
import random
|
||||||
import io
|
import io
|
||||||
|
import pathlib
|
||||||
|
|
||||||
from spacy.syntax.util import Config
|
from spacy.tokens import Doc
|
||||||
|
from spacy.syntax.nonproj import PseudoProjectivity
|
||||||
|
from spacy.language import Language
|
||||||
from spacy.gold import GoldParse
|
from spacy.gold import GoldParse
|
||||||
from spacy.tokenizer import Tokenizer
|
|
||||||
from spacy.vocab import Vocab
|
from spacy.vocab import Vocab
|
||||||
from spacy.tagger import Tagger
|
from spacy.tagger import Tagger
|
||||||
from spacy.syntax.parser import Parser
|
from spacy.pipeline import DependencyParser
|
||||||
from spacy.syntax.arc_eager import ArcEager
|
|
||||||
from spacy.syntax.parser import get_templates
|
from spacy.syntax.parser import get_templates
|
||||||
|
from spacy.syntax.arc_eager import ArcEager
|
||||||
from spacy.scorer import Scorer
|
from spacy.scorer import Scorer
|
||||||
import spacy.attrs
|
import spacy.attrs
|
||||||
|
import io
|
||||||
from spacy.language import Language
|
|
||||||
|
|
||||||
from spacy.tagger import W_orth
|
|
||||||
|
|
||||||
TAGGER_TEMPLATES = (
|
|
||||||
(W_orth,),
|
|
||||||
)
|
|
||||||
|
|
||||||
try:
|
|
||||||
from codecs import open
|
|
||||||
except ImportError:
|
|
||||||
pass
|
|
||||||
|
|
||||||
|
|
||||||
class TreebankParser(object):
|
|
||||||
@staticmethod
|
|
||||||
def setup_model_dir(model_dir, labels, templates, feat_set='basic', seed=0):
|
|
||||||
dep_model_dir = path.join(model_dir, 'deps')
|
|
||||||
pos_model_dir = path.join(model_dir, 'pos')
|
|
||||||
if path.exists(dep_model_dir):
|
|
||||||
shutil.rmtree(dep_model_dir)
|
|
||||||
if path.exists(pos_model_dir):
|
|
||||||
shutil.rmtree(pos_model_dir)
|
|
||||||
os.mkdir(dep_model_dir)
|
|
||||||
os.mkdir(pos_model_dir)
|
|
||||||
|
|
||||||
Config.write(dep_model_dir, 'config', features=feat_set, seed=seed,
|
|
||||||
labels=labels)
|
|
||||||
|
|
||||||
@classmethod
|
|
||||||
def from_dir(cls, tag_map, model_dir):
|
|
||||||
vocab = Vocab(tag_map=tag_map, get_lex_attr=Language.default_lex_attrs())
|
|
||||||
vocab.get_lex_attr[spacy.attrs.LANG] = lambda _: 0
|
|
||||||
tokenizer = Tokenizer(vocab, {}, None, None, None)
|
|
||||||
tagger = Tagger.blank(vocab, TAGGER_TEMPLATES)
|
|
||||||
|
|
||||||
cfg = Config.read(path.join(model_dir, 'deps'), 'config')
|
|
||||||
parser = Parser.from_dir(path.join(model_dir, 'deps'), vocab.strings, ArcEager)
|
|
||||||
return cls(vocab, tokenizer, tagger, parser)
|
|
||||||
|
|
||||||
def __init__(self, vocab, tokenizer, tagger, parser):
|
|
||||||
self.vocab = vocab
|
|
||||||
self.tokenizer = tokenizer
|
|
||||||
self.tagger = tagger
|
|
||||||
self.parser = parser
|
|
||||||
|
|
||||||
def train(self, words, tags, heads, deps):
|
|
||||||
tokens = self.tokenizer.tokens_from_list(list(words))
|
|
||||||
self.tagger.train(tokens, tags)
|
|
||||||
|
|
||||||
tokens = self.tokenizer.tokens_from_list(list(words))
|
|
||||||
ids = range(len(words))
|
|
||||||
ner = ['O'] * len(words)
|
|
||||||
gold = GoldParse(tokens, ((ids, words, tags, heads, deps, ner)),
|
|
||||||
make_projective=False)
|
|
||||||
self.tagger(tokens)
|
|
||||||
if gold.is_projective:
|
|
||||||
try:
|
|
||||||
self.parser.train(tokens, gold)
|
|
||||||
except:
|
|
||||||
for id_, word, head, dep in zip(ids, words, heads, deps):
|
|
||||||
print(id_, word, head, dep)
|
|
||||||
raise
|
|
||||||
|
|
||||||
def __call__(self, words, tags=None):
|
|
||||||
tokens = self.tokenizer.tokens_from_list(list(words))
|
|
||||||
if tags is None:
|
|
||||||
self.tagger(tokens)
|
|
||||||
else:
|
|
||||||
self.tagger.tag_from_strings(tokens, tags)
|
|
||||||
self.parser(tokens)
|
|
||||||
return tokens
|
|
||||||
|
|
||||||
def end_training(self, data_dir):
|
|
||||||
self.parser.model.end_training()
|
|
||||||
self.parser.model.dump(path.join(data_dir, 'deps', 'model'))
|
|
||||||
self.tagger.model.end_training()
|
|
||||||
self.tagger.model.dump(path.join(data_dir, 'pos', 'model'))
|
|
||||||
strings_loc = path.join(data_dir, 'vocab', 'strings.json')
|
|
||||||
with io.open(strings_loc, 'w', encoding='utf8') as file_:
|
|
||||||
self.vocab.strings.dump(file_)
|
|
||||||
self.vocab.dump(path.join(data_dir, 'vocab', 'lexemes.bin'))
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
def read_conllx(loc):
|
def read_conllx(loc):
|
||||||
with open(loc, 'r', 'utf8') as file_:
|
with io.open(loc, 'r', encoding='utf8') as file_:
|
||||||
text = file_.read()
|
text = file_.read()
|
||||||
for sent in text.strip().split('\n\n'):
|
for sent in text.strip().split('\n\n'):
|
||||||
lines = sent.strip().split('\n')
|
lines = sent.strip().split('\n')
|
||||||
|
@ -113,24 +33,31 @@ def read_conllx(loc):
|
||||||
lines.pop(0)
|
lines.pop(0)
|
||||||
tokens = []
|
tokens = []
|
||||||
for line in lines:
|
for line in lines:
|
||||||
id_, word, lemma, pos, tag, morph, head, dep, _1, _2 = line.split()
|
id_, word, lemma, tag, pos, morph, head, dep, _1, _2 = line.split()
|
||||||
if '-' in id_:
|
if '-' in id_:
|
||||||
continue
|
continue
|
||||||
|
try:
|
||||||
id_ = int(id_) - 1
|
id_ = int(id_) - 1
|
||||||
head = (int(head) - 1) if head != '0' else id_
|
head = (int(head) - 1) if head != '0' else id_
|
||||||
dep = 'ROOT' if dep == 'root' else dep
|
dep = 'ROOT' if dep == 'root' else dep
|
||||||
tokens.append((id_, word, tag, head, dep, 'O'))
|
tokens.append((id_, word, tag, head, dep, 'O'))
|
||||||
tuples = zip(*tokens)
|
except:
|
||||||
yield (None, [(tuples, [])])
|
print(line)
|
||||||
|
raise
|
||||||
|
tuples = [list(t) for t in zip(*tokens)]
|
||||||
|
yield (None, [[tuples, []]])
|
||||||
|
|
||||||
|
|
||||||
def score_model(nlp, gold_docs, verbose=False):
|
def score_model(vocab, tagger, parser, gold_docs, verbose=False):
|
||||||
scorer = Scorer()
|
scorer = Scorer()
|
||||||
for _, gold_doc in gold_docs:
|
for _, gold_doc in gold_docs:
|
||||||
for annot_tuples, _ in gold_doc:
|
for (ids, words, tags, heads, deps, entities), _ in gold_doc:
|
||||||
tokens = nlp(list(annot_tuples[1]), tags=list(annot_tuples[2]))
|
doc = Doc(vocab, words=words)
|
||||||
gold = GoldParse(tokens, annot_tuples)
|
tagger(doc)
|
||||||
scorer.score(tokens, gold, verbose=verbose)
|
parser(doc)
|
||||||
|
PseudoProjectivity.deprojectivize(doc)
|
||||||
|
gold = GoldParse(doc, tags=tags, heads=heads, deps=deps)
|
||||||
|
scorer.score(doc, gold, verbose=verbose)
|
||||||
return scorer
|
return scorer
|
||||||
|
|
||||||
|
|
||||||
|
@ -138,22 +65,45 @@ def main(train_loc, dev_loc, model_dir, tag_map_loc):
|
||||||
with open(tag_map_loc) as file_:
|
with open(tag_map_loc) as file_:
|
||||||
tag_map = json.loads(file_.read())
|
tag_map = json.loads(file_.read())
|
||||||
train_sents = list(read_conllx(train_loc))
|
train_sents = list(read_conllx(train_loc))
|
||||||
labels = ArcEager.get_labels(train_sents)
|
train_sents = PseudoProjectivity.preprocess_training_data(train_sents)
|
||||||
templates = get_templates('basic')
|
|
||||||
|
|
||||||
TreebankParser.setup_model_dir(model_dir, labels, templates)
|
actions = ArcEager.get_actions(gold_parses=train_sents)
|
||||||
|
features = get_templates('basic')
|
||||||
|
|
||||||
nlp = TreebankParser.from_dir(tag_map, model_dir)
|
model_dir = pathlib.Path(model_dir)
|
||||||
|
with (model_dir / 'deps' / 'config.json').open('w') as file_:
|
||||||
|
json.dump({'pseudoprojective': True, 'labels': actions, 'features': features}, file_)
|
||||||
|
|
||||||
|
vocab = Vocab(lex_attr_getters=Language.Defaults.lex_attr_getters, tag_map=tag_map)
|
||||||
|
# Populate vocab
|
||||||
|
for _, doc_sents in train_sents:
|
||||||
|
for (ids, words, tags, heads, deps, ner), _ in doc_sents:
|
||||||
|
for word in words:
|
||||||
|
_ = vocab[word]
|
||||||
|
for dep in deps:
|
||||||
|
_ = vocab[dep]
|
||||||
|
for tag in tags:
|
||||||
|
_ = vocab[tag]
|
||||||
|
for tag in tags:
|
||||||
|
assert tag in tag_map, repr(tag)
|
||||||
|
tagger = Tagger(vocab, tag_map=tag_map)
|
||||||
|
parser = DependencyParser(vocab, actions=actions, features=features)
|
||||||
|
|
||||||
for itn in range(15):
|
for itn in range(15):
|
||||||
for _, doc_sents in train_sents:
|
for _, doc_sents in train_sents:
|
||||||
for (ids, words, tags, heads, deps, ner), _ in doc_sents:
|
for (ids, words, tags, heads, deps, ner), _ in doc_sents:
|
||||||
nlp.train(words, tags, heads, deps)
|
doc = Doc(vocab, words=words)
|
||||||
|
gold = GoldParse(doc, tags=tags, heads=heads, deps=deps)
|
||||||
|
tagger(doc)
|
||||||
|
parser.update(doc, gold)
|
||||||
|
doc = Doc(vocab, words=words)
|
||||||
|
tagger.update(doc, gold)
|
||||||
random.shuffle(train_sents)
|
random.shuffle(train_sents)
|
||||||
scorer = score_model(nlp, read_conllx(dev_loc))
|
scorer = score_model(vocab, tagger, parser, read_conllx(dev_loc))
|
||||||
print('%d:\t%.3f\t%.3f' % (itn, scorer.uas, scorer.tags_acc))
|
print('%d:\t%.3f\t%.3f' % (itn, scorer.uas, scorer.tags_acc))
|
||||||
|
nlp = Language(vocab=vocab, tagger=tagger, parser=parser)
|
||||||
nlp.end_training(model_dir)
|
nlp.end_training(model_dir)
|
||||||
scorer = score_model(nlp, read_conllx(dev_loc))
|
scorer = score_model(vocab, tagger, parser, read_conllx(dev_loc))
|
||||||
print('%d:\t%.3f\t%.3f\t%.3f' % (itn, scorer.uas, scorer.las, scorer.tags_acc))
|
print('%d:\t%.3f\t%.3f\t%.3f' % (itn, scorer.uas, scorer.las, scorer.tags_acc))
|
||||||
|
|
||||||
|
|
||||||
|
|
14
examples/paddle/sentiment_bilstm/config.py
Normal file
14
examples/paddle/sentiment_bilstm/config.py
Normal file
|
@ -0,0 +1,14 @@
|
||||||
|
from paddle.trainer_config_helpers import *
|
||||||
|
|
||||||
|
define_py_data_sources2(train_list='train.list',
|
||||||
|
test_list='test.list',
|
||||||
|
module="dataprovider",
|
||||||
|
obj="process")
|
||||||
|
|
||||||
|
settings(
|
||||||
|
batch_size=128,
|
||||||
|
learning_rate=2e-3,
|
||||||
|
learning_method=AdamOptimizer(),
|
||||||
|
regularization=L2Regularization(8e-4),
|
||||||
|
gradient_clipping_threshold=25
|
||||||
|
)
|
46
examples/paddle/sentiment_bilstm/dataprovider.py
Normal file
46
examples/paddle/sentiment_bilstm/dataprovider.py
Normal file
|
@ -0,0 +1,46 @@
|
||||||
|
from paddle.trainer.PyDataProvider2 import *
|
||||||
|
from itertools import izip
|
||||||
|
import spacy
|
||||||
|
|
||||||
|
|
||||||
|
def get_features(doc):
|
||||||
|
return numpy.asarray(
|
||||||
|
[t.rank+1 for t in doc
|
||||||
|
if t.has_vector and not t.is_punct and not t.is_space],
|
||||||
|
dtype='int32')
|
||||||
|
|
||||||
|
|
||||||
|
def read_data(data_dir):
|
||||||
|
for subdir, label in (('pos', 1), ('neg', 0)):
|
||||||
|
for filename in (data_dir / subdir).iterdir():
|
||||||
|
with filename.open() as file_:
|
||||||
|
text = file_.read()
|
||||||
|
yield text, label
|
||||||
|
|
||||||
|
|
||||||
|
def on_init(settings, **kwargs):
|
||||||
|
print("Loading spaCy")
|
||||||
|
nlp = spacy.load('en', entity=False)
|
||||||
|
vectors = get_vectors(nlp)
|
||||||
|
settings.input_types = [
|
||||||
|
# The text is a sequence of integer values, and each value is a word id.
|
||||||
|
# The whole sequence is the sentences that we want to predict its
|
||||||
|
# sentimental.
|
||||||
|
integer_value(vectors.shape[0], seq_type=SequenceType), # text input
|
||||||
|
|
||||||
|
# label positive/negative
|
||||||
|
integer_value(2)
|
||||||
|
]
|
||||||
|
settings.nlp = nlp
|
||||||
|
settings.vectors = vectors
|
||||||
|
settings['batch_size'] = 32
|
||||||
|
|
||||||
|
|
||||||
|
@provider(init_hook=on_init)
|
||||||
|
def process(settings, data_dir): # settings is not used currently.
|
||||||
|
texts, labels = read_data(data_dir)
|
||||||
|
for doc, label in izip(nlp.pipe(texts, batch_size=5000, n_threads=3), labels):
|
||||||
|
for sent in doc.sents:
|
||||||
|
ids = get_features(sent)
|
||||||
|
# give data to paddle.
|
||||||
|
yield ids, label
|
19
examples/paddle/sentiment_bilstm/networks.py
Normal file
19
examples/paddle/sentiment_bilstm/networks.py
Normal file
|
@ -0,0 +1,19 @@
|
||||||
|
from paddle.trainer_config_helpers import *
|
||||||
|
|
||||||
|
|
||||||
|
def bidirectional_lstm_net(input_dim,
|
||||||
|
class_dim=2,
|
||||||
|
emb_dim=128,
|
||||||
|
lstm_dim=128,
|
||||||
|
is_predict=False):
|
||||||
|
data = data_layer("word", input_dim)
|
||||||
|
emb = embedding_layer(input=data, size=emb_dim)
|
||||||
|
bi_lstm = bidirectional_lstm(input=emb, size=lstm_dim)
|
||||||
|
dropout = dropout_layer(input=bi_lstm, dropout_rate=0.5)
|
||||||
|
output = fc_layer(input=dropout, size=class_dim, act=SoftmaxActivation())
|
||||||
|
|
||||||
|
if not is_predict:
|
||||||
|
lbl = data_layer("label", 1)
|
||||||
|
outputs(classification_cost(input=output, label=lbl))
|
||||||
|
else:
|
||||||
|
outputs(output)
|
14
examples/paddle/sentiment_bilstm/train.sh
Executable file
14
examples/paddle/sentiment_bilstm/train.sh
Executable file
|
@ -0,0 +1,14 @@
|
||||||
|
config=config.py
|
||||||
|
output=./model_output
|
||||||
|
paddle train --config=$config \
|
||||||
|
--save_dir=$output \
|
||||||
|
--job=train \
|
||||||
|
--use_gpu=false \
|
||||||
|
--trainer_count=4 \
|
||||||
|
--num_passes=10 \
|
||||||
|
--log_period=20 \
|
||||||
|
--dot_period=20 \
|
||||||
|
--show_parameter_stats_period=100 \
|
||||||
|
--test_all_data_in_one_period=1 \
|
||||||
|
--config_args=batch_size=100 \
|
||||||
|
2>&1 | tee 'train.log'_
|
86
examples/sentiment/main.py
Normal file
86
examples/sentiment/main.py
Normal file
|
@ -0,0 +1,86 @@
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
from __future__ import print_function
|
||||||
|
|
||||||
|
import plac
|
||||||
|
from pathlib import Path
|
||||||
|
import random
|
||||||
|
|
||||||
|
import spacy.en
|
||||||
|
import model
|
||||||
|
|
||||||
|
|
||||||
|
try:
|
||||||
|
import cPickle as pickle
|
||||||
|
except ImportError:
|
||||||
|
import pickle
|
||||||
|
|
||||||
|
|
||||||
|
def read_data(nlp, data_dir):
|
||||||
|
for subdir, label in (('pos', 1), ('neg', 0)):
|
||||||
|
for filename in (data_dir / subdir).iterdir():
|
||||||
|
text = filename.open().read()
|
||||||
|
doc = nlp(text)
|
||||||
|
yield doc, label
|
||||||
|
|
||||||
|
|
||||||
|
def partition(examples, split_size):
|
||||||
|
examples = list(examples)
|
||||||
|
random.shuffle(examples)
|
||||||
|
n_docs = len(examples)
|
||||||
|
split = int(n_docs * split_size)
|
||||||
|
return examples[:split], examples[split:]
|
||||||
|
|
||||||
|
|
||||||
|
class Dataset(object):
|
||||||
|
def __init__(self, nlp, data_dir, batch_size=24):
|
||||||
|
self.batch_size = batch_size
|
||||||
|
self.train, self.dev = partition(read_data(nlp, Path(data_dir)), 0.8)
|
||||||
|
print("Read %d train docs" % len(self.train))
|
||||||
|
print("Pos. Train: ", sum(eg[1] == 1 for eg in self.train))
|
||||||
|
print("Read %d dev docs" % len(self.dev))
|
||||||
|
print("Neg. Dev: ", sum(eg[1] == 1 for eg in self.dev))
|
||||||
|
|
||||||
|
def batches(self, data):
|
||||||
|
for i in range(0, len(data), self.batch_size):
|
||||||
|
yield data[i : i + self.batch_size]
|
||||||
|
|
||||||
|
|
||||||
|
def model_writer(out_dir, name):
|
||||||
|
def save_model(epoch, params):
|
||||||
|
out_path = out_dir / name.format(epoch=epoch)
|
||||||
|
pickle.dump(params, out_path.open('wb'))
|
||||||
|
return save_model
|
||||||
|
|
||||||
|
|
||||||
|
@plac.annotations(
|
||||||
|
data_dir=("Data directory", "positional", None, Path),
|
||||||
|
vocab_size=("Number of words to fine-tune", "option", "w", int),
|
||||||
|
n_iter=("Number of iterations (epochs)", "option", "i", int),
|
||||||
|
vector_len=("Size of embedding vectors", "option", "e", int),
|
||||||
|
hidden_len=("Size of hidden layers", "option", "H", int),
|
||||||
|
depth=("Depth", "option", "d", int),
|
||||||
|
drop_rate=("Drop-out rate", "option", "r", float),
|
||||||
|
rho=("Regularization penalty", "option", "p", float),
|
||||||
|
batch_size=("Batch size", "option", "b", int),
|
||||||
|
out_dir=("Model directory", "positional", None, Path)
|
||||||
|
)
|
||||||
|
def main(data_dir, out_dir, n_iter=10, vector_len=300, vocab_size=20000,
|
||||||
|
hidden_len=300, depth=3, drop_rate=0.3, rho=1e-4, batch_size=24):
|
||||||
|
print("Loading")
|
||||||
|
nlp = spacy.en.English(parser=False)
|
||||||
|
dataset = Dataset(nlp, data_dir / 'train', batch_size)
|
||||||
|
print("Training")
|
||||||
|
network = model.train(dataset, vector_len, hidden_len, 2, vocab_size, depth,
|
||||||
|
drop_rate, rho, n_iter,
|
||||||
|
model_writer(out_dir, 'model_{epoch}.pickle'))
|
||||||
|
score = model.Scorer()
|
||||||
|
print("Evaluating")
|
||||||
|
for doc, label in read_data(nlp, data_dir / 'test'):
|
||||||
|
word_ids, embeddings = model.get_words(doc, 0.0, vocab_size)
|
||||||
|
guess = network.forward(word_ids, embeddings)
|
||||||
|
score += guess == label
|
||||||
|
print(score)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
plac.call(main)
|
188
examples/sentiment/model.py
Normal file
188
examples/sentiment/model.py
Normal file
|
@ -0,0 +1,188 @@
|
||||||
|
from __future__ import division
|
||||||
|
from numpy import average, zeros, outer, random, exp, sqrt, concatenate, argmax
|
||||||
|
import numpy
|
||||||
|
|
||||||
|
from .util import Scorer
|
||||||
|
|
||||||
|
|
||||||
|
class Adagrad(object):
|
||||||
|
def __init__(self, dim, lr):
|
||||||
|
self.dim = dim
|
||||||
|
self.eps = 1e-3
|
||||||
|
# initial learning rate
|
||||||
|
self.learning_rate = lr
|
||||||
|
# stores sum of squared gradients
|
||||||
|
self.h = zeros(self.dim)
|
||||||
|
self._curr_rate = zeros(self.h.shape)
|
||||||
|
|
||||||
|
def rescale(self, gradient):
|
||||||
|
self._curr_rate.fill(0)
|
||||||
|
self.h += gradient ** 2
|
||||||
|
self._curr_rate = self.learning_rate / (sqrt(self.h) + self.eps)
|
||||||
|
return self._curr_rate * gradient
|
||||||
|
|
||||||
|
def reset_weights(self):
|
||||||
|
self.h = zeros(self.dim)
|
||||||
|
|
||||||
|
|
||||||
|
class Params(object):
|
||||||
|
@classmethod
|
||||||
|
def zero(cls, depth, n_embed, n_hidden, n_labels, n_vocab):
|
||||||
|
return cls(depth, n_embed, n_hidden, n_labels, n_vocab, lambda x: zeros((x,)))
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def random(cls, depth, nE, nH, nL, nV):
|
||||||
|
return cls(depth, nE, nH, nL, nV, lambda x: (random.rand(x) * 2 - 1) * 0.08)
|
||||||
|
|
||||||
|
def __init__(self, depth, n_embed, n_hidden, n_labels, n_vocab, initializer):
|
||||||
|
nE = n_embed; nH = n_hidden; nL = n_labels; nV = n_vocab
|
||||||
|
n_weights = sum([
|
||||||
|
(nE * nH) + nH,
|
||||||
|
(nH * nH + nH) * depth,
|
||||||
|
(nH * nL) + nL,
|
||||||
|
(nV * nE)
|
||||||
|
])
|
||||||
|
self.data = initializer(n_weights)
|
||||||
|
self.W = []
|
||||||
|
self.b = []
|
||||||
|
i = self._add_layer(0, nE, nH)
|
||||||
|
for _ in range(1, depth):
|
||||||
|
i = self._add_layer(i, nH, nH)
|
||||||
|
i = self._add_layer(i, nL, nH)
|
||||||
|
self.E = self.data[i : i + (nV * nE)].reshape((nV, nE))
|
||||||
|
self.E.fill(0)
|
||||||
|
|
||||||
|
def _add_layer(self, start, x, y):
|
||||||
|
end = start + (x * y)
|
||||||
|
self.W.append(self.data[start : end].reshape((x, y)))
|
||||||
|
self.b.append(self.data[end : end + x].reshape((x, )))
|
||||||
|
return end + x
|
||||||
|
|
||||||
|
|
||||||
|
def softmax(actvn, W, b):
|
||||||
|
w = W.dot(actvn) + b
|
||||||
|
ew = exp(w - max(w))
|
||||||
|
return (ew / sum(ew)).ravel()
|
||||||
|
|
||||||
|
|
||||||
|
def relu(actvn, W, b):
|
||||||
|
x = W.dot(actvn) + b
|
||||||
|
return x * (x > 0)
|
||||||
|
|
||||||
|
|
||||||
|
def d_relu(x):
|
||||||
|
return x > 0
|
||||||
|
|
||||||
|
|
||||||
|
class Network(object):
|
||||||
|
def __init__(self, depth, n_embed, n_hidden, n_labels, n_vocab, rho=1e-4, lr=0.005):
|
||||||
|
self.depth = depth
|
||||||
|
self.n_embed = n_embed
|
||||||
|
self.n_hidden = n_hidden
|
||||||
|
self.n_labels = n_labels
|
||||||
|
self.n_vocab = n_vocab
|
||||||
|
|
||||||
|
self.params = Params.random(depth, n_embed, n_hidden, n_labels, n_vocab)
|
||||||
|
self.gradient = Params.zero(depth, n_embed, n_hidden, n_labels, n_vocab)
|
||||||
|
self.adagrad = Adagrad(self.params.data.shape, lr)
|
||||||
|
self.seen_words = {}
|
||||||
|
|
||||||
|
self.pred = zeros(self.n_labels)
|
||||||
|
self.actvn = zeros((self.depth, self.n_hidden))
|
||||||
|
self.input_vector = zeros((self.n_embed, ))
|
||||||
|
|
||||||
|
def forward(self, word_ids, embeddings):
|
||||||
|
self.input_vector.fill(0)
|
||||||
|
self.input_vector += sum(embeddings)
|
||||||
|
# Apply the fine-tuning we've learned
|
||||||
|
for id_ in word_ids:
|
||||||
|
if id_ < self.n_vocab:
|
||||||
|
self.input_vector += self.params.E[id_]
|
||||||
|
# Average
|
||||||
|
self.input_vector /= len(embeddings)
|
||||||
|
prev = self.input_vector
|
||||||
|
for i in range(self.depth):
|
||||||
|
self.actvn[i] = relu(prev, self.params.W[i], self.params.b[i])
|
||||||
|
return x * (x > 0)
|
||||||
|
|
||||||
|
|
||||||
|
prev = self.actvn[i]
|
||||||
|
self.pred = softmax(self.actvn[-1], self.params.W[-1], self.params.b[-1])
|
||||||
|
return argmax(self.pred)
|
||||||
|
|
||||||
|
def backward(self, word_ids, label):
|
||||||
|
target = zeros(self.n_labels)
|
||||||
|
target[label] = 1.0
|
||||||
|
D = self.pred - target
|
||||||
|
|
||||||
|
for i in range(self.depth, 0, -1):
|
||||||
|
self.gradient.b[i] += D
|
||||||
|
self.gradient.W[i] += outer(D, self.actvn[i-1])
|
||||||
|
D = d_relu(self.actvn[i-1]) * self.params.W[i].T.dot(D)
|
||||||
|
|
||||||
|
self.gradient.b[0] += D
|
||||||
|
self.gradient.W[0] += outer(D, self.input_vector)
|
||||||
|
|
||||||
|
grad = self.params.W[0].T.dot(D).reshape((self.n_embed,)) / len(word_ids)
|
||||||
|
for word_id in word_ids:
|
||||||
|
if word_id < self.n_vocab:
|
||||||
|
self.gradient.E[word_id] += grad
|
||||||
|
self.seen_words[word_id] = self.seen_words.get(word_id, 0) + 1
|
||||||
|
|
||||||
|
def update(self, rho, n):
|
||||||
|
# L2 Regularization
|
||||||
|
for i in range(self.depth):
|
||||||
|
self.gradient.W[i] += self.params.W[i] * rho
|
||||||
|
self.gradient.b[i] += self.params.b[i] * rho
|
||||||
|
# Do word embedding tuning
|
||||||
|
for word_id, freq in self.seen_words.items():
|
||||||
|
self.gradient.E[word_id] += (self.params.E[word_id] * freq) * rho
|
||||||
|
|
||||||
|
update = self.gradient.data / n
|
||||||
|
update = self.adagrad.rescale(update)
|
||||||
|
self.params.data -= update
|
||||||
|
self.gradient.data.fill(0)
|
||||||
|
self.seen_words = {}
|
||||||
|
|
||||||
|
|
||||||
|
def get_words(doc, dropout_rate, n_vocab):
|
||||||
|
mask = random.rand(len(doc)) > dropout_rate
|
||||||
|
word_ids = []
|
||||||
|
embeddings = []
|
||||||
|
for word in doc:
|
||||||
|
if mask[word.i] and not word.is_punct:
|
||||||
|
embeddings.append(word.vector)
|
||||||
|
word_ids.append(word.orth)
|
||||||
|
# all examples must have at least one word
|
||||||
|
if not embeddings:
|
||||||
|
return [w.orth for w in doc], [w.vector for w in doc]
|
||||||
|
else:
|
||||||
|
return word_ids, embeddings
|
||||||
|
|
||||||
|
|
||||||
|
def train(dataset, n_embed, n_hidden, n_labels, n_vocab, depth, dropout_rate, rho,
|
||||||
|
n_iter, save_model):
|
||||||
|
model = Network(depth, n_embed, n_hidden, n_labels, n_vocab)
|
||||||
|
best_acc = 0
|
||||||
|
for epoch in range(n_iter):
|
||||||
|
train_score = Scorer()
|
||||||
|
# create mini-batches
|
||||||
|
for batch in dataset.batches(dataset.train):
|
||||||
|
for doc, label in batch:
|
||||||
|
if len(doc) == 0:
|
||||||
|
continue
|
||||||
|
word_ids, embeddings = get_words(doc, dropout_rate, n_vocab)
|
||||||
|
guess = model.forward(word_ids, embeddings)
|
||||||
|
model.backward(word_ids, label)
|
||||||
|
train_score += guess == label
|
||||||
|
model.update(rho, len(batch))
|
||||||
|
test_score = Scorer()
|
||||||
|
for doc, label in dataset.dev:
|
||||||
|
word_ids, embeddings = get_words(doc, 0.0, n_vocab)
|
||||||
|
guess = model.forward(word_ids, embeddings)
|
||||||
|
test_score += guess == label
|
||||||
|
if test_score.true >= best_acc:
|
||||||
|
best_acc = test_score.true
|
||||||
|
save_model(epoch, model.params.data)
|
||||||
|
print "%d\t%s\t%s" % (epoch, train_score, test_score)
|
||||||
|
return model
|
14
examples/sentiment/util.py
Normal file
14
examples/sentiment/util.py
Normal file
|
@ -0,0 +1,14 @@
|
||||||
|
class Scorer(object):
|
||||||
|
def __init__(self):
|
||||||
|
self.true = 0
|
||||||
|
self.total = 0
|
||||||
|
|
||||||
|
def __iadd__(self, is_correct):
|
||||||
|
self.true += is_correct
|
||||||
|
self.total += 1
|
||||||
|
return self
|
||||||
|
|
||||||
|
def __str__(self):
|
||||||
|
return '%.3f' % (self.true / self.total)
|
||||||
|
|
||||||
|
|
22
examples/training/load_ner.py
Normal file
22
examples/training/load_ner.py
Normal file
|
@ -0,0 +1,22 @@
|
||||||
|
# Load NER
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
import spacy
|
||||||
|
import pathlib
|
||||||
|
from spacy.pipeline import EntityRecognizer
|
||||||
|
from spacy.vocab import Vocab
|
||||||
|
|
||||||
|
def load_model(model_dir):
|
||||||
|
model_dir = pathlib.Path(model_dir)
|
||||||
|
nlp = spacy.load('en', parser=False, entity=False, add_vectors=False)
|
||||||
|
with (model_dir / 'vocab' / 'strings.json').open('r', encoding='utf8') as file_:
|
||||||
|
nlp.vocab.strings.load(file_)
|
||||||
|
nlp.vocab.load_lexemes(model_dir / 'vocab' / 'lexemes.bin')
|
||||||
|
ner = EntityRecognizer.load(model_dir, nlp.vocab, require=True)
|
||||||
|
return (nlp, ner)
|
||||||
|
|
||||||
|
(nlp, ner) = load_model('ner')
|
||||||
|
doc = nlp.make_doc('Who is Shaka Khan?')
|
||||||
|
nlp.tagger(doc)
|
||||||
|
ner(doc)
|
||||||
|
for word in doc:
|
||||||
|
print(word.text, word.orth, word.lower, word.tag_, word.ent_type_, word.ent_iob)
|
|
@ -10,6 +10,13 @@ from spacy.tagger import Tagger
|
||||||
|
|
||||||
|
|
||||||
def train_ner(nlp, train_data, entity_types):
|
def train_ner(nlp, train_data, entity_types):
|
||||||
|
# Add new words to vocab.
|
||||||
|
for raw_text, _ in train_data:
|
||||||
|
doc = nlp.make_doc(raw_text)
|
||||||
|
for word in doc:
|
||||||
|
_ = nlp.vocab[word.orth]
|
||||||
|
|
||||||
|
# Train NER.
|
||||||
ner = EntityRecognizer(nlp.vocab, entity_types=entity_types)
|
ner = EntityRecognizer(nlp.vocab, entity_types=entity_types)
|
||||||
for itn in range(5):
|
for itn in range(5):
|
||||||
random.shuffle(train_data)
|
random.shuffle(train_data)
|
||||||
|
@ -20,21 +27,30 @@ def train_ner(nlp, train_data, entity_types):
|
||||||
ner.model.end_training()
|
ner.model.end_training()
|
||||||
return ner
|
return ner
|
||||||
|
|
||||||
|
def save_model(ner, model_dir):
|
||||||
def main(model_dir=None):
|
|
||||||
if model_dir is not None:
|
|
||||||
model_dir = pathlib.Path(model_dir)
|
model_dir = pathlib.Path(model_dir)
|
||||||
if not model_dir.exists():
|
if not model_dir.exists():
|
||||||
model_dir.mkdir()
|
model_dir.mkdir()
|
||||||
assert model_dir.is_dir()
|
assert model_dir.is_dir()
|
||||||
|
|
||||||
|
with (model_dir / 'config.json').open('w') as file_:
|
||||||
|
json.dump(ner.cfg, file_)
|
||||||
|
ner.model.dump(str(model_dir / 'model'))
|
||||||
|
if not (model_dir / 'vocab').exists():
|
||||||
|
(model_dir / 'vocab').mkdir()
|
||||||
|
ner.vocab.dump(str(model_dir / 'vocab' / 'lexemes.bin'))
|
||||||
|
with (model_dir / 'vocab' / 'strings.json').open('w', encoding='utf8') as file_:
|
||||||
|
ner.vocab.strings.dump(file_)
|
||||||
|
|
||||||
|
|
||||||
|
def main(model_dir=None):
|
||||||
nlp = spacy.load('en', parser=False, entity=False, add_vectors=False)
|
nlp = spacy.load('en', parser=False, entity=False, add_vectors=False)
|
||||||
|
|
||||||
# v1.1.2 onwards
|
# v1.1.2 onwards
|
||||||
if nlp.tagger is None:
|
if nlp.tagger is None:
|
||||||
print('---- WARNING ----')
|
print('---- WARNING ----')
|
||||||
print('Data directory not found')
|
print('Data directory not found')
|
||||||
print('please run: `python -m spacy.en.download –force all` for better performance')
|
print('please run: `python -m spacy.en.download --force all` for better performance')
|
||||||
print('Using feature templates for tagging')
|
print('Using feature templates for tagging')
|
||||||
print('-----------------')
|
print('-----------------')
|
||||||
nlp.tagger = Tagger(nlp.vocab, features=Tagger.feature_templates)
|
nlp.tagger = Tagger(nlp.vocab, features=Tagger.feature_templates)
|
||||||
|
@ -56,16 +72,17 @@ def main(model_dir=None):
|
||||||
nlp.tagger(doc)
|
nlp.tagger(doc)
|
||||||
ner(doc)
|
ner(doc)
|
||||||
for word in doc:
|
for word in doc:
|
||||||
print(word.text, word.tag_, word.ent_type_, word.ent_iob)
|
print(word.text, word.orth, word.lower, word.tag_, word.ent_type_, word.ent_iob)
|
||||||
|
|
||||||
if model_dir is not None:
|
if model_dir is not None:
|
||||||
with (model_dir / 'config.json').open('w') as file_:
|
save_model(ner, model_dir)
|
||||||
json.dump(ner.cfg, file_)
|
|
||||||
ner.model.dump(str(model_dir / 'model'))
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
if __name__ == '__main__':
|
||||||
main()
|
main('ner')
|
||||||
# Who "" 2
|
# Who "" 2
|
||||||
# is "" 2
|
# is "" 2
|
||||||
# Shaka "" PERSON 3
|
# Shaka "" PERSON 3
|
||||||
|
|
|
@ -69,7 +69,7 @@ def main(output_dir=None):
|
||||||
print(word.text, word.tag_, word.pos_)
|
print(word.text, word.tag_, word.pos_)
|
||||||
if output_dir is not None:
|
if output_dir is not None:
|
||||||
tagger.model.dump(str(output_dir / 'pos' / 'model'))
|
tagger.model.dump(str(output_dir / 'pos' / 'model'))
|
||||||
with (output_dir / 'vocab' / 'strings.json').open('wb') as file_:
|
with (output_dir / 'vocab' / 'strings.json').open('w') as file_:
|
||||||
tagger.vocab.strings.dump(file_)
|
tagger.vocab.strings.dump(file_)
|
||||||
|
|
||||||
|
|
||||||
|
|
163
fabfile.py
vendored
163
fabfile.py
vendored
|
@ -13,134 +13,6 @@ PWD = path.dirname(__file__)
|
||||||
VENV_DIR = path.join(PWD, '.env')
|
VENV_DIR = path.join(PWD, '.env')
|
||||||
|
|
||||||
|
|
||||||
def counts():
|
|
||||||
pass
|
|
||||||
# Tokenize the corpus
|
|
||||||
# tokenize()
|
|
||||||
# get_freqs()
|
|
||||||
# Collate the counts
|
|
||||||
# cat freqs | sort -k2 | gather_freqs()
|
|
||||||
# gather_freqs()
|
|
||||||
# smooth()
|
|
||||||
|
|
||||||
|
|
||||||
# clean, make, sdist
|
|
||||||
# cd to new env, install from sdist,
|
|
||||||
# Push changes to server
|
|
||||||
# Pull changes on server
|
|
||||||
# clean make init model
|
|
||||||
# test --vectors --slow
|
|
||||||
# train
|
|
||||||
# test --vectors --slow --models
|
|
||||||
# sdist
|
|
||||||
# upload data to server
|
|
||||||
# change to clean venv
|
|
||||||
# py2: install from sdist, test --slow, download data, test --models --vectors
|
|
||||||
# py3: install from sdist, test --slow, download data, test --models --vectors
|
|
||||||
|
|
||||||
|
|
||||||
def prebuild(build_dir='/tmp/build_spacy'):
|
|
||||||
if file_exists(build_dir):
|
|
||||||
shutil.rmtree(build_dir)
|
|
||||||
os.mkdir(build_dir)
|
|
||||||
spacy_dir = path.dirname(__file__)
|
|
||||||
wn_url = 'http://wordnetcode.princeton.edu/3.0/WordNet-3.0.tar.gz'
|
|
||||||
build_venv = path.join(build_dir, '.env')
|
|
||||||
with lcd(build_dir):
|
|
||||||
local('git clone %s .' % spacy_dir)
|
|
||||||
local('virtualenv ' + build_venv)
|
|
||||||
with prefix('cd %s && PYTHONPATH=`pwd` && . %s/bin/activate' % (build_dir, build_venv)):
|
|
||||||
local('pip install cython fabric fabtools pytest')
|
|
||||||
local('pip install --no-cache-dir -r requirements.txt')
|
|
||||||
local('fab clean make')
|
|
||||||
local('cp -r %s/corpora/en/wordnet corpora/en/' % spacy_dir)
|
|
||||||
local('PYTHONPATH=`pwd` python bin/init_model.py en lang_data corpora spacy/en/data')
|
|
||||||
local('PYTHONPATH=`pwd` fab test')
|
|
||||||
local('PYTHONPATH=`pwd` python -m spacy.en.download --force all')
|
|
||||||
local('PYTHONPATH=`pwd` py.test --models spacy/tests/')
|
|
||||||
|
|
||||||
|
|
||||||
def web():
|
|
||||||
def jade(source_name, out_dir):
|
|
||||||
pwd = path.join(path.dirname(__file__), 'website')
|
|
||||||
jade_loc = path.join(pwd, 'src', 'jade', source_name)
|
|
||||||
out_loc = path.join(pwd, 'site', out_dir)
|
|
||||||
local('jade -P %s --out %s' % (jade_loc, out_loc))
|
|
||||||
|
|
||||||
with virtualenv(VENV_DIR):
|
|
||||||
local('./website/create_code_samples spacy/tests/website/ website/src/code/')
|
|
||||||
|
|
||||||
jade('404.jade', '')
|
|
||||||
jade('home/index.jade', '')
|
|
||||||
jade('docs/index.jade', 'docs/')
|
|
||||||
jade('blog/index.jade', 'blog/')
|
|
||||||
|
|
||||||
for collection in ('blog', 'tutorials'):
|
|
||||||
for post_dir in (Path(__file__).parent / 'website' / 'src' / 'jade' / collection).iterdir():
|
|
||||||
if post_dir.is_dir() \
|
|
||||||
and (post_dir / 'index.jade').exists() \
|
|
||||||
and (post_dir / 'meta.jade').exists():
|
|
||||||
jade(str(post_dir / 'index.jade'), path.join(collection, post_dir.parts[-1]))
|
|
||||||
|
|
||||||
|
|
||||||
def web_publish(assets_path):
|
|
||||||
from boto.s3.connection import S3Connection, OrdinaryCallingFormat
|
|
||||||
|
|
||||||
site_path = 'website/site'
|
|
||||||
|
|
||||||
os.environ['S3_USE_SIGV4'] = 'True'
|
|
||||||
conn = S3Connection(host='s3.eu-central-1.amazonaws.com',
|
|
||||||
calling_format=OrdinaryCallingFormat())
|
|
||||||
bucket = conn.get_bucket('spacy.io', validate=False)
|
|
||||||
|
|
||||||
keys_left = set([k.name for k in bucket.list()
|
|
||||||
if not k.name.startswith('resources')])
|
|
||||||
|
|
||||||
for root, dirnames, filenames in os.walk(site_path):
|
|
||||||
for dirname in dirnames:
|
|
||||||
target = os.path.relpath(os.path.join(root, dirname), site_path)
|
|
||||||
source = os.path.join(target, 'index.html')
|
|
||||||
|
|
||||||
if os.path.exists(os.path.join(root, dirname, 'index.html')):
|
|
||||||
key = bucket.new_key(source)
|
|
||||||
key.set_redirect('//%s/%s' % (bucket.name, target))
|
|
||||||
print('adding redirect for %s' % target)
|
|
||||||
|
|
||||||
keys_left.remove(source)
|
|
||||||
|
|
||||||
for filename in filenames:
|
|
||||||
source = os.path.join(root, filename)
|
|
||||||
|
|
||||||
target = os.path.relpath(root, site_path)
|
|
||||||
if target == '.':
|
|
||||||
target = filename
|
|
||||||
elif filename != 'index.html':
|
|
||||||
target = os.path.join(target, filename)
|
|
||||||
|
|
||||||
key = bucket.new_key(target)
|
|
||||||
key.set_metadata('Content-Type', 'text/html')
|
|
||||||
key.set_contents_from_filename(source)
|
|
||||||
print('uploading %s' % target)
|
|
||||||
|
|
||||||
keys_left.remove(target)
|
|
||||||
|
|
||||||
for key_name in keys_left:
|
|
||||||
print('deleting %s' % key_name)
|
|
||||||
bucket.delete_key(key_name)
|
|
||||||
|
|
||||||
local('aws s3 sync --delete %s s3://spacy.io/resources' % assets_path)
|
|
||||||
|
|
||||||
|
|
||||||
def publish(version):
|
|
||||||
with virtualenv(VENV_DIR):
|
|
||||||
local('git push origin master')
|
|
||||||
local('git tag -a %s' % version)
|
|
||||||
local('git push origin %s' % version)
|
|
||||||
local('python setup.py sdist')
|
|
||||||
local('python setup.py register')
|
|
||||||
local('twine upload dist/spacy-%s.tar.gz' % version)
|
|
||||||
|
|
||||||
|
|
||||||
def env(lang="python2.7"):
|
def env(lang="python2.7"):
|
||||||
if file_exists('.env'):
|
if file_exists('.env'):
|
||||||
local('rm -rf .env')
|
local('rm -rf .env')
|
||||||
|
@ -172,38 +44,3 @@ def test():
|
||||||
with virtualenv(VENV_DIR):
|
with virtualenv(VENV_DIR):
|
||||||
with lcd(path.dirname(__file__)):
|
with lcd(path.dirname(__file__)):
|
||||||
local('py.test -x spacy/tests')
|
local('py.test -x spacy/tests')
|
||||||
|
|
||||||
|
|
||||||
def train(json_dir=None, dev_loc=None, model_dir=None):
|
|
||||||
if json_dir is None:
|
|
||||||
json_dir = 'corpora/en/json'
|
|
||||||
if model_dir is None:
|
|
||||||
model_dir = 'models/en/'
|
|
||||||
with virtualenv(VENV_DIR):
|
|
||||||
with lcd(path.dirname(__file__)):
|
|
||||||
local('python bin/init_model.py en lang_data/ corpora/ ' + model_dir)
|
|
||||||
local('python bin/parser/train.py -p en %s/train/ %s/development %s' % (json_dir, json_dir, model_dir))
|
|
||||||
|
|
||||||
|
|
||||||
def travis():
|
|
||||||
local('open https://travis-ci.org/honnibal/thinc')
|
|
||||||
|
|
||||||
|
|
||||||
def pos():
|
|
||||||
with virtualenv(VENV_DIR):
|
|
||||||
local('python tools/train.py ~/work_data/docparse/wsj02-21.conll ~/work_data/docparse/wsj22.conll spacy/en/data')
|
|
||||||
local('python tools/tag.py ~/work_data/docparse/wsj22.raw /tmp/tmp')
|
|
||||||
local('python tools/eval_pos.py ~/work_data/docparse/wsj22.conll /tmp/tmp')
|
|
||||||
|
|
||||||
|
|
||||||
def ner():
|
|
||||||
local('rm -rf data/en/ner')
|
|
||||||
local('python tools/train_ner.py ~/work_data/docparse/wsj02-21.conll data/en/ner')
|
|
||||||
local('python tools/tag_ner.py ~/work_data/docparse/wsj22.raw /tmp/tmp')
|
|
||||||
local('python tools/eval_ner.py ~/work_data/docparse/wsj22.conll /tmp/tmp | tail')
|
|
||||||
|
|
||||||
|
|
||||||
def conll():
|
|
||||||
local('rm -rf data/en/ner')
|
|
||||||
local('python tools/conll03_train.py ~/work_data/ner/conll2003/eng.train data/en/ner/')
|
|
||||||
local('python tools/conll03_eval.py ~/work_data/ner/conll2003/eng.testa')
|
|
||||||
|
|
|
@ -1,319 +0,0 @@
|
||||||
# surface form lemma pos
|
|
||||||
# multiple values are separated by |
|
|
||||||
# empty lines and lines starting with # are being ignored
|
|
||||||
|
|
||||||
'' ''
|
|
||||||
\") \")
|
|
||||||
\n \n <nl> SP
|
|
||||||
\t \t <tab> SP
|
|
||||||
<space> SP
|
|
||||||
|
|
||||||
# example: Wie geht's?
|
|
||||||
's 's es
|
|
||||||
'S 'S es
|
|
||||||
|
|
||||||
# example: Haste mal 'nen Euro?
|
|
||||||
'n 'n ein
|
|
||||||
'ne 'ne eine
|
|
||||||
'nen 'nen einen
|
|
||||||
|
|
||||||
# example: Kommen S’ nur herein!
|
|
||||||
s' s' sie
|
|
||||||
S' S' sie
|
|
||||||
|
|
||||||
# example: Da haben wir's!
|
|
||||||
ich's ich|'s ich|es
|
|
||||||
du's du|'s du|es
|
|
||||||
er's er|'s er|es
|
|
||||||
sie's sie|'s sie|es
|
|
||||||
wir's wir|'s wir|es
|
|
||||||
ihr's ihr|'s ihr|es
|
|
||||||
|
|
||||||
# example: Die katze auf'm dach.
|
|
||||||
auf'm auf|'m auf|dem
|
|
||||||
unter'm unter|'m unter|dem
|
|
||||||
über'm über|'m über|dem
|
|
||||||
vor'm vor|'m vor|dem
|
|
||||||
hinter'm hinter|'m hinter|dem
|
|
||||||
|
|
||||||
# persons
|
|
||||||
B.A. B.A.
|
|
||||||
B.Sc. B.Sc.
|
|
||||||
Dipl. Dipl.
|
|
||||||
Dipl.-Ing. Dipl.-Ing.
|
|
||||||
Dr. Dr.
|
|
||||||
Fr. Fr.
|
|
||||||
Frl. Frl.
|
|
||||||
Hr. Hr.
|
|
||||||
Hrn. Hrn.
|
|
||||||
Frl. Frl.
|
|
||||||
Prof. Prof.
|
|
||||||
St. St.
|
|
||||||
Hrgs. Hrgs.
|
|
||||||
Hg. Hg.
|
|
||||||
a.Z. a.Z.
|
|
||||||
a.D. a.D.
|
|
||||||
h.c. h.c.
|
|
||||||
Jr. Jr.
|
|
||||||
jr. jr.
|
|
||||||
jun. jun.
|
|
||||||
sen. sen.
|
|
||||||
rer. rer.
|
|
||||||
Ing. Ing.
|
|
||||||
M.A. M.A.
|
|
||||||
Mr. Mr.
|
|
||||||
M.Sc. M.Sc.
|
|
||||||
nat. nat.
|
|
||||||
phil. phil.
|
|
||||||
|
|
||||||
# companies
|
|
||||||
Co. Co.
|
|
||||||
co. co.
|
|
||||||
Cie. Cie.
|
|
||||||
A.G. A.G.
|
|
||||||
G.m.b.H. G.m.b.H.
|
|
||||||
i.G. i.G.
|
|
||||||
e.V. e.V.
|
|
||||||
|
|
||||||
# popular german abbreviations
|
|
||||||
Abb. Abb.
|
|
||||||
Abk. Abk.
|
|
||||||
Abs. Abs.
|
|
||||||
Abt. Abt.
|
|
||||||
abzgl. abzgl.
|
|
||||||
allg. allg.
|
|
||||||
a.M. a.M.
|
|
||||||
Bd. Bd.
|
|
||||||
betr. betr.
|
|
||||||
Betr. Betr.
|
|
||||||
Biol. Biol.
|
|
||||||
biol. biol.
|
|
||||||
Bf. Bf.
|
|
||||||
Bhf. Bhf.
|
|
||||||
Bsp. Bsp.
|
|
||||||
bspw. bspw.
|
|
||||||
bzgl. bzgl.
|
|
||||||
bzw. bzw.
|
|
||||||
d.h. d.h.
|
|
||||||
dgl. dgl.
|
|
||||||
ebd. ebd.
|
|
||||||
ehem. ehem.
|
|
||||||
eigtl. eigtl.
|
|
||||||
entspr. entspr.
|
|
||||||
erm. erm.
|
|
||||||
ev. ev.
|
|
||||||
evtl. evtl.
|
|
||||||
Fa. Fa.
|
|
||||||
Fam. Fam.
|
|
||||||
geb. geb.
|
|
||||||
Gebr. Gebr.
|
|
||||||
gem. gem.
|
|
||||||
ggf. ggf.
|
|
||||||
ggü. ggü.
|
|
||||||
ggfs. ggfs.
|
|
||||||
gegr. gegr.
|
|
||||||
Hbf. Hbf.
|
|
||||||
Hrsg. Hrsg.
|
|
||||||
hrsg. hrsg.
|
|
||||||
i.A. i.A.
|
|
||||||
i.d.R. i.d.R.
|
|
||||||
inkl. inkl.
|
|
||||||
insb. insb.
|
|
||||||
i.O. i.O.
|
|
||||||
i.Tr. i.Tr.
|
|
||||||
i.V. i.V.
|
|
||||||
jur. jur.
|
|
||||||
kath. kath.
|
|
||||||
K.O. K.O.
|
|
||||||
lt. lt.
|
|
||||||
max. max.
|
|
||||||
m.E. m.E.
|
|
||||||
m.M. m.M.
|
|
||||||
mtl. mtl.
|
|
||||||
min. min.
|
|
||||||
mind. mind.
|
|
||||||
MwSt. MwSt.
|
|
||||||
Nr. Nr.
|
|
||||||
o.a. o.a.
|
|
||||||
o.ä. o.ä.
|
|
||||||
o.Ä. o.Ä.
|
|
||||||
o.g. o.g.
|
|
||||||
o.k. o.k.
|
|
||||||
O.K. O.K.
|
|
||||||
Orig. Orig.
|
|
||||||
orig. orig.
|
|
||||||
pers. pers.
|
|
||||||
Pkt. Pkt.
|
|
||||||
Red. Red.
|
|
||||||
röm. röm.
|
|
||||||
s.o. s.o.
|
|
||||||
sog. sog.
|
|
||||||
std. std.
|
|
||||||
stellv. stellv.
|
|
||||||
Str. Str.
|
|
||||||
tägl. tägl.
|
|
||||||
Tel. Tel.
|
|
||||||
u.a. u.a.
|
|
||||||
usf. usf.
|
|
||||||
u.s.w. u.s.w.
|
|
||||||
usw. usw.
|
|
||||||
u.U. u.U.
|
|
||||||
u.v.m. u.v.m.
|
|
||||||
uvm. uvm.
|
|
||||||
v.a. v.a.
|
|
||||||
vgl. vgl.
|
|
||||||
vllt. vllt.
|
|
||||||
v.l.n.r. v.l.n.r.
|
|
||||||
vlt. vlt.
|
|
||||||
Vol. Vol.
|
|
||||||
wiss. wiss.
|
|
||||||
Univ. Univ.
|
|
||||||
z.B. z.B.
|
|
||||||
z.b. z.b.
|
|
||||||
z.Bsp. z.Bsp.
|
|
||||||
z.T. z.T.
|
|
||||||
z.Z. z.Z.
|
|
||||||
zzgl. zzgl.
|
|
||||||
z.Zt. z.Zt.
|
|
||||||
|
|
||||||
# popular latin abbreviations
|
|
||||||
vs. vs.
|
|
||||||
adv. adv.
|
|
||||||
Chr. Chr.
|
|
||||||
A.C. A.C.
|
|
||||||
A.D. A.D.
|
|
||||||
e.g. e.g.
|
|
||||||
i.e. i.e.
|
|
||||||
al. al.
|
|
||||||
p.a. p.a.
|
|
||||||
P.S. P.S.
|
|
||||||
q.e.d. q.e.d.
|
|
||||||
R.I.P. R.I.P.
|
|
||||||
etc. etc.
|
|
||||||
incl. incl.
|
|
||||||
ca. ca.
|
|
||||||
n.Chr. n.Chr.
|
|
||||||
p.s. p.s.
|
|
||||||
v.Chr. v.Chr.
|
|
||||||
|
|
||||||
# popular english abbreviations
|
|
||||||
D.C. D.C.
|
|
||||||
N.Y. N.Y.
|
|
||||||
N.Y.C. N.Y.C.
|
|
||||||
U.S. U.S.
|
|
||||||
U.S.A. U.S.A.
|
|
||||||
L.A. L.A.
|
|
||||||
U.S.S. U.S.S.
|
|
||||||
|
|
||||||
# dates & time
|
|
||||||
Jan. Jan.
|
|
||||||
Feb. Feb.
|
|
||||||
Mrz. Mrz.
|
|
||||||
Mär. Mär.
|
|
||||||
Apr. Apr.
|
|
||||||
Jun. Jun.
|
|
||||||
Jul. Jul.
|
|
||||||
Aug. Aug.
|
|
||||||
Sep. Sep.
|
|
||||||
Sept. Sept.
|
|
||||||
Okt. Okt.
|
|
||||||
Nov. Nov.
|
|
||||||
Dez. Dez.
|
|
||||||
Mo. Mo.
|
|
||||||
Di. Di.
|
|
||||||
Mi. Mi.
|
|
||||||
Do. Do.
|
|
||||||
Fr. Fr.
|
|
||||||
Sa. Sa.
|
|
||||||
So. So.
|
|
||||||
Std. Std.
|
|
||||||
Jh. Jh.
|
|
||||||
Jhd. Jhd.
|
|
||||||
|
|
||||||
# numbers
|
|
||||||
Tsd. Tsd.
|
|
||||||
Mio. Mio.
|
|
||||||
Mrd. Mrd.
|
|
||||||
|
|
||||||
# countries & languages
|
|
||||||
engl. engl.
|
|
||||||
frz. frz.
|
|
||||||
lat. lat.
|
|
||||||
österr. österr.
|
|
||||||
|
|
||||||
# smileys
|
|
||||||
:) :)
|
|
||||||
<3 <3
|
|
||||||
;) ;)
|
|
||||||
(: (:
|
|
||||||
:( :(
|
|
||||||
-_- -_-
|
|
||||||
=) =)
|
|
||||||
:/ :/
|
|
||||||
:> :>
|
|
||||||
;-) ;-)
|
|
||||||
:Y :Y
|
|
||||||
:P :P
|
|
||||||
:-P :-P
|
|
||||||
:3 :3
|
|
||||||
=3 =3
|
|
||||||
xD xD
|
|
||||||
^_^ ^_^
|
|
||||||
=] =]
|
|
||||||
=D =D
|
|
||||||
<333 <333
|
|
||||||
:)) :))
|
|
||||||
:0 :0
|
|
||||||
-__- -__-
|
|
||||||
xDD xDD
|
|
||||||
o_o o_o
|
|
||||||
o_O o_O
|
|
||||||
V_V V_V
|
|
||||||
=[[ =[[
|
|
||||||
<33 <33
|
|
||||||
;p ;p
|
|
||||||
;D ;D
|
|
||||||
;-p ;-p
|
|
||||||
;( ;(
|
|
||||||
:p :p
|
|
||||||
:] :]
|
|
||||||
:O :O
|
|
||||||
:-/ :-/
|
|
||||||
:-) :-)
|
|
||||||
:((( :(((
|
|
||||||
:(( :((
|
|
||||||
:') :')
|
|
||||||
(^_^) (^_^)
|
|
||||||
(= (=
|
|
||||||
o.O o.O
|
|
||||||
|
|
||||||
# single letters
|
|
||||||
a. a.
|
|
||||||
b. b.
|
|
||||||
c. c.
|
|
||||||
d. d.
|
|
||||||
e. e.
|
|
||||||
f. f.
|
|
||||||
g. g.
|
|
||||||
h. h.
|
|
||||||
i. i.
|
|
||||||
j. j.
|
|
||||||
k. k.
|
|
||||||
l. l.
|
|
||||||
m. m.
|
|
||||||
n. n.
|
|
||||||
o. o.
|
|
||||||
p. p.
|
|
||||||
q. q.
|
|
||||||
r. r.
|
|
||||||
s. s.
|
|
||||||
t. t.
|
|
||||||
u. u.
|
|
||||||
v. v.
|
|
||||||
w. w.
|
|
||||||
x. x.
|
|
||||||
y. y.
|
|
||||||
z. z.
|
|
||||||
ä. ä.
|
|
||||||
ö. ö.
|
|
||||||
ü. ü.
|
|
|
@ -1,194 +0,0 @@
|
||||||
{
|
|
||||||
"Reddit": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"lower": "reddit"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"SeptemberElevenAttacks": [
|
|
||||||
"EVENT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[
|
|
||||||
{"orth": "9/11"}
|
|
||||||
],
|
|
||||||
[
|
|
||||||
{"lower": "september"},
|
|
||||||
{"orth": "11"}
|
|
||||||
]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Linux": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"lower": "linux"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Haskell": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"lower": "haskell"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"HaskellCurry": [
|
|
||||||
"PERSON",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[
|
|
||||||
{"lower": "haskell"},
|
|
||||||
{"lower": "curry"}
|
|
||||||
]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Javascript": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"lower": "javascript"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"CSS": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"lower": "css"}],
|
|
||||||
[{"lower": "css3"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"displaCy": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"lower": "displacy"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"spaCy": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"orth": "spaCy"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
|
|
||||||
"HTML": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"lower": "html"}],
|
|
||||||
[{"lower": "html5"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Python": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"orth": "Python"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Ruby": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"orth": "Ruby"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Digg": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"lower": "digg"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"FoxNews": [
|
|
||||||
"ORG",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"orth": "Fox"}],
|
|
||||||
[{"orth": "News"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Google": [
|
|
||||||
"ORG",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"lower": "google"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Mac": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"lower": "mac"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Wikipedia": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"lower": "wikipedia"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Windows": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"orth": "Windows"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Dell": [
|
|
||||||
"ORG",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"lower": "dell"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Facebook": [
|
|
||||||
"ORG",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"lower": "facebook"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Blizzard": [
|
|
||||||
"ORG",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"orth": "Blizzard"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Ubuntu": [
|
|
||||||
"ORG",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"orth": "Ubuntu"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Youtube": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"lower": "youtube"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"false_positives": [
|
|
||||||
null,
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"orth": "Shit"}],
|
|
||||||
[{"orth": "Weed"}],
|
|
||||||
[{"orth": "Cool"}],
|
|
||||||
[{"orth": "Btw"}],
|
|
||||||
[{"orth": "Bah"}],
|
|
||||||
[{"orth": "Bullshit"}],
|
|
||||||
[{"orth": "Lol"}],
|
|
||||||
[{"orth": "Yo"}, {"lower": "dawg"}],
|
|
||||||
[{"orth": "Yay"}],
|
|
||||||
[{"orth": "Ahh"}],
|
|
||||||
[{"orth": "Yea"}],
|
|
||||||
[{"orth": "Bah"}]
|
|
||||||
]
|
|
||||||
]
|
|
||||||
}
|
|
|
@ -1,334 +0,0 @@
|
||||||
# coding=utf8
|
|
||||||
import json
|
|
||||||
import io
|
|
||||||
import itertools
|
|
||||||
|
|
||||||
contractions = {}
|
|
||||||
|
|
||||||
# contains the lemmas, parts of speech, number, and tenspect of
|
|
||||||
# potential tokens generated after splitting contractions off
|
|
||||||
token_properties = {}
|
|
||||||
|
|
||||||
# contains starting tokens with their potential contractions
|
|
||||||
# each potential contraction has a list of exceptions
|
|
||||||
# lower - don't generate the lowercase version
|
|
||||||
# upper - don't generate the uppercase version
|
|
||||||
# contrLower - don't generate the lowercase version with apostrophe (') removed
|
|
||||||
# contrUpper - dont' generate the uppercase version with apostrophe (') removed
|
|
||||||
# for example, we don't want to create the word "hell" or "Hell" from "he" + "'ll" so
|
|
||||||
# we add "contrLower" and "contrUpper" to the exceptions list
|
|
||||||
starting_tokens = {}
|
|
||||||
|
|
||||||
# other specials that don't really have contractions
|
|
||||||
# so they are hardcoded
|
|
||||||
hardcoded_specials = {
|
|
||||||
"''": [{"F": "''"}],
|
|
||||||
"\")": [{"F": "\")"}],
|
|
||||||
"\n": [{"F": "\n", "pos": "SP"}],
|
|
||||||
"\t": [{"F": "\t", "pos": "SP"}],
|
|
||||||
" ": [{"F": " ", "pos": "SP"}],
|
|
||||||
|
|
||||||
# example: Wie geht's?
|
|
||||||
"'s": [{"F": "'s", "L": "es"}],
|
|
||||||
"'S": [{"F": "'S", "L": "es"}],
|
|
||||||
|
|
||||||
# example: Haste mal 'nen Euro?
|
|
||||||
"'n": [{"F": "'n", "L": "ein"}],
|
|
||||||
"'ne": [{"F": "'ne", "L": "eine"}],
|
|
||||||
"'nen": [{"F": "'nen", "L": "einen"}],
|
|
||||||
|
|
||||||
# example: Kommen S’ nur herein!
|
|
||||||
"s'": [{"F": "s'", "L": "sie"}],
|
|
||||||
"S'": [{"F": "S'", "L": "sie"}],
|
|
||||||
|
|
||||||
# example: Da haben wir's!
|
|
||||||
"ich's": [{"F": "ich"}, {"F": "'s", "L": "es"}],
|
|
||||||
"du's": [{"F": "du"}, {"F": "'s", "L": "es"}],
|
|
||||||
"er's": [{"F": "er"}, {"F": "'s", "L": "es"}],
|
|
||||||
"sie's": [{"F": "sie"}, {"F": "'s", "L": "es"}],
|
|
||||||
"wir's": [{"F": "wir"}, {"F": "'s", "L": "es"}],
|
|
||||||
"ihr's": [{"F": "ihr"}, {"F": "'s", "L": "es"}],
|
|
||||||
|
|
||||||
# example: Die katze auf'm dach.
|
|
||||||
"auf'm": [{"F": "auf"}, {"F": "'m", "L": "dem"}],
|
|
||||||
"unter'm": [{"F": "unter"}, {"F": "'m", "L": "dem"}],
|
|
||||||
"über'm": [{"F": "über"}, {"F": "'m", "L": "dem"}],
|
|
||||||
"vor'm": [{"F": "vor"}, {"F": "'m", "L": "dem"}],
|
|
||||||
"hinter'm": [{"F": "hinter"}, {"F": "'m", "L": "dem"}],
|
|
||||||
|
|
||||||
# persons
|
|
||||||
"Fr.": [{"F": "Fr."}],
|
|
||||||
"Hr.": [{"F": "Hr."}],
|
|
||||||
"Frl.": [{"F": "Frl."}],
|
|
||||||
"Prof.": [{"F": "Prof."}],
|
|
||||||
"Dr.": [{"F": "Dr."}],
|
|
||||||
"St.": [{"F": "St."}],
|
|
||||||
"Hrgs.": [{"F": "Hrgs."}],
|
|
||||||
"Hg.": [{"F": "Hg."}],
|
|
||||||
"a.Z.": [{"F": "a.Z."}],
|
|
||||||
"a.D.": [{"F": "a.D."}],
|
|
||||||
"A.D.": [{"F": "A.D."}],
|
|
||||||
"h.c.": [{"F": "h.c."}],
|
|
||||||
"jun.": [{"F": "jun."}],
|
|
||||||
"sen.": [{"F": "sen."}],
|
|
||||||
"rer.": [{"F": "rer."}],
|
|
||||||
"Dipl.": [{"F": "Dipl."}],
|
|
||||||
"Ing.": [{"F": "Ing."}],
|
|
||||||
"Dipl.-Ing.": [{"F": "Dipl.-Ing."}],
|
|
||||||
|
|
||||||
# companies
|
|
||||||
"Co.": [{"F": "Co."}],
|
|
||||||
"co.": [{"F": "co."}],
|
|
||||||
"Cie.": [{"F": "Cie."}],
|
|
||||||
"A.G.": [{"F": "A.G."}],
|
|
||||||
"G.m.b.H.": [{"F": "G.m.b.H."}],
|
|
||||||
"i.G.": [{"F": "i.G."}],
|
|
||||||
"e.V.": [{"F": "e.V."}],
|
|
||||||
|
|
||||||
# popular german abbreviations
|
|
||||||
"ggü.": [{"F": "ggü."}],
|
|
||||||
"ggf.": [{"F": "ggf."}],
|
|
||||||
"ggfs.": [{"F": "ggfs."}],
|
|
||||||
"Gebr.": [{"F": "Gebr."}],
|
|
||||||
"geb.": [{"F": "geb."}],
|
|
||||||
"gegr.": [{"F": "gegr."}],
|
|
||||||
"erm.": [{"F": "erm."}],
|
|
||||||
"engl.": [{"F": "engl."}],
|
|
||||||
"ehem.": [{"F": "ehem."}],
|
|
||||||
"Biol.": [{"F": "Biol."}],
|
|
||||||
"biol.": [{"F": "biol."}],
|
|
||||||
"Abk.": [{"F": "Abk."}],
|
|
||||||
"Abb.": [{"F": "Abb."}],
|
|
||||||
"abzgl.": [{"F": "abzgl."}],
|
|
||||||
"Hbf.": [{"F": "Hbf."}],
|
|
||||||
"Bhf.": [{"F": "Bhf."}],
|
|
||||||
"Bf.": [{"F": "Bf."}],
|
|
||||||
"i.V.": [{"F": "i.V."}],
|
|
||||||
"inkl.": [{"F": "inkl."}],
|
|
||||||
"insb.": [{"F": "insb."}],
|
|
||||||
"z.B.": [{"F": "z.B."}],
|
|
||||||
"i.Tr.": [{"F": "i.Tr."}],
|
|
||||||
"Jhd.": [{"F": "Jhd."}],
|
|
||||||
"jur.": [{"F": "jur."}],
|
|
||||||
"lt.": [{"F": "lt."}],
|
|
||||||
"nat.": [{"F": "nat."}],
|
|
||||||
"u.a.": [{"F": "u.a."}],
|
|
||||||
"u.s.w.": [{"F": "u.s.w."}],
|
|
||||||
"Nr.": [{"F": "Nr."}],
|
|
||||||
"Univ.": [{"F": "Univ."}],
|
|
||||||
"vgl.": [{"F": "vgl."}],
|
|
||||||
"zzgl.": [{"F": "zzgl."}],
|
|
||||||
"z.Z.": [{"F": "z.Z."}],
|
|
||||||
"betr.": [{"F": "betr."}],
|
|
||||||
"ehem.": [{"F": "ehem."}],
|
|
||||||
|
|
||||||
# popular latin abbreviations
|
|
||||||
"vs.": [{"F": "vs."}],
|
|
||||||
"adv.": [{"F": "adv."}],
|
|
||||||
"Chr.": [{"F": "Chr."}],
|
|
||||||
"A.C.": [{"F": "A.C."}],
|
|
||||||
"A.D.": [{"F": "A.D."}],
|
|
||||||
"e.g.": [{"F": "e.g."}],
|
|
||||||
"i.e.": [{"F": "i.e."}],
|
|
||||||
"al.": [{"F": "al."}],
|
|
||||||
"p.a.": [{"F": "p.a."}],
|
|
||||||
"P.S.": [{"F": "P.S."}],
|
|
||||||
"q.e.d.": [{"F": "q.e.d."}],
|
|
||||||
"R.I.P.": [{"F": "R.I.P."}],
|
|
||||||
"etc.": [{"F": "etc."}],
|
|
||||||
"incl.": [{"F": "incl."}],
|
|
||||||
|
|
||||||
# popular english abbreviations
|
|
||||||
"D.C.": [{"F": "D.C."}],
|
|
||||||
"N.Y.": [{"F": "N.Y."}],
|
|
||||||
"N.Y.C.": [{"F": "N.Y.C."}],
|
|
||||||
|
|
||||||
# dates
|
|
||||||
"Jan.": [{"F": "Jan."}],
|
|
||||||
"Feb.": [{"F": "Feb."}],
|
|
||||||
"Mrz.": [{"F": "Mrz."}],
|
|
||||||
"Mär.": [{"F": "Mär."}],
|
|
||||||
"Apr.": [{"F": "Apr."}],
|
|
||||||
"Jun.": [{"F": "Jun."}],
|
|
||||||
"Jul.": [{"F": "Jul."}],
|
|
||||||
"Aug.": [{"F": "Aug."}],
|
|
||||||
"Sep.": [{"F": "Sep."}],
|
|
||||||
"Sept.": [{"F": "Sept."}],
|
|
||||||
"Okt.": [{"F": "Okt."}],
|
|
||||||
"Nov.": [{"F": "Nov."}],
|
|
||||||
"Dez.": [{"F": "Dez."}],
|
|
||||||
"Mo.": [{"F": "Mo."}],
|
|
||||||
"Di.": [{"F": "Di."}],
|
|
||||||
"Mi.": [{"F": "Mi."}],
|
|
||||||
"Do.": [{"F": "Do."}],
|
|
||||||
"Fr.": [{"F": "Fr."}],
|
|
||||||
"Sa.": [{"F": "Sa."}],
|
|
||||||
"So.": [{"F": "So."}],
|
|
||||||
|
|
||||||
# smileys
|
|
||||||
":)": [{"F": ":)"}],
|
|
||||||
"<3": [{"F": "<3"}],
|
|
||||||
";)": [{"F": ";)"}],
|
|
||||||
"(:": [{"F": "(:"}],
|
|
||||||
":(": [{"F": ":("}],
|
|
||||||
"-_-": [{"F": "-_-"}],
|
|
||||||
"=)": [{"F": "=)"}],
|
|
||||||
":/": [{"F": ":/"}],
|
|
||||||
":>": [{"F": ":>"}],
|
|
||||||
";-)": [{"F": ";-)"}],
|
|
||||||
":Y": [{"F": ":Y"}],
|
|
||||||
":P": [{"F": ":P"}],
|
|
||||||
":-P": [{"F": ":-P"}],
|
|
||||||
":3": [{"F": ":3"}],
|
|
||||||
"=3": [{"F": "=3"}],
|
|
||||||
"xD": [{"F": "xD"}],
|
|
||||||
"^_^": [{"F": "^_^"}],
|
|
||||||
"=]": [{"F": "=]"}],
|
|
||||||
"=D": [{"F": "=D"}],
|
|
||||||
"<333": [{"F": "<333"}],
|
|
||||||
":))": [{"F": ":))"}],
|
|
||||||
":0": [{"F": ":0"}],
|
|
||||||
"-__-": [{"F": "-__-"}],
|
|
||||||
"xDD": [{"F": "xDD"}],
|
|
||||||
"o_o": [{"F": "o_o"}],
|
|
||||||
"o_O": [{"F": "o_O"}],
|
|
||||||
"V_V": [{"F": "V_V"}],
|
|
||||||
"=[[": [{"F": "=[["}],
|
|
||||||
"<33": [{"F": "<33"}],
|
|
||||||
";p": [{"F": ";p"}],
|
|
||||||
";D": [{"F": ";D"}],
|
|
||||||
";-p": [{"F": ";-p"}],
|
|
||||||
";(": [{"F": ";("}],
|
|
||||||
":p": [{"F": ":p"}],
|
|
||||||
":]": [{"F": ":]"}],
|
|
||||||
":O": [{"F": ":O"}],
|
|
||||||
":-/": [{"F": ":-/"}],
|
|
||||||
":-)": [{"F": ":-)"}],
|
|
||||||
":(((": [{"F": ":((("}],
|
|
||||||
":((": [{"F": ":(("}],
|
|
||||||
":')": [{"F": ":')"}],
|
|
||||||
"(^_^)": [{"F": "(^_^)"}],
|
|
||||||
"(=": [{"F": "(="}],
|
|
||||||
"o.O": [{"F": "o.O"}],
|
|
||||||
|
|
||||||
"a.": [{"F": "a."}],
|
|
||||||
"b.": [{"F": "b."}],
|
|
||||||
"c.": [{"F": "c."}],
|
|
||||||
"d.": [{"F": "d."}],
|
|
||||||
"e.": [{"F": "e."}],
|
|
||||||
"f.": [{"F": "f."}],
|
|
||||||
"g.": [{"F": "g."}],
|
|
||||||
"h.": [{"F": "h."}],
|
|
||||||
"i.": [{"F": "i."}],
|
|
||||||
"j.": [{"F": "j."}],
|
|
||||||
"k.": [{"F": "k."}],
|
|
||||||
"l.": [{"F": "l."}],
|
|
||||||
"m.": [{"F": "m."}],
|
|
||||||
"n.": [{"F": "n."}],
|
|
||||||
"o.": [{"F": "o."}],
|
|
||||||
"p.": [{"F": "p."}],
|
|
||||||
"q.": [{"F": "q."}],
|
|
||||||
"r.": [{"F": "r."}],
|
|
||||||
"s.": [{"F": "s."}],
|
|
||||||
"t.": [{"F": "t."}],
|
|
||||||
"u.": [{"F": "u."}],
|
|
||||||
"v.": [{"F": "v."}],
|
|
||||||
"w.": [{"F": "w."}],
|
|
||||||
"x.": [{"F": "x."}],
|
|
||||||
"y.": [{"F": "y."}],
|
|
||||||
"z.": [{"F": "z."}],
|
|
||||||
}
|
|
||||||
|
|
||||||
def get_double_contractions(ending):
|
|
||||||
endings = []
|
|
||||||
|
|
||||||
ends_with_contraction = any([ending.endswith(contraction) for contraction in contractions])
|
|
||||||
|
|
||||||
while ends_with_contraction:
|
|
||||||
for contraction in contractions:
|
|
||||||
if ending.endswith(contraction):
|
|
||||||
endings.append(contraction)
|
|
||||||
ending = ending.rstrip(contraction)
|
|
||||||
ends_with_contraction = any([ending.endswith(contraction) for contraction in contractions])
|
|
||||||
|
|
||||||
endings.reverse() # reverse because the last ending is put in the list first
|
|
||||||
return endings
|
|
||||||
|
|
||||||
def get_token_properties(token, capitalize=False, remove_contractions=False):
|
|
||||||
props = dict(token_properties.get(token)) # ensure we copy the dict so we can add the "F" prop
|
|
||||||
if capitalize:
|
|
||||||
token = token.capitalize()
|
|
||||||
if remove_contractions:
|
|
||||||
token = token.replace("'", "")
|
|
||||||
|
|
||||||
props["F"] = token
|
|
||||||
return props
|
|
||||||
|
|
||||||
|
|
||||||
def create_entry(token, endings, capitalize=False, remove_contractions=False):
|
|
||||||
properties = []
|
|
||||||
properties.append(get_token_properties(token, capitalize=capitalize, remove_contractions=remove_contractions))
|
|
||||||
for e in endings:
|
|
||||||
properties.append(get_token_properties(e, remove_contractions=remove_contractions))
|
|
||||||
return properties
|
|
||||||
|
|
||||||
|
|
||||||
FIELDNAMES = ['F','L','pos']
|
|
||||||
def read_hardcoded(stream):
|
|
||||||
hc_specials = {}
|
|
||||||
for line in stream:
|
|
||||||
line = line.strip()
|
|
||||||
if line.startswith('#') or not line:
|
|
||||||
continue
|
|
||||||
key,_,rest = line.partition('\t')
|
|
||||||
values = []
|
|
||||||
for annotation in zip(*[ e.split('|') for e in rest.split('\t') ]):
|
|
||||||
values.append({ k:v for k,v in itertools.izip_longest(FIELDNAMES,annotation) if v })
|
|
||||||
hc_specials[key] = values
|
|
||||||
return hc_specials
|
|
||||||
|
|
||||||
|
|
||||||
def generate_specials():
|
|
||||||
|
|
||||||
specials = {}
|
|
||||||
|
|
||||||
for token in starting_tokens:
|
|
||||||
possible_endings = starting_tokens[token]
|
|
||||||
for ending in possible_endings:
|
|
||||||
|
|
||||||
endings = []
|
|
||||||
if ending.count("'") > 1:
|
|
||||||
endings.extend(get_double_contractions(ending))
|
|
||||||
else:
|
|
||||||
endings.append(ending)
|
|
||||||
|
|
||||||
exceptions = possible_endings[ending]
|
|
||||||
|
|
||||||
if "lower" not in exceptions:
|
|
||||||
special = token + ending
|
|
||||||
specials[special] = create_entry(token, endings)
|
|
||||||
|
|
||||||
if "upper" not in exceptions:
|
|
||||||
special = token.capitalize() + ending
|
|
||||||
specials[special] = create_entry(token, endings, capitalize=True)
|
|
||||||
|
|
||||||
if "contrLower" not in exceptions:
|
|
||||||
special = token + ending.replace("'", "")
|
|
||||||
specials[special] = create_entry(token, endings, remove_contractions=True)
|
|
||||||
|
|
||||||
if "contrUpper" not in exceptions:
|
|
||||||
special = token.capitalize() + ending.replace("'", "")
|
|
||||||
specials[special] = create_entry(token, endings, capitalize=True, remove_contractions=True)
|
|
||||||
|
|
||||||
# add in hardcoded specials
|
|
||||||
# changed it so it generates them from a file
|
|
||||||
with io.open('abbrev.de.tab','r',encoding='utf8') as abbrev_:
|
|
||||||
hc_specials = read_hardcoded(abbrev_)
|
|
||||||
specials = dict(specials, **hc_specials)
|
|
||||||
|
|
||||||
return specials
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
specials = generate_specials()
|
|
||||||
with open("specials.json", "w") as f:
|
|
||||||
json.dump(specials, f, sort_keys=True, indent=4, separators=(',', ': '))
|
|
|
@ -1,6 +0,0 @@
|
||||||
\.\.\.
|
|
||||||
(?<=[a-z])\.(?=[A-Z])
|
|
||||||
(?<=[a-zöäüßA-ZÖÄÜ"]):(?=[a-zöäüßA-ZÖÄÜ])
|
|
||||||
(?<=[a-zöäüßA-ZÖÄÜ"])>(?=[a-zöäüßA-ZÖÄÜ])
|
|
||||||
(?<=[a-zöäüßA-ZÖÄÜ"])<(?=[a-zöäüßA-ZÖÄÜ])
|
|
||||||
(?<=[a-zöäüßA-ZÖÄÜ"])=(?=[a-zöäüßA-ZÖÄÜ])
|
|
|
@ -1 +0,0 @@
|
||||||
{}
|
|
|
@ -1,71 +0,0 @@
|
||||||
{
|
|
||||||
"PRP": {
|
|
||||||
"ich": {"L": "-PRON-", "person": 1, "number": 1, "gender": 0, "case": 1},
|
|
||||||
"meiner": {"L": "-PRON-", "person": 1, "number": 1, "gender": 0, "case": 2},
|
|
||||||
"mir": {"L": "-PRON-", "person": 1, "number": 1, "gender": 0, "case": 3},
|
|
||||||
"mich": {"L": "-PRON-", "person": 1, "number": 1, "gender": 0, "case": 4},
|
|
||||||
"du": {"L": "-PRON-", "person": 2, "number": 0, "gender": 0, "case": 1},
|
|
||||||
"deiner": {"L": "-PRON-", "person": 2, "number": 0, "gender": 0, "case": 2},
|
|
||||||
"dir": {"L": "-PRON-", "person": 2, "number": 0, "gender": 0, "case": 3},
|
|
||||||
"dich": {"L": "-PRON-", "person": 2, "number": 0, "gender": 0, "case": 4},
|
|
||||||
"er": {"L": "-PRON-", "person": 3, "number": 0, "gender": 1, "case": 1},
|
|
||||||
"seiner": {"L": "-PRON-", "person": 3, "number": 0, "gender": 1, "case": 2},
|
|
||||||
"ihm": {"L": "-PRON-", "person": 3, "number": 0, "gender": 1, "case": 3},
|
|
||||||
"ihn": {"L": "-PRON-", "person": 3, "number": 0, "gender": 1, "case": 4},
|
|
||||||
"sie": {"L": "-PRON-", "person": 3, "number": 0, "gender": 2, "case": 1},
|
|
||||||
"ihrer": {"L": "-PRON-", "person": 3, "number": 0, "gender": 2, "case": 2},
|
|
||||||
"ihr": {"L": "-PRON-", "person": 3, "number": 0, "gender": 2, "case": 3},
|
|
||||||
"sie": {"L": "-PRON-", "person": 3, "number": 0, "gender": 2, "case": 4},
|
|
||||||
"es": {"L": "-PRON-", "person": 3, "number": 0, "gender": 3, "case": 1},
|
|
||||||
"seiner": {"L": "-PRON-", "person": 3, "number": 0, "gender": 3, "case": 2},
|
|
||||||
"ihm": {"L": "-PRON-", "person": 3, "number": 0, "gender": 3, "case": 3},
|
|
||||||
"es": {"L": "-PRON-", "person": 3, "number": 0, "gender": 3, "case": 4},
|
|
||||||
"wir": {"L": "-PRON-", "person": 1, "number": 0, "gender": 0, "case": 1},
|
|
||||||
"unser": {"L": "-PRON-", "person": 1, "number": 0, "gender": 0, "case": 2},
|
|
||||||
"uns": {"L": "-PRON-", "person": 1, "number": 0, "gender": 0, "case": 3},
|
|
||||||
"uns": {"L": "-PRON-", "person": 1, "number": 0, "gender": 0, "case": 4},
|
|
||||||
"ihr": {"L": "-PRON-", "person": 2, "number": 0, "gender": 0, "case": 1},
|
|
||||||
"euer": {"L": "-PRON-", "person": 2, "number": 0, "gender": 0, "case": 2},
|
|
||||||
"euch": {"L": "-PRON-", "person": 2, "number": 0, "gender": 0, "case": 3},
|
|
||||||
"euch": {"L": "-PRON-", "person": 2, "number": 0, "gender": 0, "case": 4},
|
|
||||||
"sie": {"L": "-PRON-", "person": 3, "number": 0, "gender": 0, "case": 1},
|
|
||||||
"ihrer": {"L": "-PRON-", "person": 3, "number": 0, "gender": 0, "case": 2},
|
|
||||||
"ihnen": {"L": "-PRON-", "person": 3, "number": 0, "gender": 0, "case": 3},
|
|
||||||
"sie": {"L": "-PRON-", "person": 3, "number": 0, "gender": 0, "case": 4}
|
|
||||||
},
|
|
||||||
|
|
||||||
"PRP$": {
|
|
||||||
"mein": {"L": "-PRON-", "person": 1, "number": 0, "gender": 0, "case": 1},
|
|
||||||
"meines": {"L": "-PRON-", "person": 1, "number": 0, "gender": 0, "case": 2},
|
|
||||||
"meinem": {"L": "-PRON-", "person": 1, "number": 0, "gender": 0, "case": 3},
|
|
||||||
"meinen": {"L": "-PRON-", "person": 1, "number": 0, "gender": 0, "case": 4},
|
|
||||||
"dein": {"L": "-PRON-", "person": 2, "number": 0, "gender": 0, "case": 1},
|
|
||||||
"deines": {"L": "-PRON-", "person": 2, "number": 0, "gender": 0, "case": 2},
|
|
||||||
"deinem": {"L": "-PRON-", "person": 2, "number": 0, "gender": 0, "case": 3},
|
|
||||||
"deinen": {"L": "-PRON-", "person": 2, "number": 0, "gender": 0, "case": 4},
|
|
||||||
"sein": {"L": "-PRON-", "person": 3, "number": 0, "gender": 1, "case": 1},
|
|
||||||
"seines": {"L": "-PRON-", "person": 3, "number": 0, "gender": 1, "case": 2},
|
|
||||||
"seinem": {"L": "-PRON-", "person": 3, "number": 0, "gender": 1, "case": 3},
|
|
||||||
"seinen": {"L": "-PRON-", "person": 3, "number": 0, "gender": 1, "case": 4},
|
|
||||||
"ihr": {"L": "-PRON-", "person": 3, "number": 0, "gender": 2, "case": 1},
|
|
||||||
"ihrer": {"L": "-PRON-", "person": 3, "number": 0, "gender": 2, "case": 2},
|
|
||||||
"ihrem": {"L": "-PRON-", "person": 3, "number": 0, "gender": 2, "case": 3},
|
|
||||||
"ihren": {"L": "-PRON-", "person": 3, "number": 0, "gender": 2, "case": 4},
|
|
||||||
"sein": {"L": "-PRON-", "person": 3, "number": 0, "gender": 3, "case": 1},
|
|
||||||
"seines": {"L": "-PRON-", "person": 3, "number": 0, "gender": 3, "case": 2},
|
|
||||||
"seinem": {"L": "-PRON-", "person": 3, "number": 0, "gender": 3, "case": 3},
|
|
||||||
"seinen": {"L": "-PRON-", "person": 3, "number": 0, "gender": 3, "case": 4},
|
|
||||||
"unser": {"L": "-PRON-", "person": 1, "number": 0, "gender": 0, "case": 1},
|
|
||||||
"unseres": {"L": "-PRON-", "person": 1, "number": 0, "gender": 0, "case": 2},
|
|
||||||
"unserem": {"L": "-PRON-", "person": 1, "number": 0, "gender": 0, "case": 3},
|
|
||||||
"unseren": {"L": "-PRON-", "person": 1, "number": 0, "gender": 0, "case": 4},
|
|
||||||
"euer": {"L": "-PRON-", "person": 2, "number": 0, "gender": 0, "case": 1},
|
|
||||||
"eures": {"L": "-PRON-", "person": 2, "number": 0, "gender": 0, "case": 2},
|
|
||||||
"eurem": {"L": "-PRON-", "person": 2, "number": 0, "gender": 0, "case": 3},
|
|
||||||
"euren": {"L": "-PRON-", "person": 2, "number": 0, "gender": 0, "case": 4},
|
|
||||||
"ihr": {"L": "-PRON-", "person": 3, "number": 0, "gender": 0, "case": 1},
|
|
||||||
"ihres": {"L": "-PRON-", "person": 3, "number": 0, "gender": 0, "case": 2},
|
|
||||||
"ihrem": {"L": "-PRON-", "person": 3, "number": 0, "gender": 0, "case": 3},
|
|
||||||
"ihren": {"L": "-PRON-", "person": 3, "number": 0, "gender": 0, "case": 4}
|
|
||||||
}
|
|
||||||
}
|
|
|
@ -1,27 +0,0 @@
|
||||||
,
|
|
||||||
"
|
|
||||||
(
|
|
||||||
[
|
|
||||||
{
|
|
||||||
*
|
|
||||||
<
|
|
||||||
>
|
|
||||||
$
|
|
||||||
£
|
|
||||||
„
|
|
||||||
“
|
|
||||||
'
|
|
||||||
``
|
|
||||||
`
|
|
||||||
#
|
|
||||||
US$
|
|
||||||
C$
|
|
||||||
A$
|
|
||||||
a-
|
|
||||||
‘
|
|
||||||
....
|
|
||||||
...
|
|
||||||
‚
|
|
||||||
»
|
|
||||||
_
|
|
||||||
§
|
|
|
@ -1,3 +0,0 @@
|
||||||
Biografie: Ein Spiel ist ein Theaterstück des Schweizer Schriftstellers Max Frisch, das 1967 entstand und am 1. Februar 1968 im Schauspielhaus Zürich uraufgeführt wurde. 1984 legte Frisch eine überarbeitete Neufassung vor. Das von Frisch als Komödie bezeichnete Stück greift eines seiner zentralen Themen auf: die Möglichkeit oder Unmöglichkeit des Menschen, seine Identität zu verändern.
|
|
||||||
|
|
||||||
Mit Biografie: Ein Spiel wandte sich Frisch von der Parabelform seiner Erfolgsstücke Biedermann und die Brandstifter und Andorra ab und postulierte eine „Dramaturgie der Permutation“. Darin sollte nicht, wie im klassischen Theater, Sinn und Schicksal im Mittelpunkt stehen, sondern die Zufälligkeit von Ereignissen und die Möglichkeit ihrer Variation. Dennoch handelt Biografie: Ein Spiel gerade von der Unmöglichkeit seines Protagonisten, seinen Lebenslauf grundlegend zu verändern. Frisch empfand die Wirkung des Stücks im Nachhinein als zu fatalistisch und die Umsetzung seiner theoretischen Absichten als nicht geglückt. Obwohl das Stück 1968 als unpolitisch und nicht zeitgemäß kritisiert wurde und auch später eine geteilte Rezeption erfuhr, gehört es an deutschsprachigen Bühnen zu den häufiger aufgeführten Stücken Frischs.
|
|
File diff suppressed because it is too large
Load Diff
|
@ -1,73 +0,0 @@
|
||||||
,
|
|
||||||
\"
|
|
||||||
\)
|
|
||||||
\]
|
|
||||||
\}
|
|
||||||
\*
|
|
||||||
\!
|
|
||||||
\?
|
|
||||||
%
|
|
||||||
\$
|
|
||||||
>
|
|
||||||
:
|
|
||||||
;
|
|
||||||
'
|
|
||||||
”
|
|
||||||
“
|
|
||||||
«
|
|
||||||
_
|
|
||||||
''
|
|
||||||
's
|
|
||||||
'S
|
|
||||||
’s
|
|
||||||
’S
|
|
||||||
’
|
|
||||||
‘
|
|
||||||
°
|
|
||||||
€
|
|
||||||
\.\.
|
|
||||||
\.\.\.
|
|
||||||
\.\.\.\.
|
|
||||||
(?<=[a-zäöüßÖÄÜ)\]"'´«‘’%\)²“”])\.
|
|
||||||
\-\-
|
|
||||||
´
|
|
||||||
(?<=[0-9])km²
|
|
||||||
(?<=[0-9])m²
|
|
||||||
(?<=[0-9])cm²
|
|
||||||
(?<=[0-9])mm²
|
|
||||||
(?<=[0-9])km³
|
|
||||||
(?<=[0-9])m³
|
|
||||||
(?<=[0-9])cm³
|
|
||||||
(?<=[0-9])mm³
|
|
||||||
(?<=[0-9])ha
|
|
||||||
(?<=[0-9])km
|
|
||||||
(?<=[0-9])m
|
|
||||||
(?<=[0-9])cm
|
|
||||||
(?<=[0-9])mm
|
|
||||||
(?<=[0-9])µm
|
|
||||||
(?<=[0-9])nm
|
|
||||||
(?<=[0-9])yd
|
|
||||||
(?<=[0-9])in
|
|
||||||
(?<=[0-9])ft
|
|
||||||
(?<=[0-9])kg
|
|
||||||
(?<=[0-9])g
|
|
||||||
(?<=[0-9])mg
|
|
||||||
(?<=[0-9])µg
|
|
||||||
(?<=[0-9])t
|
|
||||||
(?<=[0-9])lb
|
|
||||||
(?<=[0-9])oz
|
|
||||||
(?<=[0-9])m/s
|
|
||||||
(?<=[0-9])km/h
|
|
||||||
(?<=[0-9])mph
|
|
||||||
(?<=[0-9])°C
|
|
||||||
(?<=[0-9])°K
|
|
||||||
(?<=[0-9])°F
|
|
||||||
(?<=[0-9])hPa
|
|
||||||
(?<=[0-9])Pa
|
|
||||||
(?<=[0-9])mbar
|
|
||||||
(?<=[0-9])mb
|
|
||||||
(?<=[0-9])T
|
|
||||||
(?<=[0-9])G
|
|
||||||
(?<=[0-9])M
|
|
||||||
(?<=[0-9])K
|
|
||||||
(?<=[0-9])kb
|
|
|
@ -1,59 +0,0 @@
|
||||||
{
|
|
||||||
"$(": {"pos": "PUNCT", "PunctType": "Brck"},
|
|
||||||
"$,": {"pos": "PUNCT", "PunctType": "Comm"},
|
|
||||||
"$.": {"pos": "PUNCT", "PunctType": "Peri"},
|
|
||||||
"ADJA": {"pos": "ADJ"},
|
|
||||||
"ADJD": {"pos": "ADJ", "Variant": "Short"},
|
|
||||||
"ADV": {"pos": "ADV"},
|
|
||||||
"APPO": {"pos": "ADP", "AdpType": "Post"},
|
|
||||||
"APPR": {"pos": "ADP", "AdpType": "Prep"},
|
|
||||||
"APPRART": {"pos": "ADP", "AdpType": "Prep", "PronType": "Art"},
|
|
||||||
"APZR": {"pos": "ADP", "AdpType": "Circ"},
|
|
||||||
"ART": {"pos": "DET", "PronType": "Art"},
|
|
||||||
"CARD": {"pos": "NUM", "NumType": "Card"},
|
|
||||||
"FM": {"pos": "X", "Foreign": "Yes"},
|
|
||||||
"ITJ": {"pos": "INTJ"},
|
|
||||||
"KOKOM": {"pos": "CONJ", "ConjType": "Comp"},
|
|
||||||
"KON": {"pos": "CONJ"},
|
|
||||||
"KOUI": {"pos": "SCONJ"},
|
|
||||||
"KOUS": {"pos": "SCONJ"},
|
|
||||||
"NE": {"pos": "PROPN"},
|
|
||||||
"NNE": {"pos": "PROPN"},
|
|
||||||
"NN": {"pos": "NOUN"},
|
|
||||||
"PAV": {"pos": "ADV", "PronType": "Dem"},
|
|
||||||
"PROAV": {"pos": "ADV", "PronType": "Dem"},
|
|
||||||
"PDAT": {"pos": "DET", "PronType": "Dem"},
|
|
||||||
"PDS": {"pos": "PRON", "PronType": "Dem"},
|
|
||||||
"PIAT": {"pos": "DET", "PronType": "Ind,Neg,Tot"},
|
|
||||||
"PIDAT": {"pos": "DET", "AdjType": "Pdt", "PronType": "Ind,Neg,Tot"},
|
|
||||||
"PIS": {"pos": "PRON", "PronType": "Ind,Neg,Tot"},
|
|
||||||
"PPER": {"pos": "PRON", "PronType": "Prs"},
|
|
||||||
"PPOSAT": {"pos": "DET", "Poss": "Yes", "PronType": "Prs"},
|
|
||||||
"PPOSS": {"pos": "PRON", "Poss": "Yes", "PronType": "Prs"},
|
|
||||||
"PRELAT": {"pos": "DET", "PronType": "Rel"},
|
|
||||||
"PRELS": {"pos": "PRON", "PronType": "Rel"},
|
|
||||||
"PRF": {"pos": "PRON", "PronType": "Prs", "Reflex": "Yes"},
|
|
||||||
"PTKA": {"pos": "PART"},
|
|
||||||
"PTKANT": {"pos": "PART", "PartType": "Res"},
|
|
||||||
"PTKNEG": {"pos": "PART", "Negative": "Neg"},
|
|
||||||
"PTKVZ": {"pos": "PART", "PartType": "Vbp"},
|
|
||||||
"PTKZU": {"pos": "PART", "PartType": "Inf"},
|
|
||||||
"PWAT": {"pos": "DET", "PronType": "Int"},
|
|
||||||
"PWAV": {"pos": "ADV", "PronType": "Int"},
|
|
||||||
"PWS": {"pos": "PRON", "PronType": "Int"},
|
|
||||||
"TRUNC": {"pos": "X", "Hyph": "Yes"},
|
|
||||||
"VAFIN": {"pos": "AUX", "Mood": "Ind", "VerbForm": "Fin"},
|
|
||||||
"VAIMP": {"pos": "AUX", "Mood": "Imp", "VerbForm": "Fin"},
|
|
||||||
"VAINF": {"pos": "AUX", "VerbForm": "Inf"},
|
|
||||||
"VAPP": {"pos": "AUX", "Aspect": "Perf", "VerbForm": "Part"},
|
|
||||||
"VMFIN": {"pos": "VERB", "Mood": "Ind", "VerbForm": "Fin", "VerbType": "Mod"},
|
|
||||||
"VMINF": {"pos": "VERB", "VerbForm": "Inf", "VerbType": "Mod"},
|
|
||||||
"VMPP": {"pos": "VERB", "Aspect": "Perf", "VerbForm": "Part", "VerbType": "Mod"},
|
|
||||||
"VVFIN": {"pos": "VERB", "Mood": "Ind", "VerbForm": "Fin"},
|
|
||||||
"VVIMP": {"pos": "VERB", "Mood": "Imp", "VerbForm": "Fin"},
|
|
||||||
"VVINF": {"pos": "VERB", "VerbForm": "Inf"},
|
|
||||||
"VVIZU": {"pos": "VERB", "VerbForm": "Inf"},
|
|
||||||
"VVPP": {"pos": "VERB", "Aspect": "Perf", "VerbForm": "Part"},
|
|
||||||
"XY": {"pos": "X"},
|
|
||||||
"SP": {"pos": "SPACE"}
|
|
||||||
}
|
|
|
@ -1,20 +0,0 @@
|
||||||
WordNet Release 3.0 This software and database is being provided to you, the
|
|
||||||
LICENSEE, by Princeton University under the following license. By obtaining,
|
|
||||||
using and/or copying this software and database, you agree that you have read,
|
|
||||||
understood, and will comply with these terms and conditions.: Permission to
|
|
||||||
use, copy, modify and distribute this software and database and its
|
|
||||||
documentation for any purpose and without fee or royalty is hereby granted,
|
|
||||||
provided that you agree to comply with the following copyright notice and
|
|
||||||
statements, including the disclaimer, and that the same appear on ALL copies of
|
|
||||||
the software, database and documentation, including modifications that you make for internal use or for distribution. WordNet 3.0 Copyright 2006 by Princeton
|
|
||||||
University. All rights reserved. THIS SOFTWARE AND DATABASE IS PROVIDED "AS IS"
|
|
||||||
AND PRINCETON UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
|
|
||||||
IMPLIED. BY WAY OF EXAMPLE, BUT NOT LIMITATION, PRINCETON UNIVERSITY MAKES NO
|
|
||||||
REPRESENTATIONS OR WARRANTIES OF MERCHANT- ABILITY OR FITNESS FOR ANY
|
|
||||||
PARTICULAR PURPOSE OR THAT THE USE OF THE LICENSED SOFTWARE, DATABASE OR
|
|
||||||
DOCUMENTATION WILL NOT INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS
|
|
||||||
OR OTHER RIGHTS. The name of Princeton University or Princeton may not be used
|
|
||||||
in advertising or publicity pertaining to distribution of the software and/or
|
|
||||||
database. Title to copyright in this software, database and any associated
|
|
||||||
documentation shall at all times remain with Princeton University and LICENSEE
|
|
||||||
agrees to preserve same.
|
|
|
@ -1,194 +0,0 @@
|
||||||
{
|
|
||||||
"Reddit": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"lower": "reddit"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"SeptemberElevenAttacks": [
|
|
||||||
"EVENT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[
|
|
||||||
{"orth": "9/11"}
|
|
||||||
],
|
|
||||||
[
|
|
||||||
{"lower": "september"},
|
|
||||||
{"orth": "11"}
|
|
||||||
]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Linux": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"lower": "linux"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Haskell": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"lower": "haskell"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"HaskellCurry": [
|
|
||||||
"PERSON",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[
|
|
||||||
{"lower": "haskell"},
|
|
||||||
{"lower": "curry"}
|
|
||||||
]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Javascript": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"lower": "javascript"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"CSS": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"lower": "css"}],
|
|
||||||
[{"lower": "css3"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"displaCy": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"lower": "displacy"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"spaCy": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"orth": "spaCy"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
|
|
||||||
"HTML": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"lower": "html"}],
|
|
||||||
[{"lower": "html5"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Python": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"orth": "Python"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Ruby": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"orth": "Ruby"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Digg": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"lower": "digg"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"FoxNews": [
|
|
||||||
"ORG",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"orth": "Fox"}],
|
|
||||||
[{"orth": "News"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Google": [
|
|
||||||
"ORG",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"lower": "google"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Mac": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"lower": "mac"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Wikipedia": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"lower": "wikipedia"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Windows": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"orth": "Windows"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Dell": [
|
|
||||||
"ORG",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"lower": "dell"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Facebook": [
|
|
||||||
"ORG",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"lower": "facebook"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Blizzard": [
|
|
||||||
"ORG",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"orth": "Blizzard"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Ubuntu": [
|
|
||||||
"ORG",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"orth": "Ubuntu"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Youtube": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"lower": "youtube"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"false_positives": [
|
|
||||||
null,
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"orth": "Shit"}],
|
|
||||||
[{"orth": "Weed"}],
|
|
||||||
[{"orth": "Cool"}],
|
|
||||||
[{"orth": "Btw"}],
|
|
||||||
[{"orth": "Bah"}],
|
|
||||||
[{"orth": "Bullshit"}],
|
|
||||||
[{"orth": "Lol"}],
|
|
||||||
[{"orth": "Yo"}, {"lower": "dawg"}],
|
|
||||||
[{"orth": "Yay"}],
|
|
||||||
[{"orth": "Ahh"}],
|
|
||||||
[{"orth": "Yea"}],
|
|
||||||
[{"orth": "Bah"}]
|
|
||||||
]
|
|
||||||
]
|
|
||||||
}
|
|
|
@ -1,422 +0,0 @@
|
||||||
# -#- coding: utf-8 -*-
|
|
||||||
import json
|
|
||||||
|
|
||||||
contractions = {"n't", "'nt", "not", "'ve", "'d", "'ll", "'s", "'m", "'ma", "'re"}
|
|
||||||
|
|
||||||
# contains the lemmas, parts of speech, number, and tenspect of
|
|
||||||
# potential tokens generated after splitting contractions off
|
|
||||||
token_properties = {
|
|
||||||
|
|
||||||
"ai": {"L": "be", "pos": "VBP", "number": 2},
|
|
||||||
"are": {"L": "be", "pos": "VBP", "number": 2},
|
|
||||||
"ca": {"L": "can", "pos": "MD"},
|
|
||||||
"can": {"L": "can", "pos": "MD"},
|
|
||||||
"could": {"pos": "MD", "L": "could"},
|
|
||||||
"'d": {"L": "would", "pos": "MD"},
|
|
||||||
"did": {"L": "do", "pos": "VBD"},
|
|
||||||
"do": {"L": "do"},
|
|
||||||
"does": {"L": "do", "pos": "VBZ"},
|
|
||||||
"had": {"L": "have", "pos": "VBD"},
|
|
||||||
"has": {"L": "have", "pos": "VBZ"},
|
|
||||||
"have": {"pos": "VB"},
|
|
||||||
"he": {"L": "-PRON-", "pos": "PRP"},
|
|
||||||
"how": {},
|
|
||||||
"i": {"L": "-PRON-", "pos": "PRP"},
|
|
||||||
"is": {"L": "be", "pos": "VBZ"},
|
|
||||||
"it": {"L": "-PRON-", "pos": "PRP"},
|
|
||||||
"'ll": {"L": "will", "pos": "MD"},
|
|
||||||
"'m": {"L": "be", "pos": "VBP", "number": 1, "tenspect": 1},
|
|
||||||
"'ma": {},
|
|
||||||
"might": {},
|
|
||||||
"must": {},
|
|
||||||
"need": {},
|
|
||||||
"not": {"L": "not", "pos": "RB"},
|
|
||||||
"'nt": {"L": "not", "pos": "RB"},
|
|
||||||
"n't": {"L": "not", "pos": "RB"},
|
|
||||||
"'re": {"L": "be", "pos": "VBZ"},
|
|
||||||
"'s": {}, # no POS or lemma for s?
|
|
||||||
"sha": {"L": "shall", "pos": "MD"},
|
|
||||||
"she": {"L": "-PRON-", "pos": "PRP"},
|
|
||||||
"should": {},
|
|
||||||
"that": {},
|
|
||||||
"there": {},
|
|
||||||
"they": {"L": "-PRON-", "pos": "PRP"},
|
|
||||||
"was": {},
|
|
||||||
"we": {"L": "-PRON-", "pos": "PRP"},
|
|
||||||
"were": {},
|
|
||||||
"what": {},
|
|
||||||
"when": {},
|
|
||||||
"where": {},
|
|
||||||
"who": {},
|
|
||||||
"why": {},
|
|
||||||
"wo": {},
|
|
||||||
"would": {},
|
|
||||||
"you": {"L": "-PRON-", "pos": "PRP"},
|
|
||||||
"'ve": {"L": "have", "pos": "VB"}
|
|
||||||
}
|
|
||||||
|
|
||||||
# contains starting tokens with their potential contractions
|
|
||||||
# each potential contraction has a list of exceptions
|
|
||||||
# lower - don't generate the lowercase version
|
|
||||||
# upper - don't generate the uppercase version
|
|
||||||
# contrLower - don't generate the lowercase version with apostrophe (') removed
|
|
||||||
# contrUpper - dont' generate the uppercase version with apostrophe (') removed
|
|
||||||
# for example, we don't want to create the word "hell" or "Hell" from "he" + "'ll" so
|
|
||||||
# we add "contrLower" and "contrUpper" to the exceptions list
|
|
||||||
starting_tokens = {
|
|
||||||
|
|
||||||
"ai": {"n't": []},
|
|
||||||
"are": {"n't": []},
|
|
||||||
"ca": {"n't": []},
|
|
||||||
"can": {"not": []},
|
|
||||||
"could": {"'ve": [], "n't": [], "n't've": []},
|
|
||||||
"did": {"n't": []},
|
|
||||||
"does": {"n't": []},
|
|
||||||
"do": {"n't": []},
|
|
||||||
"had": {"n't": [], "n't've": []},
|
|
||||||
"has": {"n't": []},
|
|
||||||
"have": {"n't": []},
|
|
||||||
"he": {"'d": [], "'d've": [], "'ll": ["contrLower", "contrUpper"], "'s": []},
|
|
||||||
"how": {"'d": [], "'ll": [], "'s": []},
|
|
||||||
"i": {"'d": ["contrLower", "contrUpper"], "'d've": [], "'ll": ["contrLower", "contrUpper"], "'m": [], "'ma": [], "'ve": []},
|
|
||||||
"is": {"n't": []},
|
|
||||||
"it": {"'d": [], "'d've": [], "'ll": [], "'s": ["contrLower", "contrUpper"]},
|
|
||||||
"might": {"n't": [], "n't've": [], "'ve": []},
|
|
||||||
"must": {"n't": [], "'ve": []},
|
|
||||||
"need": {"n't": []},
|
|
||||||
"not": {"'ve": []},
|
|
||||||
"sha": {"n't": []},
|
|
||||||
"she": {"'d": ["contrLower", "contrUpper"], "'d've": [], "'ll": ["contrLower", "contrUpper"], "'s": []},
|
|
||||||
"should": {"'ve": [], "n't": [], "n't've": []},
|
|
||||||
"that": {"'s": []},
|
|
||||||
"there": {"'d": [], "'d've": [], "'s": ["contrLower", "contrUpper"], "'ll": []},
|
|
||||||
"they": {"'d": [], "'d've": [], "'ll": [], "'re": [], "'ve": []},
|
|
||||||
"was": {"n't": []},
|
|
||||||
"we": {"'d": ["contrLower", "contrUpper"], "'d've": [], "'ll": ["contrLower", "contrUpper"], "'re": ["contrLower", "contrUpper"], "'ve": []},
|
|
||||||
"were": {"n't": []},
|
|
||||||
"what": {"'ll": [], "'re": [], "'s": [], "'ve": []},
|
|
||||||
"when": {"'s": []},
|
|
||||||
"where": {"'d": [], "'s": [], "'ve": []},
|
|
||||||
"who": {"'d": [], "'ll": [], "'re": ["contrLower", "contrUpper"], "'s": [], "'ve": []},
|
|
||||||
"why": {"'ll": [], "'re": [], "'s": []},
|
|
||||||
"wo": {"n't": []},
|
|
||||||
"would": {"'ve": [], "n't": [], "n't've": []},
|
|
||||||
"you": {"'d": [], "'d've": [], "'ll": [], "'re": [], "'ve": []}
|
|
||||||
|
|
||||||
}
|
|
||||||
|
|
||||||
# other specials that don't really have contractions
|
|
||||||
# so they are hardcoded
|
|
||||||
hardcoded_specials = {
|
|
||||||
"let's": [{"F": "let"}, {"F": "'s", "L": "us"}],
|
|
||||||
"Let's": [{"F": "Let"}, {"F": "'s", "L": "us"}],
|
|
||||||
|
|
||||||
"'s": [{"F": "'s", "L": "'s"}],
|
|
||||||
|
|
||||||
"'S": [{"F": "'S", "L": "'s"}],
|
|
||||||
u"\u2018s": [{"F": u"\u2018s", "L": "'s"}],
|
|
||||||
u"\u2018S": [{"F": u"\u2018S", "L": "'s"}],
|
|
||||||
|
|
||||||
"'em": [{"F": "'em"}],
|
|
||||||
|
|
||||||
"'ol": [{"F": "'ol"}],
|
|
||||||
|
|
||||||
"vs.": [{"F": "vs."}],
|
|
||||||
|
|
||||||
"Ms.": [{"F": "Ms."}],
|
|
||||||
"Mr.": [{"F": "Mr."}],
|
|
||||||
"Dr.": [{"F": "Dr."}],
|
|
||||||
"Mrs.": [{"F": "Mrs."}],
|
|
||||||
"Messrs.": [{"F": "Messrs."}],
|
|
||||||
"Gov.": [{"F": "Gov."}],
|
|
||||||
"Gen.": [{"F": "Gen."}],
|
|
||||||
|
|
||||||
"Mt.": [{"F": "Mt.", "L": "Mount"}],
|
|
||||||
|
|
||||||
"''": [{"F": "''"}],
|
|
||||||
|
|
||||||
"—": [{"F": "—", "L": "--", "pos": ":"}],
|
|
||||||
|
|
||||||
"Corp.": [{"F": "Corp."}],
|
|
||||||
"Inc.": [{"F": "Inc."}],
|
|
||||||
"Co.": [{"F": "Co."}],
|
|
||||||
"co.": [{"F": "co."}],
|
|
||||||
"Ltd.": [{"F": "Ltd."}],
|
|
||||||
"Bros.": [{"F": "Bros."}],
|
|
||||||
|
|
||||||
"Rep.": [{"F": "Rep."}],
|
|
||||||
"Sen.": [{"F": "Sen."}],
|
|
||||||
"Jr.": [{"F": "Jr."}],
|
|
||||||
"Rev.": [{"F": "Rev."}],
|
|
||||||
"Adm.": [{"F": "Adm."}],
|
|
||||||
"St.": [{"F": "St."}],
|
|
||||||
|
|
||||||
"a.m.": [{"F": "a.m."}],
|
|
||||||
"p.m.": [{"F": "p.m."}],
|
|
||||||
|
|
||||||
"1a.m.": [{"F": "1"}, {"F": "a.m."}],
|
|
||||||
"2a.m.": [{"F": "2"}, {"F": "a.m."}],
|
|
||||||
"3a.m.": [{"F": "3"}, {"F": "a.m."}],
|
|
||||||
"4a.m.": [{"F": "4"}, {"F": "a.m."}],
|
|
||||||
"5a.m.": [{"F": "5"}, {"F": "a.m."}],
|
|
||||||
"6a.m.": [{"F": "6"}, {"F": "a.m."}],
|
|
||||||
"7a.m.": [{"F": "7"}, {"F": "a.m."}],
|
|
||||||
"8a.m.": [{"F": "8"}, {"F": "a.m."}],
|
|
||||||
"9a.m.": [{"F": "9"}, {"F": "a.m."}],
|
|
||||||
"10a.m.": [{"F": "10"}, {"F": "a.m."}],
|
|
||||||
"11a.m.": [{"F": "11"}, {"F": "a.m."}],
|
|
||||||
"12a.m.": [{"F": "12"}, {"F": "a.m."}],
|
|
||||||
"1am": [{"F": "1"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"2am": [{"F": "2"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"3am": [{"F": "3"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"4am": [{"F": "4"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"5am": [{"F": "5"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"6am": [{"F": "6"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"7am": [{"F": "7"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"8am": [{"F": "8"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"9am": [{"F": "9"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"10am": [{"F": "10"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"11am": [{"F": "11"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"12am": [{"F": "12"}, {"F": "am", "L": "a.m."}],
|
|
||||||
|
|
||||||
|
|
||||||
"p.m.": [{"F": "p.m."}],
|
|
||||||
"1p.m.": [{"F": "1"}, {"F": "p.m."}],
|
|
||||||
"2p.m.": [{"F": "2"}, {"F": "p.m."}],
|
|
||||||
"3p.m.": [{"F": "3"}, {"F": "p.m."}],
|
|
||||||
"4p.m.": [{"F": "4"}, {"F": "p.m."}],
|
|
||||||
"5p.m.": [{"F": "5"}, {"F": "p.m."}],
|
|
||||||
"6p.m.": [{"F": "6"}, {"F": "p.m."}],
|
|
||||||
"7p.m.": [{"F": "7"}, {"F": "p.m."}],
|
|
||||||
"8p.m.": [{"F": "8"}, {"F": "p.m."}],
|
|
||||||
"9p.m.": [{"F": "9"}, {"F": "p.m."}],
|
|
||||||
"10p.m.": [{"F": "10"}, {"F": "p.m."}],
|
|
||||||
"11p.m.": [{"F": "11"}, {"F": "p.m."}],
|
|
||||||
"12p.m.": [{"F": "12"}, {"F": "p.m."}],
|
|
||||||
"1pm": [{"F": "1"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"2pm": [{"F": "2"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"3pm": [{"F": "3"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"4pm": [{"F": "4"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"5pm": [{"F": "5"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"6pm": [{"F": "6"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"7pm": [{"F": "7"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"8pm": [{"F": "8"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"9pm": [{"F": "9"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"10pm": [{"F": "10"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"11pm": [{"F": "11"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"12pm": [{"F": "12"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
|
|
||||||
"Jan.": [{"F": "Jan."}],
|
|
||||||
"Feb.": [{"F": "Feb."}],
|
|
||||||
"Mar.": [{"F": "Mar."}],
|
|
||||||
"Apr.": [{"F": "Apr."}],
|
|
||||||
"May.": [{"F": "May."}],
|
|
||||||
"Jun.": [{"F": "Jun."}],
|
|
||||||
"Jul.": [{"F": "Jul."}],
|
|
||||||
"Aug.": [{"F": "Aug."}],
|
|
||||||
"Sep.": [{"F": "Sep."}],
|
|
||||||
"Sept.": [{"F": "Sept."}],
|
|
||||||
"Oct.": [{"F": "Oct."}],
|
|
||||||
"Nov.": [{"F": "Nov."}],
|
|
||||||
"Dec.": [{"F": "Dec."}],
|
|
||||||
|
|
||||||
"Ala.": [{"F": "Ala."}],
|
|
||||||
"Ariz.": [{"F": "Ariz."}],
|
|
||||||
"Ark.": [{"F": "Ark."}],
|
|
||||||
"Calif.": [{"F": "Calif."}],
|
|
||||||
"Colo.": [{"F": "Colo."}],
|
|
||||||
"Conn.": [{"F": "Conn."}],
|
|
||||||
"Del.": [{"F": "Del."}],
|
|
||||||
"D.C.": [{"F": "D.C."}],
|
|
||||||
"Fla.": [{"F": "Fla."}],
|
|
||||||
"Ga.": [{"F": "Ga."}],
|
|
||||||
"Ill.": [{"F": "Ill."}],
|
|
||||||
"Ind.": [{"F": "Ind."}],
|
|
||||||
"Kans.": [{"F": "Kans."}],
|
|
||||||
"Kan.": [{"F": "Kan."}],
|
|
||||||
"Ky.": [{"F": "Ky."}],
|
|
||||||
"La.": [{"F": "La."}],
|
|
||||||
"Md.": [{"F": "Md."}],
|
|
||||||
"Mass.": [{"F": "Mass."}],
|
|
||||||
"Mich.": [{"F": "Mich."}],
|
|
||||||
"Minn.": [{"F": "Minn."}],
|
|
||||||
"Miss.": [{"F": "Miss."}],
|
|
||||||
"Mo.": [{"F": "Mo."}],
|
|
||||||
"Mont.": [{"F": "Mont."}],
|
|
||||||
"Nebr.": [{"F": "Nebr."}],
|
|
||||||
"Neb.": [{"F": "Neb."}],
|
|
||||||
"Nev.": [{"F": "Nev."}],
|
|
||||||
"N.H.": [{"F": "N.H."}],
|
|
||||||
"N.J.": [{"F": "N.J."}],
|
|
||||||
"N.M.": [{"F": "N.M."}],
|
|
||||||
"N.Y.": [{"F": "N.Y."}],
|
|
||||||
"N.C.": [{"F": "N.C."}],
|
|
||||||
"N.D.": [{"F": "N.D."}],
|
|
||||||
"Okla.": [{"F": "Okla."}],
|
|
||||||
"Ore.": [{"F": "Ore."}],
|
|
||||||
"Pa.": [{"F": "Pa."}],
|
|
||||||
"Tenn.": [{"F": "Tenn."}],
|
|
||||||
"Va.": [{"F": "Va."}],
|
|
||||||
"Wash.": [{"F": "Wash."}],
|
|
||||||
"Wis.": [{"F": "Wis."}],
|
|
||||||
|
|
||||||
":)": [{"F": ":)"}],
|
|
||||||
"<3": [{"F": "<3"}],
|
|
||||||
";)": [{"F": ";)"}],
|
|
||||||
"(:": [{"F": "(:"}],
|
|
||||||
":(": [{"F": ":("}],
|
|
||||||
"-_-": [{"F": "-_-"}],
|
|
||||||
"=)": [{"F": "=)"}],
|
|
||||||
":/": [{"F": ":/"}],
|
|
||||||
":>": [{"F": ":>"}],
|
|
||||||
";-)": [{"F": ";-)"}],
|
|
||||||
":Y": [{"F": ":Y"}],
|
|
||||||
":P": [{"F": ":P"}],
|
|
||||||
":-P": [{"F": ":-P"}],
|
|
||||||
":3": [{"F": ":3"}],
|
|
||||||
"=3": [{"F": "=3"}],
|
|
||||||
"xD": [{"F": "xD"}],
|
|
||||||
"^_^": [{"F": "^_^"}],
|
|
||||||
"=]": [{"F": "=]"}],
|
|
||||||
"=D": [{"F": "=D"}],
|
|
||||||
"<333": [{"F": "<333"}],
|
|
||||||
":))": [{"F": ":))"}],
|
|
||||||
":0": [{"F": ":0"}],
|
|
||||||
"-__-": [{"F": "-__-"}],
|
|
||||||
"xDD": [{"F": "xDD"}],
|
|
||||||
"o_o": [{"F": "o_o"}],
|
|
||||||
"o_O": [{"F": "o_O"}],
|
|
||||||
"V_V": [{"F": "V_V"}],
|
|
||||||
"=[[": [{"F": "=[["}],
|
|
||||||
"<33": [{"F": "<33"}],
|
|
||||||
";p": [{"F": ";p"}],
|
|
||||||
";D": [{"F": ";D"}],
|
|
||||||
";-p": [{"F": ";-p"}],
|
|
||||||
";(": [{"F": ";("}],
|
|
||||||
":p": [{"F": ":p"}],
|
|
||||||
":]": [{"F": ":]"}],
|
|
||||||
":O": [{"F": ":O"}],
|
|
||||||
":-/": [{"F": ":-/"}],
|
|
||||||
":-)": [{"F": ":-)"}],
|
|
||||||
":(((": [{"F": ":((("}],
|
|
||||||
":((": [{"F": ":(("}],
|
|
||||||
":')": [{"F": ":')"}],
|
|
||||||
"(^_^)": [{"F": "(^_^)"}],
|
|
||||||
"(=": [{"F": "(="}],
|
|
||||||
"o.O": [{"F": "o.O"}],
|
|
||||||
"\")": [{"F": "\")"}],
|
|
||||||
"a.": [{"F": "a."}],
|
|
||||||
"b.": [{"F": "b."}],
|
|
||||||
"c.": [{"F": "c."}],
|
|
||||||
"d.": [{"F": "d."}],
|
|
||||||
"e.": [{"F": "e."}],
|
|
||||||
"f.": [{"F": "f."}],
|
|
||||||
"g.": [{"F": "g."}],
|
|
||||||
"h.": [{"F": "h."}],
|
|
||||||
"i.": [{"F": "i."}],
|
|
||||||
"j.": [{"F": "j."}],
|
|
||||||
"k.": [{"F": "k."}],
|
|
||||||
"l.": [{"F": "l."}],
|
|
||||||
"m.": [{"F": "m."}],
|
|
||||||
"n.": [{"F": "n."}],
|
|
||||||
"o.": [{"F": "o."}],
|
|
||||||
"p.": [{"F": "p."}],
|
|
||||||
"q.": [{"F": "q."}],
|
|
||||||
"r.": [{"F": "r."}],
|
|
||||||
"s.": [{"F": "s."}],
|
|
||||||
"t.": [{"F": "t."}],
|
|
||||||
"u.": [{"F": "u."}],
|
|
||||||
"v.": [{"F": "v."}],
|
|
||||||
"w.": [{"F": "w."}],
|
|
||||||
"x.": [{"F": "x."}],
|
|
||||||
"y.": [{"F": "y."}],
|
|
||||||
"z.": [{"F": "z."}],
|
|
||||||
|
|
||||||
"i.e.": [{"F": "i.e."}],
|
|
||||||
"I.e.": [{"F": "I.e."}],
|
|
||||||
"I.E.": [{"F": "I.E."}],
|
|
||||||
"e.g.": [{"F": "e.g."}],
|
|
||||||
"E.g.": [{"F": "E.g."}],
|
|
||||||
"E.G.": [{"F": "E.G."}],
|
|
||||||
"\n": [{"F": "\n", "pos": "SP"}],
|
|
||||||
"\t": [{"F": "\t", "pos": "SP"}],
|
|
||||||
" ": [{"F": " ", "pos": "SP"}],
|
|
||||||
u"\u00a0": [{"F": u"\u00a0", "pos": "SP", "L": " "}]
|
|
||||||
|
|
||||||
}
|
|
||||||
|
|
||||||
def get_double_contractions(ending):
|
|
||||||
endings = []
|
|
||||||
|
|
||||||
ends_with_contraction = any([ending.endswith(contraction) for contraction in contractions])
|
|
||||||
|
|
||||||
while ends_with_contraction:
|
|
||||||
for contraction in contractions:
|
|
||||||
if ending.endswith(contraction):
|
|
||||||
endings.append(contraction)
|
|
||||||
ending = ending.rstrip(contraction)
|
|
||||||
ends_with_contraction = any([ending.endswith(contraction) for contraction in contractions])
|
|
||||||
|
|
||||||
endings.reverse() # reverse because the last ending is put in the list first
|
|
||||||
return endings
|
|
||||||
|
|
||||||
def get_token_properties(token, capitalize=False, remove_contractions=False):
|
|
||||||
props = dict(token_properties.get(token)) # ensure we copy the dict so we can add the "F" prop
|
|
||||||
if capitalize:
|
|
||||||
token = token.capitalize()
|
|
||||||
if remove_contractions:
|
|
||||||
token = token.replace("'", "")
|
|
||||||
|
|
||||||
props["F"] = token
|
|
||||||
return props
|
|
||||||
|
|
||||||
def create_entry(token, endings, capitalize=False, remove_contractions=False):
|
|
||||||
|
|
||||||
properties = []
|
|
||||||
properties.append(get_token_properties(token, capitalize=capitalize, remove_contractions=remove_contractions))
|
|
||||||
for e in endings:
|
|
||||||
properties.append(get_token_properties(e, remove_contractions=remove_contractions))
|
|
||||||
return properties
|
|
||||||
|
|
||||||
def generate_specials():
|
|
||||||
|
|
||||||
specials = {}
|
|
||||||
|
|
||||||
for token in starting_tokens:
|
|
||||||
possible_endings = starting_tokens[token]
|
|
||||||
for ending in possible_endings:
|
|
||||||
|
|
||||||
endings = []
|
|
||||||
if ending.count("'") > 1:
|
|
||||||
endings.extend(get_double_contractions(ending))
|
|
||||||
else:
|
|
||||||
endings.append(ending)
|
|
||||||
|
|
||||||
exceptions = possible_endings[ending]
|
|
||||||
|
|
||||||
if "lower" not in exceptions:
|
|
||||||
special = token + ending
|
|
||||||
specials[special] = create_entry(token, endings)
|
|
||||||
|
|
||||||
if "upper" not in exceptions:
|
|
||||||
special = token.capitalize() + ending
|
|
||||||
specials[special] = create_entry(token, endings, capitalize=True)
|
|
||||||
|
|
||||||
if "contrLower" not in exceptions:
|
|
||||||
special = token + ending.replace("'", "")
|
|
||||||
specials[special] = create_entry(token, endings, remove_contractions=True)
|
|
||||||
|
|
||||||
if "contrUpper" not in exceptions:
|
|
||||||
special = token.capitalize() + ending.replace("'", "")
|
|
||||||
specials[special] = create_entry(token, endings, capitalize=True, remove_contractions=True)
|
|
||||||
|
|
||||||
# add in hardcoded specials
|
|
||||||
specials = dict(specials, **hardcoded_specials)
|
|
||||||
|
|
||||||
return specials
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
specials = generate_specials()
|
|
||||||
with open("specials.json", "w") as file_:
|
|
||||||
file_.write(json.dumps(specials, indent=2))
|
|
||||||
|
|
|
@ -1,6 +0,0 @@
|
||||||
\.\.\.+
|
|
||||||
(?<=[a-z])\.(?=[A-Z])
|
|
||||||
(?<=[a-zA-Z])-(?=[a-zA-z])
|
|
||||||
(?<=[a-zA-Z])--(?=[a-zA-z])
|
|
||||||
(?<=[0-9])-(?=[0-9])
|
|
||||||
(?<=[A-Za-z]),(?=[A-Za-z])
|
|
|
@ -1,59 +0,0 @@
|
||||||
{
|
|
||||||
"PRP": {
|
|
||||||
"I": {"L": "-PRON-", "PronType": "Prs", "Person": "One", "Number": "Sing", "Case": "Nom"},
|
|
||||||
"me": {"L": "-PRON-", "PronType": "Prs", "Person": "One", "Number": "Sing", "Case": "Acc"},
|
|
||||||
"you": {"L": "-PRON-", "PronType": "Prs", "Person": "Two"},
|
|
||||||
"he": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Masc", "Case": "Nom"},
|
|
||||||
"him": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Masc", "Case": "Acc"},
|
|
||||||
"she": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Fem", "Case": "Nom"},
|
|
||||||
"her": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Fem", "Case": "Acc"},
|
|
||||||
"it": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Neut"},
|
|
||||||
"we": {"L": "-PRON-", "PronType": "Prs", "Person": "One", "Number": "Plur", "Case": "Nom"},
|
|
||||||
"us": {"L": "-PRON-", "PronType": "Prs", "Person": "One", "Number": "Plur", "Case": "Acc"},
|
|
||||||
"they": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Plur", "Case": "Nom"},
|
|
||||||
"them": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Plur", "Case": "Acc"},
|
|
||||||
|
|
||||||
"mine": {"L": "-PRON-", "PronType": "Prs", "Person": "One", "Number": "Sing", "Poss": "Yes", "Reflex": "Yes"},
|
|
||||||
"yours": {"L": "-PRON-", "PronType": "Prs", "Person": "Two", "Poss": "Yes", "Reflex": "Yes"},
|
|
||||||
"his": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Masc", "Poss": "Yes", "Reflex": "Yes"},
|
|
||||||
"hers": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Fem", "Poss": "Yes", "Reflex": "Yes"},
|
|
||||||
"its": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Neut", "Poss": "Yes", "Reflex": "Yes"},
|
|
||||||
"ours": {"L": "-PRON-", "PronType": "Prs", "Person": "One", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
|
|
||||||
"yours": {"L": "-PRON-", "PronType": "Prs", "Person": "Two", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
|
|
||||||
"theirs": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
|
|
||||||
|
|
||||||
"myself": {"L": "-PRON-", "PronType": "Prs", "Person": "One", "Number": "Sing", "Case": "Acc", "Reflex": "Yes"},
|
|
||||||
"yourself": {"L": "-PRON-", "PronType": "Prs", "Person": "Two", "Case": "Acc", "Reflex": "Yes"},
|
|
||||||
"himself": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Sing", "Case": "Acc", "Gender": "Masc", "Reflex": "Yes"},
|
|
||||||
"herself": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Sing", "Case": "Acc", "Gender": "Fem", "Reflex": "Yes"},
|
|
||||||
"itself": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Sing", "Case": "Acc", "Gender": "Neut", "Reflex": "Yes"},
|
|
||||||
"themself": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Sing", "Case": "Acc", "Reflex": "Yes"},
|
|
||||||
"ourselves": {"L": "-PRON-", "PronType": "Prs", "Person": "One", "Number": "Plur", "Case": "Acc", "Reflex": "Yes"},
|
|
||||||
"yourselves": {"L": "-PRON-", "PronType": "Prs", "Person": "Two", "Case": "Acc", "Reflex": "Yes"},
|
|
||||||
"themselves": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Plur", "Case": "Acc", "Reflex": "Yes"}
|
|
||||||
|
|
||||||
},
|
|
||||||
|
|
||||||
"PRP$": {
|
|
||||||
"my": {"L": "-PRON-", "Person": "One", "Number": "Sing", "PronType": "Prs", "Poss": "Yes"},
|
|
||||||
"your": {"L": "-PRON-", "Person": "Two", "PronType": "Prs", "Poss": "Yes"},
|
|
||||||
"his": {"L": "-PRON-", "Person": "Three", "Number": "Sing", "Gender": "Masc", "PronType": "Prs", "Poss": "Yes"},
|
|
||||||
"her": {"L": "-PRON-", "Person": "Three", "Number": "Sing", "Gender": "Fem", "PronType": "Prs", "Poss": "Yes"},
|
|
||||||
"its": {"L": "-PRON-", "Person": "Three", "Number": "Sing", "Gender": "Neut", "PronType": "Prs", "Poss": "Yes"},
|
|
||||||
"our": {"L": "-PRON-", "Person": "One", "Number": "Plur", "PronType": "Prs", "Poss": "Yes"},
|
|
||||||
"their": {"L": "-PRON-", "Person": "Three", "Number": "Plur", "PronType": "Prs", "Poss": "Yes"}
|
|
||||||
},
|
|
||||||
|
|
||||||
"VBZ": {
|
|
||||||
"am": {"L": "be", "VerbForm": "Fin", "Person": "One", "Tense": "Pres", "Mood": "Ind"},
|
|
||||||
"are": {"L": "be", "VerbForm": "Fin", "Person": "Two", "Tense": "Pres", "Mood": "Ind"},
|
|
||||||
"is": {"L": "be", "VerbForm": "Fin", "Person": "Three", "Tense": "Pres", "Mood": "Ind"},
|
|
||||||
},
|
|
||||||
"VBP": {
|
|
||||||
"are": {"L": "be", "VerbForm": "Fin", "Tense": "Pres", "Mood": "Ind"}
|
|
||||||
},
|
|
||||||
"VBD": {
|
|
||||||
"was": {"L": "be", "VerbForm": "Fin", "Tense": "Past", "Number": "Sing"},
|
|
||||||
"were": {"L": "be", "VerbForm": "Fin", "Tense": "Past", "Number": "Plur"}
|
|
||||||
}
|
|
||||||
}
|
|
|
@ -1,21 +0,0 @@
|
||||||
,
|
|
||||||
"
|
|
||||||
(
|
|
||||||
[
|
|
||||||
{
|
|
||||||
*
|
|
||||||
<
|
|
||||||
$
|
|
||||||
£
|
|
||||||
“
|
|
||||||
'
|
|
||||||
``
|
|
||||||
`
|
|
||||||
#
|
|
||||||
US$
|
|
||||||
C$
|
|
||||||
A$
|
|
||||||
a-
|
|
||||||
‘
|
|
||||||
....
|
|
||||||
...
|
|
File diff suppressed because it is too large
Load Diff
|
@ -1,26 +0,0 @@
|
||||||
,
|
|
||||||
\"
|
|
||||||
\)
|
|
||||||
\]
|
|
||||||
\}
|
|
||||||
\*
|
|
||||||
\!
|
|
||||||
\?
|
|
||||||
%
|
|
||||||
\$
|
|
||||||
>
|
|
||||||
:
|
|
||||||
;
|
|
||||||
'
|
|
||||||
”
|
|
||||||
''
|
|
||||||
's
|
|
||||||
'S
|
|
||||||
’s
|
|
||||||
’S
|
|
||||||
’
|
|
||||||
\.\.
|
|
||||||
\.\.\.
|
|
||||||
\.\.\.\.
|
|
||||||
(?<=[a-z0-9)\]"'%\)])\.
|
|
||||||
(?<=[0-9])km
|
|
|
@ -1,60 +0,0 @@
|
||||||
{
|
|
||||||
".": {"pos": "punct", "puncttype": "peri"},
|
|
||||||
",": {"pos": "punct", "puncttype": "comm"},
|
|
||||||
"-LRB-": {"pos": "punct", "puncttype": "brck", "punctside": "ini"},
|
|
||||||
"-RRB-": {"pos": "punct", "puncttype": "brck", "punctside": "fin"},
|
|
||||||
"``": {"pos": "punct", "puncttype": "quot", "punctside": "ini"},
|
|
||||||
"\"\"": {"pos": "punct", "puncttype": "quot", "punctside": "fin"},
|
|
||||||
"''": {"pos": "punct", "puncttype": "quot", "punctside": "fin"},
|
|
||||||
":": {"pos": "punct"},
|
|
||||||
"$": {"pos": "sym", "other": {"symtype": "currency"}},
|
|
||||||
"#": {"pos": "sym", "other": {"symtype": "numbersign"}},
|
|
||||||
"AFX": {"pos": "adj", "hyph": "hyph"},
|
|
||||||
"CC": {"pos": "conj", "conjtype": "coor"},
|
|
||||||
"CD": {"pos": "num", "numtype": "card"},
|
|
||||||
"DT": {"pos": "det"},
|
|
||||||
"EX": {"pos": "adv", "advtype": "ex"},
|
|
||||||
"FW": {"pos": "x", "foreign": "foreign"},
|
|
||||||
"HYPH": {"pos": "punct", "puncttype": "dash"},
|
|
||||||
"IN": {"pos": "adp"},
|
|
||||||
"JJ": {"pos": "adj", "degree": "pos"},
|
|
||||||
"JJR": {"pos": "adj", "degree": "comp"},
|
|
||||||
"JJS": {"pos": "adj", "degree": "sup"},
|
|
||||||
"LS": {"pos": "punct", "numtype": "ord"},
|
|
||||||
"MD": {"pos": "verb", "verbtype": "mod"},
|
|
||||||
"NIL": {"pos": ""},
|
|
||||||
"NN": {"pos": "noun", "number": "sing"},
|
|
||||||
"NNP": {"pos": "propn", "nountype": "prop", "number": "sing"},
|
|
||||||
"NNPS": {"pos": "propn", "nountype": "prop", "number": "plur"},
|
|
||||||
"NNS": {"pos": "noun", "number": "plur"},
|
|
||||||
"PDT": {"pos": "adj", "adjtype": "pdt", "prontype": "prn"},
|
|
||||||
"POS": {"pos": "part", "poss": "poss"},
|
|
||||||
"PRP": {"pos": "pron", "prontype": "prs"},
|
|
||||||
"PRP$": {"pos": "adj", "prontype": "prs", "poss": "poss"},
|
|
||||||
"RB": {"pos": "adv", "degree": "pos"},
|
|
||||||
"RBR": {"pos": "adv", "degree": "comp"},
|
|
||||||
"RBS": {"pos": "adv", "degree": "sup"},
|
|
||||||
"RP": {"pos": "part"},
|
|
||||||
"SYM": {"pos": "sym"},
|
|
||||||
"TO": {"pos": "part", "parttype": "inf", "verbform": "inf"},
|
|
||||||
"UH": {"pos": "intJ"},
|
|
||||||
"VB": {"pos": "verb", "verbform": "inf"},
|
|
||||||
"VBD": {"pos": "verb", "verbform": "fin", "tense": "past"},
|
|
||||||
"VBG": {"pos": "verb", "verbform": "part", "tense": "pres", "aspect": "prog"},
|
|
||||||
"VBN": {"pos": "verb", "verbform": "part", "tense": "past", "aspect": "perf"},
|
|
||||||
"VBP": {"pos": "verb", "verbform": "fin", "tense": "pres"},
|
|
||||||
"VBZ": {"pos": "verb", "verbform": "fin", "tense": "pres", "number": "sing", "person": 3},
|
|
||||||
"WDT": {"pos": "adj", "prontype": "int|rel"},
|
|
||||||
"WP": {"pos": "noun", "prontype": "int|rel"},
|
|
||||||
"WP$": {"pos": "adj", "poss": "poss", "prontype": "int|rel"},
|
|
||||||
"WRB": {"pos": "adv", "prontype": "int|rel"},
|
|
||||||
"SP": {"pos": "space"},
|
|
||||||
"ADD": {"pos": "x"},
|
|
||||||
"NFP": {"pos": "punct"},
|
|
||||||
"GW": {"pos": "x"},
|
|
||||||
"AFX": {"pos": "x"},
|
|
||||||
"HYPH": {"pos": "punct"},
|
|
||||||
"XX": {"pos": "x"},
|
|
||||||
"BES": {"pos": "verb"},
|
|
||||||
"HVS": {"pos": "verb"}
|
|
||||||
}
|
|
|
@ -1,3 +0,0 @@
|
||||||
\.\.\.
|
|
||||||
(?<=[a-z])\.(?=[A-Z])
|
|
||||||
(?<=[a-zA-Z])-(?=[a-zA-z])
|
|
|
@ -1 +0,0 @@
|
||||||
{}
|
|
|
@ -1,21 +0,0 @@
|
||||||
,
|
|
||||||
"
|
|
||||||
(
|
|
||||||
[
|
|
||||||
{
|
|
||||||
*
|
|
||||||
<
|
|
||||||
$
|
|
||||||
£
|
|
||||||
“
|
|
||||||
'
|
|
||||||
``
|
|
||||||
`
|
|
||||||
#
|
|
||||||
US$
|
|
||||||
C$
|
|
||||||
A$
|
|
||||||
a-
|
|
||||||
‘
|
|
||||||
....
|
|
||||||
...
|
|
|
@ -1,3 +0,0 @@
|
||||||
Biografie: Ein Spiel ist ein Theaterstück des Schweizer Schriftstellers Max Frisch, das 1967 entstand und am 1. Februar 1968 im Schauspielhaus Zürich uraufgeführt wurde. 1984 legte Frisch eine überarbeitete Neufassung vor. Das von Frisch als Komödie bezeichnete Stück greift eines seiner zentralen Themen auf: die Möglichkeit oder Unmöglichkeit des Menschen, seine Identität zu verändern.
|
|
||||||
|
|
||||||
Mit Biografie: Ein Spiel wandte sich Frisch von der Parabelform seiner Erfolgsstücke Biedermann und die Brandstifter und Andorra ab und postulierte eine „Dramaturgie der Permutation“. Darin sollte nicht, wie im klassischen Theater, Sinn und Schicksal im Mittelpunkt stehen, sondern die Zufälligkeit von Ereignissen und die Möglichkeit ihrer Variation. Dennoch handelt Biografie: Ein Spiel gerade von der Unmöglichkeit seines Protagonisten, seinen Lebenslauf grundlegend zu verändern. Frisch empfand die Wirkung des Stücks im Nachhinein als zu fatalistisch und die Umsetzung seiner theoretischen Absichten als nicht geglückt. Obwohl das Stück 1968 als unpolitisch und nicht zeitgemäß kritisiert wurde und auch später eine geteilte Rezeption erfuhr, gehört es an deutschsprachigen Bühnen zu den häufiger aufgeführten Stücken Frischs.
|
|
|
@ -1,149 +0,0 @@
|
||||||
{
|
|
||||||
"a.m.": [{"F": "a.m."}],
|
|
||||||
"p.m.": [{"F": "p.m."}],
|
|
||||||
|
|
||||||
"1a.m.": [{"F": "1"}, {"F": "a.m."}],
|
|
||||||
"2a.m.": [{"F": "2"}, {"F": "a.m."}],
|
|
||||||
"3a.m.": [{"F": "3"}, {"F": "a.m."}],
|
|
||||||
"4a.m.": [{"F": "4"}, {"F": "a.m."}],
|
|
||||||
"5a.m.": [{"F": "5"}, {"F": "a.m."}],
|
|
||||||
"6a.m.": [{"F": "6"}, {"F": "a.m."}],
|
|
||||||
"7a.m.": [{"F": "7"}, {"F": "a.m."}],
|
|
||||||
"8a.m.": [{"F": "8"}, {"F": "a.m."}],
|
|
||||||
"9a.m.": [{"F": "9"}, {"F": "a.m."}],
|
|
||||||
"10a.m.": [{"F": "10"}, {"F": "a.m."}],
|
|
||||||
"11a.m.": [{"F": "11"}, {"F": "a.m."}],
|
|
||||||
"12a.m.": [{"F": "12"}, {"F": "a.m."}],
|
|
||||||
"1am": [{"F": "1"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"2am": [{"F": "2"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"3am": [{"F": "3"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"4am": [{"F": "4"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"5am": [{"F": "5"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"6am": [{"F": "6"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"7am": [{"F": "7"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"8am": [{"F": "8"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"9am": [{"F": "9"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"10am": [{"F": "10"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"11am": [{"F": "11"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"12am": [{"F": "12"}, {"F": "am", "L": "a.m."}],
|
|
||||||
|
|
||||||
|
|
||||||
"1p.m.": [{"F": "1"}, {"F": "p.m."}],
|
|
||||||
"2p.m.": [{"F": "2"}, {"F": "p.m."}],
|
|
||||||
"3p.m.": [{"F": "3"}, {"F": "p.m."}],
|
|
||||||
"4p.m.": [{"F": "4"}, {"F": "p.m."}],
|
|
||||||
"5p.m.": [{"F": "5"}, {"F": "p.m."}],
|
|
||||||
"6p.m.": [{"F": "6"}, {"F": "p.m."}],
|
|
||||||
"7p.m.": [{"F": "7"}, {"F": "p.m."}],
|
|
||||||
"8p.m.": [{"F": "8"}, {"F": "p.m."}],
|
|
||||||
"9p.m.": [{"F": "9"}, {"F": "p.m."}],
|
|
||||||
"10p.m.": [{"F": "10"}, {"F": "p.m."}],
|
|
||||||
"11p.m.": [{"F": "11"}, {"F": "p.m."}],
|
|
||||||
"12p.m.": [{"F": "12"}, {"F": "p.m."}],
|
|
||||||
"1pm": [{"F": "1"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"2pm": [{"F": "2"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"3pm": [{"F": "3"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"4pm": [{"F": "4"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"5pm": [{"F": "5"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"6pm": [{"F": "6"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"7pm": [{"F": "7"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"8pm": [{"F": "8"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"9pm": [{"F": "9"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"10pm": [{"F": "10"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"11pm": [{"F": "11"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"12pm": [{"F": "12"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
|
|
||||||
"Jan.": [{"F": "Jan.", "L": "Januar"}],
|
|
||||||
"Feb.": [{"F": "Feb.", "L": "Februar"}],
|
|
||||||
"Mär.": [{"F": "Mär.", "L": "März"}],
|
|
||||||
"Apr.": [{"F": "Apr.", "L": "April"}],
|
|
||||||
"Mai.": [{"F": "Mai.", "L": "Mai"}],
|
|
||||||
"Jun.": [{"F": "Jun.", "L": "Juni"}],
|
|
||||||
"Jul.": [{"F": "Jul.", "L": "Juli"}],
|
|
||||||
"Aug.": [{"F": "Aug.", "L": "August"}],
|
|
||||||
"Sep.": [{"F": "Sep.", "L": "September"}],
|
|
||||||
"Sept.": [{"F": "Sept.", "L": "September"}],
|
|
||||||
"Okt.": [{"F": "Okt.", "L": "Oktober"}],
|
|
||||||
"Nov.": [{"F": "Nov.", "L": "November"}],
|
|
||||||
"Dez.": [{"F": "Dez.", "L": "Dezember"}],
|
|
||||||
|
|
||||||
":)": [{"F": ":)"}],
|
|
||||||
"<3": [{"F": "<3"}],
|
|
||||||
";)": [{"F": ";)"}],
|
|
||||||
"(:": [{"F": "(:"}],
|
|
||||||
":(": [{"F": ":("}],
|
|
||||||
"-_-": [{"F": "-_-"}],
|
|
||||||
"=)": [{"F": "=)"}],
|
|
||||||
":/": [{"F": ":/"}],
|
|
||||||
":>": [{"F": ":>"}],
|
|
||||||
";-)": [{"F": ";-)"}],
|
|
||||||
":Y": [{"F": ":Y"}],
|
|
||||||
":P": [{"F": ":P"}],
|
|
||||||
":-P": [{"F": ":-P"}],
|
|
||||||
":3": [{"F": ":3"}],
|
|
||||||
"=3": [{"F": "=3"}],
|
|
||||||
"xD": [{"F": "xD"}],
|
|
||||||
"^_^": [{"F": "^_^"}],
|
|
||||||
"=]": [{"F": "=]"}],
|
|
||||||
"=D": [{"F": "=D"}],
|
|
||||||
"<333": [{"F": "<333"}],
|
|
||||||
":))": [{"F": ":))"}],
|
|
||||||
":0": [{"F": ":0"}],
|
|
||||||
"-__-": [{"F": "-__-"}],
|
|
||||||
"xDD": [{"F": "xDD"}],
|
|
||||||
"o_o": [{"F": "o_o"}],
|
|
||||||
"o_O": [{"F": "o_O"}],
|
|
||||||
"V_V": [{"F": "V_V"}],
|
|
||||||
"=[[": [{"F": "=[["}],
|
|
||||||
"<33": [{"F": "<33"}],
|
|
||||||
";p": [{"F": ";p"}],
|
|
||||||
";D": [{"F": ";D"}],
|
|
||||||
";-p": [{"F": ";-p"}],
|
|
||||||
";(": [{"F": ";("}],
|
|
||||||
":p": [{"F": ":p"}],
|
|
||||||
":]": [{"F": ":]"}],
|
|
||||||
":O": [{"F": ":O"}],
|
|
||||||
":-/": [{"F": ":-/"}],
|
|
||||||
":-)": [{"F": ":-)"}],
|
|
||||||
":(((": [{"F": ":((("}],
|
|
||||||
":((": [{"F": ":(("}],
|
|
||||||
":')": [{"F": ":')"}],
|
|
||||||
"(^_^)": [{"F": "(^_^)"}],
|
|
||||||
"(=": [{"F": "(="}],
|
|
||||||
"o.O": [{"F": "o.O"}],
|
|
||||||
"\")": [{"F": "\")"}],
|
|
||||||
"a.": [{"F": "a."}],
|
|
||||||
"b.": [{"F": "b."}],
|
|
||||||
"c.": [{"F": "c."}],
|
|
||||||
"d.": [{"F": "d."}],
|
|
||||||
"e.": [{"F": "e."}],
|
|
||||||
"f.": [{"F": "f."}],
|
|
||||||
"g.": [{"F": "g."}],
|
|
||||||
"h.": [{"F": "h."}],
|
|
||||||
"i.": [{"F": "i."}],
|
|
||||||
"j.": [{"F": "j."}],
|
|
||||||
"k.": [{"F": "k."}],
|
|
||||||
"l.": [{"F": "l."}],
|
|
||||||
"m.": [{"F": "m."}],
|
|
||||||
"n.": [{"F": "n."}],
|
|
||||||
"o.": [{"F": "o."}],
|
|
||||||
"p.": [{"F": "p."}],
|
|
||||||
"q.": [{"F": "q."}],
|
|
||||||
"s.": [{"F": "s."}],
|
|
||||||
"t.": [{"F": "t."}],
|
|
||||||
"u.": [{"F": "u."}],
|
|
||||||
"v.": [{"F": "v."}],
|
|
||||||
"w.": [{"F": "w."}],
|
|
||||||
"x.": [{"F": "x."}],
|
|
||||||
"y.": [{"F": "y."}],
|
|
||||||
"z.": [{"F": "z."}],
|
|
||||||
|
|
||||||
"z.b.": [{"F": "z.b."}],
|
|
||||||
"e.h.": [{"F": "I.e."}],
|
|
||||||
"o.ä.": [{"F": "I.E."}],
|
|
||||||
"bzw.": [{"F": "bzw."}],
|
|
||||||
"usw.": [{"F": "usw."}],
|
|
||||||
"\n": [{"F": "\n", "pos": "SP"}],
|
|
||||||
"\t": [{"F": "\t", "pos": "SP"}],
|
|
||||||
" ": [{"F": " ", "pos": "SP"}]
|
|
||||||
}
|
|
|
@ -1,26 +0,0 @@
|
||||||
,
|
|
||||||
\"
|
|
||||||
\)
|
|
||||||
\]
|
|
||||||
\}
|
|
||||||
\*
|
|
||||||
\!
|
|
||||||
\?
|
|
||||||
%
|
|
||||||
\$
|
|
||||||
>
|
|
||||||
:
|
|
||||||
;
|
|
||||||
'
|
|
||||||
”
|
|
||||||
''
|
|
||||||
's
|
|
||||||
'S
|
|
||||||
’s
|
|
||||||
’S
|
|
||||||
’
|
|
||||||
\.\.
|
|
||||||
\.\.\.
|
|
||||||
\.\.\.\.
|
|
||||||
(?<=[a-z0-9)\]"'%\)])\.
|
|
||||||
(?<=[0-9])km
|
|
|
@ -1,19 +0,0 @@
|
||||||
{
|
|
||||||
"NOUN": {"pos": "NOUN"},
|
|
||||||
"VERB": {"pos": "VERB"},
|
|
||||||
"PUNCT": {"pos": "PUNCT"},
|
|
||||||
"ADV": {"pos": "ADV"},
|
|
||||||
"ADJ": {"pos": "ADJ"},
|
|
||||||
"PRON": {"pos": "PRON"},
|
|
||||||
"PROPN": {"pos": "PROPN"},
|
|
||||||
"CONJ": {"pos": "CONJ"},
|
|
||||||
"NUM": {"pos": "NUM"},
|
|
||||||
"AUX": {"pos": "AUX"},
|
|
||||||
"SCONJ": {"pos": "SCONJ"},
|
|
||||||
"ADP": {"pos": "ADP"},
|
|
||||||
"SYM": {"pos": "SYM"},
|
|
||||||
"X": {"pos": "X"},
|
|
||||||
"INTJ": {"pos": "INTJ"},
|
|
||||||
"DET": {"pos": "DET"},
|
|
||||||
"PART": {"pos": "PART"}
|
|
||||||
}
|
|
|
@ -1,3 +0,0 @@
|
||||||
\.\.\.
|
|
||||||
(?<=[a-z])\.(?=[A-Z])
|
|
||||||
(?<=[a-zA-Z])-(?=[a-zA-z])
|
|
|
@ -1,21 +0,0 @@
|
||||||
,
|
|
||||||
"
|
|
||||||
(
|
|
||||||
[
|
|
||||||
{
|
|
||||||
*
|
|
||||||
<
|
|
||||||
$
|
|
||||||
£
|
|
||||||
“
|
|
||||||
'
|
|
||||||
``
|
|
||||||
`
|
|
||||||
#
|
|
||||||
US$
|
|
||||||
C$
|
|
||||||
A$
|
|
||||||
a-
|
|
||||||
‘
|
|
||||||
....
|
|
||||||
...
|
|
|
@ -1,149 +0,0 @@
|
||||||
{
|
|
||||||
"a.m.": [{"F": "a.m."}],
|
|
||||||
"p.m.": [{"F": "p.m."}],
|
|
||||||
|
|
||||||
"1a.m.": [{"F": "1"}, {"F": "a.m."}],
|
|
||||||
"2a.m.": [{"F": "2"}, {"F": "a.m."}],
|
|
||||||
"3a.m.": [{"F": "3"}, {"F": "a.m."}],
|
|
||||||
"4a.m.": [{"F": "4"}, {"F": "a.m."}],
|
|
||||||
"5a.m.": [{"F": "5"}, {"F": "a.m."}],
|
|
||||||
"6a.m.": [{"F": "6"}, {"F": "a.m."}],
|
|
||||||
"7a.m.": [{"F": "7"}, {"F": "a.m."}],
|
|
||||||
"8a.m.": [{"F": "8"}, {"F": "a.m."}],
|
|
||||||
"9a.m.": [{"F": "9"}, {"F": "a.m."}],
|
|
||||||
"10a.m.": [{"F": "10"}, {"F": "a.m."}],
|
|
||||||
"11a.m.": [{"F": "11"}, {"F": "a.m."}],
|
|
||||||
"12a.m.": [{"F": "12"}, {"F": "a.m."}],
|
|
||||||
"1am": [{"F": "1"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"2am": [{"F": "2"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"3am": [{"F": "3"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"4am": [{"F": "4"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"5am": [{"F": "5"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"6am": [{"F": "6"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"7am": [{"F": "7"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"8am": [{"F": "8"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"9am": [{"F": "9"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"10am": [{"F": "10"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"11am": [{"F": "11"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"12am": [{"F": "12"}, {"F": "am", "L": "a.m."}],
|
|
||||||
|
|
||||||
|
|
||||||
"1p.m.": [{"F": "1"}, {"F": "p.m."}],
|
|
||||||
"2p.m.": [{"F": "2"}, {"F": "p.m."}],
|
|
||||||
"3p.m.": [{"F": "3"}, {"F": "p.m."}],
|
|
||||||
"4p.m.": [{"F": "4"}, {"F": "p.m."}],
|
|
||||||
"5p.m.": [{"F": "5"}, {"F": "p.m."}],
|
|
||||||
"6p.m.": [{"F": "6"}, {"F": "p.m."}],
|
|
||||||
"7p.m.": [{"F": "7"}, {"F": "p.m."}],
|
|
||||||
"8p.m.": [{"F": "8"}, {"F": "p.m."}],
|
|
||||||
"9p.m.": [{"F": "9"}, {"F": "p.m."}],
|
|
||||||
"10p.m.": [{"F": "10"}, {"F": "p.m."}],
|
|
||||||
"11p.m.": [{"F": "11"}, {"F": "p.m."}],
|
|
||||||
"12p.m.": [{"F": "12"}, {"F": "p.m."}],
|
|
||||||
"1pm": [{"F": "1"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"2pm": [{"F": "2"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"3pm": [{"F": "3"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"4pm": [{"F": "4"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"5pm": [{"F": "5"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"6pm": [{"F": "6"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"7pm": [{"F": "7"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"8pm": [{"F": "8"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"9pm": [{"F": "9"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"10pm": [{"F": "10"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"11pm": [{"F": "11"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"12pm": [{"F": "12"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
|
|
||||||
"Jan.": [{"F": "Jan.", "L": "Januar"}],
|
|
||||||
"Feb.": [{"F": "Feb.", "L": "Februar"}],
|
|
||||||
"Mär.": [{"F": "Mär.", "L": "März"}],
|
|
||||||
"Apr.": [{"F": "Apr.", "L": "April"}],
|
|
||||||
"Mai.": [{"F": "Mai.", "L": "Mai"}],
|
|
||||||
"Jun.": [{"F": "Jun.", "L": "Juni"}],
|
|
||||||
"Jul.": [{"F": "Jul.", "L": "Juli"}],
|
|
||||||
"Aug.": [{"F": "Aug.", "L": "August"}],
|
|
||||||
"Sep.": [{"F": "Sep.", "L": "September"}],
|
|
||||||
"Sept.": [{"F": "Sept.", "L": "September"}],
|
|
||||||
"Okt.": [{"F": "Okt.", "L": "Oktober"}],
|
|
||||||
"Nov.": [{"F": "Nov.", "L": "November"}],
|
|
||||||
"Dez.": [{"F": "Dez.", "L": "Dezember"}],
|
|
||||||
|
|
||||||
":)": [{"F": ":)"}],
|
|
||||||
"<3": [{"F": "<3"}],
|
|
||||||
";)": [{"F": ";)"}],
|
|
||||||
"(:": [{"F": "(:"}],
|
|
||||||
":(": [{"F": ":("}],
|
|
||||||
"-_-": [{"F": "-_-"}],
|
|
||||||
"=)": [{"F": "=)"}],
|
|
||||||
":/": [{"F": ":/"}],
|
|
||||||
":>": [{"F": ":>"}],
|
|
||||||
";-)": [{"F": ";-)"}],
|
|
||||||
":Y": [{"F": ":Y"}],
|
|
||||||
":P": [{"F": ":P"}],
|
|
||||||
":-P": [{"F": ":-P"}],
|
|
||||||
":3": [{"F": ":3"}],
|
|
||||||
"=3": [{"F": "=3"}],
|
|
||||||
"xD": [{"F": "xD"}],
|
|
||||||
"^_^": [{"F": "^_^"}],
|
|
||||||
"=]": [{"F": "=]"}],
|
|
||||||
"=D": [{"F": "=D"}],
|
|
||||||
"<333": [{"F": "<333"}],
|
|
||||||
":))": [{"F": ":))"}],
|
|
||||||
":0": [{"F": ":0"}],
|
|
||||||
"-__-": [{"F": "-__-"}],
|
|
||||||
"xDD": [{"F": "xDD"}],
|
|
||||||
"o_o": [{"F": "o_o"}],
|
|
||||||
"o_O": [{"F": "o_O"}],
|
|
||||||
"V_V": [{"F": "V_V"}],
|
|
||||||
"=[[": [{"F": "=[["}],
|
|
||||||
"<33": [{"F": "<33"}],
|
|
||||||
";p": [{"F": ";p"}],
|
|
||||||
";D": [{"F": ";D"}],
|
|
||||||
";-p": [{"F": ";-p"}],
|
|
||||||
";(": [{"F": ";("}],
|
|
||||||
":p": [{"F": ":p"}],
|
|
||||||
":]": [{"F": ":]"}],
|
|
||||||
":O": [{"F": ":O"}],
|
|
||||||
":-/": [{"F": ":-/"}],
|
|
||||||
":-)": [{"F": ":-)"}],
|
|
||||||
":(((": [{"F": ":((("}],
|
|
||||||
":((": [{"F": ":(("}],
|
|
||||||
":')": [{"F": ":')"}],
|
|
||||||
"(^_^)": [{"F": "(^_^)"}],
|
|
||||||
"(=": [{"F": "(="}],
|
|
||||||
"o.O": [{"F": "o.O"}],
|
|
||||||
"\")": [{"F": "\")"}],
|
|
||||||
"a.": [{"F": "a."}],
|
|
||||||
"b.": [{"F": "b."}],
|
|
||||||
"c.": [{"F": "c."}],
|
|
||||||
"d.": [{"F": "d."}],
|
|
||||||
"e.": [{"F": "e."}],
|
|
||||||
"f.": [{"F": "f."}],
|
|
||||||
"g.": [{"F": "g."}],
|
|
||||||
"h.": [{"F": "h."}],
|
|
||||||
"i.": [{"F": "i."}],
|
|
||||||
"j.": [{"F": "j."}],
|
|
||||||
"k.": [{"F": "k."}],
|
|
||||||
"l.": [{"F": "l."}],
|
|
||||||
"m.": [{"F": "m."}],
|
|
||||||
"n.": [{"F": "n."}],
|
|
||||||
"o.": [{"F": "o."}],
|
|
||||||
"p.": [{"F": "p."}],
|
|
||||||
"q.": [{"F": "q."}],
|
|
||||||
"s.": [{"F": "s."}],
|
|
||||||
"t.": [{"F": "t."}],
|
|
||||||
"u.": [{"F": "u."}],
|
|
||||||
"v.": [{"F": "v."}],
|
|
||||||
"w.": [{"F": "w."}],
|
|
||||||
"x.": [{"F": "x."}],
|
|
||||||
"y.": [{"F": "y."}],
|
|
||||||
"z.": [{"F": "z."}],
|
|
||||||
|
|
||||||
"z.b.": [{"F": "z.b."}],
|
|
||||||
"e.h.": [{"F": "I.e."}],
|
|
||||||
"o.ä.": [{"F": "I.E."}],
|
|
||||||
"bzw.": [{"F": "bzw."}],
|
|
||||||
"usw.": [{"F": "usw."}],
|
|
||||||
"\n": [{"F": "\n", "pos": "SP"}],
|
|
||||||
"\t": [{"F": "\t", "pos": "SP"}],
|
|
||||||
" ": [{"F": " ", "pos": "SP"}]
|
|
||||||
}
|
|
|
@ -1,26 +0,0 @@
|
||||||
,
|
|
||||||
\"
|
|
||||||
\)
|
|
||||||
\]
|
|
||||||
\}
|
|
||||||
\*
|
|
||||||
\!
|
|
||||||
\?
|
|
||||||
%
|
|
||||||
\$
|
|
||||||
>
|
|
||||||
:
|
|
||||||
;
|
|
||||||
'
|
|
||||||
”
|
|
||||||
''
|
|
||||||
's
|
|
||||||
'S
|
|
||||||
’s
|
|
||||||
’S
|
|
||||||
’
|
|
||||||
\.\.
|
|
||||||
\.\.\.
|
|
||||||
\.\.\.\.
|
|
||||||
(?<=[a-z0-9)\]"'%\)])\.
|
|
||||||
(?<=[0-9])km
|
|
|
@ -1,44 +0,0 @@
|
||||||
{
|
|
||||||
"S": {"pos": "NOUN"},
|
|
||||||
"E": {"pos": "ADP"},
|
|
||||||
"RD": {"pos": "DET"},
|
|
||||||
"V": {"pos": "VERB"},
|
|
||||||
"_": {"pos": "NO_TAG"},
|
|
||||||
"A": {"pos": "ADJ"},
|
|
||||||
"SP": {"pos": "PROPN"},
|
|
||||||
"FF": {"pos": "PUNCT"},
|
|
||||||
"FS": {"pos": "PUNCT"},
|
|
||||||
"B": {"pos": "ADV"},
|
|
||||||
"CC": {"pos": "CONJ"},
|
|
||||||
"FB": {"pos": "PUNCT"},
|
|
||||||
"VA": {"pos": "AUX"},
|
|
||||||
"PC": {"pos": "PRON"},
|
|
||||||
"N": {"pos": "NUM"},
|
|
||||||
"RI": {"pos": "DET"},
|
|
||||||
"PR": {"pos": "PRON"},
|
|
||||||
"CS": {"pos": "SCONJ"},
|
|
||||||
"BN": {"pos": "ADV"},
|
|
||||||
"AP": {"pos": "DET"},
|
|
||||||
"VM": {"pos": "AUX"},
|
|
||||||
"DI": {"pos": "DET"},
|
|
||||||
"FC": {"pos": "PUNCT"},
|
|
||||||
"PI": {"pos": "PRON"},
|
|
||||||
"DD": {"pos": "DET"},
|
|
||||||
"DQ": {"pos": "DET"},
|
|
||||||
"PQ": {"pos": "PRON"},
|
|
||||||
"PD": {"pos": "PRON"},
|
|
||||||
"NO": {"pos": "ADJ"},
|
|
||||||
"PE": {"pos": "PRON"},
|
|
||||||
"T": {"pos": "DET"},
|
|
||||||
"X": {"pos": "SYM"},
|
|
||||||
"SW": {"pos": "X"},
|
|
||||||
"NO": {"pos": "PRON"},
|
|
||||||
"I": {"pos": "INTJ"},
|
|
||||||
"X": {"pos": "X"},
|
|
||||||
"DR": {"pos": "DET"},
|
|
||||||
"EA": {"pos": "ADP"},
|
|
||||||
"PP": {"pos": "PRON"},
|
|
||||||
"X": {"pos": "NUM"},
|
|
||||||
"DE": {"pos": "DET"},
|
|
||||||
"X": {"pos": "PART"}
|
|
||||||
}
|
|
|
@ -1,194 +0,0 @@
|
||||||
{
|
|
||||||
"Reddit": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"lower": "reddit"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"SeptemberElevenAttacks": [
|
|
||||||
"EVENT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[
|
|
||||||
{"orth": "9/11"}
|
|
||||||
],
|
|
||||||
[
|
|
||||||
{"lower": "september"},
|
|
||||||
{"orth": "11"}
|
|
||||||
]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Linux": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"lower": "linux"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Haskell": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"lower": "haskell"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"HaskellCurry": [
|
|
||||||
"PERSON",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[
|
|
||||||
{"lower": "haskell"},
|
|
||||||
{"lower": "curry"}
|
|
||||||
]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Javascript": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"lower": "javascript"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"CSS": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"lower": "css"}],
|
|
||||||
[{"lower": "css3"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"displaCy": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"lower": "displacy"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"spaCy": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"orth": "spaCy"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
|
|
||||||
"HTML": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"lower": "html"}],
|
|
||||||
[{"lower": "html5"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Python": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"orth": "Python"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Ruby": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"orth": "Ruby"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Digg": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"lower": "digg"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"FoxNews": [
|
|
||||||
"ORG",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"orth": "Fox"}],
|
|
||||||
[{"orth": "News"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Google": [
|
|
||||||
"ORG",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"lower": "google"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Mac": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"lower": "mac"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Wikipedia": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"lower": "wikipedia"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Windows": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"orth": "Windows"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Dell": [
|
|
||||||
"ORG",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"lower": "dell"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Facebook": [
|
|
||||||
"ORG",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"lower": "facebook"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Blizzard": [
|
|
||||||
"ORG",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"orth": "Blizzard"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Ubuntu": [
|
|
||||||
"ORG",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"orth": "Ubuntu"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"Youtube": [
|
|
||||||
"PRODUCT",
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"lower": "youtube"}]
|
|
||||||
]
|
|
||||||
],
|
|
||||||
"false_positives": [
|
|
||||||
null,
|
|
||||||
{},
|
|
||||||
[
|
|
||||||
[{"orth": "Shit"}],
|
|
||||||
[{"orth": "Weed"}],
|
|
||||||
[{"orth": "Cool"}],
|
|
||||||
[{"orth": "Btw"}],
|
|
||||||
[{"orth": "Bah"}],
|
|
||||||
[{"orth": "Bullshit"}],
|
|
||||||
[{"orth": "Lol"}],
|
|
||||||
[{"orth": "Yo"}, {"lower": "dawg"}],
|
|
||||||
[{"orth": "Yay"}],
|
|
||||||
[{"orth": "Ahh"}],
|
|
||||||
[{"orth": "Yea"}],
|
|
||||||
[{"orth": "Bah"}]
|
|
||||||
]
|
|
||||||
]
|
|
||||||
}
|
|
|
@ -1,6 +0,0 @@
|
||||||
\.\.\.
|
|
||||||
(?<=[a-z])\.(?=[A-Z])
|
|
||||||
(?<=[a-zA-Z])-(?=[a-zA-z])
|
|
||||||
(?<=[a-zA-Z])--(?=[a-zA-z])
|
|
||||||
(?<=[0-9])-(?=[0-9])
|
|
||||||
(?<=[A-Za-z]),(?=[A-Za-z])
|
|
|
@ -1 +0,0 @@
|
||||||
{}
|
|
|
@ -1,21 +0,0 @@
|
||||||
,
|
|
||||||
"
|
|
||||||
(
|
|
||||||
[
|
|
||||||
{
|
|
||||||
*
|
|
||||||
<
|
|
||||||
$
|
|
||||||
£
|
|
||||||
“
|
|
||||||
'
|
|
||||||
``
|
|
||||||
`
|
|
||||||
#
|
|
||||||
US$
|
|
||||||
C$
|
|
||||||
A$
|
|
||||||
a-
|
|
||||||
‘
|
|
||||||
....
|
|
||||||
...
|
|
|
@ -1 +0,0 @@
|
||||||
{}
|
|
|
@ -1,26 +0,0 @@
|
||||||
,
|
|
||||||
\"
|
|
||||||
\)
|
|
||||||
\]
|
|
||||||
\}
|
|
||||||
\*
|
|
||||||
\!
|
|
||||||
\?
|
|
||||||
%
|
|
||||||
\$
|
|
||||||
>
|
|
||||||
:
|
|
||||||
;
|
|
||||||
'
|
|
||||||
”
|
|
||||||
''
|
|
||||||
's
|
|
||||||
'S
|
|
||||||
’s
|
|
||||||
’S
|
|
||||||
’
|
|
||||||
\.\.
|
|
||||||
\.\.\.
|
|
||||||
\.\.\.\.
|
|
||||||
(?<=[a-z0-9)\]"'%\)])\.
|
|
||||||
(?<=[0-9])km
|
|
|
@ -1,43 +0,0 @@
|
||||||
{
|
|
||||||
"NR": {"pos": "PROPN"},
|
|
||||||
"AD": {"pos": "ADV"},
|
|
||||||
"NN": {"pos": "NOUN"},
|
|
||||||
"CD": {"pos": "NUM"},
|
|
||||||
"DEG": {"pos": "PART"},
|
|
||||||
"PN": {"pos": "PRON"},
|
|
||||||
"M": {"pos": "PART"},
|
|
||||||
"JJ": {"pos": "ADJ"},
|
|
||||||
"DEC": {"pos": "PART"},
|
|
||||||
"NT": {"pos": "NOUN"},
|
|
||||||
"DT": {"pos": "DET"},
|
|
||||||
"LC": {"pos": "PART"},
|
|
||||||
"CC": {"pos": "CONJ"},
|
|
||||||
"AS": {"pos": "PART"},
|
|
||||||
"SP": {"pos": "PART"},
|
|
||||||
"IJ": {"pos": "INTJ"},
|
|
||||||
"OD": {"pos": "NUM"},
|
|
||||||
"MSP": {"pos": "PART"},
|
|
||||||
"CS": {"pos": "SCONJ"},
|
|
||||||
"ETC": {"pos": "PART"},
|
|
||||||
"DEV": {"pos": "PART"},
|
|
||||||
"BA": {"pos": "AUX"},
|
|
||||||
"SB": {"pos": "AUX"},
|
|
||||||
"DER": {"pos": "PART"},
|
|
||||||
"LB": {"pos": "AUX"},
|
|
||||||
"P": {"pos": "ADP"},
|
|
||||||
"URL": {"pos": "SYM"},
|
|
||||||
"FRAG": {"pos": "X"},
|
|
||||||
"X": {"pos": "X"},
|
|
||||||
"ON": {"pos": "X"},
|
|
||||||
"FW": {"pos": "X"},
|
|
||||||
"VC": {"pos": "VERB"},
|
|
||||||
"VV": {"pos": "VERB"},
|
|
||||||
"VA": {"pos": "VERB"},
|
|
||||||
"VE": {"pos": "VERB"},
|
|
||||||
"PU": {"pos": "PUNCT"},
|
|
||||||
"SP": {"pos": "SPACE"},
|
|
||||||
"NP": {"pos": "X"},
|
|
||||||
"_": {"pos": "X"},
|
|
||||||
"VP": {"pos": "X"},
|
|
||||||
"CHAR": {"pos": "X"}
|
|
||||||
}
|
|
4
setup.py
4
setup.py
|
@ -28,6 +28,9 @@ PACKAGES = [
|
||||||
'spacy.fr',
|
'spacy.fr',
|
||||||
'spacy.it',
|
'spacy.it',
|
||||||
'spacy.pt',
|
'spacy.pt',
|
||||||
|
'spacy.nl',
|
||||||
|
'spacy.sv',
|
||||||
|
'spacy.language_data',
|
||||||
'spacy.serialize',
|
'spacy.serialize',
|
||||||
'spacy.syntax',
|
'spacy.syntax',
|
||||||
'spacy.munge',
|
'spacy.munge',
|
||||||
|
@ -77,6 +80,7 @@ MOD_NAMES = [
|
||||||
'spacy.syntax.ner',
|
'spacy.syntax.ner',
|
||||||
'spacy.symbols',
|
'spacy.symbols',
|
||||||
'spacy.syntax.iterators']
|
'spacy.syntax.iterators']
|
||||||
|
# TODO: This is missing a lot of modules. Does it matter?
|
||||||
|
|
||||||
|
|
||||||
COMPILE_OPTIONS = {
|
COMPILE_OPTIONS = {
|
||||||
|
|
|
@ -10,6 +10,8 @@ from . import es
|
||||||
from . import it
|
from . import it
|
||||||
from . import fr
|
from . import fr
|
||||||
from . import pt
|
from . import pt
|
||||||
|
from . import nl
|
||||||
|
from . import sv
|
||||||
|
|
||||||
|
|
||||||
try:
|
try:
|
||||||
|
@ -25,23 +27,14 @@ set_lang_class(pt.Portuguese.lang, pt.Portuguese)
|
||||||
set_lang_class(fr.French.lang, fr.French)
|
set_lang_class(fr.French.lang, fr.French)
|
||||||
set_lang_class(it.Italian.lang, it.Italian)
|
set_lang_class(it.Italian.lang, it.Italian)
|
||||||
set_lang_class(zh.Chinese.lang, zh.Chinese)
|
set_lang_class(zh.Chinese.lang, zh.Chinese)
|
||||||
|
set_lang_class(nl.Dutch.lang, nl.Dutch)
|
||||||
|
set_lang_class(sv.Swedish.lang, sv.Swedish)
|
||||||
|
|
||||||
|
|
||||||
def load(name, **overrides):
|
def load(name, **overrides):
|
||||||
target_name, target_version = util.split_data_name(name)
|
target_name, target_version = util.split_data_name(name)
|
||||||
data_path = overrides.get('path', util.get_data_path())
|
data_path = overrides.get('path', util.get_data_path())
|
||||||
if target_name == 'en' and 'add_vectors' not in overrides:
|
|
||||||
if 'vectors' in overrides:
|
|
||||||
vec_path = util.match_best_version(overrides['vectors'], None, data_path)
|
|
||||||
if vec_path is None:
|
|
||||||
raise IOError(
|
|
||||||
'Could not load data pack %s from %s' % (overrides['vectors'], data_path))
|
|
||||||
|
|
||||||
else:
|
|
||||||
vec_path = util.match_best_version('en_glove_cc_300_1m_vectors', None, data_path)
|
|
||||||
if vec_path is not None:
|
|
||||||
vec_path = vec_path / 'vocab' / 'vec.bin'
|
|
||||||
overrides['add_vectors'] = lambda vocab: vocab.load_vectors_from_bin_loc(vec_path)
|
|
||||||
path = util.match_best_version(target_name, target_version, data_path)
|
path = util.match_best_version(target_name, target_version, data_path)
|
||||||
cls = get_lang_class(target_name)
|
cls = get_lang_class(target_name)
|
||||||
return cls(path=path, **overrides)
|
overrides['path'] = path
|
||||||
|
return cls(**overrides)
|
||||||
|
|
|
@ -4,7 +4,7 @@
|
||||||
# https://github.com/pypa/warehouse/blob/master/warehouse/__about__.py
|
# https://github.com/pypa/warehouse/blob/master/warehouse/__about__.py
|
||||||
|
|
||||||
__title__ = 'spacy'
|
__title__ = 'spacy'
|
||||||
__version__ = '1.2.0'
|
__version__ = '1.4.0'
|
||||||
__summary__ = 'Industrial-strength NLP'
|
__summary__ = 'Industrial-strength NLP'
|
||||||
__uri__ = 'https://spacy.io'
|
__uri__ = 'https://spacy.io'
|
||||||
__author__ = 'Matthew Honnibal'
|
__author__ = 'Matthew Honnibal'
|
||||||
|
|
|
@ -87,5 +87,3 @@ cpdef enum attr_id_t:
|
||||||
PROB
|
PROB
|
||||||
|
|
||||||
LANG
|
LANG
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -86,5 +86,59 @@ IDS = {
|
||||||
"LANG": LANG,
|
"LANG": LANG,
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
# ATTR IDs, in order of the symbol
|
# ATTR IDs, in order of the symbol
|
||||||
NAMES = [key for key, value in sorted(IDS.items(), key=lambda item: item[1])]
|
NAMES = [key for key, value in sorted(IDS.items(), key=lambda item: item[1])]
|
||||||
|
|
||||||
|
|
||||||
|
def intify_attrs(stringy_attrs, strings_map=None, _do_deprecated=False):
|
||||||
|
'''Normalize a dictionary of attributes, converting them to ints.
|
||||||
|
|
||||||
|
Arguments:
|
||||||
|
stringy_attrs (dict):
|
||||||
|
Dictionary keyed by attribute string names. Values can be ints or strings.
|
||||||
|
|
||||||
|
strings_map (StringStore):
|
||||||
|
Defaults to None. If provided, encodes string values into ints.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
inty_attrs (dict):
|
||||||
|
Attributes dictionary with keys and optionally values converted to
|
||||||
|
ints.
|
||||||
|
'''
|
||||||
|
inty_attrs = {}
|
||||||
|
if _do_deprecated:
|
||||||
|
if 'F' in stringy_attrs:
|
||||||
|
stringy_attrs["ORTH"] = stringy_attrs.pop("F")
|
||||||
|
if 'L' in stringy_attrs:
|
||||||
|
stringy_attrs["LEMMA"] = stringy_attrs.pop("L")
|
||||||
|
if 'pos' in stringy_attrs:
|
||||||
|
stringy_attrs["TAG"] = stringy_attrs.pop("pos")
|
||||||
|
if 'morph' in stringy_attrs:
|
||||||
|
morphs = stringy_attrs.pop('morph')
|
||||||
|
if 'number' in stringy_attrs:
|
||||||
|
stringy_attrs.pop('number')
|
||||||
|
if 'tenspect' in stringy_attrs:
|
||||||
|
stringy_attrs.pop('tenspect')
|
||||||
|
morph_keys = [
|
||||||
|
'PunctType', 'PunctSide', 'Other', 'Degree', 'AdvType', 'Number',
|
||||||
|
'VerbForm', 'PronType', 'Aspect', 'Tense', 'PartType', 'Poss',
|
||||||
|
'Hyph', 'ConjType', 'NumType', 'Foreign', 'VerbType', 'NounType',
|
||||||
|
'Number', 'PronType', 'AdjType', 'Person', 'Variant', 'AdpType',
|
||||||
|
'Reflex', 'Negative', 'Mood', 'Aspect', 'Case']
|
||||||
|
for key in morph_keys:
|
||||||
|
if key in stringy_attrs:
|
||||||
|
stringy_attrs.pop(key)
|
||||||
|
elif key.lower() in stringy_attrs:
|
||||||
|
stringy_attrs.pop(key.lower())
|
||||||
|
elif key.upper() in stringy_attrs:
|
||||||
|
stringy_attrs.pop(key.upper())
|
||||||
|
for name, value in stringy_attrs.items():
|
||||||
|
if isinstance(name, int):
|
||||||
|
int_key = name
|
||||||
|
else:
|
||||||
|
int_key = IDS[name.upper()]
|
||||||
|
if strings_map is not None and isinstance(value, basestring):
|
||||||
|
value = strings_map[value]
|
||||||
|
inty_attrs[int_key] = value
|
||||||
|
return inty_attrs
|
||||||
|
|
|
@ -1,10 +1,12 @@
|
||||||
|
# encoding: utf8
|
||||||
from __future__ import unicode_literals, print_function
|
from __future__ import unicode_literals, print_function
|
||||||
|
|
||||||
from os import path
|
from os import path
|
||||||
|
|
||||||
from ..language import Language
|
from ..language import Language
|
||||||
from ..attrs import LANG
|
from ..attrs import LANG
|
||||||
from . import language_data
|
|
||||||
|
from .language_data import *
|
||||||
|
|
||||||
|
|
||||||
class German(Language):
|
class German(Language):
|
||||||
|
@ -15,13 +17,6 @@ class German(Language):
|
||||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||||
lex_attr_getters[LANG] = lambda text: 'de'
|
lex_attr_getters[LANG] = lambda text: 'de'
|
||||||
|
|
||||||
prefixes = tuple(language_data.TOKENIZER_PREFIXES)
|
tokenizer_exceptions = TOKENIZER_EXCEPTIONS
|
||||||
|
tag_map = TAG_MAP
|
||||||
suffixes = tuple(language_data.TOKENIZER_SUFFIXES)
|
stop_words = STOP_WORDS
|
||||||
|
|
||||||
infixes = tuple(language_data.TOKENIZER_INFIXES)
|
|
||||||
|
|
||||||
tag_map = dict(language_data.TAG_MAP)
|
|
||||||
|
|
||||||
stop_words = set(language_data.STOP_WORDS)
|
|
||||||
|
|
||||||
|
|
|
@ -1,3 +0,0 @@
|
||||||
\.\.\.
|
|
||||||
(?<=[a-z])\.(?=[A-Z])
|
|
||||||
(?<=[a-zA-Z])-(?=[a-zA-z])
|
|
|
@ -1 +0,0 @@
|
||||||
{}
|
|
|
@ -1,21 +0,0 @@
|
||||||
,
|
|
||||||
"
|
|
||||||
(
|
|
||||||
[
|
|
||||||
{
|
|
||||||
*
|
|
||||||
<
|
|
||||||
$
|
|
||||||
£
|
|
||||||
“
|
|
||||||
'
|
|
||||||
``
|
|
||||||
`
|
|
||||||
#
|
|
||||||
US$
|
|
||||||
C$
|
|
||||||
A$
|
|
||||||
a-
|
|
||||||
‘
|
|
||||||
....
|
|
||||||
...
|
|
|
@ -1 +0,0 @@
|
||||||
{}
|
|
|
@ -1,27 +0,0 @@
|
||||||
,
|
|
||||||
\"
|
|
||||||
\)
|
|
||||||
\]
|
|
||||||
\}
|
|
||||||
\*
|
|
||||||
\!
|
|
||||||
\?
|
|
||||||
%
|
|
||||||
\$
|
|
||||||
>
|
|
||||||
:
|
|
||||||
;
|
|
||||||
'
|
|
||||||
”
|
|
||||||
''
|
|
||||||
's
|
|
||||||
'S
|
|
||||||
’s
|
|
||||||
’S
|
|
||||||
’
|
|
||||||
\.\.
|
|
||||||
\.\.\.
|
|
||||||
\.\.\.\.
|
|
||||||
^\d+\.$
|
|
||||||
(?<=[a-z0-9)\]"'%\)])\.
|
|
||||||
(?<=[0-9])km
|
|
|
@ -1 +0,0 @@
|
||||||
{}
|
|
|
@ -1,31 +0,0 @@
|
||||||
{
|
|
||||||
"noun": [
|
|
||||||
["s", ""],
|
|
||||||
["ses", "s"],
|
|
||||||
["ves", "f"],
|
|
||||||
["xes", "x"],
|
|
||||||
["zes", "z"],
|
|
||||||
["ches", "ch"],
|
|
||||||
["shes", "sh"],
|
|
||||||
["men", "man"],
|
|
||||||
["ies", "y"]
|
|
||||||
],
|
|
||||||
|
|
||||||
"verb": [
|
|
||||||
["s", ""],
|
|
||||||
["ies", "y"],
|
|
||||||
["es", "e"],
|
|
||||||
["es", ""],
|
|
||||||
["ed", "e"],
|
|
||||||
["ed", ""],
|
|
||||||
["ing", "e"],
|
|
||||||
["ing", ""]
|
|
||||||
],
|
|
||||||
|
|
||||||
"adj": [
|
|
||||||
["er", ""],
|
|
||||||
["est", ""],
|
|
||||||
["er", "e"],
|
|
||||||
["est", "e"]
|
|
||||||
]
|
|
||||||
}
|
|
Binary file not shown.
|
@ -1 +0,0 @@
|
||||||
-20.000000
|
|
1514125
spacy/de/data/vocab/strings.txt
1514125
spacy/de/data/vocab/strings.txt
File diff suppressed because it is too large
Load Diff
|
@ -1,57 +0,0 @@
|
||||||
{
|
|
||||||
"$(": {"pos": "PUNCT", "PunctType": "Brck"},
|
|
||||||
"$,": {"pos": "PUNCT", "PunctType": "Comm"},
|
|
||||||
"$.": {"pos": "PUNCT", "PunctType": "Peri"},
|
|
||||||
"ADJA": {"pos": "ADJ"},
|
|
||||||
"ADJD": {"pos": "ADJ", "Variant": "Short"},
|
|
||||||
"ADV": {"pos": "ADV"},
|
|
||||||
"APPO": {"pos": "ADP", "AdpType": "Post"},
|
|
||||||
"APPR": {"pos": "ADP", "AdpType": "Prep"},
|
|
||||||
"APPRART": {"pos": "ADP", "AdpType": "Prep", "PronType": "Art"},
|
|
||||||
"APZR": {"pos": "ADP", "AdpType": "Circ"},
|
|
||||||
"ART": {"pos": "DET", "PronType": "Art"},
|
|
||||||
"CARD": {"pos": "NUM", "NumType": "Card"},
|
|
||||||
"FM": {"pos": "X", "Foreign": "Yes"},
|
|
||||||
"ITJ": {"pos": "INTJ"},
|
|
||||||
"KOKOM": {"pos": "CONJ", "ConjType": "Comp"},
|
|
||||||
"KON": {"pos": "CONJ"},
|
|
||||||
"KOUI": {"pos": "SCONJ"},
|
|
||||||
"KOUS": {"pos": "SCONJ"},
|
|
||||||
"NE": {"pos": "PROPN"},
|
|
||||||
"NN": {"pos": "NOUN"},
|
|
||||||
"PAV": {"pos": "ADV", "PronType": "Dem"},
|
|
||||||
"PDAT": {"pos": "DET", "PronType": "Dem"},
|
|
||||||
"PDS": {"pos": "PRON", "PronType": "Dem"},
|
|
||||||
"PIAT": {"pos": "DET", "PronType": "Ind,Neg,Tot"},
|
|
||||||
"PIDAT": {"pos": "DET", "AdjType": "Pdt", "PronType": "Ind,Neg,Tot"},
|
|
||||||
"PIS": {"pos": "PRON", "PronType": "Ind,Neg,Tot"},
|
|
||||||
"PPER": {"pos": "PRON", "PronType": "Prs"},
|
|
||||||
"PPOSAT": {"pos": "DET", "Poss": "Yes", "PronType": "Prs"},
|
|
||||||
"PPOSS": {"pos": "PRON", "Poss": "Yes", "PronType": "Prs"},
|
|
||||||
"PRELAT": {"pos": "DET", "PronType": "Rel"},
|
|
||||||
"PRELS": {"pos": "PRON", "PronType": "Rel"},
|
|
||||||
"PRF": {"pos": "PRON", "PronType": "Prs", "Reflex": "Yes"},
|
|
||||||
"PTKA": {"pos": "PART"},
|
|
||||||
"PTKANT": {"pos": "PART", "PartType": "Res"},
|
|
||||||
"PTKNEG": {"pos": "PART", "Negative": "Neg"},
|
|
||||||
"PTKVZ": {"pos": "PART", "PartType": "Vbp"},
|
|
||||||
"PTKZU": {"pos": "PART", "PartType": "Inf"},
|
|
||||||
"PWAT": {"pos": "DET", "PronType": "Int"},
|
|
||||||
"PWAV": {"pos": "ADV", "PronType": "Int"},
|
|
||||||
"PWS": {"pos": "PRON", "PronType": "Int"},
|
|
||||||
"TRUNC": {"pos": "X", "Hyph": "Yes"},
|
|
||||||
"VAFIN": {"pos": "AUX", "Mood": "Ind", "VerbForm": "Fin"},
|
|
||||||
"VAIMP": {"pos": "AUX", "Mood": "Imp", "VerbForm": "Fin"},
|
|
||||||
"VAINF": {"pos": "AUX", "VerbForm": "Inf"},
|
|
||||||
"VAPP": {"pos": "AUX", "Aspect": "Perf", "VerbForm": "Part"},
|
|
||||||
"VMFIN": {"pos": "VERB", "Mood": "Ind", "VerbForm": "Fin", "VerbType": "Mod"},
|
|
||||||
"VMINF": {"pos": "VERB", "VerbForm": "Inf", "VerbType": "Mod"},
|
|
||||||
"VMPP": {"pos": "VERB", "Aspect": "Perf", "VerbForm": "Part", "VerbType": "Mod"},
|
|
||||||
"VVFIN": {"pos": "VERB", "Mood": "Ind", "VerbForm": "Fin"},
|
|
||||||
"VVIMP": {"pos": "VERB", "Mood": "Imp", "VerbForm": "Fin"},
|
|
||||||
"VVINF": {"pos": "VERB", "VerbForm": "Inf"},
|
|
||||||
"VVIZU": {"pos": "VERB", "VerbForm": "Inf"},
|
|
||||||
"VVPP": {"pos": "VERB", "Aspect": "Perf", "VerbForm": "Part"},
|
|
||||||
"XY": {"pos": "X"},
|
|
||||||
"SP": {"pos": "SPACE"}
|
|
||||||
}
|
|
|
@ -4,9 +4,10 @@ from ..download import download
|
||||||
|
|
||||||
@plac.annotations(
|
@plac.annotations(
|
||||||
force=("Force overwrite", "flag", "f", bool),
|
force=("Force overwrite", "flag", "f", bool),
|
||||||
|
data_path=("Path to download model", "option", "d", str)
|
||||||
)
|
)
|
||||||
def main(data_size='all', force=False):
|
def main(data_size='all', force=False, data_path=None):
|
||||||
download('de', force)
|
download('de', force=force, data_path=data_path)
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
if __name__ == '__main__':
|
||||||
|
|
File diff suppressed because it is too large
Load Diff
81
spacy/de/stop_words.py
Normal file
81
spacy/de/stop_words.py
Normal file
|
@ -0,0 +1,81 @@
|
||||||
|
# encoding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
|
||||||
|
STOP_WORDS = set("""
|
||||||
|
á a ab aber ach acht achte achten achter achtes ag alle allein allem allen
|
||||||
|
aller allerdings alles allgemeinen als also am an andere anderen andern anders
|
||||||
|
auch auf aus ausser außer ausserdem außerdem
|
||||||
|
|
||||||
|
bald bei beide beiden beim beispiel bekannt bereits besonders besser besten bin
|
||||||
|
bis bisher bist
|
||||||
|
|
||||||
|
da dabei dadurch dafür dagegen daher dahin dahinter damals damit danach daneben
|
||||||
|
dank dann daran darauf daraus darf darfst darin darüber darum darunter das
|
||||||
|
dasein daselbst dass daß dasselbe davon davor dazu dazwischen dein deine deinem
|
||||||
|
deiner dem dementsprechend demgegenüber demgemäss demgemäß demselben demzufolge
|
||||||
|
den denen denn denselben der deren derjenige derjenigen dermassen dermaßen
|
||||||
|
derselbe derselben des deshalb desselben dessen deswegen dich die diejenige
|
||||||
|
diejenigen dies diese dieselbe dieselben diesem diesen dieser dieses dir doch
|
||||||
|
dort drei drin dritte dritten dritter drittes du durch durchaus dürfen dürft
|
||||||
|
durfte durften
|
||||||
|
|
||||||
|
eben ebenso ehrlich eigen eigene eigenen eigener eigenes ein einander eine
|
||||||
|
einem einen einer eines einigeeinigen einiger einiges einmal einmaleins elf en
|
||||||
|
ende endlich entweder er erst erste ersten erster erstes es etwa etwas euch
|
||||||
|
|
||||||
|
früher fünf fünfte fünften fünfter fünftes für
|
||||||
|
|
||||||
|
gab ganz ganze ganzen ganzer ganzes gar gedurft gegen gegenüber gehabt gehen
|
||||||
|
geht gekannt gekonnt gemacht gemocht gemusst genug gerade gern gesagt geschweige
|
||||||
|
gewesen gewollt geworden gibt ging gleich gott gross groß grosse große grossen
|
||||||
|
großen grosser großer grosses großes gut gute guter gutes
|
||||||
|
|
||||||
|
habe haben habt hast hat hatte hätte hatten hätten heisst heißt her heute hier
|
||||||
|
hin hinter hoch
|
||||||
|
|
||||||
|
ich ihm ihn ihnen ihr ihre ihrem ihrer ihres im immer in indem infolgedessen
|
||||||
|
ins irgend ist
|
||||||
|
|
||||||
|
ja jahr jahre jahren je jede jedem jeden jeder jedermann jedermanns jedoch
|
||||||
|
jemand jemandem jemanden jene jenem jenen jener jenes jetzt
|
||||||
|
|
||||||
|
kam kann kannst kaum kein keine keinem keinen keiner kleine kleinen kleiner
|
||||||
|
kleines kommen kommt können könnt konnte könnte konnten kurz
|
||||||
|
|
||||||
|
lang lange leicht leider lieber los
|
||||||
|
|
||||||
|
machen macht machte mag magst man manche manchem manchen mancher manches mehr
|
||||||
|
mein meine meinem meinen meiner meines mensch menschen mich mir mit mittel
|
||||||
|
mochte möchte mochten mögen möglich mögt morgen muss muß müssen musst müsst
|
||||||
|
musste mussten
|
||||||
|
|
||||||
|
na nach nachdem nahm natürlich neben nein neue neuen neun neunte neunten neunter
|
||||||
|
neuntes nicht nichts nie niemand niemandem niemanden noch nun nur
|
||||||
|
|
||||||
|
ob oben oder offen oft ohne
|
||||||
|
|
||||||
|
recht rechte rechten rechter rechtes richtig rund
|
||||||
|
|
||||||
|
sagt sagte sah satt schlecht schon sechs sechste sechsten sechster sechstes
|
||||||
|
sehr sei seid seien sein seine seinem seinen seiner seines seit seitdem selbst
|
||||||
|
selbst sich sie sieben siebente siebenten siebenter siebentes siebte siebten
|
||||||
|
siebter siebtes sind so solang solche solchem solchen solcher solches soll
|
||||||
|
sollen sollte sollten sondern sonst sowie später statt
|
||||||
|
|
||||||
|
tag tage tagen tat teil tel trotzdem tun
|
||||||
|
|
||||||
|
über überhaupt übrigens uhr um und uns unser unsere unserer unter
|
||||||
|
|
||||||
|
vergangene vergangenen viel viele vielem vielen vielleicht vier vierte vierten
|
||||||
|
vierter viertes vom von vor
|
||||||
|
|
||||||
|
wahr während währenddem währenddessen wann war wäre waren wart warum was wegen
|
||||||
|
weil weit weiter weitere weiteren weiteres welche welchem welchen welcher
|
||||||
|
welches wem wen wenig wenige weniger weniges wenigstens wenn wer werde werden
|
||||||
|
werdet wessen wie wieder will willst wir wird wirklich wirst wo wohl wollen
|
||||||
|
wollt wollte wollten worden wurde würde wurden würden
|
||||||
|
|
||||||
|
zehn zehnte zehnten zehnter zehntes zeit zu zuerst zugleich zum zunächst zur
|
||||||
|
zurück zusammen zwanzig zwar zwei zweite zweiten zweiter zweites zwischen
|
||||||
|
""".split())
|
65
spacy/de/tag_map.py
Normal file
65
spacy/de/tag_map.py
Normal file
|
@ -0,0 +1,65 @@
|
||||||
|
# encoding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from ..symbols import *
|
||||||
|
|
||||||
|
|
||||||
|
TAG_MAP = {
|
||||||
|
"$(": {POS: PUNCT, "PunctType": "brck"},
|
||||||
|
"$,": {POS: PUNCT, "PunctType": "comm"},
|
||||||
|
"$.": {POS: PUNCT, "PunctType": "peri"},
|
||||||
|
"ADJA": {POS: ADJ},
|
||||||
|
"ADJD": {POS: ADJ, "Variant": "short"},
|
||||||
|
"ADV": {POS: ADV},
|
||||||
|
"APPO": {POS: ADP, "AdpType": "post"},
|
||||||
|
"APPR": {POS: ADP, "AdpType": "prep"},
|
||||||
|
"APPRART": {POS: ADP, "AdpType": "prep", "PronType": "art"},
|
||||||
|
"APZR": {POS: ADP, "AdpType": "circ"},
|
||||||
|
"ART": {POS: DET, "PronType": "art"},
|
||||||
|
"CARD": {POS: NUM, "NumType": "card"},
|
||||||
|
"FM": {POS: X, "Foreign": "yes"},
|
||||||
|
"ITJ": {POS: INTJ},
|
||||||
|
"KOKOM": {POS: CONJ, "ConjType": "comp"},
|
||||||
|
"KON": {POS: CONJ},
|
||||||
|
"KOUI": {POS: SCONJ},
|
||||||
|
"KOUS": {POS: SCONJ},
|
||||||
|
"NE": {POS: PROPN},
|
||||||
|
"NNE": {POS: PROPN},
|
||||||
|
"NN": {POS: NOUN},
|
||||||
|
"PAV": {POS: ADV, "PronType": "dem"},
|
||||||
|
"PROAV": {POS: ADV, "PronType": "dem"},
|
||||||
|
"PDAT": {POS: DET, "PronType": "dem"},
|
||||||
|
"PDS": {POS: PRON, "PronType": "dem"},
|
||||||
|
"PIAT": {POS: DET, "PronType": "ind|neg|tot"},
|
||||||
|
"PIDAT": {POS: DET, "AdjType": "pdt", "PronType": "ind|neg|tot"},
|
||||||
|
"PIS": {POS: PRON, "PronType": "ind|neg|tot"},
|
||||||
|
"PPER": {POS: PRON, "PronType": "prs"},
|
||||||
|
"PPOSAT": {POS: DET, "Poss": "yes", "PronType": "prs"},
|
||||||
|
"PPOSS": {POS: PRON, "Poss": "yes", "PronType": "prs"},
|
||||||
|
"PRELAT": {POS: DET, "PronType": "rel"},
|
||||||
|
"PRELS": {POS: PRON, "PronType": "rel"},
|
||||||
|
"PRF": {POS: PRON, "PronType": "prs", "Reflex": "yes"},
|
||||||
|
"PTKA": {POS: PART},
|
||||||
|
"PTKANT": {POS: PART, "PartType": "res"},
|
||||||
|
"PTKNEG": {POS: PART, "Negative": "yes"},
|
||||||
|
"PTKVZ": {POS: PART, "PartType": "vbp"},
|
||||||
|
"PTKZU": {POS: PART, "PartType": "inf"},
|
||||||
|
"PWAT": {POS: DET, "PronType": "int"},
|
||||||
|
"PWAV": {POS: ADV, "PronType": "int"},
|
||||||
|
"PWS": {POS: PRON, "PronType": "int"},
|
||||||
|
"TRUNC": {POS: X, "Hyph": "yes"},
|
||||||
|
"VAFIN": {POS: AUX, "Mood": "ind", "VerbForm": "fin"},
|
||||||
|
"VAIMP": {POS: AUX, "Mood": "imp", "VerbForm": "fin"},
|
||||||
|
"VAINF": {POS: AUX, "VerbForm": "inf"},
|
||||||
|
"VAPP": {POS: AUX, "Aspect": "perf", "VerbForm": "part"},
|
||||||
|
"VMFIN": {POS: VERB, "Mood": "ind", "VerbForm": "fin", "VerbType": "mod"},
|
||||||
|
"VMINF": {POS: VERB, "VerbForm": "inf", "VerbType": "mod"},
|
||||||
|
"VMPP": {POS: VERB, "Aspect": "perf", "VerbForm": "part", "VerbType": "mod"},
|
||||||
|
"VVFIN": {POS: VERB, "Mood": "ind", "VerbForm": "fin"},
|
||||||
|
"VVIMP": {POS: VERB, "Mood": "imp", "VerbForm": "fin"},
|
||||||
|
"VVINF": {POS: VERB, "VerbForm": "inf"},
|
||||||
|
"VVIZU": {POS: VERB, "VerbForm": "inf"},
|
||||||
|
"VVPP": {POS: VERB, "Aspect": "perf", "VerbForm": "part"},
|
||||||
|
"XY": {POS: X},
|
||||||
|
"SP": {POS: SPACE}
|
||||||
|
}
|
629
spacy/de/tokenizer_exceptions.py
Normal file
629
spacy/de/tokenizer_exceptions.py
Normal file
|
@ -0,0 +1,629 @@
|
||||||
|
# encoding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from ..symbols import *
|
||||||
|
from ..language_data import PRON_LEMMA
|
||||||
|
|
||||||
|
|
||||||
|
TOKENIZER_EXCEPTIONS = {
|
||||||
|
"\\n": [
|
||||||
|
{ORTH: "\\n", LEMMA: "<nl>", TAG: "SP"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"\\t": [
|
||||||
|
{ORTH: "\\t", LEMMA: "<tab>", TAG: "SP"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"'S": [
|
||||||
|
{ORTH: "'S", LEMMA: PRON_LEMMA}
|
||||||
|
],
|
||||||
|
|
||||||
|
"'n": [
|
||||||
|
{ORTH: "'n", LEMMA: "ein"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"'ne": [
|
||||||
|
{ORTH: "'ne", LEMMA: "eine"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"'nen": [
|
||||||
|
{ORTH: "'nen", LEMMA: "einen"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"'s": [
|
||||||
|
{ORTH: "'s", LEMMA: PRON_LEMMA}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Abb.": [
|
||||||
|
{ORTH: "Abb.", LEMMA: "Abbildung"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Abk.": [
|
||||||
|
{ORTH: "Abk.", LEMMA: "Abkürzung"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Abt.": [
|
||||||
|
{ORTH: "Abt.", LEMMA: "Abteilung"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Apr.": [
|
||||||
|
{ORTH: "Apr.", LEMMA: "April"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Aug.": [
|
||||||
|
{ORTH: "Aug.", LEMMA: "August"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Bd.": [
|
||||||
|
{ORTH: "Bd.", LEMMA: "Band"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Betr.": [
|
||||||
|
{ORTH: "Betr.", LEMMA: "Betreff"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Bf.": [
|
||||||
|
{ORTH: "Bf.", LEMMA: "Bahnhof"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Bhf.": [
|
||||||
|
{ORTH: "Bhf.", LEMMA: "Bahnhof"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Bsp.": [
|
||||||
|
{ORTH: "Bsp.", LEMMA: "Beispiel"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Dez.": [
|
||||||
|
{ORTH: "Dez.", LEMMA: "Dezember"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Di.": [
|
||||||
|
{ORTH: "Di.", LEMMA: "Dienstag"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Do.": [
|
||||||
|
{ORTH: "Do.", LEMMA: "Donnerstag"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Fa.": [
|
||||||
|
{ORTH: "Fa.", LEMMA: "Firma"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Fam.": [
|
||||||
|
{ORTH: "Fam.", LEMMA: "Familie"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Feb.": [
|
||||||
|
{ORTH: "Feb.", LEMMA: "Februar"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Fr.": [
|
||||||
|
{ORTH: "Fr.", LEMMA: "Frau"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Frl.": [
|
||||||
|
{ORTH: "Frl.", LEMMA: "Fräulein"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Hbf.": [
|
||||||
|
{ORTH: "Hbf.", LEMMA: "Hauptbahnhof"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Hr.": [
|
||||||
|
{ORTH: "Hr.", LEMMA: "Herr"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Hrn.": [
|
||||||
|
{ORTH: "Hrn.", LEMMA: "Herr"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Jan.": [
|
||||||
|
{ORTH: "Jan.", LEMMA: "Januar"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Jh.": [
|
||||||
|
{ORTH: "Jh.", LEMMA: "Jahrhundert"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Jhd.": [
|
||||||
|
{ORTH: "Jhd.", LEMMA: "Jahrhundert"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Jul.": [
|
||||||
|
{ORTH: "Jul.", LEMMA: "Juli"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Jun.": [
|
||||||
|
{ORTH: "Jun.", LEMMA: "Juni"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Mi.": [
|
||||||
|
{ORTH: "Mi.", LEMMA: "Mittwoch"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Mio.": [
|
||||||
|
{ORTH: "Mio.", LEMMA: "Million"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Mo.": [
|
||||||
|
{ORTH: "Mo.", LEMMA: "Montag"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Mrd.": [
|
||||||
|
{ORTH: "Mrd.", LEMMA: "Milliarde"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Mrz.": [
|
||||||
|
{ORTH: "Mrz.", LEMMA: "März"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"MwSt.": [
|
||||||
|
{ORTH: "MwSt.", LEMMA: "Mehrwertsteuer"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Mär.": [
|
||||||
|
{ORTH: "Mär.", LEMMA: "März"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Nov.": [
|
||||||
|
{ORTH: "Nov.", LEMMA: "November"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Nr.": [
|
||||||
|
{ORTH: "Nr.", LEMMA: "Nummer"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Okt.": [
|
||||||
|
{ORTH: "Okt.", LEMMA: "Oktober"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Orig.": [
|
||||||
|
{ORTH: "Orig.", LEMMA: "Original"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Pkt.": [
|
||||||
|
{ORTH: "Pkt.", LEMMA: "Punkt"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Prof.": [
|
||||||
|
{ORTH: "Prof.", LEMMA: "Professor"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Red.": [
|
||||||
|
{ORTH: "Red.", LEMMA: "Redaktion"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"S'": [
|
||||||
|
{ORTH: "S'", LEMMA: PRON_LEMMA}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Sa.": [
|
||||||
|
{ORTH: "Sa.", LEMMA: "Samstag"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Sep.": [
|
||||||
|
{ORTH: "Sep.", LEMMA: "September"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Sept.": [
|
||||||
|
{ORTH: "Sept.", LEMMA: "September"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"So.": [
|
||||||
|
{ORTH: "So.", LEMMA: "Sonntag"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Std.": [
|
||||||
|
{ORTH: "Std.", LEMMA: "Stunde"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Str.": [
|
||||||
|
{ORTH: "Str.", LEMMA: "Straße"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Tel.": [
|
||||||
|
{ORTH: "Tel.", LEMMA: "Telefon"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Tsd.": [
|
||||||
|
{ORTH: "Tsd.", LEMMA: "Tausend"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"Univ.": [
|
||||||
|
{ORTH: "Univ.", LEMMA: "Universität"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"abzgl.": [
|
||||||
|
{ORTH: "abzgl.", LEMMA: "abzüglich"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"allg.": [
|
||||||
|
{ORTH: "allg.", LEMMA: "allgemein"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"auf'm": [
|
||||||
|
{ORTH: "auf", LEMMA: "auf"},
|
||||||
|
{ORTH: "'m", LEMMA: PRON_LEMMA}
|
||||||
|
],
|
||||||
|
|
||||||
|
"bspw.": [
|
||||||
|
{ORTH: "bspw.", LEMMA: "beispielsweise"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"bzgl.": [
|
||||||
|
{ORTH: "bzgl.", LEMMA: "bezüglich"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"bzw.": [
|
||||||
|
{ORTH: "bzw.", LEMMA: "beziehungsweise"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"d.h.": [
|
||||||
|
{ORTH: "d.h.", LEMMA: "das heißt"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"dgl.": [
|
||||||
|
{ORTH: "dgl.", LEMMA: "dergleichen"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"du's": [
|
||||||
|
{ORTH: "du", LEMMA: PRON_LEMMA},
|
||||||
|
{ORTH: "'s", LEMMA: PRON_LEMMA}
|
||||||
|
],
|
||||||
|
|
||||||
|
"ebd.": [
|
||||||
|
{ORTH: "ebd.", LEMMA: "ebenda"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"eigtl.": [
|
||||||
|
{ORTH: "eigtl.", LEMMA: "eigentlich"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"engl.": [
|
||||||
|
{ORTH: "engl.", LEMMA: "englisch"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"er's": [
|
||||||
|
{ORTH: "er", LEMMA: PRON_LEMMA},
|
||||||
|
{ORTH: "'s", LEMMA: PRON_LEMMA}
|
||||||
|
],
|
||||||
|
|
||||||
|
"evtl.": [
|
||||||
|
{ORTH: "evtl.", LEMMA: "eventuell"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"frz.": [
|
||||||
|
{ORTH: "frz.", LEMMA: "französisch"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"gegr.": [
|
||||||
|
{ORTH: "gegr.", LEMMA: "gegründet"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"ggf.": [
|
||||||
|
{ORTH: "ggf.", LEMMA: "gegebenenfalls"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"ggfs.": [
|
||||||
|
{ORTH: "ggfs.", LEMMA: "gegebenenfalls"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"ggü.": [
|
||||||
|
{ORTH: "ggü.", LEMMA: "gegenüber"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"hinter'm": [
|
||||||
|
{ORTH: "hinter", LEMMA: "hinter"},
|
||||||
|
{ORTH: "'m", LEMMA: PRON_LEMMA}
|
||||||
|
],
|
||||||
|
|
||||||
|
"i.O.": [
|
||||||
|
{ORTH: "i.O.", LEMMA: "in Ordnung"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"i.d.R.": [
|
||||||
|
{ORTH: "i.d.R.", LEMMA: "in der Regel"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"ich's": [
|
||||||
|
{ORTH: "ich", LEMMA: PRON_LEMMA},
|
||||||
|
{ORTH: "'s", LEMMA: PRON_LEMMA}
|
||||||
|
],
|
||||||
|
|
||||||
|
"ihr's": [
|
||||||
|
{ORTH: "ihr", LEMMA: PRON_LEMMA},
|
||||||
|
{ORTH: "'s", LEMMA: PRON_LEMMA}
|
||||||
|
],
|
||||||
|
|
||||||
|
"incl.": [
|
||||||
|
{ORTH: "incl.", LEMMA: "inklusive"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"inkl.": [
|
||||||
|
{ORTH: "inkl.", LEMMA: "inklusive"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"insb.": [
|
||||||
|
{ORTH: "insb.", LEMMA: "insbesondere"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"kath.": [
|
||||||
|
{ORTH: "kath.", LEMMA: "katholisch"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"lt.": [
|
||||||
|
{ORTH: "lt.", LEMMA: "laut"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"max.": [
|
||||||
|
{ORTH: "max.", LEMMA: "maximal"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"min.": [
|
||||||
|
{ORTH: "min.", LEMMA: "minimal"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"mind.": [
|
||||||
|
{ORTH: "mind.", LEMMA: "mindestens"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"mtl.": [
|
||||||
|
{ORTH: "mtl.", LEMMA: "monatlich"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"n.Chr.": [
|
||||||
|
{ORTH: "n.Chr.", LEMMA: "nach Christus"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"orig.": [
|
||||||
|
{ORTH: "orig.", LEMMA: "original"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"röm.": [
|
||||||
|
{ORTH: "röm.", LEMMA: "römisch"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"s'": [
|
||||||
|
{ORTH: "s'", LEMMA: PRON_LEMMA}
|
||||||
|
],
|
||||||
|
|
||||||
|
"s.o.": [
|
||||||
|
{ORTH: "s.o.", LEMMA: "siehe oben"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"sie's": [
|
||||||
|
{ORTH: "sie", LEMMA: PRON_LEMMA},
|
||||||
|
{ORTH: "'s", LEMMA: PRON_LEMMA}
|
||||||
|
],
|
||||||
|
|
||||||
|
"sog.": [
|
||||||
|
{ORTH: "sog.", LEMMA: "so genannt"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"stellv.": [
|
||||||
|
{ORTH: "stellv.", LEMMA: "stellvertretend"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"tägl.": [
|
||||||
|
{ORTH: "tägl.", LEMMA: "täglich"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"u.U.": [
|
||||||
|
{ORTH: "u.U.", LEMMA: "unter Umständen"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"u.s.w.": [
|
||||||
|
{ORTH: "u.s.w.", LEMMA: "und so weiter"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"u.v.m.": [
|
||||||
|
{ORTH: "u.v.m.", LEMMA: "und vieles mehr"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"unter'm": [
|
||||||
|
{ORTH: "unter", LEMMA: "unter"},
|
||||||
|
{ORTH: "'m", LEMMA: PRON_LEMMA}
|
||||||
|
],
|
||||||
|
|
||||||
|
"usf.": [
|
||||||
|
{ORTH: "usf.", LEMMA: "und so fort"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"usw.": [
|
||||||
|
{ORTH: "usw.", LEMMA: "und so weiter"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"uvm.": [
|
||||||
|
{ORTH: "uvm.", LEMMA: "und vieles mehr"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"v.Chr.": [
|
||||||
|
{ORTH: "v.Chr.", LEMMA: "vor Christus"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"v.a.": [
|
||||||
|
{ORTH: "v.a.", LEMMA: "vor allem"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"v.l.n.r.": [
|
||||||
|
{ORTH: "v.l.n.r.", LEMMA: "von links nach rechts"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"vgl.": [
|
||||||
|
{ORTH: "vgl.", LEMMA: "vergleiche"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"vllt.": [
|
||||||
|
{ORTH: "vllt.", LEMMA: "vielleicht"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"vlt.": [
|
||||||
|
{ORTH: "vlt.", LEMMA: "vielleicht"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"vor'm": [
|
||||||
|
{ORTH: "vor", LEMMA: "vor"},
|
||||||
|
{ORTH: "'m", LEMMA: PRON_LEMMA}
|
||||||
|
],
|
||||||
|
|
||||||
|
"wir's": [
|
||||||
|
{ORTH: "wir", LEMMA: PRON_LEMMA},
|
||||||
|
{ORTH: "'s", LEMMA: PRON_LEMMA}
|
||||||
|
],
|
||||||
|
|
||||||
|
"z.B.": [
|
||||||
|
{ORTH: "z.B.", LEMMA: "zum Beispiel"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"z.Bsp.": [
|
||||||
|
{ORTH: "z.Bsp.", LEMMA: "zum Beispiel"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"z.T.": [
|
||||||
|
{ORTH: "z.T.", LEMMA: "zum Teil"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"z.Z.": [
|
||||||
|
{ORTH: "z.Z.", LEMMA: "zur Zeit"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"z.Zt.": [
|
||||||
|
{ORTH: "z.Zt.", LEMMA: "zur Zeit"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"z.b.": [
|
||||||
|
{ORTH: "z.b.", LEMMA: "zum Beispiel"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"zzgl.": [
|
||||||
|
{ORTH: "zzgl.", LEMMA: "zuzüglich"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"österr.": [
|
||||||
|
{ORTH: "österr.", LEMMA: "österreichisch"}
|
||||||
|
],
|
||||||
|
|
||||||
|
"über'm": [
|
||||||
|
{ORTH: "über", LEMMA: "über"},
|
||||||
|
{ORTH: "'m", LEMMA: PRON_LEMMA}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
ORTH_ONLY = [
|
||||||
|
"'",
|
||||||
|
"\\\")",
|
||||||
|
"<space>",
|
||||||
|
"a.",
|
||||||
|
"ä.",
|
||||||
|
"A.C.",
|
||||||
|
"a.D.",
|
||||||
|
"A.D.",
|
||||||
|
"A.G.",
|
||||||
|
"a.M.",
|
||||||
|
"a.Z.",
|
||||||
|
"Abs.",
|
||||||
|
"adv.",
|
||||||
|
"al.",
|
||||||
|
"b.",
|
||||||
|
"B.A.",
|
||||||
|
"B.Sc.",
|
||||||
|
"betr.",
|
||||||
|
"biol.",
|
||||||
|
"Biol.",
|
||||||
|
"c.",
|
||||||
|
"ca.",
|
||||||
|
"Chr.",
|
||||||
|
"Cie.",
|
||||||
|
"co.",
|
||||||
|
"Co.",
|
||||||
|
"d.",
|
||||||
|
"D.C.",
|
||||||
|
"Dipl.-Ing.",
|
||||||
|
"Dipl.",
|
||||||
|
"Dr.",
|
||||||
|
"e.",
|
||||||
|
"e.g.",
|
||||||
|
"e.V.",
|
||||||
|
"ehem.",
|
||||||
|
"entspr.",
|
||||||
|
"erm.",
|
||||||
|
"etc.",
|
||||||
|
"ev.",
|
||||||
|
"f.",
|
||||||
|
"g.",
|
||||||
|
"G.m.b.H.",
|
||||||
|
"geb.",
|
||||||
|
"Gebr.",
|
||||||
|
"gem.",
|
||||||
|
"h.",
|
||||||
|
"h.c.",
|
||||||
|
"Hg.",
|
||||||
|
"hrsg.",
|
||||||
|
"Hrsg.",
|
||||||
|
"i.",
|
||||||
|
"i.A.",
|
||||||
|
"i.e.",
|
||||||
|
"i.G.",
|
||||||
|
"i.Tr.",
|
||||||
|
"i.V.",
|
||||||
|
"Ing.",
|
||||||
|
"j.",
|
||||||
|
"jr.",
|
||||||
|
"Jr.",
|
||||||
|
"jun.",
|
||||||
|
"jur.",
|
||||||
|
"k.",
|
||||||
|
"K.O.",
|
||||||
|
"l.",
|
||||||
|
"L.A.",
|
||||||
|
"lat.",
|
||||||
|
"m.",
|
||||||
|
"M.A.",
|
||||||
|
"m.E.",
|
||||||
|
"m.M.",
|
||||||
|
"M.Sc.",
|
||||||
|
"Mr.",
|
||||||
|
"n.",
|
||||||
|
"N.Y.",
|
||||||
|
"N.Y.C.",
|
||||||
|
"nat.",
|
||||||
|
"ö."
|
||||||
|
"o.",
|
||||||
|
"o.a.",
|
||||||
|
"o.ä.",
|
||||||
|
"o.g.",
|
||||||
|
"o.k.",
|
||||||
|
"O.K.",
|
||||||
|
"p.",
|
||||||
|
"p.a.",
|
||||||
|
"p.s.",
|
||||||
|
"P.S.",
|
||||||
|
"pers.",
|
||||||
|
"phil.",
|
||||||
|
"q.",
|
||||||
|
"q.e.d.",
|
||||||
|
"r.",
|
||||||
|
"R.I.P.",
|
||||||
|
"rer.",
|
||||||
|
"s.",
|
||||||
|
"sen.",
|
||||||
|
"St.",
|
||||||
|
"std.",
|
||||||
|
"t.",
|
||||||
|
"u.",
|
||||||
|
"ü.",
|
||||||
|
"u.a.",
|
||||||
|
"U.S.",
|
||||||
|
"U.S.A.",
|
||||||
|
"U.S.S.",
|
||||||
|
"v.",
|
||||||
|
"Vol.",
|
||||||
|
"vs.",
|
||||||
|
"w.",
|
||||||
|
"wiss.",
|
||||||
|
"x.",
|
||||||
|
"y.",
|
||||||
|
"z.",
|
||||||
|
]
|
|
@ -10,10 +10,19 @@ from . import about
|
||||||
from . import util
|
from . import util
|
||||||
|
|
||||||
|
|
||||||
def download(lang, force=False, fail_on_exist=True):
|
def download(lang, force=False, fail_on_exist=True, data_path=None):
|
||||||
|
if not data_path:
|
||||||
|
data_path = util.get_data_path()
|
||||||
|
|
||||||
|
# spaCy uses pathlib, and util.get_data_path returns a pathlib.Path object,
|
||||||
|
# but sputnik (which we're using below) doesn't use pathlib and requires
|
||||||
|
# its data_path parameters to be strings, so we coerce the data_path to a
|
||||||
|
# str here.
|
||||||
|
data_path = str(data_path)
|
||||||
|
|
||||||
try:
|
try:
|
||||||
pkg = sputnik.package(about.__title__, about.__version__,
|
pkg = sputnik.package(about.__title__, about.__version__,
|
||||||
about.__models__.get(lang, lang))
|
about.__models__.get(lang, lang), data_path)
|
||||||
if force:
|
if force:
|
||||||
shutil.rmtree(pkg.path)
|
shutil.rmtree(pkg.path)
|
||||||
elif fail_on_exist:
|
elif fail_on_exist:
|
||||||
|
@ -24,15 +33,14 @@ def download(lang, force=False, fail_on_exist=True):
|
||||||
pass
|
pass
|
||||||
|
|
||||||
package = sputnik.install(about.__title__, about.__version__,
|
package = sputnik.install(about.__title__, about.__version__,
|
||||||
about.__models__.get(lang, lang))
|
about.__models__.get(lang, lang), data_path)
|
||||||
|
|
||||||
try:
|
try:
|
||||||
sputnik.package(about.__title__, about.__version__,
|
sputnik.package(about.__title__, about.__version__,
|
||||||
about.__models__.get(lang, lang))
|
about.__models__.get(lang, lang), data_path)
|
||||||
except (PackageNotFoundException, CompatiblePackageNotFoundException):
|
except (PackageNotFoundException, CompatiblePackageNotFoundException):
|
||||||
print("Model failed to install. Please run 'python -m "
|
print("Model failed to install. Please run 'python -m "
|
||||||
"spacy.%s.download --force'." % lang, file=sys.stderr)
|
"spacy.%s.download --force'." % lang, file=sys.stderr)
|
||||||
sys.exit(1)
|
sys.exit(1)
|
||||||
|
|
||||||
data_path = util.get_data_path()
|
|
||||||
print("Model successfully installed to %s" % data_path, file=sys.stderr)
|
print("Model successfully installed to %s" % data_path, file=sys.stderr)
|
||||||
|
|
|
@ -1,15 +1,18 @@
|
||||||
|
# encoding: utf8
|
||||||
from __future__ import unicode_literals, print_function
|
from __future__ import unicode_literals, print_function
|
||||||
|
|
||||||
from os import path
|
from os import path
|
||||||
|
|
||||||
|
from ..util import match_best_version
|
||||||
|
from ..util import get_data_path
|
||||||
from ..language import Language
|
from ..language import Language
|
||||||
from . import language_data
|
|
||||||
from .. import util
|
|
||||||
from ..lemmatizer import Lemmatizer
|
from ..lemmatizer import Lemmatizer
|
||||||
from ..vocab import Vocab
|
from ..vocab import Vocab
|
||||||
from ..tokenizer import Tokenizer
|
from ..tokenizer import Tokenizer
|
||||||
from ..attrs import LANG
|
from ..attrs import LANG
|
||||||
|
|
||||||
|
from .language_data import *
|
||||||
|
|
||||||
|
|
||||||
class English(Language):
|
class English(Language):
|
||||||
lang = 'en'
|
lang = 'en'
|
||||||
|
@ -18,14 +21,40 @@ class English(Language):
|
||||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||||
lex_attr_getters[LANG] = lambda text: 'en'
|
lex_attr_getters[LANG] = lambda text: 'en'
|
||||||
|
|
||||||
tokenizer_exceptions = dict(language_data.TOKENIZER_EXCEPTIONS)
|
tokenizer_exceptions = TOKENIZER_EXCEPTIONS
|
||||||
|
tag_map = TAG_MAP
|
||||||
|
stop_words = STOP_WORDS
|
||||||
|
lemma_rules = LEMMA_RULES
|
||||||
|
|
||||||
prefixes = tuple(language_data.TOKENIZER_PREFIXES)
|
|
||||||
|
|
||||||
suffixes = tuple(language_data.TOKENIZER_SUFFIXES)
|
def __init__(self, **overrides):
|
||||||
|
# Make a special-case hack for loading the GloVe vectors, to support
|
||||||
|
# deprecated <1.0 stuff. Phase this out once the data is fixed.
|
||||||
|
overrides = _fix_deprecated_glove_vectors_loading(overrides)
|
||||||
|
Language.__init__(self, **overrides)
|
||||||
|
|
||||||
infixes = tuple(language_data.TOKENIZER_INFIXES)
|
|
||||||
|
|
||||||
tag_map = dict(language_data.TAG_MAP)
|
def _fix_deprecated_glove_vectors_loading(overrides):
|
||||||
|
if 'data_dir' in overrides and 'path' not in overrides:
|
||||||
stop_words = set(language_data.STOP_WORDS)
|
raise ValueError("The argument 'data_dir' has been renamed to 'path'")
|
||||||
|
if overrides.get('path') is False:
|
||||||
|
return overrides
|
||||||
|
if overrides.get('path') in (None, True):
|
||||||
|
data_path = get_data_path()
|
||||||
|
else:
|
||||||
|
path = overrides['path']
|
||||||
|
data_path = path.parent
|
||||||
|
vec_path = None
|
||||||
|
if 'add_vectors' not in overrides:
|
||||||
|
if 'vectors' in overrides:
|
||||||
|
vec_path = match_best_version(overrides['vectors'], None, data_path)
|
||||||
|
if vec_path is None:
|
||||||
|
raise IOError(
|
||||||
|
'Could not load data pack %s from %s' % (overrides['vectors'], data_path))
|
||||||
|
else:
|
||||||
|
vec_path = match_best_version('en_glove_cc_300_1m_vectors', None, data_path)
|
||||||
|
if vec_path is not None:
|
||||||
|
vec_path = vec_path / 'vocab' / 'vec.bin'
|
||||||
|
if vec_path is not None:
|
||||||
|
overrides['add_vectors'] = lambda vocab: vocab.load_vectors_from_bin_loc(vec_path)
|
||||||
|
return overrides
|
||||||
|
|
|
@ -7,17 +7,18 @@ from .. import about
|
||||||
|
|
||||||
@plac.annotations(
|
@plac.annotations(
|
||||||
force=("Force overwrite", "flag", "f", bool),
|
force=("Force overwrite", "flag", "f", bool),
|
||||||
|
data_path=("Path to download model", "option", "d", str)
|
||||||
)
|
)
|
||||||
def main(data_size='all', force=False):
|
def main(data_size='all', force=False, data_path=None):
|
||||||
if force:
|
if force:
|
||||||
sputnik.purge(about.__title__, about.__version__)
|
sputnik.purge(about.__title__, about.__version__)
|
||||||
|
|
||||||
if data_size in ('all', 'parser'):
|
if data_size in ('all', 'parser'):
|
||||||
print("Downloading parsing model")
|
print("Downloading parsing model")
|
||||||
download('en', False)
|
download('en', force=False, data_path=data_path)
|
||||||
if data_size in ('all', 'glove'):
|
if data_size in ('all', 'glove'):
|
||||||
print("Downloading GloVe vectors")
|
print("Downloading GloVe vectors")
|
||||||
download('en_glove_cc_300_1m_vectors', False)
|
download('en_glove_cc_300_1m_vectors', force=False, data_path=data_path)
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
if __name__ == '__main__':
|
||||||
|
|
File diff suppressed because it is too large
Load Diff
|
@ -1,4 +1,8 @@
|
||||||
{
|
# encoding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
|
||||||
|
LEMMA_RULES = {
|
||||||
"noun": [
|
"noun": [
|
||||||
["s", ""],
|
["s", ""],
|
||||||
["ses", "s"],
|
["ses", "s"],
|
67
spacy/en/morph_rules.py
Normal file
67
spacy/en/morph_rules.py
Normal file
|
@ -0,0 +1,67 @@
|
||||||
|
# encoding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from ..symbols import *
|
||||||
|
from ..language_data import PRON_LEMMA
|
||||||
|
|
||||||
|
|
||||||
|
MORPH_RULES = {
|
||||||
|
"PRP": {
|
||||||
|
"I": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Case": "Nom"},
|
||||||
|
"me": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Case": "Acc"},
|
||||||
|
"you": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two"},
|
||||||
|
"he": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Masc", "Case": "Nom"},
|
||||||
|
"him": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Masc", "Case": "Acc"},
|
||||||
|
"she": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Fem", "Case": "Nom"},
|
||||||
|
"her": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Fem", "Case": "Acc"},
|
||||||
|
"it": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Neut"},
|
||||||
|
"we": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Case": "Nom"},
|
||||||
|
"us": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Case": "Acc"},
|
||||||
|
"they": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Plur", "Case": "Nom"},
|
||||||
|
"them": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Plur", "Case": "Acc"},
|
||||||
|
|
||||||
|
"mine": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Poss": "Yes", "Reflex": "Yes"},
|
||||||
|
"yours": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Poss": "Yes", "Reflex": "Yes"},
|
||||||
|
"his": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Masc", "Poss": "Yes", "Reflex": "Yes"},
|
||||||
|
"hers": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Fem", "Poss": "Yes", "Reflex": "Yes"},
|
||||||
|
"its": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Neut", "Poss": "Yes", "Reflex": "Yes"},
|
||||||
|
"ours": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
|
||||||
|
"yours": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
|
||||||
|
"theirs": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
|
||||||
|
|
||||||
|
"myself": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Case": "Acc", "Reflex": "Yes"},
|
||||||
|
"yourself": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Case": "Acc", "Reflex": "Yes"},
|
||||||
|
"himself": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Case": "Acc", "Gender": "Masc", "Reflex": "Yes"},
|
||||||
|
"herself": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Case": "Acc", "Gender": "Fem", "Reflex": "Yes"},
|
||||||
|
"itself": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Case": "Acc", "Gender": "Neut", "Reflex": "Yes"},
|
||||||
|
"themself": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Case": "Acc", "Reflex": "Yes"},
|
||||||
|
"ourselves": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Case": "Acc", "Reflex": "Yes"},
|
||||||
|
"yourselves": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Case": "Acc", "Reflex": "Yes"},
|
||||||
|
"themselves": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Plur", "Case": "Acc", "Reflex": "Yes"}
|
||||||
|
},
|
||||||
|
|
||||||
|
"PRP$": {
|
||||||
|
"my": {LEMMA: PRON_LEMMA, "Person": "One", "Number": "Sing", "PronType": "Prs", "Poss": "Yes"},
|
||||||
|
"your": {LEMMA: PRON_LEMMA, "Person": "Two", "PronType": "Prs", "Poss": "Yes"},
|
||||||
|
"his": {LEMMA: PRON_LEMMA, "Person": "Three", "Number": "Sing", "Gender": "Masc", "PronType": "Prs", "Poss": "Yes"},
|
||||||
|
"her": {LEMMA: PRON_LEMMA, "Person": "Three", "Number": "Sing", "Gender": "Fem", "PronType": "Prs", "Poss": "Yes"},
|
||||||
|
"its": {LEMMA: PRON_LEMMA, "Person": "Three", "Number": "Sing", "Gender": "Neut", "PronType": "Prs", "Poss": "Yes"},
|
||||||
|
"our": {LEMMA: PRON_LEMMA, "Person": "One", "Number": "Plur", "PronType": "Prs", "Poss": "Yes"},
|
||||||
|
"their": {LEMMA: PRON_LEMMA, "Person": "Three", "Number": "Plur", "PronType": "Prs", "Poss": "Yes"}
|
||||||
|
},
|
||||||
|
|
||||||
|
"VBZ": {
|
||||||
|
"am": {LEMMA: "be", "VerbForm": "Fin", "Person": "One", "Tense": "Pres", "Mood": "Ind"},
|
||||||
|
"are": {LEMMA: "be", "VerbForm": "Fin", "Person": "Two", "Tense": "Pres", "Mood": "Ind"},
|
||||||
|
"is": {LEMMA: "be", "VerbForm": "Fin", "Person": "Three", "Tense": "Pres", "Mood": "Ind"},
|
||||||
|
},
|
||||||
|
|
||||||
|
"VBP": {
|
||||||
|
"are": {LEMMA: "be", "VerbForm": "Fin", "Tense": "Pres", "Mood": "Ind"}
|
||||||
|
},
|
||||||
|
|
||||||
|
"VBD": {
|
||||||
|
"was": {LEMMA: "be", "VerbForm": "Fin", "Tense": "Past", "Number": "Sing"},
|
||||||
|
"were": {LEMMA: "be", "VerbForm": "Fin", "Tense": "Past", "Number": "Plur"}
|
||||||
|
}
|
||||||
|
}
|
|
@ -1,47 +0,0 @@
|
||||||
import re
|
|
||||||
|
|
||||||
|
|
||||||
_mw_prepositions = [
|
|
||||||
'close to',
|
|
||||||
'down by',
|
|
||||||
'on the way to',
|
|
||||||
'on my way to',
|
|
||||||
'on my way',
|
|
||||||
'on his way to',
|
|
||||||
'on his way',
|
|
||||||
'on her way to',
|
|
||||||
'on her way',
|
|
||||||
'on your way to',
|
|
||||||
'on your way',
|
|
||||||
'on our way to',
|
|
||||||
'on our way',
|
|
||||||
'on their way to',
|
|
||||||
'on their way',
|
|
||||||
'along the route from'
|
|
||||||
]
|
|
||||||
|
|
||||||
|
|
||||||
MW_PREPOSITIONS_RE = re.compile('|'.join(_mw_prepositions), flags=re.IGNORECASE)
|
|
||||||
|
|
||||||
|
|
||||||
TIME_RE = re.compile(
|
|
||||||
'{colon_digits}|{colon_digits} ?{am_pm}?|{one_two_digits} ?({am_pm})'.format(
|
|
||||||
colon_digits=r'[0-2]?[0-9]:[0-5][0-9](?::[0-5][0-9])?',
|
|
||||||
one_two_digits=r'[0-2]?[0-9]',
|
|
||||||
am_pm=r'[ap]\.?m\.?'))
|
|
||||||
|
|
||||||
DATE_RE = re.compile(
|
|
||||||
'(?:this|last|next|the) (?:week|weekend|{days})'.format(
|
|
||||||
days='Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday'
|
|
||||||
))
|
|
||||||
|
|
||||||
|
|
||||||
MONEY_RE = re.compile('\$\d+(?:\.\d+)?|\d+ dollars(?: \d+ cents)?')
|
|
||||||
|
|
||||||
|
|
||||||
DAYS_RE = re.compile('Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday')
|
|
||||||
|
|
||||||
|
|
||||||
REGEXES = [('IN', 'O', MW_PREPOSITIONS_RE), ('CD', 'TIME', TIME_RE),
|
|
||||||
('NNP', 'DATE', DATE_RE),
|
|
||||||
('NNP', 'DATE', DAYS_RE), ('CD', 'MONEY', MONEY_RE)]
|
|
67
spacy/en/stop_words.py
Normal file
67
spacy/en/stop_words.py
Normal file
|
@ -0,0 +1,67 @@
|
||||||
|
# encoding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
|
||||||
|
STOP_WORDS = set("""
|
||||||
|
a about above across after afterwards again against all almost alone along
|
||||||
|
already also although always am among amongst amount an and another any anyhow
|
||||||
|
anyone anything anyway anywhere are around as at
|
||||||
|
|
||||||
|
back be became because become becomes becoming been before beforehand behind
|
||||||
|
being below beside besides between beyond both bottom but by
|
||||||
|
|
||||||
|
call can cannot ca could
|
||||||
|
|
||||||
|
did do does doing done down due during
|
||||||
|
|
||||||
|
each eight either eleven else elsewhere empty enough etc even ever every
|
||||||
|
everyone everything everywhere except
|
||||||
|
|
||||||
|
few fifteen fifty first five for former formerly forty four from front full
|
||||||
|
further
|
||||||
|
|
||||||
|
get give go
|
||||||
|
|
||||||
|
had has have he hence her here hereafter hereby herein hereupon hers herself
|
||||||
|
him himself his how however hundred
|
||||||
|
|
||||||
|
i if in inc indeed into is it its itself
|
||||||
|
|
||||||
|
keep
|
||||||
|
|
||||||
|
last latter latterly least less
|
||||||
|
|
||||||
|
just
|
||||||
|
|
||||||
|
made make many may me meanwhile might mine more moreover most mostly move much
|
||||||
|
must my myself
|
||||||
|
|
||||||
|
name namely neither never nevertheless next nine no nobody none noone nor not
|
||||||
|
nothing now nowhere
|
||||||
|
|
||||||
|
of off often on once one only onto or other others otherwise our ours ourselves
|
||||||
|
out over own
|
||||||
|
|
||||||
|
part per perhaps please put
|
||||||
|
|
||||||
|
quite
|
||||||
|
|
||||||
|
rather re really regarding
|
||||||
|
|
||||||
|
same say see seem seemed seeming seems serious several she should show side
|
||||||
|
since six sixty so some somehow someone something sometime sometimes somewhere
|
||||||
|
still such
|
||||||
|
|
||||||
|
take ten than that the their them themselves then thence there thereafter
|
||||||
|
thereby therefore therein thereupon these they third this those though three
|
||||||
|
through throughout thru thus to together too top toward towards twelve twenty
|
||||||
|
two
|
||||||
|
|
||||||
|
under until up unless upon us used using
|
||||||
|
|
||||||
|
various very very via was we well were what whatever when whence whenever where
|
||||||
|
whereafter whereas whereby wherein whereupon wherever whether which while
|
||||||
|
whither who whoever whole whom whose why will with within without would
|
||||||
|
|
||||||
|
yet you your yours yourself yourselves
|
||||||
|
""".split())
|
64
spacy/en/tag_map.py
Normal file
64
spacy/en/tag_map.py
Normal file
|
@ -0,0 +1,64 @@
|
||||||
|
# encoding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from ..symbols import *
|
||||||
|
|
||||||
|
|
||||||
|
TAG_MAP = {
|
||||||
|
".": {POS: PUNCT, "PunctType": "peri"},
|
||||||
|
",": {POS: PUNCT, "PunctType": "comm"},
|
||||||
|
"-LRB-": {POS: PUNCT, "PunctType": "brck", "PunctSide": "ini"},
|
||||||
|
"-RRB-": {POS: PUNCT, "PunctType": "brck", "PunctSide": "fin"},
|
||||||
|
"``": {POS: PUNCT, "PunctType": "quot", "PunctSide": "ini"},
|
||||||
|
"\"\"": {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"},
|
||||||
|
"''": {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"},
|
||||||
|
":": {POS: PUNCT},
|
||||||
|
"$": {POS: SYM, "Other": {"SymType": "currency"}},
|
||||||
|
"#": {POS: SYM, "Other": {"SymType": "numbersign"}},
|
||||||
|
"AFX": {POS: ADJ, "Hyph": "yes"},
|
||||||
|
"CC": {POS: CONJ, "ConjType": "coor"},
|
||||||
|
"CD": {POS: NUM, "NumType": "card"},
|
||||||
|
"DT": {POS: DET},
|
||||||
|
"EX": {POS: ADV, "AdvType": "ex"},
|
||||||
|
"FW": {POS: X, "Foreign": "yes"},
|
||||||
|
"HYPH": {POS: PUNCT, "PunctType": "dash"},
|
||||||
|
"IN": {POS: ADP},
|
||||||
|
"JJ": {POS: ADJ, "Degree": "pos"},
|
||||||
|
"JJR": {POS: ADJ, "Degree": "comp"},
|
||||||
|
"JJS": {POS: ADJ, "Degree": "sup"},
|
||||||
|
"LS": {POS: PUNCT, "NumType": "ord"},
|
||||||
|
"MD": {POS: VERB, "VerbType": "mod"},
|
||||||
|
"NIL": {POS: ""},
|
||||||
|
"NN": {POS: NOUN, "Number": "sing"},
|
||||||
|
"NNP": {POS: PROPN, "NounType": "prop", "Number": "sing"},
|
||||||
|
"NNPS": {POS: PROPN, "NounType": "prop", "Number": "plur"},
|
||||||
|
"NNS": {POS: NOUN, "Number": "plur"},
|
||||||
|
"PDT": {POS: ADJ, "AdjType": "pdt", "PronType": "prn"},
|
||||||
|
"POS": {POS: PART, "Poss": "yes"},
|
||||||
|
"PRP": {POS: PRON, "PronType": "prs"},
|
||||||
|
"PRP$": {POS: ADJ, "PronType": "prs", "Poss": "yes"},
|
||||||
|
"RB": {POS: ADV, "Degree": "pos"},
|
||||||
|
"RBR": {POS: ADV, "Degree": "comp"},
|
||||||
|
"RBS": {POS: ADV, "Degree": "sup"},
|
||||||
|
"RP": {POS: PART},
|
||||||
|
"SYM": {POS: SYM},
|
||||||
|
"TO": {POS: PART, "PartType": "inf", "VerbForm": "inf"},
|
||||||
|
"UH": {POS: INTJ},
|
||||||
|
"VB": {POS: VERB, "VerbForm": "inf"},
|
||||||
|
"VBD": {POS: VERB, "VerbForm": "fin", "Tense": "past"},
|
||||||
|
"VBG": {POS: VERB, "VerbForm": "part", "Tense": "pres", "Aspect": "prog"},
|
||||||
|
"VBN": {POS: VERB, "VerbForm": "part", "Tense": "past", "Aspect": "perf"},
|
||||||
|
"VBP": {POS: VERB, "VerbForm": "fin", "Tense": "pres"},
|
||||||
|
"VBZ": {POS: VERB, "VerbForm": "fin", "Tense": "pres", "Number": "sing", "Person": 3},
|
||||||
|
"WDT": {POS: ADJ, "PronType": "int|rel"},
|
||||||
|
"WP": {POS: NOUN, "PronType": "int|rel"},
|
||||||
|
"WP$": {POS: ADJ, "Poss": "yes", "PronType": "int|rel"},
|
||||||
|
"WRB": {POS: ADV, "PronType": "int|rel"},
|
||||||
|
"SP": {POS: SPACE},
|
||||||
|
"ADD": {POS: X},
|
||||||
|
"NFP": {POS: PUNCT},
|
||||||
|
"GW": {POS: X},
|
||||||
|
"XX": {POS: X},
|
||||||
|
"BES": {POS: VERB},
|
||||||
|
"HVS": {POS: VERB}
|
||||||
|
}
|
2065
spacy/en/tokenizer_exceptions.py
Normal file
2065
spacy/en/tokenizer_exceptions.py
Normal file
File diff suppressed because it is too large
Load Diff
246
spacy/en/uget.py
246
spacy/en/uget.py
|
@ -1,246 +0,0 @@
|
||||||
import os
|
|
||||||
import time
|
|
||||||
import io
|
|
||||||
import math
|
|
||||||
import re
|
|
||||||
|
|
||||||
try:
|
|
||||||
from urllib.parse import urlparse
|
|
||||||
from urllib.request import urlopen, Request
|
|
||||||
from urllib.error import HTTPError
|
|
||||||
except ImportError:
|
|
||||||
from urllib2 import urlopen, urlparse, Request, HTTPError
|
|
||||||
|
|
||||||
|
|
||||||
class UnknownContentLengthException(Exception): pass
|
|
||||||
class InvalidChecksumException(Exception): pass
|
|
||||||
class UnsupportedHTTPCodeException(Exception): pass
|
|
||||||
class InvalidOffsetException(Exception): pass
|
|
||||||
class MissingChecksumHeader(Exception): pass
|
|
||||||
|
|
||||||
|
|
||||||
CHUNK_SIZE = 16 * 1024
|
|
||||||
|
|
||||||
|
|
||||||
class RateSampler(object):
|
|
||||||
def __init__(self, period=1):
|
|
||||||
self.rate = None
|
|
||||||
self.reset = True
|
|
||||||
self.period = period
|
|
||||||
|
|
||||||
def __enter__(self):
|
|
||||||
if self.reset:
|
|
||||||
self.reset = False
|
|
||||||
self.start = time.time()
|
|
||||||
self.counter = 0
|
|
||||||
|
|
||||||
def __exit__(self, type, value, traceback):
|
|
||||||
elapsed = time.time() - self.start
|
|
||||||
if elapsed >= self.period:
|
|
||||||
self.reset = True
|
|
||||||
self.rate = float(self.counter) / elapsed
|
|
||||||
|
|
||||||
def update(self, value):
|
|
||||||
self.counter += value
|
|
||||||
|
|
||||||
def format(self, unit="MB"):
|
|
||||||
if self.rate is None:
|
|
||||||
return None
|
|
||||||
|
|
||||||
divisor = {'MB': 1048576, 'kB': 1024}
|
|
||||||
return "%0.2f%s/s" % (self.rate / divisor[unit], unit)
|
|
||||||
|
|
||||||
|
|
||||||
class TimeEstimator(object):
|
|
||||||
def __init__(self, cooldown=1):
|
|
||||||
self.cooldown = cooldown
|
|
||||||
self.start = time.time()
|
|
||||||
self.time_left = None
|
|
||||||
|
|
||||||
def update(self, bytes_read, total_size):
|
|
||||||
elapsed = time.time() - self.start
|
|
||||||
if elapsed > self.cooldown:
|
|
||||||
self.time_left = math.ceil(elapsed * total_size /
|
|
||||||
bytes_read - elapsed)
|
|
||||||
|
|
||||||
def format(self):
|
|
||||||
if self.time_left is None:
|
|
||||||
return None
|
|
||||||
|
|
||||||
res = "eta "
|
|
||||||
if self.time_left / 60 >= 1:
|
|
||||||
res += "%dm " % (self.time_left / 60)
|
|
||||||
return res + "%ds" % (self.time_left % 60)
|
|
||||||
|
|
||||||
|
|
||||||
def format_bytes_read(bytes_read, unit="MB"):
|
|
||||||
divisor = {'MB': 1048576, 'kB': 1024}
|
|
||||||
return "%0.2f%s" % (float(bytes_read) / divisor[unit], unit)
|
|
||||||
|
|
||||||
|
|
||||||
def format_percent(bytes_read, total_size):
|
|
||||||
percent = round(bytes_read * 100.0 / total_size, 2)
|
|
||||||
return "%0.2f%%" % percent
|
|
||||||
|
|
||||||
|
|
||||||
def get_content_range(response):
|
|
||||||
content_range = response.headers.get('Content-Range', "").strip()
|
|
||||||
if content_range:
|
|
||||||
m = re.match(r"bytes (\d+)-(\d+)/(\d+)", content_range)
|
|
||||||
if m:
|
|
||||||
return [int(v) for v in m.groups()]
|
|
||||||
|
|
||||||
|
|
||||||
def get_content_length(response):
|
|
||||||
if 'Content-Length' not in response.headers:
|
|
||||||
raise UnknownContentLengthException
|
|
||||||
return int(response.headers.get('Content-Length').strip())
|
|
||||||
|
|
||||||
|
|
||||||
def get_url_meta(url, checksum_header=None):
|
|
||||||
class HeadRequest(Request):
|
|
||||||
def get_method(self):
|
|
||||||
return "HEAD"
|
|
||||||
|
|
||||||
r = urlopen(HeadRequest(url))
|
|
||||||
res = {'size': get_content_length(r)}
|
|
||||||
|
|
||||||
if checksum_header:
|
|
||||||
value = r.headers.get(checksum_header)
|
|
||||||
if value:
|
|
||||||
res['checksum'] = value
|
|
||||||
|
|
||||||
r.close()
|
|
||||||
return res
|
|
||||||
|
|
||||||
|
|
||||||
def progress(console, bytes_read, total_size, transfer_rate, eta):
|
|
||||||
fields = [
|
|
||||||
format_bytes_read(bytes_read),
|
|
||||||
format_percent(bytes_read, total_size),
|
|
||||||
transfer_rate.format(),
|
|
||||||
eta.format(),
|
|
||||||
" " * 10,
|
|
||||||
]
|
|
||||||
console.write("Downloaded %s\r" % " ".join(filter(None, fields)))
|
|
||||||
console.flush()
|
|
||||||
|
|
||||||
|
|
||||||
def read_request(request, offset=0, console=None,
|
|
||||||
progress_func=None, write_func=None):
|
|
||||||
# support partial downloads
|
|
||||||
if offset > 0:
|
|
||||||
request.add_header('Range', "bytes=%s-" % offset)
|
|
||||||
|
|
||||||
try:
|
|
||||||
response = urlopen(request)
|
|
||||||
except HTTPError as e:
|
|
||||||
if e.code == 416: # Requested Range Not Satisfiable
|
|
||||||
raise InvalidOffsetException
|
|
||||||
|
|
||||||
# TODO add http error handling here
|
|
||||||
raise UnsupportedHTTPCodeException(e.code)
|
|
||||||
|
|
||||||
total_size = get_content_length(response) + offset
|
|
||||||
bytes_read = offset
|
|
||||||
|
|
||||||
# sanity checks
|
|
||||||
if response.code == 200: # OK
|
|
||||||
assert offset == 0
|
|
||||||
elif response.code == 206: # Partial content
|
|
||||||
range_start, range_end, range_total = get_content_range(response)
|
|
||||||
assert range_start == offset
|
|
||||||
assert range_total == total_size
|
|
||||||
assert range_end + 1 - range_start == total_size - bytes_read
|
|
||||||
else:
|
|
||||||
raise UnsupportedHTTPCodeException(response.code)
|
|
||||||
|
|
||||||
eta = TimeEstimator()
|
|
||||||
transfer_rate = RateSampler()
|
|
||||||
|
|
||||||
if console:
|
|
||||||
if offset > 0:
|
|
||||||
console.write("Continue downloading...\n")
|
|
||||||
else:
|
|
||||||
console.write("Downloading...\n")
|
|
||||||
|
|
||||||
while True:
|
|
||||||
with transfer_rate:
|
|
||||||
chunk = response.read(CHUNK_SIZE)
|
|
||||||
if not chunk:
|
|
||||||
if progress_func and console:
|
|
||||||
console.write('\n')
|
|
||||||
break
|
|
||||||
|
|
||||||
bytes_read += len(chunk)
|
|
||||||
|
|
||||||
transfer_rate.update(len(chunk))
|
|
||||||
eta.update(bytes_read - offset, total_size - offset)
|
|
||||||
|
|
||||||
if progress_func and console:
|
|
||||||
progress_func(console, bytes_read, total_size, transfer_rate, eta)
|
|
||||||
|
|
||||||
if write_func:
|
|
||||||
write_func(chunk)
|
|
||||||
|
|
||||||
response.close()
|
|
||||||
assert bytes_read == total_size
|
|
||||||
return response
|
|
||||||
|
|
||||||
|
|
||||||
def download(url, path=".",
|
|
||||||
checksum=None, checksum_header=None,
|
|
||||||
headers=None, console=None):
|
|
||||||
|
|
||||||
if os.path.isdir(path):
|
|
||||||
path = os.path.join(path, url.rsplit('/', 1)[1])
|
|
||||||
path = os.path.abspath(path)
|
|
||||||
|
|
||||||
with io.open(path, "a+b") as f:
|
|
||||||
size = f.tell()
|
|
||||||
|
|
||||||
# update checksum of partially downloaded file
|
|
||||||
if checksum:
|
|
||||||
f.seek(0, os.SEEK_SET)
|
|
||||||
for chunk in iter(lambda: f.read(CHUNK_SIZE), b""):
|
|
||||||
checksum.update(chunk)
|
|
||||||
|
|
||||||
def write(chunk):
|
|
||||||
if checksum:
|
|
||||||
checksum.update(chunk)
|
|
||||||
f.write(chunk)
|
|
||||||
|
|
||||||
request = Request(url)
|
|
||||||
|
|
||||||
# request headers
|
|
||||||
if headers:
|
|
||||||
for key, value in headers.items():
|
|
||||||
request.add_header(key, value)
|
|
||||||
|
|
||||||
try:
|
|
||||||
response = read_request(request,
|
|
||||||
offset=size,
|
|
||||||
console=console,
|
|
||||||
progress_func=progress,
|
|
||||||
write_func=write)
|
|
||||||
except InvalidOffsetException:
|
|
||||||
response = None
|
|
||||||
|
|
||||||
if checksum:
|
|
||||||
if response:
|
|
||||||
origin_checksum = response.headers.get(checksum_header)
|
|
||||||
else:
|
|
||||||
# check whether file is already complete
|
|
||||||
meta = get_url_meta(url, checksum_header)
|
|
||||||
origin_checksum = meta.get('checksum')
|
|
||||||
|
|
||||||
if origin_checksum is None:
|
|
||||||
raise MissingChecksumHeader
|
|
||||||
|
|
||||||
if checksum.hexdigest() != origin_checksum:
|
|
||||||
raise InvalidChecksumException
|
|
||||||
|
|
||||||
if console:
|
|
||||||
console.write("checksum/sha256 OK\n")
|
|
||||||
|
|
||||||
return path
|
|
|
@ -1,26 +1,20 @@
|
||||||
|
# encoding: utf8
|
||||||
from __future__ import unicode_literals, print_function
|
from __future__ import unicode_literals, print_function
|
||||||
|
|
||||||
from os import path
|
from os import path
|
||||||
|
|
||||||
from ..language import Language
|
from ..language import Language
|
||||||
from ..attrs import LANG
|
from ..attrs import LANG
|
||||||
from . import language_data
|
|
||||||
|
from .language_data import *
|
||||||
|
|
||||||
|
|
||||||
class Spanish(Language):
|
class Spanish(Language):
|
||||||
lang = 'es'
|
lang = 'es'
|
||||||
|
|
||||||
class Defaults(Language.Defaults):
|
class Defaults(Language.Defaults):
|
||||||
tokenizer_exceptions = dict(language_data.TOKENIZER_EXCEPTIONS)
|
|
||||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||||
lex_attr_getters[LANG] = lambda text: 'es'
|
lex_attr_getters[LANG] = lambda text: 'es'
|
||||||
|
|
||||||
prefixes = tuple(language_data.TOKENIZER_PREFIXES)
|
tokenizer_exceptions = TOKENIZER_EXCEPTIONS
|
||||||
|
stop_words = STOP_WORDS
|
||||||
suffixes = tuple(language_data.TOKENIZER_SUFFIXES)
|
|
||||||
|
|
||||||
infixes = tuple(language_data.TOKENIZER_INFIXES)
|
|
||||||
|
|
||||||
tag_map = dict(language_data.TAG_MAP)
|
|
||||||
|
|
||||||
stop_words = set(language_data.STOP_WORDS)
|
|
||||||
|
|
|
@ -1,356 +1,19 @@
|
||||||
# encoding: utf8
|
# encoding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
import re
|
|
||||||
|
from .. import language_data as base
|
||||||
|
from ..language_data import update_exc, strings_to_exc
|
||||||
|
|
||||||
|
from .stop_words import STOP_WORDS
|
||||||
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS, ORTH_ONLY
|
||||||
|
|
||||||
|
|
||||||
STOP_WORDS = set()
|
TOKENIZER_EXCEPTIONS = dict(TOKENIZER_EXCEPTIONS)
|
||||||
|
STOP_WORDS = set(STOP_WORDS)
|
||||||
|
|
||||||
|
|
||||||
TOKENIZER_PREFIXES = map(re.escape, r'''
|
update_exc(TOKENIZER_EXCEPTIONS, strings_to_exc(ORTH_ONLY))
|
||||||
,
|
update_exc(TOKENIZER_EXCEPTIONS, strings_to_exc(base.EMOTICONS))
|
||||||
"
|
|
||||||
(
|
|
||||||
[
|
|
||||||
{
|
|
||||||
*
|
|
||||||
<
|
|
||||||
>
|
|
||||||
$
|
|
||||||
£
|
|
||||||
„
|
|
||||||
“
|
|
||||||
'
|
|
||||||
``
|
|
||||||
`
|
|
||||||
#
|
|
||||||
US$
|
|
||||||
C$
|
|
||||||
A$
|
|
||||||
a-
|
|
||||||
‘
|
|
||||||
....
|
|
||||||
...
|
|
||||||
‚
|
|
||||||
»
|
|
||||||
_
|
|
||||||
§
|
|
||||||
'''.strip().split('\n'))
|
|
||||||
|
|
||||||
|
|
||||||
TOKENIZER_SUFFIXES = r'''
|
__all__ = ["TOKENIZER_EXCEPTIONS", "STOP_WORDS"]
|
||||||
,
|
|
||||||
\"
|
|
||||||
\)
|
|
||||||
\]
|
|
||||||
\}
|
|
||||||
\*
|
|
||||||
\!
|
|
||||||
\?
|
|
||||||
%
|
|
||||||
\$
|
|
||||||
>
|
|
||||||
:
|
|
||||||
;
|
|
||||||
'
|
|
||||||
”
|
|
||||||
“
|
|
||||||
«
|
|
||||||
_
|
|
||||||
''
|
|
||||||
's
|
|
||||||
'S
|
|
||||||
’s
|
|
||||||
’S
|
|
||||||
’
|
|
||||||
‘
|
|
||||||
°
|
|
||||||
€
|
|
||||||
\.\.
|
|
||||||
\.\.\.
|
|
||||||
\.\.\.\.
|
|
||||||
(?<=[a-zäöüßÖÄÜ)\]"'´«‘’%\)²“”])\.
|
|
||||||
\-\-
|
|
||||||
´
|
|
||||||
(?<=[0-9])km²
|
|
||||||
(?<=[0-9])m²
|
|
||||||
(?<=[0-9])cm²
|
|
||||||
(?<=[0-9])mm²
|
|
||||||
(?<=[0-9])km³
|
|
||||||
(?<=[0-9])m³
|
|
||||||
(?<=[0-9])cm³
|
|
||||||
(?<=[0-9])mm³
|
|
||||||
(?<=[0-9])ha
|
|
||||||
(?<=[0-9])km
|
|
||||||
(?<=[0-9])m
|
|
||||||
(?<=[0-9])cm
|
|
||||||
(?<=[0-9])mm
|
|
||||||
(?<=[0-9])µm
|
|
||||||
(?<=[0-9])nm
|
|
||||||
(?<=[0-9])yd
|
|
||||||
(?<=[0-9])in
|
|
||||||
(?<=[0-9])ft
|
|
||||||
(?<=[0-9])kg
|
|
||||||
(?<=[0-9])g
|
|
||||||
(?<=[0-9])mg
|
|
||||||
(?<=[0-9])µg
|
|
||||||
(?<=[0-9])t
|
|
||||||
(?<=[0-9])lb
|
|
||||||
(?<=[0-9])oz
|
|
||||||
(?<=[0-9])m/s
|
|
||||||
(?<=[0-9])km/h
|
|
||||||
(?<=[0-9])mph
|
|
||||||
(?<=[0-9])°C
|
|
||||||
(?<=[0-9])°K
|
|
||||||
(?<=[0-9])°F
|
|
||||||
(?<=[0-9])hPa
|
|
||||||
(?<=[0-9])Pa
|
|
||||||
(?<=[0-9])mbar
|
|
||||||
(?<=[0-9])mb
|
|
||||||
(?<=[0-9])T
|
|
||||||
(?<=[0-9])G
|
|
||||||
(?<=[0-9])M
|
|
||||||
(?<=[0-9])K
|
|
||||||
(?<=[0-9])kb
|
|
||||||
'''.strip().split('\n')
|
|
||||||
|
|
||||||
|
|
||||||
TOKENIZER_INFIXES = (r'''\.\.\.+ (?<=[a-z])\.(?=[A-Z]) (?<=[a-zA-Z])-(?=[a-zA-z]) '''
|
|
||||||
r'''(?<=[a-zA-Z])--(?=[a-zA-z]) (?<=[0-9])-(?=[0-9]) '''
|
|
||||||
r'''(?<=[A-Za-z]),(?=[A-Za-z])''').split()
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
TOKENIZER_EXCEPTIONS = {
|
|
||||||
"vs.": [{"F": "vs."}],
|
|
||||||
|
|
||||||
"''": [{"F": "''"}],
|
|
||||||
"—": [{"F": "—", "L": "--", "pos": "$,"}],
|
|
||||||
|
|
||||||
"a.m.": [{"F": "a.m."}],
|
|
||||||
"p.m.": [{"F": "p.m."}],
|
|
||||||
|
|
||||||
"1a.m.": [{"F": "1"}, {"F": "a.m."}],
|
|
||||||
"2a.m.": [{"F": "2"}, {"F": "a.m."}],
|
|
||||||
"3a.m.": [{"F": "3"}, {"F": "a.m."}],
|
|
||||||
"4a.m.": [{"F": "4"}, {"F": "a.m."}],
|
|
||||||
"5a.m.": [{"F": "5"}, {"F": "a.m."}],
|
|
||||||
"6a.m.": [{"F": "6"}, {"F": "a.m."}],
|
|
||||||
"7a.m.": [{"F": "7"}, {"F": "a.m."}],
|
|
||||||
"8a.m.": [{"F": "8"}, {"F": "a.m."}],
|
|
||||||
"9a.m.": [{"F": "9"}, {"F": "a.m."}],
|
|
||||||
"10a.m.": [{"F": "10"}, {"F": "a.m."}],
|
|
||||||
"11a.m.": [{"F": "11"}, {"F": "a.m."}],
|
|
||||||
"12a.m.": [{"F": "12"}, {"F": "a.m."}],
|
|
||||||
"1am": [{"F": "1"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"2am": [{"F": "2"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"3am": [{"F": "3"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"4am": [{"F": "4"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"5am": [{"F": "5"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"6am": [{"F": "6"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"7am": [{"F": "7"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"8am": [{"F": "8"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"9am": [{"F": "9"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"10am": [{"F": "10"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"11am": [{"F": "11"}, {"F": "am", "L": "a.m."}],
|
|
||||||
"12am": [{"F": "12"}, {"F": "am", "L": "a.m."}],
|
|
||||||
|
|
||||||
"p.m.": [{"F": "p.m."}],
|
|
||||||
"1p.m.": [{"F": "1"}, {"F": "p.m."}],
|
|
||||||
"2p.m.": [{"F": "2"}, {"F": "p.m."}],
|
|
||||||
"3p.m.": [{"F": "3"}, {"F": "p.m."}],
|
|
||||||
"4p.m.": [{"F": "4"}, {"F": "p.m."}],
|
|
||||||
"5p.m.": [{"F": "5"}, {"F": "p.m."}],
|
|
||||||
"6p.m.": [{"F": "6"}, {"F": "p.m."}],
|
|
||||||
"7p.m.": [{"F": "7"}, {"F": "p.m."}],
|
|
||||||
"8p.m.": [{"F": "8"}, {"F": "p.m."}],
|
|
||||||
"9p.m.": [{"F": "9"}, {"F": "p.m."}],
|
|
||||||
"10p.m.": [{"F": "10"}, {"F": "p.m."}],
|
|
||||||
"11p.m.": [{"F": "11"}, {"F": "p.m."}],
|
|
||||||
"12p.m.": [{"F": "12"}, {"F": "p.m."}],
|
|
||||||
"1pm": [{"F": "1"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"2pm": [{"F": "2"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"3pm": [{"F": "3"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"4pm": [{"F": "4"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"5pm": [{"F": "5"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"6pm": [{"F": "6"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"7pm": [{"F": "7"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"8pm": [{"F": "8"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"9pm": [{"F": "9"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"10pm": [{"F": "10"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"11pm": [{"F": "11"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
"12pm": [{"F": "12"}, {"F": "pm", "L": "p.m."}],
|
|
||||||
|
|
||||||
"Ala.": [{"F": "Ala."}],
|
|
||||||
"Ariz.": [{"F": "Ariz."}],
|
|
||||||
"Ark.": [{"F": "Ark."}],
|
|
||||||
"Calif.": [{"F": "Calif."}],
|
|
||||||
"Colo.": [{"F": "Colo."}],
|
|
||||||
"Conn.": [{"F": "Conn."}],
|
|
||||||
"Del.": [{"F": "Del."}],
|
|
||||||
"D.C.": [{"F": "D.C."}],
|
|
||||||
"Fla.": [{"F": "Fla."}],
|
|
||||||
"Ga.": [{"F": "Ga."}],
|
|
||||||
"Ill.": [{"F": "Ill."}],
|
|
||||||
"Ind.": [{"F": "Ind."}],
|
|
||||||
"Kans.": [{"F": "Kans."}],
|
|
||||||
"Kan.": [{"F": "Kan."}],
|
|
||||||
"Ky.": [{"F": "Ky."}],
|
|
||||||
"La.": [{"F": "La."}],
|
|
||||||
"Md.": [{"F": "Md."}],
|
|
||||||
"Mass.": [{"F": "Mass."}],
|
|
||||||
"Mich.": [{"F": "Mich."}],
|
|
||||||
"Minn.": [{"F": "Minn."}],
|
|
||||||
"Miss.": [{"F": "Miss."}],
|
|
||||||
"Mo.": [{"F": "Mo."}],
|
|
||||||
"Mont.": [{"F": "Mont."}],
|
|
||||||
"Nebr.": [{"F": "Nebr."}],
|
|
||||||
"Neb.": [{"F": "Neb."}],
|
|
||||||
"Nev.": [{"F": "Nev."}],
|
|
||||||
"N.H.": [{"F": "N.H."}],
|
|
||||||
"N.J.": [{"F": "N.J."}],
|
|
||||||
"N.M.": [{"F": "N.M."}],
|
|
||||||
"N.Y.": [{"F": "N.Y."}],
|
|
||||||
"N.C.": [{"F": "N.C."}],
|
|
||||||
"N.D.": [{"F": "N.D."}],
|
|
||||||
"Okla.": [{"F": "Okla."}],
|
|
||||||
"Ore.": [{"F": "Ore."}],
|
|
||||||
"Pa.": [{"F": "Pa."}],
|
|
||||||
"Tenn.": [{"F": "Tenn."}],
|
|
||||||
"Va.": [{"F": "Va."}],
|
|
||||||
"Wash.": [{"F": "Wash."}],
|
|
||||||
"Wis.": [{"F": "Wis."}],
|
|
||||||
|
|
||||||
":)": [{"F": ":)"}],
|
|
||||||
"<3": [{"F": "<3"}],
|
|
||||||
";)": [{"F": ";)"}],
|
|
||||||
"(:": [{"F": "(:"}],
|
|
||||||
":(": [{"F": ":("}],
|
|
||||||
"-_-": [{"F": "-_-"}],
|
|
||||||
"=)": [{"F": "=)"}],
|
|
||||||
":/": [{"F": ":/"}],
|
|
||||||
":>": [{"F": ":>"}],
|
|
||||||
";-)": [{"F": ";-)"}],
|
|
||||||
":Y": [{"F": ":Y"}],
|
|
||||||
":P": [{"F": ":P"}],
|
|
||||||
":-P": [{"F": ":-P"}],
|
|
||||||
":3": [{"F": ":3"}],
|
|
||||||
"=3": [{"F": "=3"}],
|
|
||||||
"xD": [{"F": "xD"}],
|
|
||||||
"^_^": [{"F": "^_^"}],
|
|
||||||
"=]": [{"F": "=]"}],
|
|
||||||
"=D": [{"F": "=D"}],
|
|
||||||
"<333": [{"F": "<333"}],
|
|
||||||
":))": [{"F": ":))"}],
|
|
||||||
":0": [{"F": ":0"}],
|
|
||||||
"-__-": [{"F": "-__-"}],
|
|
||||||
"xDD": [{"F": "xDD"}],
|
|
||||||
"o_o": [{"F": "o_o"}],
|
|
||||||
"o_O": [{"F": "o_O"}],
|
|
||||||
"V_V": [{"F": "V_V"}],
|
|
||||||
"=[[": [{"F": "=[["}],
|
|
||||||
"<33": [{"F": "<33"}],
|
|
||||||
";p": [{"F": ";p"}],
|
|
||||||
";D": [{"F": ";D"}],
|
|
||||||
";-p": [{"F": ";-p"}],
|
|
||||||
";(": [{"F": ";("}],
|
|
||||||
":p": [{"F": ":p"}],
|
|
||||||
":]": [{"F": ":]"}],
|
|
||||||
":O": [{"F": ":O"}],
|
|
||||||
":-/": [{"F": ":-/"}],
|
|
||||||
":-)": [{"F": ":-)"}],
|
|
||||||
":(((": [{"F": ":((("}],
|
|
||||||
":((": [{"F": ":(("}],
|
|
||||||
":')": [{"F": ":')"}],
|
|
||||||
"(^_^)": [{"F": "(^_^)"}],
|
|
||||||
"(=": [{"F": "(="}],
|
|
||||||
"o.O": [{"F": "o.O"}],
|
|
||||||
"\")": [{"F": "\")"}],
|
|
||||||
|
|
||||||
"a.": [{"F": "a."}],
|
|
||||||
"b.": [{"F": "b."}],
|
|
||||||
"c.": [{"F": "c."}],
|
|
||||||
"d.": [{"F": "d."}],
|
|
||||||
"e.": [{"F": "e."}],
|
|
||||||
"f.": [{"F": "f."}],
|
|
||||||
"g.": [{"F": "g."}],
|
|
||||||
"h.": [{"F": "h."}],
|
|
||||||
"i.": [{"F": "i."}],
|
|
||||||
"j.": [{"F": "j."}],
|
|
||||||
"k.": [{"F": "k."}],
|
|
||||||
"l.": [{"F": "l."}],
|
|
||||||
"m.": [{"F": "m."}],
|
|
||||||
"n.": [{"F": "n."}],
|
|
||||||
"o.": [{"F": "o."}],
|
|
||||||
"p.": [{"F": "p."}],
|
|
||||||
"q.": [{"F": "q."}],
|
|
||||||
"r.": [{"F": "r."}],
|
|
||||||
"s.": [{"F": "s."}],
|
|
||||||
"t.": [{"F": "t."}],
|
|
||||||
"u.": [{"F": "u."}],
|
|
||||||
"v.": [{"F": "v."}],
|
|
||||||
"w.": [{"F": "w."}],
|
|
||||||
"x.": [{"F": "x."}],
|
|
||||||
"y.": [{"F": "y."}],
|
|
||||||
"z.": [{"F": "z."}],
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
TAG_MAP = {
|
|
||||||
"$(": {"pos": "PUNCT", "PunctType": "Brck"},
|
|
||||||
"$,": {"pos": "PUNCT", "PunctType": "Comm"},
|
|
||||||
"$.": {"pos": "PUNCT", "PunctType": "Peri"},
|
|
||||||
"ADJA": {"pos": "ADJ"},
|
|
||||||
"ADJD": {"pos": "ADJ", "Variant": "Short"},
|
|
||||||
"ADV": {"pos": "ADV"},
|
|
||||||
"APPO": {"pos": "ADP", "AdpType": "Post"},
|
|
||||||
"APPR": {"pos": "ADP", "AdpType": "Prep"},
|
|
||||||
"APPRART": {"pos": "ADP", "AdpType": "Prep", "PronType": "Art"},
|
|
||||||
"APZR": {"pos": "ADP", "AdpType": "Circ"},
|
|
||||||
"ART": {"pos": "DET", "PronType": "Art"},
|
|
||||||
"CARD": {"pos": "NUM", "NumType": "Card"},
|
|
||||||
"FM": {"pos": "X", "Foreign": "Yes"},
|
|
||||||
"ITJ": {"pos": "INTJ"},
|
|
||||||
"KOKOM": {"pos": "CONJ", "ConjType": "Comp"},
|
|
||||||
"KON": {"pos": "CONJ"},
|
|
||||||
"KOUI": {"pos": "SCONJ"},
|
|
||||||
"KOUS": {"pos": "SCONJ"},
|
|
||||||
"NE": {"pos": "PROPN"},
|
|
||||||
"NNE": {"pos": "PROPN"},
|
|
||||||
"NN": {"pos": "NOUN"},
|
|
||||||
"PAV": {"pos": "ADV", "PronType": "Dem"},
|
|
||||||
"PROAV": {"pos": "ADV", "PronType": "Dem"},
|
|
||||||
"PDAT": {"pos": "DET", "PronType": "Dem"},
|
|
||||||
"PDS": {"pos": "PRON", "PronType": "Dem"},
|
|
||||||
"PIAT": {"pos": "DET", "PronType": "Ind,Neg,Tot"},
|
|
||||||
"PIDAT": {"pos": "DET", "AdjType": "Pdt", "PronType": "Ind,Neg,Tot"},
|
|
||||||
"PIS": {"pos": "PRON", "PronType": "Ind,Neg,Tot"},
|
|
||||||
"PPER": {"pos": "PRON", "PronType": "Prs"},
|
|
||||||
"PPOSAT": {"pos": "DET", "Poss": "Yes", "PronType": "Prs"},
|
|
||||||
"PPOSS": {"pos": "PRON", "Poss": "Yes", "PronType": "Prs"},
|
|
||||||
"PRELAT": {"pos": "DET", "PronType": "Rel"},
|
|
||||||
"PRELS": {"pos": "PRON", "PronType": "Rel"},
|
|
||||||
"PRF": {"pos": "PRON", "PronType": "Prs", "Reflex": "Yes"},
|
|
||||||
"PTKA": {"pos": "PART"},
|
|
||||||
"PTKANT": {"pos": "PART", "PartType": "Res"},
|
|
||||||
"PTKNEG": {"pos": "PART", "Negative": "Neg"},
|
|
||||||
"PTKVZ": {"pos": "PART", "PartType": "Vbp"},
|
|
||||||
"PTKZU": {"pos": "PART", "PartType": "Inf"},
|
|
||||||
"PWAT": {"pos": "DET", "PronType": "Int"},
|
|
||||||
"PWAV": {"pos": "ADV", "PronType": "Int"},
|
|
||||||
"PWS": {"pos": "PRON", "PronType": "Int"},
|
|
||||||
"TRUNC": {"pos": "X", "Hyph": "Yes"},
|
|
||||||
"VAFIN": {"pos": "AUX", "Mood": "Ind", "VerbForm": "Fin"},
|
|
||||||
"VAIMP": {"pos": "AUX", "Mood": "Imp", "VerbForm": "Fin"},
|
|
||||||
"VAINF": {"pos": "AUX", "VerbForm": "Inf"},
|
|
||||||
"VAPP": {"pos": "AUX", "Aspect": "Perf", "VerbForm": "Part"},
|
|
||||||
"VMFIN": {"pos": "VERB", "Mood": "Ind", "VerbForm": "Fin", "VerbType": "Mod"},
|
|
||||||
"VMINF": {"pos": "VERB", "VerbForm": "Inf", "VerbType": "Mod"},
|
|
||||||
"VMPP": {"pos": "VERB", "Aspect": "Perf", "VerbForm": "Part", "VerbType": "Mod"},
|
|
||||||
"VVFIN": {"pos": "VERB", "Mood": "Ind", "VerbForm": "Fin"},
|
|
||||||
"VVIMP": {"pos": "VERB", "Mood": "Imp", "VerbForm": "Fin"},
|
|
||||||
"VVINF": {"pos": "VERB", "VerbForm": "Inf"},
|
|
||||||
"VVIZU": {"pos": "VERB", "VerbForm": "Inf"},
|
|
||||||
"VVPP": {"pos": "VERB", "Aspect": "Perf", "VerbForm": "Part"},
|
|
||||||
"XY": {"pos": "X"},
|
|
||||||
"SP": {"pos": "SPACE"}
|
|
||||||
}
|
|
||||||
|
|
84
spacy/es/stop_words.py
Normal file
84
spacy/es/stop_words.py
Normal file
|
@ -0,0 +1,84 @@
|
||||||
|
# encoding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
|
||||||
|
STOP_WORDS = set("""
|
||||||
|
actualmente acuerdo adelante ademas además adrede afirmó agregó ahi ahora ahí
|
||||||
|
al algo alguna algunas alguno algunos algún alli allí alrededor ambos ampleamos
|
||||||
|
antano antaño ante anterior antes apenas aproximadamente aquel aquella aquellas
|
||||||
|
aquello aquellos aqui aquél aquélla aquéllas aquéllos aquí arriba arribaabajo
|
||||||
|
aseguró asi así atras aun aunque ayer añadió aún
|
||||||
|
|
||||||
|
bajo bastante bien breve buen buena buenas bueno buenos
|
||||||
|
|
||||||
|
cada casi cerca cierta ciertas cierto ciertos cinco claro comentó como con
|
||||||
|
conmigo conocer conseguimos conseguir considera consideró consigo consigue
|
||||||
|
consiguen consigues contigo contra cosas creo cual cuales cualquier cuando
|
||||||
|
cuanta cuantas cuanto cuantos cuatro cuenta cuál cuáles cuándo cuánta cuántas
|
||||||
|
cuánto cuántos cómo
|
||||||
|
|
||||||
|
da dado dan dar de debajo debe deben debido decir dejó del delante demasiado
|
||||||
|
demás dentro deprisa desde despacio despues después detras detrás dia dias dice
|
||||||
|
dicen dicho dieron diferente diferentes dijeron dijo dio donde dos durante día
|
||||||
|
días dónde
|
||||||
|
|
||||||
|
ejemplo el ella ellas ello ellos embargo empleais emplean emplear empleas
|
||||||
|
empleo en encima encuentra enfrente enseguida entonces entre era eramos eran
|
||||||
|
eras eres es esa esas ese eso esos esta estaba estaban estado estados estais
|
||||||
|
estamos estan estar estará estas este esto estos estoy estuvo está están ex
|
||||||
|
excepto existe existen explicó expresó él ésa ésas ése ésos ésta éstas éste
|
||||||
|
éstos
|
||||||
|
|
||||||
|
fin final fue fuera fueron fui fuimos
|
||||||
|
|
||||||
|
general gran grandes gueno
|
||||||
|
|
||||||
|
ha haber habia habla hablan habrá había habían hace haceis hacemos hacen hacer
|
||||||
|
hacerlo haces hacia haciendo hago han hasta hay haya he hecho hemos hicieron
|
||||||
|
hizo horas hoy hubo
|
||||||
|
|
||||||
|
igual incluso indicó informo informó intenta intentais intentamos intentan
|
||||||
|
intentar intentas intento ir
|
||||||
|
|
||||||
|
junto
|
||||||
|
|
||||||
|
la lado largo las le lejos les llegó lleva llevar lo los luego lugar
|
||||||
|
|
||||||
|
mal manera manifestó mas mayor me mediante medio mejor mencionó menos menudo mi
|
||||||
|
mia mias mientras mio mios mis misma mismas mismo mismos modo momento mucha
|
||||||
|
muchas mucho muchos muy más mí mía mías mío míos
|
||||||
|
|
||||||
|
nada nadie ni ninguna ningunas ninguno ningunos ningún no nos nosotras nosotros
|
||||||
|
nuestra nuestras nuestro nuestros nueva nuevas nuevo nuevos nunca
|
||||||
|
|
||||||
|
ocho os otra otras otro otros
|
||||||
|
|
||||||
|
pais para parece parte partir pasada pasado paìs peor pero pesar poca pocas
|
||||||
|
poco pocos podeis podemos poder podria podriais podriamos podrian podrias podrá
|
||||||
|
podrán podría podrían poner por porque posible primer primera primero primeros
|
||||||
|
principalmente pronto propia propias propio propios proximo próximo próximos
|
||||||
|
pudo pueda puede pueden puedo pues
|
||||||
|
|
||||||
|
qeu que quedó queremos quien quienes quiere quiza quizas quizá quizás quién quiénes qué
|
||||||
|
|
||||||
|
raras realizado realizar realizó repente respecto
|
||||||
|
|
||||||
|
sabe sabeis sabemos saben saber sabes salvo se sea sean segun segunda segundo
|
||||||
|
según seis ser sera será serán sería señaló si sido siempre siendo siete sigue
|
||||||
|
siguiente sin sino sobre sois sola solamente solas solo solos somos son soy
|
||||||
|
soyos su supuesto sus suya suyas suyo sé sí sólo
|
||||||
|
|
||||||
|
tal tambien también tampoco tan tanto tarde te temprano tendrá tendrán teneis
|
||||||
|
tenemos tener tenga tengo tenido tenía tercera ti tiempo tiene tienen toda
|
||||||
|
todas todavia todavía todo todos total trabaja trabajais trabajamos trabajan
|
||||||
|
trabajar trabajas trabajo tras trata través tres tu tus tuvo tuya tuyas tuyo
|
||||||
|
tuyos tú
|
||||||
|
|
||||||
|
ultimo un una unas uno unos usa usais usamos usan usar usas uso usted ustedes
|
||||||
|
última últimas último últimos
|
||||||
|
|
||||||
|
va vais valor vamos van varias varios vaya veces ver verdad verdadera verdadero
|
||||||
|
vez vosotras vosotros voy vuestra vuestras vuestro vuestros
|
||||||
|
|
||||||
|
ya yo
|
||||||
|
""".split())
|
318
spacy/es/tokenizer_exceptions.py
Normal file
318
spacy/es/tokenizer_exceptions.py
Normal file
|
@ -0,0 +1,318 @@
|
||||||
|
# encoding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from ..symbols import *
|
||||||
|
from ..language_data import PRON_LEMMA
|
||||||
|
|
||||||
|
|
||||||
|
TOKENIZER_EXCEPTIONS = {
|
||||||
|
"accidentarse": [
|
||||||
|
{ORTH: "accidentar", LEMMA: "accidentar", POS: AUX},
|
||||||
|
{ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"aceptarlo": [
|
||||||
|
{ORTH: "aceptar", LEMMA: "aceptar", POS: AUX},
|
||||||
|
{ORTH: "lo", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"acompañarla": [
|
||||||
|
{ORTH: "acompañar", LEMMA: "acompañar", POS: AUX},
|
||||||
|
{ORTH: "la", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"advertirle": [
|
||||||
|
{ORTH: "advertir", LEMMA: "advertir", POS: AUX},
|
||||||
|
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"al": [
|
||||||
|
{ORTH: "a", LEMMA: "a", POS: ADP},
|
||||||
|
{ORTH: "el", LEMMA: "el", POS: DET}
|
||||||
|
],
|
||||||
|
|
||||||
|
"anunciarnos": [
|
||||||
|
{ORTH: "anunciar", LEMMA: "anunciar", POS: AUX},
|
||||||
|
{ORTH: "nos", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"asegurándole": [
|
||||||
|
{ORTH: "asegurando", LEMMA: "asegurar", POS: AUX},
|
||||||
|
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"considerarle": [
|
||||||
|
{ORTH: "considerar", LEMMA: "considerar", POS: AUX},
|
||||||
|
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"decirle": [
|
||||||
|
{ORTH: "decir", LEMMA: "decir", POS: AUX},
|
||||||
|
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"decirles": [
|
||||||
|
{ORTH: "decir", LEMMA: "decir", POS: AUX},
|
||||||
|
{ORTH: "les", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"decirte": [
|
||||||
|
{ORTH: "Decir", LEMMA: "decir", POS: AUX},
|
||||||
|
{ORTH: "te", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"dejarla": [
|
||||||
|
{ORTH: "dejar", LEMMA: "dejar", POS: AUX},
|
||||||
|
{ORTH: "la", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"dejarnos": [
|
||||||
|
{ORTH: "dejar", LEMMA: "dejar", POS: AUX},
|
||||||
|
{ORTH: "nos", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"dejándole": [
|
||||||
|
{ORTH: "dejando", LEMMA: "dejar", POS: AUX},
|
||||||
|
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"del": [
|
||||||
|
{ORTH: "de", LEMMA: "de", POS: ADP},
|
||||||
|
{ORTH: "el", LEMMA: "el", POS: DET}
|
||||||
|
],
|
||||||
|
|
||||||
|
"demostrarles": [
|
||||||
|
{ORTH: "demostrar", LEMMA: "demostrar", POS: AUX},
|
||||||
|
{ORTH: "les", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"diciéndole": [
|
||||||
|
{ORTH: "diciendo", LEMMA: "decir", POS: AUX},
|
||||||
|
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"diciéndoles": [
|
||||||
|
{ORTH: "diciendo", LEMMA: "decir", POS: AUX},
|
||||||
|
{ORTH: "les", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"diferenciarse": [
|
||||||
|
{ORTH: "diferenciar", LEMMA: "diferenciar", POS: AUX},
|
||||||
|
{ORTH: "se", LEMMA: "él", POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"divirtiéndome": [
|
||||||
|
{ORTH: "divirtiendo", LEMMA: "divertir", POS: AUX},
|
||||||
|
{ORTH: "me", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"ensanchándose": [
|
||||||
|
{ORTH: "ensanchando", LEMMA: "ensanchar", POS: AUX},
|
||||||
|
{ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"explicarles": [
|
||||||
|
{ORTH: "explicar", LEMMA: "explicar", POS: AUX},
|
||||||
|
{ORTH: "les", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"haberla": [
|
||||||
|
{ORTH: "haber", LEMMA: "haber", POS: AUX},
|
||||||
|
{ORTH: "la", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"haberlas": [
|
||||||
|
{ORTH: "haber", LEMMA: "haber", POS: AUX},
|
||||||
|
{ORTH: "las", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"haberlo": [
|
||||||
|
{ORTH: "haber", LEMMA: "haber", POS: AUX},
|
||||||
|
{ORTH: "lo", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"haberlos": [
|
||||||
|
{ORTH: "haber", LEMMA: "haber", POS: AUX},
|
||||||
|
{ORTH: "los", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"haberme": [
|
||||||
|
{ORTH: "haber", LEMMA: "haber", POS: AUX},
|
||||||
|
{ORTH: "me", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"haberse": [
|
||||||
|
{ORTH: "haber", LEMMA: "haber", POS: AUX},
|
||||||
|
{ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"hacerle": [
|
||||||
|
{ORTH: "hacer", LEMMA: "hacer", POS: AUX},
|
||||||
|
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"hacerles": [
|
||||||
|
{ORTH: "hacer", LEMMA: "hacer", POS: AUX},
|
||||||
|
{ORTH: "les", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"hallarse": [
|
||||||
|
{ORTH: "hallar", LEMMA: "hallar", POS: AUX},
|
||||||
|
{ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"imaginaros": [
|
||||||
|
{ORTH: "imaginar", LEMMA: "imaginar", POS: AUX},
|
||||||
|
{ORTH: "os", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"insinuarle": [
|
||||||
|
{ORTH: "insinuar", LEMMA: "insinuar", POS: AUX},
|
||||||
|
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"justificarla": [
|
||||||
|
{ORTH: "justificar", LEMMA: "justificar", POS: AUX},
|
||||||
|
{ORTH: "la", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"mantenerlas": [
|
||||||
|
{ORTH: "mantener", LEMMA: "mantener", POS: AUX},
|
||||||
|
{ORTH: "las", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"mantenerlos": [
|
||||||
|
{ORTH: "mantener", LEMMA: "mantener", POS: AUX},
|
||||||
|
{ORTH: "los", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"mantenerme": [
|
||||||
|
{ORTH: "mantener", LEMMA: "mantener", POS: AUX},
|
||||||
|
{ORTH: "me", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"pasarte": [
|
||||||
|
{ORTH: "pasar", LEMMA: "pasar", POS: AUX},
|
||||||
|
{ORTH: "te", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"pedirle": [
|
||||||
|
{ORTH: "pedir", LEMMA: "pedir", POS: AUX},
|
||||||
|
{ORTH: "le", LEMMA: "él", POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"pel": [
|
||||||
|
{ORTH: "per", LEMMA: "per", POS: ADP},
|
||||||
|
{ORTH: "el", LEMMA: "el", POS: DET}
|
||||||
|
],
|
||||||
|
|
||||||
|
"pidiéndonos": [
|
||||||
|
{ORTH: "pidiendo", LEMMA: "pedir", POS: AUX},
|
||||||
|
{ORTH: "nos", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"poderle": [
|
||||||
|
{ORTH: "poder", LEMMA: "poder", POS: AUX},
|
||||||
|
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"preguntarse": [
|
||||||
|
{ORTH: "preguntar", LEMMA: "preguntar", POS: AUX},
|
||||||
|
{ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"preguntándose": [
|
||||||
|
{ORTH: "preguntando", LEMMA: "preguntar", POS: AUX},
|
||||||
|
{ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"presentarla": [
|
||||||
|
{ORTH: "presentar", LEMMA: "presentar", POS: AUX},
|
||||||
|
{ORTH: "la", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"pudiéndolo": [
|
||||||
|
{ORTH: "pudiendo", LEMMA: "poder", POS: AUX},
|
||||||
|
{ORTH: "lo", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"pudiéndose": [
|
||||||
|
{ORTH: "pudiendo", LEMMA: "poder", POS: AUX},
|
||||||
|
{ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"quererle": [
|
||||||
|
{ORTH: "querer", LEMMA: "querer", POS: AUX},
|
||||||
|
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"rasgarse": [
|
||||||
|
{ORTH: "Rasgar", LEMMA: "rasgar", POS: AUX},
|
||||||
|
{ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"repetirlo": [
|
||||||
|
{ORTH: "repetir", LEMMA: "repetir", POS: AUX},
|
||||||
|
{ORTH: "lo", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"robarle": [
|
||||||
|
{ORTH: "robar", LEMMA: "robar", POS: AUX},
|
||||||
|
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"seguirlos": [
|
||||||
|
{ORTH: "seguir", LEMMA: "seguir", POS: AUX},
|
||||||
|
{ORTH: "los", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"serle": [
|
||||||
|
{ORTH: "ser", LEMMA: "ser", POS: AUX},
|
||||||
|
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"serlo": [
|
||||||
|
{ORTH: "ser", LEMMA: "ser", POS: AUX},
|
||||||
|
{ORTH: "lo", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"señalándole": [
|
||||||
|
{ORTH: "señalando", LEMMA: "señalar", POS: AUX},
|
||||||
|
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"suplicarle": [
|
||||||
|
{ORTH: "suplicar", LEMMA: "suplicar", POS: AUX},
|
||||||
|
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"tenerlos": [
|
||||||
|
{ORTH: "tener", LEMMA: "tener", POS: AUX},
|
||||||
|
{ORTH: "los", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"vengarse": [
|
||||||
|
{ORTH: "vengar", LEMMA: "vengar", POS: AUX},
|
||||||
|
{ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"verla": [
|
||||||
|
{ORTH: "ver", LEMMA: "ver", POS: AUX},
|
||||||
|
{ORTH: "la", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"verle": [
|
||||||
|
{ORTH: "ver", LEMMA: "ver", POS: AUX},
|
||||||
|
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
],
|
||||||
|
|
||||||
|
"volverlo": [
|
||||||
|
{ORTH: "volver", LEMMA: "volver", POS: AUX},
|
||||||
|
{ORTH: "lo", LEMMA: PRON_LEMMA, POS: PRON}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
ORTH_ONLY = [
|
||||||
|
|
||||||
|
]
|
|
@ -1,9 +0,0 @@
|
||||||
from __future__ import unicode_literals, print_function
|
|
||||||
|
|
||||||
from os import path
|
|
||||||
|
|
||||||
from ..language import Language
|
|
||||||
|
|
||||||
|
|
||||||
class Finnish(Language):
|
|
||||||
pass
|
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user