💫 Port master changes over to develop (#2979)

* Create aryaprabhudesai.md (#2681)

* Update _install.jade (#2688)

Typo fix: "models" -> "model"

* Add FAC to spacy.explain (resolves #2706)

* Remove docstrings for deprecated arguments (see #2703)

* When calling getoption() in conftest.py, pass a default option (#2709)

* When calling getoption() in conftest.py, pass a default option

This is necessary to allow testing an installed spacy by running:

  pytest --pyargs spacy

* Add contributor agreement

* update bengali token rules for hyphen and digits (#2731)

* Less norm computations in token similarity (#2730)

* Less norm computations in token similarity

* Contributor agreement

* Remove ')' for clarity (#2737)

Sorry, don't mean to be nitpicky, I just noticed this when going through the CLI and thought it was a quick fix. That said, if this was intention than please let me know.

* added contributor agreement for mbkupfer (#2738)

* Basic support for Telugu language (#2751)

* Lex _attrs for polish language (#2750)

* Signed spaCy contributor agreement

* Added polish version of english lex_attrs

* Introduces a bulk merge function, in order to solve issue #653 (#2696)

* Fix comment

* Introduce bulk merge to increase performance on many span merges

* Sign contributor agreement

* Implement pull request suggestions

* Describe converters more explicitly (see #2643)

* Add multi-threading note to Language.pipe (resolves #2582) [ci skip]

* Fix formatting

* Fix dependency scheme docs (closes #2705) [ci skip]

* Don't set stop word in example (closes #2657) [ci skip]

* Add words to portuguese language _num_words (#2759)

* Add words to portuguese language _num_words

* Add words to portuguese language _num_words

* Update Indonesian model (#2752)

* adding e-KTP in tokenizer exceptions list

* add exception token

* removing lines with containing space as it won't matter since we use .split() method in the end, added new tokens in exception

* add tokenizer exceptions list

* combining base_norms with norm_exceptions

* adding norm_exception

* fix double key in lemmatizer

* remove unused import on punctuation.py

* reformat stop_words to reduce number of lines, improve readibility

* updating tokenizer exception

* implement is_currency for lang/id

* adding orth_first_upper in tokenizer_exceptions

* update the norm_exception list

* remove bunch of abbreviations

* adding contributors file

* Fixed spaCy+Keras example (#2763)

* bug fixes in keras example

* created contributor agreement

* Adding French hyphenated first name (#2786)

* Fix typo (closes #2784)

* Fix typo (#2795) [ci skip]

Fixed typo on line 6 "regcognizer --> recognizer"

* Adding basic support for Sinhala language. (#2788)

* adding Sinhala language package, stop words, examples and lex_attrs.

* Adding contributor agreement

* Updating contributor agreement

* Also include lowercase norm exceptions

* Fix error (#2802)

* Fix error
ValueError: cannot resize an array that references or is referenced
by another array in this way.  Use the resize function

* added spaCy Contributor Agreement

* Add charlax's contributor agreement (#2805)

* agreement of contributor, may I introduce a tiny pl languge contribution (#2799)

* Contributors agreement

* Contributors agreement

* Contributors agreement

* Add jupyter=True to displacy.render in documentation (#2806)

* Revert "Also include lowercase norm exceptions"

This reverts commit 70f4e8adf3.

* Remove deprecated encoding argument to msgpack

* Set up dependency tree pattern matching skeleton (#2732)

* Fix bug when too many entity types. Fixes #2800

* Fix Python 2 test failure

* Require older msgpack-numpy

* Restore encoding arg on msgpack-numpy

* Try to fix version pin for msgpack-numpy

* Update Portuguese Language (#2790)

* Add words to portuguese language _num_words

* Add words to portuguese language _num_words

* Portuguese - Add/remove stopwords, fix tokenizer, add currency symbols

* Extended punctuation and norm_exceptions in the Portuguese language

* Correct error in spacy universe docs concerning spacy-lookup (#2814)

* Update Keras Example for (Parikh et al, 2016) implementation  (#2803)

* bug fixes in keras example

* created contributor agreement

* baseline for Parikh model

* initial version of parikh 2016 implemented

* tested asymmetric models

* fixed grevious error in normalization

* use standard SNLI test file

* begin to rework parikh example

* initial version of running example

* start to document the new version

* start to document the new version

* Update Decompositional Attention.ipynb

* fixed calls to similarity

* updated the README

* import sys package duh

* simplified indexing on mapping word to IDs

* stupid python indent error

* added code from https://github.com/tensorflow/tensorflow/issues/3388 for tf bug workaround

* Fix typo (closes #2815) [ci skip]

* Update regex version dependency

* Set version to 2.0.13.dev3

* Skip seemingly problematic test

* Remove problematic test

* Try previous version of regex

* Revert "Remove problematic test"

This reverts commit bdebbef455.

* Unskip test

* Try older version of regex

* 💫 Update training examples and use minibatching (#2830)

<!--- Provide a general summary of your changes in the title. -->

## Description
Update the training examples in `/examples/training` to show usage of spaCy's `minibatch` and `compounding` helpers ([see here](https://spacy.io/usage/training#tips-batch-size) for details). The lack of batching in the examples has caused some confusion in the past, especially for beginners who would copy-paste the examples, update them with large training sets and experienced slow and unsatisfying results.

### Types of change
enhancements

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

* Visual C++ link updated (#2842) (closes #2841) [ci skip]

* New landing page

* Add contribution agreement

* Correcting lang/ru/examples.py (#2845)

* Correct some grammatical inaccuracies in lang\ru\examples.py; filled Contributor Agreement

* Correct some grammatical inaccuracies in lang\ru\examples.py

* Move contributor agreement to separate file

* Set version to 2.0.13.dev4

* Add Persian(Farsi) language support (#2797)

* Also include lowercase norm exceptions

* Remove in favour of https://github.com/explosion/spaCy/graphs/contributors

* Rule-based French Lemmatizer (#2818)

<!--- Provide a general summary of your changes in the title. -->

## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->

Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class.

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

- Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version.
- Add several files containing exhaustive list of words for each part of speech 
- Add some lemma rules
- Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX
- Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned
- Modify the lemmatize function to check in lookup table as a last resort
- Init files are updated so the model can support all the functionalities mentioned above
- Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [X] I have submitted the spaCy Contributor Agreement.
- [X] I ran the tests, and all new and existing tests passed.
- [X] My changes don't require a change to the documentation, or if they do, I've added all required information.

* Set version to 2.0.13

* Fix formatting and consistency

* Update docs for new version [ci skip]

* Increment version [ci skip]

* Add info on wheels [ci skip]

* Adding "This is a sentence" example to Sinhala (#2846)

* Add wheels badge

* Update badge [ci skip]

* Update README.rst [ci skip]

* Update murmurhash pin

* Increment version to 2.0.14.dev0

* Update GPU docs for v2.0.14

* Add wheel to setup_requires

* Import prefer_gpu and require_gpu functions from Thinc

* Add tests for prefer_gpu() and require_gpu()

* Update requirements and setup.py

* Workaround bug in thinc require_gpu

* Set version to v2.0.14

* Update push-tag script

* Unhack prefer_gpu

* Require thinc 6.10.6

* Update prefer_gpu and require_gpu docs [ci skip]

* Fix specifiers for GPU

* Set version to 2.0.14.dev1

* Set version to 2.0.14

* Update Thinc version pin

* Increment version

* Fix msgpack-numpy version pin

* Increment version

* Update version to 2.0.16

* Update version [ci skip]

* Redundant ')' in the Stop words' example (#2856)

<!--- Provide a general summary of your changes in the title. -->

## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [ ] I have submitted the spaCy Contributor Agreement.
- [ ] I ran the tests, and all new and existing tests passed.
- [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.

* Documentation improvement regarding joblib and SO (#2867)

Some documentation improvements

## Description
1. Fixed the dead URL to joblib
2. Fixed Stack Overflow brand name (with space)

### Types of change
Documentation

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

* raise error when setting overlapping entities as doc.ents (#2880)

* Fix out-of-bounds access in NER training

The helper method state.B(1) gets the index of the first token of the
buffer, or -1 if no such token exists. Normally this is safe because we
pass this to functions like state.safe_get(), which returns an empty
token. Here we used it directly as an array index, which is not okay!

This error may have been the cause of out-of-bounds access errors during
training. Similar errors may still be around, so much be hunted down.
Hunting this one down took a long time...I printed out values across
training runs and diffed, looking for points of divergence between
runs, when no randomness should be allowed.

* Change PyThaiNLP Url (#2876)

* Fix missing comma

* Add example showing a fix-up rule for space entities

* Set version to 2.0.17.dev0

* Update regex version

* Revert "Update regex version"

This reverts commit 62358dd867.

* Try setting older regex version, to align with conda

* Set version to 2.0.17

* Add spacy-js to universe [ci-skip]

* Add spacy-raspberry to universe (closes #2889)

* Add script to validate universe json [ci skip]

* Removed space in docs + added contributor indo (#2909)

* - removed unneeded space in documentation

* - added contributor info

* Allow input text of length up to max_length, inclusive (#2922)

* Include universe spec for spacy-wordnet component (#2919)

* feat: include universe spec for spacy-wordnet component

* chore: include spaCy contributor agreement

* Minor formatting changes [ci skip]

* Fix image [ci skip]

Twitter URL doesn't work on live site

* Check if the word is in one of the regular lists specific to each POS (#2886)

* 💫 Create random IDs for SVGs to prevent ID clashes (#2927)

Resolves #2924.

## Description
Fixes problem where multiple visualizations in Jupyter notebooks would have clashing arc IDs, resulting in weirdly positioned arc labels. Generating a random ID prefix so even identical parses won't receive the same IDs for consistency (even if effect of ID clash isn't noticable here.)

### Types of change
bug fix

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

* Fix typo [ci skip]

* fixes symbolic link on py3 and windows (#2949)

* fixes symbolic link on py3 and windows
during setup of spacy using command
python -m spacy link en_core_web_sm en
closes #2948

* Update spacy/compat.py

Co-Authored-By: cicorias <cicorias@users.noreply.github.com>

* Fix formatting

* Update universe [ci skip]

* Catalan Language Support (#2940)

* Catalan language Support

* Ddding Catalan to documentation

* Sort languages alphabetically [ci skip]

* Update tests for pytest 4.x (#2965)

<!--- Provide a general summary of your changes in the title. -->

## Description
- [x] Replace marks in params for pytest 4.0 compat ([see here](https://docs.pytest.org/en/latest/deprecations.html#marks-in-pytest-mark-parametrize))
- [x] Un-xfail passing tests (some fixes in a recent update resolved a bunch of issues, but tests were apparently never updated here)

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

* Fix regex pin to harmonize with conda (#2964)

* Update README.rst

* Fix bug where Vocab.prune_vector did not use 'batch_size' (#2977)

Fixes #2976

* Fix typo

* Fix typo

* Remove duplicate file

* Require thinc 7.0.0.dev2

Fixes bug in gpu_ops that would use cupy instead of numpy on CPU

* Add missing import

* Fix error IDs

* Fix tests
This commit is contained in:
Ines Montani 2018-11-29 16:30:29 +01:00 committed by GitHub
parent 681258e29b
commit d33953037e
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
159 changed files with 1011058 additions and 218744 deletions

View File

@ -1,7 +1,7 @@
<!--- Please provide a summary in the title and describe your issue here.
Is this a bug or feature request? If a bug, include all the steps that led to the issue.
If you're looking for help with your code, consider posting a question on StackOverflow instead:
If you're looking for help with your code, consider posting a question on Stack Overflow instead:
http://stackoverflow.com/questions/tagged/spacy -->

View File

@ -1,11 +1,11 @@
---
name: "\U0001F4AC Anything else?"
about: For general usage questions or help with your code, please consider
posting on StackOverflow instead.
posting on Stack Overflow instead.
---
<!-- Describe your issue here. Please keep in mind that the GitHub issue tracker is mostly intended for reports related to the spaCy code base and source, and for bugs and feature requests. If you're looking for help with your code, consider posting a question on StackOverflow instead: http://stackoverflow.com/questions/tagged/spacy -->
<!-- Describe your issue here. Please keep in mind that the GitHub issue tracker is mostly intended for reports related to the spaCy code base and source, and for bugs and feature requests. If you're looking for help with your code, consider posting a question on Stack Overflow instead: http://stackoverflow.com/questions/tagged/spacy -->
## Your Environment
<!-- Include details of your environment. If you're using spaCy 1.7+, you can also type `python -m spacy info --markdown` and copy-paste the result here.-->

106
.github/contributors/ALSchwalm.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | ------------------------ |
| Name | Adam Schwalm |
| Company name (if applicable) | Star Lab |
| Title or role (if applicable) | Software Engineer |
| Date | 2018-11-28 |
| GitHub username | ALSchwalm |
| Website (optional) | https://alschwalm.com |

106
.github/contributors/BramVanroy.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [x] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | ----------------------|
| Name | Bram Vanroy |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | October 19, 2018 |
| GitHub username | BramVanroy |
| Website (optional) | https://bramvanroy.be |

106
.github/contributors/Cinnamy.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Marina Lysyuk |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 13.10.2018 |
| GitHub username | Cinnamy |
| Website (optional) | |

106
.github/contributors/JKhakpour.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Ja'far Khakpour |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2018-09-24 |
| GitHub username | JKhakpour |
| Website (optional) | |

View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Aniruddha Adhikary |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2018-09-05 |
| GitHub username | aniruddha-adhikary |
| Website (optional) | https://adhikary.net |

106
.github/contributors/aongko.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [x] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Andrew Ongko |
| Company name (if applicable) | Kurio |
| Title or role (if applicable) | Senior Data Science |
| Date | Sep 10, 2018 |
| GitHub username | aongko |
| Website (optional) | |

54
.github/contributors/aryaprabhudesai.md vendored Normal file
View File

@ -0,0 +1,54 @@
spaCy contributor agreement
This spaCy Contributor Agreement ("SCA") is based on the Oracle Contributor Agreement. The SCA applies to any contribution that you make to any product or project managed by us (the "project"), and sets out the intellectual property rights you grant to us in the contributed materials. The term "us" shall mean ExplosionAI UG (haftungsbeschränkt). The term "you" shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested below and include the filled-in version with your first pull request, under the folder .github/contributors/. The name of the file should be your GitHub username, with the extension .md. For example, the user example_user would create the file .github/contributors/example_user.md.
Read this agreement carefully before signing. These terms and conditions constitute a binding legal agreement.
Contributor Agreement
The term "contribution" or "contributed materials" means any source code, object code, patch, tool, sample, graphic, specification, manual, documentation, or any other material posted or submitted by you to the project.
With respect to any worldwide copyrights, or copyright applications and registrations, in your contribution:
you hereby assign to us joint ownership, and to the extent that such assignment is or becomes invalid, ineffective or unenforceable, you hereby grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, royalty-free, unrestricted license to exercise all rights under those copyrights. This includes, at our option, the right to sublicense these same rights to third parties through multiple levels of sublicensees or other licensing arrangements;
you agree that each of us can do all things in relation to your contribution as if each of us were the sole owners, and if one of us makes a derivative work of your contribution, the one who makes the derivative work (or has it made will be the sole owner of that derivative work;
you agree that you will not assert any moral rights in your contribution against us, our licensees or transferees;
you agree that we may register a copyright in your contribution and exercise all ownership rights associated with it; and
you agree that neither of us has any duty to consult with, obtain the consent of, pay or render an accounting to the other for any use or distribution of your contribution.
With respect to any patents you own, or that you can license without payment to any third party, you hereby grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, royalty-free license to:
make, have made, use, sell, offer to sell, import, and otherwise transfer your contribution in whole or in part, alone or in combination with or included in any product, work or materials arising out of the project to which your contribution was submitted, and
at our option, to sublicense these same rights to third parties through multiple levels of sublicensees or other licensing arrangements.
Except as set out above, you keep all right, title, and interest in your contribution. The rights that you grant to us under these terms are effective on the date you first submitted a contribution to us, even if your submission took place before the date you sign these terms.
You covenant, represent, warrant and agree that:
Each contribution that you submit is and shall be an original work of authorship and you can legally grant the rights set out in this SCA;
to the best of your knowledge, each contribution will not violate any third party's copyrights, trademarks, patents, or other intellectual property rights; and
each contribution shall be in compliance with U.S. export control laws and other applicable export and import laws. You agree to notify us if you become aware of any circumstance which would make any of the foregoing representations inaccurate in any respect. We may publicly disclose your participation in the project, including the fact that you have signed the SCA.
This SCA is governed by the laws of the State of California and applicable U.S. Federal law. Any choice of law rules will not apply.
Please place an “x” on one of the applicable statement below. Please do NOT mark both statements:
[X] I am signing on behalf of myself as an individual and no other person or entity, including my employer, has or will have rights with respect to my contributions.
I am signing on behalf of my employer or a legal entity and I have the actual authority to contractually bind that entity.
Contributor Details
Field Entry
Name Arya Prabhudesai
Company name (if applicable) -
Title or role (if applicable) -
Date 2018-08-17
GitHub username aryaprabhudesai
Website (optional) -

106
.github/contributors/charlax.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [x] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Charles-Axel Dein |
| Company name (if applicable) | Skrib |
| Title or role (if applicable) | CEO |
| Date | 27/09/2018 |
| GitHub username | charlax |
| Website (optional) | www.dein.fr |

106
.github/contributors/cicorias.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [X] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Shawn Cicoria |
| Company name (if applicable) | Microsoft |
| Title or role (if applicable) | Principal Software Engineer |
| Date | November 20, 2018 |
| GitHub username | cicorias |
| Website (optional) | www.cicoria.com |

106
.github/contributors/darindf.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Darin DeForest |
| Company name (if applicable) | Ipro Tech |
| Title or role (if applicable) | Senior Software Engineer |
| Date | 2018-09-26 |
| GitHub username | darindf |
| Website (optional) | |

106
.github/contributors/filipecaixeta.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Filipe Caixeta |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 09.12.2018 |
| GitHub username | filipecaixeta |
| Website (optional) | filipecaixeta.com.br |

106
.github/contributors/frascuchon.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Francisco Aranda |
| Company name (if applicable) | recognai |
| Title or role (if applicable) | |
| Date | |
| GitHub username | frascuchon |
| Website (optional) | https://recogn.ai |

106
.github/contributors/free-variation.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | John Stewart |
| Company name (if applicable) | Amplify |
| Title or role (if applicable) | SVP Research |
| Date | 14/09/2018 |
| GitHub username | free-variation |
| Website (optional) | |

106
.github/contributors/grivaz.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name |C. Grivaz |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date |08.22.2018 |
| GitHub username |grivaz |
| Website (optional) | |

106
.github/contributors/jacopofar.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [X] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Jacopo Farina |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2018-10-12 |
| GitHub username | jacopofar |
| Website (optional) | jacopofarina.eu |

106
.github/contributors/keshan.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Keshan Sodimana |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | Sep 21, 2018 |
| GitHub username | keshan |
| Website (optional) | |

106
.github/contributors/mbkupfer.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Maxim Kupfer |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | Sep 6, 2018 |
| GitHub username | mbkupfer |
| Website (optional) | |

106
.github/contributors/mikelibg.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | ------------------------ |
| Name | Michael Liberman |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2018-11-08 |
| GitHub username | mikelibg |
| Website (optional) | |

106
.github/contributors/mpuig.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Marc Puig |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2018-11-17 |
| GitHub username | mpuig |
| Website (optional) | |

106
.github/contributors/phojnacki.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ X ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | ------------------------------------- |
| Name | Przemysław Hojnacki |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 12/09/2018 |
| GitHub username | phojnacki |
| Website (optional) | https://about.me/przemyslaw.hojnacki |

106
.github/contributors/pzelasko.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Piotr Żelasko |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 04-09-2018 |
| GitHub username | pzelasko |
| Website (optional) | |

106
.github/contributors/sainathadapa.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Sainath Adapa |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2018-09-06 |
| GitHub username | sainathadapa |
| Website (optional) | |

106
.github/contributors/tyburam.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Mateusz Tybura |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 08.09.2018 |
| GitHub username | tyburam |
| Website (optional) | |

View File

@ -26,7 +26,7 @@ also check the [troubleshooting guide](https://spacy.io/usage/#troubleshooting)
to see if your problem is already listed there.
If you're looking for help with your code, consider posting a question on
[StackOverflow](http://stackoverflow.com/questions/tagged/spacy) instead. If you
[Stack Overflow](http://stackoverflow.com/questions/tagged/spacy) instead. If you
tag it `spacy` and `python`, more people will see it and hopefully be able to
help. Please understand that we won't be able to provide individual support via
email. We also believe that help is much more valuable if it's **shared publicly**,

View File

@ -1,83 +0,0 @@
# 👥 Contributors
This is a list of everyone who has made significant contributions to spaCy, in alphabetical order. Thanks a lot for the great work!
* Adam Bittlingmayer, [@bittlingmayer](https://github.com/bittlingmayer)
* Alexey Kim, [@yuukos](https://github.com/yuukos)
* Alexis Eidelman, [@AlexisEidelman](https://github.com/AlexisEidelman)
* Ali Zarezade, [@azarezade](https://github.com/azarezade)
* Andreas Grivas, [@andreasgrv](https://github.com/andreasgrv)
* Andrew Poliakov, [@pavlin99th](https://github.com/pavlin99th)
* Aniruddha Adhikary, [@aniruddha-adhikary](https://github.com/aniruddha-adhikary)
* Anto Binish Kaspar, [@binishkaspar](https://github.com/binishkaspar)
* Avadh Patel, [@avadhpatel](https://github.com/avadhpatel)
* Ben Eyal, [@beneyal](https://github.com/beneyal)
* Bhargav Srinivasa, [@bhargavvader](https://github.com/bhargavvader)
* Bruno P. Kinoshita, [@kinow](https://github.com/kinow)
* Canbey Bilgili, [@cbilgili](https://github.com/cbilgili)
* Chris DuBois, [@chrisdubois](https://github.com/chrisdubois)
* Christoph Schwienheer, [@chssch](https://github.com/chssch)
* Dafne van Kuppevelt, [@dafnevk](https://github.com/dafnevk)
* Daniel Rapp, [@rappdw](https://github.com/rappdw)
* Daniel Vila Suero, [@dvsrepo](https://github.com/dvsrepo)
* Dmytro Sadovnychyi, [@sadovnychyi](https://github.com/sadovnychyi)
* Eric Zhao, [@ericzhao28](https://github.com/ericzhao28)
* Francisco Aranda, [@frascuchon](https://github.com/frascuchon)
* Greg Baker, [@solresol](https://github.com/solresol)
* Greg Dubbin, [@GregDubbin](https://github.com/GregDubbin)
* Grégory Howard, [@Gregory-Howard](https://github.com/Gregory-Howard)
* György Orosz, [@oroszgy](https://github.com/oroszgy)
* Henning Peters, [@henningpeters](https://github.com/henningpeters)
* Iddo Berger, [@iddoberger](https://github.com/iddoberger)
* Ines Montani, [@ines](https://github.com/ines)
* J Nicolas Schrading, [@NSchrading](https://github.com/NSchrading)
* Janneke van der Zwaan, [@jvdzwaan](https://github.com/jvdzwaan)
* Jim Geovedi, [@geovedi](https://github.com/geovedi)
* Jim Regan, [@jimregan](https://github.com/jimregan)
* Jeffrey Gerard, [@IamJeffG](https://github.com/IamJeffG)
* Jordan Suchow, [@suchow](https://github.com/suchow)
* Josh Reeter, [@jreeter](https://github.com/jreeter)
* Juan Miguel Cejuela, [@juanmirocks](https://github.com/juanmirocks)
* Kendrick Tan, [@kendricktan](https://github.com/kendricktan)
* Kyle P. Johnson, [@kylepjohnson](https://github.com/kylepjohnson)
* Leif Uwe Vogelsang, [@luvogels](https://github.com/luvogels)
* Liling Tan, [@alvations](https://github.com/alvations)
* Magnus Burton, [@magnusburton](https://github.com/magnusburton)
* Mark Amery, [@ExplodingCabbage](https://github.com/ExplodingCabbage)
* Matthew Honnibal, [@honnibal](https://github.com/honnibal)
* Maxim Samsonov, [@maxirmx](https://github.com/maxirmx)
* Michael Wallin, [@wallinm1](https://github.com/wallinm1)
* Miguel Almeida, [@mamoit](https://github.com/mamoit)
* Motoki Wu, [@tokestermw](https://github.com/tokestermw)
* Ole Henrik Skogstrøm, [@ohenrik](https://github.com/ohenrik)
* Oleg Zd, [@olegzd](https://github.com/olegzd)
* Orhan Bilgin, [@melanuria](https://github.com/melanuria)
* Orion Montoya, [@mdcclv](https://github.com/mdcclv)
* Paul O'Leary McCann, [@polm](https://github.com/polm)
* Pokey Rule, [@pokey](https://github.com/pokey)
* Ramanan Balakrishnan, [@ramananbalakrishnan](https://github.com/ramananbalakrishnan)
* Raphaël Bournhonesque, [@raphael0202](https://github.com/raphael0202)
* Rob van Nieuwpoort, [@RvanNieuwpoort](https://github.com/RvanNieuwpoort)
* Roman Domrachev, [@ligser](https://github.com/ligser)
* Roman Inflianskas, [@rominf](https://github.com/rominf)
* Sam Bozek, [@sambozek](https://github.com/sambozek)
* Sasho Savkov, [@savkov](https://github.com/savkov)
* Shuvanon Razik, [@shuvanon](https://github.com/shuvanon)
* Søren Lind Kristiansen, [@sorenlind](https://github.com/sorenlind)
* Swier, [@swierh](https://github.com/swierh)
* Thomas Tanon, [@Tpt](https://github.com/Tpt)
* Thomas Opsomer, [@thomasopsomer](https://github.com/thomasopsomer)
* Tiago Rodrigues, [@TiagoMRodrigues](https://github.com/TiagoMRodrigues)
* Vadim Mazaev, [@GreenRiverRUS](https://github.com/GreenRiverRUS)
* Vimos Tan, [@Vimos](https://github.com/Vimos)
* Vsevolod Solovyov, [@vsolovyov](https://github.com/vsolovyov)
* Wah Loon Keng, [@kengz](https://github.com/kengz)
* Wannaphong Phatthiyaphaibun, [@wannaphongcom](https://github.com/wannaphongcom)
* Willem van Hage, [@wrvhage](https://github.com/wrvhage)
* Wolfgang Seeker, [@wbwseeker](https://github.com/wbwseeker)
* Yam, [@hscspring](https://github.com/hscspring)
* Yanhao Yang, [@YanhaoYang](https://github.com/YanhaoYang)
* Yasuaki Uechi, [@uetchy](https://github.com/uetchy)
* Yu-chun Huang, [@galaxyh](https://github.com/galaxyh)
* Yubing Dong, [@tomtung](https://github.com/tomtung)
* Yuval Pinter, [@yuvalpinter](https://github.com/yuvalpinter)

328
README.rst Normal file
View File

@ -0,0 +1,328 @@
spaCy: Industrial-strength NLP
******************************
spaCy is a library for advanced Natural Language Processing in Python and Cython.
It's built on the very latest research, and was designed from day one to be
used in real products. spaCy comes with
`pre-trained statistical models <https://spacy.io/models>`_ and word
vectors, and currently supports tokenization for **30+ languages**. It features
the **fastest syntactic parser** in the world, convolutional **neural network models**
for tagging, parsing and **named entity recognition** and easy **deep learning**
integration. It's commercial open-source software, released under the MIT license.
💫 **Version 2.0 out now!** `Check out the release notes here. <https://github.com/explosion/spaCy/releases>`_
.. image:: https://img.shields.io/travis/explosion/spaCy/master.svg?style=flat-square&logo=travis
:target: https://travis-ci.org/explosion/spaCy
:alt: Build Status
.. image:: https://img.shields.io/appveyor/ci/explosion/spaCy/master.svg?style=flat-square&logo=appveyor
:target: https://ci.appveyor.com/project/explosion/spaCy
:alt: Appveyor Build Status
.. image:: https://img.shields.io/github/release/explosion/spacy.svg?style=flat-square
:target: https://github.com/explosion/spaCy/releases
:alt: Current Release Version
.. image:: https://img.shields.io/pypi/v/spacy.svg?style=flat-square
:target: https://pypi.python.org/pypi/spacy
:alt: pypi Version
.. image:: https://img.shields.io/conda/vn/conda-forge/spacy.svg?style=flat-square
:target: https://anaconda.org/conda-forge/spacy
:alt: conda Version
.. image:: https://img.shields.io/badge/wheels-%E2%9C%93-4c1.svg?longCache=true&style=flat-square&logo=python&logoColor=white
:target: https://github.com/explosion/wheelwright/releases
:alt: Python wheels
.. image:: https://img.shields.io/twitter/follow/spacy_io.svg?style=social&label=Follow
:target: https://twitter.com/spacy_io
:alt: spaCy on Twitter
📖 Documentation
================
=================== ===
`spaCy 101`_ New to spaCy? Here's everything you need to know!
`Usage Guides`_ How to use spaCy and its features.
`New in v2.0`_ New features, backwards incompatibilities and migration guide.
`API Reference`_ The detailed reference for spaCy's API.
`Models`_ Download statistical language models for spaCy.
`Universe`_ Libraries, extensions, demos, books and courses.
`Changelog`_ Changes and version history.
`Contribute`_ How to contribute to the spaCy project and code base.
=================== ===
.. _spaCy 101: https://spacy.io/usage/spacy-101
.. _New in v2.0: https://spacy.io/usage/v2#migrating
.. _Usage Guides: https://spacy.io/usage/
.. _API Reference: https://spacy.io/api/
.. _Models: https://spacy.io/models
.. _Universe: https://spacy.io/universe
.. _Changelog: https://spacy.io/usage/#changelog
.. _Contribute: https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md
💬 Where to ask questions
==========================
The spaCy project is maintained by `@honnibal <https://github.com/honnibal>`_
and `@ines <https://github.com/ines>`_. Please understand that we won't be able
to provide individual support via email. We also believe that help is much more
valuable if it's shared publicly, so that more people can benefit from it.
====================== ===
**Bug Reports** `GitHub Issue Tracker`_
**Usage Questions** `Stack Overflow`_, `Gitter Chat`_, `Reddit User Group`_
**General Discussion** `Gitter Chat`_, `Reddit User Group`_
====================== ===
.. _GitHub Issue Tracker: https://github.com/explosion/spaCy/issues
.. _Stack Overflow: http://stackoverflow.com/questions/tagged/spacy
.. _Gitter Chat: https://gitter.im/explosion/spaCy
.. _Reddit User Group: https://www.reddit.com/r/spacynlp
Features
========
* **Fastest syntactic parser** in the world
* **Named entity** recognition
* Non-destructive **tokenization**
* Support for **30+ languages**
* Pre-trained `statistical models <https://spacy.io/models>`_ and word vectors
* Easy **deep learning** integration
* Part-of-speech tagging
* Labelled dependency parsing
* Syntax-driven sentence segmentation
* Built in **visualizers** for syntax and NER
* Convenient string-to-hash mapping
* Export to numpy data arrays
* Efficient binary serialization
* Easy **model packaging** and deployment
* State-of-the-art speed
* Robust, rigorously evaluated accuracy
📖 **For more details, see the** `facts, figures and benchmarks <https://spacy.io/usage/facts-figures>`_.
Install spaCy
=============
For detailed installation instructions, see
the `documentation <https://spacy.io/usage>`_.
==================== ===
**Operating system** macOS / OS X, Linux, Windows (Cygwin, MinGW, Visual Studio)
**Python version** CPython 2.7, 3.4+. Only 64 bit.
**Package managers** `pip`_, `conda`_ (via ``conda-forge``)
==================== ===
.. _pip: https://pypi.python.org/pypi/spacy
.. _conda: https://anaconda.org/conda-forge/spacy
pip
---
Using pip, spaCy releases are available as source packages and binary wheels
(as of ``v2.0.13``).
.. code:: bash
pip install spacy
When using pip it is generally recommended to install packages in a virtual
environment to avoid modifying system state:
.. code:: bash
python -m venv .env
source .env/bin/activate
pip install spacy
conda
-----
Thanks to our great community, we've finally re-added conda support. You can now
install spaCy via ``conda-forge``:
.. code:: bash
  conda config --add channels conda-forge
  conda install spacy
For the feedstock including the build recipe and configuration,
check out `this repository <https://github.com/conda-forge/spacy-feedstock>`_.
Improvements and pull requests to the recipe and setup are always appreciated.
Updating spaCy
--------------
Some updates to spaCy may require downloading new statistical models. If you're
running spaCy v2.0 or higher, you can use the ``validate`` command to check if
your installed models are compatible and if not, print details on how to update
them:
.. code:: bash
pip install -U spacy
python -m spacy validate
If you've trained your own models, keep in mind that your training and runtime
inputs must match. After updating spaCy, we recommend **retraining your models**
with the new version.
📖 **For details on upgrading from spaCy 1.x to spaCy 2.x, see the**
`migration guide <https://spacy.io/usage/v2#migrating>`_.
Download models
===============
As of v1.7.0, models for spaCy can be installed as **Python packages**.
This means that they're a component of your application, just like any
other module. Models can be installed using spaCy's ``download`` command,
or manually by pointing pip to a path or URL.
======================= ===
`Available Models`_ Detailed model descriptions, accuracy figures and benchmarks.
`Models Documentation`_ Detailed usage instructions.
======================= ===
.. _Available Models: https://spacy.io/models
.. _Models Documentation: https://spacy.io/docs/usage/models
.. code:: bash
# out-of-the-box: download best-matching default model
python -m spacy download en
# download best-matching version of specific model for your spaCy installation
python -m spacy download en_core_web_lg
# pip install .tar.gz archive from path or URL
pip install /Users/you/en_core_web_sm-2.0.0.tar.gz
Loading and using models
------------------------
To load a model, use ``spacy.load()`` with the model's shortcut link:
.. code:: python
import spacy
nlp = spacy.load('en')
doc = nlp(u'This is a sentence.')
If you've installed a model via pip, you can also ``import`` it directly and
then call its ``load()`` method:
.. code:: python
import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()
doc = nlp(u'This is a sentence.')
📖 **For more info and examples, check out the**
`models documentation <https://spacy.io/docs/usage/models>`_.
Support for older versions
--------------------------
If you're using an older version (``v1.6.0`` or below), you can still download
and install the old models from within spaCy using ``python -m spacy.en.download all``
or ``python -m spacy.de.download all``. The ``.tar.gz`` archives are also
`attached to the v1.6.0 release <https://github.com/explosion/spaCy/tree/v1.6.0>`_.
To download and install the models manually, unpack the archive, drop the
contained directory into ``spacy/data`` and load the model via ``spacy.load('en')``
or ``spacy.load('de')``.
Compile from source
===================
The other way to install spaCy is to clone its
`GitHub repository <https://github.com/explosion/spaCy>`_ and build it from
source. That is the common way if you want to make changes to the code base.
You'll need to make sure that you have a development environment consisting of a
Python distribution including header files, a compiler,
`pip <https://pip.pypa.io/en/latest/installing/>`__, `virtualenv <https://virtualenv.pypa.io/>`_
and `git <https://git-scm.com>`_ installed. The compiler part is the trickiest.
How to do that depends on your system. See notes on Ubuntu, OS X and Windows for
details.
.. code:: bash
# make sure you are using the latest pip
python -m pip install -U pip
git clone https://github.com/explosion/spaCy
cd spaCy
python -m venv .env
source .env/bin/activate
export PYTHONPATH=`pwd`
pip install -r requirements.txt
python setup.py build_ext --inplace
Compared to regular install via pip, `requirements.txt <requirements.txt>`_
additionally installs developer dependencies such as Cython. For more details
and instructions, see the documentation on
`compiling spaCy from source <https://spacy.io/usage/#source>`_ and the
`quickstart widget <https://spacy.io/usage/#section-quickstart>`_ to get
the right commands for your platform and Python version.
Instead of the above verbose commands, you can also use the following
`Fabric <http://www.fabfile.org/>`_ commands. All commands assume that your
virtual environment is located in a directory ``.env``. If you're using a
different directory, you can change it via the environment variable ``VENV_DIR``,
for example ``VENV_DIR=".custom-env" fab clean make``.
============= ===
``fab env`` Create virtual environment and delete previous one, if it exists.
``fab make`` Compile the source.
``fab clean`` Remove compiled objects, including the generated C++.
``fab test`` Run basic tests, aborting after first failure.
============= ===
Ubuntu
------
Install system-level dependencies via ``apt-get``:
.. code:: bash
sudo apt-get install build-essential python-dev git
macOS / OS X
------------
Install a recent version of `XCode <https://developer.apple.com/xcode/>`_,
including the so-called "Command Line Tools". macOS and OS X ship with Python
and git preinstalled.
Windows
-------
Install a version of `Visual Studio Express <https://www.visualstudio.com/vs/visual-studio-express/>`_
or higher that matches the version that was used to compile your Python
interpreter. For official distributions these are VS 2008 (Python 2.7),
VS 2010 (Python 3.4) and VS 2015 (Python 3.5).
Run tests
=========
spaCy comes with an `extensive test suite <spacy/tests>`_. In order to run the
tests, you'll usually want to clone the repository and build spaCy from source.
This will also install the required development dependencies and test utilities
defined in the ``requirements.txt``.
Alternatively, you can find out where spaCy is installed and run ``pytest`` on
that directory. Don't forget to also install the test utilities via spaCy's
``requirements.txt``:
.. code:: bash
python -c "import os; import spacy; print(os.path.dirname(spacy.__file__))"
pip install -r path/to/requirements.txt
python -m pytest <spacy-directory>
See `the documentation <https://spacy.io/usage/#tests>`_ for more details and
examples.

View File

@ -7,6 +7,7 @@ git diff-index --quiet HEAD
git checkout $1
git pull origin $1
version=$(grep "__version__ = " spacy/about.py)
version=${version/__version__ = }
version=${version/\'/}

View File

@ -92,11 +92,13 @@ def get_features(docs, max_length):
def train(train_texts, train_labels, dev_texts, dev_labels,
lstm_shape, lstm_settings, lstm_optimizer, batch_size=100,
nb_epoch=5, by_sentence=True):
print("Loading spaCy")
nlp = spacy.load('en_vectors_web_lg')
nlp.add_pipe(nlp.create_pipe('sentencizer'))
embeddings = get_embeddings(nlp.vocab)
model = compile_lstm(embeddings, lstm_shape, lstm_settings)
print("Parsing texts...")
train_docs = list(nlp.pipe(train_texts))
dev_docs = list(nlp.pipe(dev_texts))
@ -107,7 +109,7 @@ def train(train_texts, train_labels, dev_texts, dev_labels,
train_X = get_features(train_docs, lstm_shape['max_length'])
dev_X = get_features(dev_docs, lstm_shape['max_length'])
model.fit(train_X, train_labels, validation_data=(dev_X, dev_labels),
nb_epoch=nb_epoch, batch_size=batch_size)
epochs=nb_epoch, batch_size=batch_size)
return model
@ -138,15 +140,9 @@ def get_embeddings(vocab):
def evaluate(model_dir, texts, labels, max_length=100):
def create_pipeline(nlp):
'''
This could be a lambda, but named functions are easier to read in Python.
'''
return [nlp.tagger, nlp.parser, SentimentAnalyser.load(model_dir, nlp,
max_length=max_length)]
nlp = spacy.load('en')
nlp.pipeline = create_pipeline(nlp)
nlp = spacy.load('en_vectors_web_lg')
nlp.add_pipe(nlp.create_pipe('sentencizer'))
nlp.add_pipe(SentimentAnalyser.load(model_dir, nlp, max_length=max_length))
correct = 0
i = 0
@ -186,7 +182,7 @@ def main(model_dir=None, train_dir=None, dev_dir=None,
is_runtime=False,
nr_hidden=64, max_length=100, # Shape
dropout=0.5, learn_rate=0.001, # General NN config
nb_epoch=5, batch_size=100, nr_examples=-1): # Training params
nb_epoch=5, batch_size=256, nr_examples=-1): # Training params
if model_dir is not None:
model_dir = pathlib.Path(model_dir)
if train_dir is None or dev_dir is None:
@ -219,7 +215,7 @@ def main(model_dir=None, train_dir=None, dev_dir=None,
if model_dir is not None:
with (model_dir / 'model').open('wb') as file_:
pickle.dump(weights[1:], file_)
with (model_dir / 'config.json').open('wb') as file_:
with (model_dir / 'config.json').open('w') as file_:
file_.write(lstm.to_json())

View File

@ -2,11 +2,7 @@
# A decomposable attention model for Natural Language Inference
**by Matthew Honnibal, [@honnibal](https://github.com/honnibal)**
> ⚠️ **IMPORTANT NOTE:** This example is currently only compatible with spaCy
> v1.x. We're working on porting the example over to Keras v2.x and spaCy v2.x.
> See [#1445](https://github.com/explosion/spaCy/issues/1445) for details
> contributions welcome!
**Updated for spaCy 2.0+ and Keras 2.2.2+ by John Stewart, [@free-variation](https://github.com/free-variation)**
This directory contains an implementation of the entailment prediction model described
by [Parikh et al. (2016)](https://arxiv.org/pdf/1606.01933.pdf). The model is notable
@ -21,19 +17,25 @@ hook is installed to customise the `.similarity()` method of spaCy's `Doc`
and `Span` objects:
```python
def demo(model_dir):
nlp = spacy.load('en', path=model_dir,
create_pipeline=create_similarity_pipeline)
doc1 = nlp(u'Worst fries ever! Greasy and horrible...')
doc2 = nlp(u'The milkshakes are good. The fries are bad.')
print(doc1.similarity(doc2))
sent1a, sent1b = doc1.sents
print(sent1a.similarity(sent1b))
print(sent1a.similarity(doc2))
print(sent1b.similarity(doc2))
def demo(shape):
nlp = spacy.load('en_vectors_web_lg')
nlp.add_pipe(KerasSimilarityShim.load(nlp.path / 'similarity', nlp, shape[0]))
doc1 = nlp(u'The king of France is bald.')
doc2 = nlp(u'France has no king.')
print("Sentence 1:", doc1)
print("Sentence 2:", doc2)
entailment_type, confidence = doc1.similarity(doc2)
print("Entailment type:", entailment_type, "(Confidence:", confidence, ")")
```
Which gives the output `Entailment type: contradiction (Confidence: 0.60604566)`, showing that
the system has definite opinions about Betrand Russell's [famous conundrum](https://users.drew.edu/jlenz/br-on-denoting.html)!
I'm working on a blog post to explain Parikh et al.'s model in more detail.
A [notebook](https://github.com/free-variation/spaCy/blob/master/examples/notebooks/Decompositional%20Attention.ipynb) is available that briefly explains this implementation.
I think it is a very interesting example of the attention mechanism, which
I didn't understand very well before working through this paper. There are
lots of ways to extend the model.
@ -43,7 +45,7 @@ lots of ways to extend the model.
| File | Description |
| --- | --- |
| `__main__.py` | The script that will be executed. Defines the CLI, the data reading, etc — all the boring stuff. |
| `spacy_hook.py` | Provides a class `SimilarityShim` that lets you use an arbitrary function to customize spaCy's `doc.similarity()` method. Instead of the default average-of-vectors algorithm, when you call `doc1.similarity(doc2)`, you'll get the result of `your_model(doc1, doc2)`. |
| `spacy_hook.py` | Provides a class `KerasSimilarityShim` that lets you use an arbitrary function to customize spaCy's `doc.similarity()` method. Instead of the default average-of-vectors algorithm, when you call `doc1.similarity(doc2)`, you'll get the result of `your_model(doc1, doc2)`. |
| `keras_decomposable_attention.py` | Defines the neural network model. |
## Setting up
@ -52,17 +54,13 @@ First, install [Keras](https://keras.io/), [spaCy](https://spacy.io) and the spa
English models (about 1GB of data):
```bash
pip install https://github.com/fchollet/keras/archive/1.2.2.zip
pip install keras
pip install spacy
python -m spacy.en.download
python -m spacy download en_vectors_web_lg
```
⚠️ **Important:** In order for the example to run, you'll need to install Keras from
the 1.2.2 release (and not via `pip install keras`). For more info on this, see
[#727](https://github.com/explosion/spaCy/issues/727).
You'll also want to get Keras working on your GPU. This will depend on your
set up, so you're mostly on your own for this step. If you're using AWS, try the
You'll also want to get Keras working on your GPU, and you will need a backend, such as TensorFlow or Theano.
This will depend on your set up, so you're mostly on your own for this step. If you're using AWS, try the
[NVidia AMI](https://aws.amazon.com/marketplace/pp/B00FYCDDTE). It made things pretty easy.
Once you've installed the dependencies, you can run a small preliminary test of
@ -80,22 +78,35 @@ Finally, download the [Stanford Natural Language Inference corpus](http://nlp.st
## Running the example
You can run the `keras_parikh_entailment/` directory as a script, which executes the file
[`keras_parikh_entailment/__main__.py`](__main__.py). The first thing you'll want to do is train the model:
[`keras_parikh_entailment/__main__.py`](__main__.py). If you run the script without arguments
the usage is shown. Running it with `-h` explains the command line arguments.
The first thing you'll want to do is train the model:
```bash
python keras_parikh_entailment/ train <train_directory> <dev_directory>
python keras_parikh_entailment/ train -t <path to SNLI train JSON> -s <path to SNLI dev JSON>
```
Training takes about 300 epochs for full accuracy, and I haven't rerun the full
experiment since refactoring things to publish this example — please let me
know if I've broken something. You should get to at least 85% on the development data.
know if I've broken something. You should get to at least 85% on the development data even after 10-15 epochs.
The other two modes demonstrate run-time usage. I never like relying on the accuracy printed
by `.fit()` methods. I never really feel confident until I've run a new process that loads
the model and starts making predictions, without access to the gold labels. I've therefore
included an `evaluate` mode. Finally, there's also a little demo, which mostly exists to show
included an `evaluate` mode.
```bash
python keras_parikh_entailment/ evaluate -s <path to SNLI train JSON>
```
Finally, there's also a little demo, which mostly exists to show
you how run-time usage will eventually look.
```bash
python keras_parikh_entailment/ demo
```
## Getting updates
We should have the blog post explaining the model ready before the end of the week. To get

View File

@ -1,82 +1,104 @@
from __future__ import division, unicode_literals, print_function
import spacy
import plac
from pathlib import Path
import numpy as np
import ujson as json
import numpy
from keras.utils.np_utils import to_categorical
from spacy_hook import get_embeddings, get_word_ids
from spacy_hook import create_similarity_pipeline
from keras.utils import to_categorical
import plac
import sys
from keras_decomposable_attention import build_model
from spacy_hook import get_embeddings, KerasSimilarityShim
try:
import cPickle as pickle
except ImportError:
import pickle
import spacy
# workaround for keras/tensorflow bug
# see https://github.com/tensorflow/tensorflow/issues/3388
import os
import importlib
from keras import backend as K
def set_keras_backend(backend):
if K.backend() != backend:
os.environ['KERAS_BACKEND'] = backend
importlib.reload(K)
assert K.backend() == backend
if backend == "tensorflow":
K.get_session().close()
cfg = K.tf.ConfigProto()
cfg.gpu_options.allow_growth = True
K.set_session(K.tf.Session(config=cfg))
K.clear_session()
set_keras_backend("tensorflow")
def train(train_loc, dev_loc, shape, settings):
train_texts1, train_texts2, train_labels = read_snli(train_loc)
dev_texts1, dev_texts2, dev_labels = read_snli(dev_loc)
print("Loading spaCy")
nlp = spacy.load('en')
nlp = spacy.load('en_vectors_web_lg')
assert nlp.path is not None
print("Processing texts...")
train_X = create_dataset(nlp, train_texts1, train_texts2, 100, shape[0])
dev_X = create_dataset(nlp, dev_texts1, dev_texts2, 100, shape[0])
print("Compiling network")
model = build_model(get_embeddings(nlp.vocab), shape, settings)
print("Processing texts...")
Xs = []
for texts in (train_texts1, train_texts2, dev_texts1, dev_texts2):
Xs.append(get_word_ids(list(nlp.pipe(texts, n_threads=20, batch_size=20000)),
max_length=shape[0],
rnn_encode=settings['gru_encode'],
tree_truncate=settings['tree_truncate']))
train_X1, train_X2, dev_X1, dev_X2 = Xs
print(settings)
model.fit(
[train_X1, train_X2],
train_X,
train_labels,
validation_data=([dev_X1, dev_X2], dev_labels),
nb_epoch=settings['nr_epoch'],
batch_size=settings['batch_size'])
validation_data = (dev_X, dev_labels),
epochs = settings['nr_epoch'],
batch_size = settings['batch_size'])
if not (nlp.path / 'similarity').exists():
(nlp.path / 'similarity').mkdir()
print("Saving to", nlp.path / 'similarity')
weights = model.get_weights()
# remove the embedding matrix. We can reconstruct it.
del weights[1]
with (nlp.path / 'similarity' / 'model').open('wb') as file_:
pickle.dump(weights[1:], file_)
with (nlp.path / 'similarity' / 'config.json').open('wb') as file_:
pickle.dump(weights, file_)
with (nlp.path / 'similarity' / 'config.json').open('w') as file_:
file_.write(model.to_json())
def evaluate(dev_loc):
def evaluate(dev_loc, shape):
dev_texts1, dev_texts2, dev_labels = read_snli(dev_loc)
nlp = spacy.load('en',
create_pipeline=create_similarity_pipeline)
nlp = spacy.load('en_vectors_web_lg')
nlp.add_pipe(KerasSimilarityShim.load(nlp.path / 'similarity', nlp, shape[0]))
total = 0.
correct = 0.
for text1, text2, label in zip(dev_texts1, dev_texts2, dev_labels):
doc1 = nlp(text1)
doc2 = nlp(text2)
sim = doc1.similarity(doc2)
if sim.argmax() == label.argmax():
sim, _ = doc1.similarity(doc2)
if sim == KerasSimilarityShim.entailment_types[label.argmax()]:
correct += 1
total += 1
return correct, total
def demo():
nlp = spacy.load('en',
create_pipeline=create_similarity_pipeline)
doc1 = nlp(u'What were the best crime fiction books in 2016?')
doc2 = nlp(
u'What should I read that was published last year? I like crime stories.')
print(doc1)
print(doc2)
print("Similarity", doc1.similarity(doc2))
def demo(shape):
nlp = spacy.load('en_vectors_web_lg')
nlp.add_pipe(KerasSimilarityShim.load(nlp.path / 'similarity', nlp, shape[0]))
doc1 = nlp(u'The king of France is bald.')
doc2 = nlp(u'France has no king.')
print("Sentence 1:", doc1)
print("Sentence 2:", doc2)
entailment_type, confidence = doc1.similarity(doc2)
print("Entailment type:", entailment_type, "(Confidence:", confidence, ")")
LABELS = {'entailment': 0, 'contradiction': 1, 'neutral': 2}
@ -84,56 +106,92 @@ def read_snli(path):
texts1 = []
texts2 = []
labels = []
with path.open() as file_:
with open(path, 'r') as file_:
for line in file_:
eg = json.loads(line)
label = eg['gold_label']
if label == '-':
if label == '-': # per Parikh, ignore - SNLI entries
continue
texts1.append(eg['sentence1'])
texts2.append(eg['sentence2'])
labels.append(LABELS[label])
return texts1, texts2, to_categorical(numpy.asarray(labels, dtype='int32'))
return texts1, texts2, to_categorical(np.asarray(labels, dtype='int32'))
def create_dataset(nlp, texts, hypotheses, num_unk, max_length):
sents = texts + hypotheses
sents_as_ids = []
for sent in sents:
doc = nlp(sent)
word_ids = []
for i, token in enumerate(doc):
# skip odd spaces from tokenizer
if token.has_vector and token.vector_norm == 0:
continue
if i > max_length:
break
if token.has_vector:
word_ids.append(token.rank + num_unk + 1)
else:
# if we don't have a vector, pick an OOV entry
word_ids.append(token.rank % num_unk + 1)
# there must be a simpler way of generating padded arrays from lists...
word_id_vec = np.zeros((max_length), dtype='int')
clipped_len = min(max_length, len(word_ids))
word_id_vec[:clipped_len] = word_ids[:clipped_len]
sents_as_ids.append(word_id_vec)
return [np.array(sents_as_ids[:len(texts)]), np.array(sents_as_ids[len(texts):])]
@plac.annotations(
mode=("Mode to execute", "positional", None, str, ["train", "evaluate", "demo"]),
train_loc=("Path to training data", "positional", None, Path),
dev_loc=("Path to development data", "positional", None, Path),
train_loc=("Path to training data", "option", "t", str),
dev_loc=("Path to development or test data", "option", "s", str),
max_length=("Length to truncate sentences", "option", "L", int),
nr_hidden=("Number of hidden units", "option", "H", int),
dropout=("Dropout level", "option", "d", float),
learn_rate=("Learning rate", "option", "e", float),
learn_rate=("Learning rate", "option", "r", float),
batch_size=("Batch size for neural network training", "option", "b", int),
nr_epoch=("Number of training epochs", "option", "i", int),
tree_truncate=("Truncate sentences by tree distance", "flag", "T", bool),
gru_encode=("Encode sentences with bidirectional GRU", "flag", "E", bool),
nr_epoch=("Number of training epochs", "option", "e", int),
entail_dir=("Direction of entailment", "option", "D", str, ["both", "left", "right"])
)
def main(mode, train_loc, dev_loc,
tree_truncate=False,
gru_encode=False,
max_length=100,
nr_hidden=100,
dropout=0.2,
learn_rate=0.001,
batch_size=100,
nr_epoch=5):
max_length = 50,
nr_hidden = 200,
dropout = 0.2,
learn_rate = 0.001,
batch_size = 1024,
nr_epoch = 10,
entail_dir="both"):
shape = (max_length, nr_hidden, 3)
settings = {
'lr': learn_rate,
'dropout': dropout,
'batch_size': batch_size,
'nr_epoch': nr_epoch,
'tree_truncate': tree_truncate,
'gru_encode': gru_encode
'entail_dir': entail_dir
}
if mode == 'train':
if train_loc == None or dev_loc == None:
print("Train mode requires paths to training and development data sets.")
sys.exit(1)
train(train_loc, dev_loc, shape, settings)
elif mode == 'evaluate':
correct, total = evaluate(dev_loc)
if dev_loc == None:
print("Evaluate mode requires paths to test data set.")
sys.exit(1)
correct, total = evaluate(dev_loc, shape)
print(correct, '/', total, correct / total)
else:
demo()
demo(shape)
if __name__ == '__main__':
plac.call(main)

View File

@ -1,259 +1,137 @@
# Semantic similarity with decomposable attention (using spaCy and Keras)
# Practical state-of-the-art text similarity with spaCy and Keras
import numpy
from keras.layers import InputSpec, Layer, Input, Dense, merge
from keras.layers import Lambda, Activation, Dropout, Embedding, TimeDistributed
from keras.layers import Bidirectional, GRU, LSTM
from keras.layers.noise import GaussianNoise
from keras.layers.advanced_activations import ELU
import keras.backend as K
from keras.models import Sequential, Model, model_from_json
from keras.regularizers import l2
from keras.optimizers import Adam
from keras.layers.normalization import BatchNormalization
from keras.layers.pooling import GlobalAveragePooling1D, GlobalMaxPooling1D
from keras.layers import Merge
# Semantic entailment/similarity with decomposable attention (using spaCy and Keras)
# Practical state-of-the-art textual entailment with spaCy and Keras
import numpy as np
from keras import layers, Model, models, optimizers
from keras import backend as K
def build_model(vectors, shape, settings):
'''Compile the model.'''
max_length, nr_hidden, nr_class = shape
# Declare inputs.
ids1 = Input(shape=(max_length,), dtype='int32', name='words1')
ids2 = Input(shape=(max_length,), dtype='int32', name='words2')
# Construct operations, which we'll chain together.
embed = _StaticEmbedding(vectors, max_length, nr_hidden, dropout=0.2, nr_tune=5000)
if settings['gru_encode']:
encode = _BiRNNEncoding(max_length, nr_hidden, dropout=settings['dropout'])
attend = _Attention(max_length, nr_hidden, dropout=settings['dropout'])
align = _SoftAlignment(max_length, nr_hidden)
compare = _Comparison(max_length, nr_hidden, dropout=settings['dropout'])
entail = _Entailment(nr_hidden, nr_class, dropout=settings['dropout'])
input1 = layers.Input(shape=(max_length,), dtype='int32', name='words1')
input2 = layers.Input(shape=(max_length,), dtype='int32', name='words2')
# embeddings (projected)
embed = create_embedding(vectors, max_length, nr_hidden)
a = embed(input1)
b = embed(input2)
# step 1: attend
F = create_feedforward(nr_hidden)
att_weights = layers.dot([F(a), F(b)], axes=-1)
G = create_feedforward(nr_hidden)
if settings['entail_dir'] == 'both':
norm_weights_a = layers.Lambda(normalizer(1))(att_weights)
norm_weights_b = layers.Lambda(normalizer(2))(att_weights)
alpha = layers.dot([norm_weights_a, a], axes=1)
beta = layers.dot([norm_weights_b, b], axes=1)
# Declare the model as a computational graph.
sent1 = embed(ids1) # Shape: (i, n)
sent2 = embed(ids2) # Shape: (j, n)
# step 2: compare
comp1 = layers.concatenate([a, beta])
comp2 = layers.concatenate([b, alpha])
v1 = layers.TimeDistributed(G)(comp1)
v2 = layers.TimeDistributed(G)(comp2)
if settings['gru_encode']:
sent1 = encode(sent1)
sent2 = encode(sent2)
# step 3: aggregate
v1_sum = layers.Lambda(sum_word)(v1)
v2_sum = layers.Lambda(sum_word)(v2)
concat = layers.concatenate([v1_sum, v2_sum])
attention = attend(sent1, sent2) # Shape: (i, j)
elif settings['entail_dir'] == 'left':
norm_weights_a = layers.Lambda(normalizer(1))(att_weights)
alpha = layers.dot([norm_weights_a, a], axes=1)
comp2 = layers.concatenate([b, alpha])
v2 = layers.TimeDistributed(G)(comp2)
v2_sum = layers.Lambda(sum_word)(v2)
concat = v2_sum
align1 = align(sent2, attention)
align2 = align(sent1, attention, transpose=True)
feats1 = compare(sent1, align1)
feats2 = compare(sent2, align2)
scores = entail(feats1, feats2)
# Now that we have the input/output, we can construct the Model object...
model = Model(input=[ids1, ids2], output=[scores])
# ...Compile it...
else:
norm_weights_b = layers.Lambda(normalizer(2))(att_weights)
beta = layers.dot([norm_weights_b, b], axes=1)
comp1 = layers.concatenate([a, beta])
v1 = layers.TimeDistributed(G)(comp1)
v1_sum = layers.Lambda(sum_word)(v1)
concat = v1_sum
H = create_feedforward(nr_hidden)
out = H(concat)
out = layers.Dense(nr_class, activation='softmax')(out)
model = Model([input1, input2], out)
model.compile(
optimizer=Adam(lr=settings['lr']),
optimizer=optimizers.Adam(lr=settings['lr']),
loss='categorical_crossentropy',
metrics=['accuracy'])
# ...And return it for training.
return model
class _StaticEmbedding(object):
def __init__(self, vectors, max_length, nr_out, nr_tune=1000, dropout=0.0):
self.nr_out = nr_out
self.max_length = max_length
self.embed = Embedding(
vectors.shape[0],
vectors.shape[1],
input_length=max_length,
weights=[vectors],
name='embed',
trainable=False)
self.tune = Embedding(
nr_tune,
nr_out,
input_length=max_length,
weights=None,
name='tune',
trainable=True,
dropout=dropout)
self.mod_ids = Lambda(lambda sent: sent % (nr_tune-1)+1,
output_shape=(self.max_length,))
def create_embedding(vectors, max_length, projected_dim):
return models.Sequential([
layers.Embedding(
vectors.shape[0],
vectors.shape[1],
input_length=max_length,
weights=[vectors],
trainable=False),
layers.TimeDistributed(
layers.Dense(projected_dim,
activation=None,
use_bias=False))
])
self.project = TimeDistributed(
Dense(
nr_out,
activation=None,
bias=False,
name='project'))
def __call__(self, sentence):
def get_output_shape(shapes):
print(shapes)
return shapes[0]
mod_sent = self.mod_ids(sentence)
tuning = self.tune(mod_sent)
#tuning = merge([tuning, mod_sent],
# mode=lambda AB: AB[0] * (K.clip(K.cast(AB[1], 'float32'), 0, 1)),
# output_shape=(self.max_length, self.nr_out))
pretrained = self.project(self.embed(sentence))
vectors = merge([pretrained, tuning], mode='sum')
return vectors
def create_feedforward(num_units=200, activation='relu', dropout_rate=0.2):
return models.Sequential([
layers.Dense(num_units, activation=activation),
layers.Dropout(dropout_rate),
layers.Dense(num_units, activation=activation),
layers.Dropout(dropout_rate)
])
class _BiRNNEncoding(object):
def __init__(self, max_length, nr_out, dropout=0.0):
self.model = Sequential()
self.model.add(Bidirectional(LSTM(nr_out, return_sequences=True,
dropout_W=dropout, dropout_U=dropout),
input_shape=(max_length, nr_out)))
self.model.add(TimeDistributed(Dense(nr_out, activation='relu', init='he_normal')))
self.model.add(TimeDistributed(Dropout(0.2)))
def normalizer(axis):
def _normalize(att_weights):
exp_weights = K.exp(att_weights)
sum_weights = K.sum(exp_weights, axis=axis, keepdims=True)
return exp_weights/sum_weights
return _normalize
def __call__(self, sentence):
return self.model(sentence)
class _Attention(object):
def __init__(self, max_length, nr_hidden, dropout=0.0, L2=0.0, activation='relu'):
self.max_length = max_length
self.model = Sequential()
self.model.add(Dropout(dropout, input_shape=(nr_hidden,)))
self.model.add(
Dense(nr_hidden, name='attend1',
init='he_normal', W_regularizer=l2(L2),
input_shape=(nr_hidden,), activation='relu'))
self.model.add(Dropout(dropout))
self.model.add(Dense(nr_hidden, name='attend2',
init='he_normal', W_regularizer=l2(L2), activation='relu'))
self.model = TimeDistributed(self.model)
def __call__(self, sent1, sent2):
def _outer(AB):
att_ji = K.batch_dot(AB[1], K.permute_dimensions(AB[0], (0, 2, 1)))
return K.permute_dimensions(att_ji,(0, 2, 1))
return merge(
[self.model(sent1), self.model(sent2)],
mode=_outer,
output_shape=(self.max_length, self.max_length))
class _SoftAlignment(object):
def __init__(self, max_length, nr_hidden):
self.max_length = max_length
self.nr_hidden = nr_hidden
def __call__(self, sentence, attention, transpose=False):
def _normalize_attention(attmat):
att = attmat[0]
mat = attmat[1]
if transpose:
att = K.permute_dimensions(att,(0, 2, 1))
# 3d softmax
e = K.exp(att - K.max(att, axis=-1, keepdims=True))
s = K.sum(e, axis=-1, keepdims=True)
sm_att = e / s
return K.batch_dot(sm_att, mat)
return merge([attention, sentence], mode=_normalize_attention,
output_shape=(self.max_length, self.nr_hidden)) # Shape: (i, n)
class _Comparison(object):
def __init__(self, words, nr_hidden, L2=0.0, dropout=0.0):
self.words = words
self.model = Sequential()
self.model.add(Dropout(dropout, input_shape=(nr_hidden*2,)))
self.model.add(Dense(nr_hidden, name='compare1',
init='he_normal', W_regularizer=l2(L2)))
self.model.add(Activation('relu'))
self.model.add(Dropout(dropout))
self.model.add(Dense(nr_hidden, name='compare2',
W_regularizer=l2(L2), init='he_normal'))
self.model.add(Activation('relu'))
self.model = TimeDistributed(self.model)
def __call__(self, sent, align, **kwargs):
result = self.model(merge([sent, align], mode='concat')) # Shape: (i, n)
avged = GlobalAveragePooling1D()(result, mask=self.words)
maxed = GlobalMaxPooling1D()(result, mask=self.words)
merged = merge([avged, maxed])
result = BatchNormalization()(merged)
return result
class _Entailment(object):
def __init__(self, nr_hidden, nr_out, dropout=0.0, L2=0.0):
self.model = Sequential()
self.model.add(Dropout(dropout, input_shape=(nr_hidden*2,)))
self.model.add(Dense(nr_hidden, name='entail1',
init='he_normal', W_regularizer=l2(L2)))
self.model.add(Activation('relu'))
self.model.add(Dropout(dropout))
self.model.add(Dense(nr_hidden, name='entail2',
init='he_normal', W_regularizer=l2(L2)))
self.model.add(Activation('relu'))
self.model.add(Dense(nr_out, name='entail_out', activation='softmax',
W_regularizer=l2(L2), init='zero'))
def __call__(self, feats1, feats2):
features = merge([feats1, feats2], mode='concat')
return self.model(features)
class _GlobalSumPooling1D(Layer):
'''Global sum pooling operation for temporal data.
# Input shape
3D tensor with shape: `(samples, steps, features)`.
# Output shape
2D tensor with shape: `(samples, features)`.
'''
def __init__(self, **kwargs):
super(_GlobalSumPooling1D, self).__init__(**kwargs)
self.input_spec = [InputSpec(ndim=3)]
def get_output_shape_for(self, input_shape):
return (input_shape[0], input_shape[2])
def call(self, x, mask=None):
if mask is not None:
return K.sum(x * K.clip(mask, 0, 1), axis=1)
else:
return K.sum(x, axis=1)
def sum_word(x):
return K.sum(x, axis=1)
def test_build_model():
vectors = numpy.ndarray((100, 8), dtype='float32')
vectors = np.ndarray((100, 8), dtype='float32')
shape = (10, 16, 3)
settings = {'lr': 0.001, 'dropout': 0.2, 'gru_encode':True}
settings = {'lr': 0.001, 'dropout': 0.2, 'gru_encode':True, 'entail_dir':'both'}
model = build_model(vectors, shape, settings)
def test_fit_model():
def _generate_X(nr_example, length, nr_vector):
X1 = numpy.ndarray((nr_example, length), dtype='int32')
X1 = np.ndarray((nr_example, length), dtype='int32')
X1 *= X1 < nr_vector
X1 *= 0 <= X1
X2 = numpy.ndarray((nr_example, length), dtype='int32')
X2 = np.ndarray((nr_example, length), dtype='int32')
X2 *= X2 < nr_vector
X2 *= 0 <= X2
return [X1, X2]
def _generate_Y(nr_example, nr_class):
ys = numpy.zeros((nr_example, nr_class), dtype='int32')
ys = np.zeros((nr_example, nr_class), dtype='int32')
for i in range(nr_example):
ys[i, i % nr_class] = 1
return ys
vectors = numpy.ndarray((100, 8), dtype='float32')
vectors = np.ndarray((100, 8), dtype='float32')
shape = (10, 16, 3)
settings = {'lr': 0.001, 'dropout': 0.2, 'gru_encode':True}
settings = {'lr': 0.001, 'dropout': 0.2, 'gru_encode':True, 'entail_dir':'both'}
model = build_model(vectors, shape, settings)
train_X = _generate_X(20, shape[0], vectors.shape[0])
@ -261,8 +139,7 @@ def test_fit_model():
dev_X = _generate_X(15, shape[0], vectors.shape[0])
dev_Y = _generate_Y(15, shape[2])
model.fit(train_X, train_Y, validation_data=(dev_X, dev_Y), nb_epoch=5,
batch_size=4)
model.fit(train_X, train_Y, validation_data=(dev_X, dev_Y), epochs=5, batch_size=4)
__all__ = [build_model]

View File

@ -1,8 +1,5 @@
import numpy as np
from keras.models import model_from_json
import numpy
import numpy.random
import json
from spacy.tokens.span import Span
try:
import cPickle as pickle
@ -11,16 +8,23 @@ except ImportError:
class KerasSimilarityShim(object):
entailment_types = ["entailment", "contradiction", "neutral"]
@classmethod
def load(cls, path, nlp, get_features=None, max_length=100):
def load(cls, path, nlp, max_length=100, get_features=None):
if get_features is None:
get_features = get_word_ids
with (path / 'config.json').open() as file_:
model = model_from_json(file_.read())
with (path / 'model').open('rb') as file_:
weights = pickle.load(file_)
embeddings = get_embeddings(nlp.vocab)
model.set_weights([embeddings] + weights)
weights.insert(1, embeddings)
model.set_weights(weights)
return cls(model, get_features=get_features, max_length=max_length)
def __init__(self, model, get_features=None, max_length=100):
@ -32,58 +36,42 @@ class KerasSimilarityShim(object):
doc.user_hooks['similarity'] = self.predict
doc.user_span_hooks['similarity'] = self.predict
return doc
def predict(self, doc1, doc2):
x1 = self.get_features([doc1], max_length=self.max_length, tree_truncate=True)
x2 = self.get_features([doc2], max_length=self.max_length, tree_truncate=True)
x1 = self.get_features([doc1], max_length=self.max_length)
x2 = self.get_features([doc2], max_length=self.max_length)
scores = self.model.predict([x1, x2])
return scores[0]
return self.entailment_types[scores.argmax()], scores.max()
def get_embeddings(vocab, nr_unk=100):
nr_vector = max(lex.rank for lex in vocab) + 1
vectors = numpy.zeros((nr_vector+nr_unk+2, vocab.vectors_length), dtype='float32')
# the extra +1 is for a zero vector representing sentence-final padding
num_vectors = max(lex.rank for lex in vocab) + 2
# create random vectors for OOV tokens
oov = np.random.normal(size=(nr_unk, vocab.vectors_length))
oov = oov / oov.sum(axis=1, keepdims=True)
vectors = np.zeros((num_vectors + nr_unk, vocab.vectors_length), dtype='float32')
vectors[1:(nr_unk + 1), ] = oov
for lex in vocab:
if lex.has_vector:
vectors[lex.rank+1] = lex.vector / lex.vector_norm
if lex.has_vector and lex.vector_norm > 0:
vectors[nr_unk + lex.rank + 1] = lex.vector / lex.vector_norm
return vectors
def get_word_ids(docs, rnn_encode=False, tree_truncate=False, max_length=100, nr_unk=100):
Xs = numpy.zeros((len(docs), max_length), dtype='int32')
def get_word_ids(docs, max_length=100, nr_unk=100):
Xs = np.zeros((len(docs), max_length), dtype='int32')
for i, doc in enumerate(docs):
if tree_truncate:
if isinstance(doc, Span):
queue = [doc.root]
else:
queue = [sent.root for sent in doc.sents]
else:
queue = list(doc)
words = []
while len(words) <= max_length and queue:
word = queue.pop(0)
if rnn_encode or (not word.is_punct and not word.is_space):
words.append(word)
if tree_truncate:
queue.extend(list(word.lefts))
queue.extend(list(word.rights))
words.sort()
for j, token in enumerate(words):
if token.has_vector:
Xs[i, j] = token.rank+1
else:
Xs[i, j] = (token.shape % (nr_unk-1))+2
j += 1
if j >= max_length:
for j, token in enumerate(doc):
if j == max_length:
break
else:
Xs[i, len(words)] = 1
if token.has_vector:
Xs[i, j] = token.rank + nr_unk + 1
else:
Xs[i, j] = token.rank % nr_unk + 1
return Xs
def create_similarity_pipeline(nlp, max_length=100):
return [
nlp.tagger,
nlp.entity,
nlp.parser,
KerasSimilarityShim.load(nlp.path / 'similarity', nlp, max_length)
]

View File

@ -0,0 +1,955 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Natural language inference using spaCy and Keras"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Introduction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebook details an implementation of the natural language inference model presented in [(Parikh et al, 2016)](https://arxiv.org/abs/1606.01933). The model is notable for the small number of paramaters *and hyperparameters* it specifices, while still yielding good performance."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Constructing the dataset"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import spacy\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We only need the GloVe vectors from spaCy, not a full NLP pipeline."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"nlp = spacy.load('en_vectors_web_lg')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Function to load the SNLI dataset. The categories are converted to one-shot representation. The function comes from an example in spaCy."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/jds/tensorflow-gpu/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n",
" from ._conv import register_converters as _register_converters\n",
"Using TensorFlow backend.\n"
]
}
],
"source": [
"import ujson as json\n",
"from keras.utils import to_categorical\n",
"\n",
"LABELS = {'entailment': 0, 'contradiction': 1, 'neutral': 2}\n",
"def read_snli(path):\n",
" texts1 = []\n",
" texts2 = []\n",
" labels = []\n",
" with open(path, 'r') as file_:\n",
" for line in file_:\n",
" eg = json.loads(line)\n",
" label = eg['gold_label']\n",
" if label == '-': # per Parikh, ignore - SNLI entries\n",
" continue\n",
" texts1.append(eg['sentence1'])\n",
" texts2.append(eg['sentence2'])\n",
" labels.append(LABELS[label])\n",
" return texts1, texts2, to_categorical(np.asarray(labels, dtype='int32'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Because Keras can do the train/test split for us, we'll load *all* SNLI triples from one file."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"texts,hypotheses,labels = read_snli('snli/snli_1.0_train.jsonl')"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"def create_dataset(nlp, texts, hypotheses, num_oov, max_length, norm_vectors = True):\n",
" sents = texts + hypotheses\n",
" \n",
" # the extra +1 is for a zero vector represting NULL for padding\n",
" num_vectors = max(lex.rank for lex in nlp.vocab) + 2 \n",
" \n",
" # create random vectors for OOV tokens\n",
" oov = np.random.normal(size=(num_oov, nlp.vocab.vectors_length))\n",
" oov = oov / oov.sum(axis=1, keepdims=True)\n",
" \n",
" vectors = np.zeros((num_vectors + num_oov, nlp.vocab.vectors_length), dtype='float32')\n",
" vectors[num_vectors:, ] = oov\n",
" for lex in nlp.vocab:\n",
" if lex.has_vector and lex.vector_norm > 0:\n",
" vectors[lex.rank + 1] = lex.vector / lex.vector_norm if norm_vectors == True else lex.vector\n",
" \n",
" sents_as_ids = []\n",
" for sent in sents:\n",
" doc = nlp(sent)\n",
" word_ids = []\n",
" \n",
" for i, token in enumerate(doc):\n",
" # skip odd spaces from tokenizer\n",
" if token.has_vector and token.vector_norm == 0:\n",
" continue\n",
" \n",
" if i > max_length:\n",
" break\n",
" \n",
" if token.has_vector:\n",
" word_ids.append(token.rank + 1)\n",
" else:\n",
" # if we don't have a vector, pick an OOV entry\n",
" word_ids.append(token.rank % num_oov + num_vectors) \n",
" \n",
" # there must be a simpler way of generating padded arrays from lists...\n",
" word_id_vec = np.zeros((max_length), dtype='int')\n",
" clipped_len = min(max_length, len(word_ids))\n",
" word_id_vec[:clipped_len] = word_ids[:clipped_len]\n",
" sents_as_ids.append(word_id_vec)\n",
" \n",
" \n",
" return vectors, np.array(sents_as_ids[:len(texts)]), np.array(sents_as_ids[len(texts):])"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"sem_vectors, text_vectors, hypothesis_vectors = create_dataset(nlp, texts, hypotheses, 100, 50, True)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"texts_test,hypotheses_test,labels_test = read_snli('snli/snli_1.0_test.jsonl')"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"_, text_vectors_test, hypothesis_vectors_test = create_dataset(nlp, texts_test, hypotheses_test, 100, 50, True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We use spaCy to tokenize the sentences and return, when available, a semantic vector for each token. \n",
"\n",
"OOV terms (tokens for which no semantic vector is available) are assigned to one of a set of randomly-generated OOV vectors, per (Parikh et al, 2016).\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that we will clip sentences to 50 words maximum."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"from keras import layers, Model, models\n",
"from keras import backend as K"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Building the model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The embedding layer copies the 300-dimensional GloVe vectors into GPU memory. Per (Parikh et al, 2016), the vectors, which are not adapted during training, are projected down to lower-dimensional vectors using a trained projection matrix."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"def create_embedding(vectors, max_length, projected_dim):\n",
" return models.Sequential([\n",
" layers.Embedding(\n",
" vectors.shape[0],\n",
" vectors.shape[1],\n",
" input_length=max_length,\n",
" weights=[vectors],\n",
" trainable=False),\n",
" \n",
" layers.TimeDistributed(\n",
" layers.Dense(projected_dim,\n",
" activation=None,\n",
" use_bias=False))\n",
" ])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The Parikh model makes use of three feedforward blocks that construct nonlinear combinations of their input. Each block contains two ReLU layers and two dropout layers."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"def create_feedforward(num_units=200, activation='relu', dropout_rate=0.2):\n",
" return models.Sequential([\n",
" layers.Dense(num_units, activation=activation),\n",
" layers.Dropout(dropout_rate),\n",
" layers.Dense(num_units, activation=activation),\n",
" layers.Dropout(dropout_rate)\n",
" ])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The basic idea of the (Parikh et al, 2016) model is to:\n",
"\n",
"1. *Align*: Construct an alignment of subphrases in the text and hypothesis using an attention-like mechanism, called \"decompositional\" because the layer is applied to each of the two sentences individually rather than to their product. The dot product of the nonlinear transformations of the inputs is then normalized vertically and horizontally to yield a pair of \"soft\" alignment structures, from text->hypothesis and hypothesis->text. Concretely, for each word in one sentence, a multinomial distribution is computed over the words of the other sentence, by learning a multinomial logistic with softmax target.\n",
"2. *Compare*: Each word is now compared to its aligned phrase using a function modeled as a two-layer feedforward ReLU network. The output is a high-dimensional representation of the strength of association between word and aligned phrase.\n",
"3. *Aggregate*: The comparison vectors are summed, separately, for the text and the hypothesis. The result is two vectors: one that describes the degree of association of the text to the hypothesis, and the second, of the hypothesis to the text.\n",
"4. Finally, these two vectors are processed by a dense layer followed by a softmax classifier, as usual.\n",
"\n",
"Note that because in entailment the truth conditions of the consequent must be a subset of those of the antecedent, it is not obvious that we need both vectors in step (3). Entailment is not symmetric. It may be enough to just use the hypothesis->text vector. We will explore this possibility later."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We need a couple of little functions for Lambda layers to normalize and aggregate weights:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"def normalizer(axis):\n",
" def _normalize(att_weights):\n",
" exp_weights = K.exp(att_weights)\n",
" sum_weights = K.sum(exp_weights, axis=axis, keepdims=True)\n",
" return exp_weights/sum_weights\n",
" return _normalize\n",
"\n",
"def sum_word(x):\n",
" return K.sum(x, axis=1)\n"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"def build_model(vectors, max_length, num_hidden, num_classes, projected_dim, entail_dir='both'):\n",
" input1 = layers.Input(shape=(max_length,), dtype='int32', name='words1')\n",
" input2 = layers.Input(shape=(max_length,), dtype='int32', name='words2')\n",
" \n",
" # embeddings (projected)\n",
" embed = create_embedding(vectors, max_length, projected_dim)\n",
" \n",
" a = embed(input1)\n",
" b = embed(input2)\n",
" \n",
" # step 1: attend\n",
" F = create_feedforward(num_hidden)\n",
" att_weights = layers.dot([F(a), F(b)], axes=-1)\n",
" \n",
" G = create_feedforward(num_hidden)\n",
" \n",
" if entail_dir == 'both':\n",
" norm_weights_a = layers.Lambda(normalizer(1))(att_weights)\n",
" norm_weights_b = layers.Lambda(normalizer(2))(att_weights)\n",
" alpha = layers.dot([norm_weights_a, a], axes=1)\n",
" beta = layers.dot([norm_weights_b, b], axes=1)\n",
"\n",
" # step 2: compare\n",
" comp1 = layers.concatenate([a, beta])\n",
" comp2 = layers.concatenate([b, alpha])\n",
" v1 = layers.TimeDistributed(G)(comp1)\n",
" v2 = layers.TimeDistributed(G)(comp2)\n",
"\n",
" # step 3: aggregate\n",
" v1_sum = layers.Lambda(sum_word)(v1)\n",
" v2_sum = layers.Lambda(sum_word)(v2)\n",
" concat = layers.concatenate([v1_sum, v2_sum])\n",
" elif entail_dir == 'left':\n",
" norm_weights_a = layers.Lambda(normalizer(1))(att_weights)\n",
" alpha = layers.dot([norm_weights_a, a], axes=1)\n",
" comp2 = layers.concatenate([b, alpha])\n",
" v2 = layers.TimeDistributed(G)(comp2)\n",
" v2_sum = layers.Lambda(sum_word)(v2)\n",
" concat = v2_sum\n",
" else:\n",
" norm_weights_b = layers.Lambda(normalizer(2))(att_weights)\n",
" beta = layers.dot([norm_weights_b, b], axes=1)\n",
" comp1 = layers.concatenate([a, beta])\n",
" v1 = layers.TimeDistributed(G)(comp1)\n",
" v1_sum = layers.Lambda(sum_word)(v1)\n",
" concat = v1_sum\n",
" \n",
" H = create_feedforward(num_hidden)\n",
" out = H(concat)\n",
" out = layers.Dense(num_classes, activation='softmax')(out)\n",
" \n",
" model = Model([input1, input2], out)\n",
" \n",
" model.compile(optimizer='adam',\n",
" loss='categorical_crossentropy',\n",
" metrics=['accuracy'])\n",
" return model\n",
" \n",
" \n",
" "
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"__________________________________________________________________________________________________\n",
"Layer (type) Output Shape Param # Connected to \n",
"==================================================================================================\n",
"words1 (InputLayer) (None, 50) 0 \n",
"__________________________________________________________________________________________________\n",
"words2 (InputLayer) (None, 50) 0 \n",
"__________________________________________________________________________________________________\n",
"sequential_1 (Sequential) (None, 50, 200) 321381600 words1[0][0] \n",
" words2[0][0] \n",
"__________________________________________________________________________________________________\n",
"sequential_2 (Sequential) (None, 50, 200) 80400 sequential_1[1][0] \n",
" sequential_1[2][0] \n",
"__________________________________________________________________________________________________\n",
"dot_1 (Dot) (None, 50, 50) 0 sequential_2[1][0] \n",
" sequential_2[2][0] \n",
"__________________________________________________________________________________________________\n",
"lambda_2 (Lambda) (None, 50, 50) 0 dot_1[0][0] \n",
"__________________________________________________________________________________________________\n",
"lambda_1 (Lambda) (None, 50, 50) 0 dot_1[0][0] \n",
"__________________________________________________________________________________________________\n",
"dot_3 (Dot) (None, 50, 200) 0 lambda_2[0][0] \n",
" sequential_1[2][0] \n",
"__________________________________________________________________________________________________\n",
"dot_2 (Dot) (None, 50, 200) 0 lambda_1[0][0] \n",
" sequential_1[1][0] \n",
"__________________________________________________________________________________________________\n",
"concatenate_1 (Concatenate) (None, 50, 400) 0 sequential_1[1][0] \n",
" dot_3[0][0] \n",
"__________________________________________________________________________________________________\n",
"concatenate_2 (Concatenate) (None, 50, 400) 0 sequential_1[2][0] \n",
" dot_2[0][0] \n",
"__________________________________________________________________________________________________\n",
"time_distributed_2 (TimeDistrib (None, 50, 200) 120400 concatenate_1[0][0] \n",
"__________________________________________________________________________________________________\n",
"time_distributed_3 (TimeDistrib (None, 50, 200) 120400 concatenate_2[0][0] \n",
"__________________________________________________________________________________________________\n",
"lambda_3 (Lambda) (None, 200) 0 time_distributed_2[0][0] \n",
"__________________________________________________________________________________________________\n",
"lambda_4 (Lambda) (None, 200) 0 time_distributed_3[0][0] \n",
"__________________________________________________________________________________________________\n",
"concatenate_3 (Concatenate) (None, 400) 0 lambda_3[0][0] \n",
" lambda_4[0][0] \n",
"__________________________________________________________________________________________________\n",
"sequential_4 (Sequential) (None, 200) 120400 concatenate_3[0][0] \n",
"__________________________________________________________________________________________________\n",
"dense_8 (Dense) (None, 3) 603 sequential_4[1][0] \n",
"==================================================================================================\n",
"Total params: 321,703,403\n",
"Trainable params: 381,803\n",
"Non-trainable params: 321,321,600\n",
"__________________________________________________________________________________________________\n"
]
}
],
"source": [
"K.clear_session()\n",
"m = build_model(sem_vectors, 50, 200, 3, 200)\n",
"m.summary()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The number of trainable parameters, ~381k, is the number given by Parikh et al, so we're on the right track."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Training the model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Parikh et al use tiny batches of 4, training for 50MM batches, which amounts to around 500 epochs. Here we'll use large batches to better use the GPU, and train for fewer epochs -- for purposes of this experiment."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Train on 549367 samples, validate on 9824 samples\n",
"Epoch 1/50\n",
"549367/549367 [==============================] - 34s 62us/step - loss: 0.7599 - acc: 0.6617 - val_loss: 0.5396 - val_acc: 0.7861\n",
"Epoch 2/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.5611 - acc: 0.7763 - val_loss: 0.4892 - val_acc: 0.8085\n",
"Epoch 3/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.5212 - acc: 0.7948 - val_loss: 0.4574 - val_acc: 0.8261\n",
"Epoch 4/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4986 - acc: 0.8045 - val_loss: 0.4410 - val_acc: 0.8274\n",
"Epoch 5/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4819 - acc: 0.8114 - val_loss: 0.4224 - val_acc: 0.8383\n",
"Epoch 6/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4714 - acc: 0.8166 - val_loss: 0.4200 - val_acc: 0.8379\n",
"Epoch 7/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4633 - acc: 0.8203 - val_loss: 0.4098 - val_acc: 0.8457\n",
"Epoch 8/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4558 - acc: 0.8232 - val_loss: 0.4114 - val_acc: 0.8415\n",
"Epoch 9/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4508 - acc: 0.8250 - val_loss: 0.4062 - val_acc: 0.8477\n",
"Epoch 10/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4433 - acc: 0.8286 - val_loss: 0.3982 - val_acc: 0.8486\n",
"Epoch 11/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4388 - acc: 0.8307 - val_loss: 0.3953 - val_acc: 0.8497\n",
"Epoch 12/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4351 - acc: 0.8321 - val_loss: 0.3973 - val_acc: 0.8522\n",
"Epoch 13/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4309 - acc: 0.8342 - val_loss: 0.3939 - val_acc: 0.8539\n",
"Epoch 14/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4269 - acc: 0.8355 - val_loss: 0.3932 - val_acc: 0.8517\n",
"Epoch 15/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4247 - acc: 0.8369 - val_loss: 0.3938 - val_acc: 0.8515\n",
"Epoch 16/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4208 - acc: 0.8379 - val_loss: 0.3936 - val_acc: 0.8504\n",
"Epoch 17/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4194 - acc: 0.8390 - val_loss: 0.3885 - val_acc: 0.8560\n",
"Epoch 18/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4162 - acc: 0.8402 - val_loss: 0.3874 - val_acc: 0.8561\n",
"Epoch 19/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4140 - acc: 0.8409 - val_loss: 0.3889 - val_acc: 0.8545\n",
"Epoch 20/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4114 - acc: 0.8426 - val_loss: 0.3864 - val_acc: 0.8583\n",
"Epoch 21/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4092 - acc: 0.8430 - val_loss: 0.3870 - val_acc: 0.8561\n",
"Epoch 22/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4062 - acc: 0.8442 - val_loss: 0.3852 - val_acc: 0.8577\n",
"Epoch 23/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4050 - acc: 0.8450 - val_loss: 0.3850 - val_acc: 0.8578\n",
"Epoch 24/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4035 - acc: 0.8455 - val_loss: 0.3825 - val_acc: 0.8555\n",
"Epoch 25/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4018 - acc: 0.8460 - val_loss: 0.3837 - val_acc: 0.8573\n",
"Epoch 26/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3989 - acc: 0.8476 - val_loss: 0.3843 - val_acc: 0.8599\n",
"Epoch 27/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3979 - acc: 0.8481 - val_loss: 0.3841 - val_acc: 0.8589\n",
"Epoch 28/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3967 - acc: 0.8484 - val_loss: 0.3811 - val_acc: 0.8575\n",
"Epoch 29/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3956 - acc: 0.8492 - val_loss: 0.3829 - val_acc: 0.8589\n",
"Epoch 30/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3938 - acc: 0.8499 - val_loss: 0.3859 - val_acc: 0.8562\n",
"Epoch 31/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3925 - acc: 0.8500 - val_loss: 0.3798 - val_acc: 0.8587\n",
"Epoch 32/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3906 - acc: 0.8509 - val_loss: 0.3834 - val_acc: 0.8569\n",
"Epoch 33/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3893 - acc: 0.8511 - val_loss: 0.3806 - val_acc: 0.8588\n",
"Epoch 34/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3885 - acc: 0.8515 - val_loss: 0.3828 - val_acc: 0.8603\n",
"Epoch 35/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3879 - acc: 0.8520 - val_loss: 0.3800 - val_acc: 0.8594\n",
"Epoch 36/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3860 - acc: 0.8530 - val_loss: 0.3796 - val_acc: 0.8577\n",
"Epoch 37/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3856 - acc: 0.8532 - val_loss: 0.3857 - val_acc: 0.8591\n",
"Epoch 38/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3838 - acc: 0.8535 - val_loss: 0.3835 - val_acc: 0.8603\n",
"Epoch 39/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3830 - acc: 0.8543 - val_loss: 0.3830 - val_acc: 0.8599\n",
"Epoch 40/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3818 - acc: 0.8548 - val_loss: 0.3832 - val_acc: 0.8559\n",
"Epoch 41/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3806 - acc: 0.8551 - val_loss: 0.3845 - val_acc: 0.8553\n",
"Epoch 42/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3803 - acc: 0.8550 - val_loss: 0.3789 - val_acc: 0.8617\n",
"Epoch 43/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3791 - acc: 0.8556 - val_loss: 0.3835 - val_acc: 0.8580\n",
"Epoch 44/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3778 - acc: 0.8565 - val_loss: 0.3799 - val_acc: 0.8580\n",
"Epoch 45/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3766 - acc: 0.8571 - val_loss: 0.3790 - val_acc: 0.8625\n",
"Epoch 46/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3770 - acc: 0.8569 - val_loss: 0.3820 - val_acc: 0.8590\n",
"Epoch 47/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3761 - acc: 0.8573 - val_loss: 0.3831 - val_acc: 0.8581\n",
"Epoch 48/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3739 - acc: 0.8579 - val_loss: 0.3828 - val_acc: 0.8599\n",
"Epoch 49/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3738 - acc: 0.8577 - val_loss: 0.3785 - val_acc: 0.8590\n",
"Epoch 50/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3726 - acc: 0.8580 - val_loss: 0.3820 - val_acc: 0.8585\n"
]
},
{
"data": {
"text/plain": [
"<keras.callbacks.History at 0x7f5c9f49c438>"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m.fit([text_vectors, hypothesis_vectors], labels, batch_size=1024, epochs=50,validation_data=([text_vectors_test, hypothesis_vectors_test], labels_test))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The result is broadly in the region reported by Parikh et al: ~86 vs 86.3%. The small difference might be accounted by differences in `max_length` (here set at 50), in the training regime, and that here we use Keras' built-in validation splitting rather than the SNLI test set."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Experiment: the asymmetric model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It was suggested earlier that, based on the semantics of entailment, the vector representing the strength of association between the hypothesis to the text is all that is needed for classifying the entailment.\n",
"\n",
"The following model removes consideration of the complementary vector (text to hypothesis) from the computation. This will decrease the paramater count slightly, because the final dense layers will be smaller, and speed up the forward pass when predicting, because fewer calculations will be needed."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"__________________________________________________________________________________________________\n",
"Layer (type) Output Shape Param # Connected to \n",
"==================================================================================================\n",
"words2 (InputLayer) (None, 50) 0 \n",
"__________________________________________________________________________________________________\n",
"words1 (InputLayer) (None, 50) 0 \n",
"__________________________________________________________________________________________________\n",
"sequential_5 (Sequential) (None, 50, 200) 321381600 words1[0][0] \n",
" words2[0][0] \n",
"__________________________________________________________________________________________________\n",
"sequential_6 (Sequential) (None, 50, 200) 80400 sequential_5[1][0] \n",
" sequential_5[2][0] \n",
"__________________________________________________________________________________________________\n",
"dot_4 (Dot) (None, 50, 50) 0 sequential_6[1][0] \n",
" sequential_6[2][0] \n",
"__________________________________________________________________________________________________\n",
"lambda_5 (Lambda) (None, 50, 50) 0 dot_4[0][0] \n",
"__________________________________________________________________________________________________\n",
"dot_5 (Dot) (None, 50, 200) 0 lambda_5[0][0] \n",
" sequential_5[1][0] \n",
"__________________________________________________________________________________________________\n",
"concatenate_4 (Concatenate) (None, 50, 400) 0 sequential_5[2][0] \n",
" dot_5[0][0] \n",
"__________________________________________________________________________________________________\n",
"time_distributed_5 (TimeDistrib (None, 50, 200) 120400 concatenate_4[0][0] \n",
"__________________________________________________________________________________________________\n",
"lambda_6 (Lambda) (None, 200) 0 time_distributed_5[0][0] \n",
"__________________________________________________________________________________________________\n",
"sequential_8 (Sequential) (None, 200) 80400 lambda_6[0][0] \n",
"__________________________________________________________________________________________________\n",
"dense_16 (Dense) (None, 3) 603 sequential_8[1][0] \n",
"==================================================================================================\n",
"Total params: 321,663,403\n",
"Trainable params: 341,803\n",
"Non-trainable params: 321,321,600\n",
"__________________________________________________________________________________________________\n"
]
}
],
"source": [
"m1 = build_model(sem_vectors, 50, 200, 3, 200, 'left')\n",
"m1.summary()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The parameter count has indeed decreased by 40,000, corresponding to the 200x200 smaller H function."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Train on 549367 samples, validate on 9824 samples\n",
"Epoch 1/50\n",
"549367/549367 [==============================] - 25s 46us/step - loss: 0.7331 - acc: 0.6770 - val_loss: 0.5257 - val_acc: 0.7936\n",
"Epoch 2/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.5518 - acc: 0.7799 - val_loss: 0.4717 - val_acc: 0.8159\n",
"Epoch 3/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.5147 - acc: 0.7967 - val_loss: 0.4449 - val_acc: 0.8278\n",
"Epoch 4/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4948 - acc: 0.8060 - val_loss: 0.4326 - val_acc: 0.8344\n",
"Epoch 5/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4814 - acc: 0.8122 - val_loss: 0.4247 - val_acc: 0.8359\n",
"Epoch 6/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4712 - acc: 0.8162 - val_loss: 0.4143 - val_acc: 0.8430\n",
"Epoch 7/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4635 - acc: 0.8205 - val_loss: 0.4172 - val_acc: 0.8401\n",
"Epoch 8/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4570 - acc: 0.8223 - val_loss: 0.4106 - val_acc: 0.8422\n",
"Epoch 9/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4505 - acc: 0.8259 - val_loss: 0.4043 - val_acc: 0.8451\n",
"Epoch 10/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4459 - acc: 0.8280 - val_loss: 0.4050 - val_acc: 0.8467\n",
"Epoch 11/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4405 - acc: 0.8300 - val_loss: 0.3975 - val_acc: 0.8481\n",
"Epoch 12/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4360 - acc: 0.8324 - val_loss: 0.4026 - val_acc: 0.8496\n",
"Epoch 13/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4327 - acc: 0.8334 - val_loss: 0.4024 - val_acc: 0.8471\n",
"Epoch 14/50\n",
"549367/549367 [==============================] - 24s 45us/step - loss: 0.4293 - acc: 0.8350 - val_loss: 0.3955 - val_acc: 0.8496\n",
"Epoch 15/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4263 - acc: 0.8369 - val_loss: 0.3980 - val_acc: 0.8490\n",
"Epoch 16/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4236 - acc: 0.8377 - val_loss: 0.3958 - val_acc: 0.8496\n",
"Epoch 17/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4213 - acc: 0.8384 - val_loss: 0.3954 - val_acc: 0.8496\n",
"Epoch 18/50\n",
"549367/549367 [==============================] - 24s 45us/step - loss: 0.4187 - acc: 0.8394 - val_loss: 0.3929 - val_acc: 0.8514\n",
"Epoch 19/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4157 - acc: 0.8409 - val_loss: 0.3939 - val_acc: 0.8507\n",
"Epoch 20/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4135 - acc: 0.8417 - val_loss: 0.3953 - val_acc: 0.8522\n",
"Epoch 21/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4122 - acc: 0.8424 - val_loss: 0.3974 - val_acc: 0.8506\n",
"Epoch 22/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4099 - acc: 0.8435 - val_loss: 0.3918 - val_acc: 0.8522\n",
"Epoch 23/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4075 - acc: 0.8443 - val_loss: 0.3901 - val_acc: 0.8513\n",
"Epoch 24/50\n",
"549367/549367 [==============================] - 24s 44us/step - loss: 0.4067 - acc: 0.8447 - val_loss: 0.3885 - val_acc: 0.8543\n",
"Epoch 25/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4047 - acc: 0.8454 - val_loss: 0.3846 - val_acc: 0.8531\n",
"Epoch 26/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4031 - acc: 0.8461 - val_loss: 0.3864 - val_acc: 0.8562\n",
"Epoch 27/50\n",
"549367/549367 [==============================] - 24s 45us/step - loss: 0.4020 - acc: 0.8467 - val_loss: 0.3874 - val_acc: 0.8546\n",
"Epoch 28/50\n",
"549367/549367 [==============================] - 24s 45us/step - loss: 0.4001 - acc: 0.8473 - val_loss: 0.3848 - val_acc: 0.8534\n",
"Epoch 29/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3991 - acc: 0.8479 - val_loss: 0.3865 - val_acc: 0.8562\n",
"Epoch 30/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3976 - acc: 0.8484 - val_loss: 0.3833 - val_acc: 0.8574\n",
"Epoch 31/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3961 - acc: 0.8487 - val_loss: 0.3846 - val_acc: 0.8585\n",
"Epoch 32/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3942 - acc: 0.8498 - val_loss: 0.3805 - val_acc: 0.8573\n",
"Epoch 33/50\n",
"549367/549367 [==============================] - 24s 44us/step - loss: 0.3935 - acc: 0.8503 - val_loss: 0.3856 - val_acc: 0.8579\n",
"Epoch 34/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3923 - acc: 0.8507 - val_loss: 0.3829 - val_acc: 0.8560\n",
"Epoch 35/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3920 - acc: 0.8508 - val_loss: 0.3864 - val_acc: 0.8575\n",
"Epoch 36/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3907 - acc: 0.8516 - val_loss: 0.3873 - val_acc: 0.8563\n",
"Epoch 37/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3891 - acc: 0.8519 - val_loss: 0.3850 - val_acc: 0.8570\n",
"Epoch 38/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3872 - acc: 0.8522 - val_loss: 0.3815 - val_acc: 0.8591\n",
"Epoch 39/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3887 - acc: 0.8520 - val_loss: 0.3829 - val_acc: 0.8590\n",
"Epoch 40/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3868 - acc: 0.8531 - val_loss: 0.3807 - val_acc: 0.8600\n",
"Epoch 41/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3859 - acc: 0.8537 - val_loss: 0.3832 - val_acc: 0.8574\n",
"Epoch 42/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3849 - acc: 0.8537 - val_loss: 0.3850 - val_acc: 0.8576\n",
"Epoch 43/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3834 - acc: 0.8541 - val_loss: 0.3825 - val_acc: 0.8563\n",
"Epoch 44/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3829 - acc: 0.8548 - val_loss: 0.3844 - val_acc: 0.8540\n",
"Epoch 45/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3816 - acc: 0.8552 - val_loss: 0.3841 - val_acc: 0.8559\n",
"Epoch 46/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3816 - acc: 0.8549 - val_loss: 0.3880 - val_acc: 0.8567\n",
"Epoch 47/50\n",
"549367/549367 [==============================] - 24s 45us/step - loss: 0.3799 - acc: 0.8559 - val_loss: 0.3767 - val_acc: 0.8635\n",
"Epoch 48/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3800 - acc: 0.8560 - val_loss: 0.3786 - val_acc: 0.8563\n",
"Epoch 49/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3781 - acc: 0.8563 - val_loss: 0.3812 - val_acc: 0.8596\n",
"Epoch 50/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3788 - acc: 0.8560 - val_loss: 0.3782 - val_acc: 0.8601\n"
]
},
{
"data": {
"text/plain": [
"<keras.callbacks.History at 0x7f5ca1bf3e48>"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m1.fit([text_vectors, hypothesis_vectors], labels, batch_size=1024, epochs=50,validation_data=([text_vectors_test, hypothesis_vectors_test], labels_test))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This model performs the same as the slightly more complex model that evaluates alignments in both directions. Note also that processing time is improved, from 64 down to 48 microseconds per step. \n",
"\n",
"Let's now look at an asymmetric model that evaluates text to hypothesis comparisons. The prediction is that such a model will correctly classify a decent proportion of the exemplars, but not as accurately as the previous two.\n",
"\n",
"We'll just use 10 epochs for expediency."
]
},
{
"cell_type": "code",
"execution_count": 96,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"__________________________________________________________________________________________________\n",
"Layer (type) Output Shape Param # Connected to \n",
"==================================================================================================\n",
"words1 (InputLayer) (None, 50) 0 \n",
"__________________________________________________________________________________________________\n",
"words2 (InputLayer) (None, 50) 0 \n",
"__________________________________________________________________________________________________\n",
"sequential_13 (Sequential) (None, 50, 200) 321381600 words1[0][0] \n",
" words2[0][0] \n",
"__________________________________________________________________________________________________\n",
"sequential_14 (Sequential) (None, 50, 200) 80400 sequential_13[1][0] \n",
" sequential_13[2][0] \n",
"__________________________________________________________________________________________________\n",
"dot_8 (Dot) (None, 50, 50) 0 sequential_14[1][0] \n",
" sequential_14[2][0] \n",
"__________________________________________________________________________________________________\n",
"lambda_9 (Lambda) (None, 50, 50) 0 dot_8[0][0] \n",
"__________________________________________________________________________________________________\n",
"dot_9 (Dot) (None, 50, 200) 0 lambda_9[0][0] \n",
" sequential_13[2][0] \n",
"__________________________________________________________________________________________________\n",
"concatenate_6 (Concatenate) (None, 50, 400) 0 sequential_13[1][0] \n",
" dot_9[0][0] \n",
"__________________________________________________________________________________________________\n",
"time_distributed_9 (TimeDistrib (None, 50, 200) 120400 concatenate_6[0][0] \n",
"__________________________________________________________________________________________________\n",
"lambda_10 (Lambda) (None, 200) 0 time_distributed_9[0][0] \n",
"__________________________________________________________________________________________________\n",
"sequential_16 (Sequential) (None, 200) 80400 lambda_10[0][0] \n",
"__________________________________________________________________________________________________\n",
"dense_32 (Dense) (None, 3) 603 sequential_16[1][0] \n",
"==================================================================================================\n",
"Total params: 321,663,403\n",
"Trainable params: 341,803\n",
"Non-trainable params: 321,321,600\n",
"__________________________________________________________________________________________________\n"
]
}
],
"source": [
"m2 = build_model(sem_vectors, 50, 200, 3, 200, 'right')\n",
"m2.summary()"
]
},
{
"cell_type": "code",
"execution_count": 97,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Train on 455226 samples, validate on 113807 samples\n",
"Epoch 1/10\n",
"455226/455226 [==============================] - 22s 49us/step - loss: 0.8920 - acc: 0.5771 - val_loss: 0.8001 - val_acc: 0.6435\n",
"Epoch 2/10\n",
"455226/455226 [==============================] - 22s 47us/step - loss: 0.7808 - acc: 0.6553 - val_loss: 0.7267 - val_acc: 0.6855\n",
"Epoch 3/10\n",
"455226/455226 [==============================] - 22s 47us/step - loss: 0.7329 - acc: 0.6825 - val_loss: 0.6966 - val_acc: 0.7006\n",
"Epoch 4/10\n",
"455226/455226 [==============================] - 22s 47us/step - loss: 0.7055 - acc: 0.6978 - val_loss: 0.6713 - val_acc: 0.7150\n",
"Epoch 5/10\n",
"455226/455226 [==============================] - 22s 47us/step - loss: 0.6862 - acc: 0.7081 - val_loss: 0.6533 - val_acc: 0.7253\n",
"Epoch 6/10\n",
"455226/455226 [==============================] - 21s 47us/step - loss: 0.6694 - acc: 0.7179 - val_loss: 0.6472 - val_acc: 0.7277\n",
"Epoch 7/10\n",
"455226/455226 [==============================] - 22s 47us/step - loss: 0.6555 - acc: 0.7252 - val_loss: 0.6338 - val_acc: 0.7347\n",
"Epoch 8/10\n",
"455226/455226 [==============================] - 22s 48us/step - loss: 0.6434 - acc: 0.7310 - val_loss: 0.6246 - val_acc: 0.7385\n",
"Epoch 9/10\n",
"455226/455226 [==============================] - 22s 47us/step - loss: 0.6325 - acc: 0.7367 - val_loss: 0.6164 - val_acc: 0.7424\n",
"Epoch 10/10\n",
"455226/455226 [==============================] - 22s 47us/step - loss: 0.6216 - acc: 0.7426 - val_loss: 0.6082 - val_acc: 0.7478\n"
]
},
{
"data": {
"text/plain": [
"<keras.callbacks.History at 0x7fa6850cf080>"
]
},
"execution_count": 97,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m2.fit([text_vectors, hypothesis_vectors], labels, batch_size=1024, epochs=10,validation_split=.2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Comparing this fit to the validation accuracy of the previous two models after 10 epochs, we observe that its accuracy is roughly 10% lower.\n",
"\n",
"It is reassuring that the neural modeling here reproduces what we know from the semantics of natural language!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@ -0,0 +1,27 @@
'''Demonstrate adding a rule-based component that forces some tokens to not
be entities, before the NER tagger is applied. This is used to hotfix the issue
in https://github.com/explosion/spaCy/issues/2870 , present as of spaCy v2.0.16.
'''
import spacy
from spacy.attrs import ENT_IOB
def fix_space_tags(doc):
ent_iobs = doc.to_array([ENT_IOB])
for i, token in enumerate(doc):
if token.is_space:
# Sets 'O' tag (0 is None, so I is 1, O is 2)
ent_iobs[i] = 2
doc.from_array([ENT_IOB], ent_iobs.reshape((len(doc), 1)))
return doc
def main():
nlp = spacy.load('en_core_web_sm')
text = u'''This is some crazy test where I dont need an Apple Watch to make things bug'''
doc = nlp(text)
print('Before', doc.ents)
nlp.add_pipe(fix_space_tags, name='fix-ner', before='ner')
doc = nlp(text)
print('After', doc.ents)
if __name__ == '__main__':
main()

View File

@ -21,8 +21,9 @@ from __future__ import unicode_literals, print_function
import plac
import random
import spacy
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding
# training data: texts, heads and dependency labels
@ -63,7 +64,7 @@ TRAIN_DATA = [
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int))
def main(model=None, output_dir=None, n_iter=5):
def main(model=None, output_dir=None, n_iter=15):
"""Load the model, set up the pipeline and train the parser."""
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
@ -89,9 +90,12 @@ def main(model=None, output_dir=None, n_iter=5):
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
for text, annotations in TRAIN_DATA:
nlp.update([text], [annotations], sgd=optimizer, losses=losses)
print(losses)
# batch up the examples using spaCy's minibatch
batches = minibatch(TRAIN_DATA, size=compounding(4., 32., 1.001))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(texts, annotations, sgd=optimizer, losses=losses)
print('Losses', losses)
# test the trained model
test_model(nlp)
@ -135,7 +139,8 @@ if __name__ == '__main__':
# [
# ('find', 'ROOT', 'find'),
# ('cheapest', 'QUALITY', 'gym'),
# ('gym', 'PLACE', 'find')
# ('gym', 'PLACE', 'find'),
# ('near', 'ATTRIBUTE', 'gym'),
# ('work', 'LOCATION', 'near')
# ]
# show me the best hotel in berlin

View File

@ -15,6 +15,7 @@ import plac
import random
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding
# training data
@ -62,14 +63,17 @@ def main(model=None, output_dir=None, n_iter=100):
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
for text, annotations in TRAIN_DATA:
# batch up the examples using spaCy's minibatch
batches = minibatch(TRAIN_DATA, size=compounding(4., 32., 1.001))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(
[text], # batch of texts
[annotations], # batch of annotations
texts, # batch of texts
annotations, # batch of annotations
drop=0.5, # dropout - make it harder to memorise data
sgd=optimizer, # callable to update weights
losses=losses)
print(losses)
print('Losses', losses)
# test the trained model
for text, _ in TRAIN_DATA:

View File

@ -31,6 +31,7 @@ import plac
import random
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding
# new entity label
@ -73,7 +74,7 @@ TRAIN_DATA = [
new_model_name=("New model name for model meta.", "option", "nm", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int))
def main(model=None, new_model_name='animal', output_dir=None, n_iter=20):
def main(model=None, new_model_name='animal', output_dir=None, n_iter=10):
"""Set up the pipeline and entity recognizer, and train the new entity."""
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
@ -104,10 +105,13 @@ def main(model=None, new_model_name='animal', output_dir=None, n_iter=20):
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
for text, annotations in TRAIN_DATA:
nlp.update([text], [annotations], sgd=optimizer, drop=0.35,
# batch up the examples using spaCy's minibatch
batches = minibatch(TRAIN_DATA, size=compounding(4., 32., 1.001))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(texts, annotations, sgd=optimizer, drop=0.35,
losses=losses)
print(losses)
print('Losses', losses)
# test the trained model
test_text = 'Do you like horses?'

View File

@ -13,6 +13,7 @@ import plac
import random
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding
# training data
@ -62,9 +63,12 @@ def main(model=None, output_dir=None, n_iter=10):
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
for text, annotations in TRAIN_DATA:
nlp.update([text], [annotations], sgd=optimizer, losses=losses)
print(losses)
# batch up the examples using spaCy's minibatch
batches = minibatch(TRAIN_DATA, size=compounding(4., 32., 1.001))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(texts, annotations, sgd=optimizer, losses=losses)
print('Losses', losses)
# test the trained model
test_text = "I like securities."

View File

@ -16,6 +16,7 @@ import plac
import random
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding
# You need to define a mapping from your data's part-of-speech tag names to the
@ -63,9 +64,12 @@ def main(lang='en', output_dir=None, n_iter=25):
for i in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
for text, annotations in TRAIN_DATA:
nlp.update([text], [annotations], sgd=optimizer, losses=losses)
print(losses)
# batch up the examples using spaCy's minibatch
batches = minibatch(TRAIN_DATA, size=compounding(4., 32., 1.001))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(texts, annotations, sgd=optimizer, losses=losses)
print('Losses', losses)
# test the trained model
test_text = "I like blue eggs"

View File

@ -2,7 +2,7 @@ cython>=0.25
numpy>=1.15.0
cymem>=2.0.2,<2.1.0
preshed>=2.0.1,<2.1.0
thinc==7.0.0.dev1
thinc==7.0.0.dev2
blis>=0.2.2,<0.3.0
murmurhash>=0.28.0,<1.1.0
cytoolz>=0.9.0,<0.10.0

View File

@ -200,7 +200,7 @@ def setup_package():
"murmurhash>=0.28.0,<1.1.0",
"cymem>=2.0.2,<2.1.0",
"preshed>=2.0.1,<2.1.0",
"thinc==7.0.0.dev1",
"thinc==7.0.0.dev2",
"blis>=0.2.2,<0.3.0",
"plac<1.0.0,>=0.9.6",
"ujson>=1.35",

View File

@ -4,6 +4,9 @@ import warnings
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
# These are imported as part of the API
from thinc.neural.util import prefer_gpu, require_gpu
from .cli.info import info as cli_info
from .glossary import explain
from .about import __version__

View File

@ -14,7 +14,7 @@ from .. import about
@plac.annotations(
model=("model to download, shortcut or name)", "positional", None, str),
model=("model to download, shortcut or name", "positional", None, str),
direct=("force direct download. Needs model name with version and won't "
"perform compatibility check", "flag", "d", bool),
pip_args=("additional arguments to be passed to `pip install` when "

View File

@ -1,6 +1,7 @@
# coding: utf8
from __future__ import unicode_literals
import os
import sys
import ujson
import itertools

View File

@ -1,6 +1,8 @@
# coding: utf8
from __future__ import unicode_literals
import random
from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_ARCS
from .templates import TPL_ENT, TPL_ENTS, TPL_FIGURE, TPL_TITLE, TPL_PAGE
from ..util import minify_html, escape_html
@ -38,7 +40,10 @@ class DependencyRenderer(object):
minify (bool): Minify HTML markup.
RETURNS (unicode): Rendered SVG or HTML markup.
"""
rendered = [self.render_svg(i, p['words'], p['arcs'])
# Create a random ID prefix to make sure parses don't receive the
# same ID, even if they're identical
id_prefix = random.randint(0, 999)
rendered = [self.render_svg('{}-{}'.format(id_prefix, i), p['words'], p['arcs'])
for i, p in enumerate(parsed)]
if page:
content = ''.join([TPL_FIGURE.format(content=svg)

View File

@ -270,7 +270,10 @@ class Errors(object):
"NBOR_RELOP.")
E101 = ("NODE_NAME should be a new node and NBOR_NAME should already have "
"have been declared in previous edges.")
E102 = ("Can't merge non-disjoint spans. '{token}' is already part of tokens to merge")
E103 = ("Trying to set conflicting doc.ents: '{span1}' and '{span2}'. A token"
" can only be part of one entity, so make sure the entities you're "
"setting don't overlap.")
@add_codes
class TempErrors(object):

View File

@ -286,6 +286,7 @@ GLOSSARY = {
'PERSON': 'People, including fictional',
'NORP': 'Nationalities or religious or political groups',
'FACILITY': 'Buildings, airports, highways, bridges, etc.',
'FAC': 'Buildings, airports, highways, bridges, etc.',
'ORG': 'Companies, agencies, institutions, etc.',
'GPE': 'Countries, cities, states',
'LOC': 'Non-GPE locations, mountain ranges, bodies of water',

View File

@ -20,12 +20,11 @@ _suffixes = (_list_punct + LIST_ELLIPSES + LIST_QUOTES + LIST_ICONS +
r'(?<=[{}(?:{})])\.'.format('|'.join([ALPHA_LOWER, r'%²\-\)\]\+', QUOTES]), _currency)])
_infixes = (LIST_ELLIPSES + LIST_ICONS +
[r'(?<=[{}])\.(?=[{}])'.format(ALPHA_LOWER, ALPHA_UPPER),
[r'(?<=[0-9{zero}-{nine}])[+\-\*^=](?=[0-9{zero}-{nine}-])'.format(zero=u'', nine=u''),
r'(?<=[{a}]),(?=[{a}])'.format(a=ALPHA),
r'(?<=[{a}"])[:<>=](?=[{a}])'.format(a=ALPHA),
r'(?<=[{a}])--(?=[{a}])'.format(a=ALPHA),
r'(?<=[{a}]),(?=[{a}])'.format(a=ALPHA),
r'(?<=[{a}])([{q}\)\]\(\[])(?=[\-{a}])'.format(a=ALPHA, q=_quotes)])
r'(?<=[{a}])[{h}](?={ae})'.format(a=ALPHA, h=HYPHENS, ae=u''),
r'(?<=[{a}])[?";:=,.]*(?:{h})(?=[{a}])'.format(a=ALPHA, h=HYPHENS),
r'(?<=[{a}"])[:<>=/](?=[{a}])'.format(a=ALPHA)])
TOKENIZER_PREFIXES = _prefixes

64
spacy/lang/ca/__init__.py Normal file
View File

@ -0,0 +1,64 @@
# coding: utf8
from __future__ import unicode_literals
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS
# uncomment if files are available
# from .norm_exceptions import NORM_EXCEPTIONS
# from .tag_map import TAG_MAP
# from .morph_rules import MORPH_RULES
# uncomment if lookup-based lemmatizer is available
from .lemmatizer import LOOKUP
from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ..norm_exceptions import BASE_NORMS
from ...language import Language
from ...attrs import LANG, NORM
from ...util import update_exc, add_lookups
# Create a Language subclass
# Documentation: https://spacy.io/docs/usage/adding-languages
# This file should be placed in spacy/lang/ca (ISO code of language).
# Before submitting a pull request, make sure the remove all comments from the
# language data files, and run at least the basic tokenizer tests. Simply add the
# language ID to the list of languages in spacy/tests/conftest.py to include it
# in the basic tokenizer sanity tests. You can optionally add a fixture for the
# language's tokenizer and add more specific tests. For more info, see the
# tests documentation: https://github.com/explosion/spaCy/tree/master/spacy/tests
class CatalanDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda text: 'ca' # ISO code
# add more norm exception dictionaries here
lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM], BASE_NORMS)
# overwrite functions for lexical attributes
lex_attr_getters.update(LEX_ATTRS)
# add custom tokenizer exceptions to base exceptions
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
# add stop words
stop_words = STOP_WORDS
# if available: add tag map
# tag_map = dict(TAG_MAP)
# if available: add morph rules
# morph_rules = dict(MORPH_RULES)
lemma_lookup = LOOKUP
class Catalan(Language):
lang = 'ca' # ISO code
Defaults = CatalanDefaults # set Defaults to custom language defaults
# set default export this allows the language class to be lazy-loaded
__all__ = ['Catalan']

22
spacy/lang/ca/examples.py Normal file
View File

@ -0,0 +1,22 @@
# coding: utf8
from __future__ import unicode_literals
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.es.examples import sentences
>>> docs = nlp.pipe(sentences)
"""
sentences = [
"Apple està buscant comprar una startup del Regne Unit per mil milions de dòlars",
"Els cotxes autònoms deleguen la responsabilitat de l'assegurança als seus fabricants",
"San Francisco analitza prohibir els robots de repartiment",
"Londres és una gran ciutat del Regne Unit",
"El gat menja peix",
"Veig a l'home amb el telescopi",
"L'Aranya menja mosques",
"El pingüí incuba en el seu niu",
]

591540
spacy/lang/ca/lemmatizer.py Normal file

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,43 @@
# coding: utf8
from __future__ import unicode_literals
# import the symbols for the attrs you want to overwrite
from ...attrs import LIKE_NUM
# Overwriting functions for lexical attributes
# Documentation: https://localhost:1234/docs/usage/adding-languages#lex-attrs
# Most of these functions, like is_lower or like_url should be language-
# independent. Others, like like_num (which includes both digits and number
# words), requires customisation.
# Example: check if token resembles a number
_num_words = ['zero', 'un', 'dos', 'tres', 'quatre', 'cinc', 'sis', 'set',
'vuit', 'nou', 'deu', 'onze', 'dotze', 'tretze', 'catorze',
'quinze', 'setze', 'disset', 'divuit', 'dinou', 'vint',
'trenta', 'quaranta', 'cinquanta', 'seixanta', 'setanta', 'vuitanta', 'noranta',
'cent', 'mil', 'milió', 'bilió', 'trilió', 'quatrilió',
'gazilió', 'bazilió']
def like_num(text):
text = text.replace(',', '').replace('.', '')
if text.isdigit():
return True
if text.count('/') == 1:
num, denom = text.split('/')
if num.isdigit() and denom.isdigit():
return True
if text in _num_words:
return True
return False
# Create dictionary of functions to overwrite. The default lex_attr_getters are
# updated with this one, so only the functions defined here are overwritten.
LEX_ATTRS = {
LIKE_NUM: like_num
}

View File

@ -0,0 +1,56 @@
# encoding: utf8
from __future__ import unicode_literals
# Stop words
STOP_WORDS = set("""
a abans ací ah així això al aleshores algun alguna algunes alguns alhora allà allí allò
als altra altre altres amb ambdues ambdós anar ans apa aquell aquella aquelles aquells
aquest aquesta aquestes aquests aquí
baix bastant
cada cadascuna cadascunes cadascuns cadascú com consegueixo conseguim conseguir
consigueix consigueixen consigueixes contra
d'un d'una d'unes d'uns dalt de del dels des des de després dins dintre donat doncs durant
e eh el elles ells els em en encara ens entre era erem eren eres es esta estan estat
estava estaven estem esteu estic està estàvem estàveu et etc ets érem éreu és éssent
fa faig fan fas fem fer feu fi fins fora
gairebé
ha han has haver havia he hem heu hi ho
i igual iguals inclòs
ja jo
l'hi la les li li'n llarg llavors
m'he ma mal malgrat mateix mateixa mateixes mateixos me mentre meu meus meva
meves mode molt molta moltes molts mon mons més
n'he n'hi ne ni no nogensmenys només nosaltres nostra nostre nostres
o oh oi on
pas pel pels per per que perquè però poc poca pocs podem poden poder
podeu poques potser primer propi puc
qual quals quan quant que quelcom qui quin quina quines quins què
s'ha s'han sa sabem saben saber sabeu sap saps semblant semblants sense ser ses
seu seus seva seves si sobre sobretot soc solament sols som son sons sota sou sóc són
t'ha t'han t'he ta tal també tampoc tan tant tanta tantes te tene tenim tenir teniu
teu teus teva teves tinc ton tons tot tota totes tots
un una unes uns us últim ús
va vaig vam van vas veu vosaltres vostra vostre vostres
""".split())

36
spacy/lang/ca/tag_map.py Normal file
View File

@ -0,0 +1,36 @@
# coding: utf8
from __future__ import unicode_literals
from ..symbols import POS, ADV, NOUN, ADP, PRON, SCONJ, PROPN, DET, SYM, INTJ
from ..symbols import PUNCT, NUM, AUX, X, CONJ, ADJ, VERB, PART, SPACE, CCONJ
# Add a tag map
# Documentation: https://spacy.io/docs/usage/adding-languages#tag-map
# Universal Dependencies: http://universaldependencies.org/u/pos/all.html
# The keys of the tag map should be strings in your tag set. The dictionary must
# have an entry POS whose value is one of the Universal Dependencies tags.
# Optionally, you can also include morphological features or other attributes.
TAG_MAP = {
"ADV": {POS: ADV},
"NOUN": {POS: NOUN},
"ADP": {POS: ADP},
"PRON": {POS: PRON},
"SCONJ": {POS: SCONJ},
"PROPN": {POS: PROPN},
"DET": {POS: DET},
"SYM": {POS: SYM},
"INTJ": {POS: INTJ},
"PUNCT": {POS: PUNCT},
"NUM": {POS: NUM},
"AUX": {POS: AUX},
"X": {POS: X},
"CONJ": {POS: CONJ},
"CCONJ": {POS: CCONJ},
"ADJ": {POS: ADJ},
"VERB": {POS: VERB},
"PART": {POS: PART},
"SP": {POS: SPACE}
}

View File

@ -0,0 +1,51 @@
# coding: utf8
from __future__ import unicode_literals
# import symbols if you need to use more, add them here
from ...symbols import ORTH, LEMMA, TAG, NORM, ADP, DET
_exc = {}
for exc_data in [
{ORTH: "aprox.", LEMMA: "aproximadament"},
{ORTH: "pàg.", LEMMA: "pàgina"},
{ORTH: "p.ex.", LEMMA: "per exemple"},
{ORTH: "gen.", LEMMA: "gener"},
{ORTH: "feb.", LEMMA: "febrer"},
{ORTH: "abr.", LEMMA: "abril"},
{ORTH: "jul.", LEMMA: "juliol"},
{ORTH: "set.", LEMMA: "setembre"},
{ORTH: "oct.", LEMMA: "octubre"},
{ORTH: "nov.", LEMMA: "novembre"},
{ORTH: "dec.", LEMMA: "desembre"},
{ORTH: "Dr.", LEMMA: "doctor"},
{ORTH: "Sr.", LEMMA: "senyor"},
{ORTH: "Sra.", LEMMA: "senyora"},
{ORTH: "Srta.", LEMMA: "senyoreta"},
{ORTH: "núm", LEMMA: "número"},
{ORTH: "St.", LEMMA: "sant"},
{ORTH: "Sta.", LEMMA: "santa"}]:
_exc[exc_data[ORTH]] = [exc_data]
# Times
_exc["12m."] = [
{ORTH: "12"},
{ORTH: "m.", LEMMA: "p.m."}]
for h in range(1, 12 + 1):
for period in ["a.m.", "am"]:
_exc["%d%s" % (h, period)] = [
{ORTH: "%d" % h},
{ORTH: period, LEMMA: "a.m."}]
for period in ["p.m.", "pm"]:
_exc["%d%s" % (h, period)] = [
{ORTH: "%d" % h},
{ORTH: period, LEMMA: "p.m."}]
# To keep things clean and readable, it's recommended to only declare the
# TOKENIZER_EXCEPTIONS at the bottom:
TOKENIZER_EXCEPTIONS = _exc

View File

@ -16,6 +16,7 @@ _latin = r'[[\p{Ll}||\p{Lu}]&&\p{Latin}]'
_persian = r'[\p{L}&&\p{Arabic}]'
_russian_lower = r'[ёа-я]'
_russian_upper = r'[ЁА-Я]'
_sinhala = r'[\p{L}&&\p{Sinhala}]'
_tatar_lower = r'[әөүҗңһ]'
_tatar_upper = r'[ӘӨҮҖҢҺ]'
_greek_lower = r'[α-ωάέίόώήύ]'
@ -23,7 +24,7 @@ _greek_upper = r'[Α-ΩΆΈΊΌΏΉΎ]'
_upper = [_latin_upper, _russian_upper, _tatar_upper, _greek_upper]
_lower = [_latin_lower, _russian_lower, _tatar_lower, _greek_lower]
_uncased = [_bengali, _hebrew, _persian]
_uncased = [_bengali, _hebrew, _persian, _sinhala]
ALPHA = merge_char_classes(_upper + _lower + _uncased)
ALPHA_LOWER = merge_char_classes(_lower + _uncased)

View File

@ -14,4 +14,5 @@ _exc = {
NORM_EXCEPTIONS = {}
for string, norm in _exc.items():
NORM_EXCEPTIONS[string] = norm
NORM_EXCEPTIONS[string.title()] = norm

View File

@ -1,21 +1,29 @@
# coding: utf8
from __future__ import unicode_literals
from .stop_words import STOP_WORDS
from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ..norm_exceptions import BASE_NORMS
from ...language import Language
from ...attrs import LANG, NORM
from ...util import update_exc, add_lookups
from ..norm_exceptions import BASE_NORMS
from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .tag_map import TAG_MAP
from .punctuation import TOKENIZER_SUFFIXES
from .lemmatizer import LEMMA_RULES, LEMMA_INDEX, LEMMA_EXC
class PersianDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda text: 'fa'
lex_attr_getters.update(LEX_ATTRS)
lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM], BASE_NORMS)
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS)
lex_attr_getters[LANG] = lambda text: 'fa'
tokenizer_exceptions = update_exc(TOKENIZER_EXCEPTIONS)
lemma_rules = LEMMA_RULES
lemma_index = LEMMA_INDEX
lemma_exc = LEMMA_EXC
stop_words = STOP_WORDS
tag_map = TAG_MAP
suffixes = TOKENIZER_SUFFIXES
class Persian(Language):

View File

@ -0,0 +1,32 @@
# coding: utf8
from __future__ import unicode_literals
from ._adjectives import ADJECTIVES
from ._adjectives_exc import ADJECTIVES_EXC
from ._nouns import NOUNS
from ._nouns_exc import NOUNS_EXC
from ._verbs import VERBS
from ._verbs_exc import VERBS_EXC
from ._lemma_rules import ADJECTIVE_RULES, NOUN_RULES, VERB_RULES, PUNCT_RULES
LEMMA_INDEX = {
'adj': ADJECTIVES,
'noun': NOUNS,
'verb': VERBS
}
LEMMA_RULES = {
'adj': ADJECTIVE_RULES,
'noun': NOUN_RULES,
'verb': VERB_RULES,
'punct': PUNCT_RULES
}
LEMMA_EXC = {
'adj': ADJECTIVES_EXC,
'noun': NOUNS_EXC,
'verb': VERBS_EXC
}

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,53 @@
# coding: utf8
from __future__ import unicode_literals
# Adjectives extracted from Mojgan Seraji's Persian Universal Dependencies Corpus
# Below adjectives are exceptions for current adjective lemmatization rules
ADJECTIVES_EXC = {
"بهترین": ("بهتر",),
"بهتر": ("بهتر",),
"سنگین": ("سنگین",),
"بیشترین": ("بیشتر",),
"برتر": ("برتر",),
"بدبین": ("بدبین",),
"متین": ("متین",),
"شیرین": ("شیرین",),
"معین": ("معین",),
"دلنشین": ("دلنشین",),
"امین": ("امین",),
"متدین": ("متدین",),
"تیزبین": ("تیزبین",),
"بنیادین": ("بنیادین",),
"دروغین": ("دروغین",),
"واپسین": ("واپسین",),
"خونین": ("خونین",),
"مزین": ("مزین",),
"خوشبین": ("خوشبین",),
"عطرآگین": ("عطرآگین",),
"زرین": ("زرین",),
"فرجامین": ("فرجامین",),
"فقیرنشین": ("فقیرنشین",),
"مستتر": ("مستتر",),
"چوبین": ("چوبین",),
"آغازین": ("آغازین",),
"سخن‌چین": ("سخن‌چین",),
"مرمرین": ("مرمرین",),
"زنده‌تر": ("زنده‌تر",),
"صفر‌کیلومتر": ("صفر‌کیلومتر",),
"غمگین": ("غمگین",),
"نازنین": ("نازنین",),
"مثبت": ("مثبت",),
"شرمگین": ("شرمگین",),
"قرین": ("قرین",),
"سوتر": ("سوتر",),
"بی‌زین": ("بی‌زین",),
"سیمین": ("سیمین",),
"رنگین": ("رنگین",),
"روشن‌بین": ("روشن‌بین",),
"اندوهگین": ("اندوهگین",),
"فی‌مابین": ("فی‌مابین",),
"لاجوردین": ("لاجوردین",),
"برنجین": ("برنجین",),
"مشکل‌آفرین": ("مشکل‌آفرین",),
"خبرچین": ("خبرچین",),
}

View File

@ -0,0 +1,64 @@
# coding: utf8
from __future__ import unicode_literals
ADJECTIVE_RULES = [
["ین", ""],
["\u200cترین", ""],
["ترین", ""],
["\u200cتر", ""],
["تر", ""],
["\u200cای", ""],
# ["ایی", "ا"],
# ["ویی", "و"],
# ["ی", ""],
# ["مند", ""],
# ["گین", ""],
# ["مین", ""],
# ["ناک", ""],
# ["سار", ""],
# ["\u200cوار", ""],
# ["وار", ""]
]
NOUN_RULES = [
['ایان', 'ا'],
['ویان', 'و'],
['ایانی', 'ا'],
['ویانی', 'و'],
['گان', 'ه'],
['گانی', 'ه'],
['گان', ''],
['گانی', ''],
['ان', ''],
['انی', ''],
['ات', ''],
['ات', 'ه'],
['ات', 'ت'],
['اتی', ''],
['اتی', 'ه'],
['اتی', 'ت'],
# ['ین', ''],
# ['ینی', ''],
# ['ون', ''],
# ['ونی', ''],
['\u200cها', ''],
['ها', ''],
['\u200cهای', ''],
['های', ''],
['\u200cهایی', ''],
['هایی', ''],
]
VERB_RULES = [
]
PUNCT_RULES = [
["", "\""],
["", "\""],
["\u2018", "'"],
["\u2019", "'"]
]

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,781 @@
# coding: utf8
from __future__ import unicode_literals
NOUNS_EXC = {
"آثار": ("اثر",),
"آرا": ("رأی",),
"آراء": ("رأی",),
"آفات": ("آفت",),
"اباطیل": ("باطل",),
"ائمه": ("امام",),
"ابرار": ("بر",),
"ابعاد": ("بعد",),
"ابنیه": ("بنا",),
"ابواب": ("باب",),
"ابیات": ("بیت",),
"اجداد": ("جد",),
"اجساد": ("جسد",),
"اجناس": ("جنس",),
"اثمار": ("ثمر",),
"اجرام": ("جرم",),
"اجسام": ("جسم",),
"اجنه": ("جن",),
"احادیث": ("حدیث",),
"احجام": ("حجم",),
"احرار": ("حر",),
"احزاب": ("حزب",),
"احکام": ("حکم",),
"اخبار": ("خبر",),
"اخیار": ("خیر",),
"ادبا": ("ادیب",),
"ادعیه": ("دعا",),
"ادله": ("دلیل",),
"ادوار": ("دوره",),
"ادیان": ("دین",),
"اذهان": ("ذهن",),
"اذکار": ("ذکر",),
"اراضی": ("ارض",),
"ارزاق": ("رزق",),
"ارقام": ("رقم",),
"ارواح": ("روح",),
"ارکان": ("رکن",),
"ازمنه": ("زمان",),
"اساتید": ("استاد",),
"اساطیر": ("اسطوره",),
"اسامی": ("اسم",),
"اسرار": ("سر",),
"اسما": ("اسم",),
"اسناد": ("سند",),
"اسیله": ("سوال",),
"اشجار": ("شجره",),
"اشخاص": ("شخص",),
"اشرار": ("شر",),
"اشربه": ("شراب",),
"اشعار": ("شعر",),
"اشقیا": ("شقی",),
"اشیا": ("شی",),
"اشباح": ("شبح",),
"اصدقا": ("صدیق",),
"اصناف": ("صنف",),
"اصنام": ("صنم",),
"اصوات": ("صوت",),
"اصول": ("اصل",),
"اضداد": ("ضد",),
"اطبا": ("طبیب",),
"اطعمه": ("طعام",),
"اطفال": ("طفل",),
"الطاف": ("لطف",),
"اعدا": ("عدو",),
"اعزا": ("عزیز",),
"اعضا": ("عضو",),
"اعماق": ("عمق",),
"الفاظ": ("لفظ",),
"اعناب": ("عنب",),
"اغذیه": ("غذا",),
"اغراض": ("غرض",),
"افراد": ("فرد",),
"افعال": ("فعل",),
"افلاک": ("فلک",),
"افکار": ("فکر",),
"اقالیم": ("اقلیم",),
"اقربا": ("قریب",),
"اقسام": ("قسم",),
"اقشار": ("قشر",),
"اقفال": ("قفل",),
"اقلام": ("قلم",),
"اقوال": ("قول",),
"اقوام": ("قوم",),
"البسه": ("لباس",),
"الحام": ("لحم",),
"الحکام": ("الحاکم",),
"القاب": ("لقب",),
"الواح": ("لوح",),
"الکبار": ("الکبیر",),
"اماکن": ("مکان",),
"امثال": ("مثل",),
"امراض": ("مرض",),
"امم": ("امت",),
"امواج": ("موج",),
"اموال": ("مال",),
"امور": ("امر",),
"امیال": ("میل",),
"انبیا": ("نبی",),
"انجم": ("نجم",),
"انظار": ("نظر",),
"انفس": ("نفس",),
"انهار": ("نهر",),
"انواع": ("نوع",),
"اهالی": ("اهل",),
"اهداف": ("هدف",),
"اواخر": ("آخر",),
"اواسط": ("وسط",),
"اوایل": ("اول",),
"اوراد": ("ورد",),
"اوراق": ("ورق",),
"اوزان": ("وزن",),
"اوصاف": ("وصف",),
"اوضاع": ("وضع",),
"اوقات": ("وقت",),
"اولاد": ("ولد",),
"اولیا": ("ولی",),
"اولیاء": ("ولی",),
"اوهام": ("وهم",),
"اکاذیب": ("اکذوبه",),
"اکفان": ("کفن",),
"ایالات": ("ایالت",),
"ایام": ("یوم",),
"ایتام": ("یتیم",),
"بشایر": ("بشارت",),
"بصایر": ("بصیرت",),
"بطون": ("بطن",),
"بنادر": ("بندر",),
"بیوت": ("بیت",),
"تجار": ("تاجر",),
"تجارب": ("تجربه",),
"تدابیر": ("تدبیر",),
"تعاریف": ("تعریف",),
"تلامیذ": ("تلمیذ",),
"تهم": ("تهمت",),
"توابیت": ("تابوت",),
"تواریخ": ("تاریخ",),
"جبال": ("جبل",),
"جداول": ("جدول",),
"جدود": ("جد",),
"جراثیم": ("جرثوم",),
"جرایم": ("جرم",),
"جرائم": ("جرم",),
"جزئیات": ("جزء",),
"جزایر": ("جزیره",),
"جزییات": ("جزء",),
"جنایات": ("جنایت",),
"جهات": ("جهت",),
"جوامع": ("جامعه",),
"حدود": ("حد",),
"حروف": ("حرف",),
"حقایق": ("حقیقت",),
"حقوق": ("حق",),
"حوادث": ("حادثه",),
"حواشی": ("حاشیه",),
"حوایج": ("حاجت",),
"حوائج": ("حاجت",),
"حکما": ("حکیم",),
"خدمات": ("خدمت",),
"خدمه": ("خادم",),
"خدم": ("خادم",),
"خزاین": ("خزینه",),
"خصایص": ("خصیصه",),
"خطوط": ("خط",),
"دراهم": ("درهم",),
"دروس": ("درس",),
"دفاتر": ("دفتر",),
"دلایل": ("دلیل",),
"دلائل": ("دلیل",),
"ذخایر": ("ذخیره",),
"ذنوب": ("ذنب",),
"ربوع": ("ربع",),
"رجال": ("رجل",),
"رسایل": ("رسال",),
"رسوم": ("رسم",),
"روابط": ("رابطه",),
"روسا": ("رئیس",),
"رئوس": ("راس",),
"ریوس": ("راس",),
"زوار": ("زائر",),
"ساعات": ("ساعت",),
"سبل": ("سبیل",),
"سطوح": ("سطح",),
"سطور": ("سطر",),
"سعدا": ("سعید",),
"سفن": ("سفینه",),
"سقاط": ("ساقی",),
"سلاطین": ("سلطان",),
"سلایق": ("سلیقه",),
"سموم": ("سم",),
"سنن": ("سنت",),
"سنین": ("سن",),
"سهام": ("سهم",),
"سوابق": ("سابقه",),
"سواحل": ("ساحل",),
"سوانح": ("سانحه",),
"شباب": ("شاب",),
"شرایط": ("شرط",),
"شروط": ("شرط",),
"شرکا": ("شریک",),
"شعب": ("شعبه",),
"شعوب": ("شعب",),
"شموس": ("شمس",),
"شهدا": ("شهید",),
"شهور": ("شهر",),
"شواهد": ("شاهد",),
"شوون": ("شان",),
"شکات": ("شاکی",),
"شیاطین": ("شیطان",),
"صبیان": ("صبی",),
"صحف": ("صحیفه",),
"صغار": ("صغیر",),
"صفوف": ("صف",),
"صنادیق": ("صندوق",),
"ضعفا": ("ضعیف",),
"ضمایر": ("ضمیر",),
"ضوابط": ("ضابطه",),
"طرق": ("طریق",),
"طلاب": ("طلبه",),
"طواغیت": ("طاغوت",),
"طیور": ("طیر",),
"عادات": ("عادت",),
"عباد": ("عبد",),
"عبارات": ("عبارت",),
"عجایب": ("عجیب",),
"عزایم": ("عزیمت",),
"عشایر": ("عشیره",),
"عطور": ("عطر",),
"عظما": ("عظیم",),
"عقاید": ("عقیده",),
"عقائد": ("عقیده",),
"علائم": ("علامت",),
"علایم": ("علامت",),
"علما": ("عالم",),
"علوم": ("علم",),
"عمال": ("عمله",),
"عناصر": ("عنصر",),
"عناوین": ("عنوان",),
"عواطف": ("عاطفه",),
"عواقب": ("عاقبت",),
"عوالم": ("عالم",),
"عوامل": ("عامل",),
"عیوب": ("عیب",),
"عیون": ("عین",),
"غدد": ("غده",),
"غرف": ("غرفه",),
"غیوب": ("غیب",),
"غیوم": ("غیم",),
"فرایض": ("فریضه",),
"فضایل": ("فضیلت",),
"فضلا": ("فاضل",),
"فواصل": ("فاصله",),
"فواید": ("فایده",),
"قبایل": ("قبیله",),
"قرون": ("قرن",),
"قصص": ("قصه",),
"قضات": ("قاضی",),
"قضایا": ("قضیه",),
"قلل": ("قله",),
"قلوب": ("قلب",),
"قواعد": ("قاعده",),
"قوانین": ("قانون",),
"قیود": ("قید",),
"لطایف": ("لطیفه",),
"لیالی": ("لیل",),
"مباحث": ("مبحث",),
"مبالغ": ("مبلغ",),
"متون": ("متن",),
"مجالس": ("مجلس",),
"محاصیل": ("محصول",),
"محافل": ("محفل",),
"محاکم": ("محکمه",),
"مخارج": ("خرج",),
"مدارس": ("مدرسه",),
"مدارک": ("مدرک",),
"مداین": ("مدینه",),
"مدن": ("مدینه",),
"مراتب": ("مرتبه",),
"مراتع": ("مرتع",),
"مراجع": ("مرجع",),
"مراحل": ("مرحله",),
"مسائل": ("مسئله",),
"مساجد": ("مسجد",),
"مساعی": ("سعی",),
"مسالک": ("مسلک",),
"مساکین": ("مسکین",),
"مسایل": ("مسئله",),
"مشاعر": ("مشعر",),
"مشاغل": ("شغل",),
"مشایخ": ("شیخ",),
"مصادر": ("مصدر",),
"مصادق": ("مصداق",),
"مصادیق": ("مصداق",),
"مصاعب": ("مصعب",),
"مضار": ("ضرر",),
"مضامین": ("مضمون",),
"مطالب": ("مطلب",),
"مظالم": ("مظلمه",),
"مظاهر": ("مظهر",),
"اهرام": ("هرم",),
"معابد": ("معبد",),
"معابر": ("معبر",),
"معاجم": ("معجم",),
"معادن": ("معدن",),
"معاذیر": ("عذر",),
"معارج": ("معراج",),
"معاصی": ("معصیت",),
"معالم": ("معلم",),
"معایب": ("عیب",),
"مفاسد": ("مفسده",),
"مفاصل": ("مفصل",),
"مفاهیم": ("مفهوم",),
"مقابر": ("مقبره",),
"مقاتل": ("مقتل",),
"مقادیر": ("مقدار",),
"مقاصد": ("مقصد",),
"مقاطع": ("مقطع",),
"ملابس": ("ملبس",),
"ملوک": ("ملک",),
"ممالک": ("مملکت",),
"منابع": ("منبع",),
"منازل": ("منزل",),
"مناسبات": ("مناسبت",),
"مناسک": ("منسک",),
"مناطق": ("منطقه",),
"مناظر": ("منظره",),
"منافع": ("منفعت",),
"موارد": ("مورد",),
"مواضع": ("موضع",),
"مواضیع": ("موضوع",),
"مواطن": ("موطن",),
"مواقع": ("موقع",),
"موانع": ("مانع",),
"مکاتب": ("مکتب",),
"مکاتیب": ("مکتوب",),
"مکارم": ("مکرمه",),
"میادین": ("میدان",),
"نتایج": ("نتیجه",),
"نعم": ("نعمت",),
"نفوس": ("نفس",),
"نقاط": ("نقطه",),
"نواحی": ("ناحیه",),
"نوافذ": ("نافذه",),
"نواقص": ("نقص",),
"نوامیس": ("ناموس",),
"نکات": ("نکته",),
"نیات": ("نیت",),
"هدایا": ("هدیه",),
"واقعیات": ("واقعیت",),
"وجوه": ("وجه",),
"وحوش": ("وحش",),
"وزرا": ("وزیر",),
"وسایل": ("وسیله",),
"وصایا": ("وصیت",),
"وظایف": ("وظیفه",),
"وعاظ": ("واعظ",),
"وقایع": ("واقعه",),
"کتب": ("کتاب",),
"کسبه": ("کاسب",),
"کفار": ("کافر",),
"کواکب": ("کوکب",),
"تصاویر": ("تصویر",),
"صنوف": ("صنف",),
"اجزا": ("جزء",),
"اجزاء": ("جزء",),
"ذخائر": ("ذخیره",),
"خسارات": ("خسارت",),
"عشاق": ("عاشق",),
"تصانیف": ("تصنیف",),
"دﻻیل": ("دلیل",),
"قوا": ("قوه",),
"ملل": ("ملت",),
"جوایز": ("جایزه",),
"جوائز": ("جایزه",),
"ابعاض": ("بعض",),
"اتباع": ("تبعه",),
"اجلاس": ("جلسه",),
"احشام": ("حشم",),
"اخلاف": ("خلف",),
"ارامنه": ("ارمنی",),
"ازواج": ("زوج",),
"اسباط": ("سبط",),
"اعداد": ("عدد",),
"اعصار": ("عصر",),
"اعقاب": ("عقبه",),
"اعیاد": ("عید",),
"اعیان": ("عین",),
"اغیار": ("غیر",),
"اقارب": ("اقرب",),
"اقران": ("قرن",),
"اقساط": ("قسط",),
"امنای": ("امین",),
"امنا": ("امین",),
"اموات": ("میت",),
"اناجیل": ("انجیل",),
"انحا": ("نحو",),
"انساب": ("نسب",),
"انوار": ("نور",),
"اوامر": ("امر",),
"اوائل": ("اول",),
"اوصیا": ("وصی",),
"آحاد": ("احد",),
"براهین": ("برهان",),
"تعابیر": ("تعبیر",),
"تعالیم": ("تعلیم",),
"تفاسیر": ("تفسیر",),
"تکالیف": ("تکلیف",),
"تماثیل": ("تمثال",),
"جنود": ("جند",),
"جوانب": ("جانب",),
"حاجات": ("حاجت",),
"حرکات": ("حرکت",),
"حضرات": ("حضرت",),
"حکایات": ("حکایت",),
"حوالی": ("حول",),
"خصایل": ("خصلت",),
"خلایق": ("خلق",),
"خلفا": ("خلیفه",),
"دعاوی": ("دعوا",),
"دیون": ("دین",),
"ذراع": ("ذرع",),
"رعایا": ("رعیت",),
"روایات": ("روایت",),
"شعرا": ("شاعر",),
"شکایات": ("شکایت",),
"شهوات": ("شهوت",),
"شیوخ": ("شیخ",),
"شئون": ("شأن",),
"طبایع": ("طبع",),
"ظروف": ("ظرف",),
"ظواهر": ("ظاهر",),
"عبادات": ("عبادت",),
"عرایض": ("عریضه",),
"عرفا": ("عارف",),
"عروق": ("عرق",),
"عساکر": ("عسکر",),
"علماء": ("عالم",),
"فتاوا": ("فتوا",),
"فراعنه": ("فرعون",),
"فرامین": ("فرمان",),
"فروض": ("فرض",),
"فروع": ("فرع",),
"فصول": ("فصل",),
"فقها": ("فقیه",),
"قبور": ("قبر",),
"قبوض": ("قبض",),
"قدوم": ("قدم",),
"قرائات": ("قرائت",),
"قرائن": ("قرینه",),
"لغات": ("لغت",),
"مجامع": ("مجمع",),
"مخازن": ("مخزن",),
"مدارج": ("درجه",),
"مذاهب": ("مذهب",),
"مراکز": ("مرکز",),
"مصارف": ("مصرف",),
"مطامع": ("طمع",),
"معانی": ("معنی",),
"مناصب": ("منصب",),
"منافذ": ("منفذ",),
"مواریث": ("میراث",),
"موازین": ("میزان",),
"موالی": ("مولی",),
"مواهب": ("موهبت",),
"نسوان": ("نسا",),
"نصوص": ("نص",),
"نظایر": ("نظیر",),
"نقایص": ("نقص",),
"نقوش": ("نقش",),
"ولایات": ("ولایت",),
"هیئات": ("هیأت",),
"جماهیر": ("جمهوری",),
"خصائص": ("خصیصه",),
"دقایق": ("دقیقه",),
"رذایل": ("رذیلت",),
"طوایف": ("طایفه",),
"علامات": ("علامت",),
"علایق": ("علاقه",),
"علل": ("علت",),
"غرایز": ("غریزه",),
"غرائز": ("غریزه",),
"غنایم": ("غنیمت",),
"فرائض": ("فریضه",),
"فضائل": ("فضیلت",),
"فقرا": ("فقیر",),
"فلاسفه": ("فیلسوف",),
"فواحش": ("فاحشه",),
"قصائد": ("قصیده",),
"قصاید": ("قصیده",),
"قوائد": ("قائده",),
"مزارع": ("مزرعه",),
"مصائب": ("مصیبت",),
"معارف": ("معرفت",),
"نصایح": ("نصیحت",),
"وثایق": ("وثیقه",),
"وظائف": ("وظیفه",),
"توابین": ("تواب",),
"رفقا": ("رفیق",),
"رقبا": ("رقیب",),
"زحمات": ("زحمت",),
"زعما": ("زعیم",),
"زوایا": ("زاویه",),
"سماوات": ("سما",),
"علوفه": ("علف",),
"غایات": ("غایت",),
"فنون": ("فن",),
"لذات": ("لذت",),
"نعمات": ("نعمت",),
"امراء": ("امیر",),
"امرا": ("امیر",),
"دهاقین": ("دهقان",),
"سنوات": ("سنه",),
"عمارات": ("عمارت",),
"فتوح": ("فتح",),
"لذائذ": ("لذیذ",),
"لذایذ": ("لذیذ", "لذت",),
"تکایا": ("تکیه",),
"صفات": ("صفت",),
"خصوصیات": ("خصوصیت",),
"کیفیات": ("کیفیت",),
"حملات": ("حمله",),
"شایعات": ("شایعه",),
"صدمات": ("صدمه",),
"غلات": ("غله",),
"کلمات": ("کلمه",),
"مبارزات": ("مبارزه",),
"مراجعات": ("مراجعه",),
"مطالبات": ("مطالبه",),
"مکاتبات": ("مکاتبه",),
"نشریات": ("نشریه",),
"بحور": ("بحر",),
"تحقیقات": ("تحقیق",),
"مکالمات": ("مکالمه",),
"ریزمکالمات": ("ریزمکالمه",),
"تجربیات": ("تجربه",),
"جملات": ("جمله",),
"حالات": ("حالت",),
"حجاج": ("حاجی",),
"حسنات": ("حسنه",),
"حشرات": ("حشره",),
"خاطرات": ("خاطره",),
"درجات": ("درجه",),
"دفعات": ("دفعه",),
"سیارات": ("سیاره",),
"شبهات": ("شبهه",),
"ضایعات": ("ضایعه",),
"ضربات": ("ضربه",),
"طبقات": ("طبقه",),
"فرضیات": ("فرضیه",),
"قطرات": ("قطره",),
"قطعات": ("قطعه",),
"قلاع": ("قلعه",),
"کشیشان": ("کشیش",),
"مادیات": ("مادی",),
"مباحثات": ("مباحثه",),
"مجاهدات": ("مجاهدت",),
"محلات": ("محله",),
"مداخلات": ("مداخله",),
"مشقات": ("مشقت",),
"معادلات": ("معادله",),
"معوقات": ("معوقه",),
"منویات": ("منویه",),
"موقوفات": ("موقوفه",),
"موسسات": ("موسسه",),
"حلقات": ("حلقه",),
"ایات": ("ایه",),
"اصلح": ("صالح",),
"اظهر": ("ظاهر",),
"آیات": ("آیه",),
"برکات": ("برکت",),
"جزوات": ("جزوه",),
"خطابات": ("خطابه",),
"دوایر": ("دایره",),
"روحیات": ("روحیه",),
"متهمان": ("متهم",),
"مجاری": ("مجرا",),
"مشترکات": ("مشترک",),
"ورثه": ("وارث",),
"وکلا": ("وکیل",),
"نقبا": ("نقیب",),
"سفرا": ("سفیر",),
"مآخذ": ("مأخذ",),
"احوال": ("حال",),
"آلام": ("الم",),
"مزایا": ("مزیت",),
"عقلا": ("عاقل",),
"مشاهد": ("مشهد",),
"ظلمات": ("ظلمت",),
"خفایا": ("خفیه",),
"مشاهدات": ("مشاهده",),
"امامان": ("امام",),
"سگان": ("سگ",),
"نظریات": ("نظریه",),
"آفاق": ("افق",),
"آمال": ("امل",),
"دکاکین": ("دکان",),
"قصبات": ("قصبه",),
"مضرات": ("مضرت",),
"قبائل": ("قبیله",),
"مجانین": ("مجنون",),
"سيئات": ("سیئه",),
"صدقات": ("صدقه",),
"کثافات": ("کثافت",),
"کسورات": ("کسر",),
"معالجات": ("معالجه",),
"مقابلات": ("مقابله",),
"مناظرات": ("مناظره",),
"ناملايمات": ("ناملایمت",),
"وجوهات": ("وجه",),
"مصادرات": ("مصادره",),
"ملمعات": ("ملمع",),
"اولویات": ("اولویت",),
"جمرات": ("جمره",),
"زیارات": ("زیارت",),
"عقبات": ("عقبه",),
"کرامات": ("کرامت",),
"مراقبات": ("مراقبه",),
"نجاسات": ("نجاست",),
"هجویات": ("هجو",),
"تبدلات": ("تبدل",),
"روات": ("راوی",),
"فیوضات": ("فیض",),
"کفارات": ("کفاره",),
"نذورات": ("نذر",),
"حفریات": ("حفر",),
"عنایات": ("عنایت",),
"جراحات": ("جراحت",),
"ثمرات": ("ثمره",),
"حکام": ("حاکم",),
"مرسولات": ("مرسوله",),
"درایات": ("درایت",),
"سیئات": ("سیئه",),
"عدوات": ("عداوت",),
"عشرات": ("عشره",),
"عقوبات": ("عقوبه",),
"عقودات": ("عقود",),
"کثرات": ("کثرت",),
"مواجهات": ("مواجهه",),
"مواصلات": ("مواصله",),
"اجوبه": ("جواب",),
"اضلاع": ("ضلع",),
"السنه": ("لسان",),
"اشتات": ("شت",),
"دعوات": ("دعوت",),
"صعوبات": ("صعوبت",),
"عفونات": ("عفونت",),
"علوفات": ("علوفه",),
"غرامات": ("غرامت",),
"فارقات": ("فارقت",),
"لزوجات": ("لزوجت",),
"محللات": ("محلله",),
"مسافات": ("مسافت",),
"مسافحات": ("مسافحه",),
"مسامرات": ("مسامره",),
"مستلذات": ("مستلذ",),
"مسرات": ("مسرت",),
"مشافهات": ("مشافهه",),
"مشاهرات": ("مشاهره",),
"معروشات": ("معروشه",),
"مجادلات": ("مجادله",),
"ابغاض": ("بغض",),
"اجداث": ("جدث",),
"اجواز": ("جوز",),
"اجواد": ("جواد",),
"ازاهیر": ("ازهار",),
"عوائد": ("عائده",),
"احافیر": ("احفار",),
"احزان": ("حزن",),
"آنام": ("انام",),
"احباب": ("حبیب",),
"نوابغ": ("نابغه",),
"بینات": ("بینه",),
"حوالات": ("حواله",),
"حوالجات": ("حواله",),
"دستجات": ("دسته",),
"شمومات": ("شموم",),
"طاقات": ("طاقه",),
"علاقات": ("علاقه",),
"مراسلات": ("مراسله",),
"موجهات": ("موجه",),
"اقویا": ("قوی",),
"اغنیا": ("غنی",),
"بلایا": ("بلا",),
"خطایا": ("خطا",),
"ثنایا": ("ثنا",),
"لوایح": ("لایحه",),
"غزلیات": ("غزل",),
"اشارات": ("اشاره",),
"رکعات": ("رکعت",),
"امثالهم": ("مثل",),
"تشنجات": ("تشنج",),
"امانات": ("امانت",),
"بریات": ("بریت",),
"توست": ("تو",),
"حبست": ("حبس",),
"حیثیات": ("حیثیت",),
"شامات": ("شامه",),
"قبالات": ("قباله",),
"قرابات": ("قرابت",),
"مطلقات": ("مطلقه",),
"نزلات": ("نزله",),
"بکمان": ("بکیم",),
"روشان": ("روشن",),
"مسانید": ("مسند",),
"ناحیت": ("ناحیه",),
"رسوله": ("رسول",),
"دانشجویان": ("دانشجو",),
"روحانیون": ("روحانی",),
"قرون": ("قرن",),
"انقلابیون": ("انقلابی",),
"قوانین": ("قانون",),
"مجاهدین": ("مجاهد",),
"محققین": ("محقق",),
"متهمین": ("متهم",),
"مهندسین": ("مهندس",),
"مؤمنین": ("مؤمن",),
"مسئولین": ("مسئول",),
"مشرکین": ("مشرک",),
"مخاطبین": ("مخاطب",),
"مأمورین": ("مأمور",),
"سلاطین": ("سلطان",),
"مضامین": ("مضمون",),
"منتخبین": ("منتخب",),
"متحدین": ("متحد",),
"متخصصین": ("متخصص",),
"مسوولین": ("مسوول",),
"شیاطین": ("شیطان",),
"مباشرین": ("مباشر",),
"منتقدین": ("منتقد",),
"موسسین": ("موسس",),
"مسؤلین": ("مسؤل",),
"متحجرین": ("متحجر",),
"مهاجرین": ("مهاجر",),
"مترجمین": ("مترجم",),
"مدعوین": ("مدعو",),
"مشترکین": ("مشترک",),
"معصومین": ("معصوم",),
"مسابقات": ("مسابقه",),
"معانی": ("معنی",),
"مطالعات": ("مطالعه",),
"نکات": ("نکته",),
"خصوصیات": ("خصوصیت",),
"خدمات": ("خدمت",),
"نشریات": ("نشریه",),
"ساعات": ("ساعت",),
"بزرگان": ("بزرگ",),
"خسارات": ("خسارت",),
"شیعیان": ("شیعه",),
"واقعیات": ("واقعیت",),
"مذاکرات": ("مذاکره",),
"حشرات": ("حشره",),
"طبقات": ("طبقه",),
"شکایات": ("شکایت",),
"ابیات": ("بیت",),
"شایعات": ("شایعه",),
"ضربات": ("ضربه",),
"مقالات": ("مقاله",),
"اوقات": ("وقت",),
"عباراتی": ("عبارت",),
"سالیان": ("سال",),
"زحمات": ("زحمت",),
"عبارات": ("عبارت",),
"لغات": ("لغت",),
"نیات": ("نیت",),
"مطالبات": ("مطالبه",),
"مطالب": ("مطلب",),
"خلقیات": ("خلق",),
"نکات": ("نکته",),
"بزرگان": ("بزرگ",),
"ابیاتی": ("بیت",),
"محرمات": ("حرام",),
"اوزان": ("وزن",),
"اخلاقیات": ("اخلاق",),
"سبزیجات": ("سبزی",),
"اضافات": ("اضافه",),
"قضات": ("قاضی",),
}

View File

@ -0,0 +1,6 @@
# coding: utf8
from __future__ import unicode_literals
VERBS = set("""
""".split())

View File

@ -0,0 +1,647 @@
# coding: utf8
from __future__ import unicode_literals
verb_roots = """
#هست
آخت#آهنج
آراست#آرا
آراماند#آرامان
آرامید#آرام
آرمید#آرام
آزرد#آزار
آزمود#آزما
آسود#آسا
آشامید#آشام
آشفت#آشوب
آشوبید#آشوب
آغازید#آغاز
آغشت#آمیز
آفرید#آفرین
آلود#آلا
آمد
آمرزید#آمرز
آموخت#آموز
آموزاند#آموزان
آمیخت#آمیز
آورد#آر
آورد#آور
آویخت#آویز
آکند#آکن
آگاهانید#آگاهان
ارزید#ارز
افتاد#افت
افراخت#افراز
افراشت#افراز
افروخت#افروز
افروزید#افروز
افزود#افزا
افسرد#افسر
افشاند#افشان
افکند#افکن
افگند#افگن
انباشت#انبار
انجامید#انجام
انداخت#انداز
اندوخت#اندوز
اندود#اندا
اندیشید#اندیش
انگاشت#انگار
انگیخت#انگیز
انگیزاند#انگیزان
ایستاد#ایست
ایستاند#ایستان
باخت#باز
باراند#باران
بارگذاشت#بارگذار
بارید#بار
باز#بازخواه
بازآفرید#بازآفرین
بازآمد#بازآ
بازآموخت#بازآموز
بازآورد#بازآور
بازایستاد#بازایست
بازتابید#بازتاب
بازجست#بازجو
بازخواند#بازخوان
بازخوراند#بازخوران
بازداد#بازده
بازداشت#بازدار
بازرساند#بازرسان
بازرسانید#بازرسان
باززد#باززن
بازستاند#بازستان
بازشمارد#بازشمار
بازشمرد#بازشمار
بازشمرد#بازشمر
بازشناخت#بازشناس
بازشناساند#بازشناسان
بازفرستاد#بازفرست
بازماند#بازمان
بازنشست#بازنشین
بازنمایاند#بازنمایان
بازنهاد#بازنه
بازنگریست#بازنگر
بازپرسید#بازپرس
بازگذارد#بازگذار
بازگذاشت#بازگذار
بازگرداند#بازگردان
بازگردانید#بازگردان
بازگردید#بازگرد
بازگرفت#بازگیر
بازگشت#بازگرد
بازگشود#بازگشا
بازگفت#بازگو
بازیافت#بازیاب
بافت#باف
بالید#بال
باوراند#باوران
بایست#باید
بخشود#بخش
بخشود#بخشا
بخشید#بخش
بر#برخواه
برآشفت#برآشوب
برآمد#برآ
برآورد#برآور
برازید#براز
برافتاد#برافت
برافراخت#برافراز
برافراشت#برافراز
برافروخت#برافروز
برافشاند#برافشان
برافکند#برافکن
براند#بران
برانداخت#برانداز
برانگیخت#برانگیز
بربست#بربند
برتاباند#برتابان
برتابید#برتاب
برتافت#برتاب
برتنید#برتن
برجهید#برجه
برخاست#برخیز
برخورد#برخور
برد#بر
برداشت#بردار
بردمید#بردم
برزد#برزن
برشد#برشو
برشمارد#برشمار
برشمرد#برشمار
برشمرد#برشمر
برنشاند#برنشان
برنشانید#برنشان
برنشست#برنشین
برنهاد#برنه
برچید#برچین
برکرد#برکن
برکشید#برکش
برکند#برکن
برگذشت#برگذر
برگرداند#برگردان
برگردانید#برگردان
برگردید#برگرد
برگرفت#برگیر
برگزید#برگزین
برگشت#برگرد
برگشود#برگشا
برگمارد#برگمار
برگمارید#برگمار
برگماشت#برگمار
برید#بر
بست#بند
بلعید#بلع
بود#باش
بوسید#بوس
بویید#بو
بیخت#بیز
بیخت#بوز
تاباند#تابان
تابید#تاب
تاخت#تاز
تاراند#تاران
تازاند#تازان
تازید#تاز
تافت#تاب
ترادیسید#ترادیس
تراشاند#تراشان
تراشید#تراش
تراوید#تراو
ترساند#ترسان
ترسید#ترس
ترشاند#ترشان
ترشید#ترش
ترکاند#ترکان
ترکید#ترک
تفتید#تفت
تمرگید#تمرگ
تنید#تن
توانست#توان
توفید#توف
تپاند#تپان
تپید#تپ
تکاند#تکان
تکانید#تکان
جست#جه
جست#جو
جنباند#جنبان
جنبید#جنب
جنگید#جنگ
جهاند#جهان
جهید#جه
جوشاند#جوشان
جوشانید#جوشان
جوشید#جوش
جويد#جو
جوید#جو
خاراند#خاران
خارید#خار
خاست#خیز
خایید#خا
خراشاند#خراشان
خراشید#خراش
خرامید#خرام
خروشید#خروش
خرید#خر
خزید#خز
خسبید#خسب
خشکاند#خشکان
خشکید#خشک
خفت#خواب
خلید#خل
خماند#خمان
خمید#خم
خنداند#خندان
خندانید#خندان
خندید#خند
خواباند#خوابان
خوابانید#خوابان
خوابید#خواب
خواست#خواه
خواست#خیز
خواند#خوان
خوراند#خوران
خورد#خور
خیزاند#خیزان
خیساند#خیسان
داد#ده
داشت#دار
دانست#دان
در#درخواه
درآمد#درآ
درآمیخت#درآمیز
درآورد#درآور
درآویخت#درآویز
درافتاد#درافت
درافکند#درافکن
درانداخت#درانداز
درانید#دران
دربرد#دربر
دربرگرفت#دربرگیر
درخشاند#درخشان
درخشانید#درخشان
درخشید#درخش
درداد#درده
دررفت#دررو
درماند#درمان
درنمود#درنما
درنوردید#درنورد
درود#درو
دروید#درو
درکرد#درکن
درکشید#درکش
درگذشت#درگذر
درگرفت#درگیر
دریافت#دریاب
درید#در
دزدید#دزد
دمید#دم
دواند#دوان
دوخت#دوز
دوشید#دوش
دوید#دو
دید#بین
راند#ران
ربود#ربا
ربود#روب
رخشید#رخش
رساند#رسان
رسانید#رسان
رست#ره
رست#رو
رسید#رس
رشت#ریس
رفت#رو
رفت#روب
رقصاند#رقصان
رقصید#رقص
رماند#رمان
رمانید#رمان
رمید#رم
رنجاند#رنجان
رنجانید#رنجان
رنجید#رنج
رندید#رند
رهاند#رهان
رهانید#رهان
رهید#ره
روبید#روب
روفت#روب
رویاند#رویان
رویانید#رویان
رویید#رو
رویید#روی
ریخت#ریز
رید#رین
ریدن#رین
ریسید#ریس
زاد#زا
زارید#زار
زایاند#زایان
زایید#زا
زد#زن
زدود#زدا
زیست#زی
ساباند#سابان
سابید#ساب
ساخت#ساز
سایید#سا
ستاد#ستان
ستاند#ستان
سترد#ستر
ستود#ستا
ستیزید#ستیز
سراند#سران
سرایید#سرا
سرشت#سرش
سرود#سرا
سرکشید#سرکش
سرگرفت#سرگیر
سرید#سر
سزید#سز
سفت#سنب
سنجید#سنج
سوخت#سوز
سود#سا
سوزاند#سوزان
سپارد#سپار
سپرد#سپار
سپرد#سپر
سپوخت#سپوز
سگالید#سگال
شاشید#شاش
شایست#
شایست#شاید
شتاباند#شتابان
شتابید#شتاب
شتافت#شتاب
شد#شو
شست#شو
شست#شوی
شلید#شل
شمار#شمر
شمارد#شمار
شمرد#شمار
شمرد#شمر
شناخت#شناس
شناساند#شناسان
شنفت#شنو
شنید#شنو
شوتید#شوت
شوراند#شوران
شورید#شور
شکافت#شکاف
شکاند#شکان
شکاند#شکن
شکست#شکن
شکفت#شکف
طلبید#طلب
طپید#طپ
غراند#غران
غرید#غر
غلتاند#غلتان
غلتانید#غلتان
غلتید#غلت
غلطاند#غلطان
غلطانید#غلطان
غلطید#غلط
فرا#فراخواه
فراخواند#فراخوان
فراداشت#فرادار
فرارسید#فرارس
فرانمود#فرانما
فراگرفت#فراگیر
فرستاد#فرست
فرسود#فرسا
فرمود#فرما
فرهیخت#فرهیز
فرو#فروخواه
فروآمد#فروآ
فروآورد#فروآور
فروافتاد#فروافت
فروافکند#فروافکن
فروبرد#فروبر
فروبست#فروبند
فروخت#فروش
فروخفت#فروخواب
فروخورد#فروخور
فروداد#فروده
فرودوخت#فرودوز
فرورفت#فرورو
فروریخت#فروریز
فروشکست#فروشکن
فروفرستاد#فروفرست
فروماند#فرومان
فرونشاند#فرونشان
فرونشانید#فرونشان
فرونشست#فرونشین
فرونمود#فرونما
فرونهاد#فرونه
فروپاشاند#فروپاشان
فروپاشید#فروپاش
فروچکید#فروچک
فروکرد#فروکن
فروکشید#فروکش
فروکوبید#فروکوب
فروکوفت#فروکوب
فروگذارد#فروگذار
فروگذاشت#فروگذار
فروگرفت#فروگیر
فریفت#فریب
فشاند#فشان
فشرد#فشار
فشرد#فشر
فلسفید#فلسف
فهماند#فهمان
فهمید#فهم
قاپید#قاپ
قبولاند#قبول
قبولاند#قبولان
لاسید#لاس
لرزاند#لرزان
لرزید#لرز
لغزاند#لغزان
لغزید#لغز
لمباند#لمبان
لمید#لم
لنگید#لنگ
لولید#لول
لیسید#لیس
ماسید#ماس
مالاند#مالان
مالید#مال
ماند#مان
مانست#مان
مرد#میر
مویید#مو
مکید#مک
نازید#ناز
نالاند#نالان
نالید#نال
نامید#نام
نشاند#نشان
نشست#نشین
نمایاند#نما
نمایاند#نمایان
نمود#نما
نهاد#نه
نهفت#نهنب
نواخت#نواز
نوازید#نواز
نوردید#نورد
نوشاند#نوشان
نوشانید#نوشان
نوشت#نویس
نوشید#نوش
نکوهید#نکوه
نگاشت#نگار
نگرید#
نگریست#نگر
هراساند#هراسان
هراسانید#هراسان
هراسید#هراس
هشت#هل
وا#واخواه
واداشت#وادار
وارفت#وارو
وارهاند#وارهان
واماند#وامان
وانهاد#وانه
واکرد#واکن
واگذارد#واگذار
واگذاشت#واگذار
ور#ورخواه
ورآمد#ورآ
ورافتاد#ورافت
وررفت#وررو
ورزید#ورز
وزاند#وزان
وزید#وز
ویراست#ویرا
پاشاند#پاشان
پاشید#پاش
پالود#پالا
پایید#پا
پخت#پز
پذیراند#پذیران
پذیرفت#پذیر
پراند#پران
پراکند#پراکن
پرداخت#پرداز
پرستید#پرست
پرسید#پرس
پرهیخت#پرهیز
پرهیزید#پرهیز
پروراند#پروران
پرورد#پرور
پرید#پر
پسندید#پسند
پلاساند#پلاسان
پلاسید#پلاس
پلکید#پلک
پناهاند#پناهان
پناهید#پناه
پنداشت#پندار
پوساند#پوسان
پوسید#پوس
پوشاند#پوشان
پوشید#پوش
پویید#پو
پژمرد#پژمر
پژوهید#پژوه
پکید#پک
پیراست#پیرا
پیمود#پیما
پیوست#پیوند
پیچاند#پیچان
پیچانید#پیچان
پیچید#پیچ
چاپید#چاپ
چایید#چا
چراند#چران
چرانید#چران
چرباند#چربان
چربید#چرب
چرخاند#چرخان
چرخانید#چرخان
چرخید#چرخ
چروکید#چروک
چرید#چر
چزاند#چزان
چسباند#چسبان
چسبید#چسب
چسید#چس
چشاند#چشان
چشید#چش
چلاند#چلان
چلانید#چلان
چپاند#چپان
چپید#چپ
چکاند#چکان
چکید#چک
چید#چین
کاست#کاه
کاشت#کار
کاوید#کاو
کرد#کن
کشاند#کشان
کشانید#کشان
کشت#کار
کشت#کش
کشید#کش
کند#کن
کوباند#کوبان
کوبید#کوب
کوشید#کوش
کوفت#کوب
کوچانید#کوچان
کوچید#کوچ
گایید#گا
گداخت#گداز
گذارد#گذار
گذاشت#گذار
گذراند#گذران
گذشت#گذر
گرازید#گراز
گرانید#گران
گرایید#گرا
گرداند#گردان
گردانید#گردان
گردید#گرد
گرفت#گیر
گروید#گرو
گریاند#گریان
گریخت#گریز
گریزاند#گریزان
گریست#گر
گریست#گری
گزارد#گزار
گزاشت#گزار
گزید#گزین
گسارد#گسار
گستراند#گستران
گسترانید#گستران
گسترد#گستر
گسست#گسل
گسلاند#گسل
گسیخت#گسل
گشاد#گشا
گشت#گرد
گشود#گشا
گفت#گو
گمارد#گمار
گماشت#گمار
گنجاند#گنجان
گنجانید#گنجان
گنجید#گنج
گنداند#گندان
گندید#گند
گوارید#گوار
گوزید#گوز
گیراند#گیران
یازید#یاز
یافت#یاب
یونید#یون
""".strip().split()
## Below code is a modified version of HAZM package's verb conjugator,
# with soem extra verbs(Anything in hazm and not in here? compare needed!)
VERBS_EXC = {}
with_nots = lambda items: items + ['ن' + item for item in items]
simple_ends = ['م', 'ی', '', 'یم', 'ید', 'ند']
narrative_ends = ['ه‌ام', 'ه‌ای', 'ه', 'ه‌ایم', 'ه‌اید', 'ه‌اند']
present_ends = ['م', 'ی', 'د', 'یم', 'ید', 'ند']
# special case of '#هست':
VERBS_EXC.update({conj: 'هست' for conj in ['هست' + end for end in simple_ends]})
VERBS_EXC.update({conj: 'هست' for conj in ['نیست' + end for end in simple_ends]})
for verb_root in verb_roots:
conjugations = []
if '#' not in verb_root:
continue
past, present = verb_root.split('#')
if past:
past_simples = [past + end for end in simple_ends]
past_imperfects = ['می‌' + item for item in past_simples]
past_narratives = [past + end for end in narrative_ends]
conjugations = with_nots(past_simples + past_imperfects + past_narratives)
if present:
imperatives = ['ب' + present, 'ن' + present]
if present.endswith('ا') or present in ('آ', 'گو'):
present = present + 'ی'
present_simples = [present + end for end in present_ends]
present_imperfects = ['می‌' + present + end for end in present_ends]
present_subjunctives = ['ب' + present + end for end in present_ends]
conjugations += with_nots(present_simples + present_imperfects) + \
present_subjunctives + imperatives
if past.startswith('آ'):
conjugations = set(map(lambda item: item.replace('بآ', 'بیا').replace('نآ', 'نیا'),\
conjugations))
VERBS_EXC.update({conj: (past,) if past else present for conj in conjugations})

View File

@ -0,0 +1,92 @@
# coding: utf8
from __future__ import unicode_literals
from ...attrs import LIKE_NUM
MIM = 'م'
ZWNJ_O_MIM = '‌ام'
YE_NUN = 'ین'
_num_words = set("""
صفر
یک
دو
سه
چهار
پنج
شش
شیش
هفت
هشت
نه
ده
یازده
دوازده
سیزده
چهارده
پانزده
پونزده
شانزده
شونزده
هفده
هجده
هیجده
نوزده
بیست
سی
چهل
پنجاه
شصت
هفتاد
هشتاد
نود
صد
یکصد
یکصد
دویست
سیصد
چهارصد
پانصد
پونصد
ششصد
شیشصد
هفتصد
هفصد
هشتصد
نهصد
هزار
میلیون
میلیارد
بیلیون
بیلیارد
تریلیون
تریلیارد
کوادریلیون
کادریلیارد
کوینتیلیون
""".split())
_ordinal_words = set("""
اول
سوم
سیام""".split())
_ordinal_words.update({num + MIM for num in _num_words})
_ordinal_words.update({num + ZWNJ_O_MIM for num in _num_words})
_ordinal_words.update({num + YE_NUN for num in _ordinal_words})
def like_num(text):
"""
check if text resembles a number
"""
text = text.replace(',', '').replace('.', '').\
replace('،', '').replace('٫','').replace('/', '')
if text.isdigit():
return True
if text in _num_words:
return True
if text in _ordinal_words:
return True
return False
LEX_ATTRS = {
LIKE_NUM: like_num
}

View File

@ -0,0 +1,16 @@
# coding: utf8
from __future__ import unicode_literals
from ..punctuation import TOKENIZER_INFIXES
from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, CURRENCY
from ..char_classes import QUOTES, UNITS, ALPHA, ALPHA_LOWER, ALPHA_UPPER
_suffixes = (LIST_PUNCT + LIST_ELLIPSES + LIST_QUOTES +
[r'(?<=[0-9])\+',
r'(?<=[0-9])%', # 4% -> ["4", "%"]
# Persian is written from Right-To-Left
r'(?<=[0-9])(?:{})'.format(CURRENCY),
r'(?<=[0-9])(?:{})'.format(UNITS),
r'(?<=[{au}][{au}])\.'.format(au=ALPHA_UPPER)])
TOKENIZER_SUFFIXES = _suffixes

View File

@ -1,105 +1,395 @@
# coding: utf8
from __future__ import unicode_literals
# stop words from HAZM package
STOP_WORDS = set("""
آباد آره آری آسانی آمد آمده آن آنان آنجا آنها آنها آنچه آنکه آورد آورده آیا آید
ات اثر از است استفاده اش اطلاعند الاسف البته الظاهر ام اما امروز امسال اند انکه او اول اکنون
اگر الواقع ای ایشان ایم این اینک اینکه
ب با بااین بار بارة باره بارها باز بازهم بازی باش باشد باشم باشند باشی باشید باشیم بالا بالاخره
بالاخص بالاست بالای بالطبع بالعکس باوجودی باورند باید بتدریج بتوان بتواند بتوانی بتوانیم بجز بخش بخشه بخشی بخصوص بخواه
بخواهد بخواهم بخواهند بخواهی بخواهید بخواهیم بخوبی بد بدان بدانجا بدانها بدون بدین بدینجا بر برآن برآنند برا برابر
براحتی براساس براستی برای برایت برایش برایشان برایم برایمان برخوردار برخوردارند برخی برداری برداشتن بردن برعکس برنامه
بروز بروشنی بزرگ بزودی بس بسا بسادگی بسته بسختی بسوی بسی بسیار بسیاری بشدت بطور بطوری بعد بعدا بعدازظهر بعدها بعری
بعضا بعضی بعضیهایشان بعضیها بعلاوه بعید بفهمی بلافاصله بله بلکه بلی بنابراین بندی به بهت بهتر بهترین بهش بود بودم بودن
بودند بوده بودی بودید بودیم بویژه بپا بکار بکن بکند بکنم بکنند بکنی بکنید بکنیم بگو بگوید بگویم بگویند بگویی بگویید
بگوییم بگیر بگیرد بگیرم بگیرند بگیری بگیرید بگیریم بی بیا بیاب بیابد بیابم بیابند بیابی بیابید بیابیم بیاور بیاورد
بیاورم بیاورند بیاوری بیاورید بیاوریم بیاید بیایم بیایند بیایی بیایید بیاییم بیرون بیست بیش بیشتر بیشتری بین بیگمان
پ پا پارسال پارسایانه پارهای پاعین پایین پدرانه پدیده پرسان پروردگارا پریروز پس پشت پشتوانه پشیمونی پنج پهن پی پیدا
پیداست پیرامون پیش پیشاپیش پیشتر پیوسته
ت تا تازه تازگی تان تاکنون تحت تحریم تدریج تر ترتیب تردید ترند ترین تصریحا تعدادی تعمدا تفاوتند تقریبا تک تلویحا تمام
تماما تمامشان تمامی تند تنها تو توؤما تواند توانست توانستم توانستن توانستند توانسته توانستی توانستیم توانم توانند توانی
توانید توانیم توسط تول توی
ث ثالثا ثانی ثانیا
ج جا جای جایی جدا جداگانه جدید جدیدا جریان جز جلو جلوگیری جلوی جمع جمعا جمعی جنابعالی جناح جنس جهت جور جوری
چ چاله چاپلوسانه چت چته چرا چشم چطور چقدر چنان چنانچه چنانکه چند چندان چنده چندین چنین چه چهار چو چون چکار چگونه چی چیز
چیزهاست چیزی چیزیست چیست چیه
ح حاشیه حاشیهای حاضر حاضرم حال حالا حاکیست حتما حتی حداقل حداکثر حدود حدودا حسابگرانه حسابی حضرتعالی حق حقیرانه حول
حکما
خ خارج خالصانه خب خداحافظ خداست خدمات خستهای خصوصا خلاصه خواست خواستم خواستن خواستند خواسته خواستی خواستید خواستیم
خواهد خواهم خواهند خواهی خواهید خواهیم خوب خوبی خود خودبه خودت خودتان خودتو خودش خودشان خودم خودمان خودمو خودی خوش
خوشبختانه خویش خویشتن خویشتنم خیاه خیر خیره خیلی
د دا داام دااما داخل داد دادم دادن دادند داده دادی دادید دادیم دار داراست دارد دارم دارند داری دارید داریم داشت داشتم
داشتن داشتند داشته داشتی داشتید داشتیم دامم دانست دانند دایم دایما در دراین درباره درحالی درحالیکه درست درسته درشتی
درصورتی درعین درمجموع درواقع درون درپی دریغ دریغا دسته دشمنیم دقیقا دلخواه دم دنبال ده دهد دهم دهند دهی دهید دهیم دو
دوباره دوم دیده دیر دیرت دیرم دیروز دیشب دیوی دیگر دیگران دیگری دیگه
ذ ذاتا ذلک ذیل
ر را راجع راحت راسا راست راستی راه رسما رسید رشته رغم رفت رفتارهاست رفته رنجند رهگشاست رو رواست روب روبروست روز روزانه
روزه روزهای روزهای روش روشنی روی رویش ریزی
ز زدن زده زشتکارانند زمان زمانی زمینه زنند زهی زود زودتر زودی زیاد زیاده زیر زیرا
س سابق ساختن ساخته ساده سادگی سازی سالانه سالته سالمتر ساله سالهاست سالها سالیانه سایر ست سخت سخته سر سراسر سرانجام
سراپا سرعت سری سریع سریعا سعی سمت سه سهوا سوم سوی سپس سیاه
ش شان شاهدند شاهدیم شاید شبهاست شخصا شد شدت شدم شدن شدند شده شدی شدید شدیدا شدیم شش شما شماری شماست شمایند شناسی شود
شوراست شوم شوند شونده شوی شوید شویم شیرین شیرینه
ص صددرصد صرفا صریحا صندوق صورت صورتی
ض ضد ضمن ضمنا
ط طبعا طبق طبیعتا طرف طریق طلبکارانه طور طوری طی
ظ ظاهرا
ع عاجزانه عاقبت عبارتند عجب عجولانه عدم عرفانی عقب علاوه علت علنا علی علیه عمدا عمدتا عمده عمل عملا عملی عموم عموما
عنقریب عنوان عینا
غ غالبا غیر غیرقانونی
ف فاقد فبها فر فردا فعلا فقط فلان فلذا فوق فکر فی فیالواقع
ق قاالند قابل قاطبه قاطعانه قاعدتا قانونا قبل قبلا قبلند قد قدر قدری قراردادن قصد قطعا
ک کارند کاش کاشکی کامل کاملا کتبا کجا کجاست کدام کرات کرد کردم کردن کردند کرده کردی کردید کردیم کس کسانی کسی کشیدن کل
کلا کلی کلیشه کلیه کم کمااینکه کماکان کمتر کمتره کمتری کمی کن کنار کنارش کنان کنایهای کند کنم کنند کننده کنون کنونی
کنی کنید کنیم که کو کی كي
گ گاه گاهی گذاری گذاشتن گذاشته گذشته گردد گرفت گرفتارند گرفتم گرفتن گرفتند گرفته گرفتی گرفتید گرفتیم گرمی گروهی گرچه
گفت گفتم گفتن گفتند گفته گفتی گفتید گفتیم گه گهگاه گو گونه گویا گویان گوید گویم گویند گویی گویید گوییم گیرد گیرم گیرند
گیری گیرید گیریم
ل لا لااقل لاجرم لب لذا لزوما لطفا لیکن لکن
م ما مادامی ماست مامان مان مانند مبادا متاسفانه متعاقبا متفاوتند مثل مثلا مجبورند مجددا مجموع مجموعا محتاجند محکم
محکمتر مخالفند مختلف مخصوصا مدام مدت مدتهاست مدتی مذهبی مرا مراتب مرتب مردانه مردم مرسی مستحضرید مستقیما مستند مسلما
مشت مشترکا مشغولند مطمانا مطمانم مطمینا مع معتقدم معتقدند معتقدیم معدود معذوریم معلومه معمولا معمولی مغرضانه مفیدند
مقابل مقدار مقصرند مقصری ممکن من منتهی منطقی مواجهند موارد موجودند مورد موقتا مکرر مکررا مگر می مي میان میزان میلیارد
میلیون میرسد میرود میشود میکنیم
ن ناامید ناخواسته ناراضی ناشی نام ناچار ناگاه ناگزیر ناگهان ناگهانی نباید نبش نبود نخست نخستین نخواهد نخواهم نخواهند
نخواهی نخواهید نخواهیم نخودی ندارد ندارم ندارند نداری ندارید نداریم نداشت نداشتم نداشتند نداشته نداشتی نداشتید نداشتیم
نزد نزدیک نسبتا نشان نشده نظیر نفرند نفهمی نماید نمی نمیشود نه نهایت نهایتا نوع نوعا نوعی نکرده نگاه نیازمندانه
نیازمندند نیز نیست نیمی
و وابسته واقع واقعا واقعی واقفند وای وجه وجود وحشت وسط وضع وضوح وقتی وقتیکه ولی وگرنه وگو وی ویا ویژه
ه ها هاست های هایی هبچ هدف هر هرحال هرچند هرچه هرکس هرگاه هرگز هزار هست هستم هستند هستی هستید هستیم هفت هق هم همان
همانند همانها همدیگر همزمان همه همهاش همواره همچنان همچنین همچون همچین همگان همگی همیشه همین هنوز هنگام هنگامی هوی هی
هیچ هیچکدام هیچکس هیچگاه هیچگونه هیچی
ی یا یابد یابم یابند یابی یابید یابیم یارب یافت یافتم یافتن یافته یافتی یافتید یافتیم یعنی یقینا یواش یک یکدیگر یکریز
یکسال یکی یکي
""".split())
و
در
به
از
که
این
را
با
است
برای
آن
یک
خود
تا
کرد
بر
هم
نیز
گفت
میشود
وی
شد
دارد
ما
اما
یا
شده
باید
هر
آنها
بود
او
دیگر
دو
مورد
میکند
شود
کند
وجود
بین
پیش
شدهاست
پس
نظر
اگر
همه
یکی
حال
هستند
من
کنند
نیست
باشد
چه
بی
می
بخش
میکنند
همین
افزود
هایی
دارند
راه
همچنین
روی
داد
بیشتر
بسیار
سه
داشت
چند
سوی
تنها
هیچ
میان
اینکه
شدن
بعد
جدید
ولی
حتی
کردن
برخی
کردند
میدهد
اول
نه
کردهاست
نسبت
بیش
شما
چنین
طور
افراد
تمام
درباره
بار
بسیاری
میتواند
کرده
چون
ندارد
دوم
بزرگ
طی
حدود
همان
بدون
البته
آنان
میگوید
دیگری
خواهدشد
کنیم
قابل
یعنی
رشد
میتوان
وارد
کل
ویژه
قبل
براساس
نیاز
گذاری
هنوز
لازم
سازی
بودهاست
چرا
میشوند
وقتی
گرفت
کم
جای
حالی
تغییر
پیدا
اکنون
تحت
باعث
مدت
فقط
زیادی
تعداد
آیا
بیان
رو
شدند
عدم
کردهاند
بودن
نوع
بلکه
جاری
دهد
برابر
مهم
بوده
اخیر
مربوط
امر
زیر
گیری
شاید
خصوص
آقای
اثر
کننده
بودند
فکر
کنار
اولین
سوم
سایر
کنید
ضمن
مانند
باز
میگیرد
ممکن
حل
دارای
پی
مثل
میرسد
اجرا
دور
منظور
کسی
موجب
طول
امکان
آنچه
تعیین
گفته
شوند
جمع
خیلی
علاوه
گونه
تاکنون
رسید
ساله
گرفته
شدهاند
علت
چهار
داشتهباشد
خواهدبود
طرف
تهیه
تبدیل
مناسب
زیرا
مشخص
میتوانند
نزدیک
جریان
روند
بنابراین
میدهند
یافت
نخستین
بالا
پنج
ریزی
عالی
چیزی
نخست
بیشتری
ترتیب
شدهبود
خاص
خوبی
خوب
شروع
فرد
کامل
غیر
میرود
دهند
آخرین
دادن
جدی
بهترین
شامل
گیرد
بخشی
باشند
تمامی
بهتر
دادهاست
حد
نبود
کسانی
میکرد
داریم
علیه
میباشد
دانست
ناشی
داشتند
دهه
میشد
ایشان
آنجا
گرفتهاست
دچار
میآید
لحاظ
آنکه
داده
بعضی
هستیم
اند
برداری
نباید
میکنیم
نشست
سهم
همیشه
آمد
اش
وگو
میکنم
حداقل
طبق
جا
خواهدکرد
نوعی
چگونه
رفت
هنگام
فوق
روش
ندارند
سعی
بندی
شمار
کلی
کافی
مواجه
همچنان
زیاد
سمت
کوچک
داشتهاست
چیز
پشت
آورد
حالا
روبه
سالهای
دادند
میکردند
عهده
نیمه
جایی
دیگران
سی
بروز
یکدیگر
آمدهاست
جز
کنم
سپس
کنندگان
خودش
همواره
یافته
شان
صرف
نمیشود
رسیدن
چهارم
یابد
متر
ساز
داشته
کردهبود
باره
نحوه
کردم
تو
شخصی
داشتهباشند
محسوب
پخش
کمی
متفاوت
سراسر
کاملا
داشتن
نظیر
آمده
گروهی
فردی
ع
همچون
خطر
خویش
کدام
دسته
سبب
عین
آوری
متاسفانه
بیرون
دار
ابتدا
شش
افرادی
میگویند
سالهای
درون
نیستند
یافتهاست
پر
خاطرنشان
گاه
جمعی
اغلب
دوباره
مییابد
لذا
زاده
گردد
اینجا""".split())

View File

@ -0,0 +1,43 @@
# coding: utf8
from __future__ import unicode_literals
from ...symbols import NOUN, PROPN, PRON
def noun_chunks(obj):
"""
Detect base noun phrases from a dependency parse. Works on both Doc and Span.
"""
labels = ['nsubj', 'dobj', 'nsubjpass', 'pcomp', 'pobj', 'dative', 'appos',
'attr', 'ROOT']
doc = obj.doc # Ensure works on both Doc and Span.
np_deps = [doc.vocab.strings.add(label) for label in labels]
conj = doc.vocab.strings.add('conj')
np_label = doc.vocab.strings.add('NP')
seen = set()
for i, word in enumerate(obj):
if word.pos not in (NOUN, PROPN, PRON):
continue
# Prevent nested chunks from being produced
if word.i in seen:
continue
if word.dep in np_deps:
if any(w.i in seen for w in word.subtree):
continue
seen.update(j for j in range(word.left_edge.i, word.i+1))
yield word.left_edge.i, word.i+1, np_label
elif word.dep == conj:
head = word.head
while head.dep == conj and head.head.i < head.i:
head = head.head
# If the head is an NP, and we're coordinated to it, we're an NP
if head.dep in np_deps:
if any(w.i in seen for w in word.subtree):
continue
seen.update(j for j in range(word.left_edge.i, word.i+1))
yield word.left_edge.i, word.i+1, np_label
SYNTAX_ITERATORS = {
'noun_chunks': noun_chunks
}

39
spacy/lang/fa/tag_map.py Normal file
View File

@ -0,0 +1,39 @@
# coding: utf8
from __future__ import unicode_literals
from ...symbols import POS, PUNCT, SYM, ADJ, CONJ, NUM, DET, ADV, ADP, X, VERB
from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, AUX
TAG_MAP = {
"ADJ": {POS: ADJ },
"ADJ_CMPR": {POS: ADJ },
"ADJ_INO": {POS: ADJ},
"ADJ_SUP": {POS: ADJ},
"ADV": {POS: ADV},
"ADV_COMP": {POS: ADV},
"ADV_I": {POS: ADV},
"ADV_LOC": {POS: ADV},
"ADV_NEG": {POS: ADV},
"ADV_TIME": {POS: ADV},
"CLITIC": {POS: PART},
"CON": {POS: CONJ},
"CONJ": {POS: CONJ},
"DELM": {POS: PUNCT},
"DET": {POS: DET},
"FW": {POS: X},
"INT": {POS: INTJ},
"N_PL": {POS: NOUN},
"N_SING": {POS: NOUN},
"N_VOC": {POS: NOUN},
"NUM": {POS: NUM},
"P": {POS: ADP},
"PREV": {POS: ADP},
"PRO": {POS: PRON},
"V_AUX": {POS: AUX},
"V_IMP": {POS: VERB},
"V_PA": {POS: VERB},
"V_PP": {POS: VERB},
"V_PRS": {POS: VERB},
"V_SUB": {POS: VERB},
}

File diff suppressed because it is too large Load Diff

View File

@ -6,7 +6,8 @@ from .punctuation import TOKENIZER_SUFFIXES, TOKENIZER_INFIXES
from .tag_map import TAG_MAP
from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS
from .lemmatizer import LOOKUP
from .lemmatizer import LEMMA_RULES, LEMMA_INDEX, LEMMA_EXC, LOOKUP
from .lemmatizer.lemmatizer import FrenchLemmatizer
from .syntax_iterators import SYNTAX_ITERATORS
from ..tokenizer_exceptions import BASE_EXCEPTIONS
@ -28,7 +29,16 @@ class FrenchDefaults(Language.Defaults):
suffixes = TOKENIZER_SUFFIXES
token_match = TOKEN_MATCH
syntax_iterators = SYNTAX_ITERATORS
lemma_lookup = LOOKUP
@classmethod
def create_lemmatizer(cls, nlp=None):
lemma_rules = LEMMA_RULES
lemma_index = LEMMA_INDEX
lemma_exc = LEMMA_EXC
lemma_lookup = LOOKUP
return FrenchLemmatizer(index=lemma_index, exceptions=lemma_exc,
rules=lemma_rules, lookup=lemma_lookup)
class French(Language):

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,23 @@
# coding: utf8
from __future__ import unicode_literals
from .lookup import LOOKUP
from ._adjectives import ADJECTIVES
from ._adjectives_irreg import ADJECTIVES_IRREG
from ._adverbs import ADVERBS
from ._nouns import NOUNS
from ._nouns_irreg import NOUNS_IRREG
from ._verbs import VERBS
from ._verbs_irreg import VERBS_IRREG
from ._dets_irreg import DETS_IRREG
from ._pronouns_irreg import PRONOUNS_IRREG
from ._auxiliary_verbs_irreg import AUXILIARY_VERBS_IRREG
from ._lemma_rules import ADJECTIVE_RULES, NOUN_RULES, VERB_RULES
LEMMA_INDEX = {'adj': ADJECTIVES, 'adv': ADVERBS, 'noun': NOUNS, 'verb': VERBS}
LEMMA_EXC = {'adj': ADJECTIVES_IRREG, 'noun': NOUNS_IRREG, 'verb': VERBS_IRREG,
'det': DETS_IRREG, 'pron': PRONOUNS_IRREG, 'aux': AUXILIARY_VERBS_IRREG}
LEMMA_RULES = {'adj': ADJECTIVE_RULES, 'noun': NOUN_RULES, 'verb': VERB_RULES}

View File

@ -0,0 +1,601 @@
# coding: utf8
from __future__ import unicode_literals
ADJECTIVES = set("""
abaissant abaissé abandonné abasourdi abasourdissant abattu abcédant aberrant
abject abjurant aboli abondant abonné abordé abouti aboutissant abouté
abricoté abrité abrouti abrupt abruti abrutissant abruzzain absent absolu
absorbé abstinent abstrait abyssin abâtardi abêtissant abîmant abîmé acarpellé
accablé accalminé accaparant accastillant accentué acceptant accepté accidenté
accolé accombant accommodant accommodé accompagné accompli accordé accorné
accoudé accouplé accoutumé accrescent accroché accru accréditant accrédité
accueillant accumulé accusé accéléré acescent achalandé acharné achevé
acidulé aciéré acotylé acquitté activé acuminé acutangulé acutifolié acutilobé
adapté additionné additivé adextré adhérent adimensionné adiré adjacent
adjoint adjugé adjuvant administré admirant adné adolescent adoptant adopté
adossé adouci adoucissant adressé adroit adscrit adsorbant adultérin adéquat
affaibli affaiblissant affairé affamé affectionné affecté affermi affidé
affilé affin affligeant affluent affolant affolé affranchi affriolant affronté
affété affûté afghan africain agaillardi agatin agatisé agaçant agglomérant
agglutinant agglutiné aggravant agissant agitant agité agminé agnat agonisant
agrafé agrandi agressé agrippant agrégé agréé aguichant ahanant ahuri
aigretté aigri aigrissant aiguilleté aiguisé ailé aimant aimanté aimé ajourné
ajusté alabastrin alambiqué alangui alanguissant alarmant alarmé albuginé
alcalescent alcalifiant alcalin alcalinisant alcoolisé aldin alexandrin alezan
aligoté alizé aliénant aliéné alkylant allaitant allant allemand allergisant
alliciant allié allongé allumant allumé alluré alléchant allégeant allégé
alphabloquant alphastimulant alphonsin alpin alternant alternifolié
altérant altéré alucité alvin alvéolé alésé amaigri amaigrissant amalgamant
amaril ambiant ambisexué ambivalent ambulant ami amiantacé amiantin amidé
aminé ammoniacé ammoniaqué ammonié amnistiant amnistié amnésiant amoindrissant
amorti amplifiant amplifié amplié ampoulé amputé amusant amusé amylacé
américain amérisant anabolisant analgésiant anamorphosé anarchisant anastigmat
anavirulent ancorné andin andorran anergisant anesthésiant angevin anglican
angoissant angoissé angustifolié angustipenné animé anisé ankylosant ankylosé
anobli anoblissant anodin anovulant ansé ansérin antenné anthropisé
antialcalin antiallemand antiamaril antiautoadjoint antibrouillé
anticipant anticipé anticoagulant anticontaminant anticonvulsivant
antidécapant antidéflagrant antidérapant antidétonant antifeutrant
antigivrant antiglissant antiliant antimonié antiméthémoglobinisant antinatal
antiodorant antioxydant antiperspirant antiquisant antirassissant
antiréfléchissant antirépublicain antirésonant antirésonnant antisymétrisé
antivieillissant antiémétisant antécédent anténatal antéposé antérieur
antérosupérieur anémiant anémié aoûté apaisant apeuré apicalisé aplati apocopé
apparent apparenté apparié appartenant appaumé appelant appelé appendiculé
appointé apposé apprivoisé approchant approché approfondi approprié approuvé
apprêté appuyé appétissant apérianthé aquarellé aquitain arabisant araucan
arborisé arboré arcelé archaïsant archiconnu archidiocésain architecturé
ardent ardoisé ardu argentin argenté argilacé arillé armoricain armé
arpégé arqué arrangeant arrivant arrivé arrogant arrondi arrosé arrêté arsénié
articulé arénacé aréolé arétin ascendant ascosporé asexué asin asphyxiant
aspirant aspiré assaillant assainissant assaisonné assassin assassinant
asservissant assidu assimilé assistant assisté assiégeant assiégé associé
assommant assonancé assonant assorti assoupi assoupissant assouplissant
assujetti assujettissant assuré asséchant astreignant astringent atloïdé
atonal atrophiant atrophié attachant attaquant attardé atteint attenant
attendu attentionné atterrant attesté attirant attitré attrayant attristant
atélectasié auriculé auscitain austral authentifiant autoadjoint autoagrippant
autoancré autobronzant autocentré autocohérent autocollant autocommandé
autocontraint autoconvergent autocopiant autoflagellant autofondant autoguidé
autolubrifiant autolustrant autolégitimant autolégitimé automodifiant
autonettoyant autoportant autoproduit autopropulsé autorepassant autorisé
autosuffisant autotrempant auvergnat avachi avalant avalé avancé avarié
aventuriné aventuré avenu averti aveuglant avianisé avili avilissant aviné
avivé avoisinant avoué avéré azimuté azoté azuré azéri aéronaval aéroporté
aéré aîné babillard badaud badgé badin bahaï bahreïni bai baillonné baissant
balafré balancé balbutiant baleiné ballant ballonisé ballonné ballottant
balzan bambochard banal banalisé bancal bandant bandé bangladeshi banlieusard
bantou baraqué barbant barbarisant barbelé barbichu barbifiant barbu bardé
baroquisant barré baryté basané basculant basculé basedowifiant basedowifié
bastillé bastionné bataillé batifolant battant battu bavard becqué bedonnant
bellifontain belligérant benoît benzolé benzoïné berçant beurré biacuminé
bicarbonaté bicarré bicomponent bicomposé biconstitué bicontinu bicornu
bidonnant bienfaisant bienséant bienveillant bigarré bigot bigourdan bigéminé
bilié billeté bilobé bimaculé binoclard biodégradant bioluminescent biorienté
biparti bipectiné bipinné bipolarisé bipédiculé biramé birman biréfringent
biscuité bisexué bismuthé bisontin bispiralé bissexué bisublimé bisérié
biterné bivalent bivitellin bivoltin blafard blanchissant blanchoyant blasé
blessé bleu bleuissant bleuté blindé blond blondin blondissant blondoyant
blousant blâmant blêmissant bodybuildé boisé boitillant bombé bonard
bondé bonifié bonnard borain bordant borin borné boré bossagé bossu bot
bouclé boudiné bouffant bouffi bouillant bouilli bouillonnant boulant bouleté
bouqueté bourdonnant bourdonné bourgeonnant bourrant bourrelé bourru bourré
boutonné bovin bracelé bradycardisant braillard branchu branché branlant
bressan bretessé bretonnant breveté briard bridgé bridé brillant brillanté
bringueballant brinquebalant brinqueballant briochin brioché brisant brisé
broché bromé bronzant bronzé brouillé broutant bruissant brun brunissant brut
brévistylé brûlant brûlé budgeté burelé buriné bursodépendant busqué busé
butyracé buté byzantin bâtard bâti bâté béant béat bédouin bégayant bénard
bénédictin béquetant béquillard bétonné bêlant bêtabloquant bêtifiant bômé
cabochard cabotin cabriolant cabré cacaoté cachectisant cachemiri caché
cadjin cadmié caducifolié cafard cagnard cagot cagoulé cahotant caillouté
calcicordé calcifié calculé calmant calotin calé camard cambrousard cambré
camisard campagnard camphré campé camé canaliculé canin cannelé canné cantalou
canulant cané caoutchouté capitolin capitulant capitulard capité capricant
capsulé captivant capuchonné caquetant carabiné caracolant caractérisé
carbonaté carboné carburant carburé cardiocutané cardé carencé caressant
carillonnant carillonné carié carminé carné carolin caronculé carpé carré
caréné casqué cassant cassé castelroussin castillan catalan catastrophé
catégorisé caudé caulescent causal causant cavalcadant celtisant cendré censé
centraméricain centré cerclé cerdagnol cerdan cerné certain certifié cervelé
chafouin chagrin chagrinant chagriné chaloupé chamoisé chamoniard chancelant
chantant chançard chapeauté chapé charançonné chargé charmant charnu charpenté
charrié chartrain chassant chasé chatoyant chaud chauffant chaussant chauvin
chenillé chenu chevalin chevauchant chevelu chevelé chevillé chevronné
chiant chicard chiffonné chiffré chimioluminescent chimiorésistant chiné
chié chlamydé chleuh chlorurant chloruré chloré chocolaté choisi choké
choral chronodépendant chryséléphantin chuintant chypré châtain chélatant
chômé ciblé cicatrisant cilié cinglant cinglé cintré circiné circonspect
circonvoisin circulant circumtempéré ciré cisalpin cisjuran cispadan citadin
citronné citérieur civil civilisé clabotant clair claironnant clairsemé
clandestin clapotant claquant clarifiant clariné classicisant claudicant
clavelé clignotant climatisé clinquant cliquetant clissé clivant cloisonné
cloqué clouté cloîtré clément clémentin coagulant coalescent coalisé coassocié
cocciné cocu codant codirigeant codominant codé codélirant codétenu coexistant
cogné cohérent coiffant coiffé coinché cokéfiant colicitant colitigant
collant collodionné collé colmatant colombin colonisé colorant coloré
combattant combinant combinard combiné comburant comité commandant commençant
commun communard communiant communicant communiqué communisant compact
comparé compassé compatissant compensé complaisant complexant compliqué
composant composé comprimé compromettant computérisé compétent comtadin conard
concertant concerté conciliant concluant concomitant concordant concourant
concupiscent concurrent concédant condamné condensant condensé condescendant
conditionné condupliqué confiant confident confiné confit confondant confédéré
congru congruent conjoint conjugant conjugué connaissant connard connivent
conné conquassant conquérant consacrant consacré consanguin conscient conscrit
conservé consistant consolant consolidé consommé consonant constant constellé
constipant constipé constituant constitué constringent consultant conséquent
containeurisé contaminant contemporain content contenu contestant continent
continu contondant contourné contractant contraignant contraint contraposé
contrarié contrastant contrasté contravariant contrecollé contredisant
contrefait contrevariant contrevenant contrit controuvé controversé contrôlé
convaincu convalescent conventionné convenu convergent converti convoluté
convulsivant convulsé conçu cooccupant cooccurrent coopérant coordiné
coordonné copartageant coparticipant coquillé coquin coraillé corallin
cordé cornard corniculé cornu corné corpulent correct correspondant corrigé
corrodant corrompu corrélé corticodépendant corticorésistant cortisoné
coréférent cossard cossu costaud costulé costumé cotisant couard couchant
coulant coulissant coulissé coupant couperosé couplé coupé courant courbatu
couronnant couronné court courtaud courtisan couru cousu couturé couvert
covalent covariant coïncident coûtant crachotant craché cramoisi cramponnant
craquelé cravachant crawlé crevant crevard crevé criant criard criblant criblé
crispant cristallin cristallisant cristallisé crochu croisetté croiseté
croissanté croisé crollé croquant crossé crotté croulant croupi croupissant
croyant cru crucifié cruenté crustacé cryodesséché cryoprécipité crémant
crépi crépitant crépu crétacé crétin crétinisant créé crêpelé crêté cubain
cuirassé cuisant cuisiné cuit cuivré culminant culotté culpabilisant cultivé
cuscuté cutané cyanosé câblé câlin cédant célébrant cérulé cérusé cévenol
damassé damné dandinant dansant demeuré demi dentelé denticulé dentu denté
dessalé dessiccant dessillant dessiné dessoudé desséchant deutéré diadémé
diamanté diapré diastasé diazoté dicarbonylé dichloré diffamant diffamé
diffractant diffringent diffusant différencié différent différé difluoré
diiodé dilatant dilaté diligent dilobé diluant dimensionné dimidié dimidé
diminué diocésain diphasé diplômant diplômé direct dirigeant dirigé dirimant
discipliné discontinu discord discordant discriminant discuté disert disgracié
disloqué disodé disparu dispersant dispersé disposé disproportionné disputé
dissimulé dissipé dissociant dissocié dissolu dissolvant dissonant disséminé
distant distinct distingué distrait distrayant distribué disubstitué disulfoné
divagant divaguant divalent divergent divertissant divin divorcé djaïn
dodu dogmatisant dolent domicilié dominant dominicain donjonné donnant donné
dormant dorsalisant doré douci doué drageonnant dragéifié drainant dramatisant
drapé dreyfusard drogué droit dru drupacé dual ductodépendant dulcifiant dur
duveté dynamisant dynamité dyspnéisant dystrophiant déaminé débarqué débauché
débilitant débloquant débordant débordé débouchant débourgeoisé déboussolé
débridé débrouillard débroussaillant débroussé débutant décadent décaféiné
décalant décalcifiant décalvant décapant décapité décarburant décati décavé
décevant déchagriné décharné déchaînant déchaîné déchevelé déchiqueté
déchiré déchloruré déchu décidu décidué décidé déclaré déclassé déclenchant
décoiffant décolleté décolorant décoloré décompensé décomplémenté décomplété
déconcertant déconditionné déconfit décongestionnant déconnant déconsidéré
décontractant décontracturant décontracté décortiqué décoré découplé découpé
décousu découvert décrispant décrochant décroissant décrépi décrépit décuman
décussé décérébré dédoré défaillant défait défanant défatigant défavorisé
déferlant déferlé défiant déficient défigé défilant défini déflagrant défleuri
défléchi défoliant défoncé déformant défranchi défraîchi défrisant défroqué
défâché défécant déférent dégagé dégingandé dégivrant déglutiné dégonflé
dégourdi dégouttant dégoûtant dégoûté dégradant dégradé dégraissant dégriffé
déguisé dégénérescent dégénéré déhanché déhiscent déjeté délabrant délabré
délassant délavé délayé délibérant délibéré délicat délinquant déliquescent
délitescent délié déloqué déluré délégué démagnétisant démaquillant démaqué
dément démerdard démesuré démixé démodé démontant démonté démoralisant
démotivant démotivé démystifiant démyélinisant démyélisant démêlant dénaturant
dénigrant dénitrant dénitrifiant dénommé dénudé dénutri dénué déodorant
dépapillé dépareillé dépassé dépaysant dépaysé dépeigné dépenaillé dépendant
dépeuplé déphasé dépité déplacé déplaisant déplaquetté déplasmatisé dépliant
déplumant déplumé déplété dépoitraillé dépolarisant dépoli dépolitisant
déponent déporté déposant déposé dépouillé dépourvu dépoussiérant dépravant
déprimant déprimé déprédé dépérissant dépétainisé déracinant déraciné
dérangé dérapant dérestauré dérivant dérivé dérobé dérogeant déroulant
déréalisant déréglé désabusé désaccordé désadapté désaffectivé désaffecté
désaisonnalisé désaligné désaliénant désaltérant désaluminisé désambiguïsé
désargenté désarmant désarçonnant désassorti désatomisé désaturant désaxé
désemparé désenchanté désensibilisant désert désespérant désespéré
désherbant déshonorant déshumanisant déshydratant déshydraté déshydrogénant
désiconisé désillusionnant désincarné désincrustant désinfectant
désintéressé désirant désobligeant désoblitérant désobéi désobéissant
désodorisant désodé désoeuvré désolant désolé désopilant désordonné
désorienté désossé désoxydant désoxygénant déstabilisant déstressant
désuni déséquilibrant déséquilibré détachant détaché détartrant détendu détenu
déterminant déterminé déterré détonant détonnant détourné détraqué détérioré
développé déverbalisant dévergondé déversé dévertébré déviant dévissé
dévoisé dévolu dévorant dévot dévoué dévoyé déwatté déçu effacé effarant
effarouché effaré effervescent efficient effiloché effilé efflanqué
effluent effondré effrangé effrayant effrayé effronté effréné efféminé
emballant embarrassant embarrassé embellissant embiellé embouché embouti
embrassant embrassé embrouillant embrouillé embroussaillé embruiné embryonné
embusqué embêtant emmerdant emmiellant emmiélant emmotté empaillé empanaché
empenné emperlé empesé empiétant emplumé empoignant empoisonnant emporté
empressé emprunté empâté empêché empêtré encaissant encaissé encalminé
encapsulant encapsulé encartouché encastré encerclant enchanté enchifrené
encloisonné encloqué encombrant encombré encorné encourageant encroué
encroûté enculé endenté endiablé endiamanté endimanché endogé endolori
endormi endurant endurci enfantin enfariné enflammé enflé enfoiré enfoncé
engageant engagé engainant englanté englobant engoulé engourdi engourdissant
engraissant engravé engrenant engrené engrêlé enguiché enhardé enivrant
enjambé enjoué enkikinant enkysté enlaidissant enlaçant enlevé enneigé ennemi
ennuyant ennuyé enquiquinant enracinant enrageant enragé enregistrant enrhumé
enrichissant enrobé enseignant enseigné ensellé ensoleillé ensommeillé
ensoutané ensuqué entartré entendu enterré enthousiasmant entouré entrant
entraînant entrecoupé entrecroisé entrelacé entrelardé entreprenant entresolé
entrouvert enturbanné enté entêtant entêté envahissant envapé enveloppant
envenimé enviné environnant envié envoyé envoûtant ergoté errant erroné
escarpé espacé espagnol espagnolisant esquintant esquinté esseulé essorant
estomaqué estompé estropié estudiantin euphorisant euphémisé eurafricain
exacerbé exact exagéré exalbuminé exaltant exalté exaspérant excellent
excepté excitant excité exclu excluant excommunié excru excédant exempt
exercé exerçant exfoliant exhalant exhilarant exigeant exilé exinscrit
exondé exorbitant exorbité exosporé exostosant expansé expatrié expectant
expert expirant exploitant exploité exposé expropriant exproprié expulsé
expérimenté extasié extemporané extradossé extrafort extraplat extrapériosté
extraverti extroverti exténuant extérieur exubérant exultant facilitant
faiblissant faignant failli faillé fainéant faisandé faisant fait falot falqué
fané faraud farci fardé farfelu farinacé fasciculé fascinant fascisant fascié
fassi fastigié fat fatal fatigant fatigué fauché favorisant façonné faïencé
feint fendant fendillé fendu fenestré fenian fenêtré fermant fermentant
ferritisant ferruginisé ferré fertilisant fervent fescennin fessu festal
festival feuillagé feuilleté feuillu feuillé feutrant feutré fiancé fibrillé
ficelé fichant fichu fieffé figulin figuré figé filant fileté filoguidé
filé fimbrié fin final finalisé finaud fini finissant fiérot flabellé flagellé
flagrant flamand flambant flamboyant flambé flamingant flammé flanchard
flanquant flapi flatulent flavescent flemmard fleurdelisé fleuri fleurissant
flippant florentin florissant flottant flottard flotté flou fluctuant fluent
fluidifié fluocompact fluorescent fluoré flushé fléchissant fléché flémard
flétrissant flûté foisonnant foliacé folié folliculé folâtrant foncé fondant
fondé forain foraminé forcené forcé forfait forgé formalisé formaté formicant
formé fort fortifiant fortrait fortuit fortuné fossilisé foudroyant fouettard
fouillé fouinard foulant fourbu fourcheté fourchu fourché fourmillant fourni
foutral foutu foxé fracassant fractal fractionné fragilisant fragrant
franchouillard francisant franciscain franciscanisant frangeant frappant
fratrisé frelaté fretté friand frigorifié fringant fringué friqué frisant
frisotté frissonnant frisé frit froid froissant froncé frondescent frottant
froussard fructifiant fruité frumentacé frustrant frustré frutescent
fréquent fréquenté frétillant fugué fulgurant fulminant fumant fumé furfuracé
furibond fusant fuselé futur futé fuyant fuyard fâché fébricitant fécond
féculent fédéré félin féminin féminisant férin férié féru fêlé gabalitain
gagé gai gaillard galant galbé gallican gallinacé galloisant galonné galopant
ganglionné gangrené gangué gantelé garant garanti gardé garni garnissant
gauchisant gazonnant gazonné gazouillant gazé gaël geignard gelé genouillé
germanisant germé gestant gesticulant gibelin gigotant gigotté gigoté girond
gironné gisant gitan givrant givré glabrescent glacial glacé glandouillant
glapissant glaçant glissant glissé globalisant glomérulé glottalisé
gloussant gloutonnant gluant glucosé glycosylé glycuroconjugué godillé
goguenard gommant gommé goménolé gondolant gonflant gonflé gouleyant goulu
gourmand gourmé goussaut gouvernant gouverné goûtu goûté gradué gradé graffité
grand grandiloquent grandissant granité granoclassé granulé graphitisant
grasseyant gratifiant gratiné gratuit gravant gravitant greffant grelottant
grenelé grenu grené griffu grignard grilleté grillé grimaçant grimpant
grinçant grippé grisant grisonnant grivelé grondant grossissant grouillant
grésillant gueulard guignard guilloché guillotiné guindé guivré guéri gâté
gélatinisant gélatiné gélifiant gélifié géminé gémissant géniculé généralisant
géométrisant gérant gênant gêné gîté habilitant habilité habillé habitué
hachuré haché hagard halbrené haletant halin hallucinant halluciné hanché
hanté harassant harassé harcelant harcelé hardi harpé hasté haut hautain
hennissant heptaperforé herbacé herborisé herbu herminé hernié hersé heurté
hibernant hilarant hindou hircin hispanisant historicisant historisant
hivernant hiérosolymitain holocristallin hominisé homogénéisé homoprothallé
homoxylé honorant honoré hordéacé hormonodéprivé horodaté horrifiant
hottentot hoyé huguenot huitard humain humectant humiliant humilié huppé
hutu hyalin hydratant hydrocarboné hydrochloré hydrocuté hydrogénant hydrogéné
hydrosalin hydrosodé hydroxylé hyperalcalin hypercalcifiant hypercalcémiant
hypercoagulant hypercommunicant hypercorrect hyperfin hyperfractionné
hyperisé hyperlordosé hypermotivé hyperphosphatémiant hyperplan hypersomnolent
hypertrophiant hypertrophié hypervascularisé hypnotisant hypoalgésiant
hypocalcémiant hypocarpogé hypocholestérolémiant hypocotylé hypoglycémiant
hypolipidémiant hypophosphatémiant hyposodé hypotendu hypotonisant
hypovirulent hypoxémiant hâlé hébraïsant hébété hélicosporé héliomarin
hélitransporté hémicordé hémicristallin hémiplégié hémodialysé hémopigmenté
hépatostrié hérissant hérissé hésitant hétéroprothallé hétérosporé hétérostylé
identifié idiot idiotifiant idéal ignifugeant ignorant ignorantin ignoré igné
illimité illuminé imaginant imaginé imagé imbriqué imbrûlé imbu imité immaculé
immergé immigrant immigré imminent immodéré immortalisant immotivé immun
immunocompétent immunodéprimant immunodéprimé immunostimulant immunosupprimé
immédiat immérité impair impaludé imparfait imparidigité imparipenné impatient
impayé impensé imperforé impermanent imperméabilisant impertinent implorant
important importun importé imposant imposé impotent impressionnant imprimant
impromptu impromulgué improuvé imprudent imprévoyant imprévu impudent
impuni impur impénitent impétiginisé inabordé inabouti inabrité inabrogé
inaccepté inaccompli inaccoutumé inachevé inactivé inadapté inadéquat
inaguerri inaliéné inaltéré inanalysé inanimé inanitié inapaisé inaperçu
inapparenté inappliqué inapprivoisé inapproprié inapprécié inapprêté
inarticulé inassimilé inassorti inassouvi inassujetti inattaqué inattendu
inavoué incandescent incapacitant incarnadin incarnat incarné incendié
incessant inchangé inchâtié incident incidenté incitant incivil inclassé
incliné inclément incohérent incombant incomitant incommodant incommuniqué
incompétent inconditionné inconfessé incongru incongruent inconnu inconquis
inconsidéré inconsistant inconsolé inconsommé inconstant inconséquent
incontesté incontinent incontrôlé inconvenant incoordonné incorporant
incorrect incorrigé incriminant incriminé incritiqué incroyant incrustant
incréé incubant inculpé incultivé incurvé indeviné indifférencié indifférent
indirect indirigé indiscipliné indiscriminé indiscuté indisposé indistinct
indolent indompté indou indu induit indulgent indupliqué induré
indébrouillé indécent indéchiffré indécidué indéfini indéfinisé indéfriché
indélibéré indélicat indémontré indémêlé indépassé indépendant indépensé
indéterminé ineffectué inefficient inemployé inentamé inentendu inespéré
inexaucé inexercé inexistant inexpert inexpié inexpliqué inexploité inexploré
inexprimé inexpérimenté inexécuté infamant infantilisant infarci infatué
infectant infecté infestant infesté infichu infiltrant infini inflammé
infléchi infondé informant informulé infortuné infoutu infréquenté infusé
inféodé inférieur inférovarié ingrat ingénu inhabité inhalant inhibant inhibé
inhérent inimité inintelligent ininterrompu inintéressant initié injecté
innervant innocent innominé innommé innomé innovant inné inobservé inoccupé
inondé inopiné inopportun inopérant inorganisé inoublié inouï inquiétant
insatisfait insaturé inscrit insensé insermenté insignifiant insinuant
insolent insondé insonorisant insonorisé insouciant insoupçonné inspirant
inspécifié installé instant instantané instructuré instruit insubordonné
insulinodépendant insulinorésistant insultant insulté insurgé insécurisant
intelligent intempérant intentionné interallié interaméricain intercepté
intercristallin intercurrent interdigité interdiocésain interdit
interfacé interfécond interférent interloqué intermittent intermédié interpolé
interprétant intersecté intersexué interstratifié interurbain intervenant
intestin intimidant intolérant intoxicant intoxiqué intramontagnard
intrigant introduit introjecté introverti intumescent intégrant intégrifolié
intéressé intérieur inusité inutilisé invaincu invalidant invariant invendu
inverti invertébré inviolé invitant involucré involuté invérifié invétéré
inéclairci inécouté inédit inégalé inélégant inéprouvé inépuisé inéquivalent
iodoformé ioduré iodylé iodé ionisant iridescent iridié irisé ironisant
irraisonné irrassasié irritant irrité irréalisé irréfléchi irréfuté irrémunéré
irrésolu irrévélé islamisant isohalin isolant isolé isosporé issant issu
itinérant ivoirin jacent jacobin jaillissant jamaïcain jamaïquain jambé
japonné jardiné jarreté jarré jaspé jauni jaunissant javelé jaïn jobard joint
joli joufflu jouissant jovial jubilant juché judaïsant jumelé juponné juré
juxtaposant juxtaposé kalmouk kanak kazakh kenyan kosovar kératinisé labié
lacinié lactant lactescent lactosé lacté lai laid lainé laité lambin lambrissé
lamifié laminé lampant lampassé lamé lancinant lancé lancéolé languissant
lapon laqué lardacé larmoyant larvé laryngé lassant latent latifolié latin
latté latéralisé lauré lauréat lavant lavé laïcisant lent lenticulé letton
lettré leucopéniant leucosporé leucostimulant levant levantin levretté levé
liant libertin libéré licencié lichénifié liftant lifté ligaturé lignifié
ligulé lilacé limacé limitant limougeaud limousin lionné lippu liquéfiant
lithiné lithié lité lié liégé lobulé lobé localisé loculé lointain lombard
lorrain loré losangé loti louchant loupé lourd lourdaud lubrifiant luisant
lunetté lunulé luné lusitain lustré luthé lutin lutéinisant lutéostimulant
lyophilisé lyré léché lénifiant léonard léonin léopardé lézardé maboul maclé
madré mafflu maghrébin magnésié magnétisant magrébin magyar mahométan maillant
majeur majorant majorquin maladroit malaisé malavisé malbâti malentendant
malformé malintentionné malnutri malodorant malotru malouin malpoli malsain
malséant maltraitant malté malveillant malvoyant maléficié mamelonné mamelu
manchot mandarin mandchou maniéré mannité manoeuvrant manquant manqué mansardé
mantouan manuscrit manuélin maori maraîchin marbré marcescent marchand
marial marin mariol marié marmottant marocain maronnant marquant marqueté
marquésan marrant marri martelé martyr marxisant masculin masculinisant
masqué massacrant massant massé massétérin mat matelassé mati matérialisé
maugrabin maugrebin meilleur melonné membrané membru menacé menant menaçant
mentholé menu merdoyant mesquin messin mesuré meublant mexicain micacé
microencapsulé microgrenu microplissé microéclaté miellé mignard migrant
militant millerandé millimétré millésimé mineur minidosé minorant minorquin
miraculé miraillé miraud mirobolant miroitant miroité miré mitigé mitré mité
mobiliérisé mochard modelant modifiant modulant modulé modélisant modéré
mogol moiré moisi moleté molletonné mollissant momentané momifié mondain mondé
monilié monobromé monochlamydé monochloré monocomposé monocontinu
monofluoré monogrammé monohalogéné monohydraté mononucléé monophasé
monopérianthé monoréfringent monosporé monotriphasé monovalent montagnard
montpelliérain monté monténégrin monumenté moralisant mordant mordicant
mordu morfal morfondu moribond moricaud mormon mort mortifiant morvandiot
mosellan motivant motivé mouchard moucheté mouflé mouillant mouillé moulant
moulé mourant moussant moussu moustachu moutonnant moutonné mouvant mouvementé
moyé mozambicain mucroné mugissant mulard multiarticulé multidigité
multilobé multinucléé multiperforé multiprogrammé multirésistant multisérié
multivalent multivarié multivitaminé multivoltin munificent murin muriqué
murrhin musard musclé musqué mussipontain musulman mutant mutilant mutin
myorelaxant myrrhé mystifiant mythifiant myélinisant myélinisé mâtiné méchant
méconnu mécontent mécréant médaillé médian médiat médicalisé médisant
méfiant mélangé mélanostimulant méningé méplat méprisant méritant mérulé
métallescent métallisé métamérisé métastasé méthoxylé méthyluré
métropolitain météorisant mêlé mûr mûrissant nabot nacré nageant nain naissant
nanti napolitain narcissisant nasard nasillard natal natté naturalisé naufragé
naval navigant navrant nazi nervin nervuré nervé nettoyant neumé neuralisant
neuroméningé neutralisant nickelé nictitant nidifiant nigaud nigérian
nippon nitescent nitrant nitrifiant nitrosé nitrurant nitré noir noiraud
nombrant nombré nominalisé nommé nonchalant normalisé normand normodosé
normotendu normé notarié nourri nourrissant noué noyé nu nuagé nuancé nucléolé
nullard numéroté nutant nué nébulé nécessitant nécrosant négligent négligé
néoformé néolatin néonatal névrosant névrosé obligeant obligé oblitérant
obscur observant obsolescent obstiné obstrué obsédant obsédé obséquent
obturé obéi obéissant obéré occitan occupant occupé occurrent ocellé ochracé
oculé odorant odoriférant oeillé oeuvé offensant offensé officiant offrant
olivacé oléacé oléfiant oléifiant oman ombellé ombiliqué ombragé ombré
omniprésent omniscient ondoyant ondulant ondulé ondé onglé onguiculé ongulé
opalescent opalin operculé opiacé opportun opposant oppositifolié opposé
oppressé opprimant opprimé opsonisant optimalisant optimisant opulent opérant
orant ordonné ordré oreillard oreillé orfévré organisé organochloré
organosilicié orientalisant orienté oropharyngé orphelin orthonormé ortié
osmié ossifiant ossifluent ossu ostial ostracé ostrogot ostrogoth ostréacé osé
ouaté ourlé oursin outillé outrageant outragé outrecuidant outrepassé outré
ouvragé ouvrant ouvré ovalisé ovillé ovin ovulant ové oxycarboné oxydant
oxygéné ozoné oïdié pacifiant padan padouan pahlavi paillard pailleté pair
palatin palermitain palissé pallotin palmatilobé palmatinervé palmatiséqué
palmiséqué palmé palpitant panaché panafricain panard paniculé paniquant panné
pantelant pantouflard pané papalin papelard papilionacé papillonnant
papou papyracé paraffiné paralysant paralysé paramédian parcheminé parent
parfumé paridigitidé paridigité parigot paripenné parlant parlé parmesan
parsi partagé partant parti participant partisan partousard partouzard
parvenu paré passant passepoilé passerillé passionnant passionné passé pataud
patelin patelinant patent patenté patient patoisant patriotard pattu patté
paumé pavé payant pectiné pehlevi peigné peinard peint pelliculant pelliculé
peluché pelé penaud penchant penché pendant pendu pennatilobé pennatinervé
penninervé penné pensant pensionné pentavalent pentu peptoné perchloraté
percutant percutané perdant perdu perfectionné perfolié perforant performant
perfusé perlant perluré perlé permanent permutant perphosporé perruqué persan
persistant personnalisé personnifié personé persuadé persulfuré persécuté
pertinent perturbant perverti perçant pesant pestiféré petiot petit peul
pharmocodépendant pharyngé phasé philippin philistin phophorylé phosphaté
phosphoré photoinduit photoluminescent photorésistant photosensibilisant
phénolé phénotypé piaffant piaillant piaillard picard picoté pigeonnant
pignonné pillard pilonnant pilosébacé pimpant pinaillé pinchard pincé pinné
pinçard pionçant piquant piqué pisan pistillé pitchoun pivotant piégé
placé plafonnant plaidant plaignant plain plaisant plan planant planté plané
plasmolysé plastifiant plat plein pleurant pleurard pleurnichard pliant
plissé plié plombé plongeant plumeté pluriarticulé plurihandicapé plurinucléé
plurivalent pochard poché poignant poilant poilu pointillé pointu pointé
poitevin poivré polarisant polarisé poli policé politicard polluant
polycarburant polychloré polycontaminé polycopié polycristallin polydésaturé
polyhandicapé polyinsaturé polylobé polynitré polynucléé polyparasité
polysubstitué polysyphilisé polytransfusé polytraumatisé polyvalent
polyvoltin pommelé pommeté pompant pompé ponctué pondéré pontifiant pontin
poplité poqué porcelainé porcin porracé portant portoricain possédant possédé
postillonné postnatal postnéonatal posté postérieur posé potelé potencé
poupin pourprin pourri pourrissant poursuivant pourtournant poussant poussé
pratiquant prenant prescient prescrit pressant pressionné pressé prieur primal
privilégié probant prochain procombant procubain profilé profitant profond
programmé prohibé projetant prolabé proliférant prolongé prompt promu
prononcé propané proportionné proratisé proscrit prostré protestant protonant
protubérant protéiné provenant provocant provoqué proéminent prudent pruiné
préalpin prébendé précipitant précipité précité précompact préconscient
précontraint préconçu précuit précédent prédesséché prédestiné prédiffusé
prédisposant prédominant prédécoupé préemballé préencollé préenregistré
préfabriqué préfixé préformant préfragmenté préférant préféré prégnant
prélatin prématuré prémuni prémédité prénasalisé prénatal prénommé préoblitéré
préoccupé préparant prépayé prépondérant prépositionné préprogrammé préroman
présalé présanctifié présent présignifié présumé présupposé prétendu
prétraité prévalant prévalent prévenant prévenu prévoyant prévu préémargé
préétabli prêt prêtant prêté psychiatrisé psychostimulant psychoénergisant
puant pubescent pudibond puissant pulsant pulsé pultacé pulvérulent puni pur
puritain purpuracé purpurin purulent pustulé putrescent putréfié puéril puîné
pyramidant pyramidé pyrazolé pyroxylé pâli pâlissant pédant pédantisant
pédiculosé pédiculé pédonculé pékiné pélorié pénalisant pénard pénicillé
pénétrant pénétré péquenaud pérennant périanthé périgourdin périmé périnatal
pérégrin pérégrinant péréqué pétant pétaradant pétillant pétiolé pétochard
pétrifiant pétrifié pétré pétulant pêchant qatari quadrifolié quadrigéminé
quadriparti quadrivalent quadruplété qualifiant qualifié quantifié quart
questionné quiescent quinaud quint quintessencié quintilobé quiné quérulent
rabattable rabattant rabattu rabougri raccourci racorni racé radiant radicant
radiodiffusé radiolipiodolé radiorésistant radiotransparent radiotélévisé
raffermissant raffiné rafraîchi rafraîchissant rageant ragot ragoûtant
raisonné rajeunissant rallié ramassé ramenard ramifié ramolli ramollissant
ramé ranci rangé rapatrié rapiat raplati rappelé rapporté rapproché rarescent
rasant rassasiant rassasié rassemblé rassurant rassuré rassérénant rasé
ratiocinant rationalisé raté ravageant ravagé ravalé ravi ravigotant ravissant
rayé rebattu rebondi rebondissant rebutant recalé recarburant recercelé
rechigné recombinant recommandé reconnaissant reconnu reconstituant recoqueté
recroiseté recroquevillé recru recrudescent recrutant rectifiant recueilli
redenté redondant redoublant redoublé refait refoulant refoulé refroidissant
regardant regrossi reinté relaxant relevé reluisant relâché relégué remarqué
rempli remuant renaissant renchéri rendu renfermé renflé renfoncé renforçant
rengagé renommé rentrant rentré renté renversant renversé repentant repenti
reporté reposant reposé repoussant repoussé repressé représentant repu
resarcelé rescapé rescindant rescié respirant resplendissant ressemblant
ressortissant ressurgi ressuscité restant restreint restringent resurgi
retardé retentissant retenu retiré retombant retrait retraité retrayant
retroussé revanchard revigorant revitalisant reviviscent reçu rhinopharyngé
rhodié rhumatisant rhumé rhénan rhônalpin riant ribaud riboulant ricain
riciné ridé rifain rigolard ringard risqué riverain roidi romagnol romain
romand romanisant rompu rond rondouillard ronflant rongeant rosacé rossard
rotacé roublard roucoulant rouergat rougeaud rougeoyant rougi rougissant
rouleauté roulotté roulé roumain rouquin rousseauisant routinisé roué rubané
rubicond rubéfiant rudenté rugissant ruiné ruisselant ruminant rupin rurbain
rusé rutilant rythmé râblé râlant râpé réadapté réalisant récalcitrant récent
réchauffant réchauffé récidivant récitant réclamant réclinant récliné
réconfortant récurant récurrent récurvé récusant réduit réentrant réflectorisé
réfléchissant réformé réfrigérant réfrigéré réfringent réfugié référencé
régissant réglant réglé régnant régressé régénérant régénéré réhabilité
réitéré réjoui réjouissant rémanent rémittent rémunéré rénitent répandu
réprouvé républicain répugnant réputé réservé résidant résident résigné
résiné résistant résolu résolvant résonant résonnant résorbant résorciné
résumé résupiné résurgent rétabli rétamé réticent réticulé rétrofléchi
rétroréfléchissant rétréci réuni réussi réverbérant révoltant révolté révolu
révulsant révulsé révélé révérend rééquilibrant rêvé rôti sabin saccadé
sacchariné sacrifié sacré safrané sagitté sahraoui saignant saignotant
sain saint saisi saisissant saladin salant salarié salicylé salin salissant
samaritain samoan sanctifiant sanglant sanglotant sanguin sanguinolent
sanskrit santalin saoul saoulard saponacé sarrasin satané satiné satisfaisant
saturant saturnin saturé saucissonné saucé saugrenu saumoné saumuré sautant
sautillé sauté sauvagin savant savoyard scalant scarifié scellé sciant
sclérosant sclérosé scolié scoriacé scorifiant scout script scrobiculé
second secrétant semelé semi-fini sempervirent semé sensibilisant sensé senti
serein serpentin serré servant servi seul sexdigité sexué sexvalent seyant
sibyllin sidérant sifflant sigillé siglé signalé signifiant silicié silicosé
simplifié simultané simulé sinapisé sinisant siphonné situé slavisant
snobinard socialisant sociologisant sodé soiffard soignant soigné solognot
somali sommeillant sommé somnolant somnolent sonnant sonné sorbonnard sortant
souahéli soudain soudant soudé soufflant soufflé souffrant soufi soulevé
sourd souriant soussigné soutenu souterrain souverain soûlant soûlard
spatulé spermagglutinant spermimmobilisant sphacélé spiralé spirant spiritain
splénectomisé spontané sporulé spumescent spécialisé stabilisant stagnant
staphylin stationné stibié stigmatisant stigmatisé stimulant stipendié stipité
stipulé stratifié stressant strict strident stridulant strié structurant
stupéfait stupéfiant stylé sténohalin sténosant stérilisant stérilisé
su suant subalpin subclaquant subconscient subintrant subit subjacent
sublimant subneutralisant subordonnant subordonné subrogé subsident subséquent
subulé suburbain subventionné subérifié succenturié succinct succulent
sucrant sucré sucé suffisant suffocant suffragant suicidé suintant suivant
sulfamidorésistant sulfamidé sulfaté sulfhydrylé sulfoné sulfurant sulfurisé
superfin superfini superflu supergéant superhydratant superordonné superovarié
suppliant supplicié suppléant supportant supposé suppurant suppuré
supradivergent suprahumain supérieur surabondant suractivé surajouté suranné
surbrillant surchargé surchauffé surclassé surcomposé surcomprimé surcouplé
surdéterminant surdéterminé surdéveloppé surencombré surexcitant surexcité
surfin surfondu surfrappé surgelé surgi surglacé surhaussé surhumain suri
surmenant surmené surmultiplié surmusclé surneigé suroxygéné surperformé
surplombant surplué surprenant surpressé surpuissant surréalisant sursalé
sursaturé sursilicé surveillé survitaminé survivant survolté surémancipé
susdit susdénommé susmentionné susnommé suspect suspendu susrelaté susurrant
suzerain suédé swahili swahéli swazi swingant swingué sylvain sympathisant
synanthéré synchronisé syncopé syndiqué synthétisant systématisé séant sébacé
séchant sécurisant sécurisé séduisant ségrégué ségrégé sélectionné sélénié
sémitisant sénescent séparé séquencé séquestrant sérigraphié séroconverti
sérotonicodépendant sétacé sévillan tabou tabouisé tacheté taché tadjik taillé
taloté taluté talé tamil tamisant tamisé tamoul tangent tannant tanné tapant
tapissant taponné tapé taqueté taquin tarabiscoté taraudant tarentin tari
tartré taré tassé tatar taupé taurin tavelé teint teintant teinté telluré
tempérant tempéré tenaillant tenant tendu tentant ternifolié terraqué
terrifiant terrorisant tessellé testacé texan texturant texturé thallosporé
thermisé thermocollant thermodurci thermofixé thermoformé thermohalin
thermoluminescent thermopropulsé thermorémanent thermorésistant thrombopéniant
thrombosé thymodépendant thébain théocentré théorbé tibétain tiercé tigré tigé
timbré timoré tintinnabulant tiqueté tirant tiré tisonné tissu titané titré
tocard toisonné tolérant tombal tombant tombé tonal tondant tondu tonifiant
tonnant tonsuré tonturé tophacé toquard toqué torché tordant tordu torsadé
tortu torturant toscan totalisant totipotent touchant touffu toulousain
tourelé tourmentant tourmenté tournant tournoyant tourné tracassant tracté
traitant tramaillé tranchant tranché tranquillisant transafricain transalpin
transandin transcendant transcutané transfini transfixiant transformant
transi transloqué transmutant transpadan transparent transperçant transpirant
transposé transtévérin transylvain trapu traumatisant traumatisé travaillant
traversant travesti traçant traînant traînard treillissé tremblant tremblotant
trempant trempé tressaillant triboluminescent tributant trichiné tricoté
tridenté trifoliolé trifolié trifurqué trigéminé trilobé trin trinervé
triparti triphasé triphosphaté trisubstitué tritié trituberculé triturant
trivialisé trompettant tronqué troublant trouillard trouvé troué truand
truffé truité trypsiné trébuchant tréflé trémulant trépassé trépidant
tuant tubard tubectomisé tuberculé tubulé tubéracé tubérifié tubérisé tufacé
tuilé tumescent tuméfié tuniqué turbiné turbocompressé turbulent turgescent
tutsi tué twisté typé tâtonnant téflonisé téléphoné télévisé ténorisant
térébrant tétraphasé tétrasubstitué tétravalent têtu tôlé ulcéré ultraciblé
ultracourt ultrafin ultramontain ultérieur uncinulé unciné uni unifiant
uniformisant unilobé uninucléé uniovulé unipotent uniramé uniréfringent
unistratifié unisérié unitegminé univalent univitellin univoltin urbain
urgent urticant usagé usant usité usé utriculé utérin utérosacré vacant
vacciné vachard vacillant vadrouillant vagabond vagabondant vaginé vagissant
vain vaincu vairé valdôtain valgisant validant vallonné valorisant valué
valvé vanadié vanilliné vanillé vanisé vanné vantard variolé varisant varié
varvé vasard vascularisé vasostimulant vasouillard vaudou veinard veiné
velu venaissin venant vendu ventripotent ventromédian ventru venté verdissant
vergeté verglacé verglaçant vergé verjuté vermicellé vermiculé vermoulant
verni vernissé verré versant versé vert verticillé vertébré vespertin vexant
vibrionnant vicariant vicelard vicié vieilli vieillissant vigil vigilant
vigorisant vil vilain violacé violent violoné vipérin virevoltant viril
virulent visigoth vitaminé vitellin vitré vivant viverrin vivifiant vivotant
vogoul voilé voisin voisé volant volanté volatil voletant voltigeant volvulé
vorticellé voulu voussé voyant voûté vrai vrillé vrombissant vu vulnérant
vulturin vécu végétant véhément vélin vélomotorisé vérolé vésicant vésiculé
vêtu wallingant watté wisigoth youpin zazou zend zigzagant zinzolin zoné
zoulou zélé zézayant âgé ânonnant ébahi ébaubi éberlué éblouissant ébouriffant
éburnin éburné écaillé écartelé écarté écervelé échancré échantillonné échappé
échauffant échauffé échevelé échiqueté échoguidé échu éclairant éclaircissant
éclatant éclaté éclipsant éclopé écoeurant écorché écoté écoutant écranté
écrasé écrit écru écrémé éculé écumant édenté édifiant édulcorant égaillé
égaré égayant égrillard égrisé égrotant égueulé éhanché éhonté élaboré élancé
électrisant électroconvulsivant électrofondu électroluminescent
élevé élingué élisabéthain élizabéthain éloigné éloquent élu élégant
émacié émanché émancipé émarginé émergent émergé émerillonné émerveillant
émigré éminent émollient émotionnant émoulu émoustillant émouvant ému
émulsionnant éméché émétisant énergisant énervant énervé épaississant épanoui
épargnant épatant épaté épeigné éperdu épeuré épicotylé épicutané épicé
épigé épinglé éploré éployé épointé époustouflant épouvanté éprouvant éprouvé
épuisé épuré équicontinu équidistant équilibrant équilibré équin équipollent
équipolé équipotent équipé équitant équivalent éraillé éreintant éreinté
érubescent érudit érythématopultacé établi étagé éteint étendu éthéré
étiolé étoffé étoilé étonnant étonné étouffant étouffé étourdi étourdissant
étriquant étriqué étroit étudiant étudié étymologisant évacuant évacué évadé
""".split())

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,553 @@
# coding: utf8
from __future__ import unicode_literals
ADVERBS = set("""
abandonnément abjectement abominablement abondamment aboralement abouliquement
abruptement abrégément abréviativement absconsement absconsément absolument
abstraitement abstrusément absurdement abusivement académiquement
accelerando acceptablement accessoirement accidentellement accortement
acidement acoustiquement acrimonieusement acrobatiquement actiniquement
actuellement adagio additionnellement additivement adiabatiquement
adjectivement administrativement admirablement admirativement adorablement
adultérieurement adverbialement adversativement adéquatement affablement
affectionnément affectivement affectueusement affinement affirmativement
agilement agitato agnostiquement agogiquement agressivement agrestement
agrologiquement agronomiquement agréablement aguicheusement aidant aigrement
ailleurs aimablement ainsi aisément alchimiquement alcooliquement alentour
algorithmiquement algébriquement alias alimentairement allegretto allegro
allopathiquement allusivement allègrement allégoriquement allégrement allégro
alphabétiquement alternativement altimétriquement altièrement altruistement
amabile ambigument ambitieusement amiablement amicalement amiteusement
amoroso amoureusement amphibologiquement amphigouriquement amplement
amusément amènement amèrement anachroniquement anagogiquement
analogiquement analoguement analytiquement anaphoriquement anarchiquement
ancestralement anciennement andante andantino anecdotiquement angulairement
angéliquement anharmoniquement animalement animato annuellement anodinement
anormalement anthropocentriquement anthropologiquement anthropomorphiquement
anticipativement anticonstitutionnellement antidémocratiquement
antinomiquement antipathiquement antipatriotiquement antiquement
antisocialement antisportivement antisymétriquement antithétiquement
antiétatiquement antécédemment antérieurement anxieusement apathiquement
apicalement apocalyptiquement apodictiquement apologétiquement apostoliquement
appassionato approbativement approchant approximativement appréciablement
après-demain aquatiquement arbitrairement arbitralement archaïquement
architecturalement archéologiquement ardemment argotiquement aridement
arithmétiquement aromatiquement arrière arrogamment articulairement
artificiellement artificieusement artisanalement artistement artistiquement
aseptiquement asiatiquement assai assertivement assertoriquement assez
associativement assurément astrologiquement astrométriquement astronomiquement
astucieusement asymptotiquement asymétriquement ataviquement ataxiquement
atomiquement atrabilairement atrocement attenant attentionnément attentivement
attractivement atypiquement aucunement audacieusement audiblement
auditivement auguralement augustement aujourd'hui auparavant auprès
aussi aussitôt austèrement autant autarciquement authentiquement
autographiquement automatiquement autonomement autoritairement autrefois
auxiliairement avant avant-hier avantageusement avarement avaricieusement
aventurément aveuglément avidement avunculairement axialement axiologiquement
aériennement aérodynamiquement aérostatiquement babéliquement bachiquement
badaudement badinement balistiquement balourdement balsamiquement banalement
barbarement barométriquement baroquement bas bassement batailleusement
baveusement beau beaucoup bellement belliqueusement ben bene benoîtement
bestialement bibliographiquement bibliquement bien bienheureusement bientôt
bigotement bigrement bihebdomadairement bijectivement bijournellement
bileusement bilieusement bilinéairement bimensuellement bimestriellement
bioacoustiquement biochimiquement bioclimatiquement biodynamiquement
biogénétiquement biologiquement biomédicalement bioniquement biophysiquement
bioélectroniquement bioénergétiquement bipolairement biquotidiennement bis
biunivoquement bizarrement bizarroïdement blafardement blagueusement
blondement blâmablement bon bonassement bonnement bordéliquement botaniquement
boueusement bouffonnement bougonnement bougrement boulimiquement
bravachement bravement bredouilleusement bref brillamment brièvement
brumeusement brusquement brut brutalement bruyamment bucoliquement
bureaucratiquement burlesquement byzantinement béatement bégueulement
bénéfiquement bénévolement béotiennement bésef bézef bêtement cabalistiquement
cabotinement cachottièrement cacophoniquement caf cafardeusement
cajoleusement calamiteusement calligraphiquement calmement calmos
calorimétriquement caloriquement canaillement cancérologiquement candidement
cantabile capablement capillairement capitalement capitulairement
captieusement caractériellement caractérologiquement cardiographiquement
caricaturalement carrément cartographiquement cartésiennement casanièrement
casuellement catalytiquement catastrophiquement catholiquement
catégoriquement causalement causativement caustiquement cauteleusement
caverneusement cellulairement censitairement censément centièmement
cependant certainement certes cf chafouinement chagrinement chaleureusement
chaotiquement charismatiquement charitablement charnellement chastement
chattement chaud chaudement chauvinement chenuement cher chevaleresquement
chichement chichiteusement chimiquement chimériquement chinoisement
chiquement chirographairement chirographiquement chirurgicalement chiément
chouettement chromatiquement chroniquement chronologiquement
chrétiennement chèrement chétivement ci cinquantièmement cinquièmement
cinématographiquement cinétiquement circonspectement circonstanciellement
citadinement civilement civiquement clairement clandestinement classiquement
climatologiquement cliniquement cléricalement cocassement cochonnement
coextensivement collatéralement collectivement collusoirement collégialement
coléreusement colériquement combien combinatoirement comiquement comme
comment commercialement comminatoirement commodément communalement
communément comparablement comparativement compatiblement compendieusement
compensatoirement complaisamment complexement complètement complémentairement
compréhensivement comptablement comptant compulsivement conardement
conceptuellement concernant concevablement concisément concomitamment
concurremment concussionnairement condamnablement conditionnellement confer
confidemment confidentiellement conflictuellement conformationnellement
confortablement confraternellement confusément confédéralement congrument
congénitalement coniquement conjecturalement conjointement conjonctivement
conjugalement connardement connement connotativement consciemment
consensuellement conservatoirement considérablement considérément
constamment constitutionnellement constitutivement consubstantiellement
consécutivement conséquemment contagieusement contemplativement
contestablement contextuellement continuellement continûment contractuellement
contrairement contrapuntiquement contrastivement contre contributoirement
convenablement conventionnellement conventuellement convivialement
coopérativement copieusement coquettement coquinement cordialement coriacement
coronairement corporativement corporellement corpusculairement correct
correctionnellement corrosivement corrélativement cosmiquement
cosmographiquement cosmologiquement cossardement cotonneusement couardement
courageusement couramment court courtement courtoisement coutumièrement
craintivement crapuleusement crescendo criardement criminellement
critiquablement critiquement croyez-en crucialement cruellement
crânement crédiblement crédulement crépusculairement crétinement crûment
cuistrement culinairement cultuellement culturellement cumulativement
curativement curieusement cursivement curvilignement cybernétiquement
cylindriquement cyniquement cynégétiquement cytogénétiquement cytologiquement
célestement célibatairement cérébralement cérémoniellement cérémonieusement
d'abondance d'abord d'ailleurs d'après d'arrache-pied d'avance d'emblée d'ici
d'office d'urgence d'évidence dactylographiquement damnablement dangereusement
debout decimo decrescendo dedans dehors demain densément depuis derechef
dernièrement derrière descriptivement despotiquement deusio deuxièmement
devant dextrement dextrorse dextrorsum diablement diaboliquement
diacoustiquement diagonalement dialectalement dialectiquement
dialogiquement diamétralement diantrement diatoniquement dichotomiquement
didactiquement difficilement difficultueusement diffusément différemment
digitalement dignement dilatoirement diligemment dimanche dimensionnellement
dinguement diplomatiquement directement directo disciplinairement
discontinûment discourtoisement discriminatoirement discrètement
discursivement disertement disgracieusement disjonctivement disons-le
dispendieusement disproportionnellement disproportionnément dissemblablement
dissuasivement dissymétriquement distinctement distinctivement distraitement
distributivement dithyrambiquement dito diurnement diversement divinement
dixièmement diététiquement docilement docimologiquement doctement
doctrinairement doctrinalement documentairement dodécaphoniquement
dogmatiquement dolce dolcissimo dolemment dolentement dolosivement
dommageablement donc dont doriquement dorsalement dorénavant doublement
doucereusement doucettement douceâtrement douillettement douloureusement
doux douzièmement draconiennement dramatiquement drastiquement
droit droitement drolatiquement dru drument drôlement dubitativement dur
durement dynamiquement dynamogéniquement dysharmoniquement débilement
décadairement décemment décidément décimalement décisivement déclamatoirement
dédaigneusement déductivement défavorablement défectueusement défensivement
définitivement dégoûtamment dégressivement dégueulassement déictiquement déjà
délibérément délicatement délicieusement déloyalement délétèrement
démentiellement démesurément démocratiquement démographiquement démoniaquement
démonstrativement démotiquement déontologiquement départementalement
déplaisamment déplorablement dépressivement dépréciativement déraisonnablement
dérivationnellement dérogativement dérogatoirement désagréablement
désastreusement désavantageusement désespéramment désespérément déshonnêtement
désobligeamment désolamment désordonnément désormais déterminément
dévotement dévotieusement dûment ecclésiastiquement ecclésiologiquement
efficacement effrayamment effrontément effroyablement effrénément
elliptiquement emblématiquement embryologiquement embryonnairement
emphatiquement empiriquement encor encore encyclopédiquement
endémiquement enfantinement enfin enjôleusement ennuyeusement ensemble
ensuite enthousiastement entièrement entomologiquement enviablement
environ ergonomiquement erratiquement erronément eschatologiquement
espressivo essentiellement esthétiquement estimablement etc ethniquement
ethnolinguistiquement ethnologiquement eucharistiquement euphoniquement
euphémiquement euristiquement européennement eurythmiquement euréka exactement
exaspérément excellemment excentriquement exceptionnellement excepté
exclamativement exclusivement excédentairement exemplairement exhaustivement
existentiellement exorbitamment exotiquement expansivement expertement
explicativement explicitement explosivement explétivement exponentiellement
expressément exprès expéditivement expérimentalement exquisément extatiquement
extensionnellement extensivement extra-muros extrajudiciairement
extravagamment extrinsèquement extrêmement extérieurement exécrablement
exégétiquement fabuleusement facheusement facile facilement facticement
factitivement factuellement facultativement facétieusement fadassement
faiblardement faiblement fallacieusement falotement fameusement familialement
faméliquement fanatiquement fanfaronnement fangeusement fantaisistement
fantasmatiquement fantasquement fantastiquement fantomatiquement
faramineusement faraudement farouchement fascistement fashionablement
fastueusement fatalement fatalistement fatidiquement faussement fautivement
favorablement ferme fermement ferroviairement fertilement fervemment
fichtrement fichument fichûment fictivement fiduciairement fidèlement
figurativement figurément filandreusement filialement filmiquement fin
financièrement finaudement finement finiment fiscalement fissa fixement
fiévreusement flagorneusement flasquement flatteusement flegmatiquement
flexionnellement flexueusement flou fluidement flâneusement flémardement
foireusement folkloriquement follement folâtrement foncièrement
fondamentalement forcément forestièrement forfaitairement formellement
fort forte fortement fortissimo fortuitement fougueusement fourbement
foutument foutûment fragilement fragmentairement frais franc franchement
fraternellement frauduleusement fraîchement frigidement frigo frileusement
frisquet frivolement froidement frontalement froussardement fructueusement
frustement frénétiquement fréquemment frêlement fugacement fugitivement
fumeusement funestement funèbrement funérairement furibardement furibondement
furioso furtivement futilement futurement fâcheusement fébrilement fécondement
félinement félonnement fémininement féodalement férocement fétidement
gaffeusement gaiement gaillardement galamment gallicanement galvaniquement
gammathérapiquement ganglionnairement gargantualement gastronomiquement
gauloisement gaîment geignardement gentement gentiment gestuellement
giratoirement glacialement glaciologiquement glaireusement glandulairement
globalement glorieusement gloutonnement gnostiquement gnoséologiquement
goguenardement goinfrement goniométriquement gothiquement gouailleusement
goulûment gourdement gourmandement goutteusement gouvernementalement
gracilement gracioso graduellement grammaticalement grand grandement
graphiquement graphologiquement gras grassement gratis gratuitement grave
gravement grazioso grincheusement grivoisement grièvement grossement
grotesquement grégairement guillerettement gutturalement guère guères
gyroscopiquement gâteusement gélatineusement génialement génitalement
généralement généreusement génériquement génétiquement géodynamiquement
géographiquement géologiquement géométralement géométriquement géophysiquement
habilement habituellement hagardement hagiographiquement haineusement
hargneusement harmonieusement harmoniquement hasardeusement haut hautainement
haïssablement hebdomadairement heptagonalement herméneutiquement
heureusement heuristiquement hexagonalement hexaédriquement hideusement hier
hippiatriquement hippiquement hippologiquement histologiquement historiquement
hiératiquement hiéroglyphiquement homocentriquement homographiquement
homologiquement homothétiquement homéopathiquement homériquement honnêtement
honorifiquement honteusement horizontalement hormis hormonalement horriblement
hospitalièrement hostilement houleusement huileusement huitièmement
humanitairement humblement humidement humoristiquement humoureusement
hydrauliquement hydrodynamiquement hydrographiquement hydrologiquement
hydropneumatiquement hydrostatiquement hydrothérapiquement hygiéniquement
hypercorrectement hypnotiquement hypocondriaquement hypocoristiquement
hypodermiquement hypostatiquement hypothécairement hypothétiquement
hâtivement hébraïquement héliaquement hélicoïdalement héliographiquement
hémiédriquement hémodynamiquement hémostatiquement héraldiquement héroïquement
hérétiquement hétéroclitement hétérodoxement hétérogènement ibidem
ici ici-bas iconiquement iconographiquement id idem identiquement
idiosyncrasiquement idiosyncratiquement idiotement idoinement idolatriquement
idylliquement idéalement idéalistement idéellement idéographiquement
ignarement ignoblement ignominieusement ignoramment illicitement illico
illogiquement illusoirement illustrement illégalement illégitimement
imaginativement imbattablement imbécilement immaculément immanquablement
immensément imminemment immobilement immodestement immodérément immondement
immortellement immuablement immunitairement immunologiquement immédiatement
immémorialement impairement impalpablement imparablement impardonnablement
impartialement impassiblement impatiemment impavidement impayablement
impensablement imperceptiblement impersonnellement impertinemment
impitoyablement implacablement implicitement impoliment impolitiquement
importunément impossiblement imprescriptiblement impressivement improbablement
impromptu improprement imprudemment imprécisément imprévisiblement impudemment
impulsivement impunément impurement impénétrablement impérativement
impérieusement impérissablement impétueusement inacceptablement
inactivement inadmissiblement inadéquatement inaliénablement inaltérablement
inamoviblement inappréciablement inassouvissablement inattaquablement
inaudiblement inauguralement inauthentiquement inavouablement incalculablement
incertainement incessamment incestueusement incidemment incisivement
inciviquement inclusivement incoerciblement incognito incommensurablement
incommutablement incomparablement incomplètement incompréhensiblement
inconcevablement inconciliablement inconditionnellement inconfortablement
inconsciemment inconsidérément inconsolablement inconstamment
inconséquemment incontestablement incontinent incontournablement
inconvenablement incorporellement incorrectement incorrigiblement
increvablement incroyablement incrédulement incurablement indescriptiblement
indicativement indiciairement indiciblement indifféremment indigemment
indignement indirectement indiscernablement indiscontinûment indiscrètement
indispensablement indissociablement indissolublement indistinctement
indivisiblement indivisément indocilement indolemment indomptablement
inductivement indulgemment industriellement industrieusement indécelablement
indéchiffrablement indécrottablement indéfectiblement indéfendablement
indéfinissablement indélicatement indélébilement indémontablement
indépassablement indépendamment indéracinablement indésirablement
indéterminément indévotement indûment ineffablement ineffaçablement
ineptement inertement inespérément inesthétiquement inestimablement
inexcusablement inexorablement inexpertement inexpiablement inexplicablement
inexprimablement inexpugnablement inextinguiblement inextirpablement
infailliblement infantilement infatigablement infectement infernalement
infimement infiniment infinitésimalement inflexiblement informatiquement
infra infructueusement infâmement inférieurement inglorieusement ingratement
ingénieusement ingénument inhabilement inhabituellement inharmonieusement
inhospitalièrement inhumainement inhéremment inimaginablement inimitablement
inintelligiblement inintentionnellement iniquement initialement
injurieusement injustement injustifiablement inlassablement innocemment
inoffensivement inopinément inopportunément inoubliablement inoxydablement
inquisitorialement inquiètement insaisissablement insalubrement insanement
insciemment insensiblement insensément insidieusement insignement
insipidement insolemment insolitement insolublement insondablement
insoucieusement insoupçonnablement insoutenablement instablement instamment
instinctivement institutionnellement instructivement instrumentalement
insulairement insupportablement insurmontablement insurpassablement
inséparablement intangiblement intarissablement intellectuellement
intelligiblement intempestivement intemporellement intenablement
intensivement intensément intentionnellement intercalairement
interlinéairement interlopement interminablement intermusculairement
interplanétairement interprofessionnellement interprétativement
intersyndicalement intervocaliquement intimement intolérablement
intraitablement intramusculairement intransitivement intraveineusement
introspectivement intrépidement intuitivement intègrement intégralement
intérimairement inutilement invalidement invariablement inventivement
invinciblement inviolablement invisiblement involontairement
invulnérablement inébranlablement inégalablement inégalement inégalitairement
inélégamment inénarrablement inépuisablement inéquitablement inévitablement
ironiquement irraisonnablement irrationnellement irrattrapablement
irrespectueusement irrespirablement irresponsablement irréconciliablement
irrécusablement irréductiblement irréellement irréfragablement irréfutablement
irréligieusement irrémissiblement irrémédiablement irréparablement
irréprochablement irrépréhensiblement irrésistiblement irrésolument
irrévocablement irrévéremment irrévérencieusement isolément isothermiquement
isoédriquement item itou itérativement jacobinement jadis jalousement jamais
jaune jeudi jeunement jobardement jointivement joliment journalistiquement
jovialement joyeusement judaïquement judiciairement judicieusement
juste justement justifiablement juvénilement juxtalinéairement jésuitement
kaléidoscopiquement kilométriquement l'année l'après-midi l'avant-veille
labialement laborieusement labyrinthiquement laconiquement lactiquement
ladrement laidement laiteusement lamentablement langagièrement langoureusement
languissamment lapidairement large largement larghetto largo lascivement
latinement latéralement laxistement laïquement legato lentement lento lerch
lestement lexicalement lexicographiquement lexicologiquement libertinement
librement libéralement licencieusement licitement ligamentairement
limitativement limpidement linguistiquement linéairement linéalement
lisiblement lithographiquement lithologiquement litigieusement littérairement
liturgiquement lividement livresquement localement logarithmiquement
logistiquement logographiquement loin lointainement loisiblement long
longitudinalement longtemps longuement loquacement lors louablement
louchement loufoquement lourd lourdaudement lourdement loyalement lubriquement
lucrativement ludiquement lugubrement lumineusement lunairement lunatiquement
lustralement luxueusement luxurieusement lymphatiquement lyriquement -bas
-dessous -dessus -haut lâchement légalement légendairement léger
légitimement légèrement léthargiquement macabrement macache macaroniquement
machinalement macrobiotiquement macroscopiquement maestoso magiquement
magnanimement magnifiquement magnétiquement magnétohydrodynamiquement
maigrement maintenant majestueusement majoritairement mal maladivement
malaisément malcommodément malencontreusement malgracieusement malgré
malheureusement malhonnêtement malicieusement malignement malproprement
malveillamment maléfiquement maniaquement manifestement manuellement mardi
maritalement maritimement marmiteusement marotiquement marre martialement
masochistement massivement maternellement mathématiquement matin matinalement
matriarcalement matrilinéairement matrimonialement maturément matérialistement
maupiteusement mauresquement maussadement mauvais mauvaisement maxi
meilleur mensongèrement mensuellement mentalement menteusement menu
mercredi merdeusement merveilleusement mesquinement mesurément mezzo
micrographiquement micrométriquement microphysiquement microscopiquement
miette mieux mignardement mignonnement militairement millimétriquement
minablement mincement minimement ministériellement minoritairement
minutieusement minéralogiquement miraculeusement mirifiquement mirobolamment
misanthropiquement misogynement misérablement miséreusement
miteusement mièvrement mnémoniquement mnémotechniquement mobilièrement
modalement moderato modernement modestement modiquement modulairement
moelleusement moindrement moins mollassement mollement mollo molto
momentanément monacalement monarchiquement monastiquement mondainement
monocordement monographiquement monolithiquement monophoniquement monotonement
monumentalement monétairement moqueusement moralement moralistement
mordicus morganatiquement mornement morosement morphologiquement mortellement
morveusement moult moutonnièrement moyennant moyennement moyenâgeusement
multidisciplinairement multilatéralement multinationalement multiplement
multipolairement municipalement musculairement musculeusement musicalement
mutuellement mystiquement mystérieusement mythiquement mythologiquement
mécanographiquement méchamment médialement médiatement médiatiquement
médicinalement médiocrement méditativement mélancoliquement mélodieusement
mélodramatiquement mémorablement méphistophéliquement méprisablement
méritoirement métaboliquement métalinguistiquement métalliquement
métallurgiquement métalogiquement métamathématiquement métaphoriquement
méthodiquement méthodologiquement méticuleusement métonymiquement métriquement
météorologiquement même mêmement mûrement n'étant naguère narcissiquement
narrativement nasalement nasillardement natalement nationalement nativement
naturellement naïvement ne nenni nerveusement net nettement
neurolinguistiquement neurologiquement neurophysiologiquement
neutrement neuvièmement niaisement nib nigaudement noblement nocivement
noirement nomadement nombreusement nominalement nominativement nommément non
nonobstant noologiquement normalement normativement nostalgiquement
notamment notarialement notoirement nouménalement nouvellement noétiquement
nucléairement nuisiblement nuitamment nullement numismatiquement numériquement
nuptialement néanmoins nébuleusement nécessairement néfastement négatif
négligemment néologiquement névrotiquement nûment objectivement oblativement
oblige obligeamment obliquement obrepticement obscurément obscènement
obsessivement obstinément obséquieusement obtusément obèsement
occultement octogonalement oculairement océanographiquement odieusement
oenologiquement offensivement officiellement officieusement oiseusement
olfactivement oligarchiquement ombrageusement onc oncques onctueusement
oniriquement onomatopéiquement onques ontologiquement onzièmement onéreusement
ophtalmologiquement opiniâtrement opportunistement opportunément opposément
optimalement optimistement optionnellement optiquement opulemment
opératoirement orageusement oralement oratoirement orbiculairement
ordinairement ordurièrement ores organiquement organoleptiquement orgiaquement
orgueilleusement orientalement originairement originalement originellement
orographiquement orthodoxement orthogonalement orthographiquement
orthopédiquement osmotiquement ostensiblement ostentatoirement oublieusement
out outrageusement outrancièrement outre outre-atlantique outre-mer
outrecuidamment ouvertement ovalement oviparement ovoviviparement
pacifiquement paillardement pairement paisiblement palingénésiquement
paléobotaniquement paléographiquement paléontologiquement panoptiquement
pantagruéliquement papelardement paraboliquement paradigmatiquement
paralittérairement parallactiquement parallèlement paramilitairement
parasitairement parcellairement parcellement parcimonieusement pardon
paresseusement parfaitement parfois parisiennement paritairement
parodiquement paroxistiquement paroxystiquement partant parthénogénétiquement
particulièrement partiellement partout pas passablement passagèrement passim
passionnément passivement passé pastoralement pataudement patelinement
paternellement paternement pathologiquement pathétiquement patibulairement
patriarcalement patrilinéairement patrimonialement patriotiquement pauvrement
païennement peinardement peineusement penaudement pendablement pendant
pensivement pentatoniquement perceptiblement perceptivement perdurablement
permissivement pernicieusement perpendiculairement perplexement
persifleusement perso personnellement perspicacement persuasivement
pertinemment perversement pesamment pessimistement petit petitement peu
peut-être pharisaïquement pharmacologiquement philanthropement
philistinement philologiquement philosophiquement phobiquement phoniquement
phonologiquement phonématiquement phonémiquement phonétiquement
photographiquement photométriquement phrénologiquement phylogénétiquement
physiologiquement physionomiquement physiquement phénoménalement
phénoménologiquement pianissimo pianistiquement piano pictographiquement
pieusement pile pinailleusement pingrement piteusement pitoyablement
piètrement più placidement plaignardement plaintivement plaisamment
plantureusement planétairement plastiquement plat platement platoniquement
plein pleinement pleutrement pluriannuellement pluridisciplinairement
plurinationalement plus plutoniquement plutôt plébéiennement plénièrement
pléthoriquement pneumatiquement point pointilleusement pointu poisseusement
poliment polissonnement politiquement poltronnement polygonalement
polyédriquement polémiquement pomologiquement pompeusement ponctuellement
pontificalement populairement pornographiquement positionnellement
possessivement possessoirement possiblement posthumement posthumément
postérieurement posément potablement potentiellement pourquoi pourtant
poétiquement pragmatiquement pratiquement premièrement presque prestement
prestissimo presto primairement primesautièrement primitivement primo
principalement princièrement printanièrement prioritairement privativement
probablement probement problématiquement processionnellement prochainement
proconsulairement prodigalement prodigieusement prodiguement productivement
professionnellement professoralement profitablement profond profondément
progressivement projectivement proleptiquement prolifiquement prolixement
promptement pronominalement prophylactiquement prophétiquement propicement
proportionnellement proportionnément proprement prosaïquement prosodiquement
prospèrement protocolairement protohistoriquement prou proverbialement
provincialement provisionnellement provisoirement prudement prudemment
préalablement précairement précautionneusement précieusement précipitamment
précisément précocement précédemment préférablement préférentiellement
préjudiciablement préliminairement prélogiquement prématurément
prépositivement préscolairement présentement présomptivement présomptueusement
présumément prétendument prétentieusement préventivement prévisionnellement
psychanalytiquement psychiatriquement psychiquement psycholinguistiquement
psychométriquement psychopathologiquement psychophysiologiquement
psychosomatiquement psychothérapiquement puamment publicitairement
pudibondement pudiquement pugnacement puissamment pulmonairement
purement puritainement pusillanimement putainement putassièrement putativement
pyramidalement pyrométriquement pâlement pâteusement pécuniairement
pédamment pédantement pédantesquement pédestrement péjorativement pénalement
péniblement pénitentiairement pépèrement péremptoirement périlleusement
périphériquement périscolairement pétrochimiquement pétrographiquement
pêle-mêle quadrangulairement quadrimestriellement quadruplement
quand quantitativement quarantièmement quarto quasi quasiment quater
quatrièmement quellement quelque quelquefois question quinto quinzièmement
quiètement quoique quotidiennement racialement racoleusement radiairement
radicalement radieusement radinement radiographiquement radiologiquement
radiotélégraphiquement radioélectriquement rageusement raidement railleusement
rapacement rapidement rapido rapidos rarement rarissimement ras rasibus
recevablement reconventionnellement recta rectangulairement rectilignement
redoutablement regrettablement relativement religieusement remarquablement
reproductivement représentativement respectablement respectivement
restrictivement revêchement rhomboédriquement rhéologiquement rhétoriquement
richissimement ridiculement rieusement rigidement rigoureusement rinforzando
risiblement ritardando rituellement robustement rocailleusement
rogatoirement roguement roidement romainement romancièrement romanesquement
rond rondement rondouillardement rosement rossement rotativement roturièrement
rougement routinièrement royalement rubato rudement rudimentairement
ruralement rustaudement rustiquement rustrement rythmiquement
réactionnairement réalistement rébarbativement récemment réciproquement
rédhibitoirement réellement réflexivement réfractairement référendairement
régionalement réglementairement réglo régressivement régulièrement
répréhensiblement répulsivement répétitivement résidentiellement
résineusement résolument rétivement rétroactivement rétrospectivement
révocablement révolutionnairement révéremment révérencieusement rêveusement
sacerdotalement sacramentellement sacrilègement sacrément sadiquement
sagement sagittalement sainement saintement saisonnièrement salacement
salaudement salement salubrement salutairement samedi sanguinairement
saphiquement sarcastiquement sardoniquement sataniquement satiriquement
satyriquement sauf saumâtrement sauvagement savamment savoureusement
scalairement scandaleusement scatologiquement sceptiquement scherzando scherzo
schématiquement sciemment scientifiquement scolairement scolastiquement
sculpturalement scélératement scéniquement sec secondairement secondement
secrètement sectairement sectoriellement secundo seigneurialement seizièmement
semestriellement sempiternellement senestrorsum sensationnellement
sensuellement sensément sentencieusement sentimentalement septentrionalement
septièmement sereinement serré serviablement servilement seul seulement sexto
sforzando si sibyllinement sic sidéralement sidérurgiquement significativement
similairement simplement simultanément sincèrement singulièrement sinistrement
sinon sinueusement siouxement sirupeusement sitôt sixièmement smorzando
sobrement sociablement socialement sociolinguistiquement sociologiquement
socratiquement soigneusement soit soixantièmement solairement soldatesquement
solidairement solidement solitairement somatiquement sombrement sommairement
somptuairement somptueusement songeusement sonorement sophistiquement
sordidement sororalement sostenuto sottement soucieusement soudain
souhaitablement souplement soupçonneusement sourcilleusement sourdement
souterrainement souvent souventefois souverainement soyeusement spacieusement
spasmodiquement spatialement spectaculairement spectralement sphériquement
splendidement spontanément sporadiquement sportivement spécialement
spécifiquement spéculairement spéculativement spéléologiquement stablement
staliniennement stationnairement statiquement statistiquement statutairement
stoechiométriquement stoïquement stratigraphiquement stratégiquement
stridemment strophiquement structuralement structurellement studieusement
stylistiquement sténographiquement stérilement stéréographiquement
stéréophoniquement suavement subalternement subconsciemment subitement subito
sublimement subordinément subordonnément subrepticement subrogatoirement
substantiellement substantivement subséquemment subtilement subversivement
succinctement succulemment suffisamment suggestivement suicidairement
superbement superficiellement superfinement superfétatoirement superlativement
supplémentairement supplétivement supportablement supposément supra
suprêmement supérieurement surabondamment surhumainement surnaturellement
surréellement surtout surérogatoirement sus suspectement suspicieusement
syllabiquement syllogistiquement symbiotiquement symboliquement
symphoniquement symptomatiquement symétriquement synchroniquement
syndicalement synergiquement synonymiquement synoptiquement syntactiquement
syntaxiquement synthétiquement systématiquement systémiquement sèchement
séculièrement sédentairement séditieusement sélectivement sélénographiquement
sémiologiquement sémiotiquement sémiquement séméiotiquement sénilement
séquentiellement séraphiquement sériellement sérieusement sérologiquement
sûr sûrement tabulairement tacitement taciturnement tactilement tactiquement
talmudiquement tangentiellement tangiblement tant tantôt tapageusement
tard tardivement tatillonnement tauromachiquement tautologiquement
taxinomiquement taxonomiquement techniquement technocratiquement
tectoniquement teigneusement tellement temporairement temporellement
tenacement tendanciellement tendancieusement tendrement tennistiquement
tenuto ter terminologiquement ternement terrible terriblement territorialement
testimonialement texto textuellement thermiquement thermodynamiquement
thermonucléairement thermoélectriquement thématiquement théocratiquement
théologiquement théoriquement théosophiquement thérapeutiquement thétiquement
timidement titulairement tièdement tiédassement tocardement tolérablement
toniquement topiquement topographiquement topologiquement toponymiquement
torrentueusement torridement tortueusement torvement totalement
toujours touristiquement tout toute toutefois toxicologiquement
traditionnellement tragediante tragiquement tranquillement
transcendantalement transformationnellement transgressivement transitivement
transversalement traumatologiquement traînardement traîtreusement
trentièmement triangulairement tribalement tridimensionnellement
trihebdomadairement trimestriellement triomphalement triplement
tristement trivialement troisio troisièmement trompeusement trop
très ttc tumultuairement tumultueusement turpidement tutélairement typiquement
typologiquement tyranniquement télescopiquement téléautographiquement
téléinformatiquement télématiquement téléologiquement télépathiquement
télévisuellement témérairement ténébreusement tératologiquement tétaniquement
tétraédriquement tôt ultimement ultimo ultérieurement unanimement uniformément
unilinéairement uniment uninominalement unipolairement uniquement unitairement
universitairement univoquement unièmement urbainement urgemment urologiquement
usurairement utilement utilitairement utopiquement uvulairement vachardement
vaginalement vaguement vaillamment vainement valablement valeureusement
vaniteusement vantardement vaporeusement variablement vasculairement
vasouillardement vastement velléitairement vendredi venimeusement ventralement
verbeusement vernaculairement versatilement vertement verticalement
vertueusement verveusement vestimentairement veulement vicieusement
vieillottement vigesimo vigilamment vigoureusement vilain vilainement vilement
vingtièmement violemment virginalement virilement virtuellement virulemment
viscéralement visiblement visqueusement visuellement vitalement vite vitement
vivacement vivement viviparement vocalement vocaliquement voir voire
volcaniquement volcanologiquement volontairement volontiers volubilement
voluptueusement voracement voyez-vous vrai vraiment vraisemblablement
vulgo végétalement végétativement véhémentement vélairement vélocement
véniellement vénéneusement vénérablement véracement véridiquement
vésaniquement vétilleusement vétustement xylographiquement xérographiquement
zootechniquement âcrement âprement çà échocardiographiquement
échométriquement éclatamment éclectiquement écologiquement économement
économétriquement édéniquement également égalitairement égocentriquement
égrillardement éhontément élastiquement électivement électoralement
électrocardiographiquement électrochimiquement électrodynamiquement
électromagnétiquement électromécaniquement électroniquement
électropneumatiquement électrostatiquement électrotechniquement
éliminatoirement élitistement élogieusement éloquemment élégamment
élémentairement éminemment émotionnellement émotivement énergiquement
énigmatiquement énièmement énormément épais épaissement éparsement épatamment
éphémèrement épicuriennement épidermiquement épidémiologiquement
épigrammatiquement épigraphiquement épileptiquement épiquement épiscopalement
épistolairement épistémologiquement épouvantablement équitablement
équivoquement érotiquement éruditement éruptivement érémitiquement
étatiquement éternellement éthiquement éthologiquement étonnamment étourdiment
étroitement étymologiquement évangéliquement évasivement éventuellement
""".split())

View File

@ -0,0 +1,86 @@
# coding: utf8
from __future__ import unicode_literals
AUXILIARY_VERBS_IRREG = {
"suis": ("être",),
"es": ("être",),
"est": ("être",),
"sommes": ("être",),
"êtes": ("être",),
"sont": ("être",),
"étais": ("être",),
"étais": ("être",),
"était": ("être",),
"étions": ("être",),
"étiez": ("être",),
"étaient": ("être",),
"fus": ("être",),
"fut": ("être",),
"fûmes": ("être",),
"fûtes": ("être",),
"furent": ("être",),
"serai": ("être",),
"seras": ("être",),
"sera": ("être",),
"serons": ("être",),
"serez": ("être",),
"seront": ("être",),
"serais": ("être",),
"serait": ("être",),
"serions": ("être",),
"seriez": ("être",),
"seraient": ("être",),
"sois": ("être",),
"soit": ("être",),
"soyons": ("être",),
"soyez": ("être",),
"soient": ("être",),
"fusse": ("être",),
"fusses": ("être",),
"fût": ("être",),
"fussions": ("être",),
"fussiez": ("être",),
"fussent": ("être",),
"étant": ("être",),
"ai": ("avoir",),
"as": ("avoir",),
"a": ("avoir",),
"avons": ("avoir",),
"avez": ("avoir",),
"ont": ("avoir",),
"avais": ("avoir",),
"avait": ("avoir",),
"avions": ("avoir",),
"aviez": ("avoir",),
"avaient": ("avoir",),
"eus": ("avoir",),
"eut": ("avoir",),
"eûmes": ("avoir",),
"eûtes": ("avoir",),
"eurent": ("avoir",),
"aurai": ("avoir",),
"auras": ("avoir",),
"aura": ("avoir",),
"aurons": ("avoir",),
"aurez": ("avoir",),
"auront": ("avoir",),
"aurais": ("avoir",),
"aurait": ("avoir",),
"aurions": ("avoir",),
"auriez": ("avoir",),
"auraient": ("avoir",),
"aie": ("avoir",),
"aies": ("avoir",),
"ait": ("avoir",),
"ayons": ("avoir",),
"ayez": ("avoir",),
"aient": ("avoir",),
"eusse": ("avoir",),
"eusses": ("avoir",),
"eût": ("avoir",),
"eussions": ("avoir",),
"eussiez": ("avoir",),
"eussent": ("avoir",),
"ayant": ("avoir",)
}

View File

@ -0,0 +1,49 @@
# coding: utf8
from __future__ import unicode_literals
DETS_IRREG = {
"aucune": ("aucun",),
"ces": ("ce",),
"cet": ("ce",),
"cette": ("ce",),
"cents": ("cent",),
"certaines": ("certains",),
"différentes": ("différents",),
"diverses": ("divers",),
"la": ("le",),
"les": ("le",),
"l'": ("le",),
"laquelle": ("lequel",),
"lesquelles": ("lequel",),
"lesquels": ("lequel",),
"leurs": ("leur",),
"mainte": ("maint",),
"maintes": ("maint",),
"maints": ("maint",),
"ma": ("mon",),
"mes": ("mon",),
"nos": ("notre",),
"nulle": ("nul",),
"nulles": ("nul",),
"nuls": ("nul",),
"quelle": ("quel",),
"quelles": ("quel",),
"quels": ("quel",),
"quelqu'": ("quelque",),
"quelques": ("quelque",),
"sa": ("son",),
"ses": ("son",),
"telle": ("tel",),
"telles": ("tel",),
"tels": ("tel",),
"ta": ("ton",),
"tes": ("ton",),
"tous": ("tout",),
"toute": ("tout",),
"toutes": ("tout",),
"des": ("un",),
"une": ("un",),
"vingts": ("vingt",),
"vos": ("votre",)
}

View File

@ -0,0 +1,56 @@
# coding: utf8
from __future__ import unicode_literals
ADJECTIVE_RULES = [
["s", ""],
["e", ""],
["es", ""]
]
NOUN_RULES = [
["s", ""]
]
VERB_RULES = [
["é", "er"],
["és", "er"],
["ée", "er"],
["ées", "er"],
["é", "er"],
["es", "er"],
["ons", "er"],
["ez", "er"],
["ent", "er"],
["ais", "er"],
["ait", "er"],
["ions", "er"],
["iez", "er"],
["aient", "er"],
["ai", "er"],
["as", "er"],
["a", "er"],
["âmes", "er"],
["âtes", "er"],
["èrent", "er"],
["erai", "er"],
["eras", "er"],
["era", "er"],
["erons", "er"],
["erez", "er"],
["eront", "er"],
["erais", "er"],
["erait", "er"],
["erions", "er"],
["eriez", "er"],
["eraient", "er"],
["asse", "er"],
["asses", "er"],
["ât", "er"],
["assions", "er"],
["assiez", "er"],
["assent", "er"],
["ant", "er"]
]

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,40 @@
# coding: utf8
from __future__ import unicode_literals
PRONOUNS_IRREG = {
"aucune": ("aucun",),
"celle-ci": ("celui-ci",),
"celles-ci": ("celui-ci",),
"ceux-ci": ("celui-ci",),
"celle-là": ("celui-là",),
"celles-là": ("celui-là",),
"ceux-là": ("celui-là",),
"celle": ("celui",),
"celles": ("celui",),
"ceux": ("celui",),
"certaines": ("certains",),
"chacune": ("chacun",),
"icelle": ("icelui",),
"icelles": ("icelui",),
"iceux": ("icelui",),
"la": ("le",),
"les": ("le",),
"laquelle": ("lequel",),
"lesquelles": ("lequel",),
"lesquels": ("lequel",),
"elle-même": ("lui-même",),
"elles-mêmes": ("lui-même",),
"eux-mêmes": ("lui-même",),
"quelle": ("quel",),
"quelles": ("quel",),
"quels": ("quel",),
"quelques-unes": ("quelqu'un",),
"quelques-uns": ("quelqu'un",),
"quelque-une": ("quelqu'un",),
"qu": ("que",),
"telle": ("tel",),
"telles": ("tel",),
"tels": ("tel",),
"toutes": ("tous",),
}

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,131 @@
# coding: utf8
from __future__ import unicode_literals
from ....symbols import POS, NOUN, VERB, ADJ, ADV, PRON, DET, AUX, PUNCT
from ....symbols import VerbForm_inf, VerbForm_none, Number_sing, Degree_pos
from .lookup import LOOKUP
'''
French language lemmatizer applies the default rule based lemmatization
procedure with some modifications for better French language support.
The parts of speech 'ADV', 'PRON', 'DET' and 'AUX' are added to use the
rule-based lemmatization. As a last resort, the lemmatizer checks in
the lookup table.
'''
class FrenchLemmatizer(object):
@classmethod
def load(cls, path, index=None, exc=None, rules=None, lookup=None):
return cls(index, exc, rules, lookup)
def __init__(self, index=None, exceptions=None, rules=None, lookup=None):
self.index = index
self.exc = exceptions
self.rules = rules
self.lookup_table = lookup if lookup is not None else {}
def __call__(self, string, univ_pos, morphology=None):
if not self.rules:
return [self.lookup_table.get(string, string)]
if univ_pos in (NOUN, 'NOUN', 'noun'):
univ_pos = 'noun'
elif univ_pos in (VERB, 'VERB', 'verb'):
univ_pos = 'verb'
elif univ_pos in (ADJ, 'ADJ', 'adj'):
univ_pos = 'adj'
elif univ_pos in (ADV, 'ADV', 'adv'):
univ_pos = 'adv'
elif univ_pos in (PRON, 'PRON', 'pron'):
univ_pos = 'pron'
elif univ_pos in (DET, 'DET', 'det'):
univ_pos = 'det'
elif univ_pos in (AUX, 'AUX', 'aux'):
univ_pos = 'aux'
elif univ_pos in (PUNCT, 'PUNCT', 'punct'):
univ_pos = 'punct'
else:
return [self.lookup(string)]
# See Issue #435 for example of where this logic is requied.
if self.is_base_form(univ_pos, morphology):
return list(set([string.lower()]))
lemmas = lemmatize(string, self.index.get(univ_pos, {}),
self.exc.get(univ_pos, {}),
self.rules.get(univ_pos, []))
return lemmas
def is_base_form(self, univ_pos, morphology=None):
"""
Check whether we're dealing with an uninflected paradigm, so we can
avoid lemmatization entirely.
"""
morphology = {} if morphology is None else morphology
others = [key for key in morphology
if key not in (POS, 'Number', 'POS', 'VerbForm', 'Tense')]
if univ_pos == 'noun' and morphology.get('Number') == 'sing':
return True
elif univ_pos == 'verb' and morphology.get('VerbForm') == 'inf':
return True
# This maps 'VBP' to base form -- probably just need 'IS_BASE'
# morphology
elif univ_pos == 'verb' and (morphology.get('VerbForm') == 'fin' and
morphology.get('Tense') == 'pres' and
morphology.get('Number') is None and
not others):
return True
elif univ_pos == 'adj' and morphology.get('Degree') == 'pos':
return True
elif VerbForm_inf in morphology:
return True
elif VerbForm_none in morphology:
return True
elif Number_sing in morphology:
return True
elif Degree_pos in morphology:
return True
else:
return False
def noun(self, string, morphology=None):
return self(string, 'noun', morphology)
def verb(self, string, morphology=None):
return self(string, 'verb', morphology)
def adj(self, string, morphology=None):
return self(string, 'adj', morphology)
def punct(self, string, morphology=None):
return self(string, 'punct', morphology)
def lookup(self, string):
if string in self.lookup_table:
return self.lookup_table[string]
return string
def lemmatize(string, index, exceptions, rules):
string = string.lower()
forms = []
if (string in index):
forms.append(string)
return forms
forms.extend(exceptions.get(string, []))
oov_forms = []
if not forms:
for old, new in rules:
if string.endswith(old):
form = string[:len(string) - len(old)] + new
if not form:
pass
elif form in index or not form.isalpha():
forms.append(form)
else:
oov_forms.append(form)
if not forms:
forms.extend(oov_forms)
if not forms and string in LOOKUP.keys():
forms.append(LOOKUP[string])
if not forms:
forms.append(string)
return list(set(forms))

File diff suppressed because it is too large Load Diff

View File

@ -55,6 +55,6 @@ def like_num(text):
LEX_ATTRS = {
NORM: norm
NORM: norm,
LIKE_NUM: like_num
}

View File

@ -10,15 +10,18 @@ from .lex_attrs import LEX_ATTRS
from .syntax_iterators import SYNTAX_ITERATORS
from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ..norm_exceptions import BASE_NORMS
from ...language import Language
from ...attrs import LANG
from ...util import update_exc
from ...attrs import LANG, NORM
from ...util import update_exc, add_lookups
class IndonesianDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda text: 'id'
lex_attr_getters.update(LEX_ATTRS)
lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM],
BASE_NORMS, NORM_EXCEPTIONS)
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
stop_words = STOP_WORDS
prefixes = TOKENIZER_PREFIXES

View File

@ -24,7 +24,7 @@ aci-acinya
aco-acoan
ad-blocker
ad-interim
ada-ada saja
ada-ada
ada-adanya
ada-adanyakah
adang-adang
@ -243,7 +243,6 @@ bari-bari
barik-barik
baris-berbaris
baru-baru
baru-baru ini
baru-batu
barung-barung
basa-basi
@ -1059,7 +1058,6 @@ box-to-box
boyo-boyo
buah-buahan
buang-buang
buang-buang air
buat-buatan
buaya-buaya
bubun-bubun
@ -1226,7 +1224,6 @@ deg-degan
degap-degap
dekak-dekak
dekat-dekat
dengan -
dengar-dengaran
dengking-mendengking
departemen-departemen
@ -1246,6 +1243,7 @@ dibayang-bayangi
dibuat-buat
diiming-imingi
dilebih-lebihkan
dimana-mana
dimata-matai
dinas-dinas
dinul-Islam
@ -1278,6 +1276,57 @@ dulang-dulang
duri-duri
duta-duta
dwi-kewarganegaraan
e-arena
e-billing
e-budgeting
e-cctv
e-class
e-commerce
e-counting
e-elektronik
e-entertainment
e-evolution
e-faktur
e-filing
e-fin
e-form
e-government
e-govt
e-hakcipta
e-id
e-info
e-katalog
e-ktp
e-leadership
e-lhkpn
e-library
e-loket
e-m1
e-money
e-news
e-nisn
e-npwp
e-paspor
e-paten
e-pay
e-perda
e-perizinan
e-planning
e-polisi
e-power
e-punten
e-retribusi
e-samsat
e-sport
e-store
e-tax
e-ticketing
e-tilang
e-toll
e-visa
e-voting
e-wallet
e-warong
ecek-ecek
eco-friendly
eco-park
@ -1440,7 +1489,25 @@ ginang-ginang
girap-girap
girik-girik
giring-giring
go-auto
go-bills
go-bluebird
go-box
go-car
go-clean
go-food
go-glam
go-jek
go-kart
go-mart
go-massage
go-med
go-points
go-pulsa
go-ride
go-send
go-shop
go-tix
go-to-market
goak-goak
goal-line
@ -1488,7 +1555,6 @@ hang-out
hantu-hantu
happy-happy
harap-harap
harap-harap cemas
harap-harapan
hard-disk
harga-harga
@ -1633,7 +1699,7 @@ jor-joran
jotos-jotosan
juak-juak
jual-beli
juang-juang !!? lenjuang
juang-juang
julo-julo
julung-julung
julur-julur
@ -1787,6 +1853,7 @@ kemarah-marahan
kemasam-masaman
kemati-matian
kembang-kembang
kemenpan-rb
kementerian-kementerian
kemerah-merahan
kempang-kempis
@ -1827,7 +1894,6 @@ keras-mengerasi
kercap-kercip
kercap-kercup
keriang-keriut
kering-kering air
kerja-kerja
kernyat-kernyut
kerobak-kerabit
@ -1952,7 +2018,7 @@ kuda-kudaan
kudap-kudap
kue-kue
kulah-kulah
kulak-kulak tangan
kulak-kulak
kulik-kulik
kulum-kulum
kumat-kamit
@ -2086,7 +2152,6 @@ lumba-lumba
lumi-lumi
luntang-lantung
lupa-lupa
lupa-lupa ingat
lupa-lupaan
lurah-camat
maaf-memaafkan
@ -2097,6 +2162,7 @@ macan-macanan
machine-to-machine
mafia-mafia
mahasiswa-mahasiswi
mahasiswa/i
mahi-mahi
main-main
main-mainan
@ -2185,14 +2251,14 @@ memandai-mandai
memanggil-manggil
memanis-manis
memanjut-manjut
memantas-mantas diri
memantas-mantas
memasak-masak
memata-matai
mematah-matah
mematuk-matuk
mematut-matut
memau-mau
memayah-mayahkan (diri)
memayah-mayahkan
membaca-baca
membacah-bacah
membagi-bagikan
@ -2576,6 +2642,7 @@ meraung-raungkan
merayau-rayau
merayu-rayu
mercak-mercik
mercedes-benz
merek-merek
mereka-mereka
mereka-reka
@ -2627,9 +2694,9 @@ morat-marit
move-on
muda-muda
muda-mudi
muda/i
mudah-mudahan
muka-muka
muka-muka (dengan -)
mula-mula
multiple-output
muluk-muluk
@ -2791,6 +2858,7 @@ paus-paus
paut-memaut
pay-per-click
paya-paya
pdi-p
pecah-pecah
pecat-pecatan
peer-to-peer
@ -2951,6 +3019,7 @@ putih-hitam
putih-putih
putra-putra
putra-putri
putra/i
putri-putri
putus-putus
putusan-putusan
@ -3069,6 +3138,7 @@ sambung-bersambung
sambung-menyambung
sambut-menyambut
samo-samo
sampah-sampah
sampai-sampai
samping-menyamping
sana-sini
@ -3204,7 +3274,7 @@ seolah-olah
sepala-pala
sepandai-pandai
sepetang-petangan
sepoi-sepoi (basa)
sepoi-sepoi
sepraktis-praktisnya
sepuas-puasnya
serak-serak
@ -3278,6 +3348,7 @@ sisa-sisa
sisi-sisi
siswa-siswa
siswa-siswi
siswa/i
siswi-siswi
situ-situ
situs-situs
@ -3380,6 +3451,7 @@ tanggul-tanggul
tanggung-menanggung
tanggung-tanggung
tank-tank
tante-tante
tanya-jawab
tapa-tapa
tapak-tapak
@ -3424,7 +3496,6 @@ teralang-alang
terambang-ambang
terambung-ambung
terang-terang
terang-terang laras
terang-terangan
teranggar-anggar
terangguk-angguk
@ -3438,7 +3509,6 @@ terayap-rayap
terbada-bada
terbahak-bahak
terbang-terbang
terbang-terbang hinggap
terbata-bata
terbatuk-batuk
terbayang-bayang

View File

@ -18199,7 +18199,6 @@ LOOKUP = {
'sekelap': 'kelap',
'kelap-kelip': 'terkelap',
'mengelapkan': 'lap',
'sekelap': 'terkelap',
'berlapar': 'lapar',
'kelaparan': 'lapar',
'kelaparannya': 'lapar',
@ -30179,7 +30178,6 @@ LOOKUP = {
'terperonyok': 'peronyok',
'terperosok': 'perosok',
'terperosoknya': 'perosok',
'merosot': 'perosot',
'memerosot': 'perosot',
'memerosotkan': 'perosot',
'kepustakaan': 'pustaka',

View File

@ -1,7 +1,10 @@
# coding: utf8
from __future__ import unicode_literals
from ...attrs import LIKE_NUM
import unicodedata
from .punctuation import LIST_CURRENCY
from ...attrs import IS_CURRENCY, LIKE_NUM
_num_words = ['nol', 'satu', 'dua', 'tiga', 'empat', 'lima', 'enam', 'tujuh',
@ -29,6 +32,17 @@ def like_num(text):
return False
def is_currency(text):
if text in LIST_CURRENCY:
return True
for char in text:
if unicodedata.category(char) != 'Sc':
return False
return True
LEX_ATTRS = {
IS_CURRENCY: is_currency,
LIKE_NUM: like_num
}

View File

@ -1,7 +1,535 @@
"""
Slang and abbreviations
Daftar kosakata yang sering salah dieja
https://id.wikipedia.org/wiki/Wikipedia:Daftar_kosakata_bahasa_Indonesia_yang_sering_salah_dieja
"""
# coding: utf8
from __future__ import unicode_literals
_exc = {}
_exc = {
# Slang and abbreviations
"silahkan": "silakan",
"yg": "yang",
"kalo": "kalau",
"cawu": "caturwulan",
"ok": "oke",
"gak": "tidak",
"enggak": "tidak",
"nggak": "tidak",
"ndak": "tidak",
"ngga": "tidak",
"dgn": "dengan",
"tdk": "tidak",
"jg": "juga",
"klo": "kalau",
"denger": "dengar",
"pinter": "pintar",
"krn": "karena",
"nemuin": "menemukan",
"jgn": "jangan",
"udah": "sudah",
"sy": "saya",
"udh": "sudah",
"dapetin": "mendapatkan",
"ngelakuin": "melakukan",
"ngebuat": "membuat",
"membikin": "membuat",
"bikin": "buat",
# Daftar kosakata yang sering salah dieja
"malpraktik": "malapraktik",
"malfungsi": "malafungsi",
"malserap": "malaserap",
"maladaptasi": "malaadaptasi",
"malsuai": "malasuai",
"maldistribusi": "maladistribusi",
"malgizi": "malagizi",
"malsikap": "malasikap",
"memperhatikan": "memerhatikan",
"akte": "akta",
"cemilan": "camilan",
"esei": "esai",
"frase": "frasa",
"kafeteria": "kafetaria",
"ketapel": "katapel",
"kenderaan": "kendaraan",
"menejemen": "manajemen",
"menejer": "manajer",
"mesjid": "masjid",
"rebo": "rabu",
"seksama": "saksama",
"senggama": "sanggama",
"sekedar": "sekadar",
"seprei": "seprai",
"semedi": "semadi",
"samadi": "semadi",
"amandemen": "amendemen",
"algoritma": "algoritme",
"aritmatika": "aritmetika",
"metoda": "metode",
"materai": "meterai",
"meterei": "meterai",
"kalendar": "kalender",
"kadaluwarsa": "kedaluwarsa",
"katagori": "kategori",
"parlamen": "parlemen",
"sekular": "sekuler",
"selular": "seluler",
"sirkular": "sirkuler",
"survai": "survei",
"survey": "survei",
"aktuil": "aktual",
"formil": "formal",
"trotoir": "trotoar",
"komersiil": "komersial",
"komersil": "komersial",
"tradisionil": "tradisionial",
"orisinil": "orisinal",
"orijinil": "orisinal",
"afdol": "afdal",
"antri": "antre",
"apotik": "apotek",
"atlit": "atlet",
"atmosfir": "atmosfer",
"cidera": "cedera",
"cendikiawan": "cendekiawan",
"cepet": "cepat",
"cinderamata": "cenderamata",
"debet": "debit",
"difinisi": "definisi",
"dekrit": "dekret",
"disain": "desain",
"diskripsi": "deskripsi",
"diskotik": "diskotek",
"eksim": "eksem",
"exim": "eksem",
"faidah": "faedah",
"ekstrim": "ekstrem",
"ekstrimis": "ekstremis",
"komplit": "komplet",
"konkrit": "konkret",
"kongkrit": "konkret",
"kongkret": "konkret",
"kridit": "kredit",
"musium": "museum",
"pinalti": "penalti",
"piranti": "peranti",
"pinsil": "pensil",
"personil": "personel",
"sistim": "sistem",
"teoritis": "teoretis",
"vidio": "video",
"cengkeh": "cengkih",
"desertasi": "disertasi",
"hakekat": "hakikat",
"intelejen": "intelijen",
"kaedah": "kaidah",
"kempes": "kempis",
"kementrian": "kementerian",
"ledeng": "leding",
"nasehat": "nasihat",
"penasehat": "penasihat",
"praktek": "praktik",
"praktekum": "praktikum",
"resiko": "risiko",
"retsleting": "ritsleting",
"senen": "senin",
"amuba": "ameba",
"punggawa": "penggawa",
"surban": "serban",
"nomer": "nomor",
"sorban": "serban",
"bis": "bus",
"agribisnis": "agrobisnis",
"kantung": "kantong",
"khutbah": "khotbah",
"mandur": "mandor",
"rubuh": "roboh",
"pastur": "pastor",
"supir": "sopir",
"goncang": "guncang",
"goa": "gua",
"kaos": "kaus",
"kokoh": "kukuh",
"komulatif": "kumulatif",
"kolomnis": "kolumnis",
"korma": "kurma",
"lobang": "lubang",
"limo": "limusin",
"limosin": "limusin",
"mangkok": "mangkuk",
"saos": "saus",
"sop": "sup",
"sorga": "surga",
"tegor": "tegur",
"telor": "telur",
"obrak-abrik": "ubrak-abrik",
"ekwivalen": "ekuivalen",
"frekwensi": "frekuensi",
"konsekwensi": "konsekuensi",
"kwadran": "kuadran",
"kwadrat": "kuadrat",
"kwalifikasi": "kualifikasi",
"kwalitas": "kualitas",
"kwalitet": "kualitas",
"kwalitatif": "kualitatif",
"kwantitas": "kuantitas",
"kwantitatif": "kuantitatif",
"kwantum": "kuantum",
"kwartal": "kuartal",
"kwintal": "kuintal",
"kwitansi": "kuitansi",
"kwatir": "khawatir",
"kuatir": "khawatir",
"jadual": "jadwal",
"hirarki": "hierarki",
"karir": "karier",
"aktip": "aktif",
"daptar": "daftar",
"efektip": "efektif",
"epektif": "efektif",
"epektip": "efektif",
"Pebruari": "Februari",
"pisik": "fisik",
"pondasi": "fondasi",
"photo": "foto",
"photokopi": "fotokopi",
"hapal": "hafal",
"insap": "insaf",
"insyaf": "insaf",
"konperensi": "konferensi",
"kreatip": "kreatif",
"kreativ": "kreatif",
"maap": "maaf",
"napsu": "nafsu",
"negatip": "negatif",
"negativ": "negatif",
"objektip": "objektif",
"obyektip": "objektif",
"obyektif": "objektif",
"pasip": "pasif",
"pasiv": "pasif",
"positip": "positif",
"positiv": "positif",
"produktip": "produktif",
"produktiv": "produktif",
"sarap": "saraf",
"sertipikat": "sertifikat",
"subjektip": "subjektif",
"subyektip": "subjektif",
"subyektif": "subjektif",
"tarip": "tarif",
"transitip": "transitif",
"transitiv": "transitif",
"faham": "paham",
"fikir": "pikir",
"berfikir": "berpikir",
"telefon": "telepon",
"telfon": "telepon",
"telpon": "telepon",
"tilpon": "telepon",
"nafas": "napas",
"bernafas": "bernapas",
"pernafasan": "pernapasan",
"vermak": "permak",
"vulpen": "pulpen",
"aktifis": "aktivis",
"konfeksi": "konveksi",
"motifasi": "motivasi",
"Nopember": "November",
"propinsi": "provinsi",
"babtis": "baptis",
"jerembab": "jerembap",
"lembab": "lembap",
"sembab": "sembap",
"saptu": "sabtu",
"tekat": "tekad",
"bejad": "bejat",
"nekad": "nekat",
"otoped": "otopet",
"skuad": "skuat",
"jenius": "genius",
"marjin": "margin",
"marjinal": "marginal",
"obyek": "objek",
"subyek": "subjek",
"projek": "proyek",
"azas": "asas",
"ijasah": "ijazah",
"jenasah": "jenazah",
"plasa": "plaza",
"bathin": "batin",
"Katholik": "Katolik",
"orthografi": "ortografi",
"pathogen": "patogen",
"theologi": "teologi",
"ijin": "izin",
"rejeki": "rezeki",
"rejim": "rezim",
"jaman": "zaman",
"jamrud": "zamrud",
"jinah": "zina",
"perjinahan": "perzinaan",
"anugrah": "anugerah",
"cendrawasih": "cenderawasih",
"jendral": "jenderal",
"kripik": "keripik",
"krupuk": "kerupuk",
"ksatria": "kesatria",
"mentri": "menteri",
"negri": "negeri",
"Prancis": "Perancis",
"sebrang": "seberang",
"menyebrang": "menyeberang",
"Sumatra": "Sumatera",
"trampil": "terampil",
"isteri": "istri",
"justeru": "justru",
"perajurit": "prajurit",
"putera": "putra",
"puteri": "putri",
"samudera": "samudra",
"sastera": "sastra",
"sutera": "sutra",
"terompet": "trompet",
"iklas": "ikhlas",
"iktisar": "ikhtisar",
"kafilah": "khafilah",
"kawatir": "khawatir",
"kotbah": "khotbah",
"kusyuk": "khusyuk",
"makluk": "makhluk",
"mahluk": "makhluk",
"mahkluk": "makhluk",
"nahkoda": "nakhoda",
"nakoda": "nakhoda",
"tahta": "takhta",
"takhyul": "takhayul",
"tahyul": "takhayul",
"tahayul": "takhayul",
"akhli": "ahli",
"anarkhi": "anarki",
"kharisma": "karisma",
"kharismatik": "karismatik",
"mahsud": "maksud",
"makhsud": "maksud",
"rakhmat": "rahmat",
"tekhnik": "teknik",
"tehnik": "teknik",
"tehnologi": "teknologi",
"ikhwal": "ihwal",
"expor": "ekspor",
"extra": "ekstra",
"komplex": "komplek",
"sex": "seks",
"taxi": "taksi",
"extasi": "ekstasi",
"syaraf": "saraf",
"syurga": "surga",
"mashur": "masyhur",
"masyur": "masyhur",
"mahsyur": "masyhur",
"mashyur": "masyhur",
"muadzin": "muazin",
"adzan": "azan",
"ustadz": "ustaz",
"ustad": "ustaz",
"ustadzah": "ustaz",
"dzikir": "zikir",
"dzuhur": "zuhur",
"dhuhur": "zuhur",
"zhuhur": "zuhur",
"analisa": "analisis",
"diagnosa": "diagnosis",
"hipotesa": "hipotesis",
"sintesa": "sintesis",
"aktiviti": "aktivitas",
"aktifitas": "aktivitas",
"efektifitas": "efektivitas",
"komuniti": "komunitas",
"kreatifitas": "kreativitas",
"produktifitas": "produktivitas",
"realiti": "realitas",
"realita": "realitas",
"selebriti": "selebritas",
"spotifitas": "sportivitas",
"universiti": "universitas",
"utiliti": "utilitas",
"validiti": "validitas",
"dilokalisir": "dilokalisasi",
"didramatisir": "didramatisasi",
"dipolitisir": "dipolitisasi",
"dinetralisir": "dinetralisasi",
"dikonfrontir": "dikonfrontasi",
"mendominir": "mendominasi",
"koordinir": "koordinasi",
"proklamir": "proklamasi",
"terorganisir": "terorganisasi",
"terealisir": "terealisasi",
"robah": "ubah",
"dirubah": "diubah",
"merubah": "mengubah",
"terlanjur": "telanjur",
"terlantar": "telantar",
"penglepasan": "pelepasan",
"pelihatan": "penglihatan",
"pemukiman": "permukiman",
"pengrumahan": "perumahan",
"penyewaan": "persewaan",
"menyintai": "mencintai",
"menyolok": "mencolok",
"contek": "sontek",
"mencontek": "menyontek",
"pungkir": "mungkir",
"dipungkiri": "dimungkiri",
"kupungkiri": "kumungkiri",
"kaupungkiri": "kaumungkiri",
"nampak": "tampak",
"nampaknya": "tampaknya",
"nongkrong": "tongkrong",
"berternak": "beternak",
"berterbangan": "beterbangan",
"berserta": "beserta",
"berperkara": "beperkara",
"berpergian": "bepergian",
"berkerja": "bekerja",
"berberapa": "beberapa",
"terbersit": "tebersit",
"terpercaya": "tepercaya",
"terperdaya": "teperdaya",
"terpercik": "tepercik",
"terpergok": "tepergok",
"aksesoris": "aksesori",
"handal": "andal",
"hantar": "antar",
"panutan": "anutan",
"atsiri": "asiri",
"bhakti": "bakti",
"china": "cina",
"dharma": "darma",
"diktaktor": "diktator",
"eksport": "ekspor",
"hembus": "embus",
"hadits": "hadis",
"hadist": "hadits",
"harafiah": "harfiah",
"himbau": "imbau",
"import": "impor",
"inget": "ingat",
"hisap": "isap",
"interprestasi": "interpretasi",
"kangker": "kanker",
"konggres": "kongres",
"lansekap": "lanskap",
"maghrib": "magrib",
"emak": "mak",
"moderen": "modern",
"pasport": "paspor",
"perduli": "peduli",
"ramadhan": "ramadan",
"rapih": "rapi",
"Sansekerta": "Sanskerta",
"shalat": "salat",
"sholat": "salat",
"silahkan": "silakan",
"standard": "standar",
"hutang": "utang",
"zinah": "zina",
"ambulan": "ambulans",
"antartika": "sntarktika",
"arteri": "arteria",
"asik": "asyik",
"australi": "australia",
"denga": "dengan",
"depo": "depot",
"detil": "detail",
"ensiklopedi": "ensiklopedia",
"elit": "elite",
"frustasi": "frustrasi",
"gladi": "geladi",
"greget": "gereget",
"itali": "italia",
"karna": "karena",
"klenteng": "kelenteng",
"erling": "kerling",
"kontruksi": "konstruksi",
"masal": "massal",
"merk": "merek",
"respon": "respons",
"diresponi": "direspons",
"skak": "sekak",
"stir": "setir",
"singapur": "singapura",
"standarisasi": "standardisasi",
"varitas": "varietas",
"amphibi": "amfibi",
"anjlog": "anjlok",
"alpukat": "avokad",
"alpokat": "avokad",
"bolpen": "pulpen",
"cabe": "cabai",
"cabay": "cabai",
"ceret": "cerek",
"differensial": "diferensial",
"duren": "durian",
"faksimili": "faksimile",
"faksimil": "faksimile",
"graha": "gerha",
"goblog": "goblok",
"gombrong": "gombroh",
"horden": "gorden",
"korden": "gorden",
"gubug": "gubuk",
"imaginasi": "imajinasi",
"jerigen": "jeriken",
"jirigen": "jeriken",
"carut-marut": "karut-marut",
"kwota": "kuota",
"mahzab": "mazhab",
"mempesona": "memesona",
"milyar": "miliar",
"missi": "misi",
"nenas": "nanas",
"negoisasi": "negosiasi",
"automotif": "otomotif",
"pararel": "paralel",
"paska": "pasca",
"prosen": "persen",
"pete": "petai",
"petay": "petai",
"proffesor": "profesor",
"rame": "ramai",
"rapot": "rapor",
"rileks": "relaks",
"rileksasi": "relaksasi",
"renumerasi": "remunerasi",
"seketaris": "sekretaris",
"sekertaris": "sekretaris",
"sensorik": "sensoris",
"sentausa": "sentosa",
"strawberi": "stroberi",
"strawbery": "stroberi",
"taqwa": "takwa",
"tauco": "taoco",
"tauge": "taoge",
"toge": "taoge",
"tauladan": "teladan",
"taubat": "tobat",
"trilyun": "triliun",
"vissi": "visi",
"coklat": "cokelat",
"narkotika": "narkotik",
"oase": "oasis",
"politisi": "politikus",
"terong": "terung",
"wool": "wol",
"himpit": "impit",
"mujizat": "mukjizat",
"mujijat": "mukjizat",
"yag": "yang",
}
NORM_EXCEPTIONS = {}

View File

@ -4,7 +4,7 @@ from __future__ import unicode_literals
from ..punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES, TOKENIZER_INFIXES
from ..char_classes import merge_chars, split_chars, _currency, _units
from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES
from ..char_classes import QUOTES, UNITS, ALPHA, ALPHA_LOWER, ALPHA_UPPER, HYPHENS
from ..char_classes import QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER, HYPHENS
_units = (_units + 's bit Gbps Mbps mbps Kbps kbps ƒ ppi px '
'Hz kHz MHz GHz mAh '
@ -25,7 +25,7 @@ HTML_SUFFIX = r'</(b|strong|i|em|p|span|div|a)>'
MONTHS = merge_chars(_months)
LIST_CURRENCY = split_chars(_currency)
TOKENIZER_PREFIXES.remove('#') # hashtag
TOKENIZER_PREFIXES.remove('#') # hashtag
_prefixes = TOKENIZER_PREFIXES + LIST_CURRENCY + [HTML_PREFIX] + ['/', '']
_suffixes = TOKENIZER_SUFFIXES + [r'\-[Nn]ya', '-[KkMm]u', '[—-]'] + [

View File

@ -1,763 +1,122 @@
"""
List of stop words in Bahasa Indonesia.
"""
# coding: utf8
from __future__ import unicode_literals
STOP_WORDS = set("""
ada
adalah
adanya
adapun
agak
agaknya
agar
akan
akankah
akhir
akhiri
akhirnya
aku
akulah
amat
amatlah
anda
andalah
antar
antara
antaranya
apa
apaan
apabila
apakah
apalagi
apatah
artinya
asal
asalkan
atas
atau
ataukah
ataupun
awal
ada adalah adanya adapun agak agaknya agar akan akankah akhir akhiri akhirnya
aku akulah amat amatlah anda andalah antar antara antaranya apa apaan apabila
apakah apalagi apatah artinya asal asalkan atas atau ataukah ataupun awal
awalnya
bagai
bagaikan
bagaimana
bagaimanakah
bagaimanapun
bagi
bagian
bahkan
bahwa
bahwasanya
baik
bakal
bakalan
balik
banyak
bapak
baru
bawah
beberapa
begini
beginian
beginikah
beginilah
begitu
begitukah
begitulah
begitupun
bekerja
belakang
belakangan
belum
belumlah
benar
benarkah
benarlah
berada
berakhir
berakhirlah
berakhirnya
berapa
berapakah
berapalah
berapapun
berarti
berawal
berbagai
berdatangan
beri
berikan
berikut
berikutnya
berjumlah
berkali-kali
berkata
berkehendak
berkeinginan
berkenaan
berlainan
berlalu
berlangsung
berlebihan
bermacam
bermacam-macam
bermaksud
bermula
bersama
bersama-sama
bersiap
bersiap-siap
bertanya
bertanya-tanya
berturut
berturut-turut
bertutur
berujar
berupa
besar
betul
betulkah
biasa
biasanya
bila
bilakah
bisa
bisakah
boleh
bolehkah
bolehlah
buat
bukan
bukankah
bukanlah
bukannya
bulan
bung
cara
caranya
cukup
cukupkah
cukuplah
cuma
dahulu
dalam
dan
dapat
dari
daripada
datang
dekat
demi
demikian
demikianlah
dengan
depan
di
dia
diakhiri
diakhirinya
dialah
diantara
diantaranya
diberi
diberikan
diberikannya
dibuat
dibuatnya
didapat
didatangkan
digunakan
diibaratkan
diibaratkannya
diingat
diingatkan
diinginkan
dijawab
dijelaskan
dijelaskannya
dikarenakan
dikatakan
dikatakannya
dikerjakan
diketahui
diketahuinya
dikira
dilakukan
dilalui
dilihat
dimaksud
dimaksudkan
dimaksudkannya
dimaksudnya
diminta
dimintai
dimisalkan
dimulai
dimulailah
dimulainya
dimungkinkan
dini
dipastikan
diperbuat
diperbuatnya
dipergunakan
diperkirakan
diperlihatkan
diperlukan
diperlukannya
dipersoalkan
dipertanyakan
dipunyai
diri
dirinya
disampaikan
disebut
disebutkan
disebutkannya
disini
disinilah
ditambahkan
ditandaskan
ditanya
ditanyai
ditanyakan
ditegaskan
ditujukan
ditunjuk
ditunjuki
ditunjukkan
ditunjukkannya
ditunjuknya
dituturkan
dituturkannya
diucapkan
diucapkannya
diungkapkan
dong
dua
dulu
empat
enggak
enggaknya
entah
entahlah
guna
gunakan
hal
hampir
hanya
hanyalah
hari
harus
haruslah
harusnya
hendak
hendaklah
hendaknya
hingga
ia
ialah
ibarat
ibaratkan
ibaratnya
ibu
ikut
ingat
ingat-ingat
ingin
inginkah
inginkan
ini
inikah
inilah
itu
itukah
itulah
jadi
jadilah
jadinya
jangan
jangankan
janganlah
jauh
jawab
jawaban
jawabnya
jelas
jelaskan
jelaslah
jelasnya
jika
jikalau
juga
jumlah
jumlahnya
justru
kala
kalau
kalaulah
kalaupun
kalian
kami
kamilah
kamu
kamulah
kan
kapan
kapankah
kapanpun
karena
karenanya
kasus
kata
katakan
katakanlah
katanya
ke
keadaan
kebetulan
kecil
kedua
keduanya
keinginan
kelamaan
kelihatan
kelihatannya
kelima
keluar
kembali
kemudian
kemungkinan
kemungkinannya
kenapa
kepada
kepadanya
kesampaian
keseluruhan
keseluruhannya
keterlaluan
ketika
khususnya
kini
kinilah
kira
kira-kira
kiranya
kita
kitalah
kok
kurang
lagi
lagian
lah
lain
lainnya
lalu
lama
lamanya
lanjut
lanjutnya
lebih
lewat
lima
luar
macam
maka
makanya
makin
malah
malahan
mampu
mampukah
mana
manakala
manalagi
masa
masalah
masalahnya
masih
masihkah
masing
masing-masing
mau
maupun
melainkan
melakukan
melalui
melihat
melihatnya
memang
memastikan
memberi
memberikan
membuat
memerlukan
memihak
meminta
memintakan
memisalkan
memperbuat
mempergunakan
memperkirakan
memperlihatkan
mempersiapkan
mempersoalkan
mempertanyakan
mempunyai
memulai
memungkinkan
menaiki
menambahkan
menandaskan
menanti
menanti-nanti
menantikan
menanya
menanyai
menanyakan
mendapat
mendapatkan
mendatang
mendatangi
mendatangkan
menegaskan
mengakhiri
mengapa
mengatakan
mengatakannya
mengenai
mengerjakan
mengetahui
menggunakan
menghendaki
mengibaratkan
mengibaratkannya
mengingat
mengingatkan
menginginkan
mengira
mengucapkan
mengucapkannya
mengungkapkan
menjadi
menjawab
menjelaskan
menuju
menunjuk
menunjuki
menunjukkan
menunjuknya
menurut
menuturkan
menyampaikan
menyangkut
menyatakan
menyebutkan
menyeluruh
menyiapkan
merasa
mereka
merekalah
merupakan
meski
meskipun
meyakini
meyakinkan
minta
mirip
misal
misalkan
misalnya
mula
mulai
mulailah
mulanya
mungkin
mungkinkah
nah
naik
namun
nanti
nantinya
nyaris
nyatanya
oleh
olehnya
pada
padahal
padanya
pak
paling
panjang
pantas
para
pasti
pastilah
penting
pentingnya
per
percuma
perlu
perlukah
perlunya
pernah
persoalan
pertama
pertama-tama
pertanyaan
pertanyakan
pihak
pihaknya
pukul
pula
pun
punya
rasa
rasanya
rata
rupanya
saat
saatnya
saja
sajalah
saling
sama
sama-sama
sambil
sampai
sampai-sampai
sampaikan
sana
sangat
sangatlah
satu
saya
sayalah
se
sebab
sebabnya
sebagai
sebagaimana
sebagainya
sebagian
sebaik
sebaik-baiknya
sebaiknya
sebaliknya
sebanyak
sebegini
sebegitu
sebelum
sebelumnya
sebenarnya
seberapa
sebesar
sebetulnya
sebisanya
sebuah
sebut
sebutlah
sebutnya
secara
secukupnya
sedang
sedangkan
sedemikian
sedikit
sedikitnya
seenaknya
segala
segalanya
segera
seharusnya
sehingga
seingat
sejak
sejauh
sejenak
sejumlah
sekadar
sekadarnya
sekali
sekali-kali
sekalian
sekaligus
sekalipun
sekarang
sekarang
sekecil
seketika
sekiranya
sekitar
sekitarnya
sekurang-kurangnya
sekurangnya
sela
selain
selaku
selalu
selama
selama-lamanya
selamanya
selanjutnya
seluruh
seluruhnya
semacam
semakin
semampu
semampunya
semasa
semasih
semata
semata-mata
semaunya
sementara
semisal
semisalnya
sempat
semua
semuanya
semula
sendiri
sendirian
sendirinya
seolah
seolah-olah
seorang
sepanjang
sepantasnya
sepantasnyalah
seperlunya
seperti
sepertinya
sepihak
sering
seringnya
serta
serupa
sesaat
sesama
sesampai
sesegera
sesekali
seseorang
sesuatu
sesuatunya
sesudah
sesudahnya
setelah
setempat
setengah
seterusnya
setiap
setiba
setibanya
setidak-tidaknya
setidaknya
setinggi
seusai
sewaktu
siap
siapa
siapakah
siapapun
sini
sinilah
soal
soalnya
suatu
sudah
sudahkah
sudahlah
supaya
tadi
tadinya
tahu
tahun
tak
tambah
tambahnya
tampak
tampaknya
tandas
tandasnya
tanpa
tanya
tanyakan
tanyanya
tapi
tegas
tegasnya
telah
tempat
tengah
tentang
tentu
tentulah
tentunya
tepat
terakhir
terasa
terbanyak
terdahulu
terdapat
terdiri
terhadap
terhadapnya
teringat
teringat-ingat
terjadi
terjadilah
terjadinya
terkira
terlalu
terlebih
terlihat
termasuk
ternyata
tersampaikan
tersebut
tersebutlah
tertentu
tertuju
terus
terutama
tetap
tetapi
tiap
tiba
tiba-tiba
tidak
tidakkah
tidaklah
tiga
tinggi
toh
tunjuk
turut
tutur
tuturnya
ucap
ucapnya
ujar
ujarnya
umum
umumnya
ungkap
ungkapnya
untuk
usah
usai
waduh
wah
wahai
waktu
waktunya
walau
walaupun
wong
yaitu
yakin
yakni
yang
""".split())
bagai bagaikan bagaimana bagaimanakah bagaimanapun bagi bagian bahkan bahwa
bahwasanya baik bakal bakalan balik banyak bapak baru bawah beberapa begini
beginian beginikah beginilah begitu begitukah begitulah begitupun bekerja
belakang belakangan belum belumlah benar benarkah benarlah berada berakhir
berakhirlah berakhirnya berapa berapakah berapalah berapapun berarti berawal
berbagai berdatangan beri berikan berikut berikutnya berjumlah berkali-kali
berkata berkehendak berkeinginan berkenaan berlainan berlalu berlangsung
berlebihan bermacam bermacam-macam bermaksud bermula bersama bersama-sama
bersiap bersiap-siap bertanya bertanya-tanya berturut berturut-turut bertutur
berujar berupa besar betul betulkah biasa biasanya bila bilakah bisa bisakah
boleh bolehkah bolehlah buat bukan bukankah bukanlah bukannya bulan bung
cara caranya cukup cukupkah cukuplah cuma
dahulu dalam dan dapat dari daripada datang dekat demi demikian demikianlah
dengan depan di dia diakhiri diakhirinya dialah diantara diantaranya diberi
diberikan diberikannya dibuat dibuatnya didapat didatangkan digunakan
diibaratkan diibaratkannya diingat diingatkan diinginkan dijawab dijelaskan
dijelaskannya dikarenakan dikatakan dikatakannya dikerjakan diketahui
diketahuinya dikira dilakukan dilalui dilihat dimaksud dimaksudkan
dimaksudkannya dimaksudnya diminta dimintai dimisalkan dimulai dimulailah
dimulainya dimungkinkan dini dipastikan diperbuat diperbuatnya dipergunakan
diperkirakan diperlihatkan diperlukan diperlukannya dipersoalkan dipertanyakan
dipunyai diri dirinya disampaikan disebut disebutkan disebutkannya disini
disinilah ditambahkan ditandaskan ditanya ditanyai ditanyakan ditegaskan
ditujukan ditunjuk ditunjuki ditunjukkan ditunjukkannya ditunjuknya dituturkan
dituturkannya diucapkan diucapkannya diungkapkan dong dua dulu
empat enggak enggaknya entah entahlah
guna gunakan
hal hampir hanya hanyalah hari harus haruslah harusnya hendak hendaklah
hendaknya hingga
ia ialah ibarat ibaratkan ibaratnya ibu ikut ingat ingat-ingat ingin inginkah
inginkan ini inikah inilah itu itukah itulah
jadi jadilah jadinya jangan jangankan janganlah jauh jawab jawaban jawabnya
jelas jelaskan jelaslah jelasnya jika jikalau juga jumlah jumlahnya justru
kala kalau kalaulah kalaupun kalian kami kamilah kamu kamulah kan kapan
kapankah kapanpun karena karenanya kasus kata katakan katakanlah katanya ke
keadaan kebetulan kecil kedua keduanya keinginan kelamaan kelihatan
kelihatannya kelima keluar kembali kemudian kemungkinan kemungkinannya kenapa
kepada kepadanya kesampaian keseluruhan keseluruhannya keterlaluan ketika
khususnya kini kinilah kira kira-kira kiranya kita kitalah kok kurang
lagi lagian lah lain lainnya lalu lama lamanya lanjut lanjutnya lebih lewat
lima luar
macam maka makanya makin malah malahan mampu mampukah mana manakala manalagi
masa masalah masalahnya masih masihkah masing masing-masing mau maupun
melainkan melakukan melalui melihat melihatnya memang memastikan memberi
memberikan membuat memerlukan memihak meminta memintakan memisalkan memperbuat
mempergunakan memperkirakan memperlihatkan mempersiapkan mempersoalkan
mempertanyakan mempunyai memulai memungkinkan menaiki menambahkan menandaskan
menanti menanti-nanti menantikan menanya menanyai menanyakan mendapat
mendapatkan mendatang mendatangi mendatangkan menegaskan mengakhiri mengapa
mengatakan mengatakannya mengenai mengerjakan mengetahui menggunakan
menghendaki mengibaratkan mengibaratkannya mengingat mengingatkan menginginkan
mengira mengucapkan mengucapkannya mengungkapkan menjadi menjawab menjelaskan
menuju menunjuk menunjuki menunjukkan menunjuknya menurut menuturkan
menyampaikan menyangkut menyatakan menyebutkan menyeluruh menyiapkan merasa
mereka merekalah merupakan meski meskipun meyakini meyakinkan minta mirip
misal misalkan misalnya mula mulai mulailah mulanya mungkin mungkinkah
nah naik namun nanti nantinya nyaris nyatanya
oleh olehnya
pada padahal padanya pak paling panjang pantas para pasti pastilah penting
pentingnya per percuma perlu perlukah perlunya pernah persoalan pertama
pertama-tama pertanyaan pertanyakan pihak pihaknya pukul pula pun punya
rasa rasanya rata rupanya
saat saatnya saja sajalah saling sama sama-sama sambil sampai sampai-sampai
sampaikan sana sangat sangatlah satu saya sayalah se sebab sebabnya sebagai
sebagaimana sebagainya sebagian sebaik sebaik-baiknya sebaiknya sebaliknya
sebanyak sebegini sebegitu sebelum sebelumnya sebenarnya seberapa sebesar
sebetulnya sebisanya sebuah sebut sebutlah sebutnya secara secukupnya sedang
sedangkan sedemikian sedikit sedikitnya seenaknya segala segalanya segera
seharusnya sehingga seingat sejak sejauh sejenak sejumlah sekadar sekadarnya
sekali sekali-kali sekalian sekaligus sekalipun sekarang sekarang sekecil
seketika sekiranya sekitar sekitarnya sekurang-kurangnya sekurangnya sela
selain selaku selalu selama selama-lamanya selamanya selanjutnya seluruh
seluruhnya semacam semakin semampu semampunya semasa semasih semata semata-mata
semaunya sementara semisal semisalnya sempat semua semuanya semula sendiri
sendirian sendirinya seolah seolah-olah seorang sepanjang sepantasnya
sepantasnyalah seperlunya seperti sepertinya sepihak sering seringnya serta
serupa sesaat sesama sesampai sesegera sesekali seseorang sesuatu sesuatunya
sesudah sesudahnya setelah setempat setengah seterusnya setiap setiba setibanya
setidak-tidaknya setidaknya setinggi seusai sewaktu siap siapa siapakah
siapapun sini sinilah soal soalnya suatu sudah sudahkah sudahlah supaya
tadi tadinya tahu tahun tak tambah tambahnya tampak tampaknya tandas tandasnya
tanpa tanya tanyakan tanyanya tapi tegas tegasnya telah tempat tengah tentang
tentu tentulah tentunya tepat terakhir terasa terbanyak terdahulu terdapat
terdiri terhadap terhadapnya teringat teringat-ingat terjadi terjadilah
terjadinya terkira terlalu terlebih terlihat termasuk ternyata tersampaikan
tersebut tersebutlah tertentu tertuju terus terutama tetap tetapi tiap tiba
tiba-tiba tidak tidakkah tidaklah tiga tinggi toh tunjuk turut tutur tuturnya
ucap ucapnya ujar ujarnya umum umumnya ungkap ungkapnya untuk usah usai
waduh wah wahai waktu waktunya walau walaupun wong
yaitu yakin yakni yang
""".split())

View File

@ -1,10 +1,11 @@
"""
Daftar singkatan dan Akronim dari:
https://id.wiktionary.org/wiki/Wiktionary:Daftar_singkatan_dan_akronim_bahasa_Indonesia#A
"""
# coding: utf8
from __future__ import unicode_literals
import regex as re
from ._tokenizer_exceptions_list import ID_BASE_EXCEPTIONS
from ..tokenizer_exceptions import URL_PATTERN
from ...symbols import ORTH, LEMMA, NORM
@ -22,6 +23,9 @@ for orth in ID_BASE_EXCEPTIONS:
orth_lower = orth.lower()
_exc[orth_lower] = [{ORTH: orth_lower}]
orth_first_upper = orth[0].upper() + orth[1:]
_exc[orth_first_upper] = [{ORTH: orth_first_upper}]
if '-' in orth:
orth_title = '-'.join([part.title() for part in orth.split('-')])
_exc[orth_title] = [{ORTH: orth_title}]
@ -30,28 +34,6 @@ for orth in ID_BASE_EXCEPTIONS:
_exc[orth_caps] = [{ORTH: orth_caps}]
for exc_data in [
{ORTH: "CKG", LEMMA: "Cakung", NORM: "Cakung"},
{ORTH: "CGP", LEMMA: "Grogol Petamburan", NORM: "Grogol Petamburan"},
{ORTH: "KSU", LEMMA: "Kepulauan Seribu Utara", NORM: "Kepulauan Seribu Utara"},
{ORTH: "KYB", LEMMA: "Kebayoran Baru", NORM: "Kebayoran Baru"},
{ORTH: "TJP", LEMMA: "Tanjungpriok", NORM: "Tanjungpriok"},
{ORTH: "TNA", LEMMA: "Tanah Abang", NORM: "Tanah Abang"},
{ORTH: "BEK", LEMMA: "Bengkayang", NORM: "Bengkayang"},
{ORTH: "KTP", LEMMA: "Ketapang", NORM: "Ketapang"},
{ORTH: "MPW", LEMMA: "Mempawah", NORM: "Mempawah"},
{ORTH: "NGP", LEMMA: "Nanga Pinoh", NORM: "Nanga Pinoh"},
{ORTH: "NBA", LEMMA: "Ngabang", NORM: "Ngabang"},
{ORTH: "PTK", LEMMA: "Pontianak", NORM: "Pontianak"},
{ORTH: "PTS", LEMMA: "Putussibau", NORM: "Putussibau"},
{ORTH: "SBS", LEMMA: "Sambas", NORM: "Sambas"},
{ORTH: "SAG", LEMMA: "Sanggau", NORM: "Sanggau"},
{ORTH: "SED", LEMMA: "Sekadau", NORM: "Sekadau"},
{ORTH: "SKW", LEMMA: "Singkawang", NORM: "Singkawang"},
{ORTH: "STG", LEMMA: "Sintang", NORM: "Sintang"},
{ORTH: "SKD", LEMMA: "Sukadane", NORM: "Sukadane"},
{ORTH: "SRY", LEMMA: "Sungai Raya", NORM: "Sungai Raya"},
{ORTH: "Jan.", LEMMA: "Januari", NORM: "Januari"},
{ORTH: "Feb.", LEMMA: "Februari", NORM: "Februari"},
{ORTH: "Mar.", LEMMA: "Maret", NORM: "Maret"},
@ -66,25 +48,43 @@ for exc_data in [
{ORTH: "Des.", LEMMA: "Desember", NORM: "Desember"}]:
_exc[exc_data[ORTH]] = [exc_data]
_other_exc = {
"do'a": [{ORTH: "do'a", LEMMA: "doa", NORM: "doa"}],
"jum'at": [{ORTH: "jum'at", LEMMA: "Jumat", NORM: "Jumat"}],
"Jum'at": [{ORTH: "Jum'at", LEMMA: "Jumat", NORM: "Jumat"}],
"la'nat": [{ORTH: "la'nat", LEMMA: "laknat", NORM: "laknat"}],
"ma'af": [{ORTH: "ma'af", LEMMA: "maaf", NORM: "maaf"}],
"mu'jizat": [{ORTH: "mu'jizat", LEMMA: "mukjizat", NORM: "mukjizat"}],
"Mu'jizat": [{ORTH: "Mu'jizat", LEMMA: "mukjizat", NORM: "mukjizat"}],
"ni'mat": [{ORTH: "ni'mat", LEMMA: "nikmat", NORM: "nikmat"}],
"raka'at": [{ORTH: "raka'at", LEMMA: "rakaat", NORM: "rakaat"}],
"ta'at": [{ORTH: "ta'at", LEMMA: "taat", NORM: "taat"}],
}
_exc.update(_other_exc)
for orth in [
"A.AB.", "A.Ma.", "A.Md.", "A.Md.Keb.", "A.Md.Kep.", "A.P.",
"B.A.", "B.Ch.E.", "B.Sc.", "Dr.", "Dra.", "Drs.", "Hj.", "Ka.", "Kp.",
"M.AB", "M.Ag.", "M.AP", "M.Arl", "M.A.R.S", "M.Hum.", "M.I.Kom.", "M.Kes,",
"M.Kom.", "M.M.", "M.P.", "M.Pd.", "M.Psi.", "M.Psi.T.", "M.Sc.", "M.SArl",
"M.Si.", "M.Sn.", "M.T.", "M.Th.", "No.", "Pjs.", "Plt.", "R.A.",
"M.AB", "M.Ag.", "M.AP", "M.Arl", "M.A.R.S", "M.Hum.", "M.I.Kom.",
"M.Kes,", "M.Kom.", "M.M.", "M.P.", "M.Pd.", "M.Psi.", "M.Psi.T.", "M.Sc.",
"M.SArl", "M.Si.", "M.Sn.", "M.T.", "M.Th.", "No.", "Pjs.", "Plt.", "R.A.",
"S.AB", "S.AP", "S.Adm", "S.Ag.", "S.Agr", "S.Ant", "S.Arl", "S.Ars",
"S.A.R.S", "S.Ds", "S.E.", "S.E.I.", "S.Farm", "S.Gz.", "S.H.", "S.Han",
"S.H.Int", "S.Hum", "S.Hut.", "S.In.", "S.IK.", "S.I.Kom.", "S.I.P", "S.IP",
"S.P.", "S.Pt", "S.Psi", "S.Ptk", "S.Keb", "S.Ked", "S.Kep", "S.KG", "S.KH",
"S.Kel", "S.K.M.", "S.Kedg.", "S.Kedh.", "S.Kom.", "S.KPM", "S.Mb", "S.Mat",
"S.Par", "S.Pd.", "S.Pd.I.", "S.Pd.SD", "S.Pol.", "S.Psi.", "S.S.", "S.SArl.",
"S.Sn", "S.Si.", "S.Si.Teol.", "S.SI.", "S.ST.", "S.ST.Han", "S.STP", "S.Sos.",
"S.Sy.", "S.T.", "S.T.Han", "S.Th.", "S.Th.I" "S.TI.", "S.T.P.", "S.TrK",
"S.Tekp.", "S.Th.",
"a.l.", "a.n.", "a.s.", "b.d.", "d.a.", "d.l.", "d/h", "dkk.", "dll.",
"dr.", "drh.", "ds.", "dsb.", "dst.", "faks.", "fax.", "hlm.", "i/o",
"n.b.", "p.p." "pjs.", "s.d.", "tel.", "u.p.",
]:
"S.H.Int", "S.Hum", "S.Hut.", "S.In.", "S.IK.", "S.I.Kom.", "S.I.P",
"S.IP", "S.P.", "S.Pt", "S.Psi", "S.Ptk", "S.Keb", "S.Ked", "S.Kep",
"S.KG", "S.KH", "S.Kel", "S.K.M.", "S.Kedg.", "S.Kedh.", "S.Kom.", "S.KPM",
"S.Mb", "S.Mat", "S.Par", "S.Pd.", "S.Pd.I.", "S.Pd.SD", "S.Pol.",
"S.Psi.", "S.S.", "S.SArl.", "S.Sn", "S.Si.", "S.Si.Teol.", "S.SI.",
"S.ST.", "S.ST.Han", "S.STP", "S.Sos.", "S.Sy.", "S.T.", "S.T.Han",
"S.Th.", "S.Th.I" "S.TI.", "S.T.P.", "S.TrK", "S.Tekp.", "S.Th.",
"Prof.", "drg.", "KH.", "Ust.", "Lc", "Pdt.", "S.H.H.", "Rm.", "Ps.",
"St.", "M.A.", "M.B.A", "M.Eng.", "M.Eng.Sc.", "M.Pharm.", "Dr. med",
"Dr.-Ing", "Dr. rer. nat.", "Dr. phil.", "Dr. iur.", "Dr. rer. oec",
"Dr. rer. pol.", "R.Ng.", "R.", "R.M.", "R.B.", "R.P.", "R.Ay.", "Rr.",
"R.Ngt.", "a.l.", "a.n.", "a.s.", "b.d.", "d.a.", "d.l.", "d/h", "dkk.",
"dll.", "dr.", "drh.", "ds.", "dsb.", "dst.", "faks.", "fax.", "hlm.",
"i/o", "n.b.", "p.p." "pjs.", "s.d.", "tel.", "u.p."]:
_exc[orth] = [{ORTH: orth}]
TOKENIZER_EXCEPTIONS = _exc

View File

@ -0,0 +1,32 @@
# coding: utf8
from __future__ import unicode_literals
from ...attrs import LIKE_NUM
_num_words = ['zero', 'jeden', 'dwa', 'trzy', 'cztery', 'pięć', 'sześć',
'siedem', 'osiem', 'dziewięć', 'dziesięć', 'jedenaście',
'dwanaście', 'trzynaście', 'czternaście',
'pietnaście', 'szesnaście', 'siedemnaście', 'osiemnaście',
'dziewiętnaście', 'dwadzieścia', 'trzydzieści', 'czterdzieści',
'pięćdziesiąt', 'szcześćdziesiąt', 'siedemdziesiąt',
'osiemdziesiąt', 'dziewięćdziesiąt', 'sto', 'tysiąc', 'milion',
'miliard', 'bilion', 'trylion']
def like_num(text):
text = text.replace(',', '').replace('.', '')
if text.isdigit():
return True
if text.count('/') == 1:
num, denom = text.split('/')
if num.isdigit() and denom.isdigit():
return True
if text.lower() in _num_words:
return True
return False
LEX_ATTRS = {
LIKE_NUM: like_num
}

Some files were not shown because too many files have changed in this diff Show More