mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-27 10:26:35 +03:00
Merge branch 'develop' into feature/refactor-parser
This commit is contained in:
commit
dc1a479fbd
15
.github/ISSUE_TEMPLATE/01_bugs.md
vendored
Normal file
15
.github/ISSUE_TEMPLATE/01_bugs.md
vendored
Normal file
|
@ -0,0 +1,15 @@
|
||||||
|
---
|
||||||
|
name: "\U0001F6A8 Bug Report"
|
||||||
|
about: Did you come across a bug or unexpected behaviour differing from the docs?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## How to reproduce the behaviour
|
||||||
|
<!-- Include a code example or the steps that led to the problem. Please try to be as specific as possible. -->
|
||||||
|
|
||||||
|
## Your Environment
|
||||||
|
<!-- Include details of your environment. If you're using spaCy 1.7+, you can also type `python -m spacy info --markdown` and copy-paste the result here.-->
|
||||||
|
* Operating System:
|
||||||
|
* Python Version Used:
|
||||||
|
* spaCy Version Used:
|
||||||
|
* Environment Information:
|
21
.github/ISSUE_TEMPLATE/02_install.md
vendored
Normal file
21
.github/ISSUE_TEMPLATE/02_install.md
vendored
Normal file
|
@ -0,0 +1,21 @@
|
||||||
|
---
|
||||||
|
name: "\U000023F3 Installation Problem"
|
||||||
|
about: Do you have problems installing spaCy, and none of the suggestions in the docs
|
||||||
|
and other issues helped?
|
||||||
|
|
||||||
|
---
|
||||||
|
<!-- Before submitting an issue, make sure to check the docs and closed issues to see if any of the solutions work for you. Installation problems can often be related to Python environment issues and problems with compilation. -->
|
||||||
|
|
||||||
|
## How to reproduce the problem
|
||||||
|
<!-- Include the details of how the problem occurred. Which command did you run to install spaCy? Did you come across an error? What else did you try? -->
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# copy-paste the error message here
|
||||||
|
```
|
||||||
|
|
||||||
|
## Your Environment
|
||||||
|
<!-- Include details of your environment. If you're using spaCy 1.7+, you can also type `python -m spacy info --markdown` and copy-paste the result here.-->
|
||||||
|
* Operating System:
|
||||||
|
* Python Version Used:
|
||||||
|
* spaCy Version Used:
|
||||||
|
* Environment Information:
|
11
.github/ISSUE_TEMPLATE/03_request.md
vendored
Normal file
11
.github/ISSUE_TEMPLATE/03_request.md
vendored
Normal file
|
@ -0,0 +1,11 @@
|
||||||
|
---
|
||||||
|
name: "\U0001F381 Feature Request"
|
||||||
|
about: Do you have an idea for an improvement, a new feature or a plugin?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Feature description
|
||||||
|
<!-- Please describe the feature: Which area of the library is it related to? What specific solution would you like? -->
|
||||||
|
|
||||||
|
## Could the feature be a [custom component](https://spacy.io/usage/processing-pipelines#custom-components) or [spaCy plugin](https://spacy.io/universe)?
|
||||||
|
If so, we will tag it as [`project idea`](https://github.com/explosion/spaCy/labels/project%20idea) so other users can take it on.
|
10
.github/ISSUE_TEMPLATE/04_docs.md
vendored
Normal file
10
.github/ISSUE_TEMPLATE/04_docs.md
vendored
Normal file
|
@ -0,0 +1,10 @@
|
||||||
|
---
|
||||||
|
name: "\U0001F4DA Documentation"
|
||||||
|
about: Did you spot a mistake in the docs, is anything unclear or do you have a
|
||||||
|
suggestion?
|
||||||
|
|
||||||
|
---
|
||||||
|
<!-- Describe the problem or suggestion here. If you've found a mistake and you know the answer, feel free to submit a pull request straight away: https://github.com/explosion/spaCy/pulls -->
|
||||||
|
|
||||||
|
## Which page or section is this issue related to?
|
||||||
|
<!-- Please include the URL and/or source. -->
|
15
.github/ISSUE_TEMPLATE/05_other.md
vendored
Normal file
15
.github/ISSUE_TEMPLATE/05_other.md
vendored
Normal file
|
@ -0,0 +1,15 @@
|
||||||
|
---
|
||||||
|
name: "\U0001F4AC Anything else?"
|
||||||
|
about: For general usage questions or help with your code, please consider
|
||||||
|
posting on StackOverflow instead.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
<!-- Describe your issue here. Please keep in mind that the GitHub issue tracker is mostly intended for reports related to the spaCy code base and source, and for bugs and feature requests. If you're looking for help with your code, consider posting a question on StackOverflow instead: http://stackoverflow.com/questions/tagged/spacy -->
|
||||||
|
|
||||||
|
## Your Environment
|
||||||
|
<!-- Include details of your environment. If you're using spaCy 1.7+, you can also type `python -m spacy info --markdown` and copy-paste the result here.-->
|
||||||
|
* Operating System:
|
||||||
|
* Python Version Used:
|
||||||
|
* spaCy Version Used:
|
||||||
|
* Environment Information:
|
106
.github/contributors/LRAbbade.md
vendored
Normal file
106
.github/contributors/LRAbbade.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Lucas Riêra Abbade |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 2018-05-08 |
|
||||||
|
| GitHub username | LRAbbade |
|
||||||
|
| Website (optional) | |
|
87
.github/contributors/alexvy86.md
vendored
Normal file
87
.github/contributors/alexvy86.md
vendored
Normal file
|
@ -0,0 +1,87 @@
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Alejandro Villarreal |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 2018-05-01 |
|
||||||
|
| GitHub username | alexvy86 |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/bellabie
vendored
Normal file
106
.github/contributors/bellabie
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | B Cavello |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 2018-05-06 |
|
||||||
|
| GitHub username | bellabie |
|
||||||
|
| Website (optional) | bcavello.com |
|
106
.github/contributors/janimo.md
vendored
Normal file
106
.github/contributors/janimo.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Jani Monoses |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 5/10/2018 |
|
||||||
|
| GitHub username | janimo |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/knoxdw.md
vendored
Normal file
106
.github/contributors/knoxdw.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Douglas Knox |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 2018-04-27 |
|
||||||
|
| GitHub username | knoxdw |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/mauryaland.md
vendored
Normal file
106
.github/contributors/mauryaland.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Amaury Fouret |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 05/08/2018 |
|
||||||
|
| GitHub username | mauryaland |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/mn3mos.md
vendored
Normal file
106
.github/contributors/mn3mos.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Gaëtan PRUVOST |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 13/04/2018 |
|
||||||
|
| GitHub username | mn3mos |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/tzano.md
vendored
Normal file
106
.github/contributors/tzano.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Tahar Zanouda |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 09-05-2018 |
|
||||||
|
| GitHub username | tzano |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/vishnumenon.md
vendored
Normal file
106
.github/contributors/vishnumenon.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Vishnu Menon |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 12 May 2018 |
|
||||||
|
| GitHub username | vishnumenon |
|
||||||
|
| Website (optional) | |
|
19
.github/lock.yml
vendored
Normal file
19
.github/lock.yml
vendored
Normal file
|
@ -0,0 +1,19 @@
|
||||||
|
# Configuration for lock-threads - https://github.com/dessant/lock-threads
|
||||||
|
|
||||||
|
# Number of days of inactivity before a closed issue or pull request is locked
|
||||||
|
daysUntilLock: 30
|
||||||
|
|
||||||
|
# Issues and pull requests with these labels will not be locked. Set to `[]` to disable
|
||||||
|
exemptLabels: []
|
||||||
|
|
||||||
|
# Label to add before locking, such as `outdated`. Set to `false` to disable
|
||||||
|
lockLabel: false
|
||||||
|
|
||||||
|
# Comment to post before locking. Set to `false` to disable
|
||||||
|
lockComment: >
|
||||||
|
This thread has been automatically locked since there has not been
|
||||||
|
any recent activity after it was closed. Please open a new issue for
|
||||||
|
related bugs.
|
||||||
|
|
||||||
|
# Limit to only `issues` or `pulls`
|
||||||
|
only: issues
|
13
.github/no-response.yml
vendored
Normal file
13
.github/no-response.yml
vendored
Normal file
|
@ -0,0 +1,13 @@
|
||||||
|
# Configuration for probot-no-response - https://github.com/probot/no-response
|
||||||
|
|
||||||
|
# Number of days of inactivity before an Issue is closed for lack of response
|
||||||
|
daysUntilClose: 14
|
||||||
|
# Label requiring a response
|
||||||
|
responseRequiredLabel: more-info-needed
|
||||||
|
# Comment to post when closing an Issue for lack of response. Set to `false` to disable
|
||||||
|
closeComment: >
|
||||||
|
This issue has been automatically closed because there has been no response
|
||||||
|
to a request for more information from the original author. With only the
|
||||||
|
information that is currently in the issue, there's not enough information
|
||||||
|
to take action. If you're the original author, feel free to reopen the issue
|
||||||
|
if you have or find the answers needed to investigate further.
|
|
@ -199,6 +199,11 @@ or manually by pointing pip to a path or URL.
|
||||||
# pip install .tar.gz archive from path or URL
|
# pip install .tar.gz archive from path or URL
|
||||||
pip install /Users/you/en_core_web_sm-2.0.0.tar.gz
|
pip install /Users/you/en_core_web_sm-2.0.0.tar.gz
|
||||||
|
|
||||||
|
If you have SSL certification problems, SSL customization options are described in the help:
|
||||||
|
|
||||||
|
# help for the download command
|
||||||
|
python -m spacy download --help
|
||||||
|
|
||||||
Loading and using models
|
Loading and using models
|
||||||
------------------------
|
------------------------
|
||||||
|
|
||||||
|
|
|
@ -68,9 +68,9 @@ class RESTCountriesComponent(object):
|
||||||
# the matches, so we're only setting a default value, not a getter.
|
# the matches, so we're only setting a default value, not a getter.
|
||||||
# If no default value is set, it defaults to None.
|
# If no default value is set, it defaults to None.
|
||||||
Token.set_extension('is_country', default=False)
|
Token.set_extension('is_country', default=False)
|
||||||
Token.set_extension('country_capital')
|
Token.set_extension('country_capital', default=False)
|
||||||
Token.set_extension('country_latlng')
|
Token.set_extension('country_latlng', default=False)
|
||||||
Token.set_extension('country_flag')
|
Token.set_extension('country_flag', default=False)
|
||||||
|
|
||||||
# Register attributes on Doc and Span via a getter that checks if one of
|
# Register attributes on Doc and Span via a getter that checks if one of
|
||||||
# the contained tokens is set to is_country == True.
|
# the contained tokens is set to is_country == True.
|
||||||
|
|
|
@ -17,19 +17,39 @@ from .. import about
|
||||||
@plac.annotations(
|
@plac.annotations(
|
||||||
model=("model to download, shortcut or name)", "positional", None, str),
|
model=("model to download, shortcut or name)", "positional", None, str),
|
||||||
direct=("force direct download. Needs model name with version and won't "
|
direct=("force direct download. Needs model name with version and won't "
|
||||||
"perform compatibility check", "flag", "d", bool))
|
"perform compatibility check", "flag", "d", bool),
|
||||||
def download(model, direct=False):
|
insecure=("insecure mode - disables the verification of certificates",
|
||||||
|
"flag", "i", bool),
|
||||||
|
ca_file=("specify a certificate authority file to use for certificates "
|
||||||
|
"validation. Ignored if --insecure is used", "option", "c"))
|
||||||
|
def download(model, direct=False, insecure=False, ca_file=None):
|
||||||
"""
|
"""
|
||||||
Download compatible model from default download path using pip. Model
|
Download compatible model from default download path using pip. Model
|
||||||
can be shortcut, model name or, if --direct flag is set, full model name
|
can be shortcut, model name or, if --direct flag is set, full model name
|
||||||
with version.
|
with version.
|
||||||
|
The --insecure optional flag can be used to disable ssl verification
|
||||||
|
The --ca-file option can be used to provide a local CA file
|
||||||
|
used for certificate verification.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
|
# ssl_verify is the argument handled to the 'verify' parameter
|
||||||
|
# of requests package. It must be either None, a boolean,
|
||||||
|
# or a string containing the path to CA file
|
||||||
|
ssl_verify = None
|
||||||
|
if insecure:
|
||||||
|
ca_file = None
|
||||||
|
ssl_verify = False
|
||||||
|
else:
|
||||||
|
if ca_file is not None:
|
||||||
|
ssl_verify = ca_file
|
||||||
|
|
||||||
|
# Download the model
|
||||||
if direct:
|
if direct:
|
||||||
dl = download_model('{m}/{m}.tar.gz'.format(m=model))
|
dl = download_model('{m}/{m}.tar.gz'.format(m=model))
|
||||||
else:
|
else:
|
||||||
shortcuts = get_json(about.__shortcuts__, "available shortcuts")
|
shortcuts = get_json(about.__shortcuts__, "available shortcuts", ssl_verify)
|
||||||
model_name = shortcuts.get(model, model)
|
model_name = shortcuts.get(model, model)
|
||||||
compatibility = get_compatibility()
|
compatibility = get_compatibility(ssl_verify)
|
||||||
version = get_version(model_name, compatibility)
|
version = get_version(model_name, compatibility)
|
||||||
dl = download_model('{m}-{v}/{m}-{v}.tar.gz'.format(m=model_name,
|
dl = download_model('{m}-{v}/{m}-{v}.tar.gz'.format(m=model_name,
|
||||||
v=version))
|
v=version))
|
||||||
|
@ -41,8 +61,7 @@ def download(model, direct=False):
|
||||||
# package, which fails if model was just installed via
|
# package, which fails if model was just installed via
|
||||||
# subprocess
|
# subprocess
|
||||||
package_path = get_package_path(model_name)
|
package_path = get_package_path(model_name)
|
||||||
link(model_name, model, force=True,
|
link(model_name, model, force=True, model_path=package_path)
|
||||||
model_path=package_path)
|
|
||||||
except:
|
except:
|
||||||
# Dirty, but since spacy.download and the auto-linking is
|
# Dirty, but since spacy.download and the auto-linking is
|
||||||
# mostly a convenience wrapper, it's best to show a success
|
# mostly a convenience wrapper, it's best to show a success
|
||||||
|
@ -50,19 +69,19 @@ def download(model, direct=False):
|
||||||
prints(Messages.M001.format(name=model_name), title=Messages.M002)
|
prints(Messages.M001.format(name=model_name), title=Messages.M002)
|
||||||
|
|
||||||
|
|
||||||
def get_json(url, desc):
|
def get_json(url, desc, ssl_verify):
|
||||||
try:
|
try:
|
||||||
data = url_read(url)
|
data = url_read(url, verify=ssl_verify)
|
||||||
except HTTPError as e:
|
except HTTPError as e:
|
||||||
prints(Messages.M004.format(desc, about.__version__),
|
prints(Messages.M004.format(desc, about.__version__),
|
||||||
title=Messages.M003.format(e.code, e.reason), exits=1)
|
title=Messages.M003.format(e.code, e.reason), exits=1)
|
||||||
return ujson.loads(data)
|
return ujson.loads(data)
|
||||||
|
|
||||||
|
|
||||||
def get_compatibility():
|
def get_compatibility(ssl_verify):
|
||||||
version = about.__version__
|
version = about.__version__
|
||||||
version = version.rsplit('.dev', 1)[0]
|
version = version.rsplit('.dev', 1)[0]
|
||||||
comp_table = get_json(about.__compatibility__, "compatibility table")
|
comp_table = get_json(about.__compatibility__, "compatibility table", ssl_verify)
|
||||||
comp = comp_table['spacy']
|
comp = comp_table['spacy']
|
||||||
if version not in comp:
|
if version not in comp:
|
||||||
prints(Messages.M006.format(version=version), title=Messages.M005,
|
prints(Messages.M006.format(version=version), title=Messages.M005,
|
||||||
|
|
|
@ -124,13 +124,16 @@ def read_conllu(file_):
|
||||||
return docs
|
return docs
|
||||||
|
|
||||||
|
|
||||||
def _make_gold(nlp, text, sent_annots):
|
def _make_gold(nlp, text, sent_annots, drop_deps=0.0):
|
||||||
# Flatten the conll annotations, and adjust the head indices
|
# Flatten the conll annotations, and adjust the head indices
|
||||||
flat = defaultdict(list)
|
flat = defaultdict(list)
|
||||||
|
sent_starts = []
|
||||||
for sent in sent_annots:
|
for sent in sent_annots:
|
||||||
flat['heads'].extend(len(flat['words'])+head for head in sent['heads'])
|
flat['heads'].extend(len(flat['words'])+head for head in sent['heads'])
|
||||||
for field in ['words', 'tags', 'deps', 'entities', 'spaces']:
|
for field in ['words', 'tags', 'deps', 'entities', 'spaces']:
|
||||||
flat[field].extend(sent[field])
|
flat[field].extend(sent[field])
|
||||||
|
sent_starts.append(True)
|
||||||
|
sent_starts.extend([False] * (len(sent['words'])-1))
|
||||||
# Construct text if necessary
|
# Construct text if necessary
|
||||||
assert len(flat['words']) == len(flat['spaces'])
|
assert len(flat['words']) == len(flat['spaces'])
|
||||||
if text is None:
|
if text is None:
|
||||||
|
@ -138,6 +141,12 @@ def _make_gold(nlp, text, sent_annots):
|
||||||
doc = nlp.make_doc(text)
|
doc = nlp.make_doc(text)
|
||||||
flat.pop('spaces')
|
flat.pop('spaces')
|
||||||
gold = GoldParse(doc, **flat)
|
gold = GoldParse(doc, **flat)
|
||||||
|
gold.sent_starts = sent_starts
|
||||||
|
for i in range(len(gold.heads)):
|
||||||
|
if random.random() < drop_deps:
|
||||||
|
gold.heads[i] = None
|
||||||
|
gold.labels[i] = None
|
||||||
|
|
||||||
return doc, gold
|
return doc, gold
|
||||||
|
|
||||||
#############################
|
#############################
|
||||||
|
|
|
@ -545,10 +545,21 @@ cdef class GoldParse:
|
||||||
"""
|
"""
|
||||||
return not nonproj.is_nonproj_tree(self.heads)
|
return not nonproj.is_nonproj_tree(self.heads)
|
||||||
|
|
||||||
@property
|
property sent_starts:
|
||||||
def sent_starts(self):
|
def __get__(self):
|
||||||
return [self.c.sent_start[i] for i in range(self.length)]
|
return [self.c.sent_start[i] for i in range(self.length)]
|
||||||
|
|
||||||
|
def __set__(self, sent_starts):
|
||||||
|
for gold_i, is_sent_start in enumerate(sent_starts):
|
||||||
|
i = self.gold_to_cand[gold_i]
|
||||||
|
if i is not None:
|
||||||
|
if is_sent_start in (1, True):
|
||||||
|
self.c.sent_start[i] = 1
|
||||||
|
elif is_sent_start in (-1, False):
|
||||||
|
self.c.sent_start[i] = -1
|
||||||
|
else:
|
||||||
|
self.c.sent_start[i] = 0
|
||||||
|
|
||||||
|
|
||||||
def biluo_tags_from_offsets(doc, entities, missing='O'):
|
def biluo_tags_from_offsets(doc, entities, missing='O'):
|
||||||
"""Encode labelled spans into per-token tags, using the
|
"""Encode labelled spans into per-token tags, using the
|
||||||
|
|
31
spacy/lang/ar/__init__.py
Normal file
31
spacy/lang/ar/__init__.py
Normal file
|
@ -0,0 +1,31 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from .stop_words import STOP_WORDS
|
||||||
|
from .lex_attrs import LEX_ATTRS
|
||||||
|
from .punctuation import TOKENIZER_SUFFIXES
|
||||||
|
|
||||||
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
|
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||||
|
from ..norm_exceptions import BASE_NORMS
|
||||||
|
from ...language import Language
|
||||||
|
from ...attrs import LANG, NORM
|
||||||
|
from ...util import update_exc, add_lookups
|
||||||
|
|
||||||
|
|
||||||
|
class ArabicDefaults(Language.Defaults):
|
||||||
|
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||||
|
lex_attr_getters.update(LEX_ATTRS)
|
||||||
|
lex_attr_getters[LANG] = lambda text: 'ar'
|
||||||
|
lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM], BASE_NORMS)
|
||||||
|
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||||
|
stop_words = STOP_WORDS
|
||||||
|
suffixes = TOKENIZER_SUFFIXES
|
||||||
|
|
||||||
|
|
||||||
|
class Arabic(Language):
|
||||||
|
lang = 'ar'
|
||||||
|
Defaults = ArabicDefaults
|
||||||
|
|
||||||
|
|
||||||
|
__all__ = ['Arabic']
|
20
spacy/lang/ar/examples.py
Normal file
20
spacy/lang/ar/examples.py
Normal file
|
@ -0,0 +1,20 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
"""
|
||||||
|
Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
>>> from spacy.lang.ar.examples import sentences
|
||||||
|
>>> docs = nlp.pipe(sentences)
|
||||||
|
"""
|
||||||
|
|
||||||
|
sentences = [
|
||||||
|
"نال الكاتب خالد توفيق جائزة الرواية العربية في معرض الشارقة الدولي للكتاب",
|
||||||
|
"أين تقع دمشق ؟"
|
||||||
|
"كيف حالك ؟",
|
||||||
|
"هل يمكن ان نلتقي على الساعة الثانية عشرة ظهرا ؟",
|
||||||
|
"ماهي أبرز التطورات السياسية، الأمنية والاجتماعية في العالم ؟",
|
||||||
|
"هل بالإمكان أن نلتقي غدا؟",
|
||||||
|
"هناك نحو 382 مليون شخص مصاب بداء السكَّري في العالم",
|
||||||
|
"كشفت دراسة حديثة أن الخيل تقرأ تعبيرات الوجه وتستطيع أن تتذكر مشاعر الناس وعواطفهم"
|
||||||
|
]
|
95
spacy/lang/ar/lex_attrs.py
Normal file
95
spacy/lang/ar/lex_attrs.py
Normal file
|
@ -0,0 +1,95 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
from ...attrs import LIKE_NUM
|
||||||
|
|
||||||
|
_num_words = set("""
|
||||||
|
صفر
|
||||||
|
واحد
|
||||||
|
إثنان
|
||||||
|
اثنان
|
||||||
|
ثلاثة
|
||||||
|
ثلاثه
|
||||||
|
أربعة
|
||||||
|
أربعه
|
||||||
|
خمسة
|
||||||
|
خمسه
|
||||||
|
ستة
|
||||||
|
سته
|
||||||
|
سبعة
|
||||||
|
سبعه
|
||||||
|
ثمانية
|
||||||
|
ثمانيه
|
||||||
|
تسعة
|
||||||
|
تسعه
|
||||||
|
ﻋﺸﺮﺓ
|
||||||
|
ﻋﺸﺮه
|
||||||
|
عشرون
|
||||||
|
عشرين
|
||||||
|
ثلاثون
|
||||||
|
ثلاثين
|
||||||
|
اربعون
|
||||||
|
اربعين
|
||||||
|
أربعون
|
||||||
|
أربعين
|
||||||
|
خمسون
|
||||||
|
خمسين
|
||||||
|
ستون
|
||||||
|
ستين
|
||||||
|
سبعون
|
||||||
|
سبعين
|
||||||
|
ثمانون
|
||||||
|
ثمانين
|
||||||
|
تسعون
|
||||||
|
تسعين
|
||||||
|
مائتين
|
||||||
|
مائتان
|
||||||
|
ثلاثمائة
|
||||||
|
خمسمائة
|
||||||
|
سبعمائة
|
||||||
|
الف
|
||||||
|
آلاف
|
||||||
|
ملايين
|
||||||
|
مليون
|
||||||
|
مليار
|
||||||
|
مليارات
|
||||||
|
""".split())
|
||||||
|
|
||||||
|
_ordinal_words = set("""
|
||||||
|
اول
|
||||||
|
أول
|
||||||
|
حاد
|
||||||
|
واحد
|
||||||
|
ثان
|
||||||
|
ثاني
|
||||||
|
ثالث
|
||||||
|
رابع
|
||||||
|
خامس
|
||||||
|
سادس
|
||||||
|
سابع
|
||||||
|
ثامن
|
||||||
|
تاسع
|
||||||
|
عاشر
|
||||||
|
""".split())
|
||||||
|
|
||||||
|
|
||||||
|
def like_num(text):
|
||||||
|
"""
|
||||||
|
check if text resembles a number
|
||||||
|
"""
|
||||||
|
text = text.replace(',', '').replace('.', '')
|
||||||
|
if text.isdigit():
|
||||||
|
return True
|
||||||
|
if text.count('/') == 1:
|
||||||
|
num, denom = text.split('/')
|
||||||
|
if num.isdigit() and denom.isdigit():
|
||||||
|
return True
|
||||||
|
if text in _num_words:
|
||||||
|
return True
|
||||||
|
if text in _ordinal_words:
|
||||||
|
return True
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
LEX_ATTRS = {
|
||||||
|
LIKE_NUM: like_num
|
||||||
|
}
|
15
spacy/lang/ar/punctuation.py
Normal file
15
spacy/lang/ar/punctuation.py
Normal file
|
@ -0,0 +1,15 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from ..punctuation import TOKENIZER_INFIXES
|
||||||
|
from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, CURRENCY
|
||||||
|
from ..char_classes import QUOTES, UNITS, ALPHA, ALPHA_LOWER, ALPHA_UPPER
|
||||||
|
|
||||||
|
_suffixes = (LIST_PUNCT + LIST_ELLIPSES + LIST_QUOTES +
|
||||||
|
[r'(?<=[0-9])\+',
|
||||||
|
# Arabic is written from Right-To-Left
|
||||||
|
r'(?<=[0-9])(?:{})'.format(CURRENCY),
|
||||||
|
r'(?<=[0-9])(?:{})'.format(UNITS),
|
||||||
|
r'(?<=[{au}][{au}])\.'.format(au=ALPHA_UPPER)])
|
||||||
|
|
||||||
|
TOKENIZER_SUFFIXES = _suffixes
|
229
spacy/lang/ar/stop_words.py
Normal file
229
spacy/lang/ar/stop_words.py
Normal file
|
@ -0,0 +1,229 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
STOP_WORDS = set("""
|
||||||
|
من
|
||||||
|
نحو
|
||||||
|
لعل
|
||||||
|
بما
|
||||||
|
بين
|
||||||
|
وبين
|
||||||
|
ايضا
|
||||||
|
وبينما
|
||||||
|
تحت
|
||||||
|
مثلا
|
||||||
|
لدي
|
||||||
|
عنه
|
||||||
|
مع
|
||||||
|
هي
|
||||||
|
وهذا
|
||||||
|
واذا
|
||||||
|
هذان
|
||||||
|
انه
|
||||||
|
بينما
|
||||||
|
أمسى
|
||||||
|
وسوف
|
||||||
|
ولم
|
||||||
|
لذلك
|
||||||
|
إلى
|
||||||
|
منه
|
||||||
|
منها
|
||||||
|
كما
|
||||||
|
ظل
|
||||||
|
هنا
|
||||||
|
به
|
||||||
|
كذلك
|
||||||
|
اما
|
||||||
|
هما
|
||||||
|
بعد
|
||||||
|
بينهم
|
||||||
|
التي
|
||||||
|
أبو
|
||||||
|
اذا
|
||||||
|
بدلا
|
||||||
|
لها
|
||||||
|
أمام
|
||||||
|
يلي
|
||||||
|
حين
|
||||||
|
ضد
|
||||||
|
الذي
|
||||||
|
قد
|
||||||
|
صار
|
||||||
|
إذا
|
||||||
|
مابرح
|
||||||
|
قبل
|
||||||
|
كل
|
||||||
|
وليست
|
||||||
|
الذين
|
||||||
|
لهذا
|
||||||
|
وثي
|
||||||
|
انهم
|
||||||
|
باللتي
|
||||||
|
مافتئ
|
||||||
|
ولا
|
||||||
|
بهذه
|
||||||
|
بحيث
|
||||||
|
كيف
|
||||||
|
وله
|
||||||
|
علي
|
||||||
|
بات
|
||||||
|
لاسيما
|
||||||
|
حتى
|
||||||
|
وقد
|
||||||
|
و
|
||||||
|
أما
|
||||||
|
فيها
|
||||||
|
بهذا
|
||||||
|
لذا
|
||||||
|
حيث
|
||||||
|
لقد
|
||||||
|
إن
|
||||||
|
فإن
|
||||||
|
اول
|
||||||
|
ليت
|
||||||
|
فاللتي
|
||||||
|
ولقد
|
||||||
|
لسوف
|
||||||
|
هذه
|
||||||
|
ولماذا
|
||||||
|
معه
|
||||||
|
الحالي
|
||||||
|
بإن
|
||||||
|
حول
|
||||||
|
في
|
||||||
|
عليه
|
||||||
|
مايزال
|
||||||
|
ولعل
|
||||||
|
أنه
|
||||||
|
أضحى
|
||||||
|
اي
|
||||||
|
ستكون
|
||||||
|
لن
|
||||||
|
أن
|
||||||
|
ضمن
|
||||||
|
وعلى
|
||||||
|
امسى
|
||||||
|
الي
|
||||||
|
ذات
|
||||||
|
ولايزال
|
||||||
|
ذلك
|
||||||
|
فقد
|
||||||
|
هم
|
||||||
|
أي
|
||||||
|
عند
|
||||||
|
ابن
|
||||||
|
أو
|
||||||
|
فهو
|
||||||
|
فانه
|
||||||
|
سوف
|
||||||
|
ما
|
||||||
|
آل
|
||||||
|
كلا
|
||||||
|
عنها
|
||||||
|
وكذلك
|
||||||
|
ليست
|
||||||
|
لم
|
||||||
|
وأن
|
||||||
|
ماذا
|
||||||
|
لو
|
||||||
|
وهل
|
||||||
|
اللتي
|
||||||
|
ولذا
|
||||||
|
يمكن
|
||||||
|
فيه
|
||||||
|
الا
|
||||||
|
عليها
|
||||||
|
وبينهم
|
||||||
|
يوم
|
||||||
|
وبما
|
||||||
|
لما
|
||||||
|
فكان
|
||||||
|
اضحى
|
||||||
|
اصبح
|
||||||
|
لهم
|
||||||
|
بها
|
||||||
|
او
|
||||||
|
الذى
|
||||||
|
الى
|
||||||
|
إلي
|
||||||
|
قال
|
||||||
|
والتي
|
||||||
|
لازال
|
||||||
|
أصبح
|
||||||
|
ولهذا
|
||||||
|
مثل
|
||||||
|
وكانت
|
||||||
|
لكنه
|
||||||
|
بذلك
|
||||||
|
هذا
|
||||||
|
لماذا
|
||||||
|
قالت
|
||||||
|
فقط
|
||||||
|
لكن
|
||||||
|
مما
|
||||||
|
وكل
|
||||||
|
وان
|
||||||
|
وأبو
|
||||||
|
ومن
|
||||||
|
كان
|
||||||
|
مازال
|
||||||
|
هل
|
||||||
|
بينهن
|
||||||
|
هو
|
||||||
|
وما
|
||||||
|
على
|
||||||
|
وهو
|
||||||
|
لأن
|
||||||
|
واللتي
|
||||||
|
والذي
|
||||||
|
دون
|
||||||
|
عن
|
||||||
|
وايضا
|
||||||
|
هناك
|
||||||
|
بلا
|
||||||
|
جدا
|
||||||
|
ثم
|
||||||
|
منذ
|
||||||
|
اللذين
|
||||||
|
لايزال
|
||||||
|
بعض
|
||||||
|
مساء
|
||||||
|
تكون
|
||||||
|
فلا
|
||||||
|
بيننا
|
||||||
|
لا
|
||||||
|
ولكن
|
||||||
|
إذ
|
||||||
|
وأثناء
|
||||||
|
ليس
|
||||||
|
ومع
|
||||||
|
فيهم
|
||||||
|
ولسوف
|
||||||
|
بل
|
||||||
|
تلك
|
||||||
|
أحد
|
||||||
|
وهي
|
||||||
|
وكان
|
||||||
|
ومنها
|
||||||
|
وفي
|
||||||
|
ماانفك
|
||||||
|
اليوم
|
||||||
|
وماذا
|
||||||
|
هؤلاء
|
||||||
|
وليس
|
||||||
|
له
|
||||||
|
أثناء
|
||||||
|
بد
|
||||||
|
اليه
|
||||||
|
كأن
|
||||||
|
اليها
|
||||||
|
بتلك
|
||||||
|
يكون
|
||||||
|
ولما
|
||||||
|
هن
|
||||||
|
والى
|
||||||
|
كانت
|
||||||
|
وقبل
|
||||||
|
ان
|
||||||
|
لدى
|
||||||
|
""".split())
|
47
spacy/lang/ar/tokenizer_exceptions.py
Normal file
47
spacy/lang/ar/tokenizer_exceptions.py
Normal file
|
@ -0,0 +1,47 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from ...symbols import ORTH, LEMMA, TAG, NORM, PRON_LEMMA
|
||||||
|
import re
|
||||||
|
|
||||||
|
_exc = {}
|
||||||
|
|
||||||
|
# time
|
||||||
|
for exc_data in [
|
||||||
|
{LEMMA: "قبل الميلاد", ORTH: "ق.م"},
|
||||||
|
{LEMMA: "بعد الميلاد", ORTH: "ب. م"},
|
||||||
|
{LEMMA: "ميلادي", ORTH: ".م"},
|
||||||
|
{LEMMA: "هجري", ORTH: ".هـ"},
|
||||||
|
{LEMMA: "توفي", ORTH: ".ت"}]:
|
||||||
|
_exc[exc_data[ORTH]] = [exc_data]
|
||||||
|
|
||||||
|
# scientific abv.
|
||||||
|
for exc_data in [
|
||||||
|
{LEMMA: "صلى الله عليه وسلم", ORTH: "صلعم"},
|
||||||
|
{LEMMA: "الشارح", ORTH: "الشـ"},
|
||||||
|
{LEMMA: "الظاهر", ORTH: "الظـ"},
|
||||||
|
{LEMMA: "أيضًا", ORTH: "أيضـ"},
|
||||||
|
{LEMMA: "إلى آخره", ORTH: "إلخ"},
|
||||||
|
{LEMMA: "انتهى", ORTH: "اهـ"},
|
||||||
|
{LEMMA: "حدّثنا", ORTH: "ثنا"},
|
||||||
|
{LEMMA: "حدثني", ORTH: "ثنى"},
|
||||||
|
{LEMMA: "أنبأنا", ORTH: "أنا"},
|
||||||
|
{LEMMA: "أخبرنا", ORTH: "نا"},
|
||||||
|
{LEMMA: "مصدر سابق", ORTH: "م. س"},
|
||||||
|
{LEMMA: "مصدر نفسه", ORTH: "م. ن"}]:
|
||||||
|
_exc[exc_data[ORTH]] = [exc_data]
|
||||||
|
|
||||||
|
# other abv.
|
||||||
|
for exc_data in [
|
||||||
|
{LEMMA: "دكتور", ORTH: "د."},
|
||||||
|
{LEMMA: "أستاذ دكتور", ORTH: "أ.د"},
|
||||||
|
{LEMMA: "أستاذ", ORTH: "أ."},
|
||||||
|
{LEMMA: "بروفيسور", ORTH: "ب."}]:
|
||||||
|
_exc[exc_data[ORTH]] = [exc_data]
|
||||||
|
|
||||||
|
for exc_data in [
|
||||||
|
{LEMMA: "تلفون", ORTH: "ت."},
|
||||||
|
{LEMMA: "صندوق بريد", ORTH: "ص.ب"}]:
|
||||||
|
_exc[exc_data[ORTH]] = [exc_data]
|
||||||
|
|
||||||
|
TOKENIZER_EXCEPTIONS = _exc
|
|
@ -3,13 +3,11 @@ from __future__ import unicode_literals
|
||||||
|
|
||||||
import regex as re
|
import regex as re
|
||||||
|
|
||||||
|
|
||||||
re.DEFAULT_VERSION = re.VERSION1
|
re.DEFAULT_VERSION = re.VERSION1
|
||||||
merge_char_classes = lambda classes: '[{}]'.format('||'.join(classes))
|
merge_char_classes = lambda classes: '[{}]'.format('||'.join(classes))
|
||||||
split_chars = lambda char: list(char.strip().split(' '))
|
split_chars = lambda char: list(char.strip().split(' '))
|
||||||
merge_chars = lambda char: char.strip().replace(' ', '|')
|
merge_chars = lambda char: char.strip().replace(' ', '|')
|
||||||
|
|
||||||
|
|
||||||
_bengali = r'[\p{L}&&\p{Bengali}]'
|
_bengali = r'[\p{L}&&\p{Bengali}]'
|
||||||
_hebrew = r'[\p{L}&&\p{Hebrew}]'
|
_hebrew = r'[\p{L}&&\p{Hebrew}]'
|
||||||
_latin_lower = r'[\p{Ll}&&\p{Latin}]'
|
_latin_lower = r'[\p{Ll}&&\p{Latin}]'
|
||||||
|
@ -27,11 +25,11 @@ ALPHA = merge_char_classes(_upper + _lower + _uncased)
|
||||||
ALPHA_LOWER = merge_char_classes(_lower + _uncased)
|
ALPHA_LOWER = merge_char_classes(_lower + _uncased)
|
||||||
ALPHA_UPPER = merge_char_classes(_upper + _uncased)
|
ALPHA_UPPER = merge_char_classes(_upper + _uncased)
|
||||||
|
|
||||||
|
|
||||||
_units = ('km km² km³ m m² m³ dm dm² dm³ cm cm² cm³ mm mm² mm³ ha µm nm yd in ft '
|
_units = ('km km² km³ m m² m³ dm dm² dm³ cm cm² cm³ mm mm² mm³ ha µm nm yd in ft '
|
||||||
'kg g mg µg t lb oz m/s km/h kmh mph hPa Pa mbar mb MB kb KB gb GB tb '
|
'kg g mg µg t lb oz m/s km/h kmh mph hPa Pa mbar mb MB kb KB gb GB tb '
|
||||||
'TB T G M K % км км² км³ м м² м³ дм дм² дм³ см см² см³ мм мм² мм³ нм '
|
'TB T G M K % км км² км³ м м² м³ дм дм² дм³ см см² см³ мм мм² мм³ нм '
|
||||||
'кг г мг м/с км/ч кПа Па мбар Кб КБ кб Мб МБ мб Гб ГБ гб Тб ТБ тб')
|
'кг г мг м/с км/ч кПа Па мбар Кб КБ кб Мб МБ мб Гб ГБ гб Тб ТБ тб'
|
||||||
|
'كم كم² كم³ م م² م³ سم سم² سم³ مم مم² مم³ كم غرام جرام جم كغ ملغ كوب اكواب')
|
||||||
_currency = r'\$ £ € ¥ ฿ US\$ C\$ A\$ ₽ ﷼'
|
_currency = r'\$ £ € ¥ ฿ US\$ C\$ A\$ ₽ ﷼'
|
||||||
|
|
||||||
# These expressions contain various unicode variations, including characters
|
# These expressions contain various unicode variations, including characters
|
||||||
|
@ -45,7 +43,6 @@ _hyphens = '- – — -- --- —— ~'
|
||||||
# Details: https://www.compart.com/en/unicode/category/So
|
# Details: https://www.compart.com/en/unicode/category/So
|
||||||
_other_symbols = r'[\p{So}]'
|
_other_symbols = r'[\p{So}]'
|
||||||
|
|
||||||
|
|
||||||
UNITS = merge_chars(_units)
|
UNITS = merge_chars(_units)
|
||||||
CURRENCY = merge_chars(_currency)
|
CURRENCY = merge_chars(_currency)
|
||||||
QUOTES = merge_chars(_quotes)
|
QUOTES = merge_chars(_quotes)
|
||||||
|
|
|
@ -11,14 +11,14 @@ avais avait avant avec avoir avons ayant
|
||||||
|
|
||||||
bah bas basee bat beau beaucoup bien bigre boum bravo brrr
|
bah bas basee bat beau beaucoup bien bigre boum bravo brrr
|
||||||
|
|
||||||
ça car ce ceci cela celle celle-ci celle-là celles celles-ci celles-là celui
|
c' c’ ça car ce ceci cela celle celle-ci celle-là celles celles-ci celles-là celui
|
||||||
celui-ci celui-là cent cependant certain certaine certaines certains certes ces
|
celui-ci celui-là cent cependant certain certaine certaines certains certes ces
|
||||||
cet cette ceux ceux-ci ceux-là chacun chacune chaque cher chers chez chiche
|
cet cette ceux ceux-ci ceux-là chacun chacune chaque cher chers chez chiche
|
||||||
chut chère chères ci cinq cinquantaine cinquante cinquantième cinquième clac
|
chut chère chères ci cinq cinquantaine cinquante cinquantième cinquième clac
|
||||||
clic combien comme comment comparable comparables compris concernant contre
|
clic combien comme comment comparable comparables compris concernant contre
|
||||||
couic crac
|
couic crac
|
||||||
|
|
||||||
da dans de debout dedans dehors deja delà depuis dernier derniere derriere
|
d' d’ da dans de debout dedans dehors deja delà depuis dernier derniere derriere
|
||||||
derrière des desormais desquelles desquels dessous dessus deux deuxième
|
derrière des desormais desquelles desquels dessous dessus deux deuxième
|
||||||
deuxièmement devant devers devra different differentes differents différent
|
deuxièmement devant devers devra different differentes differents différent
|
||||||
différente différentes différents dire directe directement dit dite dits divers
|
différente différentes différents dire directe directement dit dite dits divers
|
||||||
|
@ -37,16 +37,16 @@ gens
|
||||||
ha hein hem hep hi ho holà hop hormis hors hou houp hue hui huit huitième hum
|
ha hein hem hep hi ho holà hop hormis hors hou houp hue hui huit huitième hum
|
||||||
hurrah hé hélas i il ils importe
|
hurrah hé hélas i il ils importe
|
||||||
|
|
||||||
je jusqu jusque juste
|
j' j’ je jusqu jusque juste
|
||||||
|
|
||||||
la laisser laquelle las le lequel les lesquelles lesquels leur leurs longtemps
|
l' l’ la laisser laquelle las le lequel les lesquelles lesquels leur leurs longtemps
|
||||||
lors lorsque lui lui-meme lui-même là lès
|
lors lorsque lui lui-meme lui-même là lès
|
||||||
|
|
||||||
ma maint maintenant mais malgre malgré maximale me meme memes merci mes mien
|
m' m’ ma maint maintenant mais malgre malgré maximale me meme memes merci mes mien
|
||||||
mienne miennes miens mille mince minimale moi moi-meme moi-même moindres moins
|
mienne miennes miens mille mince minimale moi moi-meme moi-même moindres moins
|
||||||
mon moyennant multiple multiples même mêmes
|
mon moyennant multiple multiples même mêmes
|
||||||
|
|
||||||
na naturel naturelle naturelles ne neanmoins necessaire necessairement neuf
|
n' n’ na naturel naturelle naturelles ne neanmoins necessaire necessairement neuf
|
||||||
neuvième ni nombreuses nombreux non nos notamment notre nous nous-mêmes nouveau
|
neuvième ni nombreuses nombreux non nos notamment notre nous nous-mêmes nouveau
|
||||||
nul néanmoins nôtre nôtres
|
nul néanmoins nôtre nôtres
|
||||||
|
|
||||||
|
@ -60,21 +60,21 @@ plusieurs plutôt possessif possessifs possible possibles pouah pour pourquoi
|
||||||
pourrais pourrait pouvait prealable precisement premier première premièrement
|
pourrais pourrait pouvait prealable precisement premier première premièrement
|
||||||
pres probable probante procedant proche près psitt pu puis puisque pur pure
|
pres probable probante procedant proche près psitt pu puis puisque pur pure
|
||||||
|
|
||||||
qu quand quant quant-à-soi quanta quarante quatorze quatre quatre-vingt
|
qu' qu’ quand quant quant-à-soi quanta quarante quatorze quatre quatre-vingt
|
||||||
quatrième quatrièmement que quel quelconque quelle quelles quelqu'un quelque
|
quatrième quatrièmement que quel quelconque quelle quelles quelqu'un quelque
|
||||||
quelques quels qui quiconque quinze quoi quoique
|
quelques quels qui quiconque quinze quoi quoique
|
||||||
|
|
||||||
rare rarement rares relative relativement remarquable rend rendre restant reste
|
rare rarement rares relative relativement remarquable rend rendre restant reste
|
||||||
restent restrictif retour revoici revoilà rien
|
restent restrictif retour revoici revoilà rien
|
||||||
|
|
||||||
sa sacrebleu sait sans sapristi sauf se sein seize selon semblable semblaient
|
s' s’ sa sacrebleu sait sans sapristi sauf se sein seize selon semblable semblaient
|
||||||
semble semblent sent sept septième sera seraient serait seront ses seul seule
|
semble semblent sent sept septième sera seraient serait seront ses seul seule
|
||||||
seulement si sien sienne siennes siens sinon six sixième soi soi-même soit
|
seulement si sien sienne siennes siens sinon six sixième soi soi-même soit
|
||||||
soixante son sont sous souvent specifique specifiques speculatif stop
|
soixante son sont sous souvent specifique specifiques speculatif stop
|
||||||
strictement subtiles suffisant suffisante suffit suis suit suivant suivante
|
strictement subtiles suffisant suffisante suffit suis suit suivant suivante
|
||||||
suivantes suivants suivre superpose sur surtout
|
suivantes suivants suivre superpose sur surtout
|
||||||
|
|
||||||
ta tac tant tardive te tel telle tellement telles tels tenant tend tenir tente
|
t' t’ ta tac tant tardive te tel telle tellement telles tels tenant tend tenir tente
|
||||||
tes tic tien tienne tiennes tiens toc toi toi-même ton touchant toujours tous
|
tes tic tien tienne tiennes tiens toc toi toi-même ton touchant toujours tous
|
||||||
tout toute toutefois toutes treize trente tres trois troisième troisièmement
|
tout toute toutefois toutes treize trente tres trois troisième troisièmement
|
||||||
trop très tsoin tsouin tu té
|
trop très tsoin tsouin tu té
|
||||||
|
|
|
@ -3,23 +3,87 @@ from __future__ import unicode_literals, print_function
|
||||||
|
|
||||||
from ...language import Language
|
from ...language import Language
|
||||||
from ...attrs import LANG
|
from ...attrs import LANG
|
||||||
from ...tokens import Doc
|
from ...tokens import Doc, Token
|
||||||
from ...tokenizer import Tokenizer
|
from ...tokenizer import Tokenizer
|
||||||
|
from .tag_map import TAG_MAP
|
||||||
|
|
||||||
|
import re
|
||||||
|
from collections import namedtuple
|
||||||
|
|
||||||
|
ShortUnitWord = namedtuple('ShortUnitWord', ['surface', 'lemma', 'pos'])
|
||||||
|
|
||||||
|
# XXX Is this the right place for this?
|
||||||
|
Token.set_extension('mecab_tag', default=None)
|
||||||
|
|
||||||
|
def try_mecab_import():
|
||||||
|
"""Mecab is required for Japanese support, so check for it.
|
||||||
|
|
||||||
|
It it's not available blow up and explain how to fix it."""
|
||||||
|
try:
|
||||||
|
import MeCab
|
||||||
|
return MeCab
|
||||||
|
except ImportError:
|
||||||
|
raise ImportError("Japanese support requires MeCab: "
|
||||||
|
"https://github.com/SamuraiT/mecab-python3")
|
||||||
|
|
||||||
|
def resolve_pos(token):
|
||||||
|
"""If necessary, add a field to the POS tag for UD mapping.
|
||||||
|
|
||||||
|
Under Universal Dependencies, sometimes the same Unidic POS tag can
|
||||||
|
be mapped differently depending on the literal token or its context
|
||||||
|
in the sentence. This function adds information to the POS tag to
|
||||||
|
resolve ambiguous mappings.
|
||||||
|
"""
|
||||||
|
|
||||||
|
# NOTE: This is a first take. The rules here are crude approximations.
|
||||||
|
# For many of these, full dependencies are needed to properly resolve
|
||||||
|
# PoS mappings.
|
||||||
|
|
||||||
|
if token.pos == '連体詞,*,*,*':
|
||||||
|
if re.match('^[こそあど此其彼]の', token.surface):
|
||||||
|
return token.pos + ',DET'
|
||||||
|
if re.match('^[こそあど此其彼]', token.surface):
|
||||||
|
return token.pos + ',PRON'
|
||||||
|
else:
|
||||||
|
return token.pos + ',ADJ'
|
||||||
|
return token.pos
|
||||||
|
|
||||||
|
def detailed_tokens(tokenizer, text):
|
||||||
|
"""Format Mecab output into a nice data structure, based on Janome."""
|
||||||
|
|
||||||
|
node = tokenizer.parseToNode(text)
|
||||||
|
node = node.next # first node is beginning of sentence and empty, skip it
|
||||||
|
words = []
|
||||||
|
while node.posid != 0:
|
||||||
|
surface = node.surface
|
||||||
|
base = surface # a default value. Updated if available later.
|
||||||
|
parts = node.feature.split(',')
|
||||||
|
pos = ','.join(parts[0:4])
|
||||||
|
|
||||||
|
if len(parts) > 6:
|
||||||
|
# this information is only available for words in the tokenizer dictionary
|
||||||
|
reading = parts[6]
|
||||||
|
base = parts[7]
|
||||||
|
|
||||||
|
words.append( ShortUnitWord(surface, base, pos) )
|
||||||
|
node = node.next
|
||||||
|
return words
|
||||||
|
|
||||||
class JapaneseTokenizer(object):
|
class JapaneseTokenizer(object):
|
||||||
def __init__(self, cls, nlp=None):
|
def __init__(self, cls, nlp=None):
|
||||||
self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp)
|
self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp)
|
||||||
try:
|
|
||||||
from janome.tokenizer import Tokenizer
|
MeCab = try_mecab_import()
|
||||||
except ImportError:
|
self.tokenizer = MeCab.Tagger()
|
||||||
raise ImportError("The Japanese tokenizer requires the Janome "
|
|
||||||
"library: https://github.com/mocobeta/janome")
|
|
||||||
self.tokenizer = Tokenizer()
|
|
||||||
|
|
||||||
def __call__(self, text):
|
def __call__(self, text):
|
||||||
words = [x.surface for x in self.tokenizer.tokenize(text)]
|
dtokens = detailed_tokens(self.tokenizer, text)
|
||||||
return Doc(self.vocab, words=words, spaces=[False]*len(words))
|
words = [x.surface for x in dtokens]
|
||||||
|
doc = Doc(self.vocab, words=words, spaces=[False]*len(words))
|
||||||
|
for token, dtoken in zip(doc, dtokens):
|
||||||
|
token._.mecab_tag = dtoken.pos
|
||||||
|
token.tag_ = resolve_pos(dtoken)
|
||||||
|
return doc
|
||||||
|
|
||||||
# add dummy methods for to_bytes, from_bytes, to_disk and from_disk to
|
# add dummy methods for to_bytes, from_bytes, to_disk and from_disk to
|
||||||
# allow serialization (see #1557)
|
# allow serialization (see #1557)
|
||||||
|
@ -53,6 +117,7 @@ class JapaneseCharacterSegmenter(object):
|
||||||
class JapaneseDefaults(Language.Defaults):
|
class JapaneseDefaults(Language.Defaults):
|
||||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||||
lex_attr_getters[LANG] = lambda text: 'ja'
|
lex_attr_getters[LANG] = lambda text: 'ja'
|
||||||
|
tag_map = TAG_MAP
|
||||||
use_janome = True
|
use_janome = True
|
||||||
|
|
||||||
@classmethod
|
@classmethod
|
||||||
|
@ -62,13 +127,12 @@ class JapaneseDefaults(Language.Defaults):
|
||||||
else:
|
else:
|
||||||
return JapaneseCharacterSegmenter(cls, nlp.vocab)
|
return JapaneseCharacterSegmenter(cls, nlp.vocab)
|
||||||
|
|
||||||
|
|
||||||
class Japanese(Language):
|
class Japanese(Language):
|
||||||
lang = 'ja'
|
lang = 'ja'
|
||||||
Defaults = JapaneseDefaults
|
Defaults = JapaneseDefaults
|
||||||
|
Tokenizer = JapaneseTokenizer
|
||||||
|
|
||||||
def make_doc(self, text):
|
def make_doc(self, text):
|
||||||
return self.tokenizer(text)
|
return self.tokenizer(text)
|
||||||
|
|
||||||
|
|
||||||
__all__ = ['Japanese']
|
__all__ = ['Japanese']
|
||||||
|
|
88
spacy/lang/ja/tag_map.py
Normal file
88
spacy/lang/ja/tag_map.py
Normal file
|
@ -0,0 +1,88 @@
|
||||||
|
# encoding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from ...symbols import *
|
||||||
|
|
||||||
|
TAG_MAP = {
|
||||||
|
# Explanation of Unidic tags:
|
||||||
|
# https://www.gavo.t.u-tokyo.ac.jp/~mine/japanese/nlp+slp/UNIDIC_manual.pdf
|
||||||
|
|
||||||
|
# Universal Dependencies Mapping:
|
||||||
|
# http://universaldependencies.org/ja/overview/morphology.html
|
||||||
|
# http://universaldependencies.org/ja/pos/all.html
|
||||||
|
|
||||||
|
"記号,一般,*,*":{POS: PUNCT}, # this includes characters used to represent sounds like ドレミ
|
||||||
|
"記号,文字,*,*":{POS: PUNCT}, # this is for Greek and Latin characters used as sumbols, as in math
|
||||||
|
|
||||||
|
"感動詞,フィラー,*,*": {POS: INTJ},
|
||||||
|
"感動詞,一般,*,*": {POS: INTJ},
|
||||||
|
|
||||||
|
# this is specifically for unicode full-width space
|
||||||
|
"空白,*,*,*": {POS: X},
|
||||||
|
|
||||||
|
"形状詞,一般,*,*":{POS: ADJ},
|
||||||
|
"形状詞,タリ,*,*":{POS: ADJ},
|
||||||
|
"形状詞,助動詞語幹,*,*":{POS: ADJ},
|
||||||
|
"形容詞,一般,*,*":{POS: ADJ},
|
||||||
|
"形容詞,非自立可能,*,*":{POS: AUX}, # XXX ADJ if alone, AUX otherwise
|
||||||
|
|
||||||
|
"助詞,格助詞,*,*":{POS: ADP},
|
||||||
|
"助詞,係助詞,*,*":{POS: ADP},
|
||||||
|
"助詞,終助詞,*,*":{POS: PART},
|
||||||
|
"助詞,準体助詞,*,*":{POS: SCONJ}, # の as in 走るのが速い
|
||||||
|
"助詞,接続助詞,*,*":{POS: SCONJ}, # verb ending て
|
||||||
|
"助詞,副助詞,*,*":{POS: PART}, # ばかり, つつ after a verb
|
||||||
|
"助動詞,*,*,*":{POS: AUX},
|
||||||
|
"接続詞,*,*,*":{POS: SCONJ}, # XXX: might need refinement
|
||||||
|
|
||||||
|
"接頭辞,*,*,*":{POS: NOUN},
|
||||||
|
"接尾辞,形状詞的,*,*":{POS: ADJ}, # がち, チック
|
||||||
|
"接尾辞,形容詞的,*,*":{POS: ADJ}, # -らしい
|
||||||
|
"接尾辞,動詞的,*,*":{POS: NOUN}, # -じみ
|
||||||
|
"接尾辞,名詞的,サ変可能,*":{POS: NOUN}, # XXX see 名詞,普通名詞,サ変可能,*
|
||||||
|
"接尾辞,名詞的,一般,*":{POS: NOUN},
|
||||||
|
"接尾辞,名詞的,助数詞,*":{POS: NOUN},
|
||||||
|
"接尾辞,名詞的,副詞可能,*":{POS: NOUN}, # -後, -過ぎ
|
||||||
|
|
||||||
|
"代名詞,*,*,*":{POS: PRON},
|
||||||
|
"動詞,一般,*,*":{POS: VERB},
|
||||||
|
"動詞,非自立可能,*,*":{POS: VERB}, # XXX VERB if alone, AUX otherwise
|
||||||
|
"動詞,非自立可能,*,*,AUX":{POS: AUX},
|
||||||
|
"動詞,非自立可能,*,*,VERB":{POS: VERB},
|
||||||
|
"副詞,*,*,*":{POS: ADV},
|
||||||
|
|
||||||
|
"補助記号,AA,一般,*":{POS: SYM}, # text art
|
||||||
|
"補助記号,AA,顔文字,*":{POS: SYM}, # kaomoji
|
||||||
|
"補助記号,一般,*,*":{POS: SYM},
|
||||||
|
"補助記号,括弧開,*,*":{POS: PUNCT}, # open bracket
|
||||||
|
"補助記号,括弧閉,*,*":{POS: PUNCT}, # close bracket
|
||||||
|
"補助記号,句点,*,*":{POS: PUNCT}, # period or other EOS marker
|
||||||
|
"補助記号,読点,*,*":{POS: PUNCT}, # comma
|
||||||
|
|
||||||
|
"名詞,固有名詞,一般,*":{POS: PROPN}, # general proper noun
|
||||||
|
"名詞,固有名詞,人名,一般":{POS: PROPN}, # person's name
|
||||||
|
"名詞,固有名詞,人名,姓":{POS: PROPN}, # surname
|
||||||
|
"名詞,固有名詞,人名,名":{POS: PROPN}, # first name
|
||||||
|
"名詞,固有名詞,地名,一般":{POS: PROPN}, # place name
|
||||||
|
"名詞,固有名詞,地名,国":{POS: PROPN}, # country name
|
||||||
|
|
||||||
|
"名詞,助動詞語幹,*,*":{POS: AUX},
|
||||||
|
"名詞,数詞,*,*":{POS: NUM}, # includes Chinese numerals
|
||||||
|
|
||||||
|
"名詞,普通名詞,サ変可能,*":{POS: NOUN}, # XXX: sometimes VERB in UDv2; suru-verb noun
|
||||||
|
"名詞,普通名詞,サ変可能,*,NOUN":{POS: NOUN},
|
||||||
|
"名詞,普通名詞,サ変可能,*,VERB":{POS: VERB},
|
||||||
|
|
||||||
|
"名詞,普通名詞,サ変形状詞可能,*":{POS: NOUN}, # ex: 下手
|
||||||
|
"名詞,普通名詞,一般,*":{POS: NOUN},
|
||||||
|
"名詞,普通名詞,形状詞可能,*":{POS: NOUN}, # XXX: sometimes ADJ in UDv2
|
||||||
|
"名詞,普通名詞,形状詞可能,*,NOUN":{POS: NOUN},
|
||||||
|
"名詞,普通名詞,形状詞可能,*,ADJ":{POS: ADJ},
|
||||||
|
"名詞,普通名詞,助数詞可能,*":{POS: NOUN}, # counter / unit
|
||||||
|
"名詞,普通名詞,副詞可能,*":{POS: NOUN},
|
||||||
|
|
||||||
|
"連体詞,*,*,*":{POS: ADJ}, # XXX this has exceptions based on literal token
|
||||||
|
"連体詞,*,*,*,ADJ":{POS: ADJ},
|
||||||
|
"連体詞,*,*,*,PRON":{POS: PRON},
|
||||||
|
"連体詞,*,*,*,DET":{POS: DET},
|
||||||
|
}
|
|
@ -6,10 +6,10 @@ from ...attrs import LIKE_NUM
|
||||||
|
|
||||||
_num_words = ['zero', 'um', 'dois', 'três', 'quatro', 'cinco', 'seis', 'sete',
|
_num_words = ['zero', 'um', 'dois', 'três', 'quatro', 'cinco', 'seis', 'sete',
|
||||||
'oito', 'nove', 'dez', 'onze', 'doze', 'treze', 'catorze',
|
'oito', 'nove', 'dez', 'onze', 'doze', 'treze', 'catorze',
|
||||||
'quinze', 'dezasseis', 'dezassete', 'dezoito', 'dezanove', 'vinte',
|
'quinze', 'dezesseis', 'dezasseis', 'dezessete', 'dezassete', 'dezoito', 'dezenove', 'dezanove', 'vinte',
|
||||||
'trinta', 'quarenta', 'cinquenta', 'sessenta', 'setenta',
|
'trinta', 'quarenta', 'cinquenta', 'sessenta', 'setenta',
|
||||||
'oitenta', 'noventa', 'cem', 'mil', 'milhão', 'bilião', 'trilião',
|
'oitenta', 'noventa', 'cem', 'mil', 'milhão', 'bilhão', 'bilião', 'trilhão', 'trilião',
|
||||||
'quadrilião']
|
'quatrilhão']
|
||||||
|
|
||||||
_ordinal_words = ['primeiro', 'segundo', 'terceiro', 'quarto', 'quinto', 'sexto',
|
_ordinal_words = ['primeiro', 'segundo', 'terceiro', 'quarto', 'quinto', 'sexto',
|
||||||
'sétimo', 'oitavo', 'nono', 'décimo', 'vigésimo', 'trigésimo',
|
'sétimo', 'oitavo', 'nono', 'décimo', 'vigésimo', 'trigésimo',
|
||||||
|
|
|
@ -3,6 +3,7 @@ from __future__ import unicode_literals
|
||||||
|
|
||||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
|
from .lemmatizer import LOOKUP
|
||||||
|
|
||||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||||
from ..norm_exceptions import BASE_NORMS
|
from ..norm_exceptions import BASE_NORMS
|
||||||
|
@ -17,6 +18,7 @@ class RomanianDefaults(Language.Defaults):
|
||||||
lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM], BASE_NORMS)
|
lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM], BASE_NORMS)
|
||||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||||
stop_words = STOP_WORDS
|
stop_words = STOP_WORDS
|
||||||
|
lemma_lookup = LOOKUP
|
||||||
|
|
||||||
|
|
||||||
class Romanian(Language):
|
class Romanian(Language):
|
||||||
|
|
23
spacy/lang/ro/examples.py
Normal file
23
spacy/lang/ro/examples.py
Normal file
|
@ -0,0 +1,23 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
|
||||||
|
"""
|
||||||
|
Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
>>> from spacy.lang.ro import Romanian
|
||||||
|
>>> from spacy.lang.ro.examples import sentences
|
||||||
|
>>> nlp = Romanian()
|
||||||
|
>>> docs = nlp.pipe(sentences)
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
sentences = [
|
||||||
|
"Apple plănuiește să cumpere o companie britanică pentru un miliard de dolari",
|
||||||
|
"Municipalitatea din San Francisco ia în calcul interzicerea roboților curieri pe trotuar",
|
||||||
|
"Londra este un oraș mare în Regatul Unit",
|
||||||
|
"Unde ești?",
|
||||||
|
"Cine este președintele Franței?",
|
||||||
|
"Care este capitala Statelor Unite?",
|
||||||
|
"Când s-a născut Barack Obama?"
|
||||||
|
]
|
314816
spacy/lang/ro/lemmatizer.py
Normal file
314816
spacy/lang/ro/lemmatizer.py
Normal file
File diff suppressed because it is too large
Load Diff
|
@ -28,6 +28,8 @@ acestia
|
||||||
acestui
|
acestui
|
||||||
aceşti
|
aceşti
|
||||||
aceştia
|
aceştia
|
||||||
|
acești
|
||||||
|
aceștia
|
||||||
acolo
|
acolo
|
||||||
acord
|
acord
|
||||||
acum
|
acum
|
||||||
|
@ -51,6 +53,7 @@ altfel
|
||||||
alti
|
alti
|
||||||
altii
|
altii
|
||||||
altul
|
altul
|
||||||
|
alături
|
||||||
am
|
am
|
||||||
anume
|
anume
|
||||||
apoi
|
apoi
|
||||||
|
@ -80,11 +83,15 @@ au
|
||||||
avea
|
avea
|
||||||
avem
|
avem
|
||||||
aveţi
|
aveţi
|
||||||
|
aveți
|
||||||
avut
|
avut
|
||||||
azi
|
azi
|
||||||
aş
|
aş
|
||||||
aşadar
|
aşadar
|
||||||
aţi
|
aţi
|
||||||
|
aș
|
||||||
|
așadar
|
||||||
|
ați
|
||||||
b
|
b
|
||||||
ba
|
ba
|
||||||
bine
|
bine
|
||||||
|
@ -136,11 +143,13 @@ cât
|
||||||
câte
|
câte
|
||||||
câtva
|
câtva
|
||||||
câţi
|
câţi
|
||||||
|
câți
|
||||||
cînd
|
cînd
|
||||||
cît
|
cît
|
||||||
cîte
|
cîte
|
||||||
cîtva
|
cîtva
|
||||||
cîţi
|
cîţi
|
||||||
|
cîți
|
||||||
că
|
că
|
||||||
căci
|
căci
|
||||||
cărei
|
cărei
|
||||||
|
@ -167,6 +176,7 @@ departe
|
||||||
desi
|
desi
|
||||||
despre
|
despre
|
||||||
deşi
|
deşi
|
||||||
|
deși
|
||||||
din
|
din
|
||||||
dinaintea
|
dinaintea
|
||||||
dintr
|
dintr
|
||||||
|
@ -191,6 +201,7 @@ este
|
||||||
eu
|
eu
|
||||||
exact
|
exact
|
||||||
eşti
|
eşti
|
||||||
|
ești
|
||||||
f
|
f
|
||||||
face
|
face
|
||||||
fara
|
fara
|
||||||
|
@ -203,6 +214,7 @@ fii
|
||||||
fim
|
fim
|
||||||
fiu
|
fiu
|
||||||
fiţi
|
fiţi
|
||||||
|
fiți
|
||||||
foarte
|
foarte
|
||||||
fost
|
fost
|
||||||
frumos
|
frumos
|
||||||
|
@ -210,6 +222,7 @@ fără
|
||||||
g
|
g
|
||||||
geaba
|
geaba
|
||||||
graţie
|
graţie
|
||||||
|
grație
|
||||||
h
|
h
|
||||||
halbă
|
halbă
|
||||||
i
|
i
|
||||||
|
@ -259,6 +272,8 @@ multi
|
||||||
multă
|
multă
|
||||||
mulţi
|
mulţi
|
||||||
mulţumesc
|
mulţumesc
|
||||||
|
mulți
|
||||||
|
mulțumesc
|
||||||
mâine
|
mâine
|
||||||
mîine
|
mîine
|
||||||
mă
|
mă
|
||||||
|
@ -274,6 +289,7 @@ nimeri
|
||||||
nimic
|
nimic
|
||||||
niste
|
niste
|
||||||
nişte
|
nişte
|
||||||
|
niște
|
||||||
noastre
|
noastre
|
||||||
noastră
|
noastră
|
||||||
noi
|
noi
|
||||||
|
@ -284,6 +300,7 @@ nou
|
||||||
noua
|
noua
|
||||||
nouă
|
nouă
|
||||||
noştri
|
noştri
|
||||||
|
noștri
|
||||||
nu
|
nu
|
||||||
numai
|
numai
|
||||||
o
|
o
|
||||||
|
@ -322,6 +339,9 @@ putini
|
||||||
puţin
|
puţin
|
||||||
puţina
|
puţina
|
||||||
puţină
|
puţină
|
||||||
|
puțin
|
||||||
|
puțina
|
||||||
|
puțină
|
||||||
până
|
până
|
||||||
pînă
|
pînă
|
||||||
r
|
r
|
||||||
|
@ -343,11 +363,13 @@ sub
|
||||||
sunt
|
sunt
|
||||||
suntem
|
suntem
|
||||||
sunteţi
|
sunteţi
|
||||||
|
sunteți
|
||||||
sus
|
sus
|
||||||
sută
|
sută
|
||||||
sînt
|
sînt
|
||||||
sîntem
|
sîntem
|
||||||
sînteţi
|
sînteţi
|
||||||
|
sînteți
|
||||||
să
|
să
|
||||||
săi
|
săi
|
||||||
său
|
său
|
||||||
|
@ -367,7 +389,9 @@ toti
|
||||||
totul
|
totul
|
||||||
totusi
|
totusi
|
||||||
totuşi
|
totuşi
|
||||||
|
totuși
|
||||||
toţi
|
toţi
|
||||||
|
toți
|
||||||
trei
|
trei
|
||||||
treia
|
treia
|
||||||
treilea
|
treilea
|
||||||
|
@ -404,6 +428,7 @@ vor
|
||||||
vostru
|
vostru
|
||||||
vouă
|
vouă
|
||||||
voştri
|
voştri
|
||||||
|
voștri
|
||||||
vreme
|
vreme
|
||||||
vreo
|
vreo
|
||||||
vreun
|
vreun
|
||||||
|
@ -428,15 +453,23 @@ zice
|
||||||
întrucât
|
întrucât
|
||||||
întrucît
|
întrucît
|
||||||
îţi
|
îţi
|
||||||
|
îți
|
||||||
ăla
|
ăla
|
||||||
ălea
|
ălea
|
||||||
ăsta
|
ăsta
|
||||||
ăstea
|
ăstea
|
||||||
ăştia
|
ăştia
|
||||||
|
ăștia
|
||||||
şapte
|
şapte
|
||||||
şase
|
şase
|
||||||
şi
|
şi
|
||||||
ştiu
|
ştiu
|
||||||
ţi
|
ţi
|
||||||
ţie
|
ţie
|
||||||
|
șapte
|
||||||
|
șase
|
||||||
|
și
|
||||||
|
știu
|
||||||
|
ți
|
||||||
|
ție
|
||||||
""".split())
|
""".split())
|
||||||
|
|
|
@ -58,9 +58,9 @@ cdef weight_t push_cost(StateClass stcls, const GoldParseC* gold, int target) no
|
||||||
cdef int i, S_i
|
cdef int i, S_i
|
||||||
for i in range(stcls.stack_depth()):
|
for i in range(stcls.stack_depth()):
|
||||||
S_i = stcls.S(i)
|
S_i = stcls.S(i)
|
||||||
if gold.heads[target] == S_i:
|
if gold.has_dep[target] and gold.heads[target] == S_i:
|
||||||
cost += 1
|
cost += 1
|
||||||
if gold.heads[S_i] == target and (NON_MONOTONIC or not stcls.has_head(S_i)):
|
if gold.has_dep[S_i] and gold.heads[S_i] == target and (NON_MONOTONIC or not stcls.has_head(S_i)):
|
||||||
cost += 1
|
cost += 1
|
||||||
if BINARY_COSTS and cost >= 1:
|
if BINARY_COSTS and cost >= 1:
|
||||||
return cost
|
return cost
|
||||||
|
@ -73,10 +73,12 @@ cdef weight_t pop_cost(StateClass stcls, const GoldParseC* gold, int target) nog
|
||||||
cdef int i, B_i
|
cdef int i, B_i
|
||||||
for i in range(stcls.buffer_length()):
|
for i in range(stcls.buffer_length()):
|
||||||
B_i = stcls.B(i)
|
B_i = stcls.B(i)
|
||||||
|
if gold.has_dep[B_i]:
|
||||||
cost += gold.heads[B_i] == target
|
cost += gold.heads[B_i] == target
|
||||||
cost += gold.heads[target] == B_i
|
|
||||||
if gold.heads[B_i] == B_i or gold.heads[B_i] < target:
|
if gold.heads[B_i] == B_i or gold.heads[B_i] < target:
|
||||||
break
|
break
|
||||||
|
if gold.has_dep[target]:
|
||||||
|
cost += gold.heads[target] == B_i
|
||||||
if BINARY_COSTS and cost >= 1:
|
if BINARY_COSTS and cost >= 1:
|
||||||
return cost
|
return cost
|
||||||
if Break.is_valid(stcls.c, 0) and Break.move_cost(stcls, gold) == 0:
|
if Break.is_valid(stcls.c, 0) and Break.move_cost(stcls, gold) == 0:
|
||||||
|
@ -107,6 +109,9 @@ cdef bint arc_is_gold(const GoldParseC* gold, int head, int child) nogil:
|
||||||
|
|
||||||
cdef bint label_is_gold(const GoldParseC* gold, int head, int child, attr_t label) nogil:
|
cdef bint label_is_gold(const GoldParseC* gold, int head, int child, attr_t label) nogil:
|
||||||
if not gold.has_dep[child]:
|
if not gold.has_dep[child]:
|
||||||
|
if label == SUBTOK_LABEL:
|
||||||
|
return False
|
||||||
|
else:
|
||||||
return True
|
return True
|
||||||
elif label == 0:
|
elif label == 0:
|
||||||
return True
|
return True
|
||||||
|
@ -167,7 +172,7 @@ cdef class Reduce:
|
||||||
# Decrement cost for the arcs e save
|
# Decrement cost for the arcs e save
|
||||||
for i in range(1, st.stack_depth()):
|
for i in range(1, st.stack_depth()):
|
||||||
S_i = st.S(i)
|
S_i = st.S(i)
|
||||||
if gold.heads[st.S(0)] == S_i:
|
if gold.has_dep[st.S(0)] and gold.heads[st.S(0)] == S_i:
|
||||||
cost -= 1
|
cost -= 1
|
||||||
if gold.heads[S_i] == st.S(0):
|
if gold.heads[S_i] == st.S(0):
|
||||||
cost -= 1
|
cost -= 1
|
||||||
|
@ -208,7 +213,9 @@ cdef class LeftArc:
|
||||||
# Account for deps we might lose between S0 and stack
|
# Account for deps we might lose between S0 and stack
|
||||||
if not s.has_head(s.S(0)):
|
if not s.has_head(s.S(0)):
|
||||||
for i in range(1, s.stack_depth()):
|
for i in range(1, s.stack_depth()):
|
||||||
|
if gold.has_dep[s.S(i)]:
|
||||||
cost += gold.heads[s.S(i)] == s.S(0)
|
cost += gold.heads[s.S(i)] == s.S(0)
|
||||||
|
if gold.has_dep[s.S(0)]:
|
||||||
cost += gold.heads[s.S(0)] == s.S(i)
|
cost += gold.heads[s.S(0)] == s.S(i)
|
||||||
return cost + pop_cost(s, gold, s.S(0)) + arc_cost(s, gold, s.B(0), s.S(0))
|
return cost + pop_cost(s, gold, s.S(0)) + arc_cost(s, gold, s.B(0), s.S(0))
|
||||||
|
|
||||||
|
@ -284,18 +291,20 @@ cdef class Break:
|
||||||
S_i = s.S(i)
|
S_i = s.S(i)
|
||||||
for j in range(s.buffer_length()):
|
for j in range(s.buffer_length()):
|
||||||
B_i = s.B(j)
|
B_i = s.B(j)
|
||||||
|
if gold.has_dep[S_i]:
|
||||||
cost += gold.heads[S_i] == B_i
|
cost += gold.heads[S_i] == B_i
|
||||||
|
if gold.has_dep[B_i]:
|
||||||
cost += gold.heads[B_i] == S_i
|
cost += gold.heads[B_i] == S_i
|
||||||
if cost != 0:
|
if cost != 0:
|
||||||
return cost
|
return cost
|
||||||
# Check for sentence boundary --- if it's here, we can't have any deps
|
# Check for sentence boundary --- if it's here, we can't have any deps
|
||||||
# between stack and buffer, so rest of action is irrelevant.
|
# between stack and buffer, so rest of action is irrelevant.
|
||||||
s0_root = _get_root(s.S(0), gold)
|
if not gold.has_dep[s.S(0)] or not gold.has_dep[s.B(0)]:
|
||||||
b0_root = _get_root(s.B(0), gold)
|
|
||||||
if s0_root != b0_root or s0_root == -1 or b0_root == -1:
|
|
||||||
return cost
|
return cost
|
||||||
|
if gold.sent_start[s.B_(0).l_edge] == -1:
|
||||||
|
return cost+1
|
||||||
else:
|
else:
|
||||||
return cost + 1
|
return cost
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
cdef inline weight_t label_cost(StateClass s, const GoldParseC* gold, attr_t label) nogil:
|
cdef inline weight_t label_cost(StateClass s, const GoldParseC* gold, attr_t label) nogil:
|
||||||
|
|
|
@ -15,7 +15,8 @@ from .. import util
|
||||||
# here if it's using spaCy's tokenizer (not a different library)
|
# here if it's using spaCy's tokenizer (not a different library)
|
||||||
# TODO: re-implement generic tokenizer tests
|
# TODO: re-implement generic tokenizer tests
|
||||||
_languages = ['bn', 'da', 'de', 'en', 'es', 'fi', 'fr', 'ga', 'he', 'hu', 'id',
|
_languages = ['bn', 'da', 'de', 'en', 'es', 'fi', 'fr', 'ga', 'he', 'hu', 'id',
|
||||||
'it', 'nb', 'nl', 'pl', 'pt', 'ru', 'sv', 'tr', 'xx']
|
'it', 'nb', 'nl', 'pl', 'pt', 'ru', 'sv', 'tr', 'ar', 'xx']
|
||||||
|
|
||||||
_models = {'en': ['en_core_web_sm'],
|
_models = {'en': ['en_core_web_sm'],
|
||||||
'de': ['de_core_news_md'],
|
'de': ['de_core_news_md'],
|
||||||
'fr': ['fr_core_news_sm'],
|
'fr': ['fr_core_news_sm'],
|
||||||
|
@ -50,8 +51,8 @@ def RU(request):
|
||||||
|
|
||||||
#@pytest.fixture(params=_languages)
|
#@pytest.fixture(params=_languages)
|
||||||
#def tokenizer(request):
|
#def tokenizer(request):
|
||||||
#lang = util.get_lang_class(request.param)
|
#lang = util.get_lang_class(request.param)
|
||||||
#return lang.Defaults.create_tokenizer()
|
#return lang.Defaults.create_tokenizer()
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
|
@ -100,6 +101,11 @@ def fi_tokenizer():
|
||||||
return util.get_lang_class('fi').Defaults.create_tokenizer()
|
return util.get_lang_class('fi').Defaults.create_tokenizer()
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def ro_tokenizer():
|
||||||
|
return util.get_lang_class('ro').Defaults.create_tokenizer()
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def id_tokenizer():
|
def id_tokenizer():
|
||||||
return util.get_lang_class('id').Defaults.create_tokenizer()
|
return util.get_lang_class('id').Defaults.create_tokenizer()
|
||||||
|
@ -135,10 +141,9 @@ def da_tokenizer():
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def ja_tokenizer():
|
def ja_tokenizer():
|
||||||
janome = pytest.importorskip("janome")
|
janome = pytest.importorskip("MeCab")
|
||||||
return util.get_lang_class('ja').Defaults.create_tokenizer()
|
return util.get_lang_class('ja').Defaults.create_tokenizer()
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def th_tokenizer():
|
def th_tokenizer():
|
||||||
pythainlp = pytest.importorskip("pythainlp")
|
pythainlp = pytest.importorskip("pythainlp")
|
||||||
|
@ -148,6 +153,9 @@ def th_tokenizer():
|
||||||
def tr_tokenizer():
|
def tr_tokenizer():
|
||||||
return util.get_lang_class('tr').Defaults.create_tokenizer()
|
return util.get_lang_class('tr').Defaults.create_tokenizer()
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def ar_tokenizer():
|
||||||
|
return util.get_lang_class('ar').Defaults.create_tokenizer()
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def ru_tokenizer():
|
def ru_tokenizer():
|
||||||
|
|
0
spacy/tests/lang/ar/__init__.py
Normal file
0
spacy/tests/lang/ar/__init__.py
Normal file
26
spacy/tests/lang/ar/test_exceptions.py
Normal file
26
spacy/tests/lang/ar/test_exceptions.py
Normal file
|
@ -0,0 +1,26 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('text',
|
||||||
|
["ق.م", "إلخ", "ص.ب", "ت."])
|
||||||
|
def test_ar_tokenizer_handles_abbr(ar_tokenizer, text):
|
||||||
|
tokens = ar_tokenizer(text)
|
||||||
|
assert len(tokens) == 1
|
||||||
|
|
||||||
|
|
||||||
|
def test_ar_tokenizer_handles_exc_in_text(ar_tokenizer):
|
||||||
|
text = u"تعود الكتابة الهيروغليفية إلى سنة 3200 ق.م"
|
||||||
|
tokens = ar_tokenizer(text)
|
||||||
|
assert len(tokens) == 7
|
||||||
|
assert tokens[6].text == "ق.م"
|
||||||
|
assert tokens[6].lemma_ == "قبل الميلاد"
|
||||||
|
|
||||||
|
|
||||||
|
def test_ar_tokenizer_handles_exc_in_text(ar_tokenizer):
|
||||||
|
text = u"يبلغ طول مضيق طارق 14كم "
|
||||||
|
tokens = ar_tokenizer(text)
|
||||||
|
print([(tokens[i].text, tokens[i].suffix_) for i in range(len(tokens))])
|
||||||
|
assert len(tokens) == 6
|
13
spacy/tests/lang/ar/test_text.py
Normal file
13
spacy/tests/lang/ar/test_text.py
Normal file
|
@ -0,0 +1,13 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
|
||||||
|
def test_tokenizer_handles_long_text(ar_tokenizer):
|
||||||
|
text = """نجيب محفوظ مؤلف و كاتب روائي عربي، يعد من أهم الأدباء العرب خلال القرن العشرين.
|
||||||
|
ولد نجيب محفوظ في مدينة القاهرة، حيث ترعرع و تلقى تعليمه الجامعي في جامعتها،
|
||||||
|
فتمكن من نيل شهادة في الفلسفة. ألف محفوظ على مدار حياته الكثير من الأعمال الأدبية، و في مقدمتها ثلاثيته الشهيرة.
|
||||||
|
و قد نجح في الحصول على جائزة نوبل للآداب، ليكون بذلك العربي الوحيد الذي فاز بها."""
|
||||||
|
|
||||||
|
tokens = ar_tokenizer(text)
|
||||||
|
assert tokens[3].is_stop == True
|
||||||
|
assert len(tokens) == 77
|
|
@ -5,15 +5,41 @@ import pytest
|
||||||
|
|
||||||
|
|
||||||
TOKENIZER_TESTS = [
|
TOKENIZER_TESTS = [
|
||||||
("日本語だよ", ['日本語', 'だ', 'よ']),
|
("日本語だよ", ['日本', '語', 'だ', 'よ']),
|
||||||
("東京タワーの近くに住んでいます。", ['東京', 'タワー', 'の', '近く', 'に', '住ん', 'で', 'い', 'ます', '。']),
|
("東京タワーの近くに住んでいます。", ['東京', 'タワー', 'の', '近く', 'に', '住ん', 'で', 'い', 'ます', '。']),
|
||||||
("吾輩は猫である。", ['吾輩', 'は', '猫', 'で', 'ある', '。']),
|
("吾輩は猫である。", ['吾輩', 'は', '猫', 'で', 'ある', '。']),
|
||||||
("月に代わって、お仕置きよ!", ['月', 'に', '代わっ', 'て', '、', 'お仕置き', 'よ', '!']),
|
("月に代わって、お仕置きよ!", ['月', 'に', '代わっ', 'て', '、', 'お', '仕置き', 'よ', '!']),
|
||||||
("すもももももももものうち", ['すもも', 'も', 'もも', 'も', 'もも', 'の', 'うち'])
|
("すもももももももものうち", ['すもも', 'も', 'もも', 'も', 'もも', 'の', 'うち'])
|
||||||
]
|
]
|
||||||
|
|
||||||
|
TAG_TESTS = [
|
||||||
|
("日本語だよ", ['日本語だよ', '名詞-固有名詞-地名-国', '名詞-普通名詞-一般', '助動詞', '助詞-終助詞']),
|
||||||
|
("東京タワーの近くに住んでいます。", ['名詞-固有名詞-地名-一般', '名詞-普通名詞-一般', '助詞-格助詞', '名詞-普通名詞-副詞可能', '助詞-格助詞', '動詞-一般', '助詞-接続助詞', '動詞-非自立可能', '助動詞', '補助記号-句点']),
|
||||||
|
("吾輩は猫である。", ['代名詞', '助詞-係助詞', '名詞-普通名詞-一般', '助動詞', '動詞-非自立可能', '補助記号-句点']),
|
||||||
|
("月に代わって、お仕置きよ!", ['名詞-普通名詞-助数詞可能', '助詞-格助詞', '動詞-一般', '助詞-接続助詞', '補助記号-読点', '接頭辞', '名詞-普通名詞-一般', '助詞-終助詞', '補助記号-句点 ']),
|
||||||
|
("すもももももももものうち", ['名詞-普通名詞-一般', '助詞-係助詞', '名詞-普通名詞-一般', '助詞-係助詞', '名詞-普通名詞-一般', '助詞-格助詞', '名詞-普通名詞-副詞可能'])
|
||||||
|
]
|
||||||
|
|
||||||
|
POS_TESTS = [
|
||||||
|
('日本語だよ', ['PROPN', 'NOUN', 'AUX', 'PART']),
|
||||||
|
('東京タワーの近くに住んでいます。', ['PROPN', 'NOUN', 'ADP', 'NOUN', 'ADP', 'VERB', 'SCONJ', 'VERB', 'AUX', 'PUNCT']),
|
||||||
|
('吾輩は猫である。', ['PRON', 'ADP', 'NOUN', 'AUX', 'VERB', 'PUNCT']),
|
||||||
|
('月に代わって、お仕置きよ!', ['NOUN', 'ADP', 'VERB', 'SCONJ', 'PUNCT', 'NOUN', 'NOUN', 'PART', 'PUNCT']),
|
||||||
|
('すもももももももものうち', ['NOUN', 'ADP', 'NOUN', 'ADP', 'NOUN', 'ADP', 'NOUN'])
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text,expected_tokens', TOKENIZER_TESTS)
|
@pytest.mark.parametrize('text,expected_tokens', TOKENIZER_TESTS)
|
||||||
def test_japanese_tokenizer(ja_tokenizer, text, expected_tokens):
|
def test_japanese_tokenizer(ja_tokenizer, text, expected_tokens):
|
||||||
tokens = [token.text for token in ja_tokenizer(text)]
|
tokens = [token.text for token in ja_tokenizer(text)]
|
||||||
assert tokens == expected_tokens
|
assert tokens == expected_tokens
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('text,expected_tags', TAG_TESTS)
|
||||||
|
def test_japanese_tokenizer(ja_tokenizer, text, expected_tags):
|
||||||
|
tags = [token.tag_ for token in ja_tokenizer(text)]
|
||||||
|
assert tags == expected_tags
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('text,expected_pos', POS_TESTS)
|
||||||
|
def test_japanese_tokenizer(ja_tokenizer, text, expected_pos):
|
||||||
|
pos = [token.pos_ for token in ja_tokenizer(text)]
|
||||||
|
assert pos == expected_pos
|
||||||
|
|
0
spacy/tests/lang/ro/__init__.py
Normal file
0
spacy/tests/lang/ro/__init__.py
Normal file
13
spacy/tests/lang/ro/test_lemmatizer.py
Normal file
13
spacy/tests/lang/ro/test_lemmatizer.py
Normal file
|
@ -0,0 +1,13 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('string,lemma', [('câini', 'câine'),
|
||||||
|
('expedițiilor', 'expediție'),
|
||||||
|
('pensete', 'pensetă'),
|
||||||
|
('erau', 'fi')])
|
||||||
|
def test_lemmatizer_lookup_assigns(ro_tokenizer, string, lemma):
|
||||||
|
tokens = ro_tokenizer(string)
|
||||||
|
assert tokens[0].lemma_ == lemma
|
18
spacy/tests/regression/test_issue2219.py
Normal file
18
spacy/tests/regression/test_issue2219.py
Normal file
|
@ -0,0 +1,18 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
from ..util import add_vecs_to_vocab, get_doc
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def vectors():
|
||||||
|
return [("a", [1, 2, 3]), ("letter", [4, 5, 6])]
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def vocab(en_vocab, vectors):
|
||||||
|
add_vecs_to_vocab(en_vocab, vectors)
|
||||||
|
return en_vocab
|
||||||
|
|
||||||
|
def test_issue2219(vocab, vectors):
|
||||||
|
[(word1, vec1), (word2, vec2)] = vectors
|
||||||
|
doc = get_doc(vocab, words=[word1, word2])
|
||||||
|
assert doc[0].similarity(doc[1]) == doc[1].similarity(doc[0])
|
|
@ -155,7 +155,7 @@ cdef class Token:
|
||||||
"""
|
"""
|
||||||
if 'similarity' in self.doc.user_token_hooks:
|
if 'similarity' in self.doc.user_token_hooks:
|
||||||
return self.doc.user_token_hooks['similarity'](self)
|
return self.doc.user_token_hooks['similarity'](self)
|
||||||
if hasattr(other, '__len__') and len(other) == 1:
|
if hasattr(other, '__len__') and len(other) == 1 and hasattr(other, "__getitem__"):
|
||||||
if self.c.lex.orth == getattr(other[0], 'orth', None):
|
if self.c.lex.orth == getattr(other[0], 'orth', None):
|
||||||
return 1.0
|
return 1.0
|
||||||
elif hasattr(other, 'orth'):
|
elif hasattr(other, 'orth'):
|
||||||
|
|
|
@ -27,8 +27,6 @@ The docs can always use another example or more detail, and they should always b
|
||||||
|
|
||||||
While all page content lives in the `.jade` files, article meta (page titles, sidebars etc.) is stored as JSON. Each folder contains a `_data.json` with all required meta for its files.
|
While all page content lives in the `.jade` files, article meta (page titles, sidebars etc.) is stored as JSON. Each folder contains a `_data.json` with all required meta for its files.
|
||||||
|
|
||||||
For simplicity, all sites linked in the [tutorials](https://spacy.io/docs/usage/tutorials) and [showcase](https://spacy.io/docs/usage/showcase) are also stored as JSON. So in order to edit those pages, there's no need to dig into the Jade files – simply edit the [`_data.json`](docs/usage/_data.json).
|
|
||||||
|
|
||||||
### Markup language and conventions
|
### Markup language and conventions
|
||||||
|
|
||||||
Jade/Pug is a whitespace-sensitive markup language that compiles to HTML. Indentation is used to nest elements, and for template logic, like `if`/`else` or `for`, mainly used to iterate over objects and arrays in the meta data. It also allows inline JavaScript expressions.
|
Jade/Pug is a whitespace-sensitive markup language that compiles to HTML. Indentation is used to nest elements, and for template logic, like `if`/`else` or `for`, mainly used to iterate over objects and arrays in the meta data. It also allows inline JavaScript expressions.
|
||||||
|
|
|
@ -12,8 +12,6 @@
|
||||||
"COMPANY_URL": "https://explosion.ai",
|
"COMPANY_URL": "https://explosion.ai",
|
||||||
"DEMOS_URL": "https://explosion.ai/demos",
|
"DEMOS_URL": "https://explosion.ai/demos",
|
||||||
"MODELS_REPO": "explosion/spacy-models",
|
"MODELS_REPO": "explosion/spacy-models",
|
||||||
"KERNEL_BINDER": "ines/spacy-binder",
|
|
||||||
"KERNEL_PYTHON": "python3",
|
|
||||||
|
|
||||||
"SPACY_VERSION": "2.0",
|
"SPACY_VERSION": "2.0",
|
||||||
"BINDER_VERSION": "2.0.11",
|
"BINDER_VERSION": "2.0.11",
|
||||||
|
@ -87,7 +85,7 @@
|
||||||
],
|
],
|
||||||
|
|
||||||
"V_CSS": "2.1.3",
|
"V_CSS": "2.1.3",
|
||||||
"V_JS": "2.1.1",
|
"V_JS": "2.1.2",
|
||||||
"DEFAULT_SYNTAX": "python",
|
"DEFAULT_SYNTAX": "python",
|
||||||
"ANALYTICS": "UA-58931649-1",
|
"ANALYTICS": "UA-58931649-1",
|
||||||
"MAILCHIMP": {
|
"MAILCHIMP": {
|
||||||
|
|
|
@ -15,7 +15,7 @@ p
|
||||||
+cell Nationalities or religious or political groups.
|
+cell Nationalities or religious or political groups.
|
||||||
|
|
||||||
+row
|
+row
|
||||||
+cell #[code FACILITY]
|
+cell #[code FAC]
|
||||||
+cell Buildings, airports, highways, bridges, etc.
|
+cell Buildings, airports, highways, bridges, etc.
|
||||||
|
|
||||||
+row
|
+row
|
||||||
|
|
|
@ -149,7 +149,7 @@ p
|
||||||
|
|
||||||
+aside-code("Example").
|
+aside-code("Example").
|
||||||
from spacy.tokens import Doc
|
from spacy.tokens import Doc
|
||||||
city_getter = lambda doc: doc.text in ('New York', 'Paris', 'Berlin')
|
city_getter = lambda doc: any(city in doc.text for city in ('New York', 'Paris', 'Berlin'))
|
||||||
Doc.set_extension('has_city', getter=city_getter)
|
Doc.set_extension('has_city', getter=city_getter)
|
||||||
doc = nlp(u'I like New York')
|
doc = nlp(u'I like New York')
|
||||||
assert doc._.has_city
|
assert doc._.has_city
|
||||||
|
|
|
@ -127,7 +127,7 @@ p
|
||||||
|
|
||||||
+aside-code("Example").
|
+aside-code("Example").
|
||||||
from spacy.tokens import Span
|
from spacy.tokens import Span
|
||||||
city_getter = lambda span: span.text in ('New York', 'Paris', 'Berlin')
|
city_getter = lambda span: any(city in span.text for city in ('New York', 'Paris', 'Berlin'))
|
||||||
Span.set_extension('has_city', getter=city_getter)
|
Span.set_extension('has_city', getter=city_getter)
|
||||||
doc = nlp(u'I like New York in Autumn')
|
doc = nlp(u'I like New York in Autumn')
|
||||||
assert doc[1:4]._.has_city
|
assert doc[1:4]._.has_city
|
||||||
|
|
|
@ -47,7 +47,7 @@ import initUniverse from './universe.vue.js';
|
||||||
*/
|
*/
|
||||||
{
|
{
|
||||||
if (window.Juniper) {
|
if (window.Juniper) {
|
||||||
new Juniper({ repo: 'ines/spacy-binder' });
|
new Juniper({ repo: 'ines/spacy-io-binder' });
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
4
website/assets/js/vendor/juniper.min.js
vendored
4
website/assets/js/vendor/juniper.min.js
vendored
File diff suppressed because one or more lines are too long
|
@ -445,6 +445,29 @@
|
||||||
},
|
},
|
||||||
"category": ["visualizers"]
|
"category": ["visualizers"]
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
"id": "scattertext",
|
||||||
|
"slogan": "Beautiful visualizations of how language differs among document types",
|
||||||
|
"description": "A tool for finding distinguishing terms in small-to-medium-sized corpora, and presenting them in a sexy, interactive scatter plot with non-overlapping term labels. Exploratory data analysis just got more fun.",
|
||||||
|
"github": "JasonKessler/scattertext",
|
||||||
|
"image": "https://jasonkessler.github.io/2012conventions0.0.2.2.png",
|
||||||
|
"code_example": [
|
||||||
|
"import spacy",
|
||||||
|
"import scattertext as st",
|
||||||
|
"",
|
||||||
|
"nlp = spacy.load('en')",
|
||||||
|
"corpus = st.CorpusFromPandas(convention_df,",
|
||||||
|
" category_col='party',",
|
||||||
|
" text_col='text',",
|
||||||
|
" nlp=nlp).build()"
|
||||||
|
],
|
||||||
|
"author": "Jason Kessler",
|
||||||
|
"author_links": {
|
||||||
|
"github": "JasonKessler",
|
||||||
|
"twitter": "jasonkessler"
|
||||||
|
},
|
||||||
|
"category": ["visualizers"]
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"id": "rasa",
|
"id": "rasa",
|
||||||
"title": "Rasa NLU",
|
"title": "Rasa NLU",
|
||||||
|
|
|
@ -4,7 +4,7 @@ p
|
||||||
| The individual components #[strong expose variables] that can be imported
|
| The individual components #[strong expose variables] that can be imported
|
||||||
| within a language module, and added to the language's #[code Defaults].
|
| within a language module, and added to the language's #[code Defaults].
|
||||||
| Some components, like the punctuation rules, usually don't need much
|
| Some components, like the punctuation rules, usually don't need much
|
||||||
| customisation and can simply be imported from the global rules. Others,
|
| customisation and can be imported from the global rules. Others,
|
||||||
| like the tokenizer and norm exceptions, are very specific and will make
|
| like the tokenizer and norm exceptions, are very specific and will make
|
||||||
| a big difference to spaCy's performance on the particular language and
|
| a big difference to spaCy's performance on the particular language and
|
||||||
| training a language model.
|
| training a language model.
|
||||||
|
|
|
@ -92,6 +92,7 @@
|
||||||
"Dependency Parse": "dependency-parse",
|
"Dependency Parse": "dependency-parse",
|
||||||
"Named Entities": "named-entities",
|
"Named Entities": "named-entities",
|
||||||
"Tokenization": "tokenization",
|
"Tokenization": "tokenization",
|
||||||
|
"Sentence Segmentation": "sbd",
|
||||||
"Rule-based Matching": "rule-based-matching"
|
"Rule-based Matching": "rule-based-matching"
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
|
|
|
@ -39,7 +39,7 @@ p
|
||||||
| this. The above error mostly occurs when doing a system-wide installation,
|
| this. The above error mostly occurs when doing a system-wide installation,
|
||||||
| which will create the symlinks in a system directory. Run the
|
| which will create the symlinks in a system directory. Run the
|
||||||
| #[code download] or #[code link] command as administrator (on Windows,
|
| #[code download] or #[code link] command as administrator (on Windows,
|
||||||
| simply right-click on your terminal or shell ans select "Run as
|
| you can either right-click on your terminal or shell ans select "Run as
|
||||||
| Administrator"), or use a #[code virtualenv] to install spaCy in a user
|
| Administrator"), or use a #[code virtualenv] to install spaCy in a user
|
||||||
| directory, instead of doing a system-wide installation.
|
| directory, instead of doing a system-wide installation.
|
||||||
|
|
||||||
|
|
|
@ -220,8 +220,8 @@ p
|
||||||
|
|
||||||
p
|
p
|
||||||
| The best way to understand spaCy's dependency parser is interactively.
|
| The best way to understand spaCy's dependency parser is interactively.
|
||||||
| To make this easier, spaCy v2.0+ comes with a visualization module. Simply
|
| To make this easier, spaCy v2.0+ comes with a visualization module. You
|
||||||
| pass a #[code Doc] or a list of #[code Doc] objects to
|
| can pass a #[code Doc] or a list of #[code Doc] objects to
|
||||||
| displaCy and run #[+api("top-level#displacy.serve") #[code displacy.serve]] to
|
| displaCy and run #[+api("top-level#displacy.serve") #[code displacy.serve]] to
|
||||||
| run the web server, or #[+api("top-level#displacy.render") #[code displacy.render]]
|
| run the web server, or #[+api("top-level#displacy.render") #[code displacy.render]]
|
||||||
| to generate the raw markup. If you want to know how to write rules that
|
| to generate the raw markup. If you want to know how to write rules that
|
||||||
|
|
|
@ -195,7 +195,7 @@ p
|
||||||
| lets you explore an entity recognition model's behaviour interactively.
|
| lets you explore an entity recognition model's behaviour interactively.
|
||||||
| If you're training a model, it's very useful to run the visualization
|
| If you're training a model, it's very useful to run the visualization
|
||||||
| yourself. To help you do that, spaCy v2.0+ comes with a visualization
|
| yourself. To help you do that, spaCy v2.0+ comes with a visualization
|
||||||
| module. Simply pass a #[code Doc] or a list of #[code Doc] objects to
|
| module. You can pass a #[code Doc] or a list of #[code Doc] objects to
|
||||||
| displaCy and run #[+api("top-level#displacy.serve") #[code displacy.serve]] to
|
| displaCy and run #[+api("top-level#displacy.serve") #[code displacy.serve]] to
|
||||||
| run the web server, or #[+api("top-level#displacy.render") #[code displacy.render]]
|
| run the web server, or #[+api("top-level#displacy.render") #[code displacy.render]]
|
||||||
| to generate the raw markup.
|
| to generate the raw markup.
|
||||||
|
|
129
website/usage/_linguistic-features/_sentence-segmentation.jade
Normal file
129
website/usage/_linguistic-features/_sentence-segmentation.jade
Normal file
|
@ -0,0 +1,129 @@
|
||||||
|
//- 💫 DOCS > USAGE > LINGUISTIC FEATURES > SENTENCE SEGMENTATION
|
||||||
|
|
||||||
|
p
|
||||||
|
| A #[+api("doc") #[code Doc]] object's sentences are available via the
|
||||||
|
| #[code Doc.sents] property. Unlike other libraries, spaCy uses the
|
||||||
|
| dependency parse to determine sentence boundaries. This is usually more
|
||||||
|
| accurate than a rule-based approach, but it also means you'll need a
|
||||||
|
| #[strong statistical model] and accurate predictions. If your
|
||||||
|
| texts are closer to general-purpose news or web text, this should work
|
||||||
|
| well out-of-the-box. For social media or conversational text that
|
||||||
|
| doesn't follow the same rules, your application may benefit from a custom
|
||||||
|
| rule-based implementation. You can either plug a rule-based component
|
||||||
|
| into your #[+a("/usage/processing-pipelines") processing pipeline] or use
|
||||||
|
| the #[code SentenceSegmenter] component with a custom stategy.
|
||||||
|
|
||||||
|
+h(3, "sbd-parser") Default: Using the dependency parse
|
||||||
|
+tag-model("dependency parser")
|
||||||
|
|
||||||
|
p
|
||||||
|
| To view a #[code Doc]'s sentences, you can iterate over the
|
||||||
|
| #[code Doc.sents], a generator that yields
|
||||||
|
| #[+api("span") #[code Span]] objects.
|
||||||
|
|
||||||
|
+code-exec.
|
||||||
|
import spacy
|
||||||
|
|
||||||
|
nlp = spacy.load('en_core_web_sm')
|
||||||
|
doc = nlp(u"This is a sentence. This is another sentence.")
|
||||||
|
for sent in doc.sents:
|
||||||
|
print(sent.text)
|
||||||
|
|
||||||
|
+h(3, "sbd-manual") Setting boundaries manually
|
||||||
|
|
||||||
|
p
|
||||||
|
| spaCy's dependency parser respects already set boundaries, so you can
|
||||||
|
| preprocess your #[code Doc] using custom rules #[code before] it's
|
||||||
|
| parsed. This can be done by adding a
|
||||||
|
| #[+a("/usage/processing-pipelines") custom pipeline component]. Depending
|
||||||
|
| on your text, this may also improve accuracy, since the parser is
|
||||||
|
| constrained to predict parses consistent with the sentence boundaries.
|
||||||
|
|
||||||
|
+infobox("Important note", "⚠️")
|
||||||
|
| To prevent inconsitent state, you can only set boundaries #[em before] a
|
||||||
|
| document is parsed (and #[code Doc.is_parsed] is #[code False]). To
|
||||||
|
| ensure that your component is added in the right place, you can set
|
||||||
|
| #[code before='parser'] or #[code first=True] when adding it to the
|
||||||
|
| pipeline using #[+api("language#add_pipe") #[code nlp.add_pipe]].
|
||||||
|
|
||||||
|
p
|
||||||
|
| Here's an example of a component that implements a pre-processing rule
|
||||||
|
| for splitting on #[code '...'] tokens. The component is added before
|
||||||
|
| the parser, which is then used to further segment the text. This
|
||||||
|
| approach can be useful if you want to implement #[em additional] rules
|
||||||
|
| specific to your data, while still being able to take advantage of
|
||||||
|
| dependency-based sentence segmentation.
|
||||||
|
|
||||||
|
+code-exec.
|
||||||
|
import spacy
|
||||||
|
|
||||||
|
text = u"this is a sentence...hello...and another sentence."
|
||||||
|
|
||||||
|
nlp = spacy.load('en_core_web_sm')
|
||||||
|
doc = nlp(text)
|
||||||
|
print('Before:', [sent.text for sent in doc.sents])
|
||||||
|
|
||||||
|
def set_custom_boundaries(doc):
|
||||||
|
for token in doc[:-1]:
|
||||||
|
if token.text == '...':
|
||||||
|
doc[token.i+1].is_sent_start = True
|
||||||
|
return doc
|
||||||
|
|
||||||
|
nlp.add_pipe(set_custom_boundaries, before='parser')
|
||||||
|
doc = nlp(text)
|
||||||
|
print('After:', [sent.text for sent in doc.sents])
|
||||||
|
|
||||||
|
+h(3, "sbd-component") Rule-based pipeline component
|
||||||
|
|
||||||
|
p
|
||||||
|
| The #[code sentencizer] component is a
|
||||||
|
| #[+a("/usage/processing-pipelines") pipeline component] that splits
|
||||||
|
| sentences on punctuation like #[code .], #[code !] or #[code ?].
|
||||||
|
| You can plug it into your pipeline if you only need sentence boundaries
|
||||||
|
| without the dependency parse. Note that #[code Doc.sents] will
|
||||||
|
| #[strong raise an error] if no sentence boundaries are set.
|
||||||
|
|
||||||
|
+code-exec.
|
||||||
|
import spacy
|
||||||
|
from spacy.lang.en import English
|
||||||
|
|
||||||
|
nlp = English() # just the language with no model
|
||||||
|
sbd = nlp.create_pipe('sentencizer') # or: nlp.create_pipe('sbd')
|
||||||
|
nlp.add_pipe(sbd)
|
||||||
|
doc = nlp(u"This is a sentence. This is another sentence.")
|
||||||
|
for sent in doc.sents:
|
||||||
|
print(sent.text)
|
||||||
|
|
||||||
|
+h(3, "sbd-custom") Custom rule-based strategy
|
||||||
|
|
||||||
|
p
|
||||||
|
| If you want to implement your own strategy that differs from the default
|
||||||
|
| rule-based approach of splitting on sentences, you can also instantiate
|
||||||
|
| the #[code SentenceSegmenter] directly and pass in your own strategy.
|
||||||
|
| The strategy should be a function that takes a #[code Doc] object and
|
||||||
|
| yields a #[code Span] for each sentence. Here's an example of a custom
|
||||||
|
| segmentation strategy for splitting on newlines only:
|
||||||
|
|
||||||
|
+code-exec.
|
||||||
|
from spacy.lang.en import English
|
||||||
|
from spacy.pipeline import SentenceSegmenter
|
||||||
|
|
||||||
|
def split_on_newlines(doc):
|
||||||
|
start = 0
|
||||||
|
seen_newline = False
|
||||||
|
for word in doc:
|
||||||
|
if seen_newline and not word.is_space:
|
||||||
|
yield doc[start:word.i]
|
||||||
|
start = word.i
|
||||||
|
seen_newline = False
|
||||||
|
elif word.text == '\n':
|
||||||
|
seen_newline = True
|
||||||
|
if start < len(doc):
|
||||||
|
yield doc[start:len(doc)]
|
||||||
|
|
||||||
|
nlp = English() # just the language with no model
|
||||||
|
sbd = SentenceSegmenter(nlp.vocab, strategy=split_on_newlines)
|
||||||
|
nlp.add_pipe(sbd)
|
||||||
|
doc = nlp(u"This is a sentence\n\nThis is another sentence\nAnd more")
|
||||||
|
for sent in doc.sents:
|
||||||
|
print([token.text for token in sent])
|
|
@ -274,7 +274,7 @@ p
|
||||||
| In spaCy v1.x, you had to add a custom tokenizer by passing it to the
|
| In spaCy v1.x, you had to add a custom tokenizer by passing it to the
|
||||||
| #[code make_doc] keyword argument, or by passing a tokenizer "factory"
|
| #[code make_doc] keyword argument, or by passing a tokenizer "factory"
|
||||||
| to #[code create_make_doc]. This was unnecessarily complicated. Since
|
| to #[code create_make_doc]. This was unnecessarily complicated. Since
|
||||||
| spaCy v2.0, you can simply write to #[code nlp.tokenizer]. If your
|
| spaCy v2.0, you can write to #[code nlp.tokenizer] instead. If your
|
||||||
| tokenizer needs the vocab, you can write a function and use
|
| tokenizer needs the vocab, you can write a function and use
|
||||||
| #[code nlp.vocab].
|
| #[code nlp.vocab].
|
||||||
|
|
||||||
|
|
|
@ -20,14 +20,14 @@ include _install-basics
|
||||||
|
|
||||||
p
|
p
|
||||||
| To download a model directly using #[+a("https://pypi.python.org/pypi/pip") pip],
|
| To download a model directly using #[+a("https://pypi.python.org/pypi/pip") pip],
|
||||||
| simply point #[code pip install] to the URL or local path of the archive
|
| point #[code pip install] to the URL or local path of the archive
|
||||||
| file. To find the direct link to a model, head over to the
|
| file. To find the direct link to a model, head over to the
|
||||||
| #[+a(gh("spacy-models") + "/releases") model releases], right click on the archive
|
| #[+a(gh("spacy-models") + "/releases") model releases], right click on the archive
|
||||||
| link and copy it to your clipboard.
|
| link and copy it to your clipboard.
|
||||||
|
|
||||||
+code(false, "bash").
|
+code(false, "bash").
|
||||||
# with external URL
|
# with external URL
|
||||||
pip install #{gh("spacy-models")}/releases/download/en_core_web_md-1.2.0/en_core_web_md-1.2.0.tar.gz
|
pip install #{gh("spacy-models")}/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz
|
||||||
|
|
||||||
# with local file
|
# with local file
|
||||||
pip install /Users/you/en_core_web_md-1.2.0.tar.gz
|
pip install /Users/you/en_core_web_md-1.2.0.tar.gz
|
||||||
|
@ -69,7 +69,7 @@ p
|
||||||
|
|
||||||
p
|
p
|
||||||
| You can place the #[strong model package directory] anywhere on your
|
| You can place the #[strong model package directory] anywhere on your
|
||||||
| local file system. To use it with spaCy, simply assign it a name by
|
| local file system. To use it with spaCy, assign it a name by
|
||||||
| creating a #[+a("#usage") shortcut link] for the data directory.
|
| creating a #[+a("#usage") shortcut link] for the data directory.
|
||||||
|
|
||||||
+h(3, "usage") Using models with spaCy
|
+h(3, "usage") Using models with spaCy
|
||||||
|
|
|
@ -26,7 +26,7 @@ p
|
||||||
p
|
p
|
||||||
| Because all models are valid Python packages, you can add them to your
|
| Because all models are valid Python packages, you can add them to your
|
||||||
| application's #[code requirements.txt]. If you're running your own
|
| application's #[code requirements.txt]. If you're running your own
|
||||||
| internal PyPi installation, you can simply upload the models there. pip's
|
| internal PyPi installation, you can upload the models there. pip's
|
||||||
| #[+a("https://pip.pypa.io/en/latest/reference/pip_install/#requirements-file-format") requirements file format]
|
| #[+a("https://pip.pypa.io/en/latest/reference/pip_install/#requirements-file-format") requirements file format]
|
||||||
| supports both package names to download via a PyPi server, as well as direct
|
| supports both package names to download via a PyPi server, as well as direct
|
||||||
| URLs.
|
| URLs.
|
||||||
|
|
|
@ -5,7 +5,7 @@ p
|
||||||
| segments it into words, punctuation and so on. This is done by applying
|
| segments it into words, punctuation and so on. This is done by applying
|
||||||
| rules specific to each language. For example, punctuation at the end of a
|
| rules specific to each language. For example, punctuation at the end of a
|
||||||
| sentence should be split off – whereas "U.K." should remain one token.
|
| sentence should be split off – whereas "U.K." should remain one token.
|
||||||
| Each #[code Doc] consists of individual tokens, and we can simply iterate
|
| Each #[code Doc] consists of individual tokens, and we can iterate
|
||||||
| over them:
|
| over them:
|
||||||
|
|
||||||
+code-exec.
|
+code-exec.
|
||||||
|
|
|
@ -72,10 +72,11 @@ p
|
||||||
| you want to visualize output from other libraries, like
|
| you want to visualize output from other libraries, like
|
||||||
| #[+a("http://www.nltk.org") NLTK] or
|
| #[+a("http://www.nltk.org") NLTK] or
|
||||||
| #[+a("https://github.com/tensorflow/models/tree/master/research/syntaxnet") SyntaxNet].
|
| #[+a("https://github.com/tensorflow/models/tree/master/research/syntaxnet") SyntaxNet].
|
||||||
| Simply convert the dependency parse or recognised entities to displaCy's
|
| If you set #[code manual=True] on either #[code render()] or
|
||||||
| format and set #[code manual=True] on either #[code render()] or
|
| #[code serve()], you can pass in data in displaCy's format (instead of
|
||||||
| #[code serve()]. When setting #[code ents] manually, make sure to supply
|
| #[code Doc] objects). When setting #[code ents] manually, make sure to
|
||||||
| them in the right order, i.e. starting with the lowest start position.
|
| supply them in the right order, i.e. starting with the lowest start
|
||||||
|
| position.
|
||||||
|
|
||||||
+aside-code("Example").
|
+aside-code("Example").
|
||||||
ex = [{'text': 'But Google is starting from behind.',
|
ex = [{'text': 'But Google is starting from behind.',
|
||||||
|
@ -109,7 +110,7 @@ p
|
||||||
| If you want to use the visualizers as part of a web application, for
|
| If you want to use the visualizers as part of a web application, for
|
||||||
| example to create something like our
|
| example to create something like our
|
||||||
| #[+a(DEMOS_URL + "/displacy") online demo], it's not recommended to
|
| #[+a(DEMOS_URL + "/displacy") online demo], it's not recommended to
|
||||||
| simply wrap and serve the displaCy renderer. Instead, you should only
|
| only wrap and serve the displaCy renderer. Instead, you should only
|
||||||
| rely on the server to perform spaCy's processing capabilities, and use
|
| rely on the server to perform spaCy's processing capabilities, and use
|
||||||
| #[+a(gh("displacy")) displaCy.js] to render the JSON-formatted output.
|
| #[+a(gh("displacy")) displaCy.js] to render the JSON-formatted output.
|
||||||
|
|
||||||
|
|
|
@ -33,6 +33,10 @@ p
|
||||||
+h(2, "tokenization") Tokenization
|
+h(2, "tokenization") Tokenization
|
||||||
include _linguistic-features/_tokenization
|
include _linguistic-features/_tokenization
|
||||||
|
|
||||||
|
+section("sbd")
|
||||||
|
+h(2, "sbd") Sentence Segmentation
|
||||||
|
include _linguistic-features/_sentence-segmentation
|
||||||
|
|
||||||
+section("rule-based-matching")
|
+section("rule-based-matching")
|
||||||
+h(2, "rule-based-matching") Rule-based matching
|
+h(2, "rule-based-matching") Rule-based matching
|
||||||
include _linguistic-features/_rule-based-matching
|
include _linguistic-features/_rule-based-matching
|
||||||
|
|
Loading…
Reference in New Issue
Block a user