Merge branch 'develop' into feature/refactor-parser

This commit is contained in:
Matthew Honnibal 2018-05-15 18:39:21 +02:00
commit dc1a479fbd
67 changed files with 316912 additions and 102 deletions

15
.github/ISSUE_TEMPLATE/01_bugs.md vendored Normal file
View File

@ -0,0 +1,15 @@
---
name: "\U0001F6A8 Bug Report"
about: Did you come across a bug or unexpected behaviour differing from the docs?
---
## How to reproduce the behaviour
<!-- Include a code example or the steps that led to the problem. Please try to be as specific as possible. -->
## Your Environment
<!-- Include details of your environment. If you're using spaCy 1.7+, you can also type `python -m spacy info --markdown` and copy-paste the result here.-->
* Operating System:
* Python Version Used:
* spaCy Version Used:
* Environment Information:

21
.github/ISSUE_TEMPLATE/02_install.md vendored Normal file
View File

@ -0,0 +1,21 @@
---
name: "\U000023F3 Installation Problem"
about: Do you have problems installing spaCy, and none of the suggestions in the docs
and other issues helped?
---
<!-- Before submitting an issue, make sure to check the docs and closed issues to see if any of the solutions work for you. Installation problems can often be related to Python environment issues and problems with compilation. -->
## How to reproduce the problem
<!-- Include the details of how the problem occurred. Which command did you run to install spaCy? Did you come across an error? What else did you try? -->
```bash
# copy-paste the error message here
```
## Your Environment
<!-- Include details of your environment. If you're using spaCy 1.7+, you can also type `python -m spacy info --markdown` and copy-paste the result here.-->
* Operating System:
* Python Version Used:
* spaCy Version Used:
* Environment Information:

11
.github/ISSUE_TEMPLATE/03_request.md vendored Normal file
View File

@ -0,0 +1,11 @@
---
name: "\U0001F381 Feature Request"
about: Do you have an idea for an improvement, a new feature or a plugin?
---
## Feature description
<!-- Please describe the feature: Which area of the library is it related to? What specific solution would you like? -->
## Could the feature be a [custom component](https://spacy.io/usage/processing-pipelines#custom-components) or [spaCy plugin](https://spacy.io/universe)?
If so, we will tag it as [`project idea`](https://github.com/explosion/spaCy/labels/project%20idea) so other users can take it on.

10
.github/ISSUE_TEMPLATE/04_docs.md vendored Normal file
View File

@ -0,0 +1,10 @@
---
name: "\U0001F4DA Documentation"
about: Did you spot a mistake in the docs, is anything unclear or do you have a
suggestion?
---
<!-- Describe the problem or suggestion here. If you've found a mistake and you know the answer, feel free to submit a pull request straight away: https://github.com/explosion/spaCy/pulls -->
## Which page or section is this issue related to?
<!-- Please include the URL and/or source. -->

15
.github/ISSUE_TEMPLATE/05_other.md vendored Normal file
View File

@ -0,0 +1,15 @@
---
name: "\U0001F4AC Anything else?"
about: For general usage questions or help with your code, please consider
posting on StackOverflow instead.
---
<!-- Describe your issue here. Please keep in mind that the GitHub issue tracker is mostly intended for reports related to the spaCy code base and source, and for bugs and feature requests. If you're looking for help with your code, consider posting a question on StackOverflow instead: http://stackoverflow.com/questions/tagged/spacy -->
## Your Environment
<!-- Include details of your environment. If you're using spaCy 1.7+, you can also type `python -m spacy info --markdown` and copy-paste the result here.-->
* Operating System:
* Python Version Used:
* spaCy Version Used:
* Environment Information:

106
.github/contributors/LRAbbade.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Lucas Riêra Abbade |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2018-05-08 |
| GitHub username | LRAbbade |
| Website (optional) | |

87
.github/contributors/alexvy86.md vendored Normal file
View File

@ -0,0 +1,87 @@
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Alejandro Villarreal |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2018-05-01 |
| GitHub username | alexvy86 |
| Website (optional) | |

106
.github/contributors/bellabie vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | B Cavello |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2018-05-06 |
| GitHub username | bellabie |
| Website (optional) | bcavello.com |

106
.github/contributors/janimo.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Jani Monoses |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 5/10/2018 |
| GitHub username | janimo |
| Website (optional) | |

106
.github/contributors/knoxdw.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Douglas Knox |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2018-04-27 |
| GitHub username | knoxdw |
| Website (optional) | |

106
.github/contributors/mauryaland.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Amaury Fouret |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 05/08/2018 |
| GitHub username | mauryaland |
| Website (optional) | |

106
.github/contributors/mn3mos.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Gaëtan PRUVOST |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 13/04/2018 |
| GitHub username | mn3mos |
| Website (optional) | |

106
.github/contributors/tzano.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Tahar Zanouda |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 09-05-2018 |
| GitHub username | tzano |
| Website (optional) | |

106
.github/contributors/vishnumenon.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Vishnu Menon |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 12 May 2018 |
| GitHub username | vishnumenon |
| Website (optional) | |

19
.github/lock.yml vendored Normal file
View File

@ -0,0 +1,19 @@
# Configuration for lock-threads - https://github.com/dessant/lock-threads
# Number of days of inactivity before a closed issue or pull request is locked
daysUntilLock: 30
# Issues and pull requests with these labels will not be locked. Set to `[]` to disable
exemptLabels: []
# Label to add before locking, such as `outdated`. Set to `false` to disable
lockLabel: false
# Comment to post before locking. Set to `false` to disable
lockComment: >
This thread has been automatically locked since there has not been
any recent activity after it was closed. Please open a new issue for
related bugs.
# Limit to only `issues` or `pulls`
only: issues

13
.github/no-response.yml vendored Normal file
View File

@ -0,0 +1,13 @@
# Configuration for probot-no-response - https://github.com/probot/no-response
# Number of days of inactivity before an Issue is closed for lack of response
daysUntilClose: 14
# Label requiring a response
responseRequiredLabel: more-info-needed
# Comment to post when closing an Issue for lack of response. Set to `false` to disable
closeComment: >
This issue has been automatically closed because there has been no response
to a request for more information from the original author. With only the
information that is currently in the issue, there's not enough information
to take action. If you're the original author, feel free to reopen the issue
if you have or find the answers needed to investigate further.

View File

@ -199,6 +199,11 @@ or manually by pointing pip to a path or URL.
# pip install .tar.gz archive from path or URL # pip install .tar.gz archive from path or URL
pip install /Users/you/en_core_web_sm-2.0.0.tar.gz pip install /Users/you/en_core_web_sm-2.0.0.tar.gz
If you have SSL certification problems, SSL customization options are described in the help:
# help for the download command
python -m spacy download --help
Loading and using models Loading and using models
------------------------ ------------------------

View File

@ -68,9 +68,9 @@ class RESTCountriesComponent(object):
# the matches, so we're only setting a default value, not a getter. # the matches, so we're only setting a default value, not a getter.
# If no default value is set, it defaults to None. # If no default value is set, it defaults to None.
Token.set_extension('is_country', default=False) Token.set_extension('is_country', default=False)
Token.set_extension('country_capital') Token.set_extension('country_capital', default=False)
Token.set_extension('country_latlng') Token.set_extension('country_latlng', default=False)
Token.set_extension('country_flag') Token.set_extension('country_flag', default=False)
# Register attributes on Doc and Span via a getter that checks if one of # Register attributes on Doc and Span via a getter that checks if one of
# the contained tokens is set to is_country == True. # the contained tokens is set to is_country == True.

View File

@ -17,19 +17,39 @@ from .. import about
@plac.annotations( @plac.annotations(
model=("model to download, shortcut or name)", "positional", None, str), model=("model to download, shortcut or name)", "positional", None, str),
direct=("force direct download. Needs model name with version and won't " direct=("force direct download. Needs model name with version and won't "
"perform compatibility check", "flag", "d", bool)) "perform compatibility check", "flag", "d", bool),
def download(model, direct=False): insecure=("insecure mode - disables the verification of certificates",
"flag", "i", bool),
ca_file=("specify a certificate authority file to use for certificates "
"validation. Ignored if --insecure is used", "option", "c"))
def download(model, direct=False, insecure=False, ca_file=None):
""" """
Download compatible model from default download path using pip. Model Download compatible model from default download path using pip. Model
can be shortcut, model name or, if --direct flag is set, full model name can be shortcut, model name or, if --direct flag is set, full model name
with version. with version.
The --insecure optional flag can be used to disable ssl verification
The --ca-file option can be used to provide a local CA file
used for certificate verification.
""" """
# ssl_verify is the argument handled to the 'verify' parameter
# of requests package. It must be either None, a boolean,
# or a string containing the path to CA file
ssl_verify = None
if insecure:
ca_file = None
ssl_verify = False
else:
if ca_file is not None:
ssl_verify = ca_file
# Download the model
if direct: if direct:
dl = download_model('{m}/{m}.tar.gz'.format(m=model)) dl = download_model('{m}/{m}.tar.gz'.format(m=model))
else: else:
shortcuts = get_json(about.__shortcuts__, "available shortcuts") shortcuts = get_json(about.__shortcuts__, "available shortcuts", ssl_verify)
model_name = shortcuts.get(model, model) model_name = shortcuts.get(model, model)
compatibility = get_compatibility() compatibility = get_compatibility(ssl_verify)
version = get_version(model_name, compatibility) version = get_version(model_name, compatibility)
dl = download_model('{m}-{v}/{m}-{v}.tar.gz'.format(m=model_name, dl = download_model('{m}-{v}/{m}-{v}.tar.gz'.format(m=model_name,
v=version)) v=version))
@ -41,8 +61,7 @@ def download(model, direct=False):
# package, which fails if model was just installed via # package, which fails if model was just installed via
# subprocess # subprocess
package_path = get_package_path(model_name) package_path = get_package_path(model_name)
link(model_name, model, force=True, link(model_name, model, force=True, model_path=package_path)
model_path=package_path)
except: except:
# Dirty, but since spacy.download and the auto-linking is # Dirty, but since spacy.download and the auto-linking is
# mostly a convenience wrapper, it's best to show a success # mostly a convenience wrapper, it's best to show a success
@ -50,19 +69,19 @@ def download(model, direct=False):
prints(Messages.M001.format(name=model_name), title=Messages.M002) prints(Messages.M001.format(name=model_name), title=Messages.M002)
def get_json(url, desc): def get_json(url, desc, ssl_verify):
try: try:
data = url_read(url) data = url_read(url, verify=ssl_verify)
except HTTPError as e: except HTTPError as e:
prints(Messages.M004.format(desc, about.__version__), prints(Messages.M004.format(desc, about.__version__),
title=Messages.M003.format(e.code, e.reason), exits=1) title=Messages.M003.format(e.code, e.reason), exits=1)
return ujson.loads(data) return ujson.loads(data)
def get_compatibility(): def get_compatibility(ssl_verify):
version = about.__version__ version = about.__version__
version = version.rsplit('.dev', 1)[0] version = version.rsplit('.dev', 1)[0]
comp_table = get_json(about.__compatibility__, "compatibility table") comp_table = get_json(about.__compatibility__, "compatibility table", ssl_verify)
comp = comp_table['spacy'] comp = comp_table['spacy']
if version not in comp: if version not in comp:
prints(Messages.M006.format(version=version), title=Messages.M005, prints(Messages.M006.format(version=version), title=Messages.M005,

View File

@ -124,13 +124,16 @@ def read_conllu(file_):
return docs return docs
def _make_gold(nlp, text, sent_annots): def _make_gold(nlp, text, sent_annots, drop_deps=0.0):
# Flatten the conll annotations, and adjust the head indices # Flatten the conll annotations, and adjust the head indices
flat = defaultdict(list) flat = defaultdict(list)
sent_starts = []
for sent in sent_annots: for sent in sent_annots:
flat['heads'].extend(len(flat['words'])+head for head in sent['heads']) flat['heads'].extend(len(flat['words'])+head for head in sent['heads'])
for field in ['words', 'tags', 'deps', 'entities', 'spaces']: for field in ['words', 'tags', 'deps', 'entities', 'spaces']:
flat[field].extend(sent[field]) flat[field].extend(sent[field])
sent_starts.append(True)
sent_starts.extend([False] * (len(sent['words'])-1))
# Construct text if necessary # Construct text if necessary
assert len(flat['words']) == len(flat['spaces']) assert len(flat['words']) == len(flat['spaces'])
if text is None: if text is None:
@ -138,6 +141,12 @@ def _make_gold(nlp, text, sent_annots):
doc = nlp.make_doc(text) doc = nlp.make_doc(text)
flat.pop('spaces') flat.pop('spaces')
gold = GoldParse(doc, **flat) gold = GoldParse(doc, **flat)
gold.sent_starts = sent_starts
for i in range(len(gold.heads)):
if random.random() < drop_deps:
gold.heads[i] = None
gold.labels[i] = None
return doc, gold return doc, gold
############################# #############################

View File

@ -545,10 +545,21 @@ cdef class GoldParse:
""" """
return not nonproj.is_nonproj_tree(self.heads) return not nonproj.is_nonproj_tree(self.heads)
@property property sent_starts:
def sent_starts(self): def __get__(self):
return [self.c.sent_start[i] for i in range(self.length)] return [self.c.sent_start[i] for i in range(self.length)]
def __set__(self, sent_starts):
for gold_i, is_sent_start in enumerate(sent_starts):
i = self.gold_to_cand[gold_i]
if i is not None:
if is_sent_start in (1, True):
self.c.sent_start[i] = 1
elif is_sent_start in (-1, False):
self.c.sent_start[i] = -1
else:
self.c.sent_start[i] = 0
def biluo_tags_from_offsets(doc, entities, missing='O'): def biluo_tags_from_offsets(doc, entities, missing='O'):
"""Encode labelled spans into per-token tags, using the """Encode labelled spans into per-token tags, using the

31
spacy/lang/ar/__init__.py Normal file
View File

@ -0,0 +1,31 @@
# coding: utf8
from __future__ import unicode_literals
from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS
from .punctuation import TOKENIZER_SUFFIXES
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ..norm_exceptions import BASE_NORMS
from ...language import Language
from ...attrs import LANG, NORM
from ...util import update_exc, add_lookups
class ArabicDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters.update(LEX_ATTRS)
lex_attr_getters[LANG] = lambda text: 'ar'
lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM], BASE_NORMS)
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
stop_words = STOP_WORDS
suffixes = TOKENIZER_SUFFIXES
class Arabic(Language):
lang = 'ar'
Defaults = ArabicDefaults
__all__ = ['Arabic']

20
spacy/lang/ar/examples.py Normal file
View File

@ -0,0 +1,20 @@
# coding: utf8
from __future__ import unicode_literals
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.ar.examples import sentences
>>> docs = nlp.pipe(sentences)
"""
sentences = [
"نال الكاتب خالد توفيق جائزة الرواية العربية في معرض الشارقة الدولي للكتاب",
"أين تقع دمشق ؟"
"كيف حالك ؟",
"هل يمكن ان نلتقي على الساعة الثانية عشرة ظهرا ؟",
"ماهي أبرز التطورات السياسية، الأمنية والاجتماعية في العالم ؟",
"هل بالإمكان أن نلتقي غدا؟",
"هناك نحو 382 مليون شخص مصاب بداء السكَّري في العالم",
"كشفت دراسة حديثة أن الخيل تقرأ تعبيرات الوجه وتستطيع أن تتذكر مشاعر الناس وعواطفهم"
]

View File

@ -0,0 +1,95 @@
# coding: utf8
from __future__ import unicode_literals
from ...attrs import LIKE_NUM
_num_words = set("""
صفر
واحد
إثنان
اثنان
ثلاثة
ثلاثه
أربعة
أربعه
خمسة
خمسه
ستة
سته
سبعة
سبعه
ثمانية
ثمانيه
تسعة
تسعه
ﻋﺸﺮﺓ
ﻋﺸﺮه
عشرون
عشرين
ثلاثون
ثلاثين
اربعون
اربعين
أربعون
أربعين
خمسون
خمسين
ستون
ستين
سبعون
سبعين
ثمانون
ثمانين
تسعون
تسعين
مائتين
مائتان
ثلاثمائة
خمسمائة
سبعمائة
الف
آلاف
ملايين
مليون
مليار
مليارات
""".split())
_ordinal_words = set("""
اول
أول
حاد
واحد
ثان
ثاني
ثالث
رابع
خامس
سادس
سابع
ثامن
تاسع
عاشر
""".split())
def like_num(text):
"""
check if text resembles a number
"""
text = text.replace(',', '').replace('.', '')
if text.isdigit():
return True
if text.count('/') == 1:
num, denom = text.split('/')
if num.isdigit() and denom.isdigit():
return True
if text in _num_words:
return True
if text in _ordinal_words:
return True
return False
LEX_ATTRS = {
LIKE_NUM: like_num
}

View File

@ -0,0 +1,15 @@
# coding: utf8
from __future__ import unicode_literals
from ..punctuation import TOKENIZER_INFIXES
from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, CURRENCY
from ..char_classes import QUOTES, UNITS, ALPHA, ALPHA_LOWER, ALPHA_UPPER
_suffixes = (LIST_PUNCT + LIST_ELLIPSES + LIST_QUOTES +
[r'(?<=[0-9])\+',
# Arabic is written from Right-To-Left
r'(?<=[0-9])(?:{})'.format(CURRENCY),
r'(?<=[0-9])(?:{})'.format(UNITS),
r'(?<=[{au}][{au}])\.'.format(au=ALPHA_UPPER)])
TOKENIZER_SUFFIXES = _suffixes

229
spacy/lang/ar/stop_words.py Normal file
View File

@ -0,0 +1,229 @@
# coding: utf8
from __future__ import unicode_literals
STOP_WORDS = set("""
من
نحو
لعل
بما
بين
وبين
ايضا
وبينما
تحت
مثلا
لدي
عنه
مع
هي
وهذا
واذا
هذان
انه
بينما
أمسى
وسوف
ولم
لذلك
إلى
منه
منها
كما
ظل
هنا
به
كذلك
اما
هما
بعد
بينهم
التي
أبو
اذا
بدلا
لها
أمام
يلي
حين
ضد
الذي
قد
صار
إذا
مابرح
قبل
كل
وليست
الذين
لهذا
وثي
انهم
باللتي
مافتئ
ولا
بهذه
بحيث
كيف
وله
علي
بات
لاسيما
حتى
وقد
و
أما
فيها
بهذا
لذا
حيث
لقد
إن
فإن
اول
ليت
فاللتي
ولقد
لسوف
هذه
ولماذا
معه
الحالي
بإن
حول
في
عليه
مايزال
ولعل
أنه
أضحى
اي
ستكون
لن
أن
ضمن
وعلى
امسى
الي
ذات
ولايزال
ذلك
فقد
هم
أي
عند
ابن
أو
فهو
فانه
سوف
ما
آل
كلا
عنها
وكذلك
ليست
لم
وأن
ماذا
لو
وهل
اللتي
ولذا
يمكن
فيه
الا
عليها
وبينهم
يوم
وبما
لما
فكان
اضحى
اصبح
لهم
بها
او
الذى
الى
إلي
قال
والتي
لازال
أصبح
ولهذا
مثل
وكانت
لكنه
بذلك
هذا
لماذا
قالت
فقط
لكن
مما
وكل
وان
وأبو
ومن
كان
مازال
هل
بينهن
هو
وما
على
وهو
لأن
واللتي
والذي
دون
عن
وايضا
هناك
بلا
جدا
ثم
منذ
اللذين
لايزال
بعض
مساء
تكون
فلا
بيننا
لا
ولكن
إذ
وأثناء
ليس
ومع
فيهم
ولسوف
بل
تلك
أحد
وهي
وكان
ومنها
وفي
ماانفك
اليوم
وماذا
هؤلاء
وليس
له
أثناء
بد
اليه
كأن
اليها
بتلك
يكون
ولما
هن
والى
كانت
وقبل
ان
لدى
""".split())

View File

@ -0,0 +1,47 @@
# coding: utf8
from __future__ import unicode_literals
from ...symbols import ORTH, LEMMA, TAG, NORM, PRON_LEMMA
import re
_exc = {}
# time
for exc_data in [
{LEMMA: "قبل الميلاد", ORTH: "ق.م"},
{LEMMA: "بعد الميلاد", ORTH: "ب. م"},
{LEMMA: "ميلادي", ORTH: ""},
{LEMMA: "هجري", ORTH: ".هـ"},
{LEMMA: "توفي", ORTH: ""}]:
_exc[exc_data[ORTH]] = [exc_data]
# scientific abv.
for exc_data in [
{LEMMA: "صلى الله عليه وسلم", ORTH: "صلعم"},
{LEMMA: "الشارح", ORTH: "الشـ"},
{LEMMA: "الظاهر", ORTH: "الظـ"},
{LEMMA: "أيضًا", ORTH: "أيضـ"},
{LEMMA: "إلى آخره", ORTH: "إلخ"},
{LEMMA: "انتهى", ORTH: "اهـ"},
{LEMMA: "حدّثنا", ORTH: "ثنا"},
{LEMMA: "حدثني", ORTH: "ثنى"},
{LEMMA: "أنبأنا", ORTH: "أنا"},
{LEMMA: "أخبرنا", ORTH: "نا"},
{LEMMA: "مصدر سابق", ORTH: "م. س"},
{LEMMA: "مصدر نفسه", ORTH: "م. ن"}]:
_exc[exc_data[ORTH]] = [exc_data]
# other abv.
for exc_data in [
{LEMMA: "دكتور", ORTH: "د."},
{LEMMA: "أستاذ دكتور", ORTH: "أ.د"},
{LEMMA: "أستاذ", ORTH: "أ."},
{LEMMA: "بروفيسور", ORTH: "ب."}]:
_exc[exc_data[ORTH]] = [exc_data]
for exc_data in [
{LEMMA: "تلفون", ORTH: "ت."},
{LEMMA: "صندوق بريد", ORTH: "ص.ب"}]:
_exc[exc_data[ORTH]] = [exc_data]
TOKENIZER_EXCEPTIONS = _exc

View File

@ -3,13 +3,11 @@ from __future__ import unicode_literals
import regex as re import regex as re
re.DEFAULT_VERSION = re.VERSION1 re.DEFAULT_VERSION = re.VERSION1
merge_char_classes = lambda classes: '[{}]'.format('||'.join(classes)) merge_char_classes = lambda classes: '[{}]'.format('||'.join(classes))
split_chars = lambda char: list(char.strip().split(' ')) split_chars = lambda char: list(char.strip().split(' '))
merge_chars = lambda char: char.strip().replace(' ', '|') merge_chars = lambda char: char.strip().replace(' ', '|')
_bengali = r'[\p{L}&&\p{Bengali}]' _bengali = r'[\p{L}&&\p{Bengali}]'
_hebrew = r'[\p{L}&&\p{Hebrew}]' _hebrew = r'[\p{L}&&\p{Hebrew}]'
_latin_lower = r'[\p{Ll}&&\p{Latin}]' _latin_lower = r'[\p{Ll}&&\p{Latin}]'
@ -27,11 +25,11 @@ ALPHA = merge_char_classes(_upper + _lower + _uncased)
ALPHA_LOWER = merge_char_classes(_lower + _uncased) ALPHA_LOWER = merge_char_classes(_lower + _uncased)
ALPHA_UPPER = merge_char_classes(_upper + _uncased) ALPHA_UPPER = merge_char_classes(_upper + _uncased)
_units = ('km km² km³ m m² m³ dm dm² dm³ cm cm² cm³ mm mm² mm³ ha µm nm yd in ft ' _units = ('km km² km³ m m² m³ dm dm² dm³ cm cm² cm³ mm mm² mm³ ha µm nm yd in ft '
'kg g mg µg t lb oz m/s km/h kmh mph hPa Pa mbar mb MB kb KB gb GB tb ' 'kg g mg µg t lb oz m/s km/h kmh mph hPa Pa mbar mb MB kb KB gb GB tb '
'TB T G M K % км км² км³ м м² м³ дм дм² дм³ см см² см³ мм мм² мм³ нм ' 'TB T G M K % км км² км³ м м² м³ дм дм² дм³ см см² см³ мм мм² мм³ нм '
'кг г мг м/с км/ч кПа Па мбар Кб КБ кб Мб МБ мб Гб ГБ гб Тб ТБ тб') 'кг г мг м/с км/ч кПа Па мбар Кб КБ кб Мб МБ мб Гб ГБ гб Тб ТБ тб'
'كم كم² كم³ م م² م³ سم سم² سم³ مم مم² مم³ كم غرام جرام جم كغ ملغ كوب اكواب')
_currency = r'\$ £ € ¥ ฿ US\$ C\$ A\$ ₽ ﷼' _currency = r'\$ £ € ¥ ฿ US\$ C\$ A\$ ₽ ﷼'
# These expressions contain various unicode variations, including characters # These expressions contain various unicode variations, including characters
@ -45,7 +43,6 @@ _hyphens = '- — -- --- —— ~'
# Details: https://www.compart.com/en/unicode/category/So # Details: https://www.compart.com/en/unicode/category/So
_other_symbols = r'[\p{So}]' _other_symbols = r'[\p{So}]'
UNITS = merge_chars(_units) UNITS = merge_chars(_units)
CURRENCY = merge_chars(_currency) CURRENCY = merge_chars(_currency)
QUOTES = merge_chars(_quotes) QUOTES = merge_chars(_quotes)

View File

@ -11,14 +11,14 @@ avais avait avant avec avoir avons ayant
bah bas basee bat beau beaucoup bien bigre boum bravo brrr bah bas basee bat beau beaucoup bien bigre boum bravo brrr
ça car ce ceci cela celle celle-ci celle- celles celles-ci celles- celui c' c ça car ce ceci cela celle celle-ci celle-là celles celles-ci celles-là celui
celui-ci celui- cent cependant certain certaine certaines certains certes ces celui-ci celui- cent cependant certain certaine certaines certains certes ces
cet cette ceux ceux-ci ceux- chacun chacune chaque cher chers chez chiche cet cette ceux ceux-ci ceux- chacun chacune chaque cher chers chez chiche
chut chère chères ci cinq cinquantaine cinquante cinquantième cinquième clac chut chère chères ci cinq cinquantaine cinquante cinquantième cinquième clac
clic combien comme comment comparable comparables compris concernant contre clic combien comme comment comparable comparables compris concernant contre
couic crac couic crac
da dans de debout dedans dehors deja delà depuis dernier derniere derriere d' d da dans de debout dedans dehors deja delà depuis dernier derniere derriere
derrière des desormais desquelles desquels dessous dessus deux deuxième derrière des desormais desquelles desquels dessous dessus deux deuxième
deuxièmement devant devers devra different differentes differents différent deuxièmement devant devers devra different differentes differents différent
différente différentes différents dire directe directement dit dite dits divers différente différentes différents dire directe directement dit dite dits divers
@ -37,16 +37,16 @@ gens
ha hein hem hep hi ho holà hop hormis hors hou houp hue hui huit huitième hum ha hein hem hep hi ho holà hop hormis hors hou houp hue hui huit huitième hum
hurrah hélas i il ils importe hurrah hélas i il ils importe
je jusqu jusque juste j' j je jusqu jusque juste
la laisser laquelle las le lequel les lesquelles lesquels leur leurs longtemps l' l la laisser laquelle las le lequel les lesquelles lesquels leur leurs longtemps
lors lorsque lui lui-meme lui-même lès lors lorsque lui lui-meme lui-même lès
ma maint maintenant mais malgre malgré maximale me meme memes merci mes mien m' m ma maint maintenant mais malgre malgré maximale me meme memes merci mes mien
mienne miennes miens mille mince minimale moi moi-meme moi-même moindres moins mienne miennes miens mille mince minimale moi moi-meme moi-même moindres moins
mon moyennant multiple multiples même mêmes mon moyennant multiple multiples même mêmes
na naturel naturelle naturelles ne neanmoins necessaire necessairement neuf n' n na naturel naturelle naturelles ne neanmoins necessaire necessairement neuf
neuvième ni nombreuses nombreux non nos notamment notre nous nous-mêmes nouveau neuvième ni nombreuses nombreux non nos notamment notre nous nous-mêmes nouveau
nul néanmoins nôtre nôtres nul néanmoins nôtre nôtres
@ -60,21 +60,21 @@ plusieurs plutôt possessif possessifs possible possibles pouah pour pourquoi
pourrais pourrait pouvait prealable precisement premier première premièrement pourrais pourrait pouvait prealable precisement premier première premièrement
pres probable probante procedant proche près psitt pu puis puisque pur pure pres probable probante procedant proche près psitt pu puis puisque pur pure
qu quand quant quant-à-soi quanta quarante quatorze quatre quatre-vingt qu' qu quand quant quant-à-soi quanta quarante quatorze quatre quatre-vingt
quatrième quatrièmement que quel quelconque quelle quelles quelqu'un quelque quatrième quatrièmement que quel quelconque quelle quelles quelqu'un quelque
quelques quels qui quiconque quinze quoi quoique quelques quels qui quiconque quinze quoi quoique
rare rarement rares relative relativement remarquable rend rendre restant reste rare rarement rares relative relativement remarquable rend rendre restant reste
restent restrictif retour revoici revoilà rien restent restrictif retour revoici revoilà rien
sa sacrebleu sait sans sapristi sauf se sein seize selon semblable semblaient s' s sa sacrebleu sait sans sapristi sauf se sein seize selon semblable semblaient
semble semblent sent sept septième sera seraient serait seront ses seul seule semble semblent sent sept septième sera seraient serait seront ses seul seule
seulement si sien sienne siennes siens sinon six sixième soi soi-même soit seulement si sien sienne siennes siens sinon six sixième soi soi-même soit
soixante son sont sous souvent specifique specifiques speculatif stop soixante son sont sous souvent specifique specifiques speculatif stop
strictement subtiles suffisant suffisante suffit suis suit suivant suivante strictement subtiles suffisant suffisante suffit suis suit suivant suivante
suivantes suivants suivre superpose sur surtout suivantes suivants suivre superpose sur surtout
ta tac tant tardive te tel telle tellement telles tels tenant tend tenir tente t' t ta tac tant tardive te tel telle tellement telles tels tenant tend tenir tente
tes tic tien tienne tiennes tiens toc toi toi-même ton touchant toujours tous tes tic tien tienne tiennes tiens toc toi toi-même ton touchant toujours tous
tout toute toutefois toutes treize trente tres trois troisième troisièmement tout toute toutefois toutes treize trente tres trois troisième troisièmement
trop très tsoin tsouin tu trop très tsoin tsouin tu

View File

@ -3,23 +3,87 @@ from __future__ import unicode_literals, print_function
from ...language import Language from ...language import Language
from ...attrs import LANG from ...attrs import LANG
from ...tokens import Doc from ...tokens import Doc, Token
from ...tokenizer import Tokenizer from ...tokenizer import Tokenizer
from .tag_map import TAG_MAP
import re
from collections import namedtuple
ShortUnitWord = namedtuple('ShortUnitWord', ['surface', 'lemma', 'pos'])
# XXX Is this the right place for this?
Token.set_extension('mecab_tag', default=None)
def try_mecab_import():
"""Mecab is required for Japanese support, so check for it.
It it's not available blow up and explain how to fix it."""
try:
import MeCab
return MeCab
except ImportError:
raise ImportError("Japanese support requires MeCab: "
"https://github.com/SamuraiT/mecab-python3")
def resolve_pos(token):
"""If necessary, add a field to the POS tag for UD mapping.
Under Universal Dependencies, sometimes the same Unidic POS tag can
be mapped differently depending on the literal token or its context
in the sentence. This function adds information to the POS tag to
resolve ambiguous mappings.
"""
# NOTE: This is a first take. The rules here are crude approximations.
# For many of these, full dependencies are needed to properly resolve
# PoS mappings.
if token.pos == '連体詞,*,*,*':
if re.match('^[こそあど此其彼]の', token.surface):
return token.pos + ',DET'
if re.match('^[こそあど此其彼]', token.surface):
return token.pos + ',PRON'
else:
return token.pos + ',ADJ'
return token.pos
def detailed_tokens(tokenizer, text):
"""Format Mecab output into a nice data structure, based on Janome."""
node = tokenizer.parseToNode(text)
node = node.next # first node is beginning of sentence and empty, skip it
words = []
while node.posid != 0:
surface = node.surface
base = surface # a default value. Updated if available later.
parts = node.feature.split(',')
pos = ','.join(parts[0:4])
if len(parts) > 6:
# this information is only available for words in the tokenizer dictionary
reading = parts[6]
base = parts[7]
words.append( ShortUnitWord(surface, base, pos) )
node = node.next
return words
class JapaneseTokenizer(object): class JapaneseTokenizer(object):
def __init__(self, cls, nlp=None): def __init__(self, cls, nlp=None):
self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp) self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp)
try:
from janome.tokenizer import Tokenizer MeCab = try_mecab_import()
except ImportError: self.tokenizer = MeCab.Tagger()
raise ImportError("The Japanese tokenizer requires the Janome "
"library: https://github.com/mocobeta/janome")
self.tokenizer = Tokenizer()
def __call__(self, text): def __call__(self, text):
words = [x.surface for x in self.tokenizer.tokenize(text)] dtokens = detailed_tokens(self.tokenizer, text)
return Doc(self.vocab, words=words, spaces=[False]*len(words)) words = [x.surface for x in dtokens]
doc = Doc(self.vocab, words=words, spaces=[False]*len(words))
for token, dtoken in zip(doc, dtokens):
token._.mecab_tag = dtoken.pos
token.tag_ = resolve_pos(dtoken)
return doc
# add dummy methods for to_bytes, from_bytes, to_disk and from_disk to # add dummy methods for to_bytes, from_bytes, to_disk and from_disk to
# allow serialization (see #1557) # allow serialization (see #1557)
@ -53,6 +117,7 @@ class JapaneseCharacterSegmenter(object):
class JapaneseDefaults(Language.Defaults): class JapaneseDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters) lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda text: 'ja' lex_attr_getters[LANG] = lambda text: 'ja'
tag_map = TAG_MAP
use_janome = True use_janome = True
@classmethod @classmethod
@ -62,13 +127,12 @@ class JapaneseDefaults(Language.Defaults):
else: else:
return JapaneseCharacterSegmenter(cls, nlp.vocab) return JapaneseCharacterSegmenter(cls, nlp.vocab)
class Japanese(Language): class Japanese(Language):
lang = 'ja' lang = 'ja'
Defaults = JapaneseDefaults Defaults = JapaneseDefaults
Tokenizer = JapaneseTokenizer
def make_doc(self, text): def make_doc(self, text):
return self.tokenizer(text) return self.tokenizer(text)
__all__ = ['Japanese'] __all__ = ['Japanese']

88
spacy/lang/ja/tag_map.py Normal file
View File

@ -0,0 +1,88 @@
# encoding: utf8
from __future__ import unicode_literals
from ...symbols import *
TAG_MAP = {
# Explanation of Unidic tags:
# https://www.gavo.t.u-tokyo.ac.jp/~mine/japanese/nlp+slp/UNIDIC_manual.pdf
# Universal Dependencies Mapping:
# http://universaldependencies.org/ja/overview/morphology.html
# http://universaldependencies.org/ja/pos/all.html
"記号,一般,*,*":{POS: PUNCT}, # this includes characters used to represent sounds like ドレミ
"記号,文字,*,*":{POS: PUNCT}, # this is for Greek and Latin characters used as sumbols, as in math
"感動詞,フィラー,*,*": {POS: INTJ},
"感動詞,一般,*,*": {POS: INTJ},
# this is specifically for unicode full-width space
"空白,*,*,*": {POS: X},
"形状詞,一般,*,*":{POS: ADJ},
"形状詞,タリ,*,*":{POS: ADJ},
"形状詞,助動詞語幹,*,*":{POS: ADJ},
"形容詞,一般,*,*":{POS: ADJ},
"形容詞,非自立可能,*,*":{POS: AUX}, # XXX ADJ if alone, AUX otherwise
"助詞,格助詞,*,*":{POS: ADP},
"助詞,係助詞,*,*":{POS: ADP},
"助詞,終助詞,*,*":{POS: PART},
"助詞,準体助詞,*,*":{POS: SCONJ}, # の as in 走るのが速い
"助詞,接続助詞,*,*":{POS: SCONJ}, # verb ending て
"助詞,副助詞,*,*":{POS: PART}, # ばかり, つつ after a verb
"助動詞,*,*,*":{POS: AUX},
"接続詞,*,*,*":{POS: SCONJ}, # XXX: might need refinement
"接頭辞,*,*,*":{POS: NOUN},
"接尾辞,形状詞的,*,*":{POS: ADJ}, # がち, チック
"接尾辞,形容詞的,*,*":{POS: ADJ}, # -らしい
"接尾辞,動詞的,*,*":{POS: NOUN}, # -じみ
"接尾辞,名詞的,サ変可能,*":{POS: NOUN}, # XXX see 名詞,普通名詞,サ変可能,*
"接尾辞,名詞的,一般,*":{POS: NOUN},
"接尾辞,名詞的,助数詞,*":{POS: NOUN},
"接尾辞,名詞的,副詞可能,*":{POS: NOUN}, # -後, -過ぎ
"代名詞,*,*,*":{POS: PRON},
"動詞,一般,*,*":{POS: VERB},
"動詞,非自立可能,*,*":{POS: VERB}, # XXX VERB if alone, AUX otherwise
"動詞,非自立可能,*,*,AUX":{POS: AUX},
"動詞,非自立可能,*,*,VERB":{POS: VERB},
"副詞,*,*,*":{POS: ADV},
"補助記号,,一般,*":{POS: SYM}, # text art
"補助記号,,顔文字,*":{POS: SYM}, # kaomoji
"補助記号,一般,*,*":{POS: SYM},
"補助記号,括弧開,*,*":{POS: PUNCT}, # open bracket
"補助記号,括弧閉,*,*":{POS: PUNCT}, # close bracket
"補助記号,句点,*,*":{POS: PUNCT}, # period or other EOS marker
"補助記号,読点,*,*":{POS: PUNCT}, # comma
"名詞,固有名詞,一般,*":{POS: PROPN}, # general proper noun
"名詞,固有名詞,人名,一般":{POS: PROPN}, # person's name
"名詞,固有名詞,人名,姓":{POS: PROPN}, # surname
"名詞,固有名詞,人名,名":{POS: PROPN}, # first name
"名詞,固有名詞,地名,一般":{POS: PROPN}, # place name
"名詞,固有名詞,地名,国":{POS: PROPN}, # country name
"名詞,助動詞語幹,*,*":{POS: AUX},
"名詞,数詞,*,*":{POS: NUM}, # includes Chinese numerals
"名詞,普通名詞,サ変可能,*":{POS: NOUN}, # XXX: sometimes VERB in UDv2; suru-verb noun
"名詞,普通名詞,サ変可能,*,NOUN":{POS: NOUN},
"名詞,普通名詞,サ変可能,*,VERB":{POS: VERB},
"名詞,普通名詞,サ変形状詞可能,*":{POS: NOUN}, # ex: 下手
"名詞,普通名詞,一般,*":{POS: NOUN},
"名詞,普通名詞,形状詞可能,*":{POS: NOUN}, # XXX: sometimes ADJ in UDv2
"名詞,普通名詞,形状詞可能,*,NOUN":{POS: NOUN},
"名詞,普通名詞,形状詞可能,*,ADJ":{POS: ADJ},
"名詞,普通名詞,助数詞可能,*":{POS: NOUN}, # counter / unit
"名詞,普通名詞,副詞可能,*":{POS: NOUN},
"連体詞,*,*,*":{POS: ADJ}, # XXX this has exceptions based on literal token
"連体詞,*,*,*,ADJ":{POS: ADJ},
"連体詞,*,*,*,PRON":{POS: PRON},
"連体詞,*,*,*,DET":{POS: DET},
}

View File

@ -6,10 +6,10 @@ from ...attrs import LIKE_NUM
_num_words = ['zero', 'um', 'dois', 'três', 'quatro', 'cinco', 'seis', 'sete', _num_words = ['zero', 'um', 'dois', 'três', 'quatro', 'cinco', 'seis', 'sete',
'oito', 'nove', 'dez', 'onze', 'doze', 'treze', 'catorze', 'oito', 'nove', 'dez', 'onze', 'doze', 'treze', 'catorze',
'quinze', 'dezasseis', 'dezassete', 'dezoito', 'dezanove', 'vinte', 'quinze', 'dezesseis', 'dezasseis', 'dezessete', 'dezassete', 'dezoito', 'dezenove', 'dezanove', 'vinte',
'trinta', 'quarenta', 'cinquenta', 'sessenta', 'setenta', 'trinta', 'quarenta', 'cinquenta', 'sessenta', 'setenta',
'oitenta', 'noventa', 'cem', 'mil', 'milhão', 'bilião', 'trilião', 'oitenta', 'noventa', 'cem', 'mil', 'milhão', 'bilhão', 'bilião', 'trilhão', 'trilião',
'quadrilião'] 'quatrilhão']
_ordinal_words = ['primeiro', 'segundo', 'terceiro', 'quarto', 'quinto', 'sexto', _ordinal_words = ['primeiro', 'segundo', 'terceiro', 'quarto', 'quinto', 'sexto',
'sétimo', 'oitavo', 'nono', 'décimo', 'vigésimo', 'trigésimo', 'sétimo', 'oitavo', 'nono', 'décimo', 'vigésimo', 'trigésimo',

View File

@ -3,6 +3,7 @@ from __future__ import unicode_literals
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .stop_words import STOP_WORDS from .stop_words import STOP_WORDS
from .lemmatizer import LOOKUP
from ..tokenizer_exceptions import BASE_EXCEPTIONS from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ..norm_exceptions import BASE_NORMS from ..norm_exceptions import BASE_NORMS
@ -17,6 +18,7 @@ class RomanianDefaults(Language.Defaults):
lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM], BASE_NORMS) lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM], BASE_NORMS)
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
stop_words = STOP_WORDS stop_words = STOP_WORDS
lemma_lookup = LOOKUP
class Romanian(Language): class Romanian(Language):

23
spacy/lang/ro/examples.py Normal file
View File

@ -0,0 +1,23 @@
# coding: utf8
from __future__ import unicode_literals
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.ro import Romanian
>>> from spacy.lang.ro.examples import sentences
>>> nlp = Romanian()
>>> docs = nlp.pipe(sentences)
"""
sentences = [
"Apple plănuiește să cumpere o companie britanică pentru un miliard de dolari",
"Municipalitatea din San Francisco ia în calcul interzicerea roboților curieri pe trotuar",
"Londra este un oraș mare în Regatul Unit",
"Unde ești?",
"Cine este președintele Franței?",
"Care este capitala Statelor Unite?",
"Când s-a născut Barack Obama?"
]

314816
spacy/lang/ro/lemmatizer.py Normal file

File diff suppressed because it is too large Load Diff

View File

@ -28,6 +28,8 @@ acestia
acestui acestui
aceşti aceşti
aceştia aceştia
acești
aceștia
acolo acolo
acord acord
acum acum
@ -51,6 +53,7 @@ altfel
alti alti
altii altii
altul altul
alături
am am
anume anume
apoi apoi
@ -80,11 +83,15 @@ au
avea avea
avem avem
aveţi aveţi
aveți
avut avut
azi azi
aşadar aşadar
aţi aţi
așadar
ați
b b
ba ba
bine bine
@ -136,11 +143,13 @@ cât
câte câte
câtva câtva
câţi câţi
câți
cînd cînd
cît cît
cîte cîte
cîtva cîtva
cîţi cîţi
cîți
căci căci
cărei cărei
@ -167,6 +176,7 @@ departe
desi desi
despre despre
deşi deşi
deși
din din
dinaintea dinaintea
dintr dintr
@ -191,6 +201,7 @@ este
eu eu
exact exact
eşti eşti
ești
f f
face face
fara fara
@ -203,6 +214,7 @@ fii
fim fim
fiu fiu
fiţi fiţi
fiți
foarte foarte
fost fost
frumos frumos
@ -210,6 +222,7 @@ fără
g g
geaba geaba
graţie graţie
grație
h h
halbă halbă
i i
@ -259,6 +272,8 @@ multi
multă multă
mulţi mulţi
mulţumesc mulţumesc
mulți
mulțumesc
mâine mâine
mîine mîine
@ -274,6 +289,7 @@ nimeri
nimic nimic
niste niste
nişte nişte
niște
noastre noastre
noastră noastră
noi noi
@ -284,6 +300,7 @@ nou
noua noua
nouă nouă
noştri noştri
noștri
nu nu
numai numai
o o
@ -322,6 +339,9 @@ putini
puţin puţin
puţina puţina
puţină puţină
puțin
puțina
puțină
până până
pînă pînă
r r
@ -343,11 +363,13 @@ sub
sunt sunt
suntem suntem
sunteţi sunteţi
sunteți
sus sus
sută sută
sînt sînt
sîntem sîntem
sînteţi sînteţi
sînteți
săi săi
său său
@ -367,7 +389,9 @@ toti
totul totul
totusi totusi
totuşi totuşi
totuși
toţi toţi
toți
trei trei
treia treia
treilea treilea
@ -404,6 +428,7 @@ vor
vostru vostru
vouă vouă
voştri voştri
voștri
vreme vreme
vreo vreo
vreun vreun
@ -428,15 +453,23 @@ zice
întrucât întrucât
întrucît întrucît
îţi îţi
îți
ăla ăla
ălea ălea
ăsta ăsta
ăstea ăstea
ăştia ăştia
ăștia
şapte şapte
şase şase
şi şi
ştiu ştiu
ţi ţi
ţie ţie
șapte
șase
și
știu
ți
ție
""".split()) """.split())

View File

@ -58,9 +58,9 @@ cdef weight_t push_cost(StateClass stcls, const GoldParseC* gold, int target) no
cdef int i, S_i cdef int i, S_i
for i in range(stcls.stack_depth()): for i in range(stcls.stack_depth()):
S_i = stcls.S(i) S_i = stcls.S(i)
if gold.heads[target] == S_i: if gold.has_dep[target] and gold.heads[target] == S_i:
cost += 1 cost += 1
if gold.heads[S_i] == target and (NON_MONOTONIC or not stcls.has_head(S_i)): if gold.has_dep[S_i] and gold.heads[S_i] == target and (NON_MONOTONIC or not stcls.has_head(S_i)):
cost += 1 cost += 1
if BINARY_COSTS and cost >= 1: if BINARY_COSTS and cost >= 1:
return cost return cost
@ -73,10 +73,12 @@ cdef weight_t pop_cost(StateClass stcls, const GoldParseC* gold, int target) nog
cdef int i, B_i cdef int i, B_i
for i in range(stcls.buffer_length()): for i in range(stcls.buffer_length()):
B_i = stcls.B(i) B_i = stcls.B(i)
if gold.has_dep[B_i]:
cost += gold.heads[B_i] == target cost += gold.heads[B_i] == target
cost += gold.heads[target] == B_i
if gold.heads[B_i] == B_i or gold.heads[B_i] < target: if gold.heads[B_i] == B_i or gold.heads[B_i] < target:
break break
if gold.has_dep[target]:
cost += gold.heads[target] == B_i
if BINARY_COSTS and cost >= 1: if BINARY_COSTS and cost >= 1:
return cost return cost
if Break.is_valid(stcls.c, 0) and Break.move_cost(stcls, gold) == 0: if Break.is_valid(stcls.c, 0) and Break.move_cost(stcls, gold) == 0:
@ -107,6 +109,9 @@ cdef bint arc_is_gold(const GoldParseC* gold, int head, int child) nogil:
cdef bint label_is_gold(const GoldParseC* gold, int head, int child, attr_t label) nogil: cdef bint label_is_gold(const GoldParseC* gold, int head, int child, attr_t label) nogil:
if not gold.has_dep[child]: if not gold.has_dep[child]:
if label == SUBTOK_LABEL:
return False
else:
return True return True
elif label == 0: elif label == 0:
return True return True
@ -167,7 +172,7 @@ cdef class Reduce:
# Decrement cost for the arcs e save # Decrement cost for the arcs e save
for i in range(1, st.stack_depth()): for i in range(1, st.stack_depth()):
S_i = st.S(i) S_i = st.S(i)
if gold.heads[st.S(0)] == S_i: if gold.has_dep[st.S(0)] and gold.heads[st.S(0)] == S_i:
cost -= 1 cost -= 1
if gold.heads[S_i] == st.S(0): if gold.heads[S_i] == st.S(0):
cost -= 1 cost -= 1
@ -208,7 +213,9 @@ cdef class LeftArc:
# Account for deps we might lose between S0 and stack # Account for deps we might lose between S0 and stack
if not s.has_head(s.S(0)): if not s.has_head(s.S(0)):
for i in range(1, s.stack_depth()): for i in range(1, s.stack_depth()):
if gold.has_dep[s.S(i)]:
cost += gold.heads[s.S(i)] == s.S(0) cost += gold.heads[s.S(i)] == s.S(0)
if gold.has_dep[s.S(0)]:
cost += gold.heads[s.S(0)] == s.S(i) cost += gold.heads[s.S(0)] == s.S(i)
return cost + pop_cost(s, gold, s.S(0)) + arc_cost(s, gold, s.B(0), s.S(0)) return cost + pop_cost(s, gold, s.S(0)) + arc_cost(s, gold, s.B(0), s.S(0))
@ -284,18 +291,20 @@ cdef class Break:
S_i = s.S(i) S_i = s.S(i)
for j in range(s.buffer_length()): for j in range(s.buffer_length()):
B_i = s.B(j) B_i = s.B(j)
if gold.has_dep[S_i]:
cost += gold.heads[S_i] == B_i cost += gold.heads[S_i] == B_i
if gold.has_dep[B_i]:
cost += gold.heads[B_i] == S_i cost += gold.heads[B_i] == S_i
if cost != 0: if cost != 0:
return cost return cost
# Check for sentence boundary --- if it's here, we can't have any deps # Check for sentence boundary --- if it's here, we can't have any deps
# between stack and buffer, so rest of action is irrelevant. # between stack and buffer, so rest of action is irrelevant.
s0_root = _get_root(s.S(0), gold) if not gold.has_dep[s.S(0)] or not gold.has_dep[s.B(0)]:
b0_root = _get_root(s.B(0), gold)
if s0_root != b0_root or s0_root == -1 or b0_root == -1:
return cost return cost
if gold.sent_start[s.B_(0).l_edge] == -1:
return cost+1
else: else:
return cost + 1 return cost
@staticmethod @staticmethod
cdef inline weight_t label_cost(StateClass s, const GoldParseC* gold, attr_t label) nogil: cdef inline weight_t label_cost(StateClass s, const GoldParseC* gold, attr_t label) nogil:

View File

@ -15,7 +15,8 @@ from .. import util
# here if it's using spaCy's tokenizer (not a different library) # here if it's using spaCy's tokenizer (not a different library)
# TODO: re-implement generic tokenizer tests # TODO: re-implement generic tokenizer tests
_languages = ['bn', 'da', 'de', 'en', 'es', 'fi', 'fr', 'ga', 'he', 'hu', 'id', _languages = ['bn', 'da', 'de', 'en', 'es', 'fi', 'fr', 'ga', 'he', 'hu', 'id',
'it', 'nb', 'nl', 'pl', 'pt', 'ru', 'sv', 'tr', 'xx'] 'it', 'nb', 'nl', 'pl', 'pt', 'ru', 'sv', 'tr', 'ar', 'xx']
_models = {'en': ['en_core_web_sm'], _models = {'en': ['en_core_web_sm'],
'de': ['de_core_news_md'], 'de': ['de_core_news_md'],
'fr': ['fr_core_news_sm'], 'fr': ['fr_core_news_sm'],
@ -50,8 +51,8 @@ def RU(request):
#@pytest.fixture(params=_languages) #@pytest.fixture(params=_languages)
#def tokenizer(request): #def tokenizer(request):
#lang = util.get_lang_class(request.param) #lang = util.get_lang_class(request.param)
#return lang.Defaults.create_tokenizer() #return lang.Defaults.create_tokenizer()
@pytest.fixture @pytest.fixture
@ -100,6 +101,11 @@ def fi_tokenizer():
return util.get_lang_class('fi').Defaults.create_tokenizer() return util.get_lang_class('fi').Defaults.create_tokenizer()
@pytest.fixture
def ro_tokenizer():
return util.get_lang_class('ro').Defaults.create_tokenizer()
@pytest.fixture @pytest.fixture
def id_tokenizer(): def id_tokenizer():
return util.get_lang_class('id').Defaults.create_tokenizer() return util.get_lang_class('id').Defaults.create_tokenizer()
@ -135,10 +141,9 @@ def da_tokenizer():
@pytest.fixture @pytest.fixture
def ja_tokenizer(): def ja_tokenizer():
janome = pytest.importorskip("janome") janome = pytest.importorskip("MeCab")
return util.get_lang_class('ja').Defaults.create_tokenizer() return util.get_lang_class('ja').Defaults.create_tokenizer()
@pytest.fixture @pytest.fixture
def th_tokenizer(): def th_tokenizer():
pythainlp = pytest.importorskip("pythainlp") pythainlp = pytest.importorskip("pythainlp")
@ -148,6 +153,9 @@ def th_tokenizer():
def tr_tokenizer(): def tr_tokenizer():
return util.get_lang_class('tr').Defaults.create_tokenizer() return util.get_lang_class('tr').Defaults.create_tokenizer()
@pytest.fixture
def ar_tokenizer():
return util.get_lang_class('ar').Defaults.create_tokenizer()
@pytest.fixture @pytest.fixture
def ru_tokenizer(): def ru_tokenizer():

View File

View File

@ -0,0 +1,26 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
@pytest.mark.parametrize('text',
["ق.م", "إلخ", "ص.ب", "ت."])
def test_ar_tokenizer_handles_abbr(ar_tokenizer, text):
tokens = ar_tokenizer(text)
assert len(tokens) == 1
def test_ar_tokenizer_handles_exc_in_text(ar_tokenizer):
text = u"تعود الكتابة الهيروغليفية إلى سنة 3200 ق.م"
tokens = ar_tokenizer(text)
assert len(tokens) == 7
assert tokens[6].text == "ق.م"
assert tokens[6].lemma_ == "قبل الميلاد"
def test_ar_tokenizer_handles_exc_in_text(ar_tokenizer):
text = u"يبلغ طول مضيق طارق 14كم "
tokens = ar_tokenizer(text)
print([(tokens[i].text, tokens[i].suffix_) for i in range(len(tokens))])
assert len(tokens) == 6

View File

@ -0,0 +1,13 @@
# coding: utf8
from __future__ import unicode_literals
def test_tokenizer_handles_long_text(ar_tokenizer):
text = """نجيب محفوظ مؤلف و كاتب روائي عربي، يعد من أهم الأدباء العرب خلال القرن العشرين.
ولد نجيب محفوظ في مدينة القاهرة، حيث ترعرع و تلقى تعليمه الجامعي في جامعتها،
فتمكن من نيل شهادة في الفلسفة. ألف محفوظ على مدار حياته الكثير من الأعمال الأدبية، و في مقدمتها ثلاثيته الشهيرة.
و قد نجح في الحصول على جائزة نوبل للآداب، ليكون بذلك العربي الوحيد الذي فاز بها."""
tokens = ar_tokenizer(text)
assert tokens[3].is_stop == True
assert len(tokens) == 77

View File

@ -5,15 +5,41 @@ import pytest
TOKENIZER_TESTS = [ TOKENIZER_TESTS = [
("日本語だよ", ['日本', '', '']), ("日本語だよ", ['日本', '', '', '']),
("東京タワーの近くに住んでいます。", ['東京', 'タワー', '', '近く', '', '住ん', '', '', 'ます', '']), ("東京タワーの近くに住んでいます。", ['東京', 'タワー', '', '近く', '', '住ん', '', '', 'ます', '']),
("吾輩は猫である。", ['吾輩', '', '', '', 'ある', '']), ("吾輩は猫である。", ['吾輩', '', '', '', 'ある', '']),
("月に代わって、お仕置きよ!", ['', '', '代わっ', '', '', '仕置き', '', '!']), ("月に代わって、お仕置きよ!", ['', '', '代わっ', '', '', '', '仕置き', '', '!']),
("すもももももももものうち", ['すもも', '', 'もも', '', 'もも', '', 'うち']) ("すもももももももものうち", ['すもも', '', 'もも', '', 'もも', '', 'うち'])
] ]
TAG_TESTS = [
("日本語だよ", ['日本語だよ', '名詞-固有名詞-地名-国', '名詞-普通名詞-一般', '助動詞', '助詞-終助詞']),
("東京タワーの近くに住んでいます。", ['名詞-固有名詞-地名-一般', '名詞-普通名詞-一般', '助詞-格助詞', '名詞-普通名詞-副詞可能', '助詞-格助詞', '動詞-一般', '助詞-接続助詞', '動詞-非自立可能', '助動詞', '補助記号-句点']),
("吾輩は猫である。", ['代名詞', '助詞-係助詞', '名詞-普通名詞-一般', '助動詞', '動詞-非自立可能', '補助記号-句点']),
("月に代わって、お仕置きよ!", ['名詞-普通名詞-助数詞可能', '助詞-格助詞', '動詞-一般', '助詞-接続助詞', '補助記号-読点', '接頭辞', '名詞-普通名詞-一般', '助詞-終助詞', '補助記号-句点 ']),
("すもももももももものうち", ['名詞-普通名詞-一般', '助詞-係助詞', '名詞-普通名詞-一般', '助詞-係助詞', '名詞-普通名詞-一般', '助詞-格助詞', '名詞-普通名詞-副詞可能'])
]
POS_TESTS = [
('日本語だよ', ['PROPN', 'NOUN', 'AUX', 'PART']),
('東京タワーの近くに住んでいます。', ['PROPN', 'NOUN', 'ADP', 'NOUN', 'ADP', 'VERB', 'SCONJ', 'VERB', 'AUX', 'PUNCT']),
('吾輩は猫である。', ['PRON', 'ADP', 'NOUN', 'AUX', 'VERB', 'PUNCT']),
('月に代わって、お仕置きよ!', ['NOUN', 'ADP', 'VERB', 'SCONJ', 'PUNCT', 'NOUN', 'NOUN', 'PART', 'PUNCT']),
('すもももももももものうち', ['NOUN', 'ADP', 'NOUN', 'ADP', 'NOUN', 'ADP', 'NOUN'])
]
@pytest.mark.parametrize('text,expected_tokens', TOKENIZER_TESTS) @pytest.mark.parametrize('text,expected_tokens', TOKENIZER_TESTS)
def test_japanese_tokenizer(ja_tokenizer, text, expected_tokens): def test_japanese_tokenizer(ja_tokenizer, text, expected_tokens):
tokens = [token.text for token in ja_tokenizer(text)] tokens = [token.text for token in ja_tokenizer(text)]
assert tokens == expected_tokens assert tokens == expected_tokens
@pytest.mark.parametrize('text,expected_tags', TAG_TESTS)
def test_japanese_tokenizer(ja_tokenizer, text, expected_tags):
tags = [token.tag_ for token in ja_tokenizer(text)]
assert tags == expected_tags
@pytest.mark.parametrize('text,expected_pos', POS_TESTS)
def test_japanese_tokenizer(ja_tokenizer, text, expected_pos):
pos = [token.pos_ for token in ja_tokenizer(text)]
assert pos == expected_pos

View File

View File

@ -0,0 +1,13 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
@pytest.mark.parametrize('string,lemma', [('câini', 'câine'),
('expedițiilor', 'expediție'),
('pensete', 'pensetă'),
('erau', 'fi')])
def test_lemmatizer_lookup_assigns(ro_tokenizer, string, lemma):
tokens = ro_tokenizer(string)
assert tokens[0].lemma_ == lemma

View File

@ -0,0 +1,18 @@
# coding: utf8
from __future__ import unicode_literals
from ..util import add_vecs_to_vocab, get_doc
import pytest
@pytest.fixture
def vectors():
return [("a", [1, 2, 3]), ("letter", [4, 5, 6])]
@pytest.fixture
def vocab(en_vocab, vectors):
add_vecs_to_vocab(en_vocab, vectors)
return en_vocab
def test_issue2219(vocab, vectors):
[(word1, vec1), (word2, vec2)] = vectors
doc = get_doc(vocab, words=[word1, word2])
assert doc[0].similarity(doc[1]) == doc[1].similarity(doc[0])

View File

@ -155,7 +155,7 @@ cdef class Token:
""" """
if 'similarity' in self.doc.user_token_hooks: if 'similarity' in self.doc.user_token_hooks:
return self.doc.user_token_hooks['similarity'](self) return self.doc.user_token_hooks['similarity'](self)
if hasattr(other, '__len__') and len(other) == 1: if hasattr(other, '__len__') and len(other) == 1 and hasattr(other, "__getitem__"):
if self.c.lex.orth == getattr(other[0], 'orth', None): if self.c.lex.orth == getattr(other[0], 'orth', None):
return 1.0 return 1.0
elif hasattr(other, 'orth'): elif hasattr(other, 'orth'):

View File

@ -27,8 +27,6 @@ The docs can always use another example or more detail, and they should always b
While all page content lives in the `.jade` files, article meta (page titles, sidebars etc.) is stored as JSON. Each folder contains a `_data.json` with all required meta for its files. While all page content lives in the `.jade` files, article meta (page titles, sidebars etc.) is stored as JSON. Each folder contains a `_data.json` with all required meta for its files.
For simplicity, all sites linked in the [tutorials](https://spacy.io/docs/usage/tutorials) and [showcase](https://spacy.io/docs/usage/showcase) are also stored as JSON. So in order to edit those pages, there's no need to dig into the Jade files simply edit the [`_data.json`](docs/usage/_data.json).
### Markup language and conventions ### Markup language and conventions
Jade/Pug is a whitespace-sensitive markup language that compiles to HTML. Indentation is used to nest elements, and for template logic, like `if`/`else` or `for`, mainly used to iterate over objects and arrays in the meta data. It also allows inline JavaScript expressions. Jade/Pug is a whitespace-sensitive markup language that compiles to HTML. Indentation is used to nest elements, and for template logic, like `if`/`else` or `for`, mainly used to iterate over objects and arrays in the meta data. It also allows inline JavaScript expressions.

View File

@ -12,8 +12,6 @@
"COMPANY_URL": "https://explosion.ai", "COMPANY_URL": "https://explosion.ai",
"DEMOS_URL": "https://explosion.ai/demos", "DEMOS_URL": "https://explosion.ai/demos",
"MODELS_REPO": "explosion/spacy-models", "MODELS_REPO": "explosion/spacy-models",
"KERNEL_BINDER": "ines/spacy-binder",
"KERNEL_PYTHON": "python3",
"SPACY_VERSION": "2.0", "SPACY_VERSION": "2.0",
"BINDER_VERSION": "2.0.11", "BINDER_VERSION": "2.0.11",
@ -87,7 +85,7 @@
], ],
"V_CSS": "2.1.3", "V_CSS": "2.1.3",
"V_JS": "2.1.1", "V_JS": "2.1.2",
"DEFAULT_SYNTAX": "python", "DEFAULT_SYNTAX": "python",
"ANALYTICS": "UA-58931649-1", "ANALYTICS": "UA-58931649-1",
"MAILCHIMP": { "MAILCHIMP": {

View File

@ -15,7 +15,7 @@ p
+cell Nationalities or religious or political groups. +cell Nationalities or religious or political groups.
+row +row
+cell #[code FACILITY] +cell #[code FAC]
+cell Buildings, airports, highways, bridges, etc. +cell Buildings, airports, highways, bridges, etc.
+row +row

View File

@ -149,7 +149,7 @@ p
+aside-code("Example"). +aside-code("Example").
from spacy.tokens import Doc from spacy.tokens import Doc
city_getter = lambda doc: doc.text in ('New York', 'Paris', 'Berlin') city_getter = lambda doc: any(city in doc.text for city in ('New York', 'Paris', 'Berlin'))
Doc.set_extension('has_city', getter=city_getter) Doc.set_extension('has_city', getter=city_getter)
doc = nlp(u'I like New York') doc = nlp(u'I like New York')
assert doc._.has_city assert doc._.has_city

View File

@ -127,7 +127,7 @@ p
+aside-code("Example"). +aside-code("Example").
from spacy.tokens import Span from spacy.tokens import Span
city_getter = lambda span: span.text in ('New York', 'Paris', 'Berlin') city_getter = lambda span: any(city in span.text for city in ('New York', 'Paris', 'Berlin'))
Span.set_extension('has_city', getter=city_getter) Span.set_extension('has_city', getter=city_getter)
doc = nlp(u'I like New York in Autumn') doc = nlp(u'I like New York in Autumn')
assert doc[1:4]._.has_city assert doc[1:4]._.has_city

View File

@ -47,7 +47,7 @@ import initUniverse from './universe.vue.js';
*/ */
{ {
if (window.Juniper) { if (window.Juniper) {
new Juniper({ repo: 'ines/spacy-binder' }); new Juniper({ repo: 'ines/spacy-io-binder' });
} }
} }

File diff suppressed because one or more lines are too long

View File

@ -445,6 +445,29 @@
}, },
"category": ["visualizers"] "category": ["visualizers"]
}, },
{
"id": "scattertext",
"slogan": "Beautiful visualizations of how language differs among document types",
"description": "A tool for finding distinguishing terms in small-to-medium-sized corpora, and presenting them in a sexy, interactive scatter plot with non-overlapping term labels. Exploratory data analysis just got more fun.",
"github": "JasonKessler/scattertext",
"image": "https://jasonkessler.github.io/2012conventions0.0.2.2.png",
"code_example": [
"import spacy",
"import scattertext as st",
"",
"nlp = spacy.load('en')",
"corpus = st.CorpusFromPandas(convention_df,",
" category_col='party',",
" text_col='text',",
" nlp=nlp).build()"
],
"author": "Jason Kessler",
"author_links": {
"github": "JasonKessler",
"twitter": "jasonkessler"
},
"category": ["visualizers"]
},
{ {
"id": "rasa", "id": "rasa",
"title": "Rasa NLU", "title": "Rasa NLU",

View File

@ -4,7 +4,7 @@ p
| The individual components #[strong expose variables] that can be imported | The individual components #[strong expose variables] that can be imported
| within a language module, and added to the language's #[code Defaults]. | within a language module, and added to the language's #[code Defaults].
| Some components, like the punctuation rules, usually don't need much | Some components, like the punctuation rules, usually don't need much
| customisation and can simply be imported from the global rules. Others, | customisation and can be imported from the global rules. Others,
| like the tokenizer and norm exceptions, are very specific and will make | like the tokenizer and norm exceptions, are very specific and will make
| a big difference to spaCy's performance on the particular language and | a big difference to spaCy's performance on the particular language and
| training a language model. | training a language model.

View File

@ -92,6 +92,7 @@
"Dependency Parse": "dependency-parse", "Dependency Parse": "dependency-parse",
"Named Entities": "named-entities", "Named Entities": "named-entities",
"Tokenization": "tokenization", "Tokenization": "tokenization",
"Sentence Segmentation": "sbd",
"Rule-based Matching": "rule-based-matching" "Rule-based Matching": "rule-based-matching"
} }
}, },

View File

@ -39,7 +39,7 @@ p
| this. The above error mostly occurs when doing a system-wide installation, | this. The above error mostly occurs when doing a system-wide installation,
| which will create the symlinks in a system directory. Run the | which will create the symlinks in a system directory. Run the
| #[code download] or #[code link] command as administrator (on Windows, | #[code download] or #[code link] command as administrator (on Windows,
| simply right-click on your terminal or shell ans select "Run as | you can either right-click on your terminal or shell ans select "Run as
| Administrator"), or use a #[code virtualenv] to install spaCy in a user | Administrator"), or use a #[code virtualenv] to install spaCy in a user
| directory, instead of doing a system-wide installation. | directory, instead of doing a system-wide installation.

View File

@ -220,8 +220,8 @@ p
p p
| The best way to understand spaCy's dependency parser is interactively. | The best way to understand spaCy's dependency parser is interactively.
| To make this easier, spaCy v2.0+ comes with a visualization module. Simply | To make this easier, spaCy v2.0+ comes with a visualization module. You
| pass a #[code Doc] or a list of #[code Doc] objects to | can pass a #[code Doc] or a list of #[code Doc] objects to
| displaCy and run #[+api("top-level#displacy.serve") #[code displacy.serve]] to | displaCy and run #[+api("top-level#displacy.serve") #[code displacy.serve]] to
| run the web server, or #[+api("top-level#displacy.render") #[code displacy.render]] | run the web server, or #[+api("top-level#displacy.render") #[code displacy.render]]
| to generate the raw markup. If you want to know how to write rules that | to generate the raw markup. If you want to know how to write rules that

View File

@ -195,7 +195,7 @@ p
| lets you explore an entity recognition model's behaviour interactively. | lets you explore an entity recognition model's behaviour interactively.
| If you're training a model, it's very useful to run the visualization | If you're training a model, it's very useful to run the visualization
| yourself. To help you do that, spaCy v2.0+ comes with a visualization | yourself. To help you do that, spaCy v2.0+ comes with a visualization
| module. Simply pass a #[code Doc] or a list of #[code Doc] objects to | module. You can pass a #[code Doc] or a list of #[code Doc] objects to
| displaCy and run #[+api("top-level#displacy.serve") #[code displacy.serve]] to | displaCy and run #[+api("top-level#displacy.serve") #[code displacy.serve]] to
| run the web server, or #[+api("top-level#displacy.render") #[code displacy.render]] | run the web server, or #[+api("top-level#displacy.render") #[code displacy.render]]
| to generate the raw markup. | to generate the raw markup.

View File

@ -0,0 +1,129 @@
//- 💫 DOCS > USAGE > LINGUISTIC FEATURES > SENTENCE SEGMENTATION
p
| A #[+api("doc") #[code Doc]] object's sentences are available via the
| #[code Doc.sents] property. Unlike other libraries, spaCy uses the
| dependency parse to determine sentence boundaries. This is usually more
| accurate than a rule-based approach, but it also means you'll need a
| #[strong statistical model] and accurate predictions. If your
| texts are closer to general-purpose news or web text, this should work
| well out-of-the-box. For social media or conversational text that
| doesn't follow the same rules, your application may benefit from a custom
| rule-based implementation. You can either plug a rule-based component
| into your #[+a("/usage/processing-pipelines") processing pipeline] or use
| the #[code SentenceSegmenter] component with a custom stategy.
+h(3, "sbd-parser") Default: Using the dependency parse
+tag-model("dependency parser")
p
| To view a #[code Doc]'s sentences, you can iterate over the
| #[code Doc.sents], a generator that yields
| #[+api("span") #[code Span]] objects.
+code-exec.
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u"This is a sentence. This is another sentence.")
for sent in doc.sents:
print(sent.text)
+h(3, "sbd-manual") Setting boundaries manually
p
| spaCy's dependency parser respects already set boundaries, so you can
| preprocess your #[code Doc] using custom rules #[code before] it's
| parsed. This can be done by adding a
| #[+a("/usage/processing-pipelines") custom pipeline component]. Depending
| on your text, this may also improve accuracy, since the parser is
| constrained to predict parses consistent with the sentence boundaries.
+infobox("Important note", "⚠️")
| To prevent inconsitent state, you can only set boundaries #[em before] a
| document is parsed (and #[code Doc.is_parsed] is #[code False]). To
| ensure that your component is added in the right place, you can set
| #[code before='parser'] or #[code first=True] when adding it to the
| pipeline using #[+api("language#add_pipe") #[code nlp.add_pipe]].
p
| Here's an example of a component that implements a pre-processing rule
| for splitting on #[code '...'] tokens. The component is added before
| the parser, which is then used to further segment the text. This
| approach can be useful if you want to implement #[em additional] rules
| specific to your data, while still being able to take advantage of
| dependency-based sentence segmentation.
+code-exec.
import spacy
text = u"this is a sentence...hello...and another sentence."
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
print('Before:', [sent.text for sent in doc.sents])
def set_custom_boundaries(doc):
for token in doc[:-1]:
if token.text == '...':
doc[token.i+1].is_sent_start = True
return doc
nlp.add_pipe(set_custom_boundaries, before='parser')
doc = nlp(text)
print('After:', [sent.text for sent in doc.sents])
+h(3, "sbd-component") Rule-based pipeline component
p
| The #[code sentencizer] component is a
| #[+a("/usage/processing-pipelines") pipeline component] that splits
| sentences on punctuation like #[code &period;], #[code !] or #[code ?].
| You can plug it into your pipeline if you only need sentence boundaries
| without the dependency parse. Note that #[code Doc.sents] will
| #[strong raise an error] if no sentence boundaries are set.
+code-exec.
import spacy
from spacy.lang.en import English
nlp = English() # just the language with no model
sbd = nlp.create_pipe('sentencizer') # or: nlp.create_pipe('sbd')
nlp.add_pipe(sbd)
doc = nlp(u"This is a sentence. This is another sentence.")
for sent in doc.sents:
print(sent.text)
+h(3, "sbd-custom") Custom rule-based strategy
p
| If you want to implement your own strategy that differs from the default
| rule-based approach of splitting on sentences, you can also instantiate
| the #[code SentenceSegmenter] directly and pass in your own strategy.
| The strategy should be a function that takes a #[code Doc] object and
| yields a #[code Span] for each sentence. Here's an example of a custom
| segmentation strategy for splitting on newlines only:
+code-exec.
from spacy.lang.en import English
from spacy.pipeline import SentenceSegmenter
def split_on_newlines(doc):
start = 0
seen_newline = False
for word in doc:
if seen_newline and not word.is_space:
yield doc[start:word.i]
start = word.i
seen_newline = False
elif word.text == '\n':
seen_newline = True
if start &lt; len(doc):
yield doc[start:len(doc)]
nlp = English() # just the language with no model
sbd = SentenceSegmenter(nlp.vocab, strategy=split_on_newlines)
nlp.add_pipe(sbd)
doc = nlp(u"This is a sentence\n\nThis is another sentence\nAnd more")
for sent in doc.sents:
print([token.text for token in sent])

View File

@ -274,7 +274,7 @@ p
| In spaCy v1.x, you had to add a custom tokenizer by passing it to the | In spaCy v1.x, you had to add a custom tokenizer by passing it to the
| #[code make_doc] keyword argument, or by passing a tokenizer "factory" | #[code make_doc] keyword argument, or by passing a tokenizer "factory"
| to #[code create_make_doc]. This was unnecessarily complicated. Since | to #[code create_make_doc]. This was unnecessarily complicated. Since
| spaCy v2.0, you can simply write to #[code nlp.tokenizer]. If your | spaCy v2.0, you can write to #[code nlp.tokenizer] instead. If your
| tokenizer needs the vocab, you can write a function and use | tokenizer needs the vocab, you can write a function and use
| #[code nlp.vocab]. | #[code nlp.vocab].

View File

@ -20,14 +20,14 @@ include _install-basics
p p
| To download a model directly using #[+a("https://pypi.python.org/pypi/pip") pip], | To download a model directly using #[+a("https://pypi.python.org/pypi/pip") pip],
| simply point #[code pip install] to the URL or local path of the archive | point #[code pip install] to the URL or local path of the archive
| file. To find the direct link to a model, head over to the | file. To find the direct link to a model, head over to the
| #[+a(gh("spacy-models") + "/releases") model releases], right click on the archive | #[+a(gh("spacy-models") + "/releases") model releases], right click on the archive
| link and copy it to your clipboard. | link and copy it to your clipboard.
+code(false, "bash"). +code(false, "bash").
# with external URL # with external URL
pip install #{gh("spacy-models")}/releases/download/en_core_web_md-1.2.0/en_core_web_md-1.2.0.tar.gz pip install #{gh("spacy-models")}/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz
# with local file # with local file
pip install /Users/you/en_core_web_md-1.2.0.tar.gz pip install /Users/you/en_core_web_md-1.2.0.tar.gz
@ -69,7 +69,7 @@ p
p p
| You can place the #[strong model package directory] anywhere on your | You can place the #[strong model package directory] anywhere on your
| local file system. To use it with spaCy, simply assign it a name by | local file system. To use it with spaCy, assign it a name by
| creating a #[+a("#usage") shortcut link] for the data directory. | creating a #[+a("#usage") shortcut link] for the data directory.
+h(3, "usage") Using models with spaCy +h(3, "usage") Using models with spaCy

View File

@ -26,7 +26,7 @@ p
p p
| Because all models are valid Python packages, you can add them to your | Because all models are valid Python packages, you can add them to your
| application's #[code requirements.txt]. If you're running your own | application's #[code requirements.txt]. If you're running your own
| internal PyPi installation, you can simply upload the models there. pip's | internal PyPi installation, you can upload the models there. pip's
| #[+a("https://pip.pypa.io/en/latest/reference/pip_install/#requirements-file-format") requirements file format] | #[+a("https://pip.pypa.io/en/latest/reference/pip_install/#requirements-file-format") requirements file format]
| supports both package names to download via a PyPi server, as well as direct | supports both package names to download via a PyPi server, as well as direct
| URLs. | URLs.

View File

@ -5,7 +5,7 @@ p
| segments it into words, punctuation and so on. This is done by applying | segments it into words, punctuation and so on. This is done by applying
| rules specific to each language. For example, punctuation at the end of a | rules specific to each language. For example, punctuation at the end of a
| sentence should be split off whereas "U.K." should remain one token. | sentence should be split off whereas "U.K." should remain one token.
| Each #[code Doc] consists of individual tokens, and we can simply iterate | Each #[code Doc] consists of individual tokens, and we can iterate
| over them: | over them:
+code-exec. +code-exec.

View File

@ -72,10 +72,11 @@ p
| you want to visualize output from other libraries, like | you want to visualize output from other libraries, like
| #[+a("http://www.nltk.org") NLTK] or | #[+a("http://www.nltk.org") NLTK] or
| #[+a("https://github.com/tensorflow/models/tree/master/research/syntaxnet") SyntaxNet]. | #[+a("https://github.com/tensorflow/models/tree/master/research/syntaxnet") SyntaxNet].
| Simply convert the dependency parse or recognised entities to displaCy's | If you set #[code manual=True] on either #[code render()] or
| format and set #[code manual=True] on either #[code render()] or | #[code serve()], you can pass in data in displaCy's format (instead of
| #[code serve()]. When setting #[code ents] manually, make sure to supply | #[code Doc] objects). When setting #[code ents] manually, make sure to
| them in the right order, i.e. starting with the lowest start position. | supply them in the right order, i.e. starting with the lowest start
| position.
+aside-code("Example"). +aside-code("Example").
ex = [{'text': 'But Google is starting from behind.', ex = [{'text': 'But Google is starting from behind.',
@ -109,7 +110,7 @@ p
| If you want to use the visualizers as part of a web application, for | If you want to use the visualizers as part of a web application, for
| example to create something like our | example to create something like our
| #[+a(DEMOS_URL + "/displacy") online demo], it's not recommended to | #[+a(DEMOS_URL + "/displacy") online demo], it's not recommended to
| simply wrap and serve the displaCy renderer. Instead, you should only | only wrap and serve the displaCy renderer. Instead, you should only
| rely on the server to perform spaCy's processing capabilities, and use | rely on the server to perform spaCy's processing capabilities, and use
| #[+a(gh("displacy")) displaCy.js] to render the JSON-formatted output. | #[+a(gh("displacy")) displaCy.js] to render the JSON-formatted output.

View File

@ -33,6 +33,10 @@ p
+h(2, "tokenization") Tokenization +h(2, "tokenization") Tokenization
include _linguistic-features/_tokenization include _linguistic-features/_tokenization
+section("sbd")
+h(2, "sbd") Sentence Segmentation
include _linguistic-features/_sentence-segmentation
+section("rule-based-matching") +section("rule-based-matching")
+h(2, "rule-based-matching") Rule-based matching +h(2, "rule-based-matching") Rule-based matching
include _linguistic-features/_rule-based-matching include _linguistic-features/_rule-based-matching