mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-27 01:34:30 +03:00
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
This commit is contained in:
commit
ec41ceb383
106
.github/contributors/aliiae.md
vendored
Normal file
106
.github/contributors/aliiae.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Aliia Erofeeva |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 13 June 2018 |
|
||||||
|
| GitHub username | aliiae |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/btrungchi.md
vendored
Normal file
106
.github/contributors/btrungchi.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Bui Trung Chi |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 2018-06-30 |
|
||||||
|
| GitHub username | btrungchi |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/coryhurst.md
vendored
Normal file
106
.github/contributors/coryhurst.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -----------------------------|
|
||||||
|
| Name | Cory Hurst |
|
||||||
|
| Company name (if applicable) | Samtec Smart Platform Group |
|
||||||
|
| Title or role (if applicable) | SoftwareDeveloper |
|
||||||
|
| Date | 2017-11-13 |
|
||||||
|
| GitHub username | cjhurst |
|
||||||
|
| Website (optional) | https://blog.spg.ai/ |
|
106
.github/contributors/mirfan899.md
vendored
Normal file
106
.github/contributors/mirfan899.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | ------------------------ |
|
||||||
|
| Name | Muhammad Irfan |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | AI & ML Developer |
|
||||||
|
| Date | 2018-09-06 |
|
||||||
|
| GitHub username | mirfan899 |
|
||||||
|
| Website (optional) | |
|
|
@ -11,5 +11,5 @@ ujson>=1.35
|
||||||
dill>=0.2,<0.3
|
dill>=0.2,<0.3
|
||||||
regex==2017.4.5
|
regex==2017.4.5
|
||||||
requests>=2.13.0,<3.0.0
|
requests>=2.13.0,<3.0.0
|
||||||
pytest>=3.0.6,<4.0.0
|
pytest>=3.6.0,<4.0.0
|
||||||
mock>=2.0.0,<3.0.0
|
mock>=2.0.0,<3.0.0
|
||||||
|
|
1
setup.py
1
setup.py
|
@ -222,6 +222,7 @@ def setup_package():
|
||||||
'Programming Language :: Python :: 3.4',
|
'Programming Language :: Python :: 3.4',
|
||||||
'Programming Language :: Python :: 3.5',
|
'Programming Language :: Python :: 3.5',
|
||||||
'Programming Language :: Python :: 3.6',
|
'Programming Language :: Python :: 3.6',
|
||||||
|
'Programming Language :: Python :: 3.7',
|
||||||
'Topic :: Scientific/Engineering'],
|
'Topic :: Scientific/Engineering'],
|
||||||
cmdclass = {
|
cmdclass = {
|
||||||
'build_ext': build_ext_subclass},
|
'build_ext': build_ext_subclass},
|
||||||
|
|
|
@ -20,5 +20,5 @@ def blank(name, **kwargs):
|
||||||
return LangClass(**kwargs)
|
return LangClass(**kwargs)
|
||||||
|
|
||||||
|
|
||||||
def info(model=None, markdown=False):
|
def info(model=None, markdown=False, silent=False):
|
||||||
return cli_info(model, markdown)
|
return cli_info(model, markdown, silent)
|
||||||
|
|
|
@ -2,7 +2,7 @@
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from .render import DependencyRenderer, EntityRenderer
|
from .render import DependencyRenderer, EntityRenderer
|
||||||
from ..tokens import Doc
|
from ..tokens import Doc, Span
|
||||||
from ..compat import b_to_str
|
from ..compat import b_to_str
|
||||||
from ..errors import Errors, Warnings, user_warning
|
from ..errors import Errors, Warnings, user_warning
|
||||||
from ..util import prints, is_in_jupyter
|
from ..util import prints, is_in_jupyter
|
||||||
|
@ -29,8 +29,11 @@ def render(docs, style='dep', page=False, minify=False, jupyter=IS_JUPYTER,
|
||||||
'ent': (EntityRenderer, parse_ents)}
|
'ent': (EntityRenderer, parse_ents)}
|
||||||
if style not in factories:
|
if style not in factories:
|
||||||
raise ValueError(Errors.E087.format(style=style))
|
raise ValueError(Errors.E087.format(style=style))
|
||||||
if isinstance(docs, Doc) or isinstance(docs, dict):
|
if isinstance(docs, (Doc, Span, dict)):
|
||||||
docs = [docs]
|
docs = [docs]
|
||||||
|
docs = [obj if not isinstance(obj, Span) else obj.as_doc() for obj in docs]
|
||||||
|
if not all(isinstance(obj, (Doc, Span, dict)) for obj in docs):
|
||||||
|
raise ValueError(Errors.E096)
|
||||||
renderer, converter = factories[style]
|
renderer, converter = factories[style]
|
||||||
renderer = renderer(options=options)
|
renderer = renderer(options=options)
|
||||||
parsed = [converter(doc, options) for doc in docs] if not manual else docs
|
parsed = [converter(doc, options) for doc in docs] if not manual else docs
|
||||||
|
|
|
@ -136,7 +136,7 @@ class DependencyRenderer(object):
|
||||||
end (int): X-coordinate of arrow end point.
|
end (int): X-coordinate of arrow end point.
|
||||||
RETURNS (unicode): Definition of the arrow head path ('d' attribute).
|
RETURNS (unicode): Definition of the arrow head path ('d' attribute).
|
||||||
"""
|
"""
|
||||||
if direction is 'left':
|
if direction == 'left':
|
||||||
pos1, pos2, pos3 = (x, x-self.arrow_width+2, x+self.arrow_width-2)
|
pos1, pos2, pos3 = (x, x-self.arrow_width+2, x+self.arrow_width-2)
|
||||||
else:
|
else:
|
||||||
pos1, pos2, pos3 = (end, end+self.arrow_width-2,
|
pos1, pos2, pos3 = (end, end+self.arrow_width-2,
|
||||||
|
|
|
@ -257,6 +257,8 @@ class Errors(object):
|
||||||
E094 = ("Error reading line {line_num} in vectors file {loc}.")
|
E094 = ("Error reading line {line_num} in vectors file {loc}.")
|
||||||
E095 = ("Can't write to frozen dictionary. This is likely an internal "
|
E095 = ("Can't write to frozen dictionary. This is likely an internal "
|
||||||
"error. Are you writing to a default function argument?")
|
"error. Are you writing to a default function argument?")
|
||||||
|
E096 = ("Invalid object passed to displaCy: Can only visualize Doc or "
|
||||||
|
"Span objects, or dicts if set to manual=True.")
|
||||||
|
|
||||||
|
|
||||||
@add_codes
|
@add_codes
|
||||||
|
|
|
@ -16,9 +16,11 @@ _latin = r'[[\p{Ll}||\p{Lu}]&&\p{Latin}]'
|
||||||
_persian = r'[\p{L}&&\p{Arabic}]'
|
_persian = r'[\p{L}&&\p{Arabic}]'
|
||||||
_russian_lower = r'[ёа-я]'
|
_russian_lower = r'[ёа-я]'
|
||||||
_russian_upper = r'[ЁА-Я]'
|
_russian_upper = r'[ЁА-Я]'
|
||||||
|
_tatar_lower = r'[әөүҗңһ]'
|
||||||
|
_tatar_upper = r'[ӘӨҮҖҢҺ]'
|
||||||
|
|
||||||
_upper = [_latin_upper, _russian_upper]
|
_upper = [_latin_upper, _russian_upper, _tatar_upper]
|
||||||
_lower = [_latin_lower, _russian_lower]
|
_lower = [_latin_lower, _russian_lower, _tatar_lower]
|
||||||
_uncased = [_bengali, _hebrew, _persian]
|
_uncased = [_bengali, _hebrew, _persian]
|
||||||
|
|
||||||
ALPHA = merge_char_classes(_upper + _lower + _uncased)
|
ALPHA = merge_char_classes(_upper + _lower + _uncased)
|
||||||
|
|
|
@ -60,9 +60,8 @@ def detailed_tokens(tokenizer, text):
|
||||||
parts = node.feature.split(',')
|
parts = node.feature.split(',')
|
||||||
pos = ','.join(parts[0:4])
|
pos = ','.join(parts[0:4])
|
||||||
|
|
||||||
if len(parts) > 6:
|
if len(parts) > 7:
|
||||||
# this information is only available for words in the tokenizer dictionary
|
# this information is only available for words in the tokenizer dictionary
|
||||||
reading = parts[6]
|
|
||||||
base = parts[7]
|
base = parts[7]
|
||||||
|
|
||||||
words.append( ShortUnitWord(surface, base, pos) )
|
words.append( ShortUnitWord(surface, base, pos) )
|
||||||
|
|
31
spacy/lang/tt/__init__.py
Normal file
31
spacy/lang/tt/__init__.py
Normal file
|
@ -0,0 +1,31 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from .lex_attrs import LEX_ATTRS
|
||||||
|
from .punctuation import TOKENIZER_INFIXES
|
||||||
|
from .stop_words import STOP_WORDS
|
||||||
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
|
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||||
|
from ...attrs import LANG
|
||||||
|
from ...language import Language
|
||||||
|
from ...util import update_exc
|
||||||
|
|
||||||
|
|
||||||
|
class TatarDefaults(Language.Defaults):
|
||||||
|
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||||
|
lex_attr_getters[LANG] = lambda text: 'tt'
|
||||||
|
|
||||||
|
lex_attr_getters.update(LEX_ATTRS)
|
||||||
|
|
||||||
|
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||||
|
infixes = tuple(TOKENIZER_INFIXES)
|
||||||
|
|
||||||
|
stop_words = STOP_WORDS
|
||||||
|
|
||||||
|
|
||||||
|
class Tatar(Language):
|
||||||
|
lang = 'tt'
|
||||||
|
Defaults = TatarDefaults
|
||||||
|
|
||||||
|
|
||||||
|
__all__ = ['Tatar']
|
19
spacy/lang/tt/examples.py
Normal file
19
spacy/lang/tt/examples.py
Normal file
|
@ -0,0 +1,19 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
"""
|
||||||
|
Example sentences to test spaCy and its language models.
|
||||||
|
>>> from spacy.lang.tt.examples import sentences
|
||||||
|
>>> docs = nlp.pipe(sentences)
|
||||||
|
"""
|
||||||
|
|
||||||
|
sentences = [
|
||||||
|
"Apple Бөекбритания стартабын $1 миллиард өчен сатып алыун исәпли.",
|
||||||
|
"Автоном автомобильләр иминият җаваплылыкны җитештерүчеләргә күчерә.",
|
||||||
|
"Сан-Франциско тротуар буенча йөри торган робот-курьерларны тыю мөмкинлеген карый.",
|
||||||
|
"Лондон - Бөекбританиядә урнашкан зур шәһәр.",
|
||||||
|
"Син кайда?",
|
||||||
|
"Францияда кем президент?",
|
||||||
|
"Америка Кушма Штатларының башкаласы нинди шәһәр?",
|
||||||
|
"Барак Обама кайчан туган?"
|
||||||
|
]
|
29
spacy/lang/tt/lex_attrs.py
Normal file
29
spacy/lang/tt/lex_attrs.py
Normal file
|
@ -0,0 +1,29 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from ...attrs import LIKE_NUM
|
||||||
|
|
||||||
|
_num_words = ['нуль', 'ноль', 'бер', 'ике', 'өч', 'дүрт', 'биш', 'алты', 'җиде',
|
||||||
|
'сигез', 'тугыз', 'ун', 'унбер', 'унике', 'унөч', 'ундүрт',
|
||||||
|
'унбиш', 'уналты', 'унҗиде', 'унсигез', 'унтугыз', 'егерме',
|
||||||
|
'утыз', 'кырык', 'илле', 'алтмыш', 'җитмеш', 'сиксән', 'туксан',
|
||||||
|
'йөз', 'мең', 'төмән', 'миллион', 'миллиард', 'триллион',
|
||||||
|
'триллиард']
|
||||||
|
|
||||||
|
|
||||||
|
def like_num(text):
|
||||||
|
text = text.replace(',', '').replace('.', '')
|
||||||
|
if text.isdigit():
|
||||||
|
return True
|
||||||
|
if text.count('/') == 1:
|
||||||
|
num, denom = text.split('/')
|
||||||
|
if num.isdigit() and denom.isdigit():
|
||||||
|
return True
|
||||||
|
if text in _num_words:
|
||||||
|
return True
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
LEX_ATTRS = {
|
||||||
|
LIKE_NUM: like_num
|
||||||
|
}
|
19
spacy/lang/tt/punctuation.py
Normal file
19
spacy/lang/tt/punctuation.py
Normal file
|
@ -0,0 +1,19 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from ..char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER, QUOTES, HYPHENS
|
||||||
|
from ..char_classes import LIST_ELLIPSES, LIST_ICONS
|
||||||
|
|
||||||
|
_hyphens_no_dash = HYPHENS.replace('-', '').strip('|').replace('||', '')
|
||||||
|
_infixes = (LIST_ELLIPSES + LIST_ICONS +
|
||||||
|
[r'(?<=[{}])\.(?=[{}])'.format(ALPHA_LOWER, ALPHA_UPPER),
|
||||||
|
r'(?<=[{a}])[,!?/\(\)]+(?=[{a}])'.format(a=ALPHA),
|
||||||
|
r'(?<=[{a}{q}])[:<>=](?=[{a}])'.format(a=ALPHA, q=QUOTES),
|
||||||
|
r'(?<=[{a}])--(?=[{a}])'.format(a=ALPHA),
|
||||||
|
r'(?<=[{a}]),(?=[{a}])'.format(a=ALPHA),
|
||||||
|
r'(?<=[{a}])([{q}\)\]\(\[])(?=[\-{a}])'.format(a=ALPHA, q=QUOTES),
|
||||||
|
r'(?<=[{a}])[?";:=,.]*(?:{h})(?=[{a}])'.format(a=ALPHA,
|
||||||
|
h=_hyphens_no_dash),
|
||||||
|
r'(?<=[0-9])-(?=[0-9])'])
|
||||||
|
|
||||||
|
TOKENIZER_INFIXES = _infixes
|
174
spacy/lang/tt/stop_words.py
Normal file
174
spacy/lang/tt/stop_words.py
Normal file
|
@ -0,0 +1,174 @@
|
||||||
|
# encoding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
# Tatar stopwords are from https://github.com/aliiae/stopwords-tt
|
||||||
|
|
||||||
|
STOP_WORDS = set("""алай алайса алар аларга аларда алардан аларны аларның аларча
|
||||||
|
алары аларын аларынга аларында аларыннан аларының алтмыш алтмышынчы алтмышынчыга
|
||||||
|
алтмышынчыда алтмышынчыдан алтмышынчылар алтмышынчыларга алтмышынчыларда
|
||||||
|
алтмышынчылардан алтмышынчыларны алтмышынчыларның алтмышынчыны алтмышынчының
|
||||||
|
алты алтылап алтынчы алтынчыга алтынчыда алтынчыдан алтынчылар алтынчыларга
|
||||||
|
алтынчыларда алтынчылардан алтынчыларны алтынчыларның алтынчыны алтынчының
|
||||||
|
алтышар анда андагы андай андый андыйга андыйда андыйдан андыйны андыйның аннан
|
||||||
|
ансы анча аны аныкы аныкын аныкынга аныкында аныкыннан аныкының анысы анысын
|
||||||
|
анысынга анысында анысыннан анысының аның аныңча аркылы ары аша аңа аңар аңарга
|
||||||
|
аңарда аңардагы аңардан
|
||||||
|
|
||||||
|
бар бара барлык барча барчасы барчасын барчасына барчасында барчасыннан
|
||||||
|
барчасының бары башка башкача белән без безгә бездә бездән безне безнең безнеңчә
|
||||||
|
белдерүенчә белән бер бергә беренче беренчегә беренчедә беренчедән беренчеләр
|
||||||
|
беренчеләргә беренчеләрдә беренчеләрдән беренчеләрне беренчеләрнең беренчене
|
||||||
|
беренченең беркайда беркайсы беркая беркаян беркем беркемгә беркемдә беркемне
|
||||||
|
беркемнең беркемнән берлән берни бернигә бернидә бернидән бернинди бернине
|
||||||
|
бернинең берничек берничә бернәрсә бернәрсәгә бернәрсәдә бернәрсәдән бернәрсәне
|
||||||
|
бернәрсәнең беррәттән берсе берсен берсенгә берсендә берсенең берсеннән берәр
|
||||||
|
берәрсе берәрсен берәрсендә берәрсенең берәрсеннән берәрсенә берәү бигрәк бик
|
||||||
|
бирле бит биш бишенче бишенчегә бишенчедә бишенчедән бишенчеләр бишенчеләргә
|
||||||
|
бишенчеләрдә бишенчеләрдән бишенчеләрне бишенчеләрнең бишенчене бишенченең
|
||||||
|
бишләп болай болар боларга боларда болардан боларны боларның болары боларын
|
||||||
|
боларынга боларында боларыннан боларының бу буе буена буенда буенча буйлап
|
||||||
|
буларак булачак булды булмый булса булып булыр булырга бусы бүтән бәлки бән
|
||||||
|
бәрабәренә бөтен бөтенесе бөтенесен бөтенесендә бөтенесенең бөтенесеннән
|
||||||
|
бөтенесенә
|
||||||
|
|
||||||
|
вә
|
||||||
|
|
||||||
|
гел генә гына гүя гүяки гәрчә
|
||||||
|
|
||||||
|
да ди дигән диде дип дистәләгән дистәләрчә дүрт дүртенче дүртенчегә дүртенчедә
|
||||||
|
дүртенчедән дүртенчеләр дүртенчеләргә дүртенчеләрдә дүртенчеләрдән дүртенчеләрне
|
||||||
|
дүртенчеләрнең дүртенчене дүртенченең дүртләп дә
|
||||||
|
|
||||||
|
егерме егерменче егерменчегә егерменчедә егерменчедән егерменчеләр
|
||||||
|
егерменчеләргә егерменчеләрдә егерменчеләрдән егерменчеләрне егерменчеләрнең
|
||||||
|
егерменчене егерменченең ел елда
|
||||||
|
|
||||||
|
иде идек идем ике икенче икенчегә икенчедә икенчедән икенчеләр икенчеләргә
|
||||||
|
икенчеләрдә икенчеләрдән икенчеләрне икенчеләрнең икенчене икенченең икешәр икән
|
||||||
|
илле илленче илленчегә илленчедә илленчедән илленчеләр илленчеләргә
|
||||||
|
илленчеләрдә илленчеләрдән илленчеләрне илленчеләрнең илленчене илленченең илә
|
||||||
|
илән инде исә итеп иткән итте итү итә итәргә иң
|
||||||
|
|
||||||
|
йөз йөзенче йөзенчегә йөзенчедә йөзенчедән йөзенчеләр йөзенчеләргә йөзенчеләрдә
|
||||||
|
йөзенчеләрдән йөзенчеләрне йөзенчеләрнең йөзенчене йөзенченең йөзләгән йөзләрчә
|
||||||
|
йөзәрләгән
|
||||||
|
|
||||||
|
кадәр кай кайбер кайберләре кайберсе кайберәү кайберәүгә кайберәүдә кайберәүдән
|
||||||
|
кайберәүне кайберәүнең кайдагы кайсы кайсыбер кайсын кайсына кайсында кайсыннан
|
||||||
|
кайсының кайчангы кайчандагы кайчаннан караганда карамастан карамый карата каршы
|
||||||
|
каршына каршында каршындагы кебек кем кемгә кемдә кемне кемнең кемнән кенә ки
|
||||||
|
килеп килә кирәк кына кырыгынчы кырыгынчыга кырыгынчыда кырыгынчыдан
|
||||||
|
кырыгынчылар кырыгынчыларга кырыгынчыларда кырыгынчылардан кырыгынчыларны
|
||||||
|
кырыгынчыларның кырыгынчыны кырыгынчының кырык күк күпләгән күпме күпмеләп
|
||||||
|
күпмешәр күпмешәрләп күптән күрә
|
||||||
|
|
||||||
|
ләкин
|
||||||
|
|
||||||
|
максатында менә мең меңенче меңенчегә меңенчедә меңенчедән меңенчеләр
|
||||||
|
меңенчеләргә меңенчеләрдә меңенчеләрдән меңенчеләрне меңенчеләрнең меңенчене
|
||||||
|
меңенченең меңләгән меңләп меңнәрчә меңәрләгән меңәрләп миллиард миллиардлаган
|
||||||
|
миллиардларча миллион миллионлаган миллионнарча миллионынчы миллионынчыга
|
||||||
|
миллионынчыда миллионынчыдан миллионынчылар миллионынчыларга миллионынчыларда
|
||||||
|
миллионынчылардан миллионынчыларны миллионынчыларның миллионынчыны
|
||||||
|
миллионынчының мин миндә мине минем минемчә миннән миңа монда мондагы мондые
|
||||||
|
мондыен мондыенгә мондыендә мондыеннән мондыеның мондый мондыйга мондыйда
|
||||||
|
мондыйдан мондыйлар мондыйларга мондыйларда мондыйлардан мондыйларны
|
||||||
|
мондыйларның мондыйлары мондыйларын мондыйларынга мондыйларында мондыйларыннан
|
||||||
|
мондыйларының мондыйны мондыйның моннан монсыз монча моны моныкы моныкын
|
||||||
|
моныкынга моныкында моныкыннан моныкының монысы монысын монысынга монысында
|
||||||
|
монысыннан монысының моның моңа моңар моңарга мәгълүматынча мәгәр мән мөмкин
|
||||||
|
|
||||||
|
ни нибарысы никадәре нинди ниндие ниндиен ниндиенгә ниндиендә ниндиенең
|
||||||
|
ниндиеннән ниндиләр ниндиләргә ниндиләрдә ниндиләрдән ниндиләрен ниндиләренн
|
||||||
|
ниндиләреннгә ниндиләренндә ниндиләреннең ниндиләренннән ниндиләрне ниндиләрнең
|
||||||
|
ниндирәк нихәтле ничаклы ничек ничәшәр ничәшәрләп нуль нче нчы нәрсә нәрсәгә
|
||||||
|
нәрсәдә нәрсәдән нәрсәне нәрсәнең
|
||||||
|
|
||||||
|
саен сез сезгә сездә сездән сезне сезнең сезнеңчә сигез сигезенче сигезенчегә
|
||||||
|
сигезенчедә сигезенчедән сигезенчеләр сигезенчеләргә сигезенчеләрдә
|
||||||
|
сигезенчеләрдән сигезенчеләрне сигезенчеләрнең сигезенчене сигезенченең
|
||||||
|
сиксән син синдә сине синең синеңчә синнән сиңа соң сыман сүзенчә сүзләренчә
|
||||||
|
|
||||||
|
та таба теге тегеләй тегеләр тегеләргә тегеләрдә тегеләрдән тегеләре тегеләрен
|
||||||
|
тегеләренгә тегеләрендә тегеләренең тегеләреннән тегеләрне тегеләрнең тегенди
|
||||||
|
тегендигә тегендидә тегендидән тегендине тегендинең тегендә тегендәге тегене
|
||||||
|
тегенеке тегенекен тегенекенгә тегенекендә тегенекенең тегенекеннән тегенең
|
||||||
|
тегеннән тегесе тегесен тегесенгә тегесендә тегесенең тегесеннән тегеңә тиеш тик
|
||||||
|
тикле тора триллиард триллион тугыз тугызлап тугызлашып тугызынчы тугызынчыга
|
||||||
|
тугызынчыда тугызынчыдан тугызынчылар тугызынчыларга тугызынчыларда
|
||||||
|
тугызынчылардан тугызынчыларны тугызынчыларның тугызынчыны тугызынчының туксан
|
||||||
|
туксанынчы туксанынчыга туксанынчыда туксанынчыдан туксанынчылар туксанынчыларга
|
||||||
|
туксанынчыларда туксанынчылардан туксанынчыларны туксанынчыларның туксанынчыны
|
||||||
|
туксанынчының турында тыш түгел тә тәгаенләнгән төмән
|
||||||
|
|
||||||
|
уенча уйлавынча ук ул ун уналты уналтынчы уналтынчыга уналтынчыда уналтынчыдан
|
||||||
|
уналтынчылар уналтынчыларга уналтынчыларда уналтынчылардан уналтынчыларны
|
||||||
|
уналтынчыларның уналтынчыны уналтынчының унарлаган унарлап унаула унаулап унбер
|
||||||
|
унберенче унберенчегә унберенчедә унберенчедән унберенчеләр унберенчеләргә
|
||||||
|
унберенчеләрдә унберенчеләрдән унберенчеләрне унберенчеләрнең унберенчене
|
||||||
|
унберенченең унбиш унбишенче унбишенчегә унбишенчедә унбишенчедән унбишенчеләр
|
||||||
|
унбишенчеләргә унбишенчеләрдә унбишенчеләрдән унбишенчеләрне унбишенчеләрнең
|
||||||
|
унбишенчене унбишенченең ундүрт ундүртенче ундүртенчегә ундүртенчедә
|
||||||
|
ундүртенчедән ундүртенчеләр ундүртенчеләргә ундүртенчеләрдә ундүртенчеләрдән
|
||||||
|
ундүртенчеләрне ундүртенчеләрнең ундүртенчене ундүртенченең унике уникенче
|
||||||
|
уникенчегә уникенчедә уникенчедән уникенчеләр уникенчеләргә уникенчеләрдә
|
||||||
|
уникенчеләрдән уникенчеләрне уникенчеләрнең уникенчене уникенченең унлаган
|
||||||
|
унлап уннарча унсигез унсигезенче унсигезенчегә унсигезенчедә унсигезенчедән
|
||||||
|
унсигезенчеләр унсигезенчеләргә унсигезенчеләрдә унсигезенчеләрдән
|
||||||
|
унсигезенчеләрне унсигезенчеләрнең унсигезенчене унсигезенченең унтугыз
|
||||||
|
унтугызынчы унтугызынчыга унтугызынчыда унтугызынчыдан унтугызынчылар
|
||||||
|
унтугызынчыларга унтугызынчыларда унтугызынчылардан унтугызынчыларны
|
||||||
|
унтугызынчыларның унтугызынчыны унтугызынчының унынчы унынчыга унынчыда
|
||||||
|
унынчыдан унынчылар унынчыларга унынчыларда унынчылардан унынчыларны
|
||||||
|
унынчыларның унынчыны унынчының унҗиде унҗиденче унҗиденчегә унҗиденчедә
|
||||||
|
унҗиденчедән унҗиденчеләр унҗиденчеләргә унҗиденчеләрдә унҗиденчеләрдән
|
||||||
|
унҗиденчеләрне унҗиденчеләрнең унҗиденчене унҗиденченең унөч унөченче унөченчегә
|
||||||
|
унөченчедә унөченчедән унөченчеләр унөченчеләргә унөченчеләрдә унөченчеләрдән
|
||||||
|
унөченчеләрне унөченчеләрнең унөченчене унөченченең утыз утызынчы утызынчыга
|
||||||
|
утызынчыда утызынчыдан утызынчылар утызынчыларга утызынчыларда утызынчылардан
|
||||||
|
утызынчыларны утызынчыларның утызынчыны утызынчының
|
||||||
|
|
||||||
|
фикеренчә фәкать
|
||||||
|
|
||||||
|
хакында хәбәр хәлбуки хәтле хәтта
|
||||||
|
|
||||||
|
чаклы чакта чөнки
|
||||||
|
|
||||||
|
шикелле шул шулай шулар шуларга шуларда шулардан шуларны шуларның шулары шуларын
|
||||||
|
шуларынга шуларында шуларыннан шуларының шулкадәр шултикле шултиклем шулхәтле
|
||||||
|
шулчаклы шунда шундагы шундый шундыйга шундыйда шундыйдан шундыйны шундыйның
|
||||||
|
шунлыктан шуннан шунсы шунча шуны шуныкы шуныкын шуныкынга шуныкында шуныкыннан
|
||||||
|
шуныкының шунысы шунысын шунысынга шунысында шунысыннан шунысының шуның шушы
|
||||||
|
шушында шушыннан шушыны шушының шушыңа шуңа шуңар шуңарга
|
||||||
|
|
||||||
|
элек
|
||||||
|
|
||||||
|
югыйсә юк юкса
|
||||||
|
|
||||||
|
я ягъни язуынча яисә яки яктан якын ярашлы яхут яшь яшьлек
|
||||||
|
|
||||||
|
җиде җиделәп җиденче җиденчегә җиденчедә җиденчедән җиденчеләр җиденчеләргә
|
||||||
|
җиденчеләрдә җиденчеләрдән җиденчеләрне җиденчеләрнең җиденчене җиденченең
|
||||||
|
җидешәр җитмеш җитмешенче җитмешенчегә җитмешенчедә җитмешенчедән җитмешенчеләр
|
||||||
|
җитмешенчеләргә җитмешенчеләрдә җитмешенчеләрдән җитмешенчеләрне
|
||||||
|
җитмешенчеләрнең җитмешенчене җитмешенченең җыенысы
|
||||||
|
|
||||||
|
үз үзе үзем үземдә үземне үземнең үземнән үземә үзен үзендә үзенең үзеннән үзенә
|
||||||
|
үк
|
||||||
|
|
||||||
|
һичбер һичбере һичберен һичберендә һичберенең һичбереннән һичберенә һичберсе
|
||||||
|
һичберсен һичберсендә һичберсенең һичберсеннән һичберсенә һичберәү һичберәүгә
|
||||||
|
һичберәүдә һичберәүдән һичберәүне һичберәүнең һичкайсы һичкайсыга һичкайсыда
|
||||||
|
һичкайсыдан һичкайсыны һичкайсының һичкем һичкемгә һичкемдә һичкемне һичкемнең
|
||||||
|
һичкемнән һични һичнигә һичнидә һичнидән һичнинди һичнине һичнинең һичнәрсә
|
||||||
|
һичнәрсәгә һичнәрсәдә һичнәрсәдән һичнәрсәне һичнәрсәнең һәм һәммә һәммәсе
|
||||||
|
һәммәсен һәммәсендә һәммәсенең һәммәсеннән һәммәсенә һәр һәрбер һәрбере һәрберсе
|
||||||
|
һәркайсы һәркайсыга һәркайсыда һәркайсыдан һәркайсыны һәркайсының һәркем
|
||||||
|
һәркемгә һәркемдә һәркемне һәркемнең һәркемнән һәрни һәрнәрсә һәрнәрсәгә
|
||||||
|
һәрнәрсәдә һәрнәрсәдән һәрнәрсәне һәрнәрсәнең һәртөрле
|
||||||
|
|
||||||
|
ә әгәр әйтүенчә әйтүләренчә әлбәттә әле әлеге әллә әмма әнә
|
||||||
|
|
||||||
|
өстәп өч өчен өченче өченчегә өченчедә өченчедән өченчеләр өченчеләргә
|
||||||
|
өченчеләрдә өченчеләрдән өченчеләрне өченчеләрнең өченчене өченченең өчләп
|
||||||
|
өчәрләп""".split())
|
52
spacy/lang/tt/tokenizer_exceptions.py
Normal file
52
spacy/lang/tt/tokenizer_exceptions.py
Normal file
|
@ -0,0 +1,52 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from ...symbols import ORTH, LEMMA, NORM
|
||||||
|
|
||||||
|
_exc = {}
|
||||||
|
|
||||||
|
_abbrev_exc = [
|
||||||
|
# Weekdays abbreviations
|
||||||
|
{ORTH: "дш", LEMMA: "дүшәмбе"},
|
||||||
|
{ORTH: "сш", LEMMA: "сишәмбе"},
|
||||||
|
{ORTH: "чш", LEMMA: "чәршәмбе"},
|
||||||
|
{ORTH: "пш", LEMMA: "пәнҗешәмбе"},
|
||||||
|
{ORTH: "җм", LEMMA: "җомга"},
|
||||||
|
{ORTH: "шб", LEMMA: "шимбә"},
|
||||||
|
{ORTH: "яш", LEMMA: "якшәмбе"},
|
||||||
|
|
||||||
|
# Months abbreviations
|
||||||
|
{ORTH: "гый", LEMMA: "гыйнвар"},
|
||||||
|
{ORTH: "фев", LEMMA: "февраль"},
|
||||||
|
{ORTH: "мар", LEMMA: "март"},
|
||||||
|
{ORTH: "мар", LEMMA: "март"},
|
||||||
|
{ORTH: "апр", LEMMA: "апрель"},
|
||||||
|
{ORTH: "июн", LEMMA: "июнь"},
|
||||||
|
{ORTH: "июл", LEMMA: "июль"},
|
||||||
|
{ORTH: "авг", LEMMA: "август"},
|
||||||
|
{ORTH: "сен", LEMMA: "сентябрь"},
|
||||||
|
{ORTH: "окт", LEMMA: "октябрь"},
|
||||||
|
{ORTH: "ноя", LEMMA: "ноябрь"},
|
||||||
|
{ORTH: "дек", LEMMA: "декабрь"},
|
||||||
|
|
||||||
|
# Number abbreviations
|
||||||
|
{ORTH: "млрд", LEMMA: "миллиард"},
|
||||||
|
{ORTH: "млн", LEMMA: "миллион"},
|
||||||
|
]
|
||||||
|
|
||||||
|
for abbr in _abbrev_exc:
|
||||||
|
for orth in (abbr[ORTH], abbr[ORTH].capitalize(), abbr[ORTH].upper()):
|
||||||
|
_exc[orth] = [{ORTH: orth, LEMMA: abbr[LEMMA], NORM: abbr[LEMMA]}]
|
||||||
|
_exc[orth + "."] = [
|
||||||
|
{ORTH: orth + ".", LEMMA: abbr[LEMMA], NORM: abbr[LEMMA]}
|
||||||
|
]
|
||||||
|
|
||||||
|
for exc_data in [ # "etc." abbreviations
|
||||||
|
{ORTH: "һ.б.ш.", NORM: "һәм башка шундыйлар"},
|
||||||
|
{ORTH: "һ.б.", NORM: "һәм башка"},
|
||||||
|
{ORTH: "б.э.к.", NORM: "безнең эрага кадәр"},
|
||||||
|
{ORTH: "б.э.", NORM: "безнең эра"}]:
|
||||||
|
exc_data[LEMMA] = exc_data[NORM]
|
||||||
|
_exc[exc_data[ORTH]] = [exc_data]
|
||||||
|
|
||||||
|
TOKENIZER_EXCEPTIONS = _exc
|
30
spacy/lang/ur/__init__.py
Normal file
30
spacy/lang/ur/__init__.py
Normal file
|
@ -0,0 +1,30 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
|
from .stop_words import STOP_WORDS
|
||||||
|
from .lex_attrs import LEX_ATTRS
|
||||||
|
from ..tag_map import TAG_MAP
|
||||||
|
|
||||||
|
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||||
|
from ...language import Language
|
||||||
|
from ...attrs import LANG, NORM
|
||||||
|
from ...util import update_exc
|
||||||
|
|
||||||
|
|
||||||
|
class UrduDefaults(Language.Defaults):
|
||||||
|
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||||
|
lex_attr_getters.update(LEX_ATTRS)
|
||||||
|
lex_attr_getters[LANG] = lambda text: 'ur'
|
||||||
|
|
||||||
|
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||||
|
tag_map = TAG_MAP
|
||||||
|
stop_words = STOP_WORDS
|
||||||
|
|
||||||
|
|
||||||
|
class Urdu(Language):
|
||||||
|
lang = 'ur'
|
||||||
|
Defaults = UrduDefaults
|
||||||
|
|
||||||
|
|
||||||
|
__all__ = ['Urdu']
|
16
spacy/lang/ur/examples.py
Normal file
16
spacy/lang/ur/examples.py
Normal file
|
@ -0,0 +1,16 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
|
||||||
|
"""
|
||||||
|
Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
>>> from spacy.lang.da.examples import sentences
|
||||||
|
>>> docs = nlp.pipe(sentences)
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
sentences = [
|
||||||
|
"اردو ہے جس کا نام ہم جانتے ہیں داغ",
|
||||||
|
"سارے جہاں میں دھوم ہماری زباں کی ہے",
|
||||||
|
]
|
29113
spacy/lang/ur/lemmatizer.py
Normal file
29113
spacy/lang/ur/lemmatizer.py
Normal file
File diff suppressed because it is too large
Load Diff
47
spacy/lang/ur/lex_attrs.py
Normal file
47
spacy/lang/ur/lex_attrs.py
Normal file
|
@ -0,0 +1,47 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from ...attrs import LIKE_NUM
|
||||||
|
|
||||||
|
# Source https://quizlet.com/4271889/1-100-urdu-number-wordsurdu-numerals-flash-cards/
|
||||||
|
# http://www.urduword.com/lessons.php?lesson=numbers
|
||||||
|
# https://en.wikibooks.org/wiki/Urdu/Vocabulary/Numbers
|
||||||
|
# https://www.urdu-english.com/lessons/beginner/numbers
|
||||||
|
|
||||||
|
_num_words = """ایک دو تین چار پانچ چھ سات آٹھ نو دس گیارہ بارہ تیرہ چودہ پندرہ سولہ سترہ
|
||||||
|
اٹهارا انیس بیس اکیس بائیس تئیس چوبیس پچیس چھببیس
|
||||||
|
ستایس اٹھائس انتيس تیس اکتیس بتیس تینتیس چونتیس پینتیس
|
||||||
|
چھتیس سینتیس ارتیس انتالیس چالیس اکتالیس بیالیس تیتالیس
|
||||||
|
چوالیس پیتالیس چھیالیس سینتالیس اڑتالیس انچالیس پچاس اکاون باون
|
||||||
|
تریپن چون پچپن چھپن ستاون اٹھاون انسٹھ ساثھ
|
||||||
|
اکسٹھ باسٹھ تریسٹھ چوسٹھ پیسٹھ چھیاسٹھ سڑسٹھ اڑسٹھ
|
||||||
|
انھتر ستر اکھتر بھتتر تیھتر چوھتر تچھتر چھیتر ستتر
|
||||||
|
اٹھتر انیاسی اسی اکیاسی بیاسی تیراسی چوراسی پچیاسی چھیاسی
|
||||||
|
سٹیاسی اٹھیاسی نواسی نوے اکانوے بانوے ترانوے
|
||||||
|
چورانوے پچانوے چھیانوے ستانوے اٹھانوے ننانوے سو
|
||||||
|
""".split()
|
||||||
|
|
||||||
|
# source https://www.google.com/intl/ur/inputtools/try/
|
||||||
|
|
||||||
|
_ordinal_words = """پہلا دوسرا تیسرا چوتھا پانچواں چھٹا ساتواں آٹھواں نواں دسواں گیارہواں بارہواں تیرھواں چودھواں
|
||||||
|
پندرھواں سولہواں سترھواں اٹھارواں انیسواں بسیواں
|
||||||
|
""".split()
|
||||||
|
|
||||||
|
|
||||||
|
def like_num(text):
|
||||||
|
text = text.replace(',', '').replace('.', '')
|
||||||
|
if text.isdigit():
|
||||||
|
return True
|
||||||
|
if text.count('/') == 1:
|
||||||
|
num, denom = text.split('/')
|
||||||
|
if num.isdigit() and denom.isdigit():
|
||||||
|
return True
|
||||||
|
if text in _num_words:
|
||||||
|
return True
|
||||||
|
if text in _ordinal_words:
|
||||||
|
return True
|
||||||
|
return False
|
||||||
|
|
||||||
|
LEX_ATTRS = {
|
||||||
|
LIKE_NUM: like_num
|
||||||
|
}
|
515
spacy/lang/ur/stop_words.py
Normal file
515
spacy/lang/ur/stop_words.py
Normal file
|
@ -0,0 +1,515 @@
|
||||||
|
# encoding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
# Source: collected from different resource on internet
|
||||||
|
|
||||||
|
STOP_WORDS = set("""
|
||||||
|
ثھی
|
||||||
|
خو
|
||||||
|
گی
|
||||||
|
اپٌے
|
||||||
|
گئے
|
||||||
|
ثہت
|
||||||
|
طرف
|
||||||
|
ہوبری
|
||||||
|
پبئے
|
||||||
|
اپٌب
|
||||||
|
دوضری
|
||||||
|
گیب
|
||||||
|
کت
|
||||||
|
گب
|
||||||
|
ثھی
|
||||||
|
ضے
|
||||||
|
ہر
|
||||||
|
پر
|
||||||
|
اش
|
||||||
|
دی
|
||||||
|
گے
|
||||||
|
لگیں
|
||||||
|
ہے
|
||||||
|
ثعذ
|
||||||
|
ضکتے
|
||||||
|
تھی
|
||||||
|
اى
|
||||||
|
دیب
|
||||||
|
لئے
|
||||||
|
والے
|
||||||
|
یہ
|
||||||
|
ثدبئے
|
||||||
|
ضکتی
|
||||||
|
تھب
|
||||||
|
اًذر
|
||||||
|
رریعے
|
||||||
|
لگی
|
||||||
|
ہوبرا
|
||||||
|
ہوًے
|
||||||
|
ثبہر
|
||||||
|
ضکتب
|
||||||
|
ًہیں
|
||||||
|
تو
|
||||||
|
اور
|
||||||
|
رہب
|
||||||
|
لگے
|
||||||
|
ہوضکتب
|
||||||
|
ہوں
|
||||||
|
کب
|
||||||
|
ہوبرے
|
||||||
|
توبم
|
||||||
|
کیب
|
||||||
|
ایطے
|
||||||
|
رہی
|
||||||
|
هگر
|
||||||
|
ہوضکتی
|
||||||
|
ہیں
|
||||||
|
کریں
|
||||||
|
ہو
|
||||||
|
تک
|
||||||
|
کی
|
||||||
|
ایک
|
||||||
|
رہے
|
||||||
|
هیں
|
||||||
|
ہوضکتے
|
||||||
|
کیطے
|
||||||
|
ہوًب
|
||||||
|
تت
|
||||||
|
کہ
|
||||||
|
ہوا
|
||||||
|
آئے
|
||||||
|
ضبت
|
||||||
|
تھے
|
||||||
|
کیوں
|
||||||
|
ہو
|
||||||
|
تب
|
||||||
|
کے
|
||||||
|
پھر
|
||||||
|
ثغیر
|
||||||
|
خبر
|
||||||
|
ہے
|
||||||
|
رکھ
|
||||||
|
کی
|
||||||
|
طب
|
||||||
|
کوئی
|
||||||
|
رریعے
|
||||||
|
ثبرے
|
||||||
|
خب
|
||||||
|
اضطرذ
|
||||||
|
ثلکہ
|
||||||
|
خجکہ
|
||||||
|
رکھ
|
||||||
|
تب
|
||||||
|
کی
|
||||||
|
طرف
|
||||||
|
ثراں
|
||||||
|
خبر
|
||||||
|
رریعہ
|
||||||
|
اضکب
|
||||||
|
ثٌذ
|
||||||
|
خص
|
||||||
|
کی
|
||||||
|
لئے
|
||||||
|
توہیں
|
||||||
|
دوضرے
|
||||||
|
کررہی
|
||||||
|
اضکی
|
||||||
|
ثیچ
|
||||||
|
خوکہ
|
||||||
|
رکھتی
|
||||||
|
کیوًکہ
|
||||||
|
دوًوں
|
||||||
|
کر
|
||||||
|
رہے
|
||||||
|
خبر
|
||||||
|
ہی
|
||||||
|
ثرآں
|
||||||
|
اضکے
|
||||||
|
پچھلا
|
||||||
|
خیطب
|
||||||
|
رکھتے
|
||||||
|
کے
|
||||||
|
ثعذ
|
||||||
|
تو
|
||||||
|
ہی
|
||||||
|
دورى
|
||||||
|
کر
|
||||||
|
یہبں
|
||||||
|
آش
|
||||||
|
تھوڑا
|
||||||
|
چکے
|
||||||
|
زکویہ
|
||||||
|
دوضروں
|
||||||
|
ضکب
|
||||||
|
اوًچب
|
||||||
|
ثٌب
|
||||||
|
پل
|
||||||
|
تھوڑی
|
||||||
|
چلا
|
||||||
|
خبهوظ
|
||||||
|
دیتب
|
||||||
|
ضکٌب
|
||||||
|
اخبزت
|
||||||
|
اوًچبئی
|
||||||
|
ثٌبرہب
|
||||||
|
پوچھب
|
||||||
|
تھوڑے
|
||||||
|
چلو
|
||||||
|
ختن
|
||||||
|
دیتی
|
||||||
|
ضکی
|
||||||
|
اچھب
|
||||||
|
اوًچی
|
||||||
|
ثٌبرہی
|
||||||
|
پوچھتب
|
||||||
|
تیي
|
||||||
|
چلیں
|
||||||
|
در
|
||||||
|
دیتے
|
||||||
|
ضکے
|
||||||
|
اچھی
|
||||||
|
اوًچے
|
||||||
|
ثٌبرہے
|
||||||
|
پوچھتی
|
||||||
|
خبًب
|
||||||
|
چلے
|
||||||
|
درخبت
|
||||||
|
دیر
|
||||||
|
ضلطلہ
|
||||||
|
اچھے
|
||||||
|
اٹھبًب
|
||||||
|
ثٌبًب
|
||||||
|
پوچھتے
|
||||||
|
خبًتب
|
||||||
|
چھوٹب
|
||||||
|
درخہ
|
||||||
|
دیکھٌب
|
||||||
|
ضوچ
|
||||||
|
اختتبم
|
||||||
|
اہن
|
||||||
|
ثٌذ
|
||||||
|
پوچھٌب
|
||||||
|
خبًتی
|
||||||
|
چھوٹوں
|
||||||
|
درخے
|
||||||
|
دیکھو
|
||||||
|
ضوچب
|
||||||
|
ادھر
|
||||||
|
آئی
|
||||||
|
ثٌذکرًب
|
||||||
|
پوچھو
|
||||||
|
خبًتے
|
||||||
|
چھوٹی
|
||||||
|
درزقیقت
|
||||||
|
دیکھی
|
||||||
|
ضوچتب
|
||||||
|
ارد
|
||||||
|
آئے
|
||||||
|
ثٌذکرو
|
||||||
|
پوچھوں
|
||||||
|
خبًٌب
|
||||||
|
چھوٹے
|
||||||
|
درضت
|
||||||
|
دیکھیں
|
||||||
|
ضوچتی
|
||||||
|
اردگرد
|
||||||
|
آج
|
||||||
|
ثٌذی
|
||||||
|
پوچھیں
|
||||||
|
خططرذ
|
||||||
|
چھہ
|
||||||
|
دش
|
||||||
|
دیٌب
|
||||||
|
ضوچتے
|
||||||
|
ارکبى
|
||||||
|
آخر
|
||||||
|
ثڑا
|
||||||
|
پورا
|
||||||
|
خگہ
|
||||||
|
چیسیں
|
||||||
|
دفعہ
|
||||||
|
دے
|
||||||
|
ضوچٌب
|
||||||
|
اضتعوبل
|
||||||
|
آخر
|
||||||
|
پہلا
|
||||||
|
خگہوں
|
||||||
|
زبصل
|
||||||
|
دکھبئیں
|
||||||
|
راضتوں
|
||||||
|
ضوچو
|
||||||
|
اضتعوبلات
|
||||||
|
آدهی
|
||||||
|
ثڑی
|
||||||
|
پہلی
|
||||||
|
خگہیں
|
||||||
|
زبضر
|
||||||
|
دکھبتب
|
||||||
|
راضتہ
|
||||||
|
ضوچی
|
||||||
|
اغیب
|
||||||
|
آًب
|
||||||
|
ثڑے
|
||||||
|
پہلےضی
|
||||||
|
خلذی
|
||||||
|
زبل
|
||||||
|
دکھبتی
|
||||||
|
راضتے
|
||||||
|
ضوچیں
|
||||||
|
اطراف
|
||||||
|
آٹھ
|
||||||
|
ثھر
|
||||||
|
خٌبة
|
||||||
|
زبل
|
||||||
|
دکھبتے
|
||||||
|
رکي
|
||||||
|
ضیذھب
|
||||||
|
افراد
|
||||||
|
آیب
|
||||||
|
ثھرا
|
||||||
|
پہلے
|
||||||
|
خواى
|
||||||
|
زبلات
|
||||||
|
دکھبًب
|
||||||
|
رکھب
|
||||||
|
ضیذھی
|
||||||
|
اکثر
|
||||||
|
ثب
|
||||||
|
ہوا
|
||||||
|
پیع
|
||||||
|
خوًہی
|
||||||
|
زبلیہ
|
||||||
|
دکھبو
|
||||||
|
رکھی
|
||||||
|
ضیذھے
|
||||||
|
اکٹھب
|
||||||
|
ثھرپور
|
||||||
|
تبزٍ
|
||||||
|
خیطبکہ
|
||||||
|
زصوں
|
||||||
|
رکھے
|
||||||
|
ضیکٌڈ
|
||||||
|
اکٹھی
|
||||||
|
ثبری
|
||||||
|
ثہتر
|
||||||
|
تر
|
||||||
|
چبر
|
||||||
|
زصہ
|
||||||
|
دلچطپ
|
||||||
|
زیبدٍ
|
||||||
|
غبیذ
|
||||||
|
اکٹھے
|
||||||
|
ثبلا
|
||||||
|
ثہتری
|
||||||
|
ترتیت
|
||||||
|
چبہب
|
||||||
|
زصے
|
||||||
|
دلچطپی
|
||||||
|
ضبت
|
||||||
|
غخص
|
||||||
|
اکیلا
|
||||||
|
ثبلترتیت
|
||||||
|
ثہتریي
|
||||||
|
تریي
|
||||||
|
چبہٌب
|
||||||
|
زقبئق
|
||||||
|
دلچطپیبں
|
||||||
|
ضبدٍ
|
||||||
|
غذ
|
||||||
|
اکیلی
|
||||||
|
ثرش
|
||||||
|
پبش
|
||||||
|
تعذاد
|
||||||
|
چبہے
|
||||||
|
زقیتیں
|
||||||
|
هٌبضت
|
||||||
|
ضبرا
|
||||||
|
غروع
|
||||||
|
اکیلے
|
||||||
|
ثغیر
|
||||||
|
پبًب
|
||||||
|
چکب
|
||||||
|
زقیقت
|
||||||
|
دو
|
||||||
|
ضبرے
|
||||||
|
غروعبت
|
||||||
|
اگرچہ
|
||||||
|
ثلٌذ
|
||||||
|
پبًچ
|
||||||
|
تن
|
||||||
|
چکی
|
||||||
|
زکن
|
||||||
|
دور
|
||||||
|
ضبل
|
||||||
|
غے
|
||||||
|
الگ
|
||||||
|
پراًب
|
||||||
|
تٌہب
|
||||||
|
چکیں
|
||||||
|
دوضرا
|
||||||
|
ضبلوں
|
||||||
|
صبف
|
||||||
|
صسیر
|
||||||
|
قجیلہ
|
||||||
|
کوًطے
|
||||||
|
لازهی
|
||||||
|
هطئلے
|
||||||
|
ًیب
|
||||||
|
طریق
|
||||||
|
کرتی
|
||||||
|
کہتے
|
||||||
|
صفر
|
||||||
|
قطن
|
||||||
|
کھولا
|
||||||
|
لگتب
|
||||||
|
هطبئل
|
||||||
|
وار
|
||||||
|
طریقوں
|
||||||
|
کرتے
|
||||||
|
کہٌب
|
||||||
|
صورت
|
||||||
|
کئی
|
||||||
|
کھولٌب
|
||||||
|
لگتی
|
||||||
|
هطتعول
|
||||||
|
وار
|
||||||
|
طریقہ
|
||||||
|
کرتے
|
||||||
|
ہو
|
||||||
|
کہٌب
|
||||||
|
صورتسبل
|
||||||
|
کئے
|
||||||
|
کھولو
|
||||||
|
لگتے
|
||||||
|
هػتول
|
||||||
|
ٹھیک
|
||||||
|
طریقے
|
||||||
|
کرًب
|
||||||
|
کہو
|
||||||
|
صورتوں
|
||||||
|
کبفی
|
||||||
|
هطلق
|
||||||
|
ڈھوًڈا
|
||||||
|
طور
|
||||||
|
کرو
|
||||||
|
کہوں
|
||||||
|
صورتیں
|
||||||
|
کبم
|
||||||
|
کھولیں
|
||||||
|
لگی
|
||||||
|
هعلوم
|
||||||
|
ڈھوًڈلیب
|
||||||
|
طورپر
|
||||||
|
کریں
|
||||||
|
کہی
|
||||||
|
ضرور
|
||||||
|
کجھی
|
||||||
|
کھولے
|
||||||
|
لگے
|
||||||
|
هکول
|
||||||
|
ڈھوًڈًب
|
||||||
|
ظبہر
|
||||||
|
کرے
|
||||||
|
کہیں
|
||||||
|
ضرورت
|
||||||
|
کرا
|
||||||
|
کہب
|
||||||
|
لوجب
|
||||||
|
هلا
|
||||||
|
ڈھوًڈو
|
||||||
|
عذد
|
||||||
|
کل
|
||||||
|
کہیں
|
||||||
|
کرتب
|
||||||
|
کہتب
|
||||||
|
لوجی
|
||||||
|
هوکي
|
||||||
|
ڈھوًڈی
|
||||||
|
عظین
|
||||||
|
کن
|
||||||
|
کہے
|
||||||
|
ضروری
|
||||||
|
کرتبہوں
|
||||||
|
کہتی
|
||||||
|
لوجے
|
||||||
|
هوکٌبت
|
||||||
|
ڈھوًڈیں
|
||||||
|
علاقوں
|
||||||
|
کوتر
|
||||||
|
کیے
|
||||||
|
لوسبت
|
||||||
|
هوکٌہ
|
||||||
|
ہن
|
||||||
|
لے
|
||||||
|
ًبپطٌذ
|
||||||
|
ہورہے
|
||||||
|
علاقہ
|
||||||
|
کورا
|
||||||
|
کے
|
||||||
|
رریعے
|
||||||
|
لوسہ
|
||||||
|
هڑا
|
||||||
|
ہوئی
|
||||||
|
هتعلق
|
||||||
|
ًبگسیر
|
||||||
|
ہوگئی
|
||||||
|
علاقے
|
||||||
|
کوروں
|
||||||
|
گئی
|
||||||
|
لو
|
||||||
|
هڑًب
|
||||||
|
ہوئے
|
||||||
|
هسترم
|
||||||
|
ًطجت
|
||||||
|
ہو
|
||||||
|
گئے
|
||||||
|
علاوٍ
|
||||||
|
کورٍ
|
||||||
|
گرد
|
||||||
|
لوگ
|
||||||
|
هڑے
|
||||||
|
ہوتی
|
||||||
|
هسترهہ
|
||||||
|
ًقطہ
|
||||||
|
ہوگیب
|
||||||
|
کورے
|
||||||
|
گروپ
|
||||||
|
لوگوں
|
||||||
|
هہرثبى
|
||||||
|
ہوتے
|
||||||
|
هسطوش
|
||||||
|
ًکبلٌب
|
||||||
|
ہوًی
|
||||||
|
عووهی
|
||||||
|
کوطي
|
||||||
|
گروٍ
|
||||||
|
لڑکپي
|
||||||
|
هیرا
|
||||||
|
ہوچکب
|
||||||
|
هختلف
|
||||||
|
ًکتہ
|
||||||
|
ہی
|
||||||
|
فرد
|
||||||
|
کوى
|
||||||
|
گروہوں
|
||||||
|
لی
|
||||||
|
هیری
|
||||||
|
ہوچکی
|
||||||
|
هسیذ
|
||||||
|
فی
|
||||||
|
کوًطب
|
||||||
|
گٌتی
|
||||||
|
لیب
|
||||||
|
هیرے
|
||||||
|
ہوچکے
|
||||||
|
هطئلہ
|
||||||
|
ًوخواى
|
||||||
|
یقیٌی
|
||||||
|
قجل
|
||||||
|
کوًطی
|
||||||
|
لیٌب
|
||||||
|
ًئی
|
||||||
|
ہورہب
|
||||||
|
لیں
|
||||||
|
ًئے
|
||||||
|
ہورہی
|
||||||
|
ثبعث
|
||||||
|
ضت
|
||||||
|
""".split())
|
65
spacy/lang/ur/tag_map.py
Normal file
65
spacy/lang/ur/tag_map.py
Normal file
|
@ -0,0 +1,65 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from ...symbols import POS, PUNCT, SYM, ADJ, CCONJ, NUM, DET, ADV, ADP, X, VERB
|
||||||
|
from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON
|
||||||
|
|
||||||
|
TAG_MAP = {
|
||||||
|
".": {POS: PUNCT, "PunctType": "peri"},
|
||||||
|
",": {POS: PUNCT, "PunctType": "comm"},
|
||||||
|
"-LRB-": {POS: PUNCT, "PunctType": "brck", "PunctSide": "ini"},
|
||||||
|
"-RRB-": {POS: PUNCT, "PunctType": "brck", "PunctSide": "fin"},
|
||||||
|
"``": {POS: PUNCT, "PunctType": "quot", "PunctSide": "ini"},
|
||||||
|
"\"\"": {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"},
|
||||||
|
"''": {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"},
|
||||||
|
":": {POS: PUNCT},
|
||||||
|
"$": {POS: SYM, "Other": {"SymType": "currency"}},
|
||||||
|
"#": {POS: SYM, "Other": {"SymType": "numbersign"}},
|
||||||
|
"AFX": {POS: ADJ, "Hyph": "yes"},
|
||||||
|
"CC": {POS: CCONJ, "ConjType": "coor"},
|
||||||
|
"CD": {POS: NUM, "NumType": "card"},
|
||||||
|
"DT": {POS: DET},
|
||||||
|
"EX": {POS: ADV, "AdvType": "ex"},
|
||||||
|
"FW": {POS: X, "Foreign": "yes"},
|
||||||
|
"HYPH": {POS: PUNCT, "PunctType": "dash"},
|
||||||
|
"IN": {POS: ADP},
|
||||||
|
"JJ": {POS: ADJ, "Degree": "pos"},
|
||||||
|
"JJR": {POS: ADJ, "Degree": "comp"},
|
||||||
|
"JJS": {POS: ADJ, "Degree": "sup"},
|
||||||
|
"LS": {POS: PUNCT, "NumType": "ord"},
|
||||||
|
"MD": {POS: VERB, "VerbType": "mod"},
|
||||||
|
"NIL": {POS: ""},
|
||||||
|
"NN": {POS: NOUN, "Number": "sing"},
|
||||||
|
"NNP": {POS: PROPN, "NounType": "prop", "Number": "sing"},
|
||||||
|
"NNPS": {POS: PROPN, "NounType": "prop", "Number": "plur"},
|
||||||
|
"NNS": {POS: NOUN, "Number": "plur"},
|
||||||
|
"PDT": {POS: ADJ, "AdjType": "pdt", "PronType": "prn"},
|
||||||
|
"POS": {POS: PART, "Poss": "yes"},
|
||||||
|
"PRP": {POS: PRON, "PronType": "prs"},
|
||||||
|
"PRP$": {POS: ADJ, "PronType": "prs", "Poss": "yes"},
|
||||||
|
"RB": {POS: ADV, "Degree": "pos"},
|
||||||
|
"RBR": {POS: ADV, "Degree": "comp"},
|
||||||
|
"RBS": {POS: ADV, "Degree": "sup"},
|
||||||
|
"RP": {POS: PART},
|
||||||
|
"SP": {POS: SPACE},
|
||||||
|
"SYM": {POS: SYM},
|
||||||
|
"TO": {POS: PART, "PartType": "inf", "VerbForm": "inf"},
|
||||||
|
"UH": {POS: INTJ},
|
||||||
|
"VB": {POS: VERB, "VerbForm": "inf"},
|
||||||
|
"VBD": {POS: VERB, "VerbForm": "fin", "Tense": "past"},
|
||||||
|
"VBG": {POS: VERB, "VerbForm": "part", "Tense": "pres", "Aspect": "prog"},
|
||||||
|
"VBN": {POS: VERB, "VerbForm": "part", "Tense": "past", "Aspect": "perf"},
|
||||||
|
"VBP": {POS: VERB, "VerbForm": "fin", "Tense": "pres"},
|
||||||
|
"VBZ": {POS: VERB, "VerbForm": "fin", "Tense": "pres", "Number": "sing", "Person": 3},
|
||||||
|
"WDT": {POS: ADJ, "PronType": "int|rel"},
|
||||||
|
"WP": {POS: NOUN, "PronType": "int|rel"},
|
||||||
|
"WP$": {POS: ADJ, "Poss": "yes", "PronType": "int|rel"},
|
||||||
|
"WRB": {POS: ADV, "PronType": "int|rel"},
|
||||||
|
"ADD": {POS: X},
|
||||||
|
"NFP": {POS: PUNCT},
|
||||||
|
"GW": {POS: X},
|
||||||
|
"XX": {POS: X},
|
||||||
|
"BES": {POS: VERB},
|
||||||
|
"HVS": {POS: VERB},
|
||||||
|
"_SP": {POS: SPACE},
|
||||||
|
}
|
22
spacy/lang/ur/tokenizer_exceptions.py
Normal file
22
spacy/lang/ur/tokenizer_exceptions.py
Normal file
|
@ -0,0 +1,22 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
# import symbols – if you need to use more, add them here
|
||||||
|
from ...symbols import ORTH, LEMMA, TAG, NORM, ADP, DET
|
||||||
|
|
||||||
|
# Add tokenizer exceptions
|
||||||
|
# Documentation: https://spacy.io/docs/usage/adding-languages#tokenizer-exceptions
|
||||||
|
# Feel free to use custom logic to generate repetitive exceptions more efficiently.
|
||||||
|
# If an exception is split into more than one token, the ORTH values combined always
|
||||||
|
# need to match the original string.
|
||||||
|
|
||||||
|
# Exceptions should be added in the following format:
|
||||||
|
|
||||||
|
_exc = {
|
||||||
|
|
||||||
|
}
|
||||||
|
|
||||||
|
# To keep things clean and readable, it's recommended to only declare the
|
||||||
|
# TOKENIZER_EXCEPTIONS at the bottom:
|
||||||
|
|
||||||
|
TOKENIZER_EXCEPTIONS = _exc
|
|
@ -15,7 +15,8 @@ from .. import util
|
||||||
# here if it's using spaCy's tokenizer (not a different library)
|
# here if it's using spaCy's tokenizer (not a different library)
|
||||||
# TODO: re-implement generic tokenizer tests
|
# TODO: re-implement generic tokenizer tests
|
||||||
_languages = ['bn', 'da', 'de', 'en', 'es', 'fi', 'fr', 'ga', 'he', 'hu', 'id',
|
_languages = ['bn', 'da', 'de', 'en', 'es', 'fi', 'fr', 'ga', 'he', 'hu', 'id',
|
||||||
'it', 'nb', 'nl', 'pl', 'pt', 'ro', 'ru', 'sv', 'tr', 'ar', 'xx']
|
'it', 'nb', 'nl', 'pl', 'pt', 'ro', 'ru', 'sv', 'tr', 'ar', 'ut', 'tt',
|
||||||
|
'xx']
|
||||||
|
|
||||||
_models = {'en': ['en_core_web_sm'],
|
_models = {'en': ['en_core_web_sm'],
|
||||||
'de': ['de_core_news_sm'],
|
'de': ['de_core_news_sm'],
|
||||||
|
@ -153,10 +154,18 @@ def th_tokenizer():
|
||||||
def tr_tokenizer():
|
def tr_tokenizer():
|
||||||
return util.get_lang_class('tr').Defaults.create_tokenizer()
|
return util.get_lang_class('tr').Defaults.create_tokenizer()
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def tt_tokenizer():
|
||||||
|
return util.get_lang_class('tt').Defaults.create_tokenizer()
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def ar_tokenizer():
|
def ar_tokenizer():
|
||||||
return util.get_lang_class('ar').Defaults.create_tokenizer()
|
return util.get_lang_class('ar').Defaults.create_tokenizer()
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def ur_tokenizer():
|
||||||
|
return util.get_lang_class('ur').Defaults.create_tokenizer()
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def ru_tokenizer():
|
def ru_tokenizer():
|
||||||
pymorphy = pytest.importorskip('pymorphy2')
|
pymorphy = pytest.importorskip('pymorphy2')
|
||||||
|
|
|
@ -4,30 +4,25 @@ from __future__ import unicode_literals
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.models('fr')
|
def test_lemmatizer_verb(fr_tokenizer):
|
||||||
def test_lemmatizer_verb(FR):
|
tokens = fr_tokenizer("Qu'est-ce que tu fais?")
|
||||||
tokens = FR("Qu'est-ce que tu fais?")
|
|
||||||
assert tokens[0].lemma_ == "que"
|
assert tokens[0].lemma_ == "que"
|
||||||
assert tokens[1].lemma_ == "être"
|
assert tokens[1].lemma_ == "être"
|
||||||
assert tokens[5].lemma_ == "faire"
|
assert tokens[5].lemma_ == "faire"
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.models('fr')
|
def test_lemmatizer_noun_verb_2(fr_tokenizer):
|
||||||
@pytest.mark.xfail(reason="sont tagged as AUX")
|
tokens = fr_tokenizer("Les abaissements de température sont gênants.")
|
||||||
def test_lemmatizer_noun_verb_2(FR):
|
|
||||||
tokens = FR("Les abaissements de température sont gênants.")
|
|
||||||
assert tokens[4].lemma_ == "être"
|
assert tokens[4].lemma_ == "être"
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.models('fr')
|
|
||||||
@pytest.mark.xfail(reason="Costaricienne TAG is PROPN instead of NOUN and spacy don't lemmatize PROPN")
|
@pytest.mark.xfail(reason="Costaricienne TAG is PROPN instead of NOUN and spacy don't lemmatize PROPN")
|
||||||
def test_lemmatizer_noun(FR):
|
def test_lemmatizer_noun(fr_tokenizer):
|
||||||
tokens = FR("il y a des Costaricienne.")
|
tokens = fr_tokenizer("il y a des Costaricienne.")
|
||||||
assert tokens[4].lemma_ == "Costaricain"
|
assert tokens[4].lemma_ == "Costaricain"
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.models('fr')
|
def test_lemmatizer_noun_2(fr_tokenizer):
|
||||||
def test_lemmatizer_noun_2(FR):
|
tokens = fr_tokenizer("Les abaissements de température sont gênants.")
|
||||||
tokens = FR("Les abaissements de température sont gênants.")
|
|
||||||
assert tokens[1].lemma_ == "abaissement"
|
assert tokens[1].lemma_ == "abaissement"
|
||||||
assert tokens[5].lemma_ == "gênant"
|
assert tokens[5].lemma_ == "gênant"
|
||||||
|
|
0
spacy/tests/lang/tt/__init__.py
Normal file
0
spacy/tests/lang/tt/__init__.py
Normal file
75
spacy/tests/lang/tt/test_tokenizer.py
Normal file
75
spacy/tests/lang/tt/test_tokenizer.py
Normal file
|
@ -0,0 +1,75 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
INFIX_HYPHEN_TESTS = [
|
||||||
|
("Явым-төшем күләме.", "Явым-төшем күләме .".split()),
|
||||||
|
("Хатын-кыз киеме.", "Хатын-кыз киеме .".split())
|
||||||
|
]
|
||||||
|
|
||||||
|
PUNC_INSIDE_WORDS_TESTS = [
|
||||||
|
("Пассаҗир саны - 2,13 млн — кеше/көндә (2010), 783,9 млн. кеше/елда.",
|
||||||
|
"Пассаҗир саны - 2,13 млн — кеше / көндә ( 2010 ) ,"
|
||||||
|
" 783,9 млн. кеше / елда .".split()),
|
||||||
|
("Ту\"кай", "Ту \" кай".split())
|
||||||
|
]
|
||||||
|
|
||||||
|
MIXED_ORDINAL_NUMS_TESTS = [
|
||||||
|
("Иртәгә 22нче гыйнвар...", "Иртәгә 22нче гыйнвар ...".split())
|
||||||
|
]
|
||||||
|
|
||||||
|
ABBREV_TESTS = [
|
||||||
|
("«3 елда (б.э.к.) туган", "« 3 елда ( б.э.к. ) туган".split()),
|
||||||
|
("тукымадан һ.б.ш. тегелгән.", "тукымадан һ.б.ш. тегелгән .".split())
|
||||||
|
]
|
||||||
|
|
||||||
|
NAME_ABBREV_TESTS = [
|
||||||
|
("Ә.Тукай", "Ә.Тукай".split()),
|
||||||
|
("Ә.тукай", "Ә.тукай".split()),
|
||||||
|
("ә.Тукай", "ә . Тукай".split()),
|
||||||
|
("Миләүшә.", "Миләүшә .".split())
|
||||||
|
]
|
||||||
|
|
||||||
|
TYPOS_IN_PUNC_TESTS = [
|
||||||
|
("«3 елда , туган", "« 3 елда , туган".split()),
|
||||||
|
("«3 елда,туган", "« 3 елда , туган".split()),
|
||||||
|
("«3 елда,туган.", "« 3 елда , туган .".split()),
|
||||||
|
("Ул эшли(кайчан?)", "Ул эшли ( кайчан ? )".split()),
|
||||||
|
("Ул (кайчан?)эшли", "Ул ( кайчан ?) эшли".split()) # "?)" => "?)" or "? )"
|
||||||
|
]
|
||||||
|
|
||||||
|
LONG_TEXTS_TESTS = [
|
||||||
|
("Иң борынгы кешеләр суыклар һәм салкын кышлар булмый торган җылы"
|
||||||
|
"якларда яшәгәннәр, шуңа күрә аларга кием кирәк булмаган.Йөз"
|
||||||
|
"меңнәрчә еллар үткән, борынгы кешеләр акрынлап Европа һәм Азиянең"
|
||||||
|
"салкын илләрендә дә яши башлаганнар. Алар кырыс һәм салкын"
|
||||||
|
"кышлардан саклану өчен кием-салым уйлап тапканнар - итәк.",
|
||||||
|
"Иң борынгы кешеләр суыклар һәм салкын кышлар булмый торган җылы"
|
||||||
|
"якларда яшәгәннәр , шуңа күрә аларга кием кирәк булмаган . Йөз"
|
||||||
|
"меңнәрчә еллар үткән , борынгы кешеләр акрынлап Европа һәм Азиянең"
|
||||||
|
"салкын илләрендә дә яши башлаганнар . Алар кырыс һәм салкын"
|
||||||
|
"кышлардан саклану өчен кием-салым уйлап тапканнар - итәк .".split()
|
||||||
|
)
|
||||||
|
]
|
||||||
|
|
||||||
|
TESTCASES = (INFIX_HYPHEN_TESTS + PUNC_INSIDE_WORDS_TESTS +
|
||||||
|
MIXED_ORDINAL_NUMS_TESTS + ABBREV_TESTS + NAME_ABBREV_TESTS +
|
||||||
|
LONG_TEXTS_TESTS + TYPOS_IN_PUNC_TESTS)
|
||||||
|
|
||||||
|
NORM_TESTCASES = [
|
||||||
|
("тукымадан һ.б.ш. тегелгән.",
|
||||||
|
["тукымадан", "һәм башка шундыйлар", "тегелгән", "."])
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("text,expected_tokens", TESTCASES)
|
||||||
|
def test_tokenizer_handles_testcases(tt_tokenizer, text, expected_tokens):
|
||||||
|
tokens = [token.text for token in tt_tokenizer(text) if not token.is_space]
|
||||||
|
assert expected_tokens == tokens
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('text,norms', NORM_TESTCASES)
|
||||||
|
def test_tokenizer_handles_norm_exceptions(tt_tokenizer, text, norms):
|
||||||
|
tokens = tt_tokenizer(text)
|
||||||
|
assert [token.norm_ for token in tokens] == norms
|
0
spacy/tests/lang/ur/__init__.py
Normal file
0
spacy/tests/lang/ur/__init__.py
Normal file
26
spacy/tests/lang/ur/test_text.py
Normal file
26
spacy/tests/lang/ur/test_text.py
Normal file
|
@ -0,0 +1,26 @@
|
||||||
|
# coding: utf-8
|
||||||
|
|
||||||
|
"""Test that longer and mixed texts are tokenized correctly."""
|
||||||
|
|
||||||
|
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
|
def test_tokenizer_handles_long_text(ur_tokenizer):
|
||||||
|
text = """اصل میں رسوا ہونے کی ہمیں
|
||||||
|
کچھ عادت سی ہو گئی ہے اس لئے جگ ہنسائی کا ذکر نہیں کرتا،ہوا کچھ یوں کہ عرصہ چھ سال بعد ہمیں بھی خیال آیا
|
||||||
|
کہ ایک عدد ٹیلی ویژن ہی کیوں نہ خرید لیں ، سوچا ورلڈ کپ ہی دیکھیں گے۔اپنے پاکستان کے کھلاڑیوں کو دیکھ کر
|
||||||
|
ورلڈ کپ دیکھنے کا حوصلہ ہی نہ رہا تو اب یوں ہی ادھر اُدھر کے چینل گھمانے لگ پڑتے ہیں۔"""
|
||||||
|
|
||||||
|
tokens = ur_tokenizer(text)
|
||||||
|
assert len(tokens) == 77
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('text,length', [
|
||||||
|
("تحریر باسط حبیب", 3),
|
||||||
|
("میرا پاکستان", 2)])
|
||||||
|
def test_tokenizer_handles_cnts(ur_tokenizer, text, length):
|
||||||
|
tokens = ur_tokenizer(text)
|
||||||
|
assert len(tokens) == length
|
|
@ -3,7 +3,7 @@ from __future__ import unicode_literals
|
||||||
|
|
||||||
from ..util import ensure_path
|
from ..util import ensure_path
|
||||||
from .. import util
|
from .. import util
|
||||||
from ..displacy import parse_deps, parse_ents
|
from .. import displacy
|
||||||
from ..tokens import Span
|
from ..tokens import Span
|
||||||
from .util import get_doc
|
from .util import get_doc
|
||||||
from .._ml import PrecomputableAffine
|
from .._ml import PrecomputableAffine
|
||||||
|
@ -34,18 +34,16 @@ def test_util_get_package_path(package):
|
||||||
assert isinstance(path, Path)
|
assert isinstance(path, Path)
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.xfail
|
|
||||||
def test_displacy_parse_ents(en_vocab):
|
def test_displacy_parse_ents(en_vocab):
|
||||||
"""Test that named entities on a Doc are converted into displaCy's format."""
|
"""Test that named entities on a Doc are converted into displaCy's format."""
|
||||||
doc = get_doc(en_vocab, words=["But", "Google", "is", "starting", "from", "behind"])
|
doc = get_doc(en_vocab, words=["But", "Google", "is", "starting", "from", "behind"])
|
||||||
doc.ents = [Span(doc, 1, 2, label=doc.vocab.strings[u'ORG'])]
|
doc.ents = [Span(doc, 1, 2, label=doc.vocab.strings[u'ORG'])]
|
||||||
ents = parse_ents(doc)
|
ents = displacy.parse_ents(doc)
|
||||||
assert isinstance(ents, dict)
|
assert isinstance(ents, dict)
|
||||||
assert ents['text'] == 'But Google is starting from behind '
|
assert ents['text'] == 'But Google is starting from behind '
|
||||||
assert ents['ents'] == [{'start': 4, 'end': 10, 'label': 'ORG'}]
|
assert ents['ents'] == [{'start': 4, 'end': 10, 'label': 'ORG'}]
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.xfail
|
|
||||||
def test_displacy_parse_deps(en_vocab):
|
def test_displacy_parse_deps(en_vocab):
|
||||||
"""Test that deps and tags on a Doc are converted into displaCy's format."""
|
"""Test that deps and tags on a Doc are converted into displaCy's format."""
|
||||||
words = ["This", "is", "a", "sentence"]
|
words = ["This", "is", "a", "sentence"]
|
||||||
|
@ -55,7 +53,7 @@ def test_displacy_parse_deps(en_vocab):
|
||||||
deps = ['nsubj', 'ROOT', 'det', 'attr']
|
deps = ['nsubj', 'ROOT', 'det', 'attr']
|
||||||
doc = get_doc(en_vocab, words=words, heads=heads, pos=pos, tags=tags,
|
doc = get_doc(en_vocab, words=words, heads=heads, pos=pos, tags=tags,
|
||||||
deps=deps)
|
deps=deps)
|
||||||
deps = parse_deps(doc)
|
deps = displacy.parse_deps(doc)
|
||||||
assert isinstance(deps, dict)
|
assert isinstance(deps, dict)
|
||||||
assert deps['words'] == [{'text': 'This', 'tag': 'DET'},
|
assert deps['words'] == [{'text': 'This', 'tag': 'DET'},
|
||||||
{'text': 'is', 'tag': 'VERB'},
|
{'text': 'is', 'tag': 'VERB'},
|
||||||
|
@ -66,7 +64,19 @@ def test_displacy_parse_deps(en_vocab):
|
||||||
{'start': 1, 'end': 3, 'label': 'attr', 'dir': 'right'}]
|
{'start': 1, 'end': 3, 'label': 'attr', 'dir': 'right'}]
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.xfail
|
def test_displacy_spans(en_vocab):
|
||||||
|
"""Test that displaCy can render Spans."""
|
||||||
|
doc = get_doc(en_vocab, words=["But", "Google", "is", "starting", "from", "behind"])
|
||||||
|
doc.ents = [Span(doc, 1, 2, label=doc.vocab.strings[u'ORG'])]
|
||||||
|
html = displacy.render(doc[1:4], style='ent')
|
||||||
|
assert html.startswith('<div')
|
||||||
|
|
||||||
|
|
||||||
|
def test_displacy_raises_for_wrong_type(en_vocab):
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
html = displacy.render('hello world')
|
||||||
|
|
||||||
|
|
||||||
def test_PrecomputableAffine(nO=4, nI=5, nF=3, nP=2):
|
def test_PrecomputableAffine(nO=4, nI=5, nF=3, nP=2):
|
||||||
model = PrecomputableAffine(nO=nO, nI=nI, nF=nF, nP=nP)
|
model = PrecomputableAffine(nO=nO, nI=nI, nF=nF, nP=nP)
|
||||||
assert model.W.shape == (nF, nO, nP, nI)
|
assert model.W.shape == (nF, nO, nP, nI)
|
||||||
|
|
|
@ -394,7 +394,7 @@ cdef class Tokenizer:
|
||||||
data = OrderedDict()
|
data = OrderedDict()
|
||||||
deserializers = OrderedDict((
|
deserializers = OrderedDict((
|
||||||
('vocab', lambda b: self.vocab.from_bytes(b)),
|
('vocab', lambda b: self.vocab.from_bytes(b)),
|
||||||
('prefix_search', lambda b: data.setdefault('prefix', b)),
|
('prefix_search', lambda b: data.setdefault('prefix_search', b)),
|
||||||
('suffix_search', lambda b: data.setdefault('suffix_search', b)),
|
('suffix_search', lambda b: data.setdefault('suffix_search', b)),
|
||||||
('infix_finditer', lambda b: data.setdefault('infix_finditer', b)),
|
('infix_finditer', lambda b: data.setdefault('infix_finditer', b)),
|
||||||
('token_match', lambda b: data.setdefault('token_match', b)),
|
('token_match', lambda b: data.setdefault('token_match', b)),
|
||||||
|
|
|
@ -84,8 +84,8 @@
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
|
|
||||||
"V_CSS": "2.1.3",
|
"V_CSS": "2.2.1",
|
||||||
"V_JS": "2.1.2",
|
"V_JS": "2.2.2",
|
||||||
"DEFAULT_SYNTAX": "python",
|
"DEFAULT_SYNTAX": "python",
|
||||||
"ANALYTICS": "UA-58931649-1",
|
"ANALYTICS": "UA-58931649-1",
|
||||||
"MAILCHIMP": {
|
"MAILCHIMP": {
|
||||||
|
|
|
@ -124,6 +124,12 @@ mixin help(tooltip, icon_size)
|
||||||
+icon("help_o", icon_size || 16).o-icon--inline
|
+icon("help_o", icon_size || 16).o-icon--inline
|
||||||
|
|
||||||
|
|
||||||
|
//- Abbreviation
|
||||||
|
|
||||||
|
mixin abbr(title)
|
||||||
|
abbr.o-abbr(data-tooltip=title data-tooltip-style="code" aria-label=title)&attributes(attributes)
|
||||||
|
block
|
||||||
|
|
||||||
//- Aside wrapper
|
//- Aside wrapper
|
||||||
label - [string] aside label
|
label - [string] aside label
|
||||||
|
|
||||||
|
|
|
@ -9,7 +9,7 @@ menu.c-sidebar.js-sidebar.u-text
|
||||||
each url, item in items
|
each url, item in items
|
||||||
- var is_current = CURRENT == url || (CURRENT == "index" && url == "./")
|
- var is_current = CURRENT == url || (CURRENT == "index" && url == "./")
|
||||||
li.c-sidebar__item
|
li.c-sidebar__item
|
||||||
+a(url)(class=is_current ? "is-active" : null tabindex=is_current ? "-1" : null)=item
|
+a(url)(class=is_current ? "is-active" : null tabindex=is_current ? "-1" : null data-sidebar-active=is_current ? "" : null)=item
|
||||||
|
|
||||||
if is_current
|
if is_current
|
||||||
if IS_MODELS && CURRENT_MODELS.length
|
if IS_MODELS && CURRENT_MODELS.length
|
||||||
|
|
|
@ -1,115 +0,0 @@
|
||||||
//- 💫 DOCS > API > ARCHITECTURE > CYTHON
|
|
||||||
|
|
||||||
+aside("What's Cython?")
|
|
||||||
| #[+a("http://cython.org/") Cython] is a language for writing
|
|
||||||
| C extensions for Python. Most Python code is also valid Cython, but
|
|
||||||
| you can add type declarations to get efficient memory-managed code
|
|
||||||
| just like C or C++.
|
|
||||||
|
|
||||||
p
|
|
||||||
| spaCy's core data structures are implemented as
|
|
||||||
| #[+a("http://cython.org/") Cython] #[code cdef] classes. Memory is
|
|
||||||
| managed through the #[+a(gh("cymem")) #[code cymem]]
|
|
||||||
| #[code cymem.Pool] class, which allows you
|
|
||||||
| to allocate memory which will be freed when the #[code Pool] object
|
|
||||||
| is garbage collected. This means you usually don't have to worry
|
|
||||||
| about freeing memory. You just have to decide which Python object
|
|
||||||
| owns the memory, and make it own the #[code Pool]. When that object
|
|
||||||
| goes out of scope, the memory will be freed. You do have to take
|
|
||||||
| care that no pointers outlive the object that owns them — but this
|
|
||||||
| is generally quite easy.
|
|
||||||
|
|
||||||
p
|
|
||||||
| All Cython modules should have the #[code # cython: infer_types=True]
|
|
||||||
| compiler directive at the top of the file. This makes the code much
|
|
||||||
| cleaner, as it avoids the need for many type declarations. If
|
|
||||||
| possible, you should prefer to declare your functions #[code nogil],
|
|
||||||
| even if you don't especially care about multi-threading. The reason
|
|
||||||
| is that #[code nogil] functions help the Cython compiler reason about
|
|
||||||
| your code quite a lot — you're telling the compiler that no Python
|
|
||||||
| dynamics are possible. This lets many errors be raised, and ensures
|
|
||||||
| your function will run at C speed.
|
|
||||||
|
|
||||||
|
|
||||||
p
|
|
||||||
| Cython gives you many choices of sequences: you could have a Python
|
|
||||||
| list, a numpy array, a memory view, a C++ vector, or a pointer.
|
|
||||||
| Pointers are preferred, because they are fastest, have the most
|
|
||||||
| explicit semantics, and let the compiler check your code more
|
|
||||||
| strictly. C++ vectors are also great — but you should only use them
|
|
||||||
| internally in functions. It's less friendly to accept a vector as an
|
|
||||||
| argument, because that asks the user to do much more work. Here's
|
|
||||||
| how to get a pointer from a numpy array, memory view or vector:
|
|
||||||
|
|
||||||
+code.
|
|
||||||
cdef void get_pointers(np.ndarray[int, mode='c'] numpy_array, vector[int] cpp_vector, int[::1] memory_view) nogil:
|
|
||||||
pointer1 = <int*>numpy_array.data
|
|
||||||
pointer2 = cpp_vector.data()
|
|
||||||
pointer3 = &memory_view[0]
|
|
||||||
|
|
||||||
p
|
|
||||||
| Both C arrays and C++ vectors reassure the compiler that no Python
|
|
||||||
| operations are possible on your variable. This is a big advantage:
|
|
||||||
| it lets the Cython compiler raise many more errors for you.
|
|
||||||
|
|
||||||
p
|
|
||||||
| When getting a pointer from a numpy array or memoryview, take care
|
|
||||||
| that the data is actually stored in C-contiguous order — otherwise
|
|
||||||
| you'll get a pointer to nonsense. The type-declarations in the code
|
|
||||||
| above should generate runtime errors if buffers with incorrect
|
|
||||||
| memory layouts are passed in. To iterate over the array, the
|
|
||||||
| following style is preferred:
|
|
||||||
|
|
||||||
+code.
|
|
||||||
cdef int c_total(const int* int_array, int length) nogil:
|
|
||||||
total = 0
|
|
||||||
for item in int_array[:length]:
|
|
||||||
total += item
|
|
||||||
return total
|
|
||||||
|
|
||||||
p
|
|
||||||
| If this is confusing, consider that the compiler couldn't deal with
|
|
||||||
| #[code for item in int_array:] — there's no length attached to a raw
|
|
||||||
| pointer, so how could we figure out where to stop? The length is
|
|
||||||
| provided in the slice notation as a solution to this. Note that we
|
|
||||||
| don't have to declare the type of #[code item] in the code above —
|
|
||||||
| the compiler can easily infer it. This gives us tidy code that looks
|
|
||||||
| quite like Python, but is exactly as fast as C — because we've made
|
|
||||||
| sure the compilation to C is trivial.
|
|
||||||
|
|
||||||
p
|
|
||||||
| Your functions cannot be declared #[code nogil] if they need to
|
|
||||||
| create Python objects or call Python functions. This is perfectly
|
|
||||||
| okay — you shouldn't torture your code just to get #[code nogil]
|
|
||||||
| functions. However, if your function isn't #[code nogil], you should
|
|
||||||
| compile your module with #[code cython -a --cplus my_module.pyx] and
|
|
||||||
| open the resulting #[code my_module.html] file in a browser. This
|
|
||||||
| will let you see how Cython is compiling your code. Calls into the
|
|
||||||
| Python run-time will be in bright yellow. This lets you easily see
|
|
||||||
| whether Cython is able to correctly type your code, or whether there
|
|
||||||
| are unexpected problems.
|
|
||||||
|
|
||||||
p
|
|
||||||
| Working in Cython is very rewarding once you're over the initial
|
|
||||||
| learning curve. As with C and C++, the first way you write something
|
|
||||||
| in Cython will often be the performance-optimal approach. In
|
|
||||||
| contrast, Python optimisation generally requires a lot of
|
|
||||||
| experimentation. Is it faster to have an #[code if item in my_dict]
|
|
||||||
| check, or to use #[code .get()]? What about
|
|
||||||
| #[code try]/#[code except]? Does this numpy operation create a copy?
|
|
||||||
| There's no way to guess the answers to these questions, and you'll
|
|
||||||
| usually be dissatisfied with your results — so there's no way to
|
|
||||||
| know when to stop this process. In the worst case, you'll make a
|
|
||||||
| mess that invites the next reader to try their luck too. This is
|
|
||||||
| like one of those
|
|
||||||
| #[+a("http://www.wemjournal.org/article/S1080-6032%2809%2970088-2/abstract") volcanic gas-traps],
|
|
||||||
| where the rescuers keep passing out from low oxygen, causing
|
|
||||||
| another rescuer to follow — only to succumb themselves. In short,
|
|
||||||
| just say no to optimizing your Python. If it's not fast enough the
|
|
||||||
| first time, just switch to Cython.
|
|
||||||
|
|
||||||
+infobox("Resources")
|
|
||||||
+list.o-no-block
|
|
||||||
+item #[+a("http://docs.cython.org/en/latest/") Official Cython documentation] (cython.org)
|
|
||||||
+item #[+a("https://explosion.ai/blog/writing-c-in-cython", true) Writing C in Cython] (explosion.ai)
|
|
||||||
+item #[+a("https://explosion.ai/blog/multithreading-with-cython") Multi-threading spaCy’s parser and named entity recogniser] (explosion.ai)
|
|
|
@ -1,149 +0,0 @@
|
||||||
//- 💫 DOCS > API > ARCHITECTURE > NN MODEL ARCHITECTURE
|
|
||||||
|
|
||||||
p
|
|
||||||
| spaCy's statistical models have been custom-designed to give a
|
|
||||||
| high-performance mix of speed and accuracy. The current architecture
|
|
||||||
| hasn't been published yet, but in the meantime we prepared a video that
|
|
||||||
| explains how the models work, with particular focus on NER.
|
|
||||||
|
|
||||||
+youtube("sqDHBH9IjRU")
|
|
||||||
|
|
||||||
p
|
|
||||||
| The parsing model is a blend of recent results. The two recent
|
|
||||||
| inspirations have been the work of Eli Klipperwasser and Yoav Goldberg at
|
|
||||||
| Bar Ilan#[+fn(1)], and the SyntaxNet team from Google. The foundation of
|
|
||||||
| the parser is still based on the work of Joakim Nivre#[+fn(2)], who
|
|
||||||
| introduced the transition-based framework#[+fn(3)], the arc-eager
|
|
||||||
| transition system, and the imitation learning objective. The model is
|
|
||||||
| implemented using #[+a(gh("thinc")) Thinc], spaCy's machine learning
|
|
||||||
| library. We first predict context-sensitive vectors for each word in the
|
|
||||||
| input:
|
|
||||||
|
|
||||||
+code.
|
|
||||||
(embed_lower | embed_prefix | embed_suffix | embed_shape)
|
|
||||||
>> Maxout(token_width)
|
|
||||||
>> convolution ** 4
|
|
||||||
|
|
||||||
p
|
|
||||||
| This convolutional layer is shared between the tagger, parser and NER,
|
|
||||||
| and will also be shared by the future neural lemmatizer. Because the
|
|
||||||
| parser shares these layers with the tagger, the parser does not require
|
|
||||||
| tag features. I got this trick from David Weiss's "Stack Combination"
|
|
||||||
| paper#[+fn(4)].
|
|
||||||
|
|
||||||
p
|
|
||||||
| To boost the representation, the tagger actually predicts a "super tag"
|
|
||||||
| with POS, morphology and dependency label#[+fn(5)]. The tagger predicts
|
|
||||||
| these supertags by adding a softmax layer onto the convolutional layer –
|
|
||||||
| so, we're teaching the convolutional layer to give us a representation
|
|
||||||
| that's one affine transform from this informative lexical information.
|
|
||||||
| This is obviously good for the parser (which backprops to the
|
|
||||||
| convolutions too). The parser model makes a state vector by concatenating
|
|
||||||
| the vector representations for its context tokens. The current context
|
|
||||||
| tokens:
|
|
||||||
|
|
||||||
+table
|
|
||||||
+row
|
|
||||||
+cell #[code S0], #[code S1], #[code S2]
|
|
||||||
+cell Top three words on the stack.
|
|
||||||
|
|
||||||
+row
|
|
||||||
+cell #[code B0], #[code B1]
|
|
||||||
+cell First two words of the buffer.
|
|
||||||
|
|
||||||
+row
|
|
||||||
+cell
|
|
||||||
| #[code S0L1], #[code S1L1], #[code S2L1], #[code B0L1],
|
|
||||||
| #[code B1L1]#[br]
|
|
||||||
| #[code S0L2], #[code S1L2], #[code S2L2], #[code B0L2],
|
|
||||||
| #[code B1L2]
|
|
||||||
+cell
|
|
||||||
| Leftmost and second leftmost children of #[code S0], #[code S1],
|
|
||||||
| #[code S2], #[code B0] and #[code B1].
|
|
||||||
|
|
||||||
+row
|
|
||||||
+cell
|
|
||||||
| #[code S0R1], #[code S1R1], #[code S2R1], #[code B0R1],
|
|
||||||
| #[code B1R1]#[br]
|
|
||||||
| #[code S0R2], #[code S1R2], #[code S2R2], #[code B0R2],
|
|
||||||
| #[code B1R2]
|
|
||||||
+cell
|
|
||||||
| Rightmost and second rightmost children of #[code S0], #[code S1],
|
|
||||||
| #[code S2], #[code B0] and #[code B1].
|
|
||||||
|
|
||||||
p
|
|
||||||
| This makes the state vector quite long: #[code 13*T], where #[code T] is
|
|
||||||
| the token vector width (128 is working well). Fortunately, there's a way
|
|
||||||
| to structure the computation to save some expense (and make it more
|
|
||||||
| GPU-friendly).
|
|
||||||
|
|
||||||
p
|
|
||||||
| The parser typically visits #[code 2*N] states for a sentence of length
|
|
||||||
| #[code N] (although it may visit more, if it back-tracks with a
|
|
||||||
| non-monotonic transition#[+fn(4)]). A naive implementation would require
|
|
||||||
| #[code 2*N (B, 13*T) @ (13*T, H)] matrix multiplications for a batch of
|
|
||||||
| size #[code B]. We can instead perform one #[code (B*N, T) @ (T, 13*H)]
|
|
||||||
| multiplication, to pre-compute the hidden weights for each positional
|
|
||||||
| feature with respect to the words in the batch. (Note that our token
|
|
||||||
| vectors come from the CNN — so we can't play this trick over the
|
|
||||||
| vocabulary. That's how Stanford's NN parser#[+fn(3)] works — and why its
|
|
||||||
| model is so big.)
|
|
||||||
|
|
||||||
p
|
|
||||||
| This pre-computation strategy allows a nice compromise between
|
|
||||||
| GPU-friendliness and implementation simplicity. The CNN and the wide
|
|
||||||
| lower layer are computed on the GPU, and then the precomputed hidden
|
|
||||||
| weights are moved to the CPU, before we start the transition-based
|
|
||||||
| parsing process. This makes a lot of things much easier. We don't have to
|
|
||||||
| worry about variable-length batch sizes, and we don't have to implement
|
|
||||||
| the dynamic oracle in CUDA to train.
|
|
||||||
|
|
||||||
p
|
|
||||||
| Currently the parser's loss function is multilabel log loss#[+fn(6)], as
|
|
||||||
| the dynamic oracle allows multiple states to be 0 cost. This is defined
|
|
||||||
| as follows, where #[code gZ] is the sum of the scores assigned to gold
|
|
||||||
| classes:
|
|
||||||
|
|
||||||
+code.
|
|
||||||
(exp(score) / Z) - (exp(score) / gZ)
|
|
||||||
|
|
||||||
+bibliography
|
|
||||||
+item
|
|
||||||
| #[+a("https://www.semanticscholar.org/paper/Simple-and-Accurate-Dependency-Parsing-Using-Bidir-Kiperwasser-Goldberg/3cf31ecb2724b5088783d7c96a5fc0d5604cbf41") Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations]
|
|
||||||
br
|
|
||||||
| Eliyahu Kiperwasser, Yoav Goldberg. (2016)
|
|
||||||
|
|
||||||
+item
|
|
||||||
| #[+a("https://www.semanticscholar.org/paper/A-Dynamic-Oracle-for-Arc-Eager-Dependency-Parsing-Goldberg-Nivre/22697256ec19ecc3e14fcfc63624a44cf9c22df4") A Dynamic Oracle for Arc-Eager Dependency Parsing]
|
|
||||||
br
|
|
||||||
| Yoav Goldberg, Joakim Nivre (2012)
|
|
||||||
|
|
||||||
+item
|
|
||||||
| #[+a("https://explosion.ai/blog/parsing-english-in-python") Parsing English in 500 Lines of Python]
|
|
||||||
br
|
|
||||||
| Matthew Honnibal (2013)
|
|
||||||
|
|
||||||
+item
|
|
||||||
| #[+a("https://www.semanticscholar.org/paper/Stack-propagation-Improved-Representation-Learning-Zhang-Weiss/0c133f79b23e8c680891d2e49a66f0e3d37f1466") Stack-propagation: Improved Representation Learning for Syntax]
|
|
||||||
br
|
|
||||||
| Yuan Zhang, David Weiss (2016)
|
|
||||||
|
|
||||||
+item
|
|
||||||
| #[+a("https://www.semanticscholar.org/paper/Deep-multi-task-learning-with-low-level-tasks-supe-S%C3%B8gaard-Goldberg/03ad06583c9721855ccd82c3d969a01360218d86") Deep multi-task learning with low level tasks supervised at lower layers]
|
|
||||||
br
|
|
||||||
| Anders Søgaard, Yoav Goldberg (2016)
|
|
||||||
|
|
||||||
+item
|
|
||||||
| #[+a("https://www.semanticscholar.org/paper/An-Improved-Non-monotonic-Transition-System-for-De-Honnibal-Johnson/4094cee47ade13b77b5ab4d2e6cb9dd2b8a2917c") An Improved Non-monotonic Transition System for Dependency Parsing]
|
|
||||||
br
|
|
||||||
| Matthew Honnibal, Mark Johnson (2015)
|
|
||||||
|
|
||||||
+item
|
|
||||||
| #[+a("http://cs.stanford.edu/people/danqi/papers/emnlp2014.pdf") A Fast and Accurate Dependency Parser using Neural Networks]
|
|
||||||
br
|
|
||||||
| Danqi Cheng, Christopher D. Manning (2014)
|
|
||||||
|
|
||||||
+item
|
|
||||||
| #[+a("https://www.semanticscholar.org/paper/Parsing-the-Wall-Street-Journal-using-a-Lexical-Fu-Riezler-King/0ad07862a91cd59b7eb5de38267e47725a62b8b2") Parsing the Wall Street Journal using a Lexical-Functional Grammar and Discriminative Estimation Techniques]
|
|
||||||
br
|
|
||||||
| Stefan Riezler et al. (2002)
|
|
71
website/api/_cython/_doc.jade
Normal file
71
website/api/_cython/_doc.jade
Normal file
|
@ -0,0 +1,71 @@
|
||||||
|
//- 💫 DOCS > API > CYTHON > CLASSES > DOC
|
||||||
|
|
||||||
|
p
|
||||||
|
| The #[code Doc] object holds an array of
|
||||||
|
| #[+api("cython-structs#tokenc") #[code TokenC]] structs.
|
||||||
|
|
||||||
|
+infobox
|
||||||
|
| This section documents the extra C-level attributes and methods that
|
||||||
|
| can't be accessed from Python. For the Python documentation, see
|
||||||
|
| #[+api("doc") #[code Doc]].
|
||||||
|
|
||||||
|
+h(3, "doc_attributes") Attributes
|
||||||
|
|
||||||
|
+table(["Name", "Type", "Description"])
|
||||||
|
+row
|
||||||
|
+cell #[code mem]
|
||||||
|
+cell #[code cymem.Pool]
|
||||||
|
+cell
|
||||||
|
| A memory pool. Allocated memory will be freed once the
|
||||||
|
| #[code Doc] object is garbage collected.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code vocab]
|
||||||
|
+cell #[code Vocab]
|
||||||
|
+cell A reference to the shared #[code Vocab] object.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code c]
|
||||||
|
+cell #[code TokenC*]
|
||||||
|
+cell
|
||||||
|
| A pointer to a #[+api("cython-structs#tokenc") #[code TokenC]]
|
||||||
|
| struct.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code length]
|
||||||
|
+cell #[code int]
|
||||||
|
+cell The number of tokens in the document.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code max_length]
|
||||||
|
+cell #[code int]
|
||||||
|
+cell The underlying size of the #[code Doc.c] array.
|
||||||
|
|
||||||
|
+h(3, "doc_push_back") Doc.push_back
|
||||||
|
+tag method
|
||||||
|
|
||||||
|
p
|
||||||
|
| Append a token to the #[code Doc]. The token can be provided as a
|
||||||
|
| #[+api("cython-structs#lexemec") #[code LexemeC]] or
|
||||||
|
| #[+api("cython-structs#tokenc") #[code TokenC]] pointer, using Cython's
|
||||||
|
| #[+a("http://cython.readthedocs.io/en/latest/src/userguide/fusedtypes.html") fused types].
|
||||||
|
|
||||||
|
+aside-code("Example").
|
||||||
|
from spacy.tokens cimport Doc
|
||||||
|
from spacy.vocab cimport Vocab
|
||||||
|
|
||||||
|
doc = Doc(Vocab())
|
||||||
|
lexeme = doc.vocab.get(u'hello')
|
||||||
|
doc.push_back(lexeme, True)
|
||||||
|
assert doc.text == u'hello '
|
||||||
|
|
||||||
|
+table(["Name", "Type", "Description"])
|
||||||
|
+row
|
||||||
|
+cell #[code lex_or_tok]
|
||||||
|
+cell #[code LexemeOrToken]
|
||||||
|
+cell The word to append to the #[code Doc].
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code has_space]
|
||||||
|
+cell #[code bint]
|
||||||
|
+cell Whether the word has trailing whitespace.
|
30
website/api/_cython/_lexeme.jade
Normal file
30
website/api/_cython/_lexeme.jade
Normal file
|
@ -0,0 +1,30 @@
|
||||||
|
//- 💫 DOCS > API > CYTHON > CLASSES > LEXEME
|
||||||
|
|
||||||
|
p
|
||||||
|
| A Cython class providing access and methods for an entry in the
|
||||||
|
| vocabulary.
|
||||||
|
|
||||||
|
+infobox
|
||||||
|
| This section documents the extra C-level attributes and methods that
|
||||||
|
| can't be accessed from Python. For the Python documentation, see
|
||||||
|
| #[+api("lexeme") #[code Lexeme]].
|
||||||
|
|
||||||
|
+h(3, "lexeme_attributes") Attributes
|
||||||
|
|
||||||
|
+table(["Name", "Type", "Description"])
|
||||||
|
+row
|
||||||
|
+cell #[code c]
|
||||||
|
+cell #[code LexemeC*]
|
||||||
|
+cell
|
||||||
|
| A pointer to a #[+api("cython-structs#lexemec") #[code LexemeC]]
|
||||||
|
| struct.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code vocab]
|
||||||
|
+cell #[code Vocab]
|
||||||
|
+cell A reference to the shared #[code Vocab] object.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code orth]
|
||||||
|
+cell #[+abbr("uint64_t") #[code attr_t]]
|
||||||
|
+cell ID of the verbatim text content.
|
200
website/api/_cython/_lexemec.jade
Normal file
200
website/api/_cython/_lexemec.jade
Normal file
|
@ -0,0 +1,200 @@
|
||||||
|
//- 💫 DOCS > API > CYTHON > STRUCTS > LEXEMEC
|
||||||
|
|
||||||
|
p
|
||||||
|
| Struct holding information about a lexical type. #[code LexemeC]
|
||||||
|
| structs are usually owned by the #[code Vocab], and accessed through a
|
||||||
|
| read-only pointer on the #[code TokenC] struct.
|
||||||
|
|
||||||
|
+aside-code("Example").
|
||||||
|
lex = doc.c[3].lex
|
||||||
|
|
||||||
|
+table(["Name", "Type", "Description"])
|
||||||
|
+row
|
||||||
|
+cell #[code flags]
|
||||||
|
+cell #[+abbr("uint64_t") #[code flags_t]]
|
||||||
|
+cell Bit-field for binary lexical flag values.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code id]
|
||||||
|
+cell #[+abbr("uint64_t") #[code attr_t]]
|
||||||
|
+cell
|
||||||
|
| Usually used to map lexemes to rows in a matrix, e.g. for word
|
||||||
|
| vectors. Does not need to be unique, so currently misnamed.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code length]
|
||||||
|
+cell #[+abbr("uint64_t") #[code attr_t]]
|
||||||
|
+cell Number of unicode characters in the lexeme.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code orth]
|
||||||
|
+cell #[+abbr("uint64_t") #[code attr_t]]
|
||||||
|
+cell ID of the verbatim text content.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code lower]
|
||||||
|
+cell #[+abbr("uint64_t") #[code attr_t]]
|
||||||
|
+cell ID of the lowercase form of the lexeme.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code norm]
|
||||||
|
+cell #[+abbr("uint64_t") #[code attr_t]]
|
||||||
|
+cell ID of the lexeme's norm, i.e. a normalised form of the text.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code shape]
|
||||||
|
+cell #[+abbr("uint64_t") #[code attr_t]]
|
||||||
|
+cell Transform of the lexeme's string, to show orthographic features.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code prefix]
|
||||||
|
+cell #[+abbr("uint64_t") #[code attr_t]]
|
||||||
|
+cell
|
||||||
|
| Length-N substring from the start of the lexeme. Defaults to
|
||||||
|
| #[code N=1].
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code suffix]
|
||||||
|
+cell #[+abbr("uint64_t") #[code attr_t]]
|
||||||
|
+cell
|
||||||
|
| Length-N substring from the end of the lexeme. Defaults to
|
||||||
|
| #[code N=3].
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code cluster]
|
||||||
|
+cell #[+abbr("uint64_t") #[code attr_t]]
|
||||||
|
+cell Brown cluster ID.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code prob]
|
||||||
|
+cell #[code float]
|
||||||
|
+cell Smoothed log probability estimate of the lexeme's type.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code sentiment]
|
||||||
|
+cell #[code float]
|
||||||
|
+cell A scalar value indicating positivity or negativity.
|
||||||
|
|
||||||
|
+h(3, "lexeme_get_struct_attr", "spacy/lexeme.pxd") Lexeme.get_struct_attr
|
||||||
|
+tag staticmethod
|
||||||
|
+tag nogil
|
||||||
|
|
||||||
|
p Get the value of an attribute from the #[code LexemeC] struct by attribute ID.
|
||||||
|
|
||||||
|
+aside-code("Example").
|
||||||
|
from spacy.attrs cimport IS_ALPHA
|
||||||
|
from spacy.lexeme cimport Lexeme
|
||||||
|
|
||||||
|
lexeme = doc.c[3].lex
|
||||||
|
is_alpha = Lexeme.get_struct_attr(lexeme, IS_ALPHA)
|
||||||
|
|
||||||
|
+table(["Name", "Type", "Description"])
|
||||||
|
+row
|
||||||
|
+cell #[code lex]
|
||||||
|
+cell #[code const LexemeC*]
|
||||||
|
+cell A pointer to a #[code LexemeC] struct.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code feat_name]
|
||||||
|
+cell #[code attr_id_t]
|
||||||
|
+cell
|
||||||
|
| The ID of the attribute to look up. The attributes are
|
||||||
|
| enumerated in #[code spacy.typedefs].
|
||||||
|
|
||||||
|
+row("foot")
|
||||||
|
+cell returns
|
||||||
|
+cell #[+abbr("uint64_t") #[code attr_t]]
|
||||||
|
+cell The value of the attribute.
|
||||||
|
|
||||||
|
+h(3, "lexeme_set_struct_attr", "spacy/lexeme.pxd") Lexeme.set_struct_attr
|
||||||
|
+tag staticmethod
|
||||||
|
+tag nogil
|
||||||
|
|
||||||
|
p Set the value of an attribute of the #[code LexemeC] struct by attribute ID.
|
||||||
|
|
||||||
|
+aside-code("Example").
|
||||||
|
from spacy.attrs cimport NORM
|
||||||
|
from spacy.lexeme cimport Lexeme
|
||||||
|
|
||||||
|
lexeme = doc.c[3].lex
|
||||||
|
Lexeme.set_struct_attr(lexeme, NORM, lexeme.lower)
|
||||||
|
|
||||||
|
+table(["Name", "Type", "Description"])
|
||||||
|
+row
|
||||||
|
+cell #[code lex]
|
||||||
|
+cell #[code const LexemeC*]
|
||||||
|
+cell A pointer to a #[code LexemeC] struct.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code feat_name]
|
||||||
|
+cell #[code attr_id_t]
|
||||||
|
+cell
|
||||||
|
| The ID of the attribute to look up. The attributes are
|
||||||
|
| enumerated in #[code spacy.typedefs].
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code value]
|
||||||
|
+cell #[+abbr("uint64_t") #[code attr_t]]
|
||||||
|
+cell The value to set.
|
||||||
|
|
||||||
|
+h(3, "lexeme_c_check_flag", "spacy/lexeme.pxd") Lexeme.c_check_flag
|
||||||
|
+tag staticmethod
|
||||||
|
+tag nogil
|
||||||
|
|
||||||
|
p Check the value of a binary flag attribute.
|
||||||
|
|
||||||
|
+aside-code("Example").
|
||||||
|
from spacy.attrs cimport IS_STOP
|
||||||
|
from spacy.lexeme cimport Lexeme
|
||||||
|
|
||||||
|
lexeme = doc.c[3].lex
|
||||||
|
is_stop = Lexeme.c_check_flag(lexeme, IS_STOP)
|
||||||
|
|
||||||
|
+table(["Name", "Type", "Description"])
|
||||||
|
+row
|
||||||
|
+cell #[code lexeme]
|
||||||
|
+cell #[code const LexemeC*]
|
||||||
|
+cell A pointer to a #[code LexemeC] struct.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code flag_id]
|
||||||
|
+cell #[code attr_id_t]
|
||||||
|
+cell
|
||||||
|
| The ID of the flag to look up. The flag IDs are enumerated in
|
||||||
|
| #[code spacy.typedefs].
|
||||||
|
|
||||||
|
+row("foot")
|
||||||
|
+cell returns
|
||||||
|
+cell #[code bint]
|
||||||
|
+cell The boolean value of the flag.
|
||||||
|
|
||||||
|
+h(3, "lexeme_c_set_flag", "spacy/lexeme.pxd") Lexeme.c_set_flag
|
||||||
|
+tag staticmethod
|
||||||
|
+tag nogil
|
||||||
|
|
||||||
|
p Set the value of a binary flag attribute.
|
||||||
|
|
||||||
|
+aside-code("Example").
|
||||||
|
from spacy.attrs cimport IS_STOP
|
||||||
|
from spacy.lexeme cimport Lexeme
|
||||||
|
|
||||||
|
lexeme = doc.c[3].lex
|
||||||
|
Lexeme.c_set_flag(lexeme, IS_STOP, 0)
|
||||||
|
|
||||||
|
+table(["Name", "Type", "Description"])
|
||||||
|
+row
|
||||||
|
+cell #[code lexeme]
|
||||||
|
+cell #[code const LexemeC*]
|
||||||
|
+cell A pointer to a #[code LexemeC] struct.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code flag_id]
|
||||||
|
+cell #[code attr_id_t]
|
||||||
|
+cell
|
||||||
|
| The ID of the flag to look up. The flag IDs are enumerated in
|
||||||
|
| #[code spacy.typedefs].
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code value]
|
||||||
|
+cell #[code bint]
|
||||||
|
+cell The value to set.
|
43
website/api/_cython/_span.jade
Normal file
43
website/api/_cython/_span.jade
Normal file
|
@ -0,0 +1,43 @@
|
||||||
|
//- 💫 DOCS > API > CYTHON > CLASSES > SPAN
|
||||||
|
|
||||||
|
p
|
||||||
|
| A Cython class providing access and methods for a slice of a #[code Doc]
|
||||||
|
| object.
|
||||||
|
|
||||||
|
+infobox
|
||||||
|
| This section documents the extra C-level attributes and methods that
|
||||||
|
| can't be accessed from Python. For the Python documentation, see
|
||||||
|
| #[+api("span") #[code Span]].
|
||||||
|
|
||||||
|
+h(3, "span_attributes") Attributes
|
||||||
|
|
||||||
|
+table(["Name", "Type", "Description"])
|
||||||
|
+row
|
||||||
|
+cell #[code doc]
|
||||||
|
+cell #[code Doc]
|
||||||
|
+cell The parent document.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code start]
|
||||||
|
+cell #[code int]
|
||||||
|
+cell The index of the first token of the span.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code end]
|
||||||
|
+cell #[code int]
|
||||||
|
+cell The index of the first token after the span.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code start_char]
|
||||||
|
+cell #[code int]
|
||||||
|
+cell The index of the first character of the span.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code end_char]
|
||||||
|
+cell #[code int]
|
||||||
|
+cell The index of the last character of the span.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code label]
|
||||||
|
+cell #[+abbr("uint64_t") #[code attr_t]]
|
||||||
|
+cell A label to attach to the span, e.g. for named entities.
|
23
website/api/_cython/_stringstore.jade
Normal file
23
website/api/_cython/_stringstore.jade
Normal file
|
@ -0,0 +1,23 @@
|
||||||
|
//- 💫 DOCS > API > CYTHON > CLASSES > STRINGSTORE
|
||||||
|
|
||||||
|
p A lookup table to retrieve strings by 64-bit hashes.
|
||||||
|
|
||||||
|
+infobox
|
||||||
|
| This section documents the extra C-level attributes and methods that
|
||||||
|
| can't be accessed from Python. For the Python documentation, see
|
||||||
|
| #[+api("stringstore") #[code StringStore]].
|
||||||
|
|
||||||
|
+h(3, "stringstore_attributes") Attributes
|
||||||
|
|
||||||
|
+table(["Name", "Type", "Description"])
|
||||||
|
+row
|
||||||
|
+cell #[code mem]
|
||||||
|
+cell #[code cymem.Pool]
|
||||||
|
+cell
|
||||||
|
| A memory pool. Allocated memory will be freed once the
|
||||||
|
| #[code StringStore] object is garbage collected.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code keys]
|
||||||
|
+cell #[+abbr("vector[uint64_t]") #[code vector[hash_t]]]
|
||||||
|
+cell A list of hash values in the #[code StringStore].
|
73
website/api/_cython/_token.jade
Normal file
73
website/api/_cython/_token.jade
Normal file
|
@ -0,0 +1,73 @@
|
||||||
|
//- 💫 DOCS > API > CYTHON > CLASSES > TOKEN
|
||||||
|
|
||||||
|
p
|
||||||
|
| A Cython class providing access and methods for a
|
||||||
|
| #[+api("cython-structs#tokenc") #[code TokenC]] struct. Note that the
|
||||||
|
| #[code Token] object does not own the struct. It only receives a pointer
|
||||||
|
| to it.
|
||||||
|
|
||||||
|
+infobox
|
||||||
|
| This section documents the extra C-level attributes and methods that
|
||||||
|
| can't be accessed from Python. For the Python documentation, see
|
||||||
|
| #[+api("token") #[code Token]].
|
||||||
|
|
||||||
|
+h(3, "token_attributes") Attributes
|
||||||
|
|
||||||
|
+table(["Name", "Type", "Description"])
|
||||||
|
+row
|
||||||
|
+cell #[code vocab]
|
||||||
|
+cell #[code Vocab]
|
||||||
|
+cell A reference to the shared #[code Vocab] object.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code c]
|
||||||
|
+cell #[code TokenC*]
|
||||||
|
+cell
|
||||||
|
| A pointer to a #[+api("cython-structs#tokenc") #[code TokenC]]
|
||||||
|
| struct.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code i]
|
||||||
|
+cell #[code int]
|
||||||
|
+cell The offset of the token within the document.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code doc]
|
||||||
|
+cell #[code Doc]
|
||||||
|
+cell The parent document.
|
||||||
|
|
||||||
|
+h(3, "token_cinit") Token.cinit
|
||||||
|
+tag method
|
||||||
|
|
||||||
|
p Create a #[code Token] object from a #[code TokenC*] pointer.
|
||||||
|
|
||||||
|
+aside-code("Example").
|
||||||
|
token = Token.cinit(&doc.c[3], doc, 3)
|
||||||
|
|
||||||
|
+table(["Name", "Type", "Description"])
|
||||||
|
+row
|
||||||
|
+cell #[code vocab]
|
||||||
|
+cell #[code Vocab]
|
||||||
|
+cell A reference to the shared #[code Vocab].
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code c]
|
||||||
|
+cell #[code TokenC*]
|
||||||
|
+cell
|
||||||
|
| A pointer to a #[+api("cython-structs#tokenc") #[code TokenC]]
|
||||||
|
| struct.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code offset]
|
||||||
|
+cell #[code int]
|
||||||
|
+cell The offset of the token within the document.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code doc]
|
||||||
|
+cell #[code Doc]
|
||||||
|
+cell The parent document.
|
||||||
|
|
||||||
|
+row("foot")
|
||||||
|
+cell returns
|
||||||
|
+cell #[code Token]
|
||||||
|
+cell The newly constructed object.
|
270
website/api/_cython/_tokenc.jade
Normal file
270
website/api/_cython/_tokenc.jade
Normal file
|
@ -0,0 +1,270 @@
|
||||||
|
//- 💫 DOCS > API > CYTHON > STRUCTS > TOKENC
|
||||||
|
|
||||||
|
p
|
||||||
|
| Cython data container for the #[code Token] object.
|
||||||
|
|
||||||
|
+aside-code("Example").
|
||||||
|
token = &doc.c[3]
|
||||||
|
token_ptr = &doc.c[3]
|
||||||
|
|
||||||
|
+table(["Name", "Type", "Description"])
|
||||||
|
+row
|
||||||
|
+cell #[code lex]
|
||||||
|
+cell #[code const LexemeC*]
|
||||||
|
+cell A pointer to the lexeme for the token.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code morph]
|
||||||
|
+cell #[code uint64_t]
|
||||||
|
+cell An ID allowing lookup of morphological attributes.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code pos]
|
||||||
|
+cell #[code univ_pos_t]
|
||||||
|
+cell Coarse-grained part-of-speech tag.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code spacy]
|
||||||
|
+cell #[code bint]
|
||||||
|
+cell A binary value indicating whether the token has trailing whitespace.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code tag]
|
||||||
|
+cell #[+abbr("uint64_t") #[code attr_t]]
|
||||||
|
+cell Fine-grained part-of-speech tag.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code idx]
|
||||||
|
+cell #[code int]
|
||||||
|
+cell The character offset of the token within the parent document.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code lemma]
|
||||||
|
+cell #[+abbr("uint64_t") #[code attr_t]]
|
||||||
|
+cell Base form of the token, with no inflectional suffixes.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code sense]
|
||||||
|
+cell #[+abbr("uint64_t") #[code attr_t]]
|
||||||
|
+cell Space for storing a word sense ID, currently unused.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code head]
|
||||||
|
+cell #[code int]
|
||||||
|
+cell Offset of the syntactic parent relative to the token.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code dep]
|
||||||
|
+cell #[+abbr("uint64_t") #[code attr_t]]
|
||||||
|
+cell Syntactic dependency relation.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code l_kids]
|
||||||
|
+cell #[code uint32_t]
|
||||||
|
+cell Number of left children.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code r_kids]
|
||||||
|
+cell #[code uint32_t]
|
||||||
|
+cell Number of right children.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code l_edge]
|
||||||
|
+cell #[code uint32_t]
|
||||||
|
+cell Offset of the leftmost token of this token's syntactic descendents.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code r_edge]
|
||||||
|
+cell #[code uint32_t]
|
||||||
|
+cell Offset of the rightmost token of this token's syntactic descendents.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code sent_start]
|
||||||
|
+cell #[code int]
|
||||||
|
+cell
|
||||||
|
| Ternary value indicating whether the token is the first word of
|
||||||
|
| a sentence. #[code 0] indicates a missing value, #[code -1]
|
||||||
|
| indicates #[code False] and #[code 1] indicates #[code True]. The default value, 0,
|
||||||
|
| is interpretted as no sentence break. Sentence boundary detectors will usually
|
||||||
|
| set 0 for all tokens except tokens that follow a sentence boundary.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code ent_iob]
|
||||||
|
+cell #[code int]
|
||||||
|
+cell
|
||||||
|
| IOB code of named entity tag. #[code 0] indicates a missing
|
||||||
|
| value, #[code 1] indicates #[code I], #[code 2] indicates
|
||||||
|
| #[code 0] and #[code 3] indicates #[code B].
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code ent_type]
|
||||||
|
+cell #[+abbr("uint64_t") #[code attr_t]]
|
||||||
|
+cell Named entity type.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code ent_id]
|
||||||
|
+cell #[+abbr("uint64_t") #[code hash_t]]
|
||||||
|
+cell
|
||||||
|
| ID of the entity the token is an instance of, if any. Currently
|
||||||
|
| not used, but potentially for coreference resolution.
|
||||||
|
|
||||||
|
+h(3, "token_get_struct_attr", "spacy/tokens/token.pxd") Token.get_struct_attr
|
||||||
|
+tag staticmethod
|
||||||
|
+tag nogil
|
||||||
|
|
||||||
|
p Get the value of an attribute from the #[code TokenC] struct by attribute ID.
|
||||||
|
|
||||||
|
+aside-code("Example").
|
||||||
|
from spacy.attrs cimport IS_ALPHA
|
||||||
|
from spacy.tokens cimport Token
|
||||||
|
|
||||||
|
is_alpha = Token.get_struct_attr(&doc.c[3], IS_ALPHA)
|
||||||
|
|
||||||
|
+table(["Name", "Type", "Description"])
|
||||||
|
+row
|
||||||
|
+cell #[code token]
|
||||||
|
+cell #[code const TokenC*]
|
||||||
|
+cell A pointer to a #[code TokenC] struct.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code feat_name]
|
||||||
|
+cell #[code attr_id_t]
|
||||||
|
+cell
|
||||||
|
| The ID of the attribute to look up. The attributes are
|
||||||
|
| enumerated in #[code spacy.typedefs].
|
||||||
|
|
||||||
|
+row("foot")
|
||||||
|
+cell returns
|
||||||
|
+cell #[+abbr("uint64_t") #[code attr_t]]
|
||||||
|
+cell The value of the attribute.
|
||||||
|
|
||||||
|
+h(3, "token_set_struct_attr", "spacy/tokens/token.pxd") Token.set_struct_attr
|
||||||
|
+tag staticmethod
|
||||||
|
+tag nogil
|
||||||
|
|
||||||
|
p Set the value of an attribute of the #[code TokenC] struct by attribute ID.
|
||||||
|
|
||||||
|
+aside-code("Example").
|
||||||
|
from spacy.attrs cimport TAG
|
||||||
|
from spacy.tokens cimport Token
|
||||||
|
|
||||||
|
token = &doc.c[3]
|
||||||
|
Token.set_struct_attr(token, TAG, 0)
|
||||||
|
|
||||||
|
+table(["Name", "Type", "Description"])
|
||||||
|
+row
|
||||||
|
+cell #[code token]
|
||||||
|
+cell #[code const TokenC*]
|
||||||
|
+cell A pointer to a #[code TokenC] struct.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code feat_name]
|
||||||
|
+cell #[code attr_id_t]
|
||||||
|
+cell
|
||||||
|
| The ID of the attribute to look up. The attributes are
|
||||||
|
| enumerated in #[code spacy.typedefs].
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code value]
|
||||||
|
+cell #[+abbr("uint64_t") #[code attr_t]]
|
||||||
|
+cell The value to set.
|
||||||
|
|
||||||
|
+h(3, "token_by_start", "spacy/tokens/doc.pxd") token_by_start
|
||||||
|
+tag function
|
||||||
|
|
||||||
|
p Find a token in a #[code TokenC*] array by the offset of its first character.
|
||||||
|
|
||||||
|
+aside-code("Example").
|
||||||
|
from spacy.tokens.doc cimport Doc, token_by_start
|
||||||
|
from spacy.vocab cimport Vocab
|
||||||
|
|
||||||
|
doc = Doc(Vocab(), words=[u'hello', u'world'])
|
||||||
|
assert token_by_start(doc.c, doc.length, 6) == 1
|
||||||
|
assert token_by_start(doc.c, doc.length, 4) == -1
|
||||||
|
|
||||||
|
+table(["Name", "Type", "Description"])
|
||||||
|
+row
|
||||||
|
+cell #[code tokens]
|
||||||
|
+cell #[code const TokenC*]
|
||||||
|
+cell A #[code TokenC*] array.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code length]
|
||||||
|
+cell #[code int]
|
||||||
|
+cell The number of tokens in the array.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code start_char]
|
||||||
|
+cell #[code int]
|
||||||
|
+cell The start index to search for.
|
||||||
|
|
||||||
|
+row("foot")
|
||||||
|
+cell returns
|
||||||
|
+cell #[code int]
|
||||||
|
+cell The index of the token in the array or #[code -1] if not found.
|
||||||
|
|
||||||
|
+h(3, "token_by_end", "spacy/tokens/doc.pxd") token_by_end
|
||||||
|
+tag function
|
||||||
|
|
||||||
|
p Find a token in a #[code TokenC*] array by the offset of its final character.
|
||||||
|
|
||||||
|
+aside-code("Example").
|
||||||
|
from spacy.tokens.doc cimport Doc, token_by_end
|
||||||
|
from spacy.vocab cimport Vocab
|
||||||
|
|
||||||
|
doc = Doc(Vocab(), words=[u'hello', u'world'])
|
||||||
|
assert token_by_end(doc.c, doc.length, 5) == 0
|
||||||
|
assert token_by_end(doc.c, doc.length, 1) == -1
|
||||||
|
|
||||||
|
+table(["Name", "Type", "Description"])
|
||||||
|
+row
|
||||||
|
+cell #[code tokens]
|
||||||
|
+cell #[code const TokenC*]
|
||||||
|
+cell A #[code TokenC*] array.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code length]
|
||||||
|
+cell #[code int]
|
||||||
|
+cell The number of tokens in the array.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code end_char]
|
||||||
|
+cell #[code int]
|
||||||
|
+cell The end index to search for.
|
||||||
|
|
||||||
|
+row("foot")
|
||||||
|
+cell returns
|
||||||
|
+cell #[code int]
|
||||||
|
+cell The index of the token in the array or #[code -1] if not found.
|
||||||
|
|
||||||
|
+h(3, "set_children_from_heads", "spacy/tokens/doc.pxd") set_children_from_heads
|
||||||
|
+tag function
|
||||||
|
|
||||||
|
p
|
||||||
|
| Set attributes that allow lookup of syntactic children on a
|
||||||
|
| #[code TokenC*] array. This function must be called after making changes
|
||||||
|
| to the #[code TokenC.head] attribute, in order to make the parse tree
|
||||||
|
| navigation consistent.
|
||||||
|
|
||||||
|
+aside-code("Example").
|
||||||
|
from spacy.tokens.doc cimport Doc, set_children_from_heads
|
||||||
|
from spacy.vocab cimport Vocab
|
||||||
|
|
||||||
|
doc = Doc(Vocab(), words=[u'Baileys', u'from', u'a', u'shoe'])
|
||||||
|
doc.c[0].head = 0
|
||||||
|
doc.c[1].head = 0
|
||||||
|
doc.c[2].head = 3
|
||||||
|
doc.c[3].head = 1
|
||||||
|
set_children_from_heads(doc.c, doc.length)
|
||||||
|
assert doc.c[3].l_kids == 1
|
||||||
|
|
||||||
|
+table(["Name", "Type", "Description"])
|
||||||
|
+row
|
||||||
|
+cell #[code tokens]
|
||||||
|
+cell #[code const TokenC*]
|
||||||
|
+cell A #[code TokenC*] array.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code length]
|
||||||
|
+cell #[code int]
|
||||||
|
+cell The number of tokens in the array.
|
88
website/api/_cython/_vocab.jade
Normal file
88
website/api/_cython/_vocab.jade
Normal file
|
@ -0,0 +1,88 @@
|
||||||
|
//- 💫 DOCS > API > CYTHON > CLASSES > VOCAB
|
||||||
|
|
||||||
|
p
|
||||||
|
| A Cython class providing access and methods for a vocabulary and other
|
||||||
|
| data shared across a language.
|
||||||
|
|
||||||
|
+infobox
|
||||||
|
| This section documents the extra C-level attributes and methods that
|
||||||
|
| can't be accessed from Python. For the Python documentation, see
|
||||||
|
| #[+api("vocab") #[code Vocab]].
|
||||||
|
|
||||||
|
+h(3, "vocab_attributes") Attributes
|
||||||
|
|
||||||
|
+table(["Name", "Type", "Description"])
|
||||||
|
+row
|
||||||
|
+cell #[code mem]
|
||||||
|
+cell #[code cymem.Pool]
|
||||||
|
+cell
|
||||||
|
| A memory pool. Allocated memory will be freed once the
|
||||||
|
| #[code Vocab] object is garbage collected.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code strings]
|
||||||
|
+cell #[code StringStore]
|
||||||
|
+cell
|
||||||
|
| A #[code StringStore] that maps string to hash values and vice
|
||||||
|
| versa.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code length]
|
||||||
|
+cell #[code int]
|
||||||
|
+cell The number of entries in the vocabulary.
|
||||||
|
|
||||||
|
+h(3, "vocab_get") Vocab.get
|
||||||
|
+tag method
|
||||||
|
|
||||||
|
p
|
||||||
|
| Retrieve a #[+api("cython-structs#lexemec") #[code LexemeC*]] pointer
|
||||||
|
| from the vocabulary.
|
||||||
|
|
||||||
|
+aside-code("Example").
|
||||||
|
lexeme = vocab.get(vocab.mem, u'hello')
|
||||||
|
|
||||||
|
+table(["Name", "Type", "Description"])
|
||||||
|
+row
|
||||||
|
+cell #[code mem]
|
||||||
|
+cell #[code cymem.Pool]
|
||||||
|
+cell
|
||||||
|
| A memory pool. Allocated memory will be freed once the
|
||||||
|
| #[code Vocab] object is garbage collected.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code string]
|
||||||
|
+cell #[code unicode]
|
||||||
|
+cell The string of the word to look up.
|
||||||
|
|
||||||
|
+row("foot")
|
||||||
|
+cell returns
|
||||||
|
+cell #[code const LexemeC*]
|
||||||
|
+cell The lexeme in the vocabulary.
|
||||||
|
|
||||||
|
+h(3, "vocab_get_by_orth") Vocab.get_by_orth
|
||||||
|
+tag method
|
||||||
|
|
||||||
|
p
|
||||||
|
| Retrieve a #[+api("cython-structs#lexemec") #[code LexemeC*]] pointer
|
||||||
|
| from the vocabulary.
|
||||||
|
|
||||||
|
+aside-code("Example").
|
||||||
|
lexeme = vocab.get_by_orth(doc[0].lex.norm)
|
||||||
|
|
||||||
|
+table(["Name", "Type", "Description"])
|
||||||
|
+row
|
||||||
|
+cell #[code mem]
|
||||||
|
+cell #[code cymem.Pool]
|
||||||
|
+cell
|
||||||
|
| A memory pool. Allocated memory will be freed once the
|
||||||
|
| #[code Vocab] object is garbage collected.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code orth]
|
||||||
|
+cell #[+abbr("uint64_t") #[code attr_t]]
|
||||||
|
+cell ID of the verbatim text content.
|
||||||
|
|
||||||
|
+row("foot")
|
||||||
|
+cell returns
|
||||||
|
+cell #[code const LexemeC*]
|
||||||
|
+cell The lexeme in the vocabulary.
|
|
@ -33,6 +33,12 @@
|
||||||
"Vectors": "vectors",
|
"Vectors": "vectors",
|
||||||
"GoldParse": "goldparse",
|
"GoldParse": "goldparse",
|
||||||
"GoldCorpus": "goldcorpus"
|
"GoldCorpus": "goldcorpus"
|
||||||
|
},
|
||||||
|
|
||||||
|
"Cython": {
|
||||||
|
"Architecture": "cython",
|
||||||
|
"Structs": "cython-structs",
|
||||||
|
"Classes": "cython-classes"
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
|
|
||||||
|
@ -41,8 +47,7 @@
|
||||||
"next": "annotation",
|
"next": "annotation",
|
||||||
"menu": {
|
"menu": {
|
||||||
"Basics": "basics",
|
"Basics": "basics",
|
||||||
"Neural Network Model": "nn-model",
|
"Neural Network Model": "nn-model"
|
||||||
"Cython Conventions": "cython"
|
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
|
|
||||||
|
@ -211,5 +216,36 @@
|
||||||
"Named Entities": "named-entities",
|
"Named Entities": "named-entities",
|
||||||
"Models & Training": "training"
|
"Models & Training": "training"
|
||||||
}
|
}
|
||||||
|
},
|
||||||
|
|
||||||
|
"cython": {
|
||||||
|
"title": "Cython Architecture",
|
||||||
|
"next": "cython-structs",
|
||||||
|
"menu": {
|
||||||
|
"Overview": "overview",
|
||||||
|
"Conventions": "conventions"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
|
||||||
|
"cython-structs": {
|
||||||
|
"title": "Cython Structs",
|
||||||
|
"teaser": "C-language objects that let you group variables together in a single contiguous block.",
|
||||||
|
"next": "cython-classes",
|
||||||
|
"menu": {
|
||||||
|
"TokenC": "tokenc",
|
||||||
|
"LexemeC": "lexemec"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
|
||||||
|
"cython-classes": {
|
||||||
|
"title": "Cython Classes",
|
||||||
|
"menu": {
|
||||||
|
"Doc": "doc",
|
||||||
|
"Token": "token",
|
||||||
|
"Span": "span",
|
||||||
|
"Lexeme": "lexeme",
|
||||||
|
"Vocab": "vocab",
|
||||||
|
"StringStore": "stringstore"
|
||||||
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
|
@ -280,7 +280,7 @@ p
|
||||||
+row
|
+row
|
||||||
+cell #[code --n-iter], #[code -n]
|
+cell #[code --n-iter], #[code -n]
|
||||||
+cell option
|
+cell option
|
||||||
+cell Number of iterations (default: #[code 20]).
|
+cell Number of iterations (default: #[code 30]).
|
||||||
|
|
||||||
+row
|
+row
|
||||||
+cell #[code --n-sents], #[code -ns]
|
+cell #[code --n-sents], #[code -ns]
|
||||||
|
|
39
website/api/cython-classes.jade
Normal file
39
website/api/cython-classes.jade
Normal file
|
@ -0,0 +1,39 @@
|
||||||
|
//- 💫 DOCS > API > CYTHON > CLASSES
|
||||||
|
|
||||||
|
include ../_includes/_mixins
|
||||||
|
|
||||||
|
+section("doc")
|
||||||
|
+h(2, "doc", "spacy/tokens/doc.pxd") Doc
|
||||||
|
+tag cdef class
|
||||||
|
|
||||||
|
include _cython/_doc
|
||||||
|
|
||||||
|
+section("token")
|
||||||
|
+h(2, "token", "spacy/tokens/token.pxd") Token
|
||||||
|
+tag cdef class
|
||||||
|
|
||||||
|
include _cython/_token
|
||||||
|
|
||||||
|
+section("span")
|
||||||
|
+h(2, "span", "spacy/tokens/span.pxd") Span
|
||||||
|
+tag cdef class
|
||||||
|
|
||||||
|
include _cython/_span
|
||||||
|
|
||||||
|
+section("lexeme")
|
||||||
|
+h(2, "lexeme", "spacy/lexeme.pxd") Lexeme
|
||||||
|
+tag cdef class
|
||||||
|
|
||||||
|
include _cython/_lexeme
|
||||||
|
|
||||||
|
+section("vocab")
|
||||||
|
+h(2, "vocab", "spacy/vocab.pxd") Vocab
|
||||||
|
+tag cdef class
|
||||||
|
|
||||||
|
include _cython/_vocab
|
||||||
|
|
||||||
|
+section("stringstore")
|
||||||
|
+h(2, "stringstore", "spacy/strings.pxd") StringStore
|
||||||
|
+tag cdef class
|
||||||
|
|
||||||
|
include _cython/_stringstore
|
15
website/api/cython-structs.jade
Normal file
15
website/api/cython-structs.jade
Normal file
|
@ -0,0 +1,15 @@
|
||||||
|
//- 💫 DOCS > API > CYTHON > STRUCTS
|
||||||
|
|
||||||
|
include ../_includes/_mixins
|
||||||
|
|
||||||
|
+section("tokenc")
|
||||||
|
+h(2, "tokenc", "spacy/structs.pxd") TokenC
|
||||||
|
+tag C struct
|
||||||
|
|
||||||
|
include _cython/_tokenc
|
||||||
|
|
||||||
|
+section("lexemec")
|
||||||
|
+h(2, "lexemec", "spacy/structs.pxd") LexemeC
|
||||||
|
+tag C struct
|
||||||
|
|
||||||
|
include _cython/_lexemec
|
176
website/api/cython.jade
Normal file
176
website/api/cython.jade
Normal file
|
@ -0,0 +1,176 @@
|
||||||
|
//- 💫 DOCS > API > CYTHON > ARCHITECTURE
|
||||||
|
|
||||||
|
include ../_includes/_mixins
|
||||||
|
|
||||||
|
+section("overview")
|
||||||
|
+aside("What's Cython?")
|
||||||
|
| #[+a("http://cython.org/") Cython] is a language for writing
|
||||||
|
| C extensions for Python. Most Python code is also valid Cython, but
|
||||||
|
| you can add type declarations to get efficient memory-managed code
|
||||||
|
| just like C or C++.
|
||||||
|
|
||||||
|
p
|
||||||
|
| This section documents spaCy's C-level data structures and
|
||||||
|
| interfaces, intended for use from Cython. Some of the attributes are
|
||||||
|
| primarily for internal use, and all C-level functions and methods are
|
||||||
|
| designed for speed over safety – if you make a mistake and access an
|
||||||
|
| array out-of-bounds, the program may crash abruptly.
|
||||||
|
|
||||||
|
p
|
||||||
|
| With Cython there are four ways of declaring complex data types.
|
||||||
|
| Unfortunately we use all four in different places, as they all have
|
||||||
|
| different utility:
|
||||||
|
|
||||||
|
+table(["Declaration", "Description", "Example"])
|
||||||
|
+row
|
||||||
|
+cell #[code class]
|
||||||
|
+cell A normal Python class.
|
||||||
|
+cell #[+api("language") #[code Language]]
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code cdef class]
|
||||||
|
+cell
|
||||||
|
| A Python extension type. Differs from a normal Python class
|
||||||
|
| in that its attributes can be defined on the underlying
|
||||||
|
| struct. Can have C-level objects as attributes (notably
|
||||||
|
| structs and pointers), and can have methods which have
|
||||||
|
| C-level objects as arguments or return types.
|
||||||
|
+cell #[+api("cython-classes#lexeme") #[code Lexeme]]
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code cdef struct]
|
||||||
|
+cell
|
||||||
|
| A struct is just a collection of variables, sort of like a
|
||||||
|
| named tuple, except the memory is contiguous. Structs can't
|
||||||
|
| have methods, only attributes.
|
||||||
|
+cell #[+api("cython-structs#lexemec") #[code LexemeC]]
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code cdef cppclass]
|
||||||
|
+cell
|
||||||
|
| A C++ class. Like a struct, this can be allocated on the
|
||||||
|
| stack, but can have methods, a constructor and a destructor.
|
||||||
|
| Differs from `cdef class` in that it can be created and
|
||||||
|
| destroyed without acquiring the Python global interpreter
|
||||||
|
| lock. This style is the most obscure.
|
||||||
|
+cell #[+src(gh("spacy", "spacy/syntax/_state.pxd")) #[code StateC]]
|
||||||
|
|
||||||
|
p
|
||||||
|
| The most important classes in spaCy are defined as #[code cdef class]
|
||||||
|
| objects. The underlying data for these objects is usually gathered
|
||||||
|
| into a struct, which is usually named #[code c]. For instance, the
|
||||||
|
| #[+api("cython-classses#lexeme") #[code Lexeme]] class holds a
|
||||||
|
| #[+api("cython-structs#lexemec") #[code LexemeC]] struct, at
|
||||||
|
| #[code Lexeme.c]. This lets you shed the Python container, and pass
|
||||||
|
| a pointer to the underlying data into C-level functions.
|
||||||
|
|
||||||
|
+section("conventions")
|
||||||
|
+h(2, "conventions") Conventions
|
||||||
|
|
||||||
|
p
|
||||||
|
| spaCy's core data structures are implemented as
|
||||||
|
| #[+a("http://cython.org/") Cython] #[code cdef] classes. Memory is
|
||||||
|
| managed through the #[+a(gh("cymem")) #[code cymem]]
|
||||||
|
| #[code cymem.Pool] class, which allows you
|
||||||
|
| to allocate memory which will be freed when the #[code Pool] object
|
||||||
|
| is garbage collected. This means you usually don't have to worry
|
||||||
|
| about freeing memory. You just have to decide which Python object
|
||||||
|
| owns the memory, and make it own the #[code Pool]. When that object
|
||||||
|
| goes out of scope, the memory will be freed. You do have to take
|
||||||
|
| care that no pointers outlive the object that owns them — but this
|
||||||
|
| is generally quite easy.
|
||||||
|
|
||||||
|
p
|
||||||
|
| All Cython modules should have the #[code # cython: infer_types=True]
|
||||||
|
| compiler directive at the top of the file. This makes the code much
|
||||||
|
| cleaner, as it avoids the need for many type declarations. If
|
||||||
|
| possible, you should prefer to declare your functions #[code nogil],
|
||||||
|
| even if you don't especially care about multi-threading. The reason
|
||||||
|
| is that #[code nogil] functions help the Cython compiler reason about
|
||||||
|
| your code quite a lot — you're telling the compiler that no Python
|
||||||
|
| dynamics are possible. This lets many errors be raised, and ensures
|
||||||
|
| your function will run at C speed.
|
||||||
|
|
||||||
|
|
||||||
|
p
|
||||||
|
| Cython gives you many choices of sequences: you could have a Python
|
||||||
|
| list, a numpy array, a memory view, a C++ vector, or a pointer.
|
||||||
|
| Pointers are preferred, because they are fastest, have the most
|
||||||
|
| explicit semantics, and let the compiler check your code more
|
||||||
|
| strictly. C++ vectors are also great — but you should only use them
|
||||||
|
| internally in functions. It's less friendly to accept a vector as an
|
||||||
|
| argument, because that asks the user to do much more work. Here's
|
||||||
|
| how to get a pointer from a numpy array, memory view or vector:
|
||||||
|
|
||||||
|
+code.
|
||||||
|
cdef void get_pointers(np.ndarray[int, mode='c'] numpy_array, vector[int] cpp_vector, int[::1] memory_view) nogil:
|
||||||
|
pointer1 = <int*>numpy_array.data
|
||||||
|
pointer2 = cpp_vector.data()
|
||||||
|
pointer3 = &memory_view[0]
|
||||||
|
|
||||||
|
p
|
||||||
|
| Both C arrays and C++ vectors reassure the compiler that no Python
|
||||||
|
| operations are possible on your variable. This is a big advantage:
|
||||||
|
| it lets the Cython compiler raise many more errors for you.
|
||||||
|
|
||||||
|
p
|
||||||
|
| When getting a pointer from a numpy array or memoryview, take care
|
||||||
|
| that the data is actually stored in C-contiguous order — otherwise
|
||||||
|
| you'll get a pointer to nonsense. The type-declarations in the code
|
||||||
|
| above should generate runtime errors if buffers with incorrect
|
||||||
|
| memory layouts are passed in. To iterate over the array, the
|
||||||
|
| following style is preferred:
|
||||||
|
|
||||||
|
+code.
|
||||||
|
cdef int c_total(const int* int_array, int length) nogil:
|
||||||
|
total = 0
|
||||||
|
for item in int_array[:length]:
|
||||||
|
total += item
|
||||||
|
return total
|
||||||
|
|
||||||
|
p
|
||||||
|
| If this is confusing, consider that the compiler couldn't deal with
|
||||||
|
| #[code for item in int_array:] — there's no length attached to a raw
|
||||||
|
| pointer, so how could we figure out where to stop? The length is
|
||||||
|
| provided in the slice notation as a solution to this. Note that we
|
||||||
|
| don't have to declare the type of #[code item] in the code above —
|
||||||
|
| the compiler can easily infer it. This gives us tidy code that looks
|
||||||
|
| quite like Python, but is exactly as fast as C — because we've made
|
||||||
|
| sure the compilation to C is trivial.
|
||||||
|
|
||||||
|
p
|
||||||
|
| Your functions cannot be declared #[code nogil] if they need to
|
||||||
|
| create Python objects or call Python functions. This is perfectly
|
||||||
|
| okay — you shouldn't torture your code just to get #[code nogil]
|
||||||
|
| functions. However, if your function isn't #[code nogil], you should
|
||||||
|
| compile your module with #[code cython -a --cplus my_module.pyx] and
|
||||||
|
| open the resulting #[code my_module.html] file in a browser. This
|
||||||
|
| will let you see how Cython is compiling your code. Calls into the
|
||||||
|
| Python run-time will be in bright yellow. This lets you easily see
|
||||||
|
| whether Cython is able to correctly type your code, or whether there
|
||||||
|
| are unexpected problems.
|
||||||
|
|
||||||
|
p
|
||||||
|
| Working in Cython is very rewarding once you're over the initial
|
||||||
|
| learning curve. As with C and C++, the first way you write something
|
||||||
|
| in Cython will often be the performance-optimal approach. In
|
||||||
|
| contrast, Python optimisation generally requires a lot of
|
||||||
|
| experimentation. Is it faster to have an #[code if item in my_dict]
|
||||||
|
| check, or to use #[code .get()]? What about
|
||||||
|
| #[code try]/#[code except]? Does this numpy operation create a copy?
|
||||||
|
| There's no way to guess the answers to these questions, and you'll
|
||||||
|
| usually be dissatisfied with your results — so there's no way to
|
||||||
|
| know when to stop this process. In the worst case, you'll make a
|
||||||
|
| mess that invites the next reader to try their luck too. This is
|
||||||
|
| like one of those
|
||||||
|
| #[+a("http://www.wemjournal.org/article/S1080-6032%2809%2970088-2/abstract") volcanic gas-traps],
|
||||||
|
| where the rescuers keep passing out from low oxygen, causing
|
||||||
|
| another rescuer to follow — only to succumb themselves. In short,
|
||||||
|
| just say no to optimizing your Python. If it's not fast enough the
|
||||||
|
| first time, just switch to Cython.
|
||||||
|
|
||||||
|
+infobox("Resources")
|
||||||
|
+list.o-no-block
|
||||||
|
+item #[+a("http://docs.cython.org/en/latest/") Official Cython documentation] (cython.org)
|
||||||
|
+item #[+a("https://explosion.ai/blog/writing-c-in-cython", true) Writing C in Cython] (explosion.ai)
|
||||||
|
+item #[+a("https://explosion.ai/blog/multithreading-with-cython") Multi-threading spaCy’s parser and named entity recogniser] (explosion.ai)
|
|
@ -7,8 +7,151 @@ include ../_includes/_mixins
|
||||||
|
|
||||||
+section("nn-model")
|
+section("nn-model")
|
||||||
+h(2, "nn-model") Neural network model architecture
|
+h(2, "nn-model") Neural network model architecture
|
||||||
include _architecture/_nn-model
|
|
||||||
|
|
||||||
+section("cython")
|
p
|
||||||
+h(2, "cython") Cython conventions
|
| spaCy's statistical models have been custom-designed to give a
|
||||||
include _architecture/_cython
|
| high-performance mix of speed and accuracy. The current architecture
|
||||||
|
| hasn't been published yet, but in the meantime we prepared a video that
|
||||||
|
| explains how the models work, with particular focus on NER.
|
||||||
|
|
||||||
|
+youtube("sqDHBH9IjRU")
|
||||||
|
|
||||||
|
p
|
||||||
|
| The parsing model is a blend of recent results. The two recent
|
||||||
|
| inspirations have been the work of Eli Klipperwasser and Yoav Goldberg at
|
||||||
|
| Bar Ilan#[+fn(1)], and the SyntaxNet team from Google. The foundation of
|
||||||
|
| the parser is still based on the work of Joakim Nivre#[+fn(2)], who
|
||||||
|
| introduced the transition-based framework#[+fn(3)], the arc-eager
|
||||||
|
| transition system, and the imitation learning objective. The model is
|
||||||
|
| implemented using #[+a(gh("thinc")) Thinc], spaCy's machine learning
|
||||||
|
| library. We first predict context-sensitive vectors for each word in the
|
||||||
|
| input:
|
||||||
|
|
||||||
|
+code.
|
||||||
|
(embed_lower | embed_prefix | embed_suffix | embed_shape)
|
||||||
|
>> Maxout(token_width)
|
||||||
|
>> convolution ** 4
|
||||||
|
|
||||||
|
p
|
||||||
|
| This convolutional layer is shared between the tagger, parser and NER,
|
||||||
|
| and will also be shared by the future neural lemmatizer. Because the
|
||||||
|
| parser shares these layers with the tagger, the parser does not require
|
||||||
|
| tag features. I got this trick from David Weiss's "Stack Combination"
|
||||||
|
| paper#[+fn(4)].
|
||||||
|
|
||||||
|
p
|
||||||
|
| To boost the representation, the tagger actually predicts a "super tag"
|
||||||
|
| with POS, morphology and dependency label#[+fn(5)]. The tagger predicts
|
||||||
|
| these supertags by adding a softmax layer onto the convolutional layer –
|
||||||
|
| so, we're teaching the convolutional layer to give us a representation
|
||||||
|
| that's one affine transform from this informative lexical information.
|
||||||
|
| This is obviously good for the parser (which backprops to the
|
||||||
|
| convolutions too). The parser model makes a state vector by concatenating
|
||||||
|
| the vector representations for its context tokens. The current context
|
||||||
|
| tokens:
|
||||||
|
|
||||||
|
+table
|
||||||
|
+row
|
||||||
|
+cell #[code S0], #[code S1], #[code S2]
|
||||||
|
+cell Top three words on the stack.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code B0], #[code B1]
|
||||||
|
+cell First two words of the buffer.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell
|
||||||
|
| #[code S0L1], #[code S1L1], #[code S2L1], #[code B0L1],
|
||||||
|
| #[code B1L1]#[br]
|
||||||
|
| #[code S0L2], #[code S1L2], #[code S2L2], #[code B0L2],
|
||||||
|
| #[code B1L2]
|
||||||
|
+cell
|
||||||
|
| Leftmost and second leftmost children of #[code S0], #[code S1],
|
||||||
|
| #[code S2], #[code B0] and #[code B1].
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell
|
||||||
|
| #[code S0R1], #[code S1R1], #[code S2R1], #[code B0R1],
|
||||||
|
| #[code B1R1]#[br]
|
||||||
|
| #[code S0R2], #[code S1R2], #[code S2R2], #[code B0R2],
|
||||||
|
| #[code B1R2]
|
||||||
|
+cell
|
||||||
|
| Rightmost and second rightmost children of #[code S0], #[code S1],
|
||||||
|
| #[code S2], #[code B0] and #[code B1].
|
||||||
|
|
||||||
|
p
|
||||||
|
| This makes the state vector quite long: #[code 13*T], where #[code T] is
|
||||||
|
| the token vector width (128 is working well). Fortunately, there's a way
|
||||||
|
| to structure the computation to save some expense (and make it more
|
||||||
|
| GPU-friendly).
|
||||||
|
|
||||||
|
p
|
||||||
|
| The parser typically visits #[code 2*N] states for a sentence of length
|
||||||
|
| #[code N] (although it may visit more, if it back-tracks with a
|
||||||
|
| non-monotonic transition#[+fn(4)]). A naive implementation would require
|
||||||
|
| #[code 2*N (B, 13*T) @ (13*T, H)] matrix multiplications for a batch of
|
||||||
|
| size #[code B]. We can instead perform one #[code (B*N, T) @ (T, 13*H)]
|
||||||
|
| multiplication, to pre-compute the hidden weights for each positional
|
||||||
|
| feature with respect to the words in the batch. (Note that our token
|
||||||
|
| vectors come from the CNN — so we can't play this trick over the
|
||||||
|
| vocabulary. That's how Stanford's NN parser#[+fn(3)] works — and why its
|
||||||
|
| model is so big.)
|
||||||
|
|
||||||
|
p
|
||||||
|
| This pre-computation strategy allows a nice compromise between
|
||||||
|
| GPU-friendliness and implementation simplicity. The CNN and the wide
|
||||||
|
| lower layer are computed on the GPU, and then the precomputed hidden
|
||||||
|
| weights are moved to the CPU, before we start the transition-based
|
||||||
|
| parsing process. This makes a lot of things much easier. We don't have to
|
||||||
|
| worry about variable-length batch sizes, and we don't have to implement
|
||||||
|
| the dynamic oracle in CUDA to train.
|
||||||
|
|
||||||
|
p
|
||||||
|
| Currently the parser's loss function is multilabel log loss#[+fn(6)], as
|
||||||
|
| the dynamic oracle allows multiple states to be 0 cost. This is defined
|
||||||
|
| as follows, where #[code gZ] is the sum of the scores assigned to gold
|
||||||
|
| classes:
|
||||||
|
|
||||||
|
+code.
|
||||||
|
(exp(score) / Z) - (exp(score) / gZ)
|
||||||
|
|
||||||
|
+bibliography
|
||||||
|
+item
|
||||||
|
| #[+a("https://www.semanticscholar.org/paper/Simple-and-Accurate-Dependency-Parsing-Using-Bidir-Kiperwasser-Goldberg/3cf31ecb2724b5088783d7c96a5fc0d5604cbf41") Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations]
|
||||||
|
br
|
||||||
|
| Eliyahu Kiperwasser, Yoav Goldberg. (2016)
|
||||||
|
|
||||||
|
+item
|
||||||
|
| #[+a("https://www.semanticscholar.org/paper/A-Dynamic-Oracle-for-Arc-Eager-Dependency-Parsing-Goldberg-Nivre/22697256ec19ecc3e14fcfc63624a44cf9c22df4") A Dynamic Oracle for Arc-Eager Dependency Parsing]
|
||||||
|
br
|
||||||
|
| Yoav Goldberg, Joakim Nivre (2012)
|
||||||
|
|
||||||
|
+item
|
||||||
|
| #[+a("https://explosion.ai/blog/parsing-english-in-python") Parsing English in 500 Lines of Python]
|
||||||
|
br
|
||||||
|
| Matthew Honnibal (2013)
|
||||||
|
|
||||||
|
+item
|
||||||
|
| #[+a("https://www.semanticscholar.org/paper/Stack-propagation-Improved-Representation-Learning-Zhang-Weiss/0c133f79b23e8c680891d2e49a66f0e3d37f1466") Stack-propagation: Improved Representation Learning for Syntax]
|
||||||
|
br
|
||||||
|
| Yuan Zhang, David Weiss (2016)
|
||||||
|
|
||||||
|
+item
|
||||||
|
| #[+a("https://www.semanticscholar.org/paper/Deep-multi-task-learning-with-low-level-tasks-supe-S%C3%B8gaard-Goldberg/03ad06583c9721855ccd82c3d969a01360218d86") Deep multi-task learning with low level tasks supervised at lower layers]
|
||||||
|
br
|
||||||
|
| Anders Søgaard, Yoav Goldberg (2016)
|
||||||
|
|
||||||
|
+item
|
||||||
|
| #[+a("https://www.semanticscholar.org/paper/An-Improved-Non-monotonic-Transition-System-for-De-Honnibal-Johnson/4094cee47ade13b77b5ab4d2e6cb9dd2b8a2917c") An Improved Non-monotonic Transition System for Dependency Parsing]
|
||||||
|
br
|
||||||
|
| Matthew Honnibal, Mark Johnson (2015)
|
||||||
|
|
||||||
|
+item
|
||||||
|
| #[+a("http://cs.stanford.edu/people/danqi/papers/emnlp2014.pdf") A Fast and Accurate Dependency Parser using Neural Networks]
|
||||||
|
br
|
||||||
|
| Danqi Cheng, Christopher D. Manning (2014)
|
||||||
|
|
||||||
|
+item
|
||||||
|
| #[+a("https://www.semanticscholar.org/paper/Parsing-the-Wall-Street-Journal-using-a-Lexical-Fu-Riezler-King/0ad07862a91cd59b7eb5de38267e47725a62b8b2") Parsing the Wall Street Journal using a Lexical-Functional Grammar and Discriminative Estimation Techniques]
|
||||||
|
br
|
||||||
|
| Stefan Riezler et al. (2002)
|
||||||
|
|
|
@ -573,15 +573,15 @@ p The L2 norm of the token's vector representation.
|
||||||
+cell #[code ent_id]
|
+cell #[code ent_id]
|
||||||
+cell int
|
+cell int
|
||||||
+cell
|
+cell
|
||||||
| ID of the entity the token is an instance of, if any. Usually
|
| ID of the entity the token is an instance of, if any. Currently
|
||||||
| assigned by patterns in the Matcher.
|
| not used, but potentially for coreference resolution.
|
||||||
|
|
||||||
+row
|
+row
|
||||||
+cell #[code ent_id_]
|
+cell #[code ent_id_]
|
||||||
+cell unicode
|
+cell unicode
|
||||||
+cell
|
+cell
|
||||||
| ID of the entity the token is an instance of, if any. Usually
|
| ID of the entity the token is an instance of, if any. Currently
|
||||||
| assigned by patterns in the Matcher.
|
| not used, but potentially for coreference resolution.
|
||||||
|
|
||||||
+row
|
+row
|
||||||
+cell #[code lemma]
|
+cell #[code lemma]
|
||||||
|
|
|
@ -231,3 +231,19 @@
|
||||||
border: none
|
border: none
|
||||||
text-align-last: center
|
text-align-last: center
|
||||||
width: 100%
|
width: 100%
|
||||||
|
|
||||||
|
//- Abbreviations
|
||||||
|
|
||||||
|
.o-abbr
|
||||||
|
+breakpoint(min, md)
|
||||||
|
cursor: help
|
||||||
|
border-bottom: 2px dotted $color-theme
|
||||||
|
padding-bottom: 3px
|
||||||
|
|
||||||
|
+breakpoint(max, sm)
|
||||||
|
&[data-tooltip]:before
|
||||||
|
content: none
|
||||||
|
|
||||||
|
&:after
|
||||||
|
content: " (" attr(aria-label) ")"
|
||||||
|
color: $color-subtle-dark
|
||||||
|
|
|
@ -47,7 +47,10 @@ import initUniverse from './universe.vue.js';
|
||||||
*/
|
*/
|
||||||
{
|
{
|
||||||
if (window.Juniper) {
|
if (window.Juniper) {
|
||||||
new Juniper({ repo: 'ines/spacy-io-binder' });
|
new Juniper({
|
||||||
|
repo: 'ines/spacy-io-binder',
|
||||||
|
storageExpire: 60
|
||||||
|
});
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@ -58,8 +61,13 @@ import initUniverse from './universe.vue.js';
|
||||||
const sectionAttr = 'data-section';
|
const sectionAttr = 'data-section';
|
||||||
const navAttr = 'data-nav';
|
const navAttr = 'data-nav';
|
||||||
const activeClass = 'is-active';
|
const activeClass = 'is-active';
|
||||||
|
const sidebarAttr = 'data-sidebar-active';
|
||||||
const sections = [...document.querySelectorAll(`[${navAttr}]`)];
|
const sections = [...document.querySelectorAll(`[${navAttr}]`)];
|
||||||
|
const currentItem = document.querySelector(`[${sidebarAttr}]`);
|
||||||
if (window.inView) {
|
if (window.inView) {
|
||||||
|
if (currentItem && Element.prototype.scrollIntoView && !inView.is(currentItem)) {
|
||||||
|
currentItem.scrollIntoView();
|
||||||
|
}
|
||||||
if (sections.length) { // highlight first item regardless
|
if (sections.length) { // highlight first item regardless
|
||||||
sections[0].classList.add(activeClass);
|
sections[0].classList.add(activeClass);
|
||||||
}
|
}
|
||||||
|
@ -69,6 +77,9 @@ import initUniverse from './universe.vue.js';
|
||||||
if (el) {
|
if (el) {
|
||||||
sections.forEach(el => el.classList.remove(activeClass));
|
sections.forEach(el => el.classList.remove(activeClass));
|
||||||
el.classList.add(activeClass);
|
el.classList.add(activeClass);
|
||||||
|
if (Element.prototype.scrollIntoView && !inView.is(el)) {
|
||||||
|
el.scrollIntoView();
|
||||||
|
}
|
||||||
}
|
}
|
||||||
});
|
});
|
||||||
}
|
}
|
||||||
|
|
4
website/assets/js/vendor/juniper.min.js
vendored
4
website/assets/js/vendor/juniper.min.js
vendored
File diff suppressed because one or more lines are too long
2
website/assets/js/vendor/prism.min.js
vendored
2
website/assets/js/vendor/prism.min.js
vendored
|
@ -16,7 +16,7 @@ Prism.languages.json={property:/".*?"(?=\s*:)/gi,string:/"(?!:)(\\?[^"])*?"(?!:)
|
||||||
!function(a){var e=/\\([^a-z()[\]]|[a-z\*]+)/i,n={"equation-command":{pattern:e,alias:"regex"}};a.languages.latex={comment:/%.*/m,cdata:{pattern:/(\\begin\{((?:verbatim|lstlisting)\*?)\})([\w\W]*?)(?=\\end\{\2\})/,lookbehind:!0},equation:[{pattern:/\$(?:\\?[\w\W])*?\$|\\\((?:\\?[\w\W])*?\\\)|\\\[(?:\\?[\w\W])*?\\\]/,inside:n,alias:"string"},{pattern:/(\\begin\{((?:equation|math|eqnarray|align|multline|gather)\*?)\})([\w\W]*?)(?=\\end\{\2\})/,lookbehind:!0,inside:n,alias:"string"}],keyword:{pattern:/(\\(?:begin|end|ref|cite|label|usepackage|documentclass)(?:\[[^\]]+\])?\{)[^}]+(?=\})/,lookbehind:!0},url:{pattern:/(\\url\{)[^}]+(?=\})/,lookbehind:!0},headline:{pattern:/(\\(?:part|chapter|section|subsection|frametitle|subsubsection|paragraph|subparagraph|subsubparagraph|subsubsubparagraph)\*?(?:\[[^\]]+\])?\{)[^}]+(?=\}(?:\[[^\]]+\])?)/,lookbehind:!0,alias:"class-name"},"function":{pattern:e,alias:"selector"},punctuation:/[[\]{}&]/}}(Prism);
|
!function(a){var e=/\\([^a-z()[\]]|[a-z\*]+)/i,n={"equation-command":{pattern:e,alias:"regex"}};a.languages.latex={comment:/%.*/m,cdata:{pattern:/(\\begin\{((?:verbatim|lstlisting)\*?)\})([\w\W]*?)(?=\\end\{\2\})/,lookbehind:!0},equation:[{pattern:/\$(?:\\?[\w\W])*?\$|\\\((?:\\?[\w\W])*?\\\)|\\\[(?:\\?[\w\W])*?\\\]/,inside:n,alias:"string"},{pattern:/(\\begin\{((?:equation|math|eqnarray|align|multline|gather)\*?)\})([\w\W]*?)(?=\\end\{\2\})/,lookbehind:!0,inside:n,alias:"string"}],keyword:{pattern:/(\\(?:begin|end|ref|cite|label|usepackage|documentclass)(?:\[[^\]]+\])?\{)[^}]+(?=\})/,lookbehind:!0},url:{pattern:/(\\url\{)[^}]+(?=\})/,lookbehind:!0},headline:{pattern:/(\\(?:part|chapter|section|subsection|frametitle|subsubsection|paragraph|subparagraph|subsubparagraph|subsubsubparagraph)\*?(?:\[[^\]]+\])?\{)[^}]+(?=\}(?:\[[^\]]+\])?)/,lookbehind:!0,alias:"class-name"},"function":{pattern:e,alias:"selector"},punctuation:/[[\]{}&]/}}(Prism);
|
||||||
Prism.languages.makefile={comment:{pattern:/(^|[^\\])#(?:\\(?:\r\n|[\s\S])|.)*/,lookbehind:!0},string:/(["'])(?:\\(?:\r\n|[\s\S])|(?!\1)[^\\\r\n])*\1/,builtin:/\.[A-Z][^:#=\s]+(?=\s*:(?!=))/,symbol:{pattern:/^[^:=\r\n]+(?=\s*:(?!=))/m,inside:{variable:/\$+(?:[^(){}:#=\s]+|(?=[({]))/}},variable:/\$+(?:[^(){}:#=\s]+|\([@*%<^+?][DF]\)|(?=[({]))/,keyword:[/-include\b|\b(?:define|else|endef|endif|export|ifn?def|ifn?eq|include|override|private|sinclude|undefine|unexport|vpath)\b/,{pattern:/(\()(?:addsuffix|abspath|and|basename|call|dir|error|eval|file|filter(?:-out)?|findstring|firstword|flavor|foreach|guile|if|info|join|lastword|load|notdir|or|origin|patsubst|realpath|shell|sort|strip|subst|suffix|value|warning|wildcard|word(?:s|list)?)(?=[ \t])/,lookbehind:!0}],operator:/(?:::|[?:+!])?=|[|@]/,punctuation:/[:;(){}]/};
|
Prism.languages.makefile={comment:{pattern:/(^|[^\\])#(?:\\(?:\r\n|[\s\S])|.)*/,lookbehind:!0},string:/(["'])(?:\\(?:\r\n|[\s\S])|(?!\1)[^\\\r\n])*\1/,builtin:/\.[A-Z][^:#=\s]+(?=\s*:(?!=))/,symbol:{pattern:/^[^:=\r\n]+(?=\s*:(?!=))/m,inside:{variable:/\$+(?:[^(){}:#=\s]+|(?=[({]))/}},variable:/\$+(?:[^(){}:#=\s]+|\([@*%<^+?][DF]\)|(?=[({]))/,keyword:[/-include\b|\b(?:define|else|endef|endif|export|ifn?def|ifn?eq|include|override|private|sinclude|undefine|unexport|vpath)\b/,{pattern:/(\()(?:addsuffix|abspath|and|basename|call|dir|error|eval|file|filter(?:-out)?|findstring|firstword|flavor|foreach|guile|if|info|join|lastword|load|notdir|or|origin|patsubst|realpath|shell|sort|strip|subst|suffix|value|warning|wildcard|word(?:s|list)?)(?=[ \t])/,lookbehind:!0}],operator:/(?:::|[?:+!])?=|[|@]/,punctuation:/[:;(){}]/};
|
||||||
Prism.languages.markdown=Prism.languages.extend("markup",{}),Prism.languages.insertBefore("markdown","prolog",{blockquote:{pattern:/^>(?:[\t ]*>)*/m,alias:"punctuation"},code:[{pattern:/^(?: {4}|\t).+/m,alias:"keyword"},{pattern:/``.+?``|`[^`\n]+`/,alias:"keyword"}],title:[{pattern:/\w+.*(?:\r?\n|\r)(?:==+|--+)/,alias:"important",inside:{punctuation:/==+$|--+$/}},{pattern:/(^\s*)#+.+/m,lookbehind:!0,alias:"important",inside:{punctuation:/^#+|#+$/}}],hr:{pattern:/(^\s*)([*-])([\t ]*\2){2,}(?=\s*$)/m,lookbehind:!0,alias:"punctuation"},list:{pattern:/(^\s*)(?:[*+-]|\d+\.)(?=[\t ].)/m,lookbehind:!0,alias:"punctuation"},"url-reference":{pattern:/!?\[[^\]]+\]:[\t ]+(?:\S+|<(?:\\.|[^>\\])+>)(?:[\t ]+(?:"(?:\\.|[^"\\])*"|'(?:\\.|[^'\\])*'|\((?:\\.|[^)\\])*\)))?/,inside:{variable:{pattern:/^(!?\[)[^\]]+/,lookbehind:!0},string:/(?:"(?:\\.|[^"\\])*"|'(?:\\.|[^'\\])*'|\((?:\\.|[^)\\])*\))$/,punctuation:/^[\[\]!:]|[<>]/},alias:"url"},bold:{pattern:/(^|[^\\])(\*\*|__)(?:(?:\r?\n|\r)(?!\r?\n|\r)|.)+?\2/,lookbehind:!0,inside:{punctuation:/^\*\*|^__|\*\*$|__$/}},italic:{pattern:/(^|[^\\])([*_])(?:(?:\r?\n|\r)(?!\r?\n|\r)|.)+?\2/,lookbehind:!0,inside:{punctuation:/^[*_]|[*_]$/}},url:{pattern:/!?\[[^\]]+\](?:\([^\s)]+(?:[\t ]+"(?:\\.|[^"\\])*")?\)| ?\[[^\]\n]*\])/,inside:{variable:{pattern:/(!?\[)[^\]]+(?=\]$)/,lookbehind:!0},string:{pattern:/"(?:\\.|[^"\\])*"(?=\)$)/}}}}),Prism.languages.markdown.bold.inside.url=Prism.util.clone(Prism.languages.markdown.url),Prism.languages.markdown.italic.inside.url=Prism.util.clone(Prism.languages.markdown.url),Prism.languages.markdown.bold.inside.italic=Prism.util.clone(Prism.languages.markdown.italic),Prism.languages.markdown.italic.inside.bold=Prism.util.clone(Prism.languages.markdown.bold);
|
Prism.languages.markdown=Prism.languages.extend("markup",{}),Prism.languages.insertBefore("markdown","prolog",{blockquote:{pattern:/^>(?:[\t ]*>)*/m,alias:"punctuation"},code:[{pattern:/^(?: {4}|\t).+/m,alias:"keyword"},{pattern:/``.+?``|`[^`\n]+`/,alias:"keyword"}],title:[{pattern:/\w+.*(?:\r?\n|\r)(?:==+|--+)/,alias:"important",inside:{punctuation:/==+$|--+$/}},{pattern:/(^\s*)#+.+/m,lookbehind:!0,alias:"important",inside:{punctuation:/^#+|#+$/}}],hr:{pattern:/(^\s*)([*-])([\t ]*\2){2,}(?=\s*$)/m,lookbehind:!0,alias:"punctuation"},list:{pattern:/(^\s*)(?:[*+-]|\d+\.)(?=[\t ].)/m,lookbehind:!0,alias:"punctuation"},"url-reference":{pattern:/!?\[[^\]]+\]:[\t ]+(?:\S+|<(?:\\.|[^>\\])+>)(?:[\t ]+(?:"(?:\\.|[^"\\])*"|'(?:\\.|[^'\\])*'|\((?:\\.|[^)\\])*\)))?/,inside:{variable:{pattern:/^(!?\[)[^\]]+/,lookbehind:!0},string:/(?:"(?:\\.|[^"\\])*"|'(?:\\.|[^'\\])*'|\((?:\\.|[^)\\])*\))$/,punctuation:/^[\[\]!:]|[<>]/},alias:"url"},bold:{pattern:/(^|[^\\])(\*\*|__)(?:(?:\r?\n|\r)(?!\r?\n|\r)|.)+?\2/,lookbehind:!0,inside:{punctuation:/^\*\*|^__|\*\*$|__$/}},italic:{pattern:/(^|[^\\])([*_])(?:(?:\r?\n|\r)(?!\r?\n|\r)|.)+?\2/,lookbehind:!0,inside:{punctuation:/^[*_]|[*_]$/}},url:{pattern:/!?\[[^\]]+\](?:\([^\s)]+(?:[\t ]+"(?:\\.|[^"\\])*")?\)| ?\[[^\]\n]*\])/,inside:{variable:{pattern:/(!?\[)[^\]]+(?=\]$)/,lookbehind:!0},string:{pattern:/"(?:\\.|[^"\\])*"(?=\)$)/}}}}),Prism.languages.markdown.bold.inside.url=Prism.util.clone(Prism.languages.markdown.url),Prism.languages.markdown.italic.inside.url=Prism.util.clone(Prism.languages.markdown.url),Prism.languages.markdown.bold.inside.italic=Prism.util.clone(Prism.languages.markdown.italic),Prism.languages.markdown.italic.inside.bold=Prism.util.clone(Prism.languages.markdown.bold);
|
||||||
Prism.languages.python={"triple-quoted-string":{pattern:/"""[\s\S]+?"""|'''[\s\S]+?'''/,alias:"string"},comment:{pattern:/(^|[^\\])#.*/,lookbehind:!0},string:/("|')(?:\\?.)*?\1/,"function":{pattern:/((?:^|\s)def[ \t]+)[a-zA-Z_][a-zA-Z0-9_]*(?=\()/g,lookbehind:!0},"class-name":{pattern:/(\bclass\s+)[a-z0-9_]+/i,lookbehind:!0},keyword:/\b(?:as|assert|async|await|break|class|continue|def|del|elif|else|except|exec|finally|for|from|global|if|import|in|is|lambda|pass|print|raise|return|try|while|with|yield)\b/,"boolean":/\b(?:True|False|None)\b/,number:/\b-?(?:0[bo])?(?:(?:\d|0x[\da-f])[\da-f]*\.?\d*|\.\d+)(?:e[+-]?\d+)?j?\b/i,operator:/[-+%=]=?|!=|\*\*?=?|\/\/?=?|<[<=>]?|>[=>]?|[&|^~]|\b(?:or|and|not)\b/,punctuation:/[{}[\];(),.:]/,"constant":/\b[A-Z_]{2,}\b/};
|
Prism.languages.python={"triple-quoted-string":{pattern:/"""[\s\S]+?"""|'''[\s\S]+?'''/,alias:"string"},comment:{pattern:/(^|[^\\])#.*/,lookbehind:!0},string:/("|')(?:\\?.)*?\1/,"function":{pattern:/((?:^|\s)def[ \t]+)[a-zA-Z_][a-zA-Z0-9_]*(?=\()/g,lookbehind:!0},"class-name":{pattern:/(\bclass\s+)[a-z0-9_]+/i,lookbehind:!0},keyword:/\b(?:as|assert|async|await|break|class|continue|def|del|elif|else|except|exec|finally|for|from|global|if|import|in|is|lambda|pass|print|raise|return|try|while|with|yield|cimport)\b/,"boolean":/\b(?:True|False|None)\b/,number:/\b-?(?:0[bo])?(?:(?:\d|0x[\da-f])[\da-f]*\.?\d*|\.\d+)(?:e[+-]?\d+)?j?\b/i,operator:/[-+%=]=?|!=|\*\*?=?|\/\/?=?|<[<=>]?|>[=>]?|[&|^~]|\b(?:or|and|not)\b/,punctuation:/[{}[\];(),.:]/,"constant":/\b[A-Z_]{2,}\b/};
|
||||||
Prism.languages.rest={table:[{pattern:/(\s*)(?:\+[=-]+)+\+(?:\r?\n|\r)(?:\1(?:[+|].+)+[+|](?:\r?\n|\r))+\1(?:\+[=-]+)+\+/,lookbehind:!0,inside:{punctuation:/\||(?:\+[=-]+)+\+/}},{pattern:/(\s*)(?:=+ +)+=+((?:\r?\n|\r)\1.+)+(?:\r?\n|\r)\1(?:=+ +)+=+(?=(?:\r?\n|\r){2}|\s*$)/,lookbehind:!0,inside:{punctuation:/[=-]+/}}],"substitution-def":{pattern:/(^\s*\.\. )\|(?:[^|\s](?:[^|]*[^|\s])?)\| [^:]+::/m,lookbehind:!0,inside:{substitution:{pattern:/^\|(?:[^|\s]|[^|\s][^|]*[^|\s])\|/,alias:"attr-value",inside:{punctuation:/^\||\|$/}},directive:{pattern:/( +)[^:]+::/,lookbehind:!0,alias:"function",inside:{punctuation:/::$/}}}},"link-target":[{pattern:/(^\s*\.\. )\[[^\]]+\]/m,lookbehind:!0,alias:"string",inside:{punctuation:/^\[|\]$/}},{pattern:/(^\s*\.\. )_(?:`[^`]+`|(?:[^:\\]|\\.)+):/m,lookbehind:!0,alias:"string",inside:{punctuation:/^_|:$/}}],directive:{pattern:/(^\s*\.\. )[^:]+::/m,lookbehind:!0,alias:"function",inside:{punctuation:/::$/}},comment:{pattern:/(^\s*\.\.)(?:(?: .+)?(?:(?:\r?\n|\r).+)+| .+)(?=(?:\r?\n|\r){2}|$)/m,lookbehind:!0},title:[{pattern:/^(([!"#$%&'()*+,\-.\/:;<=>?@\[\\\]^_`{|}~])\2+)(?:\r?\n|\r).+(?:\r?\n|\r)\1$/m,inside:{punctuation:/^[!"#$%&'()*+,\-.\/:;<=>?@\[\\\]^_`{|}~]+|[!"#$%&'()*+,\-.\/:;<=>?@\[\\\]^_`{|}~]+$/,important:/.+/}},{pattern:/(^|(?:\r?\n|\r){2}).+(?:\r?\n|\r)([!"#$%&'()*+,\-.\/:;<=>?@\[\\\]^_`{|}~])\2+(?=\r?\n|\r|$)/,lookbehind:!0,inside:{punctuation:/[!"#$%&'()*+,\-.\/:;<=>?@\[\\\]^_`{|}~]+$/,important:/.+/}}],hr:{pattern:/((?:\r?\n|\r){2})([!"#$%&'()*+,\-.\/:;<=>?@\[\\\]^_`{|}~])\2{3,}(?=(?:\r?\n|\r){2})/,lookbehind:!0,alias:"punctuation"},field:{pattern:/(^\s*):[^:\r\n]+:(?= )/m,lookbehind:!0,alias:"attr-name"},"command-line-option":{pattern:/(^\s*)(?:[+-][a-z\d]|(?:\-\-|\/)[a-z\d-]+)(?:[ =](?:[a-z][a-z\d_-]*|<[^<>]+>))?(?:, (?:[+-][a-z\d]|(?:\-\-|\/)[a-z\d-]+)(?:[ =](?:[a-z][a-z\d_-]*|<[^<>]+>))?)*(?=(?:\r?\n|\r)? {2,}\S)/im,lookbehind:!0,alias:"symbol"},"literal-block":{pattern:/::(?:\r?\n|\r){2}([ \t]+).+(?:(?:\r?\n|\r)\1.+)*/,inside:{"literal-block-punctuation":{pattern:/^::/,alias:"punctuation"}}},"quoted-literal-block":{pattern:/::(?:\r?\n|\r){2}([!"#$%&'()*+,\-.\/:;<=>?@\[\\\]^_`{|}~]).*(?:(?:\r?\n|\r)\1.*)*/,inside:{"literal-block-punctuation":{pattern:/^(?:::|([!"#$%&'()*+,\-.\/:;<=>?@\[\\\]^_`{|}~])\1*)/m,alias:"punctuation"}}},"list-bullet":{pattern:/(^\s*)(?:[*+\-•‣⁃]|\(?(?:\d+|[a-z]|[ivxdclm]+)\)|(?:\d+|[a-z]|[ivxdclm]+)\.)(?= )/im,lookbehind:!0,alias:"punctuation"},"doctest-block":{pattern:/(^\s*)>>> .+(?:(?:\r?\n|\r).+)*/m,lookbehind:!0,inside:{punctuation:/^>>>/}},inline:[{pattern:/(^|[\s\-:\/'"<(\[{])(?::[^:]+:`.*?`|`.*?`:[^:]+:|(\*\*?|``?|\|)(?!\s).*?[^\s]\2(?=[\s\-.,:;!?\\\/'")\]}]|$))/m,lookbehind:!0,inside:{bold:{pattern:/(^\*\*).+(?=\*\*$)/,lookbehind:!0},italic:{pattern:/(^\*).+(?=\*$)/,lookbehind:!0},"inline-literal":{pattern:/(^``).+(?=``$)/,lookbehind:!0,alias:"symbol"},role:{pattern:/^:[^:]+:|:[^:]+:$/,alias:"function",inside:{punctuation:/^:|:$/}},"interpreted-text":{pattern:/(^`).+(?=`$)/,lookbehind:!0,alias:"attr-value"},substitution:{pattern:/(^\|).+(?=\|$)/,lookbehind:!0,alias:"attr-value"},punctuation:/\*\*?|``?|\|/}}],link:[{pattern:/\[[^\]]+\]_(?=[\s\-.,:;!?\\\/'")\]}]|$)/,alias:"string",inside:{punctuation:/^\[|\]_$/}},{pattern:/(?:\b[a-z\d](?:[_.:+]?[a-z\d]+)*_?_|`[^`]+`_?_|_`[^`]+`)(?=[\s\-.,:;!?\\\/'")\]}]|$)/i,alias:"string",inside:{punctuation:/^_?`|`$|`?_?_$/}}],punctuation:{pattern:/(^\s*)(?:\|(?= |$)|(?:---?|—|\.\.|__)(?= )|\.\.$)/m,lookbehind:!0}};
|
Prism.languages.rest={table:[{pattern:/(\s*)(?:\+[=-]+)+\+(?:\r?\n|\r)(?:\1(?:[+|].+)+[+|](?:\r?\n|\r))+\1(?:\+[=-]+)+\+/,lookbehind:!0,inside:{punctuation:/\||(?:\+[=-]+)+\+/}},{pattern:/(\s*)(?:=+ +)+=+((?:\r?\n|\r)\1.+)+(?:\r?\n|\r)\1(?:=+ +)+=+(?=(?:\r?\n|\r){2}|\s*$)/,lookbehind:!0,inside:{punctuation:/[=-]+/}}],"substitution-def":{pattern:/(^\s*\.\. )\|(?:[^|\s](?:[^|]*[^|\s])?)\| [^:]+::/m,lookbehind:!0,inside:{substitution:{pattern:/^\|(?:[^|\s]|[^|\s][^|]*[^|\s])\|/,alias:"attr-value",inside:{punctuation:/^\||\|$/}},directive:{pattern:/( +)[^:]+::/,lookbehind:!0,alias:"function",inside:{punctuation:/::$/}}}},"link-target":[{pattern:/(^\s*\.\. )\[[^\]]+\]/m,lookbehind:!0,alias:"string",inside:{punctuation:/^\[|\]$/}},{pattern:/(^\s*\.\. )_(?:`[^`]+`|(?:[^:\\]|\\.)+):/m,lookbehind:!0,alias:"string",inside:{punctuation:/^_|:$/}}],directive:{pattern:/(^\s*\.\. )[^:]+::/m,lookbehind:!0,alias:"function",inside:{punctuation:/::$/}},comment:{pattern:/(^\s*\.\.)(?:(?: .+)?(?:(?:\r?\n|\r).+)+| .+)(?=(?:\r?\n|\r){2}|$)/m,lookbehind:!0},title:[{pattern:/^(([!"#$%&'()*+,\-.\/:;<=>?@\[\\\]^_`{|}~])\2+)(?:\r?\n|\r).+(?:\r?\n|\r)\1$/m,inside:{punctuation:/^[!"#$%&'()*+,\-.\/:;<=>?@\[\\\]^_`{|}~]+|[!"#$%&'()*+,\-.\/:;<=>?@\[\\\]^_`{|}~]+$/,important:/.+/}},{pattern:/(^|(?:\r?\n|\r){2}).+(?:\r?\n|\r)([!"#$%&'()*+,\-.\/:;<=>?@\[\\\]^_`{|}~])\2+(?=\r?\n|\r|$)/,lookbehind:!0,inside:{punctuation:/[!"#$%&'()*+,\-.\/:;<=>?@\[\\\]^_`{|}~]+$/,important:/.+/}}],hr:{pattern:/((?:\r?\n|\r){2})([!"#$%&'()*+,\-.\/:;<=>?@\[\\\]^_`{|}~])\2{3,}(?=(?:\r?\n|\r){2})/,lookbehind:!0,alias:"punctuation"},field:{pattern:/(^\s*):[^:\r\n]+:(?= )/m,lookbehind:!0,alias:"attr-name"},"command-line-option":{pattern:/(^\s*)(?:[+-][a-z\d]|(?:\-\-|\/)[a-z\d-]+)(?:[ =](?:[a-z][a-z\d_-]*|<[^<>]+>))?(?:, (?:[+-][a-z\d]|(?:\-\-|\/)[a-z\d-]+)(?:[ =](?:[a-z][a-z\d_-]*|<[^<>]+>))?)*(?=(?:\r?\n|\r)? {2,}\S)/im,lookbehind:!0,alias:"symbol"},"literal-block":{pattern:/::(?:\r?\n|\r){2}([ \t]+).+(?:(?:\r?\n|\r)\1.+)*/,inside:{"literal-block-punctuation":{pattern:/^::/,alias:"punctuation"}}},"quoted-literal-block":{pattern:/::(?:\r?\n|\r){2}([!"#$%&'()*+,\-.\/:;<=>?@\[\\\]^_`{|}~]).*(?:(?:\r?\n|\r)\1.*)*/,inside:{"literal-block-punctuation":{pattern:/^(?:::|([!"#$%&'()*+,\-.\/:;<=>?@\[\\\]^_`{|}~])\1*)/m,alias:"punctuation"}}},"list-bullet":{pattern:/(^\s*)(?:[*+\-•‣⁃]|\(?(?:\d+|[a-z]|[ivxdclm]+)\)|(?:\d+|[a-z]|[ivxdclm]+)\.)(?= )/im,lookbehind:!0,alias:"punctuation"},"doctest-block":{pattern:/(^\s*)>>> .+(?:(?:\r?\n|\r).+)*/m,lookbehind:!0,inside:{punctuation:/^>>>/}},inline:[{pattern:/(^|[\s\-:\/'"<(\[{])(?::[^:]+:`.*?`|`.*?`:[^:]+:|(\*\*?|``?|\|)(?!\s).*?[^\s]\2(?=[\s\-.,:;!?\\\/'")\]}]|$))/m,lookbehind:!0,inside:{bold:{pattern:/(^\*\*).+(?=\*\*$)/,lookbehind:!0},italic:{pattern:/(^\*).+(?=\*$)/,lookbehind:!0},"inline-literal":{pattern:/(^``).+(?=``$)/,lookbehind:!0,alias:"symbol"},role:{pattern:/^:[^:]+:|:[^:]+:$/,alias:"function",inside:{punctuation:/^:|:$/}},"interpreted-text":{pattern:/(^`).+(?=`$)/,lookbehind:!0,alias:"attr-value"},substitution:{pattern:/(^\|).+(?=\|$)/,lookbehind:!0,alias:"attr-value"},punctuation:/\*\*?|``?|\|/}}],link:[{pattern:/\[[^\]]+\]_(?=[\s\-.,:;!?\\\/'")\]}]|$)/,alias:"string",inside:{punctuation:/^\[|\]_$/}},{pattern:/(?:\b[a-z\d](?:[_.:+]?[a-z\d]+)*_?_|`[^`]+`_?_|_`[^`]+`)(?=[\s\-.,:;!?\\\/'")\]}]|$)/i,alias:"string",inside:{punctuation:/^_?`|`$|`?_?_$/}}],punctuation:{pattern:/(^\s*)(?:\|(?= |$)|(?:---?|—|\.\.|__)(?= )|\.\.$)/m,lookbehind:!0}};
|
||||||
!function(e){e.languages.sass=e.languages.extend("css",{comment:{pattern:/^([ \t]*)\/[\/*].*(?:(?:\r?\n|\r)\1[ \t]+.+)*/m,lookbehind:!0}}),e.languages.insertBefore("sass","atrule",{"atrule-line":{pattern:/^(?:[ \t]*)[@+=].+/m,inside:{atrule:/(?:@[\w-]+|[+=])/m}}}),delete e.languages.sass.atrule;var a=/((\$[-_\w]+)|(#\{\$[-_\w]+\}))/i,t=[/[+*\/%]|[=!]=|<=?|>=?|\b(?:and|or|not)\b/,{pattern:/(\s+)-(?=\s)/,lookbehind:!0}];e.languages.insertBefore("sass","property",{"variable-line":{pattern:/^[ \t]*\$.+/m,inside:{punctuation:/:/,variable:a,operator:t}},"property-line":{pattern:/^[ \t]*(?:[^:\s]+ *:.*|:[^:\s]+.*)/m,inside:{property:[/[^:\s]+(?=\s*:)/,{pattern:/(:)[^:\s]+/,lookbehind:!0}],punctuation:/:/,variable:a,operator:t,important:e.languages.sass.important}}}),delete e.languages.sass.property,delete e.languages.sass.important,delete e.languages.sass.selector,e.languages.insertBefore("sass","punctuation",{selector:{pattern:/([ \t]*)\S(?:,?[^,\r\n]+)*(?:,(?:\r?\n|\r)\1[ \t]+\S(?:,?[^,\r\n]+)*)*/,lookbehind:!0}})}(Prism);
|
!function(e){e.languages.sass=e.languages.extend("css",{comment:{pattern:/^([ \t]*)\/[\/*].*(?:(?:\r?\n|\r)\1[ \t]+.+)*/m,lookbehind:!0}}),e.languages.insertBefore("sass","atrule",{"atrule-line":{pattern:/^(?:[ \t]*)[@+=].+/m,inside:{atrule:/(?:@[\w-]+|[+=])/m}}}),delete e.languages.sass.atrule;var a=/((\$[-_\w]+)|(#\{\$[-_\w]+\}))/i,t=[/[+*\/%]|[=!]=|<=?|>=?|\b(?:and|or|not)\b/,{pattern:/(\s+)-(?=\s)/,lookbehind:!0}];e.languages.insertBefore("sass","property",{"variable-line":{pattern:/^[ \t]*\$.+/m,inside:{punctuation:/:/,variable:a,operator:t}},"property-line":{pattern:/^[ \t]*(?:[^:\s]+ *:.*|:[^:\s]+.*)/m,inside:{property:[/[^:\s]+(?=\s*:)/,{pattern:/(:)[^:\s]+/,lookbehind:!0}],punctuation:/:/,variable:a,operator:t,important:e.languages.sass.important}}}),delete e.languages.sass.property,delete e.languages.sass.important,delete e.languages.sass.selector,e.languages.insertBefore("sass","punctuation",{selector:{pattern:/([ \t]*)\S(?:,?[^,\r\n]+)*(?:,(?:\r?\n|\r)\1[ \t]+\S(?:,?[^,\r\n]+)*)*/,lookbehind:!0}})}(Prism);
|
||||||
Prism.languages.scss=Prism.languages.extend("css",{comment:{pattern:/(^|[^\\])(?:\/\*[\w\W]*?\*\/|\/\/.*)/,lookbehind:!0},atrule:{pattern:/@[\w-]+(?:\([^()]+\)|[^(])*?(?=\s+[{;])/,inside:{rule:/@[\w-]+/}},url:/(?:[-a-z]+-)*url(?=\()/i,selector:{pattern:/(?=\S)[^@;\{\}\(\)]?([^@;\{\}\(\)]|&|#\{\$[-_\w]+\})+(?=\s*\{(\}|\s|[^\}]+(:|\{)[^\}]+))/m,inside:{placeholder:/%[-_\w]+/}}}),Prism.languages.insertBefore("scss","atrule",{keyword:[/@(?:if|else(?: if)?|for|each|while|import|extend|debug|warn|mixin|include|function|return|content)/i,{pattern:/( +)(?:from|through)(?= )/,lookbehind:!0}]}),Prism.languages.insertBefore("scss","property",{variable:/\$[-_\w]+|#\{\$[-_\w]+\}/}),Prism.languages.insertBefore("scss","function",{placeholder:{pattern:/%[-_\w]+/,alias:"selector"},statement:/\B!(?:default|optional)\b/i,"boolean":/\b(?:true|false)\b/,"null":/\bnull\b/,operator:{pattern:/(\s)(?:[-+*\/%]|[=!]=|<=?|>=?|and|or|not)(?=\s)/,lookbehind:!0}}),Prism.languages.scss.atrule.inside.rest=Prism.util.clone(Prism.languages.scss);
|
Prism.languages.scss=Prism.languages.extend("css",{comment:{pattern:/(^|[^\\])(?:\/\*[\w\W]*?\*\/|\/\/.*)/,lookbehind:!0},atrule:{pattern:/@[\w-]+(?:\([^()]+\)|[^(])*?(?=\s+[{;])/,inside:{rule:/@[\w-]+/}},url:/(?:[-a-z]+-)*url(?=\()/i,selector:{pattern:/(?=\S)[^@;\{\}\(\)]?([^@;\{\}\(\)]|&|#\{\$[-_\w]+\})+(?=\s*\{(\}|\s|[^\}]+(:|\{)[^\}]+))/m,inside:{placeholder:/%[-_\w]+/}}}),Prism.languages.insertBefore("scss","atrule",{keyword:[/@(?:if|else(?: if)?|for|each|while|import|extend|debug|warn|mixin|include|function|return|content)/i,{pattern:/( +)(?:from|through)(?= )/,lookbehind:!0}]}),Prism.languages.insertBefore("scss","property",{variable:/\$[-_\w]+|#\{\$[-_\w]+\}/}),Prism.languages.insertBefore("scss","function",{placeholder:{pattern:/%[-_\w]+/,alias:"selector"},statement:/\B!(?:default|optional)\b/i,"boolean":/\b(?:true|false)\b/,"null":/\bnull\b/,operator:{pattern:/(\s)(?:[-+*\/%]|[=!]=|<=?|>=?|and|or|not)(?=\s)/,lookbehind:!0}}),Prism.languages.scss.atrule.inside.rest=Prism.util.clone(Prism.languages.scss);
|
||||||
|
|
|
@ -76,6 +76,7 @@
|
||||||
},
|
},
|
||||||
|
|
||||||
"MODEL_LICENSES": {
|
"MODEL_LICENSES": {
|
||||||
|
"MIT": "https://opensource.org/licenses/MIT",
|
||||||
"CC BY 4.0": "https://creativecommons.org/licenses/by/4.0/",
|
"CC BY 4.0": "https://creativecommons.org/licenses/by/4.0/",
|
||||||
"CC BY-SA": "https://creativecommons.org/licenses/by-sa/3.0/",
|
"CC BY-SA": "https://creativecommons.org/licenses/by-sa/3.0/",
|
||||||
"CC BY-SA 3.0": "https://creativecommons.org/licenses/by-sa/3.0/",
|
"CC BY-SA 3.0": "https://creativecommons.org/licenses/by-sa/3.0/",
|
||||||
|
@ -118,6 +119,8 @@
|
||||||
"he": "Hebrew",
|
"he": "Hebrew",
|
||||||
"ar": "Arabic",
|
"ar": "Arabic",
|
||||||
"fa": "Persian",
|
"fa": "Persian",
|
||||||
|
"ur": "Urdu",
|
||||||
|
"tt": "Tatar",
|
||||||
"ga": "Irish",
|
"ga": "Irish",
|
||||||
"bn": "Bengali",
|
"bn": "Bengali",
|
||||||
"hi": "Hindi",
|
"hi": "Hindi",
|
||||||
|
|
|
@ -157,7 +157,13 @@ p
|
||||||
|
|
||||||
+infobox("Important note", "⚠️")
|
+infobox("Important note", "⚠️")
|
||||||
| This evaluation was conducted in 2015. We're working on benchmarks on
|
| This evaluation was conducted in 2015. We're working on benchmarks on
|
||||||
| current CPU and GPU hardware.
|
| current CPU and GPU hardware. In the meantime, we're grateful to the
|
||||||
|
| Stanford folks for drawing our attention to what seems
|
||||||
|
| to be #[+a("https://nlp.stanford.edu/software/tokenizer.html#Speed") a long-standing error]
|
||||||
|
| in our CoreNLP benchmarks, especially for their
|
||||||
|
| tokenizer. Until we run corrected experiments, we have updated the table
|
||||||
|
| using their figures.
|
||||||
|
|
||||||
|
|
||||||
+aside("Methodology")
|
+aside("Methodology")
|
||||||
| #[strong Set up:] 100,000 plain-text documents were streamed from an
|
| #[strong Set up:] 100,000 plain-text documents were streamed from an
|
||||||
|
@ -183,14 +189,14 @@ p
|
||||||
+row
|
+row
|
||||||
+cell #[strong spaCy]
|
+cell #[strong spaCy]
|
||||||
each data in [ "0.2ms", "1ms", "19ms"]
|
each data in [ "0.2ms", "1ms", "19ms"]
|
||||||
+cell("num") #[strong=data]
|
+cell("num")=data
|
||||||
|
|
||||||
each data in ["1x", "1x", "1x"]
|
each data in ["1x", "1x", "1x"]
|
||||||
+cell("num")=data
|
+cell("num")=data
|
||||||
|
|
||||||
+row
|
+row
|
||||||
+cell CoreNLP
|
+cell CoreNLP
|
||||||
each data in ["2ms", "10ms", "49ms", "10x", "10x", "2.6x"]
|
each data in ["0.18ms", "10ms", "49ms", "0.9x", "10x", "2.6x"]
|
||||||
+cell("num")=data
|
+cell("num")=data
|
||||||
+row
|
+row
|
||||||
+cell ZPar
|
+cell ZPar
|
||||||
|
|
|
@ -354,7 +354,7 @@ p
|
||||||
string = ''.join(output)
|
string = ''.join(output)
|
||||||
string = string.replace('\n', '')
|
string = string.replace('\n', '')
|
||||||
string = string.replace('\t', ' ')
|
string = string.replace('\t', ' ')
|
||||||
return '<pre>{}</pre>.format(string)
|
return '<pre>{}</pre>'.format(string)
|
||||||
|
|
||||||
nlp = spacy.load('en_core_web_sm')
|
nlp = spacy.load('en_core_web_sm')
|
||||||
doc = nlp(u"This is a test.\n\nHello world.")
|
doc = nlp(u"This is a test.\n\nHello world.")
|
||||||
|
|
Loading…
Reference in New Issue
Block a user