mirror of
https://github.com/explosion/spaCy.git
synced 2025-02-25 08:03:08 +03:00
Merge branch 'master' of https://github.com/explosion/spaCy
This commit is contained in:
commit
e64b241f9c
106
.github/contributors/HiromuHota.md
vendored
Normal file
106
.github/contributors/HiromuHota.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Hiromu Hota |
|
||||||
|
| Company name (if applicable) | Hitachi America, Ltd.|
|
||||||
|
| Title or role (if applicable) | Researcher |
|
||||||
|
| Date | 2019-03-25 |
|
||||||
|
| GitHub username | HiromuHota |
|
||||||
|
| Website (optional) | |
|
104
.github/contributors/SamuelLKane.md
vendored
Normal file
104
.github/contributors/SamuelLKane.md
vendored
Normal file
|
@ -0,0 +1,104 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Samuel Kane |
|
||||||
|
| Date | 3/20/19 |
|
||||||
|
| GitHub username | SamuelLKane |
|
||||||
|
| Website (optional) | samuel-kane.com |
|
107
.github/contributors/graus.md
vendored
Normal file
107
.github/contributors/graus.md
vendored
Normal file
|
@ -0,0 +1,107 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | David Graus |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 28.03.2019 |
|
||||||
|
| GitHub username | graus |
|
||||||
|
| Website (optional) | graus.nu |
|
||||||
|
|
106
.github/contributors/wannaphongcom.md
vendored
Normal file
106
.github/contributors/wannaphongcom.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Wannaphong Phatthiyaphaibun |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 25-3-2019 |
|
||||||
|
| GitHub username | wannaphongcom |
|
||||||
|
| Website (optional) | |
|
71
examples/pipeline/dummy_entity_linking.py
Normal file
71
examples/pipeline/dummy_entity_linking.py
Normal file
|
@ -0,0 +1,71 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
"""Demonstrate how to build a simple knowledge base and run an Entity Linking algorithm.
|
||||||
|
Currently still a bit of a dummy algorithm: taking simply the entity with highest probability for a given alias
|
||||||
|
"""
|
||||||
|
import spacy
|
||||||
|
from spacy.kb import KnowledgeBase
|
||||||
|
|
||||||
|
|
||||||
|
def create_kb(vocab):
|
||||||
|
kb = KnowledgeBase(vocab=vocab)
|
||||||
|
|
||||||
|
# adding entities
|
||||||
|
entity_0 = "Q1004791_Douglas"
|
||||||
|
print("adding entity", entity_0)
|
||||||
|
kb.add_entity(entity=entity_0, prob=0.5)
|
||||||
|
|
||||||
|
entity_1 = "Q42_Douglas_Adams"
|
||||||
|
print("adding entity", entity_1)
|
||||||
|
kb.add_entity(entity=entity_1, prob=0.5)
|
||||||
|
|
||||||
|
entity_2 = "Q5301561_Douglas_Haig"
|
||||||
|
print("adding entity", entity_2)
|
||||||
|
kb.add_entity(entity=entity_2, prob=0.5)
|
||||||
|
|
||||||
|
# adding aliases
|
||||||
|
print()
|
||||||
|
alias_0 = "Douglas"
|
||||||
|
print("adding alias", alias_0)
|
||||||
|
kb.add_alias(alias=alias_0, entities=[entity_0, entity_1, entity_2], probabilities=[0.1, 0.6, 0.2])
|
||||||
|
|
||||||
|
alias_1 = "Douglas Adams"
|
||||||
|
print("adding alias", alias_1)
|
||||||
|
kb.add_alias(alias=alias_1, entities=[entity_1], probabilities=[0.9])
|
||||||
|
|
||||||
|
print()
|
||||||
|
print("kb size:", len(kb), kb.get_size_entities(), kb.get_size_aliases())
|
||||||
|
|
||||||
|
return kb
|
||||||
|
|
||||||
|
|
||||||
|
def add_el(kb, nlp):
|
||||||
|
el_pipe = nlp.create_pipe(name='entity_linker', config={"kb": kb})
|
||||||
|
nlp.add_pipe(el_pipe, last=True)
|
||||||
|
|
||||||
|
for alias in ["Douglas Adams", "Douglas"]:
|
||||||
|
candidates = nlp.linker.kb.get_candidates(alias)
|
||||||
|
print()
|
||||||
|
print(len(candidates), "candidate(s) for", alias, ":")
|
||||||
|
for c in candidates:
|
||||||
|
print(" ", c.entity_, c.prior_prob)
|
||||||
|
|
||||||
|
text = "In The Hitchhiker's Guide to the Galaxy, written by Douglas Adams, " \
|
||||||
|
"Douglas reminds us to always bring our towel. " \
|
||||||
|
"The main character in Doug's novel is called Arthur Dent."
|
||||||
|
doc = nlp(text)
|
||||||
|
|
||||||
|
print()
|
||||||
|
for token in doc:
|
||||||
|
print("token", token.text, token.ent_type_, token.ent_kb_id_)
|
||||||
|
|
||||||
|
print()
|
||||||
|
for ent in doc.ents:
|
||||||
|
print("ent", ent.text, ent.label_, ent.kb_id_)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
nlp = spacy.load('en_core_web_sm')
|
||||||
|
my_kb = create_kb(nlp.vocab)
|
||||||
|
add_el(my_kb, nlp)
|
1
setup.py
1
setup.py
|
@ -40,6 +40,7 @@ MOD_NAMES = [
|
||||||
"spacy.lexeme",
|
"spacy.lexeme",
|
||||||
"spacy.vocab",
|
"spacy.vocab",
|
||||||
"spacy.attrs",
|
"spacy.attrs",
|
||||||
|
"spacy.kb",
|
||||||
"spacy.morphology",
|
"spacy.morphology",
|
||||||
"spacy.pipeline.pipes",
|
"spacy.pipeline.pipes",
|
||||||
"spacy.syntax.stateclass",
|
"spacy.syntax.stateclass",
|
||||||
|
|
|
@ -80,6 +80,8 @@ class Warnings(object):
|
||||||
"the v2.x models cannot release the global interpreter lock. "
|
"the v2.x models cannot release the global interpreter lock. "
|
||||||
"Future versions may introduce a `n_process` argument for "
|
"Future versions may introduce a `n_process` argument for "
|
||||||
"parallel inference via multiprocessing.")
|
"parallel inference via multiprocessing.")
|
||||||
|
W017 = ("Alias '{alias}' already exists in the Knowledge base.")
|
||||||
|
W018 = ("Entity '{entity}' already exists in the Knowledge base.")
|
||||||
|
|
||||||
|
|
||||||
@add_codes
|
@add_codes
|
||||||
|
@ -371,6 +373,16 @@ class Errors(object):
|
||||||
"with spacy >= 2.1.0. To fix this, reinstall Python and use a wide "
|
"with spacy >= 2.1.0. To fix this, reinstall Python and use a wide "
|
||||||
"unicode build instead. You can also rebuild Python and set the "
|
"unicode build instead. You can also rebuild Python and set the "
|
||||||
"--enable-unicode=ucs4 flag.")
|
"--enable-unicode=ucs4 flag.")
|
||||||
|
E131 = ("Cannot write the kb_id of an existing Span object because a Span "
|
||||||
|
"is a read-only view of the underlying Token objects stored in the Doc. "
|
||||||
|
"Instead, create a new Span object and specify the `kb_id` keyword argument, "
|
||||||
|
"for example:\nfrom spacy.tokens import Span\n"
|
||||||
|
"span = Span(doc, start={start}, end={end}, label='{label}', kb_id='{kb_id}')")
|
||||||
|
E132 = ("The vectors for entities and probabilities for alias '{alias}' should have equal length, "
|
||||||
|
"but found {entities_length} and {probabilities_length} respectively.")
|
||||||
|
E133 = ("The sum of prior probabilities for alias '{alias}' should not exceed 1, "
|
||||||
|
"but found {sum}.")
|
||||||
|
E134 = ("Alias '{alias}' defined for unknown entity '{entity}'.")
|
||||||
|
|
||||||
|
|
||||||
@add_codes
|
@add_codes
|
||||||
|
|
148
spacy/kb.pxd
Normal file
148
spacy/kb.pxd
Normal file
|
@ -0,0 +1,148 @@
|
||||||
|
"""Knowledge-base for entity or concept linking."""
|
||||||
|
from cymem.cymem cimport Pool
|
||||||
|
from preshed.maps cimport PreshMap
|
||||||
|
from libcpp.vector cimport vector
|
||||||
|
from libc.stdint cimport int32_t, int64_t
|
||||||
|
|
||||||
|
from spacy.vocab cimport Vocab
|
||||||
|
from .typedefs cimport hash_t
|
||||||
|
|
||||||
|
|
||||||
|
# Internal struct, for storage and disambiguation. This isn't what we return
|
||||||
|
# to the user as the answer to "here's your entity". It's the minimum number
|
||||||
|
# of bits we need to keep track of the answers.
|
||||||
|
cdef struct _EntryC:
|
||||||
|
|
||||||
|
# The hash of this entry's unique ID and name in the kB
|
||||||
|
hash_t entity_hash
|
||||||
|
|
||||||
|
# Allows retrieval of one or more vectors.
|
||||||
|
# Each element of vector_rows should be an index into a vectors table.
|
||||||
|
# Every entry should have the same number of vectors, so we can avoid storing
|
||||||
|
# the number of vectors in each knowledge-base struct
|
||||||
|
int32_t* vector_rows
|
||||||
|
|
||||||
|
# Allows retrieval of a struct of non-vector features. We could make this a
|
||||||
|
# pointer, but we have 32 bits left over in the struct after prob, so we'd
|
||||||
|
# like this to only be 32 bits. We can also set this to -1, for the common
|
||||||
|
# case where there are no features.
|
||||||
|
int32_t feats_row
|
||||||
|
|
||||||
|
# log probability of entity, based on corpus frequency
|
||||||
|
float prob
|
||||||
|
|
||||||
|
|
||||||
|
# Each alias struct stores a list of Entry pointers with their prior probabilities
|
||||||
|
# for this specific mention/alias.
|
||||||
|
cdef struct _AliasC:
|
||||||
|
|
||||||
|
# All entry candidates for this alias
|
||||||
|
vector[int64_t] entry_indices
|
||||||
|
|
||||||
|
# Prior probability P(entity|alias) - should sum up to (at most) 1.
|
||||||
|
vector[float] probs
|
||||||
|
|
||||||
|
|
||||||
|
# Object used by the Entity Linker that summarizes one entity-alias candidate combination.
|
||||||
|
cdef class Candidate:
|
||||||
|
|
||||||
|
cdef readonly KnowledgeBase kb
|
||||||
|
cdef hash_t entity_hash
|
||||||
|
cdef hash_t alias_hash
|
||||||
|
cdef float prior_prob
|
||||||
|
|
||||||
|
|
||||||
|
cdef class KnowledgeBase:
|
||||||
|
cdef Pool mem
|
||||||
|
cpdef readonly Vocab vocab
|
||||||
|
|
||||||
|
# This maps 64bit keys (hash of unique entity string)
|
||||||
|
# to 64bit values (position of the _EntryC struct in the _entries vector).
|
||||||
|
# The PreshMap is pretty space efficient, as it uses open addressing. So
|
||||||
|
# the only overhead is the vacancy rate, which is approximately 30%.
|
||||||
|
cdef PreshMap _entry_index
|
||||||
|
|
||||||
|
# Each entry takes 128 bits, and again we'll have a 30% or so overhead for
|
||||||
|
# over allocation.
|
||||||
|
# In total we end up with (N*128*1.3)+(N*128*1.3) bits for N entries.
|
||||||
|
# Storing 1m entries would take 41.6mb under this scheme.
|
||||||
|
cdef vector[_EntryC] _entries
|
||||||
|
|
||||||
|
# This maps 64bit keys (hash of unique alias string)
|
||||||
|
# to 64bit values (position of the _AliasC struct in the _aliases_table vector).
|
||||||
|
cdef PreshMap _alias_index
|
||||||
|
|
||||||
|
# This should map mention hashes to (entry_id, prob) tuples. The probability
|
||||||
|
# should be P(entity | mention), which is pretty important to know.
|
||||||
|
# We can pack both pieces of information into a 64-bit value, to keep things
|
||||||
|
# efficient.
|
||||||
|
cdef vector[_AliasC] _aliases_table
|
||||||
|
|
||||||
|
# This is the part which might take more space: storing various
|
||||||
|
# categorical features for the entries, and storing vectors for disambiguation
|
||||||
|
# and possibly usage.
|
||||||
|
# If each entry gets a 300-dimensional vector, for 1m entries we would need
|
||||||
|
# 1.2gb. That gets expensive fast. What might be better is to avoid learning
|
||||||
|
# a unique vector for every entity. We could instead have a compositional
|
||||||
|
# model, that embeds different features of the entities into vectors. We'll
|
||||||
|
# still want some per-entity features, like the Wikipedia text or entity
|
||||||
|
# co-occurrence. Hopefully those vectors can be narrow, e.g. 64 dimensions.
|
||||||
|
cdef object _vectors_table
|
||||||
|
|
||||||
|
# It's very useful to track categorical features, at least for output, even
|
||||||
|
# if they're not useful in the model itself. For instance, we should be
|
||||||
|
# able to track stuff like a person's date of birth or whatever. This can
|
||||||
|
# easily make the KB bigger, but if this isn't needed by the model, and it's
|
||||||
|
# optional data, we can let users configure a DB as the backend for this.
|
||||||
|
cdef object _features_table
|
||||||
|
|
||||||
|
cdef inline int64_t c_add_entity(self, hash_t entity_hash, float prob,
|
||||||
|
int32_t* vector_rows, int feats_row):
|
||||||
|
"""Add an entry to the knowledge base."""
|
||||||
|
# This is what we'll map the hash key to. It's where the entry will sit
|
||||||
|
# in the vector of entries, so we can get it later.
|
||||||
|
cdef int64_t new_index = self._entries.size()
|
||||||
|
self._entries.push_back(
|
||||||
|
_EntryC(
|
||||||
|
entity_hash=entity_hash,
|
||||||
|
vector_rows=vector_rows,
|
||||||
|
feats_row=feats_row,
|
||||||
|
prob=prob
|
||||||
|
))
|
||||||
|
self._entry_index[entity_hash] = new_index
|
||||||
|
return new_index
|
||||||
|
|
||||||
|
cdef inline int64_t c_add_aliases(self, hash_t alias_hash, vector[int64_t] entry_indices, vector[float] probs):
|
||||||
|
"""Connect a mention to a list of potential entities with their prior probabilities ."""
|
||||||
|
cdef int64_t new_index = self._aliases_table.size()
|
||||||
|
|
||||||
|
self._aliases_table.push_back(
|
||||||
|
_AliasC(
|
||||||
|
entry_indices=entry_indices,
|
||||||
|
probs=probs
|
||||||
|
))
|
||||||
|
self._alias_index[alias_hash] = new_index
|
||||||
|
return new_index
|
||||||
|
|
||||||
|
cdef inline _create_empty_vectors(self):
|
||||||
|
"""
|
||||||
|
Making sure the first element of each vector is a dummy,
|
||||||
|
because the PreshMap maps pointing to indices in these vectors can not contain 0 as value
|
||||||
|
cf. https://github.com/explosion/preshed/issues/17
|
||||||
|
"""
|
||||||
|
cdef int32_t dummy_value = 0
|
||||||
|
self.vocab.strings.add("")
|
||||||
|
self._entries.push_back(
|
||||||
|
_EntryC(
|
||||||
|
entity_hash=self.vocab.strings[""],
|
||||||
|
vector_rows=&dummy_value,
|
||||||
|
feats_row=dummy_value,
|
||||||
|
prob=dummy_value
|
||||||
|
))
|
||||||
|
self._aliases_table.push_back(
|
||||||
|
_AliasC(
|
||||||
|
entry_indices=[dummy_value],
|
||||||
|
probs=[dummy_value]
|
||||||
|
))
|
||||||
|
|
||||||
|
|
131
spacy/kb.pyx
Normal file
131
spacy/kb.pyx
Normal file
|
@ -0,0 +1,131 @@
|
||||||
|
# cython: profile=True
|
||||||
|
# coding: utf8
|
||||||
|
from spacy.errors import Errors, Warnings, user_warning
|
||||||
|
|
||||||
|
|
||||||
|
cdef class Candidate:
|
||||||
|
|
||||||
|
def __init__(self, KnowledgeBase kb, entity_hash, alias_hash, prior_prob):
|
||||||
|
self.kb = kb
|
||||||
|
self.entity_hash = entity_hash
|
||||||
|
self.alias_hash = alias_hash
|
||||||
|
self.prior_prob = prior_prob
|
||||||
|
|
||||||
|
@property
|
||||||
|
def entity(self):
|
||||||
|
"""RETURNS (uint64): hash of the entity's KB ID/name"""
|
||||||
|
return self.entity_hash
|
||||||
|
|
||||||
|
@property
|
||||||
|
def entity_(self):
|
||||||
|
"""RETURNS (unicode): ID/name of this entity in the KB"""
|
||||||
|
return self.kb.vocab.strings[self.entity]
|
||||||
|
|
||||||
|
@property
|
||||||
|
def alias(self):
|
||||||
|
"""RETURNS (uint64): hash of the alias"""
|
||||||
|
return self.alias_hash
|
||||||
|
|
||||||
|
@property
|
||||||
|
def alias_(self):
|
||||||
|
"""RETURNS (unicode): ID of the original alias"""
|
||||||
|
return self.kb.vocab.strings[self.alias]
|
||||||
|
|
||||||
|
@property
|
||||||
|
def prior_prob(self):
|
||||||
|
return self.prior_prob
|
||||||
|
|
||||||
|
|
||||||
|
cdef class KnowledgeBase:
|
||||||
|
|
||||||
|
def __init__(self, Vocab vocab):
|
||||||
|
self.vocab = vocab
|
||||||
|
self._entry_index = PreshMap()
|
||||||
|
self._alias_index = PreshMap()
|
||||||
|
self.mem = Pool()
|
||||||
|
self._create_empty_vectors()
|
||||||
|
|
||||||
|
def __len__(self):
|
||||||
|
return self.get_size_entities()
|
||||||
|
|
||||||
|
def get_size_entities(self):
|
||||||
|
return self._entries.size() - 1 # not counting dummy element on index 0
|
||||||
|
|
||||||
|
def get_size_aliases(self):
|
||||||
|
return self._aliases_table.size() - 1 # not counting dummy element on index 0
|
||||||
|
|
||||||
|
def add_entity(self, unicode entity, float prob=0.5, vectors=None, features=None):
|
||||||
|
"""
|
||||||
|
Add an entity to the KB.
|
||||||
|
Return the hash of the entity ID at the end
|
||||||
|
"""
|
||||||
|
cdef hash_t entity_hash = self.vocab.strings.add(entity)
|
||||||
|
|
||||||
|
# Return if this entity was added before
|
||||||
|
if entity_hash in self._entry_index:
|
||||||
|
user_warning(Warnings.W018.format(entity=entity))
|
||||||
|
return
|
||||||
|
|
||||||
|
cdef int32_t dummy_value = 342
|
||||||
|
self.c_add_entity(entity_hash=entity_hash, prob=prob,
|
||||||
|
vector_rows=&dummy_value, feats_row=dummy_value)
|
||||||
|
# TODO self._vectors_table.get_pointer(vectors),
|
||||||
|
# self._features_table.get(features))
|
||||||
|
|
||||||
|
return entity_hash
|
||||||
|
|
||||||
|
def add_alias(self, unicode alias, entities, probabilities):
|
||||||
|
"""
|
||||||
|
For a given alias, add its potential entities and prior probabilies to the KB.
|
||||||
|
Return the alias_hash at the end
|
||||||
|
"""
|
||||||
|
|
||||||
|
# Throw an error if the length of entities and probabilities are not the same
|
||||||
|
if not len(entities) == len(probabilities):
|
||||||
|
raise ValueError(Errors.E132.format(alias=alias,
|
||||||
|
entities_length=len(entities),
|
||||||
|
probabilities_length=len(probabilities)))
|
||||||
|
|
||||||
|
# Throw an error if the probabilities sum up to more than 1
|
||||||
|
prob_sum = sum(probabilities)
|
||||||
|
if prob_sum > 1:
|
||||||
|
raise ValueError(Errors.E133.format(alias=alias, sum=prob_sum))
|
||||||
|
|
||||||
|
cdef hash_t alias_hash = self.vocab.strings.add(alias)
|
||||||
|
|
||||||
|
# Return if this alias was added before
|
||||||
|
if alias_hash in self._alias_index:
|
||||||
|
user_warning(Warnings.W017.format(alias=alias))
|
||||||
|
return
|
||||||
|
|
||||||
|
cdef hash_t entity_hash
|
||||||
|
|
||||||
|
cdef vector[int64_t] entry_indices
|
||||||
|
cdef vector[float] probs
|
||||||
|
|
||||||
|
for entity, prob in zip(entities, probabilities):
|
||||||
|
entity_hash = self.vocab.strings[entity]
|
||||||
|
if not entity_hash in self._entry_index:
|
||||||
|
raise ValueError(Errors.E134.format(alias=alias, entity=entity))
|
||||||
|
|
||||||
|
entry_index = <int64_t>self._entry_index.get(entity_hash)
|
||||||
|
entry_indices.push_back(int(entry_index))
|
||||||
|
probs.push_back(float(prob))
|
||||||
|
|
||||||
|
self.c_add_aliases(alias_hash=alias_hash, entry_indices=entry_indices, probs=probs)
|
||||||
|
|
||||||
|
return alias_hash
|
||||||
|
|
||||||
|
|
||||||
|
def get_candidates(self, unicode alias):
|
||||||
|
""" TODO: where to put this functionality ?"""
|
||||||
|
cdef hash_t alias_hash = self.vocab.strings[alias]
|
||||||
|
alias_index = <int64_t>self._alias_index.get(alias_hash)
|
||||||
|
alias_entry = self._aliases_table[alias_index]
|
||||||
|
|
||||||
|
return [Candidate(kb=self,
|
||||||
|
entity_hash=self._entries[entry_index].entity_hash,
|
||||||
|
alias_hash=alias_hash,
|
||||||
|
prior_prob=prob)
|
||||||
|
for (entry_index, prob) in zip(alias_entry.entry_indices, alias_entry.probs)
|
||||||
|
if entry_index != 0]
|
|
@ -5,9 +5,27 @@ from __future__ import unicode_literals
|
||||||
ADVERBS_IRREG = {
|
ADVERBS_IRREG = {
|
||||||
"best": ("well",),
|
"best": ("well",),
|
||||||
"better": ("well",),
|
"better": ("well",),
|
||||||
|
"closer": ("close",),
|
||||||
|
"closest": ("close",),
|
||||||
"deeper": ("deeply",),
|
"deeper": ("deeply",),
|
||||||
|
"earlier": ("early",),
|
||||||
|
"earliest": ("early",),
|
||||||
"farther": ("far",),
|
"farther": ("far",),
|
||||||
"further": ("far",),
|
"further": ("far",),
|
||||||
|
"faster": ("fast",),
|
||||||
|
"fastest": ("fast",),
|
||||||
"harder": ("hard",),
|
"harder": ("hard",),
|
||||||
"hardest": ("hard",),
|
"hardest": ("hard",),
|
||||||
|
"longer": ("long",),
|
||||||
|
"longest": ("long",),
|
||||||
|
"nearer": ("near",),
|
||||||
|
"nearest": ("near",),
|
||||||
|
"nigher": ("nigh",),
|
||||||
|
"nighest": ("nigh",),
|
||||||
|
"quicker": ("quick",),
|
||||||
|
"quickest": ("quick",),
|
||||||
|
"slower": ("slow",),
|
||||||
|
"slowest": ("slowest",),
|
||||||
|
"sooner": ("soon",),
|
||||||
|
"soonest": ("soon",)
|
||||||
}
|
}
|
||||||
|
|
|
@ -1,7 +1,7 @@
|
||||||
# encoding: utf8
|
# encoding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from ...symbols import POS, NOUN, PRON, ADJ, ADV, INTJ, PROPN, DET, NUM, AUX
|
from ...symbols import POS, NOUN, PRON, ADJ, ADV, INTJ, PROPN, DET, NUM, AUX,VERB
|
||||||
from ...symbols import ADP, CCONJ, PART, PUNCT, SPACE, SCONJ
|
from ...symbols import ADP, CCONJ, PART, PUNCT, SPACE, SCONJ
|
||||||
|
|
||||||
# Source: Korakot Chaovavanich
|
# Source: Korakot Chaovavanich
|
||||||
|
@ -16,6 +16,9 @@ TAG_MAP = {
|
||||||
"CMTR": {POS: NOUN},
|
"CMTR": {POS: NOUN},
|
||||||
"CFQC": {POS: NOUN},
|
"CFQC": {POS: NOUN},
|
||||||
"CVBL": {POS: NOUN},
|
"CVBL": {POS: NOUN},
|
||||||
|
# VERB
|
||||||
|
"VACT":{POS:VERB},
|
||||||
|
"VSTA":{POS:VERB},
|
||||||
# PRON
|
# PRON
|
||||||
"PRON": {POS: PRON},
|
"PRON": {POS: PRON},
|
||||||
"NPRP": {POS: PRON},
|
"NPRP": {POS: PRON},
|
||||||
|
@ -78,6 +81,7 @@ TAG_MAP = {
|
||||||
"EAFF": {POS: PART},
|
"EAFF": {POS: PART},
|
||||||
"AITT": {POS: PART},
|
"AITT": {POS: PART},
|
||||||
"NEG": {POS: PART},
|
"NEG": {POS: PART},
|
||||||
|
"EITT": {POS: PART},
|
||||||
# PUNCT
|
# PUNCT
|
||||||
"PUNCT": {POS: PUNCT},
|
"PUNCT": {POS: PUNCT},
|
||||||
"PUNC": {POS: PUNCT},
|
"PUNC": {POS: PUNCT},
|
||||||
|
|
|
@ -14,7 +14,7 @@ import srsly
|
||||||
from .tokenizer import Tokenizer
|
from .tokenizer import Tokenizer
|
||||||
from .vocab import Vocab
|
from .vocab import Vocab
|
||||||
from .lemmatizer import Lemmatizer
|
from .lemmatizer import Lemmatizer
|
||||||
from .pipeline import DependencyParser, Tensorizer, Tagger, EntityRecognizer
|
from .pipeline import DependencyParser, Tensorizer, Tagger, EntityRecognizer, EntityLinker
|
||||||
from .pipeline import SimilarityHook, TextCategorizer, Sentencizer
|
from .pipeline import SimilarityHook, TextCategorizer, Sentencizer
|
||||||
from .pipeline import merge_noun_chunks, merge_entities, merge_subtokens
|
from .pipeline import merge_noun_chunks, merge_entities, merge_subtokens
|
||||||
from .pipeline import EntityRuler
|
from .pipeline import EntityRuler
|
||||||
|
@ -117,6 +117,7 @@ class Language(object):
|
||||||
"tagger": lambda nlp, **cfg: Tagger(nlp.vocab, **cfg),
|
"tagger": lambda nlp, **cfg: Tagger(nlp.vocab, **cfg),
|
||||||
"parser": lambda nlp, **cfg: DependencyParser(nlp.vocab, **cfg),
|
"parser": lambda nlp, **cfg: DependencyParser(nlp.vocab, **cfg),
|
||||||
"ner": lambda nlp, **cfg: EntityRecognizer(nlp.vocab, **cfg),
|
"ner": lambda nlp, **cfg: EntityRecognizer(nlp.vocab, **cfg),
|
||||||
|
"entity_linker": lambda nlp, **cfg: EntityLinker(nlp.vocab, **cfg),
|
||||||
"similarity": lambda nlp, **cfg: SimilarityHook(nlp.vocab, **cfg),
|
"similarity": lambda nlp, **cfg: SimilarityHook(nlp.vocab, **cfg),
|
||||||
"textcat": lambda nlp, **cfg: TextCategorizer(nlp.vocab, **cfg),
|
"textcat": lambda nlp, **cfg: TextCategorizer(nlp.vocab, **cfg),
|
||||||
"sentencizer": lambda nlp, **cfg: Sentencizer(**cfg),
|
"sentencizer": lambda nlp, **cfg: Sentencizer(**cfg),
|
||||||
|
@ -212,6 +213,10 @@ class Language(object):
|
||||||
def entity(self):
|
def entity(self):
|
||||||
return self.get_pipe("ner")
|
return self.get_pipe("ner")
|
||||||
|
|
||||||
|
@property
|
||||||
|
def linker(self):
|
||||||
|
return self.get_pipe("entity_linker")
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def matcher(self):
|
def matcher(self):
|
||||||
return self.get_pipe("matcher")
|
return self.get_pipe("matcher")
|
||||||
|
|
|
@ -1,7 +1,7 @@
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from .pipes import Tagger, DependencyParser, EntityRecognizer
|
from .pipes import Tagger, DependencyParser, EntityRecognizer, EntityLinker
|
||||||
from .pipes import TextCategorizer, Tensorizer, Pipe, Sentencizer
|
from .pipes import TextCategorizer, Tensorizer, Pipe, Sentencizer
|
||||||
from .entityruler import EntityRuler
|
from .entityruler import EntityRuler
|
||||||
from .hooks import SentenceSegmenter, SimilarityHook
|
from .hooks import SentenceSegmenter, SimilarityHook
|
||||||
|
@ -11,6 +11,7 @@ __all__ = [
|
||||||
"Tagger",
|
"Tagger",
|
||||||
"DependencyParser",
|
"DependencyParser",
|
||||||
"EntityRecognizer",
|
"EntityRecognizer",
|
||||||
|
"EntityLinker",
|
||||||
"TextCategorizer",
|
"TextCategorizer",
|
||||||
"Tensorizer",
|
"Tensorizer",
|
||||||
"Pipe",
|
"Pipe",
|
||||||
|
|
|
@ -1061,6 +1061,55 @@ cdef class EntityRecognizer(Parser):
|
||||||
if move[0] in ("B", "I", "L", "U")))
|
if move[0] in ("B", "I", "L", "U")))
|
||||||
|
|
||||||
|
|
||||||
|
class EntityLinker(Pipe):
|
||||||
|
name = 'entity_linker'
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def Model(cls, nr_class=1, **cfg):
|
||||||
|
# TODO: non-dummy EL implementation
|
||||||
|
return None
|
||||||
|
|
||||||
|
def __init__(self, model=True, **cfg):
|
||||||
|
self.model = False
|
||||||
|
self.cfg = dict(cfg)
|
||||||
|
self.kb = self.cfg["kb"]
|
||||||
|
|
||||||
|
def __call__(self, doc):
|
||||||
|
self.set_annotations([doc], scores=None, tensors=None)
|
||||||
|
return doc
|
||||||
|
|
||||||
|
def pipe(self, stream, batch_size=128, n_threads=-1):
|
||||||
|
"""Apply the pipe to a stream of documents.
|
||||||
|
Both __call__ and pipe should delegate to the `predict()`
|
||||||
|
and `set_annotations()` methods.
|
||||||
|
"""
|
||||||
|
for docs in util.minibatch(stream, size=batch_size):
|
||||||
|
docs = list(docs)
|
||||||
|
self.set_annotations(docs, scores=None, tensors=None)
|
||||||
|
yield from docs
|
||||||
|
|
||||||
|
def set_annotations(self, docs, scores, tensors=None):
|
||||||
|
"""
|
||||||
|
Currently implemented as taking the KB entry with highest prior probability for each named entity
|
||||||
|
TODO: actually use context etc
|
||||||
|
"""
|
||||||
|
for i, doc in enumerate(docs):
|
||||||
|
for ent in doc.ents:
|
||||||
|
candidates = self.kb.get_candidates(ent.text)
|
||||||
|
if candidates:
|
||||||
|
best_candidate = max(candidates, key=lambda c: c.prior_prob)
|
||||||
|
for token in ent:
|
||||||
|
token.ent_kb_id_ = best_candidate.entity_
|
||||||
|
|
||||||
|
def get_loss(self, docs, golds, scores):
|
||||||
|
# TODO
|
||||||
|
pass
|
||||||
|
|
||||||
|
def add_label(self, label):
|
||||||
|
# TODO
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
class Sentencizer(object):
|
class Sentencizer(object):
|
||||||
"""Segment the Doc into sentences using a rule-based strategy.
|
"""Segment the Doc into sentences using a rule-based strategy.
|
||||||
|
|
||||||
|
@ -1147,4 +1196,4 @@ class Sentencizer(object):
|
||||||
return self
|
return self
|
||||||
|
|
||||||
|
|
||||||
__all__ = ["Tagger", "DependencyParser", "EntityRecognizer", "Tensorizer", "TextCategorizer", "Sentencizer"]
|
__all__ = ["Tagger", "DependencyParser", "EntityRecognizer", "Tensorizer", "TextCategorizer", "EntityLinker", "Sentencizer"]
|
||||||
|
|
|
@ -70,4 +70,5 @@ cdef struct TokenC:
|
||||||
int sent_start
|
int sent_start
|
||||||
int ent_iob
|
int ent_iob
|
||||||
attr_t ent_type # TODO: Is there a better way to do this? Multiple sources of truth..
|
attr_t ent_type # TODO: Is there a better way to do this? Multiple sources of truth..
|
||||||
|
attr_t ent_kb_id
|
||||||
hash_t ent_id
|
hash_t ent_id
|
||||||
|
|
|
@ -172,10 +172,12 @@ def test_span_as_doc(doc):
|
||||||
assert span_doc[0].idx == 0
|
assert span_doc[0].idx == 0
|
||||||
|
|
||||||
|
|
||||||
def test_span_string_label(doc):
|
def test_span_string_label_kb_id(doc):
|
||||||
span = Span(doc, 0, 1, label="hello")
|
span = Span(doc, 0, 1, label="hello", kb_id="Q342")
|
||||||
assert span.label_ == "hello"
|
assert span.label_ == "hello"
|
||||||
assert span.label == doc.vocab.strings["hello"]
|
assert span.label == doc.vocab.strings["hello"]
|
||||||
|
assert span.kb_id_ == "Q342"
|
||||||
|
assert span.kb_id == doc.vocab.strings["Q342"]
|
||||||
|
|
||||||
|
|
||||||
def test_span_label_readonly(doc):
|
def test_span_label_readonly(doc):
|
||||||
|
@ -184,6 +186,12 @@ def test_span_label_readonly(doc):
|
||||||
span.label_ = "hello"
|
span.label_ = "hello"
|
||||||
|
|
||||||
|
|
||||||
|
def test_span_kb_id_readonly(doc):
|
||||||
|
span = Span(doc, 0, 1)
|
||||||
|
with pytest.raises(NotImplementedError):
|
||||||
|
span.kb_id_ = "Q342"
|
||||||
|
|
||||||
|
|
||||||
def test_span_ents_property(doc):
|
def test_span_ents_property(doc):
|
||||||
"""Test span.ents for the """
|
"""Test span.ents for the """
|
||||||
doc.ents = [
|
doc.ents = [
|
||||||
|
|
|
@ -124,3 +124,9 @@ def test_en_tokenizer_norm_exceptions(en_tokenizer, text, norms):
|
||||||
def test_en_lex_attrs_norm_exceptions(en_tokenizer, text, norm):
|
def test_en_lex_attrs_norm_exceptions(en_tokenizer, text, norm):
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
assert tokens[0].norm_ == norm
|
assert tokens[0].norm_ == norm
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("text", ["faster", "fastest", "better", "best"])
|
||||||
|
def test_en_lemmatizer_handles_irreg_adverbs(en_tokenizer, text):
|
||||||
|
tokens = en_tokenizer(text)
|
||||||
|
assert tokens[0].lemma_ in ["fast", "well"]
|
||||||
|
|
|
@ -14,11 +14,11 @@ TOKENIZER_TESTS = [
|
||||||
]
|
]
|
||||||
|
|
||||||
TAG_TESTS = [
|
TAG_TESTS = [
|
||||||
("日本語だよ", ['日本語だよ', '名詞-固有名詞-地名-国', '名詞-普通名詞-一般', '助動詞', '助詞-終助詞']),
|
("日本語だよ", ['名詞,固有名詞,地名,国', '名詞,普通名詞,一般,*', '助動詞,*,*,*', '助詞,終助詞,*,*']),
|
||||||
("東京タワーの近くに住んでいます。", ['名詞-固有名詞-地名-一般', '名詞-普通名詞-一般', '助詞-格助詞', '名詞-普通名詞-副詞可能', '助詞-格助詞', '動詞-一般', '助詞-接続助詞', '動詞-非自立可能', '助動詞', '補助記号-句点']),
|
("東京タワーの近くに住んでいます。", ['名詞,固有名詞,地名,一般', '名詞,普通名詞,一般,*', '助詞,格助詞,*,*', '名詞,普通名詞,副詞可能,*', '助詞,格助詞,*,*', '動詞,一般,*,*', '助詞,接続助詞,*,*', '動詞,非自立可能,*,*', '助動詞,*,*,*', '補助記号,句点,*,*']),
|
||||||
("吾輩は猫である。", ['代名詞', '助詞-係助詞', '名詞-普通名詞-一般', '助動詞', '動詞-非自立可能', '補助記号-句点']),
|
("吾輩は猫である。", ['代名詞,*,*,*', '助詞,係助詞,*,*', '名詞,普通名詞,一般,*', '助動詞,*,*,*', '動詞,非自立可能,*,*', '補助記号,句点,*,*']),
|
||||||
("月に代わって、お仕置きよ!", ['名詞-普通名詞-助数詞可能', '助詞-格助詞', '動詞-一般', '助詞-接続助詞', '補助記号-読点', '接頭辞', '名詞-普通名詞-一般', '助詞-終助詞', '補助記号-句点 ']),
|
("月に代わって、お仕置きよ!", ['名詞,普通名詞,助数詞可能,*', '助詞,格助詞,*,*', '動詞,一般,*,*', '助詞,接続助詞,*,*', '補助記号,読点,*,*', '接頭辞,*,*,*', '名詞,普通名詞,一般,*', '助詞,終助詞,*,*', '補助記号,句点,*,*']),
|
||||||
("すもももももももものうち", ['名詞-普通名詞-一般', '助詞-係助詞', '名詞-普通名詞-一般', '助詞-係助詞', '名詞-普通名詞-一般', '助詞-格助詞', '名詞-普通名詞-副詞可能'])
|
("すもももももももものうち", ['名詞,普通名詞,一般,*', '助詞,係助詞,*,*', '名詞,普通名詞,一般,*', '助詞,係助詞,*,*', '名詞,普通名詞,一般,*', '助詞,格助詞,*,*', '名詞,普通名詞,副詞可能,*'])
|
||||||
]
|
]
|
||||||
|
|
||||||
POS_TESTS = [
|
POS_TESTS = [
|
||||||
|
|
91
spacy/tests/pipeline/test_el.py
Normal file
91
spacy/tests/pipeline/test_el.py
Normal file
|
@ -0,0 +1,91 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from spacy.kb import KnowledgeBase
|
||||||
|
from spacy.lang.en import English
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def nlp():
|
||||||
|
return English()
|
||||||
|
|
||||||
|
|
||||||
|
def test_kb_valid_entities(nlp):
|
||||||
|
"""Test the valid construction of a KB with 3 entities and two aliases"""
|
||||||
|
mykb = KnowledgeBase(nlp.vocab)
|
||||||
|
|
||||||
|
# adding entities
|
||||||
|
mykb.add_entity(entity=u'Q1', prob=0.9)
|
||||||
|
mykb.add_entity(entity=u'Q2')
|
||||||
|
mykb.add_entity(entity=u'Q3', prob=0.5)
|
||||||
|
|
||||||
|
# adding aliases
|
||||||
|
mykb.add_alias(alias=u'douglas', entities=[u'Q2', u'Q3'], probabilities=[0.8, 0.2])
|
||||||
|
mykb.add_alias(alias=u'adam', entities=[u'Q2'], probabilities=[0.9])
|
||||||
|
|
||||||
|
# test the size of the corresponding KB
|
||||||
|
assert(mykb.get_size_entities() == 3)
|
||||||
|
assert(mykb.get_size_aliases() == 2)
|
||||||
|
|
||||||
|
|
||||||
|
def test_kb_invalid_entities(nlp):
|
||||||
|
"""Test the invalid construction of a KB with an alias linked to a non-existing entity"""
|
||||||
|
mykb = KnowledgeBase(nlp.vocab)
|
||||||
|
|
||||||
|
# adding entities
|
||||||
|
mykb.add_entity(entity=u'Q1', prob=0.9)
|
||||||
|
mykb.add_entity(entity=u'Q2', prob=0.2)
|
||||||
|
mykb.add_entity(entity=u'Q3', prob=0.5)
|
||||||
|
|
||||||
|
# adding aliases - should fail because one of the given IDs is not valid
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
mykb.add_alias(alias=u'douglas', entities=[u'Q2', u'Q342'], probabilities=[0.8, 0.2])
|
||||||
|
|
||||||
|
|
||||||
|
def test_kb_invalid_probabilities(nlp):
|
||||||
|
"""Test the invalid construction of a KB with wrong prior probabilities"""
|
||||||
|
mykb = KnowledgeBase(nlp.vocab)
|
||||||
|
|
||||||
|
# adding entities
|
||||||
|
mykb.add_entity(entity=u'Q1', prob=0.9)
|
||||||
|
mykb.add_entity(entity=u'Q2', prob=0.2)
|
||||||
|
mykb.add_entity(entity=u'Q3', prob=0.5)
|
||||||
|
|
||||||
|
# adding aliases - should fail because the sum of the probabilities exceeds 1
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
mykb.add_alias(alias=u'douglas', entities=[u'Q2', u'Q3'], probabilities=[0.8, 0.4])
|
||||||
|
|
||||||
|
|
||||||
|
def test_kb_invalid_combination(nlp):
|
||||||
|
"""Test the invalid construction of a KB with non-matching entity and probability lists"""
|
||||||
|
mykb = KnowledgeBase(nlp.vocab)
|
||||||
|
|
||||||
|
# adding entities
|
||||||
|
mykb.add_entity(entity=u'Q1', prob=0.9)
|
||||||
|
mykb.add_entity(entity=u'Q2', prob=0.2)
|
||||||
|
mykb.add_entity(entity=u'Q3', prob=0.5)
|
||||||
|
|
||||||
|
# adding aliases - should fail because the entities and probabilities vectors are not of equal length
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
mykb.add_alias(alias=u'douglas', entities=[u'Q2', u'Q3'], probabilities=[0.3, 0.4, 0.1])
|
||||||
|
|
||||||
|
|
||||||
|
def test_candidate_generation(nlp):
|
||||||
|
"""Test correct candidate generation"""
|
||||||
|
mykb = KnowledgeBase(nlp.vocab)
|
||||||
|
|
||||||
|
# adding entities
|
||||||
|
mykb.add_entity(entity=u'Q1', prob=0.9)
|
||||||
|
mykb.add_entity(entity=u'Q2', prob=0.2)
|
||||||
|
mykb.add_entity(entity=u'Q3', prob=0.5)
|
||||||
|
|
||||||
|
# adding aliases
|
||||||
|
mykb.add_alias(alias=u'douglas', entities=[u'Q2', u'Q3'], probabilities=[0.8, 0.2])
|
||||||
|
mykb.add_alias(alias=u'adam', entities=[u'Q2'], probabilities=[0.9])
|
||||||
|
|
||||||
|
# test the size of the relevant candidates
|
||||||
|
assert(len(mykb.get_candidates(u'douglas')) == 2)
|
||||||
|
assert(len(mykb.get_candidates(u'adam')) == 1)
|
||||||
|
assert(len(mykb.get_candidates(u'shrubbery')) == 0)
|
10
spacy/tests/regression/test_issue3447.py
Normal file
10
spacy/tests/regression/test_issue3447.py
Normal file
|
@ -0,0 +1,10 @@
|
||||||
|
from spacy.util import decaying
|
||||||
|
|
||||||
|
def test_decaying():
|
||||||
|
sizes = decaying(10., 1., .5)
|
||||||
|
size = next(sizes)
|
||||||
|
assert size == 10.
|
||||||
|
size = next(sizes)
|
||||||
|
assert size == 10. - 0.5
|
||||||
|
size = next(sizes)
|
||||||
|
assert size == 10. - 0.5 - 0.5
|
|
@ -131,6 +131,7 @@ cdef class Tokenizer:
|
||||||
|
|
||||||
texts: A sequence of unicode texts.
|
texts: A sequence of unicode texts.
|
||||||
batch_size (int): Number of texts to accumulate in an internal buffer.
|
batch_size (int): Number of texts to accumulate in an internal buffer.
|
||||||
|
Defaults to 1000.
|
||||||
YIELDS (Doc): A sequence of Doc objects, in order.
|
YIELDS (Doc): A sequence of Doc objects, in order.
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/tokenizer#pipe
|
DOCS: https://spacy.io/api/tokenizer#pipe
|
||||||
|
|
|
@ -326,7 +326,7 @@ cdef class Doc:
|
||||||
def doc(self):
|
def doc(self):
|
||||||
return self
|
return self
|
||||||
|
|
||||||
def char_span(self, int start_idx, int end_idx, label=0, vector=None):
|
def char_span(self, int start_idx, int end_idx, label=0, kb_id=0, vector=None):
|
||||||
"""Create a `Span` object from the slice `doc.text[start : end]`.
|
"""Create a `Span` object from the slice `doc.text[start : end]`.
|
||||||
|
|
||||||
doc (Doc): The parent document.
|
doc (Doc): The parent document.
|
||||||
|
@ -334,6 +334,7 @@ cdef class Doc:
|
||||||
end (int): The index of the first character after the span.
|
end (int): The index of the first character after the span.
|
||||||
label (uint64 or string): A label to attach to the Span, e.g. for
|
label (uint64 or string): A label to attach to the Span, e.g. for
|
||||||
named entities.
|
named entities.
|
||||||
|
kb_id (uint64 or string): An ID from a KB to capture the meaning of a named entity.
|
||||||
vector (ndarray[ndim=1, dtype='float32']): A meaning representation of
|
vector (ndarray[ndim=1, dtype='float32']): A meaning representation of
|
||||||
the span.
|
the span.
|
||||||
RETURNS (Span): The newly constructed object.
|
RETURNS (Span): The newly constructed object.
|
||||||
|
@ -342,6 +343,8 @@ cdef class Doc:
|
||||||
"""
|
"""
|
||||||
if not isinstance(label, int):
|
if not isinstance(label, int):
|
||||||
label = self.vocab.strings.add(label)
|
label = self.vocab.strings.add(label)
|
||||||
|
if not isinstance(kb_id, int):
|
||||||
|
kb_id = self.vocab.strings.add(kb_id)
|
||||||
cdef int start = token_by_start(self.c, self.length, start_idx)
|
cdef int start = token_by_start(self.c, self.length, start_idx)
|
||||||
if start == -1:
|
if start == -1:
|
||||||
return None
|
return None
|
||||||
|
@ -350,7 +353,7 @@ cdef class Doc:
|
||||||
return None
|
return None
|
||||||
# Currently we have the token index, we want the range-end index
|
# Currently we have the token index, we want the range-end index
|
||||||
end += 1
|
end += 1
|
||||||
cdef Span span = Span(self, start, end, label=label, vector=vector)
|
cdef Span span = Span(self, start, end, label=label, kb_id=kb_id, vector=vector)
|
||||||
return span
|
return span
|
||||||
|
|
||||||
def similarity(self, other):
|
def similarity(self, other):
|
||||||
|
@ -484,6 +487,7 @@ cdef class Doc:
|
||||||
cdef const TokenC* token
|
cdef const TokenC* token
|
||||||
cdef int start = -1
|
cdef int start = -1
|
||||||
cdef attr_t label = 0
|
cdef attr_t label = 0
|
||||||
|
cdef attr_t kb_id = 0
|
||||||
output = []
|
output = []
|
||||||
for i in range(self.length):
|
for i in range(self.length):
|
||||||
token = &self.c[i]
|
token = &self.c[i]
|
||||||
|
@ -493,16 +497,18 @@ cdef class Doc:
|
||||||
raise ValueError(Errors.E093.format(seq=" ".join(seq)))
|
raise ValueError(Errors.E093.format(seq=" ".join(seq)))
|
||||||
elif token.ent_iob == 2 or token.ent_iob == 0:
|
elif token.ent_iob == 2 or token.ent_iob == 0:
|
||||||
if start != -1:
|
if start != -1:
|
||||||
output.append(Span(self, start, i, label=label))
|
output.append(Span(self, start, i, label=label, kb_id=kb_id))
|
||||||
start = -1
|
start = -1
|
||||||
label = 0
|
label = 0
|
||||||
|
kb_id = 0
|
||||||
elif token.ent_iob == 3:
|
elif token.ent_iob == 3:
|
||||||
if start != -1:
|
if start != -1:
|
||||||
output.append(Span(self, start, i, label=label))
|
output.append(Span(self, start, i, label=label, kb_id=kb_id))
|
||||||
start = i
|
start = i
|
||||||
label = token.ent_type
|
label = token.ent_type
|
||||||
|
kb_id = token.ent_kb_id
|
||||||
if start != -1:
|
if start != -1:
|
||||||
output.append(Span(self, start, self.length, label=label))
|
output.append(Span(self, start, self.length, label=label, kb_id=kb_id))
|
||||||
return tuple(output)
|
return tuple(output)
|
||||||
|
|
||||||
def __set__(self, ents):
|
def __set__(self, ents):
|
||||||
|
|
|
@ -11,6 +11,7 @@ cdef class Span:
|
||||||
cdef readonly int start_char
|
cdef readonly int start_char
|
||||||
cdef readonly int end_char
|
cdef readonly int end_char
|
||||||
cdef readonly attr_t label
|
cdef readonly attr_t label
|
||||||
|
cdef readonly attr_t kb_id
|
||||||
|
|
||||||
cdef public _vector
|
cdef public _vector
|
||||||
cdef public _vector_norm
|
cdef public _vector_norm
|
||||||
|
|
|
@ -85,13 +85,14 @@ cdef class Span:
|
||||||
return Underscore.span_extensions.pop(name)
|
return Underscore.span_extensions.pop(name)
|
||||||
|
|
||||||
def __cinit__(self, Doc doc, int start, int end, label=0, vector=None,
|
def __cinit__(self, Doc doc, int start, int end, label=0, vector=None,
|
||||||
vector_norm=None):
|
vector_norm=None, kb_id=0):
|
||||||
"""Create a `Span` object from the slice `doc[start : end]`.
|
"""Create a `Span` object from the slice `doc[start : end]`.
|
||||||
|
|
||||||
doc (Doc): The parent document.
|
doc (Doc): The parent document.
|
||||||
start (int): The index of the first token of the span.
|
start (int): The index of the first token of the span.
|
||||||
end (int): The index of the first token after the span.
|
end (int): The index of the first token after the span.
|
||||||
label (uint64): A label to attach to the Span, e.g. for named entities.
|
label (uint64): A label to attach to the Span, e.g. for named entities.
|
||||||
|
kb_id (uint64): An identifier from a Knowledge Base to capture the meaning of a named entity.
|
||||||
vector (ndarray[ndim=1, dtype='float32']): A meaning representation
|
vector (ndarray[ndim=1, dtype='float32']): A meaning representation
|
||||||
of the span.
|
of the span.
|
||||||
RETURNS (Span): The newly constructed object.
|
RETURNS (Span): The newly constructed object.
|
||||||
|
@ -110,11 +111,14 @@ cdef class Span:
|
||||||
self.end_char = 0
|
self.end_char = 0
|
||||||
if isinstance(label, basestring_):
|
if isinstance(label, basestring_):
|
||||||
label = doc.vocab.strings.add(label)
|
label = doc.vocab.strings.add(label)
|
||||||
|
if isinstance(kb_id, basestring_):
|
||||||
|
kb_id = doc.vocab.strings.add(kb_id)
|
||||||
if label not in doc.vocab.strings:
|
if label not in doc.vocab.strings:
|
||||||
raise ValueError(Errors.E084.format(label=label))
|
raise ValueError(Errors.E084.format(label=label))
|
||||||
self.label = label
|
self.label = label
|
||||||
self._vector = vector
|
self._vector = vector
|
||||||
self._vector_norm = vector_norm
|
self._vector_norm = vector_norm
|
||||||
|
self.kb_id = kb_id
|
||||||
|
|
||||||
def __richcmp__(self, Span other, int op):
|
def __richcmp__(self, Span other, int op):
|
||||||
if other is None:
|
if other is None:
|
||||||
|
@ -655,6 +659,20 @@ cdef class Span:
|
||||||
label_ = ''
|
label_ = ''
|
||||||
raise NotImplementedError(Errors.E129.format(start=self.start, end=self.end, label=label_))
|
raise NotImplementedError(Errors.E129.format(start=self.start, end=self.end, label=label_))
|
||||||
|
|
||||||
|
property kb_id_:
|
||||||
|
"""RETURNS (unicode): The named entity's KB ID."""
|
||||||
|
def __get__(self):
|
||||||
|
return self.doc.vocab.strings[self.kb_id]
|
||||||
|
|
||||||
|
def __set__(self, unicode kb_id_):
|
||||||
|
if not kb_id_:
|
||||||
|
kb_id_ = ''
|
||||||
|
current_label = self.label_
|
||||||
|
if not current_label:
|
||||||
|
current_label = ''
|
||||||
|
raise NotImplementedError(Errors.E131.format(start=self.start, end=self.end,
|
||||||
|
label=current_label, kb_id=kb_id_))
|
||||||
|
|
||||||
|
|
||||||
cdef int _count_words_to_root(const TokenC* token, int sent_length) except -1:
|
cdef int _count_words_to_root(const TokenC* token, int sent_length) except -1:
|
||||||
# Don't allow spaces to be the root, if there are
|
# Don't allow spaces to be the root, if there are
|
||||||
|
|
|
@ -770,6 +770,22 @@ cdef class Token:
|
||||||
def __set__(self, name):
|
def __set__(self, name):
|
||||||
self.c.ent_id = self.vocab.strings.add(name)
|
self.c.ent_id = self.vocab.strings.add(name)
|
||||||
|
|
||||||
|
property ent_kb_id:
|
||||||
|
"""RETURNS (uint64): Named entity KB ID."""
|
||||||
|
def __get__(self):
|
||||||
|
return self.c.ent_kb_id
|
||||||
|
|
||||||
|
def __set__(self, attr_t ent_kb_id):
|
||||||
|
self.c.ent_kb_id = ent_kb_id
|
||||||
|
|
||||||
|
property ent_kb_id_:
|
||||||
|
"""RETURNS (unicode): Named entity KB ID."""
|
||||||
|
def __get__(self):
|
||||||
|
return self.vocab.strings[self.c.ent_kb_id]
|
||||||
|
|
||||||
|
def __set__(self, ent_kb_id):
|
||||||
|
self.c.ent_kb_id = self.vocab.strings.add(ent_kb_id)
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def whitespace_(self):
|
def whitespace_(self):
|
||||||
"""RETURNS (unicode): The trailing whitespace character, if present."""
|
"""RETURNS (unicode): The trailing whitespace character, if present."""
|
||||||
|
|
|
@ -507,13 +507,10 @@ def stepping(start, stop, steps):
|
||||||
def decaying(start, stop, decay):
|
def decaying(start, stop, decay):
|
||||||
"""Yield an infinite series of linearly decaying values."""
|
"""Yield an infinite series of linearly decaying values."""
|
||||||
|
|
||||||
def clip(value):
|
curr = float(start)
|
||||||
return max(value, stop) if (start > stop) else min(value, stop)
|
|
||||||
|
|
||||||
nr_upd = 1.0
|
|
||||||
while True:
|
while True:
|
||||||
yield clip(start * 1.0 / (1.0 + decay * nr_upd))
|
yield max(curr, stop)
|
||||||
nr_upd += 1
|
curr -= (decay)
|
||||||
|
|
||||||
|
|
||||||
def minibatch_by_words(items, size, tuples=True, count_words=len):
|
def minibatch_by_words(items, size, tuples=True, count_words=len):
|
||||||
|
|
|
@ -64,7 +64,7 @@ Tokenize a stream of texts.
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ------------ | ----- | -------------------------------------------------------- |
|
| ------------ | ----- | -------------------------------------------------------- |
|
||||||
| `texts` | - | A sequence of unicode texts. |
|
| `texts` | - | A sequence of unicode texts. |
|
||||||
| `batch_size` | int | The number of texts to accumulate in an internal buffer. |
|
| `batch_size` | int | The number of texts to accumulate in an internal buffer. Defaults to `1000`.|
|
||||||
| **YIELDS** | `Doc` | A sequence of Doc objects, in order. |
|
| **YIELDS** | `Doc` | A sequence of Doc objects, in order. |
|
||||||
|
|
||||||
## Tokenizer.find_infix {#find_infix tag="method"}
|
## Tokenizer.find_infix {#find_infix tag="method"}
|
||||||
|
|
|
@ -622,10 +622,10 @@ Yield an infinite series of linearly decaying values.
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
> sizes = decaying(1., 10., 0.001)
|
> sizes = decaying(10., 1., 0.001)
|
||||||
> assert next(sizes) == 1.
|
> assert next(sizes) == 10.
|
||||||
> assert next(sizes) == 1. - 0.001
|
> assert next(sizes) == 10. - 0.001
|
||||||
> assert next(sizes) == 0.999 - 0.001
|
> assert next(sizes) == 9.999 - 0.001
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
|
|
|
@ -40,6 +40,7 @@ exports.createPages = ({ graphql, actions }) => {
|
||||||
resources {
|
resources {
|
||||||
id
|
id
|
||||||
title
|
title
|
||||||
|
slogan
|
||||||
}
|
}
|
||||||
categories {
|
categories {
|
||||||
label
|
label
|
||||||
|
@ -178,6 +179,7 @@ exports.createPages = ({ graphql, actions }) => {
|
||||||
slug: slug,
|
slug: slug,
|
||||||
isIndex: false,
|
isIndex: false,
|
||||||
title: page.title || page.id,
|
title: page.title || page.id,
|
||||||
|
teaser: page.slogan,
|
||||||
data: { ...page, isProject: true },
|
data: { ...page, isProject: true },
|
||||||
...universeContext,
|
...universeContext,
|
||||||
},
|
},
|
||||||
|
|
|
@ -117,6 +117,7 @@
|
||||||
{ "code": "sk", "name": "Slovak" },
|
{ "code": "sk", "name": "Slovak" },
|
||||||
{ "code": "sl", "name": "Slovenian" },
|
{ "code": "sl", "name": "Slovenian" },
|
||||||
{ "code": "sq", "name": "Albanian" },
|
{ "code": "sq", "name": "Albanian" },
|
||||||
|
{ "code": "et", "name": "Estonian" },
|
||||||
{
|
{
|
||||||
"code": "th",
|
"code": "th",
|
||||||
"name": "Thai",
|
"name": "Thai",
|
||||||
|
|
|
@ -29,7 +29,7 @@
|
||||||
"spacyVersion": "2.1",
|
"spacyVersion": "2.1",
|
||||||
"binderUrl": "ines/spacy-io-binder",
|
"binderUrl": "ines/spacy-io-binder",
|
||||||
"binderBranch": "live",
|
"binderBranch": "live",
|
||||||
"binderVersion": "2.1.2",
|
"binderVersion": "2.1.3",
|
||||||
"sections": [
|
"sections": [
|
||||||
{ "id": "usage", "title": "Usage Documentation", "theme": "blue" },
|
{ "id": "usage", "title": "Usage Documentation", "theme": "blue" },
|
||||||
{ "id": "models", "title": "Models Documentation", "theme": "blue" },
|
{ "id": "models", "title": "Models Documentation", "theme": "blue" },
|
||||||
|
|
|
@ -554,6 +554,36 @@
|
||||||
},
|
},
|
||||||
"category": ["standalone"]
|
"category": ["standalone"]
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
"id": "textpipe",
|
||||||
|
"slogan": "clean and extract metadata from text",
|
||||||
|
"description": "`textpipe` is a Python package for converting raw text in to clean, readable text and extracting metadata from that text. Its functionalities include transforming raw text into readable text by removing HTML tags and extracting metadata such as the number of words and named entities from the text.",
|
||||||
|
"github": "textpipe/textpipe",
|
||||||
|
"pip": "textpipe",
|
||||||
|
"author": "Textpipe Contributors",
|
||||||
|
"author_links": {
|
||||||
|
"github": "textpipe",
|
||||||
|
"website": "https://github.com/textpipe/textpipe/blob/master/CONTRIBUTORS.md"
|
||||||
|
},
|
||||||
|
"category": ["standalone"],
|
||||||
|
"tags": ["text-processing", "named-entity-recognition"],
|
||||||
|
"thumb": "https://avatars0.githubusercontent.com/u/40492530",
|
||||||
|
"code_example": [
|
||||||
|
"from textpipe import doc, pipeline",
|
||||||
|
"sample_text = 'Sample text! <!DOCTYPE>'",
|
||||||
|
"document = doc.Doc(sample_text)",
|
||||||
|
"print(document.clean)",
|
||||||
|
"'Sample text!'",
|
||||||
|
"print(document.language)",
|
||||||
|
"# 'en'",
|
||||||
|
"print(document.nwords)",
|
||||||
|
"# 2",
|
||||||
|
"",
|
||||||
|
"pipe = pipeline.Pipeline(['CleanText', 'NWords'])",
|
||||||
|
"print(pipe(sample_text))",
|
||||||
|
"# {'CleanText': 'Sample text!', 'NWords': 2}"
|
||||||
|
]
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"id": "mordecai",
|
"id": "mordecai",
|
||||||
"slogan": "Full text geoparsing using spaCy, Geonames and Keras",
|
"slogan": "Full text geoparsing using spaCy, Geonames and Keras",
|
||||||
|
|
|
@ -75,14 +75,28 @@ export const LandingBannerGrid = ({ children }) => (
|
||||||
</Grid>
|
</Grid>
|
||||||
)
|
)
|
||||||
|
|
||||||
export const LandingBanner = ({ title, label, to, button, small, background, color, children }) => {
|
export const LandingBanner = ({
|
||||||
|
title,
|
||||||
|
label,
|
||||||
|
to,
|
||||||
|
button,
|
||||||
|
small,
|
||||||
|
background,
|
||||||
|
backgroundImage,
|
||||||
|
color,
|
||||||
|
children,
|
||||||
|
}) => {
|
||||||
const contentClassNames = classNames(classes.bannerContent, {
|
const contentClassNames = classNames(classes.bannerContent, {
|
||||||
[classes.bannerContentSmall]: small,
|
[classes.bannerContentSmall]: small,
|
||||||
})
|
})
|
||||||
const textClassNames = classNames(classes.bannerText, {
|
const textClassNames = classNames(classes.bannerText, {
|
||||||
[classes.bannerTextSmall]: small,
|
[classes.bannerTextSmall]: small,
|
||||||
})
|
})
|
||||||
const style = { '--color-theme': background, '--color-back': color }
|
const style = {
|
||||||
|
'--color-theme': background,
|
||||||
|
'--color-back': color,
|
||||||
|
backgroundImage: backgroundImage ? `url(${backgroundImage})` : null,
|
||||||
|
}
|
||||||
const Heading = small ? H2 : H1
|
const Heading = small ? H2 : H1
|
||||||
return (
|
return (
|
||||||
<div className={classes.banner} style={style}>
|
<div className={classes.banner} style={style}>
|
||||||
|
@ -113,7 +127,7 @@ export const LandingBanner = ({ title, label, to, button, small, background, col
|
||||||
|
|
||||||
export const LandingBannerButton = ({ to, small, children }) => (
|
export const LandingBannerButton = ({ to, small, children }) => (
|
||||||
<div className={classes.bannerButton}>
|
<div className={classes.bannerButton}>
|
||||||
<Button to={to} variant="tertiary" large={!small}>
|
<Button to={to} variant="tertiary" large={!small} className={classes.bannerButtonElement}>
|
||||||
{children}
|
{children}
|
||||||
</Button>
|
</Button>
|
||||||
</div>
|
</div>
|
||||||
|
|
Binary file not shown.
Before Width: | Height: | Size: 30 KiB After Width: | Height: | Size: 24 KiB |
BIN
website/src/images/spacy-irl.jpg
Normal file
BIN
website/src/images/spacy-irl.jpg
Normal file
Binary file not shown.
After Width: | Height: | Size: 75 KiB |
|
@ -73,6 +73,7 @@
|
||||||
color: var(--color-back)
|
color: var(--color-back)
|
||||||
padding: 5rem
|
padding: 5rem
|
||||||
margin-bottom: var(--spacing-md)
|
margin-bottom: var(--spacing-md)
|
||||||
|
background-size: cover
|
||||||
|
|
||||||
.banner-content
|
.banner-content
|
||||||
margin-bottom: 0
|
margin-bottom: 0
|
||||||
|
@ -100,7 +101,7 @@
|
||||||
|
|
||||||
.banner-text-small p
|
.banner-text-small p
|
||||||
font-size: 1.35rem
|
font-size: 1.35rem
|
||||||
margin-bottom: 1rem
|
margin-bottom: 1.5rem
|
||||||
|
|
||||||
@include breakpoint(min, md)
|
@include breakpoint(min, md)
|
||||||
.banner-content
|
.banner-content
|
||||||
|
@ -134,6 +135,9 @@
|
||||||
margin-bottom: var(--spacing-sm)
|
margin-bottom: var(--spacing-sm)
|
||||||
text-align: right
|
text-align: right
|
||||||
|
|
||||||
|
.banner-button-element
|
||||||
|
background: var(--color-theme)
|
||||||
|
|
||||||
.logos
|
.logos
|
||||||
text-align: center
|
text-align: center
|
||||||
padding-bottom: 1rem
|
padding-bottom: 1rem
|
||||||
|
|
|
@ -19,6 +19,7 @@ import { H2 } from '../components/typography'
|
||||||
import { Ul, Li } from '../components/list'
|
import { Ul, Li } from '../components/list'
|
||||||
import Button from '../components/button'
|
import Button from '../components/button'
|
||||||
import Link from '../components/link'
|
import Link from '../components/link'
|
||||||
|
import irlBackground from '../images/spacy-irl.jpg'
|
||||||
|
|
||||||
import BenchmarksChoi from 'usage/_benchmarks-choi.md'
|
import BenchmarksChoi from 'usage/_benchmarks-choi.md'
|
||||||
|
|
||||||
|
@ -151,19 +152,21 @@ const Landing = ({ data }) => {
|
||||||
|
|
||||||
<LandingBannerGrid>
|
<LandingBannerGrid>
|
||||||
<LandingBanner
|
<LandingBanner
|
||||||
title="BERT-style language model pretraining and more"
|
title="spaCy IRL 2019: Two days of NLP"
|
||||||
label="New in v2.1"
|
label="Join us in Berlin"
|
||||||
to="/usage/v2-1"
|
to="https://irl.spacy.io/2019"
|
||||||
button="Read more"
|
button="Get tickets"
|
||||||
|
background="#ffc194"
|
||||||
|
backgroundImage={irlBackground}
|
||||||
|
color="#1a1e23"
|
||||||
small
|
small
|
||||||
>
|
>
|
||||||
Learn more from small training corpora by initializing your models with{' '}
|
We're pleased to invite the spaCy community and other folks working on Natural
|
||||||
<strong>knowledge from raw text</strong>. The new pretrain command teaches
|
Language Processing to Berlin this summer for a small and intimate event{' '}
|
||||||
spaCy's CNN model to predict words based on their context, producing
|
<strong>July 5-6, 2019</strong>. The event includes a hands-on training day for
|
||||||
representations of words in contexts. If you've seen Google's BERT system or
|
teams using spaCy in production, followed by a one-track conference. We booked a
|
||||||
fast.ai's ULMFiT, spaCy's pretraining is similar – but much more efficient. It's
|
beautiful venue, hand-picked an awesome lineup of speakers and scheduled plenty
|
||||||
still experimental, but users are already reporting good results, so give it a
|
of social time to get to know each other and exchange ideas.
|
||||||
try!
|
|
||||||
</LandingBanner>
|
</LandingBanner>
|
||||||
|
|
||||||
<LandingBanner
|
<LandingBanner
|
||||||
|
@ -191,23 +194,17 @@ const Landing = ({ data }) => {
|
||||||
<LandingLogos title="Featured on" logos={data.logosPublications} />
|
<LandingLogos title="Featured on" logos={data.logosPublications} />
|
||||||
|
|
||||||
<LandingBanner
|
<LandingBanner
|
||||||
title="Convolutional neural network models"
|
title="BERT-style language model pretraining"
|
||||||
label="New in v2.0"
|
label="New in v2.1"
|
||||||
button="Download models"
|
to="/usage/v2-1"
|
||||||
to="/models"
|
button="Read more"
|
||||||
>
|
>
|
||||||
spaCy v2.0 features new neural models for <strong>tagging</strong>,{' '}
|
Learn more from small training corpora by initializing your models with{' '}
|
||||||
<strong>parsing</strong> and <strong>entity recognition</strong>. The models have
|
<strong>knowledge from raw text</strong>. The new pretrain command teaches spaCy's
|
||||||
been designed and implemented from scratch specifically for spaCy, to give you an
|
CNN model to predict words based on their context, producing representations of
|
||||||
unmatched balance of speed, size and accuracy. A novel bloom embedding strategy with
|
words in contexts. If you've seen Google's BERT system or fast.ai's ULMFiT, spaCy's
|
||||||
subword features is used to support huge vocabularies in tiny tables. Convolutional
|
pretraining is similar – but much more efficient. It's still experimental, but users
|
||||||
layers with residual connections, layer normalization and maxout non-linearity are
|
are already reporting good results, so give it a try!
|
||||||
used, giving much better efficiency than the standard BiLSTM solution. Finally, the
|
|
||||||
parser and NER use an imitation learning objective to deliver accuracy in-line with
|
|
||||||
the latest research systems, even when evaluated from raw text. With these
|
|
||||||
innovations, spaCy v2.0's models are <strong>10× smaller</strong>,{' '}
|
|
||||||
<strong>20% more accurate</strong>, and
|
|
||||||
<strong>even cheaper to run</strong> than the previous generation.
|
|
||||||
</LandingBanner>
|
</LandingBanner>
|
||||||
|
|
||||||
<LandingGrid cols={2}>
|
<LandingGrid cols={2}>
|
||||||
|
|
Loading…
Reference in New Issue
Block a user