Merge branch 'master' into spacy.io

This commit is contained in:
Ines Montani 2019-05-11 17:49:28 +02:00
commit 1d1df7b5f9
30 changed files with 913 additions and 68 deletions

106
.github/contributors/F0rge1cE.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [x] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Icarus Xu |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 05/06/2019 |
| GitHub username | F0rge1cE |
| Website (optional) | |

106
.github/contributors/amitness.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [X] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Amit Chaudhary |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | April 29, 2019 |
| GitHub username | amitness |
| Website (optional) | https://amitness.com |

106
.github/contributors/henry860916.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | ------------------------ |
| Name | Henry Zhang |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2019-04-30 |
| GitHub username | henry860916 |
| Website (optional) | |

106
.github/contributors/ldorigo.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Luca Dorigo |
| Company name (if applicable) | / |
| Title or role (if applicable) | / |
| Date | 08.05.2019 |
| GitHub username | ldorigo |
| Website (optional) | / |

View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Richard Paul Hudson |
| Company name (if applicable) | msg systems ag |
| Title or role (if applicable) | Principal IT Consultant|
| Date | 06. May 2019 |
| GitHub username | richardpaulhudson |
| Website (optional) | |

106
.github/contributors/yaph.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Ramiro Gómez |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2019-04-29 |
| GitHub username | yaph |
| Website (optional) | http://ramiro.org/ |

View File

@ -447,17 +447,7 @@ use the `get_doc()` utility function to construct it manually.
## Updating the website ## Updating the website
Our [website and docs](https://spacy.io) are implemented in For instructions on how to build and run the [website](https://spacy.io) locally see **[Setup and installation](https://github.com/explosion/spaCy/blob/master/website/README.md#setup-and-installation-setup)** in the *website* directory's README.
[Jade/Pug](https://www.jade-lang.org), and built or served by
[Harp](https://harpjs.com). Jade/Pug is an extensible templating language with a
readable syntax, that compiles to HTML. Here's how to view the site locally:
```bash
sudo npm install --global harp
git clone https://github.com/explosion/spaCy
cd spaCy/website
harp server
```
The docs can always use another example or more detail, and they should always The docs can always use another example or more detail, and they should always
be up to date and not misleading. To quickly find the correct file to edit, be up to date and not misleading. To quickly find the correct file to edit,

View File

@ -36,11 +36,27 @@ def main(model="en_core_web_sm"):
print("{:<10}\t{}\t{}".format(r1.text, r2.ent_type_, r2.text)) print("{:<10}\t{}\t{}".format(r1.text, r2.ent_type_, r2.text))
def filter_spans(spans):
# Filter a sequence of spans so they don't contain overlaps
get_sort_key = lambda span: (span.end - span.start, span.start)
sorted_spans = sorted(spans, key=get_sort_key, reverse=True)
result = []
seen_tokens = set()
for span in sorted_spans:
if span.start not in seen_tokens and span.end - 1 not in seen_tokens:
result.append(span)
seen_tokens.update(range(span.start, span.end))
return result
def extract_currency_relations(doc): def extract_currency_relations(doc):
# merge entities and noun chunks into one token # Merge entities and noun chunks into one token
seen_tokens = set()
spans = list(doc.ents) + list(doc.noun_chunks) spans = list(doc.ents) + list(doc.noun_chunks)
spans = filter_spans(spans)
with doc.retokenize() as retokenizer:
for span in spans: for span in spans:
span.merge() retokenizer.merge(span)
relations = [] relations = []
for money in filter(lambda w: w.ent_type_ == "MONEY", doc): for money in filter(lambda w: w.ent_type_ == "MONEY", doc):

View File

@ -9,7 +9,7 @@ srsly>=0.0.5,<1.1.0
# Third party dependencies # Third party dependencies
numpy>=1.15.0 numpy>=1.15.0
requests>=2.13.0,<3.0.0 requests>=2.13.0,<3.0.0
jsonschema>=2.6.0,<3.0.0 jsonschema>=2.6.0,<3.1.0
plac<1.0.0,>=0.9.6 plac<1.0.0,>=0.9.6
pathlib==1.0.1; python_version < "3.4" pathlib==1.0.1; python_version < "3.4"
# Development dependencies # Development dependencies

View File

@ -232,7 +232,7 @@ def setup_package():
"blis>=0.2.2,<0.3.0", "blis>=0.2.2,<0.3.0",
"plac<1.0.0,>=0.9.6", "plac<1.0.0,>=0.9.6",
"requests>=2.13.0,<3.0.0", "requests>=2.13.0,<3.0.0",
"jsonschema>=2.6.0,<3.0.0", "jsonschema>=2.6.0,<3.1.0",
"wasabi>=0.2.0,<1.1.0", "wasabi>=0.2.0,<1.1.0",
"srsly>=0.0.5,<1.1.0", "srsly>=0.0.5,<1.1.0",
'pathlib==1.0.1; python_version < "3.4"', 'pathlib==1.0.1; python_version < "3.4"',

View File

@ -181,7 +181,7 @@ def read_vectors(vectors_loc):
vectors_keys = [] vectors_keys = []
for i, line in enumerate(tqdm(f)): for i, line in enumerate(tqdm(f)):
line = line.rstrip() line = line.rstrip()
pieces = line.rsplit(" ", vectors_data.shape[1] + 1) pieces = line.rsplit(" ", vectors_data.shape[1])
word = pieces.pop(0) word = pieces.pop(0)
if len(pieces) != vectors_data.shape[1]: if len(pieces) != vectors_data.shape[1]:
msg.fail(Errors.E094.format(line_num=i, loc=vectors_loc), exits=1) msg.fail(Errors.E094.format(line_num=i, loc=vectors_loc), exits=1)

View File

@ -181,10 +181,10 @@ def make_update(model, docs, optimizer, drop=0.0, objective="L2"):
def make_docs(nlp, batch, min_length, max_length): def make_docs(nlp, batch, min_length, max_length):
docs = [] docs = []
for record in batch: for record in batch:
text = record["text"]
if "tokens" in record: if "tokens" in record:
doc = Doc(nlp.vocab, words=record["tokens"]) doc = Doc(nlp.vocab, words=record["tokens"])
else: else:
text = record["text"]
doc = nlp.make_doc(text) doc = nlp.make_doc(text)
if "heads" in record: if "heads" in record:
heads = record["heads"] heads = record["heads"]

View File

@ -16,6 +16,7 @@ import random
from .._ml import create_default_optimizer from .._ml import create_default_optimizer
from ..attrs import PROB, IS_OOV, CLUSTER, LANG from ..attrs import PROB, IS_OOV, CLUSTER, LANG
from ..gold import GoldCorpus from ..gold import GoldCorpus
from ..compat import path2str
from .. import util from .. import util
from .. import about from .. import about
@ -423,10 +424,12 @@ def _collate_best_model(meta, output_path, components):
for component in components: for component in components:
bests[component] = _find_best(output_path, component) bests[component] = _find_best(output_path, component)
best_dest = output_path / "model-best" best_dest = output_path / "model-best"
shutil.copytree(output_path / "model-final", best_dest) shutil.copytree(path2str(output_path / "model-final"), path2str(best_dest))
for component, best_component_src in bests.items(): for component, best_component_src in bests.items():
shutil.rmtree(best_dest / component) shutil.rmtree(path2str(best_dest / component))
shutil.copytree(best_component_src / component, best_dest / component) shutil.copytree(
path2str(best_component_src / component), path2str(best_dest / component)
)
accs = srsly.read_json(best_component_src / "accuracy.json") accs = srsly.read_json(best_component_src / "accuracy.json")
for metric in _get_metrics(component): for metric in _get_metrics(component):
meta["accuracy"][metric] = accs[metric] meta["accuracy"][metric] = accs[metric]

View File

@ -168,6 +168,7 @@ GLOSSARY = {
# Dependency Labels (English) # Dependency Labels (English)
# ClearNLP / Universal Dependencies # ClearNLP / Universal Dependencies
# https://github.com/clir/clearnlp-guidelines/blob/master/md/specifications/dependency_labels.md # https://github.com/clir/clearnlp-guidelines/blob/master/md/specifications/dependency_labels.md
"acl": "clausal modifier of noun (adjectival clause)",
"acomp": "adjectival complement", "acomp": "adjectival complement",
"advcl": "adverbial clause modifier", "advcl": "adverbial clause modifier",
"advmod": "adverbial modifier", "advmod": "adverbial modifier",
@ -177,22 +178,32 @@ GLOSSARY = {
"attr": "attribute", "attr": "attribute",
"aux": "auxiliary", "aux": "auxiliary",
"auxpass": "auxiliary (passive)", "auxpass": "auxiliary (passive)",
"case": "case marking",
"cc": "coordinating conjunction", "cc": "coordinating conjunction",
"ccomp": "clausal complement", "ccomp": "clausal complement",
"clf": "classifier",
"complm": "complementizer", "complm": "complementizer",
"compound": "compound",
"conj": "conjunct", "conj": "conjunct",
"cop": "copula", "cop": "copula",
"csubj": "clausal subject", "csubj": "clausal subject",
"csubjpass": "clausal subject (passive)", "csubjpass": "clausal subject (passive)",
"dative": "dative",
"dep": "unclassified dependent", "dep": "unclassified dependent",
"det": "determiner", "det": "determiner",
"discourse": "discourse element",
"dislocated": "dislocated elements",
"dobj": "direct object", "dobj": "direct object",
"expl": "expletive", "expl": "expletive",
"fixed": "fixed multiword expression",
"flat": "flat multiword expression",
"goeswith": "goes with",
"hmod": "modifier in hyphenation", "hmod": "modifier in hyphenation",
"hyph": "hyphen", "hyph": "hyphen",
"infmod": "infinitival modifier", "infmod": "infinitival modifier",
"intj": "interjection", "intj": "interjection",
"iobj": "indirect object", "iobj": "indirect object",
"list": "list",
"mark": "marker", "mark": "marker",
"meta": "meta modifier", "meta": "meta modifier",
"neg": "negation modifier", "neg": "negation modifier",
@ -201,11 +212,15 @@ GLOSSARY = {
"npadvmod": "noun phrase as adverbial modifier", "npadvmod": "noun phrase as adverbial modifier",
"nsubj": "nominal subject", "nsubj": "nominal subject",
"nsubjpass": "nominal subject (passive)", "nsubjpass": "nominal subject (passive)",
"nounmod": "modifier of nominal",
"npmod": "noun phrase as adverbial modifier",
"num": "number modifier", "num": "number modifier",
"number": "number compound modifier", "number": "number compound modifier",
"nummod": "numeric modifier",
"oprd": "object predicate", "oprd": "object predicate",
"obj": "object", "obj": "object",
"obl": "oblique nominal", "obl": "oblique nominal",
"orphan": "orphan",
"parataxis": "parataxis", "parataxis": "parataxis",
"partmod": "participal modifier", "partmod": "participal modifier",
"pcomp": "complement of preposition", "pcomp": "complement of preposition",
@ -218,7 +233,10 @@ GLOSSARY = {
"punct": "punctuation", "punct": "punctuation",
"quantmod": "modifier of quantifier", "quantmod": "modifier of quantifier",
"rcmod": "relative clause modifier", "rcmod": "relative clause modifier",
"relcl": "relative clause modifier",
"reparandum": "overridden disfluency",
"root": "root", "root": "root",
"vocative": "vocative",
"xcomp": "open clausal complement", "xcomp": "open clausal complement",
# Dependency labels (German) # Dependency labels (German)
# TIGER Treebank # TIGER Treebank

View File

@ -5,8 +5,8 @@ from __future__ import unicode_literals
STOP_WORDS = set( STOP_WORDS = set(
""" """
á a ab aber ach acht achte achten achter achtes ag alle allein allem allen á a ab aber ach acht achte achten achter achtes ag alle allein allem allen
aller allerdings alles allgemeinen als also am an andere anderen andern anders aller allerdings alles allgemeinen als also am an andere anderen anderem andern
auch auf aus ausser außer ausserdem außerdem anders auch auf aus ausser außer ausserdem außerdem
bald bei beide beiden beim beispiel bekannt bereits besonders besser besten bin bald bei beide beiden beim beispiel bekannt bereits besonders besser besten bin
bis bisher bist bis bisher bist
@ -35,8 +35,8 @@ großen grosser großer grosses großes gut gute guter gutes
habe haben habt hast hat hatte hätte hatten hätten heisst heißt her heute hier habe haben habt hast hat hatte hätte hatten hätten heisst heißt her heute hier
hin hinter hoch hin hinter hoch
ich ihm ihn ihnen ihr ihre ihrem ihrer ihres im immer in indem infolgedessen ich ihm ihn ihnen ihr ihre ihrem ihren ihrer ihres im immer in indem
ins irgend ist infolgedessen ins irgend ist
ja jahr jahre jahren je jede jedem jeden jeder jedermann jedermanns jedoch ja jahr jahre jahren je jede jedem jeden jeder jedermann jedermanns jedoch
jemand jemandem jemanden jene jenem jenen jener jenes jetzt jemand jemandem jemanden jene jenem jenen jener jenes jetzt

View File

@ -11,9 +11,9 @@ Example sentences to test spaCy and its language models.
sentences = [ sentences = [
"Apple cherche a acheter une startup anglaise pour 1 milliard de dollard", "Apple cherche à acheter une startup anglaise pour 1 milliard de dollars",
"Les voitures autonomes voient leur assurances décalées vers les constructeurs", "Les voitures autonomes déplacent la responsabilité de l'assurance vers les constructeurs",
"San Francisco envisage d'interdire les robots coursiers", "San Francisco envisage d'interdire les robots coursiers sur les trottoirs",
"Londres est une grande ville du Royaume-Uni", "Londres est une grande ville du Royaume-Uni",
"LItalie choisit ArcelorMittal pour reprendre la plus grande aciérie dEurope", "LItalie choisit ArcelorMittal pour reprendre la plus grande aciérie dEurope",
"Apple lance HomePod parce qu'il se sent menacé par l'Echo d'Amazon", "Apple lance HomePod parce qu'il se sent menacé par l'Echo d'Amazon",

View File

@ -5,6 +5,7 @@ from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .tag_map import TAG_MAP from .tag_map import TAG_MAP
from .stop_words import STOP_WORDS from .stop_words import STOP_WORDS
from .norm_exceptions import NORM_EXCEPTIONS from .norm_exceptions import NORM_EXCEPTIONS
from .lex_attrs import LEX_ATTRS
from ..norm_exceptions import BASE_NORMS from ..norm_exceptions import BASE_NORMS
from ...attrs import LANG, NORM from ...attrs import LANG, NORM
@ -27,13 +28,14 @@ class ThaiTokenizer(DummyTokenizer):
self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp) self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp)
def __call__(self, text): def __call__(self, text):
words = list(self.word_tokenize(text, "newmm")) words = list(self.word_tokenize(text))
spaces = [False] * len(words) spaces = [False] * len(words)
return Doc(self.vocab, words=words, spaces=spaces) return Doc(self.vocab, words=words, spaces=spaces)
class ThaiDefaults(Language.Defaults): class ThaiDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters) lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters.update(LEX_ATTRS)
lex_attr_getters[LANG] = lambda _text: "th" lex_attr_getters[LANG] = lambda _text: "th"
lex_attr_getters[NORM] = add_lookups( lex_attr_getters[NORM] = add_lookups(
Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS Language.Defaults.lex_attr_getters[NORM], BASE_NORMS, NORM_EXCEPTIONS

View File

@ -0,0 +1,62 @@
# coding: utf8
from __future__ import unicode_literals
from ...attrs import LIKE_NUM
_num_words = [
"ศูนย์",
"หนึ่ง",
"สอง",
"สาม",
"สี่",
"ห้า",
"หก",
"เจ็ด",
"แปด",
"เก้า",
"สิบ",
"สิบเอ็ด",
"ยี่สิบ",
"ยี่สิบเอ็ด",
"สามสิบ",
"สามสิบเอ็ด",
"สี่สิบ",
"สี่สิบเอ็ด",
"ห้าสิบ",
"ห้าสิบเอ็ด",
"หกสิบเอ็ด",
"เจ็ดสิบ",
"เจ็ดสิบเอ็ด",
"แปดสิบ",
"แปดสิบเอ็ด",
"เก้าสิบ",
"เก้าสิบเอ็ด",
"ร้อย",
"พัน",
"ล้าน",
"พันล้าน",
"หมื่นล้าน",
"แสนล้าน",
"ล้านล้าน",
"ล้านล้านล้าน",
"ล้านล้านล้านล้าน",
]
def like_num(text):
if text.startswith(("+", "-", "±", "~")):
text = text[1:]
text = text.replace(",", "").replace(".", "")
if text.isdigit():
return True
if text.count("/") == 1:
num, denom = text.split("/")
if num.isdigit() and denom.isdigit():
return True
if text in _num_words:
return True
return False
LEX_ATTRS = {LIKE_NUM: like_num}

View File

@ -111,4 +111,3 @@ NORM_EXCEPTIONS = {}
for string, norm in _exc.items(): for string, norm in _exc.items():
NORM_EXCEPTIONS[string] = norm NORM_EXCEPTIONS[string] = norm
NORM_EXCEPTIONS[string.title()] = norm NORM_EXCEPTIONS[string.title()] = norm

View File

@ -1,5 +1,6 @@
# coding: utf8 # coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
from collections import OrderedDict
from .symbols import POS, NOUN, VERB, ADJ, PUNCT, PROPN from .symbols import POS, NOUN, VERB, ADJ, PUNCT, PROPN
from .symbols import VerbForm_inf, VerbForm_none, Number_sing, Degree_pos from .symbols import VerbForm_inf, VerbForm_none, Number_sing, Degree_pos
@ -118,8 +119,8 @@ def lemmatize(string, index, exceptions, rules):
forms.append(form) forms.append(form)
else: else:
oov_forms.append(form) oov_forms.append(form)
# Remove duplicates, and sort forms generated by rules alphabetically. # Remove duplicates but preserve the ordering of applied "rules"
forms = list(set(forms)) forms = list(OrderedDict.fromkeys(forms))
# Put exceptions at the front of the list, so they get priority. # Put exceptions at the front of the list, so they get priority.
# This is a dodgy heuristic -- but it's the best we can do until we get # This is a dodgy heuristic -- but it's the best we can do until we get
# frequencies on this. We can at least prune out problematic exceptions, # frequencies on this. We can at least prune out problematic exceptions,

View File

@ -6,6 +6,7 @@ from spacy.attrs import ORTH, LENGTH
from spacy.tokens import Doc, Span from spacy.tokens import Doc, Span
from spacy.vocab import Vocab from spacy.vocab import Vocab
from spacy.errors import ModelsWarning from spacy.errors import ModelsWarning
from spacy.util import filter_spans
from ..util import get_doc from ..util import get_doc
@ -219,3 +220,21 @@ def test_span_ents_property(doc):
assert sentences[2].ents[0].label_ == "PRODUCT" assert sentences[2].ents[0].label_ == "PRODUCT"
assert sentences[2].ents[0].start == 11 assert sentences[2].ents[0].start == 11
assert sentences[2].ents[0].end == 14 assert sentences[2].ents[0].end == 14
def test_filter_spans(doc):
# Test filtering duplicates
spans = [doc[1:4], doc[6:8], doc[1:4], doc[10:14]]
filtered = filter_spans(spans)
assert len(filtered) == 3
assert filtered[0].start == 1 and filtered[0].end == 4
assert filtered[1].start == 6 and filtered[1].end == 8
assert filtered[2].start == 10 and filtered[2].end == 14
# Test filtering overlaps with longest preference
spans = [doc[1:4], doc[1:3], doc[5:10], doc[7:9], doc[1:4]]
filtered = filter_spans(spans)
assert len(filtered) == 2
assert len(filtered[0]) == 3
assert len(filtered[1]) == 5
assert filtered[0].start == 1 and filtered[0].end == 4
assert filtered[1].start == 5 and filtered[1].end == 10

View File

@ -510,7 +510,7 @@ def decaying(start, stop, decay):
curr = float(start) curr = float(start)
while True: while True:
yield max(curr, stop) yield max(curr, stop)
curr -= (decay) curr -= decay
def minibatch_by_words(items, size, tuples=True, count_words=len): def minibatch_by_words(items, size, tuples=True, count_words=len):
@ -571,6 +571,28 @@ def itershuffle(iterable, bufsize=1000):
raise StopIteration raise StopIteration
def filter_spans(spans):
"""Filter a sequence of spans and remove duplicates or overlaps. Useful for
creating named entities (where one token can only be part of one entity) or
when merging spans with `Retokenizer.merge`. When spans overlap, the (first)
longest span is preferred over shorter spans.
spans (iterable): The spans to filter.
RETURNS (list): The filtered spans.
"""
get_sort_key = lambda span: (span.end - span.start, span.start)
sorted_spans = sorted(spans, key=get_sort_key, reverse=True)
result = []
seen_tokens = set()
for span in sorted_spans:
# Check for end - 1 here because boundaries are inclusive
if span.start not in seen_tokens and span.end - 1 not in seen_tokens:
result.append(span)
seen_tokens.update(range(span.start, span.end))
result = sorted(result, key=lambda span: span.start)
return result
def to_bytes(getters, exclude): def to_bytes(getters, exclude):
serialized = OrderedDict() serialized = OrderedDict()
for key, getter in getters.items(): for key, getter in getters.items():

View File

@ -457,7 +457,7 @@ sit amet dignissim justo congue.
## Setup and installation {#setup} ## Setup and installation {#setup}
Before running the setup, make sure your versions of Before running the setup, make sure your versions of
[Node](https://nodejs.org/en/) and [npm](https://www.npmjs.com/) are up to date. [Node](https://nodejs.org/en/) and [npm](https://www.npmjs.com/) are up to date. Node v10.15 or later is required.
```bash ```bash
# Clone the repository # Clone the repository

View File

@ -198,7 +198,7 @@ will only train the tagger and parser.
```bash ```bash
$ python -m spacy train [lang] [output_path] [train_path] [dev_path] $ python -m spacy train [lang] [output_path] [train_path] [dev_path]
[--base-model] [--pipeline] [--vectors] [--n-iter] [--n-examples] [--use-gpu] [--base-model] [--pipeline] [--vectors] [--n-iter] [--n-early-stopping] [--n-examples] [--use-gpu]
[--version] [--meta-path] [--init-tok2vec] [--parser-multitasks] [--version] [--meta-path] [--init-tok2vec] [--parser-multitasks]
[--entity-multitasks] [--gold-preproc] [--noise-level] [--learn-tokens] [--entity-multitasks] [--gold-preproc] [--noise-level] [--learn-tokens]
[--verbose] [--verbose]
@ -214,6 +214,7 @@ $ python -m spacy train [lang] [output_path] [train_path] [dev_path]
| `--pipeline`, `-p` <Tag variant="new">2.1</Tag> | option | Comma-separated names of pipeline components to train. Defaults to `'tagger,parser,ner'`. | | `--pipeline`, `-p` <Tag variant="new">2.1</Tag> | option | Comma-separated names of pipeline components to train. Defaults to `'tagger,parser,ner'`. |
| `--vectors`, `-v` | option | Model to load vectors from. | | `--vectors`, `-v` | option | Model to load vectors from. |
| `--n-iter`, `-n` | option | Number of iterations (default: `30`). | | `--n-iter`, `-n` | option | Number of iterations (default: `30`). |
| `--n-early-stopping`, `-ne` | option | Maximum number of training epochs without dev accuracy improvement. |
| `--n-examples`, `-ns` | option | Number of examples to use (defaults to `0` for all examples). | | `--n-examples`, `-ns` | option | Number of examples to use (defaults to `0` for all examples). |
| `--use-gpu`, `-g` | option | Whether to use GPU. Can be either `0`, `1` or `-1`. | | `--use-gpu`, `-g` | option | Whether to use GPU. Can be either `0`, `1` or `-1`. |
| `--version`, `-V` | option | Model version. Will be written out to the model's `meta.json` after training. | | `--version`, `-V` | option | Model version. Will be written out to the model's `meta.json` after training. |
@ -285,10 +286,11 @@ improvement.
```bash ```bash
$ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir] [--width] $ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir] [--width]
[--depth] [--embed-rows] [--dropout] [--seed] [--n-iter] [--use-vectors] [--depth] [--embed-rows] [--dropout] [--seed] [--n-iter] [--use-vectors]
[--n-save_every]
``` ```
| Argument | Type | Description | | Argument | Type | Description |
| ---------------------- | ---------- | --------------------------------------------------------------------------------------------------------------------------------- | | ----------------------- | ---------- | --------------------------------------------------------------------------------------------------------------------------------- |
| `texts_loc` | positional | Path to JSONL file with raw texts to learn from, with text provided as the key `"text"`. [See here](#pretrain-jsonl) for details. | | `texts_loc` | positional | Path to JSONL file with raw texts to learn from, with text provided as the key `"text"`. [See here](#pretrain-jsonl) for details. |
| `vectors_model` | positional | Name or path to spaCy model with vectors to learn from. | | `vectors_model` | positional | Name or path to spaCy model with vectors to learn from. |
| `output_dir` | positional | Directory to write models to on each epoch. | | `output_dir` | positional | Directory to write models to on each epoch. |
@ -302,6 +304,7 @@ $ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir] [--width]
| `--seed`, `-s` | option | Seed for random number generators. | | `--seed`, `-s` | option | Seed for random number generators. |
| `--n-iter`, `-i` | option | Number of iterations to pretrain. | | `--n-iter`, `-i` | option | Number of iterations to pretrain. |
| `--use-vectors`, `-uv` | flag | Whether to use the static vectors as input features. | | `--use-vectors`, `-uv` | flag | Whether to use the static vectors as input features. |
| `--n-save_every`, `-se` | option | Save model every X batches. |
| **CREATES** | weights | The pre-trained weights that can be used to initialize `spacy train`. | | **CREATES** | weights | The pre-trained weights that can be used to initialize `spacy train`. |
### JSONL format for raw text {#pretrain-jsonl} ### JSONL format for raw text {#pretrain-jsonl}
@ -324,7 +327,7 @@ tokenization can be provided.
| Key | Type | Description | | Key | Type | Description |
| -------- | ------- | -------------------------------------------- | | -------- | ------- | -------------------------------------------- |
| `text` | unicode | The raw input text. | | `text` | unicode | The raw input text. Is not required if `tokens` available. |
| `tokens` | list | Optional tokenization, one string per token. | | `tokens` | list | Optional tokenization, one string per token. |
```json ```json
@ -332,6 +335,7 @@ tokenization can be provided.
{"text": "Can I ask where you work now and what you do, and if you enjoy it?"} {"text": "Can I ask where you work now and what you do, and if you enjoy it?"}
{"text": "They may just pull out of the Seattle market completely, at least until they have autonomous vehicles."} {"text": "They may just pull out of the Seattle market completely, at least until they have autonomous vehicles."}
{"text": "My cynical view on this is that it will never be free to the public. Reason: what would be the draw of joining the military? Right now their selling point is free Healthcare and Education. Ironically both are run horribly and most, that I've talked to, come out wishing they never went in."} {"text": "My cynical view on this is that it will never be free to the public. Reason: what would be the draw of joining the military? Right now their selling point is free Healthcare and Education. Ironically both are run horribly and most, that I've talked to, come out wishing they never went in."}
{"tokens": ["If", "tokens", "are", "provided", "then", "we", "can", "skip", "the", "raw", "input", "text"]}
``` ```
## Init Model {#init-model new="2"} ## Init Model {#init-model new="2"}
@ -375,7 +379,7 @@ pipeline.
```bash ```bash
$ python -m spacy evaluate [model] [data_path] [--displacy-path] [--displacy-limit] $ python -m spacy evaluate [model] [data_path] [--displacy-path] [--displacy-limit]
[--gpu-id] [--gold-preproc] [--gpu-id] [--gold-preproc] [--return-scores]
``` ```
| Argument | Type | Description | | Argument | Type | Description |
@ -386,6 +390,7 @@ $ python -m spacy evaluate [model] [data_path] [--displacy-path] [--displacy-lim
| `--displacy-limit`, `-dl` | option | Number of parses to generate per file. Defaults to `25`. Keep in mind that a significantly higher number might cause the `.html` files to render slowly. | | `--displacy-limit`, `-dl` | option | Number of parses to generate per file. Defaults to `25`. Keep in mind that a significantly higher number might cause the `.html` files to render slowly. |
| `--gpu-id`, `-g` | option | GPU to use, if any. Defaults to `-1` for CPU. | | `--gpu-id`, `-g` | option | GPU to use, if any. Defaults to `-1` for CPU. |
| `--gold-preproc`, `-G` | flag | Use gold preprocessing. | | `--gold-preproc`, `-G` | flag | Use gold preprocessing. |
| `--return-scores`, `-R` | flag | Return dict containing model scores. |
| **CREATES** | `stdout`, HTML | Training results and optional displaCy visualizations. | | **CREATES** | `stdout`, HTML | Training results and optional displaCy visualizations. |
## Package {#package} ## Package {#package}

View File

@ -212,12 +212,12 @@ Render a dependency parse tree or named entity visualization.
> ``` > ```
| Name | Type | Description | Default | | Name | Type | Description | Default |
| ----------- | ------------------- | ------------------------------------------------------------------------------------------------------------------------------------ | ---------------------- | | ----------- | ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
| `docs` | list, `Doc`, `Span` | Document(s) to visualize. | | `docs` | list, `Doc`, `Span` | Document(s) to visualize. |
| `style` | unicode | Visualization style, `'dep'` or `'ent'`. | `'dep'` | | `style` | unicode | Visualization style, `'dep'` or `'ent'`. | `'dep'` |
| `page` | bool | Render markup as full HTML page. | `False` | | `page` | bool | Render markup as full HTML page. | `False` |
| `minify` | bool | Minify HTML markup. | `False` | | `minify` | bool | Minify HTML markup. | `False` |
| `jupyter` | bool | Explicitly enable "[Jupyter](http://jupyter.org/) mode" to return markup ready to be rendered in a notebook. | detected automatically | | `jupyter` | bool | Explicitly enable or disable "[Jupyter](http://jupyter.org/) mode" to return markup ready to be rendered in a notebook. Detected automatically if `None`. | `None` |
| `options` | dict | [Visualizer-specific options](#options), e.g. colors. | `{}` | | `options` | dict | [Visualizer-specific options](#options), e.g. colors. | `{}` |
| `manual` | bool | Don't parse `Doc` and instead, expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. | `False` | | `manual` | bool | Don't parse `Doc` and instead, expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. | `False` |
| **RETURNS** | unicode | Rendered HTML markup. | | **RETURNS** | unicode | Rendered HTML markup. |
@ -654,6 +654,27 @@ for batching. Larger `buffsize` means less bias.
| `buffsize` | int | Items to hold back. | | `buffsize` | int | Items to hold back. |
| **YIELDS** | iterable | The shuffled iterator. | | **YIELDS** | iterable | The shuffled iterator. |
### util.filter_spans {#util.filter_spans tag="function" new="2.1.4"}
Filter a sequence of [`Span`](/api/span) objects and remove duplicates or
overlaps. Useful for creating named entities (where one token can only be part
of one entity) or when merging spans with
[`Retokenizer.merge`](/api/doc#retokenizer.merge). When spans overlap, the
(first) longest span is preferred over shorter spans.
> #### Example
>
> ```python
> doc = nlp("This is a sentence.")
> spans = [doc[0:2], doc[0:2], doc[0:4]]
> filtered = filter_spans(spans)
> ```
| Name | Type | Description |
| ----------- | -------- | -------------------- |
| `spans` | iterable | The spans to filter. |
| **RETURNS** | list | The filtered spans. |
## Compatibility functions {#compat source="spacy/compaty.py"} ## Compatibility functions {#compat source="spacy/compaty.py"}
All Python code is written in an **intersection of Python 2 and Python 3**. This All Python code is written in an **intersection of Python 2 and Python 3**. This

View File

@ -4,7 +4,7 @@ example, everything that's in your `nlp` object. This means you'll have to
translate its contents and structure into a format that can be saved, like a translate its contents and structure into a format that can be saved, like a
file or a byte string. This process is called serialization. spaCy comes with file or a byte string. This process is called serialization. spaCy comes with
**built-in serialization methods** and supports the **built-in serialization methods** and supports the
[Pickle protocol](http://www.diveintopython3.net/serializing.html#dump). [Pickle protocol](https://www.diveinto.org/python3/serializing.html#dump).
> #### What's pickle? > #### What's pickle?
> >

View File

@ -260,7 +260,7 @@ def my_component(doc):
nlp = spacy.load("en_core_web_sm") nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(my_component, name="print_info", last=True) nlp.add_pipe(my_component, name="print_info", last=True)
print(nlp.pipe_names) # ['print_info', 'tagger', 'parser', 'ner'] print(nlp.pipe_names) # ['tagger', 'parser', 'ner', 'print_info']
doc = nlp(u"This is a sentence.") doc = nlp(u"This is a sentence.")
``` ```

View File

@ -713,9 +713,9 @@ from spacy.matcher import PhraseMatcher
nlp = spacy.load('en_core_web_sm') nlp = spacy.load('en_core_web_sm')
matcher = PhraseMatcher(nlp.vocab) matcher = PhraseMatcher(nlp.vocab)
terminology_list = [u"Barack Obama", u"Angela Merkel", u"Washington, D.C."] terms = [u"Barack Obama", u"Angela Merkel", u"Washington, D.C."]
# Only run nlp.make_doc to speed things up # Only run nlp.make_doc to speed things up
patterns = [nlp.make_doc(text) for text in terminology_list] patterns = [nlp.make_doc(text) for text in terms]
matcher.add("TerminologyList", None, *patterns) matcher.add("TerminologyList", None, *patterns)
doc = nlp(u"German Chancellor Angela Merkel and US President Barack Obama " doc = nlp(u"German Chancellor Angela Merkel and US President Barack Obama "

View File

@ -102,7 +102,7 @@ systems, or to pre-process text for **deep learning**.
integrated and opinionated. spaCy tries to avoid asking the user to choose integrated and opinionated. spaCy tries to avoid asking the user to choose
between multiple algorithms that deliver equivalent functionality. Keeping the between multiple algorithms that deliver equivalent functionality. Keeping the
menu small lets spaCy deliver generally better performance and developer menu small lets spaCy deliver generally better performance and developer
experience.M experience.
- **spaCy is not a company**. It's an open-source library. Our company - **spaCy is not a company**. It's an open-source library. Our company
publishing spaCy and other software is called publishing spaCy and other software is called

View File

@ -980,6 +980,22 @@
}, },
"category": ["podcasts"] "category": ["podcasts"]
}, },
{
"type": "education",
"id": "twimlai-podcast",
"title": "TWiML & AI: Practical NLP with spaCy and Prodigy",
"slogan": "May 2019",
"description": "\"Ines and I caught up to discuss her various projects, including the aforementioned SpaCy, an open-source NLP library built with a focus on industry and production use cases. In our conversation, Ines gives us an overview of the SpaCy Library, a look at some of the use cases that excite her, and the Spacy community and contributors. We also discuss her work with Prodigy, an annotation service tool that uses continuous active learning to train models, and finally, what other exciting projects she is working on.\"",
"thumb": "https://i.imgur.com/ng2F5gK.png",
"url": "https://twimlai.com/twiml-talk-262-practical-natural-language-processing-with-spacy-and-prodigy-w-ines-montani",
"iframe": "https://html5-player.libsyn.com/embed/episode/id/9691514/height/90/theme/custom/thumbnail/no/preload/no/direction/backward/render-playlist/no/custom-color/3e85b1/",
"iframe_height": 90,
"author": "Sam Charrington",
"author_links": {
"website": "https://twimlai.com"
},
"category": ["podcasts"]
},
{ {
"id": "adam_qas", "id": "adam_qas",
"title": "ADAM: Question Answering System", "title": "ADAM: Question Answering System",
@ -1338,8 +1354,43 @@
}, },
"category": ["pipeline"], "category": ["pipeline"],
"tags": ["inflection"] "tags": ["inflection"]
},
{
"id": "NGym",
"title": "NeuralGym",
"slogan": "A little Windows GUI for training models with spaCy",
"description": "NeuralGym is a Python application for Windows with a graphical user interface to train models with spaCy. Run the application, select an output folder, a training data file in spaCy's data format, a spaCy model or blank model and press 'Start'.",
"github": "d5555/NeuralGym",
"url": "https://github.com/d5555/NeuralGym",
"image": "https://github.com/d5555/NeuralGym/raw/master/NGym.png",
"thumb": "https://github.com/d5555/NeuralGym/raw/master/NGym/web.png",
"author": "d5555",
"category": ["training"],
"tags": ["windows"]
},
{
"id": "holmes",
"title": "Holmes",
"slogan": "Information extraction from English and German texts based on predicate logic",
"github": "msg-systems/holmes-extractor",
"url": "https://github.com/msg-systems/holmes-extractor",
"description": "Holmes is a Python 3 library that supports a number of use cases involving information extraction from English and German texts, including chatbot, structural search, topic matching and supervised document classification.",
"pip": "holmes-extractor",
"category": ["conversational", "research", "standalone"],
"tags": ["chatbots", "text-processing"],
"code_example": [
"import holmes_extractor as holmes",
"holmes_manager = holmes.Manager(model='en_coref_lg')",
"holmes_manager.register_search_phrase('A big dog chases a cat')",
"holmes_manager.start_chatbot_mode_console()"
],
"author": "Richard Paul Hudson",
"author_links": {
"github": "richardpaulhudson"
}
} }
], ],
"categories": [ "categories": [
{ {
"label": "Projects", "label": "Projects",