mirror of
https://github.com/explosion/spaCy.git
synced 2024-11-11 20:28:20 +03:00
Merge branch 'master' into spacy.io
This commit is contained in:
commit
50b117c072
106
.github/contributors/isaric.md
vendored
Normal file
106
.github/contributors/isaric.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Ivan Šarić |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 18.08.2019. |
|
||||||
|
| GitHub username | isaric |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/yanaiela.md
vendored
Normal file
106
.github/contributors/yanaiela.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Yanai Elazar |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 14/8/2019 |
|
||||||
|
| GitHub username | yanaiela |
|
||||||
|
| Website (optional) | https://yanaiela.github.io |
|
1
.gitignore
vendored
1
.gitignore
vendored
|
@ -3,6 +3,7 @@ spacy/data/
|
||||||
corpora/
|
corpora/
|
||||||
/models/
|
/models/
|
||||||
keys/
|
keys/
|
||||||
|
*.json.gz
|
||||||
|
|
||||||
# Website
|
# Website
|
||||||
website/.cache/
|
website/.cache/
|
||||||
|
|
|
@ -8,7 +8,7 @@ import sys
|
||||||
import srsly
|
import srsly
|
||||||
from wasabi import Printer, MESSAGES
|
from wasabi import Printer, MESSAGES
|
||||||
|
|
||||||
from ..gold import GoldCorpus, read_json_object
|
from ..gold import GoldCorpus
|
||||||
from ..syntax import nonproj
|
from ..syntax import nonproj
|
||||||
from ..util import load_model, get_lang_class
|
from ..util import load_model, get_lang_class
|
||||||
|
|
||||||
|
@ -95,13 +95,19 @@ def debug_data(
|
||||||
corpus = GoldCorpus(train_path, dev_path)
|
corpus = GoldCorpus(train_path, dev_path)
|
||||||
try:
|
try:
|
||||||
train_docs = list(corpus.train_docs(nlp))
|
train_docs = list(corpus.train_docs(nlp))
|
||||||
train_docs_unpreprocessed = list(corpus.train_docs_without_preprocessing(nlp))
|
train_docs_unpreprocessed = list(
|
||||||
|
corpus.train_docs_without_preprocessing(nlp)
|
||||||
|
)
|
||||||
except ValueError as e:
|
except ValueError as e:
|
||||||
loading_train_error_message = "Training data cannot be loaded: {}".format(str(e))
|
loading_train_error_message = "Training data cannot be loaded: {}".format(
|
||||||
|
str(e)
|
||||||
|
)
|
||||||
try:
|
try:
|
||||||
dev_docs = list(corpus.dev_docs(nlp))
|
dev_docs = list(corpus.dev_docs(nlp))
|
||||||
except ValueError as e:
|
except ValueError as e:
|
||||||
loading_dev_error_message = "Development data cannot be loaded: {}".format(str(e))
|
loading_dev_error_message = "Development data cannot be loaded: {}".format(
|
||||||
|
str(e)
|
||||||
|
)
|
||||||
if loading_train_error_message or loading_dev_error_message:
|
if loading_train_error_message or loading_dev_error_message:
|
||||||
if loading_train_error_message:
|
if loading_train_error_message:
|
||||||
msg.fail(loading_train_error_message)
|
msg.fail(loading_train_error_message)
|
||||||
|
@ -158,11 +164,15 @@ def debug_data(
|
||||||
)
|
)
|
||||||
if gold_train_data["n_misaligned_words"] > 0:
|
if gold_train_data["n_misaligned_words"] > 0:
|
||||||
msg.warn(
|
msg.warn(
|
||||||
"{} misaligned tokens in the training data".format(gold_train_data["n_misaligned_words"])
|
"{} misaligned tokens in the training data".format(
|
||||||
|
gold_train_data["n_misaligned_words"]
|
||||||
|
)
|
||||||
)
|
)
|
||||||
if gold_dev_data["n_misaligned_words"] > 0:
|
if gold_dev_data["n_misaligned_words"] > 0:
|
||||||
msg.warn(
|
msg.warn(
|
||||||
"{} misaligned tokens in the dev data".format(gold_dev_data["n_misaligned_words"])
|
"{} misaligned tokens in the dev data".format(
|
||||||
|
gold_dev_data["n_misaligned_words"]
|
||||||
|
)
|
||||||
)
|
)
|
||||||
most_common_words = gold_train_data["words"].most_common(10)
|
most_common_words = gold_train_data["words"].most_common(10)
|
||||||
msg.text(
|
msg.text(
|
||||||
|
@ -184,7 +194,9 @@ def debug_data(
|
||||||
|
|
||||||
if "ner" in pipeline:
|
if "ner" in pipeline:
|
||||||
# Get all unique NER labels present in the data
|
# Get all unique NER labels present in the data
|
||||||
labels = set(label for label in gold_train_data["ner"] if label not in ("O", "-"))
|
labels = set(
|
||||||
|
label for label in gold_train_data["ner"] if label not in ("O", "-")
|
||||||
|
)
|
||||||
label_counts = gold_train_data["ner"]
|
label_counts = gold_train_data["ner"]
|
||||||
model_labels = _get_labels_from_model(nlp, "ner")
|
model_labels = _get_labels_from_model(nlp, "ner")
|
||||||
new_labels = [l for l in labels if l not in model_labels]
|
new_labels = [l for l in labels if l not in model_labels]
|
||||||
|
@ -222,7 +234,9 @@ def debug_data(
|
||||||
)
|
)
|
||||||
|
|
||||||
if gold_train_data["ws_ents"]:
|
if gold_train_data["ws_ents"]:
|
||||||
msg.fail("{} invalid whitespace entity spans".format(gold_train_data["ws_ents"]))
|
msg.fail(
|
||||||
|
"{} invalid whitespace entity spans".format(gold_train_data["ws_ents"])
|
||||||
|
)
|
||||||
has_ws_ents_error = True
|
has_ws_ents_error = True
|
||||||
|
|
||||||
for label in new_labels:
|
for label in new_labels:
|
||||||
|
@ -323,33 +337,36 @@ def debug_data(
|
||||||
"Found {} sentence{} with an average length of {:.1f} words.".format(
|
"Found {} sentence{} with an average length of {:.1f} words.".format(
|
||||||
gold_train_data["n_sents"],
|
gold_train_data["n_sents"],
|
||||||
"s" if len(train_docs) > 1 else "",
|
"s" if len(train_docs) > 1 else "",
|
||||||
gold_train_data["n_words"] / gold_train_data["n_sents"]
|
gold_train_data["n_words"] / gold_train_data["n_sents"],
|
||||||
)
|
)
|
||||||
)
|
)
|
||||||
|
|
||||||
# profile labels
|
# profile labels
|
||||||
labels_train = [label for label in gold_train_data["deps"]]
|
labels_train = [label for label in gold_train_data["deps"]]
|
||||||
labels_train_unpreprocessed = [label for label in gold_train_unpreprocessed_data["deps"]]
|
labels_train_unpreprocessed = [
|
||||||
|
label for label in gold_train_unpreprocessed_data["deps"]
|
||||||
|
]
|
||||||
labels_dev = [label for label in gold_dev_data["deps"]]
|
labels_dev = [label for label in gold_dev_data["deps"]]
|
||||||
|
|
||||||
if gold_train_unpreprocessed_data["n_nonproj"] > 0:
|
if gold_train_unpreprocessed_data["n_nonproj"] > 0:
|
||||||
msg.info(
|
msg.info(
|
||||||
"Found {} nonprojective train sentence{}".format(
|
"Found {} nonprojective train sentence{}".format(
|
||||||
gold_train_unpreprocessed_data["n_nonproj"],
|
gold_train_unpreprocessed_data["n_nonproj"],
|
||||||
"s" if gold_train_unpreprocessed_data["n_nonproj"] > 1 else ""
|
"s" if gold_train_unpreprocessed_data["n_nonproj"] > 1 else "",
|
||||||
)
|
)
|
||||||
)
|
)
|
||||||
if gold_dev_data["n_nonproj"] > 0:
|
if gold_dev_data["n_nonproj"] > 0:
|
||||||
msg.info(
|
msg.info(
|
||||||
"Found {} nonprojective dev sentence{}".format(
|
"Found {} nonprojective dev sentence{}".format(
|
||||||
gold_dev_data["n_nonproj"],
|
gold_dev_data["n_nonproj"],
|
||||||
"s" if gold_dev_data["n_nonproj"] > 1 else ""
|
"s" if gold_dev_data["n_nonproj"] > 1 else "",
|
||||||
)
|
)
|
||||||
)
|
)
|
||||||
|
|
||||||
msg.info(
|
msg.info(
|
||||||
"{} {} in train data".format(
|
"{} {} in train data".format(
|
||||||
len(labels_train_unpreprocessed), "label" if len(labels_train) == 1 else "labels"
|
len(labels_train_unpreprocessed),
|
||||||
|
"label" if len(labels_train) == 1 else "labels",
|
||||||
)
|
)
|
||||||
)
|
)
|
||||||
msg.info(
|
msg.info(
|
||||||
|
@ -373,12 +390,13 @@ def debug_data(
|
||||||
)
|
)
|
||||||
has_low_data_warning = True
|
has_low_data_warning = True
|
||||||
|
|
||||||
|
|
||||||
# rare labels in projectivized train
|
# rare labels in projectivized train
|
||||||
rare_projectivized_labels = []
|
rare_projectivized_labels = []
|
||||||
for label in gold_train_data["deps"]:
|
for label in gold_train_data["deps"]:
|
||||||
if gold_train_data["deps"][label] <= DEP_LABEL_THRESHOLD and "||" in label:
|
if gold_train_data["deps"][label] <= DEP_LABEL_THRESHOLD and "||" in label:
|
||||||
rare_projectivized_labels.append("{}: {}".format(label, str(gold_train_data["deps"][label])))
|
rare_projectivized_labels.append(
|
||||||
|
"{}: {}".format(label, str(gold_train_data["deps"][label]))
|
||||||
|
)
|
||||||
|
|
||||||
if len(rare_projectivized_labels) > 0:
|
if len(rare_projectivized_labels) > 0:
|
||||||
msg.warn(
|
msg.warn(
|
||||||
|
@ -387,12 +405,13 @@ def debug_data(
|
||||||
"want to projectivize labels such as punct before "
|
"want to projectivize labels such as punct before "
|
||||||
"training in order to improve parser performance.".format(
|
"training in order to improve parser performance.".format(
|
||||||
len(rare_projectivized_labels),
|
len(rare_projectivized_labels),
|
||||||
"s" if len(rare_projectivized_labels) > 1 else "")
|
"s" if len(rare_projectivized_labels) > 1 else "",
|
||||||
|
)
|
||||||
)
|
)
|
||||||
msg.warn(
|
msg.warn(
|
||||||
"Projectivized labels with low numbers of examples: "
|
"Projectivized labels with low numbers of examples: "
|
||||||
"{}".format("\n".join(rare_projectivized_labels)),
|
"{}".format("\n".join(rare_projectivized_labels)),
|
||||||
show=verbose
|
show=verbose,
|
||||||
)
|
)
|
||||||
has_low_data_warning = True
|
has_low_data_warning = True
|
||||||
|
|
||||||
|
@ -401,15 +420,15 @@ def debug_data(
|
||||||
msg.warn(
|
msg.warn(
|
||||||
"The following labels were found only in the train data: "
|
"The following labels were found only in the train data: "
|
||||||
"{}".format(", ".join(set(labels_train) - set(labels_dev))),
|
"{}".format(", ".join(set(labels_train) - set(labels_dev))),
|
||||||
show=verbose
|
show=verbose,
|
||||||
)
|
)
|
||||||
|
|
||||||
# labels only in dev
|
# labels only in dev
|
||||||
if set(labels_dev) - set(labels_train):
|
if set(labels_dev) - set(labels_train):
|
||||||
msg.warn(
|
msg.warn(
|
||||||
"The following labels were found only in the dev data: " +
|
"The following labels were found only in the dev data: "
|
||||||
", ".join(set(labels_dev) - set(labels_train)),
|
+ ", ".join(set(labels_dev) - set(labels_train)),
|
||||||
show=verbose
|
show=verbose,
|
||||||
)
|
)
|
||||||
|
|
||||||
if has_low_data_warning:
|
if has_low_data_warning:
|
||||||
|
@ -422,8 +441,10 @@ def debug_data(
|
||||||
# multiple root labels
|
# multiple root labels
|
||||||
if len(gold_train_unpreprocessed_data["roots"]) > 1:
|
if len(gold_train_unpreprocessed_data["roots"]) > 1:
|
||||||
msg.warn(
|
msg.warn(
|
||||||
"Multiple root labels ({}) ".format(", ".join(gold_train_unpreprocessed_data["roots"])) +
|
"Multiple root labels ({}) ".format(
|
||||||
"found in training data. spaCy's parser uses a single root "
|
", ".join(gold_train_unpreprocessed_data["roots"])
|
||||||
|
)
|
||||||
|
+ "found in training data. spaCy's parser uses a single root "
|
||||||
"label ROOT so this distinction will not be available."
|
"label ROOT so this distinction will not be available."
|
||||||
)
|
)
|
||||||
|
|
||||||
|
@ -432,14 +453,14 @@ def debug_data(
|
||||||
msg.fail(
|
msg.fail(
|
||||||
"Found {} nonprojective projectivized train sentence{}".format(
|
"Found {} nonprojective projectivized train sentence{}".format(
|
||||||
gold_train_data["n_nonproj"],
|
gold_train_data["n_nonproj"],
|
||||||
"s" if gold_train_data["n_nonproj"] > 1 else ""
|
"s" if gold_train_data["n_nonproj"] > 1 else "",
|
||||||
)
|
)
|
||||||
)
|
)
|
||||||
if gold_train_data["n_cycles"] > 0:
|
if gold_train_data["n_cycles"] > 0:
|
||||||
msg.fail(
|
msg.fail(
|
||||||
"Found {} projectivized train sentence{} with cycles".format(
|
"Found {} projectivized train sentence{} with cycles".format(
|
||||||
gold_train_data["n_cycles"],
|
gold_train_data["n_cycles"],
|
||||||
"s" if gold_train_data["n_cycles"] > 1 else ""
|
"s" if gold_train_data["n_cycles"] > 1 else "",
|
||||||
)
|
)
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
|
@ -84,12 +84,12 @@ def evaluate(
|
||||||
def render_parses(docs, output_path, model_name="", limit=250, deps=True, ents=True):
|
def render_parses(docs, output_path, model_name="", limit=250, deps=True, ents=True):
|
||||||
docs[0].user_data["title"] = model_name
|
docs[0].user_data["title"] = model_name
|
||||||
if ents:
|
if ents:
|
||||||
with (output_path / "entities.html").open("w") as file_:
|
|
||||||
html = displacy.render(docs[:limit], style="ent", page=True)
|
html = displacy.render(docs[:limit], style="ent", page=True)
|
||||||
|
with (output_path / "entities.html").open("w", encoding="utf8") as file_:
|
||||||
file_.write(html)
|
file_.write(html)
|
||||||
if deps:
|
if deps:
|
||||||
with (output_path / "parses.html").open("w") as file_:
|
|
||||||
html = displacy.render(
|
html = displacy.render(
|
||||||
docs[:limit], style="dep", page=True, options={"compact": True}
|
docs[:limit], style="dep", page=True, options={"compact": True}
|
||||||
)
|
)
|
||||||
|
with (output_path / "parses.html").open("w", encoding="utf8") as file_:
|
||||||
file_.write(html)
|
file_.write(html)
|
||||||
|
|
|
@ -114,7 +114,7 @@ def read_attrs_from_deprecated(freqs_loc, clusters_loc):
|
||||||
probs, _ = read_freqs(freqs_loc)
|
probs, _ = read_freqs(freqs_loc)
|
||||||
msg.good("Counted frequencies")
|
msg.good("Counted frequencies")
|
||||||
else:
|
else:
|
||||||
probs, _ = ({}, DEFAULT_OOV_PROB)
|
probs, _ = ({}, DEFAULT_OOV_PROB) # noqa: F841
|
||||||
if clusters_loc:
|
if clusters_loc:
|
||||||
with msg.loading("Reading clusters..."):
|
with msg.loading("Reading clusters..."):
|
||||||
clusters = read_clusters(clusters_loc)
|
clusters = read_clusters(clusters_loc)
|
||||||
|
|
|
@ -247,6 +247,15 @@ class EntityRenderer(object):
|
||||||
self.direction = DEFAULT_DIR
|
self.direction = DEFAULT_DIR
|
||||||
self.lang = DEFAULT_LANG
|
self.lang = DEFAULT_LANG
|
||||||
|
|
||||||
|
template = options.get("template")
|
||||||
|
if template:
|
||||||
|
self.ent_template = template
|
||||||
|
else:
|
||||||
|
if self.direction == "rtl":
|
||||||
|
self.ent_template = TPL_ENT_RTL
|
||||||
|
else:
|
||||||
|
self.ent_template = TPL_ENT
|
||||||
|
|
||||||
def render(self, parsed, page=False, minify=False):
|
def render(self, parsed, page=False, minify=False):
|
||||||
"""Render complete markup.
|
"""Render complete markup.
|
||||||
|
|
||||||
|
@ -284,6 +293,7 @@ class EntityRenderer(object):
|
||||||
label = span["label"]
|
label = span["label"]
|
||||||
start = span["start"]
|
start = span["start"]
|
||||||
end = span["end"]
|
end = span["end"]
|
||||||
|
additional_params = span.get("params", {})
|
||||||
entity = escape_html(text[start:end])
|
entity = escape_html(text[start:end])
|
||||||
fragments = text[offset:start].split("\n")
|
fragments = text[offset:start].split("\n")
|
||||||
for i, fragment in enumerate(fragments):
|
for i, fragment in enumerate(fragments):
|
||||||
|
@ -293,10 +303,8 @@ class EntityRenderer(object):
|
||||||
if self.ents is None or label.upper() in self.ents:
|
if self.ents is None or label.upper() in self.ents:
|
||||||
color = self.colors.get(label.upper(), self.default_color)
|
color = self.colors.get(label.upper(), self.default_color)
|
||||||
ent_settings = {"label": label, "text": entity, "bg": color}
|
ent_settings = {"label": label, "text": entity, "bg": color}
|
||||||
if self.direction == "rtl":
|
ent_settings.update(additional_params)
|
||||||
markup += TPL_ENT_RTL.format(**ent_settings)
|
markup += self.ent_template.format(**ent_settings)
|
||||||
else:
|
|
||||||
markup += TPL_ENT.format(**ent_settings)
|
|
||||||
else:
|
else:
|
||||||
markup += entity
|
markup += entity
|
||||||
offset = end
|
offset = end
|
||||||
|
|
|
@ -429,6 +429,7 @@ class Errors(object):
|
||||||
E155 = ("The `nlp` object should have access to pre-trained word vectors, cf. "
|
E155 = ("The `nlp` object should have access to pre-trained word vectors, cf. "
|
||||||
"https://spacy.io/usage/models#languages.")
|
"https://spacy.io/usage/models#languages.")
|
||||||
|
|
||||||
|
|
||||||
@add_codes
|
@add_codes
|
||||||
class TempErrors(object):
|
class TempErrors(object):
|
||||||
T003 = ("Resizing pre-trained Tagger models is not currently supported.")
|
T003 = ("Resizing pre-trained Tagger models is not currently supported.")
|
||||||
|
|
18
spacy/lang/hr/examples.py
Normal file
18
spacy/lang/hr/examples.py
Normal file
|
@ -0,0 +1,18 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
"""
|
||||||
|
Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
>>> from spacy.lang.hr.examples import sentences
|
||||||
|
>>> docs = nlp.pipe(sentences)
|
||||||
|
"""
|
||||||
|
|
||||||
|
sentences = [
|
||||||
|
"Ovo je rečenica.",
|
||||||
|
"Kako se popravlja auto?",
|
||||||
|
"Zagreb je udaljen od Ljubljane svega 150 km.",
|
||||||
|
"Nećete vjerovati što se dogodilo na ovogodišnjem festivalu!",
|
||||||
|
"Budućnost Apple je upitna nakon dugotrajnog pada vrijednosti dionica firme.",
|
||||||
|
"Trgovina oružjem predstavlja prijetnju za globalni mir.",
|
||||||
|
]
|
|
@ -1,10 +1,8 @@
|
||||||
# encoding: utf8
|
# encoding: utf8
|
||||||
from __future__ import unicode_literals, print_function
|
from __future__ import unicode_literals, print_function
|
||||||
|
|
||||||
import re
|
|
||||||
import sys
|
import sys
|
||||||
|
|
||||||
|
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
from .tag_map import TAG_MAP
|
from .tag_map import TAG_MAP
|
||||||
from ...attrs import LANG
|
from ...attrs import LANG
|
||||||
|
|
|
@ -109,7 +109,7 @@ for orth in [
|
||||||
|
|
||||||
|
|
||||||
emoticons = set(
|
emoticons = set(
|
||||||
"""
|
r"""
|
||||||
:)
|
:)
|
||||||
:-)
|
:-)
|
||||||
:))
|
:))
|
||||||
|
|
|
@ -8,6 +8,7 @@ from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
from .tag_map import TAG_MAP
|
from .tag_map import TAG_MAP
|
||||||
|
|
||||||
|
|
||||||
class ChineseDefaults(Language.Defaults):
|
class ChineseDefaults(Language.Defaults):
|
||||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||||
lex_attr_getters[LANG] = lambda text: "zh"
|
lex_attr_getters[LANG] = lambda text: "zh"
|
||||||
|
|
|
@ -1,8 +1,8 @@
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from ...symbols import POS, PUNCT, SYM, ADJ, CONJ, CCONJ, NUM, DET, ADV, ADP, X, VERB
|
from ...symbols import POS, PUNCT, ADJ, CONJ, CCONJ, NUM, DET, ADV, ADP, X, VERB
|
||||||
from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, AUX
|
from ...symbols import NOUN, PART, INTJ, PRON
|
||||||
|
|
||||||
# The Chinese part-of-speech tagger uses the OntoNotes 5 version of the Penn Treebank tag set.
|
# The Chinese part-of-speech tagger uses the OntoNotes 5 version of the Penn Treebank tag set.
|
||||||
# We also map the tags to the simpler Google Universal POS tag set.
|
# We also map the tags to the simpler Google Universal POS tag set.
|
||||||
|
@ -43,5 +43,5 @@ TAG_MAP = {
|
||||||
"JJ": {POS: ADJ},
|
"JJ": {POS: ADJ},
|
||||||
"P": {POS: ADP},
|
"P": {POS: ADP},
|
||||||
"PN": {POS: PRON},
|
"PN": {POS: PRON},
|
||||||
"PU": {POS: PUNCT}
|
"PU": {POS: PUNCT},
|
||||||
}
|
}
|
|
@ -160,14 +160,15 @@ class Scorer(object):
|
||||||
cand_deps.add((gold_i, gold_head, token.dep_.lower()))
|
cand_deps.add((gold_i, gold_head, token.dep_.lower()))
|
||||||
if "-" not in [token[-1] for token in gold.orig_annot]:
|
if "-" not in [token[-1] for token in gold.orig_annot]:
|
||||||
# Find all NER labels in gold and doc
|
# Find all NER labels in gold and doc
|
||||||
ent_labels = set([x[0] for x in gold_ents]
|
ent_labels = set([x[0] for x in gold_ents] + [k.label_ for k in doc.ents])
|
||||||
+ [k.label_ for k in doc.ents])
|
|
||||||
# Set up all labels for per type scoring and prepare gold per type
|
# Set up all labels for per type scoring and prepare gold per type
|
||||||
gold_per_ents = {ent_label: set() for ent_label in ent_labels}
|
gold_per_ents = {ent_label: set() for ent_label in ent_labels}
|
||||||
for ent_label in ent_labels:
|
for ent_label in ent_labels:
|
||||||
if ent_label not in self.ner_per_ents:
|
if ent_label not in self.ner_per_ents:
|
||||||
self.ner_per_ents[ent_label] = PRFScore()
|
self.ner_per_ents[ent_label] = PRFScore()
|
||||||
gold_per_ents[ent_label].update([x for x in gold_ents if x[0] == ent_label])
|
gold_per_ents[ent_label].update(
|
||||||
|
[x for x in gold_ents if x[0] == ent_label]
|
||||||
|
)
|
||||||
# Find all candidate labels, for all and per type
|
# Find all candidate labels, for all and per type
|
||||||
cand_ents = set()
|
cand_ents = set()
|
||||||
cand_per_ents = {ent_label: set() for ent_label in ent_labels}
|
cand_per_ents = {ent_label: set() for ent_label in ent_labels}
|
||||||
|
|
|
@ -1,7 +1,6 @@
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
|
||||||
from spacy.matcher import PhraseMatcher
|
from spacy.matcher import PhraseMatcher
|
||||||
from spacy.tokens import Doc
|
from spacy.tokens import Doc
|
||||||
|
|
||||||
|
|
|
@ -3,12 +3,13 @@ from __future__ import unicode_literals
|
||||||
|
|
||||||
from ..util import get_doc
|
from ..util import get_doc
|
||||||
|
|
||||||
|
|
||||||
def test_issue4104(en_vocab):
|
def test_issue4104(en_vocab):
|
||||||
"""Test that English lookup lemmatization of spun & dry are correct
|
"""Test that English lookup lemmatization of spun & dry are correct
|
||||||
expected mapping = {'dry': 'dry', 'spun': 'spin', 'spun-dry': 'spin-dry'}
|
expected mapping = {'dry': 'dry', 'spun': 'spin', 'spun-dry': 'spin-dry'}
|
||||||
"""
|
"""
|
||||||
text = 'dry spun spun-dry'
|
text = "dry spun spun-dry"
|
||||||
doc = get_doc(en_vocab, [t for t in text.split(" ")])
|
doc = get_doc(en_vocab, [t for t in text.split(" ")])
|
||||||
# using a simple list to preserve order
|
# using a simple list to preserve order
|
||||||
expected = ['dry', 'spin', 'spin-dry']
|
expected = ["dry", "spin", "spin-dry"]
|
||||||
assert [token.lemma_ for token in doc] == expected
|
assert [token.lemma_ for token in doc] == expected
|
||||||
|
|
|
@ -6,6 +6,7 @@ from spacy.gold import spans_from_biluo_tags, GoldParse
|
||||||
from spacy.tokens import Doc
|
from spacy.tokens import Doc
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
def test_gold_biluo_U(en_vocab):
|
def test_gold_biluo_U(en_vocab):
|
||||||
words = ["I", "flew", "to", "London", "."]
|
words = ["I", "flew", "to", "London", "."]
|
||||||
spaces = [True, True, True, False, True]
|
spaces = [True, True, True, False, True]
|
||||||
|
@ -32,14 +33,18 @@ def test_gold_biluo_BIL(en_vocab):
|
||||||
tags = biluo_tags_from_offsets(doc, entities)
|
tags = biluo_tags_from_offsets(doc, entities)
|
||||||
assert tags == ["O", "O", "O", "B-LOC", "I-LOC", "L-LOC", "O"]
|
assert tags == ["O", "O", "O", "B-LOC", "I-LOC", "L-LOC", "O"]
|
||||||
|
|
||||||
|
|
||||||
def test_gold_biluo_overlap(en_vocab):
|
def test_gold_biluo_overlap(en_vocab):
|
||||||
words = ["I", "flew", "to", "San", "Francisco", "Valley", "."]
|
words = ["I", "flew", "to", "San", "Francisco", "Valley", "."]
|
||||||
spaces = [True, True, True, True, True, False, True]
|
spaces = [True, True, True, True, True, False, True]
|
||||||
doc = Doc(en_vocab, words=words, spaces=spaces)
|
doc = Doc(en_vocab, words=words, spaces=spaces)
|
||||||
entities = [(len("I flew to "), len("I flew to San Francisco Valley"), "LOC"),
|
entities = [
|
||||||
(len("I flew to "), len("I flew to San Francisco"), "LOC")]
|
(len("I flew to "), len("I flew to San Francisco Valley"), "LOC"),
|
||||||
|
(len("I flew to "), len("I flew to San Francisco"), "LOC"),
|
||||||
|
]
|
||||||
with pytest.raises(ValueError):
|
with pytest.raises(ValueError):
|
||||||
tags = biluo_tags_from_offsets(doc, entities)
|
biluo_tags_from_offsets(doc, entities)
|
||||||
|
|
||||||
|
|
||||||
def test_gold_biluo_misalign(en_vocab):
|
def test_gold_biluo_misalign(en_vocab):
|
||||||
words = ["I", "flew", "to", "San", "Francisco", "Valley."]
|
words = ["I", "flew", "to", "San", "Francisco", "Valley."]
|
||||||
|
|
|
@ -7,67 +7,62 @@ from spacy.scorer import Scorer
|
||||||
from .util import get_doc
|
from .util import get_doc
|
||||||
|
|
||||||
test_ner_cardinal = [
|
test_ner_cardinal = [
|
||||||
[
|
["100 - 200", {"entities": [[0, 3, "CARDINAL"], [6, 9, "CARDINAL"]]}]
|
||||||
"100 - 200",
|
|
||||||
{
|
|
||||||
"entities": [
|
|
||||||
[0, 3, "CARDINAL"],
|
|
||||||
[6, 9, "CARDINAL"]
|
|
||||||
]
|
|
||||||
}
|
|
||||||
]
|
|
||||||
]
|
]
|
||||||
|
|
||||||
test_ner_apple = [
|
test_ner_apple = [
|
||||||
[
|
[
|
||||||
"Apple is looking at buying U.K. startup for $1 billion",
|
"Apple is looking at buying U.K. startup for $1 billion",
|
||||||
{
|
{"entities": [(0, 5, "ORG"), (27, 31, "GPE"), (44, 54, "MONEY")]},
|
||||||
"entities": [
|
|
||||||
(0, 5, "ORG"),
|
|
||||||
(27, 31, "GPE"),
|
|
||||||
(44, 54, "MONEY"),
|
|
||||||
]
|
|
||||||
}
|
|
||||||
]
|
]
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
def test_ner_per_type(en_vocab):
|
def test_ner_per_type(en_vocab):
|
||||||
# Gold and Doc are identical
|
# Gold and Doc are identical
|
||||||
scorer = Scorer()
|
scorer = Scorer()
|
||||||
for input_, annot in test_ner_cardinal:
|
for input_, annot in test_ner_cardinal:
|
||||||
doc = get_doc(en_vocab, words = input_.split(' '), ents = [[0, 1, 'CARDINAL'], [2, 3, 'CARDINAL']])
|
doc = get_doc(
|
||||||
gold = GoldParse(doc, entities = annot['entities'])
|
en_vocab,
|
||||||
|
words=input_.split(" "),
|
||||||
|
ents=[[0, 1, "CARDINAL"], [2, 3, "CARDINAL"]],
|
||||||
|
)
|
||||||
|
gold = GoldParse(doc, entities=annot["entities"])
|
||||||
scorer.score(doc, gold)
|
scorer.score(doc, gold)
|
||||||
results = scorer.scores
|
results = scorer.scores
|
||||||
|
|
||||||
assert results['ents_p'] == 100
|
assert results["ents_p"] == 100
|
||||||
assert results['ents_f'] == 100
|
assert results["ents_f"] == 100
|
||||||
assert results['ents_r'] == 100
|
assert results["ents_r"] == 100
|
||||||
assert results['ents_per_type']['CARDINAL']['p'] == 100
|
assert results["ents_per_type"]["CARDINAL"]["p"] == 100
|
||||||
assert results['ents_per_type']['CARDINAL']['f'] == 100
|
assert results["ents_per_type"]["CARDINAL"]["f"] == 100
|
||||||
assert results['ents_per_type']['CARDINAL']['r'] == 100
|
assert results["ents_per_type"]["CARDINAL"]["r"] == 100
|
||||||
|
|
||||||
# Doc has one missing and one extra entity
|
# Doc has one missing and one extra entity
|
||||||
# Entity type MONEY is not present in Doc
|
# Entity type MONEY is not present in Doc
|
||||||
scorer = Scorer()
|
scorer = Scorer()
|
||||||
for input_, annot in test_ner_apple:
|
for input_, annot in test_ner_apple:
|
||||||
doc = get_doc(en_vocab, words = input_.split(' '), ents = [[0, 1, 'ORG'], [5, 6, 'GPE'], [6, 7, 'ORG']])
|
doc = get_doc(
|
||||||
gold = GoldParse(doc, entities = annot['entities'])
|
en_vocab,
|
||||||
|
words=input_.split(" "),
|
||||||
|
ents=[[0, 1, "ORG"], [5, 6, "GPE"], [6, 7, "ORG"]],
|
||||||
|
)
|
||||||
|
gold = GoldParse(doc, entities=annot["entities"])
|
||||||
scorer.score(doc, gold)
|
scorer.score(doc, gold)
|
||||||
results = scorer.scores
|
results = scorer.scores
|
||||||
|
|
||||||
assert results['ents_p'] == approx(66.66666)
|
assert results["ents_p"] == approx(66.66666)
|
||||||
assert results['ents_r'] == approx(66.66666)
|
assert results["ents_r"] == approx(66.66666)
|
||||||
assert results['ents_f'] == approx(66.66666)
|
assert results["ents_f"] == approx(66.66666)
|
||||||
assert 'GPE' in results['ents_per_type']
|
assert "GPE" in results["ents_per_type"]
|
||||||
assert 'MONEY' in results['ents_per_type']
|
assert "MONEY" in results["ents_per_type"]
|
||||||
assert 'ORG' in results['ents_per_type']
|
assert "ORG" in results["ents_per_type"]
|
||||||
assert results['ents_per_type']['GPE']['p'] == 100
|
assert results["ents_per_type"]["GPE"]["p"] == 100
|
||||||
assert results['ents_per_type']['GPE']['r'] == 100
|
assert results["ents_per_type"]["GPE"]["r"] == 100
|
||||||
assert results['ents_per_type']['GPE']['f'] == 100
|
assert results["ents_per_type"]["GPE"]["f"] == 100
|
||||||
assert results['ents_per_type']['MONEY']['p'] == 0
|
assert results["ents_per_type"]["MONEY"]["p"] == 0
|
||||||
assert results['ents_per_type']['MONEY']['r'] == 0
|
assert results["ents_per_type"]["MONEY"]["r"] == 0
|
||||||
assert results['ents_per_type']['MONEY']['f'] == 0
|
assert results["ents_per_type"]["MONEY"]["f"] == 0
|
||||||
assert results['ents_per_type']['ORG']['p'] == 50
|
assert results["ents_per_type"]["ORG"]["p"] == 50
|
||||||
assert results['ents_per_type']['ORG']['r'] == 100
|
assert results["ents_per_type"]["ORG"]["r"] == 100
|
||||||
assert results['ents_per_type']['ORG']['f'] == approx(66.66666)
|
assert results["ents_per_type"]["ORG"]["f"] == approx(66.66666)
|
||||||
|
|
|
@ -1,18 +0,0 @@
|
||||||
<div class="entities" style="line-height: 2.5; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 18px">But
|
|
||||||
<mark class="entity" style="background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">Google
|
|
||||||
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">ORG</span></mark>is starting from behind. The company made a late push into hardware,
|
|
||||||
and
|
|
||||||
<mark class="entity" style="background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">Apple
|
|
||||||
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">ORG</span></mark>’s
|
|
||||||
<mark class="entity" style="background: #bfeeb7; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">Siri
|
|
||||||
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">PRODUCT</span></mark>, available on
|
|
||||||
<mark class="entity" style="background: #bfeeb7; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">iPhones
|
|
||||||
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">PRODUCT</span></mark>, and
|
|
||||||
<mark class="entity" style="background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">Amazon
|
|
||||||
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">ORG</span></mark>’s
|
|
||||||
<mark class="entity" style="background: #bfeeb7; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">Alexa
|
|
||||||
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">PRODUCT</span></mark>software, which runs on its
|
|
||||||
<mark class="entity" style="background: #bfeeb7; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">Echo
|
|
||||||
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">PRODUCT</span></mark>and
|
|
||||||
<mark class="entity" style="background: #bfeeb7; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">Dot
|
|
||||||
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">PRODUCT</span></mark>devices, have clear leads in consumer adoption.</div>
|
|
16
website/docs/images/displacy-ent1.html
Normal file
16
website/docs/images/displacy-ent1.html
Normal file
|
@ -0,0 +1,16 @@
|
||||||
|
<div class="entities" style="line-height: 2.5; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 16px">
|
||||||
|
<mark class="entity" style="background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">
|
||||||
|
Apple
|
||||||
|
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">ORG</span>
|
||||||
|
</mark>
|
||||||
|
is looking at buying
|
||||||
|
<mark class="entity" style="background: #feca74; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">
|
||||||
|
U.K.
|
||||||
|
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">GPE</span>
|
||||||
|
</mark>
|
||||||
|
startup for
|
||||||
|
<mark class="entity" style="background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">
|
||||||
|
$1 billion
|
||||||
|
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">MONEY</span>
|
||||||
|
</mark>
|
||||||
|
</div>
|
18
website/docs/images/displacy-ent2.html
Normal file
18
website/docs/images/displacy-ent2.html
Normal file
|
@ -0,0 +1,18 @@
|
||||||
|
<div class="entities" style="line-height: 2.5; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; font-size: 18px">
|
||||||
|
When
|
||||||
|
<mark class="entity" style="background: #aa9cfc; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">
|
||||||
|
Sebastian Thrun
|
||||||
|
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">PERSON</span>
|
||||||
|
</mark>
|
||||||
|
started working on self-driving cars at
|
||||||
|
<mark class="entity" style="background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">
|
||||||
|
Google
|
||||||
|
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">ORG</span>
|
||||||
|
</mark>
|
||||||
|
in
|
||||||
|
<mark class="entity" style="background: #bfe1d9; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">
|
||||||
|
2007
|
||||||
|
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">DATE</span>
|
||||||
|
</mark>
|
||||||
|
, few people outside of the company took him seriously.
|
||||||
|
</div>
|
|
@ -32,7 +32,7 @@ for ent in doc.ents:
|
||||||
Using spaCy's built-in [displaCy visualizer](/usage/visualizers), here's what
|
Using spaCy's built-in [displaCy visualizer](/usage/visualizers), here's what
|
||||||
our example sentence and its named entities look like:
|
our example sentence and its named entities look like:
|
||||||
|
|
||||||
import DisplaCyEntHtml from 'images/displacy-ent.html'; import { Iframe } from
|
import DisplaCyEntHtml from 'images/displacy-ent1.html'; import { Iframe } from
|
||||||
'components/embed'
|
'components/embed'
|
||||||
|
|
||||||
<Iframe title="displaCy visualization of entities" html={DisplaCyEntHtml} height={450} />
|
<Iframe title="displaCy visualization of entities" html={DisplaCyEntHtml} height={100} />
|
||||||
|
|
|
@ -564,19 +564,16 @@ For more details and examples, see the
|
||||||
import spacy
|
import spacy
|
||||||
from spacy import displacy
|
from spacy import displacy
|
||||||
|
|
||||||
text = """But Google is starting from behind. The company made a late push
|
text = u"When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."
|
||||||
into hardware, and Apple’s Siri, available on iPhones, and Amazon’s Alexa
|
|
||||||
software, which runs on its Echo and Dot devices, have clear leads in
|
|
||||||
consumer adoption."""
|
|
||||||
|
|
||||||
nlp = spacy.load("custom_ner_model")
|
nlp = spacy.load("en_core_web_sm")
|
||||||
doc = nlp(text)
|
doc = nlp(text)
|
||||||
displacy.serve(doc, style="ent")
|
displacy.serve(doc, style="ent")
|
||||||
```
|
```
|
||||||
|
|
||||||
import DisplacyEntHtml from 'images/displacy-ent.html'
|
import DisplacyEntHtml from 'images/displacy-ent2.html'
|
||||||
|
|
||||||
<Iframe title="displaCy visualizer for entities" html={DisplacyEntHtml} height={275} />
|
<Iframe title="displaCy visualizer for entities" html={DisplacyEntHtml} height={180} />
|
||||||
|
|
||||||
## Tokenization {#tokenization}
|
## Tokenization {#tokenization}
|
||||||
|
|
||||||
|
|
|
@ -117,19 +117,16 @@ text.
|
||||||
import spacy
|
import spacy
|
||||||
from spacy import displacy
|
from spacy import displacy
|
||||||
|
|
||||||
text = """But Google is starting from behind. The company made a late push
|
text = u"When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."
|
||||||
into hardware, and Apple’s Siri, available on iPhones, and Amazon’s Alexa
|
|
||||||
software, which runs on its Echo and Dot devices, have clear leads in
|
|
||||||
consumer adoption."""
|
|
||||||
|
|
||||||
nlp = spacy.load("custom_ner_model")
|
nlp = spacy.load("en_core_web_sm")
|
||||||
doc = nlp(text)
|
doc = nlp(text)
|
||||||
displacy.serve(doc, style="ent")
|
displacy.serve(doc, style="ent")
|
||||||
```
|
```
|
||||||
|
|
||||||
import DisplacyEntHtml from 'images/displacy-ent.html'
|
import DisplacyEntHtml from 'images/displacy-ent2.html'
|
||||||
|
|
||||||
<Iframe title="displaCy visualizer for entities" html={DisplacyEntHtml} height={275} />
|
<Iframe title="displaCy visualizer for entities" html={DisplacyEntHtml} height={180} />
|
||||||
|
|
||||||
The entity visualizer lets you customize the following `options`:
|
The entity visualizer lets you customize the following `options`:
|
||||||
|
|
||||||
|
@ -204,11 +201,14 @@ doc2 = nlp(LONG_NEWS_ARTICLE)
|
||||||
displacy.render(doc2, style="ent")
|
displacy.render(doc2, style="ent")
|
||||||
```
|
```
|
||||||
|
|
||||||
> #### Enabling or disabling Jupyter mode
|
<Infobox variant="warning" title="Important note">
|
||||||
>
|
|
||||||
> To explicitly enable or disable "Jupyter mode", you can use the jupyter`
|
To explicitly enable or disable "Jupyter mode", you can use the `jupyter`
|
||||||
> keyword argument – e.g. to return raw HTML in a notebook, or to force Jupyter
|
keyword argument – e.g. to return raw HTML in a notebook, or to force Jupyter
|
||||||
> rendering if auto-detection fails.
|
rendering if auto-detection fails.
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
|
|
||||||
![displaCy visualizer in a Jupyter notebook](../images/displacy_jupyter.jpg)
|
![displaCy visualizer in a Jupyter notebook](../images/displacy_jupyter.jpg)
|
||||||
|
|
||||||
|
@ -284,7 +284,7 @@ nlp = spacy.load("en_core_web_sm")
|
||||||
sentences = [u"This is an example.", u"This is another one."]
|
sentences = [u"This is an example.", u"This is another one."]
|
||||||
for sent in sentences:
|
for sent in sentences:
|
||||||
doc = nlp(sent)
|
doc = nlp(sent)
|
||||||
svg = displacy.render(doc, style="dep")
|
svg = displacy.render(doc, style="dep", jupyter=False)
|
||||||
file_name = '-'.join([w.text for w in doc if not w.is_punct]) + ".svg"
|
file_name = '-'.join([w.text for w in doc if not w.is_punct]) + ".svg"
|
||||||
output_path = Path("/images/" + file_name)
|
output_path = Path("/images/" + file_name)
|
||||||
output_path.open("w", encoding="utf-8").write(svg)
|
output_path.open("w", encoding="utf-8").write(svg)
|
||||||
|
|
Loading…
Reference in New Issue
Block a user