mirror of
https://github.com/explosion/spaCy.git
synced 2024-11-13 13:17:06 +03:00
Merge branch 'master' into spacy.io
This commit is contained in:
commit
89967f3701
106
.github/contributors/Jan-711.md
vendored
Normal file
106
.github/contributors/Jan-711.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Jan Jessewitsch |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 16.02.2020 |
|
||||||
|
| GitHub username | Jan-711 |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/MisterKeefe.md
vendored
Normal file
106
.github/contributors/MisterKeefe.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Tom Keefe |
|
||||||
|
| Company name (if applicable) | / |
|
||||||
|
| Title or role (if applicable) | / |
|
||||||
|
| Date | 18 February 2020 |
|
||||||
|
| GitHub username | MisterKeefe |
|
||||||
|
| Website (optional) | / |
|
|
@ -1,5 +1,5 @@
|
||||||
recursive-include include *.h
|
recursive-include include *.h
|
||||||
recursive-include spacy *.txt
|
recursive-include spacy *.txt *.pyx *.pxd
|
||||||
include LICENSE
|
include LICENSE
|
||||||
include README.md
|
include README.md
|
||||||
include bin/spacy
|
include bin/spacy
|
||||||
|
|
|
@ -26,12 +26,12 @@ DEFAULT_TEXT = "Mark Zuckerberg is the CEO of Facebook."
|
||||||
HTML_WRAPPER = """<div style="overflow-x: auto; border: 1px solid #e6e9ef; border-radius: 0.25rem; padding: 1rem; margin-bottom: 2.5rem">{}</div>"""
|
HTML_WRAPPER = """<div style="overflow-x: auto; border: 1px solid #e6e9ef; border-radius: 0.25rem; padding: 1rem; margin-bottom: 2.5rem">{}</div>"""
|
||||||
|
|
||||||
|
|
||||||
@st.cache(ignore_hash=True)
|
@st.cache(allow_output_mutation=True)
|
||||||
def load_model(name):
|
def load_model(name):
|
||||||
return spacy.load(name)
|
return spacy.load(name)
|
||||||
|
|
||||||
|
|
||||||
@st.cache(ignore_hash=True)
|
@st.cache(allow_output_mutation=True)
|
||||||
def process_text(model_name, text):
|
def process_text(model_name, text):
|
||||||
nlp = load_model(model_name)
|
nlp = load_model(model_name)
|
||||||
return nlp(text)
|
return nlp(text)
|
||||||
|
@ -79,7 +79,9 @@ if "ner" in nlp.pipe_names:
|
||||||
st.header("Named Entities")
|
st.header("Named Entities")
|
||||||
st.sidebar.header("Named Entities")
|
st.sidebar.header("Named Entities")
|
||||||
label_set = nlp.get_pipe("ner").labels
|
label_set = nlp.get_pipe("ner").labels
|
||||||
labels = st.sidebar.multiselect("Entity labels", label_set, label_set)
|
labels = st.sidebar.multiselect(
|
||||||
|
"Entity labels", options=label_set, default=list(label_set)
|
||||||
|
)
|
||||||
html = displacy.render(doc, style="ent", options={"ents": labels})
|
html = displacy.render(doc, style="ent", options={"ents": labels})
|
||||||
# Newlines seem to mess with the rendering
|
# Newlines seem to mess with the rendering
|
||||||
html = html.replace("\n", " ")
|
html = html.replace("\n", " ")
|
||||||
|
|
|
@ -92,3 +92,5 @@ cdef enum attr_id_t:
|
||||||
LANG
|
LANG
|
||||||
ENT_KB_ID = symbols.ENT_KB_ID
|
ENT_KB_ID = symbols.ENT_KB_ID
|
||||||
ENT_ID = symbols.ENT_ID
|
ENT_ID = symbols.ENT_ID
|
||||||
|
|
||||||
|
IDX
|
||||||
|
|
|
@ -91,6 +91,7 @@ IDS = {
|
||||||
"SPACY": SPACY,
|
"SPACY": SPACY,
|
||||||
"PROB": PROB,
|
"PROB": PROB,
|
||||||
"LANG": LANG,
|
"LANG": LANG,
|
||||||
|
"IDX": IDX
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -34,7 +34,7 @@ from .train import _load_pretrained_tok2vec
|
||||||
vectors_model=("Name or path to spaCy model with vectors to learn from"),
|
vectors_model=("Name or path to spaCy model with vectors to learn from"),
|
||||||
output_dir=("Directory to write models to on each epoch", "positional", None, str),
|
output_dir=("Directory to write models to on each epoch", "positional", None, str),
|
||||||
width=("Width of CNN layers", "option", "cw", int),
|
width=("Width of CNN layers", "option", "cw", int),
|
||||||
depth=("Depth of CNN layers", "option", "cd", int),
|
conv_depth=("Depth of CNN layers", "option", "cd", int),
|
||||||
cnn_window=("Window size for CNN layers", "option", "cW", int),
|
cnn_window=("Window size for CNN layers", "option", "cW", int),
|
||||||
cnn_pieces=("Maxout size for CNN layers. 1 for Mish", "option", "cP", int),
|
cnn_pieces=("Maxout size for CNN layers. 1 for Mish", "option", "cP", int),
|
||||||
use_chars=("Whether to use character-based embedding", "flag", "chr", bool),
|
use_chars=("Whether to use character-based embedding", "flag", "chr", bool),
|
||||||
|
@ -84,7 +84,7 @@ def pretrain(
|
||||||
vectors_model,
|
vectors_model,
|
||||||
output_dir,
|
output_dir,
|
||||||
width=96,
|
width=96,
|
||||||
depth=4,
|
conv_depth=4,
|
||||||
bilstm_depth=0,
|
bilstm_depth=0,
|
||||||
cnn_pieces=3,
|
cnn_pieces=3,
|
||||||
sa_depth=0,
|
sa_depth=0,
|
||||||
|
@ -132,9 +132,15 @@ def pretrain(
|
||||||
msg.info("Using GPU" if has_gpu else "Not using GPU")
|
msg.info("Using GPU" if has_gpu else "Not using GPU")
|
||||||
|
|
||||||
output_dir = Path(output_dir)
|
output_dir = Path(output_dir)
|
||||||
|
if output_dir.exists() and [p for p in output_dir.iterdir()]:
|
||||||
|
msg.warn(
|
||||||
|
"Output directory is not empty",
|
||||||
|
"It is better to use an empty directory or refer to a new output path, "
|
||||||
|
"then the new directory will be created for you.",
|
||||||
|
)
|
||||||
if not output_dir.exists():
|
if not output_dir.exists():
|
||||||
output_dir.mkdir()
|
output_dir.mkdir()
|
||||||
msg.good("Created output directory")
|
msg.good("Created output directory: {}".format(output_dir))
|
||||||
srsly.write_json(output_dir / "config.json", config)
|
srsly.write_json(output_dir / "config.json", config)
|
||||||
msg.good("Saved settings to config.json")
|
msg.good("Saved settings to config.json")
|
||||||
|
|
||||||
|
@ -162,7 +168,7 @@ def pretrain(
|
||||||
Tok2Vec(
|
Tok2Vec(
|
||||||
width,
|
width,
|
||||||
embed_rows,
|
embed_rows,
|
||||||
conv_depth=depth,
|
conv_depth=conv_depth,
|
||||||
pretrained_vectors=pretrained_vectors,
|
pretrained_vectors=pretrained_vectors,
|
||||||
bilstm_depth=bilstm_depth, # Requires PyTorch. Experimental.
|
bilstm_depth=bilstm_depth, # Requires PyTorch. Experimental.
|
||||||
subword_features=not use_chars, # Set to False for Chinese etc
|
subword_features=not use_chars, # Set to False for Chinese etc
|
||||||
|
|
|
@ -14,6 +14,7 @@ import contextlib
|
||||||
import random
|
import random
|
||||||
|
|
||||||
from .._ml import create_default_optimizer
|
from .._ml import create_default_optimizer
|
||||||
|
from ..util import use_gpu as set_gpu
|
||||||
from ..attrs import PROB, IS_OOV, CLUSTER, LANG
|
from ..attrs import PROB, IS_OOV, CLUSTER, LANG
|
||||||
from ..gold import GoldCorpus
|
from ..gold import GoldCorpus
|
||||||
from ..compat import path2str
|
from ..compat import path2str
|
||||||
|
@ -32,6 +33,13 @@ from .. import about
|
||||||
pipeline=("Comma-separated names of pipeline components", "option", "p", str),
|
pipeline=("Comma-separated names of pipeline components", "option", "p", str),
|
||||||
replace_components=("Replace components from base model", "flag", "R", bool),
|
replace_components=("Replace components from base model", "flag", "R", bool),
|
||||||
vectors=("Model to load vectors from", "option", "v", str),
|
vectors=("Model to load vectors from", "option", "v", str),
|
||||||
|
width=("Width of CNN layers of Tok2Vec component", "option", "cw", int),
|
||||||
|
conv_depth=("Depth of CNN layers of Tok2Vec component", "option", "cd", int),
|
||||||
|
cnn_window=("Window size for CNN layers of Tok2Vec component", "option", "cW", int),
|
||||||
|
cnn_pieces=("Maxout size for CNN layers of Tok2Vec component. 1 for Mish", "option", "cP", int),
|
||||||
|
use_chars=("Whether to use character-based embedding of Tok2Vec component", "flag", "chr", bool),
|
||||||
|
bilstm_depth=("Depth of BiLSTM layers of Tok2Vec component (requires PyTorch)", "option", "lstm", int),
|
||||||
|
embed_rows=("Number of embedding rows of Tok2Vec component", "option", "er", int),
|
||||||
n_iter=("Number of iterations", "option", "n", int),
|
n_iter=("Number of iterations", "option", "n", int),
|
||||||
n_early_stopping=("Maximum number of training epochs without dev accuracy improvement", "option", "ne", int),
|
n_early_stopping=("Maximum number of training epochs without dev accuracy improvement", "option", "ne", int),
|
||||||
n_examples=("Number of examples", "option", "ns", int),
|
n_examples=("Number of examples", "option", "ns", int),
|
||||||
|
@ -63,6 +71,13 @@ def train(
|
||||||
pipeline="tagger,parser,ner",
|
pipeline="tagger,parser,ner",
|
||||||
replace_components=False,
|
replace_components=False,
|
||||||
vectors=None,
|
vectors=None,
|
||||||
|
width=96,
|
||||||
|
conv_depth=4,
|
||||||
|
cnn_window=1,
|
||||||
|
cnn_pieces=3,
|
||||||
|
use_chars=False,
|
||||||
|
bilstm_depth=0,
|
||||||
|
embed_rows=2000,
|
||||||
n_iter=30,
|
n_iter=30,
|
||||||
n_early_stopping=None,
|
n_early_stopping=None,
|
||||||
n_examples=0,
|
n_examples=0,
|
||||||
|
@ -115,6 +130,7 @@ def train(
|
||||||
)
|
)
|
||||||
if not output_path.exists():
|
if not output_path.exists():
|
||||||
output_path.mkdir()
|
output_path.mkdir()
|
||||||
|
msg.good("Created output directory: {}".format(output_path))
|
||||||
|
|
||||||
# Take dropout and batch size as generators of values -- dropout
|
# Take dropout and batch size as generators of values -- dropout
|
||||||
# starts high and decays sharply, to force the optimizer to explore.
|
# starts high and decays sharply, to force the optimizer to explore.
|
||||||
|
@ -147,6 +163,18 @@ def train(
|
||||||
disabled_pipes = None
|
disabled_pipes = None
|
||||||
pipes_added = False
|
pipes_added = False
|
||||||
msg.text("Training pipeline: {}".format(pipeline))
|
msg.text("Training pipeline: {}".format(pipeline))
|
||||||
|
if use_gpu >= 0:
|
||||||
|
activated_gpu = None
|
||||||
|
try:
|
||||||
|
activated_gpu = set_gpu(use_gpu)
|
||||||
|
except Exception as e:
|
||||||
|
msg.warn("Exception: {}".format(e))
|
||||||
|
if activated_gpu is not None:
|
||||||
|
msg.text("Using GPU: {}".format(use_gpu))
|
||||||
|
else:
|
||||||
|
msg.warn("Unable to activate GPU: {}".format(use_gpu))
|
||||||
|
msg.text("Using CPU only")
|
||||||
|
use_gpu = -1
|
||||||
if base_model:
|
if base_model:
|
||||||
msg.text("Starting with base model '{}'".format(base_model))
|
msg.text("Starting with base model '{}'".format(base_model))
|
||||||
nlp = util.load_model(base_model)
|
nlp = util.load_model(base_model)
|
||||||
|
@ -237,7 +265,15 @@ def train(
|
||||||
optimizer = create_default_optimizer(Model.ops)
|
optimizer = create_default_optimizer(Model.ops)
|
||||||
else:
|
else:
|
||||||
# Start with a blank model, call begin_training
|
# Start with a blank model, call begin_training
|
||||||
optimizer = nlp.begin_training(lambda: corpus.train_tuples, device=use_gpu)
|
cfg = {"device": use_gpu}
|
||||||
|
cfg["conv_depth"] = conv_depth
|
||||||
|
cfg["token_vector_width"] = width
|
||||||
|
cfg["bilstm_depth"] = bilstm_depth
|
||||||
|
cfg["cnn_maxout_pieces"] = cnn_pieces
|
||||||
|
cfg["embed_size"] = embed_rows
|
||||||
|
cfg["conv_window"] = cnn_window
|
||||||
|
cfg["subword_features"] = not use_chars
|
||||||
|
optimizer = nlp.begin_training(lambda: corpus.train_tuples, **cfg)
|
||||||
|
|
||||||
nlp._optimizer = None
|
nlp._optimizer = None
|
||||||
|
|
||||||
|
@ -362,6 +398,7 @@ def train(
|
||||||
if not batch:
|
if not batch:
|
||||||
continue
|
continue
|
||||||
docs, golds = zip(*batch)
|
docs, golds = zip(*batch)
|
||||||
|
try:
|
||||||
nlp.update(
|
nlp.update(
|
||||||
docs,
|
docs,
|
||||||
golds,
|
golds,
|
||||||
|
@ -369,6 +406,11 @@ def train(
|
||||||
drop=next(dropout_rates),
|
drop=next(dropout_rates),
|
||||||
losses=losses,
|
losses=losses,
|
||||||
)
|
)
|
||||||
|
except ValueError as e:
|
||||||
|
msg.warn("Error during training")
|
||||||
|
if init_tok2vec:
|
||||||
|
msg.warn("Did you provide the same parameters during 'train' as during 'pretrain'?")
|
||||||
|
msg.fail("Original error message: {}".format(e), exits=1)
|
||||||
if raw_text:
|
if raw_text:
|
||||||
# If raw text is available, perform 'rehearsal' updates,
|
# If raw text is available, perform 'rehearsal' updates,
|
||||||
# which use unlabelled data to reduce overfitting.
|
# which use unlabelled data to reduce overfitting.
|
||||||
|
@ -495,6 +537,8 @@ def train(
|
||||||
"score = {}".format(best_score, current_score)
|
"score = {}".format(best_score, current_score)
|
||||||
)
|
)
|
||||||
break
|
break
|
||||||
|
except Exception as e:
|
||||||
|
msg.warn("Aborting and saving the final best model. Encountered exception: {}".format(e))
|
||||||
finally:
|
finally:
|
||||||
best_pipes = nlp.pipe_names
|
best_pipes = nlp.pipe_names
|
||||||
if disabled_pipes:
|
if disabled_pipes:
|
||||||
|
|
|
@ -144,10 +144,12 @@ def parse_deps(orig_doc, options={}):
|
||||||
for span, tag, lemma, ent_type in spans:
|
for span, tag, lemma, ent_type in spans:
|
||||||
attrs = {"tag": tag, "lemma": lemma, "ent_type": ent_type}
|
attrs = {"tag": tag, "lemma": lemma, "ent_type": ent_type}
|
||||||
retokenizer.merge(span, attrs=attrs)
|
retokenizer.merge(span, attrs=attrs)
|
||||||
if options.get("fine_grained"):
|
fine_grained = options.get("fine_grained")
|
||||||
words = [{"text": w.text, "tag": w.tag_} for w in doc]
|
add_lemma = options.get("add_lemma")
|
||||||
else:
|
words = [{"text": w.text,
|
||||||
words = [{"text": w.text, "tag": w.pos_} for w in doc]
|
"tag": w.tag_ if fine_grained else w.pos_,
|
||||||
|
"lemma": w.lemma_ if add_lemma else None} for w in doc]
|
||||||
|
|
||||||
arcs = []
|
arcs = []
|
||||||
for word in doc:
|
for word in doc:
|
||||||
if word.i < word.head.i:
|
if word.i < word.head.i:
|
||||||
|
|
|
@ -3,7 +3,7 @@ from __future__ import unicode_literals
|
||||||
|
|
||||||
import uuid
|
import uuid
|
||||||
|
|
||||||
from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_ARCS, TPL_ENTS
|
from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_WORDS_LEMMA, TPL_DEP_ARCS, TPL_ENTS
|
||||||
from .templates import TPL_ENT, TPL_ENT_RTL, TPL_FIGURE, TPL_TITLE, TPL_PAGE
|
from .templates import TPL_ENT, TPL_ENT_RTL, TPL_FIGURE, TPL_TITLE, TPL_PAGE
|
||||||
from ..util import minify_html, escape_html, registry
|
from ..util import minify_html, escape_html, registry
|
||||||
from ..errors import Errors
|
from ..errors import Errors
|
||||||
|
@ -83,7 +83,7 @@ class DependencyRenderer(object):
|
||||||
self.width = self.offset_x + len(words) * self.distance
|
self.width = self.offset_x + len(words) * self.distance
|
||||||
self.height = self.offset_y + 3 * self.word_spacing
|
self.height = self.offset_y + 3 * self.word_spacing
|
||||||
self.id = render_id
|
self.id = render_id
|
||||||
words = [self.render_word(w["text"], w["tag"], i) for i, w in enumerate(words)]
|
words = [self.render_word(w["text"], w["tag"], w.get("lemma", None), i) for i, w in enumerate(words)]
|
||||||
arcs = [
|
arcs = [
|
||||||
self.render_arrow(a["label"], a["start"], a["end"], a["dir"], i)
|
self.render_arrow(a["label"], a["start"], a["end"], a["dir"], i)
|
||||||
for i, a in enumerate(arcs)
|
for i, a in enumerate(arcs)
|
||||||
|
@ -101,7 +101,7 @@ class DependencyRenderer(object):
|
||||||
lang=self.lang,
|
lang=self.lang,
|
||||||
)
|
)
|
||||||
|
|
||||||
def render_word(self, text, tag, i):
|
def render_word(self, text, tag, lemma, i,):
|
||||||
"""Render individual word.
|
"""Render individual word.
|
||||||
|
|
||||||
text (unicode): Word text.
|
text (unicode): Word text.
|
||||||
|
@ -114,6 +114,8 @@ class DependencyRenderer(object):
|
||||||
if self.direction == "rtl":
|
if self.direction == "rtl":
|
||||||
x = self.width - x
|
x = self.width - x
|
||||||
html_text = escape_html(text)
|
html_text = escape_html(text)
|
||||||
|
if lemma is not None:
|
||||||
|
return TPL_DEP_WORDS_LEMMA.format(text=html_text, tag=tag, lemma=lemma, x=x, y=y)
|
||||||
return TPL_DEP_WORDS.format(text=html_text, tag=tag, x=x, y=y)
|
return TPL_DEP_WORDS.format(text=html_text, tag=tag, x=x, y=y)
|
||||||
|
|
||||||
def render_arrow(self, label, start, end, direction, i):
|
def render_arrow(self, label, start, end, direction, i):
|
||||||
|
|
|
@ -18,6 +18,15 @@ TPL_DEP_WORDS = """
|
||||||
"""
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
TPL_DEP_WORDS_LEMMA = """
|
||||||
|
<text class="displacy-token" fill="currentColor" text-anchor="middle" y="{y}">
|
||||||
|
<tspan class="displacy-word" fill="currentColor" x="{x}">{text}</tspan>
|
||||||
|
<tspan class="displacy-lemma" dy="2em" fill="currentColor" x="{x}">{lemma}</tspan>
|
||||||
|
<tspan class="displacy-tag" dy="2em" fill="currentColor" x="{x}">{tag}</tspan>
|
||||||
|
</text>
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
TPL_DEP_ARCS = """
|
TPL_DEP_ARCS = """
|
||||||
<g class="displacy-arrow">
|
<g class="displacy-arrow">
|
||||||
<path class="displacy-arc" id="arrow-{id}-{i}" stroke-width="{stroke}px" d="{arc}" fill="none" stroke="currentColor"/>
|
<path class="displacy-arc" id="arrow-{id}-{i}" stroke-width="{stroke}px" d="{arc}" fill="none" stroke="currentColor"/>
|
||||||
|
|
|
@ -29,7 +29,7 @@ früher fünf fünfte fünften fünfter fünftes für
|
||||||
|
|
||||||
gab ganz ganze ganzen ganzer ganzes gar gedurft gegen gegenüber gehabt gehen
|
gab ganz ganze ganzen ganzer ganzes gar gedurft gegen gegenüber gehabt gehen
|
||||||
geht gekannt gekonnt gemacht gemocht gemusst genug gerade gern gesagt geschweige
|
geht gekannt gekonnt gemacht gemocht gemusst genug gerade gern gesagt geschweige
|
||||||
gewesen gewollt geworden gibt ging gleich gott gross groß grosse große grossen
|
gewesen gewollt geworden gibt ging gleich gross groß grosse große grossen
|
||||||
großen grosser großer grosses großes gut gute guter gutes
|
großen grosser großer grosses großes gut gute guter gutes
|
||||||
|
|
||||||
habe haben habt hast hat hatte hätte hatten hätten heisst heißt her heute hier
|
habe haben habt hast hat hatte hätte hatten hätten heisst heißt her heute hier
|
||||||
|
@ -47,9 +47,8 @@ kleines kommen kommt können könnt konnte könnte konnten kurz
|
||||||
lang lange leicht leider lieber los
|
lang lange leicht leider lieber los
|
||||||
|
|
||||||
machen macht machte mag magst man manche manchem manchen mancher manches mehr
|
machen macht machte mag magst man manche manchem manchen mancher manches mehr
|
||||||
mein meine meinem meinen meiner meines mensch menschen mich mir mit mittel
|
mein meine meinem meinen meiner meines mich mir mit mittel mochte möchte mochten
|
||||||
mochte möchte mochten mögen möglich mögt morgen muss muß müssen musst müsst
|
mögen möglich mögt morgen muss muß müssen musst müsst musste mussten
|
||||||
musste mussten
|
|
||||||
|
|
||||||
na nach nachdem nahm natürlich neben nein neue neuen neun neunte neunten neunter
|
na nach nachdem nahm natürlich neben nein neue neuen neun neunte neunten neunter
|
||||||
neuntes nicht nichts nie niemand niemandem niemanden noch nun nur
|
neuntes nicht nichts nie niemand niemandem niemanden noch nun nur
|
||||||
|
|
|
@ -3,7 +3,7 @@
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
from .tag_map_general import TAG_MAP
|
from ..tag_map import TAG_MAP
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
from .lex_attrs import LEX_ATTRS
|
from .lex_attrs import LEX_ATTRS
|
||||||
from .lemmatizer import GreekLemmatizer
|
from .lemmatizer import GreekLemmatizer
|
||||||
|
|
|
@ -1,27 +0,0 @@
|
||||||
# coding: utf8
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
from ...symbols import POS, ADV, NOUN, ADP, PRON, SCONJ, PROPN, DET, SYM, INTJ
|
|
||||||
from ...symbols import PUNCT, NUM, AUX, X, ADJ, VERB, PART, SPACE, CCONJ
|
|
||||||
|
|
||||||
|
|
||||||
TAG_MAP = {
|
|
||||||
"ADJ": {POS: ADJ},
|
|
||||||
"ADV": {POS: ADV},
|
|
||||||
"INTJ": {POS: INTJ},
|
|
||||||
"NOUN": {POS: NOUN},
|
|
||||||
"PROPN": {POS: PROPN},
|
|
||||||
"VERB": {POS: VERB},
|
|
||||||
"ADP": {POS: ADP},
|
|
||||||
"CCONJ": {POS: CCONJ},
|
|
||||||
"SCONJ": {POS: SCONJ},
|
|
||||||
"PART": {POS: PART},
|
|
||||||
"PUNCT": {POS: PUNCT},
|
|
||||||
"SYM": {POS: SYM},
|
|
||||||
"NUM": {POS: NUM},
|
|
||||||
"PRON": {POS: PRON},
|
|
||||||
"AUX": {POS: AUX},
|
|
||||||
"SPACE": {POS: SPACE},
|
|
||||||
"DET": {POS: DET},
|
|
||||||
"X": {POS: X},
|
|
||||||
}
|
|
|
@ -14,6 +14,7 @@ for exc_data in [
|
||||||
{ORTH: "alv.", LEMMA: "arvonlisävero"},
|
{ORTH: "alv.", LEMMA: "arvonlisävero"},
|
||||||
{ORTH: "ark.", LEMMA: "arkisin"},
|
{ORTH: "ark.", LEMMA: "arkisin"},
|
||||||
{ORTH: "as.", LEMMA: "asunto"},
|
{ORTH: "as.", LEMMA: "asunto"},
|
||||||
|
{ORTH: "eaa.", LEMMA: "ennen ajanlaskun alkua"},
|
||||||
{ORTH: "ed.", LEMMA: "edellinen"},
|
{ORTH: "ed.", LEMMA: "edellinen"},
|
||||||
{ORTH: "esim.", LEMMA: "esimerkki"},
|
{ORTH: "esim.", LEMMA: "esimerkki"},
|
||||||
{ORTH: "huom.", LEMMA: "huomautus"},
|
{ORTH: "huom.", LEMMA: "huomautus"},
|
||||||
|
@ -27,6 +28,7 @@ for exc_data in [
|
||||||
{ORTH: "läh.", LEMMA: "lähettäjä"},
|
{ORTH: "läh.", LEMMA: "lähettäjä"},
|
||||||
{ORTH: "miel.", LEMMA: "mieluummin"},
|
{ORTH: "miel.", LEMMA: "mieluummin"},
|
||||||
{ORTH: "milj.", LEMMA: "miljoona"},
|
{ORTH: "milj.", LEMMA: "miljoona"},
|
||||||
|
{ORTH: "Mm.", LEMMA: "muun muassa"},
|
||||||
{ORTH: "mm.", LEMMA: "muun muassa"},
|
{ORTH: "mm.", LEMMA: "muun muassa"},
|
||||||
{ORTH: "myöh.", LEMMA: "myöhempi"},
|
{ORTH: "myöh.", LEMMA: "myöhempi"},
|
||||||
{ORTH: "n.", LEMMA: "noin"},
|
{ORTH: "n.", LEMMA: "noin"},
|
||||||
|
|
|
@ -3,6 +3,8 @@ from __future__ import unicode_literals
|
||||||
|
|
||||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
from .stop_words import STOP_WORDS
|
from .stop_words import STOP_WORDS
|
||||||
|
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
|
||||||
|
from .punctuation import TOKENIZER_SUFFIXES
|
||||||
|
|
||||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||||
from ..norm_exceptions import BASE_NORMS
|
from ..norm_exceptions import BASE_NORMS
|
||||||
|
@ -24,6 +26,9 @@ class RomanianDefaults(Language.Defaults):
|
||||||
)
|
)
|
||||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||||
stop_words = STOP_WORDS
|
stop_words = STOP_WORDS
|
||||||
|
prefixes = TOKENIZER_PREFIXES
|
||||||
|
suffixes = TOKENIZER_SUFFIXES
|
||||||
|
infixes = TOKENIZER_INFIXES
|
||||||
tag_map = TAG_MAP
|
tag_map = TAG_MAP
|
||||||
|
|
||||||
|
|
||||||
|
|
164
spacy/lang/ro/punctuation.py
Normal file
164
spacy/lang/ro/punctuation.py
Normal file
|
@ -0,0 +1,164 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import itertools
|
||||||
|
|
||||||
|
from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, LIST_CURRENCY
|
||||||
|
from ..char_classes import LIST_ICONS, CURRENCY
|
||||||
|
from ..char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA, PUNCT
|
||||||
|
|
||||||
|
|
||||||
|
_list_icons = [x for x in LIST_ICONS if x != "°"]
|
||||||
|
_list_icons = [x.replace("\\u00B0", "") for x in _list_icons]
|
||||||
|
|
||||||
|
|
||||||
|
_ro_variants = {
|
||||||
|
"Ă": ["Ă", "A"],
|
||||||
|
"Â": ["Â", "A"],
|
||||||
|
"Î": ["Î", "I"],
|
||||||
|
"Ș": ["Ș", "Ş", "S"],
|
||||||
|
"Ț": ["Ț", "Ţ", "T"],
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _make_ro_variants(tokens):
|
||||||
|
variants = []
|
||||||
|
for token in tokens:
|
||||||
|
upper_token = token.upper()
|
||||||
|
upper_char_variants = [_ro_variants.get(c, [c]) for c in upper_token]
|
||||||
|
upper_variants = ["".join(x) for x in itertools.product(*upper_char_variants)]
|
||||||
|
for variant in upper_variants:
|
||||||
|
variants.extend([variant, variant.lower(), variant.title()])
|
||||||
|
return sorted(list(set(variants)))
|
||||||
|
|
||||||
|
|
||||||
|
# UD_Romanian-RRT closed class prefixes
|
||||||
|
# POS: ADP|AUX|CCONJ|DET|NUM|PART|PRON|SCONJ
|
||||||
|
_ud_rrt_prefixes = [
|
||||||
|
"a-",
|
||||||
|
"c-",
|
||||||
|
"ce-",
|
||||||
|
"cu-",
|
||||||
|
"d-",
|
||||||
|
"de-",
|
||||||
|
"dintr-",
|
||||||
|
"e-",
|
||||||
|
"făr-",
|
||||||
|
"i-",
|
||||||
|
"l-",
|
||||||
|
"le-",
|
||||||
|
"m-",
|
||||||
|
"mi-",
|
||||||
|
"n-",
|
||||||
|
"ne-",
|
||||||
|
"p-",
|
||||||
|
"pe-",
|
||||||
|
"prim-",
|
||||||
|
"printr-",
|
||||||
|
"s-",
|
||||||
|
"se-",
|
||||||
|
"te-",
|
||||||
|
"v-",
|
||||||
|
"într-",
|
||||||
|
"ș-",
|
||||||
|
"și-",
|
||||||
|
"ți-",
|
||||||
|
]
|
||||||
|
_ud_rrt_prefix_variants = _make_ro_variants(_ud_rrt_prefixes)
|
||||||
|
|
||||||
|
|
||||||
|
# UD_Romanian-RRT closed class suffixes without NUM
|
||||||
|
# POS: ADP|AUX|CCONJ|DET|PART|PRON|SCONJ
|
||||||
|
_ud_rrt_suffixes = [
|
||||||
|
"-a",
|
||||||
|
"-aceasta",
|
||||||
|
"-ai",
|
||||||
|
"-al",
|
||||||
|
"-ale",
|
||||||
|
"-alta",
|
||||||
|
"-am",
|
||||||
|
"-ar",
|
||||||
|
"-astea",
|
||||||
|
"-atâta",
|
||||||
|
"-au",
|
||||||
|
"-aș",
|
||||||
|
"-ați",
|
||||||
|
"-i",
|
||||||
|
"-ilor",
|
||||||
|
"-l",
|
||||||
|
"-le",
|
||||||
|
"-lea",
|
||||||
|
"-mea",
|
||||||
|
"-meu",
|
||||||
|
"-mi",
|
||||||
|
"-mă",
|
||||||
|
"-n",
|
||||||
|
"-ndărătul",
|
||||||
|
"-ne",
|
||||||
|
"-o",
|
||||||
|
"-oi",
|
||||||
|
"-or",
|
||||||
|
"-s",
|
||||||
|
"-se",
|
||||||
|
"-si",
|
||||||
|
"-te",
|
||||||
|
"-ul",
|
||||||
|
"-ului",
|
||||||
|
"-un",
|
||||||
|
"-uri",
|
||||||
|
"-urile",
|
||||||
|
"-urilor",
|
||||||
|
"-veți",
|
||||||
|
"-vă",
|
||||||
|
"-ăștia",
|
||||||
|
"-și",
|
||||||
|
"-ți",
|
||||||
|
]
|
||||||
|
_ud_rrt_suffix_variants = _make_ro_variants(_ud_rrt_suffixes)
|
||||||
|
|
||||||
|
|
||||||
|
_prefixes = (
|
||||||
|
["§", "%", "=", "—", "–", r"\+(?![0-9])"]
|
||||||
|
+ _ud_rrt_prefix_variants
|
||||||
|
+ LIST_PUNCT
|
||||||
|
+ LIST_ELLIPSES
|
||||||
|
+ LIST_QUOTES
|
||||||
|
+ LIST_CURRENCY
|
||||||
|
+ LIST_ICONS
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
_suffixes = (
|
||||||
|
_ud_rrt_suffix_variants
|
||||||
|
+ LIST_PUNCT
|
||||||
|
+ LIST_ELLIPSES
|
||||||
|
+ LIST_QUOTES
|
||||||
|
+ _list_icons
|
||||||
|
+ ["—", "–"]
|
||||||
|
+ [
|
||||||
|
r"(?<=[0-9])\+",
|
||||||
|
r"(?<=°[FfCcKk])\.",
|
||||||
|
r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
|
||||||
|
r"(?<=[0-9{al}{e}{p}(?:{q})])\.".format(
|
||||||
|
al=ALPHA_LOWER, e=r"%²\-\+", q=CONCAT_QUOTES, p=PUNCT
|
||||||
|
),
|
||||||
|
r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),
|
||||||
|
]
|
||||||
|
)
|
||||||
|
|
||||||
|
_infixes = (
|
||||||
|
LIST_ELLIPSES
|
||||||
|
+ _list_icons
|
||||||
|
+ [
|
||||||
|
r"(?<=[0-9])[+\*^](?=[0-9-])",
|
||||||
|
r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
|
||||||
|
al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
|
||||||
|
),
|
||||||
|
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
|
||||||
|
r"(?<=[{a}0-9])[:<>=](?=[{a}])".format(a=ALPHA),
|
||||||
|
]
|
||||||
|
)
|
||||||
|
|
||||||
|
TOKENIZER_PREFIXES = _prefixes
|
||||||
|
TOKENIZER_SUFFIXES = _suffixes
|
||||||
|
TOKENIZER_INFIXES = _infixes
|
|
@ -2,6 +2,7 @@
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from ...symbols import ORTH
|
from ...symbols import ORTH
|
||||||
|
from .punctuation import _make_ro_variants
|
||||||
|
|
||||||
|
|
||||||
_exc = {}
|
_exc = {}
|
||||||
|
@ -45,8 +46,52 @@ for orth in [
|
||||||
"dpdv",
|
"dpdv",
|
||||||
"șamd.",
|
"șamd.",
|
||||||
"ș.a.m.d.",
|
"ș.a.m.d.",
|
||||||
|
# below: from UD_Romanian-RRT:
|
||||||
|
"A.c.",
|
||||||
|
"A.f.",
|
||||||
|
"A.r.",
|
||||||
|
"Al.",
|
||||||
|
"Art.",
|
||||||
|
"Aug.",
|
||||||
|
"Bd.",
|
||||||
|
"Dem.",
|
||||||
|
"Dr.",
|
||||||
|
"Fig.",
|
||||||
|
"Fr.",
|
||||||
|
"Gh.",
|
||||||
|
"Gr.",
|
||||||
|
"Lt.",
|
||||||
|
"Nr.",
|
||||||
|
"Obs.",
|
||||||
|
"Prof.",
|
||||||
|
"Sf.",
|
||||||
|
"a.m.",
|
||||||
|
"a.r.",
|
||||||
|
"alin.",
|
||||||
|
"art.",
|
||||||
|
"d-l",
|
||||||
|
"d-lui",
|
||||||
|
"d-nei",
|
||||||
|
"ex.",
|
||||||
|
"fig.",
|
||||||
|
"ian.",
|
||||||
|
"lit.",
|
||||||
|
"lt.",
|
||||||
|
"p.a.",
|
||||||
|
"p.m.",
|
||||||
|
"pct.",
|
||||||
|
"prep.",
|
||||||
|
"sf.",
|
||||||
|
"tel.",
|
||||||
|
"univ.",
|
||||||
|
"îngr.",
|
||||||
|
"într-adevăr",
|
||||||
|
"Șt.",
|
||||||
|
"ș.a.",
|
||||||
]:
|
]:
|
||||||
_exc[orth] = [{ORTH: orth}]
|
# note: does not distinguish capitalized-only exceptions from others
|
||||||
|
for variant in _make_ro_variants([orth]):
|
||||||
|
_exc[variant] = [{ORTH: variant}]
|
||||||
|
|
||||||
|
|
||||||
TOKENIZER_EXCEPTIONS = _exc
|
TOKENIZER_EXCEPTIONS = _exc
|
||||||
|
|
|
@ -608,6 +608,7 @@ class Language(object):
|
||||||
link_vectors_to_models(self.vocab)
|
link_vectors_to_models(self.vocab)
|
||||||
if self.vocab.vectors.data.shape[1]:
|
if self.vocab.vectors.data.shape[1]:
|
||||||
cfg["pretrained_vectors"] = self.vocab.vectors.name
|
cfg["pretrained_vectors"] = self.vocab.vectors.name
|
||||||
|
cfg['pretrained_dims'] = self.vocab.vectors.data.shape[1]
|
||||||
if sgd is None:
|
if sgd is None:
|
||||||
sgd = create_default_optimizer(Model.ops)
|
sgd = create_default_optimizer(Model.ops)
|
||||||
self._optimizer = sgd
|
self._optimizer = sgd
|
||||||
|
|
|
@ -8,7 +8,7 @@ from ..language import component
|
||||||
from ..errors import Errors
|
from ..errors import Errors
|
||||||
from ..compat import basestring_
|
from ..compat import basestring_
|
||||||
from ..util import ensure_path, to_disk, from_disk
|
from ..util import ensure_path, to_disk, from_disk
|
||||||
from ..tokens import Span
|
from ..tokens import Doc, Span
|
||||||
from ..matcher import Matcher, PhraseMatcher
|
from ..matcher import Matcher, PhraseMatcher
|
||||||
|
|
||||||
DEFAULT_ENT_ID_SEP = "||"
|
DEFAULT_ENT_ID_SEP = "||"
|
||||||
|
@ -162,6 +162,7 @@ class EntityRuler(object):
|
||||||
@property
|
@property
|
||||||
def patterns(self):
|
def patterns(self):
|
||||||
"""Get all patterns that were added to the entity ruler.
|
"""Get all patterns that were added to the entity ruler.
|
||||||
|
|
||||||
RETURNS (list): The original patterns, one dictionary per pattern.
|
RETURNS (list): The original patterns, one dictionary per pattern.
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/entityruler#patterns
|
DOCS: https://spacy.io/api/entityruler#patterns
|
||||||
|
@ -194,6 +195,7 @@ class EntityRuler(object):
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/entityruler#add_patterns
|
DOCS: https://spacy.io/api/entityruler#add_patterns
|
||||||
"""
|
"""
|
||||||
|
|
||||||
# disable the nlp components after this one in case they hadn't been initialized / deserialised yet
|
# disable the nlp components after this one in case they hadn't been initialized / deserialised yet
|
||||||
try:
|
try:
|
||||||
current_index = self.nlp.pipe_names.index(self.name)
|
current_index = self.nlp.pipe_names.index(self.name)
|
||||||
|
@ -203,7 +205,33 @@ class EntityRuler(object):
|
||||||
except ValueError:
|
except ValueError:
|
||||||
subsequent_pipes = []
|
subsequent_pipes = []
|
||||||
with self.nlp.disable_pipes(subsequent_pipes):
|
with self.nlp.disable_pipes(subsequent_pipes):
|
||||||
|
token_patterns = []
|
||||||
|
phrase_pattern_labels = []
|
||||||
|
phrase_pattern_texts = []
|
||||||
|
phrase_pattern_ids = []
|
||||||
|
|
||||||
for entry in patterns:
|
for entry in patterns:
|
||||||
|
if isinstance(entry["pattern"], basestring_):
|
||||||
|
phrase_pattern_labels.append(entry["label"])
|
||||||
|
phrase_pattern_texts.append(entry["pattern"])
|
||||||
|
phrase_pattern_ids.append(entry.get("id"))
|
||||||
|
elif isinstance(entry["pattern"], list):
|
||||||
|
token_patterns.append(entry)
|
||||||
|
|
||||||
|
phrase_patterns = []
|
||||||
|
for label, pattern, ent_id in zip(
|
||||||
|
phrase_pattern_labels,
|
||||||
|
self.nlp.pipe(phrase_pattern_texts),
|
||||||
|
phrase_pattern_ids
|
||||||
|
):
|
||||||
|
phrase_pattern = {
|
||||||
|
"label": label, "pattern": pattern, "id": ent_id
|
||||||
|
}
|
||||||
|
if ent_id:
|
||||||
|
phrase_pattern["id"] = ent_id
|
||||||
|
phrase_patterns.append(phrase_pattern)
|
||||||
|
|
||||||
|
for entry in token_patterns + phrase_patterns:
|
||||||
label = entry["label"]
|
label = entry["label"]
|
||||||
if "id" in entry:
|
if "id" in entry:
|
||||||
ent_label = label
|
ent_label = label
|
||||||
|
@ -212,8 +240,8 @@ class EntityRuler(object):
|
||||||
self._ent_ids[key] = (ent_label, entry["id"])
|
self._ent_ids[key] = (ent_label, entry["id"])
|
||||||
|
|
||||||
pattern = entry["pattern"]
|
pattern = entry["pattern"]
|
||||||
if isinstance(pattern, basestring_):
|
if isinstance(pattern, Doc):
|
||||||
self.phrase_patterns[label].append(self.nlp(pattern))
|
self.phrase_patterns[label].append(pattern)
|
||||||
elif isinstance(pattern, list):
|
elif isinstance(pattern, list):
|
||||||
self.token_patterns[label].append(pattern)
|
self.token_patterns[label].append(pattern)
|
||||||
else:
|
else:
|
||||||
|
@ -226,6 +254,8 @@ class EntityRuler(object):
|
||||||
def _split_label(self, label):
|
def _split_label(self, label):
|
||||||
"""Split Entity label into ent_label and ent_id if it contains self.ent_id_sep
|
"""Split Entity label into ent_label and ent_id if it contains self.ent_id_sep
|
||||||
|
|
||||||
|
label (str): The value of label in a pattern entry
|
||||||
|
|
||||||
RETURNS (tuple): ent_label, ent_id
|
RETURNS (tuple): ent_label, ent_id
|
||||||
"""
|
"""
|
||||||
if self.ent_id_sep in label:
|
if self.ent_id_sep in label:
|
||||||
|
@ -239,6 +269,9 @@ class EntityRuler(object):
|
||||||
def _create_label(self, label, ent_id):
|
def _create_label(self, label, ent_id):
|
||||||
"""Join Entity label with ent_id if the pattern has an `id` attribute
|
"""Join Entity label with ent_id if the pattern has an `id` attribute
|
||||||
|
|
||||||
|
label (str): The label to set for ent.label_
|
||||||
|
ent_id (str): The label
|
||||||
|
|
||||||
RETURNS (str): The ent_label joined with configured `ent_id_sep`
|
RETURNS (str): The ent_label joined with configured `ent_id_sep`
|
||||||
"""
|
"""
|
||||||
if isinstance(ent_id, basestring_):
|
if isinstance(ent_id, basestring_):
|
||||||
|
@ -250,6 +283,7 @@ class EntityRuler(object):
|
||||||
|
|
||||||
patterns_bytes (bytes): The bytestring to load.
|
patterns_bytes (bytes): The bytestring to load.
|
||||||
**kwargs: Other config paramters, mostly for consistency.
|
**kwargs: Other config paramters, mostly for consistency.
|
||||||
|
|
||||||
RETURNS (EntityRuler): The loaded entity ruler.
|
RETURNS (EntityRuler): The loaded entity ruler.
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/entityruler#from_bytes
|
DOCS: https://spacy.io/api/entityruler#from_bytes
|
||||||
|
@ -292,6 +326,7 @@ class EntityRuler(object):
|
||||||
|
|
||||||
path (unicode / Path): The JSONL file to load.
|
path (unicode / Path): The JSONL file to load.
|
||||||
**kwargs: Other config paramters, mostly for consistency.
|
**kwargs: Other config paramters, mostly for consistency.
|
||||||
|
|
||||||
RETURNS (EntityRuler): The loaded entity ruler.
|
RETURNS (EntityRuler): The loaded entity ruler.
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/entityruler#from_disk
|
DOCS: https://spacy.io/api/entityruler#from_disk
|
||||||
|
|
|
@ -1044,6 +1044,7 @@ class TextCategorizer(Pipe):
|
||||||
self.add_label(cat)
|
self.add_label(cat)
|
||||||
if self.model is True:
|
if self.model is True:
|
||||||
self.cfg["pretrained_vectors"] = kwargs.get("pretrained_vectors")
|
self.cfg["pretrained_vectors"] = kwargs.get("pretrained_vectors")
|
||||||
|
self.cfg["pretrained_dims"] = kwargs.get("pretrained_dims")
|
||||||
self.require_labels()
|
self.require_labels()
|
||||||
self.model = self.Model(len(self.labels), **self.cfg)
|
self.model = self.Model(len(self.labels), **self.cfg)
|
||||||
link_vectors_to_models(self.vocab)
|
link_vectors_to_models(self.vocab)
|
||||||
|
|
|
@ -463,3 +463,5 @@ cdef enum symbol_t:
|
||||||
|
|
||||||
ENT_KB_ID
|
ENT_KB_ID
|
||||||
ENT_ID
|
ENT_ID
|
||||||
|
|
||||||
|
IDX
|
|
@ -93,6 +93,7 @@ IDS = {
|
||||||
"SPACY": SPACY,
|
"SPACY": SPACY,
|
||||||
"PROB": PROB,
|
"PROB": PROB,
|
||||||
"LANG": LANG,
|
"LANG": LANG,
|
||||||
|
"IDX": IDX,
|
||||||
|
|
||||||
"ADJ": ADJ,
|
"ADJ": ADJ,
|
||||||
"ADP": ADP,
|
"ADP": ADP,
|
||||||
|
|
|
@ -66,3 +66,14 @@ def test_doc_array_to_from_string_attrs(en_vocab, attrs):
|
||||||
words = ["An", "example", "sentence"]
|
words = ["An", "example", "sentence"]
|
||||||
doc = Doc(en_vocab, words=words)
|
doc = Doc(en_vocab, words=words)
|
||||||
Doc(en_vocab, words=words).from_array(attrs, doc.to_array(attrs))
|
Doc(en_vocab, words=words).from_array(attrs, doc.to_array(attrs))
|
||||||
|
|
||||||
|
|
||||||
|
def test_doc_array_idx(en_vocab):
|
||||||
|
"""Test that Doc.to_array can retrieve token start indices"""
|
||||||
|
words = ["An", "example", "sentence"]
|
||||||
|
doc = Doc(en_vocab, words=words)
|
||||||
|
offsets = Doc(en_vocab, words=words).to_array("IDX")
|
||||||
|
|
||||||
|
assert offsets[0] == 0
|
||||||
|
assert offsets[1] == 3
|
||||||
|
assert offsets[2] == 11
|
||||||
|
|
|
@ -7,7 +7,7 @@ import numpy
|
||||||
from spacy.tokens import Doc, Span
|
from spacy.tokens import Doc, Span
|
||||||
from spacy.vocab import Vocab
|
from spacy.vocab import Vocab
|
||||||
from spacy.errors import ModelsWarning
|
from spacy.errors import ModelsWarning
|
||||||
from spacy.attrs import ENT_TYPE, ENT_IOB
|
from spacy.attrs import ENT_TYPE, ENT_IOB, SENT_START, HEAD, DEP
|
||||||
|
|
||||||
from ..util import get_doc
|
from ..util import get_doc
|
||||||
|
|
||||||
|
@ -274,6 +274,39 @@ def test_doc_is_nered(en_vocab):
|
||||||
assert new_doc.is_nered
|
assert new_doc.is_nered
|
||||||
|
|
||||||
|
|
||||||
|
def test_doc_from_array_sent_starts(en_vocab):
|
||||||
|
words = ["I", "live", "in", "New", "York", ".", "I", "like", "cats", "."]
|
||||||
|
heads = [0, 0, 0, 0, 0, 0, 6, 6, 6, 6]
|
||||||
|
deps = ["ROOT", "dep", "dep", "dep", "dep", "dep", "ROOT", "dep", "dep", "dep", "dep"]
|
||||||
|
doc = Doc(en_vocab, words=words)
|
||||||
|
for i, (dep, head) in enumerate(zip(deps, heads)):
|
||||||
|
doc[i].dep_ = dep
|
||||||
|
doc[i].head = doc[head]
|
||||||
|
if head == i:
|
||||||
|
doc[i].is_sent_start = True
|
||||||
|
doc.is_parsed
|
||||||
|
|
||||||
|
attrs = [SENT_START, HEAD]
|
||||||
|
arr = doc.to_array(attrs)
|
||||||
|
new_doc = Doc(en_vocab, words=words)
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
new_doc.from_array(attrs, arr)
|
||||||
|
|
||||||
|
attrs = [SENT_START, DEP]
|
||||||
|
arr = doc.to_array(attrs)
|
||||||
|
new_doc = Doc(en_vocab, words=words)
|
||||||
|
new_doc.from_array(attrs, arr)
|
||||||
|
assert [t.is_sent_start for t in doc] == [t.is_sent_start for t in new_doc]
|
||||||
|
assert not new_doc.is_parsed
|
||||||
|
|
||||||
|
attrs = [HEAD, DEP]
|
||||||
|
arr = doc.to_array(attrs)
|
||||||
|
new_doc = Doc(en_vocab, words=words)
|
||||||
|
new_doc.from_array(attrs, arr)
|
||||||
|
assert [t.is_sent_start for t in doc] == [t.is_sent_start for t in new_doc]
|
||||||
|
assert new_doc.is_parsed
|
||||||
|
|
||||||
|
|
||||||
def test_doc_lang(en_vocab):
|
def test_doc_lang(en_vocab):
|
||||||
doc = Doc(en_vocab, words=["Hello", "world"])
|
doc = Doc(en_vocab, words=["Hello", "world"])
|
||||||
assert doc.lang_ == "en"
|
assert doc.lang_ == "en"
|
||||||
|
|
|
@ -279,3 +279,12 @@ def test_filter_spans(doc):
|
||||||
assert len(filtered[1]) == 5
|
assert len(filtered[1]) == 5
|
||||||
assert filtered[0].start == 1 and filtered[0].end == 4
|
assert filtered[0].start == 1 and filtered[0].end == 4
|
||||||
assert filtered[1].start == 5 and filtered[1].end == 10
|
assert filtered[1].start == 5 and filtered[1].end == 10
|
||||||
|
|
||||||
|
|
||||||
|
def test_span_eq_hash(doc, doc_not_parsed):
|
||||||
|
assert doc[0:2] == doc[0:2]
|
||||||
|
assert doc[0:2] != doc[1:3]
|
||||||
|
assert doc[0:2] != doc_not_parsed[0:2]
|
||||||
|
assert hash(doc[0:2]) == hash(doc[0:2])
|
||||||
|
assert hash(doc[0:2]) != hash(doc[1:3])
|
||||||
|
assert hash(doc[0:2]) != hash(doc_not_parsed[0:2])
|
||||||
|
|
|
@ -10,28 +10,33 @@ ABBREVIATION_TESTS = [
|
||||||
["Hyvää", "uutta", "vuotta", "t.", "siht.", "Niemelä", "!"],
|
["Hyvää", "uutta", "vuotta", "t.", "siht.", "Niemelä", "!"],
|
||||||
),
|
),
|
||||||
("Paino on n. 2.2 kg", ["Paino", "on", "n.", "2.2", "kg"]),
|
("Paino on n. 2.2 kg", ["Paino", "on", "n.", "2.2", "kg"]),
|
||||||
|
(
|
||||||
|
"Vuonna 1 eaa. tapahtui kauheita.",
|
||||||
|
["Vuonna", "1", "eaa.", "tapahtui", "kauheita", "."],
|
||||||
|
),
|
||||||
]
|
]
|
||||||
|
|
||||||
HYPHENATED_TESTS = [
|
HYPHENATED_TESTS = [
|
||||||
(
|
(
|
||||||
"1700-luvulle sijoittuva taide-elokuva",
|
"1700-luvulle sijoittuva taide-elokuva Wikimedia-säätiön Varsinais-Suomen",
|
||||||
["1700-luvulle", "sijoittuva", "taide-elokuva"],
|
[
|
||||||
|
"1700-luvulle",
|
||||||
|
"sijoittuva",
|
||||||
|
"taide-elokuva",
|
||||||
|
"Wikimedia-säätiön",
|
||||||
|
"Varsinais-Suomen",
|
||||||
|
],
|
||||||
)
|
)
|
||||||
]
|
]
|
||||||
|
|
||||||
ABBREVIATION_INFLECTION_TESTS = [
|
ABBREVIATION_INFLECTION_TESTS = [
|
||||||
(
|
(
|
||||||
"VTT:ssa ennen v:ta 2010 suoritetut mittaukset",
|
"VTT:ssa ennen v:ta 2010 suoritetut mittaukset",
|
||||||
["VTT:ssa", "ennen", "v:ta", "2010", "suoritetut", "mittaukset"]
|
["VTT:ssa", "ennen", "v:ta", "2010", "suoritetut", "mittaukset"],
|
||||||
),
|
),
|
||||||
(
|
("ALV:n osuus on 24 %.", ["ALV:n", "osuus", "on", "24", "%", "."]),
|
||||||
"ALV:n osuus on 24 %.",
|
("Hiihtäjä oli kilpailun 14:s.", ["Hiihtäjä", "oli", "kilpailun", "14:s", "."]),
|
||||||
["ALV:n", "osuus", "on", "24", "%", "."]
|
("EU:n toimesta tehtiin jotain.", ["EU:n", "toimesta", "tehtiin", "jotain", "."]),
|
||||||
),
|
|
||||||
(
|
|
||||||
"Hiihtäjä oli kilpailun 14:s.",
|
|
||||||
["Hiihtäjä", "oli", "kilpailun", "14:s", "."]
|
|
||||||
)
|
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -31,10 +31,10 @@ def test_displacy_parse_deps(en_vocab):
|
||||||
deps = displacy.parse_deps(doc)
|
deps = displacy.parse_deps(doc)
|
||||||
assert isinstance(deps, dict)
|
assert isinstance(deps, dict)
|
||||||
assert deps["words"] == [
|
assert deps["words"] == [
|
||||||
{"text": "This", "tag": "DET"},
|
{"lemma": None, "text": "This", "tag": "DET"},
|
||||||
{"text": "is", "tag": "AUX"},
|
{"lemma": None, "text": "is", "tag": "AUX"},
|
||||||
{"text": "a", "tag": "DET"},
|
{"lemma": None, "text": "a", "tag": "DET"},
|
||||||
{"text": "sentence", "tag": "NOUN"},
|
{"lemma": None, "text": "sentence", "tag": "NOUN"},
|
||||||
]
|
]
|
||||||
assert deps["arcs"] == [
|
assert deps["arcs"] == [
|
||||||
{"start": 0, "end": 1, "label": "nsubj", "dir": "left"},
|
{"start": 0, "end": 1, "label": "nsubj", "dir": "left"},
|
||||||
|
|
|
@ -95,7 +95,11 @@ def assert_docs_equal(doc1, doc2):
|
||||||
|
|
||||||
assert [t.ent_type for t in doc1] == [t.ent_type for t in doc2]
|
assert [t.ent_type for t in doc1] == [t.ent_type for t in doc2]
|
||||||
assert [t.ent_iob for t in doc1] == [t.ent_iob for t in doc2]
|
assert [t.ent_iob for t in doc1] == [t.ent_iob for t in doc2]
|
||||||
assert [ent for ent in doc1.ents] == [ent for ent in doc2.ents]
|
for ent1, ent2 in zip(doc1.ents, doc2.ents):
|
||||||
|
assert ent1.start == ent2.start
|
||||||
|
assert ent1.end == ent2.end
|
||||||
|
assert ent1.label == ent2.label
|
||||||
|
assert ent1.kb_id == ent2.kb_id
|
||||||
|
|
||||||
|
|
||||||
def assert_packed_msg_equal(b1, b2):
|
def assert_packed_msg_equal(b1, b2):
|
||||||
|
|
|
@ -23,7 +23,7 @@ from ..lexeme cimport Lexeme, EMPTY_LEXEME
|
||||||
from ..typedefs cimport attr_t, flags_t
|
from ..typedefs cimport attr_t, flags_t
|
||||||
from ..attrs cimport ID, ORTH, NORM, LOWER, SHAPE, PREFIX, SUFFIX, CLUSTER
|
from ..attrs cimport ID, ORTH, NORM, LOWER, SHAPE, PREFIX, SUFFIX, CLUSTER
|
||||||
from ..attrs cimport LENGTH, POS, LEMMA, TAG, DEP, HEAD, SPACY, ENT_IOB
|
from ..attrs cimport LENGTH, POS, LEMMA, TAG, DEP, HEAD, SPACY, ENT_IOB
|
||||||
from ..attrs cimport ENT_TYPE, ENT_ID, ENT_KB_ID, SENT_START, attr_id_t
|
from ..attrs cimport ENT_TYPE, ENT_ID, ENT_KB_ID, SENT_START, IDX, attr_id_t
|
||||||
from ..parts_of_speech cimport CCONJ, PUNCT, NOUN, univ_pos_t
|
from ..parts_of_speech cimport CCONJ, PUNCT, NOUN, univ_pos_t
|
||||||
|
|
||||||
from ..attrs import intify_attrs, IDS
|
from ..attrs import intify_attrs, IDS
|
||||||
|
@ -73,6 +73,8 @@ cdef attr_t get_token_attr(const TokenC* token, attr_id_t feat_name) nogil:
|
||||||
return token.ent_id
|
return token.ent_id
|
||||||
elif feat_name == ENT_KB_ID:
|
elif feat_name == ENT_KB_ID:
|
||||||
return token.ent_kb_id
|
return token.ent_kb_id
|
||||||
|
elif feat_name == IDX:
|
||||||
|
return token.idx
|
||||||
else:
|
else:
|
||||||
return Lexeme.get_struct_attr(token.lex, feat_name)
|
return Lexeme.get_struct_attr(token.lex, feat_name)
|
||||||
|
|
||||||
|
@ -813,7 +815,7 @@ cdef class Doc:
|
||||||
if attr_ids[j] != TAG:
|
if attr_ids[j] != TAG:
|
||||||
Token.set_struct_attr(token, attr_ids[j], array[i, j])
|
Token.set_struct_attr(token, attr_ids[j], array[i, j])
|
||||||
# Set flags
|
# Set flags
|
||||||
self.is_parsed = bool(self.is_parsed or HEAD in attrs or DEP in attrs)
|
self.is_parsed = bool(self.is_parsed or HEAD in attrs)
|
||||||
self.is_tagged = bool(self.is_tagged or TAG in attrs or POS in attrs)
|
self.is_tagged = bool(self.is_tagged or TAG in attrs or POS in attrs)
|
||||||
# If document is parsed, set children
|
# If document is parsed, set children
|
||||||
if self.is_parsed:
|
if self.is_parsed:
|
||||||
|
|
|
@ -127,22 +127,27 @@ cdef class Span:
|
||||||
return False
|
return False
|
||||||
else:
|
else:
|
||||||
return True
|
return True
|
||||||
# Eq
|
# <
|
||||||
if op == 0:
|
if op == 0:
|
||||||
return self.start_char < other.start_char
|
return self.start_char < other.start_char
|
||||||
|
# <=
|
||||||
elif op == 1:
|
elif op == 1:
|
||||||
return self.start_char <= other.start_char
|
return self.start_char <= other.start_char
|
||||||
|
# ==
|
||||||
elif op == 2:
|
elif op == 2:
|
||||||
return self.start_char == other.start_char and self.end_char == other.end_char
|
return (self.doc, self.start_char, self.end_char, self.label, self.kb_id) == (other.doc, other.start_char, other.end_char, other.label, other.kb_id)
|
||||||
|
# !=
|
||||||
elif op == 3:
|
elif op == 3:
|
||||||
return self.start_char != other.start_char or self.end_char != other.end_char
|
return (self.doc, self.start_char, self.end_char, self.label, self.kb_id) != (other.doc, other.start_char, other.end_char, other.label, other.kb_id)
|
||||||
|
# >
|
||||||
elif op == 4:
|
elif op == 4:
|
||||||
return self.start_char > other.start_char
|
return self.start_char > other.start_char
|
||||||
|
# >=
|
||||||
elif op == 5:
|
elif op == 5:
|
||||||
return self.start_char >= other.start_char
|
return self.start_char >= other.start_char
|
||||||
|
|
||||||
def __hash__(self):
|
def __hash__(self):
|
||||||
return hash((self.doc, self.label, self.start_char, self.end_char))
|
return hash((self.doc, self.start_char, self.end_char, self.label, self.kb_id))
|
||||||
|
|
||||||
def __len__(self):
|
def __len__(self):
|
||||||
"""Get the number of tokens in the span.
|
"""Get the number of tokens in the span.
|
||||||
|
|
|
@ -283,7 +283,11 @@ cdef class Vectors:
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/vectors#add
|
DOCS: https://spacy.io/api/vectors#add
|
||||||
"""
|
"""
|
||||||
key = get_string_id(key)
|
# use int for all keys and rows in key2row for more efficient access
|
||||||
|
# and serialization
|
||||||
|
key = int(get_string_id(key))
|
||||||
|
if row is not None:
|
||||||
|
row = int(row)
|
||||||
if row is None and key in self.key2row:
|
if row is None and key in self.key2row:
|
||||||
row = self.key2row[key]
|
row = self.key2row[key]
|
||||||
elif row is None:
|
elif row is None:
|
||||||
|
|
|
@ -239,6 +239,7 @@ If a setting is not present in the options, the default value will be used.
|
||||||
| Name | Type | Description | Default |
|
| Name | Type | Description | Default |
|
||||||
| ------------------ | ------- | --------------------------------------------------------------------------------------------------------------- | ----------------------- |
|
| ------------------ | ------- | --------------------------------------------------------------------------------------------------------------- | ----------------------- |
|
||||||
| `fine_grained` | bool | Use fine-grained part-of-speech tags (`Token.tag_`) instead of coarse-grained tags (`Token.pos_`). | `False` |
|
| `fine_grained` | bool | Use fine-grained part-of-speech tags (`Token.tag_`) instead of coarse-grained tags (`Token.pos_`). | `False` |
|
||||||
|
| `add_lemma` | bool | Print the lemma's in a separate row below the token texts in the `dep` visualisation. | `False` |
|
||||||
| `collapse_punct` | bool | Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation. | `True` |
|
| `collapse_punct` | bool | Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation. | `True` |
|
||||||
| `collapse_phrases` | bool | Merge noun phrases into one token. | `False` |
|
| `collapse_phrases` | bool | Merge noun phrases into one token. | `False` |
|
||||||
| `compact` | bool | "Compact mode" with square arrows that takes up less space. | `False` |
|
| `compact` | bool | "Compact mode" with square arrows that takes up less space. | `False` |
|
||||||
|
|
|
@ -1096,6 +1096,33 @@ with the patterns. When you load the model back in, all pipeline components will
|
||||||
be restored and deserialized – including the entity ruler. This lets you ship
|
be restored and deserialized – including the entity ruler. This lets you ship
|
||||||
powerful model packages with binary weights _and_ rules included!
|
powerful model packages with binary weights _and_ rules included!
|
||||||
|
|
||||||
|
### Using a large number of phrase patterns {#entityruler-large-phrase-patterns new="2.2.4"}
|
||||||
|
|
||||||
|
When using a large amount of **phrase patterns** (roughly > 10000) it's useful to understand how the `add_patterns` function of the EntityRuler works. For each **phrase pattern**,
|
||||||
|
the EntityRuler calls the nlp object to construct a doc object. This happens in case you try
|
||||||
|
to add the EntityRuler at the end of an existing pipeline with, for example, a POS tagger and want to
|
||||||
|
extract matches based on the pattern's POS signature.
|
||||||
|
|
||||||
|
In this case you would pass a config value of `phrase_matcher_attr="POS"` for the EntityRuler.
|
||||||
|
|
||||||
|
Running the full language pipeline across every pattern in a large list scales linearly and can therefore take a long time on large amounts of phrase patterns.
|
||||||
|
|
||||||
|
As of spaCy 2.2.4 the `add_patterns` function has been refactored to use nlp.pipe on all phrase patterns resulting in about a 10x-20x speed up with 5,000-100,000 phrase patterns respectively.
|
||||||
|
|
||||||
|
Even with this speedup (but especially if you're using an older version) the `add_patterns` function can still take a long time.
|
||||||
|
|
||||||
|
An easy workaround to make this function run faster is disabling the other language pipes
|
||||||
|
while adding the phrase patterns.
|
||||||
|
|
||||||
|
```python
|
||||||
|
entityruler = EntityRuler(nlp)
|
||||||
|
patterns = [{"label": "TEST", "pattern": str(i)} for i in range(100000)]
|
||||||
|
|
||||||
|
other_pipes = [p for p in nlp.pipe_names if p != "tagger"]
|
||||||
|
with nlp.disable_pipes(*disable_pipes):
|
||||||
|
entityruler.add_patterns(patterns)
|
||||||
|
```
|
||||||
|
|
||||||
## Combining models and rules {#models-rules}
|
## Combining models and rules {#models-rules}
|
||||||
|
|
||||||
You can combine statistical and rule-based components in a variety of ways.
|
You can combine statistical and rule-based components in a variety of ways.
|
||||||
|
|
|
@ -999,6 +999,17 @@
|
||||||
"author": "Graphbrain",
|
"author": "Graphbrain",
|
||||||
"category": ["standalone"]
|
"category": ["standalone"]
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
"type": "education",
|
||||||
|
"id": "nostarch-nlp-python",
|
||||||
|
"title": "Natural Language Processing Using Python",
|
||||||
|
"slogan": "No Starch Press, 2020",
|
||||||
|
"description": "Natural Language Processing Using Python is an introduction to natural language processing (NLP), the task of converting human language into data that a computer can process. The book uses spaCy, a leading Python library for NLP, to guide readers through common NLP tasks related to generating and understanding human language with code. It addresses problems like understanding a user's intent, continuing a conversation with a human, and maintaining the state of a conversation.",
|
||||||
|
"cover": "https://nostarch.com/sites/default/files/styles/uc_product_full/public/NaturalLanguageProcessing_final_v01.jpg",
|
||||||
|
"url": "https://nostarch.com/NLPPython",
|
||||||
|
"author": "Yuli Vasiliev",
|
||||||
|
"category": ["books"]
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"type": "education",
|
"type": "education",
|
||||||
"id": "oreilly-python-ds",
|
"id": "oreilly-python-ds",
|
||||||
|
|
Loading…
Reference in New Issue
Block a user