mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-03 22:06:37 +03:00
Merge branch 'master' into spacy.io
This commit is contained in:
commit
89967f3701
106
.github/contributors/Jan-711.md
vendored
Normal file
106
.github/contributors/Jan-711.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Jan Jessewitsch |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 16.02.2020 |
|
||||
| GitHub username | Jan-711 |
|
||||
| Website (optional) | |
|
106
.github/contributors/MisterKeefe.md
vendored
Normal file
106
.github/contributors/MisterKeefe.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [ ] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Tom Keefe |
|
||||
| Company name (if applicable) | / |
|
||||
| Title or role (if applicable) | / |
|
||||
| Date | 18 February 2020 |
|
||||
| GitHub username | MisterKeefe |
|
||||
| Website (optional) | / |
|
|
@ -1,5 +1,5 @@
|
|||
recursive-include include *.h
|
||||
recursive-include spacy *.txt
|
||||
recursive-include spacy *.txt *.pyx *.pxd
|
||||
include LICENSE
|
||||
include README.md
|
||||
include bin/spacy
|
||||
|
|
|
@ -26,12 +26,12 @@ DEFAULT_TEXT = "Mark Zuckerberg is the CEO of Facebook."
|
|||
HTML_WRAPPER = """<div style="overflow-x: auto; border: 1px solid #e6e9ef; border-radius: 0.25rem; padding: 1rem; margin-bottom: 2.5rem">{}</div>"""
|
||||
|
||||
|
||||
@st.cache(ignore_hash=True)
|
||||
@st.cache(allow_output_mutation=True)
|
||||
def load_model(name):
|
||||
return spacy.load(name)
|
||||
|
||||
|
||||
@st.cache(ignore_hash=True)
|
||||
@st.cache(allow_output_mutation=True)
|
||||
def process_text(model_name, text):
|
||||
nlp = load_model(model_name)
|
||||
return nlp(text)
|
||||
|
@ -79,7 +79,9 @@ if "ner" in nlp.pipe_names:
|
|||
st.header("Named Entities")
|
||||
st.sidebar.header("Named Entities")
|
||||
label_set = nlp.get_pipe("ner").labels
|
||||
labels = st.sidebar.multiselect("Entity labels", label_set, label_set)
|
||||
labels = st.sidebar.multiselect(
|
||||
"Entity labels", options=label_set, default=list(label_set)
|
||||
)
|
||||
html = displacy.render(doc, style="ent", options={"ents": labels})
|
||||
# Newlines seem to mess with the rendering
|
||||
html = html.replace("\n", " ")
|
||||
|
|
|
@ -92,3 +92,5 @@ cdef enum attr_id_t:
|
|||
LANG
|
||||
ENT_KB_ID = symbols.ENT_KB_ID
|
||||
ENT_ID = symbols.ENT_ID
|
||||
|
||||
IDX
|
||||
|
|
|
@ -91,6 +91,7 @@ IDS = {
|
|||
"SPACY": SPACY,
|
||||
"PROB": PROB,
|
||||
"LANG": LANG,
|
||||
"IDX": IDX
|
||||
}
|
||||
|
||||
|
||||
|
|
|
@ -34,7 +34,7 @@ from .train import _load_pretrained_tok2vec
|
|||
vectors_model=("Name or path to spaCy model with vectors to learn from"),
|
||||
output_dir=("Directory to write models to on each epoch", "positional", None, str),
|
||||
width=("Width of CNN layers", "option", "cw", int),
|
||||
depth=("Depth of CNN layers", "option", "cd", int),
|
||||
conv_depth=("Depth of CNN layers", "option", "cd", int),
|
||||
cnn_window=("Window size for CNN layers", "option", "cW", int),
|
||||
cnn_pieces=("Maxout size for CNN layers. 1 for Mish", "option", "cP", int),
|
||||
use_chars=("Whether to use character-based embedding", "flag", "chr", bool),
|
||||
|
@ -84,7 +84,7 @@ def pretrain(
|
|||
vectors_model,
|
||||
output_dir,
|
||||
width=96,
|
||||
depth=4,
|
||||
conv_depth=4,
|
||||
bilstm_depth=0,
|
||||
cnn_pieces=3,
|
||||
sa_depth=0,
|
||||
|
@ -132,9 +132,15 @@ def pretrain(
|
|||
msg.info("Using GPU" if has_gpu else "Not using GPU")
|
||||
|
||||
output_dir = Path(output_dir)
|
||||
if output_dir.exists() and [p for p in output_dir.iterdir()]:
|
||||
msg.warn(
|
||||
"Output directory is not empty",
|
||||
"It is better to use an empty directory or refer to a new output path, "
|
||||
"then the new directory will be created for you.",
|
||||
)
|
||||
if not output_dir.exists():
|
||||
output_dir.mkdir()
|
||||
msg.good("Created output directory")
|
||||
msg.good("Created output directory: {}".format(output_dir))
|
||||
srsly.write_json(output_dir / "config.json", config)
|
||||
msg.good("Saved settings to config.json")
|
||||
|
||||
|
@ -162,7 +168,7 @@ def pretrain(
|
|||
Tok2Vec(
|
||||
width,
|
||||
embed_rows,
|
||||
conv_depth=depth,
|
||||
conv_depth=conv_depth,
|
||||
pretrained_vectors=pretrained_vectors,
|
||||
bilstm_depth=bilstm_depth, # Requires PyTorch. Experimental.
|
||||
subword_features=not use_chars, # Set to False for Chinese etc
|
||||
|
|
|
@ -14,6 +14,7 @@ import contextlib
|
|||
import random
|
||||
|
||||
from .._ml import create_default_optimizer
|
||||
from ..util import use_gpu as set_gpu
|
||||
from ..attrs import PROB, IS_OOV, CLUSTER, LANG
|
||||
from ..gold import GoldCorpus
|
||||
from ..compat import path2str
|
||||
|
@ -32,6 +33,13 @@ from .. import about
|
|||
pipeline=("Comma-separated names of pipeline components", "option", "p", str),
|
||||
replace_components=("Replace components from base model", "flag", "R", bool),
|
||||
vectors=("Model to load vectors from", "option", "v", str),
|
||||
width=("Width of CNN layers of Tok2Vec component", "option", "cw", int),
|
||||
conv_depth=("Depth of CNN layers of Tok2Vec component", "option", "cd", int),
|
||||
cnn_window=("Window size for CNN layers of Tok2Vec component", "option", "cW", int),
|
||||
cnn_pieces=("Maxout size for CNN layers of Tok2Vec component. 1 for Mish", "option", "cP", int),
|
||||
use_chars=("Whether to use character-based embedding of Tok2Vec component", "flag", "chr", bool),
|
||||
bilstm_depth=("Depth of BiLSTM layers of Tok2Vec component (requires PyTorch)", "option", "lstm", int),
|
||||
embed_rows=("Number of embedding rows of Tok2Vec component", "option", "er", int),
|
||||
n_iter=("Number of iterations", "option", "n", int),
|
||||
n_early_stopping=("Maximum number of training epochs without dev accuracy improvement", "option", "ne", int),
|
||||
n_examples=("Number of examples", "option", "ns", int),
|
||||
|
@ -63,6 +71,13 @@ def train(
|
|||
pipeline="tagger,parser,ner",
|
||||
replace_components=False,
|
||||
vectors=None,
|
||||
width=96,
|
||||
conv_depth=4,
|
||||
cnn_window=1,
|
||||
cnn_pieces=3,
|
||||
use_chars=False,
|
||||
bilstm_depth=0,
|
||||
embed_rows=2000,
|
||||
n_iter=30,
|
||||
n_early_stopping=None,
|
||||
n_examples=0,
|
||||
|
@ -115,6 +130,7 @@ def train(
|
|||
)
|
||||
if not output_path.exists():
|
||||
output_path.mkdir()
|
||||
msg.good("Created output directory: {}".format(output_path))
|
||||
|
||||
# Take dropout and batch size as generators of values -- dropout
|
||||
# starts high and decays sharply, to force the optimizer to explore.
|
||||
|
@ -147,6 +163,18 @@ def train(
|
|||
disabled_pipes = None
|
||||
pipes_added = False
|
||||
msg.text("Training pipeline: {}".format(pipeline))
|
||||
if use_gpu >= 0:
|
||||
activated_gpu = None
|
||||
try:
|
||||
activated_gpu = set_gpu(use_gpu)
|
||||
except Exception as e:
|
||||
msg.warn("Exception: {}".format(e))
|
||||
if activated_gpu is not None:
|
||||
msg.text("Using GPU: {}".format(use_gpu))
|
||||
else:
|
||||
msg.warn("Unable to activate GPU: {}".format(use_gpu))
|
||||
msg.text("Using CPU only")
|
||||
use_gpu = -1
|
||||
if base_model:
|
||||
msg.text("Starting with base model '{}'".format(base_model))
|
||||
nlp = util.load_model(base_model)
|
||||
|
@ -237,7 +265,15 @@ def train(
|
|||
optimizer = create_default_optimizer(Model.ops)
|
||||
else:
|
||||
# Start with a blank model, call begin_training
|
||||
optimizer = nlp.begin_training(lambda: corpus.train_tuples, device=use_gpu)
|
||||
cfg = {"device": use_gpu}
|
||||
cfg["conv_depth"] = conv_depth
|
||||
cfg["token_vector_width"] = width
|
||||
cfg["bilstm_depth"] = bilstm_depth
|
||||
cfg["cnn_maxout_pieces"] = cnn_pieces
|
||||
cfg["embed_size"] = embed_rows
|
||||
cfg["conv_window"] = cnn_window
|
||||
cfg["subword_features"] = not use_chars
|
||||
optimizer = nlp.begin_training(lambda: corpus.train_tuples, **cfg)
|
||||
|
||||
nlp._optimizer = None
|
||||
|
||||
|
@ -362,13 +398,19 @@ def train(
|
|||
if not batch:
|
||||
continue
|
||||
docs, golds = zip(*batch)
|
||||
nlp.update(
|
||||
docs,
|
||||
golds,
|
||||
sgd=optimizer,
|
||||
drop=next(dropout_rates),
|
||||
losses=losses,
|
||||
)
|
||||
try:
|
||||
nlp.update(
|
||||
docs,
|
||||
golds,
|
||||
sgd=optimizer,
|
||||
drop=next(dropout_rates),
|
||||
losses=losses,
|
||||
)
|
||||
except ValueError as e:
|
||||
msg.warn("Error during training")
|
||||
if init_tok2vec:
|
||||
msg.warn("Did you provide the same parameters during 'train' as during 'pretrain'?")
|
||||
msg.fail("Original error message: {}".format(e), exits=1)
|
||||
if raw_text:
|
||||
# If raw text is available, perform 'rehearsal' updates,
|
||||
# which use unlabelled data to reduce overfitting.
|
||||
|
@ -495,6 +537,8 @@ def train(
|
|||
"score = {}".format(best_score, current_score)
|
||||
)
|
||||
break
|
||||
except Exception as e:
|
||||
msg.warn("Aborting and saving the final best model. Encountered exception: {}".format(e))
|
||||
finally:
|
||||
best_pipes = nlp.pipe_names
|
||||
if disabled_pipes:
|
||||
|
|
|
@ -144,10 +144,12 @@ def parse_deps(orig_doc, options={}):
|
|||
for span, tag, lemma, ent_type in spans:
|
||||
attrs = {"tag": tag, "lemma": lemma, "ent_type": ent_type}
|
||||
retokenizer.merge(span, attrs=attrs)
|
||||
if options.get("fine_grained"):
|
||||
words = [{"text": w.text, "tag": w.tag_} for w in doc]
|
||||
else:
|
||||
words = [{"text": w.text, "tag": w.pos_} for w in doc]
|
||||
fine_grained = options.get("fine_grained")
|
||||
add_lemma = options.get("add_lemma")
|
||||
words = [{"text": w.text,
|
||||
"tag": w.tag_ if fine_grained else w.pos_,
|
||||
"lemma": w.lemma_ if add_lemma else None} for w in doc]
|
||||
|
||||
arcs = []
|
||||
for word in doc:
|
||||
if word.i < word.head.i:
|
||||
|
|
|
@ -3,7 +3,7 @@ from __future__ import unicode_literals
|
|||
|
||||
import uuid
|
||||
|
||||
from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_ARCS, TPL_ENTS
|
||||
from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_WORDS_LEMMA, TPL_DEP_ARCS, TPL_ENTS
|
||||
from .templates import TPL_ENT, TPL_ENT_RTL, TPL_FIGURE, TPL_TITLE, TPL_PAGE
|
||||
from ..util import minify_html, escape_html, registry
|
||||
from ..errors import Errors
|
||||
|
@ -83,7 +83,7 @@ class DependencyRenderer(object):
|
|||
self.width = self.offset_x + len(words) * self.distance
|
||||
self.height = self.offset_y + 3 * self.word_spacing
|
||||
self.id = render_id
|
||||
words = [self.render_word(w["text"], w["tag"], i) for i, w in enumerate(words)]
|
||||
words = [self.render_word(w["text"], w["tag"], w.get("lemma", None), i) for i, w in enumerate(words)]
|
||||
arcs = [
|
||||
self.render_arrow(a["label"], a["start"], a["end"], a["dir"], i)
|
||||
for i, a in enumerate(arcs)
|
||||
|
@ -101,7 +101,7 @@ class DependencyRenderer(object):
|
|||
lang=self.lang,
|
||||
)
|
||||
|
||||
def render_word(self, text, tag, i):
|
||||
def render_word(self, text, tag, lemma, i,):
|
||||
"""Render individual word.
|
||||
|
||||
text (unicode): Word text.
|
||||
|
@ -114,6 +114,8 @@ class DependencyRenderer(object):
|
|||
if self.direction == "rtl":
|
||||
x = self.width - x
|
||||
html_text = escape_html(text)
|
||||
if lemma is not None:
|
||||
return TPL_DEP_WORDS_LEMMA.format(text=html_text, tag=tag, lemma=lemma, x=x, y=y)
|
||||
return TPL_DEP_WORDS.format(text=html_text, tag=tag, x=x, y=y)
|
||||
|
||||
def render_arrow(self, label, start, end, direction, i):
|
||||
|
|
|
@ -18,6 +18,15 @@ TPL_DEP_WORDS = """
|
|||
"""
|
||||
|
||||
|
||||
TPL_DEP_WORDS_LEMMA = """
|
||||
<text class="displacy-token" fill="currentColor" text-anchor="middle" y="{y}">
|
||||
<tspan class="displacy-word" fill="currentColor" x="{x}">{text}</tspan>
|
||||
<tspan class="displacy-lemma" dy="2em" fill="currentColor" x="{x}">{lemma}</tspan>
|
||||
<tspan class="displacy-tag" dy="2em" fill="currentColor" x="{x}">{tag}</tspan>
|
||||
</text>
|
||||
"""
|
||||
|
||||
|
||||
TPL_DEP_ARCS = """
|
||||
<g class="displacy-arrow">
|
||||
<path class="displacy-arc" id="arrow-{id}-{i}" stroke-width="{stroke}px" d="{arc}" fill="none" stroke="currentColor"/>
|
||||
|
|
|
@ -22,14 +22,14 @@ dort drei drin dritte dritten dritter drittes du durch durchaus dürfen dürft
|
|||
durfte durften
|
||||
|
||||
eben ebenso ehrlich eigen eigene eigenen eigener eigenes ein einander eine
|
||||
einem einen einer eines einigeeinigen einiger einiges einmal einmaleins elf en
|
||||
einem einen einer eines einige einigen einiger einiges einmal einmaleins elf en
|
||||
ende endlich entweder er erst erste ersten erster erstes es etwa etwas euch
|
||||
|
||||
früher fünf fünfte fünften fünfter fünftes für
|
||||
|
||||
gab ganz ganze ganzen ganzer ganzes gar gedurft gegen gegenüber gehabt gehen
|
||||
geht gekannt gekonnt gemacht gemocht gemusst genug gerade gern gesagt geschweige
|
||||
gewesen gewollt geworden gibt ging gleich gott gross groß grosse große grossen
|
||||
gewesen gewollt geworden gibt ging gleich gross groß grosse große grossen
|
||||
großen grosser großer grosses großes gut gute guter gutes
|
||||
|
||||
habe haben habt hast hat hatte hätte hatten hätten heisst heißt her heute hier
|
||||
|
@ -47,9 +47,8 @@ kleines kommen kommt können könnt konnte könnte konnten kurz
|
|||
lang lange leicht leider lieber los
|
||||
|
||||
machen macht machte mag magst man manche manchem manchen mancher manches mehr
|
||||
mein meine meinem meinen meiner meines mensch menschen mich mir mit mittel
|
||||
mochte möchte mochten mögen möglich mögt morgen muss muß müssen musst müsst
|
||||
musste mussten
|
||||
mein meine meinem meinen meiner meines mich mir mit mittel mochte möchte mochten
|
||||
mögen möglich mögt morgen muss muß müssen musst müsst musste mussten
|
||||
|
||||
na nach nachdem nahm natürlich neben nein neue neuen neun neunte neunten neunter
|
||||
neuntes nicht nichts nie niemand niemandem niemanden noch nun nur
|
||||
|
|
|
@ -3,7 +3,7 @@
|
|||
from __future__ import unicode_literals
|
||||
|
||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||
from .tag_map_general import TAG_MAP
|
||||
from ..tag_map import TAG_MAP
|
||||
from .stop_words import STOP_WORDS
|
||||
from .lex_attrs import LEX_ATTRS
|
||||
from .lemmatizer import GreekLemmatizer
|
||||
|
|
|
@ -1,27 +0,0 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...symbols import POS, ADV, NOUN, ADP, PRON, SCONJ, PROPN, DET, SYM, INTJ
|
||||
from ...symbols import PUNCT, NUM, AUX, X, ADJ, VERB, PART, SPACE, CCONJ
|
||||
|
||||
|
||||
TAG_MAP = {
|
||||
"ADJ": {POS: ADJ},
|
||||
"ADV": {POS: ADV},
|
||||
"INTJ": {POS: INTJ},
|
||||
"NOUN": {POS: NOUN},
|
||||
"PROPN": {POS: PROPN},
|
||||
"VERB": {POS: VERB},
|
||||
"ADP": {POS: ADP},
|
||||
"CCONJ": {POS: CCONJ},
|
||||
"SCONJ": {POS: SCONJ},
|
||||
"PART": {POS: PART},
|
||||
"PUNCT": {POS: PUNCT},
|
||||
"SYM": {POS: SYM},
|
||||
"NUM": {POS: NUM},
|
||||
"PRON": {POS: PRON},
|
||||
"AUX": {POS: AUX},
|
||||
"SPACE": {POS: SPACE},
|
||||
"DET": {POS: DET},
|
||||
"X": {POS: X},
|
||||
}
|
|
@ -14,6 +14,7 @@ for exc_data in [
|
|||
{ORTH: "alv.", LEMMA: "arvonlisävero"},
|
||||
{ORTH: "ark.", LEMMA: "arkisin"},
|
||||
{ORTH: "as.", LEMMA: "asunto"},
|
||||
{ORTH: "eaa.", LEMMA: "ennen ajanlaskun alkua"},
|
||||
{ORTH: "ed.", LEMMA: "edellinen"},
|
||||
{ORTH: "esim.", LEMMA: "esimerkki"},
|
||||
{ORTH: "huom.", LEMMA: "huomautus"},
|
||||
|
@ -27,6 +28,7 @@ for exc_data in [
|
|||
{ORTH: "läh.", LEMMA: "lähettäjä"},
|
||||
{ORTH: "miel.", LEMMA: "mieluummin"},
|
||||
{ORTH: "milj.", LEMMA: "miljoona"},
|
||||
{ORTH: "Mm.", LEMMA: "muun muassa"},
|
||||
{ORTH: "mm.", LEMMA: "muun muassa"},
|
||||
{ORTH: "myöh.", LEMMA: "myöhempi"},
|
||||
{ORTH: "n.", LEMMA: "noin"},
|
||||
|
|
|
@ -3,6 +3,8 @@ from __future__ import unicode_literals
|
|||
|
||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||
from .stop_words import STOP_WORDS
|
||||
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
|
||||
from .punctuation import TOKENIZER_SUFFIXES
|
||||
|
||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||
from ..norm_exceptions import BASE_NORMS
|
||||
|
@ -24,6 +26,9 @@ class RomanianDefaults(Language.Defaults):
|
|||
)
|
||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||
stop_words = STOP_WORDS
|
||||
prefixes = TOKENIZER_PREFIXES
|
||||
suffixes = TOKENIZER_SUFFIXES
|
||||
infixes = TOKENIZER_INFIXES
|
||||
tag_map = TAG_MAP
|
||||
|
||||
|
||||
|
|
164
spacy/lang/ro/punctuation.py
Normal file
164
spacy/lang/ro/punctuation.py
Normal file
|
@ -0,0 +1,164 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import itertools
|
||||
|
||||
from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, LIST_CURRENCY
|
||||
from ..char_classes import LIST_ICONS, CURRENCY
|
||||
from ..char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA, PUNCT
|
||||
|
||||
|
||||
_list_icons = [x for x in LIST_ICONS if x != "°"]
|
||||
_list_icons = [x.replace("\\u00B0", "") for x in _list_icons]
|
||||
|
||||
|
||||
_ro_variants = {
|
||||
"Ă": ["Ă", "A"],
|
||||
"Â": ["Â", "A"],
|
||||
"Î": ["Î", "I"],
|
||||
"Ș": ["Ș", "Ş", "S"],
|
||||
"Ț": ["Ț", "Ţ", "T"],
|
||||
}
|
||||
|
||||
|
||||
def _make_ro_variants(tokens):
|
||||
variants = []
|
||||
for token in tokens:
|
||||
upper_token = token.upper()
|
||||
upper_char_variants = [_ro_variants.get(c, [c]) for c in upper_token]
|
||||
upper_variants = ["".join(x) for x in itertools.product(*upper_char_variants)]
|
||||
for variant in upper_variants:
|
||||
variants.extend([variant, variant.lower(), variant.title()])
|
||||
return sorted(list(set(variants)))
|
||||
|
||||
|
||||
# UD_Romanian-RRT closed class prefixes
|
||||
# POS: ADP|AUX|CCONJ|DET|NUM|PART|PRON|SCONJ
|
||||
_ud_rrt_prefixes = [
|
||||
"a-",
|
||||
"c-",
|
||||
"ce-",
|
||||
"cu-",
|
||||
"d-",
|
||||
"de-",
|
||||
"dintr-",
|
||||
"e-",
|
||||
"făr-",
|
||||
"i-",
|
||||
"l-",
|
||||
"le-",
|
||||
"m-",
|
||||
"mi-",
|
||||
"n-",
|
||||
"ne-",
|
||||
"p-",
|
||||
"pe-",
|
||||
"prim-",
|
||||
"printr-",
|
||||
"s-",
|
||||
"se-",
|
||||
"te-",
|
||||
"v-",
|
||||
"într-",
|
||||
"ș-",
|
||||
"și-",
|
||||
"ți-",
|
||||
]
|
||||
_ud_rrt_prefix_variants = _make_ro_variants(_ud_rrt_prefixes)
|
||||
|
||||
|
||||
# UD_Romanian-RRT closed class suffixes without NUM
|
||||
# POS: ADP|AUX|CCONJ|DET|PART|PRON|SCONJ
|
||||
_ud_rrt_suffixes = [
|
||||
"-a",
|
||||
"-aceasta",
|
||||
"-ai",
|
||||
"-al",
|
||||
"-ale",
|
||||
"-alta",
|
||||
"-am",
|
||||
"-ar",
|
||||
"-astea",
|
||||
"-atâta",
|
||||
"-au",
|
||||
"-aș",
|
||||
"-ați",
|
||||
"-i",
|
||||
"-ilor",
|
||||
"-l",
|
||||
"-le",
|
||||
"-lea",
|
||||
"-mea",
|
||||
"-meu",
|
||||
"-mi",
|
||||
"-mă",
|
||||
"-n",
|
||||
"-ndărătul",
|
||||
"-ne",
|
||||
"-o",
|
||||
"-oi",
|
||||
"-or",
|
||||
"-s",
|
||||
"-se",
|
||||
"-si",
|
||||
"-te",
|
||||
"-ul",
|
||||
"-ului",
|
||||
"-un",
|
||||
"-uri",
|
||||
"-urile",
|
||||
"-urilor",
|
||||
"-veți",
|
||||
"-vă",
|
||||
"-ăștia",
|
||||
"-și",
|
||||
"-ți",
|
||||
]
|
||||
_ud_rrt_suffix_variants = _make_ro_variants(_ud_rrt_suffixes)
|
||||
|
||||
|
||||
_prefixes = (
|
||||
["§", "%", "=", "—", "–", r"\+(?![0-9])"]
|
||||
+ _ud_rrt_prefix_variants
|
||||
+ LIST_PUNCT
|
||||
+ LIST_ELLIPSES
|
||||
+ LIST_QUOTES
|
||||
+ LIST_CURRENCY
|
||||
+ LIST_ICONS
|
||||
)
|
||||
|
||||
|
||||
_suffixes = (
|
||||
_ud_rrt_suffix_variants
|
||||
+ LIST_PUNCT
|
||||
+ LIST_ELLIPSES
|
||||
+ LIST_QUOTES
|
||||
+ _list_icons
|
||||
+ ["—", "–"]
|
||||
+ [
|
||||
r"(?<=[0-9])\+",
|
||||
r"(?<=°[FfCcKk])\.",
|
||||
r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
|
||||
r"(?<=[0-9{al}{e}{p}(?:{q})])\.".format(
|
||||
al=ALPHA_LOWER, e=r"%²\-\+", q=CONCAT_QUOTES, p=PUNCT
|
||||
),
|
||||
r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),
|
||||
]
|
||||
)
|
||||
|
||||
_infixes = (
|
||||
LIST_ELLIPSES
|
||||
+ _list_icons
|
||||
+ [
|
||||
r"(?<=[0-9])[+\*^](?=[0-9-])",
|
||||
r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
|
||||
al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
|
||||
),
|
||||
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
|
||||
r"(?<=[{a}0-9])[:<>=](?=[{a}])".format(a=ALPHA),
|
||||
]
|
||||
)
|
||||
|
||||
TOKENIZER_PREFIXES = _prefixes
|
||||
TOKENIZER_SUFFIXES = _suffixes
|
||||
TOKENIZER_INFIXES = _infixes
|
|
@ -2,6 +2,7 @@
|
|||
from __future__ import unicode_literals
|
||||
|
||||
from ...symbols import ORTH
|
||||
from .punctuation import _make_ro_variants
|
||||
|
||||
|
||||
_exc = {}
|
||||
|
@ -45,8 +46,52 @@ for orth in [
|
|||
"dpdv",
|
||||
"șamd.",
|
||||
"ș.a.m.d.",
|
||||
# below: from UD_Romanian-RRT:
|
||||
"A.c.",
|
||||
"A.f.",
|
||||
"A.r.",
|
||||
"Al.",
|
||||
"Art.",
|
||||
"Aug.",
|
||||
"Bd.",
|
||||
"Dem.",
|
||||
"Dr.",
|
||||
"Fig.",
|
||||
"Fr.",
|
||||
"Gh.",
|
||||
"Gr.",
|
||||
"Lt.",
|
||||
"Nr.",
|
||||
"Obs.",
|
||||
"Prof.",
|
||||
"Sf.",
|
||||
"a.m.",
|
||||
"a.r.",
|
||||
"alin.",
|
||||
"art.",
|
||||
"d-l",
|
||||
"d-lui",
|
||||
"d-nei",
|
||||
"ex.",
|
||||
"fig.",
|
||||
"ian.",
|
||||
"lit.",
|
||||
"lt.",
|
||||
"p.a.",
|
||||
"p.m.",
|
||||
"pct.",
|
||||
"prep.",
|
||||
"sf.",
|
||||
"tel.",
|
||||
"univ.",
|
||||
"îngr.",
|
||||
"într-adevăr",
|
||||
"Șt.",
|
||||
"ș.a.",
|
||||
]:
|
||||
_exc[orth] = [{ORTH: orth}]
|
||||
# note: does not distinguish capitalized-only exceptions from others
|
||||
for variant in _make_ro_variants([orth]):
|
||||
_exc[variant] = [{ORTH: variant}]
|
||||
|
||||
|
||||
TOKENIZER_EXCEPTIONS = _exc
|
||||
|
|
|
@ -608,6 +608,7 @@ class Language(object):
|
|||
link_vectors_to_models(self.vocab)
|
||||
if self.vocab.vectors.data.shape[1]:
|
||||
cfg["pretrained_vectors"] = self.vocab.vectors.name
|
||||
cfg['pretrained_dims'] = self.vocab.vectors.data.shape[1]
|
||||
if sgd is None:
|
||||
sgd = create_default_optimizer(Model.ops)
|
||||
self._optimizer = sgd
|
||||
|
|
|
@ -8,7 +8,7 @@ from ..language import component
|
|||
from ..errors import Errors
|
||||
from ..compat import basestring_
|
||||
from ..util import ensure_path, to_disk, from_disk
|
||||
from ..tokens import Span
|
||||
from ..tokens import Doc, Span
|
||||
from ..matcher import Matcher, PhraseMatcher
|
||||
|
||||
DEFAULT_ENT_ID_SEP = "||"
|
||||
|
@ -162,6 +162,7 @@ class EntityRuler(object):
|
|||
@property
|
||||
def patterns(self):
|
||||
"""Get all patterns that were added to the entity ruler.
|
||||
|
||||
RETURNS (list): The original patterns, one dictionary per pattern.
|
||||
|
||||
DOCS: https://spacy.io/api/entityruler#patterns
|
||||
|
@ -194,6 +195,7 @@ class EntityRuler(object):
|
|||
|
||||
DOCS: https://spacy.io/api/entityruler#add_patterns
|
||||
"""
|
||||
|
||||
# disable the nlp components after this one in case they hadn't been initialized / deserialised yet
|
||||
try:
|
||||
current_index = self.nlp.pipe_names.index(self.name)
|
||||
|
@ -203,7 +205,33 @@ class EntityRuler(object):
|
|||
except ValueError:
|
||||
subsequent_pipes = []
|
||||
with self.nlp.disable_pipes(subsequent_pipes):
|
||||
token_patterns = []
|
||||
phrase_pattern_labels = []
|
||||
phrase_pattern_texts = []
|
||||
phrase_pattern_ids = []
|
||||
|
||||
for entry in patterns:
|
||||
if isinstance(entry["pattern"], basestring_):
|
||||
phrase_pattern_labels.append(entry["label"])
|
||||
phrase_pattern_texts.append(entry["pattern"])
|
||||
phrase_pattern_ids.append(entry.get("id"))
|
||||
elif isinstance(entry["pattern"], list):
|
||||
token_patterns.append(entry)
|
||||
|
||||
phrase_patterns = []
|
||||
for label, pattern, ent_id in zip(
|
||||
phrase_pattern_labels,
|
||||
self.nlp.pipe(phrase_pattern_texts),
|
||||
phrase_pattern_ids
|
||||
):
|
||||
phrase_pattern = {
|
||||
"label": label, "pattern": pattern, "id": ent_id
|
||||
}
|
||||
if ent_id:
|
||||
phrase_pattern["id"] = ent_id
|
||||
phrase_patterns.append(phrase_pattern)
|
||||
|
||||
for entry in token_patterns + phrase_patterns:
|
||||
label = entry["label"]
|
||||
if "id" in entry:
|
||||
ent_label = label
|
||||
|
@ -212,8 +240,8 @@ class EntityRuler(object):
|
|||
self._ent_ids[key] = (ent_label, entry["id"])
|
||||
|
||||
pattern = entry["pattern"]
|
||||
if isinstance(pattern, basestring_):
|
||||
self.phrase_patterns[label].append(self.nlp(pattern))
|
||||
if isinstance(pattern, Doc):
|
||||
self.phrase_patterns[label].append(pattern)
|
||||
elif isinstance(pattern, list):
|
||||
self.token_patterns[label].append(pattern)
|
||||
else:
|
||||
|
@ -226,6 +254,8 @@ class EntityRuler(object):
|
|||
def _split_label(self, label):
|
||||
"""Split Entity label into ent_label and ent_id if it contains self.ent_id_sep
|
||||
|
||||
label (str): The value of label in a pattern entry
|
||||
|
||||
RETURNS (tuple): ent_label, ent_id
|
||||
"""
|
||||
if self.ent_id_sep in label:
|
||||
|
@ -239,6 +269,9 @@ class EntityRuler(object):
|
|||
def _create_label(self, label, ent_id):
|
||||
"""Join Entity label with ent_id if the pattern has an `id` attribute
|
||||
|
||||
label (str): The label to set for ent.label_
|
||||
ent_id (str): The label
|
||||
|
||||
RETURNS (str): The ent_label joined with configured `ent_id_sep`
|
||||
"""
|
||||
if isinstance(ent_id, basestring_):
|
||||
|
@ -250,6 +283,7 @@ class EntityRuler(object):
|
|||
|
||||
patterns_bytes (bytes): The bytestring to load.
|
||||
**kwargs: Other config paramters, mostly for consistency.
|
||||
|
||||
RETURNS (EntityRuler): The loaded entity ruler.
|
||||
|
||||
DOCS: https://spacy.io/api/entityruler#from_bytes
|
||||
|
@ -292,6 +326,7 @@ class EntityRuler(object):
|
|||
|
||||
path (unicode / Path): The JSONL file to load.
|
||||
**kwargs: Other config paramters, mostly for consistency.
|
||||
|
||||
RETURNS (EntityRuler): The loaded entity ruler.
|
||||
|
||||
DOCS: https://spacy.io/api/entityruler#from_disk
|
||||
|
|
|
@ -1044,6 +1044,7 @@ class TextCategorizer(Pipe):
|
|||
self.add_label(cat)
|
||||
if self.model is True:
|
||||
self.cfg["pretrained_vectors"] = kwargs.get("pretrained_vectors")
|
||||
self.cfg["pretrained_dims"] = kwargs.get("pretrained_dims")
|
||||
self.require_labels()
|
||||
self.model = self.Model(len(self.labels), **self.cfg)
|
||||
link_vectors_to_models(self.vocab)
|
||||
|
|
|
@ -463,3 +463,5 @@ cdef enum symbol_t:
|
|||
|
||||
ENT_KB_ID
|
||||
ENT_ID
|
||||
|
||||
IDX
|
|
@ -93,6 +93,7 @@ IDS = {
|
|||
"SPACY": SPACY,
|
||||
"PROB": PROB,
|
||||
"LANG": LANG,
|
||||
"IDX": IDX,
|
||||
|
||||
"ADJ": ADJ,
|
||||
"ADP": ADP,
|
||||
|
|
|
@ -66,3 +66,14 @@ def test_doc_array_to_from_string_attrs(en_vocab, attrs):
|
|||
words = ["An", "example", "sentence"]
|
||||
doc = Doc(en_vocab, words=words)
|
||||
Doc(en_vocab, words=words).from_array(attrs, doc.to_array(attrs))
|
||||
|
||||
|
||||
def test_doc_array_idx(en_vocab):
|
||||
"""Test that Doc.to_array can retrieve token start indices"""
|
||||
words = ["An", "example", "sentence"]
|
||||
doc = Doc(en_vocab, words=words)
|
||||
offsets = Doc(en_vocab, words=words).to_array("IDX")
|
||||
|
||||
assert offsets[0] == 0
|
||||
assert offsets[1] == 3
|
||||
assert offsets[2] == 11
|
||||
|
|
|
@ -7,7 +7,7 @@ import numpy
|
|||
from spacy.tokens import Doc, Span
|
||||
from spacy.vocab import Vocab
|
||||
from spacy.errors import ModelsWarning
|
||||
from spacy.attrs import ENT_TYPE, ENT_IOB
|
||||
from spacy.attrs import ENT_TYPE, ENT_IOB, SENT_START, HEAD, DEP
|
||||
|
||||
from ..util import get_doc
|
||||
|
||||
|
@ -274,6 +274,39 @@ def test_doc_is_nered(en_vocab):
|
|||
assert new_doc.is_nered
|
||||
|
||||
|
||||
def test_doc_from_array_sent_starts(en_vocab):
|
||||
words = ["I", "live", "in", "New", "York", ".", "I", "like", "cats", "."]
|
||||
heads = [0, 0, 0, 0, 0, 0, 6, 6, 6, 6]
|
||||
deps = ["ROOT", "dep", "dep", "dep", "dep", "dep", "ROOT", "dep", "dep", "dep", "dep"]
|
||||
doc = Doc(en_vocab, words=words)
|
||||
for i, (dep, head) in enumerate(zip(deps, heads)):
|
||||
doc[i].dep_ = dep
|
||||
doc[i].head = doc[head]
|
||||
if head == i:
|
||||
doc[i].is_sent_start = True
|
||||
doc.is_parsed
|
||||
|
||||
attrs = [SENT_START, HEAD]
|
||||
arr = doc.to_array(attrs)
|
||||
new_doc = Doc(en_vocab, words=words)
|
||||
with pytest.raises(ValueError):
|
||||
new_doc.from_array(attrs, arr)
|
||||
|
||||
attrs = [SENT_START, DEP]
|
||||
arr = doc.to_array(attrs)
|
||||
new_doc = Doc(en_vocab, words=words)
|
||||
new_doc.from_array(attrs, arr)
|
||||
assert [t.is_sent_start for t in doc] == [t.is_sent_start for t in new_doc]
|
||||
assert not new_doc.is_parsed
|
||||
|
||||
attrs = [HEAD, DEP]
|
||||
arr = doc.to_array(attrs)
|
||||
new_doc = Doc(en_vocab, words=words)
|
||||
new_doc.from_array(attrs, arr)
|
||||
assert [t.is_sent_start for t in doc] == [t.is_sent_start for t in new_doc]
|
||||
assert new_doc.is_parsed
|
||||
|
||||
|
||||
def test_doc_lang(en_vocab):
|
||||
doc = Doc(en_vocab, words=["Hello", "world"])
|
||||
assert doc.lang_ == "en"
|
||||
|
|
|
@ -279,3 +279,12 @@ def test_filter_spans(doc):
|
|||
assert len(filtered[1]) == 5
|
||||
assert filtered[0].start == 1 and filtered[0].end == 4
|
||||
assert filtered[1].start == 5 and filtered[1].end == 10
|
||||
|
||||
|
||||
def test_span_eq_hash(doc, doc_not_parsed):
|
||||
assert doc[0:2] == doc[0:2]
|
||||
assert doc[0:2] != doc[1:3]
|
||||
assert doc[0:2] != doc_not_parsed[0:2]
|
||||
assert hash(doc[0:2]) == hash(doc[0:2])
|
||||
assert hash(doc[0:2]) != hash(doc[1:3])
|
||||
assert hash(doc[0:2]) != hash(doc_not_parsed[0:2])
|
||||
|
|
|
@ -10,28 +10,33 @@ ABBREVIATION_TESTS = [
|
|||
["Hyvää", "uutta", "vuotta", "t.", "siht.", "Niemelä", "!"],
|
||||
),
|
||||
("Paino on n. 2.2 kg", ["Paino", "on", "n.", "2.2", "kg"]),
|
||||
(
|
||||
"Vuonna 1 eaa. tapahtui kauheita.",
|
||||
["Vuonna", "1", "eaa.", "tapahtui", "kauheita", "."],
|
||||
),
|
||||
]
|
||||
|
||||
HYPHENATED_TESTS = [
|
||||
(
|
||||
"1700-luvulle sijoittuva taide-elokuva",
|
||||
["1700-luvulle", "sijoittuva", "taide-elokuva"],
|
||||
"1700-luvulle sijoittuva taide-elokuva Wikimedia-säätiön Varsinais-Suomen",
|
||||
[
|
||||
"1700-luvulle",
|
||||
"sijoittuva",
|
||||
"taide-elokuva",
|
||||
"Wikimedia-säätiön",
|
||||
"Varsinais-Suomen",
|
||||
],
|
||||
)
|
||||
]
|
||||
|
||||
ABBREVIATION_INFLECTION_TESTS = [
|
||||
(
|
||||
"VTT:ssa ennen v:ta 2010 suoritetut mittaukset",
|
||||
["VTT:ssa", "ennen", "v:ta", "2010", "suoritetut", "mittaukset"]
|
||||
["VTT:ssa", "ennen", "v:ta", "2010", "suoritetut", "mittaukset"],
|
||||
),
|
||||
(
|
||||
"ALV:n osuus on 24 %.",
|
||||
["ALV:n", "osuus", "on", "24", "%", "."]
|
||||
),
|
||||
(
|
||||
"Hiihtäjä oli kilpailun 14:s.",
|
||||
["Hiihtäjä", "oli", "kilpailun", "14:s", "."]
|
||||
)
|
||||
("ALV:n osuus on 24 %.", ["ALV:n", "osuus", "on", "24", "%", "."]),
|
||||
("Hiihtäjä oli kilpailun 14:s.", ["Hiihtäjä", "oli", "kilpailun", "14:s", "."]),
|
||||
("EU:n toimesta tehtiin jotain.", ["EU:n", "toimesta", "tehtiin", "jotain", "."]),
|
||||
]
|
||||
|
||||
|
||||
|
|
|
@ -31,10 +31,10 @@ def test_displacy_parse_deps(en_vocab):
|
|||
deps = displacy.parse_deps(doc)
|
||||
assert isinstance(deps, dict)
|
||||
assert deps["words"] == [
|
||||
{"text": "This", "tag": "DET"},
|
||||
{"text": "is", "tag": "AUX"},
|
||||
{"text": "a", "tag": "DET"},
|
||||
{"text": "sentence", "tag": "NOUN"},
|
||||
{"lemma": None, "text": "This", "tag": "DET"},
|
||||
{"lemma": None, "text": "is", "tag": "AUX"},
|
||||
{"lemma": None, "text": "a", "tag": "DET"},
|
||||
{"lemma": None, "text": "sentence", "tag": "NOUN"},
|
||||
]
|
||||
assert deps["arcs"] == [
|
||||
{"start": 0, "end": 1, "label": "nsubj", "dir": "left"},
|
||||
|
|
|
@ -95,7 +95,11 @@ def assert_docs_equal(doc1, doc2):
|
|||
|
||||
assert [t.ent_type for t in doc1] == [t.ent_type for t in doc2]
|
||||
assert [t.ent_iob for t in doc1] == [t.ent_iob for t in doc2]
|
||||
assert [ent for ent in doc1.ents] == [ent for ent in doc2.ents]
|
||||
for ent1, ent2 in zip(doc1.ents, doc2.ents):
|
||||
assert ent1.start == ent2.start
|
||||
assert ent1.end == ent2.end
|
||||
assert ent1.label == ent2.label
|
||||
assert ent1.kb_id == ent2.kb_id
|
||||
|
||||
|
||||
def assert_packed_msg_equal(b1, b2):
|
||||
|
|
|
@ -23,7 +23,7 @@ from ..lexeme cimport Lexeme, EMPTY_LEXEME
|
|||
from ..typedefs cimport attr_t, flags_t
|
||||
from ..attrs cimport ID, ORTH, NORM, LOWER, SHAPE, PREFIX, SUFFIX, CLUSTER
|
||||
from ..attrs cimport LENGTH, POS, LEMMA, TAG, DEP, HEAD, SPACY, ENT_IOB
|
||||
from ..attrs cimport ENT_TYPE, ENT_ID, ENT_KB_ID, SENT_START, attr_id_t
|
||||
from ..attrs cimport ENT_TYPE, ENT_ID, ENT_KB_ID, SENT_START, IDX, attr_id_t
|
||||
from ..parts_of_speech cimport CCONJ, PUNCT, NOUN, univ_pos_t
|
||||
|
||||
from ..attrs import intify_attrs, IDS
|
||||
|
@ -73,6 +73,8 @@ cdef attr_t get_token_attr(const TokenC* token, attr_id_t feat_name) nogil:
|
|||
return token.ent_id
|
||||
elif feat_name == ENT_KB_ID:
|
||||
return token.ent_kb_id
|
||||
elif feat_name == IDX:
|
||||
return token.idx
|
||||
else:
|
||||
return Lexeme.get_struct_attr(token.lex, feat_name)
|
||||
|
||||
|
@ -813,7 +815,7 @@ cdef class Doc:
|
|||
if attr_ids[j] != TAG:
|
||||
Token.set_struct_attr(token, attr_ids[j], array[i, j])
|
||||
# Set flags
|
||||
self.is_parsed = bool(self.is_parsed or HEAD in attrs or DEP in attrs)
|
||||
self.is_parsed = bool(self.is_parsed or HEAD in attrs)
|
||||
self.is_tagged = bool(self.is_tagged or TAG in attrs or POS in attrs)
|
||||
# If document is parsed, set children
|
||||
if self.is_parsed:
|
||||
|
|
|
@ -127,22 +127,27 @@ cdef class Span:
|
|||
return False
|
||||
else:
|
||||
return True
|
||||
# Eq
|
||||
# <
|
||||
if op == 0:
|
||||
return self.start_char < other.start_char
|
||||
# <=
|
||||
elif op == 1:
|
||||
return self.start_char <= other.start_char
|
||||
# ==
|
||||
elif op == 2:
|
||||
return self.start_char == other.start_char and self.end_char == other.end_char
|
||||
return (self.doc, self.start_char, self.end_char, self.label, self.kb_id) == (other.doc, other.start_char, other.end_char, other.label, other.kb_id)
|
||||
# !=
|
||||
elif op == 3:
|
||||
return self.start_char != other.start_char or self.end_char != other.end_char
|
||||
return (self.doc, self.start_char, self.end_char, self.label, self.kb_id) != (other.doc, other.start_char, other.end_char, other.label, other.kb_id)
|
||||
# >
|
||||
elif op == 4:
|
||||
return self.start_char > other.start_char
|
||||
# >=
|
||||
elif op == 5:
|
||||
return self.start_char >= other.start_char
|
||||
|
||||
def __hash__(self):
|
||||
return hash((self.doc, self.label, self.start_char, self.end_char))
|
||||
return hash((self.doc, self.start_char, self.end_char, self.label, self.kb_id))
|
||||
|
||||
def __len__(self):
|
||||
"""Get the number of tokens in the span.
|
||||
|
|
|
@ -283,7 +283,11 @@ cdef class Vectors:
|
|||
|
||||
DOCS: https://spacy.io/api/vectors#add
|
||||
"""
|
||||
key = get_string_id(key)
|
||||
# use int for all keys and rows in key2row for more efficient access
|
||||
# and serialization
|
||||
key = int(get_string_id(key))
|
||||
if row is not None:
|
||||
row = int(row)
|
||||
if row is None and key in self.key2row:
|
||||
row = self.key2row[key]
|
||||
elif row is None:
|
||||
|
|
|
@ -239,6 +239,7 @@ If a setting is not present in the options, the default value will be used.
|
|||
| Name | Type | Description | Default |
|
||||
| ------------------ | ------- | --------------------------------------------------------------------------------------------------------------- | ----------------------- |
|
||||
| `fine_grained` | bool | Use fine-grained part-of-speech tags (`Token.tag_`) instead of coarse-grained tags (`Token.pos_`). | `False` |
|
||||
| `add_lemma` | bool | Print the lemma's in a separate row below the token texts in the `dep` visualisation. | `False` |
|
||||
| `collapse_punct` | bool | Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation. | `True` |
|
||||
| `collapse_phrases` | bool | Merge noun phrases into one token. | `False` |
|
||||
| `compact` | bool | "Compact mode" with square arrows that takes up less space. | `False` |
|
||||
|
|
|
@ -1096,6 +1096,33 @@ with the patterns. When you load the model back in, all pipeline components will
|
|||
be restored and deserialized – including the entity ruler. This lets you ship
|
||||
powerful model packages with binary weights _and_ rules included!
|
||||
|
||||
### Using a large number of phrase patterns {#entityruler-large-phrase-patterns new="2.2.4"}
|
||||
|
||||
When using a large amount of **phrase patterns** (roughly > 10000) it's useful to understand how the `add_patterns` function of the EntityRuler works. For each **phrase pattern**,
|
||||
the EntityRuler calls the nlp object to construct a doc object. This happens in case you try
|
||||
to add the EntityRuler at the end of an existing pipeline with, for example, a POS tagger and want to
|
||||
extract matches based on the pattern's POS signature.
|
||||
|
||||
In this case you would pass a config value of `phrase_matcher_attr="POS"` for the EntityRuler.
|
||||
|
||||
Running the full language pipeline across every pattern in a large list scales linearly and can therefore take a long time on large amounts of phrase patterns.
|
||||
|
||||
As of spaCy 2.2.4 the `add_patterns` function has been refactored to use nlp.pipe on all phrase patterns resulting in about a 10x-20x speed up with 5,000-100,000 phrase patterns respectively.
|
||||
|
||||
Even with this speedup (but especially if you're using an older version) the `add_patterns` function can still take a long time.
|
||||
|
||||
An easy workaround to make this function run faster is disabling the other language pipes
|
||||
while adding the phrase patterns.
|
||||
|
||||
```python
|
||||
entityruler = EntityRuler(nlp)
|
||||
patterns = [{"label": "TEST", "pattern": str(i)} for i in range(100000)]
|
||||
|
||||
other_pipes = [p for p in nlp.pipe_names if p != "tagger"]
|
||||
with nlp.disable_pipes(*disable_pipes):
|
||||
entityruler.add_patterns(patterns)
|
||||
```
|
||||
|
||||
## Combining models and rules {#models-rules}
|
||||
|
||||
You can combine statistical and rule-based components in a variety of ways.
|
||||
|
|
|
@ -999,6 +999,17 @@
|
|||
"author": "Graphbrain",
|
||||
"category": ["standalone"]
|
||||
},
|
||||
{
|
||||
"type": "education",
|
||||
"id": "nostarch-nlp-python",
|
||||
"title": "Natural Language Processing Using Python",
|
||||
"slogan": "No Starch Press, 2020",
|
||||
"description": "Natural Language Processing Using Python is an introduction to natural language processing (NLP), the task of converting human language into data that a computer can process. The book uses spaCy, a leading Python library for NLP, to guide readers through common NLP tasks related to generating and understanding human language with code. It addresses problems like understanding a user's intent, continuing a conversation with a human, and maintaining the state of a conversation.",
|
||||
"cover": "https://nostarch.com/sites/default/files/styles/uc_product_full/public/NaturalLanguageProcessing_final_v01.jpg",
|
||||
"url": "https://nostarch.com/NLPPython",
|
||||
"author": "Yuli Vasiliev",
|
||||
"category": ["books"]
|
||||
},
|
||||
{
|
||||
"type": "education",
|
||||
"id": "oreilly-python-ds",
|
||||
|
|
Loading…
Reference in New Issue
Block a user