Merge branch 'master' into spacy.io

This commit is contained in:
Ines Montani 2020-02-23 12:04:20 +01:00
commit 89967f3701
36 changed files with 704 additions and 84 deletions

106
.github/contributors/Jan-711.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Jan Jessewitsch |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 16.02.2020 |
| GitHub username | Jan-711 |
| Website (optional) | |

106
.github/contributors/MisterKeefe.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Tom Keefe |
| Company name (if applicable) | / |
| Title or role (if applicable) | / |
| Date | 18 February 2020 |
| GitHub username | MisterKeefe |
| Website (optional) | / |

View File

@ -1,5 +1,5 @@
recursive-include include *.h
recursive-include spacy *.txt
recursive-include spacy *.txt *.pyx *.pxd
include LICENSE
include README.md
include bin/spacy

View File

@ -26,12 +26,12 @@ DEFAULT_TEXT = "Mark Zuckerberg is the CEO of Facebook."
HTML_WRAPPER = """<div style="overflow-x: auto; border: 1px solid #e6e9ef; border-radius: 0.25rem; padding: 1rem; margin-bottom: 2.5rem">{}</div>"""
@st.cache(ignore_hash=True)
@st.cache(allow_output_mutation=True)
def load_model(name):
return spacy.load(name)
@st.cache(ignore_hash=True)
@st.cache(allow_output_mutation=True)
def process_text(model_name, text):
nlp = load_model(model_name)
return nlp(text)
@ -79,7 +79,9 @@ if "ner" in nlp.pipe_names:
st.header("Named Entities")
st.sidebar.header("Named Entities")
label_set = nlp.get_pipe("ner").labels
labels = st.sidebar.multiselect("Entity labels", label_set, label_set)
labels = st.sidebar.multiselect(
"Entity labels", options=label_set, default=list(label_set)
)
html = displacy.render(doc, style="ent", options={"ents": labels})
# Newlines seem to mess with the rendering
html = html.replace("\n", " ")

View File

@ -92,3 +92,5 @@ cdef enum attr_id_t:
LANG
ENT_KB_ID = symbols.ENT_KB_ID
ENT_ID = symbols.ENT_ID
IDX

View File

@ -91,6 +91,7 @@ IDS = {
"SPACY": SPACY,
"PROB": PROB,
"LANG": LANG,
"IDX": IDX
}

View File

@ -34,7 +34,7 @@ from .train import _load_pretrained_tok2vec
vectors_model=("Name or path to spaCy model with vectors to learn from"),
output_dir=("Directory to write models to on each epoch", "positional", None, str),
width=("Width of CNN layers", "option", "cw", int),
depth=("Depth of CNN layers", "option", "cd", int),
conv_depth=("Depth of CNN layers", "option", "cd", int),
cnn_window=("Window size for CNN layers", "option", "cW", int),
cnn_pieces=("Maxout size for CNN layers. 1 for Mish", "option", "cP", int),
use_chars=("Whether to use character-based embedding", "flag", "chr", bool),
@ -84,7 +84,7 @@ def pretrain(
vectors_model,
output_dir,
width=96,
depth=4,
conv_depth=4,
bilstm_depth=0,
cnn_pieces=3,
sa_depth=0,
@ -132,9 +132,15 @@ def pretrain(
msg.info("Using GPU" if has_gpu else "Not using GPU")
output_dir = Path(output_dir)
if output_dir.exists() and [p for p in output_dir.iterdir()]:
msg.warn(
"Output directory is not empty",
"It is better to use an empty directory or refer to a new output path, "
"then the new directory will be created for you.",
)
if not output_dir.exists():
output_dir.mkdir()
msg.good("Created output directory")
msg.good("Created output directory: {}".format(output_dir))
srsly.write_json(output_dir / "config.json", config)
msg.good("Saved settings to config.json")
@ -162,7 +168,7 @@ def pretrain(
Tok2Vec(
width,
embed_rows,
conv_depth=depth,
conv_depth=conv_depth,
pretrained_vectors=pretrained_vectors,
bilstm_depth=bilstm_depth, # Requires PyTorch. Experimental.
subword_features=not use_chars, # Set to False for Chinese etc

View File

@ -14,6 +14,7 @@ import contextlib
import random
from .._ml import create_default_optimizer
from ..util import use_gpu as set_gpu
from ..attrs import PROB, IS_OOV, CLUSTER, LANG
from ..gold import GoldCorpus
from ..compat import path2str
@ -32,6 +33,13 @@ from .. import about
pipeline=("Comma-separated names of pipeline components", "option", "p", str),
replace_components=("Replace components from base model", "flag", "R", bool),
vectors=("Model to load vectors from", "option", "v", str),
width=("Width of CNN layers of Tok2Vec component", "option", "cw", int),
conv_depth=("Depth of CNN layers of Tok2Vec component", "option", "cd", int),
cnn_window=("Window size for CNN layers of Tok2Vec component", "option", "cW", int),
cnn_pieces=("Maxout size for CNN layers of Tok2Vec component. 1 for Mish", "option", "cP", int),
use_chars=("Whether to use character-based embedding of Tok2Vec component", "flag", "chr", bool),
bilstm_depth=("Depth of BiLSTM layers of Tok2Vec component (requires PyTorch)", "option", "lstm", int),
embed_rows=("Number of embedding rows of Tok2Vec component", "option", "er", int),
n_iter=("Number of iterations", "option", "n", int),
n_early_stopping=("Maximum number of training epochs without dev accuracy improvement", "option", "ne", int),
n_examples=("Number of examples", "option", "ns", int),
@ -63,6 +71,13 @@ def train(
pipeline="tagger,parser,ner",
replace_components=False,
vectors=None,
width=96,
conv_depth=4,
cnn_window=1,
cnn_pieces=3,
use_chars=False,
bilstm_depth=0,
embed_rows=2000,
n_iter=30,
n_early_stopping=None,
n_examples=0,
@ -115,6 +130,7 @@ def train(
)
if not output_path.exists():
output_path.mkdir()
msg.good("Created output directory: {}".format(output_path))
# Take dropout and batch size as generators of values -- dropout
# starts high and decays sharply, to force the optimizer to explore.
@ -147,6 +163,18 @@ def train(
disabled_pipes = None
pipes_added = False
msg.text("Training pipeline: {}".format(pipeline))
if use_gpu >= 0:
activated_gpu = None
try:
activated_gpu = set_gpu(use_gpu)
except Exception as e:
msg.warn("Exception: {}".format(e))
if activated_gpu is not None:
msg.text("Using GPU: {}".format(use_gpu))
else:
msg.warn("Unable to activate GPU: {}".format(use_gpu))
msg.text("Using CPU only")
use_gpu = -1
if base_model:
msg.text("Starting with base model '{}'".format(base_model))
nlp = util.load_model(base_model)
@ -237,7 +265,15 @@ def train(
optimizer = create_default_optimizer(Model.ops)
else:
# Start with a blank model, call begin_training
optimizer = nlp.begin_training(lambda: corpus.train_tuples, device=use_gpu)
cfg = {"device": use_gpu}
cfg["conv_depth"] = conv_depth
cfg["token_vector_width"] = width
cfg["bilstm_depth"] = bilstm_depth
cfg["cnn_maxout_pieces"] = cnn_pieces
cfg["embed_size"] = embed_rows
cfg["conv_window"] = cnn_window
cfg["subword_features"] = not use_chars
optimizer = nlp.begin_training(lambda: corpus.train_tuples, **cfg)
nlp._optimizer = None
@ -362,13 +398,19 @@ def train(
if not batch:
continue
docs, golds = zip(*batch)
nlp.update(
docs,
golds,
sgd=optimizer,
drop=next(dropout_rates),
losses=losses,
)
try:
nlp.update(
docs,
golds,
sgd=optimizer,
drop=next(dropout_rates),
losses=losses,
)
except ValueError as e:
msg.warn("Error during training")
if init_tok2vec:
msg.warn("Did you provide the same parameters during 'train' as during 'pretrain'?")
msg.fail("Original error message: {}".format(e), exits=1)
if raw_text:
# If raw text is available, perform 'rehearsal' updates,
# which use unlabelled data to reduce overfitting.
@ -495,6 +537,8 @@ def train(
"score = {}".format(best_score, current_score)
)
break
except Exception as e:
msg.warn("Aborting and saving the final best model. Encountered exception: {}".format(e))
finally:
best_pipes = nlp.pipe_names
if disabled_pipes:

View File

@ -144,10 +144,12 @@ def parse_deps(orig_doc, options={}):
for span, tag, lemma, ent_type in spans:
attrs = {"tag": tag, "lemma": lemma, "ent_type": ent_type}
retokenizer.merge(span, attrs=attrs)
if options.get("fine_grained"):
words = [{"text": w.text, "tag": w.tag_} for w in doc]
else:
words = [{"text": w.text, "tag": w.pos_} for w in doc]
fine_grained = options.get("fine_grained")
add_lemma = options.get("add_lemma")
words = [{"text": w.text,
"tag": w.tag_ if fine_grained else w.pos_,
"lemma": w.lemma_ if add_lemma else None} for w in doc]
arcs = []
for word in doc:
if word.i < word.head.i:

View File

@ -3,7 +3,7 @@ from __future__ import unicode_literals
import uuid
from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_ARCS, TPL_ENTS
from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_WORDS_LEMMA, TPL_DEP_ARCS, TPL_ENTS
from .templates import TPL_ENT, TPL_ENT_RTL, TPL_FIGURE, TPL_TITLE, TPL_PAGE
from ..util import minify_html, escape_html, registry
from ..errors import Errors
@ -83,7 +83,7 @@ class DependencyRenderer(object):
self.width = self.offset_x + len(words) * self.distance
self.height = self.offset_y + 3 * self.word_spacing
self.id = render_id
words = [self.render_word(w["text"], w["tag"], i) for i, w in enumerate(words)]
words = [self.render_word(w["text"], w["tag"], w.get("lemma", None), i) for i, w in enumerate(words)]
arcs = [
self.render_arrow(a["label"], a["start"], a["end"], a["dir"], i)
for i, a in enumerate(arcs)
@ -101,7 +101,7 @@ class DependencyRenderer(object):
lang=self.lang,
)
def render_word(self, text, tag, i):
def render_word(self, text, tag, lemma, i,):
"""Render individual word.
text (unicode): Word text.
@ -114,6 +114,8 @@ class DependencyRenderer(object):
if self.direction == "rtl":
x = self.width - x
html_text = escape_html(text)
if lemma is not None:
return TPL_DEP_WORDS_LEMMA.format(text=html_text, tag=tag, lemma=lemma, x=x, y=y)
return TPL_DEP_WORDS.format(text=html_text, tag=tag, x=x, y=y)
def render_arrow(self, label, start, end, direction, i):

View File

@ -18,6 +18,15 @@ TPL_DEP_WORDS = """
"""
TPL_DEP_WORDS_LEMMA = """
<text class="displacy-token" fill="currentColor" text-anchor="middle" y="{y}">
<tspan class="displacy-word" fill="currentColor" x="{x}">{text}</tspan>
<tspan class="displacy-lemma" dy="2em" fill="currentColor" x="{x}">{lemma}</tspan>
<tspan class="displacy-tag" dy="2em" fill="currentColor" x="{x}">{tag}</tspan>
</text>
"""
TPL_DEP_ARCS = """
<g class="displacy-arrow">
<path class="displacy-arc" id="arrow-{id}-{i}" stroke-width="{stroke}px" d="{arc}" fill="none" stroke="currentColor"/>

View File

@ -22,14 +22,14 @@ dort drei drin dritte dritten dritter drittes du durch durchaus dürfen dürft
durfte durften
eben ebenso ehrlich eigen eigene eigenen eigener eigenes ein einander eine
einem einen einer eines einigeeinigen einiger einiges einmal einmaleins elf en
einem einen einer eines einige einigen einiger einiges einmal einmaleins elf en
ende endlich entweder er erst erste ersten erster erstes es etwa etwas euch
früher fünf fünfte fünften fünfter fünftes für
gab ganz ganze ganzen ganzer ganzes gar gedurft gegen gegenüber gehabt gehen
geht gekannt gekonnt gemacht gemocht gemusst genug gerade gern gesagt geschweige
gewesen gewollt geworden gibt ging gleich gott gross groß grosse große grossen
gewesen gewollt geworden gibt ging gleich gross groß grosse große grossen
großen grosser großer grosses großes gut gute guter gutes
habe haben habt hast hat hatte hätte hatten hätten heisst heißt her heute hier
@ -47,9 +47,8 @@ kleines kommen kommt können könnt konnte könnte konnten kurz
lang lange leicht leider lieber los
machen macht machte mag magst man manche manchem manchen mancher manches mehr
mein meine meinem meinen meiner meines mensch menschen mich mir mit mittel
mochte möchte mochten mögen möglich mögt morgen muss muß müssen musst müsst
musste mussten
mein meine meinem meinen meiner meines mich mir mit mittel mochte möchte mochten
mögen möglich mögt morgen muss muß müssen musst müsst musste mussten
na nach nachdem nahm natürlich neben nein neue neuen neun neunte neunten neunter
neuntes nicht nichts nie niemand niemandem niemanden noch nun nur

View File

@ -3,7 +3,7 @@
from __future__ import unicode_literals
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .tag_map_general import TAG_MAP
from ..tag_map import TAG_MAP
from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS
from .lemmatizer import GreekLemmatizer

View File

@ -1,27 +0,0 @@
# coding: utf8
from __future__ import unicode_literals
from ...symbols import POS, ADV, NOUN, ADP, PRON, SCONJ, PROPN, DET, SYM, INTJ
from ...symbols import PUNCT, NUM, AUX, X, ADJ, VERB, PART, SPACE, CCONJ
TAG_MAP = {
"ADJ": {POS: ADJ},
"ADV": {POS: ADV},
"INTJ": {POS: INTJ},
"NOUN": {POS: NOUN},
"PROPN": {POS: PROPN},
"VERB": {POS: VERB},
"ADP": {POS: ADP},
"CCONJ": {POS: CCONJ},
"SCONJ": {POS: SCONJ},
"PART": {POS: PART},
"PUNCT": {POS: PUNCT},
"SYM": {POS: SYM},
"NUM": {POS: NUM},
"PRON": {POS: PRON},
"AUX": {POS: AUX},
"SPACE": {POS: SPACE},
"DET": {POS: DET},
"X": {POS: X},
}

View File

@ -14,6 +14,7 @@ for exc_data in [
{ORTH: "alv.", LEMMA: "arvonlisävero"},
{ORTH: "ark.", LEMMA: "arkisin"},
{ORTH: "as.", LEMMA: "asunto"},
{ORTH: "eaa.", LEMMA: "ennen ajanlaskun alkua"},
{ORTH: "ed.", LEMMA: "edellinen"},
{ORTH: "esim.", LEMMA: "esimerkki"},
{ORTH: "huom.", LEMMA: "huomautus"},
@ -27,6 +28,7 @@ for exc_data in [
{ORTH: "läh.", LEMMA: "lähettäjä"},
{ORTH: "miel.", LEMMA: "mieluummin"},
{ORTH: "milj.", LEMMA: "miljoona"},
{ORTH: "Mm.", LEMMA: "muun muassa"},
{ORTH: "mm.", LEMMA: "muun muassa"},
{ORTH: "myöh.", LEMMA: "myöhempi"},
{ORTH: "n.", LEMMA: "noin"},

View File

@ -3,6 +3,8 @@ from __future__ import unicode_literals
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .stop_words import STOP_WORDS
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
from .punctuation import TOKENIZER_SUFFIXES
from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ..norm_exceptions import BASE_NORMS
@ -24,6 +26,9 @@ class RomanianDefaults(Language.Defaults):
)
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
stop_words = STOP_WORDS
prefixes = TOKENIZER_PREFIXES
suffixes = TOKENIZER_SUFFIXES
infixes = TOKENIZER_INFIXES
tag_map = TAG_MAP

View File

@ -0,0 +1,164 @@
# coding: utf8
from __future__ import unicode_literals
import itertools
from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, LIST_CURRENCY
from ..char_classes import LIST_ICONS, CURRENCY
from ..char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA, PUNCT
_list_icons = [x for x in LIST_ICONS if x != "°"]
_list_icons = [x.replace("\\u00B0", "") for x in _list_icons]
_ro_variants = {
"Ă": ["Ă", "A"],
"Â": ["Â", "A"],
"Î": ["Î", "I"],
"Ș": ["Ș", "Ş", "S"],
"Ț": ["Ț", "Ţ", "T"],
}
def _make_ro_variants(tokens):
variants = []
for token in tokens:
upper_token = token.upper()
upper_char_variants = [_ro_variants.get(c, [c]) for c in upper_token]
upper_variants = ["".join(x) for x in itertools.product(*upper_char_variants)]
for variant in upper_variants:
variants.extend([variant, variant.lower(), variant.title()])
return sorted(list(set(variants)))
# UD_Romanian-RRT closed class prefixes
# POS: ADP|AUX|CCONJ|DET|NUM|PART|PRON|SCONJ
_ud_rrt_prefixes = [
"a-",
"c-",
"ce-",
"cu-",
"d-",
"de-",
"dintr-",
"e-",
"făr-",
"i-",
"l-",
"le-",
"m-",
"mi-",
"n-",
"ne-",
"p-",
"pe-",
"prim-",
"printr-",
"s-",
"se-",
"te-",
"v-",
"într-",
"ș-",
"și-",
"ți-",
]
_ud_rrt_prefix_variants = _make_ro_variants(_ud_rrt_prefixes)
# UD_Romanian-RRT closed class suffixes without NUM
# POS: ADP|AUX|CCONJ|DET|PART|PRON|SCONJ
_ud_rrt_suffixes = [
"-a",
"-aceasta",
"-ai",
"-al",
"-ale",
"-alta",
"-am",
"-ar",
"-astea",
"-atâta",
"-au",
"-aș",
"-ați",
"-i",
"-ilor",
"-l",
"-le",
"-lea",
"-mea",
"-meu",
"-mi",
"-mă",
"-n",
"-ndărătul",
"-ne",
"-o",
"-oi",
"-or",
"-s",
"-se",
"-si",
"-te",
"-ul",
"-ului",
"-un",
"-uri",
"-urile",
"-urilor",
"-veți",
"-vă",
"-ăștia",
"-și",
"-ți",
]
_ud_rrt_suffix_variants = _make_ro_variants(_ud_rrt_suffixes)
_prefixes = (
["§", "%", "=", "", "", r"\+(?![0-9])"]
+ _ud_rrt_prefix_variants
+ LIST_PUNCT
+ LIST_ELLIPSES
+ LIST_QUOTES
+ LIST_CURRENCY
+ LIST_ICONS
)
_suffixes = (
_ud_rrt_suffix_variants
+ LIST_PUNCT
+ LIST_ELLIPSES
+ LIST_QUOTES
+ _list_icons
+ ["", ""]
+ [
r"(?<=[0-9])\+",
r"(?<=°[FfCcKk])\.",
r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
r"(?<=[0-9{al}{e}{p}(?:{q})])\.".format(
al=ALPHA_LOWER, e=r"%²\-\+", q=CONCAT_QUOTES, p=PUNCT
),
r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),
]
)
_infixes = (
LIST_ELLIPSES
+ _list_icons
+ [
r"(?<=[0-9])[+\*^](?=[0-9-])",
r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
),
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
r"(?<=[{a}0-9])[:<>=](?=[{a}])".format(a=ALPHA),
]
)
TOKENIZER_PREFIXES = _prefixes
TOKENIZER_SUFFIXES = _suffixes
TOKENIZER_INFIXES = _infixes

View File

@ -2,6 +2,7 @@
from __future__ import unicode_literals
from ...symbols import ORTH
from .punctuation import _make_ro_variants
_exc = {}
@ -45,8 +46,52 @@ for orth in [
"dpdv",
"șamd.",
"ș.a.m.d.",
# below: from UD_Romanian-RRT:
"A.c.",
"A.f.",
"A.r.",
"Al.",
"Art.",
"Aug.",
"Bd.",
"Dem.",
"Dr.",
"Fig.",
"Fr.",
"Gh.",
"Gr.",
"Lt.",
"Nr.",
"Obs.",
"Prof.",
"Sf.",
"a.m.",
"a.r.",
"alin.",
"art.",
"d-l",
"d-lui",
"d-nei",
"ex.",
"fig.",
"ian.",
"lit.",
"lt.",
"p.a.",
"p.m.",
"pct.",
"prep.",
"sf.",
"tel.",
"univ.",
"îngr.",
"într-adevăr",
"Șt.",
"ș.a.",
]:
_exc[orth] = [{ORTH: orth}]
# note: does not distinguish capitalized-only exceptions from others
for variant in _make_ro_variants([orth]):
_exc[variant] = [{ORTH: variant}]
TOKENIZER_EXCEPTIONS = _exc

View File

@ -608,6 +608,7 @@ class Language(object):
link_vectors_to_models(self.vocab)
if self.vocab.vectors.data.shape[1]:
cfg["pretrained_vectors"] = self.vocab.vectors.name
cfg['pretrained_dims'] = self.vocab.vectors.data.shape[1]
if sgd is None:
sgd = create_default_optimizer(Model.ops)
self._optimizer = sgd

View File

@ -8,7 +8,7 @@ from ..language import component
from ..errors import Errors
from ..compat import basestring_
from ..util import ensure_path, to_disk, from_disk
from ..tokens import Span
from ..tokens import Doc, Span
from ..matcher import Matcher, PhraseMatcher
DEFAULT_ENT_ID_SEP = "||"
@ -162,6 +162,7 @@ class EntityRuler(object):
@property
def patterns(self):
"""Get all patterns that were added to the entity ruler.
RETURNS (list): The original patterns, one dictionary per pattern.
DOCS: https://spacy.io/api/entityruler#patterns
@ -194,6 +195,7 @@ class EntityRuler(object):
DOCS: https://spacy.io/api/entityruler#add_patterns
"""
# disable the nlp components after this one in case they hadn't been initialized / deserialised yet
try:
current_index = self.nlp.pipe_names.index(self.name)
@ -203,7 +205,33 @@ class EntityRuler(object):
except ValueError:
subsequent_pipes = []
with self.nlp.disable_pipes(subsequent_pipes):
token_patterns = []
phrase_pattern_labels = []
phrase_pattern_texts = []
phrase_pattern_ids = []
for entry in patterns:
if isinstance(entry["pattern"], basestring_):
phrase_pattern_labels.append(entry["label"])
phrase_pattern_texts.append(entry["pattern"])
phrase_pattern_ids.append(entry.get("id"))
elif isinstance(entry["pattern"], list):
token_patterns.append(entry)
phrase_patterns = []
for label, pattern, ent_id in zip(
phrase_pattern_labels,
self.nlp.pipe(phrase_pattern_texts),
phrase_pattern_ids
):
phrase_pattern = {
"label": label, "pattern": pattern, "id": ent_id
}
if ent_id:
phrase_pattern["id"] = ent_id
phrase_patterns.append(phrase_pattern)
for entry in token_patterns + phrase_patterns:
label = entry["label"]
if "id" in entry:
ent_label = label
@ -212,8 +240,8 @@ class EntityRuler(object):
self._ent_ids[key] = (ent_label, entry["id"])
pattern = entry["pattern"]
if isinstance(pattern, basestring_):
self.phrase_patterns[label].append(self.nlp(pattern))
if isinstance(pattern, Doc):
self.phrase_patterns[label].append(pattern)
elif isinstance(pattern, list):
self.token_patterns[label].append(pattern)
else:
@ -226,6 +254,8 @@ class EntityRuler(object):
def _split_label(self, label):
"""Split Entity label into ent_label and ent_id if it contains self.ent_id_sep
label (str): The value of label in a pattern entry
RETURNS (tuple): ent_label, ent_id
"""
if self.ent_id_sep in label:
@ -239,6 +269,9 @@ class EntityRuler(object):
def _create_label(self, label, ent_id):
"""Join Entity label with ent_id if the pattern has an `id` attribute
label (str): The label to set for ent.label_
ent_id (str): The label
RETURNS (str): The ent_label joined with configured `ent_id_sep`
"""
if isinstance(ent_id, basestring_):
@ -250,6 +283,7 @@ class EntityRuler(object):
patterns_bytes (bytes): The bytestring to load.
**kwargs: Other config paramters, mostly for consistency.
RETURNS (EntityRuler): The loaded entity ruler.
DOCS: https://spacy.io/api/entityruler#from_bytes
@ -292,6 +326,7 @@ class EntityRuler(object):
path (unicode / Path): The JSONL file to load.
**kwargs: Other config paramters, mostly for consistency.
RETURNS (EntityRuler): The loaded entity ruler.
DOCS: https://spacy.io/api/entityruler#from_disk

View File

@ -1044,6 +1044,7 @@ class TextCategorizer(Pipe):
self.add_label(cat)
if self.model is True:
self.cfg["pretrained_vectors"] = kwargs.get("pretrained_vectors")
self.cfg["pretrained_dims"] = kwargs.get("pretrained_dims")
self.require_labels()
self.model = self.Model(len(self.labels), **self.cfg)
link_vectors_to_models(self.vocab)

View File

@ -463,3 +463,5 @@ cdef enum symbol_t:
ENT_KB_ID
ENT_ID
IDX

View File

@ -93,6 +93,7 @@ IDS = {
"SPACY": SPACY,
"PROB": PROB,
"LANG": LANG,
"IDX": IDX,
"ADJ": ADJ,
"ADP": ADP,

View File

@ -66,3 +66,14 @@ def test_doc_array_to_from_string_attrs(en_vocab, attrs):
words = ["An", "example", "sentence"]
doc = Doc(en_vocab, words=words)
Doc(en_vocab, words=words).from_array(attrs, doc.to_array(attrs))
def test_doc_array_idx(en_vocab):
"""Test that Doc.to_array can retrieve token start indices"""
words = ["An", "example", "sentence"]
doc = Doc(en_vocab, words=words)
offsets = Doc(en_vocab, words=words).to_array("IDX")
assert offsets[0] == 0
assert offsets[1] == 3
assert offsets[2] == 11

View File

@ -7,7 +7,7 @@ import numpy
from spacy.tokens import Doc, Span
from spacy.vocab import Vocab
from spacy.errors import ModelsWarning
from spacy.attrs import ENT_TYPE, ENT_IOB
from spacy.attrs import ENT_TYPE, ENT_IOB, SENT_START, HEAD, DEP
from ..util import get_doc
@ -274,6 +274,39 @@ def test_doc_is_nered(en_vocab):
assert new_doc.is_nered
def test_doc_from_array_sent_starts(en_vocab):
words = ["I", "live", "in", "New", "York", ".", "I", "like", "cats", "."]
heads = [0, 0, 0, 0, 0, 0, 6, 6, 6, 6]
deps = ["ROOT", "dep", "dep", "dep", "dep", "dep", "ROOT", "dep", "dep", "dep", "dep"]
doc = Doc(en_vocab, words=words)
for i, (dep, head) in enumerate(zip(deps, heads)):
doc[i].dep_ = dep
doc[i].head = doc[head]
if head == i:
doc[i].is_sent_start = True
doc.is_parsed
attrs = [SENT_START, HEAD]
arr = doc.to_array(attrs)
new_doc = Doc(en_vocab, words=words)
with pytest.raises(ValueError):
new_doc.from_array(attrs, arr)
attrs = [SENT_START, DEP]
arr = doc.to_array(attrs)
new_doc = Doc(en_vocab, words=words)
new_doc.from_array(attrs, arr)
assert [t.is_sent_start for t in doc] == [t.is_sent_start for t in new_doc]
assert not new_doc.is_parsed
attrs = [HEAD, DEP]
arr = doc.to_array(attrs)
new_doc = Doc(en_vocab, words=words)
new_doc.from_array(attrs, arr)
assert [t.is_sent_start for t in doc] == [t.is_sent_start for t in new_doc]
assert new_doc.is_parsed
def test_doc_lang(en_vocab):
doc = Doc(en_vocab, words=["Hello", "world"])
assert doc.lang_ == "en"

View File

@ -279,3 +279,12 @@ def test_filter_spans(doc):
assert len(filtered[1]) == 5
assert filtered[0].start == 1 and filtered[0].end == 4
assert filtered[1].start == 5 and filtered[1].end == 10
def test_span_eq_hash(doc, doc_not_parsed):
assert doc[0:2] == doc[0:2]
assert doc[0:2] != doc[1:3]
assert doc[0:2] != doc_not_parsed[0:2]
assert hash(doc[0:2]) == hash(doc[0:2])
assert hash(doc[0:2]) != hash(doc[1:3])
assert hash(doc[0:2]) != hash(doc_not_parsed[0:2])

View File

@ -10,28 +10,33 @@ ABBREVIATION_TESTS = [
["Hyvää", "uutta", "vuotta", "t.", "siht.", "Niemelä", "!"],
),
("Paino on n. 2.2 kg", ["Paino", "on", "n.", "2.2", "kg"]),
(
"Vuonna 1 eaa. tapahtui kauheita.",
["Vuonna", "1", "eaa.", "tapahtui", "kauheita", "."],
),
]
HYPHENATED_TESTS = [
(
"1700-luvulle sijoittuva taide-elokuva",
["1700-luvulle", "sijoittuva", "taide-elokuva"],
"1700-luvulle sijoittuva taide-elokuva Wikimedia-säätiön Varsinais-Suomen",
[
"1700-luvulle",
"sijoittuva",
"taide-elokuva",
"Wikimedia-säätiön",
"Varsinais-Suomen",
],
)
]
ABBREVIATION_INFLECTION_TESTS = [
(
"VTT:ssa ennen v:ta 2010 suoritetut mittaukset",
["VTT:ssa", "ennen", "v:ta", "2010", "suoritetut", "mittaukset"]
["VTT:ssa", "ennen", "v:ta", "2010", "suoritetut", "mittaukset"],
),
(
"ALV:n osuus on 24 %.",
["ALV:n", "osuus", "on", "24", "%", "."]
),
(
"Hiihtäjä oli kilpailun 14:s.",
["Hiihtäjä", "oli", "kilpailun", "14:s", "."]
)
("ALV:n osuus on 24 %.", ["ALV:n", "osuus", "on", "24", "%", "."]),
("Hiihtäjä oli kilpailun 14:s.", ["Hiihtäjä", "oli", "kilpailun", "14:s", "."]),
("EU:n toimesta tehtiin jotain.", ["EU:n", "toimesta", "tehtiin", "jotain", "."]),
]

View File

@ -31,10 +31,10 @@ def test_displacy_parse_deps(en_vocab):
deps = displacy.parse_deps(doc)
assert isinstance(deps, dict)
assert deps["words"] == [
{"text": "This", "tag": "DET"},
{"text": "is", "tag": "AUX"},
{"text": "a", "tag": "DET"},
{"text": "sentence", "tag": "NOUN"},
{"lemma": None, "text": "This", "tag": "DET"},
{"lemma": None, "text": "is", "tag": "AUX"},
{"lemma": None, "text": "a", "tag": "DET"},
{"lemma": None, "text": "sentence", "tag": "NOUN"},
]
assert deps["arcs"] == [
{"start": 0, "end": 1, "label": "nsubj", "dir": "left"},

View File

@ -95,7 +95,11 @@ def assert_docs_equal(doc1, doc2):
assert [t.ent_type for t in doc1] == [t.ent_type for t in doc2]
assert [t.ent_iob for t in doc1] == [t.ent_iob for t in doc2]
assert [ent for ent in doc1.ents] == [ent for ent in doc2.ents]
for ent1, ent2 in zip(doc1.ents, doc2.ents):
assert ent1.start == ent2.start
assert ent1.end == ent2.end
assert ent1.label == ent2.label
assert ent1.kb_id == ent2.kb_id
def assert_packed_msg_equal(b1, b2):

View File

@ -23,7 +23,7 @@ from ..lexeme cimport Lexeme, EMPTY_LEXEME
from ..typedefs cimport attr_t, flags_t
from ..attrs cimport ID, ORTH, NORM, LOWER, SHAPE, PREFIX, SUFFIX, CLUSTER
from ..attrs cimport LENGTH, POS, LEMMA, TAG, DEP, HEAD, SPACY, ENT_IOB
from ..attrs cimport ENT_TYPE, ENT_ID, ENT_KB_ID, SENT_START, attr_id_t
from ..attrs cimport ENT_TYPE, ENT_ID, ENT_KB_ID, SENT_START, IDX, attr_id_t
from ..parts_of_speech cimport CCONJ, PUNCT, NOUN, univ_pos_t
from ..attrs import intify_attrs, IDS
@ -73,6 +73,8 @@ cdef attr_t get_token_attr(const TokenC* token, attr_id_t feat_name) nogil:
return token.ent_id
elif feat_name == ENT_KB_ID:
return token.ent_kb_id
elif feat_name == IDX:
return token.idx
else:
return Lexeme.get_struct_attr(token.lex, feat_name)
@ -813,7 +815,7 @@ cdef class Doc:
if attr_ids[j] != TAG:
Token.set_struct_attr(token, attr_ids[j], array[i, j])
# Set flags
self.is_parsed = bool(self.is_parsed or HEAD in attrs or DEP in attrs)
self.is_parsed = bool(self.is_parsed or HEAD in attrs)
self.is_tagged = bool(self.is_tagged or TAG in attrs or POS in attrs)
# If document is parsed, set children
if self.is_parsed:

View File

@ -127,22 +127,27 @@ cdef class Span:
return False
else:
return True
# Eq
# <
if op == 0:
return self.start_char < other.start_char
# <=
elif op == 1:
return self.start_char <= other.start_char
# ==
elif op == 2:
return self.start_char == other.start_char and self.end_char == other.end_char
return (self.doc, self.start_char, self.end_char, self.label, self.kb_id) == (other.doc, other.start_char, other.end_char, other.label, other.kb_id)
# !=
elif op == 3:
return self.start_char != other.start_char or self.end_char != other.end_char
return (self.doc, self.start_char, self.end_char, self.label, self.kb_id) != (other.doc, other.start_char, other.end_char, other.label, other.kb_id)
# >
elif op == 4:
return self.start_char > other.start_char
# >=
elif op == 5:
return self.start_char >= other.start_char
def __hash__(self):
return hash((self.doc, self.label, self.start_char, self.end_char))
return hash((self.doc, self.start_char, self.end_char, self.label, self.kb_id))
def __len__(self):
"""Get the number of tokens in the span.

View File

@ -283,7 +283,11 @@ cdef class Vectors:
DOCS: https://spacy.io/api/vectors#add
"""
key = get_string_id(key)
# use int for all keys and rows in key2row for more efficient access
# and serialization
key = int(get_string_id(key))
if row is not None:
row = int(row)
if row is None and key in self.key2row:
row = self.key2row[key]
elif row is None:

View File

@ -239,6 +239,7 @@ If a setting is not present in the options, the default value will be used.
| Name | Type | Description | Default |
| ------------------ | ------- | --------------------------------------------------------------------------------------------------------------- | ----------------------- |
| `fine_grained` | bool | Use fine-grained part-of-speech tags (`Token.tag_`) instead of coarse-grained tags (`Token.pos_`). | `False` |
| `add_lemma` | bool | Print the lemma's in a separate row below the token texts in the `dep` visualisation. | `False` |
| `collapse_punct` | bool | Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation. | `True` |
| `collapse_phrases` | bool | Merge noun phrases into one token. | `False` |
| `compact` | bool | "Compact mode" with square arrows that takes up less space. | `False` |

View File

@ -1096,6 +1096,33 @@ with the patterns. When you load the model back in, all pipeline components will
be restored and deserialized including the entity ruler. This lets you ship
powerful model packages with binary weights _and_ rules included!
### Using a large number of phrase patterns {#entityruler-large-phrase-patterns new="2.2.4"}
When using a large amount of **phrase patterns** (roughly > 10000) it's useful to understand how the `add_patterns` function of the EntityRuler works. For each **phrase pattern**,
the EntityRuler calls the nlp object to construct a doc object. This happens in case you try
to add the EntityRuler at the end of an existing pipeline with, for example, a POS tagger and want to
extract matches based on the pattern's POS signature.
In this case you would pass a config value of `phrase_matcher_attr="POS"` for the EntityRuler.
Running the full language pipeline across every pattern in a large list scales linearly and can therefore take a long time on large amounts of phrase patterns.
As of spaCy 2.2.4 the `add_patterns` function has been refactored to use nlp.pipe on all phrase patterns resulting in about a 10x-20x speed up with 5,000-100,000 phrase patterns respectively.
Even with this speedup (but especially if you're using an older version) the `add_patterns` function can still take a long time.
An easy workaround to make this function run faster is disabling the other language pipes
while adding the phrase patterns.
```python
entityruler = EntityRuler(nlp)
patterns = [{"label": "TEST", "pattern": str(i)} for i in range(100000)]
other_pipes = [p for p in nlp.pipe_names if p != "tagger"]
with nlp.disable_pipes(*disable_pipes):
entityruler.add_patterns(patterns)
```
## Combining models and rules {#models-rules}
You can combine statistical and rule-based components in a variety of ways.

View File

@ -999,6 +999,17 @@
"author": "Graphbrain",
"category": ["standalone"]
},
{
"type": "education",
"id": "nostarch-nlp-python",
"title": "Natural Language Processing Using Python",
"slogan": "No Starch Press, 2020",
"description": "Natural Language Processing Using Python is an introduction to natural language processing (NLP), the task of converting human language into data that a computer can process. The book uses spaCy, a leading Python library for NLP, to guide readers through common NLP tasks related to generating and understanding human language with code. It addresses problems like understanding a user's intent, continuing a conversation with a human, and maintaining the state of a conversation.",
"cover": "https://nostarch.com/sites/default/files/styles/uc_product_full/public/NaturalLanguageProcessing_final_v01.jpg",
"url": "https://nostarch.com/NLPPython",
"author": "Yuli Vasiliev",
"category": ["books"]
},
{
"type": "education",
"id": "oreilly-python-ds",