Merge pull request #5755 from adrianeboyd/v2.3.x

Update v2.3.x from master
This commit is contained in:
Adriane Boyd 2020-07-13 15:30:40 +02:00 committed by GitHub
commit bf778f59c7
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
17 changed files with 572 additions and 124 deletions

106
.github/contributors/gandersen101.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Grant Andersen |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 07.06.2020 |
| GitHub username | gandersen101 |
| Website (optional) | |

106
.github/contributors/jbesomi.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Jonathan B. |
| Company name (if applicable) | besomi.ai |
| Title or role (if applicable) | - |
| Date | 07.07.2020 |
| GitHub username | jbesomi |
| Website (optional) | besomi.ai |

106
.github/contributors/mikeizbicki.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Mike Izbicki |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 02 Jun 2020 |
| GitHub username | mikeizbicki |
| Website (optional) | https://izbicki.me |

View File

@ -14,7 +14,7 @@ from thinc.api import with_getitem, flatten_add_lengths
from thinc.api import uniqued, wrap, noop
from thinc.linear.linear import LinearModel
from thinc.neural.ops import NumpyOps, CupyOps
from thinc.neural.util import get_array_module, copy_array
from thinc.neural.util import get_array_module, copy_array, to_categorical
from thinc.neural.optimizers import Adam
from thinc import describe
@ -840,6 +840,8 @@ def masked_language_model(vocab, model, mask_prob=0.15):
def mlm_backward(d_output, sgd=None):
d_output *= 1 - mask
# Rescale gradient for number of instances.
d_output *= mask.size - mask.sum()
return backprop(d_output, sgd=sgd)
return output, mlm_backward
@ -944,7 +946,7 @@ class CharacterEmbed(Model):
# for the tip.
nCv = self.ops.xp.arange(self.nC)
for doc in docs:
doc_ids = doc.to_utf8_array(nr_char=self.nC)
doc_ids = self.ops.asarray(doc.to_utf8_array(nr_char=self.nC))
doc_vectors = self.ops.allocate((len(doc), self.nC, self.nM))
# Let's say I have a 2d array of indices, and a 3d table of data. What numpy
# incantation do I chant to get
@ -986,3 +988,17 @@ def get_cossim_loss(yh, y, ignore_zeros=False):
losses[zero_indices] = 0
loss = losses.sum()
return loss, -d_yh
def get_characters_loss(ops, docs, prediction, nr_char=10):
target_ids = numpy.vstack([doc.to_utf8_array(nr_char=nr_char) for doc in docs])
target_ids = target_ids.reshape((-1,))
target = ops.asarray(to_categorical(target_ids, nb_classes=256), dtype="f")
target = target.reshape((-1, 256*nr_char))
diff = prediction - target
loss = (diff**2).sum()
d_target = diff / float(prediction.shape[0])
return loss, d_target

View File

@ -1,6 +1,6 @@
# fmt: off
__title__ = "spacy"
__version__ = "2.3.1"
__version__ = "2.3.2"
__release__ = True
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"

View File

@ -18,7 +18,8 @@ from ..errors import Errors
from ..tokens import Doc
from ..attrs import ID, HEAD
from .._ml import Tok2Vec, flatten, chain, create_default_optimizer
from .._ml import masked_language_model, get_cossim_loss
from .._ml import masked_language_model, get_cossim_loss, get_characters_loss
from .._ml import MultiSoftmax
from .. import util
from .train import _load_pretrained_tok2vec
@ -42,7 +43,7 @@ from .train import _load_pretrained_tok2vec
bilstm_depth=("Depth of BiLSTM layers (requires PyTorch)", "option", "lstm", int),
embed_rows=("Number of embedding rows", "option", "er", int),
loss_func=(
"Loss function to use for the objective. Either 'L2' or 'cosine'",
"Loss function to use for the objective. Either 'characters', 'L2' or 'cosine'",
"option",
"L",
str,
@ -85,11 +86,11 @@ def pretrain(
output_dir,
width=96,
conv_depth=4,
bilstm_depth=0,
cnn_pieces=3,
sa_depth=0,
use_chars=False,
cnn_window=1,
bilstm_depth=0,
use_chars=False,
embed_rows=2000,
loss_func="cosine",
use_vectors=False,
@ -124,11 +125,7 @@ def pretrain(
config[key] = str(config[key])
util.fix_random_seed(seed)
has_gpu = prefer_gpu()
if has_gpu:
import torch
torch.set_default_tensor_type("torch.cuda.FloatTensor")
has_gpu = prefer_gpu(gpu_id=1)
msg.info("Using GPU" if has_gpu else "Not using GPU")
output_dir = Path(output_dir)
@ -174,6 +171,7 @@ def pretrain(
subword_features=not use_chars, # Set to False for Chinese etc
cnn_maxout_pieces=cnn_pieces, # If set to 1, use Mish activation.
),
objective=loss_func
)
# Load in pretrained weights
if init_tok2vec is not None:
@ -264,7 +262,10 @@ def make_update(model, docs, optimizer, drop=0.0, objective="L2"):
RETURNS loss: A float for the loss.
"""
predictions, backprop = model.begin_update(docs, drop=drop)
loss, gradients = get_vectors_loss(model.ops, docs, predictions, objective)
if objective == "characters":
loss, gradients = get_characters_loss(model.ops, docs, predictions)
else:
loss, gradients = get_vectors_loss(model.ops, docs, predictions, objective)
backprop(gradients, sgd=optimizer)
# Don't want to return a cupy object here
# The gradients are modified in-place by the BERT MLM,
@ -326,16 +327,23 @@ def get_vectors_loss(ops, docs, prediction, objective="L2"):
return loss, d_target
def create_pretraining_model(nlp, tok2vec):
def create_pretraining_model(nlp, tok2vec, objective="cosine", nr_char=10):
"""Define a network for the pretraining. We simply add an output layer onto
the tok2vec input model. The tok2vec input model needs to be a model that
takes a batch of Doc objects (as a list), and returns a list of arrays.
Each array in the output needs to have one row per token in the doc.
"""
output_size = nlp.vocab.vectors.data.shape[1]
output_layer = chain(
LN(Maxout(300, pieces=3)), Affine(output_size, drop_factor=0.0)
)
if objective == "characters":
out_sizes = [256] * nr_char
output_layer = chain(
LN(Maxout(300, pieces=3)),
MultiSoftmax(out_sizes, 300)
)
else:
output_size = nlp.vocab.vectors.data.shape[1]
output_layer = chain(
LN(Maxout(300, pieces=3)), Affine(output_size, drop_factor=0.0)
)
# This is annoying, but the parser etc have the flatten step after
# the tok2vec. To load the weights in cleanly, we need to match
# the shape of the models' components exactly. So what we cann

View File

@ -285,7 +285,7 @@ def train(
if base_model and not pipes_added:
# Start with an existing model, use default optimizer
optimizer = create_default_optimizer(Model.ops)
optimizer = nlp.resume_training(device=use_gpu)
else:
# Start with a blank model, call begin_training
cfg = {"device": use_gpu}
@ -576,6 +576,8 @@ def train(
with nlp.use_params(optimizer.averages):
final_model_path = output_path / "model-final"
nlp.to_disk(final_model_path)
srsly.write_json(final_model_path / "meta.json", meta)
meta_loc = output_path / "model-final" / "meta.json"
final_meta = srsly.read_json(meta_loc)
final_meta.setdefault("accuracy", {})

View File

@ -18,41 +18,6 @@ def _return_en(_):
return "en"
def en_is_base_form(univ_pos, morphology=None):
"""
Check whether we're dealing with an uninflected paradigm, so we can
avoid lemmatization entirely.
univ_pos (unicode / int): The token's universal part-of-speech tag.
morphology (dict): The token's morphological features following the
Universal Dependencies scheme.
"""
if morphology is None:
morphology = {}
if univ_pos == "noun" and morphology.get("Number") == "sing":
return True
elif univ_pos == "verb" and morphology.get("VerbForm") == "inf":
return True
# This maps 'VBP' to base form -- probably just need 'IS_BASE'
# morphology
elif univ_pos == "verb" and (
morphology.get("VerbForm") == "fin"
and morphology.get("Tense") == "pres"
and morphology.get("Number") is None
):
return True
elif univ_pos == "adj" and morphology.get("Degree") == "pos":
return True
elif morphology.get("VerbForm") == "inf":
return True
elif morphology.get("VerbForm") == "none":
return True
elif morphology.get("Degree") == "pos":
return True
else:
return False
class EnglishDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters.update(LEX_ATTRS)
@ -61,7 +26,6 @@ class EnglishDefaults(Language.Defaults):
tag_map = TAG_MAP
stop_words = STOP_WORDS
morph_rules = MORPH_RULES
is_base_form = en_is_base_form
syntax_iterators = SYNTAX_ITERATORS
single_orth_variants = [
{"tags": ["NFP"], "variants": ["", "..."]},
@ -72,6 +36,41 @@ class EnglishDefaults(Language.Defaults):
{"tags": ["``", "''"], "variants": [('"', '"'), ("", "")]},
]
@classmethod
def is_base_form(cls, univ_pos, morphology=None):
"""
Check whether we're dealing with an uninflected paradigm, so we can
avoid lemmatization entirely.
univ_pos (unicode / int): The token's universal part-of-speech tag.
morphology (dict): The token's morphological features following the
Universal Dependencies scheme.
"""
if morphology is None:
morphology = {}
if univ_pos == "noun" and morphology.get("Number") == "sing":
return True
elif univ_pos == "verb" and morphology.get("VerbForm") == "inf":
return True
# This maps 'VBP' to base form -- probably just need 'IS_BASE'
# morphology
elif univ_pos == "verb" and (
morphology.get("VerbForm") == "fin"
and morphology.get("Tense") == "pres"
and morphology.get("Number") is None
):
return True
elif univ_pos == "adj" and morphology.get("Degree") == "pos":
return True
elif morphology.get("VerbForm") == "inf":
return True
elif morphology.get("VerbForm") == "none":
return True
elif morphology.get("Degree") == "pos":
return True
else:
return False
class English(Language):
lang = "en"

View File

@ -45,9 +45,6 @@ class FrenchLemmatizer(Lemmatizer):
univ_pos = "sconj"
else:
return [self.lookup(string)]
# See Issue #435 for example of where this logic is requied.
if self.is_base_form(univ_pos, morphology):
return list(set([string.lower()]))
index_table = self.lookups.get_table("lemma_index", {})
exc_table = self.lookups.get_table("lemma_exc", {})
rules_table = self.lookups.get_table("lemma_rules", {})
@ -59,43 +56,6 @@ class FrenchLemmatizer(Lemmatizer):
)
return lemmas
def is_base_form(self, univ_pos, morphology=None):
"""
Check whether we're dealing with an uninflected paradigm, so we can
avoid lemmatization entirely.
"""
morphology = {} if morphology is None else morphology
others = [
key
for key in morphology
if key not in (POS, "Number", "POS", "VerbForm", "Tense")
]
if univ_pos == "noun" and morphology.get("Number") == "sing":
return True
elif univ_pos == "verb" and morphology.get("VerbForm") == "inf":
return True
# This maps 'VBP' to base form -- probably just need 'IS_BASE'
# morphology
elif univ_pos == "verb" and (
morphology.get("VerbForm") == "fin"
and morphology.get("Tense") == "pres"
and morphology.get("Number") is None
and not others
):
return True
elif univ_pos == "adj" and morphology.get("Degree") == "pos":
return True
elif VerbForm_inf in morphology:
return True
elif VerbForm_none in morphology:
return True
elif Number_sing in morphology:
return True
elif Degree_pos in morphology:
return True
else:
return False
def noun(self, string, morphology=None):
return self(string, "noun", morphology)

View File

@ -42,7 +42,11 @@ def check_spaces(text, tokens):
class KoreanTokenizer(DummyTokenizer):
def __init__(self, cls, nlp=None):
self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp)
self.Tokenizer = try_mecab_import()
MeCab = try_mecab_import()
self.mecab_tokenizer = MeCab("-F%f[0],%f[7]")
def __del__(self):
self.mecab_tokenizer.__del__()
def __call__(self, text):
dtokens = list(self.detailed_tokens(text))
@ -58,17 +62,16 @@ class KoreanTokenizer(DummyTokenizer):
def detailed_tokens(self, text):
# 품사 태그(POS)[0], 의미 부류(semantic class)[1], 종성 유무(jongseong)[2], 읽기(reading)[3],
# 타입(type)[4], 첫번째 품사(start pos)[5], 마지막 품사(end pos)[6], 표현(expression)[7], *
with self.Tokenizer("-F%f[0],%f[7]") as tokenizer:
for node in tokenizer.parse(text, as_nodes=True):
if node.is_eos():
break
surface = node.surface
feature = node.feature
tag, _, expr = feature.partition(",")
lemma, _, remainder = expr.partition("/")
if lemma == "*":
lemma = surface
yield {"surface": surface, "lemma": lemma, "tag": tag}
for node in self.mecab_tokenizer.parse(text, as_nodes=True):
if node.is_eos():
break
surface = node.surface
feature = node.feature
tag, _, expr = feature.partition(",")
lemma, _, remainder = expr.partition("/")
if lemma == "*":
lemma = surface
yield {"surface": surface, "lemma": lemma, "tag": tag}
class KoreanDefaults(Language.Defaults):

View File

@ -21,7 +21,7 @@ class Lemmatizer(object):
def load(cls, *args, **kwargs):
raise NotImplementedError(Errors.E172)
def __init__(self, lookups, *args, is_base_form=None, **kwargs):
def __init__(self, lookups, is_base_form=None, *args, **kwargs):
"""Initialize a Lemmatizer.
lookups (Lookups): The lookups object containing the (optional) tables

View File

@ -49,6 +49,14 @@ def Tok2Vec(width, embed_size, **kwargs):
>> LN(Maxout(width, width * 5, pieces=3)),
column=cols.index(ORTH),
)
elif char_embed:
embed = concatenate_lists(
CharacterEmbed(nM=64, nC=8),
FeatureExtracter(cols) >> with_flatten(glove),
)
reduce_dimensions = LN(
Maxout(width, 64 * 8 + width, pieces=cnn_maxout_pieces)
)
else:
embed = uniqued(
(glove | norm) >> LN(Maxout(width, width * 2, pieces=3)),
@ -81,7 +89,8 @@ def Tok2Vec(width, embed_size, **kwargs):
)
else:
tok2vec = FeatureExtracter(cols) >> with_flatten(
embed >> convolution ** conv_depth, pad=conv_depth
embed
>> convolution ** conv_depth, pad=conv_depth
)
if bilstm_depth >= 1:

View File

@ -33,6 +33,7 @@ from .._ml import build_text_classifier, build_simple_cnn_text_classifier
from .._ml import build_bow_text_classifier, build_nel_encoder
from .._ml import link_vectors_to_models, zero_init, flatten
from .._ml import masked_language_model, create_default_optimizer, get_cossim_loss
from .._ml import MultiSoftmax, get_characters_loss
from ..errors import Errors, TempErrors, Warnings
from .. import util
@ -846,11 +847,15 @@ class MultitaskObjective(Tagger):
class ClozeMultitask(Pipe):
@classmethod
def Model(cls, vocab, tok2vec, **cfg):
output_size = vocab.vectors.data.shape[1]
output_layer = chain(
LayerNorm(Maxout(output_size, tok2vec.nO, pieces=3)),
zero_init(Affine(output_size, output_size, drop_factor=0.0))
)
if cfg["objective"] == "characters":
out_sizes = [256] * cfg.get("nr_char", 4)
output_layer = MultiSoftmax(out_sizes)
else:
output_size = vocab.vectors.data.shape[1]
output_layer = chain(
LayerNorm(Maxout(output_size, tok2vec.nO, pieces=3)),
zero_init(Affine(output_size, output_size, drop_factor=0.0))
)
model = chain(tok2vec, output_layer)
model = masked_language_model(vocab, model)
model.tok2vec = tok2vec
@ -861,6 +866,8 @@ class ClozeMultitask(Pipe):
self.vocab = vocab
self.model = model
self.cfg = cfg
self.cfg.setdefault("objective", "characters")
self.cfg.setdefault("nr_char", 4)
def set_annotations(self, docs, dep_ids, tensors=None):
pass
@ -869,7 +876,8 @@ class ClozeMultitask(Pipe):
tok2vec=None, sgd=None, **kwargs):
link_vectors_to_models(self.vocab)
if self.model is True:
self.model = self.Model(self.vocab, tok2vec)
kwargs.update(self.cfg)
self.model = self.Model(self.vocab, tok2vec, **kwargs)
X = self.model.ops.allocate((5, self.model.tok2vec.nO))
self.model.output_layer.begin_training(X)
if sgd is None:
@ -883,13 +891,16 @@ class ClozeMultitask(Pipe):
return tokvecs, vectors
def get_loss(self, docs, vectors, prediction):
# The simplest way to implement this would be to vstack the
# token.vector values, but that's a bit inefficient, especially on GPU.
# Instead we fetch the index into the vectors table for each of our tokens,
# and look them up all at once. This prevents data copying.
ids = self.model.ops.flatten([doc.to_array(ID).ravel() for doc in docs])
target = vectors[ids]
loss, gradient = get_cossim_loss(prediction, target, ignore_zeros=True)
if self.cfg["objective"] == "characters":
loss, gradient = get_characters_loss(self.model.ops, docs, prediction)
else:
# The simplest way to implement this would be to vstack the
# token.vector values, but that's a bit inefficient, especially on GPU.
# Instead we fetch the index into the vectors table for each of our tokens,
# and look them up all at once. This prevents data copying.
ids = self.model.ops.flatten([doc.to_array(ID).ravel() for doc in docs])
target = vectors[ids]
loss, gradient = get_cossim_loss(prediction, target, ignore_zeros=True)
return float(loss), gradient
def update(self, docs, golds, drop=0., sgd=None, losses=None):
@ -906,6 +917,20 @@ class ClozeMultitask(Pipe):
if losses is not None:
losses[self.name] += loss
@staticmethod
def decode_utf8_predictions(char_array):
# The format alternates filling from start and end, and 255 is missing
words = []
char_array = char_array.reshape((char_array.shape[0], -1, 256))
nr_char = char_array.shape[1]
char_array = char_array.argmax(axis=-1)
for row in char_array:
starts = [chr(c) for c in row[::2] if c != 255]
ends = [chr(c) for c in row[1::2] if c != 255]
word = "".join(starts + list(reversed(ends)))
words.append(word)
return words
@component("textcat", assigns=["doc.cats"])
class TextCategorizer(Pipe):
@ -1069,6 +1094,7 @@ cdef class DependencyParser(Parser):
assigns = ["token.dep", "token.is_sent_start", "doc.sents"]
requires = []
TransitionSystem = ArcEager
nr_feature = 8
@property
def postprocesses(self):

View File

@ -59,7 +59,7 @@ def test_issue2626_2835(en_tokenizer, text):
def test_issue2656(en_tokenizer):
"""Test that tokenizer correctly splits of punctuation after numbers with
"""Test that tokenizer correctly splits off punctuation after numbers with
decimal points.
"""
doc = en_tokenizer("I went for 40.3, and got home by 10.0.")

View File

@ -121,6 +121,7 @@ def test_issue3248_1():
assert len(matcher) == 2
@pytest.mark.skipif(is_python2, reason="Can't pickle instancemethod for is_base_form")
def test_issue3248_2():
"""Test that the PhraseMatcher can be pickled correctly."""
nlp = English()

View File

@ -473,7 +473,7 @@ $ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir]
| `--use-chars`, `-chr` <Tag variant="new">2.2.2</Tag> | flag | Whether to use character-based embedding. |
| `--sa-depth`, `-sa` <Tag variant="new">2.2.2</Tag> | option | Depth of self-attention layers. |
| `--embed-rows`, `-er` | option | Number of embedding rows. |
| `--loss-func`, `-L` | option | Loss function to use for the objective. Either `"L2"` or `"cosine"`. |
| `--loss-func`, `-L` | option | Loss function to use for the objective. Either `"cosine"`, `"L2"` or `"characters"`. |
| `--dropout`, `-d` | option | Dropout rate. |
| `--batch-size`, `-bs` | option | Number of words per training batch. |
| `--max-length`, `-xw` | option | Maximum words per example. Longer examples are discarded. |

View File

@ -1,5 +1,58 @@
{
"resources": [
{
"id": "spacy-streamlit",
"title": "spacy-streamlit",
"slogan": "spaCy building blocks for Streamlit apps",
"github": "explosion/spacy-streamlit",
"description": "This package contains utilities for visualizing spaCy models and building interactive spaCy-powered apps with [Streamlit](https://streamlit.io). It includes various building blocks you can use in your own Streamlit app, like visualizers for **syntactic dependencies**, **named entities**, **text classification**, **semantic similarity** via word vectors, token attributes, and more.",
"pip": "spacy-streamlit",
"category": ["visualizers"],
"thumb": "https://i.imgur.com/mhEjluE.jpg",
"image": "https://user-images.githubusercontent.com/13643239/85388081-f2da8700-b545-11ea-9bd4-e303d3c5763c.png",
"code_example": [
"import spacy_streamlit",
"",
"models = [\"en_core_web_sm\", \"en_core_web_md\"]",
"default_text = \"Sundar Pichai is the CEO of Google.\"",
"spacy_streamlit.visualize(models, default_text))"
],
"author": "Ines Montani",
"author_links": {
"twitter": "_inesmontani",
"github": "ines",
"website": "https://ines.io"
}
},
{
"id": "spaczz",
"title": "spaczz",
"slogan": "Fuzzy matching and more for spaCy.",
"description": "Spaczz provides fuzzy matching and multi-token regex matching functionality for spaCy. Spaczz's components have similar APIs to their spaCy counterparts and spaczz pipeline components can integrate into spaCy pipelines where they can be saved/loaded as models.",
"github": "gandersen101/spaczz",
"pip": "spaczz",
"code_example": [
"import spacy",
"from spaczz.pipeline import SpaczzRuler",
"",
"nlp = spacy.blank('en')",
"ruler = SpaczzRuler(nlp)",
"ruler.add_patterns([{'label': 'PERSON', 'pattern': 'Bill Gates', 'type': 'fuzzy'}])",
"nlp.add_pipe(ruler)",
"",
"doc = nlp('Oops, I spelled Bill Gatez wrong.')",
"print([(ent.text, ent.start, ent.end, ent.label_) for ent in doc.ents])"
],
"code_language": "python",
"url": "https://spaczz.readthedocs.io/en/latest/",
"author": "Grant Andersen",
"author_links": {
"twitter": "gandersen101",
"github": "gandersen101"
},
"category": ["pipeline"],
"tags": ["fuzzy-matching", "regex"]
},
{
"id": "spacy-universal-sentence-encoder",
"title": "SpaCy - Universal Sentence Encoder",
@ -1237,6 +1290,19 @@
"youtube": "K1elwpgDdls",
"category": ["videos"]
},
{
"type": "education",
"id": "video-spacy-course-es",
"title": "NLP avanzado con spaCy · Un curso en línea gratis",
"description": "spaCy es un paquete moderno de Python para hacer Procesamiento de Lenguaje Natural de potencia industrial. En este curso en línea, interactivo y gratuito, aprenderás a usar spaCy para construir sistemas avanzados de comprensión de lenguaje natural usando enfoques basados en reglas y en machine learning.",
"url": "https://course.spacy.io/es",
"author": "Camila Gutiérrez",
"author_links": {
"twitter": "Mariacamilagl30"
},
"youtube": "RNiLVCE5d4k",
"category": ["videos"]
},
{
"type": "education",
"id": "video-intro-to-nlp-episode-1",
@ -1293,6 +1359,20 @@
"youtube": "IqOJU1-_Fi0",
"category": ["videos"]
},
{
"type": "education",
"id": "video-intro-to-nlp-episode-5",
"title": "Intro to NLP with spaCy (5)",
"slogan": "Episode 5: Rules vs. Machine Learning",
"description": "In this new video series, data science instructor Vincent Warmerdam gets started with spaCy, an open-source library for Natural Language Processing in Python. His mission: building a system to automatically detect programming languages in large volumes of text. Follow his process from the first idea to a prototype all the way to data collection and training a statistical named entity recogntion model from scratch.",
"author": "Vincent Warmerdam",
"author_links": {
"twitter": "fishnets88",
"github": "koaning"
},
"youtube": "f4sqeLRzkPg",
"category": ["videos"]
},
{
"type": "education",
"id": "video-spacy-irl-entity-linking",
@ -2347,6 +2427,32 @@
},
"category": ["pipeline", "conversational", "research"],
"tags": ["spell check", "correction", "preprocessing", "translation", "correction"]
},
{
"id": "texthero",
"title": "Texthero",
"slogan": "Text preprocessing, representation and visualization from zero to hero.",
"description": "Texthero is a python package to work with text data efficiently. It empowers NLP developers with a tool to quickly understand any text-based dataset and it provides a solid pipeline to clean and represent text data, from zero to hero.",
"github": "jbesomi/texthero",
"pip": "texthero",
"code_example": [
"import texthero as hero",
"import pandas as pd",
"",
"df = pd.read_csv('https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv')",
"df['named_entities'] = hero.named_entities(df['text'])",
"df.head()"
],
"code_language": "python",
"url": "https://texthero.org",
"thumb": "https://texthero.org/img/T.png",
"image": "https://texthero.org/docs/assets/texthero.png",
"author": "Jonathan Besomi",
"author_links": {
"github": "jbesomi",
"website": "https://besomi.ai"
},
"category": ["standalone"]
}
],