Merge pull request #5755 from adrianeboyd/v2.3.x

Update v2.3.x from master
This commit is contained in:
Adriane Boyd 2020-07-13 15:30:40 +02:00 committed by GitHub
commit bf778f59c7
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
17 changed files with 572 additions and 124 deletions

106
.github/contributors/gandersen101.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Grant Andersen |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 07.06.2020 |
| GitHub username | gandersen101 |
| Website (optional) | |

106
.github/contributors/jbesomi.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Jonathan B. |
| Company name (if applicable) | besomi.ai |
| Title or role (if applicable) | - |
| Date | 07.07.2020 |
| GitHub username | jbesomi |
| Website (optional) | besomi.ai |

106
.github/contributors/mikeizbicki.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Mike Izbicki |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 02 Jun 2020 |
| GitHub username | mikeizbicki |
| Website (optional) | https://izbicki.me |

View File

@ -14,7 +14,7 @@ from thinc.api import with_getitem, flatten_add_lengths
from thinc.api import uniqued, wrap, noop from thinc.api import uniqued, wrap, noop
from thinc.linear.linear import LinearModel from thinc.linear.linear import LinearModel
from thinc.neural.ops import NumpyOps, CupyOps from thinc.neural.ops import NumpyOps, CupyOps
from thinc.neural.util import get_array_module, copy_array from thinc.neural.util import get_array_module, copy_array, to_categorical
from thinc.neural.optimizers import Adam from thinc.neural.optimizers import Adam
from thinc import describe from thinc import describe
@ -840,6 +840,8 @@ def masked_language_model(vocab, model, mask_prob=0.15):
def mlm_backward(d_output, sgd=None): def mlm_backward(d_output, sgd=None):
d_output *= 1 - mask d_output *= 1 - mask
# Rescale gradient for number of instances.
d_output *= mask.size - mask.sum()
return backprop(d_output, sgd=sgd) return backprop(d_output, sgd=sgd)
return output, mlm_backward return output, mlm_backward
@ -944,7 +946,7 @@ class CharacterEmbed(Model):
# for the tip. # for the tip.
nCv = self.ops.xp.arange(self.nC) nCv = self.ops.xp.arange(self.nC)
for doc in docs: for doc in docs:
doc_ids = doc.to_utf8_array(nr_char=self.nC) doc_ids = self.ops.asarray(doc.to_utf8_array(nr_char=self.nC))
doc_vectors = self.ops.allocate((len(doc), self.nC, self.nM)) doc_vectors = self.ops.allocate((len(doc), self.nC, self.nM))
# Let's say I have a 2d array of indices, and a 3d table of data. What numpy # Let's say I have a 2d array of indices, and a 3d table of data. What numpy
# incantation do I chant to get # incantation do I chant to get
@ -986,3 +988,17 @@ def get_cossim_loss(yh, y, ignore_zeros=False):
losses[zero_indices] = 0 losses[zero_indices] = 0
loss = losses.sum() loss = losses.sum()
return loss, -d_yh return loss, -d_yh
def get_characters_loss(ops, docs, prediction, nr_char=10):
target_ids = numpy.vstack([doc.to_utf8_array(nr_char=nr_char) for doc in docs])
target_ids = target_ids.reshape((-1,))
target = ops.asarray(to_categorical(target_ids, nb_classes=256), dtype="f")
target = target.reshape((-1, 256*nr_char))
diff = prediction - target
loss = (diff**2).sum()
d_target = diff / float(prediction.shape[0])
return loss, d_target

View File

@ -1,6 +1,6 @@
# fmt: off # fmt: off
__title__ = "spacy" __title__ = "spacy"
__version__ = "2.3.1" __version__ = "2.3.2"
__release__ = True __release__ = True
__download_url__ = "https://github.com/explosion/spacy-models/releases/download" __download_url__ = "https://github.com/explosion/spacy-models/releases/download"
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json" __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"

View File

@ -18,7 +18,8 @@ from ..errors import Errors
from ..tokens import Doc from ..tokens import Doc
from ..attrs import ID, HEAD from ..attrs import ID, HEAD
from .._ml import Tok2Vec, flatten, chain, create_default_optimizer from .._ml import Tok2Vec, flatten, chain, create_default_optimizer
from .._ml import masked_language_model, get_cossim_loss from .._ml import masked_language_model, get_cossim_loss, get_characters_loss
from .._ml import MultiSoftmax
from .. import util from .. import util
from .train import _load_pretrained_tok2vec from .train import _load_pretrained_tok2vec
@ -42,7 +43,7 @@ from .train import _load_pretrained_tok2vec
bilstm_depth=("Depth of BiLSTM layers (requires PyTorch)", "option", "lstm", int), bilstm_depth=("Depth of BiLSTM layers (requires PyTorch)", "option", "lstm", int),
embed_rows=("Number of embedding rows", "option", "er", int), embed_rows=("Number of embedding rows", "option", "er", int),
loss_func=( loss_func=(
"Loss function to use for the objective. Either 'L2' or 'cosine'", "Loss function to use for the objective. Either 'characters', 'L2' or 'cosine'",
"option", "option",
"L", "L",
str, str,
@ -85,11 +86,11 @@ def pretrain(
output_dir, output_dir,
width=96, width=96,
conv_depth=4, conv_depth=4,
bilstm_depth=0,
cnn_pieces=3, cnn_pieces=3,
sa_depth=0, sa_depth=0,
use_chars=False,
cnn_window=1, cnn_window=1,
bilstm_depth=0,
use_chars=False,
embed_rows=2000, embed_rows=2000,
loss_func="cosine", loss_func="cosine",
use_vectors=False, use_vectors=False,
@ -124,11 +125,7 @@ def pretrain(
config[key] = str(config[key]) config[key] = str(config[key])
util.fix_random_seed(seed) util.fix_random_seed(seed)
has_gpu = prefer_gpu() has_gpu = prefer_gpu(gpu_id=1)
if has_gpu:
import torch
torch.set_default_tensor_type("torch.cuda.FloatTensor")
msg.info("Using GPU" if has_gpu else "Not using GPU") msg.info("Using GPU" if has_gpu else "Not using GPU")
output_dir = Path(output_dir) output_dir = Path(output_dir)
@ -174,6 +171,7 @@ def pretrain(
subword_features=not use_chars, # Set to False for Chinese etc subword_features=not use_chars, # Set to False for Chinese etc
cnn_maxout_pieces=cnn_pieces, # If set to 1, use Mish activation. cnn_maxout_pieces=cnn_pieces, # If set to 1, use Mish activation.
), ),
objective=loss_func
) )
# Load in pretrained weights # Load in pretrained weights
if init_tok2vec is not None: if init_tok2vec is not None:
@ -264,7 +262,10 @@ def make_update(model, docs, optimizer, drop=0.0, objective="L2"):
RETURNS loss: A float for the loss. RETURNS loss: A float for the loss.
""" """
predictions, backprop = model.begin_update(docs, drop=drop) predictions, backprop = model.begin_update(docs, drop=drop)
loss, gradients = get_vectors_loss(model.ops, docs, predictions, objective) if objective == "characters":
loss, gradients = get_characters_loss(model.ops, docs, predictions)
else:
loss, gradients = get_vectors_loss(model.ops, docs, predictions, objective)
backprop(gradients, sgd=optimizer) backprop(gradients, sgd=optimizer)
# Don't want to return a cupy object here # Don't want to return a cupy object here
# The gradients are modified in-place by the BERT MLM, # The gradients are modified in-place by the BERT MLM,
@ -326,16 +327,23 @@ def get_vectors_loss(ops, docs, prediction, objective="L2"):
return loss, d_target return loss, d_target
def create_pretraining_model(nlp, tok2vec): def create_pretraining_model(nlp, tok2vec, objective="cosine", nr_char=10):
"""Define a network for the pretraining. We simply add an output layer onto """Define a network for the pretraining. We simply add an output layer onto
the tok2vec input model. The tok2vec input model needs to be a model that the tok2vec input model. The tok2vec input model needs to be a model that
takes a batch of Doc objects (as a list), and returns a list of arrays. takes a batch of Doc objects (as a list), and returns a list of arrays.
Each array in the output needs to have one row per token in the doc. Each array in the output needs to have one row per token in the doc.
""" """
output_size = nlp.vocab.vectors.data.shape[1] if objective == "characters":
output_layer = chain( out_sizes = [256] * nr_char
LN(Maxout(300, pieces=3)), Affine(output_size, drop_factor=0.0) output_layer = chain(
) LN(Maxout(300, pieces=3)),
MultiSoftmax(out_sizes, 300)
)
else:
output_size = nlp.vocab.vectors.data.shape[1]
output_layer = chain(
LN(Maxout(300, pieces=3)), Affine(output_size, drop_factor=0.0)
)
# This is annoying, but the parser etc have the flatten step after # This is annoying, but the parser etc have the flatten step after
# the tok2vec. To load the weights in cleanly, we need to match # the tok2vec. To load the weights in cleanly, we need to match
# the shape of the models' components exactly. So what we cann # the shape of the models' components exactly. So what we cann

View File

@ -285,7 +285,7 @@ def train(
if base_model and not pipes_added: if base_model and not pipes_added:
# Start with an existing model, use default optimizer # Start with an existing model, use default optimizer
optimizer = create_default_optimizer(Model.ops) optimizer = nlp.resume_training(device=use_gpu)
else: else:
# Start with a blank model, call begin_training # Start with a blank model, call begin_training
cfg = {"device": use_gpu} cfg = {"device": use_gpu}
@ -576,6 +576,8 @@ def train(
with nlp.use_params(optimizer.averages): with nlp.use_params(optimizer.averages):
final_model_path = output_path / "model-final" final_model_path = output_path / "model-final"
nlp.to_disk(final_model_path) nlp.to_disk(final_model_path)
srsly.write_json(final_model_path / "meta.json", meta)
meta_loc = output_path / "model-final" / "meta.json" meta_loc = output_path / "model-final" / "meta.json"
final_meta = srsly.read_json(meta_loc) final_meta = srsly.read_json(meta_loc)
final_meta.setdefault("accuracy", {}) final_meta.setdefault("accuracy", {})

View File

@ -18,41 +18,6 @@ def _return_en(_):
return "en" return "en"
def en_is_base_form(univ_pos, morphology=None):
"""
Check whether we're dealing with an uninflected paradigm, so we can
avoid lemmatization entirely.
univ_pos (unicode / int): The token's universal part-of-speech tag.
morphology (dict): The token's morphological features following the
Universal Dependencies scheme.
"""
if morphology is None:
morphology = {}
if univ_pos == "noun" and morphology.get("Number") == "sing":
return True
elif univ_pos == "verb" and morphology.get("VerbForm") == "inf":
return True
# This maps 'VBP' to base form -- probably just need 'IS_BASE'
# morphology
elif univ_pos == "verb" and (
morphology.get("VerbForm") == "fin"
and morphology.get("Tense") == "pres"
and morphology.get("Number") is None
):
return True
elif univ_pos == "adj" and morphology.get("Degree") == "pos":
return True
elif morphology.get("VerbForm") == "inf":
return True
elif morphology.get("VerbForm") == "none":
return True
elif morphology.get("Degree") == "pos":
return True
else:
return False
class EnglishDefaults(Language.Defaults): class EnglishDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters) lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters.update(LEX_ATTRS) lex_attr_getters.update(LEX_ATTRS)
@ -61,7 +26,6 @@ class EnglishDefaults(Language.Defaults):
tag_map = TAG_MAP tag_map = TAG_MAP
stop_words = STOP_WORDS stop_words = STOP_WORDS
morph_rules = MORPH_RULES morph_rules = MORPH_RULES
is_base_form = en_is_base_form
syntax_iterators = SYNTAX_ITERATORS syntax_iterators = SYNTAX_ITERATORS
single_orth_variants = [ single_orth_variants = [
{"tags": ["NFP"], "variants": ["", "..."]}, {"tags": ["NFP"], "variants": ["", "..."]},
@ -72,6 +36,41 @@ class EnglishDefaults(Language.Defaults):
{"tags": ["``", "''"], "variants": [('"', '"'), ("", "")]}, {"tags": ["``", "''"], "variants": [('"', '"'), ("", "")]},
] ]
@classmethod
def is_base_form(cls, univ_pos, morphology=None):
"""
Check whether we're dealing with an uninflected paradigm, so we can
avoid lemmatization entirely.
univ_pos (unicode / int): The token's universal part-of-speech tag.
morphology (dict): The token's morphological features following the
Universal Dependencies scheme.
"""
if morphology is None:
morphology = {}
if univ_pos == "noun" and morphology.get("Number") == "sing":
return True
elif univ_pos == "verb" and morphology.get("VerbForm") == "inf":
return True
# This maps 'VBP' to base form -- probably just need 'IS_BASE'
# morphology
elif univ_pos == "verb" and (
morphology.get("VerbForm") == "fin"
and morphology.get("Tense") == "pres"
and morphology.get("Number") is None
):
return True
elif univ_pos == "adj" and morphology.get("Degree") == "pos":
return True
elif morphology.get("VerbForm") == "inf":
return True
elif morphology.get("VerbForm") == "none":
return True
elif morphology.get("Degree") == "pos":
return True
else:
return False
class English(Language): class English(Language):
lang = "en" lang = "en"

View File

@ -45,9 +45,6 @@ class FrenchLemmatizer(Lemmatizer):
univ_pos = "sconj" univ_pos = "sconj"
else: else:
return [self.lookup(string)] return [self.lookup(string)]
# See Issue #435 for example of where this logic is requied.
if self.is_base_form(univ_pos, morphology):
return list(set([string.lower()]))
index_table = self.lookups.get_table("lemma_index", {}) index_table = self.lookups.get_table("lemma_index", {})
exc_table = self.lookups.get_table("lemma_exc", {}) exc_table = self.lookups.get_table("lemma_exc", {})
rules_table = self.lookups.get_table("lemma_rules", {}) rules_table = self.lookups.get_table("lemma_rules", {})
@ -59,43 +56,6 @@ class FrenchLemmatizer(Lemmatizer):
) )
return lemmas return lemmas
def is_base_form(self, univ_pos, morphology=None):
"""
Check whether we're dealing with an uninflected paradigm, so we can
avoid lemmatization entirely.
"""
morphology = {} if morphology is None else morphology
others = [
key
for key in morphology
if key not in (POS, "Number", "POS", "VerbForm", "Tense")
]
if univ_pos == "noun" and morphology.get("Number") == "sing":
return True
elif univ_pos == "verb" and morphology.get("VerbForm") == "inf":
return True
# This maps 'VBP' to base form -- probably just need 'IS_BASE'
# morphology
elif univ_pos == "verb" and (
morphology.get("VerbForm") == "fin"
and morphology.get("Tense") == "pres"
and morphology.get("Number") is None
and not others
):
return True
elif univ_pos == "adj" and morphology.get("Degree") == "pos":
return True
elif VerbForm_inf in morphology:
return True
elif VerbForm_none in morphology:
return True
elif Number_sing in morphology:
return True
elif Degree_pos in morphology:
return True
else:
return False
def noun(self, string, morphology=None): def noun(self, string, morphology=None):
return self(string, "noun", morphology) return self(string, "noun", morphology)

View File

@ -42,7 +42,11 @@ def check_spaces(text, tokens):
class KoreanTokenizer(DummyTokenizer): class KoreanTokenizer(DummyTokenizer):
def __init__(self, cls, nlp=None): def __init__(self, cls, nlp=None):
self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp) self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp)
self.Tokenizer = try_mecab_import() MeCab = try_mecab_import()
self.mecab_tokenizer = MeCab("-F%f[0],%f[7]")
def __del__(self):
self.mecab_tokenizer.__del__()
def __call__(self, text): def __call__(self, text):
dtokens = list(self.detailed_tokens(text)) dtokens = list(self.detailed_tokens(text))
@ -58,17 +62,16 @@ class KoreanTokenizer(DummyTokenizer):
def detailed_tokens(self, text): def detailed_tokens(self, text):
# 품사 태그(POS)[0], 의미 부류(semantic class)[1], 종성 유무(jongseong)[2], 읽기(reading)[3], # 품사 태그(POS)[0], 의미 부류(semantic class)[1], 종성 유무(jongseong)[2], 읽기(reading)[3],
# 타입(type)[4], 첫번째 품사(start pos)[5], 마지막 품사(end pos)[6], 표현(expression)[7], * # 타입(type)[4], 첫번째 품사(start pos)[5], 마지막 품사(end pos)[6], 표현(expression)[7], *
with self.Tokenizer("-F%f[0],%f[7]") as tokenizer: for node in self.mecab_tokenizer.parse(text, as_nodes=True):
for node in tokenizer.parse(text, as_nodes=True): if node.is_eos():
if node.is_eos(): break
break surface = node.surface
surface = node.surface feature = node.feature
feature = node.feature tag, _, expr = feature.partition(",")
tag, _, expr = feature.partition(",") lemma, _, remainder = expr.partition("/")
lemma, _, remainder = expr.partition("/") if lemma == "*":
if lemma == "*": lemma = surface
lemma = surface yield {"surface": surface, "lemma": lemma, "tag": tag}
yield {"surface": surface, "lemma": lemma, "tag": tag}
class KoreanDefaults(Language.Defaults): class KoreanDefaults(Language.Defaults):

View File

@ -21,7 +21,7 @@ class Lemmatizer(object):
def load(cls, *args, **kwargs): def load(cls, *args, **kwargs):
raise NotImplementedError(Errors.E172) raise NotImplementedError(Errors.E172)
def __init__(self, lookups, *args, is_base_form=None, **kwargs): def __init__(self, lookups, is_base_form=None, *args, **kwargs):
"""Initialize a Lemmatizer. """Initialize a Lemmatizer.
lookups (Lookups): The lookups object containing the (optional) tables lookups (Lookups): The lookups object containing the (optional) tables

View File

@ -49,6 +49,14 @@ def Tok2Vec(width, embed_size, **kwargs):
>> LN(Maxout(width, width * 5, pieces=3)), >> LN(Maxout(width, width * 5, pieces=3)),
column=cols.index(ORTH), column=cols.index(ORTH),
) )
elif char_embed:
embed = concatenate_lists(
CharacterEmbed(nM=64, nC=8),
FeatureExtracter(cols) >> with_flatten(glove),
)
reduce_dimensions = LN(
Maxout(width, 64 * 8 + width, pieces=cnn_maxout_pieces)
)
else: else:
embed = uniqued( embed = uniqued(
(glove | norm) >> LN(Maxout(width, width * 2, pieces=3)), (glove | norm) >> LN(Maxout(width, width * 2, pieces=3)),
@ -81,7 +89,8 @@ def Tok2Vec(width, embed_size, **kwargs):
) )
else: else:
tok2vec = FeatureExtracter(cols) >> with_flatten( tok2vec = FeatureExtracter(cols) >> with_flatten(
embed >> convolution ** conv_depth, pad=conv_depth embed
>> convolution ** conv_depth, pad=conv_depth
) )
if bilstm_depth >= 1: if bilstm_depth >= 1:

View File

@ -33,6 +33,7 @@ from .._ml import build_text_classifier, build_simple_cnn_text_classifier
from .._ml import build_bow_text_classifier, build_nel_encoder from .._ml import build_bow_text_classifier, build_nel_encoder
from .._ml import link_vectors_to_models, zero_init, flatten from .._ml import link_vectors_to_models, zero_init, flatten
from .._ml import masked_language_model, create_default_optimizer, get_cossim_loss from .._ml import masked_language_model, create_default_optimizer, get_cossim_loss
from .._ml import MultiSoftmax, get_characters_loss
from ..errors import Errors, TempErrors, Warnings from ..errors import Errors, TempErrors, Warnings
from .. import util from .. import util
@ -846,11 +847,15 @@ class MultitaskObjective(Tagger):
class ClozeMultitask(Pipe): class ClozeMultitask(Pipe):
@classmethod @classmethod
def Model(cls, vocab, tok2vec, **cfg): def Model(cls, vocab, tok2vec, **cfg):
output_size = vocab.vectors.data.shape[1] if cfg["objective"] == "characters":
output_layer = chain( out_sizes = [256] * cfg.get("nr_char", 4)
LayerNorm(Maxout(output_size, tok2vec.nO, pieces=3)), output_layer = MultiSoftmax(out_sizes)
zero_init(Affine(output_size, output_size, drop_factor=0.0)) else:
) output_size = vocab.vectors.data.shape[1]
output_layer = chain(
LayerNorm(Maxout(output_size, tok2vec.nO, pieces=3)),
zero_init(Affine(output_size, output_size, drop_factor=0.0))
)
model = chain(tok2vec, output_layer) model = chain(tok2vec, output_layer)
model = masked_language_model(vocab, model) model = masked_language_model(vocab, model)
model.tok2vec = tok2vec model.tok2vec = tok2vec
@ -861,6 +866,8 @@ class ClozeMultitask(Pipe):
self.vocab = vocab self.vocab = vocab
self.model = model self.model = model
self.cfg = cfg self.cfg = cfg
self.cfg.setdefault("objective", "characters")
self.cfg.setdefault("nr_char", 4)
def set_annotations(self, docs, dep_ids, tensors=None): def set_annotations(self, docs, dep_ids, tensors=None):
pass pass
@ -869,7 +876,8 @@ class ClozeMultitask(Pipe):
tok2vec=None, sgd=None, **kwargs): tok2vec=None, sgd=None, **kwargs):
link_vectors_to_models(self.vocab) link_vectors_to_models(self.vocab)
if self.model is True: if self.model is True:
self.model = self.Model(self.vocab, tok2vec) kwargs.update(self.cfg)
self.model = self.Model(self.vocab, tok2vec, **kwargs)
X = self.model.ops.allocate((5, self.model.tok2vec.nO)) X = self.model.ops.allocate((5, self.model.tok2vec.nO))
self.model.output_layer.begin_training(X) self.model.output_layer.begin_training(X)
if sgd is None: if sgd is None:
@ -883,13 +891,16 @@ class ClozeMultitask(Pipe):
return tokvecs, vectors return tokvecs, vectors
def get_loss(self, docs, vectors, prediction): def get_loss(self, docs, vectors, prediction):
# The simplest way to implement this would be to vstack the if self.cfg["objective"] == "characters":
# token.vector values, but that's a bit inefficient, especially on GPU. loss, gradient = get_characters_loss(self.model.ops, docs, prediction)
# Instead we fetch the index into the vectors table for each of our tokens, else:
# and look them up all at once. This prevents data copying. # The simplest way to implement this would be to vstack the
ids = self.model.ops.flatten([doc.to_array(ID).ravel() for doc in docs]) # token.vector values, but that's a bit inefficient, especially on GPU.
target = vectors[ids] # Instead we fetch the index into the vectors table for each of our tokens,
loss, gradient = get_cossim_loss(prediction, target, ignore_zeros=True) # and look them up all at once. This prevents data copying.
ids = self.model.ops.flatten([doc.to_array(ID).ravel() for doc in docs])
target = vectors[ids]
loss, gradient = get_cossim_loss(prediction, target, ignore_zeros=True)
return float(loss), gradient return float(loss), gradient
def update(self, docs, golds, drop=0., sgd=None, losses=None): def update(self, docs, golds, drop=0., sgd=None, losses=None):
@ -906,6 +917,20 @@ class ClozeMultitask(Pipe):
if losses is not None: if losses is not None:
losses[self.name] += loss losses[self.name] += loss
@staticmethod
def decode_utf8_predictions(char_array):
# The format alternates filling from start and end, and 255 is missing
words = []
char_array = char_array.reshape((char_array.shape[0], -1, 256))
nr_char = char_array.shape[1]
char_array = char_array.argmax(axis=-1)
for row in char_array:
starts = [chr(c) for c in row[::2] if c != 255]
ends = [chr(c) for c in row[1::2] if c != 255]
word = "".join(starts + list(reversed(ends)))
words.append(word)
return words
@component("textcat", assigns=["doc.cats"]) @component("textcat", assigns=["doc.cats"])
class TextCategorizer(Pipe): class TextCategorizer(Pipe):
@ -1069,6 +1094,7 @@ cdef class DependencyParser(Parser):
assigns = ["token.dep", "token.is_sent_start", "doc.sents"] assigns = ["token.dep", "token.is_sent_start", "doc.sents"]
requires = [] requires = []
TransitionSystem = ArcEager TransitionSystem = ArcEager
nr_feature = 8
@property @property
def postprocesses(self): def postprocesses(self):

View File

@ -59,7 +59,7 @@ def test_issue2626_2835(en_tokenizer, text):
def test_issue2656(en_tokenizer): def test_issue2656(en_tokenizer):
"""Test that tokenizer correctly splits of punctuation after numbers with """Test that tokenizer correctly splits off punctuation after numbers with
decimal points. decimal points.
""" """
doc = en_tokenizer("I went for 40.3, and got home by 10.0.") doc = en_tokenizer("I went for 40.3, and got home by 10.0.")

View File

@ -121,6 +121,7 @@ def test_issue3248_1():
assert len(matcher) == 2 assert len(matcher) == 2
@pytest.mark.skipif(is_python2, reason="Can't pickle instancemethod for is_base_form")
def test_issue3248_2(): def test_issue3248_2():
"""Test that the PhraseMatcher can be pickled correctly.""" """Test that the PhraseMatcher can be pickled correctly."""
nlp = English() nlp = English()

View File

@ -473,7 +473,7 @@ $ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir]
| `--use-chars`, `-chr` <Tag variant="new">2.2.2</Tag> | flag | Whether to use character-based embedding. | | `--use-chars`, `-chr` <Tag variant="new">2.2.2</Tag> | flag | Whether to use character-based embedding. |
| `--sa-depth`, `-sa` <Tag variant="new">2.2.2</Tag> | option | Depth of self-attention layers. | | `--sa-depth`, `-sa` <Tag variant="new">2.2.2</Tag> | option | Depth of self-attention layers. |
| `--embed-rows`, `-er` | option | Number of embedding rows. | | `--embed-rows`, `-er` | option | Number of embedding rows. |
| `--loss-func`, `-L` | option | Loss function to use for the objective. Either `"L2"` or `"cosine"`. | | `--loss-func`, `-L` | option | Loss function to use for the objective. Either `"cosine"`, `"L2"` or `"characters"`. |
| `--dropout`, `-d` | option | Dropout rate. | | `--dropout`, `-d` | option | Dropout rate. |
| `--batch-size`, `-bs` | option | Number of words per training batch. | | `--batch-size`, `-bs` | option | Number of words per training batch. |
| `--max-length`, `-xw` | option | Maximum words per example. Longer examples are discarded. | | `--max-length`, `-xw` | option | Maximum words per example. Longer examples are discarded. |

View File

@ -1,5 +1,58 @@
{ {
"resources": [ "resources": [
{
"id": "spacy-streamlit",
"title": "spacy-streamlit",
"slogan": "spaCy building blocks for Streamlit apps",
"github": "explosion/spacy-streamlit",
"description": "This package contains utilities for visualizing spaCy models and building interactive spaCy-powered apps with [Streamlit](https://streamlit.io). It includes various building blocks you can use in your own Streamlit app, like visualizers for **syntactic dependencies**, **named entities**, **text classification**, **semantic similarity** via word vectors, token attributes, and more.",
"pip": "spacy-streamlit",
"category": ["visualizers"],
"thumb": "https://i.imgur.com/mhEjluE.jpg",
"image": "https://user-images.githubusercontent.com/13643239/85388081-f2da8700-b545-11ea-9bd4-e303d3c5763c.png",
"code_example": [
"import spacy_streamlit",
"",
"models = [\"en_core_web_sm\", \"en_core_web_md\"]",
"default_text = \"Sundar Pichai is the CEO of Google.\"",
"spacy_streamlit.visualize(models, default_text))"
],
"author": "Ines Montani",
"author_links": {
"twitter": "_inesmontani",
"github": "ines",
"website": "https://ines.io"
}
},
{
"id": "spaczz",
"title": "spaczz",
"slogan": "Fuzzy matching and more for spaCy.",
"description": "Spaczz provides fuzzy matching and multi-token regex matching functionality for spaCy. Spaczz's components have similar APIs to their spaCy counterparts and spaczz pipeline components can integrate into spaCy pipelines where they can be saved/loaded as models.",
"github": "gandersen101/spaczz",
"pip": "spaczz",
"code_example": [
"import spacy",
"from spaczz.pipeline import SpaczzRuler",
"",
"nlp = spacy.blank('en')",
"ruler = SpaczzRuler(nlp)",
"ruler.add_patterns([{'label': 'PERSON', 'pattern': 'Bill Gates', 'type': 'fuzzy'}])",
"nlp.add_pipe(ruler)",
"",
"doc = nlp('Oops, I spelled Bill Gatez wrong.')",
"print([(ent.text, ent.start, ent.end, ent.label_) for ent in doc.ents])"
],
"code_language": "python",
"url": "https://spaczz.readthedocs.io/en/latest/",
"author": "Grant Andersen",
"author_links": {
"twitter": "gandersen101",
"github": "gandersen101"
},
"category": ["pipeline"],
"tags": ["fuzzy-matching", "regex"]
},
{ {
"id": "spacy-universal-sentence-encoder", "id": "spacy-universal-sentence-encoder",
"title": "SpaCy - Universal Sentence Encoder", "title": "SpaCy - Universal Sentence Encoder",
@ -1237,6 +1290,19 @@
"youtube": "K1elwpgDdls", "youtube": "K1elwpgDdls",
"category": ["videos"] "category": ["videos"]
}, },
{
"type": "education",
"id": "video-spacy-course-es",
"title": "NLP avanzado con spaCy · Un curso en línea gratis",
"description": "spaCy es un paquete moderno de Python para hacer Procesamiento de Lenguaje Natural de potencia industrial. En este curso en línea, interactivo y gratuito, aprenderás a usar spaCy para construir sistemas avanzados de comprensión de lenguaje natural usando enfoques basados en reglas y en machine learning.",
"url": "https://course.spacy.io/es",
"author": "Camila Gutiérrez",
"author_links": {
"twitter": "Mariacamilagl30"
},
"youtube": "RNiLVCE5d4k",
"category": ["videos"]
},
{ {
"type": "education", "type": "education",
"id": "video-intro-to-nlp-episode-1", "id": "video-intro-to-nlp-episode-1",
@ -1293,6 +1359,20 @@
"youtube": "IqOJU1-_Fi0", "youtube": "IqOJU1-_Fi0",
"category": ["videos"] "category": ["videos"]
}, },
{
"type": "education",
"id": "video-intro-to-nlp-episode-5",
"title": "Intro to NLP with spaCy (5)",
"slogan": "Episode 5: Rules vs. Machine Learning",
"description": "In this new video series, data science instructor Vincent Warmerdam gets started with spaCy, an open-source library for Natural Language Processing in Python. His mission: building a system to automatically detect programming languages in large volumes of text. Follow his process from the first idea to a prototype all the way to data collection and training a statistical named entity recogntion model from scratch.",
"author": "Vincent Warmerdam",
"author_links": {
"twitter": "fishnets88",
"github": "koaning"
},
"youtube": "f4sqeLRzkPg",
"category": ["videos"]
},
{ {
"type": "education", "type": "education",
"id": "video-spacy-irl-entity-linking", "id": "video-spacy-irl-entity-linking",
@ -2347,6 +2427,32 @@
}, },
"category": ["pipeline", "conversational", "research"], "category": ["pipeline", "conversational", "research"],
"tags": ["spell check", "correction", "preprocessing", "translation", "correction"] "tags": ["spell check", "correction", "preprocessing", "translation", "correction"]
},
{
"id": "texthero",
"title": "Texthero",
"slogan": "Text preprocessing, representation and visualization from zero to hero.",
"description": "Texthero is a python package to work with text data efficiently. It empowers NLP developers with a tool to quickly understand any text-based dataset and it provides a solid pipeline to clean and represent text data, from zero to hero.",
"github": "jbesomi/texthero",
"pip": "texthero",
"code_example": [
"import texthero as hero",
"import pandas as pd",
"",
"df = pd.read_csv('https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv')",
"df['named_entities'] = hero.named_entities(df['text'])",
"df.head()"
],
"code_language": "python",
"url": "https://texthero.org",
"thumb": "https://texthero.org/img/T.png",
"image": "https://texthero.org/docs/assets/texthero.png",
"author": "Jonathan Besomi",
"author_links": {
"github": "jbesomi",
"website": "https://besomi.ai"
},
"category": ["standalone"]
} }
], ],