mirror of
https://github.com/explosion/spaCy.git
synced 2025-07-03 19:33:19 +03:00
Merge pull request #5755 from adrianeboyd/v2.3.x
Update v2.3.x from master
This commit is contained in:
commit
bf778f59c7
106
.github/contributors/gandersen101.md
vendored
Normal file
106
.github/contributors/gandersen101.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [ x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Grant Andersen |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 07.06.2020 |
|
||||||
|
| GitHub username | gandersen101 |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/jbesomi.md
vendored
Normal file
106
.github/contributors/jbesomi.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Jonathan B. |
|
||||||
|
| Company name (if applicable) | besomi.ai |
|
||||||
|
| Title or role (if applicable) | - |
|
||||||
|
| Date | 07.07.2020 |
|
||||||
|
| GitHub username | jbesomi |
|
||||||
|
| Website (optional) | besomi.ai |
|
106
.github/contributors/mikeizbicki.md
vendored
Normal file
106
.github/contributors/mikeizbicki.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Mike Izbicki |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 02 Jun 2020 |
|
||||||
|
| GitHub username | mikeizbicki |
|
||||||
|
| Website (optional) | https://izbicki.me |
|
20
spacy/_ml.py
20
spacy/_ml.py
|
@ -14,7 +14,7 @@ from thinc.api import with_getitem, flatten_add_lengths
|
||||||
from thinc.api import uniqued, wrap, noop
|
from thinc.api import uniqued, wrap, noop
|
||||||
from thinc.linear.linear import LinearModel
|
from thinc.linear.linear import LinearModel
|
||||||
from thinc.neural.ops import NumpyOps, CupyOps
|
from thinc.neural.ops import NumpyOps, CupyOps
|
||||||
from thinc.neural.util import get_array_module, copy_array
|
from thinc.neural.util import get_array_module, copy_array, to_categorical
|
||||||
from thinc.neural.optimizers import Adam
|
from thinc.neural.optimizers import Adam
|
||||||
|
|
||||||
from thinc import describe
|
from thinc import describe
|
||||||
|
@ -840,6 +840,8 @@ def masked_language_model(vocab, model, mask_prob=0.15):
|
||||||
|
|
||||||
def mlm_backward(d_output, sgd=None):
|
def mlm_backward(d_output, sgd=None):
|
||||||
d_output *= 1 - mask
|
d_output *= 1 - mask
|
||||||
|
# Rescale gradient for number of instances.
|
||||||
|
d_output *= mask.size - mask.sum()
|
||||||
return backprop(d_output, sgd=sgd)
|
return backprop(d_output, sgd=sgd)
|
||||||
|
|
||||||
return output, mlm_backward
|
return output, mlm_backward
|
||||||
|
@ -944,7 +946,7 @@ class CharacterEmbed(Model):
|
||||||
# for the tip.
|
# for the tip.
|
||||||
nCv = self.ops.xp.arange(self.nC)
|
nCv = self.ops.xp.arange(self.nC)
|
||||||
for doc in docs:
|
for doc in docs:
|
||||||
doc_ids = doc.to_utf8_array(nr_char=self.nC)
|
doc_ids = self.ops.asarray(doc.to_utf8_array(nr_char=self.nC))
|
||||||
doc_vectors = self.ops.allocate((len(doc), self.nC, self.nM))
|
doc_vectors = self.ops.allocate((len(doc), self.nC, self.nM))
|
||||||
# Let's say I have a 2d array of indices, and a 3d table of data. What numpy
|
# Let's say I have a 2d array of indices, and a 3d table of data. What numpy
|
||||||
# incantation do I chant to get
|
# incantation do I chant to get
|
||||||
|
@ -986,3 +988,17 @@ def get_cossim_loss(yh, y, ignore_zeros=False):
|
||||||
losses[zero_indices] = 0
|
losses[zero_indices] = 0
|
||||||
loss = losses.sum()
|
loss = losses.sum()
|
||||||
return loss, -d_yh
|
return loss, -d_yh
|
||||||
|
|
||||||
|
|
||||||
|
def get_characters_loss(ops, docs, prediction, nr_char=10):
|
||||||
|
target_ids = numpy.vstack([doc.to_utf8_array(nr_char=nr_char) for doc in docs])
|
||||||
|
target_ids = target_ids.reshape((-1,))
|
||||||
|
target = ops.asarray(to_categorical(target_ids, nb_classes=256), dtype="f")
|
||||||
|
target = target.reshape((-1, 256*nr_char))
|
||||||
|
diff = prediction - target
|
||||||
|
loss = (diff**2).sum()
|
||||||
|
d_target = diff / float(prediction.shape[0])
|
||||||
|
return loss, d_target
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -1,6 +1,6 @@
|
||||||
# fmt: off
|
# fmt: off
|
||||||
__title__ = "spacy"
|
__title__ = "spacy"
|
||||||
__version__ = "2.3.1"
|
__version__ = "2.3.2"
|
||||||
__release__ = True
|
__release__ = True
|
||||||
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
|
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
|
||||||
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
|
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
|
||||||
|
|
|
@ -18,7 +18,8 @@ from ..errors import Errors
|
||||||
from ..tokens import Doc
|
from ..tokens import Doc
|
||||||
from ..attrs import ID, HEAD
|
from ..attrs import ID, HEAD
|
||||||
from .._ml import Tok2Vec, flatten, chain, create_default_optimizer
|
from .._ml import Tok2Vec, flatten, chain, create_default_optimizer
|
||||||
from .._ml import masked_language_model, get_cossim_loss
|
from .._ml import masked_language_model, get_cossim_loss, get_characters_loss
|
||||||
|
from .._ml import MultiSoftmax
|
||||||
from .. import util
|
from .. import util
|
||||||
from .train import _load_pretrained_tok2vec
|
from .train import _load_pretrained_tok2vec
|
||||||
|
|
||||||
|
@ -42,7 +43,7 @@ from .train import _load_pretrained_tok2vec
|
||||||
bilstm_depth=("Depth of BiLSTM layers (requires PyTorch)", "option", "lstm", int),
|
bilstm_depth=("Depth of BiLSTM layers (requires PyTorch)", "option", "lstm", int),
|
||||||
embed_rows=("Number of embedding rows", "option", "er", int),
|
embed_rows=("Number of embedding rows", "option", "er", int),
|
||||||
loss_func=(
|
loss_func=(
|
||||||
"Loss function to use for the objective. Either 'L2' or 'cosine'",
|
"Loss function to use for the objective. Either 'characters', 'L2' or 'cosine'",
|
||||||
"option",
|
"option",
|
||||||
"L",
|
"L",
|
||||||
str,
|
str,
|
||||||
|
@ -85,11 +86,11 @@ def pretrain(
|
||||||
output_dir,
|
output_dir,
|
||||||
width=96,
|
width=96,
|
||||||
conv_depth=4,
|
conv_depth=4,
|
||||||
bilstm_depth=0,
|
|
||||||
cnn_pieces=3,
|
cnn_pieces=3,
|
||||||
sa_depth=0,
|
sa_depth=0,
|
||||||
use_chars=False,
|
|
||||||
cnn_window=1,
|
cnn_window=1,
|
||||||
|
bilstm_depth=0,
|
||||||
|
use_chars=False,
|
||||||
embed_rows=2000,
|
embed_rows=2000,
|
||||||
loss_func="cosine",
|
loss_func="cosine",
|
||||||
use_vectors=False,
|
use_vectors=False,
|
||||||
|
@ -124,11 +125,7 @@ def pretrain(
|
||||||
config[key] = str(config[key])
|
config[key] = str(config[key])
|
||||||
util.fix_random_seed(seed)
|
util.fix_random_seed(seed)
|
||||||
|
|
||||||
has_gpu = prefer_gpu()
|
has_gpu = prefer_gpu(gpu_id=1)
|
||||||
if has_gpu:
|
|
||||||
import torch
|
|
||||||
|
|
||||||
torch.set_default_tensor_type("torch.cuda.FloatTensor")
|
|
||||||
msg.info("Using GPU" if has_gpu else "Not using GPU")
|
msg.info("Using GPU" if has_gpu else "Not using GPU")
|
||||||
|
|
||||||
output_dir = Path(output_dir)
|
output_dir = Path(output_dir)
|
||||||
|
@ -174,6 +171,7 @@ def pretrain(
|
||||||
subword_features=not use_chars, # Set to False for Chinese etc
|
subword_features=not use_chars, # Set to False for Chinese etc
|
||||||
cnn_maxout_pieces=cnn_pieces, # If set to 1, use Mish activation.
|
cnn_maxout_pieces=cnn_pieces, # If set to 1, use Mish activation.
|
||||||
),
|
),
|
||||||
|
objective=loss_func
|
||||||
)
|
)
|
||||||
# Load in pretrained weights
|
# Load in pretrained weights
|
||||||
if init_tok2vec is not None:
|
if init_tok2vec is not None:
|
||||||
|
@ -264,6 +262,9 @@ def make_update(model, docs, optimizer, drop=0.0, objective="L2"):
|
||||||
RETURNS loss: A float for the loss.
|
RETURNS loss: A float for the loss.
|
||||||
"""
|
"""
|
||||||
predictions, backprop = model.begin_update(docs, drop=drop)
|
predictions, backprop = model.begin_update(docs, drop=drop)
|
||||||
|
if objective == "characters":
|
||||||
|
loss, gradients = get_characters_loss(model.ops, docs, predictions)
|
||||||
|
else:
|
||||||
loss, gradients = get_vectors_loss(model.ops, docs, predictions, objective)
|
loss, gradients = get_vectors_loss(model.ops, docs, predictions, objective)
|
||||||
backprop(gradients, sgd=optimizer)
|
backprop(gradients, sgd=optimizer)
|
||||||
# Don't want to return a cupy object here
|
# Don't want to return a cupy object here
|
||||||
|
@ -326,12 +327,19 @@ def get_vectors_loss(ops, docs, prediction, objective="L2"):
|
||||||
return loss, d_target
|
return loss, d_target
|
||||||
|
|
||||||
|
|
||||||
def create_pretraining_model(nlp, tok2vec):
|
def create_pretraining_model(nlp, tok2vec, objective="cosine", nr_char=10):
|
||||||
"""Define a network for the pretraining. We simply add an output layer onto
|
"""Define a network for the pretraining. We simply add an output layer onto
|
||||||
the tok2vec input model. The tok2vec input model needs to be a model that
|
the tok2vec input model. The tok2vec input model needs to be a model that
|
||||||
takes a batch of Doc objects (as a list), and returns a list of arrays.
|
takes a batch of Doc objects (as a list), and returns a list of arrays.
|
||||||
Each array in the output needs to have one row per token in the doc.
|
Each array in the output needs to have one row per token in the doc.
|
||||||
"""
|
"""
|
||||||
|
if objective == "characters":
|
||||||
|
out_sizes = [256] * nr_char
|
||||||
|
output_layer = chain(
|
||||||
|
LN(Maxout(300, pieces=3)),
|
||||||
|
MultiSoftmax(out_sizes, 300)
|
||||||
|
)
|
||||||
|
else:
|
||||||
output_size = nlp.vocab.vectors.data.shape[1]
|
output_size = nlp.vocab.vectors.data.shape[1]
|
||||||
output_layer = chain(
|
output_layer = chain(
|
||||||
LN(Maxout(300, pieces=3)), Affine(output_size, drop_factor=0.0)
|
LN(Maxout(300, pieces=3)), Affine(output_size, drop_factor=0.0)
|
||||||
|
|
|
@ -285,7 +285,7 @@ def train(
|
||||||
|
|
||||||
if base_model and not pipes_added:
|
if base_model and not pipes_added:
|
||||||
# Start with an existing model, use default optimizer
|
# Start with an existing model, use default optimizer
|
||||||
optimizer = create_default_optimizer(Model.ops)
|
optimizer = nlp.resume_training(device=use_gpu)
|
||||||
else:
|
else:
|
||||||
# Start with a blank model, call begin_training
|
# Start with a blank model, call begin_training
|
||||||
cfg = {"device": use_gpu}
|
cfg = {"device": use_gpu}
|
||||||
|
@ -576,6 +576,8 @@ def train(
|
||||||
with nlp.use_params(optimizer.averages):
|
with nlp.use_params(optimizer.averages):
|
||||||
final_model_path = output_path / "model-final"
|
final_model_path = output_path / "model-final"
|
||||||
nlp.to_disk(final_model_path)
|
nlp.to_disk(final_model_path)
|
||||||
|
srsly.write_json(final_model_path / "meta.json", meta)
|
||||||
|
|
||||||
meta_loc = output_path / "model-final" / "meta.json"
|
meta_loc = output_path / "model-final" / "meta.json"
|
||||||
final_meta = srsly.read_json(meta_loc)
|
final_meta = srsly.read_json(meta_loc)
|
||||||
final_meta.setdefault("accuracy", {})
|
final_meta.setdefault("accuracy", {})
|
||||||
|
|
|
@ -18,7 +18,26 @@ def _return_en(_):
|
||||||
return "en"
|
return "en"
|
||||||
|
|
||||||
|
|
||||||
def en_is_base_form(univ_pos, morphology=None):
|
class EnglishDefaults(Language.Defaults):
|
||||||
|
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||||
|
lex_attr_getters.update(LEX_ATTRS)
|
||||||
|
lex_attr_getters[LANG] = _return_en
|
||||||
|
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||||
|
tag_map = TAG_MAP
|
||||||
|
stop_words = STOP_WORDS
|
||||||
|
morph_rules = MORPH_RULES
|
||||||
|
syntax_iterators = SYNTAX_ITERATORS
|
||||||
|
single_orth_variants = [
|
||||||
|
{"tags": ["NFP"], "variants": ["…", "..."]},
|
||||||
|
{"tags": [":"], "variants": ["-", "—", "–", "--", "---", "——"]},
|
||||||
|
]
|
||||||
|
paired_orth_variants = [
|
||||||
|
{"tags": ["``", "''"], "variants": [("'", "'"), ("‘", "’")]},
|
||||||
|
{"tags": ["``", "''"], "variants": [('"', '"'), ("“", "”")]},
|
||||||
|
]
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def is_base_form(cls, univ_pos, morphology=None):
|
||||||
"""
|
"""
|
||||||
Check whether we're dealing with an uninflected paradigm, so we can
|
Check whether we're dealing with an uninflected paradigm, so we can
|
||||||
avoid lemmatization entirely.
|
avoid lemmatization entirely.
|
||||||
|
@ -53,26 +72,6 @@ def en_is_base_form(univ_pos, morphology=None):
|
||||||
return False
|
return False
|
||||||
|
|
||||||
|
|
||||||
class EnglishDefaults(Language.Defaults):
|
|
||||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
|
||||||
lex_attr_getters.update(LEX_ATTRS)
|
|
||||||
lex_attr_getters[LANG] = _return_en
|
|
||||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
|
||||||
tag_map = TAG_MAP
|
|
||||||
stop_words = STOP_WORDS
|
|
||||||
morph_rules = MORPH_RULES
|
|
||||||
is_base_form = en_is_base_form
|
|
||||||
syntax_iterators = SYNTAX_ITERATORS
|
|
||||||
single_orth_variants = [
|
|
||||||
{"tags": ["NFP"], "variants": ["…", "..."]},
|
|
||||||
{"tags": [":"], "variants": ["-", "—", "–", "--", "---", "——"]},
|
|
||||||
]
|
|
||||||
paired_orth_variants = [
|
|
||||||
{"tags": ["``", "''"], "variants": [("'", "'"), ("‘", "’")]},
|
|
||||||
{"tags": ["``", "''"], "variants": [('"', '"'), ("“", "”")]},
|
|
||||||
]
|
|
||||||
|
|
||||||
|
|
||||||
class English(Language):
|
class English(Language):
|
||||||
lang = "en"
|
lang = "en"
|
||||||
Defaults = EnglishDefaults
|
Defaults = EnglishDefaults
|
||||||
|
|
|
@ -45,9 +45,6 @@ class FrenchLemmatizer(Lemmatizer):
|
||||||
univ_pos = "sconj"
|
univ_pos = "sconj"
|
||||||
else:
|
else:
|
||||||
return [self.lookup(string)]
|
return [self.lookup(string)]
|
||||||
# See Issue #435 for example of where this logic is requied.
|
|
||||||
if self.is_base_form(univ_pos, morphology):
|
|
||||||
return list(set([string.lower()]))
|
|
||||||
index_table = self.lookups.get_table("lemma_index", {})
|
index_table = self.lookups.get_table("lemma_index", {})
|
||||||
exc_table = self.lookups.get_table("lemma_exc", {})
|
exc_table = self.lookups.get_table("lemma_exc", {})
|
||||||
rules_table = self.lookups.get_table("lemma_rules", {})
|
rules_table = self.lookups.get_table("lemma_rules", {})
|
||||||
|
@ -59,43 +56,6 @@ class FrenchLemmatizer(Lemmatizer):
|
||||||
)
|
)
|
||||||
return lemmas
|
return lemmas
|
||||||
|
|
||||||
def is_base_form(self, univ_pos, morphology=None):
|
|
||||||
"""
|
|
||||||
Check whether we're dealing with an uninflected paradigm, so we can
|
|
||||||
avoid lemmatization entirely.
|
|
||||||
"""
|
|
||||||
morphology = {} if morphology is None else morphology
|
|
||||||
others = [
|
|
||||||
key
|
|
||||||
for key in morphology
|
|
||||||
if key not in (POS, "Number", "POS", "VerbForm", "Tense")
|
|
||||||
]
|
|
||||||
if univ_pos == "noun" and morphology.get("Number") == "sing":
|
|
||||||
return True
|
|
||||||
elif univ_pos == "verb" and morphology.get("VerbForm") == "inf":
|
|
||||||
return True
|
|
||||||
# This maps 'VBP' to base form -- probably just need 'IS_BASE'
|
|
||||||
# morphology
|
|
||||||
elif univ_pos == "verb" and (
|
|
||||||
morphology.get("VerbForm") == "fin"
|
|
||||||
and morphology.get("Tense") == "pres"
|
|
||||||
and morphology.get("Number") is None
|
|
||||||
and not others
|
|
||||||
):
|
|
||||||
return True
|
|
||||||
elif univ_pos == "adj" and morphology.get("Degree") == "pos":
|
|
||||||
return True
|
|
||||||
elif VerbForm_inf in morphology:
|
|
||||||
return True
|
|
||||||
elif VerbForm_none in morphology:
|
|
||||||
return True
|
|
||||||
elif Number_sing in morphology:
|
|
||||||
return True
|
|
||||||
elif Degree_pos in morphology:
|
|
||||||
return True
|
|
||||||
else:
|
|
||||||
return False
|
|
||||||
|
|
||||||
def noun(self, string, morphology=None):
|
def noun(self, string, morphology=None):
|
||||||
return self(string, "noun", morphology)
|
return self(string, "noun", morphology)
|
||||||
|
|
||||||
|
|
|
@ -42,7 +42,11 @@ def check_spaces(text, tokens):
|
||||||
class KoreanTokenizer(DummyTokenizer):
|
class KoreanTokenizer(DummyTokenizer):
|
||||||
def __init__(self, cls, nlp=None):
|
def __init__(self, cls, nlp=None):
|
||||||
self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp)
|
self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp)
|
||||||
self.Tokenizer = try_mecab_import()
|
MeCab = try_mecab_import()
|
||||||
|
self.mecab_tokenizer = MeCab("-F%f[0],%f[7]")
|
||||||
|
|
||||||
|
def __del__(self):
|
||||||
|
self.mecab_tokenizer.__del__()
|
||||||
|
|
||||||
def __call__(self, text):
|
def __call__(self, text):
|
||||||
dtokens = list(self.detailed_tokens(text))
|
dtokens = list(self.detailed_tokens(text))
|
||||||
|
@ -58,8 +62,7 @@ class KoreanTokenizer(DummyTokenizer):
|
||||||
def detailed_tokens(self, text):
|
def detailed_tokens(self, text):
|
||||||
# 품사 태그(POS)[0], 의미 부류(semantic class)[1], 종성 유무(jongseong)[2], 읽기(reading)[3],
|
# 품사 태그(POS)[0], 의미 부류(semantic class)[1], 종성 유무(jongseong)[2], 읽기(reading)[3],
|
||||||
# 타입(type)[4], 첫번째 품사(start pos)[5], 마지막 품사(end pos)[6], 표현(expression)[7], *
|
# 타입(type)[4], 첫번째 품사(start pos)[5], 마지막 품사(end pos)[6], 표현(expression)[7], *
|
||||||
with self.Tokenizer("-F%f[0],%f[7]") as tokenizer:
|
for node in self.mecab_tokenizer.parse(text, as_nodes=True):
|
||||||
for node in tokenizer.parse(text, as_nodes=True):
|
|
||||||
if node.is_eos():
|
if node.is_eos():
|
||||||
break
|
break
|
||||||
surface = node.surface
|
surface = node.surface
|
||||||
|
|
|
@ -21,7 +21,7 @@ class Lemmatizer(object):
|
||||||
def load(cls, *args, **kwargs):
|
def load(cls, *args, **kwargs):
|
||||||
raise NotImplementedError(Errors.E172)
|
raise NotImplementedError(Errors.E172)
|
||||||
|
|
||||||
def __init__(self, lookups, *args, is_base_form=None, **kwargs):
|
def __init__(self, lookups, is_base_form=None, *args, **kwargs):
|
||||||
"""Initialize a Lemmatizer.
|
"""Initialize a Lemmatizer.
|
||||||
|
|
||||||
lookups (Lookups): The lookups object containing the (optional) tables
|
lookups (Lookups): The lookups object containing the (optional) tables
|
||||||
|
|
|
@ -49,6 +49,14 @@ def Tok2Vec(width, embed_size, **kwargs):
|
||||||
>> LN(Maxout(width, width * 5, pieces=3)),
|
>> LN(Maxout(width, width * 5, pieces=3)),
|
||||||
column=cols.index(ORTH),
|
column=cols.index(ORTH),
|
||||||
)
|
)
|
||||||
|
elif char_embed:
|
||||||
|
embed = concatenate_lists(
|
||||||
|
CharacterEmbed(nM=64, nC=8),
|
||||||
|
FeatureExtracter(cols) >> with_flatten(glove),
|
||||||
|
)
|
||||||
|
reduce_dimensions = LN(
|
||||||
|
Maxout(width, 64 * 8 + width, pieces=cnn_maxout_pieces)
|
||||||
|
)
|
||||||
else:
|
else:
|
||||||
embed = uniqued(
|
embed = uniqued(
|
||||||
(glove | norm) >> LN(Maxout(width, width * 2, pieces=3)),
|
(glove | norm) >> LN(Maxout(width, width * 2, pieces=3)),
|
||||||
|
@ -81,7 +89,8 @@ def Tok2Vec(width, embed_size, **kwargs):
|
||||||
)
|
)
|
||||||
else:
|
else:
|
||||||
tok2vec = FeatureExtracter(cols) >> with_flatten(
|
tok2vec = FeatureExtracter(cols) >> with_flatten(
|
||||||
embed >> convolution ** conv_depth, pad=conv_depth
|
embed
|
||||||
|
>> convolution ** conv_depth, pad=conv_depth
|
||||||
)
|
)
|
||||||
|
|
||||||
if bilstm_depth >= 1:
|
if bilstm_depth >= 1:
|
||||||
|
|
|
@ -33,6 +33,7 @@ from .._ml import build_text_classifier, build_simple_cnn_text_classifier
|
||||||
from .._ml import build_bow_text_classifier, build_nel_encoder
|
from .._ml import build_bow_text_classifier, build_nel_encoder
|
||||||
from .._ml import link_vectors_to_models, zero_init, flatten
|
from .._ml import link_vectors_to_models, zero_init, flatten
|
||||||
from .._ml import masked_language_model, create_default_optimizer, get_cossim_loss
|
from .._ml import masked_language_model, create_default_optimizer, get_cossim_loss
|
||||||
|
from .._ml import MultiSoftmax, get_characters_loss
|
||||||
from ..errors import Errors, TempErrors, Warnings
|
from ..errors import Errors, TempErrors, Warnings
|
||||||
from .. import util
|
from .. import util
|
||||||
|
|
||||||
|
@ -846,6 +847,10 @@ class MultitaskObjective(Tagger):
|
||||||
class ClozeMultitask(Pipe):
|
class ClozeMultitask(Pipe):
|
||||||
@classmethod
|
@classmethod
|
||||||
def Model(cls, vocab, tok2vec, **cfg):
|
def Model(cls, vocab, tok2vec, **cfg):
|
||||||
|
if cfg["objective"] == "characters":
|
||||||
|
out_sizes = [256] * cfg.get("nr_char", 4)
|
||||||
|
output_layer = MultiSoftmax(out_sizes)
|
||||||
|
else:
|
||||||
output_size = vocab.vectors.data.shape[1]
|
output_size = vocab.vectors.data.shape[1]
|
||||||
output_layer = chain(
|
output_layer = chain(
|
||||||
LayerNorm(Maxout(output_size, tok2vec.nO, pieces=3)),
|
LayerNorm(Maxout(output_size, tok2vec.nO, pieces=3)),
|
||||||
|
@ -861,6 +866,8 @@ class ClozeMultitask(Pipe):
|
||||||
self.vocab = vocab
|
self.vocab = vocab
|
||||||
self.model = model
|
self.model = model
|
||||||
self.cfg = cfg
|
self.cfg = cfg
|
||||||
|
self.cfg.setdefault("objective", "characters")
|
||||||
|
self.cfg.setdefault("nr_char", 4)
|
||||||
|
|
||||||
def set_annotations(self, docs, dep_ids, tensors=None):
|
def set_annotations(self, docs, dep_ids, tensors=None):
|
||||||
pass
|
pass
|
||||||
|
@ -869,7 +876,8 @@ class ClozeMultitask(Pipe):
|
||||||
tok2vec=None, sgd=None, **kwargs):
|
tok2vec=None, sgd=None, **kwargs):
|
||||||
link_vectors_to_models(self.vocab)
|
link_vectors_to_models(self.vocab)
|
||||||
if self.model is True:
|
if self.model is True:
|
||||||
self.model = self.Model(self.vocab, tok2vec)
|
kwargs.update(self.cfg)
|
||||||
|
self.model = self.Model(self.vocab, tok2vec, **kwargs)
|
||||||
X = self.model.ops.allocate((5, self.model.tok2vec.nO))
|
X = self.model.ops.allocate((5, self.model.tok2vec.nO))
|
||||||
self.model.output_layer.begin_training(X)
|
self.model.output_layer.begin_training(X)
|
||||||
if sgd is None:
|
if sgd is None:
|
||||||
|
@ -883,6 +891,9 @@ class ClozeMultitask(Pipe):
|
||||||
return tokvecs, vectors
|
return tokvecs, vectors
|
||||||
|
|
||||||
def get_loss(self, docs, vectors, prediction):
|
def get_loss(self, docs, vectors, prediction):
|
||||||
|
if self.cfg["objective"] == "characters":
|
||||||
|
loss, gradient = get_characters_loss(self.model.ops, docs, prediction)
|
||||||
|
else:
|
||||||
# The simplest way to implement this would be to vstack the
|
# The simplest way to implement this would be to vstack the
|
||||||
# token.vector values, but that's a bit inefficient, especially on GPU.
|
# token.vector values, but that's a bit inefficient, especially on GPU.
|
||||||
# Instead we fetch the index into the vectors table for each of our tokens,
|
# Instead we fetch the index into the vectors table for each of our tokens,
|
||||||
|
@ -906,6 +917,20 @@ class ClozeMultitask(Pipe):
|
||||||
if losses is not None:
|
if losses is not None:
|
||||||
losses[self.name] += loss
|
losses[self.name] += loss
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def decode_utf8_predictions(char_array):
|
||||||
|
# The format alternates filling from start and end, and 255 is missing
|
||||||
|
words = []
|
||||||
|
char_array = char_array.reshape((char_array.shape[0], -1, 256))
|
||||||
|
nr_char = char_array.shape[1]
|
||||||
|
char_array = char_array.argmax(axis=-1)
|
||||||
|
for row in char_array:
|
||||||
|
starts = [chr(c) for c in row[::2] if c != 255]
|
||||||
|
ends = [chr(c) for c in row[1::2] if c != 255]
|
||||||
|
word = "".join(starts + list(reversed(ends)))
|
||||||
|
words.append(word)
|
||||||
|
return words
|
||||||
|
|
||||||
|
|
||||||
@component("textcat", assigns=["doc.cats"])
|
@component("textcat", assigns=["doc.cats"])
|
||||||
class TextCategorizer(Pipe):
|
class TextCategorizer(Pipe):
|
||||||
|
@ -1069,6 +1094,7 @@ cdef class DependencyParser(Parser):
|
||||||
assigns = ["token.dep", "token.is_sent_start", "doc.sents"]
|
assigns = ["token.dep", "token.is_sent_start", "doc.sents"]
|
||||||
requires = []
|
requires = []
|
||||||
TransitionSystem = ArcEager
|
TransitionSystem = ArcEager
|
||||||
|
nr_feature = 8
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def postprocesses(self):
|
def postprocesses(self):
|
||||||
|
|
|
@ -59,7 +59,7 @@ def test_issue2626_2835(en_tokenizer, text):
|
||||||
|
|
||||||
|
|
||||||
def test_issue2656(en_tokenizer):
|
def test_issue2656(en_tokenizer):
|
||||||
"""Test that tokenizer correctly splits of punctuation after numbers with
|
"""Test that tokenizer correctly splits off punctuation after numbers with
|
||||||
decimal points.
|
decimal points.
|
||||||
"""
|
"""
|
||||||
doc = en_tokenizer("I went for 40.3, and got home by 10.0.")
|
doc = en_tokenizer("I went for 40.3, and got home by 10.0.")
|
||||||
|
|
|
@ -121,6 +121,7 @@ def test_issue3248_1():
|
||||||
assert len(matcher) == 2
|
assert len(matcher) == 2
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.skipif(is_python2, reason="Can't pickle instancemethod for is_base_form")
|
||||||
def test_issue3248_2():
|
def test_issue3248_2():
|
||||||
"""Test that the PhraseMatcher can be pickled correctly."""
|
"""Test that the PhraseMatcher can be pickled correctly."""
|
||||||
nlp = English()
|
nlp = English()
|
||||||
|
|
|
@ -473,7 +473,7 @@ $ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir]
|
||||||
| `--use-chars`, `-chr` <Tag variant="new">2.2.2</Tag> | flag | Whether to use character-based embedding. |
|
| `--use-chars`, `-chr` <Tag variant="new">2.2.2</Tag> | flag | Whether to use character-based embedding. |
|
||||||
| `--sa-depth`, `-sa` <Tag variant="new">2.2.2</Tag> | option | Depth of self-attention layers. |
|
| `--sa-depth`, `-sa` <Tag variant="new">2.2.2</Tag> | option | Depth of self-attention layers. |
|
||||||
| `--embed-rows`, `-er` | option | Number of embedding rows. |
|
| `--embed-rows`, `-er` | option | Number of embedding rows. |
|
||||||
| `--loss-func`, `-L` | option | Loss function to use for the objective. Either `"L2"` or `"cosine"`. |
|
| `--loss-func`, `-L` | option | Loss function to use for the objective. Either `"cosine"`, `"L2"` or `"characters"`. |
|
||||||
| `--dropout`, `-d` | option | Dropout rate. |
|
| `--dropout`, `-d` | option | Dropout rate. |
|
||||||
| `--batch-size`, `-bs` | option | Number of words per training batch. |
|
| `--batch-size`, `-bs` | option | Number of words per training batch. |
|
||||||
| `--max-length`, `-xw` | option | Maximum words per example. Longer examples are discarded. |
|
| `--max-length`, `-xw` | option | Maximum words per example. Longer examples are discarded. |
|
||||||
|
|
|
@ -1,5 +1,58 @@
|
||||||
{
|
{
|
||||||
"resources": [
|
"resources": [
|
||||||
|
{
|
||||||
|
"id": "spacy-streamlit",
|
||||||
|
"title": "spacy-streamlit",
|
||||||
|
"slogan": "spaCy building blocks for Streamlit apps",
|
||||||
|
"github": "explosion/spacy-streamlit",
|
||||||
|
"description": "This package contains utilities for visualizing spaCy models and building interactive spaCy-powered apps with [Streamlit](https://streamlit.io). It includes various building blocks you can use in your own Streamlit app, like visualizers for **syntactic dependencies**, **named entities**, **text classification**, **semantic similarity** via word vectors, token attributes, and more.",
|
||||||
|
"pip": "spacy-streamlit",
|
||||||
|
"category": ["visualizers"],
|
||||||
|
"thumb": "https://i.imgur.com/mhEjluE.jpg",
|
||||||
|
"image": "https://user-images.githubusercontent.com/13643239/85388081-f2da8700-b545-11ea-9bd4-e303d3c5763c.png",
|
||||||
|
"code_example": [
|
||||||
|
"import spacy_streamlit",
|
||||||
|
"",
|
||||||
|
"models = [\"en_core_web_sm\", \"en_core_web_md\"]",
|
||||||
|
"default_text = \"Sundar Pichai is the CEO of Google.\"",
|
||||||
|
"spacy_streamlit.visualize(models, default_text))"
|
||||||
|
],
|
||||||
|
"author": "Ines Montani",
|
||||||
|
"author_links": {
|
||||||
|
"twitter": "_inesmontani",
|
||||||
|
"github": "ines",
|
||||||
|
"website": "https://ines.io"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "spaczz",
|
||||||
|
"title": "spaczz",
|
||||||
|
"slogan": "Fuzzy matching and more for spaCy.",
|
||||||
|
"description": "Spaczz provides fuzzy matching and multi-token regex matching functionality for spaCy. Spaczz's components have similar APIs to their spaCy counterparts and spaczz pipeline components can integrate into spaCy pipelines where they can be saved/loaded as models.",
|
||||||
|
"github": "gandersen101/spaczz",
|
||||||
|
"pip": "spaczz",
|
||||||
|
"code_example": [
|
||||||
|
"import spacy",
|
||||||
|
"from spaczz.pipeline import SpaczzRuler",
|
||||||
|
"",
|
||||||
|
"nlp = spacy.blank('en')",
|
||||||
|
"ruler = SpaczzRuler(nlp)",
|
||||||
|
"ruler.add_patterns([{'label': 'PERSON', 'pattern': 'Bill Gates', 'type': 'fuzzy'}])",
|
||||||
|
"nlp.add_pipe(ruler)",
|
||||||
|
"",
|
||||||
|
"doc = nlp('Oops, I spelled Bill Gatez wrong.')",
|
||||||
|
"print([(ent.text, ent.start, ent.end, ent.label_) for ent in doc.ents])"
|
||||||
|
],
|
||||||
|
"code_language": "python",
|
||||||
|
"url": "https://spaczz.readthedocs.io/en/latest/",
|
||||||
|
"author": "Grant Andersen",
|
||||||
|
"author_links": {
|
||||||
|
"twitter": "gandersen101",
|
||||||
|
"github": "gandersen101"
|
||||||
|
},
|
||||||
|
"category": ["pipeline"],
|
||||||
|
"tags": ["fuzzy-matching", "regex"]
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"id": "spacy-universal-sentence-encoder",
|
"id": "spacy-universal-sentence-encoder",
|
||||||
"title": "SpaCy - Universal Sentence Encoder",
|
"title": "SpaCy - Universal Sentence Encoder",
|
||||||
|
@ -1237,6 +1290,19 @@
|
||||||
"youtube": "K1elwpgDdls",
|
"youtube": "K1elwpgDdls",
|
||||||
"category": ["videos"]
|
"category": ["videos"]
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
"type": "education",
|
||||||
|
"id": "video-spacy-course-es",
|
||||||
|
"title": "NLP avanzado con spaCy · Un curso en línea gratis",
|
||||||
|
"description": "spaCy es un paquete moderno de Python para hacer Procesamiento de Lenguaje Natural de potencia industrial. En este curso en línea, interactivo y gratuito, aprenderás a usar spaCy para construir sistemas avanzados de comprensión de lenguaje natural usando enfoques basados en reglas y en machine learning.",
|
||||||
|
"url": "https://course.spacy.io/es",
|
||||||
|
"author": "Camila Gutiérrez",
|
||||||
|
"author_links": {
|
||||||
|
"twitter": "Mariacamilagl30"
|
||||||
|
},
|
||||||
|
"youtube": "RNiLVCE5d4k",
|
||||||
|
"category": ["videos"]
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"type": "education",
|
"type": "education",
|
||||||
"id": "video-intro-to-nlp-episode-1",
|
"id": "video-intro-to-nlp-episode-1",
|
||||||
|
@ -1293,6 +1359,20 @@
|
||||||
"youtube": "IqOJU1-_Fi0",
|
"youtube": "IqOJU1-_Fi0",
|
||||||
"category": ["videos"]
|
"category": ["videos"]
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
"type": "education",
|
||||||
|
"id": "video-intro-to-nlp-episode-5",
|
||||||
|
"title": "Intro to NLP with spaCy (5)",
|
||||||
|
"slogan": "Episode 5: Rules vs. Machine Learning",
|
||||||
|
"description": "In this new video series, data science instructor Vincent Warmerdam gets started with spaCy, an open-source library for Natural Language Processing in Python. His mission: building a system to automatically detect programming languages in large volumes of text. Follow his process from the first idea to a prototype all the way to data collection and training a statistical named entity recogntion model from scratch.",
|
||||||
|
"author": "Vincent Warmerdam",
|
||||||
|
"author_links": {
|
||||||
|
"twitter": "fishnets88",
|
||||||
|
"github": "koaning"
|
||||||
|
},
|
||||||
|
"youtube": "f4sqeLRzkPg",
|
||||||
|
"category": ["videos"]
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"type": "education",
|
"type": "education",
|
||||||
"id": "video-spacy-irl-entity-linking",
|
"id": "video-spacy-irl-entity-linking",
|
||||||
|
@ -2347,6 +2427,32 @@
|
||||||
},
|
},
|
||||||
"category": ["pipeline", "conversational", "research"],
|
"category": ["pipeline", "conversational", "research"],
|
||||||
"tags": ["spell check", "correction", "preprocessing", "translation", "correction"]
|
"tags": ["spell check", "correction", "preprocessing", "translation", "correction"]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "texthero",
|
||||||
|
"title": "Texthero",
|
||||||
|
"slogan": "Text preprocessing, representation and visualization from zero to hero.",
|
||||||
|
"description": "Texthero is a python package to work with text data efficiently. It empowers NLP developers with a tool to quickly understand any text-based dataset and it provides a solid pipeline to clean and represent text data, from zero to hero.",
|
||||||
|
"github": "jbesomi/texthero",
|
||||||
|
"pip": "texthero",
|
||||||
|
"code_example": [
|
||||||
|
"import texthero as hero",
|
||||||
|
"import pandas as pd",
|
||||||
|
"",
|
||||||
|
"df = pd.read_csv('https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv')",
|
||||||
|
"df['named_entities'] = hero.named_entities(df['text'])",
|
||||||
|
"df.head()"
|
||||||
|
],
|
||||||
|
"code_language": "python",
|
||||||
|
"url": "https://texthero.org",
|
||||||
|
"thumb": "https://texthero.org/img/T.png",
|
||||||
|
"image": "https://texthero.org/docs/assets/texthero.png",
|
||||||
|
"author": "Jonathan Besomi",
|
||||||
|
"author_links": {
|
||||||
|
"github": "jbesomi",
|
||||||
|
"website": "https://besomi.ai"
|
||||||
|
},
|
||||||
|
"category": ["standalone"]
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue
Block a user