mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-03 22:06:37 +03:00
Merge branch 'master' into spacy.io
This commit is contained in:
commit
02de21d8b4
106
.github/contributors/GuiGel.md
vendored
Normal file
106
.github/contributors/GuiGel.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Guillaume Gelabert |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2019-11-15 |
|
||||
| GitHub username | GuiGel |
|
||||
| Website (optional) | |
|
106
.github/contributors/erip.md
vendored
Normal file
106
.github/contributors/erip.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Elijah Rippeth |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2019-11-16 |
|
||||
| GitHub username | erip |
|
||||
| Website (optional) | |
|
|
@ -50,15 +50,16 @@ jobs:
|
|||
Python36Mac:
|
||||
imageName: 'macos-10.13'
|
||||
python.version: '3.6'
|
||||
Python37Linux:
|
||||
imageName: 'ubuntu-16.04'
|
||||
python.version: '3.7'
|
||||
Python37Windows:
|
||||
imageName: 'vs2017-win2016'
|
||||
python.version: '3.7'
|
||||
Python37Mac:
|
||||
imageName: 'macos-10.13'
|
||||
python.version: '3.7'
|
||||
# Don't test on 3.7 for now to speed up builds
|
||||
# Python37Linux:
|
||||
# imageName: 'ubuntu-16.04'
|
||||
# python.version: '3.7'
|
||||
# Python37Windows:
|
||||
# imageName: 'vs2017-win2016'
|
||||
# python.version: '3.7'
|
||||
# Python37Mac:
|
||||
# imageName: 'macos-10.13'
|
||||
# python.version: '3.7'
|
||||
Python38Linux:
|
||||
imageName: 'ubuntu-16.04'
|
||||
python.version: '3.8'
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
# fmt: off
|
||||
__title__ = "spacy"
|
||||
__version__ = "2.2.2"
|
||||
__version__ = "2.2.3"
|
||||
__release__ = True
|
||||
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
|
||||
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
|
||||
|
|
|
@ -529,6 +529,7 @@ class Errors(object):
|
|||
E185 = ("Received invalid attribute in component attribute declaration: "
|
||||
"{obj}.{attr}\nAttribute '{attr}' does not exist on {obj}.")
|
||||
E186 = ("'{tok_a}' and '{tok_b}' are different texts.")
|
||||
E187 = ("Only unicode strings are supported as labels.")
|
||||
|
||||
|
||||
@add_codes
|
||||
|
|
|
@ -31,6 +31,10 @@ _latin_u_supplement = r"\u00C0-\u00D6\u00D8-\u00DE"
|
|||
_latin_l_supplement = r"\u00DF-\u00F6\u00F8-\u00FF"
|
||||
_latin_supplement = r"\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u00FF"
|
||||
|
||||
_hangul_syllables = r"\uAC00-\uD7AF"
|
||||
_hangul_jamo = r"\u1100-\u11FF"
|
||||
_hangul = _hangul_syllables + _hangul_jamo
|
||||
|
||||
# letters with diacritics - Catalan, Czech, Latin, Latvian, Lithuanian, Polish, Slovak, Turkish, Welsh
|
||||
_latin_u_extendedA = (
|
||||
r"\u0100\u0102\u0104\u0106\u0108\u010A\u010C\u010E\u0110\u0112\u0114\u0116\u0118\u011A\u011C"
|
||||
|
@ -202,7 +206,15 @@ _upper = LATIN_UPPER + _russian_upper + _tatar_upper + _greek_upper + _ukrainian
|
|||
_lower = LATIN_LOWER + _russian_lower + _tatar_lower + _greek_lower + _ukrainian_lower
|
||||
|
||||
_uncased = (
|
||||
_bengali + _hebrew + _persian + _sinhala + _hindi + _kannada + _tamil + _telugu
|
||||
_bengali
|
||||
+ _hebrew
|
||||
+ _persian
|
||||
+ _sinhala
|
||||
+ _hindi
|
||||
+ _kannada
|
||||
+ _tamil
|
||||
+ _telugu
|
||||
+ _hangul
|
||||
)
|
||||
|
||||
ALPHA = group_chars(LATIN + _russian + _tatar + _greek + _ukrainian + _uncased)
|
||||
|
|
67
spacy/lang/ko/lex_attrs.py
Normal file
67
spacy/lang/ko/lex_attrs.py
Normal file
|
@ -0,0 +1,67 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...attrs import LIKE_NUM
|
||||
|
||||
|
||||
_num_words = [
|
||||
"영",
|
||||
"공",
|
||||
# Native Korean number system
|
||||
"하나",
|
||||
"둘",
|
||||
"셋",
|
||||
"넷",
|
||||
"다섯",
|
||||
"여섯",
|
||||
"일곱",
|
||||
"여덟",
|
||||
"아홉",
|
||||
"열",
|
||||
"스물",
|
||||
"서른",
|
||||
"마흔",
|
||||
"쉰",
|
||||
"예순",
|
||||
"일흔",
|
||||
"여든",
|
||||
"아흔",
|
||||
# Sino-Korean number system
|
||||
"일",
|
||||
"이",
|
||||
"삼",
|
||||
"사",
|
||||
"오",
|
||||
"육",
|
||||
"칠",
|
||||
"팔",
|
||||
"구",
|
||||
"십",
|
||||
"백",
|
||||
"천",
|
||||
"만",
|
||||
"십만",
|
||||
"백만",
|
||||
"천만",
|
||||
"일억",
|
||||
"십억",
|
||||
"백억",
|
||||
]
|
||||
|
||||
|
||||
def like_num(text):
|
||||
if text.startswith(("+", "-", "±", "~")):
|
||||
text = text[1:]
|
||||
text = text.replace(",", "").replace(".", "")
|
||||
if text.isdigit():
|
||||
return True
|
||||
if text.count("/") == 1:
|
||||
num, denom = text.split("/")
|
||||
if num.isdigit() and denom.isdigit():
|
||||
return True
|
||||
if any(char.lower() in _num_words for char in text):
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
LEX_ATTRS = {LIKE_NUM: like_num}
|
|
@ -6,9 +6,7 @@ from ...symbols import ORTH, LEMMA, NORM
|
|||
# TODO
|
||||
# treat other apostrophes within words as part of the word: [op d'mannst], [fir d'éischt] (= exceptions)
|
||||
|
||||
_exc = {
|
||||
|
||||
}
|
||||
_exc = {}
|
||||
|
||||
# translate / delete what is not necessary
|
||||
for exc_data in [
|
||||
|
|
|
@ -14,6 +14,7 @@ from .tag_map import TAG_MAP
|
|||
def try_jieba_import(use_jieba):
|
||||
try:
|
||||
import jieba
|
||||
|
||||
return jieba
|
||||
except ImportError:
|
||||
if use_jieba:
|
||||
|
@ -34,7 +35,9 @@ class ChineseTokenizer(DummyTokenizer):
|
|||
def __call__(self, text):
|
||||
# use jieba
|
||||
if self.use_jieba:
|
||||
jieba_words = list([x for x in self.jieba_seg.cut(text, cut_all=False) if x])
|
||||
jieba_words = list(
|
||||
[x for x in self.jieba_seg.cut(text, cut_all=False) if x]
|
||||
)
|
||||
words = [jieba_words[0]]
|
||||
spaces = [False]
|
||||
for i in range(1, len(jieba_words)):
|
||||
|
|
|
@ -292,13 +292,14 @@ class EntityRuler(object):
|
|||
self.add_patterns(patterns)
|
||||
else:
|
||||
cfg = {}
|
||||
deserializers = {
|
||||
deserializers_patterns = {
|
||||
"patterns": lambda p: self.add_patterns(
|
||||
srsly.read_jsonl(p.with_suffix(".jsonl"))
|
||||
),
|
||||
"cfg": lambda p: cfg.update(srsly.read_json(p)),
|
||||
)}
|
||||
deserializers_cfg = {
|
||||
"cfg": lambda p: cfg.update(srsly.read_json(p))
|
||||
}
|
||||
from_disk(path, deserializers, {})
|
||||
from_disk(path, deserializers_cfg, {})
|
||||
self.overwrite = cfg.get("overwrite", False)
|
||||
self.phrase_matcher_attr = cfg.get("phrase_matcher_attr")
|
||||
self.ent_id_sep = cfg.get("ent_id_sep", DEFAULT_ENT_ID_SEP)
|
||||
|
@ -307,6 +308,7 @@ class EntityRuler(object):
|
|||
self.phrase_matcher = PhraseMatcher(
|
||||
self.nlp.vocab, attr=self.phrase_matcher_attr
|
||||
)
|
||||
from_disk(path, deserializers_patterns, {})
|
||||
return self
|
||||
|
||||
def to_disk(self, path, **kwargs):
|
||||
|
|
|
@ -13,6 +13,7 @@ from thinc.misc import LayerNorm
|
|||
from thinc.neural.util import to_categorical
|
||||
from thinc.neural.util import get_array_module
|
||||
|
||||
from ..compat import basestring_
|
||||
from ..tokens.doc cimport Doc
|
||||
from ..syntax.nn_parser cimport Parser
|
||||
from ..syntax.ner cimport BiluoPushDown
|
||||
|
@ -547,6 +548,8 @@ class Tagger(Pipe):
|
|||
return build_tagger_model(n_tags, **cfg)
|
||||
|
||||
def add_label(self, label, values=None):
|
||||
if not isinstance(label, basestring_):
|
||||
raise ValueError(Errors.E187)
|
||||
if label in self.labels:
|
||||
return 0
|
||||
if self.model not in (True, False, None):
|
||||
|
@ -1016,6 +1019,8 @@ class TextCategorizer(Pipe):
|
|||
return float(mean_square_error), d_scores
|
||||
|
||||
def add_label(self, label):
|
||||
if not isinstance(label, basestring_):
|
||||
raise ValueError(Errors.E187)
|
||||
if label in self.labels:
|
||||
return 0
|
||||
if self.model not in (None, True, False):
|
||||
|
|
|
@ -271,7 +271,9 @@ class Scorer(object):
|
|||
self.labelled_per_dep[token.dep_.lower()] = PRFScore()
|
||||
if token.dep_.lower() not in cand_deps_per_dep:
|
||||
cand_deps_per_dep[token.dep_.lower()] = set()
|
||||
cand_deps_per_dep[token.dep_.lower()].add((gold_i, gold_head, token.dep_.lower()))
|
||||
cand_deps_per_dep[token.dep_.lower()].add(
|
||||
(gold_i, gold_head, token.dep_.lower())
|
||||
)
|
||||
if "-" not in [token[-1] for token in gold.orig_annot]:
|
||||
# Find all NER labels in gold and doc
|
||||
ent_labels = set([x[0] for x in gold_ents] + [k.label_ for k in doc.ents])
|
||||
|
@ -304,7 +306,9 @@ class Scorer(object):
|
|||
self.tags.score_set(cand_tags, gold_tags)
|
||||
self.labelled.score_set(cand_deps, gold_deps)
|
||||
for dep in self.labelled_per_dep:
|
||||
self.labelled_per_dep[dep].score_set(cand_deps_per_dep.get(dep, set()), gold_deps_per_dep.get(dep, set()))
|
||||
self.labelled_per_dep[dep].score_set(
|
||||
cand_deps_per_dep.get(dep, set()), gold_deps_per_dep.get(dep, set())
|
||||
)
|
||||
self.unlabelled.score_set(
|
||||
set(item[:2] for item in cand_deps), set(item[:2] for item in gold_deps)
|
||||
)
|
||||
|
|
|
@ -42,11 +42,17 @@ cdef WeightsC get_c_weights(model) except *:
|
|||
cdef precompute_hiddens state2vec = model.state2vec
|
||||
output.feat_weights = state2vec.get_feat_weights()
|
||||
output.feat_bias = <const float*>state2vec.bias.data
|
||||
cdef np.ndarray vec2scores_W = model.vec2scores.W
|
||||
cdef np.ndarray vec2scores_b = model.vec2scores.b
|
||||
cdef np.ndarray vec2scores_W
|
||||
cdef np.ndarray vec2scores_b
|
||||
if model.vec2scores is None:
|
||||
output.hidden_weights = NULL
|
||||
output.hidden_bias = NULL
|
||||
else:
|
||||
vec2scores_W = model.vec2scores.W
|
||||
vec2scores_b = model.vec2scores.b
|
||||
output.hidden_weights = <const float*>vec2scores_W.data
|
||||
output.hidden_bias = <const float*>vec2scores_b.data
|
||||
cdef np.ndarray class_mask = model._class_mask
|
||||
output.hidden_weights = <const float*>vec2scores_W.data
|
||||
output.hidden_bias = <const float*>vec2scores_b.data
|
||||
output.seen_classes = <const float*>class_mask.data
|
||||
return output
|
||||
|
||||
|
@ -54,7 +60,10 @@ cdef WeightsC get_c_weights(model) except *:
|
|||
cdef SizesC get_c_sizes(model, int batch_size) except *:
|
||||
cdef SizesC output
|
||||
output.states = batch_size
|
||||
output.classes = model.vec2scores.nO
|
||||
if model.vec2scores is None:
|
||||
output.classes = model.state2vec.nO
|
||||
else:
|
||||
output.classes = model.vec2scores.nO
|
||||
output.hiddens = model.state2vec.nO
|
||||
output.pieces = model.state2vec.nP
|
||||
output.feats = model.state2vec.nF
|
||||
|
@ -105,11 +114,12 @@ cdef void resize_activations(ActivationsC* A, SizesC n) nogil:
|
|||
|
||||
cdef void predict_states(ActivationsC* A, StateC** states,
|
||||
const WeightsC* W, SizesC n) nogil:
|
||||
cdef double one = 1.0
|
||||
resize_activations(A, n)
|
||||
memset(A.unmaxed, 0, n.states * n.hiddens * n.pieces * sizeof(float))
|
||||
memset(A.hiddens, 0, n.states * n.hiddens * sizeof(float))
|
||||
for i in range(n.states):
|
||||
states[i].set_context_tokens(&A.token_ids[i*n.feats], n.feats)
|
||||
memset(A.unmaxed, 0, n.states * n.hiddens * n.pieces * sizeof(float))
|
||||
memset(A.hiddens, 0, n.states * n.hiddens * sizeof(float))
|
||||
sum_state_features(A.unmaxed,
|
||||
W.feat_weights, A.token_ids, n.states, n.feats, n.hiddens * n.pieces)
|
||||
for i in range(n.states):
|
||||
|
@ -120,18 +130,20 @@ cdef void predict_states(ActivationsC* A, StateC** states,
|
|||
which = Vec.arg_max(&A.unmaxed[index], n.pieces)
|
||||
A.hiddens[i*n.hiddens + j] = A.unmaxed[index + which]
|
||||
memset(A.scores, 0, n.states * n.classes * sizeof(float))
|
||||
cdef double one = 1.0
|
||||
# Compute hidden-to-output
|
||||
blis.cy.gemm(blis.cy.NO_TRANSPOSE, blis.cy.TRANSPOSE,
|
||||
n.states, n.classes, n.hiddens, one,
|
||||
<float*>A.hiddens, n.hiddens, 1,
|
||||
<float*>W.hidden_weights, n.hiddens, 1,
|
||||
one,
|
||||
<float*>A.scores, n.classes, 1)
|
||||
# Add bias
|
||||
for i in range(n.states):
|
||||
VecVec.add_i(&A.scores[i*n.classes],
|
||||
W.hidden_bias, 1., n.classes)
|
||||
if W.hidden_weights == NULL:
|
||||
memcpy(A.scores, A.hiddens, n.states * n.classes * sizeof(float))
|
||||
else:
|
||||
# Compute hidden-to-output
|
||||
blis.cy.gemm(blis.cy.NO_TRANSPOSE, blis.cy.TRANSPOSE,
|
||||
n.states, n.classes, n.hiddens, one,
|
||||
<float*>A.hiddens, n.hiddens, 1,
|
||||
<float*>W.hidden_weights, n.hiddens, 1,
|
||||
one,
|
||||
<float*>A.scores, n.classes, 1)
|
||||
# Add bias
|
||||
for i in range(n.states):
|
||||
VecVec.add_i(&A.scores[i*n.classes],
|
||||
W.hidden_bias, 1., n.classes)
|
||||
# Set unseen classes to minimum value
|
||||
i = 0
|
||||
min_ = A.scores[0]
|
||||
|
@ -219,7 +231,9 @@ cdef int arg_max_if_valid(const weight_t* scores, const int* is_valid, int n) no
|
|||
class ParserModel(Model):
|
||||
def __init__(self, tok2vec, lower_model, upper_model, unseen_classes=None):
|
||||
Model.__init__(self)
|
||||
self._layers = [tok2vec, lower_model, upper_model]
|
||||
self._layers = [tok2vec, lower_model]
|
||||
if upper_model is not None:
|
||||
self._layers.append(upper_model)
|
||||
self.unseen_classes = set()
|
||||
if unseen_classes:
|
||||
for class_ in unseen_classes:
|
||||
|
@ -234,6 +248,8 @@ class ParserModel(Model):
|
|||
return step_model, finish_parser_update
|
||||
|
||||
def resize_output(self, new_output):
|
||||
if len(self._layers) == 2:
|
||||
return
|
||||
if new_output == self.upper.nO:
|
||||
return
|
||||
smaller = self.upper
|
||||
|
@ -275,12 +291,24 @@ class ParserModel(Model):
|
|||
class ParserStepModel(Model):
|
||||
def __init__(self, docs, layers, unseen_classes=None, drop=0.):
|
||||
self.tokvecs, self.bp_tokvecs = layers[0].begin_update(docs, drop=drop)
|
||||
if layers[1].nP >= 2:
|
||||
activation = "maxout"
|
||||
elif len(layers) == 2:
|
||||
activation = None
|
||||
else:
|
||||
activation = "relu"
|
||||
self.state2vec = precompute_hiddens(len(docs), self.tokvecs, layers[1],
|
||||
drop=drop)
|
||||
self.vec2scores = layers[-1]
|
||||
self.cuda_stream = util.get_cuda_stream()
|
||||
activation=activation, drop=drop)
|
||||
if len(layers) == 3:
|
||||
self.vec2scores = layers[-1]
|
||||
else:
|
||||
self.vec2scores = None
|
||||
self.cuda_stream = util.get_cuda_stream(non_blocking=True)
|
||||
self.backprops = []
|
||||
self._class_mask = numpy.zeros((self.vec2scores.nO,), dtype='f')
|
||||
if self.vec2scores is None:
|
||||
self._class_mask = numpy.zeros((self.state2vec.nO,), dtype='f')
|
||||
else:
|
||||
self._class_mask = numpy.zeros((self.vec2scores.nO,), dtype='f')
|
||||
self._class_mask.fill(1)
|
||||
if unseen_classes is not None:
|
||||
for class_ in unseen_classes:
|
||||
|
@ -302,10 +330,15 @@ class ParserStepModel(Model):
|
|||
def begin_update(self, states, drop=0.):
|
||||
token_ids = self.get_token_ids(states)
|
||||
vector, get_d_tokvecs = self.state2vec.begin_update(token_ids, drop=0.0)
|
||||
mask = self.vec2scores.ops.get_dropout_mask(vector.shape, drop)
|
||||
if mask is not None:
|
||||
vector *= mask
|
||||
scores, get_d_vector = self.vec2scores.begin_update(vector, drop=drop)
|
||||
if self.vec2scores is not None:
|
||||
mask = self.vec2scores.ops.get_dropout_mask(vector.shape, drop)
|
||||
if mask is not None:
|
||||
vector *= mask
|
||||
scores, get_d_vector = self.vec2scores.begin_update(vector, drop=drop)
|
||||
else:
|
||||
scores = NumpyOps().asarray(vector)
|
||||
get_d_vector = lambda d_scores, sgd=None: d_scores
|
||||
mask = None
|
||||
# If the class is unseen, make sure its score is minimum
|
||||
scores[:, self._class_mask == 0] = numpy.nanmin(scores)
|
||||
|
||||
|
@ -342,12 +375,12 @@ class ParserStepModel(Model):
|
|||
return ids
|
||||
|
||||
def make_updates(self, sgd):
|
||||
# Tells CUDA to block, so our async copies complete.
|
||||
if self.cuda_stream is not None:
|
||||
self.cuda_stream.synchronize()
|
||||
# Add a padding vector to the d_tokvecs gradient, so that missing
|
||||
# values don't affect the real gradient.
|
||||
d_tokvecs = self.ops.allocate((self.tokvecs.shape[0]+1, self.tokvecs.shape[1]))
|
||||
# Tells CUDA to block, so our async copies complete.
|
||||
if self.cuda_stream is not None:
|
||||
self.cuda_stream.synchronize()
|
||||
for ids, d_vector, bp_vector in self.backprops:
|
||||
d_state_features = bp_vector((d_vector, ids), sgd=sgd)
|
||||
ids = ids.flatten()
|
||||
|
@ -385,9 +418,10 @@ cdef class precompute_hiddens:
|
|||
cdef np.ndarray bias
|
||||
cdef object _cuda_stream
|
||||
cdef object _bp_hiddens
|
||||
cdef object activation
|
||||
|
||||
def __init__(self, batch_size, tokvecs, lower_model, cuda_stream=None,
|
||||
drop=0.):
|
||||
activation="maxout", drop=0.):
|
||||
gpu_cached, bp_features = lower_model.begin_update(tokvecs, drop=drop)
|
||||
cdef np.ndarray cached
|
||||
if not isinstance(gpu_cached, numpy.ndarray):
|
||||
|
@ -405,6 +439,8 @@ cdef class precompute_hiddens:
|
|||
self.nP = getattr(lower_model, 'nP', 1)
|
||||
self.nO = cached.shape[2]
|
||||
self.ops = lower_model.ops
|
||||
assert activation in (None, "relu", "maxout")
|
||||
self.activation = activation
|
||||
self._is_synchronized = False
|
||||
self._cuda_stream = cuda_stream
|
||||
self._cached = cached
|
||||
|
@ -417,7 +453,7 @@ cdef class precompute_hiddens:
|
|||
return <float*>self._cached.data
|
||||
|
||||
def __call__(self, X):
|
||||
return self.begin_update(X)[0]
|
||||
return self.begin_update(X, drop=None)[0]
|
||||
|
||||
def begin_update(self, token_ids, drop=0.):
|
||||
cdef np.ndarray state_vector = numpy.zeros(
|
||||
|
@ -450,28 +486,35 @@ cdef class precompute_hiddens:
|
|||
else:
|
||||
ops = CupyOps()
|
||||
|
||||
if self.nP == 1:
|
||||
state_vector = state_vector.reshape(state_vector.shape[:-1])
|
||||
mask = state_vector >= 0.
|
||||
state_vector *= mask
|
||||
else:
|
||||
if self.activation == "maxout":
|
||||
state_vector, mask = ops.maxout(state_vector)
|
||||
else:
|
||||
state_vector = state_vector.reshape(state_vector.shape[:-1])
|
||||
if self.activation == "relu":
|
||||
mask = state_vector >= 0.
|
||||
state_vector *= mask
|
||||
else:
|
||||
mask = None
|
||||
|
||||
def backprop_nonlinearity(d_best, sgd=None):
|
||||
if isinstance(d_best, numpy.ndarray):
|
||||
ops = NumpyOps()
|
||||
else:
|
||||
ops = CupyOps()
|
||||
mask_ = ops.asarray(mask)
|
||||
|
||||
if mask is not None:
|
||||
mask_ = ops.asarray(mask)
|
||||
# This will usually be on GPU
|
||||
d_best = ops.asarray(d_best)
|
||||
# Fix nans (which can occur from unseen classes.)
|
||||
d_best[ops.xp.isnan(d_best)] = 0.
|
||||
if self.nP == 1:
|
||||
if self.activation == "maxout":
|
||||
mask_ = ops.asarray(mask)
|
||||
return ops.backprop_maxout(d_best, mask_, self.nP)
|
||||
elif self.activation == "relu":
|
||||
mask_ = ops.asarray(mask)
|
||||
d_best *= mask_
|
||||
d_best = d_best.reshape((d_best.shape + (1,)))
|
||||
return d_best
|
||||
else:
|
||||
return ops.backprop_maxout(d_best, mask_, self.nP)
|
||||
return d_best.reshape((d_best.shape + (1,)))
|
||||
return state_vector, backprop_nonlinearity
|
||||
|
|
|
@ -100,10 +100,30 @@ cdef cppclass StateC:
|
|||
free(this.shifted - PADDING)
|
||||
|
||||
void set_context_tokens(int* ids, int n) nogil:
|
||||
if n == 2:
|
||||
if n == 1:
|
||||
if this.B(0) >= 0:
|
||||
ids[0] = this.B(0)
|
||||
else:
|
||||
ids[0] = -1
|
||||
elif n == 2:
|
||||
ids[0] = this.B(0)
|
||||
ids[1] = this.S(0)
|
||||
if n == 8:
|
||||
elif n == 3:
|
||||
if this.B(0) >= 0:
|
||||
ids[0] = this.B(0)
|
||||
else:
|
||||
ids[0] = -1
|
||||
# First word of entity, if any
|
||||
if this.entity_is_open():
|
||||
ids[1] = this.E(0)
|
||||
else:
|
||||
ids[1] = -1
|
||||
# Last word of entity, if within entity
|
||||
if ids[0] == -1 or ids[1] == -1:
|
||||
ids[2] = -1
|
||||
else:
|
||||
ids[2] = ids[0] - 1
|
||||
elif n == 8:
|
||||
ids[0] = this.B(0)
|
||||
ids[1] = this.B(1)
|
||||
ids[2] = this.S(0)
|
||||
|
|
|
@ -22,7 +22,7 @@ from thinc.extra.search cimport Beam
|
|||
from thinc.api import chain, clone
|
||||
from thinc.v2v import Model, Maxout, Affine
|
||||
from thinc.misc import LayerNorm
|
||||
from thinc.neural.ops import CupyOps
|
||||
from thinc.neural.ops import NumpyOps, CupyOps
|
||||
from thinc.neural.util import get_array_module
|
||||
from thinc.linalg cimport Vec, VecVec
|
||||
import srsly
|
||||
|
@ -61,13 +61,17 @@ cdef class Parser:
|
|||
t2v_pieces = util.env_opt('cnn_maxout_pieces', cfg.get('cnn_maxout_pieces', 3))
|
||||
bilstm_depth = util.env_opt('bilstm_depth', cfg.get('bilstm_depth', 0))
|
||||
self_attn_depth = util.env_opt('self_attn_depth', cfg.get('self_attn_depth', 0))
|
||||
if depth != 1:
|
||||
nr_feature_tokens = cfg.get("nr_feature_tokens", cls.nr_feature)
|
||||
if depth not in (0, 1):
|
||||
raise ValueError(TempErrors.T004.format(value=depth))
|
||||
parser_maxout_pieces = util.env_opt('parser_maxout_pieces',
|
||||
cfg.get('maxout_pieces', 2))
|
||||
token_vector_width = util.env_opt('token_vector_width',
|
||||
cfg.get('token_vector_width', 96))
|
||||
hidden_width = util.env_opt('hidden_width', cfg.get('hidden_width', 64))
|
||||
if depth == 0:
|
||||
hidden_width = nr_class
|
||||
parser_maxout_pieces = 1
|
||||
embed_size = util.env_opt('embed_size', cfg.get('embed_size', 2000))
|
||||
pretrained_vectors = cfg.get('pretrained_vectors', None)
|
||||
tok2vec = Tok2Vec(token_vector_width, embed_size,
|
||||
|
@ -80,16 +84,19 @@ cdef class Parser:
|
|||
tok2vec = chain(tok2vec, flatten)
|
||||
tok2vec.nO = token_vector_width
|
||||
lower = PrecomputableAffine(hidden_width,
|
||||
nF=cls.nr_feature, nI=token_vector_width,
|
||||
nF=nr_feature_tokens, nI=token_vector_width,
|
||||
nP=parser_maxout_pieces)
|
||||
lower.nP = parser_maxout_pieces
|
||||
|
||||
with Model.use_device('cpu'):
|
||||
upper = Affine(nr_class, hidden_width, drop_factor=0.0)
|
||||
upper.W *= 0
|
||||
if depth == 1:
|
||||
with Model.use_device('cpu'):
|
||||
upper = Affine(nr_class, hidden_width, drop_factor=0.0)
|
||||
upper.W *= 0
|
||||
else:
|
||||
upper = None
|
||||
|
||||
cfg = {
|
||||
'nr_class': nr_class,
|
||||
'nr_feature_tokens': nr_feature_tokens,
|
||||
'hidden_depth': depth,
|
||||
'token_vector_width': token_vector_width,
|
||||
'hidden_width': hidden_width,
|
||||
|
@ -133,6 +140,7 @@ cdef class Parser:
|
|||
if 'beam_update_prob' not in cfg:
|
||||
cfg['beam_update_prob'] = util.env_opt('beam_update_prob', 1.0)
|
||||
cfg.setdefault('cnn_maxout_pieces', 3)
|
||||
cfg.setdefault("nr_feature_tokens", self.nr_feature)
|
||||
self.cfg = cfg
|
||||
self.model = model
|
||||
self._multitasks = []
|
||||
|
@ -299,7 +307,7 @@ cdef class Parser:
|
|||
token_ids = numpy.zeros((len(docs) * beam_width, self.nr_feature),
|
||||
dtype='i', order='C')
|
||||
cdef int* c_ids
|
||||
cdef int nr_feature = self.nr_feature
|
||||
cdef int nr_feature = self.cfg["nr_feature_tokens"]
|
||||
cdef int n_states
|
||||
model = self.model(docs)
|
||||
todo = [beam for beam in beams if not beam.is_done]
|
||||
|
@ -502,7 +510,7 @@ cdef class Parser:
|
|||
self.moves.preprocess_gold(gold)
|
||||
model, finish_update = self.model.begin_update(docs, drop=drop)
|
||||
states_d_scores, backprops, beams = _beam_utils.update_beam(
|
||||
self.moves, self.nr_feature, 10000, states, golds, model.state2vec,
|
||||
self.moves, self.cfg["nr_feature_tokens"], 10000, states, golds, model.state2vec,
|
||||
model.vec2scores, width, drop=drop, losses=losses,
|
||||
beam_density=beam_density)
|
||||
for i, d_scores in enumerate(states_d_scores):
|
||||
|
|
|
@ -2,6 +2,7 @@
|
|||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
import re
|
||||
from spacy.lang.en import English
|
||||
from spacy.tokenizer import Tokenizer
|
||||
from spacy.util import compile_prefix_regex, compile_suffix_regex
|
||||
|
@ -19,13 +20,14 @@ def custom_en_tokenizer(en_vocab):
|
|||
r"[\[\]!&:,()\*—–\/-]",
|
||||
]
|
||||
infix_re = compile_infix_regex(custom_infixes)
|
||||
token_match_re = re.compile("a-b")
|
||||
return Tokenizer(
|
||||
en_vocab,
|
||||
English.Defaults.tokenizer_exceptions,
|
||||
prefix_re.search,
|
||||
suffix_re.search,
|
||||
infix_re.finditer,
|
||||
token_match=None,
|
||||
token_match=token_match_re.match,
|
||||
)
|
||||
|
||||
|
||||
|
@ -74,3 +76,81 @@ def test_en_customized_tokenizer_handles_infixes(custom_en_tokenizer):
|
|||
"Megaregion",
|
||||
".",
|
||||
]
|
||||
|
||||
|
||||
def test_en_customized_tokenizer_handles_token_match(custom_en_tokenizer):
|
||||
sentence = "The 8 and 10-county definitions a-b not used for the greater Southern California Megaregion."
|
||||
context = [word.text for word in custom_en_tokenizer(sentence)]
|
||||
assert context == [
|
||||
"The",
|
||||
"8",
|
||||
"and",
|
||||
"10",
|
||||
"-",
|
||||
"county",
|
||||
"definitions",
|
||||
"a-b",
|
||||
"not",
|
||||
"used",
|
||||
"for",
|
||||
"the",
|
||||
"greater",
|
||||
"Southern",
|
||||
"California",
|
||||
"Megaregion",
|
||||
".",
|
||||
]
|
||||
|
||||
|
||||
def test_en_customized_tokenizer_handles_rules(custom_en_tokenizer):
|
||||
sentence = "The 8 and 10-county definitions are not used for the greater Southern California Megaregion. :)"
|
||||
context = [word.text for word in custom_en_tokenizer(sentence)]
|
||||
assert context == [
|
||||
"The",
|
||||
"8",
|
||||
"and",
|
||||
"10",
|
||||
"-",
|
||||
"county",
|
||||
"definitions",
|
||||
"are",
|
||||
"not",
|
||||
"used",
|
||||
"for",
|
||||
"the",
|
||||
"greater",
|
||||
"Southern",
|
||||
"California",
|
||||
"Megaregion",
|
||||
".",
|
||||
":)",
|
||||
]
|
||||
|
||||
|
||||
def test_en_customized_tokenizer_handles_rules_property(custom_en_tokenizer):
|
||||
sentence = "The 8 and 10-county definitions are not used for the greater Southern California Megaregion. :)"
|
||||
rules = custom_en_tokenizer.rules
|
||||
del rules[":)"]
|
||||
custom_en_tokenizer.rules = rules
|
||||
context = [word.text for word in custom_en_tokenizer(sentence)]
|
||||
assert context == [
|
||||
"The",
|
||||
"8",
|
||||
"and",
|
||||
"10",
|
||||
"-",
|
||||
"county",
|
||||
"definitions",
|
||||
"are",
|
||||
"not",
|
||||
"used",
|
||||
"for",
|
||||
"the",
|
||||
"greater",
|
||||
"Southern",
|
||||
"California",
|
||||
"Megaregion",
|
||||
".",
|
||||
":",
|
||||
")",
|
||||
]
|
||||
|
|
|
@ -259,6 +259,27 @@ def test_block_ner():
|
|||
assert [token.ent_type_ for token in doc] == expected_types
|
||||
|
||||
|
||||
def test_change_number_features():
|
||||
# Test the default number features
|
||||
nlp = English()
|
||||
ner = nlp.create_pipe("ner")
|
||||
nlp.add_pipe(ner)
|
||||
ner.add_label("PERSON")
|
||||
nlp.begin_training()
|
||||
assert ner.model.lower.nF == ner.nr_feature
|
||||
# Test we can change it
|
||||
nlp = English()
|
||||
ner = nlp.create_pipe("ner")
|
||||
nlp.add_pipe(ner)
|
||||
ner.add_label("PERSON")
|
||||
nlp.begin_training(
|
||||
component_cfg={"ner": {"nr_feature_tokens": 3, "token_vector_width": 128}}
|
||||
)
|
||||
assert ner.model.lower.nF == 3
|
||||
# Test the model runs
|
||||
nlp("hello world")
|
||||
|
||||
|
||||
class BlockerComponent1(object):
|
||||
name = "my_blocker"
|
||||
|
||||
|
|
14
spacy/tests/pipeline/test_tagger.py
Normal file
14
spacy/tests/pipeline/test_tagger.py
Normal file
|
@ -0,0 +1,14 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
from spacy.language import Language
|
||||
from spacy.pipeline import Tagger
|
||||
|
||||
|
||||
def test_label_types():
|
||||
nlp = Language()
|
||||
nlp.add_pipe(nlp.create_pipe("tagger"))
|
||||
nlp.get_pipe("tagger").add_label("A")
|
||||
with pytest.raises(ValueError):
|
||||
nlp.get_pipe("tagger").add_label(9)
|
|
@ -62,3 +62,11 @@ def test_textcat_learns_multilabel():
|
|||
assert score < 0.5
|
||||
else:
|
||||
assert score > 0.5
|
||||
|
||||
|
||||
def test_label_types():
|
||||
nlp = Language()
|
||||
nlp.add_pipe(nlp.create_pipe("textcat"))
|
||||
nlp.get_pipe("textcat").add_label("answer")
|
||||
with pytest.raises(ValueError):
|
||||
nlp.get_pipe("textcat").add_label(9)
|
||||
|
|
|
@ -3,9 +3,9 @@ from __future__ import unicode_literals
|
|||
|
||||
import srsly
|
||||
from spacy.gold import GoldCorpus
|
||||
|
||||
from spacy.lang.en import English
|
||||
from spacy.tests.util import make_tempdir
|
||||
|
||||
from ..util import make_tempdir
|
||||
|
||||
|
||||
def test_issue4402():
|
||||
|
|
|
@ -1,7 +1,6 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
from mock import Mock
|
||||
from spacy.matcher import DependencyMatcher
|
||||
from ..util import get_doc
|
||||
|
@ -11,8 +10,14 @@ def test_issue4590(en_vocab):
|
|||
"""Test that matches param in on_match method are the same as matches run with no on_match method"""
|
||||
pattern = [
|
||||
{"SPEC": {"NODE_NAME": "jumped"}, "PATTERN": {"ORTH": "jumped"}},
|
||||
{"SPEC": {"NODE_NAME": "fox", "NBOR_RELOP": ">", "NBOR_NAME": "jumped"}, "PATTERN": {"ORTH": "fox"}},
|
||||
{"SPEC": {"NODE_NAME": "quick", "NBOR_RELOP": ".", "NBOR_NAME": "jumped"}, "PATTERN": {"ORTH": "fox"}},
|
||||
{
|
||||
"SPEC": {"NODE_NAME": "fox", "NBOR_RELOP": ">", "NBOR_NAME": "jumped"},
|
||||
"PATTERN": {"ORTH": "fox"},
|
||||
},
|
||||
{
|
||||
"SPEC": {"NODE_NAME": "quick", "NBOR_RELOP": ".", "NBOR_NAME": "jumped"},
|
||||
"PATTERN": {"ORTH": "fox"},
|
||||
},
|
||||
]
|
||||
|
||||
on_match = Mock()
|
||||
|
@ -31,4 +36,3 @@ def test_issue4590(en_vocab):
|
|||
on_match_args = on_match.call_args
|
||||
|
||||
assert on_match_args[0][3] == matches
|
||||
|
||||
|
|
65
spacy/tests/regression/test_issue4651.py
Normal file
65
spacy/tests/regression/test_issue4651.py
Normal file
|
@ -0,0 +1,65 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from spacy.lang.en import English
|
||||
from spacy.pipeline import EntityRuler
|
||||
|
||||
from ..util import make_tempdir
|
||||
|
||||
|
||||
def test_issue4651_with_phrase_matcher_attr():
|
||||
"""Test that the EntityRuler PhraseMatcher is deserialize correctly using
|
||||
the method from_disk when the EntityRuler argument phrase_matcher_attr is
|
||||
specified.
|
||||
"""
|
||||
text = "Spacy is a python library for nlp"
|
||||
|
||||
nlp = English()
|
||||
ruler = EntityRuler(nlp, phrase_matcher_attr="LOWER")
|
||||
patterns = [{"label": "PYTHON_LIB", "pattern": "spacy", "id": "spaCy"}]
|
||||
ruler.add_patterns(patterns)
|
||||
nlp.add_pipe(ruler)
|
||||
|
||||
doc = nlp(text)
|
||||
res = [(ent.text, ent.label_, ent.ent_id_) for ent in doc.ents]
|
||||
|
||||
nlp_reloaded = English()
|
||||
with make_tempdir() as d:
|
||||
file_path = d / "entityruler"
|
||||
ruler.to_disk(file_path)
|
||||
ruler_reloaded = EntityRuler(nlp_reloaded).from_disk(file_path)
|
||||
|
||||
nlp_reloaded.add_pipe(ruler_reloaded)
|
||||
doc_reloaded = nlp_reloaded(text)
|
||||
res_reloaded = [(ent.text, ent.label_, ent.ent_id_) for ent in doc_reloaded.ents]
|
||||
|
||||
assert res == res_reloaded
|
||||
|
||||
|
||||
def test_issue4651_without_phrase_matcher_attr():
|
||||
"""Test that the EntityRuler PhraseMatcher is deserialize correctly using
|
||||
the method from_disk when the EntityRuler argument phrase_matcher_attr is
|
||||
not specified.
|
||||
"""
|
||||
text = "Spacy is a python library for nlp"
|
||||
|
||||
nlp = English()
|
||||
ruler = EntityRuler(nlp)
|
||||
patterns = [{"label": "PYTHON_LIB", "pattern": "spacy", "id": "spaCy"}]
|
||||
ruler.add_patterns(patterns)
|
||||
nlp.add_pipe(ruler)
|
||||
|
||||
doc = nlp(text)
|
||||
res = [(ent.text, ent.label_, ent.ent_id_) for ent in doc.ents]
|
||||
|
||||
nlp_reloaded = English()
|
||||
with make_tempdir() as d:
|
||||
file_path = d / "entityruler"
|
||||
ruler.to_disk(file_path)
|
||||
ruler_reloaded = EntityRuler(nlp_reloaded).from_disk(file_path)
|
||||
|
||||
nlp_reloaded.add_pipe(ruler_reloaded)
|
||||
doc_reloaded = nlp_reloaded(text)
|
||||
res_reloaded = [(ent.text, ent.label_, ent.ent_id_) for ent in doc_reloaded.ents]
|
||||
|
||||
assert res == res_reloaded
|
|
@ -12,8 +12,22 @@ from .util import get_doc
|
|||
test_las_apple = [
|
||||
[
|
||||
"Apple is looking at buying U.K. startup for $ 1 billion",
|
||||
{"heads": [2, 2, 2, 2, 3, 6, 4, 4, 10, 10, 7],
|
||||
"deps": ['nsubj', 'aux', 'ROOT', 'prep', 'pcomp', 'compound', 'dobj', 'prep', 'quantmod', 'compound', 'pobj']},
|
||||
{
|
||||
"heads": [2, 2, 2, 2, 3, 6, 4, 4, 10, 10, 7],
|
||||
"deps": [
|
||||
"nsubj",
|
||||
"aux",
|
||||
"ROOT",
|
||||
"prep",
|
||||
"pcomp",
|
||||
"compound",
|
||||
"dobj",
|
||||
"prep",
|
||||
"quantmod",
|
||||
"compound",
|
||||
"pobj",
|
||||
],
|
||||
},
|
||||
]
|
||||
]
|
||||
|
||||
|
@ -59,7 +73,7 @@ def test_las_per_type(en_vocab):
|
|||
en_vocab,
|
||||
words=input_.split(" "),
|
||||
heads=([h - i for i, h in enumerate(annot["heads"])]),
|
||||
deps=annot["deps"]
|
||||
deps=annot["deps"],
|
||||
)
|
||||
gold = GoldParse(doc, heads=annot["heads"], deps=annot["deps"])
|
||||
doc[0].dep_ = "compound"
|
||||
|
|
65
spacy/tests/tokenizer/test_explain.py
Normal file
65
spacy/tests/tokenizer/test_explain.py
Normal file
|
@ -0,0 +1,65 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
from spacy.util import get_lang_class
|
||||
|
||||
# Only include languages with no external dependencies
|
||||
# "is" seems to confuse importlib, so we're also excluding it for now
|
||||
# excluded: ja, ru, th, uk, vi, zh, is
|
||||
LANGUAGES = [
|
||||
pytest.param("fr", marks=pytest.mark.slow()),
|
||||
pytest.param("af", marks=pytest.mark.slow()),
|
||||
pytest.param("ar", marks=pytest.mark.slow()),
|
||||
pytest.param("bg", marks=pytest.mark.slow()),
|
||||
"bn",
|
||||
pytest.param("ca", marks=pytest.mark.slow()),
|
||||
pytest.param("cs", marks=pytest.mark.slow()),
|
||||
pytest.param("da", marks=pytest.mark.slow()),
|
||||
pytest.param("de", marks=pytest.mark.slow()),
|
||||
"el",
|
||||
"en",
|
||||
pytest.param("es", marks=pytest.mark.slow()),
|
||||
pytest.param("et", marks=pytest.mark.slow()),
|
||||
pytest.param("fa", marks=pytest.mark.slow()),
|
||||
pytest.param("fi", marks=pytest.mark.slow()),
|
||||
"fr",
|
||||
pytest.param("ga", marks=pytest.mark.slow()),
|
||||
pytest.param("he", marks=pytest.mark.slow()),
|
||||
pytest.param("hi", marks=pytest.mark.slow()),
|
||||
pytest.param("hr", marks=pytest.mark.slow()),
|
||||
"hu",
|
||||
pytest.param("id", marks=pytest.mark.slow()),
|
||||
pytest.param("it", marks=pytest.mark.slow()),
|
||||
pytest.param("kn", marks=pytest.mark.slow()),
|
||||
pytest.param("lb", marks=pytest.mark.slow()),
|
||||
pytest.param("lt", marks=pytest.mark.slow()),
|
||||
pytest.param("lv", marks=pytest.mark.slow()),
|
||||
pytest.param("nb", marks=pytest.mark.slow()),
|
||||
pytest.param("nl", marks=pytest.mark.slow()),
|
||||
"pl",
|
||||
pytest.param("pt", marks=pytest.mark.slow()),
|
||||
pytest.param("ro", marks=pytest.mark.slow()),
|
||||
pytest.param("si", marks=pytest.mark.slow()),
|
||||
pytest.param("sk", marks=pytest.mark.slow()),
|
||||
pytest.param("sl", marks=pytest.mark.slow()),
|
||||
pytest.param("sq", marks=pytest.mark.slow()),
|
||||
pytest.param("sr", marks=pytest.mark.slow()),
|
||||
pytest.param("sv", marks=pytest.mark.slow()),
|
||||
pytest.param("ta", marks=pytest.mark.slow()),
|
||||
pytest.param("te", marks=pytest.mark.slow()),
|
||||
pytest.param("tl", marks=pytest.mark.slow()),
|
||||
pytest.param("tr", marks=pytest.mark.slow()),
|
||||
pytest.param("tt", marks=pytest.mark.slow()),
|
||||
pytest.param("ur", marks=pytest.mark.slow()),
|
||||
]
|
||||
|
||||
|
||||
@pytest.mark.parametrize("lang", LANGUAGES)
|
||||
def test_tokenizer_explain(lang):
|
||||
tokenizer = get_lang_class(lang).Defaults.create_tokenizer()
|
||||
examples = pytest.importorskip("spacy.lang.{}.examples".format(lang))
|
||||
for sentence in examples.sentences:
|
||||
tokens = [t.text for t in tokenizer(sentence) if not t.is_space]
|
||||
debug_tokens = [t[1] for t in tokenizer.explain(sentence)]
|
||||
assert tokens == debug_tokens
|
|
@ -15,6 +15,8 @@ import re
|
|||
from .tokens.doc cimport Doc
|
||||
from .strings cimport hash_string
|
||||
from .compat import unescape_unicode
|
||||
from .attrs import intify_attrs
|
||||
from .symbols import ORTH
|
||||
|
||||
from .errors import Errors, Warnings, deprecation_warning
|
||||
from . import util
|
||||
|
@ -57,9 +59,7 @@ cdef class Tokenizer:
|
|||
self.infix_finditer = infix_finditer
|
||||
self.vocab = vocab
|
||||
self._rules = {}
|
||||
if rules is not None:
|
||||
for chunk, substrings in sorted(rules.items()):
|
||||
self.add_special_case(chunk, substrings)
|
||||
self._load_special_tokenization(rules)
|
||||
|
||||
property token_match:
|
||||
def __get__(self):
|
||||
|
@ -93,6 +93,18 @@ cdef class Tokenizer:
|
|||
self._infix_finditer = infix_finditer
|
||||
self._flush_cache()
|
||||
|
||||
property rules:
|
||||
def __get__(self):
|
||||
return self._rules
|
||||
|
||||
def __set__(self, rules):
|
||||
self._rules = {}
|
||||
self._reset_cache([key for key in self._cache])
|
||||
self._reset_specials()
|
||||
self._cache = PreshMap()
|
||||
self._specials = PreshMap()
|
||||
self._load_special_tokenization(rules)
|
||||
|
||||
def __reduce__(self):
|
||||
args = (self.vocab,
|
||||
self._rules,
|
||||
|
@ -227,10 +239,6 @@ cdef class Tokenizer:
|
|||
cdef unicode minus_suf
|
||||
cdef size_t last_size = 0
|
||||
while string and len(string) != last_size:
|
||||
if self.token_match and self.token_match(string) \
|
||||
and not self.find_prefix(string) \
|
||||
and not self.find_suffix(string):
|
||||
break
|
||||
if self._specials.get(hash_string(string)) != NULL:
|
||||
has_special[0] = 1
|
||||
break
|
||||
|
@ -393,8 +401,9 @@ cdef class Tokenizer:
|
|||
|
||||
def _load_special_tokenization(self, special_cases):
|
||||
"""Add special-case tokenization rules."""
|
||||
for chunk, substrings in sorted(special_cases.items()):
|
||||
self.add_special_case(chunk, substrings)
|
||||
if special_cases is not None:
|
||||
for chunk, substrings in sorted(special_cases.items()):
|
||||
self.add_special_case(chunk, substrings)
|
||||
|
||||
def add_special_case(self, unicode string, substrings):
|
||||
"""Add a special-case tokenization rule.
|
||||
|
@ -423,6 +432,73 @@ cdef class Tokenizer:
|
|||
self.mem.free(stale_cached)
|
||||
self._rules[string] = substrings
|
||||
|
||||
def explain(self, text):
|
||||
"""A debugging tokenizer that provides information about which
|
||||
tokenizer rule or pattern was matched for each token. The tokens
|
||||
produced are identical to `nlp.tokenizer()` except for whitespace
|
||||
tokens.
|
||||
|
||||
string (unicode): The string to tokenize.
|
||||
RETURNS (list): A list of (pattern_string, token_string) tuples
|
||||
|
||||
DOCS: https://spacy.io/api/tokenizer#explain
|
||||
"""
|
||||
prefix_search = self.prefix_search
|
||||
suffix_search = self.suffix_search
|
||||
infix_finditer = self.infix_finditer
|
||||
token_match = self.token_match
|
||||
special_cases = {}
|
||||
for orth, special_tokens in self.rules.items():
|
||||
special_cases[orth] = [intify_attrs(special_token, strings_map=self.vocab.strings, _do_deprecated=True) for special_token in special_tokens]
|
||||
tokens = []
|
||||
for substring in text.split():
|
||||
suffixes = []
|
||||
while substring:
|
||||
while prefix_search(substring) or suffix_search(substring):
|
||||
if substring in special_cases:
|
||||
tokens.extend(("SPECIAL-" + str(i + 1), self.vocab.strings[e[ORTH]]) for i, e in enumerate(special_cases[substring]))
|
||||
substring = ''
|
||||
break
|
||||
if prefix_search(substring):
|
||||
split = prefix_search(substring).end()
|
||||
# break if pattern matches the empty string
|
||||
if split == 0:
|
||||
break
|
||||
tokens.append(("PREFIX", substring[:split]))
|
||||
substring = substring[split:]
|
||||
if substring in special_cases:
|
||||
continue
|
||||
if suffix_search(substring):
|
||||
split = suffix_search(substring).start()
|
||||
# break if pattern matches the empty string
|
||||
if split == len(substring):
|
||||
break
|
||||
suffixes.append(("SUFFIX", substring[split:]))
|
||||
substring = substring[:split]
|
||||
if substring in special_cases:
|
||||
tokens.extend(("SPECIAL-" + str(i + 1), self.vocab.strings[e[ORTH]]) for i, e in enumerate(special_cases[substring]))
|
||||
substring = ''
|
||||
elif token_match(substring):
|
||||
tokens.append(("TOKEN_MATCH", substring))
|
||||
substring = ''
|
||||
elif list(infix_finditer(substring)):
|
||||
infixes = infix_finditer(substring)
|
||||
offset = 0
|
||||
for match in infixes:
|
||||
if substring[offset : match.start()]:
|
||||
tokens.append(("TOKEN", substring[offset : match.start()]))
|
||||
if substring[match.start() : match.end()]:
|
||||
tokens.append(("INFIX", substring[match.start() : match.end()]))
|
||||
offset = match.end()
|
||||
if substring[offset:]:
|
||||
tokens.append(("TOKEN", substring[offset:]))
|
||||
substring = ''
|
||||
elif substring:
|
||||
tokens.append(("TOKEN", substring))
|
||||
substring = ''
|
||||
tokens.extend(reversed(suffixes))
|
||||
return tokens
|
||||
|
||||
def to_disk(self, path, **kwargs):
|
||||
"""Save the current state to a directory.
|
||||
|
||||
|
@ -507,8 +583,7 @@ cdef class Tokenizer:
|
|||
self._reset_specials()
|
||||
self._cache = PreshMap()
|
||||
self._specials = PreshMap()
|
||||
for string, substrings in data.get("rules", {}).items():
|
||||
self.add_special_case(string, substrings)
|
||||
self._load_special_tokenization(data.get("rules", {}))
|
||||
|
||||
return self
|
||||
|
||||
|
|
|
@ -301,13 +301,13 @@ def get_component_name(component):
|
|||
return repr(component)
|
||||
|
||||
|
||||
def get_cuda_stream(require=False):
|
||||
def get_cuda_stream(require=False, non_blocking=True):
|
||||
if CudaStream is None:
|
||||
return None
|
||||
elif isinstance(Model.ops, NumpyOps):
|
||||
return None
|
||||
else:
|
||||
return CudaStream()
|
||||
return CudaStream(non_blocking=non_blocking)
|
||||
|
||||
|
||||
def get_async(stream, numpy_array):
|
||||
|
|
|
@ -265,17 +265,12 @@ cdef class Vectors:
|
|||
rows = [self.key2row.get(key, -1.) for key in keys]
|
||||
return xp.asarray(rows, dtype="i")
|
||||
else:
|
||||
targets = set()
|
||||
row2key = {row: key for key, row in self.key2row.items()}
|
||||
if row is not None:
|
||||
targets.add(row)
|
||||
return row2key[row]
|
||||
else:
|
||||
targets.update(rows)
|
||||
results = []
|
||||
for key, row in self.key2row.items():
|
||||
if row in targets:
|
||||
results.append(key)
|
||||
targets.remove(row)
|
||||
return xp.asarray(results, dtype="uint64")
|
||||
results = [row2key[row] for row in rows]
|
||||
return xp.asarray(results, dtype="uint64")
|
||||
|
||||
def add(self, key, *, vector=None, row=None):
|
||||
"""Add a key to the table. Keys can be mapped to an existing vector
|
||||
|
|
|
@ -58,4 +58,5 @@ Update the evaluation scores from a single [`Doc`](/api/doc) /
|
|||
| `ents_per_type` <Tag variant="new">2.1.5</Tag> | dict | Scores per entity label. Keyed by label, mapped to a dict of `p`, `r` and `f` scores. |
|
||||
| `textcat_score` <Tag variant="new">2.2</Tag> | float | F-score on positive label for binary exclusive, macro-averaged F-score for 3+ exclusive, macro-averaged AUC ROC score for multilabel (`-1` if undefined). |
|
||||
| `textcats_per_cat` <Tag variant="new">2.2</Tag> | dict | Scores per textcat label, keyed by label. |
|
||||
| `las_per_type` <Tag variant="new">2.2.3</Tag> | dict | Labelled dependency scores, keyed by label. |
|
||||
| `scores` | dict | All scores, keyed by type. |
|
||||
|
|
|
@ -34,15 +34,15 @@ the
|
|||
> tokenizer = nlp.Defaults.create_tokenizer(nlp)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ---------------- | ----------- | ----------------------------------------------------------------------------------- |
|
||||
| `vocab` | `Vocab` | A storage container for lexical types. |
|
||||
| `rules` | dict | Exceptions and special-cases for the tokenizer. |
|
||||
| `prefix_search` | callable | A function matching the signature of `re.compile(string).search` to match prefixes. |
|
||||
| `suffix_search` | callable | A function matching the signature of `re.compile(string).search` to match suffixes. |
|
||||
| `infix_finditer` | callable | A function matching the signature of `re.compile(string).finditer` to find infixes. |
|
||||
| `token_match` | callable | A boolean function matching strings to be recognized as tokens. |
|
||||
| **RETURNS** | `Tokenizer` | The newly constructed object. |
|
||||
| Name | Type | Description |
|
||||
| ---------------- | ----------- | ----------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `vocab` | `Vocab` | A storage container for lexical types. |
|
||||
| `rules` | dict | Exceptions and special-cases for the tokenizer. |
|
||||
| `prefix_search` | callable | A function matching the signature of `re.compile(string).search` to match prefixes. |
|
||||
| `suffix_search` | callable | A function matching the signature of `re.compile(string).search` to match suffixes. |
|
||||
| `infix_finditer` | callable | A function matching the signature of `re.compile(string).finditer` to find infixes. |
|
||||
| `token_match` | callable | A function matching the signature of `re.compile(string).match to find token matches. |
|
||||
| **RETURNS** | `Tokenizer` | The newly constructed object. |
|
||||
|
||||
## Tokenizer.\_\_call\_\_ {#call tag="method"}
|
||||
|
||||
|
@ -128,6 +128,25 @@ and examples.
|
|||
| `string` | unicode | The string to specially tokenize. |
|
||||
| `token_attrs` | iterable | A sequence of dicts, where each dict describes a token and its attributes. The `ORTH` fields of the attributes must exactly match the string when they are concatenated. |
|
||||
|
||||
## Tokenizer.explain {#explain tag="method"}
|
||||
|
||||
Tokenize a string with a slow debugging tokenizer that provides information
|
||||
about which tokenizer rule or pattern was matched for each token. The tokens
|
||||
produced are identical to `Tokenizer.__call__` except for whitespace tokens.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> tok_exp = nlp.tokenizer.explain("(don't)")
|
||||
> assert [t[0] for t in tok_exp] == ["PREFIX", "SPECIAL-1", "SPECIAL-2", "SUFFIX"]
|
||||
> assert [t[1] for t in tok_exp] == ["(", "do", "n't", ")"]
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------| -------- | --------------------------------------------------- |
|
||||
| `string` | unicode | The string to tokenize with the debugging tokenizer |
|
||||
| **RETURNS** | list | A list of `(pattern_string, token_string)` tuples |
|
||||
|
||||
## Tokenizer.to_disk {#to_disk tag="method"}
|
||||
|
||||
Serialize the tokenizer to disk.
|
||||
|
@ -198,12 +217,14 @@ it.
|
|||
|
||||
## Attributes {#attributes}
|
||||
|
||||
| Name | Type | Description |
|
||||
| ---------------- | ------- | -------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `vocab` | `Vocab` | The vocab object of the parent `Doc`. |
|
||||
| `prefix_search` | - | A function to find segment boundaries from the start of a string. Returns the length of the segment, or `None`. |
|
||||
| `suffix_search` | - | A function to find segment boundaries from the end of a string. Returns the length of the segment, or `None`. |
|
||||
| `infix_finditer` | - | A function to find internal segment separators, e.g. hyphens. Returns a (possibly empty) list of `re.MatchObject` objects. |
|
||||
| Name | Type | Description |
|
||||
| ---------------- | ------- | --------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `vocab` | `Vocab` | The vocab object of the parent `Doc`. |
|
||||
| `prefix_search` | - | A function to find segment boundaries from the start of a string. Returns the length of the segment, or `None`. |
|
||||
| `suffix_search` | - | A function to find segment boundaries from the end of a string. Returns the length of the segment, or `None`. |
|
||||
| `infix_finditer` | - | A function to find internal segment separators, e.g. hyphens. Returns a (possibly empty) list of `re.MatchObject` objects. |
|
||||
| `token_match` | - | A function matching the signature of `re.compile(string).match to find token matches. Returns an `re.MatchObject` or `None. |
|
||||
| `rules` | dict | A dictionary of tokenizer exceptions and special cases. |
|
||||
|
||||
## Serialization fields {#serialization-fields}
|
||||
|
||||
|
|
|
@ -792,6 +792,33 @@ The algorithm can be summarized as follows:
|
|||
tokens on all infixes.
|
||||
8. Once we can't consume any more of the string, handle it as a single token.
|
||||
|
||||
#### Debugging the tokenizer {#tokenizer-debug new="2.2.3"}
|
||||
|
||||
A working implementation of the pseudo-code above is available for debugging as
|
||||
[`nlp.tokenizer.explain(text)`](/api/tokenizer#explain). It returns a list of
|
||||
tuples showing which tokenizer rule or pattern was matched for each token. The
|
||||
tokens produced are identical to `nlp.tokenizer()` except for whitespace tokens:
|
||||
|
||||
```python
|
||||
### {executable="true"}
|
||||
from spacy.lang.en import English
|
||||
|
||||
nlp = English()
|
||||
text = '''"Let's go!"'''
|
||||
doc = nlp(text)
|
||||
tok_exp = nlp.tokenizer.explain(text)
|
||||
assert [t.text for t in doc if not t.is_space] == [t[1] for t in tok_exp]
|
||||
for t in tok_exp:
|
||||
print(t[1], "\\t", t[0])
|
||||
|
||||
# " PREFIX
|
||||
# Let SPECIAL-1
|
||||
# 's SPECIAL-2
|
||||
# go TOKEN
|
||||
# ! SUFFIX
|
||||
# " SUFFIX
|
||||
```
|
||||
|
||||
### Customizing spaCy's Tokenizer class {#native-tokenizers}
|
||||
|
||||
Let's imagine you wanted to create a tokenizer for a new language or specific
|
||||
|
|
|
@ -1679,13 +1679,14 @@
|
|||
"slogan": "Information extraction from English and German texts based on predicate logic",
|
||||
"github": "msg-systems/holmes-extractor",
|
||||
"url": "https://github.com/msg-systems/holmes-extractor",
|
||||
"description": "Holmes is a Python 3 library that supports a number of use cases involving information extraction from English and German texts, including chatbot, structural search, topic matching and supervised document classification.",
|
||||
"description": "Holmes is a Python 3 library that supports a number of use cases involving information extraction from English and German texts, including chatbot, structural extraction, topic matching and supervised document classification. There is a [website demonstrating intelligent search based on topic matching](https://holmes-demo.xt.msg.team).",
|
||||
"pip": "holmes-extractor",
|
||||
"category": ["conversational", "standalone"],
|
||||
"tags": ["chatbots", "text-processing"],
|
||||
"thumb": "https://raw.githubusercontent.com/msg-systems/holmes-extractor/master/docs/holmes_thumbnail.png",
|
||||
"code_example": [
|
||||
"import holmes_extractor as holmes",
|
||||
"holmes_manager = holmes.Manager(model='en_coref_lg')",
|
||||
"holmes_manager = holmes.Manager(model='en_core_web_lg')",
|
||||
"holmes_manager.register_search_phrase('A big dog chases a cat')",
|
||||
"holmes_manager.start_chatbot_mode_console()"
|
||||
],
|
||||
|
|
Loading…
Reference in New Issue
Block a user