diff --git a/.github/contributors/GuiGel.md b/.github/contributors/GuiGel.md new file mode 100644 index 000000000..43fb0f757 --- /dev/null +++ b/.github/contributors/GuiGel.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Guillaume Gelabert | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 2019-11-15 | +| GitHub username | GuiGel | +| Website (optional) | | diff --git a/.github/contributors/erip.md b/.github/contributors/erip.md new file mode 100644 index 000000000..56df07338 --- /dev/null +++ b/.github/contributors/erip.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Elijah Rippeth | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 2019-11-16 | +| GitHub username | erip | +| Website (optional) | | diff --git a/azure-pipelines.yml b/azure-pipelines.yml index 029cc9dd0..054365336 100644 --- a/azure-pipelines.yml +++ b/azure-pipelines.yml @@ -50,15 +50,16 @@ jobs: Python36Mac: imageName: 'macos-10.13' python.version: '3.6' - Python37Linux: - imageName: 'ubuntu-16.04' - python.version: '3.7' - Python37Windows: - imageName: 'vs2017-win2016' - python.version: '3.7' - Python37Mac: - imageName: 'macos-10.13' - python.version: '3.7' + # Don't test on 3.7 for now to speed up builds + # Python37Linux: + # imageName: 'ubuntu-16.04' + # python.version: '3.7' + # Python37Windows: + # imageName: 'vs2017-win2016' + # python.version: '3.7' + # Python37Mac: + # imageName: 'macos-10.13' + # python.version: '3.7' Python38Linux: imageName: 'ubuntu-16.04' python.version: '3.8' diff --git a/spacy/about.py b/spacy/about.py index c6db9700f..a1880fb54 100644 --- a/spacy/about.py +++ b/spacy/about.py @@ -1,6 +1,6 @@ # fmt: off __title__ = "spacy" -__version__ = "2.2.2" +__version__ = "2.2.3" __release__ = True __download_url__ = "https://github.com/explosion/spacy-models/releases/download" __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json" diff --git a/spacy/errors.py b/spacy/errors.py index c708f0a5b..3e62b5a3e 100644 --- a/spacy/errors.py +++ b/spacy/errors.py @@ -529,6 +529,7 @@ class Errors(object): E185 = ("Received invalid attribute in component attribute declaration: " "{obj}.{attr}\nAttribute '{attr}' does not exist on {obj}.") E186 = ("'{tok_a}' and '{tok_b}' are different texts.") + E187 = ("Only unicode strings are supported as labels.") @add_codes diff --git a/spacy/lang/char_classes.py b/spacy/lang/char_classes.py index 5ed2a2a8c..2c8823867 100644 --- a/spacy/lang/char_classes.py +++ b/spacy/lang/char_classes.py @@ -31,6 +31,10 @@ _latin_u_supplement = r"\u00C0-\u00D6\u00D8-\u00DE" _latin_l_supplement = r"\u00DF-\u00F6\u00F8-\u00FF" _latin_supplement = r"\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u00FF" +_hangul_syllables = r"\uAC00-\uD7AF" +_hangul_jamo = r"\u1100-\u11FF" +_hangul = _hangul_syllables + _hangul_jamo + # letters with diacritics - Catalan, Czech, Latin, Latvian, Lithuanian, Polish, Slovak, Turkish, Welsh _latin_u_extendedA = ( r"\u0100\u0102\u0104\u0106\u0108\u010A\u010C\u010E\u0110\u0112\u0114\u0116\u0118\u011A\u011C" @@ -202,7 +206,15 @@ _upper = LATIN_UPPER + _russian_upper + _tatar_upper + _greek_upper + _ukrainian _lower = LATIN_LOWER + _russian_lower + _tatar_lower + _greek_lower + _ukrainian_lower _uncased = ( - _bengali + _hebrew + _persian + _sinhala + _hindi + _kannada + _tamil + _telugu + _bengali + + _hebrew + + _persian + + _sinhala + + _hindi + + _kannada + + _tamil + + _telugu + + _hangul ) ALPHA = group_chars(LATIN + _russian + _tatar + _greek + _ukrainian + _uncased) diff --git a/spacy/lang/ko/lex_attrs.py b/spacy/lang/ko/lex_attrs.py new file mode 100644 index 000000000..1904a0ece --- /dev/null +++ b/spacy/lang/ko/lex_attrs.py @@ -0,0 +1,67 @@ +# coding: utf8 +from __future__ import unicode_literals + +from ...attrs import LIKE_NUM + + +_num_words = [ + "영", + "공", + # Native Korean number system + "하나", + "둘", + "셋", + "넷", + "다섯", + "여섯", + "일곱", + "여덟", + "아홉", + "열", + "스물", + "서른", + "마흔", + "쉰", + "예순", + "일흔", + "여든", + "아흔", + # Sino-Korean number system + "일", + "이", + "삼", + "사", + "오", + "육", + "칠", + "팔", + "구", + "십", + "백", + "천", + "만", + "십만", + "백만", + "천만", + "일억", + "십억", + "백억", +] + + +def like_num(text): + if text.startswith(("+", "-", "±", "~")): + text = text[1:] + text = text.replace(",", "").replace(".", "") + if text.isdigit(): + return True + if text.count("/") == 1: + num, denom = text.split("/") + if num.isdigit() and denom.isdigit(): + return True + if any(char.lower() in _num_words for char in text): + return True + return False + + +LEX_ATTRS = {LIKE_NUM: like_num} diff --git a/spacy/lang/lb/tokenizer_exceptions.py b/spacy/lang/lb/tokenizer_exceptions.py index 18b58f2b1..d84372aef 100644 --- a/spacy/lang/lb/tokenizer_exceptions.py +++ b/spacy/lang/lb/tokenizer_exceptions.py @@ -6,9 +6,7 @@ from ...symbols import ORTH, LEMMA, NORM # TODO # treat other apostrophes within words as part of the word: [op d'mannst], [fir d'éischt] (= exceptions) -_exc = { - -} +_exc = {} # translate / delete what is not necessary for exc_data in [ diff --git a/spacy/lang/zh/__init__.py b/spacy/lang/zh/__init__.py index 5bd7b7335..8179b4551 100644 --- a/spacy/lang/zh/__init__.py +++ b/spacy/lang/zh/__init__.py @@ -14,6 +14,7 @@ from .tag_map import TAG_MAP def try_jieba_import(use_jieba): try: import jieba + return jieba except ImportError: if use_jieba: @@ -34,7 +35,9 @@ class ChineseTokenizer(DummyTokenizer): def __call__(self, text): # use jieba if self.use_jieba: - jieba_words = list([x for x in self.jieba_seg.cut(text, cut_all=False) if x]) + jieba_words = list( + [x for x in self.jieba_seg.cut(text, cut_all=False) if x] + ) words = [jieba_words[0]] spaces = [False] for i in range(1, len(jieba_words)): diff --git a/spacy/pipeline/entityruler.py b/spacy/pipeline/entityruler.py index d926b987b..205697637 100644 --- a/spacy/pipeline/entityruler.py +++ b/spacy/pipeline/entityruler.py @@ -292,13 +292,14 @@ class EntityRuler(object): self.add_patterns(patterns) else: cfg = {} - deserializers = { + deserializers_patterns = { "patterns": lambda p: self.add_patterns( srsly.read_jsonl(p.with_suffix(".jsonl")) - ), - "cfg": lambda p: cfg.update(srsly.read_json(p)), + )} + deserializers_cfg = { + "cfg": lambda p: cfg.update(srsly.read_json(p)) } - from_disk(path, deserializers, {}) + from_disk(path, deserializers_cfg, {}) self.overwrite = cfg.get("overwrite", False) self.phrase_matcher_attr = cfg.get("phrase_matcher_attr") self.ent_id_sep = cfg.get("ent_id_sep", DEFAULT_ENT_ID_SEP) @@ -307,6 +308,7 @@ class EntityRuler(object): self.phrase_matcher = PhraseMatcher( self.nlp.vocab, attr=self.phrase_matcher_attr ) + from_disk(path, deserializers_patterns, {}) return self def to_disk(self, path, **kwargs): diff --git a/spacy/pipeline/pipes.pyx b/spacy/pipeline/pipes.pyx index d29cf9ce9..8bf36e9c2 100644 --- a/spacy/pipeline/pipes.pyx +++ b/spacy/pipeline/pipes.pyx @@ -13,6 +13,7 @@ from thinc.misc import LayerNorm from thinc.neural.util import to_categorical from thinc.neural.util import get_array_module +from ..compat import basestring_ from ..tokens.doc cimport Doc from ..syntax.nn_parser cimport Parser from ..syntax.ner cimport BiluoPushDown @@ -547,6 +548,8 @@ class Tagger(Pipe): return build_tagger_model(n_tags, **cfg) def add_label(self, label, values=None): + if not isinstance(label, basestring_): + raise ValueError(Errors.E187) if label in self.labels: return 0 if self.model not in (True, False, None): @@ -1016,6 +1019,8 @@ class TextCategorizer(Pipe): return float(mean_square_error), d_scores def add_label(self, label): + if not isinstance(label, basestring_): + raise ValueError(Errors.E187) if label in self.labels: return 0 if self.model not in (None, True, False): diff --git a/spacy/scorer.py b/spacy/scorer.py index 0b4843f41..7b05b11fd 100644 --- a/spacy/scorer.py +++ b/spacy/scorer.py @@ -271,7 +271,9 @@ class Scorer(object): self.labelled_per_dep[token.dep_.lower()] = PRFScore() if token.dep_.lower() not in cand_deps_per_dep: cand_deps_per_dep[token.dep_.lower()] = set() - cand_deps_per_dep[token.dep_.lower()].add((gold_i, gold_head, token.dep_.lower())) + cand_deps_per_dep[token.dep_.lower()].add( + (gold_i, gold_head, token.dep_.lower()) + ) if "-" not in [token[-1] for token in gold.orig_annot]: # Find all NER labels in gold and doc ent_labels = set([x[0] for x in gold_ents] + [k.label_ for k in doc.ents]) @@ -304,7 +306,9 @@ class Scorer(object): self.tags.score_set(cand_tags, gold_tags) self.labelled.score_set(cand_deps, gold_deps) for dep in self.labelled_per_dep: - self.labelled_per_dep[dep].score_set(cand_deps_per_dep.get(dep, set()), gold_deps_per_dep.get(dep, set())) + self.labelled_per_dep[dep].score_set( + cand_deps_per_dep.get(dep, set()), gold_deps_per_dep.get(dep, set()) + ) self.unlabelled.score_set( set(item[:2] for item in cand_deps), set(item[:2] for item in gold_deps) ) diff --git a/spacy/syntax/_parser_model.pyx b/spacy/syntax/_parser_model.pyx index 77bd43ed7..8b6448a46 100644 --- a/spacy/syntax/_parser_model.pyx +++ b/spacy/syntax/_parser_model.pyx @@ -42,11 +42,17 @@ cdef WeightsC get_c_weights(model) except *: cdef precompute_hiddens state2vec = model.state2vec output.feat_weights = state2vec.get_feat_weights() output.feat_bias = state2vec.bias.data - cdef np.ndarray vec2scores_W = model.vec2scores.W - cdef np.ndarray vec2scores_b = model.vec2scores.b + cdef np.ndarray vec2scores_W + cdef np.ndarray vec2scores_b + if model.vec2scores is None: + output.hidden_weights = NULL + output.hidden_bias = NULL + else: + vec2scores_W = model.vec2scores.W + vec2scores_b = model.vec2scores.b + output.hidden_weights = vec2scores_W.data + output.hidden_bias = vec2scores_b.data cdef np.ndarray class_mask = model._class_mask - output.hidden_weights = vec2scores_W.data - output.hidden_bias = vec2scores_b.data output.seen_classes = class_mask.data return output @@ -54,7 +60,10 @@ cdef WeightsC get_c_weights(model) except *: cdef SizesC get_c_sizes(model, int batch_size) except *: cdef SizesC output output.states = batch_size - output.classes = model.vec2scores.nO + if model.vec2scores is None: + output.classes = model.state2vec.nO + else: + output.classes = model.vec2scores.nO output.hiddens = model.state2vec.nO output.pieces = model.state2vec.nP output.feats = model.state2vec.nF @@ -105,11 +114,12 @@ cdef void resize_activations(ActivationsC* A, SizesC n) nogil: cdef void predict_states(ActivationsC* A, StateC** states, const WeightsC* W, SizesC n) nogil: + cdef double one = 1.0 resize_activations(A, n) - memset(A.unmaxed, 0, n.states * n.hiddens * n.pieces * sizeof(float)) - memset(A.hiddens, 0, n.states * n.hiddens * sizeof(float)) for i in range(n.states): states[i].set_context_tokens(&A.token_ids[i*n.feats], n.feats) + memset(A.unmaxed, 0, n.states * n.hiddens * n.pieces * sizeof(float)) + memset(A.hiddens, 0, n.states * n.hiddens * sizeof(float)) sum_state_features(A.unmaxed, W.feat_weights, A.token_ids, n.states, n.feats, n.hiddens * n.pieces) for i in range(n.states): @@ -120,18 +130,20 @@ cdef void predict_states(ActivationsC* A, StateC** states, which = Vec.arg_max(&A.unmaxed[index], n.pieces) A.hiddens[i*n.hiddens + j] = A.unmaxed[index + which] memset(A.scores, 0, n.states * n.classes * sizeof(float)) - cdef double one = 1.0 - # Compute hidden-to-output - blis.cy.gemm(blis.cy.NO_TRANSPOSE, blis.cy.TRANSPOSE, - n.states, n.classes, n.hiddens, one, - A.hiddens, n.hiddens, 1, - W.hidden_weights, n.hiddens, 1, - one, - A.scores, n.classes, 1) - # Add bias - for i in range(n.states): - VecVec.add_i(&A.scores[i*n.classes], - W.hidden_bias, 1., n.classes) + if W.hidden_weights == NULL: + memcpy(A.scores, A.hiddens, n.states * n.classes * sizeof(float)) + else: + # Compute hidden-to-output + blis.cy.gemm(blis.cy.NO_TRANSPOSE, blis.cy.TRANSPOSE, + n.states, n.classes, n.hiddens, one, + A.hiddens, n.hiddens, 1, + W.hidden_weights, n.hiddens, 1, + one, + A.scores, n.classes, 1) + # Add bias + for i in range(n.states): + VecVec.add_i(&A.scores[i*n.classes], + W.hidden_bias, 1., n.classes) # Set unseen classes to minimum value i = 0 min_ = A.scores[0] @@ -219,7 +231,9 @@ cdef int arg_max_if_valid(const weight_t* scores, const int* is_valid, int n) no class ParserModel(Model): def __init__(self, tok2vec, lower_model, upper_model, unseen_classes=None): Model.__init__(self) - self._layers = [tok2vec, lower_model, upper_model] + self._layers = [tok2vec, lower_model] + if upper_model is not None: + self._layers.append(upper_model) self.unseen_classes = set() if unseen_classes: for class_ in unseen_classes: @@ -234,6 +248,8 @@ class ParserModel(Model): return step_model, finish_parser_update def resize_output(self, new_output): + if len(self._layers) == 2: + return if new_output == self.upper.nO: return smaller = self.upper @@ -275,12 +291,24 @@ class ParserModel(Model): class ParserStepModel(Model): def __init__(self, docs, layers, unseen_classes=None, drop=0.): self.tokvecs, self.bp_tokvecs = layers[0].begin_update(docs, drop=drop) + if layers[1].nP >= 2: + activation = "maxout" + elif len(layers) == 2: + activation = None + else: + activation = "relu" self.state2vec = precompute_hiddens(len(docs), self.tokvecs, layers[1], - drop=drop) - self.vec2scores = layers[-1] - self.cuda_stream = util.get_cuda_stream() + activation=activation, drop=drop) + if len(layers) == 3: + self.vec2scores = layers[-1] + else: + self.vec2scores = None + self.cuda_stream = util.get_cuda_stream(non_blocking=True) self.backprops = [] - self._class_mask = numpy.zeros((self.vec2scores.nO,), dtype='f') + if self.vec2scores is None: + self._class_mask = numpy.zeros((self.state2vec.nO,), dtype='f') + else: + self._class_mask = numpy.zeros((self.vec2scores.nO,), dtype='f') self._class_mask.fill(1) if unseen_classes is not None: for class_ in unseen_classes: @@ -302,10 +330,15 @@ class ParserStepModel(Model): def begin_update(self, states, drop=0.): token_ids = self.get_token_ids(states) vector, get_d_tokvecs = self.state2vec.begin_update(token_ids, drop=0.0) - mask = self.vec2scores.ops.get_dropout_mask(vector.shape, drop) - if mask is not None: - vector *= mask - scores, get_d_vector = self.vec2scores.begin_update(vector, drop=drop) + if self.vec2scores is not None: + mask = self.vec2scores.ops.get_dropout_mask(vector.shape, drop) + if mask is not None: + vector *= mask + scores, get_d_vector = self.vec2scores.begin_update(vector, drop=drop) + else: + scores = NumpyOps().asarray(vector) + get_d_vector = lambda d_scores, sgd=None: d_scores + mask = None # If the class is unseen, make sure its score is minimum scores[:, self._class_mask == 0] = numpy.nanmin(scores) @@ -342,12 +375,12 @@ class ParserStepModel(Model): return ids def make_updates(self, sgd): - # Tells CUDA to block, so our async copies complete. - if self.cuda_stream is not None: - self.cuda_stream.synchronize() # Add a padding vector to the d_tokvecs gradient, so that missing # values don't affect the real gradient. d_tokvecs = self.ops.allocate((self.tokvecs.shape[0]+1, self.tokvecs.shape[1])) + # Tells CUDA to block, so our async copies complete. + if self.cuda_stream is not None: + self.cuda_stream.synchronize() for ids, d_vector, bp_vector in self.backprops: d_state_features = bp_vector((d_vector, ids), sgd=sgd) ids = ids.flatten() @@ -385,9 +418,10 @@ cdef class precompute_hiddens: cdef np.ndarray bias cdef object _cuda_stream cdef object _bp_hiddens + cdef object activation def __init__(self, batch_size, tokvecs, lower_model, cuda_stream=None, - drop=0.): + activation="maxout", drop=0.): gpu_cached, bp_features = lower_model.begin_update(tokvecs, drop=drop) cdef np.ndarray cached if not isinstance(gpu_cached, numpy.ndarray): @@ -405,6 +439,8 @@ cdef class precompute_hiddens: self.nP = getattr(lower_model, 'nP', 1) self.nO = cached.shape[2] self.ops = lower_model.ops + assert activation in (None, "relu", "maxout") + self.activation = activation self._is_synchronized = False self._cuda_stream = cuda_stream self._cached = cached @@ -417,7 +453,7 @@ cdef class precompute_hiddens: return self._cached.data def __call__(self, X): - return self.begin_update(X)[0] + return self.begin_update(X, drop=None)[0] def begin_update(self, token_ids, drop=0.): cdef np.ndarray state_vector = numpy.zeros( @@ -450,28 +486,35 @@ cdef class precompute_hiddens: else: ops = CupyOps() - if self.nP == 1: - state_vector = state_vector.reshape(state_vector.shape[:-1]) - mask = state_vector >= 0. - state_vector *= mask - else: + if self.activation == "maxout": state_vector, mask = ops.maxout(state_vector) + else: + state_vector = state_vector.reshape(state_vector.shape[:-1]) + if self.activation == "relu": + mask = state_vector >= 0. + state_vector *= mask + else: + mask = None def backprop_nonlinearity(d_best, sgd=None): if isinstance(d_best, numpy.ndarray): ops = NumpyOps() else: ops = CupyOps() - mask_ = ops.asarray(mask) - + if mask is not None: + mask_ = ops.asarray(mask) # This will usually be on GPU d_best = ops.asarray(d_best) # Fix nans (which can occur from unseen classes.) d_best[ops.xp.isnan(d_best)] = 0. - if self.nP == 1: + if self.activation == "maxout": + mask_ = ops.asarray(mask) + return ops.backprop_maxout(d_best, mask_, self.nP) + elif self.activation == "relu": + mask_ = ops.asarray(mask) d_best *= mask_ d_best = d_best.reshape((d_best.shape + (1,))) return d_best else: - return ops.backprop_maxout(d_best, mask_, self.nP) + return d_best.reshape((d_best.shape + (1,))) return state_vector, backprop_nonlinearity diff --git a/spacy/syntax/_state.pxd b/spacy/syntax/_state.pxd index 65c0a3b4d..141d796a4 100644 --- a/spacy/syntax/_state.pxd +++ b/spacy/syntax/_state.pxd @@ -100,10 +100,30 @@ cdef cppclass StateC: free(this.shifted - PADDING) void set_context_tokens(int* ids, int n) nogil: - if n == 2: + if n == 1: + if this.B(0) >= 0: + ids[0] = this.B(0) + else: + ids[0] = -1 + elif n == 2: ids[0] = this.B(0) ids[1] = this.S(0) - if n == 8: + elif n == 3: + if this.B(0) >= 0: + ids[0] = this.B(0) + else: + ids[0] = -1 + # First word of entity, if any + if this.entity_is_open(): + ids[1] = this.E(0) + else: + ids[1] = -1 + # Last word of entity, if within entity + if ids[0] == -1 or ids[1] == -1: + ids[2] = -1 + else: + ids[2] = ids[0] - 1 + elif n == 8: ids[0] = this.B(0) ids[1] = this.B(1) ids[2] = this.S(0) diff --git a/spacy/syntax/nn_parser.pyx b/spacy/syntax/nn_parser.pyx index 0ed7e6952..8493140b8 100644 --- a/spacy/syntax/nn_parser.pyx +++ b/spacy/syntax/nn_parser.pyx @@ -22,7 +22,7 @@ from thinc.extra.search cimport Beam from thinc.api import chain, clone from thinc.v2v import Model, Maxout, Affine from thinc.misc import LayerNorm -from thinc.neural.ops import CupyOps +from thinc.neural.ops import NumpyOps, CupyOps from thinc.neural.util import get_array_module from thinc.linalg cimport Vec, VecVec import srsly @@ -61,13 +61,17 @@ cdef class Parser: t2v_pieces = util.env_opt('cnn_maxout_pieces', cfg.get('cnn_maxout_pieces', 3)) bilstm_depth = util.env_opt('bilstm_depth', cfg.get('bilstm_depth', 0)) self_attn_depth = util.env_opt('self_attn_depth', cfg.get('self_attn_depth', 0)) - if depth != 1: + nr_feature_tokens = cfg.get("nr_feature_tokens", cls.nr_feature) + if depth not in (0, 1): raise ValueError(TempErrors.T004.format(value=depth)) parser_maxout_pieces = util.env_opt('parser_maxout_pieces', cfg.get('maxout_pieces', 2)) token_vector_width = util.env_opt('token_vector_width', cfg.get('token_vector_width', 96)) hidden_width = util.env_opt('hidden_width', cfg.get('hidden_width', 64)) + if depth == 0: + hidden_width = nr_class + parser_maxout_pieces = 1 embed_size = util.env_opt('embed_size', cfg.get('embed_size', 2000)) pretrained_vectors = cfg.get('pretrained_vectors', None) tok2vec = Tok2Vec(token_vector_width, embed_size, @@ -80,16 +84,19 @@ cdef class Parser: tok2vec = chain(tok2vec, flatten) tok2vec.nO = token_vector_width lower = PrecomputableAffine(hidden_width, - nF=cls.nr_feature, nI=token_vector_width, + nF=nr_feature_tokens, nI=token_vector_width, nP=parser_maxout_pieces) lower.nP = parser_maxout_pieces - - with Model.use_device('cpu'): - upper = Affine(nr_class, hidden_width, drop_factor=0.0) - upper.W *= 0 + if depth == 1: + with Model.use_device('cpu'): + upper = Affine(nr_class, hidden_width, drop_factor=0.0) + upper.W *= 0 + else: + upper = None cfg = { 'nr_class': nr_class, + 'nr_feature_tokens': nr_feature_tokens, 'hidden_depth': depth, 'token_vector_width': token_vector_width, 'hidden_width': hidden_width, @@ -133,6 +140,7 @@ cdef class Parser: if 'beam_update_prob' not in cfg: cfg['beam_update_prob'] = util.env_opt('beam_update_prob', 1.0) cfg.setdefault('cnn_maxout_pieces', 3) + cfg.setdefault("nr_feature_tokens", self.nr_feature) self.cfg = cfg self.model = model self._multitasks = [] @@ -299,7 +307,7 @@ cdef class Parser: token_ids = numpy.zeros((len(docs) * beam_width, self.nr_feature), dtype='i', order='C') cdef int* c_ids - cdef int nr_feature = self.nr_feature + cdef int nr_feature = self.cfg["nr_feature_tokens"] cdef int n_states model = self.model(docs) todo = [beam for beam in beams if not beam.is_done] @@ -502,7 +510,7 @@ cdef class Parser: self.moves.preprocess_gold(gold) model, finish_update = self.model.begin_update(docs, drop=drop) states_d_scores, backprops, beams = _beam_utils.update_beam( - self.moves, self.nr_feature, 10000, states, golds, model.state2vec, + self.moves, self.cfg["nr_feature_tokens"], 10000, states, golds, model.state2vec, model.vec2scores, width, drop=drop, losses=losses, beam_density=beam_density) for i, d_scores in enumerate(states_d_scores): diff --git a/spacy/tests/lang/en/test_customized_tokenizer.py b/spacy/tests/lang/en/test_customized_tokenizer.py index fdac32a90..7f939011f 100644 --- a/spacy/tests/lang/en/test_customized_tokenizer.py +++ b/spacy/tests/lang/en/test_customized_tokenizer.py @@ -2,6 +2,7 @@ from __future__ import unicode_literals import pytest +import re from spacy.lang.en import English from spacy.tokenizer import Tokenizer from spacy.util import compile_prefix_regex, compile_suffix_regex @@ -19,13 +20,14 @@ def custom_en_tokenizer(en_vocab): r"[\[\]!&:,()\*—–\/-]", ] infix_re = compile_infix_regex(custom_infixes) + token_match_re = re.compile("a-b") return Tokenizer( en_vocab, English.Defaults.tokenizer_exceptions, prefix_re.search, suffix_re.search, infix_re.finditer, - token_match=None, + token_match=token_match_re.match, ) @@ -74,3 +76,81 @@ def test_en_customized_tokenizer_handles_infixes(custom_en_tokenizer): "Megaregion", ".", ] + + +def test_en_customized_tokenizer_handles_token_match(custom_en_tokenizer): + sentence = "The 8 and 10-county definitions a-b not used for the greater Southern California Megaregion." + context = [word.text for word in custom_en_tokenizer(sentence)] + assert context == [ + "The", + "8", + "and", + "10", + "-", + "county", + "definitions", + "a-b", + "not", + "used", + "for", + "the", + "greater", + "Southern", + "California", + "Megaregion", + ".", + ] + + +def test_en_customized_tokenizer_handles_rules(custom_en_tokenizer): + sentence = "The 8 and 10-county definitions are not used for the greater Southern California Megaregion. :)" + context = [word.text for word in custom_en_tokenizer(sentence)] + assert context == [ + "The", + "8", + "and", + "10", + "-", + "county", + "definitions", + "are", + "not", + "used", + "for", + "the", + "greater", + "Southern", + "California", + "Megaregion", + ".", + ":)", + ] + + +def test_en_customized_tokenizer_handles_rules_property(custom_en_tokenizer): + sentence = "The 8 and 10-county definitions are not used for the greater Southern California Megaregion. :)" + rules = custom_en_tokenizer.rules + del rules[":)"] + custom_en_tokenizer.rules = rules + context = [word.text for word in custom_en_tokenizer(sentence)] + assert context == [ + "The", + "8", + "and", + "10", + "-", + "county", + "definitions", + "are", + "not", + "used", + "for", + "the", + "greater", + "Southern", + "California", + "Megaregion", + ".", + ":", + ")", + ] diff --git a/spacy/tests/parser/test_ner.py b/spacy/tests/parser/test_ner.py index d05403891..8329391ca 100644 --- a/spacy/tests/parser/test_ner.py +++ b/spacy/tests/parser/test_ner.py @@ -259,6 +259,27 @@ def test_block_ner(): assert [token.ent_type_ for token in doc] == expected_types +def test_change_number_features(): + # Test the default number features + nlp = English() + ner = nlp.create_pipe("ner") + nlp.add_pipe(ner) + ner.add_label("PERSON") + nlp.begin_training() + assert ner.model.lower.nF == ner.nr_feature + # Test we can change it + nlp = English() + ner = nlp.create_pipe("ner") + nlp.add_pipe(ner) + ner.add_label("PERSON") + nlp.begin_training( + component_cfg={"ner": {"nr_feature_tokens": 3, "token_vector_width": 128}} + ) + assert ner.model.lower.nF == 3 + # Test the model runs + nlp("hello world") + + class BlockerComponent1(object): name = "my_blocker" diff --git a/spacy/tests/pipeline/test_tagger.py b/spacy/tests/pipeline/test_tagger.py new file mode 100644 index 000000000..d0331602c --- /dev/null +++ b/spacy/tests/pipeline/test_tagger.py @@ -0,0 +1,14 @@ +# coding: utf8 +from __future__ import unicode_literals + +import pytest +from spacy.language import Language +from spacy.pipeline import Tagger + + +def test_label_types(): + nlp = Language() + nlp.add_pipe(nlp.create_pipe("tagger")) + nlp.get_pipe("tagger").add_label("A") + with pytest.raises(ValueError): + nlp.get_pipe("tagger").add_label(9) diff --git a/spacy/tests/pipeline/test_textcat.py b/spacy/tests/pipeline/test_textcat.py index ef70dc013..b7db85056 100644 --- a/spacy/tests/pipeline/test_textcat.py +++ b/spacy/tests/pipeline/test_textcat.py @@ -62,3 +62,11 @@ def test_textcat_learns_multilabel(): assert score < 0.5 else: assert score > 0.5 + + +def test_label_types(): + nlp = Language() + nlp.add_pipe(nlp.create_pipe("textcat")) + nlp.get_pipe("textcat").add_label("answer") + with pytest.raises(ValueError): + nlp.get_pipe("textcat").add_label(9) diff --git a/spacy/tests/regression/test_issue4402.py b/spacy/tests/regression/test_issue4402.py index 2e1b69000..d3b4bdf9a 100644 --- a/spacy/tests/regression/test_issue4402.py +++ b/spacy/tests/regression/test_issue4402.py @@ -3,9 +3,9 @@ from __future__ import unicode_literals import srsly from spacy.gold import GoldCorpus - from spacy.lang.en import English -from spacy.tests.util import make_tempdir + +from ..util import make_tempdir def test_issue4402(): diff --git a/spacy/tests/regression/test_issue4590.py b/spacy/tests/regression/test_issue4590.py index 6a43dfea9..8ec9a0bd1 100644 --- a/spacy/tests/regression/test_issue4590.py +++ b/spacy/tests/regression/test_issue4590.py @@ -1,7 +1,6 @@ # coding: utf-8 from __future__ import unicode_literals -import pytest from mock import Mock from spacy.matcher import DependencyMatcher from ..util import get_doc @@ -11,8 +10,14 @@ def test_issue4590(en_vocab): """Test that matches param in on_match method are the same as matches run with no on_match method""" pattern = [ {"SPEC": {"NODE_NAME": "jumped"}, "PATTERN": {"ORTH": "jumped"}}, - {"SPEC": {"NODE_NAME": "fox", "NBOR_RELOP": ">", "NBOR_NAME": "jumped"}, "PATTERN": {"ORTH": "fox"}}, - {"SPEC": {"NODE_NAME": "quick", "NBOR_RELOP": ".", "NBOR_NAME": "jumped"}, "PATTERN": {"ORTH": "fox"}}, + { + "SPEC": {"NODE_NAME": "fox", "NBOR_RELOP": ">", "NBOR_NAME": "jumped"}, + "PATTERN": {"ORTH": "fox"}, + }, + { + "SPEC": {"NODE_NAME": "quick", "NBOR_RELOP": ".", "NBOR_NAME": "jumped"}, + "PATTERN": {"ORTH": "fox"}, + }, ] on_match = Mock() @@ -23,12 +28,11 @@ def test_issue4590(en_vocab): text = "The quick brown fox jumped over the lazy fox" heads = [3, 2, 1, 1, 0, -1, 2, 1, -3] deps = ["det", "amod", "amod", "nsubj", "prep", "pobj", "det", "amod"] - + doc = get_doc(en_vocab, text.split(), heads=heads, deps=deps) - + matches = matcher(doc) - + on_match_args = on_match.call_args assert on_match_args[0][3] == matches - diff --git a/spacy/tests/regression/test_issue4651.py b/spacy/tests/regression/test_issue4651.py new file mode 100644 index 000000000..eb49f4a38 --- /dev/null +++ b/spacy/tests/regression/test_issue4651.py @@ -0,0 +1,65 @@ +# coding: utf-8 +from __future__ import unicode_literals + +from spacy.lang.en import English +from spacy.pipeline import EntityRuler + +from ..util import make_tempdir + + +def test_issue4651_with_phrase_matcher_attr(): + """Test that the EntityRuler PhraseMatcher is deserialize correctly using + the method from_disk when the EntityRuler argument phrase_matcher_attr is + specified. + """ + text = "Spacy is a python library for nlp" + + nlp = English() + ruler = EntityRuler(nlp, phrase_matcher_attr="LOWER") + patterns = [{"label": "PYTHON_LIB", "pattern": "spacy", "id": "spaCy"}] + ruler.add_patterns(patterns) + nlp.add_pipe(ruler) + + doc = nlp(text) + res = [(ent.text, ent.label_, ent.ent_id_) for ent in doc.ents] + + nlp_reloaded = English() + with make_tempdir() as d: + file_path = d / "entityruler" + ruler.to_disk(file_path) + ruler_reloaded = EntityRuler(nlp_reloaded).from_disk(file_path) + + nlp_reloaded.add_pipe(ruler_reloaded) + doc_reloaded = nlp_reloaded(text) + res_reloaded = [(ent.text, ent.label_, ent.ent_id_) for ent in doc_reloaded.ents] + + assert res == res_reloaded + + +def test_issue4651_without_phrase_matcher_attr(): + """Test that the EntityRuler PhraseMatcher is deserialize correctly using + the method from_disk when the EntityRuler argument phrase_matcher_attr is + not specified. + """ + text = "Spacy is a python library for nlp" + + nlp = English() + ruler = EntityRuler(nlp) + patterns = [{"label": "PYTHON_LIB", "pattern": "spacy", "id": "spaCy"}] + ruler.add_patterns(patterns) + nlp.add_pipe(ruler) + + doc = nlp(text) + res = [(ent.text, ent.label_, ent.ent_id_) for ent in doc.ents] + + nlp_reloaded = English() + with make_tempdir() as d: + file_path = d / "entityruler" + ruler.to_disk(file_path) + ruler_reloaded = EntityRuler(nlp_reloaded).from_disk(file_path) + + nlp_reloaded.add_pipe(ruler_reloaded) + doc_reloaded = nlp_reloaded(text) + res_reloaded = [(ent.text, ent.label_, ent.ent_id_) for ent in doc_reloaded.ents] + + assert res == res_reloaded diff --git a/spacy/tests/test_scorer.py b/spacy/tests/test_scorer.py index c59358a6b..2a4ef0f40 100644 --- a/spacy/tests/test_scorer.py +++ b/spacy/tests/test_scorer.py @@ -12,8 +12,22 @@ from .util import get_doc test_las_apple = [ [ "Apple is looking at buying U.K. startup for $ 1 billion", - {"heads": [2, 2, 2, 2, 3, 6, 4, 4, 10, 10, 7], - "deps": ['nsubj', 'aux', 'ROOT', 'prep', 'pcomp', 'compound', 'dobj', 'prep', 'quantmod', 'compound', 'pobj']}, + { + "heads": [2, 2, 2, 2, 3, 6, 4, 4, 10, 10, 7], + "deps": [ + "nsubj", + "aux", + "ROOT", + "prep", + "pcomp", + "compound", + "dobj", + "prep", + "quantmod", + "compound", + "pobj", + ], + }, ] ] @@ -59,7 +73,7 @@ def test_las_per_type(en_vocab): en_vocab, words=input_.split(" "), heads=([h - i for i, h in enumerate(annot["heads"])]), - deps=annot["deps"] + deps=annot["deps"], ) gold = GoldParse(doc, heads=annot["heads"], deps=annot["deps"]) doc[0].dep_ = "compound" diff --git a/spacy/tests/tokenizer/test_explain.py b/spacy/tests/tokenizer/test_explain.py new file mode 100644 index 000000000..2d71588cc --- /dev/null +++ b/spacy/tests/tokenizer/test_explain.py @@ -0,0 +1,65 @@ +# coding: utf-8 +from __future__ import unicode_literals + +import pytest +from spacy.util import get_lang_class + +# Only include languages with no external dependencies +# "is" seems to confuse importlib, so we're also excluding it for now +# excluded: ja, ru, th, uk, vi, zh, is +LANGUAGES = [ + pytest.param("fr", marks=pytest.mark.slow()), + pytest.param("af", marks=pytest.mark.slow()), + pytest.param("ar", marks=pytest.mark.slow()), + pytest.param("bg", marks=pytest.mark.slow()), + "bn", + pytest.param("ca", marks=pytest.mark.slow()), + pytest.param("cs", marks=pytest.mark.slow()), + pytest.param("da", marks=pytest.mark.slow()), + pytest.param("de", marks=pytest.mark.slow()), + "el", + "en", + pytest.param("es", marks=pytest.mark.slow()), + pytest.param("et", marks=pytest.mark.slow()), + pytest.param("fa", marks=pytest.mark.slow()), + pytest.param("fi", marks=pytest.mark.slow()), + "fr", + pytest.param("ga", marks=pytest.mark.slow()), + pytest.param("he", marks=pytest.mark.slow()), + pytest.param("hi", marks=pytest.mark.slow()), + pytest.param("hr", marks=pytest.mark.slow()), + "hu", + pytest.param("id", marks=pytest.mark.slow()), + pytest.param("it", marks=pytest.mark.slow()), + pytest.param("kn", marks=pytest.mark.slow()), + pytest.param("lb", marks=pytest.mark.slow()), + pytest.param("lt", marks=pytest.mark.slow()), + pytest.param("lv", marks=pytest.mark.slow()), + pytest.param("nb", marks=pytest.mark.slow()), + pytest.param("nl", marks=pytest.mark.slow()), + "pl", + pytest.param("pt", marks=pytest.mark.slow()), + pytest.param("ro", marks=pytest.mark.slow()), + pytest.param("si", marks=pytest.mark.slow()), + pytest.param("sk", marks=pytest.mark.slow()), + pytest.param("sl", marks=pytest.mark.slow()), + pytest.param("sq", marks=pytest.mark.slow()), + pytest.param("sr", marks=pytest.mark.slow()), + pytest.param("sv", marks=pytest.mark.slow()), + pytest.param("ta", marks=pytest.mark.slow()), + pytest.param("te", marks=pytest.mark.slow()), + pytest.param("tl", marks=pytest.mark.slow()), + pytest.param("tr", marks=pytest.mark.slow()), + pytest.param("tt", marks=pytest.mark.slow()), + pytest.param("ur", marks=pytest.mark.slow()), +] + + +@pytest.mark.parametrize("lang", LANGUAGES) +def test_tokenizer_explain(lang): + tokenizer = get_lang_class(lang).Defaults.create_tokenizer() + examples = pytest.importorskip("spacy.lang.{}.examples".format(lang)) + for sentence in examples.sentences: + tokens = [t.text for t in tokenizer(sentence) if not t.is_space] + debug_tokens = [t[1] for t in tokenizer.explain(sentence)] + assert tokens == debug_tokens diff --git a/spacy/tokenizer.pyx b/spacy/tokenizer.pyx index b39bb1ecb..230f41921 100644 --- a/spacy/tokenizer.pyx +++ b/spacy/tokenizer.pyx @@ -15,6 +15,8 @@ import re from .tokens.doc cimport Doc from .strings cimport hash_string from .compat import unescape_unicode +from .attrs import intify_attrs +from .symbols import ORTH from .errors import Errors, Warnings, deprecation_warning from . import util @@ -57,9 +59,7 @@ cdef class Tokenizer: self.infix_finditer = infix_finditer self.vocab = vocab self._rules = {} - if rules is not None: - for chunk, substrings in sorted(rules.items()): - self.add_special_case(chunk, substrings) + self._load_special_tokenization(rules) property token_match: def __get__(self): @@ -93,6 +93,18 @@ cdef class Tokenizer: self._infix_finditer = infix_finditer self._flush_cache() + property rules: + def __get__(self): + return self._rules + + def __set__(self, rules): + self._rules = {} + self._reset_cache([key for key in self._cache]) + self._reset_specials() + self._cache = PreshMap() + self._specials = PreshMap() + self._load_special_tokenization(rules) + def __reduce__(self): args = (self.vocab, self._rules, @@ -227,10 +239,6 @@ cdef class Tokenizer: cdef unicode minus_suf cdef size_t last_size = 0 while string and len(string) != last_size: - if self.token_match and self.token_match(string) \ - and not self.find_prefix(string) \ - and not self.find_suffix(string): - break if self._specials.get(hash_string(string)) != NULL: has_special[0] = 1 break @@ -393,8 +401,9 @@ cdef class Tokenizer: def _load_special_tokenization(self, special_cases): """Add special-case tokenization rules.""" - for chunk, substrings in sorted(special_cases.items()): - self.add_special_case(chunk, substrings) + if special_cases is not None: + for chunk, substrings in sorted(special_cases.items()): + self.add_special_case(chunk, substrings) def add_special_case(self, unicode string, substrings): """Add a special-case tokenization rule. @@ -423,6 +432,73 @@ cdef class Tokenizer: self.mem.free(stale_cached) self._rules[string] = substrings + def explain(self, text): + """A debugging tokenizer that provides information about which + tokenizer rule or pattern was matched for each token. The tokens + produced are identical to `nlp.tokenizer()` except for whitespace + tokens. + + string (unicode): The string to tokenize. + RETURNS (list): A list of (pattern_string, token_string) tuples + + DOCS: https://spacy.io/api/tokenizer#explain + """ + prefix_search = self.prefix_search + suffix_search = self.suffix_search + infix_finditer = self.infix_finditer + token_match = self.token_match + special_cases = {} + for orth, special_tokens in self.rules.items(): + special_cases[orth] = [intify_attrs(special_token, strings_map=self.vocab.strings, _do_deprecated=True) for special_token in special_tokens] + tokens = [] + for substring in text.split(): + suffixes = [] + while substring: + while prefix_search(substring) or suffix_search(substring): + if substring in special_cases: + tokens.extend(("SPECIAL-" + str(i + 1), self.vocab.strings[e[ORTH]]) for i, e in enumerate(special_cases[substring])) + substring = '' + break + if prefix_search(substring): + split = prefix_search(substring).end() + # break if pattern matches the empty string + if split == 0: + break + tokens.append(("PREFIX", substring[:split])) + substring = substring[split:] + if substring in special_cases: + continue + if suffix_search(substring): + split = suffix_search(substring).start() + # break if pattern matches the empty string + if split == len(substring): + break + suffixes.append(("SUFFIX", substring[split:])) + substring = substring[:split] + if substring in special_cases: + tokens.extend(("SPECIAL-" + str(i + 1), self.vocab.strings[e[ORTH]]) for i, e in enumerate(special_cases[substring])) + substring = '' + elif token_match(substring): + tokens.append(("TOKEN_MATCH", substring)) + substring = '' + elif list(infix_finditer(substring)): + infixes = infix_finditer(substring) + offset = 0 + for match in infixes: + if substring[offset : match.start()]: + tokens.append(("TOKEN", substring[offset : match.start()])) + if substring[match.start() : match.end()]: + tokens.append(("INFIX", substring[match.start() : match.end()])) + offset = match.end() + if substring[offset:]: + tokens.append(("TOKEN", substring[offset:])) + substring = '' + elif substring: + tokens.append(("TOKEN", substring)) + substring = '' + tokens.extend(reversed(suffixes)) + return tokens + def to_disk(self, path, **kwargs): """Save the current state to a directory. @@ -507,8 +583,7 @@ cdef class Tokenizer: self._reset_specials() self._cache = PreshMap() self._specials = PreshMap() - for string, substrings in data.get("rules", {}).items(): - self.add_special_case(string, substrings) + self._load_special_tokenization(data.get("rules", {})) return self diff --git a/spacy/util.py b/spacy/util.py index 2d5a56806..57ed244f8 100644 --- a/spacy/util.py +++ b/spacy/util.py @@ -301,13 +301,13 @@ def get_component_name(component): return repr(component) -def get_cuda_stream(require=False): +def get_cuda_stream(require=False, non_blocking=True): if CudaStream is None: return None elif isinstance(Model.ops, NumpyOps): return None else: - return CudaStream() + return CudaStream(non_blocking=non_blocking) def get_async(stream, numpy_array): diff --git a/spacy/vectors.pyx b/spacy/vectors.pyx index 44dddb30c..6b26bf123 100644 --- a/spacy/vectors.pyx +++ b/spacy/vectors.pyx @@ -265,17 +265,12 @@ cdef class Vectors: rows = [self.key2row.get(key, -1.) for key in keys] return xp.asarray(rows, dtype="i") else: - targets = set() + row2key = {row: key for key, row in self.key2row.items()} if row is not None: - targets.add(row) + return row2key[row] else: - targets.update(rows) - results = [] - for key, row in self.key2row.items(): - if row in targets: - results.append(key) - targets.remove(row) - return xp.asarray(results, dtype="uint64") + results = [row2key[row] for row in rows] + return xp.asarray(results, dtype="uint64") def add(self, key, *, vector=None, row=None): """Add a key to the table. Keys can be mapped to an existing vector diff --git a/website/docs/api/scorer.md b/website/docs/api/scorer.md index 35348217b..b1824573c 100644 --- a/website/docs/api/scorer.md +++ b/website/docs/api/scorer.md @@ -58,4 +58,5 @@ Update the evaluation scores from a single [`Doc`](/api/doc) / | `ents_per_type` 2.1.5 | dict | Scores per entity label. Keyed by label, mapped to a dict of `p`, `r` and `f` scores. | | `textcat_score` 2.2 | float | F-score on positive label for binary exclusive, macro-averaged F-score for 3+ exclusive, macro-averaged AUC ROC score for multilabel (`-1` if undefined). | | `textcats_per_cat` 2.2 | dict | Scores per textcat label, keyed by label. | +| `las_per_type` 2.2.3 | dict | Labelled dependency scores, keyed by label. | | `scores` | dict | All scores, keyed by type. | diff --git a/website/docs/api/tokenizer.md b/website/docs/api/tokenizer.md index d6ab73f14..7462af739 100644 --- a/website/docs/api/tokenizer.md +++ b/website/docs/api/tokenizer.md @@ -34,15 +34,15 @@ the > tokenizer = nlp.Defaults.create_tokenizer(nlp) > ``` -| Name | Type | Description | -| ---------------- | ----------- | ----------------------------------------------------------------------------------- | -| `vocab` | `Vocab` | A storage container for lexical types. | -| `rules` | dict | Exceptions and special-cases for the tokenizer. | -| `prefix_search` | callable | A function matching the signature of `re.compile(string).search` to match prefixes. | -| `suffix_search` | callable | A function matching the signature of `re.compile(string).search` to match suffixes. | -| `infix_finditer` | callable | A function matching the signature of `re.compile(string).finditer` to find infixes. | -| `token_match` | callable | A boolean function matching strings to be recognized as tokens. | -| **RETURNS** | `Tokenizer` | The newly constructed object. | +| Name | Type | Description | +| ---------------- | ----------- | ----------------------------------------------------------------------------------------------------------------------------- | +| `vocab` | `Vocab` | A storage container for lexical types. | +| `rules` | dict | Exceptions and special-cases for the tokenizer. | +| `prefix_search` | callable | A function matching the signature of `re.compile(string).search` to match prefixes. | +| `suffix_search` | callable | A function matching the signature of `re.compile(string).search` to match suffixes. | +| `infix_finditer` | callable | A function matching the signature of `re.compile(string).finditer` to find infixes. | +| `token_match` | callable | A function matching the signature of `re.compile(string).match to find token matches. | +| **RETURNS** | `Tokenizer` | The newly constructed object. | ## Tokenizer.\_\_call\_\_ {#call tag="method"} @@ -128,6 +128,25 @@ and examples. | `string` | unicode | The string to specially tokenize. | | `token_attrs` | iterable | A sequence of dicts, where each dict describes a token and its attributes. The `ORTH` fields of the attributes must exactly match the string when they are concatenated. | +## Tokenizer.explain {#explain tag="method"} + +Tokenize a string with a slow debugging tokenizer that provides information +about which tokenizer rule or pattern was matched for each token. The tokens +produced are identical to `Tokenizer.__call__` except for whitespace tokens. + +> #### Example +> +> ```python +> tok_exp = nlp.tokenizer.explain("(don't)") +> assert [t[0] for t in tok_exp] == ["PREFIX", "SPECIAL-1", "SPECIAL-2", "SUFFIX"] +> assert [t[1] for t in tok_exp] == ["(", "do", "n't", ")"] +> ``` + +| Name | Type | Description | +| ------------| -------- | --------------------------------------------------- | +| `string` | unicode | The string to tokenize with the debugging tokenizer | +| **RETURNS** | list | A list of `(pattern_string, token_string)` tuples | + ## Tokenizer.to_disk {#to_disk tag="method"} Serialize the tokenizer to disk. @@ -198,12 +217,14 @@ it. ## Attributes {#attributes} -| Name | Type | Description | -| ---------------- | ------- | -------------------------------------------------------------------------------------------------------------------------- | -| `vocab` | `Vocab` | The vocab object of the parent `Doc`. | -| `prefix_search` | - | A function to find segment boundaries from the start of a string. Returns the length of the segment, or `None`. | -| `suffix_search` | - | A function to find segment boundaries from the end of a string. Returns the length of the segment, or `None`. | -| `infix_finditer` | - | A function to find internal segment separators, e.g. hyphens. Returns a (possibly empty) list of `re.MatchObject` objects. | +| Name | Type | Description | +| ---------------- | ------- | --------------------------------------------------------------------------------------------------------------------------- | +| `vocab` | `Vocab` | The vocab object of the parent `Doc`. | +| `prefix_search` | - | A function to find segment boundaries from the start of a string. Returns the length of the segment, or `None`. | +| `suffix_search` | - | A function to find segment boundaries from the end of a string. Returns the length of the segment, or `None`. | +| `infix_finditer` | - | A function to find internal segment separators, e.g. hyphens. Returns a (possibly empty) list of `re.MatchObject` objects. | +| `token_match` | - | A function matching the signature of `re.compile(string).match to find token matches. Returns an `re.MatchObject` or `None. | +| `rules` | dict | A dictionary of tokenizer exceptions and special cases. | ## Serialization fields {#serialization-fields} diff --git a/website/docs/usage/linguistic-features.md b/website/docs/usage/linguistic-features.md index db3aac686..3af7d9fd1 100644 --- a/website/docs/usage/linguistic-features.md +++ b/website/docs/usage/linguistic-features.md @@ -792,6 +792,33 @@ The algorithm can be summarized as follows: tokens on all infixes. 8. Once we can't consume any more of the string, handle it as a single token. +#### Debugging the tokenizer {#tokenizer-debug new="2.2.3"} + +A working implementation of the pseudo-code above is available for debugging as +[`nlp.tokenizer.explain(text)`](/api/tokenizer#explain). It returns a list of +tuples showing which tokenizer rule or pattern was matched for each token. The +tokens produced are identical to `nlp.tokenizer()` except for whitespace tokens: + +```python +### {executable="true"} +from spacy.lang.en import English + +nlp = English() +text = '''"Let's go!"''' +doc = nlp(text) +tok_exp = nlp.tokenizer.explain(text) +assert [t.text for t in doc if not t.is_space] == [t[1] for t in tok_exp] +for t in tok_exp: + print(t[1], "\\t", t[0]) + +# " PREFIX +# Let SPECIAL-1 +# 's SPECIAL-2 +# go TOKEN +# ! SUFFIX +# " SUFFIX +``` + ### Customizing spaCy's Tokenizer class {#native-tokenizers} Let's imagine you wanted to create a tokenizer for a new language or specific diff --git a/website/meta/universe.json b/website/meta/universe.json index 40ebfaaa7..98a7807ca 100644 --- a/website/meta/universe.json +++ b/website/meta/universe.json @@ -1679,13 +1679,14 @@ "slogan": "Information extraction from English and German texts based on predicate logic", "github": "msg-systems/holmes-extractor", "url": "https://github.com/msg-systems/holmes-extractor", - "description": "Holmes is a Python 3 library that supports a number of use cases involving information extraction from English and German texts, including chatbot, structural search, topic matching and supervised document classification.", + "description": "Holmes is a Python 3 library that supports a number of use cases involving information extraction from English and German texts, including chatbot, structural extraction, topic matching and supervised document classification. There is a [website demonstrating intelligent search based on topic matching](https://holmes-demo.xt.msg.team).", "pip": "holmes-extractor", "category": ["conversational", "standalone"], "tags": ["chatbots", "text-processing"], + "thumb": "https://raw.githubusercontent.com/msg-systems/holmes-extractor/master/docs/holmes_thumbnail.png", "code_example": [ "import holmes_extractor as holmes", - "holmes_manager = holmes.Manager(model='en_coref_lg')", + "holmes_manager = holmes.Manager(model='en_core_web_lg')", "holmes_manager.register_search_phrase('A big dog chases a cat')", "holmes_manager.start_chatbot_mode_console()" ],