mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-05 23:06:28 +03:00
Merge branch 'master' into spacy.io
This commit is contained in:
commit
02de21d8b4
106
.github/contributors/GuiGel.md
vendored
Normal file
106
.github/contributors/GuiGel.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Guillaume Gelabert |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 2019-11-15 |
|
||||||
|
| GitHub username | GuiGel |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/erip.md
vendored
Normal file
106
.github/contributors/erip.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Elijah Rippeth |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 2019-11-16 |
|
||||||
|
| GitHub username | erip |
|
||||||
|
| Website (optional) | |
|
|
@ -50,15 +50,16 @@ jobs:
|
||||||
Python36Mac:
|
Python36Mac:
|
||||||
imageName: 'macos-10.13'
|
imageName: 'macos-10.13'
|
||||||
python.version: '3.6'
|
python.version: '3.6'
|
||||||
Python37Linux:
|
# Don't test on 3.7 for now to speed up builds
|
||||||
imageName: 'ubuntu-16.04'
|
# Python37Linux:
|
||||||
python.version: '3.7'
|
# imageName: 'ubuntu-16.04'
|
||||||
Python37Windows:
|
# python.version: '3.7'
|
||||||
imageName: 'vs2017-win2016'
|
# Python37Windows:
|
||||||
python.version: '3.7'
|
# imageName: 'vs2017-win2016'
|
||||||
Python37Mac:
|
# python.version: '3.7'
|
||||||
imageName: 'macos-10.13'
|
# Python37Mac:
|
||||||
python.version: '3.7'
|
# imageName: 'macos-10.13'
|
||||||
|
# python.version: '3.7'
|
||||||
Python38Linux:
|
Python38Linux:
|
||||||
imageName: 'ubuntu-16.04'
|
imageName: 'ubuntu-16.04'
|
||||||
python.version: '3.8'
|
python.version: '3.8'
|
||||||
|
|
|
@ -1,6 +1,6 @@
|
||||||
# fmt: off
|
# fmt: off
|
||||||
__title__ = "spacy"
|
__title__ = "spacy"
|
||||||
__version__ = "2.2.2"
|
__version__ = "2.2.3"
|
||||||
__release__ = True
|
__release__ = True
|
||||||
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
|
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
|
||||||
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
|
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
|
||||||
|
|
|
@ -529,6 +529,7 @@ class Errors(object):
|
||||||
E185 = ("Received invalid attribute in component attribute declaration: "
|
E185 = ("Received invalid attribute in component attribute declaration: "
|
||||||
"{obj}.{attr}\nAttribute '{attr}' does not exist on {obj}.")
|
"{obj}.{attr}\nAttribute '{attr}' does not exist on {obj}.")
|
||||||
E186 = ("'{tok_a}' and '{tok_b}' are different texts.")
|
E186 = ("'{tok_a}' and '{tok_b}' are different texts.")
|
||||||
|
E187 = ("Only unicode strings are supported as labels.")
|
||||||
|
|
||||||
|
|
||||||
@add_codes
|
@add_codes
|
||||||
|
|
|
@ -31,6 +31,10 @@ _latin_u_supplement = r"\u00C0-\u00D6\u00D8-\u00DE"
|
||||||
_latin_l_supplement = r"\u00DF-\u00F6\u00F8-\u00FF"
|
_latin_l_supplement = r"\u00DF-\u00F6\u00F8-\u00FF"
|
||||||
_latin_supplement = r"\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u00FF"
|
_latin_supplement = r"\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u00FF"
|
||||||
|
|
||||||
|
_hangul_syllables = r"\uAC00-\uD7AF"
|
||||||
|
_hangul_jamo = r"\u1100-\u11FF"
|
||||||
|
_hangul = _hangul_syllables + _hangul_jamo
|
||||||
|
|
||||||
# letters with diacritics - Catalan, Czech, Latin, Latvian, Lithuanian, Polish, Slovak, Turkish, Welsh
|
# letters with diacritics - Catalan, Czech, Latin, Latvian, Lithuanian, Polish, Slovak, Turkish, Welsh
|
||||||
_latin_u_extendedA = (
|
_latin_u_extendedA = (
|
||||||
r"\u0100\u0102\u0104\u0106\u0108\u010A\u010C\u010E\u0110\u0112\u0114\u0116\u0118\u011A\u011C"
|
r"\u0100\u0102\u0104\u0106\u0108\u010A\u010C\u010E\u0110\u0112\u0114\u0116\u0118\u011A\u011C"
|
||||||
|
@ -202,7 +206,15 @@ _upper = LATIN_UPPER + _russian_upper + _tatar_upper + _greek_upper + _ukrainian
|
||||||
_lower = LATIN_LOWER + _russian_lower + _tatar_lower + _greek_lower + _ukrainian_lower
|
_lower = LATIN_LOWER + _russian_lower + _tatar_lower + _greek_lower + _ukrainian_lower
|
||||||
|
|
||||||
_uncased = (
|
_uncased = (
|
||||||
_bengali + _hebrew + _persian + _sinhala + _hindi + _kannada + _tamil + _telugu
|
_bengali
|
||||||
|
+ _hebrew
|
||||||
|
+ _persian
|
||||||
|
+ _sinhala
|
||||||
|
+ _hindi
|
||||||
|
+ _kannada
|
||||||
|
+ _tamil
|
||||||
|
+ _telugu
|
||||||
|
+ _hangul
|
||||||
)
|
)
|
||||||
|
|
||||||
ALPHA = group_chars(LATIN + _russian + _tatar + _greek + _ukrainian + _uncased)
|
ALPHA = group_chars(LATIN + _russian + _tatar + _greek + _ukrainian + _uncased)
|
||||||
|
|
67
spacy/lang/ko/lex_attrs.py
Normal file
67
spacy/lang/ko/lex_attrs.py
Normal file
|
@ -0,0 +1,67 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from ...attrs import LIKE_NUM
|
||||||
|
|
||||||
|
|
||||||
|
_num_words = [
|
||||||
|
"영",
|
||||||
|
"공",
|
||||||
|
# Native Korean number system
|
||||||
|
"하나",
|
||||||
|
"둘",
|
||||||
|
"셋",
|
||||||
|
"넷",
|
||||||
|
"다섯",
|
||||||
|
"여섯",
|
||||||
|
"일곱",
|
||||||
|
"여덟",
|
||||||
|
"아홉",
|
||||||
|
"열",
|
||||||
|
"스물",
|
||||||
|
"서른",
|
||||||
|
"마흔",
|
||||||
|
"쉰",
|
||||||
|
"예순",
|
||||||
|
"일흔",
|
||||||
|
"여든",
|
||||||
|
"아흔",
|
||||||
|
# Sino-Korean number system
|
||||||
|
"일",
|
||||||
|
"이",
|
||||||
|
"삼",
|
||||||
|
"사",
|
||||||
|
"오",
|
||||||
|
"육",
|
||||||
|
"칠",
|
||||||
|
"팔",
|
||||||
|
"구",
|
||||||
|
"십",
|
||||||
|
"백",
|
||||||
|
"천",
|
||||||
|
"만",
|
||||||
|
"십만",
|
||||||
|
"백만",
|
||||||
|
"천만",
|
||||||
|
"일억",
|
||||||
|
"십억",
|
||||||
|
"백억",
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def like_num(text):
|
||||||
|
if text.startswith(("+", "-", "±", "~")):
|
||||||
|
text = text[1:]
|
||||||
|
text = text.replace(",", "").replace(".", "")
|
||||||
|
if text.isdigit():
|
||||||
|
return True
|
||||||
|
if text.count("/") == 1:
|
||||||
|
num, denom = text.split("/")
|
||||||
|
if num.isdigit() and denom.isdigit():
|
||||||
|
return True
|
||||||
|
if any(char.lower() in _num_words for char in text):
|
||||||
|
return True
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
LEX_ATTRS = {LIKE_NUM: like_num}
|
|
@ -6,9 +6,7 @@ from ...symbols import ORTH, LEMMA, NORM
|
||||||
# TODO
|
# TODO
|
||||||
# treat other apostrophes within words as part of the word: [op d'mannst], [fir d'éischt] (= exceptions)
|
# treat other apostrophes within words as part of the word: [op d'mannst], [fir d'éischt] (= exceptions)
|
||||||
|
|
||||||
_exc = {
|
_exc = {}
|
||||||
|
|
||||||
}
|
|
||||||
|
|
||||||
# translate / delete what is not necessary
|
# translate / delete what is not necessary
|
||||||
for exc_data in [
|
for exc_data in [
|
||||||
|
|
|
@ -14,6 +14,7 @@ from .tag_map import TAG_MAP
|
||||||
def try_jieba_import(use_jieba):
|
def try_jieba_import(use_jieba):
|
||||||
try:
|
try:
|
||||||
import jieba
|
import jieba
|
||||||
|
|
||||||
return jieba
|
return jieba
|
||||||
except ImportError:
|
except ImportError:
|
||||||
if use_jieba:
|
if use_jieba:
|
||||||
|
@ -34,7 +35,9 @@ class ChineseTokenizer(DummyTokenizer):
|
||||||
def __call__(self, text):
|
def __call__(self, text):
|
||||||
# use jieba
|
# use jieba
|
||||||
if self.use_jieba:
|
if self.use_jieba:
|
||||||
jieba_words = list([x for x in self.jieba_seg.cut(text, cut_all=False) if x])
|
jieba_words = list(
|
||||||
|
[x for x in self.jieba_seg.cut(text, cut_all=False) if x]
|
||||||
|
)
|
||||||
words = [jieba_words[0]]
|
words = [jieba_words[0]]
|
||||||
spaces = [False]
|
spaces = [False]
|
||||||
for i in range(1, len(jieba_words)):
|
for i in range(1, len(jieba_words)):
|
||||||
|
|
|
@ -292,13 +292,14 @@ class EntityRuler(object):
|
||||||
self.add_patterns(patterns)
|
self.add_patterns(patterns)
|
||||||
else:
|
else:
|
||||||
cfg = {}
|
cfg = {}
|
||||||
deserializers = {
|
deserializers_patterns = {
|
||||||
"patterns": lambda p: self.add_patterns(
|
"patterns": lambda p: self.add_patterns(
|
||||||
srsly.read_jsonl(p.with_suffix(".jsonl"))
|
srsly.read_jsonl(p.with_suffix(".jsonl"))
|
||||||
),
|
)}
|
||||||
"cfg": lambda p: cfg.update(srsly.read_json(p)),
|
deserializers_cfg = {
|
||||||
|
"cfg": lambda p: cfg.update(srsly.read_json(p))
|
||||||
}
|
}
|
||||||
from_disk(path, deserializers, {})
|
from_disk(path, deserializers_cfg, {})
|
||||||
self.overwrite = cfg.get("overwrite", False)
|
self.overwrite = cfg.get("overwrite", False)
|
||||||
self.phrase_matcher_attr = cfg.get("phrase_matcher_attr")
|
self.phrase_matcher_attr = cfg.get("phrase_matcher_attr")
|
||||||
self.ent_id_sep = cfg.get("ent_id_sep", DEFAULT_ENT_ID_SEP)
|
self.ent_id_sep = cfg.get("ent_id_sep", DEFAULT_ENT_ID_SEP)
|
||||||
|
@ -307,6 +308,7 @@ class EntityRuler(object):
|
||||||
self.phrase_matcher = PhraseMatcher(
|
self.phrase_matcher = PhraseMatcher(
|
||||||
self.nlp.vocab, attr=self.phrase_matcher_attr
|
self.nlp.vocab, attr=self.phrase_matcher_attr
|
||||||
)
|
)
|
||||||
|
from_disk(path, deserializers_patterns, {})
|
||||||
return self
|
return self
|
||||||
|
|
||||||
def to_disk(self, path, **kwargs):
|
def to_disk(self, path, **kwargs):
|
||||||
|
|
|
@ -13,6 +13,7 @@ from thinc.misc import LayerNorm
|
||||||
from thinc.neural.util import to_categorical
|
from thinc.neural.util import to_categorical
|
||||||
from thinc.neural.util import get_array_module
|
from thinc.neural.util import get_array_module
|
||||||
|
|
||||||
|
from ..compat import basestring_
|
||||||
from ..tokens.doc cimport Doc
|
from ..tokens.doc cimport Doc
|
||||||
from ..syntax.nn_parser cimport Parser
|
from ..syntax.nn_parser cimport Parser
|
||||||
from ..syntax.ner cimport BiluoPushDown
|
from ..syntax.ner cimport BiluoPushDown
|
||||||
|
@ -547,6 +548,8 @@ class Tagger(Pipe):
|
||||||
return build_tagger_model(n_tags, **cfg)
|
return build_tagger_model(n_tags, **cfg)
|
||||||
|
|
||||||
def add_label(self, label, values=None):
|
def add_label(self, label, values=None):
|
||||||
|
if not isinstance(label, basestring_):
|
||||||
|
raise ValueError(Errors.E187)
|
||||||
if label in self.labels:
|
if label in self.labels:
|
||||||
return 0
|
return 0
|
||||||
if self.model not in (True, False, None):
|
if self.model not in (True, False, None):
|
||||||
|
@ -1016,6 +1019,8 @@ class TextCategorizer(Pipe):
|
||||||
return float(mean_square_error), d_scores
|
return float(mean_square_error), d_scores
|
||||||
|
|
||||||
def add_label(self, label):
|
def add_label(self, label):
|
||||||
|
if not isinstance(label, basestring_):
|
||||||
|
raise ValueError(Errors.E187)
|
||||||
if label in self.labels:
|
if label in self.labels:
|
||||||
return 0
|
return 0
|
||||||
if self.model not in (None, True, False):
|
if self.model not in (None, True, False):
|
||||||
|
|
|
@ -271,7 +271,9 @@ class Scorer(object):
|
||||||
self.labelled_per_dep[token.dep_.lower()] = PRFScore()
|
self.labelled_per_dep[token.dep_.lower()] = PRFScore()
|
||||||
if token.dep_.lower() not in cand_deps_per_dep:
|
if token.dep_.lower() not in cand_deps_per_dep:
|
||||||
cand_deps_per_dep[token.dep_.lower()] = set()
|
cand_deps_per_dep[token.dep_.lower()] = set()
|
||||||
cand_deps_per_dep[token.dep_.lower()].add((gold_i, gold_head, token.dep_.lower()))
|
cand_deps_per_dep[token.dep_.lower()].add(
|
||||||
|
(gold_i, gold_head, token.dep_.lower())
|
||||||
|
)
|
||||||
if "-" not in [token[-1] for token in gold.orig_annot]:
|
if "-" not in [token[-1] for token in gold.orig_annot]:
|
||||||
# Find all NER labels in gold and doc
|
# Find all NER labels in gold and doc
|
||||||
ent_labels = set([x[0] for x in gold_ents] + [k.label_ for k in doc.ents])
|
ent_labels = set([x[0] for x in gold_ents] + [k.label_ for k in doc.ents])
|
||||||
|
@ -304,7 +306,9 @@ class Scorer(object):
|
||||||
self.tags.score_set(cand_tags, gold_tags)
|
self.tags.score_set(cand_tags, gold_tags)
|
||||||
self.labelled.score_set(cand_deps, gold_deps)
|
self.labelled.score_set(cand_deps, gold_deps)
|
||||||
for dep in self.labelled_per_dep:
|
for dep in self.labelled_per_dep:
|
||||||
self.labelled_per_dep[dep].score_set(cand_deps_per_dep.get(dep, set()), gold_deps_per_dep.get(dep, set()))
|
self.labelled_per_dep[dep].score_set(
|
||||||
|
cand_deps_per_dep.get(dep, set()), gold_deps_per_dep.get(dep, set())
|
||||||
|
)
|
||||||
self.unlabelled.score_set(
|
self.unlabelled.score_set(
|
||||||
set(item[:2] for item in cand_deps), set(item[:2] for item in gold_deps)
|
set(item[:2] for item in cand_deps), set(item[:2] for item in gold_deps)
|
||||||
)
|
)
|
||||||
|
|
|
@ -42,11 +42,17 @@ cdef WeightsC get_c_weights(model) except *:
|
||||||
cdef precompute_hiddens state2vec = model.state2vec
|
cdef precompute_hiddens state2vec = model.state2vec
|
||||||
output.feat_weights = state2vec.get_feat_weights()
|
output.feat_weights = state2vec.get_feat_weights()
|
||||||
output.feat_bias = <const float*>state2vec.bias.data
|
output.feat_bias = <const float*>state2vec.bias.data
|
||||||
cdef np.ndarray vec2scores_W = model.vec2scores.W
|
cdef np.ndarray vec2scores_W
|
||||||
cdef np.ndarray vec2scores_b = model.vec2scores.b
|
cdef np.ndarray vec2scores_b
|
||||||
cdef np.ndarray class_mask = model._class_mask
|
if model.vec2scores is None:
|
||||||
|
output.hidden_weights = NULL
|
||||||
|
output.hidden_bias = NULL
|
||||||
|
else:
|
||||||
|
vec2scores_W = model.vec2scores.W
|
||||||
|
vec2scores_b = model.vec2scores.b
|
||||||
output.hidden_weights = <const float*>vec2scores_W.data
|
output.hidden_weights = <const float*>vec2scores_W.data
|
||||||
output.hidden_bias = <const float*>vec2scores_b.data
|
output.hidden_bias = <const float*>vec2scores_b.data
|
||||||
|
cdef np.ndarray class_mask = model._class_mask
|
||||||
output.seen_classes = <const float*>class_mask.data
|
output.seen_classes = <const float*>class_mask.data
|
||||||
return output
|
return output
|
||||||
|
|
||||||
|
@ -54,6 +60,9 @@ cdef WeightsC get_c_weights(model) except *:
|
||||||
cdef SizesC get_c_sizes(model, int batch_size) except *:
|
cdef SizesC get_c_sizes(model, int batch_size) except *:
|
||||||
cdef SizesC output
|
cdef SizesC output
|
||||||
output.states = batch_size
|
output.states = batch_size
|
||||||
|
if model.vec2scores is None:
|
||||||
|
output.classes = model.state2vec.nO
|
||||||
|
else:
|
||||||
output.classes = model.vec2scores.nO
|
output.classes = model.vec2scores.nO
|
||||||
output.hiddens = model.state2vec.nO
|
output.hiddens = model.state2vec.nO
|
||||||
output.pieces = model.state2vec.nP
|
output.pieces = model.state2vec.nP
|
||||||
|
@ -105,11 +114,12 @@ cdef void resize_activations(ActivationsC* A, SizesC n) nogil:
|
||||||
|
|
||||||
cdef void predict_states(ActivationsC* A, StateC** states,
|
cdef void predict_states(ActivationsC* A, StateC** states,
|
||||||
const WeightsC* W, SizesC n) nogil:
|
const WeightsC* W, SizesC n) nogil:
|
||||||
|
cdef double one = 1.0
|
||||||
resize_activations(A, n)
|
resize_activations(A, n)
|
||||||
memset(A.unmaxed, 0, n.states * n.hiddens * n.pieces * sizeof(float))
|
|
||||||
memset(A.hiddens, 0, n.states * n.hiddens * sizeof(float))
|
|
||||||
for i in range(n.states):
|
for i in range(n.states):
|
||||||
states[i].set_context_tokens(&A.token_ids[i*n.feats], n.feats)
|
states[i].set_context_tokens(&A.token_ids[i*n.feats], n.feats)
|
||||||
|
memset(A.unmaxed, 0, n.states * n.hiddens * n.pieces * sizeof(float))
|
||||||
|
memset(A.hiddens, 0, n.states * n.hiddens * sizeof(float))
|
||||||
sum_state_features(A.unmaxed,
|
sum_state_features(A.unmaxed,
|
||||||
W.feat_weights, A.token_ids, n.states, n.feats, n.hiddens * n.pieces)
|
W.feat_weights, A.token_ids, n.states, n.feats, n.hiddens * n.pieces)
|
||||||
for i in range(n.states):
|
for i in range(n.states):
|
||||||
|
@ -120,7 +130,9 @@ cdef void predict_states(ActivationsC* A, StateC** states,
|
||||||
which = Vec.arg_max(&A.unmaxed[index], n.pieces)
|
which = Vec.arg_max(&A.unmaxed[index], n.pieces)
|
||||||
A.hiddens[i*n.hiddens + j] = A.unmaxed[index + which]
|
A.hiddens[i*n.hiddens + j] = A.unmaxed[index + which]
|
||||||
memset(A.scores, 0, n.states * n.classes * sizeof(float))
|
memset(A.scores, 0, n.states * n.classes * sizeof(float))
|
||||||
cdef double one = 1.0
|
if W.hidden_weights == NULL:
|
||||||
|
memcpy(A.scores, A.hiddens, n.states * n.classes * sizeof(float))
|
||||||
|
else:
|
||||||
# Compute hidden-to-output
|
# Compute hidden-to-output
|
||||||
blis.cy.gemm(blis.cy.NO_TRANSPOSE, blis.cy.TRANSPOSE,
|
blis.cy.gemm(blis.cy.NO_TRANSPOSE, blis.cy.TRANSPOSE,
|
||||||
n.states, n.classes, n.hiddens, one,
|
n.states, n.classes, n.hiddens, one,
|
||||||
|
@ -219,7 +231,9 @@ cdef int arg_max_if_valid(const weight_t* scores, const int* is_valid, int n) no
|
||||||
class ParserModel(Model):
|
class ParserModel(Model):
|
||||||
def __init__(self, tok2vec, lower_model, upper_model, unseen_classes=None):
|
def __init__(self, tok2vec, lower_model, upper_model, unseen_classes=None):
|
||||||
Model.__init__(self)
|
Model.__init__(self)
|
||||||
self._layers = [tok2vec, lower_model, upper_model]
|
self._layers = [tok2vec, lower_model]
|
||||||
|
if upper_model is not None:
|
||||||
|
self._layers.append(upper_model)
|
||||||
self.unseen_classes = set()
|
self.unseen_classes = set()
|
||||||
if unseen_classes:
|
if unseen_classes:
|
||||||
for class_ in unseen_classes:
|
for class_ in unseen_classes:
|
||||||
|
@ -234,6 +248,8 @@ class ParserModel(Model):
|
||||||
return step_model, finish_parser_update
|
return step_model, finish_parser_update
|
||||||
|
|
||||||
def resize_output(self, new_output):
|
def resize_output(self, new_output):
|
||||||
|
if len(self._layers) == 2:
|
||||||
|
return
|
||||||
if new_output == self.upper.nO:
|
if new_output == self.upper.nO:
|
||||||
return
|
return
|
||||||
smaller = self.upper
|
smaller = self.upper
|
||||||
|
@ -275,11 +291,23 @@ class ParserModel(Model):
|
||||||
class ParserStepModel(Model):
|
class ParserStepModel(Model):
|
||||||
def __init__(self, docs, layers, unseen_classes=None, drop=0.):
|
def __init__(self, docs, layers, unseen_classes=None, drop=0.):
|
||||||
self.tokvecs, self.bp_tokvecs = layers[0].begin_update(docs, drop=drop)
|
self.tokvecs, self.bp_tokvecs = layers[0].begin_update(docs, drop=drop)
|
||||||
|
if layers[1].nP >= 2:
|
||||||
|
activation = "maxout"
|
||||||
|
elif len(layers) == 2:
|
||||||
|
activation = None
|
||||||
|
else:
|
||||||
|
activation = "relu"
|
||||||
self.state2vec = precompute_hiddens(len(docs), self.tokvecs, layers[1],
|
self.state2vec = precompute_hiddens(len(docs), self.tokvecs, layers[1],
|
||||||
drop=drop)
|
activation=activation, drop=drop)
|
||||||
|
if len(layers) == 3:
|
||||||
self.vec2scores = layers[-1]
|
self.vec2scores = layers[-1]
|
||||||
self.cuda_stream = util.get_cuda_stream()
|
else:
|
||||||
|
self.vec2scores = None
|
||||||
|
self.cuda_stream = util.get_cuda_stream(non_blocking=True)
|
||||||
self.backprops = []
|
self.backprops = []
|
||||||
|
if self.vec2scores is None:
|
||||||
|
self._class_mask = numpy.zeros((self.state2vec.nO,), dtype='f')
|
||||||
|
else:
|
||||||
self._class_mask = numpy.zeros((self.vec2scores.nO,), dtype='f')
|
self._class_mask = numpy.zeros((self.vec2scores.nO,), dtype='f')
|
||||||
self._class_mask.fill(1)
|
self._class_mask.fill(1)
|
||||||
if unseen_classes is not None:
|
if unseen_classes is not None:
|
||||||
|
@ -302,10 +330,15 @@ class ParserStepModel(Model):
|
||||||
def begin_update(self, states, drop=0.):
|
def begin_update(self, states, drop=0.):
|
||||||
token_ids = self.get_token_ids(states)
|
token_ids = self.get_token_ids(states)
|
||||||
vector, get_d_tokvecs = self.state2vec.begin_update(token_ids, drop=0.0)
|
vector, get_d_tokvecs = self.state2vec.begin_update(token_ids, drop=0.0)
|
||||||
|
if self.vec2scores is not None:
|
||||||
mask = self.vec2scores.ops.get_dropout_mask(vector.shape, drop)
|
mask = self.vec2scores.ops.get_dropout_mask(vector.shape, drop)
|
||||||
if mask is not None:
|
if mask is not None:
|
||||||
vector *= mask
|
vector *= mask
|
||||||
scores, get_d_vector = self.vec2scores.begin_update(vector, drop=drop)
|
scores, get_d_vector = self.vec2scores.begin_update(vector, drop=drop)
|
||||||
|
else:
|
||||||
|
scores = NumpyOps().asarray(vector)
|
||||||
|
get_d_vector = lambda d_scores, sgd=None: d_scores
|
||||||
|
mask = None
|
||||||
# If the class is unseen, make sure its score is minimum
|
# If the class is unseen, make sure its score is minimum
|
||||||
scores[:, self._class_mask == 0] = numpy.nanmin(scores)
|
scores[:, self._class_mask == 0] = numpy.nanmin(scores)
|
||||||
|
|
||||||
|
@ -342,12 +375,12 @@ class ParserStepModel(Model):
|
||||||
return ids
|
return ids
|
||||||
|
|
||||||
def make_updates(self, sgd):
|
def make_updates(self, sgd):
|
||||||
# Tells CUDA to block, so our async copies complete.
|
|
||||||
if self.cuda_stream is not None:
|
|
||||||
self.cuda_stream.synchronize()
|
|
||||||
# Add a padding vector to the d_tokvecs gradient, so that missing
|
# Add a padding vector to the d_tokvecs gradient, so that missing
|
||||||
# values don't affect the real gradient.
|
# values don't affect the real gradient.
|
||||||
d_tokvecs = self.ops.allocate((self.tokvecs.shape[0]+1, self.tokvecs.shape[1]))
|
d_tokvecs = self.ops.allocate((self.tokvecs.shape[0]+1, self.tokvecs.shape[1]))
|
||||||
|
# Tells CUDA to block, so our async copies complete.
|
||||||
|
if self.cuda_stream is not None:
|
||||||
|
self.cuda_stream.synchronize()
|
||||||
for ids, d_vector, bp_vector in self.backprops:
|
for ids, d_vector, bp_vector in self.backprops:
|
||||||
d_state_features = bp_vector((d_vector, ids), sgd=sgd)
|
d_state_features = bp_vector((d_vector, ids), sgd=sgd)
|
||||||
ids = ids.flatten()
|
ids = ids.flatten()
|
||||||
|
@ -385,9 +418,10 @@ cdef class precompute_hiddens:
|
||||||
cdef np.ndarray bias
|
cdef np.ndarray bias
|
||||||
cdef object _cuda_stream
|
cdef object _cuda_stream
|
||||||
cdef object _bp_hiddens
|
cdef object _bp_hiddens
|
||||||
|
cdef object activation
|
||||||
|
|
||||||
def __init__(self, batch_size, tokvecs, lower_model, cuda_stream=None,
|
def __init__(self, batch_size, tokvecs, lower_model, cuda_stream=None,
|
||||||
drop=0.):
|
activation="maxout", drop=0.):
|
||||||
gpu_cached, bp_features = lower_model.begin_update(tokvecs, drop=drop)
|
gpu_cached, bp_features = lower_model.begin_update(tokvecs, drop=drop)
|
||||||
cdef np.ndarray cached
|
cdef np.ndarray cached
|
||||||
if not isinstance(gpu_cached, numpy.ndarray):
|
if not isinstance(gpu_cached, numpy.ndarray):
|
||||||
|
@ -405,6 +439,8 @@ cdef class precompute_hiddens:
|
||||||
self.nP = getattr(lower_model, 'nP', 1)
|
self.nP = getattr(lower_model, 'nP', 1)
|
||||||
self.nO = cached.shape[2]
|
self.nO = cached.shape[2]
|
||||||
self.ops = lower_model.ops
|
self.ops = lower_model.ops
|
||||||
|
assert activation in (None, "relu", "maxout")
|
||||||
|
self.activation = activation
|
||||||
self._is_synchronized = False
|
self._is_synchronized = False
|
||||||
self._cuda_stream = cuda_stream
|
self._cuda_stream = cuda_stream
|
||||||
self._cached = cached
|
self._cached = cached
|
||||||
|
@ -417,7 +453,7 @@ cdef class precompute_hiddens:
|
||||||
return <float*>self._cached.data
|
return <float*>self._cached.data
|
||||||
|
|
||||||
def __call__(self, X):
|
def __call__(self, X):
|
||||||
return self.begin_update(X)[0]
|
return self.begin_update(X, drop=None)[0]
|
||||||
|
|
||||||
def begin_update(self, token_ids, drop=0.):
|
def begin_update(self, token_ids, drop=0.):
|
||||||
cdef np.ndarray state_vector = numpy.zeros(
|
cdef np.ndarray state_vector = numpy.zeros(
|
||||||
|
@ -450,28 +486,35 @@ cdef class precompute_hiddens:
|
||||||
else:
|
else:
|
||||||
ops = CupyOps()
|
ops = CupyOps()
|
||||||
|
|
||||||
if self.nP == 1:
|
if self.activation == "maxout":
|
||||||
|
state_vector, mask = ops.maxout(state_vector)
|
||||||
|
else:
|
||||||
state_vector = state_vector.reshape(state_vector.shape[:-1])
|
state_vector = state_vector.reshape(state_vector.shape[:-1])
|
||||||
|
if self.activation == "relu":
|
||||||
mask = state_vector >= 0.
|
mask = state_vector >= 0.
|
||||||
state_vector *= mask
|
state_vector *= mask
|
||||||
else:
|
else:
|
||||||
state_vector, mask = ops.maxout(state_vector)
|
mask = None
|
||||||
|
|
||||||
def backprop_nonlinearity(d_best, sgd=None):
|
def backprop_nonlinearity(d_best, sgd=None):
|
||||||
if isinstance(d_best, numpy.ndarray):
|
if isinstance(d_best, numpy.ndarray):
|
||||||
ops = NumpyOps()
|
ops = NumpyOps()
|
||||||
else:
|
else:
|
||||||
ops = CupyOps()
|
ops = CupyOps()
|
||||||
|
if mask is not None:
|
||||||
mask_ = ops.asarray(mask)
|
mask_ = ops.asarray(mask)
|
||||||
|
|
||||||
# This will usually be on GPU
|
# This will usually be on GPU
|
||||||
d_best = ops.asarray(d_best)
|
d_best = ops.asarray(d_best)
|
||||||
# Fix nans (which can occur from unseen classes.)
|
# Fix nans (which can occur from unseen classes.)
|
||||||
d_best[ops.xp.isnan(d_best)] = 0.
|
d_best[ops.xp.isnan(d_best)] = 0.
|
||||||
if self.nP == 1:
|
if self.activation == "maxout":
|
||||||
|
mask_ = ops.asarray(mask)
|
||||||
|
return ops.backprop_maxout(d_best, mask_, self.nP)
|
||||||
|
elif self.activation == "relu":
|
||||||
|
mask_ = ops.asarray(mask)
|
||||||
d_best *= mask_
|
d_best *= mask_
|
||||||
d_best = d_best.reshape((d_best.shape + (1,)))
|
d_best = d_best.reshape((d_best.shape + (1,)))
|
||||||
return d_best
|
return d_best
|
||||||
else:
|
else:
|
||||||
return ops.backprop_maxout(d_best, mask_, self.nP)
|
return d_best.reshape((d_best.shape + (1,)))
|
||||||
return state_vector, backprop_nonlinearity
|
return state_vector, backprop_nonlinearity
|
||||||
|
|
|
@ -100,10 +100,30 @@ cdef cppclass StateC:
|
||||||
free(this.shifted - PADDING)
|
free(this.shifted - PADDING)
|
||||||
|
|
||||||
void set_context_tokens(int* ids, int n) nogil:
|
void set_context_tokens(int* ids, int n) nogil:
|
||||||
if n == 2:
|
if n == 1:
|
||||||
|
if this.B(0) >= 0:
|
||||||
|
ids[0] = this.B(0)
|
||||||
|
else:
|
||||||
|
ids[0] = -1
|
||||||
|
elif n == 2:
|
||||||
ids[0] = this.B(0)
|
ids[0] = this.B(0)
|
||||||
ids[1] = this.S(0)
|
ids[1] = this.S(0)
|
||||||
if n == 8:
|
elif n == 3:
|
||||||
|
if this.B(0) >= 0:
|
||||||
|
ids[0] = this.B(0)
|
||||||
|
else:
|
||||||
|
ids[0] = -1
|
||||||
|
# First word of entity, if any
|
||||||
|
if this.entity_is_open():
|
||||||
|
ids[1] = this.E(0)
|
||||||
|
else:
|
||||||
|
ids[1] = -1
|
||||||
|
# Last word of entity, if within entity
|
||||||
|
if ids[0] == -1 or ids[1] == -1:
|
||||||
|
ids[2] = -1
|
||||||
|
else:
|
||||||
|
ids[2] = ids[0] - 1
|
||||||
|
elif n == 8:
|
||||||
ids[0] = this.B(0)
|
ids[0] = this.B(0)
|
||||||
ids[1] = this.B(1)
|
ids[1] = this.B(1)
|
||||||
ids[2] = this.S(0)
|
ids[2] = this.S(0)
|
||||||
|
|
|
@ -22,7 +22,7 @@ from thinc.extra.search cimport Beam
|
||||||
from thinc.api import chain, clone
|
from thinc.api import chain, clone
|
||||||
from thinc.v2v import Model, Maxout, Affine
|
from thinc.v2v import Model, Maxout, Affine
|
||||||
from thinc.misc import LayerNorm
|
from thinc.misc import LayerNorm
|
||||||
from thinc.neural.ops import CupyOps
|
from thinc.neural.ops import NumpyOps, CupyOps
|
||||||
from thinc.neural.util import get_array_module
|
from thinc.neural.util import get_array_module
|
||||||
from thinc.linalg cimport Vec, VecVec
|
from thinc.linalg cimport Vec, VecVec
|
||||||
import srsly
|
import srsly
|
||||||
|
@ -61,13 +61,17 @@ cdef class Parser:
|
||||||
t2v_pieces = util.env_opt('cnn_maxout_pieces', cfg.get('cnn_maxout_pieces', 3))
|
t2v_pieces = util.env_opt('cnn_maxout_pieces', cfg.get('cnn_maxout_pieces', 3))
|
||||||
bilstm_depth = util.env_opt('bilstm_depth', cfg.get('bilstm_depth', 0))
|
bilstm_depth = util.env_opt('bilstm_depth', cfg.get('bilstm_depth', 0))
|
||||||
self_attn_depth = util.env_opt('self_attn_depth', cfg.get('self_attn_depth', 0))
|
self_attn_depth = util.env_opt('self_attn_depth', cfg.get('self_attn_depth', 0))
|
||||||
if depth != 1:
|
nr_feature_tokens = cfg.get("nr_feature_tokens", cls.nr_feature)
|
||||||
|
if depth not in (0, 1):
|
||||||
raise ValueError(TempErrors.T004.format(value=depth))
|
raise ValueError(TempErrors.T004.format(value=depth))
|
||||||
parser_maxout_pieces = util.env_opt('parser_maxout_pieces',
|
parser_maxout_pieces = util.env_opt('parser_maxout_pieces',
|
||||||
cfg.get('maxout_pieces', 2))
|
cfg.get('maxout_pieces', 2))
|
||||||
token_vector_width = util.env_opt('token_vector_width',
|
token_vector_width = util.env_opt('token_vector_width',
|
||||||
cfg.get('token_vector_width', 96))
|
cfg.get('token_vector_width', 96))
|
||||||
hidden_width = util.env_opt('hidden_width', cfg.get('hidden_width', 64))
|
hidden_width = util.env_opt('hidden_width', cfg.get('hidden_width', 64))
|
||||||
|
if depth == 0:
|
||||||
|
hidden_width = nr_class
|
||||||
|
parser_maxout_pieces = 1
|
||||||
embed_size = util.env_opt('embed_size', cfg.get('embed_size', 2000))
|
embed_size = util.env_opt('embed_size', cfg.get('embed_size', 2000))
|
||||||
pretrained_vectors = cfg.get('pretrained_vectors', None)
|
pretrained_vectors = cfg.get('pretrained_vectors', None)
|
||||||
tok2vec = Tok2Vec(token_vector_width, embed_size,
|
tok2vec = Tok2Vec(token_vector_width, embed_size,
|
||||||
|
@ -80,16 +84,19 @@ cdef class Parser:
|
||||||
tok2vec = chain(tok2vec, flatten)
|
tok2vec = chain(tok2vec, flatten)
|
||||||
tok2vec.nO = token_vector_width
|
tok2vec.nO = token_vector_width
|
||||||
lower = PrecomputableAffine(hidden_width,
|
lower = PrecomputableAffine(hidden_width,
|
||||||
nF=cls.nr_feature, nI=token_vector_width,
|
nF=nr_feature_tokens, nI=token_vector_width,
|
||||||
nP=parser_maxout_pieces)
|
nP=parser_maxout_pieces)
|
||||||
lower.nP = parser_maxout_pieces
|
lower.nP = parser_maxout_pieces
|
||||||
|
if depth == 1:
|
||||||
with Model.use_device('cpu'):
|
with Model.use_device('cpu'):
|
||||||
upper = Affine(nr_class, hidden_width, drop_factor=0.0)
|
upper = Affine(nr_class, hidden_width, drop_factor=0.0)
|
||||||
upper.W *= 0
|
upper.W *= 0
|
||||||
|
else:
|
||||||
|
upper = None
|
||||||
|
|
||||||
cfg = {
|
cfg = {
|
||||||
'nr_class': nr_class,
|
'nr_class': nr_class,
|
||||||
|
'nr_feature_tokens': nr_feature_tokens,
|
||||||
'hidden_depth': depth,
|
'hidden_depth': depth,
|
||||||
'token_vector_width': token_vector_width,
|
'token_vector_width': token_vector_width,
|
||||||
'hidden_width': hidden_width,
|
'hidden_width': hidden_width,
|
||||||
|
@ -133,6 +140,7 @@ cdef class Parser:
|
||||||
if 'beam_update_prob' not in cfg:
|
if 'beam_update_prob' not in cfg:
|
||||||
cfg['beam_update_prob'] = util.env_opt('beam_update_prob', 1.0)
|
cfg['beam_update_prob'] = util.env_opt('beam_update_prob', 1.0)
|
||||||
cfg.setdefault('cnn_maxout_pieces', 3)
|
cfg.setdefault('cnn_maxout_pieces', 3)
|
||||||
|
cfg.setdefault("nr_feature_tokens", self.nr_feature)
|
||||||
self.cfg = cfg
|
self.cfg = cfg
|
||||||
self.model = model
|
self.model = model
|
||||||
self._multitasks = []
|
self._multitasks = []
|
||||||
|
@ -299,7 +307,7 @@ cdef class Parser:
|
||||||
token_ids = numpy.zeros((len(docs) * beam_width, self.nr_feature),
|
token_ids = numpy.zeros((len(docs) * beam_width, self.nr_feature),
|
||||||
dtype='i', order='C')
|
dtype='i', order='C')
|
||||||
cdef int* c_ids
|
cdef int* c_ids
|
||||||
cdef int nr_feature = self.nr_feature
|
cdef int nr_feature = self.cfg["nr_feature_tokens"]
|
||||||
cdef int n_states
|
cdef int n_states
|
||||||
model = self.model(docs)
|
model = self.model(docs)
|
||||||
todo = [beam for beam in beams if not beam.is_done]
|
todo = [beam for beam in beams if not beam.is_done]
|
||||||
|
@ -502,7 +510,7 @@ cdef class Parser:
|
||||||
self.moves.preprocess_gold(gold)
|
self.moves.preprocess_gold(gold)
|
||||||
model, finish_update = self.model.begin_update(docs, drop=drop)
|
model, finish_update = self.model.begin_update(docs, drop=drop)
|
||||||
states_d_scores, backprops, beams = _beam_utils.update_beam(
|
states_d_scores, backprops, beams = _beam_utils.update_beam(
|
||||||
self.moves, self.nr_feature, 10000, states, golds, model.state2vec,
|
self.moves, self.cfg["nr_feature_tokens"], 10000, states, golds, model.state2vec,
|
||||||
model.vec2scores, width, drop=drop, losses=losses,
|
model.vec2scores, width, drop=drop, losses=losses,
|
||||||
beam_density=beam_density)
|
beam_density=beam_density)
|
||||||
for i, d_scores in enumerate(states_d_scores):
|
for i, d_scores in enumerate(states_d_scores):
|
||||||
|
|
|
@ -2,6 +2,7 @@
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
import re
|
||||||
from spacy.lang.en import English
|
from spacy.lang.en import English
|
||||||
from spacy.tokenizer import Tokenizer
|
from spacy.tokenizer import Tokenizer
|
||||||
from spacy.util import compile_prefix_regex, compile_suffix_regex
|
from spacy.util import compile_prefix_regex, compile_suffix_regex
|
||||||
|
@ -19,13 +20,14 @@ def custom_en_tokenizer(en_vocab):
|
||||||
r"[\[\]!&:,()\*—–\/-]",
|
r"[\[\]!&:,()\*—–\/-]",
|
||||||
]
|
]
|
||||||
infix_re = compile_infix_regex(custom_infixes)
|
infix_re = compile_infix_regex(custom_infixes)
|
||||||
|
token_match_re = re.compile("a-b")
|
||||||
return Tokenizer(
|
return Tokenizer(
|
||||||
en_vocab,
|
en_vocab,
|
||||||
English.Defaults.tokenizer_exceptions,
|
English.Defaults.tokenizer_exceptions,
|
||||||
prefix_re.search,
|
prefix_re.search,
|
||||||
suffix_re.search,
|
suffix_re.search,
|
||||||
infix_re.finditer,
|
infix_re.finditer,
|
||||||
token_match=None,
|
token_match=token_match_re.match,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@ -74,3 +76,81 @@ def test_en_customized_tokenizer_handles_infixes(custom_en_tokenizer):
|
||||||
"Megaregion",
|
"Megaregion",
|
||||||
".",
|
".",
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def test_en_customized_tokenizer_handles_token_match(custom_en_tokenizer):
|
||||||
|
sentence = "The 8 and 10-county definitions a-b not used for the greater Southern California Megaregion."
|
||||||
|
context = [word.text for word in custom_en_tokenizer(sentence)]
|
||||||
|
assert context == [
|
||||||
|
"The",
|
||||||
|
"8",
|
||||||
|
"and",
|
||||||
|
"10",
|
||||||
|
"-",
|
||||||
|
"county",
|
||||||
|
"definitions",
|
||||||
|
"a-b",
|
||||||
|
"not",
|
||||||
|
"used",
|
||||||
|
"for",
|
||||||
|
"the",
|
||||||
|
"greater",
|
||||||
|
"Southern",
|
||||||
|
"California",
|
||||||
|
"Megaregion",
|
||||||
|
".",
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def test_en_customized_tokenizer_handles_rules(custom_en_tokenizer):
|
||||||
|
sentence = "The 8 and 10-county definitions are not used for the greater Southern California Megaregion. :)"
|
||||||
|
context = [word.text for word in custom_en_tokenizer(sentence)]
|
||||||
|
assert context == [
|
||||||
|
"The",
|
||||||
|
"8",
|
||||||
|
"and",
|
||||||
|
"10",
|
||||||
|
"-",
|
||||||
|
"county",
|
||||||
|
"definitions",
|
||||||
|
"are",
|
||||||
|
"not",
|
||||||
|
"used",
|
||||||
|
"for",
|
||||||
|
"the",
|
||||||
|
"greater",
|
||||||
|
"Southern",
|
||||||
|
"California",
|
||||||
|
"Megaregion",
|
||||||
|
".",
|
||||||
|
":)",
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def test_en_customized_tokenizer_handles_rules_property(custom_en_tokenizer):
|
||||||
|
sentence = "The 8 and 10-county definitions are not used for the greater Southern California Megaregion. :)"
|
||||||
|
rules = custom_en_tokenizer.rules
|
||||||
|
del rules[":)"]
|
||||||
|
custom_en_tokenizer.rules = rules
|
||||||
|
context = [word.text for word in custom_en_tokenizer(sentence)]
|
||||||
|
assert context == [
|
||||||
|
"The",
|
||||||
|
"8",
|
||||||
|
"and",
|
||||||
|
"10",
|
||||||
|
"-",
|
||||||
|
"county",
|
||||||
|
"definitions",
|
||||||
|
"are",
|
||||||
|
"not",
|
||||||
|
"used",
|
||||||
|
"for",
|
||||||
|
"the",
|
||||||
|
"greater",
|
||||||
|
"Southern",
|
||||||
|
"California",
|
||||||
|
"Megaregion",
|
||||||
|
".",
|
||||||
|
":",
|
||||||
|
")",
|
||||||
|
]
|
||||||
|
|
|
@ -259,6 +259,27 @@ def test_block_ner():
|
||||||
assert [token.ent_type_ for token in doc] == expected_types
|
assert [token.ent_type_ for token in doc] == expected_types
|
||||||
|
|
||||||
|
|
||||||
|
def test_change_number_features():
|
||||||
|
# Test the default number features
|
||||||
|
nlp = English()
|
||||||
|
ner = nlp.create_pipe("ner")
|
||||||
|
nlp.add_pipe(ner)
|
||||||
|
ner.add_label("PERSON")
|
||||||
|
nlp.begin_training()
|
||||||
|
assert ner.model.lower.nF == ner.nr_feature
|
||||||
|
# Test we can change it
|
||||||
|
nlp = English()
|
||||||
|
ner = nlp.create_pipe("ner")
|
||||||
|
nlp.add_pipe(ner)
|
||||||
|
ner.add_label("PERSON")
|
||||||
|
nlp.begin_training(
|
||||||
|
component_cfg={"ner": {"nr_feature_tokens": 3, "token_vector_width": 128}}
|
||||||
|
)
|
||||||
|
assert ner.model.lower.nF == 3
|
||||||
|
# Test the model runs
|
||||||
|
nlp("hello world")
|
||||||
|
|
||||||
|
|
||||||
class BlockerComponent1(object):
|
class BlockerComponent1(object):
|
||||||
name = "my_blocker"
|
name = "my_blocker"
|
||||||
|
|
||||||
|
|
14
spacy/tests/pipeline/test_tagger.py
Normal file
14
spacy/tests/pipeline/test_tagger.py
Normal file
|
@ -0,0 +1,14 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
from spacy.language import Language
|
||||||
|
from spacy.pipeline import Tagger
|
||||||
|
|
||||||
|
|
||||||
|
def test_label_types():
|
||||||
|
nlp = Language()
|
||||||
|
nlp.add_pipe(nlp.create_pipe("tagger"))
|
||||||
|
nlp.get_pipe("tagger").add_label("A")
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
nlp.get_pipe("tagger").add_label(9)
|
|
@ -62,3 +62,11 @@ def test_textcat_learns_multilabel():
|
||||||
assert score < 0.5
|
assert score < 0.5
|
||||||
else:
|
else:
|
||||||
assert score > 0.5
|
assert score > 0.5
|
||||||
|
|
||||||
|
|
||||||
|
def test_label_types():
|
||||||
|
nlp = Language()
|
||||||
|
nlp.add_pipe(nlp.create_pipe("textcat"))
|
||||||
|
nlp.get_pipe("textcat").add_label("answer")
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
nlp.get_pipe("textcat").add_label(9)
|
||||||
|
|
|
@ -3,9 +3,9 @@ from __future__ import unicode_literals
|
||||||
|
|
||||||
import srsly
|
import srsly
|
||||||
from spacy.gold import GoldCorpus
|
from spacy.gold import GoldCorpus
|
||||||
|
|
||||||
from spacy.lang.en import English
|
from spacy.lang.en import English
|
||||||
from spacy.tests.util import make_tempdir
|
|
||||||
|
from ..util import make_tempdir
|
||||||
|
|
||||||
|
|
||||||
def test_issue4402():
|
def test_issue4402():
|
||||||
|
|
|
@ -1,7 +1,6 @@
|
||||||
# coding: utf-8
|
# coding: utf-8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
|
||||||
from mock import Mock
|
from mock import Mock
|
||||||
from spacy.matcher import DependencyMatcher
|
from spacy.matcher import DependencyMatcher
|
||||||
from ..util import get_doc
|
from ..util import get_doc
|
||||||
|
@ -11,8 +10,14 @@ def test_issue4590(en_vocab):
|
||||||
"""Test that matches param in on_match method are the same as matches run with no on_match method"""
|
"""Test that matches param in on_match method are the same as matches run with no on_match method"""
|
||||||
pattern = [
|
pattern = [
|
||||||
{"SPEC": {"NODE_NAME": "jumped"}, "PATTERN": {"ORTH": "jumped"}},
|
{"SPEC": {"NODE_NAME": "jumped"}, "PATTERN": {"ORTH": "jumped"}},
|
||||||
{"SPEC": {"NODE_NAME": "fox", "NBOR_RELOP": ">", "NBOR_NAME": "jumped"}, "PATTERN": {"ORTH": "fox"}},
|
{
|
||||||
{"SPEC": {"NODE_NAME": "quick", "NBOR_RELOP": ".", "NBOR_NAME": "jumped"}, "PATTERN": {"ORTH": "fox"}},
|
"SPEC": {"NODE_NAME": "fox", "NBOR_RELOP": ">", "NBOR_NAME": "jumped"},
|
||||||
|
"PATTERN": {"ORTH": "fox"},
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"SPEC": {"NODE_NAME": "quick", "NBOR_RELOP": ".", "NBOR_NAME": "jumped"},
|
||||||
|
"PATTERN": {"ORTH": "fox"},
|
||||||
|
},
|
||||||
]
|
]
|
||||||
|
|
||||||
on_match = Mock()
|
on_match = Mock()
|
||||||
|
@ -31,4 +36,3 @@ def test_issue4590(en_vocab):
|
||||||
on_match_args = on_match.call_args
|
on_match_args = on_match.call_args
|
||||||
|
|
||||||
assert on_match_args[0][3] == matches
|
assert on_match_args[0][3] == matches
|
||||||
|
|
||||||
|
|
65
spacy/tests/regression/test_issue4651.py
Normal file
65
spacy/tests/regression/test_issue4651.py
Normal file
|
@ -0,0 +1,65 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from spacy.lang.en import English
|
||||||
|
from spacy.pipeline import EntityRuler
|
||||||
|
|
||||||
|
from ..util import make_tempdir
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue4651_with_phrase_matcher_attr():
|
||||||
|
"""Test that the EntityRuler PhraseMatcher is deserialize correctly using
|
||||||
|
the method from_disk when the EntityRuler argument phrase_matcher_attr is
|
||||||
|
specified.
|
||||||
|
"""
|
||||||
|
text = "Spacy is a python library for nlp"
|
||||||
|
|
||||||
|
nlp = English()
|
||||||
|
ruler = EntityRuler(nlp, phrase_matcher_attr="LOWER")
|
||||||
|
patterns = [{"label": "PYTHON_LIB", "pattern": "spacy", "id": "spaCy"}]
|
||||||
|
ruler.add_patterns(patterns)
|
||||||
|
nlp.add_pipe(ruler)
|
||||||
|
|
||||||
|
doc = nlp(text)
|
||||||
|
res = [(ent.text, ent.label_, ent.ent_id_) for ent in doc.ents]
|
||||||
|
|
||||||
|
nlp_reloaded = English()
|
||||||
|
with make_tempdir() as d:
|
||||||
|
file_path = d / "entityruler"
|
||||||
|
ruler.to_disk(file_path)
|
||||||
|
ruler_reloaded = EntityRuler(nlp_reloaded).from_disk(file_path)
|
||||||
|
|
||||||
|
nlp_reloaded.add_pipe(ruler_reloaded)
|
||||||
|
doc_reloaded = nlp_reloaded(text)
|
||||||
|
res_reloaded = [(ent.text, ent.label_, ent.ent_id_) for ent in doc_reloaded.ents]
|
||||||
|
|
||||||
|
assert res == res_reloaded
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue4651_without_phrase_matcher_attr():
|
||||||
|
"""Test that the EntityRuler PhraseMatcher is deserialize correctly using
|
||||||
|
the method from_disk when the EntityRuler argument phrase_matcher_attr is
|
||||||
|
not specified.
|
||||||
|
"""
|
||||||
|
text = "Spacy is a python library for nlp"
|
||||||
|
|
||||||
|
nlp = English()
|
||||||
|
ruler = EntityRuler(nlp)
|
||||||
|
patterns = [{"label": "PYTHON_LIB", "pattern": "spacy", "id": "spaCy"}]
|
||||||
|
ruler.add_patterns(patterns)
|
||||||
|
nlp.add_pipe(ruler)
|
||||||
|
|
||||||
|
doc = nlp(text)
|
||||||
|
res = [(ent.text, ent.label_, ent.ent_id_) for ent in doc.ents]
|
||||||
|
|
||||||
|
nlp_reloaded = English()
|
||||||
|
with make_tempdir() as d:
|
||||||
|
file_path = d / "entityruler"
|
||||||
|
ruler.to_disk(file_path)
|
||||||
|
ruler_reloaded = EntityRuler(nlp_reloaded).from_disk(file_path)
|
||||||
|
|
||||||
|
nlp_reloaded.add_pipe(ruler_reloaded)
|
||||||
|
doc_reloaded = nlp_reloaded(text)
|
||||||
|
res_reloaded = [(ent.text, ent.label_, ent.ent_id_) for ent in doc_reloaded.ents]
|
||||||
|
|
||||||
|
assert res == res_reloaded
|
|
@ -12,8 +12,22 @@ from .util import get_doc
|
||||||
test_las_apple = [
|
test_las_apple = [
|
||||||
[
|
[
|
||||||
"Apple is looking at buying U.K. startup for $ 1 billion",
|
"Apple is looking at buying U.K. startup for $ 1 billion",
|
||||||
{"heads": [2, 2, 2, 2, 3, 6, 4, 4, 10, 10, 7],
|
{
|
||||||
"deps": ['nsubj', 'aux', 'ROOT', 'prep', 'pcomp', 'compound', 'dobj', 'prep', 'quantmod', 'compound', 'pobj']},
|
"heads": [2, 2, 2, 2, 3, 6, 4, 4, 10, 10, 7],
|
||||||
|
"deps": [
|
||||||
|
"nsubj",
|
||||||
|
"aux",
|
||||||
|
"ROOT",
|
||||||
|
"prep",
|
||||||
|
"pcomp",
|
||||||
|
"compound",
|
||||||
|
"dobj",
|
||||||
|
"prep",
|
||||||
|
"quantmod",
|
||||||
|
"compound",
|
||||||
|
"pobj",
|
||||||
|
],
|
||||||
|
},
|
||||||
]
|
]
|
||||||
]
|
]
|
||||||
|
|
||||||
|
@ -59,7 +73,7 @@ def test_las_per_type(en_vocab):
|
||||||
en_vocab,
|
en_vocab,
|
||||||
words=input_.split(" "),
|
words=input_.split(" "),
|
||||||
heads=([h - i for i, h in enumerate(annot["heads"])]),
|
heads=([h - i for i, h in enumerate(annot["heads"])]),
|
||||||
deps=annot["deps"]
|
deps=annot["deps"],
|
||||||
)
|
)
|
||||||
gold = GoldParse(doc, heads=annot["heads"], deps=annot["deps"])
|
gold = GoldParse(doc, heads=annot["heads"], deps=annot["deps"])
|
||||||
doc[0].dep_ = "compound"
|
doc[0].dep_ = "compound"
|
||||||
|
|
65
spacy/tests/tokenizer/test_explain.py
Normal file
65
spacy/tests/tokenizer/test_explain.py
Normal file
|
@ -0,0 +1,65 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
from spacy.util import get_lang_class
|
||||||
|
|
||||||
|
# Only include languages with no external dependencies
|
||||||
|
# "is" seems to confuse importlib, so we're also excluding it for now
|
||||||
|
# excluded: ja, ru, th, uk, vi, zh, is
|
||||||
|
LANGUAGES = [
|
||||||
|
pytest.param("fr", marks=pytest.mark.slow()),
|
||||||
|
pytest.param("af", marks=pytest.mark.slow()),
|
||||||
|
pytest.param("ar", marks=pytest.mark.slow()),
|
||||||
|
pytest.param("bg", marks=pytest.mark.slow()),
|
||||||
|
"bn",
|
||||||
|
pytest.param("ca", marks=pytest.mark.slow()),
|
||||||
|
pytest.param("cs", marks=pytest.mark.slow()),
|
||||||
|
pytest.param("da", marks=pytest.mark.slow()),
|
||||||
|
pytest.param("de", marks=pytest.mark.slow()),
|
||||||
|
"el",
|
||||||
|
"en",
|
||||||
|
pytest.param("es", marks=pytest.mark.slow()),
|
||||||
|
pytest.param("et", marks=pytest.mark.slow()),
|
||||||
|
pytest.param("fa", marks=pytest.mark.slow()),
|
||||||
|
pytest.param("fi", marks=pytest.mark.slow()),
|
||||||
|
"fr",
|
||||||
|
pytest.param("ga", marks=pytest.mark.slow()),
|
||||||
|
pytest.param("he", marks=pytest.mark.slow()),
|
||||||
|
pytest.param("hi", marks=pytest.mark.slow()),
|
||||||
|
pytest.param("hr", marks=pytest.mark.slow()),
|
||||||
|
"hu",
|
||||||
|
pytest.param("id", marks=pytest.mark.slow()),
|
||||||
|
pytest.param("it", marks=pytest.mark.slow()),
|
||||||
|
pytest.param("kn", marks=pytest.mark.slow()),
|
||||||
|
pytest.param("lb", marks=pytest.mark.slow()),
|
||||||
|
pytest.param("lt", marks=pytest.mark.slow()),
|
||||||
|
pytest.param("lv", marks=pytest.mark.slow()),
|
||||||
|
pytest.param("nb", marks=pytest.mark.slow()),
|
||||||
|
pytest.param("nl", marks=pytest.mark.slow()),
|
||||||
|
"pl",
|
||||||
|
pytest.param("pt", marks=pytest.mark.slow()),
|
||||||
|
pytest.param("ro", marks=pytest.mark.slow()),
|
||||||
|
pytest.param("si", marks=pytest.mark.slow()),
|
||||||
|
pytest.param("sk", marks=pytest.mark.slow()),
|
||||||
|
pytest.param("sl", marks=pytest.mark.slow()),
|
||||||
|
pytest.param("sq", marks=pytest.mark.slow()),
|
||||||
|
pytest.param("sr", marks=pytest.mark.slow()),
|
||||||
|
pytest.param("sv", marks=pytest.mark.slow()),
|
||||||
|
pytest.param("ta", marks=pytest.mark.slow()),
|
||||||
|
pytest.param("te", marks=pytest.mark.slow()),
|
||||||
|
pytest.param("tl", marks=pytest.mark.slow()),
|
||||||
|
pytest.param("tr", marks=pytest.mark.slow()),
|
||||||
|
pytest.param("tt", marks=pytest.mark.slow()),
|
||||||
|
pytest.param("ur", marks=pytest.mark.slow()),
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("lang", LANGUAGES)
|
||||||
|
def test_tokenizer_explain(lang):
|
||||||
|
tokenizer = get_lang_class(lang).Defaults.create_tokenizer()
|
||||||
|
examples = pytest.importorskip("spacy.lang.{}.examples".format(lang))
|
||||||
|
for sentence in examples.sentences:
|
||||||
|
tokens = [t.text for t in tokenizer(sentence) if not t.is_space]
|
||||||
|
debug_tokens = [t[1] for t in tokenizer.explain(sentence)]
|
||||||
|
assert tokens == debug_tokens
|
|
@ -15,6 +15,8 @@ import re
|
||||||
from .tokens.doc cimport Doc
|
from .tokens.doc cimport Doc
|
||||||
from .strings cimport hash_string
|
from .strings cimport hash_string
|
||||||
from .compat import unescape_unicode
|
from .compat import unescape_unicode
|
||||||
|
from .attrs import intify_attrs
|
||||||
|
from .symbols import ORTH
|
||||||
|
|
||||||
from .errors import Errors, Warnings, deprecation_warning
|
from .errors import Errors, Warnings, deprecation_warning
|
||||||
from . import util
|
from . import util
|
||||||
|
@ -57,9 +59,7 @@ cdef class Tokenizer:
|
||||||
self.infix_finditer = infix_finditer
|
self.infix_finditer = infix_finditer
|
||||||
self.vocab = vocab
|
self.vocab = vocab
|
||||||
self._rules = {}
|
self._rules = {}
|
||||||
if rules is not None:
|
self._load_special_tokenization(rules)
|
||||||
for chunk, substrings in sorted(rules.items()):
|
|
||||||
self.add_special_case(chunk, substrings)
|
|
||||||
|
|
||||||
property token_match:
|
property token_match:
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
|
@ -93,6 +93,18 @@ cdef class Tokenizer:
|
||||||
self._infix_finditer = infix_finditer
|
self._infix_finditer = infix_finditer
|
||||||
self._flush_cache()
|
self._flush_cache()
|
||||||
|
|
||||||
|
property rules:
|
||||||
|
def __get__(self):
|
||||||
|
return self._rules
|
||||||
|
|
||||||
|
def __set__(self, rules):
|
||||||
|
self._rules = {}
|
||||||
|
self._reset_cache([key for key in self._cache])
|
||||||
|
self._reset_specials()
|
||||||
|
self._cache = PreshMap()
|
||||||
|
self._specials = PreshMap()
|
||||||
|
self._load_special_tokenization(rules)
|
||||||
|
|
||||||
def __reduce__(self):
|
def __reduce__(self):
|
||||||
args = (self.vocab,
|
args = (self.vocab,
|
||||||
self._rules,
|
self._rules,
|
||||||
|
@ -227,10 +239,6 @@ cdef class Tokenizer:
|
||||||
cdef unicode minus_suf
|
cdef unicode minus_suf
|
||||||
cdef size_t last_size = 0
|
cdef size_t last_size = 0
|
||||||
while string and len(string) != last_size:
|
while string and len(string) != last_size:
|
||||||
if self.token_match and self.token_match(string) \
|
|
||||||
and not self.find_prefix(string) \
|
|
||||||
and not self.find_suffix(string):
|
|
||||||
break
|
|
||||||
if self._specials.get(hash_string(string)) != NULL:
|
if self._specials.get(hash_string(string)) != NULL:
|
||||||
has_special[0] = 1
|
has_special[0] = 1
|
||||||
break
|
break
|
||||||
|
@ -393,6 +401,7 @@ cdef class Tokenizer:
|
||||||
|
|
||||||
def _load_special_tokenization(self, special_cases):
|
def _load_special_tokenization(self, special_cases):
|
||||||
"""Add special-case tokenization rules."""
|
"""Add special-case tokenization rules."""
|
||||||
|
if special_cases is not None:
|
||||||
for chunk, substrings in sorted(special_cases.items()):
|
for chunk, substrings in sorted(special_cases.items()):
|
||||||
self.add_special_case(chunk, substrings)
|
self.add_special_case(chunk, substrings)
|
||||||
|
|
||||||
|
@ -423,6 +432,73 @@ cdef class Tokenizer:
|
||||||
self.mem.free(stale_cached)
|
self.mem.free(stale_cached)
|
||||||
self._rules[string] = substrings
|
self._rules[string] = substrings
|
||||||
|
|
||||||
|
def explain(self, text):
|
||||||
|
"""A debugging tokenizer that provides information about which
|
||||||
|
tokenizer rule or pattern was matched for each token. The tokens
|
||||||
|
produced are identical to `nlp.tokenizer()` except for whitespace
|
||||||
|
tokens.
|
||||||
|
|
||||||
|
string (unicode): The string to tokenize.
|
||||||
|
RETURNS (list): A list of (pattern_string, token_string) tuples
|
||||||
|
|
||||||
|
DOCS: https://spacy.io/api/tokenizer#explain
|
||||||
|
"""
|
||||||
|
prefix_search = self.prefix_search
|
||||||
|
suffix_search = self.suffix_search
|
||||||
|
infix_finditer = self.infix_finditer
|
||||||
|
token_match = self.token_match
|
||||||
|
special_cases = {}
|
||||||
|
for orth, special_tokens in self.rules.items():
|
||||||
|
special_cases[orth] = [intify_attrs(special_token, strings_map=self.vocab.strings, _do_deprecated=True) for special_token in special_tokens]
|
||||||
|
tokens = []
|
||||||
|
for substring in text.split():
|
||||||
|
suffixes = []
|
||||||
|
while substring:
|
||||||
|
while prefix_search(substring) or suffix_search(substring):
|
||||||
|
if substring in special_cases:
|
||||||
|
tokens.extend(("SPECIAL-" + str(i + 1), self.vocab.strings[e[ORTH]]) for i, e in enumerate(special_cases[substring]))
|
||||||
|
substring = ''
|
||||||
|
break
|
||||||
|
if prefix_search(substring):
|
||||||
|
split = prefix_search(substring).end()
|
||||||
|
# break if pattern matches the empty string
|
||||||
|
if split == 0:
|
||||||
|
break
|
||||||
|
tokens.append(("PREFIX", substring[:split]))
|
||||||
|
substring = substring[split:]
|
||||||
|
if substring in special_cases:
|
||||||
|
continue
|
||||||
|
if suffix_search(substring):
|
||||||
|
split = suffix_search(substring).start()
|
||||||
|
# break if pattern matches the empty string
|
||||||
|
if split == len(substring):
|
||||||
|
break
|
||||||
|
suffixes.append(("SUFFIX", substring[split:]))
|
||||||
|
substring = substring[:split]
|
||||||
|
if substring in special_cases:
|
||||||
|
tokens.extend(("SPECIAL-" + str(i + 1), self.vocab.strings[e[ORTH]]) for i, e in enumerate(special_cases[substring]))
|
||||||
|
substring = ''
|
||||||
|
elif token_match(substring):
|
||||||
|
tokens.append(("TOKEN_MATCH", substring))
|
||||||
|
substring = ''
|
||||||
|
elif list(infix_finditer(substring)):
|
||||||
|
infixes = infix_finditer(substring)
|
||||||
|
offset = 0
|
||||||
|
for match in infixes:
|
||||||
|
if substring[offset : match.start()]:
|
||||||
|
tokens.append(("TOKEN", substring[offset : match.start()]))
|
||||||
|
if substring[match.start() : match.end()]:
|
||||||
|
tokens.append(("INFIX", substring[match.start() : match.end()]))
|
||||||
|
offset = match.end()
|
||||||
|
if substring[offset:]:
|
||||||
|
tokens.append(("TOKEN", substring[offset:]))
|
||||||
|
substring = ''
|
||||||
|
elif substring:
|
||||||
|
tokens.append(("TOKEN", substring))
|
||||||
|
substring = ''
|
||||||
|
tokens.extend(reversed(suffixes))
|
||||||
|
return tokens
|
||||||
|
|
||||||
def to_disk(self, path, **kwargs):
|
def to_disk(self, path, **kwargs):
|
||||||
"""Save the current state to a directory.
|
"""Save the current state to a directory.
|
||||||
|
|
||||||
|
@ -507,8 +583,7 @@ cdef class Tokenizer:
|
||||||
self._reset_specials()
|
self._reset_specials()
|
||||||
self._cache = PreshMap()
|
self._cache = PreshMap()
|
||||||
self._specials = PreshMap()
|
self._specials = PreshMap()
|
||||||
for string, substrings in data.get("rules", {}).items():
|
self._load_special_tokenization(data.get("rules", {}))
|
||||||
self.add_special_case(string, substrings)
|
|
||||||
|
|
||||||
return self
|
return self
|
||||||
|
|
||||||
|
|
|
@ -301,13 +301,13 @@ def get_component_name(component):
|
||||||
return repr(component)
|
return repr(component)
|
||||||
|
|
||||||
|
|
||||||
def get_cuda_stream(require=False):
|
def get_cuda_stream(require=False, non_blocking=True):
|
||||||
if CudaStream is None:
|
if CudaStream is None:
|
||||||
return None
|
return None
|
||||||
elif isinstance(Model.ops, NumpyOps):
|
elif isinstance(Model.ops, NumpyOps):
|
||||||
return None
|
return None
|
||||||
else:
|
else:
|
||||||
return CudaStream()
|
return CudaStream(non_blocking=non_blocking)
|
||||||
|
|
||||||
|
|
||||||
def get_async(stream, numpy_array):
|
def get_async(stream, numpy_array):
|
||||||
|
|
|
@ -265,16 +265,11 @@ cdef class Vectors:
|
||||||
rows = [self.key2row.get(key, -1.) for key in keys]
|
rows = [self.key2row.get(key, -1.) for key in keys]
|
||||||
return xp.asarray(rows, dtype="i")
|
return xp.asarray(rows, dtype="i")
|
||||||
else:
|
else:
|
||||||
targets = set()
|
row2key = {row: key for key, row in self.key2row.items()}
|
||||||
if row is not None:
|
if row is not None:
|
||||||
targets.add(row)
|
return row2key[row]
|
||||||
else:
|
else:
|
||||||
targets.update(rows)
|
results = [row2key[row] for row in rows]
|
||||||
results = []
|
|
||||||
for key, row in self.key2row.items():
|
|
||||||
if row in targets:
|
|
||||||
results.append(key)
|
|
||||||
targets.remove(row)
|
|
||||||
return xp.asarray(results, dtype="uint64")
|
return xp.asarray(results, dtype="uint64")
|
||||||
|
|
||||||
def add(self, key, *, vector=None, row=None):
|
def add(self, key, *, vector=None, row=None):
|
||||||
|
|
|
@ -58,4 +58,5 @@ Update the evaluation scores from a single [`Doc`](/api/doc) /
|
||||||
| `ents_per_type` <Tag variant="new">2.1.5</Tag> | dict | Scores per entity label. Keyed by label, mapped to a dict of `p`, `r` and `f` scores. |
|
| `ents_per_type` <Tag variant="new">2.1.5</Tag> | dict | Scores per entity label. Keyed by label, mapped to a dict of `p`, `r` and `f` scores. |
|
||||||
| `textcat_score` <Tag variant="new">2.2</Tag> | float | F-score on positive label for binary exclusive, macro-averaged F-score for 3+ exclusive, macro-averaged AUC ROC score for multilabel (`-1` if undefined). |
|
| `textcat_score` <Tag variant="new">2.2</Tag> | float | F-score on positive label for binary exclusive, macro-averaged F-score for 3+ exclusive, macro-averaged AUC ROC score for multilabel (`-1` if undefined). |
|
||||||
| `textcats_per_cat` <Tag variant="new">2.2</Tag> | dict | Scores per textcat label, keyed by label. |
|
| `textcats_per_cat` <Tag variant="new">2.2</Tag> | dict | Scores per textcat label, keyed by label. |
|
||||||
|
| `las_per_type` <Tag variant="new">2.2.3</Tag> | dict | Labelled dependency scores, keyed by label. |
|
||||||
| `scores` | dict | All scores, keyed by type. |
|
| `scores` | dict | All scores, keyed by type. |
|
||||||
|
|
|
@ -35,13 +35,13 @@ the
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ---------------- | ----------- | ----------------------------------------------------------------------------------- |
|
| ---------------- | ----------- | ----------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `vocab` | `Vocab` | A storage container for lexical types. |
|
| `vocab` | `Vocab` | A storage container for lexical types. |
|
||||||
| `rules` | dict | Exceptions and special-cases for the tokenizer. |
|
| `rules` | dict | Exceptions and special-cases for the tokenizer. |
|
||||||
| `prefix_search` | callable | A function matching the signature of `re.compile(string).search` to match prefixes. |
|
| `prefix_search` | callable | A function matching the signature of `re.compile(string).search` to match prefixes. |
|
||||||
| `suffix_search` | callable | A function matching the signature of `re.compile(string).search` to match suffixes. |
|
| `suffix_search` | callable | A function matching the signature of `re.compile(string).search` to match suffixes. |
|
||||||
| `infix_finditer` | callable | A function matching the signature of `re.compile(string).finditer` to find infixes. |
|
| `infix_finditer` | callable | A function matching the signature of `re.compile(string).finditer` to find infixes. |
|
||||||
| `token_match` | callable | A boolean function matching strings to be recognized as tokens. |
|
| `token_match` | callable | A function matching the signature of `re.compile(string).match to find token matches. |
|
||||||
| **RETURNS** | `Tokenizer` | The newly constructed object. |
|
| **RETURNS** | `Tokenizer` | The newly constructed object. |
|
||||||
|
|
||||||
## Tokenizer.\_\_call\_\_ {#call tag="method"}
|
## Tokenizer.\_\_call\_\_ {#call tag="method"}
|
||||||
|
@ -128,6 +128,25 @@ and examples.
|
||||||
| `string` | unicode | The string to specially tokenize. |
|
| `string` | unicode | The string to specially tokenize. |
|
||||||
| `token_attrs` | iterable | A sequence of dicts, where each dict describes a token and its attributes. The `ORTH` fields of the attributes must exactly match the string when they are concatenated. |
|
| `token_attrs` | iterable | A sequence of dicts, where each dict describes a token and its attributes. The `ORTH` fields of the attributes must exactly match the string when they are concatenated. |
|
||||||
|
|
||||||
|
## Tokenizer.explain {#explain tag="method"}
|
||||||
|
|
||||||
|
Tokenize a string with a slow debugging tokenizer that provides information
|
||||||
|
about which tokenizer rule or pattern was matched for each token. The tokens
|
||||||
|
produced are identical to `Tokenizer.__call__` except for whitespace tokens.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> tok_exp = nlp.tokenizer.explain("(don't)")
|
||||||
|
> assert [t[0] for t in tok_exp] == ["PREFIX", "SPECIAL-1", "SPECIAL-2", "SUFFIX"]
|
||||||
|
> assert [t[1] for t in tok_exp] == ["(", "do", "n't", ")"]
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Type | Description |
|
||||||
|
| ------------| -------- | --------------------------------------------------- |
|
||||||
|
| `string` | unicode | The string to tokenize with the debugging tokenizer |
|
||||||
|
| **RETURNS** | list | A list of `(pattern_string, token_string)` tuples |
|
||||||
|
|
||||||
## Tokenizer.to_disk {#to_disk tag="method"}
|
## Tokenizer.to_disk {#to_disk tag="method"}
|
||||||
|
|
||||||
Serialize the tokenizer to disk.
|
Serialize the tokenizer to disk.
|
||||||
|
@ -199,11 +218,13 @@ it.
|
||||||
## Attributes {#attributes}
|
## Attributes {#attributes}
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ---------------- | ------- | -------------------------------------------------------------------------------------------------------------------------- |
|
| ---------------- | ------- | --------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `vocab` | `Vocab` | The vocab object of the parent `Doc`. |
|
| `vocab` | `Vocab` | The vocab object of the parent `Doc`. |
|
||||||
| `prefix_search` | - | A function to find segment boundaries from the start of a string. Returns the length of the segment, or `None`. |
|
| `prefix_search` | - | A function to find segment boundaries from the start of a string. Returns the length of the segment, or `None`. |
|
||||||
| `suffix_search` | - | A function to find segment boundaries from the end of a string. Returns the length of the segment, or `None`. |
|
| `suffix_search` | - | A function to find segment boundaries from the end of a string. Returns the length of the segment, or `None`. |
|
||||||
| `infix_finditer` | - | A function to find internal segment separators, e.g. hyphens. Returns a (possibly empty) list of `re.MatchObject` objects. |
|
| `infix_finditer` | - | A function to find internal segment separators, e.g. hyphens. Returns a (possibly empty) list of `re.MatchObject` objects. |
|
||||||
|
| `token_match` | - | A function matching the signature of `re.compile(string).match to find token matches. Returns an `re.MatchObject` or `None. |
|
||||||
|
| `rules` | dict | A dictionary of tokenizer exceptions and special cases. |
|
||||||
|
|
||||||
## Serialization fields {#serialization-fields}
|
## Serialization fields {#serialization-fields}
|
||||||
|
|
||||||
|
|
|
@ -792,6 +792,33 @@ The algorithm can be summarized as follows:
|
||||||
tokens on all infixes.
|
tokens on all infixes.
|
||||||
8. Once we can't consume any more of the string, handle it as a single token.
|
8. Once we can't consume any more of the string, handle it as a single token.
|
||||||
|
|
||||||
|
#### Debugging the tokenizer {#tokenizer-debug new="2.2.3"}
|
||||||
|
|
||||||
|
A working implementation of the pseudo-code above is available for debugging as
|
||||||
|
[`nlp.tokenizer.explain(text)`](/api/tokenizer#explain). It returns a list of
|
||||||
|
tuples showing which tokenizer rule or pattern was matched for each token. The
|
||||||
|
tokens produced are identical to `nlp.tokenizer()` except for whitespace tokens:
|
||||||
|
|
||||||
|
```python
|
||||||
|
### {executable="true"}
|
||||||
|
from spacy.lang.en import English
|
||||||
|
|
||||||
|
nlp = English()
|
||||||
|
text = '''"Let's go!"'''
|
||||||
|
doc = nlp(text)
|
||||||
|
tok_exp = nlp.tokenizer.explain(text)
|
||||||
|
assert [t.text for t in doc if not t.is_space] == [t[1] for t in tok_exp]
|
||||||
|
for t in tok_exp:
|
||||||
|
print(t[1], "\\t", t[0])
|
||||||
|
|
||||||
|
# " PREFIX
|
||||||
|
# Let SPECIAL-1
|
||||||
|
# 's SPECIAL-2
|
||||||
|
# go TOKEN
|
||||||
|
# ! SUFFIX
|
||||||
|
# " SUFFIX
|
||||||
|
```
|
||||||
|
|
||||||
### Customizing spaCy's Tokenizer class {#native-tokenizers}
|
### Customizing spaCy's Tokenizer class {#native-tokenizers}
|
||||||
|
|
||||||
Let's imagine you wanted to create a tokenizer for a new language or specific
|
Let's imagine you wanted to create a tokenizer for a new language or specific
|
||||||
|
|
|
@ -1679,13 +1679,14 @@
|
||||||
"slogan": "Information extraction from English and German texts based on predicate logic",
|
"slogan": "Information extraction from English and German texts based on predicate logic",
|
||||||
"github": "msg-systems/holmes-extractor",
|
"github": "msg-systems/holmes-extractor",
|
||||||
"url": "https://github.com/msg-systems/holmes-extractor",
|
"url": "https://github.com/msg-systems/holmes-extractor",
|
||||||
"description": "Holmes is a Python 3 library that supports a number of use cases involving information extraction from English and German texts, including chatbot, structural search, topic matching and supervised document classification.",
|
"description": "Holmes is a Python 3 library that supports a number of use cases involving information extraction from English and German texts, including chatbot, structural extraction, topic matching and supervised document classification. There is a [website demonstrating intelligent search based on topic matching](https://holmes-demo.xt.msg.team).",
|
||||||
"pip": "holmes-extractor",
|
"pip": "holmes-extractor",
|
||||||
"category": ["conversational", "standalone"],
|
"category": ["conversational", "standalone"],
|
||||||
"tags": ["chatbots", "text-processing"],
|
"tags": ["chatbots", "text-processing"],
|
||||||
|
"thumb": "https://raw.githubusercontent.com/msg-systems/holmes-extractor/master/docs/holmes_thumbnail.png",
|
||||||
"code_example": [
|
"code_example": [
|
||||||
"import holmes_extractor as holmes",
|
"import holmes_extractor as holmes",
|
||||||
"holmes_manager = holmes.Manager(model='en_coref_lg')",
|
"holmes_manager = holmes.Manager(model='en_core_web_lg')",
|
||||||
"holmes_manager.register_search_phrase('A big dog chases a cat')",
|
"holmes_manager.register_search_phrase('A big dog chases a cat')",
|
||||||
"holmes_manager.start_chatbot_mode_console()"
|
"holmes_manager.start_chatbot_mode_console()"
|
||||||
],
|
],
|
||||||
|
|
Loading…
Reference in New Issue
Block a user