Update from develop

This commit is contained in:
Matthew Honnibal 2019-03-08 16:56:54 +01:00
commit 4cf897e8e1
44 changed files with 2179 additions and 1407 deletions

106
.github/contributors/adrienball.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | ------------------------------- |
| Name | Adrien Ball |
| Company name (if applicable) | |
| Title or role (if applicable) | Machine Learning Engineer |
| Date | 2019-03-07 |
| GitHub username | adrienball |
| Website (optional) | https://medium.com/@adrien_ball |

106
.github/contributors/danielkingai2.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Daniel King |
| Company name (if applicable) | Allen Institute for Artificial Intelligence |
| Title or role (if applicable) | Predoctoral Young Investigator |
| Date | 03/06/2019 |
| GitHub username | danielkingai2 |
| Website (optional) | |

View File

@ -27,14 +27,17 @@ def download(model, direct=False, *pip_args):
can be shortcut, model name or, if --direct flag is set, full model name can be shortcut, model name or, if --direct flag is set, full model name
with version. For direct downloads, the compatibility check will be skipped. with version. For direct downloads, the compatibility check will be skipped.
""" """
dl_tpl = "{m}-{v}/{m}-{v}.tar.gz#egg={m}=={v}"
if direct: if direct:
dl = download_model("{m}/{m}.tar.gz#egg={m}".format(m=model), pip_args) components = model.split("-")
model_name = "".join(components[:-1])
version = components[-1]
dl = download_model(dl_tpl.format(m=model_name, v=version), pip_args)
else: else:
shortcuts = get_json(about.__shortcuts__, "available shortcuts") shortcuts = get_json(about.__shortcuts__, "available shortcuts")
model_name = shortcuts.get(model, model) model_name = shortcuts.get(model, model)
compatibility = get_compatibility() compatibility = get_compatibility()
version = get_version(model_name, compatibility) version = get_version(model_name, compatibility)
dl_tpl = "{m}-{v}/{m}-{v}.tar.gz#egg={m}=={v}"
dl = download_model(dl_tpl.format(m=model_name, v=version), pip_args) dl = download_model(dl_tpl.format(m=model_name, v=version), pip_args)
if dl != 0: # if download subprocess doesn't return 0, exit if dl != 0: # if download subprocess doesn't return 0, exit
sys.exit(dl) sys.exit(dl)

View File

@ -1,4 +1,11 @@
# coding: utf8 # coding: utf8
"""
Helpers for Python and platform compatibility. To distinguish them from
the builtin functions, replacement functions are suffixed with an underscore,
e.g. `unicode_`.
DOCS: https://spacy.io/api/top-level#compat
"""
from __future__ import unicode_literals from __future__ import unicode_literals
import os import os
@ -64,19 +71,23 @@ elif is_python3:
def b_to_str(b_str): def b_to_str(b_str):
"""Convert a bytes object to a string.
b_str (bytes): The object to convert.
RETURNS (unicode): The converted string.
"""
if is_python2: if is_python2:
return b_str return b_str
# important: if no encoding is set, string becomes "b'...'" # Important: if no encoding is set, string becomes "b'...'"
return str(b_str, encoding="utf8") return str(b_str, encoding="utf8")
def getattr_(obj, name, *default):
if is_python3 and isinstance(name, bytes):
name = name.decode("utf8")
return getattr(obj, name, *default)
def symlink_to(orig, dest): def symlink_to(orig, dest):
"""Create a symlink. Used for model shortcut links.
orig (unicode / Path): The origin path.
dest (unicode / Path): The destination path of the symlink.
"""
if is_windows: if is_windows:
import subprocess import subprocess
@ -86,6 +97,10 @@ def symlink_to(orig, dest):
def symlink_remove(link): def symlink_remove(link):
"""Remove a symlink. Used for model shortcut links.
link (unicode / Path): The path to the symlink.
"""
# https://stackoverflow.com/q/26554135/6400719 # https://stackoverflow.com/q/26554135/6400719
if os.path.isdir(path2str(link)) and is_windows: if os.path.isdir(path2str(link)) and is_windows:
# this should only be on Py2.7 and windows # this should only be on Py2.7 and windows
@ -95,6 +110,18 @@ def symlink_remove(link):
def is_config(python2=None, python3=None, windows=None, linux=None, osx=None): def is_config(python2=None, python3=None, windows=None, linux=None, osx=None):
"""Check if a specific configuration of Python version and operating system
matches the user's setup. Mostly used to display targeted error messages.
python2 (bool): spaCy is executed with Python 2.x.
python3 (bool): spaCy is executed with Python 3.x.
windows (bool): spaCy is executed on Windows.
linux (bool): spaCy is executed on Linux.
osx (bool): spaCy is executed on OS X or macOS.
RETURNS (bool): Whether the configuration matches the user's platform.
DOCS: https://spacy.io/api/top-level#compat.is_config
"""
return ( return (
python2 in (None, is_python2) python2 in (None, is_python2)
and python3 in (None, is_python3) and python3 in (None, is_python3)
@ -104,19 +131,14 @@ def is_config(python2=None, python3=None, windows=None, linux=None, osx=None):
) )
def normalize_string_keys(old):
"""Given a dictionary, make sure keys are unicode strings, not bytes."""
new = {}
for key, value in old.items():
if isinstance(key, bytes_):
new[key.decode("utf8")] = value
else:
new[key] = value
return new
def import_file(name, loc): def import_file(name, loc):
loc = str(loc) """Import module from a file. Used to load models from a directory.
name (unicode): Name of module to load.
loc (unicode / Path): Path to the file.
RETURNS: The loaded module.
"""
loc = path2str(loc)
if is_python_pre_3_5: if is_python_pre_3_5:
import imp import imp

View File

@ -1,4 +1,10 @@
# coding: utf8 # coding: utf8
"""
spaCy's built in visualization suite for dependencies and named entities.
DOCS: https://spacy.io/api/top-level#displacy
USAGE: https://spacy.io/usage/visualizers
"""
from __future__ import unicode_literals from __future__ import unicode_literals
from .render import DependencyRenderer, EntityRenderer from .render import DependencyRenderer, EntityRenderer
@ -25,6 +31,9 @@ def render(
options (dict): Visualiser-specific options, e.g. colors. options (dict): Visualiser-specific options, e.g. colors.
manual (bool): Don't parse `Doc` and instead expect a dict/list of dicts. manual (bool): Don't parse `Doc` and instead expect a dict/list of dicts.
RETURNS (unicode): Rendered HTML markup. RETURNS (unicode): Rendered HTML markup.
DOCS: https://spacy.io/api/top-level#displacy.render
USAGE: https://spacy.io/usage/visualizers
""" """
factories = { factories = {
"dep": (DependencyRenderer, parse_deps), "dep": (DependencyRenderer, parse_deps),
@ -71,6 +80,9 @@ def serve(
manual (bool): Don't parse `Doc` and instead expect a dict/list of dicts. manual (bool): Don't parse `Doc` and instead expect a dict/list of dicts.
port (int): Port to serve visualisation. port (int): Port to serve visualisation.
host (unicode): Host to serve visualisation. host (unicode): Host to serve visualisation.
DOCS: https://spacy.io/api/top-level#displacy.serve
USAGE: https://spacy.io/usage/visualizers
""" """
from wsgiref import simple_server from wsgiref import simple_server

View File

@ -338,6 +338,17 @@ class Errors(object):
"or with a getter AND setter.") "or with a getter AND setter.")
E120 = ("Can't set custom extension attributes during retokenization. " E120 = ("Can't set custom extension attributes during retokenization. "
"Expected dict mapping attribute names to values, but got: {value}") "Expected dict mapping attribute names to values, but got: {value}")
E121 = ("Can't bulk merge spans. Attribute length {attr_len} should be "
"equal to span length ({span_len}).")
E122 = ("Cannot find token to be split. Did it get merged?")
E123 = ("Cannot find head of token to be split. Did it get merged?")
E124 = ("Cannot read from file: {path}. Supported formats: .json, .msg")
E125 = ("Unexpected value: {value}")
E126 = ("Unexpected matcher predicate: '{bad}'. Expected one of: {good}. "
"This is likely a bug in spaCy, so feel free to open an issue.")
E127 = ("Cannot create phrase pattern representation for length 0. This "
"is likely a bug in spaCy.")
@add_codes @add_codes

View File

@ -14,34 +14,38 @@ from . import _align
from .syntax import nonproj from .syntax import nonproj
from .tokens import Doc, Span from .tokens import Doc, Span
from .errors import Errors from .errors import Errors
from .compat import path2str
from . import util from . import util
from .util import minibatch, itershuffle from .util import minibatch, itershuffle
from libc.stdio cimport FILE, fopen, fclose, fread, fwrite, feof, fseek from libc.stdio cimport FILE, fopen, fclose, fread, fwrite, feof, fseek
punct_re = re.compile(r"\W")
def tags_to_entities(tags): def tags_to_entities(tags):
entities = [] entities = []
start = None start = None
for i, tag in enumerate(tags): for i, tag in enumerate(tags):
if tag is None: if tag is None:
continue continue
if tag.startswith('O'): if tag.startswith("O"):
# TODO: We shouldn't be getting these malformed inputs. Fix this. # TODO: We shouldn't be getting these malformed inputs. Fix this.
if start is not None: if start is not None:
start = None start = None
continue continue
elif tag == '-': elif tag == "-":
continue continue
elif tag.startswith('I'): elif tag.startswith("I"):
if start is None: if start is None:
raise ValueError(Errors.E067.format(tags=tags[:i+1])) raise ValueError(Errors.E067.format(tags=tags[:i + 1]))
continue continue
if tag.startswith('U'): if tag.startswith("U"):
entities.append((tag[2:], i, i)) entities.append((tag[2:], i, i))
elif tag.startswith('B'): elif tag.startswith("B"):
start = i start = i
elif tag.startswith('L'): elif tag.startswith("L"):
entities.append((tag[2:], start, i)) entities.append((tag[2:], start, i))
start = None start = None
else: else:
@ -60,19 +64,18 @@ def merge_sents(sents):
m_deps[3].extend(head + i for head in heads) m_deps[3].extend(head + i for head in heads)
m_deps[4].extend(labels) m_deps[4].extend(labels)
m_deps[5].extend(ner) m_deps[5].extend(ner)
m_brackets.extend((b['first'] + i, b['last'] + i, b['label']) m_brackets.extend((b["first"] + i, b["last"] + i, b["label"])
for b in brackets) for b in brackets)
i += len(ids) i += len(ids)
return [(m_deps, m_brackets)] return [(m_deps, m_brackets)]
punct_re = re.compile(r'\W')
def align(cand_words, gold_words): def align(cand_words, gold_words):
if cand_words == gold_words: if cand_words == gold_words:
alignment = numpy.arange(len(cand_words)) alignment = numpy.arange(len(cand_words))
return 0, alignment, alignment, {}, {} return 0, alignment, alignment, {}, {}
cand_words = [w.replace(' ', '').lower() for w in cand_words] cand_words = [w.replace(" ", "").lower() for w in cand_words]
gold_words = [w.replace(' ', '').lower() for w in gold_words] gold_words = [w.replace(" ", "").lower() for w in gold_words]
cost, i2j, j2i, matrix = _align.align(cand_words, gold_words) cost, i2j, j2i, matrix = _align.align(cand_words, gold_words)
i2j_multi, j2i_multi = _align.multi_align(i2j, j2i, [len(w) for w in cand_words], i2j_multi, j2i_multi = _align.multi_align(i2j, j2i, [len(w) for w in cand_words],
[len(w) for w in gold_words]) [len(w) for w in gold_words])
@ -89,7 +92,10 @@ def align(cand_words, gold_words):
class GoldCorpus(object): class GoldCorpus(object):
"""An annotated corpus, using the JSON file format. Manages """An annotated corpus, using the JSON file format. Manages
annotations for tagging, dependency parsing and NER.""" annotations for tagging, dependency parsing and NER.
DOCS: https://spacy.io/api/goldcorpus
"""
def __init__(self, train, dev, gold_preproc=False, limit=None): def __init__(self, train, dev, gold_preproc=False, limit=None):
"""Create a GoldCorpus. """Create a GoldCorpus.
@ -101,12 +107,10 @@ class GoldCorpus(object):
if isinstance(train, str) or isinstance(train, Path): if isinstance(train, str) or isinstance(train, Path):
train = self.read_tuples(self.walk_corpus(train)) train = self.read_tuples(self.walk_corpus(train))
dev = self.read_tuples(self.walk_corpus(dev)) dev = self.read_tuples(self.walk_corpus(dev))
# Write temp directory with one doc per file, so we can shuffle and stream
# Write temp directory with one doc per file, so we can shuffle
# and stream
self.tmp_dir = Path(tempfile.mkdtemp()) self.tmp_dir = Path(tempfile.mkdtemp())
self.write_msgpack(self.tmp_dir / 'train', train, limit=self.limit) self.write_msgpack(self.tmp_dir / "train", train, limit=self.limit)
self.write_msgpack(self.tmp_dir / 'dev', dev, limit=self.limit) self.write_msgpack(self.tmp_dir / "dev", dev, limit=self.limit)
def __del__(self): def __del__(self):
shutil.rmtree(self.tmp_dir) shutil.rmtree(self.tmp_dir)
@ -117,7 +121,7 @@ class GoldCorpus(object):
directory.mkdir() directory.mkdir()
n = 0 n = 0
for i, doc_tuple in enumerate(doc_tuples): for i, doc_tuple in enumerate(doc_tuples):
srsly.write_msgpack(directory / '{}.msg'.format(i), [doc_tuple]) srsly.write_msgpack(directory / "{}.msg".format(i), [doc_tuple])
n += len(doc_tuple[1]) n += len(doc_tuple[1])
if limit and n >= limit: if limit and n >= limit:
break break
@ -134,11 +138,11 @@ class GoldCorpus(object):
if str(path) in seen: if str(path) in seen:
continue continue
seen.add(str(path)) seen.add(str(path))
if path.parts[-1].startswith('.'): if path.parts[-1].startswith("."):
continue continue
elif path.is_dir(): elif path.is_dir():
paths.extend(path.iterdir()) paths.extend(path.iterdir())
elif path.parts[-1].endswith('.json'): elif path.parts[-1].endswith(".json"):
locs.append(path) locs.append(path)
return locs return locs
@ -147,13 +151,12 @@ class GoldCorpus(object):
i = 0 i = 0
for loc in locs: for loc in locs:
loc = util.ensure_path(loc) loc = util.ensure_path(loc)
if loc.parts[-1].endswith('json'): if loc.parts[-1].endswith("json"):
gold_tuples = read_json_file(loc) gold_tuples = read_json_file(loc)
elif loc.parts[-1].endswith('msg'): elif loc.parts[-1].endswith("msg"):
gold_tuples = srsly.read_msgpack(loc) gold_tuples = srsly.read_msgpack(loc)
else: else:
msg = "Cannot read from file: %s. Supported formats: .json, .msg" raise ValueError(Errors.E124.format(path=path2str(loc)))
raise ValueError(msg % loc)
for item in gold_tuples: for item in gold_tuples:
yield item yield item
i += len(item[1]) i += len(item[1])
@ -162,12 +165,12 @@ class GoldCorpus(object):
@property @property
def dev_tuples(self): def dev_tuples(self):
locs = (self.tmp_dir / 'dev').iterdir() locs = (self.tmp_dir / "dev").iterdir()
yield from self.read_tuples(locs, limit=self.limit) yield from self.read_tuples(locs, limit=self.limit)
@property @property
def train_tuples(self): def train_tuples(self):
locs = (self.tmp_dir / 'train').iterdir() locs = (self.tmp_dir / "train").iterdir()
yield from self.read_tuples(locs, limit=self.limit) yield from self.read_tuples(locs, limit=self.limit)
def count_train(self): def count_train(self):
@ -193,8 +196,7 @@ class GoldCorpus(object):
yield from gold_docs yield from gold_docs
def dev_docs(self, nlp, gold_preproc=False): def dev_docs(self, nlp, gold_preproc=False):
gold_docs = self.iter_gold_docs(nlp, self.dev_tuples, gold_docs = self.iter_gold_docs(nlp, self.dev_tuples, gold_preproc=gold_preproc)
gold_preproc=gold_preproc)
yield from gold_docs yield from gold_docs
@classmethod @classmethod
@ -205,32 +207,29 @@ class GoldCorpus(object):
raw_text = None raw_text = None
else: else:
paragraph_tuples = merge_sents(paragraph_tuples) paragraph_tuples = merge_sents(paragraph_tuples)
docs = cls._make_docs(nlp, raw_text, paragraph_tuples, docs = cls._make_docs(nlp, raw_text, paragraph_tuples, gold_preproc,
gold_preproc, noise_level=noise_level) noise_level=noise_level)
golds = cls._make_golds(docs, paragraph_tuples, make_projective) golds = cls._make_golds(docs, paragraph_tuples, make_projective)
for doc, gold in zip(docs, golds): for doc, gold in zip(docs, golds):
if (not max_length) or len(doc) < max_length: if (not max_length) or len(doc) < max_length:
yield doc, gold yield doc, gold
@classmethod @classmethod
def _make_docs(cls, nlp, raw_text, paragraph_tuples, gold_preproc, def _make_docs(cls, nlp, raw_text, paragraph_tuples, gold_preproc, noise_level=0.0):
noise_level=0.0):
if raw_text is not None: if raw_text is not None:
raw_text = add_noise(raw_text, noise_level) raw_text = add_noise(raw_text, noise_level)
return [nlp.make_doc(raw_text)] return [nlp.make_doc(raw_text)]
else: else:
return [Doc(nlp.vocab, return [Doc(nlp.vocab, words=add_noise(sent_tuples[1], noise_level))
words=add_noise(sent_tuples[1], noise_level))
for (sent_tuples, brackets) in paragraph_tuples] for (sent_tuples, brackets) in paragraph_tuples]
@classmethod @classmethod
def _make_golds(cls, docs, paragraph_tuples, make_projective): def _make_golds(cls, docs, paragraph_tuples, make_projective):
if len(docs) != len(paragraph_tuples): if len(docs) != len(paragraph_tuples):
raise ValueError(Errors.E070.format(n_docs=len(docs), n_annots = len(paragraph_tuples)
n_annots=len(paragraph_tuples))) raise ValueError(Errors.E070.format(n_docs=len(docs), n_annots=n_annots))
if len(docs) == 1: if len(docs) == 1:
return [GoldParse.from_annot_tuples(docs[0], return [GoldParse.from_annot_tuples(docs[0], paragraph_tuples[0][0],
paragraph_tuples[0][0],
make_projective=make_projective)] make_projective=make_projective)]
else: else:
return [GoldParse.from_annot_tuples(doc, sent_tuples, return [GoldParse.from_annot_tuples(doc, sent_tuples,
@ -247,18 +246,18 @@ def add_noise(orig, noise_level):
corrupted = [w for w in corrupted if w] corrupted = [w for w in corrupted if w]
return corrupted return corrupted
else: else:
return ''.join(_corrupt(c, noise_level) for c in orig) return "".join(_corrupt(c, noise_level) for c in orig)
def _corrupt(c, noise_level): def _corrupt(c, noise_level):
if random.random() >= noise_level: if random.random() >= noise_level:
return c return c
elif c == ' ': elif c == " ":
return '\n' return "\n"
elif c == '\n': elif c == "\n":
return ' ' return " "
elif c in ['.', "'", "!", "?", ',']: elif c in [".", "'", "!", "?", ","]:
return '' return ""
else: else:
return c.lower() return c.lower()
@ -284,30 +283,30 @@ def json_to_tuple(doc):
YIELDS (tuple): The reformatted data. YIELDS (tuple): The reformatted data.
""" """
paragraphs = [] paragraphs = []
for paragraph in doc['paragraphs']: for paragraph in doc["paragraphs"]:
sents = [] sents = []
for sent in paragraph['sentences']: for sent in paragraph["sentences"]:
words = [] words = []
ids = [] ids = []
tags = [] tags = []
heads = [] heads = []
labels = [] labels = []
ner = [] ner = []
for i, token in enumerate(sent['tokens']): for i, token in enumerate(sent["tokens"]):
words.append(token['orth']) words.append(token["orth"])
ids.append(i) ids.append(i)
tags.append(token.get('tag', '-')) tags.append(token.get('tag', "-"))
heads.append(token.get('head', 0) + i) heads.append(token.get("head", 0) + i)
labels.append(token.get('dep', '')) labels.append(token.get("dep", ""))
# Ensure ROOT label is case-insensitive # Ensure ROOT label is case-insensitive
if labels[-1].lower() == 'root': if labels[-1].lower() == "root":
labels[-1] = 'ROOT' labels[-1] = "ROOT"
ner.append(token.get('ner', '-')) ner.append(token.get("ner", "-"))
sents.append([ sents.append([
[ids, words, tags, heads, labels, ner], [ids, words, tags, heads, labels, ner],
sent.get('brackets', [])]) sent.get("brackets", [])])
if sents: if sents:
yield [paragraph.get('raw', None), sents] yield [paragraph.get("raw", None), sents]
def read_json_file(loc, docs_filter=None, limit=None): def read_json_file(loc, docs_filter=None, limit=None):
@ -329,7 +328,7 @@ def _json_iterate(loc):
# It's okay to read in the whole file -- just don't parse it into JSON. # It's okay to read in the whole file -- just don't parse it into JSON.
cdef bytes py_raw cdef bytes py_raw
loc = util.ensure_path(loc) loc = util.ensure_path(loc)
with loc.open('rb') as file_: with loc.open("rb") as file_:
py_raw = file_.read() py_raw = file_.read()
raw = <char*>py_raw raw = <char*>py_raw
cdef int square_depth = 0 cdef int square_depth = 0
@ -339,11 +338,11 @@ def _json_iterate(loc):
cdef int start = -1 cdef int start = -1
cdef char c cdef char c
cdef char quote = ord('"') cdef char quote = ord('"')
cdef char backslash = ord('\\') cdef char backslash = ord("\\")
cdef char open_square = ord('[') cdef char open_square = ord("[")
cdef char close_square = ord(']') cdef char close_square = ord("]")
cdef char open_curly = ord('{') cdef char open_curly = ord("{")
cdef char close_curly = ord('}') cdef char close_curly = ord("}")
for i in range(len(py_raw)): for i in range(len(py_raw)):
c = raw[i] c = raw[i]
if escape: if escape:
@ -368,7 +367,7 @@ def _json_iterate(loc):
elif c == close_curly: elif c == close_curly:
curly_depth -= 1 curly_depth -= 1
if square_depth == 1 and curly_depth == 0: if square_depth == 1 and curly_depth == 0:
py_str = py_raw[start : i+1].decode('utf8') py_str = py_raw[start : i + 1].decode("utf8")
try: try:
yield srsly.json_loads(py_str) yield srsly.json_loads(py_str)
except Exception: except Exception:
@ -388,7 +387,7 @@ def iob_to_biluo(tags):
def _consume_os(tags): def _consume_os(tags):
while tags and tags[0] == 'O': while tags and tags[0] == "O":
yield tags.pop(0) yield tags.pop(0)
@ -396,24 +395,27 @@ def _consume_ent(tags):
if not tags: if not tags:
return [] return []
tag = tags.pop(0) tag = tags.pop(0)
target_in = 'I' + tag[1:] target_in = "I" + tag[1:]
target_last = 'L' + tag[1:] target_last = "L" + tag[1:]
length = 1 length = 1
while tags and tags[0] in {target_in, target_last}: while tags and tags[0] in {target_in, target_last}:
length += 1 length += 1
tags.pop(0) tags.pop(0)
label = tag[2:] label = tag[2:]
if length == 1: if length == 1:
return ['U-' + label] return ["U-" + label]
else: else:
start = 'B-' + label start = "B-" + label
end = 'L-' + label end = "L-" + label
middle = ['I-%s' % label for _ in range(1, length - 1)] middle = ["I-%s" % label for _ in range(1, length - 1)]
return [start] + middle + [end] return [start] + middle + [end]
cdef class GoldParse: cdef class GoldParse:
"""Collection for training annotations.""" """Collection for training annotations.
DOCS: https://spacy.io/api/goldparse
"""
@classmethod @classmethod
def from_annot_tuples(cls, doc, annot_tuples, make_projective=False): def from_annot_tuples(cls, doc, annot_tuples, make_projective=False):
_, words, tags, heads, deps, entities = annot_tuples _, words, tags, heads, deps, entities = annot_tuples
@ -458,13 +460,13 @@ cdef class GoldParse:
if morphology is None: if morphology is None:
morphology = [None for _ in words] morphology = [None for _ in words]
if entities is None: if entities is None:
entities = ['-' for _ in doc] entities = ["-" for _ in doc]
elif len(entities) == 0: elif len(entities) == 0:
entities = ['O' for _ in doc] entities = ["O" for _ in doc]
else: else:
# Translate the None values to '-', to make processing easier. # Translate the None values to '-', to make processing easier.
# See Issue #2603 # See Issue #2603
entities = [(ent if ent is not None else '-') for ent in entities] entities = [(ent if ent is not None else "-") for ent in entities]
if not isinstance(entities[0], basestring): if not isinstance(entities[0], basestring):
# Assume we have entities specified by character offset. # Assume we have entities specified by character offset.
entities = biluo_tags_from_offsets(doc, entities) entities = biluo_tags_from_offsets(doc, entities)
@ -510,10 +512,10 @@ cdef class GoldParse:
for i, gold_i in enumerate(self.cand_to_gold): for i, gold_i in enumerate(self.cand_to_gold):
if doc[i].text.isspace(): if doc[i].text.isspace():
self.words[i] = doc[i].text self.words[i] = doc[i].text
self.tags[i] = '_SP' self.tags[i] = "_SP"
self.heads[i] = None self.heads[i] = None
self.labels[i] = None self.labels[i] = None
self.ner[i] = 'O' self.ner[i] = "O"
self.morphology[i] = set() self.morphology[i] = set()
if gold_i is None: if gold_i is None:
if i in i2j_multi: if i in i2j_multi:
@ -525,7 +527,7 @@ cdef class GoldParse:
# Set next word in multi-token span as head, until last # Set next word in multi-token span as head, until last
if not is_last: if not is_last:
self.heads[i] = i+1 self.heads[i] = i+1
self.labels[i] = 'subtok' self.labels[i] = "subtok"
else: else:
self.heads[i] = self.gold_to_cand[heads[i2j_multi[i]]] self.heads[i] = self.gold_to_cand[heads[i2j_multi[i]]]
self.labels[i] = deps[i2j_multi[i]] self.labels[i] = deps[i2j_multi[i]]
@ -534,24 +536,24 @@ cdef class GoldParse:
# BILOU tags. We can't have BB or LL etc. # BILOU tags. We can't have BB or LL etc.
# Case 1: O -- easy. # Case 1: O -- easy.
ner_tag = entities[i2j_multi[i]] ner_tag = entities[i2j_multi[i]]
if ner_tag == 'O': if ner_tag == "O":
self.ner[i] = 'O' self.ner[i] = "O"
# Case 2: U. This has to become a B I* L sequence. # Case 2: U. This has to become a B I* L sequence.
elif ner_tag.startswith('U-'): elif ner_tag.startswith("U-"):
if is_first: if is_first:
self.ner[i] = ner_tag.replace('U-', 'B-', 1) self.ner[i] = ner_tag.replace("U-", "B-", 1)
elif is_last: elif is_last:
self.ner[i] = ner_tag.replace('U-', 'L-', 1) self.ner[i] = ner_tag.replace("U-", "L-", 1)
else: else:
self.ner[i] = ner_tag.replace('U-', 'I-', 1) self.ner[i] = ner_tag.replace("U-", "I-", 1)
# Case 3: L. If not last, change to I. # Case 3: L. If not last, change to I.
elif ner_tag.startswith('L-'): elif ner_tag.startswith("L-"):
if is_last: if is_last:
self.ner[i] = ner_tag self.ner[i] = ner_tag
else: else:
self.ner[i] = ner_tag.replace('L-', 'I-', 1) self.ner[i] = ner_tag.replace("L-", "I-", 1)
# Case 4: I. Stays correct # Case 4: I. Stays correct
elif ner_tag.startswith('I-'): elif ner_tag.startswith("I-"):
self.ner[i] = ner_tag self.ner[i] = ner_tag
else: else:
self.words[i] = words[gold_i] self.words[i] = words[gold_i]
@ -613,7 +615,7 @@ def docs_to_json(docs, underscore=None):
return [doc.to_json(underscore=underscore) for doc in docs] return [doc.to_json(underscore=underscore) for doc in docs]
def biluo_tags_from_offsets(doc, entities, missing='O'): def biluo_tags_from_offsets(doc, entities, missing="O"):
"""Encode labelled spans into per-token tags, using the """Encode labelled spans into per-token tags, using the
Begin/In/Last/Unit/Out scheme (BILUO). Begin/In/Last/Unit/Out scheme (BILUO).
@ -636,11 +638,11 @@ def biluo_tags_from_offsets(doc, entities, missing='O'):
>>> entities = [(len('I like '), len('I like London'), 'LOC')] >>> entities = [(len('I like '), len('I like London'), 'LOC')]
>>> doc = nlp.tokenizer(text) >>> doc = nlp.tokenizer(text)
>>> tags = biluo_tags_from_offsets(doc, entities) >>> tags = biluo_tags_from_offsets(doc, entities)
>>> assert tags == ['O', 'O', 'U-LOC', 'O'] >>> assert tags == ["O", "O", 'U-LOC', "O"]
""" """
starts = {token.idx: token.i for token in doc} starts = {token.idx: token.i for token in doc}
ends = {token.idx+len(token): token.i for token in doc} ends = {token.idx + len(token): token.i for token in doc}
biluo = ['-' for _ in doc] biluo = ["-" for _ in doc]
# Handle entity cases # Handle entity cases
for start_char, end_char, label in entities: for start_char, end_char, label in entities:
start_token = starts.get(start_char) start_token = starts.get(start_char)
@ -648,19 +650,19 @@ def biluo_tags_from_offsets(doc, entities, missing='O'):
# Only interested if the tokenization is correct # Only interested if the tokenization is correct
if start_token is not None and end_token is not None: if start_token is not None and end_token is not None:
if start_token == end_token: if start_token == end_token:
biluo[start_token] = 'U-%s' % label biluo[start_token] = "U-%s" % label
else: else:
biluo[start_token] = 'B-%s' % label biluo[start_token] = "B-%s" % label
for i in range(start_token+1, end_token): for i in range(start_token+1, end_token):
biluo[i] = 'I-%s' % label biluo[i] = "I-%s" % label
biluo[end_token] = 'L-%s' % label biluo[end_token] = "L-%s" % label
# Now distinguish the O cases from ones where we miss the tokenization # Now distinguish the O cases from ones where we miss the tokenization
entity_chars = set() entity_chars = set()
for start_char, end_char, label in entities: for start_char, end_char, label in entities:
for i in range(start_char, end_char): for i in range(start_char, end_char):
entity_chars.add(i) entity_chars.add(i)
for token in doc: for token in doc:
for i in range(token.idx, token.idx+len(token)): for i in range(token.idx, token.idx + len(token)):
if i in entity_chars: if i in entity_chars:
break break
else: else:
@ -702,4 +704,4 @@ def offsets_from_biluo_tags(doc, tags):
def is_punct_label(label): def is_punct_label(label):
return label == 'P' or label.lower() == 'punct' return label == "P" or label.lower() == "punct"

View File

@ -104,8 +104,9 @@ class Language(object):
Defaults (class): Settings, data and factory methods for creating the `nlp` Defaults (class): Settings, data and factory methods for creating the `nlp`
object and processing pipeline. object and processing pipeline.
lang (unicode): Two-letter language ID, i.e. ISO code. lang (unicode): Two-letter language ID, i.e. ISO code.
"""
DOCS: https://spacy.io/api/language
"""
Defaults = BaseDefaults Defaults = BaseDefaults
lang = None lang = None

View File

@ -6,6 +6,13 @@ from .symbols import VerbForm_inf, VerbForm_none, Number_sing, Degree_pos
class Lemmatizer(object): class Lemmatizer(object):
"""
The Lemmatizer supports simple part-of-speech-sensitive suffix rules and
lookup tables.
DOCS: https://spacy.io/api/lemmatizer
"""
@classmethod @classmethod
def load(cls, path, index=None, exc=None, rules=None, lookup=None): def load(cls, path, index=None, exc=None, rules=None, lookup=None):
return cls(index, exc, rules, lookup) return cls(index, exc, rules, lookup)

View File

@ -4,16 +4,19 @@ from __future__ import unicode_literals, print_function
# Compiler crashes on memory view coercion without this. Should report bug. # Compiler crashes on memory view coercion without this. Should report bug.
from cython.view cimport array as cvarray from cython.view cimport array as cvarray
from libc.string cimport memset
cimport numpy as np cimport numpy as np
np.import_array() np.import_array()
from libc.string cimport memset
import numpy import numpy
from thinc.neural.util import get_array_module
from .typedefs cimport attr_t, flags_t from .typedefs cimport attr_t, flags_t
from .attrs cimport IS_ALPHA, IS_ASCII, IS_DIGIT, IS_LOWER, IS_PUNCT, IS_SPACE from .attrs cimport IS_ALPHA, IS_ASCII, IS_DIGIT, IS_LOWER, IS_PUNCT, IS_SPACE
from .attrs cimport IS_TITLE, IS_UPPER, LIKE_URL, LIKE_NUM, LIKE_EMAIL, IS_STOP from .attrs cimport IS_TITLE, IS_UPPER, LIKE_URL, LIKE_NUM, LIKE_EMAIL, IS_STOP
from .attrs cimport IS_BRACKET, IS_QUOTE, IS_LEFT_PUNCT, IS_RIGHT_PUNCT, IS_CURRENCY, IS_OOV from .attrs cimport IS_BRACKET, IS_QUOTE, IS_LEFT_PUNCT, IS_RIGHT_PUNCT
from .attrs cimport PROB from .attrs cimport IS_CURRENCY, IS_OOV, PROB
from .attrs import intify_attrs from .attrs import intify_attrs
from .errors import Errors, Warnings, user_warning from .errors import Errors, Warnings, user_warning
@ -26,6 +29,8 @@ cdef class Lexeme:
word-type, as opposed to a word token. It therefore has no part-of-speech word-type, as opposed to a word token. It therefore has no part-of-speech
tag, dependency parse, or lemma (lemmatization depends on the tag, dependency parse, or lemma (lemmatization depends on the
part-of-speech tag). part-of-speech tag).
DOCS: https://spacy.io/api/lexeme
""" """
def __init__(self, Vocab vocab, attr_t orth): def __init__(self, Vocab vocab, attr_t orth):
"""Create a Lexeme object. """Create a Lexeme object.
@ -114,18 +119,19 @@ cdef class Lexeme:
RETURNS (float): A scalar similarity score. Higher is more similar. RETURNS (float): A scalar similarity score. Higher is more similar.
""" """
# Return 1.0 similarity for matches # Return 1.0 similarity for matches
if hasattr(other, 'orth'): if hasattr(other, "orth"):
if self.c.orth == other.orth: if self.c.orth == other.orth:
return 1.0 return 1.0
elif hasattr(other, '__len__') and len(other) == 1 \ elif hasattr(other, "__len__") and len(other) == 1 \
and hasattr(other[0], 'orth'): and hasattr(other[0], "orth"):
if self.c.orth == other[0].orth: if self.c.orth == other[0].orth:
return 1.0 return 1.0
if self.vector_norm == 0 or other.vector_norm == 0: if self.vector_norm == 0 or other.vector_norm == 0:
user_warning(Warnings.W008.format(obj='Lexeme')) user_warning(Warnings.W008.format(obj="Lexeme"))
return 0.0 return 0.0
return (numpy.dot(self.vector, other.vector) / vector = self.vector
(self.vector_norm * other.vector_norm)) xp = get_array_module(vector)
return (xp.dot(vector, other.vector) / (self.vector_norm * other.vector_norm))
def to_bytes(self): def to_bytes(self):
lex_data = Lexeme.c_to_bytes(self.c) lex_data = Lexeme.c_to_bytes(self.c)
@ -134,7 +140,7 @@ cdef class Lexeme:
if (end-start) != sizeof(lex_data.data): if (end-start) != sizeof(lex_data.data):
raise ValueError(Errors.E072.format(length=end-start, raise ValueError(Errors.E072.format(length=end-start,
bad_length=sizeof(lex_data.data))) bad_length=sizeof(lex_data.data)))
byte_string = b'\0' * sizeof(lex_data.data) byte_string = b"\0" * sizeof(lex_data.data)
byte_chars = <char*>byte_string byte_chars = <char*>byte_string
for i in range(sizeof(lex_data.data)): for i in range(sizeof(lex_data.data)):
byte_chars[i] = lex_data.data[i] byte_chars[i] = lex_data.data[i]

View File

@ -1,6 +1,8 @@
# coding: utf8 # coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
from .matcher import Matcher # noqa: F401 from .matcher import Matcher
from .phrasematcher import PhraseMatcher # noqa: F401 from .phrasematcher import PhraseMatcher
from .dependencymatcher import DependencyTreeMatcher # noqa: F401 from .dependencymatcher import DependencyTreeMatcher
__all__ = ["Matcher", "PhraseMatcher", "DependencyTreeMatcher"]

View File

@ -13,7 +13,7 @@ from .matcher import unpickle_matcher
from ..errors import Errors from ..errors import Errors
DELIMITER = '||' DELIMITER = "||"
INDEX_HEAD = 1 INDEX_HEAD = 1
INDEX_RELOP = 0 INDEX_RELOP = 0
@ -55,7 +55,8 @@ cdef class DependencyTreeMatcher:
return (unpickle_matcher, data, None, None) return (unpickle_matcher, data, None, None)
def __len__(self): def __len__(self):
"""Get the number of rules, which are edges ,added to the dependency tree matcher. """Get the number of rules, which are edges, added to the dependency
tree matcher.
RETURNS (int): The number of rules. RETURNS (int): The number of rules.
""" """
@ -73,19 +74,30 @@ cdef class DependencyTreeMatcher:
idx = 0 idx = 0
visitedNodes = {} visitedNodes = {}
for relation in pattern: for relation in pattern:
if 'PATTERN' not in relation or 'SPEC' not in relation: if "PATTERN" not in relation or "SPEC" not in relation:
raise ValueError(Errors.E098.format(key=key)) raise ValueError(Errors.E098.format(key=key))
if idx == 0: if idx == 0:
if not('NODE_NAME' in relation['SPEC'] and 'NBOR_RELOP' not in relation['SPEC'] and 'NBOR_NAME' not in relation['SPEC']): if not(
"NODE_NAME" in relation["SPEC"]
and "NBOR_RELOP" not in relation["SPEC"]
and "NBOR_NAME" not in relation["SPEC"]
):
raise ValueError(Errors.E099.format(key=key)) raise ValueError(Errors.E099.format(key=key))
visitedNodes[relation['SPEC']['NODE_NAME']] = True visitedNodes[relation["SPEC"]["NODE_NAME"]] = True
else: else:
if not('NODE_NAME' in relation['SPEC'] and 'NBOR_RELOP' in relation['SPEC'] and 'NBOR_NAME' in relation['SPEC']): if not(
"NODE_NAME" in relation["SPEC"]
and "NBOR_RELOP" in relation["SPEC"]
and "NBOR_NAME" in relation["SPEC"]
):
raise ValueError(Errors.E100.format(key=key)) raise ValueError(Errors.E100.format(key=key))
if relation['SPEC']['NODE_NAME'] in visitedNodes or relation['SPEC']['NBOR_NAME'] not in visitedNodes: if (
relation["SPEC"]["NODE_NAME"] in visitedNodes
or relation["SPEC"]["NBOR_NAME"] not in visitedNodes
):
raise ValueError(Errors.E101.format(key=key)) raise ValueError(Errors.E101.format(key=key))
visitedNodes[relation['SPEC']['NODE_NAME']] = True visitedNodes[relation["SPEC"]["NODE_NAME"]] = True
visitedNodes[relation['SPEC']['NBOR_NAME']] = True visitedNodes[relation["SPEC"]["NBOR_NAME"]] = True
idx = idx + 1 idx = idx + 1
def add(self, key, on_match, *patterns): def add(self, key, on_match, *patterns):
@ -93,55 +105,46 @@ cdef class DependencyTreeMatcher:
if len(pattern) == 0: if len(pattern) == 0:
raise ValueError(Errors.E012.format(key=key)) raise ValueError(Errors.E012.format(key=key))
self.validateInput(pattern,key) self.validateInput(pattern,key)
key = self._normalize_key(key) key = self._normalize_key(key)
_patterns = [] _patterns = []
for pattern in patterns: for pattern in patterns:
token_patterns = [] token_patterns = []
for i in range(len(pattern)): for i in range(len(pattern)):
token_pattern = [pattern[i]['PATTERN']] token_pattern = [pattern[i]["PATTERN"]]
token_patterns.append(token_pattern) token_patterns.append(token_pattern)
# self.patterns.append(token_patterns) # self.patterns.append(token_patterns)
_patterns.append(token_patterns) _patterns.append(token_patterns)
self._patterns.setdefault(key, []) self._patterns.setdefault(key, [])
self._callbacks[key] = on_match self._callbacks[key] = on_match
self._patterns[key].extend(_patterns) self._patterns[key].extend(_patterns)
# Add each node pattern of all the input patterns individually to the
# Add each node pattern of all the input patterns individually to the matcher. # matcher. This enables only a single instance of Matcher to be used.
# This enables only a single instance of Matcher to be used.
# Multiple adds are required to track each node pattern. # Multiple adds are required to track each node pattern.
_keys_to_token_list = [] _keys_to_token_list = []
for i in range(len(_patterns)): for i in range(len(_patterns)):
_keys_to_token = {} _keys_to_token = {}
# TODO : Better ways to hash edges in pattern? # TODO: Better ways to hash edges in pattern?
for j in range(len(_patterns[i])): for j in range(len(_patterns[i])):
k = self._normalize_key(unicode(key)+DELIMITER+unicode(i)+DELIMITER+unicode(j)) k = self._normalize_key(unicode(key) + DELIMITER + unicode(i) + DELIMITER + unicode(j))
self.token_matcher.add(k,None,_patterns[i][j]) self.token_matcher.add(k, None, _patterns[i][j])
_keys_to_token[k] = j _keys_to_token[k] = j
_keys_to_token_list.append(_keys_to_token) _keys_to_token_list.append(_keys_to_token)
self._keys_to_token.setdefault(key, []) self._keys_to_token.setdefault(key, [])
self._keys_to_token[key].extend(_keys_to_token_list) self._keys_to_token[key].extend(_keys_to_token_list)
_nodes_list = [] _nodes_list = []
for pattern in patterns: for pattern in patterns:
nodes = {} nodes = {}
for i in range(len(pattern)): for i in range(len(pattern)):
nodes[pattern[i]['SPEC']['NODE_NAME']]=i nodes[pattern[i]["SPEC"]["NODE_NAME"]] = i
_nodes_list.append(nodes) _nodes_list.append(nodes)
self._nodes.setdefault(key, []) self._nodes.setdefault(key, [])
self._nodes[key].extend(_nodes_list) self._nodes[key].extend(_nodes_list)
# Create an object tree to traverse later on. This data structure
# enables easy tree pattern match. Doc-Token based tree cannot be
# reused since it is memory-heavy and tightly coupled with the Doc.
self.retrieve_tree(patterns, _nodes_list,key)
# Create an object tree to traverse later on. def retrieve_tree(self, patterns, _nodes_list, key):
# This datastructure enable easy tree pattern match.
# Doc-Token based tree cannot be reused since it is memory heavy and
# tightly coupled with doc
self.retrieve_tree(patterns,_nodes_list,key)
def retrieve_tree(self,patterns,_nodes_list,key):
_heads_list = [] _heads_list = []
_root_list = [] _root_list = []
for i in range(len(patterns)): for i in range(len(patterns)):
@ -149,31 +152,29 @@ cdef class DependencyTreeMatcher:
root = -1 root = -1
for j in range(len(patterns[i])): for j in range(len(patterns[i])):
token_pattern = patterns[i][j] token_pattern = patterns[i][j]
if('NBOR_RELOP' not in token_pattern['SPEC']): if ("NBOR_RELOP" not in token_pattern["SPEC"]):
heads[j] = ('root',j) heads[j] = ('root', j)
root = j root = j
else: else:
heads[j] = (token_pattern['SPEC']['NBOR_RELOP'],_nodes_list[i][token_pattern['SPEC']['NBOR_NAME']]) heads[j] = (
token_pattern["SPEC"]["NBOR_RELOP"],
_nodes_list[i][token_pattern["SPEC"]["NBOR_NAME"]]
)
_heads_list.append(heads) _heads_list.append(heads)
_root_list.append(root) _root_list.append(root)
_tree_list = [] _tree_list = []
for i in range(len(patterns)): for i in range(len(patterns)):
tree = {} tree = {}
for j in range(len(patterns[i])): for j in range(len(patterns[i])):
if(_heads_list[i][j][INDEX_HEAD] == j): if(_heads_list[i][j][INDEX_HEAD] == j):
continue continue
head = _heads_list[i][j][INDEX_HEAD] head = _heads_list[i][j][INDEX_HEAD]
if(head not in tree): if(head not in tree):
tree[head] = [] tree[head] = []
tree[head].append( (_heads_list[i][j][INDEX_RELOP],j) ) tree[head].append((_heads_list[i][j][INDEX_RELOP], j))
_tree_list.append(tree) _tree_list.append(tree)
self._tree.setdefault(key, []) self._tree.setdefault(key, [])
self._tree[key].extend(_tree_list) self._tree[key].extend(_tree_list)
self._root.setdefault(key, []) self._root.setdefault(key, [])
self._root[key].extend(_root_list) self._root[key].extend(_root_list)
@ -199,7 +200,6 @@ cdef class DependencyTreeMatcher:
def __call__(self, Doc doc): def __call__(self, Doc doc):
matched_trees = [] matched_trees = []
matches = self.token_matcher(doc) matches = self.token_matcher(doc)
for key in list(self._patterns.keys()): for key in list(self._patterns.keys()):
_patterns_list = self._patterns[key] _patterns_list = self._patterns[key]
@ -216,39 +216,51 @@ cdef class DependencyTreeMatcher:
id_to_position = {} id_to_position = {}
for i in range(len(_nodes)): for i in range(len(_nodes)):
id_to_position[i]=[] id_to_position[i]=[]
# TODO: This could be taken outside to improve running time..?
# This could be taken outside to improve running time..?
for match_id, start, end in matches: for match_id, start, end in matches:
if match_id in _keys_to_token: if match_id in _keys_to_token:
id_to_position[_keys_to_token[match_id]].append(start) id_to_position[_keys_to_token[match_id]].append(start)
_node_operator_map = self.get_node_operator_map(
_node_operator_map = self.get_node_operator_map(doc,_tree,id_to_position,_nodes,_root) doc,
_tree,
id_to_position,
_nodes,_root
)
length = len(_nodes) length = len(_nodes)
if _root in id_to_position: if _root in id_to_position:
candidates = id_to_position[_root] candidates = id_to_position[_root]
for candidate in candidates: for candidate in candidates:
isVisited = {} isVisited = {}
self.dfs(candidate,_root,_tree,id_to_position,doc,isVisited,_node_operator_map) self.dfs(
# To check if the subtree pattern is completely identified. This is a heuristic. candidate,
# This is done to reduce the complexity of exponential unordered subtree matching. _root,_tree,
# Will give approximate matches in some cases. id_to_position,
doc,
isVisited,
_node_operator_map
)
# To check if the subtree pattern is completely
# identified. This is a heuristic. This is done to
# reduce the complexity of exponential unordered subtree
# matching. Will give approximate matches in some cases.
if(len(isVisited) == length): if(len(isVisited) == length):
matched_trees.append((key,list(isVisited))) matched_trees.append((key,list(isVisited)))
for i, (ent_id, nodes) in enumerate(matched_trees): for i, (ent_id, nodes) in enumerate(matched_trees):
on_match = self._callbacks.get(ent_id) on_match = self._callbacks.get(ent_id)
if on_match is not None: if on_match is not None:
on_match(self, doc, i, matches) on_match(self, doc, i, matches)
return matched_trees return matched_trees
def dfs(self,candidate,root,tree,id_to_position,doc,isVisited,_node_operator_map): def dfs(self,candidate,root,tree,id_to_position,doc,isVisited,_node_operator_map):
if(root in id_to_position and candidate in id_to_position[root]): if (root in id_to_position and candidate in id_to_position[root]):
# color the node since it is valid # Color the node since it is valid
isVisited[candidate] = True isVisited[candidate] = True
if root in tree: if root in tree:
for root_child in tree[root]: for root_child in tree[root]:
if candidate in _node_operator_map and root_child[INDEX_RELOP] in _node_operator_map[candidate]: if (
candidate in _node_operator_map
and root_child[INDEX_RELOP] in _node_operator_map[candidate]
):
candidate_children = _node_operator_map[candidate][root_child[INDEX_RELOP]] candidate_children = _node_operator_map[candidate][root_child[INDEX_RELOP]]
for candidate_child in candidate_children: for candidate_child in candidate_children:
result = self.dfs( result = self.dfs(
@ -275,72 +287,68 @@ cdef class DependencyTreeMatcher:
for child in tree[node]: for child in tree[node]:
all_operators.append(child[INDEX_RELOP]) all_operators.append(child[INDEX_RELOP])
all_operators = list(set(all_operators)) all_operators = list(set(all_operators))
all_nodes = [] all_nodes = []
for node in all_node_indices: for node in all_node_indices:
all_nodes = all_nodes + id_to_position[node] all_nodes = all_nodes + id_to_position[node]
all_nodes = list(set(all_nodes)) all_nodes = list(set(all_nodes))
for node in all_nodes: for node in all_nodes:
_node_operator_map[node] = {} _node_operator_map[node] = {}
for operator in all_operators: for operator in all_operators:
_node_operator_map[node][operator] = [] _node_operator_map[node][operator] = []
# Used to invoke methods for each operator # Used to invoke methods for each operator
switcher = { switcher = {
'<':self.dep, "<": self.dep,
'>':self.gov, ">": self.gov,
'>>':self.dep_chain, ">>": self.dep_chain,
'<<':self.gov_chain, "<<": self.gov_chain,
'.':self.imm_precede, ".": self.imm_precede,
'$+':self.imm_right_sib, "$+": self.imm_right_sib,
'$-':self.imm_left_sib, "$-": self.imm_left_sib,
'$++':self.right_sib, "$++": self.right_sib,
'$--':self.left_sib "$--": self.left_sib
} }
for operator in all_operators: for operator in all_operators:
for node in all_nodes: for node in all_nodes:
_node_operator_map[node][operator] = switcher.get(operator)(doc,node) _node_operator_map[node][operator] = switcher.get(operator)(doc,node)
return _node_operator_map return _node_operator_map
def dep(self,doc,node): def dep(self, doc, node):
return list(doc[node].head) return list(doc[node].head)
def gov(self,doc,node): def gov(self,doc,node):
return list(doc[node].children) return list(doc[node].children)
def dep_chain(self,doc,node): def dep_chain(self, doc, node):
return list(doc[node].ancestors) return list(doc[node].ancestors)
def gov_chain(self,doc,node): def gov_chain(self, doc, node):
return list(doc[node].subtree) return list(doc[node].subtree)
def imm_precede(self,doc,node): def imm_precede(self, doc, node):
if node>0: if node > 0:
return [doc[node-1]] return [doc[node - 1]]
return [] return []
def imm_right_sib(self,doc,node): def imm_right_sib(self, doc, node):
for idx in range(list(doc[node].head.children)): for idx in range(list(doc[node].head.children)):
if idx == node-1: if idx == node - 1:
return [doc[idx]] return [doc[idx]]
return [] return []
def imm_left_sib(self,doc,node): def imm_left_sib(self, doc, node):
for idx in range(list(doc[node].head.children)): for idx in range(list(doc[node].head.children)):
if idx == node+1: if idx == node + 1:
return [doc[idx]] return [doc[idx]]
return [] return []
def right_sib(self,doc,node): def right_sib(self, doc, node):
candidate_children = [] candidate_children = []
for idx in range(list(doc[node].head.children)): for idx in range(list(doc[node].head.children)):
if idx < node: if idx < node:
candidate_children.append(doc[idx]) candidate_children.append(doc[idx])
return candidate_children return candidate_children
def left_sib(self,doc,node): def left_sib(self, doc, node):
candidate_children = [] candidate_children = []
for idx in range(list(doc[node].head.children)): for idx in range(list(doc[node].head.children)):
if idx > node: if idx > node:

File diff suppressed because it is too large Load Diff

View File

@ -12,7 +12,7 @@ from ..vocab cimport Vocab
from ..tokens.doc cimport Doc, get_token_attr from ..tokens.doc cimport Doc, get_token_attr
from ..typedefs cimport attr_t, hash_t from ..typedefs cimport attr_t, hash_t
from ..errors import Warnings, deprecation_warning, user_warning from ..errors import Errors, Warnings, deprecation_warning, user_warning
from ..attrs import FLAG61 as U_ENT from ..attrs import FLAG61 as U_ENT
from ..attrs import FLAG60 as B2_ENT from ..attrs import FLAG60 as B2_ENT
from ..attrs import FLAG59 as B3_ENT from ..attrs import FLAG59 as B3_ENT
@ -25,6 +25,13 @@ from ..attrs import FLAG41 as I4_ENT
cdef class PhraseMatcher: cdef class PhraseMatcher:
"""Efficiently match large terminology lists. While the `Matcher` matches
sequences based on lists of token descriptions, the `PhraseMatcher` accepts
match patterns in the form of `Doc` objects.
DOCS: https://spacy.io/api/phrasematcher
USAGE: https://spacy.io/usage/rule-based-matching#phrasematcher
"""
cdef Pool mem cdef Pool mem
cdef Vocab vocab cdef Vocab vocab
cdef Matcher matcher cdef Matcher matcher
@ -36,7 +43,16 @@ cdef class PhraseMatcher:
cdef public object _docs cdef public object _docs
cdef public object _validate cdef public object _validate
def __init__(self, Vocab vocab, max_length=0, attr='ORTH', validate=False): def __init__(self, Vocab vocab, max_length=0, attr="ORTH", validate=False):
"""Initialize the PhraseMatcher.
vocab (Vocab): The shared vocabulary.
attr (int / unicode): Token attribute to match on.
validate (bool): Perform additional validation when patterns are added.
RETURNS (PhraseMatcher): The newly constructed object.
DOCS: https://spacy.io/api/phrasematcher#init
"""
if max_length != 0: if max_length != 0:
deprecation_warning(Warnings.W010) deprecation_warning(Warnings.W010)
self.mem = Pool() self.mem = Pool()
@ -54,7 +70,7 @@ cdef class PhraseMatcher:
[{B3_ENT: True}, {I3_ENT: True}, {L3_ENT: True}], [{B3_ENT: True}, {I3_ENT: True}, {L3_ENT: True}],
[{B4_ENT: True}, {I4_ENT: True}, {I4_ENT: True, "OP": "+"}, {L4_ENT: True}], [{B4_ENT: True}, {I4_ENT: True}, {I4_ENT: True, "OP": "+"}, {L4_ENT: True}],
] ]
self.matcher.add('Candidate', None, *abstract_patterns) self.matcher.add("Candidate", None, *abstract_patterns)
self._callbacks = {} self._callbacks = {}
self._docs = {} self._docs = {}
self._validate = validate self._validate = validate
@ -65,6 +81,8 @@ cdef class PhraseMatcher:
number of individual patterns. number of individual patterns.
RETURNS (int): The number of rules. RETURNS (int): The number of rules.
DOCS: https://spacy.io/api/phrasematcher#len
""" """
return len(self._docs) return len(self._docs)
@ -73,6 +91,8 @@ cdef class PhraseMatcher:
key (unicode): The match ID. key (unicode): The match ID.
RETURNS (bool): Whether the matcher contains rules for this match ID. RETURNS (bool): Whether the matcher contains rules for this match ID.
DOCS: https://spacy.io/api/phrasematcher#contains
""" """
cdef hash_t ent_id = self.matcher._normalize_key(key) cdef hash_t ent_id = self.matcher._normalize_key(key)
return ent_id in self._callbacks return ent_id in self._callbacks
@ -88,6 +108,8 @@ cdef class PhraseMatcher:
key (unicode): The match ID. key (unicode): The match ID.
on_match (callable): Callback executed on match. on_match (callable): Callback executed on match.
*docs (Doc): `Doc` objects representing match patterns. *docs (Doc): `Doc` objects representing match patterns.
DOCS: https://spacy.io/api/phrasematcher#add
""" """
cdef Doc doc cdef Doc doc
cdef hash_t ent_id = self.matcher._normalize_key(key) cdef hash_t ent_id = self.matcher._normalize_key(key)
@ -112,8 +134,7 @@ cdef class PhraseMatcher:
lexeme = self.vocab[attr_value] lexeme = self.vocab[attr_value]
lexeme.set_flag(tag, True) lexeme.set_flag(tag, True)
phrase_key[i] = lexeme.orth phrase_key[i] = lexeme.orth
phrase_hash = hash64(phrase_key, phrase_hash = hash64(phrase_key, length * sizeof(attr_t), 0)
length * sizeof(attr_t), 0)
self.phrase_ids.set(phrase_hash, <void*>ent_id) self.phrase_ids.set(phrase_hash, <void*>ent_id)
def __call__(self, Doc doc): def __call__(self, Doc doc):
@ -123,6 +144,8 @@ cdef class PhraseMatcher:
RETURNS (list): A list of `(key, start, end)` tuples, RETURNS (list): A list of `(key, start, end)` tuples,
describing the matches. A match tuple describes a span describing the matches. A match tuple describes a span
`doc[start:end]`. The `label_id` and `key` are both integers. `doc[start:end]`. The `label_id` and `key` are both integers.
DOCS: https://spacy.io/api/phrasematcher#call
""" """
matches = [] matches = []
if self.attr == ORTH: if self.attr == ORTH:
@ -158,6 +181,8 @@ cdef class PhraseMatcher:
If both return_matches and as_tuples are True, the output will If both return_matches and as_tuples are True, the output will
be a sequence of ((doc, matches), context) tuples. be a sequence of ((doc, matches), context) tuples.
YIELDS (Doc): Documents, in order. YIELDS (Doc): Documents, in order.
DOCS: https://spacy.io/api/phrasematcher#pipe
""" """
if as_tuples: if as_tuples:
for doc, context in stream: for doc, context in stream:
@ -180,8 +205,7 @@ cdef class PhraseMatcher:
phrase_key = <attr_t*>mem.alloc(end-start, sizeof(attr_t)) phrase_key = <attr_t*>mem.alloc(end-start, sizeof(attr_t))
for i, j in enumerate(range(start, end)): for i, j in enumerate(range(start, end)):
phrase_key[i] = doc.c[j].lex.orth phrase_key[i] = doc.c[j].lex.orth
cdef hash_t key = hash64(phrase_key, cdef hash_t key = hash64(phrase_key, (end-start) * sizeof(attr_t), 0)
(end-start) * sizeof(attr_t), 0)
ent_id = <hash_t>self.phrase_ids.get(key) ent_id = <hash_t>self.phrase_ids.get(key)
if ent_id == 0: if ent_id == 0:
return None return None
@ -203,12 +227,12 @@ cdef class PhraseMatcher:
# Concatenate the attr name and value to not pollute lexeme space # Concatenate the attr name and value to not pollute lexeme space
# e.g. 'POS-VERB' instead of just 'VERB', which could otherwise # e.g. 'POS-VERB' instead of just 'VERB', which could otherwise
# create false positive matches # create false positive matches
return 'matcher:{}-{}'.format(string_attr_name, string_attr_value) return "matcher:{}-{}".format(string_attr_name, string_attr_value)
def get_bilou(length): def get_bilou(length):
if length == 0: if length == 0:
raise ValueError("Length must be >= 1") raise ValueError(Errors.E127)
elif length == 1: elif length == 1:
return [U_ENT] return [U_ENT]
elif length == 2: elif length == 2:

View File

@ -1,9 +1,23 @@
# coding: utf8 # coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
from .pipes import Tagger, DependencyParser, EntityRecognizer # noqa from .pipes import Tagger, DependencyParser, EntityRecognizer, Morphologizer
from .pipes import TextCategorizer, Tensorizer, Pipe # noqa from .pipes import TextCategorizer, Tensorizer, Pipe
from .morphologizer import Morphologizer from .entityruler import EntityRuler
from .entityruler import EntityRuler # noqa from .hooks import SentenceSegmenter, SimilarityHook
from .hooks import SentenceSegmenter, SimilarityHook # noqa from .functions import merge_entities, merge_noun_chunks, merge_subtokens
from .functions import merge_entities, merge_noun_chunks, merge_subtokens # noqa
__all__ = [
"Tagger",
"DependencyParser",
"EntityRecognizer",
"TextCategorizer",
"Tensorizer",
"Pipe",
"EntityRuler",
"SentenceSegmenter",
"SimilarityHook",
"merge_entities",
"merge_noun_chunks",
"merge_subtokens",
]

View File

@ -12,10 +12,20 @@ from ..matcher import Matcher, PhraseMatcher
class EntityRuler(object): class EntityRuler(object):
"""The EntityRuler lets you add spans to the `Doc.ents` using token-based
rules or exact phrase matches. It can be combined with the statistical
`EntityRecognizer` to boost accuracy, or used on its own to implement a
purely rule-based entity recognition system. After initialization, the
component is typically added to the pipeline using `nlp.add_pipe`.
DOCS: https://spacy.io/api/entityruler
USAGE: https://spacy.io/usage/rule-based-matching#entityruler
"""
name = "entity_ruler" name = "entity_ruler"
def __init__(self, nlp, **cfg): def __init__(self, nlp, **cfg):
"""Initialise the entitiy ruler. If patterns are supplied here, they """Initialize the entitiy ruler. If patterns are supplied here, they
need to be a list of dictionaries with a `"label"` and `"pattern"` need to be a list of dictionaries with a `"label"` and `"pattern"`
key. A pattern can either be a token pattern (list) or a phrase pattern key. A pattern can either be a token pattern (list) or a phrase pattern
(string). For example: `{'label': 'ORG', 'pattern': 'Apple'}`. (string). For example: `{'label': 'ORG', 'pattern': 'Apple'}`.
@ -29,6 +39,8 @@ class EntityRuler(object):
of a model pipeline, this will include all keyword arguments passed of a model pipeline, this will include all keyword arguments passed
to `spacy.load`. to `spacy.load`.
RETURNS (EntityRuler): The newly constructed object. RETURNS (EntityRuler): The newly constructed object.
DOCS: https://spacy.io/api/entityruler#init
""" """
self.nlp = nlp self.nlp = nlp
self.overwrite = cfg.get("overwrite_ents", False) self.overwrite = cfg.get("overwrite_ents", False)
@ -55,6 +67,8 @@ class EntityRuler(object):
doc (Doc): The Doc object in the pipeline. doc (Doc): The Doc object in the pipeline.
RETURNS (Doc): The Doc with added entities, if available. RETURNS (Doc): The Doc with added entities, if available.
DOCS: https://spacy.io/api/entityruler#call
""" """
matches = list(self.matcher(doc)) + list(self.phrase_matcher(doc)) matches = list(self.matcher(doc)) + list(self.phrase_matcher(doc))
matches = set( matches = set(
@ -83,6 +97,8 @@ class EntityRuler(object):
"""All labels present in the match patterns. """All labels present in the match patterns.
RETURNS (set): The string labels. RETURNS (set): The string labels.
DOCS: https://spacy.io/api/entityruler#labels
""" """
all_labels = set(self.token_patterns.keys()) all_labels = set(self.token_patterns.keys())
all_labels.update(self.phrase_patterns.keys()) all_labels.update(self.phrase_patterns.keys())
@ -93,6 +109,8 @@ class EntityRuler(object):
"""Get all patterns that were added to the entity ruler. """Get all patterns that were added to the entity ruler.
RETURNS (list): The original patterns, one dictionary per pattern. RETURNS (list): The original patterns, one dictionary per pattern.
DOCS: https://spacy.io/api/entityruler#patterns
""" """
all_patterns = [] all_patterns = []
for label, patterns in self.token_patterns.items(): for label, patterns in self.token_patterns.items():
@ -110,6 +128,8 @@ class EntityRuler(object):
{'label': 'GPE', 'pattern': [{'lower': 'san'}, {'lower': 'francisco'}]} {'label': 'GPE', 'pattern': [{'lower': 'san'}, {'lower': 'francisco'}]}
patterns (list): The patterns to add. patterns (list): The patterns to add.
DOCS: https://spacy.io/api/entityruler#add_patterns
""" """
for entry in patterns: for entry in patterns:
label = entry["label"] label = entry["label"]
@ -131,6 +151,8 @@ class EntityRuler(object):
patterns_bytes (bytes): The bytestring to load. patterns_bytes (bytes): The bytestring to load.
**kwargs: Other config paramters, mostly for consistency. **kwargs: Other config paramters, mostly for consistency.
RETURNS (EntityRuler): The loaded entity ruler. RETURNS (EntityRuler): The loaded entity ruler.
DOCS: https://spacy.io/api/entityruler#from_bytes
""" """
patterns = srsly.msgpack_loads(patterns_bytes) patterns = srsly.msgpack_loads(patterns_bytes)
self.add_patterns(patterns) self.add_patterns(patterns)
@ -140,6 +162,8 @@ class EntityRuler(object):
"""Serialize the entity ruler patterns to a bytestring. """Serialize the entity ruler patterns to a bytestring.
RETURNS (bytes): The serialized patterns. RETURNS (bytes): The serialized patterns.
DOCS: https://spacy.io/api/entityruler#to_bytes
""" """
return srsly.msgpack_dumps(self.patterns) return srsly.msgpack_dumps(self.patterns)
@ -150,6 +174,8 @@ class EntityRuler(object):
path (unicode / Path): The JSONL file to load. path (unicode / Path): The JSONL file to load.
**kwargs: Other config paramters, mostly for consistency. **kwargs: Other config paramters, mostly for consistency.
RETURNS (EntityRuler): The loaded entity ruler. RETURNS (EntityRuler): The loaded entity ruler.
DOCS: https://spacy.io/api/entityruler#from_disk
""" """
path = ensure_path(path) path = ensure_path(path)
path = path.with_suffix(".jsonl") path = path.with_suffix(".jsonl")
@ -164,6 +190,8 @@ class EntityRuler(object):
path (unicode / Path): The JSONL file to load. path (unicode / Path): The JSONL file to load.
**kwargs: Other config paramters, mostly for consistency. **kwargs: Other config paramters, mostly for consistency.
RETURNS (EntityRuler): The loaded entity ruler. RETURNS (EntityRuler): The loaded entity ruler.
DOCS: https://spacy.io/api/entityruler
""" """
path = ensure_path(path) path = ensure_path(path)
path = path.with_suffix(".jsonl") path = path.with_suffix(".jsonl")

View File

@ -9,6 +9,8 @@ def merge_noun_chunks(doc):
doc (Doc): The Doc object. doc (Doc): The Doc object.
RETURNS (Doc): The Doc object with merged noun chunks. RETURNS (Doc): The Doc object with merged noun chunks.
DOCS: https://spacy.io/api/pipeline-functions#merge_noun_chunks
""" """
if not doc.is_parsed: if not doc.is_parsed:
return doc return doc
@ -23,7 +25,9 @@ def merge_entities(doc):
"""Merge entities into a single token. """Merge entities into a single token.
doc (Doc): The Doc object. doc (Doc): The Doc object.
RETURNS (Doc): The Doc object with merged noun entities. RETURNS (Doc): The Doc object with merged entities.
DOCS: https://spacy.io/api/pipeline-functions#merge_entities
""" """
with doc.retokenize() as retokenizer: with doc.retokenize() as retokenizer:
for ent in doc.ents: for ent in doc.ents:
@ -33,6 +37,14 @@ def merge_entities(doc):
def merge_subtokens(doc, label="subtok"): def merge_subtokens(doc, label="subtok"):
"""Merge subtokens into a single token.
doc (Doc): The Doc object.
label (unicode): The subtoken dependency label.
RETURNS (Doc): The Doc object with merged subtokens.
DOCS: https://spacy.io/api/pipeline-functions#merge_subtokens
"""
merger = Matcher(doc.vocab) merger = Matcher(doc.vocab)
merger.add("SUBTOK", None, [{"DEP": label, "op": "+"}]) merger.add("SUBTOK", None, [{"DEP": label, "op": "+"}])
matches = merger(doc) matches = merger(doc)

View File

@ -15,6 +15,8 @@ class SentenceSegmenter(object):
initialization, or assign a new strategy to the .strategy attribute. initialization, or assign a new strategy to the .strategy attribute.
Sentence detection strategies should be generators that take `Doc` objects Sentence detection strategies should be generators that take `Doc` objects
and yield `Span` objects for each sentence. and yield `Span` objects for each sentence.
DOCS: https://spacy.io/api/sentencesegmenter
""" """
name = "sentencizer" name = "sentencizer"

View File

@ -6,9 +6,8 @@ from __future__ import unicode_literals
cimport numpy as np cimport numpy as np
import numpy import numpy
from collections import OrderedDict
import srsly import srsly
from collections import OrderedDict
from thinc.api import chain from thinc.api import chain
from thinc.v2v import Affine, Maxout, Softmax from thinc.v2v import Affine, Maxout, Softmax
from thinc.misc import LayerNorm from thinc.misc import LayerNorm
@ -284,9 +283,7 @@ class Tensorizer(Pipe):
""" """
for doc, tensor in zip(docs, tensors): for doc, tensor in zip(docs, tensors):
if tensor.shape[0] != len(doc): if tensor.shape[0] != len(doc):
raise ValueError( raise ValueError(Errors.E076.format(rows=tensor.shape[0], words=len(doc)))
Errors.E076.format(rows=tensor.shape[0], words=len(doc))
)
doc.tensor = tensor doc.tensor = tensor
def update(self, docs, golds, state=None, drop=0.0, sgd=None, losses=None): def update(self, docs, golds, state=None, drop=0.0, sgd=None, losses=None):
@ -346,14 +343,19 @@ class Tensorizer(Pipe):
class Tagger(Pipe): class Tagger(Pipe):
name = 'tagger' """Pipeline component for part-of-speech tagging.
DOCS: https://spacy.io/api/tagger
"""
name = "tagger"
def __init__(self, vocab, model=True, **cfg): def __init__(self, vocab, model=True, **cfg):
self.vocab = vocab self.vocab = vocab
self.model = model self.model = model
self._rehearsal_model = None self._rehearsal_model = None
self.cfg = OrderedDict(sorted(cfg.items())) self.cfg = OrderedDict(sorted(cfg.items()))
self.cfg.setdefault('cnn_maxout_pieces', 2) self.cfg.setdefault("cnn_maxout_pieces", 2)
@property @property
def labels(self): def labels(self):
@ -404,7 +406,7 @@ class Tagger(Pipe):
cdef Vocab vocab = self.vocab cdef Vocab vocab = self.vocab
for i, doc in enumerate(docs): for i, doc in enumerate(docs):
doc_tag_ids = batch_tag_ids[i] doc_tag_ids = batch_tag_ids[i]
if hasattr(doc_tag_ids, 'get'): if hasattr(doc_tag_ids, "get"):
doc_tag_ids = doc_tag_ids.get() doc_tag_ids = doc_tag_ids.get()
for j, tag_id in enumerate(doc_tag_ids): for j, tag_id in enumerate(doc_tag_ids):
# Don't clobber preset POS tags # Don't clobber preset POS tags
@ -453,9 +455,9 @@ class Tagger(Pipe):
scores = self.model.ops.flatten(scores) scores = self.model.ops.flatten(scores)
tag_index = {tag: i for i, tag in enumerate(self.labels)} tag_index = {tag: i for i, tag in enumerate(self.labels)}
cdef int idx = 0 cdef int idx = 0
correct = numpy.zeros((scores.shape[0],), dtype='i') correct = numpy.zeros((scores.shape[0],), dtype="i")
guesses = scores.argmax(axis=1) guesses = scores.argmax(axis=1)
known_labels = numpy.ones((scores.shape[0], 1), dtype='f') known_labels = numpy.ones((scores.shape[0], 1), dtype="f")
for gold in golds: for gold in golds:
for tag in gold.tags: for tag in gold.tags:
if tag is None: if tag is None:
@ -466,7 +468,7 @@ class Tagger(Pipe):
correct[idx] = 0 correct[idx] = 0
known_labels[idx] = 0. known_labels[idx] = 0.
idx += 1 idx += 1
correct = self.model.ops.xp.array(correct, dtype='i') correct = self.model.ops.xp.array(correct, dtype="i")
d_scores = scores - to_categorical(correct, nb_classes=scores.shape[1]) d_scores = scores - to_categorical(correct, nb_classes=scores.shape[1])
d_scores *= self.model.ops.asarray(known_labels) d_scores *= self.model.ops.asarray(known_labels)
loss = (d_scores**2).sum() loss = (d_scores**2).sum()
@ -490,9 +492,9 @@ class Tagger(Pipe):
vocab.morphology = Morphology(vocab.strings, new_tag_map, vocab.morphology = Morphology(vocab.strings, new_tag_map,
vocab.morphology.lemmatizer, vocab.morphology.lemmatizer,
exc=vocab.morphology.exc) exc=vocab.morphology.exc)
self.cfg['pretrained_vectors'] = kwargs.get('pretrained_vectors') self.cfg["pretrained_vectors"] = kwargs.get("pretrained_vectors")
if self.model is True: if self.model is True:
for hp in ['token_vector_width', 'conv_depth']: for hp in ["token_vector_width", "conv_depth"]:
if hp in kwargs: if hp in kwargs:
self.cfg[hp] = kwargs[hp] self.cfg[hp] = kwargs[hp]
self.model = self.Model(self.vocab.morphology.n_tags, **self.cfg) self.model = self.Model(self.vocab.morphology.n_tags, **self.cfg)
@ -503,7 +505,7 @@ class Tagger(Pipe):
@classmethod @classmethod
def Model(cls, n_tags, **cfg): def Model(cls, n_tags, **cfg):
if cfg.get('pretrained_dims') and not cfg.get('pretrained_vectors'): if cfg.get("pretrained_dims") and not cfg.get("pretrained_vectors"):
raise ValueError(TempErrors.T008) raise ValueError(TempErrors.T008)
return build_tagger_model(n_tags, **cfg) return build_tagger_model(n_tags, **cfg)
@ -538,25 +540,23 @@ class Tagger(Pipe):
def to_bytes(self, **exclude): def to_bytes(self, **exclude):
serialize = OrderedDict() serialize = OrderedDict()
if self.model not in (None, True, False): if self.model not in (None, True, False):
serialize['model'] = self.model.to_bytes serialize["model"] = self.model.to_bytes
serialize['vocab'] = self.vocab.to_bytes serialize["vocab"] = self.vocab.to_bytes
serialize['cfg'] = lambda: srsly.json_dumps(self.cfg) serialize["cfg"] = lambda: srsly.json_dumps(self.cfg)
tag_map = OrderedDict(sorted(self.vocab.morphology.tag_map.items())) tag_map = OrderedDict(sorted(self.vocab.morphology.tag_map.items()))
serialize['tag_map'] = lambda: srsly.msgpack_dumps(tag_map) serialize["tag_map"] = lambda: srsly.msgpack_dumps(tag_map)
return util.to_bytes(serialize, exclude) return util.to_bytes(serialize, exclude)
def from_bytes(self, bytes_data, **exclude): def from_bytes(self, bytes_data, **exclude):
def load_model(b): def load_model(b):
# TODO: Remove this once we don't have to handle previous models # TODO: Remove this once we don't have to handle previous models
if self.cfg.get('pretrained_dims') and 'pretrained_vectors' not in self.cfg: if self.cfg.get("pretrained_dims") and "pretrained_vectors" not in self.cfg:
self.cfg['pretrained_vectors'] = self.vocab.vectors.name self.cfg["pretrained_vectors"] = self.vocab.vectors.name
if self.model is True: if self.model is True:
token_vector_width = util.env_opt( token_vector_width = util.env_opt(
'token_vector_width', "token_vector_width",
self.cfg.get('token_vector_width', 96)) self.cfg.get("token_vector_width", 96))
self.model = self.Model(self.vocab.morphology.n_tags, self.model = self.Model(self.vocab.morphology.n_tags, **self.cfg)
**self.cfg)
self.model.from_bytes(b) self.model.from_bytes(b)
def load_tag_map(b): def load_tag_map(b):
@ -567,10 +567,10 @@ class Tagger(Pipe):
exc=self.vocab.morphology.exc) exc=self.vocab.morphology.exc)
deserialize = OrderedDict(( deserialize = OrderedDict((
('vocab', lambda b: self.vocab.from_bytes(b)), ("vocab", lambda b: self.vocab.from_bytes(b)),
('tag_map', load_tag_map), ("tag_map", load_tag_map),
('cfg', lambda b: self.cfg.update(srsly.json_loads(b))), ("cfg", lambda b: self.cfg.update(srsly.json_loads(b))),
('model', lambda b: load_model(b)), ("model", lambda b: load_model(b)),
)) ))
util.from_bytes(bytes_data, deserialize, exclude) util.from_bytes(bytes_data, deserialize, exclude)
return self return self
@ -580,7 +580,7 @@ class Tagger(Pipe):
serialize = OrderedDict(( serialize = OrderedDict((
('vocab', lambda p: self.vocab.to_disk(p)), ('vocab', lambda p: self.vocab.to_disk(p)),
('tag_map', lambda p: srsly.write_msgpack(p, tag_map)), ('tag_map', lambda p: srsly.write_msgpack(p, tag_map)),
('model', lambda p: p.open('wb').write(self.model.to_bytes())), ('model', lambda p: p.open("wb").write(self.model.to_bytes())),
('cfg', lambda p: srsly.write_json(p, self.cfg)) ('cfg', lambda p: srsly.write_json(p, self.cfg))
)) ))
util.to_disk(path, serialize, exclude) util.to_disk(path, serialize, exclude)
@ -588,11 +588,11 @@ class Tagger(Pipe):
def from_disk(self, path, **exclude): def from_disk(self, path, **exclude):
def load_model(p): def load_model(p):
# TODO: Remove this once we don't have to handle previous models # TODO: Remove this once we don't have to handle previous models
if self.cfg.get('pretrained_dims') and 'pretrained_vectors' not in self.cfg: if self.cfg.get("pretrained_dims") and "pretrained_vectors" not in self.cfg:
self.cfg['pretrained_vectors'] = self.vocab.vectors.name self.cfg["pretrained_vectors"] = self.vocab.vectors.name
if self.model is True: if self.model is True:
self.model = self.Model(self.vocab.morphology.n_tags, **self.cfg) self.model = self.Model(self.vocab.morphology.n_tags, **self.cfg)
with p.open('rb') as file_: with p.open("rb") as file_:
self.model.from_bytes(file_.read()) self.model.from_bytes(file_.read())
def load_tag_map(p): def load_tag_map(p):
@ -603,10 +603,10 @@ class Tagger(Pipe):
exc=self.vocab.morphology.exc) exc=self.vocab.morphology.exc)
deserialize = OrderedDict(( deserialize = OrderedDict((
('cfg', lambda p: self.cfg.update(_load_cfg(p))), ("cfg", lambda p: self.cfg.update(_load_cfg(p))),
('vocab', lambda p: self.vocab.from_disk(p)), ("vocab", lambda p: self.vocab.from_disk(p)),
('tag_map', load_tag_map), ("tag_map", load_tag_map),
('model', load_model), ("model", load_model),
)) ))
util.from_disk(path, deserialize, exclude) util.from_disk(path, deserialize, exclude)
return self return self
@ -616,37 +616,38 @@ class MultitaskObjective(Tagger):
"""Experimental: Assist training of a parser or tagger, by training a """Experimental: Assist training of a parser or tagger, by training a
side-objective. side-objective.
""" """
name = 'nn_labeller'
name = "nn_labeller"
def __init__(self, vocab, model=True, target='dep_tag_offset', **cfg): def __init__(self, vocab, model=True, target='dep_tag_offset', **cfg):
self.vocab = vocab self.vocab = vocab
self.model = model self.model = model
if target == 'dep': if target == "dep":
self.make_label = self.make_dep self.make_label = self.make_dep
elif target == 'tag': elif target == "tag":
self.make_label = self.make_tag self.make_label = self.make_tag
elif target == 'ent': elif target == "ent":
self.make_label = self.make_ent self.make_label = self.make_ent
elif target == 'dep_tag_offset': elif target == "dep_tag_offset":
self.make_label = self.make_dep_tag_offset self.make_label = self.make_dep_tag_offset
elif target == 'ent_tag': elif target == "ent_tag":
self.make_label = self.make_ent_tag self.make_label = self.make_ent_tag
elif target == 'sent_start': elif target == "sent_start":
self.make_label = self.make_sent_start self.make_label = self.make_sent_start
elif hasattr(target, '__call__'): elif hasattr(target, "__call__"):
self.make_label = target self.make_label = target
else: else:
raise ValueError(Errors.E016) raise ValueError(Errors.E016)
self.cfg = dict(cfg) self.cfg = dict(cfg)
self.cfg.setdefault('cnn_maxout_pieces', 2) self.cfg.setdefault("cnn_maxout_pieces", 2)
@property @property
def labels(self): def labels(self):
return self.cfg.setdefault('labels', {}) return self.cfg.setdefault("labels", {})
@labels.setter @labels.setter
def labels(self, value): def labels(self, value):
self.cfg['labels'] = value self.cfg["labels"] = value
def set_annotations(self, docs, dep_ids, tensors=None): def set_annotations(self, docs, dep_ids, tensors=None):
pass pass
@ -662,7 +663,7 @@ class MultitaskObjective(Tagger):
if label is not None and label not in self.labels: if label is not None and label not in self.labels:
self.labels[label] = len(self.labels) self.labels[label] = len(self.labels)
if self.model is True: if self.model is True:
token_vector_width = util.env_opt('token_vector_width') token_vector_width = util.env_opt("token_vector_width")
self.model = self.Model(len(self.labels), tok2vec=tok2vec) self.model = self.Model(len(self.labels), tok2vec=tok2vec)
link_vectors_to_models(self.vocab) link_vectors_to_models(self.vocab)
if sgd is None: if sgd is None:
@ -671,7 +672,7 @@ class MultitaskObjective(Tagger):
@classmethod @classmethod
def Model(cls, n_tags, tok2vec=None, **cfg): def Model(cls, n_tags, tok2vec=None, **cfg):
token_vector_width = util.env_opt('token_vector_width', 96) token_vector_width = util.env_opt("token_vector_width", 96)
softmax = Softmax(n_tags, token_vector_width*2) softmax = Softmax(n_tags, token_vector_width*2)
model = chain( model = chain(
tok2vec, tok2vec,
@ -690,10 +691,10 @@ class MultitaskObjective(Tagger):
def get_loss(self, docs, golds, scores): def get_loss(self, docs, golds, scores):
if len(docs) != len(golds): if len(docs) != len(golds):
raise ValueError(Errors.E077.format(value='loss', n_docs=len(docs), raise ValueError(Errors.E077.format(value="loss", n_docs=len(docs),
n_golds=len(golds))) n_golds=len(golds)))
cdef int idx = 0 cdef int idx = 0
correct = numpy.zeros((scores.shape[0],), dtype='i') correct = numpy.zeros((scores.shape[0],), dtype="i")
guesses = scores.argmax(axis=1) guesses = scores.argmax(axis=1)
for i, gold in enumerate(golds): for i, gold in enumerate(golds):
for j in range(len(docs[i])): for j in range(len(docs[i])):
@ -705,7 +706,7 @@ class MultitaskObjective(Tagger):
else: else:
correct[idx] = self.labels[label] correct[idx] = self.labels[label]
idx += 1 idx += 1
correct = self.model.ops.xp.array(correct, dtype='i') correct = self.model.ops.xp.array(correct, dtype="i")
d_scores = scores - to_categorical(correct, nb_classes=scores.shape[1]) d_scores = scores - to_categorical(correct, nb_classes=scores.shape[1])
loss = (d_scores**2).sum() loss = (d_scores**2).sum()
return float(loss), d_scores return float(loss), d_scores
@ -733,25 +734,25 @@ class MultitaskObjective(Tagger):
offset = heads[i] - i offset = heads[i] - i
offset = min(offset, 2) offset = min(offset, 2)
offset = max(offset, -2) offset = max(offset, -2)
return '%s-%s:%d' % (deps[i], tags[i], offset) return "%s-%s:%d" % (deps[i], tags[i], offset)
@staticmethod @staticmethod
def make_ent_tag(i, words, tags, heads, deps, ents): def make_ent_tag(i, words, tags, heads, deps, ents):
if ents is None or ents[i] is None: if ents is None or ents[i] is None:
return None return None
else: else:
return '%s-%s' % (tags[i], ents[i]) return "%s-%s" % (tags[i], ents[i])
@staticmethod @staticmethod
def make_sent_start(target, words, tags, heads, deps, ents, cache=True, _cache={}): def make_sent_start(target, words, tags, heads, deps, ents, cache=True, _cache={}):
'''A multi-task objective for representing sentence boundaries, """A multi-task objective for representing sentence boundaries,
using BILU scheme. (O is impossible) using BILU scheme. (O is impossible)
The implementation of this method uses an internal cache that relies The implementation of this method uses an internal cache that relies
on the identity of the heads array, to avoid requiring a new piece on the identity of the heads array, to avoid requiring a new piece
of gold data. You can pass cache=False if you know the cache will of gold data. You can pass cache=False if you know the cache will
do the wrong thing. do the wrong thing.
''' """
assert len(words) == len(heads) assert len(words) == len(heads)
assert target < len(words), (target, len(words)) assert target < len(words), (target, len(words))
if cache: if cache:
@ -760,10 +761,10 @@ class MultitaskObjective(Tagger):
else: else:
for key in list(_cache.keys()): for key in list(_cache.keys()):
_cache.pop(key) _cache.pop(key)
sent_tags = ['I-SENT'] * len(words) sent_tags = ["I-SENT"] * len(words)
_cache[id(heads)] = sent_tags _cache[id(heads)] = sent_tags
else: else:
sent_tags = ['I-SENT'] * len(words) sent_tags = ["I-SENT"] * len(words)
def _find_root(child): def _find_root(child):
seen = set([child]) seen = set([child])
@ -781,10 +782,10 @@ class MultitaskObjective(Tagger):
sentences.setdefault(root, []).append(i) sentences.setdefault(root, []).append(i)
for root, span in sorted(sentences.items()): for root, span in sorted(sentences.items()):
if len(span) == 1: if len(span) == 1:
sent_tags[span[0]] = 'U-SENT' sent_tags[span[0]] = "U-SENT"
else: else:
sent_tags[span[0]] = 'B-SENT' sent_tags[span[0]] = "B-SENT"
sent_tags[span[-1]] = 'L-SENT' sent_tags[span[-1]] = "L-SENT"
return sent_tags[target] return sent_tags[target]
@ -854,6 +855,10 @@ class ClozeMultitask(Pipe):
class TextCategorizer(Pipe): class TextCategorizer(Pipe):
"""Pipeline component for text classification.
DOCS: https://spacy.io/api/textcategorizer
"""
name = 'textcat' name = 'textcat'
@classmethod @classmethod
@ -863,7 +868,7 @@ class TextCategorizer(Pipe):
token_vector_width = cfg["token_vector_width"] token_vector_width = cfg["token_vector_width"]
else: else:
token_vector_width = util.env_opt("token_vector_width", 96) token_vector_width = util.env_opt("token_vector_width", 96)
if cfg.get('architecture') == 'simple_cnn': if cfg.get("architecture") == "simple_cnn":
tok2vec = Tok2Vec(token_vector_width, embed_size, **cfg) tok2vec = Tok2Vec(token_vector_width, embed_size, **cfg)
return build_simple_cnn_text_classifier(tok2vec, nr_class, **cfg) return build_simple_cnn_text_classifier(tok2vec, nr_class, **cfg)
else: else:
@ -884,11 +889,11 @@ class TextCategorizer(Pipe):
@property @property
def labels(self): def labels(self):
return tuple(self.cfg.setdefault('labels', [])) return tuple(self.cfg.setdefault("labels", []))
@labels.setter @labels.setter
def labels(self, value): def labels(self, value):
self.cfg['labels'] = tuple(value) self.cfg["labels"] = tuple(value)
def __call__(self, doc): def __call__(self, doc):
scores, tensors = self.predict([doc]) scores, tensors = self.predict([doc])
@ -934,8 +939,8 @@ class TextCategorizer(Pipe):
losses[self.name] += (gradient**2).sum() losses[self.name] += (gradient**2).sum()
def get_loss(self, docs, golds, scores): def get_loss(self, docs, golds, scores):
truths = numpy.zeros((len(golds), len(self.labels)), dtype='f') truths = numpy.zeros((len(golds), len(self.labels)), dtype="f")
not_missing = numpy.ones((len(golds), len(self.labels)), dtype='f') not_missing = numpy.ones((len(golds), len(self.labels)), dtype="f")
for i, gold in enumerate(golds): for i, gold in enumerate(golds):
for j, label in enumerate(self.labels): for j, label in enumerate(self.labels):
if label in gold.cats: if label in gold.cats:
@ -956,20 +961,19 @@ class TextCategorizer(Pipe):
# This functionality was available previously, but was broken. # This functionality was available previously, but was broken.
# The problem is that we resize the last layer, but the last layer # The problem is that we resize the last layer, but the last layer
# is actually just an ensemble. We're not resizing the child layers # is actually just an ensemble. We're not resizing the child layers
# -- a huge problem. # - a huge problem.
raise ValueError(Errors.E116) raise ValueError(Errors.E116)
#smaller = self.model._layers[-1] # smaller = self.model._layers[-1]
#larger = Affine(len(self.labels)+1, smaller.nI) # larger = Affine(len(self.labels)+1, smaller.nI)
#copy_array(larger.W[:smaller.nO], smaller.W) # copy_array(larger.W[:smaller.nO], smaller.W)
#copy_array(larger.b[:smaller.nO], smaller.b) # copy_array(larger.b[:smaller.nO], smaller.b)
#self.model._layers[-1] = larger # self.model._layers[-1] = larger
self.labels = tuple(list(self.labels) + [label]) self.labels = tuple(list(self.labels) + [label])
return 1 return 1
def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None, def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None, **kwargs):
**kwargs):
if self.model is True: if self.model is True:
self.cfg['pretrained_vectors'] = kwargs.get('pretrained_vectors') self.cfg["pretrained_vectors"] = kwargs.get("pretrained_vectors")
self.model = self.Model(len(self.labels), **self.cfg) self.model = self.Model(len(self.labels), **self.cfg)
link_vectors_to_models(self.vocab) link_vectors_to_models(self.vocab)
if sgd is None: if sgd is None:
@ -978,7 +982,12 @@ class TextCategorizer(Pipe):
cdef class DependencyParser(Parser): cdef class DependencyParser(Parser):
name = 'parser' """Pipeline component for dependency parsing.
DOCS: https://spacy.io/api/dependencyparser
"""
name = "parser"
TransitionSystem = ArcEager TransitionSystem = ArcEager
@property @property
@ -986,7 +995,7 @@ cdef class DependencyParser(Parser):
return [nonproj.deprojectivize] return [nonproj.deprojectivize]
def add_multitask_objective(self, target): def add_multitask_objective(self, target):
if target == 'cloze': if target == "cloze":
cloze = ClozeMultitask(self.vocab) cloze = ClozeMultitask(self.vocab)
self._multitasks.append(cloze) self._multitasks.append(cloze)
else: else:
@ -1000,8 +1009,7 @@ cdef class DependencyParser(Parser):
tok2vec=tok2vec, sgd=sgd) tok2vec=tok2vec, sgd=sgd)
def __reduce__(self): def __reduce__(self):
return (DependencyParser, (self.vocab, self.moves, self.model), return (DependencyParser, (self.vocab, self.moves, self.model), None, None)
None, None)
@property @property
def labels(self): def labels(self):
@ -1010,6 +1018,11 @@ cdef class DependencyParser(Parser):
cdef class EntityRecognizer(Parser): cdef class EntityRecognizer(Parser):
"""Pipeline component for named entity recognition.
DOCS: https://spacy.io/api/entityrecognizer
"""
name = "ner" name = "ner"
TransitionSystem = BiluoPushDown TransitionSystem = BiluoPushDown
nr_feature = 6 nr_feature = 6
@ -1040,4 +1053,4 @@ cdef class EntityRecognizer(Parser):
if move[0] in ("B", "I", "L", "U"))) if move[0] in ("B", "I", "L", "U")))
__all__ = ['Tagger', 'DependencyParser', 'EntityRecognizer', 'Tensorizer', 'TextCategorizer'] __all__ = ["Tagger", "DependencyParser", "EntityRecognizer", "Tensorizer", "TextCategorizer"]

View File

@ -20,7 +20,7 @@ from . import util
def get_string_id(key): def get_string_id(key):
"""Get a string ID, handling the reserved symbols correctly. If the key is """Get a string ID, handling the reserved symbols correctly. If the key is
already an ID, return it. already an ID, return it.
This function optimises for convenience over performance, so shouldn't be This function optimises for convenience over performance, so shouldn't be
used in tight loops. used in tight loops.
""" """
@ -31,12 +31,12 @@ def get_string_id(key):
elif not key: elif not key:
return 0 return 0
else: else:
chars = key.encode('utf8') chars = key.encode("utf8")
return hash_utf8(chars, len(chars)) return hash_utf8(chars, len(chars))
cpdef hash_t hash_string(unicode string) except 0: cpdef hash_t hash_string(unicode string) except 0:
chars = string.encode('utf8') chars = string.encode("utf8")
return hash_utf8(chars, len(chars)) return hash_utf8(chars, len(chars))
@ -51,9 +51,9 @@ cdef uint32_t hash32_utf8(char* utf8_string, int length) nogil:
cdef unicode decode_Utf8Str(const Utf8Str* string): cdef unicode decode_Utf8Str(const Utf8Str* string):
cdef int i, length cdef int i, length
if string.s[0] < sizeof(string.s) and string.s[0] != 0: if string.s[0] < sizeof(string.s) and string.s[0] != 0:
return string.s[1:string.s[0]+1].decode('utf8') return string.s[1:string.s[0]+1].decode("utf8")
elif string.p[0] < 255: elif string.p[0] < 255:
return string.p[1:string.p[0]+1].decode('utf8') return string.p[1:string.p[0]+1].decode("utf8")
else: else:
i = 0 i = 0
length = 0 length = 0
@ -62,7 +62,7 @@ cdef unicode decode_Utf8Str(const Utf8Str* string):
length += 255 length += 255
length += string.p[i] length += string.p[i]
i += 1 i += 1
return string.p[i:length + i].decode('utf8') return string.p[i:length + i].decode("utf8")
cdef Utf8Str* _allocate(Pool mem, const unsigned char* chars, uint32_t length) except *: cdef Utf8Str* _allocate(Pool mem, const unsigned char* chars, uint32_t length) except *:
@ -91,7 +91,10 @@ cdef Utf8Str* _allocate(Pool mem, const unsigned char* chars, uint32_t length) e
cdef class StringStore: cdef class StringStore:
"""Look up strings by 64-bit hashes.""" """Look up strings by 64-bit hashes.
DOCS: https://spacy.io/api/stringstore
"""
def __init__(self, strings=None, freeze=False): def __init__(self, strings=None, freeze=False):
"""Create the StringStore. """Create the StringStore.
@ -113,7 +116,7 @@ cdef class StringStore:
if isinstance(string_or_id, basestring) and len(string_or_id) == 0: if isinstance(string_or_id, basestring) and len(string_or_id) == 0:
return 0 return 0
elif string_or_id == 0: elif string_or_id == 0:
return u'' return ""
elif string_or_id in SYMBOLS_BY_STR: elif string_or_id in SYMBOLS_BY_STR:
return SYMBOLS_BY_STR[string_or_id] return SYMBOLS_BY_STR[string_or_id]
cdef hash_t key cdef hash_t key
@ -193,7 +196,7 @@ cdef class StringStore:
elif isinstance(string, unicode): elif isinstance(string, unicode):
key = hash_string(string) key = hash_string(string)
else: else:
string = string.encode('utf8') string = string.encode("utf8")
key = hash_utf8(string, len(string)) key = hash_utf8(string, len(string))
if key < len(SYMBOLS_BY_INT): if key < len(SYMBOLS_BY_INT):
return True return True
@ -308,7 +311,7 @@ cdef class StringStore:
cdef const Utf8Str* intern_unicode(self, unicode py_string): cdef const Utf8Str* intern_unicode(self, unicode py_string):
# 0 means missing, but we don't bother offsetting the index. # 0 means missing, but we don't bother offsetting the index.
cdef bytes byte_string = py_string.encode('utf8') cdef bytes byte_string = py_string.encode("utf8")
return self._intern_utf8(byte_string, len(byte_string)) return self._intern_utf8(byte_string, len(byte_string))
@cython.final @cython.final

View File

@ -18,8 +18,8 @@ LANGUAGES = ["af", "ar", "bg", "bn", "ca", "cs", "da", "de", "el", "en", "es",
@pytest.mark.parametrize("lang", LANGUAGES) @pytest.mark.parametrize("lang", LANGUAGES)
def test_lang_initialize(lang, capfd): def test_lang_initialize(lang, capfd):
"""Test that languages can be initialized.""" """Test that languages can be initialized."""
nlp = get_lang_class(lang)() # noqa: F841 nlp = get_lang_class(lang)()
# Check for stray print statements (see #3342) # Check for stray print statements (see #3342)
doc = nlp("test") doc = nlp("test") # noqa: F841
captured = capfd.readouterr() captured = capfd.readouterr()
assert not captured.out assert not captured.out

View File

@ -3,16 +3,18 @@
# coding: utf8 # coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
from collections import OrderedDict
from cython.operator cimport dereference as deref from cython.operator cimport dereference as deref
from cython.operator cimport preincrement as preinc from cython.operator cimport preincrement as preinc
from cymem.cymem cimport Pool from cymem.cymem cimport Pool
from preshed.maps cimport PreshMap from preshed.maps cimport PreshMap
import re
cimport cython cimport cython
from collections import OrderedDict
import re
from .tokens.doc cimport Doc from .tokens.doc cimport Doc
from .strings cimport hash_string from .strings cimport hash_string
from .errors import Errors, Warnings, deprecation_warning from .errors import Errors, Warnings, deprecation_warning
from . import util from . import util
@ -20,6 +22,8 @@ from . import util
cdef class Tokenizer: cdef class Tokenizer:
"""Segment text, and create Doc objects with the discovered segment """Segment text, and create Doc objects with the discovered segment
boundaries. boundaries.
DOCS: https://spacy.io/api/tokenizer
""" """
def __init__(self, Vocab vocab, rules=None, prefix_search=None, def __init__(self, Vocab vocab, rules=None, prefix_search=None,
suffix_search=None, infix_finditer=None, token_match=None): suffix_search=None, infix_finditer=None, token_match=None):
@ -40,6 +44,8 @@ cdef class Tokenizer:
EXAMPLE: EXAMPLE:
>>> tokenizer = Tokenizer(nlp.vocab) >>> tokenizer = Tokenizer(nlp.vocab)
>>> tokenizer = English().Defaults.create_tokenizer(nlp) >>> tokenizer = English().Defaults.create_tokenizer(nlp)
DOCS: https://spacy.io/api/tokenizer#init
""" """
self.mem = Pool() self.mem = Pool()
self._cache = PreshMap() self._cache = PreshMap()
@ -73,6 +79,8 @@ cdef class Tokenizer:
string (unicode): The string to tokenize. string (unicode): The string to tokenize.
RETURNS (Doc): A container for linguistic annotations. RETURNS (Doc): A container for linguistic annotations.
DOCS: https://spacy.io/api/tokenizer#call
""" """
if len(string) >= (2 ** 30): if len(string) >= (2 ** 30):
raise ValueError(Errors.E025.format(length=len(string))) raise ValueError(Errors.E025.format(length=len(string)))
@ -114,7 +122,7 @@ cdef class Tokenizer:
cache_hit = self._try_cache(key, doc) cache_hit = self._try_cache(key, doc)
if not cache_hit: if not cache_hit:
self._tokenize(doc, span, key) self._tokenize(doc, span, key)
doc.c[doc.length - 1].spacy = string[-1] == ' ' and not in_ws doc.c[doc.length - 1].spacy = string[-1] == " " and not in_ws
return doc return doc
def pipe(self, texts, batch_size=1000, n_threads=2): def pipe(self, texts, batch_size=1000, n_threads=2):
@ -122,9 +130,9 @@ cdef class Tokenizer:
texts: A sequence of unicode texts. texts: A sequence of unicode texts.
batch_size (int): Number of texts to accumulate in an internal buffer. batch_size (int): Number of texts to accumulate in an internal buffer.
n_threads (int): Number of threads to use, if the implementation
supports multi-threading. The default tokenizer is single-threaded.
YIELDS (Doc): A sequence of Doc objects, in order. YIELDS (Doc): A sequence of Doc objects, in order.
DOCS: https://spacy.io/api/tokenizer#pipe
""" """
for text in texts: for text in texts:
yield self(text) yield self(text)
@ -235,7 +243,7 @@ cdef class Tokenizer:
if not matches: if not matches:
tokens.push_back(self.vocab.get(tokens.mem, string), False) tokens.push_back(self.vocab.get(tokens.mem, string), False)
else: else:
# let's say we have dyn-o-mite-dave - the regex finds the # Let's say we have dyn-o-mite-dave - the regex finds the
# start and end positions of the hyphens # start and end positions of the hyphens
start = 0 start = 0
start_before_infixes = start start_before_infixes = start
@ -257,7 +265,6 @@ cdef class Tokenizer:
# https://github.com/explosion/spaCy/issues/768) # https://github.com/explosion/spaCy/issues/768)
infix_span = string[infix_start:infix_end] infix_span = string[infix_start:infix_end]
tokens.push_back(self.vocab.get(tokens.mem, infix_span), False) tokens.push_back(self.vocab.get(tokens.mem, infix_span), False)
start = infix_end start = infix_end
span = string[start:] span = string[start:]
if span: if span:
@ -274,7 +281,7 @@ cdef class Tokenizer:
for i in range(n): for i in range(n):
if self.vocab._by_orth.get(tokens[i].lex.orth) == NULL: if self.vocab._by_orth.get(tokens[i].lex.orth) == NULL:
return 0 return 0
# See https://github.com/explosion/spaCy/issues/1250 # See #1250
if has_special: if has_special:
return 0 return 0
cached = <_Cached*>self.mem.alloc(1, sizeof(_Cached)) cached = <_Cached*>self.mem.alloc(1, sizeof(_Cached))
@ -293,6 +300,8 @@ cdef class Tokenizer:
RETURNS (list): A list of `re.MatchObject` objects that have `.start()` RETURNS (list): A list of `re.MatchObject` objects that have `.start()`
and `.end()` methods, denoting the placement of internal segment and `.end()` methods, denoting the placement of internal segment
separators, e.g. hyphens. separators, e.g. hyphens.
DOCS: https://spacy.io/api/tokenizer#find_infix
""" """
if self.infix_finditer is None: if self.infix_finditer is None:
return 0 return 0
@ -304,6 +313,8 @@ cdef class Tokenizer:
string (unicode): The string to segment. string (unicode): The string to segment.
RETURNS (int): The length of the prefix if present, otherwise `None`. RETURNS (int): The length of the prefix if present, otherwise `None`.
DOCS: https://spacy.io/api/tokenizer#find_prefix
""" """
if self.prefix_search is None: if self.prefix_search is None:
return 0 return 0
@ -316,6 +327,8 @@ cdef class Tokenizer:
string (unicode): The string to segment. string (unicode): The string to segment.
Returns (int): The length of the suffix if present, otherwise `None`. Returns (int): The length of the suffix if present, otherwise `None`.
DOCS: https://spacy.io/api/tokenizer#find_suffix
""" """
if self.suffix_search is None: if self.suffix_search is None:
return 0 return 0
@ -334,6 +347,8 @@ cdef class Tokenizer:
token_attrs (iterable): A sequence of dicts, where each dict describes token_attrs (iterable): A sequence of dicts, where each dict describes
a token and its attributes. The `ORTH` fields of the attributes a token and its attributes. The `ORTH` fields of the attributes
must exactly match the string when they are concatenated. must exactly match the string when they are concatenated.
DOCS: https://spacy.io/api/tokenizer#add_special_case
""" """
substrings = list(substrings) substrings = list(substrings)
cached = <_Cached*>self.mem.alloc(1, sizeof(_Cached)) cached = <_Cached*>self.mem.alloc(1, sizeof(_Cached))
@ -350,8 +365,10 @@ cdef class Tokenizer:
path (unicode or Path): A path to a directory, which will be created if path (unicode or Path): A path to a directory, which will be created if
it doesn't exist. Paths may be either strings or Path-like objects. it doesn't exist. Paths may be either strings or Path-like objects.
DOCS: https://spacy.io/api/tokenizer#to_disk
""" """
with path.open('wb') as file_: with path.open("wb") as file_:
file_.write(self.to_bytes(**exclude)) file_.write(self.to_bytes(**exclude))
def from_disk(self, path, **exclude): def from_disk(self, path, **exclude):
@ -361,8 +378,10 @@ cdef class Tokenizer:
path (unicode or Path): A path to a directory. Paths may be either path (unicode or Path): A path to a directory. Paths may be either
strings or `Path`-like objects. strings or `Path`-like objects.
RETURNS (Tokenizer): The modified `Tokenizer` object. RETURNS (Tokenizer): The modified `Tokenizer` object.
DOCS: https://spacy.io/api/tokenizer#from_disk
""" """
with path.open('rb') as file_: with path.open("rb") as file_:
bytes_data = file_.read() bytes_data = file_.read()
self.from_bytes(bytes_data, **exclude) self.from_bytes(bytes_data, **exclude)
return self return self
@ -372,14 +391,16 @@ cdef class Tokenizer:
**exclude: Named attributes to prevent from being serialized. **exclude: Named attributes to prevent from being serialized.
RETURNS (bytes): The serialized form of the `Tokenizer` object. RETURNS (bytes): The serialized form of the `Tokenizer` object.
DOCS: https://spacy.io/api/tokenizer#to_bytes
""" """
serializers = OrderedDict(( serializers = OrderedDict((
('vocab', lambda: self.vocab.to_bytes()), ("vocab", lambda: self.vocab.to_bytes()),
('prefix_search', lambda: _get_regex_pattern(self.prefix_search)), ("prefix_search", lambda: _get_regex_pattern(self.prefix_search)),
('suffix_search', lambda: _get_regex_pattern(self.suffix_search)), ("suffix_search", lambda: _get_regex_pattern(self.suffix_search)),
('infix_finditer', lambda: _get_regex_pattern(self.infix_finditer)), ("infix_finditer", lambda: _get_regex_pattern(self.infix_finditer)),
('token_match', lambda: _get_regex_pattern(self.token_match)), ("token_match", lambda: _get_regex_pattern(self.token_match)),
('exceptions', lambda: OrderedDict(sorted(self._rules.items()))) ("exceptions", lambda: OrderedDict(sorted(self._rules.items())))
)) ))
return util.to_bytes(serializers, exclude) return util.to_bytes(serializers, exclude)
@ -389,26 +410,28 @@ cdef class Tokenizer:
bytes_data (bytes): The data to load from. bytes_data (bytes): The data to load from.
**exclude: Named attributes to prevent from being loaded. **exclude: Named attributes to prevent from being loaded.
RETURNS (Tokenizer): The `Tokenizer` object. RETURNS (Tokenizer): The `Tokenizer` object.
DOCS: https://spacy.io/api/tokenizer#from_bytes
""" """
data = OrderedDict() data = OrderedDict()
deserializers = OrderedDict(( deserializers = OrderedDict((
('vocab', lambda b: self.vocab.from_bytes(b)), ("vocab", lambda b: self.vocab.from_bytes(b)),
('prefix_search', lambda b: data.setdefault('prefix_search', b)), ("prefix_search", lambda b: data.setdefault("prefix_search", b)),
('suffix_search', lambda b: data.setdefault('suffix_search', b)), ("suffix_search", lambda b: data.setdefault("suffix_search", b)),
('infix_finditer', lambda b: data.setdefault('infix_finditer', b)), ("infix_finditer", lambda b: data.setdefault("infix_finditer", b)),
('token_match', lambda b: data.setdefault('token_match', b)), ("token_match", lambda b: data.setdefault("token_match", b)),
('exceptions', lambda b: data.setdefault('rules', b)) ("exceptions", lambda b: data.setdefault("rules", b))
)) ))
msg = util.from_bytes(bytes_data, deserializers, exclude) msg = util.from_bytes(bytes_data, deserializers, exclude)
if data.get('prefix_search'): if data.get("prefix_search"):
self.prefix_search = re.compile(data['prefix_search']).search self.prefix_search = re.compile(data["prefix_search"]).search
if data.get('suffix_search'): if data.get("suffix_search"):
self.suffix_search = re.compile(data['suffix_search']).search self.suffix_search = re.compile(data["suffix_search"]).search
if data.get('infix_finditer'): if data.get("infix_finditer"):
self.infix_finditer = re.compile(data['infix_finditer']).finditer self.infix_finditer = re.compile(data["infix_finditer"]).finditer
if data.get('token_match'): if data.get("token_match"):
self.token_match = re.compile(data['token_match']).match self.token_match = re.compile(data["token_match"]).match
for string, substrings in data.get('rules', {}).items(): for string, substrings in data.get("rules", {}).items():
self.add_special_case(string, substrings) self.add_special_case(string, substrings)
return self return self

View File

@ -1,5 +1,8 @@
# coding: utf8
from __future__ import unicode_literals
from .doc import Doc from .doc import Doc
from .token import Token from .token import Token
from .span import Span from .span import Span
__all__ = ['Doc', 'Token', 'Span'] __all__ = ["Doc", "Token", "Span"]

View File

@ -6,11 +6,11 @@ from __future__ import unicode_literals
from libc.string cimport memcpy, memset from libc.string cimport memcpy, memset
from libc.stdlib cimport malloc, free from libc.stdlib cimport malloc, free
import numpy
from cymem.cymem cimport Pool from cymem.cymem cimport Pool
from thinc.neural.util import get_array_module from thinc.neural.util import get_array_module
import numpy
from .doc cimport Doc, set_children_from_heads, token_by_start, token_by_end from .doc cimport Doc, set_children_from_heads, token_by_start, token_by_end
from .span cimport Span from .span cimport Span
from .token cimport Token from .token cimport Token
@ -26,11 +26,16 @@ from ..strings import get_string_id
cdef class Retokenizer: cdef class Retokenizer:
"""Helper class for doc.retokenize() context manager.""" """Helper class for doc.retokenize() context manager.
DOCS: https://spacy.io/api/doc#retokenize
USAGE: https://spacy.io/usage/linguistic-features#retokenization
"""
cdef Doc doc cdef Doc doc
cdef list merges cdef list merges
cdef list splits cdef list splits
cdef set tokens_to_merge cdef set tokens_to_merge
def __init__(self, doc): def __init__(self, doc):
self.doc = doc self.doc = doc
self.merges = [] self.merges = []
@ -40,6 +45,11 @@ cdef class Retokenizer:
def merge(self, Span span, attrs=SimpleFrozenDict()): def merge(self, Span span, attrs=SimpleFrozenDict()):
"""Mark a span for merging. The attrs will be applied to the resulting """Mark a span for merging. The attrs will be applied to the resulting
token. token.
span (Span): The span to merge.
attrs (dict): Attributes to set on the merged token.
DOCS: https://spacy.io/api/doc#retokenizer.merge
""" """
for token in span: for token in span:
if token.i in self.tokens_to_merge: if token.i in self.tokens_to_merge:
@ -58,6 +68,16 @@ cdef class Retokenizer:
def split(self, Token token, orths, heads, attrs=SimpleFrozenDict()): def split(self, Token token, orths, heads, attrs=SimpleFrozenDict()):
"""Mark a Token for splitting, into the specified orths. The attrs """Mark a Token for splitting, into the specified orths. The attrs
will be applied to each subtoken. will be applied to each subtoken.
token (Token): The token to split.
orths (list): The verbatim text of the split tokens. Needs to match the
text of the original token.
heads (list): List of token or `(token, subtoken)` tuples specifying the
tokens to attach the newly split subtokens to.
attrs (dict): Attributes to set on all split tokens. Attribute names
mapped to list of per-token attribute values.
DOCS: https://spacy.io/api/doc#retokenizer.split
""" """
if ''.join(orths) != token.text: if ''.join(orths) != token.text:
raise ValueError(Errors.E117.format(new=''.join(orths), old=token.text)) raise ValueError(Errors.E117.format(new=''.join(orths), old=token.text))
@ -104,14 +124,12 @@ cdef class Retokenizer:
# referred to in the splits. If we merged these tokens previously, we # referred to in the splits. If we merged these tokens previously, we
# have to raise an error # have to raise an error
if token_index == -1: if token_index == -1:
raise IndexError( raise IndexError(Errors.E122)
"Cannot find token to be split. Did it get merged?")
head_indices = [] head_indices = []
for head_char, subtoken in heads: for head_char, subtoken in heads:
head_index = token_by_start(self.doc.c, self.doc.length, head_char) head_index = token_by_start(self.doc.c, self.doc.length, head_char)
if head_index == -1: if head_index == -1:
raise IndexError( raise IndexError(Errors.E123)
"Cannot find head of token to be split. Did it get merged?")
# We want to refer to the token index of the head *after* the # We want to refer to the token index of the head *after* the
# mergery. We need to account for the extra tokens introduced. # mergery. We need to account for the extra tokens introduced.
# e.g., let's say we have [ab, c] and we want a and b to depend # e.g., let's say we have [ab, c] and we want a and b to depend
@ -206,7 +224,6 @@ def _merge(Doc doc, int start, int end, attributes):
doc.c[i].head -= i doc.c[i].head -= i
# Set the left/right children, left/right edges # Set the left/right children, left/right edges
set_children_from_heads(doc.c, doc.length) set_children_from_heads(doc.c, doc.length)
# Clear the cached Python objects
# Return the merged Python object # Return the merged Python object
return doc[start] return doc[start]
@ -336,7 +353,7 @@ def _bulk_merge(Doc doc, merges):
# Make sure ent_iob remains consistent # Make sure ent_iob remains consistent
for (span, _) in merges: for (span, _) in merges:
if(span.end < len(offsets)): if(span.end < len(offsets)):
#if it's not the last span # If it's not the last span
token_after_span_position = offsets[span.end] token_after_span_position = offsets[span.end]
if doc.c[token_after_span_position].ent_iob == 1\ if doc.c[token_after_span_position].ent_iob == 1\
and doc.c[token_after_span_position - 1].ent_iob in (0, 2): and doc.c[token_after_span_position - 1].ent_iob in (0, 2):

View File

@ -1,3 +1,4 @@
# coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
import numpy import numpy
@ -16,9 +17,8 @@ class Binder(object):
def __init__(self, attrs=None): def __init__(self, attrs=None):
"""Create a Binder object, to hold serialized annotations. """Create a Binder object, to hold serialized annotations.
attrs (list): attrs (list): List of attributes to serialize. 'orth' and 'spacy' are
List of attributes to serialize. 'orth' and 'spacy' are always always serialized, so they're not required. Defaults to None.
serialized, so they're not required. Defaults to None.
""" """
attrs = attrs or [] attrs = attrs or []
self.attrs = list(attrs) self.attrs = list(attrs)

View File

@ -7,28 +7,25 @@ from __future__ import unicode_literals
cimport cython cimport cython
cimport numpy as np cimport numpy as np
from libc.string cimport memcpy, memset
from libc.math cimport sqrt
import numpy import numpy
import numpy.linalg import numpy.linalg
import struct import struct
import srsly import srsly
from thinc.neural.util import get_array_module, copy_array from thinc.neural.util import get_array_module, copy_array
import srsly
from libc.string cimport memcpy, memset
from libc.math cimport sqrt
from .span cimport Span
from .token cimport Token
from .span cimport Span from .span cimport Span
from .token cimport Token from .token cimport Token
from ..lexeme cimport Lexeme, EMPTY_LEXEME from ..lexeme cimport Lexeme, EMPTY_LEXEME
from ..typedefs cimport attr_t, flags_t from ..typedefs cimport attr_t, flags_t
from ..attrs import intify_attrs, IDS
from ..attrs cimport attr_id_t
from ..attrs cimport ID, ORTH, NORM, LOWER, SHAPE, PREFIX, SUFFIX, CLUSTER from ..attrs cimport ID, ORTH, NORM, LOWER, SHAPE, PREFIX, SUFFIX, CLUSTER
from ..attrs cimport LENGTH, POS, LEMMA, TAG, DEP, HEAD, SPACY, ENT_IOB from ..attrs cimport LENGTH, POS, LEMMA, TAG, DEP, HEAD, SPACY, ENT_IOB
from ..attrs cimport ENT_TYPE, SENT_START from ..attrs cimport ENT_TYPE, SENT_START, attr_id_t
from ..parts_of_speech cimport CCONJ, PUNCT, NOUN, univ_pos_t from ..parts_of_speech cimport CCONJ, PUNCT, NOUN, univ_pos_t
from ..attrs import intify_attrs, IDS
from ..util import normalize_slice from ..util import normalize_slice
from ..compat import is_config, copy_reg, pickle, basestring_ from ..compat import is_config, copy_reg, pickle, basestring_
from ..errors import deprecation_warning, models_warning, user_warning from ..errors import deprecation_warning, models_warning, user_warning
@ -37,6 +34,7 @@ from .. import util
from .underscore import Underscore, get_ext_args from .underscore import Underscore, get_ext_args
from ._retokenize import Retokenizer from ._retokenize import Retokenizer
DEF PADDING = 5 DEF PADDING = 5
@ -77,7 +75,7 @@ def _get_chunker(lang):
return None return None
except KeyError: except KeyError:
return None return None
return cls.Defaults.syntax_iterators.get(u'noun_chunks') return cls.Defaults.syntax_iterators.get("noun_chunks")
cdef class Doc: cdef class Doc:
@ -94,23 +92,60 @@ cdef class Doc:
>>> from spacy.tokens import Doc >>> from spacy.tokens import Doc
>>> doc = Doc(nlp.vocab, words=[u'hello', u'world', u'!'], >>> doc = Doc(nlp.vocab, words=[u'hello', u'world', u'!'],
spaces=[True, False, False]) spaces=[True, False, False])
DOCS: https://spacy.io/api/doc
""" """
@classmethod @classmethod
def set_extension(cls, name, **kwargs): def set_extension(cls, name, **kwargs):
if cls.has_extension(name) and not kwargs.get('force', False): """Define a custom attribute which becomes available as `Doc._`.
raise ValueError(Errors.E090.format(name=name, obj='Doc'))
name (unicode): Name of the attribute to set.
default: Optional default value of the attribute.
getter (callable): Optional getter function.
setter (callable): Optional setter function.
method (callable): Optional method for method extension.
force (bool): Force overwriting existing attribute.
DOCS: https://spacy.io/api/doc#set_extension
USAGE: https://spacy.io/usage/processing-pipelines#custom-components-attributes
"""
if cls.has_extension(name) and not kwargs.get("force", False):
raise ValueError(Errors.E090.format(name=name, obj="Doc"))
Underscore.doc_extensions[name] = get_ext_args(**kwargs) Underscore.doc_extensions[name] = get_ext_args(**kwargs)
@classmethod @classmethod
def get_extension(cls, name): def get_extension(cls, name):
"""Look up a previously registered extension by name.
name (unicode): Name of the extension.
RETURNS (tuple): A `(default, method, getter, setter)` tuple.
DOCS: https://spacy.io/api/doc#get_extension
"""
return Underscore.doc_extensions.get(name) return Underscore.doc_extensions.get(name)
@classmethod @classmethod
def has_extension(cls, name): def has_extension(cls, name):
"""Check whether an extension has been registered.
name (unicode): Name of the extension.
RETURNS (bool): Whether the extension has been registered.
DOCS: https://spacy.io/api/doc#has_extension
"""
return name in Underscore.doc_extensions return name in Underscore.doc_extensions
@classmethod @classmethod
def remove_extension(cls, name): def remove_extension(cls, name):
"""Remove a previously registered extension.
name (unicode): Name of the extension.
RETURNS (tuple): A `(default, method, getter, setter)` tuple of the
removed extension.
DOCS: https://spacy.io/api/doc#remove_extension
"""
if not cls.has_extension(name): if not cls.has_extension(name):
raise ValueError(Errors.E046.format(name=name)) raise ValueError(Errors.E046.format(name=name))
return Underscore.doc_extensions.pop(name) return Underscore.doc_extensions.pop(name)
@ -128,6 +163,8 @@ cdef class Doc:
it is not. If `None`, defaults to `[True]*len(words)` it is not. If `None`, defaults to `[True]*len(words)`
user_data (dict or None): Optional extra data to attach to the Doc. user_data (dict or None): Optional extra data to attach to the Doc.
RETURNS (Doc): The newly constructed object. RETURNS (Doc): The newly constructed object.
DOCS: https://spacy.io/api/doc#init
""" """
self.vocab = vocab self.vocab = vocab
size = 20 size = 20
@ -151,7 +188,7 @@ cdef class Doc:
self.user_hooks = {} self.user_hooks = {}
self.user_token_hooks = {} self.user_token_hooks = {}
self.user_span_hooks = {} self.user_span_hooks = {}
self.tensor = numpy.zeros((0,), dtype='float32') self.tensor = numpy.zeros((0,), dtype="float32")
self.user_data = {} if user_data is None else user_data self.user_data = {} if user_data is None else user_data
self._vector = None self._vector = None
self.noun_chunks_iterator = _get_chunker(self.vocab.lang) self.noun_chunks_iterator = _get_chunker(self.vocab.lang)
@ -184,6 +221,7 @@ cdef class Doc:
@property @property
def _(self): def _(self):
"""Custom extension attributes registered via `set_extension`."""
return Underscore(Underscore.doc_extensions, self) return Underscore(Underscore.doc_extensions, self)
@property @property
@ -195,7 +233,7 @@ cdef class Doc:
b) sent.is_parsed is set to True; b) sent.is_parsed is set to True;
c) At least one token other than the first where sent_start is not None. c) At least one token other than the first where sent_start is not None.
""" """
if 'sents' in self.user_hooks: if "sents" in self.user_hooks:
return True return True
if self.is_parsed: if self.is_parsed:
return True return True
@ -227,11 +265,12 @@ cdef class Doc:
supported, as `Span` objects must be contiguous (cannot have gaps). supported, as `Span` objects must be contiguous (cannot have gaps).
You can use negative indices and open-ended ranges, which have You can use negative indices and open-ended ranges, which have
their normal Python semantics. their normal Python semantics.
DOCS: https://spacy.io/api/doc#getitem
""" """
if isinstance(i, slice): if isinstance(i, slice):
start, stop = normalize_slice(len(self), i.start, i.stop, i.step) start, stop = normalize_slice(len(self), i.start, i.stop, i.step)
return Span(self, start, stop, label=0) return Span(self, start, stop, label=0)
if i < 0: if i < 0:
i = self.length + i i = self.length + i
bounds_check(i, self.length, PADDING) bounds_check(i, self.length, PADDING)
@ -244,8 +283,7 @@ cdef class Doc:
than-Python speeds are required, you can instead access the annotations than-Python speeds are required, you can instead access the annotations
as a numpy array, or access the underlying C data directly from Cython. as a numpy array, or access the underlying C data directly from Cython.
EXAMPLE: DOCS: https://spacy.io/api/doc#iter
>>> for token in doc
""" """
cdef int i cdef int i
for i in range(self.length): for i in range(self.length):
@ -256,16 +294,15 @@ cdef class Doc:
RETURNS (int): The number of tokens in the document. RETURNS (int): The number of tokens in the document.
EXAMPLE: DOCS: https://spacy.io/api/doc#len
>>> len(doc)
""" """
return self.length return self.length
def __unicode__(self): def __unicode__(self):
return u''.join([t.text_with_ws for t in self]) return "".join([t.text_with_ws for t in self])
def __bytes__(self): def __bytes__(self):
return u''.join([t.text_with_ws for t in self]).encode('utf-8') return "".join([t.text_with_ws for t in self]).encode("utf-8")
def __str__(self): def __str__(self):
if is_config(python3=True): if is_config(python3=True):
@ -290,6 +327,8 @@ cdef class Doc:
vector (ndarray[ndim=1, dtype='float32']): A meaning representation of vector (ndarray[ndim=1, dtype='float32']): A meaning representation of
the span. the span.
RETURNS (Span): The newly constructed object. RETURNS (Span): The newly constructed object.
DOCS: https://spacy.io/api/doc#char_span
""" """
if not isinstance(label, int): if not isinstance(label, int):
label = self.vocab.strings.add(label) label = self.vocab.strings.add(label)
@ -311,9 +350,11 @@ cdef class Doc:
other (object): The object to compare with. By default, accepts `Doc`, other (object): The object to compare with. By default, accepts `Doc`,
`Span`, `Token` and `Lexeme` objects. `Span`, `Token` and `Lexeme` objects.
RETURNS (float): A scalar similarity score. Higher is more similar. RETURNS (float): A scalar similarity score. Higher is more similar.
DOCS: https://spacy.io/api/doc#similarity
""" """
if 'similarity' in self.user_hooks: if "similarity" in self.user_hooks:
return self.user_hooks['similarity'](self, other) return self.user_hooks["similarity"](self, other)
if isinstance(other, (Lexeme, Token)) and self.length == 1: if isinstance(other, (Lexeme, Token)) and self.length == 1:
if self.c[0].lex.orth == other.orth: if self.c[0].lex.orth == other.orth:
return 1.0 return 1.0
@ -325,21 +366,25 @@ cdef class Doc:
else: else:
return 1.0 return 1.0
if self.vocab.vectors.n_keys == 0: if self.vocab.vectors.n_keys == 0:
models_warning(Warnings.W007.format(obj='Doc')) models_warning(Warnings.W007.format(obj="Doc"))
if self.vector_norm == 0 or other.vector_norm == 0: if self.vector_norm == 0 or other.vector_norm == 0:
user_warning(Warnings.W008.format(obj='Doc')) user_warning(Warnings.W008.format(obj="Doc"))
return 0.0 return 0.0
return numpy.dot(self.vector, other.vector) / (self.vector_norm * other.vector_norm) vector = self.vector
xp = get_array_module(vector)
return xp.dot(vector, other.vector) / (self.vector_norm * other.vector_norm)
property has_vector: property has_vector:
"""A boolean value indicating whether a word vector is associated with """A boolean value indicating whether a word vector is associated with
the object. the object.
RETURNS (bool): Whether a word vector is associated with the object. RETURNS (bool): Whether a word vector is associated with the object.
DOCS: https://spacy.io/api/doc#has_vector
""" """
def __get__(self): def __get__(self):
if 'has_vector' in self.user_hooks: if "has_vector" in self.user_hooks:
return self.user_hooks['has_vector'](self) return self.user_hooks["has_vector"](self)
elif self.vocab.vectors.data.size: elif self.vocab.vectors.data.size:
return True return True
elif self.tensor.size: elif self.tensor.size:
@ -353,28 +398,25 @@ cdef class Doc:
RETURNS (numpy.ndarray[ndim=1, dtype='float32']): A 1D numpy array RETURNS (numpy.ndarray[ndim=1, dtype='float32']): A 1D numpy array
representing the document's semantics. representing the document's semantics.
DOCS: https://spacy.io/api/doc#vector
""" """
def __get__(self): def __get__(self):
if 'vector' in self.user_hooks: if "vector" in self.user_hooks:
return self.user_hooks['vector'](self) return self.user_hooks["vector"](self)
if self._vector is not None: if self._vector is not None:
return self._vector return self._vector
elif not len(self): elif not len(self):
self._vector = numpy.zeros((self.vocab.vectors_length,), self._vector = numpy.zeros((self.vocab.vectors_length,), dtype="f")
dtype='f')
return self._vector return self._vector
elif self.vocab.vectors.data.size > 0: elif self.vocab.vectors.data.size > 0:
vector = numpy.zeros((self.vocab.vectors_length,), dtype='f') self._vector = sum(t.vector for t in self) / len(self)
for token in self.c[:self.length]:
vector += self.vocab.get_vector(token.lex.orth)
self._vector = vector / len(self)
return self._vector return self._vector
elif self.tensor.size > 0: elif self.tensor.size > 0:
self._vector = self.tensor.mean(axis=0) self._vector = self.tensor.mean(axis=0)
return self._vector return self._vector
else: else:
return numpy.zeros((self.vocab.vectors_length,), return numpy.zeros((self.vocab.vectors_length,), dtype="float32")
dtype='float32')
def __set__(self, value): def __set__(self, value):
self._vector = value self._vector = value
@ -383,10 +425,12 @@ cdef class Doc:
"""The L2 norm of the document's vector representation. """The L2 norm of the document's vector representation.
RETURNS (float): The L2 norm of the vector representation. RETURNS (float): The L2 norm of the vector representation.
DOCS: https://spacy.io/api/doc#vector_norm
""" """
def __get__(self): def __get__(self):
if 'vector_norm' in self.user_hooks: if "vector_norm" in self.user_hooks:
return self.user_hooks['vector_norm'](self) return self.user_hooks["vector_norm"](self)
cdef float value cdef float value
cdef double norm = 0 cdef double norm = 0
if self._vector_norm is None: if self._vector_norm is None:
@ -405,7 +449,7 @@ cdef class Doc:
RETURNS (unicode): The original verbatim text of the document. RETURNS (unicode): The original verbatim text of the document.
""" """
def __get__(self): def __get__(self):
return u''.join(t.text_with_ws for t in self) return "".join(t.text_with_ws for t in self)
property text_with_ws: property text_with_ws:
"""An alias of `Doc.text`, provided for duck-type compatibility with """An alias of `Doc.text`, provided for duck-type compatibility with
@ -417,21 +461,12 @@ cdef class Doc:
return self.text return self.text
property ents: property ents:
"""Iterate over the entities in the document. Yields named-entity """The named entities in the document. Returns a tuple of named entity
`Span` objects, if the entity recognizer has been applied to the `Span` objects, if the entity recognizer has been applied.
document.
YIELDS (Span): Entities in the document. RETURNS (tuple): Entities in the document, one `Span` per entity.
EXAMPLE: Iterate over the span to get individual Token objects, DOCS: https://spacy.io/api/doc#ents
or access the label:
>>> tokens = nlp(u'Mr. Best flew to New York on Saturday morning.')
>>> ents = list(tokens.ents)
>>> assert ents[0].label == 346
>>> assert ents[0].label_ == 'PERSON'
>>> assert ents[0].orth_ == 'Best'
>>> assert ents[0].text == 'Mr. Best'
""" """
def __get__(self): def __get__(self):
cdef int i cdef int i
@ -443,8 +478,8 @@ cdef class Doc:
token = &self.c[i] token = &self.c[i]
if token.ent_iob == 1: if token.ent_iob == 1:
if start == -1: if start == -1:
seq = ['%s|%s' % (t.text, t.ent_iob_) for t in self[i-5:i+5]] seq = ["%s|%s" % (t.text, t.ent_iob_) for t in self[i-5:i+5]]
raise ValueError(Errors.E093.format(seq=' '.join(seq))) raise ValueError(Errors.E093.format(seq=" ".join(seq)))
elif token.ent_iob == 2 or token.ent_iob == 0: elif token.ent_iob == 2 or token.ent_iob == 0:
if start != -1: if start != -1:
output.append(Span(self, start, i, label=label)) output.append(Span(self, start, i, label=label))
@ -466,7 +501,6 @@ cdef class Doc:
# prediction # prediction
# 3. Test basic data-driven ORTH gazetteer # 3. Test basic data-driven ORTH gazetteer
# 4. Test more nuanced date and currency regex # 4. Test more nuanced date and currency regex
tokens_in_ents = {} tokens_in_ents = {}
cdef attr_t entity_type cdef attr_t entity_type
cdef int ent_start, ent_end cdef int ent_start, ent_end
@ -480,7 +514,6 @@ cdef class Doc:
self.vocab.strings[tokens_in_ents[token_index][2]]), self.vocab.strings[tokens_in_ents[token_index][2]]),
span2=(ent_start, ent_end, self.vocab.strings[entity_type]))) span2=(ent_start, ent_end, self.vocab.strings[entity_type])))
tokens_in_ents[token_index] = (ent_start, ent_end, entity_type) tokens_in_ents[token_index] = (ent_start, ent_end, entity_type)
cdef int i cdef int i
for i in range(self.length): for i in range(self.length):
self.c[i].ent_type = 0 self.c[i].ent_type = 0
@ -511,6 +544,8 @@ cdef class Doc:
clauses. clauses.
YIELDS (Span): Noun chunks in the document. YIELDS (Span): Noun chunks in the document.
DOCS: https://spacy.io/api/doc#noun_chunks
""" """
def __get__(self): def __get__(self):
if not self.is_parsed: if not self.is_parsed:
@ -534,15 +569,15 @@ cdef class Doc:
dependency parse. If the parser is disabled, the `sents` iterator will dependency parse. If the parser is disabled, the `sents` iterator will
be unavailable. be unavailable.
EXAMPLE: YIELDS (Span): Sentences in the document.
>>> doc = nlp("This is a sentence. Here's another...")
>>> assert [s.root.text for s in doc.sents] == ["is", "'s"] DOCS: https://spacy.io/api/doc#sents
""" """
def __get__(self): def __get__(self):
if not self.is_sentenced: if not self.is_sentenced:
raise ValueError(Errors.E030) raise ValueError(Errors.E030)
if 'sents' in self.user_hooks: if "sents" in self.user_hooks:
yield from self.user_hooks['sents'](self) yield from self.user_hooks["sents"](self)
else: else:
start = 0 start = 0
for i in range(1, self.length): for i in range(1, self.length):
@ -607,17 +642,16 @@ cdef class Doc:
if isinstance(py_attr_ids, basestring_): if isinstance(py_attr_ids, basestring_):
# Handle inputs like doc.to_array('ORTH') # Handle inputs like doc.to_array('ORTH')
py_attr_ids = [py_attr_ids] py_attr_ids = [py_attr_ids]
elif not hasattr(py_attr_ids, '__iter__'): elif not hasattr(py_attr_ids, "__iter__"):
# Handle inputs like doc.to_array(ORTH) # Handle inputs like doc.to_array(ORTH)
py_attr_ids = [py_attr_ids] py_attr_ids = [py_attr_ids]
# Allow strings, e.g. 'lemma' or 'LEMMA' # Allow strings, e.g. 'lemma' or 'LEMMA'
py_attr_ids = [(IDS[id_.upper()] if hasattr(id_, 'upper') else id_) py_attr_ids = [(IDS[id_.upper()] if hasattr(id_, "upper") else id_)
for id_ in py_attr_ids] for id_ in py_attr_ids]
# Make an array from the attributes --- otherwise our inner loop is # Make an array from the attributes --- otherwise our inner loop is
# Python dict iteration. # Python dict iteration.
cdef np.ndarray attr_ids = numpy.asarray(py_attr_ids, dtype='i') cdef np.ndarray attr_ids = numpy.asarray(py_attr_ids, dtype="i")
output = numpy.ndarray(shape=(self.length, len(attr_ids)), output = numpy.ndarray(shape=(self.length, len(attr_ids)), dtype=numpy.uint64)
dtype=numpy.uint64)
c_output = <attr_t*>output.data c_output = <attr_t*>output.data
c_attr_ids = <attr_id_t*>attr_ids.data c_attr_ids = <attr_id_t*>attr_ids.data
cdef TokenC* token cdef TokenC* token
@ -629,8 +663,7 @@ cdef class Doc:
# Handle 1d case # Handle 1d case
return output if len(attr_ids) >= 2 else output.reshape((self.length,)) return output if len(attr_ids) >= 2 else output.reshape((self.length,))
def count_by(self, attr_id_t attr_id, exclude=None, def count_by(self, attr_id_t attr_id, exclude=None, PreshCounter counts=None):
PreshCounter counts=None):
"""Count the frequencies of a given attribute. Produces a dict of """Count the frequencies of a given attribute. Produces a dict of
`{attribute (int): count (ints)}` frequencies, keyed by the values of `{attribute (int): count (ints)}` frequencies, keyed by the values of
the given attribute ID. the given attribute ID.
@ -638,13 +671,7 @@ cdef class Doc:
attr_id (int): The attribute ID to key the counts. attr_id (int): The attribute ID to key the counts.
RETURNS (dict): A dictionary mapping attributes to integer counts. RETURNS (dict): A dictionary mapping attributes to integer counts.
EXAMPLE: DOCS: https://spacy.io/api/doc#count_by
>>> from spacy import attrs
>>> doc = nlp(u'apple apple orange banana')
>>> tokens.count_by(attrs.ORTH)
{12800L: 1, 11880L: 2, 7561L: 1}
>>> tokens.to_array([attrs.ORTH])
array([[11880], [11880], [7561], [12800]])
""" """
cdef int i cdef int i
cdef attr_t attr cdef attr_t attr
@ -685,13 +712,21 @@ cdef class Doc:
cdef void set_parse(self, const TokenC* parsed) nogil: cdef void set_parse(self, const TokenC* parsed) nogil:
# TODO: This method is fairly misleading atm. It's used by Parser # TODO: This method is fairly misleading atm. It's used by Parser
# to actually apply the parse calculated. Need to rethink this. # to actually apply the parse calculated. Need to rethink this.
# Probably we should use from_array? # Probably we should use from_array?
self.is_parsed = True self.is_parsed = True
for i in range(self.length): for i in range(self.length):
self.c[i] = parsed[i] self.c[i] = parsed[i]
def from_array(self, attrs, array): def from_array(self, attrs, array):
"""Load attributes from a numpy array. Write to a `Doc` object, from an
`(M, N)` array of attributes.
attrs (list) A list of attribute ID ints.
array (numpy.ndarray[ndim=2, dtype='int32']): The attribute values.
RETURNS (Doc): Itself.
DOCS: https://spacy.io/api/doc#from_array
"""
if SENT_START in attrs and HEAD in attrs: if SENT_START in attrs and HEAD in attrs:
raise ValueError(Errors.E032) raise ValueError(Errors.E032)
cdef int i, col cdef int i, col
@ -715,10 +750,10 @@ cdef class Doc:
for i in range(length): for i in range(length):
if array[i, col] != 0: if array[i, col] != 0:
self.vocab.morphology.assign_tag(&tokens[i], array[i, col]) self.vocab.morphology.assign_tag(&tokens[i], array[i, col])
# set flags # Set flags
self.is_parsed = bool(self.is_parsed or HEAD in attrs or DEP in attrs) self.is_parsed = bool(self.is_parsed or HEAD in attrs or DEP in attrs)
self.is_tagged = bool(self.is_tagged or TAG in attrs or POS in attrs) self.is_tagged = bool(self.is_tagged or TAG in attrs or POS in attrs)
# if document is parsed, set children # If document is parsed, set children
if self.is_parsed: if self.is_parsed:
set_children_from_heads(self.c, self.length) set_children_from_heads(self.c, self.length)
return self return self
@ -730,6 +765,8 @@ cdef class Doc:
RETURNS (np.array[ndim=2, dtype=numpy.int32]): LCA matrix with shape RETURNS (np.array[ndim=2, dtype=numpy.int32]): LCA matrix with shape
(n, n), where n = len(self). (n, n), where n = len(self).
DOCS: https://spacy.io/api/doc#get_lca_matrix
""" """
return numpy.asarray(_get_lca_matrix(self, 0, len(self))) return numpy.asarray(_get_lca_matrix(self, 0, len(self)))
@ -738,9 +775,11 @@ cdef class Doc:
path (unicode or Path): A path to a directory, which will be created if path (unicode or Path): A path to a directory, which will be created if
it doesn't exist. Paths may be either strings or Path-like objects. it doesn't exist. Paths may be either strings or Path-like objects.
DOCS: https://spacy.io/api/doc#to_disk
""" """
path = util.ensure_path(path) path = util.ensure_path(path)
with path.open('wb') as file_: with path.open("wb") as file_:
file_.write(self.to_bytes(**exclude)) file_.write(self.to_bytes(**exclude))
def from_disk(self, path, **exclude): def from_disk(self, path, **exclude):
@ -750,9 +789,11 @@ cdef class Doc:
path (unicode or Path): A path to a directory. Paths may be either path (unicode or Path): A path to a directory. Paths may be either
strings or `Path`-like objects. strings or `Path`-like objects.
RETURNS (Doc): The modified `Doc` object. RETURNS (Doc): The modified `Doc` object.
DOCS: https://spacy.io/api/doc#from_disk
""" """
path = util.ensure_path(path) path = util.ensure_path(path)
with path.open('rb') as file_: with path.open("rb") as file_:
bytes_data = file_.read() bytes_data = file_.read()
return self.from_bytes(bytes_data, **exclude) return self.from_bytes(bytes_data, **exclude)
@ -761,15 +802,16 @@ cdef class Doc:
RETURNS (bytes): A losslessly serialized copy of the `Doc`, including RETURNS (bytes): A losslessly serialized copy of the `Doc`, including
all annotations. all annotations.
DOCS: https://spacy.io/api/doc#to_bytes
""" """
array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE] array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE]
if self.is_tagged: if self.is_tagged:
array_head.append(TAG) array_head.append(TAG)
# if doc parsed add head and dep attribute # If doc parsed add head and dep attribute
if self.is_parsed: if self.is_parsed:
array_head.extend([HEAD, DEP]) array_head.extend([HEAD, DEP])
# otherwise add sent_start # Otherwise add sent_start
else: else:
array_head.append(SENT_START) array_head.append(SENT_START)
# Msgpack doesn't distinguish between lists and tuples, which is # Msgpack doesn't distinguish between lists and tuples, which is
@ -777,17 +819,16 @@ cdef class Doc:
# keys, we must have tuples. In values we just have to hope # keys, we must have tuples. In values we just have to hope
# users don't mind getting a list instead of a tuple. # users don't mind getting a list instead of a tuple.
serializers = { serializers = {
'text': lambda: self.text, "text": lambda: self.text,
'array_head': lambda: array_head, "array_head": lambda: array_head,
'array_body': lambda: self.to_array(array_head), "array_body": lambda: self.to_array(array_head),
'sentiment': lambda: self.sentiment, "sentiment": lambda: self.sentiment,
'tensor': lambda: self.tensor, "tensor": lambda: self.tensor,
} }
if 'user_data' not in exclude and self.user_data: if "user_data" not in exclude and self.user_data:
user_data_keys, user_data_values = list(zip(*self.user_data.items())) user_data_keys, user_data_values = list(zip(*self.user_data.items()))
serializers['user_data_keys'] = lambda: srsly.msgpack_dumps(user_data_keys) serializers["user_data_keys"] = lambda: srsly.msgpack_dumps(user_data_keys)
serializers['user_data_values'] = lambda: srsly.msgpack_dumps(user_data_values) serializers["user_data_values"] = lambda: srsly.msgpack_dumps(user_data_values)
return util.to_bytes(serializers, exclude) return util.to_bytes(serializers, exclude)
def from_bytes(self, bytes_data, **exclude): def from_bytes(self, bytes_data, **exclude):
@ -795,42 +836,40 @@ cdef class Doc:
data (bytes): The string to load from. data (bytes): The string to load from.
RETURNS (Doc): Itself. RETURNS (Doc): Itself.
DOCS: https://spacy.io/api/doc#from_bytes
""" """
if self.length != 0: if self.length != 0:
raise ValueError(Errors.E033.format(length=self.length)) raise ValueError(Errors.E033.format(length=self.length))
deserializers = { deserializers = {
'text': lambda b: None, "text": lambda b: None,
'array_head': lambda b: None, "array_head": lambda b: None,
'array_body': lambda b: None, "array_body": lambda b: None,
'sentiment': lambda b: None, "sentiment": lambda b: None,
'tensor': lambda b: None, "tensor": lambda b: None,
'user_data_keys': lambda b: None, "user_data_keys": lambda b: None,
'user_data_values': lambda b: None, "user_data_values": lambda b: None,
} }
msg = util.from_bytes(bytes_data, deserializers, exclude) msg = util.from_bytes(bytes_data, deserializers, exclude)
# Msgpack doesn't distinguish between lists and tuples, which is # Msgpack doesn't distinguish between lists and tuples, which is
# vexing for user data. As a best guess, we *know* that within # vexing for user data. As a best guess, we *know* that within
# keys, we must have tuples. In values we just have to hope # keys, we must have tuples. In values we just have to hope
# users don't mind getting a list instead of a tuple. # users don't mind getting a list instead of a tuple.
if 'user_data' not in exclude and 'user_data_keys' in msg: if "user_data" not in exclude and "user_data_keys" in msg:
user_data_keys = srsly.msgpack_loads(msg['user_data_keys'], use_list=False) user_data_keys = srsly.msgpack_loads(msg["user_data_keys"], use_list=False)
user_data_values = srsly.msgpack_loads(msg['user_data_values']) user_data_values = srsly.msgpack_loads(msg["user_data_values"])
for key, value in zip(user_data_keys, user_data_values): for key, value in zip(user_data_keys, user_data_values):
self.user_data[key] = value self.user_data[key] = value
cdef int i, start, end, has_space cdef int i, start, end, has_space
if "sentiment" not in exclude and "sentiment" in msg:
if 'sentiment' not in exclude and 'sentiment' in msg: self.sentiment = msg["sentiment"]
self.sentiment = msg['sentiment'] if "tensor" not in exclude and "tensor" in msg:
if 'tensor' not in exclude and 'tensor' in msg: self.tensor = msg["tensor"]
self.tensor = msg['tensor']
start = 0 start = 0
cdef const LexemeC* lex cdef const LexemeC* lex
cdef unicode orth_ cdef unicode orth_
text = msg['text'] text = msg["text"]
attrs = msg['array_body'] attrs = msg["array_body"]
for i in range(attrs.shape[0]): for i in range(attrs.shape[0]):
end = start + attrs[i, 0] end = start + attrs[i, 0]
has_space = attrs[i, 1] has_space = attrs[i, 1]
@ -838,11 +877,11 @@ cdef class Doc:
lex = self.vocab.get(self.mem, orth_) lex = self.vocab.get(self.mem, orth_)
self.push_back(lex, has_space) self.push_back(lex, has_space)
start = end + has_space start = end + has_space
self.from_array(msg['array_head'][2:], attrs[:, 2:]) self.from_array(msg["array_head"][2:], attrs[:, 2:])
return self return self
def extend_tensor(self, tensor): def extend_tensor(self, tensor):
'''Concatenate a new tensor onto the doc.tensor object. """Concatenate a new tensor onto the doc.tensor object.
The doc.tensor attribute holds dense feature vectors The doc.tensor attribute holds dense feature vectors
computed by the models in the pipeline. Let's say a computed by the models in the pipeline. Let's say a
@ -850,7 +889,7 @@ cdef class Doc:
per word. doc.tensor.shape will be (30, 128). After per word. doc.tensor.shape will be (30, 128). After
calling doc.extend_tensor with an array of shape (30, 64), calling doc.extend_tensor with an array of shape (30, 64),
doc.tensor == (30, 192). doc.tensor == (30, 192).
''' """
xp = get_array_module(self.tensor) xp = get_array_module(self.tensor)
if self.tensor.size == 0: if self.tensor.size == 0:
self.tensor.resize(tensor.shape, refcheck=False) self.tensor.resize(tensor.shape, refcheck=False)
@ -859,7 +898,7 @@ cdef class Doc:
self.tensor = xp.hstack((self.tensor, tensor)) self.tensor = xp.hstack((self.tensor, tensor))
def retokenize(self): def retokenize(self):
'''Context manager to handle retokenization of the Doc. """Context manager to handle retokenization of the Doc.
Modifications to the Doc's tokenization are stored, and then Modifications to the Doc's tokenization are stored, and then
made all at once when the context manager exits. This is made all at once when the context manager exits. This is
much more efficient, and less error-prone. much more efficient, and less error-prone.
@ -867,7 +906,10 @@ cdef class Doc:
All views of the Doc (Span and Token) created before the All views of the Doc (Span and Token) created before the
retokenization are invalidated, although they may accidentally retokenization are invalidated, although they may accidentally
continue to work. continue to work.
'''
DOCS: https://spacy.io/api/doc#retokenize
USAGE: https://spacy.io/usage/linguistic-features#retokenization
"""
return Retokenizer(self) return Retokenizer(self)
def _bulk_merge(self, spans, attributes): def _bulk_merge(self, spans, attributes):
@ -883,9 +925,10 @@ cdef class Doc:
RETURNS (Token): The first newly merged token. RETURNS (Token): The first newly merged token.
""" """
cdef unicode tag, lemma, ent_type cdef unicode tag, lemma, ent_type
attr_len = len(attributes)
assert len(attributes) == len(spans), "attribute length should be equal to span length" + str(len(attributes)) +\ span_len = len(spans)
str(len(spans)) if not attr_len == span_len:
raise ValueError(Errors.E121.format(attr_len=attr_len, span_len=span_len))
with self.retokenize() as retokenizer: with self.retokenize() as retokenizer:
for i, span in enumerate(spans): for i, span in enumerate(spans):
fix_attributes(self, attributes[i]) fix_attributes(self, attributes[i])
@ -916,13 +959,10 @@ cdef class Doc:
elif not args: elif not args:
fix_attributes(self, attributes) fix_attributes(self, attributes)
elif args: elif args:
raise ValueError(Errors.E034.format(n_args=len(args), raise ValueError(Errors.E034.format(n_args=len(args), args=repr(args),
args=repr(args),
kwargs=repr(attributes))) kwargs=repr(attributes)))
remove_label_if_necessary(attributes) remove_label_if_necessary(attributes)
attributes = intify_attrs(attributes, strings_map=self.vocab.strings) attributes = intify_attrs(attributes, strings_map=self.vocab.strings)
cdef int start = token_by_start(self.c, self.length, start_idx) cdef int start = token_by_start(self.c, self.length, start_idx)
if start == -1: if start == -1:
return None return None
@ -939,44 +979,47 @@ cdef class Doc:
raise ValueError(Errors.E105) raise ValueError(Errors.E105)
def to_json(self, underscore=None): def to_json(self, underscore=None):
"""Convert a Doc to JSON. Produces the same format used by the spacy """Convert a Doc to JSON. The format it produces will be the new format
train command. for the `spacy train` command (not implemented yet).
underscore (list): Optional list of string names of custom doc._. underscore (list): Optional list of string names of custom doc._.
attributes. Attribute values need to be JSON-serializable. Values will attributes. Attribute values need to be JSON-serializable. Values will
be added to an "_" key in the data, e.g. "_": {"foo": "bar"}. be added to an "_" key in the data, e.g. "_": {"foo": "bar"}.
RETURNS (dict): The data in spaCy's JSON format. RETURNS (dict): The data in spaCy's JSON format.
DOCS: https://spacy.io/api/doc#to_json
""" """
data = {'text': self.text} data = {"text": self.text}
data['ents'] = [{'start': ent.start_char, 'end': ent.end_char, if self.ents:
'label': ent.label_} for ent in self.ents] data["ents"] = [{"start": ent.start_char, "end": ent.end_char,
"label": ent.label_} for ent in self.ents]
sents = list(self.sents) sents = list(self.sents)
if sents: if sents:
data['sents'] = [{'start': sent.start_char, 'end': sent.end_char} data["sents"] = [{"start": sent.start_char, "end": sent.end_char}
for sent in sents] for sent in sents]
if self.cats: if self.cats:
data['cats'] = self.cats data["cats"] = self.cats
data['tokens'] = [] data["tokens"] = []
for token in self: for token in self:
token_data = {'id': token.i, 'start': token.idx, 'end': token.idx + len(token)} token_data = {"id": token.i, "start": token.idx, "end": token.idx + len(token)}
if token.pos_: if token.pos_:
token_data['pos'] = token.pos_ token_data["pos"] = token.pos_
if token.tag_: if token.tag_:
token_data['tag'] = token.tag_ token_data["tag"] = token.tag_
if token.dep_: if token.dep_:
token_data['dep'] = token.dep_ token_data["dep"] = token.dep_
if token.head: if token.head:
token_data['head'] = token.head.i token_data["head"] = token.head.i
data['tokens'].append(token_data) data["tokens"].append(token_data)
if underscore: if underscore:
data['_'] = {} data["_"] = {}
for attr in underscore: for attr in underscore:
if not self.has_extension(attr): if not self.has_extension(attr):
raise ValueError(Errors.E106.format(attr=attr, opts=underscore)) raise ValueError(Errors.E106.format(attr=attr, opts=underscore))
value = self._.get(attr) value = self._.get(attr)
if not srsly.is_json_serializable(value): if not srsly.is_json_serializable(value):
raise ValueError(Errors.E107.format(attr=attr, value=repr(value))) raise ValueError(Errors.E107.format(attr=attr, value=repr(value)))
data['_'][attr] = value data["_"][attr] = value
return data return data
@ -1008,9 +1051,8 @@ cdef int set_children_from_heads(TokenC* tokens, int length) except -1:
tokens[i].r_kids = 0 tokens[i].r_kids = 0
tokens[i].l_edge = i tokens[i].l_edge = i
tokens[i].r_edge = i tokens[i].r_edge = i
# Three times, for non-projectivity # Three times, for non-projectivity. See issue #3170. This isn't a very
# See issue #3170. This isn't a very satisfying fix, but I think it's # satisfying fix, but I think it's sufficient.
# sufficient.
for loop_count in range(3): for loop_count in range(3):
# Set left edges # Set left edges
for i in range(length): for i in range(length):
@ -1022,7 +1064,7 @@ cdef int set_children_from_heads(TokenC* tokens, int length) except -1:
head.l_edge = child.l_edge head.l_edge = child.l_edge
if child.r_edge > head.r_edge: if child.r_edge > head.r_edge:
head.r_edge = child.r_edge head.r_edge = child.r_edge
# Set right edges --- same as above, but iterate in reverse # Set right edges - same as above, but iterate in reverse
for i in range(length-1, -1, -1): for i in range(length-1, -1, -1):
child = &tokens[i] child = &tokens[i]
head = &tokens[i + child.head] head = &tokens[i + child.head]
@ -1053,20 +1095,14 @@ cdef int _get_tokens_lca(Token token_j, Token token_k):
return token_k.i return token_k.i
elif token_k.head == token_j: elif token_k.head == token_j:
return token_j.i return token_j.i
token_j_ancestors = set(token_j.ancestors) token_j_ancestors = set(token_j.ancestors)
if token_k in token_j_ancestors: if token_k in token_j_ancestors:
return token_k.i return token_k.i
for token_k_ancestor in token_k.ancestors: for token_k_ancestor in token_k.ancestors:
if token_k_ancestor == token_j: if token_k_ancestor == token_j:
return token_j.i return token_j.i
if token_k_ancestor in token_j_ancestors: if token_k_ancestor in token_j_ancestors:
return token_k_ancestor.i return token_k_ancestor.i
return -1 return -1
@ -1084,12 +1120,10 @@ cdef int [:,:] _get_lca_matrix(Doc doc, int start, int end):
with shape (n, n), where n = len(doc). with shape (n, n), where n = len(doc).
""" """
cdef int [:,:] lca_matrix cdef int [:,:] lca_matrix
n_tokens= end - start n_tokens= end - start
lca_mat = numpy.empty((n_tokens, n_tokens), dtype=numpy.int32) lca_mat = numpy.empty((n_tokens, n_tokens), dtype=numpy.int32)
lca_mat.fill(-1) lca_mat.fill(-1)
lca_matrix = lca_mat lca_matrix = lca_mat
for j in range(n_tokens): for j in range(n_tokens):
token_j = doc[start + j] token_j = doc[start + j]
# the common ancestor of token and itself is itself: # the common ancestor of token and itself is itself:
@ -1110,7 +1144,6 @@ cdef int [:,:] _get_lca_matrix(Doc doc, int start, int end):
else: else:
lca_matrix[j, k] = lca - start lca_matrix[j, k] = lca - start
lca_matrix[k, j] = lca - start lca_matrix[k, j] = lca - start
return lca_matrix return lca_matrix
@ -1124,8 +1157,7 @@ def pickle_doc(doc):
def unpickle_doc(vocab, hooks_and_data, bytes_data): def unpickle_doc(vocab, hooks_and_data, bytes_data):
user_data, doc_hooks, span_hooks, token_hooks = srsly.pickle_loads(hooks_and_data) user_data, doc_hooks, span_hooks, token_hooks = srsly.pickle_loads(hooks_and_data)
doc = Doc(vocab, user_data=user_data).from_bytes(bytes_data, doc = Doc(vocab, user_data=user_data).from_bytes(bytes_data, exclude="user_data")
exclude='user_data')
doc.user_hooks.update(doc_hooks) doc.user_hooks.update(doc_hooks)
doc.user_span_hooks.update(span_hooks) doc.user_span_hooks.update(span_hooks)
doc.user_token_hooks.update(token_hooks) doc.user_token_hooks.update(token_hooks)
@ -1134,19 +1166,22 @@ def unpickle_doc(vocab, hooks_and_data, bytes_data):
copy_reg.pickle(Doc, pickle_doc, unpickle_doc) copy_reg.pickle(Doc, pickle_doc, unpickle_doc)
def remove_label_if_necessary(attributes): def remove_label_if_necessary(attributes):
# More deprecated attribute handling =/ # More deprecated attribute handling =/
if 'label' in attributes: if "label" in attributes:
attributes['ent_type'] = attributes.pop('label') attributes["ent_type"] = attributes.pop("label")
def fix_attributes(doc, attributes): def fix_attributes(doc, attributes):
if 'label' in attributes and 'ent_type' not in attributes: if "label" in attributes and "ent_type" not in attributes:
if isinstance(attributes['label'], int): if isinstance(attributes["label"], int):
attributes[ENT_TYPE] = attributes['label'] attributes[ENT_TYPE] = attributes["label"]
else: else:
attributes[ENT_TYPE] = doc.vocab.strings[attributes['label']] attributes[ENT_TYPE] = doc.vocab.strings[attributes["label"]]
if 'ent_type' in attributes: if "ent_type" in attributes:
attributes[ENT_TYPE] = attributes['ent_type'] attributes[ENT_TYPE] = attributes["ent_type"]
def get_entity_info(ent_info): def get_entity_info(ent_info):
if isinstance(ent_info, Span): if isinstance(ent_info, Span):

View File

@ -1,11 +1,13 @@
# coding: utf8 # coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
from collections import defaultdict
cimport numpy as np cimport numpy as np
from libc.math cimport sqrt
import numpy import numpy
import numpy.linalg import numpy.linalg
from libc.math cimport sqrt from thinc.neural.util import get_array_module
from collections import defaultdict
from .doc cimport token_by_start, token_by_end, get_token_attr, _get_lca_matrix from .doc cimport token_by_start, token_by_end, get_token_attr, _get_lca_matrix
from .token cimport TokenC from .token cimport TokenC
@ -13,9 +15,10 @@ from ..structs cimport TokenC, LexemeC
from ..typedefs cimport flags_t, attr_t, hash_t from ..typedefs cimport flags_t, attr_t, hash_t
from ..attrs cimport attr_id_t from ..attrs cimport attr_id_t
from ..parts_of_speech cimport univ_pos_t from ..parts_of_speech cimport univ_pos_t
from ..util import normalize_slice
from ..attrs cimport * from ..attrs cimport *
from ..lexeme cimport Lexeme from ..lexeme cimport Lexeme
from ..util import normalize_slice
from ..compat import is_config, basestring_ from ..compat import is_config, basestring_
from ..errors import Errors, TempErrors, Warnings, user_warning, models_warning from ..errors import Errors, TempErrors, Warnings, user_warning, models_warning
from ..errors import deprecation_warning from ..errors import deprecation_warning
@ -23,29 +26,66 @@ from .underscore import Underscore, get_ext_args
cdef class Span: cdef class Span:
"""A slice from a Doc object.""" """A slice from a Doc object.
DOCS: https://spacy.io/api/span
"""
@classmethod @classmethod
def set_extension(cls, name, **kwargs): def set_extension(cls, name, **kwargs):
if cls.has_extension(name) and not kwargs.get('force', False): """Define a custom attribute which becomes available as `Span._`.
raise ValueError(Errors.E090.format(name=name, obj='Span'))
name (unicode): Name of the attribute to set.
default: Optional default value of the attribute.
getter (callable): Optional getter function.
setter (callable): Optional setter function.
method (callable): Optional method for method extension.
force (bool): Force overwriting existing attribute.
DOCS: https://spacy.io/api/span#set_extension
USAGE: https://spacy.io/usage/processing-pipelines#custom-components-attributes
"""
if cls.has_extension(name) and not kwargs.get("force", False):
raise ValueError(Errors.E090.format(name=name, obj="Span"))
Underscore.span_extensions[name] = get_ext_args(**kwargs) Underscore.span_extensions[name] = get_ext_args(**kwargs)
@classmethod @classmethod
def get_extension(cls, name): def get_extension(cls, name):
"""Look up a previously registered extension by name.
name (unicode): Name of the extension.
RETURNS (tuple): A `(default, method, getter, setter)` tuple.
DOCS: https://spacy.io/api/span#get_extension
"""
return Underscore.span_extensions.get(name) return Underscore.span_extensions.get(name)
@classmethod @classmethod
def has_extension(cls, name): def has_extension(cls, name):
"""Check whether an extension has been registered.
name (unicode): Name of the extension.
RETURNS (bool): Whether the extension has been registered.
DOCS: https://spacy.io/api/span#has_extension
"""
return name in Underscore.span_extensions return name in Underscore.span_extensions
@classmethod @classmethod
def remove_extension(cls, name): def remove_extension(cls, name):
"""Remove a previously registered extension.
name (unicode): Name of the extension.
RETURNS (tuple): A `(default, method, getter, setter)` tuple of the
removed extension.
DOCS: https://spacy.io/api/span#remove_extension
"""
if not cls.has_extension(name): if not cls.has_extension(name):
raise ValueError(Errors.E046.format(name=name)) raise ValueError(Errors.E046.format(name=name))
return Underscore.span_extensions.pop(name) return Underscore.span_extensions.pop(name)
def __cinit__(self, Doc doc, int start, int end, label=0, def __cinit__(self, Doc doc, int start, int end, label=0, vector=None,
vector=None, vector_norm=None): vector_norm=None):
"""Create a `Span` object from the slice `doc[start : end]`. """Create a `Span` object from the slice `doc[start : end]`.
doc (Doc): The parent document. doc (Doc): The parent document.
@ -55,6 +95,8 @@ cdef class Span:
vector (ndarray[ndim=1, dtype='float32']): A meaning representation vector (ndarray[ndim=1, dtype='float32']): A meaning representation
of the span. of the span.
RETURNS (Span): The newly constructed object. RETURNS (Span): The newly constructed object.
DOCS: https://spacy.io/api/span#init
""" """
if not (0 <= start <= end <= len(doc)): if not (0 <= start <= end <= len(doc)):
raise IndexError(Errors.E035.format(start=start, end=end, length=len(doc))) raise IndexError(Errors.E035.format(start=start, end=end, length=len(doc)))
@ -101,6 +143,8 @@ cdef class Span:
"""Get the number of tokens in the span. """Get the number of tokens in the span.
RETURNS (int): The number of tokens in the span. RETURNS (int): The number of tokens in the span.
DOCS: https://spacy.io/api/span#len
""" """
self._recalculate_indices() self._recalculate_indices()
if self.end < self.start: if self.end < self.start:
@ -110,7 +154,7 @@ cdef class Span:
def __repr__(self): def __repr__(self):
if is_config(python3=True): if is_config(python3=True):
return self.text return self.text
return self.text.encode('utf-8') return self.text.encode("utf-8")
def __getitem__(self, object i): def __getitem__(self, object i):
"""Get a `Token` or a `Span` object """Get a `Token` or a `Span` object
@ -119,9 +163,7 @@ cdef class Span:
the span to get. the span to get.
RETURNS (Token or Span): The token at `span[i]`. RETURNS (Token or Span): The token at `span[i]`.
EXAMPLE: DOCS: https://spacy.io/api/span#getitem
>>> span[0]
>>> span[1:3]
""" """
self._recalculate_indices() self._recalculate_indices()
if isinstance(i, slice): if isinstance(i, slice):
@ -137,6 +179,8 @@ cdef class Span:
"""Iterate over `Token` objects. """Iterate over `Token` objects.
YIELDS (Token): A `Token` object. YIELDS (Token): A `Token` object.
DOCS: https://spacy.io/api/span#iter
""" """
self._recalculate_indices() self._recalculate_indices()
for i in range(self.start, self.end): for i in range(self.start, self.end):
@ -147,31 +191,32 @@ cdef class Span:
@property @property
def _(self): def _(self):
"""User space for adding custom attribute extensions.""" """Custom extension attributes registered via `set_extension`."""
return Underscore(Underscore.span_extensions, self, return Underscore(Underscore.span_extensions, self,
start=self.start_char, end=self.end_char) start=self.start_char, end=self.end_char)
def as_doc(self): def as_doc(self):
# TODO: fix """Create a `Doc` object with a copy of the `Span`'s data.
"""Create a `Doc` object with a copy of the Span's data.
RETURNS (Doc): The `Doc` copy of the span. RETURNS (Doc): The `Doc` copy of the span.
DOCS: https://spacy.io/api/span#as_doc
""" """
cdef Doc doc = Doc(self.doc.vocab, # TODO: Fix!
words=[t.text for t in self], words = [t.text for t in self]
spaces=[bool(t.whitespace_) for t in self]) spaces = [bool(t.whitespace_) for t in self]
cdef Doc doc = Doc(self.doc.vocab, words=words, spaces=spaces)
array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE] array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE]
if self.doc.is_tagged: if self.doc.is_tagged:
array_head.append(TAG) array_head.append(TAG)
# if doc parsed add head and dep attribute # If doc parsed add head and dep attribute
if self.doc.is_parsed: if self.doc.is_parsed:
array_head.extend([HEAD, DEP]) array_head.extend([HEAD, DEP])
# otherwise add sent_start # Otherwise add sent_start
else: else:
array_head.append(SENT_START) array_head.append(SENT_START)
array = self.doc.to_array(array_head) array = self.doc.to_array(array_head)
doc.from_array(array_head, array[self.start : self.end]) doc.from_array(array_head, array[self.start : self.end])
doc.noun_chunks_iterator = self.doc.noun_chunks_iterator doc.noun_chunks_iterator = self.doc.noun_chunks_iterator
doc.user_hooks = self.doc.user_hooks doc.user_hooks = self.doc.user_hooks
doc.user_span_hooks = self.doc.user_span_hooks doc.user_span_hooks = self.doc.user_span_hooks
@ -180,7 +225,7 @@ cdef class Span:
doc.vector_norm = self.vector_norm doc.vector_norm = self.vector_norm
doc.tensor = self.doc.tensor[self.start : self.end] doc.tensor = self.doc.tensor[self.start : self.end]
for key, value in self.doc.cats.items(): for key, value in self.doc.cats.items():
if hasattr(key, '__len__') and len(key) == 3: if hasattr(key, "__len__") and len(key) == 3:
cat_start, cat_end, cat_label = key cat_start, cat_end, cat_label = key
if cat_start == self.start_char and cat_end == self.end_char: if cat_start == self.start_char and cat_end == self.end_char:
doc.cats[cat_label] = value doc.cats[cat_label] = value
@ -206,6 +251,8 @@ cdef class Span:
RETURNS (np.array[ndim=2, dtype=numpy.int32]): LCA matrix with shape RETURNS (np.array[ndim=2, dtype=numpy.int32]): LCA matrix with shape
(n, n), where n = len(self). (n, n), where n = len(self).
DOCS: https://spacy.io/api/span#get_lca_matrix
""" """
return numpy.asarray(_get_lca_matrix(self.doc, self.start, self.end)) return numpy.asarray(_get_lca_matrix(self.doc, self.start, self.end))
@ -216,24 +263,28 @@ cdef class Span:
other (object): The object to compare with. By default, accepts `Doc`, other (object): The object to compare with. By default, accepts `Doc`,
`Span`, `Token` and `Lexeme` objects. `Span`, `Token` and `Lexeme` objects.
RETURNS (float): A scalar similarity score. Higher is more similar. RETURNS (float): A scalar similarity score. Higher is more similar.
DOCS: https://spacy.io/api/span#similarity
""" """
if 'similarity' in self.doc.user_span_hooks: if "similarity" in self.doc.user_span_hooks:
self.doc.user_span_hooks['similarity'](self, other) self.doc.user_span_hooks["similarity"](self, other)
if len(self) == 1 and hasattr(other, 'orth'): if len(self) == 1 and hasattr(other, "orth"):
if self[0].orth == other.orth: if self[0].orth == other.orth:
return 1.0 return 1.0
elif hasattr(other, '__len__') and len(self) == len(other): elif hasattr(other, "__len__") and len(self) == len(other):
for i in range(len(self)): for i in range(len(self)):
if self[i].orth != getattr(other[i], 'orth', None): if self[i].orth != getattr(other[i], "orth", None):
break break
else: else:
return 1.0 return 1.0
if self.vocab.vectors.n_keys == 0: if self.vocab.vectors.n_keys == 0:
models_warning(Warnings.W007.format(obj='Span')) models_warning(Warnings.W007.format(obj="Span"))
if self.vector_norm == 0.0 or other.vector_norm == 0.0: if self.vector_norm == 0.0 or other.vector_norm == 0.0:
user_warning(Warnings.W008.format(obj='Span')) user_warning(Warnings.W008.format(obj="Span"))
return 0.0 return 0.0
return numpy.dot(self.vector, other.vector) / (self.vector_norm * other.vector_norm) vector = self.vector
xp = get_array_module(vector)
return xp.dot(vector, other.vector) / (self.vector_norm * other.vector_norm)
cpdef np.ndarray to_array(self, object py_attr_ids): cpdef np.ndarray to_array(self, object py_attr_ids):
"""Given a list of M attribute IDs, export the tokens to a numpy """Given a list of M attribute IDs, export the tokens to a numpy
@ -248,8 +299,8 @@ cdef class Span:
cdef int i, j cdef int i, j
cdef attr_id_t feature cdef attr_id_t feature
cdef np.ndarray[attr_t, ndim=2] output cdef np.ndarray[attr_t, ndim=2] output
# Make an array from the attributes --- otherwise our inner loop is Python # Make an array from the attributes - otherwise our inner loop is Python
# dict iteration. # dict iteration
cdef np.ndarray[attr_t, ndim=1] attr_ids = numpy.asarray(py_attr_ids, dtype=numpy.uint64) cdef np.ndarray[attr_t, ndim=1] attr_ids = numpy.asarray(py_attr_ids, dtype=numpy.uint64)
cdef int length = self.end - self.start cdef int length = self.end - self.start
output = numpy.ndarray(shape=(length, len(attr_ids)), dtype=numpy.uint64) output = numpy.ndarray(shape=(length, len(attr_ids)), dtype=numpy.uint64)
@ -279,12 +330,11 @@ cdef class Span:
property sent: property sent:
"""RETURNS (Span): The sentence span that the span is a part of.""" """RETURNS (Span): The sentence span that the span is a part of."""
def __get__(self): def __get__(self):
if 'sent' in self.doc.user_span_hooks: if "sent" in self.doc.user_span_hooks:
return self.doc.user_span_hooks['sent'](self) return self.doc.user_span_hooks["sent"](self)
# This should raise if we're not parsed # This should raise if not parsed / no custom sentence boundaries
# or doesen't have any sbd component :)
self.doc.sents self.doc.sents
# if doc is parsed we can use the deps to find the sentence # If doc is parsed we can use the deps to find the sentence
# otherwise we use the `sent_start` token attribute # otherwise we use the `sent_start` token attribute
cdef int n = 0 cdef int n = 0
cdef int i cdef int i
@ -297,11 +347,11 @@ cdef class Span:
raise RuntimeError(Errors.E038) raise RuntimeError(Errors.E038)
return self.doc[root.l_edge:root.r_edge + 1] return self.doc[root.l_edge:root.r_edge + 1]
elif self.doc.is_sentenced: elif self.doc.is_sentenced:
# find start of the sentence # Find start of the sentence
start = self.start start = self.start
while self.doc.c[start].sent_start != 1 and start > 0: while self.doc.c[start].sent_start != 1 and start > 0:
start += -1 start += -1
# find end of the sentence # Find end of the sentence
end = self.end end = self.end
n = 0 n = 0
while end < self.doc.length and self.doc.c[end].sent_start != 1: while end < self.doc.length and self.doc.c[end].sent_start != 1:
@ -312,7 +362,13 @@ cdef class Span:
return self.doc[start:end] return self.doc[start:end]
property ents: property ents:
"""RETURNS (list): A list of tokens that belong to the current span.""" """The named entities in the span. Returns a tuple of named entity
`Span` objects, if the entity recognizer has been applied.
RETURNS (tuple): Entities in the span, one `Span` per entity.
DOCS: https://spacy.io/api/span#ents
"""
def __get__(self): def __get__(self):
ents = [] ents = []
for ent in self.doc.ents: for ent in self.doc.ents:
@ -321,11 +377,16 @@ cdef class Span:
return ents return ents
property has_vector: property has_vector:
"""RETURNS (bool): Whether a word vector is associated with the object. """A boolean value indicating whether a word vector is associated with
the object.
RETURNS (bool): Whether a word vector is associated with the object.
DOCS: https://spacy.io/api/span#has_vector
""" """
def __get__(self): def __get__(self):
if 'has_vector' in self.doc.user_span_hooks: if "has_vector" in self.doc.user_span_hooks:
return self.doc.user_span_hooks['has_vector'](self) return self.doc.user_span_hooks["has_vector"](self)
elif self.vocab.vectors.data.size > 0: elif self.vocab.vectors.data.size > 0:
return any(token.has_vector for token in self) return any(token.has_vector for token in self)
elif self.doc.tensor.size > 0: elif self.doc.tensor.size > 0:
@ -339,19 +400,26 @@ cdef class Span:
RETURNS (numpy.ndarray[ndim=1, dtype='float32']): A 1D numpy array RETURNS (numpy.ndarray[ndim=1, dtype='float32']): A 1D numpy array
representing the span's semantics. representing the span's semantics.
DOCS: https://spacy.io/api/span#vector
""" """
def __get__(self): def __get__(self):
if 'vector' in self.doc.user_span_hooks: if "vector" in self.doc.user_span_hooks:
return self.doc.user_span_hooks['vector'](self) return self.doc.user_span_hooks["vector"](self)
if self._vector is None: if self._vector is None:
self._vector = sum(t.vector for t in self) / len(self) self._vector = sum(t.vector for t in self) / len(self)
return self._vector return self._vector
property vector_norm: property vector_norm:
"""RETURNS (float): The L2 norm of the vector representation.""" """The L2 norm of the span's vector representation.
RETURNS (float): The L2 norm of the vector representation.
DOCS: https://spacy.io/api/span#vector_norm
"""
def __get__(self): def __get__(self):
if 'vector_norm' in self.doc.user_span_hooks: if "vector_norm" in self.doc.user_span_hooks:
return self.doc.user_span_hooks['vector'](self) return self.doc.user_span_hooks["vector"](self)
cdef float value cdef float value
cdef double norm = 0 cdef double norm = 0
if self._vector_norm is None: if self._vector_norm is None:
@ -366,8 +434,8 @@ cdef class Span:
negativity of the span. negativity of the span.
""" """
def __get__(self): def __get__(self):
if 'sentiment' in self.doc.user_span_hooks: if "sentiment" in self.doc.user_span_hooks:
return self.doc.user_span_hooks['sentiment'](self) return self.doc.user_span_hooks["sentiment"](self)
else: else:
return sum([token.sentiment for token in self]) / len(self) return sum([token.sentiment for token in self]) / len(self)
@ -387,7 +455,7 @@ cdef class Span:
whitespace). whitespace).
""" """
def __get__(self): def __get__(self):
return u''.join([t.text_with_ws for t in self]) return "".join([t.text_with_ws for t in self])
property noun_chunks: property noun_chunks:
"""Yields base noun-phrase `Span` objects, if the document has been """Yields base noun-phrase `Span` objects, if the document has been
@ -396,7 +464,9 @@ cdef class Span:
NP-level coordination, no prepositional phrases, and no relative NP-level coordination, no prepositional phrases, and no relative
clauses. clauses.
YIELDS (Span): Base noun-phrase `Span` objects YIELDS (Span): Base noun-phrase `Span` objects.
DOCS: https://spacy.io/api/span#noun_chunks
""" """
def __get__(self): def __get__(self):
if not self.doc.is_parsed: if not self.doc.is_parsed:
@ -415,52 +485,18 @@ cdef class Span:
yield span yield span
property root: property root:
"""The token within the span that's highest in the parse tree. """The token with the shortest path to the root of the
If there's a tie, the earliest is prefered. sentence (or the root itself). If multiple tokens are equally
high in the tree, the first token is taken.
RETURNS (Token): The root token. RETURNS (Token): The root token.
EXAMPLE: The root token has the shortest path to the root of the DOCS: https://spacy.io/api/span#root
sentence (or is the root itself). If multiple words are equally
high in the tree, the first word is taken. For example:
>>> toks = nlp(u'I like New York in Autumn.')
Let's name the indices easier than writing `toks[4]` etc.
>>> i, like, new, york, in_, autumn, dot = range(len(toks))
The head of 'new' is 'York', and the head of "York" is "like"
>>> toks[new].head.text
'York'
>>> toks[york].head.text
'like'
Create a span for "New York". Its root is "York".
>>> new_york = toks[new:york+1]
>>> new_york.root.text
'York'
Here's a more complicated case, raised by issue #214:
>>> toks = nlp(u'to, north and south carolina')
>>> to, north, and_, south, carolina = toks
>>> south.head.text, carolina.head.text
('north', 'to')
Here "south" is a child of "north", which is a child of "carolina".
Carolina is the root of the span:
>>> south_carolina = toks[-2:]
>>> south_carolina.root.text
'carolina'
""" """
def __get__(self): def __get__(self):
self._recalculate_indices() self._recalculate_indices()
if 'root' in self.doc.user_span_hooks: if "root" in self.doc.user_span_hooks:
return self.doc.user_span_hooks['root'](self) return self.doc.user_span_hooks["root"](self)
# This should probably be called 'head', and the other one called # This should probably be called 'head', and the other one called
# 'gov'. But we went with 'head' elsehwhere, and now we're stuck =/ # 'gov'. But we went with 'head' elsehwhere, and now we're stuck =/
cdef int i cdef int i
@ -492,10 +528,12 @@ cdef class Span:
return self.doc[root] return self.doc[root]
property lefts: property lefts:
""" Tokens that are to the left of the span, whose head is within the """Tokens that are to the left of the span, whose head is within the
`Span`. `Span`.
YIELDS (Token):A left-child of a token of the span. YIELDS (Token):A left-child of a token of the span.
DOCS: https://spacy.io/api/span#lefts
""" """
def __get__(self): def __get__(self):
for token in reversed(self): # Reverse, so we get tokens in order for token in reversed(self): # Reverse, so we get tokens in order
@ -508,6 +546,8 @@ cdef class Span:
`Span`. `Span`.
YIELDS (Token): A right-child of a token of the span. YIELDS (Token): A right-child of a token of the span.
DOCS: https://spacy.io/api/span#rights
""" """
def __get__(self): def __get__(self):
for token in self: for token in self:
@ -516,15 +556,25 @@ cdef class Span:
yield right yield right
property n_lefts: property n_lefts:
"""RETURNS (int): The number of leftward immediate children of the """The number of tokens that are to the left of the span, whose
heads are within the span.
RETURNS (int): The number of leftward immediate children of the
span, in the syntactic dependency parse. span, in the syntactic dependency parse.
DOCS: https://spacy.io/api/span#n_lefts
""" """
def __get__(self): def __get__(self):
return len(list(self.lefts)) return len(list(self.lefts))
property n_rights: property n_rights:
"""RETURNS (int): The number of rightward immediate children of the """The number of tokens that are to the right of the span, whose
heads are within the span.
RETURNS (int): The number of rightward immediate children of the
span, in the syntactic dependency parse. span, in the syntactic dependency parse.
DOCS: https://spacy.io/api/span#n_rights
""" """
def __get__(self): def __get__(self):
return len(list(self.rights)) return len(list(self.rights))
@ -533,6 +583,8 @@ cdef class Span:
"""Tokens within the span and tokens which descend from them. """Tokens within the span and tokens which descend from them.
YIELDS (Token): A token within the span, or a descendant from it. YIELDS (Token): A token within the span, or a descendant from it.
DOCS: https://spacy.io/api/span#subtree
""" """
def __get__(self): def __get__(self):
for word in self.lefts: for word in self.lefts:
@ -547,7 +599,7 @@ cdef class Span:
return self.root.ent_id return self.root.ent_id
def __set__(self, hash_t key): def __set__(self, hash_t key):
raise NotImplementedError(TempErrors.T007.format(attr='ent_id')) raise NotImplementedError(TempErrors.T007.format(attr="ent_id"))
property ent_id_: property ent_id_:
"""RETURNS (unicode): The (string) entity ID.""" """RETURNS (unicode): The (string) entity ID."""
@ -555,10 +607,10 @@ cdef class Span:
return self.root.ent_id_ return self.root.ent_id_
def __set__(self, hash_t key): def __set__(self, hash_t key):
raise NotImplementedError(TempErrors.T007.format(attr='ent_id_')) raise NotImplementedError(TempErrors.T007.format(attr="ent_id_"))
property orth_: property orth_:
"""Verbatim text content (identical to Span.text). Exists mostly for """Verbatim text content (identical to `Span.text`). Exists mostly for
consistency with other attributes. consistency with other attributes.
RETURNS (unicode): The span's text.""" RETURNS (unicode): The span's text."""
@ -568,27 +620,28 @@ cdef class Span:
property lemma_: property lemma_:
"""RETURNS (unicode): The span's lemma.""" """RETURNS (unicode): The span's lemma."""
def __get__(self): def __get__(self):
return ' '.join([t.lemma_ for t in self]).strip() return " ".join([t.lemma_ for t in self]).strip()
property upper_: property upper_:
"""Deprecated. Use Span.text.upper() instead.""" """Deprecated. Use `Span.text.upper()` instead."""
def __get__(self): def __get__(self):
return ''.join([t.text_with_ws.upper() for t in self]).strip() return "".join([t.text_with_ws.upper() for t in self]).strip()
property lower_: property lower_:
"""Deprecated. Use Span.text.lower() instead.""" """Deprecated. Use `Span.text.lower()` instead."""
def __get__(self): def __get__(self):
return ''.join([t.text_with_ws.lower() for t in self]).strip() return "".join([t.text_with_ws.lower() for t in self]).strip()
property string: property string:
"""Deprecated: Use Span.text_with_ws instead.""" """Deprecated: Use `Span.text_with_ws` instead."""
def __get__(self): def __get__(self):
return ''.join([t.text_with_ws for t in self]) return "".join([t.text_with_ws for t in self])
property label_: property label_:
"""RETURNS (unicode): The span's label.""" """RETURNS (unicode): The span's label."""
def __get__(self): def __get__(self):
return self.doc.vocab.strings[self.label] return self.doc.vocab.strings[self.label]
def __set__(self, unicode label_): def __set__(self, unicode label_):
self.label = self.doc.vocab.strings.add(label_) self.label = self.doc.vocab.strings.add(label_)

View File

@ -8,42 +8,83 @@ from cpython.mem cimport PyMem_Malloc, PyMem_Free
from cython.view cimport array as cvarray from cython.view cimport array as cvarray
cimport numpy as np cimport numpy as np
np.import_array() np.import_array()
import numpy import numpy
from thinc.neural.util import get_array_module
from ..typedefs cimport hash_t from ..typedefs cimport hash_t
from ..lexeme cimport Lexeme from ..lexeme cimport Lexeme
from .. import parts_of_speech
from ..attrs cimport IS_ALPHA, IS_ASCII, IS_DIGIT, IS_LOWER, IS_PUNCT, IS_SPACE from ..attrs cimport IS_ALPHA, IS_ASCII, IS_DIGIT, IS_LOWER, IS_PUNCT, IS_SPACE
from ..attrs cimport IS_BRACKET, IS_QUOTE, IS_LEFT_PUNCT, IS_RIGHT_PUNCT from ..attrs cimport IS_BRACKET, IS_QUOTE, IS_LEFT_PUNCT, IS_RIGHT_PUNCT
from ..attrs cimport IS_OOV, IS_TITLE, IS_UPPER, IS_CURRENCY, LIKE_URL, LIKE_NUM, LIKE_EMAIL from ..attrs cimport IS_OOV, IS_TITLE, IS_UPPER, IS_CURRENCY, LIKE_URL, LIKE_NUM, LIKE_EMAIL
from ..attrs cimport IS_STOP, ID, ORTH, NORM, LOWER, SHAPE, PREFIX, SUFFIX from ..attrs cimport IS_STOP, ID, ORTH, NORM, LOWER, SHAPE, PREFIX, SUFFIX
from ..attrs cimport LENGTH, CLUSTER, LEMMA, POS, TAG, DEP from ..attrs cimport LENGTH, CLUSTER, LEMMA, POS, TAG, DEP
from ..symbols cimport conj
from .. import parts_of_speech
from .. import util
from ..compat import is_config from ..compat import is_config
from ..errors import Errors, Warnings, user_warning, models_warning from ..errors import Errors, Warnings, user_warning, models_warning
from .. import util
from .underscore import Underscore, get_ext_args from .underscore import Underscore, get_ext_args
from .morphanalysis cimport MorphAnalysis from .morphanalysis cimport MorphAnalysis
cdef class Token: cdef class Token:
"""An individual token i.e. a word, punctuation symbol, whitespace, """An individual token i.e. a word, punctuation symbol, whitespace,
etc.""" etc.
DOCS: https://spacy.io/api/token
"""
@classmethod @classmethod
def set_extension(cls, name, **kwargs): def set_extension(cls, name, **kwargs):
if cls.has_extension(name) and not kwargs.get('force', False): """Define a custom attribute which becomes available as `Token._`.
raise ValueError(Errors.E090.format(name=name, obj='Token'))
name (unicode): Name of the attribute to set.
default: Optional default value of the attribute.
getter (callable): Optional getter function.
setter (callable): Optional setter function.
method (callable): Optional method for method extension.
force (bool): Force overwriting existing attribute.
DOCS: https://spacy.io/api/token#set_extension
USAGE: https://spacy.io/usage/processing-pipelines#custom-components-attributes
"""
if cls.has_extension(name) and not kwargs.get("force", False):
raise ValueError(Errors.E090.format(name=name, obj="Token"))
Underscore.token_extensions[name] = get_ext_args(**kwargs) Underscore.token_extensions[name] = get_ext_args(**kwargs)
@classmethod @classmethod
def get_extension(cls, name): def get_extension(cls, name):
"""Look up a previously registered extension by name.
name (unicode): Name of the extension.
RETURNS (tuple): A `(default, method, getter, setter)` tuple.
DOCS: https://spacy.io/api/token#get_extension
"""
return Underscore.token_extensions.get(name) return Underscore.token_extensions.get(name)
@classmethod @classmethod
def has_extension(cls, name): def has_extension(cls, name):
"""Check whether an extension has been registered.
name (unicode): Name of the extension.
RETURNS (bool): Whether the extension has been registered.
DOCS: https://spacy.io/api/token#has_extension
"""
return name in Underscore.token_extensions return name in Underscore.token_extensions
@classmethod @classmethod
def remove_extension(cls, name): def remove_extension(cls, name):
"""Remove a previously registered extension.
name (unicode): Name of the extension.
RETURNS (tuple): A `(default, method, getter, setter)` tuple of the
removed extension.
DOCS: https://spacy.io/api/token#remove_extension
"""
if not cls.has_extension(name): if not cls.has_extension(name):
raise ValueError(Errors.E046.format(name=name)) raise ValueError(Errors.E046.format(name=name))
return Underscore.token_extensions.pop(name) return Underscore.token_extensions.pop(name)
@ -54,6 +95,8 @@ cdef class Token:
vocab (Vocab): A storage container for lexical types. vocab (Vocab): A storage container for lexical types.
doc (Doc): The parent document. doc (Doc): The parent document.
offset (int): The index of the token within the document. offset (int): The index of the token within the document.
DOCS: https://spacy.io/api/token#init
""" """
self.vocab = vocab self.vocab = vocab
self.doc = doc self.doc = doc
@ -67,6 +110,8 @@ cdef class Token:
"""The number of unicode characters in the token, i.e. `token.text`. """The number of unicode characters in the token, i.e. `token.text`.
RETURNS (int): The number of unicode characters in the token. RETURNS (int): The number of unicode characters in the token.
DOCS: https://spacy.io/api/token#len
""" """
return self.c.lex.length return self.c.lex.length
@ -121,6 +166,7 @@ cdef class Token:
@property @property
def _(self): def _(self):
"""Custom extension attributes registered via `set_extension`."""
return Underscore(Underscore.token_extensions, self, return Underscore(Underscore.token_extensions, self,
start=self.idx, end=None) start=self.idx, end=None)
@ -130,12 +176,7 @@ cdef class Token:
flag_id (int): The ID of the flag attribute. flag_id (int): The ID of the flag attribute.
RETURNS (bool): Whether the flag is set. RETURNS (bool): Whether the flag is set.
EXAMPLE: DOCS: https://spacy.io/api/token#check_flag
>>> from spacy.attrs import IS_TITLE
>>> doc = nlp(u'Give it back! He pleaded.')
>>> token = doc[0]
>>> token.check_flag(IS_TITLE)
True
""" """
return Lexeme.c_check_flag(self.c.lex, flag_id) return Lexeme.c_check_flag(self.c.lex, flag_id)
@ -144,6 +185,8 @@ cdef class Token:
i (int): The relative position of the token to get. Defaults to 1. i (int): The relative position of the token to get. Defaults to 1.
RETURNS (Token): The token at position `self.doc[self.i+i]`. RETURNS (Token): The token at position `self.doc[self.i+i]`.
DOCS: https://spacy.io/api/token#nbor
""" """
if self.i+i < 0 or (self.i+i >= len(self.doc)): if self.i+i < 0 or (self.i+i >= len(self.doc)):
raise IndexError(Errors.E042.format(i=self.i, j=i, length=len(self.doc))) raise IndexError(Errors.E042.format(i=self.i, j=i, length=len(self.doc)))
@ -156,22 +199,25 @@ cdef class Token:
other (object): The object to compare with. By default, accepts `Doc`, other (object): The object to compare with. By default, accepts `Doc`,
`Span`, `Token` and `Lexeme` objects. `Span`, `Token` and `Lexeme` objects.
RETURNS (float): A scalar similarity score. Higher is more similar. RETURNS (float): A scalar similarity score. Higher is more similar.
DOCS: https://spacy.io/api/token#similarity
""" """
if 'similarity' in self.doc.user_token_hooks: if "similarity" in self.doc.user_token_hooks:
return self.doc.user_token_hooks['similarity'](self) return self.doc.user_token_hooks["similarity"](self)
if hasattr(other, '__len__') and len(other) == 1 and hasattr(other, "__getitem__"): if hasattr(other, "__len__") and len(other) == 1 and hasattr(other, "__getitem__"):
if self.c.lex.orth == getattr(other[0], 'orth', None): if self.c.lex.orth == getattr(other[0], "orth", None):
return 1.0 return 1.0
elif hasattr(other, 'orth'): elif hasattr(other, "orth"):
if self.c.lex.orth == other.orth: if self.c.lex.orth == other.orth:
return 1.0 return 1.0
if self.vocab.vectors.n_keys == 0: if self.vocab.vectors.n_keys == 0:
models_warning(Warnings.W007.format(obj='Token')) models_warning(Warnings.W007.format(obj="Token"))
if self.vector_norm == 0 or other.vector_norm == 0: if self.vector_norm == 0 or other.vector_norm == 0:
user_warning(Warnings.W008.format(obj='Token')) user_warning(Warnings.W008.format(obj="Token"))
return 0.0 return 0.0
return (numpy.dot(self.vector, other.vector) / vector = self.vector
(self.vector_norm * other.vector_norm)) xp = get_array_module(vector)
return (xp.dot(vector, other.vector) / (self.vector_norm * other.vector_norm))
property morph: property morph:
def __get__(self): def __get__(self):
@ -205,7 +251,7 @@ cdef class Token:
def __get__(self): def __get__(self):
cdef unicode orth = self.vocab.strings[self.c.lex.orth] cdef unicode orth = self.vocab.strings[self.c.lex.orth]
if self.c.spacy: if self.c.spacy:
return orth + u' ' return orth + " "
else: else:
return orth return orth
@ -218,8 +264,8 @@ cdef class Token:
"""RETURNS (float): A scalar value indicating the positivity or """RETURNS (float): A scalar value indicating the positivity or
negativity of the token.""" negativity of the token."""
def __get__(self): def __get__(self):
if 'sentiment' in self.doc.user_token_hooks: if "sentiment" in self.doc.user_token_hooks:
return self.doc.user_token_hooks['sentiment'](self) return self.doc.user_token_hooks["sentiment"](self)
return self.c.lex.sentiment return self.c.lex.sentiment
property lang: property lang:
@ -301,6 +347,7 @@ cdef class Token:
"""RETURNS (uint64): ID of coarse-grained part-of-speech tag.""" """RETURNS (uint64): ID of coarse-grained part-of-speech tag."""
def __get__(self): def __get__(self):
return self.c.pos return self.c.pos
def __set__(self, pos): def __set__(self, pos):
self.c.pos = pos self.c.pos = pos
@ -325,10 +372,12 @@ cdef class Token:
the object. the object.
RETURNS (bool): Whether a word vector is associated with the object. RETURNS (bool): Whether a word vector is associated with the object.
DOCS: https://spacy.io/api/token#has_vector
""" """
def __get__(self): def __get__(self):
if 'has_vector' in self.doc.user_token_hooks: if 'has_vector' in self.doc.user_token_hooks:
return self.doc.user_token_hooks['has_vector'](self) return self.doc.user_token_hooks["has_vector"](self)
if self.vocab.vectors.size == 0 and self.doc.tensor.size != 0: if self.vocab.vectors.size == 0 and self.doc.tensor.size != 0:
return True return True
return self.vocab.has_vector(self.c.lex.orth) return self.vocab.has_vector(self.c.lex.orth)
@ -338,10 +387,12 @@ cdef class Token:
RETURNS (numpy.ndarray[ndim=1, dtype='float32']): A 1D numpy array RETURNS (numpy.ndarray[ndim=1, dtype='float32']): A 1D numpy array
representing the token's semantics. representing the token's semantics.
DOCS: https://spacy.io/api/token#vector
""" """
def __get__(self): def __get__(self):
if 'vector' in self.doc.user_token_hooks: if 'vector' in self.doc.user_token_hooks:
return self.doc.user_token_hooks['vector'](self) return self.doc.user_token_hooks["vector"](self)
if self.vocab.vectors.size == 0 and self.doc.tensor.size != 0: if self.vocab.vectors.size == 0 and self.doc.tensor.size != 0:
return self.doc.tensor[self.i] return self.doc.tensor[self.i]
else: else:
@ -351,23 +402,35 @@ cdef class Token:
"""The L2 norm of the token's vector representation. """The L2 norm of the token's vector representation.
RETURNS (float): The L2 norm of the vector representation. RETURNS (float): The L2 norm of the vector representation.
DOCS: https://spacy.io/api/token#vector_norm
""" """
def __get__(self): def __get__(self):
if 'vector_norm' in self.doc.user_token_hooks: if 'vector_norm' in self.doc.user_token_hooks:
return self.doc.user_token_hooks['vector_norm'](self) return self.doc.user_token_hooks["vector_norm"](self)
vector = self.vector vector = self.vector
return numpy.sqrt((vector ** 2).sum()) return numpy.sqrt((vector ** 2).sum())
property n_lefts: property n_lefts:
"""RETURNS (int): The number of leftward immediate children of the """The number of leftward immediate children of the word, in the
syntactic dependency parse.
RETURNS (int): The number of leftward immediate children of the
word, in the syntactic dependency parse. word, in the syntactic dependency parse.
DOCS: https://spacy.io/api/token#n_lefts
""" """
def __get__(self): def __get__(self):
return self.c.l_kids return self.c.l_kids
property n_rights: property n_rights:
"""RETURNS (int): The number of rightward immediate children of the """The number of rightward immediate children of the word, in the
syntactic dependency parse.
RETURNS (int): The number of rightward immediate children of the
word, in the syntactic dependency parse. word, in the syntactic dependency parse.
DOCS: https://spacy.io/api/token#n_rights
""" """
def __get__(self): def __get__(self):
return self.c.r_kids return self.c.r_kids
@ -376,7 +439,7 @@ cdef class Token:
"""RETURNS (Span): The sentence span that the token is a part of.""" """RETURNS (Span): The sentence span that the token is a part of."""
def __get__(self): def __get__(self):
if 'sent' in self.doc.user_token_hooks: if 'sent' in self.doc.user_token_hooks:
return self.doc.user_token_hooks['sent'](self) return self.doc.user_token_hooks["sent"](self)
return self.doc[self.i : self.i+1].sent return self.doc[self.i : self.i+1].sent
property sent_start: property sent_start:
@ -393,8 +456,13 @@ cdef class Token:
self.is_sent_start = value self.is_sent_start = value
property is_sent_start: property is_sent_start:
"""RETURNS (bool / None): Whether the token starts a sentence. """A boolean value indicating whether the token starts a sentence.
`None` if unknown. Defaults to `True` for the first token in the `Doc`.
RETURNS (bool / None): Whether the token starts a sentence.
None if unknown. None if unknown.
DOCS: https://spacy.io/api/token#is_sent_start
""" """
def __get__(self): def __get__(self):
if self.c.sent_start == 0: if self.c.sent_start == 0:
@ -421,6 +489,8 @@ cdef class Token:
dependency parse. dependency parse.
YIELDS (Token): A left-child of the token. YIELDS (Token): A left-child of the token.
DOCS: https://spacy.io/api/token#lefts
""" """
def __get__(self): def __get__(self):
cdef int nr_iter = 0 cdef int nr_iter = 0
@ -432,13 +502,15 @@ cdef class Token:
nr_iter += 1 nr_iter += 1
# This is ugly, but it's a way to guard out infinite loops # This is ugly, but it's a way to guard out infinite loops
if nr_iter >= 10000000: if nr_iter >= 10000000:
raise RuntimeError(Errors.E045.format(attr='token.lefts')) raise RuntimeError(Errors.E045.format(attr="token.lefts"))
property rights: property rights:
"""The rightward immediate children of the word, in the syntactic """The rightward immediate children of the word, in the syntactic
dependency parse. dependency parse.
YIELDS (Token): A right-child of the token. YIELDS (Token): A right-child of the token.
DOCS: https://spacy.io/api/token#rights
""" """
def __get__(self): def __get__(self):
cdef const TokenC* ptr = self.c + (self.c.r_edge - self.i) cdef const TokenC* ptr = self.c + (self.c.r_edge - self.i)
@ -450,7 +522,7 @@ cdef class Token:
ptr -= 1 ptr -= 1
nr_iter += 1 nr_iter += 1
if nr_iter >= 10000000: if nr_iter >= 10000000:
raise RuntimeError(Errors.E045.format(attr='token.rights')) raise RuntimeError(Errors.E045.format(attr="token.rights"))
tokens.reverse() tokens.reverse()
for t in tokens: for t in tokens:
yield t yield t
@ -458,7 +530,9 @@ cdef class Token:
property children: property children:
"""A sequence of the token's immediate syntactic children. """A sequence of the token's immediate syntactic children.
YIELDS (Token): A child token such that child.head==self YIELDS (Token): A child token such that `child.head==self`.
DOCS: https://spacy.io/api/token#children
""" """
def __get__(self): def __get__(self):
yield from self.lefts yield from self.lefts
@ -470,6 +544,8 @@ cdef class Token:
YIELDS (Token): A descendent token such that YIELDS (Token): A descendent token such that
`self.is_ancestor(descendent) or token == self`. `self.is_ancestor(descendent) or token == self`.
DOCS: https://spacy.io/api/token#subtree
""" """
def __get__(self): def __get__(self):
for word in self.lefts: for word in self.lefts:
@ -499,11 +575,13 @@ cdef class Token:
YIELDS (Token): A sequence of ancestor tokens such that YIELDS (Token): A sequence of ancestor tokens such that
`ancestor.is_ancestor(self)`. `ancestor.is_ancestor(self)`.
DOCS: https://spacy.io/api/token#ancestors
""" """
def __get__(self): def __get__(self):
cdef const TokenC* head_ptr = self.c cdef const TokenC* head_ptr = self.c
# guard against infinite loop, no token can have # Guard against infinite loop, no token can have
# more ancestors than tokens in the tree # more ancestors than tokens in the tree.
cdef int i = 0 cdef int i = 0
while head_ptr.head != 0 and i < self.doc.length: while head_ptr.head != 0 and i < self.doc.length:
head_ptr += head_ptr.head head_ptr += head_ptr.head
@ -516,6 +594,8 @@ cdef class Token:
descendant (Token): Another token. descendant (Token): Another token.
RETURNS (bool): Whether this token is the ancestor of the descendant. RETURNS (bool): Whether this token is the ancestor of the descendant.
DOCS: https://spacy.io/api/token#is_ancestor
""" """
if self.doc is not descendant.doc: if self.doc is not descendant.doc:
return False return False
@ -531,34 +611,28 @@ cdef class Token:
return self.doc[self.i + self.c.head] return self.doc[self.i + self.c.head]
def __set__(self, Token new_head): def __set__(self, Token new_head):
# this function sets the head of self to new_head # This function sets the head of self to new_head and updates the
# and updates the counters for left/right dependents # counters for left/right dependents and left/right corner for the
# and left/right corner for the new and the old head # new and the old head
# Do nothing if old head is new head
# do nothing if old head is new head
if self.i + self.c.head == new_head.i: if self.i + self.c.head == new_head.i:
return return
cdef Token old_head = self.head cdef Token old_head = self.head
cdef int rel_newhead_i = new_head.i - self.i cdef int rel_newhead_i = new_head.i - self.i
# Is the new head a descendant of the old head
# is the new head a descendant of the old head
cdef bint is_desc = old_head.is_ancestor(new_head) cdef bint is_desc = old_head.is_ancestor(new_head)
cdef int new_edge cdef int new_edge
cdef Token anc, child cdef Token anc, child
# Update number of deps of old head
# update number of deps of old head
if self.c.head > 0: # left dependent if self.c.head > 0: # left dependent
old_head.c.l_kids -= 1 old_head.c.l_kids -= 1
if self.c.l_edge == old_head.c.l_edge: if self.c.l_edge == old_head.c.l_edge:
# the token dominates the left edge so the left edge of # The token dominates the left edge so the left edge of
# the head may change when the token is reattached, it may # the head may change when the token is reattached, it may
# not change if the new head is a descendant of the current # not change if the new head is a descendant of the current
# head # head.
new_edge = self.c.l_edge new_edge = self.c.l_edge
# the new l_edge is the left-most l_edge on any of the # The new l_edge is the left-most l_edge on any of the
# other dependents where the l_edge is left of the head, # other dependents where the l_edge is left of the head,
# otherwise it is the head # otherwise it is the head
if not is_desc: if not is_desc:
@ -569,21 +643,18 @@ cdef class Token:
if child.c.l_edge < new_edge: if child.c.l_edge < new_edge:
new_edge = child.c.l_edge new_edge = child.c.l_edge
old_head.c.l_edge = new_edge old_head.c.l_edge = new_edge
# Walk up the tree from old_head and assign new l_edge to
# walk up the tree from old_head and assign new l_edge to
# ancestors until an ancestor already has an l_edge that's # ancestors until an ancestor already has an l_edge that's
# further left # further left
for anc in old_head.ancestors: for anc in old_head.ancestors:
if anc.c.l_edge <= new_edge: if anc.c.l_edge <= new_edge:
break break
anc.c.l_edge = new_edge anc.c.l_edge = new_edge
elif self.c.head < 0: # right dependent elif self.c.head < 0: # right dependent
old_head.c.r_kids -= 1 old_head.c.r_kids -= 1
# do the same thing as for l_edge # Do the same thing as for l_edge
if self.c.r_edge == old_head.c.r_edge: if self.c.r_edge == old_head.c.r_edge:
new_edge = self.c.r_edge new_edge = self.c.r_edge
if not is_desc: if not is_desc:
new_edge = old_head.i new_edge = old_head.i
for child in old_head.children: for child in old_head.children:
@ -592,16 +663,14 @@ cdef class Token:
if child.c.r_edge > new_edge: if child.c.r_edge > new_edge:
new_edge = child.c.r_edge new_edge = child.c.r_edge
old_head.c.r_edge = new_edge old_head.c.r_edge = new_edge
for anc in old_head.ancestors: for anc in old_head.ancestors:
if anc.c.r_edge >= new_edge: if anc.c.r_edge >= new_edge:
break break
anc.c.r_edge = new_edge anc.c.r_edge = new_edge
# Update number of deps of new head
# update number of deps of new head
if rel_newhead_i > 0: # left dependent if rel_newhead_i > 0: # left dependent
new_head.c.l_kids += 1 new_head.c.l_kids += 1
# walk up the tree from new head and set l_edge to self.l_edge # Walk up the tree from new head and set l_edge to self.l_edge
# until you hit a token with an l_edge further to the left # until you hit a token with an l_edge further to the left
if self.c.l_edge < new_head.c.l_edge: if self.c.l_edge < new_head.c.l_edge:
new_head.c.l_edge = self.c.l_edge new_head.c.l_edge = self.c.l_edge
@ -609,34 +678,33 @@ cdef class Token:
if anc.c.l_edge <= self.c.l_edge: if anc.c.l_edge <= self.c.l_edge:
break break
anc.c.l_edge = self.c.l_edge anc.c.l_edge = self.c.l_edge
elif rel_newhead_i < 0: # right dependent elif rel_newhead_i < 0: # right dependent
new_head.c.r_kids += 1 new_head.c.r_kids += 1
# do the same as for l_edge # Do the same as for l_edge
if self.c.r_edge > new_head.c.r_edge: if self.c.r_edge > new_head.c.r_edge:
new_head.c.r_edge = self.c.r_edge new_head.c.r_edge = self.c.r_edge
for anc in new_head.ancestors: for anc in new_head.ancestors:
if anc.c.r_edge >= self.c.r_edge: if anc.c.r_edge >= self.c.r_edge:
break break
anc.c.r_edge = self.c.r_edge anc.c.r_edge = self.c.r_edge
# Set new head
# set new head
self.c.head = rel_newhead_i self.c.head = rel_newhead_i
property conjuncts: property conjuncts:
"""A sequence of coordinated tokens, including the token itself. """A sequence of coordinated tokens, including the token itself.
YIELDS (Token): A coordinated token. YIELDS (Token): A coordinated token.
DOCS: https://spacy.io/api/token#conjuncts
""" """
def __get__(self): def __get__(self):
"""Get a list of conjoined words."""
cdef Token word cdef Token word
if 'conjuncts' in self.doc.user_token_hooks: if "conjuncts" in self.doc.user_token_hooks:
yield from self.doc.user_token_hooks['conjuncts'](self) yield from self.doc.user_token_hooks["conjuncts"](self)
else: else:
if self.dep_ != 'conj': if self.dep != conj:
for word in self.rights: for word in self.rights:
if word.dep_ == 'conj': if word.dep == conj:
yield word yield word
yield from word.conjuncts yield from word.conjuncts
@ -673,7 +741,7 @@ cdef class Token:
RETURNS (unicode): IOB code of named entity tag. RETURNS (unicode): IOB code of named entity tag.
""" """
def __get__(self): def __get__(self):
iob_strings = ('', 'I', 'O', 'B') iob_strings = ("", "I", "O", "B")
return iob_strings[self.c.ent_iob] return iob_strings[self.c.ent_iob]
property ent_id: property ent_id:
@ -700,7 +768,7 @@ cdef class Token:
"""RETURNS (unicode): The trailing whitespace character, if present. """RETURNS (unicode): The trailing whitespace character, if present.
""" """
def __get__(self): def __get__(self):
return ' ' if self.c.spacy else '' return " " if self.c.spacy else ""
property orth_: property orth_:
"""RETURNS (unicode): Verbatim text content (identical to """RETURNS (unicode): Verbatim text content (identical to
@ -773,6 +841,7 @@ cdef class Token:
"""RETURNS (unicode): Coarse-grained part-of-speech tag.""" """RETURNS (unicode): Coarse-grained part-of-speech tag."""
def __get__(self): def __get__(self):
return parts_of_speech.NAMES[self.c.pos] return parts_of_speech.NAMES[self.c.pos]
def __set__(self, pos_name): def __set__(self, pos_name):
self.c.pos = parts_of_speech.IDS[pos_name] self.c.pos = parts_of_speech.IDS[pos_name]

View File

@ -1,30 +1,31 @@
# coding: utf8 # coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
cimport numpy as np
from cython.operator cimport dereference as deref
from libcpp.set cimport set as cppset
import functools import functools
import numpy import numpy
from collections import OrderedDict from collections import OrderedDict
import srsly import srsly
cimport numpy as np
from thinc.neural.util import get_array_module from thinc.neural.util import get_array_module
from thinc.neural._classes.model import Model from thinc.neural._classes.model import Model
from .strings cimport StringStore from .strings cimport StringStore
from .strings import get_string_id from .strings import get_string_id
from .compat import basestring_, path2str from .compat import basestring_, path2str
from .errors import Errors from .errors import Errors
from . import util from . import util
from cython.operator cimport dereference as deref
from libcpp.set cimport set as cppset
def unpickle_vectors(bytes_data): def unpickle_vectors(bytes_data):
return Vectors().from_bytes(bytes_data) return Vectors().from_bytes(bytes_data)
class GlobalRegistry(object): class GlobalRegistry(object):
'''Global store of vectors, to avoid repeatedly loading the data.''' """Global store of vectors, to avoid repeatedly loading the data."""
data = {} data = {}
@classmethod @classmethod
@ -46,8 +47,10 @@ cdef class Vectors:
rows in the vectors.data table. rows in the vectors.data table.
Multiple keys can be mapped to the same vector, and not all of the rows in Multiple keys can be mapped to the same vector, and not all of the rows in
the table need to be assigned --- so len(list(vectors.keys())) may be the table need to be assigned - so len(list(vectors.keys())) may be
greater or smaller than vectors.shape[0]. greater or smaller than vectors.shape[0].
DOCS: https://spacy.io/api/vectors
""" """
cdef public object name cdef public object name
cdef public object data cdef public object data
@ -62,12 +65,14 @@ cdef class Vectors:
keys (iterable): A sequence of keys, aligned with the data. keys (iterable): A sequence of keys, aligned with the data.
name (string): A name to identify the vectors table. name (string): A name to identify the vectors table.
RETURNS (Vectors): The newly created object. RETURNS (Vectors): The newly created object.
DOCS: https://spacy.io/api/vectors#init
""" """
self.name = name self.name = name
if data is None: if data is None:
if shape is None: if shape is None:
shape = (0,0) shape = (0,0)
data = numpy.zeros(shape, dtype='f') data = numpy.zeros(shape, dtype="f")
self.data = data self.data = data
self.key2row = OrderedDict() self.key2row = OrderedDict()
if self.data is not None: if self.data is not None:
@ -84,23 +89,40 @@ cdef class Vectors:
in the vector table. in the vector table.
RETURNS (tuple): A `(rows, dims)` pair. RETURNS (tuple): A `(rows, dims)` pair.
DOCS: https://spacy.io/api/vectors#shape
""" """
return self.data.shape return self.data.shape
@property @property
def size(self): def size(self):
"""RETURNS (int): rows*dims""" """The vector size i,e. rows * dims.
RETURNS (int): The vector size.
DOCS: https://spacy.io/api/vectors#size
"""
return self.data.shape[0] * self.data.shape[1] return self.data.shape[0] * self.data.shape[1]
@property @property
def is_full(self): def is_full(self):
"""RETURNS (bool): `True` if no slots are available for new keys.""" """Whether the vectors table is full.
RETURNS (bool): `True` if no slots are available for new keys.
DOCS: https://spacy.io/api/vectors#is_full
"""
return self._unset.size() == 0 return self._unset.size() == 0
@property @property
def n_keys(self): def n_keys(self):
"""RETURNS (int) The number of keys in the table. Note that this is the """Get the number of keys in the table. Note that this is the number
number of all keys, not just unique vectors.""" of all keys, not just unique vectors.
RETURNS (int): The number of keys in the table.
DOCS: https://spacy.io/api/vectors#n_keys
"""
return len(self.key2row) return len(self.key2row)
def __reduce__(self): def __reduce__(self):
@ -111,6 +133,8 @@ cdef class Vectors:
key (int): The key to get the vector for. key (int): The key to get the vector for.
RETURNS (ndarray): The vector for the key. RETURNS (ndarray): The vector for the key.
DOCS: https://spacy.io/api/vectors#getitem
""" """
i = self.key2row[key] i = self.key2row[key]
if i is None: if i is None:
@ -123,6 +147,8 @@ cdef class Vectors:
key (int): The key to set the vector for. key (int): The key to set the vector for.
vector (ndarray): The vector to set. vector (ndarray): The vector to set.
DOCS: https://spacy.io/api/vectors#setitem
""" """
i = self.key2row[key] i = self.key2row[key]
self.data[i] = vector self.data[i] = vector
@ -133,6 +159,8 @@ cdef class Vectors:
"""Iterate over the keys in the table. """Iterate over the keys in the table.
YIELDS (int): A key in the table. YIELDS (int): A key in the table.
DOCS: https://spacy.io/api/vectors#iter
""" """
yield from self.key2row yield from self.key2row
@ -140,6 +168,8 @@ cdef class Vectors:
"""Return the number of vectors in the table. """Return the number of vectors in the table.
RETURNS (int): The number of vectors in the data. RETURNS (int): The number of vectors in the data.
DOCS: https://spacy.io/api/vectors#len
""" """
return self.data.shape[0] return self.data.shape[0]
@ -148,6 +178,8 @@ cdef class Vectors:
key (int): The key to check. key (int): The key to check.
RETURNS (bool): Whether the key has a vector entry. RETURNS (bool): Whether the key has a vector entry.
DOCS: https://spacy.io/api/vectors#contains
""" """
return key in self.key2row return key in self.key2row
@ -159,6 +191,12 @@ cdef class Vectors:
If the number of vectors is reduced, keys mapped to rows that have been If the number of vectors is reduced, keys mapped to rows that have been
deleted are removed. These removed items are returned as a list of deleted are removed. These removed items are returned as a list of
`(key, row)` tuples. `(key, row)` tuples.
shape (tuple): A `(rows, dims)` tuple.
inplace (bool): Reallocate the memory.
RETURNS (list): The removed items as a list of `(key, row)` tuples.
DOCS: https://spacy.io/api/vectors#resize
""" """
if inplace: if inplace:
self.data.resize(shape, refcheck=False) self.data.resize(shape, refcheck=False)
@ -175,10 +213,7 @@ cdef class Vectors:
return removed_items return removed_items
def keys(self): def keys(self):
"""A sequence of the keys in the table. """RETURNS (iterable): A sequence of keys in the table."""
RETURNS (iterable): The keys.
"""
return self.key2row.keys() return self.key2row.keys()
def values(self): def values(self):
@ -188,6 +223,8 @@ cdef class Vectors:
returned may be less than the length of the vectors table. returned may be less than the length of the vectors table.
YIELDS (ndarray): A vector in the table. YIELDS (ndarray): A vector in the table.
DOCS: https://spacy.io/api/vectors#values
""" """
for row, vector in enumerate(range(self.data.shape[0])): for row, vector in enumerate(range(self.data.shape[0])):
if not self._unset.count(row): if not self._unset.count(row):
@ -197,6 +234,8 @@ cdef class Vectors:
"""Iterate over `(key, vector)` pairs. """Iterate over `(key, vector)` pairs.
YIELDS (tuple): A key/vector pair. YIELDS (tuple): A key/vector pair.
DOCS: https://spacy.io/api/vectors#items
""" """
for key, row in self.key2row.items(): for key, row in self.key2row.items():
yield key, self.data[row] yield key, self.data[row]
@ -215,7 +254,7 @@ cdef class Vectors:
RETURNS: The requested key, keys, row or rows. RETURNS: The requested key, keys, row or rows.
""" """
if sum(arg is None for arg in (key, keys, row, rows)) != 3: if sum(arg is None for arg in (key, keys, row, rows)) != 3:
bad_kwargs = {'key': key, 'keys': keys, 'row': row, 'rows': rows} bad_kwargs = {"key": key, "keys": keys, "row": row, "rows": rows}
raise ValueError(Errors.E059.format(kwargs=bad_kwargs)) raise ValueError(Errors.E059.format(kwargs=bad_kwargs))
xp = get_array_module(self.data) xp = get_array_module(self.data)
if key is not None: if key is not None:
@ -224,7 +263,7 @@ cdef class Vectors:
elif keys is not None: elif keys is not None:
keys = [get_string_id(key) for key in keys] keys = [get_string_id(key) for key in keys]
rows = [self.key2row.get(key, -1.) for key in keys] rows = [self.key2row.get(key, -1.) for key in keys]
return xp.asarray(rows, dtype='i') return xp.asarray(rows, dtype="i")
else: else:
targets = set() targets = set()
if row is not None: if row is not None:
@ -236,7 +275,7 @@ cdef class Vectors:
if row in targets: if row in targets:
results.append(key) results.append(key)
targets.remove(row) targets.remove(row)
return xp.asarray(results, dtype='uint64') return xp.asarray(results, dtype="uint64")
def add(self, key, *, vector=None, row=None): def add(self, key, *, vector=None, row=None):
"""Add a key to the table. Keys can be mapped to an existing vector """Add a key to the table. Keys can be mapped to an existing vector
@ -246,6 +285,8 @@ cdef class Vectors:
vector (ndarray / None): A vector to add for the key. vector (ndarray / None): A vector to add for the key.
row (int / None): The row number of a vector to map the key to. row (int / None): The row number of a vector to map the key to.
RETURNS (int): The row the vector was added to. RETURNS (int): The row the vector was added to.
DOCS: https://spacy.io/api/vectors#add
""" """
key = get_string_id(key) key = get_string_id(key)
if row is None and key in self.key2row: if row is None and key in self.key2row:
@ -292,11 +333,10 @@ cdef class Vectors:
sims = xp.dot(batch, vectors.T) sims = xp.dot(batch, vectors.T)
best_rows[i:i+batch_size] = sims.argmax(axis=1) best_rows[i:i+batch_size] = sims.argmax(axis=1)
scores[i:i+batch_size] = sims.max(axis=1) scores[i:i+batch_size] = sims.max(axis=1)
xp = get_array_module(self.data) xp = get_array_module(self.data)
row2key = {row: key for key, row in self.key2row.items()} row2key = {row: key for key, row in self.key2row.items()}
keys = xp.asarray( keys = xp.asarray(
[row2key[row] for row in best_rows if row in row2key], dtype='uint64') [row2key[row] for row in best_rows if row in row2key], dtype="uint64")
return (keys, best_rows, scores) return (keys, best_rows, scores)
def from_glove(self, path): def from_glove(self, path):
@ -308,29 +348,30 @@ cdef class Vectors:
path (unicode / Path): The path to load the GloVe vectors from. path (unicode / Path): The path to load the GloVe vectors from.
RETURNS: A `StringStore` object, holding the key-to-string mapping. RETURNS: A `StringStore` object, holding the key-to-string mapping.
DOCS: https://spacy.io/api/vectors#from_glove
""" """
path = util.ensure_path(path) path = util.ensure_path(path)
width = None width = None
for name in path.iterdir(): for name in path.iterdir():
if name.parts[-1].startswith('vectors'): if name.parts[-1].startswith("vectors"):
_, dims, dtype, _2 = name.parts[-1].split('.') _, dims, dtype, _2 = name.parts[-1].split('.')
width = int(dims) width = int(dims)
break break
else: else:
raise IOError(Errors.E061.format(filename=path)) raise IOError(Errors.E061.format(filename=path))
bin_loc = path / 'vectors.{dims}.{dtype}.bin'.format(dims=dims, bin_loc = path / "vectors.{dims}.{dtype}.bin".format(dims=dims, dtype=dtype)
dtype=dtype)
xp = get_array_module(self.data) xp = get_array_module(self.data)
self.data = None self.data = None
with bin_loc.open('rb') as file_: with bin_loc.open("rb") as file_:
self.data = xp.fromfile(file_, dtype=dtype) self.data = xp.fromfile(file_, dtype=dtype)
if dtype != 'float32': if dtype != "float32":
self.data = xp.ascontiguousarray(self.data, dtype='float32') self.data = xp.ascontiguousarray(self.data, dtype="float32")
if self.data.ndim == 1: if self.data.ndim == 1:
self.data = self.data.reshape((self.data.size//width, width)) self.data = self.data.reshape((self.data.size//width, width))
n = 0 n = 0
strings = StringStore() strings = StringStore()
with (path / 'vocab.txt').open('r') as file_: with (path / "vocab.txt").open("r") as file_:
for i, line in enumerate(file_): for i, line in enumerate(file_):
key = strings.add(line.strip()) key = strings.add(line.strip())
self.add(key, row=i) self.add(key, row=i)
@ -341,16 +382,17 @@ cdef class Vectors:
path (unicode / Path): A path to a directory, which will be created if path (unicode / Path): A path to a directory, which will be created if
it doesn't exists. Either a string or a Path-like object. it doesn't exists. Either a string or a Path-like object.
DOCS: https://spacy.io/api/vectors#to_disk
""" """
xp = get_array_module(self.data) xp = get_array_module(self.data)
if xp is numpy: if xp is numpy:
save_array = lambda arr, file_: xp.save(file_, arr, save_array = lambda arr, file_: xp.save(file_, arr, allow_pickle=False)
allow_pickle=False)
else: else:
save_array = lambda arr, file_: xp.save(file_, arr) save_array = lambda arr, file_: xp.save(file_, arr)
serializers = OrderedDict(( serializers = OrderedDict((
('vectors', lambda p: save_array(self.data, p.open('wb'))), ("vectors", lambda p: save_array(self.data, p.open("wb"))),
('key2row', lambda p: srsly.write_msgpack(p, self.key2row)) ("key2row", lambda p: srsly.write_msgpack(p, self.key2row))
)) ))
return util.to_disk(path, serializers, exclude) return util.to_disk(path, serializers, exclude)
@ -360,6 +402,8 @@ cdef class Vectors:
path (unicode / Path): Directory path, string or Path-like object. path (unicode / Path): Directory path, string or Path-like object.
RETURNS (Vectors): The modified object. RETURNS (Vectors): The modified object.
DOCS: https://spacy.io/api/vectors#from_disk
""" """
def load_key2row(path): def load_key2row(path):
if path.exists(): if path.exists():
@ -380,9 +424,9 @@ cdef class Vectors:
self.data = xp.load(str(path)) self.data = xp.load(str(path))
serializers = OrderedDict(( serializers = OrderedDict((
('key2row', load_key2row), ("key2row", load_key2row),
('keys', load_keys), ("keys", load_keys),
('vectors', load_vectors), ("vectors", load_vectors),
)) ))
util.from_disk(path, serializers, exclude) util.from_disk(path, serializers, exclude)
return self return self
@ -392,15 +436,17 @@ cdef class Vectors:
**exclude: Named attributes to prevent from being serialized. **exclude: Named attributes to prevent from being serialized.
RETURNS (bytes): The serialized form of the `Vectors` object. RETURNS (bytes): The serialized form of the `Vectors` object.
DOCS: https://spacy.io/api/vectors#to_bytes
""" """
def serialize_weights(): def serialize_weights():
if hasattr(self.data, 'to_bytes'): if hasattr(self.data, "to_bytes"):
return self.data.to_bytes() return self.data.to_bytes()
else: else:
return srsly.msgpack_dumps(self.data) return srsly.msgpack_dumps(self.data)
serializers = OrderedDict(( serializers = OrderedDict((
('key2row', lambda: srsly.msgpack_dumps(self.key2row)), ("key2row", lambda: srsly.msgpack_dumps(self.key2row)),
('vectors', serialize_weights) ("vectors", serialize_weights)
)) ))
return util.to_bytes(serializers, exclude) return util.to_bytes(serializers, exclude)
@ -410,16 +456,18 @@ cdef class Vectors:
data (bytes): The data to load from. data (bytes): The data to load from.
**exclude: Named attributes to prevent from being loaded. **exclude: Named attributes to prevent from being loaded.
RETURNS (Vectors): The `Vectors` object. RETURNS (Vectors): The `Vectors` object.
DOCS: https://spacy.io/api/vectors#from_bytes
""" """
def deserialize_weights(b): def deserialize_weights(b):
if hasattr(self.data, 'from_bytes'): if hasattr(self.data, "from_bytes"):
self.data.from_bytes() self.data.from_bytes()
else: else:
self.data = srsly.msgpack_loads(b) self.data = srsly.msgpack_loads(b)
deserializers = OrderedDict(( deserializers = OrderedDict((
('key2row', lambda b: self.key2row.update(srsly.msgpack_loads(b))), ("key2row", lambda b: self.key2row.update(srsly.msgpack_loads(b))),
('vectors', deserialize_weights) ("vectors", deserialize_weights)
)) ))
util.from_bytes(data, deserializers, exclude) util.from_bytes(data, deserializers, exclude)
return self return self

View File

@ -4,9 +4,9 @@ from __future__ import unicode_literals
import numpy import numpy
import srsly import srsly
from collections import OrderedDict from collections import OrderedDict
from thinc.neural.util import get_array_module from thinc.neural.util import get_array_module
from .lexeme cimport EMPTY_LEXEME from .lexeme cimport EMPTY_LEXEME
from .lexeme cimport Lexeme from .lexeme cimport Lexeme
from .typedefs cimport attr_t from .typedefs cimport attr_t
@ -27,6 +27,8 @@ cdef class Vocab:
"""A look-up table that allows you to access `Lexeme` objects. The `Vocab` """A look-up table that allows you to access `Lexeme` objects. The `Vocab`
instance also provides access to the `StringStore`, and owns underlying instance also provides access to the `StringStore`, and owns underlying
C-data that is shared between `Doc` objects. C-data that is shared between `Doc` objects.
DOCS: https://spacy.io/api/vocab
""" """
def __init__(self, lex_attr_getters=None, tag_map=None, lemmatizer=None, def __init__(self, lex_attr_getters=None, tag_map=None, lemmatizer=None,
strings=tuple(), oov_prob=-20., **deprecated_kwargs): strings=tuple(), oov_prob=-20., **deprecated_kwargs):
@ -62,7 +64,7 @@ cdef class Vocab:
langfunc = None langfunc = None
if self.lex_attr_getters: if self.lex_attr_getters:
langfunc = self.lex_attr_getters.get(LANG, None) langfunc = self.lex_attr_getters.get(LANG, None)
return langfunc('_') if langfunc else '' return langfunc("_") if langfunc else ""
def __len__(self): def __len__(self):
"""The current number of lexemes stored. """The current number of lexemes stored.
@ -87,11 +89,7 @@ cdef class Vocab:
available bit will be chosen. available bit will be chosen.
RETURNS (int): The integer ID by which the flag value can be checked. RETURNS (int): The integer ID by which the flag value can be checked.
EXAMPLE: DOCS: https://spacy.io/api/vocab#add_flag
>>> my_product_getter = lambda text: text in ['spaCy', 'dislaCy']
>>> MY_PRODUCT = nlp.vocab.add_flag(my_product_getter)
>>> doc = nlp(u'I like spaCy')
>>> assert doc[2].check_flag(MY_PRODUCT) == True
""" """
if flag_id == -1: if flag_id == -1:
for bit in range(1, 64): for bit in range(1, 64):
@ -112,7 +110,7 @@ cdef class Vocab:
`Lexeme` if necessary using memory acquired from the given pool. If the `Lexeme` if necessary using memory acquired from the given pool. If the
pool is the lexicon's own memory, the lexeme is saved in the lexicon. pool is the lexicon's own memory, the lexeme is saved in the lexicon.
""" """
if string == u'': if string == "":
return &EMPTY_LEXEME return &EMPTY_LEXEME
cdef LexemeC* lex cdef LexemeC* lex
cdef hash_t key = self.strings[string] cdef hash_t key = self.strings[string]
@ -176,10 +174,12 @@ cdef class Vocab:
string (unicode): The ID string. string (unicode): The ID string.
RETURNS (bool) Whether the string has an entry in the vocabulary. RETURNS (bool) Whether the string has an entry in the vocabulary.
DOCS: https://spacy.io/api/vocab#contains
""" """
cdef hash_t int_key cdef hash_t int_key
if isinstance(key, bytes): if isinstance(key, bytes):
int_key = self.strings[key.decode('utf8')] int_key = self.strings[key.decode("utf8")]
elif isinstance(key, unicode): elif isinstance(key, unicode):
int_key = self.strings[key] int_key = self.strings[key]
else: else:
@ -191,6 +191,8 @@ cdef class Vocab:
"""Iterate over the lexemes in the vocabulary. """Iterate over the lexemes in the vocabulary.
YIELDS (Lexeme): An entry in the vocabulary. YIELDS (Lexeme): An entry in the vocabulary.
DOCS: https://spacy.io/api/vocab#iter
""" """
cdef attr_t key cdef attr_t key
cdef size_t addr cdef size_t addr
@ -210,8 +212,10 @@ cdef class Vocab:
RETURNS (Lexeme): The lexeme indicated by the given ID. RETURNS (Lexeme): The lexeme indicated by the given ID.
EXAMPLE: EXAMPLE:
>>> apple = nlp.vocab.strings['apple'] >>> apple = nlp.vocab.strings["apple"]
>>> assert nlp.vocab[apple] == nlp.vocab[u'apple'] >>> assert nlp.vocab[apple] == nlp.vocab[u"apple"]
DOCS: https://spacy.io/api/vocab#getitem
""" """
cdef attr_t orth cdef attr_t orth
if isinstance(id_or_string, unicode): if isinstance(id_or_string, unicode):
@ -284,6 +288,8 @@ cdef class Vocab:
`(string, score)` tuples, where `string` is the entry the removed `(string, score)` tuples, where `string` is the entry the removed
word was mapped to, and `score` the similarity score between the word was mapped to, and `score` the similarity score between the
two words. two words.
DOCS: https://spacy.io/api/vocab#prune_vectors
""" """
xp = get_array_module(self.vectors.data) xp = get_array_module(self.vectors.data)
# Make prob negative so it sorts by rank ascending # Make prob negative so it sorts by rank ascending
@ -291,16 +297,12 @@ cdef class Vocab:
priority = [(-lex.prob, self.vectors.key2row[lex.orth], lex.orth) priority = [(-lex.prob, self.vectors.key2row[lex.orth], lex.orth)
for lex in self if lex.orth in self.vectors.key2row] for lex in self if lex.orth in self.vectors.key2row]
priority.sort() priority.sort()
indices = xp.asarray([i for (prob, i, key) in priority], dtype='i') indices = xp.asarray([i for (prob, i, key) in priority], dtype="i")
keys = xp.asarray([key for (prob, i, key) in priority], dtype='uint64') keys = xp.asarray([key for (prob, i, key) in priority], dtype="uint64")
keep = xp.ascontiguousarray(self.vectors.data[indices[:nr_row]]) keep = xp.ascontiguousarray(self.vectors.data[indices[:nr_row]])
toss = xp.ascontiguousarray(self.vectors.data[indices[nr_row:]]) toss = xp.ascontiguousarray(self.vectors.data[indices[nr_row:]])
self.vectors = Vectors(data=keep, keys=keys) self.vectors = Vectors(data=keep, keys=keys)
syn_keys, syn_rows, scores = self.vectors.most_similar(toss, batch_size=batch_size) syn_keys, syn_rows, scores = self.vectors.most_similar(toss, batch_size=batch_size)
remap = {} remap = {}
for i, key in enumerate(keys[nr_row:]): for i, key in enumerate(keys[nr_row:]):
self.vectors.add(key, row=syn_rows[i]) self.vectors.add(key, row=syn_rows[i])
@ -319,21 +321,22 @@ cdef class Vocab:
RETURNS (numpy.ndarray): A word vector. Size RETURNS (numpy.ndarray): A word vector. Size
and shape determined by the `vocab.vectors` instance. Usually, a and shape determined by the `vocab.vectors` instance. Usually, a
numpy ndarray of shape (300,) and dtype float32. numpy ndarray of shape (300,) and dtype float32.
DOCS: https://spacy.io/api/vocab#get_vector
""" """
if isinstance(orth, basestring_): if isinstance(orth, basestring_):
orth = self.strings.add(orth) orth = self.strings.add(orth)
word = self[orth].orth_ word = self[orth].orth_
if orth in self.vectors.key2row: if orth in self.vectors.key2row:
return self.vectors[orth] return self.vectors[orth]
# Assign default ngram limits to minn and maxn which is the length of the word. # Assign default ngram limits to minn and maxn which is the length of the word.
if minn is None: if minn is None:
minn = len(word) minn = len(word)
if maxn is None: if maxn is None:
maxn = len(word) maxn = len(word)
vectors = numpy.zeros((self.vectors_length,), dtype='f') vectors = numpy.zeros((self.vectors_length,), dtype="f")
# Fasttext's ngram computation taken from
# Fasttext's ngram computation taken from https://github.com/facebookresearch/fastText # https://github.com/facebookresearch/fastText
ngrams_size = 0; ngrams_size = 0;
for i in range(len(word)): for i in range(len(word)):
ngram = "" ngram = ""
@ -356,12 +359,16 @@ cdef class Vocab:
n = n + 1 n = n + 1
if ngrams_size > 0: if ngrams_size > 0:
vectors = vectors * (1.0/ngrams_size) vectors = vectors * (1.0/ngrams_size)
return vectors return vectors
def set_vector(self, orth, vector): def set_vector(self, orth, vector):
"""Set a vector for a word in the vocabulary. Words can be referenced """Set a vector for a word in the vocabulary. Words can be referenced
by string or int ID. by string or int ID.
orth (int / unicode): The word.
vector (numpy.ndarray[ndim=1, dtype='float32']): The vector to set.
DOCS: https://spacy.io/api/vocab#set_vector
""" """
if isinstance(orth, basestring_): if isinstance(orth, basestring_):
orth = self.strings.add(orth) orth = self.strings.add(orth)
@ -372,13 +379,19 @@ cdef class Vocab:
else: else:
width = self.vectors.shape[1] width = self.vectors.shape[1]
self.vectors.resize((new_rows, width)) self.vectors.resize((new_rows, width))
lex = self[orth] # Adds worse to vocab lex = self[orth] # Adds words to vocab
self.vectors.add(orth, vector=vector) self.vectors.add(orth, vector=vector)
self.vectors.add(orth, vector=vector) self.vectors.add(orth, vector=vector)
def has_vector(self, orth): def has_vector(self, orth):
"""Check whether a word has a vector. Returns False if no vectors have """Check whether a word has a vector. Returns False if no vectors have
been loaded. Words can be looked up by string or int ID.""" been loaded. Words can be looked up by string or int ID.
orth (int / unicode): The word.
RETURNS (bool): Whether the word has a vector.
DOCS: https://spacy.io/api/vocab#has_vector
"""
if isinstance(orth, basestring_): if isinstance(orth, basestring_):
orth = self.strings.add(orth) orth = self.strings.add(orth)
return orth in self.vectors return orth in self.vectors
@ -388,12 +401,14 @@ cdef class Vocab:
path (unicode or Path): A path to a directory, which will be created if path (unicode or Path): A path to a directory, which will be created if
it doesn't exist. Paths may be either strings or Path-like objects. it doesn't exist. Paths may be either strings or Path-like objects.
DOCS: https://spacy.io/api/vocab#to_disk
""" """
path = util.ensure_path(path) path = util.ensure_path(path)
if not path.exists(): if not path.exists():
path.mkdir() path.mkdir()
self.strings.to_disk(path / 'strings.json') self.strings.to_disk(path / "strings.json")
with (path / 'lexemes.bin').open('wb') as file_: with (path / "lexemes.bin").open('wb') as file_:
file_.write(self.lexemes_to_bytes()) file_.write(self.lexemes_to_bytes())
if self.vectors is not None: if self.vectors is not None:
self.vectors.to_disk(path) self.vectors.to_disk(path)
@ -405,13 +420,15 @@ cdef class Vocab:
path (unicode or Path): A path to a directory. Paths may be either path (unicode or Path): A path to a directory. Paths may be either
strings or `Path`-like objects. strings or `Path`-like objects.
RETURNS (Vocab): The modified `Vocab` object. RETURNS (Vocab): The modified `Vocab` object.
DOCS: https://spacy.io/api/vocab#to_disk
""" """
path = util.ensure_path(path) path = util.ensure_path(path)
self.strings.from_disk(path / 'strings.json') self.strings.from_disk(path / "strings.json")
with (path / 'lexemes.bin').open('rb') as file_: with (path / "lexemes.bin").open("rb") as file_:
self.lexemes_from_bytes(file_.read()) self.lexemes_from_bytes(file_.read())
if self.vectors is not None: if self.vectors is not None:
self.vectors.from_disk(path, exclude='strings.json') self.vectors.from_disk(path, exclude="strings.json")
if self.vectors.name is not None: if self.vectors.name is not None:
link_vectors_to_models(self) link_vectors_to_models(self)
return self return self
@ -421,6 +438,8 @@ cdef class Vocab:
**exclude: Named attributes to prevent from being serialized. **exclude: Named attributes to prevent from being serialized.
RETURNS (bytes): The serialized form of the `Vocab` object. RETURNS (bytes): The serialized form of the `Vocab` object.
DOCS: https://spacy.io/api/vocab#to_bytes
""" """
def deserialize_vectors(): def deserialize_vectors():
if self.vectors is None: if self.vectors is None:
@ -429,9 +448,9 @@ cdef class Vocab:
return self.vectors.to_bytes() return self.vectors.to_bytes()
getters = OrderedDict(( getters = OrderedDict((
('strings', lambda: self.strings.to_bytes()), ("strings", lambda: self.strings.to_bytes()),
('lexemes', lambda: self.lexemes_to_bytes()), ("lexemes", lambda: self.lexemes_to_bytes()),
('vectors', deserialize_vectors) ("vectors", deserialize_vectors)
)) ))
return util.to_bytes(getters, exclude) return util.to_bytes(getters, exclude)
@ -441,6 +460,8 @@ cdef class Vocab:
bytes_data (bytes): The data to load from. bytes_data (bytes): The data to load from.
**exclude: Named attributes to prevent from being loaded. **exclude: Named attributes to prevent from being loaded.
RETURNS (Vocab): The `Vocab` object. RETURNS (Vocab): The `Vocab` object.
DOCS: https://spacy.io/api/vocab#from_bytes
""" """
def serialize_vectors(b): def serialize_vectors(b):
if self.vectors is None: if self.vectors is None:
@ -448,9 +469,9 @@ cdef class Vocab:
else: else:
return self.vectors.from_bytes(b) return self.vectors.from_bytes(b)
setters = OrderedDict(( setters = OrderedDict((
('strings', lambda b: self.strings.from_bytes(b)), ("strings", lambda b: self.strings.from_bytes(b)),
('lexemes', lambda b: self.lexemes_from_bytes(b)), ("lexemes", lambda b: self.lexemes_from_bytes(b)),
('vectors', lambda b: serialize_vectors(b)) ("vectors", lambda b: serialize_vectors(b))
)) ))
util.from_bytes(bytes_data, setters, exclude) util.from_bytes(bytes_data, setters, exclude)
if self.vectors.name is not None: if self.vectors.name is not None:
@ -467,7 +488,7 @@ cdef class Vocab:
if addr == 0: if addr == 0:
continue continue
size += sizeof(lex_data.data) size += sizeof(lex_data.data)
byte_string = b'\0' * size byte_string = b"\0" * size
byte_ptr = <unsigned char*>byte_string byte_ptr = <unsigned char*>byte_string
cdef int j cdef int j
cdef int i = 0 cdef int i = 0

View File

@ -186,8 +186,8 @@ $ python -m spacy train [lang] [output_path] [train_path] [dev_path]
| ----------------------------------------------------- | ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ----------------------------------------------------- | ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `lang` | positional | Model language. | | `lang` | positional | Model language. |
| `output_path` | positional | Directory to store model in. Will be created if it doesn't exist. | | `output_path` | positional | Directory to store model in. Will be created if it doesn't exist. |
| `train_path` | positional | Location of JSON-formatted training data. | | `train_path` | positional | Location of JSON-formatted training data. Can be a file or a directory of files. |
| `dev_path` | positional | Location of JSON-formatted development data for evaluation. | | `dev_path` | positional | Location of JSON-formatted development data for evaluation. Can be a file or a directory of files. |
| `--base-model`, `-b` | option | Optional name of base model to update. Can be any loadable spaCy model. | | `--base-model`, `-b` | option | Optional name of base model to update. Can be any loadable spaCy model. |
| `--pipeline`, `-p` <Tag variant="new">2.1</Tag> | option | Comma-separated names of pipeline components to train. Defaults to `'tagger,parser,ner'`. | | `--pipeline`, `-p` <Tag variant="new">2.1</Tag> | option | Comma-separated names of pipeline components to train. Defaults to `'tagger,parser,ner'`. |
| `--vectors`, `-v` | option | Model to load vectors from. | | `--vectors`, `-v` | option | Model to load vectors from. |

View File

@ -1,7 +1,7 @@
--- ---
title: DependencyParser title: DependencyParser
tag: class tag: class
source: spacy/pipeline.pyx source: spacy/pipeline/pipes.pyx
--- ---
This class is a subclass of `Pipe` and follows the same API. The pipeline This class is a subclass of `Pipe` and follows the same API. The pipeline
@ -211,7 +211,7 @@ Modify the pipe's model, to use the given parameter values.
> ```python > ```python
> parser = DependencyParser(nlp.vocab) > parser = DependencyParser(nlp.vocab)
> with parser.use_params(): > with parser.use_params():
> parser.to_disk('/best_model') > parser.to_disk("/best_model")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
@ -226,7 +226,7 @@ Add a new label to the pipe.
> >
> ```python > ```python
> parser = DependencyParser(nlp.vocab) > parser = DependencyParser(nlp.vocab)
> parser.add_label('MY_LABEL') > parser.add_label("MY_LABEL")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
@ -241,7 +241,7 @@ Serialize the pipe to disk.
> >
> ```python > ```python
> parser = DependencyParser(nlp.vocab) > parser = DependencyParser(nlp.vocab)
> parser.to_disk('/path/to/parser') > parser.to_disk("/path/to/parser")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
@ -256,7 +256,7 @@ Load the pipe from disk. Modifies the object in place and returns it.
> >
> ```python > ```python
> parser = DependencyParser(nlp.vocab) > parser = DependencyParser(nlp.vocab)
> parser.from_disk('/path/to/parser') > parser.from_disk("/path/to/parser")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
@ -266,7 +266,7 @@ Load the pipe from disk. Modifies the object in place and returns it.
## DependencyParser.to_bytes {#to_bytes tag="method"} ## DependencyParser.to_bytes {#to_bytes tag="method"}
> #### example > #### Example
> >
> ```python > ```python
> parser = DependencyParser(nlp.vocab) > parser = DependencyParser(nlp.vocab)

View File

@ -127,6 +127,7 @@ details, see the documentation on
| `method` | callable | Set a custom method on the object, for example `doc._.compare(other_doc)`. | | `method` | callable | Set a custom method on the object, for example `doc._.compare(other_doc)`. |
| `getter` | callable | Getter function that takes the object and returns an attribute value. Is called when the user accesses the `._` attribute. | | `getter` | callable | Getter function that takes the object and returns an attribute value. Is called when the user accesses the `._` attribute. |
| `setter` | callable | Setter function that takes the `Doc` and a value, and modifies the object. Is called when the user writes to the `Doc._` attribute. | | `setter` | callable | Setter function that takes the `Doc` and a value, and modifies the object. Is called when the user writes to the `Doc._` attribute. |
| `force` | bool | Force overwriting existing attribute. |
## Doc.get_extension {#get_extension tag="classmethod" new="2"} ## Doc.get_extension {#get_extension tag="classmethod" new="2"}
@ -263,6 +264,46 @@ ancestor is found, e.g. if span excludes a necessary ancestor.
| ----------- | -------------------------------------- | ----------------------------------------------- | | ----------- | -------------------------------------- | ----------------------------------------------- |
| **RETURNS** | `numpy.ndarray[ndim=2, dtype='int32']` | The lowest common ancestor matrix of the `Doc`. | | **RETURNS** | `numpy.ndarray[ndim=2, dtype='int32']` | The lowest common ancestor matrix of the `Doc`. |
## Doc.to_json {#to_json, tag="method" new="2.1"}
Convert a Doc to JSON. The format it produces will be the new format for the
[`spacy train`](/api/cli#train) command (not implemented yet). If custom
underscore attributes are specified, their values need to be JSON-serializable.
They'll be added to an `"_"` key in the data, e.g. `"_": {"foo": "bar"}`.
> #### Example
>
> ```python
> doc = nlp(u"Hello")
> json_doc = doc.to_json()
> ```
>
> #### Result
>
> ```python
> {
> "text": "Hello",
> "ents": [],
> "sents": [{"start": 0, "end": 5}],
> "tokens": [{"id": 0, "start": 0, "end": 5, "pos": "INTJ", "tag": "UH", "dep": "ROOT", "head": 0}
> ]
> }
> ```
| Name | Type | Description |
| ------------ | ---- | ------------------------------------------------------------------------------ |
| `underscore` | list | Optional list of string names of custom JSON-serializable `doc._.` attributes. |
| **RETURNS** | dict | The JSON-formatted data. |
<Infobox title="Deprecation note" variant="warning">
spaCy previously implemented a `Doc.print_tree` method that returned a similar
JSON-formatted representation of a `Doc`. As of v2.1, this method is deprecated
in favor of `Doc.to_json`. If you need more complex nested representations, you
might want to write your own function to extract the data.
</Infobox>
## Doc.to_array {#to_array tag="method"} ## Doc.to_array {#to_array tag="method"}
Export given token attributes to a numpy `ndarray`. If `attr_ids` is a sequence Export given token attributes to a numpy `ndarray`. If `attr_ids` is a sequence
@ -310,7 +351,7 @@ array of attributes.
| Name | Type | Description | | Name | Type | Description |
| ----------- | -------------------------------------- | ----------------------------- | | ----------- | -------------------------------------- | ----------------------------- |
| `attrs` | ints | A list of attribute ID ints. | | `attrs` | list | A list of attribute ID ints. |
| `array` | `numpy.ndarray[ndim=2, dtype='int32']` | The attribute values to load. | | `array` | `numpy.ndarray[ndim=2, dtype='int32']` | The attribute values to load. |
| **RETURNS** | `Doc` | Itself. | | **RETURNS** | `Doc` | Itself. |
@ -429,14 +470,16 @@ to specify how the new subtokens should be integrated into the dependency tree.
The list of per-token heads can either be a token in the original document, e.g. The list of per-token heads can either be a token in the original document, e.g.
`doc[2]`, or a tuple consisting of the token in the original document and its `doc[2]`, or a tuple consisting of the token in the original document and its
subtoken index. For example, `(doc[3], 1)` will attach the subtoken to the subtoken index. For example, `(doc[3], 1)` will attach the subtoken to the
second subtoken of `doc[3]`. This mechanism allows attaching subtokens to other second subtoken of `doc[3]`.
newly created subtokens, without having to keep track of the changing token
indices. If the specified head token will be split within the retokenizer block This mechanism allows attaching subtokens to other newly created subtokens,
and no subtoken index is specified, it will default to `0`. Attributes to set on without having to keep track of the changing token indices. If the specified
subtokens can be provided as a list of values. They'll be applied to the head token will be split within the retokenizer block and no subtoken index is
resulting token (if they're context-dependent token attributes like `LEMMA` or specified, it will default to `0`. Attributes to set on subtokens can be
`DEP`) or to the underlying lexeme (if they're context-independent lexical provided as a list of values. They'll be applied to the resulting token (if
attributes like `LOWER` or `IS_STOP`). they're context-dependent token attributes like `LEMMA` or `DEP`) or to the
underlying lexeme (if they're context-independent lexical attributes like
`LOWER` or `IS_STOP`).
> #### Example > #### Example
> >
@ -487,8 +530,8 @@ and end token boundaries, the document remains unchanged.
## Doc.ents {#ents tag="property" model="NER"} ## Doc.ents {#ents tag="property" model="NER"}
Iterate over the entities in the document. Yields named-entity `Span` objects, The named entities in the document. Returns a tuple of named entity `Span`
if the entity recognizer has been applied to the document. objects, if the entity recognizer has been applied.
> #### Example > #### Example
> >
@ -500,9 +543,9 @@ if the entity recognizer has been applied to the document.
> assert ents[0].text == u"Mr. Best" > assert ents[0].text == u"Mr. Best"
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ---------- | ------ | ------------------------- | | ----------- | ----- | ------------------------------------------------ |
| **YIELDS** | `Span` | Entities in the document. | | **RETURNS** | tuple | Entities in the document, one `Span` per entity. |
## Doc.noun_chunks {#noun_chunks tag="property" model="parser"} ## Doc.noun_chunks {#noun_chunks tag="property" model="parser"}
@ -541,9 +584,9 @@ will be unavailable.
> assert [s.root.text for s in sents] == [u"is", u"'s"] > assert [s.root.text for s in sents] == [u"is", u"'s"]
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ---------- | ---------------------------------- | ----------- | | ---------- | ------ | -------------------------- |
| **YIELDS** | `Span | Sentences in the document. | | **YIELDS** | `Span` | Sentences in the document. |
## Doc.has_vector {#has_vector tag="property" model="vectors"} ## Doc.has_vector {#has_vector tag="property" model="vectors"}

View File

@ -1,7 +1,7 @@
--- ---
title: EntityRecognizer title: EntityRecognizer
tag: class tag: class
source: spacy/pipeline.pyx source: spacy/pipeline/pipes.pyx
--- ---
This class is a subclass of `Pipe` and follows the same API. The pipeline This class is a subclass of `Pipe` and follows the same API. The pipeline
@ -211,7 +211,7 @@ Modify the pipe's model, to use the given parameter values.
> ```python > ```python
> ner = EntityRecognizer(nlp.vocab) > ner = EntityRecognizer(nlp.vocab)
> with ner.use_params(): > with ner.use_params():
> ner.to_disk('/best_model') > ner.to_disk("/best_model")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
@ -226,7 +226,7 @@ Add a new label to the pipe.
> >
> ```python > ```python
> ner = EntityRecognizer(nlp.vocab) > ner = EntityRecognizer(nlp.vocab)
> ner.add_label('MY_LABEL') > ner.add_label("MY_LABEL")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
@ -241,7 +241,7 @@ Serialize the pipe to disk.
> >
> ```python > ```python
> ner = EntityRecognizer(nlp.vocab) > ner = EntityRecognizer(nlp.vocab)
> ner.to_disk('/path/to/ner') > ner.to_disk("/path/to/ner")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
@ -256,7 +256,7 @@ Load the pipe from disk. Modifies the object in place and returns it.
> >
> ```python > ```python
> ner = EntityRecognizer(nlp.vocab) > ner = EntityRecognizer(nlp.vocab)
> ner.from_disk('/path/to/ner') > ner.from_disk("/path/to/ner")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
@ -266,7 +266,7 @@ Load the pipe from disk. Modifies the object in place and returns it.
## EntityRecognizer.to_bytes {#to_bytes tag="method"} ## EntityRecognizer.to_bytes {#to_bytes tag="method"}
> #### example > #### Example
> >
> ```python > ```python
> ner = EntityRecognizer(nlp.vocab) > ner = EntityRecognizer(nlp.vocab)

View File

@ -1,7 +1,7 @@
--- ---
title: EntityRuler title: EntityRuler
tag: class tag: class
source: spacy/pipeline.pyx source: spacy/pipeline/entityruler.py
new: 2.1 new: 2.1
--- ---
@ -128,7 +128,7 @@ newline-delimited JSON (JSONL).
> >
> ```python > ```python
> ruler = EntityRuler(nlp) > ruler = EntityRuler(nlp)
> ruler.to_disk('/path/to/rules.jsonl') > ruler.to_disk("/path/to/rules.jsonl")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
@ -144,7 +144,7 @@ JSON (JSONL) with one entry per line.
> >
> ```python > ```python
> ruler = EntityRuler(nlp) > ruler = EntityRuler(nlp)
> ruler.from_disk('/path/to/rules.jsonl') > ruler.from_disk("/path/to/rules.jsonl")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |

View File

@ -1,7 +1,7 @@
--- ---
title: Pipeline Functions title: Pipeline Functions
teaser: Other built-in pipeline components and helpers teaser: Other built-in pipeline components and helpers
source: spacy/pipeline.pyx source: spacy/pipeline/functions.py
menu: menu:
- ['merge_noun_chunks', 'merge_noun_chunks'] - ['merge_noun_chunks', 'merge_noun_chunks']
- ['merge_entities', 'merge_entities'] - ['merge_entities', 'merge_entities']
@ -73,10 +73,10 @@ components to the end of the pipeline and after all other components.
| `doc` | `Doc` | The `Doc` object to process, e.g. the `Doc` in the pipeline. | | `doc` | `Doc` | The `Doc` object to process, e.g. the `Doc` in the pipeline. |
| **RETURNS** | `Doc` | The modified `Doc` with merged entities. | | **RETURNS** | `Doc` | The modified `Doc` with merged entities. |
## merge_subtokens {#merge_entities tag="function" new="2.1"} ## merge_subtokens {#merge_subtokens tag="function" new="2.1"}
Merge subtokens into a single token. Also available via the string name Merge subtokens into a single token. Also available via the string name
`"merge_entities"`. After initialization, the component is typically added to `"merge_subtokens"`. After initialization, the component is typically added to
the processing pipeline using [`nlp.add_pipe`](/api/language#add_pipe). the processing pipeline using [`nlp.add_pipe`](/api/language#add_pipe).
As of v2.1, the parser is able to predict "subtokens" that should be merged into As of v2.1, the parser is able to predict "subtokens" that should be merged into

View File

@ -1,7 +1,7 @@
--- ---
title: SentenceSegmenter title: SentenceSegmenter
tag: class tag: class
source: spacy/pipeline.pyx source: spacy/pipeline/hooks.py
--- ---
A simple spaCy hook, to allow custom sentence boundary detection logic that A simple spaCy hook, to allow custom sentence boundary detection logic that

View File

@ -260,8 +260,8 @@ Retokenize the document, such that the span is merged into a single token.
## Span.ents {#ents tag="property" new="2.0.12" model="ner"} ## Span.ents {#ents tag="property" new="2.0.12" model="ner"}
Iterate over the entities in the span. Yields named-entity `Span` objects, if The named entities in the span. Returns a tuple of named entity `Span` objects,
the entity recognizer has been applied to the parent document. if the entity recognizer has been applied.
> #### Example > #### Example
> >
@ -274,9 +274,9 @@ the entity recognizer has been applied to the parent document.
> assert ents[0].text == u"Mr. Best" > assert ents[0].text == u"Mr. Best"
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ---------- | ------ | ------------------------- | | ----------- | ----- | -------------------------------------------- |
| **YIELDS** | `Span` | Entities in the document. | | **RETURNS** | tuple | Entities in the span, one `Span` per entity. |
## Span.as_doc {#as_doc tag="method"} ## Span.as_doc {#as_doc tag="method"}
@ -297,8 +297,9 @@ Create a new `Doc` object corresponding to the `Span`, with a copy of the data.
## Span.root {#root tag="property" model="parser"} ## Span.root {#root tag="property" model="parser"}
The token within the span that's highest in the parse tree. If there's a tie, The token with the shortest path to the root of the sentence (or the root
the earliest is preferred. itself). If multiple tokens are equally high in the tree, the first token is
taken.
> #### Example > #### Example
> >

View File

@ -1,7 +1,7 @@
--- ---
title: Tagger title: Tagger
tag: class tag: class
source: spacy/pipeline.pyx source: spacy/pipeline/pipes.pyx
--- ---
This class is a subclass of `Pipe` and follows the same API. The pipeline This class is a subclass of `Pipe` and follows the same API. The pipeline
@ -209,7 +209,7 @@ Modify the pipe's model, to use the given parameter values.
> ```python > ```python
> tagger = Tagger(nlp.vocab) > tagger = Tagger(nlp.vocab)
> with tagger.use_params(): > with tagger.use_params():
> tagger.to_disk('/best_model') > tagger.to_disk("/best_model")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
@ -225,7 +225,7 @@ Add a new label to the pipe.
> ```python > ```python
> from spacy.symbols import POS > from spacy.symbols import POS
> tagger = Tagger(nlp.vocab) > tagger = Tagger(nlp.vocab)
> tagger.add_label('MY_LABEL', {POS: 'NOUN'}) > tagger.add_label("MY_LABEL", {POS: 'NOUN'})
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
@ -241,7 +241,7 @@ Serialize the pipe to disk.
> >
> ```python > ```python
> tagger = Tagger(nlp.vocab) > tagger = Tagger(nlp.vocab)
> tagger.to_disk('/path/to/tagger') > tagger.to_disk("/path/to/tagger")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
@ -256,7 +256,7 @@ Load the pipe from disk. Modifies the object in place and returns it.
> >
> ```python > ```python
> tagger = Tagger(nlp.vocab) > tagger = Tagger(nlp.vocab)
> tagger.from_disk('/path/to/tagger') > tagger.from_disk("/path/to/tagger")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
@ -266,7 +266,7 @@ Load the pipe from disk. Modifies the object in place and returns it.
## Tagger.to_bytes {#to_bytes tag="method"} ## Tagger.to_bytes {#to_bytes tag="method"}
> #### example > #### Example
> >
> ```python > ```python
> tagger = Tagger(nlp.vocab) > tagger = Tagger(nlp.vocab)

View File

@ -1,7 +1,7 @@
--- ---
title: TextCategorizer title: TextCategorizer
tag: class tag: class
source: spacy/pipeline.pyx source: spacy/pipeline/pipes.pyx
new: 2 new: 2
--- ---
@ -227,7 +227,7 @@ Modify the pipe's model, to use the given parameter values.
> ```python > ```python
> textcat = TextCategorizer(nlp.vocab) > textcat = TextCategorizer(nlp.vocab)
> with textcat.use_params(): > with textcat.use_params():
> textcat.to_disk('/best_model') > textcat.to_disk("/best_model")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
@ -242,7 +242,7 @@ Add a new label to the pipe.
> >
> ```python > ```python
> textcat = TextCategorizer(nlp.vocab) > textcat = TextCategorizer(nlp.vocab)
> textcat.add_label('MY_LABEL') > textcat.add_label("MY_LABEL")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
@ -257,7 +257,7 @@ Serialize the pipe to disk.
> >
> ```python > ```python
> textcat = TextCategorizer(nlp.vocab) > textcat = TextCategorizer(nlp.vocab)
> textcat.to_disk('/path/to/textcat') > textcat.to_disk("/path/to/textcat")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
@ -272,7 +272,7 @@ Load the pipe from disk. Modifies the object in place and returns it.
> >
> ```python > ```python
> textcat = TextCategorizer(nlp.vocab) > textcat = TextCategorizer(nlp.vocab)
> textcat.from_disk('/path/to/textcat') > textcat.from_disk("/path/to/textcat")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
@ -282,7 +282,7 @@ Load the pipe from disk. Modifies the object in place and returns it.
## TextCategorizer.to_bytes {#to_bytes tag="method"} ## TextCategorizer.to_bytes {#to_bytes tag="method"}
> #### example > #### Example
> >
> ```python > ```python
> textcat = TextCategorizer(nlp.vocab) > textcat = TextCategorizer(nlp.vocab)

View File

@ -324,7 +324,7 @@ A sequence containing the token and all the token's syntactic descendants.
## Token.is_sent_start {#is_sent_start tag="property" new="2"} ## Token.is_sent_start {#is_sent_start tag="property" new="2"}
A boolean value indicating whether the token starts a sentence. `None` if A boolean value indicating whether the token starts a sentence. `None` if
unknown. Defaults to `True` for the first token in the `doc`. unknown. Defaults to `True` for the first token in the `Doc`.
> #### Example > #### Example
> >

View File

@ -116,6 +116,72 @@ details and examples.
| `string` | unicode | The string to specially tokenize. | | `string` | unicode | The string to specially tokenize. |
| `token_attrs` | iterable | A sequence of dicts, where each dict describes a token and its attributes. The `ORTH` fields of the attributes must exactly match the string when they are concatenated. | | `token_attrs` | iterable | A sequence of dicts, where each dict describes a token and its attributes. The `ORTH` fields of the attributes must exactly match the string when they are concatenated. |
## Tokenizer.to_disk {#to_disk tag="method"}
Serialize the tokenizer to disk.
> #### Example
>
> ```python
> tokenizer = Tokenizer(nlp.vocab)
> tokenizer.to_disk("/path/to/tokenizer")
> ```
| Name | Type | Description |
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
## Tokenizer.from_disk {#from_disk tag="method"}
Load the tokenizer from disk. Modifies the object in place and returns it.
> #### Example
>
> ```python
> tokenizer = Tokenizer(nlp.vocab)
> tokenizer.from_disk("/path/to/tokenizer")
> ```
| Name | Type | Description |
| ----------- | ---------------- | -------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
| **RETURNS** | `Tokenizer` | The modified `Tokenizer` object. |
## Tokenizer.to_bytes {#to_bytes tag="method"}
> #### Example
>
> ```python
> tokenizer = tokenizer(nlp.vocab)
> tokenizer_bytes = tokenizer.to_bytes()
> ```
Serialize the tokenizer to a bytestring.
| Name | Type | Description |
| ----------- | ----- | -------------------------------------------------- |
| `**exclude` | - | Named attributes to prevent from being serialized. |
| **RETURNS** | bytes | The serialized form of the `Tokenizer` object. |
## Tokenizer.from_bytes {#from_bytes tag="method"}
Load the tokenizer from a bytestring. Modifies the object in place and returns
it.
> #### Example
>
> ```python
> tokenizer_bytes = tokenizer.to_bytes()
> tokenizer = Tokenizer(nlp.vocab)
> tokenizer.from_bytes(tokenizer_bytes)
> ```
| Name | Type | Description |
| ------------ | ----------- | ---------------------------------------------- |
| `bytes_data` | bytes | The data to load from. |
| `**exclude` | - | Named attributes to prevent from being loaded. |
| **RETURNS** | `Tokenizer` | The `Tokenizer` object. |
## Attributes {#attributes} ## Attributes {#attributes}
| Name | Type | Description | | Name | Type | Description |

View File

@ -642,7 +642,7 @@ All Python code is written in an **intersection of Python 2 and Python 3**. This
is easy in Cython, but somewhat ugly in Python. Logic that deals with Python or is easy in Cython, but somewhat ugly in Python. Logic that deals with Python or
platform compatibility only lives in `spacy.compat`. To distinguish them from platform compatibility only lives in `spacy.compat`. To distinguish them from
the builtin functions, replacement functions are suffixed with an underscore, the builtin functions, replacement functions are suffixed with an underscore,
e.e `unicode_`. e.g. `unicode_`.
> #### Example > #### Example
> >
@ -660,7 +660,7 @@ e.e `unicode_`.
| `compat.input_` | `raw_input` | `input` | | `compat.input_` | `raw_input` | `input` |
| `compat.path2str` | `str(path)` with `.decode('utf8')` | `str(path)` | | `compat.path2str` | `str(path)` with `.decode('utf8')` | `str(path)` |
### compat.is_config {#is_config tag="function"} ### compat.is_config {#compat.is_config tag="function"}
Check if a specific configuration of Python version and operating system matches Check if a specific configuration of Python version and operating system matches
the user's setup. Mostly used to display targeted error messages. the user's setup. Mostly used to display targeted error messages.

View File

@ -424,7 +424,7 @@ take a path to a JSON file containing the patterns. This lets you reuse the
component with different patterns, depending on your application: component with different patterns, depending on your application:
```python ```python
html_merger = BadHTMLMerger(nlp, path='/path/to/patterns.json') html_merger = BadHTMLMerger(nlp, path="/path/to/patterns.json")
``` ```
<Infobox title="📖 Processing pipelines"> <Infobox title="📖 Processing pipelines">