Merge branch 'master' into feature/lemmatizer

This commit is contained in:
Ines Montani 2019-03-16 13:44:22 +01:00
commit 278e9d2eb0
115 changed files with 2640 additions and 2222 deletions

106
.github/contributors/Poluglottos.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [X] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Ryan Ford |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | Mar 13 2019 |
| GitHub username | Poluglottos |
| Website (optional) | |

106
.github/contributors/tmetzl.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Tim Metzler |
| Company name (if applicable) | University of Applied Sciences Bonn-Rhein-Sieg |
| Title or role (if applicable) | |
| Date | 03/10/2019 |
| GitHub username | tmetzl |
| Website (optional) | |

View File

@ -12,7 +12,7 @@ currently supports tokenization for **45+ languages**. It features the
and easy **deep learning** integration. It's commercial open-source software,
released under the MIT license.
💫 **Version 2.1 out now!** [Check out the release notes here.](https://github.com/explosion/spaCy/releases)
💫 **Version 2.0 out now!** [Check out the release notes here.](https://github.com/explosion/spaCy/releases)
[![Azure Pipelines](https://img.shields.io/azure-devops/build/explosion-ai/public/8/master.svg?logo=azure-devops&style=flat-square)](https://dev.azure.com/explosion-ai/public/_build?definitionId=8)
[![Travis Build Status](https://img.shields.io/travis/explosion/spaCy/master.svg?style=flat-square&logo=travis)](https://travis-ci.org/explosion/spaCy)
@ -25,19 +25,17 @@ released under the MIT license.
## 📖 Documentation
| Documentation | |
| --------------- | -------------------------------------------------------------- |
| [spaCy 101] | New to spaCy? Here's everything you need to know! |
| [Usage Guides] | How to use spaCy and its features. |
| [New in v2.1] | New features, backwards incompatibilities and migration guide. |
| [API Reference] | The detailed reference for spaCy's API. |
| [Models] | Download statistical language models for spaCy. |
| [Universe] | Libraries, extensions, demos, books and courses. |
| [Changelog] | Changes and version history. |
| [Contribute] | How to contribute to the spaCy project and code base. |
| Documentation | |
| --------------- | ----------------------------------------------------- |
| [spaCy 101] | New to spaCy? Here's everything you need to know! |
| [Usage Guides] | How to use spaCy and its features. |
| [API Reference] | The detailed reference for spaCy's API. |
| [Models] | Download statistical language models for spaCy. |
| [Universe] | Libraries, extensions, demos, books and courses. |
| [Changelog] | Changes and version history. |
| [Contribute] | How to contribute to the spaCy project and code base. |
[spacy 101]: https://spacy.io/usage/spacy-101
[new in v2.1]: https://spacy.io/usage/v2-1
[usage guides]: https://spacy.io/usage/
[api reference]: https://spacy.io/api/
[models]: https://spacy.io/models

View File

@ -7,6 +7,7 @@ git diff-index --quiet HEAD
git checkout $1
git pull origin $1
git push origin $1
version=$(grep "__version__ = " spacy/about.py)
version=${version/__version__ = }
@ -15,4 +16,4 @@ version=${version/\'/}
version=${version/\"/}
version=${version/\"/}
git tag "v$version"
git push origin --tags
git push origin "v$version" --tags

107
bin/train_word_vectors.py Normal file
View File

@ -0,0 +1,107 @@
#!/usr/bin/env python
from __future__ import print_function, unicode_literals, division
import logging
from pathlib import Path
from collections import defaultdict
from gensim.models import Word2Vec
from preshed.counter import PreshCounter
import plac
import spacy
logger = logging.getLogger(__name__)
class Corpus(object):
def __init__(self, directory, min_freq=10):
self.directory = directory
self.counts = PreshCounter()
self.strings = {}
self.min_freq = min_freq
def count_doc(self, doc):
# Get counts for this document
for word in doc:
self.counts.inc(word.orth, 1)
return len(doc)
def __iter__(self):
for text_loc in iter_dir(self.directory):
with text_loc.open("r", encoding="utf-8") as file_:
text = file_.read()
yield text
def iter_dir(loc):
dir_path = Path(loc)
for fn_path in dir_path.iterdir():
if fn_path.is_dir():
for sub_path in fn_path.iterdir():
yield sub_path
else:
yield fn_path
@plac.annotations(
lang=("ISO language code"),
in_dir=("Location of input directory"),
out_loc=("Location of output file"),
n_workers=("Number of workers", "option", "n", int),
size=("Dimension of the word vectors", "option", "d", int),
window=("Context window size", "option", "w", int),
min_count=("Min count", "option", "m", int),
negative=("Number of negative samples", "option", "g", int),
nr_iter=("Number of iterations", "option", "i", int),
)
def main(
lang,
in_dir,
out_loc,
negative=5,
n_workers=4,
window=5,
size=128,
min_count=10,
nr_iter=2,
):
logging.basicConfig(
format="%(asctime)s : %(levelname)s : %(message)s", level=logging.INFO
)
model = Word2Vec(
size=size,
window=window,
min_count=min_count,
workers=n_workers,
sample=1e-5,
negative=negative,
)
nlp = spacy.blank(lang)
corpus = Corpus(in_dir)
total_words = 0
total_sents = 0
for text_no, text_loc in enumerate(iter_dir(corpus.directory)):
with text_loc.open("r", encoding="utf-8") as file_:
text = file_.read()
total_sents += text.count("\n")
doc = nlp(text)
total_words += corpus.count_doc(doc)
logger.info(
"PROGRESS: at batch #%i, processed %i words, keeping %i word types",
text_no,
total_words,
len(corpus.strings),
)
model.corpus_count = total_sents
model.raw_vocab = defaultdict(int)
for orth, freq in corpus.counts:
if freq >= min_count:
model.raw_vocab[nlp.vocab.strings[orth]] = freq
model.scale_vocab()
model.finalize_vocab()
model.iter = nr_iter
model.train(corpus)
model.save(out_loc)
if __name__ == "__main__":
plac.call(main)

View File

@ -49,7 +49,7 @@ class SentimentAnalyser(object):
y = self._model.predict(X)
self.set_sentiment(doc, y)
def pipe(self, docs, batch_size=1000, n_threads=2):
def pipe(self, docs, batch_size=1000):
for minibatch in cytoolz.partition_all(batch_size, docs):
minibatch = list(minibatch)
sentences = []
@ -176,7 +176,7 @@ def evaluate(model_dir, texts, labels, max_length=100):
correct = 0
i = 0
for doc in nlp.pipe(texts, batch_size=1000, n_threads=4):
for doc in nlp.pipe(texts, batch_size=1000):
correct += bool(doc.sentiment >= 0.5) == bool(labels[i])
i += 1
return float(correct) / i

View File

@ -4,7 +4,7 @@ preshed>=2.0.1,<2.1.0
thinc>=7.0.2,<7.1.0
blis>=0.2.2,<0.3.0
murmurhash>=0.28.0,<1.1.0
wasabi>=0.0.12,<1.1.0
wasabi>=0.1.3,<1.1.0
srsly>=0.0.5,<1.1.0
# Third party dependencies
numpy>=1.15.0

View File

@ -97,6 +97,7 @@ def with_cpu(ops, model):
"""Wrap a model that should run on CPU, transferring inputs and outputs
as necessary."""
model.to_cpu()
def with_cpu_forward(inputs, drop=0.):
cpu_outputs, backprop = model.begin_update(_to_cpu(inputs), drop=drop)
gpu_outputs = _to_device(ops, cpu_outputs)

View File

@ -4,7 +4,7 @@
# fmt: off
__title__ = "spacy-nightly"
__version__ = "2.1.0a10"
__version__ = "2.1.0a13"
__summary__ = "Industrial-strength Natural Language Processing (NLP) with Python and Cython"
__uri__ = "https://spacy.io"
__author__ = "Explosion AI"

View File

@ -6,7 +6,7 @@ from pathlib import Path
from wasabi import Printer
import srsly
from .converters import conllu2json, conllubio2json, iob2json, conll_ner2json
from .converters import conllu2json, iob2json, conll_ner2json
from .converters import ner_jsonl2json
@ -14,7 +14,7 @@ from .converters import ner_jsonl2json
# entry to this dict with the file extension mapped to the converter function
# imported from /converters.
CONVERTERS = {
"conllubio": conllubio2json,
"conllubio": conllu2json,
"conllu": conllu2json,
"conll": conllu2json,
"ner": conll_ner2json,

View File

@ -1,5 +1,4 @@
from .conllu2json import conllu2json # noqa: F401
from .conllubio2json import conllubio2json # noqa: F401
from .iob2json import iob2json # noqa: F401
from .conll_ner2json import conll_ner2json # noqa: F401
from .jsonl2json import ner_jsonl2json # noqa: F401

View File

@ -71,6 +71,7 @@ def read_conllx(input_data, use_morphology=False, n=0):
dep = "ROOT" if dep == "root" else dep
tag = pos if tag == "_" else tag
tag = tag + "__" + morph if use_morphology else tag
iob = iob if iob else "O"
tokens.append((id_, word, tag, head, dep, iob))
except: # noqa: E722
print(line)

View File

@ -1,85 +0,0 @@
# coding: utf8
from __future__ import unicode_literals
from ...gold import iob_to_biluo
def conllubio2json(input_data, n_sents=10, use_morphology=False, lang=None):
"""
Convert conllu files into JSON format for use with train cli.
use_morphology parameter enables appending morphology to tags, which is
useful for languages such as Spanish, where UD tags are not so rich.
"""
# by @dvsrepo, via #11 explosion/spacy-dev-resources
docs = []
sentences = []
conll_tuples = read_conllx(input_data, use_morphology=use_morphology)
for i, (raw_text, tokens) in enumerate(conll_tuples):
sentence, brackets = tokens[0]
sentences.append(generate_sentence(sentence))
# Real-sized documents could be extracted using the comments on the
# conluu document
if len(sentences) % n_sents == 0:
doc = create_doc(sentences, i)
docs.append(doc)
sentences = []
return docs
def read_conllx(input_data, use_morphology=False, n=0):
i = 0
for sent in input_data.strip().split("\n\n"):
lines = sent.strip().split("\n")
if lines:
while lines[0].startswith("#"):
lines.pop(0)
tokens = []
for line in lines:
parts = line.split("\t")
id_, word, lemma, pos, tag, morph, head, dep, _1, ner = parts
if "-" in id_ or "." in id_:
continue
try:
id_ = int(id_) - 1
head = (int(head) - 1) if head != "0" else id_
dep = "ROOT" if dep == "root" else dep
tag = pos if tag == "_" else tag
tag = tag + "__" + morph if use_morphology else tag
ner = ner if ner else "O"
tokens.append((id_, word, tag, head, dep, ner))
except: # noqa: E722
print(line)
raise
tuples = [list(t) for t in zip(*tokens)]
yield (None, [[tuples, []]])
i += 1
if n >= 1 and i >= n:
break
def generate_sentence(sent):
(id_, word, tag, head, dep, ner) = sent
sentence = {}
tokens = []
ner = iob_to_biluo(ner)
for i, id in enumerate(id_):
token = {}
token["orth"] = word[i]
token["tag"] = tag[i]
token["head"] = head[i] - id
token["dep"] = dep[i]
token["ner"] = ner[i]
tokens.append(token)
sentence["tokens"] = tokens
return sentence
def create_doc(sentences, id):
doc = {}
paragraph = {}
doc["id"] = id
doc["paragraphs"] = []
paragraph["sentences"] = sentences
doc["paragraphs"].append(paragraph)
return doc

View File

@ -41,24 +41,32 @@ def download(model, direct=False, *pip_args):
dl = download_model(dl_tpl.format(m=model_name, v=version), pip_args)
if dl != 0: # if download subprocess doesn't return 0, exit
sys.exit(dl)
try:
# Get package path here because link uses
# pip.get_installed_distributions() to check if model is a
# package, which fails if model was just installed via
# subprocess
package_path = get_package_path(model_name)
link(model_name, model, force=True, model_path=package_path)
except: # noqa: E722
# Dirty, but since spacy.download and the auto-linking is
# mostly a convenience wrapper, it's best to show a success
# message and loading instructions, even if linking fails.
msg.warn(
"Download successful but linking failed",
"Creating a shortcut link for 'en' didn't work (maybe you "
"don't have admin permissions?), but you can still load the "
"model via its full package name: "
"nlp = spacy.load('{}')".format(model_name),
)
msg.good(
"Download and installation successful",
"You can now load the model via spacy.load('{}')".format(model_name),
)
# Only create symlink if the model is installed via a shortcut like 'en'.
# There's no real advantage over an additional symlink for en_core_web_sm
# and if anything, it's more error prone and causes more confusion.
if model in shortcuts:
try:
# Get package path here because link uses
# pip.get_installed_distributions() to check if model is a
# package, which fails if model was just installed via
# subprocess
package_path = get_package_path(model_name)
link(model_name, model, force=True, model_path=package_path)
except: # noqa: E722
# Dirty, but since spacy.download and the auto-linking is
# mostly a convenience wrapper, it's best to show a success
# message and loading instructions, even if linking fails.
msg.warn(
"Download successful but linking failed",
"Creating a shortcut link for '{}' didn't work (maybe you "
"don't have admin permissions?), but you can still load "
"the model via its full package name: "
"nlp = spacy.load('{}')".format(model, model_name),
)
def get_json(url, desc):

View File

@ -161,7 +161,7 @@ def parse_deps(orig_doc, options={}):
"dir": "right",
}
)
return {"words": words, "arcs": arcs}
return {"words": words, "arcs": arcs, "settings": get_doc_settings(orig_doc)}
def parse_ents(doc, options={}):
@ -177,7 +177,8 @@ def parse_ents(doc, options={}):
if not ents:
user_warning(Warnings.W006)
title = doc.user_data.get("title", None) if hasattr(doc, "user_data") else None
return {"text": doc.text, "ents": ents, "title": title}
settings = get_doc_settings(doc)
return {"text": doc.text, "ents": ents, "title": title, "settings": settings}
def set_render_wrapper(func):
@ -195,3 +196,10 @@ def set_render_wrapper(func):
if not hasattr(func, "__call__"):
raise ValueError(Errors.E110.format(obj=type(func)))
RENDER_WRAPPER = func
def get_doc_settings(doc):
return {
"lang": doc.lang_,
"direction": doc.vocab.writing_system.get("direction", "ltr"),
}

View File

@ -3,10 +3,13 @@ from __future__ import unicode_literals
import uuid
from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_ARCS
from .templates import TPL_ENT, TPL_ENTS, TPL_FIGURE, TPL_TITLE, TPL_PAGE
from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_ARCS, TPL_ENTS
from .templates import TPL_ENT, TPL_ENT_RTL, TPL_FIGURE, TPL_TITLE, TPL_PAGE
from ..util import minify_html, escape_html
DEFAULT_LANG = "en"
DEFAULT_DIR = "ltr"
class DependencyRenderer(object):
"""Render dependency parses as SVGs."""
@ -30,6 +33,8 @@ class DependencyRenderer(object):
self.color = options.get("color", "#000000")
self.bg = options.get("bg", "#ffffff")
self.font = options.get("font", "Arial")
self.direction = DEFAULT_DIR
self.lang = DEFAULT_LANG
def render(self, parsed, page=False, minify=False):
"""Render complete markup.
@ -42,13 +47,19 @@ class DependencyRenderer(object):
# Create a random ID prefix to make sure parses don't receive the
# same ID, even if they're identical
id_prefix = uuid.uuid4().hex
rendered = [
self.render_svg("{}-{}".format(id_prefix, i), p["words"], p["arcs"])
for i, p in enumerate(parsed)
]
rendered = []
for i, p in enumerate(parsed):
if i == 0:
self.direction = p["settings"].get("direction", DEFAULT_DIR)
self.lang = p["settings"].get("lang", DEFAULT_LANG)
render_id = "{}-{}".format(id_prefix, i)
svg = self.render_svg(render_id, p["words"], p["arcs"])
rendered.append(svg)
if page:
content = "".join([TPL_FIGURE.format(content=svg) for svg in rendered])
markup = TPL_PAGE.format(content=content)
markup = TPL_PAGE.format(
content=content, lang=self.lang, dir=self.direction
)
else:
markup = "".join(rendered)
if minify:
@ -83,6 +94,8 @@ class DependencyRenderer(object):
bg=self.bg,
font=self.font,
content=content,
dir=self.direction,
lang=self.lang,
)
def render_word(self, text, tag, i):
@ -95,11 +108,13 @@ class DependencyRenderer(object):
"""
y = self.offset_y + self.word_spacing
x = self.offset_x + i * self.distance
if self.direction == "rtl":
x = self.width - x
html_text = escape_html(text)
return TPL_DEP_WORDS.format(text=html_text, tag=tag, x=x, y=y)
def render_arrow(self, label, start, end, direction, i):
"""Render indivicual arrow.
"""Render individual arrow.
label (unicode): Dependency label.
start (int): Index of start word.
@ -110,6 +125,8 @@ class DependencyRenderer(object):
"""
level = self.levels.index(end - start) + 1
x_start = self.offset_x + start * self.distance + self.arrow_spacing
if self.direction == "rtl":
x_start = self.width - x_start
y = self.offset_y
x_end = (
self.offset_x
@ -117,6 +134,8 @@ class DependencyRenderer(object):
+ start * self.distance
- self.arrow_spacing * (self.highest_level - level) / 4
)
if self.direction == "rtl":
x_end = self.width - x_end
y_curve = self.offset_y - level * self.distance / 2
if self.compact:
y_curve = self.offset_y - level * self.distance / 6
@ -124,12 +143,14 @@ class DependencyRenderer(object):
y_curve = -self.distance
arrowhead = self.get_arrowhead(direction, x_start, y, x_end)
arc = self.get_arc(x_start, y, y_curve, x_end)
label_side = "right" if self.direction == "rtl" else "left"
return TPL_DEP_ARCS.format(
id=self.id,
i=i,
stroke=self.arrow_stroke,
head=arrowhead,
label=label,
label_side=label_side,
arc=arc,
)
@ -219,6 +240,8 @@ class EntityRenderer(object):
self.default_color = "#ddd"
self.colors = colors
self.ents = options.get("ents", None)
self.direction = DEFAULT_DIR
self.lang = DEFAULT_LANG
def render(self, parsed, page=False, minify=False):
"""Render complete markup.
@ -228,12 +251,15 @@ class EntityRenderer(object):
minify (bool): Minify HTML markup.
RETURNS (unicode): Rendered HTML markup.
"""
rendered = [
self.render_ents(p["text"], p["ents"], p.get("title", None)) for p in parsed
]
rendered = []
for i, p in enumerate(parsed):
if i == 0:
self.direction = p["settings"].get("direction", DEFAULT_DIR)
self.lang = p["settings"].get("lang", DEFAULT_LANG)
rendered.append(self.render_ents(p["text"], p["ents"], p["title"]))
if page:
docs = "".join([TPL_FIGURE.format(content=doc) for doc in rendered])
markup = TPL_PAGE.format(content=docs)
markup = TPL_PAGE.format(content=docs, lang=self.lang, dir=self.direction)
else:
markup = "".join(rendered)
if minify:
@ -261,12 +287,16 @@ class EntityRenderer(object):
markup += "</br>"
if self.ents is None or label.upper() in self.ents:
color = self.colors.get(label.upper(), self.default_color)
markup += TPL_ENT.format(label=label, text=entity, bg=color)
ent_settings = {"label": label, "text": entity, "bg": color}
if self.direction == "rtl":
markup += TPL_ENT_RTL.format(**ent_settings)
else:
markup += TPL_ENT.format(**ent_settings)
else:
markup += entity
offset = end
markup += escape_html(text[offset:])
markup = TPL_ENTS.format(content=markup, colors=self.colors)
markup = TPL_ENTS.format(content=markup, dir=self.direction)
if title:
markup = TPL_TITLE.format(title=title) + markup
return markup

View File

@ -6,7 +6,7 @@ from __future__ import unicode_literals
# Jupyter to render it properly in a cell
TPL_DEP_SVG = """
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" id="{id}" class="displacy" width="{width}" height="{height}" style="max-width: none; height: {height}px; color: {color}; background: {bg}; font-family: {font}">{content}</svg>
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xml:lang="{lang}" id="{id}" class="displacy" width="{width}" height="{height}" direction="{dir}" style="max-width: none; height: {height}px; color: {color}; background: {bg}; font-family: {font}; direction: {dir}">{content}</svg>
"""
@ -22,7 +22,7 @@ TPL_DEP_ARCS = """
<g class="displacy-arrow">
<path class="displacy-arc" id="arrow-{id}-{i}" stroke-width="{stroke}px" d="{arc}" fill="none" stroke="currentColor"/>
<text dy="1.25em" style="font-size: 0.8em; letter-spacing: 1px">
<textPath xlink:href="#arrow-{id}-{i}" class="displacy-label" startOffset="50%" fill="currentColor" text-anchor="middle">{label}</textPath>
<textPath xlink:href="#arrow-{id}-{i}" class="displacy-label" startOffset="50%" side="{label_side}" fill="currentColor" text-anchor="middle">{label}</textPath>
</text>
<path class="displacy-arrowhead" d="{head}" fill="currentColor"/>
</g>
@ -39,7 +39,7 @@ TPL_TITLE = """
TPL_ENTS = """
<div class="entities" style="line-height: 2.5">{content}</div>
<div class="entities" style="line-height: 2.5; direction: {dir}">{content}</div>
"""
@ -50,14 +50,21 @@ TPL_ENT = """
</mark>
"""
TPL_ENT_RTL = """
<mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
{text}
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-right: 0.5rem">{label}</span>
</mark>
"""
TPL_PAGE = """
<!DOCTYPE html>
<html>
<html lang="{lang}">
<head>
<title>displaCy</title>
</head>
<body style="font-size: 16px; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; padding: 4rem 2rem;">{content}</body>
<body style="font-size: 16px; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; padding: 4rem 2rem; direction: {dir}">{content}</body>
</html>
"""

View File

@ -70,6 +70,16 @@ class Warnings(object):
W013 = ("As of v2.1.0, {obj}.merge is deprecated. Please use the more "
"efficient and less error-prone Doc.retokenize context manager "
"instead.")
W014 = ("As of v2.1.0, the `disable` keyword argument on the serialization "
"methods is and should be replaced with `exclude`. This makes it "
"consistent with the other objects serializable.")
W015 = ("As of v2.1.0, the use of keyword arguments to exclude fields from "
"being serialized or deserialized is deprecated. Please use the "
"`exclude` argument instead. For example: exclude=['{arg}'].")
W016 = ("The keyword argument `n_threads` on the is now deprecated, as "
"the v2.x models cannot release the global interpreter lock. "
"Future versions may introduce a `n_process` argument for "
"parallel inference via multiprocessing.")
@add_codes
@ -348,7 +358,15 @@ class Errors(object):
"This is likely a bug in spaCy, so feel free to open an issue.")
E127 = ("Cannot create phrase pattern representation for length 0. This "
"is likely a bug in spaCy.")
E128 = ("Unsupported serialization argument: '{arg}'. The use of keyword "
"arguments to exclude fields from being serialized or deserialized "
"is now deprecated. Please use the `exclude` argument instead. "
"For example: exclude=['{arg}'].")
E129 = ("Cannot write the label of an existing Span object because a Span "
"is a read-only view of the underlying Token objects stored in the Doc. "
"Instead, create a new Span object and specify the `label` keyword argument, "
"for example:\nfrom spacy.tokens import Span\n"
"span = Span(doc, start={start}, end={end}, label='{label}')")
@add_codes
class TempErrors(object):

View File

@ -23,6 +23,7 @@ class ArabicDefaults(Language.Defaults):
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
stop_words = STOP_WORDS
suffixes = TOKENIZER_SUFFIXES
writing_system = {"direction": "rtl", "has_case": False, "has_letters": True}
class Arabic(Language):

View File

@ -34,10 +34,10 @@ TAG_MAP = {
"NNP": {POS: PROPN, "NounType": "prop", "Number": "sing"},
"NNPS": {POS: PROPN, "NounType": "prop", "Number": "plur"},
"NNS": {POS: NOUN, "Number": "plur"},
"PDT": {POS: ADJ, "AdjType": "pdt", "PronType": "prn"},
"PDT": {POS: DET, "AdjType": "pdt", "PronType": "prn"},
"POS": {POS: PART, "Poss": "yes"},
"PRP": {POS: PRON, "PronType": "prs"},
"PRP$": {POS: ADJ, "PronType": "prs", "Poss": "yes"},
"PRP$": {POS: DET, "PronType": "prs", "Poss": "yes"},
"RB": {POS: ADV, "Degree": "pos"},
"RBR": {POS: ADV, "Degree": "comp"},
"RBS": {POS: ADV, "Degree": "sup"},

View File

@ -27,6 +27,7 @@ class PersianDefaults(Language.Defaults):
stop_words = STOP_WORDS
tag_map = TAG_MAP
suffixes = TOKENIZER_SUFFIXES
writing_system = {"direction": "rtl", "has_case": False, "has_letters": True}
class Persian(Language):

View File

@ -14,6 +14,7 @@ class HebrewDefaults(Language.Defaults):
lex_attr_getters[LANG] = lambda text: "he"
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS)
stop_words = STOP_WORDS
writing_system = {"direction": "rtl", "has_case": False, "has_letters": True}
class Hebrew(Language):

View File

@ -8,15 +8,13 @@ from .stop_words import STOP_WORDS
from .tag_map import TAG_MAP
from ...attrs import LANG
from ...language import Language
from ...tokens import Doc, Token
from ...tokens import Doc
from ...compat import copy_reg
from ...util import DummyTokenizer
ShortUnitWord = namedtuple("ShortUnitWord", ["surface", "lemma", "pos"])
# TODO: Is this the right place for this?
Token.set_extension("mecab_tag", default=None)
def try_mecab_import():
"""Mecab is required for Japanese support, so check for it.
@ -81,10 +79,12 @@ class JapaneseTokenizer(DummyTokenizer):
words = [x.surface for x in dtokens]
spaces = [False] * len(words)
doc = Doc(self.vocab, words=words, spaces=spaces)
mecab_tags = []
for token, dtoken in zip(doc, dtokens):
token._.mecab_tag = dtoken.pos
mecab_tags.append(dtoken.pos)
token.tag_ = resolve_pos(dtoken)
token.lemma_ = dtoken.lemma
doc.user_data["mecab_tags"] = mecab_tags
return doc
@ -93,6 +93,7 @@ class JapaneseDefaults(Language.Defaults):
lex_attr_getters[LANG] = lambda _text: "ja"
stop_words = STOP_WORDS
tag_map = TAG_MAP
writing_system = {"direction": "ltr", "has_case": False, "has_letters": False}
@classmethod
def create_tokenizer(cls, nlp=None):
@ -107,4 +108,11 @@ class Japanese(Language):
return self.tokenizer(text)
def pickle_japanese(instance):
return Japanese, tuple()
copy_reg.pickle(Japanese, pickle_japanese)
__all__ = ["Japanese"]

View File

@ -14,6 +14,7 @@ class ChineseDefaults(Language.Defaults):
use_jieba = True
tokenizer_exceptions = BASE_EXCEPTIONS
stop_words = STOP_WORDS
writing_system = {"direction": "ltr", "has_case": False, "has_letters": False}
class Chinese(Language):

View File

@ -29,7 +29,7 @@ from .lang.punctuation import TOKENIZER_INFIXES
from .lang.tokenizer_exceptions import TOKEN_MATCH
from .lang.tag_map import TAG_MAP
from .lang.lex_attrs import LEX_ATTRS, is_stop
from .errors import Errors
from .errors import Errors, Warnings, deprecation_warning
from . import util
from . import about
@ -95,6 +95,7 @@ class BaseDefaults(object):
morph_rules = {}
lex_attr_getters = LEX_ATTRS
syntax_iterators = {}
writing_system = {"direction": "ltr", "has_case": True, "has_letters": True}
class Language(object):
@ -107,6 +108,7 @@ class Language(object):
DOCS: https://spacy.io/api/language
"""
Defaults = BaseDefaults
lang = None
@ -195,6 +197,7 @@ class Language(object):
self._meta = value
# Conveniences to access pipeline components
# Shouldn't be used anymore!
@property
def tensorizer(self):
return self.get_pipe("tensorizer")
@ -228,6 +231,8 @@ class Language(object):
name (unicode): Name of pipeline component to get.
RETURNS (callable): The pipeline component.
DOCS: https://spacy.io/api/language#get_pipe
"""
for pipe_name, component in self.pipeline:
if pipe_name == name:
@ -240,6 +245,8 @@ class Language(object):
name (unicode): Factory name to look up in `Language.factories`.
config (dict): Configuration parameters to initialise component.
RETURNS (callable): Pipeline component.
DOCS: https://spacy.io/api/language#create_pipe
"""
if name not in self.factories:
if name == "sbd":
@ -266,9 +273,7 @@ class Language(object):
first (bool): Insert component first / not first in the pipeline.
last (bool): Insert component last / not last in the pipeline.
EXAMPLE:
>>> nlp.add_pipe(component, before='ner')
>>> nlp.add_pipe(component, name='custom_name', last=True)
DOCS: https://spacy.io/api/language#add_pipe
"""
if not hasattr(component, "__call__"):
msg = Errors.E003.format(component=repr(component), name=name)
@ -310,6 +315,8 @@ class Language(object):
name (unicode): Name of the component.
RETURNS (bool): Whether a component of the name exists in the pipeline.
DOCS: https://spacy.io/api/language#has_pipe
"""
return name in self.pipe_names
@ -318,6 +325,8 @@ class Language(object):
name (unicode): Name of the component to replace.
component (callable): Pipeline component.
DOCS: https://spacy.io/api/language#replace_pipe
"""
if name not in self.pipe_names:
raise ValueError(Errors.E001.format(name=name, opts=self.pipe_names))
@ -328,6 +337,8 @@ class Language(object):
old_name (unicode): Name of the component to rename.
new_name (unicode): New name of the component.
DOCS: https://spacy.io/api/language#rename_pipe
"""
if old_name not in self.pipe_names:
raise ValueError(Errors.E001.format(name=old_name, opts=self.pipe_names))
@ -341,36 +352,39 @@ class Language(object):
name (unicode): Name of the component to remove.
RETURNS (tuple): A `(name, component)` tuple of the removed component.
DOCS: https://spacy.io/api/language#remove_pipe
"""
if name not in self.pipe_names:
raise ValueError(Errors.E001.format(name=name, opts=self.pipe_names))
return self.pipeline.pop(self.pipe_names.index(name))
def __call__(self, text, disable=[]):
def __call__(self, text, disable=[], component_cfg=None):
"""Apply the pipeline to some text. The text can span multiple sentences,
and can contain arbtrary whitespace. Alignment into the original string
is preserved.
text (unicode): The text to be processed.
disable (list): Names of the pipeline components to disable.
component_cfg (dict): An optional dictionary with extra keyword arguments
for specific components.
RETURNS (Doc): A container for accessing the annotations.
EXAMPLE:
>>> tokens = nlp('An example sentence. Another example sentence.')
>>> tokens[0].text, tokens[0].head.tag_
('An', 'NN')
DOCS: https://spacy.io/api/language#call
"""
if len(text) > self.max_length:
raise ValueError(
Errors.E088.format(length=len(text), max_length=self.max_length)
)
doc = self.make_doc(text)
if component_cfg is None:
component_cfg = {}
for name, proc in self.pipeline:
if name in disable:
continue
if not hasattr(proc, "__call__"):
raise ValueError(Errors.E003.format(component=type(proc), name=name))
doc = proc(doc)
doc = proc(doc, **component_cfg.get(name, {}))
if doc is None:
raise ValueError(Errors.E005.format(name=name))
return doc
@ -381,24 +395,14 @@ class Language(object):
of the block. Otherwise, a DisabledPipes object is returned, that has
a `.restore()` method you can use to undo your changes.
EXAMPLE:
>>> nlp.add_pipe('parser')
>>> nlp.add_pipe('tagger')
>>> with nlp.disable_pipes('parser', 'tagger'):
>>> assert not nlp.has_pipe('parser')
>>> assert nlp.has_pipe('parser')
>>> disabled = nlp.disable_pipes('parser')
>>> assert len(disabled) == 1
>>> assert not nlp.has_pipe('parser')
>>> disabled.restore()
>>> assert nlp.has_pipe('parser')
DOCS: https://spacy.io/api/language#disable_pipes
"""
return DisabledPipes(self, *names)
def make_doc(self, text):
return self.tokenizer(text)
def update(self, docs, golds, drop=0.0, sgd=None, losses=None):
def update(self, docs, golds, drop=0.0, sgd=None, losses=None, component_cfg=None):
"""Update the models in the pipeline.
docs (iterable): A batch of `Doc` objects.
@ -407,11 +411,7 @@ class Language(object):
sgd (callable): An optimizer.
RETURNS (dict): Results from the update.
EXAMPLE:
>>> with nlp.begin_training(gold) as (trainer, optimizer):
>>> for epoch in trainer.epochs(gold):
>>> for docs, golds in epoch:
>>> state = nlp.update(docs, golds, sgd=optimizer)
DOCS: https://spacy.io/api/language#update
"""
if len(docs) != len(golds):
raise IndexError(Errors.E009.format(n_docs=len(docs), n_golds=len(golds)))
@ -421,7 +421,6 @@ class Language(object):
if self._optimizer is None:
self._optimizer = create_default_optimizer(Model.ops)
sgd = self._optimizer
# Allow dict of args to GoldParse, instead of GoldParse objects.
gold_objs = []
doc_objs = []
@ -442,14 +441,17 @@ class Language(object):
get_grads.alpha = sgd.alpha
get_grads.b1 = sgd.b1
get_grads.b2 = sgd.b2
pipes = list(self.pipeline)
random.shuffle(pipes)
if component_cfg is None:
component_cfg = {}
for name, proc in pipes:
if not hasattr(proc, "update"):
continue
grads = {}
proc.update(docs, golds, drop=drop, sgd=get_grads, losses=losses)
kwargs = component_cfg.get(name, {})
kwargs.setdefault("drop", drop)
proc.update(docs, golds, sgd=get_grads, losses=losses, **kwargs)
for key, (W, dW) in grads.items():
sgd(W, dW, key=key)
@ -473,6 +475,7 @@ class Language(object):
>>> raw_batch = [nlp.make_doc(text) for text in next(raw_text_batches)]
>>> nlp.rehearse(raw_batch)
"""
# TODO: document
if len(docs) == 0:
return
if sgd is None:
@ -495,7 +498,6 @@ class Language(object):
get_grads.alpha = sgd.alpha
get_grads.b1 = sgd.b1
get_grads.b2 = sgd.b2
for name, proc in pipes:
if not hasattr(proc, "rehearse"):
continue
@ -503,7 +505,6 @@ class Language(object):
proc.rehearse(docs, sgd=get_grads, losses=losses, **config.get(name, {}))
for key, (W, dW) in grads.items():
sgd(W, dW, key=key)
return losses
def preprocess_gold(self, docs_golds):
@ -519,13 +520,16 @@ class Language(object):
for doc, gold in docs_golds:
yield doc, gold
def begin_training(self, get_gold_tuples=None, sgd=None, **cfg):
def begin_training(self, get_gold_tuples=None, sgd=None, component_cfg=None, **cfg):
"""Allocate models, pre-process training data and acquire a trainer and
optimizer. Used as a contextmanager.
get_gold_tuples (function): Function returning gold data
component_cfg (dict): Config parameters for specific components.
**cfg: Config parameters.
RETURNS: An optimizer
RETURNS: An optimizer.
DOCS: https://spacy.io/api/language#begin_training
"""
if get_gold_tuples is None:
get_gold_tuples = lambda: []
@ -545,10 +549,17 @@ class Language(object):
if sgd is None:
sgd = create_default_optimizer(Model.ops)
self._optimizer = sgd
if component_cfg is None:
component_cfg = {}
for name, proc in self.pipeline:
if hasattr(proc, "begin_training"):
kwargs = component_cfg.get(name, {})
kwargs.update(cfg)
proc.begin_training(
get_gold_tuples, pipeline=self.pipeline, sgd=self._optimizer, **cfg
get_gold_tuples,
pipeline=self.pipeline,
sgd=self._optimizer,
**kwargs
)
return self._optimizer
@ -576,20 +587,27 @@ class Language(object):
proc._rehearsal_model = deepcopy(proc.model)
return self._optimizer
def evaluate(self, docs_golds, verbose=False, batch_size=256):
scorer = Scorer()
def evaluate(
self, docs_golds, verbose=False, batch_size=256, scorer=None, component_cfg=None
):
if scorer is None:
scorer = Scorer()
docs, golds = zip(*docs_golds)
docs = list(docs)
golds = list(golds)
for name, pipe in self.pipeline:
kwargs = component_cfg.get(name, {})
kwargs.setdefault("batch_size", batch_size)
if not hasattr(pipe, "pipe"):
docs = (pipe(doc) for doc in docs)
docs = (pipe(doc, **kwargs) for doc in docs)
else:
docs = pipe.pipe(docs, batch_size=batch_size)
docs = pipe.pipe(docs, **kwargs)
for doc, gold in zip(docs, golds):
if verbose:
print(doc)
scorer.score(doc, gold, verbose=verbose)
kwargs = component_cfg.get("scorer", {})
kwargs.setdefault("verbose", verbose)
scorer.score(doc, gold, **kwargs)
return scorer
@contextmanager
@ -628,49 +646,57 @@ class Language(object):
self,
texts,
as_tuples=False,
n_threads=2,
n_threads=-1,
batch_size=1000,
disable=[],
cleanup=False,
component_cfg=None,
):
"""Process texts as a stream, and yield `Doc` objects in order.
texts (iterator): A sequence of texts to process.
as_tuples (bool):
If set to True, inputs should be a sequence of
as_tuples (bool): If set to True, inputs should be a sequence of
(text, context) tuples. Output will then be a sequence of
(doc, context) tuples. Defaults to False.
n_threads (int): Currently inactive.
batch_size (int): The number of texts to buffer.
disable (list): Names of the pipeline components to disable.
cleanup (bool): If True, unneeded strings are freed,
to control memory use. Experimental.
cleanup (bool): If True, unneeded strings are freed to control memory
use. Experimental.
component_cfg (dict): An optional dictionary with extra keyword
arguments for specific components.
YIELDS (Doc): Documents in the order of the original text.
EXAMPLE:
>>> texts = [u'One document.', u'...', u'Lots of documents']
>>> for doc in nlp.pipe(texts, batch_size=50, n_threads=4):
>>> assert doc.is_parsed
DOCS: https://spacy.io/api/language#pipe
"""
if n_threads != -1:
deprecation_warning(Warnings.W016)
if as_tuples:
text_context1, text_context2 = itertools.tee(texts)
texts = (tc[0] for tc in text_context1)
contexts = (tc[1] for tc in text_context2)
docs = self.pipe(
texts, n_threads=n_threads, batch_size=batch_size, disable=disable
texts,
batch_size=batch_size,
disable=disable,
component_cfg=component_cfg,
)
for doc, context in izip(docs, contexts):
yield (doc, context)
return
docs = (self.make_doc(text) for text in texts)
if component_cfg is None:
component_cfg = {}
for name, proc in self.pipeline:
if name in disable:
continue
kwargs = component_cfg.get(name, {})
# Allow component_cfg to overwrite the top-level kwargs.
kwargs.setdefault("batch_size", batch_size)
if hasattr(proc, "pipe"):
docs = proc.pipe(docs, n_threads=n_threads, batch_size=batch_size)
docs = proc.pipe(docs, **kwargs)
else:
# Apply the function, but yield the doc
docs = _pipe(proc, docs)
docs = _pipe(proc, docs, kwargs)
# Track weakrefs of "recent" documents, so that we can see when they
# expire from memory. When they do, we know we don't need old strings.
# This way, we avoid maintaining an unbounded growth in string entries
@ -701,124 +727,114 @@ class Language(object):
self.tokenizer._reset_cache(keys)
nr_seen = 0
def to_disk(self, path, disable=tuple()):
def to_disk(self, path, exclude=tuple(), disable=None):
"""Save the current state to a directory. If a model is loaded, this
will include the model.
path (unicode or Path): A path to a directory, which will be created if
it doesn't exist. Paths may be strings or `Path`-like objects.
disable (list): Names of pipeline components to disable and prevent
from being saved.
path (unicode or Path): Path to a directory, which will be created if
it doesn't exist.
exclude (list): Names of components or serialization fields to exclude.
EXAMPLE:
>>> nlp.to_disk('/path/to/models')
DOCS: https://spacy.io/api/language#to_disk
"""
if disable is not None:
deprecation_warning(Warnings.W014)
exclude = disable
path = util.ensure_path(path)
serializers = OrderedDict(
(
("tokenizer", lambda p: self.tokenizer.to_disk(p, vocab=False)),
("meta.json", lambda p: p.open("w").write(srsly.json_dumps(self.meta))),
)
)
serializers = OrderedDict()
serializers["tokenizer"] = lambda p: self.tokenizer.to_disk(p, exclude=["vocab"])
serializers["meta.json"] = lambda p: p.open("w").write(srsly.json_dumps(self.meta))
for name, proc in self.pipeline:
if not hasattr(proc, "name"):
continue
if name in disable:
if name in exclude:
continue
if not hasattr(proc, "to_disk"):
continue
serializers[name] = lambda p, proc=proc: proc.to_disk(p, vocab=False)
serializers[name] = lambda p, proc=proc: proc.to_disk(p, exclude=["vocab"])
serializers["vocab"] = lambda p: self.vocab.to_disk(p)
util.to_disk(path, serializers, {p: False for p in disable})
util.to_disk(path, serializers, exclude)
def from_disk(self, path, disable=tuple()):
def from_disk(self, path, exclude=tuple(), disable=None):
"""Loads state from a directory. Modifies the object in place and
returns it. If the saved `Language` object contains a model, the
model will be loaded.
path (unicode or Path): A path to a directory. Paths may be either
strings or `Path`-like objects.
disable (list): Names of the pipeline components to disable.
path (unicode or Path): A path to a directory.
exclude (list): Names of components or serialization fields to exclude.
RETURNS (Language): The modified `Language` object.
EXAMPLE:
>>> from spacy.language import Language
>>> nlp = Language().from_disk('/path/to/models')
DOCS: https://spacy.io/api/language#from_disk
"""
if disable is not None:
deprecation_warning(Warnings.W014)
exclude = disable
path = util.ensure_path(path)
deserializers = OrderedDict(
(
("meta.json", lambda p: self.meta.update(srsly.read_json(p))),
(
"vocab",
lambda p: (
self.vocab.from_disk(p) and _fix_pretrained_vectors_name(self)
),
),
("tokenizer", lambda p: self.tokenizer.from_disk(p, vocab=False)),
)
)
deserializers = OrderedDict()
deserializers["meta.json"] = lambda p: self.meta.update(srsly.read_json(p))
deserializers["vocab"] = lambda p: self.vocab.from_disk(p) and _fix_pretrained_vectors_name(self)
deserializers["tokenizer"] = lambda p: self.tokenizer.from_disk(p, exclude=["vocab"])
for name, proc in self.pipeline:
if name in disable:
if name in exclude:
continue
if not hasattr(proc, "from_disk"):
continue
deserializers[name] = lambda p, proc=proc: proc.from_disk(p, vocab=False)
exclude = {p: False for p in disable}
if not (path / "vocab").exists():
exclude["vocab"] = True
deserializers[name] = lambda p, proc=proc: proc.from_disk(p, exclude=["vocab"])
if not (path / "vocab").exists() and "vocab" not in exclude:
# Convert to list here in case exclude is (default) tuple
exclude = list(exclude) + ["vocab"]
util.from_disk(path, deserializers, exclude)
self._path = path
return self
def to_bytes(self, disable=[], **exclude):
def to_bytes(self, exclude=tuple(), disable=None, **kwargs):
"""Serialize the current state to a binary string.
disable (list): Nameds of pipeline components to disable and prevent
from being serialized.
exclude (list): Names of components or serialization fields to exclude.
RETURNS (bytes): The serialized form of the `Language` object.
DOCS: https://spacy.io/api/language#to_bytes
"""
serializers = OrderedDict(
(
("vocab", lambda: self.vocab.to_bytes()),
("tokenizer", lambda: self.tokenizer.to_bytes(vocab=False)),
("meta", lambda: srsly.json_dumps(self.meta)),
)
)
for i, (name, proc) in enumerate(self.pipeline):
if name in disable:
if disable is not None:
deprecation_warning(Warnings.W014)
exclude = disable
serializers = OrderedDict()
serializers["vocab"] = lambda: self.vocab.to_bytes()
serializers["tokenizer"] = lambda: self.tokenizer.to_bytes(exclude=["vocab"])
serializers["meta.json"] = lambda: srsly.json_dumps(self.meta)
for name, proc in self.pipeline:
if name in exclude:
continue
if not hasattr(proc, "to_bytes"):
continue
serializers[i] = lambda proc=proc: proc.to_bytes(vocab=False)
serializers[name] = lambda proc=proc: proc.to_bytes(exclude=["vocab"])
exclude = util.get_serialization_exclude(serializers, exclude, kwargs)
return util.to_bytes(serializers, exclude)
def from_bytes(self, bytes_data, disable=[]):
def from_bytes(self, bytes_data, exclude=tuple(), disable=None, **kwargs):
"""Load state from a binary string.
bytes_data (bytes): The data to load from.
disable (list): Names of the pipeline components to disable.
exclude (list): Names of components or serialization fields to exclude.
RETURNS (Language): The `Language` object.
DOCS: https://spacy.io/api/language#from_bytes
"""
deserializers = OrderedDict(
(
("meta", lambda b: self.meta.update(srsly.json_loads(b))),
(
"vocab",
lambda b: (
self.vocab.from_bytes(b) and _fix_pretrained_vectors_name(self)
),
),
("tokenizer", lambda b: self.tokenizer.from_bytes(b, vocab=False)),
)
)
for i, (name, proc) in enumerate(self.pipeline):
if name in disable:
if disable is not None:
deprecation_warning(Warnings.W014)
exclude = disable
deserializers = OrderedDict()
deserializers["meta.json"] = lambda b: self.meta.update(srsly.json_loads(b))
deserializers["vocab"] = lambda b: self.vocab.from_bytes(b) and _fix_pretrained_vectors_name(self)
deserializers["tokenizer"] = lambda b: self.tokenizer.from_bytes(b, exclude=["vocab"])
for name, proc in self.pipeline:
if name in exclude:
continue
if not hasattr(proc, "from_bytes"):
continue
deserializers[i] = lambda b, proc=proc: proc.from_bytes(b, vocab=False)
util.from_bytes(bytes_data, deserializers, {})
deserializers[name] = lambda b, proc=proc: proc.from_bytes(b, exclude=["vocab"])
exclude = util.get_serialization_exclude(deserializers, exclude, kwargs)
util.from_bytes(bytes_data, deserializers, exclude)
return self
@ -873,7 +889,12 @@ class DisabledPipes(list):
self[:] = []
def _pipe(func, docs):
def _pipe(func, docs, kwargs):
# We added some args for pipe that __call__ doesn't expect.
kwargs = dict(kwargs)
for arg in ["n_threads", "batch_size"]:
if arg in kwargs:
kwargs.pop(arg)
for doc in docs:
doc = func(doc)
doc = func(doc, **kwargs)
yield doc

View File

@ -161,17 +161,17 @@ cdef class Lexeme:
Lexeme.c_from_bytes(self.c, lex_data)
self.orth = self.c.orth
property has_vector:
@property
def has_vector(self):
"""RETURNS (bool): Whether a word vector is associated with the object.
"""
def __get__(self):
return self.vocab.has_vector(self.c.orth)
return self.vocab.has_vector(self.c.orth)
property vector_norm:
@property
def vector_norm(self):
"""RETURNS (float): The L2 norm of the vector representation."""
def __get__(self):
vector = self.vector
return numpy.sqrt((vector**2).sum())
vector = self.vector
return numpy.sqrt((vector**2).sum())
property vector:
"""A real-valued meaning representation.
@ -209,17 +209,17 @@ cdef class Lexeme:
def __set__(self, float sentiment):
self.c.sentiment = sentiment
property orth_:
@property
def orth_(self):
"""RETURNS (unicode): The original verbatim text of the lexeme
(identical to `Lexeme.text`). Exists mostly for consistency with
the other attributes."""
def __get__(self):
return self.vocab.strings[self.c.orth]
return self.vocab.strings[self.c.orth]
property text:
@property
def text(self):
"""RETURNS (unicode): The original verbatim text of the lexeme."""
def __get__(self):
return self.orth_
return self.orth_
property lower:
"""RETURNS (unicode): Lowercase form of the lexeme."""

View File

@ -19,7 +19,7 @@ from ..attrs cimport ID, attr_id_t, NULL_ATTR, ORTH
from ._schemas import TOKEN_PATTERN_SCHEMA
from ..util import get_json_validator, validate_json
from ..errors import Errors, MatchPatternError
from ..errors import Errors, MatchPatternError, Warnings, deprecation_warning
from ..strings import get_string_id
from ..attrs import IDS
@ -153,15 +153,15 @@ cdef class Matcher:
return default
return (self._callbacks[key], self._patterns[key])
def pipe(self, docs, batch_size=1000, n_threads=2):
def pipe(self, docs, batch_size=1000, n_threads=-1):
"""Match a stream of documents, yielding them in turn.
docs (iterable): A stream of documents.
batch_size (int): Number of documents to accumulate into a working set.
n_threads (int): The number of threads with which to work on the buffer
in parallel, if the implementation supports multi-threading.
YIELDS (Doc): Documents, in order.
"""
if n_threads != -1:
deprecation_warning(Warnings.W016)
for doc in docs:
self(doc)
yield doc

View File

@ -166,14 +166,12 @@ cdef class PhraseMatcher:
on_match(self, doc, i, matches)
return matches
def pipe(self, stream, batch_size=1000, n_threads=1, return_matches=False,
def pipe(self, stream, batch_size=1000, n_threads=-1, return_matches=False,
as_tuples=False):
"""Match a stream of documents, yielding them in turn.
docs (iterable): A stream of documents.
batch_size (int): Number of documents to accumulate into a working set.
n_threads (int): The number of threads with which to work on the buffer
in parallel, if the implementation supports multi-threading.
return_matches (bool): Yield the match lists along with the docs, making
results (doc, matches) tuples.
as_tuples (bool): Interpret the input stream as (doc, context) tuples,
@ -184,6 +182,8 @@ cdef class PhraseMatcher:
DOCS: https://spacy.io/api/phrasematcher#pipe
"""
if n_threads != -1:
deprecation_warning(Warnings.W016)
if as_tuples:
for doc, context in stream:
matches = self(doc)

View File

@ -136,7 +136,7 @@ class MorphologyClassMap(object):
cdef class Morphology:
'''Store the possible morphological analyses for a language, and index them
by hash.
To save space on each token, tokens only know the hash of their morphological
analysis, so queries of morphological attributes are delegated
to this class.
@ -200,7 +200,7 @@ cdef class Morphology:
return []
else:
return tag_to_json(tag)
cpdef update(self, hash_t morph, features):
"""Update a morphological analysis with new feature values."""
tag = (<MorphAnalysisC*>self.tags.get(morph))[0]
@ -248,7 +248,7 @@ cdef class Morphology:
if feat in self._feat_map.id2feat})
attrs = intify_attrs(attrs, self.strings, _do_deprecated=True)
self.exc[(tag_str, self.strings.add(orth_str))] = attrs
cdef hash_t insert(self, MorphAnalysisC tag) except 0:
cdef hash_t key = hash_tag(tag)
if self.tags.get(key) == NULL:
@ -256,7 +256,7 @@ cdef class Morphology:
tag_ptr[0] = tag
self.tags.set(key, <void*>tag_ptr)
return key
cdef int assign_untagged(self, TokenC* token) except -1:
"""Set morphological attributes on a token without a POS tag. Uses
the lemmatizer's lookup() method, which looks up the string in the
@ -631,7 +631,7 @@ cdef int check_feature(const MorphAnalysisC* tag, attr_t feature) nogil:
return 1
else:
return 0
cdef int set_feature(MorphAnalysisC* tag,
univ_field_t field, attr_t feature, int value) except -1:
if value == True:

View File

@ -141,16 +141,21 @@ class Pipe(object):
with self.model.use_params(params):
yield
def to_bytes(self, **exclude):
"""Serialize the pipe to a bytestring."""
def to_bytes(self, exclude=tuple(), **kwargs):
"""Serialize the pipe to a bytestring.
exclude (list): String names of serialization fields to exclude.
RETURNS (bytes): The serialized object.
"""
serialize = OrderedDict()
serialize["cfg"] = lambda: srsly.json_dumps(self.cfg)
if self.model not in (True, False, None):
serialize["model"] = self.model.to_bytes
serialize["vocab"] = self.vocab.to_bytes
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
return util.to_bytes(serialize, exclude)
def from_bytes(self, bytes_data, **exclude):
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
"""Load the pipe from a bytestring."""
def load_model(b):
@ -161,26 +166,25 @@ class Pipe(object):
self.model = self.Model(**self.cfg)
self.model.from_bytes(b)
deserialize = OrderedDict(
(
("cfg", lambda b: self.cfg.update(srsly.json_loads(b))),
("vocab", lambda b: self.vocab.from_bytes(b)),
("model", load_model),
)
)
deserialize = OrderedDict()
deserialize["cfg"] = lambda b: self.cfg.update(srsly.json_loads(b))
deserialize["vocab"] = lambda b: self.vocab.from_bytes(b)
deserialize["model"] = load_model
exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
util.from_bytes(bytes_data, deserialize, exclude)
return self
def to_disk(self, path, **exclude):
def to_disk(self, path, exclude=tuple(), **kwargs):
"""Serialize the pipe to disk."""
serialize = OrderedDict()
serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg)
serialize["vocab"] = lambda p: self.vocab.to_disk(p)
if self.model not in (None, True, False):
serialize["model"] = lambda p: p.open("wb").write(self.model.to_bytes())
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
util.to_disk(path, serialize, exclude)
def from_disk(self, path, **exclude):
def from_disk(self, path, exclude=tuple(), **kwargs):
"""Load the pipe from disk."""
def load_model(p):
@ -191,13 +195,11 @@ class Pipe(object):
self.model = self.Model(**self.cfg)
self.model.from_bytes(p.open("rb").read())
deserialize = OrderedDict(
(
("cfg", lambda p: self.cfg.update(_load_cfg(p))),
("vocab", lambda p: self.vocab.from_disk(p)),
("model", load_model),
)
)
deserialize = OrderedDict()
deserialize["cfg"] = lambda p: self.cfg.update(_load_cfg(p))
deserialize["vocab"] = lambda p: self.vocab.from_disk(p)
deserialize["model"] = load_model
exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
util.from_disk(path, deserialize, exclude)
return self
@ -255,7 +257,6 @@ class Tensorizer(Pipe):
stream (iterator): A sequence of `Doc` objects to process.
batch_size (int): Number of `Doc` objects to group.
n_threads (int): Number of threads.
YIELDS (iterator): A sequence of `Doc` objects, in order of input.
"""
for docs in util.minibatch(stream, size=batch_size):
@ -541,7 +542,7 @@ class Tagger(Pipe):
with self.model.use_params(params):
yield
def to_bytes(self, **exclude):
def to_bytes(self, exclude=tuple(), **kwargs):
serialize = OrderedDict()
if self.model not in (None, True, False):
serialize["model"] = self.model.to_bytes
@ -549,9 +550,10 @@ class Tagger(Pipe):
serialize["cfg"] = lambda: srsly.json_dumps(self.cfg)
tag_map = OrderedDict(sorted(self.vocab.morphology.tag_map.items()))
serialize["tag_map"] = lambda: srsly.msgpack_dumps(tag_map)
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
return util.to_bytes(serialize, exclude)
def from_bytes(self, bytes_data, **exclude):
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
def load_model(b):
# TODO: Remove this once we don't have to handle previous models
if self.cfg.get("pretrained_dims") and "pretrained_vectors" not in self.cfg:
@ -576,20 +578,22 @@ class Tagger(Pipe):
("cfg", lambda b: self.cfg.update(srsly.json_loads(b))),
("model", lambda b: load_model(b)),
))
exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
util.from_bytes(bytes_data, deserialize, exclude)
return self
def to_disk(self, path, **exclude):
def to_disk(self, path, exclude=tuple(), **kwargs):
tag_map = OrderedDict(sorted(self.vocab.morphology.tag_map.items()))
serialize = OrderedDict((
('vocab', lambda p: self.vocab.to_disk(p)),
('tag_map', lambda p: srsly.write_msgpack(p, tag_map)),
('model', lambda p: p.open("wb").write(self.model.to_bytes())),
('cfg', lambda p: srsly.write_json(p, self.cfg))
("vocab", lambda p: self.vocab.to_disk(p)),
("tag_map", lambda p: srsly.write_msgpack(p, tag_map)),
("model", lambda p: p.open("wb").write(self.model.to_bytes())),
("cfg", lambda p: srsly.write_json(p, self.cfg))
))
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
util.to_disk(path, serialize, exclude)
def from_disk(self, path, **exclude):
def from_disk(self, path, exclude=tuple(), **kwargs):
def load_model(p):
# TODO: Remove this once we don't have to handle previous models
if self.cfg.get("pretrained_dims") and "pretrained_vectors" not in self.cfg:
@ -612,6 +616,7 @@ class Tagger(Pipe):
("tag_map", load_tag_map),
("model", load_model),
))
exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
util.from_disk(path, deserialize, exclude)
return self

View File

@ -248,19 +248,17 @@ cdef class StringStore:
self.add(word)
return self
def to_bytes(self, **exclude):
def to_bytes(self, **kwargs):
"""Serialize the current state to a binary string.
**exclude: Named attributes to prevent from being serialized.
RETURNS (bytes): The serialized form of the `StringStore` object.
"""
return srsly.json_dumps(list(self))
def from_bytes(self, bytes_data, **exclude):
def from_bytes(self, bytes_data, **kwargs):
"""Load state from a binary string.
bytes_data (bytes): The data to load from.
**exclude: Named attributes to prevent from being loaded.
RETURNS (StringStore): The `StringStore` object.
"""
strings = srsly.json_loads(bytes_data)

View File

@ -157,6 +157,10 @@ cdef void cpu_log_loss(float* d_scores,
cdef double max_, gmax, Z, gZ
best = arg_max_if_gold(scores, costs, is_valid, O)
guess = arg_max_if_valid(scores, is_valid, O)
if best == -1 or guess == -1:
# These shouldn't happen, but if they do, we want to make sure we don't
# cause an OOB access.
return
Z = 1e-10
gZ = 1e-10
max_ = scores[guess]

View File

@ -323,6 +323,12 @@ cdef cppclass StateC:
if this._s_i >= 1:
this._s_i -= 1
void force_final() nogil:
# This should only be used in desperate situations, as it may leave
# the analysis in an unexpected state.
this._s_i = 0
this._b_i = this.length
void unshift() nogil:
this._b_i -= 1
this._buffer[this._b_i] = this.S(0)

View File

@ -369,9 +369,9 @@ cdef class ArcEager(TransitionSystem):
actions[LEFT].setdefault('dep', 0)
return actions
property action_types:
def __get__(self):
return (SHIFT, REDUCE, LEFT, RIGHT, BREAK)
@property
def action_types(self):
return (SHIFT, REDUCE, LEFT, RIGHT, BREAK)
def get_cost(self, StateClass state, GoldParse gold, action):
cdef Transition t = self.lookup_transition(action)
@ -384,7 +384,7 @@ cdef class ArcEager(TransitionSystem):
cdef Transition t = self.lookup_transition(action)
t.do(state.c, t.label)
return state
def is_gold_parse(self, StateClass state, GoldParse gold):
predicted = set()
truth = set()

View File

@ -80,9 +80,9 @@ cdef class BiluoPushDown(TransitionSystem):
actions[action][label] += 1
return actions
property action_types:
def __get__(self):
return (BEGIN, IN, LAST, UNIT, OUT)
@property
def action_types(self):
return (BEGIN, IN, LAST, UNIT, OUT)
def move_name(self, int move, attr_t label):
if move == OUT:
@ -257,30 +257,42 @@ cdef class Missing:
cdef class Begin:
@staticmethod
cdef bint is_valid(const StateC* st, attr_t label) nogil:
# Ensure we don't clobber preset entities. If no entity preset,
# ent_iob is 0
cdef int preset_ent_iob = st.B_(0).ent_iob
if preset_ent_iob == 1:
cdef int preset_ent_label = st.B_(0).ent_type
# If we're the last token of the input, we can't B -- must U or O.
if st.B(1) == -1:
return False
elif preset_ent_iob == 2:
elif st.entity_is_open():
return False
elif preset_ent_iob == 3 and st.B_(0).ent_type != label:
elif label == 0:
return False
# If the next word is B or O, we can't B now
elif preset_ent_iob == 1 or preset_ent_iob == 2:
# Ensure we don't clobber preset entities. If no entity preset,
# ent_iob is 0
return False
elif preset_ent_iob == 3:
# Okay, we're in a preset entity.
if label != preset_ent_label:
# If label isn't right, reject
return False
elif st.B_(1).ent_iob != 1:
# If next token isn't marked I, we need to make U, not B.
return False
else:
# Otherwise, force acceptance, even if we're across a sentence
# boundary or the token is whitespace.
return True
elif st.B_(1).ent_iob == 2 or st.B_(1).ent_iob == 3:
# If the next word is B or O, we can't B now
return False
# If the current word is B, and the next word isn't I, the current word
# is really U
elif preset_ent_iob == 3 and st.B_(1).ent_iob != 1:
return False
# Don't allow entities to extend across sentence boundaries
elif st.B_(1).sent_start == 1:
# Don't allow entities to extend across sentence boundaries
return False
# Don't allow entities to start on whitespace
elif Lexeme.get_struct_attr(st.B_(0).lex, IS_SPACE):
return False
else:
return label != 0 and not st.entity_is_open()
return True
@staticmethod
cdef int transition(StateC* st, attr_t label) nogil:
@ -314,18 +326,27 @@ cdef class In:
@staticmethod
cdef bint is_valid(const StateC* st, attr_t label) nogil:
cdef int preset_ent_iob = st.B_(0).ent_iob
if preset_ent_iob == 2:
if label == 0:
return False
elif st.E_(0).ent_type != label:
return False
elif not st.entity_is_open():
return False
elif st.B(1) == -1:
# If we're at the end, we can't I.
return False
elif preset_ent_iob == 2:
return False
elif preset_ent_iob == 3:
return False
# TODO: Is this quite right? I think it's supposed to be ensuring the
# gazetteer matches are maintained
elif st.B(1) != -1 and st.B_(1).ent_iob != preset_ent_iob:
elif st.B_(1).ent_iob == 2 or st.B_(1).ent_iob == 3:
# If we know the next word is B or O, we can't be I (must be L)
return False
# Don't allow entities to extend across sentence boundaries
elif st.B(1) != -1 and st.B_(1).sent_start == 1:
# Don't allow entities to extend across sentence boundaries
return False
return st.entity_is_open() and label != 0 and st.E_(0).ent_type == label
else:
return True
@staticmethod
cdef int transition(StateC* st, attr_t label) nogil:
@ -370,9 +391,17 @@ cdef class In:
cdef class Last:
@staticmethod
cdef bint is_valid(const StateC* st, attr_t label) nogil:
if st.B_(1).ent_iob == 1:
if label == 0:
return False
return st.entity_is_open() and label != 0 and st.E_(0).ent_type == label
elif not st.entity_is_open():
return False
elif st.E_(0).ent_type != label:
return False
elif st.B_(1).ent_iob == 1:
# If a preset entity has I next, we can't L here.
return False
else:
return True
@staticmethod
cdef int transition(StateC* st, attr_t label) nogil:
@ -416,17 +445,29 @@ cdef class Unit:
@staticmethod
cdef bint is_valid(const StateC* st, attr_t label) nogil:
cdef int preset_ent_iob = st.B_(0).ent_iob
if preset_ent_iob == 2:
cdef attr_t preset_ent_label = st.B_(0).ent_type
if label == 0:
return False
elif preset_ent_iob == 1:
elif st.entity_is_open():
return False
elif preset_ent_iob == 3 and st.B_(0).ent_type != label:
elif preset_ent_iob == 2:
# Don't clobber preset O
return False
elif st.B_(1).ent_iob == 1:
# If next token is In, we can't be Unit -- must be Begin
return False
elif preset_ent_iob == 3:
# Okay, there's a preset entity here
if label != preset_ent_label:
# Require labels to match
return False
else:
# Otherwise return True, ignoring the whitespace constraint.
return True
elif Lexeme.get_struct_attr(st.B_(0).lex, IS_SPACE):
return False
return label != 0 and not st.entity_is_open()
else:
return True
@staticmethod
cdef int transition(StateC* st, attr_t label) nogil:
@ -461,11 +502,14 @@ cdef class Out:
@staticmethod
cdef bint is_valid(const StateC* st, attr_t label) nogil:
cdef int preset_ent_iob = st.B_(0).ent_iob
if preset_ent_iob == 3:
if st.entity_is_open():
return False
elif preset_ent_iob == 3:
return False
elif preset_ent_iob == 1:
return False
return not st.entity_is_open()
else:
return True
@staticmethod
cdef int transition(StateC* st, attr_t label) nogil:

View File

@ -205,13 +205,11 @@ cdef class Parser:
self.set_annotations([doc], states, tensors=None)
return doc
def pipe(self, docs, int batch_size=256, int n_threads=2, beam_width=None):
def pipe(self, docs, int batch_size=256, int n_threads=-1, beam_width=None):
"""Process a stream of documents.
stream: The sequence of documents to process.
batch_size (int): Number of documents to accumulate into a working set.
n_threads (int): The number of threads with which to work on the buffer
in parallel.
YIELDS (Doc): Documents, in order.
"""
if beam_width is None:
@ -221,14 +219,14 @@ cdef class Parser:
for batch in util.minibatch(docs, size=batch_size):
batch_in_order = list(batch)
by_length = sorted(batch_in_order, key=lambda doc: len(doc))
for subbatch in util.minibatch(by_length, size=batch_size//4):
for subbatch in util.minibatch(by_length, size=max(batch_size//4, 2)):
subbatch = list(subbatch)
parse_states = self.predict(subbatch, beam_width=beam_width,
beam_density=beam_density)
self.set_annotations(subbatch, parse_states, tensors=None)
for doc in batch_in_order:
yield doc
def require_model(self):
"""Raise an error if the component's model is not initialized."""
if getattr(self, 'model', None) in (None, True, False):
@ -272,7 +270,7 @@ cdef class Parser:
beams = self.moves.init_beams(docs, beam_width, beam_density=beam_density)
# This is pretty dirty, but the NER can resize itself in init_batch,
# if labels are missing. We therefore have to check whether we need to
# expand our model output.
# expand our model output.
self.model.resize_output(self.moves.n_moves)
model = self.model(docs)
token_ids = numpy.zeros((len(docs) * beam_width, self.nr_feature),
@ -363,9 +361,14 @@ cdef class Parser:
for i in range(batch_size):
self.moves.set_valid(is_valid, states[i])
guess = arg_max_if_valid(&scores[i*nr_class], is_valid, nr_class)
action = self.moves.c[guess]
action.do(states[i], action.label)
states[i].push_hist(guess)
if guess == -1:
# This shouldn't happen, but it's hard to raise an error here,
# and we don't want to infinite loop. So, force to end state.
states[i].force_final()
else:
action = self.moves.c[guess]
action.do(states[i], action.label)
states[i].push_hist(guess)
free(is_valid)
def transition_beams(self, beams, float[:, ::1] scores):
@ -437,7 +440,7 @@ cdef class Parser:
if self._rehearsal_model is None:
return None
losses.setdefault(self.name, 0.)
states = self.moves.init_batch(docs)
# This is pretty dirty, but the NER can resize itself in init_batch,
# if labels are missing. We therefore have to check whether we need to
@ -598,22 +601,24 @@ cdef class Parser:
self.cfg.update(cfg)
return sgd
def to_disk(self, path, **exclude):
def to_disk(self, path, exclude=tuple(), **kwargs):
serializers = {
'model': lambda p: (self.model.to_disk(p) if self.model is not True else True),
'vocab': lambda p: self.vocab.to_disk(p),
'moves': lambda p: self.moves.to_disk(p, strings=False),
'moves': lambda p: self.moves.to_disk(p, exclude=["strings"]),
'cfg': lambda p: srsly.write_json(p, self.cfg)
}
exclude = util.get_serialization_exclude(serializers, exclude, kwargs)
util.to_disk(path, serializers, exclude)
def from_disk(self, path, **exclude):
def from_disk(self, path, exclude=tuple(), **kwargs):
deserializers = {
'vocab': lambda p: self.vocab.from_disk(p),
'moves': lambda p: self.moves.from_disk(p, strings=False),
'moves': lambda p: self.moves.from_disk(p, exclude=["strings"]),
'cfg': lambda p: self.cfg.update(srsly.read_json(p)),
'model': lambda p: None
}
exclude = util.get_serialization_exclude(deserializers, exclude, kwargs)
util.from_disk(path, deserializers, exclude)
if 'model' not in exclude:
path = util.ensure_path(path)
@ -627,22 +632,24 @@ cdef class Parser:
self.cfg.update(cfg)
return self
def to_bytes(self, **exclude):
def to_bytes(self, exclude=tuple(), **kwargs):
serializers = OrderedDict((
('model', lambda: (self.model.to_bytes() if self.model is not True else True)),
('vocab', lambda: self.vocab.to_bytes()),
('moves', lambda: self.moves.to_bytes(strings=False)),
('moves', lambda: self.moves.to_bytes(exclude=["strings"])),
('cfg', lambda: srsly.json_dumps(self.cfg, indent=2, sort_keys=True))
))
exclude = util.get_serialization_exclude(serializers, exclude, kwargs)
return util.to_bytes(serializers, exclude)
def from_bytes(self, bytes_data, **exclude):
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
deserializers = OrderedDict((
('vocab', lambda b: self.vocab.from_bytes(b)),
('moves', lambda b: self.moves.from_bytes(b, strings=False)),
('moves', lambda b: self.moves.from_bytes(b, exclude=["strings"])),
('cfg', lambda b: self.cfg.update(srsly.json_loads(b))),
('model', lambda b: None)
))
exclude = util.get_serialization_exclude(deserializers, exclude, kwargs)
msg = util.from_bytes(bytes_data, deserializers, exclude)
if 'model' not in exclude:
# TODO: Remove this once we don't have to handle previous models

View File

@ -94,6 +94,13 @@ cdef class TransitionSystem:
raise ValueError(Errors.E024)
return history
def apply_transition(self, StateClass state, name):
if not self.is_valid(state, name):
raise ValueError(
"Cannot apply transition {name}: invalid for the current state.".format(name=name))
action = self.lookup_transition(name)
action.do(state.c, action.label)
cdef int initialize_state(self, StateC* state) nogil:
pass
@ -201,30 +208,32 @@ cdef class TransitionSystem:
self.labels[action][label_name] = new_freq-1
return 1
def to_disk(self, path, **exclude):
def to_disk(self, path, **kwargs):
with path.open('wb') as file_:
file_.write(self.to_bytes(**exclude))
file_.write(self.to_bytes(**kwargs))
def from_disk(self, path, **exclude):
def from_disk(self, path, **kwargs):
with path.open('rb') as file_:
byte_data = file_.read()
self.from_bytes(byte_data, **exclude)
self.from_bytes(byte_data, **kwargs)
return self
def to_bytes(self, **exclude):
def to_bytes(self, exclude=tuple(), **kwargs):
transitions = []
serializers = {
'moves': lambda: srsly.json_dumps(self.labels),
'strings': lambda: self.strings.to_bytes()
}
exclude = util.get_serialization_exclude(serializers, exclude, kwargs)
return util.to_bytes(serializers, exclude)
def from_bytes(self, bytes_data, **exclude):
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
labels = {}
deserializers = {
'moves': lambda b: labels.update(srsly.json_loads(b)),
'strings': lambda b: self.strings.from_bytes(b)
}
exclude = util.get_serialization_exclude(deserializers, exclude, kwargs)
msg = util.from_bytes(bytes_data, deserializers, exclude)
self.initialize_actions(labels)
return self

View File

@ -1,46 +1,44 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
from spacy.tokens import Doc
from spacy.attrs import ORTH, SHAPE, POS, DEP
from ..util import get_doc
def test_doc_array_attr_of_token(en_tokenizer, en_vocab):
text = "An example sentence"
tokens = en_tokenizer(text)
example = tokens.vocab["example"]
def test_doc_array_attr_of_token(en_vocab):
doc = Doc(en_vocab, words=["An", "example", "sentence"])
example = doc.vocab["example"]
assert example.orth != example.shape
feats_array = tokens.to_array((ORTH, SHAPE))
feats_array = doc.to_array((ORTH, SHAPE))
assert feats_array[0][0] != feats_array[0][1]
assert feats_array[0][0] != feats_array[0][1]
def test_doc_stringy_array_attr_of_token(en_tokenizer, en_vocab):
text = "An example sentence"
tokens = en_tokenizer(text)
example = tokens.vocab["example"]
def test_doc_stringy_array_attr_of_token(en_vocab):
doc = Doc(en_vocab, words=["An", "example", "sentence"])
example = doc.vocab["example"]
assert example.orth != example.shape
feats_array = tokens.to_array((ORTH, SHAPE))
feats_array_stringy = tokens.to_array(("ORTH", "SHAPE"))
feats_array = doc.to_array((ORTH, SHAPE))
feats_array_stringy = doc.to_array(("ORTH", "SHAPE"))
assert feats_array_stringy[0][0] == feats_array[0][0]
assert feats_array_stringy[0][1] == feats_array[0][1]
def test_doc_scalar_attr_of_token(en_tokenizer, en_vocab):
text = "An example sentence"
tokens = en_tokenizer(text)
example = tokens.vocab["example"]
def test_doc_scalar_attr_of_token(en_vocab):
doc = Doc(en_vocab, words=["An", "example", "sentence"])
example = doc.vocab["example"]
assert example.orth != example.shape
feats_array = tokens.to_array(ORTH)
feats_array = doc.to_array(ORTH)
assert feats_array.shape == (3,)
def test_doc_array_tag(en_tokenizer):
text = "A nice sentence."
def test_doc_array_tag(en_vocab):
words = ["A", "nice", "sentence", "."]
pos = ["DET", "ADJ", "NOUN", "PUNCT"]
tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], pos=pos)
doc = get_doc(en_vocab, words=words, pos=pos)
assert doc[0].pos != doc[1].pos != doc[2].pos != doc[3].pos
feats_array = doc.to_array((ORTH, POS))
assert feats_array[0][1] == doc[0].pos
@ -49,13 +47,22 @@ def test_doc_array_tag(en_tokenizer):
assert feats_array[3][1] == doc[3].pos
def test_doc_array_dep(en_tokenizer):
text = "A nice sentence."
def test_doc_array_dep(en_vocab):
words = ["A", "nice", "sentence", "."]
deps = ["det", "amod", "ROOT", "punct"]
tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], deps=deps)
doc = get_doc(en_vocab, words=words, deps=deps)
feats_array = doc.to_array((ORTH, DEP))
assert feats_array[0][1] == doc[0].dep
assert feats_array[1][1] == doc[1].dep
assert feats_array[2][1] == doc[2].dep
assert feats_array[3][1] == doc[3].dep
@pytest.mark.parametrize("attrs", [["ORTH", "SHAPE"], "IS_ALPHA"])
def test_doc_array_to_from_string_attrs(en_vocab, attrs):
"""Test that both Doc.to_array and Doc.from_array accept string attrs,
as well as single attrs and sequences of attrs.
"""
words = ["An", "example", "sentence"]
doc = Doc(en_vocab, words=words)
Doc(en_vocab, words=words).from_array(attrs, doc.to_array(attrs))

View File

@ -4,9 +4,10 @@ from __future__ import unicode_literals
import pytest
import numpy
from spacy.tokens import Doc
from spacy.tokens import Doc, Span
from spacy.vocab import Vocab
from spacy.errors import ModelsWarning
from spacy.attrs import ENT_TYPE, ENT_IOB
from ..util import get_doc
@ -112,14 +113,14 @@ def test_doc_api_serialize(en_tokenizer, text):
assert [t.orth for t in tokens] == [t.orth for t in new_tokens]
new_tokens = Doc(tokens.vocab).from_bytes(
tokens.to_bytes(tensor=False), tensor=False
tokens.to_bytes(exclude=["tensor"]), exclude=["tensor"]
)
assert tokens.text == new_tokens.text
assert [t.text for t in tokens] == [t.text for t in new_tokens]
assert [t.orth for t in tokens] == [t.orth for t in new_tokens]
new_tokens = Doc(tokens.vocab).from_bytes(
tokens.to_bytes(sentiment=False), sentiment=False
tokens.to_bytes(exclude=["sentiment"]), exclude=["sentiment"]
)
assert tokens.text == new_tokens.text
assert [t.text for t in tokens] == [t.text for t in new_tokens]
@ -256,3 +257,24 @@ def test_lowest_common_ancestor(en_tokenizer, sentence, heads, lca_matrix):
assert lca[1, 1] == 1
assert lca[0, 1] == 2
assert lca[1, 2] == 2
def test_doc_is_nered(en_vocab):
words = ["I", "live", "in", "New", "York"]
doc = Doc(en_vocab, words=words)
assert not doc.is_nered
doc.ents = [Span(doc, 3, 5, label="GPE")]
assert doc.is_nered
# Test creating doc from array with unknown values
arr = numpy.array([[0, 0], [0, 0], [0, 0], [384, 3], [384, 1]], dtype="uint64")
doc = Doc(en_vocab, words=words).from_array([ENT_TYPE, ENT_IOB], arr)
assert doc.is_nered
# Test serialization
new_doc = Doc(en_vocab).from_bytes(doc.to_bytes())
assert new_doc.is_nered
def test_doc_lang(en_vocab):
doc = Doc(en_vocab, words=["Hello", "world"])
assert doc.lang_ == "en"
assert doc.lang == en_vocab.strings["en"]

View File

@ -178,11 +178,10 @@ def test_span_string_label(doc):
assert span.label == doc.vocab.strings["hello"]
def test_span_string_set_label(doc):
def test_span_label_readonly(doc):
span = Span(doc, 0, 1)
span.label_ = "hello"
assert span.label_ == "hello"
assert span.label == doc.vocab.strings["hello"]
with pytest.raises(NotImplementedError):
span.label_ = "hello"
def test_span_ents_property(doc):

View File

@ -199,3 +199,31 @@ def test_token0_has_sent_start_true():
assert doc[0].is_sent_start is True
assert doc[1].is_sent_start is None
assert not doc.is_sentenced
def test_token_api_conjuncts_chain(en_vocab):
words = "The boy and the girl and the man went .".split()
heads = [1, 7, -1, 1, -3, -1, 1, -3, 0, -1]
deps = ["det", "nsubj", "cc", "det", "conj", "cc", "det", "conj", "ROOT", "punct"]
doc = get_doc(en_vocab, words=words, heads=heads, deps=deps)
assert [w.text for w in doc[1].conjuncts] == ["girl", "man"]
assert [w.text for w in doc[4].conjuncts] == ["boy", "man"]
assert [w.text for w in doc[7].conjuncts] == ["boy", "girl"]
def test_token_api_conjuncts_simple(en_vocab):
words = "They came and went .".split()
heads = [1, 0, -1, -2, -1]
deps = ["nsubj", "ROOT", "cc", "conj"]
doc = get_doc(en_vocab, words=words, heads=heads, deps=deps)
assert [w.text for w in doc[1].conjuncts] == ["went"]
assert [w.text for w in doc[3].conjuncts] == ["came"]
def test_token_api_non_conjuncts(en_vocab):
words = "They came .".split()
heads = [1, 0, -1]
deps = ["nsubj", "ROOT", "punct"]
doc = get_doc(en_vocab, words=words, heads=heads, deps=deps)
assert [w.text for w in doc[0].conjuncts] == []
assert [w.text for w in doc[1].conjuncts] == []

View File

@ -106,3 +106,37 @@ def test_underscore_raises_for_invalid(invalid_kwargs):
def test_underscore_accepts_valid(valid_kwargs):
valid_kwargs["force"] = True
Doc.set_extension("test", **valid_kwargs)
def test_underscore_mutable_defaults_list(en_vocab):
"""Test that mutable default arguments are handled correctly (see #2581)."""
Doc.set_extension("mutable", default=[])
doc1 = Doc(en_vocab, words=["one"])
doc2 = Doc(en_vocab, words=["two"])
doc1._.mutable.append("foo")
assert len(doc1._.mutable) == 1
assert doc1._.mutable[0] == "foo"
assert len(doc2._.mutable) == 0
doc1._.mutable = ["bar", "baz"]
doc1._.mutable.append("foo")
assert len(doc1._.mutable) == 3
assert len(doc2._.mutable) == 0
def test_underscore_mutable_defaults_dict(en_vocab):
"""Test that mutable default arguments are handled correctly (see #2581)."""
Token.set_extension("mutable", default={})
token1 = Doc(en_vocab, words=["one"])[0]
token2 = Doc(en_vocab, words=["two"])[0]
token1._.mutable["foo"] = "bar"
assert len(token1._.mutable) == 1
assert token1._.mutable["foo"] == "bar"
assert len(token2._.mutable) == 0
token1._.mutable["foo"] = "baz"
assert len(token1._.mutable) == 1
assert token1._.mutable["foo"] == "baz"
token1._.mutable["x"] = []
token1._.mutable["x"].append("y")
assert len(token1._.mutable) == 2
assert token1._.mutable["x"] == ["y"]
assert len(token2._.mutable) == 0

View File

@ -2,22 +2,24 @@
from __future__ import unicode_literals
import pytest
import numpy
from spacy.tokens import Doc
from spacy.displacy import render
from spacy.gold import iob_to_biluo
from spacy.lang.it import Italian
import numpy
from spacy.lang.en import English
from ..util import add_vecs_to_vocab, get_doc
@pytest.mark.xfail(
reason="The dot is now properly split off, but the prefix/suffix rules are not applied again afterwards."
"This means that the quote will still be attached to the remaining token."
)
@pytest.mark.xfail
def test_issue2070():
"""Test that checks that a dot followed by a quote is handled appropriately."""
"""Test that checks that a dot followed by a quote is handled
appropriately.
"""
# Problem: The dot is now properly split off, but the prefix/suffix rules
# are not applied again afterwards. This means that the quote will still be
# attached to the remaining token.
nlp = English()
doc = nlp('First sentence."A quoted sentence" he said ...')
assert len(doc) == 11
@ -37,6 +39,26 @@ def test_issue2179():
assert nlp2.get_pipe("ner").labels == ("CITIZENSHIP",)
def test_issue2203(en_vocab):
"""Test that lemmas are set correctly in doc.from_array."""
words = ["I", "'ll", "survive"]
tags = ["PRP", "MD", "VB"]
lemmas = ["-PRON-", "will", "survive"]
tag_ids = [en_vocab.strings.add(tag) for tag in tags]
lemma_ids = [en_vocab.strings.add(lemma) for lemma in lemmas]
doc = Doc(en_vocab, words=words)
# Work around lemma corrpution problem and set lemmas after tags
doc.from_array("TAG", numpy.array(tag_ids, dtype="uint64"))
doc.from_array("LEMMA", numpy.array(lemma_ids, dtype="uint64"))
assert [t.tag_ for t in doc] == tags
assert [t.lemma_ for t in doc] == lemmas
# We need to serialize both tag and lemma, since this is what causes the bug
doc_array = doc.to_array(["TAG", "LEMMA"])
new_doc = Doc(doc.vocab, words=words).from_array(["TAG", "LEMMA"], doc_array)
assert [t.tag_ for t in new_doc] == tags
assert [t.lemma_ for t in new_doc] == lemmas
def test_issue2219(en_vocab):
vectors = [("a", [1, 2, 3]), ("letter", [4, 5, 6])]
add_vecs_to_vocab(en_vocab, vectors)

View File

@ -0,0 +1,26 @@
# coding: utf8
from __future__ import unicode_literals
from spacy.lang.en import English
from spacy.tokens import Doc
from spacy.pipeline import EntityRuler, EntityRecognizer
def test_issue3345():
"""Test case where preset entity crosses sentence boundary."""
nlp = English()
doc = Doc(nlp.vocab, words=["I", "live", "in", "New", "York"])
doc[4].is_sent_start = True
ruler = EntityRuler(nlp, patterns=[{"label": "GPE", "pattern": "New York"}])
ner = EntityRecognizer(doc.vocab)
# Add the OUT action. I wouldn't have thought this would be necessary...
ner.moves.add_action(5, "")
ner.add_label("GPE")
doc = ruler(doc)
# Get into the state just before "New"
state = ner.moves.init_batch([doc])[0]
ner.moves.apply_transition(state, "O")
ner.moves.apply_transition(state, "O")
ner.moves.apply_transition(state, "O")
# Check that B-GPE is valid.
assert ner.moves.is_valid(state, "B-GPE")

View File

@ -0,0 +1,21 @@
# coding: utf8
from __future__ import unicode_literals
import pytest
from spacy.lang.en import English
from spacy.matcher import Matcher, PhraseMatcher
def test_issue3410():
texts = ["Hello world", "This is a test"]
nlp = English()
matcher = Matcher(nlp.vocab)
phrasematcher = PhraseMatcher(nlp.vocab)
with pytest.deprecated_call():
docs = list(nlp.pipe(texts, n_threads=4))
with pytest.deprecated_call():
docs = list(nlp.tokenizer.pipe(texts, n_threads=4))
with pytest.deprecated_call():
list(matcher.pipe(docs, n_threads=4))
with pytest.deprecated_call():
list(phrasematcher.pipe(docs, n_threads=4))

View File

@ -1,6 +1,7 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
from spacy.tokens import Doc
from spacy.compat import path2str
@ -41,3 +42,18 @@ def test_serialize_doc_roundtrip_disk_str_path(en_vocab):
doc.to_disk(file_path)
doc_d = Doc(en_vocab).from_disk(file_path)
assert doc.to_bytes() == doc_d.to_bytes()
def test_serialize_doc_exclude(en_vocab):
doc = Doc(en_vocab, words=["hello", "world"])
doc.user_data["foo"] = "bar"
new_doc = Doc(en_vocab).from_bytes(doc.to_bytes())
assert new_doc.user_data["foo"] == "bar"
new_doc = Doc(en_vocab).from_bytes(doc.to_bytes(), exclude=["user_data"])
assert not new_doc.user_data
new_doc = Doc(en_vocab).from_bytes(doc.to_bytes(exclude=["user_data"]))
assert not new_doc.user_data
with pytest.raises(ValueError):
doc.to_bytes(user_data=False)
with pytest.raises(ValueError):
Doc(en_vocab).from_bytes(doc.to_bytes(), tensor=False)

View File

@ -52,3 +52,19 @@ def test_serialize_with_custom_tokenizer():
nlp.tokenizer = custom_tokenizer(nlp)
with make_tempdir() as d:
nlp.to_disk(d)
def test_serialize_language_exclude(meta_data):
name = "name-in-fixture"
nlp = Language(meta=meta_data)
assert nlp.meta["name"] == name
new_nlp = Language().from_bytes(nlp.to_bytes())
assert nlp.meta["name"] == name
new_nlp = Language().from_bytes(nlp.to_bytes(), exclude=["meta"])
assert not new_nlp.meta["name"] == name
new_nlp = Language().from_bytes(nlp.to_bytes(exclude=["meta"]))
assert not new_nlp.meta["name"] == name
with pytest.raises(ValueError):
nlp.to_bytes(meta=False)
with pytest.raises(ValueError):
Language().from_bytes(nlp.to_bytes(), meta=False)

View File

@ -55,7 +55,9 @@ def test_serialize_parser_roundtrip_disk(en_vocab, Parser):
parser_d = Parser(en_vocab)
parser_d.model, _ = parser_d.Model(0)
parser_d = parser_d.from_disk(file_path)
assert parser.to_bytes(model=False) == parser_d.to_bytes(model=False)
parser_bytes = parser.to_bytes(exclude=["model"])
parser_d_bytes = parser_d.to_bytes(exclude=["model"])
assert parser_bytes == parser_d_bytes
def test_to_from_bytes(parser, blank_parser):
@ -114,3 +116,25 @@ def test_serialize_textcat_empty(en_vocab):
# See issue #1105
textcat = TextCategorizer(en_vocab, labels=["ENTITY", "ACTION", "MODIFIER"])
textcat.to_bytes()
@pytest.mark.parametrize("Parser", test_parsers)
def test_serialize_pipe_exclude(en_vocab, Parser):
def get_new_parser():
new_parser = Parser(en_vocab)
new_parser.model, _ = new_parser.Model(0)
return new_parser
parser = Parser(en_vocab)
parser.model, _ = parser.Model(0)
parser.cfg["foo"] = "bar"
new_parser = get_new_parser().from_bytes(parser.to_bytes())
assert "foo" in new_parser.cfg
new_parser = get_new_parser().from_bytes(parser.to_bytes(), exclude=["cfg"])
assert "foo" not in new_parser.cfg
new_parser = get_new_parser().from_bytes(parser.to_bytes(exclude=["cfg"]))
assert "foo" not in new_parser.cfg
with pytest.raises(ValueError):
parser.to_bytes(cfg=False)
with pytest.raises(ValueError):
get_new_parser().from_bytes(parser.to_bytes(), cfg=False)

View File

@ -12,13 +12,12 @@ test_strings = [([], []), (["rats", "are", "cute"], ["i", "like", "rats"])]
test_strings_attrs = [(["rats", "are", "cute"], "Hello")]
@pytest.mark.xfail
@pytest.mark.parametrize("text", ["rat"])
def test_serialize_vocab(en_vocab, text):
text_hash = en_vocab.strings.add(text)
vocab_bytes = en_vocab.to_bytes()
new_vocab = Vocab().from_bytes(vocab_bytes)
assert new_vocab.strings(text_hash) == text
assert new_vocab.strings[text_hash] == text
@pytest.mark.parametrize("strings1,strings2", test_strings)
@ -69,6 +68,15 @@ def test_serialize_vocab_lex_attrs_bytes(strings, lex_attr):
assert vocab2[strings[0]].norm_ == lex_attr
@pytest.mark.parametrize("strings,lex_attr", test_strings_attrs)
def test_deserialize_vocab_seen_entries(strings, lex_attr):
# Reported in #2153
vocab = Vocab(strings=strings)
length = len(vocab)
vocab.from_bytes(vocab.to_bytes())
assert len(vocab) == length
@pytest.mark.parametrize("strings,lex_attr", test_strings_attrs)
def test_serialize_vocab_lex_attrs_disk(strings, lex_attr):
vocab1 = Vocab(strings=strings)

View File

@ -1,38 +1,28 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
import os
from pathlib import Path
from spacy.compat import symlink_to, symlink_remove, path2str
from spacy.cli.converters import conllu2json
@pytest.fixture
def target_local_path():
return Path("./foo-target")
@pytest.fixture
def link_local_path():
return Path("./foo-symlink")
@pytest.fixture(scope="function")
def setup_target(request, target_local_path, link_local_path):
if not target_local_path.exists():
os.mkdir(path2str(target_local_path))
# yield -- need to cleanup even if assertion fails
# https://github.com/pytest-dev/pytest/issues/2508#issuecomment-309934240
def cleanup():
symlink_remove(link_local_path)
os.rmdir(path2str(target_local_path))
request.addfinalizer(cleanup)
def test_create_symlink_windows(setup_target, target_local_path, link_local_path):
assert target_local_path.exists()
symlink_to(link_local_path, target_local_path)
assert link_local_path.exists()
def test_cli_converters_conllu2json():
# https://raw.githubusercontent.com/ohenrik/nb_news_ud_sm/master/original_data/no-ud-dev-ner.conllu
lines = [
"1\tDommer\tdommer\tNOUN\t_\tDefinite=Ind|Gender=Masc|Number=Sing\t2\tappos\t_\tO",
"2\tFinn\tFinn\tPROPN\t_\tGender=Masc\t4\tnsubj\t_\tB-PER",
"3\tEilertsen\tEilertsen\tPROPN\t_\t_\t2\tname\t_\tI-PER",
"4\tavstår\tavstå\tVERB\t_\tMood=Ind|Tense=Pres|VerbForm=Fin\t0\troot\t_\tO",
]
input_data = "\n".join(lines)
converted = conllu2json(input_data, n_sents=1)
assert len(converted) == 1
assert converted[0]["id"] == 0
assert len(converted[0]["paragraphs"]) == 1
assert len(converted[0]["paragraphs"][0]["sentences"]) == 1
sent = converted[0]["paragraphs"][0]["sentences"][0]
assert len(sent["tokens"]) == 4
tokens = sent["tokens"]
assert [t["orth"] for t in tokens] == ["Dommer", "Finn", "Eilertsen", "avstår"]
assert [t["tag"] for t in tokens] == ["NOUN", "PROPN", "PROPN", "VERB"]
assert [t["head"] for t in tokens] == [1, 2, -1, 0]
assert [t["dep"] for t in tokens] == ["appos", "nsubj", "name", "ROOT"]
assert [t["ner"] for t in tokens] == ["O", "B-PER", "L-PER", "O"]

View File

@ -0,0 +1,90 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
from spacy import displacy
from spacy.tokens import Span
from spacy.lang.fa import Persian
from .util import get_doc
def test_displacy_parse_ents(en_vocab):
"""Test that named entities on a Doc are converted into displaCy's format."""
doc = get_doc(en_vocab, words=["But", "Google", "is", "starting", "from", "behind"])
doc.ents = [Span(doc, 1, 2, label=doc.vocab.strings["ORG"])]
ents = displacy.parse_ents(doc)
assert isinstance(ents, dict)
assert ents["text"] == "But Google is starting from behind "
assert ents["ents"] == [{"start": 4, "end": 10, "label": "ORG"}]
def test_displacy_parse_deps(en_vocab):
"""Test that deps and tags on a Doc are converted into displaCy's format."""
words = ["This", "is", "a", "sentence"]
heads = [1, 0, 1, -2]
pos = ["DET", "VERB", "DET", "NOUN"]
tags = ["DT", "VBZ", "DT", "NN"]
deps = ["nsubj", "ROOT", "det", "attr"]
doc = get_doc(en_vocab, words=words, heads=heads, pos=pos, tags=tags, deps=deps)
deps = displacy.parse_deps(doc)
assert isinstance(deps, dict)
assert deps["words"] == [
{"text": "This", "tag": "DET"},
{"text": "is", "tag": "VERB"},
{"text": "a", "tag": "DET"},
{"text": "sentence", "tag": "NOUN"},
]
assert deps["arcs"] == [
{"start": 0, "end": 1, "label": "nsubj", "dir": "left"},
{"start": 2, "end": 3, "label": "det", "dir": "left"},
{"start": 1, "end": 3, "label": "attr", "dir": "right"},
]
def test_displacy_spans(en_vocab):
"""Test that displaCy can render Spans."""
doc = get_doc(en_vocab, words=["But", "Google", "is", "starting", "from", "behind"])
doc.ents = [Span(doc, 1, 2, label=doc.vocab.strings["ORG"])]
html = displacy.render(doc[1:4], style="ent")
assert html.startswith("<div")
def test_displacy_raises_for_wrong_type(en_vocab):
with pytest.raises(ValueError):
displacy.render("hello world")
def test_displacy_rtl():
# Source: http://www.sobhe.ir/hazm/ is this correct?
words = ["ما", "بسیار", "کتاب", "می\u200cخوانیم"]
# These are (likely) wrong, but it's just for testing
pos = ["PRO", "ADV", "N_PL", "V_SUB"] # needs to match lang.fa.tag_map
deps = ["foo", "bar", "foo", "baz"]
heads = [1, 0, 1, -2]
nlp = Persian()
doc = get_doc(nlp.vocab, words=words, pos=pos, tags=pos, heads=heads, deps=deps)
doc.ents = [Span(doc, 1, 3, label="TEST")]
html = displacy.render(doc, page=True, style="dep")
assert "direction: rtl" in html
assert 'direction="rtl"' in html
assert 'lang="{}"'.format(nlp.lang) in html
html = displacy.render(doc, page=True, style="ent")
assert "direction: rtl" in html
assert 'lang="{}"'.format(nlp.lang) in html
def test_displacy_render_wrapper(en_vocab):
"""Test that displaCy accepts custom rendering wrapper."""
def wrapper(html):
return "TEST" + html + "TEST"
displacy.set_render_wrapper(wrapper)
doc = get_doc(en_vocab, words=["But", "Google", "is", "starting", "from", "behind"])
doc.ents = [Span(doc, 1, 2, label=doc.vocab.strings["ORG"])]
html = displacy.render(doc, style="ent")
assert html.startswith("TEST<div")
assert html.endswith("/div>TEST")
# Restore
displacy.set_render_wrapper(lambda html: html)

View File

@ -2,14 +2,35 @@
from __future__ import unicode_literals
import pytest
import os
from pathlib import Path
from spacy import util
from spacy import displacy
from spacy import prefer_gpu, require_gpu
from spacy.tokens import Span
from spacy.compat import symlink_to, symlink_remove, path2str
from spacy._ml import PrecomputableAffine
from .util import get_doc
@pytest.fixture
def symlink_target():
return Path("./foo-target")
@pytest.fixture
def symlink():
return Path("./foo-symlink")
@pytest.fixture(scope="function")
def symlink_setup_target(request, symlink_target, symlink):
if not symlink_target.exists():
os.mkdir(path2str(symlink_target))
# yield -- need to cleanup even if assertion fails
# https://github.com/pytest-dev/pytest/issues/2508#issuecomment-309934240
def cleanup():
symlink_remove(symlink)
os.rmdir(path2str(symlink_target))
request.addfinalizer(cleanup)
@pytest.mark.parametrize("text", ["hello/world", "hello world"])
@ -31,66 +52,6 @@ def test_util_get_package_path(package):
assert isinstance(path, Path)
def test_displacy_parse_ents(en_vocab):
"""Test that named entities on a Doc are converted into displaCy's format."""
doc = get_doc(en_vocab, words=["But", "Google", "is", "starting", "from", "behind"])
doc.ents = [Span(doc, 1, 2, label=doc.vocab.strings["ORG"])]
ents = displacy.parse_ents(doc)
assert isinstance(ents, dict)
assert ents["text"] == "But Google is starting from behind "
assert ents["ents"] == [{"start": 4, "end": 10, "label": "ORG"}]
def test_displacy_parse_deps(en_vocab):
"""Test that deps and tags on a Doc are converted into displaCy's format."""
words = ["This", "is", "a", "sentence"]
heads = [1, 0, 1, -2]
pos = ["DET", "VERB", "DET", "NOUN"]
tags = ["DT", "VBZ", "DT", "NN"]
deps = ["nsubj", "ROOT", "det", "attr"]
doc = get_doc(en_vocab, words=words, heads=heads, pos=pos, tags=tags, deps=deps)
deps = displacy.parse_deps(doc)
assert isinstance(deps, dict)
assert deps["words"] == [
{"text": "This", "tag": "DET"},
{"text": "is", "tag": "VERB"},
{"text": "a", "tag": "DET"},
{"text": "sentence", "tag": "NOUN"},
]
assert deps["arcs"] == [
{"start": 0, "end": 1, "label": "nsubj", "dir": "left"},
{"start": 2, "end": 3, "label": "det", "dir": "left"},
{"start": 1, "end": 3, "label": "attr", "dir": "right"},
]
def test_displacy_spans(en_vocab):
"""Test that displaCy can render Spans."""
doc = get_doc(en_vocab, words=["But", "Google", "is", "starting", "from", "behind"])
doc.ents = [Span(doc, 1, 2, label=doc.vocab.strings["ORG"])]
html = displacy.render(doc[1:4], style="ent")
assert html.startswith("<div")
def test_displacy_render_wrapper(en_vocab):
"""Test that displaCy accepts custom rendering wrapper."""
def wrapper(html):
return "TEST" + html + "TEST"
displacy.set_render_wrapper(wrapper)
doc = get_doc(en_vocab, words=["But", "Google", "is", "starting", "from", "behind"])
doc.ents = [Span(doc, 1, 2, label=doc.vocab.strings["ORG"])]
html = displacy.render(doc, style="ent")
assert html.startswith("TEST<div")
assert html.endswith("/div>TEST")
def test_displacy_raises_for_wrong_type(en_vocab):
with pytest.raises(ValueError):
displacy.render("hello world")
def test_PrecomputableAffine(nO=4, nI=5, nF=3, nP=2):
model = PrecomputableAffine(nO=nO, nI=nI, nF=nF, nP=nP)
assert model.W.shape == (nF, nO, nP, nI)
@ -124,3 +85,9 @@ def test_prefer_gpu():
def test_require_gpu():
with pytest.raises(ValueError):
require_gpu()
def test_create_symlink_windows(symlink_setup_target, symlink_target, symlink):
assert symlink_target.exists()
symlink_to(symlink, symlink_target)
assert symlink.exists()

View File

@ -45,3 +45,8 @@ def test_vocab_api_contains(en_vocab, text):
_ = en_vocab[text] # noqa: F841
assert text in en_vocab
assert "LKsdjvlsakdvlaksdvlkasjdvljasdlkfvm" not in en_vocab
def test_vocab_writing_system(en_vocab):
assert en_vocab.writing_system["direction"] == "ltr"
assert en_vocab.writing_system["has_case"] is True

View File

@ -125,7 +125,7 @@ cdef class Tokenizer:
doc.c[doc.length - 1].spacy = string[-1] == " " and not in_ws
return doc
def pipe(self, texts, batch_size=1000, n_threads=2):
def pipe(self, texts, batch_size=1000, n_threads=-1):
"""Tokenize a stream of texts.
texts: A sequence of unicode texts.
@ -134,6 +134,8 @@ cdef class Tokenizer:
DOCS: https://spacy.io/api/tokenizer#pipe
"""
if n_threads != -1:
deprecation_warning(Warnings.W016)
for text in texts:
yield self(text)
@ -360,36 +362,37 @@ cdef class Tokenizer:
self._cache.set(key, cached)
self._rules[string] = substrings
def to_disk(self, path, **exclude):
def to_disk(self, path, **kwargs):
"""Save the current state to a directory.
path (unicode or Path): A path to a directory, which will be created if
it doesn't exist. Paths may be either strings or Path-like objects.
it doesn't exist.
exclude (list): String names of serialization fields to exclude.
DOCS: https://spacy.io/api/tokenizer#to_disk
"""
with path.open("wb") as file_:
file_.write(self.to_bytes(**exclude))
file_.write(self.to_bytes(**kwargs))
def from_disk(self, path, **exclude):
def from_disk(self, path, **kwargs):
"""Loads state from a directory. Modifies the object in place and
returns it.
path (unicode or Path): A path to a directory. Paths may be either
strings or `Path`-like objects.
path (unicode or Path): A path to a directory.
exclude (list): String names of serialization fields to exclude.
RETURNS (Tokenizer): The modified `Tokenizer` object.
DOCS: https://spacy.io/api/tokenizer#from_disk
"""
with path.open("rb") as file_:
bytes_data = file_.read()
self.from_bytes(bytes_data, **exclude)
self.from_bytes(bytes_data, **kwargs)
return self
def to_bytes(self, **exclude):
def to_bytes(self, exclude=tuple(), **kwargs):
"""Serialize the current state to a binary string.
**exclude: Named attributes to prevent from being serialized.
exclude (list): String names of serialization fields to exclude.
RETURNS (bytes): The serialized form of the `Tokenizer` object.
DOCS: https://spacy.io/api/tokenizer#to_bytes
@ -402,13 +405,14 @@ cdef class Tokenizer:
("token_match", lambda: _get_regex_pattern(self.token_match)),
("exceptions", lambda: OrderedDict(sorted(self._rules.items())))
))
exclude = util.get_serialization_exclude(serializers, exclude, kwargs)
return util.to_bytes(serializers, exclude)
def from_bytes(self, bytes_data, **exclude):
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
"""Load state from a binary string.
bytes_data (bytes): The data to load from.
**exclude: Named attributes to prevent from being loaded.
exclude (list): String names of serialization fields to exclude.
RETURNS (Tokenizer): The `Tokenizer` object.
DOCS: https://spacy.io/api/tokenizer#from_bytes
@ -422,6 +426,7 @@ cdef class Tokenizer:
("token_match", lambda b: data.setdefault("token_match", b)),
("exceptions", lambda b: data.setdefault("rules", b))
))
exclude = util.get_serialization_exclude(deserializers, exclude, kwargs)
msg = util.from_bytes(bytes_data, deserializers, exclude)
if data.get("prefix_search"):
self.prefix_search = re.compile(data["prefix_search"]).search

View File

@ -240,8 +240,18 @@ cdef class Doc:
for i in range(1, self.length):
if self.c[i].sent_start == -1 or self.c[i].sent_start == 1:
return True
else:
return False
return False
@property
def is_nered(self):
"""Check if the document has named entities set. Will return True if
*any* of the tokens has a named entity tag set (even if the others are
uknown values).
"""
for i in range(self.length):
if self.c[i].ent_iob != 0:
return True
return False
def __getitem__(self, object i):
"""Get a `Token` or `Span` object.
@ -374,7 +384,8 @@ cdef class Doc:
xp = get_array_module(vector)
return xp.dot(vector, other.vector) / (self.vector_norm * other.vector_norm)
property has_vector:
@property
def has_vector(self):
"""A boolean value indicating whether a word vector is associated with
the object.
@ -382,15 +393,14 @@ cdef class Doc:
DOCS: https://spacy.io/api/doc#has_vector
"""
def __get__(self):
if "has_vector" in self.user_hooks:
return self.user_hooks["has_vector"](self)
elif self.vocab.vectors.data.size:
return True
elif self.tensor.size:
return True
else:
return False
if "has_vector" in self.user_hooks:
return self.user_hooks["has_vector"](self)
elif self.vocab.vectors.data.size:
return True
elif self.tensor.size:
return True
else:
return False
property vector:
"""A real-valued meaning representation. Defaults to an average of the
@ -443,22 +453,22 @@ cdef class Doc:
def __set__(self, value):
self._vector_norm = value
property text:
@property
def text(self):
"""A unicode representation of the document text.
RETURNS (unicode): The original verbatim text of the document.
"""
def __get__(self):
return "".join(t.text_with_ws for t in self)
return "".join(t.text_with_ws for t in self)
property text_with_ws:
@property
def text_with_ws(self):
"""An alias of `Doc.text`, provided for duck-type compatibility with
`Span` and `Token`.
RETURNS (unicode): The original verbatim text of the document.
"""
def __get__(self):
return self.text
return self.text
property ents:
"""The named entities in the document. Returns a tuple of named entity
@ -535,7 +545,8 @@ cdef class Doc:
# Set start as B
self.c[start].ent_iob = 3
property noun_chunks:
@property
def noun_chunks(self):
"""Iterate over the base noun phrases in the document. Yields base
noun-phrase #[code Span] objects, if the document has been
syntactically parsed. A base noun phrase, or "NP chunk", is a noun
@ -547,22 +558,22 @@ cdef class Doc:
DOCS: https://spacy.io/api/doc#noun_chunks
"""
def __get__(self):
if not self.is_parsed:
raise ValueError(Errors.E029)
# Accumulate the result before beginning to iterate over it. This
# prevents the tokenisation from being changed out from under us
# during the iteration. The tricky thing here is that Span accepts
# its tokenisation changing, so it's okay once we have the Span
# objects. See Issue #375.
spans = []
if self.noun_chunks_iterator is not None:
for start, end, label in self.noun_chunks_iterator(self):
spans.append(Span(self, start, end, label=label))
for span in spans:
yield span
if not self.is_parsed:
raise ValueError(Errors.E029)
# Accumulate the result before beginning to iterate over it. This
# prevents the tokenisation from being changed out from under us
# during the iteration. The tricky thing here is that Span accepts
# its tokenisation changing, so it's okay once we have the Span
# objects. See Issue #375.
spans = []
if self.noun_chunks_iterator is not None:
for start, end, label in self.noun_chunks_iterator(self):
spans.append(Span(self, start, end, label=label))
for span in spans:
yield span
property sents:
@property
def sents(self):
"""Iterate over the sentences in the document. Yields sentence `Span`
objects. Sentence spans have no label. To improve accuracy on informal
texts, spaCy calculates sentence boundaries from the syntactic
@ -573,19 +584,28 @@ cdef class Doc:
DOCS: https://spacy.io/api/doc#sents
"""
def __get__(self):
if not self.is_sentenced:
raise ValueError(Errors.E030)
if "sents" in self.user_hooks:
yield from self.user_hooks["sents"](self)
else:
start = 0
for i in range(1, self.length):
if self.c[i].sent_start == 1:
yield Span(self, start, i)
start = i
if start != self.length:
yield Span(self, start, self.length)
if not self.is_sentenced:
raise ValueError(Errors.E030)
if "sents" in self.user_hooks:
yield from self.user_hooks["sents"](self)
else:
start = 0
for i in range(1, self.length):
if self.c[i].sent_start == 1:
yield Span(self, start, i)
start = i
if start != self.length:
yield Span(self, start, self.length)
@property
def lang(self):
"""RETURNS (uint64): ID of the language of the doc's vocabulary."""
return self.vocab.strings[self.vocab.lang]
@property
def lang_(self):
"""RETURNS (unicode): Language of the doc's vocabulary, e.g. 'en'."""
return self.vocab.lang
cdef int push_back(self, LexemeOrToken lex_or_tok, bint has_space) except -1:
if self.length == 0:
@ -727,6 +747,18 @@ cdef class Doc:
DOCS: https://spacy.io/api/doc#from_array
"""
# Handle scalar/list inputs of strings/ints for py_attr_ids
# See also #3064
if isinstance(attrs, basestring_):
# Handle inputs like doc.to_array('ORTH')
attrs = [attrs]
elif not hasattr(attrs, "__iter__"):
# Handle inputs like doc.to_array(ORTH)
attrs = [attrs]
# Allow strings, e.g. 'lemma' or 'LEMMA'
attrs = [(IDS[id_.upper()] if hasattr(id_, "upper") else id_)
for id_ in attrs]
if SENT_START in attrs and HEAD in attrs:
raise ValueError(Errors.E032)
cdef int i, col
@ -739,17 +771,20 @@ cdef class Doc:
attr_ids = <attr_id_t*>mem.alloc(n_attrs, sizeof(attr_id_t))
for i, attr_id in enumerate(attrs):
attr_ids[i] = attr_id
if len(array.shape) == 1:
array = array.reshape((array.size, 1))
# Do TAG first. This lets subsequent loop override stuff like POS, LEMMA
if TAG in attrs:
col = attrs.index(TAG)
for i in range(length):
if array[i, col] != 0:
self.vocab.morphology.assign_tag(&tokens[i], array[i, col])
# Now load the data
for i in range(self.length):
token = &self.c[i]
for j in range(n_attrs):
Token.set_struct_attr(token, attr_ids[j], array[i, j])
# Auxiliary loading logic
for col, attr_id in enumerate(attrs):
if attr_id == TAG:
for i in range(length):
if array[i, col] != 0:
self.vocab.morphology.assign_tag(&tokens[i], array[i, col])
if attr_ids[j] != TAG:
Token.set_struct_attr(token, attr_ids[j], array[i, j])
# Set flags
self.is_parsed = bool(self.is_parsed or HEAD in attrs or DEP in attrs)
self.is_tagged = bool(self.is_tagged or TAG in attrs or POS in attrs)
@ -770,24 +805,26 @@ cdef class Doc:
"""
return numpy.asarray(_get_lca_matrix(self, 0, len(self)))
def to_disk(self, path, **exclude):
def to_disk(self, path, **kwargs):
"""Save the current state to a directory.
path (unicode or Path): A path to a directory, which will be created if
it doesn't exist. Paths may be either strings or Path-like objects.
exclude (list): String names of serialization fields to exclude.
DOCS: https://spacy.io/api/doc#to_disk
"""
path = util.ensure_path(path)
with path.open("wb") as file_:
file_.write(self.to_bytes(**exclude))
file_.write(self.to_bytes(**kwargs))
def from_disk(self, path, **exclude):
def from_disk(self, path, **kwargs):
"""Loads state from a directory. Modifies the object in place and
returns it.
path (unicode or Path): A path to a directory. Paths may be either
strings or `Path`-like objects.
exclude (list): String names of serialization fields to exclude.
RETURNS (Doc): The modified `Doc` object.
DOCS: https://spacy.io/api/doc#from_disk
@ -795,11 +832,12 @@ cdef class Doc:
path = util.ensure_path(path)
with path.open("rb") as file_:
bytes_data = file_.read()
return self.from_bytes(bytes_data, **exclude)
return self.from_bytes(bytes_data, **kwargs)
def to_bytes(self, **exclude):
def to_bytes(self, exclude=tuple(), **kwargs):
"""Serialize, i.e. export the document contents to a binary string.
exclude (list): String names of serialization fields to exclude.
RETURNS (bytes): A losslessly serialized copy of the `Doc`, including
all annotations.
@ -825,16 +863,22 @@ cdef class Doc:
"sentiment": lambda: self.sentiment,
"tensor": lambda: self.tensor,
}
for key in kwargs:
if key in serializers or key in ("user_data", "user_data_keys", "user_data_values"):
raise ValueError(Errors.E128.format(arg=key))
if "user_data" not in exclude and self.user_data:
user_data_keys, user_data_values = list(zip(*self.user_data.items()))
serializers["user_data_keys"] = lambda: srsly.msgpack_dumps(user_data_keys)
serializers["user_data_values"] = lambda: srsly.msgpack_dumps(user_data_values)
if "user_data_keys" not in exclude:
serializers["user_data_keys"] = lambda: srsly.msgpack_dumps(user_data_keys)
if "user_data_values" not in exclude:
serializers["user_data_values"] = lambda: srsly.msgpack_dumps(user_data_values)
return util.to_bytes(serializers, exclude)
def from_bytes(self, bytes_data, **exclude):
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
"""Deserialize, i.e. import the document contents from a binary string.
data (bytes): The string to load from.
exclude (list): String names of serialization fields to exclude.
RETURNS (Doc): Itself.
DOCS: https://spacy.io/api/doc#from_bytes
@ -850,6 +894,9 @@ cdef class Doc:
"user_data_keys": lambda b: None,
"user_data_values": lambda b: None,
}
for key in kwargs:
if key in deserializers or key in ("user_data",):
raise ValueError(Errors.E128.format(arg=key))
msg = util.from_bytes(bytes_data, deserializers, exclude)
# Msgpack doesn't distinguish between lists and tuples, which is
# vexing for user data. As a best guess, we *know* that within
@ -990,11 +1037,11 @@ cdef class Doc:
DOCS: https://spacy.io/api/doc#to_json
"""
data = {"text": self.text}
if self.ents:
if self.is_nered:
data["ents"] = [{"start": ent.start_char, "end": ent.end_char,
"label": ent.label_} for ent in self.ents]
sents = list(self.sents)
if sents:
if self.is_sentenced:
sents = list(self.sents)
data["sents"] = [{"start": sent.start_char, "end": sent.end_char}
for sent in sents]
if self.cats:
@ -1002,13 +1049,11 @@ cdef class Doc:
data["tokens"] = []
for token in self:
token_data = {"id": token.i, "start": token.idx, "end": token.idx + len(token)}
if token.pos_:
if self.is_tagged:
token_data["pos"] = token.pos_
if token.tag_:
token_data["tag"] = token.tag_
if token.dep_:
if self.is_parsed:
token_data["dep"] = token.dep_
if token.head:
token_data["head"] = token.head.i
data["tokens"].append(token_data)
if underscore:
@ -1179,7 +1224,7 @@ cdef int [:,:] _get_lca_matrix(Doc doc, int start, int end):
def pickle_doc(doc):
bytes_data = doc.to_bytes(vocab=False, user_data=False)
bytes_data = doc.to_bytes(exclude=["vocab", "user_data"])
hooks_and_data = (doc.user_data, doc.user_hooks, doc.user_span_hooks,
doc.user_token_hooks)
return (unpickle_doc, (doc.vocab, srsly.pickle_dumps(hooks_and_data), bytes_data))
@ -1188,7 +1233,7 @@ def pickle_doc(doc):
def unpickle_doc(vocab, hooks_and_data, bytes_data):
user_data, doc_hooks, span_hooks, token_hooks = srsly.pickle_loads(hooks_and_data)
doc = Doc(vocab, user_data=user_data).from_bytes(bytes_data, exclude="user_data")
doc = Doc(vocab, user_data=user_data).from_bytes(bytes_data, exclude=["user_data"])
doc.user_hooks.update(doc_hooks)
doc.user_span_hooks.update(span_hooks)
doc.user_token_hooks.update(token_hooks)

View File

@ -322,46 +322,47 @@ cdef class Span:
self.start = start
self.end = end + 1
property vocab:
@property
def vocab(self):
"""RETURNS (Vocab): The Span's Doc's vocab."""
def __get__(self):
return self.doc.vocab
return self.doc.vocab
property sent:
@property
def sent(self):
"""RETURNS (Span): The sentence span that the span is a part of."""
def __get__(self):
if "sent" in self.doc.user_span_hooks:
return self.doc.user_span_hooks["sent"](self)
# This should raise if not parsed / no custom sentence boundaries
self.doc.sents
# If doc is parsed we can use the deps to find the sentence
# otherwise we use the `sent_start` token attribute
cdef int n = 0
cdef int i
if self.doc.is_parsed:
root = &self.doc.c[self.start]
while root.head != 0:
root += root.head
n += 1
if n >= self.doc.length:
raise RuntimeError(Errors.E038)
return self.doc[root.l_edge:root.r_edge + 1]
elif self.doc.is_sentenced:
# Find start of the sentence
start = self.start
while self.doc.c[start].sent_start != 1 and start > 0:
start += -1
# Find end of the sentence
end = self.end
n = 0
while end < self.doc.length and self.doc.c[end].sent_start != 1:
end += 1
n += 1
if n >= self.doc.length:
break
return self.doc[start:end]
if "sent" in self.doc.user_span_hooks:
return self.doc.user_span_hooks["sent"](self)
# This should raise if not parsed / no custom sentence boundaries
self.doc.sents
# If doc is parsed we can use the deps to find the sentence
# otherwise we use the `sent_start` token attribute
cdef int n = 0
cdef int i
if self.doc.is_parsed:
root = &self.doc.c[self.start]
while root.head != 0:
root += root.head
n += 1
if n >= self.doc.length:
raise RuntimeError(Errors.E038)
return self.doc[root.l_edge:root.r_edge + 1]
elif self.doc.is_sentenced:
# Find start of the sentence
start = self.start
while self.doc.c[start].sent_start != 1 and start > 0:
start += -1
# Find end of the sentence
end = self.end
n = 0
while end < self.doc.length and self.doc.c[end].sent_start != 1:
end += 1
n += 1
if n >= self.doc.length:
break
return self.doc[start:end]
property ents:
@property
def ents(self):
"""The named entities in the span. Returns a tuple of named entity
`Span` objects, if the entity recognizer has been applied.
@ -369,14 +370,14 @@ cdef class Span:
DOCS: https://spacy.io/api/span#ents
"""
def __get__(self):
ents = []
for ent in self.doc.ents:
if ent.start >= self.start and ent.end <= self.end:
ents.append(ent)
return ents
ents = []
for ent in self.doc.ents:
if ent.start >= self.start and ent.end <= self.end:
ents.append(ent)
return ents
property has_vector:
@property
def has_vector(self):
"""A boolean value indicating whether a word vector is associated with
the object.
@ -384,17 +385,17 @@ cdef class Span:
DOCS: https://spacy.io/api/span#has_vector
"""
def __get__(self):
if "has_vector" in self.doc.user_span_hooks:
return self.doc.user_span_hooks["has_vector"](self)
elif self.vocab.vectors.data.size > 0:
return any(token.has_vector for token in self)
elif self.doc.tensor.size > 0:
return True
else:
return False
if "has_vector" in self.doc.user_span_hooks:
return self.doc.user_span_hooks["has_vector"](self)
elif self.vocab.vectors.data.size > 0:
return any(token.has_vector for token in self)
elif self.doc.tensor.size > 0:
return True
else:
return False
property vector:
@property
def vector(self):
"""A real-valued meaning representation. Defaults to an average of the
token vectors.
@ -403,61 +404,61 @@ cdef class Span:
DOCS: https://spacy.io/api/span#vector
"""
def __get__(self):
if "vector" in self.doc.user_span_hooks:
return self.doc.user_span_hooks["vector"](self)
if self._vector is None:
self._vector = sum(t.vector for t in self) / len(self)
return self._vector
if "vector" in self.doc.user_span_hooks:
return self.doc.user_span_hooks["vector"](self)
if self._vector is None:
self._vector = sum(t.vector for t in self) / len(self)
return self._vector
property vector_norm:
@property
def vector_norm(self):
"""The L2 norm of the span's vector representation.
RETURNS (float): The L2 norm of the vector representation.
DOCS: https://spacy.io/api/span#vector_norm
"""
def __get__(self):
if "vector_norm" in self.doc.user_span_hooks:
return self.doc.user_span_hooks["vector"](self)
cdef float value
cdef double norm = 0
if self._vector_norm is None:
norm = 0
for value in self.vector:
norm += value * value
self._vector_norm = sqrt(norm) if norm != 0 else 0
return self._vector_norm
if "vector_norm" in self.doc.user_span_hooks:
return self.doc.user_span_hooks["vector"](self)
cdef float value
cdef double norm = 0
if self._vector_norm is None:
norm = 0
for value in self.vector:
norm += value * value
self._vector_norm = sqrt(norm) if norm != 0 else 0
return self._vector_norm
property sentiment:
@property
def sentiment(self):
"""RETURNS (float): A scalar value indicating the positivity or
negativity of the span.
"""
def __get__(self):
if "sentiment" in self.doc.user_span_hooks:
return self.doc.user_span_hooks["sentiment"](self)
else:
return sum([token.sentiment for token in self]) / len(self)
if "sentiment" in self.doc.user_span_hooks:
return self.doc.user_span_hooks["sentiment"](self)
else:
return sum([token.sentiment for token in self]) / len(self)
property text:
@property
def text(self):
"""RETURNS (unicode): The original verbatim text of the span."""
def __get__(self):
text = self.text_with_ws
if self[-1].whitespace_:
text = text[:-1]
return text
text = self.text_with_ws
if self[-1].whitespace_:
text = text[:-1]
return text
property text_with_ws:
@property
def text_with_ws(self):
"""The text content of the span with a trailing whitespace character if
the last token has one.
RETURNS (unicode): The text content of the span (with trailing
whitespace).
"""
def __get__(self):
return "".join([t.text_with_ws for t in self])
return "".join([t.text_with_ws for t in self])
property noun_chunks:
@property
def noun_chunks(self):
"""Yields base noun-phrase `Span` objects, if the document has been
syntactically parsed. A base noun phrase, or "NP chunk", is a noun
phrase that does not permit other NPs to be nested within it so no
@ -468,23 +469,23 @@ cdef class Span:
DOCS: https://spacy.io/api/span#noun_chunks
"""
def __get__(self):
if not self.doc.is_parsed:
raise ValueError(Errors.E029)
# Accumulate the result before beginning to iterate over it. This
# prevents the tokenisation from being changed out from under us
# during the iteration. The tricky thing here is that Span accepts
# its tokenisation changing, so it's okay once we have the Span
# objects. See Issue #375
spans = []
cdef attr_t label
if self.doc.noun_chunks_iterator is not None:
for start, end, label in self.doc.noun_chunks_iterator(self):
spans.append(Span(self.doc, start, end, label=label))
for span in spans:
yield span
if not self.doc.is_parsed:
raise ValueError(Errors.E029)
# Accumulate the result before beginning to iterate over it. This
# prevents the tokenisation from being changed out from under us
# during the iteration. The tricky thing here is that Span accepts
# its tokenisation changing, so it's okay once we have the Span
# objects. See Issue #375
spans = []
cdef attr_t label
if self.doc.noun_chunks_iterator is not None:
for start, end, label in self.doc.noun_chunks_iterator(self):
spans.append(Span(self.doc, start, end, label=label))
for span in spans:
yield span
property root:
@property
def root(self):
"""The token with the shortest path to the root of the
sentence (or the root itself). If multiple tokens are equally
high in the tree, the first token is taken.
@ -493,41 +494,51 @@ cdef class Span:
DOCS: https://spacy.io/api/span#root
"""
def __get__(self):
self._recalculate_indices()
if "root" in self.doc.user_span_hooks:
return self.doc.user_span_hooks["root"](self)
# This should probably be called 'head', and the other one called
# 'gov'. But we went with 'head' elsehwhere, and now we're stuck =/
cdef int i
# First, we scan through the Span, and check whether there's a word
# with head==0, i.e. a sentence root. If so, we can return it. The
# longer the span, the more likely it contains a sentence root, and
# in this case we return in linear time.
for i in range(self.start, self.end):
if self.doc.c[i].head == 0:
return self.doc[i]
# If we don't have a sentence root, we do something that's not so
# algorithmically clever, but I think should be quite fast,
# especially for short spans.
# For each word, we count the path length, and arg min this measure.
# We could use better tree logic to save steps here...But I
# think this should be okay.
cdef int current_best = self.doc.length
cdef int root = -1
for i in range(self.start, self.end):
if self.start <= (i+self.doc.c[i].head) < self.end:
continue
words_to_root = _count_words_to_root(&self.doc.c[i], self.doc.length)
if words_to_root < current_best:
current_best = words_to_root
root = i
if root == -1:
return self.doc[self.start]
else:
return self.doc[root]
self._recalculate_indices()
if "root" in self.doc.user_span_hooks:
return self.doc.user_span_hooks["root"](self)
# This should probably be called 'head', and the other one called
# 'gov'. But we went with 'head' elsehwhere, and now we're stuck =/
cdef int i
# First, we scan through the Span, and check whether there's a word
# with head==0, i.e. a sentence root. If so, we can return it. The
# longer the span, the more likely it contains a sentence root, and
# in this case we return in linear time.
for i in range(self.start, self.end):
if self.doc.c[i].head == 0:
return self.doc[i]
# If we don't have a sentence root, we do something that's not so
# algorithmically clever, but I think should be quite fast,
# especially for short spans.
# For each word, we count the path length, and arg min this measure.
# We could use better tree logic to save steps here...But I
# think this should be okay.
cdef int current_best = self.doc.length
cdef int root = -1
for i in range(self.start, self.end):
if self.start <= (i+self.doc.c[i].head) < self.end:
continue
words_to_root = _count_words_to_root(&self.doc.c[i], self.doc.length)
if words_to_root < current_best:
current_best = words_to_root
root = i
if root == -1:
return self.doc[self.start]
else:
return self.doc[root]
property lefts:
@property
def conjuncts(self):
"""Tokens that are conjoined to the span's root.
RETURNS (tuple): A tuple of Token objects.
DOCS: https://spacy.io/api/span#lefts
"""
return self.root.conjuncts
@property
def lefts(self):
"""Tokens that are to the left of the span, whose head is within the
`Span`.
@ -535,13 +546,13 @@ cdef class Span:
DOCS: https://spacy.io/api/span#lefts
"""
def __get__(self):
for token in reversed(self): # Reverse, so we get tokens in order
for left in token.lefts:
if left.i < self.start:
yield left
for token in reversed(self): # Reverse, so we get tokens in order
for left in token.lefts:
if left.i < self.start:
yield left
property rights:
@property
def rights(self):
"""Tokens that are to the right of the Span, whose head is within the
`Span`.
@ -549,13 +560,13 @@ cdef class Span:
DOCS: https://spacy.io/api/span#rights
"""
def __get__(self):
for token in self:
for right in token.rights:
if right.i >= self.end:
yield right
for token in self:
for right in token.rights:
if right.i >= self.end:
yield right
property n_lefts:
@property
def n_lefts(self):
"""The number of tokens that are to the left of the span, whose
heads are within the span.
@ -564,10 +575,10 @@ cdef class Span:
DOCS: https://spacy.io/api/span#n_lefts
"""
def __get__(self):
return len(list(self.lefts))
return len(list(self.lefts))
property n_rights:
@property
def n_rights(self):
"""The number of tokens that are to the right of the span, whose
heads are within the span.
@ -576,22 +587,21 @@ cdef class Span:
DOCS: https://spacy.io/api/span#n_rights
"""
def __get__(self):
return len(list(self.rights))
return len(list(self.rights))
property subtree:
@property
def subtree(self):
"""Tokens within the span and tokens which descend from them.
YIELDS (Token): A token within the span, or a descendant from it.
DOCS: https://spacy.io/api/span#subtree
"""
def __get__(self):
for word in self.lefts:
yield from word.subtree
yield from self
for word in self.rights:
yield from word.subtree
for word in self.lefts:
yield from word.subtree
yield from self
for word in self.rights:
yield from word.subtree
property ent_id:
"""RETURNS (uint64): The entity ID."""
@ -609,33 +619,33 @@ cdef class Span:
def __set__(self, hash_t key):
raise NotImplementedError(TempErrors.T007.format(attr="ent_id_"))
property orth_:
@property
def orth_(self):
"""Verbatim text content (identical to `Span.text`). Exists mostly for
consistency with other attributes.
RETURNS (unicode): The span's text."""
def __get__(self):
return self.text
return self.text
property lemma_:
@property
def lemma_(self):
"""RETURNS (unicode): The span's lemma."""
def __get__(self):
return " ".join([t.lemma_ for t in self]).strip()
return " ".join([t.lemma_ for t in self]).strip()
property upper_:
@property
def upper_(self):
"""Deprecated. Use `Span.text.upper()` instead."""
def __get__(self):
return "".join([t.text_with_ws.upper() for t in self]).strip()
return "".join([t.text_with_ws.upper() for t in self]).strip()
property lower_:
@property
def lower_(self):
"""Deprecated. Use `Span.text.lower()` instead."""
def __get__(self):
return "".join([t.text_with_ws.lower() for t in self]).strip()
return "".join([t.text_with_ws.lower() for t in self]).strip()
property string:
@property
def string(self):
"""Deprecated: Use `Span.text_with_ws` instead."""
def __get__(self):
return "".join([t.text_with_ws for t in self])
return "".join([t.text_with_ws for t in self])
property label_:
"""RETURNS (unicode): The span's label."""
@ -643,7 +653,9 @@ cdef class Span:
return self.doc.vocab.strings[self.label]
def __set__(self, unicode label_):
self.label = self.doc.vocab.strings.add(label_)
if not label_:
label_ = ''
raise NotImplementedError(Errors.E129.format(start=self.start, end=self.end, label=label_))
cdef int _count_words_to_root(const TokenC* token, int sent_length) except -1:

View File

@ -219,115 +219,115 @@ cdef class Token:
xp = get_array_module(vector)
return (xp.dot(vector, other.vector) / (self.vector_norm * other.vector_norm))
property morph:
def __get__(self):
return MorphAnalysis.from_id(self.vocab, self.c.morph)
@property
def morph(self):
return MorphAnalysis.from_id(self.vocab, self.c.morph)
property lex_id:
@property
def lex_id(self):
"""RETURNS (int): Sequential ID of the token's lexical type."""
def __get__(self):
return self.c.lex.id
return self.c.lex.id
property rank:
@property
def rank(self):
"""RETURNS (int): Sequential ID of the token's lexical type, used to
index into tables, e.g. for word vectors."""
def __get__(self):
return self.c.lex.id
return self.c.lex.id
property string:
@property
def string(self):
"""Deprecated: Use Token.text_with_ws instead."""
def __get__(self):
return self.text_with_ws
return self.text_with_ws
property text:
@property
def text(self):
"""RETURNS (unicode): The original verbatim text of the token."""
def __get__(self):
return self.orth_
return self.orth_
property text_with_ws:
@property
def text_with_ws(self):
"""RETURNS (unicode): The text content of the span (with trailing
whitespace).
"""
def __get__(self):
cdef unicode orth = self.vocab.strings[self.c.lex.orth]
if self.c.spacy:
return orth + " "
else:
return orth
cdef unicode orth = self.vocab.strings[self.c.lex.orth]
if self.c.spacy:
return orth + " "
else:
return orth
property prob:
@property
def prob(self):
"""RETURNS (float): Smoothed log probability estimate of token type."""
def __get__(self):
return self.c.lex.prob
return self.c.lex.prob
property sentiment:
@property
def sentiment(self):
"""RETURNS (float): A scalar value indicating the positivity or
negativity of the token."""
def __get__(self):
if "sentiment" in self.doc.user_token_hooks:
return self.doc.user_token_hooks["sentiment"](self)
return self.c.lex.sentiment
if "sentiment" in self.doc.user_token_hooks:
return self.doc.user_token_hooks["sentiment"](self)
return self.c.lex.sentiment
property lang:
@property
def lang(self):
"""RETURNS (uint64): ID of the language of the parent document's
vocabulary.
"""
def __get__(self):
return self.c.lex.lang
return self.c.lex.lang
property idx:
@property
def idx(self):
"""RETURNS (int): The character offset of the token within the parent
document.
"""
def __get__(self):
return self.c.idx
return self.c.idx
property cluster:
@property
def cluster(self):
"""RETURNS (int): Brown cluster ID."""
def __get__(self):
return self.c.lex.cluster
return self.c.lex.cluster
property orth:
@property
def orth(self):
"""RETURNS (uint64): ID of the verbatim text content."""
def __get__(self):
return self.c.lex.orth
return self.c.lex.orth
property lower:
@property
def lower(self):
"""RETURNS (uint64): ID of the lowercase token text."""
def __get__(self):
return self.c.lex.lower
return self.c.lex.lower
property norm:
@property
def norm(self):
"""RETURNS (uint64): ID of the token's norm, i.e. a normalised form of
the token text. Usually set in the language's tokenizer exceptions
or norm exceptions.
"""
def __get__(self):
if self.c.norm == 0:
return self.c.lex.norm
else:
return self.c.norm
if self.c.norm == 0:
return self.c.lex.norm
else:
return self.c.norm
property shape:
@property
def shape(self):
"""RETURNS (uint64): ID of the token's shape, a transform of the
tokens's string, to show orthographic features (e.g. "Xxxx", "dd").
"""
def __get__(self):
return self.c.lex.shape
return self.c.lex.shape
property prefix:
@property
def prefix(self):
"""RETURNS (uint64): ID of a length-N substring from the start of the
token. Defaults to `N=1`.
"""
def __get__(self):
return self.c.lex.prefix
return self.c.lex.prefix
property suffix:
@property
def suffix(self):
"""RETURNS (uint64): ID of a length-N substring from the end of the
token. Defaults to `N=3`.
"""
def __get__(self):
return self.c.lex.suffix
return self.c.lex.suffix
property lemma:
"""RETURNS (uint64): ID of the base form of the word, with no
@ -367,7 +367,8 @@ cdef class Token:
def __set__(self, attr_t label):
self.c.dep = label
property has_vector:
@property
def has_vector(self):
"""A boolean value indicating whether a word vector is associated with
the object.
@ -375,14 +376,14 @@ cdef class Token:
DOCS: https://spacy.io/api/token#has_vector
"""
def __get__(self):
if 'has_vector' in self.doc.user_token_hooks:
return self.doc.user_token_hooks["has_vector"](self)
if self.vocab.vectors.size == 0 and self.doc.tensor.size != 0:
return True
return self.vocab.has_vector(self.c.lex.orth)
if "has_vector" in self.doc.user_token_hooks:
return self.doc.user_token_hooks["has_vector"](self)
if self.vocab.vectors.size == 0 and self.doc.tensor.size != 0:
return True
return self.vocab.has_vector(self.c.lex.orth)
property vector:
@property
def vector(self):
"""A real-valued meaning representation.
RETURNS (numpy.ndarray[ndim=1, dtype='float32']): A 1D numpy array
@ -390,28 +391,28 @@ cdef class Token:
DOCS: https://spacy.io/api/token#vector
"""
def __get__(self):
if 'vector' in self.doc.user_token_hooks:
return self.doc.user_token_hooks["vector"](self)
if self.vocab.vectors.size == 0 and self.doc.tensor.size != 0:
return self.doc.tensor[self.i]
else:
return self.vocab.get_vector(self.c.lex.orth)
if "vector" in self.doc.user_token_hooks:
return self.doc.user_token_hooks["vector"](self)
if self.vocab.vectors.size == 0 and self.doc.tensor.size != 0:
return self.doc.tensor[self.i]
else:
return self.vocab.get_vector(self.c.lex.orth)
property vector_norm:
@property
def vector_norm(self):
"""The L2 norm of the token's vector representation.
RETURNS (float): The L2 norm of the vector representation.
DOCS: https://spacy.io/api/token#vector_norm
"""
def __get__(self):
if 'vector_norm' in self.doc.user_token_hooks:
return self.doc.user_token_hooks["vector_norm"](self)
vector = self.vector
return numpy.sqrt((vector ** 2).sum())
if "vector_norm" in self.doc.user_token_hooks:
return self.doc.user_token_hooks["vector_norm"](self)
vector = self.vector
return numpy.sqrt((vector ** 2).sum())
property n_lefts:
@property
def n_lefts(self):
"""The number of leftward immediate children of the word, in the
syntactic dependency parse.
@ -420,10 +421,10 @@ cdef class Token:
DOCS: https://spacy.io/api/token#n_lefts
"""
def __get__(self):
return self.c.l_kids
return self.c.l_kids
property n_rights:
@property
def n_rights(self):
"""The number of rightward immediate children of the word, in the
syntactic dependency parse.
@ -432,15 +433,14 @@ cdef class Token:
DOCS: https://spacy.io/api/token#n_rights
"""
def __get__(self):
return self.c.r_kids
return self.c.r_kids
property sent:
@property
def sent(self):
"""RETURNS (Span): The sentence span that the token is a part of."""
def __get__(self):
if 'sent' in self.doc.user_token_hooks:
return self.doc.user_token_hooks["sent"](self)
return self.doc[self.i : self.i+1].sent
if 'sent' in self.doc.user_token_hooks:
return self.doc.user_token_hooks["sent"](self)
return self.doc[self.i : self.i+1].sent
property sent_start:
def __get__(self):
@ -484,7 +484,8 @@ cdef class Token:
else:
raise ValueError(Errors.E044.format(value=value))
property lefts:
@property
def lefts(self):
"""The leftward immediate children of the word, in the syntactic
dependency parse.
@ -492,19 +493,19 @@ cdef class Token:
DOCS: https://spacy.io/api/token#lefts
"""
def __get__(self):
cdef int nr_iter = 0
cdef const TokenC* ptr = self.c - (self.i - self.c.l_edge)
while ptr < self.c:
if ptr + ptr.head == self.c:
yield self.doc[ptr - (self.c - self.i)]
ptr += 1
nr_iter += 1
# This is ugly, but it's a way to guard out infinite loops
if nr_iter >= 10000000:
raise RuntimeError(Errors.E045.format(attr="token.lefts"))
cdef int nr_iter = 0
cdef const TokenC* ptr = self.c - (self.i - self.c.l_edge)
while ptr < self.c:
if ptr + ptr.head == self.c:
yield self.doc[ptr - (self.c - self.i)]
ptr += 1
nr_iter += 1
# This is ugly, but it's a way to guard out infinite loops
if nr_iter >= 10000000:
raise RuntimeError(Errors.E045.format(attr="token.lefts"))
property rights:
@property
def rights(self):
"""The rightward immediate children of the word, in the syntactic
dependency parse.
@ -512,33 +513,33 @@ cdef class Token:
DOCS: https://spacy.io/api/token#rights
"""
def __get__(self):
cdef const TokenC* ptr = self.c + (self.c.r_edge - self.i)
tokens = []
cdef int nr_iter = 0
while ptr > self.c:
if ptr + ptr.head == self.c:
tokens.append(self.doc[ptr - (self.c - self.i)])
ptr -= 1
nr_iter += 1
if nr_iter >= 10000000:
raise RuntimeError(Errors.E045.format(attr="token.rights"))
tokens.reverse()
for t in tokens:
yield t
cdef const TokenC* ptr = self.c + (self.c.r_edge - self.i)
tokens = []
cdef int nr_iter = 0
while ptr > self.c:
if ptr + ptr.head == self.c:
tokens.append(self.doc[ptr - (self.c - self.i)])
ptr -= 1
nr_iter += 1
if nr_iter >= 10000000:
raise RuntimeError(Errors.E045.format(attr="token.rights"))
tokens.reverse()
for t in tokens:
yield t
property children:
@property
def children(self):
"""A sequence of the token's immediate syntactic children.
YIELDS (Token): A child token such that `child.head==self`.
DOCS: https://spacy.io/api/token#children
"""
def __get__(self):
yield from self.lefts
yield from self.rights
yield from self.lefts
yield from self.rights
property subtree:
@property
def subtree(self):
"""A sequence containing the token and all the token's syntactic
descendants.
@ -547,30 +548,30 @@ cdef class Token:
DOCS: https://spacy.io/api/token#subtree
"""
def __get__(self):
for word in self.lefts:
yield from word.subtree
yield self
for word in self.rights:
yield from word.subtree
for word in self.lefts:
yield from word.subtree
yield self
for word in self.rights:
yield from word.subtree
property left_edge:
@property
def left_edge(self):
"""The leftmost token of this token's syntactic descendents.
RETURNS (Token): The first token such that `self.is_ancestor(token)`.
"""
def __get__(self):
return self.doc[self.c.l_edge]
return self.doc[self.c.l_edge]
property right_edge:
@property
def right_edge(self):
"""The rightmost token of this token's syntactic descendents.
RETURNS (Token): The last token such that `self.is_ancestor(token)`.
"""
def __get__(self):
return self.doc[self.c.r_edge]
return self.doc[self.c.r_edge]
property ancestors:
@property
def ancestors(self):
"""A sequence of this token's syntactic ancestors.
YIELDS (Token): A sequence of ancestor tokens such that
@ -578,15 +579,14 @@ cdef class Token:
DOCS: https://spacy.io/api/token#ancestors
"""
def __get__(self):
cdef const TokenC* head_ptr = self.c
# Guard against infinite loop, no token can have
# more ancestors than tokens in the tree.
cdef int i = 0
while head_ptr.head != 0 and i < self.doc.length:
head_ptr += head_ptr.head
yield self.doc[head_ptr - (self.c - self.i)]
i += 1
cdef const TokenC* head_ptr = self.c
# Guard against infinite loop, no token can have
# more ancestors than tokens in the tree.
cdef int i = 0
while head_ptr.head != 0 and i < self.doc.length:
head_ptr += head_ptr.head
yield self.doc[head_ptr - (self.c - self.i)]
i += 1
def is_ancestor(self, descendant):
"""Check whether this token is a parent, grandparent, etc. of another
@ -690,23 +690,31 @@ cdef class Token:
# Set new head
self.c.head = rel_newhead_i
property conjuncts:
@property
def conjuncts(self):
"""A sequence of coordinated tokens, including the token itself.
YIELDS (Token): A coordinated token.
RETURNS (tuple): The coordinated tokens.
DOCS: https://spacy.io/api/token#conjuncts
"""
def __get__(self):
cdef Token word
if "conjuncts" in self.doc.user_token_hooks:
yield from self.doc.user_token_hooks["conjuncts"](self)
cdef Token word, child
if "conjuncts" in self.doc.user_token_hooks:
return tuple(self.doc.user_token_hooks["conjuncts"](self))
start = self
while start.i != start.head.i:
if start.dep == conj:
start = start.head
else:
if self.dep != conj:
for word in self.rights:
if word.dep == conj:
yield word
yield from word.conjuncts
break
queue = [start]
output = [start]
for word in queue:
for child in word.rights:
if child.c.dep == conj:
output.append(child)
queue.append(child)
return tuple([w for w in output if w.i != self.i])
property ent_type:
"""RETURNS (uint64): Named entity type."""
@ -716,15 +724,6 @@ cdef class Token:
def __set__(self, ent_type):
self.c.ent_type = ent_type
property ent_iob:
"""IOB code of named entity tag. `1="I", 2="O", 3="B"`. 0 means no tag
is assigned.
RETURNS (uint64): IOB code of named entity tag.
"""
def __get__(self):
return self.c.ent_iob
property ent_type_:
"""RETURNS (unicode): Named entity type."""
def __get__(self):
@ -733,16 +732,25 @@ cdef class Token:
def __set__(self, ent_type):
self.c.ent_type = self.vocab.strings.add(ent_type)
property ent_iob_:
@property
def ent_iob(self):
"""IOB code of named entity tag. `1="I", 2="O", 3="B"`. 0 means no tag
is assigned.
RETURNS (uint64): IOB code of named entity tag.
"""
return self.c.ent_iob
@property
def ent_iob_(self):
"""IOB code of named entity tag. "B" means the token begins an entity,
"I" means it is inside an entity, "O" means it is outside an entity,
and "" means no entity tag is set.
RETURNS (unicode): IOB code of named entity tag.
"""
def __get__(self):
iob_strings = ("", "I", "O", "B")
return iob_strings[self.c.ent_iob]
iob_strings = ("", "I", "O", "B")
return iob_strings[self.c.ent_iob]
property ent_id:
"""RETURNS (uint64): ID of the entity the token is an instance of,
@ -764,26 +772,25 @@ cdef class Token:
def __set__(self, name):
self.c.ent_id = self.vocab.strings.add(name)
property whitespace_:
"""RETURNS (unicode): The trailing whitespace character, if present.
"""
def __get__(self):
return " " if self.c.spacy else ""
@property
def whitespace_(self):
"""RETURNS (unicode): The trailing whitespace character, if present."""
return " " if self.c.spacy else ""
property orth_:
@property
def orth_(self):
"""RETURNS (unicode): Verbatim text content (identical to
`Token.text`). Exists mostly for consistency with the other
attributes.
"""
def __get__(self):
return self.vocab.strings[self.c.lex.orth]
return self.vocab.strings[self.c.lex.orth]
property lower_:
@property
def lower_(self):
"""RETURNS (unicode): The lowercase token text. Equivalent to
`Token.text.lower()`.
"""
def __get__(self):
return self.vocab.strings[self.c.lex.lower]
return self.vocab.strings[self.c.lex.lower]
property norm_:
"""RETURNS (unicode): The token's norm, i.e. a normalised form of the
@ -796,33 +803,33 @@ cdef class Token:
def __set__(self, unicode norm_):
self.c.norm = self.vocab.strings.add(norm_)
property shape_:
@property
def shape_(self):
"""RETURNS (unicode): Transform of the tokens's string, to show
orthographic features. For example, "Xxxx" or "dd".
"""
def __get__(self):
return self.vocab.strings[self.c.lex.shape]
return self.vocab.strings[self.c.lex.shape]
property prefix_:
@property
def prefix_(self):
"""RETURNS (unicode): A length-N substring from the start of the token.
Defaults to `N=1`.
"""
def __get__(self):
return self.vocab.strings[self.c.lex.prefix]
return self.vocab.strings[self.c.lex.prefix]
property suffix_:
@property
def suffix_(self):
"""RETURNS (unicode): A length-N substring from the end of the token.
Defaults to `N=3`.
"""
def __get__(self):
return self.vocab.strings[self.c.lex.suffix]
return self.vocab.strings[self.c.lex.suffix]
property lang_:
@property
def lang_(self):
"""RETURNS (unicode): Language of the parent document's vocabulary,
e.g. 'en'.
"""
def __get__(self):
return self.vocab.strings[self.c.lex.lang]
return self.vocab.strings[self.c.lex.lang]
property lemma_:
"""RETURNS (unicode): The token lemma, i.e. the base form of the word,
@ -861,110 +868,110 @@ cdef class Token:
def __set__(self, unicode label):
self.c.dep = self.vocab.strings.add(label)
property is_oov:
@property
def is_oov(self):
"""RETURNS (bool): Whether the token is out-of-vocabulary."""
def __get__(self):
return Lexeme.c_check_flag(self.c.lex, IS_OOV)
return Lexeme.c_check_flag(self.c.lex, IS_OOV)
property is_stop:
@property
def is_stop(self):
"""RETURNS (bool): Whether the token is a stop word, i.e. part of a
"stop list" defined by the language data.
"""
def __get__(self):
return Lexeme.c_check_flag(self.c.lex, IS_STOP)
return Lexeme.c_check_flag(self.c.lex, IS_STOP)
property is_alpha:
@property
def is_alpha(self):
"""RETURNS (bool): Whether the token consists of alpha characters.
Equivalent to `token.text.isalpha()`.
"""
def __get__(self):
return Lexeme.c_check_flag(self.c.lex, IS_ALPHA)
return Lexeme.c_check_flag(self.c.lex, IS_ALPHA)
property is_ascii:
@property
def is_ascii(self):
"""RETURNS (bool): Whether the token consists of ASCII characters.
Equivalent to `[any(ord(c) >= 128 for c in token.text)]`.
"""
def __get__(self):
return Lexeme.c_check_flag(self.c.lex, IS_ASCII)
return Lexeme.c_check_flag(self.c.lex, IS_ASCII)
property is_digit:
@property
def is_digit(self):
"""RETURNS (bool): Whether the token consists of digits. Equivalent to
`token.text.isdigit()`.
"""
def __get__(self):
return Lexeme.c_check_flag(self.c.lex, IS_DIGIT)
return Lexeme.c_check_flag(self.c.lex, IS_DIGIT)
property is_lower:
@property
def is_lower(self):
"""RETURNS (bool): Whether the token is in lowercase. Equivalent to
`token.text.islower()`.
"""
def __get__(self):
return Lexeme.c_check_flag(self.c.lex, IS_LOWER)
return Lexeme.c_check_flag(self.c.lex, IS_LOWER)
property is_upper:
@property
def is_upper(self):
"""RETURNS (bool): Whether the token is in uppercase. Equivalent to
`token.text.isupper()`
"""
def __get__(self):
return Lexeme.c_check_flag(self.c.lex, IS_UPPER)
return Lexeme.c_check_flag(self.c.lex, IS_UPPER)
property is_title:
@property
def is_title(self):
"""RETURNS (bool): Whether the token is in titlecase. Equivalent to
`token.text.istitle()`.
"""
def __get__(self):
return Lexeme.c_check_flag(self.c.lex, IS_TITLE)
return Lexeme.c_check_flag(self.c.lex, IS_TITLE)
property is_punct:
@property
def is_punct(self):
"""RETURNS (bool): Whether the token is punctuation."""
def __get__(self):
return Lexeme.c_check_flag(self.c.lex, IS_PUNCT)
return Lexeme.c_check_flag(self.c.lex, IS_PUNCT)
property is_space:
@property
def is_space(self):
"""RETURNS (bool): Whether the token consists of whitespace characters.
Equivalent to `token.text.isspace()`.
"""
def __get__(self):
return Lexeme.c_check_flag(self.c.lex, IS_SPACE)
return Lexeme.c_check_flag(self.c.lex, IS_SPACE)
property is_bracket:
@property
def is_bracket(self):
"""RETURNS (bool): Whether the token is a bracket."""
def __get__(self):
return Lexeme.c_check_flag(self.c.lex, IS_BRACKET)
return Lexeme.c_check_flag(self.c.lex, IS_BRACKET)
property is_quote:
@property
def is_quote(self):
"""RETURNS (bool): Whether the token is a quotation mark."""
def __get__(self):
return Lexeme.c_check_flag(self.c.lex, IS_QUOTE)
return Lexeme.c_check_flag(self.c.lex, IS_QUOTE)
property is_left_punct:
@property
def is_left_punct(self):
"""RETURNS (bool): Whether the token is a left punctuation mark."""
def __get__(self):
return Lexeme.c_check_flag(self.c.lex, IS_LEFT_PUNCT)
return Lexeme.c_check_flag(self.c.lex, IS_LEFT_PUNCT)
property is_right_punct:
@property
def is_right_punct(self):
"""RETURNS (bool): Whether the token is a right punctuation mark."""
def __get__(self):
return Lexeme.c_check_flag(self.c.lex, IS_RIGHT_PUNCT)
return Lexeme.c_check_flag(self.c.lex, IS_RIGHT_PUNCT)
property is_currency:
@property
def is_currency(self):
"""RETURNS (bool): Whether the token is a currency symbol."""
def __get__(self):
return Lexeme.c_check_flag(self.c.lex, IS_CURRENCY)
return Lexeme.c_check_flag(self.c.lex, IS_CURRENCY)
property like_url:
@property
def like_url(self):
"""RETURNS (bool): Whether the token resembles a URL."""
def __get__(self):
return Lexeme.c_check_flag(self.c.lex, LIKE_URL)
return Lexeme.c_check_flag(self.c.lex, LIKE_URL)
property like_num:
@property
def like_num(self):
"""RETURNS (bool): Whether the token resembles a number, e.g. "10.9",
"10", "ten", etc.
"""
def __get__(self):
return Lexeme.c_check_flag(self.c.lex, LIKE_NUM)
return Lexeme.c_check_flag(self.c.lex, LIKE_NUM)
property like_email:
@property
def like_email(self):
"""RETURNS (bool): Whether the token resembles an email address."""
def __get__(self):
return Lexeme.c_check_flag(self.c.lex, LIKE_EMAIL)
return Lexeme.c_check_flag(self.c.lex, LIKE_EMAIL)

View File

@ -2,11 +2,13 @@
from __future__ import unicode_literals
import functools
import copy
from ..errors import Errors
class Underscore(object):
mutable_types = (dict, list, set)
doc_extensions = {}
span_extensions = {}
token_extensions = {}
@ -32,7 +34,15 @@ class Underscore(object):
elif method is not None:
return functools.partial(method, self._obj)
else:
return self._doc.user_data.get(self._get_key(name), default)
key = self._get_key(name)
if key in self._doc.user_data:
return self._doc.user_data[key]
elif isinstance(default, self.mutable_types):
# Handle mutable default arguments (see #2581)
new_default = copy.copy(default)
self.__setattr__(name, new_default)
return new_default
return default
def __setattr__(self, name, value):
if name not in self._extensions:

View File

@ -25,7 +25,7 @@ except ImportError:
from .symbols import ORTH
from .compat import cupy, CudaStream, path2str, basestring_, unicode_
from .compat import import_file
from .errors import Errors
from .errors import Errors, Warnings, deprecation_warning
LANGUAGES = {}
@ -38,6 +38,18 @@ def set_env_log(value):
_PRINT_ENV = value
def lang_class_is_loaded(lang):
"""Check whether a Language class is already loaded. Language classes are
loaded lazily, to avoid expensive setup code associated with the language
data.
lang (unicode): Two-letter language code, e.g. 'en'.
RETURNS (bool): Whether a Language class has been loaded.
"""
global LANGUAGES
return lang in LANGUAGES
def get_lang_class(lang):
"""Import and load a Language class.
@ -565,7 +577,8 @@ def itershuffle(iterable, bufsize=1000):
def to_bytes(getters, exclude):
serialized = OrderedDict()
for key, getter in getters.items():
if key not in exclude:
# Split to support file names like meta.json
if key.split(".")[0] not in exclude:
serialized[key] = getter()
return srsly.msgpack_dumps(serialized)
@ -573,7 +586,8 @@ def to_bytes(getters, exclude):
def from_bytes(bytes_data, setters, exclude):
msg = srsly.msgpack_loads(bytes_data)
for key, setter in setters.items():
if key not in exclude and key in msg:
# Split to support file names like meta.json
if key.split(".")[0] not in exclude and key in msg:
setter(msg[key])
return msg
@ -583,7 +597,8 @@ def to_disk(path, writers, exclude):
if not path.exists():
path.mkdir()
for key, writer in writers.items():
if key not in exclude:
# Split to support file names like meta.json
if key.split(".")[0] not in exclude:
writer(path / key)
return path
@ -591,7 +606,8 @@ def to_disk(path, writers, exclude):
def from_disk(path, readers, exclude):
path = ensure_path(path)
for key, reader in readers.items():
if key not in exclude:
# Split to support file names like meta.json
if key.split(".")[0] not in exclude:
reader(path / key)
return path
@ -677,6 +693,23 @@ def validate_json(data, validator):
return errors
def get_serialization_exclude(serializers, exclude, kwargs):
"""Helper function to validate serialization args and manage transition from
keyword arguments (pre v2.1) to exclude argument.
"""
exclude = list(exclude)
# Split to support file names like meta.json
options = [name.split(".")[0] for name in serializers]
for key, value in kwargs.items():
if key in ("vocab",) and value is False:
deprecation_warning(Warnings.W015.format(arg=key))
exclude.append(key)
elif key.split(".")[0] in options:
raise ValueError(Errors.E128.format(arg=key))
# TODO: user warning?
return exclude
class SimpleFrozenDict(dict):
"""Simplified implementation of a frozen dict, mainly used as default
function or method argument (for arguments that should default to empty
@ -696,14 +729,14 @@ class SimpleFrozenDict(dict):
class DummyTokenizer(object):
# add dummy methods for to_bytes, from_bytes, to_disk and from_disk to
# allow serialization (see #1557)
def to_bytes(self, **exclude):
def to_bytes(self, **kwargs):
return b""
def from_bytes(self, _bytes_data, **exclude):
def from_bytes(self, _bytes_data, **kwargs):
return self
def to_disk(self, _path, **exclude):
def to_disk(self, _path, **kwargs):
return None
def from_disk(self, _path, **exclude):
def from_disk(self, _path, **kwargs):
return self

View File

@ -377,11 +377,11 @@ cdef class Vectors:
self.add(key, row=i)
return strings
def to_disk(self, path, **exclude):
def to_disk(self, path, **kwargs):
"""Save the current state to a directory.
path (unicode / Path): A path to a directory, which will be created if
it doesn't exists. Either a string or a Path-like object.
it doesn't exists.
DOCS: https://spacy.io/api/vectors#to_disk
"""
@ -394,9 +394,9 @@ cdef class Vectors:
("vectors", lambda p: save_array(self.data, p.open("wb"))),
("key2row", lambda p: srsly.write_msgpack(p, self.key2row))
))
return util.to_disk(path, serializers, exclude)
return util.to_disk(path, serializers, [])
def from_disk(self, path, **exclude):
def from_disk(self, path, **kwargs):
"""Loads state from a directory. Modifies the object in place and
returns it.
@ -428,13 +428,13 @@ cdef class Vectors:
("keys", load_keys),
("vectors", load_vectors),
))
util.from_disk(path, serializers, exclude)
util.from_disk(path, serializers, [])
return self
def to_bytes(self, **exclude):
def to_bytes(self, **kwargs):
"""Serialize the current state to a binary string.
**exclude: Named attributes to prevent from being serialized.
exclude (list): String names of serialization fields to exclude.
RETURNS (bytes): The serialized form of the `Vectors` object.
DOCS: https://spacy.io/api/vectors#to_bytes
@ -444,17 +444,18 @@ cdef class Vectors:
return self.data.to_bytes()
else:
return srsly.msgpack_dumps(self.data)
serializers = OrderedDict((
("key2row", lambda: srsly.msgpack_dumps(self.key2row)),
("vectors", serialize_weights)
))
return util.to_bytes(serializers, exclude)
return util.to_bytes(serializers, [])
def from_bytes(self, data, **exclude):
def from_bytes(self, data, **kwargs):
"""Load state from a binary string.
data (bytes): The data to load from.
**exclude: Named attributes to prevent from being loaded.
exclude (list): String names of serialization fields to exclude.
RETURNS (Vectors): The `Vectors` object.
DOCS: https://spacy.io/api/vectors#from_bytes
@ -469,5 +470,5 @@ cdef class Vectors:
("key2row", lambda b: self.key2row.update(srsly.msgpack_loads(b))),
("vectors", deserialize_weights)
))
util.from_bytes(data, deserializers, exclude)
util.from_bytes(data, deserializers, [])
return self

View File

@ -1,6 +1,7 @@
# coding: utf8
# cython: profile=True
from __future__ import unicode_literals
from libc.string cimport memcpy
import numpy
import srsly
@ -59,12 +60,23 @@ cdef class Vocab:
self.morphology = Morphology(self.strings, tag_map, lemmatizer)
self.vectors = Vectors()
property lang:
@property
def lang(self):
langfunc = None
if self.lex_attr_getters:
langfunc = self.lex_attr_getters.get(LANG, None)
return langfunc("_") if langfunc else ""
property writing_system:
"""A dict with information about the language's writing system. To get
the data, we use the vocab.lang property to fetch the Language class.
If the Language class is not loaded, an empty dict is returned.
"""
def __get__(self):
langfunc = None
if self.lex_attr_getters:
langfunc = self.lex_attr_getters.get(LANG, None)
return langfunc("_") if langfunc else ""
if not util.lang_class_is_loaded(self.lang):
return {}
lang_class = util.get_lang_class(self.lang)
return dict(lang_class.Defaults.writing_system)
def __len__(self):
"""The current number of lexemes stored.
@ -396,47 +408,57 @@ cdef class Vocab:
orth = self.strings.add(orth)
return orth in self.vectors
def to_disk(self, path, **exclude):
def to_disk(self, path, exclude=tuple(), **kwargs):
"""Save the current state to a directory.
path (unicode or Path): A path to a directory, which will be created if
it doesn't exist. Paths may be either strings or Path-like objects.
it doesn't exist.
exclude (list): String names of serialization fields to exclude.
DOCS: https://spacy.io/api/vocab#to_disk
"""
path = util.ensure_path(path)
if not path.exists():
path.mkdir()
self.strings.to_disk(path / "strings.json")
with (path / "lexemes.bin").open('wb') as file_:
file_.write(self.lexemes_to_bytes())
if self.vectors is not None:
setters = ["strings", "lexemes", "vectors"]
exclude = util.get_serialization_exclude(setters, exclude, kwargs)
if "strings" not in exclude:
self.strings.to_disk(path / "strings.json")
if "lexemes" not in exclude:
with (path / "lexemes.bin").open("wb") as file_:
file_.write(self.lexemes_to_bytes())
if "vectors" not in "exclude" and self.vectors is not None:
self.vectors.to_disk(path)
def from_disk(self, path, **exclude):
def from_disk(self, path, exclude=tuple(), **kwargs):
"""Loads state from a directory. Modifies the object in place and
returns it.
path (unicode or Path): A path to a directory. Paths may be either
strings or `Path`-like objects.
path (unicode or Path): A path to a directory.
exclude (list): String names of serialization fields to exclude.
RETURNS (Vocab): The modified `Vocab` object.
DOCS: https://spacy.io/api/vocab#to_disk
"""
path = util.ensure_path(path)
self.strings.from_disk(path / "strings.json")
with (path / "lexemes.bin").open("rb") as file_:
self.lexemes_from_bytes(file_.read())
if self.vectors is not None:
self.vectors.from_disk(path, exclude="strings.json")
if self.vectors.name is not None:
link_vectors_to_models(self)
getters = ["strings", "lexemes", "vectors"]
exclude = util.get_serialization_exclude(getters, exclude, kwargs)
if "strings" not in exclude:
self.strings.from_disk(path / "strings.json") # TODO: add exclude?
if "lexemes" not in exclude:
with (path / "lexemes.bin").open("rb") as file_:
self.lexemes_from_bytes(file_.read())
if "vectors" not in exclude:
if self.vectors is not None:
self.vectors.from_disk(path, exclude=["strings"])
if self.vectors.name is not None:
link_vectors_to_models(self)
return self
def to_bytes(self, **exclude):
def to_bytes(self, exclude=tuple(), **kwargs):
"""Serialize the current state to a binary string.
**exclude: Named attributes to prevent from being serialized.
exclude (list): String names of serialization fields to exclude.
RETURNS (bytes): The serialized form of the `Vocab` object.
DOCS: https://spacy.io/api/vocab#to_bytes
@ -452,13 +474,14 @@ cdef class Vocab:
("lexemes", lambda: self.lexemes_to_bytes()),
("vectors", deserialize_vectors)
))
exclude = util.get_serialization_exclude(getters, exclude, kwargs)
return util.to_bytes(getters, exclude)
def from_bytes(self, bytes_data, **exclude):
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
"""Load state from a binary string.
bytes_data (bytes): The data to load from.
**exclude: Named attributes to prevent from being loaded.
exclude (list): String names of serialization fields to exclude.
RETURNS (Vocab): The `Vocab` object.
DOCS: https://spacy.io/api/vocab#from_bytes
@ -468,11 +491,13 @@ cdef class Vocab:
return None
else:
return self.vectors.from_bytes(b)
setters = OrderedDict((
("strings", lambda b: self.strings.from_bytes(b)),
("lexemes", lambda b: self.lexemes_from_bytes(b)),
("vectors", lambda b: serialize_vectors(b))
))
exclude = util.get_serialization_exclude(setters, exclude, kwargs)
util.from_bytes(bytes_data, setters, exclude)
if self.vectors.name is not None:
link_vectors_to_models(self)
@ -518,7 +543,10 @@ cdef class Vocab:
for j in range(sizeof(lex_data.data)):
lex_data.data[j] = bytes_ptr[i+j]
Lexeme.c_from_bytes(lexeme, lex_data)
prev_entry = self._by_orth.get(lexeme.orth)
if prev_entry != NULL:
memcpy(prev_entry, lexeme, sizeof(LexemeC))
continue
ptr = self.strings._map.get(lexeme.orth)
if ptr == NULL:
continue

27
website/.eslintrc Normal file
View File

@ -0,0 +1,27 @@
{
"extends": ["standard", "prettier"],
"plugins": ["standard", "react", "react-hooks"],
"rules": {
"no-var": "error",
"no-unused-vars": 1,
"arrow-spacing": ["error", { "before": true, "after": true }],
"indent": ["error", 4],
"semi": ["error", "never"],
"arrow-parens": ["error", "as-needed"],
"standard/object-curly-even-spacing": ["error", "either"],
"standard/array-bracket-even-spacing": ["error", "either"],
"standard/computed-property-even-spacing": ["error", "even"],
"standard/no-callback-literal": ["error", ["cb", "callback"]],
"react/jsx-uses-react": "error",
"react/jsx-uses-vars": "error",
"react-hooks/rules-of-hooks": "error",
"react-hooks/exhaustive-deps": "warn"
},
"parser": "babel-eslint",
"parserOptions": {
"ecmaVersion": 8
},
"env": {
"browser": true
}
}

View File

@ -78,7 +78,7 @@ assigned by spaCy's [models](/models). The individual mapping is specific to the
training corpus and can be defined in the respective language data's
[`tag_map.py`](/usage/adding-languages#tag-map).
<Accordion title="Universal Part-of-speech Tags">
<Accordion title="Universal Part-of-speech Tags" id="pos-universal">
spaCy also maps all language-specific part-of-speech tags to a small, fixed set
of word type tags following the
@ -269,7 +269,7 @@ This section lists the syntactic dependency labels assigned by spaCy's
[models](/models). The individual labels are language-specific and depend on the
training corpus.
<Accordion title="Universal Dependency Labels">
<Accordion title="Universal Dependency Labels" id="dependency-parsing-universal">
The [Universal Dependencies scheme](http://universaldependencies.org/u/dep/) is
used in all languages trained on Universal Dependency Corpora.

View File

@ -244,9 +244,10 @@ Serialize the pipe to disk.
> parser.to_disk("/path/to/parser")
> ```
| Name | Type | Description |
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
| Name | Type | Description |
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
## DependencyParser.from_disk {#from_disk tag="method"}
@ -262,6 +263,7 @@ Load the pipe from disk. Modifies the object in place and returns it.
| Name | Type | Description |
| ----------- | ------------------ | -------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | `DependencyParser` | The modified `DependencyParser` object. |
## DependencyParser.to_bytes {#to_bytes tag="method"}
@ -275,10 +277,10 @@ Load the pipe from disk. Modifies the object in place and returns it.
Serialize the pipe to a bytestring.
| Name | Type | Description |
| ----------- | ----- | ----------------------------------------------------- |
| `**exclude` | - | Named attributes to prevent from being serialized. |
| **RETURNS** | bytes | The serialized form of the `DependencyParser` object. |
| Name | Type | Description |
| ----------- | ----- | ------------------------------------------------------------------------- |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | bytes | The serialized form of the `DependencyParser` object. |
## DependencyParser.from_bytes {#from_bytes tag="method"}
@ -292,11 +294,11 @@ Load the pipe from a bytestring. Modifies the object in place and returns it.
> parser.from_bytes(parser_bytes)
> ```
| Name | Type | Description |
| ------------ | ------------------ | ---------------------------------------------- |
| `bytes_data` | bytes | The data to load from. |
| `**exclude` | - | Named attributes to prevent from being loaded. |
| **RETURNS** | `DependencyParser` | The `DependencyParser` object. |
| Name | Type | Description |
| ------------ | ------------------ | ------------------------------------------------------------------------- |
| `bytes_data` | bytes | The data to load from. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | `DependencyParser` | The `DependencyParser` object. |
## DependencyParser.labels {#labels tag="property"}
@ -312,3 +314,21 @@ The labels currently added to the component.
| Name | Type | Description |
| ----------- | ----- | ---------------------------------- |
| **RETURNS** | tuple | The labels added to the component. |
## Serialization fields {#serialization-fields}
During serialization, spaCy will export several data fields used to restore
different aspects of the object. If needed, you can exclude them from
serialization by passing in the string names via the `exclude` argument.
> #### Example
>
> ```python
> data = parser.to_disk("/path", exclude=["vocab"])
> ```
| Name | Description |
| ------- | -------------------------------------------------------------- |
| `vocab` | The shared [`Vocab`](/api/vocab). |
| `cfg` | The config file. You usually don't want to exclude this. |
| `model` | The binary model data. You usually don't want to exclude this. |

View File

@ -237,7 +237,7 @@ attribute ID.
> from spacy.attrs import ORTH
> doc = nlp(u"apple apple orange banana")
> assert doc.count_by(ORTH) == {7024L: 1, 119552L: 1, 2087L: 2}
> doc.to_array([attrs.ORTH])
> doc.to_array([ORTH])
> # array([[11880], [11880], [7561], [12800]])
> ```
@ -349,11 +349,12 @@ array of attributes.
> assert doc[0].pos_ == doc2[0].pos_
> ```
| Name | Type | Description |
| ----------- | -------------------------------------- | ----------------------------- |
| `attrs` | list | A list of attribute ID ints. |
| `array` | `numpy.ndarray[ndim=2, dtype='int32']` | The attribute values to load. |
| **RETURNS** | `Doc` | Itself. |
| Name | Type | Description |
| ----------- | -------------------------------------- | ------------------------------------------------------------------------- |
| `attrs` | list | A list of attribute ID ints. |
| `array` | `numpy.ndarray[ndim=2, dtype='int32']` | The attribute values to load. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | `Doc` | Itself. |
## Doc.to_disk {#to_disk tag="method" new="2"}
@ -365,9 +366,10 @@ Save the current state to a directory.
> doc.to_disk("/path/to/doc")
> ```
| Name | Type | Description |
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
| Name | Type | Description |
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
## Doc.from_disk {#from_disk tag="method" new="2"}
@ -384,6 +386,7 @@ Loads state from a directory. Modifies the object in place and returns it.
| Name | Type | Description |
| ----------- | ---------------- | -------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | `Doc` | The modified `Doc` object. |
## Doc.to_bytes {#to_bytes tag="method"}
@ -397,9 +400,10 @@ Serialize, i.e. export the document contents to a binary string.
> doc_bytes = doc.to_bytes()
> ```
| Name | Type | Description |
| ----------- | ----- | --------------------------------------------------------------------- |
| **RETURNS** | bytes | A losslessly serialized copy of the `Doc`, including all annotations. |
| Name | Type | Description |
| ----------- | ----- | ------------------------------------------------------------------------- |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | bytes | A losslessly serialized copy of the `Doc`, including all annotations. |
## Doc.from_bytes {#from_bytes tag="method"}
@ -416,10 +420,11 @@ Deserialize, i.e. import the document contents from a binary string.
> assert doc.text == doc2.text
> ```
| Name | Type | Description |
| ----------- | ----- | ------------------------ |
| `data` | bytes | The string to load from. |
| **RETURNS** | `Doc` | The `Doc` object. |
| Name | Type | Description |
| ----------- | ----- | ------------------------------------------------------------------------- |
| `data` | bytes | The string to load from. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | `Doc` | The `Doc` object. |
## Doc.retokenize {#retokenize tag="contextmanager" new="2.1"}
@ -640,20 +645,45 @@ The L2 norm of the document's vector representation.
## Attributes {#attributes}
| Name | Type | Description |
| ----------------------------------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `text` | unicode | A unicode representation of the document text. |
| `text_with_ws` | unicode | An alias of `Doc.text`, provided for duck-type compatibility with `Span` and `Token`. |
| `mem` | `Pool` | The document's local memory heap, for all C data it owns. |
| `vocab` | `Vocab` | The store of lexical types. |
| `tensor` <Tag variant="new">2</Tag> | object | Container for dense vector representations. |
| `cats` <Tag variant="new">2</Tag> | dictionary | Maps either a label to a score for categories applied to whole document, or `(start_char, end_char, label)` to score for categories applied to spans. `start_char` and `end_char` should be character offsets, label can be either a string or an integer ID, and score should be a float. |
| `user_data` | - | A generic storage area, for user custom data. |
| `is_tagged` | bool | A flag indicating that the document has been part-of-speech tagged. |
| `is_parsed` | bool | A flag indicating that the document has been syntactically parsed. |
| `is_sentenced` | bool | A flag indicating that sentence boundaries have been applied to the document. |
| `sentiment` | float | The document's positivity/negativity score, if available. |
| `user_hooks` | dict | A dictionary that allows customization of the `Doc`'s properties. |
| `user_token_hooks` | dict | A dictionary that allows customization of properties of `Token` children. |
| `user_span_hooks` | dict | A dictionary that allows customization of properties of `Span` children. |
| `_` | `Underscore` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). |
| Name | Type | Description |
| --------------------------------------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `text` | unicode | A unicode representation of the document text. |
| `text_with_ws` | unicode | An alias of `Doc.text`, provided for duck-type compatibility with `Span` and `Token`. |
| `mem` | `Pool` | The document's local memory heap, for all C data it owns. |
| `vocab` | `Vocab` | The store of lexical types. |
| `tensor` <Tag variant="new">2</Tag> | object | Container for dense vector representations. |
| `cats` <Tag variant="new">2</Tag> | dictionary | Maps either a label to a score for categories applied to whole document, or `(start_char, end_char, label)` to score for categories applied to spans. `start_char` and `end_char` should be character offsets, label can be either a string or an integer ID, and score should be a float. |
| `user_data` | - | A generic storage area, for user custom data. |
| `lang` <Tag variant="new">2.1</Tag> | int | Language of the document's vocabulary. |
| `lang_` <Tag variant="new">2.1</Tag> | unicode | Language of the document's vocabulary. |
| `is_tagged` | bool | A flag indicating that the document has been part-of-speech tagged. |
| `is_parsed` | bool | A flag indicating that the document has been syntactically parsed. |
| `is_sentenced` | bool | A flag indicating that sentence boundaries have been applied to the document. |
| `is_nered` <Tag variant="new">2.1</Tag> | bool | A flag indicating that named entities have been set. Will return `True` if _any_ of the tokens has an entity tag set, even if the others are unknown. |
| `sentiment` | float | The document's positivity/negativity score, if available. |
| `user_hooks` | dict | A dictionary that allows customization of the `Doc`'s properties. |
| `user_token_hooks` | dict | A dictionary that allows customization of properties of `Token` children. |
| `user_span_hooks` | dict | A dictionary that allows customization of properties of `Span` children. |
| `_` | `Underscore` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). |
## Serialization fields {#serialization-fields}
During serialization, spaCy will export several data fields used to restore
different aspects of the object. If needed, you can exclude them from
serialization by passing in the string names via the `exclude` argument.
> #### Example
>
> ```python
> data = doc.to_bytes(exclude=["text", "tensor"])
> doc.from_disk("./doc.bin", exclude=["user_data"])
> ```
| Name | Description |
| ------------------ | --------------------------------------------- |
| `text` | The value of the `Doc.text` attribute. |
| `sentiment` | The value of the `Doc.sentiment` attribute. |
| `tensor` | The value of the `Doc.tensor` attribute. |
| `user_data` | The value of the `Doc.user_data` dictionary. |
| `user_data_keys` | The keys of the `Doc.user_data` dictionary. |
| `user_data_values` | The values of the `Doc.user_data` dictionary. |

View File

@ -244,9 +244,10 @@ Serialize the pipe to disk.
> ner.to_disk("/path/to/ner")
> ```
| Name | Type | Description |
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
| Name | Type | Description |
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
## EntityRecognizer.from_disk {#from_disk tag="method"}
@ -262,6 +263,7 @@ Load the pipe from disk. Modifies the object in place and returns it.
| Name | Type | Description |
| ----------- | ------------------ | -------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | `EntityRecognizer` | The modified `EntityRecognizer` object. |
## EntityRecognizer.to_bytes {#to_bytes tag="method"}
@ -275,10 +277,10 @@ Load the pipe from disk. Modifies the object in place and returns it.
Serialize the pipe to a bytestring.
| Name | Type | Description |
| ----------- | ----- | ----------------------------------------------------- |
| `**exclude` | - | Named attributes to prevent from being serialized. |
| **RETURNS** | bytes | The serialized form of the `EntityRecognizer` object. |
| Name | Type | Description |
| ----------- | ----- | ------------------------------------------------------------------------- |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | bytes | The serialized form of the `EntityRecognizer` object. |
## EntityRecognizer.from_bytes {#from_bytes tag="method"}
@ -292,11 +294,11 @@ Load the pipe from a bytestring. Modifies the object in place and returns it.
> ner.from_bytes(ner_bytes)
> ```
| Name | Type | Description |
| ------------ | ------------------ | ---------------------------------------------- |
| `bytes_data` | bytes | The data to load from. |
| `**exclude` | - | Named attributes to prevent from being loaded. |
| **RETURNS** | `EntityRecognizer` | The `EntityRecognizer` object. |
| Name | Type | Description |
| ------------ | ------------------ | ------------------------------------------------------------------------- |
| `bytes_data` | bytes | The data to load from. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | `EntityRecognizer` | The `EntityRecognizer` object. |
## EntityRecognizer.labels {#labels tag="property"}
@ -312,3 +314,21 @@ The labels currently added to the component.
| Name | Type | Description |
| ----------- | ----- | ---------------------------------- |
| **RETURNS** | tuple | The labels added to the component. |
## Serialization fields {#serialization-fields}
During serialization, spaCy will export several data fields used to restore
different aspects of the object. If needed, you can exclude them from
serialization by passing in the string names via the `exclude` argument.
> #### Example
>
> ```python
> data = ner.to_disk("/path", exclude=["vocab"])
> ```
| Name | Description |
| ------- | -------------------------------------------------------------- |
| `vocab` | The shared [`Vocab`](/api/vocab). |
| `cfg` | The config file. You usually don't want to exclude this. |
| `model` | The binary model data. You usually don't want to exclude this. |

View File

@ -91,13 +91,14 @@ multiprocessing.
> assert doc.is_parsed
> ```
| Name | Type | Description |
| ------------ | ----- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `texts` | - | A sequence of unicode objects. |
| `as_tuples` | bool | If set to `True`, inputs should be a sequence of `(text, context)` tuples. Output will then be a sequence of `(doc, context)` tuples. Defaults to `False`. |
| `batch_size` | int | The number of texts to buffer. |
| `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
| **YIELDS** | `Doc` | Documents in the order of the original text. |
| Name | Type | Description |
| -------------------------------------------- | ----- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `texts` | - | A sequence of unicode objects. |
| `as_tuples` | bool | If set to `True`, inputs should be a sequence of `(text, context)` tuples. Output will then be a sequence of `(doc, context)` tuples. Defaults to `False`. |
| `batch_size` | int | The number of texts to buffer. |
| `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
| `component_cfg` <Tag variant="new">2.1</Tag> | dict | Config parameters for specific pipeline components, keyed by component name. |
| **YIELDS** | `Doc` | Documents in the order of the original text. |
## Language.update {#update tag="method"}
@ -112,13 +113,14 @@ Update the models in the pipeline.
> nlp.update([doc], [gold], drop=0.5, sgd=optimizer)
> ```
| Name | Type | Description |
| ----------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `docs` | iterable | A batch of `Doc` objects or unicode. If unicode, a `Doc` object will be created from the text. |
| `golds` | iterable | A batch of `GoldParse` objects or dictionaries. Dictionaries will be used to create [`GoldParse`](/api/goldparse) objects. For the available keys and their usage, see [`GoldParse.__init__`](/api/goldparse#init). |
| `drop` | float | The dropout rate. |
| `sgd` | callable | An optimizer. |
| **RETURNS** | dict | Results from the update. |
| Name | Type | Description |
| -------------------------------------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `docs` | iterable | A batch of `Doc` objects or unicode. If unicode, a `Doc` object will be created from the text. |
| `golds` | iterable | A batch of `GoldParse` objects or dictionaries. Dictionaries will be used to create [`GoldParse`](/api/goldparse) objects. For the available keys and their usage, see [`GoldParse.__init__`](/api/goldparse#init). |
| `drop` | float | The dropout rate. |
| `sgd` | callable | An optimizer. |
| `component_cfg` <Tag variant="new">2.1</Tag> | dict | Config parameters for specific pipeline components, keyed by component name. |
| **RETURNS** | dict | Results from the update. |
## Language.begin_training {#begin_training tag="method"}
@ -130,11 +132,12 @@ Allocate models, pre-process training data and acquire an optimizer.
> optimizer = nlp.begin_training(gold_tuples)
> ```
| Name | Type | Description |
| ------------- | -------- | ---------------------------- |
| `gold_tuples` | iterable | Gold-standard training data. |
| `**cfg` | - | Config parameters. |
| **RETURNS** | callable | An optimizer. |
| Name | Type | Description |
| -------------------------------------------- | -------- | ---------------------------------------------------------------------------- |
| `gold_tuples` | iterable | Gold-standard training data. |
| `component_cfg` <Tag variant="new">2.1</Tag> | dict | Config parameters for specific pipeline components, keyed by component name. |
| `**cfg` | - | Config parameters (sent to all components). |
| **RETURNS** | callable | An optimizer. |
## Language.use_params {#use_params tag="contextmanager, method"}
@ -327,7 +330,7 @@ the model**.
| Name | Type | Description |
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
| `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling) and prevent from being saved. |
| `exclude` | list | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. |
## Language.from_disk {#from_disk tag="method" new="2"}
@ -349,22 +352,22 @@ loaded object.
> nlp = English().from_disk("/path/to/en_model")
> ```
| Name | Type | Description |
| ----------- | ---------------- | --------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
| `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
| **RETURNS** | `Language` | The modified `Language` object. |
| Name | Type | Description |
| ----------- | ---------------- | ----------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | `Language` | The modified `Language` object. |
<Infobox title="Changed in v2.0" variant="warning">
As of spaCy v2.0, the `save_to_directory` method has been renamed to `to_disk`,
to improve consistency across classes. Pipeline components to prevent from being
loaded can now be added as a list to `disable`, instead of specifying one
keyword argument per component.
loaded can now be added as a list to `disable` (v2.0) or `exclude` (v2.1),
instead of specifying one keyword argument per component.
```diff
- nlp = spacy.load("en", tagger=False, entity=False)
+ nlp = English().from_disk("/model", disable=["tagger', 'ner"])
+ nlp = English().from_disk("/model", exclude=["tagger", "ner"])
```
</Infobox>
@ -379,10 +382,10 @@ Serialize the current state to a binary string.
> nlp_bytes = nlp.to_bytes()
> ```
| Name | Type | Description |
| ----------- | ----- | ------------------------------------------------------------------------------------------------------------------- |
| `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling) and prevent from being serialized. |
| **RETURNS** | bytes | The serialized form of the `Language` object. |
| Name | Type | Description |
| ----------- | ----- | ----------------------------------------------------------------------------------------- |
| `exclude` | list | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | bytes | The serialized form of the `Language` object. |
## Language.from_bytes {#from_bytes tag="method"}
@ -400,20 +403,21 @@ available to the loaded object.
> nlp2.from_bytes(nlp_bytes)
> ```
| Name | Type | Description |
| ------------ | ---------- | --------------------------------------------------------------------------------- |
| `bytes_data` | bytes | The data to load from. |
| `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
| **RETURNS** | `Language` | The `Language` object. |
| Name | Type | Description |
| ------------ | ---------- | ----------------------------------------------------------------------------------------- |
| `bytes_data` | bytes | The data to load from. |
| `exclude` | list | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | `Language` | The `Language` object. |
<Infobox title="Changed in v2.0" variant="warning">
Pipeline components to prevent from being loaded can now be added as a list to
`disable`, instead of specifying one keyword argument per component.
`disable` (v2.0) or `exclude` (v2.1), instead of specifying one keyword argument
per component.
```diff
- nlp = English().from_bytes(bytes, tagger=False, entity=False)
+ nlp = English().from_bytes(bytes, disable=["tagger", "ner"])
+ nlp = English().from_bytes(bytes, exclude=["tagger", "ner"])
```
</Infobox>
@ -437,3 +441,23 @@ Pipeline components to prevent from being loaded can now be added as a list to
| `Defaults` | class | Settings, data and factory methods for creating the `nlp` object and processing pipeline. |
| `lang` | unicode | Two-letter language ID, i.e. [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). |
| `factories` <Tag variant="new">2</Tag> | dict | Factories that create pre-defined pipeline components, e.g. the tagger, parser or entity recognizer, keyed by their component name. |
## Serialization fields {#serialization-fields}
During serialization, spaCy will export several data fields used to restore
different aspects of the object. If needed, you can exclude them from
serialization by passing in the string names via the `exclude` argument.
> #### Example
>
> ```python
> data = nlp.to_bytes(exclude=["tokenizer", "vocab"])
> nlp.from_disk("./model-data", exclude=["ner"])
> ```
| Name | Description |
| ----------- | -------------------------------------------------- |
| `vocab` | The shared [`Vocab`](/api/vocab). |
| `tokenizer` | Tokenization rules and exceptions. |
| `meta` | The meta data, available as `Language.meta`. |
| ... | String names of pipeline components, e.g. `"ner"`. |

View File

@ -316,6 +316,22 @@ taken.
| ----------- | ------- | --------------- |
| **RETURNS** | `Token` | The root token. |
## Span.conjuncts {#conjuncts tag="property" model="parser"}
A tuple of tokens coordinated to `span.root`.
> #### Example
>
> ```python
> doc = nlp(u"I like apples and oranges")
> apples_conjuncts = doc[2:3].conjuncts
> assert [t.text for t in apples_conjuncts] == [u"oranges"]
> ```
| Name | Type | Description |
| ----------- | ------- | ----------------------- |
| **RETURNS** | `tuple` | The coordinated tokens. |
## Span.lefts {#lefts tag="property" model="parser"}
Tokens that are to the left of the span, whose heads are within the span.

View File

@ -151,10 +151,9 @@ Serialize the current state to a binary string.
> store_bytes = stringstore.to_bytes()
> ```
| Name | Type | Description |
| ----------- | ----- | -------------------------------------------------- |
| `**exclude` | - | Named attributes to prevent from being serialized. |
| **RETURNS** | bytes | The serialized form of the `StringStore` object. |
| Name | Type | Description |
| ----------- | ----- | ------------------------------------------------ |
| **RETURNS** | bytes | The serialized form of the `StringStore` object. |
## StringStore.from_bytes {#from_bytes tag="method"}
@ -168,11 +167,10 @@ Load state from a binary string.
> new_store = StringStore().from_bytes(store_bytes)
> ```
| Name | Type | Description |
| ------------ | ------------- | ---------------------------------------------- |
| `bytes_data` | bytes | The data to load from. |
| `**exclude` | - | Named attributes to prevent from being loaded. |
| **RETURNS** | `StringStore` | The `StringStore` object. |
| Name | Type | Description |
| ------------ | ------------- | ------------------------- |
| `bytes_data` | bytes | The data to load from. |
| **RETURNS** | `StringStore` | The `StringStore` object. |
## Utilities {#util}

View File

@ -244,9 +244,10 @@ Serialize the pipe to disk.
> tagger.to_disk("/path/to/tagger")
> ```
| Name | Type | Description |
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
| Name | Type | Description |
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
## Tagger.from_disk {#from_disk tag="method"}
@ -262,6 +263,7 @@ Load the pipe from disk. Modifies the object in place and returns it.
| Name | Type | Description |
| ----------- | ---------------- | -------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | `Tagger` | The modified `Tagger` object. |
## Tagger.to_bytes {#to_bytes tag="method"}
@ -275,10 +277,10 @@ Load the pipe from disk. Modifies the object in place and returns it.
Serialize the pipe to a bytestring.
| Name | Type | Description |
| ----------- | ----- | -------------------------------------------------- |
| `**exclude` | - | Named attributes to prevent from being serialized. |
| **RETURNS** | bytes | The serialized form of the `Tagger` object. |
| Name | Type | Description |
| ----------- | ----- | ------------------------------------------------------------------------- |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | bytes | The serialized form of the `Tagger` object. |
## Tagger.from_bytes {#from_bytes tag="method"}
@ -292,11 +294,11 @@ Load the pipe from a bytestring. Modifies the object in place and returns it.
> tagger.from_bytes(tagger_bytes)
> ```
| Name | Type | Description |
| ------------ | -------- | ---------------------------------------------- |
| `bytes_data` | bytes | The data to load from. |
| `**exclude` | - | Named attributes to prevent from being loaded. |
| **RETURNS** | `Tagger` | The `Tagger` object. |
| Name | Type | Description |
| ------------ | -------- | ------------------------------------------------------------------------- |
| `bytes_data` | bytes | The data to load from. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | `Tagger` | The `Tagger` object. |
## Tagger.labels {#labels tag="property"}
@ -314,3 +316,22 @@ tags by default, e.g. `VERB`, `NOUN` and so on.
| Name | Type | Description |
| ----------- | ----- | ---------------------------------- |
| **RETURNS** | tuple | The labels added to the component. |
## Serialization fields {#serialization-fields}
During serialization, spaCy will export several data fields used to restore
different aspects of the object. If needed, you can exclude them from
serialization by passing in the string names via the `exclude` argument.
> #### Example
>
> ```python
> data = tagger.to_disk("/path", exclude=["vocab"])
> ```
| Name | Description |
| --------- | ------------------------------------------------------------------------------------------ |
| `vocab` | The shared [`Vocab`](/api/vocab). |
| `cfg` | The config file. You usually don't want to exclude this. |
| `model` | The binary model data. You usually don't want to exclude this. |
| `tag_map` | The [tag map](/usage/adding-languages#tag-map) mapping fine-grained to coarse-grained tag. |

View File

@ -260,9 +260,10 @@ Serialize the pipe to disk.
> textcat.to_disk("/path/to/textcat")
> ```
| Name | Type | Description |
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
| Name | Type | Description |
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
## TextCategorizer.from_disk {#from_disk tag="method"}
@ -278,6 +279,7 @@ Load the pipe from disk. Modifies the object in place and returns it.
| Name | Type | Description |
| ----------- | ----------------- | -------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | `TextCategorizer` | The modified `TextCategorizer` object. |
## TextCategorizer.to_bytes {#to_bytes tag="method"}
@ -291,10 +293,10 @@ Load the pipe from disk. Modifies the object in place and returns it.
Serialize the pipe to a bytestring.
| Name | Type | Description |
| ----------- | ----- | ---------------------------------------------------- |
| `**exclude` | - | Named attributes to prevent from being serialized. |
| **RETURNS** | bytes | The serialized form of the `TextCategorizer` object. |
| Name | Type | Description |
| ----------- | ----- | ------------------------------------------------------------------------- |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | bytes | The serialized form of the `TextCategorizer` object. |
## TextCategorizer.from_bytes {#from_bytes tag="method"}
@ -308,11 +310,11 @@ Load the pipe from a bytestring. Modifies the object in place and returns it.
> textcat.from_bytes(textcat_bytes)
> ```
| Name | Type | Description |
| ------------ | ----------------- | ---------------------------------------------- |
| `bytes_data` | bytes | The data to load from. |
| `**exclude` | - | Named attributes to prevent from being loaded. |
| **RETURNS** | `TextCategorizer` | The `TextCategorizer` object. |
| Name | Type | Description |
| ------------ | ----------------- | ------------------------------------------------------------------------- |
| `bytes_data` | bytes | The data to load from. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | `TextCategorizer` | The `TextCategorizer` object. |
## TextCategorizer.labels {#labels tag="property"}
@ -328,3 +330,21 @@ The labels currently added to the component.
| Name | Type | Description |
| ----------- | ----- | ---------------------------------- |
| **RETURNS** | tuple | The labels added to the component. |
## Serialization fields {#serialization-fields}
During serialization, spaCy will export several data fields used to restore
different aspects of the object. If needed, you can exclude them from
serialization by passing in the string names via the `exclude` argument.
> #### Example
>
> ```python
> data = textcat.to_disk("/path", exclude=["vocab"])
> ```
| Name | Description |
| ------- | -------------------------------------------------------------- |
| `vocab` | The shared [`Vocab`](/api/vocab). |
| `cfg` | The config file. You usually don't want to exclude this. |
| `model` | The binary model data. You usually don't want to exclude this. |

View File

@ -211,7 +211,7 @@ The rightmost token of this token's syntactic descendants.
## Token.conjuncts {#conjuncts tag="property" model="parser"}
A sequence of coordinated tokens, including the token itself.
A tuple of coordinated tokens, not including the token itself.
> #### Example
>
@ -221,9 +221,9 @@ A sequence of coordinated tokens, including the token itself.
> assert [t.text for t in apples_conjuncts] == [u"oranges"]
> ```
| Name | Type | Description |
| ---------- | ------- | -------------------- |
| **YIELDS** | `Token` | A coordinated token. |
| Name | Type | Description |
| ----------- | ------- | ----------------------- |
| **RETURNS** | `tuple` | The coordinated tokens. |
## Token.children {#children tag="property" model="parser"}

View File

@ -127,9 +127,10 @@ Serialize the tokenizer to disk.
> tokenizer.to_disk("/path/to/tokenizer")
> ```
| Name | Type | Description |
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
| Name | Type | Description |
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
## Tokenizer.from_disk {#from_disk tag="method"}
@ -145,6 +146,7 @@ Load the tokenizer from disk. Modifies the object in place and returns it.
| Name | Type | Description |
| ----------- | ---------------- | -------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | `Tokenizer` | The modified `Tokenizer` object. |
## Tokenizer.to_bytes {#to_bytes tag="method"}
@ -158,10 +160,10 @@ Load the tokenizer from disk. Modifies the object in place and returns it.
Serialize the tokenizer to a bytestring.
| Name | Type | Description |
| ----------- | ----- | -------------------------------------------------- |
| `**exclude` | - | Named attributes to prevent from being serialized. |
| **RETURNS** | bytes | The serialized form of the `Tokenizer` object. |
| Name | Type | Description |
| ----------- | ----- | ------------------------------------------------------------------------- |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | bytes | The serialized form of the `Tokenizer` object. |
## Tokenizer.from_bytes {#from_bytes tag="method"}
@ -176,11 +178,11 @@ it.
> tokenizer.from_bytes(tokenizer_bytes)
> ```
| Name | Type | Description |
| ------------ | ----------- | ---------------------------------------------- |
| `bytes_data` | bytes | The data to load from. |
| `**exclude` | - | Named attributes to prevent from being loaded. |
| **RETURNS** | `Tokenizer` | The `Tokenizer` object. |
| Name | Type | Description |
| ------------ | ----------- | ------------------------------------------------------------------------- |
| `bytes_data` | bytes | The data to load from. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | `Tokenizer` | The `Tokenizer` object. |
## Attributes {#attributes}
@ -190,3 +192,25 @@ it.
| `prefix_search` | - | A function to find segment boundaries from the start of a string. Returns the length of the segment, or `None`. |
| `suffix_search` | - | A function to find segment boundaries from the end of a string. Returns the length of the segment, or `None`. |
| `infix_finditer` | - | A function to find internal segment separators, e.g. hyphens. Returns a (possibly empty) list of `re.MatchObject` objects. |
## Serialization fields {#serialization-fields}
During serialization, spaCy will export several data fields used to restore
different aspects of the object. If needed, you can exclude them from
serialization by passing in the string names via the `exclude` argument.
> #### Example
>
> ```python
> data = tokenizer.to_bytes(exclude=["vocab", "exceptions"])
> tokenizer.from_disk("./data", exclude=["token_match"])
> ```
| Name | Description |
| ---------------- | --------------------------------- |
| `vocab` | The shared [`Vocab`](/api/vocab). |
| `prefix_search` | The prefix rules. |
| `suffix_search` | The suffix rules. |
| `infix_finditer` | The infix rules. |
| `token_match` | The token match expression. |
| `exceptions` | The tokenizer exception rules. |

View File

@ -351,6 +351,24 @@ the two-letter language code.
| `name` | unicode | Two-letter language code, e.g. `'en'`. |
| `cls` | `Language` | The language class, e.g. `English`. |
### util.lang_class_is_loaded (#util.lang_class_is_loaded tag="function" new="2.1")
Check whether a `Language` class is already loaded. `Language` classes are
loaded lazily, to avoid expensive setup code associated with the language data.
> #### Example
>
> ```python
> lang_cls = util.get_lang_class("en")
> assert util.lang_class_is_loaded("en") is True
> assert util.lang_class_is_loaded("de") is False
> ```
| Name | Type | Description |
| ----------- | ------- | -------------------------------------- |
| `name` | unicode | Two-letter language code, e.g. `'en'`. |
| **RETURNS** | bool | Whether the class has been loaded. |
### util.load_model {#util.load_model tag="function" new="2"}
Load a model from a shortcut link, package or data path. If called with a

View File

@ -311,10 +311,9 @@ Save the current state to a directory.
>
> ```
| Name | Type | Description |
| ----------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
| `**exclude` | - | Named attributes to prevent from being saved. |
| Name | Type | Description |
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
## Vectors.from_disk {#from_disk tag="method"}
@ -342,10 +341,9 @@ Serialize the current state to a binary string.
> vectors_bytes = vectors.to_bytes()
> ```
| Name | Type | Description |
| ----------- | ----- | -------------------------------------------------- |
| `**exclude` | - | Named attributes to prevent from being serialized. |
| **RETURNS** | bytes | The serialized form of the `Vectors` object. |
| Name | Type | Description |
| ----------- | ----- | -------------------------------------------- |
| **RETURNS** | bytes | The serialized form of the `Vectors` object. |
## Vectors.from_bytes {#from_bytes tag="method"}
@ -360,11 +358,10 @@ Load state from a binary string.
> new_vectors.from_bytes(vectors_bytes)
> ```
| Name | Type | Description |
| ----------- | --------- | ---------------------------------------------- |
| `data` | bytes | The data to load from. |
| `**exclude` | - | Named attributes to prevent from being loaded. |
| **RETURNS** | `Vectors` | The `Vectors` object. |
| Name | Type | Description |
| ----------- | --------- | ---------------------- |
| `data` | bytes | The data to load from. |
| **RETURNS** | `Vectors` | The `Vectors` object. |
## Attributes {#attributes}

View File

@ -221,9 +221,10 @@ Save the current state to a directory.
> nlp.vocab.to_disk("/path/to/vocab")
> ```
| Name | Type | Description |
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
| Name | Type | Description |
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
## Vocab.from_disk {#from_disk tag="method" new="2"}
@ -239,6 +240,7 @@ Loads state from a directory. Modifies the object in place and returns it.
| Name | Type | Description |
| ----------- | ---------------- | -------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | `Vocab` | The modified `Vocab` object. |
## Vocab.to_bytes {#to_bytes tag="method"}
@ -251,10 +253,10 @@ Serialize the current state to a binary string.
> vocab_bytes = nlp.vocab.to_bytes()
> ```
| Name | Type | Description |
| ----------- | ----- | -------------------------------------------------- |
| `**exclude` | - | Named attributes to prevent from being serialized. |
| **RETURNS** | bytes | The serialized form of the `Vocab` object. |
| Name | Type | Description |
| ----------- | ----- | ------------------------------------------------------------------------- |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | bytes | The serialized form of the `Vocab` object. |
## Vocab.from_bytes {#from_bytes tag="method"}
@ -269,11 +271,11 @@ Load state from a binary string.
> vocab.from_bytes(vocab_bytes)
> ```
| Name | Type | Description |
| ------------ | ------- | ---------------------------------------------- |
| `bytes_data` | bytes | The data to load from. |
| `**exclude` | - | Named attributes to prevent from being loaded. |
| **RETURNS** | `Vocab` | The `Vocab` object. |
| Name | Type | Description |
| ------------ | ------- | ------------------------------------------------------------------------- |
| `bytes_data` | bytes | The data to load from. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | `Vocab` | The `Vocab` object. |
## Attributes {#attributes}
@ -286,8 +288,28 @@ Load state from a binary string.
> assert type(PERSON) == int
> ```
| Name | Type | Description |
| ------------------------------------ | ------------- | --------------------------------------------- |
| `strings` | `StringStore` | A table managing the string-to-int mapping. |
| `vectors` <Tag variant="new">2</Tag> | `Vectors` | A table associating word IDs to word vectors. |
| `vectors_length` | int | Number of dimensions for each word vector. |
| Name | Type | Description |
| --------------------------------------------- | ------------- | ------------------------------------------------------------ |
| `strings` | `StringStore` | A table managing the string-to-int mapping. |
| `vectors` <Tag variant="new">2</Tag> | `Vectors` | A table associating word IDs to word vectors. |
| `vectors_length` | int | Number of dimensions for each word vector. |
| `writing_system` <Tag variant="new">2.1</Tag> | dict | A dict with information about the language's writing system. |
## Serialization fields {#serialization-fields}
During serialization, spaCy will export several data fields used to restore
different aspects of the object. If needed, you can exclude them from
serialization by passing in the string names via the `exclude` argument.
> #### Example
>
> ```python
> data = vocab.to_bytes(exclude=["strings", "vectors"])
> vocab.from_disk("./vocab", exclude=["strings"])
> ```
| Name | Description |
| --------- | ----------------------------------------------------- |
| `strings` | The strings in the [`StringStore`](/api/stringstore). |
| `lexemes` | The lexeme data. |
| `vectors` | The word vectors, if available. |

View File

@ -39,9 +39,9 @@ together all components and creating the `Language` subclass for example,
| **Morph rules**<br />[`morph_rules.py`][morph_rules.py] | Exception rules for morphological analysis of irregular words like personal pronouns. |
[stop_words.py]:
https://github.com/explosion/spacy-dev-resources/tree/master/templates/new_language/stop_words.py
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/stop_words.py
[tokenizer_exceptions.py]:
https://github.com/explosion/spacy-dev-resources/tree/master/templates/new_language/tokenizer_exceptions.py
https://github.com/explosion/spaCy/tree/master/spacy/lang/de/tokenizer_exceptions.py
[norm_exceptions.py]:
https://github.com/explosion/spaCy/tree/master/spacy/lang/norm_exceptions.py
[punctuation.py]:
@ -49,12 +49,12 @@ together all components and creating the `Language` subclass for example,
[char_classes.py]:
https://github.com/explosion/spaCy/tree/master/spacy/lang/char_classes.py
[lex_attrs.py]:
https://github.com/explosion/spacy-dev-resources/tree/master/templates/new_language/lex_attrs.py
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/lex_attrs.py
[syntax_iterators.py]:
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/syntax_iterators.py
[lemmatizer.py]:
https://github.com/explosion/spacy-dev-resources/tree/master/templates/new_language/lemmatizer.py
https://github.com/explosion/spaCy/tree/master/spacy/lang/de/lemmatizer.py
[tag_map.py]:
https://github.com/explosion/spacy-dev-resources/tree/master/templates/new_language/tag_map.py
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/tag_map.py
[morph_rules.py]:
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/morph_rules.py

View File

@ -33,9 +33,22 @@ list containing the component names:
import Accordion from 'components/accordion.js'
<Accordion title="Does the order of pipeline components matter?">
<Accordion title="Does the order of pipeline components matter?" id="pipeline-components-order">
No
In spaCy v2.x, the statistical components like the tagger or parser are
independent and don't share any data between themselves. For example, the named
entity recognizer doesn't use any features set by the tagger and parser, and so
on. This means that you can swap them, or remove single components from the
pipeline without affecting the others.
However, custom components may depend on annotations set by other components.
For example, a custom lemmatizer may need the part-of-speech tags assigned, so
it'll only work if it's added after the tagger. The parser will respect
pre-defined sentence boundaries, so if a previous component in the pipeline sets
them, its dependency predictions may be different. Similarly, it matters if you
add the [`EntityRuler`](/api/entityruler) before or after the statistical entity
recognizer: if it's added before, the entity recognizer will take the existing
entities into account when making predictions.
</Accordion>

View File

@ -39,7 +39,7 @@ and morphological analysis.
</div>
<Infobox title="Table of Contents">
<Infobox title="Table of Contents" id="toc">
- [Language data 101](#101)
- [The Language subclass](#language-subclass)
@ -105,15 +105,15 @@ to know the language's character set. If the language you're adding uses
non-latin characters, you might need to define the required character classes in
the global
[`char_classes.py`](https://github.com/explosion/spaCy/tree/master/spacy/lang/char_classes.py).
For efficiency, spaCy uses hard-coded unicode ranges to define character classes,
the definitions of which can be found on [Wikipedia](https://en.wikipedia.org/wiki/Unicode_block).
If the language requires very specific punctuation
rules, you should consider overwriting the default regular expressions with your
own in the language's `Defaults`.
For efficiency, spaCy uses hard-coded unicode ranges to define character
classes, the definitions of which can be found on
[Wikipedia](https://en.wikipedia.org/wiki/Unicode_block). If the language
requires very specific punctuation rules, you should consider overwriting the
default regular expressions with your own in the language's `Defaults`.
</Infobox>
### Creating a `Language` subclass {#language-subclass}
### Creating a language subclass {#language-subclass}
Language-specific code and resources should be organized into a sub-package of
spaCy, named according to the language's
@ -121,9 +121,9 @@ spaCy, named according to the language's
code and resources specific to Spanish are placed into a directory
`spacy/lang/es`, which can be imported as `spacy.lang.es`.
To get started, you can use our
[templates](https://github.com/explosion/spacy-dev-resources/templates/new_language)
for the most important files. Here's what the class template looks like:
To get started, you can check out the
[existing languages](https://github.com/explosion/spacy/tree/master/spacy/lang).
Here's what the class could look like:
```python
### __init__.py (excerpt)
@ -614,7 +614,7 @@ require models to be trained from labeled examples. The word vectors, word
probabilities and word clusters also require training, although these can be
trained from unlabeled text, which tends to be much easier to collect.
### Creating a vocabulary file
### Creating a vocabulary file {#vocab-file}
spaCy expects that common words will be cached in a [`Vocab`](/api/vocab)
instance. The vocabulary caches lexical features. spaCy loads the vocabulary
@ -631,20 +631,20 @@ of using deep learning for NLP with limited labeled data. The vectors are also
useful by themselves they power the `.similarity` methods in spaCy. For best
results, you should pre-process the text with spaCy before training the Word2vec
model. This ensures your tokenization will match. You can use our
[word vectors training script](https://github.com/explosion/spacy-dev-resources/tree/master/training/word_vectors.py),
[word vectors training script](https://github.com/explosion/spacy/tree/master/bin/train_word_vectors.py),
which pre-processes the text with your language-specific tokenizer and trains
the model using [Gensim](https://radimrehurek.com/gensim/). The `vectors.bin`
file should consist of one word and vector per line.
```python
https://github.com/explosion/spacy-dev-resources/tree/master/training/word_vectors.py
https://github.com/explosion/spacy/tree/master/bin/train_word_vectors.py
```
If you don't have a large sample of text available, you can also convert word
vectors produced by a variety of other tools into spaCy's format. See the docs
on [converting word vectors](/usage/vectors-similarity#converting) for details.
### Creating or converting a training corpus
### Creating or converting a training corpus {#training-corpus}
The easiest way to train spaCy's tagger, parser, entity recognizer or text
categorizer is to use the [`spacy train`](/api/cli#train) command-line utility.

View File

@ -29,7 +29,7 @@ Here's a quick comparison of the functionalities offered by spaCy,
| Entity linking | ❌ | ❌ | ❌ |
| Coreference resolution | ❌ | ❌ | ✅ |
### When should I use what?
### When should I use what? {#comparison-usage}
Natural Language Understanding is an active area of research and development, so
there are many different tools or technologies catering to different use-cases.

View File

@ -28,7 +28,7 @@ import QuickstartInstall from 'widgets/quickstart-install.js'
## Installation instructions {#installation}
### pip
### pip {#pip}
Using pip, spaCy releases are available as source packages and binary wheels (as
of v2.0.13).
@ -58,7 +58,7 @@ source .env/bin/activate
pip install spacy
```
### conda
### conda {#conda}
Thanks to our great community, we've been able to re-add conda support. You can
also install spaCy via `conda-forge`:
@ -194,7 +194,7 @@ official distributions these are:
| Python 3.4 | Visual Studio 2010 |
| Python 3.5+ | Visual Studio 2015 |
### Run tests
### Run tests {#run-tests}
spaCy comes with an
[extensive test suite](https://github.com/explosion/spaCy/tree/master/spacy/tests).
@ -418,7 +418,7 @@ either of these, clone your repository again.
</Accordion>
## Changelog
## Changelog {#changelog}
import Changelog from 'widgets/changelog.js'

View File

@ -298,9 +298,9 @@ different languages, see the
The best way to understand spaCy's dependency parser is interactively. To make
this easier, spaCy v2.0+ comes with a visualization module. You can pass a `Doc`
or a list of `Doc` objects to displaCy and run
[`displacy.serve`](top-level#displacy.serve) to run the web server, or
[`displacy.render`](top-level#displacy.render) to generate the raw markup. If
you want to know how to write rules that hook into some type of syntactic
[`displacy.serve`](/api/top-level#displacy.serve) to run the web server, or
[`displacy.render`](/api/top-level#displacy.render) to generate the raw markup.
If you want to know how to write rules that hook into some type of syntactic
construction, just plug the sentence into the visualizer and see how spaCy
annotates it.
@ -621,7 +621,7 @@ For more details on the language-specific data, see the usage guide on
</Infobox>
<Accordion title="Should I change the language data or add custom tokenizer rules?">
<Accordion title="Should I change the language data or add custom tokenizer rules?" id="lang-data-vs-tokenizer">
Tokenization rules that are specific to one language, but can be **generalized
across that language** should ideally live in the language data in

View File

@ -41,7 +41,7 @@ contribute to model development.
> If a model is available for a language, you can download it using the
> [`spacy download`](/api/cli#download) command. In order to use languages that
> don't yet come with a model, you have to import them directly, or use
> [`spacy.blank`](api/top-level#spacy.blank):
> [`spacy.blank`](/api/top-level#spacy.blank):
>
> ```python
> from spacy.lang.fi import Finnish

View File

@ -46,7 +46,8 @@ components. spaCy then does the following:
3. Add each pipeline component to the pipeline in order, using
[`add_pipe`](/api/language#add_pipe).
4. Make the **model data** available to the `Language` class by calling
[`from_disk`](language#from_disk) with the path to the model data directory.
[`from_disk`](/api/language#from_disk) with the path to the model data
directory.
So when you call this...
@ -110,7 +111,7 @@ print(nlp.pipe_names)
# ['tagger', 'parser', 'ner']
```
### Built-in pipeline components
### Built-in pipeline components {#built-in}
spaCy ships with several built-in pipeline components that are also available in
the `Language.factories`. This means that you can initialize them by calling
@ -426,7 +427,7 @@ spaCy, and implement your own models trained with other machine learning
libraries. It also lets you take advantage of spaCy's data structures and the
`Doc` object as the "single source of truth".
<Accordion title="Why ._ and not just a top-level attribute?">
<Accordion title="Why ._ and not just a top-level attribute?" id="why-dot-underscore">
Writing to a `._` attribute instead of to the `Doc` directly keeps a clearer
separation and makes it easier to ensure backwards compatibility. For example,
@ -437,7 +438,7 @@ immediately know what's built-in and what's custom for example,
</Accordion>
<Accordion title="How is the ._ implemented?">
<Accordion title="How is the ._ implemented?" id="dot-underscore-implementation">
Extension definitions the defaults, methods, getters and setters you pass in
to `set_extension` are stored in class attributes on the `Underscore` class.
@ -458,9 +459,7 @@ There are three main types of extensions, which can be defined using the
1. **Attribute extensions.** Set a default value for an attribute, which can be
overwritten manually at any time. Attribute extensions work like "normal"
variables and are the quickest way to store arbitrary information on a `Doc`,
`Span` or `Token`. Attribute defaults behaves just like argument defaults
[in Python functions](http://docs.python-guide.org/en/latest/writing/gotchas/#mutable-default-arguments),
and should not be used for mutable values like dictionaries or lists.
`Span` or `Token`.
```python
Doc.set_extension("hello", default=True)
@ -527,25 +526,6 @@ Once you've registered your custom attribute, you can also use the built-in
especially useful it you want to pass in a string instead of calling
`doc._.my_attr`.
<Infobox title="Using mutable default values" variant="danger">
When using **mutable values** like dictionaries or lists as the `default`
argument, keep in mind that they behave just like mutable default arguments
[in Python functions](http://docs.python-guide.org/en/latest/writing/gotchas/#mutable-default-arguments).
This can easily cause unintended results, like the same value being set on _all_
objects instead of only one particular instance. In most cases, it's better to
use **getters and setters**, and only set the `default` for boolean or string
values.
```diff
+ Doc.set_extension('fruits', getter=get_fruits, setter=set_fruits)
- Doc.set_extension('fruits', default={})
- doc._.fruits['apple'] = u'🍎' # all docs now have {'apple': u'🍎'}
```
</Infobox>
### Example: Pipeline component for GPE entities and country meta data via a REST API {#component-example3}
This example shows the implementation of a pipeline component that fetches

View File

@ -15,7 +15,7 @@ their relationships. This means you can easily access and analyze the
surrounding tokens, merge spans into single tokens or add entries to the named
entities in `doc.ents`.
<Accordion title="Should I use rules or train a model?">
<Accordion title="Should I use rules or train a model?" id="rules-vs-model">
For complex tasks, it's usually better to train a statistical entity recognition
model. However, statistical models require training data, so for many
@ -41,7 +41,7 @@ on [rule-based entity recognition](#entityruler).
</Accordion>
<Accordion title="When should I use the token matcher vs. the phrase matcher?">
<Accordion title="When should I use the token matcher vs. the phrase matcher?" id="matcher-vs-phrase-matcher">
The `PhraseMatcher` is useful if you already have a large terminology list or
gazetteer consisting of single or multi-token phrases that you want to find

View File

@ -22,7 +22,7 @@ the changes, see [this table](/usage/v2#incompat) and the notes on
</Infobox>
### Serializing the pipeline
### Serializing the pipeline {#pipeline}
When serializing the pipeline, keep in mind that this will only save out the
**binary data for the individual components** to allow spaCy to restore them
@ -361,7 +361,7 @@ In theory, the entry point mechanism also lets you overwrite built-in factories
including the tokenizer. By default, spaCy will output a warning in these
cases, to prevent accidental overwrites and unintended results.
#### Advanced components with settings
#### Advanced components with settings {#advanced-cfg}
The `**cfg` keyword arguments that the factory receives are passed down all the
way from `spacy.load`. This means that the factory can respond to custom

View File

@ -50,7 +50,7 @@ systems, or to pre-process text for **deep learning**.
</div>
<Infobox title="Table of contents">
<Infobox title="Table of contents" id="toc">
- [Features](#features)
- [Linguistic annotations](#annotations)

View File

@ -14,7 +14,7 @@ faster runtime, and many bug fixes, v2.1 also introduces experimental support
for some exciting new NLP innovations. For the full changelog, see the
[release notes on GitHub](https://github.com/explosion/spaCy/releases/tag/v2.1.0).
### BERT/ULMFit/Elmo-style pre-training {tag="experimental"}
### BERT/ULMFit/Elmo-style pre-training {#pretraining tag="experimental"}
> #### Example
>
@ -39,7 +39,7 @@ it.
</Infobox>
### Extended match pattern API
### Extended match pattern API {#matcher-api}
> #### Example
>
@ -67,7 +67,7 @@ values.
</Infobox>
### Easy rule-based entity recognition
### Easy rule-based entity recognition {#entity-ruler}
> #### Example
>
@ -91,7 +91,7 @@ flexibility.
</Infobox>
### Phrase matching with other attributes
### Phrase matching with other attributes {#phrasematcher}
> #### Example
>
@ -115,7 +115,7 @@ or `POS` for finding sequences of the same part-of-speech tags.
</Infobox>
### Retokenizer for merging and splitting
### Retokenizer for merging and splitting {#retokenizer}
> #### Example
>
@ -142,7 +142,7 @@ deprecated.
</Infobox>
### Components and languages via entry points
### Components and languages via entry points {#entry-points}
> #### Example
>
@ -169,7 +169,7 @@ is required.
</Infobox>
### Improved documentation
### Improved documentation {#docs}
Although it looks pretty much the same, we've rebuilt the entire documentation
using [Gatsby](https://www.gatsbyjs.org/) and [MDX](https://mdxjs.com/). It's
@ -237,6 +237,19 @@ if all of your models are up to date, you can run the
+ retokenizer.merge(doc[6:8])
```
- The serialization methods `to_disk`, `from_disk`, `to_bytes` and `from_bytes`
now support a single `exclude` argument to provide a list of string names to
exclude. The docs have been updated to list the available serialization fields
for each class. The `disable` argument on the [`Language`](/api/language)
serialization methods has been renamed to `exclude` for consistency.
```diff
- nlp.to_disk("/path", disable=["parser", "ner"])
+ nlp.to_disk("/path", exclude=["parser", "ner"])
- data = nlp.tokenizer.to_bytes(vocab=False)
+ data = nlp.tokenizer.to_bytes(exclude=["vocab"])
```
- For better compatibility with the Universal Dependencies data, the lemmatizer
now preserves capitalization, e.g. for proper nouns. See
[this issue](https://github.com/explosion/spaCy/issues/3256) for details.

View File

@ -39,7 +39,7 @@ also add your own custom attributes, properties and methods to the `Doc`,
</div>
<Infobox title="Table of Contents">
<Infobox title="Table of Contents" id="toc">
- [Summary](#summary)
- [New features](#features)

View File

@ -75,7 +75,7 @@ arcs.
| `font` | unicode | Font name or font family for all text. | `"Arial"` |
For a list of all available options, see the
[`displacy` API documentation](top-level#displacy_options).
[`displacy` API documentation](/api/top-level#displacy_options).
> #### Options example
>
@ -283,7 +283,7 @@ from pathlib import Path
nlp = spacy.load("en_core_web_sm")
sentences = [u"This is an example.", u"This is another one."]
for sent in sentences:
doc = nlp(sentence)
doc = nlp(sent)
svg = displacy.render(doc, style="dep")
file_name = '-'.join([w.text for w in doc if not w.is_punct]) + ".svg"
output_path = Path("/images/" + file_name)

View File

@ -23,6 +23,11 @@
"list": "89ad33e698"
},
"docSearch": {
"apiKey": "f7dbcd148fae73db20b6ad33d03cc9e8",
"indexName": "dev_spacy_netlify",
"appId": "Y7BGGRAPHC"
},
"_docSearch": {
"apiKey": "371e26ed49d29a27bd36273dfdaf89af",
"indexName": "spacy"
},

View File

@ -524,6 +524,22 @@
},
"category": ["standalone", "research"]
},
{
"id": "scispacy",
"title": "scispaCy",
"slogan": "A full spaCy pipeline and models for scientific/biomedical documents",
"github": "allenai/scispacy",
"pip": "scispacy",
"thumb": "https://i.imgur.com/dJQSclW.png",
"url": "https://allenai.github.io/scispacy/",
"author": " Allen Institute for Artificial Intelligence",
"author_links": {
"github": "allenai",
"twitter": "allenai_org",
"website": "http://allenai.org"
},
"category": ["models", "research"]
},
{
"id": "textacy",
"slogan": "NLP, before and after spaCy",
@ -851,6 +867,22 @@
},
"category": ["courses"]
},
{
"type": "education",
"id": "datacamp-advanced-nlp",
"title": "Advanced Natural Language Processing with spaCy",
"slogan": "Datacamp, 2019",
"description": "If you're working with a lot of text, you'll eventually want to know more about it. For example, what's it about? What do the words mean in context? Who is doing what to whom? What companies and products are mentioned? Which texts are similar to each other? In this course, you'll learn how to use spaCy, a fast-growing industry standard library for NLP in Python, to build advanced natural language understanding systems, using both rule-based and machine learning approaches.",
"url": "https://www.datacamp.com/courses/advanced-nlp-with-spacy",
"thumb": "https://i.imgur.com/0Zks7c0.jpg",
"author": "Ines Montani",
"author_links": {
"twitter": "_inesmontani",
"github": "ines",
"website": "https://ines.io"
},
"category": ["courses"]
},
{
"type": "education",
"id": "learning-path-spacy",
@ -910,6 +942,7 @@
"description": "Most NLP projects rely crucially on the quality of annotations used for training and evaluating models. In this episode, Matt and Ines of Explosion AI tell us how Prodigy can improve data annotation and model development workflows. Prodigy is an annotation tool implemented as a python library, and it comes with a web application and a command line interface. A developer can define input data streams and design simple annotation interfaces. Prodigy can help break down complex annotation decisions into a series of binary decisions, and it provides easy integration with spaCy models. Developers can specify how models should be modified as new annotations come in in an active learning framework.",
"soundcloud": "559200912",
"thumb": "https://i.imgur.com/hOBQEzc.jpg",
"url": "https://soundcloud.com/nlp-highlights/78-where-do-corpora-come-from-with-matt-honnibal-and-ines-montani",
"author": "Matt Gardner, Waleed Ammar (Allen AI)",
"author_links": {
"website": "https://soundcloud.com/nlp-highlights"
@ -925,12 +958,28 @@
"iframe": "https://www.pythonpodcast.com/wp-content/plugins/podlove-podcasting-plugin-for-wordpress/lib/modules/podlove_web_player/player_v4/dist/share.html?episode=https://www.pythonpodcast.com/?podlove_player4=176",
"iframe_height": 200,
"thumb": "https://i.imgur.com/rpo6BuY.png",
"url": "https://www.podcastinit.com/episode-87-spacy-with-matthew-honnibal/",
"author": "Tobias Macey",
"author_links": {
"website": "https://www.podcastinit.com"
},
"category": ["podcasts"]
},
{
"type": "education",
"id": "talk-python-podcast",
"title": "Talk Python 202: Building a software business",
"slogan": "March 2019",
"description": "One core question around open source is how do you fund it? Well, there is always that PayPal donate button. But that's been a tremendous failure for many projects. Often the go-to answer is consulting. But what if you don't want to trade time for money? You could take things up a notch and change the equation, exchanging value for money. That's what Ines Montani and her co-founder did when they started Explosion AI with spaCy as the foundation.",
"thumb": "https://i.imgur.com/q1twuK8.png",
"url": "https://talkpython.fm/episodes/show/202/building-a-software-business",
"soundcloud": "588364857",
"author": "Michael Kennedy",
"author_links": {
"website": "https://talkpython.fm/"
},
"category": ["podcasts"]
},
{
"id": "adam_qas",
"title": "ADAM: Question Answering System",

View File

@ -1833,9 +1833,9 @@
}
},
"acorn": {
"version": "6.1.0",
"resolved": "https://registry.npmjs.org/acorn/-/acorn-6.1.0.tgz",
"integrity": "sha512-MW/FjM+IvU9CgBzjO3UIPCE2pyEwUsoFl+VGdczOPEdxfGFjuKny/gN54mOuX7Qxmb9Rg9MCn2oKiSUeW+pjrw=="
"version": "6.1.1",
"resolved": "https://registry.npmjs.org/acorn/-/acorn-6.1.1.tgz",
"integrity": "sha512-jPTiwtOxaHNaAPg/dmrJ/beuzLRnXtB0kQPQ8JpotKJgTB6rX6c8mlf315941pyjBSaPg8NHXS9fhP4u17DpGA=="
},
"acorn-dynamic-import": {
"version": "3.0.0",
@ -5958,9 +5958,9 @@
"integrity": "sha1-G2HAViGQqN/2rjuyzwIAyhMLhtQ="
},
"eslint": {
"version": "5.14.1",
"resolved": "https://registry.npmjs.org/eslint/-/eslint-5.14.1.tgz",
"integrity": "sha512-CyUMbmsjxedx8B0mr79mNOqetvkbij/zrXnFeK2zc3pGRn3/tibjiNAv/3UxFEyfMDjh+ZqTrJrEGBFiGfD5Og==",
"version": "5.15.1",
"resolved": "https://registry.npmjs.org/eslint/-/eslint-5.15.1.tgz",
"integrity": "sha512-NTcm6vQ+PTgN3UBsALw5BMhgO6i5EpIjQF/Xb5tIh3sk9QhrFafujUOczGz4J24JBlzWclSB9Vmx8d+9Z6bFCg==",
"requires": {
"@babel/code-frame": "^7.0.0",
"ajv": "^6.9.1",
@ -5968,7 +5968,7 @@
"cross-spawn": "^6.0.5",
"debug": "^4.0.1",
"doctrine": "^3.0.0",
"eslint-scope": "^4.0.0",
"eslint-scope": "^4.0.2",
"eslint-utils": "^1.3.1",
"eslint-visitor-keys": "^1.0.0",
"espree": "^5.0.1",
@ -6001,9 +6001,9 @@
},
"dependencies": {
"ajv": {
"version": "6.9.2",
"resolved": "https://registry.npmjs.org/ajv/-/ajv-6.9.2.tgz",
"integrity": "sha512-4UFy0/LgDo7Oa/+wOAlj44tp9K78u38E5/359eSrqEp1Z5PdVfimCcs7SluXMP755RUQu6d2b4AvF0R1C9RZjg==",
"version": "6.10.0",
"resolved": "https://registry.npmjs.org/ajv/-/ajv-6.10.0.tgz",
"integrity": "sha512-nffhOpkymDECQyR0mnsUtoCE8RlX38G0rYP+wgLWFyZuUyuuojSSvi/+euOiQBIn63whYwYVIIH1TvE3tu4OEg==",
"requires": {
"fast-deep-equal": "^2.0.1",
"fast-json-stable-stringify": "^2.0.0",
@ -6037,9 +6037,9 @@
}
},
"eslint-scope": {
"version": "4.0.0",
"resolved": "https://registry.npmjs.org/eslint-scope/-/eslint-scope-4.0.0.tgz",
"integrity": "sha512-1G6UTDi7Jc1ELFwnR58HV4fK9OQK4S6N985f166xqXxpjU6plxFISJa2Ba9KCQuFa8RCnj/lSFJbHo7UFDBnUA==",
"version": "4.0.2",
"resolved": "https://registry.npmjs.org/eslint-scope/-/eslint-scope-4.0.2.tgz",
"integrity": "sha512-5q1+B/ogmHl8+paxtOKx38Z8LtWkVGuNt3+GQNErqwLl6ViNp/gdJGMCjZNxZ8j/VYjDNZ2Fo+eQc1TAVPIzbg==",
"requires": {
"esrecurse": "^4.1.0",
"estraverse": "^4.1.1"
@ -6448,52 +6448,6 @@
}
}
},
"expand-range": {
"version": "1.8.2",
"resolved": "http://registry.npmjs.org/expand-range/-/expand-range-1.8.2.tgz",
"integrity": "sha1-opnv/TNf4nIeuujiV+x5ZE/IUzc=",
"requires": {
"fill-range": "^2.1.0"
},
"dependencies": {
"fill-range": {
"version": "2.2.4",
"resolved": "https://registry.npmjs.org/fill-range/-/fill-range-2.2.4.tgz",
"integrity": "sha512-cnrcCbj01+j2gTG921VZPnHbjmdAf8oQV/iGeV2kZxGSyfYjjTyY79ErsK1WJWMpw6DaApEX72binqJE+/d+5Q==",
"requires": {
"is-number": "^2.1.0",
"isobject": "^2.0.0",
"randomatic": "^3.0.0",
"repeat-element": "^1.1.2",
"repeat-string": "^1.5.2"
}
},
"is-number": {
"version": "2.1.0",
"resolved": "https://registry.npmjs.org/is-number/-/is-number-2.1.0.tgz",
"integrity": "sha1-Afy7s5NGOlSPL0ZszhbezknbkI8=",
"requires": {
"kind-of": "^3.0.2"
}
},
"isobject": {
"version": "2.1.0",
"resolved": "https://registry.npmjs.org/isobject/-/isobject-2.1.0.tgz",
"integrity": "sha1-8GVWEJaj8dou9GJy+BXIQNh+DIk=",
"requires": {
"isarray": "1.0.0"
}
},
"kind-of": {
"version": "3.2.2",
"resolved": "https://registry.npmjs.org/kind-of/-/kind-of-3.2.2.tgz",
"integrity": "sha1-MeohpzS6ubuw8yRm2JOupR5KPGQ=",
"requires": {
"is-buffer": "^1.1.5"
}
}
}
},
"expand-template": {
"version": "2.0.3",
"resolved": "https://registry.npmjs.org/expand-template/-/expand-template-2.0.3.tgz",
@ -6818,11 +6772,6 @@
"resolved": "https://registry.npmjs.org/file-uri-to-path/-/file-uri-to-path-1.0.0.tgz",
"integrity": "sha512-0Zt+s3L7Vf1biwWZ29aARiVYLx7iMGnEUl9x33fbB/j3jR81u/O2LbqK+Bm1CDSNDKVtJ/YjwY7TUd5SkeLQLw=="
},
"filename-regex": {
"version": "2.0.1",
"resolved": "https://registry.npmjs.org/filename-regex/-/filename-regex-2.0.1.tgz",
"integrity": "sha1-wcS5vuPglyXdsQa3XB4wH+LxiyY="
},
"filename-reserved-regex": {
"version": "2.0.0",
"resolved": "https://registry.npmjs.org/filename-reserved-regex/-/filename-reserved-regex-2.0.0.tgz",
@ -7130,468 +7079,6 @@
"resolved": "https://registry.npmjs.org/fs.realpath/-/fs.realpath-1.0.0.tgz",
"integrity": "sha1-FQStJSMVjKpA20onh8sBQRmU6k8="
},
"fsevents": {
"version": "1.2.4",
"resolved": "https://registry.npmjs.org/fsevents/-/fsevents-1.2.4.tgz",
"integrity": "sha512-z8H8/diyk76B7q5wg+Ud0+CqzcAF3mBBI/bA5ne5zrRUUIvNkJY//D3BqyH571KuAC4Nr7Rw7CjWX4r0y9DvNg==",
"optional": true,
"requires": {
"nan": "^2.9.2",
"node-pre-gyp": "^0.10.0"
},
"dependencies": {
"abbrev": {
"version": "1.1.1",
"bundled": true,
"optional": true
},
"ansi-regex": {
"version": "2.1.1",
"bundled": true
},
"aproba": {
"version": "1.2.0",
"bundled": true,
"optional": true
},
"are-we-there-yet": {
"version": "1.1.4",
"bundled": true,
"optional": true,
"requires": {
"delegates": "^1.0.0",
"readable-stream": "^2.0.6"
}
},
"balanced-match": {
"version": "1.0.0",
"bundled": true
},
"brace-expansion": {
"version": "1.1.11",
"bundled": true,
"requires": {
"balanced-match": "^1.0.0",
"concat-map": "0.0.1"
}
},
"chownr": {
"version": "1.0.1",
"bundled": true,
"optional": true
},
"code-point-at": {
"version": "1.1.0",
"bundled": true
},
"concat-map": {
"version": "0.0.1",
"bundled": true
},
"console-control-strings": {
"version": "1.1.0",
"bundled": true
},
"core-util-is": {
"version": "1.0.2",
"bundled": true,
"optional": true
},
"debug": {
"version": "2.6.9",
"bundled": true,
"optional": true,
"requires": {
"ms": "2.0.0"
}
},
"deep-extend": {
"version": "0.5.1",
"bundled": true,
"optional": true
},
"delegates": {
"version": "1.0.0",
"bundled": true,
"optional": true
},
"detect-libc": {
"version": "1.0.3",
"bundled": true,
"optional": true
},
"fs-minipass": {
"version": "1.2.5",
"bundled": true,
"optional": true,
"requires": {
"minipass": "^2.2.1"
}
},
"fs.realpath": {
"version": "1.0.0",
"bundled": true,
"optional": true
},
"gauge": {
"version": "2.7.4",
"bundled": true,
"optional": true,
"requires": {
"aproba": "^1.0.3",
"console-control-strings": "^1.0.0",
"has-unicode": "^2.0.0",
"object-assign": "^4.1.0",
"signal-exit": "^3.0.0",
"string-width": "^1.0.1",
"strip-ansi": "^3.0.1",
"wide-align": "^1.1.0"
}
},
"glob": {
"version": "7.1.2",
"bundled": true,
"optional": true,
"requires": {
"fs.realpath": "^1.0.0",
"inflight": "^1.0.4",
"inherits": "2",
"minimatch": "^3.0.4",
"once": "^1.3.0",
"path-is-absolute": "^1.0.0"
}
},
"has-unicode": {
"version": "2.0.1",
"bundled": true,
"optional": true
},
"iconv-lite": {
"version": "0.4.21",
"bundled": true,
"optional": true,
"requires": {
"safer-buffer": "^2.1.0"
}
},
"ignore-walk": {
"version": "3.0.1",
"bundled": true,
"optional": true,
"requires": {
"minimatch": "^3.0.4"
}
},
"inflight": {
"version": "1.0.6",
"bundled": true,
"optional": true,
"requires": {
"once": "^1.3.0",
"wrappy": "1"
}
},
"inherits": {
"version": "2.0.3",
"bundled": true
},
"ini": {
"version": "1.3.5",
"bundled": true,
"optional": true
},
"is-fullwidth-code-point": {
"version": "1.0.0",
"bundled": true,
"requires": {
"number-is-nan": "^1.0.0"
}
},
"isarray": {
"version": "1.0.0",
"bundled": true,
"optional": true
},
"minimatch": {
"version": "3.0.4",
"bundled": true,
"requires": {
"brace-expansion": "^1.1.7"
}
},
"minimist": {
"version": "0.0.8",
"bundled": true
},
"minipass": {
"version": "2.2.4",
"bundled": true,
"requires": {
"safe-buffer": "^5.1.1",
"yallist": "^3.0.0"
}
},
"minizlib": {
"version": "1.1.0",
"bundled": true,
"optional": true,
"requires": {
"minipass": "^2.2.1"
}
},
"mkdirp": {
"version": "0.5.1",
"bundled": true,
"requires": {
"minimist": "0.0.8"
}
},
"ms": {
"version": "2.0.0",
"bundled": true,
"optional": true
},
"needle": {
"version": "2.2.0",
"bundled": true,
"optional": true,
"requires": {
"debug": "^2.1.2",
"iconv-lite": "^0.4.4",
"sax": "^1.2.4"
}
},
"node-pre-gyp": {
"version": "0.10.0",
"bundled": true,
"optional": true,
"requires": {
"detect-libc": "^1.0.2",
"mkdirp": "^0.5.1",
"needle": "^2.2.0",
"nopt": "^4.0.1",
"npm-packlist": "^1.1.6",
"npmlog": "^4.0.2",
"rc": "^1.1.7",
"rimraf": "^2.6.1",
"semver": "^5.3.0",
"tar": "^4"
}
},
"nopt": {
"version": "4.0.1",
"bundled": true,
"optional": true,
"requires": {
"abbrev": "1",
"osenv": "^0.1.4"
}
},
"npm-bundled": {
"version": "1.0.3",
"bundled": true,
"optional": true
},
"npm-packlist": {
"version": "1.1.10",
"bundled": true,
"optional": true,
"requires": {
"ignore-walk": "^3.0.1",
"npm-bundled": "^1.0.1"
}
},
"npmlog": {
"version": "4.1.2",
"bundled": true,
"optional": true,
"requires": {
"are-we-there-yet": "~1.1.2",
"console-control-strings": "~1.1.0",
"gauge": "~2.7.3",
"set-blocking": "~2.0.0"
}
},
"number-is-nan": {
"version": "1.0.1",
"bundled": true
},
"object-assign": {
"version": "4.1.1",
"bundled": true,
"optional": true
},
"once": {
"version": "1.4.0",
"bundled": true,
"requires": {
"wrappy": "1"
}
},
"os-homedir": {
"version": "1.0.2",
"bundled": true,
"optional": true
},
"os-tmpdir": {
"version": "1.0.2",
"bundled": true,
"optional": true
},
"osenv": {
"version": "0.1.5",
"bundled": true,
"optional": true,
"requires": {
"os-homedir": "^1.0.0",
"os-tmpdir": "^1.0.0"
}
},
"path-is-absolute": {
"version": "1.0.1",
"bundled": true,
"optional": true
},
"process-nextick-args": {
"version": "2.0.0",
"bundled": true,
"optional": true
},
"rc": {
"version": "1.2.7",
"bundled": true,
"optional": true,
"requires": {
"deep-extend": "^0.5.1",
"ini": "~1.3.0",
"minimist": "^1.2.0",
"strip-json-comments": "~2.0.1"
},
"dependencies": {
"minimist": {
"version": "1.2.0",
"bundled": true,
"optional": true
}
}
},
"readable-stream": {
"version": "2.3.6",
"bundled": true,
"optional": true,
"requires": {
"core-util-is": "~1.0.0",
"inherits": "~2.0.3",
"isarray": "~1.0.0",
"process-nextick-args": "~2.0.0",
"safe-buffer": "~5.1.1",
"string_decoder": "~1.1.1",
"util-deprecate": "~1.0.1"
}
},
"rimraf": {
"version": "2.6.2",
"bundled": true,
"optional": true,
"requires": {
"glob": "^7.0.5"
}
},
"safe-buffer": {
"version": "5.1.1",
"bundled": true
},
"safer-buffer": {
"version": "2.1.2",
"bundled": true,
"optional": true
},
"sax": {
"version": "1.2.4",
"bundled": true,
"optional": true
},
"semver": {
"version": "5.5.0",
"bundled": true,
"optional": true
},
"set-blocking": {
"version": "2.0.0",
"bundled": true,
"optional": true
},
"signal-exit": {
"version": "3.0.2",
"bundled": true,
"optional": true
},
"string-width": {
"version": "1.0.2",
"bundled": true,
"requires": {
"code-point-at": "^1.0.0",
"is-fullwidth-code-point": "^1.0.0",
"strip-ansi": "^3.0.0"
}
},
"string_decoder": {
"version": "1.1.1",
"bundled": true,
"optional": true,
"requires": {
"safe-buffer": "~5.1.0"
}
},
"strip-ansi": {
"version": "3.0.1",
"bundled": true,
"requires": {
"ansi-regex": "^2.0.0"
}
},
"strip-json-comments": {
"version": "2.0.1",
"bundled": true,
"optional": true
},
"tar": {
"version": "4.4.1",
"bundled": true,
"optional": true,
"requires": {
"chownr": "^1.0.1",
"fs-minipass": "^1.2.5",
"minipass": "^2.2.4",
"minizlib": "^1.1.0",
"mkdirp": "^0.5.0",
"safe-buffer": "^5.1.1",
"yallist": "^3.0.2"
}
},
"util-deprecate": {
"version": "1.0.2",
"bundled": true,
"optional": true
},
"wide-align": {
"version": "1.1.2",
"bundled": true,
"optional": true,
"requires": {
"string-width": "^1.0.2"
}
},
"wrappy": {
"version": "1.0.2",
"bundled": true
},
"yallist": {
"version": "3.0.2",
"bundled": true
}
}
},
"fstream": {
"version": "1.0.11",
"resolved": "https://registry.npmjs.org/fstream/-/fstream-1.0.11.tgz",
@ -8322,14 +7809,14 @@
}
},
"gatsby-source-filesystem": {
"version": "2.0.20",
"resolved": "https://registry.npmjs.org/gatsby-source-filesystem/-/gatsby-source-filesystem-2.0.20.tgz",
"integrity": "sha512-nS2hBsqKEQIJ5Yd+g9p++FcsfmvbQmZlBUzx04VPBYZBu2LuLA/ZxQkmdiTNnbDQ18KJw0Zu2PnmUerPnEMqyg==",
"version": "2.0.24",
"resolved": "https://registry.npmjs.org/gatsby-source-filesystem/-/gatsby-source-filesystem-2.0.24.tgz",
"integrity": "sha512-KzyHzuXni9hOiZFDgeoH5ABJZqb59fSJNGr2C4U6B1AlGXFMucFK45Fh3V8axtpi833bIbCb9rGmK+tvL4Qb1w==",
"requires": {
"@babel/runtime": "^7.0.0",
"better-queue": "^3.8.7",
"bluebird": "^3.5.0",
"chokidar": "^1.7.0",
"chokidar": "^2.1.2",
"file-type": "^10.2.0",
"fs-extra": "^5.0.0",
"got": "^7.1.0",
@ -8343,83 +7830,6 @@
"xstate": "^3.1.0"
},
"dependencies": {
"anymatch": {
"version": "1.3.2",
"resolved": "https://registry.npmjs.org/anymatch/-/anymatch-1.3.2.tgz",
"integrity": "sha512-0XNayC8lTHQ2OI8aljNCN3sSx6hsr/1+rlcDAotXJR7C1oZZHCNsfpbKwMjRA3Uqb5tF1Rae2oloTr4xpq+WjA==",
"requires": {
"micromatch": "^2.1.5",
"normalize-path": "^2.0.0"
}
},
"arr-diff": {
"version": "2.0.0",
"resolved": "https://registry.npmjs.org/arr-diff/-/arr-diff-2.0.0.tgz",
"integrity": "sha1-jzuCf5Vai9ZpaX5KQlasPOrjVs8=",
"requires": {
"arr-flatten": "^1.0.1"
}
},
"array-unique": {
"version": "0.2.1",
"resolved": "https://registry.npmjs.org/array-unique/-/array-unique-0.2.1.tgz",
"integrity": "sha1-odl8yvy8JiXMcPrc6zalDFiwGlM="
},
"braces": {
"version": "1.8.5",
"resolved": "https://registry.npmjs.org/braces/-/braces-1.8.5.tgz",
"integrity": "sha1-uneWLhLf+WnWt2cR6RS3N4V79qc=",
"requires": {
"expand-range": "^1.8.1",
"preserve": "^0.2.0",
"repeat-element": "^1.1.2"
}
},
"chokidar": {
"version": "1.7.0",
"resolved": "https://registry.npmjs.org/chokidar/-/chokidar-1.7.0.tgz",
"integrity": "sha1-eY5ol3gVHIB2tLNg5e3SjNortGg=",
"requires": {
"anymatch": "^1.3.0",
"async-each": "^1.0.0",
"fsevents": "^1.0.0",
"glob-parent": "^2.0.0",
"inherits": "^2.0.1",
"is-binary-path": "^1.0.0",
"is-glob": "^2.0.0",
"path-is-absolute": "^1.0.0",
"readdirp": "^2.0.0"
}
},
"expand-brackets": {
"version": "0.1.5",
"resolved": "https://registry.npmjs.org/expand-brackets/-/expand-brackets-0.1.5.tgz",
"integrity": "sha1-3wcoTjQqgHzXM6xa9yQR5YHRF3s=",
"requires": {
"is-posix-bracket": "^0.1.0"
}
},
"extglob": {
"version": "0.3.2",
"resolved": "https://registry.npmjs.org/extglob/-/extglob-0.3.2.tgz",
"integrity": "sha1-Lhj/PS9JqydlzskCPwEdqo2DSaE=",
"requires": {
"is-extglob": "^1.0.0"
}
},
"file-type": {
"version": "10.7.1",
"resolved": "https://registry.npmjs.org/file-type/-/file-type-10.7.1.tgz",
"integrity": "sha512-kUc4EE9q3MH6kx70KumPOvXLZLEJZzY9phEVg/bKWyGZ+OA9KoKZzFR4HS0yDmNv31sJkdf4hbTERIfplF9OxQ=="
},
"glob-parent": {
"version": "2.0.0",
"resolved": "https://registry.npmjs.org/glob-parent/-/glob-parent-2.0.0.tgz",
"integrity": "sha1-gTg9ctsFT8zPUzbaqQLxgvbtuyg=",
"requires": {
"is-glob": "^2.0.0"
}
},
"got": {
"version": "7.1.0",
"resolved": "https://registry.npmjs.org/got/-/got-7.1.0.tgz",
@ -8441,47 +7851,6 @@
"url-to-options": "^1.0.1"
}
},
"is-extglob": {
"version": "1.0.0",
"resolved": "https://registry.npmjs.org/is-extglob/-/is-extglob-1.0.0.tgz",
"integrity": "sha1-rEaBd8SUNAWgkvyPKXYMb/xiBsA="
},
"is-glob": {
"version": "2.0.1",
"resolved": "https://registry.npmjs.org/is-glob/-/is-glob-2.0.1.tgz",
"integrity": "sha1-0Jb5JqPe1WAPP9/ZEZjLCIjC2GM=",
"requires": {
"is-extglob": "^1.0.0"
}
},
"kind-of": {
"version": "3.2.2",
"resolved": "https://registry.npmjs.org/kind-of/-/kind-of-3.2.2.tgz",
"integrity": "sha1-MeohpzS6ubuw8yRm2JOupR5KPGQ=",
"requires": {
"is-buffer": "^1.1.5"
}
},
"micromatch": {
"version": "2.3.11",
"resolved": "https://registry.npmjs.org/micromatch/-/micromatch-2.3.11.tgz",
"integrity": "sha1-hmd8l9FyCzY0MdBNDRUpO9OMFWU=",
"requires": {
"arr-diff": "^2.0.0",
"array-unique": "^0.2.1",
"braces": "^1.8.2",
"expand-brackets": "^0.1.4",
"extglob": "^0.3.1",
"filename-regex": "^2.0.0",
"is-extglob": "^1.0.0",
"is-glob": "^2.0.1",
"kind-of": "^3.0.2",
"normalize-path": "^2.0.1",
"object.omit": "^2.0.0",
"parse-glob": "^3.0.4",
"regex-cache": "^0.4.2"
}
},
"pify": {
"version": "4.0.1",
"resolved": "https://registry.npmjs.org/pify/-/pify-4.0.1.tgz",
@ -8493,12 +7862,12 @@
"integrity": "sha1-4mDHj2Fhzdmw5WzD4Khd4Xx6V74="
},
"read-chunk": {
"version": "3.0.0",
"resolved": "https://registry.npmjs.org/read-chunk/-/read-chunk-3.0.0.tgz",
"integrity": "sha512-8lBUVPjj9TC5bKLBacB+rpexM03+LWiYbv6ma3BeWmUYXGxqA1WNNgIZHq/iIsCrbFMzPhFbkOqdsyOFRnuoXg==",
"version": "3.1.0",
"resolved": "https://registry.npmjs.org/read-chunk/-/read-chunk-3.1.0.tgz",
"integrity": "sha512-ZdiZJXXoZYE08SzZvTipHhI+ZW0FpzxmFtLI3vIeMuRN9ySbIZ+SZawKogqJ7dxW9fJ/W73BNtxu4Zu/bZp+Ng==",
"requires": {
"pify": "^4.0.0",
"with-open-file": "^0.1.3"
"pify": "^4.0.1",
"with-open-file": "^0.1.5"
}
}
}
@ -8742,38 +8111,6 @@
"path-is-absolute": "^1.0.0"
}
},
"glob-base": {
"version": "0.3.0",
"resolved": "https://registry.npmjs.org/glob-base/-/glob-base-0.3.0.tgz",
"integrity": "sha1-27Fk9iIbHAscz4Kuoyi0l98Oo8Q=",
"requires": {
"glob-parent": "^2.0.0",
"is-glob": "^2.0.0"
},
"dependencies": {
"glob-parent": {
"version": "2.0.0",
"resolved": "https://registry.npmjs.org/glob-parent/-/glob-parent-2.0.0.tgz",
"integrity": "sha1-gTg9ctsFT8zPUzbaqQLxgvbtuyg=",
"requires": {
"is-glob": "^2.0.0"
}
},
"is-extglob": {
"version": "1.0.0",
"resolved": "https://registry.npmjs.org/is-extglob/-/is-extglob-1.0.0.tgz",
"integrity": "sha1-rEaBd8SUNAWgkvyPKXYMb/xiBsA="
},
"is-glob": {
"version": "2.0.1",
"resolved": "https://registry.npmjs.org/is-glob/-/is-glob-2.0.1.tgz",
"integrity": "sha1-0Jb5JqPe1WAPP9/ZEZjLCIjC2GM=",
"requires": {
"is-extglob": "^1.0.0"
}
}
}
},
"glob-parent": {
"version": "3.1.0",
"resolved": "https://registry.npmjs.org/glob-parent/-/glob-parent-3.1.0.tgz",
@ -10110,19 +9447,6 @@
"resolved": "https://registry.npmjs.org/is-directory/-/is-directory-0.3.1.tgz",
"integrity": "sha1-YTObbyR1/Hcv2cnYP1yFddwVSuE="
},
"is-dotfile": {
"version": "1.0.3",
"resolved": "https://registry.npmjs.org/is-dotfile/-/is-dotfile-1.0.3.tgz",
"integrity": "sha1-pqLzL/0t+wT1yiXs0Pa4PPeYoeE="
},
"is-equal-shallow": {
"version": "0.1.3",
"resolved": "https://registry.npmjs.org/is-equal-shallow/-/is-equal-shallow-0.1.3.tgz",
"integrity": "sha1-IjgJj8Ih3gvPpdnqxMRdY4qhxTQ=",
"requires": {
"is-primitive": "^2.0.0"
}
},
"is-extendable": {
"version": "0.1.1",
"resolved": "https://registry.npmjs.org/is-extendable/-/is-extendable-0.1.1.tgz",
@ -10263,16 +9587,6 @@
"resolved": "https://registry.npmjs.org/is-png/-/is-png-1.1.0.tgz",
"integrity": "sha1-1XSxK/J1wDUEVVcLDltXqwYgd84="
},
"is-posix-bracket": {
"version": "0.1.1",
"resolved": "https://registry.npmjs.org/is-posix-bracket/-/is-posix-bracket-0.1.1.tgz",
"integrity": "sha1-MzTceXdDaOkvAW5vvAqI9c1ua8Q="
},
"is-primitive": {
"version": "2.0.0",
"resolved": "https://registry.npmjs.org/is-primitive/-/is-primitive-2.0.0.tgz",
"integrity": "sha1-IHurkWOEmcB7Kt8kCkGochADRXU="
},
"is-promise": {
"version": "2.1.0",
"resolved": "https://registry.npmjs.org/is-promise/-/is-promise-2.1.0.tgz",
@ -11162,11 +10476,6 @@
"resolved": "https://registry.npmjs.org/marked/-/marked-0.4.0.tgz",
"integrity": "sha512-tMsdNBgOsrUophCAFQl0XPe6Zqk/uy9gnue+jIIKhykO51hxyu6uNx7zBPy0+y/WKYVZZMspV9YeXLNdKk+iYw=="
},
"math-random": {
"version": "1.0.1",
"resolved": "https://registry.npmjs.org/math-random/-/math-random-1.0.1.tgz",
"integrity": "sha1-izqsWIuKZuSXXjzepn97sylgH6w="
},
"md-attr-parser": {
"version": "1.2.1",
"resolved": "https://registry.npmjs.org/md-attr-parser/-/md-attr-parser-1.2.1.tgz",
@ -12230,15 +11539,6 @@
"es-abstract": "^1.5.1"
}
},
"object.omit": {
"version": "2.0.1",
"resolved": "https://registry.npmjs.org/object.omit/-/object.omit-2.0.1.tgz",
"integrity": "sha1-Gpx0SCnznbuFjHbKNXmuKlTr0fo=",
"requires": {
"for-own": "^0.1.4",
"is-extendable": "^0.1.1"
}
},
"object.pick": {
"version": "1.3.0",
"resolved": "https://registry.npmjs.org/object.pick/-/object.pick-1.3.0.tgz",
@ -12579,32 +11879,6 @@
"path-root": "^0.1.1"
}
},
"parse-glob": {
"version": "3.0.4",
"resolved": "https://registry.npmjs.org/parse-glob/-/parse-glob-3.0.4.tgz",
"integrity": "sha1-ssN2z7EfNVE7rdFz7wu246OIORw=",
"requires": {
"glob-base": "^0.3.0",
"is-dotfile": "^1.0.0",
"is-extglob": "^1.0.0",
"is-glob": "^2.0.0"
},
"dependencies": {
"is-extglob": {
"version": "1.0.0",
"resolved": "https://registry.npmjs.org/is-extglob/-/is-extglob-1.0.0.tgz",
"integrity": "sha1-rEaBd8SUNAWgkvyPKXYMb/xiBsA="
},
"is-glob": {
"version": "2.0.1",
"resolved": "https://registry.npmjs.org/is-glob/-/is-glob-2.0.1.tgz",
"integrity": "sha1-0Jb5JqPe1WAPP9/ZEZjLCIjC2GM=",
"requires": {
"is-extglob": "^1.0.0"
}
}
}
},
"parse-headers": {
"version": "2.0.1",
"resolved": "https://registry.npmjs.org/parse-headers/-/parse-headers-2.0.1.tgz",
@ -14769,11 +14043,6 @@
"resolved": "https://registry.npmjs.org/prepend-http/-/prepend-http-1.0.4.tgz",
"integrity": "sha1-1PRWKwzjaW5BrFLQ4ALlemNdxtw="
},
"preserve": {
"version": "0.2.0",
"resolved": "https://registry.npmjs.org/preserve/-/preserve-0.2.0.tgz",
"integrity": "sha1-gV7R9uvGWSb4ZbMQwHE7yzMVzks="
},
"prettier": {
"version": "1.16.4",
"resolved": "https://registry.npmjs.org/prettier/-/prettier-1.16.4.tgz",
@ -14982,23 +14251,6 @@
"resolved": "http://registry.npmjs.org/ramda/-/ramda-0.21.0.tgz",
"integrity": "sha1-oAGr7bP/YQd9T/HVd9RN536NCjU="
},
"randomatic": {
"version": "3.1.1",
"resolved": "https://registry.npmjs.org/randomatic/-/randomatic-3.1.1.tgz",
"integrity": "sha512-TuDE5KxZ0J461RVjrJZCJc+J+zCkTb1MbH9AQUq68sMhOMcy9jLcb3BrZKgp9q9Ncltdg4QVqWrH02W2EFFVYw==",
"requires": {
"is-number": "^4.0.0",
"kind-of": "^6.0.0",
"math-random": "^1.0.1"
},
"dependencies": {
"is-number": {
"version": "4.0.0",
"resolved": "https://registry.npmjs.org/is-number/-/is-number-4.0.0.tgz",
"integrity": "sha512-rSklcAIlf1OmFdyAqbnWTLVelsQ58uvZ66S/ZyawjWqIviTWCjg2PzVGw8WUA+nNuPTqb4wgA+NszrJ+08LlgQ=="
}
}
},
"randombytes": {
"version": "2.1.0",
"resolved": "https://registry.npmjs.org/randombytes/-/randombytes-2.1.0.tgz",
@ -15458,14 +14710,6 @@
"private": "^0.1.6"
}
},
"regex-cache": {
"version": "0.4.4",
"resolved": "https://registry.npmjs.org/regex-cache/-/regex-cache-0.4.4.tgz",
"integrity": "sha512-nVIZwtCjkC9YgvWkpM55B5rBhBYRZhAaJbgcFYXXsHnbZ9UZI9nnVWYZpBlCqv9ho2eZryPnWrZGsOdPwVWXWQ==",
"requires": {
"is-equal-shallow": "^0.1.3"
}
},
"regex-not": {
"version": "1.0.2",
"resolved": "https://registry.npmjs.org/regex-not/-/regex-not-1.0.2.tgz",
@ -17710,9 +16954,9 @@
},
"dependencies": {
"ajv": {
"version": "6.9.2",
"resolved": "https://registry.npmjs.org/ajv/-/ajv-6.9.2.tgz",
"integrity": "sha512-4UFy0/LgDo7Oa/+wOAlj44tp9K78u38E5/359eSrqEp1Z5PdVfimCcs7SluXMP755RUQu6d2b4AvF0R1C9RZjg==",
"version": "6.10.0",
"resolved": "https://registry.npmjs.org/ajv/-/ajv-6.10.0.tgz",
"integrity": "sha512-nffhOpkymDECQyR0mnsUtoCE8RlX38G0rYP+wgLWFyZuUyuuojSSvi/+euOiQBIn63whYwYVIIH1TvE3tu4OEg==",
"requires": {
"fast-deep-equal": "^2.0.1",
"fast-json-stable-stringify": "^2.0.0",
@ -17721,26 +16965,26 @@
}
},
"ansi-regex": {
"version": "4.0.0",
"resolved": "https://registry.npmjs.org/ansi-regex/-/ansi-regex-4.0.0.tgz",
"integrity": "sha512-iB5Dda8t/UqpPI/IjsejXu5jOGDrzn41wJyljwPH65VCIbk6+1BzFIMJGFwTNrYXT1CrD+B4l19U7awiQ8rk7w=="
"version": "4.1.0",
"resolved": "https://registry.npmjs.org/ansi-regex/-/ansi-regex-4.1.0.tgz",
"integrity": "sha512-1apePfXM1UOSqw0o9IiFAovVz9M5S1Dg+4TrDwfMewQ6p/rmMueb7tWZjQ1rx4Loy1ArBggoqGpfqqdI4rondg=="
},
"string-width": {
"version": "3.0.0",
"resolved": "https://registry.npmjs.org/string-width/-/string-width-3.0.0.tgz",
"integrity": "sha512-rr8CUxBbvOZDUvc5lNIJ+OC1nPVpz+Siw9VBtUjB9b6jZehZLFt0JMCZzShFHIsI8cbhm0EsNIfWJMFV3cu3Ew==",
"version": "3.1.0",
"resolved": "https://registry.npmjs.org/string-width/-/string-width-3.1.0.tgz",
"integrity": "sha512-vafcv6KjVZKSgz06oM/H6GDBrAtz8vdhQakGjFIvNrHA6y3HCF1CInLy+QLq8dTJPQ1b+KDUqDFctkdRW44e1w==",
"requires": {
"emoji-regex": "^7.0.1",
"is-fullwidth-code-point": "^2.0.0",
"strip-ansi": "^5.0.0"
"strip-ansi": "^5.1.0"
}
},
"strip-ansi": {
"version": "5.0.0",
"resolved": "https://registry.npmjs.org/strip-ansi/-/strip-ansi-5.0.0.tgz",
"integrity": "sha512-Uu7gQyZI7J7gn5qLn1Np3G9vcYGTVqB+lFTytnDJv83dd8T22aGH451P3jueT2/QemInJDfxHB5Tde5OzgG1Ow==",
"version": "5.1.0",
"resolved": "https://registry.npmjs.org/strip-ansi/-/strip-ansi-5.1.0.tgz",
"integrity": "sha512-TjxrkPONqO2Z8QDCpeE2j6n0M6EwxzyDgzEeGp+FbdvaJAt//ClYi6W5my+3ROlC/hZX2KACUwDfK49Ka5eDvg==",
"requires": {
"ansi-regex": "^4.0.0"
"ansi-regex": "^4.1.0"
}
}
}

View File

@ -12,7 +12,6 @@
"@mdx-js/tag": "^0.17.5",
"@phosphor/widgets": "^1.6.0",
"@rehooks/online-status": "^1.0.0",
"@sindresorhus/slugify": "^0.8.0",
"@svgr/webpack": "^4.1.0",
"autoprefixer": "^9.4.7",
"classnames": "^2.2.6",
@ -35,7 +34,7 @@
"gatsby-remark-prismjs": "^3.2.4",
"gatsby-remark-smartypants": "^2.0.8",
"gatsby-remark-unwrap-images": "^1.0.1",
"gatsby-source-filesystem": "^2.0.20",
"gatsby-source-filesystem": "^2.0.24",
"gatsby-transformer-remark": "^2.2.5",
"gatsby-transformer-sharp": "^2.1.13",
"html-to-react": "^1.3.4",
@ -62,7 +61,8 @@
"md-attr-parser": "^1.2.1",
"prettier": "^1.16.4",
"raw-loader": "^1.0.0",
"unist-util-visit": "^1.4.0"
"unist-util-visit": "^1.4.0",
"@sindresorhus/slugify": "^0.8.0"
},
"repository": {
"type": "git",

View File

@ -1,33 +1,38 @@
import React, { useState } from 'react'
import React, { useState, useEffect } from 'react'
import PropTypes from 'prop-types'
import classNames from 'classnames'
import slugify from '@sindresorhus/slugify'
import Link from './link'
import classes from '../styles/accordion.module.sass'
const Accordion = ({ title, id, expanded, children }) => {
const anchorId = id ? id : slugify(title)
const [isExpanded, setIsExpanded] = useState(expanded)
const [isExpanded, setIsExpanded] = useState(true)
const contentClassNames = classNames(classes.content, {
[classes.hidden]: !isExpanded,
})
const iconClassNames = classNames({
[classes.hidden]: isExpanded,
})
// Make sure accordion is expanded if JS is disabled
useEffect(() => setIsExpanded(expanded), [])
return (
<section id={anchorId}>
<section className="accordion" id={id}>
<div className={classes.root}>
<h3>
<h4>
<button
className={classes.button}
aria-expanded={String(isExpanded)}
onClick={() => setIsExpanded(!isExpanded)}
>
<span>
{title}
{isExpanded && (
<Link to={`#${anchorId}`} className={classes.anchor} hidden>
<span className="heading-text">{title}</span>
{isExpanded && !!id && (
<Link
to={`#${id}`}
className={classes.anchor}
hidden
onClick={event => event.stopPropagation()}
>
&para;
</Link>
)}
@ -44,7 +49,7 @@ const Accordion = ({ title, id, expanded, children }) => {
<rect height={2} width={8} x={1} y={4} />
</svg>
</button>
</h3>
</h4>
<div className={contentClassNames}>{children}</div>
</div>
</section>

View File

@ -33,10 +33,11 @@ const GitHubCode = ({ url, lang, errorMsg, className }) => {
})
.catch(err => {
setCode(errorMsg)
console.error(err)
})
setInitialized(true)
}
}, [])
}, [initialized, rawUrl, errorMsg])
const highlighted = lang === 'none' || !code ? code : highlightCode(lang, code)

View File

@ -5,13 +5,13 @@ import classNames from 'classnames'
import Icon from './icon'
import classes from '../styles/infobox.module.sass'
const Infobox = ({ title, variant, className, children }) => {
const Infobox = ({ title, id, variant, className, children }) => {
const infoboxClassNames = classNames(classes.root, className, {
[classes.warning]: variant === 'warning',
[classes.danger]: variant === 'danger',
})
return (
<aside className={infoboxClassNames}>
<aside className={infoboxClassNames} id={id}>
{title && (
<h4 className={classes.title}>
{variant !== 'default' && (
@ -31,6 +31,7 @@ Infobox.defaultProps = {
Infobox.propTypes = {
title: PropTypes.string,
id: PropTypes.string,
variant: PropTypes.oneOf(['default', 'warning', 'danger']),
className: PropTypes.string,
children: PropTypes.node.isRequired,

View File

@ -232,6 +232,7 @@ Juniper.defaultProps = {
theme: 'default',
isolateCells: true,
useBinder: true,
storageKey: 'juniper',
useStorage: true,
storageExpire: 60,
debug: false,

View File

@ -34,22 +34,19 @@ const Progress = () => {
setOffset(getOffset())
}
useEffect(
() => {
if (!initialized && progressRef.current) {
handleResize()
setInitialized(true)
}
window.addEventListener('scroll', handleScroll)
window.addEventListener('resize', handleResize)
useEffect(() => {
if (!initialized && progressRef.current) {
handleResize()
setInitialized(true)
}
window.addEventListener('scroll', handleScroll)
window.addEventListener('resize', handleResize)
return () => {
window.removeEventListener('scroll', handleScroll)
window.removeEventListener('resize', handleResize)
}
},
[progressRef]
)
return () => {
window.removeEventListener('scroll', handleScroll)
window.removeEventListener('resize', handleResize)
}
}, [initialized, progressRef])
const { height, vh } = offset
const total = 100 - ((height - scrollY - vh) / height) * 100

View File

@ -8,6 +8,12 @@ import Icon from './icon'
import { H2 } from './typography'
import classes from '../styles/quickstart.module.sass'
function getNewChecked(optionId, checkedForId, multiple) {
if (!multiple) return [optionId]
if (checkedForId.includes(optionId)) return checkedForId.filter(opt => opt !== optionId)
return [...checkedForId, optionId]
}
const Quickstart = ({ data, title, description, id, children }) => {
const [styles, setStyles] = useState({})
const [checked, setChecked] = useState({})
@ -38,13 +44,13 @@ const Quickstart = ({ data, title, description, id, children }) => {
setStyles(initialStyles)
setInitialized(true)
}
})
}, [data, initialized])
return !data.length ? null : (
<Section id={id}>
<div className={classes.root}>
{title && (
<H2 className={classes.title}>
<H2 className={classes.title} name={id}>
<a href={`#${id}`}>{title}</a>
</H2>
)}
@ -76,13 +82,11 @@ const Quickstart = ({ data, title, description, id, children }) => {
onChange={() => {
const newChecked = {
...checked,
[id]: !multiple
? [option.id]
: checkedForId.includes(option.id)
? checkedForId.filter(
opt => opt !== option.id
)
: [...checkedForId, option.id],
[id]: getNewChecked(
option.id,
checkedForId,
multiple
),
}
setChecked(newChecked)
setStyles({

Some files were not shown because too many files have changed in this diff Show More