Merge branch 'master' into feature/lemmatizer

This commit is contained in:
Ines Montani 2019-03-16 13:44:22 +01:00
commit 278e9d2eb0
115 changed files with 2640 additions and 2222 deletions

106
.github/contributors/Poluglottos.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [X] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Ryan Ford |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | Mar 13 2019 |
| GitHub username | Poluglottos |
| Website (optional) | |

106
.github/contributors/tmetzl.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Tim Metzler |
| Company name (if applicable) | University of Applied Sciences Bonn-Rhein-Sieg |
| Title or role (if applicable) | |
| Date | 03/10/2019 |
| GitHub username | tmetzl |
| Website (optional) | |

View File

@ -12,7 +12,7 @@ currently supports tokenization for **45+ languages**. It features the
and easy **deep learning** integration. It's commercial open-source software, and easy **deep learning** integration. It's commercial open-source software,
released under the MIT license. released under the MIT license.
💫 **Version 2.1 out now!** [Check out the release notes here.](https://github.com/explosion/spaCy/releases) 💫 **Version 2.0 out now!** [Check out the release notes here.](https://github.com/explosion/spaCy/releases)
[![Azure Pipelines](https://img.shields.io/azure-devops/build/explosion-ai/public/8/master.svg?logo=azure-devops&style=flat-square)](https://dev.azure.com/explosion-ai/public/_build?definitionId=8) [![Azure Pipelines](https://img.shields.io/azure-devops/build/explosion-ai/public/8/master.svg?logo=azure-devops&style=flat-square)](https://dev.azure.com/explosion-ai/public/_build?definitionId=8)
[![Travis Build Status](https://img.shields.io/travis/explosion/spaCy/master.svg?style=flat-square&logo=travis)](https://travis-ci.org/explosion/spaCy) [![Travis Build Status](https://img.shields.io/travis/explosion/spaCy/master.svg?style=flat-square&logo=travis)](https://travis-ci.org/explosion/spaCy)
@ -25,19 +25,17 @@ released under the MIT license.
## 📖 Documentation ## 📖 Documentation
| Documentation | | | Documentation | |
| --------------- | -------------------------------------------------------------- | | --------------- | ----------------------------------------------------- |
| [spaCy 101] | New to spaCy? Here's everything you need to know! | | [spaCy 101] | New to spaCy? Here's everything you need to know! |
| [Usage Guides] | How to use spaCy and its features. | | [Usage Guides] | How to use spaCy and its features. |
| [New in v2.1] | New features, backwards incompatibilities and migration guide. | | [API Reference] | The detailed reference for spaCy's API. |
| [API Reference] | The detailed reference for spaCy's API. | | [Models] | Download statistical language models for spaCy. |
| [Models] | Download statistical language models for spaCy. | | [Universe] | Libraries, extensions, demos, books and courses. |
| [Universe] | Libraries, extensions, demos, books and courses. | | [Changelog] | Changes and version history. |
| [Changelog] | Changes and version history. | | [Contribute] | How to contribute to the spaCy project and code base. |
| [Contribute] | How to contribute to the spaCy project and code base. |
[spacy 101]: https://spacy.io/usage/spacy-101 [spacy 101]: https://spacy.io/usage/spacy-101
[new in v2.1]: https://spacy.io/usage/v2-1
[usage guides]: https://spacy.io/usage/ [usage guides]: https://spacy.io/usage/
[api reference]: https://spacy.io/api/ [api reference]: https://spacy.io/api/
[models]: https://spacy.io/models [models]: https://spacy.io/models

View File

@ -7,6 +7,7 @@ git diff-index --quiet HEAD
git checkout $1 git checkout $1
git pull origin $1 git pull origin $1
git push origin $1
version=$(grep "__version__ = " spacy/about.py) version=$(grep "__version__ = " spacy/about.py)
version=${version/__version__ = } version=${version/__version__ = }
@ -15,4 +16,4 @@ version=${version/\'/}
version=${version/\"/} version=${version/\"/}
version=${version/\"/} version=${version/\"/}
git tag "v$version" git tag "v$version"
git push origin --tags git push origin "v$version" --tags

107
bin/train_word_vectors.py Normal file
View File

@ -0,0 +1,107 @@
#!/usr/bin/env python
from __future__ import print_function, unicode_literals, division
import logging
from pathlib import Path
from collections import defaultdict
from gensim.models import Word2Vec
from preshed.counter import PreshCounter
import plac
import spacy
logger = logging.getLogger(__name__)
class Corpus(object):
def __init__(self, directory, min_freq=10):
self.directory = directory
self.counts = PreshCounter()
self.strings = {}
self.min_freq = min_freq
def count_doc(self, doc):
# Get counts for this document
for word in doc:
self.counts.inc(word.orth, 1)
return len(doc)
def __iter__(self):
for text_loc in iter_dir(self.directory):
with text_loc.open("r", encoding="utf-8") as file_:
text = file_.read()
yield text
def iter_dir(loc):
dir_path = Path(loc)
for fn_path in dir_path.iterdir():
if fn_path.is_dir():
for sub_path in fn_path.iterdir():
yield sub_path
else:
yield fn_path
@plac.annotations(
lang=("ISO language code"),
in_dir=("Location of input directory"),
out_loc=("Location of output file"),
n_workers=("Number of workers", "option", "n", int),
size=("Dimension of the word vectors", "option", "d", int),
window=("Context window size", "option", "w", int),
min_count=("Min count", "option", "m", int),
negative=("Number of negative samples", "option", "g", int),
nr_iter=("Number of iterations", "option", "i", int),
)
def main(
lang,
in_dir,
out_loc,
negative=5,
n_workers=4,
window=5,
size=128,
min_count=10,
nr_iter=2,
):
logging.basicConfig(
format="%(asctime)s : %(levelname)s : %(message)s", level=logging.INFO
)
model = Word2Vec(
size=size,
window=window,
min_count=min_count,
workers=n_workers,
sample=1e-5,
negative=negative,
)
nlp = spacy.blank(lang)
corpus = Corpus(in_dir)
total_words = 0
total_sents = 0
for text_no, text_loc in enumerate(iter_dir(corpus.directory)):
with text_loc.open("r", encoding="utf-8") as file_:
text = file_.read()
total_sents += text.count("\n")
doc = nlp(text)
total_words += corpus.count_doc(doc)
logger.info(
"PROGRESS: at batch #%i, processed %i words, keeping %i word types",
text_no,
total_words,
len(corpus.strings),
)
model.corpus_count = total_sents
model.raw_vocab = defaultdict(int)
for orth, freq in corpus.counts:
if freq >= min_count:
model.raw_vocab[nlp.vocab.strings[orth]] = freq
model.scale_vocab()
model.finalize_vocab()
model.iter = nr_iter
model.train(corpus)
model.save(out_loc)
if __name__ == "__main__":
plac.call(main)

View File

@ -49,7 +49,7 @@ class SentimentAnalyser(object):
y = self._model.predict(X) y = self._model.predict(X)
self.set_sentiment(doc, y) self.set_sentiment(doc, y)
def pipe(self, docs, batch_size=1000, n_threads=2): def pipe(self, docs, batch_size=1000):
for minibatch in cytoolz.partition_all(batch_size, docs): for minibatch in cytoolz.partition_all(batch_size, docs):
minibatch = list(minibatch) minibatch = list(minibatch)
sentences = [] sentences = []
@ -176,7 +176,7 @@ def evaluate(model_dir, texts, labels, max_length=100):
correct = 0 correct = 0
i = 0 i = 0
for doc in nlp.pipe(texts, batch_size=1000, n_threads=4): for doc in nlp.pipe(texts, batch_size=1000):
correct += bool(doc.sentiment >= 0.5) == bool(labels[i]) correct += bool(doc.sentiment >= 0.5) == bool(labels[i])
i += 1 i += 1
return float(correct) / i return float(correct) / i

View File

@ -4,7 +4,7 @@ preshed>=2.0.1,<2.1.0
thinc>=7.0.2,<7.1.0 thinc>=7.0.2,<7.1.0
blis>=0.2.2,<0.3.0 blis>=0.2.2,<0.3.0
murmurhash>=0.28.0,<1.1.0 murmurhash>=0.28.0,<1.1.0
wasabi>=0.0.12,<1.1.0 wasabi>=0.1.3,<1.1.0
srsly>=0.0.5,<1.1.0 srsly>=0.0.5,<1.1.0
# Third party dependencies # Third party dependencies
numpy>=1.15.0 numpy>=1.15.0

View File

@ -97,6 +97,7 @@ def with_cpu(ops, model):
"""Wrap a model that should run on CPU, transferring inputs and outputs """Wrap a model that should run on CPU, transferring inputs and outputs
as necessary.""" as necessary."""
model.to_cpu() model.to_cpu()
def with_cpu_forward(inputs, drop=0.): def with_cpu_forward(inputs, drop=0.):
cpu_outputs, backprop = model.begin_update(_to_cpu(inputs), drop=drop) cpu_outputs, backprop = model.begin_update(_to_cpu(inputs), drop=drop)
gpu_outputs = _to_device(ops, cpu_outputs) gpu_outputs = _to_device(ops, cpu_outputs)

View File

@ -4,7 +4,7 @@
# fmt: off # fmt: off
__title__ = "spacy-nightly" __title__ = "spacy-nightly"
__version__ = "2.1.0a10" __version__ = "2.1.0a13"
__summary__ = "Industrial-strength Natural Language Processing (NLP) with Python and Cython" __summary__ = "Industrial-strength Natural Language Processing (NLP) with Python and Cython"
__uri__ = "https://spacy.io" __uri__ = "https://spacy.io"
__author__ = "Explosion AI" __author__ = "Explosion AI"

View File

@ -6,7 +6,7 @@ from pathlib import Path
from wasabi import Printer from wasabi import Printer
import srsly import srsly
from .converters import conllu2json, conllubio2json, iob2json, conll_ner2json from .converters import conllu2json, iob2json, conll_ner2json
from .converters import ner_jsonl2json from .converters import ner_jsonl2json
@ -14,7 +14,7 @@ from .converters import ner_jsonl2json
# entry to this dict with the file extension mapped to the converter function # entry to this dict with the file extension mapped to the converter function
# imported from /converters. # imported from /converters.
CONVERTERS = { CONVERTERS = {
"conllubio": conllubio2json, "conllubio": conllu2json,
"conllu": conllu2json, "conllu": conllu2json,
"conll": conllu2json, "conll": conllu2json,
"ner": conll_ner2json, "ner": conll_ner2json,

View File

@ -1,5 +1,4 @@
from .conllu2json import conllu2json # noqa: F401 from .conllu2json import conllu2json # noqa: F401
from .conllubio2json import conllubio2json # noqa: F401
from .iob2json import iob2json # noqa: F401 from .iob2json import iob2json # noqa: F401
from .conll_ner2json import conll_ner2json # noqa: F401 from .conll_ner2json import conll_ner2json # noqa: F401
from .jsonl2json import ner_jsonl2json # noqa: F401 from .jsonl2json import ner_jsonl2json # noqa: F401

View File

@ -71,6 +71,7 @@ def read_conllx(input_data, use_morphology=False, n=0):
dep = "ROOT" if dep == "root" else dep dep = "ROOT" if dep == "root" else dep
tag = pos if tag == "_" else tag tag = pos if tag == "_" else tag
tag = tag + "__" + morph if use_morphology else tag tag = tag + "__" + morph if use_morphology else tag
iob = iob if iob else "O"
tokens.append((id_, word, tag, head, dep, iob)) tokens.append((id_, word, tag, head, dep, iob))
except: # noqa: E722 except: # noqa: E722
print(line) print(line)

View File

@ -1,85 +0,0 @@
# coding: utf8
from __future__ import unicode_literals
from ...gold import iob_to_biluo
def conllubio2json(input_data, n_sents=10, use_morphology=False, lang=None):
"""
Convert conllu files into JSON format for use with train cli.
use_morphology parameter enables appending morphology to tags, which is
useful for languages such as Spanish, where UD tags are not so rich.
"""
# by @dvsrepo, via #11 explosion/spacy-dev-resources
docs = []
sentences = []
conll_tuples = read_conllx(input_data, use_morphology=use_morphology)
for i, (raw_text, tokens) in enumerate(conll_tuples):
sentence, brackets = tokens[0]
sentences.append(generate_sentence(sentence))
# Real-sized documents could be extracted using the comments on the
# conluu document
if len(sentences) % n_sents == 0:
doc = create_doc(sentences, i)
docs.append(doc)
sentences = []
return docs
def read_conllx(input_data, use_morphology=False, n=0):
i = 0
for sent in input_data.strip().split("\n\n"):
lines = sent.strip().split("\n")
if lines:
while lines[0].startswith("#"):
lines.pop(0)
tokens = []
for line in lines:
parts = line.split("\t")
id_, word, lemma, pos, tag, morph, head, dep, _1, ner = parts
if "-" in id_ or "." in id_:
continue
try:
id_ = int(id_) - 1
head = (int(head) - 1) if head != "0" else id_
dep = "ROOT" if dep == "root" else dep
tag = pos if tag == "_" else tag
tag = tag + "__" + morph if use_morphology else tag
ner = ner if ner else "O"
tokens.append((id_, word, tag, head, dep, ner))
except: # noqa: E722
print(line)
raise
tuples = [list(t) for t in zip(*tokens)]
yield (None, [[tuples, []]])
i += 1
if n >= 1 and i >= n:
break
def generate_sentence(sent):
(id_, word, tag, head, dep, ner) = sent
sentence = {}
tokens = []
ner = iob_to_biluo(ner)
for i, id in enumerate(id_):
token = {}
token["orth"] = word[i]
token["tag"] = tag[i]
token["head"] = head[i] - id
token["dep"] = dep[i]
token["ner"] = ner[i]
tokens.append(token)
sentence["tokens"] = tokens
return sentence
def create_doc(sentences, id):
doc = {}
paragraph = {}
doc["id"] = id
doc["paragraphs"] = []
paragraph["sentences"] = sentences
doc["paragraphs"].append(paragraph)
return doc

View File

@ -41,24 +41,32 @@ def download(model, direct=False, *pip_args):
dl = download_model(dl_tpl.format(m=model_name, v=version), pip_args) dl = download_model(dl_tpl.format(m=model_name, v=version), pip_args)
if dl != 0: # if download subprocess doesn't return 0, exit if dl != 0: # if download subprocess doesn't return 0, exit
sys.exit(dl) sys.exit(dl)
try: msg.good(
# Get package path here because link uses "Download and installation successful",
# pip.get_installed_distributions() to check if model is a "You can now load the model via spacy.load('{}')".format(model_name),
# package, which fails if model was just installed via )
# subprocess # Only create symlink if the model is installed via a shortcut like 'en'.
package_path = get_package_path(model_name) # There's no real advantage over an additional symlink for en_core_web_sm
link(model_name, model, force=True, model_path=package_path) # and if anything, it's more error prone and causes more confusion.
except: # noqa: E722 if model in shortcuts:
# Dirty, but since spacy.download and the auto-linking is try:
# mostly a convenience wrapper, it's best to show a success # Get package path here because link uses
# message and loading instructions, even if linking fails. # pip.get_installed_distributions() to check if model is a
msg.warn( # package, which fails if model was just installed via
"Download successful but linking failed", # subprocess
"Creating a shortcut link for 'en' didn't work (maybe you " package_path = get_package_path(model_name)
"don't have admin permissions?), but you can still load the " link(model_name, model, force=True, model_path=package_path)
"model via its full package name: " except: # noqa: E722
"nlp = spacy.load('{}')".format(model_name), # Dirty, but since spacy.download and the auto-linking is
) # mostly a convenience wrapper, it's best to show a success
# message and loading instructions, even if linking fails.
msg.warn(
"Download successful but linking failed",
"Creating a shortcut link for '{}' didn't work (maybe you "
"don't have admin permissions?), but you can still load "
"the model via its full package name: "
"nlp = spacy.load('{}')".format(model, model_name),
)
def get_json(url, desc): def get_json(url, desc):

View File

@ -161,7 +161,7 @@ def parse_deps(orig_doc, options={}):
"dir": "right", "dir": "right",
} }
) )
return {"words": words, "arcs": arcs} return {"words": words, "arcs": arcs, "settings": get_doc_settings(orig_doc)}
def parse_ents(doc, options={}): def parse_ents(doc, options={}):
@ -177,7 +177,8 @@ def parse_ents(doc, options={}):
if not ents: if not ents:
user_warning(Warnings.W006) user_warning(Warnings.W006)
title = doc.user_data.get("title", None) if hasattr(doc, "user_data") else None title = doc.user_data.get("title", None) if hasattr(doc, "user_data") else None
return {"text": doc.text, "ents": ents, "title": title} settings = get_doc_settings(doc)
return {"text": doc.text, "ents": ents, "title": title, "settings": settings}
def set_render_wrapper(func): def set_render_wrapper(func):
@ -195,3 +196,10 @@ def set_render_wrapper(func):
if not hasattr(func, "__call__"): if not hasattr(func, "__call__"):
raise ValueError(Errors.E110.format(obj=type(func))) raise ValueError(Errors.E110.format(obj=type(func)))
RENDER_WRAPPER = func RENDER_WRAPPER = func
def get_doc_settings(doc):
return {
"lang": doc.lang_,
"direction": doc.vocab.writing_system.get("direction", "ltr"),
}

View File

@ -3,10 +3,13 @@ from __future__ import unicode_literals
import uuid import uuid
from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_ARCS from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_ARCS, TPL_ENTS
from .templates import TPL_ENT, TPL_ENTS, TPL_FIGURE, TPL_TITLE, TPL_PAGE from .templates import TPL_ENT, TPL_ENT_RTL, TPL_FIGURE, TPL_TITLE, TPL_PAGE
from ..util import minify_html, escape_html from ..util import minify_html, escape_html
DEFAULT_LANG = "en"
DEFAULT_DIR = "ltr"
class DependencyRenderer(object): class DependencyRenderer(object):
"""Render dependency parses as SVGs.""" """Render dependency parses as SVGs."""
@ -30,6 +33,8 @@ class DependencyRenderer(object):
self.color = options.get("color", "#000000") self.color = options.get("color", "#000000")
self.bg = options.get("bg", "#ffffff") self.bg = options.get("bg", "#ffffff")
self.font = options.get("font", "Arial") self.font = options.get("font", "Arial")
self.direction = DEFAULT_DIR
self.lang = DEFAULT_LANG
def render(self, parsed, page=False, minify=False): def render(self, parsed, page=False, minify=False):
"""Render complete markup. """Render complete markup.
@ -42,13 +47,19 @@ class DependencyRenderer(object):
# Create a random ID prefix to make sure parses don't receive the # Create a random ID prefix to make sure parses don't receive the
# same ID, even if they're identical # same ID, even if they're identical
id_prefix = uuid.uuid4().hex id_prefix = uuid.uuid4().hex
rendered = [ rendered = []
self.render_svg("{}-{}".format(id_prefix, i), p["words"], p["arcs"]) for i, p in enumerate(parsed):
for i, p in enumerate(parsed) if i == 0:
] self.direction = p["settings"].get("direction", DEFAULT_DIR)
self.lang = p["settings"].get("lang", DEFAULT_LANG)
render_id = "{}-{}".format(id_prefix, i)
svg = self.render_svg(render_id, p["words"], p["arcs"])
rendered.append(svg)
if page: if page:
content = "".join([TPL_FIGURE.format(content=svg) for svg in rendered]) content = "".join([TPL_FIGURE.format(content=svg) for svg in rendered])
markup = TPL_PAGE.format(content=content) markup = TPL_PAGE.format(
content=content, lang=self.lang, dir=self.direction
)
else: else:
markup = "".join(rendered) markup = "".join(rendered)
if minify: if minify:
@ -83,6 +94,8 @@ class DependencyRenderer(object):
bg=self.bg, bg=self.bg,
font=self.font, font=self.font,
content=content, content=content,
dir=self.direction,
lang=self.lang,
) )
def render_word(self, text, tag, i): def render_word(self, text, tag, i):
@ -95,11 +108,13 @@ class DependencyRenderer(object):
""" """
y = self.offset_y + self.word_spacing y = self.offset_y + self.word_spacing
x = self.offset_x + i * self.distance x = self.offset_x + i * self.distance
if self.direction == "rtl":
x = self.width - x
html_text = escape_html(text) html_text = escape_html(text)
return TPL_DEP_WORDS.format(text=html_text, tag=tag, x=x, y=y) return TPL_DEP_WORDS.format(text=html_text, tag=tag, x=x, y=y)
def render_arrow(self, label, start, end, direction, i): def render_arrow(self, label, start, end, direction, i):
"""Render indivicual arrow. """Render individual arrow.
label (unicode): Dependency label. label (unicode): Dependency label.
start (int): Index of start word. start (int): Index of start word.
@ -110,6 +125,8 @@ class DependencyRenderer(object):
""" """
level = self.levels.index(end - start) + 1 level = self.levels.index(end - start) + 1
x_start = self.offset_x + start * self.distance + self.arrow_spacing x_start = self.offset_x + start * self.distance + self.arrow_spacing
if self.direction == "rtl":
x_start = self.width - x_start
y = self.offset_y y = self.offset_y
x_end = ( x_end = (
self.offset_x self.offset_x
@ -117,6 +134,8 @@ class DependencyRenderer(object):
+ start * self.distance + start * self.distance
- self.arrow_spacing * (self.highest_level - level) / 4 - self.arrow_spacing * (self.highest_level - level) / 4
) )
if self.direction == "rtl":
x_end = self.width - x_end
y_curve = self.offset_y - level * self.distance / 2 y_curve = self.offset_y - level * self.distance / 2
if self.compact: if self.compact:
y_curve = self.offset_y - level * self.distance / 6 y_curve = self.offset_y - level * self.distance / 6
@ -124,12 +143,14 @@ class DependencyRenderer(object):
y_curve = -self.distance y_curve = -self.distance
arrowhead = self.get_arrowhead(direction, x_start, y, x_end) arrowhead = self.get_arrowhead(direction, x_start, y, x_end)
arc = self.get_arc(x_start, y, y_curve, x_end) arc = self.get_arc(x_start, y, y_curve, x_end)
label_side = "right" if self.direction == "rtl" else "left"
return TPL_DEP_ARCS.format( return TPL_DEP_ARCS.format(
id=self.id, id=self.id,
i=i, i=i,
stroke=self.arrow_stroke, stroke=self.arrow_stroke,
head=arrowhead, head=arrowhead,
label=label, label=label,
label_side=label_side,
arc=arc, arc=arc,
) )
@ -219,6 +240,8 @@ class EntityRenderer(object):
self.default_color = "#ddd" self.default_color = "#ddd"
self.colors = colors self.colors = colors
self.ents = options.get("ents", None) self.ents = options.get("ents", None)
self.direction = DEFAULT_DIR
self.lang = DEFAULT_LANG
def render(self, parsed, page=False, minify=False): def render(self, parsed, page=False, minify=False):
"""Render complete markup. """Render complete markup.
@ -228,12 +251,15 @@ class EntityRenderer(object):
minify (bool): Minify HTML markup. minify (bool): Minify HTML markup.
RETURNS (unicode): Rendered HTML markup. RETURNS (unicode): Rendered HTML markup.
""" """
rendered = [ rendered = []
self.render_ents(p["text"], p["ents"], p.get("title", None)) for p in parsed for i, p in enumerate(parsed):
] if i == 0:
self.direction = p["settings"].get("direction", DEFAULT_DIR)
self.lang = p["settings"].get("lang", DEFAULT_LANG)
rendered.append(self.render_ents(p["text"], p["ents"], p["title"]))
if page: if page:
docs = "".join([TPL_FIGURE.format(content=doc) for doc in rendered]) docs = "".join([TPL_FIGURE.format(content=doc) for doc in rendered])
markup = TPL_PAGE.format(content=docs) markup = TPL_PAGE.format(content=docs, lang=self.lang, dir=self.direction)
else: else:
markup = "".join(rendered) markup = "".join(rendered)
if minify: if minify:
@ -261,12 +287,16 @@ class EntityRenderer(object):
markup += "</br>" markup += "</br>"
if self.ents is None or label.upper() in self.ents: if self.ents is None or label.upper() in self.ents:
color = self.colors.get(label.upper(), self.default_color) color = self.colors.get(label.upper(), self.default_color)
markup += TPL_ENT.format(label=label, text=entity, bg=color) ent_settings = {"label": label, "text": entity, "bg": color}
if self.direction == "rtl":
markup += TPL_ENT_RTL.format(**ent_settings)
else:
markup += TPL_ENT.format(**ent_settings)
else: else:
markup += entity markup += entity
offset = end offset = end
markup += escape_html(text[offset:]) markup += escape_html(text[offset:])
markup = TPL_ENTS.format(content=markup, colors=self.colors) markup = TPL_ENTS.format(content=markup, dir=self.direction)
if title: if title:
markup = TPL_TITLE.format(title=title) + markup markup = TPL_TITLE.format(title=title) + markup
return markup return markup

View File

@ -6,7 +6,7 @@ from __future__ import unicode_literals
# Jupyter to render it properly in a cell # Jupyter to render it properly in a cell
TPL_DEP_SVG = """ TPL_DEP_SVG = """
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" id="{id}" class="displacy" width="{width}" height="{height}" style="max-width: none; height: {height}px; color: {color}; background: {bg}; font-family: {font}">{content}</svg> <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xml:lang="{lang}" id="{id}" class="displacy" width="{width}" height="{height}" direction="{dir}" style="max-width: none; height: {height}px; color: {color}; background: {bg}; font-family: {font}; direction: {dir}">{content}</svg>
""" """
@ -22,7 +22,7 @@ TPL_DEP_ARCS = """
<g class="displacy-arrow"> <g class="displacy-arrow">
<path class="displacy-arc" id="arrow-{id}-{i}" stroke-width="{stroke}px" d="{arc}" fill="none" stroke="currentColor"/> <path class="displacy-arc" id="arrow-{id}-{i}" stroke-width="{stroke}px" d="{arc}" fill="none" stroke="currentColor"/>
<text dy="1.25em" style="font-size: 0.8em; letter-spacing: 1px"> <text dy="1.25em" style="font-size: 0.8em; letter-spacing: 1px">
<textPath xlink:href="#arrow-{id}-{i}" class="displacy-label" startOffset="50%" fill="currentColor" text-anchor="middle">{label}</textPath> <textPath xlink:href="#arrow-{id}-{i}" class="displacy-label" startOffset="50%" side="{label_side}" fill="currentColor" text-anchor="middle">{label}</textPath>
</text> </text>
<path class="displacy-arrowhead" d="{head}" fill="currentColor"/> <path class="displacy-arrowhead" d="{head}" fill="currentColor"/>
</g> </g>
@ -39,7 +39,7 @@ TPL_TITLE = """
TPL_ENTS = """ TPL_ENTS = """
<div class="entities" style="line-height: 2.5">{content}</div> <div class="entities" style="line-height: 2.5; direction: {dir}">{content}</div>
""" """
@ -50,14 +50,21 @@ TPL_ENT = """
</mark> </mark>
""" """
TPL_ENT_RTL = """
<mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
{text}
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-right: 0.5rem">{label}</span>
</mark>
"""
TPL_PAGE = """ TPL_PAGE = """
<!DOCTYPE html> <!DOCTYPE html>
<html> <html lang="{lang}">
<head> <head>
<title>displaCy</title> <title>displaCy</title>
</head> </head>
<body style="font-size: 16px; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; padding: 4rem 2rem;">{content}</body> <body style="font-size: 16px; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; padding: 4rem 2rem; direction: {dir}">{content}</body>
</html> </html>
""" """

View File

@ -70,6 +70,16 @@ class Warnings(object):
W013 = ("As of v2.1.0, {obj}.merge is deprecated. Please use the more " W013 = ("As of v2.1.0, {obj}.merge is deprecated. Please use the more "
"efficient and less error-prone Doc.retokenize context manager " "efficient and less error-prone Doc.retokenize context manager "
"instead.") "instead.")
W014 = ("As of v2.1.0, the `disable` keyword argument on the serialization "
"methods is and should be replaced with `exclude`. This makes it "
"consistent with the other objects serializable.")
W015 = ("As of v2.1.0, the use of keyword arguments to exclude fields from "
"being serialized or deserialized is deprecated. Please use the "
"`exclude` argument instead. For example: exclude=['{arg}'].")
W016 = ("The keyword argument `n_threads` on the is now deprecated, as "
"the v2.x models cannot release the global interpreter lock. "
"Future versions may introduce a `n_process` argument for "
"parallel inference via multiprocessing.")
@add_codes @add_codes
@ -348,7 +358,15 @@ class Errors(object):
"This is likely a bug in spaCy, so feel free to open an issue.") "This is likely a bug in spaCy, so feel free to open an issue.")
E127 = ("Cannot create phrase pattern representation for length 0. This " E127 = ("Cannot create phrase pattern representation for length 0. This "
"is likely a bug in spaCy.") "is likely a bug in spaCy.")
E128 = ("Unsupported serialization argument: '{arg}'. The use of keyword "
"arguments to exclude fields from being serialized or deserialized "
"is now deprecated. Please use the `exclude` argument instead. "
"For example: exclude=['{arg}'].")
E129 = ("Cannot write the label of an existing Span object because a Span "
"is a read-only view of the underlying Token objects stored in the Doc. "
"Instead, create a new Span object and specify the `label` keyword argument, "
"for example:\nfrom spacy.tokens import Span\n"
"span = Span(doc, start={start}, end={end}, label='{label}')")
@add_codes @add_codes
class TempErrors(object): class TempErrors(object):

View File

@ -23,6 +23,7 @@ class ArabicDefaults(Language.Defaults):
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
stop_words = STOP_WORDS stop_words = STOP_WORDS
suffixes = TOKENIZER_SUFFIXES suffixes = TOKENIZER_SUFFIXES
writing_system = {"direction": "rtl", "has_case": False, "has_letters": True}
class Arabic(Language): class Arabic(Language):

View File

@ -34,10 +34,10 @@ TAG_MAP = {
"NNP": {POS: PROPN, "NounType": "prop", "Number": "sing"}, "NNP": {POS: PROPN, "NounType": "prop", "Number": "sing"},
"NNPS": {POS: PROPN, "NounType": "prop", "Number": "plur"}, "NNPS": {POS: PROPN, "NounType": "prop", "Number": "plur"},
"NNS": {POS: NOUN, "Number": "plur"}, "NNS": {POS: NOUN, "Number": "plur"},
"PDT": {POS: ADJ, "AdjType": "pdt", "PronType": "prn"}, "PDT": {POS: DET, "AdjType": "pdt", "PronType": "prn"},
"POS": {POS: PART, "Poss": "yes"}, "POS": {POS: PART, "Poss": "yes"},
"PRP": {POS: PRON, "PronType": "prs"}, "PRP": {POS: PRON, "PronType": "prs"},
"PRP$": {POS: ADJ, "PronType": "prs", "Poss": "yes"}, "PRP$": {POS: DET, "PronType": "prs", "Poss": "yes"},
"RB": {POS: ADV, "Degree": "pos"}, "RB": {POS: ADV, "Degree": "pos"},
"RBR": {POS: ADV, "Degree": "comp"}, "RBR": {POS: ADV, "Degree": "comp"},
"RBS": {POS: ADV, "Degree": "sup"}, "RBS": {POS: ADV, "Degree": "sup"},

View File

@ -27,6 +27,7 @@ class PersianDefaults(Language.Defaults):
stop_words = STOP_WORDS stop_words = STOP_WORDS
tag_map = TAG_MAP tag_map = TAG_MAP
suffixes = TOKENIZER_SUFFIXES suffixes = TOKENIZER_SUFFIXES
writing_system = {"direction": "rtl", "has_case": False, "has_letters": True}
class Persian(Language): class Persian(Language):

View File

@ -14,6 +14,7 @@ class HebrewDefaults(Language.Defaults):
lex_attr_getters[LANG] = lambda text: "he" lex_attr_getters[LANG] = lambda text: "he"
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS) tokenizer_exceptions = update_exc(BASE_EXCEPTIONS)
stop_words = STOP_WORDS stop_words = STOP_WORDS
writing_system = {"direction": "rtl", "has_case": False, "has_letters": True}
class Hebrew(Language): class Hebrew(Language):

View File

@ -8,15 +8,13 @@ from .stop_words import STOP_WORDS
from .tag_map import TAG_MAP from .tag_map import TAG_MAP
from ...attrs import LANG from ...attrs import LANG
from ...language import Language from ...language import Language
from ...tokens import Doc, Token from ...tokens import Doc
from ...compat import copy_reg
from ...util import DummyTokenizer from ...util import DummyTokenizer
ShortUnitWord = namedtuple("ShortUnitWord", ["surface", "lemma", "pos"]) ShortUnitWord = namedtuple("ShortUnitWord", ["surface", "lemma", "pos"])
# TODO: Is this the right place for this?
Token.set_extension("mecab_tag", default=None)
def try_mecab_import(): def try_mecab_import():
"""Mecab is required for Japanese support, so check for it. """Mecab is required for Japanese support, so check for it.
@ -81,10 +79,12 @@ class JapaneseTokenizer(DummyTokenizer):
words = [x.surface for x in dtokens] words = [x.surface for x in dtokens]
spaces = [False] * len(words) spaces = [False] * len(words)
doc = Doc(self.vocab, words=words, spaces=spaces) doc = Doc(self.vocab, words=words, spaces=spaces)
mecab_tags = []
for token, dtoken in zip(doc, dtokens): for token, dtoken in zip(doc, dtokens):
token._.mecab_tag = dtoken.pos mecab_tags.append(dtoken.pos)
token.tag_ = resolve_pos(dtoken) token.tag_ = resolve_pos(dtoken)
token.lemma_ = dtoken.lemma token.lemma_ = dtoken.lemma
doc.user_data["mecab_tags"] = mecab_tags
return doc return doc
@ -93,6 +93,7 @@ class JapaneseDefaults(Language.Defaults):
lex_attr_getters[LANG] = lambda _text: "ja" lex_attr_getters[LANG] = lambda _text: "ja"
stop_words = STOP_WORDS stop_words = STOP_WORDS
tag_map = TAG_MAP tag_map = TAG_MAP
writing_system = {"direction": "ltr", "has_case": False, "has_letters": False}
@classmethod @classmethod
def create_tokenizer(cls, nlp=None): def create_tokenizer(cls, nlp=None):
@ -107,4 +108,11 @@ class Japanese(Language):
return self.tokenizer(text) return self.tokenizer(text)
def pickle_japanese(instance):
return Japanese, tuple()
copy_reg.pickle(Japanese, pickle_japanese)
__all__ = ["Japanese"] __all__ = ["Japanese"]

View File

@ -14,6 +14,7 @@ class ChineseDefaults(Language.Defaults):
use_jieba = True use_jieba = True
tokenizer_exceptions = BASE_EXCEPTIONS tokenizer_exceptions = BASE_EXCEPTIONS
stop_words = STOP_WORDS stop_words = STOP_WORDS
writing_system = {"direction": "ltr", "has_case": False, "has_letters": False}
class Chinese(Language): class Chinese(Language):

View File

@ -29,7 +29,7 @@ from .lang.punctuation import TOKENIZER_INFIXES
from .lang.tokenizer_exceptions import TOKEN_MATCH from .lang.tokenizer_exceptions import TOKEN_MATCH
from .lang.tag_map import TAG_MAP from .lang.tag_map import TAG_MAP
from .lang.lex_attrs import LEX_ATTRS, is_stop from .lang.lex_attrs import LEX_ATTRS, is_stop
from .errors import Errors from .errors import Errors, Warnings, deprecation_warning
from . import util from . import util
from . import about from . import about
@ -95,6 +95,7 @@ class BaseDefaults(object):
morph_rules = {} morph_rules = {}
lex_attr_getters = LEX_ATTRS lex_attr_getters = LEX_ATTRS
syntax_iterators = {} syntax_iterators = {}
writing_system = {"direction": "ltr", "has_case": True, "has_letters": True}
class Language(object): class Language(object):
@ -107,6 +108,7 @@ class Language(object):
DOCS: https://spacy.io/api/language DOCS: https://spacy.io/api/language
""" """
Defaults = BaseDefaults Defaults = BaseDefaults
lang = None lang = None
@ -195,6 +197,7 @@ class Language(object):
self._meta = value self._meta = value
# Conveniences to access pipeline components # Conveniences to access pipeline components
# Shouldn't be used anymore!
@property @property
def tensorizer(self): def tensorizer(self):
return self.get_pipe("tensorizer") return self.get_pipe("tensorizer")
@ -228,6 +231,8 @@ class Language(object):
name (unicode): Name of pipeline component to get. name (unicode): Name of pipeline component to get.
RETURNS (callable): The pipeline component. RETURNS (callable): The pipeline component.
DOCS: https://spacy.io/api/language#get_pipe
""" """
for pipe_name, component in self.pipeline: for pipe_name, component in self.pipeline:
if pipe_name == name: if pipe_name == name:
@ -240,6 +245,8 @@ class Language(object):
name (unicode): Factory name to look up in `Language.factories`. name (unicode): Factory name to look up in `Language.factories`.
config (dict): Configuration parameters to initialise component. config (dict): Configuration parameters to initialise component.
RETURNS (callable): Pipeline component. RETURNS (callable): Pipeline component.
DOCS: https://spacy.io/api/language#create_pipe
""" """
if name not in self.factories: if name not in self.factories:
if name == "sbd": if name == "sbd":
@ -266,9 +273,7 @@ class Language(object):
first (bool): Insert component first / not first in the pipeline. first (bool): Insert component first / not first in the pipeline.
last (bool): Insert component last / not last in the pipeline. last (bool): Insert component last / not last in the pipeline.
EXAMPLE: DOCS: https://spacy.io/api/language#add_pipe
>>> nlp.add_pipe(component, before='ner')
>>> nlp.add_pipe(component, name='custom_name', last=True)
""" """
if not hasattr(component, "__call__"): if not hasattr(component, "__call__"):
msg = Errors.E003.format(component=repr(component), name=name) msg = Errors.E003.format(component=repr(component), name=name)
@ -310,6 +315,8 @@ class Language(object):
name (unicode): Name of the component. name (unicode): Name of the component.
RETURNS (bool): Whether a component of the name exists in the pipeline. RETURNS (bool): Whether a component of the name exists in the pipeline.
DOCS: https://spacy.io/api/language#has_pipe
""" """
return name in self.pipe_names return name in self.pipe_names
@ -318,6 +325,8 @@ class Language(object):
name (unicode): Name of the component to replace. name (unicode): Name of the component to replace.
component (callable): Pipeline component. component (callable): Pipeline component.
DOCS: https://spacy.io/api/language#replace_pipe
""" """
if name not in self.pipe_names: if name not in self.pipe_names:
raise ValueError(Errors.E001.format(name=name, opts=self.pipe_names)) raise ValueError(Errors.E001.format(name=name, opts=self.pipe_names))
@ -328,6 +337,8 @@ class Language(object):
old_name (unicode): Name of the component to rename. old_name (unicode): Name of the component to rename.
new_name (unicode): New name of the component. new_name (unicode): New name of the component.
DOCS: https://spacy.io/api/language#rename_pipe
""" """
if old_name not in self.pipe_names: if old_name not in self.pipe_names:
raise ValueError(Errors.E001.format(name=old_name, opts=self.pipe_names)) raise ValueError(Errors.E001.format(name=old_name, opts=self.pipe_names))
@ -341,36 +352,39 @@ class Language(object):
name (unicode): Name of the component to remove. name (unicode): Name of the component to remove.
RETURNS (tuple): A `(name, component)` tuple of the removed component. RETURNS (tuple): A `(name, component)` tuple of the removed component.
DOCS: https://spacy.io/api/language#remove_pipe
""" """
if name not in self.pipe_names: if name not in self.pipe_names:
raise ValueError(Errors.E001.format(name=name, opts=self.pipe_names)) raise ValueError(Errors.E001.format(name=name, opts=self.pipe_names))
return self.pipeline.pop(self.pipe_names.index(name)) return self.pipeline.pop(self.pipe_names.index(name))
def __call__(self, text, disable=[]): def __call__(self, text, disable=[], component_cfg=None):
"""Apply the pipeline to some text. The text can span multiple sentences, """Apply the pipeline to some text. The text can span multiple sentences,
and can contain arbtrary whitespace. Alignment into the original string and can contain arbtrary whitespace. Alignment into the original string
is preserved. is preserved.
text (unicode): The text to be processed. text (unicode): The text to be processed.
disable (list): Names of the pipeline components to disable. disable (list): Names of the pipeline components to disable.
component_cfg (dict): An optional dictionary with extra keyword arguments
for specific components.
RETURNS (Doc): A container for accessing the annotations. RETURNS (Doc): A container for accessing the annotations.
EXAMPLE: DOCS: https://spacy.io/api/language#call
>>> tokens = nlp('An example sentence. Another example sentence.')
>>> tokens[0].text, tokens[0].head.tag_
('An', 'NN')
""" """
if len(text) > self.max_length: if len(text) > self.max_length:
raise ValueError( raise ValueError(
Errors.E088.format(length=len(text), max_length=self.max_length) Errors.E088.format(length=len(text), max_length=self.max_length)
) )
doc = self.make_doc(text) doc = self.make_doc(text)
if component_cfg is None:
component_cfg = {}
for name, proc in self.pipeline: for name, proc in self.pipeline:
if name in disable: if name in disable:
continue continue
if not hasattr(proc, "__call__"): if not hasattr(proc, "__call__"):
raise ValueError(Errors.E003.format(component=type(proc), name=name)) raise ValueError(Errors.E003.format(component=type(proc), name=name))
doc = proc(doc) doc = proc(doc, **component_cfg.get(name, {}))
if doc is None: if doc is None:
raise ValueError(Errors.E005.format(name=name)) raise ValueError(Errors.E005.format(name=name))
return doc return doc
@ -381,24 +395,14 @@ class Language(object):
of the block. Otherwise, a DisabledPipes object is returned, that has of the block. Otherwise, a DisabledPipes object is returned, that has
a `.restore()` method you can use to undo your changes. a `.restore()` method you can use to undo your changes.
EXAMPLE: DOCS: https://spacy.io/api/language#disable_pipes
>>> nlp.add_pipe('parser')
>>> nlp.add_pipe('tagger')
>>> with nlp.disable_pipes('parser', 'tagger'):
>>> assert not nlp.has_pipe('parser')
>>> assert nlp.has_pipe('parser')
>>> disabled = nlp.disable_pipes('parser')
>>> assert len(disabled) == 1
>>> assert not nlp.has_pipe('parser')
>>> disabled.restore()
>>> assert nlp.has_pipe('parser')
""" """
return DisabledPipes(self, *names) return DisabledPipes(self, *names)
def make_doc(self, text): def make_doc(self, text):
return self.tokenizer(text) return self.tokenizer(text)
def update(self, docs, golds, drop=0.0, sgd=None, losses=None): def update(self, docs, golds, drop=0.0, sgd=None, losses=None, component_cfg=None):
"""Update the models in the pipeline. """Update the models in the pipeline.
docs (iterable): A batch of `Doc` objects. docs (iterable): A batch of `Doc` objects.
@ -407,11 +411,7 @@ class Language(object):
sgd (callable): An optimizer. sgd (callable): An optimizer.
RETURNS (dict): Results from the update. RETURNS (dict): Results from the update.
EXAMPLE: DOCS: https://spacy.io/api/language#update
>>> with nlp.begin_training(gold) as (trainer, optimizer):
>>> for epoch in trainer.epochs(gold):
>>> for docs, golds in epoch:
>>> state = nlp.update(docs, golds, sgd=optimizer)
""" """
if len(docs) != len(golds): if len(docs) != len(golds):
raise IndexError(Errors.E009.format(n_docs=len(docs), n_golds=len(golds))) raise IndexError(Errors.E009.format(n_docs=len(docs), n_golds=len(golds)))
@ -421,7 +421,6 @@ class Language(object):
if self._optimizer is None: if self._optimizer is None:
self._optimizer = create_default_optimizer(Model.ops) self._optimizer = create_default_optimizer(Model.ops)
sgd = self._optimizer sgd = self._optimizer
# Allow dict of args to GoldParse, instead of GoldParse objects. # Allow dict of args to GoldParse, instead of GoldParse objects.
gold_objs = [] gold_objs = []
doc_objs = [] doc_objs = []
@ -442,14 +441,17 @@ class Language(object):
get_grads.alpha = sgd.alpha get_grads.alpha = sgd.alpha
get_grads.b1 = sgd.b1 get_grads.b1 = sgd.b1
get_grads.b2 = sgd.b2 get_grads.b2 = sgd.b2
pipes = list(self.pipeline) pipes = list(self.pipeline)
random.shuffle(pipes) random.shuffle(pipes)
if component_cfg is None:
component_cfg = {}
for name, proc in pipes: for name, proc in pipes:
if not hasattr(proc, "update"): if not hasattr(proc, "update"):
continue continue
grads = {} grads = {}
proc.update(docs, golds, drop=drop, sgd=get_grads, losses=losses) kwargs = component_cfg.get(name, {})
kwargs.setdefault("drop", drop)
proc.update(docs, golds, sgd=get_grads, losses=losses, **kwargs)
for key, (W, dW) in grads.items(): for key, (W, dW) in grads.items():
sgd(W, dW, key=key) sgd(W, dW, key=key)
@ -473,6 +475,7 @@ class Language(object):
>>> raw_batch = [nlp.make_doc(text) for text in next(raw_text_batches)] >>> raw_batch = [nlp.make_doc(text) for text in next(raw_text_batches)]
>>> nlp.rehearse(raw_batch) >>> nlp.rehearse(raw_batch)
""" """
# TODO: document
if len(docs) == 0: if len(docs) == 0:
return return
if sgd is None: if sgd is None:
@ -495,7 +498,6 @@ class Language(object):
get_grads.alpha = sgd.alpha get_grads.alpha = sgd.alpha
get_grads.b1 = sgd.b1 get_grads.b1 = sgd.b1
get_grads.b2 = sgd.b2 get_grads.b2 = sgd.b2
for name, proc in pipes: for name, proc in pipes:
if not hasattr(proc, "rehearse"): if not hasattr(proc, "rehearse"):
continue continue
@ -503,7 +505,6 @@ class Language(object):
proc.rehearse(docs, sgd=get_grads, losses=losses, **config.get(name, {})) proc.rehearse(docs, sgd=get_grads, losses=losses, **config.get(name, {}))
for key, (W, dW) in grads.items(): for key, (W, dW) in grads.items():
sgd(W, dW, key=key) sgd(W, dW, key=key)
return losses return losses
def preprocess_gold(self, docs_golds): def preprocess_gold(self, docs_golds):
@ -519,13 +520,16 @@ class Language(object):
for doc, gold in docs_golds: for doc, gold in docs_golds:
yield doc, gold yield doc, gold
def begin_training(self, get_gold_tuples=None, sgd=None, **cfg): def begin_training(self, get_gold_tuples=None, sgd=None, component_cfg=None, **cfg):
"""Allocate models, pre-process training data and acquire a trainer and """Allocate models, pre-process training data and acquire a trainer and
optimizer. Used as a contextmanager. optimizer. Used as a contextmanager.
get_gold_tuples (function): Function returning gold data get_gold_tuples (function): Function returning gold data
component_cfg (dict): Config parameters for specific components.
**cfg: Config parameters. **cfg: Config parameters.
RETURNS: An optimizer RETURNS: An optimizer.
DOCS: https://spacy.io/api/language#begin_training
""" """
if get_gold_tuples is None: if get_gold_tuples is None:
get_gold_tuples = lambda: [] get_gold_tuples = lambda: []
@ -545,10 +549,17 @@ class Language(object):
if sgd is None: if sgd is None:
sgd = create_default_optimizer(Model.ops) sgd = create_default_optimizer(Model.ops)
self._optimizer = sgd self._optimizer = sgd
if component_cfg is None:
component_cfg = {}
for name, proc in self.pipeline: for name, proc in self.pipeline:
if hasattr(proc, "begin_training"): if hasattr(proc, "begin_training"):
kwargs = component_cfg.get(name, {})
kwargs.update(cfg)
proc.begin_training( proc.begin_training(
get_gold_tuples, pipeline=self.pipeline, sgd=self._optimizer, **cfg get_gold_tuples,
pipeline=self.pipeline,
sgd=self._optimizer,
**kwargs
) )
return self._optimizer return self._optimizer
@ -576,20 +587,27 @@ class Language(object):
proc._rehearsal_model = deepcopy(proc.model) proc._rehearsal_model = deepcopy(proc.model)
return self._optimizer return self._optimizer
def evaluate(self, docs_golds, verbose=False, batch_size=256): def evaluate(
scorer = Scorer() self, docs_golds, verbose=False, batch_size=256, scorer=None, component_cfg=None
):
if scorer is None:
scorer = Scorer()
docs, golds = zip(*docs_golds) docs, golds = zip(*docs_golds)
docs = list(docs) docs = list(docs)
golds = list(golds) golds = list(golds)
for name, pipe in self.pipeline: for name, pipe in self.pipeline:
kwargs = component_cfg.get(name, {})
kwargs.setdefault("batch_size", batch_size)
if not hasattr(pipe, "pipe"): if not hasattr(pipe, "pipe"):
docs = (pipe(doc) for doc in docs) docs = (pipe(doc, **kwargs) for doc in docs)
else: else:
docs = pipe.pipe(docs, batch_size=batch_size) docs = pipe.pipe(docs, **kwargs)
for doc, gold in zip(docs, golds): for doc, gold in zip(docs, golds):
if verbose: if verbose:
print(doc) print(doc)
scorer.score(doc, gold, verbose=verbose) kwargs = component_cfg.get("scorer", {})
kwargs.setdefault("verbose", verbose)
scorer.score(doc, gold, **kwargs)
return scorer return scorer
@contextmanager @contextmanager
@ -628,49 +646,57 @@ class Language(object):
self, self,
texts, texts,
as_tuples=False, as_tuples=False,
n_threads=2, n_threads=-1,
batch_size=1000, batch_size=1000,
disable=[], disable=[],
cleanup=False, cleanup=False,
component_cfg=None,
): ):
"""Process texts as a stream, and yield `Doc` objects in order. """Process texts as a stream, and yield `Doc` objects in order.
texts (iterator): A sequence of texts to process. texts (iterator): A sequence of texts to process.
as_tuples (bool): as_tuples (bool): If set to True, inputs should be a sequence of
If set to True, inputs should be a sequence of
(text, context) tuples. Output will then be a sequence of (text, context) tuples. Output will then be a sequence of
(doc, context) tuples. Defaults to False. (doc, context) tuples. Defaults to False.
n_threads (int): Currently inactive.
batch_size (int): The number of texts to buffer. batch_size (int): The number of texts to buffer.
disable (list): Names of the pipeline components to disable. disable (list): Names of the pipeline components to disable.
cleanup (bool): If True, unneeded strings are freed, cleanup (bool): If True, unneeded strings are freed to control memory
to control memory use. Experimental. use. Experimental.
component_cfg (dict): An optional dictionary with extra keyword
arguments for specific components.
YIELDS (Doc): Documents in the order of the original text. YIELDS (Doc): Documents in the order of the original text.
EXAMPLE: DOCS: https://spacy.io/api/language#pipe
>>> texts = [u'One document.', u'...', u'Lots of documents']
>>> for doc in nlp.pipe(texts, batch_size=50, n_threads=4):
>>> assert doc.is_parsed
""" """
if n_threads != -1:
deprecation_warning(Warnings.W016)
if as_tuples: if as_tuples:
text_context1, text_context2 = itertools.tee(texts) text_context1, text_context2 = itertools.tee(texts)
texts = (tc[0] for tc in text_context1) texts = (tc[0] for tc in text_context1)
contexts = (tc[1] for tc in text_context2) contexts = (tc[1] for tc in text_context2)
docs = self.pipe( docs = self.pipe(
texts, n_threads=n_threads, batch_size=batch_size, disable=disable texts,
batch_size=batch_size,
disable=disable,
component_cfg=component_cfg,
) )
for doc, context in izip(docs, contexts): for doc, context in izip(docs, contexts):
yield (doc, context) yield (doc, context)
return return
docs = (self.make_doc(text) for text in texts) docs = (self.make_doc(text) for text in texts)
if component_cfg is None:
component_cfg = {}
for name, proc in self.pipeline: for name, proc in self.pipeline:
if name in disable: if name in disable:
continue continue
kwargs = component_cfg.get(name, {})
# Allow component_cfg to overwrite the top-level kwargs.
kwargs.setdefault("batch_size", batch_size)
if hasattr(proc, "pipe"): if hasattr(proc, "pipe"):
docs = proc.pipe(docs, n_threads=n_threads, batch_size=batch_size) docs = proc.pipe(docs, **kwargs)
else: else:
# Apply the function, but yield the doc # Apply the function, but yield the doc
docs = _pipe(proc, docs) docs = _pipe(proc, docs, kwargs)
# Track weakrefs of "recent" documents, so that we can see when they # Track weakrefs of "recent" documents, so that we can see when they
# expire from memory. When they do, we know we don't need old strings. # expire from memory. When they do, we know we don't need old strings.
# This way, we avoid maintaining an unbounded growth in string entries # This way, we avoid maintaining an unbounded growth in string entries
@ -701,124 +727,114 @@ class Language(object):
self.tokenizer._reset_cache(keys) self.tokenizer._reset_cache(keys)
nr_seen = 0 nr_seen = 0
def to_disk(self, path, disable=tuple()): def to_disk(self, path, exclude=tuple(), disable=None):
"""Save the current state to a directory. If a model is loaded, this """Save the current state to a directory. If a model is loaded, this
will include the model. will include the model.
path (unicode or Path): A path to a directory, which will be created if path (unicode or Path): Path to a directory, which will be created if
it doesn't exist. Paths may be strings or `Path`-like objects. it doesn't exist.
disable (list): Names of pipeline components to disable and prevent exclude (list): Names of components or serialization fields to exclude.
from being saved.
EXAMPLE: DOCS: https://spacy.io/api/language#to_disk
>>> nlp.to_disk('/path/to/models')
""" """
if disable is not None:
deprecation_warning(Warnings.W014)
exclude = disable
path = util.ensure_path(path) path = util.ensure_path(path)
serializers = OrderedDict( serializers = OrderedDict()
( serializers["tokenizer"] = lambda p: self.tokenizer.to_disk(p, exclude=["vocab"])
("tokenizer", lambda p: self.tokenizer.to_disk(p, vocab=False)), serializers["meta.json"] = lambda p: p.open("w").write(srsly.json_dumps(self.meta))
("meta.json", lambda p: p.open("w").write(srsly.json_dumps(self.meta))),
)
)
for name, proc in self.pipeline: for name, proc in self.pipeline:
if not hasattr(proc, "name"): if not hasattr(proc, "name"):
continue continue
if name in disable: if name in exclude:
continue continue
if not hasattr(proc, "to_disk"): if not hasattr(proc, "to_disk"):
continue continue
serializers[name] = lambda p, proc=proc: proc.to_disk(p, vocab=False) serializers[name] = lambda p, proc=proc: proc.to_disk(p, exclude=["vocab"])
serializers["vocab"] = lambda p: self.vocab.to_disk(p) serializers["vocab"] = lambda p: self.vocab.to_disk(p)
util.to_disk(path, serializers, {p: False for p in disable}) util.to_disk(path, serializers, exclude)
def from_disk(self, path, disable=tuple()): def from_disk(self, path, exclude=tuple(), disable=None):
"""Loads state from a directory. Modifies the object in place and """Loads state from a directory. Modifies the object in place and
returns it. If the saved `Language` object contains a model, the returns it. If the saved `Language` object contains a model, the
model will be loaded. model will be loaded.
path (unicode or Path): A path to a directory. Paths may be either path (unicode or Path): A path to a directory.
strings or `Path`-like objects. exclude (list): Names of components or serialization fields to exclude.
disable (list): Names of the pipeline components to disable.
RETURNS (Language): The modified `Language` object. RETURNS (Language): The modified `Language` object.
EXAMPLE: DOCS: https://spacy.io/api/language#from_disk
>>> from spacy.language import Language
>>> nlp = Language().from_disk('/path/to/models')
""" """
if disable is not None:
deprecation_warning(Warnings.W014)
exclude = disable
path = util.ensure_path(path) path = util.ensure_path(path)
deserializers = OrderedDict( deserializers = OrderedDict()
( deserializers["meta.json"] = lambda p: self.meta.update(srsly.read_json(p))
("meta.json", lambda p: self.meta.update(srsly.read_json(p))), deserializers["vocab"] = lambda p: self.vocab.from_disk(p) and _fix_pretrained_vectors_name(self)
( deserializers["tokenizer"] = lambda p: self.tokenizer.from_disk(p, exclude=["vocab"])
"vocab",
lambda p: (
self.vocab.from_disk(p) and _fix_pretrained_vectors_name(self)
),
),
("tokenizer", lambda p: self.tokenizer.from_disk(p, vocab=False)),
)
)
for name, proc in self.pipeline: for name, proc in self.pipeline:
if name in disable: if name in exclude:
continue continue
if not hasattr(proc, "from_disk"): if not hasattr(proc, "from_disk"):
continue continue
deserializers[name] = lambda p, proc=proc: proc.from_disk(p, vocab=False) deserializers[name] = lambda p, proc=proc: proc.from_disk(p, exclude=["vocab"])
exclude = {p: False for p in disable} if not (path / "vocab").exists() and "vocab" not in exclude:
if not (path / "vocab").exists(): # Convert to list here in case exclude is (default) tuple
exclude["vocab"] = True exclude = list(exclude) + ["vocab"]
util.from_disk(path, deserializers, exclude) util.from_disk(path, deserializers, exclude)
self._path = path self._path = path
return self return self
def to_bytes(self, disable=[], **exclude): def to_bytes(self, exclude=tuple(), disable=None, **kwargs):
"""Serialize the current state to a binary string. """Serialize the current state to a binary string.
disable (list): Nameds of pipeline components to disable and prevent exclude (list): Names of components or serialization fields to exclude.
from being serialized.
RETURNS (bytes): The serialized form of the `Language` object. RETURNS (bytes): The serialized form of the `Language` object.
DOCS: https://spacy.io/api/language#to_bytes
""" """
serializers = OrderedDict( if disable is not None:
( deprecation_warning(Warnings.W014)
("vocab", lambda: self.vocab.to_bytes()), exclude = disable
("tokenizer", lambda: self.tokenizer.to_bytes(vocab=False)), serializers = OrderedDict()
("meta", lambda: srsly.json_dumps(self.meta)), serializers["vocab"] = lambda: self.vocab.to_bytes()
) serializers["tokenizer"] = lambda: self.tokenizer.to_bytes(exclude=["vocab"])
) serializers["meta.json"] = lambda: srsly.json_dumps(self.meta)
for i, (name, proc) in enumerate(self.pipeline): for name, proc in self.pipeline:
if name in disable: if name in exclude:
continue continue
if not hasattr(proc, "to_bytes"): if not hasattr(proc, "to_bytes"):
continue continue
serializers[i] = lambda proc=proc: proc.to_bytes(vocab=False) serializers[name] = lambda proc=proc: proc.to_bytes(exclude=["vocab"])
exclude = util.get_serialization_exclude(serializers, exclude, kwargs)
return util.to_bytes(serializers, exclude) return util.to_bytes(serializers, exclude)
def from_bytes(self, bytes_data, disable=[]): def from_bytes(self, bytes_data, exclude=tuple(), disable=None, **kwargs):
"""Load state from a binary string. """Load state from a binary string.
bytes_data (bytes): The data to load from. bytes_data (bytes): The data to load from.
disable (list): Names of the pipeline components to disable. exclude (list): Names of components or serialization fields to exclude.
RETURNS (Language): The `Language` object. RETURNS (Language): The `Language` object.
DOCS: https://spacy.io/api/language#from_bytes
""" """
deserializers = OrderedDict( if disable is not None:
( deprecation_warning(Warnings.W014)
("meta", lambda b: self.meta.update(srsly.json_loads(b))), exclude = disable
( deserializers = OrderedDict()
"vocab", deserializers["meta.json"] = lambda b: self.meta.update(srsly.json_loads(b))
lambda b: ( deserializers["vocab"] = lambda b: self.vocab.from_bytes(b) and _fix_pretrained_vectors_name(self)
self.vocab.from_bytes(b) and _fix_pretrained_vectors_name(self) deserializers["tokenizer"] = lambda b: self.tokenizer.from_bytes(b, exclude=["vocab"])
), for name, proc in self.pipeline:
), if name in exclude:
("tokenizer", lambda b: self.tokenizer.from_bytes(b, vocab=False)),
)
)
for i, (name, proc) in enumerate(self.pipeline):
if name in disable:
continue continue
if not hasattr(proc, "from_bytes"): if not hasattr(proc, "from_bytes"):
continue continue
deserializers[i] = lambda b, proc=proc: proc.from_bytes(b, vocab=False) deserializers[name] = lambda b, proc=proc: proc.from_bytes(b, exclude=["vocab"])
util.from_bytes(bytes_data, deserializers, {}) exclude = util.get_serialization_exclude(deserializers, exclude, kwargs)
util.from_bytes(bytes_data, deserializers, exclude)
return self return self
@ -873,7 +889,12 @@ class DisabledPipes(list):
self[:] = [] self[:] = []
def _pipe(func, docs): def _pipe(func, docs, kwargs):
# We added some args for pipe that __call__ doesn't expect.
kwargs = dict(kwargs)
for arg in ["n_threads", "batch_size"]:
if arg in kwargs:
kwargs.pop(arg)
for doc in docs: for doc in docs:
doc = func(doc) doc = func(doc, **kwargs)
yield doc yield doc

View File

@ -161,17 +161,17 @@ cdef class Lexeme:
Lexeme.c_from_bytes(self.c, lex_data) Lexeme.c_from_bytes(self.c, lex_data)
self.orth = self.c.orth self.orth = self.c.orth
property has_vector: @property
def has_vector(self):
"""RETURNS (bool): Whether a word vector is associated with the object. """RETURNS (bool): Whether a word vector is associated with the object.
""" """
def __get__(self): return self.vocab.has_vector(self.c.orth)
return self.vocab.has_vector(self.c.orth)
property vector_norm: @property
def vector_norm(self):
"""RETURNS (float): The L2 norm of the vector representation.""" """RETURNS (float): The L2 norm of the vector representation."""
def __get__(self): vector = self.vector
vector = self.vector return numpy.sqrt((vector**2).sum())
return numpy.sqrt((vector**2).sum())
property vector: property vector:
"""A real-valued meaning representation. """A real-valued meaning representation.
@ -209,17 +209,17 @@ cdef class Lexeme:
def __set__(self, float sentiment): def __set__(self, float sentiment):
self.c.sentiment = sentiment self.c.sentiment = sentiment
property orth_: @property
def orth_(self):
"""RETURNS (unicode): The original verbatim text of the lexeme """RETURNS (unicode): The original verbatim text of the lexeme
(identical to `Lexeme.text`). Exists mostly for consistency with (identical to `Lexeme.text`). Exists mostly for consistency with
the other attributes.""" the other attributes."""
def __get__(self): return self.vocab.strings[self.c.orth]
return self.vocab.strings[self.c.orth]
property text: @property
def text(self):
"""RETURNS (unicode): The original verbatim text of the lexeme.""" """RETURNS (unicode): The original verbatim text of the lexeme."""
def __get__(self): return self.orth_
return self.orth_
property lower: property lower:
"""RETURNS (unicode): Lowercase form of the lexeme.""" """RETURNS (unicode): Lowercase form of the lexeme."""

View File

@ -19,7 +19,7 @@ from ..attrs cimport ID, attr_id_t, NULL_ATTR, ORTH
from ._schemas import TOKEN_PATTERN_SCHEMA from ._schemas import TOKEN_PATTERN_SCHEMA
from ..util import get_json_validator, validate_json from ..util import get_json_validator, validate_json
from ..errors import Errors, MatchPatternError from ..errors import Errors, MatchPatternError, Warnings, deprecation_warning
from ..strings import get_string_id from ..strings import get_string_id
from ..attrs import IDS from ..attrs import IDS
@ -153,15 +153,15 @@ cdef class Matcher:
return default return default
return (self._callbacks[key], self._patterns[key]) return (self._callbacks[key], self._patterns[key])
def pipe(self, docs, batch_size=1000, n_threads=2): def pipe(self, docs, batch_size=1000, n_threads=-1):
"""Match a stream of documents, yielding them in turn. """Match a stream of documents, yielding them in turn.
docs (iterable): A stream of documents. docs (iterable): A stream of documents.
batch_size (int): Number of documents to accumulate into a working set. batch_size (int): Number of documents to accumulate into a working set.
n_threads (int): The number of threads with which to work on the buffer
in parallel, if the implementation supports multi-threading.
YIELDS (Doc): Documents, in order. YIELDS (Doc): Documents, in order.
""" """
if n_threads != -1:
deprecation_warning(Warnings.W016)
for doc in docs: for doc in docs:
self(doc) self(doc)
yield doc yield doc

View File

@ -166,14 +166,12 @@ cdef class PhraseMatcher:
on_match(self, doc, i, matches) on_match(self, doc, i, matches)
return matches return matches
def pipe(self, stream, batch_size=1000, n_threads=1, return_matches=False, def pipe(self, stream, batch_size=1000, n_threads=-1, return_matches=False,
as_tuples=False): as_tuples=False):
"""Match a stream of documents, yielding them in turn. """Match a stream of documents, yielding them in turn.
docs (iterable): A stream of documents. docs (iterable): A stream of documents.
batch_size (int): Number of documents to accumulate into a working set. batch_size (int): Number of documents to accumulate into a working set.
n_threads (int): The number of threads with which to work on the buffer
in parallel, if the implementation supports multi-threading.
return_matches (bool): Yield the match lists along with the docs, making return_matches (bool): Yield the match lists along with the docs, making
results (doc, matches) tuples. results (doc, matches) tuples.
as_tuples (bool): Interpret the input stream as (doc, context) tuples, as_tuples (bool): Interpret the input stream as (doc, context) tuples,
@ -184,6 +182,8 @@ cdef class PhraseMatcher:
DOCS: https://spacy.io/api/phrasematcher#pipe DOCS: https://spacy.io/api/phrasematcher#pipe
""" """
if n_threads != -1:
deprecation_warning(Warnings.W016)
if as_tuples: if as_tuples:
for doc, context in stream: for doc, context in stream:
matches = self(doc) matches = self(doc)

View File

@ -141,16 +141,21 @@ class Pipe(object):
with self.model.use_params(params): with self.model.use_params(params):
yield yield
def to_bytes(self, **exclude): def to_bytes(self, exclude=tuple(), **kwargs):
"""Serialize the pipe to a bytestring.""" """Serialize the pipe to a bytestring.
exclude (list): String names of serialization fields to exclude.
RETURNS (bytes): The serialized object.
"""
serialize = OrderedDict() serialize = OrderedDict()
serialize["cfg"] = lambda: srsly.json_dumps(self.cfg) serialize["cfg"] = lambda: srsly.json_dumps(self.cfg)
if self.model not in (True, False, None): if self.model not in (True, False, None):
serialize["model"] = self.model.to_bytes serialize["model"] = self.model.to_bytes
serialize["vocab"] = self.vocab.to_bytes serialize["vocab"] = self.vocab.to_bytes
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
return util.to_bytes(serialize, exclude) return util.to_bytes(serialize, exclude)
def from_bytes(self, bytes_data, **exclude): def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
"""Load the pipe from a bytestring.""" """Load the pipe from a bytestring."""
def load_model(b): def load_model(b):
@ -161,26 +166,25 @@ class Pipe(object):
self.model = self.Model(**self.cfg) self.model = self.Model(**self.cfg)
self.model.from_bytes(b) self.model.from_bytes(b)
deserialize = OrderedDict( deserialize = OrderedDict()
( deserialize["cfg"] = lambda b: self.cfg.update(srsly.json_loads(b))
("cfg", lambda b: self.cfg.update(srsly.json_loads(b))), deserialize["vocab"] = lambda b: self.vocab.from_bytes(b)
("vocab", lambda b: self.vocab.from_bytes(b)), deserialize["model"] = load_model
("model", load_model), exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
)
)
util.from_bytes(bytes_data, deserialize, exclude) util.from_bytes(bytes_data, deserialize, exclude)
return self return self
def to_disk(self, path, **exclude): def to_disk(self, path, exclude=tuple(), **kwargs):
"""Serialize the pipe to disk.""" """Serialize the pipe to disk."""
serialize = OrderedDict() serialize = OrderedDict()
serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg) serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg)
serialize["vocab"] = lambda p: self.vocab.to_disk(p) serialize["vocab"] = lambda p: self.vocab.to_disk(p)
if self.model not in (None, True, False): if self.model not in (None, True, False):
serialize["model"] = lambda p: p.open("wb").write(self.model.to_bytes()) serialize["model"] = lambda p: p.open("wb").write(self.model.to_bytes())
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
util.to_disk(path, serialize, exclude) util.to_disk(path, serialize, exclude)
def from_disk(self, path, **exclude): def from_disk(self, path, exclude=tuple(), **kwargs):
"""Load the pipe from disk.""" """Load the pipe from disk."""
def load_model(p): def load_model(p):
@ -191,13 +195,11 @@ class Pipe(object):
self.model = self.Model(**self.cfg) self.model = self.Model(**self.cfg)
self.model.from_bytes(p.open("rb").read()) self.model.from_bytes(p.open("rb").read())
deserialize = OrderedDict( deserialize = OrderedDict()
( deserialize["cfg"] = lambda p: self.cfg.update(_load_cfg(p))
("cfg", lambda p: self.cfg.update(_load_cfg(p))), deserialize["vocab"] = lambda p: self.vocab.from_disk(p)
("vocab", lambda p: self.vocab.from_disk(p)), deserialize["model"] = load_model
("model", load_model), exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
)
)
util.from_disk(path, deserialize, exclude) util.from_disk(path, deserialize, exclude)
return self return self
@ -255,7 +257,6 @@ class Tensorizer(Pipe):
stream (iterator): A sequence of `Doc` objects to process. stream (iterator): A sequence of `Doc` objects to process.
batch_size (int): Number of `Doc` objects to group. batch_size (int): Number of `Doc` objects to group.
n_threads (int): Number of threads.
YIELDS (iterator): A sequence of `Doc` objects, in order of input. YIELDS (iterator): A sequence of `Doc` objects, in order of input.
""" """
for docs in util.minibatch(stream, size=batch_size): for docs in util.minibatch(stream, size=batch_size):
@ -541,7 +542,7 @@ class Tagger(Pipe):
with self.model.use_params(params): with self.model.use_params(params):
yield yield
def to_bytes(self, **exclude): def to_bytes(self, exclude=tuple(), **kwargs):
serialize = OrderedDict() serialize = OrderedDict()
if self.model not in (None, True, False): if self.model not in (None, True, False):
serialize["model"] = self.model.to_bytes serialize["model"] = self.model.to_bytes
@ -549,9 +550,10 @@ class Tagger(Pipe):
serialize["cfg"] = lambda: srsly.json_dumps(self.cfg) serialize["cfg"] = lambda: srsly.json_dumps(self.cfg)
tag_map = OrderedDict(sorted(self.vocab.morphology.tag_map.items())) tag_map = OrderedDict(sorted(self.vocab.morphology.tag_map.items()))
serialize["tag_map"] = lambda: srsly.msgpack_dumps(tag_map) serialize["tag_map"] = lambda: srsly.msgpack_dumps(tag_map)
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
return util.to_bytes(serialize, exclude) return util.to_bytes(serialize, exclude)
def from_bytes(self, bytes_data, **exclude): def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
def load_model(b): def load_model(b):
# TODO: Remove this once we don't have to handle previous models # TODO: Remove this once we don't have to handle previous models
if self.cfg.get("pretrained_dims") and "pretrained_vectors" not in self.cfg: if self.cfg.get("pretrained_dims") and "pretrained_vectors" not in self.cfg:
@ -576,20 +578,22 @@ class Tagger(Pipe):
("cfg", lambda b: self.cfg.update(srsly.json_loads(b))), ("cfg", lambda b: self.cfg.update(srsly.json_loads(b))),
("model", lambda b: load_model(b)), ("model", lambda b: load_model(b)),
)) ))
exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
util.from_bytes(bytes_data, deserialize, exclude) util.from_bytes(bytes_data, deserialize, exclude)
return self return self
def to_disk(self, path, **exclude): def to_disk(self, path, exclude=tuple(), **kwargs):
tag_map = OrderedDict(sorted(self.vocab.morphology.tag_map.items())) tag_map = OrderedDict(sorted(self.vocab.morphology.tag_map.items()))
serialize = OrderedDict(( serialize = OrderedDict((
('vocab', lambda p: self.vocab.to_disk(p)), ("vocab", lambda p: self.vocab.to_disk(p)),
('tag_map', lambda p: srsly.write_msgpack(p, tag_map)), ("tag_map", lambda p: srsly.write_msgpack(p, tag_map)),
('model', lambda p: p.open("wb").write(self.model.to_bytes())), ("model", lambda p: p.open("wb").write(self.model.to_bytes())),
('cfg', lambda p: srsly.write_json(p, self.cfg)) ("cfg", lambda p: srsly.write_json(p, self.cfg))
)) ))
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
util.to_disk(path, serialize, exclude) util.to_disk(path, serialize, exclude)
def from_disk(self, path, **exclude): def from_disk(self, path, exclude=tuple(), **kwargs):
def load_model(p): def load_model(p):
# TODO: Remove this once we don't have to handle previous models # TODO: Remove this once we don't have to handle previous models
if self.cfg.get("pretrained_dims") and "pretrained_vectors" not in self.cfg: if self.cfg.get("pretrained_dims") and "pretrained_vectors" not in self.cfg:
@ -612,6 +616,7 @@ class Tagger(Pipe):
("tag_map", load_tag_map), ("tag_map", load_tag_map),
("model", load_model), ("model", load_model),
)) ))
exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
util.from_disk(path, deserialize, exclude) util.from_disk(path, deserialize, exclude)
return self return self

View File

@ -248,19 +248,17 @@ cdef class StringStore:
self.add(word) self.add(word)
return self return self
def to_bytes(self, **exclude): def to_bytes(self, **kwargs):
"""Serialize the current state to a binary string. """Serialize the current state to a binary string.
**exclude: Named attributes to prevent from being serialized.
RETURNS (bytes): The serialized form of the `StringStore` object. RETURNS (bytes): The serialized form of the `StringStore` object.
""" """
return srsly.json_dumps(list(self)) return srsly.json_dumps(list(self))
def from_bytes(self, bytes_data, **exclude): def from_bytes(self, bytes_data, **kwargs):
"""Load state from a binary string. """Load state from a binary string.
bytes_data (bytes): The data to load from. bytes_data (bytes): The data to load from.
**exclude: Named attributes to prevent from being loaded.
RETURNS (StringStore): The `StringStore` object. RETURNS (StringStore): The `StringStore` object.
""" """
strings = srsly.json_loads(bytes_data) strings = srsly.json_loads(bytes_data)

View File

@ -157,6 +157,10 @@ cdef void cpu_log_loss(float* d_scores,
cdef double max_, gmax, Z, gZ cdef double max_, gmax, Z, gZ
best = arg_max_if_gold(scores, costs, is_valid, O) best = arg_max_if_gold(scores, costs, is_valid, O)
guess = arg_max_if_valid(scores, is_valid, O) guess = arg_max_if_valid(scores, is_valid, O)
if best == -1 or guess == -1:
# These shouldn't happen, but if they do, we want to make sure we don't
# cause an OOB access.
return
Z = 1e-10 Z = 1e-10
gZ = 1e-10 gZ = 1e-10
max_ = scores[guess] max_ = scores[guess]

View File

@ -323,6 +323,12 @@ cdef cppclass StateC:
if this._s_i >= 1: if this._s_i >= 1:
this._s_i -= 1 this._s_i -= 1
void force_final() nogil:
# This should only be used in desperate situations, as it may leave
# the analysis in an unexpected state.
this._s_i = 0
this._b_i = this.length
void unshift() nogil: void unshift() nogil:
this._b_i -= 1 this._b_i -= 1
this._buffer[this._b_i] = this.S(0) this._buffer[this._b_i] = this.S(0)

View File

@ -369,9 +369,9 @@ cdef class ArcEager(TransitionSystem):
actions[LEFT].setdefault('dep', 0) actions[LEFT].setdefault('dep', 0)
return actions return actions
property action_types: @property
def __get__(self): def action_types(self):
return (SHIFT, REDUCE, LEFT, RIGHT, BREAK) return (SHIFT, REDUCE, LEFT, RIGHT, BREAK)
def get_cost(self, StateClass state, GoldParse gold, action): def get_cost(self, StateClass state, GoldParse gold, action):
cdef Transition t = self.lookup_transition(action) cdef Transition t = self.lookup_transition(action)

View File

@ -80,9 +80,9 @@ cdef class BiluoPushDown(TransitionSystem):
actions[action][label] += 1 actions[action][label] += 1
return actions return actions
property action_types: @property
def __get__(self): def action_types(self):
return (BEGIN, IN, LAST, UNIT, OUT) return (BEGIN, IN, LAST, UNIT, OUT)
def move_name(self, int move, attr_t label): def move_name(self, int move, attr_t label):
if move == OUT: if move == OUT:
@ -257,30 +257,42 @@ cdef class Missing:
cdef class Begin: cdef class Begin:
@staticmethod @staticmethod
cdef bint is_valid(const StateC* st, attr_t label) nogil: cdef bint is_valid(const StateC* st, attr_t label) nogil:
# Ensure we don't clobber preset entities. If no entity preset,
# ent_iob is 0
cdef int preset_ent_iob = st.B_(0).ent_iob cdef int preset_ent_iob = st.B_(0).ent_iob
if preset_ent_iob == 1: cdef int preset_ent_label = st.B_(0).ent_type
# If we're the last token of the input, we can't B -- must U or O.
if st.B(1) == -1:
return False return False
elif preset_ent_iob == 2: elif st.entity_is_open():
return False return False
elif preset_ent_iob == 3 and st.B_(0).ent_type != label: elif label == 0:
return False return False
# If the next word is B or O, we can't B now elif preset_ent_iob == 1 or preset_ent_iob == 2:
# Ensure we don't clobber preset entities. If no entity preset,
# ent_iob is 0
return False
elif preset_ent_iob == 3:
# Okay, we're in a preset entity.
if label != preset_ent_label:
# If label isn't right, reject
return False
elif st.B_(1).ent_iob != 1:
# If next token isn't marked I, we need to make U, not B.
return False
else:
# Otherwise, force acceptance, even if we're across a sentence
# boundary or the token is whitespace.
return True
elif st.B_(1).ent_iob == 2 or st.B_(1).ent_iob == 3: elif st.B_(1).ent_iob == 2 or st.B_(1).ent_iob == 3:
# If the next word is B or O, we can't B now
return False return False
# If the current word is B, and the next word isn't I, the current word
# is really U
elif preset_ent_iob == 3 and st.B_(1).ent_iob != 1:
return False
# Don't allow entities to extend across sentence boundaries
elif st.B_(1).sent_start == 1: elif st.B_(1).sent_start == 1:
# Don't allow entities to extend across sentence boundaries
return False return False
# Don't allow entities to start on whitespace # Don't allow entities to start on whitespace
elif Lexeme.get_struct_attr(st.B_(0).lex, IS_SPACE): elif Lexeme.get_struct_attr(st.B_(0).lex, IS_SPACE):
return False return False
else: else:
return label != 0 and not st.entity_is_open() return True
@staticmethod @staticmethod
cdef int transition(StateC* st, attr_t label) nogil: cdef int transition(StateC* st, attr_t label) nogil:
@ -314,18 +326,27 @@ cdef class In:
@staticmethod @staticmethod
cdef bint is_valid(const StateC* st, attr_t label) nogil: cdef bint is_valid(const StateC* st, attr_t label) nogil:
cdef int preset_ent_iob = st.B_(0).ent_iob cdef int preset_ent_iob = st.B_(0).ent_iob
if preset_ent_iob == 2: if label == 0:
return False
elif st.E_(0).ent_type != label:
return False
elif not st.entity_is_open():
return False
elif st.B(1) == -1:
# If we're at the end, we can't I.
return False
elif preset_ent_iob == 2:
return False return False
elif preset_ent_iob == 3: elif preset_ent_iob == 3:
return False return False
# TODO: Is this quite right? I think it's supposed to be ensuring the elif st.B_(1).ent_iob == 2 or st.B_(1).ent_iob == 3:
# gazetteer matches are maintained # If we know the next word is B or O, we can't be I (must be L)
elif st.B(1) != -1 and st.B_(1).ent_iob != preset_ent_iob:
return False return False
# Don't allow entities to extend across sentence boundaries
elif st.B(1) != -1 and st.B_(1).sent_start == 1: elif st.B(1) != -1 and st.B_(1).sent_start == 1:
# Don't allow entities to extend across sentence boundaries
return False return False
return st.entity_is_open() and label != 0 and st.E_(0).ent_type == label else:
return True
@staticmethod @staticmethod
cdef int transition(StateC* st, attr_t label) nogil: cdef int transition(StateC* st, attr_t label) nogil:
@ -370,9 +391,17 @@ cdef class In:
cdef class Last: cdef class Last:
@staticmethod @staticmethod
cdef bint is_valid(const StateC* st, attr_t label) nogil: cdef bint is_valid(const StateC* st, attr_t label) nogil:
if st.B_(1).ent_iob == 1: if label == 0:
return False return False
return st.entity_is_open() and label != 0 and st.E_(0).ent_type == label elif not st.entity_is_open():
return False
elif st.E_(0).ent_type != label:
return False
elif st.B_(1).ent_iob == 1:
# If a preset entity has I next, we can't L here.
return False
else:
return True
@staticmethod @staticmethod
cdef int transition(StateC* st, attr_t label) nogil: cdef int transition(StateC* st, attr_t label) nogil:
@ -416,17 +445,29 @@ cdef class Unit:
@staticmethod @staticmethod
cdef bint is_valid(const StateC* st, attr_t label) nogil: cdef bint is_valid(const StateC* st, attr_t label) nogil:
cdef int preset_ent_iob = st.B_(0).ent_iob cdef int preset_ent_iob = st.B_(0).ent_iob
if preset_ent_iob == 2: cdef attr_t preset_ent_label = st.B_(0).ent_type
if label == 0:
return False return False
elif preset_ent_iob == 1: elif st.entity_is_open():
return False return False
elif preset_ent_iob == 3 and st.B_(0).ent_type != label: elif preset_ent_iob == 2:
# Don't clobber preset O
return False return False
elif st.B_(1).ent_iob == 1: elif st.B_(1).ent_iob == 1:
# If next token is In, we can't be Unit -- must be Begin
return False return False
elif preset_ent_iob == 3:
# Okay, there's a preset entity here
if label != preset_ent_label:
# Require labels to match
return False
else:
# Otherwise return True, ignoring the whitespace constraint.
return True
elif Lexeme.get_struct_attr(st.B_(0).lex, IS_SPACE): elif Lexeme.get_struct_attr(st.B_(0).lex, IS_SPACE):
return False return False
return label != 0 and not st.entity_is_open() else:
return True
@staticmethod @staticmethod
cdef int transition(StateC* st, attr_t label) nogil: cdef int transition(StateC* st, attr_t label) nogil:
@ -461,11 +502,14 @@ cdef class Out:
@staticmethod @staticmethod
cdef bint is_valid(const StateC* st, attr_t label) nogil: cdef bint is_valid(const StateC* st, attr_t label) nogil:
cdef int preset_ent_iob = st.B_(0).ent_iob cdef int preset_ent_iob = st.B_(0).ent_iob
if preset_ent_iob == 3: if st.entity_is_open():
return False
elif preset_ent_iob == 3:
return False return False
elif preset_ent_iob == 1: elif preset_ent_iob == 1:
return False return False
return not st.entity_is_open() else:
return True
@staticmethod @staticmethod
cdef int transition(StateC* st, attr_t label) nogil: cdef int transition(StateC* st, attr_t label) nogil:

View File

@ -205,13 +205,11 @@ cdef class Parser:
self.set_annotations([doc], states, tensors=None) self.set_annotations([doc], states, tensors=None)
return doc return doc
def pipe(self, docs, int batch_size=256, int n_threads=2, beam_width=None): def pipe(self, docs, int batch_size=256, int n_threads=-1, beam_width=None):
"""Process a stream of documents. """Process a stream of documents.
stream: The sequence of documents to process. stream: The sequence of documents to process.
batch_size (int): Number of documents to accumulate into a working set. batch_size (int): Number of documents to accumulate into a working set.
n_threads (int): The number of threads with which to work on the buffer
in parallel.
YIELDS (Doc): Documents, in order. YIELDS (Doc): Documents, in order.
""" """
if beam_width is None: if beam_width is None:
@ -221,7 +219,7 @@ cdef class Parser:
for batch in util.minibatch(docs, size=batch_size): for batch in util.minibatch(docs, size=batch_size):
batch_in_order = list(batch) batch_in_order = list(batch)
by_length = sorted(batch_in_order, key=lambda doc: len(doc)) by_length = sorted(batch_in_order, key=lambda doc: len(doc))
for subbatch in util.minibatch(by_length, size=batch_size//4): for subbatch in util.minibatch(by_length, size=max(batch_size//4, 2)):
subbatch = list(subbatch) subbatch = list(subbatch)
parse_states = self.predict(subbatch, beam_width=beam_width, parse_states = self.predict(subbatch, beam_width=beam_width,
beam_density=beam_density) beam_density=beam_density)
@ -363,9 +361,14 @@ cdef class Parser:
for i in range(batch_size): for i in range(batch_size):
self.moves.set_valid(is_valid, states[i]) self.moves.set_valid(is_valid, states[i])
guess = arg_max_if_valid(&scores[i*nr_class], is_valid, nr_class) guess = arg_max_if_valid(&scores[i*nr_class], is_valid, nr_class)
action = self.moves.c[guess] if guess == -1:
action.do(states[i], action.label) # This shouldn't happen, but it's hard to raise an error here,
states[i].push_hist(guess) # and we don't want to infinite loop. So, force to end state.
states[i].force_final()
else:
action = self.moves.c[guess]
action.do(states[i], action.label)
states[i].push_hist(guess)
free(is_valid) free(is_valid)
def transition_beams(self, beams, float[:, ::1] scores): def transition_beams(self, beams, float[:, ::1] scores):
@ -598,22 +601,24 @@ cdef class Parser:
self.cfg.update(cfg) self.cfg.update(cfg)
return sgd return sgd
def to_disk(self, path, **exclude): def to_disk(self, path, exclude=tuple(), **kwargs):
serializers = { serializers = {
'model': lambda p: (self.model.to_disk(p) if self.model is not True else True), 'model': lambda p: (self.model.to_disk(p) if self.model is not True else True),
'vocab': lambda p: self.vocab.to_disk(p), 'vocab': lambda p: self.vocab.to_disk(p),
'moves': lambda p: self.moves.to_disk(p, strings=False), 'moves': lambda p: self.moves.to_disk(p, exclude=["strings"]),
'cfg': lambda p: srsly.write_json(p, self.cfg) 'cfg': lambda p: srsly.write_json(p, self.cfg)
} }
exclude = util.get_serialization_exclude(serializers, exclude, kwargs)
util.to_disk(path, serializers, exclude) util.to_disk(path, serializers, exclude)
def from_disk(self, path, **exclude): def from_disk(self, path, exclude=tuple(), **kwargs):
deserializers = { deserializers = {
'vocab': lambda p: self.vocab.from_disk(p), 'vocab': lambda p: self.vocab.from_disk(p),
'moves': lambda p: self.moves.from_disk(p, strings=False), 'moves': lambda p: self.moves.from_disk(p, exclude=["strings"]),
'cfg': lambda p: self.cfg.update(srsly.read_json(p)), 'cfg': lambda p: self.cfg.update(srsly.read_json(p)),
'model': lambda p: None 'model': lambda p: None
} }
exclude = util.get_serialization_exclude(deserializers, exclude, kwargs)
util.from_disk(path, deserializers, exclude) util.from_disk(path, deserializers, exclude)
if 'model' not in exclude: if 'model' not in exclude:
path = util.ensure_path(path) path = util.ensure_path(path)
@ -627,22 +632,24 @@ cdef class Parser:
self.cfg.update(cfg) self.cfg.update(cfg)
return self return self
def to_bytes(self, **exclude): def to_bytes(self, exclude=tuple(), **kwargs):
serializers = OrderedDict(( serializers = OrderedDict((
('model', lambda: (self.model.to_bytes() if self.model is not True else True)), ('model', lambda: (self.model.to_bytes() if self.model is not True else True)),
('vocab', lambda: self.vocab.to_bytes()), ('vocab', lambda: self.vocab.to_bytes()),
('moves', lambda: self.moves.to_bytes(strings=False)), ('moves', lambda: self.moves.to_bytes(exclude=["strings"])),
('cfg', lambda: srsly.json_dumps(self.cfg, indent=2, sort_keys=True)) ('cfg', lambda: srsly.json_dumps(self.cfg, indent=2, sort_keys=True))
)) ))
exclude = util.get_serialization_exclude(serializers, exclude, kwargs)
return util.to_bytes(serializers, exclude) return util.to_bytes(serializers, exclude)
def from_bytes(self, bytes_data, **exclude): def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
deserializers = OrderedDict(( deserializers = OrderedDict((
('vocab', lambda b: self.vocab.from_bytes(b)), ('vocab', lambda b: self.vocab.from_bytes(b)),
('moves', lambda b: self.moves.from_bytes(b, strings=False)), ('moves', lambda b: self.moves.from_bytes(b, exclude=["strings"])),
('cfg', lambda b: self.cfg.update(srsly.json_loads(b))), ('cfg', lambda b: self.cfg.update(srsly.json_loads(b))),
('model', lambda b: None) ('model', lambda b: None)
)) ))
exclude = util.get_serialization_exclude(deserializers, exclude, kwargs)
msg = util.from_bytes(bytes_data, deserializers, exclude) msg = util.from_bytes(bytes_data, deserializers, exclude)
if 'model' not in exclude: if 'model' not in exclude:
# TODO: Remove this once we don't have to handle previous models # TODO: Remove this once we don't have to handle previous models

View File

@ -94,6 +94,13 @@ cdef class TransitionSystem:
raise ValueError(Errors.E024) raise ValueError(Errors.E024)
return history return history
def apply_transition(self, StateClass state, name):
if not self.is_valid(state, name):
raise ValueError(
"Cannot apply transition {name}: invalid for the current state.".format(name=name))
action = self.lookup_transition(name)
action.do(state.c, action.label)
cdef int initialize_state(self, StateC* state) nogil: cdef int initialize_state(self, StateC* state) nogil:
pass pass
@ -201,30 +208,32 @@ cdef class TransitionSystem:
self.labels[action][label_name] = new_freq-1 self.labels[action][label_name] = new_freq-1
return 1 return 1
def to_disk(self, path, **exclude): def to_disk(self, path, **kwargs):
with path.open('wb') as file_: with path.open('wb') as file_:
file_.write(self.to_bytes(**exclude)) file_.write(self.to_bytes(**kwargs))
def from_disk(self, path, **exclude): def from_disk(self, path, **kwargs):
with path.open('rb') as file_: with path.open('rb') as file_:
byte_data = file_.read() byte_data = file_.read()
self.from_bytes(byte_data, **exclude) self.from_bytes(byte_data, **kwargs)
return self return self
def to_bytes(self, **exclude): def to_bytes(self, exclude=tuple(), **kwargs):
transitions = [] transitions = []
serializers = { serializers = {
'moves': lambda: srsly.json_dumps(self.labels), 'moves': lambda: srsly.json_dumps(self.labels),
'strings': lambda: self.strings.to_bytes() 'strings': lambda: self.strings.to_bytes()
} }
exclude = util.get_serialization_exclude(serializers, exclude, kwargs)
return util.to_bytes(serializers, exclude) return util.to_bytes(serializers, exclude)
def from_bytes(self, bytes_data, **exclude): def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
labels = {} labels = {}
deserializers = { deserializers = {
'moves': lambda b: labels.update(srsly.json_loads(b)), 'moves': lambda b: labels.update(srsly.json_loads(b)),
'strings': lambda b: self.strings.from_bytes(b) 'strings': lambda b: self.strings.from_bytes(b)
} }
exclude = util.get_serialization_exclude(deserializers, exclude, kwargs)
msg = util.from_bytes(bytes_data, deserializers, exclude) msg = util.from_bytes(bytes_data, deserializers, exclude)
self.initialize_actions(labels) self.initialize_actions(labels)
return self return self

View File

@ -1,46 +1,44 @@
# coding: utf-8 # coding: utf-8
from __future__ import unicode_literals from __future__ import unicode_literals
import pytest
from spacy.tokens import Doc
from spacy.attrs import ORTH, SHAPE, POS, DEP from spacy.attrs import ORTH, SHAPE, POS, DEP
from ..util import get_doc from ..util import get_doc
def test_doc_array_attr_of_token(en_tokenizer, en_vocab): def test_doc_array_attr_of_token(en_vocab):
text = "An example sentence" doc = Doc(en_vocab, words=["An", "example", "sentence"])
tokens = en_tokenizer(text) example = doc.vocab["example"]
example = tokens.vocab["example"]
assert example.orth != example.shape assert example.orth != example.shape
feats_array = tokens.to_array((ORTH, SHAPE)) feats_array = doc.to_array((ORTH, SHAPE))
assert feats_array[0][0] != feats_array[0][1] assert feats_array[0][0] != feats_array[0][1]
assert feats_array[0][0] != feats_array[0][1] assert feats_array[0][0] != feats_array[0][1]
def test_doc_stringy_array_attr_of_token(en_tokenizer, en_vocab): def test_doc_stringy_array_attr_of_token(en_vocab):
text = "An example sentence" doc = Doc(en_vocab, words=["An", "example", "sentence"])
tokens = en_tokenizer(text) example = doc.vocab["example"]
example = tokens.vocab["example"]
assert example.orth != example.shape assert example.orth != example.shape
feats_array = tokens.to_array((ORTH, SHAPE)) feats_array = doc.to_array((ORTH, SHAPE))
feats_array_stringy = tokens.to_array(("ORTH", "SHAPE")) feats_array_stringy = doc.to_array(("ORTH", "SHAPE"))
assert feats_array_stringy[0][0] == feats_array[0][0] assert feats_array_stringy[0][0] == feats_array[0][0]
assert feats_array_stringy[0][1] == feats_array[0][1] assert feats_array_stringy[0][1] == feats_array[0][1]
def test_doc_scalar_attr_of_token(en_tokenizer, en_vocab): def test_doc_scalar_attr_of_token(en_vocab):
text = "An example sentence" doc = Doc(en_vocab, words=["An", "example", "sentence"])
tokens = en_tokenizer(text) example = doc.vocab["example"]
example = tokens.vocab["example"]
assert example.orth != example.shape assert example.orth != example.shape
feats_array = tokens.to_array(ORTH) feats_array = doc.to_array(ORTH)
assert feats_array.shape == (3,) assert feats_array.shape == (3,)
def test_doc_array_tag(en_tokenizer): def test_doc_array_tag(en_vocab):
text = "A nice sentence." words = ["A", "nice", "sentence", "."]
pos = ["DET", "ADJ", "NOUN", "PUNCT"] pos = ["DET", "ADJ", "NOUN", "PUNCT"]
tokens = en_tokenizer(text) doc = get_doc(en_vocab, words=words, pos=pos)
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], pos=pos)
assert doc[0].pos != doc[1].pos != doc[2].pos != doc[3].pos assert doc[0].pos != doc[1].pos != doc[2].pos != doc[3].pos
feats_array = doc.to_array((ORTH, POS)) feats_array = doc.to_array((ORTH, POS))
assert feats_array[0][1] == doc[0].pos assert feats_array[0][1] == doc[0].pos
@ -49,13 +47,22 @@ def test_doc_array_tag(en_tokenizer):
assert feats_array[3][1] == doc[3].pos assert feats_array[3][1] == doc[3].pos
def test_doc_array_dep(en_tokenizer): def test_doc_array_dep(en_vocab):
text = "A nice sentence." words = ["A", "nice", "sentence", "."]
deps = ["det", "amod", "ROOT", "punct"] deps = ["det", "amod", "ROOT", "punct"]
tokens = en_tokenizer(text) doc = get_doc(en_vocab, words=words, deps=deps)
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], deps=deps)
feats_array = doc.to_array((ORTH, DEP)) feats_array = doc.to_array((ORTH, DEP))
assert feats_array[0][1] == doc[0].dep assert feats_array[0][1] == doc[0].dep
assert feats_array[1][1] == doc[1].dep assert feats_array[1][1] == doc[1].dep
assert feats_array[2][1] == doc[2].dep assert feats_array[2][1] == doc[2].dep
assert feats_array[3][1] == doc[3].dep assert feats_array[3][1] == doc[3].dep
@pytest.mark.parametrize("attrs", [["ORTH", "SHAPE"], "IS_ALPHA"])
def test_doc_array_to_from_string_attrs(en_vocab, attrs):
"""Test that both Doc.to_array and Doc.from_array accept string attrs,
as well as single attrs and sequences of attrs.
"""
words = ["An", "example", "sentence"]
doc = Doc(en_vocab, words=words)
Doc(en_vocab, words=words).from_array(attrs, doc.to_array(attrs))

View File

@ -4,9 +4,10 @@ from __future__ import unicode_literals
import pytest import pytest
import numpy import numpy
from spacy.tokens import Doc from spacy.tokens import Doc, Span
from spacy.vocab import Vocab from spacy.vocab import Vocab
from spacy.errors import ModelsWarning from spacy.errors import ModelsWarning
from spacy.attrs import ENT_TYPE, ENT_IOB
from ..util import get_doc from ..util import get_doc
@ -112,14 +113,14 @@ def test_doc_api_serialize(en_tokenizer, text):
assert [t.orth for t in tokens] == [t.orth for t in new_tokens] assert [t.orth for t in tokens] == [t.orth for t in new_tokens]
new_tokens = Doc(tokens.vocab).from_bytes( new_tokens = Doc(tokens.vocab).from_bytes(
tokens.to_bytes(tensor=False), tensor=False tokens.to_bytes(exclude=["tensor"]), exclude=["tensor"]
) )
assert tokens.text == new_tokens.text assert tokens.text == new_tokens.text
assert [t.text for t in tokens] == [t.text for t in new_tokens] assert [t.text for t in tokens] == [t.text for t in new_tokens]
assert [t.orth for t in tokens] == [t.orth for t in new_tokens] assert [t.orth for t in tokens] == [t.orth for t in new_tokens]
new_tokens = Doc(tokens.vocab).from_bytes( new_tokens = Doc(tokens.vocab).from_bytes(
tokens.to_bytes(sentiment=False), sentiment=False tokens.to_bytes(exclude=["sentiment"]), exclude=["sentiment"]
) )
assert tokens.text == new_tokens.text assert tokens.text == new_tokens.text
assert [t.text for t in tokens] == [t.text for t in new_tokens] assert [t.text for t in tokens] == [t.text for t in new_tokens]
@ -256,3 +257,24 @@ def test_lowest_common_ancestor(en_tokenizer, sentence, heads, lca_matrix):
assert lca[1, 1] == 1 assert lca[1, 1] == 1
assert lca[0, 1] == 2 assert lca[0, 1] == 2
assert lca[1, 2] == 2 assert lca[1, 2] == 2
def test_doc_is_nered(en_vocab):
words = ["I", "live", "in", "New", "York"]
doc = Doc(en_vocab, words=words)
assert not doc.is_nered
doc.ents = [Span(doc, 3, 5, label="GPE")]
assert doc.is_nered
# Test creating doc from array with unknown values
arr = numpy.array([[0, 0], [0, 0], [0, 0], [384, 3], [384, 1]], dtype="uint64")
doc = Doc(en_vocab, words=words).from_array([ENT_TYPE, ENT_IOB], arr)
assert doc.is_nered
# Test serialization
new_doc = Doc(en_vocab).from_bytes(doc.to_bytes())
assert new_doc.is_nered
def test_doc_lang(en_vocab):
doc = Doc(en_vocab, words=["Hello", "world"])
assert doc.lang_ == "en"
assert doc.lang == en_vocab.strings["en"]

View File

@ -178,11 +178,10 @@ def test_span_string_label(doc):
assert span.label == doc.vocab.strings["hello"] assert span.label == doc.vocab.strings["hello"]
def test_span_string_set_label(doc): def test_span_label_readonly(doc):
span = Span(doc, 0, 1) span = Span(doc, 0, 1)
span.label_ = "hello" with pytest.raises(NotImplementedError):
assert span.label_ == "hello" span.label_ = "hello"
assert span.label == doc.vocab.strings["hello"]
def test_span_ents_property(doc): def test_span_ents_property(doc):

View File

@ -199,3 +199,31 @@ def test_token0_has_sent_start_true():
assert doc[0].is_sent_start is True assert doc[0].is_sent_start is True
assert doc[1].is_sent_start is None assert doc[1].is_sent_start is None
assert not doc.is_sentenced assert not doc.is_sentenced
def test_token_api_conjuncts_chain(en_vocab):
words = "The boy and the girl and the man went .".split()
heads = [1, 7, -1, 1, -3, -1, 1, -3, 0, -1]
deps = ["det", "nsubj", "cc", "det", "conj", "cc", "det", "conj", "ROOT", "punct"]
doc = get_doc(en_vocab, words=words, heads=heads, deps=deps)
assert [w.text for w in doc[1].conjuncts] == ["girl", "man"]
assert [w.text for w in doc[4].conjuncts] == ["boy", "man"]
assert [w.text for w in doc[7].conjuncts] == ["boy", "girl"]
def test_token_api_conjuncts_simple(en_vocab):
words = "They came and went .".split()
heads = [1, 0, -1, -2, -1]
deps = ["nsubj", "ROOT", "cc", "conj"]
doc = get_doc(en_vocab, words=words, heads=heads, deps=deps)
assert [w.text for w in doc[1].conjuncts] == ["went"]
assert [w.text for w in doc[3].conjuncts] == ["came"]
def test_token_api_non_conjuncts(en_vocab):
words = "They came .".split()
heads = [1, 0, -1]
deps = ["nsubj", "ROOT", "punct"]
doc = get_doc(en_vocab, words=words, heads=heads, deps=deps)
assert [w.text for w in doc[0].conjuncts] == []
assert [w.text for w in doc[1].conjuncts] == []

View File

@ -106,3 +106,37 @@ def test_underscore_raises_for_invalid(invalid_kwargs):
def test_underscore_accepts_valid(valid_kwargs): def test_underscore_accepts_valid(valid_kwargs):
valid_kwargs["force"] = True valid_kwargs["force"] = True
Doc.set_extension("test", **valid_kwargs) Doc.set_extension("test", **valid_kwargs)
def test_underscore_mutable_defaults_list(en_vocab):
"""Test that mutable default arguments are handled correctly (see #2581)."""
Doc.set_extension("mutable", default=[])
doc1 = Doc(en_vocab, words=["one"])
doc2 = Doc(en_vocab, words=["two"])
doc1._.mutable.append("foo")
assert len(doc1._.mutable) == 1
assert doc1._.mutable[0] == "foo"
assert len(doc2._.mutable) == 0
doc1._.mutable = ["bar", "baz"]
doc1._.mutable.append("foo")
assert len(doc1._.mutable) == 3
assert len(doc2._.mutable) == 0
def test_underscore_mutable_defaults_dict(en_vocab):
"""Test that mutable default arguments are handled correctly (see #2581)."""
Token.set_extension("mutable", default={})
token1 = Doc(en_vocab, words=["one"])[0]
token2 = Doc(en_vocab, words=["two"])[0]
token1._.mutable["foo"] = "bar"
assert len(token1._.mutable) == 1
assert token1._.mutable["foo"] == "bar"
assert len(token2._.mutable) == 0
token1._.mutable["foo"] = "baz"
assert len(token1._.mutable) == 1
assert token1._.mutable["foo"] == "baz"
token1._.mutable["x"] = []
token1._.mutable["x"].append("y")
assert len(token1._.mutable) == 2
assert token1._.mutable["x"] == ["y"]
assert len(token2._.mutable) == 0

View File

@ -2,22 +2,24 @@
from __future__ import unicode_literals from __future__ import unicode_literals
import pytest import pytest
import numpy
from spacy.tokens import Doc from spacy.tokens import Doc
from spacy.displacy import render from spacy.displacy import render
from spacy.gold import iob_to_biluo from spacy.gold import iob_to_biluo
from spacy.lang.it import Italian from spacy.lang.it import Italian
import numpy
from spacy.lang.en import English from spacy.lang.en import English
from ..util import add_vecs_to_vocab, get_doc from ..util import add_vecs_to_vocab, get_doc
@pytest.mark.xfail( @pytest.mark.xfail
reason="The dot is now properly split off, but the prefix/suffix rules are not applied again afterwards."
"This means that the quote will still be attached to the remaining token."
)
def test_issue2070(): def test_issue2070():
"""Test that checks that a dot followed by a quote is handled appropriately.""" """Test that checks that a dot followed by a quote is handled
appropriately.
"""
# Problem: The dot is now properly split off, but the prefix/suffix rules
# are not applied again afterwards. This means that the quote will still be
# attached to the remaining token.
nlp = English() nlp = English()
doc = nlp('First sentence."A quoted sentence" he said ...') doc = nlp('First sentence."A quoted sentence" he said ...')
assert len(doc) == 11 assert len(doc) == 11
@ -37,6 +39,26 @@ def test_issue2179():
assert nlp2.get_pipe("ner").labels == ("CITIZENSHIP",) assert nlp2.get_pipe("ner").labels == ("CITIZENSHIP",)
def test_issue2203(en_vocab):
"""Test that lemmas are set correctly in doc.from_array."""
words = ["I", "'ll", "survive"]
tags = ["PRP", "MD", "VB"]
lemmas = ["-PRON-", "will", "survive"]
tag_ids = [en_vocab.strings.add(tag) for tag in tags]
lemma_ids = [en_vocab.strings.add(lemma) for lemma in lemmas]
doc = Doc(en_vocab, words=words)
# Work around lemma corrpution problem and set lemmas after tags
doc.from_array("TAG", numpy.array(tag_ids, dtype="uint64"))
doc.from_array("LEMMA", numpy.array(lemma_ids, dtype="uint64"))
assert [t.tag_ for t in doc] == tags
assert [t.lemma_ for t in doc] == lemmas
# We need to serialize both tag and lemma, since this is what causes the bug
doc_array = doc.to_array(["TAG", "LEMMA"])
new_doc = Doc(doc.vocab, words=words).from_array(["TAG", "LEMMA"], doc_array)
assert [t.tag_ for t in new_doc] == tags
assert [t.lemma_ for t in new_doc] == lemmas
def test_issue2219(en_vocab): def test_issue2219(en_vocab):
vectors = [("a", [1, 2, 3]), ("letter", [4, 5, 6])] vectors = [("a", [1, 2, 3]), ("letter", [4, 5, 6])]
add_vecs_to_vocab(en_vocab, vectors) add_vecs_to_vocab(en_vocab, vectors)

View File

@ -0,0 +1,26 @@
# coding: utf8
from __future__ import unicode_literals
from spacy.lang.en import English
from spacy.tokens import Doc
from spacy.pipeline import EntityRuler, EntityRecognizer
def test_issue3345():
"""Test case where preset entity crosses sentence boundary."""
nlp = English()
doc = Doc(nlp.vocab, words=["I", "live", "in", "New", "York"])
doc[4].is_sent_start = True
ruler = EntityRuler(nlp, patterns=[{"label": "GPE", "pattern": "New York"}])
ner = EntityRecognizer(doc.vocab)
# Add the OUT action. I wouldn't have thought this would be necessary...
ner.moves.add_action(5, "")
ner.add_label("GPE")
doc = ruler(doc)
# Get into the state just before "New"
state = ner.moves.init_batch([doc])[0]
ner.moves.apply_transition(state, "O")
ner.moves.apply_transition(state, "O")
ner.moves.apply_transition(state, "O")
# Check that B-GPE is valid.
assert ner.moves.is_valid(state, "B-GPE")

View File

@ -0,0 +1,21 @@
# coding: utf8
from __future__ import unicode_literals
import pytest
from spacy.lang.en import English
from spacy.matcher import Matcher, PhraseMatcher
def test_issue3410():
texts = ["Hello world", "This is a test"]
nlp = English()
matcher = Matcher(nlp.vocab)
phrasematcher = PhraseMatcher(nlp.vocab)
with pytest.deprecated_call():
docs = list(nlp.pipe(texts, n_threads=4))
with pytest.deprecated_call():
docs = list(nlp.tokenizer.pipe(texts, n_threads=4))
with pytest.deprecated_call():
list(matcher.pipe(docs, n_threads=4))
with pytest.deprecated_call():
list(phrasematcher.pipe(docs, n_threads=4))

View File

@ -1,6 +1,7 @@
# coding: utf-8 # coding: utf-8
from __future__ import unicode_literals from __future__ import unicode_literals
import pytest
from spacy.tokens import Doc from spacy.tokens import Doc
from spacy.compat import path2str from spacy.compat import path2str
@ -41,3 +42,18 @@ def test_serialize_doc_roundtrip_disk_str_path(en_vocab):
doc.to_disk(file_path) doc.to_disk(file_path)
doc_d = Doc(en_vocab).from_disk(file_path) doc_d = Doc(en_vocab).from_disk(file_path)
assert doc.to_bytes() == doc_d.to_bytes() assert doc.to_bytes() == doc_d.to_bytes()
def test_serialize_doc_exclude(en_vocab):
doc = Doc(en_vocab, words=["hello", "world"])
doc.user_data["foo"] = "bar"
new_doc = Doc(en_vocab).from_bytes(doc.to_bytes())
assert new_doc.user_data["foo"] == "bar"
new_doc = Doc(en_vocab).from_bytes(doc.to_bytes(), exclude=["user_data"])
assert not new_doc.user_data
new_doc = Doc(en_vocab).from_bytes(doc.to_bytes(exclude=["user_data"]))
assert not new_doc.user_data
with pytest.raises(ValueError):
doc.to_bytes(user_data=False)
with pytest.raises(ValueError):
Doc(en_vocab).from_bytes(doc.to_bytes(), tensor=False)

View File

@ -52,3 +52,19 @@ def test_serialize_with_custom_tokenizer():
nlp.tokenizer = custom_tokenizer(nlp) nlp.tokenizer = custom_tokenizer(nlp)
with make_tempdir() as d: with make_tempdir() as d:
nlp.to_disk(d) nlp.to_disk(d)
def test_serialize_language_exclude(meta_data):
name = "name-in-fixture"
nlp = Language(meta=meta_data)
assert nlp.meta["name"] == name
new_nlp = Language().from_bytes(nlp.to_bytes())
assert nlp.meta["name"] == name
new_nlp = Language().from_bytes(nlp.to_bytes(), exclude=["meta"])
assert not new_nlp.meta["name"] == name
new_nlp = Language().from_bytes(nlp.to_bytes(exclude=["meta"]))
assert not new_nlp.meta["name"] == name
with pytest.raises(ValueError):
nlp.to_bytes(meta=False)
with pytest.raises(ValueError):
Language().from_bytes(nlp.to_bytes(), meta=False)

View File

@ -55,7 +55,9 @@ def test_serialize_parser_roundtrip_disk(en_vocab, Parser):
parser_d = Parser(en_vocab) parser_d = Parser(en_vocab)
parser_d.model, _ = parser_d.Model(0) parser_d.model, _ = parser_d.Model(0)
parser_d = parser_d.from_disk(file_path) parser_d = parser_d.from_disk(file_path)
assert parser.to_bytes(model=False) == parser_d.to_bytes(model=False) parser_bytes = parser.to_bytes(exclude=["model"])
parser_d_bytes = parser_d.to_bytes(exclude=["model"])
assert parser_bytes == parser_d_bytes
def test_to_from_bytes(parser, blank_parser): def test_to_from_bytes(parser, blank_parser):
@ -114,3 +116,25 @@ def test_serialize_textcat_empty(en_vocab):
# See issue #1105 # See issue #1105
textcat = TextCategorizer(en_vocab, labels=["ENTITY", "ACTION", "MODIFIER"]) textcat = TextCategorizer(en_vocab, labels=["ENTITY", "ACTION", "MODIFIER"])
textcat.to_bytes() textcat.to_bytes()
@pytest.mark.parametrize("Parser", test_parsers)
def test_serialize_pipe_exclude(en_vocab, Parser):
def get_new_parser():
new_parser = Parser(en_vocab)
new_parser.model, _ = new_parser.Model(0)
return new_parser
parser = Parser(en_vocab)
parser.model, _ = parser.Model(0)
parser.cfg["foo"] = "bar"
new_parser = get_new_parser().from_bytes(parser.to_bytes())
assert "foo" in new_parser.cfg
new_parser = get_new_parser().from_bytes(parser.to_bytes(), exclude=["cfg"])
assert "foo" not in new_parser.cfg
new_parser = get_new_parser().from_bytes(parser.to_bytes(exclude=["cfg"]))
assert "foo" not in new_parser.cfg
with pytest.raises(ValueError):
parser.to_bytes(cfg=False)
with pytest.raises(ValueError):
get_new_parser().from_bytes(parser.to_bytes(), cfg=False)

View File

@ -12,13 +12,12 @@ test_strings = [([], []), (["rats", "are", "cute"], ["i", "like", "rats"])]
test_strings_attrs = [(["rats", "are", "cute"], "Hello")] test_strings_attrs = [(["rats", "are", "cute"], "Hello")]
@pytest.mark.xfail
@pytest.mark.parametrize("text", ["rat"]) @pytest.mark.parametrize("text", ["rat"])
def test_serialize_vocab(en_vocab, text): def test_serialize_vocab(en_vocab, text):
text_hash = en_vocab.strings.add(text) text_hash = en_vocab.strings.add(text)
vocab_bytes = en_vocab.to_bytes() vocab_bytes = en_vocab.to_bytes()
new_vocab = Vocab().from_bytes(vocab_bytes) new_vocab = Vocab().from_bytes(vocab_bytes)
assert new_vocab.strings(text_hash) == text assert new_vocab.strings[text_hash] == text
@pytest.mark.parametrize("strings1,strings2", test_strings) @pytest.mark.parametrize("strings1,strings2", test_strings)
@ -69,6 +68,15 @@ def test_serialize_vocab_lex_attrs_bytes(strings, lex_attr):
assert vocab2[strings[0]].norm_ == lex_attr assert vocab2[strings[0]].norm_ == lex_attr
@pytest.mark.parametrize("strings,lex_attr", test_strings_attrs)
def test_deserialize_vocab_seen_entries(strings, lex_attr):
# Reported in #2153
vocab = Vocab(strings=strings)
length = len(vocab)
vocab.from_bytes(vocab.to_bytes())
assert len(vocab) == length
@pytest.mark.parametrize("strings,lex_attr", test_strings_attrs) @pytest.mark.parametrize("strings,lex_attr", test_strings_attrs)
def test_serialize_vocab_lex_attrs_disk(strings, lex_attr): def test_serialize_vocab_lex_attrs_disk(strings, lex_attr):
vocab1 = Vocab(strings=strings) vocab1 = Vocab(strings=strings)

View File

@ -1,38 +1,28 @@
# coding: utf-8 # coding: utf-8
from __future__ import unicode_literals from __future__ import unicode_literals
import pytest from spacy.cli.converters import conllu2json
import os
from pathlib import Path
from spacy.compat import symlink_to, symlink_remove, path2str
@pytest.fixture def test_cli_converters_conllu2json():
def target_local_path(): # https://raw.githubusercontent.com/ohenrik/nb_news_ud_sm/master/original_data/no-ud-dev-ner.conllu
return Path("./foo-target") lines = [
"1\tDommer\tdommer\tNOUN\t_\tDefinite=Ind|Gender=Masc|Number=Sing\t2\tappos\t_\tO",
"2\tFinn\tFinn\tPROPN\t_\tGender=Masc\t4\tnsubj\t_\tB-PER",
@pytest.fixture "3\tEilertsen\tEilertsen\tPROPN\t_\t_\t2\tname\t_\tI-PER",
def link_local_path(): "4\tavstår\tavstå\tVERB\t_\tMood=Ind|Tense=Pres|VerbForm=Fin\t0\troot\t_\tO",
return Path("./foo-symlink") ]
input_data = "\n".join(lines)
converted = conllu2json(input_data, n_sents=1)
@pytest.fixture(scope="function") assert len(converted) == 1
def setup_target(request, target_local_path, link_local_path): assert converted[0]["id"] == 0
if not target_local_path.exists(): assert len(converted[0]["paragraphs"]) == 1
os.mkdir(path2str(target_local_path)) assert len(converted[0]["paragraphs"][0]["sentences"]) == 1
sent = converted[0]["paragraphs"][0]["sentences"][0]
# yield -- need to cleanup even if assertion fails assert len(sent["tokens"]) == 4
# https://github.com/pytest-dev/pytest/issues/2508#issuecomment-309934240 tokens = sent["tokens"]
def cleanup(): assert [t["orth"] for t in tokens] == ["Dommer", "Finn", "Eilertsen", "avstår"]
symlink_remove(link_local_path) assert [t["tag"] for t in tokens] == ["NOUN", "PROPN", "PROPN", "VERB"]
os.rmdir(path2str(target_local_path)) assert [t["head"] for t in tokens] == [1, 2, -1, 0]
assert [t["dep"] for t in tokens] == ["appos", "nsubj", "name", "ROOT"]
request.addfinalizer(cleanup) assert [t["ner"] for t in tokens] == ["O", "B-PER", "L-PER", "O"]
def test_create_symlink_windows(setup_target, target_local_path, link_local_path):
assert target_local_path.exists()
symlink_to(link_local_path, target_local_path)
assert link_local_path.exists()

View File

@ -0,0 +1,90 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
from spacy import displacy
from spacy.tokens import Span
from spacy.lang.fa import Persian
from .util import get_doc
def test_displacy_parse_ents(en_vocab):
"""Test that named entities on a Doc are converted into displaCy's format."""
doc = get_doc(en_vocab, words=["But", "Google", "is", "starting", "from", "behind"])
doc.ents = [Span(doc, 1, 2, label=doc.vocab.strings["ORG"])]
ents = displacy.parse_ents(doc)
assert isinstance(ents, dict)
assert ents["text"] == "But Google is starting from behind "
assert ents["ents"] == [{"start": 4, "end": 10, "label": "ORG"}]
def test_displacy_parse_deps(en_vocab):
"""Test that deps and tags on a Doc are converted into displaCy's format."""
words = ["This", "is", "a", "sentence"]
heads = [1, 0, 1, -2]
pos = ["DET", "VERB", "DET", "NOUN"]
tags = ["DT", "VBZ", "DT", "NN"]
deps = ["nsubj", "ROOT", "det", "attr"]
doc = get_doc(en_vocab, words=words, heads=heads, pos=pos, tags=tags, deps=deps)
deps = displacy.parse_deps(doc)
assert isinstance(deps, dict)
assert deps["words"] == [
{"text": "This", "tag": "DET"},
{"text": "is", "tag": "VERB"},
{"text": "a", "tag": "DET"},
{"text": "sentence", "tag": "NOUN"},
]
assert deps["arcs"] == [
{"start": 0, "end": 1, "label": "nsubj", "dir": "left"},
{"start": 2, "end": 3, "label": "det", "dir": "left"},
{"start": 1, "end": 3, "label": "attr", "dir": "right"},
]
def test_displacy_spans(en_vocab):
"""Test that displaCy can render Spans."""
doc = get_doc(en_vocab, words=["But", "Google", "is", "starting", "from", "behind"])
doc.ents = [Span(doc, 1, 2, label=doc.vocab.strings["ORG"])]
html = displacy.render(doc[1:4], style="ent")
assert html.startswith("<div")
def test_displacy_raises_for_wrong_type(en_vocab):
with pytest.raises(ValueError):
displacy.render("hello world")
def test_displacy_rtl():
# Source: http://www.sobhe.ir/hazm/ is this correct?
words = ["ما", "بسیار", "کتاب", "می\u200cخوانیم"]
# These are (likely) wrong, but it's just for testing
pos = ["PRO", "ADV", "N_PL", "V_SUB"] # needs to match lang.fa.tag_map
deps = ["foo", "bar", "foo", "baz"]
heads = [1, 0, 1, -2]
nlp = Persian()
doc = get_doc(nlp.vocab, words=words, pos=pos, tags=pos, heads=heads, deps=deps)
doc.ents = [Span(doc, 1, 3, label="TEST")]
html = displacy.render(doc, page=True, style="dep")
assert "direction: rtl" in html
assert 'direction="rtl"' in html
assert 'lang="{}"'.format(nlp.lang) in html
html = displacy.render(doc, page=True, style="ent")
assert "direction: rtl" in html
assert 'lang="{}"'.format(nlp.lang) in html
def test_displacy_render_wrapper(en_vocab):
"""Test that displaCy accepts custom rendering wrapper."""
def wrapper(html):
return "TEST" + html + "TEST"
displacy.set_render_wrapper(wrapper)
doc = get_doc(en_vocab, words=["But", "Google", "is", "starting", "from", "behind"])
doc.ents = [Span(doc, 1, 2, label=doc.vocab.strings["ORG"])]
html = displacy.render(doc, style="ent")
assert html.startswith("TEST<div")
assert html.endswith("/div>TEST")
# Restore
displacy.set_render_wrapper(lambda html: html)

View File

@ -2,14 +2,35 @@
from __future__ import unicode_literals from __future__ import unicode_literals
import pytest import pytest
import os
from pathlib import Path from pathlib import Path
from spacy import util from spacy import util
from spacy import displacy
from spacy import prefer_gpu, require_gpu from spacy import prefer_gpu, require_gpu
from spacy.tokens import Span from spacy.compat import symlink_to, symlink_remove, path2str
from spacy._ml import PrecomputableAffine from spacy._ml import PrecomputableAffine
from .util import get_doc
@pytest.fixture
def symlink_target():
return Path("./foo-target")
@pytest.fixture
def symlink():
return Path("./foo-symlink")
@pytest.fixture(scope="function")
def symlink_setup_target(request, symlink_target, symlink):
if not symlink_target.exists():
os.mkdir(path2str(symlink_target))
# yield -- need to cleanup even if assertion fails
# https://github.com/pytest-dev/pytest/issues/2508#issuecomment-309934240
def cleanup():
symlink_remove(symlink)
os.rmdir(path2str(symlink_target))
request.addfinalizer(cleanup)
@pytest.mark.parametrize("text", ["hello/world", "hello world"]) @pytest.mark.parametrize("text", ["hello/world", "hello world"])
@ -31,66 +52,6 @@ def test_util_get_package_path(package):
assert isinstance(path, Path) assert isinstance(path, Path)
def test_displacy_parse_ents(en_vocab):
"""Test that named entities on a Doc are converted into displaCy's format."""
doc = get_doc(en_vocab, words=["But", "Google", "is", "starting", "from", "behind"])
doc.ents = [Span(doc, 1, 2, label=doc.vocab.strings["ORG"])]
ents = displacy.parse_ents(doc)
assert isinstance(ents, dict)
assert ents["text"] == "But Google is starting from behind "
assert ents["ents"] == [{"start": 4, "end": 10, "label": "ORG"}]
def test_displacy_parse_deps(en_vocab):
"""Test that deps and tags on a Doc are converted into displaCy's format."""
words = ["This", "is", "a", "sentence"]
heads = [1, 0, 1, -2]
pos = ["DET", "VERB", "DET", "NOUN"]
tags = ["DT", "VBZ", "DT", "NN"]
deps = ["nsubj", "ROOT", "det", "attr"]
doc = get_doc(en_vocab, words=words, heads=heads, pos=pos, tags=tags, deps=deps)
deps = displacy.parse_deps(doc)
assert isinstance(deps, dict)
assert deps["words"] == [
{"text": "This", "tag": "DET"},
{"text": "is", "tag": "VERB"},
{"text": "a", "tag": "DET"},
{"text": "sentence", "tag": "NOUN"},
]
assert deps["arcs"] == [
{"start": 0, "end": 1, "label": "nsubj", "dir": "left"},
{"start": 2, "end": 3, "label": "det", "dir": "left"},
{"start": 1, "end": 3, "label": "attr", "dir": "right"},
]
def test_displacy_spans(en_vocab):
"""Test that displaCy can render Spans."""
doc = get_doc(en_vocab, words=["But", "Google", "is", "starting", "from", "behind"])
doc.ents = [Span(doc, 1, 2, label=doc.vocab.strings["ORG"])]
html = displacy.render(doc[1:4], style="ent")
assert html.startswith("<div")
def test_displacy_render_wrapper(en_vocab):
"""Test that displaCy accepts custom rendering wrapper."""
def wrapper(html):
return "TEST" + html + "TEST"
displacy.set_render_wrapper(wrapper)
doc = get_doc(en_vocab, words=["But", "Google", "is", "starting", "from", "behind"])
doc.ents = [Span(doc, 1, 2, label=doc.vocab.strings["ORG"])]
html = displacy.render(doc, style="ent")
assert html.startswith("TEST<div")
assert html.endswith("/div>TEST")
def test_displacy_raises_for_wrong_type(en_vocab):
with pytest.raises(ValueError):
displacy.render("hello world")
def test_PrecomputableAffine(nO=4, nI=5, nF=3, nP=2): def test_PrecomputableAffine(nO=4, nI=5, nF=3, nP=2):
model = PrecomputableAffine(nO=nO, nI=nI, nF=nF, nP=nP) model = PrecomputableAffine(nO=nO, nI=nI, nF=nF, nP=nP)
assert model.W.shape == (nF, nO, nP, nI) assert model.W.shape == (nF, nO, nP, nI)
@ -124,3 +85,9 @@ def test_prefer_gpu():
def test_require_gpu(): def test_require_gpu():
with pytest.raises(ValueError): with pytest.raises(ValueError):
require_gpu() require_gpu()
def test_create_symlink_windows(symlink_setup_target, symlink_target, symlink):
assert symlink_target.exists()
symlink_to(symlink, symlink_target)
assert symlink.exists()

View File

@ -45,3 +45,8 @@ def test_vocab_api_contains(en_vocab, text):
_ = en_vocab[text] # noqa: F841 _ = en_vocab[text] # noqa: F841
assert text in en_vocab assert text in en_vocab
assert "LKsdjvlsakdvlaksdvlkasjdvljasdlkfvm" not in en_vocab assert "LKsdjvlsakdvlaksdvlkasjdvljasdlkfvm" not in en_vocab
def test_vocab_writing_system(en_vocab):
assert en_vocab.writing_system["direction"] == "ltr"
assert en_vocab.writing_system["has_case"] is True

View File

@ -125,7 +125,7 @@ cdef class Tokenizer:
doc.c[doc.length - 1].spacy = string[-1] == " " and not in_ws doc.c[doc.length - 1].spacy = string[-1] == " " and not in_ws
return doc return doc
def pipe(self, texts, batch_size=1000, n_threads=2): def pipe(self, texts, batch_size=1000, n_threads=-1):
"""Tokenize a stream of texts. """Tokenize a stream of texts.
texts: A sequence of unicode texts. texts: A sequence of unicode texts.
@ -134,6 +134,8 @@ cdef class Tokenizer:
DOCS: https://spacy.io/api/tokenizer#pipe DOCS: https://spacy.io/api/tokenizer#pipe
""" """
if n_threads != -1:
deprecation_warning(Warnings.W016)
for text in texts: for text in texts:
yield self(text) yield self(text)
@ -360,36 +362,37 @@ cdef class Tokenizer:
self._cache.set(key, cached) self._cache.set(key, cached)
self._rules[string] = substrings self._rules[string] = substrings
def to_disk(self, path, **exclude): def to_disk(self, path, **kwargs):
"""Save the current state to a directory. """Save the current state to a directory.
path (unicode or Path): A path to a directory, which will be created if path (unicode or Path): A path to a directory, which will be created if
it doesn't exist. Paths may be either strings or Path-like objects. it doesn't exist.
exclude (list): String names of serialization fields to exclude.
DOCS: https://spacy.io/api/tokenizer#to_disk DOCS: https://spacy.io/api/tokenizer#to_disk
""" """
with path.open("wb") as file_: with path.open("wb") as file_:
file_.write(self.to_bytes(**exclude)) file_.write(self.to_bytes(**kwargs))
def from_disk(self, path, **exclude): def from_disk(self, path, **kwargs):
"""Loads state from a directory. Modifies the object in place and """Loads state from a directory. Modifies the object in place and
returns it. returns it.
path (unicode or Path): A path to a directory. Paths may be either path (unicode or Path): A path to a directory.
strings or `Path`-like objects. exclude (list): String names of serialization fields to exclude.
RETURNS (Tokenizer): The modified `Tokenizer` object. RETURNS (Tokenizer): The modified `Tokenizer` object.
DOCS: https://spacy.io/api/tokenizer#from_disk DOCS: https://spacy.io/api/tokenizer#from_disk
""" """
with path.open("rb") as file_: with path.open("rb") as file_:
bytes_data = file_.read() bytes_data = file_.read()
self.from_bytes(bytes_data, **exclude) self.from_bytes(bytes_data, **kwargs)
return self return self
def to_bytes(self, **exclude): def to_bytes(self, exclude=tuple(), **kwargs):
"""Serialize the current state to a binary string. """Serialize the current state to a binary string.
**exclude: Named attributes to prevent from being serialized. exclude (list): String names of serialization fields to exclude.
RETURNS (bytes): The serialized form of the `Tokenizer` object. RETURNS (bytes): The serialized form of the `Tokenizer` object.
DOCS: https://spacy.io/api/tokenizer#to_bytes DOCS: https://spacy.io/api/tokenizer#to_bytes
@ -402,13 +405,14 @@ cdef class Tokenizer:
("token_match", lambda: _get_regex_pattern(self.token_match)), ("token_match", lambda: _get_regex_pattern(self.token_match)),
("exceptions", lambda: OrderedDict(sorted(self._rules.items()))) ("exceptions", lambda: OrderedDict(sorted(self._rules.items())))
)) ))
exclude = util.get_serialization_exclude(serializers, exclude, kwargs)
return util.to_bytes(serializers, exclude) return util.to_bytes(serializers, exclude)
def from_bytes(self, bytes_data, **exclude): def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
"""Load state from a binary string. """Load state from a binary string.
bytes_data (bytes): The data to load from. bytes_data (bytes): The data to load from.
**exclude: Named attributes to prevent from being loaded. exclude (list): String names of serialization fields to exclude.
RETURNS (Tokenizer): The `Tokenizer` object. RETURNS (Tokenizer): The `Tokenizer` object.
DOCS: https://spacy.io/api/tokenizer#from_bytes DOCS: https://spacy.io/api/tokenizer#from_bytes
@ -422,6 +426,7 @@ cdef class Tokenizer:
("token_match", lambda b: data.setdefault("token_match", b)), ("token_match", lambda b: data.setdefault("token_match", b)),
("exceptions", lambda b: data.setdefault("rules", b)) ("exceptions", lambda b: data.setdefault("rules", b))
)) ))
exclude = util.get_serialization_exclude(deserializers, exclude, kwargs)
msg = util.from_bytes(bytes_data, deserializers, exclude) msg = util.from_bytes(bytes_data, deserializers, exclude)
if data.get("prefix_search"): if data.get("prefix_search"):
self.prefix_search = re.compile(data["prefix_search"]).search self.prefix_search = re.compile(data["prefix_search"]).search

View File

@ -240,8 +240,18 @@ cdef class Doc:
for i in range(1, self.length): for i in range(1, self.length):
if self.c[i].sent_start == -1 or self.c[i].sent_start == 1: if self.c[i].sent_start == -1 or self.c[i].sent_start == 1:
return True return True
else: return False
return False
@property
def is_nered(self):
"""Check if the document has named entities set. Will return True if
*any* of the tokens has a named entity tag set (even if the others are
uknown values).
"""
for i in range(self.length):
if self.c[i].ent_iob != 0:
return True
return False
def __getitem__(self, object i): def __getitem__(self, object i):
"""Get a `Token` or `Span` object. """Get a `Token` or `Span` object.
@ -374,7 +384,8 @@ cdef class Doc:
xp = get_array_module(vector) xp = get_array_module(vector)
return xp.dot(vector, other.vector) / (self.vector_norm * other.vector_norm) return xp.dot(vector, other.vector) / (self.vector_norm * other.vector_norm)
property has_vector: @property
def has_vector(self):
"""A boolean value indicating whether a word vector is associated with """A boolean value indicating whether a word vector is associated with
the object. the object.
@ -382,15 +393,14 @@ cdef class Doc:
DOCS: https://spacy.io/api/doc#has_vector DOCS: https://spacy.io/api/doc#has_vector
""" """
def __get__(self): if "has_vector" in self.user_hooks:
if "has_vector" in self.user_hooks: return self.user_hooks["has_vector"](self)
return self.user_hooks["has_vector"](self) elif self.vocab.vectors.data.size:
elif self.vocab.vectors.data.size: return True
return True elif self.tensor.size:
elif self.tensor.size: return True
return True else:
else: return False
return False
property vector: property vector:
"""A real-valued meaning representation. Defaults to an average of the """A real-valued meaning representation. Defaults to an average of the
@ -443,22 +453,22 @@ cdef class Doc:
def __set__(self, value): def __set__(self, value):
self._vector_norm = value self._vector_norm = value
property text: @property
def text(self):
"""A unicode representation of the document text. """A unicode representation of the document text.
RETURNS (unicode): The original verbatim text of the document. RETURNS (unicode): The original verbatim text of the document.
""" """
def __get__(self): return "".join(t.text_with_ws for t in self)
return "".join(t.text_with_ws for t in self)
property text_with_ws: @property
def text_with_ws(self):
"""An alias of `Doc.text`, provided for duck-type compatibility with """An alias of `Doc.text`, provided for duck-type compatibility with
`Span` and `Token`. `Span` and `Token`.
RETURNS (unicode): The original verbatim text of the document. RETURNS (unicode): The original verbatim text of the document.
""" """
def __get__(self): return self.text
return self.text
property ents: property ents:
"""The named entities in the document. Returns a tuple of named entity """The named entities in the document. Returns a tuple of named entity
@ -535,7 +545,8 @@ cdef class Doc:
# Set start as B # Set start as B
self.c[start].ent_iob = 3 self.c[start].ent_iob = 3
property noun_chunks: @property
def noun_chunks(self):
"""Iterate over the base noun phrases in the document. Yields base """Iterate over the base noun phrases in the document. Yields base
noun-phrase #[code Span] objects, if the document has been noun-phrase #[code Span] objects, if the document has been
syntactically parsed. A base noun phrase, or "NP chunk", is a noun syntactically parsed. A base noun phrase, or "NP chunk", is a noun
@ -547,22 +558,22 @@ cdef class Doc:
DOCS: https://spacy.io/api/doc#noun_chunks DOCS: https://spacy.io/api/doc#noun_chunks
""" """
def __get__(self): if not self.is_parsed:
if not self.is_parsed: raise ValueError(Errors.E029)
raise ValueError(Errors.E029) # Accumulate the result before beginning to iterate over it. This
# Accumulate the result before beginning to iterate over it. This # prevents the tokenisation from being changed out from under us
# prevents the tokenisation from being changed out from under us # during the iteration. The tricky thing here is that Span accepts
# during the iteration. The tricky thing here is that Span accepts # its tokenisation changing, so it's okay once we have the Span
# its tokenisation changing, so it's okay once we have the Span # objects. See Issue #375.
# objects. See Issue #375. spans = []
spans = [] if self.noun_chunks_iterator is not None:
if self.noun_chunks_iterator is not None: for start, end, label in self.noun_chunks_iterator(self):
for start, end, label in self.noun_chunks_iterator(self): spans.append(Span(self, start, end, label=label))
spans.append(Span(self, start, end, label=label)) for span in spans:
for span in spans: yield span
yield span
property sents: @property
def sents(self):
"""Iterate over the sentences in the document. Yields sentence `Span` """Iterate over the sentences in the document. Yields sentence `Span`
objects. Sentence spans have no label. To improve accuracy on informal objects. Sentence spans have no label. To improve accuracy on informal
texts, spaCy calculates sentence boundaries from the syntactic texts, spaCy calculates sentence boundaries from the syntactic
@ -573,19 +584,28 @@ cdef class Doc:
DOCS: https://spacy.io/api/doc#sents DOCS: https://spacy.io/api/doc#sents
""" """
def __get__(self): if not self.is_sentenced:
if not self.is_sentenced: raise ValueError(Errors.E030)
raise ValueError(Errors.E030) if "sents" in self.user_hooks:
if "sents" in self.user_hooks: yield from self.user_hooks["sents"](self)
yield from self.user_hooks["sents"](self) else:
else: start = 0
start = 0 for i in range(1, self.length):
for i in range(1, self.length): if self.c[i].sent_start == 1:
if self.c[i].sent_start == 1: yield Span(self, start, i)
yield Span(self, start, i) start = i
start = i if start != self.length:
if start != self.length: yield Span(self, start, self.length)
yield Span(self, start, self.length)
@property
def lang(self):
"""RETURNS (uint64): ID of the language of the doc's vocabulary."""
return self.vocab.strings[self.vocab.lang]
@property
def lang_(self):
"""RETURNS (unicode): Language of the doc's vocabulary, e.g. 'en'."""
return self.vocab.lang
cdef int push_back(self, LexemeOrToken lex_or_tok, bint has_space) except -1: cdef int push_back(self, LexemeOrToken lex_or_tok, bint has_space) except -1:
if self.length == 0: if self.length == 0:
@ -727,6 +747,18 @@ cdef class Doc:
DOCS: https://spacy.io/api/doc#from_array DOCS: https://spacy.io/api/doc#from_array
""" """
# Handle scalar/list inputs of strings/ints for py_attr_ids
# See also #3064
if isinstance(attrs, basestring_):
# Handle inputs like doc.to_array('ORTH')
attrs = [attrs]
elif not hasattr(attrs, "__iter__"):
# Handle inputs like doc.to_array(ORTH)
attrs = [attrs]
# Allow strings, e.g. 'lemma' or 'LEMMA'
attrs = [(IDS[id_.upper()] if hasattr(id_, "upper") else id_)
for id_ in attrs]
if SENT_START in attrs and HEAD in attrs: if SENT_START in attrs and HEAD in attrs:
raise ValueError(Errors.E032) raise ValueError(Errors.E032)
cdef int i, col cdef int i, col
@ -739,17 +771,20 @@ cdef class Doc:
attr_ids = <attr_id_t*>mem.alloc(n_attrs, sizeof(attr_id_t)) attr_ids = <attr_id_t*>mem.alloc(n_attrs, sizeof(attr_id_t))
for i, attr_id in enumerate(attrs): for i, attr_id in enumerate(attrs):
attr_ids[i] = attr_id attr_ids[i] = attr_id
if len(array.shape) == 1:
array = array.reshape((array.size, 1))
# Do TAG first. This lets subsequent loop override stuff like POS, LEMMA
if TAG in attrs:
col = attrs.index(TAG)
for i in range(length):
if array[i, col] != 0:
self.vocab.morphology.assign_tag(&tokens[i], array[i, col])
# Now load the data # Now load the data
for i in range(self.length): for i in range(self.length):
token = &self.c[i] token = &self.c[i]
for j in range(n_attrs): for j in range(n_attrs):
Token.set_struct_attr(token, attr_ids[j], array[i, j]) if attr_ids[j] != TAG:
# Auxiliary loading logic Token.set_struct_attr(token, attr_ids[j], array[i, j])
for col, attr_id in enumerate(attrs):
if attr_id == TAG:
for i in range(length):
if array[i, col] != 0:
self.vocab.morphology.assign_tag(&tokens[i], array[i, col])
# Set flags # Set flags
self.is_parsed = bool(self.is_parsed or HEAD in attrs or DEP in attrs) self.is_parsed = bool(self.is_parsed or HEAD in attrs or DEP in attrs)
self.is_tagged = bool(self.is_tagged or TAG in attrs or POS in attrs) self.is_tagged = bool(self.is_tagged or TAG in attrs or POS in attrs)
@ -770,24 +805,26 @@ cdef class Doc:
""" """
return numpy.asarray(_get_lca_matrix(self, 0, len(self))) return numpy.asarray(_get_lca_matrix(self, 0, len(self)))
def to_disk(self, path, **exclude): def to_disk(self, path, **kwargs):
"""Save the current state to a directory. """Save the current state to a directory.
path (unicode or Path): A path to a directory, which will be created if path (unicode or Path): A path to a directory, which will be created if
it doesn't exist. Paths may be either strings or Path-like objects. it doesn't exist. Paths may be either strings or Path-like objects.
exclude (list): String names of serialization fields to exclude.
DOCS: https://spacy.io/api/doc#to_disk DOCS: https://spacy.io/api/doc#to_disk
""" """
path = util.ensure_path(path) path = util.ensure_path(path)
with path.open("wb") as file_: with path.open("wb") as file_:
file_.write(self.to_bytes(**exclude)) file_.write(self.to_bytes(**kwargs))
def from_disk(self, path, **exclude): def from_disk(self, path, **kwargs):
"""Loads state from a directory. Modifies the object in place and """Loads state from a directory. Modifies the object in place and
returns it. returns it.
path (unicode or Path): A path to a directory. Paths may be either path (unicode or Path): A path to a directory. Paths may be either
strings or `Path`-like objects. strings or `Path`-like objects.
exclude (list): String names of serialization fields to exclude.
RETURNS (Doc): The modified `Doc` object. RETURNS (Doc): The modified `Doc` object.
DOCS: https://spacy.io/api/doc#from_disk DOCS: https://spacy.io/api/doc#from_disk
@ -795,11 +832,12 @@ cdef class Doc:
path = util.ensure_path(path) path = util.ensure_path(path)
with path.open("rb") as file_: with path.open("rb") as file_:
bytes_data = file_.read() bytes_data = file_.read()
return self.from_bytes(bytes_data, **exclude) return self.from_bytes(bytes_data, **kwargs)
def to_bytes(self, **exclude): def to_bytes(self, exclude=tuple(), **kwargs):
"""Serialize, i.e. export the document contents to a binary string. """Serialize, i.e. export the document contents to a binary string.
exclude (list): String names of serialization fields to exclude.
RETURNS (bytes): A losslessly serialized copy of the `Doc`, including RETURNS (bytes): A losslessly serialized copy of the `Doc`, including
all annotations. all annotations.
@ -825,16 +863,22 @@ cdef class Doc:
"sentiment": lambda: self.sentiment, "sentiment": lambda: self.sentiment,
"tensor": lambda: self.tensor, "tensor": lambda: self.tensor,
} }
for key in kwargs:
if key in serializers or key in ("user_data", "user_data_keys", "user_data_values"):
raise ValueError(Errors.E128.format(arg=key))
if "user_data" not in exclude and self.user_data: if "user_data" not in exclude and self.user_data:
user_data_keys, user_data_values = list(zip(*self.user_data.items())) user_data_keys, user_data_values = list(zip(*self.user_data.items()))
serializers["user_data_keys"] = lambda: srsly.msgpack_dumps(user_data_keys) if "user_data_keys" not in exclude:
serializers["user_data_values"] = lambda: srsly.msgpack_dumps(user_data_values) serializers["user_data_keys"] = lambda: srsly.msgpack_dumps(user_data_keys)
if "user_data_values" not in exclude:
serializers["user_data_values"] = lambda: srsly.msgpack_dumps(user_data_values)
return util.to_bytes(serializers, exclude) return util.to_bytes(serializers, exclude)
def from_bytes(self, bytes_data, **exclude): def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
"""Deserialize, i.e. import the document contents from a binary string. """Deserialize, i.e. import the document contents from a binary string.
data (bytes): The string to load from. data (bytes): The string to load from.
exclude (list): String names of serialization fields to exclude.
RETURNS (Doc): Itself. RETURNS (Doc): Itself.
DOCS: https://spacy.io/api/doc#from_bytes DOCS: https://spacy.io/api/doc#from_bytes
@ -850,6 +894,9 @@ cdef class Doc:
"user_data_keys": lambda b: None, "user_data_keys": lambda b: None,
"user_data_values": lambda b: None, "user_data_values": lambda b: None,
} }
for key in kwargs:
if key in deserializers or key in ("user_data",):
raise ValueError(Errors.E128.format(arg=key))
msg = util.from_bytes(bytes_data, deserializers, exclude) msg = util.from_bytes(bytes_data, deserializers, exclude)
# Msgpack doesn't distinguish between lists and tuples, which is # Msgpack doesn't distinguish between lists and tuples, which is
# vexing for user data. As a best guess, we *know* that within # vexing for user data. As a best guess, we *know* that within
@ -990,11 +1037,11 @@ cdef class Doc:
DOCS: https://spacy.io/api/doc#to_json DOCS: https://spacy.io/api/doc#to_json
""" """
data = {"text": self.text} data = {"text": self.text}
if self.ents: if self.is_nered:
data["ents"] = [{"start": ent.start_char, "end": ent.end_char, data["ents"] = [{"start": ent.start_char, "end": ent.end_char,
"label": ent.label_} for ent in self.ents] "label": ent.label_} for ent in self.ents]
sents = list(self.sents) if self.is_sentenced:
if sents: sents = list(self.sents)
data["sents"] = [{"start": sent.start_char, "end": sent.end_char} data["sents"] = [{"start": sent.start_char, "end": sent.end_char}
for sent in sents] for sent in sents]
if self.cats: if self.cats:
@ -1002,13 +1049,11 @@ cdef class Doc:
data["tokens"] = [] data["tokens"] = []
for token in self: for token in self:
token_data = {"id": token.i, "start": token.idx, "end": token.idx + len(token)} token_data = {"id": token.i, "start": token.idx, "end": token.idx + len(token)}
if token.pos_: if self.is_tagged:
token_data["pos"] = token.pos_ token_data["pos"] = token.pos_
if token.tag_:
token_data["tag"] = token.tag_ token_data["tag"] = token.tag_
if token.dep_: if self.is_parsed:
token_data["dep"] = token.dep_ token_data["dep"] = token.dep_
if token.head:
token_data["head"] = token.head.i token_data["head"] = token.head.i
data["tokens"].append(token_data) data["tokens"].append(token_data)
if underscore: if underscore:
@ -1179,7 +1224,7 @@ cdef int [:,:] _get_lca_matrix(Doc doc, int start, int end):
def pickle_doc(doc): def pickle_doc(doc):
bytes_data = doc.to_bytes(vocab=False, user_data=False) bytes_data = doc.to_bytes(exclude=["vocab", "user_data"])
hooks_and_data = (doc.user_data, doc.user_hooks, doc.user_span_hooks, hooks_and_data = (doc.user_data, doc.user_hooks, doc.user_span_hooks,
doc.user_token_hooks) doc.user_token_hooks)
return (unpickle_doc, (doc.vocab, srsly.pickle_dumps(hooks_and_data), bytes_data)) return (unpickle_doc, (doc.vocab, srsly.pickle_dumps(hooks_and_data), bytes_data))
@ -1188,7 +1233,7 @@ def pickle_doc(doc):
def unpickle_doc(vocab, hooks_and_data, bytes_data): def unpickle_doc(vocab, hooks_and_data, bytes_data):
user_data, doc_hooks, span_hooks, token_hooks = srsly.pickle_loads(hooks_and_data) user_data, doc_hooks, span_hooks, token_hooks = srsly.pickle_loads(hooks_and_data)
doc = Doc(vocab, user_data=user_data).from_bytes(bytes_data, exclude="user_data") doc = Doc(vocab, user_data=user_data).from_bytes(bytes_data, exclude=["user_data"])
doc.user_hooks.update(doc_hooks) doc.user_hooks.update(doc_hooks)
doc.user_span_hooks.update(span_hooks) doc.user_span_hooks.update(span_hooks)
doc.user_token_hooks.update(token_hooks) doc.user_token_hooks.update(token_hooks)

View File

@ -322,46 +322,47 @@ cdef class Span:
self.start = start self.start = start
self.end = end + 1 self.end = end + 1
property vocab: @property
def vocab(self):
"""RETURNS (Vocab): The Span's Doc's vocab.""" """RETURNS (Vocab): The Span's Doc's vocab."""
def __get__(self): return self.doc.vocab
return self.doc.vocab
property sent: @property
def sent(self):
"""RETURNS (Span): The sentence span that the span is a part of.""" """RETURNS (Span): The sentence span that the span is a part of."""
def __get__(self): if "sent" in self.doc.user_span_hooks:
if "sent" in self.doc.user_span_hooks: return self.doc.user_span_hooks["sent"](self)
return self.doc.user_span_hooks["sent"](self) # This should raise if not parsed / no custom sentence boundaries
# This should raise if not parsed / no custom sentence boundaries self.doc.sents
self.doc.sents # If doc is parsed we can use the deps to find the sentence
# If doc is parsed we can use the deps to find the sentence # otherwise we use the `sent_start` token attribute
# otherwise we use the `sent_start` token attribute cdef int n = 0
cdef int n = 0 cdef int i
cdef int i if self.doc.is_parsed:
if self.doc.is_parsed: root = &self.doc.c[self.start]
root = &self.doc.c[self.start] while root.head != 0:
while root.head != 0: root += root.head
root += root.head n += 1
n += 1 if n >= self.doc.length:
if n >= self.doc.length: raise RuntimeError(Errors.E038)
raise RuntimeError(Errors.E038) return self.doc[root.l_edge:root.r_edge + 1]
return self.doc[root.l_edge:root.r_edge + 1] elif self.doc.is_sentenced:
elif self.doc.is_sentenced: # Find start of the sentence
# Find start of the sentence start = self.start
start = self.start while self.doc.c[start].sent_start != 1 and start > 0:
while self.doc.c[start].sent_start != 1 and start > 0: start += -1
start += -1 # Find end of the sentence
# Find end of the sentence end = self.end
end = self.end n = 0
n = 0 while end < self.doc.length and self.doc.c[end].sent_start != 1:
while end < self.doc.length and self.doc.c[end].sent_start != 1: end += 1
end += 1 n += 1
n += 1 if n >= self.doc.length:
if n >= self.doc.length: break
break return self.doc[start:end]
return self.doc[start:end]
property ents: @property
def ents(self):
"""The named entities in the span. Returns a tuple of named entity """The named entities in the span. Returns a tuple of named entity
`Span` objects, if the entity recognizer has been applied. `Span` objects, if the entity recognizer has been applied.
@ -369,14 +370,14 @@ cdef class Span:
DOCS: https://spacy.io/api/span#ents DOCS: https://spacy.io/api/span#ents
""" """
def __get__(self): ents = []
ents = [] for ent in self.doc.ents:
for ent in self.doc.ents: if ent.start >= self.start and ent.end <= self.end:
if ent.start >= self.start and ent.end <= self.end: ents.append(ent)
ents.append(ent) return ents
return ents
property has_vector: @property
def has_vector(self):
"""A boolean value indicating whether a word vector is associated with """A boolean value indicating whether a word vector is associated with
the object. the object.
@ -384,17 +385,17 @@ cdef class Span:
DOCS: https://spacy.io/api/span#has_vector DOCS: https://spacy.io/api/span#has_vector
""" """
def __get__(self): if "has_vector" in self.doc.user_span_hooks:
if "has_vector" in self.doc.user_span_hooks: return self.doc.user_span_hooks["has_vector"](self)
return self.doc.user_span_hooks["has_vector"](self) elif self.vocab.vectors.data.size > 0:
elif self.vocab.vectors.data.size > 0: return any(token.has_vector for token in self)
return any(token.has_vector for token in self) elif self.doc.tensor.size > 0:
elif self.doc.tensor.size > 0: return True
return True else:
else: return False
return False
property vector: @property
def vector(self):
"""A real-valued meaning representation. Defaults to an average of the """A real-valued meaning representation. Defaults to an average of the
token vectors. token vectors.
@ -403,61 +404,61 @@ cdef class Span:
DOCS: https://spacy.io/api/span#vector DOCS: https://spacy.io/api/span#vector
""" """
def __get__(self): if "vector" in self.doc.user_span_hooks:
if "vector" in self.doc.user_span_hooks: return self.doc.user_span_hooks["vector"](self)
return self.doc.user_span_hooks["vector"](self) if self._vector is None:
if self._vector is None: self._vector = sum(t.vector for t in self) / len(self)
self._vector = sum(t.vector for t in self) / len(self) return self._vector
return self._vector
property vector_norm: @property
def vector_norm(self):
"""The L2 norm of the span's vector representation. """The L2 norm of the span's vector representation.
RETURNS (float): The L2 norm of the vector representation. RETURNS (float): The L2 norm of the vector representation.
DOCS: https://spacy.io/api/span#vector_norm DOCS: https://spacy.io/api/span#vector_norm
""" """
def __get__(self): if "vector_norm" in self.doc.user_span_hooks:
if "vector_norm" in self.doc.user_span_hooks: return self.doc.user_span_hooks["vector"](self)
return self.doc.user_span_hooks["vector"](self) cdef float value
cdef float value cdef double norm = 0
cdef double norm = 0 if self._vector_norm is None:
if self._vector_norm is None: norm = 0
norm = 0 for value in self.vector:
for value in self.vector: norm += value * value
norm += value * value self._vector_norm = sqrt(norm) if norm != 0 else 0
self._vector_norm = sqrt(norm) if norm != 0 else 0 return self._vector_norm
return self._vector_norm
property sentiment: @property
def sentiment(self):
"""RETURNS (float): A scalar value indicating the positivity or """RETURNS (float): A scalar value indicating the positivity or
negativity of the span. negativity of the span.
""" """
def __get__(self): if "sentiment" in self.doc.user_span_hooks:
if "sentiment" in self.doc.user_span_hooks: return self.doc.user_span_hooks["sentiment"](self)
return self.doc.user_span_hooks["sentiment"](self) else:
else: return sum([token.sentiment for token in self]) / len(self)
return sum([token.sentiment for token in self]) / len(self)
property text: @property
def text(self):
"""RETURNS (unicode): The original verbatim text of the span.""" """RETURNS (unicode): The original verbatim text of the span."""
def __get__(self): text = self.text_with_ws
text = self.text_with_ws if self[-1].whitespace_:
if self[-1].whitespace_: text = text[:-1]
text = text[:-1] return text
return text
property text_with_ws: @property
def text_with_ws(self):
"""The text content of the span with a trailing whitespace character if """The text content of the span with a trailing whitespace character if
the last token has one. the last token has one.
RETURNS (unicode): The text content of the span (with trailing RETURNS (unicode): The text content of the span (with trailing
whitespace). whitespace).
""" """
def __get__(self): return "".join([t.text_with_ws for t in self])
return "".join([t.text_with_ws for t in self])
property noun_chunks: @property
def noun_chunks(self):
"""Yields base noun-phrase `Span` objects, if the document has been """Yields base noun-phrase `Span` objects, if the document has been
syntactically parsed. A base noun phrase, or "NP chunk", is a noun syntactically parsed. A base noun phrase, or "NP chunk", is a noun
phrase that does not permit other NPs to be nested within it so no phrase that does not permit other NPs to be nested within it so no
@ -468,23 +469,23 @@ cdef class Span:
DOCS: https://spacy.io/api/span#noun_chunks DOCS: https://spacy.io/api/span#noun_chunks
""" """
def __get__(self): if not self.doc.is_parsed:
if not self.doc.is_parsed: raise ValueError(Errors.E029)
raise ValueError(Errors.E029) # Accumulate the result before beginning to iterate over it. This
# Accumulate the result before beginning to iterate over it. This # prevents the tokenisation from being changed out from under us
# prevents the tokenisation from being changed out from under us # during the iteration. The tricky thing here is that Span accepts
# during the iteration. The tricky thing here is that Span accepts # its tokenisation changing, so it's okay once we have the Span
# its tokenisation changing, so it's okay once we have the Span # objects. See Issue #375
# objects. See Issue #375 spans = []
spans = [] cdef attr_t label
cdef attr_t label if self.doc.noun_chunks_iterator is not None:
if self.doc.noun_chunks_iterator is not None: for start, end, label in self.doc.noun_chunks_iterator(self):
for start, end, label in self.doc.noun_chunks_iterator(self): spans.append(Span(self.doc, start, end, label=label))
spans.append(Span(self.doc, start, end, label=label)) for span in spans:
for span in spans: yield span
yield span
property root: @property
def root(self):
"""The token with the shortest path to the root of the """The token with the shortest path to the root of the
sentence (or the root itself). If multiple tokens are equally sentence (or the root itself). If multiple tokens are equally
high in the tree, the first token is taken. high in the tree, the first token is taken.
@ -493,41 +494,51 @@ cdef class Span:
DOCS: https://spacy.io/api/span#root DOCS: https://spacy.io/api/span#root
""" """
def __get__(self): self._recalculate_indices()
self._recalculate_indices() if "root" in self.doc.user_span_hooks:
if "root" in self.doc.user_span_hooks: return self.doc.user_span_hooks["root"](self)
return self.doc.user_span_hooks["root"](self) # This should probably be called 'head', and the other one called
# This should probably be called 'head', and the other one called # 'gov'. But we went with 'head' elsehwhere, and now we're stuck =/
# 'gov'. But we went with 'head' elsehwhere, and now we're stuck =/ cdef int i
cdef int i # First, we scan through the Span, and check whether there's a word
# First, we scan through the Span, and check whether there's a word # with head==0, i.e. a sentence root. If so, we can return it. The
# with head==0, i.e. a sentence root. If so, we can return it. The # longer the span, the more likely it contains a sentence root, and
# longer the span, the more likely it contains a sentence root, and # in this case we return in linear time.
# in this case we return in linear time. for i in range(self.start, self.end):
for i in range(self.start, self.end): if self.doc.c[i].head == 0:
if self.doc.c[i].head == 0: return self.doc[i]
return self.doc[i] # If we don't have a sentence root, we do something that's not so
# If we don't have a sentence root, we do something that's not so # algorithmically clever, but I think should be quite fast,
# algorithmically clever, but I think should be quite fast, # especially for short spans.
# especially for short spans. # For each word, we count the path length, and arg min this measure.
# For each word, we count the path length, and arg min this measure. # We could use better tree logic to save steps here...But I
# We could use better tree logic to save steps here...But I # think this should be okay.
# think this should be okay. cdef int current_best = self.doc.length
cdef int current_best = self.doc.length cdef int root = -1
cdef int root = -1 for i in range(self.start, self.end):
for i in range(self.start, self.end): if self.start <= (i+self.doc.c[i].head) < self.end:
if self.start <= (i+self.doc.c[i].head) < self.end: continue
continue words_to_root = _count_words_to_root(&self.doc.c[i], self.doc.length)
words_to_root = _count_words_to_root(&self.doc.c[i], self.doc.length) if words_to_root < current_best:
if words_to_root < current_best: current_best = words_to_root
current_best = words_to_root root = i
root = i if root == -1:
if root == -1: return self.doc[self.start]
return self.doc[self.start] else:
else: return self.doc[root]
return self.doc[root]
property lefts: @property
def conjuncts(self):
"""Tokens that are conjoined to the span's root.
RETURNS (tuple): A tuple of Token objects.
DOCS: https://spacy.io/api/span#lefts
"""
return self.root.conjuncts
@property
def lefts(self):
"""Tokens that are to the left of the span, whose head is within the """Tokens that are to the left of the span, whose head is within the
`Span`. `Span`.
@ -535,13 +546,13 @@ cdef class Span:
DOCS: https://spacy.io/api/span#lefts DOCS: https://spacy.io/api/span#lefts
""" """
def __get__(self): for token in reversed(self): # Reverse, so we get tokens in order
for token in reversed(self): # Reverse, so we get tokens in order for left in token.lefts:
for left in token.lefts: if left.i < self.start:
if left.i < self.start: yield left
yield left
property rights: @property
def rights(self):
"""Tokens that are to the right of the Span, whose head is within the """Tokens that are to the right of the Span, whose head is within the
`Span`. `Span`.
@ -549,13 +560,13 @@ cdef class Span:
DOCS: https://spacy.io/api/span#rights DOCS: https://spacy.io/api/span#rights
""" """
def __get__(self): for token in self:
for token in self: for right in token.rights:
for right in token.rights: if right.i >= self.end:
if right.i >= self.end: yield right
yield right
property n_lefts: @property
def n_lefts(self):
"""The number of tokens that are to the left of the span, whose """The number of tokens that are to the left of the span, whose
heads are within the span. heads are within the span.
@ -564,10 +575,10 @@ cdef class Span:
DOCS: https://spacy.io/api/span#n_lefts DOCS: https://spacy.io/api/span#n_lefts
""" """
def __get__(self): return len(list(self.lefts))
return len(list(self.lefts))
property n_rights: @property
def n_rights(self):
"""The number of tokens that are to the right of the span, whose """The number of tokens that are to the right of the span, whose
heads are within the span. heads are within the span.
@ -576,22 +587,21 @@ cdef class Span:
DOCS: https://spacy.io/api/span#n_rights DOCS: https://spacy.io/api/span#n_rights
""" """
def __get__(self): return len(list(self.rights))
return len(list(self.rights))
property subtree: @property
def subtree(self):
"""Tokens within the span and tokens which descend from them. """Tokens within the span and tokens which descend from them.
YIELDS (Token): A token within the span, or a descendant from it. YIELDS (Token): A token within the span, or a descendant from it.
DOCS: https://spacy.io/api/span#subtree DOCS: https://spacy.io/api/span#subtree
""" """
def __get__(self): for word in self.lefts:
for word in self.lefts: yield from word.subtree
yield from word.subtree yield from self
yield from self for word in self.rights:
for word in self.rights: yield from word.subtree
yield from word.subtree
property ent_id: property ent_id:
"""RETURNS (uint64): The entity ID.""" """RETURNS (uint64): The entity ID."""
@ -609,33 +619,33 @@ cdef class Span:
def __set__(self, hash_t key): def __set__(self, hash_t key):
raise NotImplementedError(TempErrors.T007.format(attr="ent_id_")) raise NotImplementedError(TempErrors.T007.format(attr="ent_id_"))
property orth_: @property
def orth_(self):
"""Verbatim text content (identical to `Span.text`). Exists mostly for """Verbatim text content (identical to `Span.text`). Exists mostly for
consistency with other attributes. consistency with other attributes.
RETURNS (unicode): The span's text.""" RETURNS (unicode): The span's text."""
def __get__(self): return self.text
return self.text
property lemma_: @property
def lemma_(self):
"""RETURNS (unicode): The span's lemma.""" """RETURNS (unicode): The span's lemma."""
def __get__(self): return " ".join([t.lemma_ for t in self]).strip()
return " ".join([t.lemma_ for t in self]).strip()
property upper_: @property
def upper_(self):
"""Deprecated. Use `Span.text.upper()` instead.""" """Deprecated. Use `Span.text.upper()` instead."""
def __get__(self): return "".join([t.text_with_ws.upper() for t in self]).strip()
return "".join([t.text_with_ws.upper() for t in self]).strip()
property lower_: @property
def lower_(self):
"""Deprecated. Use `Span.text.lower()` instead.""" """Deprecated. Use `Span.text.lower()` instead."""
def __get__(self): return "".join([t.text_with_ws.lower() for t in self]).strip()
return "".join([t.text_with_ws.lower() for t in self]).strip()
property string: @property
def string(self):
"""Deprecated: Use `Span.text_with_ws` instead.""" """Deprecated: Use `Span.text_with_ws` instead."""
def __get__(self): return "".join([t.text_with_ws for t in self])
return "".join([t.text_with_ws for t in self])
property label_: property label_:
"""RETURNS (unicode): The span's label.""" """RETURNS (unicode): The span's label."""
@ -643,7 +653,9 @@ cdef class Span:
return self.doc.vocab.strings[self.label] return self.doc.vocab.strings[self.label]
def __set__(self, unicode label_): def __set__(self, unicode label_):
self.label = self.doc.vocab.strings.add(label_) if not label_:
label_ = ''
raise NotImplementedError(Errors.E129.format(start=self.start, end=self.end, label=label_))
cdef int _count_words_to_root(const TokenC* token, int sent_length) except -1: cdef int _count_words_to_root(const TokenC* token, int sent_length) except -1:

View File

@ -219,115 +219,115 @@ cdef class Token:
xp = get_array_module(vector) xp = get_array_module(vector)
return (xp.dot(vector, other.vector) / (self.vector_norm * other.vector_norm)) return (xp.dot(vector, other.vector) / (self.vector_norm * other.vector_norm))
property morph: @property
def __get__(self): def morph(self):
return MorphAnalysis.from_id(self.vocab, self.c.morph) return MorphAnalysis.from_id(self.vocab, self.c.morph)
property lex_id: @property
def lex_id(self):
"""RETURNS (int): Sequential ID of the token's lexical type.""" """RETURNS (int): Sequential ID of the token's lexical type."""
def __get__(self): return self.c.lex.id
return self.c.lex.id
property rank: @property
def rank(self):
"""RETURNS (int): Sequential ID of the token's lexical type, used to """RETURNS (int): Sequential ID of the token's lexical type, used to
index into tables, e.g. for word vectors.""" index into tables, e.g. for word vectors."""
def __get__(self): return self.c.lex.id
return self.c.lex.id
property string: @property
def string(self):
"""Deprecated: Use Token.text_with_ws instead.""" """Deprecated: Use Token.text_with_ws instead."""
def __get__(self): return self.text_with_ws
return self.text_with_ws
property text: @property
def text(self):
"""RETURNS (unicode): The original verbatim text of the token.""" """RETURNS (unicode): The original verbatim text of the token."""
def __get__(self): return self.orth_
return self.orth_
property text_with_ws: @property
def text_with_ws(self):
"""RETURNS (unicode): The text content of the span (with trailing """RETURNS (unicode): The text content of the span (with trailing
whitespace). whitespace).
""" """
def __get__(self): cdef unicode orth = self.vocab.strings[self.c.lex.orth]
cdef unicode orth = self.vocab.strings[self.c.lex.orth] if self.c.spacy:
if self.c.spacy: return orth + " "
return orth + " " else:
else: return orth
return orth
property prob: @property
def prob(self):
"""RETURNS (float): Smoothed log probability estimate of token type.""" """RETURNS (float): Smoothed log probability estimate of token type."""
def __get__(self): return self.c.lex.prob
return self.c.lex.prob
property sentiment: @property
def sentiment(self):
"""RETURNS (float): A scalar value indicating the positivity or """RETURNS (float): A scalar value indicating the positivity or
negativity of the token.""" negativity of the token."""
def __get__(self): if "sentiment" in self.doc.user_token_hooks:
if "sentiment" in self.doc.user_token_hooks: return self.doc.user_token_hooks["sentiment"](self)
return self.doc.user_token_hooks["sentiment"](self) return self.c.lex.sentiment
return self.c.lex.sentiment
property lang: @property
def lang(self):
"""RETURNS (uint64): ID of the language of the parent document's """RETURNS (uint64): ID of the language of the parent document's
vocabulary. vocabulary.
""" """
def __get__(self): return self.c.lex.lang
return self.c.lex.lang
property idx: @property
def idx(self):
"""RETURNS (int): The character offset of the token within the parent """RETURNS (int): The character offset of the token within the parent
document. document.
""" """
def __get__(self): return self.c.idx
return self.c.idx
property cluster: @property
def cluster(self):
"""RETURNS (int): Brown cluster ID.""" """RETURNS (int): Brown cluster ID."""
def __get__(self): return self.c.lex.cluster
return self.c.lex.cluster
property orth: @property
def orth(self):
"""RETURNS (uint64): ID of the verbatim text content.""" """RETURNS (uint64): ID of the verbatim text content."""
def __get__(self): return self.c.lex.orth
return self.c.lex.orth
property lower: @property
def lower(self):
"""RETURNS (uint64): ID of the lowercase token text.""" """RETURNS (uint64): ID of the lowercase token text."""
def __get__(self): return self.c.lex.lower
return self.c.lex.lower
property norm: @property
def norm(self):
"""RETURNS (uint64): ID of the token's norm, i.e. a normalised form of """RETURNS (uint64): ID of the token's norm, i.e. a normalised form of
the token text. Usually set in the language's tokenizer exceptions the token text. Usually set in the language's tokenizer exceptions
or norm exceptions. or norm exceptions.
""" """
def __get__(self): if self.c.norm == 0:
if self.c.norm == 0: return self.c.lex.norm
return self.c.lex.norm else:
else: return self.c.norm
return self.c.norm
property shape: @property
def shape(self):
"""RETURNS (uint64): ID of the token's shape, a transform of the """RETURNS (uint64): ID of the token's shape, a transform of the
tokens's string, to show orthographic features (e.g. "Xxxx", "dd"). tokens's string, to show orthographic features (e.g. "Xxxx", "dd").
""" """
def __get__(self): return self.c.lex.shape
return self.c.lex.shape
property prefix: @property
def prefix(self):
"""RETURNS (uint64): ID of a length-N substring from the start of the """RETURNS (uint64): ID of a length-N substring from the start of the
token. Defaults to `N=1`. token. Defaults to `N=1`.
""" """
def __get__(self): return self.c.lex.prefix
return self.c.lex.prefix
property suffix: @property
def suffix(self):
"""RETURNS (uint64): ID of a length-N substring from the end of the """RETURNS (uint64): ID of a length-N substring from the end of the
token. Defaults to `N=3`. token. Defaults to `N=3`.
""" """
def __get__(self): return self.c.lex.suffix
return self.c.lex.suffix
property lemma: property lemma:
"""RETURNS (uint64): ID of the base form of the word, with no """RETURNS (uint64): ID of the base form of the word, with no
@ -367,7 +367,8 @@ cdef class Token:
def __set__(self, attr_t label): def __set__(self, attr_t label):
self.c.dep = label self.c.dep = label
property has_vector: @property
def has_vector(self):
"""A boolean value indicating whether a word vector is associated with """A boolean value indicating whether a word vector is associated with
the object. the object.
@ -375,14 +376,14 @@ cdef class Token:
DOCS: https://spacy.io/api/token#has_vector DOCS: https://spacy.io/api/token#has_vector
""" """
def __get__(self): if "has_vector" in self.doc.user_token_hooks:
if 'has_vector' in self.doc.user_token_hooks: return self.doc.user_token_hooks["has_vector"](self)
return self.doc.user_token_hooks["has_vector"](self) if self.vocab.vectors.size == 0 and self.doc.tensor.size != 0:
if self.vocab.vectors.size == 0 and self.doc.tensor.size != 0: return True
return True return self.vocab.has_vector(self.c.lex.orth)
return self.vocab.has_vector(self.c.lex.orth)
property vector: @property
def vector(self):
"""A real-valued meaning representation. """A real-valued meaning representation.
RETURNS (numpy.ndarray[ndim=1, dtype='float32']): A 1D numpy array RETURNS (numpy.ndarray[ndim=1, dtype='float32']): A 1D numpy array
@ -390,28 +391,28 @@ cdef class Token:
DOCS: https://spacy.io/api/token#vector DOCS: https://spacy.io/api/token#vector
""" """
def __get__(self): if "vector" in self.doc.user_token_hooks:
if 'vector' in self.doc.user_token_hooks: return self.doc.user_token_hooks["vector"](self)
return self.doc.user_token_hooks["vector"](self) if self.vocab.vectors.size == 0 and self.doc.tensor.size != 0:
if self.vocab.vectors.size == 0 and self.doc.tensor.size != 0: return self.doc.tensor[self.i]
return self.doc.tensor[self.i] else:
else: return self.vocab.get_vector(self.c.lex.orth)
return self.vocab.get_vector(self.c.lex.orth)
property vector_norm: @property
def vector_norm(self):
"""The L2 norm of the token's vector representation. """The L2 norm of the token's vector representation.
RETURNS (float): The L2 norm of the vector representation. RETURNS (float): The L2 norm of the vector representation.
DOCS: https://spacy.io/api/token#vector_norm DOCS: https://spacy.io/api/token#vector_norm
""" """
def __get__(self): if "vector_norm" in self.doc.user_token_hooks:
if 'vector_norm' in self.doc.user_token_hooks: return self.doc.user_token_hooks["vector_norm"](self)
return self.doc.user_token_hooks["vector_norm"](self) vector = self.vector
vector = self.vector return numpy.sqrt((vector ** 2).sum())
return numpy.sqrt((vector ** 2).sum())
property n_lefts: @property
def n_lefts(self):
"""The number of leftward immediate children of the word, in the """The number of leftward immediate children of the word, in the
syntactic dependency parse. syntactic dependency parse.
@ -420,10 +421,10 @@ cdef class Token:
DOCS: https://spacy.io/api/token#n_lefts DOCS: https://spacy.io/api/token#n_lefts
""" """
def __get__(self): return self.c.l_kids
return self.c.l_kids
property n_rights: @property
def n_rights(self):
"""The number of rightward immediate children of the word, in the """The number of rightward immediate children of the word, in the
syntactic dependency parse. syntactic dependency parse.
@ -432,15 +433,14 @@ cdef class Token:
DOCS: https://spacy.io/api/token#n_rights DOCS: https://spacy.io/api/token#n_rights
""" """
def __get__(self): return self.c.r_kids
return self.c.r_kids
property sent: @property
def sent(self):
"""RETURNS (Span): The sentence span that the token is a part of.""" """RETURNS (Span): The sentence span that the token is a part of."""
def __get__(self): if 'sent' in self.doc.user_token_hooks:
if 'sent' in self.doc.user_token_hooks: return self.doc.user_token_hooks["sent"](self)
return self.doc.user_token_hooks["sent"](self) return self.doc[self.i : self.i+1].sent
return self.doc[self.i : self.i+1].sent
property sent_start: property sent_start:
def __get__(self): def __get__(self):
@ -484,7 +484,8 @@ cdef class Token:
else: else:
raise ValueError(Errors.E044.format(value=value)) raise ValueError(Errors.E044.format(value=value))
property lefts: @property
def lefts(self):
"""The leftward immediate children of the word, in the syntactic """The leftward immediate children of the word, in the syntactic
dependency parse. dependency parse.
@ -492,19 +493,19 @@ cdef class Token:
DOCS: https://spacy.io/api/token#lefts DOCS: https://spacy.io/api/token#lefts
""" """
def __get__(self): cdef int nr_iter = 0
cdef int nr_iter = 0 cdef const TokenC* ptr = self.c - (self.i - self.c.l_edge)
cdef const TokenC* ptr = self.c - (self.i - self.c.l_edge) while ptr < self.c:
while ptr < self.c: if ptr + ptr.head == self.c:
if ptr + ptr.head == self.c: yield self.doc[ptr - (self.c - self.i)]
yield self.doc[ptr - (self.c - self.i)] ptr += 1
ptr += 1 nr_iter += 1
nr_iter += 1 # This is ugly, but it's a way to guard out infinite loops
# This is ugly, but it's a way to guard out infinite loops if nr_iter >= 10000000:
if nr_iter >= 10000000: raise RuntimeError(Errors.E045.format(attr="token.lefts"))
raise RuntimeError(Errors.E045.format(attr="token.lefts"))
property rights: @property
def rights(self):
"""The rightward immediate children of the word, in the syntactic """The rightward immediate children of the word, in the syntactic
dependency parse. dependency parse.
@ -512,33 +513,33 @@ cdef class Token:
DOCS: https://spacy.io/api/token#rights DOCS: https://spacy.io/api/token#rights
""" """
def __get__(self): cdef const TokenC* ptr = self.c + (self.c.r_edge - self.i)
cdef const TokenC* ptr = self.c + (self.c.r_edge - self.i) tokens = []
tokens = [] cdef int nr_iter = 0
cdef int nr_iter = 0 while ptr > self.c:
while ptr > self.c: if ptr + ptr.head == self.c:
if ptr + ptr.head == self.c: tokens.append(self.doc[ptr - (self.c - self.i)])
tokens.append(self.doc[ptr - (self.c - self.i)]) ptr -= 1
ptr -= 1 nr_iter += 1
nr_iter += 1 if nr_iter >= 10000000:
if nr_iter >= 10000000: raise RuntimeError(Errors.E045.format(attr="token.rights"))
raise RuntimeError(Errors.E045.format(attr="token.rights")) tokens.reverse()
tokens.reverse() for t in tokens:
for t in tokens: yield t
yield t
property children: @property
def children(self):
"""A sequence of the token's immediate syntactic children. """A sequence of the token's immediate syntactic children.
YIELDS (Token): A child token such that `child.head==self`. YIELDS (Token): A child token such that `child.head==self`.
DOCS: https://spacy.io/api/token#children DOCS: https://spacy.io/api/token#children
""" """
def __get__(self): yield from self.lefts
yield from self.lefts yield from self.rights
yield from self.rights
property subtree: @property
def subtree(self):
"""A sequence containing the token and all the token's syntactic """A sequence containing the token and all the token's syntactic
descendants. descendants.
@ -547,30 +548,30 @@ cdef class Token:
DOCS: https://spacy.io/api/token#subtree DOCS: https://spacy.io/api/token#subtree
""" """
def __get__(self): for word in self.lefts:
for word in self.lefts: yield from word.subtree
yield from word.subtree yield self
yield self for word in self.rights:
for word in self.rights: yield from word.subtree
yield from word.subtree
property left_edge: @property
def left_edge(self):
"""The leftmost token of this token's syntactic descendents. """The leftmost token of this token's syntactic descendents.
RETURNS (Token): The first token such that `self.is_ancestor(token)`. RETURNS (Token): The first token such that `self.is_ancestor(token)`.
""" """
def __get__(self): return self.doc[self.c.l_edge]
return self.doc[self.c.l_edge]
property right_edge: @property
def right_edge(self):
"""The rightmost token of this token's syntactic descendents. """The rightmost token of this token's syntactic descendents.
RETURNS (Token): The last token such that `self.is_ancestor(token)`. RETURNS (Token): The last token such that `self.is_ancestor(token)`.
""" """
def __get__(self): return self.doc[self.c.r_edge]
return self.doc[self.c.r_edge]
property ancestors: @property
def ancestors(self):
"""A sequence of this token's syntactic ancestors. """A sequence of this token's syntactic ancestors.
YIELDS (Token): A sequence of ancestor tokens such that YIELDS (Token): A sequence of ancestor tokens such that
@ -578,15 +579,14 @@ cdef class Token:
DOCS: https://spacy.io/api/token#ancestors DOCS: https://spacy.io/api/token#ancestors
""" """
def __get__(self): cdef const TokenC* head_ptr = self.c
cdef const TokenC* head_ptr = self.c # Guard against infinite loop, no token can have
# Guard against infinite loop, no token can have # more ancestors than tokens in the tree.
# more ancestors than tokens in the tree. cdef int i = 0
cdef int i = 0 while head_ptr.head != 0 and i < self.doc.length:
while head_ptr.head != 0 and i < self.doc.length: head_ptr += head_ptr.head
head_ptr += head_ptr.head yield self.doc[head_ptr - (self.c - self.i)]
yield self.doc[head_ptr - (self.c - self.i)] i += 1
i += 1
def is_ancestor(self, descendant): def is_ancestor(self, descendant):
"""Check whether this token is a parent, grandparent, etc. of another """Check whether this token is a parent, grandparent, etc. of another
@ -690,23 +690,31 @@ cdef class Token:
# Set new head # Set new head
self.c.head = rel_newhead_i self.c.head = rel_newhead_i
property conjuncts: @property
def conjuncts(self):
"""A sequence of coordinated tokens, including the token itself. """A sequence of coordinated tokens, including the token itself.
YIELDS (Token): A coordinated token. RETURNS (tuple): The coordinated tokens.
DOCS: https://spacy.io/api/token#conjuncts DOCS: https://spacy.io/api/token#conjuncts
""" """
def __get__(self): cdef Token word, child
cdef Token word if "conjuncts" in self.doc.user_token_hooks:
if "conjuncts" in self.doc.user_token_hooks: return tuple(self.doc.user_token_hooks["conjuncts"](self))
yield from self.doc.user_token_hooks["conjuncts"](self) start = self
while start.i != start.head.i:
if start.dep == conj:
start = start.head
else: else:
if self.dep != conj: break
for word in self.rights: queue = [start]
if word.dep == conj: output = [start]
yield word for word in queue:
yield from word.conjuncts for child in word.rights:
if child.c.dep == conj:
output.append(child)
queue.append(child)
return tuple([w for w in output if w.i != self.i])
property ent_type: property ent_type:
"""RETURNS (uint64): Named entity type.""" """RETURNS (uint64): Named entity type."""
@ -716,15 +724,6 @@ cdef class Token:
def __set__(self, ent_type): def __set__(self, ent_type):
self.c.ent_type = ent_type self.c.ent_type = ent_type
property ent_iob:
"""IOB code of named entity tag. `1="I", 2="O", 3="B"`. 0 means no tag
is assigned.
RETURNS (uint64): IOB code of named entity tag.
"""
def __get__(self):
return self.c.ent_iob
property ent_type_: property ent_type_:
"""RETURNS (unicode): Named entity type.""" """RETURNS (unicode): Named entity type."""
def __get__(self): def __get__(self):
@ -733,16 +732,25 @@ cdef class Token:
def __set__(self, ent_type): def __set__(self, ent_type):
self.c.ent_type = self.vocab.strings.add(ent_type) self.c.ent_type = self.vocab.strings.add(ent_type)
property ent_iob_: @property
def ent_iob(self):
"""IOB code of named entity tag. `1="I", 2="O", 3="B"`. 0 means no tag
is assigned.
RETURNS (uint64): IOB code of named entity tag.
"""
return self.c.ent_iob
@property
def ent_iob_(self):
"""IOB code of named entity tag. "B" means the token begins an entity, """IOB code of named entity tag. "B" means the token begins an entity,
"I" means it is inside an entity, "O" means it is outside an entity, "I" means it is inside an entity, "O" means it is outside an entity,
and "" means no entity tag is set. and "" means no entity tag is set.
RETURNS (unicode): IOB code of named entity tag. RETURNS (unicode): IOB code of named entity tag.
""" """
def __get__(self): iob_strings = ("", "I", "O", "B")
iob_strings = ("", "I", "O", "B") return iob_strings[self.c.ent_iob]
return iob_strings[self.c.ent_iob]
property ent_id: property ent_id:
"""RETURNS (uint64): ID of the entity the token is an instance of, """RETURNS (uint64): ID of the entity the token is an instance of,
@ -764,26 +772,25 @@ cdef class Token:
def __set__(self, name): def __set__(self, name):
self.c.ent_id = self.vocab.strings.add(name) self.c.ent_id = self.vocab.strings.add(name)
property whitespace_: @property
"""RETURNS (unicode): The trailing whitespace character, if present. def whitespace_(self):
""" """RETURNS (unicode): The trailing whitespace character, if present."""
def __get__(self): return " " if self.c.spacy else ""
return " " if self.c.spacy else ""
property orth_: @property
def orth_(self):
"""RETURNS (unicode): Verbatim text content (identical to """RETURNS (unicode): Verbatim text content (identical to
`Token.text`). Exists mostly for consistency with the other `Token.text`). Exists mostly for consistency with the other
attributes. attributes.
""" """
def __get__(self): return self.vocab.strings[self.c.lex.orth]
return self.vocab.strings[self.c.lex.orth]
property lower_: @property
def lower_(self):
"""RETURNS (unicode): The lowercase token text. Equivalent to """RETURNS (unicode): The lowercase token text. Equivalent to
`Token.text.lower()`. `Token.text.lower()`.
""" """
def __get__(self): return self.vocab.strings[self.c.lex.lower]
return self.vocab.strings[self.c.lex.lower]
property norm_: property norm_:
"""RETURNS (unicode): The token's norm, i.e. a normalised form of the """RETURNS (unicode): The token's norm, i.e. a normalised form of the
@ -796,33 +803,33 @@ cdef class Token:
def __set__(self, unicode norm_): def __set__(self, unicode norm_):
self.c.norm = self.vocab.strings.add(norm_) self.c.norm = self.vocab.strings.add(norm_)
property shape_: @property
def shape_(self):
"""RETURNS (unicode): Transform of the tokens's string, to show """RETURNS (unicode): Transform of the tokens's string, to show
orthographic features. For example, "Xxxx" or "dd". orthographic features. For example, "Xxxx" or "dd".
""" """
def __get__(self): return self.vocab.strings[self.c.lex.shape]
return self.vocab.strings[self.c.lex.shape]
property prefix_: @property
def prefix_(self):
"""RETURNS (unicode): A length-N substring from the start of the token. """RETURNS (unicode): A length-N substring from the start of the token.
Defaults to `N=1`. Defaults to `N=1`.
""" """
def __get__(self): return self.vocab.strings[self.c.lex.prefix]
return self.vocab.strings[self.c.lex.prefix]
property suffix_: @property
def suffix_(self):
"""RETURNS (unicode): A length-N substring from the end of the token. """RETURNS (unicode): A length-N substring from the end of the token.
Defaults to `N=3`. Defaults to `N=3`.
""" """
def __get__(self): return self.vocab.strings[self.c.lex.suffix]
return self.vocab.strings[self.c.lex.suffix]
property lang_: @property
def lang_(self):
"""RETURNS (unicode): Language of the parent document's vocabulary, """RETURNS (unicode): Language of the parent document's vocabulary,
e.g. 'en'. e.g. 'en'.
""" """
def __get__(self): return self.vocab.strings[self.c.lex.lang]
return self.vocab.strings[self.c.lex.lang]
property lemma_: property lemma_:
"""RETURNS (unicode): The token lemma, i.e. the base form of the word, """RETURNS (unicode): The token lemma, i.e. the base form of the word,
@ -861,110 +868,110 @@ cdef class Token:
def __set__(self, unicode label): def __set__(self, unicode label):
self.c.dep = self.vocab.strings.add(label) self.c.dep = self.vocab.strings.add(label)
property is_oov: @property
def is_oov(self):
"""RETURNS (bool): Whether the token is out-of-vocabulary.""" """RETURNS (bool): Whether the token is out-of-vocabulary."""
def __get__(self): return Lexeme.c_check_flag(self.c.lex, IS_OOV)
return Lexeme.c_check_flag(self.c.lex, IS_OOV)
property is_stop: @property
def is_stop(self):
"""RETURNS (bool): Whether the token is a stop word, i.e. part of a """RETURNS (bool): Whether the token is a stop word, i.e. part of a
"stop list" defined by the language data. "stop list" defined by the language data.
""" """
def __get__(self): return Lexeme.c_check_flag(self.c.lex, IS_STOP)
return Lexeme.c_check_flag(self.c.lex, IS_STOP)
property is_alpha: @property
def is_alpha(self):
"""RETURNS (bool): Whether the token consists of alpha characters. """RETURNS (bool): Whether the token consists of alpha characters.
Equivalent to `token.text.isalpha()`. Equivalent to `token.text.isalpha()`.
""" """
def __get__(self): return Lexeme.c_check_flag(self.c.lex, IS_ALPHA)
return Lexeme.c_check_flag(self.c.lex, IS_ALPHA)
property is_ascii: @property
def is_ascii(self):
"""RETURNS (bool): Whether the token consists of ASCII characters. """RETURNS (bool): Whether the token consists of ASCII characters.
Equivalent to `[any(ord(c) >= 128 for c in token.text)]`. Equivalent to `[any(ord(c) >= 128 for c in token.text)]`.
""" """
def __get__(self): return Lexeme.c_check_flag(self.c.lex, IS_ASCII)
return Lexeme.c_check_flag(self.c.lex, IS_ASCII)
property is_digit: @property
def is_digit(self):
"""RETURNS (bool): Whether the token consists of digits. Equivalent to """RETURNS (bool): Whether the token consists of digits. Equivalent to
`token.text.isdigit()`. `token.text.isdigit()`.
""" """
def __get__(self): return Lexeme.c_check_flag(self.c.lex, IS_DIGIT)
return Lexeme.c_check_flag(self.c.lex, IS_DIGIT)
property is_lower: @property
def is_lower(self):
"""RETURNS (bool): Whether the token is in lowercase. Equivalent to """RETURNS (bool): Whether the token is in lowercase. Equivalent to
`token.text.islower()`. `token.text.islower()`.
""" """
def __get__(self): return Lexeme.c_check_flag(self.c.lex, IS_LOWER)
return Lexeme.c_check_flag(self.c.lex, IS_LOWER)
property is_upper: @property
def is_upper(self):
"""RETURNS (bool): Whether the token is in uppercase. Equivalent to """RETURNS (bool): Whether the token is in uppercase. Equivalent to
`token.text.isupper()` `token.text.isupper()`
""" """
def __get__(self): return Lexeme.c_check_flag(self.c.lex, IS_UPPER)
return Lexeme.c_check_flag(self.c.lex, IS_UPPER)
property is_title: @property
def is_title(self):
"""RETURNS (bool): Whether the token is in titlecase. Equivalent to """RETURNS (bool): Whether the token is in titlecase. Equivalent to
`token.text.istitle()`. `token.text.istitle()`.
""" """
def __get__(self): return Lexeme.c_check_flag(self.c.lex, IS_TITLE)
return Lexeme.c_check_flag(self.c.lex, IS_TITLE)
property is_punct: @property
def is_punct(self):
"""RETURNS (bool): Whether the token is punctuation.""" """RETURNS (bool): Whether the token is punctuation."""
def __get__(self): return Lexeme.c_check_flag(self.c.lex, IS_PUNCT)
return Lexeme.c_check_flag(self.c.lex, IS_PUNCT)
property is_space: @property
def is_space(self):
"""RETURNS (bool): Whether the token consists of whitespace characters. """RETURNS (bool): Whether the token consists of whitespace characters.
Equivalent to `token.text.isspace()`. Equivalent to `token.text.isspace()`.
""" """
def __get__(self): return Lexeme.c_check_flag(self.c.lex, IS_SPACE)
return Lexeme.c_check_flag(self.c.lex, IS_SPACE)
property is_bracket: @property
def is_bracket(self):
"""RETURNS (bool): Whether the token is a bracket.""" """RETURNS (bool): Whether the token is a bracket."""
def __get__(self): return Lexeme.c_check_flag(self.c.lex, IS_BRACKET)
return Lexeme.c_check_flag(self.c.lex, IS_BRACKET)
property is_quote: @property
def is_quote(self):
"""RETURNS (bool): Whether the token is a quotation mark.""" """RETURNS (bool): Whether the token is a quotation mark."""
def __get__(self): return Lexeme.c_check_flag(self.c.lex, IS_QUOTE)
return Lexeme.c_check_flag(self.c.lex, IS_QUOTE)
property is_left_punct: @property
def is_left_punct(self):
"""RETURNS (bool): Whether the token is a left punctuation mark.""" """RETURNS (bool): Whether the token is a left punctuation mark."""
def __get__(self): return Lexeme.c_check_flag(self.c.lex, IS_LEFT_PUNCT)
return Lexeme.c_check_flag(self.c.lex, IS_LEFT_PUNCT)
property is_right_punct: @property
def is_right_punct(self):
"""RETURNS (bool): Whether the token is a right punctuation mark.""" """RETURNS (bool): Whether the token is a right punctuation mark."""
def __get__(self): return Lexeme.c_check_flag(self.c.lex, IS_RIGHT_PUNCT)
return Lexeme.c_check_flag(self.c.lex, IS_RIGHT_PUNCT)
property is_currency: @property
def is_currency(self):
"""RETURNS (bool): Whether the token is a currency symbol.""" """RETURNS (bool): Whether the token is a currency symbol."""
def __get__(self): return Lexeme.c_check_flag(self.c.lex, IS_CURRENCY)
return Lexeme.c_check_flag(self.c.lex, IS_CURRENCY)
property like_url: @property
def like_url(self):
"""RETURNS (bool): Whether the token resembles a URL.""" """RETURNS (bool): Whether the token resembles a URL."""
def __get__(self): return Lexeme.c_check_flag(self.c.lex, LIKE_URL)
return Lexeme.c_check_flag(self.c.lex, LIKE_URL)
property like_num: @property
def like_num(self):
"""RETURNS (bool): Whether the token resembles a number, e.g. "10.9", """RETURNS (bool): Whether the token resembles a number, e.g. "10.9",
"10", "ten", etc. "10", "ten", etc.
""" """
def __get__(self): return Lexeme.c_check_flag(self.c.lex, LIKE_NUM)
return Lexeme.c_check_flag(self.c.lex, LIKE_NUM)
property like_email: @property
def like_email(self):
"""RETURNS (bool): Whether the token resembles an email address.""" """RETURNS (bool): Whether the token resembles an email address."""
def __get__(self): return Lexeme.c_check_flag(self.c.lex, LIKE_EMAIL)
return Lexeme.c_check_flag(self.c.lex, LIKE_EMAIL)

View File

@ -2,11 +2,13 @@
from __future__ import unicode_literals from __future__ import unicode_literals
import functools import functools
import copy
from ..errors import Errors from ..errors import Errors
class Underscore(object): class Underscore(object):
mutable_types = (dict, list, set)
doc_extensions = {} doc_extensions = {}
span_extensions = {} span_extensions = {}
token_extensions = {} token_extensions = {}
@ -32,7 +34,15 @@ class Underscore(object):
elif method is not None: elif method is not None:
return functools.partial(method, self._obj) return functools.partial(method, self._obj)
else: else:
return self._doc.user_data.get(self._get_key(name), default) key = self._get_key(name)
if key in self._doc.user_data:
return self._doc.user_data[key]
elif isinstance(default, self.mutable_types):
# Handle mutable default arguments (see #2581)
new_default = copy.copy(default)
self.__setattr__(name, new_default)
return new_default
return default
def __setattr__(self, name, value): def __setattr__(self, name, value):
if name not in self._extensions: if name not in self._extensions:

View File

@ -25,7 +25,7 @@ except ImportError:
from .symbols import ORTH from .symbols import ORTH
from .compat import cupy, CudaStream, path2str, basestring_, unicode_ from .compat import cupy, CudaStream, path2str, basestring_, unicode_
from .compat import import_file from .compat import import_file
from .errors import Errors from .errors import Errors, Warnings, deprecation_warning
LANGUAGES = {} LANGUAGES = {}
@ -38,6 +38,18 @@ def set_env_log(value):
_PRINT_ENV = value _PRINT_ENV = value
def lang_class_is_loaded(lang):
"""Check whether a Language class is already loaded. Language classes are
loaded lazily, to avoid expensive setup code associated with the language
data.
lang (unicode): Two-letter language code, e.g. 'en'.
RETURNS (bool): Whether a Language class has been loaded.
"""
global LANGUAGES
return lang in LANGUAGES
def get_lang_class(lang): def get_lang_class(lang):
"""Import and load a Language class. """Import and load a Language class.
@ -565,7 +577,8 @@ def itershuffle(iterable, bufsize=1000):
def to_bytes(getters, exclude): def to_bytes(getters, exclude):
serialized = OrderedDict() serialized = OrderedDict()
for key, getter in getters.items(): for key, getter in getters.items():
if key not in exclude: # Split to support file names like meta.json
if key.split(".")[0] not in exclude:
serialized[key] = getter() serialized[key] = getter()
return srsly.msgpack_dumps(serialized) return srsly.msgpack_dumps(serialized)
@ -573,7 +586,8 @@ def to_bytes(getters, exclude):
def from_bytes(bytes_data, setters, exclude): def from_bytes(bytes_data, setters, exclude):
msg = srsly.msgpack_loads(bytes_data) msg = srsly.msgpack_loads(bytes_data)
for key, setter in setters.items(): for key, setter in setters.items():
if key not in exclude and key in msg: # Split to support file names like meta.json
if key.split(".")[0] not in exclude and key in msg:
setter(msg[key]) setter(msg[key])
return msg return msg
@ -583,7 +597,8 @@ def to_disk(path, writers, exclude):
if not path.exists(): if not path.exists():
path.mkdir() path.mkdir()
for key, writer in writers.items(): for key, writer in writers.items():
if key not in exclude: # Split to support file names like meta.json
if key.split(".")[0] not in exclude:
writer(path / key) writer(path / key)
return path return path
@ -591,7 +606,8 @@ def to_disk(path, writers, exclude):
def from_disk(path, readers, exclude): def from_disk(path, readers, exclude):
path = ensure_path(path) path = ensure_path(path)
for key, reader in readers.items(): for key, reader in readers.items():
if key not in exclude: # Split to support file names like meta.json
if key.split(".")[0] not in exclude:
reader(path / key) reader(path / key)
return path return path
@ -677,6 +693,23 @@ def validate_json(data, validator):
return errors return errors
def get_serialization_exclude(serializers, exclude, kwargs):
"""Helper function to validate serialization args and manage transition from
keyword arguments (pre v2.1) to exclude argument.
"""
exclude = list(exclude)
# Split to support file names like meta.json
options = [name.split(".")[0] for name in serializers]
for key, value in kwargs.items():
if key in ("vocab",) and value is False:
deprecation_warning(Warnings.W015.format(arg=key))
exclude.append(key)
elif key.split(".")[0] in options:
raise ValueError(Errors.E128.format(arg=key))
# TODO: user warning?
return exclude
class SimpleFrozenDict(dict): class SimpleFrozenDict(dict):
"""Simplified implementation of a frozen dict, mainly used as default """Simplified implementation of a frozen dict, mainly used as default
function or method argument (for arguments that should default to empty function or method argument (for arguments that should default to empty
@ -696,14 +729,14 @@ class SimpleFrozenDict(dict):
class DummyTokenizer(object): class DummyTokenizer(object):
# add dummy methods for to_bytes, from_bytes, to_disk and from_disk to # add dummy methods for to_bytes, from_bytes, to_disk and from_disk to
# allow serialization (see #1557) # allow serialization (see #1557)
def to_bytes(self, **exclude): def to_bytes(self, **kwargs):
return b"" return b""
def from_bytes(self, _bytes_data, **exclude): def from_bytes(self, _bytes_data, **kwargs):
return self return self
def to_disk(self, _path, **exclude): def to_disk(self, _path, **kwargs):
return None return None
def from_disk(self, _path, **exclude): def from_disk(self, _path, **kwargs):
return self return self

View File

@ -377,11 +377,11 @@ cdef class Vectors:
self.add(key, row=i) self.add(key, row=i)
return strings return strings
def to_disk(self, path, **exclude): def to_disk(self, path, **kwargs):
"""Save the current state to a directory. """Save the current state to a directory.
path (unicode / Path): A path to a directory, which will be created if path (unicode / Path): A path to a directory, which will be created if
it doesn't exists. Either a string or a Path-like object. it doesn't exists.
DOCS: https://spacy.io/api/vectors#to_disk DOCS: https://spacy.io/api/vectors#to_disk
""" """
@ -394,9 +394,9 @@ cdef class Vectors:
("vectors", lambda p: save_array(self.data, p.open("wb"))), ("vectors", lambda p: save_array(self.data, p.open("wb"))),
("key2row", lambda p: srsly.write_msgpack(p, self.key2row)) ("key2row", lambda p: srsly.write_msgpack(p, self.key2row))
)) ))
return util.to_disk(path, serializers, exclude) return util.to_disk(path, serializers, [])
def from_disk(self, path, **exclude): def from_disk(self, path, **kwargs):
"""Loads state from a directory. Modifies the object in place and """Loads state from a directory. Modifies the object in place and
returns it. returns it.
@ -428,13 +428,13 @@ cdef class Vectors:
("keys", load_keys), ("keys", load_keys),
("vectors", load_vectors), ("vectors", load_vectors),
)) ))
util.from_disk(path, serializers, exclude) util.from_disk(path, serializers, [])
return self return self
def to_bytes(self, **exclude): def to_bytes(self, **kwargs):
"""Serialize the current state to a binary string. """Serialize the current state to a binary string.
**exclude: Named attributes to prevent from being serialized. exclude (list): String names of serialization fields to exclude.
RETURNS (bytes): The serialized form of the `Vectors` object. RETURNS (bytes): The serialized form of the `Vectors` object.
DOCS: https://spacy.io/api/vectors#to_bytes DOCS: https://spacy.io/api/vectors#to_bytes
@ -444,17 +444,18 @@ cdef class Vectors:
return self.data.to_bytes() return self.data.to_bytes()
else: else:
return srsly.msgpack_dumps(self.data) return srsly.msgpack_dumps(self.data)
serializers = OrderedDict(( serializers = OrderedDict((
("key2row", lambda: srsly.msgpack_dumps(self.key2row)), ("key2row", lambda: srsly.msgpack_dumps(self.key2row)),
("vectors", serialize_weights) ("vectors", serialize_weights)
)) ))
return util.to_bytes(serializers, exclude) return util.to_bytes(serializers, [])
def from_bytes(self, data, **exclude): def from_bytes(self, data, **kwargs):
"""Load state from a binary string. """Load state from a binary string.
data (bytes): The data to load from. data (bytes): The data to load from.
**exclude: Named attributes to prevent from being loaded. exclude (list): String names of serialization fields to exclude.
RETURNS (Vectors): The `Vectors` object. RETURNS (Vectors): The `Vectors` object.
DOCS: https://spacy.io/api/vectors#from_bytes DOCS: https://spacy.io/api/vectors#from_bytes
@ -469,5 +470,5 @@ cdef class Vectors:
("key2row", lambda b: self.key2row.update(srsly.msgpack_loads(b))), ("key2row", lambda b: self.key2row.update(srsly.msgpack_loads(b))),
("vectors", deserialize_weights) ("vectors", deserialize_weights)
)) ))
util.from_bytes(data, deserializers, exclude) util.from_bytes(data, deserializers, [])
return self return self

View File

@ -1,6 +1,7 @@
# coding: utf8 # coding: utf8
# cython: profile=True # cython: profile=True
from __future__ import unicode_literals from __future__ import unicode_literals
from libc.string cimport memcpy
import numpy import numpy
import srsly import srsly
@ -59,12 +60,23 @@ cdef class Vocab:
self.morphology = Morphology(self.strings, tag_map, lemmatizer) self.morphology = Morphology(self.strings, tag_map, lemmatizer)
self.vectors = Vectors() self.vectors = Vectors()
property lang: @property
def lang(self):
langfunc = None
if self.lex_attr_getters:
langfunc = self.lex_attr_getters.get(LANG, None)
return langfunc("_") if langfunc else ""
property writing_system:
"""A dict with information about the language's writing system. To get
the data, we use the vocab.lang property to fetch the Language class.
If the Language class is not loaded, an empty dict is returned.
"""
def __get__(self): def __get__(self):
langfunc = None if not util.lang_class_is_loaded(self.lang):
if self.lex_attr_getters: return {}
langfunc = self.lex_attr_getters.get(LANG, None) lang_class = util.get_lang_class(self.lang)
return langfunc("_") if langfunc else "" return dict(lang_class.Defaults.writing_system)
def __len__(self): def __len__(self):
"""The current number of lexemes stored. """The current number of lexemes stored.
@ -396,47 +408,57 @@ cdef class Vocab:
orth = self.strings.add(orth) orth = self.strings.add(orth)
return orth in self.vectors return orth in self.vectors
def to_disk(self, path, **exclude): def to_disk(self, path, exclude=tuple(), **kwargs):
"""Save the current state to a directory. """Save the current state to a directory.
path (unicode or Path): A path to a directory, which will be created if path (unicode or Path): A path to a directory, which will be created if
it doesn't exist. Paths may be either strings or Path-like objects. it doesn't exist.
exclude (list): String names of serialization fields to exclude.
DOCS: https://spacy.io/api/vocab#to_disk DOCS: https://spacy.io/api/vocab#to_disk
""" """
path = util.ensure_path(path) path = util.ensure_path(path)
if not path.exists(): if not path.exists():
path.mkdir() path.mkdir()
self.strings.to_disk(path / "strings.json") setters = ["strings", "lexemes", "vectors"]
with (path / "lexemes.bin").open('wb') as file_: exclude = util.get_serialization_exclude(setters, exclude, kwargs)
file_.write(self.lexemes_to_bytes()) if "strings" not in exclude:
if self.vectors is not None: self.strings.to_disk(path / "strings.json")
if "lexemes" not in exclude:
with (path / "lexemes.bin").open("wb") as file_:
file_.write(self.lexemes_to_bytes())
if "vectors" not in "exclude" and self.vectors is not None:
self.vectors.to_disk(path) self.vectors.to_disk(path)
def from_disk(self, path, **exclude): def from_disk(self, path, exclude=tuple(), **kwargs):
"""Loads state from a directory. Modifies the object in place and """Loads state from a directory. Modifies the object in place and
returns it. returns it.
path (unicode or Path): A path to a directory. Paths may be either path (unicode or Path): A path to a directory.
strings or `Path`-like objects. exclude (list): String names of serialization fields to exclude.
RETURNS (Vocab): The modified `Vocab` object. RETURNS (Vocab): The modified `Vocab` object.
DOCS: https://spacy.io/api/vocab#to_disk DOCS: https://spacy.io/api/vocab#to_disk
""" """
path = util.ensure_path(path) path = util.ensure_path(path)
self.strings.from_disk(path / "strings.json") getters = ["strings", "lexemes", "vectors"]
with (path / "lexemes.bin").open("rb") as file_: exclude = util.get_serialization_exclude(getters, exclude, kwargs)
self.lexemes_from_bytes(file_.read()) if "strings" not in exclude:
if self.vectors is not None: self.strings.from_disk(path / "strings.json") # TODO: add exclude?
self.vectors.from_disk(path, exclude="strings.json") if "lexemes" not in exclude:
if self.vectors.name is not None: with (path / "lexemes.bin").open("rb") as file_:
link_vectors_to_models(self) self.lexemes_from_bytes(file_.read())
if "vectors" not in exclude:
if self.vectors is not None:
self.vectors.from_disk(path, exclude=["strings"])
if self.vectors.name is not None:
link_vectors_to_models(self)
return self return self
def to_bytes(self, **exclude): def to_bytes(self, exclude=tuple(), **kwargs):
"""Serialize the current state to a binary string. """Serialize the current state to a binary string.
**exclude: Named attributes to prevent from being serialized. exclude (list): String names of serialization fields to exclude.
RETURNS (bytes): The serialized form of the `Vocab` object. RETURNS (bytes): The serialized form of the `Vocab` object.
DOCS: https://spacy.io/api/vocab#to_bytes DOCS: https://spacy.io/api/vocab#to_bytes
@ -452,13 +474,14 @@ cdef class Vocab:
("lexemes", lambda: self.lexemes_to_bytes()), ("lexemes", lambda: self.lexemes_to_bytes()),
("vectors", deserialize_vectors) ("vectors", deserialize_vectors)
)) ))
exclude = util.get_serialization_exclude(getters, exclude, kwargs)
return util.to_bytes(getters, exclude) return util.to_bytes(getters, exclude)
def from_bytes(self, bytes_data, **exclude): def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
"""Load state from a binary string. """Load state from a binary string.
bytes_data (bytes): The data to load from. bytes_data (bytes): The data to load from.
**exclude: Named attributes to prevent from being loaded. exclude (list): String names of serialization fields to exclude.
RETURNS (Vocab): The `Vocab` object. RETURNS (Vocab): The `Vocab` object.
DOCS: https://spacy.io/api/vocab#from_bytes DOCS: https://spacy.io/api/vocab#from_bytes
@ -468,11 +491,13 @@ cdef class Vocab:
return None return None
else: else:
return self.vectors.from_bytes(b) return self.vectors.from_bytes(b)
setters = OrderedDict(( setters = OrderedDict((
("strings", lambda b: self.strings.from_bytes(b)), ("strings", lambda b: self.strings.from_bytes(b)),
("lexemes", lambda b: self.lexemes_from_bytes(b)), ("lexemes", lambda b: self.lexemes_from_bytes(b)),
("vectors", lambda b: serialize_vectors(b)) ("vectors", lambda b: serialize_vectors(b))
)) ))
exclude = util.get_serialization_exclude(setters, exclude, kwargs)
util.from_bytes(bytes_data, setters, exclude) util.from_bytes(bytes_data, setters, exclude)
if self.vectors.name is not None: if self.vectors.name is not None:
link_vectors_to_models(self) link_vectors_to_models(self)
@ -518,7 +543,10 @@ cdef class Vocab:
for j in range(sizeof(lex_data.data)): for j in range(sizeof(lex_data.data)):
lex_data.data[j] = bytes_ptr[i+j] lex_data.data[j] = bytes_ptr[i+j]
Lexeme.c_from_bytes(lexeme, lex_data) Lexeme.c_from_bytes(lexeme, lex_data)
prev_entry = self._by_orth.get(lexeme.orth)
if prev_entry != NULL:
memcpy(prev_entry, lexeme, sizeof(LexemeC))
continue
ptr = self.strings._map.get(lexeme.orth) ptr = self.strings._map.get(lexeme.orth)
if ptr == NULL: if ptr == NULL:
continue continue

27
website/.eslintrc Normal file
View File

@ -0,0 +1,27 @@
{
"extends": ["standard", "prettier"],
"plugins": ["standard", "react", "react-hooks"],
"rules": {
"no-var": "error",
"no-unused-vars": 1,
"arrow-spacing": ["error", { "before": true, "after": true }],
"indent": ["error", 4],
"semi": ["error", "never"],
"arrow-parens": ["error", "as-needed"],
"standard/object-curly-even-spacing": ["error", "either"],
"standard/array-bracket-even-spacing": ["error", "either"],
"standard/computed-property-even-spacing": ["error", "even"],
"standard/no-callback-literal": ["error", ["cb", "callback"]],
"react/jsx-uses-react": "error",
"react/jsx-uses-vars": "error",
"react-hooks/rules-of-hooks": "error",
"react-hooks/exhaustive-deps": "warn"
},
"parser": "babel-eslint",
"parserOptions": {
"ecmaVersion": 8
},
"env": {
"browser": true
}
}

View File

@ -78,7 +78,7 @@ assigned by spaCy's [models](/models). The individual mapping is specific to the
training corpus and can be defined in the respective language data's training corpus and can be defined in the respective language data's
[`tag_map.py`](/usage/adding-languages#tag-map). [`tag_map.py`](/usage/adding-languages#tag-map).
<Accordion title="Universal Part-of-speech Tags"> <Accordion title="Universal Part-of-speech Tags" id="pos-universal">
spaCy also maps all language-specific part-of-speech tags to a small, fixed set spaCy also maps all language-specific part-of-speech tags to a small, fixed set
of word type tags following the of word type tags following the
@ -269,7 +269,7 @@ This section lists the syntactic dependency labels assigned by spaCy's
[models](/models). The individual labels are language-specific and depend on the [models](/models). The individual labels are language-specific and depend on the
training corpus. training corpus.
<Accordion title="Universal Dependency Labels"> <Accordion title="Universal Dependency Labels" id="dependency-parsing-universal">
The [Universal Dependencies scheme](http://universaldependencies.org/u/dep/) is The [Universal Dependencies scheme](http://universaldependencies.org/u/dep/) is
used in all languages trained on Universal Dependency Corpora. used in all languages trained on Universal Dependency Corpora.

View File

@ -244,9 +244,10 @@ Serialize the pipe to disk.
> parser.to_disk("/path/to/parser") > parser.to_disk("/path/to/parser")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- | | --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | | `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
## DependencyParser.from_disk {#from_disk tag="method"} ## DependencyParser.from_disk {#from_disk tag="method"}
@ -262,6 +263,7 @@ Load the pipe from disk. Modifies the object in place and returns it.
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------------------ | -------------------------------------------------------------------------- | | ----------- | ------------------ | -------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. | | `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | `DependencyParser` | The modified `DependencyParser` object. | | **RETURNS** | `DependencyParser` | The modified `DependencyParser` object. |
## DependencyParser.to_bytes {#to_bytes tag="method"} ## DependencyParser.to_bytes {#to_bytes tag="method"}
@ -275,10 +277,10 @@ Load the pipe from disk. Modifies the object in place and returns it.
Serialize the pipe to a bytestring. Serialize the pipe to a bytestring.
| Name | Type | Description | | Name | Type | Description |
| ----------- | ----- | ----------------------------------------------------- | | ----------- | ----- | ------------------------------------------------------------------------- |
| `**exclude` | - | Named attributes to prevent from being serialized. | | `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | bytes | The serialized form of the `DependencyParser` object. | | **RETURNS** | bytes | The serialized form of the `DependencyParser` object. |
## DependencyParser.from_bytes {#from_bytes tag="method"} ## DependencyParser.from_bytes {#from_bytes tag="method"}
@ -292,11 +294,11 @@ Load the pipe from a bytestring. Modifies the object in place and returns it.
> parser.from_bytes(parser_bytes) > parser.from_bytes(parser_bytes)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------------ | ------------------ | ---------------------------------------------- | | ------------ | ------------------ | ------------------------------------------------------------------------- |
| `bytes_data` | bytes | The data to load from. | | `bytes_data` | bytes | The data to load from. |
| `**exclude` | - | Named attributes to prevent from being loaded. | | `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | `DependencyParser` | The `DependencyParser` object. | | **RETURNS** | `DependencyParser` | The `DependencyParser` object. |
## DependencyParser.labels {#labels tag="property"} ## DependencyParser.labels {#labels tag="property"}
@ -312,3 +314,21 @@ The labels currently added to the component.
| Name | Type | Description | | Name | Type | Description |
| ----------- | ----- | ---------------------------------- | | ----------- | ----- | ---------------------------------- |
| **RETURNS** | tuple | The labels added to the component. | | **RETURNS** | tuple | The labels added to the component. |
## Serialization fields {#serialization-fields}
During serialization, spaCy will export several data fields used to restore
different aspects of the object. If needed, you can exclude them from
serialization by passing in the string names via the `exclude` argument.
> #### Example
>
> ```python
> data = parser.to_disk("/path", exclude=["vocab"])
> ```
| Name | Description |
| ------- | -------------------------------------------------------------- |
| `vocab` | The shared [`Vocab`](/api/vocab). |
| `cfg` | The config file. You usually don't want to exclude this. |
| `model` | The binary model data. You usually don't want to exclude this. |

View File

@ -237,7 +237,7 @@ attribute ID.
> from spacy.attrs import ORTH > from spacy.attrs import ORTH
> doc = nlp(u"apple apple orange banana") > doc = nlp(u"apple apple orange banana")
> assert doc.count_by(ORTH) == {7024L: 1, 119552L: 1, 2087L: 2} > assert doc.count_by(ORTH) == {7024L: 1, 119552L: 1, 2087L: 2}
> doc.to_array([attrs.ORTH]) > doc.to_array([ORTH])
> # array([[11880], [11880], [7561], [12800]]) > # array([[11880], [11880], [7561], [12800]])
> ``` > ```
@ -349,11 +349,12 @@ array of attributes.
> assert doc[0].pos_ == doc2[0].pos_ > assert doc[0].pos_ == doc2[0].pos_
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | -------------------------------------- | ----------------------------- | | ----------- | -------------------------------------- | ------------------------------------------------------------------------- |
| `attrs` | list | A list of attribute ID ints. | | `attrs` | list | A list of attribute ID ints. |
| `array` | `numpy.ndarray[ndim=2, dtype='int32']` | The attribute values to load. | | `array` | `numpy.ndarray[ndim=2, dtype='int32']` | The attribute values to load. |
| **RETURNS** | `Doc` | Itself. | | `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | `Doc` | Itself. |
## Doc.to_disk {#to_disk tag="method" new="2"} ## Doc.to_disk {#to_disk tag="method" new="2"}
@ -365,9 +366,10 @@ Save the current state to a directory.
> doc.to_disk("/path/to/doc") > doc.to_disk("/path/to/doc")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- | | --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | | `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
## Doc.from_disk {#from_disk tag="method" new="2"} ## Doc.from_disk {#from_disk tag="method" new="2"}
@ -384,6 +386,7 @@ Loads state from a directory. Modifies the object in place and returns it.
| Name | Type | Description | | Name | Type | Description |
| ----------- | ---------------- | -------------------------------------------------------------------------- | | ----------- | ---------------- | -------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. | | `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | `Doc` | The modified `Doc` object. | | **RETURNS** | `Doc` | The modified `Doc` object. |
## Doc.to_bytes {#to_bytes tag="method"} ## Doc.to_bytes {#to_bytes tag="method"}
@ -397,9 +400,10 @@ Serialize, i.e. export the document contents to a binary string.
> doc_bytes = doc.to_bytes() > doc_bytes = doc.to_bytes()
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ----- | --------------------------------------------------------------------- | | ----------- | ----- | ------------------------------------------------------------------------- |
| **RETURNS** | bytes | A losslessly serialized copy of the `Doc`, including all annotations. | | `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | bytes | A losslessly serialized copy of the `Doc`, including all annotations. |
## Doc.from_bytes {#from_bytes tag="method"} ## Doc.from_bytes {#from_bytes tag="method"}
@ -416,10 +420,11 @@ Deserialize, i.e. import the document contents from a binary string.
> assert doc.text == doc2.text > assert doc.text == doc2.text
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ----- | ------------------------ | | ----------- | ----- | ------------------------------------------------------------------------- |
| `data` | bytes | The string to load from. | | `data` | bytes | The string to load from. |
| **RETURNS** | `Doc` | The `Doc` object. | | `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | `Doc` | The `Doc` object. |
## Doc.retokenize {#retokenize tag="contextmanager" new="2.1"} ## Doc.retokenize {#retokenize tag="contextmanager" new="2.1"}
@ -640,20 +645,45 @@ The L2 norm of the document's vector representation.
## Attributes {#attributes} ## Attributes {#attributes}
| Name | Type | Description | | Name | Type | Description |
| ----------------------------------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | --------------------------------------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `text` | unicode | A unicode representation of the document text. | | `text` | unicode | A unicode representation of the document text. |
| `text_with_ws` | unicode | An alias of `Doc.text`, provided for duck-type compatibility with `Span` and `Token`. | | `text_with_ws` | unicode | An alias of `Doc.text`, provided for duck-type compatibility with `Span` and `Token`. |
| `mem` | `Pool` | The document's local memory heap, for all C data it owns. | | `mem` | `Pool` | The document's local memory heap, for all C data it owns. |
| `vocab` | `Vocab` | The store of lexical types. | | `vocab` | `Vocab` | The store of lexical types. |
| `tensor` <Tag variant="new">2</Tag> | object | Container for dense vector representations. | | `tensor` <Tag variant="new">2</Tag> | object | Container for dense vector representations. |
| `cats` <Tag variant="new">2</Tag> | dictionary | Maps either a label to a score for categories applied to whole document, or `(start_char, end_char, label)` to score for categories applied to spans. `start_char` and `end_char` should be character offsets, label can be either a string or an integer ID, and score should be a float. | | `cats` <Tag variant="new">2</Tag> | dictionary | Maps either a label to a score for categories applied to whole document, or `(start_char, end_char, label)` to score for categories applied to spans. `start_char` and `end_char` should be character offsets, label can be either a string or an integer ID, and score should be a float. |
| `user_data` | - | A generic storage area, for user custom data. | | `user_data` | - | A generic storage area, for user custom data. |
| `is_tagged` | bool | A flag indicating that the document has been part-of-speech tagged. | | `lang` <Tag variant="new">2.1</Tag> | int | Language of the document's vocabulary. |
| `is_parsed` | bool | A flag indicating that the document has been syntactically parsed. | | `lang_` <Tag variant="new">2.1</Tag> | unicode | Language of the document's vocabulary. |
| `is_sentenced` | bool | A flag indicating that sentence boundaries have been applied to the document. | | `is_tagged` | bool | A flag indicating that the document has been part-of-speech tagged. |
| `sentiment` | float | The document's positivity/negativity score, if available. | | `is_parsed` | bool | A flag indicating that the document has been syntactically parsed. |
| `user_hooks` | dict | A dictionary that allows customization of the `Doc`'s properties. | | `is_sentenced` | bool | A flag indicating that sentence boundaries have been applied to the document. |
| `user_token_hooks` | dict | A dictionary that allows customization of properties of `Token` children. | | `is_nered` <Tag variant="new">2.1</Tag> | bool | A flag indicating that named entities have been set. Will return `True` if _any_ of the tokens has an entity tag set, even if the others are unknown. |
| `user_span_hooks` | dict | A dictionary that allows customization of properties of `Span` children. | | `sentiment` | float | The document's positivity/negativity score, if available. |
| `_` | `Underscore` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). | | `user_hooks` | dict | A dictionary that allows customization of the `Doc`'s properties. |
| `user_token_hooks` | dict | A dictionary that allows customization of properties of `Token` children. |
| `user_span_hooks` | dict | A dictionary that allows customization of properties of `Span` children. |
| `_` | `Underscore` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). |
## Serialization fields {#serialization-fields}
During serialization, spaCy will export several data fields used to restore
different aspects of the object. If needed, you can exclude them from
serialization by passing in the string names via the `exclude` argument.
> #### Example
>
> ```python
> data = doc.to_bytes(exclude=["text", "tensor"])
> doc.from_disk("./doc.bin", exclude=["user_data"])
> ```
| Name | Description |
| ------------------ | --------------------------------------------- |
| `text` | The value of the `Doc.text` attribute. |
| `sentiment` | The value of the `Doc.sentiment` attribute. |
| `tensor` | The value of the `Doc.tensor` attribute. |
| `user_data` | The value of the `Doc.user_data` dictionary. |
| `user_data_keys` | The keys of the `Doc.user_data` dictionary. |
| `user_data_values` | The values of the `Doc.user_data` dictionary. |

View File

@ -244,9 +244,10 @@ Serialize the pipe to disk.
> ner.to_disk("/path/to/ner") > ner.to_disk("/path/to/ner")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- | | --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | | `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
## EntityRecognizer.from_disk {#from_disk tag="method"} ## EntityRecognizer.from_disk {#from_disk tag="method"}
@ -262,6 +263,7 @@ Load the pipe from disk. Modifies the object in place and returns it.
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------------------ | -------------------------------------------------------------------------- | | ----------- | ------------------ | -------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. | | `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | `EntityRecognizer` | The modified `EntityRecognizer` object. | | **RETURNS** | `EntityRecognizer` | The modified `EntityRecognizer` object. |
## EntityRecognizer.to_bytes {#to_bytes tag="method"} ## EntityRecognizer.to_bytes {#to_bytes tag="method"}
@ -275,10 +277,10 @@ Load the pipe from disk. Modifies the object in place and returns it.
Serialize the pipe to a bytestring. Serialize the pipe to a bytestring.
| Name | Type | Description | | Name | Type | Description |
| ----------- | ----- | ----------------------------------------------------- | | ----------- | ----- | ------------------------------------------------------------------------- |
| `**exclude` | - | Named attributes to prevent from being serialized. | | `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | bytes | The serialized form of the `EntityRecognizer` object. | | **RETURNS** | bytes | The serialized form of the `EntityRecognizer` object. |
## EntityRecognizer.from_bytes {#from_bytes tag="method"} ## EntityRecognizer.from_bytes {#from_bytes tag="method"}
@ -292,11 +294,11 @@ Load the pipe from a bytestring. Modifies the object in place and returns it.
> ner.from_bytes(ner_bytes) > ner.from_bytes(ner_bytes)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------------ | ------------------ | ---------------------------------------------- | | ------------ | ------------------ | ------------------------------------------------------------------------- |
| `bytes_data` | bytes | The data to load from. | | `bytes_data` | bytes | The data to load from. |
| `**exclude` | - | Named attributes to prevent from being loaded. | | `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | `EntityRecognizer` | The `EntityRecognizer` object. | | **RETURNS** | `EntityRecognizer` | The `EntityRecognizer` object. |
## EntityRecognizer.labels {#labels tag="property"} ## EntityRecognizer.labels {#labels tag="property"}
@ -312,3 +314,21 @@ The labels currently added to the component.
| Name | Type | Description | | Name | Type | Description |
| ----------- | ----- | ---------------------------------- | | ----------- | ----- | ---------------------------------- |
| **RETURNS** | tuple | The labels added to the component. | | **RETURNS** | tuple | The labels added to the component. |
## Serialization fields {#serialization-fields}
During serialization, spaCy will export several data fields used to restore
different aspects of the object. If needed, you can exclude them from
serialization by passing in the string names via the `exclude` argument.
> #### Example
>
> ```python
> data = ner.to_disk("/path", exclude=["vocab"])
> ```
| Name | Description |
| ------- | -------------------------------------------------------------- |
| `vocab` | The shared [`Vocab`](/api/vocab). |
| `cfg` | The config file. You usually don't want to exclude this. |
| `model` | The binary model data. You usually don't want to exclude this. |

View File

@ -91,13 +91,14 @@ multiprocessing.
> assert doc.is_parsed > assert doc.is_parsed
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------------ | ----- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | | -------------------------------------------- | ----- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `texts` | - | A sequence of unicode objects. | | `texts` | - | A sequence of unicode objects. |
| `as_tuples` | bool | If set to `True`, inputs should be a sequence of `(text, context)` tuples. Output will then be a sequence of `(doc, context)` tuples. Defaults to `False`. | | `as_tuples` | bool | If set to `True`, inputs should be a sequence of `(text, context)` tuples. Output will then be a sequence of `(doc, context)` tuples. Defaults to `False`. |
| `batch_size` | int | The number of texts to buffer. | | `batch_size` | int | The number of texts to buffer. |
| `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). | | `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
| **YIELDS** | `Doc` | Documents in the order of the original text. | | `component_cfg` <Tag variant="new">2.1</Tag> | dict | Config parameters for specific pipeline components, keyed by component name. |
| **YIELDS** | `Doc` | Documents in the order of the original text. |
## Language.update {#update tag="method"} ## Language.update {#update tag="method"}
@ -112,13 +113,14 @@ Update the models in the pipeline.
> nlp.update([doc], [gold], drop=0.5, sgd=optimizer) > nlp.update([doc], [gold], drop=0.5, sgd=optimizer)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | -------------------------------------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `docs` | iterable | A batch of `Doc` objects or unicode. If unicode, a `Doc` object will be created from the text. | | `docs` | iterable | A batch of `Doc` objects or unicode. If unicode, a `Doc` object will be created from the text. |
| `golds` | iterable | A batch of `GoldParse` objects or dictionaries. Dictionaries will be used to create [`GoldParse`](/api/goldparse) objects. For the available keys and their usage, see [`GoldParse.__init__`](/api/goldparse#init). | | `golds` | iterable | A batch of `GoldParse` objects or dictionaries. Dictionaries will be used to create [`GoldParse`](/api/goldparse) objects. For the available keys and their usage, see [`GoldParse.__init__`](/api/goldparse#init). |
| `drop` | float | The dropout rate. | | `drop` | float | The dropout rate. |
| `sgd` | callable | An optimizer. | | `sgd` | callable | An optimizer. |
| **RETURNS** | dict | Results from the update. | | `component_cfg` <Tag variant="new">2.1</Tag> | dict | Config parameters for specific pipeline components, keyed by component name. |
| **RETURNS** | dict | Results from the update. |
## Language.begin_training {#begin_training tag="method"} ## Language.begin_training {#begin_training tag="method"}
@ -130,11 +132,12 @@ Allocate models, pre-process training data and acquire an optimizer.
> optimizer = nlp.begin_training(gold_tuples) > optimizer = nlp.begin_training(gold_tuples)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------------- | -------- | ---------------------------- | | -------------------------------------------- | -------- | ---------------------------------------------------------------------------- |
| `gold_tuples` | iterable | Gold-standard training data. | | `gold_tuples` | iterable | Gold-standard training data. |
| `**cfg` | - | Config parameters. | | `component_cfg` <Tag variant="new">2.1</Tag> | dict | Config parameters for specific pipeline components, keyed by component name. |
| **RETURNS** | callable | An optimizer. | | `**cfg` | - | Config parameters (sent to all components). |
| **RETURNS** | callable | An optimizer. |
## Language.use_params {#use_params tag="contextmanager, method"} ## Language.use_params {#use_params tag="contextmanager, method"}
@ -327,7 +330,7 @@ the model**.
| Name | Type | Description | | Name | Type | Description |
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- | | --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | | `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
| `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling) and prevent from being saved. | | `exclude` | list | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. |
## Language.from_disk {#from_disk tag="method" new="2"} ## Language.from_disk {#from_disk tag="method" new="2"}
@ -349,22 +352,22 @@ loaded object.
> nlp = English().from_disk("/path/to/en_model") > nlp = English().from_disk("/path/to/en_model")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ---------------- | --------------------------------------------------------------------------------- | | ----------- | ---------------- | ----------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. | | `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
| `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). | | `exclude` | list | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | `Language` | The modified `Language` object. | | **RETURNS** | `Language` | The modified `Language` object. |
<Infobox title="Changed in v2.0" variant="warning"> <Infobox title="Changed in v2.0" variant="warning">
As of spaCy v2.0, the `save_to_directory` method has been renamed to `to_disk`, As of spaCy v2.0, the `save_to_directory` method has been renamed to `to_disk`,
to improve consistency across classes. Pipeline components to prevent from being to improve consistency across classes. Pipeline components to prevent from being
loaded can now be added as a list to `disable`, instead of specifying one loaded can now be added as a list to `disable` (v2.0) or `exclude` (v2.1),
keyword argument per component. instead of specifying one keyword argument per component.
```diff ```diff
- nlp = spacy.load("en", tagger=False, entity=False) - nlp = spacy.load("en", tagger=False, entity=False)
+ nlp = English().from_disk("/model", disable=["tagger', 'ner"]) + nlp = English().from_disk("/model", exclude=["tagger", "ner"])
``` ```
</Infobox> </Infobox>
@ -379,10 +382,10 @@ Serialize the current state to a binary string.
> nlp_bytes = nlp.to_bytes() > nlp_bytes = nlp.to_bytes()
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ----- | ------------------------------------------------------------------------------------------------------------------- | | ----------- | ----- | ----------------------------------------------------------------------------------------- |
| `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling) and prevent from being serialized. | | `exclude` | list | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | bytes | The serialized form of the `Language` object. | | **RETURNS** | bytes | The serialized form of the `Language` object. |
## Language.from_bytes {#from_bytes tag="method"} ## Language.from_bytes {#from_bytes tag="method"}
@ -400,20 +403,21 @@ available to the loaded object.
> nlp2.from_bytes(nlp_bytes) > nlp2.from_bytes(nlp_bytes)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------------ | ---------- | --------------------------------------------------------------------------------- | | ------------ | ---------- | ----------------------------------------------------------------------------------------- |
| `bytes_data` | bytes | The data to load from. | | `bytes_data` | bytes | The data to load from. |
| `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). | | `exclude` | list | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | `Language` | The `Language` object. | | **RETURNS** | `Language` | The `Language` object. |
<Infobox title="Changed in v2.0" variant="warning"> <Infobox title="Changed in v2.0" variant="warning">
Pipeline components to prevent from being loaded can now be added as a list to Pipeline components to prevent from being loaded can now be added as a list to
`disable`, instead of specifying one keyword argument per component. `disable` (v2.0) or `exclude` (v2.1), instead of specifying one keyword argument
per component.
```diff ```diff
- nlp = English().from_bytes(bytes, tagger=False, entity=False) - nlp = English().from_bytes(bytes, tagger=False, entity=False)
+ nlp = English().from_bytes(bytes, disable=["tagger", "ner"]) + nlp = English().from_bytes(bytes, exclude=["tagger", "ner"])
``` ```
</Infobox> </Infobox>
@ -437,3 +441,23 @@ Pipeline components to prevent from being loaded can now be added as a list to
| `Defaults` | class | Settings, data and factory methods for creating the `nlp` object and processing pipeline. | | `Defaults` | class | Settings, data and factory methods for creating the `nlp` object and processing pipeline. |
| `lang` | unicode | Two-letter language ID, i.e. [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). | | `lang` | unicode | Two-letter language ID, i.e. [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). |
| `factories` <Tag variant="new">2</Tag> | dict | Factories that create pre-defined pipeline components, e.g. the tagger, parser or entity recognizer, keyed by their component name. | | `factories` <Tag variant="new">2</Tag> | dict | Factories that create pre-defined pipeline components, e.g. the tagger, parser or entity recognizer, keyed by their component name. |
## Serialization fields {#serialization-fields}
During serialization, spaCy will export several data fields used to restore
different aspects of the object. If needed, you can exclude them from
serialization by passing in the string names via the `exclude` argument.
> #### Example
>
> ```python
> data = nlp.to_bytes(exclude=["tokenizer", "vocab"])
> nlp.from_disk("./model-data", exclude=["ner"])
> ```
| Name | Description |
| ----------- | -------------------------------------------------- |
| `vocab` | The shared [`Vocab`](/api/vocab). |
| `tokenizer` | Tokenization rules and exceptions. |
| `meta` | The meta data, available as `Language.meta`. |
| ... | String names of pipeline components, e.g. `"ner"`. |

View File

@ -316,6 +316,22 @@ taken.
| ----------- | ------- | --------------- | | ----------- | ------- | --------------- |
| **RETURNS** | `Token` | The root token. | | **RETURNS** | `Token` | The root token. |
## Span.conjuncts {#conjuncts tag="property" model="parser"}
A tuple of tokens coordinated to `span.root`.
> #### Example
>
> ```python
> doc = nlp(u"I like apples and oranges")
> apples_conjuncts = doc[2:3].conjuncts
> assert [t.text for t in apples_conjuncts] == [u"oranges"]
> ```
| Name | Type | Description |
| ----------- | ------- | ----------------------- |
| **RETURNS** | `tuple` | The coordinated tokens. |
## Span.lefts {#lefts tag="property" model="parser"} ## Span.lefts {#lefts tag="property" model="parser"}
Tokens that are to the left of the span, whose heads are within the span. Tokens that are to the left of the span, whose heads are within the span.

View File

@ -151,10 +151,9 @@ Serialize the current state to a binary string.
> store_bytes = stringstore.to_bytes() > store_bytes = stringstore.to_bytes()
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ----- | -------------------------------------------------- | | ----------- | ----- | ------------------------------------------------ |
| `**exclude` | - | Named attributes to prevent from being serialized. | | **RETURNS** | bytes | The serialized form of the `StringStore` object. |
| **RETURNS** | bytes | The serialized form of the `StringStore` object. |
## StringStore.from_bytes {#from_bytes tag="method"} ## StringStore.from_bytes {#from_bytes tag="method"}
@ -168,11 +167,10 @@ Load state from a binary string.
> new_store = StringStore().from_bytes(store_bytes) > new_store = StringStore().from_bytes(store_bytes)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------------ | ------------- | ---------------------------------------------- | | ------------ | ------------- | ------------------------- |
| `bytes_data` | bytes | The data to load from. | | `bytes_data` | bytes | The data to load from. |
| `**exclude` | - | Named attributes to prevent from being loaded. | | **RETURNS** | `StringStore` | The `StringStore` object. |
| **RETURNS** | `StringStore` | The `StringStore` object. |
## Utilities {#util} ## Utilities {#util}

View File

@ -244,9 +244,10 @@ Serialize the pipe to disk.
> tagger.to_disk("/path/to/tagger") > tagger.to_disk("/path/to/tagger")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- | | --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | | `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
## Tagger.from_disk {#from_disk tag="method"} ## Tagger.from_disk {#from_disk tag="method"}
@ -262,6 +263,7 @@ Load the pipe from disk. Modifies the object in place and returns it.
| Name | Type | Description | | Name | Type | Description |
| ----------- | ---------------- | -------------------------------------------------------------------------- | | ----------- | ---------------- | -------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. | | `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | `Tagger` | The modified `Tagger` object. | | **RETURNS** | `Tagger` | The modified `Tagger` object. |
## Tagger.to_bytes {#to_bytes tag="method"} ## Tagger.to_bytes {#to_bytes tag="method"}
@ -275,10 +277,10 @@ Load the pipe from disk. Modifies the object in place and returns it.
Serialize the pipe to a bytestring. Serialize the pipe to a bytestring.
| Name | Type | Description | | Name | Type | Description |
| ----------- | ----- | -------------------------------------------------- | | ----------- | ----- | ------------------------------------------------------------------------- |
| `**exclude` | - | Named attributes to prevent from being serialized. | | `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | bytes | The serialized form of the `Tagger` object. | | **RETURNS** | bytes | The serialized form of the `Tagger` object. |
## Tagger.from_bytes {#from_bytes tag="method"} ## Tagger.from_bytes {#from_bytes tag="method"}
@ -292,11 +294,11 @@ Load the pipe from a bytestring. Modifies the object in place and returns it.
> tagger.from_bytes(tagger_bytes) > tagger.from_bytes(tagger_bytes)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------------ | -------- | ---------------------------------------------- | | ------------ | -------- | ------------------------------------------------------------------------- |
| `bytes_data` | bytes | The data to load from. | | `bytes_data` | bytes | The data to load from. |
| `**exclude` | - | Named attributes to prevent from being loaded. | | `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | `Tagger` | The `Tagger` object. | | **RETURNS** | `Tagger` | The `Tagger` object. |
## Tagger.labels {#labels tag="property"} ## Tagger.labels {#labels tag="property"}
@ -314,3 +316,22 @@ tags by default, e.g. `VERB`, `NOUN` and so on.
| Name | Type | Description | | Name | Type | Description |
| ----------- | ----- | ---------------------------------- | | ----------- | ----- | ---------------------------------- |
| **RETURNS** | tuple | The labels added to the component. | | **RETURNS** | tuple | The labels added to the component. |
## Serialization fields {#serialization-fields}
During serialization, spaCy will export several data fields used to restore
different aspects of the object. If needed, you can exclude them from
serialization by passing in the string names via the `exclude` argument.
> #### Example
>
> ```python
> data = tagger.to_disk("/path", exclude=["vocab"])
> ```
| Name | Description |
| --------- | ------------------------------------------------------------------------------------------ |
| `vocab` | The shared [`Vocab`](/api/vocab). |
| `cfg` | The config file. You usually don't want to exclude this. |
| `model` | The binary model data. You usually don't want to exclude this. |
| `tag_map` | The [tag map](/usage/adding-languages#tag-map) mapping fine-grained to coarse-grained tag. |

View File

@ -260,9 +260,10 @@ Serialize the pipe to disk.
> textcat.to_disk("/path/to/textcat") > textcat.to_disk("/path/to/textcat")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- | | --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | | `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
## TextCategorizer.from_disk {#from_disk tag="method"} ## TextCategorizer.from_disk {#from_disk tag="method"}
@ -278,6 +279,7 @@ Load the pipe from disk. Modifies the object in place and returns it.
| Name | Type | Description | | Name | Type | Description |
| ----------- | ----------------- | -------------------------------------------------------------------------- | | ----------- | ----------------- | -------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. | | `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | `TextCategorizer` | The modified `TextCategorizer` object. | | **RETURNS** | `TextCategorizer` | The modified `TextCategorizer` object. |
## TextCategorizer.to_bytes {#to_bytes tag="method"} ## TextCategorizer.to_bytes {#to_bytes tag="method"}
@ -291,10 +293,10 @@ Load the pipe from disk. Modifies the object in place and returns it.
Serialize the pipe to a bytestring. Serialize the pipe to a bytestring.
| Name | Type | Description | | Name | Type | Description |
| ----------- | ----- | ---------------------------------------------------- | | ----------- | ----- | ------------------------------------------------------------------------- |
| `**exclude` | - | Named attributes to prevent from being serialized. | | `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | bytes | The serialized form of the `TextCategorizer` object. | | **RETURNS** | bytes | The serialized form of the `TextCategorizer` object. |
## TextCategorizer.from_bytes {#from_bytes tag="method"} ## TextCategorizer.from_bytes {#from_bytes tag="method"}
@ -308,11 +310,11 @@ Load the pipe from a bytestring. Modifies the object in place and returns it.
> textcat.from_bytes(textcat_bytes) > textcat.from_bytes(textcat_bytes)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------------ | ----------------- | ---------------------------------------------- | | ------------ | ----------------- | ------------------------------------------------------------------------- |
| `bytes_data` | bytes | The data to load from. | | `bytes_data` | bytes | The data to load from. |
| `**exclude` | - | Named attributes to prevent from being loaded. | | `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | `TextCategorizer` | The `TextCategorizer` object. | | **RETURNS** | `TextCategorizer` | The `TextCategorizer` object. |
## TextCategorizer.labels {#labels tag="property"} ## TextCategorizer.labels {#labels tag="property"}
@ -328,3 +330,21 @@ The labels currently added to the component.
| Name | Type | Description | | Name | Type | Description |
| ----------- | ----- | ---------------------------------- | | ----------- | ----- | ---------------------------------- |
| **RETURNS** | tuple | The labels added to the component. | | **RETURNS** | tuple | The labels added to the component. |
## Serialization fields {#serialization-fields}
During serialization, spaCy will export several data fields used to restore
different aspects of the object. If needed, you can exclude them from
serialization by passing in the string names via the `exclude` argument.
> #### Example
>
> ```python
> data = textcat.to_disk("/path", exclude=["vocab"])
> ```
| Name | Description |
| ------- | -------------------------------------------------------------- |
| `vocab` | The shared [`Vocab`](/api/vocab). |
| `cfg` | The config file. You usually don't want to exclude this. |
| `model` | The binary model data. You usually don't want to exclude this. |

View File

@ -211,7 +211,7 @@ The rightmost token of this token's syntactic descendants.
## Token.conjuncts {#conjuncts tag="property" model="parser"} ## Token.conjuncts {#conjuncts tag="property" model="parser"}
A sequence of coordinated tokens, including the token itself. A tuple of coordinated tokens, not including the token itself.
> #### Example > #### Example
> >
@ -221,9 +221,9 @@ A sequence of coordinated tokens, including the token itself.
> assert [t.text for t in apples_conjuncts] == [u"oranges"] > assert [t.text for t in apples_conjuncts] == [u"oranges"]
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ---------- | ------- | -------------------- | | ----------- | ------- | ----------------------- |
| **YIELDS** | `Token` | A coordinated token. | | **RETURNS** | `tuple` | The coordinated tokens. |
## Token.children {#children tag="property" model="parser"} ## Token.children {#children tag="property" model="parser"}

View File

@ -127,9 +127,10 @@ Serialize the tokenizer to disk.
> tokenizer.to_disk("/path/to/tokenizer") > tokenizer.to_disk("/path/to/tokenizer")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- | | --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | | `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
## Tokenizer.from_disk {#from_disk tag="method"} ## Tokenizer.from_disk {#from_disk tag="method"}
@ -145,6 +146,7 @@ Load the tokenizer from disk. Modifies the object in place and returns it.
| Name | Type | Description | | Name | Type | Description |
| ----------- | ---------------- | -------------------------------------------------------------------------- | | ----------- | ---------------- | -------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. | | `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | `Tokenizer` | The modified `Tokenizer` object. | | **RETURNS** | `Tokenizer` | The modified `Tokenizer` object. |
## Tokenizer.to_bytes {#to_bytes tag="method"} ## Tokenizer.to_bytes {#to_bytes tag="method"}
@ -158,10 +160,10 @@ Load the tokenizer from disk. Modifies the object in place and returns it.
Serialize the tokenizer to a bytestring. Serialize the tokenizer to a bytestring.
| Name | Type | Description | | Name | Type | Description |
| ----------- | ----- | -------------------------------------------------- | | ----------- | ----- | ------------------------------------------------------------------------- |
| `**exclude` | - | Named attributes to prevent from being serialized. | | `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | bytes | The serialized form of the `Tokenizer` object. | | **RETURNS** | bytes | The serialized form of the `Tokenizer` object. |
## Tokenizer.from_bytes {#from_bytes tag="method"} ## Tokenizer.from_bytes {#from_bytes tag="method"}
@ -176,11 +178,11 @@ it.
> tokenizer.from_bytes(tokenizer_bytes) > tokenizer.from_bytes(tokenizer_bytes)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------------ | ----------- | ---------------------------------------------- | | ------------ | ----------- | ------------------------------------------------------------------------- |
| `bytes_data` | bytes | The data to load from. | | `bytes_data` | bytes | The data to load from. |
| `**exclude` | - | Named attributes to prevent from being loaded. | | `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | `Tokenizer` | The `Tokenizer` object. | | **RETURNS** | `Tokenizer` | The `Tokenizer` object. |
## Attributes {#attributes} ## Attributes {#attributes}
@ -190,3 +192,25 @@ it.
| `prefix_search` | - | A function to find segment boundaries from the start of a string. Returns the length of the segment, or `None`. | | `prefix_search` | - | A function to find segment boundaries from the start of a string. Returns the length of the segment, or `None`. |
| `suffix_search` | - | A function to find segment boundaries from the end of a string. Returns the length of the segment, or `None`. | | `suffix_search` | - | A function to find segment boundaries from the end of a string. Returns the length of the segment, or `None`. |
| `infix_finditer` | - | A function to find internal segment separators, e.g. hyphens. Returns a (possibly empty) list of `re.MatchObject` objects. | | `infix_finditer` | - | A function to find internal segment separators, e.g. hyphens. Returns a (possibly empty) list of `re.MatchObject` objects. |
## Serialization fields {#serialization-fields}
During serialization, spaCy will export several data fields used to restore
different aspects of the object. If needed, you can exclude them from
serialization by passing in the string names via the `exclude` argument.
> #### Example
>
> ```python
> data = tokenizer.to_bytes(exclude=["vocab", "exceptions"])
> tokenizer.from_disk("./data", exclude=["token_match"])
> ```
| Name | Description |
| ---------------- | --------------------------------- |
| `vocab` | The shared [`Vocab`](/api/vocab). |
| `prefix_search` | The prefix rules. |
| `suffix_search` | The suffix rules. |
| `infix_finditer` | The infix rules. |
| `token_match` | The token match expression. |
| `exceptions` | The tokenizer exception rules. |

View File

@ -351,6 +351,24 @@ the two-letter language code.
| `name` | unicode | Two-letter language code, e.g. `'en'`. | | `name` | unicode | Two-letter language code, e.g. `'en'`. |
| `cls` | `Language` | The language class, e.g. `English`. | | `cls` | `Language` | The language class, e.g. `English`. |
### util.lang_class_is_loaded (#util.lang_class_is_loaded tag="function" new="2.1")
Check whether a `Language` class is already loaded. `Language` classes are
loaded lazily, to avoid expensive setup code associated with the language data.
> #### Example
>
> ```python
> lang_cls = util.get_lang_class("en")
> assert util.lang_class_is_loaded("en") is True
> assert util.lang_class_is_loaded("de") is False
> ```
| Name | Type | Description |
| ----------- | ------- | -------------------------------------- |
| `name` | unicode | Two-letter language code, e.g. `'en'`. |
| **RETURNS** | bool | Whether the class has been loaded. |
### util.load_model {#util.load_model tag="function" new="2"} ### util.load_model {#util.load_model tag="function" new="2"}
Load a model from a shortcut link, package or data path. If called with a Load a model from a shortcut link, package or data path. If called with a

View File

@ -311,10 +311,9 @@ Save the current state to a directory.
> >
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- | | ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | | `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
| `**exclude` | - | Named attributes to prevent from being saved. |
## Vectors.from_disk {#from_disk tag="method"} ## Vectors.from_disk {#from_disk tag="method"}
@ -342,10 +341,9 @@ Serialize the current state to a binary string.
> vectors_bytes = vectors.to_bytes() > vectors_bytes = vectors.to_bytes()
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ----- | -------------------------------------------------- | | ----------- | ----- | -------------------------------------------- |
| `**exclude` | - | Named attributes to prevent from being serialized. | | **RETURNS** | bytes | The serialized form of the `Vectors` object. |
| **RETURNS** | bytes | The serialized form of the `Vectors` object. |
## Vectors.from_bytes {#from_bytes tag="method"} ## Vectors.from_bytes {#from_bytes tag="method"}
@ -360,11 +358,10 @@ Load state from a binary string.
> new_vectors.from_bytes(vectors_bytes) > new_vectors.from_bytes(vectors_bytes)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | --------- | ---------------------------------------------- | | ----------- | --------- | ---------------------- |
| `data` | bytes | The data to load from. | | `data` | bytes | The data to load from. |
| `**exclude` | - | Named attributes to prevent from being loaded. | | **RETURNS** | `Vectors` | The `Vectors` object. |
| **RETURNS** | `Vectors` | The `Vectors` object. |
## Attributes {#attributes} ## Attributes {#attributes}

View File

@ -221,9 +221,10 @@ Save the current state to a directory.
> nlp.vocab.to_disk("/path/to/vocab") > nlp.vocab.to_disk("/path/to/vocab")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- | | --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | | `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
## Vocab.from_disk {#from_disk tag="method" new="2"} ## Vocab.from_disk {#from_disk tag="method" new="2"}
@ -239,6 +240,7 @@ Loads state from a directory. Modifies the object in place and returns it.
| Name | Type | Description | | Name | Type | Description |
| ----------- | ---------------- | -------------------------------------------------------------------------- | | ----------- | ---------------- | -------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. | | `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | `Vocab` | The modified `Vocab` object. | | **RETURNS** | `Vocab` | The modified `Vocab` object. |
## Vocab.to_bytes {#to_bytes tag="method"} ## Vocab.to_bytes {#to_bytes tag="method"}
@ -251,10 +253,10 @@ Serialize the current state to a binary string.
> vocab_bytes = nlp.vocab.to_bytes() > vocab_bytes = nlp.vocab.to_bytes()
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ----- | -------------------------------------------------- | | ----------- | ----- | ------------------------------------------------------------------------- |
| `**exclude` | - | Named attributes to prevent from being serialized. | | `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | bytes | The serialized form of the `Vocab` object. | | **RETURNS** | bytes | The serialized form of the `Vocab` object. |
## Vocab.from_bytes {#from_bytes tag="method"} ## Vocab.from_bytes {#from_bytes tag="method"}
@ -269,11 +271,11 @@ Load state from a binary string.
> vocab.from_bytes(vocab_bytes) > vocab.from_bytes(vocab_bytes)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------------ | ------- | ---------------------------------------------- | | ------------ | ------- | ------------------------------------------------------------------------- |
| `bytes_data` | bytes | The data to load from. | | `bytes_data` | bytes | The data to load from. |
| `**exclude` | - | Named attributes to prevent from being loaded. | | `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | `Vocab` | The `Vocab` object. | | **RETURNS** | `Vocab` | The `Vocab` object. |
## Attributes {#attributes} ## Attributes {#attributes}
@ -286,8 +288,28 @@ Load state from a binary string.
> assert type(PERSON) == int > assert type(PERSON) == int
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------------------------------------ | ------------- | --------------------------------------------- | | --------------------------------------------- | ------------- | ------------------------------------------------------------ |
| `strings` | `StringStore` | A table managing the string-to-int mapping. | | `strings` | `StringStore` | A table managing the string-to-int mapping. |
| `vectors` <Tag variant="new">2</Tag> | `Vectors` | A table associating word IDs to word vectors. | | `vectors` <Tag variant="new">2</Tag> | `Vectors` | A table associating word IDs to word vectors. |
| `vectors_length` | int | Number of dimensions for each word vector. | | `vectors_length` | int | Number of dimensions for each word vector. |
| `writing_system` <Tag variant="new">2.1</Tag> | dict | A dict with information about the language's writing system. |
## Serialization fields {#serialization-fields}
During serialization, spaCy will export several data fields used to restore
different aspects of the object. If needed, you can exclude them from
serialization by passing in the string names via the `exclude` argument.
> #### Example
>
> ```python
> data = vocab.to_bytes(exclude=["strings", "vectors"])
> vocab.from_disk("./vocab", exclude=["strings"])
> ```
| Name | Description |
| --------- | ----------------------------------------------------- |
| `strings` | The strings in the [`StringStore`](/api/stringstore). |
| `lexemes` | The lexeme data. |
| `vectors` | The word vectors, if available. |

View File

@ -39,9 +39,9 @@ together all components and creating the `Language` subclass for example,
| **Morph rules**<br />[`morph_rules.py`][morph_rules.py] | Exception rules for morphological analysis of irregular words like personal pronouns. | | **Morph rules**<br />[`morph_rules.py`][morph_rules.py] | Exception rules for morphological analysis of irregular words like personal pronouns. |
[stop_words.py]: [stop_words.py]:
https://github.com/explosion/spacy-dev-resources/tree/master/templates/new_language/stop_words.py https://github.com/explosion/spaCy/tree/master/spacy/lang/en/stop_words.py
[tokenizer_exceptions.py]: [tokenizer_exceptions.py]:
https://github.com/explosion/spacy-dev-resources/tree/master/templates/new_language/tokenizer_exceptions.py https://github.com/explosion/spaCy/tree/master/spacy/lang/de/tokenizer_exceptions.py
[norm_exceptions.py]: [norm_exceptions.py]:
https://github.com/explosion/spaCy/tree/master/spacy/lang/norm_exceptions.py https://github.com/explosion/spaCy/tree/master/spacy/lang/norm_exceptions.py
[punctuation.py]: [punctuation.py]:
@ -49,12 +49,12 @@ together all components and creating the `Language` subclass for example,
[char_classes.py]: [char_classes.py]:
https://github.com/explosion/spaCy/tree/master/spacy/lang/char_classes.py https://github.com/explosion/spaCy/tree/master/spacy/lang/char_classes.py
[lex_attrs.py]: [lex_attrs.py]:
https://github.com/explosion/spacy-dev-resources/tree/master/templates/new_language/lex_attrs.py https://github.com/explosion/spaCy/tree/master/spacy/lang/en/lex_attrs.py
[syntax_iterators.py]: [syntax_iterators.py]:
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/syntax_iterators.py https://github.com/explosion/spaCy/tree/master/spacy/lang/en/syntax_iterators.py
[lemmatizer.py]: [lemmatizer.py]:
https://github.com/explosion/spacy-dev-resources/tree/master/templates/new_language/lemmatizer.py https://github.com/explosion/spaCy/tree/master/spacy/lang/de/lemmatizer.py
[tag_map.py]: [tag_map.py]:
https://github.com/explosion/spacy-dev-resources/tree/master/templates/new_language/tag_map.py https://github.com/explosion/spaCy/tree/master/spacy/lang/en/tag_map.py
[morph_rules.py]: [morph_rules.py]:
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/morph_rules.py https://github.com/explosion/spaCy/tree/master/spacy/lang/en/morph_rules.py

View File

@ -33,9 +33,22 @@ list containing the component names:
import Accordion from 'components/accordion.js' import Accordion from 'components/accordion.js'
<Accordion title="Does the order of pipeline components matter?"> <Accordion title="Does the order of pipeline components matter?" id="pipeline-components-order">
No In spaCy v2.x, the statistical components like the tagger or parser are
independent and don't share any data between themselves. For example, the named
entity recognizer doesn't use any features set by the tagger and parser, and so
on. This means that you can swap them, or remove single components from the
pipeline without affecting the others.
However, custom components may depend on annotations set by other components.
For example, a custom lemmatizer may need the part-of-speech tags assigned, so
it'll only work if it's added after the tagger. The parser will respect
pre-defined sentence boundaries, so if a previous component in the pipeline sets
them, its dependency predictions may be different. Similarly, it matters if you
add the [`EntityRuler`](/api/entityruler) before or after the statistical entity
recognizer: if it's added before, the entity recognizer will take the existing
entities into account when making predictions.
</Accordion> </Accordion>

View File

@ -39,7 +39,7 @@ and morphological analysis.
</div> </div>
<Infobox title="Table of Contents"> <Infobox title="Table of Contents" id="toc">
- [Language data 101](#101) - [Language data 101](#101)
- [The Language subclass](#language-subclass) - [The Language subclass](#language-subclass)
@ -105,15 +105,15 @@ to know the language's character set. If the language you're adding uses
non-latin characters, you might need to define the required character classes in non-latin characters, you might need to define the required character classes in
the global the global
[`char_classes.py`](https://github.com/explosion/spaCy/tree/master/spacy/lang/char_classes.py). [`char_classes.py`](https://github.com/explosion/spaCy/tree/master/spacy/lang/char_classes.py).
For efficiency, spaCy uses hard-coded unicode ranges to define character classes, For efficiency, spaCy uses hard-coded unicode ranges to define character
the definitions of which can be found on [Wikipedia](https://en.wikipedia.org/wiki/Unicode_block). classes, the definitions of which can be found on
If the language requires very specific punctuation [Wikipedia](https://en.wikipedia.org/wiki/Unicode_block). If the language
rules, you should consider overwriting the default regular expressions with your requires very specific punctuation rules, you should consider overwriting the
own in the language's `Defaults`. default regular expressions with your own in the language's `Defaults`.
</Infobox> </Infobox>
### Creating a `Language` subclass {#language-subclass} ### Creating a language subclass {#language-subclass}
Language-specific code and resources should be organized into a sub-package of Language-specific code and resources should be organized into a sub-package of
spaCy, named according to the language's spaCy, named according to the language's
@ -121,9 +121,9 @@ spaCy, named according to the language's
code and resources specific to Spanish are placed into a directory code and resources specific to Spanish are placed into a directory
`spacy/lang/es`, which can be imported as `spacy.lang.es`. `spacy/lang/es`, which can be imported as `spacy.lang.es`.
To get started, you can use our To get started, you can check out the
[templates](https://github.com/explosion/spacy-dev-resources/templates/new_language) [existing languages](https://github.com/explosion/spacy/tree/master/spacy/lang).
for the most important files. Here's what the class template looks like: Here's what the class could look like:
```python ```python
### __init__.py (excerpt) ### __init__.py (excerpt)
@ -614,7 +614,7 @@ require models to be trained from labeled examples. The word vectors, word
probabilities and word clusters also require training, although these can be probabilities and word clusters also require training, although these can be
trained from unlabeled text, which tends to be much easier to collect. trained from unlabeled text, which tends to be much easier to collect.
### Creating a vocabulary file ### Creating a vocabulary file {#vocab-file}
spaCy expects that common words will be cached in a [`Vocab`](/api/vocab) spaCy expects that common words will be cached in a [`Vocab`](/api/vocab)
instance. The vocabulary caches lexical features. spaCy loads the vocabulary instance. The vocabulary caches lexical features. spaCy loads the vocabulary
@ -631,20 +631,20 @@ of using deep learning for NLP with limited labeled data. The vectors are also
useful by themselves they power the `.similarity` methods in spaCy. For best useful by themselves they power the `.similarity` methods in spaCy. For best
results, you should pre-process the text with spaCy before training the Word2vec results, you should pre-process the text with spaCy before training the Word2vec
model. This ensures your tokenization will match. You can use our model. This ensures your tokenization will match. You can use our
[word vectors training script](https://github.com/explosion/spacy-dev-resources/tree/master/training/word_vectors.py), [word vectors training script](https://github.com/explosion/spacy/tree/master/bin/train_word_vectors.py),
which pre-processes the text with your language-specific tokenizer and trains which pre-processes the text with your language-specific tokenizer and trains
the model using [Gensim](https://radimrehurek.com/gensim/). The `vectors.bin` the model using [Gensim](https://radimrehurek.com/gensim/). The `vectors.bin`
file should consist of one word and vector per line. file should consist of one word and vector per line.
```python ```python
https://github.com/explosion/spacy-dev-resources/tree/master/training/word_vectors.py https://github.com/explosion/spacy/tree/master/bin/train_word_vectors.py
``` ```
If you don't have a large sample of text available, you can also convert word If you don't have a large sample of text available, you can also convert word
vectors produced by a variety of other tools into spaCy's format. See the docs vectors produced by a variety of other tools into spaCy's format. See the docs
on [converting word vectors](/usage/vectors-similarity#converting) for details. on [converting word vectors](/usage/vectors-similarity#converting) for details.
### Creating or converting a training corpus ### Creating or converting a training corpus {#training-corpus}
The easiest way to train spaCy's tagger, parser, entity recognizer or text The easiest way to train spaCy's tagger, parser, entity recognizer or text
categorizer is to use the [`spacy train`](/api/cli#train) command-line utility. categorizer is to use the [`spacy train`](/api/cli#train) command-line utility.

View File

@ -29,7 +29,7 @@ Here's a quick comparison of the functionalities offered by spaCy,
| Entity linking | ❌ | ❌ | ❌ | | Entity linking | ❌ | ❌ | ❌ |
| Coreference resolution | ❌ | ❌ | ✅ | | Coreference resolution | ❌ | ❌ | ✅ |
### When should I use what? ### When should I use what? {#comparison-usage}
Natural Language Understanding is an active area of research and development, so Natural Language Understanding is an active area of research and development, so
there are many different tools or technologies catering to different use-cases. there are many different tools or technologies catering to different use-cases.

View File

@ -28,7 +28,7 @@ import QuickstartInstall from 'widgets/quickstart-install.js'
## Installation instructions {#installation} ## Installation instructions {#installation}
### pip ### pip {#pip}
Using pip, spaCy releases are available as source packages and binary wheels (as Using pip, spaCy releases are available as source packages and binary wheels (as
of v2.0.13). of v2.0.13).
@ -58,7 +58,7 @@ source .env/bin/activate
pip install spacy pip install spacy
``` ```
### conda ### conda {#conda}
Thanks to our great community, we've been able to re-add conda support. You can Thanks to our great community, we've been able to re-add conda support. You can
also install spaCy via `conda-forge`: also install spaCy via `conda-forge`:
@ -194,7 +194,7 @@ official distributions these are:
| Python 3.4 | Visual Studio 2010 | | Python 3.4 | Visual Studio 2010 |
| Python 3.5+ | Visual Studio 2015 | | Python 3.5+ | Visual Studio 2015 |
### Run tests ### Run tests {#run-tests}
spaCy comes with an spaCy comes with an
[extensive test suite](https://github.com/explosion/spaCy/tree/master/spacy/tests). [extensive test suite](https://github.com/explosion/spaCy/tree/master/spacy/tests).
@ -418,7 +418,7 @@ either of these, clone your repository again.
</Accordion> </Accordion>
## Changelog ## Changelog {#changelog}
import Changelog from 'widgets/changelog.js' import Changelog from 'widgets/changelog.js'

View File

@ -298,9 +298,9 @@ different languages, see the
The best way to understand spaCy's dependency parser is interactively. To make The best way to understand spaCy's dependency parser is interactively. To make
this easier, spaCy v2.0+ comes with a visualization module. You can pass a `Doc` this easier, spaCy v2.0+ comes with a visualization module. You can pass a `Doc`
or a list of `Doc` objects to displaCy and run or a list of `Doc` objects to displaCy and run
[`displacy.serve`](top-level#displacy.serve) to run the web server, or [`displacy.serve`](/api/top-level#displacy.serve) to run the web server, or
[`displacy.render`](top-level#displacy.render) to generate the raw markup. If [`displacy.render`](/api/top-level#displacy.render) to generate the raw markup.
you want to know how to write rules that hook into some type of syntactic If you want to know how to write rules that hook into some type of syntactic
construction, just plug the sentence into the visualizer and see how spaCy construction, just plug the sentence into the visualizer and see how spaCy
annotates it. annotates it.
@ -621,7 +621,7 @@ For more details on the language-specific data, see the usage guide on
</Infobox> </Infobox>
<Accordion title="Should I change the language data or add custom tokenizer rules?"> <Accordion title="Should I change the language data or add custom tokenizer rules?" id="lang-data-vs-tokenizer">
Tokenization rules that are specific to one language, but can be **generalized Tokenization rules that are specific to one language, but can be **generalized
across that language** should ideally live in the language data in across that language** should ideally live in the language data in

View File

@ -41,7 +41,7 @@ contribute to model development.
> If a model is available for a language, you can download it using the > If a model is available for a language, you can download it using the
> [`spacy download`](/api/cli#download) command. In order to use languages that > [`spacy download`](/api/cli#download) command. In order to use languages that
> don't yet come with a model, you have to import them directly, or use > don't yet come with a model, you have to import them directly, or use
> [`spacy.blank`](api/top-level#spacy.blank): > [`spacy.blank`](/api/top-level#spacy.blank):
> >
> ```python > ```python
> from spacy.lang.fi import Finnish > from spacy.lang.fi import Finnish

View File

@ -46,7 +46,8 @@ components. spaCy then does the following:
3. Add each pipeline component to the pipeline in order, using 3. Add each pipeline component to the pipeline in order, using
[`add_pipe`](/api/language#add_pipe). [`add_pipe`](/api/language#add_pipe).
4. Make the **model data** available to the `Language` class by calling 4. Make the **model data** available to the `Language` class by calling
[`from_disk`](language#from_disk) with the path to the model data directory. [`from_disk`](/api/language#from_disk) with the path to the model data
directory.
So when you call this... So when you call this...
@ -110,7 +111,7 @@ print(nlp.pipe_names)
# ['tagger', 'parser', 'ner'] # ['tagger', 'parser', 'ner']
``` ```
### Built-in pipeline components ### Built-in pipeline components {#built-in}
spaCy ships with several built-in pipeline components that are also available in spaCy ships with several built-in pipeline components that are also available in
the `Language.factories`. This means that you can initialize them by calling the `Language.factories`. This means that you can initialize them by calling
@ -426,7 +427,7 @@ spaCy, and implement your own models trained with other machine learning
libraries. It also lets you take advantage of spaCy's data structures and the libraries. It also lets you take advantage of spaCy's data structures and the
`Doc` object as the "single source of truth". `Doc` object as the "single source of truth".
<Accordion title="Why ._ and not just a top-level attribute?"> <Accordion title="Why ._ and not just a top-level attribute?" id="why-dot-underscore">
Writing to a `._` attribute instead of to the `Doc` directly keeps a clearer Writing to a `._` attribute instead of to the `Doc` directly keeps a clearer
separation and makes it easier to ensure backwards compatibility. For example, separation and makes it easier to ensure backwards compatibility. For example,
@ -437,7 +438,7 @@ immediately know what's built-in and what's custom for example,
</Accordion> </Accordion>
<Accordion title="How is the ._ implemented?"> <Accordion title="How is the ._ implemented?" id="dot-underscore-implementation">
Extension definitions the defaults, methods, getters and setters you pass in Extension definitions the defaults, methods, getters and setters you pass in
to `set_extension` are stored in class attributes on the `Underscore` class. to `set_extension` are stored in class attributes on the `Underscore` class.
@ -458,9 +459,7 @@ There are three main types of extensions, which can be defined using the
1. **Attribute extensions.** Set a default value for an attribute, which can be 1. **Attribute extensions.** Set a default value for an attribute, which can be
overwritten manually at any time. Attribute extensions work like "normal" overwritten manually at any time. Attribute extensions work like "normal"
variables and are the quickest way to store arbitrary information on a `Doc`, variables and are the quickest way to store arbitrary information on a `Doc`,
`Span` or `Token`. Attribute defaults behaves just like argument defaults `Span` or `Token`.
[in Python functions](http://docs.python-guide.org/en/latest/writing/gotchas/#mutable-default-arguments),
and should not be used for mutable values like dictionaries or lists.
```python ```python
Doc.set_extension("hello", default=True) Doc.set_extension("hello", default=True)
@ -527,25 +526,6 @@ Once you've registered your custom attribute, you can also use the built-in
especially useful it you want to pass in a string instead of calling especially useful it you want to pass in a string instead of calling
`doc._.my_attr`. `doc._.my_attr`.
<Infobox title="Using mutable default values" variant="danger">
When using **mutable values** like dictionaries or lists as the `default`
argument, keep in mind that they behave just like mutable default arguments
[in Python functions](http://docs.python-guide.org/en/latest/writing/gotchas/#mutable-default-arguments).
This can easily cause unintended results, like the same value being set on _all_
objects instead of only one particular instance. In most cases, it's better to
use **getters and setters**, and only set the `default` for boolean or string
values.
```diff
+ Doc.set_extension('fruits', getter=get_fruits, setter=set_fruits)
- Doc.set_extension('fruits', default={})
- doc._.fruits['apple'] = u'🍎' # all docs now have {'apple': u'🍎'}
```
</Infobox>
### Example: Pipeline component for GPE entities and country meta data via a REST API {#component-example3} ### Example: Pipeline component for GPE entities and country meta data via a REST API {#component-example3}
This example shows the implementation of a pipeline component that fetches This example shows the implementation of a pipeline component that fetches

View File

@ -15,7 +15,7 @@ their relationships. This means you can easily access and analyze the
surrounding tokens, merge spans into single tokens or add entries to the named surrounding tokens, merge spans into single tokens or add entries to the named
entities in `doc.ents`. entities in `doc.ents`.
<Accordion title="Should I use rules or train a model?"> <Accordion title="Should I use rules or train a model?" id="rules-vs-model">
For complex tasks, it's usually better to train a statistical entity recognition For complex tasks, it's usually better to train a statistical entity recognition
model. However, statistical models require training data, so for many model. However, statistical models require training data, so for many
@ -41,7 +41,7 @@ on [rule-based entity recognition](#entityruler).
</Accordion> </Accordion>
<Accordion title="When should I use the token matcher vs. the phrase matcher?"> <Accordion title="When should I use the token matcher vs. the phrase matcher?" id="matcher-vs-phrase-matcher">
The `PhraseMatcher` is useful if you already have a large terminology list or The `PhraseMatcher` is useful if you already have a large terminology list or
gazetteer consisting of single or multi-token phrases that you want to find gazetteer consisting of single or multi-token phrases that you want to find

View File

@ -22,7 +22,7 @@ the changes, see [this table](/usage/v2#incompat) and the notes on
</Infobox> </Infobox>
### Serializing the pipeline ### Serializing the pipeline {#pipeline}
When serializing the pipeline, keep in mind that this will only save out the When serializing the pipeline, keep in mind that this will only save out the
**binary data for the individual components** to allow spaCy to restore them **binary data for the individual components** to allow spaCy to restore them
@ -361,7 +361,7 @@ In theory, the entry point mechanism also lets you overwrite built-in factories
including the tokenizer. By default, spaCy will output a warning in these including the tokenizer. By default, spaCy will output a warning in these
cases, to prevent accidental overwrites and unintended results. cases, to prevent accidental overwrites and unintended results.
#### Advanced components with settings #### Advanced components with settings {#advanced-cfg}
The `**cfg` keyword arguments that the factory receives are passed down all the The `**cfg` keyword arguments that the factory receives are passed down all the
way from `spacy.load`. This means that the factory can respond to custom way from `spacy.load`. This means that the factory can respond to custom

View File

@ -50,7 +50,7 @@ systems, or to pre-process text for **deep learning**.
</div> </div>
<Infobox title="Table of contents"> <Infobox title="Table of contents" id="toc">
- [Features](#features) - [Features](#features)
- [Linguistic annotations](#annotations) - [Linguistic annotations](#annotations)

View File

@ -14,7 +14,7 @@ faster runtime, and many bug fixes, v2.1 also introduces experimental support
for some exciting new NLP innovations. For the full changelog, see the for some exciting new NLP innovations. For the full changelog, see the
[release notes on GitHub](https://github.com/explosion/spaCy/releases/tag/v2.1.0). [release notes on GitHub](https://github.com/explosion/spaCy/releases/tag/v2.1.0).
### BERT/ULMFit/Elmo-style pre-training {tag="experimental"} ### BERT/ULMFit/Elmo-style pre-training {#pretraining tag="experimental"}
> #### Example > #### Example
> >
@ -39,7 +39,7 @@ it.
</Infobox> </Infobox>
### Extended match pattern API ### Extended match pattern API {#matcher-api}
> #### Example > #### Example
> >
@ -67,7 +67,7 @@ values.
</Infobox> </Infobox>
### Easy rule-based entity recognition ### Easy rule-based entity recognition {#entity-ruler}
> #### Example > #### Example
> >
@ -91,7 +91,7 @@ flexibility.
</Infobox> </Infobox>
### Phrase matching with other attributes ### Phrase matching with other attributes {#phrasematcher}
> #### Example > #### Example
> >
@ -115,7 +115,7 @@ or `POS` for finding sequences of the same part-of-speech tags.
</Infobox> </Infobox>
### Retokenizer for merging and splitting ### Retokenizer for merging and splitting {#retokenizer}
> #### Example > #### Example
> >
@ -142,7 +142,7 @@ deprecated.
</Infobox> </Infobox>
### Components and languages via entry points ### Components and languages via entry points {#entry-points}
> #### Example > #### Example
> >
@ -169,7 +169,7 @@ is required.
</Infobox> </Infobox>
### Improved documentation ### Improved documentation {#docs}
Although it looks pretty much the same, we've rebuilt the entire documentation Although it looks pretty much the same, we've rebuilt the entire documentation
using [Gatsby](https://www.gatsbyjs.org/) and [MDX](https://mdxjs.com/). It's using [Gatsby](https://www.gatsbyjs.org/) and [MDX](https://mdxjs.com/). It's
@ -237,6 +237,19 @@ if all of your models are up to date, you can run the
+ retokenizer.merge(doc[6:8]) + retokenizer.merge(doc[6:8])
``` ```
- The serialization methods `to_disk`, `from_disk`, `to_bytes` and `from_bytes`
now support a single `exclude` argument to provide a list of string names to
exclude. The docs have been updated to list the available serialization fields
for each class. The `disable` argument on the [`Language`](/api/language)
serialization methods has been renamed to `exclude` for consistency.
```diff
- nlp.to_disk("/path", disable=["parser", "ner"])
+ nlp.to_disk("/path", exclude=["parser", "ner"])
- data = nlp.tokenizer.to_bytes(vocab=False)
+ data = nlp.tokenizer.to_bytes(exclude=["vocab"])
```
- For better compatibility with the Universal Dependencies data, the lemmatizer - For better compatibility with the Universal Dependencies data, the lemmatizer
now preserves capitalization, e.g. for proper nouns. See now preserves capitalization, e.g. for proper nouns. See
[this issue](https://github.com/explosion/spaCy/issues/3256) for details. [this issue](https://github.com/explosion/spaCy/issues/3256) for details.

View File

@ -39,7 +39,7 @@ also add your own custom attributes, properties and methods to the `Doc`,
</div> </div>
<Infobox title="Table of Contents"> <Infobox title="Table of Contents" id="toc">
- [Summary](#summary) - [Summary](#summary)
- [New features](#features) - [New features](#features)

View File

@ -75,7 +75,7 @@ arcs.
| `font` | unicode | Font name or font family for all text. | `"Arial"` | | `font` | unicode | Font name or font family for all text. | `"Arial"` |
For a list of all available options, see the For a list of all available options, see the
[`displacy` API documentation](top-level#displacy_options). [`displacy` API documentation](/api/top-level#displacy_options).
> #### Options example > #### Options example
> >
@ -283,7 +283,7 @@ from pathlib import Path
nlp = spacy.load("en_core_web_sm") nlp = spacy.load("en_core_web_sm")
sentences = [u"This is an example.", u"This is another one."] sentences = [u"This is an example.", u"This is another one."]
for sent in sentences: for sent in sentences:
doc = nlp(sentence) doc = nlp(sent)
svg = displacy.render(doc, style="dep") svg = displacy.render(doc, style="dep")
file_name = '-'.join([w.text for w in doc if not w.is_punct]) + ".svg" file_name = '-'.join([w.text for w in doc if not w.is_punct]) + ".svg"
output_path = Path("/images/" + file_name) output_path = Path("/images/" + file_name)

View File

@ -23,6 +23,11 @@
"list": "89ad33e698" "list": "89ad33e698"
}, },
"docSearch": { "docSearch": {
"apiKey": "f7dbcd148fae73db20b6ad33d03cc9e8",
"indexName": "dev_spacy_netlify",
"appId": "Y7BGGRAPHC"
},
"_docSearch": {
"apiKey": "371e26ed49d29a27bd36273dfdaf89af", "apiKey": "371e26ed49d29a27bd36273dfdaf89af",
"indexName": "spacy" "indexName": "spacy"
}, },

View File

@ -524,6 +524,22 @@
}, },
"category": ["standalone", "research"] "category": ["standalone", "research"]
}, },
{
"id": "scispacy",
"title": "scispaCy",
"slogan": "A full spaCy pipeline and models for scientific/biomedical documents",
"github": "allenai/scispacy",
"pip": "scispacy",
"thumb": "https://i.imgur.com/dJQSclW.png",
"url": "https://allenai.github.io/scispacy/",
"author": " Allen Institute for Artificial Intelligence",
"author_links": {
"github": "allenai",
"twitter": "allenai_org",
"website": "http://allenai.org"
},
"category": ["models", "research"]
},
{ {
"id": "textacy", "id": "textacy",
"slogan": "NLP, before and after spaCy", "slogan": "NLP, before and after spaCy",
@ -851,6 +867,22 @@
}, },
"category": ["courses"] "category": ["courses"]
}, },
{
"type": "education",
"id": "datacamp-advanced-nlp",
"title": "Advanced Natural Language Processing with spaCy",
"slogan": "Datacamp, 2019",
"description": "If you're working with a lot of text, you'll eventually want to know more about it. For example, what's it about? What do the words mean in context? Who is doing what to whom? What companies and products are mentioned? Which texts are similar to each other? In this course, you'll learn how to use spaCy, a fast-growing industry standard library for NLP in Python, to build advanced natural language understanding systems, using both rule-based and machine learning approaches.",
"url": "https://www.datacamp.com/courses/advanced-nlp-with-spacy",
"thumb": "https://i.imgur.com/0Zks7c0.jpg",
"author": "Ines Montani",
"author_links": {
"twitter": "_inesmontani",
"github": "ines",
"website": "https://ines.io"
},
"category": ["courses"]
},
{ {
"type": "education", "type": "education",
"id": "learning-path-spacy", "id": "learning-path-spacy",
@ -910,6 +942,7 @@
"description": "Most NLP projects rely crucially on the quality of annotations used for training and evaluating models. In this episode, Matt and Ines of Explosion AI tell us how Prodigy can improve data annotation and model development workflows. Prodigy is an annotation tool implemented as a python library, and it comes with a web application and a command line interface. A developer can define input data streams and design simple annotation interfaces. Prodigy can help break down complex annotation decisions into a series of binary decisions, and it provides easy integration with spaCy models. Developers can specify how models should be modified as new annotations come in in an active learning framework.", "description": "Most NLP projects rely crucially on the quality of annotations used for training and evaluating models. In this episode, Matt and Ines of Explosion AI tell us how Prodigy can improve data annotation and model development workflows. Prodigy is an annotation tool implemented as a python library, and it comes with a web application and a command line interface. A developer can define input data streams and design simple annotation interfaces. Prodigy can help break down complex annotation decisions into a series of binary decisions, and it provides easy integration with spaCy models. Developers can specify how models should be modified as new annotations come in in an active learning framework.",
"soundcloud": "559200912", "soundcloud": "559200912",
"thumb": "https://i.imgur.com/hOBQEzc.jpg", "thumb": "https://i.imgur.com/hOBQEzc.jpg",
"url": "https://soundcloud.com/nlp-highlights/78-where-do-corpora-come-from-with-matt-honnibal-and-ines-montani",
"author": "Matt Gardner, Waleed Ammar (Allen AI)", "author": "Matt Gardner, Waleed Ammar (Allen AI)",
"author_links": { "author_links": {
"website": "https://soundcloud.com/nlp-highlights" "website": "https://soundcloud.com/nlp-highlights"
@ -925,12 +958,28 @@
"iframe": "https://www.pythonpodcast.com/wp-content/plugins/podlove-podcasting-plugin-for-wordpress/lib/modules/podlove_web_player/player_v4/dist/share.html?episode=https://www.pythonpodcast.com/?podlove_player4=176", "iframe": "https://www.pythonpodcast.com/wp-content/plugins/podlove-podcasting-plugin-for-wordpress/lib/modules/podlove_web_player/player_v4/dist/share.html?episode=https://www.pythonpodcast.com/?podlove_player4=176",
"iframe_height": 200, "iframe_height": 200,
"thumb": "https://i.imgur.com/rpo6BuY.png", "thumb": "https://i.imgur.com/rpo6BuY.png",
"url": "https://www.podcastinit.com/episode-87-spacy-with-matthew-honnibal/",
"author": "Tobias Macey", "author": "Tobias Macey",
"author_links": { "author_links": {
"website": "https://www.podcastinit.com" "website": "https://www.podcastinit.com"
}, },
"category": ["podcasts"] "category": ["podcasts"]
}, },
{
"type": "education",
"id": "talk-python-podcast",
"title": "Talk Python 202: Building a software business",
"slogan": "March 2019",
"description": "One core question around open source is how do you fund it? Well, there is always that PayPal donate button. But that's been a tremendous failure for many projects. Often the go-to answer is consulting. But what if you don't want to trade time for money? You could take things up a notch and change the equation, exchanging value for money. That's what Ines Montani and her co-founder did when they started Explosion AI with spaCy as the foundation.",
"thumb": "https://i.imgur.com/q1twuK8.png",
"url": "https://talkpython.fm/episodes/show/202/building-a-software-business",
"soundcloud": "588364857",
"author": "Michael Kennedy",
"author_links": {
"website": "https://talkpython.fm/"
},
"category": ["podcasts"]
},
{ {
"id": "adam_qas", "id": "adam_qas",
"title": "ADAM: Question Answering System", "title": "ADAM: Question Answering System",

View File

@ -1833,9 +1833,9 @@
} }
}, },
"acorn": { "acorn": {
"version": "6.1.0", "version": "6.1.1",
"resolved": "https://registry.npmjs.org/acorn/-/acorn-6.1.0.tgz", "resolved": "https://registry.npmjs.org/acorn/-/acorn-6.1.1.tgz",
"integrity": "sha512-MW/FjM+IvU9CgBzjO3UIPCE2pyEwUsoFl+VGdczOPEdxfGFjuKny/gN54mOuX7Qxmb9Rg9MCn2oKiSUeW+pjrw==" "integrity": "sha512-jPTiwtOxaHNaAPg/dmrJ/beuzLRnXtB0kQPQ8JpotKJgTB6rX6c8mlf315941pyjBSaPg8NHXS9fhP4u17DpGA=="
}, },
"acorn-dynamic-import": { "acorn-dynamic-import": {
"version": "3.0.0", "version": "3.0.0",
@ -5958,9 +5958,9 @@
"integrity": "sha1-G2HAViGQqN/2rjuyzwIAyhMLhtQ=" "integrity": "sha1-G2HAViGQqN/2rjuyzwIAyhMLhtQ="
}, },
"eslint": { "eslint": {
"version": "5.14.1", "version": "5.15.1",
"resolved": "https://registry.npmjs.org/eslint/-/eslint-5.14.1.tgz", "resolved": "https://registry.npmjs.org/eslint/-/eslint-5.15.1.tgz",
"integrity": "sha512-CyUMbmsjxedx8B0mr79mNOqetvkbij/zrXnFeK2zc3pGRn3/tibjiNAv/3UxFEyfMDjh+ZqTrJrEGBFiGfD5Og==", "integrity": "sha512-NTcm6vQ+PTgN3UBsALw5BMhgO6i5EpIjQF/Xb5tIh3sk9QhrFafujUOczGz4J24JBlzWclSB9Vmx8d+9Z6bFCg==",
"requires": { "requires": {
"@babel/code-frame": "^7.0.0", "@babel/code-frame": "^7.0.0",
"ajv": "^6.9.1", "ajv": "^6.9.1",
@ -5968,7 +5968,7 @@
"cross-spawn": "^6.0.5", "cross-spawn": "^6.0.5",
"debug": "^4.0.1", "debug": "^4.0.1",
"doctrine": "^3.0.0", "doctrine": "^3.0.0",
"eslint-scope": "^4.0.0", "eslint-scope": "^4.0.2",
"eslint-utils": "^1.3.1", "eslint-utils": "^1.3.1",
"eslint-visitor-keys": "^1.0.0", "eslint-visitor-keys": "^1.0.0",
"espree": "^5.0.1", "espree": "^5.0.1",
@ -6001,9 +6001,9 @@
}, },
"dependencies": { "dependencies": {
"ajv": { "ajv": {
"version": "6.9.2", "version": "6.10.0",
"resolved": "https://registry.npmjs.org/ajv/-/ajv-6.9.2.tgz", "resolved": "https://registry.npmjs.org/ajv/-/ajv-6.10.0.tgz",
"integrity": "sha512-4UFy0/LgDo7Oa/+wOAlj44tp9K78u38E5/359eSrqEp1Z5PdVfimCcs7SluXMP755RUQu6d2b4AvF0R1C9RZjg==", "integrity": "sha512-nffhOpkymDECQyR0mnsUtoCE8RlX38G0rYP+wgLWFyZuUyuuojSSvi/+euOiQBIn63whYwYVIIH1TvE3tu4OEg==",
"requires": { "requires": {
"fast-deep-equal": "^2.0.1", "fast-deep-equal": "^2.0.1",
"fast-json-stable-stringify": "^2.0.0", "fast-json-stable-stringify": "^2.0.0",
@ -6037,9 +6037,9 @@
} }
}, },
"eslint-scope": { "eslint-scope": {
"version": "4.0.0", "version": "4.0.2",
"resolved": "https://registry.npmjs.org/eslint-scope/-/eslint-scope-4.0.0.tgz", "resolved": "https://registry.npmjs.org/eslint-scope/-/eslint-scope-4.0.2.tgz",
"integrity": "sha512-1G6UTDi7Jc1ELFwnR58HV4fK9OQK4S6N985f166xqXxpjU6plxFISJa2Ba9KCQuFa8RCnj/lSFJbHo7UFDBnUA==", "integrity": "sha512-5q1+B/ogmHl8+paxtOKx38Z8LtWkVGuNt3+GQNErqwLl6ViNp/gdJGMCjZNxZ8j/VYjDNZ2Fo+eQc1TAVPIzbg==",
"requires": { "requires": {
"esrecurse": "^4.1.0", "esrecurse": "^4.1.0",
"estraverse": "^4.1.1" "estraverse": "^4.1.1"
@ -6448,52 +6448,6 @@
} }
} }
}, },
"expand-range": {
"version": "1.8.2",
"resolved": "http://registry.npmjs.org/expand-range/-/expand-range-1.8.2.tgz",
"integrity": "sha1-opnv/TNf4nIeuujiV+x5ZE/IUzc=",
"requires": {
"fill-range": "^2.1.0"
},
"dependencies": {
"fill-range": {
"version": "2.2.4",
"resolved": "https://registry.npmjs.org/fill-range/-/fill-range-2.2.4.tgz",
"integrity": "sha512-cnrcCbj01+j2gTG921VZPnHbjmdAf8oQV/iGeV2kZxGSyfYjjTyY79ErsK1WJWMpw6DaApEX72binqJE+/d+5Q==",
"requires": {
"is-number": "^2.1.0",
"isobject": "^2.0.0",
"randomatic": "^3.0.0",
"repeat-element": "^1.1.2",
"repeat-string": "^1.5.2"
}
},
"is-number": {
"version": "2.1.0",
"resolved": "https://registry.npmjs.org/is-number/-/is-number-2.1.0.tgz",
"integrity": "sha1-Afy7s5NGOlSPL0ZszhbezknbkI8=",
"requires": {
"kind-of": "^3.0.2"
}
},
"isobject": {
"version": "2.1.0",
"resolved": "https://registry.npmjs.org/isobject/-/isobject-2.1.0.tgz",
"integrity": "sha1-8GVWEJaj8dou9GJy+BXIQNh+DIk=",
"requires": {
"isarray": "1.0.0"
}
},
"kind-of": {
"version": "3.2.2",
"resolved": "https://registry.npmjs.org/kind-of/-/kind-of-3.2.2.tgz",
"integrity": "sha1-MeohpzS6ubuw8yRm2JOupR5KPGQ=",
"requires": {
"is-buffer": "^1.1.5"
}
}
}
},
"expand-template": { "expand-template": {
"version": "2.0.3", "version": "2.0.3",
"resolved": "https://registry.npmjs.org/expand-template/-/expand-template-2.0.3.tgz", "resolved": "https://registry.npmjs.org/expand-template/-/expand-template-2.0.3.tgz",
@ -6818,11 +6772,6 @@
"resolved": "https://registry.npmjs.org/file-uri-to-path/-/file-uri-to-path-1.0.0.tgz", "resolved": "https://registry.npmjs.org/file-uri-to-path/-/file-uri-to-path-1.0.0.tgz",
"integrity": "sha512-0Zt+s3L7Vf1biwWZ29aARiVYLx7iMGnEUl9x33fbB/j3jR81u/O2LbqK+Bm1CDSNDKVtJ/YjwY7TUd5SkeLQLw==" "integrity": "sha512-0Zt+s3L7Vf1biwWZ29aARiVYLx7iMGnEUl9x33fbB/j3jR81u/O2LbqK+Bm1CDSNDKVtJ/YjwY7TUd5SkeLQLw=="
}, },
"filename-regex": {
"version": "2.0.1",
"resolved": "https://registry.npmjs.org/filename-regex/-/filename-regex-2.0.1.tgz",
"integrity": "sha1-wcS5vuPglyXdsQa3XB4wH+LxiyY="
},
"filename-reserved-regex": { "filename-reserved-regex": {
"version": "2.0.0", "version": "2.0.0",
"resolved": "https://registry.npmjs.org/filename-reserved-regex/-/filename-reserved-regex-2.0.0.tgz", "resolved": "https://registry.npmjs.org/filename-reserved-regex/-/filename-reserved-regex-2.0.0.tgz",
@ -7130,468 +7079,6 @@
"resolved": "https://registry.npmjs.org/fs.realpath/-/fs.realpath-1.0.0.tgz", "resolved": "https://registry.npmjs.org/fs.realpath/-/fs.realpath-1.0.0.tgz",
"integrity": "sha1-FQStJSMVjKpA20onh8sBQRmU6k8=" "integrity": "sha1-FQStJSMVjKpA20onh8sBQRmU6k8="
}, },
"fsevents": {
"version": "1.2.4",
"resolved": "https://registry.npmjs.org/fsevents/-/fsevents-1.2.4.tgz",
"integrity": "sha512-z8H8/diyk76B7q5wg+Ud0+CqzcAF3mBBI/bA5ne5zrRUUIvNkJY//D3BqyH571KuAC4Nr7Rw7CjWX4r0y9DvNg==",
"optional": true,
"requires": {
"nan": "^2.9.2",
"node-pre-gyp": "^0.10.0"
},
"dependencies": {
"abbrev": {
"version": "1.1.1",
"bundled": true,
"optional": true
},
"ansi-regex": {
"version": "2.1.1",
"bundled": true
},
"aproba": {
"version": "1.2.0",
"bundled": true,
"optional": true
},
"are-we-there-yet": {
"version": "1.1.4",
"bundled": true,
"optional": true,
"requires": {
"delegates": "^1.0.0",
"readable-stream": "^2.0.6"
}
},
"balanced-match": {
"version": "1.0.0",
"bundled": true
},
"brace-expansion": {
"version": "1.1.11",
"bundled": true,
"requires": {
"balanced-match": "^1.0.0",
"concat-map": "0.0.1"
}
},
"chownr": {
"version": "1.0.1",
"bundled": true,
"optional": true
},
"code-point-at": {
"version": "1.1.0",
"bundled": true
},
"concat-map": {
"version": "0.0.1",
"bundled": true
},
"console-control-strings": {
"version": "1.1.0",
"bundled": true
},
"core-util-is": {
"version": "1.0.2",
"bundled": true,
"optional": true
},
"debug": {
"version": "2.6.9",
"bundled": true,
"optional": true,
"requires": {
"ms": "2.0.0"
}
},
"deep-extend": {
"version": "0.5.1",
"bundled": true,
"optional": true
},
"delegates": {
"version": "1.0.0",
"bundled": true,
"optional": true
},
"detect-libc": {
"version": "1.0.3",
"bundled": true,
"optional": true
},
"fs-minipass": {
"version": "1.2.5",
"bundled": true,
"optional": true,
"requires": {
"minipass": "^2.2.1"
}
},
"fs.realpath": {
"version": "1.0.0",
"bundled": true,
"optional": true
},
"gauge": {
"version": "2.7.4",
"bundled": true,
"optional": true,
"requires": {
"aproba": "^1.0.3",
"console-control-strings": "^1.0.0",
"has-unicode": "^2.0.0",
"object-assign": "^4.1.0",
"signal-exit": "^3.0.0",
"string-width": "^1.0.1",
"strip-ansi": "^3.0.1",
"wide-align": "^1.1.0"
}
},
"glob": {
"version": "7.1.2",
"bundled": true,
"optional": true,
"requires": {
"fs.realpath": "^1.0.0",
"inflight": "^1.0.4",
"inherits": "2",
"minimatch": "^3.0.4",
"once": "^1.3.0",
"path-is-absolute": "^1.0.0"
}
},
"has-unicode": {
"version": "2.0.1",
"bundled": true,
"optional": true
},
"iconv-lite": {
"version": "0.4.21",
"bundled": true,
"optional": true,
"requires": {
"safer-buffer": "^2.1.0"
}
},
"ignore-walk": {
"version": "3.0.1",
"bundled": true,
"optional": true,
"requires": {
"minimatch": "^3.0.4"
}
},
"inflight": {
"version": "1.0.6",
"bundled": true,
"optional": true,
"requires": {
"once": "^1.3.0",
"wrappy": "1"
}
},
"inherits": {
"version": "2.0.3",
"bundled": true
},
"ini": {
"version": "1.3.5",
"bundled": true,
"optional": true
},
"is-fullwidth-code-point": {
"version": "1.0.0",
"bundled": true,
"requires": {
"number-is-nan": "^1.0.0"
}
},
"isarray": {
"version": "1.0.0",
"bundled": true,
"optional": true
},
"minimatch": {
"version": "3.0.4",
"bundled": true,
"requires": {
"brace-expansion": "^1.1.7"
}
},
"minimist": {
"version": "0.0.8",
"bundled": true
},
"minipass": {
"version": "2.2.4",
"bundled": true,
"requires": {
"safe-buffer": "^5.1.1",
"yallist": "^3.0.0"
}
},
"minizlib": {
"version": "1.1.0",
"bundled": true,
"optional": true,
"requires": {
"minipass": "^2.2.1"
}
},
"mkdirp": {
"version": "0.5.1",
"bundled": true,
"requires": {
"minimist": "0.0.8"
}
},
"ms": {
"version": "2.0.0",
"bundled": true,
"optional": true
},
"needle": {
"version": "2.2.0",
"bundled": true,
"optional": true,
"requires": {
"debug": "^2.1.2",
"iconv-lite": "^0.4.4",
"sax": "^1.2.4"
}
},
"node-pre-gyp": {
"version": "0.10.0",
"bundled": true,
"optional": true,
"requires": {
"detect-libc": "^1.0.2",
"mkdirp": "^0.5.1",
"needle": "^2.2.0",
"nopt": "^4.0.1",
"npm-packlist": "^1.1.6",
"npmlog": "^4.0.2",
"rc": "^1.1.7",
"rimraf": "^2.6.1",
"semver": "^5.3.0",
"tar": "^4"
}
},
"nopt": {
"version": "4.0.1",
"bundled": true,
"optional": true,
"requires": {
"abbrev": "1",
"osenv": "^0.1.4"
}
},
"npm-bundled": {
"version": "1.0.3",
"bundled": true,
"optional": true
},
"npm-packlist": {
"version": "1.1.10",
"bundled": true,
"optional": true,
"requires": {
"ignore-walk": "^3.0.1",
"npm-bundled": "^1.0.1"
}
},
"npmlog": {
"version": "4.1.2",
"bundled": true,
"optional": true,
"requires": {
"are-we-there-yet": "~1.1.2",
"console-control-strings": "~1.1.0",
"gauge": "~2.7.3",
"set-blocking": "~2.0.0"
}
},
"number-is-nan": {
"version": "1.0.1",
"bundled": true
},
"object-assign": {
"version": "4.1.1",
"bundled": true,
"optional": true
},
"once": {
"version": "1.4.0",
"bundled": true,
"requires": {
"wrappy": "1"
}
},
"os-homedir": {
"version": "1.0.2",
"bundled": true,
"optional": true
},
"os-tmpdir": {
"version": "1.0.2",
"bundled": true,
"optional": true
},
"osenv": {
"version": "0.1.5",
"bundled": true,
"optional": true,
"requires": {
"os-homedir": "^1.0.0",
"os-tmpdir": "^1.0.0"
}
},
"path-is-absolute": {
"version": "1.0.1",
"bundled": true,
"optional": true
},
"process-nextick-args": {
"version": "2.0.0",
"bundled": true,
"optional": true
},
"rc": {
"version": "1.2.7",
"bundled": true,
"optional": true,
"requires": {
"deep-extend": "^0.5.1",
"ini": "~1.3.0",
"minimist": "^1.2.0",
"strip-json-comments": "~2.0.1"
},
"dependencies": {
"minimist": {
"version": "1.2.0",
"bundled": true,
"optional": true
}
}
},
"readable-stream": {
"version": "2.3.6",
"bundled": true,
"optional": true,
"requires": {
"core-util-is": "~1.0.0",
"inherits": "~2.0.3",
"isarray": "~1.0.0",
"process-nextick-args": "~2.0.0",
"safe-buffer": "~5.1.1",
"string_decoder": "~1.1.1",
"util-deprecate": "~1.0.1"
}
},
"rimraf": {
"version": "2.6.2",
"bundled": true,
"optional": true,
"requires": {
"glob": "^7.0.5"
}
},
"safe-buffer": {
"version": "5.1.1",
"bundled": true
},
"safer-buffer": {
"version": "2.1.2",
"bundled": true,
"optional": true
},
"sax": {
"version": "1.2.4",
"bundled": true,
"optional": true
},
"semver": {
"version": "5.5.0",
"bundled": true,
"optional": true
},
"set-blocking": {
"version": "2.0.0",
"bundled": true,
"optional": true
},
"signal-exit": {
"version": "3.0.2",
"bundled": true,
"optional": true
},
"string-width": {
"version": "1.0.2",
"bundled": true,
"requires": {
"code-point-at": "^1.0.0",
"is-fullwidth-code-point": "^1.0.0",
"strip-ansi": "^3.0.0"
}
},
"string_decoder": {
"version": "1.1.1",
"bundled": true,
"optional": true,
"requires": {
"safe-buffer": "~5.1.0"
}
},
"strip-ansi": {
"version": "3.0.1",
"bundled": true,
"requires": {
"ansi-regex": "^2.0.0"
}
},
"strip-json-comments": {
"version": "2.0.1",
"bundled": true,
"optional": true
},
"tar": {
"version": "4.4.1",
"bundled": true,
"optional": true,
"requires": {
"chownr": "^1.0.1",
"fs-minipass": "^1.2.5",
"minipass": "^2.2.4",
"minizlib": "^1.1.0",
"mkdirp": "^0.5.0",
"safe-buffer": "^5.1.1",
"yallist": "^3.0.2"
}
},
"util-deprecate": {
"version": "1.0.2",
"bundled": true,
"optional": true
},
"wide-align": {
"version": "1.1.2",
"bundled": true,
"optional": true,
"requires": {
"string-width": "^1.0.2"
}
},
"wrappy": {
"version": "1.0.2",
"bundled": true
},
"yallist": {
"version": "3.0.2",
"bundled": true
}
}
},
"fstream": { "fstream": {
"version": "1.0.11", "version": "1.0.11",
"resolved": "https://registry.npmjs.org/fstream/-/fstream-1.0.11.tgz", "resolved": "https://registry.npmjs.org/fstream/-/fstream-1.0.11.tgz",
@ -8322,14 +7809,14 @@
} }
}, },
"gatsby-source-filesystem": { "gatsby-source-filesystem": {
"version": "2.0.20", "version": "2.0.24",
"resolved": "https://registry.npmjs.org/gatsby-source-filesystem/-/gatsby-source-filesystem-2.0.20.tgz", "resolved": "https://registry.npmjs.org/gatsby-source-filesystem/-/gatsby-source-filesystem-2.0.24.tgz",
"integrity": "sha512-nS2hBsqKEQIJ5Yd+g9p++FcsfmvbQmZlBUzx04VPBYZBu2LuLA/ZxQkmdiTNnbDQ18KJw0Zu2PnmUerPnEMqyg==", "integrity": "sha512-KzyHzuXni9hOiZFDgeoH5ABJZqb59fSJNGr2C4U6B1AlGXFMucFK45Fh3V8axtpi833bIbCb9rGmK+tvL4Qb1w==",
"requires": { "requires": {
"@babel/runtime": "^7.0.0", "@babel/runtime": "^7.0.0",
"better-queue": "^3.8.7", "better-queue": "^3.8.7",
"bluebird": "^3.5.0", "bluebird": "^3.5.0",
"chokidar": "^1.7.0", "chokidar": "^2.1.2",
"file-type": "^10.2.0", "file-type": "^10.2.0",
"fs-extra": "^5.0.0", "fs-extra": "^5.0.0",
"got": "^7.1.0", "got": "^7.1.0",
@ -8343,83 +7830,6 @@
"xstate": "^3.1.0" "xstate": "^3.1.0"
}, },
"dependencies": { "dependencies": {
"anymatch": {
"version": "1.3.2",
"resolved": "https://registry.npmjs.org/anymatch/-/anymatch-1.3.2.tgz",
"integrity": "sha512-0XNayC8lTHQ2OI8aljNCN3sSx6hsr/1+rlcDAotXJR7C1oZZHCNsfpbKwMjRA3Uqb5tF1Rae2oloTr4xpq+WjA==",
"requires": {
"micromatch": "^2.1.5",
"normalize-path": "^2.0.0"
}
},
"arr-diff": {
"version": "2.0.0",
"resolved": "https://registry.npmjs.org/arr-diff/-/arr-diff-2.0.0.tgz",
"integrity": "sha1-jzuCf5Vai9ZpaX5KQlasPOrjVs8=",
"requires": {
"arr-flatten": "^1.0.1"
}
},
"array-unique": {
"version": "0.2.1",
"resolved": "https://registry.npmjs.org/array-unique/-/array-unique-0.2.1.tgz",
"integrity": "sha1-odl8yvy8JiXMcPrc6zalDFiwGlM="
},
"braces": {
"version": "1.8.5",
"resolved": "https://registry.npmjs.org/braces/-/braces-1.8.5.tgz",
"integrity": "sha1-uneWLhLf+WnWt2cR6RS3N4V79qc=",
"requires": {
"expand-range": "^1.8.1",
"preserve": "^0.2.0",
"repeat-element": "^1.1.2"
}
},
"chokidar": {
"version": "1.7.0",
"resolved": "https://registry.npmjs.org/chokidar/-/chokidar-1.7.0.tgz",
"integrity": "sha1-eY5ol3gVHIB2tLNg5e3SjNortGg=",
"requires": {
"anymatch": "^1.3.0",
"async-each": "^1.0.0",
"fsevents": "^1.0.0",
"glob-parent": "^2.0.0",
"inherits": "^2.0.1",
"is-binary-path": "^1.0.0",
"is-glob": "^2.0.0",
"path-is-absolute": "^1.0.0",
"readdirp": "^2.0.0"
}
},
"expand-brackets": {
"version": "0.1.5",
"resolved": "https://registry.npmjs.org/expand-brackets/-/expand-brackets-0.1.5.tgz",
"integrity": "sha1-3wcoTjQqgHzXM6xa9yQR5YHRF3s=",
"requires": {
"is-posix-bracket": "^0.1.0"
}
},
"extglob": {
"version": "0.3.2",
"resolved": "https://registry.npmjs.org/extglob/-/extglob-0.3.2.tgz",
"integrity": "sha1-Lhj/PS9JqydlzskCPwEdqo2DSaE=",
"requires": {
"is-extglob": "^1.0.0"
}
},
"file-type": {
"version": "10.7.1",
"resolved": "https://registry.npmjs.org/file-type/-/file-type-10.7.1.tgz",
"integrity": "sha512-kUc4EE9q3MH6kx70KumPOvXLZLEJZzY9phEVg/bKWyGZ+OA9KoKZzFR4HS0yDmNv31sJkdf4hbTERIfplF9OxQ=="
},
"glob-parent": {
"version": "2.0.0",
"resolved": "https://registry.npmjs.org/glob-parent/-/glob-parent-2.0.0.tgz",
"integrity": "sha1-gTg9ctsFT8zPUzbaqQLxgvbtuyg=",
"requires": {
"is-glob": "^2.0.0"
}
},
"got": { "got": {
"version": "7.1.0", "version": "7.1.0",
"resolved": "https://registry.npmjs.org/got/-/got-7.1.0.tgz", "resolved": "https://registry.npmjs.org/got/-/got-7.1.0.tgz",
@ -8441,47 +7851,6 @@
"url-to-options": "^1.0.1" "url-to-options": "^1.0.1"
} }
}, },
"is-extglob": {
"version": "1.0.0",
"resolved": "https://registry.npmjs.org/is-extglob/-/is-extglob-1.0.0.tgz",
"integrity": "sha1-rEaBd8SUNAWgkvyPKXYMb/xiBsA="
},
"is-glob": {
"version": "2.0.1",
"resolved": "https://registry.npmjs.org/is-glob/-/is-glob-2.0.1.tgz",
"integrity": "sha1-0Jb5JqPe1WAPP9/ZEZjLCIjC2GM=",
"requires": {
"is-extglob": "^1.0.0"
}
},
"kind-of": {
"version": "3.2.2",
"resolved": "https://registry.npmjs.org/kind-of/-/kind-of-3.2.2.tgz",
"integrity": "sha1-MeohpzS6ubuw8yRm2JOupR5KPGQ=",
"requires": {
"is-buffer": "^1.1.5"
}
},
"micromatch": {
"version": "2.3.11",
"resolved": "https://registry.npmjs.org/micromatch/-/micromatch-2.3.11.tgz",
"integrity": "sha1-hmd8l9FyCzY0MdBNDRUpO9OMFWU=",
"requires": {
"arr-diff": "^2.0.0",
"array-unique": "^0.2.1",
"braces": "^1.8.2",
"expand-brackets": "^0.1.4",
"extglob": "^0.3.1",
"filename-regex": "^2.0.0",
"is-extglob": "^1.0.0",
"is-glob": "^2.0.1",
"kind-of": "^3.0.2",
"normalize-path": "^2.0.1",
"object.omit": "^2.0.0",
"parse-glob": "^3.0.4",
"regex-cache": "^0.4.2"
}
},
"pify": { "pify": {
"version": "4.0.1", "version": "4.0.1",
"resolved": "https://registry.npmjs.org/pify/-/pify-4.0.1.tgz", "resolved": "https://registry.npmjs.org/pify/-/pify-4.0.1.tgz",
@ -8493,12 +7862,12 @@
"integrity": "sha1-4mDHj2Fhzdmw5WzD4Khd4Xx6V74=" "integrity": "sha1-4mDHj2Fhzdmw5WzD4Khd4Xx6V74="
}, },
"read-chunk": { "read-chunk": {
"version": "3.0.0", "version": "3.1.0",
"resolved": "https://registry.npmjs.org/read-chunk/-/read-chunk-3.0.0.tgz", "resolved": "https://registry.npmjs.org/read-chunk/-/read-chunk-3.1.0.tgz",
"integrity": "sha512-8lBUVPjj9TC5bKLBacB+rpexM03+LWiYbv6ma3BeWmUYXGxqA1WNNgIZHq/iIsCrbFMzPhFbkOqdsyOFRnuoXg==", "integrity": "sha512-ZdiZJXXoZYE08SzZvTipHhI+ZW0FpzxmFtLI3vIeMuRN9ySbIZ+SZawKogqJ7dxW9fJ/W73BNtxu4Zu/bZp+Ng==",
"requires": { "requires": {
"pify": "^4.0.0", "pify": "^4.0.1",
"with-open-file": "^0.1.3" "with-open-file": "^0.1.5"
} }
} }
} }
@ -8742,38 +8111,6 @@
"path-is-absolute": "^1.0.0" "path-is-absolute": "^1.0.0"
} }
}, },
"glob-base": {
"version": "0.3.0",
"resolved": "https://registry.npmjs.org/glob-base/-/glob-base-0.3.0.tgz",
"integrity": "sha1-27Fk9iIbHAscz4Kuoyi0l98Oo8Q=",
"requires": {
"glob-parent": "^2.0.0",
"is-glob": "^2.0.0"
},
"dependencies": {
"glob-parent": {
"version": "2.0.0",
"resolved": "https://registry.npmjs.org/glob-parent/-/glob-parent-2.0.0.tgz",
"integrity": "sha1-gTg9ctsFT8zPUzbaqQLxgvbtuyg=",
"requires": {
"is-glob": "^2.0.0"
}
},
"is-extglob": {
"version": "1.0.0",
"resolved": "https://registry.npmjs.org/is-extglob/-/is-extglob-1.0.0.tgz",
"integrity": "sha1-rEaBd8SUNAWgkvyPKXYMb/xiBsA="
},
"is-glob": {
"version": "2.0.1",
"resolved": "https://registry.npmjs.org/is-glob/-/is-glob-2.0.1.tgz",
"integrity": "sha1-0Jb5JqPe1WAPP9/ZEZjLCIjC2GM=",
"requires": {
"is-extglob": "^1.0.0"
}
}
}
},
"glob-parent": { "glob-parent": {
"version": "3.1.0", "version": "3.1.0",
"resolved": "https://registry.npmjs.org/glob-parent/-/glob-parent-3.1.0.tgz", "resolved": "https://registry.npmjs.org/glob-parent/-/glob-parent-3.1.0.tgz",
@ -10110,19 +9447,6 @@
"resolved": "https://registry.npmjs.org/is-directory/-/is-directory-0.3.1.tgz", "resolved": "https://registry.npmjs.org/is-directory/-/is-directory-0.3.1.tgz",
"integrity": "sha1-YTObbyR1/Hcv2cnYP1yFddwVSuE=" "integrity": "sha1-YTObbyR1/Hcv2cnYP1yFddwVSuE="
}, },
"is-dotfile": {
"version": "1.0.3",
"resolved": "https://registry.npmjs.org/is-dotfile/-/is-dotfile-1.0.3.tgz",
"integrity": "sha1-pqLzL/0t+wT1yiXs0Pa4PPeYoeE="
},
"is-equal-shallow": {
"version": "0.1.3",
"resolved": "https://registry.npmjs.org/is-equal-shallow/-/is-equal-shallow-0.1.3.tgz",
"integrity": "sha1-IjgJj8Ih3gvPpdnqxMRdY4qhxTQ=",
"requires": {
"is-primitive": "^2.0.0"
}
},
"is-extendable": { "is-extendable": {
"version": "0.1.1", "version": "0.1.1",
"resolved": "https://registry.npmjs.org/is-extendable/-/is-extendable-0.1.1.tgz", "resolved": "https://registry.npmjs.org/is-extendable/-/is-extendable-0.1.1.tgz",
@ -10263,16 +9587,6 @@
"resolved": "https://registry.npmjs.org/is-png/-/is-png-1.1.0.tgz", "resolved": "https://registry.npmjs.org/is-png/-/is-png-1.1.0.tgz",
"integrity": "sha1-1XSxK/J1wDUEVVcLDltXqwYgd84=" "integrity": "sha1-1XSxK/J1wDUEVVcLDltXqwYgd84="
}, },
"is-posix-bracket": {
"version": "0.1.1",
"resolved": "https://registry.npmjs.org/is-posix-bracket/-/is-posix-bracket-0.1.1.tgz",
"integrity": "sha1-MzTceXdDaOkvAW5vvAqI9c1ua8Q="
},
"is-primitive": {
"version": "2.0.0",
"resolved": "https://registry.npmjs.org/is-primitive/-/is-primitive-2.0.0.tgz",
"integrity": "sha1-IHurkWOEmcB7Kt8kCkGochADRXU="
},
"is-promise": { "is-promise": {
"version": "2.1.0", "version": "2.1.0",
"resolved": "https://registry.npmjs.org/is-promise/-/is-promise-2.1.0.tgz", "resolved": "https://registry.npmjs.org/is-promise/-/is-promise-2.1.0.tgz",
@ -11162,11 +10476,6 @@
"resolved": "https://registry.npmjs.org/marked/-/marked-0.4.0.tgz", "resolved": "https://registry.npmjs.org/marked/-/marked-0.4.0.tgz",
"integrity": "sha512-tMsdNBgOsrUophCAFQl0XPe6Zqk/uy9gnue+jIIKhykO51hxyu6uNx7zBPy0+y/WKYVZZMspV9YeXLNdKk+iYw==" "integrity": "sha512-tMsdNBgOsrUophCAFQl0XPe6Zqk/uy9gnue+jIIKhykO51hxyu6uNx7zBPy0+y/WKYVZZMspV9YeXLNdKk+iYw=="
}, },
"math-random": {
"version": "1.0.1",
"resolved": "https://registry.npmjs.org/math-random/-/math-random-1.0.1.tgz",
"integrity": "sha1-izqsWIuKZuSXXjzepn97sylgH6w="
},
"md-attr-parser": { "md-attr-parser": {
"version": "1.2.1", "version": "1.2.1",
"resolved": "https://registry.npmjs.org/md-attr-parser/-/md-attr-parser-1.2.1.tgz", "resolved": "https://registry.npmjs.org/md-attr-parser/-/md-attr-parser-1.2.1.tgz",
@ -12230,15 +11539,6 @@
"es-abstract": "^1.5.1" "es-abstract": "^1.5.1"
} }
}, },
"object.omit": {
"version": "2.0.1",
"resolved": "https://registry.npmjs.org/object.omit/-/object.omit-2.0.1.tgz",
"integrity": "sha1-Gpx0SCnznbuFjHbKNXmuKlTr0fo=",
"requires": {
"for-own": "^0.1.4",
"is-extendable": "^0.1.1"
}
},
"object.pick": { "object.pick": {
"version": "1.3.0", "version": "1.3.0",
"resolved": "https://registry.npmjs.org/object.pick/-/object.pick-1.3.0.tgz", "resolved": "https://registry.npmjs.org/object.pick/-/object.pick-1.3.0.tgz",
@ -12579,32 +11879,6 @@
"path-root": "^0.1.1" "path-root": "^0.1.1"
} }
}, },
"parse-glob": {
"version": "3.0.4",
"resolved": "https://registry.npmjs.org/parse-glob/-/parse-glob-3.0.4.tgz",
"integrity": "sha1-ssN2z7EfNVE7rdFz7wu246OIORw=",
"requires": {
"glob-base": "^0.3.0",
"is-dotfile": "^1.0.0",
"is-extglob": "^1.0.0",
"is-glob": "^2.0.0"
},
"dependencies": {
"is-extglob": {
"version": "1.0.0",
"resolved": "https://registry.npmjs.org/is-extglob/-/is-extglob-1.0.0.tgz",
"integrity": "sha1-rEaBd8SUNAWgkvyPKXYMb/xiBsA="
},
"is-glob": {
"version": "2.0.1",
"resolved": "https://registry.npmjs.org/is-glob/-/is-glob-2.0.1.tgz",
"integrity": "sha1-0Jb5JqPe1WAPP9/ZEZjLCIjC2GM=",
"requires": {
"is-extglob": "^1.0.0"
}
}
}
},
"parse-headers": { "parse-headers": {
"version": "2.0.1", "version": "2.0.1",
"resolved": "https://registry.npmjs.org/parse-headers/-/parse-headers-2.0.1.tgz", "resolved": "https://registry.npmjs.org/parse-headers/-/parse-headers-2.0.1.tgz",
@ -14769,11 +14043,6 @@
"resolved": "https://registry.npmjs.org/prepend-http/-/prepend-http-1.0.4.tgz", "resolved": "https://registry.npmjs.org/prepend-http/-/prepend-http-1.0.4.tgz",
"integrity": "sha1-1PRWKwzjaW5BrFLQ4ALlemNdxtw=" "integrity": "sha1-1PRWKwzjaW5BrFLQ4ALlemNdxtw="
}, },
"preserve": {
"version": "0.2.0",
"resolved": "https://registry.npmjs.org/preserve/-/preserve-0.2.0.tgz",
"integrity": "sha1-gV7R9uvGWSb4ZbMQwHE7yzMVzks="
},
"prettier": { "prettier": {
"version": "1.16.4", "version": "1.16.4",
"resolved": "https://registry.npmjs.org/prettier/-/prettier-1.16.4.tgz", "resolved": "https://registry.npmjs.org/prettier/-/prettier-1.16.4.tgz",
@ -14982,23 +14251,6 @@
"resolved": "http://registry.npmjs.org/ramda/-/ramda-0.21.0.tgz", "resolved": "http://registry.npmjs.org/ramda/-/ramda-0.21.0.tgz",
"integrity": "sha1-oAGr7bP/YQd9T/HVd9RN536NCjU=" "integrity": "sha1-oAGr7bP/YQd9T/HVd9RN536NCjU="
}, },
"randomatic": {
"version": "3.1.1",
"resolved": "https://registry.npmjs.org/randomatic/-/randomatic-3.1.1.tgz",
"integrity": "sha512-TuDE5KxZ0J461RVjrJZCJc+J+zCkTb1MbH9AQUq68sMhOMcy9jLcb3BrZKgp9q9Ncltdg4QVqWrH02W2EFFVYw==",
"requires": {
"is-number": "^4.0.0",
"kind-of": "^6.0.0",
"math-random": "^1.0.1"
},
"dependencies": {
"is-number": {
"version": "4.0.0",
"resolved": "https://registry.npmjs.org/is-number/-/is-number-4.0.0.tgz",
"integrity": "sha512-rSklcAIlf1OmFdyAqbnWTLVelsQ58uvZ66S/ZyawjWqIviTWCjg2PzVGw8WUA+nNuPTqb4wgA+NszrJ+08LlgQ=="
}
}
},
"randombytes": { "randombytes": {
"version": "2.1.0", "version": "2.1.0",
"resolved": "https://registry.npmjs.org/randombytes/-/randombytes-2.1.0.tgz", "resolved": "https://registry.npmjs.org/randombytes/-/randombytes-2.1.0.tgz",
@ -15458,14 +14710,6 @@
"private": "^0.1.6" "private": "^0.1.6"
} }
}, },
"regex-cache": {
"version": "0.4.4",
"resolved": "https://registry.npmjs.org/regex-cache/-/regex-cache-0.4.4.tgz",
"integrity": "sha512-nVIZwtCjkC9YgvWkpM55B5rBhBYRZhAaJbgcFYXXsHnbZ9UZI9nnVWYZpBlCqv9ho2eZryPnWrZGsOdPwVWXWQ==",
"requires": {
"is-equal-shallow": "^0.1.3"
}
},
"regex-not": { "regex-not": {
"version": "1.0.2", "version": "1.0.2",
"resolved": "https://registry.npmjs.org/regex-not/-/regex-not-1.0.2.tgz", "resolved": "https://registry.npmjs.org/regex-not/-/regex-not-1.0.2.tgz",
@ -17710,9 +16954,9 @@
}, },
"dependencies": { "dependencies": {
"ajv": { "ajv": {
"version": "6.9.2", "version": "6.10.0",
"resolved": "https://registry.npmjs.org/ajv/-/ajv-6.9.2.tgz", "resolved": "https://registry.npmjs.org/ajv/-/ajv-6.10.0.tgz",
"integrity": "sha512-4UFy0/LgDo7Oa/+wOAlj44tp9K78u38E5/359eSrqEp1Z5PdVfimCcs7SluXMP755RUQu6d2b4AvF0R1C9RZjg==", "integrity": "sha512-nffhOpkymDECQyR0mnsUtoCE8RlX38G0rYP+wgLWFyZuUyuuojSSvi/+euOiQBIn63whYwYVIIH1TvE3tu4OEg==",
"requires": { "requires": {
"fast-deep-equal": "^2.0.1", "fast-deep-equal": "^2.0.1",
"fast-json-stable-stringify": "^2.0.0", "fast-json-stable-stringify": "^2.0.0",
@ -17721,26 +16965,26 @@
} }
}, },
"ansi-regex": { "ansi-regex": {
"version": "4.0.0", "version": "4.1.0",
"resolved": "https://registry.npmjs.org/ansi-regex/-/ansi-regex-4.0.0.tgz", "resolved": "https://registry.npmjs.org/ansi-regex/-/ansi-regex-4.1.0.tgz",
"integrity": "sha512-iB5Dda8t/UqpPI/IjsejXu5jOGDrzn41wJyljwPH65VCIbk6+1BzFIMJGFwTNrYXT1CrD+B4l19U7awiQ8rk7w==" "integrity": "sha512-1apePfXM1UOSqw0o9IiFAovVz9M5S1Dg+4TrDwfMewQ6p/rmMueb7tWZjQ1rx4Loy1ArBggoqGpfqqdI4rondg=="
}, },
"string-width": { "string-width": {
"version": "3.0.0", "version": "3.1.0",
"resolved": "https://registry.npmjs.org/string-width/-/string-width-3.0.0.tgz", "resolved": "https://registry.npmjs.org/string-width/-/string-width-3.1.0.tgz",
"integrity": "sha512-rr8CUxBbvOZDUvc5lNIJ+OC1nPVpz+Siw9VBtUjB9b6jZehZLFt0JMCZzShFHIsI8cbhm0EsNIfWJMFV3cu3Ew==", "integrity": "sha512-vafcv6KjVZKSgz06oM/H6GDBrAtz8vdhQakGjFIvNrHA6y3HCF1CInLy+QLq8dTJPQ1b+KDUqDFctkdRW44e1w==",
"requires": { "requires": {
"emoji-regex": "^7.0.1", "emoji-regex": "^7.0.1",
"is-fullwidth-code-point": "^2.0.0", "is-fullwidth-code-point": "^2.0.0",
"strip-ansi": "^5.0.0" "strip-ansi": "^5.1.0"
} }
}, },
"strip-ansi": { "strip-ansi": {
"version": "5.0.0", "version": "5.1.0",
"resolved": "https://registry.npmjs.org/strip-ansi/-/strip-ansi-5.0.0.tgz", "resolved": "https://registry.npmjs.org/strip-ansi/-/strip-ansi-5.1.0.tgz",
"integrity": "sha512-Uu7gQyZI7J7gn5qLn1Np3G9vcYGTVqB+lFTytnDJv83dd8T22aGH451P3jueT2/QemInJDfxHB5Tde5OzgG1Ow==", "integrity": "sha512-TjxrkPONqO2Z8QDCpeE2j6n0M6EwxzyDgzEeGp+FbdvaJAt//ClYi6W5my+3ROlC/hZX2KACUwDfK49Ka5eDvg==",
"requires": { "requires": {
"ansi-regex": "^4.0.0" "ansi-regex": "^4.1.0"
} }
} }
} }

View File

@ -12,7 +12,6 @@
"@mdx-js/tag": "^0.17.5", "@mdx-js/tag": "^0.17.5",
"@phosphor/widgets": "^1.6.0", "@phosphor/widgets": "^1.6.0",
"@rehooks/online-status": "^1.0.0", "@rehooks/online-status": "^1.0.0",
"@sindresorhus/slugify": "^0.8.0",
"@svgr/webpack": "^4.1.0", "@svgr/webpack": "^4.1.0",
"autoprefixer": "^9.4.7", "autoprefixer": "^9.4.7",
"classnames": "^2.2.6", "classnames": "^2.2.6",
@ -35,7 +34,7 @@
"gatsby-remark-prismjs": "^3.2.4", "gatsby-remark-prismjs": "^3.2.4",
"gatsby-remark-smartypants": "^2.0.8", "gatsby-remark-smartypants": "^2.0.8",
"gatsby-remark-unwrap-images": "^1.0.1", "gatsby-remark-unwrap-images": "^1.0.1",
"gatsby-source-filesystem": "^2.0.20", "gatsby-source-filesystem": "^2.0.24",
"gatsby-transformer-remark": "^2.2.5", "gatsby-transformer-remark": "^2.2.5",
"gatsby-transformer-sharp": "^2.1.13", "gatsby-transformer-sharp": "^2.1.13",
"html-to-react": "^1.3.4", "html-to-react": "^1.3.4",
@ -62,7 +61,8 @@
"md-attr-parser": "^1.2.1", "md-attr-parser": "^1.2.1",
"prettier": "^1.16.4", "prettier": "^1.16.4",
"raw-loader": "^1.0.0", "raw-loader": "^1.0.0",
"unist-util-visit": "^1.4.0" "unist-util-visit": "^1.4.0",
"@sindresorhus/slugify": "^0.8.0"
}, },
"repository": { "repository": {
"type": "git", "type": "git",

View File

@ -1,33 +1,38 @@
import React, { useState } from 'react' import React, { useState, useEffect } from 'react'
import PropTypes from 'prop-types' import PropTypes from 'prop-types'
import classNames from 'classnames' import classNames from 'classnames'
import slugify from '@sindresorhus/slugify'
import Link from './link' import Link from './link'
import classes from '../styles/accordion.module.sass' import classes from '../styles/accordion.module.sass'
const Accordion = ({ title, id, expanded, children }) => { const Accordion = ({ title, id, expanded, children }) => {
const anchorId = id ? id : slugify(title) const [isExpanded, setIsExpanded] = useState(true)
const [isExpanded, setIsExpanded] = useState(expanded)
const contentClassNames = classNames(classes.content, { const contentClassNames = classNames(classes.content, {
[classes.hidden]: !isExpanded, [classes.hidden]: !isExpanded,
}) })
const iconClassNames = classNames({ const iconClassNames = classNames({
[classes.hidden]: isExpanded, [classes.hidden]: isExpanded,
}) })
// Make sure accordion is expanded if JS is disabled
useEffect(() => setIsExpanded(expanded), [])
return ( return (
<section id={anchorId}> <section className="accordion" id={id}>
<div className={classes.root}> <div className={classes.root}>
<h3> <h4>
<button <button
className={classes.button} className={classes.button}
aria-expanded={String(isExpanded)} aria-expanded={String(isExpanded)}
onClick={() => setIsExpanded(!isExpanded)} onClick={() => setIsExpanded(!isExpanded)}
> >
<span> <span>
{title} <span className="heading-text">{title}</span>
{isExpanded && ( {isExpanded && !!id && (
<Link to={`#${anchorId}`} className={classes.anchor} hidden> <Link
to={`#${id}`}
className={classes.anchor}
hidden
onClick={event => event.stopPropagation()}
>
&para; &para;
</Link> </Link>
)} )}
@ -44,7 +49,7 @@ const Accordion = ({ title, id, expanded, children }) => {
<rect height={2} width={8} x={1} y={4} /> <rect height={2} width={8} x={1} y={4} />
</svg> </svg>
</button> </button>
</h3> </h4>
<div className={contentClassNames}>{children}</div> <div className={contentClassNames}>{children}</div>
</div> </div>
</section> </section>

View File

@ -33,10 +33,11 @@ const GitHubCode = ({ url, lang, errorMsg, className }) => {
}) })
.catch(err => { .catch(err => {
setCode(errorMsg) setCode(errorMsg)
console.error(err)
}) })
setInitialized(true) setInitialized(true)
} }
}, []) }, [initialized, rawUrl, errorMsg])
const highlighted = lang === 'none' || !code ? code : highlightCode(lang, code) const highlighted = lang === 'none' || !code ? code : highlightCode(lang, code)

View File

@ -5,13 +5,13 @@ import classNames from 'classnames'
import Icon from './icon' import Icon from './icon'
import classes from '../styles/infobox.module.sass' import classes from '../styles/infobox.module.sass'
const Infobox = ({ title, variant, className, children }) => { const Infobox = ({ title, id, variant, className, children }) => {
const infoboxClassNames = classNames(classes.root, className, { const infoboxClassNames = classNames(classes.root, className, {
[classes.warning]: variant === 'warning', [classes.warning]: variant === 'warning',
[classes.danger]: variant === 'danger', [classes.danger]: variant === 'danger',
}) })
return ( return (
<aside className={infoboxClassNames}> <aside className={infoboxClassNames} id={id}>
{title && ( {title && (
<h4 className={classes.title}> <h4 className={classes.title}>
{variant !== 'default' && ( {variant !== 'default' && (
@ -31,6 +31,7 @@ Infobox.defaultProps = {
Infobox.propTypes = { Infobox.propTypes = {
title: PropTypes.string, title: PropTypes.string,
id: PropTypes.string,
variant: PropTypes.oneOf(['default', 'warning', 'danger']), variant: PropTypes.oneOf(['default', 'warning', 'danger']),
className: PropTypes.string, className: PropTypes.string,
children: PropTypes.node.isRequired, children: PropTypes.node.isRequired,

View File

@ -232,6 +232,7 @@ Juniper.defaultProps = {
theme: 'default', theme: 'default',
isolateCells: true, isolateCells: true,
useBinder: true, useBinder: true,
storageKey: 'juniper',
useStorage: true, useStorage: true,
storageExpire: 60, storageExpire: 60,
debug: false, debug: false,

View File

@ -34,22 +34,19 @@ const Progress = () => {
setOffset(getOffset()) setOffset(getOffset())
} }
useEffect( useEffect(() => {
() => { if (!initialized && progressRef.current) {
if (!initialized && progressRef.current) { handleResize()
handleResize() setInitialized(true)
setInitialized(true) }
} window.addEventListener('scroll', handleScroll)
window.addEventListener('scroll', handleScroll) window.addEventListener('resize', handleResize)
window.addEventListener('resize', handleResize)
return () => { return () => {
window.removeEventListener('scroll', handleScroll) window.removeEventListener('scroll', handleScroll)
window.removeEventListener('resize', handleResize) window.removeEventListener('resize', handleResize)
} }
}, }, [initialized, progressRef])
[progressRef]
)
const { height, vh } = offset const { height, vh } = offset
const total = 100 - ((height - scrollY - vh) / height) * 100 const total = 100 - ((height - scrollY - vh) / height) * 100

View File

@ -8,6 +8,12 @@ import Icon from './icon'
import { H2 } from './typography' import { H2 } from './typography'
import classes from '../styles/quickstart.module.sass' import classes from '../styles/quickstart.module.sass'
function getNewChecked(optionId, checkedForId, multiple) {
if (!multiple) return [optionId]
if (checkedForId.includes(optionId)) return checkedForId.filter(opt => opt !== optionId)
return [...checkedForId, optionId]
}
const Quickstart = ({ data, title, description, id, children }) => { const Quickstart = ({ data, title, description, id, children }) => {
const [styles, setStyles] = useState({}) const [styles, setStyles] = useState({})
const [checked, setChecked] = useState({}) const [checked, setChecked] = useState({})
@ -38,13 +44,13 @@ const Quickstart = ({ data, title, description, id, children }) => {
setStyles(initialStyles) setStyles(initialStyles)
setInitialized(true) setInitialized(true)
} }
}) }, [data, initialized])
return !data.length ? null : ( return !data.length ? null : (
<Section id={id}> <Section id={id}>
<div className={classes.root}> <div className={classes.root}>
{title && ( {title && (
<H2 className={classes.title}> <H2 className={classes.title} name={id}>
<a href={`#${id}`}>{title}</a> <a href={`#${id}`}>{title}</a>
</H2> </H2>
)} )}
@ -76,13 +82,11 @@ const Quickstart = ({ data, title, description, id, children }) => {
onChange={() => { onChange={() => {
const newChecked = { const newChecked = {
...checked, ...checked,
[id]: !multiple [id]: getNewChecked(
? [option.id] option.id,
: checkedForId.includes(option.id) checkedForId,
? checkedForId.filter( multiple
opt => opt !== option.id ),
)
: [...checkedForId, option.id],
} }
setChecked(newChecked) setChecked(newChecked)
setStyles({ setStyles({

View File

@ -6,19 +6,20 @@ import Icon from './icon'
import classes from '../styles/search.module.sass' import classes from '../styles/search.module.sass'
const Search = ({ id, placeholder, settings }) => { const Search = ({ id, placeholder, settings }) => {
const { apiKey, indexName } = settings const { apiKey, indexName, appId } = settings
const [isInitialized, setIsInitialized] = useState(false) const [initialized, setInitialized] = useState(false)
useEffect(() => { useEffect(() => {
if (!isInitialized) { if (!initialized) {
setIsInitialized(true) setInitialized(true)
window.docsearch({ window.docsearch({
appId,
apiKey, apiKey,
indexName, indexName,
inputSelector: `#${id}`, inputSelector: `#${id}`,
debug: false, debug: false,
}) })
} }
}, window.docsearch) }, [initialized, apiKey, indexName, id])
return ( return (
<form className={classes.root}> <form className={classes.root}>
<label htmlFor={id} className={classes.icon}> <label htmlFor={id} className={classes.icon}>

Some files were not shown because too many files have changed in this diff Show More