mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-26 17:24:41 +03:00
Merge branch 'master' into feature/lemmatizer
This commit is contained in:
commit
278e9d2eb0
106
.github/contributors/Poluglottos.md
vendored
Normal file
106
.github/contributors/Poluglottos.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [X] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Ryan Ford |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | Mar 13 2019 |
|
||||
| GitHub username | Poluglottos |
|
||||
| Website (optional) | |
|
106
.github/contributors/tmetzl.md
vendored
Normal file
106
.github/contributors/tmetzl.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Tim Metzler |
|
||||
| Company name (if applicable) | University of Applied Sciences Bonn-Rhein-Sieg |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 03/10/2019 |
|
||||
| GitHub username | tmetzl |
|
||||
| Website (optional) | |
|
22
README.md
22
README.md
|
@ -12,7 +12,7 @@ currently supports tokenization for **45+ languages**. It features the
|
|||
and easy **deep learning** integration. It's commercial open-source software,
|
||||
released under the MIT license.
|
||||
|
||||
💫 **Version 2.1 out now!** [Check out the release notes here.](https://github.com/explosion/spaCy/releases)
|
||||
💫 **Version 2.0 out now!** [Check out the release notes here.](https://github.com/explosion/spaCy/releases)
|
||||
|
||||
[![Azure Pipelines](https://img.shields.io/azure-devops/build/explosion-ai/public/8/master.svg?logo=azure-devops&style=flat-square)](https://dev.azure.com/explosion-ai/public/_build?definitionId=8)
|
||||
[![Travis Build Status](https://img.shields.io/travis/explosion/spaCy/master.svg?style=flat-square&logo=travis)](https://travis-ci.org/explosion/spaCy)
|
||||
|
@ -25,19 +25,17 @@ released under the MIT license.
|
|||
|
||||
## 📖 Documentation
|
||||
|
||||
| Documentation | |
|
||||
| --------------- | -------------------------------------------------------------- |
|
||||
| [spaCy 101] | New to spaCy? Here's everything you need to know! |
|
||||
| [Usage Guides] | How to use spaCy and its features. |
|
||||
| [New in v2.1] | New features, backwards incompatibilities and migration guide. |
|
||||
| [API Reference] | The detailed reference for spaCy's API. |
|
||||
| [Models] | Download statistical language models for spaCy. |
|
||||
| [Universe] | Libraries, extensions, demos, books and courses. |
|
||||
| [Changelog] | Changes and version history. |
|
||||
| [Contribute] | How to contribute to the spaCy project and code base. |
|
||||
| Documentation | |
|
||||
| --------------- | ----------------------------------------------------- |
|
||||
| [spaCy 101] | New to spaCy? Here's everything you need to know! |
|
||||
| [Usage Guides] | How to use spaCy and its features. |
|
||||
| [API Reference] | The detailed reference for spaCy's API. |
|
||||
| [Models] | Download statistical language models for spaCy. |
|
||||
| [Universe] | Libraries, extensions, demos, books and courses. |
|
||||
| [Changelog] | Changes and version history. |
|
||||
| [Contribute] | How to contribute to the spaCy project and code base. |
|
||||
|
||||
[spacy 101]: https://spacy.io/usage/spacy-101
|
||||
[new in v2.1]: https://spacy.io/usage/v2-1
|
||||
[usage guides]: https://spacy.io/usage/
|
||||
[api reference]: https://spacy.io/api/
|
||||
[models]: https://spacy.io/models
|
||||
|
|
|
@ -7,6 +7,7 @@ git diff-index --quiet HEAD
|
|||
|
||||
git checkout $1
|
||||
git pull origin $1
|
||||
git push origin $1
|
||||
|
||||
version=$(grep "__version__ = " spacy/about.py)
|
||||
version=${version/__version__ = }
|
||||
|
@ -15,4 +16,4 @@ version=${version/\'/}
|
|||
version=${version/\"/}
|
||||
version=${version/\"/}
|
||||
git tag "v$version"
|
||||
git push origin --tags
|
||||
git push origin "v$version" --tags
|
||||
|
|
107
bin/train_word_vectors.py
Normal file
107
bin/train_word_vectors.py
Normal file
|
@ -0,0 +1,107 @@
|
|||
#!/usr/bin/env python
|
||||
from __future__ import print_function, unicode_literals, division
|
||||
|
||||
import logging
|
||||
from pathlib import Path
|
||||
from collections import defaultdict
|
||||
from gensim.models import Word2Vec
|
||||
from preshed.counter import PreshCounter
|
||||
import plac
|
||||
import spacy
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class Corpus(object):
|
||||
def __init__(self, directory, min_freq=10):
|
||||
self.directory = directory
|
||||
self.counts = PreshCounter()
|
||||
self.strings = {}
|
||||
self.min_freq = min_freq
|
||||
|
||||
def count_doc(self, doc):
|
||||
# Get counts for this document
|
||||
for word in doc:
|
||||
self.counts.inc(word.orth, 1)
|
||||
return len(doc)
|
||||
|
||||
def __iter__(self):
|
||||
for text_loc in iter_dir(self.directory):
|
||||
with text_loc.open("r", encoding="utf-8") as file_:
|
||||
text = file_.read()
|
||||
yield text
|
||||
|
||||
|
||||
def iter_dir(loc):
|
||||
dir_path = Path(loc)
|
||||
for fn_path in dir_path.iterdir():
|
||||
if fn_path.is_dir():
|
||||
for sub_path in fn_path.iterdir():
|
||||
yield sub_path
|
||||
else:
|
||||
yield fn_path
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
lang=("ISO language code"),
|
||||
in_dir=("Location of input directory"),
|
||||
out_loc=("Location of output file"),
|
||||
n_workers=("Number of workers", "option", "n", int),
|
||||
size=("Dimension of the word vectors", "option", "d", int),
|
||||
window=("Context window size", "option", "w", int),
|
||||
min_count=("Min count", "option", "m", int),
|
||||
negative=("Number of negative samples", "option", "g", int),
|
||||
nr_iter=("Number of iterations", "option", "i", int),
|
||||
)
|
||||
def main(
|
||||
lang,
|
||||
in_dir,
|
||||
out_loc,
|
||||
negative=5,
|
||||
n_workers=4,
|
||||
window=5,
|
||||
size=128,
|
||||
min_count=10,
|
||||
nr_iter=2,
|
||||
):
|
||||
logging.basicConfig(
|
||||
format="%(asctime)s : %(levelname)s : %(message)s", level=logging.INFO
|
||||
)
|
||||
model = Word2Vec(
|
||||
size=size,
|
||||
window=window,
|
||||
min_count=min_count,
|
||||
workers=n_workers,
|
||||
sample=1e-5,
|
||||
negative=negative,
|
||||
)
|
||||
nlp = spacy.blank(lang)
|
||||
corpus = Corpus(in_dir)
|
||||
total_words = 0
|
||||
total_sents = 0
|
||||
for text_no, text_loc in enumerate(iter_dir(corpus.directory)):
|
||||
with text_loc.open("r", encoding="utf-8") as file_:
|
||||
text = file_.read()
|
||||
total_sents += text.count("\n")
|
||||
doc = nlp(text)
|
||||
total_words += corpus.count_doc(doc)
|
||||
logger.info(
|
||||
"PROGRESS: at batch #%i, processed %i words, keeping %i word types",
|
||||
text_no,
|
||||
total_words,
|
||||
len(corpus.strings),
|
||||
)
|
||||
model.corpus_count = total_sents
|
||||
model.raw_vocab = defaultdict(int)
|
||||
for orth, freq in corpus.counts:
|
||||
if freq >= min_count:
|
||||
model.raw_vocab[nlp.vocab.strings[orth]] = freq
|
||||
model.scale_vocab()
|
||||
model.finalize_vocab()
|
||||
model.iter = nr_iter
|
||||
model.train(corpus)
|
||||
model.save(out_loc)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
plac.call(main)
|
|
@ -49,7 +49,7 @@ class SentimentAnalyser(object):
|
|||
y = self._model.predict(X)
|
||||
self.set_sentiment(doc, y)
|
||||
|
||||
def pipe(self, docs, batch_size=1000, n_threads=2):
|
||||
def pipe(self, docs, batch_size=1000):
|
||||
for minibatch in cytoolz.partition_all(batch_size, docs):
|
||||
minibatch = list(minibatch)
|
||||
sentences = []
|
||||
|
@ -176,7 +176,7 @@ def evaluate(model_dir, texts, labels, max_length=100):
|
|||
|
||||
correct = 0
|
||||
i = 0
|
||||
for doc in nlp.pipe(texts, batch_size=1000, n_threads=4):
|
||||
for doc in nlp.pipe(texts, batch_size=1000):
|
||||
correct += bool(doc.sentiment >= 0.5) == bool(labels[i])
|
||||
i += 1
|
||||
return float(correct) / i
|
||||
|
|
|
@ -4,7 +4,7 @@ preshed>=2.0.1,<2.1.0
|
|||
thinc>=7.0.2,<7.1.0
|
||||
blis>=0.2.2,<0.3.0
|
||||
murmurhash>=0.28.0,<1.1.0
|
||||
wasabi>=0.0.12,<1.1.0
|
||||
wasabi>=0.1.3,<1.1.0
|
||||
srsly>=0.0.5,<1.1.0
|
||||
# Third party dependencies
|
||||
numpy>=1.15.0
|
||||
|
|
|
@ -97,6 +97,7 @@ def with_cpu(ops, model):
|
|||
"""Wrap a model that should run on CPU, transferring inputs and outputs
|
||||
as necessary."""
|
||||
model.to_cpu()
|
||||
|
||||
def with_cpu_forward(inputs, drop=0.):
|
||||
cpu_outputs, backprop = model.begin_update(_to_cpu(inputs), drop=drop)
|
||||
gpu_outputs = _to_device(ops, cpu_outputs)
|
||||
|
|
|
@ -4,7 +4,7 @@
|
|||
# fmt: off
|
||||
|
||||
__title__ = "spacy-nightly"
|
||||
__version__ = "2.1.0a10"
|
||||
__version__ = "2.1.0a13"
|
||||
__summary__ = "Industrial-strength Natural Language Processing (NLP) with Python and Cython"
|
||||
__uri__ = "https://spacy.io"
|
||||
__author__ = "Explosion AI"
|
||||
|
|
|
@ -6,7 +6,7 @@ from pathlib import Path
|
|||
from wasabi import Printer
|
||||
import srsly
|
||||
|
||||
from .converters import conllu2json, conllubio2json, iob2json, conll_ner2json
|
||||
from .converters import conllu2json, iob2json, conll_ner2json
|
||||
from .converters import ner_jsonl2json
|
||||
|
||||
|
||||
|
@ -14,7 +14,7 @@ from .converters import ner_jsonl2json
|
|||
# entry to this dict with the file extension mapped to the converter function
|
||||
# imported from /converters.
|
||||
CONVERTERS = {
|
||||
"conllubio": conllubio2json,
|
||||
"conllubio": conllu2json,
|
||||
"conllu": conllu2json,
|
||||
"conll": conllu2json,
|
||||
"ner": conll_ner2json,
|
||||
|
|
|
@ -1,5 +1,4 @@
|
|||
from .conllu2json import conllu2json # noqa: F401
|
||||
from .conllubio2json import conllubio2json # noqa: F401
|
||||
from .iob2json import iob2json # noqa: F401
|
||||
from .conll_ner2json import conll_ner2json # noqa: F401
|
||||
from .jsonl2json import ner_jsonl2json # noqa: F401
|
||||
|
|
|
@ -71,6 +71,7 @@ def read_conllx(input_data, use_morphology=False, n=0):
|
|||
dep = "ROOT" if dep == "root" else dep
|
||||
tag = pos if tag == "_" else tag
|
||||
tag = tag + "__" + morph if use_morphology else tag
|
||||
iob = iob if iob else "O"
|
||||
tokens.append((id_, word, tag, head, dep, iob))
|
||||
except: # noqa: E722
|
||||
print(line)
|
||||
|
|
|
@ -1,85 +0,0 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...gold import iob_to_biluo
|
||||
|
||||
|
||||
def conllubio2json(input_data, n_sents=10, use_morphology=False, lang=None):
|
||||
"""
|
||||
Convert conllu files into JSON format for use with train cli.
|
||||
use_morphology parameter enables appending morphology to tags, which is
|
||||
useful for languages such as Spanish, where UD tags are not so rich.
|
||||
"""
|
||||
# by @dvsrepo, via #11 explosion/spacy-dev-resources
|
||||
docs = []
|
||||
sentences = []
|
||||
conll_tuples = read_conllx(input_data, use_morphology=use_morphology)
|
||||
for i, (raw_text, tokens) in enumerate(conll_tuples):
|
||||
sentence, brackets = tokens[0]
|
||||
sentences.append(generate_sentence(sentence))
|
||||
# Real-sized documents could be extracted using the comments on the
|
||||
# conluu document
|
||||
if len(sentences) % n_sents == 0:
|
||||
doc = create_doc(sentences, i)
|
||||
docs.append(doc)
|
||||
sentences = []
|
||||
return docs
|
||||
|
||||
|
||||
def read_conllx(input_data, use_morphology=False, n=0):
|
||||
i = 0
|
||||
for sent in input_data.strip().split("\n\n"):
|
||||
lines = sent.strip().split("\n")
|
||||
if lines:
|
||||
while lines[0].startswith("#"):
|
||||
lines.pop(0)
|
||||
tokens = []
|
||||
for line in lines:
|
||||
|
||||
parts = line.split("\t")
|
||||
id_, word, lemma, pos, tag, morph, head, dep, _1, ner = parts
|
||||
if "-" in id_ or "." in id_:
|
||||
continue
|
||||
try:
|
||||
id_ = int(id_) - 1
|
||||
head = (int(head) - 1) if head != "0" else id_
|
||||
dep = "ROOT" if dep == "root" else dep
|
||||
tag = pos if tag == "_" else tag
|
||||
tag = tag + "__" + morph if use_morphology else tag
|
||||
ner = ner if ner else "O"
|
||||
tokens.append((id_, word, tag, head, dep, ner))
|
||||
except: # noqa: E722
|
||||
print(line)
|
||||
raise
|
||||
tuples = [list(t) for t in zip(*tokens)]
|
||||
yield (None, [[tuples, []]])
|
||||
i += 1
|
||||
if n >= 1 and i >= n:
|
||||
break
|
||||
|
||||
|
||||
def generate_sentence(sent):
|
||||
(id_, word, tag, head, dep, ner) = sent
|
||||
sentence = {}
|
||||
tokens = []
|
||||
ner = iob_to_biluo(ner)
|
||||
for i, id in enumerate(id_):
|
||||
token = {}
|
||||
token["orth"] = word[i]
|
||||
token["tag"] = tag[i]
|
||||
token["head"] = head[i] - id
|
||||
token["dep"] = dep[i]
|
||||
token["ner"] = ner[i]
|
||||
tokens.append(token)
|
||||
sentence["tokens"] = tokens
|
||||
return sentence
|
||||
|
||||
|
||||
def create_doc(sentences, id):
|
||||
doc = {}
|
||||
paragraph = {}
|
||||
doc["id"] = id
|
||||
doc["paragraphs"] = []
|
||||
paragraph["sentences"] = sentences
|
||||
doc["paragraphs"].append(paragraph)
|
||||
return doc
|
|
@ -41,24 +41,32 @@ def download(model, direct=False, *pip_args):
|
|||
dl = download_model(dl_tpl.format(m=model_name, v=version), pip_args)
|
||||
if dl != 0: # if download subprocess doesn't return 0, exit
|
||||
sys.exit(dl)
|
||||
try:
|
||||
# Get package path here because link uses
|
||||
# pip.get_installed_distributions() to check if model is a
|
||||
# package, which fails if model was just installed via
|
||||
# subprocess
|
||||
package_path = get_package_path(model_name)
|
||||
link(model_name, model, force=True, model_path=package_path)
|
||||
except: # noqa: E722
|
||||
# Dirty, but since spacy.download and the auto-linking is
|
||||
# mostly a convenience wrapper, it's best to show a success
|
||||
# message and loading instructions, even if linking fails.
|
||||
msg.warn(
|
||||
"Download successful but linking failed",
|
||||
"Creating a shortcut link for 'en' didn't work (maybe you "
|
||||
"don't have admin permissions?), but you can still load the "
|
||||
"model via its full package name: "
|
||||
"nlp = spacy.load('{}')".format(model_name),
|
||||
)
|
||||
msg.good(
|
||||
"Download and installation successful",
|
||||
"You can now load the model via spacy.load('{}')".format(model_name),
|
||||
)
|
||||
# Only create symlink if the model is installed via a shortcut like 'en'.
|
||||
# There's no real advantage over an additional symlink for en_core_web_sm
|
||||
# and if anything, it's more error prone and causes more confusion.
|
||||
if model in shortcuts:
|
||||
try:
|
||||
# Get package path here because link uses
|
||||
# pip.get_installed_distributions() to check if model is a
|
||||
# package, which fails if model was just installed via
|
||||
# subprocess
|
||||
package_path = get_package_path(model_name)
|
||||
link(model_name, model, force=True, model_path=package_path)
|
||||
except: # noqa: E722
|
||||
# Dirty, but since spacy.download and the auto-linking is
|
||||
# mostly a convenience wrapper, it's best to show a success
|
||||
# message and loading instructions, even if linking fails.
|
||||
msg.warn(
|
||||
"Download successful but linking failed",
|
||||
"Creating a shortcut link for '{}' didn't work (maybe you "
|
||||
"don't have admin permissions?), but you can still load "
|
||||
"the model via its full package name: "
|
||||
"nlp = spacy.load('{}')".format(model, model_name),
|
||||
)
|
||||
|
||||
|
||||
def get_json(url, desc):
|
||||
|
|
|
@ -161,7 +161,7 @@ def parse_deps(orig_doc, options={}):
|
|||
"dir": "right",
|
||||
}
|
||||
)
|
||||
return {"words": words, "arcs": arcs}
|
||||
return {"words": words, "arcs": arcs, "settings": get_doc_settings(orig_doc)}
|
||||
|
||||
|
||||
def parse_ents(doc, options={}):
|
||||
|
@ -177,7 +177,8 @@ def parse_ents(doc, options={}):
|
|||
if not ents:
|
||||
user_warning(Warnings.W006)
|
||||
title = doc.user_data.get("title", None) if hasattr(doc, "user_data") else None
|
||||
return {"text": doc.text, "ents": ents, "title": title}
|
||||
settings = get_doc_settings(doc)
|
||||
return {"text": doc.text, "ents": ents, "title": title, "settings": settings}
|
||||
|
||||
|
||||
def set_render_wrapper(func):
|
||||
|
@ -195,3 +196,10 @@ def set_render_wrapper(func):
|
|||
if not hasattr(func, "__call__"):
|
||||
raise ValueError(Errors.E110.format(obj=type(func)))
|
||||
RENDER_WRAPPER = func
|
||||
|
||||
|
||||
def get_doc_settings(doc):
|
||||
return {
|
||||
"lang": doc.lang_,
|
||||
"direction": doc.vocab.writing_system.get("direction", "ltr"),
|
||||
}
|
||||
|
|
|
@ -3,10 +3,13 @@ from __future__ import unicode_literals
|
|||
|
||||
import uuid
|
||||
|
||||
from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_ARCS
|
||||
from .templates import TPL_ENT, TPL_ENTS, TPL_FIGURE, TPL_TITLE, TPL_PAGE
|
||||
from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_ARCS, TPL_ENTS
|
||||
from .templates import TPL_ENT, TPL_ENT_RTL, TPL_FIGURE, TPL_TITLE, TPL_PAGE
|
||||
from ..util import minify_html, escape_html
|
||||
|
||||
DEFAULT_LANG = "en"
|
||||
DEFAULT_DIR = "ltr"
|
||||
|
||||
|
||||
class DependencyRenderer(object):
|
||||
"""Render dependency parses as SVGs."""
|
||||
|
@ -30,6 +33,8 @@ class DependencyRenderer(object):
|
|||
self.color = options.get("color", "#000000")
|
||||
self.bg = options.get("bg", "#ffffff")
|
||||
self.font = options.get("font", "Arial")
|
||||
self.direction = DEFAULT_DIR
|
||||
self.lang = DEFAULT_LANG
|
||||
|
||||
def render(self, parsed, page=False, minify=False):
|
||||
"""Render complete markup.
|
||||
|
@ -42,13 +47,19 @@ class DependencyRenderer(object):
|
|||
# Create a random ID prefix to make sure parses don't receive the
|
||||
# same ID, even if they're identical
|
||||
id_prefix = uuid.uuid4().hex
|
||||
rendered = [
|
||||
self.render_svg("{}-{}".format(id_prefix, i), p["words"], p["arcs"])
|
||||
for i, p in enumerate(parsed)
|
||||
]
|
||||
rendered = []
|
||||
for i, p in enumerate(parsed):
|
||||
if i == 0:
|
||||
self.direction = p["settings"].get("direction", DEFAULT_DIR)
|
||||
self.lang = p["settings"].get("lang", DEFAULT_LANG)
|
||||
render_id = "{}-{}".format(id_prefix, i)
|
||||
svg = self.render_svg(render_id, p["words"], p["arcs"])
|
||||
rendered.append(svg)
|
||||
if page:
|
||||
content = "".join([TPL_FIGURE.format(content=svg) for svg in rendered])
|
||||
markup = TPL_PAGE.format(content=content)
|
||||
markup = TPL_PAGE.format(
|
||||
content=content, lang=self.lang, dir=self.direction
|
||||
)
|
||||
else:
|
||||
markup = "".join(rendered)
|
||||
if minify:
|
||||
|
@ -83,6 +94,8 @@ class DependencyRenderer(object):
|
|||
bg=self.bg,
|
||||
font=self.font,
|
||||
content=content,
|
||||
dir=self.direction,
|
||||
lang=self.lang,
|
||||
)
|
||||
|
||||
def render_word(self, text, tag, i):
|
||||
|
@ -95,11 +108,13 @@ class DependencyRenderer(object):
|
|||
"""
|
||||
y = self.offset_y + self.word_spacing
|
||||
x = self.offset_x + i * self.distance
|
||||
if self.direction == "rtl":
|
||||
x = self.width - x
|
||||
html_text = escape_html(text)
|
||||
return TPL_DEP_WORDS.format(text=html_text, tag=tag, x=x, y=y)
|
||||
|
||||
def render_arrow(self, label, start, end, direction, i):
|
||||
"""Render indivicual arrow.
|
||||
"""Render individual arrow.
|
||||
|
||||
label (unicode): Dependency label.
|
||||
start (int): Index of start word.
|
||||
|
@ -110,6 +125,8 @@ class DependencyRenderer(object):
|
|||
"""
|
||||
level = self.levels.index(end - start) + 1
|
||||
x_start = self.offset_x + start * self.distance + self.arrow_spacing
|
||||
if self.direction == "rtl":
|
||||
x_start = self.width - x_start
|
||||
y = self.offset_y
|
||||
x_end = (
|
||||
self.offset_x
|
||||
|
@ -117,6 +134,8 @@ class DependencyRenderer(object):
|
|||
+ start * self.distance
|
||||
- self.arrow_spacing * (self.highest_level - level) / 4
|
||||
)
|
||||
if self.direction == "rtl":
|
||||
x_end = self.width - x_end
|
||||
y_curve = self.offset_y - level * self.distance / 2
|
||||
if self.compact:
|
||||
y_curve = self.offset_y - level * self.distance / 6
|
||||
|
@ -124,12 +143,14 @@ class DependencyRenderer(object):
|
|||
y_curve = -self.distance
|
||||
arrowhead = self.get_arrowhead(direction, x_start, y, x_end)
|
||||
arc = self.get_arc(x_start, y, y_curve, x_end)
|
||||
label_side = "right" if self.direction == "rtl" else "left"
|
||||
return TPL_DEP_ARCS.format(
|
||||
id=self.id,
|
||||
i=i,
|
||||
stroke=self.arrow_stroke,
|
||||
head=arrowhead,
|
||||
label=label,
|
||||
label_side=label_side,
|
||||
arc=arc,
|
||||
)
|
||||
|
||||
|
@ -219,6 +240,8 @@ class EntityRenderer(object):
|
|||
self.default_color = "#ddd"
|
||||
self.colors = colors
|
||||
self.ents = options.get("ents", None)
|
||||
self.direction = DEFAULT_DIR
|
||||
self.lang = DEFAULT_LANG
|
||||
|
||||
def render(self, parsed, page=False, minify=False):
|
||||
"""Render complete markup.
|
||||
|
@ -228,12 +251,15 @@ class EntityRenderer(object):
|
|||
minify (bool): Minify HTML markup.
|
||||
RETURNS (unicode): Rendered HTML markup.
|
||||
"""
|
||||
rendered = [
|
||||
self.render_ents(p["text"], p["ents"], p.get("title", None)) for p in parsed
|
||||
]
|
||||
rendered = []
|
||||
for i, p in enumerate(parsed):
|
||||
if i == 0:
|
||||
self.direction = p["settings"].get("direction", DEFAULT_DIR)
|
||||
self.lang = p["settings"].get("lang", DEFAULT_LANG)
|
||||
rendered.append(self.render_ents(p["text"], p["ents"], p["title"]))
|
||||
if page:
|
||||
docs = "".join([TPL_FIGURE.format(content=doc) for doc in rendered])
|
||||
markup = TPL_PAGE.format(content=docs)
|
||||
markup = TPL_PAGE.format(content=docs, lang=self.lang, dir=self.direction)
|
||||
else:
|
||||
markup = "".join(rendered)
|
||||
if minify:
|
||||
|
@ -261,12 +287,16 @@ class EntityRenderer(object):
|
|||
markup += "</br>"
|
||||
if self.ents is None or label.upper() in self.ents:
|
||||
color = self.colors.get(label.upper(), self.default_color)
|
||||
markup += TPL_ENT.format(label=label, text=entity, bg=color)
|
||||
ent_settings = {"label": label, "text": entity, "bg": color}
|
||||
if self.direction == "rtl":
|
||||
markup += TPL_ENT_RTL.format(**ent_settings)
|
||||
else:
|
||||
markup += TPL_ENT.format(**ent_settings)
|
||||
else:
|
||||
markup += entity
|
||||
offset = end
|
||||
markup += escape_html(text[offset:])
|
||||
markup = TPL_ENTS.format(content=markup, colors=self.colors)
|
||||
markup = TPL_ENTS.format(content=markup, dir=self.direction)
|
||||
if title:
|
||||
markup = TPL_TITLE.format(title=title) + markup
|
||||
return markup
|
||||
|
|
|
@ -6,7 +6,7 @@ from __future__ import unicode_literals
|
|||
# Jupyter to render it properly in a cell
|
||||
|
||||
TPL_DEP_SVG = """
|
||||
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" id="{id}" class="displacy" width="{width}" height="{height}" style="max-width: none; height: {height}px; color: {color}; background: {bg}; font-family: {font}">{content}</svg>
|
||||
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xml:lang="{lang}" id="{id}" class="displacy" width="{width}" height="{height}" direction="{dir}" style="max-width: none; height: {height}px; color: {color}; background: {bg}; font-family: {font}; direction: {dir}">{content}</svg>
|
||||
"""
|
||||
|
||||
|
||||
|
@ -22,7 +22,7 @@ TPL_DEP_ARCS = """
|
|||
<g class="displacy-arrow">
|
||||
<path class="displacy-arc" id="arrow-{id}-{i}" stroke-width="{stroke}px" d="{arc}" fill="none" stroke="currentColor"/>
|
||||
<text dy="1.25em" style="font-size: 0.8em; letter-spacing: 1px">
|
||||
<textPath xlink:href="#arrow-{id}-{i}" class="displacy-label" startOffset="50%" fill="currentColor" text-anchor="middle">{label}</textPath>
|
||||
<textPath xlink:href="#arrow-{id}-{i}" class="displacy-label" startOffset="50%" side="{label_side}" fill="currentColor" text-anchor="middle">{label}</textPath>
|
||||
</text>
|
||||
<path class="displacy-arrowhead" d="{head}" fill="currentColor"/>
|
||||
</g>
|
||||
|
@ -39,7 +39,7 @@ TPL_TITLE = """
|
|||
|
||||
|
||||
TPL_ENTS = """
|
||||
<div class="entities" style="line-height: 2.5">{content}</div>
|
||||
<div class="entities" style="line-height: 2.5; direction: {dir}">{content}</div>
|
||||
"""
|
||||
|
||||
|
||||
|
@ -50,14 +50,21 @@ TPL_ENT = """
|
|||
</mark>
|
||||
"""
|
||||
|
||||
TPL_ENT_RTL = """
|
||||
<mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
|
||||
{text}
|
||||
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-right: 0.5rem">{label}</span>
|
||||
</mark>
|
||||
"""
|
||||
|
||||
|
||||
TPL_PAGE = """
|
||||
<!DOCTYPE html>
|
||||
<html>
|
||||
<html lang="{lang}">
|
||||
<head>
|
||||
<title>displaCy</title>
|
||||
</head>
|
||||
|
||||
<body style="font-size: 16px; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; padding: 4rem 2rem;">{content}</body>
|
||||
<body style="font-size: 16px; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; padding: 4rem 2rem; direction: {dir}">{content}</body>
|
||||
</html>
|
||||
"""
|
||||
|
|
|
@ -70,6 +70,16 @@ class Warnings(object):
|
|||
W013 = ("As of v2.1.0, {obj}.merge is deprecated. Please use the more "
|
||||
"efficient and less error-prone Doc.retokenize context manager "
|
||||
"instead.")
|
||||
W014 = ("As of v2.1.0, the `disable` keyword argument on the serialization "
|
||||
"methods is and should be replaced with `exclude`. This makes it "
|
||||
"consistent with the other objects serializable.")
|
||||
W015 = ("As of v2.1.0, the use of keyword arguments to exclude fields from "
|
||||
"being serialized or deserialized is deprecated. Please use the "
|
||||
"`exclude` argument instead. For example: exclude=['{arg}'].")
|
||||
W016 = ("The keyword argument `n_threads` on the is now deprecated, as "
|
||||
"the v2.x models cannot release the global interpreter lock. "
|
||||
"Future versions may introduce a `n_process` argument for "
|
||||
"parallel inference via multiprocessing.")
|
||||
|
||||
|
||||
@add_codes
|
||||
|
@ -348,7 +358,15 @@ class Errors(object):
|
|||
"This is likely a bug in spaCy, so feel free to open an issue.")
|
||||
E127 = ("Cannot create phrase pattern representation for length 0. This "
|
||||
"is likely a bug in spaCy.")
|
||||
|
||||
E128 = ("Unsupported serialization argument: '{arg}'. The use of keyword "
|
||||
"arguments to exclude fields from being serialized or deserialized "
|
||||
"is now deprecated. Please use the `exclude` argument instead. "
|
||||
"For example: exclude=['{arg}'].")
|
||||
E129 = ("Cannot write the label of an existing Span object because a Span "
|
||||
"is a read-only view of the underlying Token objects stored in the Doc. "
|
||||
"Instead, create a new Span object and specify the `label` keyword argument, "
|
||||
"for example:\nfrom spacy.tokens import Span\n"
|
||||
"span = Span(doc, start={start}, end={end}, label='{label}')")
|
||||
|
||||
@add_codes
|
||||
class TempErrors(object):
|
||||
|
|
|
@ -23,6 +23,7 @@ class ArabicDefaults(Language.Defaults):
|
|||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||
stop_words = STOP_WORDS
|
||||
suffixes = TOKENIZER_SUFFIXES
|
||||
writing_system = {"direction": "rtl", "has_case": False, "has_letters": True}
|
||||
|
||||
|
||||
class Arabic(Language):
|
||||
|
|
|
@ -34,10 +34,10 @@ TAG_MAP = {
|
|||
"NNP": {POS: PROPN, "NounType": "prop", "Number": "sing"},
|
||||
"NNPS": {POS: PROPN, "NounType": "prop", "Number": "plur"},
|
||||
"NNS": {POS: NOUN, "Number": "plur"},
|
||||
"PDT": {POS: ADJ, "AdjType": "pdt", "PronType": "prn"},
|
||||
"PDT": {POS: DET, "AdjType": "pdt", "PronType": "prn"},
|
||||
"POS": {POS: PART, "Poss": "yes"},
|
||||
"PRP": {POS: PRON, "PronType": "prs"},
|
||||
"PRP$": {POS: ADJ, "PronType": "prs", "Poss": "yes"},
|
||||
"PRP$": {POS: DET, "PronType": "prs", "Poss": "yes"},
|
||||
"RB": {POS: ADV, "Degree": "pos"},
|
||||
"RBR": {POS: ADV, "Degree": "comp"},
|
||||
"RBS": {POS: ADV, "Degree": "sup"},
|
||||
|
|
|
@ -27,6 +27,7 @@ class PersianDefaults(Language.Defaults):
|
|||
stop_words = STOP_WORDS
|
||||
tag_map = TAG_MAP
|
||||
suffixes = TOKENIZER_SUFFIXES
|
||||
writing_system = {"direction": "rtl", "has_case": False, "has_letters": True}
|
||||
|
||||
|
||||
class Persian(Language):
|
||||
|
|
|
@ -14,6 +14,7 @@ class HebrewDefaults(Language.Defaults):
|
|||
lex_attr_getters[LANG] = lambda text: "he"
|
||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS)
|
||||
stop_words = STOP_WORDS
|
||||
writing_system = {"direction": "rtl", "has_case": False, "has_letters": True}
|
||||
|
||||
|
||||
class Hebrew(Language):
|
||||
|
|
|
@ -8,15 +8,13 @@ from .stop_words import STOP_WORDS
|
|||
from .tag_map import TAG_MAP
|
||||
from ...attrs import LANG
|
||||
from ...language import Language
|
||||
from ...tokens import Doc, Token
|
||||
from ...tokens import Doc
|
||||
from ...compat import copy_reg
|
||||
from ...util import DummyTokenizer
|
||||
|
||||
|
||||
ShortUnitWord = namedtuple("ShortUnitWord", ["surface", "lemma", "pos"])
|
||||
|
||||
# TODO: Is this the right place for this?
|
||||
Token.set_extension("mecab_tag", default=None)
|
||||
|
||||
|
||||
def try_mecab_import():
|
||||
"""Mecab is required for Japanese support, so check for it.
|
||||
|
@ -81,10 +79,12 @@ class JapaneseTokenizer(DummyTokenizer):
|
|||
words = [x.surface for x in dtokens]
|
||||
spaces = [False] * len(words)
|
||||
doc = Doc(self.vocab, words=words, spaces=spaces)
|
||||
mecab_tags = []
|
||||
for token, dtoken in zip(doc, dtokens):
|
||||
token._.mecab_tag = dtoken.pos
|
||||
mecab_tags.append(dtoken.pos)
|
||||
token.tag_ = resolve_pos(dtoken)
|
||||
token.lemma_ = dtoken.lemma
|
||||
doc.user_data["mecab_tags"] = mecab_tags
|
||||
return doc
|
||||
|
||||
|
||||
|
@ -93,6 +93,7 @@ class JapaneseDefaults(Language.Defaults):
|
|||
lex_attr_getters[LANG] = lambda _text: "ja"
|
||||
stop_words = STOP_WORDS
|
||||
tag_map = TAG_MAP
|
||||
writing_system = {"direction": "ltr", "has_case": False, "has_letters": False}
|
||||
|
||||
@classmethod
|
||||
def create_tokenizer(cls, nlp=None):
|
||||
|
@ -107,4 +108,11 @@ class Japanese(Language):
|
|||
return self.tokenizer(text)
|
||||
|
||||
|
||||
def pickle_japanese(instance):
|
||||
return Japanese, tuple()
|
||||
|
||||
|
||||
copy_reg.pickle(Japanese, pickle_japanese)
|
||||
|
||||
|
||||
__all__ = ["Japanese"]
|
||||
|
|
|
@ -14,6 +14,7 @@ class ChineseDefaults(Language.Defaults):
|
|||
use_jieba = True
|
||||
tokenizer_exceptions = BASE_EXCEPTIONS
|
||||
stop_words = STOP_WORDS
|
||||
writing_system = {"direction": "ltr", "has_case": False, "has_letters": False}
|
||||
|
||||
|
||||
class Chinese(Language):
|
||||
|
|
|
@ -29,7 +29,7 @@ from .lang.punctuation import TOKENIZER_INFIXES
|
|||
from .lang.tokenizer_exceptions import TOKEN_MATCH
|
||||
from .lang.tag_map import TAG_MAP
|
||||
from .lang.lex_attrs import LEX_ATTRS, is_stop
|
||||
from .errors import Errors
|
||||
from .errors import Errors, Warnings, deprecation_warning
|
||||
from . import util
|
||||
from . import about
|
||||
|
||||
|
@ -95,6 +95,7 @@ class BaseDefaults(object):
|
|||
morph_rules = {}
|
||||
lex_attr_getters = LEX_ATTRS
|
||||
syntax_iterators = {}
|
||||
writing_system = {"direction": "ltr", "has_case": True, "has_letters": True}
|
||||
|
||||
|
||||
class Language(object):
|
||||
|
@ -107,6 +108,7 @@ class Language(object):
|
|||
|
||||
DOCS: https://spacy.io/api/language
|
||||
"""
|
||||
|
||||
Defaults = BaseDefaults
|
||||
lang = None
|
||||
|
||||
|
@ -195,6 +197,7 @@ class Language(object):
|
|||
self._meta = value
|
||||
|
||||
# Conveniences to access pipeline components
|
||||
# Shouldn't be used anymore!
|
||||
@property
|
||||
def tensorizer(self):
|
||||
return self.get_pipe("tensorizer")
|
||||
|
@ -228,6 +231,8 @@ class Language(object):
|
|||
|
||||
name (unicode): Name of pipeline component to get.
|
||||
RETURNS (callable): The pipeline component.
|
||||
|
||||
DOCS: https://spacy.io/api/language#get_pipe
|
||||
"""
|
||||
for pipe_name, component in self.pipeline:
|
||||
if pipe_name == name:
|
||||
|
@ -240,6 +245,8 @@ class Language(object):
|
|||
name (unicode): Factory name to look up in `Language.factories`.
|
||||
config (dict): Configuration parameters to initialise component.
|
||||
RETURNS (callable): Pipeline component.
|
||||
|
||||
DOCS: https://spacy.io/api/language#create_pipe
|
||||
"""
|
||||
if name not in self.factories:
|
||||
if name == "sbd":
|
||||
|
@ -266,9 +273,7 @@ class Language(object):
|
|||
first (bool): Insert component first / not first in the pipeline.
|
||||
last (bool): Insert component last / not last in the pipeline.
|
||||
|
||||
EXAMPLE:
|
||||
>>> nlp.add_pipe(component, before='ner')
|
||||
>>> nlp.add_pipe(component, name='custom_name', last=True)
|
||||
DOCS: https://spacy.io/api/language#add_pipe
|
||||
"""
|
||||
if not hasattr(component, "__call__"):
|
||||
msg = Errors.E003.format(component=repr(component), name=name)
|
||||
|
@ -310,6 +315,8 @@ class Language(object):
|
|||
|
||||
name (unicode): Name of the component.
|
||||
RETURNS (bool): Whether a component of the name exists in the pipeline.
|
||||
|
||||
DOCS: https://spacy.io/api/language#has_pipe
|
||||
"""
|
||||
return name in self.pipe_names
|
||||
|
||||
|
@ -318,6 +325,8 @@ class Language(object):
|
|||
|
||||
name (unicode): Name of the component to replace.
|
||||
component (callable): Pipeline component.
|
||||
|
||||
DOCS: https://spacy.io/api/language#replace_pipe
|
||||
"""
|
||||
if name not in self.pipe_names:
|
||||
raise ValueError(Errors.E001.format(name=name, opts=self.pipe_names))
|
||||
|
@ -328,6 +337,8 @@ class Language(object):
|
|||
|
||||
old_name (unicode): Name of the component to rename.
|
||||
new_name (unicode): New name of the component.
|
||||
|
||||
DOCS: https://spacy.io/api/language#rename_pipe
|
||||
"""
|
||||
if old_name not in self.pipe_names:
|
||||
raise ValueError(Errors.E001.format(name=old_name, opts=self.pipe_names))
|
||||
|
@ -341,36 +352,39 @@ class Language(object):
|
|||
|
||||
name (unicode): Name of the component to remove.
|
||||
RETURNS (tuple): A `(name, component)` tuple of the removed component.
|
||||
|
||||
DOCS: https://spacy.io/api/language#remove_pipe
|
||||
"""
|
||||
if name not in self.pipe_names:
|
||||
raise ValueError(Errors.E001.format(name=name, opts=self.pipe_names))
|
||||
return self.pipeline.pop(self.pipe_names.index(name))
|
||||
|
||||
def __call__(self, text, disable=[]):
|
||||
def __call__(self, text, disable=[], component_cfg=None):
|
||||
"""Apply the pipeline to some text. The text can span multiple sentences,
|
||||
and can contain arbtrary whitespace. Alignment into the original string
|
||||
is preserved.
|
||||
|
||||
text (unicode): The text to be processed.
|
||||
disable (list): Names of the pipeline components to disable.
|
||||
component_cfg (dict): An optional dictionary with extra keyword arguments
|
||||
for specific components.
|
||||
RETURNS (Doc): A container for accessing the annotations.
|
||||
|
||||
EXAMPLE:
|
||||
>>> tokens = nlp('An example sentence. Another example sentence.')
|
||||
>>> tokens[0].text, tokens[0].head.tag_
|
||||
('An', 'NN')
|
||||
DOCS: https://spacy.io/api/language#call
|
||||
"""
|
||||
if len(text) > self.max_length:
|
||||
raise ValueError(
|
||||
Errors.E088.format(length=len(text), max_length=self.max_length)
|
||||
)
|
||||
doc = self.make_doc(text)
|
||||
if component_cfg is None:
|
||||
component_cfg = {}
|
||||
for name, proc in self.pipeline:
|
||||
if name in disable:
|
||||
continue
|
||||
if not hasattr(proc, "__call__"):
|
||||
raise ValueError(Errors.E003.format(component=type(proc), name=name))
|
||||
doc = proc(doc)
|
||||
doc = proc(doc, **component_cfg.get(name, {}))
|
||||
if doc is None:
|
||||
raise ValueError(Errors.E005.format(name=name))
|
||||
return doc
|
||||
|
@ -381,24 +395,14 @@ class Language(object):
|
|||
of the block. Otherwise, a DisabledPipes object is returned, that has
|
||||
a `.restore()` method you can use to undo your changes.
|
||||
|
||||
EXAMPLE:
|
||||
>>> nlp.add_pipe('parser')
|
||||
>>> nlp.add_pipe('tagger')
|
||||
>>> with nlp.disable_pipes('parser', 'tagger'):
|
||||
>>> assert not nlp.has_pipe('parser')
|
||||
>>> assert nlp.has_pipe('parser')
|
||||
>>> disabled = nlp.disable_pipes('parser')
|
||||
>>> assert len(disabled) == 1
|
||||
>>> assert not nlp.has_pipe('parser')
|
||||
>>> disabled.restore()
|
||||
>>> assert nlp.has_pipe('parser')
|
||||
DOCS: https://spacy.io/api/language#disable_pipes
|
||||
"""
|
||||
return DisabledPipes(self, *names)
|
||||
|
||||
def make_doc(self, text):
|
||||
return self.tokenizer(text)
|
||||
|
||||
def update(self, docs, golds, drop=0.0, sgd=None, losses=None):
|
||||
def update(self, docs, golds, drop=0.0, sgd=None, losses=None, component_cfg=None):
|
||||
"""Update the models in the pipeline.
|
||||
|
||||
docs (iterable): A batch of `Doc` objects.
|
||||
|
@ -407,11 +411,7 @@ class Language(object):
|
|||
sgd (callable): An optimizer.
|
||||
RETURNS (dict): Results from the update.
|
||||
|
||||
EXAMPLE:
|
||||
>>> with nlp.begin_training(gold) as (trainer, optimizer):
|
||||
>>> for epoch in trainer.epochs(gold):
|
||||
>>> for docs, golds in epoch:
|
||||
>>> state = nlp.update(docs, golds, sgd=optimizer)
|
||||
DOCS: https://spacy.io/api/language#update
|
||||
"""
|
||||
if len(docs) != len(golds):
|
||||
raise IndexError(Errors.E009.format(n_docs=len(docs), n_golds=len(golds)))
|
||||
|
@ -421,7 +421,6 @@ class Language(object):
|
|||
if self._optimizer is None:
|
||||
self._optimizer = create_default_optimizer(Model.ops)
|
||||
sgd = self._optimizer
|
||||
|
||||
# Allow dict of args to GoldParse, instead of GoldParse objects.
|
||||
gold_objs = []
|
||||
doc_objs = []
|
||||
|
@ -442,14 +441,17 @@ class Language(object):
|
|||
get_grads.alpha = sgd.alpha
|
||||
get_grads.b1 = sgd.b1
|
||||
get_grads.b2 = sgd.b2
|
||||
|
||||
pipes = list(self.pipeline)
|
||||
random.shuffle(pipes)
|
||||
if component_cfg is None:
|
||||
component_cfg = {}
|
||||
for name, proc in pipes:
|
||||
if not hasattr(proc, "update"):
|
||||
continue
|
||||
grads = {}
|
||||
proc.update(docs, golds, drop=drop, sgd=get_grads, losses=losses)
|
||||
kwargs = component_cfg.get(name, {})
|
||||
kwargs.setdefault("drop", drop)
|
||||
proc.update(docs, golds, sgd=get_grads, losses=losses, **kwargs)
|
||||
for key, (W, dW) in grads.items():
|
||||
sgd(W, dW, key=key)
|
||||
|
||||
|
@ -473,6 +475,7 @@ class Language(object):
|
|||
>>> raw_batch = [nlp.make_doc(text) for text in next(raw_text_batches)]
|
||||
>>> nlp.rehearse(raw_batch)
|
||||
"""
|
||||
# TODO: document
|
||||
if len(docs) == 0:
|
||||
return
|
||||
if sgd is None:
|
||||
|
@ -495,7 +498,6 @@ class Language(object):
|
|||
get_grads.alpha = sgd.alpha
|
||||
get_grads.b1 = sgd.b1
|
||||
get_grads.b2 = sgd.b2
|
||||
|
||||
for name, proc in pipes:
|
||||
if not hasattr(proc, "rehearse"):
|
||||
continue
|
||||
|
@ -503,7 +505,6 @@ class Language(object):
|
|||
proc.rehearse(docs, sgd=get_grads, losses=losses, **config.get(name, {}))
|
||||
for key, (W, dW) in grads.items():
|
||||
sgd(W, dW, key=key)
|
||||
|
||||
return losses
|
||||
|
||||
def preprocess_gold(self, docs_golds):
|
||||
|
@ -519,13 +520,16 @@ class Language(object):
|
|||
for doc, gold in docs_golds:
|
||||
yield doc, gold
|
||||
|
||||
def begin_training(self, get_gold_tuples=None, sgd=None, **cfg):
|
||||
def begin_training(self, get_gold_tuples=None, sgd=None, component_cfg=None, **cfg):
|
||||
"""Allocate models, pre-process training data and acquire a trainer and
|
||||
optimizer. Used as a contextmanager.
|
||||
|
||||
get_gold_tuples (function): Function returning gold data
|
||||
component_cfg (dict): Config parameters for specific components.
|
||||
**cfg: Config parameters.
|
||||
RETURNS: An optimizer
|
||||
RETURNS: An optimizer.
|
||||
|
||||
DOCS: https://spacy.io/api/language#begin_training
|
||||
"""
|
||||
if get_gold_tuples is None:
|
||||
get_gold_tuples = lambda: []
|
||||
|
@ -545,10 +549,17 @@ class Language(object):
|
|||
if sgd is None:
|
||||
sgd = create_default_optimizer(Model.ops)
|
||||
self._optimizer = sgd
|
||||
if component_cfg is None:
|
||||
component_cfg = {}
|
||||
for name, proc in self.pipeline:
|
||||
if hasattr(proc, "begin_training"):
|
||||
kwargs = component_cfg.get(name, {})
|
||||
kwargs.update(cfg)
|
||||
proc.begin_training(
|
||||
get_gold_tuples, pipeline=self.pipeline, sgd=self._optimizer, **cfg
|
||||
get_gold_tuples,
|
||||
pipeline=self.pipeline,
|
||||
sgd=self._optimizer,
|
||||
**kwargs
|
||||
)
|
||||
return self._optimizer
|
||||
|
||||
|
@ -576,20 +587,27 @@ class Language(object):
|
|||
proc._rehearsal_model = deepcopy(proc.model)
|
||||
return self._optimizer
|
||||
|
||||
def evaluate(self, docs_golds, verbose=False, batch_size=256):
|
||||
scorer = Scorer()
|
||||
def evaluate(
|
||||
self, docs_golds, verbose=False, batch_size=256, scorer=None, component_cfg=None
|
||||
):
|
||||
if scorer is None:
|
||||
scorer = Scorer()
|
||||
docs, golds = zip(*docs_golds)
|
||||
docs = list(docs)
|
||||
golds = list(golds)
|
||||
for name, pipe in self.pipeline:
|
||||
kwargs = component_cfg.get(name, {})
|
||||
kwargs.setdefault("batch_size", batch_size)
|
||||
if not hasattr(pipe, "pipe"):
|
||||
docs = (pipe(doc) for doc in docs)
|
||||
docs = (pipe(doc, **kwargs) for doc in docs)
|
||||
else:
|
||||
docs = pipe.pipe(docs, batch_size=batch_size)
|
||||
docs = pipe.pipe(docs, **kwargs)
|
||||
for doc, gold in zip(docs, golds):
|
||||
if verbose:
|
||||
print(doc)
|
||||
scorer.score(doc, gold, verbose=verbose)
|
||||
kwargs = component_cfg.get("scorer", {})
|
||||
kwargs.setdefault("verbose", verbose)
|
||||
scorer.score(doc, gold, **kwargs)
|
||||
return scorer
|
||||
|
||||
@contextmanager
|
||||
|
@ -628,49 +646,57 @@ class Language(object):
|
|||
self,
|
||||
texts,
|
||||
as_tuples=False,
|
||||
n_threads=2,
|
||||
n_threads=-1,
|
||||
batch_size=1000,
|
||||
disable=[],
|
||||
cleanup=False,
|
||||
component_cfg=None,
|
||||
):
|
||||
"""Process texts as a stream, and yield `Doc` objects in order.
|
||||
|
||||
texts (iterator): A sequence of texts to process.
|
||||
as_tuples (bool):
|
||||
If set to True, inputs should be a sequence of
|
||||
as_tuples (bool): If set to True, inputs should be a sequence of
|
||||
(text, context) tuples. Output will then be a sequence of
|
||||
(doc, context) tuples. Defaults to False.
|
||||
n_threads (int): Currently inactive.
|
||||
batch_size (int): The number of texts to buffer.
|
||||
disable (list): Names of the pipeline components to disable.
|
||||
cleanup (bool): If True, unneeded strings are freed,
|
||||
to control memory use. Experimental.
|
||||
cleanup (bool): If True, unneeded strings are freed to control memory
|
||||
use. Experimental.
|
||||
component_cfg (dict): An optional dictionary with extra keyword
|
||||
arguments for specific components.
|
||||
YIELDS (Doc): Documents in the order of the original text.
|
||||
|
||||
EXAMPLE:
|
||||
>>> texts = [u'One document.', u'...', u'Lots of documents']
|
||||
>>> for doc in nlp.pipe(texts, batch_size=50, n_threads=4):
|
||||
>>> assert doc.is_parsed
|
||||
DOCS: https://spacy.io/api/language#pipe
|
||||
"""
|
||||
if n_threads != -1:
|
||||
deprecation_warning(Warnings.W016)
|
||||
if as_tuples:
|
||||
text_context1, text_context2 = itertools.tee(texts)
|
||||
texts = (tc[0] for tc in text_context1)
|
||||
contexts = (tc[1] for tc in text_context2)
|
||||
docs = self.pipe(
|
||||
texts, n_threads=n_threads, batch_size=batch_size, disable=disable
|
||||
texts,
|
||||
batch_size=batch_size,
|
||||
disable=disable,
|
||||
component_cfg=component_cfg,
|
||||
)
|
||||
for doc, context in izip(docs, contexts):
|
||||
yield (doc, context)
|
||||
return
|
||||
docs = (self.make_doc(text) for text in texts)
|
||||
if component_cfg is None:
|
||||
component_cfg = {}
|
||||
for name, proc in self.pipeline:
|
||||
if name in disable:
|
||||
continue
|
||||
kwargs = component_cfg.get(name, {})
|
||||
# Allow component_cfg to overwrite the top-level kwargs.
|
||||
kwargs.setdefault("batch_size", batch_size)
|
||||
if hasattr(proc, "pipe"):
|
||||
docs = proc.pipe(docs, n_threads=n_threads, batch_size=batch_size)
|
||||
docs = proc.pipe(docs, **kwargs)
|
||||
else:
|
||||
# Apply the function, but yield the doc
|
||||
docs = _pipe(proc, docs)
|
||||
docs = _pipe(proc, docs, kwargs)
|
||||
# Track weakrefs of "recent" documents, so that we can see when they
|
||||
# expire from memory. When they do, we know we don't need old strings.
|
||||
# This way, we avoid maintaining an unbounded growth in string entries
|
||||
|
@ -701,124 +727,114 @@ class Language(object):
|
|||
self.tokenizer._reset_cache(keys)
|
||||
nr_seen = 0
|
||||
|
||||
def to_disk(self, path, disable=tuple()):
|
||||
def to_disk(self, path, exclude=tuple(), disable=None):
|
||||
"""Save the current state to a directory. If a model is loaded, this
|
||||
will include the model.
|
||||
|
||||
path (unicode or Path): A path to a directory, which will be created if
|
||||
it doesn't exist. Paths may be strings or `Path`-like objects.
|
||||
disable (list): Names of pipeline components to disable and prevent
|
||||
from being saved.
|
||||
path (unicode or Path): Path to a directory, which will be created if
|
||||
it doesn't exist.
|
||||
exclude (list): Names of components or serialization fields to exclude.
|
||||
|
||||
EXAMPLE:
|
||||
>>> nlp.to_disk('/path/to/models')
|
||||
DOCS: https://spacy.io/api/language#to_disk
|
||||
"""
|
||||
if disable is not None:
|
||||
deprecation_warning(Warnings.W014)
|
||||
exclude = disable
|
||||
path = util.ensure_path(path)
|
||||
serializers = OrderedDict(
|
||||
(
|
||||
("tokenizer", lambda p: self.tokenizer.to_disk(p, vocab=False)),
|
||||
("meta.json", lambda p: p.open("w").write(srsly.json_dumps(self.meta))),
|
||||
)
|
||||
)
|
||||
serializers = OrderedDict()
|
||||
serializers["tokenizer"] = lambda p: self.tokenizer.to_disk(p, exclude=["vocab"])
|
||||
serializers["meta.json"] = lambda p: p.open("w").write(srsly.json_dumps(self.meta))
|
||||
for name, proc in self.pipeline:
|
||||
if not hasattr(proc, "name"):
|
||||
continue
|
||||
if name in disable:
|
||||
if name in exclude:
|
||||
continue
|
||||
if not hasattr(proc, "to_disk"):
|
||||
continue
|
||||
serializers[name] = lambda p, proc=proc: proc.to_disk(p, vocab=False)
|
||||
serializers[name] = lambda p, proc=proc: proc.to_disk(p, exclude=["vocab"])
|
||||
serializers["vocab"] = lambda p: self.vocab.to_disk(p)
|
||||
util.to_disk(path, serializers, {p: False for p in disable})
|
||||
util.to_disk(path, serializers, exclude)
|
||||
|
||||
def from_disk(self, path, disable=tuple()):
|
||||
def from_disk(self, path, exclude=tuple(), disable=None):
|
||||
"""Loads state from a directory. Modifies the object in place and
|
||||
returns it. If the saved `Language` object contains a model, the
|
||||
model will be loaded.
|
||||
|
||||
path (unicode or Path): A path to a directory. Paths may be either
|
||||
strings or `Path`-like objects.
|
||||
disable (list): Names of the pipeline components to disable.
|
||||
path (unicode or Path): A path to a directory.
|
||||
exclude (list): Names of components or serialization fields to exclude.
|
||||
RETURNS (Language): The modified `Language` object.
|
||||
|
||||
EXAMPLE:
|
||||
>>> from spacy.language import Language
|
||||
>>> nlp = Language().from_disk('/path/to/models')
|
||||
DOCS: https://spacy.io/api/language#from_disk
|
||||
"""
|
||||
if disable is not None:
|
||||
deprecation_warning(Warnings.W014)
|
||||
exclude = disable
|
||||
path = util.ensure_path(path)
|
||||
deserializers = OrderedDict(
|
||||
(
|
||||
("meta.json", lambda p: self.meta.update(srsly.read_json(p))),
|
||||
(
|
||||
"vocab",
|
||||
lambda p: (
|
||||
self.vocab.from_disk(p) and _fix_pretrained_vectors_name(self)
|
||||
),
|
||||
),
|
||||
("tokenizer", lambda p: self.tokenizer.from_disk(p, vocab=False)),
|
||||
)
|
||||
)
|
||||
deserializers = OrderedDict()
|
||||
deserializers["meta.json"] = lambda p: self.meta.update(srsly.read_json(p))
|
||||
deserializers["vocab"] = lambda p: self.vocab.from_disk(p) and _fix_pretrained_vectors_name(self)
|
||||
deserializers["tokenizer"] = lambda p: self.tokenizer.from_disk(p, exclude=["vocab"])
|
||||
for name, proc in self.pipeline:
|
||||
if name in disable:
|
||||
if name in exclude:
|
||||
continue
|
||||
if not hasattr(proc, "from_disk"):
|
||||
continue
|
||||
deserializers[name] = lambda p, proc=proc: proc.from_disk(p, vocab=False)
|
||||
exclude = {p: False for p in disable}
|
||||
if not (path / "vocab").exists():
|
||||
exclude["vocab"] = True
|
||||
deserializers[name] = lambda p, proc=proc: proc.from_disk(p, exclude=["vocab"])
|
||||
if not (path / "vocab").exists() and "vocab" not in exclude:
|
||||
# Convert to list here in case exclude is (default) tuple
|
||||
exclude = list(exclude) + ["vocab"]
|
||||
util.from_disk(path, deserializers, exclude)
|
||||
self._path = path
|
||||
return self
|
||||
|
||||
def to_bytes(self, disable=[], **exclude):
|
||||
def to_bytes(self, exclude=tuple(), disable=None, **kwargs):
|
||||
"""Serialize the current state to a binary string.
|
||||
|
||||
disable (list): Nameds of pipeline components to disable and prevent
|
||||
from being serialized.
|
||||
exclude (list): Names of components or serialization fields to exclude.
|
||||
RETURNS (bytes): The serialized form of the `Language` object.
|
||||
|
||||
DOCS: https://spacy.io/api/language#to_bytes
|
||||
"""
|
||||
serializers = OrderedDict(
|
||||
(
|
||||
("vocab", lambda: self.vocab.to_bytes()),
|
||||
("tokenizer", lambda: self.tokenizer.to_bytes(vocab=False)),
|
||||
("meta", lambda: srsly.json_dumps(self.meta)),
|
||||
)
|
||||
)
|
||||
for i, (name, proc) in enumerate(self.pipeline):
|
||||
if name in disable:
|
||||
if disable is not None:
|
||||
deprecation_warning(Warnings.W014)
|
||||
exclude = disable
|
||||
serializers = OrderedDict()
|
||||
serializers["vocab"] = lambda: self.vocab.to_bytes()
|
||||
serializers["tokenizer"] = lambda: self.tokenizer.to_bytes(exclude=["vocab"])
|
||||
serializers["meta.json"] = lambda: srsly.json_dumps(self.meta)
|
||||
for name, proc in self.pipeline:
|
||||
if name in exclude:
|
||||
continue
|
||||
if not hasattr(proc, "to_bytes"):
|
||||
continue
|
||||
serializers[i] = lambda proc=proc: proc.to_bytes(vocab=False)
|
||||
serializers[name] = lambda proc=proc: proc.to_bytes(exclude=["vocab"])
|
||||
exclude = util.get_serialization_exclude(serializers, exclude, kwargs)
|
||||
return util.to_bytes(serializers, exclude)
|
||||
|
||||
def from_bytes(self, bytes_data, disable=[]):
|
||||
def from_bytes(self, bytes_data, exclude=tuple(), disable=None, **kwargs):
|
||||
"""Load state from a binary string.
|
||||
|
||||
bytes_data (bytes): The data to load from.
|
||||
disable (list): Names of the pipeline components to disable.
|
||||
exclude (list): Names of components or serialization fields to exclude.
|
||||
RETURNS (Language): The `Language` object.
|
||||
|
||||
DOCS: https://spacy.io/api/language#from_bytes
|
||||
"""
|
||||
deserializers = OrderedDict(
|
||||
(
|
||||
("meta", lambda b: self.meta.update(srsly.json_loads(b))),
|
||||
(
|
||||
"vocab",
|
||||
lambda b: (
|
||||
self.vocab.from_bytes(b) and _fix_pretrained_vectors_name(self)
|
||||
),
|
||||
),
|
||||
("tokenizer", lambda b: self.tokenizer.from_bytes(b, vocab=False)),
|
||||
)
|
||||
)
|
||||
for i, (name, proc) in enumerate(self.pipeline):
|
||||
if name in disable:
|
||||
if disable is not None:
|
||||
deprecation_warning(Warnings.W014)
|
||||
exclude = disable
|
||||
deserializers = OrderedDict()
|
||||
deserializers["meta.json"] = lambda b: self.meta.update(srsly.json_loads(b))
|
||||
deserializers["vocab"] = lambda b: self.vocab.from_bytes(b) and _fix_pretrained_vectors_name(self)
|
||||
deserializers["tokenizer"] = lambda b: self.tokenizer.from_bytes(b, exclude=["vocab"])
|
||||
for name, proc in self.pipeline:
|
||||
if name in exclude:
|
||||
continue
|
||||
if not hasattr(proc, "from_bytes"):
|
||||
continue
|
||||
deserializers[i] = lambda b, proc=proc: proc.from_bytes(b, vocab=False)
|
||||
util.from_bytes(bytes_data, deserializers, {})
|
||||
deserializers[name] = lambda b, proc=proc: proc.from_bytes(b, exclude=["vocab"])
|
||||
exclude = util.get_serialization_exclude(deserializers, exclude, kwargs)
|
||||
util.from_bytes(bytes_data, deserializers, exclude)
|
||||
return self
|
||||
|
||||
|
||||
|
@ -873,7 +889,12 @@ class DisabledPipes(list):
|
|||
self[:] = []
|
||||
|
||||
|
||||
def _pipe(func, docs):
|
||||
def _pipe(func, docs, kwargs):
|
||||
# We added some args for pipe that __call__ doesn't expect.
|
||||
kwargs = dict(kwargs)
|
||||
for arg in ["n_threads", "batch_size"]:
|
||||
if arg in kwargs:
|
||||
kwargs.pop(arg)
|
||||
for doc in docs:
|
||||
doc = func(doc)
|
||||
doc = func(doc, **kwargs)
|
||||
yield doc
|
||||
|
|
|
@ -161,17 +161,17 @@ cdef class Lexeme:
|
|||
Lexeme.c_from_bytes(self.c, lex_data)
|
||||
self.orth = self.c.orth
|
||||
|
||||
property has_vector:
|
||||
@property
|
||||
def has_vector(self):
|
||||
"""RETURNS (bool): Whether a word vector is associated with the object.
|
||||
"""
|
||||
def __get__(self):
|
||||
return self.vocab.has_vector(self.c.orth)
|
||||
return self.vocab.has_vector(self.c.orth)
|
||||
|
||||
property vector_norm:
|
||||
@property
|
||||
def vector_norm(self):
|
||||
"""RETURNS (float): The L2 norm of the vector representation."""
|
||||
def __get__(self):
|
||||
vector = self.vector
|
||||
return numpy.sqrt((vector**2).sum())
|
||||
vector = self.vector
|
||||
return numpy.sqrt((vector**2).sum())
|
||||
|
||||
property vector:
|
||||
"""A real-valued meaning representation.
|
||||
|
@ -209,17 +209,17 @@ cdef class Lexeme:
|
|||
def __set__(self, float sentiment):
|
||||
self.c.sentiment = sentiment
|
||||
|
||||
property orth_:
|
||||
@property
|
||||
def orth_(self):
|
||||
"""RETURNS (unicode): The original verbatim text of the lexeme
|
||||
(identical to `Lexeme.text`). Exists mostly for consistency with
|
||||
the other attributes."""
|
||||
def __get__(self):
|
||||
return self.vocab.strings[self.c.orth]
|
||||
return self.vocab.strings[self.c.orth]
|
||||
|
||||
property text:
|
||||
@property
|
||||
def text(self):
|
||||
"""RETURNS (unicode): The original verbatim text of the lexeme."""
|
||||
def __get__(self):
|
||||
return self.orth_
|
||||
return self.orth_
|
||||
|
||||
property lower:
|
||||
"""RETURNS (unicode): Lowercase form of the lexeme."""
|
||||
|
|
|
@ -19,7 +19,7 @@ from ..attrs cimport ID, attr_id_t, NULL_ATTR, ORTH
|
|||
|
||||
from ._schemas import TOKEN_PATTERN_SCHEMA
|
||||
from ..util import get_json_validator, validate_json
|
||||
from ..errors import Errors, MatchPatternError
|
||||
from ..errors import Errors, MatchPatternError, Warnings, deprecation_warning
|
||||
from ..strings import get_string_id
|
||||
from ..attrs import IDS
|
||||
|
||||
|
@ -153,15 +153,15 @@ cdef class Matcher:
|
|||
return default
|
||||
return (self._callbacks[key], self._patterns[key])
|
||||
|
||||
def pipe(self, docs, batch_size=1000, n_threads=2):
|
||||
def pipe(self, docs, batch_size=1000, n_threads=-1):
|
||||
"""Match a stream of documents, yielding them in turn.
|
||||
|
||||
docs (iterable): A stream of documents.
|
||||
batch_size (int): Number of documents to accumulate into a working set.
|
||||
n_threads (int): The number of threads with which to work on the buffer
|
||||
in parallel, if the implementation supports multi-threading.
|
||||
YIELDS (Doc): Documents, in order.
|
||||
"""
|
||||
if n_threads != -1:
|
||||
deprecation_warning(Warnings.W016)
|
||||
for doc in docs:
|
||||
self(doc)
|
||||
yield doc
|
||||
|
|
|
@ -166,14 +166,12 @@ cdef class PhraseMatcher:
|
|||
on_match(self, doc, i, matches)
|
||||
return matches
|
||||
|
||||
def pipe(self, stream, batch_size=1000, n_threads=1, return_matches=False,
|
||||
def pipe(self, stream, batch_size=1000, n_threads=-1, return_matches=False,
|
||||
as_tuples=False):
|
||||
"""Match a stream of documents, yielding them in turn.
|
||||
|
||||
docs (iterable): A stream of documents.
|
||||
batch_size (int): Number of documents to accumulate into a working set.
|
||||
n_threads (int): The number of threads with which to work on the buffer
|
||||
in parallel, if the implementation supports multi-threading.
|
||||
return_matches (bool): Yield the match lists along with the docs, making
|
||||
results (doc, matches) tuples.
|
||||
as_tuples (bool): Interpret the input stream as (doc, context) tuples,
|
||||
|
@ -184,6 +182,8 @@ cdef class PhraseMatcher:
|
|||
|
||||
DOCS: https://spacy.io/api/phrasematcher#pipe
|
||||
"""
|
||||
if n_threads != -1:
|
||||
deprecation_warning(Warnings.W016)
|
||||
if as_tuples:
|
||||
for doc, context in stream:
|
||||
matches = self(doc)
|
||||
|
|
|
@ -136,7 +136,7 @@ class MorphologyClassMap(object):
|
|||
cdef class Morphology:
|
||||
'''Store the possible morphological analyses for a language, and index them
|
||||
by hash.
|
||||
|
||||
|
||||
To save space on each token, tokens only know the hash of their morphological
|
||||
analysis, so queries of morphological attributes are delegated
|
||||
to this class.
|
||||
|
@ -200,7 +200,7 @@ cdef class Morphology:
|
|||
return []
|
||||
else:
|
||||
return tag_to_json(tag)
|
||||
|
||||
|
||||
cpdef update(self, hash_t morph, features):
|
||||
"""Update a morphological analysis with new feature values."""
|
||||
tag = (<MorphAnalysisC*>self.tags.get(morph))[0]
|
||||
|
@ -248,7 +248,7 @@ cdef class Morphology:
|
|||
if feat in self._feat_map.id2feat})
|
||||
attrs = intify_attrs(attrs, self.strings, _do_deprecated=True)
|
||||
self.exc[(tag_str, self.strings.add(orth_str))] = attrs
|
||||
|
||||
|
||||
cdef hash_t insert(self, MorphAnalysisC tag) except 0:
|
||||
cdef hash_t key = hash_tag(tag)
|
||||
if self.tags.get(key) == NULL:
|
||||
|
@ -256,7 +256,7 @@ cdef class Morphology:
|
|||
tag_ptr[0] = tag
|
||||
self.tags.set(key, <void*>tag_ptr)
|
||||
return key
|
||||
|
||||
|
||||
cdef int assign_untagged(self, TokenC* token) except -1:
|
||||
"""Set morphological attributes on a token without a POS tag. Uses
|
||||
the lemmatizer's lookup() method, which looks up the string in the
|
||||
|
@ -631,7 +631,7 @@ cdef int check_feature(const MorphAnalysisC* tag, attr_t feature) nogil:
|
|||
return 1
|
||||
else:
|
||||
return 0
|
||||
|
||||
|
||||
cdef int set_feature(MorphAnalysisC* tag,
|
||||
univ_field_t field, attr_t feature, int value) except -1:
|
||||
if value == True:
|
||||
|
|
|
@ -141,16 +141,21 @@ class Pipe(object):
|
|||
with self.model.use_params(params):
|
||||
yield
|
||||
|
||||
def to_bytes(self, **exclude):
|
||||
"""Serialize the pipe to a bytestring."""
|
||||
def to_bytes(self, exclude=tuple(), **kwargs):
|
||||
"""Serialize the pipe to a bytestring.
|
||||
|
||||
exclude (list): String names of serialization fields to exclude.
|
||||
RETURNS (bytes): The serialized object.
|
||||
"""
|
||||
serialize = OrderedDict()
|
||||
serialize["cfg"] = lambda: srsly.json_dumps(self.cfg)
|
||||
if self.model not in (True, False, None):
|
||||
serialize["model"] = self.model.to_bytes
|
||||
serialize["vocab"] = self.vocab.to_bytes
|
||||
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
|
||||
return util.to_bytes(serialize, exclude)
|
||||
|
||||
def from_bytes(self, bytes_data, **exclude):
|
||||
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
|
||||
"""Load the pipe from a bytestring."""
|
||||
|
||||
def load_model(b):
|
||||
|
@ -161,26 +166,25 @@ class Pipe(object):
|
|||
self.model = self.Model(**self.cfg)
|
||||
self.model.from_bytes(b)
|
||||
|
||||
deserialize = OrderedDict(
|
||||
(
|
||||
("cfg", lambda b: self.cfg.update(srsly.json_loads(b))),
|
||||
("vocab", lambda b: self.vocab.from_bytes(b)),
|
||||
("model", load_model),
|
||||
)
|
||||
)
|
||||
deserialize = OrderedDict()
|
||||
deserialize["cfg"] = lambda b: self.cfg.update(srsly.json_loads(b))
|
||||
deserialize["vocab"] = lambda b: self.vocab.from_bytes(b)
|
||||
deserialize["model"] = load_model
|
||||
exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
|
||||
util.from_bytes(bytes_data, deserialize, exclude)
|
||||
return self
|
||||
|
||||
def to_disk(self, path, **exclude):
|
||||
def to_disk(self, path, exclude=tuple(), **kwargs):
|
||||
"""Serialize the pipe to disk."""
|
||||
serialize = OrderedDict()
|
||||
serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg)
|
||||
serialize["vocab"] = lambda p: self.vocab.to_disk(p)
|
||||
if self.model not in (None, True, False):
|
||||
serialize["model"] = lambda p: p.open("wb").write(self.model.to_bytes())
|
||||
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
|
||||
util.to_disk(path, serialize, exclude)
|
||||
|
||||
def from_disk(self, path, **exclude):
|
||||
def from_disk(self, path, exclude=tuple(), **kwargs):
|
||||
"""Load the pipe from disk."""
|
||||
|
||||
def load_model(p):
|
||||
|
@ -191,13 +195,11 @@ class Pipe(object):
|
|||
self.model = self.Model(**self.cfg)
|
||||
self.model.from_bytes(p.open("rb").read())
|
||||
|
||||
deserialize = OrderedDict(
|
||||
(
|
||||
("cfg", lambda p: self.cfg.update(_load_cfg(p))),
|
||||
("vocab", lambda p: self.vocab.from_disk(p)),
|
||||
("model", load_model),
|
||||
)
|
||||
)
|
||||
deserialize = OrderedDict()
|
||||
deserialize["cfg"] = lambda p: self.cfg.update(_load_cfg(p))
|
||||
deserialize["vocab"] = lambda p: self.vocab.from_disk(p)
|
||||
deserialize["model"] = load_model
|
||||
exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
|
||||
util.from_disk(path, deserialize, exclude)
|
||||
return self
|
||||
|
||||
|
@ -255,7 +257,6 @@ class Tensorizer(Pipe):
|
|||
|
||||
stream (iterator): A sequence of `Doc` objects to process.
|
||||
batch_size (int): Number of `Doc` objects to group.
|
||||
n_threads (int): Number of threads.
|
||||
YIELDS (iterator): A sequence of `Doc` objects, in order of input.
|
||||
"""
|
||||
for docs in util.minibatch(stream, size=batch_size):
|
||||
|
@ -541,7 +542,7 @@ class Tagger(Pipe):
|
|||
with self.model.use_params(params):
|
||||
yield
|
||||
|
||||
def to_bytes(self, **exclude):
|
||||
def to_bytes(self, exclude=tuple(), **kwargs):
|
||||
serialize = OrderedDict()
|
||||
if self.model not in (None, True, False):
|
||||
serialize["model"] = self.model.to_bytes
|
||||
|
@ -549,9 +550,10 @@ class Tagger(Pipe):
|
|||
serialize["cfg"] = lambda: srsly.json_dumps(self.cfg)
|
||||
tag_map = OrderedDict(sorted(self.vocab.morphology.tag_map.items()))
|
||||
serialize["tag_map"] = lambda: srsly.msgpack_dumps(tag_map)
|
||||
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
|
||||
return util.to_bytes(serialize, exclude)
|
||||
|
||||
def from_bytes(self, bytes_data, **exclude):
|
||||
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
|
||||
def load_model(b):
|
||||
# TODO: Remove this once we don't have to handle previous models
|
||||
if self.cfg.get("pretrained_dims") and "pretrained_vectors" not in self.cfg:
|
||||
|
@ -576,20 +578,22 @@ class Tagger(Pipe):
|
|||
("cfg", lambda b: self.cfg.update(srsly.json_loads(b))),
|
||||
("model", lambda b: load_model(b)),
|
||||
))
|
||||
exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
|
||||
util.from_bytes(bytes_data, deserialize, exclude)
|
||||
return self
|
||||
|
||||
def to_disk(self, path, **exclude):
|
||||
def to_disk(self, path, exclude=tuple(), **kwargs):
|
||||
tag_map = OrderedDict(sorted(self.vocab.morphology.tag_map.items()))
|
||||
serialize = OrderedDict((
|
||||
('vocab', lambda p: self.vocab.to_disk(p)),
|
||||
('tag_map', lambda p: srsly.write_msgpack(p, tag_map)),
|
||||
('model', lambda p: p.open("wb").write(self.model.to_bytes())),
|
||||
('cfg', lambda p: srsly.write_json(p, self.cfg))
|
||||
("vocab", lambda p: self.vocab.to_disk(p)),
|
||||
("tag_map", lambda p: srsly.write_msgpack(p, tag_map)),
|
||||
("model", lambda p: p.open("wb").write(self.model.to_bytes())),
|
||||
("cfg", lambda p: srsly.write_json(p, self.cfg))
|
||||
))
|
||||
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
|
||||
util.to_disk(path, serialize, exclude)
|
||||
|
||||
def from_disk(self, path, **exclude):
|
||||
def from_disk(self, path, exclude=tuple(), **kwargs):
|
||||
def load_model(p):
|
||||
# TODO: Remove this once we don't have to handle previous models
|
||||
if self.cfg.get("pretrained_dims") and "pretrained_vectors" not in self.cfg:
|
||||
|
@ -612,6 +616,7 @@ class Tagger(Pipe):
|
|||
("tag_map", load_tag_map),
|
||||
("model", load_model),
|
||||
))
|
||||
exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
|
||||
util.from_disk(path, deserialize, exclude)
|
||||
return self
|
||||
|
||||
|
|
|
@ -248,19 +248,17 @@ cdef class StringStore:
|
|||
self.add(word)
|
||||
return self
|
||||
|
||||
def to_bytes(self, **exclude):
|
||||
def to_bytes(self, **kwargs):
|
||||
"""Serialize the current state to a binary string.
|
||||
|
||||
**exclude: Named attributes to prevent from being serialized.
|
||||
RETURNS (bytes): The serialized form of the `StringStore` object.
|
||||
"""
|
||||
return srsly.json_dumps(list(self))
|
||||
|
||||
def from_bytes(self, bytes_data, **exclude):
|
||||
def from_bytes(self, bytes_data, **kwargs):
|
||||
"""Load state from a binary string.
|
||||
|
||||
bytes_data (bytes): The data to load from.
|
||||
**exclude: Named attributes to prevent from being loaded.
|
||||
RETURNS (StringStore): The `StringStore` object.
|
||||
"""
|
||||
strings = srsly.json_loads(bytes_data)
|
||||
|
|
|
@ -157,6 +157,10 @@ cdef void cpu_log_loss(float* d_scores,
|
|||
cdef double max_, gmax, Z, gZ
|
||||
best = arg_max_if_gold(scores, costs, is_valid, O)
|
||||
guess = arg_max_if_valid(scores, is_valid, O)
|
||||
if best == -1 or guess == -1:
|
||||
# These shouldn't happen, but if they do, we want to make sure we don't
|
||||
# cause an OOB access.
|
||||
return
|
||||
Z = 1e-10
|
||||
gZ = 1e-10
|
||||
max_ = scores[guess]
|
||||
|
|
|
@ -323,6 +323,12 @@ cdef cppclass StateC:
|
|||
if this._s_i >= 1:
|
||||
this._s_i -= 1
|
||||
|
||||
void force_final() nogil:
|
||||
# This should only be used in desperate situations, as it may leave
|
||||
# the analysis in an unexpected state.
|
||||
this._s_i = 0
|
||||
this._b_i = this.length
|
||||
|
||||
void unshift() nogil:
|
||||
this._b_i -= 1
|
||||
this._buffer[this._b_i] = this.S(0)
|
||||
|
|
|
@ -369,9 +369,9 @@ cdef class ArcEager(TransitionSystem):
|
|||
actions[LEFT].setdefault('dep', 0)
|
||||
return actions
|
||||
|
||||
property action_types:
|
||||
def __get__(self):
|
||||
return (SHIFT, REDUCE, LEFT, RIGHT, BREAK)
|
||||
@property
|
||||
def action_types(self):
|
||||
return (SHIFT, REDUCE, LEFT, RIGHT, BREAK)
|
||||
|
||||
def get_cost(self, StateClass state, GoldParse gold, action):
|
||||
cdef Transition t = self.lookup_transition(action)
|
||||
|
@ -384,7 +384,7 @@ cdef class ArcEager(TransitionSystem):
|
|||
cdef Transition t = self.lookup_transition(action)
|
||||
t.do(state.c, t.label)
|
||||
return state
|
||||
|
||||
|
||||
def is_gold_parse(self, StateClass state, GoldParse gold):
|
||||
predicted = set()
|
||||
truth = set()
|
||||
|
|
|
@ -80,9 +80,9 @@ cdef class BiluoPushDown(TransitionSystem):
|
|||
actions[action][label] += 1
|
||||
return actions
|
||||
|
||||
property action_types:
|
||||
def __get__(self):
|
||||
return (BEGIN, IN, LAST, UNIT, OUT)
|
||||
@property
|
||||
def action_types(self):
|
||||
return (BEGIN, IN, LAST, UNIT, OUT)
|
||||
|
||||
def move_name(self, int move, attr_t label):
|
||||
if move == OUT:
|
||||
|
@ -257,30 +257,42 @@ cdef class Missing:
|
|||
cdef class Begin:
|
||||
@staticmethod
|
||||
cdef bint is_valid(const StateC* st, attr_t label) nogil:
|
||||
# Ensure we don't clobber preset entities. If no entity preset,
|
||||
# ent_iob is 0
|
||||
cdef int preset_ent_iob = st.B_(0).ent_iob
|
||||
if preset_ent_iob == 1:
|
||||
cdef int preset_ent_label = st.B_(0).ent_type
|
||||
# If we're the last token of the input, we can't B -- must U or O.
|
||||
if st.B(1) == -1:
|
||||
return False
|
||||
elif preset_ent_iob == 2:
|
||||
elif st.entity_is_open():
|
||||
return False
|
||||
elif preset_ent_iob == 3 and st.B_(0).ent_type != label:
|
||||
elif label == 0:
|
||||
return False
|
||||
# If the next word is B or O, we can't B now
|
||||
elif preset_ent_iob == 1 or preset_ent_iob == 2:
|
||||
# Ensure we don't clobber preset entities. If no entity preset,
|
||||
# ent_iob is 0
|
||||
return False
|
||||
elif preset_ent_iob == 3:
|
||||
# Okay, we're in a preset entity.
|
||||
if label != preset_ent_label:
|
||||
# If label isn't right, reject
|
||||
return False
|
||||
elif st.B_(1).ent_iob != 1:
|
||||
# If next token isn't marked I, we need to make U, not B.
|
||||
return False
|
||||
else:
|
||||
# Otherwise, force acceptance, even if we're across a sentence
|
||||
# boundary or the token is whitespace.
|
||||
return True
|
||||
elif st.B_(1).ent_iob == 2 or st.B_(1).ent_iob == 3:
|
||||
# If the next word is B or O, we can't B now
|
||||
return False
|
||||
# If the current word is B, and the next word isn't I, the current word
|
||||
# is really U
|
||||
elif preset_ent_iob == 3 and st.B_(1).ent_iob != 1:
|
||||
return False
|
||||
# Don't allow entities to extend across sentence boundaries
|
||||
elif st.B_(1).sent_start == 1:
|
||||
# Don't allow entities to extend across sentence boundaries
|
||||
return False
|
||||
# Don't allow entities to start on whitespace
|
||||
elif Lexeme.get_struct_attr(st.B_(0).lex, IS_SPACE):
|
||||
return False
|
||||
else:
|
||||
return label != 0 and not st.entity_is_open()
|
||||
return True
|
||||
|
||||
@staticmethod
|
||||
cdef int transition(StateC* st, attr_t label) nogil:
|
||||
|
@ -314,18 +326,27 @@ cdef class In:
|
|||
@staticmethod
|
||||
cdef bint is_valid(const StateC* st, attr_t label) nogil:
|
||||
cdef int preset_ent_iob = st.B_(0).ent_iob
|
||||
if preset_ent_iob == 2:
|
||||
if label == 0:
|
||||
return False
|
||||
elif st.E_(0).ent_type != label:
|
||||
return False
|
||||
elif not st.entity_is_open():
|
||||
return False
|
||||
elif st.B(1) == -1:
|
||||
# If we're at the end, we can't I.
|
||||
return False
|
||||
elif preset_ent_iob == 2:
|
||||
return False
|
||||
elif preset_ent_iob == 3:
|
||||
return False
|
||||
# TODO: Is this quite right? I think it's supposed to be ensuring the
|
||||
# gazetteer matches are maintained
|
||||
elif st.B(1) != -1 and st.B_(1).ent_iob != preset_ent_iob:
|
||||
elif st.B_(1).ent_iob == 2 or st.B_(1).ent_iob == 3:
|
||||
# If we know the next word is B or O, we can't be I (must be L)
|
||||
return False
|
||||
# Don't allow entities to extend across sentence boundaries
|
||||
elif st.B(1) != -1 and st.B_(1).sent_start == 1:
|
||||
# Don't allow entities to extend across sentence boundaries
|
||||
return False
|
||||
return st.entity_is_open() and label != 0 and st.E_(0).ent_type == label
|
||||
else:
|
||||
return True
|
||||
|
||||
@staticmethod
|
||||
cdef int transition(StateC* st, attr_t label) nogil:
|
||||
|
@ -370,9 +391,17 @@ cdef class In:
|
|||
cdef class Last:
|
||||
@staticmethod
|
||||
cdef bint is_valid(const StateC* st, attr_t label) nogil:
|
||||
if st.B_(1).ent_iob == 1:
|
||||
if label == 0:
|
||||
return False
|
||||
return st.entity_is_open() and label != 0 and st.E_(0).ent_type == label
|
||||
elif not st.entity_is_open():
|
||||
return False
|
||||
elif st.E_(0).ent_type != label:
|
||||
return False
|
||||
elif st.B_(1).ent_iob == 1:
|
||||
# If a preset entity has I next, we can't L here.
|
||||
return False
|
||||
else:
|
||||
return True
|
||||
|
||||
@staticmethod
|
||||
cdef int transition(StateC* st, attr_t label) nogil:
|
||||
|
@ -416,17 +445,29 @@ cdef class Unit:
|
|||
@staticmethod
|
||||
cdef bint is_valid(const StateC* st, attr_t label) nogil:
|
||||
cdef int preset_ent_iob = st.B_(0).ent_iob
|
||||
if preset_ent_iob == 2:
|
||||
cdef attr_t preset_ent_label = st.B_(0).ent_type
|
||||
if label == 0:
|
||||
return False
|
||||
elif preset_ent_iob == 1:
|
||||
elif st.entity_is_open():
|
||||
return False
|
||||
elif preset_ent_iob == 3 and st.B_(0).ent_type != label:
|
||||
elif preset_ent_iob == 2:
|
||||
# Don't clobber preset O
|
||||
return False
|
||||
elif st.B_(1).ent_iob == 1:
|
||||
# If next token is In, we can't be Unit -- must be Begin
|
||||
return False
|
||||
elif preset_ent_iob == 3:
|
||||
# Okay, there's a preset entity here
|
||||
if label != preset_ent_label:
|
||||
# Require labels to match
|
||||
return False
|
||||
else:
|
||||
# Otherwise return True, ignoring the whitespace constraint.
|
||||
return True
|
||||
elif Lexeme.get_struct_attr(st.B_(0).lex, IS_SPACE):
|
||||
return False
|
||||
return label != 0 and not st.entity_is_open()
|
||||
else:
|
||||
return True
|
||||
|
||||
@staticmethod
|
||||
cdef int transition(StateC* st, attr_t label) nogil:
|
||||
|
@ -461,11 +502,14 @@ cdef class Out:
|
|||
@staticmethod
|
||||
cdef bint is_valid(const StateC* st, attr_t label) nogil:
|
||||
cdef int preset_ent_iob = st.B_(0).ent_iob
|
||||
if preset_ent_iob == 3:
|
||||
if st.entity_is_open():
|
||||
return False
|
||||
elif preset_ent_iob == 3:
|
||||
return False
|
||||
elif preset_ent_iob == 1:
|
||||
return False
|
||||
return not st.entity_is_open()
|
||||
else:
|
||||
return True
|
||||
|
||||
@staticmethod
|
||||
cdef int transition(StateC* st, attr_t label) nogil:
|
||||
|
|
|
@ -205,13 +205,11 @@ cdef class Parser:
|
|||
self.set_annotations([doc], states, tensors=None)
|
||||
return doc
|
||||
|
||||
def pipe(self, docs, int batch_size=256, int n_threads=2, beam_width=None):
|
||||
def pipe(self, docs, int batch_size=256, int n_threads=-1, beam_width=None):
|
||||
"""Process a stream of documents.
|
||||
|
||||
stream: The sequence of documents to process.
|
||||
batch_size (int): Number of documents to accumulate into a working set.
|
||||
n_threads (int): The number of threads with which to work on the buffer
|
||||
in parallel.
|
||||
YIELDS (Doc): Documents, in order.
|
||||
"""
|
||||
if beam_width is None:
|
||||
|
@ -221,14 +219,14 @@ cdef class Parser:
|
|||
for batch in util.minibatch(docs, size=batch_size):
|
||||
batch_in_order = list(batch)
|
||||
by_length = sorted(batch_in_order, key=lambda doc: len(doc))
|
||||
for subbatch in util.minibatch(by_length, size=batch_size//4):
|
||||
for subbatch in util.minibatch(by_length, size=max(batch_size//4, 2)):
|
||||
subbatch = list(subbatch)
|
||||
parse_states = self.predict(subbatch, beam_width=beam_width,
|
||||
beam_density=beam_density)
|
||||
self.set_annotations(subbatch, parse_states, tensors=None)
|
||||
for doc in batch_in_order:
|
||||
yield doc
|
||||
|
||||
|
||||
def require_model(self):
|
||||
"""Raise an error if the component's model is not initialized."""
|
||||
if getattr(self, 'model', None) in (None, True, False):
|
||||
|
@ -272,7 +270,7 @@ cdef class Parser:
|
|||
beams = self.moves.init_beams(docs, beam_width, beam_density=beam_density)
|
||||
# This is pretty dirty, but the NER can resize itself in init_batch,
|
||||
# if labels are missing. We therefore have to check whether we need to
|
||||
# expand our model output.
|
||||
# expand our model output.
|
||||
self.model.resize_output(self.moves.n_moves)
|
||||
model = self.model(docs)
|
||||
token_ids = numpy.zeros((len(docs) * beam_width, self.nr_feature),
|
||||
|
@ -363,9 +361,14 @@ cdef class Parser:
|
|||
for i in range(batch_size):
|
||||
self.moves.set_valid(is_valid, states[i])
|
||||
guess = arg_max_if_valid(&scores[i*nr_class], is_valid, nr_class)
|
||||
action = self.moves.c[guess]
|
||||
action.do(states[i], action.label)
|
||||
states[i].push_hist(guess)
|
||||
if guess == -1:
|
||||
# This shouldn't happen, but it's hard to raise an error here,
|
||||
# and we don't want to infinite loop. So, force to end state.
|
||||
states[i].force_final()
|
||||
else:
|
||||
action = self.moves.c[guess]
|
||||
action.do(states[i], action.label)
|
||||
states[i].push_hist(guess)
|
||||
free(is_valid)
|
||||
|
||||
def transition_beams(self, beams, float[:, ::1] scores):
|
||||
|
@ -437,7 +440,7 @@ cdef class Parser:
|
|||
if self._rehearsal_model is None:
|
||||
return None
|
||||
losses.setdefault(self.name, 0.)
|
||||
|
||||
|
||||
states = self.moves.init_batch(docs)
|
||||
# This is pretty dirty, but the NER can resize itself in init_batch,
|
||||
# if labels are missing. We therefore have to check whether we need to
|
||||
|
@ -598,22 +601,24 @@ cdef class Parser:
|
|||
self.cfg.update(cfg)
|
||||
return sgd
|
||||
|
||||
def to_disk(self, path, **exclude):
|
||||
def to_disk(self, path, exclude=tuple(), **kwargs):
|
||||
serializers = {
|
||||
'model': lambda p: (self.model.to_disk(p) if self.model is not True else True),
|
||||
'vocab': lambda p: self.vocab.to_disk(p),
|
||||
'moves': lambda p: self.moves.to_disk(p, strings=False),
|
||||
'moves': lambda p: self.moves.to_disk(p, exclude=["strings"]),
|
||||
'cfg': lambda p: srsly.write_json(p, self.cfg)
|
||||
}
|
||||
exclude = util.get_serialization_exclude(serializers, exclude, kwargs)
|
||||
util.to_disk(path, serializers, exclude)
|
||||
|
||||
def from_disk(self, path, **exclude):
|
||||
def from_disk(self, path, exclude=tuple(), **kwargs):
|
||||
deserializers = {
|
||||
'vocab': lambda p: self.vocab.from_disk(p),
|
||||
'moves': lambda p: self.moves.from_disk(p, strings=False),
|
||||
'moves': lambda p: self.moves.from_disk(p, exclude=["strings"]),
|
||||
'cfg': lambda p: self.cfg.update(srsly.read_json(p)),
|
||||
'model': lambda p: None
|
||||
}
|
||||
exclude = util.get_serialization_exclude(deserializers, exclude, kwargs)
|
||||
util.from_disk(path, deserializers, exclude)
|
||||
if 'model' not in exclude:
|
||||
path = util.ensure_path(path)
|
||||
|
@ -627,22 +632,24 @@ cdef class Parser:
|
|||
self.cfg.update(cfg)
|
||||
return self
|
||||
|
||||
def to_bytes(self, **exclude):
|
||||
def to_bytes(self, exclude=tuple(), **kwargs):
|
||||
serializers = OrderedDict((
|
||||
('model', lambda: (self.model.to_bytes() if self.model is not True else True)),
|
||||
('vocab', lambda: self.vocab.to_bytes()),
|
||||
('moves', lambda: self.moves.to_bytes(strings=False)),
|
||||
('moves', lambda: self.moves.to_bytes(exclude=["strings"])),
|
||||
('cfg', lambda: srsly.json_dumps(self.cfg, indent=2, sort_keys=True))
|
||||
))
|
||||
exclude = util.get_serialization_exclude(serializers, exclude, kwargs)
|
||||
return util.to_bytes(serializers, exclude)
|
||||
|
||||
def from_bytes(self, bytes_data, **exclude):
|
||||
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
|
||||
deserializers = OrderedDict((
|
||||
('vocab', lambda b: self.vocab.from_bytes(b)),
|
||||
('moves', lambda b: self.moves.from_bytes(b, strings=False)),
|
||||
('moves', lambda b: self.moves.from_bytes(b, exclude=["strings"])),
|
||||
('cfg', lambda b: self.cfg.update(srsly.json_loads(b))),
|
||||
('model', lambda b: None)
|
||||
))
|
||||
exclude = util.get_serialization_exclude(deserializers, exclude, kwargs)
|
||||
msg = util.from_bytes(bytes_data, deserializers, exclude)
|
||||
if 'model' not in exclude:
|
||||
# TODO: Remove this once we don't have to handle previous models
|
||||
|
|
|
@ -94,6 +94,13 @@ cdef class TransitionSystem:
|
|||
raise ValueError(Errors.E024)
|
||||
return history
|
||||
|
||||
def apply_transition(self, StateClass state, name):
|
||||
if not self.is_valid(state, name):
|
||||
raise ValueError(
|
||||
"Cannot apply transition {name}: invalid for the current state.".format(name=name))
|
||||
action = self.lookup_transition(name)
|
||||
action.do(state.c, action.label)
|
||||
|
||||
cdef int initialize_state(self, StateC* state) nogil:
|
||||
pass
|
||||
|
||||
|
@ -201,30 +208,32 @@ cdef class TransitionSystem:
|
|||
self.labels[action][label_name] = new_freq-1
|
||||
return 1
|
||||
|
||||
def to_disk(self, path, **exclude):
|
||||
def to_disk(self, path, **kwargs):
|
||||
with path.open('wb') as file_:
|
||||
file_.write(self.to_bytes(**exclude))
|
||||
file_.write(self.to_bytes(**kwargs))
|
||||
|
||||
def from_disk(self, path, **exclude):
|
||||
def from_disk(self, path, **kwargs):
|
||||
with path.open('rb') as file_:
|
||||
byte_data = file_.read()
|
||||
self.from_bytes(byte_data, **exclude)
|
||||
self.from_bytes(byte_data, **kwargs)
|
||||
return self
|
||||
|
||||
def to_bytes(self, **exclude):
|
||||
def to_bytes(self, exclude=tuple(), **kwargs):
|
||||
transitions = []
|
||||
serializers = {
|
||||
'moves': lambda: srsly.json_dumps(self.labels),
|
||||
'strings': lambda: self.strings.to_bytes()
|
||||
}
|
||||
exclude = util.get_serialization_exclude(serializers, exclude, kwargs)
|
||||
return util.to_bytes(serializers, exclude)
|
||||
|
||||
def from_bytes(self, bytes_data, **exclude):
|
||||
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
|
||||
labels = {}
|
||||
deserializers = {
|
||||
'moves': lambda b: labels.update(srsly.json_loads(b)),
|
||||
'strings': lambda b: self.strings.from_bytes(b)
|
||||
}
|
||||
exclude = util.get_serialization_exclude(deserializers, exclude, kwargs)
|
||||
msg = util.from_bytes(bytes_data, deserializers, exclude)
|
||||
self.initialize_actions(labels)
|
||||
return self
|
||||
|
|
|
@ -1,46 +1,44 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
from spacy.tokens import Doc
|
||||
from spacy.attrs import ORTH, SHAPE, POS, DEP
|
||||
|
||||
from ..util import get_doc
|
||||
|
||||
|
||||
def test_doc_array_attr_of_token(en_tokenizer, en_vocab):
|
||||
text = "An example sentence"
|
||||
tokens = en_tokenizer(text)
|
||||
example = tokens.vocab["example"]
|
||||
def test_doc_array_attr_of_token(en_vocab):
|
||||
doc = Doc(en_vocab, words=["An", "example", "sentence"])
|
||||
example = doc.vocab["example"]
|
||||
assert example.orth != example.shape
|
||||
feats_array = tokens.to_array((ORTH, SHAPE))
|
||||
feats_array = doc.to_array((ORTH, SHAPE))
|
||||
assert feats_array[0][0] != feats_array[0][1]
|
||||
assert feats_array[0][0] != feats_array[0][1]
|
||||
|
||||
|
||||
def test_doc_stringy_array_attr_of_token(en_tokenizer, en_vocab):
|
||||
text = "An example sentence"
|
||||
tokens = en_tokenizer(text)
|
||||
example = tokens.vocab["example"]
|
||||
def test_doc_stringy_array_attr_of_token(en_vocab):
|
||||
doc = Doc(en_vocab, words=["An", "example", "sentence"])
|
||||
example = doc.vocab["example"]
|
||||
assert example.orth != example.shape
|
||||
feats_array = tokens.to_array((ORTH, SHAPE))
|
||||
feats_array_stringy = tokens.to_array(("ORTH", "SHAPE"))
|
||||
feats_array = doc.to_array((ORTH, SHAPE))
|
||||
feats_array_stringy = doc.to_array(("ORTH", "SHAPE"))
|
||||
assert feats_array_stringy[0][0] == feats_array[0][0]
|
||||
assert feats_array_stringy[0][1] == feats_array[0][1]
|
||||
|
||||
|
||||
def test_doc_scalar_attr_of_token(en_tokenizer, en_vocab):
|
||||
text = "An example sentence"
|
||||
tokens = en_tokenizer(text)
|
||||
example = tokens.vocab["example"]
|
||||
def test_doc_scalar_attr_of_token(en_vocab):
|
||||
doc = Doc(en_vocab, words=["An", "example", "sentence"])
|
||||
example = doc.vocab["example"]
|
||||
assert example.orth != example.shape
|
||||
feats_array = tokens.to_array(ORTH)
|
||||
feats_array = doc.to_array(ORTH)
|
||||
assert feats_array.shape == (3,)
|
||||
|
||||
|
||||
def test_doc_array_tag(en_tokenizer):
|
||||
text = "A nice sentence."
|
||||
def test_doc_array_tag(en_vocab):
|
||||
words = ["A", "nice", "sentence", "."]
|
||||
pos = ["DET", "ADJ", "NOUN", "PUNCT"]
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], pos=pos)
|
||||
doc = get_doc(en_vocab, words=words, pos=pos)
|
||||
assert doc[0].pos != doc[1].pos != doc[2].pos != doc[3].pos
|
||||
feats_array = doc.to_array((ORTH, POS))
|
||||
assert feats_array[0][1] == doc[0].pos
|
||||
|
@ -49,13 +47,22 @@ def test_doc_array_tag(en_tokenizer):
|
|||
assert feats_array[3][1] == doc[3].pos
|
||||
|
||||
|
||||
def test_doc_array_dep(en_tokenizer):
|
||||
text = "A nice sentence."
|
||||
def test_doc_array_dep(en_vocab):
|
||||
words = ["A", "nice", "sentence", "."]
|
||||
deps = ["det", "amod", "ROOT", "punct"]
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], deps=deps)
|
||||
doc = get_doc(en_vocab, words=words, deps=deps)
|
||||
feats_array = doc.to_array((ORTH, DEP))
|
||||
assert feats_array[0][1] == doc[0].dep
|
||||
assert feats_array[1][1] == doc[1].dep
|
||||
assert feats_array[2][1] == doc[2].dep
|
||||
assert feats_array[3][1] == doc[3].dep
|
||||
|
||||
|
||||
@pytest.mark.parametrize("attrs", [["ORTH", "SHAPE"], "IS_ALPHA"])
|
||||
def test_doc_array_to_from_string_attrs(en_vocab, attrs):
|
||||
"""Test that both Doc.to_array and Doc.from_array accept string attrs,
|
||||
as well as single attrs and sequences of attrs.
|
||||
"""
|
||||
words = ["An", "example", "sentence"]
|
||||
doc = Doc(en_vocab, words=words)
|
||||
Doc(en_vocab, words=words).from_array(attrs, doc.to_array(attrs))
|
||||
|
|
|
@ -4,9 +4,10 @@ from __future__ import unicode_literals
|
|||
|
||||
import pytest
|
||||
import numpy
|
||||
from spacy.tokens import Doc
|
||||
from spacy.tokens import Doc, Span
|
||||
from spacy.vocab import Vocab
|
||||
from spacy.errors import ModelsWarning
|
||||
from spacy.attrs import ENT_TYPE, ENT_IOB
|
||||
|
||||
from ..util import get_doc
|
||||
|
||||
|
@ -112,14 +113,14 @@ def test_doc_api_serialize(en_tokenizer, text):
|
|||
assert [t.orth for t in tokens] == [t.orth for t in new_tokens]
|
||||
|
||||
new_tokens = Doc(tokens.vocab).from_bytes(
|
||||
tokens.to_bytes(tensor=False), tensor=False
|
||||
tokens.to_bytes(exclude=["tensor"]), exclude=["tensor"]
|
||||
)
|
||||
assert tokens.text == new_tokens.text
|
||||
assert [t.text for t in tokens] == [t.text for t in new_tokens]
|
||||
assert [t.orth for t in tokens] == [t.orth for t in new_tokens]
|
||||
|
||||
new_tokens = Doc(tokens.vocab).from_bytes(
|
||||
tokens.to_bytes(sentiment=False), sentiment=False
|
||||
tokens.to_bytes(exclude=["sentiment"]), exclude=["sentiment"]
|
||||
)
|
||||
assert tokens.text == new_tokens.text
|
||||
assert [t.text for t in tokens] == [t.text for t in new_tokens]
|
||||
|
@ -256,3 +257,24 @@ def test_lowest_common_ancestor(en_tokenizer, sentence, heads, lca_matrix):
|
|||
assert lca[1, 1] == 1
|
||||
assert lca[0, 1] == 2
|
||||
assert lca[1, 2] == 2
|
||||
|
||||
|
||||
def test_doc_is_nered(en_vocab):
|
||||
words = ["I", "live", "in", "New", "York"]
|
||||
doc = Doc(en_vocab, words=words)
|
||||
assert not doc.is_nered
|
||||
doc.ents = [Span(doc, 3, 5, label="GPE")]
|
||||
assert doc.is_nered
|
||||
# Test creating doc from array with unknown values
|
||||
arr = numpy.array([[0, 0], [0, 0], [0, 0], [384, 3], [384, 1]], dtype="uint64")
|
||||
doc = Doc(en_vocab, words=words).from_array([ENT_TYPE, ENT_IOB], arr)
|
||||
assert doc.is_nered
|
||||
# Test serialization
|
||||
new_doc = Doc(en_vocab).from_bytes(doc.to_bytes())
|
||||
assert new_doc.is_nered
|
||||
|
||||
|
||||
def test_doc_lang(en_vocab):
|
||||
doc = Doc(en_vocab, words=["Hello", "world"])
|
||||
assert doc.lang_ == "en"
|
||||
assert doc.lang == en_vocab.strings["en"]
|
||||
|
|
|
@ -178,11 +178,10 @@ def test_span_string_label(doc):
|
|||
assert span.label == doc.vocab.strings["hello"]
|
||||
|
||||
|
||||
def test_span_string_set_label(doc):
|
||||
def test_span_label_readonly(doc):
|
||||
span = Span(doc, 0, 1)
|
||||
span.label_ = "hello"
|
||||
assert span.label_ == "hello"
|
||||
assert span.label == doc.vocab.strings["hello"]
|
||||
with pytest.raises(NotImplementedError):
|
||||
span.label_ = "hello"
|
||||
|
||||
|
||||
def test_span_ents_property(doc):
|
||||
|
|
|
@ -199,3 +199,31 @@ def test_token0_has_sent_start_true():
|
|||
assert doc[0].is_sent_start is True
|
||||
assert doc[1].is_sent_start is None
|
||||
assert not doc.is_sentenced
|
||||
|
||||
|
||||
def test_token_api_conjuncts_chain(en_vocab):
|
||||
words = "The boy and the girl and the man went .".split()
|
||||
heads = [1, 7, -1, 1, -3, -1, 1, -3, 0, -1]
|
||||
deps = ["det", "nsubj", "cc", "det", "conj", "cc", "det", "conj", "ROOT", "punct"]
|
||||
doc = get_doc(en_vocab, words=words, heads=heads, deps=deps)
|
||||
assert [w.text for w in doc[1].conjuncts] == ["girl", "man"]
|
||||
assert [w.text for w in doc[4].conjuncts] == ["boy", "man"]
|
||||
assert [w.text for w in doc[7].conjuncts] == ["boy", "girl"]
|
||||
|
||||
|
||||
def test_token_api_conjuncts_simple(en_vocab):
|
||||
words = "They came and went .".split()
|
||||
heads = [1, 0, -1, -2, -1]
|
||||
deps = ["nsubj", "ROOT", "cc", "conj"]
|
||||
doc = get_doc(en_vocab, words=words, heads=heads, deps=deps)
|
||||
assert [w.text for w in doc[1].conjuncts] == ["went"]
|
||||
assert [w.text for w in doc[3].conjuncts] == ["came"]
|
||||
|
||||
|
||||
def test_token_api_non_conjuncts(en_vocab):
|
||||
words = "They came .".split()
|
||||
heads = [1, 0, -1]
|
||||
deps = ["nsubj", "ROOT", "punct"]
|
||||
doc = get_doc(en_vocab, words=words, heads=heads, deps=deps)
|
||||
assert [w.text for w in doc[0].conjuncts] == []
|
||||
assert [w.text for w in doc[1].conjuncts] == []
|
||||
|
|
|
@ -106,3 +106,37 @@ def test_underscore_raises_for_invalid(invalid_kwargs):
|
|||
def test_underscore_accepts_valid(valid_kwargs):
|
||||
valid_kwargs["force"] = True
|
||||
Doc.set_extension("test", **valid_kwargs)
|
||||
|
||||
|
||||
def test_underscore_mutable_defaults_list(en_vocab):
|
||||
"""Test that mutable default arguments are handled correctly (see #2581)."""
|
||||
Doc.set_extension("mutable", default=[])
|
||||
doc1 = Doc(en_vocab, words=["one"])
|
||||
doc2 = Doc(en_vocab, words=["two"])
|
||||
doc1._.mutable.append("foo")
|
||||
assert len(doc1._.mutable) == 1
|
||||
assert doc1._.mutable[0] == "foo"
|
||||
assert len(doc2._.mutable) == 0
|
||||
doc1._.mutable = ["bar", "baz"]
|
||||
doc1._.mutable.append("foo")
|
||||
assert len(doc1._.mutable) == 3
|
||||
assert len(doc2._.mutable) == 0
|
||||
|
||||
|
||||
def test_underscore_mutable_defaults_dict(en_vocab):
|
||||
"""Test that mutable default arguments are handled correctly (see #2581)."""
|
||||
Token.set_extension("mutable", default={})
|
||||
token1 = Doc(en_vocab, words=["one"])[0]
|
||||
token2 = Doc(en_vocab, words=["two"])[0]
|
||||
token1._.mutable["foo"] = "bar"
|
||||
assert len(token1._.mutable) == 1
|
||||
assert token1._.mutable["foo"] == "bar"
|
||||
assert len(token2._.mutable) == 0
|
||||
token1._.mutable["foo"] = "baz"
|
||||
assert len(token1._.mutable) == 1
|
||||
assert token1._.mutable["foo"] == "baz"
|
||||
token1._.mutable["x"] = []
|
||||
token1._.mutable["x"].append("y")
|
||||
assert len(token1._.mutable) == 2
|
||||
assert token1._.mutable["x"] == ["y"]
|
||||
assert len(token2._.mutable) == 0
|
||||
|
|
|
@ -2,22 +2,24 @@
|
|||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
import numpy
|
||||
from spacy.tokens import Doc
|
||||
from spacy.displacy import render
|
||||
from spacy.gold import iob_to_biluo
|
||||
from spacy.lang.it import Italian
|
||||
import numpy
|
||||
from spacy.lang.en import English
|
||||
|
||||
from ..util import add_vecs_to_vocab, get_doc
|
||||
|
||||
|
||||
@pytest.mark.xfail(
|
||||
reason="The dot is now properly split off, but the prefix/suffix rules are not applied again afterwards."
|
||||
"This means that the quote will still be attached to the remaining token."
|
||||
)
|
||||
@pytest.mark.xfail
|
||||
def test_issue2070():
|
||||
"""Test that checks that a dot followed by a quote is handled appropriately."""
|
||||
"""Test that checks that a dot followed by a quote is handled
|
||||
appropriately.
|
||||
"""
|
||||
# Problem: The dot is now properly split off, but the prefix/suffix rules
|
||||
# are not applied again afterwards. This means that the quote will still be
|
||||
# attached to the remaining token.
|
||||
nlp = English()
|
||||
doc = nlp('First sentence."A quoted sentence" he said ...')
|
||||
assert len(doc) == 11
|
||||
|
@ -37,6 +39,26 @@ def test_issue2179():
|
|||
assert nlp2.get_pipe("ner").labels == ("CITIZENSHIP",)
|
||||
|
||||
|
||||
def test_issue2203(en_vocab):
|
||||
"""Test that lemmas are set correctly in doc.from_array."""
|
||||
words = ["I", "'ll", "survive"]
|
||||
tags = ["PRP", "MD", "VB"]
|
||||
lemmas = ["-PRON-", "will", "survive"]
|
||||
tag_ids = [en_vocab.strings.add(tag) for tag in tags]
|
||||
lemma_ids = [en_vocab.strings.add(lemma) for lemma in lemmas]
|
||||
doc = Doc(en_vocab, words=words)
|
||||
# Work around lemma corrpution problem and set lemmas after tags
|
||||
doc.from_array("TAG", numpy.array(tag_ids, dtype="uint64"))
|
||||
doc.from_array("LEMMA", numpy.array(lemma_ids, dtype="uint64"))
|
||||
assert [t.tag_ for t in doc] == tags
|
||||
assert [t.lemma_ for t in doc] == lemmas
|
||||
# We need to serialize both tag and lemma, since this is what causes the bug
|
||||
doc_array = doc.to_array(["TAG", "LEMMA"])
|
||||
new_doc = Doc(doc.vocab, words=words).from_array(["TAG", "LEMMA"], doc_array)
|
||||
assert [t.tag_ for t in new_doc] == tags
|
||||
assert [t.lemma_ for t in new_doc] == lemmas
|
||||
|
||||
|
||||
def test_issue2219(en_vocab):
|
||||
vectors = [("a", [1, 2, 3]), ("letter", [4, 5, 6])]
|
||||
add_vecs_to_vocab(en_vocab, vectors)
|
||||
|
|
26
spacy/tests/regression/test_issue3345.py
Normal file
26
spacy/tests/regression/test_issue3345.py
Normal file
|
@ -0,0 +1,26 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from spacy.lang.en import English
|
||||
from spacy.tokens import Doc
|
||||
from spacy.pipeline import EntityRuler, EntityRecognizer
|
||||
|
||||
|
||||
def test_issue3345():
|
||||
"""Test case where preset entity crosses sentence boundary."""
|
||||
nlp = English()
|
||||
doc = Doc(nlp.vocab, words=["I", "live", "in", "New", "York"])
|
||||
doc[4].is_sent_start = True
|
||||
ruler = EntityRuler(nlp, patterns=[{"label": "GPE", "pattern": "New York"}])
|
||||
ner = EntityRecognizer(doc.vocab)
|
||||
# Add the OUT action. I wouldn't have thought this would be necessary...
|
||||
ner.moves.add_action(5, "")
|
||||
ner.add_label("GPE")
|
||||
doc = ruler(doc)
|
||||
# Get into the state just before "New"
|
||||
state = ner.moves.init_batch([doc])[0]
|
||||
ner.moves.apply_transition(state, "O")
|
||||
ner.moves.apply_transition(state, "O")
|
||||
ner.moves.apply_transition(state, "O")
|
||||
# Check that B-GPE is valid.
|
||||
assert ner.moves.is_valid(state, "B-GPE")
|
21
spacy/tests/regression/test_issue3410.py
Normal file
21
spacy/tests/regression/test_issue3410.py
Normal file
|
@ -0,0 +1,21 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
from spacy.lang.en import English
|
||||
from spacy.matcher import Matcher, PhraseMatcher
|
||||
|
||||
|
||||
def test_issue3410():
|
||||
texts = ["Hello world", "This is a test"]
|
||||
nlp = English()
|
||||
matcher = Matcher(nlp.vocab)
|
||||
phrasematcher = PhraseMatcher(nlp.vocab)
|
||||
with pytest.deprecated_call():
|
||||
docs = list(nlp.pipe(texts, n_threads=4))
|
||||
with pytest.deprecated_call():
|
||||
docs = list(nlp.tokenizer.pipe(texts, n_threads=4))
|
||||
with pytest.deprecated_call():
|
||||
list(matcher.pipe(docs, n_threads=4))
|
||||
with pytest.deprecated_call():
|
||||
list(phrasematcher.pipe(docs, n_threads=4))
|
|
@ -1,6 +1,7 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
from spacy.tokens import Doc
|
||||
from spacy.compat import path2str
|
||||
|
||||
|
@ -41,3 +42,18 @@ def test_serialize_doc_roundtrip_disk_str_path(en_vocab):
|
|||
doc.to_disk(file_path)
|
||||
doc_d = Doc(en_vocab).from_disk(file_path)
|
||||
assert doc.to_bytes() == doc_d.to_bytes()
|
||||
|
||||
|
||||
def test_serialize_doc_exclude(en_vocab):
|
||||
doc = Doc(en_vocab, words=["hello", "world"])
|
||||
doc.user_data["foo"] = "bar"
|
||||
new_doc = Doc(en_vocab).from_bytes(doc.to_bytes())
|
||||
assert new_doc.user_data["foo"] == "bar"
|
||||
new_doc = Doc(en_vocab).from_bytes(doc.to_bytes(), exclude=["user_data"])
|
||||
assert not new_doc.user_data
|
||||
new_doc = Doc(en_vocab).from_bytes(doc.to_bytes(exclude=["user_data"]))
|
||||
assert not new_doc.user_data
|
||||
with pytest.raises(ValueError):
|
||||
doc.to_bytes(user_data=False)
|
||||
with pytest.raises(ValueError):
|
||||
Doc(en_vocab).from_bytes(doc.to_bytes(), tensor=False)
|
||||
|
|
|
@ -52,3 +52,19 @@ def test_serialize_with_custom_tokenizer():
|
|||
nlp.tokenizer = custom_tokenizer(nlp)
|
||||
with make_tempdir() as d:
|
||||
nlp.to_disk(d)
|
||||
|
||||
|
||||
def test_serialize_language_exclude(meta_data):
|
||||
name = "name-in-fixture"
|
||||
nlp = Language(meta=meta_data)
|
||||
assert nlp.meta["name"] == name
|
||||
new_nlp = Language().from_bytes(nlp.to_bytes())
|
||||
assert nlp.meta["name"] == name
|
||||
new_nlp = Language().from_bytes(nlp.to_bytes(), exclude=["meta"])
|
||||
assert not new_nlp.meta["name"] == name
|
||||
new_nlp = Language().from_bytes(nlp.to_bytes(exclude=["meta"]))
|
||||
assert not new_nlp.meta["name"] == name
|
||||
with pytest.raises(ValueError):
|
||||
nlp.to_bytes(meta=False)
|
||||
with pytest.raises(ValueError):
|
||||
Language().from_bytes(nlp.to_bytes(), meta=False)
|
||||
|
|
|
@ -55,7 +55,9 @@ def test_serialize_parser_roundtrip_disk(en_vocab, Parser):
|
|||
parser_d = Parser(en_vocab)
|
||||
parser_d.model, _ = parser_d.Model(0)
|
||||
parser_d = parser_d.from_disk(file_path)
|
||||
assert parser.to_bytes(model=False) == parser_d.to_bytes(model=False)
|
||||
parser_bytes = parser.to_bytes(exclude=["model"])
|
||||
parser_d_bytes = parser_d.to_bytes(exclude=["model"])
|
||||
assert parser_bytes == parser_d_bytes
|
||||
|
||||
|
||||
def test_to_from_bytes(parser, blank_parser):
|
||||
|
@ -114,3 +116,25 @@ def test_serialize_textcat_empty(en_vocab):
|
|||
# See issue #1105
|
||||
textcat = TextCategorizer(en_vocab, labels=["ENTITY", "ACTION", "MODIFIER"])
|
||||
textcat.to_bytes()
|
||||
|
||||
|
||||
@pytest.mark.parametrize("Parser", test_parsers)
|
||||
def test_serialize_pipe_exclude(en_vocab, Parser):
|
||||
def get_new_parser():
|
||||
new_parser = Parser(en_vocab)
|
||||
new_parser.model, _ = new_parser.Model(0)
|
||||
return new_parser
|
||||
|
||||
parser = Parser(en_vocab)
|
||||
parser.model, _ = parser.Model(0)
|
||||
parser.cfg["foo"] = "bar"
|
||||
new_parser = get_new_parser().from_bytes(parser.to_bytes())
|
||||
assert "foo" in new_parser.cfg
|
||||
new_parser = get_new_parser().from_bytes(parser.to_bytes(), exclude=["cfg"])
|
||||
assert "foo" not in new_parser.cfg
|
||||
new_parser = get_new_parser().from_bytes(parser.to_bytes(exclude=["cfg"]))
|
||||
assert "foo" not in new_parser.cfg
|
||||
with pytest.raises(ValueError):
|
||||
parser.to_bytes(cfg=False)
|
||||
with pytest.raises(ValueError):
|
||||
get_new_parser().from_bytes(parser.to_bytes(), cfg=False)
|
||||
|
|
|
@ -12,13 +12,12 @@ test_strings = [([], []), (["rats", "are", "cute"], ["i", "like", "rats"])]
|
|||
test_strings_attrs = [(["rats", "are", "cute"], "Hello")]
|
||||
|
||||
|
||||
@pytest.mark.xfail
|
||||
@pytest.mark.parametrize("text", ["rat"])
|
||||
def test_serialize_vocab(en_vocab, text):
|
||||
text_hash = en_vocab.strings.add(text)
|
||||
vocab_bytes = en_vocab.to_bytes()
|
||||
new_vocab = Vocab().from_bytes(vocab_bytes)
|
||||
assert new_vocab.strings(text_hash) == text
|
||||
assert new_vocab.strings[text_hash] == text
|
||||
|
||||
|
||||
@pytest.mark.parametrize("strings1,strings2", test_strings)
|
||||
|
@ -69,6 +68,15 @@ def test_serialize_vocab_lex_attrs_bytes(strings, lex_attr):
|
|||
assert vocab2[strings[0]].norm_ == lex_attr
|
||||
|
||||
|
||||
@pytest.mark.parametrize("strings,lex_attr", test_strings_attrs)
|
||||
def test_deserialize_vocab_seen_entries(strings, lex_attr):
|
||||
# Reported in #2153
|
||||
vocab = Vocab(strings=strings)
|
||||
length = len(vocab)
|
||||
vocab.from_bytes(vocab.to_bytes())
|
||||
assert len(vocab) == length
|
||||
|
||||
|
||||
@pytest.mark.parametrize("strings,lex_attr", test_strings_attrs)
|
||||
def test_serialize_vocab_lex_attrs_disk(strings, lex_attr):
|
||||
vocab1 = Vocab(strings=strings)
|
||||
|
|
|
@ -1,38 +1,28 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
import os
|
||||
from pathlib import Path
|
||||
from spacy.compat import symlink_to, symlink_remove, path2str
|
||||
from spacy.cli.converters import conllu2json
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def target_local_path():
|
||||
return Path("./foo-target")
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def link_local_path():
|
||||
return Path("./foo-symlink")
|
||||
|
||||
|
||||
@pytest.fixture(scope="function")
|
||||
def setup_target(request, target_local_path, link_local_path):
|
||||
if not target_local_path.exists():
|
||||
os.mkdir(path2str(target_local_path))
|
||||
|
||||
# yield -- need to cleanup even if assertion fails
|
||||
# https://github.com/pytest-dev/pytest/issues/2508#issuecomment-309934240
|
||||
def cleanup():
|
||||
symlink_remove(link_local_path)
|
||||
os.rmdir(path2str(target_local_path))
|
||||
|
||||
request.addfinalizer(cleanup)
|
||||
|
||||
|
||||
def test_create_symlink_windows(setup_target, target_local_path, link_local_path):
|
||||
assert target_local_path.exists()
|
||||
|
||||
symlink_to(link_local_path, target_local_path)
|
||||
assert link_local_path.exists()
|
||||
def test_cli_converters_conllu2json():
|
||||
# https://raw.githubusercontent.com/ohenrik/nb_news_ud_sm/master/original_data/no-ud-dev-ner.conllu
|
||||
lines = [
|
||||
"1\tDommer\tdommer\tNOUN\t_\tDefinite=Ind|Gender=Masc|Number=Sing\t2\tappos\t_\tO",
|
||||
"2\tFinn\tFinn\tPROPN\t_\tGender=Masc\t4\tnsubj\t_\tB-PER",
|
||||
"3\tEilertsen\tEilertsen\tPROPN\t_\t_\t2\tname\t_\tI-PER",
|
||||
"4\tavstår\tavstå\tVERB\t_\tMood=Ind|Tense=Pres|VerbForm=Fin\t0\troot\t_\tO",
|
||||
]
|
||||
input_data = "\n".join(lines)
|
||||
converted = conllu2json(input_data, n_sents=1)
|
||||
assert len(converted) == 1
|
||||
assert converted[0]["id"] == 0
|
||||
assert len(converted[0]["paragraphs"]) == 1
|
||||
assert len(converted[0]["paragraphs"][0]["sentences"]) == 1
|
||||
sent = converted[0]["paragraphs"][0]["sentences"][0]
|
||||
assert len(sent["tokens"]) == 4
|
||||
tokens = sent["tokens"]
|
||||
assert [t["orth"] for t in tokens] == ["Dommer", "Finn", "Eilertsen", "avstår"]
|
||||
assert [t["tag"] for t in tokens] == ["NOUN", "PROPN", "PROPN", "VERB"]
|
||||
assert [t["head"] for t in tokens] == [1, 2, -1, 0]
|
||||
assert [t["dep"] for t in tokens] == ["appos", "nsubj", "name", "ROOT"]
|
||||
assert [t["ner"] for t in tokens] == ["O", "B-PER", "L-PER", "O"]
|
||||
|
|
90
spacy/tests/test_displacy.py
Normal file
90
spacy/tests/test_displacy.py
Normal file
|
@ -0,0 +1,90 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
from spacy import displacy
|
||||
from spacy.tokens import Span
|
||||
from spacy.lang.fa import Persian
|
||||
|
||||
from .util import get_doc
|
||||
|
||||
|
||||
def test_displacy_parse_ents(en_vocab):
|
||||
"""Test that named entities on a Doc are converted into displaCy's format."""
|
||||
doc = get_doc(en_vocab, words=["But", "Google", "is", "starting", "from", "behind"])
|
||||
doc.ents = [Span(doc, 1, 2, label=doc.vocab.strings["ORG"])]
|
||||
ents = displacy.parse_ents(doc)
|
||||
assert isinstance(ents, dict)
|
||||
assert ents["text"] == "But Google is starting from behind "
|
||||
assert ents["ents"] == [{"start": 4, "end": 10, "label": "ORG"}]
|
||||
|
||||
|
||||
def test_displacy_parse_deps(en_vocab):
|
||||
"""Test that deps and tags on a Doc are converted into displaCy's format."""
|
||||
words = ["This", "is", "a", "sentence"]
|
||||
heads = [1, 0, 1, -2]
|
||||
pos = ["DET", "VERB", "DET", "NOUN"]
|
||||
tags = ["DT", "VBZ", "DT", "NN"]
|
||||
deps = ["nsubj", "ROOT", "det", "attr"]
|
||||
doc = get_doc(en_vocab, words=words, heads=heads, pos=pos, tags=tags, deps=deps)
|
||||
deps = displacy.parse_deps(doc)
|
||||
assert isinstance(deps, dict)
|
||||
assert deps["words"] == [
|
||||
{"text": "This", "tag": "DET"},
|
||||
{"text": "is", "tag": "VERB"},
|
||||
{"text": "a", "tag": "DET"},
|
||||
{"text": "sentence", "tag": "NOUN"},
|
||||
]
|
||||
assert deps["arcs"] == [
|
||||
{"start": 0, "end": 1, "label": "nsubj", "dir": "left"},
|
||||
{"start": 2, "end": 3, "label": "det", "dir": "left"},
|
||||
{"start": 1, "end": 3, "label": "attr", "dir": "right"},
|
||||
]
|
||||
|
||||
|
||||
def test_displacy_spans(en_vocab):
|
||||
"""Test that displaCy can render Spans."""
|
||||
doc = get_doc(en_vocab, words=["But", "Google", "is", "starting", "from", "behind"])
|
||||
doc.ents = [Span(doc, 1, 2, label=doc.vocab.strings["ORG"])]
|
||||
html = displacy.render(doc[1:4], style="ent")
|
||||
assert html.startswith("<div")
|
||||
|
||||
|
||||
def test_displacy_raises_for_wrong_type(en_vocab):
|
||||
with pytest.raises(ValueError):
|
||||
displacy.render("hello world")
|
||||
|
||||
|
||||
def test_displacy_rtl():
|
||||
# Source: http://www.sobhe.ir/hazm/ – is this correct?
|
||||
words = ["ما", "بسیار", "کتاب", "می\u200cخوانیم"]
|
||||
# These are (likely) wrong, but it's just for testing
|
||||
pos = ["PRO", "ADV", "N_PL", "V_SUB"] # needs to match lang.fa.tag_map
|
||||
deps = ["foo", "bar", "foo", "baz"]
|
||||
heads = [1, 0, 1, -2]
|
||||
nlp = Persian()
|
||||
doc = get_doc(nlp.vocab, words=words, pos=pos, tags=pos, heads=heads, deps=deps)
|
||||
doc.ents = [Span(doc, 1, 3, label="TEST")]
|
||||
html = displacy.render(doc, page=True, style="dep")
|
||||
assert "direction: rtl" in html
|
||||
assert 'direction="rtl"' in html
|
||||
assert 'lang="{}"'.format(nlp.lang) in html
|
||||
html = displacy.render(doc, page=True, style="ent")
|
||||
assert "direction: rtl" in html
|
||||
assert 'lang="{}"'.format(nlp.lang) in html
|
||||
|
||||
|
||||
def test_displacy_render_wrapper(en_vocab):
|
||||
"""Test that displaCy accepts custom rendering wrapper."""
|
||||
|
||||
def wrapper(html):
|
||||
return "TEST" + html + "TEST"
|
||||
|
||||
displacy.set_render_wrapper(wrapper)
|
||||
doc = get_doc(en_vocab, words=["But", "Google", "is", "starting", "from", "behind"])
|
||||
doc.ents = [Span(doc, 1, 2, label=doc.vocab.strings["ORG"])]
|
||||
html = displacy.render(doc, style="ent")
|
||||
assert html.startswith("TEST<div")
|
||||
assert html.endswith("/div>TEST")
|
||||
# Restore
|
||||
displacy.set_render_wrapper(lambda html: html)
|
|
@ -2,14 +2,35 @@
|
|||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
import os
|
||||
from pathlib import Path
|
||||
from spacy import util
|
||||
from spacy import displacy
|
||||
from spacy import prefer_gpu, require_gpu
|
||||
from spacy.tokens import Span
|
||||
from spacy.compat import symlink_to, symlink_remove, path2str
|
||||
from spacy._ml import PrecomputableAffine
|
||||
|
||||
from .util import get_doc
|
||||
|
||||
@pytest.fixture
|
||||
def symlink_target():
|
||||
return Path("./foo-target")
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def symlink():
|
||||
return Path("./foo-symlink")
|
||||
|
||||
|
||||
@pytest.fixture(scope="function")
|
||||
def symlink_setup_target(request, symlink_target, symlink):
|
||||
if not symlink_target.exists():
|
||||
os.mkdir(path2str(symlink_target))
|
||||
# yield -- need to cleanup even if assertion fails
|
||||
# https://github.com/pytest-dev/pytest/issues/2508#issuecomment-309934240
|
||||
def cleanup():
|
||||
symlink_remove(symlink)
|
||||
os.rmdir(path2str(symlink_target))
|
||||
|
||||
request.addfinalizer(cleanup)
|
||||
|
||||
|
||||
@pytest.mark.parametrize("text", ["hello/world", "hello world"])
|
||||
|
@ -31,66 +52,6 @@ def test_util_get_package_path(package):
|
|||
assert isinstance(path, Path)
|
||||
|
||||
|
||||
def test_displacy_parse_ents(en_vocab):
|
||||
"""Test that named entities on a Doc are converted into displaCy's format."""
|
||||
doc = get_doc(en_vocab, words=["But", "Google", "is", "starting", "from", "behind"])
|
||||
doc.ents = [Span(doc, 1, 2, label=doc.vocab.strings["ORG"])]
|
||||
ents = displacy.parse_ents(doc)
|
||||
assert isinstance(ents, dict)
|
||||
assert ents["text"] == "But Google is starting from behind "
|
||||
assert ents["ents"] == [{"start": 4, "end": 10, "label": "ORG"}]
|
||||
|
||||
|
||||
def test_displacy_parse_deps(en_vocab):
|
||||
"""Test that deps and tags on a Doc are converted into displaCy's format."""
|
||||
words = ["This", "is", "a", "sentence"]
|
||||
heads = [1, 0, 1, -2]
|
||||
pos = ["DET", "VERB", "DET", "NOUN"]
|
||||
tags = ["DT", "VBZ", "DT", "NN"]
|
||||
deps = ["nsubj", "ROOT", "det", "attr"]
|
||||
doc = get_doc(en_vocab, words=words, heads=heads, pos=pos, tags=tags, deps=deps)
|
||||
deps = displacy.parse_deps(doc)
|
||||
assert isinstance(deps, dict)
|
||||
assert deps["words"] == [
|
||||
{"text": "This", "tag": "DET"},
|
||||
{"text": "is", "tag": "VERB"},
|
||||
{"text": "a", "tag": "DET"},
|
||||
{"text": "sentence", "tag": "NOUN"},
|
||||
]
|
||||
assert deps["arcs"] == [
|
||||
{"start": 0, "end": 1, "label": "nsubj", "dir": "left"},
|
||||
{"start": 2, "end": 3, "label": "det", "dir": "left"},
|
||||
{"start": 1, "end": 3, "label": "attr", "dir": "right"},
|
||||
]
|
||||
|
||||
|
||||
def test_displacy_spans(en_vocab):
|
||||
"""Test that displaCy can render Spans."""
|
||||
doc = get_doc(en_vocab, words=["But", "Google", "is", "starting", "from", "behind"])
|
||||
doc.ents = [Span(doc, 1, 2, label=doc.vocab.strings["ORG"])]
|
||||
html = displacy.render(doc[1:4], style="ent")
|
||||
assert html.startswith("<div")
|
||||
|
||||
|
||||
def test_displacy_render_wrapper(en_vocab):
|
||||
"""Test that displaCy accepts custom rendering wrapper."""
|
||||
|
||||
def wrapper(html):
|
||||
return "TEST" + html + "TEST"
|
||||
|
||||
displacy.set_render_wrapper(wrapper)
|
||||
doc = get_doc(en_vocab, words=["But", "Google", "is", "starting", "from", "behind"])
|
||||
doc.ents = [Span(doc, 1, 2, label=doc.vocab.strings["ORG"])]
|
||||
html = displacy.render(doc, style="ent")
|
||||
assert html.startswith("TEST<div")
|
||||
assert html.endswith("/div>TEST")
|
||||
|
||||
|
||||
def test_displacy_raises_for_wrong_type(en_vocab):
|
||||
with pytest.raises(ValueError):
|
||||
displacy.render("hello world")
|
||||
|
||||
|
||||
def test_PrecomputableAffine(nO=4, nI=5, nF=3, nP=2):
|
||||
model = PrecomputableAffine(nO=nO, nI=nI, nF=nF, nP=nP)
|
||||
assert model.W.shape == (nF, nO, nP, nI)
|
||||
|
@ -124,3 +85,9 @@ def test_prefer_gpu():
|
|||
def test_require_gpu():
|
||||
with pytest.raises(ValueError):
|
||||
require_gpu()
|
||||
|
||||
|
||||
def test_create_symlink_windows(symlink_setup_target, symlink_target, symlink):
|
||||
assert symlink_target.exists()
|
||||
symlink_to(symlink, symlink_target)
|
||||
assert symlink.exists()
|
||||
|
|
|
@ -45,3 +45,8 @@ def test_vocab_api_contains(en_vocab, text):
|
|||
_ = en_vocab[text] # noqa: F841
|
||||
assert text in en_vocab
|
||||
assert "LKsdjvlsakdvlaksdvlkasjdvljasdlkfvm" not in en_vocab
|
||||
|
||||
|
||||
def test_vocab_writing_system(en_vocab):
|
||||
assert en_vocab.writing_system["direction"] == "ltr"
|
||||
assert en_vocab.writing_system["has_case"] is True
|
||||
|
|
|
@ -125,7 +125,7 @@ cdef class Tokenizer:
|
|||
doc.c[doc.length - 1].spacy = string[-1] == " " and not in_ws
|
||||
return doc
|
||||
|
||||
def pipe(self, texts, batch_size=1000, n_threads=2):
|
||||
def pipe(self, texts, batch_size=1000, n_threads=-1):
|
||||
"""Tokenize a stream of texts.
|
||||
|
||||
texts: A sequence of unicode texts.
|
||||
|
@ -134,6 +134,8 @@ cdef class Tokenizer:
|
|||
|
||||
DOCS: https://spacy.io/api/tokenizer#pipe
|
||||
"""
|
||||
if n_threads != -1:
|
||||
deprecation_warning(Warnings.W016)
|
||||
for text in texts:
|
||||
yield self(text)
|
||||
|
||||
|
@ -360,36 +362,37 @@ cdef class Tokenizer:
|
|||
self._cache.set(key, cached)
|
||||
self._rules[string] = substrings
|
||||
|
||||
def to_disk(self, path, **exclude):
|
||||
def to_disk(self, path, **kwargs):
|
||||
"""Save the current state to a directory.
|
||||
|
||||
path (unicode or Path): A path to a directory, which will be created if
|
||||
it doesn't exist. Paths may be either strings or Path-like objects.
|
||||
it doesn't exist.
|
||||
exclude (list): String names of serialization fields to exclude.
|
||||
|
||||
DOCS: https://spacy.io/api/tokenizer#to_disk
|
||||
"""
|
||||
with path.open("wb") as file_:
|
||||
file_.write(self.to_bytes(**exclude))
|
||||
file_.write(self.to_bytes(**kwargs))
|
||||
|
||||
def from_disk(self, path, **exclude):
|
||||
def from_disk(self, path, **kwargs):
|
||||
"""Loads state from a directory. Modifies the object in place and
|
||||
returns it.
|
||||
|
||||
path (unicode or Path): A path to a directory. Paths may be either
|
||||
strings or `Path`-like objects.
|
||||
path (unicode or Path): A path to a directory.
|
||||
exclude (list): String names of serialization fields to exclude.
|
||||
RETURNS (Tokenizer): The modified `Tokenizer` object.
|
||||
|
||||
DOCS: https://spacy.io/api/tokenizer#from_disk
|
||||
"""
|
||||
with path.open("rb") as file_:
|
||||
bytes_data = file_.read()
|
||||
self.from_bytes(bytes_data, **exclude)
|
||||
self.from_bytes(bytes_data, **kwargs)
|
||||
return self
|
||||
|
||||
def to_bytes(self, **exclude):
|
||||
def to_bytes(self, exclude=tuple(), **kwargs):
|
||||
"""Serialize the current state to a binary string.
|
||||
|
||||
**exclude: Named attributes to prevent from being serialized.
|
||||
exclude (list): String names of serialization fields to exclude.
|
||||
RETURNS (bytes): The serialized form of the `Tokenizer` object.
|
||||
|
||||
DOCS: https://spacy.io/api/tokenizer#to_bytes
|
||||
|
@ -402,13 +405,14 @@ cdef class Tokenizer:
|
|||
("token_match", lambda: _get_regex_pattern(self.token_match)),
|
||||
("exceptions", lambda: OrderedDict(sorted(self._rules.items())))
|
||||
))
|
||||
exclude = util.get_serialization_exclude(serializers, exclude, kwargs)
|
||||
return util.to_bytes(serializers, exclude)
|
||||
|
||||
def from_bytes(self, bytes_data, **exclude):
|
||||
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
|
||||
"""Load state from a binary string.
|
||||
|
||||
bytes_data (bytes): The data to load from.
|
||||
**exclude: Named attributes to prevent from being loaded.
|
||||
exclude (list): String names of serialization fields to exclude.
|
||||
RETURNS (Tokenizer): The `Tokenizer` object.
|
||||
|
||||
DOCS: https://spacy.io/api/tokenizer#from_bytes
|
||||
|
@ -422,6 +426,7 @@ cdef class Tokenizer:
|
|||
("token_match", lambda b: data.setdefault("token_match", b)),
|
||||
("exceptions", lambda b: data.setdefault("rules", b))
|
||||
))
|
||||
exclude = util.get_serialization_exclude(deserializers, exclude, kwargs)
|
||||
msg = util.from_bytes(bytes_data, deserializers, exclude)
|
||||
if data.get("prefix_search"):
|
||||
self.prefix_search = re.compile(data["prefix_search"]).search
|
||||
|
|
|
@ -240,8 +240,18 @@ cdef class Doc:
|
|||
for i in range(1, self.length):
|
||||
if self.c[i].sent_start == -1 or self.c[i].sent_start == 1:
|
||||
return True
|
||||
else:
|
||||
return False
|
||||
return False
|
||||
|
||||
@property
|
||||
def is_nered(self):
|
||||
"""Check if the document has named entities set. Will return True if
|
||||
*any* of the tokens has a named entity tag set (even if the others are
|
||||
uknown values).
|
||||
"""
|
||||
for i in range(self.length):
|
||||
if self.c[i].ent_iob != 0:
|
||||
return True
|
||||
return False
|
||||
|
||||
def __getitem__(self, object i):
|
||||
"""Get a `Token` or `Span` object.
|
||||
|
@ -374,7 +384,8 @@ cdef class Doc:
|
|||
xp = get_array_module(vector)
|
||||
return xp.dot(vector, other.vector) / (self.vector_norm * other.vector_norm)
|
||||
|
||||
property has_vector:
|
||||
@property
|
||||
def has_vector(self):
|
||||
"""A boolean value indicating whether a word vector is associated with
|
||||
the object.
|
||||
|
||||
|
@ -382,15 +393,14 @@ cdef class Doc:
|
|||
|
||||
DOCS: https://spacy.io/api/doc#has_vector
|
||||
"""
|
||||
def __get__(self):
|
||||
if "has_vector" in self.user_hooks:
|
||||
return self.user_hooks["has_vector"](self)
|
||||
elif self.vocab.vectors.data.size:
|
||||
return True
|
||||
elif self.tensor.size:
|
||||
return True
|
||||
else:
|
||||
return False
|
||||
if "has_vector" in self.user_hooks:
|
||||
return self.user_hooks["has_vector"](self)
|
||||
elif self.vocab.vectors.data.size:
|
||||
return True
|
||||
elif self.tensor.size:
|
||||
return True
|
||||
else:
|
||||
return False
|
||||
|
||||
property vector:
|
||||
"""A real-valued meaning representation. Defaults to an average of the
|
||||
|
@ -443,22 +453,22 @@ cdef class Doc:
|
|||
def __set__(self, value):
|
||||
self._vector_norm = value
|
||||
|
||||
property text:
|
||||
@property
|
||||
def text(self):
|
||||
"""A unicode representation of the document text.
|
||||
|
||||
RETURNS (unicode): The original verbatim text of the document.
|
||||
"""
|
||||
def __get__(self):
|
||||
return "".join(t.text_with_ws for t in self)
|
||||
return "".join(t.text_with_ws for t in self)
|
||||
|
||||
property text_with_ws:
|
||||
@property
|
||||
def text_with_ws(self):
|
||||
"""An alias of `Doc.text`, provided for duck-type compatibility with
|
||||
`Span` and `Token`.
|
||||
|
||||
RETURNS (unicode): The original verbatim text of the document.
|
||||
"""
|
||||
def __get__(self):
|
||||
return self.text
|
||||
return self.text
|
||||
|
||||
property ents:
|
||||
"""The named entities in the document. Returns a tuple of named entity
|
||||
|
@ -535,7 +545,8 @@ cdef class Doc:
|
|||
# Set start as B
|
||||
self.c[start].ent_iob = 3
|
||||
|
||||
property noun_chunks:
|
||||
@property
|
||||
def noun_chunks(self):
|
||||
"""Iterate over the base noun phrases in the document. Yields base
|
||||
noun-phrase #[code Span] objects, if the document has been
|
||||
syntactically parsed. A base noun phrase, or "NP chunk", is a noun
|
||||
|
@ -547,22 +558,22 @@ cdef class Doc:
|
|||
|
||||
DOCS: https://spacy.io/api/doc#noun_chunks
|
||||
"""
|
||||
def __get__(self):
|
||||
if not self.is_parsed:
|
||||
raise ValueError(Errors.E029)
|
||||
# Accumulate the result before beginning to iterate over it. This
|
||||
# prevents the tokenisation from being changed out from under us
|
||||
# during the iteration. The tricky thing here is that Span accepts
|
||||
# its tokenisation changing, so it's okay once we have the Span
|
||||
# objects. See Issue #375.
|
||||
spans = []
|
||||
if self.noun_chunks_iterator is not None:
|
||||
for start, end, label in self.noun_chunks_iterator(self):
|
||||
spans.append(Span(self, start, end, label=label))
|
||||
for span in spans:
|
||||
yield span
|
||||
if not self.is_parsed:
|
||||
raise ValueError(Errors.E029)
|
||||
# Accumulate the result before beginning to iterate over it. This
|
||||
# prevents the tokenisation from being changed out from under us
|
||||
# during the iteration. The tricky thing here is that Span accepts
|
||||
# its tokenisation changing, so it's okay once we have the Span
|
||||
# objects. See Issue #375.
|
||||
spans = []
|
||||
if self.noun_chunks_iterator is not None:
|
||||
for start, end, label in self.noun_chunks_iterator(self):
|
||||
spans.append(Span(self, start, end, label=label))
|
||||
for span in spans:
|
||||
yield span
|
||||
|
||||
property sents:
|
||||
@property
|
||||
def sents(self):
|
||||
"""Iterate over the sentences in the document. Yields sentence `Span`
|
||||
objects. Sentence spans have no label. To improve accuracy on informal
|
||||
texts, spaCy calculates sentence boundaries from the syntactic
|
||||
|
@ -573,19 +584,28 @@ cdef class Doc:
|
|||
|
||||
DOCS: https://spacy.io/api/doc#sents
|
||||
"""
|
||||
def __get__(self):
|
||||
if not self.is_sentenced:
|
||||
raise ValueError(Errors.E030)
|
||||
if "sents" in self.user_hooks:
|
||||
yield from self.user_hooks["sents"](self)
|
||||
else:
|
||||
start = 0
|
||||
for i in range(1, self.length):
|
||||
if self.c[i].sent_start == 1:
|
||||
yield Span(self, start, i)
|
||||
start = i
|
||||
if start != self.length:
|
||||
yield Span(self, start, self.length)
|
||||
if not self.is_sentenced:
|
||||
raise ValueError(Errors.E030)
|
||||
if "sents" in self.user_hooks:
|
||||
yield from self.user_hooks["sents"](self)
|
||||
else:
|
||||
start = 0
|
||||
for i in range(1, self.length):
|
||||
if self.c[i].sent_start == 1:
|
||||
yield Span(self, start, i)
|
||||
start = i
|
||||
if start != self.length:
|
||||
yield Span(self, start, self.length)
|
||||
|
||||
@property
|
||||
def lang(self):
|
||||
"""RETURNS (uint64): ID of the language of the doc's vocabulary."""
|
||||
return self.vocab.strings[self.vocab.lang]
|
||||
|
||||
@property
|
||||
def lang_(self):
|
||||
"""RETURNS (unicode): Language of the doc's vocabulary, e.g. 'en'."""
|
||||
return self.vocab.lang
|
||||
|
||||
cdef int push_back(self, LexemeOrToken lex_or_tok, bint has_space) except -1:
|
||||
if self.length == 0:
|
||||
|
@ -727,6 +747,18 @@ cdef class Doc:
|
|||
|
||||
DOCS: https://spacy.io/api/doc#from_array
|
||||
"""
|
||||
# Handle scalar/list inputs of strings/ints for py_attr_ids
|
||||
# See also #3064
|
||||
if isinstance(attrs, basestring_):
|
||||
# Handle inputs like doc.to_array('ORTH')
|
||||
attrs = [attrs]
|
||||
elif not hasattr(attrs, "__iter__"):
|
||||
# Handle inputs like doc.to_array(ORTH)
|
||||
attrs = [attrs]
|
||||
# Allow strings, e.g. 'lemma' or 'LEMMA'
|
||||
attrs = [(IDS[id_.upper()] if hasattr(id_, "upper") else id_)
|
||||
for id_ in attrs]
|
||||
|
||||
if SENT_START in attrs and HEAD in attrs:
|
||||
raise ValueError(Errors.E032)
|
||||
cdef int i, col
|
||||
|
@ -739,17 +771,20 @@ cdef class Doc:
|
|||
attr_ids = <attr_id_t*>mem.alloc(n_attrs, sizeof(attr_id_t))
|
||||
for i, attr_id in enumerate(attrs):
|
||||
attr_ids[i] = attr_id
|
||||
if len(array.shape) == 1:
|
||||
array = array.reshape((array.size, 1))
|
||||
# Do TAG first. This lets subsequent loop override stuff like POS, LEMMA
|
||||
if TAG in attrs:
|
||||
col = attrs.index(TAG)
|
||||
for i in range(length):
|
||||
if array[i, col] != 0:
|
||||
self.vocab.morphology.assign_tag(&tokens[i], array[i, col])
|
||||
# Now load the data
|
||||
for i in range(self.length):
|
||||
token = &self.c[i]
|
||||
for j in range(n_attrs):
|
||||
Token.set_struct_attr(token, attr_ids[j], array[i, j])
|
||||
# Auxiliary loading logic
|
||||
for col, attr_id in enumerate(attrs):
|
||||
if attr_id == TAG:
|
||||
for i in range(length):
|
||||
if array[i, col] != 0:
|
||||
self.vocab.morphology.assign_tag(&tokens[i], array[i, col])
|
||||
if attr_ids[j] != TAG:
|
||||
Token.set_struct_attr(token, attr_ids[j], array[i, j])
|
||||
# Set flags
|
||||
self.is_parsed = bool(self.is_parsed or HEAD in attrs or DEP in attrs)
|
||||
self.is_tagged = bool(self.is_tagged or TAG in attrs or POS in attrs)
|
||||
|
@ -770,24 +805,26 @@ cdef class Doc:
|
|||
"""
|
||||
return numpy.asarray(_get_lca_matrix(self, 0, len(self)))
|
||||
|
||||
def to_disk(self, path, **exclude):
|
||||
def to_disk(self, path, **kwargs):
|
||||
"""Save the current state to a directory.
|
||||
|
||||
path (unicode or Path): A path to a directory, which will be created if
|
||||
it doesn't exist. Paths may be either strings or Path-like objects.
|
||||
exclude (list): String names of serialization fields to exclude.
|
||||
|
||||
DOCS: https://spacy.io/api/doc#to_disk
|
||||
"""
|
||||
path = util.ensure_path(path)
|
||||
with path.open("wb") as file_:
|
||||
file_.write(self.to_bytes(**exclude))
|
||||
file_.write(self.to_bytes(**kwargs))
|
||||
|
||||
def from_disk(self, path, **exclude):
|
||||
def from_disk(self, path, **kwargs):
|
||||
"""Loads state from a directory. Modifies the object in place and
|
||||
returns it.
|
||||
|
||||
path (unicode or Path): A path to a directory. Paths may be either
|
||||
strings or `Path`-like objects.
|
||||
exclude (list): String names of serialization fields to exclude.
|
||||
RETURNS (Doc): The modified `Doc` object.
|
||||
|
||||
DOCS: https://spacy.io/api/doc#from_disk
|
||||
|
@ -795,11 +832,12 @@ cdef class Doc:
|
|||
path = util.ensure_path(path)
|
||||
with path.open("rb") as file_:
|
||||
bytes_data = file_.read()
|
||||
return self.from_bytes(bytes_data, **exclude)
|
||||
return self.from_bytes(bytes_data, **kwargs)
|
||||
|
||||
def to_bytes(self, **exclude):
|
||||
def to_bytes(self, exclude=tuple(), **kwargs):
|
||||
"""Serialize, i.e. export the document contents to a binary string.
|
||||
|
||||
exclude (list): String names of serialization fields to exclude.
|
||||
RETURNS (bytes): A losslessly serialized copy of the `Doc`, including
|
||||
all annotations.
|
||||
|
||||
|
@ -825,16 +863,22 @@ cdef class Doc:
|
|||
"sentiment": lambda: self.sentiment,
|
||||
"tensor": lambda: self.tensor,
|
||||
}
|
||||
for key in kwargs:
|
||||
if key in serializers or key in ("user_data", "user_data_keys", "user_data_values"):
|
||||
raise ValueError(Errors.E128.format(arg=key))
|
||||
if "user_data" not in exclude and self.user_data:
|
||||
user_data_keys, user_data_values = list(zip(*self.user_data.items()))
|
||||
serializers["user_data_keys"] = lambda: srsly.msgpack_dumps(user_data_keys)
|
||||
serializers["user_data_values"] = lambda: srsly.msgpack_dumps(user_data_values)
|
||||
if "user_data_keys" not in exclude:
|
||||
serializers["user_data_keys"] = lambda: srsly.msgpack_dumps(user_data_keys)
|
||||
if "user_data_values" not in exclude:
|
||||
serializers["user_data_values"] = lambda: srsly.msgpack_dumps(user_data_values)
|
||||
return util.to_bytes(serializers, exclude)
|
||||
|
||||
def from_bytes(self, bytes_data, **exclude):
|
||||
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
|
||||
"""Deserialize, i.e. import the document contents from a binary string.
|
||||
|
||||
data (bytes): The string to load from.
|
||||
exclude (list): String names of serialization fields to exclude.
|
||||
RETURNS (Doc): Itself.
|
||||
|
||||
DOCS: https://spacy.io/api/doc#from_bytes
|
||||
|
@ -850,6 +894,9 @@ cdef class Doc:
|
|||
"user_data_keys": lambda b: None,
|
||||
"user_data_values": lambda b: None,
|
||||
}
|
||||
for key in kwargs:
|
||||
if key in deserializers or key in ("user_data",):
|
||||
raise ValueError(Errors.E128.format(arg=key))
|
||||
msg = util.from_bytes(bytes_data, deserializers, exclude)
|
||||
# Msgpack doesn't distinguish between lists and tuples, which is
|
||||
# vexing for user data. As a best guess, we *know* that within
|
||||
|
@ -990,11 +1037,11 @@ cdef class Doc:
|
|||
DOCS: https://spacy.io/api/doc#to_json
|
||||
"""
|
||||
data = {"text": self.text}
|
||||
if self.ents:
|
||||
if self.is_nered:
|
||||
data["ents"] = [{"start": ent.start_char, "end": ent.end_char,
|
||||
"label": ent.label_} for ent in self.ents]
|
||||
sents = list(self.sents)
|
||||
if sents:
|
||||
if self.is_sentenced:
|
||||
sents = list(self.sents)
|
||||
data["sents"] = [{"start": sent.start_char, "end": sent.end_char}
|
||||
for sent in sents]
|
||||
if self.cats:
|
||||
|
@ -1002,13 +1049,11 @@ cdef class Doc:
|
|||
data["tokens"] = []
|
||||
for token in self:
|
||||
token_data = {"id": token.i, "start": token.idx, "end": token.idx + len(token)}
|
||||
if token.pos_:
|
||||
if self.is_tagged:
|
||||
token_data["pos"] = token.pos_
|
||||
if token.tag_:
|
||||
token_data["tag"] = token.tag_
|
||||
if token.dep_:
|
||||
if self.is_parsed:
|
||||
token_data["dep"] = token.dep_
|
||||
if token.head:
|
||||
token_data["head"] = token.head.i
|
||||
data["tokens"].append(token_data)
|
||||
if underscore:
|
||||
|
@ -1179,7 +1224,7 @@ cdef int [:,:] _get_lca_matrix(Doc doc, int start, int end):
|
|||
|
||||
|
||||
def pickle_doc(doc):
|
||||
bytes_data = doc.to_bytes(vocab=False, user_data=False)
|
||||
bytes_data = doc.to_bytes(exclude=["vocab", "user_data"])
|
||||
hooks_and_data = (doc.user_data, doc.user_hooks, doc.user_span_hooks,
|
||||
doc.user_token_hooks)
|
||||
return (unpickle_doc, (doc.vocab, srsly.pickle_dumps(hooks_and_data), bytes_data))
|
||||
|
@ -1188,7 +1233,7 @@ def pickle_doc(doc):
|
|||
def unpickle_doc(vocab, hooks_and_data, bytes_data):
|
||||
user_data, doc_hooks, span_hooks, token_hooks = srsly.pickle_loads(hooks_and_data)
|
||||
|
||||
doc = Doc(vocab, user_data=user_data).from_bytes(bytes_data, exclude="user_data")
|
||||
doc = Doc(vocab, user_data=user_data).from_bytes(bytes_data, exclude=["user_data"])
|
||||
doc.user_hooks.update(doc_hooks)
|
||||
doc.user_span_hooks.update(span_hooks)
|
||||
doc.user_token_hooks.update(token_hooks)
|
||||
|
|
|
@ -322,46 +322,47 @@ cdef class Span:
|
|||
self.start = start
|
||||
self.end = end + 1
|
||||
|
||||
property vocab:
|
||||
@property
|
||||
def vocab(self):
|
||||
"""RETURNS (Vocab): The Span's Doc's vocab."""
|
||||
def __get__(self):
|
||||
return self.doc.vocab
|
||||
return self.doc.vocab
|
||||
|
||||
property sent:
|
||||
@property
|
||||
def sent(self):
|
||||
"""RETURNS (Span): The sentence span that the span is a part of."""
|
||||
def __get__(self):
|
||||
if "sent" in self.doc.user_span_hooks:
|
||||
return self.doc.user_span_hooks["sent"](self)
|
||||
# This should raise if not parsed / no custom sentence boundaries
|
||||
self.doc.sents
|
||||
# If doc is parsed we can use the deps to find the sentence
|
||||
# otherwise we use the `sent_start` token attribute
|
||||
cdef int n = 0
|
||||
cdef int i
|
||||
if self.doc.is_parsed:
|
||||
root = &self.doc.c[self.start]
|
||||
while root.head != 0:
|
||||
root += root.head
|
||||
n += 1
|
||||
if n >= self.doc.length:
|
||||
raise RuntimeError(Errors.E038)
|
||||
return self.doc[root.l_edge:root.r_edge + 1]
|
||||
elif self.doc.is_sentenced:
|
||||
# Find start of the sentence
|
||||
start = self.start
|
||||
while self.doc.c[start].sent_start != 1 and start > 0:
|
||||
start += -1
|
||||
# Find end of the sentence
|
||||
end = self.end
|
||||
n = 0
|
||||
while end < self.doc.length and self.doc.c[end].sent_start != 1:
|
||||
end += 1
|
||||
n += 1
|
||||
if n >= self.doc.length:
|
||||
break
|
||||
return self.doc[start:end]
|
||||
if "sent" in self.doc.user_span_hooks:
|
||||
return self.doc.user_span_hooks["sent"](self)
|
||||
# This should raise if not parsed / no custom sentence boundaries
|
||||
self.doc.sents
|
||||
# If doc is parsed we can use the deps to find the sentence
|
||||
# otherwise we use the `sent_start` token attribute
|
||||
cdef int n = 0
|
||||
cdef int i
|
||||
if self.doc.is_parsed:
|
||||
root = &self.doc.c[self.start]
|
||||
while root.head != 0:
|
||||
root += root.head
|
||||
n += 1
|
||||
if n >= self.doc.length:
|
||||
raise RuntimeError(Errors.E038)
|
||||
return self.doc[root.l_edge:root.r_edge + 1]
|
||||
elif self.doc.is_sentenced:
|
||||
# Find start of the sentence
|
||||
start = self.start
|
||||
while self.doc.c[start].sent_start != 1 and start > 0:
|
||||
start += -1
|
||||
# Find end of the sentence
|
||||
end = self.end
|
||||
n = 0
|
||||
while end < self.doc.length and self.doc.c[end].sent_start != 1:
|
||||
end += 1
|
||||
n += 1
|
||||
if n >= self.doc.length:
|
||||
break
|
||||
return self.doc[start:end]
|
||||
|
||||
property ents:
|
||||
@property
|
||||
def ents(self):
|
||||
"""The named entities in the span. Returns a tuple of named entity
|
||||
`Span` objects, if the entity recognizer has been applied.
|
||||
|
||||
|
@ -369,14 +370,14 @@ cdef class Span:
|
|||
|
||||
DOCS: https://spacy.io/api/span#ents
|
||||
"""
|
||||
def __get__(self):
|
||||
ents = []
|
||||
for ent in self.doc.ents:
|
||||
if ent.start >= self.start and ent.end <= self.end:
|
||||
ents.append(ent)
|
||||
return ents
|
||||
ents = []
|
||||
for ent in self.doc.ents:
|
||||
if ent.start >= self.start and ent.end <= self.end:
|
||||
ents.append(ent)
|
||||
return ents
|
||||
|
||||
property has_vector:
|
||||
@property
|
||||
def has_vector(self):
|
||||
"""A boolean value indicating whether a word vector is associated with
|
||||
the object.
|
||||
|
||||
|
@ -384,17 +385,17 @@ cdef class Span:
|
|||
|
||||
DOCS: https://spacy.io/api/span#has_vector
|
||||
"""
|
||||
def __get__(self):
|
||||
if "has_vector" in self.doc.user_span_hooks:
|
||||
return self.doc.user_span_hooks["has_vector"](self)
|
||||
elif self.vocab.vectors.data.size > 0:
|
||||
return any(token.has_vector for token in self)
|
||||
elif self.doc.tensor.size > 0:
|
||||
return True
|
||||
else:
|
||||
return False
|
||||
if "has_vector" in self.doc.user_span_hooks:
|
||||
return self.doc.user_span_hooks["has_vector"](self)
|
||||
elif self.vocab.vectors.data.size > 0:
|
||||
return any(token.has_vector for token in self)
|
||||
elif self.doc.tensor.size > 0:
|
||||
return True
|
||||
else:
|
||||
return False
|
||||
|
||||
property vector:
|
||||
@property
|
||||
def vector(self):
|
||||
"""A real-valued meaning representation. Defaults to an average of the
|
||||
token vectors.
|
||||
|
||||
|
@ -403,61 +404,61 @@ cdef class Span:
|
|||
|
||||
DOCS: https://spacy.io/api/span#vector
|
||||
"""
|
||||
def __get__(self):
|
||||
if "vector" in self.doc.user_span_hooks:
|
||||
return self.doc.user_span_hooks["vector"](self)
|
||||
if self._vector is None:
|
||||
self._vector = sum(t.vector for t in self) / len(self)
|
||||
return self._vector
|
||||
if "vector" in self.doc.user_span_hooks:
|
||||
return self.doc.user_span_hooks["vector"](self)
|
||||
if self._vector is None:
|
||||
self._vector = sum(t.vector for t in self) / len(self)
|
||||
return self._vector
|
||||
|
||||
property vector_norm:
|
||||
@property
|
||||
def vector_norm(self):
|
||||
"""The L2 norm of the span's vector representation.
|
||||
|
||||
RETURNS (float): The L2 norm of the vector representation.
|
||||
|
||||
DOCS: https://spacy.io/api/span#vector_norm
|
||||
"""
|
||||
def __get__(self):
|
||||
if "vector_norm" in self.doc.user_span_hooks:
|
||||
return self.doc.user_span_hooks["vector"](self)
|
||||
cdef float value
|
||||
cdef double norm = 0
|
||||
if self._vector_norm is None:
|
||||
norm = 0
|
||||
for value in self.vector:
|
||||
norm += value * value
|
||||
self._vector_norm = sqrt(norm) if norm != 0 else 0
|
||||
return self._vector_norm
|
||||
if "vector_norm" in self.doc.user_span_hooks:
|
||||
return self.doc.user_span_hooks["vector"](self)
|
||||
cdef float value
|
||||
cdef double norm = 0
|
||||
if self._vector_norm is None:
|
||||
norm = 0
|
||||
for value in self.vector:
|
||||
norm += value * value
|
||||
self._vector_norm = sqrt(norm) if norm != 0 else 0
|
||||
return self._vector_norm
|
||||
|
||||
property sentiment:
|
||||
@property
|
||||
def sentiment(self):
|
||||
"""RETURNS (float): A scalar value indicating the positivity or
|
||||
negativity of the span.
|
||||
"""
|
||||
def __get__(self):
|
||||
if "sentiment" in self.doc.user_span_hooks:
|
||||
return self.doc.user_span_hooks["sentiment"](self)
|
||||
else:
|
||||
return sum([token.sentiment for token in self]) / len(self)
|
||||
if "sentiment" in self.doc.user_span_hooks:
|
||||
return self.doc.user_span_hooks["sentiment"](self)
|
||||
else:
|
||||
return sum([token.sentiment for token in self]) / len(self)
|
||||
|
||||
property text:
|
||||
@property
|
||||
def text(self):
|
||||
"""RETURNS (unicode): The original verbatim text of the span."""
|
||||
def __get__(self):
|
||||
text = self.text_with_ws
|
||||
if self[-1].whitespace_:
|
||||
text = text[:-1]
|
||||
return text
|
||||
text = self.text_with_ws
|
||||
if self[-1].whitespace_:
|
||||
text = text[:-1]
|
||||
return text
|
||||
|
||||
property text_with_ws:
|
||||
@property
|
||||
def text_with_ws(self):
|
||||
"""The text content of the span with a trailing whitespace character if
|
||||
the last token has one.
|
||||
|
||||
RETURNS (unicode): The text content of the span (with trailing
|
||||
whitespace).
|
||||
"""
|
||||
def __get__(self):
|
||||
return "".join([t.text_with_ws for t in self])
|
||||
return "".join([t.text_with_ws for t in self])
|
||||
|
||||
property noun_chunks:
|
||||
@property
|
||||
def noun_chunks(self):
|
||||
"""Yields base noun-phrase `Span` objects, if the document has been
|
||||
syntactically parsed. A base noun phrase, or "NP chunk", is a noun
|
||||
phrase that does not permit other NPs to be nested within it – so no
|
||||
|
@ -468,23 +469,23 @@ cdef class Span:
|
|||
|
||||
DOCS: https://spacy.io/api/span#noun_chunks
|
||||
"""
|
||||
def __get__(self):
|
||||
if not self.doc.is_parsed:
|
||||
raise ValueError(Errors.E029)
|
||||
# Accumulate the result before beginning to iterate over it. This
|
||||
# prevents the tokenisation from being changed out from under us
|
||||
# during the iteration. The tricky thing here is that Span accepts
|
||||
# its tokenisation changing, so it's okay once we have the Span
|
||||
# objects. See Issue #375
|
||||
spans = []
|
||||
cdef attr_t label
|
||||
if self.doc.noun_chunks_iterator is not None:
|
||||
for start, end, label in self.doc.noun_chunks_iterator(self):
|
||||
spans.append(Span(self.doc, start, end, label=label))
|
||||
for span in spans:
|
||||
yield span
|
||||
if not self.doc.is_parsed:
|
||||
raise ValueError(Errors.E029)
|
||||
# Accumulate the result before beginning to iterate over it. This
|
||||
# prevents the tokenisation from being changed out from under us
|
||||
# during the iteration. The tricky thing here is that Span accepts
|
||||
# its tokenisation changing, so it's okay once we have the Span
|
||||
# objects. See Issue #375
|
||||
spans = []
|
||||
cdef attr_t label
|
||||
if self.doc.noun_chunks_iterator is not None:
|
||||
for start, end, label in self.doc.noun_chunks_iterator(self):
|
||||
spans.append(Span(self.doc, start, end, label=label))
|
||||
for span in spans:
|
||||
yield span
|
||||
|
||||
property root:
|
||||
@property
|
||||
def root(self):
|
||||
"""The token with the shortest path to the root of the
|
||||
sentence (or the root itself). If multiple tokens are equally
|
||||
high in the tree, the first token is taken.
|
||||
|
@ -493,41 +494,51 @@ cdef class Span:
|
|||
|
||||
DOCS: https://spacy.io/api/span#root
|
||||
"""
|
||||
def __get__(self):
|
||||
self._recalculate_indices()
|
||||
if "root" in self.doc.user_span_hooks:
|
||||
return self.doc.user_span_hooks["root"](self)
|
||||
# This should probably be called 'head', and the other one called
|
||||
# 'gov'. But we went with 'head' elsehwhere, and now we're stuck =/
|
||||
cdef int i
|
||||
# First, we scan through the Span, and check whether there's a word
|
||||
# with head==0, i.e. a sentence root. If so, we can return it. The
|
||||
# longer the span, the more likely it contains a sentence root, and
|
||||
# in this case we return in linear time.
|
||||
for i in range(self.start, self.end):
|
||||
if self.doc.c[i].head == 0:
|
||||
return self.doc[i]
|
||||
# If we don't have a sentence root, we do something that's not so
|
||||
# algorithmically clever, but I think should be quite fast,
|
||||
# especially for short spans.
|
||||
# For each word, we count the path length, and arg min this measure.
|
||||
# We could use better tree logic to save steps here...But I
|
||||
# think this should be okay.
|
||||
cdef int current_best = self.doc.length
|
||||
cdef int root = -1
|
||||
for i in range(self.start, self.end):
|
||||
if self.start <= (i+self.doc.c[i].head) < self.end:
|
||||
continue
|
||||
words_to_root = _count_words_to_root(&self.doc.c[i], self.doc.length)
|
||||
if words_to_root < current_best:
|
||||
current_best = words_to_root
|
||||
root = i
|
||||
if root == -1:
|
||||
return self.doc[self.start]
|
||||
else:
|
||||
return self.doc[root]
|
||||
self._recalculate_indices()
|
||||
if "root" in self.doc.user_span_hooks:
|
||||
return self.doc.user_span_hooks["root"](self)
|
||||
# This should probably be called 'head', and the other one called
|
||||
# 'gov'. But we went with 'head' elsehwhere, and now we're stuck =/
|
||||
cdef int i
|
||||
# First, we scan through the Span, and check whether there's a word
|
||||
# with head==0, i.e. a sentence root. If so, we can return it. The
|
||||
# longer the span, the more likely it contains a sentence root, and
|
||||
# in this case we return in linear time.
|
||||
for i in range(self.start, self.end):
|
||||
if self.doc.c[i].head == 0:
|
||||
return self.doc[i]
|
||||
# If we don't have a sentence root, we do something that's not so
|
||||
# algorithmically clever, but I think should be quite fast,
|
||||
# especially for short spans.
|
||||
# For each word, we count the path length, and arg min this measure.
|
||||
# We could use better tree logic to save steps here...But I
|
||||
# think this should be okay.
|
||||
cdef int current_best = self.doc.length
|
||||
cdef int root = -1
|
||||
for i in range(self.start, self.end):
|
||||
if self.start <= (i+self.doc.c[i].head) < self.end:
|
||||
continue
|
||||
words_to_root = _count_words_to_root(&self.doc.c[i], self.doc.length)
|
||||
if words_to_root < current_best:
|
||||
current_best = words_to_root
|
||||
root = i
|
||||
if root == -1:
|
||||
return self.doc[self.start]
|
||||
else:
|
||||
return self.doc[root]
|
||||
|
||||
property lefts:
|
||||
@property
|
||||
def conjuncts(self):
|
||||
"""Tokens that are conjoined to the span's root.
|
||||
|
||||
RETURNS (tuple): A tuple of Token objects.
|
||||
|
||||
DOCS: https://spacy.io/api/span#lefts
|
||||
"""
|
||||
return self.root.conjuncts
|
||||
|
||||
@property
|
||||
def lefts(self):
|
||||
"""Tokens that are to the left of the span, whose head is within the
|
||||
`Span`.
|
||||
|
||||
|
@ -535,13 +546,13 @@ cdef class Span:
|
|||
|
||||
DOCS: https://spacy.io/api/span#lefts
|
||||
"""
|
||||
def __get__(self):
|
||||
for token in reversed(self): # Reverse, so we get tokens in order
|
||||
for left in token.lefts:
|
||||
if left.i < self.start:
|
||||
yield left
|
||||
for token in reversed(self): # Reverse, so we get tokens in order
|
||||
for left in token.lefts:
|
||||
if left.i < self.start:
|
||||
yield left
|
||||
|
||||
property rights:
|
||||
@property
|
||||
def rights(self):
|
||||
"""Tokens that are to the right of the Span, whose head is within the
|
||||
`Span`.
|
||||
|
||||
|
@ -549,13 +560,13 @@ cdef class Span:
|
|||
|
||||
DOCS: https://spacy.io/api/span#rights
|
||||
"""
|
||||
def __get__(self):
|
||||
for token in self:
|
||||
for right in token.rights:
|
||||
if right.i >= self.end:
|
||||
yield right
|
||||
for token in self:
|
||||
for right in token.rights:
|
||||
if right.i >= self.end:
|
||||
yield right
|
||||
|
||||
property n_lefts:
|
||||
@property
|
||||
def n_lefts(self):
|
||||
"""The number of tokens that are to the left of the span, whose
|
||||
heads are within the span.
|
||||
|
||||
|
@ -564,10 +575,10 @@ cdef class Span:
|
|||
|
||||
DOCS: https://spacy.io/api/span#n_lefts
|
||||
"""
|
||||
def __get__(self):
|
||||
return len(list(self.lefts))
|
||||
return len(list(self.lefts))
|
||||
|
||||
property n_rights:
|
||||
@property
|
||||
def n_rights(self):
|
||||
"""The number of tokens that are to the right of the span, whose
|
||||
heads are within the span.
|
||||
|
||||
|
@ -576,22 +587,21 @@ cdef class Span:
|
|||
|
||||
DOCS: https://spacy.io/api/span#n_rights
|
||||
"""
|
||||
def __get__(self):
|
||||
return len(list(self.rights))
|
||||
return len(list(self.rights))
|
||||
|
||||
property subtree:
|
||||
@property
|
||||
def subtree(self):
|
||||
"""Tokens within the span and tokens which descend from them.
|
||||
|
||||
YIELDS (Token): A token within the span, or a descendant from it.
|
||||
|
||||
DOCS: https://spacy.io/api/span#subtree
|
||||
"""
|
||||
def __get__(self):
|
||||
for word in self.lefts:
|
||||
yield from word.subtree
|
||||
yield from self
|
||||
for word in self.rights:
|
||||
yield from word.subtree
|
||||
for word in self.lefts:
|
||||
yield from word.subtree
|
||||
yield from self
|
||||
for word in self.rights:
|
||||
yield from word.subtree
|
||||
|
||||
property ent_id:
|
||||
"""RETURNS (uint64): The entity ID."""
|
||||
|
@ -609,33 +619,33 @@ cdef class Span:
|
|||
def __set__(self, hash_t key):
|
||||
raise NotImplementedError(TempErrors.T007.format(attr="ent_id_"))
|
||||
|
||||
property orth_:
|
||||
@property
|
||||
def orth_(self):
|
||||
"""Verbatim text content (identical to `Span.text`). Exists mostly for
|
||||
consistency with other attributes.
|
||||
|
||||
RETURNS (unicode): The span's text."""
|
||||
def __get__(self):
|
||||
return self.text
|
||||
return self.text
|
||||
|
||||
property lemma_:
|
||||
@property
|
||||
def lemma_(self):
|
||||
"""RETURNS (unicode): The span's lemma."""
|
||||
def __get__(self):
|
||||
return " ".join([t.lemma_ for t in self]).strip()
|
||||
return " ".join([t.lemma_ for t in self]).strip()
|
||||
|
||||
property upper_:
|
||||
@property
|
||||
def upper_(self):
|
||||
"""Deprecated. Use `Span.text.upper()` instead."""
|
||||
def __get__(self):
|
||||
return "".join([t.text_with_ws.upper() for t in self]).strip()
|
||||
return "".join([t.text_with_ws.upper() for t in self]).strip()
|
||||
|
||||
property lower_:
|
||||
@property
|
||||
def lower_(self):
|
||||
"""Deprecated. Use `Span.text.lower()` instead."""
|
||||
def __get__(self):
|
||||
return "".join([t.text_with_ws.lower() for t in self]).strip()
|
||||
return "".join([t.text_with_ws.lower() for t in self]).strip()
|
||||
|
||||
property string:
|
||||
@property
|
||||
def string(self):
|
||||
"""Deprecated: Use `Span.text_with_ws` instead."""
|
||||
def __get__(self):
|
||||
return "".join([t.text_with_ws for t in self])
|
||||
return "".join([t.text_with_ws for t in self])
|
||||
|
||||
property label_:
|
||||
"""RETURNS (unicode): The span's label."""
|
||||
|
@ -643,7 +653,9 @@ cdef class Span:
|
|||
return self.doc.vocab.strings[self.label]
|
||||
|
||||
def __set__(self, unicode label_):
|
||||
self.label = self.doc.vocab.strings.add(label_)
|
||||
if not label_:
|
||||
label_ = ''
|
||||
raise NotImplementedError(Errors.E129.format(start=self.start, end=self.end, label=label_))
|
||||
|
||||
|
||||
cdef int _count_words_to_root(const TokenC* token, int sent_length) except -1:
|
||||
|
|
|
@ -219,115 +219,115 @@ cdef class Token:
|
|||
xp = get_array_module(vector)
|
||||
return (xp.dot(vector, other.vector) / (self.vector_norm * other.vector_norm))
|
||||
|
||||
property morph:
|
||||
def __get__(self):
|
||||
return MorphAnalysis.from_id(self.vocab, self.c.morph)
|
||||
@property
|
||||
def morph(self):
|
||||
return MorphAnalysis.from_id(self.vocab, self.c.morph)
|
||||
|
||||
property lex_id:
|
||||
@property
|
||||
def lex_id(self):
|
||||
"""RETURNS (int): Sequential ID of the token's lexical type."""
|
||||
def __get__(self):
|
||||
return self.c.lex.id
|
||||
return self.c.lex.id
|
||||
|
||||
property rank:
|
||||
@property
|
||||
def rank(self):
|
||||
"""RETURNS (int): Sequential ID of the token's lexical type, used to
|
||||
index into tables, e.g. for word vectors."""
|
||||
def __get__(self):
|
||||
return self.c.lex.id
|
||||
return self.c.lex.id
|
||||
|
||||
property string:
|
||||
@property
|
||||
def string(self):
|
||||
"""Deprecated: Use Token.text_with_ws instead."""
|
||||
def __get__(self):
|
||||
return self.text_with_ws
|
||||
return self.text_with_ws
|
||||
|
||||
property text:
|
||||
@property
|
||||
def text(self):
|
||||
"""RETURNS (unicode): The original verbatim text of the token."""
|
||||
def __get__(self):
|
||||
return self.orth_
|
||||
return self.orth_
|
||||
|
||||
property text_with_ws:
|
||||
@property
|
||||
def text_with_ws(self):
|
||||
"""RETURNS (unicode): The text content of the span (with trailing
|
||||
whitespace).
|
||||
"""
|
||||
def __get__(self):
|
||||
cdef unicode orth = self.vocab.strings[self.c.lex.orth]
|
||||
if self.c.spacy:
|
||||
return orth + " "
|
||||
else:
|
||||
return orth
|
||||
cdef unicode orth = self.vocab.strings[self.c.lex.orth]
|
||||
if self.c.spacy:
|
||||
return orth + " "
|
||||
else:
|
||||
return orth
|
||||
|
||||
property prob:
|
||||
@property
|
||||
def prob(self):
|
||||
"""RETURNS (float): Smoothed log probability estimate of token type."""
|
||||
def __get__(self):
|
||||
return self.c.lex.prob
|
||||
return self.c.lex.prob
|
||||
|
||||
property sentiment:
|
||||
@property
|
||||
def sentiment(self):
|
||||
"""RETURNS (float): A scalar value indicating the positivity or
|
||||
negativity of the token."""
|
||||
def __get__(self):
|
||||
if "sentiment" in self.doc.user_token_hooks:
|
||||
return self.doc.user_token_hooks["sentiment"](self)
|
||||
return self.c.lex.sentiment
|
||||
if "sentiment" in self.doc.user_token_hooks:
|
||||
return self.doc.user_token_hooks["sentiment"](self)
|
||||
return self.c.lex.sentiment
|
||||
|
||||
property lang:
|
||||
@property
|
||||
def lang(self):
|
||||
"""RETURNS (uint64): ID of the language of the parent document's
|
||||
vocabulary.
|
||||
"""
|
||||
def __get__(self):
|
||||
return self.c.lex.lang
|
||||
return self.c.lex.lang
|
||||
|
||||
property idx:
|
||||
@property
|
||||
def idx(self):
|
||||
"""RETURNS (int): The character offset of the token within the parent
|
||||
document.
|
||||
"""
|
||||
def __get__(self):
|
||||
return self.c.idx
|
||||
return self.c.idx
|
||||
|
||||
property cluster:
|
||||
@property
|
||||
def cluster(self):
|
||||
"""RETURNS (int): Brown cluster ID."""
|
||||
def __get__(self):
|
||||
return self.c.lex.cluster
|
||||
return self.c.lex.cluster
|
||||
|
||||
property orth:
|
||||
@property
|
||||
def orth(self):
|
||||
"""RETURNS (uint64): ID of the verbatim text content."""
|
||||
def __get__(self):
|
||||
return self.c.lex.orth
|
||||
return self.c.lex.orth
|
||||
|
||||
property lower:
|
||||
@property
|
||||
def lower(self):
|
||||
"""RETURNS (uint64): ID of the lowercase token text."""
|
||||
def __get__(self):
|
||||
return self.c.lex.lower
|
||||
return self.c.lex.lower
|
||||
|
||||
property norm:
|
||||
@property
|
||||
def norm(self):
|
||||
"""RETURNS (uint64): ID of the token's norm, i.e. a normalised form of
|
||||
the token text. Usually set in the language's tokenizer exceptions
|
||||
or norm exceptions.
|
||||
"""
|
||||
def __get__(self):
|
||||
if self.c.norm == 0:
|
||||
return self.c.lex.norm
|
||||
else:
|
||||
return self.c.norm
|
||||
if self.c.norm == 0:
|
||||
return self.c.lex.norm
|
||||
else:
|
||||
return self.c.norm
|
||||
|
||||
property shape:
|
||||
@property
|
||||
def shape(self):
|
||||
"""RETURNS (uint64): ID of the token's shape, a transform of the
|
||||
tokens's string, to show orthographic features (e.g. "Xxxx", "dd").
|
||||
"""
|
||||
def __get__(self):
|
||||
return self.c.lex.shape
|
||||
return self.c.lex.shape
|
||||
|
||||
property prefix:
|
||||
@property
|
||||
def prefix(self):
|
||||
"""RETURNS (uint64): ID of a length-N substring from the start of the
|
||||
token. Defaults to `N=1`.
|
||||
"""
|
||||
def __get__(self):
|
||||
return self.c.lex.prefix
|
||||
return self.c.lex.prefix
|
||||
|
||||
property suffix:
|
||||
@property
|
||||
def suffix(self):
|
||||
"""RETURNS (uint64): ID of a length-N substring from the end of the
|
||||
token. Defaults to `N=3`.
|
||||
"""
|
||||
def __get__(self):
|
||||
return self.c.lex.suffix
|
||||
return self.c.lex.suffix
|
||||
|
||||
property lemma:
|
||||
"""RETURNS (uint64): ID of the base form of the word, with no
|
||||
|
@ -367,7 +367,8 @@ cdef class Token:
|
|||
def __set__(self, attr_t label):
|
||||
self.c.dep = label
|
||||
|
||||
property has_vector:
|
||||
@property
|
||||
def has_vector(self):
|
||||
"""A boolean value indicating whether a word vector is associated with
|
||||
the object.
|
||||
|
||||
|
@ -375,14 +376,14 @@ cdef class Token:
|
|||
|
||||
DOCS: https://spacy.io/api/token#has_vector
|
||||
"""
|
||||
def __get__(self):
|
||||
if 'has_vector' in self.doc.user_token_hooks:
|
||||
return self.doc.user_token_hooks["has_vector"](self)
|
||||
if self.vocab.vectors.size == 0 and self.doc.tensor.size != 0:
|
||||
return True
|
||||
return self.vocab.has_vector(self.c.lex.orth)
|
||||
if "has_vector" in self.doc.user_token_hooks:
|
||||
return self.doc.user_token_hooks["has_vector"](self)
|
||||
if self.vocab.vectors.size == 0 and self.doc.tensor.size != 0:
|
||||
return True
|
||||
return self.vocab.has_vector(self.c.lex.orth)
|
||||
|
||||
property vector:
|
||||
@property
|
||||
def vector(self):
|
||||
"""A real-valued meaning representation.
|
||||
|
||||
RETURNS (numpy.ndarray[ndim=1, dtype='float32']): A 1D numpy array
|
||||
|
@ -390,28 +391,28 @@ cdef class Token:
|
|||
|
||||
DOCS: https://spacy.io/api/token#vector
|
||||
"""
|
||||
def __get__(self):
|
||||
if 'vector' in self.doc.user_token_hooks:
|
||||
return self.doc.user_token_hooks["vector"](self)
|
||||
if self.vocab.vectors.size == 0 and self.doc.tensor.size != 0:
|
||||
return self.doc.tensor[self.i]
|
||||
else:
|
||||
return self.vocab.get_vector(self.c.lex.orth)
|
||||
if "vector" in self.doc.user_token_hooks:
|
||||
return self.doc.user_token_hooks["vector"](self)
|
||||
if self.vocab.vectors.size == 0 and self.doc.tensor.size != 0:
|
||||
return self.doc.tensor[self.i]
|
||||
else:
|
||||
return self.vocab.get_vector(self.c.lex.orth)
|
||||
|
||||
property vector_norm:
|
||||
@property
|
||||
def vector_norm(self):
|
||||
"""The L2 norm of the token's vector representation.
|
||||
|
||||
RETURNS (float): The L2 norm of the vector representation.
|
||||
|
||||
DOCS: https://spacy.io/api/token#vector_norm
|
||||
"""
|
||||
def __get__(self):
|
||||
if 'vector_norm' in self.doc.user_token_hooks:
|
||||
return self.doc.user_token_hooks["vector_norm"](self)
|
||||
vector = self.vector
|
||||
return numpy.sqrt((vector ** 2).sum())
|
||||
if "vector_norm" in self.doc.user_token_hooks:
|
||||
return self.doc.user_token_hooks["vector_norm"](self)
|
||||
vector = self.vector
|
||||
return numpy.sqrt((vector ** 2).sum())
|
||||
|
||||
property n_lefts:
|
||||
@property
|
||||
def n_lefts(self):
|
||||
"""The number of leftward immediate children of the word, in the
|
||||
syntactic dependency parse.
|
||||
|
||||
|
@ -420,10 +421,10 @@ cdef class Token:
|
|||
|
||||
DOCS: https://spacy.io/api/token#n_lefts
|
||||
"""
|
||||
def __get__(self):
|
||||
return self.c.l_kids
|
||||
return self.c.l_kids
|
||||
|
||||
property n_rights:
|
||||
@property
|
||||
def n_rights(self):
|
||||
"""The number of rightward immediate children of the word, in the
|
||||
syntactic dependency parse.
|
||||
|
||||
|
@ -432,15 +433,14 @@ cdef class Token:
|
|||
|
||||
DOCS: https://spacy.io/api/token#n_rights
|
||||
"""
|
||||
def __get__(self):
|
||||
return self.c.r_kids
|
||||
return self.c.r_kids
|
||||
|
||||
property sent:
|
||||
@property
|
||||
def sent(self):
|
||||
"""RETURNS (Span): The sentence span that the token is a part of."""
|
||||
def __get__(self):
|
||||
if 'sent' in self.doc.user_token_hooks:
|
||||
return self.doc.user_token_hooks["sent"](self)
|
||||
return self.doc[self.i : self.i+1].sent
|
||||
if 'sent' in self.doc.user_token_hooks:
|
||||
return self.doc.user_token_hooks["sent"](self)
|
||||
return self.doc[self.i : self.i+1].sent
|
||||
|
||||
property sent_start:
|
||||
def __get__(self):
|
||||
|
@ -484,7 +484,8 @@ cdef class Token:
|
|||
else:
|
||||
raise ValueError(Errors.E044.format(value=value))
|
||||
|
||||
property lefts:
|
||||
@property
|
||||
def lefts(self):
|
||||
"""The leftward immediate children of the word, in the syntactic
|
||||
dependency parse.
|
||||
|
||||
|
@ -492,19 +493,19 @@ cdef class Token:
|
|||
|
||||
DOCS: https://spacy.io/api/token#lefts
|
||||
"""
|
||||
def __get__(self):
|
||||
cdef int nr_iter = 0
|
||||
cdef const TokenC* ptr = self.c - (self.i - self.c.l_edge)
|
||||
while ptr < self.c:
|
||||
if ptr + ptr.head == self.c:
|
||||
yield self.doc[ptr - (self.c - self.i)]
|
||||
ptr += 1
|
||||
nr_iter += 1
|
||||
# This is ugly, but it's a way to guard out infinite loops
|
||||
if nr_iter >= 10000000:
|
||||
raise RuntimeError(Errors.E045.format(attr="token.lefts"))
|
||||
cdef int nr_iter = 0
|
||||
cdef const TokenC* ptr = self.c - (self.i - self.c.l_edge)
|
||||
while ptr < self.c:
|
||||
if ptr + ptr.head == self.c:
|
||||
yield self.doc[ptr - (self.c - self.i)]
|
||||
ptr += 1
|
||||
nr_iter += 1
|
||||
# This is ugly, but it's a way to guard out infinite loops
|
||||
if nr_iter >= 10000000:
|
||||
raise RuntimeError(Errors.E045.format(attr="token.lefts"))
|
||||
|
||||
property rights:
|
||||
@property
|
||||
def rights(self):
|
||||
"""The rightward immediate children of the word, in the syntactic
|
||||
dependency parse.
|
||||
|
||||
|
@ -512,33 +513,33 @@ cdef class Token:
|
|||
|
||||
DOCS: https://spacy.io/api/token#rights
|
||||
"""
|
||||
def __get__(self):
|
||||
cdef const TokenC* ptr = self.c + (self.c.r_edge - self.i)
|
||||
tokens = []
|
||||
cdef int nr_iter = 0
|
||||
while ptr > self.c:
|
||||
if ptr + ptr.head == self.c:
|
||||
tokens.append(self.doc[ptr - (self.c - self.i)])
|
||||
ptr -= 1
|
||||
nr_iter += 1
|
||||
if nr_iter >= 10000000:
|
||||
raise RuntimeError(Errors.E045.format(attr="token.rights"))
|
||||
tokens.reverse()
|
||||
for t in tokens:
|
||||
yield t
|
||||
cdef const TokenC* ptr = self.c + (self.c.r_edge - self.i)
|
||||
tokens = []
|
||||
cdef int nr_iter = 0
|
||||
while ptr > self.c:
|
||||
if ptr + ptr.head == self.c:
|
||||
tokens.append(self.doc[ptr - (self.c - self.i)])
|
||||
ptr -= 1
|
||||
nr_iter += 1
|
||||
if nr_iter >= 10000000:
|
||||
raise RuntimeError(Errors.E045.format(attr="token.rights"))
|
||||
tokens.reverse()
|
||||
for t in tokens:
|
||||
yield t
|
||||
|
||||
property children:
|
||||
@property
|
||||
def children(self):
|
||||
"""A sequence of the token's immediate syntactic children.
|
||||
|
||||
YIELDS (Token): A child token such that `child.head==self`.
|
||||
|
||||
DOCS: https://spacy.io/api/token#children
|
||||
"""
|
||||
def __get__(self):
|
||||
yield from self.lefts
|
||||
yield from self.rights
|
||||
yield from self.lefts
|
||||
yield from self.rights
|
||||
|
||||
property subtree:
|
||||
@property
|
||||
def subtree(self):
|
||||
"""A sequence containing the token and all the token's syntactic
|
||||
descendants.
|
||||
|
||||
|
@ -547,30 +548,30 @@ cdef class Token:
|
|||
|
||||
DOCS: https://spacy.io/api/token#subtree
|
||||
"""
|
||||
def __get__(self):
|
||||
for word in self.lefts:
|
||||
yield from word.subtree
|
||||
yield self
|
||||
for word in self.rights:
|
||||
yield from word.subtree
|
||||
for word in self.lefts:
|
||||
yield from word.subtree
|
||||
yield self
|
||||
for word in self.rights:
|
||||
yield from word.subtree
|
||||
|
||||
property left_edge:
|
||||
@property
|
||||
def left_edge(self):
|
||||
"""The leftmost token of this token's syntactic descendents.
|
||||
|
||||
RETURNS (Token): The first token such that `self.is_ancestor(token)`.
|
||||
"""
|
||||
def __get__(self):
|
||||
return self.doc[self.c.l_edge]
|
||||
return self.doc[self.c.l_edge]
|
||||
|
||||
property right_edge:
|
||||
@property
|
||||
def right_edge(self):
|
||||
"""The rightmost token of this token's syntactic descendents.
|
||||
|
||||
RETURNS (Token): The last token such that `self.is_ancestor(token)`.
|
||||
"""
|
||||
def __get__(self):
|
||||
return self.doc[self.c.r_edge]
|
||||
return self.doc[self.c.r_edge]
|
||||
|
||||
property ancestors:
|
||||
@property
|
||||
def ancestors(self):
|
||||
"""A sequence of this token's syntactic ancestors.
|
||||
|
||||
YIELDS (Token): A sequence of ancestor tokens such that
|
||||
|
@ -578,15 +579,14 @@ cdef class Token:
|
|||
|
||||
DOCS: https://spacy.io/api/token#ancestors
|
||||
"""
|
||||
def __get__(self):
|
||||
cdef const TokenC* head_ptr = self.c
|
||||
# Guard against infinite loop, no token can have
|
||||
# more ancestors than tokens in the tree.
|
||||
cdef int i = 0
|
||||
while head_ptr.head != 0 and i < self.doc.length:
|
||||
head_ptr += head_ptr.head
|
||||
yield self.doc[head_ptr - (self.c - self.i)]
|
||||
i += 1
|
||||
cdef const TokenC* head_ptr = self.c
|
||||
# Guard against infinite loop, no token can have
|
||||
# more ancestors than tokens in the tree.
|
||||
cdef int i = 0
|
||||
while head_ptr.head != 0 and i < self.doc.length:
|
||||
head_ptr += head_ptr.head
|
||||
yield self.doc[head_ptr - (self.c - self.i)]
|
||||
i += 1
|
||||
|
||||
def is_ancestor(self, descendant):
|
||||
"""Check whether this token is a parent, grandparent, etc. of another
|
||||
|
@ -690,23 +690,31 @@ cdef class Token:
|
|||
# Set new head
|
||||
self.c.head = rel_newhead_i
|
||||
|
||||
property conjuncts:
|
||||
@property
|
||||
def conjuncts(self):
|
||||
"""A sequence of coordinated tokens, including the token itself.
|
||||
|
||||
YIELDS (Token): A coordinated token.
|
||||
RETURNS (tuple): The coordinated tokens.
|
||||
|
||||
DOCS: https://spacy.io/api/token#conjuncts
|
||||
"""
|
||||
def __get__(self):
|
||||
cdef Token word
|
||||
if "conjuncts" in self.doc.user_token_hooks:
|
||||
yield from self.doc.user_token_hooks["conjuncts"](self)
|
||||
cdef Token word, child
|
||||
if "conjuncts" in self.doc.user_token_hooks:
|
||||
return tuple(self.doc.user_token_hooks["conjuncts"](self))
|
||||
start = self
|
||||
while start.i != start.head.i:
|
||||
if start.dep == conj:
|
||||
start = start.head
|
||||
else:
|
||||
if self.dep != conj:
|
||||
for word in self.rights:
|
||||
if word.dep == conj:
|
||||
yield word
|
||||
yield from word.conjuncts
|
||||
break
|
||||
queue = [start]
|
||||
output = [start]
|
||||
for word in queue:
|
||||
for child in word.rights:
|
||||
if child.c.dep == conj:
|
||||
output.append(child)
|
||||
queue.append(child)
|
||||
return tuple([w for w in output if w.i != self.i])
|
||||
|
||||
property ent_type:
|
||||
"""RETURNS (uint64): Named entity type."""
|
||||
|
@ -716,15 +724,6 @@ cdef class Token:
|
|||
def __set__(self, ent_type):
|
||||
self.c.ent_type = ent_type
|
||||
|
||||
property ent_iob:
|
||||
"""IOB code of named entity tag. `1="I", 2="O", 3="B"`. 0 means no tag
|
||||
is assigned.
|
||||
|
||||
RETURNS (uint64): IOB code of named entity tag.
|
||||
"""
|
||||
def __get__(self):
|
||||
return self.c.ent_iob
|
||||
|
||||
property ent_type_:
|
||||
"""RETURNS (unicode): Named entity type."""
|
||||
def __get__(self):
|
||||
|
@ -733,16 +732,25 @@ cdef class Token:
|
|||
def __set__(self, ent_type):
|
||||
self.c.ent_type = self.vocab.strings.add(ent_type)
|
||||
|
||||
property ent_iob_:
|
||||
@property
|
||||
def ent_iob(self):
|
||||
"""IOB code of named entity tag. `1="I", 2="O", 3="B"`. 0 means no tag
|
||||
is assigned.
|
||||
|
||||
RETURNS (uint64): IOB code of named entity tag.
|
||||
"""
|
||||
return self.c.ent_iob
|
||||
|
||||
@property
|
||||
def ent_iob_(self):
|
||||
"""IOB code of named entity tag. "B" means the token begins an entity,
|
||||
"I" means it is inside an entity, "O" means it is outside an entity,
|
||||
and "" means no entity tag is set.
|
||||
|
||||
RETURNS (unicode): IOB code of named entity tag.
|
||||
"""
|
||||
def __get__(self):
|
||||
iob_strings = ("", "I", "O", "B")
|
||||
return iob_strings[self.c.ent_iob]
|
||||
iob_strings = ("", "I", "O", "B")
|
||||
return iob_strings[self.c.ent_iob]
|
||||
|
||||
property ent_id:
|
||||
"""RETURNS (uint64): ID of the entity the token is an instance of,
|
||||
|
@ -764,26 +772,25 @@ cdef class Token:
|
|||
def __set__(self, name):
|
||||
self.c.ent_id = self.vocab.strings.add(name)
|
||||
|
||||
property whitespace_:
|
||||
"""RETURNS (unicode): The trailing whitespace character, if present.
|
||||
"""
|
||||
def __get__(self):
|
||||
return " " if self.c.spacy else ""
|
||||
@property
|
||||
def whitespace_(self):
|
||||
"""RETURNS (unicode): The trailing whitespace character, if present."""
|
||||
return " " if self.c.spacy else ""
|
||||
|
||||
property orth_:
|
||||
@property
|
||||
def orth_(self):
|
||||
"""RETURNS (unicode): Verbatim text content (identical to
|
||||
`Token.text`). Exists mostly for consistency with the other
|
||||
attributes.
|
||||
"""
|
||||
def __get__(self):
|
||||
return self.vocab.strings[self.c.lex.orth]
|
||||
return self.vocab.strings[self.c.lex.orth]
|
||||
|
||||
property lower_:
|
||||
@property
|
||||
def lower_(self):
|
||||
"""RETURNS (unicode): The lowercase token text. Equivalent to
|
||||
`Token.text.lower()`.
|
||||
"""
|
||||
def __get__(self):
|
||||
return self.vocab.strings[self.c.lex.lower]
|
||||
return self.vocab.strings[self.c.lex.lower]
|
||||
|
||||
property norm_:
|
||||
"""RETURNS (unicode): The token's norm, i.e. a normalised form of the
|
||||
|
@ -796,33 +803,33 @@ cdef class Token:
|
|||
def __set__(self, unicode norm_):
|
||||
self.c.norm = self.vocab.strings.add(norm_)
|
||||
|
||||
property shape_:
|
||||
@property
|
||||
def shape_(self):
|
||||
"""RETURNS (unicode): Transform of the tokens's string, to show
|
||||
orthographic features. For example, "Xxxx" or "dd".
|
||||
"""
|
||||
def __get__(self):
|
||||
return self.vocab.strings[self.c.lex.shape]
|
||||
return self.vocab.strings[self.c.lex.shape]
|
||||
|
||||
property prefix_:
|
||||
@property
|
||||
def prefix_(self):
|
||||
"""RETURNS (unicode): A length-N substring from the start of the token.
|
||||
Defaults to `N=1`.
|
||||
"""
|
||||
def __get__(self):
|
||||
return self.vocab.strings[self.c.lex.prefix]
|
||||
return self.vocab.strings[self.c.lex.prefix]
|
||||
|
||||
property suffix_:
|
||||
@property
|
||||
def suffix_(self):
|
||||
"""RETURNS (unicode): A length-N substring from the end of the token.
|
||||
Defaults to `N=3`.
|
||||
"""
|
||||
def __get__(self):
|
||||
return self.vocab.strings[self.c.lex.suffix]
|
||||
return self.vocab.strings[self.c.lex.suffix]
|
||||
|
||||
property lang_:
|
||||
@property
|
||||
def lang_(self):
|
||||
"""RETURNS (unicode): Language of the parent document's vocabulary,
|
||||
e.g. 'en'.
|
||||
"""
|
||||
def __get__(self):
|
||||
return self.vocab.strings[self.c.lex.lang]
|
||||
return self.vocab.strings[self.c.lex.lang]
|
||||
|
||||
property lemma_:
|
||||
"""RETURNS (unicode): The token lemma, i.e. the base form of the word,
|
||||
|
@ -861,110 +868,110 @@ cdef class Token:
|
|||
def __set__(self, unicode label):
|
||||
self.c.dep = self.vocab.strings.add(label)
|
||||
|
||||
property is_oov:
|
||||
@property
|
||||
def is_oov(self):
|
||||
"""RETURNS (bool): Whether the token is out-of-vocabulary."""
|
||||
def __get__(self):
|
||||
return Lexeme.c_check_flag(self.c.lex, IS_OOV)
|
||||
return Lexeme.c_check_flag(self.c.lex, IS_OOV)
|
||||
|
||||
property is_stop:
|
||||
@property
|
||||
def is_stop(self):
|
||||
"""RETURNS (bool): Whether the token is a stop word, i.e. part of a
|
||||
"stop list" defined by the language data.
|
||||
"""
|
||||
def __get__(self):
|
||||
return Lexeme.c_check_flag(self.c.lex, IS_STOP)
|
||||
return Lexeme.c_check_flag(self.c.lex, IS_STOP)
|
||||
|
||||
property is_alpha:
|
||||
@property
|
||||
def is_alpha(self):
|
||||
"""RETURNS (bool): Whether the token consists of alpha characters.
|
||||
Equivalent to `token.text.isalpha()`.
|
||||
"""
|
||||
def __get__(self):
|
||||
return Lexeme.c_check_flag(self.c.lex, IS_ALPHA)
|
||||
return Lexeme.c_check_flag(self.c.lex, IS_ALPHA)
|
||||
|
||||
property is_ascii:
|
||||
@property
|
||||
def is_ascii(self):
|
||||
"""RETURNS (bool): Whether the token consists of ASCII characters.
|
||||
Equivalent to `[any(ord(c) >= 128 for c in token.text)]`.
|
||||
"""
|
||||
def __get__(self):
|
||||
return Lexeme.c_check_flag(self.c.lex, IS_ASCII)
|
||||
return Lexeme.c_check_flag(self.c.lex, IS_ASCII)
|
||||
|
||||
property is_digit:
|
||||
@property
|
||||
def is_digit(self):
|
||||
"""RETURNS (bool): Whether the token consists of digits. Equivalent to
|
||||
`token.text.isdigit()`.
|
||||
"""
|
||||
def __get__(self):
|
||||
return Lexeme.c_check_flag(self.c.lex, IS_DIGIT)
|
||||
return Lexeme.c_check_flag(self.c.lex, IS_DIGIT)
|
||||
|
||||
property is_lower:
|
||||
@property
|
||||
def is_lower(self):
|
||||
"""RETURNS (bool): Whether the token is in lowercase. Equivalent to
|
||||
`token.text.islower()`.
|
||||
"""
|
||||
def __get__(self):
|
||||
return Lexeme.c_check_flag(self.c.lex, IS_LOWER)
|
||||
return Lexeme.c_check_flag(self.c.lex, IS_LOWER)
|
||||
|
||||
property is_upper:
|
||||
@property
|
||||
def is_upper(self):
|
||||
"""RETURNS (bool): Whether the token is in uppercase. Equivalent to
|
||||
`token.text.isupper()`
|
||||
"""
|
||||
def __get__(self):
|
||||
return Lexeme.c_check_flag(self.c.lex, IS_UPPER)
|
||||
return Lexeme.c_check_flag(self.c.lex, IS_UPPER)
|
||||
|
||||
property is_title:
|
||||
@property
|
||||
def is_title(self):
|
||||
"""RETURNS (bool): Whether the token is in titlecase. Equivalent to
|
||||
`token.text.istitle()`.
|
||||
"""
|
||||
def __get__(self):
|
||||
return Lexeme.c_check_flag(self.c.lex, IS_TITLE)
|
||||
return Lexeme.c_check_flag(self.c.lex, IS_TITLE)
|
||||
|
||||
property is_punct:
|
||||
@property
|
||||
def is_punct(self):
|
||||
"""RETURNS (bool): Whether the token is punctuation."""
|
||||
def __get__(self):
|
||||
return Lexeme.c_check_flag(self.c.lex, IS_PUNCT)
|
||||
return Lexeme.c_check_flag(self.c.lex, IS_PUNCT)
|
||||
|
||||
property is_space:
|
||||
@property
|
||||
def is_space(self):
|
||||
"""RETURNS (bool): Whether the token consists of whitespace characters.
|
||||
Equivalent to `token.text.isspace()`.
|
||||
"""
|
||||
def __get__(self):
|
||||
return Lexeme.c_check_flag(self.c.lex, IS_SPACE)
|
||||
return Lexeme.c_check_flag(self.c.lex, IS_SPACE)
|
||||
|
||||
property is_bracket:
|
||||
@property
|
||||
def is_bracket(self):
|
||||
"""RETURNS (bool): Whether the token is a bracket."""
|
||||
def __get__(self):
|
||||
return Lexeme.c_check_flag(self.c.lex, IS_BRACKET)
|
||||
return Lexeme.c_check_flag(self.c.lex, IS_BRACKET)
|
||||
|
||||
property is_quote:
|
||||
@property
|
||||
def is_quote(self):
|
||||
"""RETURNS (bool): Whether the token is a quotation mark."""
|
||||
def __get__(self):
|
||||
return Lexeme.c_check_flag(self.c.lex, IS_QUOTE)
|
||||
return Lexeme.c_check_flag(self.c.lex, IS_QUOTE)
|
||||
|
||||
property is_left_punct:
|
||||
@property
|
||||
def is_left_punct(self):
|
||||
"""RETURNS (bool): Whether the token is a left punctuation mark."""
|
||||
def __get__(self):
|
||||
return Lexeme.c_check_flag(self.c.lex, IS_LEFT_PUNCT)
|
||||
return Lexeme.c_check_flag(self.c.lex, IS_LEFT_PUNCT)
|
||||
|
||||
property is_right_punct:
|
||||
@property
|
||||
def is_right_punct(self):
|
||||
"""RETURNS (bool): Whether the token is a right punctuation mark."""
|
||||
def __get__(self):
|
||||
return Lexeme.c_check_flag(self.c.lex, IS_RIGHT_PUNCT)
|
||||
return Lexeme.c_check_flag(self.c.lex, IS_RIGHT_PUNCT)
|
||||
|
||||
property is_currency:
|
||||
@property
|
||||
def is_currency(self):
|
||||
"""RETURNS (bool): Whether the token is a currency symbol."""
|
||||
def __get__(self):
|
||||
return Lexeme.c_check_flag(self.c.lex, IS_CURRENCY)
|
||||
return Lexeme.c_check_flag(self.c.lex, IS_CURRENCY)
|
||||
|
||||
property like_url:
|
||||
@property
|
||||
def like_url(self):
|
||||
"""RETURNS (bool): Whether the token resembles a URL."""
|
||||
def __get__(self):
|
||||
return Lexeme.c_check_flag(self.c.lex, LIKE_URL)
|
||||
return Lexeme.c_check_flag(self.c.lex, LIKE_URL)
|
||||
|
||||
property like_num:
|
||||
@property
|
||||
def like_num(self):
|
||||
"""RETURNS (bool): Whether the token resembles a number, e.g. "10.9",
|
||||
"10", "ten", etc.
|
||||
"""
|
||||
def __get__(self):
|
||||
return Lexeme.c_check_flag(self.c.lex, LIKE_NUM)
|
||||
return Lexeme.c_check_flag(self.c.lex, LIKE_NUM)
|
||||
|
||||
property like_email:
|
||||
@property
|
||||
def like_email(self):
|
||||
"""RETURNS (bool): Whether the token resembles an email address."""
|
||||
def __get__(self):
|
||||
return Lexeme.c_check_flag(self.c.lex, LIKE_EMAIL)
|
||||
return Lexeme.c_check_flag(self.c.lex, LIKE_EMAIL)
|
||||
|
|
|
@ -2,11 +2,13 @@
|
|||
from __future__ import unicode_literals
|
||||
|
||||
import functools
|
||||
import copy
|
||||
|
||||
from ..errors import Errors
|
||||
|
||||
|
||||
class Underscore(object):
|
||||
mutable_types = (dict, list, set)
|
||||
doc_extensions = {}
|
||||
span_extensions = {}
|
||||
token_extensions = {}
|
||||
|
@ -32,7 +34,15 @@ class Underscore(object):
|
|||
elif method is not None:
|
||||
return functools.partial(method, self._obj)
|
||||
else:
|
||||
return self._doc.user_data.get(self._get_key(name), default)
|
||||
key = self._get_key(name)
|
||||
if key in self._doc.user_data:
|
||||
return self._doc.user_data[key]
|
||||
elif isinstance(default, self.mutable_types):
|
||||
# Handle mutable default arguments (see #2581)
|
||||
new_default = copy.copy(default)
|
||||
self.__setattr__(name, new_default)
|
||||
return new_default
|
||||
return default
|
||||
|
||||
def __setattr__(self, name, value):
|
||||
if name not in self._extensions:
|
||||
|
|
|
@ -25,7 +25,7 @@ except ImportError:
|
|||
from .symbols import ORTH
|
||||
from .compat import cupy, CudaStream, path2str, basestring_, unicode_
|
||||
from .compat import import_file
|
||||
from .errors import Errors
|
||||
from .errors import Errors, Warnings, deprecation_warning
|
||||
|
||||
|
||||
LANGUAGES = {}
|
||||
|
@ -38,6 +38,18 @@ def set_env_log(value):
|
|||
_PRINT_ENV = value
|
||||
|
||||
|
||||
def lang_class_is_loaded(lang):
|
||||
"""Check whether a Language class is already loaded. Language classes are
|
||||
loaded lazily, to avoid expensive setup code associated with the language
|
||||
data.
|
||||
|
||||
lang (unicode): Two-letter language code, e.g. 'en'.
|
||||
RETURNS (bool): Whether a Language class has been loaded.
|
||||
"""
|
||||
global LANGUAGES
|
||||
return lang in LANGUAGES
|
||||
|
||||
|
||||
def get_lang_class(lang):
|
||||
"""Import and load a Language class.
|
||||
|
||||
|
@ -565,7 +577,8 @@ def itershuffle(iterable, bufsize=1000):
|
|||
def to_bytes(getters, exclude):
|
||||
serialized = OrderedDict()
|
||||
for key, getter in getters.items():
|
||||
if key not in exclude:
|
||||
# Split to support file names like meta.json
|
||||
if key.split(".")[0] not in exclude:
|
||||
serialized[key] = getter()
|
||||
return srsly.msgpack_dumps(serialized)
|
||||
|
||||
|
@ -573,7 +586,8 @@ def to_bytes(getters, exclude):
|
|||
def from_bytes(bytes_data, setters, exclude):
|
||||
msg = srsly.msgpack_loads(bytes_data)
|
||||
for key, setter in setters.items():
|
||||
if key not in exclude and key in msg:
|
||||
# Split to support file names like meta.json
|
||||
if key.split(".")[0] not in exclude and key in msg:
|
||||
setter(msg[key])
|
||||
return msg
|
||||
|
||||
|
@ -583,7 +597,8 @@ def to_disk(path, writers, exclude):
|
|||
if not path.exists():
|
||||
path.mkdir()
|
||||
for key, writer in writers.items():
|
||||
if key not in exclude:
|
||||
# Split to support file names like meta.json
|
||||
if key.split(".")[0] not in exclude:
|
||||
writer(path / key)
|
||||
return path
|
||||
|
||||
|
@ -591,7 +606,8 @@ def to_disk(path, writers, exclude):
|
|||
def from_disk(path, readers, exclude):
|
||||
path = ensure_path(path)
|
||||
for key, reader in readers.items():
|
||||
if key not in exclude:
|
||||
# Split to support file names like meta.json
|
||||
if key.split(".")[0] not in exclude:
|
||||
reader(path / key)
|
||||
return path
|
||||
|
||||
|
@ -677,6 +693,23 @@ def validate_json(data, validator):
|
|||
return errors
|
||||
|
||||
|
||||
def get_serialization_exclude(serializers, exclude, kwargs):
|
||||
"""Helper function to validate serialization args and manage transition from
|
||||
keyword arguments (pre v2.1) to exclude argument.
|
||||
"""
|
||||
exclude = list(exclude)
|
||||
# Split to support file names like meta.json
|
||||
options = [name.split(".")[0] for name in serializers]
|
||||
for key, value in kwargs.items():
|
||||
if key in ("vocab",) and value is False:
|
||||
deprecation_warning(Warnings.W015.format(arg=key))
|
||||
exclude.append(key)
|
||||
elif key.split(".")[0] in options:
|
||||
raise ValueError(Errors.E128.format(arg=key))
|
||||
# TODO: user warning?
|
||||
return exclude
|
||||
|
||||
|
||||
class SimpleFrozenDict(dict):
|
||||
"""Simplified implementation of a frozen dict, mainly used as default
|
||||
function or method argument (for arguments that should default to empty
|
||||
|
@ -696,14 +729,14 @@ class SimpleFrozenDict(dict):
|
|||
class DummyTokenizer(object):
|
||||
# add dummy methods for to_bytes, from_bytes, to_disk and from_disk to
|
||||
# allow serialization (see #1557)
|
||||
def to_bytes(self, **exclude):
|
||||
def to_bytes(self, **kwargs):
|
||||
return b""
|
||||
|
||||
def from_bytes(self, _bytes_data, **exclude):
|
||||
def from_bytes(self, _bytes_data, **kwargs):
|
||||
return self
|
||||
|
||||
def to_disk(self, _path, **exclude):
|
||||
def to_disk(self, _path, **kwargs):
|
||||
return None
|
||||
|
||||
def from_disk(self, _path, **exclude):
|
||||
def from_disk(self, _path, **kwargs):
|
||||
return self
|
||||
|
|
|
@ -377,11 +377,11 @@ cdef class Vectors:
|
|||
self.add(key, row=i)
|
||||
return strings
|
||||
|
||||
def to_disk(self, path, **exclude):
|
||||
def to_disk(self, path, **kwargs):
|
||||
"""Save the current state to a directory.
|
||||
|
||||
path (unicode / Path): A path to a directory, which will be created if
|
||||
it doesn't exists. Either a string or a Path-like object.
|
||||
it doesn't exists.
|
||||
|
||||
DOCS: https://spacy.io/api/vectors#to_disk
|
||||
"""
|
||||
|
@ -394,9 +394,9 @@ cdef class Vectors:
|
|||
("vectors", lambda p: save_array(self.data, p.open("wb"))),
|
||||
("key2row", lambda p: srsly.write_msgpack(p, self.key2row))
|
||||
))
|
||||
return util.to_disk(path, serializers, exclude)
|
||||
return util.to_disk(path, serializers, [])
|
||||
|
||||
def from_disk(self, path, **exclude):
|
||||
def from_disk(self, path, **kwargs):
|
||||
"""Loads state from a directory. Modifies the object in place and
|
||||
returns it.
|
||||
|
||||
|
@ -428,13 +428,13 @@ cdef class Vectors:
|
|||
("keys", load_keys),
|
||||
("vectors", load_vectors),
|
||||
))
|
||||
util.from_disk(path, serializers, exclude)
|
||||
util.from_disk(path, serializers, [])
|
||||
return self
|
||||
|
||||
def to_bytes(self, **exclude):
|
||||
def to_bytes(self, **kwargs):
|
||||
"""Serialize the current state to a binary string.
|
||||
|
||||
**exclude: Named attributes to prevent from being serialized.
|
||||
exclude (list): String names of serialization fields to exclude.
|
||||
RETURNS (bytes): The serialized form of the `Vectors` object.
|
||||
|
||||
DOCS: https://spacy.io/api/vectors#to_bytes
|
||||
|
@ -444,17 +444,18 @@ cdef class Vectors:
|
|||
return self.data.to_bytes()
|
||||
else:
|
||||
return srsly.msgpack_dumps(self.data)
|
||||
|
||||
serializers = OrderedDict((
|
||||
("key2row", lambda: srsly.msgpack_dumps(self.key2row)),
|
||||
("vectors", serialize_weights)
|
||||
))
|
||||
return util.to_bytes(serializers, exclude)
|
||||
return util.to_bytes(serializers, [])
|
||||
|
||||
def from_bytes(self, data, **exclude):
|
||||
def from_bytes(self, data, **kwargs):
|
||||
"""Load state from a binary string.
|
||||
|
||||
data (bytes): The data to load from.
|
||||
**exclude: Named attributes to prevent from being loaded.
|
||||
exclude (list): String names of serialization fields to exclude.
|
||||
RETURNS (Vectors): The `Vectors` object.
|
||||
|
||||
DOCS: https://spacy.io/api/vectors#from_bytes
|
||||
|
@ -469,5 +470,5 @@ cdef class Vectors:
|
|||
("key2row", lambda b: self.key2row.update(srsly.msgpack_loads(b))),
|
||||
("vectors", deserialize_weights)
|
||||
))
|
||||
util.from_bytes(data, deserializers, exclude)
|
||||
util.from_bytes(data, deserializers, [])
|
||||
return self
|
||||
|
|
|
@ -1,6 +1,7 @@
|
|||
# coding: utf8
|
||||
# cython: profile=True
|
||||
from __future__ import unicode_literals
|
||||
from libc.string cimport memcpy
|
||||
|
||||
import numpy
|
||||
import srsly
|
||||
|
@ -59,12 +60,23 @@ cdef class Vocab:
|
|||
self.morphology = Morphology(self.strings, tag_map, lemmatizer)
|
||||
self.vectors = Vectors()
|
||||
|
||||
property lang:
|
||||
@property
|
||||
def lang(self):
|
||||
langfunc = None
|
||||
if self.lex_attr_getters:
|
||||
langfunc = self.lex_attr_getters.get(LANG, None)
|
||||
return langfunc("_") if langfunc else ""
|
||||
|
||||
property writing_system:
|
||||
"""A dict with information about the language's writing system. To get
|
||||
the data, we use the vocab.lang property to fetch the Language class.
|
||||
If the Language class is not loaded, an empty dict is returned.
|
||||
"""
|
||||
def __get__(self):
|
||||
langfunc = None
|
||||
if self.lex_attr_getters:
|
||||
langfunc = self.lex_attr_getters.get(LANG, None)
|
||||
return langfunc("_") if langfunc else ""
|
||||
if not util.lang_class_is_loaded(self.lang):
|
||||
return {}
|
||||
lang_class = util.get_lang_class(self.lang)
|
||||
return dict(lang_class.Defaults.writing_system)
|
||||
|
||||
def __len__(self):
|
||||
"""The current number of lexemes stored.
|
||||
|
@ -396,47 +408,57 @@ cdef class Vocab:
|
|||
orth = self.strings.add(orth)
|
||||
return orth in self.vectors
|
||||
|
||||
def to_disk(self, path, **exclude):
|
||||
def to_disk(self, path, exclude=tuple(), **kwargs):
|
||||
"""Save the current state to a directory.
|
||||
|
||||
path (unicode or Path): A path to a directory, which will be created if
|
||||
it doesn't exist. Paths may be either strings or Path-like objects.
|
||||
it doesn't exist.
|
||||
exclude (list): String names of serialization fields to exclude.
|
||||
|
||||
DOCS: https://spacy.io/api/vocab#to_disk
|
||||
"""
|
||||
path = util.ensure_path(path)
|
||||
if not path.exists():
|
||||
path.mkdir()
|
||||
self.strings.to_disk(path / "strings.json")
|
||||
with (path / "lexemes.bin").open('wb') as file_:
|
||||
file_.write(self.lexemes_to_bytes())
|
||||
if self.vectors is not None:
|
||||
setters = ["strings", "lexemes", "vectors"]
|
||||
exclude = util.get_serialization_exclude(setters, exclude, kwargs)
|
||||
if "strings" not in exclude:
|
||||
self.strings.to_disk(path / "strings.json")
|
||||
if "lexemes" not in exclude:
|
||||
with (path / "lexemes.bin").open("wb") as file_:
|
||||
file_.write(self.lexemes_to_bytes())
|
||||
if "vectors" not in "exclude" and self.vectors is not None:
|
||||
self.vectors.to_disk(path)
|
||||
|
||||
def from_disk(self, path, **exclude):
|
||||
def from_disk(self, path, exclude=tuple(), **kwargs):
|
||||
"""Loads state from a directory. Modifies the object in place and
|
||||
returns it.
|
||||
|
||||
path (unicode or Path): A path to a directory. Paths may be either
|
||||
strings or `Path`-like objects.
|
||||
path (unicode or Path): A path to a directory.
|
||||
exclude (list): String names of serialization fields to exclude.
|
||||
RETURNS (Vocab): The modified `Vocab` object.
|
||||
|
||||
DOCS: https://spacy.io/api/vocab#to_disk
|
||||
"""
|
||||
path = util.ensure_path(path)
|
||||
self.strings.from_disk(path / "strings.json")
|
||||
with (path / "lexemes.bin").open("rb") as file_:
|
||||
self.lexemes_from_bytes(file_.read())
|
||||
if self.vectors is not None:
|
||||
self.vectors.from_disk(path, exclude="strings.json")
|
||||
if self.vectors.name is not None:
|
||||
link_vectors_to_models(self)
|
||||
getters = ["strings", "lexemes", "vectors"]
|
||||
exclude = util.get_serialization_exclude(getters, exclude, kwargs)
|
||||
if "strings" not in exclude:
|
||||
self.strings.from_disk(path / "strings.json") # TODO: add exclude?
|
||||
if "lexemes" not in exclude:
|
||||
with (path / "lexemes.bin").open("rb") as file_:
|
||||
self.lexemes_from_bytes(file_.read())
|
||||
if "vectors" not in exclude:
|
||||
if self.vectors is not None:
|
||||
self.vectors.from_disk(path, exclude=["strings"])
|
||||
if self.vectors.name is not None:
|
||||
link_vectors_to_models(self)
|
||||
return self
|
||||
|
||||
def to_bytes(self, **exclude):
|
||||
def to_bytes(self, exclude=tuple(), **kwargs):
|
||||
"""Serialize the current state to a binary string.
|
||||
|
||||
**exclude: Named attributes to prevent from being serialized.
|
||||
exclude (list): String names of serialization fields to exclude.
|
||||
RETURNS (bytes): The serialized form of the `Vocab` object.
|
||||
|
||||
DOCS: https://spacy.io/api/vocab#to_bytes
|
||||
|
@ -452,13 +474,14 @@ cdef class Vocab:
|
|||
("lexemes", lambda: self.lexemes_to_bytes()),
|
||||
("vectors", deserialize_vectors)
|
||||
))
|
||||
exclude = util.get_serialization_exclude(getters, exclude, kwargs)
|
||||
return util.to_bytes(getters, exclude)
|
||||
|
||||
def from_bytes(self, bytes_data, **exclude):
|
||||
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
|
||||
"""Load state from a binary string.
|
||||
|
||||
bytes_data (bytes): The data to load from.
|
||||
**exclude: Named attributes to prevent from being loaded.
|
||||
exclude (list): String names of serialization fields to exclude.
|
||||
RETURNS (Vocab): The `Vocab` object.
|
||||
|
||||
DOCS: https://spacy.io/api/vocab#from_bytes
|
||||
|
@ -468,11 +491,13 @@ cdef class Vocab:
|
|||
return None
|
||||
else:
|
||||
return self.vectors.from_bytes(b)
|
||||
|
||||
setters = OrderedDict((
|
||||
("strings", lambda b: self.strings.from_bytes(b)),
|
||||
("lexemes", lambda b: self.lexemes_from_bytes(b)),
|
||||
("vectors", lambda b: serialize_vectors(b))
|
||||
))
|
||||
exclude = util.get_serialization_exclude(setters, exclude, kwargs)
|
||||
util.from_bytes(bytes_data, setters, exclude)
|
||||
if self.vectors.name is not None:
|
||||
link_vectors_to_models(self)
|
||||
|
@ -518,7 +543,10 @@ cdef class Vocab:
|
|||
for j in range(sizeof(lex_data.data)):
|
||||
lex_data.data[j] = bytes_ptr[i+j]
|
||||
Lexeme.c_from_bytes(lexeme, lex_data)
|
||||
|
||||
prev_entry = self._by_orth.get(lexeme.orth)
|
||||
if prev_entry != NULL:
|
||||
memcpy(prev_entry, lexeme, sizeof(LexemeC))
|
||||
continue
|
||||
ptr = self.strings._map.get(lexeme.orth)
|
||||
if ptr == NULL:
|
||||
continue
|
||||
|
|
27
website/.eslintrc
Normal file
27
website/.eslintrc
Normal file
|
@ -0,0 +1,27 @@
|
|||
{
|
||||
"extends": ["standard", "prettier"],
|
||||
"plugins": ["standard", "react", "react-hooks"],
|
||||
"rules": {
|
||||
"no-var": "error",
|
||||
"no-unused-vars": 1,
|
||||
"arrow-spacing": ["error", { "before": true, "after": true }],
|
||||
"indent": ["error", 4],
|
||||
"semi": ["error", "never"],
|
||||
"arrow-parens": ["error", "as-needed"],
|
||||
"standard/object-curly-even-spacing": ["error", "either"],
|
||||
"standard/array-bracket-even-spacing": ["error", "either"],
|
||||
"standard/computed-property-even-spacing": ["error", "even"],
|
||||
"standard/no-callback-literal": ["error", ["cb", "callback"]],
|
||||
"react/jsx-uses-react": "error",
|
||||
"react/jsx-uses-vars": "error",
|
||||
"react-hooks/rules-of-hooks": "error",
|
||||
"react-hooks/exhaustive-deps": "warn"
|
||||
},
|
||||
"parser": "babel-eslint",
|
||||
"parserOptions": {
|
||||
"ecmaVersion": 8
|
||||
},
|
||||
"env": {
|
||||
"browser": true
|
||||
}
|
||||
}
|
|
@ -78,7 +78,7 @@ assigned by spaCy's [models](/models). The individual mapping is specific to the
|
|||
training corpus and can be defined in the respective language data's
|
||||
[`tag_map.py`](/usage/adding-languages#tag-map).
|
||||
|
||||
<Accordion title="Universal Part-of-speech Tags">
|
||||
<Accordion title="Universal Part-of-speech Tags" id="pos-universal">
|
||||
|
||||
spaCy also maps all language-specific part-of-speech tags to a small, fixed set
|
||||
of word type tags following the
|
||||
|
@ -269,7 +269,7 @@ This section lists the syntactic dependency labels assigned by spaCy's
|
|||
[models](/models). The individual labels are language-specific and depend on the
|
||||
training corpus.
|
||||
|
||||
<Accordion title="Universal Dependency Labels">
|
||||
<Accordion title="Universal Dependency Labels" id="dependency-parsing-universal">
|
||||
|
||||
The [Universal Dependencies scheme](http://universaldependencies.org/u/dep/) is
|
||||
used in all languages trained on Universal Dependency Corpora.
|
||||
|
|
|
@ -244,9 +244,10 @@ Serialize the pipe to disk.
|
|||
> parser.to_disk("/path/to/parser")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| Name | Type | Description |
|
||||
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
|
||||
## DependencyParser.from_disk {#from_disk tag="method"}
|
||||
|
||||
|
@ -262,6 +263,7 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
|||
| Name | Type | Description |
|
||||
| ----------- | ------------------ | -------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `DependencyParser` | The modified `DependencyParser` object. |
|
||||
|
||||
## DependencyParser.to_bytes {#to_bytes tag="method"}
|
||||
|
@ -275,10 +277,10 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
|||
|
||||
Serialize the pipe to a bytestring.
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | ----------------------------------------------------- |
|
||||
| `**exclude` | - | Named attributes to prevent from being serialized. |
|
||||
| **RETURNS** | bytes | The serialized form of the `DependencyParser` object. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | ------------------------------------------------------------------------- |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | bytes | The serialized form of the `DependencyParser` object. |
|
||||
|
||||
## DependencyParser.from_bytes {#from_bytes tag="method"}
|
||||
|
||||
|
@ -292,11 +294,11 @@ Load the pipe from a bytestring. Modifies the object in place and returns it.
|
|||
> parser.from_bytes(parser_bytes)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------ | ------------------ | ---------------------------------------------- |
|
||||
| `bytes_data` | bytes | The data to load from. |
|
||||
| `**exclude` | - | Named attributes to prevent from being loaded. |
|
||||
| **RETURNS** | `DependencyParser` | The `DependencyParser` object. |
|
||||
| Name | Type | Description |
|
||||
| ------------ | ------------------ | ------------------------------------------------------------------------- |
|
||||
| `bytes_data` | bytes | The data to load from. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `DependencyParser` | The `DependencyParser` object. |
|
||||
|
||||
## DependencyParser.labels {#labels tag="property"}
|
||||
|
||||
|
@ -312,3 +314,21 @@ The labels currently added to the component.
|
|||
| Name | Type | Description |
|
||||
| ----------- | ----- | ---------------------------------- |
|
||||
| **RETURNS** | tuple | The labels added to the component. |
|
||||
|
||||
## Serialization fields {#serialization-fields}
|
||||
|
||||
During serialization, spaCy will export several data fields used to restore
|
||||
different aspects of the object. If needed, you can exclude them from
|
||||
serialization by passing in the string names via the `exclude` argument.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> data = parser.to_disk("/path", exclude=["vocab"])
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ------- | -------------------------------------------------------------- |
|
||||
| `vocab` | The shared [`Vocab`](/api/vocab). |
|
||||
| `cfg` | The config file. You usually don't want to exclude this. |
|
||||
| `model` | The binary model data. You usually don't want to exclude this. |
|
||||
|
|
|
@ -237,7 +237,7 @@ attribute ID.
|
|||
> from spacy.attrs import ORTH
|
||||
> doc = nlp(u"apple apple orange banana")
|
||||
> assert doc.count_by(ORTH) == {7024L: 1, 119552L: 1, 2087L: 2}
|
||||
> doc.to_array([attrs.ORTH])
|
||||
> doc.to_array([ORTH])
|
||||
> # array([[11880], [11880], [7561], [12800]])
|
||||
> ```
|
||||
|
||||
|
@ -349,11 +349,12 @@ array of attributes.
|
|||
> assert doc[0].pos_ == doc2[0].pos_
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | -------------------------------------- | ----------------------------- |
|
||||
| `attrs` | list | A list of attribute ID ints. |
|
||||
| `array` | `numpy.ndarray[ndim=2, dtype='int32']` | The attribute values to load. |
|
||||
| **RETURNS** | `Doc` | Itself. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | -------------------------------------- | ------------------------------------------------------------------------- |
|
||||
| `attrs` | list | A list of attribute ID ints. |
|
||||
| `array` | `numpy.ndarray[ndim=2, dtype='int32']` | The attribute values to load. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `Doc` | Itself. |
|
||||
|
||||
## Doc.to_disk {#to_disk tag="method" new="2"}
|
||||
|
||||
|
@ -365,9 +366,10 @@ Save the current state to a directory.
|
|||
> doc.to_disk("/path/to/doc")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| Name | Type | Description |
|
||||
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
|
||||
## Doc.from_disk {#from_disk tag="method" new="2"}
|
||||
|
||||
|
@ -384,6 +386,7 @@ Loads state from a directory. Modifies the object in place and returns it.
|
|||
| Name | Type | Description |
|
||||
| ----------- | ---------------- | -------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `Doc` | The modified `Doc` object. |
|
||||
|
||||
## Doc.to_bytes {#to_bytes tag="method"}
|
||||
|
@ -397,9 +400,10 @@ Serialize, i.e. export the document contents to a binary string.
|
|||
> doc_bytes = doc.to_bytes()
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | --------------------------------------------------------------------- |
|
||||
| **RETURNS** | bytes | A losslessly serialized copy of the `Doc`, including all annotations. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | ------------------------------------------------------------------------- |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | bytes | A losslessly serialized copy of the `Doc`, including all annotations. |
|
||||
|
||||
## Doc.from_bytes {#from_bytes tag="method"}
|
||||
|
||||
|
@ -416,10 +420,11 @@ Deserialize, i.e. import the document contents from a binary string.
|
|||
> assert doc.text == doc2.text
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | ------------------------ |
|
||||
| `data` | bytes | The string to load from. |
|
||||
| **RETURNS** | `Doc` | The `Doc` object. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | ------------------------------------------------------------------------- |
|
||||
| `data` | bytes | The string to load from. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `Doc` | The `Doc` object. |
|
||||
|
||||
## Doc.retokenize {#retokenize tag="contextmanager" new="2.1"}
|
||||
|
||||
|
@ -640,20 +645,45 @@ The L2 norm of the document's vector representation.
|
|||
|
||||
## Attributes {#attributes}
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------------------------------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `text` | unicode | A unicode representation of the document text. |
|
||||
| `text_with_ws` | unicode | An alias of `Doc.text`, provided for duck-type compatibility with `Span` and `Token`. |
|
||||
| `mem` | `Pool` | The document's local memory heap, for all C data it owns. |
|
||||
| `vocab` | `Vocab` | The store of lexical types. |
|
||||
| `tensor` <Tag variant="new">2</Tag> | object | Container for dense vector representations. |
|
||||
| `cats` <Tag variant="new">2</Tag> | dictionary | Maps either a label to a score for categories applied to whole document, or `(start_char, end_char, label)` to score for categories applied to spans. `start_char` and `end_char` should be character offsets, label can be either a string or an integer ID, and score should be a float. |
|
||||
| `user_data` | - | A generic storage area, for user custom data. |
|
||||
| `is_tagged` | bool | A flag indicating that the document has been part-of-speech tagged. |
|
||||
| `is_parsed` | bool | A flag indicating that the document has been syntactically parsed. |
|
||||
| `is_sentenced` | bool | A flag indicating that sentence boundaries have been applied to the document. |
|
||||
| `sentiment` | float | The document's positivity/negativity score, if available. |
|
||||
| `user_hooks` | dict | A dictionary that allows customization of the `Doc`'s properties. |
|
||||
| `user_token_hooks` | dict | A dictionary that allows customization of properties of `Token` children. |
|
||||
| `user_span_hooks` | dict | A dictionary that allows customization of properties of `Span` children. |
|
||||
| `_` | `Underscore` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). |
|
||||
| Name | Type | Description |
|
||||
| --------------------------------------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `text` | unicode | A unicode representation of the document text. |
|
||||
| `text_with_ws` | unicode | An alias of `Doc.text`, provided for duck-type compatibility with `Span` and `Token`. |
|
||||
| `mem` | `Pool` | The document's local memory heap, for all C data it owns. |
|
||||
| `vocab` | `Vocab` | The store of lexical types. |
|
||||
| `tensor` <Tag variant="new">2</Tag> | object | Container for dense vector representations. |
|
||||
| `cats` <Tag variant="new">2</Tag> | dictionary | Maps either a label to a score for categories applied to whole document, or `(start_char, end_char, label)` to score for categories applied to spans. `start_char` and `end_char` should be character offsets, label can be either a string or an integer ID, and score should be a float. |
|
||||
| `user_data` | - | A generic storage area, for user custom data. |
|
||||
| `lang` <Tag variant="new">2.1</Tag> | int | Language of the document's vocabulary. |
|
||||
| `lang_` <Tag variant="new">2.1</Tag> | unicode | Language of the document's vocabulary. |
|
||||
| `is_tagged` | bool | A flag indicating that the document has been part-of-speech tagged. |
|
||||
| `is_parsed` | bool | A flag indicating that the document has been syntactically parsed. |
|
||||
| `is_sentenced` | bool | A flag indicating that sentence boundaries have been applied to the document. |
|
||||
| `is_nered` <Tag variant="new">2.1</Tag> | bool | A flag indicating that named entities have been set. Will return `True` if _any_ of the tokens has an entity tag set, even if the others are unknown. |
|
||||
| `sentiment` | float | The document's positivity/negativity score, if available. |
|
||||
| `user_hooks` | dict | A dictionary that allows customization of the `Doc`'s properties. |
|
||||
| `user_token_hooks` | dict | A dictionary that allows customization of properties of `Token` children. |
|
||||
| `user_span_hooks` | dict | A dictionary that allows customization of properties of `Span` children. |
|
||||
| `_` | `Underscore` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). |
|
||||
|
||||
## Serialization fields {#serialization-fields}
|
||||
|
||||
During serialization, spaCy will export several data fields used to restore
|
||||
different aspects of the object. If needed, you can exclude them from
|
||||
serialization by passing in the string names via the `exclude` argument.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> data = doc.to_bytes(exclude=["text", "tensor"])
|
||||
> doc.from_disk("./doc.bin", exclude=["user_data"])
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ------------------ | --------------------------------------------- |
|
||||
| `text` | The value of the `Doc.text` attribute. |
|
||||
| `sentiment` | The value of the `Doc.sentiment` attribute. |
|
||||
| `tensor` | The value of the `Doc.tensor` attribute. |
|
||||
| `user_data` | The value of the `Doc.user_data` dictionary. |
|
||||
| `user_data_keys` | The keys of the `Doc.user_data` dictionary. |
|
||||
| `user_data_values` | The values of the `Doc.user_data` dictionary. |
|
||||
|
|
|
@ -244,9 +244,10 @@ Serialize the pipe to disk.
|
|||
> ner.to_disk("/path/to/ner")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| Name | Type | Description |
|
||||
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
|
||||
## EntityRecognizer.from_disk {#from_disk tag="method"}
|
||||
|
||||
|
@ -262,6 +263,7 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
|||
| Name | Type | Description |
|
||||
| ----------- | ------------------ | -------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `EntityRecognizer` | The modified `EntityRecognizer` object. |
|
||||
|
||||
## EntityRecognizer.to_bytes {#to_bytes tag="method"}
|
||||
|
@ -275,10 +277,10 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
|||
|
||||
Serialize the pipe to a bytestring.
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | ----------------------------------------------------- |
|
||||
| `**exclude` | - | Named attributes to prevent from being serialized. |
|
||||
| **RETURNS** | bytes | The serialized form of the `EntityRecognizer` object. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | ------------------------------------------------------------------------- |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | bytes | The serialized form of the `EntityRecognizer` object. |
|
||||
|
||||
## EntityRecognizer.from_bytes {#from_bytes tag="method"}
|
||||
|
||||
|
@ -292,11 +294,11 @@ Load the pipe from a bytestring. Modifies the object in place and returns it.
|
|||
> ner.from_bytes(ner_bytes)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------ | ------------------ | ---------------------------------------------- |
|
||||
| `bytes_data` | bytes | The data to load from. |
|
||||
| `**exclude` | - | Named attributes to prevent from being loaded. |
|
||||
| **RETURNS** | `EntityRecognizer` | The `EntityRecognizer` object. |
|
||||
| Name | Type | Description |
|
||||
| ------------ | ------------------ | ------------------------------------------------------------------------- |
|
||||
| `bytes_data` | bytes | The data to load from. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `EntityRecognizer` | The `EntityRecognizer` object. |
|
||||
|
||||
## EntityRecognizer.labels {#labels tag="property"}
|
||||
|
||||
|
@ -312,3 +314,21 @@ The labels currently added to the component.
|
|||
| Name | Type | Description |
|
||||
| ----------- | ----- | ---------------------------------- |
|
||||
| **RETURNS** | tuple | The labels added to the component. |
|
||||
|
||||
## Serialization fields {#serialization-fields}
|
||||
|
||||
During serialization, spaCy will export several data fields used to restore
|
||||
different aspects of the object. If needed, you can exclude them from
|
||||
serialization by passing in the string names via the `exclude` argument.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> data = ner.to_disk("/path", exclude=["vocab"])
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ------- | -------------------------------------------------------------- |
|
||||
| `vocab` | The shared [`Vocab`](/api/vocab). |
|
||||
| `cfg` | The config file. You usually don't want to exclude this. |
|
||||
| `model` | The binary model data. You usually don't want to exclude this. |
|
||||
|
|
|
@ -91,13 +91,14 @@ multiprocessing.
|
|||
> assert doc.is_parsed
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------ | ----- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `texts` | - | A sequence of unicode objects. |
|
||||
| `as_tuples` | bool | If set to `True`, inputs should be a sequence of `(text, context)` tuples. Output will then be a sequence of `(doc, context)` tuples. Defaults to `False`. |
|
||||
| `batch_size` | int | The number of texts to buffer. |
|
||||
| `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
|
||||
| **YIELDS** | `Doc` | Documents in the order of the original text. |
|
||||
| Name | Type | Description |
|
||||
| -------------------------------------------- | ----- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `texts` | - | A sequence of unicode objects. |
|
||||
| `as_tuples` | bool | If set to `True`, inputs should be a sequence of `(text, context)` tuples. Output will then be a sequence of `(doc, context)` tuples. Defaults to `False`. |
|
||||
| `batch_size` | int | The number of texts to buffer. |
|
||||
| `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
|
||||
| `component_cfg` <Tag variant="new">2.1</Tag> | dict | Config parameters for specific pipeline components, keyed by component name. |
|
||||
| **YIELDS** | `Doc` | Documents in the order of the original text. |
|
||||
|
||||
## Language.update {#update tag="method"}
|
||||
|
||||
|
@ -112,13 +113,14 @@ Update the models in the pipeline.
|
|||
> nlp.update([doc], [gold], drop=0.5, sgd=optimizer)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `docs` | iterable | A batch of `Doc` objects or unicode. If unicode, a `Doc` object will be created from the text. |
|
||||
| `golds` | iterable | A batch of `GoldParse` objects or dictionaries. Dictionaries will be used to create [`GoldParse`](/api/goldparse) objects. For the available keys and their usage, see [`GoldParse.__init__`](/api/goldparse#init). |
|
||||
| `drop` | float | The dropout rate. |
|
||||
| `sgd` | callable | An optimizer. |
|
||||
| **RETURNS** | dict | Results from the update. |
|
||||
| Name | Type | Description |
|
||||
| -------------------------------------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `docs` | iterable | A batch of `Doc` objects or unicode. If unicode, a `Doc` object will be created from the text. |
|
||||
| `golds` | iterable | A batch of `GoldParse` objects or dictionaries. Dictionaries will be used to create [`GoldParse`](/api/goldparse) objects. For the available keys and their usage, see [`GoldParse.__init__`](/api/goldparse#init). |
|
||||
| `drop` | float | The dropout rate. |
|
||||
| `sgd` | callable | An optimizer. |
|
||||
| `component_cfg` <Tag variant="new">2.1</Tag> | dict | Config parameters for specific pipeline components, keyed by component name. |
|
||||
| **RETURNS** | dict | Results from the update. |
|
||||
|
||||
## Language.begin_training {#begin_training tag="method"}
|
||||
|
||||
|
@ -130,11 +132,12 @@ Allocate models, pre-process training data and acquire an optimizer.
|
|||
> optimizer = nlp.begin_training(gold_tuples)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------- | -------- | ---------------------------- |
|
||||
| `gold_tuples` | iterable | Gold-standard training data. |
|
||||
| `**cfg` | - | Config parameters. |
|
||||
| **RETURNS** | callable | An optimizer. |
|
||||
| Name | Type | Description |
|
||||
| -------------------------------------------- | -------- | ---------------------------------------------------------------------------- |
|
||||
| `gold_tuples` | iterable | Gold-standard training data. |
|
||||
| `component_cfg` <Tag variant="new">2.1</Tag> | dict | Config parameters for specific pipeline components, keyed by component name. |
|
||||
| `**cfg` | - | Config parameters (sent to all components). |
|
||||
| **RETURNS** | callable | An optimizer. |
|
||||
|
||||
## Language.use_params {#use_params tag="contextmanager, method"}
|
||||
|
||||
|
@ -327,7 +330,7 @@ the model**.
|
|||
| Name | Type | Description |
|
||||
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling) and prevent from being saved. |
|
||||
| `exclude` | list | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. |
|
||||
|
||||
## Language.from_disk {#from_disk tag="method" new="2"}
|
||||
|
||||
|
@ -349,22 +352,22 @@ loaded object.
|
|||
> nlp = English().from_disk("/path/to/en_model")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---------------- | --------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||
| `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
|
||||
| **RETURNS** | `Language` | The modified `Language` object. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---------------- | ----------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `Language` | The modified `Language` object. |
|
||||
|
||||
<Infobox title="Changed in v2.0" variant="warning">
|
||||
|
||||
As of spaCy v2.0, the `save_to_directory` method has been renamed to `to_disk`,
|
||||
to improve consistency across classes. Pipeline components to prevent from being
|
||||
loaded can now be added as a list to `disable`, instead of specifying one
|
||||
keyword argument per component.
|
||||
loaded can now be added as a list to `disable` (v2.0) or `exclude` (v2.1),
|
||||
instead of specifying one keyword argument per component.
|
||||
|
||||
```diff
|
||||
- nlp = spacy.load("en", tagger=False, entity=False)
|
||||
+ nlp = English().from_disk("/model", disable=["tagger', 'ner"])
|
||||
+ nlp = English().from_disk("/model", exclude=["tagger", "ner"])
|
||||
```
|
||||
|
||||
</Infobox>
|
||||
|
@ -379,10 +382,10 @@ Serialize the current state to a binary string.
|
|||
> nlp_bytes = nlp.to_bytes()
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | ------------------------------------------------------------------------------------------------------------------- |
|
||||
| `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling) and prevent from being serialized. |
|
||||
| **RETURNS** | bytes | The serialized form of the `Language` object. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | ----------------------------------------------------------------------------------------- |
|
||||
| `exclude` | list | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | bytes | The serialized form of the `Language` object. |
|
||||
|
||||
## Language.from_bytes {#from_bytes tag="method"}
|
||||
|
||||
|
@ -400,20 +403,21 @@ available to the loaded object.
|
|||
> nlp2.from_bytes(nlp_bytes)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------ | ---------- | --------------------------------------------------------------------------------- |
|
||||
| `bytes_data` | bytes | The data to load from. |
|
||||
| `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
|
||||
| **RETURNS** | `Language` | The `Language` object. |
|
||||
| Name | Type | Description |
|
||||
| ------------ | ---------- | ----------------------------------------------------------------------------------------- |
|
||||
| `bytes_data` | bytes | The data to load from. |
|
||||
| `exclude` | list | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `Language` | The `Language` object. |
|
||||
|
||||
<Infobox title="Changed in v2.0" variant="warning">
|
||||
|
||||
Pipeline components to prevent from being loaded can now be added as a list to
|
||||
`disable`, instead of specifying one keyword argument per component.
|
||||
`disable` (v2.0) or `exclude` (v2.1), instead of specifying one keyword argument
|
||||
per component.
|
||||
|
||||
```diff
|
||||
- nlp = English().from_bytes(bytes, tagger=False, entity=False)
|
||||
+ nlp = English().from_bytes(bytes, disable=["tagger", "ner"])
|
||||
+ nlp = English().from_bytes(bytes, exclude=["tagger", "ner"])
|
||||
```
|
||||
|
||||
</Infobox>
|
||||
|
@ -437,3 +441,23 @@ Pipeline components to prevent from being loaded can now be added as a list to
|
|||
| `Defaults` | class | Settings, data and factory methods for creating the `nlp` object and processing pipeline. |
|
||||
| `lang` | unicode | Two-letter language ID, i.e. [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). |
|
||||
| `factories` <Tag variant="new">2</Tag> | dict | Factories that create pre-defined pipeline components, e.g. the tagger, parser or entity recognizer, keyed by their component name. |
|
||||
|
||||
## Serialization fields {#serialization-fields}
|
||||
|
||||
During serialization, spaCy will export several data fields used to restore
|
||||
different aspects of the object. If needed, you can exclude them from
|
||||
serialization by passing in the string names via the `exclude` argument.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> data = nlp.to_bytes(exclude=["tokenizer", "vocab"])
|
||||
> nlp.from_disk("./model-data", exclude=["ner"])
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ----------- | -------------------------------------------------- |
|
||||
| `vocab` | The shared [`Vocab`](/api/vocab). |
|
||||
| `tokenizer` | Tokenization rules and exceptions. |
|
||||
| `meta` | The meta data, available as `Language.meta`. |
|
||||
| ... | String names of pipeline components, e.g. `"ner"`. |
|
||||
|
|
|
@ -316,6 +316,22 @@ taken.
|
|||
| ----------- | ------- | --------------- |
|
||||
| **RETURNS** | `Token` | The root token. |
|
||||
|
||||
## Span.conjuncts {#conjuncts tag="property" model="parser"}
|
||||
|
||||
A tuple of tokens coordinated to `span.root`.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> doc = nlp(u"I like apples and oranges")
|
||||
> apples_conjuncts = doc[2:3].conjuncts
|
||||
> assert [t.text for t in apples_conjuncts] == [u"oranges"]
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------- | ----------------------- |
|
||||
| **RETURNS** | `tuple` | The coordinated tokens. |
|
||||
|
||||
## Span.lefts {#lefts tag="property" model="parser"}
|
||||
|
||||
Tokens that are to the left of the span, whose heads are within the span.
|
||||
|
|
|
@ -151,10 +151,9 @@ Serialize the current state to a binary string.
|
|||
> store_bytes = stringstore.to_bytes()
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | -------------------------------------------------- |
|
||||
| `**exclude` | - | Named attributes to prevent from being serialized. |
|
||||
| **RETURNS** | bytes | The serialized form of the `StringStore` object. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | ------------------------------------------------ |
|
||||
| **RETURNS** | bytes | The serialized form of the `StringStore` object. |
|
||||
|
||||
## StringStore.from_bytes {#from_bytes tag="method"}
|
||||
|
||||
|
@ -168,11 +167,10 @@ Load state from a binary string.
|
|||
> new_store = StringStore().from_bytes(store_bytes)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------ | ------------- | ---------------------------------------------- |
|
||||
| `bytes_data` | bytes | The data to load from. |
|
||||
| `**exclude` | - | Named attributes to prevent from being loaded. |
|
||||
| **RETURNS** | `StringStore` | The `StringStore` object. |
|
||||
| Name | Type | Description |
|
||||
| ------------ | ------------- | ------------------------- |
|
||||
| `bytes_data` | bytes | The data to load from. |
|
||||
| **RETURNS** | `StringStore` | The `StringStore` object. |
|
||||
|
||||
## Utilities {#util}
|
||||
|
||||
|
|
|
@ -244,9 +244,10 @@ Serialize the pipe to disk.
|
|||
> tagger.to_disk("/path/to/tagger")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| Name | Type | Description |
|
||||
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
|
||||
## Tagger.from_disk {#from_disk tag="method"}
|
||||
|
||||
|
@ -262,6 +263,7 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
|||
| Name | Type | Description |
|
||||
| ----------- | ---------------- | -------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `Tagger` | The modified `Tagger` object. |
|
||||
|
||||
## Tagger.to_bytes {#to_bytes tag="method"}
|
||||
|
@ -275,10 +277,10 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
|||
|
||||
Serialize the pipe to a bytestring.
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | -------------------------------------------------- |
|
||||
| `**exclude` | - | Named attributes to prevent from being serialized. |
|
||||
| **RETURNS** | bytes | The serialized form of the `Tagger` object. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | ------------------------------------------------------------------------- |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | bytes | The serialized form of the `Tagger` object. |
|
||||
|
||||
## Tagger.from_bytes {#from_bytes tag="method"}
|
||||
|
||||
|
@ -292,11 +294,11 @@ Load the pipe from a bytestring. Modifies the object in place and returns it.
|
|||
> tagger.from_bytes(tagger_bytes)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------ | -------- | ---------------------------------------------- |
|
||||
| `bytes_data` | bytes | The data to load from. |
|
||||
| `**exclude` | - | Named attributes to prevent from being loaded. |
|
||||
| **RETURNS** | `Tagger` | The `Tagger` object. |
|
||||
| Name | Type | Description |
|
||||
| ------------ | -------- | ------------------------------------------------------------------------- |
|
||||
| `bytes_data` | bytes | The data to load from. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `Tagger` | The `Tagger` object. |
|
||||
|
||||
## Tagger.labels {#labels tag="property"}
|
||||
|
||||
|
@ -314,3 +316,22 @@ tags by default, e.g. `VERB`, `NOUN` and so on.
|
|||
| Name | Type | Description |
|
||||
| ----------- | ----- | ---------------------------------- |
|
||||
| **RETURNS** | tuple | The labels added to the component. |
|
||||
|
||||
## Serialization fields {#serialization-fields}
|
||||
|
||||
During serialization, spaCy will export several data fields used to restore
|
||||
different aspects of the object. If needed, you can exclude them from
|
||||
serialization by passing in the string names via the `exclude` argument.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> data = tagger.to_disk("/path", exclude=["vocab"])
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| --------- | ------------------------------------------------------------------------------------------ |
|
||||
| `vocab` | The shared [`Vocab`](/api/vocab). |
|
||||
| `cfg` | The config file. You usually don't want to exclude this. |
|
||||
| `model` | The binary model data. You usually don't want to exclude this. |
|
||||
| `tag_map` | The [tag map](/usage/adding-languages#tag-map) mapping fine-grained to coarse-grained tag. |
|
||||
|
|
|
@ -260,9 +260,10 @@ Serialize the pipe to disk.
|
|||
> textcat.to_disk("/path/to/textcat")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| Name | Type | Description |
|
||||
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
|
||||
## TextCategorizer.from_disk {#from_disk tag="method"}
|
||||
|
||||
|
@ -278,6 +279,7 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
|||
| Name | Type | Description |
|
||||
| ----------- | ----------------- | -------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `TextCategorizer` | The modified `TextCategorizer` object. |
|
||||
|
||||
## TextCategorizer.to_bytes {#to_bytes tag="method"}
|
||||
|
@ -291,10 +293,10 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
|||
|
||||
Serialize the pipe to a bytestring.
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | ---------------------------------------------------- |
|
||||
| `**exclude` | - | Named attributes to prevent from being serialized. |
|
||||
| **RETURNS** | bytes | The serialized form of the `TextCategorizer` object. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | ------------------------------------------------------------------------- |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | bytes | The serialized form of the `TextCategorizer` object. |
|
||||
|
||||
## TextCategorizer.from_bytes {#from_bytes tag="method"}
|
||||
|
||||
|
@ -308,11 +310,11 @@ Load the pipe from a bytestring. Modifies the object in place and returns it.
|
|||
> textcat.from_bytes(textcat_bytes)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------ | ----------------- | ---------------------------------------------- |
|
||||
| `bytes_data` | bytes | The data to load from. |
|
||||
| `**exclude` | - | Named attributes to prevent from being loaded. |
|
||||
| **RETURNS** | `TextCategorizer` | The `TextCategorizer` object. |
|
||||
| Name | Type | Description |
|
||||
| ------------ | ----------------- | ------------------------------------------------------------------------- |
|
||||
| `bytes_data` | bytes | The data to load from. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `TextCategorizer` | The `TextCategorizer` object. |
|
||||
|
||||
## TextCategorizer.labels {#labels tag="property"}
|
||||
|
||||
|
@ -328,3 +330,21 @@ The labels currently added to the component.
|
|||
| Name | Type | Description |
|
||||
| ----------- | ----- | ---------------------------------- |
|
||||
| **RETURNS** | tuple | The labels added to the component. |
|
||||
|
||||
## Serialization fields {#serialization-fields}
|
||||
|
||||
During serialization, spaCy will export several data fields used to restore
|
||||
different aspects of the object. If needed, you can exclude them from
|
||||
serialization by passing in the string names via the `exclude` argument.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> data = textcat.to_disk("/path", exclude=["vocab"])
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ------- | -------------------------------------------------------------- |
|
||||
| `vocab` | The shared [`Vocab`](/api/vocab). |
|
||||
| `cfg` | The config file. You usually don't want to exclude this. |
|
||||
| `model` | The binary model data. You usually don't want to exclude this. |
|
||||
|
|
|
@ -211,7 +211,7 @@ The rightmost token of this token's syntactic descendants.
|
|||
|
||||
## Token.conjuncts {#conjuncts tag="property" model="parser"}
|
||||
|
||||
A sequence of coordinated tokens, including the token itself.
|
||||
A tuple of coordinated tokens, not including the token itself.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
@ -221,9 +221,9 @@ A sequence of coordinated tokens, including the token itself.
|
|||
> assert [t.text for t in apples_conjuncts] == [u"oranges"]
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ---------- | ------- | -------------------- |
|
||||
| **YIELDS** | `Token` | A coordinated token. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------- | ----------------------- |
|
||||
| **RETURNS** | `tuple` | The coordinated tokens. |
|
||||
|
||||
## Token.children {#children tag="property" model="parser"}
|
||||
|
||||
|
|
|
@ -127,9 +127,10 @@ Serialize the tokenizer to disk.
|
|||
> tokenizer.to_disk("/path/to/tokenizer")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| Name | Type | Description |
|
||||
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
|
||||
## Tokenizer.from_disk {#from_disk tag="method"}
|
||||
|
||||
|
@ -145,6 +146,7 @@ Load the tokenizer from disk. Modifies the object in place and returns it.
|
|||
| Name | Type | Description |
|
||||
| ----------- | ---------------- | -------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `Tokenizer` | The modified `Tokenizer` object. |
|
||||
|
||||
## Tokenizer.to_bytes {#to_bytes tag="method"}
|
||||
|
@ -158,10 +160,10 @@ Load the tokenizer from disk. Modifies the object in place and returns it.
|
|||
|
||||
Serialize the tokenizer to a bytestring.
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | -------------------------------------------------- |
|
||||
| `**exclude` | - | Named attributes to prevent from being serialized. |
|
||||
| **RETURNS** | bytes | The serialized form of the `Tokenizer` object. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | ------------------------------------------------------------------------- |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | bytes | The serialized form of the `Tokenizer` object. |
|
||||
|
||||
## Tokenizer.from_bytes {#from_bytes tag="method"}
|
||||
|
||||
|
@ -176,11 +178,11 @@ it.
|
|||
> tokenizer.from_bytes(tokenizer_bytes)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------ | ----------- | ---------------------------------------------- |
|
||||
| `bytes_data` | bytes | The data to load from. |
|
||||
| `**exclude` | - | Named attributes to prevent from being loaded. |
|
||||
| **RETURNS** | `Tokenizer` | The `Tokenizer` object. |
|
||||
| Name | Type | Description |
|
||||
| ------------ | ----------- | ------------------------------------------------------------------------- |
|
||||
| `bytes_data` | bytes | The data to load from. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `Tokenizer` | The `Tokenizer` object. |
|
||||
|
||||
## Attributes {#attributes}
|
||||
|
||||
|
@ -190,3 +192,25 @@ it.
|
|||
| `prefix_search` | - | A function to find segment boundaries from the start of a string. Returns the length of the segment, or `None`. |
|
||||
| `suffix_search` | - | A function to find segment boundaries from the end of a string. Returns the length of the segment, or `None`. |
|
||||
| `infix_finditer` | - | A function to find internal segment separators, e.g. hyphens. Returns a (possibly empty) list of `re.MatchObject` objects. |
|
||||
|
||||
## Serialization fields {#serialization-fields}
|
||||
|
||||
During serialization, spaCy will export several data fields used to restore
|
||||
different aspects of the object. If needed, you can exclude them from
|
||||
serialization by passing in the string names via the `exclude` argument.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> data = tokenizer.to_bytes(exclude=["vocab", "exceptions"])
|
||||
> tokenizer.from_disk("./data", exclude=["token_match"])
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ---------------- | --------------------------------- |
|
||||
| `vocab` | The shared [`Vocab`](/api/vocab). |
|
||||
| `prefix_search` | The prefix rules. |
|
||||
| `suffix_search` | The suffix rules. |
|
||||
| `infix_finditer` | The infix rules. |
|
||||
| `token_match` | The token match expression. |
|
||||
| `exceptions` | The tokenizer exception rules. |
|
||||
|
|
|
@ -351,6 +351,24 @@ the two-letter language code.
|
|||
| `name` | unicode | Two-letter language code, e.g. `'en'`. |
|
||||
| `cls` | `Language` | The language class, e.g. `English`. |
|
||||
|
||||
### util.lang_class_is_loaded (#util.lang_class_is_loaded tag="function" new="2.1")
|
||||
|
||||
Check whether a `Language` class is already loaded. `Language` classes are
|
||||
loaded lazily, to avoid expensive setup code associated with the language data.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> lang_cls = util.get_lang_class("en")
|
||||
> assert util.lang_class_is_loaded("en") is True
|
||||
> assert util.lang_class_is_loaded("de") is False
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------- | -------------------------------------- |
|
||||
| `name` | unicode | Two-letter language code, e.g. `'en'`. |
|
||||
| **RETURNS** | bool | Whether the class has been loaded. |
|
||||
|
||||
### util.load_model {#util.load_model tag="function" new="2"}
|
||||
|
||||
Load a model from a shortcut link, package or data path. If called with a
|
||||
|
|
|
@ -311,10 +311,9 @@ Save the current state to a directory.
|
|||
>
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| `**exclude` | - | Named attributes to prevent from being saved. |
|
||||
| Name | Type | Description |
|
||||
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
|
||||
## Vectors.from_disk {#from_disk tag="method"}
|
||||
|
||||
|
@ -342,10 +341,9 @@ Serialize the current state to a binary string.
|
|||
> vectors_bytes = vectors.to_bytes()
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | -------------------------------------------------- |
|
||||
| `**exclude` | - | Named attributes to prevent from being serialized. |
|
||||
| **RETURNS** | bytes | The serialized form of the `Vectors` object. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | -------------------------------------------- |
|
||||
| **RETURNS** | bytes | The serialized form of the `Vectors` object. |
|
||||
|
||||
## Vectors.from_bytes {#from_bytes tag="method"}
|
||||
|
||||
|
@ -360,11 +358,10 @@ Load state from a binary string.
|
|||
> new_vectors.from_bytes(vectors_bytes)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | --------- | ---------------------------------------------- |
|
||||
| `data` | bytes | The data to load from. |
|
||||
| `**exclude` | - | Named attributes to prevent from being loaded. |
|
||||
| **RETURNS** | `Vectors` | The `Vectors` object. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | --------- | ---------------------- |
|
||||
| `data` | bytes | The data to load from. |
|
||||
| **RETURNS** | `Vectors` | The `Vectors` object. |
|
||||
|
||||
## Attributes {#attributes}
|
||||
|
||||
|
|
|
@ -221,9 +221,10 @@ Save the current state to a directory.
|
|||
> nlp.vocab.to_disk("/path/to/vocab")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| Name | Type | Description |
|
||||
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
|
||||
## Vocab.from_disk {#from_disk tag="method" new="2"}
|
||||
|
||||
|
@ -239,6 +240,7 @@ Loads state from a directory. Modifies the object in place and returns it.
|
|||
| Name | Type | Description |
|
||||
| ----------- | ---------------- | -------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `Vocab` | The modified `Vocab` object. |
|
||||
|
||||
## Vocab.to_bytes {#to_bytes tag="method"}
|
||||
|
@ -251,10 +253,10 @@ Serialize the current state to a binary string.
|
|||
> vocab_bytes = nlp.vocab.to_bytes()
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | -------------------------------------------------- |
|
||||
| `**exclude` | - | Named attributes to prevent from being serialized. |
|
||||
| **RETURNS** | bytes | The serialized form of the `Vocab` object. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | ------------------------------------------------------------------------- |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | bytes | The serialized form of the `Vocab` object. |
|
||||
|
||||
## Vocab.from_bytes {#from_bytes tag="method"}
|
||||
|
||||
|
@ -269,11 +271,11 @@ Load state from a binary string.
|
|||
> vocab.from_bytes(vocab_bytes)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------ | ------- | ---------------------------------------------- |
|
||||
| `bytes_data` | bytes | The data to load from. |
|
||||
| `**exclude` | - | Named attributes to prevent from being loaded. |
|
||||
| **RETURNS** | `Vocab` | The `Vocab` object. |
|
||||
| Name | Type | Description |
|
||||
| ------------ | ------- | ------------------------------------------------------------------------- |
|
||||
| `bytes_data` | bytes | The data to load from. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `Vocab` | The `Vocab` object. |
|
||||
|
||||
## Attributes {#attributes}
|
||||
|
||||
|
@ -286,8 +288,28 @@ Load state from a binary string.
|
|||
> assert type(PERSON) == int
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------------------------------ | ------------- | --------------------------------------------- |
|
||||
| `strings` | `StringStore` | A table managing the string-to-int mapping. |
|
||||
| `vectors` <Tag variant="new">2</Tag> | `Vectors` | A table associating word IDs to word vectors. |
|
||||
| `vectors_length` | int | Number of dimensions for each word vector. |
|
||||
| Name | Type | Description |
|
||||
| --------------------------------------------- | ------------- | ------------------------------------------------------------ |
|
||||
| `strings` | `StringStore` | A table managing the string-to-int mapping. |
|
||||
| `vectors` <Tag variant="new">2</Tag> | `Vectors` | A table associating word IDs to word vectors. |
|
||||
| `vectors_length` | int | Number of dimensions for each word vector. |
|
||||
| `writing_system` <Tag variant="new">2.1</Tag> | dict | A dict with information about the language's writing system. |
|
||||
|
||||
## Serialization fields {#serialization-fields}
|
||||
|
||||
During serialization, spaCy will export several data fields used to restore
|
||||
different aspects of the object. If needed, you can exclude them from
|
||||
serialization by passing in the string names via the `exclude` argument.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> data = vocab.to_bytes(exclude=["strings", "vectors"])
|
||||
> vocab.from_disk("./vocab", exclude=["strings"])
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| --------- | ----------------------------------------------------- |
|
||||
| `strings` | The strings in the [`StringStore`](/api/stringstore). |
|
||||
| `lexemes` | The lexeme data. |
|
||||
| `vectors` | The word vectors, if available. |
|
||||
|
|
|
@ -39,9 +39,9 @@ together all components and creating the `Language` subclass – for example,
|
|||
| **Morph rules**<br />[`morph_rules.py`][morph_rules.py] | Exception rules for morphological analysis of irregular words like personal pronouns. |
|
||||
|
||||
[stop_words.py]:
|
||||
https://github.com/explosion/spacy-dev-resources/tree/master/templates/new_language/stop_words.py
|
||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/stop_words.py
|
||||
[tokenizer_exceptions.py]:
|
||||
https://github.com/explosion/spacy-dev-resources/tree/master/templates/new_language/tokenizer_exceptions.py
|
||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/de/tokenizer_exceptions.py
|
||||
[norm_exceptions.py]:
|
||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/norm_exceptions.py
|
||||
[punctuation.py]:
|
||||
|
@ -49,12 +49,12 @@ together all components and creating the `Language` subclass – for example,
|
|||
[char_classes.py]:
|
||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/char_classes.py
|
||||
[lex_attrs.py]:
|
||||
https://github.com/explosion/spacy-dev-resources/tree/master/templates/new_language/lex_attrs.py
|
||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/lex_attrs.py
|
||||
[syntax_iterators.py]:
|
||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/syntax_iterators.py
|
||||
[lemmatizer.py]:
|
||||
https://github.com/explosion/spacy-dev-resources/tree/master/templates/new_language/lemmatizer.py
|
||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/de/lemmatizer.py
|
||||
[tag_map.py]:
|
||||
https://github.com/explosion/spacy-dev-resources/tree/master/templates/new_language/tag_map.py
|
||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/tag_map.py
|
||||
[morph_rules.py]:
|
||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/morph_rules.py
|
||||
|
|
|
@ -33,9 +33,22 @@ list containing the component names:
|
|||
|
||||
import Accordion from 'components/accordion.js'
|
||||
|
||||
<Accordion title="Does the order of pipeline components matter?">
|
||||
<Accordion title="Does the order of pipeline components matter?" id="pipeline-components-order">
|
||||
|
||||
No
|
||||
In spaCy v2.x, the statistical components like the tagger or parser are
|
||||
independent and don't share any data between themselves. For example, the named
|
||||
entity recognizer doesn't use any features set by the tagger and parser, and so
|
||||
on. This means that you can swap them, or remove single components from the
|
||||
pipeline without affecting the others.
|
||||
|
||||
However, custom components may depend on annotations set by other components.
|
||||
For example, a custom lemmatizer may need the part-of-speech tags assigned, so
|
||||
it'll only work if it's added after the tagger. The parser will respect
|
||||
pre-defined sentence boundaries, so if a previous component in the pipeline sets
|
||||
them, its dependency predictions may be different. Similarly, it matters if you
|
||||
add the [`EntityRuler`](/api/entityruler) before or after the statistical entity
|
||||
recognizer: if it's added before, the entity recognizer will take the existing
|
||||
entities into account when making predictions.
|
||||
|
||||
</Accordion>
|
||||
|
||||
|
|
|
@ -39,7 +39,7 @@ and morphological analysis.
|
|||
|
||||
</div>
|
||||
|
||||
<Infobox title="Table of Contents">
|
||||
<Infobox title="Table of Contents" id="toc">
|
||||
|
||||
- [Language data 101](#101)
|
||||
- [The Language subclass](#language-subclass)
|
||||
|
@ -105,15 +105,15 @@ to know the language's character set. If the language you're adding uses
|
|||
non-latin characters, you might need to define the required character classes in
|
||||
the global
|
||||
[`char_classes.py`](https://github.com/explosion/spaCy/tree/master/spacy/lang/char_classes.py).
|
||||
For efficiency, spaCy uses hard-coded unicode ranges to define character classes,
|
||||
the definitions of which can be found on [Wikipedia](https://en.wikipedia.org/wiki/Unicode_block).
|
||||
If the language requires very specific punctuation
|
||||
rules, you should consider overwriting the default regular expressions with your
|
||||
own in the language's `Defaults`.
|
||||
For efficiency, spaCy uses hard-coded unicode ranges to define character
|
||||
classes, the definitions of which can be found on
|
||||
[Wikipedia](https://en.wikipedia.org/wiki/Unicode_block). If the language
|
||||
requires very specific punctuation rules, you should consider overwriting the
|
||||
default regular expressions with your own in the language's `Defaults`.
|
||||
|
||||
</Infobox>
|
||||
|
||||
### Creating a `Language` subclass {#language-subclass}
|
||||
### Creating a language subclass {#language-subclass}
|
||||
|
||||
Language-specific code and resources should be organized into a sub-package of
|
||||
spaCy, named according to the language's
|
||||
|
@ -121,9 +121,9 @@ spaCy, named according to the language's
|
|||
code and resources specific to Spanish are placed into a directory
|
||||
`spacy/lang/es`, which can be imported as `spacy.lang.es`.
|
||||
|
||||
To get started, you can use our
|
||||
[templates](https://github.com/explosion/spacy-dev-resources/templates/new_language)
|
||||
for the most important files. Here's what the class template looks like:
|
||||
To get started, you can check out the
|
||||
[existing languages](https://github.com/explosion/spacy/tree/master/spacy/lang).
|
||||
Here's what the class could look like:
|
||||
|
||||
```python
|
||||
### __init__.py (excerpt)
|
||||
|
@ -614,7 +614,7 @@ require models to be trained from labeled examples. The word vectors, word
|
|||
probabilities and word clusters also require training, although these can be
|
||||
trained from unlabeled text, which tends to be much easier to collect.
|
||||
|
||||
### Creating a vocabulary file
|
||||
### Creating a vocabulary file {#vocab-file}
|
||||
|
||||
spaCy expects that common words will be cached in a [`Vocab`](/api/vocab)
|
||||
instance. The vocabulary caches lexical features. spaCy loads the vocabulary
|
||||
|
@ -631,20 +631,20 @@ of using deep learning for NLP with limited labeled data. The vectors are also
|
|||
useful by themselves – they power the `.similarity` methods in spaCy. For best
|
||||
results, you should pre-process the text with spaCy before training the Word2vec
|
||||
model. This ensures your tokenization will match. You can use our
|
||||
[word vectors training script](https://github.com/explosion/spacy-dev-resources/tree/master/training/word_vectors.py),
|
||||
[word vectors training script](https://github.com/explosion/spacy/tree/master/bin/train_word_vectors.py),
|
||||
which pre-processes the text with your language-specific tokenizer and trains
|
||||
the model using [Gensim](https://radimrehurek.com/gensim/). The `vectors.bin`
|
||||
file should consist of one word and vector per line.
|
||||
|
||||
```python
|
||||
https://github.com/explosion/spacy-dev-resources/tree/master/training/word_vectors.py
|
||||
https://github.com/explosion/spacy/tree/master/bin/train_word_vectors.py
|
||||
```
|
||||
|
||||
If you don't have a large sample of text available, you can also convert word
|
||||
vectors produced by a variety of other tools into spaCy's format. See the docs
|
||||
on [converting word vectors](/usage/vectors-similarity#converting) for details.
|
||||
|
||||
### Creating or converting a training corpus
|
||||
### Creating or converting a training corpus {#training-corpus}
|
||||
|
||||
The easiest way to train spaCy's tagger, parser, entity recognizer or text
|
||||
categorizer is to use the [`spacy train`](/api/cli#train) command-line utility.
|
||||
|
|
|
@ -29,7 +29,7 @@ Here's a quick comparison of the functionalities offered by spaCy,
|
|||
| Entity linking | ❌ | ❌ | ❌ |
|
||||
| Coreference resolution | ❌ | ❌ | ✅ |
|
||||
|
||||
### When should I use what?
|
||||
### When should I use what? {#comparison-usage}
|
||||
|
||||
Natural Language Understanding is an active area of research and development, so
|
||||
there are many different tools or technologies catering to different use-cases.
|
||||
|
|
|
@ -28,7 +28,7 @@ import QuickstartInstall from 'widgets/quickstart-install.js'
|
|||
|
||||
## Installation instructions {#installation}
|
||||
|
||||
### pip
|
||||
### pip {#pip}
|
||||
|
||||
Using pip, spaCy releases are available as source packages and binary wheels (as
|
||||
of v2.0.13).
|
||||
|
@ -58,7 +58,7 @@ source .env/bin/activate
|
|||
pip install spacy
|
||||
```
|
||||
|
||||
### conda
|
||||
### conda {#conda}
|
||||
|
||||
Thanks to our great community, we've been able to re-add conda support. You can
|
||||
also install spaCy via `conda-forge`:
|
||||
|
@ -194,7 +194,7 @@ official distributions these are:
|
|||
| Python 3.4 | Visual Studio 2010 |
|
||||
| Python 3.5+ | Visual Studio 2015 |
|
||||
|
||||
### Run tests
|
||||
### Run tests {#run-tests}
|
||||
|
||||
spaCy comes with an
|
||||
[extensive test suite](https://github.com/explosion/spaCy/tree/master/spacy/tests).
|
||||
|
@ -418,7 +418,7 @@ either of these, clone your repository again.
|
|||
|
||||
</Accordion>
|
||||
|
||||
## Changelog
|
||||
## Changelog {#changelog}
|
||||
|
||||
import Changelog from 'widgets/changelog.js'
|
||||
|
||||
|
|
|
@ -298,9 +298,9 @@ different languages, see the
|
|||
The best way to understand spaCy's dependency parser is interactively. To make
|
||||
this easier, spaCy v2.0+ comes with a visualization module. You can pass a `Doc`
|
||||
or a list of `Doc` objects to displaCy and run
|
||||
[`displacy.serve`](top-level#displacy.serve) to run the web server, or
|
||||
[`displacy.render`](top-level#displacy.render) to generate the raw markup. If
|
||||
you want to know how to write rules that hook into some type of syntactic
|
||||
[`displacy.serve`](/api/top-level#displacy.serve) to run the web server, or
|
||||
[`displacy.render`](/api/top-level#displacy.render) to generate the raw markup.
|
||||
If you want to know how to write rules that hook into some type of syntactic
|
||||
construction, just plug the sentence into the visualizer and see how spaCy
|
||||
annotates it.
|
||||
|
||||
|
@ -621,7 +621,7 @@ For more details on the language-specific data, see the usage guide on
|
|||
|
||||
</Infobox>
|
||||
|
||||
<Accordion title="Should I change the language data or add custom tokenizer rules?">
|
||||
<Accordion title="Should I change the language data or add custom tokenizer rules?" id="lang-data-vs-tokenizer">
|
||||
|
||||
Tokenization rules that are specific to one language, but can be **generalized
|
||||
across that language** should ideally live in the language data in
|
||||
|
|
|
@ -41,7 +41,7 @@ contribute to model development.
|
|||
> If a model is available for a language, you can download it using the
|
||||
> [`spacy download`](/api/cli#download) command. In order to use languages that
|
||||
> don't yet come with a model, you have to import them directly, or use
|
||||
> [`spacy.blank`](api/top-level#spacy.blank):
|
||||
> [`spacy.blank`](/api/top-level#spacy.blank):
|
||||
>
|
||||
> ```python
|
||||
> from spacy.lang.fi import Finnish
|
||||
|
|
|
@ -46,7 +46,8 @@ components. spaCy then does the following:
|
|||
3. Add each pipeline component to the pipeline in order, using
|
||||
[`add_pipe`](/api/language#add_pipe).
|
||||
4. Make the **model data** available to the `Language` class by calling
|
||||
[`from_disk`](language#from_disk) with the path to the model data directory.
|
||||
[`from_disk`](/api/language#from_disk) with the path to the model data
|
||||
directory.
|
||||
|
||||
So when you call this...
|
||||
|
||||
|
@ -110,7 +111,7 @@ print(nlp.pipe_names)
|
|||
# ['tagger', 'parser', 'ner']
|
||||
```
|
||||
|
||||
### Built-in pipeline components
|
||||
### Built-in pipeline components {#built-in}
|
||||
|
||||
spaCy ships with several built-in pipeline components that are also available in
|
||||
the `Language.factories`. This means that you can initialize them by calling
|
||||
|
@ -426,7 +427,7 @@ spaCy, and implement your own models trained with other machine learning
|
|||
libraries. It also lets you take advantage of spaCy's data structures and the
|
||||
`Doc` object as the "single source of truth".
|
||||
|
||||
<Accordion title="Why ._ and not just a top-level attribute?">
|
||||
<Accordion title="Why ._ and not just a top-level attribute?" id="why-dot-underscore">
|
||||
|
||||
Writing to a `._` attribute instead of to the `Doc` directly keeps a clearer
|
||||
separation and makes it easier to ensure backwards compatibility. For example,
|
||||
|
@ -437,7 +438,7 @@ immediately know what's built-in and what's custom – for example,
|
|||
|
||||
</Accordion>
|
||||
|
||||
<Accordion title="How is the ._ implemented?">
|
||||
<Accordion title="How is the ._ implemented?" id="dot-underscore-implementation">
|
||||
|
||||
Extension definitions – the defaults, methods, getters and setters you pass in
|
||||
to `set_extension` – are stored in class attributes on the `Underscore` class.
|
||||
|
@ -458,9 +459,7 @@ There are three main types of extensions, which can be defined using the
|
|||
1. **Attribute extensions.** Set a default value for an attribute, which can be
|
||||
overwritten manually at any time. Attribute extensions work like "normal"
|
||||
variables and are the quickest way to store arbitrary information on a `Doc`,
|
||||
`Span` or `Token`. Attribute defaults behaves just like argument defaults
|
||||
[in Python functions](http://docs.python-guide.org/en/latest/writing/gotchas/#mutable-default-arguments),
|
||||
and should not be used for mutable values like dictionaries or lists.
|
||||
`Span` or `Token`.
|
||||
|
||||
```python
|
||||
Doc.set_extension("hello", default=True)
|
||||
|
@ -527,25 +526,6 @@ Once you've registered your custom attribute, you can also use the built-in
|
|||
especially useful it you want to pass in a string instead of calling
|
||||
`doc._.my_attr`.
|
||||
|
||||
<Infobox title="Using mutable default values" variant="danger">
|
||||
|
||||
When using **mutable values** like dictionaries or lists as the `default`
|
||||
argument, keep in mind that they behave just like mutable default arguments
|
||||
[in Python functions](http://docs.python-guide.org/en/latest/writing/gotchas/#mutable-default-arguments).
|
||||
This can easily cause unintended results, like the same value being set on _all_
|
||||
objects instead of only one particular instance. In most cases, it's better to
|
||||
use **getters and setters**, and only set the `default` for boolean or string
|
||||
values.
|
||||
|
||||
```diff
|
||||
+ Doc.set_extension('fruits', getter=get_fruits, setter=set_fruits)
|
||||
|
||||
- Doc.set_extension('fruits', default={})
|
||||
- doc._.fruits['apple'] = u'🍎' # all docs now have {'apple': u'🍎'}
|
||||
```
|
||||
|
||||
</Infobox>
|
||||
|
||||
### Example: Pipeline component for GPE entities and country meta data via a REST API {#component-example3}
|
||||
|
||||
This example shows the implementation of a pipeline component that fetches
|
||||
|
|
|
@ -15,7 +15,7 @@ their relationships. This means you can easily access and analyze the
|
|||
surrounding tokens, merge spans into single tokens or add entries to the named
|
||||
entities in `doc.ents`.
|
||||
|
||||
<Accordion title="Should I use rules or train a model?">
|
||||
<Accordion title="Should I use rules or train a model?" id="rules-vs-model">
|
||||
|
||||
For complex tasks, it's usually better to train a statistical entity recognition
|
||||
model. However, statistical models require training data, so for many
|
||||
|
@ -41,7 +41,7 @@ on [rule-based entity recognition](#entityruler).
|
|||
|
||||
</Accordion>
|
||||
|
||||
<Accordion title="When should I use the token matcher vs. the phrase matcher?">
|
||||
<Accordion title="When should I use the token matcher vs. the phrase matcher?" id="matcher-vs-phrase-matcher">
|
||||
|
||||
The `PhraseMatcher` is useful if you already have a large terminology list or
|
||||
gazetteer consisting of single or multi-token phrases that you want to find
|
||||
|
|
|
@ -22,7 +22,7 @@ the changes, see [this table](/usage/v2#incompat) and the notes on
|
|||
|
||||
</Infobox>
|
||||
|
||||
### Serializing the pipeline
|
||||
### Serializing the pipeline {#pipeline}
|
||||
|
||||
When serializing the pipeline, keep in mind that this will only save out the
|
||||
**binary data for the individual components** to allow spaCy to restore them –
|
||||
|
@ -361,7 +361,7 @@ In theory, the entry point mechanism also lets you overwrite built-in factories
|
|||
– including the tokenizer. By default, spaCy will output a warning in these
|
||||
cases, to prevent accidental overwrites and unintended results.
|
||||
|
||||
#### Advanced components with settings
|
||||
#### Advanced components with settings {#advanced-cfg}
|
||||
|
||||
The `**cfg` keyword arguments that the factory receives are passed down all the
|
||||
way from `spacy.load`. This means that the factory can respond to custom
|
||||
|
|
|
@ -50,7 +50,7 @@ systems, or to pre-process text for **deep learning**.
|
|||
|
||||
</div>
|
||||
|
||||
<Infobox title="Table of contents">
|
||||
<Infobox title="Table of contents" id="toc">
|
||||
|
||||
- [Features](#features)
|
||||
- [Linguistic annotations](#annotations)
|
||||
|
|
|
@ -14,7 +14,7 @@ faster runtime, and many bug fixes, v2.1 also introduces experimental support
|
|||
for some exciting new NLP innovations. For the full changelog, see the
|
||||
[release notes on GitHub](https://github.com/explosion/spaCy/releases/tag/v2.1.0).
|
||||
|
||||
### BERT/ULMFit/Elmo-style pre-training {tag="experimental"}
|
||||
### BERT/ULMFit/Elmo-style pre-training {#pretraining tag="experimental"}
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
@ -39,7 +39,7 @@ it.
|
|||
|
||||
</Infobox>
|
||||
|
||||
### Extended match pattern API
|
||||
### Extended match pattern API {#matcher-api}
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
@ -67,7 +67,7 @@ values.
|
|||
|
||||
</Infobox>
|
||||
|
||||
### Easy rule-based entity recognition
|
||||
### Easy rule-based entity recognition {#entity-ruler}
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
@ -91,7 +91,7 @@ flexibility.
|
|||
|
||||
</Infobox>
|
||||
|
||||
### Phrase matching with other attributes
|
||||
### Phrase matching with other attributes {#phrasematcher}
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
@ -115,7 +115,7 @@ or `POS` for finding sequences of the same part-of-speech tags.
|
|||
|
||||
</Infobox>
|
||||
|
||||
### Retokenizer for merging and splitting
|
||||
### Retokenizer for merging and splitting {#retokenizer}
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
@ -142,7 +142,7 @@ deprecated.
|
|||
|
||||
</Infobox>
|
||||
|
||||
### Components and languages via entry points
|
||||
### Components and languages via entry points {#entry-points}
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
@ -169,7 +169,7 @@ is required.
|
|||
|
||||
</Infobox>
|
||||
|
||||
### Improved documentation
|
||||
### Improved documentation {#docs}
|
||||
|
||||
Although it looks pretty much the same, we've rebuilt the entire documentation
|
||||
using [Gatsby](https://www.gatsbyjs.org/) and [MDX](https://mdxjs.com/). It's
|
||||
|
@ -237,6 +237,19 @@ if all of your models are up to date, you can run the
|
|||
+ retokenizer.merge(doc[6:8])
|
||||
```
|
||||
|
||||
- The serialization methods `to_disk`, `from_disk`, `to_bytes` and `from_bytes`
|
||||
now support a single `exclude` argument to provide a list of string names to
|
||||
exclude. The docs have been updated to list the available serialization fields
|
||||
for each class. The `disable` argument on the [`Language`](/api/language)
|
||||
serialization methods has been renamed to `exclude` for consistency.
|
||||
|
||||
```diff
|
||||
- nlp.to_disk("/path", disable=["parser", "ner"])
|
||||
+ nlp.to_disk("/path", exclude=["parser", "ner"])
|
||||
- data = nlp.tokenizer.to_bytes(vocab=False)
|
||||
+ data = nlp.tokenizer.to_bytes(exclude=["vocab"])
|
||||
```
|
||||
|
||||
- For better compatibility with the Universal Dependencies data, the lemmatizer
|
||||
now preserves capitalization, e.g. for proper nouns. See
|
||||
[this issue](https://github.com/explosion/spaCy/issues/3256) for details.
|
||||
|
|
|
@ -39,7 +39,7 @@ also add your own custom attributes, properties and methods to the `Doc`,
|
|||
|
||||
</div>
|
||||
|
||||
<Infobox title="Table of Contents">
|
||||
<Infobox title="Table of Contents" id="toc">
|
||||
|
||||
- [Summary](#summary)
|
||||
- [New features](#features)
|
||||
|
|
|
@ -75,7 +75,7 @@ arcs.
|
|||
| `font` | unicode | Font name or font family for all text. | `"Arial"` |
|
||||
|
||||
For a list of all available options, see the
|
||||
[`displacy` API documentation](top-level#displacy_options).
|
||||
[`displacy` API documentation](/api/top-level#displacy_options).
|
||||
|
||||
> #### Options example
|
||||
>
|
||||
|
@ -283,7 +283,7 @@ from pathlib import Path
|
|||
nlp = spacy.load("en_core_web_sm")
|
||||
sentences = [u"This is an example.", u"This is another one."]
|
||||
for sent in sentences:
|
||||
doc = nlp(sentence)
|
||||
doc = nlp(sent)
|
||||
svg = displacy.render(doc, style="dep")
|
||||
file_name = '-'.join([w.text for w in doc if not w.is_punct]) + ".svg"
|
||||
output_path = Path("/images/" + file_name)
|
||||
|
|
|
@ -23,6 +23,11 @@
|
|||
"list": "89ad33e698"
|
||||
},
|
||||
"docSearch": {
|
||||
"apiKey": "f7dbcd148fae73db20b6ad33d03cc9e8",
|
||||
"indexName": "dev_spacy_netlify",
|
||||
"appId": "Y7BGGRAPHC"
|
||||
},
|
||||
"_docSearch": {
|
||||
"apiKey": "371e26ed49d29a27bd36273dfdaf89af",
|
||||
"indexName": "spacy"
|
||||
},
|
||||
|
|
|
@ -524,6 +524,22 @@
|
|||
},
|
||||
"category": ["standalone", "research"]
|
||||
},
|
||||
{
|
||||
"id": "scispacy",
|
||||
"title": "scispaCy",
|
||||
"slogan": "A full spaCy pipeline and models for scientific/biomedical documents",
|
||||
"github": "allenai/scispacy",
|
||||
"pip": "scispacy",
|
||||
"thumb": "https://i.imgur.com/dJQSclW.png",
|
||||
"url": "https://allenai.github.io/scispacy/",
|
||||
"author": " Allen Institute for Artificial Intelligence",
|
||||
"author_links": {
|
||||
"github": "allenai",
|
||||
"twitter": "allenai_org",
|
||||
"website": "http://allenai.org"
|
||||
},
|
||||
"category": ["models", "research"]
|
||||
},
|
||||
{
|
||||
"id": "textacy",
|
||||
"slogan": "NLP, before and after spaCy",
|
||||
|
@ -851,6 +867,22 @@
|
|||
},
|
||||
"category": ["courses"]
|
||||
},
|
||||
{
|
||||
"type": "education",
|
||||
"id": "datacamp-advanced-nlp",
|
||||
"title": "Advanced Natural Language Processing with spaCy",
|
||||
"slogan": "Datacamp, 2019",
|
||||
"description": "If you're working with a lot of text, you'll eventually want to know more about it. For example, what's it about? What do the words mean in context? Who is doing what to whom? What companies and products are mentioned? Which texts are similar to each other? In this course, you'll learn how to use spaCy, a fast-growing industry standard library for NLP in Python, to build advanced natural language understanding systems, using both rule-based and machine learning approaches.",
|
||||
"url": "https://www.datacamp.com/courses/advanced-nlp-with-spacy",
|
||||
"thumb": "https://i.imgur.com/0Zks7c0.jpg",
|
||||
"author": "Ines Montani",
|
||||
"author_links": {
|
||||
"twitter": "_inesmontani",
|
||||
"github": "ines",
|
||||
"website": "https://ines.io"
|
||||
},
|
||||
"category": ["courses"]
|
||||
},
|
||||
{
|
||||
"type": "education",
|
||||
"id": "learning-path-spacy",
|
||||
|
@ -910,6 +942,7 @@
|
|||
"description": "Most NLP projects rely crucially on the quality of annotations used for training and evaluating models. In this episode, Matt and Ines of Explosion AI tell us how Prodigy can improve data annotation and model development workflows. Prodigy is an annotation tool implemented as a python library, and it comes with a web application and a command line interface. A developer can define input data streams and design simple annotation interfaces. Prodigy can help break down complex annotation decisions into a series of binary decisions, and it provides easy integration with spaCy models. Developers can specify how models should be modified as new annotations come in in an active learning framework.",
|
||||
"soundcloud": "559200912",
|
||||
"thumb": "https://i.imgur.com/hOBQEzc.jpg",
|
||||
"url": "https://soundcloud.com/nlp-highlights/78-where-do-corpora-come-from-with-matt-honnibal-and-ines-montani",
|
||||
"author": "Matt Gardner, Waleed Ammar (Allen AI)",
|
||||
"author_links": {
|
||||
"website": "https://soundcloud.com/nlp-highlights"
|
||||
|
@ -925,12 +958,28 @@
|
|||
"iframe": "https://www.pythonpodcast.com/wp-content/plugins/podlove-podcasting-plugin-for-wordpress/lib/modules/podlove_web_player/player_v4/dist/share.html?episode=https://www.pythonpodcast.com/?podlove_player4=176",
|
||||
"iframe_height": 200,
|
||||
"thumb": "https://i.imgur.com/rpo6BuY.png",
|
||||
"url": "https://www.podcastinit.com/episode-87-spacy-with-matthew-honnibal/",
|
||||
"author": "Tobias Macey",
|
||||
"author_links": {
|
||||
"website": "https://www.podcastinit.com"
|
||||
},
|
||||
"category": ["podcasts"]
|
||||
},
|
||||
{
|
||||
"type": "education",
|
||||
"id": "talk-python-podcast",
|
||||
"title": "Talk Python 202: Building a software business",
|
||||
"slogan": "March 2019",
|
||||
"description": "One core question around open source is how do you fund it? Well, there is always that PayPal donate button. But that's been a tremendous failure for many projects. Often the go-to answer is consulting. But what if you don't want to trade time for money? You could take things up a notch and change the equation, exchanging value for money. That's what Ines Montani and her co-founder did when they started Explosion AI with spaCy as the foundation.",
|
||||
"thumb": "https://i.imgur.com/q1twuK8.png",
|
||||
"url": "https://talkpython.fm/episodes/show/202/building-a-software-business",
|
||||
"soundcloud": "588364857",
|
||||
"author": "Michael Kennedy",
|
||||
"author_links": {
|
||||
"website": "https://talkpython.fm/"
|
||||
},
|
||||
"category": ["podcasts"]
|
||||
},
|
||||
{
|
||||
"id": "adam_qas",
|
||||
"title": "ADAM: Question Answering System",
|
||||
|
|
828
website/package-lock.json
generated
828
website/package-lock.json
generated
|
@ -1833,9 +1833,9 @@
|
|||
}
|
||||
},
|
||||
"acorn": {
|
||||
"version": "6.1.0",
|
||||
"resolved": "https://registry.npmjs.org/acorn/-/acorn-6.1.0.tgz",
|
||||
"integrity": "sha512-MW/FjM+IvU9CgBzjO3UIPCE2pyEwUsoFl+VGdczOPEdxfGFjuKny/gN54mOuX7Qxmb9Rg9MCn2oKiSUeW+pjrw=="
|
||||
"version": "6.1.1",
|
||||
"resolved": "https://registry.npmjs.org/acorn/-/acorn-6.1.1.tgz",
|
||||
"integrity": "sha512-jPTiwtOxaHNaAPg/dmrJ/beuzLRnXtB0kQPQ8JpotKJgTB6rX6c8mlf315941pyjBSaPg8NHXS9fhP4u17DpGA=="
|
||||
},
|
||||
"acorn-dynamic-import": {
|
||||
"version": "3.0.0",
|
||||
|
@ -5958,9 +5958,9 @@
|
|||
"integrity": "sha1-G2HAViGQqN/2rjuyzwIAyhMLhtQ="
|
||||
},
|
||||
"eslint": {
|
||||
"version": "5.14.1",
|
||||
"resolved": "https://registry.npmjs.org/eslint/-/eslint-5.14.1.tgz",
|
||||
"integrity": "sha512-CyUMbmsjxedx8B0mr79mNOqetvkbij/zrXnFeK2zc3pGRn3/tibjiNAv/3UxFEyfMDjh+ZqTrJrEGBFiGfD5Og==",
|
||||
"version": "5.15.1",
|
||||
"resolved": "https://registry.npmjs.org/eslint/-/eslint-5.15.1.tgz",
|
||||
"integrity": "sha512-NTcm6vQ+PTgN3UBsALw5BMhgO6i5EpIjQF/Xb5tIh3sk9QhrFafujUOczGz4J24JBlzWclSB9Vmx8d+9Z6bFCg==",
|
||||
"requires": {
|
||||
"@babel/code-frame": "^7.0.0",
|
||||
"ajv": "^6.9.1",
|
||||
|
@ -5968,7 +5968,7 @@
|
|||
"cross-spawn": "^6.0.5",
|
||||
"debug": "^4.0.1",
|
||||
"doctrine": "^3.0.0",
|
||||
"eslint-scope": "^4.0.0",
|
||||
"eslint-scope": "^4.0.2",
|
||||
"eslint-utils": "^1.3.1",
|
||||
"eslint-visitor-keys": "^1.0.0",
|
||||
"espree": "^5.0.1",
|
||||
|
@ -6001,9 +6001,9 @@
|
|||
},
|
||||
"dependencies": {
|
||||
"ajv": {
|
||||
"version": "6.9.2",
|
||||
"resolved": "https://registry.npmjs.org/ajv/-/ajv-6.9.2.tgz",
|
||||
"integrity": "sha512-4UFy0/LgDo7Oa/+wOAlj44tp9K78u38E5/359eSrqEp1Z5PdVfimCcs7SluXMP755RUQu6d2b4AvF0R1C9RZjg==",
|
||||
"version": "6.10.0",
|
||||
"resolved": "https://registry.npmjs.org/ajv/-/ajv-6.10.0.tgz",
|
||||
"integrity": "sha512-nffhOpkymDECQyR0mnsUtoCE8RlX38G0rYP+wgLWFyZuUyuuojSSvi/+euOiQBIn63whYwYVIIH1TvE3tu4OEg==",
|
||||
"requires": {
|
||||
"fast-deep-equal": "^2.0.1",
|
||||
"fast-json-stable-stringify": "^2.0.0",
|
||||
|
@ -6037,9 +6037,9 @@
|
|||
}
|
||||
},
|
||||
"eslint-scope": {
|
||||
"version": "4.0.0",
|
||||
"resolved": "https://registry.npmjs.org/eslint-scope/-/eslint-scope-4.0.0.tgz",
|
||||
"integrity": "sha512-1G6UTDi7Jc1ELFwnR58HV4fK9OQK4S6N985f166xqXxpjU6plxFISJa2Ba9KCQuFa8RCnj/lSFJbHo7UFDBnUA==",
|
||||
"version": "4.0.2",
|
||||
"resolved": "https://registry.npmjs.org/eslint-scope/-/eslint-scope-4.0.2.tgz",
|
||||
"integrity": "sha512-5q1+B/ogmHl8+paxtOKx38Z8LtWkVGuNt3+GQNErqwLl6ViNp/gdJGMCjZNxZ8j/VYjDNZ2Fo+eQc1TAVPIzbg==",
|
||||
"requires": {
|
||||
"esrecurse": "^4.1.0",
|
||||
"estraverse": "^4.1.1"
|
||||
|
@ -6448,52 +6448,6 @@
|
|||
}
|
||||
}
|
||||
},
|
||||
"expand-range": {
|
||||
"version": "1.8.2",
|
||||
"resolved": "http://registry.npmjs.org/expand-range/-/expand-range-1.8.2.tgz",
|
||||
"integrity": "sha1-opnv/TNf4nIeuujiV+x5ZE/IUzc=",
|
||||
"requires": {
|
||||
"fill-range": "^2.1.0"
|
||||
},
|
||||
"dependencies": {
|
||||
"fill-range": {
|
||||
"version": "2.2.4",
|
||||
"resolved": "https://registry.npmjs.org/fill-range/-/fill-range-2.2.4.tgz",
|
||||
"integrity": "sha512-cnrcCbj01+j2gTG921VZPnHbjmdAf8oQV/iGeV2kZxGSyfYjjTyY79ErsK1WJWMpw6DaApEX72binqJE+/d+5Q==",
|
||||
"requires": {
|
||||
"is-number": "^2.1.0",
|
||||
"isobject": "^2.0.0",
|
||||
"randomatic": "^3.0.0",
|
||||
"repeat-element": "^1.1.2",
|
||||
"repeat-string": "^1.5.2"
|
||||
}
|
||||
},
|
||||
"is-number": {
|
||||
"version": "2.1.0",
|
||||
"resolved": "https://registry.npmjs.org/is-number/-/is-number-2.1.0.tgz",
|
||||
"integrity": "sha1-Afy7s5NGOlSPL0ZszhbezknbkI8=",
|
||||
"requires": {
|
||||
"kind-of": "^3.0.2"
|
||||
}
|
||||
},
|
||||
"isobject": {
|
||||
"version": "2.1.0",
|
||||
"resolved": "https://registry.npmjs.org/isobject/-/isobject-2.1.0.tgz",
|
||||
"integrity": "sha1-8GVWEJaj8dou9GJy+BXIQNh+DIk=",
|
||||
"requires": {
|
||||
"isarray": "1.0.0"
|
||||
}
|
||||
},
|
||||
"kind-of": {
|
||||
"version": "3.2.2",
|
||||
"resolved": "https://registry.npmjs.org/kind-of/-/kind-of-3.2.2.tgz",
|
||||
"integrity": "sha1-MeohpzS6ubuw8yRm2JOupR5KPGQ=",
|
||||
"requires": {
|
||||
"is-buffer": "^1.1.5"
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"expand-template": {
|
||||
"version": "2.0.3",
|
||||
"resolved": "https://registry.npmjs.org/expand-template/-/expand-template-2.0.3.tgz",
|
||||
|
@ -6818,11 +6772,6 @@
|
|||
"resolved": "https://registry.npmjs.org/file-uri-to-path/-/file-uri-to-path-1.0.0.tgz",
|
||||
"integrity": "sha512-0Zt+s3L7Vf1biwWZ29aARiVYLx7iMGnEUl9x33fbB/j3jR81u/O2LbqK+Bm1CDSNDKVtJ/YjwY7TUd5SkeLQLw=="
|
||||
},
|
||||
"filename-regex": {
|
||||
"version": "2.0.1",
|
||||
"resolved": "https://registry.npmjs.org/filename-regex/-/filename-regex-2.0.1.tgz",
|
||||
"integrity": "sha1-wcS5vuPglyXdsQa3XB4wH+LxiyY="
|
||||
},
|
||||
"filename-reserved-regex": {
|
||||
"version": "2.0.0",
|
||||
"resolved": "https://registry.npmjs.org/filename-reserved-regex/-/filename-reserved-regex-2.0.0.tgz",
|
||||
|
@ -7130,468 +7079,6 @@
|
|||
"resolved": "https://registry.npmjs.org/fs.realpath/-/fs.realpath-1.0.0.tgz",
|
||||
"integrity": "sha1-FQStJSMVjKpA20onh8sBQRmU6k8="
|
||||
},
|
||||
"fsevents": {
|
||||
"version": "1.2.4",
|
||||
"resolved": "https://registry.npmjs.org/fsevents/-/fsevents-1.2.4.tgz",
|
||||
"integrity": "sha512-z8H8/diyk76B7q5wg+Ud0+CqzcAF3mBBI/bA5ne5zrRUUIvNkJY//D3BqyH571KuAC4Nr7Rw7CjWX4r0y9DvNg==",
|
||||
"optional": true,
|
||||
"requires": {
|
||||
"nan": "^2.9.2",
|
||||
"node-pre-gyp": "^0.10.0"
|
||||
},
|
||||
"dependencies": {
|
||||
"abbrev": {
|
||||
"version": "1.1.1",
|
||||
"bundled": true,
|
||||
"optional": true
|
||||
},
|
||||
"ansi-regex": {
|
||||
"version": "2.1.1",
|
||||
"bundled": true
|
||||
},
|
||||
"aproba": {
|
||||
"version": "1.2.0",
|
||||
"bundled": true,
|
||||
"optional": true
|
||||
},
|
||||
"are-we-there-yet": {
|
||||
"version": "1.1.4",
|
||||
"bundled": true,
|
||||
"optional": true,
|
||||
"requires": {
|
||||
"delegates": "^1.0.0",
|
||||
"readable-stream": "^2.0.6"
|
||||
}
|
||||
},
|
||||
"balanced-match": {
|
||||
"version": "1.0.0",
|
||||
"bundled": true
|
||||
},
|
||||
"brace-expansion": {
|
||||
"version": "1.1.11",
|
||||
"bundled": true,
|
||||
"requires": {
|
||||
"balanced-match": "^1.0.0",
|
||||
"concat-map": "0.0.1"
|
||||
}
|
||||
},
|
||||
"chownr": {
|
||||
"version": "1.0.1",
|
||||
"bundled": true,
|
||||
"optional": true
|
||||
},
|
||||
"code-point-at": {
|
||||
"version": "1.1.0",
|
||||
"bundled": true
|
||||
},
|
||||
"concat-map": {
|
||||
"version": "0.0.1",
|
||||
"bundled": true
|
||||
},
|
||||
"console-control-strings": {
|
||||
"version": "1.1.0",
|
||||
"bundled": true
|
||||
},
|
||||
"core-util-is": {
|
||||
"version": "1.0.2",
|
||||
"bundled": true,
|
||||
"optional": true
|
||||
},
|
||||
"debug": {
|
||||
"version": "2.6.9",
|
||||
"bundled": true,
|
||||
"optional": true,
|
||||
"requires": {
|
||||
"ms": "2.0.0"
|
||||
}
|
||||
},
|
||||
"deep-extend": {
|
||||
"version": "0.5.1",
|
||||
"bundled": true,
|
||||
"optional": true
|
||||
},
|
||||
"delegates": {
|
||||
"version": "1.0.0",
|
||||
"bundled": true,
|
||||
"optional": true
|
||||
},
|
||||
"detect-libc": {
|
||||
"version": "1.0.3",
|
||||
"bundled": true,
|
||||
"optional": true
|
||||
},
|
||||
"fs-minipass": {
|
||||
"version": "1.2.5",
|
||||
"bundled": true,
|
||||
"optional": true,
|
||||
"requires": {
|
||||
"minipass": "^2.2.1"
|
||||
}
|
||||
},
|
||||
"fs.realpath": {
|
||||
"version": "1.0.0",
|
||||
"bundled": true,
|
||||
"optional": true
|
||||
},
|
||||
"gauge": {
|
||||
"version": "2.7.4",
|
||||
"bundled": true,
|
||||
"optional": true,
|
||||
"requires": {
|
||||
"aproba": "^1.0.3",
|
||||
"console-control-strings": "^1.0.0",
|
||||
"has-unicode": "^2.0.0",
|
||||
"object-assign": "^4.1.0",
|
||||
"signal-exit": "^3.0.0",
|
||||
"string-width": "^1.0.1",
|
||||
"strip-ansi": "^3.0.1",
|
||||
"wide-align": "^1.1.0"
|
||||
}
|
||||
},
|
||||
"glob": {
|
||||
"version": "7.1.2",
|
||||
"bundled": true,
|
||||
"optional": true,
|
||||
"requires": {
|
||||
"fs.realpath": "^1.0.0",
|
||||
"inflight": "^1.0.4",
|
||||
"inherits": "2",
|
||||
"minimatch": "^3.0.4",
|
||||
"once": "^1.3.0",
|
||||
"path-is-absolute": "^1.0.0"
|
||||
}
|
||||
},
|
||||
"has-unicode": {
|
||||
"version": "2.0.1",
|
||||
"bundled": true,
|
||||
"optional": true
|
||||
},
|
||||
"iconv-lite": {
|
||||
"version": "0.4.21",
|
||||
"bundled": true,
|
||||
"optional": true,
|
||||
"requires": {
|
||||
"safer-buffer": "^2.1.0"
|
||||
}
|
||||
},
|
||||
"ignore-walk": {
|
||||
"version": "3.0.1",
|
||||
"bundled": true,
|
||||
"optional": true,
|
||||
"requires": {
|
||||
"minimatch": "^3.0.4"
|
||||
}
|
||||
},
|
||||
"inflight": {
|
||||
"version": "1.0.6",
|
||||
"bundled": true,
|
||||
"optional": true,
|
||||
"requires": {
|
||||
"once": "^1.3.0",
|
||||
"wrappy": "1"
|
||||
}
|
||||
},
|
||||
"inherits": {
|
||||
"version": "2.0.3",
|
||||
"bundled": true
|
||||
},
|
||||
"ini": {
|
||||
"version": "1.3.5",
|
||||
"bundled": true,
|
||||
"optional": true
|
||||
},
|
||||
"is-fullwidth-code-point": {
|
||||
"version": "1.0.0",
|
||||
"bundled": true,
|
||||
"requires": {
|
||||
"number-is-nan": "^1.0.0"
|
||||
}
|
||||
},
|
||||
"isarray": {
|
||||
"version": "1.0.0",
|
||||
"bundled": true,
|
||||
"optional": true
|
||||
},
|
||||
"minimatch": {
|
||||
"version": "3.0.4",
|
||||
"bundled": true,
|
||||
"requires": {
|
||||
"brace-expansion": "^1.1.7"
|
||||
}
|
||||
},
|
||||
"minimist": {
|
||||
"version": "0.0.8",
|
||||
"bundled": true
|
||||
},
|
||||
"minipass": {
|
||||
"version": "2.2.4",
|
||||
"bundled": true,
|
||||
"requires": {
|
||||
"safe-buffer": "^5.1.1",
|
||||
"yallist": "^3.0.0"
|
||||
}
|
||||
},
|
||||
"minizlib": {
|
||||
"version": "1.1.0",
|
||||
"bundled": true,
|
||||
"optional": true,
|
||||
"requires": {
|
||||
"minipass": "^2.2.1"
|
||||
}
|
||||
},
|
||||
"mkdirp": {
|
||||
"version": "0.5.1",
|
||||
"bundled": true,
|
||||
"requires": {
|
||||
"minimist": "0.0.8"
|
||||
}
|
||||
},
|
||||
"ms": {
|
||||
"version": "2.0.0",
|
||||
"bundled": true,
|
||||
"optional": true
|
||||
},
|
||||
"needle": {
|
||||
"version": "2.2.0",
|
||||
"bundled": true,
|
||||
"optional": true,
|
||||
"requires": {
|
||||
"debug": "^2.1.2",
|
||||
"iconv-lite": "^0.4.4",
|
||||
"sax": "^1.2.4"
|
||||
}
|
||||
},
|
||||
"node-pre-gyp": {
|
||||
"version": "0.10.0",
|
||||
"bundled": true,
|
||||
"optional": true,
|
||||
"requires": {
|
||||
"detect-libc": "^1.0.2",
|
||||
"mkdirp": "^0.5.1",
|
||||
"needle": "^2.2.0",
|
||||
"nopt": "^4.0.1",
|
||||
"npm-packlist": "^1.1.6",
|
||||
"npmlog": "^4.0.2",
|
||||
"rc": "^1.1.7",
|
||||
"rimraf": "^2.6.1",
|
||||
"semver": "^5.3.0",
|
||||
"tar": "^4"
|
||||
}
|
||||
},
|
||||
"nopt": {
|
||||
"version": "4.0.1",
|
||||
"bundled": true,
|
||||
"optional": true,
|
||||
"requires": {
|
||||
"abbrev": "1",
|
||||
"osenv": "^0.1.4"
|
||||
}
|
||||
},
|
||||
"npm-bundled": {
|
||||
"version": "1.0.3",
|
||||
"bundled": true,
|
||||
"optional": true
|
||||
},
|
||||
"npm-packlist": {
|
||||
"version": "1.1.10",
|
||||
"bundled": true,
|
||||
"optional": true,
|
||||
"requires": {
|
||||
"ignore-walk": "^3.0.1",
|
||||
"npm-bundled": "^1.0.1"
|
||||
}
|
||||
},
|
||||
"npmlog": {
|
||||
"version": "4.1.2",
|
||||
"bundled": true,
|
||||
"optional": true,
|
||||
"requires": {
|
||||
"are-we-there-yet": "~1.1.2",
|
||||
"console-control-strings": "~1.1.0",
|
||||
"gauge": "~2.7.3",
|
||||
"set-blocking": "~2.0.0"
|
||||
}
|
||||
},
|
||||
"number-is-nan": {
|
||||
"version": "1.0.1",
|
||||
"bundled": true
|
||||
},
|
||||
"object-assign": {
|
||||
"version": "4.1.1",
|
||||
"bundled": true,
|
||||
"optional": true
|
||||
},
|
||||
"once": {
|
||||
"version": "1.4.0",
|
||||
"bundled": true,
|
||||
"requires": {
|
||||
"wrappy": "1"
|
||||
}
|
||||
},
|
||||
"os-homedir": {
|
||||
"version": "1.0.2",
|
||||
"bundled": true,
|
||||
"optional": true
|
||||
},
|
||||
"os-tmpdir": {
|
||||
"version": "1.0.2",
|
||||
"bundled": true,
|
||||
"optional": true
|
||||
},
|
||||
"osenv": {
|
||||
"version": "0.1.5",
|
||||
"bundled": true,
|
||||
"optional": true,
|
||||
"requires": {
|
||||
"os-homedir": "^1.0.0",
|
||||
"os-tmpdir": "^1.0.0"
|
||||
}
|
||||
},
|
||||
"path-is-absolute": {
|
||||
"version": "1.0.1",
|
||||
"bundled": true,
|
||||
"optional": true
|
||||
},
|
||||
"process-nextick-args": {
|
||||
"version": "2.0.0",
|
||||
"bundled": true,
|
||||
"optional": true
|
||||
},
|
||||
"rc": {
|
||||
"version": "1.2.7",
|
||||
"bundled": true,
|
||||
"optional": true,
|
||||
"requires": {
|
||||
"deep-extend": "^0.5.1",
|
||||
"ini": "~1.3.0",
|
||||
"minimist": "^1.2.0",
|
||||
"strip-json-comments": "~2.0.1"
|
||||
},
|
||||
"dependencies": {
|
||||
"minimist": {
|
||||
"version": "1.2.0",
|
||||
"bundled": true,
|
||||
"optional": true
|
||||
}
|
||||
}
|
||||
},
|
||||
"readable-stream": {
|
||||
"version": "2.3.6",
|
||||
"bundled": true,
|
||||
"optional": true,
|
||||
"requires": {
|
||||
"core-util-is": "~1.0.0",
|
||||
"inherits": "~2.0.3",
|
||||
"isarray": "~1.0.0",
|
||||
"process-nextick-args": "~2.0.0",
|
||||
"safe-buffer": "~5.1.1",
|
||||
"string_decoder": "~1.1.1",
|
||||
"util-deprecate": "~1.0.1"
|
||||
}
|
||||
},
|
||||
"rimraf": {
|
||||
"version": "2.6.2",
|
||||
"bundled": true,
|
||||
"optional": true,
|
||||
"requires": {
|
||||
"glob": "^7.0.5"
|
||||
}
|
||||
},
|
||||
"safe-buffer": {
|
||||
"version": "5.1.1",
|
||||
"bundled": true
|
||||
},
|
||||
"safer-buffer": {
|
||||
"version": "2.1.2",
|
||||
"bundled": true,
|
||||
"optional": true
|
||||
},
|
||||
"sax": {
|
||||
"version": "1.2.4",
|
||||
"bundled": true,
|
||||
"optional": true
|
||||
},
|
||||
"semver": {
|
||||
"version": "5.5.0",
|
||||
"bundled": true,
|
||||
"optional": true
|
||||
},
|
||||
"set-blocking": {
|
||||
"version": "2.0.0",
|
||||
"bundled": true,
|
||||
"optional": true
|
||||
},
|
||||
"signal-exit": {
|
||||
"version": "3.0.2",
|
||||
"bundled": true,
|
||||
"optional": true
|
||||
},
|
||||
"string-width": {
|
||||
"version": "1.0.2",
|
||||
"bundled": true,
|
||||
"requires": {
|
||||
"code-point-at": "^1.0.0",
|
||||
"is-fullwidth-code-point": "^1.0.0",
|
||||
"strip-ansi": "^3.0.0"
|
||||
}
|
||||
},
|
||||
"string_decoder": {
|
||||
"version": "1.1.1",
|
||||
"bundled": true,
|
||||
"optional": true,
|
||||
"requires": {
|
||||
"safe-buffer": "~5.1.0"
|
||||
}
|
||||
},
|
||||
"strip-ansi": {
|
||||
"version": "3.0.1",
|
||||
"bundled": true,
|
||||
"requires": {
|
||||
"ansi-regex": "^2.0.0"
|
||||
}
|
||||
},
|
||||
"strip-json-comments": {
|
||||
"version": "2.0.1",
|
||||
"bundled": true,
|
||||
"optional": true
|
||||
},
|
||||
"tar": {
|
||||
"version": "4.4.1",
|
||||
"bundled": true,
|
||||
"optional": true,
|
||||
"requires": {
|
||||
"chownr": "^1.0.1",
|
||||
"fs-minipass": "^1.2.5",
|
||||
"minipass": "^2.2.4",
|
||||
"minizlib": "^1.1.0",
|
||||
"mkdirp": "^0.5.0",
|
||||
"safe-buffer": "^5.1.1",
|
||||
"yallist": "^3.0.2"
|
||||
}
|
||||
},
|
||||
"util-deprecate": {
|
||||
"version": "1.0.2",
|
||||
"bundled": true,
|
||||
"optional": true
|
||||
},
|
||||
"wide-align": {
|
||||
"version": "1.1.2",
|
||||
"bundled": true,
|
||||
"optional": true,
|
||||
"requires": {
|
||||
"string-width": "^1.0.2"
|
||||
}
|
||||
},
|
||||
"wrappy": {
|
||||
"version": "1.0.2",
|
||||
"bundled": true
|
||||
},
|
||||
"yallist": {
|
||||
"version": "3.0.2",
|
||||
"bundled": true
|
||||
}
|
||||
}
|
||||
},
|
||||
"fstream": {
|
||||
"version": "1.0.11",
|
||||
"resolved": "https://registry.npmjs.org/fstream/-/fstream-1.0.11.tgz",
|
||||
|
@ -8322,14 +7809,14 @@
|
|||
}
|
||||
},
|
||||
"gatsby-source-filesystem": {
|
||||
"version": "2.0.20",
|
||||
"resolved": "https://registry.npmjs.org/gatsby-source-filesystem/-/gatsby-source-filesystem-2.0.20.tgz",
|
||||
"integrity": "sha512-nS2hBsqKEQIJ5Yd+g9p++FcsfmvbQmZlBUzx04VPBYZBu2LuLA/ZxQkmdiTNnbDQ18KJw0Zu2PnmUerPnEMqyg==",
|
||||
"version": "2.0.24",
|
||||
"resolved": "https://registry.npmjs.org/gatsby-source-filesystem/-/gatsby-source-filesystem-2.0.24.tgz",
|
||||
"integrity": "sha512-KzyHzuXni9hOiZFDgeoH5ABJZqb59fSJNGr2C4U6B1AlGXFMucFK45Fh3V8axtpi833bIbCb9rGmK+tvL4Qb1w==",
|
||||
"requires": {
|
||||
"@babel/runtime": "^7.0.0",
|
||||
"better-queue": "^3.8.7",
|
||||
"bluebird": "^3.5.0",
|
||||
"chokidar": "^1.7.0",
|
||||
"chokidar": "^2.1.2",
|
||||
"file-type": "^10.2.0",
|
||||
"fs-extra": "^5.0.0",
|
||||
"got": "^7.1.0",
|
||||
|
@ -8343,83 +7830,6 @@
|
|||
"xstate": "^3.1.0"
|
||||
},
|
||||
"dependencies": {
|
||||
"anymatch": {
|
||||
"version": "1.3.2",
|
||||
"resolved": "https://registry.npmjs.org/anymatch/-/anymatch-1.3.2.tgz",
|
||||
"integrity": "sha512-0XNayC8lTHQ2OI8aljNCN3sSx6hsr/1+rlcDAotXJR7C1oZZHCNsfpbKwMjRA3Uqb5tF1Rae2oloTr4xpq+WjA==",
|
||||
"requires": {
|
||||
"micromatch": "^2.1.5",
|
||||
"normalize-path": "^2.0.0"
|
||||
}
|
||||
},
|
||||
"arr-diff": {
|
||||
"version": "2.0.0",
|
||||
"resolved": "https://registry.npmjs.org/arr-diff/-/arr-diff-2.0.0.tgz",
|
||||
"integrity": "sha1-jzuCf5Vai9ZpaX5KQlasPOrjVs8=",
|
||||
"requires": {
|
||||
"arr-flatten": "^1.0.1"
|
||||
}
|
||||
},
|
||||
"array-unique": {
|
||||
"version": "0.2.1",
|
||||
"resolved": "https://registry.npmjs.org/array-unique/-/array-unique-0.2.1.tgz",
|
||||
"integrity": "sha1-odl8yvy8JiXMcPrc6zalDFiwGlM="
|
||||
},
|
||||
"braces": {
|
||||
"version": "1.8.5",
|
||||
"resolved": "https://registry.npmjs.org/braces/-/braces-1.8.5.tgz",
|
||||
"integrity": "sha1-uneWLhLf+WnWt2cR6RS3N4V79qc=",
|
||||
"requires": {
|
||||
"expand-range": "^1.8.1",
|
||||
"preserve": "^0.2.0",
|
||||
"repeat-element": "^1.1.2"
|
||||
}
|
||||
},
|
||||
"chokidar": {
|
||||
"version": "1.7.0",
|
||||
"resolved": "https://registry.npmjs.org/chokidar/-/chokidar-1.7.0.tgz",
|
||||
"integrity": "sha1-eY5ol3gVHIB2tLNg5e3SjNortGg=",
|
||||
"requires": {
|
||||
"anymatch": "^1.3.0",
|
||||
"async-each": "^1.0.0",
|
||||
"fsevents": "^1.0.0",
|
||||
"glob-parent": "^2.0.0",
|
||||
"inherits": "^2.0.1",
|
||||
"is-binary-path": "^1.0.0",
|
||||
"is-glob": "^2.0.0",
|
||||
"path-is-absolute": "^1.0.0",
|
||||
"readdirp": "^2.0.0"
|
||||
}
|
||||
},
|
||||
"expand-brackets": {
|
||||
"version": "0.1.5",
|
||||
"resolved": "https://registry.npmjs.org/expand-brackets/-/expand-brackets-0.1.5.tgz",
|
||||
"integrity": "sha1-3wcoTjQqgHzXM6xa9yQR5YHRF3s=",
|
||||
"requires": {
|
||||
"is-posix-bracket": "^0.1.0"
|
||||
}
|
||||
},
|
||||
"extglob": {
|
||||
"version": "0.3.2",
|
||||
"resolved": "https://registry.npmjs.org/extglob/-/extglob-0.3.2.tgz",
|
||||
"integrity": "sha1-Lhj/PS9JqydlzskCPwEdqo2DSaE=",
|
||||
"requires": {
|
||||
"is-extglob": "^1.0.0"
|
||||
}
|
||||
},
|
||||
"file-type": {
|
||||
"version": "10.7.1",
|
||||
"resolved": "https://registry.npmjs.org/file-type/-/file-type-10.7.1.tgz",
|
||||
"integrity": "sha512-kUc4EE9q3MH6kx70KumPOvXLZLEJZzY9phEVg/bKWyGZ+OA9KoKZzFR4HS0yDmNv31sJkdf4hbTERIfplF9OxQ=="
|
||||
},
|
||||
"glob-parent": {
|
||||
"version": "2.0.0",
|
||||
"resolved": "https://registry.npmjs.org/glob-parent/-/glob-parent-2.0.0.tgz",
|
||||
"integrity": "sha1-gTg9ctsFT8zPUzbaqQLxgvbtuyg=",
|
||||
"requires": {
|
||||
"is-glob": "^2.0.0"
|
||||
}
|
||||
},
|
||||
"got": {
|
||||
"version": "7.1.0",
|
||||
"resolved": "https://registry.npmjs.org/got/-/got-7.1.0.tgz",
|
||||
|
@ -8441,47 +7851,6 @@
|
|||
"url-to-options": "^1.0.1"
|
||||
}
|
||||
},
|
||||
"is-extglob": {
|
||||
"version": "1.0.0",
|
||||
"resolved": "https://registry.npmjs.org/is-extglob/-/is-extglob-1.0.0.tgz",
|
||||
"integrity": "sha1-rEaBd8SUNAWgkvyPKXYMb/xiBsA="
|
||||
},
|
||||
"is-glob": {
|
||||
"version": "2.0.1",
|
||||
"resolved": "https://registry.npmjs.org/is-glob/-/is-glob-2.0.1.tgz",
|
||||
"integrity": "sha1-0Jb5JqPe1WAPP9/ZEZjLCIjC2GM=",
|
||||
"requires": {
|
||||
"is-extglob": "^1.0.0"
|
||||
}
|
||||
},
|
||||
"kind-of": {
|
||||
"version": "3.2.2",
|
||||
"resolved": "https://registry.npmjs.org/kind-of/-/kind-of-3.2.2.tgz",
|
||||
"integrity": "sha1-MeohpzS6ubuw8yRm2JOupR5KPGQ=",
|
||||
"requires": {
|
||||
"is-buffer": "^1.1.5"
|
||||
}
|
||||
},
|
||||
"micromatch": {
|
||||
"version": "2.3.11",
|
||||
"resolved": "https://registry.npmjs.org/micromatch/-/micromatch-2.3.11.tgz",
|
||||
"integrity": "sha1-hmd8l9FyCzY0MdBNDRUpO9OMFWU=",
|
||||
"requires": {
|
||||
"arr-diff": "^2.0.0",
|
||||
"array-unique": "^0.2.1",
|
||||
"braces": "^1.8.2",
|
||||
"expand-brackets": "^0.1.4",
|
||||
"extglob": "^0.3.1",
|
||||
"filename-regex": "^2.0.0",
|
||||
"is-extglob": "^1.0.0",
|
||||
"is-glob": "^2.0.1",
|
||||
"kind-of": "^3.0.2",
|
||||
"normalize-path": "^2.0.1",
|
||||
"object.omit": "^2.0.0",
|
||||
"parse-glob": "^3.0.4",
|
||||
"regex-cache": "^0.4.2"
|
||||
}
|
||||
},
|
||||
"pify": {
|
||||
"version": "4.0.1",
|
||||
"resolved": "https://registry.npmjs.org/pify/-/pify-4.0.1.tgz",
|
||||
|
@ -8493,12 +7862,12 @@
|
|||
"integrity": "sha1-4mDHj2Fhzdmw5WzD4Khd4Xx6V74="
|
||||
},
|
||||
"read-chunk": {
|
||||
"version": "3.0.0",
|
||||
"resolved": "https://registry.npmjs.org/read-chunk/-/read-chunk-3.0.0.tgz",
|
||||
"integrity": "sha512-8lBUVPjj9TC5bKLBacB+rpexM03+LWiYbv6ma3BeWmUYXGxqA1WNNgIZHq/iIsCrbFMzPhFbkOqdsyOFRnuoXg==",
|
||||
"version": "3.1.0",
|
||||
"resolved": "https://registry.npmjs.org/read-chunk/-/read-chunk-3.1.0.tgz",
|
||||
"integrity": "sha512-ZdiZJXXoZYE08SzZvTipHhI+ZW0FpzxmFtLI3vIeMuRN9ySbIZ+SZawKogqJ7dxW9fJ/W73BNtxu4Zu/bZp+Ng==",
|
||||
"requires": {
|
||||
"pify": "^4.0.0",
|
||||
"with-open-file": "^0.1.3"
|
||||
"pify": "^4.0.1",
|
||||
"with-open-file": "^0.1.5"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@ -8742,38 +8111,6 @@
|
|||
"path-is-absolute": "^1.0.0"
|
||||
}
|
||||
},
|
||||
"glob-base": {
|
||||
"version": "0.3.0",
|
||||
"resolved": "https://registry.npmjs.org/glob-base/-/glob-base-0.3.0.tgz",
|
||||
"integrity": "sha1-27Fk9iIbHAscz4Kuoyi0l98Oo8Q=",
|
||||
"requires": {
|
||||
"glob-parent": "^2.0.0",
|
||||
"is-glob": "^2.0.0"
|
||||
},
|
||||
"dependencies": {
|
||||
"glob-parent": {
|
||||
"version": "2.0.0",
|
||||
"resolved": "https://registry.npmjs.org/glob-parent/-/glob-parent-2.0.0.tgz",
|
||||
"integrity": "sha1-gTg9ctsFT8zPUzbaqQLxgvbtuyg=",
|
||||
"requires": {
|
||||
"is-glob": "^2.0.0"
|
||||
}
|
||||
},
|
||||
"is-extglob": {
|
||||
"version": "1.0.0",
|
||||
"resolved": "https://registry.npmjs.org/is-extglob/-/is-extglob-1.0.0.tgz",
|
||||
"integrity": "sha1-rEaBd8SUNAWgkvyPKXYMb/xiBsA="
|
||||
},
|
||||
"is-glob": {
|
||||
"version": "2.0.1",
|
||||
"resolved": "https://registry.npmjs.org/is-glob/-/is-glob-2.0.1.tgz",
|
||||
"integrity": "sha1-0Jb5JqPe1WAPP9/ZEZjLCIjC2GM=",
|
||||
"requires": {
|
||||
"is-extglob": "^1.0.0"
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"glob-parent": {
|
||||
"version": "3.1.0",
|
||||
"resolved": "https://registry.npmjs.org/glob-parent/-/glob-parent-3.1.0.tgz",
|
||||
|
@ -10110,19 +9447,6 @@
|
|||
"resolved": "https://registry.npmjs.org/is-directory/-/is-directory-0.3.1.tgz",
|
||||
"integrity": "sha1-YTObbyR1/Hcv2cnYP1yFddwVSuE="
|
||||
},
|
||||
"is-dotfile": {
|
||||
"version": "1.0.3",
|
||||
"resolved": "https://registry.npmjs.org/is-dotfile/-/is-dotfile-1.0.3.tgz",
|
||||
"integrity": "sha1-pqLzL/0t+wT1yiXs0Pa4PPeYoeE="
|
||||
},
|
||||
"is-equal-shallow": {
|
||||
"version": "0.1.3",
|
||||
"resolved": "https://registry.npmjs.org/is-equal-shallow/-/is-equal-shallow-0.1.3.tgz",
|
||||
"integrity": "sha1-IjgJj8Ih3gvPpdnqxMRdY4qhxTQ=",
|
||||
"requires": {
|
||||
"is-primitive": "^2.0.0"
|
||||
}
|
||||
},
|
||||
"is-extendable": {
|
||||
"version": "0.1.1",
|
||||
"resolved": "https://registry.npmjs.org/is-extendable/-/is-extendable-0.1.1.tgz",
|
||||
|
@ -10263,16 +9587,6 @@
|
|||
"resolved": "https://registry.npmjs.org/is-png/-/is-png-1.1.0.tgz",
|
||||
"integrity": "sha1-1XSxK/J1wDUEVVcLDltXqwYgd84="
|
||||
},
|
||||
"is-posix-bracket": {
|
||||
"version": "0.1.1",
|
||||
"resolved": "https://registry.npmjs.org/is-posix-bracket/-/is-posix-bracket-0.1.1.tgz",
|
||||
"integrity": "sha1-MzTceXdDaOkvAW5vvAqI9c1ua8Q="
|
||||
},
|
||||
"is-primitive": {
|
||||
"version": "2.0.0",
|
||||
"resolved": "https://registry.npmjs.org/is-primitive/-/is-primitive-2.0.0.tgz",
|
||||
"integrity": "sha1-IHurkWOEmcB7Kt8kCkGochADRXU="
|
||||
},
|
||||
"is-promise": {
|
||||
"version": "2.1.0",
|
||||
"resolved": "https://registry.npmjs.org/is-promise/-/is-promise-2.1.0.tgz",
|
||||
|
@ -11162,11 +10476,6 @@
|
|||
"resolved": "https://registry.npmjs.org/marked/-/marked-0.4.0.tgz",
|
||||
"integrity": "sha512-tMsdNBgOsrUophCAFQl0XPe6Zqk/uy9gnue+jIIKhykO51hxyu6uNx7zBPy0+y/WKYVZZMspV9YeXLNdKk+iYw=="
|
||||
},
|
||||
"math-random": {
|
||||
"version": "1.0.1",
|
||||
"resolved": "https://registry.npmjs.org/math-random/-/math-random-1.0.1.tgz",
|
||||
"integrity": "sha1-izqsWIuKZuSXXjzepn97sylgH6w="
|
||||
},
|
||||
"md-attr-parser": {
|
||||
"version": "1.2.1",
|
||||
"resolved": "https://registry.npmjs.org/md-attr-parser/-/md-attr-parser-1.2.1.tgz",
|
||||
|
@ -12230,15 +11539,6 @@
|
|||
"es-abstract": "^1.5.1"
|
||||
}
|
||||
},
|
||||
"object.omit": {
|
||||
"version": "2.0.1",
|
||||
"resolved": "https://registry.npmjs.org/object.omit/-/object.omit-2.0.1.tgz",
|
||||
"integrity": "sha1-Gpx0SCnznbuFjHbKNXmuKlTr0fo=",
|
||||
"requires": {
|
||||
"for-own": "^0.1.4",
|
||||
"is-extendable": "^0.1.1"
|
||||
}
|
||||
},
|
||||
"object.pick": {
|
||||
"version": "1.3.0",
|
||||
"resolved": "https://registry.npmjs.org/object.pick/-/object.pick-1.3.0.tgz",
|
||||
|
@ -12579,32 +11879,6 @@
|
|||
"path-root": "^0.1.1"
|
||||
}
|
||||
},
|
||||
"parse-glob": {
|
||||
"version": "3.0.4",
|
||||
"resolved": "https://registry.npmjs.org/parse-glob/-/parse-glob-3.0.4.tgz",
|
||||
"integrity": "sha1-ssN2z7EfNVE7rdFz7wu246OIORw=",
|
||||
"requires": {
|
||||
"glob-base": "^0.3.0",
|
||||
"is-dotfile": "^1.0.0",
|
||||
"is-extglob": "^1.0.0",
|
||||
"is-glob": "^2.0.0"
|
||||
},
|
||||
"dependencies": {
|
||||
"is-extglob": {
|
||||
"version": "1.0.0",
|
||||
"resolved": "https://registry.npmjs.org/is-extglob/-/is-extglob-1.0.0.tgz",
|
||||
"integrity": "sha1-rEaBd8SUNAWgkvyPKXYMb/xiBsA="
|
||||
},
|
||||
"is-glob": {
|
||||
"version": "2.0.1",
|
||||
"resolved": "https://registry.npmjs.org/is-glob/-/is-glob-2.0.1.tgz",
|
||||
"integrity": "sha1-0Jb5JqPe1WAPP9/ZEZjLCIjC2GM=",
|
||||
"requires": {
|
||||
"is-extglob": "^1.0.0"
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"parse-headers": {
|
||||
"version": "2.0.1",
|
||||
"resolved": "https://registry.npmjs.org/parse-headers/-/parse-headers-2.0.1.tgz",
|
||||
|
@ -14769,11 +14043,6 @@
|
|||
"resolved": "https://registry.npmjs.org/prepend-http/-/prepend-http-1.0.4.tgz",
|
||||
"integrity": "sha1-1PRWKwzjaW5BrFLQ4ALlemNdxtw="
|
||||
},
|
||||
"preserve": {
|
||||
"version": "0.2.0",
|
||||
"resolved": "https://registry.npmjs.org/preserve/-/preserve-0.2.0.tgz",
|
||||
"integrity": "sha1-gV7R9uvGWSb4ZbMQwHE7yzMVzks="
|
||||
},
|
||||
"prettier": {
|
||||
"version": "1.16.4",
|
||||
"resolved": "https://registry.npmjs.org/prettier/-/prettier-1.16.4.tgz",
|
||||
|
@ -14982,23 +14251,6 @@
|
|||
"resolved": "http://registry.npmjs.org/ramda/-/ramda-0.21.0.tgz",
|
||||
"integrity": "sha1-oAGr7bP/YQd9T/HVd9RN536NCjU="
|
||||
},
|
||||
"randomatic": {
|
||||
"version": "3.1.1",
|
||||
"resolved": "https://registry.npmjs.org/randomatic/-/randomatic-3.1.1.tgz",
|
||||
"integrity": "sha512-TuDE5KxZ0J461RVjrJZCJc+J+zCkTb1MbH9AQUq68sMhOMcy9jLcb3BrZKgp9q9Ncltdg4QVqWrH02W2EFFVYw==",
|
||||
"requires": {
|
||||
"is-number": "^4.0.0",
|
||||
"kind-of": "^6.0.0",
|
||||
"math-random": "^1.0.1"
|
||||
},
|
||||
"dependencies": {
|
||||
"is-number": {
|
||||
"version": "4.0.0",
|
||||
"resolved": "https://registry.npmjs.org/is-number/-/is-number-4.0.0.tgz",
|
||||
"integrity": "sha512-rSklcAIlf1OmFdyAqbnWTLVelsQ58uvZ66S/ZyawjWqIviTWCjg2PzVGw8WUA+nNuPTqb4wgA+NszrJ+08LlgQ=="
|
||||
}
|
||||
}
|
||||
},
|
||||
"randombytes": {
|
||||
"version": "2.1.0",
|
||||
"resolved": "https://registry.npmjs.org/randombytes/-/randombytes-2.1.0.tgz",
|
||||
|
@ -15458,14 +14710,6 @@
|
|||
"private": "^0.1.6"
|
||||
}
|
||||
},
|
||||
"regex-cache": {
|
||||
"version": "0.4.4",
|
||||
"resolved": "https://registry.npmjs.org/regex-cache/-/regex-cache-0.4.4.tgz",
|
||||
"integrity": "sha512-nVIZwtCjkC9YgvWkpM55B5rBhBYRZhAaJbgcFYXXsHnbZ9UZI9nnVWYZpBlCqv9ho2eZryPnWrZGsOdPwVWXWQ==",
|
||||
"requires": {
|
||||
"is-equal-shallow": "^0.1.3"
|
||||
}
|
||||
},
|
||||
"regex-not": {
|
||||
"version": "1.0.2",
|
||||
"resolved": "https://registry.npmjs.org/regex-not/-/regex-not-1.0.2.tgz",
|
||||
|
@ -17710,9 +16954,9 @@
|
|||
},
|
||||
"dependencies": {
|
||||
"ajv": {
|
||||
"version": "6.9.2",
|
||||
"resolved": "https://registry.npmjs.org/ajv/-/ajv-6.9.2.tgz",
|
||||
"integrity": "sha512-4UFy0/LgDo7Oa/+wOAlj44tp9K78u38E5/359eSrqEp1Z5PdVfimCcs7SluXMP755RUQu6d2b4AvF0R1C9RZjg==",
|
||||
"version": "6.10.0",
|
||||
"resolved": "https://registry.npmjs.org/ajv/-/ajv-6.10.0.tgz",
|
||||
"integrity": "sha512-nffhOpkymDECQyR0mnsUtoCE8RlX38G0rYP+wgLWFyZuUyuuojSSvi/+euOiQBIn63whYwYVIIH1TvE3tu4OEg==",
|
||||
"requires": {
|
||||
"fast-deep-equal": "^2.0.1",
|
||||
"fast-json-stable-stringify": "^2.0.0",
|
||||
|
@ -17721,26 +16965,26 @@
|
|||
}
|
||||
},
|
||||
"ansi-regex": {
|
||||
"version": "4.0.0",
|
||||
"resolved": "https://registry.npmjs.org/ansi-regex/-/ansi-regex-4.0.0.tgz",
|
||||
"integrity": "sha512-iB5Dda8t/UqpPI/IjsejXu5jOGDrzn41wJyljwPH65VCIbk6+1BzFIMJGFwTNrYXT1CrD+B4l19U7awiQ8rk7w=="
|
||||
"version": "4.1.0",
|
||||
"resolved": "https://registry.npmjs.org/ansi-regex/-/ansi-regex-4.1.0.tgz",
|
||||
"integrity": "sha512-1apePfXM1UOSqw0o9IiFAovVz9M5S1Dg+4TrDwfMewQ6p/rmMueb7tWZjQ1rx4Loy1ArBggoqGpfqqdI4rondg=="
|
||||
},
|
||||
"string-width": {
|
||||
"version": "3.0.0",
|
||||
"resolved": "https://registry.npmjs.org/string-width/-/string-width-3.0.0.tgz",
|
||||
"integrity": "sha512-rr8CUxBbvOZDUvc5lNIJ+OC1nPVpz+Siw9VBtUjB9b6jZehZLFt0JMCZzShFHIsI8cbhm0EsNIfWJMFV3cu3Ew==",
|
||||
"version": "3.1.0",
|
||||
"resolved": "https://registry.npmjs.org/string-width/-/string-width-3.1.0.tgz",
|
||||
"integrity": "sha512-vafcv6KjVZKSgz06oM/H6GDBrAtz8vdhQakGjFIvNrHA6y3HCF1CInLy+QLq8dTJPQ1b+KDUqDFctkdRW44e1w==",
|
||||
"requires": {
|
||||
"emoji-regex": "^7.0.1",
|
||||
"is-fullwidth-code-point": "^2.0.0",
|
||||
"strip-ansi": "^5.0.0"
|
||||
"strip-ansi": "^5.1.0"
|
||||
}
|
||||
},
|
||||
"strip-ansi": {
|
||||
"version": "5.0.0",
|
||||
"resolved": "https://registry.npmjs.org/strip-ansi/-/strip-ansi-5.0.0.tgz",
|
||||
"integrity": "sha512-Uu7gQyZI7J7gn5qLn1Np3G9vcYGTVqB+lFTytnDJv83dd8T22aGH451P3jueT2/QemInJDfxHB5Tde5OzgG1Ow==",
|
||||
"version": "5.1.0",
|
||||
"resolved": "https://registry.npmjs.org/strip-ansi/-/strip-ansi-5.1.0.tgz",
|
||||
"integrity": "sha512-TjxrkPONqO2Z8QDCpeE2j6n0M6EwxzyDgzEeGp+FbdvaJAt//ClYi6W5my+3ROlC/hZX2KACUwDfK49Ka5eDvg==",
|
||||
"requires": {
|
||||
"ansi-regex": "^4.0.0"
|
||||
"ansi-regex": "^4.1.0"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
|
@ -12,7 +12,6 @@
|
|||
"@mdx-js/tag": "^0.17.5",
|
||||
"@phosphor/widgets": "^1.6.0",
|
||||
"@rehooks/online-status": "^1.0.0",
|
||||
"@sindresorhus/slugify": "^0.8.0",
|
||||
"@svgr/webpack": "^4.1.0",
|
||||
"autoprefixer": "^9.4.7",
|
||||
"classnames": "^2.2.6",
|
||||
|
@ -35,7 +34,7 @@
|
|||
"gatsby-remark-prismjs": "^3.2.4",
|
||||
"gatsby-remark-smartypants": "^2.0.8",
|
||||
"gatsby-remark-unwrap-images": "^1.0.1",
|
||||
"gatsby-source-filesystem": "^2.0.20",
|
||||
"gatsby-source-filesystem": "^2.0.24",
|
||||
"gatsby-transformer-remark": "^2.2.5",
|
||||
"gatsby-transformer-sharp": "^2.1.13",
|
||||
"html-to-react": "^1.3.4",
|
||||
|
@ -62,7 +61,8 @@
|
|||
"md-attr-parser": "^1.2.1",
|
||||
"prettier": "^1.16.4",
|
||||
"raw-loader": "^1.0.0",
|
||||
"unist-util-visit": "^1.4.0"
|
||||
"unist-util-visit": "^1.4.0",
|
||||
"@sindresorhus/slugify": "^0.8.0"
|
||||
},
|
||||
"repository": {
|
||||
"type": "git",
|
||||
|
|
|
@ -1,33 +1,38 @@
|
|||
import React, { useState } from 'react'
|
||||
import React, { useState, useEffect } from 'react'
|
||||
import PropTypes from 'prop-types'
|
||||
import classNames from 'classnames'
|
||||
import slugify from '@sindresorhus/slugify'
|
||||
|
||||
import Link from './link'
|
||||
import classes from '../styles/accordion.module.sass'
|
||||
|
||||
const Accordion = ({ title, id, expanded, children }) => {
|
||||
const anchorId = id ? id : slugify(title)
|
||||
const [isExpanded, setIsExpanded] = useState(expanded)
|
||||
const [isExpanded, setIsExpanded] = useState(true)
|
||||
const contentClassNames = classNames(classes.content, {
|
||||
[classes.hidden]: !isExpanded,
|
||||
})
|
||||
const iconClassNames = classNames({
|
||||
[classes.hidden]: isExpanded,
|
||||
})
|
||||
// Make sure accordion is expanded if JS is disabled
|
||||
useEffect(() => setIsExpanded(expanded), [])
|
||||
return (
|
||||
<section id={anchorId}>
|
||||
<section className="accordion" id={id}>
|
||||
<div className={classes.root}>
|
||||
<h3>
|
||||
<h4>
|
||||
<button
|
||||
className={classes.button}
|
||||
aria-expanded={String(isExpanded)}
|
||||
onClick={() => setIsExpanded(!isExpanded)}
|
||||
>
|
||||
<span>
|
||||
{title}
|
||||
{isExpanded && (
|
||||
<Link to={`#${anchorId}`} className={classes.anchor} hidden>
|
||||
<span className="heading-text">{title}</span>
|
||||
{isExpanded && !!id && (
|
||||
<Link
|
||||
to={`#${id}`}
|
||||
className={classes.anchor}
|
||||
hidden
|
||||
onClick={event => event.stopPropagation()}
|
||||
>
|
||||
¶
|
||||
</Link>
|
||||
)}
|
||||
|
@ -44,7 +49,7 @@ const Accordion = ({ title, id, expanded, children }) => {
|
|||
<rect height={2} width={8} x={1} y={4} />
|
||||
</svg>
|
||||
</button>
|
||||
</h3>
|
||||
</h4>
|
||||
<div className={contentClassNames}>{children}</div>
|
||||
</div>
|
||||
</section>
|
||||
|
|
|
@ -33,10 +33,11 @@ const GitHubCode = ({ url, lang, errorMsg, className }) => {
|
|||
})
|
||||
.catch(err => {
|
||||
setCode(errorMsg)
|
||||
console.error(err)
|
||||
})
|
||||
setInitialized(true)
|
||||
}
|
||||
}, [])
|
||||
}, [initialized, rawUrl, errorMsg])
|
||||
|
||||
const highlighted = lang === 'none' || !code ? code : highlightCode(lang, code)
|
||||
|
||||
|
|
|
@ -5,13 +5,13 @@ import classNames from 'classnames'
|
|||
import Icon from './icon'
|
||||
import classes from '../styles/infobox.module.sass'
|
||||
|
||||
const Infobox = ({ title, variant, className, children }) => {
|
||||
const Infobox = ({ title, id, variant, className, children }) => {
|
||||
const infoboxClassNames = classNames(classes.root, className, {
|
||||
[classes.warning]: variant === 'warning',
|
||||
[classes.danger]: variant === 'danger',
|
||||
})
|
||||
return (
|
||||
<aside className={infoboxClassNames}>
|
||||
<aside className={infoboxClassNames} id={id}>
|
||||
{title && (
|
||||
<h4 className={classes.title}>
|
||||
{variant !== 'default' && (
|
||||
|
@ -31,6 +31,7 @@ Infobox.defaultProps = {
|
|||
|
||||
Infobox.propTypes = {
|
||||
title: PropTypes.string,
|
||||
id: PropTypes.string,
|
||||
variant: PropTypes.oneOf(['default', 'warning', 'danger']),
|
||||
className: PropTypes.string,
|
||||
children: PropTypes.node.isRequired,
|
||||
|
|
|
@ -232,6 +232,7 @@ Juniper.defaultProps = {
|
|||
theme: 'default',
|
||||
isolateCells: true,
|
||||
useBinder: true,
|
||||
storageKey: 'juniper',
|
||||
useStorage: true,
|
||||
storageExpire: 60,
|
||||
debug: false,
|
||||
|
|
|
@ -34,22 +34,19 @@ const Progress = () => {
|
|||
setOffset(getOffset())
|
||||
}
|
||||
|
||||
useEffect(
|
||||
() => {
|
||||
if (!initialized && progressRef.current) {
|
||||
handleResize()
|
||||
setInitialized(true)
|
||||
}
|
||||
window.addEventListener('scroll', handleScroll)
|
||||
window.addEventListener('resize', handleResize)
|
||||
useEffect(() => {
|
||||
if (!initialized && progressRef.current) {
|
||||
handleResize()
|
||||
setInitialized(true)
|
||||
}
|
||||
window.addEventListener('scroll', handleScroll)
|
||||
window.addEventListener('resize', handleResize)
|
||||
|
||||
return () => {
|
||||
window.removeEventListener('scroll', handleScroll)
|
||||
window.removeEventListener('resize', handleResize)
|
||||
}
|
||||
},
|
||||
[progressRef]
|
||||
)
|
||||
return () => {
|
||||
window.removeEventListener('scroll', handleScroll)
|
||||
window.removeEventListener('resize', handleResize)
|
||||
}
|
||||
}, [initialized, progressRef])
|
||||
|
||||
const { height, vh } = offset
|
||||
const total = 100 - ((height - scrollY - vh) / height) * 100
|
||||
|
|
|
@ -8,6 +8,12 @@ import Icon from './icon'
|
|||
import { H2 } from './typography'
|
||||
import classes from '../styles/quickstart.module.sass'
|
||||
|
||||
function getNewChecked(optionId, checkedForId, multiple) {
|
||||
if (!multiple) return [optionId]
|
||||
if (checkedForId.includes(optionId)) return checkedForId.filter(opt => opt !== optionId)
|
||||
return [...checkedForId, optionId]
|
||||
}
|
||||
|
||||
const Quickstart = ({ data, title, description, id, children }) => {
|
||||
const [styles, setStyles] = useState({})
|
||||
const [checked, setChecked] = useState({})
|
||||
|
@ -38,13 +44,13 @@ const Quickstart = ({ data, title, description, id, children }) => {
|
|||
setStyles(initialStyles)
|
||||
setInitialized(true)
|
||||
}
|
||||
})
|
||||
}, [data, initialized])
|
||||
|
||||
return !data.length ? null : (
|
||||
<Section id={id}>
|
||||
<div className={classes.root}>
|
||||
{title && (
|
||||
<H2 className={classes.title}>
|
||||
<H2 className={classes.title} name={id}>
|
||||
<a href={`#${id}`}>{title}</a>
|
||||
</H2>
|
||||
)}
|
||||
|
@ -76,13 +82,11 @@ const Quickstart = ({ data, title, description, id, children }) => {
|
|||
onChange={() => {
|
||||
const newChecked = {
|
||||
...checked,
|
||||
[id]: !multiple
|
||||
? [option.id]
|
||||
: checkedForId.includes(option.id)
|
||||
? checkedForId.filter(
|
||||
opt => opt !== option.id
|
||||
)
|
||||
: [...checkedForId, option.id],
|
||||
[id]: getNewChecked(
|
||||
option.id,
|
||||
checkedForId,
|
||||
multiple
|
||||
),
|
||||
}
|
||||
setChecked(newChecked)
|
||||
setStyles({
|
||||
|
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user