mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-12 10:16:27 +03:00
Merge branch 'master' into feature/lemmatizer
This commit is contained in:
commit
278e9d2eb0
106
.github/contributors/Poluglottos.md
vendored
Normal file
106
.github/contributors/Poluglottos.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [X] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Ryan Ford |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | Mar 13 2019 |
|
||||||
|
| GitHub username | Poluglottos |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/tmetzl.md
vendored
Normal file
106
.github/contributors/tmetzl.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Tim Metzler |
|
||||||
|
| Company name (if applicable) | University of Applied Sciences Bonn-Rhein-Sieg |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 03/10/2019 |
|
||||||
|
| GitHub username | tmetzl |
|
||||||
|
| Website (optional) | |
|
22
README.md
22
README.md
|
@ -12,7 +12,7 @@ currently supports tokenization for **45+ languages**. It features the
|
||||||
and easy **deep learning** integration. It's commercial open-source software,
|
and easy **deep learning** integration. It's commercial open-source software,
|
||||||
released under the MIT license.
|
released under the MIT license.
|
||||||
|
|
||||||
💫 **Version 2.1 out now!** [Check out the release notes here.](https://github.com/explosion/spaCy/releases)
|
💫 **Version 2.0 out now!** [Check out the release notes here.](https://github.com/explosion/spaCy/releases)
|
||||||
|
|
||||||
[![Azure Pipelines](https://img.shields.io/azure-devops/build/explosion-ai/public/8/master.svg?logo=azure-devops&style=flat-square)](https://dev.azure.com/explosion-ai/public/_build?definitionId=8)
|
[![Azure Pipelines](https://img.shields.io/azure-devops/build/explosion-ai/public/8/master.svg?logo=azure-devops&style=flat-square)](https://dev.azure.com/explosion-ai/public/_build?definitionId=8)
|
||||||
[![Travis Build Status](https://img.shields.io/travis/explosion/spaCy/master.svg?style=flat-square&logo=travis)](https://travis-ci.org/explosion/spaCy)
|
[![Travis Build Status](https://img.shields.io/travis/explosion/spaCy/master.svg?style=flat-square&logo=travis)](https://travis-ci.org/explosion/spaCy)
|
||||||
|
@ -25,19 +25,17 @@ released under the MIT license.
|
||||||
|
|
||||||
## 📖 Documentation
|
## 📖 Documentation
|
||||||
|
|
||||||
| Documentation | |
|
| Documentation | |
|
||||||
| --------------- | -------------------------------------------------------------- |
|
| --------------- | ----------------------------------------------------- |
|
||||||
| [spaCy 101] | New to spaCy? Here's everything you need to know! |
|
| [spaCy 101] | New to spaCy? Here's everything you need to know! |
|
||||||
| [Usage Guides] | How to use spaCy and its features. |
|
| [Usage Guides] | How to use spaCy and its features. |
|
||||||
| [New in v2.1] | New features, backwards incompatibilities and migration guide. |
|
| [API Reference] | The detailed reference for spaCy's API. |
|
||||||
| [API Reference] | The detailed reference for spaCy's API. |
|
| [Models] | Download statistical language models for spaCy. |
|
||||||
| [Models] | Download statistical language models for spaCy. |
|
| [Universe] | Libraries, extensions, demos, books and courses. |
|
||||||
| [Universe] | Libraries, extensions, demos, books and courses. |
|
| [Changelog] | Changes and version history. |
|
||||||
| [Changelog] | Changes and version history. |
|
| [Contribute] | How to contribute to the spaCy project and code base. |
|
||||||
| [Contribute] | How to contribute to the spaCy project and code base. |
|
|
||||||
|
|
||||||
[spacy 101]: https://spacy.io/usage/spacy-101
|
[spacy 101]: https://spacy.io/usage/spacy-101
|
||||||
[new in v2.1]: https://spacy.io/usage/v2-1
|
|
||||||
[usage guides]: https://spacy.io/usage/
|
[usage guides]: https://spacy.io/usage/
|
||||||
[api reference]: https://spacy.io/api/
|
[api reference]: https://spacy.io/api/
|
||||||
[models]: https://spacy.io/models
|
[models]: https://spacy.io/models
|
||||||
|
|
|
@ -7,6 +7,7 @@ git diff-index --quiet HEAD
|
||||||
|
|
||||||
git checkout $1
|
git checkout $1
|
||||||
git pull origin $1
|
git pull origin $1
|
||||||
|
git push origin $1
|
||||||
|
|
||||||
version=$(grep "__version__ = " spacy/about.py)
|
version=$(grep "__version__ = " spacy/about.py)
|
||||||
version=${version/__version__ = }
|
version=${version/__version__ = }
|
||||||
|
@ -15,4 +16,4 @@ version=${version/\'/}
|
||||||
version=${version/\"/}
|
version=${version/\"/}
|
||||||
version=${version/\"/}
|
version=${version/\"/}
|
||||||
git tag "v$version"
|
git tag "v$version"
|
||||||
git push origin --tags
|
git push origin "v$version" --tags
|
||||||
|
|
107
bin/train_word_vectors.py
Normal file
107
bin/train_word_vectors.py
Normal file
|
@ -0,0 +1,107 @@
|
||||||
|
#!/usr/bin/env python
|
||||||
|
from __future__ import print_function, unicode_literals, division
|
||||||
|
|
||||||
|
import logging
|
||||||
|
from pathlib import Path
|
||||||
|
from collections import defaultdict
|
||||||
|
from gensim.models import Word2Vec
|
||||||
|
from preshed.counter import PreshCounter
|
||||||
|
import plac
|
||||||
|
import spacy
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
class Corpus(object):
|
||||||
|
def __init__(self, directory, min_freq=10):
|
||||||
|
self.directory = directory
|
||||||
|
self.counts = PreshCounter()
|
||||||
|
self.strings = {}
|
||||||
|
self.min_freq = min_freq
|
||||||
|
|
||||||
|
def count_doc(self, doc):
|
||||||
|
# Get counts for this document
|
||||||
|
for word in doc:
|
||||||
|
self.counts.inc(word.orth, 1)
|
||||||
|
return len(doc)
|
||||||
|
|
||||||
|
def __iter__(self):
|
||||||
|
for text_loc in iter_dir(self.directory):
|
||||||
|
with text_loc.open("r", encoding="utf-8") as file_:
|
||||||
|
text = file_.read()
|
||||||
|
yield text
|
||||||
|
|
||||||
|
|
||||||
|
def iter_dir(loc):
|
||||||
|
dir_path = Path(loc)
|
||||||
|
for fn_path in dir_path.iterdir():
|
||||||
|
if fn_path.is_dir():
|
||||||
|
for sub_path in fn_path.iterdir():
|
||||||
|
yield sub_path
|
||||||
|
else:
|
||||||
|
yield fn_path
|
||||||
|
|
||||||
|
|
||||||
|
@plac.annotations(
|
||||||
|
lang=("ISO language code"),
|
||||||
|
in_dir=("Location of input directory"),
|
||||||
|
out_loc=("Location of output file"),
|
||||||
|
n_workers=("Number of workers", "option", "n", int),
|
||||||
|
size=("Dimension of the word vectors", "option", "d", int),
|
||||||
|
window=("Context window size", "option", "w", int),
|
||||||
|
min_count=("Min count", "option", "m", int),
|
||||||
|
negative=("Number of negative samples", "option", "g", int),
|
||||||
|
nr_iter=("Number of iterations", "option", "i", int),
|
||||||
|
)
|
||||||
|
def main(
|
||||||
|
lang,
|
||||||
|
in_dir,
|
||||||
|
out_loc,
|
||||||
|
negative=5,
|
||||||
|
n_workers=4,
|
||||||
|
window=5,
|
||||||
|
size=128,
|
||||||
|
min_count=10,
|
||||||
|
nr_iter=2,
|
||||||
|
):
|
||||||
|
logging.basicConfig(
|
||||||
|
format="%(asctime)s : %(levelname)s : %(message)s", level=logging.INFO
|
||||||
|
)
|
||||||
|
model = Word2Vec(
|
||||||
|
size=size,
|
||||||
|
window=window,
|
||||||
|
min_count=min_count,
|
||||||
|
workers=n_workers,
|
||||||
|
sample=1e-5,
|
||||||
|
negative=negative,
|
||||||
|
)
|
||||||
|
nlp = spacy.blank(lang)
|
||||||
|
corpus = Corpus(in_dir)
|
||||||
|
total_words = 0
|
||||||
|
total_sents = 0
|
||||||
|
for text_no, text_loc in enumerate(iter_dir(corpus.directory)):
|
||||||
|
with text_loc.open("r", encoding="utf-8") as file_:
|
||||||
|
text = file_.read()
|
||||||
|
total_sents += text.count("\n")
|
||||||
|
doc = nlp(text)
|
||||||
|
total_words += corpus.count_doc(doc)
|
||||||
|
logger.info(
|
||||||
|
"PROGRESS: at batch #%i, processed %i words, keeping %i word types",
|
||||||
|
text_no,
|
||||||
|
total_words,
|
||||||
|
len(corpus.strings),
|
||||||
|
)
|
||||||
|
model.corpus_count = total_sents
|
||||||
|
model.raw_vocab = defaultdict(int)
|
||||||
|
for orth, freq in corpus.counts:
|
||||||
|
if freq >= min_count:
|
||||||
|
model.raw_vocab[nlp.vocab.strings[orth]] = freq
|
||||||
|
model.scale_vocab()
|
||||||
|
model.finalize_vocab()
|
||||||
|
model.iter = nr_iter
|
||||||
|
model.train(corpus)
|
||||||
|
model.save(out_loc)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
plac.call(main)
|
|
@ -49,7 +49,7 @@ class SentimentAnalyser(object):
|
||||||
y = self._model.predict(X)
|
y = self._model.predict(X)
|
||||||
self.set_sentiment(doc, y)
|
self.set_sentiment(doc, y)
|
||||||
|
|
||||||
def pipe(self, docs, batch_size=1000, n_threads=2):
|
def pipe(self, docs, batch_size=1000):
|
||||||
for minibatch in cytoolz.partition_all(batch_size, docs):
|
for minibatch in cytoolz.partition_all(batch_size, docs):
|
||||||
minibatch = list(minibatch)
|
minibatch = list(minibatch)
|
||||||
sentences = []
|
sentences = []
|
||||||
|
@ -176,7 +176,7 @@ def evaluate(model_dir, texts, labels, max_length=100):
|
||||||
|
|
||||||
correct = 0
|
correct = 0
|
||||||
i = 0
|
i = 0
|
||||||
for doc in nlp.pipe(texts, batch_size=1000, n_threads=4):
|
for doc in nlp.pipe(texts, batch_size=1000):
|
||||||
correct += bool(doc.sentiment >= 0.5) == bool(labels[i])
|
correct += bool(doc.sentiment >= 0.5) == bool(labels[i])
|
||||||
i += 1
|
i += 1
|
||||||
return float(correct) / i
|
return float(correct) / i
|
||||||
|
|
|
@ -4,7 +4,7 @@ preshed>=2.0.1,<2.1.0
|
||||||
thinc>=7.0.2,<7.1.0
|
thinc>=7.0.2,<7.1.0
|
||||||
blis>=0.2.2,<0.3.0
|
blis>=0.2.2,<0.3.0
|
||||||
murmurhash>=0.28.0,<1.1.0
|
murmurhash>=0.28.0,<1.1.0
|
||||||
wasabi>=0.0.12,<1.1.0
|
wasabi>=0.1.3,<1.1.0
|
||||||
srsly>=0.0.5,<1.1.0
|
srsly>=0.0.5,<1.1.0
|
||||||
# Third party dependencies
|
# Third party dependencies
|
||||||
numpy>=1.15.0
|
numpy>=1.15.0
|
||||||
|
|
|
@ -97,6 +97,7 @@ def with_cpu(ops, model):
|
||||||
"""Wrap a model that should run on CPU, transferring inputs and outputs
|
"""Wrap a model that should run on CPU, transferring inputs and outputs
|
||||||
as necessary."""
|
as necessary."""
|
||||||
model.to_cpu()
|
model.to_cpu()
|
||||||
|
|
||||||
def with_cpu_forward(inputs, drop=0.):
|
def with_cpu_forward(inputs, drop=0.):
|
||||||
cpu_outputs, backprop = model.begin_update(_to_cpu(inputs), drop=drop)
|
cpu_outputs, backprop = model.begin_update(_to_cpu(inputs), drop=drop)
|
||||||
gpu_outputs = _to_device(ops, cpu_outputs)
|
gpu_outputs = _to_device(ops, cpu_outputs)
|
||||||
|
|
|
@ -4,7 +4,7 @@
|
||||||
# fmt: off
|
# fmt: off
|
||||||
|
|
||||||
__title__ = "spacy-nightly"
|
__title__ = "spacy-nightly"
|
||||||
__version__ = "2.1.0a10"
|
__version__ = "2.1.0a13"
|
||||||
__summary__ = "Industrial-strength Natural Language Processing (NLP) with Python and Cython"
|
__summary__ = "Industrial-strength Natural Language Processing (NLP) with Python and Cython"
|
||||||
__uri__ = "https://spacy.io"
|
__uri__ = "https://spacy.io"
|
||||||
__author__ = "Explosion AI"
|
__author__ = "Explosion AI"
|
||||||
|
|
|
@ -6,7 +6,7 @@ from pathlib import Path
|
||||||
from wasabi import Printer
|
from wasabi import Printer
|
||||||
import srsly
|
import srsly
|
||||||
|
|
||||||
from .converters import conllu2json, conllubio2json, iob2json, conll_ner2json
|
from .converters import conllu2json, iob2json, conll_ner2json
|
||||||
from .converters import ner_jsonl2json
|
from .converters import ner_jsonl2json
|
||||||
|
|
||||||
|
|
||||||
|
@ -14,7 +14,7 @@ from .converters import ner_jsonl2json
|
||||||
# entry to this dict with the file extension mapped to the converter function
|
# entry to this dict with the file extension mapped to the converter function
|
||||||
# imported from /converters.
|
# imported from /converters.
|
||||||
CONVERTERS = {
|
CONVERTERS = {
|
||||||
"conllubio": conllubio2json,
|
"conllubio": conllu2json,
|
||||||
"conllu": conllu2json,
|
"conllu": conllu2json,
|
||||||
"conll": conllu2json,
|
"conll": conllu2json,
|
||||||
"ner": conll_ner2json,
|
"ner": conll_ner2json,
|
||||||
|
|
|
@ -1,5 +1,4 @@
|
||||||
from .conllu2json import conllu2json # noqa: F401
|
from .conllu2json import conllu2json # noqa: F401
|
||||||
from .conllubio2json import conllubio2json # noqa: F401
|
|
||||||
from .iob2json import iob2json # noqa: F401
|
from .iob2json import iob2json # noqa: F401
|
||||||
from .conll_ner2json import conll_ner2json # noqa: F401
|
from .conll_ner2json import conll_ner2json # noqa: F401
|
||||||
from .jsonl2json import ner_jsonl2json # noqa: F401
|
from .jsonl2json import ner_jsonl2json # noqa: F401
|
||||||
|
|
|
@ -71,6 +71,7 @@ def read_conllx(input_data, use_morphology=False, n=0):
|
||||||
dep = "ROOT" if dep == "root" else dep
|
dep = "ROOT" if dep == "root" else dep
|
||||||
tag = pos if tag == "_" else tag
|
tag = pos if tag == "_" else tag
|
||||||
tag = tag + "__" + morph if use_morphology else tag
|
tag = tag + "__" + morph if use_morphology else tag
|
||||||
|
iob = iob if iob else "O"
|
||||||
tokens.append((id_, word, tag, head, dep, iob))
|
tokens.append((id_, word, tag, head, dep, iob))
|
||||||
except: # noqa: E722
|
except: # noqa: E722
|
||||||
print(line)
|
print(line)
|
||||||
|
|
|
@ -1,85 +0,0 @@
|
||||||
# coding: utf8
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
from ...gold import iob_to_biluo
|
|
||||||
|
|
||||||
|
|
||||||
def conllubio2json(input_data, n_sents=10, use_morphology=False, lang=None):
|
|
||||||
"""
|
|
||||||
Convert conllu files into JSON format for use with train cli.
|
|
||||||
use_morphology parameter enables appending morphology to tags, which is
|
|
||||||
useful for languages such as Spanish, where UD tags are not so rich.
|
|
||||||
"""
|
|
||||||
# by @dvsrepo, via #11 explosion/spacy-dev-resources
|
|
||||||
docs = []
|
|
||||||
sentences = []
|
|
||||||
conll_tuples = read_conllx(input_data, use_morphology=use_morphology)
|
|
||||||
for i, (raw_text, tokens) in enumerate(conll_tuples):
|
|
||||||
sentence, brackets = tokens[0]
|
|
||||||
sentences.append(generate_sentence(sentence))
|
|
||||||
# Real-sized documents could be extracted using the comments on the
|
|
||||||
# conluu document
|
|
||||||
if len(sentences) % n_sents == 0:
|
|
||||||
doc = create_doc(sentences, i)
|
|
||||||
docs.append(doc)
|
|
||||||
sentences = []
|
|
||||||
return docs
|
|
||||||
|
|
||||||
|
|
||||||
def read_conllx(input_data, use_morphology=False, n=0):
|
|
||||||
i = 0
|
|
||||||
for sent in input_data.strip().split("\n\n"):
|
|
||||||
lines = sent.strip().split("\n")
|
|
||||||
if lines:
|
|
||||||
while lines[0].startswith("#"):
|
|
||||||
lines.pop(0)
|
|
||||||
tokens = []
|
|
||||||
for line in lines:
|
|
||||||
|
|
||||||
parts = line.split("\t")
|
|
||||||
id_, word, lemma, pos, tag, morph, head, dep, _1, ner = parts
|
|
||||||
if "-" in id_ or "." in id_:
|
|
||||||
continue
|
|
||||||
try:
|
|
||||||
id_ = int(id_) - 1
|
|
||||||
head = (int(head) - 1) if head != "0" else id_
|
|
||||||
dep = "ROOT" if dep == "root" else dep
|
|
||||||
tag = pos if tag == "_" else tag
|
|
||||||
tag = tag + "__" + morph if use_morphology else tag
|
|
||||||
ner = ner if ner else "O"
|
|
||||||
tokens.append((id_, word, tag, head, dep, ner))
|
|
||||||
except: # noqa: E722
|
|
||||||
print(line)
|
|
||||||
raise
|
|
||||||
tuples = [list(t) for t in zip(*tokens)]
|
|
||||||
yield (None, [[tuples, []]])
|
|
||||||
i += 1
|
|
||||||
if n >= 1 and i >= n:
|
|
||||||
break
|
|
||||||
|
|
||||||
|
|
||||||
def generate_sentence(sent):
|
|
||||||
(id_, word, tag, head, dep, ner) = sent
|
|
||||||
sentence = {}
|
|
||||||
tokens = []
|
|
||||||
ner = iob_to_biluo(ner)
|
|
||||||
for i, id in enumerate(id_):
|
|
||||||
token = {}
|
|
||||||
token["orth"] = word[i]
|
|
||||||
token["tag"] = tag[i]
|
|
||||||
token["head"] = head[i] - id
|
|
||||||
token["dep"] = dep[i]
|
|
||||||
token["ner"] = ner[i]
|
|
||||||
tokens.append(token)
|
|
||||||
sentence["tokens"] = tokens
|
|
||||||
return sentence
|
|
||||||
|
|
||||||
|
|
||||||
def create_doc(sentences, id):
|
|
||||||
doc = {}
|
|
||||||
paragraph = {}
|
|
||||||
doc["id"] = id
|
|
||||||
doc["paragraphs"] = []
|
|
||||||
paragraph["sentences"] = sentences
|
|
||||||
doc["paragraphs"].append(paragraph)
|
|
||||||
return doc
|
|
|
@ -41,24 +41,32 @@ def download(model, direct=False, *pip_args):
|
||||||
dl = download_model(dl_tpl.format(m=model_name, v=version), pip_args)
|
dl = download_model(dl_tpl.format(m=model_name, v=version), pip_args)
|
||||||
if dl != 0: # if download subprocess doesn't return 0, exit
|
if dl != 0: # if download subprocess doesn't return 0, exit
|
||||||
sys.exit(dl)
|
sys.exit(dl)
|
||||||
try:
|
msg.good(
|
||||||
# Get package path here because link uses
|
"Download and installation successful",
|
||||||
# pip.get_installed_distributions() to check if model is a
|
"You can now load the model via spacy.load('{}')".format(model_name),
|
||||||
# package, which fails if model was just installed via
|
)
|
||||||
# subprocess
|
# Only create symlink if the model is installed via a shortcut like 'en'.
|
||||||
package_path = get_package_path(model_name)
|
# There's no real advantage over an additional symlink for en_core_web_sm
|
||||||
link(model_name, model, force=True, model_path=package_path)
|
# and if anything, it's more error prone and causes more confusion.
|
||||||
except: # noqa: E722
|
if model in shortcuts:
|
||||||
# Dirty, but since spacy.download and the auto-linking is
|
try:
|
||||||
# mostly a convenience wrapper, it's best to show a success
|
# Get package path here because link uses
|
||||||
# message and loading instructions, even if linking fails.
|
# pip.get_installed_distributions() to check if model is a
|
||||||
msg.warn(
|
# package, which fails if model was just installed via
|
||||||
"Download successful but linking failed",
|
# subprocess
|
||||||
"Creating a shortcut link for 'en' didn't work (maybe you "
|
package_path = get_package_path(model_name)
|
||||||
"don't have admin permissions?), but you can still load the "
|
link(model_name, model, force=True, model_path=package_path)
|
||||||
"model via its full package name: "
|
except: # noqa: E722
|
||||||
"nlp = spacy.load('{}')".format(model_name),
|
# Dirty, but since spacy.download and the auto-linking is
|
||||||
)
|
# mostly a convenience wrapper, it's best to show a success
|
||||||
|
# message and loading instructions, even if linking fails.
|
||||||
|
msg.warn(
|
||||||
|
"Download successful but linking failed",
|
||||||
|
"Creating a shortcut link for '{}' didn't work (maybe you "
|
||||||
|
"don't have admin permissions?), but you can still load "
|
||||||
|
"the model via its full package name: "
|
||||||
|
"nlp = spacy.load('{}')".format(model, model_name),
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
def get_json(url, desc):
|
def get_json(url, desc):
|
||||||
|
|
|
@ -161,7 +161,7 @@ def parse_deps(orig_doc, options={}):
|
||||||
"dir": "right",
|
"dir": "right",
|
||||||
}
|
}
|
||||||
)
|
)
|
||||||
return {"words": words, "arcs": arcs}
|
return {"words": words, "arcs": arcs, "settings": get_doc_settings(orig_doc)}
|
||||||
|
|
||||||
|
|
||||||
def parse_ents(doc, options={}):
|
def parse_ents(doc, options={}):
|
||||||
|
@ -177,7 +177,8 @@ def parse_ents(doc, options={}):
|
||||||
if not ents:
|
if not ents:
|
||||||
user_warning(Warnings.W006)
|
user_warning(Warnings.W006)
|
||||||
title = doc.user_data.get("title", None) if hasattr(doc, "user_data") else None
|
title = doc.user_data.get("title", None) if hasattr(doc, "user_data") else None
|
||||||
return {"text": doc.text, "ents": ents, "title": title}
|
settings = get_doc_settings(doc)
|
||||||
|
return {"text": doc.text, "ents": ents, "title": title, "settings": settings}
|
||||||
|
|
||||||
|
|
||||||
def set_render_wrapper(func):
|
def set_render_wrapper(func):
|
||||||
|
@ -195,3 +196,10 @@ def set_render_wrapper(func):
|
||||||
if not hasattr(func, "__call__"):
|
if not hasattr(func, "__call__"):
|
||||||
raise ValueError(Errors.E110.format(obj=type(func)))
|
raise ValueError(Errors.E110.format(obj=type(func)))
|
||||||
RENDER_WRAPPER = func
|
RENDER_WRAPPER = func
|
||||||
|
|
||||||
|
|
||||||
|
def get_doc_settings(doc):
|
||||||
|
return {
|
||||||
|
"lang": doc.lang_,
|
||||||
|
"direction": doc.vocab.writing_system.get("direction", "ltr"),
|
||||||
|
}
|
||||||
|
|
|
@ -3,10 +3,13 @@ from __future__ import unicode_literals
|
||||||
|
|
||||||
import uuid
|
import uuid
|
||||||
|
|
||||||
from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_ARCS
|
from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_ARCS, TPL_ENTS
|
||||||
from .templates import TPL_ENT, TPL_ENTS, TPL_FIGURE, TPL_TITLE, TPL_PAGE
|
from .templates import TPL_ENT, TPL_ENT_RTL, TPL_FIGURE, TPL_TITLE, TPL_PAGE
|
||||||
from ..util import minify_html, escape_html
|
from ..util import minify_html, escape_html
|
||||||
|
|
||||||
|
DEFAULT_LANG = "en"
|
||||||
|
DEFAULT_DIR = "ltr"
|
||||||
|
|
||||||
|
|
||||||
class DependencyRenderer(object):
|
class DependencyRenderer(object):
|
||||||
"""Render dependency parses as SVGs."""
|
"""Render dependency parses as SVGs."""
|
||||||
|
@ -30,6 +33,8 @@ class DependencyRenderer(object):
|
||||||
self.color = options.get("color", "#000000")
|
self.color = options.get("color", "#000000")
|
||||||
self.bg = options.get("bg", "#ffffff")
|
self.bg = options.get("bg", "#ffffff")
|
||||||
self.font = options.get("font", "Arial")
|
self.font = options.get("font", "Arial")
|
||||||
|
self.direction = DEFAULT_DIR
|
||||||
|
self.lang = DEFAULT_LANG
|
||||||
|
|
||||||
def render(self, parsed, page=False, minify=False):
|
def render(self, parsed, page=False, minify=False):
|
||||||
"""Render complete markup.
|
"""Render complete markup.
|
||||||
|
@ -42,13 +47,19 @@ class DependencyRenderer(object):
|
||||||
# Create a random ID prefix to make sure parses don't receive the
|
# Create a random ID prefix to make sure parses don't receive the
|
||||||
# same ID, even if they're identical
|
# same ID, even if they're identical
|
||||||
id_prefix = uuid.uuid4().hex
|
id_prefix = uuid.uuid4().hex
|
||||||
rendered = [
|
rendered = []
|
||||||
self.render_svg("{}-{}".format(id_prefix, i), p["words"], p["arcs"])
|
for i, p in enumerate(parsed):
|
||||||
for i, p in enumerate(parsed)
|
if i == 0:
|
||||||
]
|
self.direction = p["settings"].get("direction", DEFAULT_DIR)
|
||||||
|
self.lang = p["settings"].get("lang", DEFAULT_LANG)
|
||||||
|
render_id = "{}-{}".format(id_prefix, i)
|
||||||
|
svg = self.render_svg(render_id, p["words"], p["arcs"])
|
||||||
|
rendered.append(svg)
|
||||||
if page:
|
if page:
|
||||||
content = "".join([TPL_FIGURE.format(content=svg) for svg in rendered])
|
content = "".join([TPL_FIGURE.format(content=svg) for svg in rendered])
|
||||||
markup = TPL_PAGE.format(content=content)
|
markup = TPL_PAGE.format(
|
||||||
|
content=content, lang=self.lang, dir=self.direction
|
||||||
|
)
|
||||||
else:
|
else:
|
||||||
markup = "".join(rendered)
|
markup = "".join(rendered)
|
||||||
if minify:
|
if minify:
|
||||||
|
@ -83,6 +94,8 @@ class DependencyRenderer(object):
|
||||||
bg=self.bg,
|
bg=self.bg,
|
||||||
font=self.font,
|
font=self.font,
|
||||||
content=content,
|
content=content,
|
||||||
|
dir=self.direction,
|
||||||
|
lang=self.lang,
|
||||||
)
|
)
|
||||||
|
|
||||||
def render_word(self, text, tag, i):
|
def render_word(self, text, tag, i):
|
||||||
|
@ -95,11 +108,13 @@ class DependencyRenderer(object):
|
||||||
"""
|
"""
|
||||||
y = self.offset_y + self.word_spacing
|
y = self.offset_y + self.word_spacing
|
||||||
x = self.offset_x + i * self.distance
|
x = self.offset_x + i * self.distance
|
||||||
|
if self.direction == "rtl":
|
||||||
|
x = self.width - x
|
||||||
html_text = escape_html(text)
|
html_text = escape_html(text)
|
||||||
return TPL_DEP_WORDS.format(text=html_text, tag=tag, x=x, y=y)
|
return TPL_DEP_WORDS.format(text=html_text, tag=tag, x=x, y=y)
|
||||||
|
|
||||||
def render_arrow(self, label, start, end, direction, i):
|
def render_arrow(self, label, start, end, direction, i):
|
||||||
"""Render indivicual arrow.
|
"""Render individual arrow.
|
||||||
|
|
||||||
label (unicode): Dependency label.
|
label (unicode): Dependency label.
|
||||||
start (int): Index of start word.
|
start (int): Index of start word.
|
||||||
|
@ -110,6 +125,8 @@ class DependencyRenderer(object):
|
||||||
"""
|
"""
|
||||||
level = self.levels.index(end - start) + 1
|
level = self.levels.index(end - start) + 1
|
||||||
x_start = self.offset_x + start * self.distance + self.arrow_spacing
|
x_start = self.offset_x + start * self.distance + self.arrow_spacing
|
||||||
|
if self.direction == "rtl":
|
||||||
|
x_start = self.width - x_start
|
||||||
y = self.offset_y
|
y = self.offset_y
|
||||||
x_end = (
|
x_end = (
|
||||||
self.offset_x
|
self.offset_x
|
||||||
|
@ -117,6 +134,8 @@ class DependencyRenderer(object):
|
||||||
+ start * self.distance
|
+ start * self.distance
|
||||||
- self.arrow_spacing * (self.highest_level - level) / 4
|
- self.arrow_spacing * (self.highest_level - level) / 4
|
||||||
)
|
)
|
||||||
|
if self.direction == "rtl":
|
||||||
|
x_end = self.width - x_end
|
||||||
y_curve = self.offset_y - level * self.distance / 2
|
y_curve = self.offset_y - level * self.distance / 2
|
||||||
if self.compact:
|
if self.compact:
|
||||||
y_curve = self.offset_y - level * self.distance / 6
|
y_curve = self.offset_y - level * self.distance / 6
|
||||||
|
@ -124,12 +143,14 @@ class DependencyRenderer(object):
|
||||||
y_curve = -self.distance
|
y_curve = -self.distance
|
||||||
arrowhead = self.get_arrowhead(direction, x_start, y, x_end)
|
arrowhead = self.get_arrowhead(direction, x_start, y, x_end)
|
||||||
arc = self.get_arc(x_start, y, y_curve, x_end)
|
arc = self.get_arc(x_start, y, y_curve, x_end)
|
||||||
|
label_side = "right" if self.direction == "rtl" else "left"
|
||||||
return TPL_DEP_ARCS.format(
|
return TPL_DEP_ARCS.format(
|
||||||
id=self.id,
|
id=self.id,
|
||||||
i=i,
|
i=i,
|
||||||
stroke=self.arrow_stroke,
|
stroke=self.arrow_stroke,
|
||||||
head=arrowhead,
|
head=arrowhead,
|
||||||
label=label,
|
label=label,
|
||||||
|
label_side=label_side,
|
||||||
arc=arc,
|
arc=arc,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
@ -219,6 +240,8 @@ class EntityRenderer(object):
|
||||||
self.default_color = "#ddd"
|
self.default_color = "#ddd"
|
||||||
self.colors = colors
|
self.colors = colors
|
||||||
self.ents = options.get("ents", None)
|
self.ents = options.get("ents", None)
|
||||||
|
self.direction = DEFAULT_DIR
|
||||||
|
self.lang = DEFAULT_LANG
|
||||||
|
|
||||||
def render(self, parsed, page=False, minify=False):
|
def render(self, parsed, page=False, minify=False):
|
||||||
"""Render complete markup.
|
"""Render complete markup.
|
||||||
|
@ -228,12 +251,15 @@ class EntityRenderer(object):
|
||||||
minify (bool): Minify HTML markup.
|
minify (bool): Minify HTML markup.
|
||||||
RETURNS (unicode): Rendered HTML markup.
|
RETURNS (unicode): Rendered HTML markup.
|
||||||
"""
|
"""
|
||||||
rendered = [
|
rendered = []
|
||||||
self.render_ents(p["text"], p["ents"], p.get("title", None)) for p in parsed
|
for i, p in enumerate(parsed):
|
||||||
]
|
if i == 0:
|
||||||
|
self.direction = p["settings"].get("direction", DEFAULT_DIR)
|
||||||
|
self.lang = p["settings"].get("lang", DEFAULT_LANG)
|
||||||
|
rendered.append(self.render_ents(p["text"], p["ents"], p["title"]))
|
||||||
if page:
|
if page:
|
||||||
docs = "".join([TPL_FIGURE.format(content=doc) for doc in rendered])
|
docs = "".join([TPL_FIGURE.format(content=doc) for doc in rendered])
|
||||||
markup = TPL_PAGE.format(content=docs)
|
markup = TPL_PAGE.format(content=docs, lang=self.lang, dir=self.direction)
|
||||||
else:
|
else:
|
||||||
markup = "".join(rendered)
|
markup = "".join(rendered)
|
||||||
if minify:
|
if minify:
|
||||||
|
@ -261,12 +287,16 @@ class EntityRenderer(object):
|
||||||
markup += "</br>"
|
markup += "</br>"
|
||||||
if self.ents is None or label.upper() in self.ents:
|
if self.ents is None or label.upper() in self.ents:
|
||||||
color = self.colors.get(label.upper(), self.default_color)
|
color = self.colors.get(label.upper(), self.default_color)
|
||||||
markup += TPL_ENT.format(label=label, text=entity, bg=color)
|
ent_settings = {"label": label, "text": entity, "bg": color}
|
||||||
|
if self.direction == "rtl":
|
||||||
|
markup += TPL_ENT_RTL.format(**ent_settings)
|
||||||
|
else:
|
||||||
|
markup += TPL_ENT.format(**ent_settings)
|
||||||
else:
|
else:
|
||||||
markup += entity
|
markup += entity
|
||||||
offset = end
|
offset = end
|
||||||
markup += escape_html(text[offset:])
|
markup += escape_html(text[offset:])
|
||||||
markup = TPL_ENTS.format(content=markup, colors=self.colors)
|
markup = TPL_ENTS.format(content=markup, dir=self.direction)
|
||||||
if title:
|
if title:
|
||||||
markup = TPL_TITLE.format(title=title) + markup
|
markup = TPL_TITLE.format(title=title) + markup
|
||||||
return markup
|
return markup
|
||||||
|
|
|
@ -6,7 +6,7 @@ from __future__ import unicode_literals
|
||||||
# Jupyter to render it properly in a cell
|
# Jupyter to render it properly in a cell
|
||||||
|
|
||||||
TPL_DEP_SVG = """
|
TPL_DEP_SVG = """
|
||||||
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" id="{id}" class="displacy" width="{width}" height="{height}" style="max-width: none; height: {height}px; color: {color}; background: {bg}; font-family: {font}">{content}</svg>
|
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xml:lang="{lang}" id="{id}" class="displacy" width="{width}" height="{height}" direction="{dir}" style="max-width: none; height: {height}px; color: {color}; background: {bg}; font-family: {font}; direction: {dir}">{content}</svg>
|
||||||
"""
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
@ -22,7 +22,7 @@ TPL_DEP_ARCS = """
|
||||||
<g class="displacy-arrow">
|
<g class="displacy-arrow">
|
||||||
<path class="displacy-arc" id="arrow-{id}-{i}" stroke-width="{stroke}px" d="{arc}" fill="none" stroke="currentColor"/>
|
<path class="displacy-arc" id="arrow-{id}-{i}" stroke-width="{stroke}px" d="{arc}" fill="none" stroke="currentColor"/>
|
||||||
<text dy="1.25em" style="font-size: 0.8em; letter-spacing: 1px">
|
<text dy="1.25em" style="font-size: 0.8em; letter-spacing: 1px">
|
||||||
<textPath xlink:href="#arrow-{id}-{i}" class="displacy-label" startOffset="50%" fill="currentColor" text-anchor="middle">{label}</textPath>
|
<textPath xlink:href="#arrow-{id}-{i}" class="displacy-label" startOffset="50%" side="{label_side}" fill="currentColor" text-anchor="middle">{label}</textPath>
|
||||||
</text>
|
</text>
|
||||||
<path class="displacy-arrowhead" d="{head}" fill="currentColor"/>
|
<path class="displacy-arrowhead" d="{head}" fill="currentColor"/>
|
||||||
</g>
|
</g>
|
||||||
|
@ -39,7 +39,7 @@ TPL_TITLE = """
|
||||||
|
|
||||||
|
|
||||||
TPL_ENTS = """
|
TPL_ENTS = """
|
||||||
<div class="entities" style="line-height: 2.5">{content}</div>
|
<div class="entities" style="line-height: 2.5; direction: {dir}">{content}</div>
|
||||||
"""
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
@ -50,14 +50,21 @@ TPL_ENT = """
|
||||||
</mark>
|
</mark>
|
||||||
"""
|
"""
|
||||||
|
|
||||||
|
TPL_ENT_RTL = """
|
||||||
|
<mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
|
||||||
|
{text}
|
||||||
|
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-right: 0.5rem">{label}</span>
|
||||||
|
</mark>
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
TPL_PAGE = """
|
TPL_PAGE = """
|
||||||
<!DOCTYPE html>
|
<!DOCTYPE html>
|
||||||
<html>
|
<html lang="{lang}">
|
||||||
<head>
|
<head>
|
||||||
<title>displaCy</title>
|
<title>displaCy</title>
|
||||||
</head>
|
</head>
|
||||||
|
|
||||||
<body style="font-size: 16px; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; padding: 4rem 2rem;">{content}</body>
|
<body style="font-size: 16px; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; padding: 4rem 2rem; direction: {dir}">{content}</body>
|
||||||
</html>
|
</html>
|
||||||
"""
|
"""
|
||||||
|
|
|
@ -70,6 +70,16 @@ class Warnings(object):
|
||||||
W013 = ("As of v2.1.0, {obj}.merge is deprecated. Please use the more "
|
W013 = ("As of v2.1.0, {obj}.merge is deprecated. Please use the more "
|
||||||
"efficient and less error-prone Doc.retokenize context manager "
|
"efficient and less error-prone Doc.retokenize context manager "
|
||||||
"instead.")
|
"instead.")
|
||||||
|
W014 = ("As of v2.1.0, the `disable` keyword argument on the serialization "
|
||||||
|
"methods is and should be replaced with `exclude`. This makes it "
|
||||||
|
"consistent with the other objects serializable.")
|
||||||
|
W015 = ("As of v2.1.0, the use of keyword arguments to exclude fields from "
|
||||||
|
"being serialized or deserialized is deprecated. Please use the "
|
||||||
|
"`exclude` argument instead. For example: exclude=['{arg}'].")
|
||||||
|
W016 = ("The keyword argument `n_threads` on the is now deprecated, as "
|
||||||
|
"the v2.x models cannot release the global interpreter lock. "
|
||||||
|
"Future versions may introduce a `n_process` argument for "
|
||||||
|
"parallel inference via multiprocessing.")
|
||||||
|
|
||||||
|
|
||||||
@add_codes
|
@add_codes
|
||||||
|
@ -348,7 +358,15 @@ class Errors(object):
|
||||||
"This is likely a bug in spaCy, so feel free to open an issue.")
|
"This is likely a bug in spaCy, so feel free to open an issue.")
|
||||||
E127 = ("Cannot create phrase pattern representation for length 0. This "
|
E127 = ("Cannot create phrase pattern representation for length 0. This "
|
||||||
"is likely a bug in spaCy.")
|
"is likely a bug in spaCy.")
|
||||||
|
E128 = ("Unsupported serialization argument: '{arg}'. The use of keyword "
|
||||||
|
"arguments to exclude fields from being serialized or deserialized "
|
||||||
|
"is now deprecated. Please use the `exclude` argument instead. "
|
||||||
|
"For example: exclude=['{arg}'].")
|
||||||
|
E129 = ("Cannot write the label of an existing Span object because a Span "
|
||||||
|
"is a read-only view of the underlying Token objects stored in the Doc. "
|
||||||
|
"Instead, create a new Span object and specify the `label` keyword argument, "
|
||||||
|
"for example:\nfrom spacy.tokens import Span\n"
|
||||||
|
"span = Span(doc, start={start}, end={end}, label='{label}')")
|
||||||
|
|
||||||
@add_codes
|
@add_codes
|
||||||
class TempErrors(object):
|
class TempErrors(object):
|
||||||
|
|
|
@ -23,6 +23,7 @@ class ArabicDefaults(Language.Defaults):
|
||||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||||
stop_words = STOP_WORDS
|
stop_words = STOP_WORDS
|
||||||
suffixes = TOKENIZER_SUFFIXES
|
suffixes = TOKENIZER_SUFFIXES
|
||||||
|
writing_system = {"direction": "rtl", "has_case": False, "has_letters": True}
|
||||||
|
|
||||||
|
|
||||||
class Arabic(Language):
|
class Arabic(Language):
|
||||||
|
|
|
@ -34,10 +34,10 @@ TAG_MAP = {
|
||||||
"NNP": {POS: PROPN, "NounType": "prop", "Number": "sing"},
|
"NNP": {POS: PROPN, "NounType": "prop", "Number": "sing"},
|
||||||
"NNPS": {POS: PROPN, "NounType": "prop", "Number": "plur"},
|
"NNPS": {POS: PROPN, "NounType": "prop", "Number": "plur"},
|
||||||
"NNS": {POS: NOUN, "Number": "plur"},
|
"NNS": {POS: NOUN, "Number": "plur"},
|
||||||
"PDT": {POS: ADJ, "AdjType": "pdt", "PronType": "prn"},
|
"PDT": {POS: DET, "AdjType": "pdt", "PronType": "prn"},
|
||||||
"POS": {POS: PART, "Poss": "yes"},
|
"POS": {POS: PART, "Poss": "yes"},
|
||||||
"PRP": {POS: PRON, "PronType": "prs"},
|
"PRP": {POS: PRON, "PronType": "prs"},
|
||||||
"PRP$": {POS: ADJ, "PronType": "prs", "Poss": "yes"},
|
"PRP$": {POS: DET, "PronType": "prs", "Poss": "yes"},
|
||||||
"RB": {POS: ADV, "Degree": "pos"},
|
"RB": {POS: ADV, "Degree": "pos"},
|
||||||
"RBR": {POS: ADV, "Degree": "comp"},
|
"RBR": {POS: ADV, "Degree": "comp"},
|
||||||
"RBS": {POS: ADV, "Degree": "sup"},
|
"RBS": {POS: ADV, "Degree": "sup"},
|
||||||
|
|
|
@ -27,6 +27,7 @@ class PersianDefaults(Language.Defaults):
|
||||||
stop_words = STOP_WORDS
|
stop_words = STOP_WORDS
|
||||||
tag_map = TAG_MAP
|
tag_map = TAG_MAP
|
||||||
suffixes = TOKENIZER_SUFFIXES
|
suffixes = TOKENIZER_SUFFIXES
|
||||||
|
writing_system = {"direction": "rtl", "has_case": False, "has_letters": True}
|
||||||
|
|
||||||
|
|
||||||
class Persian(Language):
|
class Persian(Language):
|
||||||
|
|
|
@ -14,6 +14,7 @@ class HebrewDefaults(Language.Defaults):
|
||||||
lex_attr_getters[LANG] = lambda text: "he"
|
lex_attr_getters[LANG] = lambda text: "he"
|
||||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS)
|
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS)
|
||||||
stop_words = STOP_WORDS
|
stop_words = STOP_WORDS
|
||||||
|
writing_system = {"direction": "rtl", "has_case": False, "has_letters": True}
|
||||||
|
|
||||||
|
|
||||||
class Hebrew(Language):
|
class Hebrew(Language):
|
||||||
|
|
|
@ -8,15 +8,13 @@ from .stop_words import STOP_WORDS
|
||||||
from .tag_map import TAG_MAP
|
from .tag_map import TAG_MAP
|
||||||
from ...attrs import LANG
|
from ...attrs import LANG
|
||||||
from ...language import Language
|
from ...language import Language
|
||||||
from ...tokens import Doc, Token
|
from ...tokens import Doc
|
||||||
|
from ...compat import copy_reg
|
||||||
from ...util import DummyTokenizer
|
from ...util import DummyTokenizer
|
||||||
|
|
||||||
|
|
||||||
ShortUnitWord = namedtuple("ShortUnitWord", ["surface", "lemma", "pos"])
|
ShortUnitWord = namedtuple("ShortUnitWord", ["surface", "lemma", "pos"])
|
||||||
|
|
||||||
# TODO: Is this the right place for this?
|
|
||||||
Token.set_extension("mecab_tag", default=None)
|
|
||||||
|
|
||||||
|
|
||||||
def try_mecab_import():
|
def try_mecab_import():
|
||||||
"""Mecab is required for Japanese support, so check for it.
|
"""Mecab is required for Japanese support, so check for it.
|
||||||
|
@ -81,10 +79,12 @@ class JapaneseTokenizer(DummyTokenizer):
|
||||||
words = [x.surface for x in dtokens]
|
words = [x.surface for x in dtokens]
|
||||||
spaces = [False] * len(words)
|
spaces = [False] * len(words)
|
||||||
doc = Doc(self.vocab, words=words, spaces=spaces)
|
doc = Doc(self.vocab, words=words, spaces=spaces)
|
||||||
|
mecab_tags = []
|
||||||
for token, dtoken in zip(doc, dtokens):
|
for token, dtoken in zip(doc, dtokens):
|
||||||
token._.mecab_tag = dtoken.pos
|
mecab_tags.append(dtoken.pos)
|
||||||
token.tag_ = resolve_pos(dtoken)
|
token.tag_ = resolve_pos(dtoken)
|
||||||
token.lemma_ = dtoken.lemma
|
token.lemma_ = dtoken.lemma
|
||||||
|
doc.user_data["mecab_tags"] = mecab_tags
|
||||||
return doc
|
return doc
|
||||||
|
|
||||||
|
|
||||||
|
@ -93,6 +93,7 @@ class JapaneseDefaults(Language.Defaults):
|
||||||
lex_attr_getters[LANG] = lambda _text: "ja"
|
lex_attr_getters[LANG] = lambda _text: "ja"
|
||||||
stop_words = STOP_WORDS
|
stop_words = STOP_WORDS
|
||||||
tag_map = TAG_MAP
|
tag_map = TAG_MAP
|
||||||
|
writing_system = {"direction": "ltr", "has_case": False, "has_letters": False}
|
||||||
|
|
||||||
@classmethod
|
@classmethod
|
||||||
def create_tokenizer(cls, nlp=None):
|
def create_tokenizer(cls, nlp=None):
|
||||||
|
@ -107,4 +108,11 @@ class Japanese(Language):
|
||||||
return self.tokenizer(text)
|
return self.tokenizer(text)
|
||||||
|
|
||||||
|
|
||||||
|
def pickle_japanese(instance):
|
||||||
|
return Japanese, tuple()
|
||||||
|
|
||||||
|
|
||||||
|
copy_reg.pickle(Japanese, pickle_japanese)
|
||||||
|
|
||||||
|
|
||||||
__all__ = ["Japanese"]
|
__all__ = ["Japanese"]
|
||||||
|
|
|
@ -14,6 +14,7 @@ class ChineseDefaults(Language.Defaults):
|
||||||
use_jieba = True
|
use_jieba = True
|
||||||
tokenizer_exceptions = BASE_EXCEPTIONS
|
tokenizer_exceptions = BASE_EXCEPTIONS
|
||||||
stop_words = STOP_WORDS
|
stop_words = STOP_WORDS
|
||||||
|
writing_system = {"direction": "ltr", "has_case": False, "has_letters": False}
|
||||||
|
|
||||||
|
|
||||||
class Chinese(Language):
|
class Chinese(Language):
|
||||||
|
|
|
@ -29,7 +29,7 @@ from .lang.punctuation import TOKENIZER_INFIXES
|
||||||
from .lang.tokenizer_exceptions import TOKEN_MATCH
|
from .lang.tokenizer_exceptions import TOKEN_MATCH
|
||||||
from .lang.tag_map import TAG_MAP
|
from .lang.tag_map import TAG_MAP
|
||||||
from .lang.lex_attrs import LEX_ATTRS, is_stop
|
from .lang.lex_attrs import LEX_ATTRS, is_stop
|
||||||
from .errors import Errors
|
from .errors import Errors, Warnings, deprecation_warning
|
||||||
from . import util
|
from . import util
|
||||||
from . import about
|
from . import about
|
||||||
|
|
||||||
|
@ -95,6 +95,7 @@ class BaseDefaults(object):
|
||||||
morph_rules = {}
|
morph_rules = {}
|
||||||
lex_attr_getters = LEX_ATTRS
|
lex_attr_getters = LEX_ATTRS
|
||||||
syntax_iterators = {}
|
syntax_iterators = {}
|
||||||
|
writing_system = {"direction": "ltr", "has_case": True, "has_letters": True}
|
||||||
|
|
||||||
|
|
||||||
class Language(object):
|
class Language(object):
|
||||||
|
@ -107,6 +108,7 @@ class Language(object):
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/language
|
DOCS: https://spacy.io/api/language
|
||||||
"""
|
"""
|
||||||
|
|
||||||
Defaults = BaseDefaults
|
Defaults = BaseDefaults
|
||||||
lang = None
|
lang = None
|
||||||
|
|
||||||
|
@ -195,6 +197,7 @@ class Language(object):
|
||||||
self._meta = value
|
self._meta = value
|
||||||
|
|
||||||
# Conveniences to access pipeline components
|
# Conveniences to access pipeline components
|
||||||
|
# Shouldn't be used anymore!
|
||||||
@property
|
@property
|
||||||
def tensorizer(self):
|
def tensorizer(self):
|
||||||
return self.get_pipe("tensorizer")
|
return self.get_pipe("tensorizer")
|
||||||
|
@ -228,6 +231,8 @@ class Language(object):
|
||||||
|
|
||||||
name (unicode): Name of pipeline component to get.
|
name (unicode): Name of pipeline component to get.
|
||||||
RETURNS (callable): The pipeline component.
|
RETURNS (callable): The pipeline component.
|
||||||
|
|
||||||
|
DOCS: https://spacy.io/api/language#get_pipe
|
||||||
"""
|
"""
|
||||||
for pipe_name, component in self.pipeline:
|
for pipe_name, component in self.pipeline:
|
||||||
if pipe_name == name:
|
if pipe_name == name:
|
||||||
|
@ -240,6 +245,8 @@ class Language(object):
|
||||||
name (unicode): Factory name to look up in `Language.factories`.
|
name (unicode): Factory name to look up in `Language.factories`.
|
||||||
config (dict): Configuration parameters to initialise component.
|
config (dict): Configuration parameters to initialise component.
|
||||||
RETURNS (callable): Pipeline component.
|
RETURNS (callable): Pipeline component.
|
||||||
|
|
||||||
|
DOCS: https://spacy.io/api/language#create_pipe
|
||||||
"""
|
"""
|
||||||
if name not in self.factories:
|
if name not in self.factories:
|
||||||
if name == "sbd":
|
if name == "sbd":
|
||||||
|
@ -266,9 +273,7 @@ class Language(object):
|
||||||
first (bool): Insert component first / not first in the pipeline.
|
first (bool): Insert component first / not first in the pipeline.
|
||||||
last (bool): Insert component last / not last in the pipeline.
|
last (bool): Insert component last / not last in the pipeline.
|
||||||
|
|
||||||
EXAMPLE:
|
DOCS: https://spacy.io/api/language#add_pipe
|
||||||
>>> nlp.add_pipe(component, before='ner')
|
|
||||||
>>> nlp.add_pipe(component, name='custom_name', last=True)
|
|
||||||
"""
|
"""
|
||||||
if not hasattr(component, "__call__"):
|
if not hasattr(component, "__call__"):
|
||||||
msg = Errors.E003.format(component=repr(component), name=name)
|
msg = Errors.E003.format(component=repr(component), name=name)
|
||||||
|
@ -310,6 +315,8 @@ class Language(object):
|
||||||
|
|
||||||
name (unicode): Name of the component.
|
name (unicode): Name of the component.
|
||||||
RETURNS (bool): Whether a component of the name exists in the pipeline.
|
RETURNS (bool): Whether a component of the name exists in the pipeline.
|
||||||
|
|
||||||
|
DOCS: https://spacy.io/api/language#has_pipe
|
||||||
"""
|
"""
|
||||||
return name in self.pipe_names
|
return name in self.pipe_names
|
||||||
|
|
||||||
|
@ -318,6 +325,8 @@ class Language(object):
|
||||||
|
|
||||||
name (unicode): Name of the component to replace.
|
name (unicode): Name of the component to replace.
|
||||||
component (callable): Pipeline component.
|
component (callable): Pipeline component.
|
||||||
|
|
||||||
|
DOCS: https://spacy.io/api/language#replace_pipe
|
||||||
"""
|
"""
|
||||||
if name not in self.pipe_names:
|
if name not in self.pipe_names:
|
||||||
raise ValueError(Errors.E001.format(name=name, opts=self.pipe_names))
|
raise ValueError(Errors.E001.format(name=name, opts=self.pipe_names))
|
||||||
|
@ -328,6 +337,8 @@ class Language(object):
|
||||||
|
|
||||||
old_name (unicode): Name of the component to rename.
|
old_name (unicode): Name of the component to rename.
|
||||||
new_name (unicode): New name of the component.
|
new_name (unicode): New name of the component.
|
||||||
|
|
||||||
|
DOCS: https://spacy.io/api/language#rename_pipe
|
||||||
"""
|
"""
|
||||||
if old_name not in self.pipe_names:
|
if old_name not in self.pipe_names:
|
||||||
raise ValueError(Errors.E001.format(name=old_name, opts=self.pipe_names))
|
raise ValueError(Errors.E001.format(name=old_name, opts=self.pipe_names))
|
||||||
|
@ -341,36 +352,39 @@ class Language(object):
|
||||||
|
|
||||||
name (unicode): Name of the component to remove.
|
name (unicode): Name of the component to remove.
|
||||||
RETURNS (tuple): A `(name, component)` tuple of the removed component.
|
RETURNS (tuple): A `(name, component)` tuple of the removed component.
|
||||||
|
|
||||||
|
DOCS: https://spacy.io/api/language#remove_pipe
|
||||||
"""
|
"""
|
||||||
if name not in self.pipe_names:
|
if name not in self.pipe_names:
|
||||||
raise ValueError(Errors.E001.format(name=name, opts=self.pipe_names))
|
raise ValueError(Errors.E001.format(name=name, opts=self.pipe_names))
|
||||||
return self.pipeline.pop(self.pipe_names.index(name))
|
return self.pipeline.pop(self.pipe_names.index(name))
|
||||||
|
|
||||||
def __call__(self, text, disable=[]):
|
def __call__(self, text, disable=[], component_cfg=None):
|
||||||
"""Apply the pipeline to some text. The text can span multiple sentences,
|
"""Apply the pipeline to some text. The text can span multiple sentences,
|
||||||
and can contain arbtrary whitespace. Alignment into the original string
|
and can contain arbtrary whitespace. Alignment into the original string
|
||||||
is preserved.
|
is preserved.
|
||||||
|
|
||||||
text (unicode): The text to be processed.
|
text (unicode): The text to be processed.
|
||||||
disable (list): Names of the pipeline components to disable.
|
disable (list): Names of the pipeline components to disable.
|
||||||
|
component_cfg (dict): An optional dictionary with extra keyword arguments
|
||||||
|
for specific components.
|
||||||
RETURNS (Doc): A container for accessing the annotations.
|
RETURNS (Doc): A container for accessing the annotations.
|
||||||
|
|
||||||
EXAMPLE:
|
DOCS: https://spacy.io/api/language#call
|
||||||
>>> tokens = nlp('An example sentence. Another example sentence.')
|
|
||||||
>>> tokens[0].text, tokens[0].head.tag_
|
|
||||||
('An', 'NN')
|
|
||||||
"""
|
"""
|
||||||
if len(text) > self.max_length:
|
if len(text) > self.max_length:
|
||||||
raise ValueError(
|
raise ValueError(
|
||||||
Errors.E088.format(length=len(text), max_length=self.max_length)
|
Errors.E088.format(length=len(text), max_length=self.max_length)
|
||||||
)
|
)
|
||||||
doc = self.make_doc(text)
|
doc = self.make_doc(text)
|
||||||
|
if component_cfg is None:
|
||||||
|
component_cfg = {}
|
||||||
for name, proc in self.pipeline:
|
for name, proc in self.pipeline:
|
||||||
if name in disable:
|
if name in disable:
|
||||||
continue
|
continue
|
||||||
if not hasattr(proc, "__call__"):
|
if not hasattr(proc, "__call__"):
|
||||||
raise ValueError(Errors.E003.format(component=type(proc), name=name))
|
raise ValueError(Errors.E003.format(component=type(proc), name=name))
|
||||||
doc = proc(doc)
|
doc = proc(doc, **component_cfg.get(name, {}))
|
||||||
if doc is None:
|
if doc is None:
|
||||||
raise ValueError(Errors.E005.format(name=name))
|
raise ValueError(Errors.E005.format(name=name))
|
||||||
return doc
|
return doc
|
||||||
|
@ -381,24 +395,14 @@ class Language(object):
|
||||||
of the block. Otherwise, a DisabledPipes object is returned, that has
|
of the block. Otherwise, a DisabledPipes object is returned, that has
|
||||||
a `.restore()` method you can use to undo your changes.
|
a `.restore()` method you can use to undo your changes.
|
||||||
|
|
||||||
EXAMPLE:
|
DOCS: https://spacy.io/api/language#disable_pipes
|
||||||
>>> nlp.add_pipe('parser')
|
|
||||||
>>> nlp.add_pipe('tagger')
|
|
||||||
>>> with nlp.disable_pipes('parser', 'tagger'):
|
|
||||||
>>> assert not nlp.has_pipe('parser')
|
|
||||||
>>> assert nlp.has_pipe('parser')
|
|
||||||
>>> disabled = nlp.disable_pipes('parser')
|
|
||||||
>>> assert len(disabled) == 1
|
|
||||||
>>> assert not nlp.has_pipe('parser')
|
|
||||||
>>> disabled.restore()
|
|
||||||
>>> assert nlp.has_pipe('parser')
|
|
||||||
"""
|
"""
|
||||||
return DisabledPipes(self, *names)
|
return DisabledPipes(self, *names)
|
||||||
|
|
||||||
def make_doc(self, text):
|
def make_doc(self, text):
|
||||||
return self.tokenizer(text)
|
return self.tokenizer(text)
|
||||||
|
|
||||||
def update(self, docs, golds, drop=0.0, sgd=None, losses=None):
|
def update(self, docs, golds, drop=0.0, sgd=None, losses=None, component_cfg=None):
|
||||||
"""Update the models in the pipeline.
|
"""Update the models in the pipeline.
|
||||||
|
|
||||||
docs (iterable): A batch of `Doc` objects.
|
docs (iterable): A batch of `Doc` objects.
|
||||||
|
@ -407,11 +411,7 @@ class Language(object):
|
||||||
sgd (callable): An optimizer.
|
sgd (callable): An optimizer.
|
||||||
RETURNS (dict): Results from the update.
|
RETURNS (dict): Results from the update.
|
||||||
|
|
||||||
EXAMPLE:
|
DOCS: https://spacy.io/api/language#update
|
||||||
>>> with nlp.begin_training(gold) as (trainer, optimizer):
|
|
||||||
>>> for epoch in trainer.epochs(gold):
|
|
||||||
>>> for docs, golds in epoch:
|
|
||||||
>>> state = nlp.update(docs, golds, sgd=optimizer)
|
|
||||||
"""
|
"""
|
||||||
if len(docs) != len(golds):
|
if len(docs) != len(golds):
|
||||||
raise IndexError(Errors.E009.format(n_docs=len(docs), n_golds=len(golds)))
|
raise IndexError(Errors.E009.format(n_docs=len(docs), n_golds=len(golds)))
|
||||||
|
@ -421,7 +421,6 @@ class Language(object):
|
||||||
if self._optimizer is None:
|
if self._optimizer is None:
|
||||||
self._optimizer = create_default_optimizer(Model.ops)
|
self._optimizer = create_default_optimizer(Model.ops)
|
||||||
sgd = self._optimizer
|
sgd = self._optimizer
|
||||||
|
|
||||||
# Allow dict of args to GoldParse, instead of GoldParse objects.
|
# Allow dict of args to GoldParse, instead of GoldParse objects.
|
||||||
gold_objs = []
|
gold_objs = []
|
||||||
doc_objs = []
|
doc_objs = []
|
||||||
|
@ -442,14 +441,17 @@ class Language(object):
|
||||||
get_grads.alpha = sgd.alpha
|
get_grads.alpha = sgd.alpha
|
||||||
get_grads.b1 = sgd.b1
|
get_grads.b1 = sgd.b1
|
||||||
get_grads.b2 = sgd.b2
|
get_grads.b2 = sgd.b2
|
||||||
|
|
||||||
pipes = list(self.pipeline)
|
pipes = list(self.pipeline)
|
||||||
random.shuffle(pipes)
|
random.shuffle(pipes)
|
||||||
|
if component_cfg is None:
|
||||||
|
component_cfg = {}
|
||||||
for name, proc in pipes:
|
for name, proc in pipes:
|
||||||
if not hasattr(proc, "update"):
|
if not hasattr(proc, "update"):
|
||||||
continue
|
continue
|
||||||
grads = {}
|
grads = {}
|
||||||
proc.update(docs, golds, drop=drop, sgd=get_grads, losses=losses)
|
kwargs = component_cfg.get(name, {})
|
||||||
|
kwargs.setdefault("drop", drop)
|
||||||
|
proc.update(docs, golds, sgd=get_grads, losses=losses, **kwargs)
|
||||||
for key, (W, dW) in grads.items():
|
for key, (W, dW) in grads.items():
|
||||||
sgd(W, dW, key=key)
|
sgd(W, dW, key=key)
|
||||||
|
|
||||||
|
@ -473,6 +475,7 @@ class Language(object):
|
||||||
>>> raw_batch = [nlp.make_doc(text) for text in next(raw_text_batches)]
|
>>> raw_batch = [nlp.make_doc(text) for text in next(raw_text_batches)]
|
||||||
>>> nlp.rehearse(raw_batch)
|
>>> nlp.rehearse(raw_batch)
|
||||||
"""
|
"""
|
||||||
|
# TODO: document
|
||||||
if len(docs) == 0:
|
if len(docs) == 0:
|
||||||
return
|
return
|
||||||
if sgd is None:
|
if sgd is None:
|
||||||
|
@ -495,7 +498,6 @@ class Language(object):
|
||||||
get_grads.alpha = sgd.alpha
|
get_grads.alpha = sgd.alpha
|
||||||
get_grads.b1 = sgd.b1
|
get_grads.b1 = sgd.b1
|
||||||
get_grads.b2 = sgd.b2
|
get_grads.b2 = sgd.b2
|
||||||
|
|
||||||
for name, proc in pipes:
|
for name, proc in pipes:
|
||||||
if not hasattr(proc, "rehearse"):
|
if not hasattr(proc, "rehearse"):
|
||||||
continue
|
continue
|
||||||
|
@ -503,7 +505,6 @@ class Language(object):
|
||||||
proc.rehearse(docs, sgd=get_grads, losses=losses, **config.get(name, {}))
|
proc.rehearse(docs, sgd=get_grads, losses=losses, **config.get(name, {}))
|
||||||
for key, (W, dW) in grads.items():
|
for key, (W, dW) in grads.items():
|
||||||
sgd(W, dW, key=key)
|
sgd(W, dW, key=key)
|
||||||
|
|
||||||
return losses
|
return losses
|
||||||
|
|
||||||
def preprocess_gold(self, docs_golds):
|
def preprocess_gold(self, docs_golds):
|
||||||
|
@ -519,13 +520,16 @@ class Language(object):
|
||||||
for doc, gold in docs_golds:
|
for doc, gold in docs_golds:
|
||||||
yield doc, gold
|
yield doc, gold
|
||||||
|
|
||||||
def begin_training(self, get_gold_tuples=None, sgd=None, **cfg):
|
def begin_training(self, get_gold_tuples=None, sgd=None, component_cfg=None, **cfg):
|
||||||
"""Allocate models, pre-process training data and acquire a trainer and
|
"""Allocate models, pre-process training data and acquire a trainer and
|
||||||
optimizer. Used as a contextmanager.
|
optimizer. Used as a contextmanager.
|
||||||
|
|
||||||
get_gold_tuples (function): Function returning gold data
|
get_gold_tuples (function): Function returning gold data
|
||||||
|
component_cfg (dict): Config parameters for specific components.
|
||||||
**cfg: Config parameters.
|
**cfg: Config parameters.
|
||||||
RETURNS: An optimizer
|
RETURNS: An optimizer.
|
||||||
|
|
||||||
|
DOCS: https://spacy.io/api/language#begin_training
|
||||||
"""
|
"""
|
||||||
if get_gold_tuples is None:
|
if get_gold_tuples is None:
|
||||||
get_gold_tuples = lambda: []
|
get_gold_tuples = lambda: []
|
||||||
|
@ -545,10 +549,17 @@ class Language(object):
|
||||||
if sgd is None:
|
if sgd is None:
|
||||||
sgd = create_default_optimizer(Model.ops)
|
sgd = create_default_optimizer(Model.ops)
|
||||||
self._optimizer = sgd
|
self._optimizer = sgd
|
||||||
|
if component_cfg is None:
|
||||||
|
component_cfg = {}
|
||||||
for name, proc in self.pipeline:
|
for name, proc in self.pipeline:
|
||||||
if hasattr(proc, "begin_training"):
|
if hasattr(proc, "begin_training"):
|
||||||
|
kwargs = component_cfg.get(name, {})
|
||||||
|
kwargs.update(cfg)
|
||||||
proc.begin_training(
|
proc.begin_training(
|
||||||
get_gold_tuples, pipeline=self.pipeline, sgd=self._optimizer, **cfg
|
get_gold_tuples,
|
||||||
|
pipeline=self.pipeline,
|
||||||
|
sgd=self._optimizer,
|
||||||
|
**kwargs
|
||||||
)
|
)
|
||||||
return self._optimizer
|
return self._optimizer
|
||||||
|
|
||||||
|
@ -576,20 +587,27 @@ class Language(object):
|
||||||
proc._rehearsal_model = deepcopy(proc.model)
|
proc._rehearsal_model = deepcopy(proc.model)
|
||||||
return self._optimizer
|
return self._optimizer
|
||||||
|
|
||||||
def evaluate(self, docs_golds, verbose=False, batch_size=256):
|
def evaluate(
|
||||||
scorer = Scorer()
|
self, docs_golds, verbose=False, batch_size=256, scorer=None, component_cfg=None
|
||||||
|
):
|
||||||
|
if scorer is None:
|
||||||
|
scorer = Scorer()
|
||||||
docs, golds = zip(*docs_golds)
|
docs, golds = zip(*docs_golds)
|
||||||
docs = list(docs)
|
docs = list(docs)
|
||||||
golds = list(golds)
|
golds = list(golds)
|
||||||
for name, pipe in self.pipeline:
|
for name, pipe in self.pipeline:
|
||||||
|
kwargs = component_cfg.get(name, {})
|
||||||
|
kwargs.setdefault("batch_size", batch_size)
|
||||||
if not hasattr(pipe, "pipe"):
|
if not hasattr(pipe, "pipe"):
|
||||||
docs = (pipe(doc) for doc in docs)
|
docs = (pipe(doc, **kwargs) for doc in docs)
|
||||||
else:
|
else:
|
||||||
docs = pipe.pipe(docs, batch_size=batch_size)
|
docs = pipe.pipe(docs, **kwargs)
|
||||||
for doc, gold in zip(docs, golds):
|
for doc, gold in zip(docs, golds):
|
||||||
if verbose:
|
if verbose:
|
||||||
print(doc)
|
print(doc)
|
||||||
scorer.score(doc, gold, verbose=verbose)
|
kwargs = component_cfg.get("scorer", {})
|
||||||
|
kwargs.setdefault("verbose", verbose)
|
||||||
|
scorer.score(doc, gold, **kwargs)
|
||||||
return scorer
|
return scorer
|
||||||
|
|
||||||
@contextmanager
|
@contextmanager
|
||||||
|
@ -628,49 +646,57 @@ class Language(object):
|
||||||
self,
|
self,
|
||||||
texts,
|
texts,
|
||||||
as_tuples=False,
|
as_tuples=False,
|
||||||
n_threads=2,
|
n_threads=-1,
|
||||||
batch_size=1000,
|
batch_size=1000,
|
||||||
disable=[],
|
disable=[],
|
||||||
cleanup=False,
|
cleanup=False,
|
||||||
|
component_cfg=None,
|
||||||
):
|
):
|
||||||
"""Process texts as a stream, and yield `Doc` objects in order.
|
"""Process texts as a stream, and yield `Doc` objects in order.
|
||||||
|
|
||||||
texts (iterator): A sequence of texts to process.
|
texts (iterator): A sequence of texts to process.
|
||||||
as_tuples (bool):
|
as_tuples (bool): If set to True, inputs should be a sequence of
|
||||||
If set to True, inputs should be a sequence of
|
|
||||||
(text, context) tuples. Output will then be a sequence of
|
(text, context) tuples. Output will then be a sequence of
|
||||||
(doc, context) tuples. Defaults to False.
|
(doc, context) tuples. Defaults to False.
|
||||||
n_threads (int): Currently inactive.
|
|
||||||
batch_size (int): The number of texts to buffer.
|
batch_size (int): The number of texts to buffer.
|
||||||
disable (list): Names of the pipeline components to disable.
|
disable (list): Names of the pipeline components to disable.
|
||||||
cleanup (bool): If True, unneeded strings are freed,
|
cleanup (bool): If True, unneeded strings are freed to control memory
|
||||||
to control memory use. Experimental.
|
use. Experimental.
|
||||||
|
component_cfg (dict): An optional dictionary with extra keyword
|
||||||
|
arguments for specific components.
|
||||||
YIELDS (Doc): Documents in the order of the original text.
|
YIELDS (Doc): Documents in the order of the original text.
|
||||||
|
|
||||||
EXAMPLE:
|
DOCS: https://spacy.io/api/language#pipe
|
||||||
>>> texts = [u'One document.', u'...', u'Lots of documents']
|
|
||||||
>>> for doc in nlp.pipe(texts, batch_size=50, n_threads=4):
|
|
||||||
>>> assert doc.is_parsed
|
|
||||||
"""
|
"""
|
||||||
|
if n_threads != -1:
|
||||||
|
deprecation_warning(Warnings.W016)
|
||||||
if as_tuples:
|
if as_tuples:
|
||||||
text_context1, text_context2 = itertools.tee(texts)
|
text_context1, text_context2 = itertools.tee(texts)
|
||||||
texts = (tc[0] for tc in text_context1)
|
texts = (tc[0] for tc in text_context1)
|
||||||
contexts = (tc[1] for tc in text_context2)
|
contexts = (tc[1] for tc in text_context2)
|
||||||
docs = self.pipe(
|
docs = self.pipe(
|
||||||
texts, n_threads=n_threads, batch_size=batch_size, disable=disable
|
texts,
|
||||||
|
batch_size=batch_size,
|
||||||
|
disable=disable,
|
||||||
|
component_cfg=component_cfg,
|
||||||
)
|
)
|
||||||
for doc, context in izip(docs, contexts):
|
for doc, context in izip(docs, contexts):
|
||||||
yield (doc, context)
|
yield (doc, context)
|
||||||
return
|
return
|
||||||
docs = (self.make_doc(text) for text in texts)
|
docs = (self.make_doc(text) for text in texts)
|
||||||
|
if component_cfg is None:
|
||||||
|
component_cfg = {}
|
||||||
for name, proc in self.pipeline:
|
for name, proc in self.pipeline:
|
||||||
if name in disable:
|
if name in disable:
|
||||||
continue
|
continue
|
||||||
|
kwargs = component_cfg.get(name, {})
|
||||||
|
# Allow component_cfg to overwrite the top-level kwargs.
|
||||||
|
kwargs.setdefault("batch_size", batch_size)
|
||||||
if hasattr(proc, "pipe"):
|
if hasattr(proc, "pipe"):
|
||||||
docs = proc.pipe(docs, n_threads=n_threads, batch_size=batch_size)
|
docs = proc.pipe(docs, **kwargs)
|
||||||
else:
|
else:
|
||||||
# Apply the function, but yield the doc
|
# Apply the function, but yield the doc
|
||||||
docs = _pipe(proc, docs)
|
docs = _pipe(proc, docs, kwargs)
|
||||||
# Track weakrefs of "recent" documents, so that we can see when they
|
# Track weakrefs of "recent" documents, so that we can see when they
|
||||||
# expire from memory. When they do, we know we don't need old strings.
|
# expire from memory. When they do, we know we don't need old strings.
|
||||||
# This way, we avoid maintaining an unbounded growth in string entries
|
# This way, we avoid maintaining an unbounded growth in string entries
|
||||||
|
@ -701,124 +727,114 @@ class Language(object):
|
||||||
self.tokenizer._reset_cache(keys)
|
self.tokenizer._reset_cache(keys)
|
||||||
nr_seen = 0
|
nr_seen = 0
|
||||||
|
|
||||||
def to_disk(self, path, disable=tuple()):
|
def to_disk(self, path, exclude=tuple(), disable=None):
|
||||||
"""Save the current state to a directory. If a model is loaded, this
|
"""Save the current state to a directory. If a model is loaded, this
|
||||||
will include the model.
|
will include the model.
|
||||||
|
|
||||||
path (unicode or Path): A path to a directory, which will be created if
|
path (unicode or Path): Path to a directory, which will be created if
|
||||||
it doesn't exist. Paths may be strings or `Path`-like objects.
|
it doesn't exist.
|
||||||
disable (list): Names of pipeline components to disable and prevent
|
exclude (list): Names of components or serialization fields to exclude.
|
||||||
from being saved.
|
|
||||||
|
|
||||||
EXAMPLE:
|
DOCS: https://spacy.io/api/language#to_disk
|
||||||
>>> nlp.to_disk('/path/to/models')
|
|
||||||
"""
|
"""
|
||||||
|
if disable is not None:
|
||||||
|
deprecation_warning(Warnings.W014)
|
||||||
|
exclude = disable
|
||||||
path = util.ensure_path(path)
|
path = util.ensure_path(path)
|
||||||
serializers = OrderedDict(
|
serializers = OrderedDict()
|
||||||
(
|
serializers["tokenizer"] = lambda p: self.tokenizer.to_disk(p, exclude=["vocab"])
|
||||||
("tokenizer", lambda p: self.tokenizer.to_disk(p, vocab=False)),
|
serializers["meta.json"] = lambda p: p.open("w").write(srsly.json_dumps(self.meta))
|
||||||
("meta.json", lambda p: p.open("w").write(srsly.json_dumps(self.meta))),
|
|
||||||
)
|
|
||||||
)
|
|
||||||
for name, proc in self.pipeline:
|
for name, proc in self.pipeline:
|
||||||
if not hasattr(proc, "name"):
|
if not hasattr(proc, "name"):
|
||||||
continue
|
continue
|
||||||
if name in disable:
|
if name in exclude:
|
||||||
continue
|
continue
|
||||||
if not hasattr(proc, "to_disk"):
|
if not hasattr(proc, "to_disk"):
|
||||||
continue
|
continue
|
||||||
serializers[name] = lambda p, proc=proc: proc.to_disk(p, vocab=False)
|
serializers[name] = lambda p, proc=proc: proc.to_disk(p, exclude=["vocab"])
|
||||||
serializers["vocab"] = lambda p: self.vocab.to_disk(p)
|
serializers["vocab"] = lambda p: self.vocab.to_disk(p)
|
||||||
util.to_disk(path, serializers, {p: False for p in disable})
|
util.to_disk(path, serializers, exclude)
|
||||||
|
|
||||||
def from_disk(self, path, disable=tuple()):
|
def from_disk(self, path, exclude=tuple(), disable=None):
|
||||||
"""Loads state from a directory. Modifies the object in place and
|
"""Loads state from a directory. Modifies the object in place and
|
||||||
returns it. If the saved `Language` object contains a model, the
|
returns it. If the saved `Language` object contains a model, the
|
||||||
model will be loaded.
|
model will be loaded.
|
||||||
|
|
||||||
path (unicode or Path): A path to a directory. Paths may be either
|
path (unicode or Path): A path to a directory.
|
||||||
strings or `Path`-like objects.
|
exclude (list): Names of components or serialization fields to exclude.
|
||||||
disable (list): Names of the pipeline components to disable.
|
|
||||||
RETURNS (Language): The modified `Language` object.
|
RETURNS (Language): The modified `Language` object.
|
||||||
|
|
||||||
EXAMPLE:
|
DOCS: https://spacy.io/api/language#from_disk
|
||||||
>>> from spacy.language import Language
|
|
||||||
>>> nlp = Language().from_disk('/path/to/models')
|
|
||||||
"""
|
"""
|
||||||
|
if disable is not None:
|
||||||
|
deprecation_warning(Warnings.W014)
|
||||||
|
exclude = disable
|
||||||
path = util.ensure_path(path)
|
path = util.ensure_path(path)
|
||||||
deserializers = OrderedDict(
|
deserializers = OrderedDict()
|
||||||
(
|
deserializers["meta.json"] = lambda p: self.meta.update(srsly.read_json(p))
|
||||||
("meta.json", lambda p: self.meta.update(srsly.read_json(p))),
|
deserializers["vocab"] = lambda p: self.vocab.from_disk(p) and _fix_pretrained_vectors_name(self)
|
||||||
(
|
deserializers["tokenizer"] = lambda p: self.tokenizer.from_disk(p, exclude=["vocab"])
|
||||||
"vocab",
|
|
||||||
lambda p: (
|
|
||||||
self.vocab.from_disk(p) and _fix_pretrained_vectors_name(self)
|
|
||||||
),
|
|
||||||
),
|
|
||||||
("tokenizer", lambda p: self.tokenizer.from_disk(p, vocab=False)),
|
|
||||||
)
|
|
||||||
)
|
|
||||||
for name, proc in self.pipeline:
|
for name, proc in self.pipeline:
|
||||||
if name in disable:
|
if name in exclude:
|
||||||
continue
|
continue
|
||||||
if not hasattr(proc, "from_disk"):
|
if not hasattr(proc, "from_disk"):
|
||||||
continue
|
continue
|
||||||
deserializers[name] = lambda p, proc=proc: proc.from_disk(p, vocab=False)
|
deserializers[name] = lambda p, proc=proc: proc.from_disk(p, exclude=["vocab"])
|
||||||
exclude = {p: False for p in disable}
|
if not (path / "vocab").exists() and "vocab" not in exclude:
|
||||||
if not (path / "vocab").exists():
|
# Convert to list here in case exclude is (default) tuple
|
||||||
exclude["vocab"] = True
|
exclude = list(exclude) + ["vocab"]
|
||||||
util.from_disk(path, deserializers, exclude)
|
util.from_disk(path, deserializers, exclude)
|
||||||
self._path = path
|
self._path = path
|
||||||
return self
|
return self
|
||||||
|
|
||||||
def to_bytes(self, disable=[], **exclude):
|
def to_bytes(self, exclude=tuple(), disable=None, **kwargs):
|
||||||
"""Serialize the current state to a binary string.
|
"""Serialize the current state to a binary string.
|
||||||
|
|
||||||
disable (list): Nameds of pipeline components to disable and prevent
|
exclude (list): Names of components or serialization fields to exclude.
|
||||||
from being serialized.
|
|
||||||
RETURNS (bytes): The serialized form of the `Language` object.
|
RETURNS (bytes): The serialized form of the `Language` object.
|
||||||
|
|
||||||
|
DOCS: https://spacy.io/api/language#to_bytes
|
||||||
"""
|
"""
|
||||||
serializers = OrderedDict(
|
if disable is not None:
|
||||||
(
|
deprecation_warning(Warnings.W014)
|
||||||
("vocab", lambda: self.vocab.to_bytes()),
|
exclude = disable
|
||||||
("tokenizer", lambda: self.tokenizer.to_bytes(vocab=False)),
|
serializers = OrderedDict()
|
||||||
("meta", lambda: srsly.json_dumps(self.meta)),
|
serializers["vocab"] = lambda: self.vocab.to_bytes()
|
||||||
)
|
serializers["tokenizer"] = lambda: self.tokenizer.to_bytes(exclude=["vocab"])
|
||||||
)
|
serializers["meta.json"] = lambda: srsly.json_dumps(self.meta)
|
||||||
for i, (name, proc) in enumerate(self.pipeline):
|
for name, proc in self.pipeline:
|
||||||
if name in disable:
|
if name in exclude:
|
||||||
continue
|
continue
|
||||||
if not hasattr(proc, "to_bytes"):
|
if not hasattr(proc, "to_bytes"):
|
||||||
continue
|
continue
|
||||||
serializers[i] = lambda proc=proc: proc.to_bytes(vocab=False)
|
serializers[name] = lambda proc=proc: proc.to_bytes(exclude=["vocab"])
|
||||||
|
exclude = util.get_serialization_exclude(serializers, exclude, kwargs)
|
||||||
return util.to_bytes(serializers, exclude)
|
return util.to_bytes(serializers, exclude)
|
||||||
|
|
||||||
def from_bytes(self, bytes_data, disable=[]):
|
def from_bytes(self, bytes_data, exclude=tuple(), disable=None, **kwargs):
|
||||||
"""Load state from a binary string.
|
"""Load state from a binary string.
|
||||||
|
|
||||||
bytes_data (bytes): The data to load from.
|
bytes_data (bytes): The data to load from.
|
||||||
disable (list): Names of the pipeline components to disable.
|
exclude (list): Names of components or serialization fields to exclude.
|
||||||
RETURNS (Language): The `Language` object.
|
RETURNS (Language): The `Language` object.
|
||||||
|
|
||||||
|
DOCS: https://spacy.io/api/language#from_bytes
|
||||||
"""
|
"""
|
||||||
deserializers = OrderedDict(
|
if disable is not None:
|
||||||
(
|
deprecation_warning(Warnings.W014)
|
||||||
("meta", lambda b: self.meta.update(srsly.json_loads(b))),
|
exclude = disable
|
||||||
(
|
deserializers = OrderedDict()
|
||||||
"vocab",
|
deserializers["meta.json"] = lambda b: self.meta.update(srsly.json_loads(b))
|
||||||
lambda b: (
|
deserializers["vocab"] = lambda b: self.vocab.from_bytes(b) and _fix_pretrained_vectors_name(self)
|
||||||
self.vocab.from_bytes(b) and _fix_pretrained_vectors_name(self)
|
deserializers["tokenizer"] = lambda b: self.tokenizer.from_bytes(b, exclude=["vocab"])
|
||||||
),
|
for name, proc in self.pipeline:
|
||||||
),
|
if name in exclude:
|
||||||
("tokenizer", lambda b: self.tokenizer.from_bytes(b, vocab=False)),
|
|
||||||
)
|
|
||||||
)
|
|
||||||
for i, (name, proc) in enumerate(self.pipeline):
|
|
||||||
if name in disable:
|
|
||||||
continue
|
continue
|
||||||
if not hasattr(proc, "from_bytes"):
|
if not hasattr(proc, "from_bytes"):
|
||||||
continue
|
continue
|
||||||
deserializers[i] = lambda b, proc=proc: proc.from_bytes(b, vocab=False)
|
deserializers[name] = lambda b, proc=proc: proc.from_bytes(b, exclude=["vocab"])
|
||||||
util.from_bytes(bytes_data, deserializers, {})
|
exclude = util.get_serialization_exclude(deserializers, exclude, kwargs)
|
||||||
|
util.from_bytes(bytes_data, deserializers, exclude)
|
||||||
return self
|
return self
|
||||||
|
|
||||||
|
|
||||||
|
@ -873,7 +889,12 @@ class DisabledPipes(list):
|
||||||
self[:] = []
|
self[:] = []
|
||||||
|
|
||||||
|
|
||||||
def _pipe(func, docs):
|
def _pipe(func, docs, kwargs):
|
||||||
|
# We added some args for pipe that __call__ doesn't expect.
|
||||||
|
kwargs = dict(kwargs)
|
||||||
|
for arg in ["n_threads", "batch_size"]:
|
||||||
|
if arg in kwargs:
|
||||||
|
kwargs.pop(arg)
|
||||||
for doc in docs:
|
for doc in docs:
|
||||||
doc = func(doc)
|
doc = func(doc, **kwargs)
|
||||||
yield doc
|
yield doc
|
||||||
|
|
|
@ -161,17 +161,17 @@ cdef class Lexeme:
|
||||||
Lexeme.c_from_bytes(self.c, lex_data)
|
Lexeme.c_from_bytes(self.c, lex_data)
|
||||||
self.orth = self.c.orth
|
self.orth = self.c.orth
|
||||||
|
|
||||||
property has_vector:
|
@property
|
||||||
|
def has_vector(self):
|
||||||
"""RETURNS (bool): Whether a word vector is associated with the object.
|
"""RETURNS (bool): Whether a word vector is associated with the object.
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
return self.vocab.has_vector(self.c.orth)
|
||||||
return self.vocab.has_vector(self.c.orth)
|
|
||||||
|
|
||||||
property vector_norm:
|
@property
|
||||||
|
def vector_norm(self):
|
||||||
"""RETURNS (float): The L2 norm of the vector representation."""
|
"""RETURNS (float): The L2 norm of the vector representation."""
|
||||||
def __get__(self):
|
vector = self.vector
|
||||||
vector = self.vector
|
return numpy.sqrt((vector**2).sum())
|
||||||
return numpy.sqrt((vector**2).sum())
|
|
||||||
|
|
||||||
property vector:
|
property vector:
|
||||||
"""A real-valued meaning representation.
|
"""A real-valued meaning representation.
|
||||||
|
@ -209,17 +209,17 @@ cdef class Lexeme:
|
||||||
def __set__(self, float sentiment):
|
def __set__(self, float sentiment):
|
||||||
self.c.sentiment = sentiment
|
self.c.sentiment = sentiment
|
||||||
|
|
||||||
property orth_:
|
@property
|
||||||
|
def orth_(self):
|
||||||
"""RETURNS (unicode): The original verbatim text of the lexeme
|
"""RETURNS (unicode): The original verbatim text of the lexeme
|
||||||
(identical to `Lexeme.text`). Exists mostly for consistency with
|
(identical to `Lexeme.text`). Exists mostly for consistency with
|
||||||
the other attributes."""
|
the other attributes."""
|
||||||
def __get__(self):
|
return self.vocab.strings[self.c.orth]
|
||||||
return self.vocab.strings[self.c.orth]
|
|
||||||
|
|
||||||
property text:
|
@property
|
||||||
|
def text(self):
|
||||||
"""RETURNS (unicode): The original verbatim text of the lexeme."""
|
"""RETURNS (unicode): The original verbatim text of the lexeme."""
|
||||||
def __get__(self):
|
return self.orth_
|
||||||
return self.orth_
|
|
||||||
|
|
||||||
property lower:
|
property lower:
|
||||||
"""RETURNS (unicode): Lowercase form of the lexeme."""
|
"""RETURNS (unicode): Lowercase form of the lexeme."""
|
||||||
|
|
|
@ -19,7 +19,7 @@ from ..attrs cimport ID, attr_id_t, NULL_ATTR, ORTH
|
||||||
|
|
||||||
from ._schemas import TOKEN_PATTERN_SCHEMA
|
from ._schemas import TOKEN_PATTERN_SCHEMA
|
||||||
from ..util import get_json_validator, validate_json
|
from ..util import get_json_validator, validate_json
|
||||||
from ..errors import Errors, MatchPatternError
|
from ..errors import Errors, MatchPatternError, Warnings, deprecation_warning
|
||||||
from ..strings import get_string_id
|
from ..strings import get_string_id
|
||||||
from ..attrs import IDS
|
from ..attrs import IDS
|
||||||
|
|
||||||
|
@ -153,15 +153,15 @@ cdef class Matcher:
|
||||||
return default
|
return default
|
||||||
return (self._callbacks[key], self._patterns[key])
|
return (self._callbacks[key], self._patterns[key])
|
||||||
|
|
||||||
def pipe(self, docs, batch_size=1000, n_threads=2):
|
def pipe(self, docs, batch_size=1000, n_threads=-1):
|
||||||
"""Match a stream of documents, yielding them in turn.
|
"""Match a stream of documents, yielding them in turn.
|
||||||
|
|
||||||
docs (iterable): A stream of documents.
|
docs (iterable): A stream of documents.
|
||||||
batch_size (int): Number of documents to accumulate into a working set.
|
batch_size (int): Number of documents to accumulate into a working set.
|
||||||
n_threads (int): The number of threads with which to work on the buffer
|
|
||||||
in parallel, if the implementation supports multi-threading.
|
|
||||||
YIELDS (Doc): Documents, in order.
|
YIELDS (Doc): Documents, in order.
|
||||||
"""
|
"""
|
||||||
|
if n_threads != -1:
|
||||||
|
deprecation_warning(Warnings.W016)
|
||||||
for doc in docs:
|
for doc in docs:
|
||||||
self(doc)
|
self(doc)
|
||||||
yield doc
|
yield doc
|
||||||
|
|
|
@ -166,14 +166,12 @@ cdef class PhraseMatcher:
|
||||||
on_match(self, doc, i, matches)
|
on_match(self, doc, i, matches)
|
||||||
return matches
|
return matches
|
||||||
|
|
||||||
def pipe(self, stream, batch_size=1000, n_threads=1, return_matches=False,
|
def pipe(self, stream, batch_size=1000, n_threads=-1, return_matches=False,
|
||||||
as_tuples=False):
|
as_tuples=False):
|
||||||
"""Match a stream of documents, yielding them in turn.
|
"""Match a stream of documents, yielding them in turn.
|
||||||
|
|
||||||
docs (iterable): A stream of documents.
|
docs (iterable): A stream of documents.
|
||||||
batch_size (int): Number of documents to accumulate into a working set.
|
batch_size (int): Number of documents to accumulate into a working set.
|
||||||
n_threads (int): The number of threads with which to work on the buffer
|
|
||||||
in parallel, if the implementation supports multi-threading.
|
|
||||||
return_matches (bool): Yield the match lists along with the docs, making
|
return_matches (bool): Yield the match lists along with the docs, making
|
||||||
results (doc, matches) tuples.
|
results (doc, matches) tuples.
|
||||||
as_tuples (bool): Interpret the input stream as (doc, context) tuples,
|
as_tuples (bool): Interpret the input stream as (doc, context) tuples,
|
||||||
|
@ -184,6 +182,8 @@ cdef class PhraseMatcher:
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/phrasematcher#pipe
|
DOCS: https://spacy.io/api/phrasematcher#pipe
|
||||||
"""
|
"""
|
||||||
|
if n_threads != -1:
|
||||||
|
deprecation_warning(Warnings.W016)
|
||||||
if as_tuples:
|
if as_tuples:
|
||||||
for doc, context in stream:
|
for doc, context in stream:
|
||||||
matches = self(doc)
|
matches = self(doc)
|
||||||
|
|
|
@ -141,16 +141,21 @@ class Pipe(object):
|
||||||
with self.model.use_params(params):
|
with self.model.use_params(params):
|
||||||
yield
|
yield
|
||||||
|
|
||||||
def to_bytes(self, **exclude):
|
def to_bytes(self, exclude=tuple(), **kwargs):
|
||||||
"""Serialize the pipe to a bytestring."""
|
"""Serialize the pipe to a bytestring.
|
||||||
|
|
||||||
|
exclude (list): String names of serialization fields to exclude.
|
||||||
|
RETURNS (bytes): The serialized object.
|
||||||
|
"""
|
||||||
serialize = OrderedDict()
|
serialize = OrderedDict()
|
||||||
serialize["cfg"] = lambda: srsly.json_dumps(self.cfg)
|
serialize["cfg"] = lambda: srsly.json_dumps(self.cfg)
|
||||||
if self.model not in (True, False, None):
|
if self.model not in (True, False, None):
|
||||||
serialize["model"] = self.model.to_bytes
|
serialize["model"] = self.model.to_bytes
|
||||||
serialize["vocab"] = self.vocab.to_bytes
|
serialize["vocab"] = self.vocab.to_bytes
|
||||||
|
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
|
||||||
return util.to_bytes(serialize, exclude)
|
return util.to_bytes(serialize, exclude)
|
||||||
|
|
||||||
def from_bytes(self, bytes_data, **exclude):
|
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
|
||||||
"""Load the pipe from a bytestring."""
|
"""Load the pipe from a bytestring."""
|
||||||
|
|
||||||
def load_model(b):
|
def load_model(b):
|
||||||
|
@ -161,26 +166,25 @@ class Pipe(object):
|
||||||
self.model = self.Model(**self.cfg)
|
self.model = self.Model(**self.cfg)
|
||||||
self.model.from_bytes(b)
|
self.model.from_bytes(b)
|
||||||
|
|
||||||
deserialize = OrderedDict(
|
deserialize = OrderedDict()
|
||||||
(
|
deserialize["cfg"] = lambda b: self.cfg.update(srsly.json_loads(b))
|
||||||
("cfg", lambda b: self.cfg.update(srsly.json_loads(b))),
|
deserialize["vocab"] = lambda b: self.vocab.from_bytes(b)
|
||||||
("vocab", lambda b: self.vocab.from_bytes(b)),
|
deserialize["model"] = load_model
|
||||||
("model", load_model),
|
exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
|
||||||
)
|
|
||||||
)
|
|
||||||
util.from_bytes(bytes_data, deserialize, exclude)
|
util.from_bytes(bytes_data, deserialize, exclude)
|
||||||
return self
|
return self
|
||||||
|
|
||||||
def to_disk(self, path, **exclude):
|
def to_disk(self, path, exclude=tuple(), **kwargs):
|
||||||
"""Serialize the pipe to disk."""
|
"""Serialize the pipe to disk."""
|
||||||
serialize = OrderedDict()
|
serialize = OrderedDict()
|
||||||
serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg)
|
serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg)
|
||||||
serialize["vocab"] = lambda p: self.vocab.to_disk(p)
|
serialize["vocab"] = lambda p: self.vocab.to_disk(p)
|
||||||
if self.model not in (None, True, False):
|
if self.model not in (None, True, False):
|
||||||
serialize["model"] = lambda p: p.open("wb").write(self.model.to_bytes())
|
serialize["model"] = lambda p: p.open("wb").write(self.model.to_bytes())
|
||||||
|
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
|
||||||
util.to_disk(path, serialize, exclude)
|
util.to_disk(path, serialize, exclude)
|
||||||
|
|
||||||
def from_disk(self, path, **exclude):
|
def from_disk(self, path, exclude=tuple(), **kwargs):
|
||||||
"""Load the pipe from disk."""
|
"""Load the pipe from disk."""
|
||||||
|
|
||||||
def load_model(p):
|
def load_model(p):
|
||||||
|
@ -191,13 +195,11 @@ class Pipe(object):
|
||||||
self.model = self.Model(**self.cfg)
|
self.model = self.Model(**self.cfg)
|
||||||
self.model.from_bytes(p.open("rb").read())
|
self.model.from_bytes(p.open("rb").read())
|
||||||
|
|
||||||
deserialize = OrderedDict(
|
deserialize = OrderedDict()
|
||||||
(
|
deserialize["cfg"] = lambda p: self.cfg.update(_load_cfg(p))
|
||||||
("cfg", lambda p: self.cfg.update(_load_cfg(p))),
|
deserialize["vocab"] = lambda p: self.vocab.from_disk(p)
|
||||||
("vocab", lambda p: self.vocab.from_disk(p)),
|
deserialize["model"] = load_model
|
||||||
("model", load_model),
|
exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
|
||||||
)
|
|
||||||
)
|
|
||||||
util.from_disk(path, deserialize, exclude)
|
util.from_disk(path, deserialize, exclude)
|
||||||
return self
|
return self
|
||||||
|
|
||||||
|
@ -255,7 +257,6 @@ class Tensorizer(Pipe):
|
||||||
|
|
||||||
stream (iterator): A sequence of `Doc` objects to process.
|
stream (iterator): A sequence of `Doc` objects to process.
|
||||||
batch_size (int): Number of `Doc` objects to group.
|
batch_size (int): Number of `Doc` objects to group.
|
||||||
n_threads (int): Number of threads.
|
|
||||||
YIELDS (iterator): A sequence of `Doc` objects, in order of input.
|
YIELDS (iterator): A sequence of `Doc` objects, in order of input.
|
||||||
"""
|
"""
|
||||||
for docs in util.minibatch(stream, size=batch_size):
|
for docs in util.minibatch(stream, size=batch_size):
|
||||||
|
@ -541,7 +542,7 @@ class Tagger(Pipe):
|
||||||
with self.model.use_params(params):
|
with self.model.use_params(params):
|
||||||
yield
|
yield
|
||||||
|
|
||||||
def to_bytes(self, **exclude):
|
def to_bytes(self, exclude=tuple(), **kwargs):
|
||||||
serialize = OrderedDict()
|
serialize = OrderedDict()
|
||||||
if self.model not in (None, True, False):
|
if self.model not in (None, True, False):
|
||||||
serialize["model"] = self.model.to_bytes
|
serialize["model"] = self.model.to_bytes
|
||||||
|
@ -549,9 +550,10 @@ class Tagger(Pipe):
|
||||||
serialize["cfg"] = lambda: srsly.json_dumps(self.cfg)
|
serialize["cfg"] = lambda: srsly.json_dumps(self.cfg)
|
||||||
tag_map = OrderedDict(sorted(self.vocab.morphology.tag_map.items()))
|
tag_map = OrderedDict(sorted(self.vocab.morphology.tag_map.items()))
|
||||||
serialize["tag_map"] = lambda: srsly.msgpack_dumps(tag_map)
|
serialize["tag_map"] = lambda: srsly.msgpack_dumps(tag_map)
|
||||||
|
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
|
||||||
return util.to_bytes(serialize, exclude)
|
return util.to_bytes(serialize, exclude)
|
||||||
|
|
||||||
def from_bytes(self, bytes_data, **exclude):
|
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
|
||||||
def load_model(b):
|
def load_model(b):
|
||||||
# TODO: Remove this once we don't have to handle previous models
|
# TODO: Remove this once we don't have to handle previous models
|
||||||
if self.cfg.get("pretrained_dims") and "pretrained_vectors" not in self.cfg:
|
if self.cfg.get("pretrained_dims") and "pretrained_vectors" not in self.cfg:
|
||||||
|
@ -576,20 +578,22 @@ class Tagger(Pipe):
|
||||||
("cfg", lambda b: self.cfg.update(srsly.json_loads(b))),
|
("cfg", lambda b: self.cfg.update(srsly.json_loads(b))),
|
||||||
("model", lambda b: load_model(b)),
|
("model", lambda b: load_model(b)),
|
||||||
))
|
))
|
||||||
|
exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
|
||||||
util.from_bytes(bytes_data, deserialize, exclude)
|
util.from_bytes(bytes_data, deserialize, exclude)
|
||||||
return self
|
return self
|
||||||
|
|
||||||
def to_disk(self, path, **exclude):
|
def to_disk(self, path, exclude=tuple(), **kwargs):
|
||||||
tag_map = OrderedDict(sorted(self.vocab.morphology.tag_map.items()))
|
tag_map = OrderedDict(sorted(self.vocab.morphology.tag_map.items()))
|
||||||
serialize = OrderedDict((
|
serialize = OrderedDict((
|
||||||
('vocab', lambda p: self.vocab.to_disk(p)),
|
("vocab", lambda p: self.vocab.to_disk(p)),
|
||||||
('tag_map', lambda p: srsly.write_msgpack(p, tag_map)),
|
("tag_map", lambda p: srsly.write_msgpack(p, tag_map)),
|
||||||
('model', lambda p: p.open("wb").write(self.model.to_bytes())),
|
("model", lambda p: p.open("wb").write(self.model.to_bytes())),
|
||||||
('cfg', lambda p: srsly.write_json(p, self.cfg))
|
("cfg", lambda p: srsly.write_json(p, self.cfg))
|
||||||
))
|
))
|
||||||
|
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
|
||||||
util.to_disk(path, serialize, exclude)
|
util.to_disk(path, serialize, exclude)
|
||||||
|
|
||||||
def from_disk(self, path, **exclude):
|
def from_disk(self, path, exclude=tuple(), **kwargs):
|
||||||
def load_model(p):
|
def load_model(p):
|
||||||
# TODO: Remove this once we don't have to handle previous models
|
# TODO: Remove this once we don't have to handle previous models
|
||||||
if self.cfg.get("pretrained_dims") and "pretrained_vectors" not in self.cfg:
|
if self.cfg.get("pretrained_dims") and "pretrained_vectors" not in self.cfg:
|
||||||
|
@ -612,6 +616,7 @@ class Tagger(Pipe):
|
||||||
("tag_map", load_tag_map),
|
("tag_map", load_tag_map),
|
||||||
("model", load_model),
|
("model", load_model),
|
||||||
))
|
))
|
||||||
|
exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
|
||||||
util.from_disk(path, deserialize, exclude)
|
util.from_disk(path, deserialize, exclude)
|
||||||
return self
|
return self
|
||||||
|
|
||||||
|
|
|
@ -248,19 +248,17 @@ cdef class StringStore:
|
||||||
self.add(word)
|
self.add(word)
|
||||||
return self
|
return self
|
||||||
|
|
||||||
def to_bytes(self, **exclude):
|
def to_bytes(self, **kwargs):
|
||||||
"""Serialize the current state to a binary string.
|
"""Serialize the current state to a binary string.
|
||||||
|
|
||||||
**exclude: Named attributes to prevent from being serialized.
|
|
||||||
RETURNS (bytes): The serialized form of the `StringStore` object.
|
RETURNS (bytes): The serialized form of the `StringStore` object.
|
||||||
"""
|
"""
|
||||||
return srsly.json_dumps(list(self))
|
return srsly.json_dumps(list(self))
|
||||||
|
|
||||||
def from_bytes(self, bytes_data, **exclude):
|
def from_bytes(self, bytes_data, **kwargs):
|
||||||
"""Load state from a binary string.
|
"""Load state from a binary string.
|
||||||
|
|
||||||
bytes_data (bytes): The data to load from.
|
bytes_data (bytes): The data to load from.
|
||||||
**exclude: Named attributes to prevent from being loaded.
|
|
||||||
RETURNS (StringStore): The `StringStore` object.
|
RETURNS (StringStore): The `StringStore` object.
|
||||||
"""
|
"""
|
||||||
strings = srsly.json_loads(bytes_data)
|
strings = srsly.json_loads(bytes_data)
|
||||||
|
|
|
@ -157,6 +157,10 @@ cdef void cpu_log_loss(float* d_scores,
|
||||||
cdef double max_, gmax, Z, gZ
|
cdef double max_, gmax, Z, gZ
|
||||||
best = arg_max_if_gold(scores, costs, is_valid, O)
|
best = arg_max_if_gold(scores, costs, is_valid, O)
|
||||||
guess = arg_max_if_valid(scores, is_valid, O)
|
guess = arg_max_if_valid(scores, is_valid, O)
|
||||||
|
if best == -1 or guess == -1:
|
||||||
|
# These shouldn't happen, but if they do, we want to make sure we don't
|
||||||
|
# cause an OOB access.
|
||||||
|
return
|
||||||
Z = 1e-10
|
Z = 1e-10
|
||||||
gZ = 1e-10
|
gZ = 1e-10
|
||||||
max_ = scores[guess]
|
max_ = scores[guess]
|
||||||
|
|
|
@ -323,6 +323,12 @@ cdef cppclass StateC:
|
||||||
if this._s_i >= 1:
|
if this._s_i >= 1:
|
||||||
this._s_i -= 1
|
this._s_i -= 1
|
||||||
|
|
||||||
|
void force_final() nogil:
|
||||||
|
# This should only be used in desperate situations, as it may leave
|
||||||
|
# the analysis in an unexpected state.
|
||||||
|
this._s_i = 0
|
||||||
|
this._b_i = this.length
|
||||||
|
|
||||||
void unshift() nogil:
|
void unshift() nogil:
|
||||||
this._b_i -= 1
|
this._b_i -= 1
|
||||||
this._buffer[this._b_i] = this.S(0)
|
this._buffer[this._b_i] = this.S(0)
|
||||||
|
|
|
@ -369,9 +369,9 @@ cdef class ArcEager(TransitionSystem):
|
||||||
actions[LEFT].setdefault('dep', 0)
|
actions[LEFT].setdefault('dep', 0)
|
||||||
return actions
|
return actions
|
||||||
|
|
||||||
property action_types:
|
@property
|
||||||
def __get__(self):
|
def action_types(self):
|
||||||
return (SHIFT, REDUCE, LEFT, RIGHT, BREAK)
|
return (SHIFT, REDUCE, LEFT, RIGHT, BREAK)
|
||||||
|
|
||||||
def get_cost(self, StateClass state, GoldParse gold, action):
|
def get_cost(self, StateClass state, GoldParse gold, action):
|
||||||
cdef Transition t = self.lookup_transition(action)
|
cdef Transition t = self.lookup_transition(action)
|
||||||
|
|
|
@ -80,9 +80,9 @@ cdef class BiluoPushDown(TransitionSystem):
|
||||||
actions[action][label] += 1
|
actions[action][label] += 1
|
||||||
return actions
|
return actions
|
||||||
|
|
||||||
property action_types:
|
@property
|
||||||
def __get__(self):
|
def action_types(self):
|
||||||
return (BEGIN, IN, LAST, UNIT, OUT)
|
return (BEGIN, IN, LAST, UNIT, OUT)
|
||||||
|
|
||||||
def move_name(self, int move, attr_t label):
|
def move_name(self, int move, attr_t label):
|
||||||
if move == OUT:
|
if move == OUT:
|
||||||
|
@ -257,30 +257,42 @@ cdef class Missing:
|
||||||
cdef class Begin:
|
cdef class Begin:
|
||||||
@staticmethod
|
@staticmethod
|
||||||
cdef bint is_valid(const StateC* st, attr_t label) nogil:
|
cdef bint is_valid(const StateC* st, attr_t label) nogil:
|
||||||
# Ensure we don't clobber preset entities. If no entity preset,
|
|
||||||
# ent_iob is 0
|
|
||||||
cdef int preset_ent_iob = st.B_(0).ent_iob
|
cdef int preset_ent_iob = st.B_(0).ent_iob
|
||||||
if preset_ent_iob == 1:
|
cdef int preset_ent_label = st.B_(0).ent_type
|
||||||
|
# If we're the last token of the input, we can't B -- must U or O.
|
||||||
|
if st.B(1) == -1:
|
||||||
return False
|
return False
|
||||||
elif preset_ent_iob == 2:
|
elif st.entity_is_open():
|
||||||
return False
|
return False
|
||||||
elif preset_ent_iob == 3 and st.B_(0).ent_type != label:
|
elif label == 0:
|
||||||
return False
|
return False
|
||||||
# If the next word is B or O, we can't B now
|
elif preset_ent_iob == 1 or preset_ent_iob == 2:
|
||||||
|
# Ensure we don't clobber preset entities. If no entity preset,
|
||||||
|
# ent_iob is 0
|
||||||
|
return False
|
||||||
|
elif preset_ent_iob == 3:
|
||||||
|
# Okay, we're in a preset entity.
|
||||||
|
if label != preset_ent_label:
|
||||||
|
# If label isn't right, reject
|
||||||
|
return False
|
||||||
|
elif st.B_(1).ent_iob != 1:
|
||||||
|
# If next token isn't marked I, we need to make U, not B.
|
||||||
|
return False
|
||||||
|
else:
|
||||||
|
# Otherwise, force acceptance, even if we're across a sentence
|
||||||
|
# boundary or the token is whitespace.
|
||||||
|
return True
|
||||||
elif st.B_(1).ent_iob == 2 or st.B_(1).ent_iob == 3:
|
elif st.B_(1).ent_iob == 2 or st.B_(1).ent_iob == 3:
|
||||||
|
# If the next word is B or O, we can't B now
|
||||||
return False
|
return False
|
||||||
# If the current word is B, and the next word isn't I, the current word
|
|
||||||
# is really U
|
|
||||||
elif preset_ent_iob == 3 and st.B_(1).ent_iob != 1:
|
|
||||||
return False
|
|
||||||
# Don't allow entities to extend across sentence boundaries
|
|
||||||
elif st.B_(1).sent_start == 1:
|
elif st.B_(1).sent_start == 1:
|
||||||
|
# Don't allow entities to extend across sentence boundaries
|
||||||
return False
|
return False
|
||||||
# Don't allow entities to start on whitespace
|
# Don't allow entities to start on whitespace
|
||||||
elif Lexeme.get_struct_attr(st.B_(0).lex, IS_SPACE):
|
elif Lexeme.get_struct_attr(st.B_(0).lex, IS_SPACE):
|
||||||
return False
|
return False
|
||||||
else:
|
else:
|
||||||
return label != 0 and not st.entity_is_open()
|
return True
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
cdef int transition(StateC* st, attr_t label) nogil:
|
cdef int transition(StateC* st, attr_t label) nogil:
|
||||||
|
@ -314,18 +326,27 @@ cdef class In:
|
||||||
@staticmethod
|
@staticmethod
|
||||||
cdef bint is_valid(const StateC* st, attr_t label) nogil:
|
cdef bint is_valid(const StateC* st, attr_t label) nogil:
|
||||||
cdef int preset_ent_iob = st.B_(0).ent_iob
|
cdef int preset_ent_iob = st.B_(0).ent_iob
|
||||||
if preset_ent_iob == 2:
|
if label == 0:
|
||||||
|
return False
|
||||||
|
elif st.E_(0).ent_type != label:
|
||||||
|
return False
|
||||||
|
elif not st.entity_is_open():
|
||||||
|
return False
|
||||||
|
elif st.B(1) == -1:
|
||||||
|
# If we're at the end, we can't I.
|
||||||
|
return False
|
||||||
|
elif preset_ent_iob == 2:
|
||||||
return False
|
return False
|
||||||
elif preset_ent_iob == 3:
|
elif preset_ent_iob == 3:
|
||||||
return False
|
return False
|
||||||
# TODO: Is this quite right? I think it's supposed to be ensuring the
|
elif st.B_(1).ent_iob == 2 or st.B_(1).ent_iob == 3:
|
||||||
# gazetteer matches are maintained
|
# If we know the next word is B or O, we can't be I (must be L)
|
||||||
elif st.B(1) != -1 and st.B_(1).ent_iob != preset_ent_iob:
|
|
||||||
return False
|
return False
|
||||||
# Don't allow entities to extend across sentence boundaries
|
|
||||||
elif st.B(1) != -1 and st.B_(1).sent_start == 1:
|
elif st.B(1) != -1 and st.B_(1).sent_start == 1:
|
||||||
|
# Don't allow entities to extend across sentence boundaries
|
||||||
return False
|
return False
|
||||||
return st.entity_is_open() and label != 0 and st.E_(0).ent_type == label
|
else:
|
||||||
|
return True
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
cdef int transition(StateC* st, attr_t label) nogil:
|
cdef int transition(StateC* st, attr_t label) nogil:
|
||||||
|
@ -370,9 +391,17 @@ cdef class In:
|
||||||
cdef class Last:
|
cdef class Last:
|
||||||
@staticmethod
|
@staticmethod
|
||||||
cdef bint is_valid(const StateC* st, attr_t label) nogil:
|
cdef bint is_valid(const StateC* st, attr_t label) nogil:
|
||||||
if st.B_(1).ent_iob == 1:
|
if label == 0:
|
||||||
return False
|
return False
|
||||||
return st.entity_is_open() and label != 0 and st.E_(0).ent_type == label
|
elif not st.entity_is_open():
|
||||||
|
return False
|
||||||
|
elif st.E_(0).ent_type != label:
|
||||||
|
return False
|
||||||
|
elif st.B_(1).ent_iob == 1:
|
||||||
|
# If a preset entity has I next, we can't L here.
|
||||||
|
return False
|
||||||
|
else:
|
||||||
|
return True
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
cdef int transition(StateC* st, attr_t label) nogil:
|
cdef int transition(StateC* st, attr_t label) nogil:
|
||||||
|
@ -416,17 +445,29 @@ cdef class Unit:
|
||||||
@staticmethod
|
@staticmethod
|
||||||
cdef bint is_valid(const StateC* st, attr_t label) nogil:
|
cdef bint is_valid(const StateC* st, attr_t label) nogil:
|
||||||
cdef int preset_ent_iob = st.B_(0).ent_iob
|
cdef int preset_ent_iob = st.B_(0).ent_iob
|
||||||
if preset_ent_iob == 2:
|
cdef attr_t preset_ent_label = st.B_(0).ent_type
|
||||||
|
if label == 0:
|
||||||
return False
|
return False
|
||||||
elif preset_ent_iob == 1:
|
elif st.entity_is_open():
|
||||||
return False
|
return False
|
||||||
elif preset_ent_iob == 3 and st.B_(0).ent_type != label:
|
elif preset_ent_iob == 2:
|
||||||
|
# Don't clobber preset O
|
||||||
return False
|
return False
|
||||||
elif st.B_(1).ent_iob == 1:
|
elif st.B_(1).ent_iob == 1:
|
||||||
|
# If next token is In, we can't be Unit -- must be Begin
|
||||||
return False
|
return False
|
||||||
|
elif preset_ent_iob == 3:
|
||||||
|
# Okay, there's a preset entity here
|
||||||
|
if label != preset_ent_label:
|
||||||
|
# Require labels to match
|
||||||
|
return False
|
||||||
|
else:
|
||||||
|
# Otherwise return True, ignoring the whitespace constraint.
|
||||||
|
return True
|
||||||
elif Lexeme.get_struct_attr(st.B_(0).lex, IS_SPACE):
|
elif Lexeme.get_struct_attr(st.B_(0).lex, IS_SPACE):
|
||||||
return False
|
return False
|
||||||
return label != 0 and not st.entity_is_open()
|
else:
|
||||||
|
return True
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
cdef int transition(StateC* st, attr_t label) nogil:
|
cdef int transition(StateC* st, attr_t label) nogil:
|
||||||
|
@ -461,11 +502,14 @@ cdef class Out:
|
||||||
@staticmethod
|
@staticmethod
|
||||||
cdef bint is_valid(const StateC* st, attr_t label) nogil:
|
cdef bint is_valid(const StateC* st, attr_t label) nogil:
|
||||||
cdef int preset_ent_iob = st.B_(0).ent_iob
|
cdef int preset_ent_iob = st.B_(0).ent_iob
|
||||||
if preset_ent_iob == 3:
|
if st.entity_is_open():
|
||||||
|
return False
|
||||||
|
elif preset_ent_iob == 3:
|
||||||
return False
|
return False
|
||||||
elif preset_ent_iob == 1:
|
elif preset_ent_iob == 1:
|
||||||
return False
|
return False
|
||||||
return not st.entity_is_open()
|
else:
|
||||||
|
return True
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
cdef int transition(StateC* st, attr_t label) nogil:
|
cdef int transition(StateC* st, attr_t label) nogil:
|
||||||
|
|
|
@ -205,13 +205,11 @@ cdef class Parser:
|
||||||
self.set_annotations([doc], states, tensors=None)
|
self.set_annotations([doc], states, tensors=None)
|
||||||
return doc
|
return doc
|
||||||
|
|
||||||
def pipe(self, docs, int batch_size=256, int n_threads=2, beam_width=None):
|
def pipe(self, docs, int batch_size=256, int n_threads=-1, beam_width=None):
|
||||||
"""Process a stream of documents.
|
"""Process a stream of documents.
|
||||||
|
|
||||||
stream: The sequence of documents to process.
|
stream: The sequence of documents to process.
|
||||||
batch_size (int): Number of documents to accumulate into a working set.
|
batch_size (int): Number of documents to accumulate into a working set.
|
||||||
n_threads (int): The number of threads with which to work on the buffer
|
|
||||||
in parallel.
|
|
||||||
YIELDS (Doc): Documents, in order.
|
YIELDS (Doc): Documents, in order.
|
||||||
"""
|
"""
|
||||||
if beam_width is None:
|
if beam_width is None:
|
||||||
|
@ -221,7 +219,7 @@ cdef class Parser:
|
||||||
for batch in util.minibatch(docs, size=batch_size):
|
for batch in util.minibatch(docs, size=batch_size):
|
||||||
batch_in_order = list(batch)
|
batch_in_order = list(batch)
|
||||||
by_length = sorted(batch_in_order, key=lambda doc: len(doc))
|
by_length = sorted(batch_in_order, key=lambda doc: len(doc))
|
||||||
for subbatch in util.minibatch(by_length, size=batch_size//4):
|
for subbatch in util.minibatch(by_length, size=max(batch_size//4, 2)):
|
||||||
subbatch = list(subbatch)
|
subbatch = list(subbatch)
|
||||||
parse_states = self.predict(subbatch, beam_width=beam_width,
|
parse_states = self.predict(subbatch, beam_width=beam_width,
|
||||||
beam_density=beam_density)
|
beam_density=beam_density)
|
||||||
|
@ -363,9 +361,14 @@ cdef class Parser:
|
||||||
for i in range(batch_size):
|
for i in range(batch_size):
|
||||||
self.moves.set_valid(is_valid, states[i])
|
self.moves.set_valid(is_valid, states[i])
|
||||||
guess = arg_max_if_valid(&scores[i*nr_class], is_valid, nr_class)
|
guess = arg_max_if_valid(&scores[i*nr_class], is_valid, nr_class)
|
||||||
action = self.moves.c[guess]
|
if guess == -1:
|
||||||
action.do(states[i], action.label)
|
# This shouldn't happen, but it's hard to raise an error here,
|
||||||
states[i].push_hist(guess)
|
# and we don't want to infinite loop. So, force to end state.
|
||||||
|
states[i].force_final()
|
||||||
|
else:
|
||||||
|
action = self.moves.c[guess]
|
||||||
|
action.do(states[i], action.label)
|
||||||
|
states[i].push_hist(guess)
|
||||||
free(is_valid)
|
free(is_valid)
|
||||||
|
|
||||||
def transition_beams(self, beams, float[:, ::1] scores):
|
def transition_beams(self, beams, float[:, ::1] scores):
|
||||||
|
@ -598,22 +601,24 @@ cdef class Parser:
|
||||||
self.cfg.update(cfg)
|
self.cfg.update(cfg)
|
||||||
return sgd
|
return sgd
|
||||||
|
|
||||||
def to_disk(self, path, **exclude):
|
def to_disk(self, path, exclude=tuple(), **kwargs):
|
||||||
serializers = {
|
serializers = {
|
||||||
'model': lambda p: (self.model.to_disk(p) if self.model is not True else True),
|
'model': lambda p: (self.model.to_disk(p) if self.model is not True else True),
|
||||||
'vocab': lambda p: self.vocab.to_disk(p),
|
'vocab': lambda p: self.vocab.to_disk(p),
|
||||||
'moves': lambda p: self.moves.to_disk(p, strings=False),
|
'moves': lambda p: self.moves.to_disk(p, exclude=["strings"]),
|
||||||
'cfg': lambda p: srsly.write_json(p, self.cfg)
|
'cfg': lambda p: srsly.write_json(p, self.cfg)
|
||||||
}
|
}
|
||||||
|
exclude = util.get_serialization_exclude(serializers, exclude, kwargs)
|
||||||
util.to_disk(path, serializers, exclude)
|
util.to_disk(path, serializers, exclude)
|
||||||
|
|
||||||
def from_disk(self, path, **exclude):
|
def from_disk(self, path, exclude=tuple(), **kwargs):
|
||||||
deserializers = {
|
deserializers = {
|
||||||
'vocab': lambda p: self.vocab.from_disk(p),
|
'vocab': lambda p: self.vocab.from_disk(p),
|
||||||
'moves': lambda p: self.moves.from_disk(p, strings=False),
|
'moves': lambda p: self.moves.from_disk(p, exclude=["strings"]),
|
||||||
'cfg': lambda p: self.cfg.update(srsly.read_json(p)),
|
'cfg': lambda p: self.cfg.update(srsly.read_json(p)),
|
||||||
'model': lambda p: None
|
'model': lambda p: None
|
||||||
}
|
}
|
||||||
|
exclude = util.get_serialization_exclude(deserializers, exclude, kwargs)
|
||||||
util.from_disk(path, deserializers, exclude)
|
util.from_disk(path, deserializers, exclude)
|
||||||
if 'model' not in exclude:
|
if 'model' not in exclude:
|
||||||
path = util.ensure_path(path)
|
path = util.ensure_path(path)
|
||||||
|
@ -627,22 +632,24 @@ cdef class Parser:
|
||||||
self.cfg.update(cfg)
|
self.cfg.update(cfg)
|
||||||
return self
|
return self
|
||||||
|
|
||||||
def to_bytes(self, **exclude):
|
def to_bytes(self, exclude=tuple(), **kwargs):
|
||||||
serializers = OrderedDict((
|
serializers = OrderedDict((
|
||||||
('model', lambda: (self.model.to_bytes() if self.model is not True else True)),
|
('model', lambda: (self.model.to_bytes() if self.model is not True else True)),
|
||||||
('vocab', lambda: self.vocab.to_bytes()),
|
('vocab', lambda: self.vocab.to_bytes()),
|
||||||
('moves', lambda: self.moves.to_bytes(strings=False)),
|
('moves', lambda: self.moves.to_bytes(exclude=["strings"])),
|
||||||
('cfg', lambda: srsly.json_dumps(self.cfg, indent=2, sort_keys=True))
|
('cfg', lambda: srsly.json_dumps(self.cfg, indent=2, sort_keys=True))
|
||||||
))
|
))
|
||||||
|
exclude = util.get_serialization_exclude(serializers, exclude, kwargs)
|
||||||
return util.to_bytes(serializers, exclude)
|
return util.to_bytes(serializers, exclude)
|
||||||
|
|
||||||
def from_bytes(self, bytes_data, **exclude):
|
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
|
||||||
deserializers = OrderedDict((
|
deserializers = OrderedDict((
|
||||||
('vocab', lambda b: self.vocab.from_bytes(b)),
|
('vocab', lambda b: self.vocab.from_bytes(b)),
|
||||||
('moves', lambda b: self.moves.from_bytes(b, strings=False)),
|
('moves', lambda b: self.moves.from_bytes(b, exclude=["strings"])),
|
||||||
('cfg', lambda b: self.cfg.update(srsly.json_loads(b))),
|
('cfg', lambda b: self.cfg.update(srsly.json_loads(b))),
|
||||||
('model', lambda b: None)
|
('model', lambda b: None)
|
||||||
))
|
))
|
||||||
|
exclude = util.get_serialization_exclude(deserializers, exclude, kwargs)
|
||||||
msg = util.from_bytes(bytes_data, deserializers, exclude)
|
msg = util.from_bytes(bytes_data, deserializers, exclude)
|
||||||
if 'model' not in exclude:
|
if 'model' not in exclude:
|
||||||
# TODO: Remove this once we don't have to handle previous models
|
# TODO: Remove this once we don't have to handle previous models
|
||||||
|
|
|
@ -94,6 +94,13 @@ cdef class TransitionSystem:
|
||||||
raise ValueError(Errors.E024)
|
raise ValueError(Errors.E024)
|
||||||
return history
|
return history
|
||||||
|
|
||||||
|
def apply_transition(self, StateClass state, name):
|
||||||
|
if not self.is_valid(state, name):
|
||||||
|
raise ValueError(
|
||||||
|
"Cannot apply transition {name}: invalid for the current state.".format(name=name))
|
||||||
|
action = self.lookup_transition(name)
|
||||||
|
action.do(state.c, action.label)
|
||||||
|
|
||||||
cdef int initialize_state(self, StateC* state) nogil:
|
cdef int initialize_state(self, StateC* state) nogil:
|
||||||
pass
|
pass
|
||||||
|
|
||||||
|
@ -201,30 +208,32 @@ cdef class TransitionSystem:
|
||||||
self.labels[action][label_name] = new_freq-1
|
self.labels[action][label_name] = new_freq-1
|
||||||
return 1
|
return 1
|
||||||
|
|
||||||
def to_disk(self, path, **exclude):
|
def to_disk(self, path, **kwargs):
|
||||||
with path.open('wb') as file_:
|
with path.open('wb') as file_:
|
||||||
file_.write(self.to_bytes(**exclude))
|
file_.write(self.to_bytes(**kwargs))
|
||||||
|
|
||||||
def from_disk(self, path, **exclude):
|
def from_disk(self, path, **kwargs):
|
||||||
with path.open('rb') as file_:
|
with path.open('rb') as file_:
|
||||||
byte_data = file_.read()
|
byte_data = file_.read()
|
||||||
self.from_bytes(byte_data, **exclude)
|
self.from_bytes(byte_data, **kwargs)
|
||||||
return self
|
return self
|
||||||
|
|
||||||
def to_bytes(self, **exclude):
|
def to_bytes(self, exclude=tuple(), **kwargs):
|
||||||
transitions = []
|
transitions = []
|
||||||
serializers = {
|
serializers = {
|
||||||
'moves': lambda: srsly.json_dumps(self.labels),
|
'moves': lambda: srsly.json_dumps(self.labels),
|
||||||
'strings': lambda: self.strings.to_bytes()
|
'strings': lambda: self.strings.to_bytes()
|
||||||
}
|
}
|
||||||
|
exclude = util.get_serialization_exclude(serializers, exclude, kwargs)
|
||||||
return util.to_bytes(serializers, exclude)
|
return util.to_bytes(serializers, exclude)
|
||||||
|
|
||||||
def from_bytes(self, bytes_data, **exclude):
|
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
|
||||||
labels = {}
|
labels = {}
|
||||||
deserializers = {
|
deserializers = {
|
||||||
'moves': lambda b: labels.update(srsly.json_loads(b)),
|
'moves': lambda b: labels.update(srsly.json_loads(b)),
|
||||||
'strings': lambda b: self.strings.from_bytes(b)
|
'strings': lambda b: self.strings.from_bytes(b)
|
||||||
}
|
}
|
||||||
|
exclude = util.get_serialization_exclude(deserializers, exclude, kwargs)
|
||||||
msg = util.from_bytes(bytes_data, deserializers, exclude)
|
msg = util.from_bytes(bytes_data, deserializers, exclude)
|
||||||
self.initialize_actions(labels)
|
self.initialize_actions(labels)
|
||||||
return self
|
return self
|
||||||
|
|
|
@ -1,46 +1,44 @@
|
||||||
# coding: utf-8
|
# coding: utf-8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
from spacy.tokens import Doc
|
||||||
from spacy.attrs import ORTH, SHAPE, POS, DEP
|
from spacy.attrs import ORTH, SHAPE, POS, DEP
|
||||||
|
|
||||||
from ..util import get_doc
|
from ..util import get_doc
|
||||||
|
|
||||||
|
|
||||||
def test_doc_array_attr_of_token(en_tokenizer, en_vocab):
|
def test_doc_array_attr_of_token(en_vocab):
|
||||||
text = "An example sentence"
|
doc = Doc(en_vocab, words=["An", "example", "sentence"])
|
||||||
tokens = en_tokenizer(text)
|
example = doc.vocab["example"]
|
||||||
example = tokens.vocab["example"]
|
|
||||||
assert example.orth != example.shape
|
assert example.orth != example.shape
|
||||||
feats_array = tokens.to_array((ORTH, SHAPE))
|
feats_array = doc.to_array((ORTH, SHAPE))
|
||||||
assert feats_array[0][0] != feats_array[0][1]
|
assert feats_array[0][0] != feats_array[0][1]
|
||||||
assert feats_array[0][0] != feats_array[0][1]
|
assert feats_array[0][0] != feats_array[0][1]
|
||||||
|
|
||||||
|
|
||||||
def test_doc_stringy_array_attr_of_token(en_tokenizer, en_vocab):
|
def test_doc_stringy_array_attr_of_token(en_vocab):
|
||||||
text = "An example sentence"
|
doc = Doc(en_vocab, words=["An", "example", "sentence"])
|
||||||
tokens = en_tokenizer(text)
|
example = doc.vocab["example"]
|
||||||
example = tokens.vocab["example"]
|
|
||||||
assert example.orth != example.shape
|
assert example.orth != example.shape
|
||||||
feats_array = tokens.to_array((ORTH, SHAPE))
|
feats_array = doc.to_array((ORTH, SHAPE))
|
||||||
feats_array_stringy = tokens.to_array(("ORTH", "SHAPE"))
|
feats_array_stringy = doc.to_array(("ORTH", "SHAPE"))
|
||||||
assert feats_array_stringy[0][0] == feats_array[0][0]
|
assert feats_array_stringy[0][0] == feats_array[0][0]
|
||||||
assert feats_array_stringy[0][1] == feats_array[0][1]
|
assert feats_array_stringy[0][1] == feats_array[0][1]
|
||||||
|
|
||||||
|
|
||||||
def test_doc_scalar_attr_of_token(en_tokenizer, en_vocab):
|
def test_doc_scalar_attr_of_token(en_vocab):
|
||||||
text = "An example sentence"
|
doc = Doc(en_vocab, words=["An", "example", "sentence"])
|
||||||
tokens = en_tokenizer(text)
|
example = doc.vocab["example"]
|
||||||
example = tokens.vocab["example"]
|
|
||||||
assert example.orth != example.shape
|
assert example.orth != example.shape
|
||||||
feats_array = tokens.to_array(ORTH)
|
feats_array = doc.to_array(ORTH)
|
||||||
assert feats_array.shape == (3,)
|
assert feats_array.shape == (3,)
|
||||||
|
|
||||||
|
|
||||||
def test_doc_array_tag(en_tokenizer):
|
def test_doc_array_tag(en_vocab):
|
||||||
text = "A nice sentence."
|
words = ["A", "nice", "sentence", "."]
|
||||||
pos = ["DET", "ADJ", "NOUN", "PUNCT"]
|
pos = ["DET", "ADJ", "NOUN", "PUNCT"]
|
||||||
tokens = en_tokenizer(text)
|
doc = get_doc(en_vocab, words=words, pos=pos)
|
||||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], pos=pos)
|
|
||||||
assert doc[0].pos != doc[1].pos != doc[2].pos != doc[3].pos
|
assert doc[0].pos != doc[1].pos != doc[2].pos != doc[3].pos
|
||||||
feats_array = doc.to_array((ORTH, POS))
|
feats_array = doc.to_array((ORTH, POS))
|
||||||
assert feats_array[0][1] == doc[0].pos
|
assert feats_array[0][1] == doc[0].pos
|
||||||
|
@ -49,13 +47,22 @@ def test_doc_array_tag(en_tokenizer):
|
||||||
assert feats_array[3][1] == doc[3].pos
|
assert feats_array[3][1] == doc[3].pos
|
||||||
|
|
||||||
|
|
||||||
def test_doc_array_dep(en_tokenizer):
|
def test_doc_array_dep(en_vocab):
|
||||||
text = "A nice sentence."
|
words = ["A", "nice", "sentence", "."]
|
||||||
deps = ["det", "amod", "ROOT", "punct"]
|
deps = ["det", "amod", "ROOT", "punct"]
|
||||||
tokens = en_tokenizer(text)
|
doc = get_doc(en_vocab, words=words, deps=deps)
|
||||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], deps=deps)
|
|
||||||
feats_array = doc.to_array((ORTH, DEP))
|
feats_array = doc.to_array((ORTH, DEP))
|
||||||
assert feats_array[0][1] == doc[0].dep
|
assert feats_array[0][1] == doc[0].dep
|
||||||
assert feats_array[1][1] == doc[1].dep
|
assert feats_array[1][1] == doc[1].dep
|
||||||
assert feats_array[2][1] == doc[2].dep
|
assert feats_array[2][1] == doc[2].dep
|
||||||
assert feats_array[3][1] == doc[3].dep
|
assert feats_array[3][1] == doc[3].dep
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("attrs", [["ORTH", "SHAPE"], "IS_ALPHA"])
|
||||||
|
def test_doc_array_to_from_string_attrs(en_vocab, attrs):
|
||||||
|
"""Test that both Doc.to_array and Doc.from_array accept string attrs,
|
||||||
|
as well as single attrs and sequences of attrs.
|
||||||
|
"""
|
||||||
|
words = ["An", "example", "sentence"]
|
||||||
|
doc = Doc(en_vocab, words=words)
|
||||||
|
Doc(en_vocab, words=words).from_array(attrs, doc.to_array(attrs))
|
||||||
|
|
|
@ -4,9 +4,10 @@ from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
import numpy
|
import numpy
|
||||||
from spacy.tokens import Doc
|
from spacy.tokens import Doc, Span
|
||||||
from spacy.vocab import Vocab
|
from spacy.vocab import Vocab
|
||||||
from spacy.errors import ModelsWarning
|
from spacy.errors import ModelsWarning
|
||||||
|
from spacy.attrs import ENT_TYPE, ENT_IOB
|
||||||
|
|
||||||
from ..util import get_doc
|
from ..util import get_doc
|
||||||
|
|
||||||
|
@ -112,14 +113,14 @@ def test_doc_api_serialize(en_tokenizer, text):
|
||||||
assert [t.orth for t in tokens] == [t.orth for t in new_tokens]
|
assert [t.orth for t in tokens] == [t.orth for t in new_tokens]
|
||||||
|
|
||||||
new_tokens = Doc(tokens.vocab).from_bytes(
|
new_tokens = Doc(tokens.vocab).from_bytes(
|
||||||
tokens.to_bytes(tensor=False), tensor=False
|
tokens.to_bytes(exclude=["tensor"]), exclude=["tensor"]
|
||||||
)
|
)
|
||||||
assert tokens.text == new_tokens.text
|
assert tokens.text == new_tokens.text
|
||||||
assert [t.text for t in tokens] == [t.text for t in new_tokens]
|
assert [t.text for t in tokens] == [t.text for t in new_tokens]
|
||||||
assert [t.orth for t in tokens] == [t.orth for t in new_tokens]
|
assert [t.orth for t in tokens] == [t.orth for t in new_tokens]
|
||||||
|
|
||||||
new_tokens = Doc(tokens.vocab).from_bytes(
|
new_tokens = Doc(tokens.vocab).from_bytes(
|
||||||
tokens.to_bytes(sentiment=False), sentiment=False
|
tokens.to_bytes(exclude=["sentiment"]), exclude=["sentiment"]
|
||||||
)
|
)
|
||||||
assert tokens.text == new_tokens.text
|
assert tokens.text == new_tokens.text
|
||||||
assert [t.text for t in tokens] == [t.text for t in new_tokens]
|
assert [t.text for t in tokens] == [t.text for t in new_tokens]
|
||||||
|
@ -256,3 +257,24 @@ def test_lowest_common_ancestor(en_tokenizer, sentence, heads, lca_matrix):
|
||||||
assert lca[1, 1] == 1
|
assert lca[1, 1] == 1
|
||||||
assert lca[0, 1] == 2
|
assert lca[0, 1] == 2
|
||||||
assert lca[1, 2] == 2
|
assert lca[1, 2] == 2
|
||||||
|
|
||||||
|
|
||||||
|
def test_doc_is_nered(en_vocab):
|
||||||
|
words = ["I", "live", "in", "New", "York"]
|
||||||
|
doc = Doc(en_vocab, words=words)
|
||||||
|
assert not doc.is_nered
|
||||||
|
doc.ents = [Span(doc, 3, 5, label="GPE")]
|
||||||
|
assert doc.is_nered
|
||||||
|
# Test creating doc from array with unknown values
|
||||||
|
arr = numpy.array([[0, 0], [0, 0], [0, 0], [384, 3], [384, 1]], dtype="uint64")
|
||||||
|
doc = Doc(en_vocab, words=words).from_array([ENT_TYPE, ENT_IOB], arr)
|
||||||
|
assert doc.is_nered
|
||||||
|
# Test serialization
|
||||||
|
new_doc = Doc(en_vocab).from_bytes(doc.to_bytes())
|
||||||
|
assert new_doc.is_nered
|
||||||
|
|
||||||
|
|
||||||
|
def test_doc_lang(en_vocab):
|
||||||
|
doc = Doc(en_vocab, words=["Hello", "world"])
|
||||||
|
assert doc.lang_ == "en"
|
||||||
|
assert doc.lang == en_vocab.strings["en"]
|
||||||
|
|
|
@ -178,11 +178,10 @@ def test_span_string_label(doc):
|
||||||
assert span.label == doc.vocab.strings["hello"]
|
assert span.label == doc.vocab.strings["hello"]
|
||||||
|
|
||||||
|
|
||||||
def test_span_string_set_label(doc):
|
def test_span_label_readonly(doc):
|
||||||
span = Span(doc, 0, 1)
|
span = Span(doc, 0, 1)
|
||||||
span.label_ = "hello"
|
with pytest.raises(NotImplementedError):
|
||||||
assert span.label_ == "hello"
|
span.label_ = "hello"
|
||||||
assert span.label == doc.vocab.strings["hello"]
|
|
||||||
|
|
||||||
|
|
||||||
def test_span_ents_property(doc):
|
def test_span_ents_property(doc):
|
||||||
|
|
|
@ -199,3 +199,31 @@ def test_token0_has_sent_start_true():
|
||||||
assert doc[0].is_sent_start is True
|
assert doc[0].is_sent_start is True
|
||||||
assert doc[1].is_sent_start is None
|
assert doc[1].is_sent_start is None
|
||||||
assert not doc.is_sentenced
|
assert not doc.is_sentenced
|
||||||
|
|
||||||
|
|
||||||
|
def test_token_api_conjuncts_chain(en_vocab):
|
||||||
|
words = "The boy and the girl and the man went .".split()
|
||||||
|
heads = [1, 7, -1, 1, -3, -1, 1, -3, 0, -1]
|
||||||
|
deps = ["det", "nsubj", "cc", "det", "conj", "cc", "det", "conj", "ROOT", "punct"]
|
||||||
|
doc = get_doc(en_vocab, words=words, heads=heads, deps=deps)
|
||||||
|
assert [w.text for w in doc[1].conjuncts] == ["girl", "man"]
|
||||||
|
assert [w.text for w in doc[4].conjuncts] == ["boy", "man"]
|
||||||
|
assert [w.text for w in doc[7].conjuncts] == ["boy", "girl"]
|
||||||
|
|
||||||
|
|
||||||
|
def test_token_api_conjuncts_simple(en_vocab):
|
||||||
|
words = "They came and went .".split()
|
||||||
|
heads = [1, 0, -1, -2, -1]
|
||||||
|
deps = ["nsubj", "ROOT", "cc", "conj"]
|
||||||
|
doc = get_doc(en_vocab, words=words, heads=heads, deps=deps)
|
||||||
|
assert [w.text for w in doc[1].conjuncts] == ["went"]
|
||||||
|
assert [w.text for w in doc[3].conjuncts] == ["came"]
|
||||||
|
|
||||||
|
|
||||||
|
def test_token_api_non_conjuncts(en_vocab):
|
||||||
|
words = "They came .".split()
|
||||||
|
heads = [1, 0, -1]
|
||||||
|
deps = ["nsubj", "ROOT", "punct"]
|
||||||
|
doc = get_doc(en_vocab, words=words, heads=heads, deps=deps)
|
||||||
|
assert [w.text for w in doc[0].conjuncts] == []
|
||||||
|
assert [w.text for w in doc[1].conjuncts] == []
|
||||||
|
|
|
@ -106,3 +106,37 @@ def test_underscore_raises_for_invalid(invalid_kwargs):
|
||||||
def test_underscore_accepts_valid(valid_kwargs):
|
def test_underscore_accepts_valid(valid_kwargs):
|
||||||
valid_kwargs["force"] = True
|
valid_kwargs["force"] = True
|
||||||
Doc.set_extension("test", **valid_kwargs)
|
Doc.set_extension("test", **valid_kwargs)
|
||||||
|
|
||||||
|
|
||||||
|
def test_underscore_mutable_defaults_list(en_vocab):
|
||||||
|
"""Test that mutable default arguments are handled correctly (see #2581)."""
|
||||||
|
Doc.set_extension("mutable", default=[])
|
||||||
|
doc1 = Doc(en_vocab, words=["one"])
|
||||||
|
doc2 = Doc(en_vocab, words=["two"])
|
||||||
|
doc1._.mutable.append("foo")
|
||||||
|
assert len(doc1._.mutable) == 1
|
||||||
|
assert doc1._.mutable[0] == "foo"
|
||||||
|
assert len(doc2._.mutable) == 0
|
||||||
|
doc1._.mutable = ["bar", "baz"]
|
||||||
|
doc1._.mutable.append("foo")
|
||||||
|
assert len(doc1._.mutable) == 3
|
||||||
|
assert len(doc2._.mutable) == 0
|
||||||
|
|
||||||
|
|
||||||
|
def test_underscore_mutable_defaults_dict(en_vocab):
|
||||||
|
"""Test that mutable default arguments are handled correctly (see #2581)."""
|
||||||
|
Token.set_extension("mutable", default={})
|
||||||
|
token1 = Doc(en_vocab, words=["one"])[0]
|
||||||
|
token2 = Doc(en_vocab, words=["two"])[0]
|
||||||
|
token1._.mutable["foo"] = "bar"
|
||||||
|
assert len(token1._.mutable) == 1
|
||||||
|
assert token1._.mutable["foo"] == "bar"
|
||||||
|
assert len(token2._.mutable) == 0
|
||||||
|
token1._.mutable["foo"] = "baz"
|
||||||
|
assert len(token1._.mutable) == 1
|
||||||
|
assert token1._.mutable["foo"] == "baz"
|
||||||
|
token1._.mutable["x"] = []
|
||||||
|
token1._.mutable["x"].append("y")
|
||||||
|
assert len(token1._.mutable) == 2
|
||||||
|
assert token1._.mutable["x"] == ["y"]
|
||||||
|
assert len(token2._.mutable) == 0
|
||||||
|
|
|
@ -2,22 +2,24 @@
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
import numpy
|
||||||
from spacy.tokens import Doc
|
from spacy.tokens import Doc
|
||||||
from spacy.displacy import render
|
from spacy.displacy import render
|
||||||
from spacy.gold import iob_to_biluo
|
from spacy.gold import iob_to_biluo
|
||||||
from spacy.lang.it import Italian
|
from spacy.lang.it import Italian
|
||||||
import numpy
|
|
||||||
from spacy.lang.en import English
|
from spacy.lang.en import English
|
||||||
|
|
||||||
from ..util import add_vecs_to_vocab, get_doc
|
from ..util import add_vecs_to_vocab, get_doc
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.xfail(
|
@pytest.mark.xfail
|
||||||
reason="The dot is now properly split off, but the prefix/suffix rules are not applied again afterwards."
|
|
||||||
"This means that the quote will still be attached to the remaining token."
|
|
||||||
)
|
|
||||||
def test_issue2070():
|
def test_issue2070():
|
||||||
"""Test that checks that a dot followed by a quote is handled appropriately."""
|
"""Test that checks that a dot followed by a quote is handled
|
||||||
|
appropriately.
|
||||||
|
"""
|
||||||
|
# Problem: The dot is now properly split off, but the prefix/suffix rules
|
||||||
|
# are not applied again afterwards. This means that the quote will still be
|
||||||
|
# attached to the remaining token.
|
||||||
nlp = English()
|
nlp = English()
|
||||||
doc = nlp('First sentence."A quoted sentence" he said ...')
|
doc = nlp('First sentence."A quoted sentence" he said ...')
|
||||||
assert len(doc) == 11
|
assert len(doc) == 11
|
||||||
|
@ -37,6 +39,26 @@ def test_issue2179():
|
||||||
assert nlp2.get_pipe("ner").labels == ("CITIZENSHIP",)
|
assert nlp2.get_pipe("ner").labels == ("CITIZENSHIP",)
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue2203(en_vocab):
|
||||||
|
"""Test that lemmas are set correctly in doc.from_array."""
|
||||||
|
words = ["I", "'ll", "survive"]
|
||||||
|
tags = ["PRP", "MD", "VB"]
|
||||||
|
lemmas = ["-PRON-", "will", "survive"]
|
||||||
|
tag_ids = [en_vocab.strings.add(tag) for tag in tags]
|
||||||
|
lemma_ids = [en_vocab.strings.add(lemma) for lemma in lemmas]
|
||||||
|
doc = Doc(en_vocab, words=words)
|
||||||
|
# Work around lemma corrpution problem and set lemmas after tags
|
||||||
|
doc.from_array("TAG", numpy.array(tag_ids, dtype="uint64"))
|
||||||
|
doc.from_array("LEMMA", numpy.array(lemma_ids, dtype="uint64"))
|
||||||
|
assert [t.tag_ for t in doc] == tags
|
||||||
|
assert [t.lemma_ for t in doc] == lemmas
|
||||||
|
# We need to serialize both tag and lemma, since this is what causes the bug
|
||||||
|
doc_array = doc.to_array(["TAG", "LEMMA"])
|
||||||
|
new_doc = Doc(doc.vocab, words=words).from_array(["TAG", "LEMMA"], doc_array)
|
||||||
|
assert [t.tag_ for t in new_doc] == tags
|
||||||
|
assert [t.lemma_ for t in new_doc] == lemmas
|
||||||
|
|
||||||
|
|
||||||
def test_issue2219(en_vocab):
|
def test_issue2219(en_vocab):
|
||||||
vectors = [("a", [1, 2, 3]), ("letter", [4, 5, 6])]
|
vectors = [("a", [1, 2, 3]), ("letter", [4, 5, 6])]
|
||||||
add_vecs_to_vocab(en_vocab, vectors)
|
add_vecs_to_vocab(en_vocab, vectors)
|
||||||
|
|
26
spacy/tests/regression/test_issue3345.py
Normal file
26
spacy/tests/regression/test_issue3345.py
Normal file
|
@ -0,0 +1,26 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from spacy.lang.en import English
|
||||||
|
from spacy.tokens import Doc
|
||||||
|
from spacy.pipeline import EntityRuler, EntityRecognizer
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue3345():
|
||||||
|
"""Test case where preset entity crosses sentence boundary."""
|
||||||
|
nlp = English()
|
||||||
|
doc = Doc(nlp.vocab, words=["I", "live", "in", "New", "York"])
|
||||||
|
doc[4].is_sent_start = True
|
||||||
|
ruler = EntityRuler(nlp, patterns=[{"label": "GPE", "pattern": "New York"}])
|
||||||
|
ner = EntityRecognizer(doc.vocab)
|
||||||
|
# Add the OUT action. I wouldn't have thought this would be necessary...
|
||||||
|
ner.moves.add_action(5, "")
|
||||||
|
ner.add_label("GPE")
|
||||||
|
doc = ruler(doc)
|
||||||
|
# Get into the state just before "New"
|
||||||
|
state = ner.moves.init_batch([doc])[0]
|
||||||
|
ner.moves.apply_transition(state, "O")
|
||||||
|
ner.moves.apply_transition(state, "O")
|
||||||
|
ner.moves.apply_transition(state, "O")
|
||||||
|
# Check that B-GPE is valid.
|
||||||
|
assert ner.moves.is_valid(state, "B-GPE")
|
21
spacy/tests/regression/test_issue3410.py
Normal file
21
spacy/tests/regression/test_issue3410.py
Normal file
|
@ -0,0 +1,21 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
from spacy.lang.en import English
|
||||||
|
from spacy.matcher import Matcher, PhraseMatcher
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue3410():
|
||||||
|
texts = ["Hello world", "This is a test"]
|
||||||
|
nlp = English()
|
||||||
|
matcher = Matcher(nlp.vocab)
|
||||||
|
phrasematcher = PhraseMatcher(nlp.vocab)
|
||||||
|
with pytest.deprecated_call():
|
||||||
|
docs = list(nlp.pipe(texts, n_threads=4))
|
||||||
|
with pytest.deprecated_call():
|
||||||
|
docs = list(nlp.tokenizer.pipe(texts, n_threads=4))
|
||||||
|
with pytest.deprecated_call():
|
||||||
|
list(matcher.pipe(docs, n_threads=4))
|
||||||
|
with pytest.deprecated_call():
|
||||||
|
list(phrasematcher.pipe(docs, n_threads=4))
|
|
@ -1,6 +1,7 @@
|
||||||
# coding: utf-8
|
# coding: utf-8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
from spacy.tokens import Doc
|
from spacy.tokens import Doc
|
||||||
from spacy.compat import path2str
|
from spacy.compat import path2str
|
||||||
|
|
||||||
|
@ -41,3 +42,18 @@ def test_serialize_doc_roundtrip_disk_str_path(en_vocab):
|
||||||
doc.to_disk(file_path)
|
doc.to_disk(file_path)
|
||||||
doc_d = Doc(en_vocab).from_disk(file_path)
|
doc_d = Doc(en_vocab).from_disk(file_path)
|
||||||
assert doc.to_bytes() == doc_d.to_bytes()
|
assert doc.to_bytes() == doc_d.to_bytes()
|
||||||
|
|
||||||
|
|
||||||
|
def test_serialize_doc_exclude(en_vocab):
|
||||||
|
doc = Doc(en_vocab, words=["hello", "world"])
|
||||||
|
doc.user_data["foo"] = "bar"
|
||||||
|
new_doc = Doc(en_vocab).from_bytes(doc.to_bytes())
|
||||||
|
assert new_doc.user_data["foo"] == "bar"
|
||||||
|
new_doc = Doc(en_vocab).from_bytes(doc.to_bytes(), exclude=["user_data"])
|
||||||
|
assert not new_doc.user_data
|
||||||
|
new_doc = Doc(en_vocab).from_bytes(doc.to_bytes(exclude=["user_data"]))
|
||||||
|
assert not new_doc.user_data
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
doc.to_bytes(user_data=False)
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
Doc(en_vocab).from_bytes(doc.to_bytes(), tensor=False)
|
||||||
|
|
|
@ -52,3 +52,19 @@ def test_serialize_with_custom_tokenizer():
|
||||||
nlp.tokenizer = custom_tokenizer(nlp)
|
nlp.tokenizer = custom_tokenizer(nlp)
|
||||||
with make_tempdir() as d:
|
with make_tempdir() as d:
|
||||||
nlp.to_disk(d)
|
nlp.to_disk(d)
|
||||||
|
|
||||||
|
|
||||||
|
def test_serialize_language_exclude(meta_data):
|
||||||
|
name = "name-in-fixture"
|
||||||
|
nlp = Language(meta=meta_data)
|
||||||
|
assert nlp.meta["name"] == name
|
||||||
|
new_nlp = Language().from_bytes(nlp.to_bytes())
|
||||||
|
assert nlp.meta["name"] == name
|
||||||
|
new_nlp = Language().from_bytes(nlp.to_bytes(), exclude=["meta"])
|
||||||
|
assert not new_nlp.meta["name"] == name
|
||||||
|
new_nlp = Language().from_bytes(nlp.to_bytes(exclude=["meta"]))
|
||||||
|
assert not new_nlp.meta["name"] == name
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
nlp.to_bytes(meta=False)
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
Language().from_bytes(nlp.to_bytes(), meta=False)
|
||||||
|
|
|
@ -55,7 +55,9 @@ def test_serialize_parser_roundtrip_disk(en_vocab, Parser):
|
||||||
parser_d = Parser(en_vocab)
|
parser_d = Parser(en_vocab)
|
||||||
parser_d.model, _ = parser_d.Model(0)
|
parser_d.model, _ = parser_d.Model(0)
|
||||||
parser_d = parser_d.from_disk(file_path)
|
parser_d = parser_d.from_disk(file_path)
|
||||||
assert parser.to_bytes(model=False) == parser_d.to_bytes(model=False)
|
parser_bytes = parser.to_bytes(exclude=["model"])
|
||||||
|
parser_d_bytes = parser_d.to_bytes(exclude=["model"])
|
||||||
|
assert parser_bytes == parser_d_bytes
|
||||||
|
|
||||||
|
|
||||||
def test_to_from_bytes(parser, blank_parser):
|
def test_to_from_bytes(parser, blank_parser):
|
||||||
|
@ -114,3 +116,25 @@ def test_serialize_textcat_empty(en_vocab):
|
||||||
# See issue #1105
|
# See issue #1105
|
||||||
textcat = TextCategorizer(en_vocab, labels=["ENTITY", "ACTION", "MODIFIER"])
|
textcat = TextCategorizer(en_vocab, labels=["ENTITY", "ACTION", "MODIFIER"])
|
||||||
textcat.to_bytes()
|
textcat.to_bytes()
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("Parser", test_parsers)
|
||||||
|
def test_serialize_pipe_exclude(en_vocab, Parser):
|
||||||
|
def get_new_parser():
|
||||||
|
new_parser = Parser(en_vocab)
|
||||||
|
new_parser.model, _ = new_parser.Model(0)
|
||||||
|
return new_parser
|
||||||
|
|
||||||
|
parser = Parser(en_vocab)
|
||||||
|
parser.model, _ = parser.Model(0)
|
||||||
|
parser.cfg["foo"] = "bar"
|
||||||
|
new_parser = get_new_parser().from_bytes(parser.to_bytes())
|
||||||
|
assert "foo" in new_parser.cfg
|
||||||
|
new_parser = get_new_parser().from_bytes(parser.to_bytes(), exclude=["cfg"])
|
||||||
|
assert "foo" not in new_parser.cfg
|
||||||
|
new_parser = get_new_parser().from_bytes(parser.to_bytes(exclude=["cfg"]))
|
||||||
|
assert "foo" not in new_parser.cfg
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
parser.to_bytes(cfg=False)
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
get_new_parser().from_bytes(parser.to_bytes(), cfg=False)
|
||||||
|
|
|
@ -12,13 +12,12 @@ test_strings = [([], []), (["rats", "are", "cute"], ["i", "like", "rats"])]
|
||||||
test_strings_attrs = [(["rats", "are", "cute"], "Hello")]
|
test_strings_attrs = [(["rats", "are", "cute"], "Hello")]
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.xfail
|
|
||||||
@pytest.mark.parametrize("text", ["rat"])
|
@pytest.mark.parametrize("text", ["rat"])
|
||||||
def test_serialize_vocab(en_vocab, text):
|
def test_serialize_vocab(en_vocab, text):
|
||||||
text_hash = en_vocab.strings.add(text)
|
text_hash = en_vocab.strings.add(text)
|
||||||
vocab_bytes = en_vocab.to_bytes()
|
vocab_bytes = en_vocab.to_bytes()
|
||||||
new_vocab = Vocab().from_bytes(vocab_bytes)
|
new_vocab = Vocab().from_bytes(vocab_bytes)
|
||||||
assert new_vocab.strings(text_hash) == text
|
assert new_vocab.strings[text_hash] == text
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize("strings1,strings2", test_strings)
|
@pytest.mark.parametrize("strings1,strings2", test_strings)
|
||||||
|
@ -69,6 +68,15 @@ def test_serialize_vocab_lex_attrs_bytes(strings, lex_attr):
|
||||||
assert vocab2[strings[0]].norm_ == lex_attr
|
assert vocab2[strings[0]].norm_ == lex_attr
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("strings,lex_attr", test_strings_attrs)
|
||||||
|
def test_deserialize_vocab_seen_entries(strings, lex_attr):
|
||||||
|
# Reported in #2153
|
||||||
|
vocab = Vocab(strings=strings)
|
||||||
|
length = len(vocab)
|
||||||
|
vocab.from_bytes(vocab.to_bytes())
|
||||||
|
assert len(vocab) == length
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize("strings,lex_attr", test_strings_attrs)
|
@pytest.mark.parametrize("strings,lex_attr", test_strings_attrs)
|
||||||
def test_serialize_vocab_lex_attrs_disk(strings, lex_attr):
|
def test_serialize_vocab_lex_attrs_disk(strings, lex_attr):
|
||||||
vocab1 = Vocab(strings=strings)
|
vocab1 = Vocab(strings=strings)
|
||||||
|
|
|
@ -1,38 +1,28 @@
|
||||||
# coding: utf-8
|
# coding: utf-8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
from spacy.cli.converters import conllu2json
|
||||||
import os
|
|
||||||
from pathlib import Path
|
|
||||||
from spacy.compat import symlink_to, symlink_remove, path2str
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
def test_cli_converters_conllu2json():
|
||||||
def target_local_path():
|
# https://raw.githubusercontent.com/ohenrik/nb_news_ud_sm/master/original_data/no-ud-dev-ner.conllu
|
||||||
return Path("./foo-target")
|
lines = [
|
||||||
|
"1\tDommer\tdommer\tNOUN\t_\tDefinite=Ind|Gender=Masc|Number=Sing\t2\tappos\t_\tO",
|
||||||
|
"2\tFinn\tFinn\tPROPN\t_\tGender=Masc\t4\tnsubj\t_\tB-PER",
|
||||||
@pytest.fixture
|
"3\tEilertsen\tEilertsen\tPROPN\t_\t_\t2\tname\t_\tI-PER",
|
||||||
def link_local_path():
|
"4\tavstår\tavstå\tVERB\t_\tMood=Ind|Tense=Pres|VerbForm=Fin\t0\troot\t_\tO",
|
||||||
return Path("./foo-symlink")
|
]
|
||||||
|
input_data = "\n".join(lines)
|
||||||
|
converted = conllu2json(input_data, n_sents=1)
|
||||||
@pytest.fixture(scope="function")
|
assert len(converted) == 1
|
||||||
def setup_target(request, target_local_path, link_local_path):
|
assert converted[0]["id"] == 0
|
||||||
if not target_local_path.exists():
|
assert len(converted[0]["paragraphs"]) == 1
|
||||||
os.mkdir(path2str(target_local_path))
|
assert len(converted[0]["paragraphs"][0]["sentences"]) == 1
|
||||||
|
sent = converted[0]["paragraphs"][0]["sentences"][0]
|
||||||
# yield -- need to cleanup even if assertion fails
|
assert len(sent["tokens"]) == 4
|
||||||
# https://github.com/pytest-dev/pytest/issues/2508#issuecomment-309934240
|
tokens = sent["tokens"]
|
||||||
def cleanup():
|
assert [t["orth"] for t in tokens] == ["Dommer", "Finn", "Eilertsen", "avstår"]
|
||||||
symlink_remove(link_local_path)
|
assert [t["tag"] for t in tokens] == ["NOUN", "PROPN", "PROPN", "VERB"]
|
||||||
os.rmdir(path2str(target_local_path))
|
assert [t["head"] for t in tokens] == [1, 2, -1, 0]
|
||||||
|
assert [t["dep"] for t in tokens] == ["appos", "nsubj", "name", "ROOT"]
|
||||||
request.addfinalizer(cleanup)
|
assert [t["ner"] for t in tokens] == ["O", "B-PER", "L-PER", "O"]
|
||||||
|
|
||||||
|
|
||||||
def test_create_symlink_windows(setup_target, target_local_path, link_local_path):
|
|
||||||
assert target_local_path.exists()
|
|
||||||
|
|
||||||
symlink_to(link_local_path, target_local_path)
|
|
||||||
assert link_local_path.exists()
|
|
||||||
|
|
90
spacy/tests/test_displacy.py
Normal file
90
spacy/tests/test_displacy.py
Normal file
|
@ -0,0 +1,90 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
from spacy import displacy
|
||||||
|
from spacy.tokens import Span
|
||||||
|
from spacy.lang.fa import Persian
|
||||||
|
|
||||||
|
from .util import get_doc
|
||||||
|
|
||||||
|
|
||||||
|
def test_displacy_parse_ents(en_vocab):
|
||||||
|
"""Test that named entities on a Doc are converted into displaCy's format."""
|
||||||
|
doc = get_doc(en_vocab, words=["But", "Google", "is", "starting", "from", "behind"])
|
||||||
|
doc.ents = [Span(doc, 1, 2, label=doc.vocab.strings["ORG"])]
|
||||||
|
ents = displacy.parse_ents(doc)
|
||||||
|
assert isinstance(ents, dict)
|
||||||
|
assert ents["text"] == "But Google is starting from behind "
|
||||||
|
assert ents["ents"] == [{"start": 4, "end": 10, "label": "ORG"}]
|
||||||
|
|
||||||
|
|
||||||
|
def test_displacy_parse_deps(en_vocab):
|
||||||
|
"""Test that deps and tags on a Doc are converted into displaCy's format."""
|
||||||
|
words = ["This", "is", "a", "sentence"]
|
||||||
|
heads = [1, 0, 1, -2]
|
||||||
|
pos = ["DET", "VERB", "DET", "NOUN"]
|
||||||
|
tags = ["DT", "VBZ", "DT", "NN"]
|
||||||
|
deps = ["nsubj", "ROOT", "det", "attr"]
|
||||||
|
doc = get_doc(en_vocab, words=words, heads=heads, pos=pos, tags=tags, deps=deps)
|
||||||
|
deps = displacy.parse_deps(doc)
|
||||||
|
assert isinstance(deps, dict)
|
||||||
|
assert deps["words"] == [
|
||||||
|
{"text": "This", "tag": "DET"},
|
||||||
|
{"text": "is", "tag": "VERB"},
|
||||||
|
{"text": "a", "tag": "DET"},
|
||||||
|
{"text": "sentence", "tag": "NOUN"},
|
||||||
|
]
|
||||||
|
assert deps["arcs"] == [
|
||||||
|
{"start": 0, "end": 1, "label": "nsubj", "dir": "left"},
|
||||||
|
{"start": 2, "end": 3, "label": "det", "dir": "left"},
|
||||||
|
{"start": 1, "end": 3, "label": "attr", "dir": "right"},
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def test_displacy_spans(en_vocab):
|
||||||
|
"""Test that displaCy can render Spans."""
|
||||||
|
doc = get_doc(en_vocab, words=["But", "Google", "is", "starting", "from", "behind"])
|
||||||
|
doc.ents = [Span(doc, 1, 2, label=doc.vocab.strings["ORG"])]
|
||||||
|
html = displacy.render(doc[1:4], style="ent")
|
||||||
|
assert html.startswith("<div")
|
||||||
|
|
||||||
|
|
||||||
|
def test_displacy_raises_for_wrong_type(en_vocab):
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
displacy.render("hello world")
|
||||||
|
|
||||||
|
|
||||||
|
def test_displacy_rtl():
|
||||||
|
# Source: http://www.sobhe.ir/hazm/ – is this correct?
|
||||||
|
words = ["ما", "بسیار", "کتاب", "می\u200cخوانیم"]
|
||||||
|
# These are (likely) wrong, but it's just for testing
|
||||||
|
pos = ["PRO", "ADV", "N_PL", "V_SUB"] # needs to match lang.fa.tag_map
|
||||||
|
deps = ["foo", "bar", "foo", "baz"]
|
||||||
|
heads = [1, 0, 1, -2]
|
||||||
|
nlp = Persian()
|
||||||
|
doc = get_doc(nlp.vocab, words=words, pos=pos, tags=pos, heads=heads, deps=deps)
|
||||||
|
doc.ents = [Span(doc, 1, 3, label="TEST")]
|
||||||
|
html = displacy.render(doc, page=True, style="dep")
|
||||||
|
assert "direction: rtl" in html
|
||||||
|
assert 'direction="rtl"' in html
|
||||||
|
assert 'lang="{}"'.format(nlp.lang) in html
|
||||||
|
html = displacy.render(doc, page=True, style="ent")
|
||||||
|
assert "direction: rtl" in html
|
||||||
|
assert 'lang="{}"'.format(nlp.lang) in html
|
||||||
|
|
||||||
|
|
||||||
|
def test_displacy_render_wrapper(en_vocab):
|
||||||
|
"""Test that displaCy accepts custom rendering wrapper."""
|
||||||
|
|
||||||
|
def wrapper(html):
|
||||||
|
return "TEST" + html + "TEST"
|
||||||
|
|
||||||
|
displacy.set_render_wrapper(wrapper)
|
||||||
|
doc = get_doc(en_vocab, words=["But", "Google", "is", "starting", "from", "behind"])
|
||||||
|
doc.ents = [Span(doc, 1, 2, label=doc.vocab.strings["ORG"])]
|
||||||
|
html = displacy.render(doc, style="ent")
|
||||||
|
assert html.startswith("TEST<div")
|
||||||
|
assert html.endswith("/div>TEST")
|
||||||
|
# Restore
|
||||||
|
displacy.set_render_wrapper(lambda html: html)
|
|
@ -2,14 +2,35 @@
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
import os
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from spacy import util
|
from spacy import util
|
||||||
from spacy import displacy
|
|
||||||
from spacy import prefer_gpu, require_gpu
|
from spacy import prefer_gpu, require_gpu
|
||||||
from spacy.tokens import Span
|
from spacy.compat import symlink_to, symlink_remove, path2str
|
||||||
from spacy._ml import PrecomputableAffine
|
from spacy._ml import PrecomputableAffine
|
||||||
|
|
||||||
from .util import get_doc
|
|
||||||
|
@pytest.fixture
|
||||||
|
def symlink_target():
|
||||||
|
return Path("./foo-target")
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def symlink():
|
||||||
|
return Path("./foo-symlink")
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture(scope="function")
|
||||||
|
def symlink_setup_target(request, symlink_target, symlink):
|
||||||
|
if not symlink_target.exists():
|
||||||
|
os.mkdir(path2str(symlink_target))
|
||||||
|
# yield -- need to cleanup even if assertion fails
|
||||||
|
# https://github.com/pytest-dev/pytest/issues/2508#issuecomment-309934240
|
||||||
|
def cleanup():
|
||||||
|
symlink_remove(symlink)
|
||||||
|
os.rmdir(path2str(symlink_target))
|
||||||
|
|
||||||
|
request.addfinalizer(cleanup)
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize("text", ["hello/world", "hello world"])
|
@pytest.mark.parametrize("text", ["hello/world", "hello world"])
|
||||||
|
@ -31,66 +52,6 @@ def test_util_get_package_path(package):
|
||||||
assert isinstance(path, Path)
|
assert isinstance(path, Path)
|
||||||
|
|
||||||
|
|
||||||
def test_displacy_parse_ents(en_vocab):
|
|
||||||
"""Test that named entities on a Doc are converted into displaCy's format."""
|
|
||||||
doc = get_doc(en_vocab, words=["But", "Google", "is", "starting", "from", "behind"])
|
|
||||||
doc.ents = [Span(doc, 1, 2, label=doc.vocab.strings["ORG"])]
|
|
||||||
ents = displacy.parse_ents(doc)
|
|
||||||
assert isinstance(ents, dict)
|
|
||||||
assert ents["text"] == "But Google is starting from behind "
|
|
||||||
assert ents["ents"] == [{"start": 4, "end": 10, "label": "ORG"}]
|
|
||||||
|
|
||||||
|
|
||||||
def test_displacy_parse_deps(en_vocab):
|
|
||||||
"""Test that deps and tags on a Doc are converted into displaCy's format."""
|
|
||||||
words = ["This", "is", "a", "sentence"]
|
|
||||||
heads = [1, 0, 1, -2]
|
|
||||||
pos = ["DET", "VERB", "DET", "NOUN"]
|
|
||||||
tags = ["DT", "VBZ", "DT", "NN"]
|
|
||||||
deps = ["nsubj", "ROOT", "det", "attr"]
|
|
||||||
doc = get_doc(en_vocab, words=words, heads=heads, pos=pos, tags=tags, deps=deps)
|
|
||||||
deps = displacy.parse_deps(doc)
|
|
||||||
assert isinstance(deps, dict)
|
|
||||||
assert deps["words"] == [
|
|
||||||
{"text": "This", "tag": "DET"},
|
|
||||||
{"text": "is", "tag": "VERB"},
|
|
||||||
{"text": "a", "tag": "DET"},
|
|
||||||
{"text": "sentence", "tag": "NOUN"},
|
|
||||||
]
|
|
||||||
assert deps["arcs"] == [
|
|
||||||
{"start": 0, "end": 1, "label": "nsubj", "dir": "left"},
|
|
||||||
{"start": 2, "end": 3, "label": "det", "dir": "left"},
|
|
||||||
{"start": 1, "end": 3, "label": "attr", "dir": "right"},
|
|
||||||
]
|
|
||||||
|
|
||||||
|
|
||||||
def test_displacy_spans(en_vocab):
|
|
||||||
"""Test that displaCy can render Spans."""
|
|
||||||
doc = get_doc(en_vocab, words=["But", "Google", "is", "starting", "from", "behind"])
|
|
||||||
doc.ents = [Span(doc, 1, 2, label=doc.vocab.strings["ORG"])]
|
|
||||||
html = displacy.render(doc[1:4], style="ent")
|
|
||||||
assert html.startswith("<div")
|
|
||||||
|
|
||||||
|
|
||||||
def test_displacy_render_wrapper(en_vocab):
|
|
||||||
"""Test that displaCy accepts custom rendering wrapper."""
|
|
||||||
|
|
||||||
def wrapper(html):
|
|
||||||
return "TEST" + html + "TEST"
|
|
||||||
|
|
||||||
displacy.set_render_wrapper(wrapper)
|
|
||||||
doc = get_doc(en_vocab, words=["But", "Google", "is", "starting", "from", "behind"])
|
|
||||||
doc.ents = [Span(doc, 1, 2, label=doc.vocab.strings["ORG"])]
|
|
||||||
html = displacy.render(doc, style="ent")
|
|
||||||
assert html.startswith("TEST<div")
|
|
||||||
assert html.endswith("/div>TEST")
|
|
||||||
|
|
||||||
|
|
||||||
def test_displacy_raises_for_wrong_type(en_vocab):
|
|
||||||
with pytest.raises(ValueError):
|
|
||||||
displacy.render("hello world")
|
|
||||||
|
|
||||||
|
|
||||||
def test_PrecomputableAffine(nO=4, nI=5, nF=3, nP=2):
|
def test_PrecomputableAffine(nO=4, nI=5, nF=3, nP=2):
|
||||||
model = PrecomputableAffine(nO=nO, nI=nI, nF=nF, nP=nP)
|
model = PrecomputableAffine(nO=nO, nI=nI, nF=nF, nP=nP)
|
||||||
assert model.W.shape == (nF, nO, nP, nI)
|
assert model.W.shape == (nF, nO, nP, nI)
|
||||||
|
@ -124,3 +85,9 @@ def test_prefer_gpu():
|
||||||
def test_require_gpu():
|
def test_require_gpu():
|
||||||
with pytest.raises(ValueError):
|
with pytest.raises(ValueError):
|
||||||
require_gpu()
|
require_gpu()
|
||||||
|
|
||||||
|
|
||||||
|
def test_create_symlink_windows(symlink_setup_target, symlink_target, symlink):
|
||||||
|
assert symlink_target.exists()
|
||||||
|
symlink_to(symlink, symlink_target)
|
||||||
|
assert symlink.exists()
|
||||||
|
|
|
@ -45,3 +45,8 @@ def test_vocab_api_contains(en_vocab, text):
|
||||||
_ = en_vocab[text] # noqa: F841
|
_ = en_vocab[text] # noqa: F841
|
||||||
assert text in en_vocab
|
assert text in en_vocab
|
||||||
assert "LKsdjvlsakdvlaksdvlkasjdvljasdlkfvm" not in en_vocab
|
assert "LKsdjvlsakdvlaksdvlkasjdvljasdlkfvm" not in en_vocab
|
||||||
|
|
||||||
|
|
||||||
|
def test_vocab_writing_system(en_vocab):
|
||||||
|
assert en_vocab.writing_system["direction"] == "ltr"
|
||||||
|
assert en_vocab.writing_system["has_case"] is True
|
||||||
|
|
|
@ -125,7 +125,7 @@ cdef class Tokenizer:
|
||||||
doc.c[doc.length - 1].spacy = string[-1] == " " and not in_ws
|
doc.c[doc.length - 1].spacy = string[-1] == " " and not in_ws
|
||||||
return doc
|
return doc
|
||||||
|
|
||||||
def pipe(self, texts, batch_size=1000, n_threads=2):
|
def pipe(self, texts, batch_size=1000, n_threads=-1):
|
||||||
"""Tokenize a stream of texts.
|
"""Tokenize a stream of texts.
|
||||||
|
|
||||||
texts: A sequence of unicode texts.
|
texts: A sequence of unicode texts.
|
||||||
|
@ -134,6 +134,8 @@ cdef class Tokenizer:
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/tokenizer#pipe
|
DOCS: https://spacy.io/api/tokenizer#pipe
|
||||||
"""
|
"""
|
||||||
|
if n_threads != -1:
|
||||||
|
deprecation_warning(Warnings.W016)
|
||||||
for text in texts:
|
for text in texts:
|
||||||
yield self(text)
|
yield self(text)
|
||||||
|
|
||||||
|
@ -360,36 +362,37 @@ cdef class Tokenizer:
|
||||||
self._cache.set(key, cached)
|
self._cache.set(key, cached)
|
||||||
self._rules[string] = substrings
|
self._rules[string] = substrings
|
||||||
|
|
||||||
def to_disk(self, path, **exclude):
|
def to_disk(self, path, **kwargs):
|
||||||
"""Save the current state to a directory.
|
"""Save the current state to a directory.
|
||||||
|
|
||||||
path (unicode or Path): A path to a directory, which will be created if
|
path (unicode or Path): A path to a directory, which will be created if
|
||||||
it doesn't exist. Paths may be either strings or Path-like objects.
|
it doesn't exist.
|
||||||
|
exclude (list): String names of serialization fields to exclude.
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/tokenizer#to_disk
|
DOCS: https://spacy.io/api/tokenizer#to_disk
|
||||||
"""
|
"""
|
||||||
with path.open("wb") as file_:
|
with path.open("wb") as file_:
|
||||||
file_.write(self.to_bytes(**exclude))
|
file_.write(self.to_bytes(**kwargs))
|
||||||
|
|
||||||
def from_disk(self, path, **exclude):
|
def from_disk(self, path, **kwargs):
|
||||||
"""Loads state from a directory. Modifies the object in place and
|
"""Loads state from a directory. Modifies the object in place and
|
||||||
returns it.
|
returns it.
|
||||||
|
|
||||||
path (unicode or Path): A path to a directory. Paths may be either
|
path (unicode or Path): A path to a directory.
|
||||||
strings or `Path`-like objects.
|
exclude (list): String names of serialization fields to exclude.
|
||||||
RETURNS (Tokenizer): The modified `Tokenizer` object.
|
RETURNS (Tokenizer): The modified `Tokenizer` object.
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/tokenizer#from_disk
|
DOCS: https://spacy.io/api/tokenizer#from_disk
|
||||||
"""
|
"""
|
||||||
with path.open("rb") as file_:
|
with path.open("rb") as file_:
|
||||||
bytes_data = file_.read()
|
bytes_data = file_.read()
|
||||||
self.from_bytes(bytes_data, **exclude)
|
self.from_bytes(bytes_data, **kwargs)
|
||||||
return self
|
return self
|
||||||
|
|
||||||
def to_bytes(self, **exclude):
|
def to_bytes(self, exclude=tuple(), **kwargs):
|
||||||
"""Serialize the current state to a binary string.
|
"""Serialize the current state to a binary string.
|
||||||
|
|
||||||
**exclude: Named attributes to prevent from being serialized.
|
exclude (list): String names of serialization fields to exclude.
|
||||||
RETURNS (bytes): The serialized form of the `Tokenizer` object.
|
RETURNS (bytes): The serialized form of the `Tokenizer` object.
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/tokenizer#to_bytes
|
DOCS: https://spacy.io/api/tokenizer#to_bytes
|
||||||
|
@ -402,13 +405,14 @@ cdef class Tokenizer:
|
||||||
("token_match", lambda: _get_regex_pattern(self.token_match)),
|
("token_match", lambda: _get_regex_pattern(self.token_match)),
|
||||||
("exceptions", lambda: OrderedDict(sorted(self._rules.items())))
|
("exceptions", lambda: OrderedDict(sorted(self._rules.items())))
|
||||||
))
|
))
|
||||||
|
exclude = util.get_serialization_exclude(serializers, exclude, kwargs)
|
||||||
return util.to_bytes(serializers, exclude)
|
return util.to_bytes(serializers, exclude)
|
||||||
|
|
||||||
def from_bytes(self, bytes_data, **exclude):
|
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
|
||||||
"""Load state from a binary string.
|
"""Load state from a binary string.
|
||||||
|
|
||||||
bytes_data (bytes): The data to load from.
|
bytes_data (bytes): The data to load from.
|
||||||
**exclude: Named attributes to prevent from being loaded.
|
exclude (list): String names of serialization fields to exclude.
|
||||||
RETURNS (Tokenizer): The `Tokenizer` object.
|
RETURNS (Tokenizer): The `Tokenizer` object.
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/tokenizer#from_bytes
|
DOCS: https://spacy.io/api/tokenizer#from_bytes
|
||||||
|
@ -422,6 +426,7 @@ cdef class Tokenizer:
|
||||||
("token_match", lambda b: data.setdefault("token_match", b)),
|
("token_match", lambda b: data.setdefault("token_match", b)),
|
||||||
("exceptions", lambda b: data.setdefault("rules", b))
|
("exceptions", lambda b: data.setdefault("rules", b))
|
||||||
))
|
))
|
||||||
|
exclude = util.get_serialization_exclude(deserializers, exclude, kwargs)
|
||||||
msg = util.from_bytes(bytes_data, deserializers, exclude)
|
msg = util.from_bytes(bytes_data, deserializers, exclude)
|
||||||
if data.get("prefix_search"):
|
if data.get("prefix_search"):
|
||||||
self.prefix_search = re.compile(data["prefix_search"]).search
|
self.prefix_search = re.compile(data["prefix_search"]).search
|
||||||
|
|
|
@ -240,8 +240,18 @@ cdef class Doc:
|
||||||
for i in range(1, self.length):
|
for i in range(1, self.length):
|
||||||
if self.c[i].sent_start == -1 or self.c[i].sent_start == 1:
|
if self.c[i].sent_start == -1 or self.c[i].sent_start == 1:
|
||||||
return True
|
return True
|
||||||
else:
|
return False
|
||||||
return False
|
|
||||||
|
@property
|
||||||
|
def is_nered(self):
|
||||||
|
"""Check if the document has named entities set. Will return True if
|
||||||
|
*any* of the tokens has a named entity tag set (even if the others are
|
||||||
|
uknown values).
|
||||||
|
"""
|
||||||
|
for i in range(self.length):
|
||||||
|
if self.c[i].ent_iob != 0:
|
||||||
|
return True
|
||||||
|
return False
|
||||||
|
|
||||||
def __getitem__(self, object i):
|
def __getitem__(self, object i):
|
||||||
"""Get a `Token` or `Span` object.
|
"""Get a `Token` or `Span` object.
|
||||||
|
@ -374,7 +384,8 @@ cdef class Doc:
|
||||||
xp = get_array_module(vector)
|
xp = get_array_module(vector)
|
||||||
return xp.dot(vector, other.vector) / (self.vector_norm * other.vector_norm)
|
return xp.dot(vector, other.vector) / (self.vector_norm * other.vector_norm)
|
||||||
|
|
||||||
property has_vector:
|
@property
|
||||||
|
def has_vector(self):
|
||||||
"""A boolean value indicating whether a word vector is associated with
|
"""A boolean value indicating whether a word vector is associated with
|
||||||
the object.
|
the object.
|
||||||
|
|
||||||
|
@ -382,15 +393,14 @@ cdef class Doc:
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/doc#has_vector
|
DOCS: https://spacy.io/api/doc#has_vector
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
if "has_vector" in self.user_hooks:
|
||||||
if "has_vector" in self.user_hooks:
|
return self.user_hooks["has_vector"](self)
|
||||||
return self.user_hooks["has_vector"](self)
|
elif self.vocab.vectors.data.size:
|
||||||
elif self.vocab.vectors.data.size:
|
return True
|
||||||
return True
|
elif self.tensor.size:
|
||||||
elif self.tensor.size:
|
return True
|
||||||
return True
|
else:
|
||||||
else:
|
return False
|
||||||
return False
|
|
||||||
|
|
||||||
property vector:
|
property vector:
|
||||||
"""A real-valued meaning representation. Defaults to an average of the
|
"""A real-valued meaning representation. Defaults to an average of the
|
||||||
|
@ -443,22 +453,22 @@ cdef class Doc:
|
||||||
def __set__(self, value):
|
def __set__(self, value):
|
||||||
self._vector_norm = value
|
self._vector_norm = value
|
||||||
|
|
||||||
property text:
|
@property
|
||||||
|
def text(self):
|
||||||
"""A unicode representation of the document text.
|
"""A unicode representation of the document text.
|
||||||
|
|
||||||
RETURNS (unicode): The original verbatim text of the document.
|
RETURNS (unicode): The original verbatim text of the document.
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
return "".join(t.text_with_ws for t in self)
|
||||||
return "".join(t.text_with_ws for t in self)
|
|
||||||
|
|
||||||
property text_with_ws:
|
@property
|
||||||
|
def text_with_ws(self):
|
||||||
"""An alias of `Doc.text`, provided for duck-type compatibility with
|
"""An alias of `Doc.text`, provided for duck-type compatibility with
|
||||||
`Span` and `Token`.
|
`Span` and `Token`.
|
||||||
|
|
||||||
RETURNS (unicode): The original verbatim text of the document.
|
RETURNS (unicode): The original verbatim text of the document.
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
return self.text
|
||||||
return self.text
|
|
||||||
|
|
||||||
property ents:
|
property ents:
|
||||||
"""The named entities in the document. Returns a tuple of named entity
|
"""The named entities in the document. Returns a tuple of named entity
|
||||||
|
@ -535,7 +545,8 @@ cdef class Doc:
|
||||||
# Set start as B
|
# Set start as B
|
||||||
self.c[start].ent_iob = 3
|
self.c[start].ent_iob = 3
|
||||||
|
|
||||||
property noun_chunks:
|
@property
|
||||||
|
def noun_chunks(self):
|
||||||
"""Iterate over the base noun phrases in the document. Yields base
|
"""Iterate over the base noun phrases in the document. Yields base
|
||||||
noun-phrase #[code Span] objects, if the document has been
|
noun-phrase #[code Span] objects, if the document has been
|
||||||
syntactically parsed. A base noun phrase, or "NP chunk", is a noun
|
syntactically parsed. A base noun phrase, or "NP chunk", is a noun
|
||||||
|
@ -547,22 +558,22 @@ cdef class Doc:
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/doc#noun_chunks
|
DOCS: https://spacy.io/api/doc#noun_chunks
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
if not self.is_parsed:
|
||||||
if not self.is_parsed:
|
raise ValueError(Errors.E029)
|
||||||
raise ValueError(Errors.E029)
|
# Accumulate the result before beginning to iterate over it. This
|
||||||
# Accumulate the result before beginning to iterate over it. This
|
# prevents the tokenisation from being changed out from under us
|
||||||
# prevents the tokenisation from being changed out from under us
|
# during the iteration. The tricky thing here is that Span accepts
|
||||||
# during the iteration. The tricky thing here is that Span accepts
|
# its tokenisation changing, so it's okay once we have the Span
|
||||||
# its tokenisation changing, so it's okay once we have the Span
|
# objects. See Issue #375.
|
||||||
# objects. See Issue #375.
|
spans = []
|
||||||
spans = []
|
if self.noun_chunks_iterator is not None:
|
||||||
if self.noun_chunks_iterator is not None:
|
for start, end, label in self.noun_chunks_iterator(self):
|
||||||
for start, end, label in self.noun_chunks_iterator(self):
|
spans.append(Span(self, start, end, label=label))
|
||||||
spans.append(Span(self, start, end, label=label))
|
for span in spans:
|
||||||
for span in spans:
|
yield span
|
||||||
yield span
|
|
||||||
|
|
||||||
property sents:
|
@property
|
||||||
|
def sents(self):
|
||||||
"""Iterate over the sentences in the document. Yields sentence `Span`
|
"""Iterate over the sentences in the document. Yields sentence `Span`
|
||||||
objects. Sentence spans have no label. To improve accuracy on informal
|
objects. Sentence spans have no label. To improve accuracy on informal
|
||||||
texts, spaCy calculates sentence boundaries from the syntactic
|
texts, spaCy calculates sentence boundaries from the syntactic
|
||||||
|
@ -573,19 +584,28 @@ cdef class Doc:
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/doc#sents
|
DOCS: https://spacy.io/api/doc#sents
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
if not self.is_sentenced:
|
||||||
if not self.is_sentenced:
|
raise ValueError(Errors.E030)
|
||||||
raise ValueError(Errors.E030)
|
if "sents" in self.user_hooks:
|
||||||
if "sents" in self.user_hooks:
|
yield from self.user_hooks["sents"](self)
|
||||||
yield from self.user_hooks["sents"](self)
|
else:
|
||||||
else:
|
start = 0
|
||||||
start = 0
|
for i in range(1, self.length):
|
||||||
for i in range(1, self.length):
|
if self.c[i].sent_start == 1:
|
||||||
if self.c[i].sent_start == 1:
|
yield Span(self, start, i)
|
||||||
yield Span(self, start, i)
|
start = i
|
||||||
start = i
|
if start != self.length:
|
||||||
if start != self.length:
|
yield Span(self, start, self.length)
|
||||||
yield Span(self, start, self.length)
|
|
||||||
|
@property
|
||||||
|
def lang(self):
|
||||||
|
"""RETURNS (uint64): ID of the language of the doc's vocabulary."""
|
||||||
|
return self.vocab.strings[self.vocab.lang]
|
||||||
|
|
||||||
|
@property
|
||||||
|
def lang_(self):
|
||||||
|
"""RETURNS (unicode): Language of the doc's vocabulary, e.g. 'en'."""
|
||||||
|
return self.vocab.lang
|
||||||
|
|
||||||
cdef int push_back(self, LexemeOrToken lex_or_tok, bint has_space) except -1:
|
cdef int push_back(self, LexemeOrToken lex_or_tok, bint has_space) except -1:
|
||||||
if self.length == 0:
|
if self.length == 0:
|
||||||
|
@ -727,6 +747,18 @@ cdef class Doc:
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/doc#from_array
|
DOCS: https://spacy.io/api/doc#from_array
|
||||||
"""
|
"""
|
||||||
|
# Handle scalar/list inputs of strings/ints for py_attr_ids
|
||||||
|
# See also #3064
|
||||||
|
if isinstance(attrs, basestring_):
|
||||||
|
# Handle inputs like doc.to_array('ORTH')
|
||||||
|
attrs = [attrs]
|
||||||
|
elif not hasattr(attrs, "__iter__"):
|
||||||
|
# Handle inputs like doc.to_array(ORTH)
|
||||||
|
attrs = [attrs]
|
||||||
|
# Allow strings, e.g. 'lemma' or 'LEMMA'
|
||||||
|
attrs = [(IDS[id_.upper()] if hasattr(id_, "upper") else id_)
|
||||||
|
for id_ in attrs]
|
||||||
|
|
||||||
if SENT_START in attrs and HEAD in attrs:
|
if SENT_START in attrs and HEAD in attrs:
|
||||||
raise ValueError(Errors.E032)
|
raise ValueError(Errors.E032)
|
||||||
cdef int i, col
|
cdef int i, col
|
||||||
|
@ -739,17 +771,20 @@ cdef class Doc:
|
||||||
attr_ids = <attr_id_t*>mem.alloc(n_attrs, sizeof(attr_id_t))
|
attr_ids = <attr_id_t*>mem.alloc(n_attrs, sizeof(attr_id_t))
|
||||||
for i, attr_id in enumerate(attrs):
|
for i, attr_id in enumerate(attrs):
|
||||||
attr_ids[i] = attr_id
|
attr_ids[i] = attr_id
|
||||||
|
if len(array.shape) == 1:
|
||||||
|
array = array.reshape((array.size, 1))
|
||||||
|
# Do TAG first. This lets subsequent loop override stuff like POS, LEMMA
|
||||||
|
if TAG in attrs:
|
||||||
|
col = attrs.index(TAG)
|
||||||
|
for i in range(length):
|
||||||
|
if array[i, col] != 0:
|
||||||
|
self.vocab.morphology.assign_tag(&tokens[i], array[i, col])
|
||||||
# Now load the data
|
# Now load the data
|
||||||
for i in range(self.length):
|
for i in range(self.length):
|
||||||
token = &self.c[i]
|
token = &self.c[i]
|
||||||
for j in range(n_attrs):
|
for j in range(n_attrs):
|
||||||
Token.set_struct_attr(token, attr_ids[j], array[i, j])
|
if attr_ids[j] != TAG:
|
||||||
# Auxiliary loading logic
|
Token.set_struct_attr(token, attr_ids[j], array[i, j])
|
||||||
for col, attr_id in enumerate(attrs):
|
|
||||||
if attr_id == TAG:
|
|
||||||
for i in range(length):
|
|
||||||
if array[i, col] != 0:
|
|
||||||
self.vocab.morphology.assign_tag(&tokens[i], array[i, col])
|
|
||||||
# Set flags
|
# Set flags
|
||||||
self.is_parsed = bool(self.is_parsed or HEAD in attrs or DEP in attrs)
|
self.is_parsed = bool(self.is_parsed or HEAD in attrs or DEP in attrs)
|
||||||
self.is_tagged = bool(self.is_tagged or TAG in attrs or POS in attrs)
|
self.is_tagged = bool(self.is_tagged or TAG in attrs or POS in attrs)
|
||||||
|
@ -770,24 +805,26 @@ cdef class Doc:
|
||||||
"""
|
"""
|
||||||
return numpy.asarray(_get_lca_matrix(self, 0, len(self)))
|
return numpy.asarray(_get_lca_matrix(self, 0, len(self)))
|
||||||
|
|
||||||
def to_disk(self, path, **exclude):
|
def to_disk(self, path, **kwargs):
|
||||||
"""Save the current state to a directory.
|
"""Save the current state to a directory.
|
||||||
|
|
||||||
path (unicode or Path): A path to a directory, which will be created if
|
path (unicode or Path): A path to a directory, which will be created if
|
||||||
it doesn't exist. Paths may be either strings or Path-like objects.
|
it doesn't exist. Paths may be either strings or Path-like objects.
|
||||||
|
exclude (list): String names of serialization fields to exclude.
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/doc#to_disk
|
DOCS: https://spacy.io/api/doc#to_disk
|
||||||
"""
|
"""
|
||||||
path = util.ensure_path(path)
|
path = util.ensure_path(path)
|
||||||
with path.open("wb") as file_:
|
with path.open("wb") as file_:
|
||||||
file_.write(self.to_bytes(**exclude))
|
file_.write(self.to_bytes(**kwargs))
|
||||||
|
|
||||||
def from_disk(self, path, **exclude):
|
def from_disk(self, path, **kwargs):
|
||||||
"""Loads state from a directory. Modifies the object in place and
|
"""Loads state from a directory. Modifies the object in place and
|
||||||
returns it.
|
returns it.
|
||||||
|
|
||||||
path (unicode or Path): A path to a directory. Paths may be either
|
path (unicode or Path): A path to a directory. Paths may be either
|
||||||
strings or `Path`-like objects.
|
strings or `Path`-like objects.
|
||||||
|
exclude (list): String names of serialization fields to exclude.
|
||||||
RETURNS (Doc): The modified `Doc` object.
|
RETURNS (Doc): The modified `Doc` object.
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/doc#from_disk
|
DOCS: https://spacy.io/api/doc#from_disk
|
||||||
|
@ -795,11 +832,12 @@ cdef class Doc:
|
||||||
path = util.ensure_path(path)
|
path = util.ensure_path(path)
|
||||||
with path.open("rb") as file_:
|
with path.open("rb") as file_:
|
||||||
bytes_data = file_.read()
|
bytes_data = file_.read()
|
||||||
return self.from_bytes(bytes_data, **exclude)
|
return self.from_bytes(bytes_data, **kwargs)
|
||||||
|
|
||||||
def to_bytes(self, **exclude):
|
def to_bytes(self, exclude=tuple(), **kwargs):
|
||||||
"""Serialize, i.e. export the document contents to a binary string.
|
"""Serialize, i.e. export the document contents to a binary string.
|
||||||
|
|
||||||
|
exclude (list): String names of serialization fields to exclude.
|
||||||
RETURNS (bytes): A losslessly serialized copy of the `Doc`, including
|
RETURNS (bytes): A losslessly serialized copy of the `Doc`, including
|
||||||
all annotations.
|
all annotations.
|
||||||
|
|
||||||
|
@ -825,16 +863,22 @@ cdef class Doc:
|
||||||
"sentiment": lambda: self.sentiment,
|
"sentiment": lambda: self.sentiment,
|
||||||
"tensor": lambda: self.tensor,
|
"tensor": lambda: self.tensor,
|
||||||
}
|
}
|
||||||
|
for key in kwargs:
|
||||||
|
if key in serializers or key in ("user_data", "user_data_keys", "user_data_values"):
|
||||||
|
raise ValueError(Errors.E128.format(arg=key))
|
||||||
if "user_data" not in exclude and self.user_data:
|
if "user_data" not in exclude and self.user_data:
|
||||||
user_data_keys, user_data_values = list(zip(*self.user_data.items()))
|
user_data_keys, user_data_values = list(zip(*self.user_data.items()))
|
||||||
serializers["user_data_keys"] = lambda: srsly.msgpack_dumps(user_data_keys)
|
if "user_data_keys" not in exclude:
|
||||||
serializers["user_data_values"] = lambda: srsly.msgpack_dumps(user_data_values)
|
serializers["user_data_keys"] = lambda: srsly.msgpack_dumps(user_data_keys)
|
||||||
|
if "user_data_values" not in exclude:
|
||||||
|
serializers["user_data_values"] = lambda: srsly.msgpack_dumps(user_data_values)
|
||||||
return util.to_bytes(serializers, exclude)
|
return util.to_bytes(serializers, exclude)
|
||||||
|
|
||||||
def from_bytes(self, bytes_data, **exclude):
|
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
|
||||||
"""Deserialize, i.e. import the document contents from a binary string.
|
"""Deserialize, i.e. import the document contents from a binary string.
|
||||||
|
|
||||||
data (bytes): The string to load from.
|
data (bytes): The string to load from.
|
||||||
|
exclude (list): String names of serialization fields to exclude.
|
||||||
RETURNS (Doc): Itself.
|
RETURNS (Doc): Itself.
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/doc#from_bytes
|
DOCS: https://spacy.io/api/doc#from_bytes
|
||||||
|
@ -850,6 +894,9 @@ cdef class Doc:
|
||||||
"user_data_keys": lambda b: None,
|
"user_data_keys": lambda b: None,
|
||||||
"user_data_values": lambda b: None,
|
"user_data_values": lambda b: None,
|
||||||
}
|
}
|
||||||
|
for key in kwargs:
|
||||||
|
if key in deserializers or key in ("user_data",):
|
||||||
|
raise ValueError(Errors.E128.format(arg=key))
|
||||||
msg = util.from_bytes(bytes_data, deserializers, exclude)
|
msg = util.from_bytes(bytes_data, deserializers, exclude)
|
||||||
# Msgpack doesn't distinguish between lists and tuples, which is
|
# Msgpack doesn't distinguish between lists and tuples, which is
|
||||||
# vexing for user data. As a best guess, we *know* that within
|
# vexing for user data. As a best guess, we *know* that within
|
||||||
|
@ -990,11 +1037,11 @@ cdef class Doc:
|
||||||
DOCS: https://spacy.io/api/doc#to_json
|
DOCS: https://spacy.io/api/doc#to_json
|
||||||
"""
|
"""
|
||||||
data = {"text": self.text}
|
data = {"text": self.text}
|
||||||
if self.ents:
|
if self.is_nered:
|
||||||
data["ents"] = [{"start": ent.start_char, "end": ent.end_char,
|
data["ents"] = [{"start": ent.start_char, "end": ent.end_char,
|
||||||
"label": ent.label_} for ent in self.ents]
|
"label": ent.label_} for ent in self.ents]
|
||||||
sents = list(self.sents)
|
if self.is_sentenced:
|
||||||
if sents:
|
sents = list(self.sents)
|
||||||
data["sents"] = [{"start": sent.start_char, "end": sent.end_char}
|
data["sents"] = [{"start": sent.start_char, "end": sent.end_char}
|
||||||
for sent in sents]
|
for sent in sents]
|
||||||
if self.cats:
|
if self.cats:
|
||||||
|
@ -1002,13 +1049,11 @@ cdef class Doc:
|
||||||
data["tokens"] = []
|
data["tokens"] = []
|
||||||
for token in self:
|
for token in self:
|
||||||
token_data = {"id": token.i, "start": token.idx, "end": token.idx + len(token)}
|
token_data = {"id": token.i, "start": token.idx, "end": token.idx + len(token)}
|
||||||
if token.pos_:
|
if self.is_tagged:
|
||||||
token_data["pos"] = token.pos_
|
token_data["pos"] = token.pos_
|
||||||
if token.tag_:
|
|
||||||
token_data["tag"] = token.tag_
|
token_data["tag"] = token.tag_
|
||||||
if token.dep_:
|
if self.is_parsed:
|
||||||
token_data["dep"] = token.dep_
|
token_data["dep"] = token.dep_
|
||||||
if token.head:
|
|
||||||
token_data["head"] = token.head.i
|
token_data["head"] = token.head.i
|
||||||
data["tokens"].append(token_data)
|
data["tokens"].append(token_data)
|
||||||
if underscore:
|
if underscore:
|
||||||
|
@ -1179,7 +1224,7 @@ cdef int [:,:] _get_lca_matrix(Doc doc, int start, int end):
|
||||||
|
|
||||||
|
|
||||||
def pickle_doc(doc):
|
def pickle_doc(doc):
|
||||||
bytes_data = doc.to_bytes(vocab=False, user_data=False)
|
bytes_data = doc.to_bytes(exclude=["vocab", "user_data"])
|
||||||
hooks_and_data = (doc.user_data, doc.user_hooks, doc.user_span_hooks,
|
hooks_and_data = (doc.user_data, doc.user_hooks, doc.user_span_hooks,
|
||||||
doc.user_token_hooks)
|
doc.user_token_hooks)
|
||||||
return (unpickle_doc, (doc.vocab, srsly.pickle_dumps(hooks_and_data), bytes_data))
|
return (unpickle_doc, (doc.vocab, srsly.pickle_dumps(hooks_and_data), bytes_data))
|
||||||
|
@ -1188,7 +1233,7 @@ def pickle_doc(doc):
|
||||||
def unpickle_doc(vocab, hooks_and_data, bytes_data):
|
def unpickle_doc(vocab, hooks_and_data, bytes_data):
|
||||||
user_data, doc_hooks, span_hooks, token_hooks = srsly.pickle_loads(hooks_and_data)
|
user_data, doc_hooks, span_hooks, token_hooks = srsly.pickle_loads(hooks_and_data)
|
||||||
|
|
||||||
doc = Doc(vocab, user_data=user_data).from_bytes(bytes_data, exclude="user_data")
|
doc = Doc(vocab, user_data=user_data).from_bytes(bytes_data, exclude=["user_data"])
|
||||||
doc.user_hooks.update(doc_hooks)
|
doc.user_hooks.update(doc_hooks)
|
||||||
doc.user_span_hooks.update(span_hooks)
|
doc.user_span_hooks.update(span_hooks)
|
||||||
doc.user_token_hooks.update(token_hooks)
|
doc.user_token_hooks.update(token_hooks)
|
||||||
|
|
|
@ -322,46 +322,47 @@ cdef class Span:
|
||||||
self.start = start
|
self.start = start
|
||||||
self.end = end + 1
|
self.end = end + 1
|
||||||
|
|
||||||
property vocab:
|
@property
|
||||||
|
def vocab(self):
|
||||||
"""RETURNS (Vocab): The Span's Doc's vocab."""
|
"""RETURNS (Vocab): The Span's Doc's vocab."""
|
||||||
def __get__(self):
|
return self.doc.vocab
|
||||||
return self.doc.vocab
|
|
||||||
|
|
||||||
property sent:
|
@property
|
||||||
|
def sent(self):
|
||||||
"""RETURNS (Span): The sentence span that the span is a part of."""
|
"""RETURNS (Span): The sentence span that the span is a part of."""
|
||||||
def __get__(self):
|
if "sent" in self.doc.user_span_hooks:
|
||||||
if "sent" in self.doc.user_span_hooks:
|
return self.doc.user_span_hooks["sent"](self)
|
||||||
return self.doc.user_span_hooks["sent"](self)
|
# This should raise if not parsed / no custom sentence boundaries
|
||||||
# This should raise if not parsed / no custom sentence boundaries
|
self.doc.sents
|
||||||
self.doc.sents
|
# If doc is parsed we can use the deps to find the sentence
|
||||||
# If doc is parsed we can use the deps to find the sentence
|
# otherwise we use the `sent_start` token attribute
|
||||||
# otherwise we use the `sent_start` token attribute
|
cdef int n = 0
|
||||||
cdef int n = 0
|
cdef int i
|
||||||
cdef int i
|
if self.doc.is_parsed:
|
||||||
if self.doc.is_parsed:
|
root = &self.doc.c[self.start]
|
||||||
root = &self.doc.c[self.start]
|
while root.head != 0:
|
||||||
while root.head != 0:
|
root += root.head
|
||||||
root += root.head
|
n += 1
|
||||||
n += 1
|
if n >= self.doc.length:
|
||||||
if n >= self.doc.length:
|
raise RuntimeError(Errors.E038)
|
||||||
raise RuntimeError(Errors.E038)
|
return self.doc[root.l_edge:root.r_edge + 1]
|
||||||
return self.doc[root.l_edge:root.r_edge + 1]
|
elif self.doc.is_sentenced:
|
||||||
elif self.doc.is_sentenced:
|
# Find start of the sentence
|
||||||
# Find start of the sentence
|
start = self.start
|
||||||
start = self.start
|
while self.doc.c[start].sent_start != 1 and start > 0:
|
||||||
while self.doc.c[start].sent_start != 1 and start > 0:
|
start += -1
|
||||||
start += -1
|
# Find end of the sentence
|
||||||
# Find end of the sentence
|
end = self.end
|
||||||
end = self.end
|
n = 0
|
||||||
n = 0
|
while end < self.doc.length and self.doc.c[end].sent_start != 1:
|
||||||
while end < self.doc.length and self.doc.c[end].sent_start != 1:
|
end += 1
|
||||||
end += 1
|
n += 1
|
||||||
n += 1
|
if n >= self.doc.length:
|
||||||
if n >= self.doc.length:
|
break
|
||||||
break
|
return self.doc[start:end]
|
||||||
return self.doc[start:end]
|
|
||||||
|
|
||||||
property ents:
|
@property
|
||||||
|
def ents(self):
|
||||||
"""The named entities in the span. Returns a tuple of named entity
|
"""The named entities in the span. Returns a tuple of named entity
|
||||||
`Span` objects, if the entity recognizer has been applied.
|
`Span` objects, if the entity recognizer has been applied.
|
||||||
|
|
||||||
|
@ -369,14 +370,14 @@ cdef class Span:
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/span#ents
|
DOCS: https://spacy.io/api/span#ents
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
ents = []
|
||||||
ents = []
|
for ent in self.doc.ents:
|
||||||
for ent in self.doc.ents:
|
if ent.start >= self.start and ent.end <= self.end:
|
||||||
if ent.start >= self.start and ent.end <= self.end:
|
ents.append(ent)
|
||||||
ents.append(ent)
|
return ents
|
||||||
return ents
|
|
||||||
|
|
||||||
property has_vector:
|
@property
|
||||||
|
def has_vector(self):
|
||||||
"""A boolean value indicating whether a word vector is associated with
|
"""A boolean value indicating whether a word vector is associated with
|
||||||
the object.
|
the object.
|
||||||
|
|
||||||
|
@ -384,17 +385,17 @@ cdef class Span:
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/span#has_vector
|
DOCS: https://spacy.io/api/span#has_vector
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
if "has_vector" in self.doc.user_span_hooks:
|
||||||
if "has_vector" in self.doc.user_span_hooks:
|
return self.doc.user_span_hooks["has_vector"](self)
|
||||||
return self.doc.user_span_hooks["has_vector"](self)
|
elif self.vocab.vectors.data.size > 0:
|
||||||
elif self.vocab.vectors.data.size > 0:
|
return any(token.has_vector for token in self)
|
||||||
return any(token.has_vector for token in self)
|
elif self.doc.tensor.size > 0:
|
||||||
elif self.doc.tensor.size > 0:
|
return True
|
||||||
return True
|
else:
|
||||||
else:
|
return False
|
||||||
return False
|
|
||||||
|
|
||||||
property vector:
|
@property
|
||||||
|
def vector(self):
|
||||||
"""A real-valued meaning representation. Defaults to an average of the
|
"""A real-valued meaning representation. Defaults to an average of the
|
||||||
token vectors.
|
token vectors.
|
||||||
|
|
||||||
|
@ -403,61 +404,61 @@ cdef class Span:
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/span#vector
|
DOCS: https://spacy.io/api/span#vector
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
if "vector" in self.doc.user_span_hooks:
|
||||||
if "vector" in self.doc.user_span_hooks:
|
return self.doc.user_span_hooks["vector"](self)
|
||||||
return self.doc.user_span_hooks["vector"](self)
|
if self._vector is None:
|
||||||
if self._vector is None:
|
self._vector = sum(t.vector for t in self) / len(self)
|
||||||
self._vector = sum(t.vector for t in self) / len(self)
|
return self._vector
|
||||||
return self._vector
|
|
||||||
|
|
||||||
property vector_norm:
|
@property
|
||||||
|
def vector_norm(self):
|
||||||
"""The L2 norm of the span's vector representation.
|
"""The L2 norm of the span's vector representation.
|
||||||
|
|
||||||
RETURNS (float): The L2 norm of the vector representation.
|
RETURNS (float): The L2 norm of the vector representation.
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/span#vector_norm
|
DOCS: https://spacy.io/api/span#vector_norm
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
if "vector_norm" in self.doc.user_span_hooks:
|
||||||
if "vector_norm" in self.doc.user_span_hooks:
|
return self.doc.user_span_hooks["vector"](self)
|
||||||
return self.doc.user_span_hooks["vector"](self)
|
cdef float value
|
||||||
cdef float value
|
cdef double norm = 0
|
||||||
cdef double norm = 0
|
if self._vector_norm is None:
|
||||||
if self._vector_norm is None:
|
norm = 0
|
||||||
norm = 0
|
for value in self.vector:
|
||||||
for value in self.vector:
|
norm += value * value
|
||||||
norm += value * value
|
self._vector_norm = sqrt(norm) if norm != 0 else 0
|
||||||
self._vector_norm = sqrt(norm) if norm != 0 else 0
|
return self._vector_norm
|
||||||
return self._vector_norm
|
|
||||||
|
|
||||||
property sentiment:
|
@property
|
||||||
|
def sentiment(self):
|
||||||
"""RETURNS (float): A scalar value indicating the positivity or
|
"""RETURNS (float): A scalar value indicating the positivity or
|
||||||
negativity of the span.
|
negativity of the span.
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
if "sentiment" in self.doc.user_span_hooks:
|
||||||
if "sentiment" in self.doc.user_span_hooks:
|
return self.doc.user_span_hooks["sentiment"](self)
|
||||||
return self.doc.user_span_hooks["sentiment"](self)
|
else:
|
||||||
else:
|
return sum([token.sentiment for token in self]) / len(self)
|
||||||
return sum([token.sentiment for token in self]) / len(self)
|
|
||||||
|
|
||||||
property text:
|
@property
|
||||||
|
def text(self):
|
||||||
"""RETURNS (unicode): The original verbatim text of the span."""
|
"""RETURNS (unicode): The original verbatim text of the span."""
|
||||||
def __get__(self):
|
text = self.text_with_ws
|
||||||
text = self.text_with_ws
|
if self[-1].whitespace_:
|
||||||
if self[-1].whitespace_:
|
text = text[:-1]
|
||||||
text = text[:-1]
|
return text
|
||||||
return text
|
|
||||||
|
|
||||||
property text_with_ws:
|
@property
|
||||||
|
def text_with_ws(self):
|
||||||
"""The text content of the span with a trailing whitespace character if
|
"""The text content of the span with a trailing whitespace character if
|
||||||
the last token has one.
|
the last token has one.
|
||||||
|
|
||||||
RETURNS (unicode): The text content of the span (with trailing
|
RETURNS (unicode): The text content of the span (with trailing
|
||||||
whitespace).
|
whitespace).
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
return "".join([t.text_with_ws for t in self])
|
||||||
return "".join([t.text_with_ws for t in self])
|
|
||||||
|
|
||||||
property noun_chunks:
|
@property
|
||||||
|
def noun_chunks(self):
|
||||||
"""Yields base noun-phrase `Span` objects, if the document has been
|
"""Yields base noun-phrase `Span` objects, if the document has been
|
||||||
syntactically parsed. A base noun phrase, or "NP chunk", is a noun
|
syntactically parsed. A base noun phrase, or "NP chunk", is a noun
|
||||||
phrase that does not permit other NPs to be nested within it – so no
|
phrase that does not permit other NPs to be nested within it – so no
|
||||||
|
@ -468,23 +469,23 @@ cdef class Span:
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/span#noun_chunks
|
DOCS: https://spacy.io/api/span#noun_chunks
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
if not self.doc.is_parsed:
|
||||||
if not self.doc.is_parsed:
|
raise ValueError(Errors.E029)
|
||||||
raise ValueError(Errors.E029)
|
# Accumulate the result before beginning to iterate over it. This
|
||||||
# Accumulate the result before beginning to iterate over it. This
|
# prevents the tokenisation from being changed out from under us
|
||||||
# prevents the tokenisation from being changed out from under us
|
# during the iteration. The tricky thing here is that Span accepts
|
||||||
# during the iteration. The tricky thing here is that Span accepts
|
# its tokenisation changing, so it's okay once we have the Span
|
||||||
# its tokenisation changing, so it's okay once we have the Span
|
# objects. See Issue #375
|
||||||
# objects. See Issue #375
|
spans = []
|
||||||
spans = []
|
cdef attr_t label
|
||||||
cdef attr_t label
|
if self.doc.noun_chunks_iterator is not None:
|
||||||
if self.doc.noun_chunks_iterator is not None:
|
for start, end, label in self.doc.noun_chunks_iterator(self):
|
||||||
for start, end, label in self.doc.noun_chunks_iterator(self):
|
spans.append(Span(self.doc, start, end, label=label))
|
||||||
spans.append(Span(self.doc, start, end, label=label))
|
for span in spans:
|
||||||
for span in spans:
|
yield span
|
||||||
yield span
|
|
||||||
|
|
||||||
property root:
|
@property
|
||||||
|
def root(self):
|
||||||
"""The token with the shortest path to the root of the
|
"""The token with the shortest path to the root of the
|
||||||
sentence (or the root itself). If multiple tokens are equally
|
sentence (or the root itself). If multiple tokens are equally
|
||||||
high in the tree, the first token is taken.
|
high in the tree, the first token is taken.
|
||||||
|
@ -493,41 +494,51 @@ cdef class Span:
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/span#root
|
DOCS: https://spacy.io/api/span#root
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
self._recalculate_indices()
|
||||||
self._recalculate_indices()
|
if "root" in self.doc.user_span_hooks:
|
||||||
if "root" in self.doc.user_span_hooks:
|
return self.doc.user_span_hooks["root"](self)
|
||||||
return self.doc.user_span_hooks["root"](self)
|
# This should probably be called 'head', and the other one called
|
||||||
# This should probably be called 'head', and the other one called
|
# 'gov'. But we went with 'head' elsehwhere, and now we're stuck =/
|
||||||
# 'gov'. But we went with 'head' elsehwhere, and now we're stuck =/
|
cdef int i
|
||||||
cdef int i
|
# First, we scan through the Span, and check whether there's a word
|
||||||
# First, we scan through the Span, and check whether there's a word
|
# with head==0, i.e. a sentence root. If so, we can return it. The
|
||||||
# with head==0, i.e. a sentence root. If so, we can return it. The
|
# longer the span, the more likely it contains a sentence root, and
|
||||||
# longer the span, the more likely it contains a sentence root, and
|
# in this case we return in linear time.
|
||||||
# in this case we return in linear time.
|
for i in range(self.start, self.end):
|
||||||
for i in range(self.start, self.end):
|
if self.doc.c[i].head == 0:
|
||||||
if self.doc.c[i].head == 0:
|
return self.doc[i]
|
||||||
return self.doc[i]
|
# If we don't have a sentence root, we do something that's not so
|
||||||
# If we don't have a sentence root, we do something that's not so
|
# algorithmically clever, but I think should be quite fast,
|
||||||
# algorithmically clever, but I think should be quite fast,
|
# especially for short spans.
|
||||||
# especially for short spans.
|
# For each word, we count the path length, and arg min this measure.
|
||||||
# For each word, we count the path length, and arg min this measure.
|
# We could use better tree logic to save steps here...But I
|
||||||
# We could use better tree logic to save steps here...But I
|
# think this should be okay.
|
||||||
# think this should be okay.
|
cdef int current_best = self.doc.length
|
||||||
cdef int current_best = self.doc.length
|
cdef int root = -1
|
||||||
cdef int root = -1
|
for i in range(self.start, self.end):
|
||||||
for i in range(self.start, self.end):
|
if self.start <= (i+self.doc.c[i].head) < self.end:
|
||||||
if self.start <= (i+self.doc.c[i].head) < self.end:
|
continue
|
||||||
continue
|
words_to_root = _count_words_to_root(&self.doc.c[i], self.doc.length)
|
||||||
words_to_root = _count_words_to_root(&self.doc.c[i], self.doc.length)
|
if words_to_root < current_best:
|
||||||
if words_to_root < current_best:
|
current_best = words_to_root
|
||||||
current_best = words_to_root
|
root = i
|
||||||
root = i
|
if root == -1:
|
||||||
if root == -1:
|
return self.doc[self.start]
|
||||||
return self.doc[self.start]
|
else:
|
||||||
else:
|
return self.doc[root]
|
||||||
return self.doc[root]
|
|
||||||
|
|
||||||
property lefts:
|
@property
|
||||||
|
def conjuncts(self):
|
||||||
|
"""Tokens that are conjoined to the span's root.
|
||||||
|
|
||||||
|
RETURNS (tuple): A tuple of Token objects.
|
||||||
|
|
||||||
|
DOCS: https://spacy.io/api/span#lefts
|
||||||
|
"""
|
||||||
|
return self.root.conjuncts
|
||||||
|
|
||||||
|
@property
|
||||||
|
def lefts(self):
|
||||||
"""Tokens that are to the left of the span, whose head is within the
|
"""Tokens that are to the left of the span, whose head is within the
|
||||||
`Span`.
|
`Span`.
|
||||||
|
|
||||||
|
@ -535,13 +546,13 @@ cdef class Span:
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/span#lefts
|
DOCS: https://spacy.io/api/span#lefts
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
for token in reversed(self): # Reverse, so we get tokens in order
|
||||||
for token in reversed(self): # Reverse, so we get tokens in order
|
for left in token.lefts:
|
||||||
for left in token.lefts:
|
if left.i < self.start:
|
||||||
if left.i < self.start:
|
yield left
|
||||||
yield left
|
|
||||||
|
|
||||||
property rights:
|
@property
|
||||||
|
def rights(self):
|
||||||
"""Tokens that are to the right of the Span, whose head is within the
|
"""Tokens that are to the right of the Span, whose head is within the
|
||||||
`Span`.
|
`Span`.
|
||||||
|
|
||||||
|
@ -549,13 +560,13 @@ cdef class Span:
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/span#rights
|
DOCS: https://spacy.io/api/span#rights
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
for token in self:
|
||||||
for token in self:
|
for right in token.rights:
|
||||||
for right in token.rights:
|
if right.i >= self.end:
|
||||||
if right.i >= self.end:
|
yield right
|
||||||
yield right
|
|
||||||
|
|
||||||
property n_lefts:
|
@property
|
||||||
|
def n_lefts(self):
|
||||||
"""The number of tokens that are to the left of the span, whose
|
"""The number of tokens that are to the left of the span, whose
|
||||||
heads are within the span.
|
heads are within the span.
|
||||||
|
|
||||||
|
@ -564,10 +575,10 @@ cdef class Span:
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/span#n_lefts
|
DOCS: https://spacy.io/api/span#n_lefts
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
return len(list(self.lefts))
|
||||||
return len(list(self.lefts))
|
|
||||||
|
|
||||||
property n_rights:
|
@property
|
||||||
|
def n_rights(self):
|
||||||
"""The number of tokens that are to the right of the span, whose
|
"""The number of tokens that are to the right of the span, whose
|
||||||
heads are within the span.
|
heads are within the span.
|
||||||
|
|
||||||
|
@ -576,22 +587,21 @@ cdef class Span:
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/span#n_rights
|
DOCS: https://spacy.io/api/span#n_rights
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
return len(list(self.rights))
|
||||||
return len(list(self.rights))
|
|
||||||
|
|
||||||
property subtree:
|
@property
|
||||||
|
def subtree(self):
|
||||||
"""Tokens within the span and tokens which descend from them.
|
"""Tokens within the span and tokens which descend from them.
|
||||||
|
|
||||||
YIELDS (Token): A token within the span, or a descendant from it.
|
YIELDS (Token): A token within the span, or a descendant from it.
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/span#subtree
|
DOCS: https://spacy.io/api/span#subtree
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
for word in self.lefts:
|
||||||
for word in self.lefts:
|
yield from word.subtree
|
||||||
yield from word.subtree
|
yield from self
|
||||||
yield from self
|
for word in self.rights:
|
||||||
for word in self.rights:
|
yield from word.subtree
|
||||||
yield from word.subtree
|
|
||||||
|
|
||||||
property ent_id:
|
property ent_id:
|
||||||
"""RETURNS (uint64): The entity ID."""
|
"""RETURNS (uint64): The entity ID."""
|
||||||
|
@ -609,33 +619,33 @@ cdef class Span:
|
||||||
def __set__(self, hash_t key):
|
def __set__(self, hash_t key):
|
||||||
raise NotImplementedError(TempErrors.T007.format(attr="ent_id_"))
|
raise NotImplementedError(TempErrors.T007.format(attr="ent_id_"))
|
||||||
|
|
||||||
property orth_:
|
@property
|
||||||
|
def orth_(self):
|
||||||
"""Verbatim text content (identical to `Span.text`). Exists mostly for
|
"""Verbatim text content (identical to `Span.text`). Exists mostly for
|
||||||
consistency with other attributes.
|
consistency with other attributes.
|
||||||
|
|
||||||
RETURNS (unicode): The span's text."""
|
RETURNS (unicode): The span's text."""
|
||||||
def __get__(self):
|
return self.text
|
||||||
return self.text
|
|
||||||
|
|
||||||
property lemma_:
|
@property
|
||||||
|
def lemma_(self):
|
||||||
"""RETURNS (unicode): The span's lemma."""
|
"""RETURNS (unicode): The span's lemma."""
|
||||||
def __get__(self):
|
return " ".join([t.lemma_ for t in self]).strip()
|
||||||
return " ".join([t.lemma_ for t in self]).strip()
|
|
||||||
|
|
||||||
property upper_:
|
@property
|
||||||
|
def upper_(self):
|
||||||
"""Deprecated. Use `Span.text.upper()` instead."""
|
"""Deprecated. Use `Span.text.upper()` instead."""
|
||||||
def __get__(self):
|
return "".join([t.text_with_ws.upper() for t in self]).strip()
|
||||||
return "".join([t.text_with_ws.upper() for t in self]).strip()
|
|
||||||
|
|
||||||
property lower_:
|
@property
|
||||||
|
def lower_(self):
|
||||||
"""Deprecated. Use `Span.text.lower()` instead."""
|
"""Deprecated. Use `Span.text.lower()` instead."""
|
||||||
def __get__(self):
|
return "".join([t.text_with_ws.lower() for t in self]).strip()
|
||||||
return "".join([t.text_with_ws.lower() for t in self]).strip()
|
|
||||||
|
|
||||||
property string:
|
@property
|
||||||
|
def string(self):
|
||||||
"""Deprecated: Use `Span.text_with_ws` instead."""
|
"""Deprecated: Use `Span.text_with_ws` instead."""
|
||||||
def __get__(self):
|
return "".join([t.text_with_ws for t in self])
|
||||||
return "".join([t.text_with_ws for t in self])
|
|
||||||
|
|
||||||
property label_:
|
property label_:
|
||||||
"""RETURNS (unicode): The span's label."""
|
"""RETURNS (unicode): The span's label."""
|
||||||
|
@ -643,7 +653,9 @@ cdef class Span:
|
||||||
return self.doc.vocab.strings[self.label]
|
return self.doc.vocab.strings[self.label]
|
||||||
|
|
||||||
def __set__(self, unicode label_):
|
def __set__(self, unicode label_):
|
||||||
self.label = self.doc.vocab.strings.add(label_)
|
if not label_:
|
||||||
|
label_ = ''
|
||||||
|
raise NotImplementedError(Errors.E129.format(start=self.start, end=self.end, label=label_))
|
||||||
|
|
||||||
|
|
||||||
cdef int _count_words_to_root(const TokenC* token, int sent_length) except -1:
|
cdef int _count_words_to_root(const TokenC* token, int sent_length) except -1:
|
||||||
|
|
|
@ -219,115 +219,115 @@ cdef class Token:
|
||||||
xp = get_array_module(vector)
|
xp = get_array_module(vector)
|
||||||
return (xp.dot(vector, other.vector) / (self.vector_norm * other.vector_norm))
|
return (xp.dot(vector, other.vector) / (self.vector_norm * other.vector_norm))
|
||||||
|
|
||||||
property morph:
|
@property
|
||||||
def __get__(self):
|
def morph(self):
|
||||||
return MorphAnalysis.from_id(self.vocab, self.c.morph)
|
return MorphAnalysis.from_id(self.vocab, self.c.morph)
|
||||||
|
|
||||||
property lex_id:
|
@property
|
||||||
|
def lex_id(self):
|
||||||
"""RETURNS (int): Sequential ID of the token's lexical type."""
|
"""RETURNS (int): Sequential ID of the token's lexical type."""
|
||||||
def __get__(self):
|
return self.c.lex.id
|
||||||
return self.c.lex.id
|
|
||||||
|
|
||||||
property rank:
|
@property
|
||||||
|
def rank(self):
|
||||||
"""RETURNS (int): Sequential ID of the token's lexical type, used to
|
"""RETURNS (int): Sequential ID of the token's lexical type, used to
|
||||||
index into tables, e.g. for word vectors."""
|
index into tables, e.g. for word vectors."""
|
||||||
def __get__(self):
|
return self.c.lex.id
|
||||||
return self.c.lex.id
|
|
||||||
|
|
||||||
property string:
|
@property
|
||||||
|
def string(self):
|
||||||
"""Deprecated: Use Token.text_with_ws instead."""
|
"""Deprecated: Use Token.text_with_ws instead."""
|
||||||
def __get__(self):
|
return self.text_with_ws
|
||||||
return self.text_with_ws
|
|
||||||
|
|
||||||
property text:
|
@property
|
||||||
|
def text(self):
|
||||||
"""RETURNS (unicode): The original verbatim text of the token."""
|
"""RETURNS (unicode): The original verbatim text of the token."""
|
||||||
def __get__(self):
|
return self.orth_
|
||||||
return self.orth_
|
|
||||||
|
|
||||||
property text_with_ws:
|
@property
|
||||||
|
def text_with_ws(self):
|
||||||
"""RETURNS (unicode): The text content of the span (with trailing
|
"""RETURNS (unicode): The text content of the span (with trailing
|
||||||
whitespace).
|
whitespace).
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
cdef unicode orth = self.vocab.strings[self.c.lex.orth]
|
||||||
cdef unicode orth = self.vocab.strings[self.c.lex.orth]
|
if self.c.spacy:
|
||||||
if self.c.spacy:
|
return orth + " "
|
||||||
return orth + " "
|
else:
|
||||||
else:
|
return orth
|
||||||
return orth
|
|
||||||
|
|
||||||
property prob:
|
@property
|
||||||
|
def prob(self):
|
||||||
"""RETURNS (float): Smoothed log probability estimate of token type."""
|
"""RETURNS (float): Smoothed log probability estimate of token type."""
|
||||||
def __get__(self):
|
return self.c.lex.prob
|
||||||
return self.c.lex.prob
|
|
||||||
|
|
||||||
property sentiment:
|
@property
|
||||||
|
def sentiment(self):
|
||||||
"""RETURNS (float): A scalar value indicating the positivity or
|
"""RETURNS (float): A scalar value indicating the positivity or
|
||||||
negativity of the token."""
|
negativity of the token."""
|
||||||
def __get__(self):
|
if "sentiment" in self.doc.user_token_hooks:
|
||||||
if "sentiment" in self.doc.user_token_hooks:
|
return self.doc.user_token_hooks["sentiment"](self)
|
||||||
return self.doc.user_token_hooks["sentiment"](self)
|
return self.c.lex.sentiment
|
||||||
return self.c.lex.sentiment
|
|
||||||
|
|
||||||
property lang:
|
@property
|
||||||
|
def lang(self):
|
||||||
"""RETURNS (uint64): ID of the language of the parent document's
|
"""RETURNS (uint64): ID of the language of the parent document's
|
||||||
vocabulary.
|
vocabulary.
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
return self.c.lex.lang
|
||||||
return self.c.lex.lang
|
|
||||||
|
|
||||||
property idx:
|
@property
|
||||||
|
def idx(self):
|
||||||
"""RETURNS (int): The character offset of the token within the parent
|
"""RETURNS (int): The character offset of the token within the parent
|
||||||
document.
|
document.
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
return self.c.idx
|
||||||
return self.c.idx
|
|
||||||
|
|
||||||
property cluster:
|
@property
|
||||||
|
def cluster(self):
|
||||||
"""RETURNS (int): Brown cluster ID."""
|
"""RETURNS (int): Brown cluster ID."""
|
||||||
def __get__(self):
|
return self.c.lex.cluster
|
||||||
return self.c.lex.cluster
|
|
||||||
|
|
||||||
property orth:
|
@property
|
||||||
|
def orth(self):
|
||||||
"""RETURNS (uint64): ID of the verbatim text content."""
|
"""RETURNS (uint64): ID of the verbatim text content."""
|
||||||
def __get__(self):
|
return self.c.lex.orth
|
||||||
return self.c.lex.orth
|
|
||||||
|
|
||||||
property lower:
|
@property
|
||||||
|
def lower(self):
|
||||||
"""RETURNS (uint64): ID of the lowercase token text."""
|
"""RETURNS (uint64): ID of the lowercase token text."""
|
||||||
def __get__(self):
|
return self.c.lex.lower
|
||||||
return self.c.lex.lower
|
|
||||||
|
|
||||||
property norm:
|
@property
|
||||||
|
def norm(self):
|
||||||
"""RETURNS (uint64): ID of the token's norm, i.e. a normalised form of
|
"""RETURNS (uint64): ID of the token's norm, i.e. a normalised form of
|
||||||
the token text. Usually set in the language's tokenizer exceptions
|
the token text. Usually set in the language's tokenizer exceptions
|
||||||
or norm exceptions.
|
or norm exceptions.
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
if self.c.norm == 0:
|
||||||
if self.c.norm == 0:
|
return self.c.lex.norm
|
||||||
return self.c.lex.norm
|
else:
|
||||||
else:
|
return self.c.norm
|
||||||
return self.c.norm
|
|
||||||
|
|
||||||
property shape:
|
@property
|
||||||
|
def shape(self):
|
||||||
"""RETURNS (uint64): ID of the token's shape, a transform of the
|
"""RETURNS (uint64): ID of the token's shape, a transform of the
|
||||||
tokens's string, to show orthographic features (e.g. "Xxxx", "dd").
|
tokens's string, to show orthographic features (e.g. "Xxxx", "dd").
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
return self.c.lex.shape
|
||||||
return self.c.lex.shape
|
|
||||||
|
|
||||||
property prefix:
|
@property
|
||||||
|
def prefix(self):
|
||||||
"""RETURNS (uint64): ID of a length-N substring from the start of the
|
"""RETURNS (uint64): ID of a length-N substring from the start of the
|
||||||
token. Defaults to `N=1`.
|
token. Defaults to `N=1`.
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
return self.c.lex.prefix
|
||||||
return self.c.lex.prefix
|
|
||||||
|
|
||||||
property suffix:
|
@property
|
||||||
|
def suffix(self):
|
||||||
"""RETURNS (uint64): ID of a length-N substring from the end of the
|
"""RETURNS (uint64): ID of a length-N substring from the end of the
|
||||||
token. Defaults to `N=3`.
|
token. Defaults to `N=3`.
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
return self.c.lex.suffix
|
||||||
return self.c.lex.suffix
|
|
||||||
|
|
||||||
property lemma:
|
property lemma:
|
||||||
"""RETURNS (uint64): ID of the base form of the word, with no
|
"""RETURNS (uint64): ID of the base form of the word, with no
|
||||||
|
@ -367,7 +367,8 @@ cdef class Token:
|
||||||
def __set__(self, attr_t label):
|
def __set__(self, attr_t label):
|
||||||
self.c.dep = label
|
self.c.dep = label
|
||||||
|
|
||||||
property has_vector:
|
@property
|
||||||
|
def has_vector(self):
|
||||||
"""A boolean value indicating whether a word vector is associated with
|
"""A boolean value indicating whether a word vector is associated with
|
||||||
the object.
|
the object.
|
||||||
|
|
||||||
|
@ -375,14 +376,14 @@ cdef class Token:
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/token#has_vector
|
DOCS: https://spacy.io/api/token#has_vector
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
if "has_vector" in self.doc.user_token_hooks:
|
||||||
if 'has_vector' in self.doc.user_token_hooks:
|
return self.doc.user_token_hooks["has_vector"](self)
|
||||||
return self.doc.user_token_hooks["has_vector"](self)
|
if self.vocab.vectors.size == 0 and self.doc.tensor.size != 0:
|
||||||
if self.vocab.vectors.size == 0 and self.doc.tensor.size != 0:
|
return True
|
||||||
return True
|
return self.vocab.has_vector(self.c.lex.orth)
|
||||||
return self.vocab.has_vector(self.c.lex.orth)
|
|
||||||
|
|
||||||
property vector:
|
@property
|
||||||
|
def vector(self):
|
||||||
"""A real-valued meaning representation.
|
"""A real-valued meaning representation.
|
||||||
|
|
||||||
RETURNS (numpy.ndarray[ndim=1, dtype='float32']): A 1D numpy array
|
RETURNS (numpy.ndarray[ndim=1, dtype='float32']): A 1D numpy array
|
||||||
|
@ -390,28 +391,28 @@ cdef class Token:
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/token#vector
|
DOCS: https://spacy.io/api/token#vector
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
if "vector" in self.doc.user_token_hooks:
|
||||||
if 'vector' in self.doc.user_token_hooks:
|
return self.doc.user_token_hooks["vector"](self)
|
||||||
return self.doc.user_token_hooks["vector"](self)
|
if self.vocab.vectors.size == 0 and self.doc.tensor.size != 0:
|
||||||
if self.vocab.vectors.size == 0 and self.doc.tensor.size != 0:
|
return self.doc.tensor[self.i]
|
||||||
return self.doc.tensor[self.i]
|
else:
|
||||||
else:
|
return self.vocab.get_vector(self.c.lex.orth)
|
||||||
return self.vocab.get_vector(self.c.lex.orth)
|
|
||||||
|
|
||||||
property vector_norm:
|
@property
|
||||||
|
def vector_norm(self):
|
||||||
"""The L2 norm of the token's vector representation.
|
"""The L2 norm of the token's vector representation.
|
||||||
|
|
||||||
RETURNS (float): The L2 norm of the vector representation.
|
RETURNS (float): The L2 norm of the vector representation.
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/token#vector_norm
|
DOCS: https://spacy.io/api/token#vector_norm
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
if "vector_norm" in self.doc.user_token_hooks:
|
||||||
if 'vector_norm' in self.doc.user_token_hooks:
|
return self.doc.user_token_hooks["vector_norm"](self)
|
||||||
return self.doc.user_token_hooks["vector_norm"](self)
|
vector = self.vector
|
||||||
vector = self.vector
|
return numpy.sqrt((vector ** 2).sum())
|
||||||
return numpy.sqrt((vector ** 2).sum())
|
|
||||||
|
|
||||||
property n_lefts:
|
@property
|
||||||
|
def n_lefts(self):
|
||||||
"""The number of leftward immediate children of the word, in the
|
"""The number of leftward immediate children of the word, in the
|
||||||
syntactic dependency parse.
|
syntactic dependency parse.
|
||||||
|
|
||||||
|
@ -420,10 +421,10 @@ cdef class Token:
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/token#n_lefts
|
DOCS: https://spacy.io/api/token#n_lefts
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
return self.c.l_kids
|
||||||
return self.c.l_kids
|
|
||||||
|
|
||||||
property n_rights:
|
@property
|
||||||
|
def n_rights(self):
|
||||||
"""The number of rightward immediate children of the word, in the
|
"""The number of rightward immediate children of the word, in the
|
||||||
syntactic dependency parse.
|
syntactic dependency parse.
|
||||||
|
|
||||||
|
@ -432,15 +433,14 @@ cdef class Token:
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/token#n_rights
|
DOCS: https://spacy.io/api/token#n_rights
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
return self.c.r_kids
|
||||||
return self.c.r_kids
|
|
||||||
|
|
||||||
property sent:
|
@property
|
||||||
|
def sent(self):
|
||||||
"""RETURNS (Span): The sentence span that the token is a part of."""
|
"""RETURNS (Span): The sentence span that the token is a part of."""
|
||||||
def __get__(self):
|
if 'sent' in self.doc.user_token_hooks:
|
||||||
if 'sent' in self.doc.user_token_hooks:
|
return self.doc.user_token_hooks["sent"](self)
|
||||||
return self.doc.user_token_hooks["sent"](self)
|
return self.doc[self.i : self.i+1].sent
|
||||||
return self.doc[self.i : self.i+1].sent
|
|
||||||
|
|
||||||
property sent_start:
|
property sent_start:
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
|
@ -484,7 +484,8 @@ cdef class Token:
|
||||||
else:
|
else:
|
||||||
raise ValueError(Errors.E044.format(value=value))
|
raise ValueError(Errors.E044.format(value=value))
|
||||||
|
|
||||||
property lefts:
|
@property
|
||||||
|
def lefts(self):
|
||||||
"""The leftward immediate children of the word, in the syntactic
|
"""The leftward immediate children of the word, in the syntactic
|
||||||
dependency parse.
|
dependency parse.
|
||||||
|
|
||||||
|
@ -492,19 +493,19 @@ cdef class Token:
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/token#lefts
|
DOCS: https://spacy.io/api/token#lefts
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
cdef int nr_iter = 0
|
||||||
cdef int nr_iter = 0
|
cdef const TokenC* ptr = self.c - (self.i - self.c.l_edge)
|
||||||
cdef const TokenC* ptr = self.c - (self.i - self.c.l_edge)
|
while ptr < self.c:
|
||||||
while ptr < self.c:
|
if ptr + ptr.head == self.c:
|
||||||
if ptr + ptr.head == self.c:
|
yield self.doc[ptr - (self.c - self.i)]
|
||||||
yield self.doc[ptr - (self.c - self.i)]
|
ptr += 1
|
||||||
ptr += 1
|
nr_iter += 1
|
||||||
nr_iter += 1
|
# This is ugly, but it's a way to guard out infinite loops
|
||||||
# This is ugly, but it's a way to guard out infinite loops
|
if nr_iter >= 10000000:
|
||||||
if nr_iter >= 10000000:
|
raise RuntimeError(Errors.E045.format(attr="token.lefts"))
|
||||||
raise RuntimeError(Errors.E045.format(attr="token.lefts"))
|
|
||||||
|
|
||||||
property rights:
|
@property
|
||||||
|
def rights(self):
|
||||||
"""The rightward immediate children of the word, in the syntactic
|
"""The rightward immediate children of the word, in the syntactic
|
||||||
dependency parse.
|
dependency parse.
|
||||||
|
|
||||||
|
@ -512,33 +513,33 @@ cdef class Token:
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/token#rights
|
DOCS: https://spacy.io/api/token#rights
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
cdef const TokenC* ptr = self.c + (self.c.r_edge - self.i)
|
||||||
cdef const TokenC* ptr = self.c + (self.c.r_edge - self.i)
|
tokens = []
|
||||||
tokens = []
|
cdef int nr_iter = 0
|
||||||
cdef int nr_iter = 0
|
while ptr > self.c:
|
||||||
while ptr > self.c:
|
if ptr + ptr.head == self.c:
|
||||||
if ptr + ptr.head == self.c:
|
tokens.append(self.doc[ptr - (self.c - self.i)])
|
||||||
tokens.append(self.doc[ptr - (self.c - self.i)])
|
ptr -= 1
|
||||||
ptr -= 1
|
nr_iter += 1
|
||||||
nr_iter += 1
|
if nr_iter >= 10000000:
|
||||||
if nr_iter >= 10000000:
|
raise RuntimeError(Errors.E045.format(attr="token.rights"))
|
||||||
raise RuntimeError(Errors.E045.format(attr="token.rights"))
|
tokens.reverse()
|
||||||
tokens.reverse()
|
for t in tokens:
|
||||||
for t in tokens:
|
yield t
|
||||||
yield t
|
|
||||||
|
|
||||||
property children:
|
@property
|
||||||
|
def children(self):
|
||||||
"""A sequence of the token's immediate syntactic children.
|
"""A sequence of the token's immediate syntactic children.
|
||||||
|
|
||||||
YIELDS (Token): A child token such that `child.head==self`.
|
YIELDS (Token): A child token such that `child.head==self`.
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/token#children
|
DOCS: https://spacy.io/api/token#children
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
yield from self.lefts
|
||||||
yield from self.lefts
|
yield from self.rights
|
||||||
yield from self.rights
|
|
||||||
|
|
||||||
property subtree:
|
@property
|
||||||
|
def subtree(self):
|
||||||
"""A sequence containing the token and all the token's syntactic
|
"""A sequence containing the token and all the token's syntactic
|
||||||
descendants.
|
descendants.
|
||||||
|
|
||||||
|
@ -547,30 +548,30 @@ cdef class Token:
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/token#subtree
|
DOCS: https://spacy.io/api/token#subtree
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
for word in self.lefts:
|
||||||
for word in self.lefts:
|
yield from word.subtree
|
||||||
yield from word.subtree
|
yield self
|
||||||
yield self
|
for word in self.rights:
|
||||||
for word in self.rights:
|
yield from word.subtree
|
||||||
yield from word.subtree
|
|
||||||
|
|
||||||
property left_edge:
|
@property
|
||||||
|
def left_edge(self):
|
||||||
"""The leftmost token of this token's syntactic descendents.
|
"""The leftmost token of this token's syntactic descendents.
|
||||||
|
|
||||||
RETURNS (Token): The first token such that `self.is_ancestor(token)`.
|
RETURNS (Token): The first token such that `self.is_ancestor(token)`.
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
return self.doc[self.c.l_edge]
|
||||||
return self.doc[self.c.l_edge]
|
|
||||||
|
|
||||||
property right_edge:
|
@property
|
||||||
|
def right_edge(self):
|
||||||
"""The rightmost token of this token's syntactic descendents.
|
"""The rightmost token of this token's syntactic descendents.
|
||||||
|
|
||||||
RETURNS (Token): The last token such that `self.is_ancestor(token)`.
|
RETURNS (Token): The last token such that `self.is_ancestor(token)`.
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
return self.doc[self.c.r_edge]
|
||||||
return self.doc[self.c.r_edge]
|
|
||||||
|
|
||||||
property ancestors:
|
@property
|
||||||
|
def ancestors(self):
|
||||||
"""A sequence of this token's syntactic ancestors.
|
"""A sequence of this token's syntactic ancestors.
|
||||||
|
|
||||||
YIELDS (Token): A sequence of ancestor tokens such that
|
YIELDS (Token): A sequence of ancestor tokens such that
|
||||||
|
@ -578,15 +579,14 @@ cdef class Token:
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/token#ancestors
|
DOCS: https://spacy.io/api/token#ancestors
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
cdef const TokenC* head_ptr = self.c
|
||||||
cdef const TokenC* head_ptr = self.c
|
# Guard against infinite loop, no token can have
|
||||||
# Guard against infinite loop, no token can have
|
# more ancestors than tokens in the tree.
|
||||||
# more ancestors than tokens in the tree.
|
cdef int i = 0
|
||||||
cdef int i = 0
|
while head_ptr.head != 0 and i < self.doc.length:
|
||||||
while head_ptr.head != 0 and i < self.doc.length:
|
head_ptr += head_ptr.head
|
||||||
head_ptr += head_ptr.head
|
yield self.doc[head_ptr - (self.c - self.i)]
|
||||||
yield self.doc[head_ptr - (self.c - self.i)]
|
i += 1
|
||||||
i += 1
|
|
||||||
|
|
||||||
def is_ancestor(self, descendant):
|
def is_ancestor(self, descendant):
|
||||||
"""Check whether this token is a parent, grandparent, etc. of another
|
"""Check whether this token is a parent, grandparent, etc. of another
|
||||||
|
@ -690,23 +690,31 @@ cdef class Token:
|
||||||
# Set new head
|
# Set new head
|
||||||
self.c.head = rel_newhead_i
|
self.c.head = rel_newhead_i
|
||||||
|
|
||||||
property conjuncts:
|
@property
|
||||||
|
def conjuncts(self):
|
||||||
"""A sequence of coordinated tokens, including the token itself.
|
"""A sequence of coordinated tokens, including the token itself.
|
||||||
|
|
||||||
YIELDS (Token): A coordinated token.
|
RETURNS (tuple): The coordinated tokens.
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/token#conjuncts
|
DOCS: https://spacy.io/api/token#conjuncts
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
cdef Token word, child
|
||||||
cdef Token word
|
if "conjuncts" in self.doc.user_token_hooks:
|
||||||
if "conjuncts" in self.doc.user_token_hooks:
|
return tuple(self.doc.user_token_hooks["conjuncts"](self))
|
||||||
yield from self.doc.user_token_hooks["conjuncts"](self)
|
start = self
|
||||||
|
while start.i != start.head.i:
|
||||||
|
if start.dep == conj:
|
||||||
|
start = start.head
|
||||||
else:
|
else:
|
||||||
if self.dep != conj:
|
break
|
||||||
for word in self.rights:
|
queue = [start]
|
||||||
if word.dep == conj:
|
output = [start]
|
||||||
yield word
|
for word in queue:
|
||||||
yield from word.conjuncts
|
for child in word.rights:
|
||||||
|
if child.c.dep == conj:
|
||||||
|
output.append(child)
|
||||||
|
queue.append(child)
|
||||||
|
return tuple([w for w in output if w.i != self.i])
|
||||||
|
|
||||||
property ent_type:
|
property ent_type:
|
||||||
"""RETURNS (uint64): Named entity type."""
|
"""RETURNS (uint64): Named entity type."""
|
||||||
|
@ -716,15 +724,6 @@ cdef class Token:
|
||||||
def __set__(self, ent_type):
|
def __set__(self, ent_type):
|
||||||
self.c.ent_type = ent_type
|
self.c.ent_type = ent_type
|
||||||
|
|
||||||
property ent_iob:
|
|
||||||
"""IOB code of named entity tag. `1="I", 2="O", 3="B"`. 0 means no tag
|
|
||||||
is assigned.
|
|
||||||
|
|
||||||
RETURNS (uint64): IOB code of named entity tag.
|
|
||||||
"""
|
|
||||||
def __get__(self):
|
|
||||||
return self.c.ent_iob
|
|
||||||
|
|
||||||
property ent_type_:
|
property ent_type_:
|
||||||
"""RETURNS (unicode): Named entity type."""
|
"""RETURNS (unicode): Named entity type."""
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
|
@ -733,16 +732,25 @@ cdef class Token:
|
||||||
def __set__(self, ent_type):
|
def __set__(self, ent_type):
|
||||||
self.c.ent_type = self.vocab.strings.add(ent_type)
|
self.c.ent_type = self.vocab.strings.add(ent_type)
|
||||||
|
|
||||||
property ent_iob_:
|
@property
|
||||||
|
def ent_iob(self):
|
||||||
|
"""IOB code of named entity tag. `1="I", 2="O", 3="B"`. 0 means no tag
|
||||||
|
is assigned.
|
||||||
|
|
||||||
|
RETURNS (uint64): IOB code of named entity tag.
|
||||||
|
"""
|
||||||
|
return self.c.ent_iob
|
||||||
|
|
||||||
|
@property
|
||||||
|
def ent_iob_(self):
|
||||||
"""IOB code of named entity tag. "B" means the token begins an entity,
|
"""IOB code of named entity tag. "B" means the token begins an entity,
|
||||||
"I" means it is inside an entity, "O" means it is outside an entity,
|
"I" means it is inside an entity, "O" means it is outside an entity,
|
||||||
and "" means no entity tag is set.
|
and "" means no entity tag is set.
|
||||||
|
|
||||||
RETURNS (unicode): IOB code of named entity tag.
|
RETURNS (unicode): IOB code of named entity tag.
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
iob_strings = ("", "I", "O", "B")
|
||||||
iob_strings = ("", "I", "O", "B")
|
return iob_strings[self.c.ent_iob]
|
||||||
return iob_strings[self.c.ent_iob]
|
|
||||||
|
|
||||||
property ent_id:
|
property ent_id:
|
||||||
"""RETURNS (uint64): ID of the entity the token is an instance of,
|
"""RETURNS (uint64): ID of the entity the token is an instance of,
|
||||||
|
@ -764,26 +772,25 @@ cdef class Token:
|
||||||
def __set__(self, name):
|
def __set__(self, name):
|
||||||
self.c.ent_id = self.vocab.strings.add(name)
|
self.c.ent_id = self.vocab.strings.add(name)
|
||||||
|
|
||||||
property whitespace_:
|
@property
|
||||||
"""RETURNS (unicode): The trailing whitespace character, if present.
|
def whitespace_(self):
|
||||||
"""
|
"""RETURNS (unicode): The trailing whitespace character, if present."""
|
||||||
def __get__(self):
|
return " " if self.c.spacy else ""
|
||||||
return " " if self.c.spacy else ""
|
|
||||||
|
|
||||||
property orth_:
|
@property
|
||||||
|
def orth_(self):
|
||||||
"""RETURNS (unicode): Verbatim text content (identical to
|
"""RETURNS (unicode): Verbatim text content (identical to
|
||||||
`Token.text`). Exists mostly for consistency with the other
|
`Token.text`). Exists mostly for consistency with the other
|
||||||
attributes.
|
attributes.
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
return self.vocab.strings[self.c.lex.orth]
|
||||||
return self.vocab.strings[self.c.lex.orth]
|
|
||||||
|
|
||||||
property lower_:
|
@property
|
||||||
|
def lower_(self):
|
||||||
"""RETURNS (unicode): The lowercase token text. Equivalent to
|
"""RETURNS (unicode): The lowercase token text. Equivalent to
|
||||||
`Token.text.lower()`.
|
`Token.text.lower()`.
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
return self.vocab.strings[self.c.lex.lower]
|
||||||
return self.vocab.strings[self.c.lex.lower]
|
|
||||||
|
|
||||||
property norm_:
|
property norm_:
|
||||||
"""RETURNS (unicode): The token's norm, i.e. a normalised form of the
|
"""RETURNS (unicode): The token's norm, i.e. a normalised form of the
|
||||||
|
@ -796,33 +803,33 @@ cdef class Token:
|
||||||
def __set__(self, unicode norm_):
|
def __set__(self, unicode norm_):
|
||||||
self.c.norm = self.vocab.strings.add(norm_)
|
self.c.norm = self.vocab.strings.add(norm_)
|
||||||
|
|
||||||
property shape_:
|
@property
|
||||||
|
def shape_(self):
|
||||||
"""RETURNS (unicode): Transform of the tokens's string, to show
|
"""RETURNS (unicode): Transform of the tokens's string, to show
|
||||||
orthographic features. For example, "Xxxx" or "dd".
|
orthographic features. For example, "Xxxx" or "dd".
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
return self.vocab.strings[self.c.lex.shape]
|
||||||
return self.vocab.strings[self.c.lex.shape]
|
|
||||||
|
|
||||||
property prefix_:
|
@property
|
||||||
|
def prefix_(self):
|
||||||
"""RETURNS (unicode): A length-N substring from the start of the token.
|
"""RETURNS (unicode): A length-N substring from the start of the token.
|
||||||
Defaults to `N=1`.
|
Defaults to `N=1`.
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
return self.vocab.strings[self.c.lex.prefix]
|
||||||
return self.vocab.strings[self.c.lex.prefix]
|
|
||||||
|
|
||||||
property suffix_:
|
@property
|
||||||
|
def suffix_(self):
|
||||||
"""RETURNS (unicode): A length-N substring from the end of the token.
|
"""RETURNS (unicode): A length-N substring from the end of the token.
|
||||||
Defaults to `N=3`.
|
Defaults to `N=3`.
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
return self.vocab.strings[self.c.lex.suffix]
|
||||||
return self.vocab.strings[self.c.lex.suffix]
|
|
||||||
|
|
||||||
property lang_:
|
@property
|
||||||
|
def lang_(self):
|
||||||
"""RETURNS (unicode): Language of the parent document's vocabulary,
|
"""RETURNS (unicode): Language of the parent document's vocabulary,
|
||||||
e.g. 'en'.
|
e.g. 'en'.
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
return self.vocab.strings[self.c.lex.lang]
|
||||||
return self.vocab.strings[self.c.lex.lang]
|
|
||||||
|
|
||||||
property lemma_:
|
property lemma_:
|
||||||
"""RETURNS (unicode): The token lemma, i.e. the base form of the word,
|
"""RETURNS (unicode): The token lemma, i.e. the base form of the word,
|
||||||
|
@ -861,110 +868,110 @@ cdef class Token:
|
||||||
def __set__(self, unicode label):
|
def __set__(self, unicode label):
|
||||||
self.c.dep = self.vocab.strings.add(label)
|
self.c.dep = self.vocab.strings.add(label)
|
||||||
|
|
||||||
property is_oov:
|
@property
|
||||||
|
def is_oov(self):
|
||||||
"""RETURNS (bool): Whether the token is out-of-vocabulary."""
|
"""RETURNS (bool): Whether the token is out-of-vocabulary."""
|
||||||
def __get__(self):
|
return Lexeme.c_check_flag(self.c.lex, IS_OOV)
|
||||||
return Lexeme.c_check_flag(self.c.lex, IS_OOV)
|
|
||||||
|
|
||||||
property is_stop:
|
@property
|
||||||
|
def is_stop(self):
|
||||||
"""RETURNS (bool): Whether the token is a stop word, i.e. part of a
|
"""RETURNS (bool): Whether the token is a stop word, i.e. part of a
|
||||||
"stop list" defined by the language data.
|
"stop list" defined by the language data.
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
return Lexeme.c_check_flag(self.c.lex, IS_STOP)
|
||||||
return Lexeme.c_check_flag(self.c.lex, IS_STOP)
|
|
||||||
|
|
||||||
property is_alpha:
|
@property
|
||||||
|
def is_alpha(self):
|
||||||
"""RETURNS (bool): Whether the token consists of alpha characters.
|
"""RETURNS (bool): Whether the token consists of alpha characters.
|
||||||
Equivalent to `token.text.isalpha()`.
|
Equivalent to `token.text.isalpha()`.
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
return Lexeme.c_check_flag(self.c.lex, IS_ALPHA)
|
||||||
return Lexeme.c_check_flag(self.c.lex, IS_ALPHA)
|
|
||||||
|
|
||||||
property is_ascii:
|
@property
|
||||||
|
def is_ascii(self):
|
||||||
"""RETURNS (bool): Whether the token consists of ASCII characters.
|
"""RETURNS (bool): Whether the token consists of ASCII characters.
|
||||||
Equivalent to `[any(ord(c) >= 128 for c in token.text)]`.
|
Equivalent to `[any(ord(c) >= 128 for c in token.text)]`.
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
return Lexeme.c_check_flag(self.c.lex, IS_ASCII)
|
||||||
return Lexeme.c_check_flag(self.c.lex, IS_ASCII)
|
|
||||||
|
|
||||||
property is_digit:
|
@property
|
||||||
|
def is_digit(self):
|
||||||
"""RETURNS (bool): Whether the token consists of digits. Equivalent to
|
"""RETURNS (bool): Whether the token consists of digits. Equivalent to
|
||||||
`token.text.isdigit()`.
|
`token.text.isdigit()`.
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
return Lexeme.c_check_flag(self.c.lex, IS_DIGIT)
|
||||||
return Lexeme.c_check_flag(self.c.lex, IS_DIGIT)
|
|
||||||
|
|
||||||
property is_lower:
|
@property
|
||||||
|
def is_lower(self):
|
||||||
"""RETURNS (bool): Whether the token is in lowercase. Equivalent to
|
"""RETURNS (bool): Whether the token is in lowercase. Equivalent to
|
||||||
`token.text.islower()`.
|
`token.text.islower()`.
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
return Lexeme.c_check_flag(self.c.lex, IS_LOWER)
|
||||||
return Lexeme.c_check_flag(self.c.lex, IS_LOWER)
|
|
||||||
|
|
||||||
property is_upper:
|
@property
|
||||||
|
def is_upper(self):
|
||||||
"""RETURNS (bool): Whether the token is in uppercase. Equivalent to
|
"""RETURNS (bool): Whether the token is in uppercase. Equivalent to
|
||||||
`token.text.isupper()`
|
`token.text.isupper()`
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
return Lexeme.c_check_flag(self.c.lex, IS_UPPER)
|
||||||
return Lexeme.c_check_flag(self.c.lex, IS_UPPER)
|
|
||||||
|
|
||||||
property is_title:
|
@property
|
||||||
|
def is_title(self):
|
||||||
"""RETURNS (bool): Whether the token is in titlecase. Equivalent to
|
"""RETURNS (bool): Whether the token is in titlecase. Equivalent to
|
||||||
`token.text.istitle()`.
|
`token.text.istitle()`.
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
return Lexeme.c_check_flag(self.c.lex, IS_TITLE)
|
||||||
return Lexeme.c_check_flag(self.c.lex, IS_TITLE)
|
|
||||||
|
|
||||||
property is_punct:
|
@property
|
||||||
|
def is_punct(self):
|
||||||
"""RETURNS (bool): Whether the token is punctuation."""
|
"""RETURNS (bool): Whether the token is punctuation."""
|
||||||
def __get__(self):
|
return Lexeme.c_check_flag(self.c.lex, IS_PUNCT)
|
||||||
return Lexeme.c_check_flag(self.c.lex, IS_PUNCT)
|
|
||||||
|
|
||||||
property is_space:
|
@property
|
||||||
|
def is_space(self):
|
||||||
"""RETURNS (bool): Whether the token consists of whitespace characters.
|
"""RETURNS (bool): Whether the token consists of whitespace characters.
|
||||||
Equivalent to `token.text.isspace()`.
|
Equivalent to `token.text.isspace()`.
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
return Lexeme.c_check_flag(self.c.lex, IS_SPACE)
|
||||||
return Lexeme.c_check_flag(self.c.lex, IS_SPACE)
|
|
||||||
|
|
||||||
property is_bracket:
|
@property
|
||||||
|
def is_bracket(self):
|
||||||
"""RETURNS (bool): Whether the token is a bracket."""
|
"""RETURNS (bool): Whether the token is a bracket."""
|
||||||
def __get__(self):
|
return Lexeme.c_check_flag(self.c.lex, IS_BRACKET)
|
||||||
return Lexeme.c_check_flag(self.c.lex, IS_BRACKET)
|
|
||||||
|
|
||||||
property is_quote:
|
@property
|
||||||
|
def is_quote(self):
|
||||||
"""RETURNS (bool): Whether the token is a quotation mark."""
|
"""RETURNS (bool): Whether the token is a quotation mark."""
|
||||||
def __get__(self):
|
return Lexeme.c_check_flag(self.c.lex, IS_QUOTE)
|
||||||
return Lexeme.c_check_flag(self.c.lex, IS_QUOTE)
|
|
||||||
|
|
||||||
property is_left_punct:
|
@property
|
||||||
|
def is_left_punct(self):
|
||||||
"""RETURNS (bool): Whether the token is a left punctuation mark."""
|
"""RETURNS (bool): Whether the token is a left punctuation mark."""
|
||||||
def __get__(self):
|
return Lexeme.c_check_flag(self.c.lex, IS_LEFT_PUNCT)
|
||||||
return Lexeme.c_check_flag(self.c.lex, IS_LEFT_PUNCT)
|
|
||||||
|
|
||||||
property is_right_punct:
|
@property
|
||||||
|
def is_right_punct(self):
|
||||||
"""RETURNS (bool): Whether the token is a right punctuation mark."""
|
"""RETURNS (bool): Whether the token is a right punctuation mark."""
|
||||||
def __get__(self):
|
return Lexeme.c_check_flag(self.c.lex, IS_RIGHT_PUNCT)
|
||||||
return Lexeme.c_check_flag(self.c.lex, IS_RIGHT_PUNCT)
|
|
||||||
|
|
||||||
property is_currency:
|
@property
|
||||||
|
def is_currency(self):
|
||||||
"""RETURNS (bool): Whether the token is a currency symbol."""
|
"""RETURNS (bool): Whether the token is a currency symbol."""
|
||||||
def __get__(self):
|
return Lexeme.c_check_flag(self.c.lex, IS_CURRENCY)
|
||||||
return Lexeme.c_check_flag(self.c.lex, IS_CURRENCY)
|
|
||||||
|
|
||||||
property like_url:
|
@property
|
||||||
|
def like_url(self):
|
||||||
"""RETURNS (bool): Whether the token resembles a URL."""
|
"""RETURNS (bool): Whether the token resembles a URL."""
|
||||||
def __get__(self):
|
return Lexeme.c_check_flag(self.c.lex, LIKE_URL)
|
||||||
return Lexeme.c_check_flag(self.c.lex, LIKE_URL)
|
|
||||||
|
|
||||||
property like_num:
|
@property
|
||||||
|
def like_num(self):
|
||||||
"""RETURNS (bool): Whether the token resembles a number, e.g. "10.9",
|
"""RETURNS (bool): Whether the token resembles a number, e.g. "10.9",
|
||||||
"10", "ten", etc.
|
"10", "ten", etc.
|
||||||
"""
|
"""
|
||||||
def __get__(self):
|
return Lexeme.c_check_flag(self.c.lex, LIKE_NUM)
|
||||||
return Lexeme.c_check_flag(self.c.lex, LIKE_NUM)
|
|
||||||
|
|
||||||
property like_email:
|
@property
|
||||||
|
def like_email(self):
|
||||||
"""RETURNS (bool): Whether the token resembles an email address."""
|
"""RETURNS (bool): Whether the token resembles an email address."""
|
||||||
def __get__(self):
|
return Lexeme.c_check_flag(self.c.lex, LIKE_EMAIL)
|
||||||
return Lexeme.c_check_flag(self.c.lex, LIKE_EMAIL)
|
|
||||||
|
|
|
@ -2,11 +2,13 @@
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import functools
|
import functools
|
||||||
|
import copy
|
||||||
|
|
||||||
from ..errors import Errors
|
from ..errors import Errors
|
||||||
|
|
||||||
|
|
||||||
class Underscore(object):
|
class Underscore(object):
|
||||||
|
mutable_types = (dict, list, set)
|
||||||
doc_extensions = {}
|
doc_extensions = {}
|
||||||
span_extensions = {}
|
span_extensions = {}
|
||||||
token_extensions = {}
|
token_extensions = {}
|
||||||
|
@ -32,7 +34,15 @@ class Underscore(object):
|
||||||
elif method is not None:
|
elif method is not None:
|
||||||
return functools.partial(method, self._obj)
|
return functools.partial(method, self._obj)
|
||||||
else:
|
else:
|
||||||
return self._doc.user_data.get(self._get_key(name), default)
|
key = self._get_key(name)
|
||||||
|
if key in self._doc.user_data:
|
||||||
|
return self._doc.user_data[key]
|
||||||
|
elif isinstance(default, self.mutable_types):
|
||||||
|
# Handle mutable default arguments (see #2581)
|
||||||
|
new_default = copy.copy(default)
|
||||||
|
self.__setattr__(name, new_default)
|
||||||
|
return new_default
|
||||||
|
return default
|
||||||
|
|
||||||
def __setattr__(self, name, value):
|
def __setattr__(self, name, value):
|
||||||
if name not in self._extensions:
|
if name not in self._extensions:
|
||||||
|
|
|
@ -25,7 +25,7 @@ except ImportError:
|
||||||
from .symbols import ORTH
|
from .symbols import ORTH
|
||||||
from .compat import cupy, CudaStream, path2str, basestring_, unicode_
|
from .compat import cupy, CudaStream, path2str, basestring_, unicode_
|
||||||
from .compat import import_file
|
from .compat import import_file
|
||||||
from .errors import Errors
|
from .errors import Errors, Warnings, deprecation_warning
|
||||||
|
|
||||||
|
|
||||||
LANGUAGES = {}
|
LANGUAGES = {}
|
||||||
|
@ -38,6 +38,18 @@ def set_env_log(value):
|
||||||
_PRINT_ENV = value
|
_PRINT_ENV = value
|
||||||
|
|
||||||
|
|
||||||
|
def lang_class_is_loaded(lang):
|
||||||
|
"""Check whether a Language class is already loaded. Language classes are
|
||||||
|
loaded lazily, to avoid expensive setup code associated with the language
|
||||||
|
data.
|
||||||
|
|
||||||
|
lang (unicode): Two-letter language code, e.g. 'en'.
|
||||||
|
RETURNS (bool): Whether a Language class has been loaded.
|
||||||
|
"""
|
||||||
|
global LANGUAGES
|
||||||
|
return lang in LANGUAGES
|
||||||
|
|
||||||
|
|
||||||
def get_lang_class(lang):
|
def get_lang_class(lang):
|
||||||
"""Import and load a Language class.
|
"""Import and load a Language class.
|
||||||
|
|
||||||
|
@ -565,7 +577,8 @@ def itershuffle(iterable, bufsize=1000):
|
||||||
def to_bytes(getters, exclude):
|
def to_bytes(getters, exclude):
|
||||||
serialized = OrderedDict()
|
serialized = OrderedDict()
|
||||||
for key, getter in getters.items():
|
for key, getter in getters.items():
|
||||||
if key not in exclude:
|
# Split to support file names like meta.json
|
||||||
|
if key.split(".")[0] not in exclude:
|
||||||
serialized[key] = getter()
|
serialized[key] = getter()
|
||||||
return srsly.msgpack_dumps(serialized)
|
return srsly.msgpack_dumps(serialized)
|
||||||
|
|
||||||
|
@ -573,7 +586,8 @@ def to_bytes(getters, exclude):
|
||||||
def from_bytes(bytes_data, setters, exclude):
|
def from_bytes(bytes_data, setters, exclude):
|
||||||
msg = srsly.msgpack_loads(bytes_data)
|
msg = srsly.msgpack_loads(bytes_data)
|
||||||
for key, setter in setters.items():
|
for key, setter in setters.items():
|
||||||
if key not in exclude and key in msg:
|
# Split to support file names like meta.json
|
||||||
|
if key.split(".")[0] not in exclude and key in msg:
|
||||||
setter(msg[key])
|
setter(msg[key])
|
||||||
return msg
|
return msg
|
||||||
|
|
||||||
|
@ -583,7 +597,8 @@ def to_disk(path, writers, exclude):
|
||||||
if not path.exists():
|
if not path.exists():
|
||||||
path.mkdir()
|
path.mkdir()
|
||||||
for key, writer in writers.items():
|
for key, writer in writers.items():
|
||||||
if key not in exclude:
|
# Split to support file names like meta.json
|
||||||
|
if key.split(".")[0] not in exclude:
|
||||||
writer(path / key)
|
writer(path / key)
|
||||||
return path
|
return path
|
||||||
|
|
||||||
|
@ -591,7 +606,8 @@ def to_disk(path, writers, exclude):
|
||||||
def from_disk(path, readers, exclude):
|
def from_disk(path, readers, exclude):
|
||||||
path = ensure_path(path)
|
path = ensure_path(path)
|
||||||
for key, reader in readers.items():
|
for key, reader in readers.items():
|
||||||
if key not in exclude:
|
# Split to support file names like meta.json
|
||||||
|
if key.split(".")[0] not in exclude:
|
||||||
reader(path / key)
|
reader(path / key)
|
||||||
return path
|
return path
|
||||||
|
|
||||||
|
@ -677,6 +693,23 @@ def validate_json(data, validator):
|
||||||
return errors
|
return errors
|
||||||
|
|
||||||
|
|
||||||
|
def get_serialization_exclude(serializers, exclude, kwargs):
|
||||||
|
"""Helper function to validate serialization args and manage transition from
|
||||||
|
keyword arguments (pre v2.1) to exclude argument.
|
||||||
|
"""
|
||||||
|
exclude = list(exclude)
|
||||||
|
# Split to support file names like meta.json
|
||||||
|
options = [name.split(".")[0] for name in serializers]
|
||||||
|
for key, value in kwargs.items():
|
||||||
|
if key in ("vocab",) and value is False:
|
||||||
|
deprecation_warning(Warnings.W015.format(arg=key))
|
||||||
|
exclude.append(key)
|
||||||
|
elif key.split(".")[0] in options:
|
||||||
|
raise ValueError(Errors.E128.format(arg=key))
|
||||||
|
# TODO: user warning?
|
||||||
|
return exclude
|
||||||
|
|
||||||
|
|
||||||
class SimpleFrozenDict(dict):
|
class SimpleFrozenDict(dict):
|
||||||
"""Simplified implementation of a frozen dict, mainly used as default
|
"""Simplified implementation of a frozen dict, mainly used as default
|
||||||
function or method argument (for arguments that should default to empty
|
function or method argument (for arguments that should default to empty
|
||||||
|
@ -696,14 +729,14 @@ class SimpleFrozenDict(dict):
|
||||||
class DummyTokenizer(object):
|
class DummyTokenizer(object):
|
||||||
# add dummy methods for to_bytes, from_bytes, to_disk and from_disk to
|
# add dummy methods for to_bytes, from_bytes, to_disk and from_disk to
|
||||||
# allow serialization (see #1557)
|
# allow serialization (see #1557)
|
||||||
def to_bytes(self, **exclude):
|
def to_bytes(self, **kwargs):
|
||||||
return b""
|
return b""
|
||||||
|
|
||||||
def from_bytes(self, _bytes_data, **exclude):
|
def from_bytes(self, _bytes_data, **kwargs):
|
||||||
return self
|
return self
|
||||||
|
|
||||||
def to_disk(self, _path, **exclude):
|
def to_disk(self, _path, **kwargs):
|
||||||
return None
|
return None
|
||||||
|
|
||||||
def from_disk(self, _path, **exclude):
|
def from_disk(self, _path, **kwargs):
|
||||||
return self
|
return self
|
||||||
|
|
|
@ -377,11 +377,11 @@ cdef class Vectors:
|
||||||
self.add(key, row=i)
|
self.add(key, row=i)
|
||||||
return strings
|
return strings
|
||||||
|
|
||||||
def to_disk(self, path, **exclude):
|
def to_disk(self, path, **kwargs):
|
||||||
"""Save the current state to a directory.
|
"""Save the current state to a directory.
|
||||||
|
|
||||||
path (unicode / Path): A path to a directory, which will be created if
|
path (unicode / Path): A path to a directory, which will be created if
|
||||||
it doesn't exists. Either a string or a Path-like object.
|
it doesn't exists.
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/vectors#to_disk
|
DOCS: https://spacy.io/api/vectors#to_disk
|
||||||
"""
|
"""
|
||||||
|
@ -394,9 +394,9 @@ cdef class Vectors:
|
||||||
("vectors", lambda p: save_array(self.data, p.open("wb"))),
|
("vectors", lambda p: save_array(self.data, p.open("wb"))),
|
||||||
("key2row", lambda p: srsly.write_msgpack(p, self.key2row))
|
("key2row", lambda p: srsly.write_msgpack(p, self.key2row))
|
||||||
))
|
))
|
||||||
return util.to_disk(path, serializers, exclude)
|
return util.to_disk(path, serializers, [])
|
||||||
|
|
||||||
def from_disk(self, path, **exclude):
|
def from_disk(self, path, **kwargs):
|
||||||
"""Loads state from a directory. Modifies the object in place and
|
"""Loads state from a directory. Modifies the object in place and
|
||||||
returns it.
|
returns it.
|
||||||
|
|
||||||
|
@ -428,13 +428,13 @@ cdef class Vectors:
|
||||||
("keys", load_keys),
|
("keys", load_keys),
|
||||||
("vectors", load_vectors),
|
("vectors", load_vectors),
|
||||||
))
|
))
|
||||||
util.from_disk(path, serializers, exclude)
|
util.from_disk(path, serializers, [])
|
||||||
return self
|
return self
|
||||||
|
|
||||||
def to_bytes(self, **exclude):
|
def to_bytes(self, **kwargs):
|
||||||
"""Serialize the current state to a binary string.
|
"""Serialize the current state to a binary string.
|
||||||
|
|
||||||
**exclude: Named attributes to prevent from being serialized.
|
exclude (list): String names of serialization fields to exclude.
|
||||||
RETURNS (bytes): The serialized form of the `Vectors` object.
|
RETURNS (bytes): The serialized form of the `Vectors` object.
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/vectors#to_bytes
|
DOCS: https://spacy.io/api/vectors#to_bytes
|
||||||
|
@ -444,17 +444,18 @@ cdef class Vectors:
|
||||||
return self.data.to_bytes()
|
return self.data.to_bytes()
|
||||||
else:
|
else:
|
||||||
return srsly.msgpack_dumps(self.data)
|
return srsly.msgpack_dumps(self.data)
|
||||||
|
|
||||||
serializers = OrderedDict((
|
serializers = OrderedDict((
|
||||||
("key2row", lambda: srsly.msgpack_dumps(self.key2row)),
|
("key2row", lambda: srsly.msgpack_dumps(self.key2row)),
|
||||||
("vectors", serialize_weights)
|
("vectors", serialize_weights)
|
||||||
))
|
))
|
||||||
return util.to_bytes(serializers, exclude)
|
return util.to_bytes(serializers, [])
|
||||||
|
|
||||||
def from_bytes(self, data, **exclude):
|
def from_bytes(self, data, **kwargs):
|
||||||
"""Load state from a binary string.
|
"""Load state from a binary string.
|
||||||
|
|
||||||
data (bytes): The data to load from.
|
data (bytes): The data to load from.
|
||||||
**exclude: Named attributes to prevent from being loaded.
|
exclude (list): String names of serialization fields to exclude.
|
||||||
RETURNS (Vectors): The `Vectors` object.
|
RETURNS (Vectors): The `Vectors` object.
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/vectors#from_bytes
|
DOCS: https://spacy.io/api/vectors#from_bytes
|
||||||
|
@ -469,5 +470,5 @@ cdef class Vectors:
|
||||||
("key2row", lambda b: self.key2row.update(srsly.msgpack_loads(b))),
|
("key2row", lambda b: self.key2row.update(srsly.msgpack_loads(b))),
|
||||||
("vectors", deserialize_weights)
|
("vectors", deserialize_weights)
|
||||||
))
|
))
|
||||||
util.from_bytes(data, deserializers, exclude)
|
util.from_bytes(data, deserializers, [])
|
||||||
return self
|
return self
|
||||||
|
|
|
@ -1,6 +1,7 @@
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
# cython: profile=True
|
# cython: profile=True
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
from libc.string cimport memcpy
|
||||||
|
|
||||||
import numpy
|
import numpy
|
||||||
import srsly
|
import srsly
|
||||||
|
@ -59,12 +60,23 @@ cdef class Vocab:
|
||||||
self.morphology = Morphology(self.strings, tag_map, lemmatizer)
|
self.morphology = Morphology(self.strings, tag_map, lemmatizer)
|
||||||
self.vectors = Vectors()
|
self.vectors = Vectors()
|
||||||
|
|
||||||
property lang:
|
@property
|
||||||
|
def lang(self):
|
||||||
|
langfunc = None
|
||||||
|
if self.lex_attr_getters:
|
||||||
|
langfunc = self.lex_attr_getters.get(LANG, None)
|
||||||
|
return langfunc("_") if langfunc else ""
|
||||||
|
|
||||||
|
property writing_system:
|
||||||
|
"""A dict with information about the language's writing system. To get
|
||||||
|
the data, we use the vocab.lang property to fetch the Language class.
|
||||||
|
If the Language class is not loaded, an empty dict is returned.
|
||||||
|
"""
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
langfunc = None
|
if not util.lang_class_is_loaded(self.lang):
|
||||||
if self.lex_attr_getters:
|
return {}
|
||||||
langfunc = self.lex_attr_getters.get(LANG, None)
|
lang_class = util.get_lang_class(self.lang)
|
||||||
return langfunc("_") if langfunc else ""
|
return dict(lang_class.Defaults.writing_system)
|
||||||
|
|
||||||
def __len__(self):
|
def __len__(self):
|
||||||
"""The current number of lexemes stored.
|
"""The current number of lexemes stored.
|
||||||
|
@ -396,47 +408,57 @@ cdef class Vocab:
|
||||||
orth = self.strings.add(orth)
|
orth = self.strings.add(orth)
|
||||||
return orth in self.vectors
|
return orth in self.vectors
|
||||||
|
|
||||||
def to_disk(self, path, **exclude):
|
def to_disk(self, path, exclude=tuple(), **kwargs):
|
||||||
"""Save the current state to a directory.
|
"""Save the current state to a directory.
|
||||||
|
|
||||||
path (unicode or Path): A path to a directory, which will be created if
|
path (unicode or Path): A path to a directory, which will be created if
|
||||||
it doesn't exist. Paths may be either strings or Path-like objects.
|
it doesn't exist.
|
||||||
|
exclude (list): String names of serialization fields to exclude.
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/vocab#to_disk
|
DOCS: https://spacy.io/api/vocab#to_disk
|
||||||
"""
|
"""
|
||||||
path = util.ensure_path(path)
|
path = util.ensure_path(path)
|
||||||
if not path.exists():
|
if not path.exists():
|
||||||
path.mkdir()
|
path.mkdir()
|
||||||
self.strings.to_disk(path / "strings.json")
|
setters = ["strings", "lexemes", "vectors"]
|
||||||
with (path / "lexemes.bin").open('wb') as file_:
|
exclude = util.get_serialization_exclude(setters, exclude, kwargs)
|
||||||
file_.write(self.lexemes_to_bytes())
|
if "strings" not in exclude:
|
||||||
if self.vectors is not None:
|
self.strings.to_disk(path / "strings.json")
|
||||||
|
if "lexemes" not in exclude:
|
||||||
|
with (path / "lexemes.bin").open("wb") as file_:
|
||||||
|
file_.write(self.lexemes_to_bytes())
|
||||||
|
if "vectors" not in "exclude" and self.vectors is not None:
|
||||||
self.vectors.to_disk(path)
|
self.vectors.to_disk(path)
|
||||||
|
|
||||||
def from_disk(self, path, **exclude):
|
def from_disk(self, path, exclude=tuple(), **kwargs):
|
||||||
"""Loads state from a directory. Modifies the object in place and
|
"""Loads state from a directory. Modifies the object in place and
|
||||||
returns it.
|
returns it.
|
||||||
|
|
||||||
path (unicode or Path): A path to a directory. Paths may be either
|
path (unicode or Path): A path to a directory.
|
||||||
strings or `Path`-like objects.
|
exclude (list): String names of serialization fields to exclude.
|
||||||
RETURNS (Vocab): The modified `Vocab` object.
|
RETURNS (Vocab): The modified `Vocab` object.
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/vocab#to_disk
|
DOCS: https://spacy.io/api/vocab#to_disk
|
||||||
"""
|
"""
|
||||||
path = util.ensure_path(path)
|
path = util.ensure_path(path)
|
||||||
self.strings.from_disk(path / "strings.json")
|
getters = ["strings", "lexemes", "vectors"]
|
||||||
with (path / "lexemes.bin").open("rb") as file_:
|
exclude = util.get_serialization_exclude(getters, exclude, kwargs)
|
||||||
self.lexemes_from_bytes(file_.read())
|
if "strings" not in exclude:
|
||||||
if self.vectors is not None:
|
self.strings.from_disk(path / "strings.json") # TODO: add exclude?
|
||||||
self.vectors.from_disk(path, exclude="strings.json")
|
if "lexemes" not in exclude:
|
||||||
if self.vectors.name is not None:
|
with (path / "lexemes.bin").open("rb") as file_:
|
||||||
link_vectors_to_models(self)
|
self.lexemes_from_bytes(file_.read())
|
||||||
|
if "vectors" not in exclude:
|
||||||
|
if self.vectors is not None:
|
||||||
|
self.vectors.from_disk(path, exclude=["strings"])
|
||||||
|
if self.vectors.name is not None:
|
||||||
|
link_vectors_to_models(self)
|
||||||
return self
|
return self
|
||||||
|
|
||||||
def to_bytes(self, **exclude):
|
def to_bytes(self, exclude=tuple(), **kwargs):
|
||||||
"""Serialize the current state to a binary string.
|
"""Serialize the current state to a binary string.
|
||||||
|
|
||||||
**exclude: Named attributes to prevent from being serialized.
|
exclude (list): String names of serialization fields to exclude.
|
||||||
RETURNS (bytes): The serialized form of the `Vocab` object.
|
RETURNS (bytes): The serialized form of the `Vocab` object.
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/vocab#to_bytes
|
DOCS: https://spacy.io/api/vocab#to_bytes
|
||||||
|
@ -452,13 +474,14 @@ cdef class Vocab:
|
||||||
("lexemes", lambda: self.lexemes_to_bytes()),
|
("lexemes", lambda: self.lexemes_to_bytes()),
|
||||||
("vectors", deserialize_vectors)
|
("vectors", deserialize_vectors)
|
||||||
))
|
))
|
||||||
|
exclude = util.get_serialization_exclude(getters, exclude, kwargs)
|
||||||
return util.to_bytes(getters, exclude)
|
return util.to_bytes(getters, exclude)
|
||||||
|
|
||||||
def from_bytes(self, bytes_data, **exclude):
|
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
|
||||||
"""Load state from a binary string.
|
"""Load state from a binary string.
|
||||||
|
|
||||||
bytes_data (bytes): The data to load from.
|
bytes_data (bytes): The data to load from.
|
||||||
**exclude: Named attributes to prevent from being loaded.
|
exclude (list): String names of serialization fields to exclude.
|
||||||
RETURNS (Vocab): The `Vocab` object.
|
RETURNS (Vocab): The `Vocab` object.
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/vocab#from_bytes
|
DOCS: https://spacy.io/api/vocab#from_bytes
|
||||||
|
@ -468,11 +491,13 @@ cdef class Vocab:
|
||||||
return None
|
return None
|
||||||
else:
|
else:
|
||||||
return self.vectors.from_bytes(b)
|
return self.vectors.from_bytes(b)
|
||||||
|
|
||||||
setters = OrderedDict((
|
setters = OrderedDict((
|
||||||
("strings", lambda b: self.strings.from_bytes(b)),
|
("strings", lambda b: self.strings.from_bytes(b)),
|
||||||
("lexemes", lambda b: self.lexemes_from_bytes(b)),
|
("lexemes", lambda b: self.lexemes_from_bytes(b)),
|
||||||
("vectors", lambda b: serialize_vectors(b))
|
("vectors", lambda b: serialize_vectors(b))
|
||||||
))
|
))
|
||||||
|
exclude = util.get_serialization_exclude(setters, exclude, kwargs)
|
||||||
util.from_bytes(bytes_data, setters, exclude)
|
util.from_bytes(bytes_data, setters, exclude)
|
||||||
if self.vectors.name is not None:
|
if self.vectors.name is not None:
|
||||||
link_vectors_to_models(self)
|
link_vectors_to_models(self)
|
||||||
|
@ -518,7 +543,10 @@ cdef class Vocab:
|
||||||
for j in range(sizeof(lex_data.data)):
|
for j in range(sizeof(lex_data.data)):
|
||||||
lex_data.data[j] = bytes_ptr[i+j]
|
lex_data.data[j] = bytes_ptr[i+j]
|
||||||
Lexeme.c_from_bytes(lexeme, lex_data)
|
Lexeme.c_from_bytes(lexeme, lex_data)
|
||||||
|
prev_entry = self._by_orth.get(lexeme.orth)
|
||||||
|
if prev_entry != NULL:
|
||||||
|
memcpy(prev_entry, lexeme, sizeof(LexemeC))
|
||||||
|
continue
|
||||||
ptr = self.strings._map.get(lexeme.orth)
|
ptr = self.strings._map.get(lexeme.orth)
|
||||||
if ptr == NULL:
|
if ptr == NULL:
|
||||||
continue
|
continue
|
||||||
|
|
27
website/.eslintrc
Normal file
27
website/.eslintrc
Normal file
|
@ -0,0 +1,27 @@
|
||||||
|
{
|
||||||
|
"extends": ["standard", "prettier"],
|
||||||
|
"plugins": ["standard", "react", "react-hooks"],
|
||||||
|
"rules": {
|
||||||
|
"no-var": "error",
|
||||||
|
"no-unused-vars": 1,
|
||||||
|
"arrow-spacing": ["error", { "before": true, "after": true }],
|
||||||
|
"indent": ["error", 4],
|
||||||
|
"semi": ["error", "never"],
|
||||||
|
"arrow-parens": ["error", "as-needed"],
|
||||||
|
"standard/object-curly-even-spacing": ["error", "either"],
|
||||||
|
"standard/array-bracket-even-spacing": ["error", "either"],
|
||||||
|
"standard/computed-property-even-spacing": ["error", "even"],
|
||||||
|
"standard/no-callback-literal": ["error", ["cb", "callback"]],
|
||||||
|
"react/jsx-uses-react": "error",
|
||||||
|
"react/jsx-uses-vars": "error",
|
||||||
|
"react-hooks/rules-of-hooks": "error",
|
||||||
|
"react-hooks/exhaustive-deps": "warn"
|
||||||
|
},
|
||||||
|
"parser": "babel-eslint",
|
||||||
|
"parserOptions": {
|
||||||
|
"ecmaVersion": 8
|
||||||
|
},
|
||||||
|
"env": {
|
||||||
|
"browser": true
|
||||||
|
}
|
||||||
|
}
|
|
@ -78,7 +78,7 @@ assigned by spaCy's [models](/models). The individual mapping is specific to the
|
||||||
training corpus and can be defined in the respective language data's
|
training corpus and can be defined in the respective language data's
|
||||||
[`tag_map.py`](/usage/adding-languages#tag-map).
|
[`tag_map.py`](/usage/adding-languages#tag-map).
|
||||||
|
|
||||||
<Accordion title="Universal Part-of-speech Tags">
|
<Accordion title="Universal Part-of-speech Tags" id="pos-universal">
|
||||||
|
|
||||||
spaCy also maps all language-specific part-of-speech tags to a small, fixed set
|
spaCy also maps all language-specific part-of-speech tags to a small, fixed set
|
||||||
of word type tags following the
|
of word type tags following the
|
||||||
|
@ -269,7 +269,7 @@ This section lists the syntactic dependency labels assigned by spaCy's
|
||||||
[models](/models). The individual labels are language-specific and depend on the
|
[models](/models). The individual labels are language-specific and depend on the
|
||||||
training corpus.
|
training corpus.
|
||||||
|
|
||||||
<Accordion title="Universal Dependency Labels">
|
<Accordion title="Universal Dependency Labels" id="dependency-parsing-universal">
|
||||||
|
|
||||||
The [Universal Dependencies scheme](http://universaldependencies.org/u/dep/) is
|
The [Universal Dependencies scheme](http://universaldependencies.org/u/dep/) is
|
||||||
used in all languages trained on Universal Dependency Corpora.
|
used in all languages trained on Universal Dependency Corpora.
|
||||||
|
|
|
@ -244,9 +244,10 @@ Serialize the pipe to disk.
|
||||||
> parser.to_disk("/path/to/parser")
|
> parser.to_disk("/path/to/parser")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||||
|
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||||
|
|
||||||
## DependencyParser.from_disk {#from_disk tag="method"}
|
## DependencyParser.from_disk {#from_disk tag="method"}
|
||||||
|
|
||||||
|
@ -262,6 +263,7 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ----------- | ------------------ | -------------------------------------------------------------------------- |
|
| ----------- | ------------------ | -------------------------------------------------------------------------- |
|
||||||
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||||
|
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||||
| **RETURNS** | `DependencyParser` | The modified `DependencyParser` object. |
|
| **RETURNS** | `DependencyParser` | The modified `DependencyParser` object. |
|
||||||
|
|
||||||
## DependencyParser.to_bytes {#to_bytes tag="method"}
|
## DependencyParser.to_bytes {#to_bytes tag="method"}
|
||||||
|
@ -275,10 +277,10 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
||||||
|
|
||||||
Serialize the pipe to a bytestring.
|
Serialize the pipe to a bytestring.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ----------- | ----- | ----------------------------------------------------- |
|
| ----------- | ----- | ------------------------------------------------------------------------- |
|
||||||
| `**exclude` | - | Named attributes to prevent from being serialized. |
|
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||||
| **RETURNS** | bytes | The serialized form of the `DependencyParser` object. |
|
| **RETURNS** | bytes | The serialized form of the `DependencyParser` object. |
|
||||||
|
|
||||||
## DependencyParser.from_bytes {#from_bytes tag="method"}
|
## DependencyParser.from_bytes {#from_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -292,11 +294,11 @@ Load the pipe from a bytestring. Modifies the object in place and returns it.
|
||||||
> parser.from_bytes(parser_bytes)
|
> parser.from_bytes(parser_bytes)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ------------ | ------------------ | ---------------------------------------------- |
|
| ------------ | ------------------ | ------------------------------------------------------------------------- |
|
||||||
| `bytes_data` | bytes | The data to load from. |
|
| `bytes_data` | bytes | The data to load from. |
|
||||||
| `**exclude` | - | Named attributes to prevent from being loaded. |
|
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||||
| **RETURNS** | `DependencyParser` | The `DependencyParser` object. |
|
| **RETURNS** | `DependencyParser` | The `DependencyParser` object. |
|
||||||
|
|
||||||
## DependencyParser.labels {#labels tag="property"}
|
## DependencyParser.labels {#labels tag="property"}
|
||||||
|
|
||||||
|
@ -312,3 +314,21 @@ The labels currently added to the component.
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ----------- | ----- | ---------------------------------- |
|
| ----------- | ----- | ---------------------------------- |
|
||||||
| **RETURNS** | tuple | The labels added to the component. |
|
| **RETURNS** | tuple | The labels added to the component. |
|
||||||
|
|
||||||
|
## Serialization fields {#serialization-fields}
|
||||||
|
|
||||||
|
During serialization, spaCy will export several data fields used to restore
|
||||||
|
different aspects of the object. If needed, you can exclude them from
|
||||||
|
serialization by passing in the string names via the `exclude` argument.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> data = parser.to_disk("/path", exclude=["vocab"])
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ------- | -------------------------------------------------------------- |
|
||||||
|
| `vocab` | The shared [`Vocab`](/api/vocab). |
|
||||||
|
| `cfg` | The config file. You usually don't want to exclude this. |
|
||||||
|
| `model` | The binary model data. You usually don't want to exclude this. |
|
||||||
|
|
|
@ -237,7 +237,7 @@ attribute ID.
|
||||||
> from spacy.attrs import ORTH
|
> from spacy.attrs import ORTH
|
||||||
> doc = nlp(u"apple apple orange banana")
|
> doc = nlp(u"apple apple orange banana")
|
||||||
> assert doc.count_by(ORTH) == {7024L: 1, 119552L: 1, 2087L: 2}
|
> assert doc.count_by(ORTH) == {7024L: 1, 119552L: 1, 2087L: 2}
|
||||||
> doc.to_array([attrs.ORTH])
|
> doc.to_array([ORTH])
|
||||||
> # array([[11880], [11880], [7561], [12800]])
|
> # array([[11880], [11880], [7561], [12800]])
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
|
@ -349,11 +349,12 @@ array of attributes.
|
||||||
> assert doc[0].pos_ == doc2[0].pos_
|
> assert doc[0].pos_ == doc2[0].pos_
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ----------- | -------------------------------------- | ----------------------------- |
|
| ----------- | -------------------------------------- | ------------------------------------------------------------------------- |
|
||||||
| `attrs` | list | A list of attribute ID ints. |
|
| `attrs` | list | A list of attribute ID ints. |
|
||||||
| `array` | `numpy.ndarray[ndim=2, dtype='int32']` | The attribute values to load. |
|
| `array` | `numpy.ndarray[ndim=2, dtype='int32']` | The attribute values to load. |
|
||||||
| **RETURNS** | `Doc` | Itself. |
|
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||||
|
| **RETURNS** | `Doc` | Itself. |
|
||||||
|
|
||||||
## Doc.to_disk {#to_disk tag="method" new="2"}
|
## Doc.to_disk {#to_disk tag="method" new="2"}
|
||||||
|
|
||||||
|
@ -365,9 +366,10 @@ Save the current state to a directory.
|
||||||
> doc.to_disk("/path/to/doc")
|
> doc.to_disk("/path/to/doc")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||||
|
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||||
|
|
||||||
## Doc.from_disk {#from_disk tag="method" new="2"}
|
## Doc.from_disk {#from_disk tag="method" new="2"}
|
||||||
|
|
||||||
|
@ -384,6 +386,7 @@ Loads state from a directory. Modifies the object in place and returns it.
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ----------- | ---------------- | -------------------------------------------------------------------------- |
|
| ----------- | ---------------- | -------------------------------------------------------------------------- |
|
||||||
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||||
|
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||||
| **RETURNS** | `Doc` | The modified `Doc` object. |
|
| **RETURNS** | `Doc` | The modified `Doc` object. |
|
||||||
|
|
||||||
## Doc.to_bytes {#to_bytes tag="method"}
|
## Doc.to_bytes {#to_bytes tag="method"}
|
||||||
|
@ -397,9 +400,10 @@ Serialize, i.e. export the document contents to a binary string.
|
||||||
> doc_bytes = doc.to_bytes()
|
> doc_bytes = doc.to_bytes()
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ----------- | ----- | --------------------------------------------------------------------- |
|
| ----------- | ----- | ------------------------------------------------------------------------- |
|
||||||
| **RETURNS** | bytes | A losslessly serialized copy of the `Doc`, including all annotations. |
|
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||||
|
| **RETURNS** | bytes | A losslessly serialized copy of the `Doc`, including all annotations. |
|
||||||
|
|
||||||
## Doc.from_bytes {#from_bytes tag="method"}
|
## Doc.from_bytes {#from_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -416,10 +420,11 @@ Deserialize, i.e. import the document contents from a binary string.
|
||||||
> assert doc.text == doc2.text
|
> assert doc.text == doc2.text
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ----------- | ----- | ------------------------ |
|
| ----------- | ----- | ------------------------------------------------------------------------- |
|
||||||
| `data` | bytes | The string to load from. |
|
| `data` | bytes | The string to load from. |
|
||||||
| **RETURNS** | `Doc` | The `Doc` object. |
|
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||||
|
| **RETURNS** | `Doc` | The `Doc` object. |
|
||||||
|
|
||||||
## Doc.retokenize {#retokenize tag="contextmanager" new="2.1"}
|
## Doc.retokenize {#retokenize tag="contextmanager" new="2.1"}
|
||||||
|
|
||||||
|
@ -640,20 +645,45 @@ The L2 norm of the document's vector representation.
|
||||||
|
|
||||||
## Attributes {#attributes}
|
## Attributes {#attributes}
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ----------------------------------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
| --------------------------------------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `text` | unicode | A unicode representation of the document text. |
|
| `text` | unicode | A unicode representation of the document text. |
|
||||||
| `text_with_ws` | unicode | An alias of `Doc.text`, provided for duck-type compatibility with `Span` and `Token`. |
|
| `text_with_ws` | unicode | An alias of `Doc.text`, provided for duck-type compatibility with `Span` and `Token`. |
|
||||||
| `mem` | `Pool` | The document's local memory heap, for all C data it owns. |
|
| `mem` | `Pool` | The document's local memory heap, for all C data it owns. |
|
||||||
| `vocab` | `Vocab` | The store of lexical types. |
|
| `vocab` | `Vocab` | The store of lexical types. |
|
||||||
| `tensor` <Tag variant="new">2</Tag> | object | Container for dense vector representations. |
|
| `tensor` <Tag variant="new">2</Tag> | object | Container for dense vector representations. |
|
||||||
| `cats` <Tag variant="new">2</Tag> | dictionary | Maps either a label to a score for categories applied to whole document, or `(start_char, end_char, label)` to score for categories applied to spans. `start_char` and `end_char` should be character offsets, label can be either a string or an integer ID, and score should be a float. |
|
| `cats` <Tag variant="new">2</Tag> | dictionary | Maps either a label to a score for categories applied to whole document, or `(start_char, end_char, label)` to score for categories applied to spans. `start_char` and `end_char` should be character offsets, label can be either a string or an integer ID, and score should be a float. |
|
||||||
| `user_data` | - | A generic storage area, for user custom data. |
|
| `user_data` | - | A generic storage area, for user custom data. |
|
||||||
| `is_tagged` | bool | A flag indicating that the document has been part-of-speech tagged. |
|
| `lang` <Tag variant="new">2.1</Tag> | int | Language of the document's vocabulary. |
|
||||||
| `is_parsed` | bool | A flag indicating that the document has been syntactically parsed. |
|
| `lang_` <Tag variant="new">2.1</Tag> | unicode | Language of the document's vocabulary. |
|
||||||
| `is_sentenced` | bool | A flag indicating that sentence boundaries have been applied to the document. |
|
| `is_tagged` | bool | A flag indicating that the document has been part-of-speech tagged. |
|
||||||
| `sentiment` | float | The document's positivity/negativity score, if available. |
|
| `is_parsed` | bool | A flag indicating that the document has been syntactically parsed. |
|
||||||
| `user_hooks` | dict | A dictionary that allows customization of the `Doc`'s properties. |
|
| `is_sentenced` | bool | A flag indicating that sentence boundaries have been applied to the document. |
|
||||||
| `user_token_hooks` | dict | A dictionary that allows customization of properties of `Token` children. |
|
| `is_nered` <Tag variant="new">2.1</Tag> | bool | A flag indicating that named entities have been set. Will return `True` if _any_ of the tokens has an entity tag set, even if the others are unknown. |
|
||||||
| `user_span_hooks` | dict | A dictionary that allows customization of properties of `Span` children. |
|
| `sentiment` | float | The document's positivity/negativity score, if available. |
|
||||||
| `_` | `Underscore` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). |
|
| `user_hooks` | dict | A dictionary that allows customization of the `Doc`'s properties. |
|
||||||
|
| `user_token_hooks` | dict | A dictionary that allows customization of properties of `Token` children. |
|
||||||
|
| `user_span_hooks` | dict | A dictionary that allows customization of properties of `Span` children. |
|
||||||
|
| `_` | `Underscore` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). |
|
||||||
|
|
||||||
|
## Serialization fields {#serialization-fields}
|
||||||
|
|
||||||
|
During serialization, spaCy will export several data fields used to restore
|
||||||
|
different aspects of the object. If needed, you can exclude them from
|
||||||
|
serialization by passing in the string names via the `exclude` argument.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> data = doc.to_bytes(exclude=["text", "tensor"])
|
||||||
|
> doc.from_disk("./doc.bin", exclude=["user_data"])
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ------------------ | --------------------------------------------- |
|
||||||
|
| `text` | The value of the `Doc.text` attribute. |
|
||||||
|
| `sentiment` | The value of the `Doc.sentiment` attribute. |
|
||||||
|
| `tensor` | The value of the `Doc.tensor` attribute. |
|
||||||
|
| `user_data` | The value of the `Doc.user_data` dictionary. |
|
||||||
|
| `user_data_keys` | The keys of the `Doc.user_data` dictionary. |
|
||||||
|
| `user_data_values` | The values of the `Doc.user_data` dictionary. |
|
||||||
|
|
|
@ -244,9 +244,10 @@ Serialize the pipe to disk.
|
||||||
> ner.to_disk("/path/to/ner")
|
> ner.to_disk("/path/to/ner")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||||
|
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||||
|
|
||||||
## EntityRecognizer.from_disk {#from_disk tag="method"}
|
## EntityRecognizer.from_disk {#from_disk tag="method"}
|
||||||
|
|
||||||
|
@ -262,6 +263,7 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ----------- | ------------------ | -------------------------------------------------------------------------- |
|
| ----------- | ------------------ | -------------------------------------------------------------------------- |
|
||||||
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||||
|
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||||
| **RETURNS** | `EntityRecognizer` | The modified `EntityRecognizer` object. |
|
| **RETURNS** | `EntityRecognizer` | The modified `EntityRecognizer` object. |
|
||||||
|
|
||||||
## EntityRecognizer.to_bytes {#to_bytes tag="method"}
|
## EntityRecognizer.to_bytes {#to_bytes tag="method"}
|
||||||
|
@ -275,10 +277,10 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
||||||
|
|
||||||
Serialize the pipe to a bytestring.
|
Serialize the pipe to a bytestring.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ----------- | ----- | ----------------------------------------------------- |
|
| ----------- | ----- | ------------------------------------------------------------------------- |
|
||||||
| `**exclude` | - | Named attributes to prevent from being serialized. |
|
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||||
| **RETURNS** | bytes | The serialized form of the `EntityRecognizer` object. |
|
| **RETURNS** | bytes | The serialized form of the `EntityRecognizer` object. |
|
||||||
|
|
||||||
## EntityRecognizer.from_bytes {#from_bytes tag="method"}
|
## EntityRecognizer.from_bytes {#from_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -292,11 +294,11 @@ Load the pipe from a bytestring. Modifies the object in place and returns it.
|
||||||
> ner.from_bytes(ner_bytes)
|
> ner.from_bytes(ner_bytes)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ------------ | ------------------ | ---------------------------------------------- |
|
| ------------ | ------------------ | ------------------------------------------------------------------------- |
|
||||||
| `bytes_data` | bytes | The data to load from. |
|
| `bytes_data` | bytes | The data to load from. |
|
||||||
| `**exclude` | - | Named attributes to prevent from being loaded. |
|
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||||
| **RETURNS** | `EntityRecognizer` | The `EntityRecognizer` object. |
|
| **RETURNS** | `EntityRecognizer` | The `EntityRecognizer` object. |
|
||||||
|
|
||||||
## EntityRecognizer.labels {#labels tag="property"}
|
## EntityRecognizer.labels {#labels tag="property"}
|
||||||
|
|
||||||
|
@ -312,3 +314,21 @@ The labels currently added to the component.
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ----------- | ----- | ---------------------------------- |
|
| ----------- | ----- | ---------------------------------- |
|
||||||
| **RETURNS** | tuple | The labels added to the component. |
|
| **RETURNS** | tuple | The labels added to the component. |
|
||||||
|
|
||||||
|
## Serialization fields {#serialization-fields}
|
||||||
|
|
||||||
|
During serialization, spaCy will export several data fields used to restore
|
||||||
|
different aspects of the object. If needed, you can exclude them from
|
||||||
|
serialization by passing in the string names via the `exclude` argument.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> data = ner.to_disk("/path", exclude=["vocab"])
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ------- | -------------------------------------------------------------- |
|
||||||
|
| `vocab` | The shared [`Vocab`](/api/vocab). |
|
||||||
|
| `cfg` | The config file. You usually don't want to exclude this. |
|
||||||
|
| `model` | The binary model data. You usually don't want to exclude this. |
|
||||||
|
|
|
@ -91,13 +91,14 @@ multiprocessing.
|
||||||
> assert doc.is_parsed
|
> assert doc.is_parsed
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ------------ | ----- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| -------------------------------------------- | ----- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `texts` | - | A sequence of unicode objects. |
|
| `texts` | - | A sequence of unicode objects. |
|
||||||
| `as_tuples` | bool | If set to `True`, inputs should be a sequence of `(text, context)` tuples. Output will then be a sequence of `(doc, context)` tuples. Defaults to `False`. |
|
| `as_tuples` | bool | If set to `True`, inputs should be a sequence of `(text, context)` tuples. Output will then be a sequence of `(doc, context)` tuples. Defaults to `False`. |
|
||||||
| `batch_size` | int | The number of texts to buffer. |
|
| `batch_size` | int | The number of texts to buffer. |
|
||||||
| `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
|
| `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
|
||||||
| **YIELDS** | `Doc` | Documents in the order of the original text. |
|
| `component_cfg` <Tag variant="new">2.1</Tag> | dict | Config parameters for specific pipeline components, keyed by component name. |
|
||||||
|
| **YIELDS** | `Doc` | Documents in the order of the original text. |
|
||||||
|
|
||||||
## Language.update {#update tag="method"}
|
## Language.update {#update tag="method"}
|
||||||
|
|
||||||
|
@ -112,13 +113,14 @@ Update the models in the pipeline.
|
||||||
> nlp.update([doc], [gold], drop=0.5, sgd=optimizer)
|
> nlp.update([doc], [gold], drop=0.5, sgd=optimizer)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ----------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| -------------------------------------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `docs` | iterable | A batch of `Doc` objects or unicode. If unicode, a `Doc` object will be created from the text. |
|
| `docs` | iterable | A batch of `Doc` objects or unicode. If unicode, a `Doc` object will be created from the text. |
|
||||||
| `golds` | iterable | A batch of `GoldParse` objects or dictionaries. Dictionaries will be used to create [`GoldParse`](/api/goldparse) objects. For the available keys and their usage, see [`GoldParse.__init__`](/api/goldparse#init). |
|
| `golds` | iterable | A batch of `GoldParse` objects or dictionaries. Dictionaries will be used to create [`GoldParse`](/api/goldparse) objects. For the available keys and their usage, see [`GoldParse.__init__`](/api/goldparse#init). |
|
||||||
| `drop` | float | The dropout rate. |
|
| `drop` | float | The dropout rate. |
|
||||||
| `sgd` | callable | An optimizer. |
|
| `sgd` | callable | An optimizer. |
|
||||||
| **RETURNS** | dict | Results from the update. |
|
| `component_cfg` <Tag variant="new">2.1</Tag> | dict | Config parameters for specific pipeline components, keyed by component name. |
|
||||||
|
| **RETURNS** | dict | Results from the update. |
|
||||||
|
|
||||||
## Language.begin_training {#begin_training tag="method"}
|
## Language.begin_training {#begin_training tag="method"}
|
||||||
|
|
||||||
|
@ -130,11 +132,12 @@ Allocate models, pre-process training data and acquire an optimizer.
|
||||||
> optimizer = nlp.begin_training(gold_tuples)
|
> optimizer = nlp.begin_training(gold_tuples)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ------------- | -------- | ---------------------------- |
|
| -------------------------------------------- | -------- | ---------------------------------------------------------------------------- |
|
||||||
| `gold_tuples` | iterable | Gold-standard training data. |
|
| `gold_tuples` | iterable | Gold-standard training data. |
|
||||||
| `**cfg` | - | Config parameters. |
|
| `component_cfg` <Tag variant="new">2.1</Tag> | dict | Config parameters for specific pipeline components, keyed by component name. |
|
||||||
| **RETURNS** | callable | An optimizer. |
|
| `**cfg` | - | Config parameters (sent to all components). |
|
||||||
|
| **RETURNS** | callable | An optimizer. |
|
||||||
|
|
||||||
## Language.use_params {#use_params tag="contextmanager, method"}
|
## Language.use_params {#use_params tag="contextmanager, method"}
|
||||||
|
|
||||||
|
@ -327,7 +330,7 @@ the model**.
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||||
| `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling) and prevent from being saved. |
|
| `exclude` | list | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. |
|
||||||
|
|
||||||
## Language.from_disk {#from_disk tag="method" new="2"}
|
## Language.from_disk {#from_disk tag="method" new="2"}
|
||||||
|
|
||||||
|
@ -349,22 +352,22 @@ loaded object.
|
||||||
> nlp = English().from_disk("/path/to/en_model")
|
> nlp = English().from_disk("/path/to/en_model")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ----------- | ---------------- | --------------------------------------------------------------------------------- |
|
| ----------- | ---------------- | ----------------------------------------------------------------------------------------- |
|
||||||
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||||
| `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
|
| `exclude` | list | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. |
|
||||||
| **RETURNS** | `Language` | The modified `Language` object. |
|
| **RETURNS** | `Language` | The modified `Language` object. |
|
||||||
|
|
||||||
<Infobox title="Changed in v2.0" variant="warning">
|
<Infobox title="Changed in v2.0" variant="warning">
|
||||||
|
|
||||||
As of spaCy v2.0, the `save_to_directory` method has been renamed to `to_disk`,
|
As of spaCy v2.0, the `save_to_directory` method has been renamed to `to_disk`,
|
||||||
to improve consistency across classes. Pipeline components to prevent from being
|
to improve consistency across classes. Pipeline components to prevent from being
|
||||||
loaded can now be added as a list to `disable`, instead of specifying one
|
loaded can now be added as a list to `disable` (v2.0) or `exclude` (v2.1),
|
||||||
keyword argument per component.
|
instead of specifying one keyword argument per component.
|
||||||
|
|
||||||
```diff
|
```diff
|
||||||
- nlp = spacy.load("en", tagger=False, entity=False)
|
- nlp = spacy.load("en", tagger=False, entity=False)
|
||||||
+ nlp = English().from_disk("/model", disable=["tagger', 'ner"])
|
+ nlp = English().from_disk("/model", exclude=["tagger", "ner"])
|
||||||
```
|
```
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
@ -379,10 +382,10 @@ Serialize the current state to a binary string.
|
||||||
> nlp_bytes = nlp.to_bytes()
|
> nlp_bytes = nlp.to_bytes()
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ----------- | ----- | ------------------------------------------------------------------------------------------------------------------- |
|
| ----------- | ----- | ----------------------------------------------------------------------------------------- |
|
||||||
| `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling) and prevent from being serialized. |
|
| `exclude` | list | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. |
|
||||||
| **RETURNS** | bytes | The serialized form of the `Language` object. |
|
| **RETURNS** | bytes | The serialized form of the `Language` object. |
|
||||||
|
|
||||||
## Language.from_bytes {#from_bytes tag="method"}
|
## Language.from_bytes {#from_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -400,20 +403,21 @@ available to the loaded object.
|
||||||
> nlp2.from_bytes(nlp_bytes)
|
> nlp2.from_bytes(nlp_bytes)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ------------ | ---------- | --------------------------------------------------------------------------------- |
|
| ------------ | ---------- | ----------------------------------------------------------------------------------------- |
|
||||||
| `bytes_data` | bytes | The data to load from. |
|
| `bytes_data` | bytes | The data to load from. |
|
||||||
| `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
|
| `exclude` | list | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. |
|
||||||
| **RETURNS** | `Language` | The `Language` object. |
|
| **RETURNS** | `Language` | The `Language` object. |
|
||||||
|
|
||||||
<Infobox title="Changed in v2.0" variant="warning">
|
<Infobox title="Changed in v2.0" variant="warning">
|
||||||
|
|
||||||
Pipeline components to prevent from being loaded can now be added as a list to
|
Pipeline components to prevent from being loaded can now be added as a list to
|
||||||
`disable`, instead of specifying one keyword argument per component.
|
`disable` (v2.0) or `exclude` (v2.1), instead of specifying one keyword argument
|
||||||
|
per component.
|
||||||
|
|
||||||
```diff
|
```diff
|
||||||
- nlp = English().from_bytes(bytes, tagger=False, entity=False)
|
- nlp = English().from_bytes(bytes, tagger=False, entity=False)
|
||||||
+ nlp = English().from_bytes(bytes, disable=["tagger", "ner"])
|
+ nlp = English().from_bytes(bytes, exclude=["tagger", "ner"])
|
||||||
```
|
```
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
@ -437,3 +441,23 @@ Pipeline components to prevent from being loaded can now be added as a list to
|
||||||
| `Defaults` | class | Settings, data and factory methods for creating the `nlp` object and processing pipeline. |
|
| `Defaults` | class | Settings, data and factory methods for creating the `nlp` object and processing pipeline. |
|
||||||
| `lang` | unicode | Two-letter language ID, i.e. [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). |
|
| `lang` | unicode | Two-letter language ID, i.e. [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). |
|
||||||
| `factories` <Tag variant="new">2</Tag> | dict | Factories that create pre-defined pipeline components, e.g. the tagger, parser or entity recognizer, keyed by their component name. |
|
| `factories` <Tag variant="new">2</Tag> | dict | Factories that create pre-defined pipeline components, e.g. the tagger, parser or entity recognizer, keyed by their component name. |
|
||||||
|
|
||||||
|
## Serialization fields {#serialization-fields}
|
||||||
|
|
||||||
|
During serialization, spaCy will export several data fields used to restore
|
||||||
|
different aspects of the object. If needed, you can exclude them from
|
||||||
|
serialization by passing in the string names via the `exclude` argument.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> data = nlp.to_bytes(exclude=["tokenizer", "vocab"])
|
||||||
|
> nlp.from_disk("./model-data", exclude=["ner"])
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ----------- | -------------------------------------------------- |
|
||||||
|
| `vocab` | The shared [`Vocab`](/api/vocab). |
|
||||||
|
| `tokenizer` | Tokenization rules and exceptions. |
|
||||||
|
| `meta` | The meta data, available as `Language.meta`. |
|
||||||
|
| ... | String names of pipeline components, e.g. `"ner"`. |
|
||||||
|
|
|
@ -316,6 +316,22 @@ taken.
|
||||||
| ----------- | ------- | --------------- |
|
| ----------- | ------- | --------------- |
|
||||||
| **RETURNS** | `Token` | The root token. |
|
| **RETURNS** | `Token` | The root token. |
|
||||||
|
|
||||||
|
## Span.conjuncts {#conjuncts tag="property" model="parser"}
|
||||||
|
|
||||||
|
A tuple of tokens coordinated to `span.root`.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> doc = nlp(u"I like apples and oranges")
|
||||||
|
> apples_conjuncts = doc[2:3].conjuncts
|
||||||
|
> assert [t.text for t in apples_conjuncts] == [u"oranges"]
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Type | Description |
|
||||||
|
| ----------- | ------- | ----------------------- |
|
||||||
|
| **RETURNS** | `tuple` | The coordinated tokens. |
|
||||||
|
|
||||||
## Span.lefts {#lefts tag="property" model="parser"}
|
## Span.lefts {#lefts tag="property" model="parser"}
|
||||||
|
|
||||||
Tokens that are to the left of the span, whose heads are within the span.
|
Tokens that are to the left of the span, whose heads are within the span.
|
||||||
|
|
|
@ -151,10 +151,9 @@ Serialize the current state to a binary string.
|
||||||
> store_bytes = stringstore.to_bytes()
|
> store_bytes = stringstore.to_bytes()
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ----------- | ----- | -------------------------------------------------- |
|
| ----------- | ----- | ------------------------------------------------ |
|
||||||
| `**exclude` | - | Named attributes to prevent from being serialized. |
|
| **RETURNS** | bytes | The serialized form of the `StringStore` object. |
|
||||||
| **RETURNS** | bytes | The serialized form of the `StringStore` object. |
|
|
||||||
|
|
||||||
## StringStore.from_bytes {#from_bytes tag="method"}
|
## StringStore.from_bytes {#from_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -168,11 +167,10 @@ Load state from a binary string.
|
||||||
> new_store = StringStore().from_bytes(store_bytes)
|
> new_store = StringStore().from_bytes(store_bytes)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ------------ | ------------- | ---------------------------------------------- |
|
| ------------ | ------------- | ------------------------- |
|
||||||
| `bytes_data` | bytes | The data to load from. |
|
| `bytes_data` | bytes | The data to load from. |
|
||||||
| `**exclude` | - | Named attributes to prevent from being loaded. |
|
| **RETURNS** | `StringStore` | The `StringStore` object. |
|
||||||
| **RETURNS** | `StringStore` | The `StringStore` object. |
|
|
||||||
|
|
||||||
## Utilities {#util}
|
## Utilities {#util}
|
||||||
|
|
||||||
|
|
|
@ -244,9 +244,10 @@ Serialize the pipe to disk.
|
||||||
> tagger.to_disk("/path/to/tagger")
|
> tagger.to_disk("/path/to/tagger")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||||
|
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||||
|
|
||||||
## Tagger.from_disk {#from_disk tag="method"}
|
## Tagger.from_disk {#from_disk tag="method"}
|
||||||
|
|
||||||
|
@ -262,6 +263,7 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ----------- | ---------------- | -------------------------------------------------------------------------- |
|
| ----------- | ---------------- | -------------------------------------------------------------------------- |
|
||||||
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||||
|
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||||
| **RETURNS** | `Tagger` | The modified `Tagger` object. |
|
| **RETURNS** | `Tagger` | The modified `Tagger` object. |
|
||||||
|
|
||||||
## Tagger.to_bytes {#to_bytes tag="method"}
|
## Tagger.to_bytes {#to_bytes tag="method"}
|
||||||
|
@ -275,10 +277,10 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
||||||
|
|
||||||
Serialize the pipe to a bytestring.
|
Serialize the pipe to a bytestring.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ----------- | ----- | -------------------------------------------------- |
|
| ----------- | ----- | ------------------------------------------------------------------------- |
|
||||||
| `**exclude` | - | Named attributes to prevent from being serialized. |
|
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||||
| **RETURNS** | bytes | The serialized form of the `Tagger` object. |
|
| **RETURNS** | bytes | The serialized form of the `Tagger` object. |
|
||||||
|
|
||||||
## Tagger.from_bytes {#from_bytes tag="method"}
|
## Tagger.from_bytes {#from_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -292,11 +294,11 @@ Load the pipe from a bytestring. Modifies the object in place and returns it.
|
||||||
> tagger.from_bytes(tagger_bytes)
|
> tagger.from_bytes(tagger_bytes)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ------------ | -------- | ---------------------------------------------- |
|
| ------------ | -------- | ------------------------------------------------------------------------- |
|
||||||
| `bytes_data` | bytes | The data to load from. |
|
| `bytes_data` | bytes | The data to load from. |
|
||||||
| `**exclude` | - | Named attributes to prevent from being loaded. |
|
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||||
| **RETURNS** | `Tagger` | The `Tagger` object. |
|
| **RETURNS** | `Tagger` | The `Tagger` object. |
|
||||||
|
|
||||||
## Tagger.labels {#labels tag="property"}
|
## Tagger.labels {#labels tag="property"}
|
||||||
|
|
||||||
|
@ -314,3 +316,22 @@ tags by default, e.g. `VERB`, `NOUN` and so on.
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ----------- | ----- | ---------------------------------- |
|
| ----------- | ----- | ---------------------------------- |
|
||||||
| **RETURNS** | tuple | The labels added to the component. |
|
| **RETURNS** | tuple | The labels added to the component. |
|
||||||
|
|
||||||
|
## Serialization fields {#serialization-fields}
|
||||||
|
|
||||||
|
During serialization, spaCy will export several data fields used to restore
|
||||||
|
different aspects of the object. If needed, you can exclude them from
|
||||||
|
serialization by passing in the string names via the `exclude` argument.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> data = tagger.to_disk("/path", exclude=["vocab"])
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| --------- | ------------------------------------------------------------------------------------------ |
|
||||||
|
| `vocab` | The shared [`Vocab`](/api/vocab). |
|
||||||
|
| `cfg` | The config file. You usually don't want to exclude this. |
|
||||||
|
| `model` | The binary model data. You usually don't want to exclude this. |
|
||||||
|
| `tag_map` | The [tag map](/usage/adding-languages#tag-map) mapping fine-grained to coarse-grained tag. |
|
||||||
|
|
|
@ -260,9 +260,10 @@ Serialize the pipe to disk.
|
||||||
> textcat.to_disk("/path/to/textcat")
|
> textcat.to_disk("/path/to/textcat")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||||
|
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||||
|
|
||||||
## TextCategorizer.from_disk {#from_disk tag="method"}
|
## TextCategorizer.from_disk {#from_disk tag="method"}
|
||||||
|
|
||||||
|
@ -278,6 +279,7 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ----------- | ----------------- | -------------------------------------------------------------------------- |
|
| ----------- | ----------------- | -------------------------------------------------------------------------- |
|
||||||
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||||
|
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||||
| **RETURNS** | `TextCategorizer` | The modified `TextCategorizer` object. |
|
| **RETURNS** | `TextCategorizer` | The modified `TextCategorizer` object. |
|
||||||
|
|
||||||
## TextCategorizer.to_bytes {#to_bytes tag="method"}
|
## TextCategorizer.to_bytes {#to_bytes tag="method"}
|
||||||
|
@ -291,10 +293,10 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
||||||
|
|
||||||
Serialize the pipe to a bytestring.
|
Serialize the pipe to a bytestring.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ----------- | ----- | ---------------------------------------------------- |
|
| ----------- | ----- | ------------------------------------------------------------------------- |
|
||||||
| `**exclude` | - | Named attributes to prevent from being serialized. |
|
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||||
| **RETURNS** | bytes | The serialized form of the `TextCategorizer` object. |
|
| **RETURNS** | bytes | The serialized form of the `TextCategorizer` object. |
|
||||||
|
|
||||||
## TextCategorizer.from_bytes {#from_bytes tag="method"}
|
## TextCategorizer.from_bytes {#from_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -308,11 +310,11 @@ Load the pipe from a bytestring. Modifies the object in place and returns it.
|
||||||
> textcat.from_bytes(textcat_bytes)
|
> textcat.from_bytes(textcat_bytes)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ------------ | ----------------- | ---------------------------------------------- |
|
| ------------ | ----------------- | ------------------------------------------------------------------------- |
|
||||||
| `bytes_data` | bytes | The data to load from. |
|
| `bytes_data` | bytes | The data to load from. |
|
||||||
| `**exclude` | - | Named attributes to prevent from being loaded. |
|
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||||
| **RETURNS** | `TextCategorizer` | The `TextCategorizer` object. |
|
| **RETURNS** | `TextCategorizer` | The `TextCategorizer` object. |
|
||||||
|
|
||||||
## TextCategorizer.labels {#labels tag="property"}
|
## TextCategorizer.labels {#labels tag="property"}
|
||||||
|
|
||||||
|
@ -328,3 +330,21 @@ The labels currently added to the component.
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ----------- | ----- | ---------------------------------- |
|
| ----------- | ----- | ---------------------------------- |
|
||||||
| **RETURNS** | tuple | The labels added to the component. |
|
| **RETURNS** | tuple | The labels added to the component. |
|
||||||
|
|
||||||
|
## Serialization fields {#serialization-fields}
|
||||||
|
|
||||||
|
During serialization, spaCy will export several data fields used to restore
|
||||||
|
different aspects of the object. If needed, you can exclude them from
|
||||||
|
serialization by passing in the string names via the `exclude` argument.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> data = textcat.to_disk("/path", exclude=["vocab"])
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ------- | -------------------------------------------------------------- |
|
||||||
|
| `vocab` | The shared [`Vocab`](/api/vocab). |
|
||||||
|
| `cfg` | The config file. You usually don't want to exclude this. |
|
||||||
|
| `model` | The binary model data. You usually don't want to exclude this. |
|
||||||
|
|
|
@ -211,7 +211,7 @@ The rightmost token of this token's syntactic descendants.
|
||||||
|
|
||||||
## Token.conjuncts {#conjuncts tag="property" model="parser"}
|
## Token.conjuncts {#conjuncts tag="property" model="parser"}
|
||||||
|
|
||||||
A sequence of coordinated tokens, including the token itself.
|
A tuple of coordinated tokens, not including the token itself.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
|
@ -221,9 +221,9 @@ A sequence of coordinated tokens, including the token itself.
|
||||||
> assert [t.text for t in apples_conjuncts] == [u"oranges"]
|
> assert [t.text for t in apples_conjuncts] == [u"oranges"]
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ---------- | ------- | -------------------- |
|
| ----------- | ------- | ----------------------- |
|
||||||
| **YIELDS** | `Token` | A coordinated token. |
|
| **RETURNS** | `tuple` | The coordinated tokens. |
|
||||||
|
|
||||||
## Token.children {#children tag="property" model="parser"}
|
## Token.children {#children tag="property" model="parser"}
|
||||||
|
|
||||||
|
|
|
@ -127,9 +127,10 @@ Serialize the tokenizer to disk.
|
||||||
> tokenizer.to_disk("/path/to/tokenizer")
|
> tokenizer.to_disk("/path/to/tokenizer")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||||
|
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||||
|
|
||||||
## Tokenizer.from_disk {#from_disk tag="method"}
|
## Tokenizer.from_disk {#from_disk tag="method"}
|
||||||
|
|
||||||
|
@ -145,6 +146,7 @@ Load the tokenizer from disk. Modifies the object in place and returns it.
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ----------- | ---------------- | -------------------------------------------------------------------------- |
|
| ----------- | ---------------- | -------------------------------------------------------------------------- |
|
||||||
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||||
|
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||||
| **RETURNS** | `Tokenizer` | The modified `Tokenizer` object. |
|
| **RETURNS** | `Tokenizer` | The modified `Tokenizer` object. |
|
||||||
|
|
||||||
## Tokenizer.to_bytes {#to_bytes tag="method"}
|
## Tokenizer.to_bytes {#to_bytes tag="method"}
|
||||||
|
@ -158,10 +160,10 @@ Load the tokenizer from disk. Modifies the object in place and returns it.
|
||||||
|
|
||||||
Serialize the tokenizer to a bytestring.
|
Serialize the tokenizer to a bytestring.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ----------- | ----- | -------------------------------------------------- |
|
| ----------- | ----- | ------------------------------------------------------------------------- |
|
||||||
| `**exclude` | - | Named attributes to prevent from being serialized. |
|
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||||
| **RETURNS** | bytes | The serialized form of the `Tokenizer` object. |
|
| **RETURNS** | bytes | The serialized form of the `Tokenizer` object. |
|
||||||
|
|
||||||
## Tokenizer.from_bytes {#from_bytes tag="method"}
|
## Tokenizer.from_bytes {#from_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -176,11 +178,11 @@ it.
|
||||||
> tokenizer.from_bytes(tokenizer_bytes)
|
> tokenizer.from_bytes(tokenizer_bytes)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ------------ | ----------- | ---------------------------------------------- |
|
| ------------ | ----------- | ------------------------------------------------------------------------- |
|
||||||
| `bytes_data` | bytes | The data to load from. |
|
| `bytes_data` | bytes | The data to load from. |
|
||||||
| `**exclude` | - | Named attributes to prevent from being loaded. |
|
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||||
| **RETURNS** | `Tokenizer` | The `Tokenizer` object. |
|
| **RETURNS** | `Tokenizer` | The `Tokenizer` object. |
|
||||||
|
|
||||||
## Attributes {#attributes}
|
## Attributes {#attributes}
|
||||||
|
|
||||||
|
@ -190,3 +192,25 @@ it.
|
||||||
| `prefix_search` | - | A function to find segment boundaries from the start of a string. Returns the length of the segment, or `None`. |
|
| `prefix_search` | - | A function to find segment boundaries from the start of a string. Returns the length of the segment, or `None`. |
|
||||||
| `suffix_search` | - | A function to find segment boundaries from the end of a string. Returns the length of the segment, or `None`. |
|
| `suffix_search` | - | A function to find segment boundaries from the end of a string. Returns the length of the segment, or `None`. |
|
||||||
| `infix_finditer` | - | A function to find internal segment separators, e.g. hyphens. Returns a (possibly empty) list of `re.MatchObject` objects. |
|
| `infix_finditer` | - | A function to find internal segment separators, e.g. hyphens. Returns a (possibly empty) list of `re.MatchObject` objects. |
|
||||||
|
|
||||||
|
## Serialization fields {#serialization-fields}
|
||||||
|
|
||||||
|
During serialization, spaCy will export several data fields used to restore
|
||||||
|
different aspects of the object. If needed, you can exclude them from
|
||||||
|
serialization by passing in the string names via the `exclude` argument.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> data = tokenizer.to_bytes(exclude=["vocab", "exceptions"])
|
||||||
|
> tokenizer.from_disk("./data", exclude=["token_match"])
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| ---------------- | --------------------------------- |
|
||||||
|
| `vocab` | The shared [`Vocab`](/api/vocab). |
|
||||||
|
| `prefix_search` | The prefix rules. |
|
||||||
|
| `suffix_search` | The suffix rules. |
|
||||||
|
| `infix_finditer` | The infix rules. |
|
||||||
|
| `token_match` | The token match expression. |
|
||||||
|
| `exceptions` | The tokenizer exception rules. |
|
||||||
|
|
|
@ -351,6 +351,24 @@ the two-letter language code.
|
||||||
| `name` | unicode | Two-letter language code, e.g. `'en'`. |
|
| `name` | unicode | Two-letter language code, e.g. `'en'`. |
|
||||||
| `cls` | `Language` | The language class, e.g. `English`. |
|
| `cls` | `Language` | The language class, e.g. `English`. |
|
||||||
|
|
||||||
|
### util.lang_class_is_loaded (#util.lang_class_is_loaded tag="function" new="2.1")
|
||||||
|
|
||||||
|
Check whether a `Language` class is already loaded. `Language` classes are
|
||||||
|
loaded lazily, to avoid expensive setup code associated with the language data.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> lang_cls = util.get_lang_class("en")
|
||||||
|
> assert util.lang_class_is_loaded("en") is True
|
||||||
|
> assert util.lang_class_is_loaded("de") is False
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Type | Description |
|
||||||
|
| ----------- | ------- | -------------------------------------- |
|
||||||
|
| `name` | unicode | Two-letter language code, e.g. `'en'`. |
|
||||||
|
| **RETURNS** | bool | Whether the class has been loaded. |
|
||||||
|
|
||||||
### util.load_model {#util.load_model tag="function" new="2"}
|
### util.load_model {#util.load_model tag="function" new="2"}
|
||||||
|
|
||||||
Load a model from a shortcut link, package or data path. If called with a
|
Load a model from a shortcut link, package or data path. If called with a
|
||||||
|
|
|
@ -311,10 +311,9 @@ Save the current state to a directory.
|
||||||
>
|
>
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ----------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||||
| `**exclude` | - | Named attributes to prevent from being saved. |
|
|
||||||
|
|
||||||
## Vectors.from_disk {#from_disk tag="method"}
|
## Vectors.from_disk {#from_disk tag="method"}
|
||||||
|
|
||||||
|
@ -342,10 +341,9 @@ Serialize the current state to a binary string.
|
||||||
> vectors_bytes = vectors.to_bytes()
|
> vectors_bytes = vectors.to_bytes()
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ----------- | ----- | -------------------------------------------------- |
|
| ----------- | ----- | -------------------------------------------- |
|
||||||
| `**exclude` | - | Named attributes to prevent from being serialized. |
|
| **RETURNS** | bytes | The serialized form of the `Vectors` object. |
|
||||||
| **RETURNS** | bytes | The serialized form of the `Vectors` object. |
|
|
||||||
|
|
||||||
## Vectors.from_bytes {#from_bytes tag="method"}
|
## Vectors.from_bytes {#from_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -360,11 +358,10 @@ Load state from a binary string.
|
||||||
> new_vectors.from_bytes(vectors_bytes)
|
> new_vectors.from_bytes(vectors_bytes)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ----------- | --------- | ---------------------------------------------- |
|
| ----------- | --------- | ---------------------- |
|
||||||
| `data` | bytes | The data to load from. |
|
| `data` | bytes | The data to load from. |
|
||||||
| `**exclude` | - | Named attributes to prevent from being loaded. |
|
| **RETURNS** | `Vectors` | The `Vectors` object. |
|
||||||
| **RETURNS** | `Vectors` | The `Vectors` object. |
|
|
||||||
|
|
||||||
## Attributes {#attributes}
|
## Attributes {#attributes}
|
||||||
|
|
||||||
|
|
|
@ -221,9 +221,10 @@ Save the current state to a directory.
|
||||||
> nlp.vocab.to_disk("/path/to/vocab")
|
> nlp.vocab.to_disk("/path/to/vocab")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||||
|
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||||
|
|
||||||
## Vocab.from_disk {#from_disk tag="method" new="2"}
|
## Vocab.from_disk {#from_disk tag="method" new="2"}
|
||||||
|
|
||||||
|
@ -239,6 +240,7 @@ Loads state from a directory. Modifies the object in place and returns it.
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ----------- | ---------------- | -------------------------------------------------------------------------- |
|
| ----------- | ---------------- | -------------------------------------------------------------------------- |
|
||||||
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||||
|
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||||
| **RETURNS** | `Vocab` | The modified `Vocab` object. |
|
| **RETURNS** | `Vocab` | The modified `Vocab` object. |
|
||||||
|
|
||||||
## Vocab.to_bytes {#to_bytes tag="method"}
|
## Vocab.to_bytes {#to_bytes tag="method"}
|
||||||
|
@ -251,10 +253,10 @@ Serialize the current state to a binary string.
|
||||||
> vocab_bytes = nlp.vocab.to_bytes()
|
> vocab_bytes = nlp.vocab.to_bytes()
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ----------- | ----- | -------------------------------------------------- |
|
| ----------- | ----- | ------------------------------------------------------------------------- |
|
||||||
| `**exclude` | - | Named attributes to prevent from being serialized. |
|
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||||
| **RETURNS** | bytes | The serialized form of the `Vocab` object. |
|
| **RETURNS** | bytes | The serialized form of the `Vocab` object. |
|
||||||
|
|
||||||
## Vocab.from_bytes {#from_bytes tag="method"}
|
## Vocab.from_bytes {#from_bytes tag="method"}
|
||||||
|
|
||||||
|
@ -269,11 +271,11 @@ Load state from a binary string.
|
||||||
> vocab.from_bytes(vocab_bytes)
|
> vocab.from_bytes(vocab_bytes)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ------------ | ------- | ---------------------------------------------- |
|
| ------------ | ------- | ------------------------------------------------------------------------- |
|
||||||
| `bytes_data` | bytes | The data to load from. |
|
| `bytes_data` | bytes | The data to load from. |
|
||||||
| `**exclude` | - | Named attributes to prevent from being loaded. |
|
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||||
| **RETURNS** | `Vocab` | The `Vocab` object. |
|
| **RETURNS** | `Vocab` | The `Vocab` object. |
|
||||||
|
|
||||||
## Attributes {#attributes}
|
## Attributes {#attributes}
|
||||||
|
|
||||||
|
@ -286,8 +288,28 @@ Load state from a binary string.
|
||||||
> assert type(PERSON) == int
|
> assert type(PERSON) == int
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ------------------------------------ | ------------- | --------------------------------------------- |
|
| --------------------------------------------- | ------------- | ------------------------------------------------------------ |
|
||||||
| `strings` | `StringStore` | A table managing the string-to-int mapping. |
|
| `strings` | `StringStore` | A table managing the string-to-int mapping. |
|
||||||
| `vectors` <Tag variant="new">2</Tag> | `Vectors` | A table associating word IDs to word vectors. |
|
| `vectors` <Tag variant="new">2</Tag> | `Vectors` | A table associating word IDs to word vectors. |
|
||||||
| `vectors_length` | int | Number of dimensions for each word vector. |
|
| `vectors_length` | int | Number of dimensions for each word vector. |
|
||||||
|
| `writing_system` <Tag variant="new">2.1</Tag> | dict | A dict with information about the language's writing system. |
|
||||||
|
|
||||||
|
## Serialization fields {#serialization-fields}
|
||||||
|
|
||||||
|
During serialization, spaCy will export several data fields used to restore
|
||||||
|
different aspects of the object. If needed, you can exclude them from
|
||||||
|
serialization by passing in the string names via the `exclude` argument.
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> data = vocab.to_bytes(exclude=["strings", "vectors"])
|
||||||
|
> vocab.from_disk("./vocab", exclude=["strings"])
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| --------- | ----------------------------------------------------- |
|
||||||
|
| `strings` | The strings in the [`StringStore`](/api/stringstore). |
|
||||||
|
| `lexemes` | The lexeme data. |
|
||||||
|
| `vectors` | The word vectors, if available. |
|
||||||
|
|
|
@ -39,9 +39,9 @@ together all components and creating the `Language` subclass – for example,
|
||||||
| **Morph rules**<br />[`morph_rules.py`][morph_rules.py] | Exception rules for morphological analysis of irregular words like personal pronouns. |
|
| **Morph rules**<br />[`morph_rules.py`][morph_rules.py] | Exception rules for morphological analysis of irregular words like personal pronouns. |
|
||||||
|
|
||||||
[stop_words.py]:
|
[stop_words.py]:
|
||||||
https://github.com/explosion/spacy-dev-resources/tree/master/templates/new_language/stop_words.py
|
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/stop_words.py
|
||||||
[tokenizer_exceptions.py]:
|
[tokenizer_exceptions.py]:
|
||||||
https://github.com/explosion/spacy-dev-resources/tree/master/templates/new_language/tokenizer_exceptions.py
|
https://github.com/explosion/spaCy/tree/master/spacy/lang/de/tokenizer_exceptions.py
|
||||||
[norm_exceptions.py]:
|
[norm_exceptions.py]:
|
||||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/norm_exceptions.py
|
https://github.com/explosion/spaCy/tree/master/spacy/lang/norm_exceptions.py
|
||||||
[punctuation.py]:
|
[punctuation.py]:
|
||||||
|
@ -49,12 +49,12 @@ together all components and creating the `Language` subclass – for example,
|
||||||
[char_classes.py]:
|
[char_classes.py]:
|
||||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/char_classes.py
|
https://github.com/explosion/spaCy/tree/master/spacy/lang/char_classes.py
|
||||||
[lex_attrs.py]:
|
[lex_attrs.py]:
|
||||||
https://github.com/explosion/spacy-dev-resources/tree/master/templates/new_language/lex_attrs.py
|
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/lex_attrs.py
|
||||||
[syntax_iterators.py]:
|
[syntax_iterators.py]:
|
||||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/syntax_iterators.py
|
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/syntax_iterators.py
|
||||||
[lemmatizer.py]:
|
[lemmatizer.py]:
|
||||||
https://github.com/explosion/spacy-dev-resources/tree/master/templates/new_language/lemmatizer.py
|
https://github.com/explosion/spaCy/tree/master/spacy/lang/de/lemmatizer.py
|
||||||
[tag_map.py]:
|
[tag_map.py]:
|
||||||
https://github.com/explosion/spacy-dev-resources/tree/master/templates/new_language/tag_map.py
|
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/tag_map.py
|
||||||
[morph_rules.py]:
|
[morph_rules.py]:
|
||||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/morph_rules.py
|
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/morph_rules.py
|
||||||
|
|
|
@ -33,9 +33,22 @@ list containing the component names:
|
||||||
|
|
||||||
import Accordion from 'components/accordion.js'
|
import Accordion from 'components/accordion.js'
|
||||||
|
|
||||||
<Accordion title="Does the order of pipeline components matter?">
|
<Accordion title="Does the order of pipeline components matter?" id="pipeline-components-order">
|
||||||
|
|
||||||
No
|
In spaCy v2.x, the statistical components like the tagger or parser are
|
||||||
|
independent and don't share any data between themselves. For example, the named
|
||||||
|
entity recognizer doesn't use any features set by the tagger and parser, and so
|
||||||
|
on. This means that you can swap them, or remove single components from the
|
||||||
|
pipeline without affecting the others.
|
||||||
|
|
||||||
|
However, custom components may depend on annotations set by other components.
|
||||||
|
For example, a custom lemmatizer may need the part-of-speech tags assigned, so
|
||||||
|
it'll only work if it's added after the tagger. The parser will respect
|
||||||
|
pre-defined sentence boundaries, so if a previous component in the pipeline sets
|
||||||
|
them, its dependency predictions may be different. Similarly, it matters if you
|
||||||
|
add the [`EntityRuler`](/api/entityruler) before or after the statistical entity
|
||||||
|
recognizer: if it's added before, the entity recognizer will take the existing
|
||||||
|
entities into account when making predictions.
|
||||||
|
|
||||||
</Accordion>
|
</Accordion>
|
||||||
|
|
||||||
|
|
|
@ -39,7 +39,7 @@ and morphological analysis.
|
||||||
|
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
<Infobox title="Table of Contents">
|
<Infobox title="Table of Contents" id="toc">
|
||||||
|
|
||||||
- [Language data 101](#101)
|
- [Language data 101](#101)
|
||||||
- [The Language subclass](#language-subclass)
|
- [The Language subclass](#language-subclass)
|
||||||
|
@ -105,15 +105,15 @@ to know the language's character set. If the language you're adding uses
|
||||||
non-latin characters, you might need to define the required character classes in
|
non-latin characters, you might need to define the required character classes in
|
||||||
the global
|
the global
|
||||||
[`char_classes.py`](https://github.com/explosion/spaCy/tree/master/spacy/lang/char_classes.py).
|
[`char_classes.py`](https://github.com/explosion/spaCy/tree/master/spacy/lang/char_classes.py).
|
||||||
For efficiency, spaCy uses hard-coded unicode ranges to define character classes,
|
For efficiency, spaCy uses hard-coded unicode ranges to define character
|
||||||
the definitions of which can be found on [Wikipedia](https://en.wikipedia.org/wiki/Unicode_block).
|
classes, the definitions of which can be found on
|
||||||
If the language requires very specific punctuation
|
[Wikipedia](https://en.wikipedia.org/wiki/Unicode_block). If the language
|
||||||
rules, you should consider overwriting the default regular expressions with your
|
requires very specific punctuation rules, you should consider overwriting the
|
||||||
own in the language's `Defaults`.
|
default regular expressions with your own in the language's `Defaults`.
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
### Creating a `Language` subclass {#language-subclass}
|
### Creating a language subclass {#language-subclass}
|
||||||
|
|
||||||
Language-specific code and resources should be organized into a sub-package of
|
Language-specific code and resources should be organized into a sub-package of
|
||||||
spaCy, named according to the language's
|
spaCy, named according to the language's
|
||||||
|
@ -121,9 +121,9 @@ spaCy, named according to the language's
|
||||||
code and resources specific to Spanish are placed into a directory
|
code and resources specific to Spanish are placed into a directory
|
||||||
`spacy/lang/es`, which can be imported as `spacy.lang.es`.
|
`spacy/lang/es`, which can be imported as `spacy.lang.es`.
|
||||||
|
|
||||||
To get started, you can use our
|
To get started, you can check out the
|
||||||
[templates](https://github.com/explosion/spacy-dev-resources/templates/new_language)
|
[existing languages](https://github.com/explosion/spacy/tree/master/spacy/lang).
|
||||||
for the most important files. Here's what the class template looks like:
|
Here's what the class could look like:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
### __init__.py (excerpt)
|
### __init__.py (excerpt)
|
||||||
|
@ -614,7 +614,7 @@ require models to be trained from labeled examples. The word vectors, word
|
||||||
probabilities and word clusters also require training, although these can be
|
probabilities and word clusters also require training, although these can be
|
||||||
trained from unlabeled text, which tends to be much easier to collect.
|
trained from unlabeled text, which tends to be much easier to collect.
|
||||||
|
|
||||||
### Creating a vocabulary file
|
### Creating a vocabulary file {#vocab-file}
|
||||||
|
|
||||||
spaCy expects that common words will be cached in a [`Vocab`](/api/vocab)
|
spaCy expects that common words will be cached in a [`Vocab`](/api/vocab)
|
||||||
instance. The vocabulary caches lexical features. spaCy loads the vocabulary
|
instance. The vocabulary caches lexical features. spaCy loads the vocabulary
|
||||||
|
@ -631,20 +631,20 @@ of using deep learning for NLP with limited labeled data. The vectors are also
|
||||||
useful by themselves – they power the `.similarity` methods in spaCy. For best
|
useful by themselves – they power the `.similarity` methods in spaCy. For best
|
||||||
results, you should pre-process the text with spaCy before training the Word2vec
|
results, you should pre-process the text with spaCy before training the Word2vec
|
||||||
model. This ensures your tokenization will match. You can use our
|
model. This ensures your tokenization will match. You can use our
|
||||||
[word vectors training script](https://github.com/explosion/spacy-dev-resources/tree/master/training/word_vectors.py),
|
[word vectors training script](https://github.com/explosion/spacy/tree/master/bin/train_word_vectors.py),
|
||||||
which pre-processes the text with your language-specific tokenizer and trains
|
which pre-processes the text with your language-specific tokenizer and trains
|
||||||
the model using [Gensim](https://radimrehurek.com/gensim/). The `vectors.bin`
|
the model using [Gensim](https://radimrehurek.com/gensim/). The `vectors.bin`
|
||||||
file should consist of one word and vector per line.
|
file should consist of one word and vector per line.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
https://github.com/explosion/spacy-dev-resources/tree/master/training/word_vectors.py
|
https://github.com/explosion/spacy/tree/master/bin/train_word_vectors.py
|
||||||
```
|
```
|
||||||
|
|
||||||
If you don't have a large sample of text available, you can also convert word
|
If you don't have a large sample of text available, you can also convert word
|
||||||
vectors produced by a variety of other tools into spaCy's format. See the docs
|
vectors produced by a variety of other tools into spaCy's format. See the docs
|
||||||
on [converting word vectors](/usage/vectors-similarity#converting) for details.
|
on [converting word vectors](/usage/vectors-similarity#converting) for details.
|
||||||
|
|
||||||
### Creating or converting a training corpus
|
### Creating or converting a training corpus {#training-corpus}
|
||||||
|
|
||||||
The easiest way to train spaCy's tagger, parser, entity recognizer or text
|
The easiest way to train spaCy's tagger, parser, entity recognizer or text
|
||||||
categorizer is to use the [`spacy train`](/api/cli#train) command-line utility.
|
categorizer is to use the [`spacy train`](/api/cli#train) command-line utility.
|
||||||
|
|
|
@ -29,7 +29,7 @@ Here's a quick comparison of the functionalities offered by spaCy,
|
||||||
| Entity linking | ❌ | ❌ | ❌ |
|
| Entity linking | ❌ | ❌ | ❌ |
|
||||||
| Coreference resolution | ❌ | ❌ | ✅ |
|
| Coreference resolution | ❌ | ❌ | ✅ |
|
||||||
|
|
||||||
### When should I use what?
|
### When should I use what? {#comparison-usage}
|
||||||
|
|
||||||
Natural Language Understanding is an active area of research and development, so
|
Natural Language Understanding is an active area of research and development, so
|
||||||
there are many different tools or technologies catering to different use-cases.
|
there are many different tools or technologies catering to different use-cases.
|
||||||
|
|
|
@ -28,7 +28,7 @@ import QuickstartInstall from 'widgets/quickstart-install.js'
|
||||||
|
|
||||||
## Installation instructions {#installation}
|
## Installation instructions {#installation}
|
||||||
|
|
||||||
### pip
|
### pip {#pip}
|
||||||
|
|
||||||
Using pip, spaCy releases are available as source packages and binary wheels (as
|
Using pip, spaCy releases are available as source packages and binary wheels (as
|
||||||
of v2.0.13).
|
of v2.0.13).
|
||||||
|
@ -58,7 +58,7 @@ source .env/bin/activate
|
||||||
pip install spacy
|
pip install spacy
|
||||||
```
|
```
|
||||||
|
|
||||||
### conda
|
### conda {#conda}
|
||||||
|
|
||||||
Thanks to our great community, we've been able to re-add conda support. You can
|
Thanks to our great community, we've been able to re-add conda support. You can
|
||||||
also install spaCy via `conda-forge`:
|
also install spaCy via `conda-forge`:
|
||||||
|
@ -194,7 +194,7 @@ official distributions these are:
|
||||||
| Python 3.4 | Visual Studio 2010 |
|
| Python 3.4 | Visual Studio 2010 |
|
||||||
| Python 3.5+ | Visual Studio 2015 |
|
| Python 3.5+ | Visual Studio 2015 |
|
||||||
|
|
||||||
### Run tests
|
### Run tests {#run-tests}
|
||||||
|
|
||||||
spaCy comes with an
|
spaCy comes with an
|
||||||
[extensive test suite](https://github.com/explosion/spaCy/tree/master/spacy/tests).
|
[extensive test suite](https://github.com/explosion/spaCy/tree/master/spacy/tests).
|
||||||
|
@ -418,7 +418,7 @@ either of these, clone your repository again.
|
||||||
|
|
||||||
</Accordion>
|
</Accordion>
|
||||||
|
|
||||||
## Changelog
|
## Changelog {#changelog}
|
||||||
|
|
||||||
import Changelog from 'widgets/changelog.js'
|
import Changelog from 'widgets/changelog.js'
|
||||||
|
|
||||||
|
|
|
@ -298,9 +298,9 @@ different languages, see the
|
||||||
The best way to understand spaCy's dependency parser is interactively. To make
|
The best way to understand spaCy's dependency parser is interactively. To make
|
||||||
this easier, spaCy v2.0+ comes with a visualization module. You can pass a `Doc`
|
this easier, spaCy v2.0+ comes with a visualization module. You can pass a `Doc`
|
||||||
or a list of `Doc` objects to displaCy and run
|
or a list of `Doc` objects to displaCy and run
|
||||||
[`displacy.serve`](top-level#displacy.serve) to run the web server, or
|
[`displacy.serve`](/api/top-level#displacy.serve) to run the web server, or
|
||||||
[`displacy.render`](top-level#displacy.render) to generate the raw markup. If
|
[`displacy.render`](/api/top-level#displacy.render) to generate the raw markup.
|
||||||
you want to know how to write rules that hook into some type of syntactic
|
If you want to know how to write rules that hook into some type of syntactic
|
||||||
construction, just plug the sentence into the visualizer and see how spaCy
|
construction, just plug the sentence into the visualizer and see how spaCy
|
||||||
annotates it.
|
annotates it.
|
||||||
|
|
||||||
|
@ -621,7 +621,7 @@ For more details on the language-specific data, see the usage guide on
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
<Accordion title="Should I change the language data or add custom tokenizer rules?">
|
<Accordion title="Should I change the language data or add custom tokenizer rules?" id="lang-data-vs-tokenizer">
|
||||||
|
|
||||||
Tokenization rules that are specific to one language, but can be **generalized
|
Tokenization rules that are specific to one language, but can be **generalized
|
||||||
across that language** should ideally live in the language data in
|
across that language** should ideally live in the language data in
|
||||||
|
|
|
@ -41,7 +41,7 @@ contribute to model development.
|
||||||
> If a model is available for a language, you can download it using the
|
> If a model is available for a language, you can download it using the
|
||||||
> [`spacy download`](/api/cli#download) command. In order to use languages that
|
> [`spacy download`](/api/cli#download) command. In order to use languages that
|
||||||
> don't yet come with a model, you have to import them directly, or use
|
> don't yet come with a model, you have to import them directly, or use
|
||||||
> [`spacy.blank`](api/top-level#spacy.blank):
|
> [`spacy.blank`](/api/top-level#spacy.blank):
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
> from spacy.lang.fi import Finnish
|
> from spacy.lang.fi import Finnish
|
||||||
|
|
|
@ -46,7 +46,8 @@ components. spaCy then does the following:
|
||||||
3. Add each pipeline component to the pipeline in order, using
|
3. Add each pipeline component to the pipeline in order, using
|
||||||
[`add_pipe`](/api/language#add_pipe).
|
[`add_pipe`](/api/language#add_pipe).
|
||||||
4. Make the **model data** available to the `Language` class by calling
|
4. Make the **model data** available to the `Language` class by calling
|
||||||
[`from_disk`](language#from_disk) with the path to the model data directory.
|
[`from_disk`](/api/language#from_disk) with the path to the model data
|
||||||
|
directory.
|
||||||
|
|
||||||
So when you call this...
|
So when you call this...
|
||||||
|
|
||||||
|
@ -110,7 +111,7 @@ print(nlp.pipe_names)
|
||||||
# ['tagger', 'parser', 'ner']
|
# ['tagger', 'parser', 'ner']
|
||||||
```
|
```
|
||||||
|
|
||||||
### Built-in pipeline components
|
### Built-in pipeline components {#built-in}
|
||||||
|
|
||||||
spaCy ships with several built-in pipeline components that are also available in
|
spaCy ships with several built-in pipeline components that are also available in
|
||||||
the `Language.factories`. This means that you can initialize them by calling
|
the `Language.factories`. This means that you can initialize them by calling
|
||||||
|
@ -426,7 +427,7 @@ spaCy, and implement your own models trained with other machine learning
|
||||||
libraries. It also lets you take advantage of spaCy's data structures and the
|
libraries. It also lets you take advantage of spaCy's data structures and the
|
||||||
`Doc` object as the "single source of truth".
|
`Doc` object as the "single source of truth".
|
||||||
|
|
||||||
<Accordion title="Why ._ and not just a top-level attribute?">
|
<Accordion title="Why ._ and not just a top-level attribute?" id="why-dot-underscore">
|
||||||
|
|
||||||
Writing to a `._` attribute instead of to the `Doc` directly keeps a clearer
|
Writing to a `._` attribute instead of to the `Doc` directly keeps a clearer
|
||||||
separation and makes it easier to ensure backwards compatibility. For example,
|
separation and makes it easier to ensure backwards compatibility. For example,
|
||||||
|
@ -437,7 +438,7 @@ immediately know what's built-in and what's custom – for example,
|
||||||
|
|
||||||
</Accordion>
|
</Accordion>
|
||||||
|
|
||||||
<Accordion title="How is the ._ implemented?">
|
<Accordion title="How is the ._ implemented?" id="dot-underscore-implementation">
|
||||||
|
|
||||||
Extension definitions – the defaults, methods, getters and setters you pass in
|
Extension definitions – the defaults, methods, getters and setters you pass in
|
||||||
to `set_extension` – are stored in class attributes on the `Underscore` class.
|
to `set_extension` – are stored in class attributes on the `Underscore` class.
|
||||||
|
@ -458,9 +459,7 @@ There are three main types of extensions, which can be defined using the
|
||||||
1. **Attribute extensions.** Set a default value for an attribute, which can be
|
1. **Attribute extensions.** Set a default value for an attribute, which can be
|
||||||
overwritten manually at any time. Attribute extensions work like "normal"
|
overwritten manually at any time. Attribute extensions work like "normal"
|
||||||
variables and are the quickest way to store arbitrary information on a `Doc`,
|
variables and are the quickest way to store arbitrary information on a `Doc`,
|
||||||
`Span` or `Token`. Attribute defaults behaves just like argument defaults
|
`Span` or `Token`.
|
||||||
[in Python functions](http://docs.python-guide.org/en/latest/writing/gotchas/#mutable-default-arguments),
|
|
||||||
and should not be used for mutable values like dictionaries or lists.
|
|
||||||
|
|
||||||
```python
|
```python
|
||||||
Doc.set_extension("hello", default=True)
|
Doc.set_extension("hello", default=True)
|
||||||
|
@ -527,25 +526,6 @@ Once you've registered your custom attribute, you can also use the built-in
|
||||||
especially useful it you want to pass in a string instead of calling
|
especially useful it you want to pass in a string instead of calling
|
||||||
`doc._.my_attr`.
|
`doc._.my_attr`.
|
||||||
|
|
||||||
<Infobox title="Using mutable default values" variant="danger">
|
|
||||||
|
|
||||||
When using **mutable values** like dictionaries or lists as the `default`
|
|
||||||
argument, keep in mind that they behave just like mutable default arguments
|
|
||||||
[in Python functions](http://docs.python-guide.org/en/latest/writing/gotchas/#mutable-default-arguments).
|
|
||||||
This can easily cause unintended results, like the same value being set on _all_
|
|
||||||
objects instead of only one particular instance. In most cases, it's better to
|
|
||||||
use **getters and setters**, and only set the `default` for boolean or string
|
|
||||||
values.
|
|
||||||
|
|
||||||
```diff
|
|
||||||
+ Doc.set_extension('fruits', getter=get_fruits, setter=set_fruits)
|
|
||||||
|
|
||||||
- Doc.set_extension('fruits', default={})
|
|
||||||
- doc._.fruits['apple'] = u'🍎' # all docs now have {'apple': u'🍎'}
|
|
||||||
```
|
|
||||||
|
|
||||||
</Infobox>
|
|
||||||
|
|
||||||
### Example: Pipeline component for GPE entities and country meta data via a REST API {#component-example3}
|
### Example: Pipeline component for GPE entities and country meta data via a REST API {#component-example3}
|
||||||
|
|
||||||
This example shows the implementation of a pipeline component that fetches
|
This example shows the implementation of a pipeline component that fetches
|
||||||
|
|
|
@ -15,7 +15,7 @@ their relationships. This means you can easily access and analyze the
|
||||||
surrounding tokens, merge spans into single tokens or add entries to the named
|
surrounding tokens, merge spans into single tokens or add entries to the named
|
||||||
entities in `doc.ents`.
|
entities in `doc.ents`.
|
||||||
|
|
||||||
<Accordion title="Should I use rules or train a model?">
|
<Accordion title="Should I use rules or train a model?" id="rules-vs-model">
|
||||||
|
|
||||||
For complex tasks, it's usually better to train a statistical entity recognition
|
For complex tasks, it's usually better to train a statistical entity recognition
|
||||||
model. However, statistical models require training data, so for many
|
model. However, statistical models require training data, so for many
|
||||||
|
@ -41,7 +41,7 @@ on [rule-based entity recognition](#entityruler).
|
||||||
|
|
||||||
</Accordion>
|
</Accordion>
|
||||||
|
|
||||||
<Accordion title="When should I use the token matcher vs. the phrase matcher?">
|
<Accordion title="When should I use the token matcher vs. the phrase matcher?" id="matcher-vs-phrase-matcher">
|
||||||
|
|
||||||
The `PhraseMatcher` is useful if you already have a large terminology list or
|
The `PhraseMatcher` is useful if you already have a large terminology list or
|
||||||
gazetteer consisting of single or multi-token phrases that you want to find
|
gazetteer consisting of single or multi-token phrases that you want to find
|
||||||
|
|
|
@ -22,7 +22,7 @@ the changes, see [this table](/usage/v2#incompat) and the notes on
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
### Serializing the pipeline
|
### Serializing the pipeline {#pipeline}
|
||||||
|
|
||||||
When serializing the pipeline, keep in mind that this will only save out the
|
When serializing the pipeline, keep in mind that this will only save out the
|
||||||
**binary data for the individual components** to allow spaCy to restore them –
|
**binary data for the individual components** to allow spaCy to restore them –
|
||||||
|
@ -361,7 +361,7 @@ In theory, the entry point mechanism also lets you overwrite built-in factories
|
||||||
– including the tokenizer. By default, spaCy will output a warning in these
|
– including the tokenizer. By default, spaCy will output a warning in these
|
||||||
cases, to prevent accidental overwrites and unintended results.
|
cases, to prevent accidental overwrites and unintended results.
|
||||||
|
|
||||||
#### Advanced components with settings
|
#### Advanced components with settings {#advanced-cfg}
|
||||||
|
|
||||||
The `**cfg` keyword arguments that the factory receives are passed down all the
|
The `**cfg` keyword arguments that the factory receives are passed down all the
|
||||||
way from `spacy.load`. This means that the factory can respond to custom
|
way from `spacy.load`. This means that the factory can respond to custom
|
||||||
|
|
|
@ -50,7 +50,7 @@ systems, or to pre-process text for **deep learning**.
|
||||||
|
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
<Infobox title="Table of contents">
|
<Infobox title="Table of contents" id="toc">
|
||||||
|
|
||||||
- [Features](#features)
|
- [Features](#features)
|
||||||
- [Linguistic annotations](#annotations)
|
- [Linguistic annotations](#annotations)
|
||||||
|
|
|
@ -14,7 +14,7 @@ faster runtime, and many bug fixes, v2.1 also introduces experimental support
|
||||||
for some exciting new NLP innovations. For the full changelog, see the
|
for some exciting new NLP innovations. For the full changelog, see the
|
||||||
[release notes on GitHub](https://github.com/explosion/spaCy/releases/tag/v2.1.0).
|
[release notes on GitHub](https://github.com/explosion/spaCy/releases/tag/v2.1.0).
|
||||||
|
|
||||||
### BERT/ULMFit/Elmo-style pre-training {tag="experimental"}
|
### BERT/ULMFit/Elmo-style pre-training {#pretraining tag="experimental"}
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
|
@ -39,7 +39,7 @@ it.
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
### Extended match pattern API
|
### Extended match pattern API {#matcher-api}
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
|
@ -67,7 +67,7 @@ values.
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
### Easy rule-based entity recognition
|
### Easy rule-based entity recognition {#entity-ruler}
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
|
@ -91,7 +91,7 @@ flexibility.
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
### Phrase matching with other attributes
|
### Phrase matching with other attributes {#phrasematcher}
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
|
@ -115,7 +115,7 @@ or `POS` for finding sequences of the same part-of-speech tags.
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
### Retokenizer for merging and splitting
|
### Retokenizer for merging and splitting {#retokenizer}
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
|
@ -142,7 +142,7 @@ deprecated.
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
### Components and languages via entry points
|
### Components and languages via entry points {#entry-points}
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
|
@ -169,7 +169,7 @@ is required.
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
### Improved documentation
|
### Improved documentation {#docs}
|
||||||
|
|
||||||
Although it looks pretty much the same, we've rebuilt the entire documentation
|
Although it looks pretty much the same, we've rebuilt the entire documentation
|
||||||
using [Gatsby](https://www.gatsbyjs.org/) and [MDX](https://mdxjs.com/). It's
|
using [Gatsby](https://www.gatsbyjs.org/) and [MDX](https://mdxjs.com/). It's
|
||||||
|
@ -237,6 +237,19 @@ if all of your models are up to date, you can run the
|
||||||
+ retokenizer.merge(doc[6:8])
|
+ retokenizer.merge(doc[6:8])
|
||||||
```
|
```
|
||||||
|
|
||||||
|
- The serialization methods `to_disk`, `from_disk`, `to_bytes` and `from_bytes`
|
||||||
|
now support a single `exclude` argument to provide a list of string names to
|
||||||
|
exclude. The docs have been updated to list the available serialization fields
|
||||||
|
for each class. The `disable` argument on the [`Language`](/api/language)
|
||||||
|
serialization methods has been renamed to `exclude` for consistency.
|
||||||
|
|
||||||
|
```diff
|
||||||
|
- nlp.to_disk("/path", disable=["parser", "ner"])
|
||||||
|
+ nlp.to_disk("/path", exclude=["parser", "ner"])
|
||||||
|
- data = nlp.tokenizer.to_bytes(vocab=False)
|
||||||
|
+ data = nlp.tokenizer.to_bytes(exclude=["vocab"])
|
||||||
|
```
|
||||||
|
|
||||||
- For better compatibility with the Universal Dependencies data, the lemmatizer
|
- For better compatibility with the Universal Dependencies data, the lemmatizer
|
||||||
now preserves capitalization, e.g. for proper nouns. See
|
now preserves capitalization, e.g. for proper nouns. See
|
||||||
[this issue](https://github.com/explosion/spaCy/issues/3256) for details.
|
[this issue](https://github.com/explosion/spaCy/issues/3256) for details.
|
||||||
|
|
|
@ -39,7 +39,7 @@ also add your own custom attributes, properties and methods to the `Doc`,
|
||||||
|
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
<Infobox title="Table of Contents">
|
<Infobox title="Table of Contents" id="toc">
|
||||||
|
|
||||||
- [Summary](#summary)
|
- [Summary](#summary)
|
||||||
- [New features](#features)
|
- [New features](#features)
|
||||||
|
|
|
@ -75,7 +75,7 @@ arcs.
|
||||||
| `font` | unicode | Font name or font family for all text. | `"Arial"` |
|
| `font` | unicode | Font name or font family for all text. | `"Arial"` |
|
||||||
|
|
||||||
For a list of all available options, see the
|
For a list of all available options, see the
|
||||||
[`displacy` API documentation](top-level#displacy_options).
|
[`displacy` API documentation](/api/top-level#displacy_options).
|
||||||
|
|
||||||
> #### Options example
|
> #### Options example
|
||||||
>
|
>
|
||||||
|
@ -283,7 +283,7 @@ from pathlib import Path
|
||||||
nlp = spacy.load("en_core_web_sm")
|
nlp = spacy.load("en_core_web_sm")
|
||||||
sentences = [u"This is an example.", u"This is another one."]
|
sentences = [u"This is an example.", u"This is another one."]
|
||||||
for sent in sentences:
|
for sent in sentences:
|
||||||
doc = nlp(sentence)
|
doc = nlp(sent)
|
||||||
svg = displacy.render(doc, style="dep")
|
svg = displacy.render(doc, style="dep")
|
||||||
file_name = '-'.join([w.text for w in doc if not w.is_punct]) + ".svg"
|
file_name = '-'.join([w.text for w in doc if not w.is_punct]) + ".svg"
|
||||||
output_path = Path("/images/" + file_name)
|
output_path = Path("/images/" + file_name)
|
||||||
|
|
|
@ -23,6 +23,11 @@
|
||||||
"list": "89ad33e698"
|
"list": "89ad33e698"
|
||||||
},
|
},
|
||||||
"docSearch": {
|
"docSearch": {
|
||||||
|
"apiKey": "f7dbcd148fae73db20b6ad33d03cc9e8",
|
||||||
|
"indexName": "dev_spacy_netlify",
|
||||||
|
"appId": "Y7BGGRAPHC"
|
||||||
|
},
|
||||||
|
"_docSearch": {
|
||||||
"apiKey": "371e26ed49d29a27bd36273dfdaf89af",
|
"apiKey": "371e26ed49d29a27bd36273dfdaf89af",
|
||||||
"indexName": "spacy"
|
"indexName": "spacy"
|
||||||
},
|
},
|
||||||
|
|
|
@ -524,6 +524,22 @@
|
||||||
},
|
},
|
||||||
"category": ["standalone", "research"]
|
"category": ["standalone", "research"]
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
"id": "scispacy",
|
||||||
|
"title": "scispaCy",
|
||||||
|
"slogan": "A full spaCy pipeline and models for scientific/biomedical documents",
|
||||||
|
"github": "allenai/scispacy",
|
||||||
|
"pip": "scispacy",
|
||||||
|
"thumb": "https://i.imgur.com/dJQSclW.png",
|
||||||
|
"url": "https://allenai.github.io/scispacy/",
|
||||||
|
"author": " Allen Institute for Artificial Intelligence",
|
||||||
|
"author_links": {
|
||||||
|
"github": "allenai",
|
||||||
|
"twitter": "allenai_org",
|
||||||
|
"website": "http://allenai.org"
|
||||||
|
},
|
||||||
|
"category": ["models", "research"]
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"id": "textacy",
|
"id": "textacy",
|
||||||
"slogan": "NLP, before and after spaCy",
|
"slogan": "NLP, before and after spaCy",
|
||||||
|
@ -851,6 +867,22 @@
|
||||||
},
|
},
|
||||||
"category": ["courses"]
|
"category": ["courses"]
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
"type": "education",
|
||||||
|
"id": "datacamp-advanced-nlp",
|
||||||
|
"title": "Advanced Natural Language Processing with spaCy",
|
||||||
|
"slogan": "Datacamp, 2019",
|
||||||
|
"description": "If you're working with a lot of text, you'll eventually want to know more about it. For example, what's it about? What do the words mean in context? Who is doing what to whom? What companies and products are mentioned? Which texts are similar to each other? In this course, you'll learn how to use spaCy, a fast-growing industry standard library for NLP in Python, to build advanced natural language understanding systems, using both rule-based and machine learning approaches.",
|
||||||
|
"url": "https://www.datacamp.com/courses/advanced-nlp-with-spacy",
|
||||||
|
"thumb": "https://i.imgur.com/0Zks7c0.jpg",
|
||||||
|
"author": "Ines Montani",
|
||||||
|
"author_links": {
|
||||||
|
"twitter": "_inesmontani",
|
||||||
|
"github": "ines",
|
||||||
|
"website": "https://ines.io"
|
||||||
|
},
|
||||||
|
"category": ["courses"]
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"type": "education",
|
"type": "education",
|
||||||
"id": "learning-path-spacy",
|
"id": "learning-path-spacy",
|
||||||
|
@ -910,6 +942,7 @@
|
||||||
"description": "Most NLP projects rely crucially on the quality of annotations used for training and evaluating models. In this episode, Matt and Ines of Explosion AI tell us how Prodigy can improve data annotation and model development workflows. Prodigy is an annotation tool implemented as a python library, and it comes with a web application and a command line interface. A developer can define input data streams and design simple annotation interfaces. Prodigy can help break down complex annotation decisions into a series of binary decisions, and it provides easy integration with spaCy models. Developers can specify how models should be modified as new annotations come in in an active learning framework.",
|
"description": "Most NLP projects rely crucially on the quality of annotations used for training and evaluating models. In this episode, Matt and Ines of Explosion AI tell us how Prodigy can improve data annotation and model development workflows. Prodigy is an annotation tool implemented as a python library, and it comes with a web application and a command line interface. A developer can define input data streams and design simple annotation interfaces. Prodigy can help break down complex annotation decisions into a series of binary decisions, and it provides easy integration with spaCy models. Developers can specify how models should be modified as new annotations come in in an active learning framework.",
|
||||||
"soundcloud": "559200912",
|
"soundcloud": "559200912",
|
||||||
"thumb": "https://i.imgur.com/hOBQEzc.jpg",
|
"thumb": "https://i.imgur.com/hOBQEzc.jpg",
|
||||||
|
"url": "https://soundcloud.com/nlp-highlights/78-where-do-corpora-come-from-with-matt-honnibal-and-ines-montani",
|
||||||
"author": "Matt Gardner, Waleed Ammar (Allen AI)",
|
"author": "Matt Gardner, Waleed Ammar (Allen AI)",
|
||||||
"author_links": {
|
"author_links": {
|
||||||
"website": "https://soundcloud.com/nlp-highlights"
|
"website": "https://soundcloud.com/nlp-highlights"
|
||||||
|
@ -925,12 +958,28 @@
|
||||||
"iframe": "https://www.pythonpodcast.com/wp-content/plugins/podlove-podcasting-plugin-for-wordpress/lib/modules/podlove_web_player/player_v4/dist/share.html?episode=https://www.pythonpodcast.com/?podlove_player4=176",
|
"iframe": "https://www.pythonpodcast.com/wp-content/plugins/podlove-podcasting-plugin-for-wordpress/lib/modules/podlove_web_player/player_v4/dist/share.html?episode=https://www.pythonpodcast.com/?podlove_player4=176",
|
||||||
"iframe_height": 200,
|
"iframe_height": 200,
|
||||||
"thumb": "https://i.imgur.com/rpo6BuY.png",
|
"thumb": "https://i.imgur.com/rpo6BuY.png",
|
||||||
|
"url": "https://www.podcastinit.com/episode-87-spacy-with-matthew-honnibal/",
|
||||||
"author": "Tobias Macey",
|
"author": "Tobias Macey",
|
||||||
"author_links": {
|
"author_links": {
|
||||||
"website": "https://www.podcastinit.com"
|
"website": "https://www.podcastinit.com"
|
||||||
},
|
},
|
||||||
"category": ["podcasts"]
|
"category": ["podcasts"]
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
"type": "education",
|
||||||
|
"id": "talk-python-podcast",
|
||||||
|
"title": "Talk Python 202: Building a software business",
|
||||||
|
"slogan": "March 2019",
|
||||||
|
"description": "One core question around open source is how do you fund it? Well, there is always that PayPal donate button. But that's been a tremendous failure for many projects. Often the go-to answer is consulting. But what if you don't want to trade time for money? You could take things up a notch and change the equation, exchanging value for money. That's what Ines Montani and her co-founder did when they started Explosion AI with spaCy as the foundation.",
|
||||||
|
"thumb": "https://i.imgur.com/q1twuK8.png",
|
||||||
|
"url": "https://talkpython.fm/episodes/show/202/building-a-software-business",
|
||||||
|
"soundcloud": "588364857",
|
||||||
|
"author": "Michael Kennedy",
|
||||||
|
"author_links": {
|
||||||
|
"website": "https://talkpython.fm/"
|
||||||
|
},
|
||||||
|
"category": ["podcasts"]
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"id": "adam_qas",
|
"id": "adam_qas",
|
||||||
"title": "ADAM: Question Answering System",
|
"title": "ADAM: Question Answering System",
|
||||||
|
|
828
website/package-lock.json
generated
828
website/package-lock.json
generated
|
@ -1833,9 +1833,9 @@
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
"acorn": {
|
"acorn": {
|
||||||
"version": "6.1.0",
|
"version": "6.1.1",
|
||||||
"resolved": "https://registry.npmjs.org/acorn/-/acorn-6.1.0.tgz",
|
"resolved": "https://registry.npmjs.org/acorn/-/acorn-6.1.1.tgz",
|
||||||
"integrity": "sha512-MW/FjM+IvU9CgBzjO3UIPCE2pyEwUsoFl+VGdczOPEdxfGFjuKny/gN54mOuX7Qxmb9Rg9MCn2oKiSUeW+pjrw=="
|
"integrity": "sha512-jPTiwtOxaHNaAPg/dmrJ/beuzLRnXtB0kQPQ8JpotKJgTB6rX6c8mlf315941pyjBSaPg8NHXS9fhP4u17DpGA=="
|
||||||
},
|
},
|
||||||
"acorn-dynamic-import": {
|
"acorn-dynamic-import": {
|
||||||
"version": "3.0.0",
|
"version": "3.0.0",
|
||||||
|
@ -5958,9 +5958,9 @@
|
||||||
"integrity": "sha1-G2HAViGQqN/2rjuyzwIAyhMLhtQ="
|
"integrity": "sha1-G2HAViGQqN/2rjuyzwIAyhMLhtQ="
|
||||||
},
|
},
|
||||||
"eslint": {
|
"eslint": {
|
||||||
"version": "5.14.1",
|
"version": "5.15.1",
|
||||||
"resolved": "https://registry.npmjs.org/eslint/-/eslint-5.14.1.tgz",
|
"resolved": "https://registry.npmjs.org/eslint/-/eslint-5.15.1.tgz",
|
||||||
"integrity": "sha512-CyUMbmsjxedx8B0mr79mNOqetvkbij/zrXnFeK2zc3pGRn3/tibjiNAv/3UxFEyfMDjh+ZqTrJrEGBFiGfD5Og==",
|
"integrity": "sha512-NTcm6vQ+PTgN3UBsALw5BMhgO6i5EpIjQF/Xb5tIh3sk9QhrFafujUOczGz4J24JBlzWclSB9Vmx8d+9Z6bFCg==",
|
||||||
"requires": {
|
"requires": {
|
||||||
"@babel/code-frame": "^7.0.0",
|
"@babel/code-frame": "^7.0.0",
|
||||||
"ajv": "^6.9.1",
|
"ajv": "^6.9.1",
|
||||||
|
@ -5968,7 +5968,7 @@
|
||||||
"cross-spawn": "^6.0.5",
|
"cross-spawn": "^6.0.5",
|
||||||
"debug": "^4.0.1",
|
"debug": "^4.0.1",
|
||||||
"doctrine": "^3.0.0",
|
"doctrine": "^3.0.0",
|
||||||
"eslint-scope": "^4.0.0",
|
"eslint-scope": "^4.0.2",
|
||||||
"eslint-utils": "^1.3.1",
|
"eslint-utils": "^1.3.1",
|
||||||
"eslint-visitor-keys": "^1.0.0",
|
"eslint-visitor-keys": "^1.0.0",
|
||||||
"espree": "^5.0.1",
|
"espree": "^5.0.1",
|
||||||
|
@ -6001,9 +6001,9 @@
|
||||||
},
|
},
|
||||||
"dependencies": {
|
"dependencies": {
|
||||||
"ajv": {
|
"ajv": {
|
||||||
"version": "6.9.2",
|
"version": "6.10.0",
|
||||||
"resolved": "https://registry.npmjs.org/ajv/-/ajv-6.9.2.tgz",
|
"resolved": "https://registry.npmjs.org/ajv/-/ajv-6.10.0.tgz",
|
||||||
"integrity": "sha512-4UFy0/LgDo7Oa/+wOAlj44tp9K78u38E5/359eSrqEp1Z5PdVfimCcs7SluXMP755RUQu6d2b4AvF0R1C9RZjg==",
|
"integrity": "sha512-nffhOpkymDECQyR0mnsUtoCE8RlX38G0rYP+wgLWFyZuUyuuojSSvi/+euOiQBIn63whYwYVIIH1TvE3tu4OEg==",
|
||||||
"requires": {
|
"requires": {
|
||||||
"fast-deep-equal": "^2.0.1",
|
"fast-deep-equal": "^2.0.1",
|
||||||
"fast-json-stable-stringify": "^2.0.0",
|
"fast-json-stable-stringify": "^2.0.0",
|
||||||
|
@ -6037,9 +6037,9 @@
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
"eslint-scope": {
|
"eslint-scope": {
|
||||||
"version": "4.0.0",
|
"version": "4.0.2",
|
||||||
"resolved": "https://registry.npmjs.org/eslint-scope/-/eslint-scope-4.0.0.tgz",
|
"resolved": "https://registry.npmjs.org/eslint-scope/-/eslint-scope-4.0.2.tgz",
|
||||||
"integrity": "sha512-1G6UTDi7Jc1ELFwnR58HV4fK9OQK4S6N985f166xqXxpjU6plxFISJa2Ba9KCQuFa8RCnj/lSFJbHo7UFDBnUA==",
|
"integrity": "sha512-5q1+B/ogmHl8+paxtOKx38Z8LtWkVGuNt3+GQNErqwLl6ViNp/gdJGMCjZNxZ8j/VYjDNZ2Fo+eQc1TAVPIzbg==",
|
||||||
"requires": {
|
"requires": {
|
||||||
"esrecurse": "^4.1.0",
|
"esrecurse": "^4.1.0",
|
||||||
"estraverse": "^4.1.1"
|
"estraverse": "^4.1.1"
|
||||||
|
@ -6448,52 +6448,6 @@
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
"expand-range": {
|
|
||||||
"version": "1.8.2",
|
|
||||||
"resolved": "http://registry.npmjs.org/expand-range/-/expand-range-1.8.2.tgz",
|
|
||||||
"integrity": "sha1-opnv/TNf4nIeuujiV+x5ZE/IUzc=",
|
|
||||||
"requires": {
|
|
||||||
"fill-range": "^2.1.0"
|
|
||||||
},
|
|
||||||
"dependencies": {
|
|
||||||
"fill-range": {
|
|
||||||
"version": "2.2.4",
|
|
||||||
"resolved": "https://registry.npmjs.org/fill-range/-/fill-range-2.2.4.tgz",
|
|
||||||
"integrity": "sha512-cnrcCbj01+j2gTG921VZPnHbjmdAf8oQV/iGeV2kZxGSyfYjjTyY79ErsK1WJWMpw6DaApEX72binqJE+/d+5Q==",
|
|
||||||
"requires": {
|
|
||||||
"is-number": "^2.1.0",
|
|
||||||
"isobject": "^2.0.0",
|
|
||||||
"randomatic": "^3.0.0",
|
|
||||||
"repeat-element": "^1.1.2",
|
|
||||||
"repeat-string": "^1.5.2"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"is-number": {
|
|
||||||
"version": "2.1.0",
|
|
||||||
"resolved": "https://registry.npmjs.org/is-number/-/is-number-2.1.0.tgz",
|
|
||||||
"integrity": "sha1-Afy7s5NGOlSPL0ZszhbezknbkI8=",
|
|
||||||
"requires": {
|
|
||||||
"kind-of": "^3.0.2"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"isobject": {
|
|
||||||
"version": "2.1.0",
|
|
||||||
"resolved": "https://registry.npmjs.org/isobject/-/isobject-2.1.0.tgz",
|
|
||||||
"integrity": "sha1-8GVWEJaj8dou9GJy+BXIQNh+DIk=",
|
|
||||||
"requires": {
|
|
||||||
"isarray": "1.0.0"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"kind-of": {
|
|
||||||
"version": "3.2.2",
|
|
||||||
"resolved": "https://registry.npmjs.org/kind-of/-/kind-of-3.2.2.tgz",
|
|
||||||
"integrity": "sha1-MeohpzS6ubuw8yRm2JOupR5KPGQ=",
|
|
||||||
"requires": {
|
|
||||||
"is-buffer": "^1.1.5"
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"expand-template": {
|
"expand-template": {
|
||||||
"version": "2.0.3",
|
"version": "2.0.3",
|
||||||
"resolved": "https://registry.npmjs.org/expand-template/-/expand-template-2.0.3.tgz",
|
"resolved": "https://registry.npmjs.org/expand-template/-/expand-template-2.0.3.tgz",
|
||||||
|
@ -6818,11 +6772,6 @@
|
||||||
"resolved": "https://registry.npmjs.org/file-uri-to-path/-/file-uri-to-path-1.0.0.tgz",
|
"resolved": "https://registry.npmjs.org/file-uri-to-path/-/file-uri-to-path-1.0.0.tgz",
|
||||||
"integrity": "sha512-0Zt+s3L7Vf1biwWZ29aARiVYLx7iMGnEUl9x33fbB/j3jR81u/O2LbqK+Bm1CDSNDKVtJ/YjwY7TUd5SkeLQLw=="
|
"integrity": "sha512-0Zt+s3L7Vf1biwWZ29aARiVYLx7iMGnEUl9x33fbB/j3jR81u/O2LbqK+Bm1CDSNDKVtJ/YjwY7TUd5SkeLQLw=="
|
||||||
},
|
},
|
||||||
"filename-regex": {
|
|
||||||
"version": "2.0.1",
|
|
||||||
"resolved": "https://registry.npmjs.org/filename-regex/-/filename-regex-2.0.1.tgz",
|
|
||||||
"integrity": "sha1-wcS5vuPglyXdsQa3XB4wH+LxiyY="
|
|
||||||
},
|
|
||||||
"filename-reserved-regex": {
|
"filename-reserved-regex": {
|
||||||
"version": "2.0.0",
|
"version": "2.0.0",
|
||||||
"resolved": "https://registry.npmjs.org/filename-reserved-regex/-/filename-reserved-regex-2.0.0.tgz",
|
"resolved": "https://registry.npmjs.org/filename-reserved-regex/-/filename-reserved-regex-2.0.0.tgz",
|
||||||
|
@ -7130,468 +7079,6 @@
|
||||||
"resolved": "https://registry.npmjs.org/fs.realpath/-/fs.realpath-1.0.0.tgz",
|
"resolved": "https://registry.npmjs.org/fs.realpath/-/fs.realpath-1.0.0.tgz",
|
||||||
"integrity": "sha1-FQStJSMVjKpA20onh8sBQRmU6k8="
|
"integrity": "sha1-FQStJSMVjKpA20onh8sBQRmU6k8="
|
||||||
},
|
},
|
||||||
"fsevents": {
|
|
||||||
"version": "1.2.4",
|
|
||||||
"resolved": "https://registry.npmjs.org/fsevents/-/fsevents-1.2.4.tgz",
|
|
||||||
"integrity": "sha512-z8H8/diyk76B7q5wg+Ud0+CqzcAF3mBBI/bA5ne5zrRUUIvNkJY//D3BqyH571KuAC4Nr7Rw7CjWX4r0y9DvNg==",
|
|
||||||
"optional": true,
|
|
||||||
"requires": {
|
|
||||||
"nan": "^2.9.2",
|
|
||||||
"node-pre-gyp": "^0.10.0"
|
|
||||||
},
|
|
||||||
"dependencies": {
|
|
||||||
"abbrev": {
|
|
||||||
"version": "1.1.1",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true
|
|
||||||
},
|
|
||||||
"ansi-regex": {
|
|
||||||
"version": "2.1.1",
|
|
||||||
"bundled": true
|
|
||||||
},
|
|
||||||
"aproba": {
|
|
||||||
"version": "1.2.0",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true
|
|
||||||
},
|
|
||||||
"are-we-there-yet": {
|
|
||||||
"version": "1.1.4",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true,
|
|
||||||
"requires": {
|
|
||||||
"delegates": "^1.0.0",
|
|
||||||
"readable-stream": "^2.0.6"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"balanced-match": {
|
|
||||||
"version": "1.0.0",
|
|
||||||
"bundled": true
|
|
||||||
},
|
|
||||||
"brace-expansion": {
|
|
||||||
"version": "1.1.11",
|
|
||||||
"bundled": true,
|
|
||||||
"requires": {
|
|
||||||
"balanced-match": "^1.0.0",
|
|
||||||
"concat-map": "0.0.1"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"chownr": {
|
|
||||||
"version": "1.0.1",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true
|
|
||||||
},
|
|
||||||
"code-point-at": {
|
|
||||||
"version": "1.1.0",
|
|
||||||
"bundled": true
|
|
||||||
},
|
|
||||||
"concat-map": {
|
|
||||||
"version": "0.0.1",
|
|
||||||
"bundled": true
|
|
||||||
},
|
|
||||||
"console-control-strings": {
|
|
||||||
"version": "1.1.0",
|
|
||||||
"bundled": true
|
|
||||||
},
|
|
||||||
"core-util-is": {
|
|
||||||
"version": "1.0.2",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true
|
|
||||||
},
|
|
||||||
"debug": {
|
|
||||||
"version": "2.6.9",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true,
|
|
||||||
"requires": {
|
|
||||||
"ms": "2.0.0"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"deep-extend": {
|
|
||||||
"version": "0.5.1",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true
|
|
||||||
},
|
|
||||||
"delegates": {
|
|
||||||
"version": "1.0.0",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true
|
|
||||||
},
|
|
||||||
"detect-libc": {
|
|
||||||
"version": "1.0.3",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true
|
|
||||||
},
|
|
||||||
"fs-minipass": {
|
|
||||||
"version": "1.2.5",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true,
|
|
||||||
"requires": {
|
|
||||||
"minipass": "^2.2.1"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"fs.realpath": {
|
|
||||||
"version": "1.0.0",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true
|
|
||||||
},
|
|
||||||
"gauge": {
|
|
||||||
"version": "2.7.4",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true,
|
|
||||||
"requires": {
|
|
||||||
"aproba": "^1.0.3",
|
|
||||||
"console-control-strings": "^1.0.0",
|
|
||||||
"has-unicode": "^2.0.0",
|
|
||||||
"object-assign": "^4.1.0",
|
|
||||||
"signal-exit": "^3.0.0",
|
|
||||||
"string-width": "^1.0.1",
|
|
||||||
"strip-ansi": "^3.0.1",
|
|
||||||
"wide-align": "^1.1.0"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"glob": {
|
|
||||||
"version": "7.1.2",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true,
|
|
||||||
"requires": {
|
|
||||||
"fs.realpath": "^1.0.0",
|
|
||||||
"inflight": "^1.0.4",
|
|
||||||
"inherits": "2",
|
|
||||||
"minimatch": "^3.0.4",
|
|
||||||
"once": "^1.3.0",
|
|
||||||
"path-is-absolute": "^1.0.0"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"has-unicode": {
|
|
||||||
"version": "2.0.1",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true
|
|
||||||
},
|
|
||||||
"iconv-lite": {
|
|
||||||
"version": "0.4.21",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true,
|
|
||||||
"requires": {
|
|
||||||
"safer-buffer": "^2.1.0"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"ignore-walk": {
|
|
||||||
"version": "3.0.1",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true,
|
|
||||||
"requires": {
|
|
||||||
"minimatch": "^3.0.4"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"inflight": {
|
|
||||||
"version": "1.0.6",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true,
|
|
||||||
"requires": {
|
|
||||||
"once": "^1.3.0",
|
|
||||||
"wrappy": "1"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"inherits": {
|
|
||||||
"version": "2.0.3",
|
|
||||||
"bundled": true
|
|
||||||
},
|
|
||||||
"ini": {
|
|
||||||
"version": "1.3.5",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true
|
|
||||||
},
|
|
||||||
"is-fullwidth-code-point": {
|
|
||||||
"version": "1.0.0",
|
|
||||||
"bundled": true,
|
|
||||||
"requires": {
|
|
||||||
"number-is-nan": "^1.0.0"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"isarray": {
|
|
||||||
"version": "1.0.0",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true
|
|
||||||
},
|
|
||||||
"minimatch": {
|
|
||||||
"version": "3.0.4",
|
|
||||||
"bundled": true,
|
|
||||||
"requires": {
|
|
||||||
"brace-expansion": "^1.1.7"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"minimist": {
|
|
||||||
"version": "0.0.8",
|
|
||||||
"bundled": true
|
|
||||||
},
|
|
||||||
"minipass": {
|
|
||||||
"version": "2.2.4",
|
|
||||||
"bundled": true,
|
|
||||||
"requires": {
|
|
||||||
"safe-buffer": "^5.1.1",
|
|
||||||
"yallist": "^3.0.0"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"minizlib": {
|
|
||||||
"version": "1.1.0",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true,
|
|
||||||
"requires": {
|
|
||||||
"minipass": "^2.2.1"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"mkdirp": {
|
|
||||||
"version": "0.5.1",
|
|
||||||
"bundled": true,
|
|
||||||
"requires": {
|
|
||||||
"minimist": "0.0.8"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"ms": {
|
|
||||||
"version": "2.0.0",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true
|
|
||||||
},
|
|
||||||
"needle": {
|
|
||||||
"version": "2.2.0",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true,
|
|
||||||
"requires": {
|
|
||||||
"debug": "^2.1.2",
|
|
||||||
"iconv-lite": "^0.4.4",
|
|
||||||
"sax": "^1.2.4"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"node-pre-gyp": {
|
|
||||||
"version": "0.10.0",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true,
|
|
||||||
"requires": {
|
|
||||||
"detect-libc": "^1.0.2",
|
|
||||||
"mkdirp": "^0.5.1",
|
|
||||||
"needle": "^2.2.0",
|
|
||||||
"nopt": "^4.0.1",
|
|
||||||
"npm-packlist": "^1.1.6",
|
|
||||||
"npmlog": "^4.0.2",
|
|
||||||
"rc": "^1.1.7",
|
|
||||||
"rimraf": "^2.6.1",
|
|
||||||
"semver": "^5.3.0",
|
|
||||||
"tar": "^4"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"nopt": {
|
|
||||||
"version": "4.0.1",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true,
|
|
||||||
"requires": {
|
|
||||||
"abbrev": "1",
|
|
||||||
"osenv": "^0.1.4"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"npm-bundled": {
|
|
||||||
"version": "1.0.3",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true
|
|
||||||
},
|
|
||||||
"npm-packlist": {
|
|
||||||
"version": "1.1.10",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true,
|
|
||||||
"requires": {
|
|
||||||
"ignore-walk": "^3.0.1",
|
|
||||||
"npm-bundled": "^1.0.1"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"npmlog": {
|
|
||||||
"version": "4.1.2",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true,
|
|
||||||
"requires": {
|
|
||||||
"are-we-there-yet": "~1.1.2",
|
|
||||||
"console-control-strings": "~1.1.0",
|
|
||||||
"gauge": "~2.7.3",
|
|
||||||
"set-blocking": "~2.0.0"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"number-is-nan": {
|
|
||||||
"version": "1.0.1",
|
|
||||||
"bundled": true
|
|
||||||
},
|
|
||||||
"object-assign": {
|
|
||||||
"version": "4.1.1",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true
|
|
||||||
},
|
|
||||||
"once": {
|
|
||||||
"version": "1.4.0",
|
|
||||||
"bundled": true,
|
|
||||||
"requires": {
|
|
||||||
"wrappy": "1"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"os-homedir": {
|
|
||||||
"version": "1.0.2",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true
|
|
||||||
},
|
|
||||||
"os-tmpdir": {
|
|
||||||
"version": "1.0.2",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true
|
|
||||||
},
|
|
||||||
"osenv": {
|
|
||||||
"version": "0.1.5",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true,
|
|
||||||
"requires": {
|
|
||||||
"os-homedir": "^1.0.0",
|
|
||||||
"os-tmpdir": "^1.0.0"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"path-is-absolute": {
|
|
||||||
"version": "1.0.1",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true
|
|
||||||
},
|
|
||||||
"process-nextick-args": {
|
|
||||||
"version": "2.0.0",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true
|
|
||||||
},
|
|
||||||
"rc": {
|
|
||||||
"version": "1.2.7",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true,
|
|
||||||
"requires": {
|
|
||||||
"deep-extend": "^0.5.1",
|
|
||||||
"ini": "~1.3.0",
|
|
||||||
"minimist": "^1.2.0",
|
|
||||||
"strip-json-comments": "~2.0.1"
|
|
||||||
},
|
|
||||||
"dependencies": {
|
|
||||||
"minimist": {
|
|
||||||
"version": "1.2.0",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true
|
|
||||||
}
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"readable-stream": {
|
|
||||||
"version": "2.3.6",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true,
|
|
||||||
"requires": {
|
|
||||||
"core-util-is": "~1.0.0",
|
|
||||||
"inherits": "~2.0.3",
|
|
||||||
"isarray": "~1.0.0",
|
|
||||||
"process-nextick-args": "~2.0.0",
|
|
||||||
"safe-buffer": "~5.1.1",
|
|
||||||
"string_decoder": "~1.1.1",
|
|
||||||
"util-deprecate": "~1.0.1"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"rimraf": {
|
|
||||||
"version": "2.6.2",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true,
|
|
||||||
"requires": {
|
|
||||||
"glob": "^7.0.5"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"safe-buffer": {
|
|
||||||
"version": "5.1.1",
|
|
||||||
"bundled": true
|
|
||||||
},
|
|
||||||
"safer-buffer": {
|
|
||||||
"version": "2.1.2",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true
|
|
||||||
},
|
|
||||||
"sax": {
|
|
||||||
"version": "1.2.4",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true
|
|
||||||
},
|
|
||||||
"semver": {
|
|
||||||
"version": "5.5.0",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true
|
|
||||||
},
|
|
||||||
"set-blocking": {
|
|
||||||
"version": "2.0.0",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true
|
|
||||||
},
|
|
||||||
"signal-exit": {
|
|
||||||
"version": "3.0.2",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true
|
|
||||||
},
|
|
||||||
"string-width": {
|
|
||||||
"version": "1.0.2",
|
|
||||||
"bundled": true,
|
|
||||||
"requires": {
|
|
||||||
"code-point-at": "^1.0.0",
|
|
||||||
"is-fullwidth-code-point": "^1.0.0",
|
|
||||||
"strip-ansi": "^3.0.0"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"string_decoder": {
|
|
||||||
"version": "1.1.1",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true,
|
|
||||||
"requires": {
|
|
||||||
"safe-buffer": "~5.1.0"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"strip-ansi": {
|
|
||||||
"version": "3.0.1",
|
|
||||||
"bundled": true,
|
|
||||||
"requires": {
|
|
||||||
"ansi-regex": "^2.0.0"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"strip-json-comments": {
|
|
||||||
"version": "2.0.1",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true
|
|
||||||
},
|
|
||||||
"tar": {
|
|
||||||
"version": "4.4.1",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true,
|
|
||||||
"requires": {
|
|
||||||
"chownr": "^1.0.1",
|
|
||||||
"fs-minipass": "^1.2.5",
|
|
||||||
"minipass": "^2.2.4",
|
|
||||||
"minizlib": "^1.1.0",
|
|
||||||
"mkdirp": "^0.5.0",
|
|
||||||
"safe-buffer": "^5.1.1",
|
|
||||||
"yallist": "^3.0.2"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"util-deprecate": {
|
|
||||||
"version": "1.0.2",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true
|
|
||||||
},
|
|
||||||
"wide-align": {
|
|
||||||
"version": "1.1.2",
|
|
||||||
"bundled": true,
|
|
||||||
"optional": true,
|
|
||||||
"requires": {
|
|
||||||
"string-width": "^1.0.2"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"wrappy": {
|
|
||||||
"version": "1.0.2",
|
|
||||||
"bundled": true
|
|
||||||
},
|
|
||||||
"yallist": {
|
|
||||||
"version": "3.0.2",
|
|
||||||
"bundled": true
|
|
||||||
}
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"fstream": {
|
"fstream": {
|
||||||
"version": "1.0.11",
|
"version": "1.0.11",
|
||||||
"resolved": "https://registry.npmjs.org/fstream/-/fstream-1.0.11.tgz",
|
"resolved": "https://registry.npmjs.org/fstream/-/fstream-1.0.11.tgz",
|
||||||
|
@ -8322,14 +7809,14 @@
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
"gatsby-source-filesystem": {
|
"gatsby-source-filesystem": {
|
||||||
"version": "2.0.20",
|
"version": "2.0.24",
|
||||||
"resolved": "https://registry.npmjs.org/gatsby-source-filesystem/-/gatsby-source-filesystem-2.0.20.tgz",
|
"resolved": "https://registry.npmjs.org/gatsby-source-filesystem/-/gatsby-source-filesystem-2.0.24.tgz",
|
||||||
"integrity": "sha512-nS2hBsqKEQIJ5Yd+g9p++FcsfmvbQmZlBUzx04VPBYZBu2LuLA/ZxQkmdiTNnbDQ18KJw0Zu2PnmUerPnEMqyg==",
|
"integrity": "sha512-KzyHzuXni9hOiZFDgeoH5ABJZqb59fSJNGr2C4U6B1AlGXFMucFK45Fh3V8axtpi833bIbCb9rGmK+tvL4Qb1w==",
|
||||||
"requires": {
|
"requires": {
|
||||||
"@babel/runtime": "^7.0.0",
|
"@babel/runtime": "^7.0.0",
|
||||||
"better-queue": "^3.8.7",
|
"better-queue": "^3.8.7",
|
||||||
"bluebird": "^3.5.0",
|
"bluebird": "^3.5.0",
|
||||||
"chokidar": "^1.7.0",
|
"chokidar": "^2.1.2",
|
||||||
"file-type": "^10.2.0",
|
"file-type": "^10.2.0",
|
||||||
"fs-extra": "^5.0.0",
|
"fs-extra": "^5.0.0",
|
||||||
"got": "^7.1.0",
|
"got": "^7.1.0",
|
||||||
|
@ -8343,83 +7830,6 @@
|
||||||
"xstate": "^3.1.0"
|
"xstate": "^3.1.0"
|
||||||
},
|
},
|
||||||
"dependencies": {
|
"dependencies": {
|
||||||
"anymatch": {
|
|
||||||
"version": "1.3.2",
|
|
||||||
"resolved": "https://registry.npmjs.org/anymatch/-/anymatch-1.3.2.tgz",
|
|
||||||
"integrity": "sha512-0XNayC8lTHQ2OI8aljNCN3sSx6hsr/1+rlcDAotXJR7C1oZZHCNsfpbKwMjRA3Uqb5tF1Rae2oloTr4xpq+WjA==",
|
|
||||||
"requires": {
|
|
||||||
"micromatch": "^2.1.5",
|
|
||||||
"normalize-path": "^2.0.0"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"arr-diff": {
|
|
||||||
"version": "2.0.0",
|
|
||||||
"resolved": "https://registry.npmjs.org/arr-diff/-/arr-diff-2.0.0.tgz",
|
|
||||||
"integrity": "sha1-jzuCf5Vai9ZpaX5KQlasPOrjVs8=",
|
|
||||||
"requires": {
|
|
||||||
"arr-flatten": "^1.0.1"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"array-unique": {
|
|
||||||
"version": "0.2.1",
|
|
||||||
"resolved": "https://registry.npmjs.org/array-unique/-/array-unique-0.2.1.tgz",
|
|
||||||
"integrity": "sha1-odl8yvy8JiXMcPrc6zalDFiwGlM="
|
|
||||||
},
|
|
||||||
"braces": {
|
|
||||||
"version": "1.8.5",
|
|
||||||
"resolved": "https://registry.npmjs.org/braces/-/braces-1.8.5.tgz",
|
|
||||||
"integrity": "sha1-uneWLhLf+WnWt2cR6RS3N4V79qc=",
|
|
||||||
"requires": {
|
|
||||||
"expand-range": "^1.8.1",
|
|
||||||
"preserve": "^0.2.0",
|
|
||||||
"repeat-element": "^1.1.2"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"chokidar": {
|
|
||||||
"version": "1.7.0",
|
|
||||||
"resolved": "https://registry.npmjs.org/chokidar/-/chokidar-1.7.0.tgz",
|
|
||||||
"integrity": "sha1-eY5ol3gVHIB2tLNg5e3SjNortGg=",
|
|
||||||
"requires": {
|
|
||||||
"anymatch": "^1.3.0",
|
|
||||||
"async-each": "^1.0.0",
|
|
||||||
"fsevents": "^1.0.0",
|
|
||||||
"glob-parent": "^2.0.0",
|
|
||||||
"inherits": "^2.0.1",
|
|
||||||
"is-binary-path": "^1.0.0",
|
|
||||||
"is-glob": "^2.0.0",
|
|
||||||
"path-is-absolute": "^1.0.0",
|
|
||||||
"readdirp": "^2.0.0"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"expand-brackets": {
|
|
||||||
"version": "0.1.5",
|
|
||||||
"resolved": "https://registry.npmjs.org/expand-brackets/-/expand-brackets-0.1.5.tgz",
|
|
||||||
"integrity": "sha1-3wcoTjQqgHzXM6xa9yQR5YHRF3s=",
|
|
||||||
"requires": {
|
|
||||||
"is-posix-bracket": "^0.1.0"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"extglob": {
|
|
||||||
"version": "0.3.2",
|
|
||||||
"resolved": "https://registry.npmjs.org/extglob/-/extglob-0.3.2.tgz",
|
|
||||||
"integrity": "sha1-Lhj/PS9JqydlzskCPwEdqo2DSaE=",
|
|
||||||
"requires": {
|
|
||||||
"is-extglob": "^1.0.0"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"file-type": {
|
|
||||||
"version": "10.7.1",
|
|
||||||
"resolved": "https://registry.npmjs.org/file-type/-/file-type-10.7.1.tgz",
|
|
||||||
"integrity": "sha512-kUc4EE9q3MH6kx70KumPOvXLZLEJZzY9phEVg/bKWyGZ+OA9KoKZzFR4HS0yDmNv31sJkdf4hbTERIfplF9OxQ=="
|
|
||||||
},
|
|
||||||
"glob-parent": {
|
|
||||||
"version": "2.0.0",
|
|
||||||
"resolved": "https://registry.npmjs.org/glob-parent/-/glob-parent-2.0.0.tgz",
|
|
||||||
"integrity": "sha1-gTg9ctsFT8zPUzbaqQLxgvbtuyg=",
|
|
||||||
"requires": {
|
|
||||||
"is-glob": "^2.0.0"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"got": {
|
"got": {
|
||||||
"version": "7.1.0",
|
"version": "7.1.0",
|
||||||
"resolved": "https://registry.npmjs.org/got/-/got-7.1.0.tgz",
|
"resolved": "https://registry.npmjs.org/got/-/got-7.1.0.tgz",
|
||||||
|
@ -8441,47 +7851,6 @@
|
||||||
"url-to-options": "^1.0.1"
|
"url-to-options": "^1.0.1"
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
"is-extglob": {
|
|
||||||
"version": "1.0.0",
|
|
||||||
"resolved": "https://registry.npmjs.org/is-extglob/-/is-extglob-1.0.0.tgz",
|
|
||||||
"integrity": "sha1-rEaBd8SUNAWgkvyPKXYMb/xiBsA="
|
|
||||||
},
|
|
||||||
"is-glob": {
|
|
||||||
"version": "2.0.1",
|
|
||||||
"resolved": "https://registry.npmjs.org/is-glob/-/is-glob-2.0.1.tgz",
|
|
||||||
"integrity": "sha1-0Jb5JqPe1WAPP9/ZEZjLCIjC2GM=",
|
|
||||||
"requires": {
|
|
||||||
"is-extglob": "^1.0.0"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"kind-of": {
|
|
||||||
"version": "3.2.2",
|
|
||||||
"resolved": "https://registry.npmjs.org/kind-of/-/kind-of-3.2.2.tgz",
|
|
||||||
"integrity": "sha1-MeohpzS6ubuw8yRm2JOupR5KPGQ=",
|
|
||||||
"requires": {
|
|
||||||
"is-buffer": "^1.1.5"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"micromatch": {
|
|
||||||
"version": "2.3.11",
|
|
||||||
"resolved": "https://registry.npmjs.org/micromatch/-/micromatch-2.3.11.tgz",
|
|
||||||
"integrity": "sha1-hmd8l9FyCzY0MdBNDRUpO9OMFWU=",
|
|
||||||
"requires": {
|
|
||||||
"arr-diff": "^2.0.0",
|
|
||||||
"array-unique": "^0.2.1",
|
|
||||||
"braces": "^1.8.2",
|
|
||||||
"expand-brackets": "^0.1.4",
|
|
||||||
"extglob": "^0.3.1",
|
|
||||||
"filename-regex": "^2.0.0",
|
|
||||||
"is-extglob": "^1.0.0",
|
|
||||||
"is-glob": "^2.0.1",
|
|
||||||
"kind-of": "^3.0.2",
|
|
||||||
"normalize-path": "^2.0.1",
|
|
||||||
"object.omit": "^2.0.0",
|
|
||||||
"parse-glob": "^3.0.4",
|
|
||||||
"regex-cache": "^0.4.2"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"pify": {
|
"pify": {
|
||||||
"version": "4.0.1",
|
"version": "4.0.1",
|
||||||
"resolved": "https://registry.npmjs.org/pify/-/pify-4.0.1.tgz",
|
"resolved": "https://registry.npmjs.org/pify/-/pify-4.0.1.tgz",
|
||||||
|
@ -8493,12 +7862,12 @@
|
||||||
"integrity": "sha1-4mDHj2Fhzdmw5WzD4Khd4Xx6V74="
|
"integrity": "sha1-4mDHj2Fhzdmw5WzD4Khd4Xx6V74="
|
||||||
},
|
},
|
||||||
"read-chunk": {
|
"read-chunk": {
|
||||||
"version": "3.0.0",
|
"version": "3.1.0",
|
||||||
"resolved": "https://registry.npmjs.org/read-chunk/-/read-chunk-3.0.0.tgz",
|
"resolved": "https://registry.npmjs.org/read-chunk/-/read-chunk-3.1.0.tgz",
|
||||||
"integrity": "sha512-8lBUVPjj9TC5bKLBacB+rpexM03+LWiYbv6ma3BeWmUYXGxqA1WNNgIZHq/iIsCrbFMzPhFbkOqdsyOFRnuoXg==",
|
"integrity": "sha512-ZdiZJXXoZYE08SzZvTipHhI+ZW0FpzxmFtLI3vIeMuRN9ySbIZ+SZawKogqJ7dxW9fJ/W73BNtxu4Zu/bZp+Ng==",
|
||||||
"requires": {
|
"requires": {
|
||||||
"pify": "^4.0.0",
|
"pify": "^4.0.1",
|
||||||
"with-open-file": "^0.1.3"
|
"with-open-file": "^0.1.5"
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
@ -8742,38 +8111,6 @@
|
||||||
"path-is-absolute": "^1.0.0"
|
"path-is-absolute": "^1.0.0"
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
"glob-base": {
|
|
||||||
"version": "0.3.0",
|
|
||||||
"resolved": "https://registry.npmjs.org/glob-base/-/glob-base-0.3.0.tgz",
|
|
||||||
"integrity": "sha1-27Fk9iIbHAscz4Kuoyi0l98Oo8Q=",
|
|
||||||
"requires": {
|
|
||||||
"glob-parent": "^2.0.0",
|
|
||||||
"is-glob": "^2.0.0"
|
|
||||||
},
|
|
||||||
"dependencies": {
|
|
||||||
"glob-parent": {
|
|
||||||
"version": "2.0.0",
|
|
||||||
"resolved": "https://registry.npmjs.org/glob-parent/-/glob-parent-2.0.0.tgz",
|
|
||||||
"integrity": "sha1-gTg9ctsFT8zPUzbaqQLxgvbtuyg=",
|
|
||||||
"requires": {
|
|
||||||
"is-glob": "^2.0.0"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"is-extglob": {
|
|
||||||
"version": "1.0.0",
|
|
||||||
"resolved": "https://registry.npmjs.org/is-extglob/-/is-extglob-1.0.0.tgz",
|
|
||||||
"integrity": "sha1-rEaBd8SUNAWgkvyPKXYMb/xiBsA="
|
|
||||||
},
|
|
||||||
"is-glob": {
|
|
||||||
"version": "2.0.1",
|
|
||||||
"resolved": "https://registry.npmjs.org/is-glob/-/is-glob-2.0.1.tgz",
|
|
||||||
"integrity": "sha1-0Jb5JqPe1WAPP9/ZEZjLCIjC2GM=",
|
|
||||||
"requires": {
|
|
||||||
"is-extglob": "^1.0.0"
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"glob-parent": {
|
"glob-parent": {
|
||||||
"version": "3.1.0",
|
"version": "3.1.0",
|
||||||
"resolved": "https://registry.npmjs.org/glob-parent/-/glob-parent-3.1.0.tgz",
|
"resolved": "https://registry.npmjs.org/glob-parent/-/glob-parent-3.1.0.tgz",
|
||||||
|
@ -10110,19 +9447,6 @@
|
||||||
"resolved": "https://registry.npmjs.org/is-directory/-/is-directory-0.3.1.tgz",
|
"resolved": "https://registry.npmjs.org/is-directory/-/is-directory-0.3.1.tgz",
|
||||||
"integrity": "sha1-YTObbyR1/Hcv2cnYP1yFddwVSuE="
|
"integrity": "sha1-YTObbyR1/Hcv2cnYP1yFddwVSuE="
|
||||||
},
|
},
|
||||||
"is-dotfile": {
|
|
||||||
"version": "1.0.3",
|
|
||||||
"resolved": "https://registry.npmjs.org/is-dotfile/-/is-dotfile-1.0.3.tgz",
|
|
||||||
"integrity": "sha1-pqLzL/0t+wT1yiXs0Pa4PPeYoeE="
|
|
||||||
},
|
|
||||||
"is-equal-shallow": {
|
|
||||||
"version": "0.1.3",
|
|
||||||
"resolved": "https://registry.npmjs.org/is-equal-shallow/-/is-equal-shallow-0.1.3.tgz",
|
|
||||||
"integrity": "sha1-IjgJj8Ih3gvPpdnqxMRdY4qhxTQ=",
|
|
||||||
"requires": {
|
|
||||||
"is-primitive": "^2.0.0"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"is-extendable": {
|
"is-extendable": {
|
||||||
"version": "0.1.1",
|
"version": "0.1.1",
|
||||||
"resolved": "https://registry.npmjs.org/is-extendable/-/is-extendable-0.1.1.tgz",
|
"resolved": "https://registry.npmjs.org/is-extendable/-/is-extendable-0.1.1.tgz",
|
||||||
|
@ -10263,16 +9587,6 @@
|
||||||
"resolved": "https://registry.npmjs.org/is-png/-/is-png-1.1.0.tgz",
|
"resolved": "https://registry.npmjs.org/is-png/-/is-png-1.1.0.tgz",
|
||||||
"integrity": "sha1-1XSxK/J1wDUEVVcLDltXqwYgd84="
|
"integrity": "sha1-1XSxK/J1wDUEVVcLDltXqwYgd84="
|
||||||
},
|
},
|
||||||
"is-posix-bracket": {
|
|
||||||
"version": "0.1.1",
|
|
||||||
"resolved": "https://registry.npmjs.org/is-posix-bracket/-/is-posix-bracket-0.1.1.tgz",
|
|
||||||
"integrity": "sha1-MzTceXdDaOkvAW5vvAqI9c1ua8Q="
|
|
||||||
},
|
|
||||||
"is-primitive": {
|
|
||||||
"version": "2.0.0",
|
|
||||||
"resolved": "https://registry.npmjs.org/is-primitive/-/is-primitive-2.0.0.tgz",
|
|
||||||
"integrity": "sha1-IHurkWOEmcB7Kt8kCkGochADRXU="
|
|
||||||
},
|
|
||||||
"is-promise": {
|
"is-promise": {
|
||||||
"version": "2.1.0",
|
"version": "2.1.0",
|
||||||
"resolved": "https://registry.npmjs.org/is-promise/-/is-promise-2.1.0.tgz",
|
"resolved": "https://registry.npmjs.org/is-promise/-/is-promise-2.1.0.tgz",
|
||||||
|
@ -11162,11 +10476,6 @@
|
||||||
"resolved": "https://registry.npmjs.org/marked/-/marked-0.4.0.tgz",
|
"resolved": "https://registry.npmjs.org/marked/-/marked-0.4.0.tgz",
|
||||||
"integrity": "sha512-tMsdNBgOsrUophCAFQl0XPe6Zqk/uy9gnue+jIIKhykO51hxyu6uNx7zBPy0+y/WKYVZZMspV9YeXLNdKk+iYw=="
|
"integrity": "sha512-tMsdNBgOsrUophCAFQl0XPe6Zqk/uy9gnue+jIIKhykO51hxyu6uNx7zBPy0+y/WKYVZZMspV9YeXLNdKk+iYw=="
|
||||||
},
|
},
|
||||||
"math-random": {
|
|
||||||
"version": "1.0.1",
|
|
||||||
"resolved": "https://registry.npmjs.org/math-random/-/math-random-1.0.1.tgz",
|
|
||||||
"integrity": "sha1-izqsWIuKZuSXXjzepn97sylgH6w="
|
|
||||||
},
|
|
||||||
"md-attr-parser": {
|
"md-attr-parser": {
|
||||||
"version": "1.2.1",
|
"version": "1.2.1",
|
||||||
"resolved": "https://registry.npmjs.org/md-attr-parser/-/md-attr-parser-1.2.1.tgz",
|
"resolved": "https://registry.npmjs.org/md-attr-parser/-/md-attr-parser-1.2.1.tgz",
|
||||||
|
@ -12230,15 +11539,6 @@
|
||||||
"es-abstract": "^1.5.1"
|
"es-abstract": "^1.5.1"
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
"object.omit": {
|
|
||||||
"version": "2.0.1",
|
|
||||||
"resolved": "https://registry.npmjs.org/object.omit/-/object.omit-2.0.1.tgz",
|
|
||||||
"integrity": "sha1-Gpx0SCnznbuFjHbKNXmuKlTr0fo=",
|
|
||||||
"requires": {
|
|
||||||
"for-own": "^0.1.4",
|
|
||||||
"is-extendable": "^0.1.1"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"object.pick": {
|
"object.pick": {
|
||||||
"version": "1.3.0",
|
"version": "1.3.0",
|
||||||
"resolved": "https://registry.npmjs.org/object.pick/-/object.pick-1.3.0.tgz",
|
"resolved": "https://registry.npmjs.org/object.pick/-/object.pick-1.3.0.tgz",
|
||||||
|
@ -12579,32 +11879,6 @@
|
||||||
"path-root": "^0.1.1"
|
"path-root": "^0.1.1"
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
"parse-glob": {
|
|
||||||
"version": "3.0.4",
|
|
||||||
"resolved": "https://registry.npmjs.org/parse-glob/-/parse-glob-3.0.4.tgz",
|
|
||||||
"integrity": "sha1-ssN2z7EfNVE7rdFz7wu246OIORw=",
|
|
||||||
"requires": {
|
|
||||||
"glob-base": "^0.3.0",
|
|
||||||
"is-dotfile": "^1.0.0",
|
|
||||||
"is-extglob": "^1.0.0",
|
|
||||||
"is-glob": "^2.0.0"
|
|
||||||
},
|
|
||||||
"dependencies": {
|
|
||||||
"is-extglob": {
|
|
||||||
"version": "1.0.0",
|
|
||||||
"resolved": "https://registry.npmjs.org/is-extglob/-/is-extglob-1.0.0.tgz",
|
|
||||||
"integrity": "sha1-rEaBd8SUNAWgkvyPKXYMb/xiBsA="
|
|
||||||
},
|
|
||||||
"is-glob": {
|
|
||||||
"version": "2.0.1",
|
|
||||||
"resolved": "https://registry.npmjs.org/is-glob/-/is-glob-2.0.1.tgz",
|
|
||||||
"integrity": "sha1-0Jb5JqPe1WAPP9/ZEZjLCIjC2GM=",
|
|
||||||
"requires": {
|
|
||||||
"is-extglob": "^1.0.0"
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"parse-headers": {
|
"parse-headers": {
|
||||||
"version": "2.0.1",
|
"version": "2.0.1",
|
||||||
"resolved": "https://registry.npmjs.org/parse-headers/-/parse-headers-2.0.1.tgz",
|
"resolved": "https://registry.npmjs.org/parse-headers/-/parse-headers-2.0.1.tgz",
|
||||||
|
@ -14769,11 +14043,6 @@
|
||||||
"resolved": "https://registry.npmjs.org/prepend-http/-/prepend-http-1.0.4.tgz",
|
"resolved": "https://registry.npmjs.org/prepend-http/-/prepend-http-1.0.4.tgz",
|
||||||
"integrity": "sha1-1PRWKwzjaW5BrFLQ4ALlemNdxtw="
|
"integrity": "sha1-1PRWKwzjaW5BrFLQ4ALlemNdxtw="
|
||||||
},
|
},
|
||||||
"preserve": {
|
|
||||||
"version": "0.2.0",
|
|
||||||
"resolved": "https://registry.npmjs.org/preserve/-/preserve-0.2.0.tgz",
|
|
||||||
"integrity": "sha1-gV7R9uvGWSb4ZbMQwHE7yzMVzks="
|
|
||||||
},
|
|
||||||
"prettier": {
|
"prettier": {
|
||||||
"version": "1.16.4",
|
"version": "1.16.4",
|
||||||
"resolved": "https://registry.npmjs.org/prettier/-/prettier-1.16.4.tgz",
|
"resolved": "https://registry.npmjs.org/prettier/-/prettier-1.16.4.tgz",
|
||||||
|
@ -14982,23 +14251,6 @@
|
||||||
"resolved": "http://registry.npmjs.org/ramda/-/ramda-0.21.0.tgz",
|
"resolved": "http://registry.npmjs.org/ramda/-/ramda-0.21.0.tgz",
|
||||||
"integrity": "sha1-oAGr7bP/YQd9T/HVd9RN536NCjU="
|
"integrity": "sha1-oAGr7bP/YQd9T/HVd9RN536NCjU="
|
||||||
},
|
},
|
||||||
"randomatic": {
|
|
||||||
"version": "3.1.1",
|
|
||||||
"resolved": "https://registry.npmjs.org/randomatic/-/randomatic-3.1.1.tgz",
|
|
||||||
"integrity": "sha512-TuDE5KxZ0J461RVjrJZCJc+J+zCkTb1MbH9AQUq68sMhOMcy9jLcb3BrZKgp9q9Ncltdg4QVqWrH02W2EFFVYw==",
|
|
||||||
"requires": {
|
|
||||||
"is-number": "^4.0.0",
|
|
||||||
"kind-of": "^6.0.0",
|
|
||||||
"math-random": "^1.0.1"
|
|
||||||
},
|
|
||||||
"dependencies": {
|
|
||||||
"is-number": {
|
|
||||||
"version": "4.0.0",
|
|
||||||
"resolved": "https://registry.npmjs.org/is-number/-/is-number-4.0.0.tgz",
|
|
||||||
"integrity": "sha512-rSklcAIlf1OmFdyAqbnWTLVelsQ58uvZ66S/ZyawjWqIviTWCjg2PzVGw8WUA+nNuPTqb4wgA+NszrJ+08LlgQ=="
|
|
||||||
}
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"randombytes": {
|
"randombytes": {
|
||||||
"version": "2.1.0",
|
"version": "2.1.0",
|
||||||
"resolved": "https://registry.npmjs.org/randombytes/-/randombytes-2.1.0.tgz",
|
"resolved": "https://registry.npmjs.org/randombytes/-/randombytes-2.1.0.tgz",
|
||||||
|
@ -15458,14 +14710,6 @@
|
||||||
"private": "^0.1.6"
|
"private": "^0.1.6"
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
"regex-cache": {
|
|
||||||
"version": "0.4.4",
|
|
||||||
"resolved": "https://registry.npmjs.org/regex-cache/-/regex-cache-0.4.4.tgz",
|
|
||||||
"integrity": "sha512-nVIZwtCjkC9YgvWkpM55B5rBhBYRZhAaJbgcFYXXsHnbZ9UZI9nnVWYZpBlCqv9ho2eZryPnWrZGsOdPwVWXWQ==",
|
|
||||||
"requires": {
|
|
||||||
"is-equal-shallow": "^0.1.3"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"regex-not": {
|
"regex-not": {
|
||||||
"version": "1.0.2",
|
"version": "1.0.2",
|
||||||
"resolved": "https://registry.npmjs.org/regex-not/-/regex-not-1.0.2.tgz",
|
"resolved": "https://registry.npmjs.org/regex-not/-/regex-not-1.0.2.tgz",
|
||||||
|
@ -17710,9 +16954,9 @@
|
||||||
},
|
},
|
||||||
"dependencies": {
|
"dependencies": {
|
||||||
"ajv": {
|
"ajv": {
|
||||||
"version": "6.9.2",
|
"version": "6.10.0",
|
||||||
"resolved": "https://registry.npmjs.org/ajv/-/ajv-6.9.2.tgz",
|
"resolved": "https://registry.npmjs.org/ajv/-/ajv-6.10.0.tgz",
|
||||||
"integrity": "sha512-4UFy0/LgDo7Oa/+wOAlj44tp9K78u38E5/359eSrqEp1Z5PdVfimCcs7SluXMP755RUQu6d2b4AvF0R1C9RZjg==",
|
"integrity": "sha512-nffhOpkymDECQyR0mnsUtoCE8RlX38G0rYP+wgLWFyZuUyuuojSSvi/+euOiQBIn63whYwYVIIH1TvE3tu4OEg==",
|
||||||
"requires": {
|
"requires": {
|
||||||
"fast-deep-equal": "^2.0.1",
|
"fast-deep-equal": "^2.0.1",
|
||||||
"fast-json-stable-stringify": "^2.0.0",
|
"fast-json-stable-stringify": "^2.0.0",
|
||||||
|
@ -17721,26 +16965,26 @@
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
"ansi-regex": {
|
"ansi-regex": {
|
||||||
"version": "4.0.0",
|
"version": "4.1.0",
|
||||||
"resolved": "https://registry.npmjs.org/ansi-regex/-/ansi-regex-4.0.0.tgz",
|
"resolved": "https://registry.npmjs.org/ansi-regex/-/ansi-regex-4.1.0.tgz",
|
||||||
"integrity": "sha512-iB5Dda8t/UqpPI/IjsejXu5jOGDrzn41wJyljwPH65VCIbk6+1BzFIMJGFwTNrYXT1CrD+B4l19U7awiQ8rk7w=="
|
"integrity": "sha512-1apePfXM1UOSqw0o9IiFAovVz9M5S1Dg+4TrDwfMewQ6p/rmMueb7tWZjQ1rx4Loy1ArBggoqGpfqqdI4rondg=="
|
||||||
},
|
},
|
||||||
"string-width": {
|
"string-width": {
|
||||||
"version": "3.0.0",
|
"version": "3.1.0",
|
||||||
"resolved": "https://registry.npmjs.org/string-width/-/string-width-3.0.0.tgz",
|
"resolved": "https://registry.npmjs.org/string-width/-/string-width-3.1.0.tgz",
|
||||||
"integrity": "sha512-rr8CUxBbvOZDUvc5lNIJ+OC1nPVpz+Siw9VBtUjB9b6jZehZLFt0JMCZzShFHIsI8cbhm0EsNIfWJMFV3cu3Ew==",
|
"integrity": "sha512-vafcv6KjVZKSgz06oM/H6GDBrAtz8vdhQakGjFIvNrHA6y3HCF1CInLy+QLq8dTJPQ1b+KDUqDFctkdRW44e1w==",
|
||||||
"requires": {
|
"requires": {
|
||||||
"emoji-regex": "^7.0.1",
|
"emoji-regex": "^7.0.1",
|
||||||
"is-fullwidth-code-point": "^2.0.0",
|
"is-fullwidth-code-point": "^2.0.0",
|
||||||
"strip-ansi": "^5.0.0"
|
"strip-ansi": "^5.1.0"
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
"strip-ansi": {
|
"strip-ansi": {
|
||||||
"version": "5.0.0",
|
"version": "5.1.0",
|
||||||
"resolved": "https://registry.npmjs.org/strip-ansi/-/strip-ansi-5.0.0.tgz",
|
"resolved": "https://registry.npmjs.org/strip-ansi/-/strip-ansi-5.1.0.tgz",
|
||||||
"integrity": "sha512-Uu7gQyZI7J7gn5qLn1Np3G9vcYGTVqB+lFTytnDJv83dd8T22aGH451P3jueT2/QemInJDfxHB5Tde5OzgG1Ow==",
|
"integrity": "sha512-TjxrkPONqO2Z8QDCpeE2j6n0M6EwxzyDgzEeGp+FbdvaJAt//ClYi6W5my+3ROlC/hZX2KACUwDfK49Ka5eDvg==",
|
||||||
"requires": {
|
"requires": {
|
||||||
"ansi-regex": "^4.0.0"
|
"ansi-regex": "^4.1.0"
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
|
@ -12,7 +12,6 @@
|
||||||
"@mdx-js/tag": "^0.17.5",
|
"@mdx-js/tag": "^0.17.5",
|
||||||
"@phosphor/widgets": "^1.6.0",
|
"@phosphor/widgets": "^1.6.0",
|
||||||
"@rehooks/online-status": "^1.0.0",
|
"@rehooks/online-status": "^1.0.0",
|
||||||
"@sindresorhus/slugify": "^0.8.0",
|
|
||||||
"@svgr/webpack": "^4.1.0",
|
"@svgr/webpack": "^4.1.0",
|
||||||
"autoprefixer": "^9.4.7",
|
"autoprefixer": "^9.4.7",
|
||||||
"classnames": "^2.2.6",
|
"classnames": "^2.2.6",
|
||||||
|
@ -35,7 +34,7 @@
|
||||||
"gatsby-remark-prismjs": "^3.2.4",
|
"gatsby-remark-prismjs": "^3.2.4",
|
||||||
"gatsby-remark-smartypants": "^2.0.8",
|
"gatsby-remark-smartypants": "^2.0.8",
|
||||||
"gatsby-remark-unwrap-images": "^1.0.1",
|
"gatsby-remark-unwrap-images": "^1.0.1",
|
||||||
"gatsby-source-filesystem": "^2.0.20",
|
"gatsby-source-filesystem": "^2.0.24",
|
||||||
"gatsby-transformer-remark": "^2.2.5",
|
"gatsby-transformer-remark": "^2.2.5",
|
||||||
"gatsby-transformer-sharp": "^2.1.13",
|
"gatsby-transformer-sharp": "^2.1.13",
|
||||||
"html-to-react": "^1.3.4",
|
"html-to-react": "^1.3.4",
|
||||||
|
@ -62,7 +61,8 @@
|
||||||
"md-attr-parser": "^1.2.1",
|
"md-attr-parser": "^1.2.1",
|
||||||
"prettier": "^1.16.4",
|
"prettier": "^1.16.4",
|
||||||
"raw-loader": "^1.0.0",
|
"raw-loader": "^1.0.0",
|
||||||
"unist-util-visit": "^1.4.0"
|
"unist-util-visit": "^1.4.0",
|
||||||
|
"@sindresorhus/slugify": "^0.8.0"
|
||||||
},
|
},
|
||||||
"repository": {
|
"repository": {
|
||||||
"type": "git",
|
"type": "git",
|
||||||
|
|
|
@ -1,33 +1,38 @@
|
||||||
import React, { useState } from 'react'
|
import React, { useState, useEffect } from 'react'
|
||||||
import PropTypes from 'prop-types'
|
import PropTypes from 'prop-types'
|
||||||
import classNames from 'classnames'
|
import classNames from 'classnames'
|
||||||
import slugify from '@sindresorhus/slugify'
|
|
||||||
|
|
||||||
import Link from './link'
|
import Link from './link'
|
||||||
import classes from '../styles/accordion.module.sass'
|
import classes from '../styles/accordion.module.sass'
|
||||||
|
|
||||||
const Accordion = ({ title, id, expanded, children }) => {
|
const Accordion = ({ title, id, expanded, children }) => {
|
||||||
const anchorId = id ? id : slugify(title)
|
const [isExpanded, setIsExpanded] = useState(true)
|
||||||
const [isExpanded, setIsExpanded] = useState(expanded)
|
|
||||||
const contentClassNames = classNames(classes.content, {
|
const contentClassNames = classNames(classes.content, {
|
||||||
[classes.hidden]: !isExpanded,
|
[classes.hidden]: !isExpanded,
|
||||||
})
|
})
|
||||||
const iconClassNames = classNames({
|
const iconClassNames = classNames({
|
||||||
[classes.hidden]: isExpanded,
|
[classes.hidden]: isExpanded,
|
||||||
})
|
})
|
||||||
|
// Make sure accordion is expanded if JS is disabled
|
||||||
|
useEffect(() => setIsExpanded(expanded), [])
|
||||||
return (
|
return (
|
||||||
<section id={anchorId}>
|
<section className="accordion" id={id}>
|
||||||
<div className={classes.root}>
|
<div className={classes.root}>
|
||||||
<h3>
|
<h4>
|
||||||
<button
|
<button
|
||||||
className={classes.button}
|
className={classes.button}
|
||||||
aria-expanded={String(isExpanded)}
|
aria-expanded={String(isExpanded)}
|
||||||
onClick={() => setIsExpanded(!isExpanded)}
|
onClick={() => setIsExpanded(!isExpanded)}
|
||||||
>
|
>
|
||||||
<span>
|
<span>
|
||||||
{title}
|
<span className="heading-text">{title}</span>
|
||||||
{isExpanded && (
|
{isExpanded && !!id && (
|
||||||
<Link to={`#${anchorId}`} className={classes.anchor} hidden>
|
<Link
|
||||||
|
to={`#${id}`}
|
||||||
|
className={classes.anchor}
|
||||||
|
hidden
|
||||||
|
onClick={event => event.stopPropagation()}
|
||||||
|
>
|
||||||
¶
|
¶
|
||||||
</Link>
|
</Link>
|
||||||
)}
|
)}
|
||||||
|
@ -44,7 +49,7 @@ const Accordion = ({ title, id, expanded, children }) => {
|
||||||
<rect height={2} width={8} x={1} y={4} />
|
<rect height={2} width={8} x={1} y={4} />
|
||||||
</svg>
|
</svg>
|
||||||
</button>
|
</button>
|
||||||
</h3>
|
</h4>
|
||||||
<div className={contentClassNames}>{children}</div>
|
<div className={contentClassNames}>{children}</div>
|
||||||
</div>
|
</div>
|
||||||
</section>
|
</section>
|
||||||
|
|
|
@ -33,10 +33,11 @@ const GitHubCode = ({ url, lang, errorMsg, className }) => {
|
||||||
})
|
})
|
||||||
.catch(err => {
|
.catch(err => {
|
||||||
setCode(errorMsg)
|
setCode(errorMsg)
|
||||||
|
console.error(err)
|
||||||
})
|
})
|
||||||
setInitialized(true)
|
setInitialized(true)
|
||||||
}
|
}
|
||||||
}, [])
|
}, [initialized, rawUrl, errorMsg])
|
||||||
|
|
||||||
const highlighted = lang === 'none' || !code ? code : highlightCode(lang, code)
|
const highlighted = lang === 'none' || !code ? code : highlightCode(lang, code)
|
||||||
|
|
||||||
|
|
|
@ -5,13 +5,13 @@ import classNames from 'classnames'
|
||||||
import Icon from './icon'
|
import Icon from './icon'
|
||||||
import classes from '../styles/infobox.module.sass'
|
import classes from '../styles/infobox.module.sass'
|
||||||
|
|
||||||
const Infobox = ({ title, variant, className, children }) => {
|
const Infobox = ({ title, id, variant, className, children }) => {
|
||||||
const infoboxClassNames = classNames(classes.root, className, {
|
const infoboxClassNames = classNames(classes.root, className, {
|
||||||
[classes.warning]: variant === 'warning',
|
[classes.warning]: variant === 'warning',
|
||||||
[classes.danger]: variant === 'danger',
|
[classes.danger]: variant === 'danger',
|
||||||
})
|
})
|
||||||
return (
|
return (
|
||||||
<aside className={infoboxClassNames}>
|
<aside className={infoboxClassNames} id={id}>
|
||||||
{title && (
|
{title && (
|
||||||
<h4 className={classes.title}>
|
<h4 className={classes.title}>
|
||||||
{variant !== 'default' && (
|
{variant !== 'default' && (
|
||||||
|
@ -31,6 +31,7 @@ Infobox.defaultProps = {
|
||||||
|
|
||||||
Infobox.propTypes = {
|
Infobox.propTypes = {
|
||||||
title: PropTypes.string,
|
title: PropTypes.string,
|
||||||
|
id: PropTypes.string,
|
||||||
variant: PropTypes.oneOf(['default', 'warning', 'danger']),
|
variant: PropTypes.oneOf(['default', 'warning', 'danger']),
|
||||||
className: PropTypes.string,
|
className: PropTypes.string,
|
||||||
children: PropTypes.node.isRequired,
|
children: PropTypes.node.isRequired,
|
||||||
|
|
|
@ -232,6 +232,7 @@ Juniper.defaultProps = {
|
||||||
theme: 'default',
|
theme: 'default',
|
||||||
isolateCells: true,
|
isolateCells: true,
|
||||||
useBinder: true,
|
useBinder: true,
|
||||||
|
storageKey: 'juniper',
|
||||||
useStorage: true,
|
useStorage: true,
|
||||||
storageExpire: 60,
|
storageExpire: 60,
|
||||||
debug: false,
|
debug: false,
|
||||||
|
|
|
@ -34,22 +34,19 @@ const Progress = () => {
|
||||||
setOffset(getOffset())
|
setOffset(getOffset())
|
||||||
}
|
}
|
||||||
|
|
||||||
useEffect(
|
useEffect(() => {
|
||||||
() => {
|
if (!initialized && progressRef.current) {
|
||||||
if (!initialized && progressRef.current) {
|
handleResize()
|
||||||
handleResize()
|
setInitialized(true)
|
||||||
setInitialized(true)
|
}
|
||||||
}
|
window.addEventListener('scroll', handleScroll)
|
||||||
window.addEventListener('scroll', handleScroll)
|
window.addEventListener('resize', handleResize)
|
||||||
window.addEventListener('resize', handleResize)
|
|
||||||
|
|
||||||
return () => {
|
return () => {
|
||||||
window.removeEventListener('scroll', handleScroll)
|
window.removeEventListener('scroll', handleScroll)
|
||||||
window.removeEventListener('resize', handleResize)
|
window.removeEventListener('resize', handleResize)
|
||||||
}
|
}
|
||||||
},
|
}, [initialized, progressRef])
|
||||||
[progressRef]
|
|
||||||
)
|
|
||||||
|
|
||||||
const { height, vh } = offset
|
const { height, vh } = offset
|
||||||
const total = 100 - ((height - scrollY - vh) / height) * 100
|
const total = 100 - ((height - scrollY - vh) / height) * 100
|
||||||
|
|
|
@ -8,6 +8,12 @@ import Icon from './icon'
|
||||||
import { H2 } from './typography'
|
import { H2 } from './typography'
|
||||||
import classes from '../styles/quickstart.module.sass'
|
import classes from '../styles/quickstart.module.sass'
|
||||||
|
|
||||||
|
function getNewChecked(optionId, checkedForId, multiple) {
|
||||||
|
if (!multiple) return [optionId]
|
||||||
|
if (checkedForId.includes(optionId)) return checkedForId.filter(opt => opt !== optionId)
|
||||||
|
return [...checkedForId, optionId]
|
||||||
|
}
|
||||||
|
|
||||||
const Quickstart = ({ data, title, description, id, children }) => {
|
const Quickstart = ({ data, title, description, id, children }) => {
|
||||||
const [styles, setStyles] = useState({})
|
const [styles, setStyles] = useState({})
|
||||||
const [checked, setChecked] = useState({})
|
const [checked, setChecked] = useState({})
|
||||||
|
@ -38,13 +44,13 @@ const Quickstart = ({ data, title, description, id, children }) => {
|
||||||
setStyles(initialStyles)
|
setStyles(initialStyles)
|
||||||
setInitialized(true)
|
setInitialized(true)
|
||||||
}
|
}
|
||||||
})
|
}, [data, initialized])
|
||||||
|
|
||||||
return !data.length ? null : (
|
return !data.length ? null : (
|
||||||
<Section id={id}>
|
<Section id={id}>
|
||||||
<div className={classes.root}>
|
<div className={classes.root}>
|
||||||
{title && (
|
{title && (
|
||||||
<H2 className={classes.title}>
|
<H2 className={classes.title} name={id}>
|
||||||
<a href={`#${id}`}>{title}</a>
|
<a href={`#${id}`}>{title}</a>
|
||||||
</H2>
|
</H2>
|
||||||
)}
|
)}
|
||||||
|
@ -76,13 +82,11 @@ const Quickstart = ({ data, title, description, id, children }) => {
|
||||||
onChange={() => {
|
onChange={() => {
|
||||||
const newChecked = {
|
const newChecked = {
|
||||||
...checked,
|
...checked,
|
||||||
[id]: !multiple
|
[id]: getNewChecked(
|
||||||
? [option.id]
|
option.id,
|
||||||
: checkedForId.includes(option.id)
|
checkedForId,
|
||||||
? checkedForId.filter(
|
multiple
|
||||||
opt => opt !== option.id
|
),
|
||||||
)
|
|
||||||
: [...checkedForId, option.id],
|
|
||||||
}
|
}
|
||||||
setChecked(newChecked)
|
setChecked(newChecked)
|
||||||
setStyles({
|
setStyles({
|
||||||
|
|
|
@ -6,19 +6,20 @@ import Icon from './icon'
|
||||||
import classes from '../styles/search.module.sass'
|
import classes from '../styles/search.module.sass'
|
||||||
|
|
||||||
const Search = ({ id, placeholder, settings }) => {
|
const Search = ({ id, placeholder, settings }) => {
|
||||||
const { apiKey, indexName } = settings
|
const { apiKey, indexName, appId } = settings
|
||||||
const [isInitialized, setIsInitialized] = useState(false)
|
const [initialized, setInitialized] = useState(false)
|
||||||
useEffect(() => {
|
useEffect(() => {
|
||||||
if (!isInitialized) {
|
if (!initialized) {
|
||||||
setIsInitialized(true)
|
setInitialized(true)
|
||||||
window.docsearch({
|
window.docsearch({
|
||||||
|
appId,
|
||||||
apiKey,
|
apiKey,
|
||||||
indexName,
|
indexName,
|
||||||
inputSelector: `#${id}`,
|
inputSelector: `#${id}`,
|
||||||
debug: false,
|
debug: false,
|
||||||
})
|
})
|
||||||
}
|
}
|
||||||
}, window.docsearch)
|
}, [initialized, apiKey, indexName, id])
|
||||||
return (
|
return (
|
||||||
<form className={classes.root}>
|
<form className={classes.root}>
|
||||||
<label htmlFor={id} className={classes.icon}>
|
<label htmlFor={id} className={classes.icon}>
|
||||||
|
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user