mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-29 18:54:07 +03:00
Merge branch 'master' into spacy.io
This commit is contained in:
commit
353f8486f5
106
.github/contributors/dhpollack.md
vendored
Normal file
106
.github/contributors/dhpollack.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [X] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | David Pollack |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | Mar 5. 2020 |
|
||||
| GitHub username | dhpollack |
|
||||
| Website (optional) | |
|
89
.github/contributors/mabraham.md
vendored
Normal file
89
.github/contributors/mabraham.md
vendored
Normal file
|
@ -0,0 +1,89 @@
|
|||
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | |
|
||||
| GitHub username | |
|
||||
| Website (optional) | |
|
7
.gitignore
vendored
7
.gitignore
vendored
|
@ -5,6 +5,11 @@ corpora/
|
|||
keys/
|
||||
*.json.gz
|
||||
|
||||
# Tests
|
||||
spacy/tests/package/setup.cfg
|
||||
spacy/tests/package/pyproject.toml
|
||||
spacy/tests/package/requirements.txt
|
||||
|
||||
# Website
|
||||
website/.cache/
|
||||
website/public/
|
||||
|
@ -40,6 +45,7 @@ __pycache__/
|
|||
.~env/
|
||||
.venv
|
||||
venv/
|
||||
env3.*/
|
||||
.dev
|
||||
.denv
|
||||
.pypyenv
|
||||
|
@ -56,6 +62,7 @@ lib64/
|
|||
parts/
|
||||
sdist/
|
||||
var/
|
||||
wheelhouse/
|
||||
*.egg-info/
|
||||
pip-wheel-metadata/
|
||||
Pipfile.lock
|
||||
|
|
47
Makefile
47
Makefile
|
@ -1,28 +1,37 @@
|
|||
SHELL := /bin/bash
|
||||
sha = $(shell "git" "rev-parse" "--short" "HEAD")
|
||||
version = $(shell "bin/get-version.sh")
|
||||
wheel = spacy-$(version)-cp36-cp36m-linux_x86_64.whl
|
||||
PYVER := 3.6
|
||||
VENV := ./env$(PYVER)
|
||||
|
||||
dist/spacy.pex : dist/spacy-$(sha).pex
|
||||
cp dist/spacy-$(sha).pex dist/spacy.pex
|
||||
chmod a+rx dist/spacy.pex
|
||||
version := $(shell "bin/get-version.sh")
|
||||
|
||||
dist/spacy-$(sha).pex : dist/$(wheel)
|
||||
env3.6/bin/python -m pip install pex==1.5.3
|
||||
env3.6/bin/pex pytest dist/$(wheel) spacy_lookups_data -e spacy -o dist/spacy-$(sha).pex
|
||||
dist/spacy-$(version).pex : wheelhouse/spacy-$(version).stamp
|
||||
$(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m spacy -o $@ spacy==$(version) jsonschema spacy_lookups_data
|
||||
chmod a+rx $@
|
||||
|
||||
dist/$(wheel) : setup.py spacy/*.py* spacy/*/*.py*
|
||||
python3.6 -m venv env3.6
|
||||
source env3.6/bin/activate
|
||||
env3.6/bin/pip install wheel
|
||||
env3.6/bin/pip install -r requirements.txt --no-cache-dir
|
||||
env3.6/bin/python setup.py build_ext --inplace
|
||||
env3.6/bin/python setup.py sdist
|
||||
env3.6/bin/python setup.py bdist_wheel
|
||||
dist/pytest.pex : wheelhouse/pytest-*.whl
|
||||
$(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m pytest -o $@ pytest pytest-timeout mock
|
||||
chmod a+rx $@
|
||||
|
||||
.PHONY : clean
|
||||
wheelhouse/spacy-$(version).stamp : $(VENV)/bin/pex setup.py spacy/*.py* spacy/*/*.py*
|
||||
$(VENV)/bin/pip wheel . -w ./wheelhouse
|
||||
$(VENV)/bin/pip wheel jsonschema spacy_lookups_data -w ./wheelhouse
|
||||
touch $@
|
||||
|
||||
wheelhouse/pytest-%.whl : $(VENV)/bin/pex
|
||||
$(VENV)/bin/pip wheel pytest pytest-timeout mock -w ./wheelhouse
|
||||
|
||||
$(VENV)/bin/pex :
|
||||
python$(PYVER) -m venv $(VENV)
|
||||
$(VENV)/bin/pip install -U pip setuptools pex wheel
|
||||
|
||||
.PHONY : clean test
|
||||
|
||||
test : dist/spacy-$(version).pex dist/pytest.pex
|
||||
( . $(VENV)/bin/activate ; \
|
||||
PEX_PATH=dist/spacy-$(version).pex ./dist/pytest.pex --pyargs spacy -x ; )
|
||||
|
||||
clean : setup.py
|
||||
source env3.6/bin/activate
|
||||
rm -rf dist/*
|
||||
rm -rf ./wheelhouse
|
||||
rm -rf $(VENV)
|
||||
python setup.py clean --all
|
||||
|
|
|
@ -2,7 +2,7 @@
|
|||
|
||||
### Step 1: Create a Knowledge Base (KB) and training data
|
||||
|
||||
Run `wikipedia_pretrain_kb.py`
|
||||
Run `wikidata_pretrain_kb.py`
|
||||
* This takes as input the locations of a **Wikipedia and a Wikidata dump**, and produces a **KB directory** + **training file**
|
||||
* WikiData: get `latest-all.json.bz2` from https://dumps.wikimedia.org/wikidatawiki/entities/
|
||||
* Wikipedia: get `enwiki-latest-pages-articles-multistream.xml.bz2` from https://dumps.wikimedia.org/enwiki/latest/ (or for any other language)
|
||||
|
|
|
@ -1,3 +1,11 @@
|
|||
[build-system]
|
||||
requires = ["setuptools"]
|
||||
requires = [
|
||||
"setuptools",
|
||||
"wheel",
|
||||
"cython>=0.25",
|
||||
"cymem>=2.0.2,<2.1.0",
|
||||
"preshed>=3.0.2,<3.1.0",
|
||||
"murmurhash>=0.28.0,<1.1.0",
|
||||
"thinc==7.4.0",
|
||||
]
|
||||
build-backend = "setuptools.build_meta"
|
||||
|
|
|
@ -1,11 +1,11 @@
|
|||
# Our libraries
|
||||
cymem>=2.0.2,<2.1.0
|
||||
preshed>=3.0.2,<3.1.0
|
||||
thinc==7.4.0.dev0
|
||||
thinc==7.4.0
|
||||
blis>=0.4.0,<0.5.0
|
||||
murmurhash>=0.28.0,<1.1.0
|
||||
wasabi>=0.4.0,<1.1.0
|
||||
srsly>=1.0.1,<1.1.0
|
||||
srsly>=1.0.2,<1.1.0
|
||||
catalogue>=0.0.7,<1.1.0
|
||||
# Third party dependencies
|
||||
numpy>=1.15.0
|
||||
|
|
|
@ -38,16 +38,16 @@ setup_requires =
|
|||
cymem>=2.0.2,<2.1.0
|
||||
preshed>=3.0.2,<3.1.0
|
||||
murmurhash>=0.28.0,<1.1.0
|
||||
thinc==7.4.0.dev0
|
||||
thinc==7.4.0
|
||||
install_requires =
|
||||
# Our libraries
|
||||
murmurhash>=0.28.0,<1.1.0
|
||||
cymem>=2.0.2,<2.1.0
|
||||
preshed>=3.0.2,<3.1.0
|
||||
thinc==7.4.0.dev0
|
||||
thinc==7.4.0
|
||||
blis>=0.4.0,<0.5.0
|
||||
wasabi>=0.4.0,<1.1.0
|
||||
srsly>=1.0.1,<1.1.0
|
||||
srsly>=1.0.2,<1.1.0
|
||||
catalogue>=0.0.7,<1.1.0
|
||||
# Third-party dependencies
|
||||
tqdm>=4.38.0,<5.0.0
|
||||
|
@ -59,7 +59,7 @@ install_requires =
|
|||
|
||||
[options.extras_require]
|
||||
lookups =
|
||||
spacy_lookups_data>=0.0.5<0.2.0
|
||||
spacy_lookups_data>=0.0.5,<0.2.0
|
||||
cuda =
|
||||
cupy>=5.0.0b4
|
||||
cuda80 =
|
||||
|
|
|
@ -296,8 +296,7 @@ def link_vectors_to_models(vocab):
|
|||
key = (ops.device, vectors.name)
|
||||
if key in thinc.extra.load_nlp.VECTORS:
|
||||
if thinc.extra.load_nlp.VECTORS[key].shape != data.shape:
|
||||
# This is a hack to avoid the problem in #3853. Maybe we should
|
||||
# print a warning as well?
|
||||
# This is a hack to avoid the problem in #3853.
|
||||
old_name = vectors.name
|
||||
new_name = vectors.name + "_%d" % data.shape[0]
|
||||
user_warning(Warnings.W019.format(old=old_name, new=new_name))
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
# fmt: off
|
||||
__title__ = "spacy"
|
||||
__version__ = "2.2.3"
|
||||
__version__ = "2.2.4.dev0"
|
||||
__release__ = True
|
||||
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
|
||||
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
|
||||
|
|
|
@ -23,24 +23,23 @@ BLANK_MODEL_THRESHOLD = 2000
|
|||
|
||||
|
||||
@plac.annotations(
|
||||
# fmt: off
|
||||
lang=("model language", "positional", None, str),
|
||||
train_path=("location of JSON-formatted training data", "positional", None, Path),
|
||||
dev_path=("location of JSON-formatted development data", "positional", None, Path),
|
||||
tag_map_path=("Location of JSON-formatted tag map", "option", "tm", Path),
|
||||
base_model=("name of model to update (optional)", "option", "b", str),
|
||||
pipeline=(
|
||||
"Comma-separated names of pipeline components to train",
|
||||
"option",
|
||||
"p",
|
||||
str,
|
||||
),
|
||||
pipeline=("Comma-separated names of pipeline components to train", "option", "p", str),
|
||||
ignore_warnings=("Ignore warnings, only show stats and errors", "flag", "IW", bool),
|
||||
verbose=("Print additional information and explanations", "flag", "V", bool),
|
||||
no_format=("Don't pretty-print the results", "flag", "NF", bool),
|
||||
# fmt: on
|
||||
)
|
||||
def debug_data(
|
||||
lang,
|
||||
train_path,
|
||||
dev_path,
|
||||
tag_map_path=None,
|
||||
base_model=None,
|
||||
pipeline="tagger,parser,ner",
|
||||
ignore_warnings=False,
|
||||
|
@ -60,6 +59,10 @@ def debug_data(
|
|||
if not dev_path.exists():
|
||||
msg.fail("Development data not found", dev_path, exits=1)
|
||||
|
||||
tag_map = {}
|
||||
if tag_map_path is not None:
|
||||
tag_map = srsly.read_json(tag_map_path)
|
||||
|
||||
# Initialize the model and pipeline
|
||||
pipeline = [p.strip() for p in pipeline.split(",")]
|
||||
if base_model:
|
||||
|
@ -67,6 +70,8 @@ def debug_data(
|
|||
else:
|
||||
lang_cls = get_lang_class(lang)
|
||||
nlp = lang_cls()
|
||||
# Update tag map with provided mapping
|
||||
nlp.vocab.morphology.tag_map.update(tag_map)
|
||||
|
||||
msg.divider("Data format validation")
|
||||
|
||||
|
@ -227,13 +232,17 @@ def debug_data(
|
|||
|
||||
if gold_train_data["ws_ents"]:
|
||||
msg.fail(
|
||||
"{} invalid whitespace entity span(s)".format(gold_train_data["ws_ents"])
|
||||
"{} invalid whitespace entity span(s)".format(
|
||||
gold_train_data["ws_ents"]
|
||||
)
|
||||
)
|
||||
has_ws_ents_error = True
|
||||
|
||||
if gold_train_data["punct_ents"]:
|
||||
msg.warn(
|
||||
"{} entity span(s) with punctuation".format(gold_train_data["punct_ents"])
|
||||
"{} entity span(s) with punctuation".format(
|
||||
gold_train_data["punct_ents"]
|
||||
)
|
||||
)
|
||||
has_punct_ents_warning = True
|
||||
|
||||
|
@ -344,7 +353,7 @@ def debug_data(
|
|||
if "tagger" in pipeline:
|
||||
msg.divider("Part-of-speech Tagging")
|
||||
labels = [label for label in gold_train_data["tags"]]
|
||||
tag_map = nlp.Defaults.tag_map
|
||||
tag_map = nlp.vocab.morphology.tag_map
|
||||
msg.info(
|
||||
"{} {} in data ({} {} in tag map)".format(
|
||||
len(labels),
|
||||
|
@ -584,7 +593,13 @@ def _compile_gold(train_docs, pipeline):
|
|||
if label.startswith(("B-", "U-", "L-")) and doc[i].is_space:
|
||||
# "Illegal" whitespace entity
|
||||
data["ws_ents"] += 1
|
||||
if label.startswith(("B-", "U-", "L-")) and doc[i].text in [".", "'", "!", "?", ","]:
|
||||
if label.startswith(("B-", "U-", "L-")) and doc[i].text in [
|
||||
".",
|
||||
"'",
|
||||
"!",
|
||||
"?",
|
||||
",",
|
||||
]:
|
||||
# punctuation entity: could be replaced by whitespace when training with noise,
|
||||
# so add a warning to alert the user to this unexpected side effect.
|
||||
data["punct_ents"] += 1
|
||||
|
|
|
@ -57,6 +57,7 @@ from .. import about
|
|||
textcat_multilabel=("Textcat classes aren't mutually exclusive (multilabel)", "flag", "TML", bool),
|
||||
textcat_arch=("Textcat model architecture", "option", "ta", str),
|
||||
textcat_positive_label=("Textcat positive label for binary classes with two labels", "option", "tpl", str),
|
||||
tag_map_path=("Location of JSON-formatted tag map", "option", "tm", Path),
|
||||
verbose=("Display more information for debug", "flag", "VV", bool),
|
||||
debug=("Run data diagnostics before training", "flag", "D", bool),
|
||||
# fmt: on
|
||||
|
@ -95,6 +96,7 @@ def train(
|
|||
textcat_multilabel=False,
|
||||
textcat_arch="bow",
|
||||
textcat_positive_label=None,
|
||||
tag_map_path=None,
|
||||
verbose=False,
|
||||
debug=False,
|
||||
):
|
||||
|
@ -132,6 +134,9 @@ def train(
|
|||
output_path.mkdir()
|
||||
msg.good("Created output directory: {}".format(output_path))
|
||||
|
||||
tag_map = {}
|
||||
if tag_map_path is not None:
|
||||
tag_map = srsly.read_json(tag_map_path)
|
||||
# Take dropout and batch size as generators of values -- dropout
|
||||
# starts high and decays sharply, to force the optimizer to explore.
|
||||
# Batch size starts at 1 and grows, so that we make updates quickly
|
||||
|
@ -238,6 +243,9 @@ def train(
|
|||
pipe_cfg = {}
|
||||
nlp.add_pipe(nlp.create_pipe(pipe, config=pipe_cfg))
|
||||
|
||||
# Update tag map with provided mapping
|
||||
nlp.vocab.morphology.tag_map.update(tag_map)
|
||||
|
||||
if vectors:
|
||||
msg.text("Loading vector from model '{}'".format(vectors))
|
||||
_load_vectors(nlp, vectors)
|
||||
|
@ -546,7 +554,30 @@ def train(
|
|||
with nlp.use_params(optimizer.averages):
|
||||
final_model_path = output_path / "model-final"
|
||||
nlp.to_disk(final_model_path)
|
||||
final_meta = srsly.read_json(output_path / "model-final" / "meta.json")
|
||||
meta_loc = output_path / "model-final" / "meta.json"
|
||||
final_meta = srsly.read_json(meta_loc)
|
||||
final_meta.setdefault("accuracy", {})
|
||||
final_meta["accuracy"].update(meta.get("accuracy", {}))
|
||||
final_meta.setdefault("speed", {})
|
||||
final_meta["speed"].setdefault("cpu", None)
|
||||
final_meta["speed"].setdefault("gpu", None)
|
||||
# combine cpu and gpu speeds with the base model speeds
|
||||
if final_meta["speed"]["cpu"] and meta["speed"]["cpu"]:
|
||||
speed = _get_total_speed([final_meta["speed"]["cpu"], meta["speed"]["cpu"]])
|
||||
final_meta["speed"]["cpu"] = speed
|
||||
if final_meta["speed"]["gpu"] and meta["speed"]["gpu"]:
|
||||
speed = _get_total_speed([final_meta["speed"]["gpu"], meta["speed"]["gpu"]])
|
||||
final_meta["speed"]["gpu"] = speed
|
||||
# if there were no speeds to update, overwrite with meta
|
||||
if final_meta["speed"]["cpu"] is None and final_meta["speed"]["gpu"] is None:
|
||||
final_meta["speed"].update(meta["speed"])
|
||||
# note: beam speeds are not combined with the base model
|
||||
if has_beam_widths:
|
||||
final_meta.setdefault("beam_accuracy", {})
|
||||
final_meta["beam_accuracy"].update(meta.get("beam_accuracy", {}))
|
||||
final_meta.setdefault("beam_speed", {})
|
||||
final_meta["beam_speed"].update(meta.get("beam_speed", {}))
|
||||
srsly.write_json(meta_loc, final_meta)
|
||||
msg.good("Saved model to output directory", final_model_path)
|
||||
with msg.loading("Creating best model..."):
|
||||
best_model_path = _collate_best_model(final_meta, output_path, best_pipes)
|
||||
|
@ -641,11 +672,11 @@ def _get_metrics(component):
|
|||
if component == "parser":
|
||||
return ("las", "uas", "las_per_type", "token_acc")
|
||||
elif component == "tagger":
|
||||
return ("tags_acc",)
|
||||
return ("tags_acc", "token_acc")
|
||||
elif component == "ner":
|
||||
return ("ents_f", "ents_p", "ents_r", "ents_per_type")
|
||||
return ("ents_f", "ents_p", "ents_r", "ents_per_type", "token_acc")
|
||||
elif component == "textcat":
|
||||
return ("textcat_score",)
|
||||
return ("textcat_score", "token_acc")
|
||||
return ("token_acc",)
|
||||
|
||||
|
||||
|
@ -701,3 +732,12 @@ def _get_progress(
|
|||
if beam_width is not None:
|
||||
result.insert(1, beam_width)
|
||||
return result
|
||||
|
||||
|
||||
def _get_total_speed(speeds):
|
||||
seconds_per_word = 0.0
|
||||
for words_per_second in speeds:
|
||||
if words_per_second is None:
|
||||
return None
|
||||
seconds_per_word += 1.0 / words_per_second
|
||||
return 1.0 / seconds_per_word
|
||||
|
|
|
@ -107,6 +107,9 @@ class Warnings(object):
|
|||
W027 = ("Found a large training file of {size} bytes. Note that it may "
|
||||
"be more efficient to split your training data into multiple "
|
||||
"smaller JSON files instead.")
|
||||
W028 = ("Doc.from_array was called with a vector of type '{type}', "
|
||||
"but is expecting one of type 'uint64' instead. This may result "
|
||||
"in problems with the vocab further on in the pipeline.")
|
||||
|
||||
|
||||
|
||||
|
@ -541,6 +544,14 @@ class Errors(object):
|
|||
E188 = ("Could not match the gold entity links to entities in the doc - "
|
||||
"make sure the gold EL data refers to valid results of the "
|
||||
"named entity recognizer in the `nlp` pipeline.")
|
||||
E189 = ("Each argument to `get_doc` should be of equal length.")
|
||||
E190 = ("Token head out of range in `Doc.from_array()` for token index "
|
||||
"'{index}' with value '{value}' (equivalent to relative head "
|
||||
"index: '{rel_head_index}'). The head indices should be relative "
|
||||
"to the current token index rather than absolute indices in the "
|
||||
"array.")
|
||||
E191 = ("Invalid head: the head token must be from the same doc as the "
|
||||
"token itself.")
|
||||
|
||||
|
||||
@add_codes
|
||||
|
|
|
@ -151,6 +151,8 @@ def align(tokens_a, tokens_b):
|
|||
cost = 0
|
||||
a2b = numpy.empty(len(tokens_a), dtype="i")
|
||||
b2a = numpy.empty(len(tokens_b), dtype="i")
|
||||
a2b.fill(-1)
|
||||
b2a.fill(-1)
|
||||
a2b_multi = {}
|
||||
b2a_multi = {}
|
||||
i = 0
|
||||
|
@ -160,7 +162,6 @@ def align(tokens_a, tokens_b):
|
|||
while i < len(tokens_a) and j < len(tokens_b):
|
||||
a = tokens_a[i][offset_a:]
|
||||
b = tokens_b[j][offset_b:]
|
||||
a2b[i] = b2a[j] = -1
|
||||
if a == b:
|
||||
if offset_a == offset_b == 0:
|
||||
a2b[i] = j
|
||||
|
|
|
@ -3,6 +3,7 @@ from __future__ import unicode_literals
|
|||
|
||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||
from .norm_exceptions import NORM_EXCEPTIONS
|
||||
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES
|
||||
from .punctuation import TOKENIZER_INFIXES
|
||||
from .tag_map import TAG_MAP
|
||||
from .stop_words import STOP_WORDS
|
||||
|
@ -22,6 +23,8 @@ class GermanDefaults(Language.Defaults):
|
|||
Language.Defaults.lex_attr_getters[NORM], NORM_EXCEPTIONS, BASE_NORMS
|
||||
)
|
||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||
prefixes = TOKENIZER_PREFIXES
|
||||
suffixes = TOKENIZER_SUFFIXES
|
||||
infixes = TOKENIZER_INFIXES
|
||||
tag_map = TAG_MAP
|
||||
stop_words = STOP_WORDS
|
||||
|
|
|
@ -1,10 +1,32 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ..char_classes import LIST_ELLIPSES, LIST_ICONS
|
||||
from ..char_classes import LIST_ELLIPSES, LIST_ICONS, LIST_PUNCT, LIST_QUOTES
|
||||
from ..char_classes import LIST_CURRENCY, CURRENCY, UNITS, PUNCT
|
||||
from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER
|
||||
from ..punctuation import _prefixes, _suffixes
|
||||
|
||||
|
||||
_prefixes = ["``",] + list(_prefixes)
|
||||
|
||||
_suffixes = (
|
||||
["''", "/"]
|
||||
+ LIST_PUNCT
|
||||
+ LIST_ELLIPSES
|
||||
+ LIST_QUOTES
|
||||
+ LIST_ICONS
|
||||
+ [
|
||||
r"(?<=[0-9])\+",
|
||||
r"(?<=°[FfCcKk])\.",
|
||||
r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
|
||||
r"(?<=[0-9])(?:{u})".format(u=UNITS),
|
||||
r"(?<=[{al}{e}{p}(?:{q})])\.".format(
|
||||
al=ALPHA_LOWER, e=r"%²\-\+", q=CONCAT_QUOTES, p=PUNCT
|
||||
),
|
||||
r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),
|
||||
]
|
||||
)
|
||||
|
||||
_quotes = CONCAT_QUOTES.replace("'", "")
|
||||
|
||||
_infixes = (
|
||||
|
@ -15,6 +37,7 @@ _infixes = (
|
|||
r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
|
||||
r"(?<=[{a}])[:<>=](?=[{a}])".format(a=ALPHA),
|
||||
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
|
||||
r"(?<=[0-9{a}])\/(?=[0-9{a}])".format(a=ALPHA),
|
||||
r"(?<=[{a}])([{q}\)\]\(\[])(?=[{a}])".format(a=ALPHA, q=_quotes),
|
||||
r"(?<=[{a}])--(?=[{a}])".format(a=ALPHA),
|
||||
r"(?<=[0-9])-(?=[0-9])",
|
||||
|
@ -22,4 +45,6 @@ _infixes = (
|
|||
)
|
||||
|
||||
|
||||
TOKENIZER_PREFIXES = _prefixes
|
||||
TOKENIZER_SUFFIXES = _suffixes
|
||||
TOKENIZER_INFIXES = _infixes
|
||||
|
|
|
@ -160,6 +160,8 @@ for exc_data in [
|
|||
|
||||
|
||||
for orth in [
|
||||
"``",
|
||||
"''",
|
||||
"A.C.",
|
||||
"a.D.",
|
||||
"A.D.",
|
||||
|
@ -175,10 +177,13 @@ for orth in [
|
|||
"biol.",
|
||||
"Biol.",
|
||||
"ca.",
|
||||
"CDU/CSU",
|
||||
"Chr.",
|
||||
"Cie.",
|
||||
"c/o",
|
||||
"co.",
|
||||
"Co.",
|
||||
"d'",
|
||||
"D.C.",
|
||||
"Dipl.-Ing.",
|
||||
"Dipl.",
|
||||
|
@ -203,12 +208,18 @@ for orth in [
|
|||
"i.G.",
|
||||
"i.Tr.",
|
||||
"i.V.",
|
||||
"I.",
|
||||
"II.",
|
||||
"III.",
|
||||
"IV.",
|
||||
"Inc.",
|
||||
"Ing.",
|
||||
"jr.",
|
||||
"Jr.",
|
||||
"jun.",
|
||||
"jur.",
|
||||
"K.O.",
|
||||
"L'",
|
||||
"L.A.",
|
||||
"lat.",
|
||||
"M.A.",
|
||||
|
|
30
spacy/lang/eu/__init__.py
Normal file
30
spacy/lang/eu/__init__.py
Normal file
|
@ -0,0 +1,30 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from .stop_words import STOP_WORDS
|
||||
from .lex_attrs import LEX_ATTRS
|
||||
from .punctuation import TOKENIZER_SUFFIXES
|
||||
from .tag_map import TAG_MAP
|
||||
|
||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||
from ...language import Language
|
||||
from ...attrs import LANG
|
||||
|
||||
|
||||
class BasqueDefaults(Language.Defaults):
|
||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||
lex_attr_getters.update(LEX_ATTRS)
|
||||
lex_attr_getters[LANG] = lambda text: "eu"
|
||||
|
||||
tokenizer_exceptions = BASE_EXCEPTIONS
|
||||
tag_map = TAG_MAP
|
||||
stop_words = STOP_WORDS
|
||||
suffixes = TOKENIZER_SUFFIXES
|
||||
|
||||
|
||||
class Basque(Language):
|
||||
lang = "eu"
|
||||
Defaults = BasqueDefaults
|
||||
|
||||
|
||||
__all__ = ["Basque"]
|
14
spacy/lang/eu/examples.py
Normal file
14
spacy/lang/eu/examples.py
Normal file
|
@ -0,0 +1,14 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
>>> from spacy.lang.eu.examples import sentences
|
||||
>>> docs = nlp.pipe(sentences)
|
||||
"""
|
||||
|
||||
sentences = [
|
||||
"bilbon ko castinga egin da eta nik jakin ez zuetako inork egin al du edota parte hartu duen ezagunik ba al du",
|
||||
"gaur telebistan entzunda denok martetik gatoz hortaz martzianoak gara beno nire ustez batzuk beste batzuk baino martzianoagoak dira"
|
||||
]
|
80
spacy/lang/eu/lex_attrs.py
Normal file
80
spacy/lang/eu/lex_attrs.py
Normal file
|
@ -0,0 +1,80 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...attrs import LIKE_NUM
|
||||
|
||||
# Source http://mylanguages.org/basque_numbers.php
|
||||
|
||||
|
||||
_num_words = """
|
||||
bat
|
||||
bi
|
||||
hiru
|
||||
lau
|
||||
bost
|
||||
sei
|
||||
zazpi
|
||||
zortzi
|
||||
bederatzi
|
||||
hamar
|
||||
hamaika
|
||||
hamabi
|
||||
hamahiru
|
||||
hamalau
|
||||
hamabost
|
||||
hamasei
|
||||
hamazazpi
|
||||
Hemezortzi
|
||||
hemeretzi
|
||||
hogei
|
||||
ehun
|
||||
mila
|
||||
milioi
|
||||
""".split()
|
||||
|
||||
# source https://www.google.com/intl/ur/inputtools/try/
|
||||
|
||||
_ordinal_words = """
|
||||
lehen
|
||||
bigarren
|
||||
hirugarren
|
||||
laugarren
|
||||
bosgarren
|
||||
seigarren
|
||||
zazpigarren
|
||||
zortzigarren
|
||||
bederatzigarren
|
||||
hamargarren
|
||||
hamaikagarren
|
||||
hamabigarren
|
||||
hamahirugarren
|
||||
hamalaugarren
|
||||
hamabosgarren
|
||||
hamaseigarren
|
||||
hamazazpigarren
|
||||
hamazortzigarren
|
||||
hemeretzigarren
|
||||
hogeigarren
|
||||
behin
|
||||
""".split()
|
||||
|
||||
|
||||
|
||||
def like_num(text):
|
||||
if text.startswith(("+", "-", "±", "~")):
|
||||
text = text[1:]
|
||||
text = text.replace(",", "").replace(".", "")
|
||||
if text.isdigit():
|
||||
return True
|
||||
if text.count("/") == 1:
|
||||
num, denom = text.split("/")
|
||||
if num.isdigit() and denom.isdigit():
|
||||
return True
|
||||
if text in _num_words:
|
||||
return True
|
||||
if text in _ordinal_words:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
LEX_ATTRS = {LIKE_NUM: like_num}
|
7
spacy/lang/eu/punctuation.py
Normal file
7
spacy/lang/eu/punctuation.py
Normal file
|
@ -0,0 +1,7 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ..punctuation import TOKENIZER_SUFFIXES
|
||||
|
||||
|
||||
_suffixes = TOKENIZER_SUFFIXES
|
108
spacy/lang/eu/stop_words.py
Normal file
108
spacy/lang/eu/stop_words.py
Normal file
|
@ -0,0 +1,108 @@
|
|||
# encoding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
# Source: https://github.com/stopwords-iso/stopwords-eu
|
||||
# https://www.ranks.nl/stopwords/basque
|
||||
# https://www.mustgo.com/worldlanguages/basque/
|
||||
STOP_WORDS = set(
|
||||
"""
|
||||
al
|
||||
anitz
|
||||
arabera
|
||||
asko
|
||||
baina
|
||||
bat
|
||||
batean
|
||||
batek
|
||||
bati
|
||||
batzuei
|
||||
batzuek
|
||||
batzuetan
|
||||
batzuk
|
||||
bera
|
||||
beraiek
|
||||
berau
|
||||
berauek
|
||||
bere
|
||||
berori
|
||||
beroriek
|
||||
beste
|
||||
bezala
|
||||
da
|
||||
dago
|
||||
dira
|
||||
ditu
|
||||
du
|
||||
dute
|
||||
edo
|
||||
egin
|
||||
ere
|
||||
eta
|
||||
eurak
|
||||
ez
|
||||
gainera
|
||||
gu
|
||||
gutxi
|
||||
guzti
|
||||
haiei
|
||||
haiek
|
||||
haietan
|
||||
hainbeste
|
||||
hala
|
||||
han
|
||||
handik
|
||||
hango
|
||||
hara
|
||||
hari
|
||||
hark
|
||||
hartan
|
||||
hau
|
||||
hauei
|
||||
hauek
|
||||
hauetan
|
||||
hemen
|
||||
hemendik
|
||||
hemengo
|
||||
hi
|
||||
hona
|
||||
honek
|
||||
honela
|
||||
honetan
|
||||
honi
|
||||
hor
|
||||
hori
|
||||
horiei
|
||||
horiek
|
||||
horietan
|
||||
horko
|
||||
horra
|
||||
horrek
|
||||
horrela
|
||||
horretan
|
||||
horri
|
||||
hortik
|
||||
hura
|
||||
izan
|
||||
ni
|
||||
noiz
|
||||
nola
|
||||
non
|
||||
nondik
|
||||
nongo
|
||||
nor
|
||||
nora
|
||||
ze
|
||||
zein
|
||||
zen
|
||||
zenbait
|
||||
zenbat
|
||||
zer
|
||||
zergatik
|
||||
ziren
|
||||
zituen
|
||||
zu
|
||||
zuek
|
||||
zuen
|
||||
zuten
|
||||
""".split()
|
||||
)
|
71
spacy/lang/eu/tag_map.py
Normal file
71
spacy/lang/eu/tag_map.py
Normal file
|
@ -0,0 +1,71 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...symbols import POS, PUNCT, SYM, ADJ, CCONJ, NUM, DET, ADV, ADP, X, VERB
|
||||
from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON
|
||||
|
||||
TAG_MAP = {
|
||||
".": {POS: PUNCT, "PunctType": "peri"},
|
||||
",": {POS: PUNCT, "PunctType": "comm"},
|
||||
"-LRB-": {POS: PUNCT, "PunctType": "brck", "PunctSide": "ini"},
|
||||
"-RRB-": {POS: PUNCT, "PunctType": "brck", "PunctSide": "fin"},
|
||||
"``": {POS: PUNCT, "PunctType": "quot", "PunctSide": "ini"},
|
||||
'""': {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"},
|
||||
"''": {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"},
|
||||
":": {POS: PUNCT},
|
||||
"$": {POS: SYM, "Other": {"SymType": "currency"}},
|
||||
"#": {POS: SYM, "Other": {"SymType": "numbersign"}},
|
||||
"AFX": {POS: ADJ, "Hyph": "yes"},
|
||||
"CC": {POS: CCONJ, "ConjType": "coor"},
|
||||
"CD": {POS: NUM, "NumType": "card"},
|
||||
"DT": {POS: DET},
|
||||
"EX": {POS: ADV, "AdvType": "ex"},
|
||||
"FW": {POS: X, "Foreign": "yes"},
|
||||
"HYPH": {POS: PUNCT, "PunctType": "dash"},
|
||||
"IN": {POS: ADP},
|
||||
"JJ": {POS: ADJ, "Degree": "pos"},
|
||||
"JJR": {POS: ADJ, "Degree": "comp"},
|
||||
"JJS": {POS: ADJ, "Degree": "sup"},
|
||||
"LS": {POS: PUNCT, "NumType": "ord"},
|
||||
"MD": {POS: VERB, "VerbType": "mod"},
|
||||
"NIL": {POS: ""},
|
||||
"NN": {POS: NOUN, "Number": "sing"},
|
||||
"NNP": {POS: PROPN, "NounType": "prop", "Number": "sing"},
|
||||
"NNPS": {POS: PROPN, "NounType": "prop", "Number": "plur"},
|
||||
"NNS": {POS: NOUN, "Number": "plur"},
|
||||
"PDT": {POS: ADJ, "AdjType": "pdt", "PronType": "prn"},
|
||||
"POS": {POS: PART, "Poss": "yes"},
|
||||
"PRP": {POS: PRON, "PronType": "prs"},
|
||||
"PRP$": {POS: ADJ, "PronType": "prs", "Poss": "yes"},
|
||||
"RB": {POS: ADV, "Degree": "pos"},
|
||||
"RBR": {POS: ADV, "Degree": "comp"},
|
||||
"RBS": {POS: ADV, "Degree": "sup"},
|
||||
"RP": {POS: PART},
|
||||
"SP": {POS: SPACE},
|
||||
"SYM": {POS: SYM},
|
||||
"TO": {POS: PART, "PartType": "inf", "VerbForm": "inf"},
|
||||
"UH": {POS: INTJ},
|
||||
"VB": {POS: VERB, "VerbForm": "inf"},
|
||||
"VBD": {POS: VERB, "VerbForm": "fin", "Tense": "past"},
|
||||
"VBG": {POS: VERB, "VerbForm": "part", "Tense": "pres", "Aspect": "prog"},
|
||||
"VBN": {POS: VERB, "VerbForm": "part", "Tense": "past", "Aspect": "perf"},
|
||||
"VBP": {POS: VERB, "VerbForm": "fin", "Tense": "pres"},
|
||||
"VBZ": {
|
||||
POS: VERB,
|
||||
"VerbForm": "fin",
|
||||
"Tense": "pres",
|
||||
"Number": "sing",
|
||||
"Person": 3,
|
||||
},
|
||||
"WDT": {POS: ADJ, "PronType": "int|rel"},
|
||||
"WP": {POS: NOUN, "PronType": "int|rel"},
|
||||
"WP$": {POS: ADJ, "Poss": "yes", "PronType": "int|rel"},
|
||||
"WRB": {POS: ADV, "PronType": "int|rel"},
|
||||
"ADD": {POS: X},
|
||||
"NFP": {POS: PUNCT},
|
||||
"GW": {POS: X},
|
||||
"XX": {POS: X},
|
||||
"BES": {POS: VERB},
|
||||
"HVS": {POS: VERB},
|
||||
"_SP": {POS: SPACE},
|
||||
}
|
|
@ -5,11 +5,13 @@ from ..char_classes import LIST_ELLIPSES, LIST_ICONS, ALPHA, ALPHA_LOWER, ALPHA_
|
|||
|
||||
ELISION = " ' ’ ".strip().replace(" ", "")
|
||||
|
||||
abbrev = ("d", "D")
|
||||
|
||||
_infixes = (
|
||||
LIST_ELLIPSES
|
||||
+ LIST_ICONS
|
||||
+ [
|
||||
r"(?<=[{a}][{el}])(?=[{a}])".format(a=ALPHA, el=ELISION),
|
||||
r"(?<=^[{ab}][{el}])(?=[{a}])".format(ab=abbrev, a=ALPHA, el=ELISION),
|
||||
r"(?<=[{al}])\.(?=[{au}])".format(al=ALPHA_LOWER, au=ALPHA_UPPER),
|
||||
r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
|
||||
r"(?<=[{a}])[:<>=](?=[{a}])".format(a=ALPHA),
|
||||
|
|
|
@ -10,6 +10,8 @@ _exc = {}
|
|||
|
||||
# translate / delete what is not necessary
|
||||
for exc_data in [
|
||||
{ORTH: "’t", LEMMA: "et", NORM: "et"},
|
||||
{ORTH: "’T", LEMMA: "et", NORM: "et"},
|
||||
{ORTH: "'t", LEMMA: "et", NORM: "et"},
|
||||
{ORTH: "'T", LEMMA: "et", NORM: "et"},
|
||||
{ORTH: "wgl.", LEMMA: "wannechgelift", NORM: "wannechgelift"},
|
||||
|
|
|
@ -3,6 +3,9 @@ from __future__ import absolute_import, unicode_literals
|
|||
|
||||
import random
|
||||
import itertools
|
||||
|
||||
from thinc.extra import load_nlp
|
||||
|
||||
from spacy.util import minibatch
|
||||
import weakref
|
||||
import functools
|
||||
|
@ -15,6 +18,7 @@ import multiprocessing as mp
|
|||
from itertools import chain, cycle
|
||||
|
||||
from .tokenizer import Tokenizer
|
||||
from .tokens.underscore import Underscore
|
||||
from .vocab import Vocab
|
||||
from .lemmatizer import Lemmatizer
|
||||
from .lookups import Lookups
|
||||
|
@ -753,8 +757,6 @@ class Language(object):
|
|||
|
||||
DOCS: https://spacy.io/api/language#pipe
|
||||
"""
|
||||
# raw_texts will be used later to stop iterator.
|
||||
texts, raw_texts = itertools.tee(texts)
|
||||
if is_python2 and n_process != 1:
|
||||
user_warning(Warnings.W023)
|
||||
n_process = 1
|
||||
|
@ -853,7 +855,10 @@ class Language(object):
|
|||
sender.send()
|
||||
|
||||
procs = [
|
||||
mp.Process(target=_apply_pipes, args=(self.make_doc, pipes, rch, sch))
|
||||
mp.Process(
|
||||
target=_apply_pipes,
|
||||
args=(self.make_doc, pipes, rch, sch, Underscore.get_state(), load_nlp.VECTORS),
|
||||
)
|
||||
for rch, sch in zip(texts_q, bytedocs_send_ch)
|
||||
]
|
||||
for proc in procs:
|
||||
|
@ -1108,16 +1113,20 @@ def _pipe(docs, proc, kwargs):
|
|||
yield doc
|
||||
|
||||
|
||||
def _apply_pipes(make_doc, pipes, reciever, sender):
|
||||
def _apply_pipes(make_doc, pipes, receiver, sender, underscore_state, vectors):
|
||||
"""Worker for Language.pipe
|
||||
|
||||
receiver (multiprocessing.Connection): Pipe to receive text. Usually
|
||||
created by `multiprocessing.Pipe()`
|
||||
sender (multiprocessing.Connection): Pipe to send doc. Usually created by
|
||||
`multiprocessing.Pipe()`
|
||||
underscore_state (tuple): The data in the Underscore class of the parent
|
||||
vectors (dict): The global vectors data, copied from the parent
|
||||
"""
|
||||
Underscore.load_state(underscore_state)
|
||||
load_nlp.VECTORS = vectors
|
||||
while True:
|
||||
texts = reciever.get()
|
||||
texts = receiver.get()
|
||||
docs = (make_doc(text) for text in texts)
|
||||
for pipe in pipes:
|
||||
docs = pipe(docs)
|
||||
|
|
|
@ -170,6 +170,10 @@ TOKEN_PATTERN_SCHEMA = {
|
|||
"title": "Token is the first in a sentence",
|
||||
"$ref": "#/definitions/boolean_value",
|
||||
},
|
||||
"SENT_START": {
|
||||
"title": "Token is the first in a sentence",
|
||||
"$ref": "#/definitions/boolean_value",
|
||||
},
|
||||
"LIKE_NUM": {
|
||||
"title": "Token resembles a number",
|
||||
"$ref": "#/definitions/boolean_value",
|
||||
|
|
|
@ -670,6 +670,8 @@ def _get_attr_values(spec, string_store):
|
|||
continue
|
||||
if attr == "TEXT":
|
||||
attr = "ORTH"
|
||||
if attr == "IS_SENT_START":
|
||||
attr = "SENT_START"
|
||||
if attr not in TOKEN_PATTERN_SCHEMA["items"]["properties"]:
|
||||
raise ValueError(Errors.E152.format(attr=attr))
|
||||
attr = IDS.get(attr)
|
||||
|
|
|
@ -367,7 +367,7 @@ class Tensorizer(Pipe):
|
|||
return sgd
|
||||
|
||||
|
||||
@component("tagger", assigns=["token.tag", "token.pos"])
|
||||
@component("tagger", assigns=["token.tag", "token.pos", "token.lemma"])
|
||||
class Tagger(Pipe):
|
||||
"""Pipeline component for part-of-speech tagging.
|
||||
|
||||
|
|
|
@ -606,7 +606,6 @@ cdef class Parser:
|
|||
if not hasattr(get_gold_tuples, '__call__'):
|
||||
gold_tuples = get_gold_tuples
|
||||
get_gold_tuples = lambda: gold_tuples
|
||||
cfg.setdefault('min_action_freq', 30)
|
||||
actions = self.moves.get_actions(gold_parses=get_gold_tuples(),
|
||||
min_freq=cfg.get('min_action_freq', 30),
|
||||
learn_tokens=self.cfg.get("learn_tokens", False))
|
||||
|
@ -616,8 +615,9 @@ cdef class Parser:
|
|||
if label not in actions[action]:
|
||||
actions[action][label] = freq
|
||||
self.moves.initialize_actions(actions)
|
||||
cfg.setdefault('token_vector_width', 96)
|
||||
if self.model is True:
|
||||
cfg.setdefault('min_action_freq', 30)
|
||||
cfg.setdefault('token_vector_width', 96)
|
||||
self.model, cfg = self.Model(self.moves.n_moves, **cfg)
|
||||
if sgd is None:
|
||||
sgd = self.create_optimizer()
|
||||
|
@ -633,11 +633,11 @@ cdef class Parser:
|
|||
if pipeline is not None:
|
||||
self.init_multitask_objectives(get_gold_tuples, pipeline, sgd=sgd, **cfg)
|
||||
link_vectors_to_models(self.vocab)
|
||||
self.cfg.update(cfg)
|
||||
else:
|
||||
if sgd is None:
|
||||
sgd = self.create_optimizer()
|
||||
self.model.begin_training([])
|
||||
self.cfg.update(cfg)
|
||||
return sgd
|
||||
|
||||
def to_disk(self, path, exclude=tuple(), **kwargs):
|
||||
|
|
|
@ -83,6 +83,11 @@ def es_tokenizer():
|
|||
return get_lang_class("es").Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def eu_tokenizer():
|
||||
return get_lang_class("eu").Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def fi_tokenizer():
|
||||
return get_lang_class("fi").Defaults.create_tokenizer()
|
||||
|
|
|
@ -77,3 +77,30 @@ def test_doc_array_idx(en_vocab):
|
|||
assert offsets[0] == 0
|
||||
assert offsets[1] == 3
|
||||
assert offsets[2] == 11
|
||||
|
||||
|
||||
def test_doc_from_array_heads_in_bounds(en_vocab):
|
||||
"""Test that Doc.from_array doesn't set heads that are out of bounds."""
|
||||
words = ["This", "is", "a", "sentence", "."]
|
||||
doc = Doc(en_vocab, words=words)
|
||||
for token in doc:
|
||||
token.head = doc[0]
|
||||
|
||||
# correct
|
||||
arr = doc.to_array(["HEAD"])
|
||||
doc_from_array = Doc(en_vocab, words=words)
|
||||
doc_from_array.from_array(["HEAD"], arr)
|
||||
|
||||
# head before start
|
||||
arr = doc.to_array(["HEAD"])
|
||||
arr[0] = -1
|
||||
doc_from_array = Doc(en_vocab, words=words)
|
||||
with pytest.raises(ValueError):
|
||||
doc_from_array.from_array(["HEAD"], arr)
|
||||
|
||||
# head after end
|
||||
arr = doc.to_array(["HEAD"])
|
||||
arr[0] = 5
|
||||
doc_from_array = Doc(en_vocab, words=words)
|
||||
with pytest.raises(ValueError):
|
||||
doc_from_array.from_array(["HEAD"], arr)
|
||||
|
|
|
@ -150,10 +150,9 @@ def test_doc_api_runtime_error(en_tokenizer):
|
|||
# Example that caused run-time error while parsing Reddit
|
||||
# fmt: off
|
||||
text = "67% of black households are single parent \n\n72% of all black babies born out of wedlock \n\n50% of all black kids don\u2019t finish high school"
|
||||
deps = ["nsubj", "prep", "amod", "pobj", "ROOT", "amod", "attr", "",
|
||||
"nummod", "prep", "det", "amod", "pobj", "acl", "prep", "prep",
|
||||
"pobj", "", "nummod", "prep", "det", "amod", "pobj", "aux", "neg",
|
||||
"ROOT", "amod", "dobj"]
|
||||
deps = ["nummod", "nsubj", "prep", "amod", "pobj", "ROOT", "amod", "attr", "", "nummod", "appos", "prep", "det",
|
||||
"amod", "pobj", "acl", "prep", "prep", "pobj",
|
||||
"", "nummod", "nsubj", "prep", "det", "amod", "pobj", "aux", "neg", "ccomp", "amod", "dobj"]
|
||||
# fmt: on
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], deps=deps)
|
||||
|
@ -277,7 +276,9 @@ def test_doc_is_nered(en_vocab):
|
|||
def test_doc_from_array_sent_starts(en_vocab):
|
||||
words = ["I", "live", "in", "New", "York", ".", "I", "like", "cats", "."]
|
||||
heads = [0, 0, 0, 0, 0, 0, 6, 6, 6, 6]
|
||||
# fmt: off
|
||||
deps = ["ROOT", "dep", "dep", "dep", "dep", "dep", "ROOT", "dep", "dep", "dep", "dep"]
|
||||
# fmt: on
|
||||
doc = Doc(en_vocab, words=words)
|
||||
for i, (dep, head) in enumerate(zip(deps, heads)):
|
||||
doc[i].dep_ = dep
|
||||
|
|
|
@ -167,6 +167,11 @@ def test_doc_token_api_head_setter(en_tokenizer):
|
|||
assert doc[4].left_edge.i == 0
|
||||
assert doc[2].left_edge.i == 0
|
||||
|
||||
# head token must be from the same document
|
||||
doc2 = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
|
||||
with pytest.raises(ValueError):
|
||||
doc[0].head = doc2[0]
|
||||
|
||||
|
||||
def test_is_sent_start(en_tokenizer):
|
||||
doc = en_tokenizer("This is a sentence. This is another.")
|
||||
|
@ -214,7 +219,7 @@ def test_token_api_conjuncts_chain(en_vocab):
|
|||
def test_token_api_conjuncts_simple(en_vocab):
|
||||
words = "They came and went .".split()
|
||||
heads = [1, 0, -1, -2, -1]
|
||||
deps = ["nsubj", "ROOT", "cc", "conj"]
|
||||
deps = ["nsubj", "ROOT", "cc", "conj", "dep"]
|
||||
doc = get_doc(en_vocab, words=words, heads=heads, deps=deps)
|
||||
assert [w.text for w in doc[1].conjuncts] == ["went"]
|
||||
assert [w.text for w in doc[3].conjuncts] == ["came"]
|
||||
|
|
|
@ -7,6 +7,15 @@ from spacy.tokens import Doc, Span, Token
|
|||
from spacy.tokens.underscore import Underscore
|
||||
|
||||
|
||||
@pytest.fixture(scope="function", autouse=True)
|
||||
def clean_underscore():
|
||||
# reset the Underscore object after the test, to avoid having state copied across tests
|
||||
yield
|
||||
Underscore.doc_extensions = {}
|
||||
Underscore.span_extensions = {}
|
||||
Underscore.token_extensions = {}
|
||||
|
||||
|
||||
def test_create_doc_underscore():
|
||||
doc = Mock()
|
||||
doc.doc = doc
|
||||
|
|
16
spacy/tests/lang/eu/test_text.py
Normal file
16
spacy/tests/lang/eu/test_text.py
Normal file
|
@ -0,0 +1,16 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
def test_eu_tokenizer_handles_long_text(eu_tokenizer):
|
||||
text = """ta nere guitarra estrenatu ondoren"""
|
||||
tokens = eu_tokenizer(text)
|
||||
assert len(tokens) == 5
|
||||
|
||||
|
||||
@pytest.mark.parametrize("text,length", [("milesker ederra joan zen hitzaldia plazer hutsa", 7), ("astelehen guztia sofan pasau biot", 5)])
|
||||
def test_eu_tokenizer_handles_cnts(eu_tokenizer, text, length):
|
||||
tokens = eu_tokenizer(text)
|
||||
assert len(tokens) == length
|
|
@ -6,6 +6,7 @@ import re
|
|||
from mock import Mock
|
||||
from spacy.matcher import Matcher, DependencyMatcher
|
||||
from spacy.tokens import Doc, Token
|
||||
from ..doc.test_underscore import clean_underscore
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
|
@ -200,6 +201,7 @@ def test_matcher_any_token_operator(en_vocab):
|
|||
assert matches[2] == "test hello world"
|
||||
|
||||
|
||||
@pytest.mark.usefixtures("clean_underscore")
|
||||
def test_matcher_extension_attribute(en_vocab):
|
||||
matcher = Matcher(en_vocab)
|
||||
get_is_fruit = lambda token: token.text in ("apple", "banana")
|
||||
|
|
|
@ -34,6 +34,8 @@ TEST_PATTERNS = [
|
|||
([{"LOWER": {"REGEX": "^X", "NOT_IN": ["XXX", "XY"]}}], 0, 0),
|
||||
([{"NORM": "a"}, {"POS": {"IN": ["NOUN"]}}], 0, 0),
|
||||
([{"_": {"foo": {"NOT_IN": ["bar", "baz"]}, "a": 5, "b": {">": 10}}}], 0, 0),
|
||||
([{"IS_SENT_START": True}], 0, 0),
|
||||
([{"SENT_START": True}], 0, 0),
|
||||
]
|
||||
|
||||
XFAIL_TEST_PATTERNS = [([{"orth": "foo"}], 0, 0)]
|
||||
|
|
|
@ -34,23 +34,23 @@ BIG BROTHER IS WATCHING YOU, the caption beneath it ran.
|
|||
@pytest.fixture
|
||||
def heads():
|
||||
# fmt: off
|
||||
return [1, 1, 0, 3, 2, 1, -4, -1, -1, -7, -8, 1, -10, 2, 1, -3, -1, -15,
|
||||
-1, 1, 4, -1, 1, -3, 0, -1, 1, -2, -4, 1, -2, 1, -2, 3, -1, 1,
|
||||
-4, -13, -14, -1, -2, 2, 1, -3, -1, 1, -2, -9, -1, 3, 1, 1, -14,
|
||||
1, -2, 1, -2, -1, 1, -2, -6, -1, -1, -2, -1, -1, -42, -1, 2, 1,
|
||||
0, -1, 1, -2, -1, 2, 1, -4, -8, 0, 1, -2, -1, -1, 3, -1, 1, -6,
|
||||
9, 1, 7, -1, 1, -2, 3, 2, 1, -10, -1, 1, -2, -22, -1, 1, 0, -1,
|
||||
2, 1, -4, -1, -2, -1, 1, -2, -6, -7, 1, -9, -1, 2, -1, -3, -1,
|
||||
3, 2, 1, -4, -19, -24, 3, 2, 1, -4, -1, 1, 2, -1, -5, -34, 1, 0,
|
||||
-1, 1, -2, -4, 1, 0, 1, -2, -1, 1, -2, -6, 1, 9, -1, 1, -3, -1,
|
||||
-1, 3, 2, 1, 0, -1, -2, 7, -1, 5, 1, 3, -1, 1, -10, -1, -2, 1,
|
||||
-2, -15, 1, 0, -1, -1, 2, 1, -3, -1, -1, -2, -1, 1, -2, -12, 1,
|
||||
1, 0, 1, -2, -1, -2, -3, 9, -1, 2, -1, -4, 2, 1, -3, -4, -15, 2,
|
||||
1, -3, -1, 2, 1, -3, -8, -9, -1, -2, -1, -4, 1, -2, -3, 1, -2,
|
||||
-19, 17, 1, -2, 14, 13, 3, 2, 1, -4, 8, -1, 1, 5, -1, 2, 1, -3,
|
||||
return [1, 1, 0, 3, 2, 1, -4, -1, -1, -7, -8, 1, 2, 1, -12, -1, -2,
|
||||
-1, 1, 4, 3, 1, 1, 0, -1, 1, -2, -4, 1, -2, 1, -2, 3, -1, 1,
|
||||
-4, -13, -14, -1, -2, 2, 1, -3, -1, 1, -2, -9, -1, -11, 1, 1, -14,
|
||||
1, -2, 1, -2, -1, 1, -2, -6, -1, -1, -2, -1, -1, -42, -1, 1, 1,
|
||||
0, -1, 1, -2, -1, 2, 1, -4, -8, 18, 1, -2, -1, -1, 3, -1, 1, 10,
|
||||
9, 1, 7, -1, 1, -2, 3, 2, 1, 0, -1, 1, -2, -4, -1, 1, 0, -1,
|
||||
2, 1, -4, -1, 2, 1, 1, 1, -6, -11, 1, 20, -1, 2, -1, -3, -1,
|
||||
3, 2, 1, -4, -10, -11, 3, 2, 1, -4, -1, 1, -3, -1, 0, -1, 1, 0,
|
||||
-1, 1, -2, -4, 1, 0, 1, -2, -1, 1, -2, -6, 1, 9, -1, 1, 6, -1,
|
||||
-1, 3, 2, 1, 0, -1, -2, 7, -1, 2, 1, 3, -1, 1, -10, -1, -2, 1,
|
||||
-2, -5, 1, 0, -1, -1, 1, -2, -5, -1, -1, -2, -1, 1, -2, -12, 1,
|
||||
1, 0, 1, -2, -1, -4, -5, 18, -1, 2, -1, -4, 2, 1, -3, -4, -5, 2,
|
||||
1, -3, -1, 2, 1, -3, -17, -24, -1, -2, -1, -4, 1, -2, -3, 1, -2,
|
||||
-10, 17, 1, -2, 14, 13, 3, 2, 1, -4, 8, -1, 1, 5, -1, 2, 1, -3,
|
||||
0, -1, 1, -2, -4, 1, 0, -1, -1, 2, -1, -3, 1, -2, 1, -2, 3, 1,
|
||||
1, -4, -1, -2, 2, 1, -5, -19, -1, 1, 1, 0, 1, 6, -1, 1, -3, -1,
|
||||
-1, -8, -9, -1]
|
||||
1, -4, -1, -2, 2, 1, -3, -19, -1, 1, 1, 0, 0, 6, 5, 1, 3, -1,
|
||||
-1, 0, -1, -1]
|
||||
# fmt: on
|
||||
|
||||
|
||||
|
|
|
@ -48,7 +48,7 @@ def test_issue2203(en_vocab):
|
|||
tag_ids = [en_vocab.strings.add(tag) for tag in tags]
|
||||
lemma_ids = [en_vocab.strings.add(lemma) for lemma in lemmas]
|
||||
doc = Doc(en_vocab, words=words)
|
||||
# Work around lemma corrpution problem and set lemmas after tags
|
||||
# Work around lemma corruption problem and set lemmas after tags
|
||||
doc.from_array("TAG", numpy.array(tag_ids, dtype="uint64"))
|
||||
doc.from_array("LEMMA", numpy.array(lemma_ids, dtype="uint64"))
|
||||
assert [t.tag_ for t in doc] == tags
|
||||
|
|
|
@ -124,7 +124,7 @@ def test_issue2772(en_vocab):
|
|||
words = "When we write or communicate virtually , we can hide our true feelings .".split()
|
||||
# A tree with a non-projective (i.e. crossing) arc
|
||||
# The arcs (0, 4) and (2, 9) cross.
|
||||
heads = [4, 1, 7, -1, -2, -1, 3, 2, 1, 0, -1, -2, -1]
|
||||
heads = [4, 1, 7, -1, -2, -1, 3, 2, 1, 0, 2, 1, -3, -4]
|
||||
deps = ["dep"] * len(heads)
|
||||
doc = get_doc(en_vocab, words=words, heads=heads, deps=deps)
|
||||
assert doc[1].is_sent_start is None
|
||||
|
|
|
@ -27,7 +27,7 @@ def test_issue4590(en_vocab):
|
|||
|
||||
text = "The quick brown fox jumped over the lazy fox"
|
||||
heads = [3, 2, 1, 1, 0, -1, 2, 1, -3]
|
||||
deps = ["det", "amod", "amod", "nsubj", "prep", "pobj", "det", "amod"]
|
||||
deps = ["det", "amod", "amod", "nsubj", "ROOT", "prep", "det", "amod", "pobj"]
|
||||
|
||||
doc = get_doc(en_vocab, text.split(), heads=heads, deps=deps)
|
||||
|
||||
|
|
26
spacy/tests/regression/test_issue4725.py
Normal file
26
spacy/tests/regression/test_issue4725.py
Normal file
|
@ -0,0 +1,26 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import numpy
|
||||
|
||||
from spacy.lang.en import English
|
||||
from spacy.vocab import Vocab
|
||||
|
||||
|
||||
def test_issue4725():
|
||||
# ensures that this runs correctly and doesn't hang or crash because of the global vectors
|
||||
vocab = Vocab(vectors_name="test_vocab_add_vector")
|
||||
data = numpy.ndarray((5, 3), dtype="f")
|
||||
data[0] = 1.0
|
||||
data[1] = 2.0
|
||||
vocab.set_vector("cat", data[0])
|
||||
vocab.set_vector("dog", data[1])
|
||||
|
||||
nlp = English(vocab=vocab)
|
||||
ner = nlp.create_pipe("ner")
|
||||
nlp.add_pipe(ner)
|
||||
nlp.begin_training()
|
||||
docs = ["Kurt is in London."] * 10
|
||||
for _ in nlp.pipe(docs, batch_size=2, n_process=2):
|
||||
pass
|
||||
|
43
spacy/tests/regression/test_issue4903.py
Normal file
43
spacy/tests/regression/test_issue4903.py
Normal file
|
@ -0,0 +1,43 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from spacy.lang.en import English
|
||||
from spacy.tokens import Span, Doc
|
||||
|
||||
|
||||
class CustomPipe:
|
||||
name = "my_pipe"
|
||||
|
||||
def __init__(self):
|
||||
Span.set_extension("my_ext", getter=self._get_my_ext)
|
||||
Doc.set_extension("my_ext", default=None)
|
||||
|
||||
def __call__(self, doc):
|
||||
gathered_ext = []
|
||||
for sent in doc.sents:
|
||||
sent_ext = self._get_my_ext(sent)
|
||||
sent._.set("my_ext", sent_ext)
|
||||
gathered_ext.append(sent_ext)
|
||||
|
||||
doc._.set("my_ext", "\n".join(gathered_ext))
|
||||
|
||||
return doc
|
||||
|
||||
@staticmethod
|
||||
def _get_my_ext(span):
|
||||
return str(span.end)
|
||||
|
||||
|
||||
def test_issue4903():
|
||||
# ensures that this runs correctly and doesn't hang or crash on Windows / macOS
|
||||
|
||||
nlp = English()
|
||||
custom_component = CustomPipe()
|
||||
nlp.add_pipe(nlp.create_pipe("sentencizer"))
|
||||
nlp.add_pipe(custom_component, after="sentencizer")
|
||||
|
||||
text = ["I like bananas.", "Do you like them?", "No, I prefer wasabi."]
|
||||
docs = list(nlp.pipe(text, n_process=2))
|
||||
assert docs[0].text == "I like bananas."
|
||||
assert docs[1].text == "Do you like them?"
|
||||
assert docs[2].text == "No, I prefer wasabi."
|
|
@ -11,6 +11,6 @@ def nlp():
|
|||
return spacy.blank("en")
|
||||
|
||||
|
||||
def test_evaluate(nlp):
|
||||
def test_issue4924(nlp):
|
||||
docs_golds = [("", {})]
|
||||
nlp.evaluate(docs_golds)
|
||||
|
|
35
spacy/tests/regression/test_issue5048.py
Normal file
35
spacy/tests/regression/test_issue5048.py
Normal file
|
@ -0,0 +1,35 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import numpy
|
||||
from spacy.tokens import Doc
|
||||
from spacy.attrs import DEP, POS, TAG
|
||||
|
||||
from ..util import get_doc
|
||||
|
||||
|
||||
def test_issue5048(en_vocab):
|
||||
words = ["This", "is", "a", "sentence"]
|
||||
pos_s = ["DET", "VERB", "DET", "NOUN"]
|
||||
spaces = [" ", " ", " ", ""]
|
||||
deps_s = ["dep", "adj", "nn", "atm"]
|
||||
tags_s = ["DT", "VBZ", "DT", "NN"]
|
||||
|
||||
strings = en_vocab.strings
|
||||
|
||||
for w in words:
|
||||
strings.add(w)
|
||||
deps = [strings.add(d) for d in deps_s]
|
||||
pos = [strings.add(p) for p in pos_s]
|
||||
tags = [strings.add(t) for t in tags_s]
|
||||
|
||||
attrs = [POS, DEP, TAG]
|
||||
array = numpy.array(list(zip(pos, deps, tags)), dtype="uint64")
|
||||
|
||||
doc = Doc(en_vocab, words=words, spaces=spaces)
|
||||
doc.from_array(attrs, array)
|
||||
v1 = [(token.text, token.pos_, token.tag_) for token in doc]
|
||||
|
||||
doc2 = get_doc(en_vocab, words=words, pos=pos_s, deps=deps_s, tags=tags_s)
|
||||
v2 = [(token.text, token.pos_, token.tag_) for token in doc2]
|
||||
assert v1 == v2
|
46
spacy/tests/regression/test_issue5082.py
Normal file
46
spacy/tests/regression/test_issue5082.py
Normal file
|
@ -0,0 +1,46 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import numpy as np
|
||||
from spacy.lang.en import English
|
||||
from spacy.pipeline import EntityRuler
|
||||
|
||||
|
||||
def test_issue5082():
|
||||
# Ensure the 'merge_entities' pipeline does something sensible for the vectors of the merged tokens
|
||||
nlp = English()
|
||||
vocab = nlp.vocab
|
||||
array1 = np.asarray([0.1, 0.5, 0.8], dtype=np.float32)
|
||||
array2 = np.asarray([-0.2, -0.6, -0.9], dtype=np.float32)
|
||||
array3 = np.asarray([0.3, -0.1, 0.7], dtype=np.float32)
|
||||
array4 = np.asarray([0.5, 0, 0.3], dtype=np.float32)
|
||||
array34 = np.asarray([0.4, -0.05, 0.5], dtype=np.float32)
|
||||
|
||||
vocab.set_vector("I", array1)
|
||||
vocab.set_vector("like", array2)
|
||||
vocab.set_vector("David", array3)
|
||||
vocab.set_vector("Bowie", array4)
|
||||
|
||||
text = "I like David Bowie"
|
||||
ruler = EntityRuler(nlp)
|
||||
patterns = [
|
||||
{"label": "PERSON", "pattern": [{"LOWER": "david"}, {"LOWER": "bowie"}]}
|
||||
]
|
||||
ruler.add_patterns(patterns)
|
||||
nlp.add_pipe(ruler)
|
||||
|
||||
parsed_vectors_1 = [t.vector for t in nlp(text)]
|
||||
assert len(parsed_vectors_1) == 4
|
||||
np.testing.assert_array_equal(parsed_vectors_1[0], array1)
|
||||
np.testing.assert_array_equal(parsed_vectors_1[1], array2)
|
||||
np.testing.assert_array_equal(parsed_vectors_1[2], array3)
|
||||
np.testing.assert_array_equal(parsed_vectors_1[3], array4)
|
||||
|
||||
merge_ents = nlp.create_pipe("merge_entities")
|
||||
nlp.add_pipe(merge_ents)
|
||||
|
||||
parsed_vectors_2 = [t.vector for t in nlp(text)]
|
||||
assert len(parsed_vectors_2) == 3
|
||||
np.testing.assert_array_equal(parsed_vectors_2[0], array1)
|
||||
np.testing.assert_array_equal(parsed_vectors_2[1], array2)
|
||||
np.testing.assert_array_equal(parsed_vectors_2[2], array34)
|
|
@ -15,12 +15,19 @@ def load_tokenizer(b):
|
|||
|
||||
|
||||
def test_serialize_custom_tokenizer(en_vocab, en_tokenizer):
|
||||
"""Test that custom tokenizer with not all functions defined can be
|
||||
serialized and deserialized correctly (see #2494)."""
|
||||
"""Test that custom tokenizer with not all functions defined or empty
|
||||
properties can be serialized and deserialized correctly (see #2494,
|
||||
#4991)."""
|
||||
tokenizer = Tokenizer(en_vocab, suffix_search=en_tokenizer.suffix_search)
|
||||
tokenizer_bytes = tokenizer.to_bytes()
|
||||
Tokenizer(en_vocab).from_bytes(tokenizer_bytes)
|
||||
|
||||
tokenizer = Tokenizer(en_vocab, rules={"ABC.": [{"ORTH": "ABC", "ORTH": "."}]})
|
||||
tokenizer.rules = {}
|
||||
tokenizer_bytes = tokenizer.to_bytes()
|
||||
tokenizer_reloaded = Tokenizer(en_vocab).from_bytes(tokenizer_bytes)
|
||||
assert tokenizer_reloaded.rules == {}
|
||||
|
||||
|
||||
@pytest.mark.skip(reason="Currently unreliable across platforms")
|
||||
@pytest.mark.parametrize("text", ["I💜you", "they’re", "“hello”"])
|
||||
|
|
|
@ -31,10 +31,10 @@ def test_displacy_parse_deps(en_vocab):
|
|||
deps = displacy.parse_deps(doc)
|
||||
assert isinstance(deps, dict)
|
||||
assert deps["words"] == [
|
||||
{"lemma": None, "text": "This", "tag": "DET"},
|
||||
{"lemma": None, "text": "is", "tag": "AUX"},
|
||||
{"lemma": None, "text": "a", "tag": "DET"},
|
||||
{"lemma": None, "text": "sentence", "tag": "NOUN"},
|
||||
{"lemma": None, "text": words[0], "tag": pos[0]},
|
||||
{"lemma": None, "text": words[1], "tag": pos[1]},
|
||||
{"lemma": None, "text": words[2], "tag": pos[2]},
|
||||
{"lemma": None, "text": words[3], "tag": pos[3]},
|
||||
]
|
||||
assert deps["arcs"] == [
|
||||
{"start": 0, "end": 1, "label": "nsubj", "dir": "left"},
|
||||
|
@ -75,7 +75,7 @@ def test_displacy_rtl():
|
|||
deps = ["foo", "bar", "foo", "baz"]
|
||||
heads = [1, 0, 1, -2]
|
||||
nlp = Persian()
|
||||
doc = get_doc(nlp.vocab, words=words, pos=pos, tags=pos, heads=heads, deps=deps)
|
||||
doc = get_doc(nlp.vocab, words=words, tags=pos, heads=heads, deps=deps)
|
||||
doc.ents = [Span(doc, 1, 3, label="TEST")]
|
||||
html = displacy.render(doc, page=True, style="dep")
|
||||
assert "direction: rtl" in html
|
||||
|
|
|
@ -7,8 +7,10 @@ import shutil
|
|||
import contextlib
|
||||
import srsly
|
||||
from pathlib import Path
|
||||
|
||||
from spacy import Errors
|
||||
from spacy.tokens import Doc, Span
|
||||
from spacy.attrs import POS, HEAD, DEP
|
||||
from spacy.attrs import POS, TAG, HEAD, DEP, LEMMA
|
||||
from spacy.compat import path2str
|
||||
|
||||
|
||||
|
@ -26,30 +28,54 @@ def make_tempdir():
|
|||
shutil.rmtree(path2str(d))
|
||||
|
||||
|
||||
def get_doc(vocab, words=[], pos=None, heads=None, deps=None, tags=None, ents=None):
|
||||
def get_doc(vocab, words=[], pos=None, heads=None, deps=None, tags=None, ents=None, lemmas=None):
|
||||
"""Create Doc object from given vocab, words and annotations."""
|
||||
pos = pos or [""] * len(words)
|
||||
tags = tags or [""] * len(words)
|
||||
heads = heads or [0] * len(words)
|
||||
deps = deps or [""] * len(words)
|
||||
for value in deps + tags + pos:
|
||||
if deps and not heads:
|
||||
heads = [0] * len(deps)
|
||||
headings = []
|
||||
values = []
|
||||
annotations = [pos, heads, deps, lemmas, tags]
|
||||
possible_headings = [POS, HEAD, DEP, LEMMA, TAG]
|
||||
for a, annot in enumerate(annotations):
|
||||
if annot is not None:
|
||||
if len(annot) != len(words):
|
||||
raise ValueError(Errors.E189)
|
||||
headings.append(possible_headings[a])
|
||||
if annot is not heads:
|
||||
values.extend(annot)
|
||||
for value in values:
|
||||
vocab.strings.add(value)
|
||||
|
||||
doc = Doc(vocab, words=words)
|
||||
attrs = doc.to_array([POS, HEAD, DEP])
|
||||
for i, (p, head, dep) in enumerate(zip(pos, heads, deps)):
|
||||
attrs[i, 0] = doc.vocab.strings[p]
|
||||
attrs[i, 1] = head
|
||||
attrs[i, 2] = doc.vocab.strings[dep]
|
||||
doc.from_array([POS, HEAD, DEP], attrs)
|
||||
|
||||
# if there are any other annotations, set them
|
||||
if headings:
|
||||
attrs = doc.to_array(headings)
|
||||
|
||||
j = 0
|
||||
for annot in annotations:
|
||||
if annot:
|
||||
if annot is heads:
|
||||
for i in range(len(words)):
|
||||
if attrs.ndim == 1:
|
||||
attrs[i] = heads[i]
|
||||
else:
|
||||
attrs[i,j] = heads[i]
|
||||
else:
|
||||
for i in range(len(words)):
|
||||
if attrs.ndim == 1:
|
||||
attrs[i] = doc.vocab.strings[annot[i]]
|
||||
else:
|
||||
attrs[i, j] = doc.vocab.strings[annot[i]]
|
||||
j += 1
|
||||
doc.from_array(headings, attrs)
|
||||
|
||||
# finally, set the entities
|
||||
if ents:
|
||||
doc.ents = [
|
||||
Span(doc, start, end, label=doc.vocab.strings[label])
|
||||
for start, end, label in ents
|
||||
]
|
||||
if tags:
|
||||
for token in doc:
|
||||
token.tag_ = tags[token.i]
|
||||
return doc
|
||||
|
||||
|
||||
|
|
|
@ -14,7 +14,7 @@ import re
|
|||
|
||||
from .tokens.doc cimport Doc
|
||||
from .strings cimport hash_string
|
||||
from .compat import unescape_unicode
|
||||
from .compat import unescape_unicode, basestring_
|
||||
from .attrs import intify_attrs
|
||||
from .symbols import ORTH
|
||||
|
||||
|
@ -508,6 +508,7 @@ cdef class Tokenizer:
|
|||
|
||||
DOCS: https://spacy.io/api/tokenizer#to_disk
|
||||
"""
|
||||
path = util.ensure_path(path)
|
||||
with path.open("wb") as file_:
|
||||
file_.write(self.to_bytes(**kwargs))
|
||||
|
||||
|
@ -521,6 +522,7 @@ cdef class Tokenizer:
|
|||
|
||||
DOCS: https://spacy.io/api/tokenizer#from_disk
|
||||
"""
|
||||
path = util.ensure_path(path)
|
||||
with path.open("rb") as file_:
|
||||
bytes_data = file_.read()
|
||||
self.from_bytes(bytes_data, **kwargs)
|
||||
|
@ -568,22 +570,22 @@ cdef class Tokenizer:
|
|||
for key in ["prefix_search", "suffix_search", "infix_finditer"]:
|
||||
if key in data:
|
||||
data[key] = unescape_unicode(data[key])
|
||||
if data.get("prefix_search"):
|
||||
if "prefix_search" in data and isinstance(data["prefix_search"], basestring_):
|
||||
self.prefix_search = re.compile(data["prefix_search"]).search
|
||||
if data.get("suffix_search"):
|
||||
if "suffix_search" in data and isinstance(data["suffix_search"], basestring_):
|
||||
self.suffix_search = re.compile(data["suffix_search"]).search
|
||||
if data.get("infix_finditer"):
|
||||
if "infix_finditer" in data and isinstance(data["infix_finditer"], basestring_):
|
||||
self.infix_finditer = re.compile(data["infix_finditer"]).finditer
|
||||
if data.get("token_match"):
|
||||
if "token_match" in data and isinstance(data["token_match"], basestring_):
|
||||
self.token_match = re.compile(data["token_match"]).match
|
||||
if data.get("rules"):
|
||||
if "rules" in data and isinstance(data["rules"], dict):
|
||||
# make sure to hard reset the cache to remove data from the default exceptions
|
||||
self._rules = {}
|
||||
self._reset_cache([key for key in self._cache])
|
||||
self._reset_specials()
|
||||
self._cache = PreshMap()
|
||||
self._specials = PreshMap()
|
||||
self._load_special_tokenization(data.get("rules", {}))
|
||||
self._load_special_tokenization(data["rules"])
|
||||
|
||||
return self
|
||||
|
||||
|
|
|
@ -213,6 +213,10 @@ def _merge(Doc doc, merges):
|
|||
new_orth = ''.join([t.text_with_ws for t in spans[token_index]])
|
||||
if spans[token_index][-1].whitespace_:
|
||||
new_orth = new_orth[:-len(spans[token_index][-1].whitespace_)]
|
||||
# add the vector of the (merged) entity to the vocab
|
||||
if not doc.vocab.get_vector(new_orth).any():
|
||||
if doc.vocab.vectors_length > 0:
|
||||
doc.vocab.set_vector(new_orth, span.vector)
|
||||
token = tokens[token_index]
|
||||
lex = doc.vocab.get(doc.mem, new_orth)
|
||||
token.lex = lex
|
||||
|
|
|
@ -785,10 +785,12 @@ cdef class Doc:
|
|||
# Allow strings, e.g. 'lemma' or 'LEMMA'
|
||||
attrs = [(IDS[id_.upper()] if hasattr(id_, "upper") else id_)
|
||||
for id_ in attrs]
|
||||
if array.dtype != numpy.uint64:
|
||||
user_warning(Warnings.W028.format(type=array.dtype))
|
||||
|
||||
if SENT_START in attrs and HEAD in attrs:
|
||||
raise ValueError(Errors.E032)
|
||||
cdef int i, col
|
||||
cdef int i, col, abs_head_index
|
||||
cdef attr_id_t attr_id
|
||||
cdef TokenC* tokens = self.c
|
||||
cdef int length = len(array)
|
||||
|
@ -802,6 +804,14 @@ cdef class Doc:
|
|||
attr_ids[i] = attr_id
|
||||
if len(array.shape) == 1:
|
||||
array = array.reshape((array.size, 1))
|
||||
# Check that all heads are within the document bounds
|
||||
if HEAD in attrs:
|
||||
col = attrs.index(HEAD)
|
||||
for i in range(length):
|
||||
# cast index to signed int
|
||||
abs_head_index = numpy.int32(array[i, col]) + i
|
||||
if abs_head_index < 0 or abs_head_index >= length:
|
||||
raise ValueError(Errors.E190.format(index=i, value=array[i, col], rel_head_index=numpy.int32(array[i, col])))
|
||||
# Do TAG first. This lets subsequent loop override stuff like POS, LEMMA
|
||||
if TAG in attrs:
|
||||
col = attrs.index(TAG)
|
||||
|
@ -872,7 +882,7 @@ cdef class Doc:
|
|||
|
||||
DOCS: https://spacy.io/api/doc#to_bytes
|
||||
"""
|
||||
array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE, ENT_ID] # TODO: ENT_KB_ID ?
|
||||
array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE, ENT_ID, NORM] # TODO: ENT_KB_ID ?
|
||||
if self.is_tagged:
|
||||
array_head.extend([TAG, POS])
|
||||
# If doc parsed add head and dep attribute
|
||||
|
@ -1173,6 +1183,7 @@ cdef int set_children_from_heads(TokenC* tokens, int length) except -1:
|
|||
heads_within_sents = _set_lr_kids_and_edges(tokens, length, loop_count)
|
||||
if loop_count > 10:
|
||||
user_warning(Warnings.W026)
|
||||
break
|
||||
loop_count += 1
|
||||
# Set sentence starts
|
||||
for i in range(length):
|
||||
|
|
|
@ -623,6 +623,9 @@ cdef class Token:
|
|||
# This function sets the head of self to new_head and updates the
|
||||
# counters for left/right dependents and left/right corner for the
|
||||
# new and the old head
|
||||
# Check that token is from the same document
|
||||
if self.doc != new_head.doc:
|
||||
raise ValueError(Errors.E191)
|
||||
# Do nothing if old head is new head
|
||||
if self.i + self.c.head == new_head.i:
|
||||
return
|
||||
|
|
|
@ -79,6 +79,14 @@ class Underscore(object):
|
|||
def _get_key(self, name):
|
||||
return ("._.", name, self._start, self._end)
|
||||
|
||||
@classmethod
|
||||
def get_state(cls):
|
||||
return cls.token_extensions, cls.span_extensions, cls.doc_extensions
|
||||
|
||||
@classmethod
|
||||
def load_state(cls, state):
|
||||
cls.token_extensions, cls.span_extensions, cls.doc_extensions = state
|
||||
|
||||
|
||||
def get_ext_args(**kwargs):
|
||||
"""Validate and convert arguments. Reused in Doc, Token and Span."""
|
||||
|
|
|
@ -184,16 +184,17 @@ low data labels and more.
|
|||
$ python -m spacy debug-data [lang] [train_path] [dev_path] [--base-model] [--pipeline] [--ignore-warnings] [--verbose] [--no-format]
|
||||
```
|
||||
|
||||
| Argument | Type | Description |
|
||||
| -------------------------- | ---------- | -------------------------------------------------------------------------------------------------- |
|
||||
| `lang` | positional | Model language. |
|
||||
| `train_path` | positional | Location of JSON-formatted training data. Can be a file or a directory of files. |
|
||||
| `dev_path` | positional | Location of JSON-formatted development data for evaluation. Can be a file or a directory of files. |
|
||||
| `--base-model`, `-b` | option | Optional name of base model to update. Can be any loadable spaCy model. |
|
||||
| `--pipeline`, `-p` | option | Comma-separated names of pipeline components to train. Defaults to `'tagger,parser,ner'`. |
|
||||
| `--ignore-warnings`, `-IW` | flag | Ignore warnings, only show stats and errors. |
|
||||
| `--verbose`, `-V` | flag | Print additional information and explanations. |
|
||||
| --no-format, `-NF` | flag | Don't pretty-print the results. Use this if you want to write to a file. |
|
||||
| Argument | Type | Description |
|
||||
| ------------------------------------------------------ | ---------- | -------------------------------------------------------------------------------------------------- |
|
||||
| `lang` | positional | Model language. |
|
||||
| `train_path` | positional | Location of JSON-formatted training data. Can be a file or a directory of files. |
|
||||
| `dev_path` | positional | Location of JSON-formatted development data for evaluation. Can be a file or a directory of files. |
|
||||
| `--tag-map-path`, `-tm` <Tag variant="new">2.2.3</Tag> | option | Location of JSON-formatted tag map. |
|
||||
| `--base-model`, `-b` | option | Optional name of base model to update. Can be any loadable spaCy model. |
|
||||
| `--pipeline`, `-p` | option | Comma-separated names of pipeline components to train. Defaults to `'tagger,parser,ner'`. |
|
||||
| `--ignore-warnings`, `-IW` | flag | Ignore warnings, only show stats and errors. |
|
||||
| `--verbose`, `-V` | flag | Print additional information and explanations. |
|
||||
| --no-format, `-NF` | flag | Don't pretty-print the results. Use this if you want to write to a file. |
|
||||
|
||||
<Accordion title="Example output">
|
||||
|
||||
|
@ -368,6 +369,7 @@ $ python -m spacy train [lang] [output_path] [train_path] [dev_path]
|
|||
| `dev_path` | positional | Location of JSON-formatted development data for evaluation. Can be a file or a directory of files. |
|
||||
| `--base-model`, `-b` <Tag variant="new">2.1</Tag> | option | Optional name of base model to update. Can be any loadable spaCy model. |
|
||||
| `--pipeline`, `-p` <Tag variant="new">2.1</Tag> | option | Comma-separated names of pipeline components to train. Defaults to `'tagger,parser,ner'`. |
|
||||
| `--replace-components`, `-R` | flag | Replace components from the base model. |
|
||||
| `--vectors`, `-v` | option | Model to load vectors from. |
|
||||
| `--n-iter`, `-n` | option | Number of iterations (default: `30`). |
|
||||
| `--n-early-stopping`, `-ne` | option | Maximum number of training epochs without dev accuracy improvement. |
|
||||
|
@ -378,6 +380,13 @@ $ python -m spacy train [lang] [output_path] [train_path] [dev_path]
|
|||
| `--init-tok2vec`, `-t2v` <Tag variant="new">2.1</Tag> | option | Path to pretrained weights for the token-to-vector parts of the models. See `spacy pretrain`. Experimental. |
|
||||
| `--parser-multitasks`, `-pt` | option | Side objectives for parser CNN, e.g. `'dep'` or `'dep,tag'` |
|
||||
| `--entity-multitasks`, `-et` | option | Side objectives for NER CNN, e.g. `'dep'` or `'dep,tag'` |
|
||||
| `--width`, `-cw` <Tag variant="new">2.2.4</Tag> | option | Width of CNN layers of `Tok2Vec` component. |
|
||||
| `--conv-depth`, `-cd` <Tag variant="new">2.2.4</Tag> | option | Depth of CNN layers of `Tok2Vec` component. |
|
||||
| `--cnn-window`, `-cW` <Tag variant="new">2.2.4</Tag> | option | Window size for CNN layers of `Tok2Vec` component. |
|
||||
| `--cnn-pieces`, `-cP` <Tag variant="new">2.2.4</Tag> | option | Maxout size for CNN layers of `Tok2Vec` component. |
|
||||
| `--use-chars`, `-chr` <Tag variant="new">2.2.4</Tag> | flag | Whether to use character-based embedding of `Tok2Vec` component. |
|
||||
| `--bilstm-depth`, `-lstm` <Tag variant="new">2.2.4</Tag> | option | Depth of BiLSTM layers of `Tok2Vec` component (requires PyTorch). |
|
||||
| `--embed-rows`, `-er` <Tag variant="new">2.2.4</Tag> | option | Number of embedding rows of `Tok2Vec` component. |
|
||||
| `--noise-level`, `-nl` | option | Float indicating the amount of corruption for data augmentation. |
|
||||
| `--orth-variant-level`, `-ovl` <Tag variant="new">2.2</Tag> | option | Float indicating the orthography variation for data augmentation (e.g. `0.3` for making 30% of occurrences of some tokens subject to replacement). |
|
||||
| `--gold-preproc`, `-G` | flag | Use gold preprocessing. |
|
||||
|
@ -385,6 +394,7 @@ $ python -m spacy train [lang] [output_path] [train_path] [dev_path]
|
|||
| `--textcat-multilabel`, `-TML` <Tag variant="new">2.2</Tag> | flag | Text classification classes aren't mutually exclusive (multilabel). |
|
||||
| `--textcat-arch`, `-ta` <Tag variant="new">2.2</Tag> | option | Text classification model architecture. Defaults to `"bow"`. |
|
||||
| `--textcat-positive-label`, `-tpl` <Tag variant="new">2.2</Tag> | option | Text classification positive label for binary classes with two labels. |
|
||||
| `--tag-map-path`, `-tm` <Tag variant="new">2.2.4</Tag> | option | Location of JSON-formatted tag map. |
|
||||
| `--verbose`, `-VV` <Tag variant="new">2.0.13</Tag> | flag | Show more detailed messages during training. |
|
||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
||||
| **CREATES** | model, pickle | A spaCy model on each epoch. |
|
||||
|
|
|
@ -7,9 +7,10 @@ source: spacy/tokens/doc.pyx
|
|||
|
||||
A `Doc` is a sequence of [`Token`](/api/token) objects. Access sentences and
|
||||
named entities, export annotations to numpy arrays, losslessly serialize to
|
||||
compressed binary strings. The `Doc` object holds an array of [`TokenC`](/api/cython-structs#tokenc) structs.
|
||||
The Python-level `Token` and [`Span`](/api/span) objects are views of this
|
||||
array, i.e. they don't own the data themselves.
|
||||
compressed binary strings. The `Doc` object holds an array of
|
||||
[`TokenC`](/api/cython-structs#tokenc) structs. The Python-level `Token` and
|
||||
[`Span`](/api/span) objects are views of this array, i.e. they don't own the
|
||||
data themselves.
|
||||
|
||||
## Doc.\_\_init\_\_ {#init tag="method"}
|
||||
|
||||
|
@ -197,13 +198,14 @@ the character indices don't map to a valid span.
|
|||
> assert span.text == "New York"
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---------------------------------------- | ------------------------------------------------------- |
|
||||
| `start` | int | The index of the first character of the span. |
|
||||
| `end` | int | The index of the last character after the span. |
|
||||
| `label` | uint64 / unicode | A label to attach to the Span, e.g. for named entities. |
|
||||
| `vector` | `numpy.ndarray[ndim=1, dtype='float32']` | A meaning representation of the span. |
|
||||
| **RETURNS** | `Span` | The newly constructed object or `None`. |
|
||||
| Name | Type | Description |
|
||||
| ------------------------------------ | ---------------------------------------- | --------------------------------------------------------------------- |
|
||||
| `start` | int | The index of the first character of the span. |
|
||||
| `end` | int | The index of the last character after the span. |
|
||||
| `label` | uint64 / unicode | A label to attach to the span, e.g. for named entities. |
|
||||
| `kb_id` <Tag variant="new">2.2</Tag> | uint64 / unicode | An ID from a knowledge base to capture the meaning of a named entity. |
|
||||
| `vector` | `numpy.ndarray[ndim=1, dtype='float32']` | A meaning representation of the span. |
|
||||
| **RETURNS** | `Span` | The newly constructed object or `None`. |
|
||||
|
||||
## Doc.similarity {#similarity tag="method" model="vectors"}
|
||||
|
||||
|
|
|
@ -172,6 +172,28 @@ Remove a previously registered extension.
|
|||
| `name` | unicode | Name of the extension. |
|
||||
| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the removed extension. |
|
||||
|
||||
## Span.char_span {#char_span tag="method" new="2.2.4"}
|
||||
|
||||
Create a `Span` object from the slice `span.text[start:end]`. Returns `None` if
|
||||
the character indices don't map to a valid span.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> doc = nlp("I like New York")
|
||||
> span = doc[1:4].char_span(5, 13, label="GPE")
|
||||
> assert span.text == "New York"
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---------------------------------------- | --------------------------------------------------------------------- |
|
||||
| `start` | int | The index of the first character of the span. |
|
||||
| `end` | int | The index of the last character after the span. |
|
||||
| `label` | uint64 / unicode | A label to attach to the span, e.g. for named entities. |
|
||||
| `kb_id` | uint64 / unicode | An ID from a knowledge base to capture the meaning of a named entity. |
|
||||
| `vector` | `numpy.ndarray[ndim=1, dtype='float32']` | A meaning representation of the span. |
|
||||
| **RETURNS** | `Span` | The newly constructed object or `None`. |
|
||||
|
||||
## Span.similarity {#similarity tag="method" model="vectors"}
|
||||
|
||||
Make a semantic similarity estimate. The default estimate is cosine similarity
|
||||
|
@ -293,10 +315,10 @@ Create a new `Doc` object corresponding to the `Span`, with a copy of the data.
|
|||
> assert doc2.text == "New York"
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------------- | ----- | ---------------------------------------------------- |
|
||||
| `copy_user_data` | bool | Whether or not to copy the original doc's user data. |
|
||||
| **RETURNS** | `Doc` | A `Doc` object of the `Span`'s content. |
|
||||
| Name | Type | Description |
|
||||
| ---------------- | ----- | ---------------------------------------------------- |
|
||||
| `copy_user_data` | bool | Whether or not to copy the original doc's user data. |
|
||||
| **RETURNS** | `Doc` | A `Doc` object of the `Span`'s content. |
|
||||
|
||||
## Span.root {#root tag="property" model="parser"}
|
||||
|
||||
|
|
|
@ -437,8 +437,8 @@ The L2 norm of the token's vector representation.
|
|||
| `norm_` | unicode | The token's norm, i.e. a normalized form of the token text. Usually set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions) or [norm exceptions](/usage/adding-languages#norm-exceptions). |
|
||||
| `lower` | int | Lowercase form of the token. |
|
||||
| `lower_` | unicode | Lowercase form of the token text. Equivalent to `Token.text.lower()`. |
|
||||
| `shape` | int | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
|
||||
| `shape_` | unicode | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
|
||||
| `shape` | int | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
|
||||
| `shape_` | unicode | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
|
||||
| `prefix` | int | Hash value of a length-N substring from the start of the token. Defaults to `N=1`. |
|
||||
| `prefix_` | unicode | A length-N substring from the start of the token. Defaults to `N=1`. |
|
||||
| `suffix` | int | Hash value of a length-N substring from the end of the token. Defaults to `N=3`. |
|
||||
|
|
|
@ -236,22 +236,22 @@ If a setting is not present in the options, the default value will be used.
|
|||
> displacy.serve(doc, style="dep", options=options)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description | Default |
|
||||
| ------------------ | ------- | --------------------------------------------------------------------------------------------------------------- | ----------------------- |
|
||||
| `fine_grained` | bool | Use fine-grained part-of-speech tags (`Token.tag_`) instead of coarse-grained tags (`Token.pos_`). | `False` |
|
||||
| `add_lemma` | bool | Print the lemma's in a separate row below the token texts in the `dep` visualisation. | `False` |
|
||||
| `collapse_punct` | bool | Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation. | `True` |
|
||||
| `collapse_phrases` | bool | Merge noun phrases into one token. | `False` |
|
||||
| `compact` | bool | "Compact mode" with square arrows that takes up less space. | `False` |
|
||||
| `color` | unicode | Text color (HEX, RGB or color names). | `'#000000'` |
|
||||
| `bg` | unicode | Background color (HEX, RGB or color names). | `'#ffffff'` |
|
||||
| `font` | unicode | Font name or font family for all text. | `'Arial'` |
|
||||
| `offset_x` | int | Spacing on left side of the SVG in px. | `50` |
|
||||
| `arrow_stroke` | int | Width of arrow path in px. | `2` |
|
||||
| `arrow_width` | int | Width of arrow head in px. | `10` / `8` (compact) |
|
||||
| `arrow_spacing` | int | Spacing between arrows in px to avoid overlaps. | `20` / `12` (compact) |
|
||||
| `word_spacing` | int | Vertical spacing between words and arcs in px. | `45` |
|
||||
| `distance` | int | Distance between words in px. | `175` / `150` (compact) |
|
||||
| Name | Type | Description | Default |
|
||||
| ------------------------------------------ | ------- | --------------------------------------------------------------------------------------------------------------- | ----------------------- |
|
||||
| `fine_grained` | bool | Use fine-grained part-of-speech tags (`Token.tag_`) instead of coarse-grained tags (`Token.pos_`). | `False` |
|
||||
| `add_lemma` <Tag variant="new">2.2.4</Tag> | bool | Print the lemma's in a separate row below the token texts. | `False` |
|
||||
| `collapse_punct` | bool | Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation. | `True` |
|
||||
| `collapse_phrases` | bool | Merge noun phrases into one token. | `False` |
|
||||
| `compact` | bool | "Compact mode" with square arrows that takes up less space. | `False` |
|
||||
| `color` | unicode | Text color (HEX, RGB or color names). | `'#000000'` |
|
||||
| `bg` | unicode | Background color (HEX, RGB or color names). | `'#ffffff'` |
|
||||
| `font` | unicode | Font name or font family for all text. | `'Arial'` |
|
||||
| `offset_x` | int | Spacing on left side of the SVG in px. | `50` |
|
||||
| `arrow_stroke` | int | Width of arrow path in px. | `2` |
|
||||
| `arrow_width` | int | Width of arrow head in px. | `10` / `8` (compact) |
|
||||
| `arrow_spacing` | int | Spacing between arrows in px to avoid overlaps. | `20` / `12` (compact) |
|
||||
| `word_spacing` | int | Vertical spacing between words and arcs in px. | `45` |
|
||||
| `distance` | int | Distance between words in px. | `175` / `150` (compact) |
|
||||
|
||||
#### Named Entity Visualizer options {#displacy_options-ent}
|
||||
|
||||
|
|
|
@ -95,6 +95,8 @@
|
|||
"has_examples": true
|
||||
},
|
||||
{ "code": "hr", "name": "Croatian", "has_examples": true },
|
||||
{ "code": "eu", "name": "Basque", "has_examples": true },
|
||||
{ "code": "yo", "name": "Yoruba", "has_examples": true },
|
||||
{ "code": "tr", "name": "Turkish", "example": "Bu bir cümledir.", "has_examples": true },
|
||||
{ "code": "ca", "name": "Catalan", "example": "Això és una frase.", "has_examples": true },
|
||||
{ "code": "he", "name": "Hebrew", "example": "זהו משפט.", "has_examples": true },
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
<svg xmlns="http://www.w3.org/2000/svg" width="220" height="37" viewBox="0 0 610 103">
|
||||
<defs>
|
||||
<radialGradient id="gradient_allenai1 "cx="75.721" cy="20.894" r="11.05" gradientUnits="userSpaceOnUse">
|
||||
<radialGradient id="gradient_allenai1" cx="75.721" cy="20.894" r="11.05" gradientUnits="userSpaceOnUse">
|
||||
<stop offset=".3" stop-color="#FDEA65" />
|
||||
<stop offset="1" stop-color="#FCB431" />
|
||||
</radialGradient>
|
||||
|
|
Before Width: | Height: | Size: 9.6 KiB After Width: | Height: | Size: 9.6 KiB |
Loading…
Reference in New Issue
Block a user