mirror of
https://github.com/explosion/spaCy.git
synced 2025-03-27 21:34:12 +03:00
Merge branch 'develop' into spacy.io
This commit is contained in:
commit
37cb09b90a
|
@ -1,21 +0,0 @@
|
|||
environment:
|
||||
matrix:
|
||||
- PYTHON: "C:\\Python35-x64"
|
||||
- PYTHON: "C:\\Python36-x64"
|
||||
- PYTHON: "C:\\Python37-x64"
|
||||
install:
|
||||
# We need wheel installed to build wheels
|
||||
- "%PYTHON%\\python.exe -m pip install wheel"
|
||||
- "%PYTHON%\\python.exe -m pip install cython"
|
||||
- "%PYTHON%\\python.exe -m pip install -r requirements.txt"
|
||||
- "%PYTHON%\\python.exe -m pip install -e ."
|
||||
build: off
|
||||
test_script:
|
||||
- "%PYTHON%\\python.exe -m pytest spacy/ --no-print-logs"
|
||||
after_test:
|
||||
- "%PYTHON%\\python.exe setup.py bdist_wheel"
|
||||
artifacts:
|
||||
- path: dist\*
|
||||
branches:
|
||||
except:
|
||||
- spacy.io
|
106
.github/contributors/adrienball.md
vendored
Normal file
106
.github/contributors/adrienball.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | ------------------------------- |
|
||||
| Name | Adrien Ball |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | Machine Learning Engineer |
|
||||
| Date | 2019-03-07 |
|
||||
| GitHub username | adrienball |
|
||||
| Website (optional) | https://medium.com/@adrien_ball |
|
15
.travis.yml
15
.travis.yml
|
@ -5,23 +5,16 @@ dist: trusty
|
|||
group: edge
|
||||
python:
|
||||
- "2.7"
|
||||
- "3.5"
|
||||
- "3.6"
|
||||
os:
|
||||
- linux
|
||||
env:
|
||||
- VIA=compile
|
||||
- VIA=flake8
|
||||
install:
|
||||
- "./travis.sh"
|
||||
- pip install flake8
|
||||
- "pip install -r requirements.txt"
|
||||
- "python setup.py build_ext --inplace"
|
||||
- "pip install -e ."
|
||||
script:
|
||||
- "cat /proc/cpuinfo | grep flags | head -n 1"
|
||||
- "pip install pytest pytest-timeout"
|
||||
- if [[ "${VIA}" == "compile" ]]; then python -m pytest --tb=native spacy; fi
|
||||
- if [[ "${VIA}" == "flake8" ]]; then flake8 . --count --exclude=spacy/compat.py,spacy/lang --select=E901,E999,F821,F822,F823 --show-source --statistics; fi
|
||||
- if [[ "${VIA}" == "pypi_nightly" ]]; then python -m pytest --tb=native --models --en `python -c "import os.path; import spacy; print(os.path.abspath(os.path.dirname(spacy.__file__)))"`; fi
|
||||
- if [[ "${VIA}" == "sdist" ]]; then python -m pytest --tb=native `python -c "import os.path; import spacy; print(os.path.abspath(os.path.dirname(spacy.__file__)))"`; fi
|
||||
- "python -m pytest --tb=native spacy"
|
||||
branches:
|
||||
except:
|
||||
- spacy.io
|
||||
|
|
|
@ -14,8 +14,8 @@ released under the MIT license.
|
|||
|
||||
💫 **Version 2.1 out now!** [Check out the release notes here.](https://github.com/explosion/spaCy/releases)
|
||||
|
||||
[](https://dev.azure.com/explosion-ai/public/_build?definitionId=8)
|
||||
[](https://travis-ci.org/explosion/spaCy)
|
||||
[](https://ci.appveyor.com/project/explosion/spaCy)
|
||||
[](https://github.com/explosion/spaCy/releases)
|
||||
[](https://pypi.python.org/pypi/spacy)
|
||||
[](https://anaconda.org/conda-forge/spacy)
|
||||
|
|
92
azure-pipelines.yml
Normal file
92
azure-pipelines.yml
Normal file
|
@ -0,0 +1,92 @@
|
|||
trigger:
|
||||
batch: true
|
||||
branches:
|
||||
include:
|
||||
- '*'
|
||||
exclude:
|
||||
- 'spacy.io'
|
||||
paths:
|
||||
exclude:
|
||||
- 'website/*'
|
||||
- '*.md'
|
||||
|
||||
jobs:
|
||||
|
||||
# Perform basic checks for most important errors (syntax etc.) Uses the config
|
||||
# defined in .flake8 and overwrites the selected codes.
|
||||
- job: 'Validate'
|
||||
pool:
|
||||
vmImage: 'ubuntu-16.04'
|
||||
steps:
|
||||
- task: UsePythonVersion@0
|
||||
inputs:
|
||||
versionSpec: '3.7'
|
||||
- script: |
|
||||
pip install flake8
|
||||
python -m flake8 spacy --count --select=E901,E999,F821,F822,F823 --show-source --statistics
|
||||
displayName: 'flake8'
|
||||
|
||||
- job: 'Test'
|
||||
dependsOn: 'Validate'
|
||||
strategy:
|
||||
matrix:
|
||||
# Python 2.7 currently doesn't work because it seems to be a narrow
|
||||
# unicode build, which causes problems with the regular expressions
|
||||
|
||||
# Python27Linux:
|
||||
# imageName: 'ubuntu-16.04'
|
||||
# python.version: '2.7'
|
||||
# Python27Mac:
|
||||
# imageName: 'macos-10.13'
|
||||
# python.version: '2.7'
|
||||
Python35Linux:
|
||||
imageName: 'ubuntu-16.04'
|
||||
python.version: '3.5'
|
||||
Python35Windows:
|
||||
imageName: 'vs2017-win2016'
|
||||
python.version: '3.5'
|
||||
Python35Mac:
|
||||
imageName: 'macos-10.13'
|
||||
python.version: '3.5'
|
||||
Python36Linux:
|
||||
imageName: 'ubuntu-16.04'
|
||||
python.version: '3.6'
|
||||
Python36Windows:
|
||||
imageName: 'vs2017-win2016'
|
||||
python.version: '3.6'
|
||||
Python36Mac:
|
||||
imageName: 'macos-10.13'
|
||||
python.version: '3.6'
|
||||
Python37Linux:
|
||||
imageName: 'ubuntu-16.04'
|
||||
python.version: '3.7'
|
||||
Python37Windows:
|
||||
imageName: 'vs2017-win2016'
|
||||
python.version: '3.7'
|
||||
Python37Mac:
|
||||
imageName: 'macos-10.13'
|
||||
python.version: '3.7'
|
||||
maxParallel: 4
|
||||
pool:
|
||||
vmImage: $(imageName)
|
||||
|
||||
steps:
|
||||
- task: UsePythonVersion@0
|
||||
inputs:
|
||||
versionSpec: '$(python.version)'
|
||||
architecture: 'x64'
|
||||
|
||||
# Downgrading pip is necessary to prevent a wheel version incompatiblity.
|
||||
# Might be fixed in the future or some other way, so investigate again.
|
||||
- script: |
|
||||
python -m pip install --upgrade pip==18.1
|
||||
pip install -r requirements.txt
|
||||
displayName: 'Install dependencies'
|
||||
|
||||
- script: |
|
||||
python setup.py build_ext --inplace
|
||||
pip install -e .
|
||||
displayName: 'Build and install'
|
||||
|
||||
- script: python -m pytest --tb=native spacy
|
||||
displayName: 'Run tests'
|
51
spacy/_ml.py
51
spacy/_ml.py
|
@ -84,16 +84,54 @@ def _zero_init(model):
|
|||
@layerize
|
||||
def _preprocess_doc(docs, drop=0.0):
|
||||
keys = [doc.to_array(LOWER) for doc in docs]
|
||||
ops = Model.ops
|
||||
# The dtype here matches what thinc is expecting -- which differs per
|
||||
# platform (by int definition). This should be fixed once the problem
|
||||
# is fixed on Thinc's side.
|
||||
lengths = ops.asarray([arr.shape[0] for arr in keys], dtype=numpy.int_)
|
||||
keys = ops.xp.concatenate(keys)
|
||||
vals = ops.allocate(keys.shape) + 1.0
|
||||
lengths = numpy.array([arr.shape[0] for arr in keys], dtype=numpy.int_)
|
||||
keys = numpy.concatenate(keys)
|
||||
vals = numpy.zeros(keys.shape, dtype='f')
|
||||
return (keys, vals, lengths), None
|
||||
|
||||
|
||||
def with_cpu(ops, model):
|
||||
"""Wrap a model that should run on CPU, transferring inputs and outputs
|
||||
as necessary."""
|
||||
model.to_cpu()
|
||||
def with_cpu_forward(inputs, drop=0.):
|
||||
cpu_outputs, backprop = model.begin_update(_to_cpu(inputs), drop=drop)
|
||||
gpu_outputs = _to_device(ops, cpu_outputs)
|
||||
|
||||
def with_cpu_backprop(d_outputs, sgd=None):
|
||||
cpu_d_outputs = _to_cpu(d_outputs)
|
||||
return backprop(cpu_d_outputs, sgd=sgd)
|
||||
|
||||
return gpu_outputs, with_cpu_backprop
|
||||
|
||||
return wrap(with_cpu_forward, model)
|
||||
|
||||
|
||||
def _to_cpu(X):
|
||||
if isinstance(X, numpy.ndarray):
|
||||
return X
|
||||
elif isinstance(X, tuple):
|
||||
return tuple([_to_cpu(x) for x in X])
|
||||
elif isinstance(X, list):
|
||||
return [_to_cpu(x) for x in X]
|
||||
elif hasattr(X, 'get'):
|
||||
return X.get()
|
||||
else:
|
||||
return X
|
||||
|
||||
|
||||
def _to_device(ops, X):
|
||||
if isinstance(X, tuple):
|
||||
return tuple([_to_device(ops, x) for x in X])
|
||||
elif isinstance(X, list):
|
||||
return [_to_device(ops, x) for x in X]
|
||||
else:
|
||||
return ops.asarray(X)
|
||||
|
||||
|
||||
@layerize
|
||||
def _preprocess_doc_bigrams(docs, drop=0.0):
|
||||
unigrams = [doc.to_array(LOWER) for doc in docs]
|
||||
|
@ -563,7 +601,10 @@ def build_text_classifier(nr_class, width=64, **cfg):
|
|||
>> zero_init(Affine(nr_class, width, drop_factor=0.0))
|
||||
)
|
||||
|
||||
linear_model = _preprocess_doc >> LinearModel(nr_class)
|
||||
linear_model = (
|
||||
_preprocess_doc
|
||||
>> with_cpu(Model.ops, LinearModel(nr_class))
|
||||
)
|
||||
if cfg.get('exclusive_classes'):
|
||||
output_layer = Softmax(nr_class, nr_class * 2)
|
||||
else:
|
||||
|
|
|
@ -23,15 +23,16 @@ CONVERTERS = {
|
|||
}
|
||||
|
||||
# File types
|
||||
FILE_TYPES = ("json", "jsonl")
|
||||
FILE_TYPES = ("json", "jsonl", "msg")
|
||||
FILE_TYPES_STDOUT = ("json", "jsonl")
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
input_file=("Input file", "positional", None, str),
|
||||
output_dir=("Output directory for converted file", "positional", None, str),
|
||||
file_type=("Type of data to produce: 'jsonl' or 'json'", "option", "t", str),
|
||||
output_dir=("Output directory. '-' for stdout.", "positional", None, str),
|
||||
file_type=("Type of data to produce: {}".format(FILE_TYPES), "option", "t", str),
|
||||
n_sents=("Number of sentences per doc", "option", "n", int),
|
||||
converter=("Name of converter (auto, iob, conllu or ner)", "option", "c", str),
|
||||
converter=("Converter: {}".format(tuple(CONVERTERS.keys())), "option", "c", str),
|
||||
lang=("Language (if tokenizer required)", "option", "l", str),
|
||||
morphology=("Enable appending morphology to tags", "flag", "m", bool),
|
||||
)
|
||||
|
@ -58,6 +59,13 @@ def convert(
|
|||
"Supported file types: '{}'".format(", ".join(FILE_TYPES)),
|
||||
exits=1,
|
||||
)
|
||||
if file_type not in FILE_TYPES_STDOUT and output_dir == "-":
|
||||
# TODO: support msgpack via stdout in srsly?
|
||||
msg.fail(
|
||||
"Can't write .{} data to stdout.".format(file_type),
|
||||
"Please specify an output directory.",
|
||||
exits=1,
|
||||
)
|
||||
if not input_path.exists():
|
||||
msg.fail("Input file not found", input_path, exits=1)
|
||||
if output_dir != "-" and not Path(output_dir).exists():
|
||||
|
@ -78,6 +86,8 @@ def convert(
|
|||
srsly.write_json(output_file, data)
|
||||
elif file_type == "jsonl":
|
||||
srsly.write_jsonl(output_file, data)
|
||||
elif file_type == "msg":
|
||||
srsly.write_msgpack(output_file, data)
|
||||
msg.good("Generated output file ({} documents)".format(len(data)), output_file)
|
||||
else:
|
||||
# Print to stdout
|
||||
|
|
|
@ -27,35 +27,46 @@ def download(model, direct=False, *pip_args):
|
|||
can be shortcut, model name or, if --direct flag is set, full model name
|
||||
with version. For direct downloads, the compatibility check will be skipped.
|
||||
"""
|
||||
dl_tpl = "{m}-{v}/{m}-{v}.tar.gz#egg={m}=={v}"
|
||||
if direct:
|
||||
dl = download_model("{m}/{m}.tar.gz#egg={m}".format(m=model), pip_args)
|
||||
components = model.split("-")
|
||||
model_name = "".join(components[:-1])
|
||||
version = components[-1]
|
||||
dl = download_model(dl_tpl.format(m=model_name, v=version), pip_args)
|
||||
else:
|
||||
shortcuts = get_json(about.__shortcuts__, "available shortcuts")
|
||||
model_name = shortcuts.get(model, model)
|
||||
compatibility = get_compatibility()
|
||||
version = get_version(model_name, compatibility)
|
||||
dl_tpl = "{m}-{v}/{m}-{v}.tar.gz#egg={m}=={v}"
|
||||
dl = download_model(dl_tpl.format(m=model_name, v=version), pip_args)
|
||||
if dl != 0: # if download subprocess doesn't return 0, exit
|
||||
sys.exit(dl)
|
||||
try:
|
||||
# Get package path here because link uses
|
||||
# pip.get_installed_distributions() to check if model is a
|
||||
# package, which fails if model was just installed via
|
||||
# subprocess
|
||||
package_path = get_package_path(model_name)
|
||||
link(model_name, model, force=True, model_path=package_path)
|
||||
except: # noqa: E722
|
||||
# Dirty, but since spacy.download and the auto-linking is
|
||||
# mostly a convenience wrapper, it's best to show a success
|
||||
# message and loading instructions, even if linking fails.
|
||||
msg.warn(
|
||||
"Download successful but linking failed",
|
||||
"Creating a shortcut link for 'en' didn't work (maybe you "
|
||||
"don't have admin permissions?), but you can still load the "
|
||||
"model via its full package name: "
|
||||
"nlp = spacy.load('{}')".format(model_name),
|
||||
)
|
||||
msg.good(
|
||||
"Download and installation successful",
|
||||
"You can now load the model via spacy.load('{}')".format(model_name),
|
||||
)
|
||||
# Only create symlink if the model is installed via a shortcut like 'en'.
|
||||
# There's no real advantage over an additional symlink for en_core_web_sm
|
||||
# and if anything, it's more error prone and causes more confusion.
|
||||
if model in shortcuts:
|
||||
try:
|
||||
# Get package path here because link uses
|
||||
# pip.get_installed_distributions() to check if model is a
|
||||
# package, which fails if model was just installed via
|
||||
# subprocess
|
||||
package_path = get_package_path(model_name)
|
||||
link(model_name, model, force=True, model_path=package_path)
|
||||
except: # noqa: E722
|
||||
# Dirty, but since spacy.download and the auto-linking is
|
||||
# mostly a convenience wrapper, it's best to show a success
|
||||
# message and loading instructions, even if linking fails.
|
||||
msg.warn(
|
||||
"Download successful but linking failed",
|
||||
"Creating a shortcut link for '{}' didn't work (maybe you "
|
||||
"don't have admin permissions?), but you can still load "
|
||||
"the model via its full package name: "
|
||||
"nlp = spacy.load('{}')".format(model, model_name),
|
||||
)
|
||||
|
||||
|
||||
def get_json(url, desc):
|
||||
|
|
|
@ -1,4 +1,11 @@
|
|||
# coding: utf8
|
||||
"""
|
||||
Helpers for Python and platform compatibility. To distinguish them from
|
||||
the builtin functions, replacement functions are suffixed with an underscore,
|
||||
e.g. `unicode_`.
|
||||
|
||||
DOCS: https://spacy.io/api/top-level#compat
|
||||
"""
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import os
|
||||
|
@ -64,19 +71,23 @@ elif is_python3:
|
|||
|
||||
|
||||
def b_to_str(b_str):
|
||||
"""Convert a bytes object to a string.
|
||||
|
||||
b_str (bytes): The object to convert.
|
||||
RETURNS (unicode): The converted string.
|
||||
"""
|
||||
if is_python2:
|
||||
return b_str
|
||||
# important: if no encoding is set, string becomes "b'...'"
|
||||
# Important: if no encoding is set, string becomes "b'...'"
|
||||
return str(b_str, encoding="utf8")
|
||||
|
||||
|
||||
def getattr_(obj, name, *default):
|
||||
if is_python3 and isinstance(name, bytes):
|
||||
name = name.decode("utf8")
|
||||
return getattr(obj, name, *default)
|
||||
|
||||
|
||||
def symlink_to(orig, dest):
|
||||
"""Create a symlink. Used for model shortcut links.
|
||||
|
||||
orig (unicode / Path): The origin path.
|
||||
dest (unicode / Path): The destination path of the symlink.
|
||||
"""
|
||||
if is_windows:
|
||||
import subprocess
|
||||
|
||||
|
@ -86,6 +97,10 @@ def symlink_to(orig, dest):
|
|||
|
||||
|
||||
def symlink_remove(link):
|
||||
"""Remove a symlink. Used for model shortcut links.
|
||||
|
||||
link (unicode / Path): The path to the symlink.
|
||||
"""
|
||||
# https://stackoverflow.com/q/26554135/6400719
|
||||
if os.path.isdir(path2str(link)) and is_windows:
|
||||
# this should only be on Py2.7 and windows
|
||||
|
@ -95,6 +110,18 @@ def symlink_remove(link):
|
|||
|
||||
|
||||
def is_config(python2=None, python3=None, windows=None, linux=None, osx=None):
|
||||
"""Check if a specific configuration of Python version and operating system
|
||||
matches the user's setup. Mostly used to display targeted error messages.
|
||||
|
||||
python2 (bool): spaCy is executed with Python 2.x.
|
||||
python3 (bool): spaCy is executed with Python 3.x.
|
||||
windows (bool): spaCy is executed on Windows.
|
||||
linux (bool): spaCy is executed on Linux.
|
||||
osx (bool): spaCy is executed on OS X or macOS.
|
||||
RETURNS (bool): Whether the configuration matches the user's platform.
|
||||
|
||||
DOCS: https://spacy.io/api/top-level#compat.is_config
|
||||
"""
|
||||
return (
|
||||
python2 in (None, is_python2)
|
||||
and python3 in (None, is_python3)
|
||||
|
@ -104,19 +131,14 @@ def is_config(python2=None, python3=None, windows=None, linux=None, osx=None):
|
|||
)
|
||||
|
||||
|
||||
def normalize_string_keys(old):
|
||||
"""Given a dictionary, make sure keys are unicode strings, not bytes."""
|
||||
new = {}
|
||||
for key, value in old.items():
|
||||
if isinstance(key, bytes_):
|
||||
new[key.decode("utf8")] = value
|
||||
else:
|
||||
new[key] = value
|
||||
return new
|
||||
|
||||
|
||||
def import_file(name, loc):
|
||||
loc = str(loc)
|
||||
"""Import module from a file. Used to load models from a directory.
|
||||
|
||||
name (unicode): Name of module to load.
|
||||
loc (unicode / Path): Path to the file.
|
||||
RETURNS: The loaded module.
|
||||
"""
|
||||
loc = path2str(loc)
|
||||
if is_python_pre_3_5:
|
||||
import imp
|
||||
|
||||
|
|
|
@ -1,4 +1,10 @@
|
|||
# coding: utf8
|
||||
"""
|
||||
spaCy's built in visualization suite for dependencies and named entities.
|
||||
|
||||
DOCS: https://spacy.io/api/top-level#displacy
|
||||
USAGE: https://spacy.io/usage/visualizers
|
||||
"""
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from .render import DependencyRenderer, EntityRenderer
|
||||
|
@ -25,6 +31,9 @@ def render(
|
|||
options (dict): Visualiser-specific options, e.g. colors.
|
||||
manual (bool): Don't parse `Doc` and instead expect a dict/list of dicts.
|
||||
RETURNS (unicode): Rendered HTML markup.
|
||||
|
||||
DOCS: https://spacy.io/api/top-level#displacy.render
|
||||
USAGE: https://spacy.io/usage/visualizers
|
||||
"""
|
||||
factories = {
|
||||
"dep": (DependencyRenderer, parse_deps),
|
||||
|
@ -71,6 +80,9 @@ def serve(
|
|||
manual (bool): Don't parse `Doc` and instead expect a dict/list of dicts.
|
||||
port (int): Port to serve visualisation.
|
||||
host (unicode): Host to serve visualisation.
|
||||
|
||||
DOCS: https://spacy.io/api/top-level#displacy.serve
|
||||
USAGE: https://spacy.io/usage/visualizers
|
||||
"""
|
||||
from wsgiref import simple_server
|
||||
|
||||
|
|
|
@ -70,6 +70,12 @@ class Warnings(object):
|
|||
W013 = ("As of v2.1.0, {obj}.merge is deprecated. Please use the more "
|
||||
"efficient and less error-prone Doc.retokenize context manager "
|
||||
"instead.")
|
||||
W014 = ("As of v2.1.0, the `disable` keyword argument on the serialization "
|
||||
"methods is and should be replaced with `exclude`. This makes it "
|
||||
"consistent with the other objects serializable.")
|
||||
W015 = ("As of v2.1.0, the use of keyword arguments to exclude fields from "
|
||||
"being serialized or deserialized is deprecated. Please use the "
|
||||
"`exclude` argument instead. For example: exclude=['{arg}'].")
|
||||
|
||||
|
||||
@add_codes
|
||||
|
@ -338,6 +344,20 @@ class Errors(object):
|
|||
"or with a getter AND setter.")
|
||||
E120 = ("Can't set custom extension attributes during retokenization. "
|
||||
"Expected dict mapping attribute names to values, but got: {value}")
|
||||
E121 = ("Can't bulk merge spans. Attribute length {attr_len} should be "
|
||||
"equal to span length ({span_len}).")
|
||||
E122 = ("Cannot find token to be split. Did it get merged?")
|
||||
E123 = ("Cannot find head of token to be split. Did it get merged?")
|
||||
E124 = ("Cannot read from file: {path}. Supported formats: {formats}")
|
||||
E125 = ("Unexpected value: {value}")
|
||||
E126 = ("Unexpected matcher predicate: '{bad}'. Expected one of: {good}. "
|
||||
"This is likely a bug in spaCy, so feel free to open an issue.")
|
||||
E127 = ("Cannot create phrase pattern representation for length 0. This "
|
||||
"is likely a bug in spaCy.")
|
||||
E128 = ("Unsupported serialization argument: '{arg}'. The use of keyword "
|
||||
"arguments to exclude fields from being serialized or deserialized "
|
||||
"is now deprecated. Please use the `exclude` argument instead. "
|
||||
"For example: exclude=['{arg}'].")
|
||||
|
||||
|
||||
@add_codes
|
||||
|
|
199
spacy/gold.pyx
199
spacy/gold.pyx
|
@ -14,34 +14,38 @@ from . import _align
|
|||
from .syntax import nonproj
|
||||
from .tokens import Doc, Span
|
||||
from .errors import Errors
|
||||
from .compat import path2str
|
||||
from . import util
|
||||
from .util import minibatch, itershuffle
|
||||
|
||||
from libc.stdio cimport FILE, fopen, fclose, fread, fwrite, feof, fseek
|
||||
|
||||
|
||||
punct_re = re.compile(r"\W")
|
||||
|
||||
|
||||
def tags_to_entities(tags):
|
||||
entities = []
|
||||
start = None
|
||||
for i, tag in enumerate(tags):
|
||||
if tag is None:
|
||||
continue
|
||||
if tag.startswith('O'):
|
||||
if tag.startswith("O"):
|
||||
# TODO: We shouldn't be getting these malformed inputs. Fix this.
|
||||
if start is not None:
|
||||
start = None
|
||||
continue
|
||||
elif tag == '-':
|
||||
elif tag == "-":
|
||||
continue
|
||||
elif tag.startswith('I'):
|
||||
elif tag.startswith("I"):
|
||||
if start is None:
|
||||
raise ValueError(Errors.E067.format(tags=tags[:i+1]))
|
||||
raise ValueError(Errors.E067.format(tags=tags[:i + 1]))
|
||||
continue
|
||||
if tag.startswith('U'):
|
||||
if tag.startswith("U"):
|
||||
entities.append((tag[2:], i, i))
|
||||
elif tag.startswith('B'):
|
||||
elif tag.startswith("B"):
|
||||
start = i
|
||||
elif tag.startswith('L'):
|
||||
elif tag.startswith("L"):
|
||||
entities.append((tag[2:], start, i))
|
||||
start = None
|
||||
else:
|
||||
|
@ -60,19 +64,18 @@ def merge_sents(sents):
|
|||
m_deps[3].extend(head + i for head in heads)
|
||||
m_deps[4].extend(labels)
|
||||
m_deps[5].extend(ner)
|
||||
m_brackets.extend((b['first'] + i, b['last'] + i, b['label'])
|
||||
m_brackets.extend((b["first"] + i, b["last"] + i, b["label"])
|
||||
for b in brackets)
|
||||
i += len(ids)
|
||||
return [(m_deps, m_brackets)]
|
||||
|
||||
|
||||
punct_re = re.compile(r'\W')
|
||||
def align(cand_words, gold_words):
|
||||
if cand_words == gold_words:
|
||||
alignment = numpy.arange(len(cand_words))
|
||||
return 0, alignment, alignment, {}, {}
|
||||
cand_words = [w.replace(' ', '').lower() for w in cand_words]
|
||||
gold_words = [w.replace(' ', '').lower() for w in gold_words]
|
||||
cand_words = [w.replace(" ", "").lower() for w in cand_words]
|
||||
gold_words = [w.replace(" ", "").lower() for w in gold_words]
|
||||
cost, i2j, j2i, matrix = _align.align(cand_words, gold_words)
|
||||
i2j_multi, j2i_multi = _align.multi_align(i2j, j2i, [len(w) for w in cand_words],
|
||||
[len(w) for w in gold_words])
|
||||
|
@ -89,7 +92,10 @@ def align(cand_words, gold_words):
|
|||
|
||||
class GoldCorpus(object):
|
||||
"""An annotated corpus, using the JSON file format. Manages
|
||||
annotations for tagging, dependency parsing and NER."""
|
||||
annotations for tagging, dependency parsing and NER.
|
||||
|
||||
DOCS: https://spacy.io/api/goldcorpus
|
||||
"""
|
||||
def __init__(self, train, dev, gold_preproc=False, limit=None):
|
||||
"""Create a GoldCorpus.
|
||||
|
||||
|
@ -101,12 +107,10 @@ class GoldCorpus(object):
|
|||
if isinstance(train, str) or isinstance(train, Path):
|
||||
train = self.read_tuples(self.walk_corpus(train))
|
||||
dev = self.read_tuples(self.walk_corpus(dev))
|
||||
|
||||
# Write temp directory with one doc per file, so we can shuffle
|
||||
# and stream
|
||||
# Write temp directory with one doc per file, so we can shuffle and stream
|
||||
self.tmp_dir = Path(tempfile.mkdtemp())
|
||||
self.write_msgpack(self.tmp_dir / 'train', train, limit=self.limit)
|
||||
self.write_msgpack(self.tmp_dir / 'dev', dev, limit=self.limit)
|
||||
self.write_msgpack(self.tmp_dir / "train", train, limit=self.limit)
|
||||
self.write_msgpack(self.tmp_dir / "dev", dev, limit=self.limit)
|
||||
|
||||
def __del__(self):
|
||||
shutil.rmtree(self.tmp_dir)
|
||||
|
@ -117,7 +121,7 @@ class GoldCorpus(object):
|
|||
directory.mkdir()
|
||||
n = 0
|
||||
for i, doc_tuple in enumerate(doc_tuples):
|
||||
srsly.write_msgpack(directory / '{}.msg'.format(i), [doc_tuple])
|
||||
srsly.write_msgpack(directory / "{}.msg".format(i), [doc_tuple])
|
||||
n += len(doc_tuple[1])
|
||||
if limit and n >= limit:
|
||||
break
|
||||
|
@ -134,11 +138,11 @@ class GoldCorpus(object):
|
|||
if str(path) in seen:
|
||||
continue
|
||||
seen.add(str(path))
|
||||
if path.parts[-1].startswith('.'):
|
||||
if path.parts[-1].startswith("."):
|
||||
continue
|
||||
elif path.is_dir():
|
||||
paths.extend(path.iterdir())
|
||||
elif path.parts[-1].endswith('.json'):
|
||||
elif path.parts[-1].endswith(".json"):
|
||||
locs.append(path)
|
||||
return locs
|
||||
|
||||
|
@ -147,13 +151,15 @@ class GoldCorpus(object):
|
|||
i = 0
|
||||
for loc in locs:
|
||||
loc = util.ensure_path(loc)
|
||||
if loc.parts[-1].endswith('json'):
|
||||
if loc.parts[-1].endswith("json"):
|
||||
gold_tuples = read_json_file(loc)
|
||||
elif loc.parts[-1].endswith('msg'):
|
||||
elif loc.parts[-1].endswith("jsonl"):
|
||||
gold_tuples = srsly.read_jsonl(loc)
|
||||
elif loc.parts[-1].endswith("msg"):
|
||||
gold_tuples = srsly.read_msgpack(loc)
|
||||
else:
|
||||
msg = "Cannot read from file: %s. Supported formats: .json, .msg"
|
||||
raise ValueError(msg % loc)
|
||||
supported = ("json", "jsonl", "msg")
|
||||
raise ValueError(Errors.E124.format(path=path2str(loc), formats=supported))
|
||||
for item in gold_tuples:
|
||||
yield item
|
||||
i += len(item[1])
|
||||
|
@ -162,12 +168,12 @@ class GoldCorpus(object):
|
|||
|
||||
@property
|
||||
def dev_tuples(self):
|
||||
locs = (self.tmp_dir / 'dev').iterdir()
|
||||
locs = (self.tmp_dir / "dev").iterdir()
|
||||
yield from self.read_tuples(locs, limit=self.limit)
|
||||
|
||||
@property
|
||||
def train_tuples(self):
|
||||
locs = (self.tmp_dir / 'train').iterdir()
|
||||
locs = (self.tmp_dir / "train").iterdir()
|
||||
yield from self.read_tuples(locs, limit=self.limit)
|
||||
|
||||
def count_train(self):
|
||||
|
@ -193,8 +199,7 @@ class GoldCorpus(object):
|
|||
yield from gold_docs
|
||||
|
||||
def dev_docs(self, nlp, gold_preproc=False):
|
||||
gold_docs = self.iter_gold_docs(nlp, self.dev_tuples,
|
||||
gold_preproc=gold_preproc)
|
||||
gold_docs = self.iter_gold_docs(nlp, self.dev_tuples, gold_preproc=gold_preproc)
|
||||
yield from gold_docs
|
||||
|
||||
@classmethod
|
||||
|
@ -205,32 +210,29 @@ class GoldCorpus(object):
|
|||
raw_text = None
|
||||
else:
|
||||
paragraph_tuples = merge_sents(paragraph_tuples)
|
||||
docs = cls._make_docs(nlp, raw_text, paragraph_tuples,
|
||||
gold_preproc, noise_level=noise_level)
|
||||
docs = cls._make_docs(nlp, raw_text, paragraph_tuples, gold_preproc,
|
||||
noise_level=noise_level)
|
||||
golds = cls._make_golds(docs, paragraph_tuples, make_projective)
|
||||
for doc, gold in zip(docs, golds):
|
||||
if (not max_length) or len(doc) < max_length:
|
||||
yield doc, gold
|
||||
|
||||
@classmethod
|
||||
def _make_docs(cls, nlp, raw_text, paragraph_tuples, gold_preproc,
|
||||
noise_level=0.0):
|
||||
def _make_docs(cls, nlp, raw_text, paragraph_tuples, gold_preproc, noise_level=0.0):
|
||||
if raw_text is not None:
|
||||
raw_text = add_noise(raw_text, noise_level)
|
||||
return [nlp.make_doc(raw_text)]
|
||||
else:
|
||||
return [Doc(nlp.vocab,
|
||||
words=add_noise(sent_tuples[1], noise_level))
|
||||
return [Doc(nlp.vocab, words=add_noise(sent_tuples[1], noise_level))
|
||||
for (sent_tuples, brackets) in paragraph_tuples]
|
||||
|
||||
@classmethod
|
||||
def _make_golds(cls, docs, paragraph_tuples, make_projective):
|
||||
if len(docs) != len(paragraph_tuples):
|
||||
raise ValueError(Errors.E070.format(n_docs=len(docs),
|
||||
n_annots=len(paragraph_tuples)))
|
||||
n_annots = len(paragraph_tuples)
|
||||
raise ValueError(Errors.E070.format(n_docs=len(docs), n_annots=n_annots))
|
||||
if len(docs) == 1:
|
||||
return [GoldParse.from_annot_tuples(docs[0],
|
||||
paragraph_tuples[0][0],
|
||||
return [GoldParse.from_annot_tuples(docs[0], paragraph_tuples[0][0],
|
||||
make_projective=make_projective)]
|
||||
else:
|
||||
return [GoldParse.from_annot_tuples(doc, sent_tuples,
|
||||
|
@ -247,18 +249,18 @@ def add_noise(orig, noise_level):
|
|||
corrupted = [w for w in corrupted if w]
|
||||
return corrupted
|
||||
else:
|
||||
return ''.join(_corrupt(c, noise_level) for c in orig)
|
||||
return "".join(_corrupt(c, noise_level) for c in orig)
|
||||
|
||||
|
||||
def _corrupt(c, noise_level):
|
||||
if random.random() >= noise_level:
|
||||
return c
|
||||
elif c == ' ':
|
||||
return '\n'
|
||||
elif c == '\n':
|
||||
return ' '
|
||||
elif c in ['.', "'", "!", "?", ',']:
|
||||
return ''
|
||||
elif c == " ":
|
||||
return "\n"
|
||||
elif c == "\n":
|
||||
return " "
|
||||
elif c in [".", "'", "!", "?", ","]:
|
||||
return ""
|
||||
else:
|
||||
return c.lower()
|
||||
|
||||
|
@ -284,30 +286,30 @@ def json_to_tuple(doc):
|
|||
YIELDS (tuple): The reformatted data.
|
||||
"""
|
||||
paragraphs = []
|
||||
for paragraph in doc['paragraphs']:
|
||||
for paragraph in doc["paragraphs"]:
|
||||
sents = []
|
||||
for sent in paragraph['sentences']:
|
||||
for sent in paragraph["sentences"]:
|
||||
words = []
|
||||
ids = []
|
||||
tags = []
|
||||
heads = []
|
||||
labels = []
|
||||
ner = []
|
||||
for i, token in enumerate(sent['tokens']):
|
||||
words.append(token['orth'])
|
||||
for i, token in enumerate(sent["tokens"]):
|
||||
words.append(token["orth"])
|
||||
ids.append(i)
|
||||
tags.append(token.get('tag', '-'))
|
||||
heads.append(token.get('head', 0) + i)
|
||||
labels.append(token.get('dep', ''))
|
||||
tags.append(token.get('tag', "-"))
|
||||
heads.append(token.get("head", 0) + i)
|
||||
labels.append(token.get("dep", ""))
|
||||
# Ensure ROOT label is case-insensitive
|
||||
if labels[-1].lower() == 'root':
|
||||
labels[-1] = 'ROOT'
|
||||
ner.append(token.get('ner', '-'))
|
||||
if labels[-1].lower() == "root":
|
||||
labels[-1] = "ROOT"
|
||||
ner.append(token.get("ner", "-"))
|
||||
sents.append([
|
||||
[ids, words, tags, heads, labels, ner],
|
||||
sent.get('brackets', [])])
|
||||
sent.get("brackets", [])])
|
||||
if sents:
|
||||
yield [paragraph.get('raw', None), sents]
|
||||
yield [paragraph.get("raw", None), sents]
|
||||
|
||||
|
||||
def read_json_file(loc, docs_filter=None, limit=None):
|
||||
|
@ -329,7 +331,7 @@ def _json_iterate(loc):
|
|||
# It's okay to read in the whole file -- just don't parse it into JSON.
|
||||
cdef bytes py_raw
|
||||
loc = util.ensure_path(loc)
|
||||
with loc.open('rb') as file_:
|
||||
with loc.open("rb") as file_:
|
||||
py_raw = file_.read()
|
||||
raw = <char*>py_raw
|
||||
cdef int square_depth = 0
|
||||
|
@ -339,11 +341,11 @@ def _json_iterate(loc):
|
|||
cdef int start = -1
|
||||
cdef char c
|
||||
cdef char quote = ord('"')
|
||||
cdef char backslash = ord('\\')
|
||||
cdef char open_square = ord('[')
|
||||
cdef char close_square = ord(']')
|
||||
cdef char open_curly = ord('{')
|
||||
cdef char close_curly = ord('}')
|
||||
cdef char backslash = ord("\\")
|
||||
cdef char open_square = ord("[")
|
||||
cdef char close_square = ord("]")
|
||||
cdef char open_curly = ord("{")
|
||||
cdef char close_curly = ord("}")
|
||||
for i in range(len(py_raw)):
|
||||
c = raw[i]
|
||||
if escape:
|
||||
|
@ -368,7 +370,7 @@ def _json_iterate(loc):
|
|||
elif c == close_curly:
|
||||
curly_depth -= 1
|
||||
if square_depth == 1 and curly_depth == 0:
|
||||
py_str = py_raw[start : i+1].decode('utf8')
|
||||
py_str = py_raw[start : i + 1].decode("utf8")
|
||||
try:
|
||||
yield srsly.json_loads(py_str)
|
||||
except Exception:
|
||||
|
@ -388,7 +390,7 @@ def iob_to_biluo(tags):
|
|||
|
||||
|
||||
def _consume_os(tags):
|
||||
while tags and tags[0] == 'O':
|
||||
while tags and tags[0] == "O":
|
||||
yield tags.pop(0)
|
||||
|
||||
|
||||
|
@ -396,24 +398,27 @@ def _consume_ent(tags):
|
|||
if not tags:
|
||||
return []
|
||||
tag = tags.pop(0)
|
||||
target_in = 'I' + tag[1:]
|
||||
target_last = 'L' + tag[1:]
|
||||
target_in = "I" + tag[1:]
|
||||
target_last = "L" + tag[1:]
|
||||
length = 1
|
||||
while tags and tags[0] in {target_in, target_last}:
|
||||
length += 1
|
||||
tags.pop(0)
|
||||
label = tag[2:]
|
||||
if length == 1:
|
||||
return ['U-' + label]
|
||||
return ["U-" + label]
|
||||
else:
|
||||
start = 'B-' + label
|
||||
end = 'L-' + label
|
||||
middle = ['I-%s' % label for _ in range(1, length - 1)]
|
||||
start = "B-" + label
|
||||
end = "L-" + label
|
||||
middle = ["I-%s" % label for _ in range(1, length - 1)]
|
||||
return [start] + middle + [end]
|
||||
|
||||
|
||||
cdef class GoldParse:
|
||||
"""Collection for training annotations."""
|
||||
"""Collection for training annotations.
|
||||
|
||||
DOCS: https://spacy.io/api/goldparse
|
||||
"""
|
||||
@classmethod
|
||||
def from_annot_tuples(cls, doc, annot_tuples, make_projective=False):
|
||||
_, words, tags, heads, deps, entities = annot_tuples
|
||||
|
@ -456,13 +461,13 @@ cdef class GoldParse:
|
|||
if deps is None:
|
||||
deps = [None for _ in doc]
|
||||
if entities is None:
|
||||
entities = ['-' for _ in doc]
|
||||
entities = ["-" for _ in doc]
|
||||
elif len(entities) == 0:
|
||||
entities = ['O' for _ in doc]
|
||||
entities = ["O" for _ in doc]
|
||||
else:
|
||||
# Translate the None values to '-', to make processing easier.
|
||||
# See Issue #2603
|
||||
entities = [(ent if ent is not None else '-') for ent in entities]
|
||||
entities = [(ent if ent is not None else "-") for ent in entities]
|
||||
if not isinstance(entities[0], basestring):
|
||||
# Assume we have entities specified by character offset.
|
||||
entities = biluo_tags_from_offsets(doc, entities)
|
||||
|
@ -508,10 +513,10 @@ cdef class GoldParse:
|
|||
for i, gold_i in enumerate(self.cand_to_gold):
|
||||
if doc[i].text.isspace():
|
||||
self.words[i] = doc[i].text
|
||||
self.tags[i] = '_SP'
|
||||
self.tags[i] = "_SP"
|
||||
self.heads[i] = None
|
||||
self.labels[i] = None
|
||||
self.ner[i] = 'O'
|
||||
self.ner[i] = "O"
|
||||
if gold_i is None:
|
||||
if i in i2j_multi:
|
||||
self.words[i] = words[i2j_multi[i]]
|
||||
|
@ -521,7 +526,7 @@ cdef class GoldParse:
|
|||
# Set next word in multi-token span as head, until last
|
||||
if not is_last:
|
||||
self.heads[i] = i+1
|
||||
self.labels[i] = 'subtok'
|
||||
self.labels[i] = "subtok"
|
||||
else:
|
||||
self.heads[i] = self.gold_to_cand[heads[i2j_multi[i]]]
|
||||
self.labels[i] = deps[i2j_multi[i]]
|
||||
|
@ -530,24 +535,24 @@ cdef class GoldParse:
|
|||
# BILOU tags. We can't have BB or LL etc.
|
||||
# Case 1: O -- easy.
|
||||
ner_tag = entities[i2j_multi[i]]
|
||||
if ner_tag == 'O':
|
||||
self.ner[i] = 'O'
|
||||
if ner_tag == "O":
|
||||
self.ner[i] = "O"
|
||||
# Case 2: U. This has to become a B I* L sequence.
|
||||
elif ner_tag.startswith('U-'):
|
||||
elif ner_tag.startswith("U-"):
|
||||
if is_first:
|
||||
self.ner[i] = ner_tag.replace('U-', 'B-', 1)
|
||||
self.ner[i] = ner_tag.replace("U-", "B-", 1)
|
||||
elif is_last:
|
||||
self.ner[i] = ner_tag.replace('U-', 'L-', 1)
|
||||
self.ner[i] = ner_tag.replace("U-", "L-", 1)
|
||||
else:
|
||||
self.ner[i] = ner_tag.replace('U-', 'I-', 1)
|
||||
self.ner[i] = ner_tag.replace("U-", "I-", 1)
|
||||
# Case 3: L. If not last, change to I.
|
||||
elif ner_tag.startswith('L-'):
|
||||
elif ner_tag.startswith("L-"):
|
||||
if is_last:
|
||||
self.ner[i] = ner_tag
|
||||
else:
|
||||
self.ner[i] = ner_tag.replace('L-', 'I-', 1)
|
||||
self.ner[i] = ner_tag.replace("L-", "I-", 1)
|
||||
# Case 4: I. Stays correct
|
||||
elif ner_tag.startswith('I-'):
|
||||
elif ner_tag.startswith("I-"):
|
||||
self.ner[i] = ner_tag
|
||||
else:
|
||||
self.words[i] = words[gold_i]
|
||||
|
@ -608,7 +613,7 @@ def docs_to_json(docs, underscore=None):
|
|||
return [doc.to_json(underscore=underscore) for doc in docs]
|
||||
|
||||
|
||||
def biluo_tags_from_offsets(doc, entities, missing='O'):
|
||||
def biluo_tags_from_offsets(doc, entities, missing="O"):
|
||||
"""Encode labelled spans into per-token tags, using the
|
||||
Begin/In/Last/Unit/Out scheme (BILUO).
|
||||
|
||||
|
@ -631,11 +636,11 @@ def biluo_tags_from_offsets(doc, entities, missing='O'):
|
|||
>>> entities = [(len('I like '), len('I like London'), 'LOC')]
|
||||
>>> doc = nlp.tokenizer(text)
|
||||
>>> tags = biluo_tags_from_offsets(doc, entities)
|
||||
>>> assert tags == ['O', 'O', 'U-LOC', 'O']
|
||||
>>> assert tags == ["O", "O", 'U-LOC', "O"]
|
||||
"""
|
||||
starts = {token.idx: token.i for token in doc}
|
||||
ends = {token.idx+len(token): token.i for token in doc}
|
||||
biluo = ['-' for _ in doc]
|
||||
ends = {token.idx + len(token): token.i for token in doc}
|
||||
biluo = ["-" for _ in doc]
|
||||
# Handle entity cases
|
||||
for start_char, end_char, label in entities:
|
||||
start_token = starts.get(start_char)
|
||||
|
@ -643,19 +648,19 @@ def biluo_tags_from_offsets(doc, entities, missing='O'):
|
|||
# Only interested if the tokenization is correct
|
||||
if start_token is not None and end_token is not None:
|
||||
if start_token == end_token:
|
||||
biluo[start_token] = 'U-%s' % label
|
||||
biluo[start_token] = "U-%s" % label
|
||||
else:
|
||||
biluo[start_token] = 'B-%s' % label
|
||||
biluo[start_token] = "B-%s" % label
|
||||
for i in range(start_token+1, end_token):
|
||||
biluo[i] = 'I-%s' % label
|
||||
biluo[end_token] = 'L-%s' % label
|
||||
biluo[i] = "I-%s" % label
|
||||
biluo[end_token] = "L-%s" % label
|
||||
# Now distinguish the O cases from ones where we miss the tokenization
|
||||
entity_chars = set()
|
||||
for start_char, end_char, label in entities:
|
||||
for i in range(start_char, end_char):
|
||||
entity_chars.add(i)
|
||||
for token in doc:
|
||||
for i in range(token.idx, token.idx+len(token)):
|
||||
for i in range(token.idx, token.idx + len(token)):
|
||||
if i in entity_chars:
|
||||
break
|
||||
else:
|
||||
|
@ -697,4 +702,4 @@ def offsets_from_biluo_tags(doc, tags):
|
|||
|
||||
|
||||
def is_punct_label(label):
|
||||
return label == 'P' or label.lower() == 'punct'
|
||||
return label == "P" or label.lower() == "punct"
|
||||
|
|
|
@ -21,7 +21,9 @@ _suffixes = (
|
|||
r"(?<=[0-9])%", # 4% -> ["4", "%"]
|
||||
r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
|
||||
r"(?<=[0-9])(?:{u})".format(u=UNITS),
|
||||
r"(?<=[0-9{al}{e}(?:{q})])\.".format(al=ALPHA_LOWER, e=r"%²\-\+", q=CONCAT_QUOTES),
|
||||
r"(?<=[0-9{al}{e}(?:{q})])\.".format(
|
||||
al=ALPHA_LOWER, e=r"%²\-\+", q=CONCAT_QUOTES
|
||||
),
|
||||
r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),
|
||||
]
|
||||
)
|
||||
|
|
|
@ -379,7 +379,7 @@ _regular_exp = [
|
|||
_regular_exp += [
|
||||
"^{prefix}[{hyphen}][{al}][{hyphen}{al}{elision}]*$".format(
|
||||
prefix=p,
|
||||
hyphen=HYPHENS, # putting the - first in the [] range avoids having to use a backslash
|
||||
hyphen=HYPHENS, # putting the - first in the [] range avoids having to use a backslash
|
||||
elision=ELISION,
|
||||
al=ALPHA_LOWER,
|
||||
)
|
||||
|
@ -423,5 +423,5 @@ _regular_exp.append(URL_PATTERN)
|
|||
|
||||
TOKENIZER_EXCEPTIONS = _exc
|
||||
TOKEN_MATCH = re.compile(
|
||||
"|".join("(?:{})".format(m) for m in _regular_exp), re.IGNORECASE
|
||||
"|".join("(?:{})".format(m) for m in _regular_exp), re.IGNORECASE | re.UNICODE
|
||||
).match
|
||||
|
|
|
@ -28,7 +28,7 @@ from .lang.punctuation import TOKENIZER_INFIXES
|
|||
from .lang.tokenizer_exceptions import TOKEN_MATCH
|
||||
from .lang.tag_map import TAG_MAP
|
||||
from .lang.lex_attrs import LEX_ATTRS, is_stop
|
||||
from .errors import Errors
|
||||
from .errors import Errors, Warnings, deprecation_warning
|
||||
from . import util
|
||||
from . import about
|
||||
|
||||
|
@ -103,8 +103,9 @@ class Language(object):
|
|||
Defaults (class): Settings, data and factory methods for creating the `nlp`
|
||||
object and processing pipeline.
|
||||
lang (unicode): Two-letter language ID, i.e. ISO code.
|
||||
"""
|
||||
|
||||
DOCS: https://spacy.io/api/language
|
||||
"""
|
||||
Defaults = BaseDefaults
|
||||
lang = None
|
||||
|
||||
|
@ -698,124 +699,114 @@ class Language(object):
|
|||
self.tokenizer._reset_cache(keys)
|
||||
nr_seen = 0
|
||||
|
||||
def to_disk(self, path, disable=tuple()):
|
||||
def to_disk(self, path, exclude=tuple(), disable=None):
|
||||
"""Save the current state to a directory. If a model is loaded, this
|
||||
will include the model.
|
||||
|
||||
path (unicode or Path): A path to a directory, which will be created if
|
||||
it doesn't exist. Paths may be strings or `Path`-like objects.
|
||||
disable (list): Names of pipeline components to disable and prevent
|
||||
from being saved.
|
||||
path (unicode or Path): Path to a directory, which will be created if
|
||||
it doesn't exist.
|
||||
exclude (list): Names of components or serialization fields to exclude.
|
||||
|
||||
EXAMPLE:
|
||||
>>> nlp.to_disk('/path/to/models')
|
||||
DOCS: https://spacy.io/api/language#to_disk
|
||||
"""
|
||||
if disable is not None:
|
||||
deprecation_warning(Warnings.W014)
|
||||
exclude = disable
|
||||
path = util.ensure_path(path)
|
||||
serializers = OrderedDict(
|
||||
(
|
||||
("tokenizer", lambda p: self.tokenizer.to_disk(p, vocab=False)),
|
||||
("meta.json", lambda p: p.open("w").write(srsly.json_dumps(self.meta))),
|
||||
)
|
||||
)
|
||||
serializers = OrderedDict()
|
||||
serializers["tokenizer"] = lambda p: self.tokenizer.to_disk(p, exclude=["vocab"])
|
||||
serializers["meta.json"] = lambda p: p.open("w").write(srsly.json_dumps(self.meta))
|
||||
for name, proc in self.pipeline:
|
||||
if not hasattr(proc, "name"):
|
||||
continue
|
||||
if name in disable:
|
||||
if name in exclude:
|
||||
continue
|
||||
if not hasattr(proc, "to_disk"):
|
||||
continue
|
||||
serializers[name] = lambda p, proc=proc: proc.to_disk(p, vocab=False)
|
||||
serializers[name] = lambda p, proc=proc: proc.to_disk(p, exclude=["vocab"])
|
||||
serializers["vocab"] = lambda p: self.vocab.to_disk(p)
|
||||
util.to_disk(path, serializers, {p: False for p in disable})
|
||||
util.to_disk(path, serializers, exclude)
|
||||
|
||||
def from_disk(self, path, disable=tuple()):
|
||||
def from_disk(self, path, exclude=tuple(), disable=None):
|
||||
"""Loads state from a directory. Modifies the object in place and
|
||||
returns it. If the saved `Language` object contains a model, the
|
||||
model will be loaded.
|
||||
|
||||
path (unicode or Path): A path to a directory. Paths may be either
|
||||
strings or `Path`-like objects.
|
||||
disable (list): Names of the pipeline components to disable.
|
||||
path (unicode or Path): A path to a directory.
|
||||
exclude (list): Names of components or serialization fields to exclude.
|
||||
RETURNS (Language): The modified `Language` object.
|
||||
|
||||
EXAMPLE:
|
||||
>>> from spacy.language import Language
|
||||
>>> nlp = Language().from_disk('/path/to/models')
|
||||
DOCS: https://spacy.io/api/language#from_disk
|
||||
"""
|
||||
if disable is not None:
|
||||
deprecation_warning(Warnings.W014)
|
||||
exclude = disable
|
||||
path = util.ensure_path(path)
|
||||
deserializers = OrderedDict(
|
||||
(
|
||||
("meta.json", lambda p: self.meta.update(srsly.read_json(p))),
|
||||
(
|
||||
"vocab",
|
||||
lambda p: (
|
||||
self.vocab.from_disk(p) and _fix_pretrained_vectors_name(self)
|
||||
),
|
||||
),
|
||||
("tokenizer", lambda p: self.tokenizer.from_disk(p, vocab=False)),
|
||||
)
|
||||
)
|
||||
deserializers = OrderedDict()
|
||||
deserializers["meta.json"] = lambda p: self.meta.update(srsly.read_json(p))
|
||||
deserializers["vocab"] = lambda p: self.vocab.from_disk(p) and _fix_pretrained_vectors_name(self)
|
||||
deserializers["tokenizer"] = lambda p: self.tokenizer.from_disk(p, exclude=["vocab"])
|
||||
for name, proc in self.pipeline:
|
||||
if name in disable:
|
||||
if name in exclude:
|
||||
continue
|
||||
if not hasattr(proc, "from_disk"):
|
||||
continue
|
||||
deserializers[name] = lambda p, proc=proc: proc.from_disk(p, vocab=False)
|
||||
exclude = {p: False for p in disable}
|
||||
if not (path / "vocab").exists():
|
||||
exclude["vocab"] = True
|
||||
deserializers[name] = lambda p, proc=proc: proc.from_disk(p, exclude=["vocab"])
|
||||
if not (path / "vocab").exists() and "vocab" not in exclude:
|
||||
# Convert to list here in case exclude is (default) tuple
|
||||
exclude = list(exclude) + ["vocab"]
|
||||
util.from_disk(path, deserializers, exclude)
|
||||
self._path = path
|
||||
return self
|
||||
|
||||
def to_bytes(self, disable=[], **exclude):
|
||||
def to_bytes(self, exclude=tuple(), disable=None, **kwargs):
|
||||
"""Serialize the current state to a binary string.
|
||||
|
||||
disable (list): Nameds of pipeline components to disable and prevent
|
||||
from being serialized.
|
||||
exclude (list): Names of components or serialization fields to exclude.
|
||||
RETURNS (bytes): The serialized form of the `Language` object.
|
||||
|
||||
DOCS: https://spacy.io/api/language#to_bytes
|
||||
"""
|
||||
serializers = OrderedDict(
|
||||
(
|
||||
("vocab", lambda: self.vocab.to_bytes()),
|
||||
("tokenizer", lambda: self.tokenizer.to_bytes(vocab=False)),
|
||||
("meta", lambda: srsly.json_dumps(self.meta)),
|
||||
)
|
||||
)
|
||||
for i, (name, proc) in enumerate(self.pipeline):
|
||||
if name in disable:
|
||||
if disable is not None:
|
||||
deprecation_warning(Warnings.W014)
|
||||
exclude = disable
|
||||
serializers = OrderedDict()
|
||||
serializers["vocab"] = lambda: self.vocab.to_bytes()
|
||||
serializers["tokenizer"] = lambda: self.tokenizer.to_bytes(exclude=["vocab"])
|
||||
serializers["meta.json"] = lambda: srsly.json_dumps(self.meta)
|
||||
for name, proc in self.pipeline:
|
||||
if name in exclude:
|
||||
continue
|
||||
if not hasattr(proc, "to_bytes"):
|
||||
continue
|
||||
serializers[i] = lambda proc=proc: proc.to_bytes(vocab=False)
|
||||
serializers[name] = lambda proc=proc: proc.to_bytes(exclude=["vocab"])
|
||||
exclude = util.get_serialization_exclude(serializers, exclude, kwargs)
|
||||
return util.to_bytes(serializers, exclude)
|
||||
|
||||
def from_bytes(self, bytes_data, disable=[]):
|
||||
def from_bytes(self, bytes_data, exclude=tuple(), disable=None, **kwargs):
|
||||
"""Load state from a binary string.
|
||||
|
||||
bytes_data (bytes): The data to load from.
|
||||
disable (list): Names of the pipeline components to disable.
|
||||
exclude (list): Names of components or serialization fields to exclude.
|
||||
RETURNS (Language): The `Language` object.
|
||||
|
||||
DOCS: https://spacy.io/api/language#from_bytes
|
||||
"""
|
||||
deserializers = OrderedDict(
|
||||
(
|
||||
("meta", lambda b: self.meta.update(srsly.json_loads(b))),
|
||||
(
|
||||
"vocab",
|
||||
lambda b: (
|
||||
self.vocab.from_bytes(b) and _fix_pretrained_vectors_name(self)
|
||||
),
|
||||
),
|
||||
("tokenizer", lambda b: self.tokenizer.from_bytes(b, vocab=False)),
|
||||
)
|
||||
)
|
||||
for i, (name, proc) in enumerate(self.pipeline):
|
||||
if name in disable:
|
||||
if disable is not None:
|
||||
deprecation_warning(Warnings.W014)
|
||||
exclude = disable
|
||||
deserializers = OrderedDict()
|
||||
deserializers["meta.json"] = lambda b: self.meta.update(srsly.json_loads(b))
|
||||
deserializers["vocab"] = lambda b: self.vocab.from_bytes(b) and _fix_pretrained_vectors_name(self)
|
||||
deserializers["tokenizer"] = lambda b: self.tokenizer.from_bytes(b, exclude=["vocab"])
|
||||
for name, proc in self.pipeline:
|
||||
if name in exclude:
|
||||
continue
|
||||
if not hasattr(proc, "from_bytes"):
|
||||
continue
|
||||
deserializers[i] = lambda b, proc=proc: proc.from_bytes(b, vocab=False)
|
||||
util.from_bytes(bytes_data, deserializers, {})
|
||||
deserializers[name] = lambda b, proc=proc: proc.from_bytes(b, exclude=["vocab"])
|
||||
exclude = util.get_serialization_exclude(deserializers, exclude, kwargs)
|
||||
util.from_bytes(bytes_data, deserializers, exclude)
|
||||
return self
|
||||
|
||||
|
||||
|
|
|
@ -6,6 +6,13 @@ from .symbols import VerbForm_inf, VerbForm_none, Number_sing, Degree_pos
|
|||
|
||||
|
||||
class Lemmatizer(object):
|
||||
"""
|
||||
The Lemmatizer supports simple part-of-speech-sensitive suffix rules and
|
||||
lookup tables.
|
||||
|
||||
DOCS: https://spacy.io/api/lemmatizer
|
||||
"""
|
||||
|
||||
@classmethod
|
||||
def load(cls, path, index=None, exc=None, rules=None, lookup=None):
|
||||
return cls(index, exc, rules, lookup)
|
||||
|
|
|
@ -4,17 +4,19 @@ from __future__ import unicode_literals, print_function
|
|||
|
||||
# Compiler crashes on memory view coercion without this. Should report bug.
|
||||
from cython.view cimport array as cvarray
|
||||
from libc.string cimport memset
|
||||
cimport numpy as np
|
||||
np.import_array()
|
||||
from libc.string cimport memset
|
||||
|
||||
import numpy
|
||||
from thinc.neural.util import get_array_module
|
||||
|
||||
from .typedefs cimport attr_t, flags_t
|
||||
from .attrs cimport IS_ALPHA, IS_ASCII, IS_DIGIT, IS_LOWER, IS_PUNCT, IS_SPACE
|
||||
from .attrs cimport IS_TITLE, IS_UPPER, LIKE_URL, LIKE_NUM, LIKE_EMAIL, IS_STOP
|
||||
from .attrs cimport IS_BRACKET, IS_QUOTE, IS_LEFT_PUNCT, IS_RIGHT_PUNCT, IS_CURRENCY, IS_OOV
|
||||
from .attrs cimport PROB
|
||||
from .attrs cimport IS_BRACKET, IS_QUOTE, IS_LEFT_PUNCT, IS_RIGHT_PUNCT
|
||||
from .attrs cimport IS_CURRENCY, IS_OOV, PROB
|
||||
|
||||
from .attrs import intify_attrs
|
||||
from .errors import Errors, Warnings, user_warning
|
||||
|
||||
|
@ -27,6 +29,8 @@ cdef class Lexeme:
|
|||
word-type, as opposed to a word token. It therefore has no part-of-speech
|
||||
tag, dependency parse, or lemma (lemmatization depends on the
|
||||
part-of-speech tag).
|
||||
|
||||
DOCS: https://spacy.io/api/lexeme
|
||||
"""
|
||||
def __init__(self, Vocab vocab, attr_t orth):
|
||||
"""Create a Lexeme object.
|
||||
|
@ -115,15 +119,15 @@ cdef class Lexeme:
|
|||
RETURNS (float): A scalar similarity score. Higher is more similar.
|
||||
"""
|
||||
# Return 1.0 similarity for matches
|
||||
if hasattr(other, 'orth'):
|
||||
if hasattr(other, "orth"):
|
||||
if self.c.orth == other.orth:
|
||||
return 1.0
|
||||
elif hasattr(other, '__len__') and len(other) == 1 \
|
||||
and hasattr(other[0], 'orth'):
|
||||
elif hasattr(other, "__len__") and len(other) == 1 \
|
||||
and hasattr(other[0], "orth"):
|
||||
if self.c.orth == other[0].orth:
|
||||
return 1.0
|
||||
if self.vector_norm == 0 or other.vector_norm == 0:
|
||||
user_warning(Warnings.W008.format(obj='Lexeme'))
|
||||
user_warning(Warnings.W008.format(obj="Lexeme"))
|
||||
return 0.0
|
||||
vector = self.vector
|
||||
xp = get_array_module(vector)
|
||||
|
@ -136,7 +140,7 @@ cdef class Lexeme:
|
|||
if (end-start) != sizeof(lex_data.data):
|
||||
raise ValueError(Errors.E072.format(length=end-start,
|
||||
bad_length=sizeof(lex_data.data)))
|
||||
byte_string = b'\0' * sizeof(lex_data.data)
|
||||
byte_string = b"\0" * sizeof(lex_data.data)
|
||||
byte_chars = <char*>byte_string
|
||||
for i in range(sizeof(lex_data.data)):
|
||||
byte_chars[i] = lex_data.data[i]
|
||||
|
|
|
@ -1,6 +1,8 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from .matcher import Matcher # noqa: F401
|
||||
from .phrasematcher import PhraseMatcher # noqa: F401
|
||||
from .dependencymatcher import DependencyTreeMatcher # noqa: F401
|
||||
from .matcher import Matcher
|
||||
from .phrasematcher import PhraseMatcher
|
||||
from .dependencymatcher import DependencyTreeMatcher
|
||||
|
||||
__all__ = ["Matcher", "PhraseMatcher", "DependencyTreeMatcher"]
|
||||
|
|
|
@ -13,7 +13,7 @@ from .matcher import unpickle_matcher
|
|||
from ..errors import Errors
|
||||
|
||||
|
||||
DELIMITER = '||'
|
||||
DELIMITER = "||"
|
||||
INDEX_HEAD = 1
|
||||
INDEX_RELOP = 0
|
||||
|
||||
|
@ -55,7 +55,8 @@ cdef class DependencyTreeMatcher:
|
|||
return (unpickle_matcher, data, None, None)
|
||||
|
||||
def __len__(self):
|
||||
"""Get the number of rules, which are edges ,added to the dependency tree matcher.
|
||||
"""Get the number of rules, which are edges, added to the dependency
|
||||
tree matcher.
|
||||
|
||||
RETURNS (int): The number of rules.
|
||||
"""
|
||||
|
@ -73,19 +74,30 @@ cdef class DependencyTreeMatcher:
|
|||
idx = 0
|
||||
visitedNodes = {}
|
||||
for relation in pattern:
|
||||
if 'PATTERN' not in relation or 'SPEC' not in relation:
|
||||
if "PATTERN" not in relation or "SPEC" not in relation:
|
||||
raise ValueError(Errors.E098.format(key=key))
|
||||
if idx == 0:
|
||||
if not('NODE_NAME' in relation['SPEC'] and 'NBOR_RELOP' not in relation['SPEC'] and 'NBOR_NAME' not in relation['SPEC']):
|
||||
if not(
|
||||
"NODE_NAME" in relation["SPEC"]
|
||||
and "NBOR_RELOP" not in relation["SPEC"]
|
||||
and "NBOR_NAME" not in relation["SPEC"]
|
||||
):
|
||||
raise ValueError(Errors.E099.format(key=key))
|
||||
visitedNodes[relation['SPEC']['NODE_NAME']] = True
|
||||
visitedNodes[relation["SPEC"]["NODE_NAME"]] = True
|
||||
else:
|
||||
if not('NODE_NAME' in relation['SPEC'] and 'NBOR_RELOP' in relation['SPEC'] and 'NBOR_NAME' in relation['SPEC']):
|
||||
if not(
|
||||
"NODE_NAME" in relation["SPEC"]
|
||||
and "NBOR_RELOP" in relation["SPEC"]
|
||||
and "NBOR_NAME" in relation["SPEC"]
|
||||
):
|
||||
raise ValueError(Errors.E100.format(key=key))
|
||||
if relation['SPEC']['NODE_NAME'] in visitedNodes or relation['SPEC']['NBOR_NAME'] not in visitedNodes:
|
||||
if (
|
||||
relation["SPEC"]["NODE_NAME"] in visitedNodes
|
||||
or relation["SPEC"]["NBOR_NAME"] not in visitedNodes
|
||||
):
|
||||
raise ValueError(Errors.E101.format(key=key))
|
||||
visitedNodes[relation['SPEC']['NODE_NAME']] = True
|
||||
visitedNodes[relation['SPEC']['NBOR_NAME']] = True
|
||||
visitedNodes[relation["SPEC"]["NODE_NAME"]] = True
|
||||
visitedNodes[relation["SPEC"]["NBOR_NAME"]] = True
|
||||
idx = idx + 1
|
||||
|
||||
def add(self, key, on_match, *patterns):
|
||||
|
@ -93,55 +105,46 @@ cdef class DependencyTreeMatcher:
|
|||
if len(pattern) == 0:
|
||||
raise ValueError(Errors.E012.format(key=key))
|
||||
self.validateInput(pattern,key)
|
||||
|
||||
key = self._normalize_key(key)
|
||||
|
||||
_patterns = []
|
||||
for pattern in patterns:
|
||||
token_patterns = []
|
||||
for i in range(len(pattern)):
|
||||
token_pattern = [pattern[i]['PATTERN']]
|
||||
token_pattern = [pattern[i]["PATTERN"]]
|
||||
token_patterns.append(token_pattern)
|
||||
# self.patterns.append(token_patterns)
|
||||
_patterns.append(token_patterns)
|
||||
|
||||
self._patterns.setdefault(key, [])
|
||||
self._callbacks[key] = on_match
|
||||
self._patterns[key].extend(_patterns)
|
||||
|
||||
# Add each node pattern of all the input patterns individually to the matcher.
|
||||
# This enables only a single instance of Matcher to be used.
|
||||
# Add each node pattern of all the input patterns individually to the
|
||||
# matcher. This enables only a single instance of Matcher to be used.
|
||||
# Multiple adds are required to track each node pattern.
|
||||
_keys_to_token_list = []
|
||||
for i in range(len(_patterns)):
|
||||
_keys_to_token = {}
|
||||
# TODO : Better ways to hash edges in pattern?
|
||||
# TODO: Better ways to hash edges in pattern?
|
||||
for j in range(len(_patterns[i])):
|
||||
k = self._normalize_key(unicode(key)+DELIMITER+unicode(i)+DELIMITER+unicode(j))
|
||||
self.token_matcher.add(k,None,_patterns[i][j])
|
||||
k = self._normalize_key(unicode(key) + DELIMITER + unicode(i) + DELIMITER + unicode(j))
|
||||
self.token_matcher.add(k, None, _patterns[i][j])
|
||||
_keys_to_token[k] = j
|
||||
_keys_to_token_list.append(_keys_to_token)
|
||||
|
||||
self._keys_to_token.setdefault(key, [])
|
||||
self._keys_to_token[key].extend(_keys_to_token_list)
|
||||
|
||||
_nodes_list = []
|
||||
for pattern in patterns:
|
||||
nodes = {}
|
||||
for i in range(len(pattern)):
|
||||
nodes[pattern[i]['SPEC']['NODE_NAME']]=i
|
||||
nodes[pattern[i]["SPEC"]["NODE_NAME"]] = i
|
||||
_nodes_list.append(nodes)
|
||||
|
||||
self._nodes.setdefault(key, [])
|
||||
self._nodes[key].extend(_nodes_list)
|
||||
# Create an object tree to traverse later on. This data structure
|
||||
# enables easy tree pattern match. Doc-Token based tree cannot be
|
||||
# reused since it is memory-heavy and tightly coupled with the Doc.
|
||||
self.retrieve_tree(patterns, _nodes_list,key)
|
||||
|
||||
# Create an object tree to traverse later on.
|
||||
# This datastructure enable easy tree pattern match.
|
||||
# Doc-Token based tree cannot be reused since it is memory heavy and
|
||||
# tightly coupled with doc
|
||||
self.retrieve_tree(patterns,_nodes_list,key)
|
||||
|
||||
def retrieve_tree(self,patterns,_nodes_list,key):
|
||||
def retrieve_tree(self, patterns, _nodes_list, key):
|
||||
_heads_list = []
|
||||
_root_list = []
|
||||
for i in range(len(patterns)):
|
||||
|
@ -149,31 +152,29 @@ cdef class DependencyTreeMatcher:
|
|||
root = -1
|
||||
for j in range(len(patterns[i])):
|
||||
token_pattern = patterns[i][j]
|
||||
if('NBOR_RELOP' not in token_pattern['SPEC']):
|
||||
heads[j] = ('root',j)
|
||||
if ("NBOR_RELOP" not in token_pattern["SPEC"]):
|
||||
heads[j] = ('root', j)
|
||||
root = j
|
||||
else:
|
||||
heads[j] = (token_pattern['SPEC']['NBOR_RELOP'],_nodes_list[i][token_pattern['SPEC']['NBOR_NAME']])
|
||||
|
||||
heads[j] = (
|
||||
token_pattern["SPEC"]["NBOR_RELOP"],
|
||||
_nodes_list[i][token_pattern["SPEC"]["NBOR_NAME"]]
|
||||
)
|
||||
_heads_list.append(heads)
|
||||
_root_list.append(root)
|
||||
|
||||
_tree_list = []
|
||||
for i in range(len(patterns)):
|
||||
tree = {}
|
||||
for j in range(len(patterns[i])):
|
||||
if(_heads_list[i][j][INDEX_HEAD] == j):
|
||||
continue
|
||||
|
||||
head = _heads_list[i][j][INDEX_HEAD]
|
||||
if(head not in tree):
|
||||
tree[head] = []
|
||||
tree[head].append( (_heads_list[i][j][INDEX_RELOP],j) )
|
||||
tree[head].append((_heads_list[i][j][INDEX_RELOP], j))
|
||||
_tree_list.append(tree)
|
||||
|
||||
self._tree.setdefault(key, [])
|
||||
self._tree[key].extend(_tree_list)
|
||||
|
||||
self._root.setdefault(key, [])
|
||||
self._root[key].extend(_root_list)
|
||||
|
||||
|
@ -199,7 +200,6 @@ cdef class DependencyTreeMatcher:
|
|||
|
||||
def __call__(self, Doc doc):
|
||||
matched_trees = []
|
||||
|
||||
matches = self.token_matcher(doc)
|
||||
for key in list(self._patterns.keys()):
|
||||
_patterns_list = self._patterns[key]
|
||||
|
@ -216,39 +216,51 @@ cdef class DependencyTreeMatcher:
|
|||
id_to_position = {}
|
||||
for i in range(len(_nodes)):
|
||||
id_to_position[i]=[]
|
||||
|
||||
# This could be taken outside to improve running time..?
|
||||
# TODO: This could be taken outside to improve running time..?
|
||||
for match_id, start, end in matches:
|
||||
if match_id in _keys_to_token:
|
||||
id_to_position[_keys_to_token[match_id]].append(start)
|
||||
|
||||
_node_operator_map = self.get_node_operator_map(doc,_tree,id_to_position,_nodes,_root)
|
||||
_node_operator_map = self.get_node_operator_map(
|
||||
doc,
|
||||
_tree,
|
||||
id_to_position,
|
||||
_nodes,_root
|
||||
)
|
||||
length = len(_nodes)
|
||||
if _root in id_to_position:
|
||||
candidates = id_to_position[_root]
|
||||
for candidate in candidates:
|
||||
isVisited = {}
|
||||
self.dfs(candidate,_root,_tree,id_to_position,doc,isVisited,_node_operator_map)
|
||||
# To check if the subtree pattern is completely identified. This is a heuristic.
|
||||
# This is done to reduce the complexity of exponential unordered subtree matching.
|
||||
# Will give approximate matches in some cases.
|
||||
self.dfs(
|
||||
candidate,
|
||||
_root,_tree,
|
||||
id_to_position,
|
||||
doc,
|
||||
isVisited,
|
||||
_node_operator_map
|
||||
)
|
||||
# To check if the subtree pattern is completely
|
||||
# identified. This is a heuristic. This is done to
|
||||
# reduce the complexity of exponential unordered subtree
|
||||
# matching. Will give approximate matches in some cases.
|
||||
if(len(isVisited) == length):
|
||||
matched_trees.append((key,list(isVisited)))
|
||||
|
||||
for i, (ent_id, nodes) in enumerate(matched_trees):
|
||||
on_match = self._callbacks.get(ent_id)
|
||||
if on_match is not None:
|
||||
on_match(self, doc, i, matches)
|
||||
|
||||
return matched_trees
|
||||
|
||||
def dfs(self,candidate,root,tree,id_to_position,doc,isVisited,_node_operator_map):
|
||||
if(root in id_to_position and candidate in id_to_position[root]):
|
||||
# color the node since it is valid
|
||||
if (root in id_to_position and candidate in id_to_position[root]):
|
||||
# Color the node since it is valid
|
||||
isVisited[candidate] = True
|
||||
if root in tree:
|
||||
for root_child in tree[root]:
|
||||
if candidate in _node_operator_map and root_child[INDEX_RELOP] in _node_operator_map[candidate]:
|
||||
if (
|
||||
candidate in _node_operator_map
|
||||
and root_child[INDEX_RELOP] in _node_operator_map[candidate]
|
||||
):
|
||||
candidate_children = _node_operator_map[candidate][root_child[INDEX_RELOP]]
|
||||
for candidate_child in candidate_children:
|
||||
result = self.dfs(
|
||||
|
@ -275,72 +287,68 @@ cdef class DependencyTreeMatcher:
|
|||
for child in tree[node]:
|
||||
all_operators.append(child[INDEX_RELOP])
|
||||
all_operators = list(set(all_operators))
|
||||
|
||||
all_nodes = []
|
||||
for node in all_node_indices:
|
||||
all_nodes = all_nodes + id_to_position[node]
|
||||
all_nodes = list(set(all_nodes))
|
||||
|
||||
for node in all_nodes:
|
||||
_node_operator_map[node] = {}
|
||||
for operator in all_operators:
|
||||
_node_operator_map[node][operator] = []
|
||||
|
||||
# Used to invoke methods for each operator
|
||||
switcher = {
|
||||
'<':self.dep,
|
||||
'>':self.gov,
|
||||
'>>':self.dep_chain,
|
||||
'<<':self.gov_chain,
|
||||
'.':self.imm_precede,
|
||||
'$+':self.imm_right_sib,
|
||||
'$-':self.imm_left_sib,
|
||||
'$++':self.right_sib,
|
||||
'$--':self.left_sib
|
||||
"<": self.dep,
|
||||
">": self.gov,
|
||||
">>": self.dep_chain,
|
||||
"<<": self.gov_chain,
|
||||
".": self.imm_precede,
|
||||
"$+": self.imm_right_sib,
|
||||
"$-": self.imm_left_sib,
|
||||
"$++": self.right_sib,
|
||||
"$--": self.left_sib
|
||||
}
|
||||
for operator in all_operators:
|
||||
for node in all_nodes:
|
||||
_node_operator_map[node][operator] = switcher.get(operator)(doc,node)
|
||||
|
||||
return _node_operator_map
|
||||
|
||||
def dep(self,doc,node):
|
||||
def dep(self, doc, node):
|
||||
return list(doc[node].head)
|
||||
|
||||
def gov(self,doc,node):
|
||||
return list(doc[node].children)
|
||||
|
||||
def dep_chain(self,doc,node):
|
||||
def dep_chain(self, doc, node):
|
||||
return list(doc[node].ancestors)
|
||||
|
||||
def gov_chain(self,doc,node):
|
||||
def gov_chain(self, doc, node):
|
||||
return list(doc[node].subtree)
|
||||
|
||||
def imm_precede(self,doc,node):
|
||||
if node>0:
|
||||
return [doc[node-1]]
|
||||
def imm_precede(self, doc, node):
|
||||
if node > 0:
|
||||
return [doc[node - 1]]
|
||||
return []
|
||||
|
||||
def imm_right_sib(self,doc,node):
|
||||
def imm_right_sib(self, doc, node):
|
||||
for idx in range(list(doc[node].head.children)):
|
||||
if idx == node-1:
|
||||
if idx == node - 1:
|
||||
return [doc[idx]]
|
||||
return []
|
||||
|
||||
def imm_left_sib(self,doc,node):
|
||||
def imm_left_sib(self, doc, node):
|
||||
for idx in range(list(doc[node].head.children)):
|
||||
if idx == node+1:
|
||||
if idx == node + 1:
|
||||
return [doc[idx]]
|
||||
return []
|
||||
|
||||
def right_sib(self,doc,node):
|
||||
def right_sib(self, doc, node):
|
||||
candidate_children = []
|
||||
for idx in range(list(doc[node].head.children)):
|
||||
if idx < node:
|
||||
candidate_children.append(doc[idx])
|
||||
return candidate_children
|
||||
|
||||
def left_sib(self,doc,node):
|
||||
def left_sib(self, doc, node):
|
||||
candidate_children = []
|
||||
for idx in range(list(doc[node].head.children)):
|
||||
if idx > node:
|
||||
|
|
File diff suppressed because it is too large
Load Diff
|
@ -12,7 +12,7 @@ from ..vocab cimport Vocab
|
|||
from ..tokens.doc cimport Doc, get_token_attr
|
||||
from ..typedefs cimport attr_t, hash_t
|
||||
|
||||
from ..errors import Warnings, deprecation_warning, user_warning
|
||||
from ..errors import Errors, Warnings, deprecation_warning, user_warning
|
||||
from ..attrs import FLAG61 as U_ENT
|
||||
from ..attrs import FLAG60 as B2_ENT
|
||||
from ..attrs import FLAG59 as B3_ENT
|
||||
|
@ -25,6 +25,13 @@ from ..attrs import FLAG41 as I4_ENT
|
|||
|
||||
|
||||
cdef class PhraseMatcher:
|
||||
"""Efficiently match large terminology lists. While the `Matcher` matches
|
||||
sequences based on lists of token descriptions, the `PhraseMatcher` accepts
|
||||
match patterns in the form of `Doc` objects.
|
||||
|
||||
DOCS: https://spacy.io/api/phrasematcher
|
||||
USAGE: https://spacy.io/usage/rule-based-matching#phrasematcher
|
||||
"""
|
||||
cdef Pool mem
|
||||
cdef Vocab vocab
|
||||
cdef Matcher matcher
|
||||
|
@ -36,7 +43,16 @@ cdef class PhraseMatcher:
|
|||
cdef public object _docs
|
||||
cdef public object _validate
|
||||
|
||||
def __init__(self, Vocab vocab, max_length=0, attr='ORTH', validate=False):
|
||||
def __init__(self, Vocab vocab, max_length=0, attr="ORTH", validate=False):
|
||||
"""Initialize the PhraseMatcher.
|
||||
|
||||
vocab (Vocab): The shared vocabulary.
|
||||
attr (int / unicode): Token attribute to match on.
|
||||
validate (bool): Perform additional validation when patterns are added.
|
||||
RETURNS (PhraseMatcher): The newly constructed object.
|
||||
|
||||
DOCS: https://spacy.io/api/phrasematcher#init
|
||||
"""
|
||||
if max_length != 0:
|
||||
deprecation_warning(Warnings.W010)
|
||||
self.mem = Pool()
|
||||
|
@ -54,7 +70,7 @@ cdef class PhraseMatcher:
|
|||
[{B3_ENT: True}, {I3_ENT: True}, {L3_ENT: True}],
|
||||
[{B4_ENT: True}, {I4_ENT: True}, {I4_ENT: True, "OP": "+"}, {L4_ENT: True}],
|
||||
]
|
||||
self.matcher.add('Candidate', None, *abstract_patterns)
|
||||
self.matcher.add("Candidate", None, *abstract_patterns)
|
||||
self._callbacks = {}
|
||||
self._docs = {}
|
||||
self._validate = validate
|
||||
|
@ -65,6 +81,8 @@ cdef class PhraseMatcher:
|
|||
number of individual patterns.
|
||||
|
||||
RETURNS (int): The number of rules.
|
||||
|
||||
DOCS: https://spacy.io/api/phrasematcher#len
|
||||
"""
|
||||
return len(self._docs)
|
||||
|
||||
|
@ -73,6 +91,8 @@ cdef class PhraseMatcher:
|
|||
|
||||
key (unicode): The match ID.
|
||||
RETURNS (bool): Whether the matcher contains rules for this match ID.
|
||||
|
||||
DOCS: https://spacy.io/api/phrasematcher#contains
|
||||
"""
|
||||
cdef hash_t ent_id = self.matcher._normalize_key(key)
|
||||
return ent_id in self._callbacks
|
||||
|
@ -88,6 +108,8 @@ cdef class PhraseMatcher:
|
|||
key (unicode): The match ID.
|
||||
on_match (callable): Callback executed on match.
|
||||
*docs (Doc): `Doc` objects representing match patterns.
|
||||
|
||||
DOCS: https://spacy.io/api/phrasematcher#add
|
||||
"""
|
||||
cdef Doc doc
|
||||
cdef hash_t ent_id = self.matcher._normalize_key(key)
|
||||
|
@ -112,8 +134,7 @@ cdef class PhraseMatcher:
|
|||
lexeme = self.vocab[attr_value]
|
||||
lexeme.set_flag(tag, True)
|
||||
phrase_key[i] = lexeme.orth
|
||||
phrase_hash = hash64(phrase_key,
|
||||
length * sizeof(attr_t), 0)
|
||||
phrase_hash = hash64(phrase_key, length * sizeof(attr_t), 0)
|
||||
self.phrase_ids.set(phrase_hash, <void*>ent_id)
|
||||
|
||||
def __call__(self, Doc doc):
|
||||
|
@ -123,6 +144,8 @@ cdef class PhraseMatcher:
|
|||
RETURNS (list): A list of `(key, start, end)` tuples,
|
||||
describing the matches. A match tuple describes a span
|
||||
`doc[start:end]`. The `label_id` and `key` are both integers.
|
||||
|
||||
DOCS: https://spacy.io/api/phrasematcher#call
|
||||
"""
|
||||
matches = []
|
||||
if self.attr == ORTH:
|
||||
|
@ -158,6 +181,8 @@ cdef class PhraseMatcher:
|
|||
If both return_matches and as_tuples are True, the output will
|
||||
be a sequence of ((doc, matches), context) tuples.
|
||||
YIELDS (Doc): Documents, in order.
|
||||
|
||||
DOCS: https://spacy.io/api/phrasematcher#pipe
|
||||
"""
|
||||
if as_tuples:
|
||||
for doc, context in stream:
|
||||
|
@ -180,8 +205,7 @@ cdef class PhraseMatcher:
|
|||
phrase_key = <attr_t*>mem.alloc(end-start, sizeof(attr_t))
|
||||
for i, j in enumerate(range(start, end)):
|
||||
phrase_key[i] = doc.c[j].lex.orth
|
||||
cdef hash_t key = hash64(phrase_key,
|
||||
(end-start) * sizeof(attr_t), 0)
|
||||
cdef hash_t key = hash64(phrase_key, (end-start) * sizeof(attr_t), 0)
|
||||
ent_id = <hash_t>self.phrase_ids.get(key)
|
||||
if ent_id == 0:
|
||||
return None
|
||||
|
@ -203,12 +227,12 @@ cdef class PhraseMatcher:
|
|||
# Concatenate the attr name and value to not pollute lexeme space
|
||||
# e.g. 'POS-VERB' instead of just 'VERB', which could otherwise
|
||||
# create false positive matches
|
||||
return 'matcher:{}-{}'.format(string_attr_name, string_attr_value)
|
||||
return "matcher:{}-{}".format(string_attr_name, string_attr_value)
|
||||
|
||||
|
||||
def get_bilou(length):
|
||||
if length == 0:
|
||||
raise ValueError("Length must be >= 1")
|
||||
raise ValueError(Errors.E127)
|
||||
elif length == 1:
|
||||
return [U_ENT]
|
||||
elif length == 2:
|
||||
|
|
|
@ -1,8 +1,23 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from .pipes import Tagger, DependencyParser, EntityRecognizer # noqa
|
||||
from .pipes import TextCategorizer, Tensorizer, Pipe # noqa
|
||||
from .entityruler import EntityRuler # noqa
|
||||
from .hooks import SentenceSegmenter, SimilarityHook # noqa
|
||||
from .functions import merge_entities, merge_noun_chunks, merge_subtokens # noqa
|
||||
from .pipes import Tagger, DependencyParser, EntityRecognizer
|
||||
from .pipes import TextCategorizer, Tensorizer, Pipe
|
||||
from .entityruler import EntityRuler
|
||||
from .hooks import SentenceSegmenter, SimilarityHook
|
||||
from .functions import merge_entities, merge_noun_chunks, merge_subtokens
|
||||
|
||||
__all__ = [
|
||||
"Tagger",
|
||||
"DependencyParser",
|
||||
"EntityRecognizer",
|
||||
"TextCategorizer",
|
||||
"Tensorizer",
|
||||
"Pipe",
|
||||
"EntityRuler",
|
||||
"SentenceSegmenter",
|
||||
"SimilarityHook",
|
||||
"merge_entities",
|
||||
"merge_noun_chunks",
|
||||
"merge_subtokens",
|
||||
]
|
||||
|
|
|
@ -12,10 +12,20 @@ from ..matcher import Matcher, PhraseMatcher
|
|||
|
||||
|
||||
class EntityRuler(object):
|
||||
"""The EntityRuler lets you add spans to the `Doc.ents` using token-based
|
||||
rules or exact phrase matches. It can be combined with the statistical
|
||||
`EntityRecognizer` to boost accuracy, or used on its own to implement a
|
||||
purely rule-based entity recognition system. After initialization, the
|
||||
component is typically added to the pipeline using `nlp.add_pipe`.
|
||||
|
||||
DOCS: https://spacy.io/api/entityruler
|
||||
USAGE: https://spacy.io/usage/rule-based-matching#entityruler
|
||||
"""
|
||||
|
||||
name = "entity_ruler"
|
||||
|
||||
def __init__(self, nlp, **cfg):
|
||||
"""Initialise the entitiy ruler. If patterns are supplied here, they
|
||||
"""Initialize the entitiy ruler. If patterns are supplied here, they
|
||||
need to be a list of dictionaries with a `"label"` and `"pattern"`
|
||||
key. A pattern can either be a token pattern (list) or a phrase pattern
|
||||
(string). For example: `{'label': 'ORG', 'pattern': 'Apple'}`.
|
||||
|
@ -29,6 +39,8 @@ class EntityRuler(object):
|
|||
of a model pipeline, this will include all keyword arguments passed
|
||||
to `spacy.load`.
|
||||
RETURNS (EntityRuler): The newly constructed object.
|
||||
|
||||
DOCS: https://spacy.io/api/entityruler#init
|
||||
"""
|
||||
self.nlp = nlp
|
||||
self.overwrite = cfg.get("overwrite_ents", False)
|
||||
|
@ -55,6 +67,8 @@ class EntityRuler(object):
|
|||
|
||||
doc (Doc): The Doc object in the pipeline.
|
||||
RETURNS (Doc): The Doc with added entities, if available.
|
||||
|
||||
DOCS: https://spacy.io/api/entityruler#call
|
||||
"""
|
||||
matches = list(self.matcher(doc)) + list(self.phrase_matcher(doc))
|
||||
matches = set(
|
||||
|
@ -83,6 +97,8 @@ class EntityRuler(object):
|
|||
"""All labels present in the match patterns.
|
||||
|
||||
RETURNS (set): The string labels.
|
||||
|
||||
DOCS: https://spacy.io/api/entityruler#labels
|
||||
"""
|
||||
all_labels = set(self.token_patterns.keys())
|
||||
all_labels.update(self.phrase_patterns.keys())
|
||||
|
@ -93,6 +109,8 @@ class EntityRuler(object):
|
|||
"""Get all patterns that were added to the entity ruler.
|
||||
|
||||
RETURNS (list): The original patterns, one dictionary per pattern.
|
||||
|
||||
DOCS: https://spacy.io/api/entityruler#patterns
|
||||
"""
|
||||
all_patterns = []
|
||||
for label, patterns in self.token_patterns.items():
|
||||
|
@ -110,6 +128,8 @@ class EntityRuler(object):
|
|||
{'label': 'GPE', 'pattern': [{'lower': 'san'}, {'lower': 'francisco'}]}
|
||||
|
||||
patterns (list): The patterns to add.
|
||||
|
||||
DOCS: https://spacy.io/api/entityruler#add_patterns
|
||||
"""
|
||||
for entry in patterns:
|
||||
label = entry["label"]
|
||||
|
@ -131,6 +151,8 @@ class EntityRuler(object):
|
|||
patterns_bytes (bytes): The bytestring to load.
|
||||
**kwargs: Other config paramters, mostly for consistency.
|
||||
RETURNS (EntityRuler): The loaded entity ruler.
|
||||
|
||||
DOCS: https://spacy.io/api/entityruler#from_bytes
|
||||
"""
|
||||
patterns = srsly.msgpack_loads(patterns_bytes)
|
||||
self.add_patterns(patterns)
|
||||
|
@ -140,6 +162,8 @@ class EntityRuler(object):
|
|||
"""Serialize the entity ruler patterns to a bytestring.
|
||||
|
||||
RETURNS (bytes): The serialized patterns.
|
||||
|
||||
DOCS: https://spacy.io/api/entityruler#to_bytes
|
||||
"""
|
||||
return srsly.msgpack_dumps(self.patterns)
|
||||
|
||||
|
@ -150,6 +174,8 @@ class EntityRuler(object):
|
|||
path (unicode / Path): The JSONL file to load.
|
||||
**kwargs: Other config paramters, mostly for consistency.
|
||||
RETURNS (EntityRuler): The loaded entity ruler.
|
||||
|
||||
DOCS: https://spacy.io/api/entityruler#from_disk
|
||||
"""
|
||||
path = ensure_path(path)
|
||||
path = path.with_suffix(".jsonl")
|
||||
|
@ -164,6 +190,8 @@ class EntityRuler(object):
|
|||
path (unicode / Path): The JSONL file to load.
|
||||
**kwargs: Other config paramters, mostly for consistency.
|
||||
RETURNS (EntityRuler): The loaded entity ruler.
|
||||
|
||||
DOCS: https://spacy.io/api/entityruler
|
||||
"""
|
||||
path = ensure_path(path)
|
||||
path = path.with_suffix(".jsonl")
|
||||
|
|
|
@ -9,6 +9,8 @@ def merge_noun_chunks(doc):
|
|||
|
||||
doc (Doc): The Doc object.
|
||||
RETURNS (Doc): The Doc object with merged noun chunks.
|
||||
|
||||
DOCS: https://spacy.io/api/pipeline-functions#merge_noun_chunks
|
||||
"""
|
||||
if not doc.is_parsed:
|
||||
return doc
|
||||
|
@ -23,7 +25,9 @@ def merge_entities(doc):
|
|||
"""Merge entities into a single token.
|
||||
|
||||
doc (Doc): The Doc object.
|
||||
RETURNS (Doc): The Doc object with merged noun entities.
|
||||
RETURNS (Doc): The Doc object with merged entities.
|
||||
|
||||
DOCS: https://spacy.io/api/pipeline-functions#merge_entities
|
||||
"""
|
||||
with doc.retokenize() as retokenizer:
|
||||
for ent in doc.ents:
|
||||
|
@ -33,6 +37,14 @@ def merge_entities(doc):
|
|||
|
||||
|
||||
def merge_subtokens(doc, label="subtok"):
|
||||
"""Merge subtokens into a single token.
|
||||
|
||||
doc (Doc): The Doc object.
|
||||
label (unicode): The subtoken dependency label.
|
||||
RETURNS (Doc): The Doc object with merged subtokens.
|
||||
|
||||
DOCS: https://spacy.io/api/pipeline-functions#merge_subtokens
|
||||
"""
|
||||
merger = Matcher(doc.vocab)
|
||||
merger.add("SUBTOK", None, [{"DEP": label, "op": "+"}])
|
||||
matches = merger(doc)
|
||||
|
|
|
@ -15,6 +15,8 @@ class SentenceSegmenter(object):
|
|||
initialization, or assign a new strategy to the .strategy attribute.
|
||||
Sentence detection strategies should be generators that take `Doc` objects
|
||||
and yield `Span` objects for each sentence.
|
||||
|
||||
DOCS: https://spacy.io/api/sentencesegmenter
|
||||
"""
|
||||
|
||||
name = "sentencizer"
|
||||
|
|
|
@ -6,9 +6,8 @@ from __future__ import unicode_literals
|
|||
cimport numpy as np
|
||||
|
||||
import numpy
|
||||
from collections import OrderedDict
|
||||
import srsly
|
||||
|
||||
from collections import OrderedDict
|
||||
from thinc.api import chain
|
||||
from thinc.v2v import Affine, Maxout, Softmax
|
||||
from thinc.misc import LayerNorm
|
||||
|
@ -142,16 +141,21 @@ class Pipe(object):
|
|||
with self.model.use_params(params):
|
||||
yield
|
||||
|
||||
def to_bytes(self, **exclude):
|
||||
"""Serialize the pipe to a bytestring."""
|
||||
def to_bytes(self, exclude=tuple(), **kwargs):
|
||||
"""Serialize the pipe to a bytestring.
|
||||
|
||||
exclude (list): String names of serialization fields to exclude.
|
||||
RETURNS (bytes): The serialized object.
|
||||
"""
|
||||
serialize = OrderedDict()
|
||||
serialize["cfg"] = lambda: srsly.json_dumps(self.cfg)
|
||||
if self.model not in (True, False, None):
|
||||
serialize["model"] = self.model.to_bytes
|
||||
serialize["vocab"] = self.vocab.to_bytes
|
||||
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
|
||||
return util.to_bytes(serialize, exclude)
|
||||
|
||||
def from_bytes(self, bytes_data, **exclude):
|
||||
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
|
||||
"""Load the pipe from a bytestring."""
|
||||
|
||||
def load_model(b):
|
||||
|
@ -162,26 +166,25 @@ class Pipe(object):
|
|||
self.model = self.Model(**self.cfg)
|
||||
self.model.from_bytes(b)
|
||||
|
||||
deserialize = OrderedDict(
|
||||
(
|
||||
("cfg", lambda b: self.cfg.update(srsly.json_loads(b))),
|
||||
("vocab", lambda b: self.vocab.from_bytes(b)),
|
||||
("model", load_model),
|
||||
)
|
||||
)
|
||||
deserialize = OrderedDict()
|
||||
deserialize["cfg"] = lambda b: self.cfg.update(srsly.json_loads(b))
|
||||
deserialize["vocab"] = lambda b: self.vocab.from_bytes(b)
|
||||
deserialize["model"] = load_model
|
||||
exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
|
||||
util.from_bytes(bytes_data, deserialize, exclude)
|
||||
return self
|
||||
|
||||
def to_disk(self, path, **exclude):
|
||||
def to_disk(self, path, exclude=tuple(), **kwargs):
|
||||
"""Serialize the pipe to disk."""
|
||||
serialize = OrderedDict()
|
||||
serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg)
|
||||
serialize["vocab"] = lambda p: self.vocab.to_disk(p)
|
||||
if self.model not in (None, True, False):
|
||||
serialize["model"] = lambda p: p.open("wb").write(self.model.to_bytes())
|
||||
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
|
||||
util.to_disk(path, serialize, exclude)
|
||||
|
||||
def from_disk(self, path, **exclude):
|
||||
def from_disk(self, path, exclude=tuple(), **kwargs):
|
||||
"""Load the pipe from disk."""
|
||||
|
||||
def load_model(p):
|
||||
|
@ -192,13 +195,11 @@ class Pipe(object):
|
|||
self.model = self.Model(**self.cfg)
|
||||
self.model.from_bytes(p.open("rb").read())
|
||||
|
||||
deserialize = OrderedDict(
|
||||
(
|
||||
("cfg", lambda p: self.cfg.update(_load_cfg(p))),
|
||||
("vocab", lambda p: self.vocab.from_disk(p)),
|
||||
("model", load_model),
|
||||
)
|
||||
)
|
||||
deserialize = OrderedDict()
|
||||
deserialize["cfg"] = lambda p: self.cfg.update(_load_cfg(p))
|
||||
deserialize["vocab"] = lambda p: self.vocab.from_disk(p)
|
||||
deserialize["model"] = load_model
|
||||
exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
|
||||
util.from_disk(path, deserialize, exclude)
|
||||
return self
|
||||
|
||||
|
@ -284,9 +285,7 @@ class Tensorizer(Pipe):
|
|||
"""
|
||||
for doc, tensor in zip(docs, tensors):
|
||||
if tensor.shape[0] != len(doc):
|
||||
raise ValueError(
|
||||
Errors.E076.format(rows=tensor.shape[0], words=len(doc))
|
||||
)
|
||||
raise ValueError(Errors.E076.format(rows=tensor.shape[0], words=len(doc)))
|
||||
doc.tensor = tensor
|
||||
|
||||
def update(self, docs, golds, state=None, drop=0.0, sgd=None, losses=None):
|
||||
|
@ -346,14 +345,19 @@ class Tensorizer(Pipe):
|
|||
|
||||
|
||||
class Tagger(Pipe):
|
||||
name = 'tagger'
|
||||
"""Pipeline component for part-of-speech tagging.
|
||||
|
||||
DOCS: https://spacy.io/api/tagger
|
||||
"""
|
||||
|
||||
name = "tagger"
|
||||
|
||||
def __init__(self, vocab, model=True, **cfg):
|
||||
self.vocab = vocab
|
||||
self.model = model
|
||||
self._rehearsal_model = None
|
||||
self.cfg = OrderedDict(sorted(cfg.items()))
|
||||
self.cfg.setdefault('cnn_maxout_pieces', 2)
|
||||
self.cfg.setdefault("cnn_maxout_pieces", 2)
|
||||
|
||||
@property
|
||||
def labels(self):
|
||||
|
@ -404,7 +408,7 @@ class Tagger(Pipe):
|
|||
cdef Vocab vocab = self.vocab
|
||||
for i, doc in enumerate(docs):
|
||||
doc_tag_ids = batch_tag_ids[i]
|
||||
if hasattr(doc_tag_ids, 'get'):
|
||||
if hasattr(doc_tag_ids, "get"):
|
||||
doc_tag_ids = doc_tag_ids.get()
|
||||
for j, tag_id in enumerate(doc_tag_ids):
|
||||
# Don't clobber preset POS tags
|
||||
|
@ -453,9 +457,9 @@ class Tagger(Pipe):
|
|||
scores = self.model.ops.flatten(scores)
|
||||
tag_index = {tag: i for i, tag in enumerate(self.labels)}
|
||||
cdef int idx = 0
|
||||
correct = numpy.zeros((scores.shape[0],), dtype='i')
|
||||
correct = numpy.zeros((scores.shape[0],), dtype="i")
|
||||
guesses = scores.argmax(axis=1)
|
||||
known_labels = numpy.ones((scores.shape[0], 1), dtype='f')
|
||||
known_labels = numpy.ones((scores.shape[0], 1), dtype="f")
|
||||
for gold in golds:
|
||||
for tag in gold.tags:
|
||||
if tag is None:
|
||||
|
@ -466,7 +470,7 @@ class Tagger(Pipe):
|
|||
correct[idx] = 0
|
||||
known_labels[idx] = 0.
|
||||
idx += 1
|
||||
correct = self.model.ops.xp.array(correct, dtype='i')
|
||||
correct = self.model.ops.xp.array(correct, dtype="i")
|
||||
d_scores = scores - to_categorical(correct, nb_classes=scores.shape[1])
|
||||
d_scores *= self.model.ops.asarray(known_labels)
|
||||
loss = (d_scores**2).sum()
|
||||
|
@ -490,9 +494,9 @@ class Tagger(Pipe):
|
|||
vocab.morphology = Morphology(vocab.strings, new_tag_map,
|
||||
vocab.morphology.lemmatizer,
|
||||
exc=vocab.morphology.exc)
|
||||
self.cfg['pretrained_vectors'] = kwargs.get('pretrained_vectors')
|
||||
self.cfg["pretrained_vectors"] = kwargs.get("pretrained_vectors")
|
||||
if self.model is True:
|
||||
for hp in ['token_vector_width', 'conv_depth']:
|
||||
for hp in ["token_vector_width", "conv_depth"]:
|
||||
if hp in kwargs:
|
||||
self.cfg[hp] = kwargs[hp]
|
||||
self.model = self.Model(self.vocab.morphology.n_tags, **self.cfg)
|
||||
|
@ -503,7 +507,7 @@ class Tagger(Pipe):
|
|||
|
||||
@classmethod
|
||||
def Model(cls, n_tags, **cfg):
|
||||
if cfg.get('pretrained_dims') and not cfg.get('pretrained_vectors'):
|
||||
if cfg.get("pretrained_dims") and not cfg.get("pretrained_vectors"):
|
||||
raise ValueError(TempErrors.T008)
|
||||
return build_tagger_model(n_tags, **cfg)
|
||||
|
||||
|
@ -535,28 +539,27 @@ class Tagger(Pipe):
|
|||
with self.model.use_params(params):
|
||||
yield
|
||||
|
||||
def to_bytes(self, **exclude):
|
||||
def to_bytes(self, exclude=tuple(), **kwargs):
|
||||
serialize = OrderedDict()
|
||||
if self.model not in (None, True, False):
|
||||
serialize['model'] = self.model.to_bytes
|
||||
serialize['vocab'] = self.vocab.to_bytes
|
||||
serialize['cfg'] = lambda: srsly.json_dumps(self.cfg)
|
||||
serialize["model"] = self.model.to_bytes
|
||||
serialize["vocab"] = self.vocab.to_bytes
|
||||
serialize["cfg"] = lambda: srsly.json_dumps(self.cfg)
|
||||
tag_map = OrderedDict(sorted(self.vocab.morphology.tag_map.items()))
|
||||
serialize['tag_map'] = lambda: srsly.msgpack_dumps(tag_map)
|
||||
serialize["tag_map"] = lambda: srsly.msgpack_dumps(tag_map)
|
||||
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
|
||||
return util.to_bytes(serialize, exclude)
|
||||
|
||||
def from_bytes(self, bytes_data, **exclude):
|
||||
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
|
||||
def load_model(b):
|
||||
# TODO: Remove this once we don't have to handle previous models
|
||||
if self.cfg.get('pretrained_dims') and 'pretrained_vectors' not in self.cfg:
|
||||
self.cfg['pretrained_vectors'] = self.vocab.vectors.name
|
||||
|
||||
if self.cfg.get("pretrained_dims") and "pretrained_vectors" not in self.cfg:
|
||||
self.cfg["pretrained_vectors"] = self.vocab.vectors.name
|
||||
if self.model is True:
|
||||
token_vector_width = util.env_opt(
|
||||
'token_vector_width',
|
||||
self.cfg.get('token_vector_width', 96))
|
||||
self.model = self.Model(self.vocab.morphology.n_tags,
|
||||
**self.cfg)
|
||||
"token_vector_width",
|
||||
self.cfg.get("token_vector_width", 96))
|
||||
self.model = self.Model(self.vocab.morphology.n_tags, **self.cfg)
|
||||
self.model.from_bytes(b)
|
||||
|
||||
def load_tag_map(b):
|
||||
|
@ -567,32 +570,34 @@ class Tagger(Pipe):
|
|||
exc=self.vocab.morphology.exc)
|
||||
|
||||
deserialize = OrderedDict((
|
||||
('vocab', lambda b: self.vocab.from_bytes(b)),
|
||||
('tag_map', load_tag_map),
|
||||
('cfg', lambda b: self.cfg.update(srsly.json_loads(b))),
|
||||
('model', lambda b: load_model(b)),
|
||||
("vocab", lambda b: self.vocab.from_bytes(b)),
|
||||
("tag_map", load_tag_map),
|
||||
("cfg", lambda b: self.cfg.update(srsly.json_loads(b))),
|
||||
("model", lambda b: load_model(b)),
|
||||
))
|
||||
exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
|
||||
util.from_bytes(bytes_data, deserialize, exclude)
|
||||
return self
|
||||
|
||||
def to_disk(self, path, **exclude):
|
||||
def to_disk(self, path, exclude=tuple(), **kwargs):
|
||||
tag_map = OrderedDict(sorted(self.vocab.morphology.tag_map.items()))
|
||||
serialize = OrderedDict((
|
||||
('vocab', lambda p: self.vocab.to_disk(p)),
|
||||
('tag_map', lambda p: srsly.write_msgpack(p, tag_map)),
|
||||
('model', lambda p: p.open('wb').write(self.model.to_bytes())),
|
||||
('cfg', lambda p: srsly.write_json(p, self.cfg))
|
||||
("vocab", lambda p: self.vocab.to_disk(p)),
|
||||
("tag_map", lambda p: srsly.write_msgpack(p, tag_map)),
|
||||
("model", lambda p: p.open("wb").write(self.model.to_bytes())),
|
||||
("cfg", lambda p: srsly.write_json(p, self.cfg))
|
||||
))
|
||||
exclude = util.get_serialization_exclude(serialize, exclude, kwargs)
|
||||
util.to_disk(path, serialize, exclude)
|
||||
|
||||
def from_disk(self, path, **exclude):
|
||||
def from_disk(self, path, exclude=tuple(), **kwargs):
|
||||
def load_model(p):
|
||||
# TODO: Remove this once we don't have to handle previous models
|
||||
if self.cfg.get('pretrained_dims') and 'pretrained_vectors' not in self.cfg:
|
||||
self.cfg['pretrained_vectors'] = self.vocab.vectors.name
|
||||
if self.cfg.get("pretrained_dims") and "pretrained_vectors" not in self.cfg:
|
||||
self.cfg["pretrained_vectors"] = self.vocab.vectors.name
|
||||
if self.model is True:
|
||||
self.model = self.Model(self.vocab.morphology.n_tags, **self.cfg)
|
||||
with p.open('rb') as file_:
|
||||
with p.open("rb") as file_:
|
||||
self.model.from_bytes(file_.read())
|
||||
|
||||
def load_tag_map(p):
|
||||
|
@ -603,11 +608,12 @@ class Tagger(Pipe):
|
|||
exc=self.vocab.morphology.exc)
|
||||
|
||||
deserialize = OrderedDict((
|
||||
('cfg', lambda p: self.cfg.update(_load_cfg(p))),
|
||||
('vocab', lambda p: self.vocab.from_disk(p)),
|
||||
('tag_map', load_tag_map),
|
||||
('model', load_model),
|
||||
("cfg", lambda p: self.cfg.update(_load_cfg(p))),
|
||||
("vocab", lambda p: self.vocab.from_disk(p)),
|
||||
("tag_map", load_tag_map),
|
||||
("model", load_model),
|
||||
))
|
||||
exclude = util.get_serialization_exclude(deserialize, exclude, kwargs)
|
||||
util.from_disk(path, deserialize, exclude)
|
||||
return self
|
||||
|
||||
|
@ -616,37 +622,38 @@ class MultitaskObjective(Tagger):
|
|||
"""Experimental: Assist training of a parser or tagger, by training a
|
||||
side-objective.
|
||||
"""
|
||||
name = 'nn_labeller'
|
||||
|
||||
name = "nn_labeller"
|
||||
|
||||
def __init__(self, vocab, model=True, target='dep_tag_offset', **cfg):
|
||||
self.vocab = vocab
|
||||
self.model = model
|
||||
if target == 'dep':
|
||||
if target == "dep":
|
||||
self.make_label = self.make_dep
|
||||
elif target == 'tag':
|
||||
elif target == "tag":
|
||||
self.make_label = self.make_tag
|
||||
elif target == 'ent':
|
||||
elif target == "ent":
|
||||
self.make_label = self.make_ent
|
||||
elif target == 'dep_tag_offset':
|
||||
elif target == "dep_tag_offset":
|
||||
self.make_label = self.make_dep_tag_offset
|
||||
elif target == 'ent_tag':
|
||||
elif target == "ent_tag":
|
||||
self.make_label = self.make_ent_tag
|
||||
elif target == 'sent_start':
|
||||
elif target == "sent_start":
|
||||
self.make_label = self.make_sent_start
|
||||
elif hasattr(target, '__call__'):
|
||||
elif hasattr(target, "__call__"):
|
||||
self.make_label = target
|
||||
else:
|
||||
raise ValueError(Errors.E016)
|
||||
self.cfg = dict(cfg)
|
||||
self.cfg.setdefault('cnn_maxout_pieces', 2)
|
||||
self.cfg.setdefault("cnn_maxout_pieces", 2)
|
||||
|
||||
@property
|
||||
def labels(self):
|
||||
return self.cfg.setdefault('labels', {})
|
||||
return self.cfg.setdefault("labels", {})
|
||||
|
||||
@labels.setter
|
||||
def labels(self, value):
|
||||
self.cfg['labels'] = value
|
||||
self.cfg["labels"] = value
|
||||
|
||||
def set_annotations(self, docs, dep_ids, tensors=None):
|
||||
pass
|
||||
|
@ -662,7 +669,7 @@ class MultitaskObjective(Tagger):
|
|||
if label is not None and label not in self.labels:
|
||||
self.labels[label] = len(self.labels)
|
||||
if self.model is True:
|
||||
token_vector_width = util.env_opt('token_vector_width')
|
||||
token_vector_width = util.env_opt("token_vector_width")
|
||||
self.model = self.Model(len(self.labels), tok2vec=tok2vec)
|
||||
link_vectors_to_models(self.vocab)
|
||||
if sgd is None:
|
||||
|
@ -671,7 +678,7 @@ class MultitaskObjective(Tagger):
|
|||
|
||||
@classmethod
|
||||
def Model(cls, n_tags, tok2vec=None, **cfg):
|
||||
token_vector_width = util.env_opt('token_vector_width', 96)
|
||||
token_vector_width = util.env_opt("token_vector_width", 96)
|
||||
softmax = Softmax(n_tags, token_vector_width*2)
|
||||
model = chain(
|
||||
tok2vec,
|
||||
|
@ -690,10 +697,10 @@ class MultitaskObjective(Tagger):
|
|||
|
||||
def get_loss(self, docs, golds, scores):
|
||||
if len(docs) != len(golds):
|
||||
raise ValueError(Errors.E077.format(value='loss', n_docs=len(docs),
|
||||
raise ValueError(Errors.E077.format(value="loss", n_docs=len(docs),
|
||||
n_golds=len(golds)))
|
||||
cdef int idx = 0
|
||||
correct = numpy.zeros((scores.shape[0],), dtype='i')
|
||||
correct = numpy.zeros((scores.shape[0],), dtype="i")
|
||||
guesses = scores.argmax(axis=1)
|
||||
for i, gold in enumerate(golds):
|
||||
for j in range(len(docs[i])):
|
||||
|
@ -705,7 +712,7 @@ class MultitaskObjective(Tagger):
|
|||
else:
|
||||
correct[idx] = self.labels[label]
|
||||
idx += 1
|
||||
correct = self.model.ops.xp.array(correct, dtype='i')
|
||||
correct = self.model.ops.xp.array(correct, dtype="i")
|
||||
d_scores = scores - to_categorical(correct, nb_classes=scores.shape[1])
|
||||
loss = (d_scores**2).sum()
|
||||
return float(loss), d_scores
|
||||
|
@ -733,25 +740,25 @@ class MultitaskObjective(Tagger):
|
|||
offset = heads[i] - i
|
||||
offset = min(offset, 2)
|
||||
offset = max(offset, -2)
|
||||
return '%s-%s:%d' % (deps[i], tags[i], offset)
|
||||
return "%s-%s:%d" % (deps[i], tags[i], offset)
|
||||
|
||||
@staticmethod
|
||||
def make_ent_tag(i, words, tags, heads, deps, ents):
|
||||
if ents is None or ents[i] is None:
|
||||
return None
|
||||
else:
|
||||
return '%s-%s' % (tags[i], ents[i])
|
||||
return "%s-%s" % (tags[i], ents[i])
|
||||
|
||||
@staticmethod
|
||||
def make_sent_start(target, words, tags, heads, deps, ents, cache=True, _cache={}):
|
||||
'''A multi-task objective for representing sentence boundaries,
|
||||
"""A multi-task objective for representing sentence boundaries,
|
||||
using BILU scheme. (O is impossible)
|
||||
|
||||
The implementation of this method uses an internal cache that relies
|
||||
on the identity of the heads array, to avoid requiring a new piece
|
||||
of gold data. You can pass cache=False if you know the cache will
|
||||
do the wrong thing.
|
||||
'''
|
||||
"""
|
||||
assert len(words) == len(heads)
|
||||
assert target < len(words), (target, len(words))
|
||||
if cache:
|
||||
|
@ -760,10 +767,10 @@ class MultitaskObjective(Tagger):
|
|||
else:
|
||||
for key in list(_cache.keys()):
|
||||
_cache.pop(key)
|
||||
sent_tags = ['I-SENT'] * len(words)
|
||||
sent_tags = ["I-SENT"] * len(words)
|
||||
_cache[id(heads)] = sent_tags
|
||||
else:
|
||||
sent_tags = ['I-SENT'] * len(words)
|
||||
sent_tags = ["I-SENT"] * len(words)
|
||||
|
||||
def _find_root(child):
|
||||
seen = set([child])
|
||||
|
@ -781,10 +788,10 @@ class MultitaskObjective(Tagger):
|
|||
sentences.setdefault(root, []).append(i)
|
||||
for root, span in sorted(sentences.items()):
|
||||
if len(span) == 1:
|
||||
sent_tags[span[0]] = 'U-SENT'
|
||||
sent_tags[span[0]] = "U-SENT"
|
||||
else:
|
||||
sent_tags[span[0]] = 'B-SENT'
|
||||
sent_tags[span[-1]] = 'L-SENT'
|
||||
sent_tags[span[0]] = "B-SENT"
|
||||
sent_tags[span[-1]] = "L-SENT"
|
||||
return sent_tags[target]
|
||||
|
||||
|
||||
|
@ -854,6 +861,10 @@ class ClozeMultitask(Pipe):
|
|||
|
||||
|
||||
class TextCategorizer(Pipe):
|
||||
"""Pipeline component for text classification.
|
||||
|
||||
DOCS: https://spacy.io/api/textcategorizer
|
||||
"""
|
||||
name = 'textcat'
|
||||
|
||||
@classmethod
|
||||
|
@ -863,7 +874,7 @@ class TextCategorizer(Pipe):
|
|||
token_vector_width = cfg["token_vector_width"]
|
||||
else:
|
||||
token_vector_width = util.env_opt("token_vector_width", 96)
|
||||
if cfg.get('architecture') == 'simple_cnn':
|
||||
if cfg.get("architecture") == "simple_cnn":
|
||||
tok2vec = Tok2Vec(token_vector_width, embed_size, **cfg)
|
||||
return build_simple_cnn_text_classifier(tok2vec, nr_class, **cfg)
|
||||
else:
|
||||
|
@ -884,11 +895,11 @@ class TextCategorizer(Pipe):
|
|||
|
||||
@property
|
||||
def labels(self):
|
||||
return tuple(self.cfg.setdefault('labels', []))
|
||||
return tuple(self.cfg.setdefault("labels", []))
|
||||
|
||||
@labels.setter
|
||||
def labels(self, value):
|
||||
self.cfg['labels'] = tuple(value)
|
||||
self.cfg["labels"] = tuple(value)
|
||||
|
||||
def __call__(self, doc):
|
||||
scores, tensors = self.predict([doc])
|
||||
|
@ -934,8 +945,8 @@ class TextCategorizer(Pipe):
|
|||
losses[self.name] += (gradient**2).sum()
|
||||
|
||||
def get_loss(self, docs, golds, scores):
|
||||
truths = numpy.zeros((len(golds), len(self.labels)), dtype='f')
|
||||
not_missing = numpy.ones((len(golds), len(self.labels)), dtype='f')
|
||||
truths = numpy.zeros((len(golds), len(self.labels)), dtype="f")
|
||||
not_missing = numpy.ones((len(golds), len(self.labels)), dtype="f")
|
||||
for i, gold in enumerate(golds):
|
||||
for j, label in enumerate(self.labels):
|
||||
if label in gold.cats:
|
||||
|
@ -956,20 +967,19 @@ class TextCategorizer(Pipe):
|
|||
# This functionality was available previously, but was broken.
|
||||
# The problem is that we resize the last layer, but the last layer
|
||||
# is actually just an ensemble. We're not resizing the child layers
|
||||
# -- a huge problem.
|
||||
# - a huge problem.
|
||||
raise ValueError(Errors.E116)
|
||||
#smaller = self.model._layers[-1]
|
||||
#larger = Affine(len(self.labels)+1, smaller.nI)
|
||||
#copy_array(larger.W[:smaller.nO], smaller.W)
|
||||
#copy_array(larger.b[:smaller.nO], smaller.b)
|
||||
#self.model._layers[-1] = larger
|
||||
# smaller = self.model._layers[-1]
|
||||
# larger = Affine(len(self.labels)+1, smaller.nI)
|
||||
# copy_array(larger.W[:smaller.nO], smaller.W)
|
||||
# copy_array(larger.b[:smaller.nO], smaller.b)
|
||||
# self.model._layers[-1] = larger
|
||||
self.labels = tuple(list(self.labels) + [label])
|
||||
return 1
|
||||
|
||||
def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None,
|
||||
**kwargs):
|
||||
def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None, **kwargs):
|
||||
if self.model is True:
|
||||
self.cfg['pretrained_vectors'] = kwargs.get('pretrained_vectors')
|
||||
self.cfg["pretrained_vectors"] = kwargs.get("pretrained_vectors")
|
||||
self.model = self.Model(len(self.labels), **self.cfg)
|
||||
link_vectors_to_models(self.vocab)
|
||||
if sgd is None:
|
||||
|
@ -978,7 +988,12 @@ class TextCategorizer(Pipe):
|
|||
|
||||
|
||||
cdef class DependencyParser(Parser):
|
||||
name = 'parser'
|
||||
"""Pipeline component for dependency parsing.
|
||||
|
||||
DOCS: https://spacy.io/api/dependencyparser
|
||||
"""
|
||||
|
||||
name = "parser"
|
||||
TransitionSystem = ArcEager
|
||||
|
||||
@property
|
||||
|
@ -986,7 +1001,7 @@ cdef class DependencyParser(Parser):
|
|||
return [nonproj.deprojectivize]
|
||||
|
||||
def add_multitask_objective(self, target):
|
||||
if target == 'cloze':
|
||||
if target == "cloze":
|
||||
cloze = ClozeMultitask(self.vocab)
|
||||
self._multitasks.append(cloze)
|
||||
else:
|
||||
|
@ -1000,8 +1015,7 @@ cdef class DependencyParser(Parser):
|
|||
tok2vec=tok2vec, sgd=sgd)
|
||||
|
||||
def __reduce__(self):
|
||||
return (DependencyParser, (self.vocab, self.moves, self.model),
|
||||
None, None)
|
||||
return (DependencyParser, (self.vocab, self.moves, self.model), None, None)
|
||||
|
||||
@property
|
||||
def labels(self):
|
||||
|
@ -1010,6 +1024,11 @@ cdef class DependencyParser(Parser):
|
|||
|
||||
|
||||
cdef class EntityRecognizer(Parser):
|
||||
"""Pipeline component for named entity recognition.
|
||||
|
||||
DOCS: https://spacy.io/api/entityrecognizer
|
||||
"""
|
||||
|
||||
name = "ner"
|
||||
TransitionSystem = BiluoPushDown
|
||||
nr_feature = 6
|
||||
|
@ -1040,4 +1059,4 @@ cdef class EntityRecognizer(Parser):
|
|||
if move[0] in ("B", "I", "L", "U")))
|
||||
|
||||
|
||||
__all__ = ['Tagger', 'DependencyParser', 'EntityRecognizer', 'Tensorizer', 'TextCategorizer']
|
||||
__all__ = ["Tagger", "DependencyParser", "EntityRecognizer", "Tensorizer", "TextCategorizer"]
|
||||
|
|
|
@ -20,7 +20,7 @@ from . import util
|
|||
def get_string_id(key):
|
||||
"""Get a string ID, handling the reserved symbols correctly. If the key is
|
||||
already an ID, return it.
|
||||
|
||||
|
||||
This function optimises for convenience over performance, so shouldn't be
|
||||
used in tight loops.
|
||||
"""
|
||||
|
@ -31,12 +31,12 @@ def get_string_id(key):
|
|||
elif not key:
|
||||
return 0
|
||||
else:
|
||||
chars = key.encode('utf8')
|
||||
chars = key.encode("utf8")
|
||||
return hash_utf8(chars, len(chars))
|
||||
|
||||
|
||||
cpdef hash_t hash_string(unicode string) except 0:
|
||||
chars = string.encode('utf8')
|
||||
chars = string.encode("utf8")
|
||||
return hash_utf8(chars, len(chars))
|
||||
|
||||
|
||||
|
@ -51,9 +51,9 @@ cdef uint32_t hash32_utf8(char* utf8_string, int length) nogil:
|
|||
cdef unicode decode_Utf8Str(const Utf8Str* string):
|
||||
cdef int i, length
|
||||
if string.s[0] < sizeof(string.s) and string.s[0] != 0:
|
||||
return string.s[1:string.s[0]+1].decode('utf8')
|
||||
return string.s[1:string.s[0]+1].decode("utf8")
|
||||
elif string.p[0] < 255:
|
||||
return string.p[1:string.p[0]+1].decode('utf8')
|
||||
return string.p[1:string.p[0]+1].decode("utf8")
|
||||
else:
|
||||
i = 0
|
||||
length = 0
|
||||
|
@ -62,7 +62,7 @@ cdef unicode decode_Utf8Str(const Utf8Str* string):
|
|||
length += 255
|
||||
length += string.p[i]
|
||||
i += 1
|
||||
return string.p[i:length + i].decode('utf8')
|
||||
return string.p[i:length + i].decode("utf8")
|
||||
|
||||
|
||||
cdef Utf8Str* _allocate(Pool mem, const unsigned char* chars, uint32_t length) except *:
|
||||
|
@ -91,7 +91,10 @@ cdef Utf8Str* _allocate(Pool mem, const unsigned char* chars, uint32_t length) e
|
|||
|
||||
|
||||
cdef class StringStore:
|
||||
"""Look up strings by 64-bit hashes."""
|
||||
"""Look up strings by 64-bit hashes.
|
||||
|
||||
DOCS: https://spacy.io/api/stringstore
|
||||
"""
|
||||
def __init__(self, strings=None, freeze=False):
|
||||
"""Create the StringStore.
|
||||
|
||||
|
@ -113,7 +116,7 @@ cdef class StringStore:
|
|||
if isinstance(string_or_id, basestring) and len(string_or_id) == 0:
|
||||
return 0
|
||||
elif string_or_id == 0:
|
||||
return u''
|
||||
return ""
|
||||
elif string_or_id in SYMBOLS_BY_STR:
|
||||
return SYMBOLS_BY_STR[string_or_id]
|
||||
|
||||
|
@ -181,7 +184,7 @@ cdef class StringStore:
|
|||
elif isinstance(string, unicode):
|
||||
key = hash_string(string)
|
||||
else:
|
||||
string = string.encode('utf8')
|
||||
string = string.encode("utf8")
|
||||
key = hash_utf8(string, len(string))
|
||||
if key < len(SYMBOLS_BY_INT):
|
||||
return True
|
||||
|
@ -233,19 +236,17 @@ cdef class StringStore:
|
|||
self.add(word)
|
||||
return self
|
||||
|
||||
def to_bytes(self, **exclude):
|
||||
def to_bytes(self, **kwargs):
|
||||
"""Serialize the current state to a binary string.
|
||||
|
||||
**exclude: Named attributes to prevent from being serialized.
|
||||
RETURNS (bytes): The serialized form of the `StringStore` object.
|
||||
"""
|
||||
return srsly.json_dumps(list(self))
|
||||
|
||||
def from_bytes(self, bytes_data, **exclude):
|
||||
def from_bytes(self, bytes_data, **kwargs):
|
||||
"""Load state from a binary string.
|
||||
|
||||
bytes_data (bytes): The data to load from.
|
||||
**exclude: Named attributes to prevent from being loaded.
|
||||
RETURNS (StringStore): The `StringStore` object.
|
||||
"""
|
||||
strings = srsly.json_loads(bytes_data)
|
||||
|
@ -296,7 +297,7 @@ cdef class StringStore:
|
|||
|
||||
cdef const Utf8Str* intern_unicode(self, unicode py_string):
|
||||
# 0 means missing, but we don't bother offsetting the index.
|
||||
cdef bytes byte_string = py_string.encode('utf8')
|
||||
cdef bytes byte_string = py_string.encode("utf8")
|
||||
return self._intern_utf8(byte_string, len(byte_string))
|
||||
|
||||
@cython.final
|
||||
|
|
|
@ -157,6 +157,10 @@ cdef void cpu_log_loss(float* d_scores,
|
|||
cdef double max_, gmax, Z, gZ
|
||||
best = arg_max_if_gold(scores, costs, is_valid, O)
|
||||
guess = arg_max_if_valid(scores, is_valid, O)
|
||||
if best == -1 or guess == -1:
|
||||
# These shouldn't happen, but if they do, we want to make sure we don't
|
||||
# cause an OOB access.
|
||||
return
|
||||
Z = 1e-10
|
||||
gZ = 1e-10
|
||||
max_ = scores[guess]
|
||||
|
|
|
@ -323,6 +323,12 @@ cdef cppclass StateC:
|
|||
if this._s_i >= 1:
|
||||
this._s_i -= 1
|
||||
|
||||
void force_final() nogil:
|
||||
# This should only be used in desperate situations, as it may leave
|
||||
# the analysis in an unexpected state.
|
||||
this._s_i = 0
|
||||
this._b_i = this.length
|
||||
|
||||
void unshift() nogil:
|
||||
this._b_i -= 1
|
||||
this._buffer[this._b_i] = this.S(0)
|
||||
|
|
|
@ -257,30 +257,42 @@ cdef class Missing:
|
|||
cdef class Begin:
|
||||
@staticmethod
|
||||
cdef bint is_valid(const StateC* st, attr_t label) nogil:
|
||||
# Ensure we don't clobber preset entities. If no entity preset,
|
||||
# ent_iob is 0
|
||||
cdef int preset_ent_iob = st.B_(0).ent_iob
|
||||
if preset_ent_iob == 1:
|
||||
cdef int preset_ent_label = st.B_(0).ent_type
|
||||
# If we're the last token of the input, we can't B -- must U or O.
|
||||
if st.B(1) == -1:
|
||||
return False
|
||||
elif preset_ent_iob == 2:
|
||||
elif st.entity_is_open():
|
||||
return False
|
||||
elif preset_ent_iob == 3 and st.B_(0).ent_type != label:
|
||||
elif label == 0:
|
||||
return False
|
||||
# If the next word is B or O, we can't B now
|
||||
elif preset_ent_iob == 1 or preset_ent_iob == 2:
|
||||
# Ensure we don't clobber preset entities. If no entity preset,
|
||||
# ent_iob is 0
|
||||
return False
|
||||
elif preset_ent_iob == 3:
|
||||
# Okay, we're in a preset entity.
|
||||
if label != preset_ent_label:
|
||||
# If label isn't right, reject
|
||||
return False
|
||||
elif st.B_(1).ent_iob != 1:
|
||||
# If next token isn't marked I, we need to make U, not B.
|
||||
return False
|
||||
else:
|
||||
# Otherwise, force acceptance, even if we're across a sentence
|
||||
# boundary or the token is whitespace.
|
||||
return True
|
||||
elif st.B_(1).ent_iob == 2 or st.B_(1).ent_iob == 3:
|
||||
# If the next word is B or O, we can't B now
|
||||
return False
|
||||
# If the current word is B, and the next word isn't I, the current word
|
||||
# is really U
|
||||
elif preset_ent_iob == 3 and st.B_(1).ent_iob != 1:
|
||||
return False
|
||||
# Don't allow entities to extend across sentence boundaries
|
||||
elif st.B_(1).sent_start == 1:
|
||||
# Don't allow entities to extend across sentence boundaries
|
||||
return False
|
||||
# Don't allow entities to start on whitespace
|
||||
elif Lexeme.get_struct_attr(st.B_(0).lex, IS_SPACE):
|
||||
return False
|
||||
else:
|
||||
return label != 0 and not st.entity_is_open()
|
||||
return True
|
||||
|
||||
@staticmethod
|
||||
cdef int transition(StateC* st, attr_t label) nogil:
|
||||
|
@ -314,18 +326,27 @@ cdef class In:
|
|||
@staticmethod
|
||||
cdef bint is_valid(const StateC* st, attr_t label) nogil:
|
||||
cdef int preset_ent_iob = st.B_(0).ent_iob
|
||||
if preset_ent_iob == 2:
|
||||
if label == 0:
|
||||
return False
|
||||
elif st.E_(0).ent_type != label:
|
||||
return False
|
||||
elif not st.entity_is_open():
|
||||
return False
|
||||
elif st.B(1) == -1:
|
||||
# If we're at the end, we can't I.
|
||||
return False
|
||||
elif preset_ent_iob == 2:
|
||||
return False
|
||||
elif preset_ent_iob == 3:
|
||||
return False
|
||||
# TODO: Is this quite right? I think it's supposed to be ensuring the
|
||||
# gazetteer matches are maintained
|
||||
elif st.B(1) != -1 and st.B_(1).ent_iob != preset_ent_iob:
|
||||
elif st.B_(1).ent_iob == 2 or st.B_(1).ent_iob == 3:
|
||||
# If we know the next word is B or O, we can't be I (must be L)
|
||||
return False
|
||||
# Don't allow entities to extend across sentence boundaries
|
||||
elif st.B(1) != -1 and st.B_(1).sent_start == 1:
|
||||
# Don't allow entities to extend across sentence boundaries
|
||||
return False
|
||||
return st.entity_is_open() and label != 0 and st.E_(0).ent_type == label
|
||||
else:
|
||||
return True
|
||||
|
||||
@staticmethod
|
||||
cdef int transition(StateC* st, attr_t label) nogil:
|
||||
|
@ -370,9 +391,17 @@ cdef class In:
|
|||
cdef class Last:
|
||||
@staticmethod
|
||||
cdef bint is_valid(const StateC* st, attr_t label) nogil:
|
||||
if st.B_(1).ent_iob == 1:
|
||||
if label == 0:
|
||||
return False
|
||||
return st.entity_is_open() and label != 0 and st.E_(0).ent_type == label
|
||||
elif not st.entity_is_open():
|
||||
return False
|
||||
elif st.E_(0).ent_type != label:
|
||||
return False
|
||||
elif st.B_(1).ent_iob == 1:
|
||||
# If a preset entity has I next, we can't L here.
|
||||
return False
|
||||
else:
|
||||
return True
|
||||
|
||||
@staticmethod
|
||||
cdef int transition(StateC* st, attr_t label) nogil:
|
||||
|
@ -416,17 +445,29 @@ cdef class Unit:
|
|||
@staticmethod
|
||||
cdef bint is_valid(const StateC* st, attr_t label) nogil:
|
||||
cdef int preset_ent_iob = st.B_(0).ent_iob
|
||||
if preset_ent_iob == 2:
|
||||
cdef attr_t preset_ent_label = st.B_(0).ent_type
|
||||
if label == 0:
|
||||
return False
|
||||
elif preset_ent_iob == 1:
|
||||
elif st.entity_is_open():
|
||||
return False
|
||||
elif preset_ent_iob == 3 and st.B_(0).ent_type != label:
|
||||
elif preset_ent_iob == 2:
|
||||
# Don't clobber preset O
|
||||
return False
|
||||
elif st.B_(1).ent_iob == 1:
|
||||
# If next token is In, we can't be Unit -- must be Begin
|
||||
return False
|
||||
elif preset_ent_iob == 3:
|
||||
# Okay, there's a preset entity here
|
||||
if label != preset_ent_label:
|
||||
# Require labels to match
|
||||
return False
|
||||
else:
|
||||
# Otherwise return True, ignoring the whitespace constraint.
|
||||
return True
|
||||
elif Lexeme.get_struct_attr(st.B_(0).lex, IS_SPACE):
|
||||
return False
|
||||
return label != 0 and not st.entity_is_open()
|
||||
else:
|
||||
return True
|
||||
|
||||
@staticmethod
|
||||
cdef int transition(StateC* st, attr_t label) nogil:
|
||||
|
@ -461,11 +502,14 @@ cdef class Out:
|
|||
@staticmethod
|
||||
cdef bint is_valid(const StateC* st, attr_t label) nogil:
|
||||
cdef int preset_ent_iob = st.B_(0).ent_iob
|
||||
if preset_ent_iob == 3:
|
||||
if st.entity_is_open():
|
||||
return False
|
||||
elif preset_ent_iob == 3:
|
||||
return False
|
||||
elif preset_ent_iob == 1:
|
||||
return False
|
||||
return not st.entity_is_open()
|
||||
else:
|
||||
return True
|
||||
|
||||
@staticmethod
|
||||
cdef int transition(StateC* st, attr_t label) nogil:
|
||||
|
|
|
@ -221,14 +221,14 @@ cdef class Parser:
|
|||
for batch in util.minibatch(docs, size=batch_size):
|
||||
batch_in_order = list(batch)
|
||||
by_length = sorted(batch_in_order, key=lambda doc: len(doc))
|
||||
for subbatch in util.minibatch(by_length, size=batch_size//4):
|
||||
for subbatch in util.minibatch(by_length, size=max(batch_size//4, 2)):
|
||||
subbatch = list(subbatch)
|
||||
parse_states = self.predict(subbatch, beam_width=beam_width,
|
||||
beam_density=beam_density)
|
||||
self.set_annotations(subbatch, parse_states, tensors=None)
|
||||
for doc in batch_in_order:
|
||||
yield doc
|
||||
|
||||
|
||||
def require_model(self):
|
||||
"""Raise an error if the component's model is not initialized."""
|
||||
if getattr(self, 'model', None) in (None, True, False):
|
||||
|
@ -272,7 +272,7 @@ cdef class Parser:
|
|||
beams = self.moves.init_beams(docs, beam_width, beam_density=beam_density)
|
||||
# This is pretty dirty, but the NER can resize itself in init_batch,
|
||||
# if labels are missing. We therefore have to check whether we need to
|
||||
# expand our model output.
|
||||
# expand our model output.
|
||||
self.model.resize_output(self.moves.n_moves)
|
||||
model = self.model(docs)
|
||||
token_ids = numpy.zeros((len(docs) * beam_width, self.nr_feature),
|
||||
|
@ -363,9 +363,14 @@ cdef class Parser:
|
|||
for i in range(batch_size):
|
||||
self.moves.set_valid(is_valid, states[i])
|
||||
guess = arg_max_if_valid(&scores[i*nr_class], is_valid, nr_class)
|
||||
action = self.moves.c[guess]
|
||||
action.do(states[i], action.label)
|
||||
states[i].push_hist(guess)
|
||||
if guess == -1:
|
||||
# This shouldn't happen, but it's hard to raise an error here,
|
||||
# and we don't want to infinite loop. So, force to end state.
|
||||
states[i].force_final()
|
||||
else:
|
||||
action = self.moves.c[guess]
|
||||
action.do(states[i], action.label)
|
||||
states[i].push_hist(guess)
|
||||
free(is_valid)
|
||||
|
||||
def transition_beams(self, beams, float[:, ::1] scores):
|
||||
|
@ -437,7 +442,7 @@ cdef class Parser:
|
|||
if self._rehearsal_model is None:
|
||||
return None
|
||||
losses.setdefault(self.name, 0.)
|
||||
|
||||
|
||||
states = self.moves.init_batch(docs)
|
||||
# This is pretty dirty, but the NER can resize itself in init_batch,
|
||||
# if labels are missing. We therefore have to check whether we need to
|
||||
|
@ -598,22 +603,24 @@ cdef class Parser:
|
|||
self.cfg.update(cfg)
|
||||
return sgd
|
||||
|
||||
def to_disk(self, path, **exclude):
|
||||
def to_disk(self, path, exclude=tuple(), **kwargs):
|
||||
serializers = {
|
||||
'model': lambda p: (self.model.to_disk(p) if self.model is not True else True),
|
||||
'vocab': lambda p: self.vocab.to_disk(p),
|
||||
'moves': lambda p: self.moves.to_disk(p, strings=False),
|
||||
'moves': lambda p: self.moves.to_disk(p, exclude=["strings"]),
|
||||
'cfg': lambda p: srsly.write_json(p, self.cfg)
|
||||
}
|
||||
exclude = util.get_serialization_exclude(serializers, exclude, kwargs)
|
||||
util.to_disk(path, serializers, exclude)
|
||||
|
||||
def from_disk(self, path, **exclude):
|
||||
def from_disk(self, path, exclude=tuple(), **kwargs):
|
||||
deserializers = {
|
||||
'vocab': lambda p: self.vocab.from_disk(p),
|
||||
'moves': lambda p: self.moves.from_disk(p, strings=False),
|
||||
'moves': lambda p: self.moves.from_disk(p, exclude=["strings"]),
|
||||
'cfg': lambda p: self.cfg.update(srsly.read_json(p)),
|
||||
'model': lambda p: None
|
||||
}
|
||||
exclude = util.get_serialization_exclude(deserializers, exclude, kwargs)
|
||||
util.from_disk(path, deserializers, exclude)
|
||||
if 'model' not in exclude:
|
||||
path = util.ensure_path(path)
|
||||
|
@ -627,22 +634,24 @@ cdef class Parser:
|
|||
self.cfg.update(cfg)
|
||||
return self
|
||||
|
||||
def to_bytes(self, **exclude):
|
||||
def to_bytes(self, exclude=tuple(), **kwargs):
|
||||
serializers = OrderedDict((
|
||||
('model', lambda: (self.model.to_bytes() if self.model is not True else True)),
|
||||
('vocab', lambda: self.vocab.to_bytes()),
|
||||
('moves', lambda: self.moves.to_bytes(strings=False)),
|
||||
('moves', lambda: self.moves.to_bytes(exclude=["strings"])),
|
||||
('cfg', lambda: srsly.json_dumps(self.cfg, indent=2, sort_keys=True))
|
||||
))
|
||||
exclude = util.get_serialization_exclude(serializers, exclude, kwargs)
|
||||
return util.to_bytes(serializers, exclude)
|
||||
|
||||
def from_bytes(self, bytes_data, **exclude):
|
||||
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
|
||||
deserializers = OrderedDict((
|
||||
('vocab', lambda b: self.vocab.from_bytes(b)),
|
||||
('moves', lambda b: self.moves.from_bytes(b, strings=False)),
|
||||
('moves', lambda b: self.moves.from_bytes(b, exclude=["strings"])),
|
||||
('cfg', lambda b: self.cfg.update(srsly.json_loads(b))),
|
||||
('model', lambda b: None)
|
||||
))
|
||||
exclude = util.get_serialization_exclude(deserializers, exclude, kwargs)
|
||||
msg = util.from_bytes(bytes_data, deserializers, exclude)
|
||||
if 'model' not in exclude:
|
||||
# TODO: Remove this once we don't have to handle previous models
|
||||
|
|
|
@ -94,6 +94,13 @@ cdef class TransitionSystem:
|
|||
raise ValueError(Errors.E024)
|
||||
return history
|
||||
|
||||
def apply_transition(self, StateClass state, name):
|
||||
if not self.is_valid(state, name):
|
||||
raise ValueError(
|
||||
"Cannot apply transition {name}: invalid for the current state.".format(name=name))
|
||||
action = self.lookup_transition(name)
|
||||
action.do(state.c, action.label)
|
||||
|
||||
cdef int initialize_state(self, StateC* state) nogil:
|
||||
pass
|
||||
|
||||
|
@ -201,30 +208,32 @@ cdef class TransitionSystem:
|
|||
self.labels[action][label_name] = new_freq-1
|
||||
return 1
|
||||
|
||||
def to_disk(self, path, **exclude):
|
||||
def to_disk(self, path, **kwargs):
|
||||
with path.open('wb') as file_:
|
||||
file_.write(self.to_bytes(**exclude))
|
||||
file_.write(self.to_bytes(**kwargs))
|
||||
|
||||
def from_disk(self, path, **exclude):
|
||||
def from_disk(self, path, **kwargs):
|
||||
with path.open('rb') as file_:
|
||||
byte_data = file_.read()
|
||||
self.from_bytes(byte_data, **exclude)
|
||||
self.from_bytes(byte_data, **kwargs)
|
||||
return self
|
||||
|
||||
def to_bytes(self, **exclude):
|
||||
def to_bytes(self, exclude=tuple(), **kwargs):
|
||||
transitions = []
|
||||
serializers = {
|
||||
'moves': lambda: srsly.json_dumps(self.labels),
|
||||
'strings': lambda: self.strings.to_bytes()
|
||||
}
|
||||
exclude = util.get_serialization_exclude(serializers, exclude, kwargs)
|
||||
return util.to_bytes(serializers, exclude)
|
||||
|
||||
def from_bytes(self, bytes_data, **exclude):
|
||||
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
|
||||
labels = {}
|
||||
deserializers = {
|
||||
'moves': lambda b: labels.update(srsly.json_loads(b)),
|
||||
'strings': lambda b: self.strings.from_bytes(b)
|
||||
}
|
||||
exclude = util.get_serialization_exclude(deserializers, exclude, kwargs)
|
||||
msg = util.from_bytes(bytes_data, deserializers, exclude)
|
||||
self.initialize_actions(labels)
|
||||
return self
|
||||
|
|
|
@ -1,46 +1,44 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
from spacy.tokens import Doc
|
||||
from spacy.attrs import ORTH, SHAPE, POS, DEP
|
||||
|
||||
from ..util import get_doc
|
||||
|
||||
|
||||
def test_doc_array_attr_of_token(en_tokenizer, en_vocab):
|
||||
text = "An example sentence"
|
||||
tokens = en_tokenizer(text)
|
||||
example = tokens.vocab["example"]
|
||||
def test_doc_array_attr_of_token(en_vocab):
|
||||
doc = Doc(en_vocab, words=["An", "example", "sentence"])
|
||||
example = doc.vocab["example"]
|
||||
assert example.orth != example.shape
|
||||
feats_array = tokens.to_array((ORTH, SHAPE))
|
||||
feats_array = doc.to_array((ORTH, SHAPE))
|
||||
assert feats_array[0][0] != feats_array[0][1]
|
||||
assert feats_array[0][0] != feats_array[0][1]
|
||||
|
||||
|
||||
def test_doc_stringy_array_attr_of_token(en_tokenizer, en_vocab):
|
||||
text = "An example sentence"
|
||||
tokens = en_tokenizer(text)
|
||||
example = tokens.vocab["example"]
|
||||
def test_doc_stringy_array_attr_of_token(en_vocab):
|
||||
doc = Doc(en_vocab, words=["An", "example", "sentence"])
|
||||
example = doc.vocab["example"]
|
||||
assert example.orth != example.shape
|
||||
feats_array = tokens.to_array((ORTH, SHAPE))
|
||||
feats_array_stringy = tokens.to_array(("ORTH", "SHAPE"))
|
||||
feats_array = doc.to_array((ORTH, SHAPE))
|
||||
feats_array_stringy = doc.to_array(("ORTH", "SHAPE"))
|
||||
assert feats_array_stringy[0][0] == feats_array[0][0]
|
||||
assert feats_array_stringy[0][1] == feats_array[0][1]
|
||||
|
||||
|
||||
def test_doc_scalar_attr_of_token(en_tokenizer, en_vocab):
|
||||
text = "An example sentence"
|
||||
tokens = en_tokenizer(text)
|
||||
example = tokens.vocab["example"]
|
||||
def test_doc_scalar_attr_of_token(en_vocab):
|
||||
doc = Doc(en_vocab, words=["An", "example", "sentence"])
|
||||
example = doc.vocab["example"]
|
||||
assert example.orth != example.shape
|
||||
feats_array = tokens.to_array(ORTH)
|
||||
feats_array = doc.to_array(ORTH)
|
||||
assert feats_array.shape == (3,)
|
||||
|
||||
|
||||
def test_doc_array_tag(en_tokenizer):
|
||||
text = "A nice sentence."
|
||||
def test_doc_array_tag(en_vocab):
|
||||
words = ["A", "nice", "sentence", "."]
|
||||
pos = ["DET", "ADJ", "NOUN", "PUNCT"]
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], pos=pos)
|
||||
doc = get_doc(en_vocab, words=words, pos=pos)
|
||||
assert doc[0].pos != doc[1].pos != doc[2].pos != doc[3].pos
|
||||
feats_array = doc.to_array((ORTH, POS))
|
||||
assert feats_array[0][1] == doc[0].pos
|
||||
|
@ -49,13 +47,22 @@ def test_doc_array_tag(en_tokenizer):
|
|||
assert feats_array[3][1] == doc[3].pos
|
||||
|
||||
|
||||
def test_doc_array_dep(en_tokenizer):
|
||||
text = "A nice sentence."
|
||||
def test_doc_array_dep(en_vocab):
|
||||
words = ["A", "nice", "sentence", "."]
|
||||
deps = ["det", "amod", "ROOT", "punct"]
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], deps=deps)
|
||||
doc = get_doc(en_vocab, words=words, deps=deps)
|
||||
feats_array = doc.to_array((ORTH, DEP))
|
||||
assert feats_array[0][1] == doc[0].dep
|
||||
assert feats_array[1][1] == doc[1].dep
|
||||
assert feats_array[2][1] == doc[2].dep
|
||||
assert feats_array[3][1] == doc[3].dep
|
||||
|
||||
|
||||
@pytest.mark.parametrize("attrs", [["ORTH", "SHAPE"], "IS_ALPHA"])
|
||||
def test_doc_array_to_from_string_attrs(en_vocab, attrs):
|
||||
"""Test that both Doc.to_array and Doc.from_array accept string attrs,
|
||||
as well as single attrs and sequences of attrs.
|
||||
"""
|
||||
words = ["An", "example", "sentence"]
|
||||
doc = Doc(en_vocab, words=words)
|
||||
Doc(en_vocab, words=words).from_array(attrs, doc.to_array(attrs))
|
||||
|
|
|
@ -4,9 +4,10 @@ from __future__ import unicode_literals
|
|||
|
||||
import pytest
|
||||
import numpy
|
||||
from spacy.tokens import Doc
|
||||
from spacy.tokens import Doc, Span
|
||||
from spacy.vocab import Vocab
|
||||
from spacy.errors import ModelsWarning
|
||||
from spacy.attrs import ENT_TYPE, ENT_IOB
|
||||
|
||||
from ..util import get_doc
|
||||
|
||||
|
@ -112,14 +113,14 @@ def test_doc_api_serialize(en_tokenizer, text):
|
|||
assert [t.orth for t in tokens] == [t.orth for t in new_tokens]
|
||||
|
||||
new_tokens = Doc(tokens.vocab).from_bytes(
|
||||
tokens.to_bytes(tensor=False), tensor=False
|
||||
tokens.to_bytes(exclude=["tensor"]), exclude=["tensor"]
|
||||
)
|
||||
assert tokens.text == new_tokens.text
|
||||
assert [t.text for t in tokens] == [t.text for t in new_tokens]
|
||||
assert [t.orth for t in tokens] == [t.orth for t in new_tokens]
|
||||
|
||||
new_tokens = Doc(tokens.vocab).from_bytes(
|
||||
tokens.to_bytes(sentiment=False), sentiment=False
|
||||
tokens.to_bytes(exclude=["sentiment"]), exclude=["sentiment"]
|
||||
)
|
||||
assert tokens.text == new_tokens.text
|
||||
assert [t.text for t in tokens] == [t.text for t in new_tokens]
|
||||
|
@ -256,3 +257,18 @@ def test_lowest_common_ancestor(en_tokenizer, sentence, heads, lca_matrix):
|
|||
assert lca[1, 1] == 1
|
||||
assert lca[0, 1] == 2
|
||||
assert lca[1, 2] == 2
|
||||
|
||||
|
||||
def test_doc_is_nered(en_vocab):
|
||||
words = ["I", "live", "in", "New", "York"]
|
||||
doc = Doc(en_vocab, words=words)
|
||||
assert not doc.is_nered
|
||||
doc.ents = [Span(doc, 3, 5, label="GPE")]
|
||||
assert doc.is_nered
|
||||
# Test creating doc from array with unknown values
|
||||
arr = numpy.array([[0, 0], [0, 0], [0, 0], [384, 3], [384, 1]], dtype="uint64")
|
||||
doc = Doc(en_vocab, words=words).from_array([ENT_TYPE, ENT_IOB], arr)
|
||||
assert doc.is_nered
|
||||
# Test serialization
|
||||
new_doc = Doc(en_vocab).from_bytes(doc.to_bytes())
|
||||
assert new_doc.is_nered
|
||||
|
|
|
@ -69,7 +69,6 @@ def test_doc_retokenize_retokenizer_attrs(en_tokenizer):
|
|||
assert doc[4].ent_type_ == "ORG"
|
||||
|
||||
|
||||
@pytest.mark.xfail
|
||||
def test_doc_retokenize_lex_attrs(en_tokenizer):
|
||||
"""Test that lexical attributes can be changed (see #2390)."""
|
||||
doc = en_tokenizer("WKRO played beach boys songs")
|
||||
|
|
|
@ -18,8 +18,8 @@ LANGUAGES = ["af", "ar", "bg", "bn", "ca", "cs", "da", "de", "el", "en", "es",
|
|||
@pytest.mark.parametrize("lang", LANGUAGES)
|
||||
def test_lang_initialize(lang, capfd):
|
||||
"""Test that languages can be initialized."""
|
||||
nlp = get_lang_class(lang)() # noqa: F841
|
||||
nlp = get_lang_class(lang)()
|
||||
# Check for stray print statements (see #3342)
|
||||
doc = nlp("test")
|
||||
doc = nlp("test") # noqa: F841
|
||||
captured = capfd.readouterr()
|
||||
assert not captured.out
|
||||
|
|
26
spacy/tests/regression/test_issue3345.py
Normal file
26
spacy/tests/regression/test_issue3345.py
Normal file
|
@ -0,0 +1,26 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from spacy.lang.en import English
|
||||
from spacy.tokens import Doc
|
||||
from spacy.pipeline import EntityRuler, EntityRecognizer
|
||||
|
||||
|
||||
def test_issue3345():
|
||||
"""Test case where preset entity crosses sentence boundary."""
|
||||
nlp = English()
|
||||
doc = Doc(nlp.vocab, words=["I", "live", "in", "New", "York"])
|
||||
doc[4].is_sent_start = True
|
||||
ruler = EntityRuler(nlp, patterns=[{"label": "GPE", "pattern": "New York"}])
|
||||
ner = EntityRecognizer(doc.vocab)
|
||||
# Add the OUT action. I wouldn't have thought this would be necessary...
|
||||
ner.moves.add_action(5, "")
|
||||
ner.add_label("GPE")
|
||||
doc = ruler(doc)
|
||||
# Get into the state just before "New"
|
||||
state = ner.moves.init_batch([doc])[0]
|
||||
ner.moves.apply_transition(state, "O")
|
||||
ner.moves.apply_transition(state, "O")
|
||||
ner.moves.apply_transition(state, "O")
|
||||
# Check that B-GPE is valid.
|
||||
assert ner.moves.is_valid(state, "B-GPE")
|
|
@ -1,6 +1,7 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
from spacy.tokens import Doc
|
||||
from spacy.compat import path2str
|
||||
|
||||
|
@ -41,3 +42,18 @@ def test_serialize_doc_roundtrip_disk_str_path(en_vocab):
|
|||
doc.to_disk(file_path)
|
||||
doc_d = Doc(en_vocab).from_disk(file_path)
|
||||
assert doc.to_bytes() == doc_d.to_bytes()
|
||||
|
||||
|
||||
def test_serialize_doc_exclude(en_vocab):
|
||||
doc = Doc(en_vocab, words=["hello", "world"])
|
||||
doc.user_data["foo"] = "bar"
|
||||
new_doc = Doc(en_vocab).from_bytes(doc.to_bytes())
|
||||
assert new_doc.user_data["foo"] == "bar"
|
||||
new_doc = Doc(en_vocab).from_bytes(doc.to_bytes(), exclude=["user_data"])
|
||||
assert not new_doc.user_data
|
||||
new_doc = Doc(en_vocab).from_bytes(doc.to_bytes(exclude=["user_data"]))
|
||||
assert not new_doc.user_data
|
||||
with pytest.raises(ValueError):
|
||||
doc.to_bytes(user_data=False)
|
||||
with pytest.raises(ValueError):
|
||||
Doc(en_vocab).from_bytes(doc.to_bytes(), tensor=False)
|
||||
|
|
|
@ -52,3 +52,19 @@ def test_serialize_with_custom_tokenizer():
|
|||
nlp.tokenizer = custom_tokenizer(nlp)
|
||||
with make_tempdir() as d:
|
||||
nlp.to_disk(d)
|
||||
|
||||
|
||||
def test_serialize_language_exclude(meta_data):
|
||||
name = "name-in-fixture"
|
||||
nlp = Language(meta=meta_data)
|
||||
assert nlp.meta["name"] == name
|
||||
new_nlp = Language().from_bytes(nlp.to_bytes())
|
||||
assert nlp.meta["name"] == name
|
||||
new_nlp = Language().from_bytes(nlp.to_bytes(), exclude=["meta"])
|
||||
assert not new_nlp.meta["name"] == name
|
||||
new_nlp = Language().from_bytes(nlp.to_bytes(exclude=["meta"]))
|
||||
assert not new_nlp.meta["name"] == name
|
||||
with pytest.raises(ValueError):
|
||||
nlp.to_bytes(meta=False)
|
||||
with pytest.raises(ValueError):
|
||||
Language().from_bytes(nlp.to_bytes(), meta=False)
|
||||
|
|
|
@ -55,7 +55,9 @@ def test_serialize_parser_roundtrip_disk(en_vocab, Parser):
|
|||
parser_d = Parser(en_vocab)
|
||||
parser_d.model, _ = parser_d.Model(0)
|
||||
parser_d = parser_d.from_disk(file_path)
|
||||
assert parser.to_bytes(model=False) == parser_d.to_bytes(model=False)
|
||||
parser_bytes = parser.to_bytes(exclude=["model"])
|
||||
parser_d_bytes = parser_d.to_bytes(exclude=["model"])
|
||||
assert parser_bytes == parser_d_bytes
|
||||
|
||||
|
||||
def test_to_from_bytes(parser, blank_parser):
|
||||
|
@ -114,3 +116,25 @@ def test_serialize_textcat_empty(en_vocab):
|
|||
# See issue #1105
|
||||
textcat = TextCategorizer(en_vocab, labels=["ENTITY", "ACTION", "MODIFIER"])
|
||||
textcat.to_bytes()
|
||||
|
||||
|
||||
@pytest.mark.parametrize("Parser", test_parsers)
|
||||
def test_serialize_pipe_exclude(en_vocab, Parser):
|
||||
def get_new_parser():
|
||||
new_parser = Parser(en_vocab)
|
||||
new_parser.model, _ = new_parser.Model(0)
|
||||
return new_parser
|
||||
|
||||
parser = Parser(en_vocab)
|
||||
parser.model, _ = parser.Model(0)
|
||||
parser.cfg["foo"] = "bar"
|
||||
new_parser = get_new_parser().from_bytes(parser.to_bytes())
|
||||
assert "foo" in new_parser.cfg
|
||||
new_parser = get_new_parser().from_bytes(parser.to_bytes(), exclude=["cfg"])
|
||||
assert "foo" not in new_parser.cfg
|
||||
new_parser = get_new_parser().from_bytes(parser.to_bytes(exclude=["cfg"]))
|
||||
assert "foo" not in new_parser.cfg
|
||||
with pytest.raises(ValueError):
|
||||
parser.to_bytes(cfg=False)
|
||||
with pytest.raises(ValueError):
|
||||
get_new_parser().from_bytes(parser.to_bytes(), cfg=False)
|
||||
|
|
|
@ -12,13 +12,12 @@ test_strings = [([], []), (["rats", "are", "cute"], ["i", "like", "rats"])]
|
|||
test_strings_attrs = [(["rats", "are", "cute"], "Hello")]
|
||||
|
||||
|
||||
@pytest.mark.xfail
|
||||
@pytest.mark.parametrize("text", ["rat"])
|
||||
def test_serialize_vocab(en_vocab, text):
|
||||
text_hash = en_vocab.strings.add(text)
|
||||
vocab_bytes = en_vocab.to_bytes()
|
||||
new_vocab = Vocab().from_bytes(vocab_bytes)
|
||||
assert new_vocab.strings(text_hash) == text
|
||||
assert new_vocab.strings[text_hash] == text
|
||||
|
||||
|
||||
@pytest.mark.parametrize("strings1,strings2", test_strings)
|
||||
|
@ -69,6 +68,15 @@ def test_serialize_vocab_lex_attrs_bytes(strings, lex_attr):
|
|||
assert vocab2[strings[0]].norm_ == lex_attr
|
||||
|
||||
|
||||
@pytest.mark.parametrize("strings,lex_attr", test_strings_attrs)
|
||||
def test_deserialize_vocab_seen_entries(strings, lex_attr):
|
||||
# Reported in #2153
|
||||
vocab = Vocab(strings=strings)
|
||||
length = len(vocab)
|
||||
vocab.from_bytes(vocab.to_bytes())
|
||||
assert len(vocab) == length
|
||||
|
||||
|
||||
@pytest.mark.parametrize("strings,lex_attr", test_strings_attrs)
|
||||
def test_serialize_vocab_lex_attrs_disk(strings, lex_attr):
|
||||
vocab1 = Vocab(strings=strings)
|
||||
|
|
|
@ -3,16 +3,18 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from collections import OrderedDict
|
||||
from cython.operator cimport dereference as deref
|
||||
from cython.operator cimport preincrement as preinc
|
||||
from cymem.cymem cimport Pool
|
||||
from preshed.maps cimport PreshMap
|
||||
import re
|
||||
cimport cython
|
||||
|
||||
from collections import OrderedDict
|
||||
import re
|
||||
|
||||
from .tokens.doc cimport Doc
|
||||
from .strings cimport hash_string
|
||||
|
||||
from .errors import Errors, Warnings, deprecation_warning
|
||||
from . import util
|
||||
|
||||
|
@ -20,6 +22,8 @@ from . import util
|
|||
cdef class Tokenizer:
|
||||
"""Segment text, and create Doc objects with the discovered segment
|
||||
boundaries.
|
||||
|
||||
DOCS: https://spacy.io/api/tokenizer
|
||||
"""
|
||||
def __init__(self, Vocab vocab, rules=None, prefix_search=None,
|
||||
suffix_search=None, infix_finditer=None, token_match=None):
|
||||
|
@ -40,6 +44,8 @@ cdef class Tokenizer:
|
|||
EXAMPLE:
|
||||
>>> tokenizer = Tokenizer(nlp.vocab)
|
||||
>>> tokenizer = English().Defaults.create_tokenizer(nlp)
|
||||
|
||||
DOCS: https://spacy.io/api/tokenizer#init
|
||||
"""
|
||||
self.mem = Pool()
|
||||
self._cache = PreshMap()
|
||||
|
@ -73,6 +79,8 @@ cdef class Tokenizer:
|
|||
|
||||
string (unicode): The string to tokenize.
|
||||
RETURNS (Doc): A container for linguistic annotations.
|
||||
|
||||
DOCS: https://spacy.io/api/tokenizer#call
|
||||
"""
|
||||
if len(string) >= (2 ** 30):
|
||||
raise ValueError(Errors.E025.format(length=len(string)))
|
||||
|
@ -114,7 +122,7 @@ cdef class Tokenizer:
|
|||
cache_hit = self._try_cache(key, doc)
|
||||
if not cache_hit:
|
||||
self._tokenize(doc, span, key)
|
||||
doc.c[doc.length - 1].spacy = string[-1] == ' ' and not in_ws
|
||||
doc.c[doc.length - 1].spacy = string[-1] == " " and not in_ws
|
||||
return doc
|
||||
|
||||
def pipe(self, texts, batch_size=1000, n_threads=2):
|
||||
|
@ -122,9 +130,9 @@ cdef class Tokenizer:
|
|||
|
||||
texts: A sequence of unicode texts.
|
||||
batch_size (int): Number of texts to accumulate in an internal buffer.
|
||||
n_threads (int): Number of threads to use, if the implementation
|
||||
supports multi-threading. The default tokenizer is single-threaded.
|
||||
YIELDS (Doc): A sequence of Doc objects, in order.
|
||||
|
||||
DOCS: https://spacy.io/api/tokenizer#pipe
|
||||
"""
|
||||
for text in texts:
|
||||
yield self(text)
|
||||
|
@ -235,7 +243,7 @@ cdef class Tokenizer:
|
|||
if not matches:
|
||||
tokens.push_back(self.vocab.get(tokens.mem, string), False)
|
||||
else:
|
||||
# let's say we have dyn-o-mite-dave - the regex finds the
|
||||
# Let's say we have dyn-o-mite-dave - the regex finds the
|
||||
# start and end positions of the hyphens
|
||||
start = 0
|
||||
start_before_infixes = start
|
||||
|
@ -257,7 +265,6 @@ cdef class Tokenizer:
|
|||
# https://github.com/explosion/spaCy/issues/768)
|
||||
infix_span = string[infix_start:infix_end]
|
||||
tokens.push_back(self.vocab.get(tokens.mem, infix_span), False)
|
||||
|
||||
start = infix_end
|
||||
span = string[start:]
|
||||
if span:
|
||||
|
@ -274,7 +281,7 @@ cdef class Tokenizer:
|
|||
for i in range(n):
|
||||
if self.vocab._by_orth.get(tokens[i].lex.orth) == NULL:
|
||||
return 0
|
||||
# See https://github.com/explosion/spaCy/issues/1250
|
||||
# See #1250
|
||||
if has_special:
|
||||
return 0
|
||||
cached = <_Cached*>self.mem.alloc(1, sizeof(_Cached))
|
||||
|
@ -293,6 +300,8 @@ cdef class Tokenizer:
|
|||
RETURNS (list): A list of `re.MatchObject` objects that have `.start()`
|
||||
and `.end()` methods, denoting the placement of internal segment
|
||||
separators, e.g. hyphens.
|
||||
|
||||
DOCS: https://spacy.io/api/tokenizer#find_infix
|
||||
"""
|
||||
if self.infix_finditer is None:
|
||||
return 0
|
||||
|
@ -304,6 +313,8 @@ cdef class Tokenizer:
|
|||
|
||||
string (unicode): The string to segment.
|
||||
RETURNS (int): The length of the prefix if present, otherwise `None`.
|
||||
|
||||
DOCS: https://spacy.io/api/tokenizer#find_prefix
|
||||
"""
|
||||
if self.prefix_search is None:
|
||||
return 0
|
||||
|
@ -316,6 +327,8 @@ cdef class Tokenizer:
|
|||
|
||||
string (unicode): The string to segment.
|
||||
Returns (int): The length of the suffix if present, otherwise `None`.
|
||||
|
||||
DOCS: https://spacy.io/api/tokenizer#find_suffix
|
||||
"""
|
||||
if self.suffix_search is None:
|
||||
return 0
|
||||
|
@ -334,6 +347,8 @@ cdef class Tokenizer:
|
|||
token_attrs (iterable): A sequence of dicts, where each dict describes
|
||||
a token and its attributes. The `ORTH` fields of the attributes
|
||||
must exactly match the string when they are concatenated.
|
||||
|
||||
DOCS: https://spacy.io/api/tokenizer#add_special_case
|
||||
"""
|
||||
substrings = list(substrings)
|
||||
cached = <_Cached*>self.mem.alloc(1, sizeof(_Cached))
|
||||
|
@ -345,70 +360,81 @@ cdef class Tokenizer:
|
|||
self._cache.set(key, cached)
|
||||
self._rules[string] = substrings
|
||||
|
||||
def to_disk(self, path, **exclude):
|
||||
def to_disk(self, path, **kwargs):
|
||||
"""Save the current state to a directory.
|
||||
|
||||
path (unicode or Path): A path to a directory, which will be created if
|
||||
it doesn't exist. Paths may be either strings or Path-like objects.
|
||||
"""
|
||||
with path.open('wb') as file_:
|
||||
file_.write(self.to_bytes(**exclude))
|
||||
it doesn't exist.
|
||||
exclude (list): String names of serialization fields to exclude.
|
||||
|
||||
def from_disk(self, path, **exclude):
|
||||
DOCS: https://spacy.io/api/tokenizer#to_disk
|
||||
"""
|
||||
with path.open("wb") as file_:
|
||||
file_.write(self.to_bytes(**kwargs))
|
||||
|
||||
def from_disk(self, path, **kwargs):
|
||||
"""Loads state from a directory. Modifies the object in place and
|
||||
returns it.
|
||||
|
||||
path (unicode or Path): A path to a directory. Paths may be either
|
||||
strings or `Path`-like objects.
|
||||
path (unicode or Path): A path to a directory.
|
||||
exclude (list): String names of serialization fields to exclude.
|
||||
RETURNS (Tokenizer): The modified `Tokenizer` object.
|
||||
|
||||
DOCS: https://spacy.io/api/tokenizer#from_disk
|
||||
"""
|
||||
with path.open('rb') as file_:
|
||||
with path.open("rb") as file_:
|
||||
bytes_data = file_.read()
|
||||
self.from_bytes(bytes_data, **exclude)
|
||||
self.from_bytes(bytes_data, **kwargs)
|
||||
return self
|
||||
|
||||
def to_bytes(self, **exclude):
|
||||
def to_bytes(self, exclude=tuple(), **kwargs):
|
||||
"""Serialize the current state to a binary string.
|
||||
|
||||
**exclude: Named attributes to prevent from being serialized.
|
||||
exclude (list): String names of serialization fields to exclude.
|
||||
RETURNS (bytes): The serialized form of the `Tokenizer` object.
|
||||
|
||||
DOCS: https://spacy.io/api/tokenizer#to_bytes
|
||||
"""
|
||||
serializers = OrderedDict((
|
||||
('vocab', lambda: self.vocab.to_bytes()),
|
||||
('prefix_search', lambda: _get_regex_pattern(self.prefix_search)),
|
||||
('suffix_search', lambda: _get_regex_pattern(self.suffix_search)),
|
||||
('infix_finditer', lambda: _get_regex_pattern(self.infix_finditer)),
|
||||
('token_match', lambda: _get_regex_pattern(self.token_match)),
|
||||
('exceptions', lambda: OrderedDict(sorted(self._rules.items())))
|
||||
("vocab", lambda: self.vocab.to_bytes()),
|
||||
("prefix_search", lambda: _get_regex_pattern(self.prefix_search)),
|
||||
("suffix_search", lambda: _get_regex_pattern(self.suffix_search)),
|
||||
("infix_finditer", lambda: _get_regex_pattern(self.infix_finditer)),
|
||||
("token_match", lambda: _get_regex_pattern(self.token_match)),
|
||||
("exceptions", lambda: OrderedDict(sorted(self._rules.items())))
|
||||
))
|
||||
exclude = util.get_serialization_exclude(serializers, exclude, kwargs)
|
||||
return util.to_bytes(serializers, exclude)
|
||||
|
||||
def from_bytes(self, bytes_data, **exclude):
|
||||
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
|
||||
"""Load state from a binary string.
|
||||
|
||||
bytes_data (bytes): The data to load from.
|
||||
**exclude: Named attributes to prevent from being loaded.
|
||||
exclude (list): String names of serialization fields to exclude.
|
||||
RETURNS (Tokenizer): The `Tokenizer` object.
|
||||
|
||||
DOCS: https://spacy.io/api/tokenizer#from_bytes
|
||||
"""
|
||||
data = OrderedDict()
|
||||
deserializers = OrderedDict((
|
||||
('vocab', lambda b: self.vocab.from_bytes(b)),
|
||||
('prefix_search', lambda b: data.setdefault('prefix_search', b)),
|
||||
('suffix_search', lambda b: data.setdefault('suffix_search', b)),
|
||||
('infix_finditer', lambda b: data.setdefault('infix_finditer', b)),
|
||||
('token_match', lambda b: data.setdefault('token_match', b)),
|
||||
('exceptions', lambda b: data.setdefault('rules', b))
|
||||
("vocab", lambda b: self.vocab.from_bytes(b)),
|
||||
("prefix_search", lambda b: data.setdefault("prefix_search", b)),
|
||||
("suffix_search", lambda b: data.setdefault("suffix_search", b)),
|
||||
("infix_finditer", lambda b: data.setdefault("infix_finditer", b)),
|
||||
("token_match", lambda b: data.setdefault("token_match", b)),
|
||||
("exceptions", lambda b: data.setdefault("rules", b))
|
||||
))
|
||||
exclude = util.get_serialization_exclude(deserializers, exclude, kwargs)
|
||||
msg = util.from_bytes(bytes_data, deserializers, exclude)
|
||||
if data.get('prefix_search'):
|
||||
self.prefix_search = re.compile(data['prefix_search']).search
|
||||
if data.get('suffix_search'):
|
||||
self.suffix_search = re.compile(data['suffix_search']).search
|
||||
if data.get('infix_finditer'):
|
||||
self.infix_finditer = re.compile(data['infix_finditer']).finditer
|
||||
if data.get('token_match'):
|
||||
self.token_match = re.compile(data['token_match']).match
|
||||
for string, substrings in data.get('rules', {}).items():
|
||||
if data.get("prefix_search"):
|
||||
self.prefix_search = re.compile(data["prefix_search"]).search
|
||||
if data.get("suffix_search"):
|
||||
self.suffix_search = re.compile(data["suffix_search"]).search
|
||||
if data.get("infix_finditer"):
|
||||
self.infix_finditer = re.compile(data["infix_finditer"]).finditer
|
||||
if data.get("token_match"):
|
||||
self.token_match = re.compile(data["token_match"]).match
|
||||
for string, substrings in data.get("rules", {}).items():
|
||||
self.add_special_case(string, substrings)
|
||||
return self
|
||||
|
||||
|
|
|
@ -1,5 +1,8 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from .doc import Doc
|
||||
from .token import Token
|
||||
from .span import Span
|
||||
|
||||
__all__ = ['Doc', 'Token', 'Span']
|
||||
__all__ = ["Doc", "Token", "Span"]
|
||||
|
|
|
@ -6,11 +6,11 @@ from __future__ import unicode_literals
|
|||
|
||||
from libc.string cimport memcpy, memset
|
||||
from libc.stdlib cimport malloc, free
|
||||
|
||||
import numpy
|
||||
from cymem.cymem cimport Pool
|
||||
from thinc.neural.util import get_array_module
|
||||
|
||||
import numpy
|
||||
|
||||
from .doc cimport Doc, set_children_from_heads, token_by_start, token_by_end
|
||||
from .span cimport Span
|
||||
from .token cimport Token
|
||||
|
@ -26,11 +26,16 @@ from ..strings import get_string_id
|
|||
|
||||
|
||||
cdef class Retokenizer:
|
||||
"""Helper class for doc.retokenize() context manager."""
|
||||
"""Helper class for doc.retokenize() context manager.
|
||||
|
||||
DOCS: https://spacy.io/api/doc#retokenize
|
||||
USAGE: https://spacy.io/usage/linguistic-features#retokenization
|
||||
"""
|
||||
cdef Doc doc
|
||||
cdef list merges
|
||||
cdef list splits
|
||||
cdef set tokens_to_merge
|
||||
|
||||
def __init__(self, doc):
|
||||
self.doc = doc
|
||||
self.merges = []
|
||||
|
@ -40,6 +45,11 @@ cdef class Retokenizer:
|
|||
def merge(self, Span span, attrs=SimpleFrozenDict()):
|
||||
"""Mark a span for merging. The attrs will be applied to the resulting
|
||||
token.
|
||||
|
||||
span (Span): The span to merge.
|
||||
attrs (dict): Attributes to set on the merged token.
|
||||
|
||||
DOCS: https://spacy.io/api/doc#retokenizer.merge
|
||||
"""
|
||||
for token in span:
|
||||
if token.i in self.tokens_to_merge:
|
||||
|
@ -58,6 +68,16 @@ cdef class Retokenizer:
|
|||
def split(self, Token token, orths, heads, attrs=SimpleFrozenDict()):
|
||||
"""Mark a Token for splitting, into the specified orths. The attrs
|
||||
will be applied to each subtoken.
|
||||
|
||||
token (Token): The token to split.
|
||||
orths (list): The verbatim text of the split tokens. Needs to match the
|
||||
text of the original token.
|
||||
heads (list): List of token or `(token, subtoken)` tuples specifying the
|
||||
tokens to attach the newly split subtokens to.
|
||||
attrs (dict): Attributes to set on all split tokens. Attribute names
|
||||
mapped to list of per-token attribute values.
|
||||
|
||||
DOCS: https://spacy.io/api/doc#retokenizer.split
|
||||
"""
|
||||
if ''.join(orths) != token.text:
|
||||
raise ValueError(Errors.E117.format(new=''.join(orths), old=token.text))
|
||||
|
@ -104,14 +124,12 @@ cdef class Retokenizer:
|
|||
# referred to in the splits. If we merged these tokens previously, we
|
||||
# have to raise an error
|
||||
if token_index == -1:
|
||||
raise IndexError(
|
||||
"Cannot find token to be split. Did it get merged?")
|
||||
raise IndexError(Errors.E122)
|
||||
head_indices = []
|
||||
for head_char, subtoken in heads:
|
||||
head_index = token_by_start(self.doc.c, self.doc.length, head_char)
|
||||
if head_index == -1:
|
||||
raise IndexError(
|
||||
"Cannot find head of token to be split. Did it get merged?")
|
||||
raise IndexError(Errors.E123)
|
||||
# We want to refer to the token index of the head *after* the
|
||||
# mergery. We need to account for the extra tokens introduced.
|
||||
# e.g., let's say we have [ab, c] and we want a and b to depend
|
||||
|
@ -206,7 +224,6 @@ def _merge(Doc doc, int start, int end, attributes):
|
|||
doc.c[i].head -= i
|
||||
# Set the left/right children, left/right edges
|
||||
set_children_from_heads(doc.c, doc.length)
|
||||
# Clear the cached Python objects
|
||||
# Return the merged Python object
|
||||
return doc[start]
|
||||
|
||||
|
@ -336,7 +353,7 @@ def _bulk_merge(Doc doc, merges):
|
|||
# Make sure ent_iob remains consistent
|
||||
for (span, _) in merges:
|
||||
if(span.end < len(offsets)):
|
||||
#if it's not the last span
|
||||
# If it's not the last span
|
||||
token_after_span_position = offsets[span.end]
|
||||
if doc.c[token_after_span_position].ent_iob == 1\
|
||||
and doc.c[token_after_span_position - 1].ent_iob in (0, 2):
|
||||
|
|
|
@ -1,3 +1,4 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import numpy
|
||||
|
@ -16,9 +17,8 @@ class Binder(object):
|
|||
def __init__(self, attrs=None):
|
||||
"""Create a Binder object, to hold serialized annotations.
|
||||
|
||||
attrs (list):
|
||||
List of attributes to serialize. 'orth' and 'spacy' are always
|
||||
serialized, so they're not required. Defaults to None.
|
||||
attrs (list): List of attributes to serialize. 'orth' and 'spacy' are
|
||||
always serialized, so they're not required. Defaults to None.
|
||||
"""
|
||||
attrs = attrs or []
|
||||
self.attrs = list(attrs)
|
||||
|
|
|
@ -7,28 +7,25 @@ from __future__ import unicode_literals
|
|||
|
||||
cimport cython
|
||||
cimport numpy as np
|
||||
from libc.string cimport memcpy, memset
|
||||
from libc.math cimport sqrt
|
||||
|
||||
import numpy
|
||||
import numpy.linalg
|
||||
import struct
|
||||
import srsly
|
||||
from thinc.neural.util import get_array_module, copy_array
|
||||
import srsly
|
||||
|
||||
from libc.string cimport memcpy, memset
|
||||
from libc.math cimport sqrt
|
||||
|
||||
from .span cimport Span
|
||||
from .token cimport Token
|
||||
from .span cimport Span
|
||||
from .token cimport Token
|
||||
from ..lexeme cimport Lexeme, EMPTY_LEXEME
|
||||
from ..typedefs cimport attr_t, flags_t
|
||||
from ..attrs import intify_attrs, IDS
|
||||
from ..attrs cimport attr_id_t
|
||||
from ..attrs cimport ID, ORTH, NORM, LOWER, SHAPE, PREFIX, SUFFIX, CLUSTER
|
||||
from ..attrs cimport LENGTH, POS, LEMMA, TAG, DEP, HEAD, SPACY, ENT_IOB
|
||||
from ..attrs cimport ENT_TYPE, SENT_START
|
||||
from ..attrs cimport ENT_TYPE, SENT_START, attr_id_t
|
||||
from ..parts_of_speech cimport CCONJ, PUNCT, NOUN, univ_pos_t
|
||||
|
||||
from ..attrs import intify_attrs, IDS
|
||||
from ..util import normalize_slice
|
||||
from ..compat import is_config, copy_reg, pickle, basestring_
|
||||
from ..errors import deprecation_warning, models_warning, user_warning
|
||||
|
@ -37,6 +34,7 @@ from .. import util
|
|||
from .underscore import Underscore, get_ext_args
|
||||
from ._retokenize import Retokenizer
|
||||
|
||||
|
||||
DEF PADDING = 5
|
||||
|
||||
|
||||
|
@ -77,7 +75,7 @@ def _get_chunker(lang):
|
|||
return None
|
||||
except KeyError:
|
||||
return None
|
||||
return cls.Defaults.syntax_iterators.get(u'noun_chunks')
|
||||
return cls.Defaults.syntax_iterators.get("noun_chunks")
|
||||
|
||||
|
||||
cdef class Doc:
|
||||
|
@ -94,23 +92,60 @@ cdef class Doc:
|
|||
>>> from spacy.tokens import Doc
|
||||
>>> doc = Doc(nlp.vocab, words=[u'hello', u'world', u'!'],
|
||||
spaces=[True, False, False])
|
||||
|
||||
DOCS: https://spacy.io/api/doc
|
||||
"""
|
||||
|
||||
@classmethod
|
||||
def set_extension(cls, name, **kwargs):
|
||||
if cls.has_extension(name) and not kwargs.get('force', False):
|
||||
raise ValueError(Errors.E090.format(name=name, obj='Doc'))
|
||||
"""Define a custom attribute which becomes available as `Doc._`.
|
||||
|
||||
name (unicode): Name of the attribute to set.
|
||||
default: Optional default value of the attribute.
|
||||
getter (callable): Optional getter function.
|
||||
setter (callable): Optional setter function.
|
||||
method (callable): Optional method for method extension.
|
||||
force (bool): Force overwriting existing attribute.
|
||||
|
||||
DOCS: https://spacy.io/api/doc#set_extension
|
||||
USAGE: https://spacy.io/usage/processing-pipelines#custom-components-attributes
|
||||
"""
|
||||
if cls.has_extension(name) and not kwargs.get("force", False):
|
||||
raise ValueError(Errors.E090.format(name=name, obj="Doc"))
|
||||
Underscore.doc_extensions[name] = get_ext_args(**kwargs)
|
||||
|
||||
@classmethod
|
||||
def get_extension(cls, name):
|
||||
"""Look up a previously registered extension by name.
|
||||
|
||||
name (unicode): Name of the extension.
|
||||
RETURNS (tuple): A `(default, method, getter, setter)` tuple.
|
||||
|
||||
DOCS: https://spacy.io/api/doc#get_extension
|
||||
"""
|
||||
return Underscore.doc_extensions.get(name)
|
||||
|
||||
@classmethod
|
||||
def has_extension(cls, name):
|
||||
"""Check whether an extension has been registered.
|
||||
|
||||
name (unicode): Name of the extension.
|
||||
RETURNS (bool): Whether the extension has been registered.
|
||||
|
||||
DOCS: https://spacy.io/api/doc#has_extension
|
||||
"""
|
||||
return name in Underscore.doc_extensions
|
||||
|
||||
@classmethod
|
||||
def remove_extension(cls, name):
|
||||
"""Remove a previously registered extension.
|
||||
|
||||
name (unicode): Name of the extension.
|
||||
RETURNS (tuple): A `(default, method, getter, setter)` tuple of the
|
||||
removed extension.
|
||||
|
||||
DOCS: https://spacy.io/api/doc#remove_extension
|
||||
"""
|
||||
if not cls.has_extension(name):
|
||||
raise ValueError(Errors.E046.format(name=name))
|
||||
return Underscore.doc_extensions.pop(name)
|
||||
|
@ -128,6 +163,8 @@ cdef class Doc:
|
|||
it is not. If `None`, defaults to `[True]*len(words)`
|
||||
user_data (dict or None): Optional extra data to attach to the Doc.
|
||||
RETURNS (Doc): The newly constructed object.
|
||||
|
||||
DOCS: https://spacy.io/api/doc#init
|
||||
"""
|
||||
self.vocab = vocab
|
||||
size = 20
|
||||
|
@ -151,7 +188,7 @@ cdef class Doc:
|
|||
self.user_hooks = {}
|
||||
self.user_token_hooks = {}
|
||||
self.user_span_hooks = {}
|
||||
self.tensor = numpy.zeros((0,), dtype='float32')
|
||||
self.tensor = numpy.zeros((0,), dtype="float32")
|
||||
self.user_data = {} if user_data is None else user_data
|
||||
self._vector = None
|
||||
self.noun_chunks_iterator = _get_chunker(self.vocab.lang)
|
||||
|
@ -184,6 +221,7 @@ cdef class Doc:
|
|||
|
||||
@property
|
||||
def _(self):
|
||||
"""Custom extension attributes registered via `set_extension`."""
|
||||
return Underscore(Underscore.doc_extensions, self)
|
||||
|
||||
@property
|
||||
|
@ -195,15 +233,25 @@ cdef class Doc:
|
|||
b) sent.is_parsed is set to True;
|
||||
c) At least one token other than the first where sent_start is not None.
|
||||
"""
|
||||
if 'sents' in self.user_hooks:
|
||||
if "sents" in self.user_hooks:
|
||||
return True
|
||||
if self.is_parsed:
|
||||
return True
|
||||
for i in range(1, self.length):
|
||||
if self.c[i].sent_start == -1 or self.c[i].sent_start == 1:
|
||||
return True
|
||||
else:
|
||||
return False
|
||||
return False
|
||||
|
||||
@property
|
||||
def is_nered(self):
|
||||
"""Check if the document has named entities set. Will return True if
|
||||
*any* of the tokens has a named entity tag set (even if the others are
|
||||
uknown values).
|
||||
"""
|
||||
for i in range(self.length):
|
||||
if self.c[i].ent_iob != 0:
|
||||
return True
|
||||
return False
|
||||
|
||||
def __getitem__(self, object i):
|
||||
"""Get a `Token` or `Span` object.
|
||||
|
@ -227,11 +275,12 @@ cdef class Doc:
|
|||
supported, as `Span` objects must be contiguous (cannot have gaps).
|
||||
You can use negative indices and open-ended ranges, which have
|
||||
their normal Python semantics.
|
||||
|
||||
DOCS: https://spacy.io/api/doc#getitem
|
||||
"""
|
||||
if isinstance(i, slice):
|
||||
start, stop = normalize_slice(len(self), i.start, i.stop, i.step)
|
||||
return Span(self, start, stop, label=0)
|
||||
|
||||
if i < 0:
|
||||
i = self.length + i
|
||||
bounds_check(i, self.length, PADDING)
|
||||
|
@ -244,8 +293,7 @@ cdef class Doc:
|
|||
than-Python speeds are required, you can instead access the annotations
|
||||
as a numpy array, or access the underlying C data directly from Cython.
|
||||
|
||||
EXAMPLE:
|
||||
>>> for token in doc
|
||||
DOCS: https://spacy.io/api/doc#iter
|
||||
"""
|
||||
cdef int i
|
||||
for i in range(self.length):
|
||||
|
@ -256,16 +304,15 @@ cdef class Doc:
|
|||
|
||||
RETURNS (int): The number of tokens in the document.
|
||||
|
||||
EXAMPLE:
|
||||
>>> len(doc)
|
||||
DOCS: https://spacy.io/api/doc#len
|
||||
"""
|
||||
return self.length
|
||||
|
||||
def __unicode__(self):
|
||||
return u''.join([t.text_with_ws for t in self])
|
||||
return "".join([t.text_with_ws for t in self])
|
||||
|
||||
def __bytes__(self):
|
||||
return u''.join([t.text_with_ws for t in self]).encode('utf-8')
|
||||
return "".join([t.text_with_ws for t in self]).encode("utf-8")
|
||||
|
||||
def __str__(self):
|
||||
if is_config(python3=True):
|
||||
|
@ -290,6 +337,8 @@ cdef class Doc:
|
|||
vector (ndarray[ndim=1, dtype='float32']): A meaning representation of
|
||||
the span.
|
||||
RETURNS (Span): The newly constructed object.
|
||||
|
||||
DOCS: https://spacy.io/api/doc#char_span
|
||||
"""
|
||||
if not isinstance(label, int):
|
||||
label = self.vocab.strings.add(label)
|
||||
|
@ -311,9 +360,11 @@ cdef class Doc:
|
|||
other (object): The object to compare with. By default, accepts `Doc`,
|
||||
`Span`, `Token` and `Lexeme` objects.
|
||||
RETURNS (float): A scalar similarity score. Higher is more similar.
|
||||
|
||||
DOCS: https://spacy.io/api/doc#similarity
|
||||
"""
|
||||
if 'similarity' in self.user_hooks:
|
||||
return self.user_hooks['similarity'](self, other)
|
||||
if "similarity" in self.user_hooks:
|
||||
return self.user_hooks["similarity"](self, other)
|
||||
if isinstance(other, (Lexeme, Token)) and self.length == 1:
|
||||
if self.c[0].lex.orth == other.orth:
|
||||
return 1.0
|
||||
|
@ -325,9 +376,9 @@ cdef class Doc:
|
|||
else:
|
||||
return 1.0
|
||||
if self.vocab.vectors.n_keys == 0:
|
||||
models_warning(Warnings.W007.format(obj='Doc'))
|
||||
models_warning(Warnings.W007.format(obj="Doc"))
|
||||
if self.vector_norm == 0 or other.vector_norm == 0:
|
||||
user_warning(Warnings.W008.format(obj='Doc'))
|
||||
user_warning(Warnings.W008.format(obj="Doc"))
|
||||
return 0.0
|
||||
vector = self.vector
|
||||
xp = get_array_module(vector)
|
||||
|
@ -338,10 +389,12 @@ cdef class Doc:
|
|||
the object.
|
||||
|
||||
RETURNS (bool): Whether a word vector is associated with the object.
|
||||
|
||||
DOCS: https://spacy.io/api/doc#has_vector
|
||||
"""
|
||||
def __get__(self):
|
||||
if 'has_vector' in self.user_hooks:
|
||||
return self.user_hooks['has_vector'](self)
|
||||
if "has_vector" in self.user_hooks:
|
||||
return self.user_hooks["has_vector"](self)
|
||||
elif self.vocab.vectors.data.size:
|
||||
return True
|
||||
elif self.tensor.size:
|
||||
|
@ -355,15 +408,16 @@ cdef class Doc:
|
|||
|
||||
RETURNS (numpy.ndarray[ndim=1, dtype='float32']): A 1D numpy array
|
||||
representing the document's semantics.
|
||||
|
||||
DOCS: https://spacy.io/api/doc#vector
|
||||
"""
|
||||
def __get__(self):
|
||||
if 'vector' in self.user_hooks:
|
||||
return self.user_hooks['vector'](self)
|
||||
if "vector" in self.user_hooks:
|
||||
return self.user_hooks["vector"](self)
|
||||
if self._vector is not None:
|
||||
return self._vector
|
||||
elif not len(self):
|
||||
self._vector = numpy.zeros((self.vocab.vectors_length,),
|
||||
dtype='f')
|
||||
self._vector = numpy.zeros((self.vocab.vectors_length,), dtype="f")
|
||||
return self._vector
|
||||
elif self.vocab.vectors.data.size > 0:
|
||||
self._vector = sum(t.vector for t in self) / len(self)
|
||||
|
@ -372,8 +426,7 @@ cdef class Doc:
|
|||
self._vector = self.tensor.mean(axis=0)
|
||||
return self._vector
|
||||
else:
|
||||
return numpy.zeros((self.vocab.vectors_length,),
|
||||
dtype='float32')
|
||||
return numpy.zeros((self.vocab.vectors_length,), dtype="float32")
|
||||
|
||||
def __set__(self, value):
|
||||
self._vector = value
|
||||
|
@ -382,10 +435,12 @@ cdef class Doc:
|
|||
"""The L2 norm of the document's vector representation.
|
||||
|
||||
RETURNS (float): The L2 norm of the vector representation.
|
||||
|
||||
DOCS: https://spacy.io/api/doc#vector_norm
|
||||
"""
|
||||
def __get__(self):
|
||||
if 'vector_norm' in self.user_hooks:
|
||||
return self.user_hooks['vector_norm'](self)
|
||||
if "vector_norm" in self.user_hooks:
|
||||
return self.user_hooks["vector_norm"](self)
|
||||
cdef float value
|
||||
cdef double norm = 0
|
||||
if self._vector_norm is None:
|
||||
|
@ -404,7 +459,7 @@ cdef class Doc:
|
|||
RETURNS (unicode): The original verbatim text of the document.
|
||||
"""
|
||||
def __get__(self):
|
||||
return u''.join(t.text_with_ws for t in self)
|
||||
return "".join(t.text_with_ws for t in self)
|
||||
|
||||
property text_with_ws:
|
||||
"""An alias of `Doc.text`, provided for duck-type compatibility with
|
||||
|
@ -416,21 +471,12 @@ cdef class Doc:
|
|||
return self.text
|
||||
|
||||
property ents:
|
||||
"""Iterate over the entities in the document. Yields named-entity
|
||||
`Span` objects, if the entity recognizer has been applied to the
|
||||
document.
|
||||
"""The named entities in the document. Returns a tuple of named entity
|
||||
`Span` objects, if the entity recognizer has been applied.
|
||||
|
||||
YIELDS (Span): Entities in the document.
|
||||
RETURNS (tuple): Entities in the document, one `Span` per entity.
|
||||
|
||||
EXAMPLE: Iterate over the span to get individual Token objects,
|
||||
or access the label:
|
||||
|
||||
>>> tokens = nlp(u'Mr. Best flew to New York on Saturday morning.')
|
||||
>>> ents = list(tokens.ents)
|
||||
>>> assert ents[0].label == 346
|
||||
>>> assert ents[0].label_ == 'PERSON'
|
||||
>>> assert ents[0].orth_ == 'Best'
|
||||
>>> assert ents[0].text == 'Mr. Best'
|
||||
DOCS: https://spacy.io/api/doc#ents
|
||||
"""
|
||||
def __get__(self):
|
||||
cdef int i
|
||||
|
@ -442,8 +488,8 @@ cdef class Doc:
|
|||
token = &self.c[i]
|
||||
if token.ent_iob == 1:
|
||||
if start == -1:
|
||||
seq = ['%s|%s' % (t.text, t.ent_iob_) for t in self[i-5:i+5]]
|
||||
raise ValueError(Errors.E093.format(seq=' '.join(seq)))
|
||||
seq = ["%s|%s" % (t.text, t.ent_iob_) for t in self[i-5:i+5]]
|
||||
raise ValueError(Errors.E093.format(seq=" ".join(seq)))
|
||||
elif token.ent_iob == 2 or token.ent_iob == 0:
|
||||
if start != -1:
|
||||
output.append(Span(self, start, i, label=label))
|
||||
|
@ -465,7 +511,6 @@ cdef class Doc:
|
|||
# prediction
|
||||
# 3. Test basic data-driven ORTH gazetteer
|
||||
# 4. Test more nuanced date and currency regex
|
||||
|
||||
tokens_in_ents = {}
|
||||
cdef attr_t entity_type
|
||||
cdef int ent_start, ent_end
|
||||
|
@ -479,7 +524,6 @@ cdef class Doc:
|
|||
self.vocab.strings[tokens_in_ents[token_index][2]]),
|
||||
span2=(ent_start, ent_end, self.vocab.strings[entity_type])))
|
||||
tokens_in_ents[token_index] = (ent_start, ent_end, entity_type)
|
||||
|
||||
cdef int i
|
||||
for i in range(self.length):
|
||||
self.c[i].ent_type = 0
|
||||
|
@ -510,6 +554,8 @@ cdef class Doc:
|
|||
clauses.
|
||||
|
||||
YIELDS (Span): Noun chunks in the document.
|
||||
|
||||
DOCS: https://spacy.io/api/doc#noun_chunks
|
||||
"""
|
||||
def __get__(self):
|
||||
if not self.is_parsed:
|
||||
|
@ -533,15 +579,15 @@ cdef class Doc:
|
|||
dependency parse. If the parser is disabled, the `sents` iterator will
|
||||
be unavailable.
|
||||
|
||||
EXAMPLE:
|
||||
>>> doc = nlp("This is a sentence. Here's another...")
|
||||
>>> assert [s.root.text for s in doc.sents] == ["is", "'s"]
|
||||
YIELDS (Span): Sentences in the document.
|
||||
|
||||
DOCS: https://spacy.io/api/doc#sents
|
||||
"""
|
||||
def __get__(self):
|
||||
if not self.is_sentenced:
|
||||
raise ValueError(Errors.E030)
|
||||
if 'sents' in self.user_hooks:
|
||||
yield from self.user_hooks['sents'](self)
|
||||
if "sents" in self.user_hooks:
|
||||
yield from self.user_hooks["sents"](self)
|
||||
else:
|
||||
start = 0
|
||||
for i in range(1, self.length):
|
||||
|
@ -606,17 +652,16 @@ cdef class Doc:
|
|||
if isinstance(py_attr_ids, basestring_):
|
||||
# Handle inputs like doc.to_array('ORTH')
|
||||
py_attr_ids = [py_attr_ids]
|
||||
elif not hasattr(py_attr_ids, '__iter__'):
|
||||
elif not hasattr(py_attr_ids, "__iter__"):
|
||||
# Handle inputs like doc.to_array(ORTH)
|
||||
py_attr_ids = [py_attr_ids]
|
||||
# Allow strings, e.g. 'lemma' or 'LEMMA'
|
||||
py_attr_ids = [(IDS[id_.upper()] if hasattr(id_, 'upper') else id_)
|
||||
py_attr_ids = [(IDS[id_.upper()] if hasattr(id_, "upper") else id_)
|
||||
for id_ in py_attr_ids]
|
||||
# Make an array from the attributes --- otherwise our inner loop is
|
||||
# Python dict iteration.
|
||||
cdef np.ndarray attr_ids = numpy.asarray(py_attr_ids, dtype='i')
|
||||
output = numpy.ndarray(shape=(self.length, len(attr_ids)),
|
||||
dtype=numpy.uint64)
|
||||
cdef np.ndarray attr_ids = numpy.asarray(py_attr_ids, dtype="i")
|
||||
output = numpy.ndarray(shape=(self.length, len(attr_ids)), dtype=numpy.uint64)
|
||||
c_output = <attr_t*>output.data
|
||||
c_attr_ids = <attr_id_t*>attr_ids.data
|
||||
cdef TokenC* token
|
||||
|
@ -628,8 +673,7 @@ cdef class Doc:
|
|||
# Handle 1d case
|
||||
return output if len(attr_ids) >= 2 else output.reshape((self.length,))
|
||||
|
||||
def count_by(self, attr_id_t attr_id, exclude=None,
|
||||
PreshCounter counts=None):
|
||||
def count_by(self, attr_id_t attr_id, exclude=None, PreshCounter counts=None):
|
||||
"""Count the frequencies of a given attribute. Produces a dict of
|
||||
`{attribute (int): count (ints)}` frequencies, keyed by the values of
|
||||
the given attribute ID.
|
||||
|
@ -637,13 +681,7 @@ cdef class Doc:
|
|||
attr_id (int): The attribute ID to key the counts.
|
||||
RETURNS (dict): A dictionary mapping attributes to integer counts.
|
||||
|
||||
EXAMPLE:
|
||||
>>> from spacy import attrs
|
||||
>>> doc = nlp(u'apple apple orange banana')
|
||||
>>> tokens.count_by(attrs.ORTH)
|
||||
{12800L: 1, 11880L: 2, 7561L: 1}
|
||||
>>> tokens.to_array([attrs.ORTH])
|
||||
array([[11880], [11880], [7561], [12800]])
|
||||
DOCS: https://spacy.io/api/doc#count_by
|
||||
"""
|
||||
cdef int i
|
||||
cdef attr_t attr
|
||||
|
@ -684,13 +722,33 @@ cdef class Doc:
|
|||
cdef void set_parse(self, const TokenC* parsed) nogil:
|
||||
# TODO: This method is fairly misleading atm. It's used by Parser
|
||||
# to actually apply the parse calculated. Need to rethink this.
|
||||
|
||||
# Probably we should use from_array?
|
||||
self.is_parsed = True
|
||||
for i in range(self.length):
|
||||
self.c[i] = parsed[i]
|
||||
|
||||
def from_array(self, attrs, array):
|
||||
"""Load attributes from a numpy array. Write to a `Doc` object, from an
|
||||
`(M, N)` array of attributes.
|
||||
|
||||
attrs (list) A list of attribute ID ints.
|
||||
array (numpy.ndarray[ndim=2, dtype='int32']): The attribute values.
|
||||
RETURNS (Doc): Itself.
|
||||
|
||||
DOCS: https://spacy.io/api/doc#from_array
|
||||
"""
|
||||
# Handle scalar/list inputs of strings/ints for py_attr_ids
|
||||
# See also #3064
|
||||
if isinstance(attrs, basestring_):
|
||||
# Handle inputs like doc.to_array('ORTH')
|
||||
attrs = [attrs]
|
||||
elif not hasattr(attrs, "__iter__"):
|
||||
# Handle inputs like doc.to_array(ORTH)
|
||||
attrs = [attrs]
|
||||
# Allow strings, e.g. 'lemma' or 'LEMMA'
|
||||
attrs = [(IDS[id_.upper()] if hasattr(id_, "upper") else id_)
|
||||
for id_ in attrs]
|
||||
|
||||
if SENT_START in attrs and HEAD in attrs:
|
||||
raise ValueError(Errors.E032)
|
||||
cdef int i, col
|
||||
|
@ -703,6 +761,8 @@ cdef class Doc:
|
|||
attr_ids = <attr_id_t*>mem.alloc(n_attrs, sizeof(attr_id_t))
|
||||
for i, attr_id in enumerate(attrs):
|
||||
attr_ids[i] = attr_id
|
||||
if len(array.shape) == 1:
|
||||
array = array.reshape((array.size, 1))
|
||||
# Now load the data
|
||||
for i in range(self.length):
|
||||
token = &self.c[i]
|
||||
|
@ -714,10 +774,10 @@ cdef class Doc:
|
|||
for i in range(length):
|
||||
if array[i, col] != 0:
|
||||
self.vocab.morphology.assign_tag(&tokens[i], array[i, col])
|
||||
# set flags
|
||||
# Set flags
|
||||
self.is_parsed = bool(self.is_parsed or HEAD in attrs or DEP in attrs)
|
||||
self.is_tagged = bool(self.is_tagged or TAG in attrs or POS in attrs)
|
||||
# if document is parsed, set children
|
||||
# If document is parsed, set children
|
||||
if self.is_parsed:
|
||||
set_children_from_heads(self.c, self.length)
|
||||
return self
|
||||
|
@ -729,46 +789,56 @@ cdef class Doc:
|
|||
|
||||
RETURNS (np.array[ndim=2, dtype=numpy.int32]): LCA matrix with shape
|
||||
(n, n), where n = len(self).
|
||||
|
||||
DOCS: https://spacy.io/api/doc#get_lca_matrix
|
||||
"""
|
||||
return numpy.asarray(_get_lca_matrix(self, 0, len(self)))
|
||||
|
||||
def to_disk(self, path, **exclude):
|
||||
def to_disk(self, path, **kwargs):
|
||||
"""Save the current state to a directory.
|
||||
|
||||
path (unicode or Path): A path to a directory, which will be created if
|
||||
it doesn't exist. Paths may be either strings or Path-like objects.
|
||||
exclude (list): String names of serialization fields to exclude.
|
||||
|
||||
DOCS: https://spacy.io/api/doc#to_disk
|
||||
"""
|
||||
path = util.ensure_path(path)
|
||||
with path.open('wb') as file_:
|
||||
file_.write(self.to_bytes(**exclude))
|
||||
with path.open("wb") as file_:
|
||||
file_.write(self.to_bytes(**kwargs))
|
||||
|
||||
def from_disk(self, path, **exclude):
|
||||
def from_disk(self, path, **kwargs):
|
||||
"""Loads state from a directory. Modifies the object in place and
|
||||
returns it.
|
||||
|
||||
path (unicode or Path): A path to a directory. Paths may be either
|
||||
strings or `Path`-like objects.
|
||||
exclude (list): String names of serialization fields to exclude.
|
||||
RETURNS (Doc): The modified `Doc` object.
|
||||
|
||||
DOCS: https://spacy.io/api/doc#from_disk
|
||||
"""
|
||||
path = util.ensure_path(path)
|
||||
with path.open('rb') as file_:
|
||||
with path.open("rb") as file_:
|
||||
bytes_data = file_.read()
|
||||
return self.from_bytes(bytes_data, **exclude)
|
||||
return self.from_bytes(bytes_data, **kwargs)
|
||||
|
||||
def to_bytes(self, **exclude):
|
||||
def to_bytes(self, exclude=tuple(), **kwargs):
|
||||
"""Serialize, i.e. export the document contents to a binary string.
|
||||
|
||||
exclude (list): String names of serialization fields to exclude.
|
||||
RETURNS (bytes): A losslessly serialized copy of the `Doc`, including
|
||||
all annotations.
|
||||
|
||||
DOCS: https://spacy.io/api/doc#to_bytes
|
||||
"""
|
||||
array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE]
|
||||
|
||||
if self.is_tagged:
|
||||
array_head.append(TAG)
|
||||
# if doc parsed add head and dep attribute
|
||||
# If doc parsed add head and dep attribute
|
||||
if self.is_parsed:
|
||||
array_head.extend([HEAD, DEP])
|
||||
# otherwise add sent_start
|
||||
# Otherwise add sent_start
|
||||
else:
|
||||
array_head.append(SENT_START)
|
||||
# Msgpack doesn't distinguish between lists and tuples, which is
|
||||
|
@ -776,60 +846,66 @@ cdef class Doc:
|
|||
# keys, we must have tuples. In values we just have to hope
|
||||
# users don't mind getting a list instead of a tuple.
|
||||
serializers = {
|
||||
'text': lambda: self.text,
|
||||
'array_head': lambda: array_head,
|
||||
'array_body': lambda: self.to_array(array_head),
|
||||
'sentiment': lambda: self.sentiment,
|
||||
'tensor': lambda: self.tensor,
|
||||
"text": lambda: self.text,
|
||||
"array_head": lambda: array_head,
|
||||
"array_body": lambda: self.to_array(array_head),
|
||||
"sentiment": lambda: self.sentiment,
|
||||
"tensor": lambda: self.tensor,
|
||||
}
|
||||
if 'user_data' not in exclude and self.user_data:
|
||||
for key in kwargs:
|
||||
if key in serializers or key in ("user_data", "user_data_keys", "user_data_values"):
|
||||
raise ValueError(Errors.E128.format(arg=key))
|
||||
if "user_data" not in exclude and self.user_data:
|
||||
user_data_keys, user_data_values = list(zip(*self.user_data.items()))
|
||||
serializers['user_data_keys'] = lambda: srsly.msgpack_dumps(user_data_keys)
|
||||
serializers['user_data_values'] = lambda: srsly.msgpack_dumps(user_data_values)
|
||||
|
||||
if "user_data_keys" not in exclude:
|
||||
serializers["user_data_keys"] = lambda: srsly.msgpack_dumps(user_data_keys)
|
||||
if "user_data_values" not in exclude:
|
||||
serializers["user_data_values"] = lambda: srsly.msgpack_dumps(user_data_values)
|
||||
return util.to_bytes(serializers, exclude)
|
||||
|
||||
def from_bytes(self, bytes_data, **exclude):
|
||||
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
|
||||
"""Deserialize, i.e. import the document contents from a binary string.
|
||||
|
||||
data (bytes): The string to load from.
|
||||
exclude (list): String names of serialization fields to exclude.
|
||||
RETURNS (Doc): Itself.
|
||||
|
||||
DOCS: https://spacy.io/api/doc#from_bytes
|
||||
"""
|
||||
if self.length != 0:
|
||||
raise ValueError(Errors.E033.format(length=self.length))
|
||||
deserializers = {
|
||||
'text': lambda b: None,
|
||||
'array_head': lambda b: None,
|
||||
'array_body': lambda b: None,
|
||||
'sentiment': lambda b: None,
|
||||
'tensor': lambda b: None,
|
||||
'user_data_keys': lambda b: None,
|
||||
'user_data_values': lambda b: None,
|
||||
"text": lambda b: None,
|
||||
"array_head": lambda b: None,
|
||||
"array_body": lambda b: None,
|
||||
"sentiment": lambda b: None,
|
||||
"tensor": lambda b: None,
|
||||
"user_data_keys": lambda b: None,
|
||||
"user_data_values": lambda b: None,
|
||||
}
|
||||
|
||||
for key in kwargs:
|
||||
if key in deserializers or key in ("user_data",):
|
||||
raise ValueError(Errors.E128.format(arg=key))
|
||||
msg = util.from_bytes(bytes_data, deserializers, exclude)
|
||||
# Msgpack doesn't distinguish between lists and tuples, which is
|
||||
# vexing for user data. As a best guess, we *know* that within
|
||||
# keys, we must have tuples. In values we just have to hope
|
||||
# users don't mind getting a list instead of a tuple.
|
||||
if 'user_data' not in exclude and 'user_data_keys' in msg:
|
||||
user_data_keys = srsly.msgpack_loads(msg['user_data_keys'], use_list=False)
|
||||
user_data_values = srsly.msgpack_loads(msg['user_data_values'])
|
||||
if "user_data" not in exclude and "user_data_keys" in msg:
|
||||
user_data_keys = srsly.msgpack_loads(msg["user_data_keys"], use_list=False)
|
||||
user_data_values = srsly.msgpack_loads(msg["user_data_values"])
|
||||
for key, value in zip(user_data_keys, user_data_values):
|
||||
self.user_data[key] = value
|
||||
|
||||
cdef int i, start, end, has_space
|
||||
|
||||
if 'sentiment' not in exclude and 'sentiment' in msg:
|
||||
self.sentiment = msg['sentiment']
|
||||
if 'tensor' not in exclude and 'tensor' in msg:
|
||||
self.tensor = msg['tensor']
|
||||
|
||||
if "sentiment" not in exclude and "sentiment" in msg:
|
||||
self.sentiment = msg["sentiment"]
|
||||
if "tensor" not in exclude and "tensor" in msg:
|
||||
self.tensor = msg["tensor"]
|
||||
start = 0
|
||||
cdef const LexemeC* lex
|
||||
cdef unicode orth_
|
||||
text = msg['text']
|
||||
attrs = msg['array_body']
|
||||
text = msg["text"]
|
||||
attrs = msg["array_body"]
|
||||
for i in range(attrs.shape[0]):
|
||||
end = start + attrs[i, 0]
|
||||
has_space = attrs[i, 1]
|
||||
|
@ -837,11 +913,11 @@ cdef class Doc:
|
|||
lex = self.vocab.get(self.mem, orth_)
|
||||
self.push_back(lex, has_space)
|
||||
start = end + has_space
|
||||
self.from_array(msg['array_head'][2:], attrs[:, 2:])
|
||||
self.from_array(msg["array_head"][2:], attrs[:, 2:])
|
||||
return self
|
||||
|
||||
def extend_tensor(self, tensor):
|
||||
'''Concatenate a new tensor onto the doc.tensor object.
|
||||
"""Concatenate a new tensor onto the doc.tensor object.
|
||||
|
||||
The doc.tensor attribute holds dense feature vectors
|
||||
computed by the models in the pipeline. Let's say a
|
||||
|
@ -849,7 +925,7 @@ cdef class Doc:
|
|||
per word. doc.tensor.shape will be (30, 128). After
|
||||
calling doc.extend_tensor with an array of shape (30, 64),
|
||||
doc.tensor == (30, 192).
|
||||
'''
|
||||
"""
|
||||
xp = get_array_module(self.tensor)
|
||||
if self.tensor.size == 0:
|
||||
self.tensor.resize(tensor.shape, refcheck=False)
|
||||
|
@ -858,7 +934,7 @@ cdef class Doc:
|
|||
self.tensor = xp.hstack((self.tensor, tensor))
|
||||
|
||||
def retokenize(self):
|
||||
'''Context manager to handle retokenization of the Doc.
|
||||
"""Context manager to handle retokenization of the Doc.
|
||||
Modifications to the Doc's tokenization are stored, and then
|
||||
made all at once when the context manager exits. This is
|
||||
much more efficient, and less error-prone.
|
||||
|
@ -866,7 +942,10 @@ cdef class Doc:
|
|||
All views of the Doc (Span and Token) created before the
|
||||
retokenization are invalidated, although they may accidentally
|
||||
continue to work.
|
||||
'''
|
||||
|
||||
DOCS: https://spacy.io/api/doc#retokenize
|
||||
USAGE: https://spacy.io/usage/linguistic-features#retokenization
|
||||
"""
|
||||
return Retokenizer(self)
|
||||
|
||||
def _bulk_merge(self, spans, attributes):
|
||||
|
@ -882,9 +961,10 @@ cdef class Doc:
|
|||
RETURNS (Token): The first newly merged token.
|
||||
"""
|
||||
cdef unicode tag, lemma, ent_type
|
||||
|
||||
assert len(attributes) == len(spans), "attribute length should be equal to span length" + str(len(attributes)) +\
|
||||
str(len(spans))
|
||||
attr_len = len(attributes)
|
||||
span_len = len(spans)
|
||||
if not attr_len == span_len:
|
||||
raise ValueError(Errors.E121.format(attr_len=attr_len, span_len=span_len))
|
||||
with self.retokenize() as retokenizer:
|
||||
for i, span in enumerate(spans):
|
||||
fix_attributes(self, attributes[i])
|
||||
|
@ -915,13 +995,10 @@ cdef class Doc:
|
|||
elif not args:
|
||||
fix_attributes(self, attributes)
|
||||
elif args:
|
||||
raise ValueError(Errors.E034.format(n_args=len(args),
|
||||
args=repr(args),
|
||||
raise ValueError(Errors.E034.format(n_args=len(args), args=repr(args),
|
||||
kwargs=repr(attributes)))
|
||||
remove_label_if_necessary(attributes)
|
||||
|
||||
attributes = intify_attrs(attributes, strings_map=self.vocab.strings)
|
||||
|
||||
cdef int start = token_by_start(self.c, self.length, start_idx)
|
||||
if start == -1:
|
||||
return None
|
||||
|
@ -938,44 +1015,45 @@ cdef class Doc:
|
|||
raise ValueError(Errors.E105)
|
||||
|
||||
def to_json(self, underscore=None):
|
||||
"""Convert a Doc to JSON. Produces the same format used by the spacy
|
||||
train command.
|
||||
"""Convert a Doc to JSON. The format it produces will be the new format
|
||||
for the `spacy train` command (not implemented yet).
|
||||
|
||||
underscore (list): Optional list of string names of custom doc._.
|
||||
attributes. Attribute values need to be JSON-serializable. Values will
|
||||
be added to an "_" key in the data, e.g. "_": {"foo": "bar"}.
|
||||
RETURNS (dict): The data in spaCy's JSON format.
|
||||
|
||||
DOCS: https://spacy.io/api/doc#to_json
|
||||
"""
|
||||
data = {'text': self.text}
|
||||
data['ents'] = [{'start': ent.start_char, 'end': ent.end_char,
|
||||
'label': ent.label_} for ent in self.ents]
|
||||
sents = list(self.sents)
|
||||
if sents:
|
||||
data['sents'] = [{'start': sent.start_char, 'end': sent.end_char}
|
||||
data = {"text": self.text}
|
||||
if self.is_nered:
|
||||
data["ents"] = [{"start": ent.start_char, "end": ent.end_char,
|
||||
"label": ent.label_} for ent in self.ents]
|
||||
if self.is_sentenced:
|
||||
sents = list(self.sents)
|
||||
data["sents"] = [{"start": sent.start_char, "end": sent.end_char}
|
||||
for sent in sents]
|
||||
if self.cats:
|
||||
data['cats'] = self.cats
|
||||
data['tokens'] = []
|
||||
data["cats"] = self.cats
|
||||
data["tokens"] = []
|
||||
for token in self:
|
||||
token_data = {'id': token.i, 'start': token.idx, 'end': token.idx + len(token)}
|
||||
if token.pos_:
|
||||
token_data['pos'] = token.pos_
|
||||
if token.tag_:
|
||||
token_data['tag'] = token.tag_
|
||||
if token.dep_:
|
||||
token_data['dep'] = token.dep_
|
||||
if token.head:
|
||||
token_data['head'] = token.head.i
|
||||
data['tokens'].append(token_data)
|
||||
token_data = {"id": token.i, "start": token.idx, "end": token.idx + len(token)}
|
||||
if self.is_tagged:
|
||||
token_data["pos"] = token.pos_
|
||||
token_data["tag"] = token.tag_
|
||||
if self.is_parsed:
|
||||
token_data["dep"] = token.dep_
|
||||
token_data["head"] = token.head.i
|
||||
data["tokens"].append(token_data)
|
||||
if underscore:
|
||||
data['_'] = {}
|
||||
data["_"] = {}
|
||||
for attr in underscore:
|
||||
if not self.has_extension(attr):
|
||||
raise ValueError(Errors.E106.format(attr=attr, opts=underscore))
|
||||
value = self._.get(attr)
|
||||
if not srsly.is_json_serializable(value):
|
||||
raise ValueError(Errors.E107.format(attr=attr, value=repr(value)))
|
||||
data['_'][attr] = value
|
||||
data["_"][attr] = value
|
||||
return data
|
||||
|
||||
|
||||
|
@ -1007,9 +1085,8 @@ cdef int set_children_from_heads(TokenC* tokens, int length) except -1:
|
|||
tokens[i].r_kids = 0
|
||||
tokens[i].l_edge = i
|
||||
tokens[i].r_edge = i
|
||||
# Three times, for non-projectivity
|
||||
# See issue #3170. This isn't a very satisfying fix, but I think it's
|
||||
# sufficient.
|
||||
# Three times, for non-projectivity. See issue #3170. This isn't a very
|
||||
# satisfying fix, but I think it's sufficient.
|
||||
for loop_count in range(3):
|
||||
# Set left edges
|
||||
for i in range(length):
|
||||
|
@ -1021,7 +1098,7 @@ cdef int set_children_from_heads(TokenC* tokens, int length) except -1:
|
|||
head.l_edge = child.l_edge
|
||||
if child.r_edge > head.r_edge:
|
||||
head.r_edge = child.r_edge
|
||||
# Set right edges --- same as above, but iterate in reverse
|
||||
# Set right edges - same as above, but iterate in reverse
|
||||
for i in range(length-1, -1, -1):
|
||||
child = &tokens[i]
|
||||
head = &tokens[i + child.head]
|
||||
|
@ -1052,20 +1129,14 @@ cdef int _get_tokens_lca(Token token_j, Token token_k):
|
|||
return token_k.i
|
||||
elif token_k.head == token_j:
|
||||
return token_j.i
|
||||
|
||||
token_j_ancestors = set(token_j.ancestors)
|
||||
|
||||
if token_k in token_j_ancestors:
|
||||
return token_k.i
|
||||
|
||||
for token_k_ancestor in token_k.ancestors:
|
||||
|
||||
if token_k_ancestor == token_j:
|
||||
return token_j.i
|
||||
|
||||
if token_k_ancestor in token_j_ancestors:
|
||||
return token_k_ancestor.i
|
||||
|
||||
return -1
|
||||
|
||||
|
||||
|
@ -1083,12 +1154,10 @@ cdef int [:,:] _get_lca_matrix(Doc doc, int start, int end):
|
|||
with shape (n, n), where n = len(doc).
|
||||
"""
|
||||
cdef int [:,:] lca_matrix
|
||||
|
||||
n_tokens= end - start
|
||||
lca_mat = numpy.empty((n_tokens, n_tokens), dtype=numpy.int32)
|
||||
lca_mat.fill(-1)
|
||||
lca_matrix = lca_mat
|
||||
|
||||
for j in range(n_tokens):
|
||||
token_j = doc[start + j]
|
||||
# the common ancestor of token and itself is itself:
|
||||
|
@ -1109,12 +1178,11 @@ cdef int [:,:] _get_lca_matrix(Doc doc, int start, int end):
|
|||
else:
|
||||
lca_matrix[j, k] = lca - start
|
||||
lca_matrix[k, j] = lca - start
|
||||
|
||||
return lca_matrix
|
||||
|
||||
|
||||
def pickle_doc(doc):
|
||||
bytes_data = doc.to_bytes(vocab=False, user_data=False)
|
||||
bytes_data = doc.to_bytes(exclude=["vocab", "user_data"])
|
||||
hooks_and_data = (doc.user_data, doc.user_hooks, doc.user_span_hooks,
|
||||
doc.user_token_hooks)
|
||||
return (unpickle_doc, (doc.vocab, srsly.pickle_dumps(hooks_and_data), bytes_data))
|
||||
|
@ -1123,8 +1191,7 @@ def pickle_doc(doc):
|
|||
def unpickle_doc(vocab, hooks_and_data, bytes_data):
|
||||
user_data, doc_hooks, span_hooks, token_hooks = srsly.pickle_loads(hooks_and_data)
|
||||
|
||||
doc = Doc(vocab, user_data=user_data).from_bytes(bytes_data,
|
||||
exclude='user_data')
|
||||
doc = Doc(vocab, user_data=user_data).from_bytes(bytes_data, exclude=["user_data"])
|
||||
doc.user_hooks.update(doc_hooks)
|
||||
doc.user_span_hooks.update(span_hooks)
|
||||
doc.user_token_hooks.update(token_hooks)
|
||||
|
@ -1133,19 +1200,22 @@ def unpickle_doc(vocab, hooks_and_data, bytes_data):
|
|||
|
||||
copy_reg.pickle(Doc, pickle_doc, unpickle_doc)
|
||||
|
||||
|
||||
def remove_label_if_necessary(attributes):
|
||||
# More deprecated attribute handling =/
|
||||
if 'label' in attributes:
|
||||
attributes['ent_type'] = attributes.pop('label')
|
||||
if "label" in attributes:
|
||||
attributes["ent_type"] = attributes.pop("label")
|
||||
|
||||
|
||||
def fix_attributes(doc, attributes):
|
||||
if 'label' in attributes and 'ent_type' not in attributes:
|
||||
if isinstance(attributes['label'], int):
|
||||
attributes[ENT_TYPE] = attributes['label']
|
||||
if "label" in attributes and "ent_type" not in attributes:
|
||||
if isinstance(attributes["label"], int):
|
||||
attributes[ENT_TYPE] = attributes["label"]
|
||||
else:
|
||||
attributes[ENT_TYPE] = doc.vocab.strings[attributes['label']]
|
||||
if 'ent_type' in attributes:
|
||||
attributes[ENT_TYPE] = attributes['ent_type']
|
||||
attributes[ENT_TYPE] = doc.vocab.strings[attributes["label"]]
|
||||
if "ent_type" in attributes:
|
||||
attributes[ENT_TYPE] = attributes["ent_type"]
|
||||
|
||||
|
||||
def get_entity_info(ent_info):
|
||||
if isinstance(ent_info, Span):
|
||||
|
|
|
@ -1,12 +1,13 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
from collections import defaultdict
|
||||
|
||||
cimport numpy as np
|
||||
from libc.math cimport sqrt
|
||||
|
||||
import numpy
|
||||
import numpy.linalg
|
||||
from libc.math cimport sqrt
|
||||
from thinc.neural.util import get_array_module
|
||||
from collections import defaultdict
|
||||
|
||||
from .doc cimport token_by_start, token_by_end, get_token_attr, _get_lca_matrix
|
||||
from .token cimport TokenC
|
||||
|
@ -14,9 +15,10 @@ from ..structs cimport TokenC, LexemeC
|
|||
from ..typedefs cimport flags_t, attr_t, hash_t
|
||||
from ..attrs cimport attr_id_t
|
||||
from ..parts_of_speech cimport univ_pos_t
|
||||
from ..util import normalize_slice
|
||||
from ..attrs cimport *
|
||||
from ..lexeme cimport Lexeme
|
||||
|
||||
from ..util import normalize_slice
|
||||
from ..compat import is_config, basestring_
|
||||
from ..errors import Errors, TempErrors, Warnings, user_warning, models_warning
|
||||
from ..errors import deprecation_warning
|
||||
|
@ -24,29 +26,66 @@ from .underscore import Underscore, get_ext_args
|
|||
|
||||
|
||||
cdef class Span:
|
||||
"""A slice from a Doc object."""
|
||||
"""A slice from a Doc object.
|
||||
|
||||
DOCS: https://spacy.io/api/span
|
||||
"""
|
||||
@classmethod
|
||||
def set_extension(cls, name, **kwargs):
|
||||
if cls.has_extension(name) and not kwargs.get('force', False):
|
||||
raise ValueError(Errors.E090.format(name=name, obj='Span'))
|
||||
"""Define a custom attribute which becomes available as `Span._`.
|
||||
|
||||
name (unicode): Name of the attribute to set.
|
||||
default: Optional default value of the attribute.
|
||||
getter (callable): Optional getter function.
|
||||
setter (callable): Optional setter function.
|
||||
method (callable): Optional method for method extension.
|
||||
force (bool): Force overwriting existing attribute.
|
||||
|
||||
DOCS: https://spacy.io/api/span#set_extension
|
||||
USAGE: https://spacy.io/usage/processing-pipelines#custom-components-attributes
|
||||
"""
|
||||
if cls.has_extension(name) and not kwargs.get("force", False):
|
||||
raise ValueError(Errors.E090.format(name=name, obj="Span"))
|
||||
Underscore.span_extensions[name] = get_ext_args(**kwargs)
|
||||
|
||||
@classmethod
|
||||
def get_extension(cls, name):
|
||||
"""Look up a previously registered extension by name.
|
||||
|
||||
name (unicode): Name of the extension.
|
||||
RETURNS (tuple): A `(default, method, getter, setter)` tuple.
|
||||
|
||||
DOCS: https://spacy.io/api/span#get_extension
|
||||
"""
|
||||
return Underscore.span_extensions.get(name)
|
||||
|
||||
@classmethod
|
||||
def has_extension(cls, name):
|
||||
"""Check whether an extension has been registered.
|
||||
|
||||
name (unicode): Name of the extension.
|
||||
RETURNS (bool): Whether the extension has been registered.
|
||||
|
||||
DOCS: https://spacy.io/api/span#has_extension
|
||||
"""
|
||||
return name in Underscore.span_extensions
|
||||
|
||||
@classmethod
|
||||
def remove_extension(cls, name):
|
||||
"""Remove a previously registered extension.
|
||||
|
||||
name (unicode): Name of the extension.
|
||||
RETURNS (tuple): A `(default, method, getter, setter)` tuple of the
|
||||
removed extension.
|
||||
|
||||
DOCS: https://spacy.io/api/span#remove_extension
|
||||
"""
|
||||
if not cls.has_extension(name):
|
||||
raise ValueError(Errors.E046.format(name=name))
|
||||
return Underscore.span_extensions.pop(name)
|
||||
|
||||
def __cinit__(self, Doc doc, int start, int end, label=0,
|
||||
vector=None, vector_norm=None):
|
||||
def __cinit__(self, Doc doc, int start, int end, label=0, vector=None,
|
||||
vector_norm=None):
|
||||
"""Create a `Span` object from the slice `doc[start : end]`.
|
||||
|
||||
doc (Doc): The parent document.
|
||||
|
@ -56,6 +95,8 @@ cdef class Span:
|
|||
vector (ndarray[ndim=1, dtype='float32']): A meaning representation
|
||||
of the span.
|
||||
RETURNS (Span): The newly constructed object.
|
||||
|
||||
DOCS: https://spacy.io/api/span#init
|
||||
"""
|
||||
if not (0 <= start <= end <= len(doc)):
|
||||
raise IndexError(Errors.E035.format(start=start, end=end, length=len(doc)))
|
||||
|
@ -102,6 +143,8 @@ cdef class Span:
|
|||
"""Get the number of tokens in the span.
|
||||
|
||||
RETURNS (int): The number of tokens in the span.
|
||||
|
||||
DOCS: https://spacy.io/api/span#len
|
||||
"""
|
||||
self._recalculate_indices()
|
||||
if self.end < self.start:
|
||||
|
@ -111,7 +154,7 @@ cdef class Span:
|
|||
def __repr__(self):
|
||||
if is_config(python3=True):
|
||||
return self.text
|
||||
return self.text.encode('utf-8')
|
||||
return self.text.encode("utf-8")
|
||||
|
||||
def __getitem__(self, object i):
|
||||
"""Get a `Token` or a `Span` object
|
||||
|
@ -120,9 +163,7 @@ cdef class Span:
|
|||
the span to get.
|
||||
RETURNS (Token or Span): The token at `span[i]`.
|
||||
|
||||
EXAMPLE:
|
||||
>>> span[0]
|
||||
>>> span[1:3]
|
||||
DOCS: https://spacy.io/api/span#getitem
|
||||
"""
|
||||
self._recalculate_indices()
|
||||
if isinstance(i, slice):
|
||||
|
@ -138,6 +179,8 @@ cdef class Span:
|
|||
"""Iterate over `Token` objects.
|
||||
|
||||
YIELDS (Token): A `Token` object.
|
||||
|
||||
DOCS: https://spacy.io/api/span#iter
|
||||
"""
|
||||
self._recalculate_indices()
|
||||
for i in range(self.start, self.end):
|
||||
|
@ -148,31 +191,32 @@ cdef class Span:
|
|||
|
||||
@property
|
||||
def _(self):
|
||||
"""User space for adding custom attribute extensions."""
|
||||
"""Custom extension attributes registered via `set_extension`."""
|
||||
return Underscore(Underscore.span_extensions, self,
|
||||
start=self.start_char, end=self.end_char)
|
||||
|
||||
def as_doc(self):
|
||||
# TODO: fix
|
||||
"""Create a `Doc` object with a copy of the Span's data.
|
||||
"""Create a `Doc` object with a copy of the `Span`'s data.
|
||||
|
||||
RETURNS (Doc): The `Doc` copy of the span.
|
||||
|
||||
DOCS: https://spacy.io/api/span#as_doc
|
||||
"""
|
||||
cdef Doc doc = Doc(self.doc.vocab,
|
||||
words=[t.text for t in self],
|
||||
spaces=[bool(t.whitespace_) for t in self])
|
||||
# TODO: Fix!
|
||||
words = [t.text for t in self]
|
||||
spaces = [bool(t.whitespace_) for t in self]
|
||||
cdef Doc doc = Doc(self.doc.vocab, words=words, spaces=spaces)
|
||||
array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE]
|
||||
if self.doc.is_tagged:
|
||||
array_head.append(TAG)
|
||||
# if doc parsed add head and dep attribute
|
||||
# If doc parsed add head and dep attribute
|
||||
if self.doc.is_parsed:
|
||||
array_head.extend([HEAD, DEP])
|
||||
# otherwise add sent_start
|
||||
# Otherwise add sent_start
|
||||
else:
|
||||
array_head.append(SENT_START)
|
||||
array = self.doc.to_array(array_head)
|
||||
doc.from_array(array_head, array[self.start : self.end])
|
||||
|
||||
doc.noun_chunks_iterator = self.doc.noun_chunks_iterator
|
||||
doc.user_hooks = self.doc.user_hooks
|
||||
doc.user_span_hooks = self.doc.user_span_hooks
|
||||
|
@ -181,7 +225,7 @@ cdef class Span:
|
|||
doc.vector_norm = self.vector_norm
|
||||
doc.tensor = self.doc.tensor[self.start : self.end]
|
||||
for key, value in self.doc.cats.items():
|
||||
if hasattr(key, '__len__') and len(key) == 3:
|
||||
if hasattr(key, "__len__") and len(key) == 3:
|
||||
cat_start, cat_end, cat_label = key
|
||||
if cat_start == self.start_char and cat_end == self.end_char:
|
||||
doc.cats[cat_label] = value
|
||||
|
@ -207,6 +251,8 @@ cdef class Span:
|
|||
|
||||
RETURNS (np.array[ndim=2, dtype=numpy.int32]): LCA matrix with shape
|
||||
(n, n), where n = len(self).
|
||||
|
||||
DOCS: https://spacy.io/api/span#get_lca_matrix
|
||||
"""
|
||||
return numpy.asarray(_get_lca_matrix(self.doc, self.start, self.end))
|
||||
|
||||
|
@ -217,22 +263,24 @@ cdef class Span:
|
|||
other (object): The object to compare with. By default, accepts `Doc`,
|
||||
`Span`, `Token` and `Lexeme` objects.
|
||||
RETURNS (float): A scalar similarity score. Higher is more similar.
|
||||
|
||||
DOCS: https://spacy.io/api/span#similarity
|
||||
"""
|
||||
if 'similarity' in self.doc.user_span_hooks:
|
||||
self.doc.user_span_hooks['similarity'](self, other)
|
||||
if len(self) == 1 and hasattr(other, 'orth'):
|
||||
if "similarity" in self.doc.user_span_hooks:
|
||||
self.doc.user_span_hooks["similarity"](self, other)
|
||||
if len(self) == 1 and hasattr(other, "orth"):
|
||||
if self[0].orth == other.orth:
|
||||
return 1.0
|
||||
elif hasattr(other, '__len__') and len(self) == len(other):
|
||||
elif hasattr(other, "__len__") and len(self) == len(other):
|
||||
for i in range(len(self)):
|
||||
if self[i].orth != getattr(other[i], 'orth', None):
|
||||
if self[i].orth != getattr(other[i], "orth", None):
|
||||
break
|
||||
else:
|
||||
return 1.0
|
||||
if self.vocab.vectors.n_keys == 0:
|
||||
models_warning(Warnings.W007.format(obj='Span'))
|
||||
models_warning(Warnings.W007.format(obj="Span"))
|
||||
if self.vector_norm == 0.0 or other.vector_norm == 0.0:
|
||||
user_warning(Warnings.W008.format(obj='Span'))
|
||||
user_warning(Warnings.W008.format(obj="Span"))
|
||||
return 0.0
|
||||
vector = self.vector
|
||||
xp = get_array_module(vector)
|
||||
|
@ -251,8 +299,8 @@ cdef class Span:
|
|||
cdef int i, j
|
||||
cdef attr_id_t feature
|
||||
cdef np.ndarray[attr_t, ndim=2] output
|
||||
# Make an array from the attributes --- otherwise our inner loop is Python
|
||||
# dict iteration.
|
||||
# Make an array from the attributes - otherwise our inner loop is Python
|
||||
# dict iteration
|
||||
cdef np.ndarray[attr_t, ndim=1] attr_ids = numpy.asarray(py_attr_ids, dtype=numpy.uint64)
|
||||
cdef int length = self.end - self.start
|
||||
output = numpy.ndarray(shape=(length, len(attr_ids)), dtype=numpy.uint64)
|
||||
|
@ -282,12 +330,11 @@ cdef class Span:
|
|||
property sent:
|
||||
"""RETURNS (Span): The sentence span that the span is a part of."""
|
||||
def __get__(self):
|
||||
if 'sent' in self.doc.user_span_hooks:
|
||||
return self.doc.user_span_hooks['sent'](self)
|
||||
# This should raise if we're not parsed
|
||||
# or doesen't have any sbd component :)
|
||||
if "sent" in self.doc.user_span_hooks:
|
||||
return self.doc.user_span_hooks["sent"](self)
|
||||
# This should raise if not parsed / no custom sentence boundaries
|
||||
self.doc.sents
|
||||
# if doc is parsed we can use the deps to find the sentence
|
||||
# If doc is parsed we can use the deps to find the sentence
|
||||
# otherwise we use the `sent_start` token attribute
|
||||
cdef int n = 0
|
||||
cdef int i
|
||||
|
@ -300,11 +347,11 @@ cdef class Span:
|
|||
raise RuntimeError(Errors.E038)
|
||||
return self.doc[root.l_edge:root.r_edge + 1]
|
||||
elif self.doc.is_sentenced:
|
||||
# find start of the sentence
|
||||
# Find start of the sentence
|
||||
start = self.start
|
||||
while self.doc.c[start].sent_start != 1 and start > 0:
|
||||
start += -1
|
||||
# find end of the sentence
|
||||
# Find end of the sentence
|
||||
end = self.end
|
||||
n = 0
|
||||
while end < self.doc.length and self.doc.c[end].sent_start != 1:
|
||||
|
@ -315,7 +362,13 @@ cdef class Span:
|
|||
return self.doc[start:end]
|
||||
|
||||
property ents:
|
||||
"""RETURNS (list): A list of tokens that belong to the current span."""
|
||||
"""The named entities in the span. Returns a tuple of named entity
|
||||
`Span` objects, if the entity recognizer has been applied.
|
||||
|
||||
RETURNS (tuple): Entities in the span, one `Span` per entity.
|
||||
|
||||
DOCS: https://spacy.io/api/span#ents
|
||||
"""
|
||||
def __get__(self):
|
||||
ents = []
|
||||
for ent in self.doc.ents:
|
||||
|
@ -324,11 +377,16 @@ cdef class Span:
|
|||
return ents
|
||||
|
||||
property has_vector:
|
||||
"""RETURNS (bool): Whether a word vector is associated with the object.
|
||||
"""A boolean value indicating whether a word vector is associated with
|
||||
the object.
|
||||
|
||||
RETURNS (bool): Whether a word vector is associated with the object.
|
||||
|
||||
DOCS: https://spacy.io/api/span#has_vector
|
||||
"""
|
||||
def __get__(self):
|
||||
if 'has_vector' in self.doc.user_span_hooks:
|
||||
return self.doc.user_span_hooks['has_vector'](self)
|
||||
if "has_vector" in self.doc.user_span_hooks:
|
||||
return self.doc.user_span_hooks["has_vector"](self)
|
||||
elif self.vocab.vectors.data.size > 0:
|
||||
return any(token.has_vector for token in self)
|
||||
elif self.doc.tensor.size > 0:
|
||||
|
@ -342,19 +400,26 @@ cdef class Span:
|
|||
|
||||
RETURNS (numpy.ndarray[ndim=1, dtype='float32']): A 1D numpy array
|
||||
representing the span's semantics.
|
||||
|
||||
DOCS: https://spacy.io/api/span#vector
|
||||
"""
|
||||
def __get__(self):
|
||||
if 'vector' in self.doc.user_span_hooks:
|
||||
return self.doc.user_span_hooks['vector'](self)
|
||||
if "vector" in self.doc.user_span_hooks:
|
||||
return self.doc.user_span_hooks["vector"](self)
|
||||
if self._vector is None:
|
||||
self._vector = sum(t.vector for t in self) / len(self)
|
||||
return self._vector
|
||||
|
||||
property vector_norm:
|
||||
"""RETURNS (float): The L2 norm of the vector representation."""
|
||||
"""The L2 norm of the span's vector representation.
|
||||
|
||||
RETURNS (float): The L2 norm of the vector representation.
|
||||
|
||||
DOCS: https://spacy.io/api/span#vector_norm
|
||||
"""
|
||||
def __get__(self):
|
||||
if 'vector_norm' in self.doc.user_span_hooks:
|
||||
return self.doc.user_span_hooks['vector'](self)
|
||||
if "vector_norm" in self.doc.user_span_hooks:
|
||||
return self.doc.user_span_hooks["vector"](self)
|
||||
cdef float value
|
||||
cdef double norm = 0
|
||||
if self._vector_norm is None:
|
||||
|
@ -369,8 +434,8 @@ cdef class Span:
|
|||
negativity of the span.
|
||||
"""
|
||||
def __get__(self):
|
||||
if 'sentiment' in self.doc.user_span_hooks:
|
||||
return self.doc.user_span_hooks['sentiment'](self)
|
||||
if "sentiment" in self.doc.user_span_hooks:
|
||||
return self.doc.user_span_hooks["sentiment"](self)
|
||||
else:
|
||||
return sum([token.sentiment for token in self]) / len(self)
|
||||
|
||||
|
@ -390,7 +455,7 @@ cdef class Span:
|
|||
whitespace).
|
||||
"""
|
||||
def __get__(self):
|
||||
return u''.join([t.text_with_ws for t in self])
|
||||
return "".join([t.text_with_ws for t in self])
|
||||
|
||||
property noun_chunks:
|
||||
"""Yields base noun-phrase `Span` objects, if the document has been
|
||||
|
@ -399,7 +464,9 @@ cdef class Span:
|
|||
NP-level coordination, no prepositional phrases, and no relative
|
||||
clauses.
|
||||
|
||||
YIELDS (Span): Base noun-phrase `Span` objects
|
||||
YIELDS (Span): Base noun-phrase `Span` objects.
|
||||
|
||||
DOCS: https://spacy.io/api/span#noun_chunks
|
||||
"""
|
||||
def __get__(self):
|
||||
if not self.doc.is_parsed:
|
||||
|
@ -418,52 +485,18 @@ cdef class Span:
|
|||
yield span
|
||||
|
||||
property root:
|
||||
"""The token within the span that's highest in the parse tree.
|
||||
If there's a tie, the earliest is prefered.
|
||||
"""The token with the shortest path to the root of the
|
||||
sentence (or the root itself). If multiple tokens are equally
|
||||
high in the tree, the first token is taken.
|
||||
|
||||
RETURNS (Token): The root token.
|
||||
|
||||
EXAMPLE: The root token has the shortest path to the root of the
|
||||
sentence (or is the root itself). If multiple words are equally
|
||||
high in the tree, the first word is taken. For example:
|
||||
|
||||
>>> toks = nlp(u'I like New York in Autumn.')
|
||||
|
||||
Let's name the indices – easier than writing `toks[4]` etc.
|
||||
|
||||
>>> i, like, new, york, in_, autumn, dot = range(len(toks))
|
||||
|
||||
The head of 'new' is 'York', and the head of "York" is "like"
|
||||
|
||||
>>> toks[new].head.text
|
||||
'York'
|
||||
>>> toks[york].head.text
|
||||
'like'
|
||||
|
||||
Create a span for "New York". Its root is "York".
|
||||
|
||||
>>> new_york = toks[new:york+1]
|
||||
>>> new_york.root.text
|
||||
'York'
|
||||
|
||||
Here's a more complicated case, raised by issue #214:
|
||||
|
||||
>>> toks = nlp(u'to, north and south carolina')
|
||||
>>> to, north, and_, south, carolina = toks
|
||||
>>> south.head.text, carolina.head.text
|
||||
('north', 'to')
|
||||
|
||||
Here "south" is a child of "north", which is a child of "carolina".
|
||||
Carolina is the root of the span:
|
||||
|
||||
>>> south_carolina = toks[-2:]
|
||||
>>> south_carolina.root.text
|
||||
'carolina'
|
||||
DOCS: https://spacy.io/api/span#root
|
||||
"""
|
||||
def __get__(self):
|
||||
self._recalculate_indices()
|
||||
if 'root' in self.doc.user_span_hooks:
|
||||
return self.doc.user_span_hooks['root'](self)
|
||||
if "root" in self.doc.user_span_hooks:
|
||||
return self.doc.user_span_hooks["root"](self)
|
||||
# This should probably be called 'head', and the other one called
|
||||
# 'gov'. But we went with 'head' elsehwhere, and now we're stuck =/
|
||||
cdef int i
|
||||
|
@ -495,10 +528,12 @@ cdef class Span:
|
|||
return self.doc[root]
|
||||
|
||||
property lefts:
|
||||
""" Tokens that are to the left of the span, whose head is within the
|
||||
"""Tokens that are to the left of the span, whose head is within the
|
||||
`Span`.
|
||||
|
||||
YIELDS (Token):A left-child of a token of the span.
|
||||
|
||||
DOCS: https://spacy.io/api/span#lefts
|
||||
"""
|
||||
def __get__(self):
|
||||
for token in reversed(self): # Reverse, so we get tokens in order
|
||||
|
@ -511,6 +546,8 @@ cdef class Span:
|
|||
`Span`.
|
||||
|
||||
YIELDS (Token): A right-child of a token of the span.
|
||||
|
||||
DOCS: https://spacy.io/api/span#rights
|
||||
"""
|
||||
def __get__(self):
|
||||
for token in self:
|
||||
|
@ -519,15 +556,25 @@ cdef class Span:
|
|||
yield right
|
||||
|
||||
property n_lefts:
|
||||
"""RETURNS (int): The number of leftward immediate children of the
|
||||
"""The number of tokens that are to the left of the span, whose
|
||||
heads are within the span.
|
||||
|
||||
RETURNS (int): The number of leftward immediate children of the
|
||||
span, in the syntactic dependency parse.
|
||||
|
||||
DOCS: https://spacy.io/api/span#n_lefts
|
||||
"""
|
||||
def __get__(self):
|
||||
return len(list(self.lefts))
|
||||
|
||||
property n_rights:
|
||||
"""RETURNS (int): The number of rightward immediate children of the
|
||||
"""The number of tokens that are to the right of the span, whose
|
||||
heads are within the span.
|
||||
|
||||
RETURNS (int): The number of rightward immediate children of the
|
||||
span, in the syntactic dependency parse.
|
||||
|
||||
DOCS: https://spacy.io/api/span#n_rights
|
||||
"""
|
||||
def __get__(self):
|
||||
return len(list(self.rights))
|
||||
|
@ -536,6 +583,8 @@ cdef class Span:
|
|||
"""Tokens within the span and tokens which descend from them.
|
||||
|
||||
YIELDS (Token): A token within the span, or a descendant from it.
|
||||
|
||||
DOCS: https://spacy.io/api/span#subtree
|
||||
"""
|
||||
def __get__(self):
|
||||
for word in self.lefts:
|
||||
|
@ -550,7 +599,7 @@ cdef class Span:
|
|||
return self.root.ent_id
|
||||
|
||||
def __set__(self, hash_t key):
|
||||
raise NotImplementedError(TempErrors.T007.format(attr='ent_id'))
|
||||
raise NotImplementedError(TempErrors.T007.format(attr="ent_id"))
|
||||
|
||||
property ent_id_:
|
||||
"""RETURNS (unicode): The (string) entity ID."""
|
||||
|
@ -558,10 +607,10 @@ cdef class Span:
|
|||
return self.root.ent_id_
|
||||
|
||||
def __set__(self, hash_t key):
|
||||
raise NotImplementedError(TempErrors.T007.format(attr='ent_id_'))
|
||||
raise NotImplementedError(TempErrors.T007.format(attr="ent_id_"))
|
||||
|
||||
property orth_:
|
||||
"""Verbatim text content (identical to Span.text). Exists mostly for
|
||||
"""Verbatim text content (identical to `Span.text`). Exists mostly for
|
||||
consistency with other attributes.
|
||||
|
||||
RETURNS (unicode): The span's text."""
|
||||
|
@ -571,27 +620,28 @@ cdef class Span:
|
|||
property lemma_:
|
||||
"""RETURNS (unicode): The span's lemma."""
|
||||
def __get__(self):
|
||||
return ' '.join([t.lemma_ for t in self]).strip()
|
||||
return " ".join([t.lemma_ for t in self]).strip()
|
||||
|
||||
property upper_:
|
||||
"""Deprecated. Use Span.text.upper() instead."""
|
||||
"""Deprecated. Use `Span.text.upper()` instead."""
|
||||
def __get__(self):
|
||||
return ''.join([t.text_with_ws.upper() for t in self]).strip()
|
||||
return "".join([t.text_with_ws.upper() for t in self]).strip()
|
||||
|
||||
property lower_:
|
||||
"""Deprecated. Use Span.text.lower() instead."""
|
||||
"""Deprecated. Use `Span.text.lower()` instead."""
|
||||
def __get__(self):
|
||||
return ''.join([t.text_with_ws.lower() for t in self]).strip()
|
||||
return "".join([t.text_with_ws.lower() for t in self]).strip()
|
||||
|
||||
property string:
|
||||
"""Deprecated: Use Span.text_with_ws instead."""
|
||||
"""Deprecated: Use `Span.text_with_ws` instead."""
|
||||
def __get__(self):
|
||||
return ''.join([t.text_with_ws for t in self])
|
||||
return "".join([t.text_with_ws for t in self])
|
||||
|
||||
property label_:
|
||||
"""RETURNS (unicode): The span's label."""
|
||||
def __get__(self):
|
||||
return self.doc.vocab.strings[self.label]
|
||||
|
||||
def __set__(self, unicode label_):
|
||||
self.label = self.doc.vocab.strings.add(label_)
|
||||
|
||||
|
|
|
@ -8,42 +8,82 @@ from cpython.mem cimport PyMem_Malloc, PyMem_Free
|
|||
from cython.view cimport array as cvarray
|
||||
cimport numpy as np
|
||||
np.import_array()
|
||||
|
||||
import numpy
|
||||
from thinc.neural.util import get_array_module
|
||||
|
||||
from ..typedefs cimport hash_t
|
||||
from ..lexeme cimport Lexeme
|
||||
from .. import parts_of_speech
|
||||
from ..attrs cimport IS_ALPHA, IS_ASCII, IS_DIGIT, IS_LOWER, IS_PUNCT, IS_SPACE
|
||||
from ..attrs cimport IS_BRACKET, IS_QUOTE, IS_LEFT_PUNCT, IS_RIGHT_PUNCT
|
||||
from ..attrs cimport IS_OOV, IS_TITLE, IS_UPPER, IS_CURRENCY, LIKE_URL, LIKE_NUM, LIKE_EMAIL
|
||||
from ..attrs cimport IS_STOP, ID, ORTH, NORM, LOWER, SHAPE, PREFIX, SUFFIX
|
||||
from ..attrs cimport LENGTH, CLUSTER, LEMMA, POS, TAG, DEP
|
||||
from ..symbols cimport conj
|
||||
|
||||
from .. import parts_of_speech
|
||||
from .. import util
|
||||
from ..compat import is_config
|
||||
from ..errors import Errors, Warnings, user_warning, models_warning
|
||||
from .. import util
|
||||
from .underscore import Underscore, get_ext_args
|
||||
|
||||
|
||||
cdef class Token:
|
||||
"""An individual token – i.e. a word, punctuation symbol, whitespace,
|
||||
etc."""
|
||||
etc.
|
||||
|
||||
DOCS: https://spacy.io/api/token
|
||||
"""
|
||||
@classmethod
|
||||
def set_extension(cls, name, **kwargs):
|
||||
if cls.has_extension(name) and not kwargs.get('force', False):
|
||||
raise ValueError(Errors.E090.format(name=name, obj='Token'))
|
||||
"""Define a custom attribute which becomes available as `Token._`.
|
||||
|
||||
name (unicode): Name of the attribute to set.
|
||||
default: Optional default value of the attribute.
|
||||
getter (callable): Optional getter function.
|
||||
setter (callable): Optional setter function.
|
||||
method (callable): Optional method for method extension.
|
||||
force (bool): Force overwriting existing attribute.
|
||||
|
||||
DOCS: https://spacy.io/api/token#set_extension
|
||||
USAGE: https://spacy.io/usage/processing-pipelines#custom-components-attributes
|
||||
"""
|
||||
if cls.has_extension(name) and not kwargs.get("force", False):
|
||||
raise ValueError(Errors.E090.format(name=name, obj="Token"))
|
||||
Underscore.token_extensions[name] = get_ext_args(**kwargs)
|
||||
|
||||
@classmethod
|
||||
def get_extension(cls, name):
|
||||
"""Look up a previously registered extension by name.
|
||||
|
||||
name (unicode): Name of the extension.
|
||||
RETURNS (tuple): A `(default, method, getter, setter)` tuple.
|
||||
|
||||
DOCS: https://spacy.io/api/token#get_extension
|
||||
"""
|
||||
return Underscore.token_extensions.get(name)
|
||||
|
||||
@classmethod
|
||||
def has_extension(cls, name):
|
||||
"""Check whether an extension has been registered.
|
||||
|
||||
name (unicode): Name of the extension.
|
||||
RETURNS (bool): Whether the extension has been registered.
|
||||
|
||||
DOCS: https://spacy.io/api/token#has_extension
|
||||
"""
|
||||
return name in Underscore.token_extensions
|
||||
|
||||
@classmethod
|
||||
def remove_extension(cls, name):
|
||||
"""Remove a previously registered extension.
|
||||
|
||||
name (unicode): Name of the extension.
|
||||
RETURNS (tuple): A `(default, method, getter, setter)` tuple of the
|
||||
removed extension.
|
||||
|
||||
DOCS: https://spacy.io/api/token#remove_extension
|
||||
"""
|
||||
if not cls.has_extension(name):
|
||||
raise ValueError(Errors.E046.format(name=name))
|
||||
return Underscore.token_extensions.pop(name)
|
||||
|
@ -54,6 +94,8 @@ cdef class Token:
|
|||
vocab (Vocab): A storage container for lexical types.
|
||||
doc (Doc): The parent document.
|
||||
offset (int): The index of the token within the document.
|
||||
|
||||
DOCS: https://spacy.io/api/token#init
|
||||
"""
|
||||
self.vocab = vocab
|
||||
self.doc = doc
|
||||
|
@ -67,6 +109,8 @@ cdef class Token:
|
|||
"""The number of unicode characters in the token, i.e. `token.text`.
|
||||
|
||||
RETURNS (int): The number of unicode characters in the token.
|
||||
|
||||
DOCS: https://spacy.io/api/token#len
|
||||
"""
|
||||
return self.c.lex.length
|
||||
|
||||
|
@ -121,6 +165,7 @@ cdef class Token:
|
|||
|
||||
@property
|
||||
def _(self):
|
||||
"""Custom extension attributes registered via `set_extension`."""
|
||||
return Underscore(Underscore.token_extensions, self,
|
||||
start=self.idx, end=None)
|
||||
|
||||
|
@ -130,12 +175,7 @@ cdef class Token:
|
|||
flag_id (int): The ID of the flag attribute.
|
||||
RETURNS (bool): Whether the flag is set.
|
||||
|
||||
EXAMPLE:
|
||||
>>> from spacy.attrs import IS_TITLE
|
||||
>>> doc = nlp(u'Give it back! He pleaded.')
|
||||
>>> token = doc[0]
|
||||
>>> token.check_flag(IS_TITLE)
|
||||
True
|
||||
DOCS: https://spacy.io/api/token#check_flag
|
||||
"""
|
||||
return Lexeme.c_check_flag(self.c.lex, flag_id)
|
||||
|
||||
|
@ -144,6 +184,8 @@ cdef class Token:
|
|||
|
||||
i (int): The relative position of the token to get. Defaults to 1.
|
||||
RETURNS (Token): The token at position `self.doc[self.i+i]`.
|
||||
|
||||
DOCS: https://spacy.io/api/token#nbor
|
||||
"""
|
||||
if self.i+i < 0 or (self.i+i >= len(self.doc)):
|
||||
raise IndexError(Errors.E042.format(i=self.i, j=i, length=len(self.doc)))
|
||||
|
@ -156,19 +198,21 @@ cdef class Token:
|
|||
other (object): The object to compare with. By default, accepts `Doc`,
|
||||
`Span`, `Token` and `Lexeme` objects.
|
||||
RETURNS (float): A scalar similarity score. Higher is more similar.
|
||||
|
||||
DOCS: https://spacy.io/api/token#similarity
|
||||
"""
|
||||
if 'similarity' in self.doc.user_token_hooks:
|
||||
return self.doc.user_token_hooks['similarity'](self)
|
||||
if hasattr(other, '__len__') and len(other) == 1 and hasattr(other, "__getitem__"):
|
||||
if self.c.lex.orth == getattr(other[0], 'orth', None):
|
||||
if "similarity" in self.doc.user_token_hooks:
|
||||
return self.doc.user_token_hooks["similarity"](self)
|
||||
if hasattr(other, "__len__") and len(other) == 1 and hasattr(other, "__getitem__"):
|
||||
if self.c.lex.orth == getattr(other[0], "orth", None):
|
||||
return 1.0
|
||||
elif hasattr(other, 'orth'):
|
||||
elif hasattr(other, "orth"):
|
||||
if self.c.lex.orth == other.orth:
|
||||
return 1.0
|
||||
if self.vocab.vectors.n_keys == 0:
|
||||
models_warning(Warnings.W007.format(obj='Token'))
|
||||
models_warning(Warnings.W007.format(obj="Token"))
|
||||
if self.vector_norm == 0 or other.vector_norm == 0:
|
||||
user_warning(Warnings.W008.format(obj='Token'))
|
||||
user_warning(Warnings.W008.format(obj="Token"))
|
||||
return 0.0
|
||||
vector = self.vector
|
||||
xp = get_array_module(vector)
|
||||
|
@ -202,7 +246,7 @@ cdef class Token:
|
|||
def __get__(self):
|
||||
cdef unicode orth = self.vocab.strings[self.c.lex.orth]
|
||||
if self.c.spacy:
|
||||
return orth + u' '
|
||||
return orth + " "
|
||||
else:
|
||||
return orth
|
||||
|
||||
|
@ -215,8 +259,8 @@ cdef class Token:
|
|||
"""RETURNS (float): A scalar value indicating the positivity or
|
||||
negativity of the token."""
|
||||
def __get__(self):
|
||||
if 'sentiment' in self.doc.user_token_hooks:
|
||||
return self.doc.user_token_hooks['sentiment'](self)
|
||||
if "sentiment" in self.doc.user_token_hooks:
|
||||
return self.doc.user_token_hooks["sentiment"](self)
|
||||
return self.c.lex.sentiment
|
||||
|
||||
property lang:
|
||||
|
@ -298,6 +342,7 @@ cdef class Token:
|
|||
"""RETURNS (uint64): ID of coarse-grained part-of-speech tag."""
|
||||
def __get__(self):
|
||||
return self.c.pos
|
||||
|
||||
def __set__(self, pos):
|
||||
self.c.pos = pos
|
||||
|
||||
|
@ -322,10 +367,12 @@ cdef class Token:
|
|||
the object.
|
||||
|
||||
RETURNS (bool): Whether a word vector is associated with the object.
|
||||
|
||||
DOCS: https://spacy.io/api/token#has_vector
|
||||
"""
|
||||
def __get__(self):
|
||||
if 'has_vector' in self.doc.user_token_hooks:
|
||||
return self.doc.user_token_hooks['has_vector'](self)
|
||||
return self.doc.user_token_hooks["has_vector"](self)
|
||||
if self.vocab.vectors.size == 0 and self.doc.tensor.size != 0:
|
||||
return True
|
||||
return self.vocab.has_vector(self.c.lex.orth)
|
||||
|
@ -335,10 +382,12 @@ cdef class Token:
|
|||
|
||||
RETURNS (numpy.ndarray[ndim=1, dtype='float32']): A 1D numpy array
|
||||
representing the token's semantics.
|
||||
|
||||
DOCS: https://spacy.io/api/token#vector
|
||||
"""
|
||||
def __get__(self):
|
||||
if 'vector' in self.doc.user_token_hooks:
|
||||
return self.doc.user_token_hooks['vector'](self)
|
||||
return self.doc.user_token_hooks["vector"](self)
|
||||
if self.vocab.vectors.size == 0 and self.doc.tensor.size != 0:
|
||||
return self.doc.tensor[self.i]
|
||||
else:
|
||||
|
@ -348,23 +397,35 @@ cdef class Token:
|
|||
"""The L2 norm of the token's vector representation.
|
||||
|
||||
RETURNS (float): The L2 norm of the vector representation.
|
||||
|
||||
DOCS: https://spacy.io/api/token#vector_norm
|
||||
"""
|
||||
def __get__(self):
|
||||
if 'vector_norm' in self.doc.user_token_hooks:
|
||||
return self.doc.user_token_hooks['vector_norm'](self)
|
||||
return self.doc.user_token_hooks["vector_norm"](self)
|
||||
vector = self.vector
|
||||
return numpy.sqrt((vector ** 2).sum())
|
||||
|
||||
property n_lefts:
|
||||
"""RETURNS (int): The number of leftward immediate children of the
|
||||
"""The number of leftward immediate children of the word, in the
|
||||
syntactic dependency parse.
|
||||
|
||||
RETURNS (int): The number of leftward immediate children of the
|
||||
word, in the syntactic dependency parse.
|
||||
|
||||
DOCS: https://spacy.io/api/token#n_lefts
|
||||
"""
|
||||
def __get__(self):
|
||||
return self.c.l_kids
|
||||
|
||||
property n_rights:
|
||||
"""RETURNS (int): The number of rightward immediate children of the
|
||||
"""The number of rightward immediate children of the word, in the
|
||||
syntactic dependency parse.
|
||||
|
||||
RETURNS (int): The number of rightward immediate children of the
|
||||
word, in the syntactic dependency parse.
|
||||
|
||||
DOCS: https://spacy.io/api/token#n_rights
|
||||
"""
|
||||
def __get__(self):
|
||||
return self.c.r_kids
|
||||
|
@ -373,7 +434,7 @@ cdef class Token:
|
|||
"""RETURNS (Span): The sentence span that the token is a part of."""
|
||||
def __get__(self):
|
||||
if 'sent' in self.doc.user_token_hooks:
|
||||
return self.doc.user_token_hooks['sent'](self)
|
||||
return self.doc.user_token_hooks["sent"](self)
|
||||
return self.doc[self.i : self.i+1].sent
|
||||
|
||||
property sent_start:
|
||||
|
@ -390,8 +451,13 @@ cdef class Token:
|
|||
self.is_sent_start = value
|
||||
|
||||
property is_sent_start:
|
||||
"""RETURNS (bool / None): Whether the token starts a sentence.
|
||||
"""A boolean value indicating whether the token starts a sentence.
|
||||
`None` if unknown. Defaults to `True` for the first token in the `Doc`.
|
||||
|
||||
RETURNS (bool / None): Whether the token starts a sentence.
|
||||
None if unknown.
|
||||
|
||||
DOCS: https://spacy.io/api/token#is_sent_start
|
||||
"""
|
||||
def __get__(self):
|
||||
if self.c.sent_start == 0:
|
||||
|
@ -418,6 +484,8 @@ cdef class Token:
|
|||
dependency parse.
|
||||
|
||||
YIELDS (Token): A left-child of the token.
|
||||
|
||||
DOCS: https://spacy.io/api/token#lefts
|
||||
"""
|
||||
def __get__(self):
|
||||
cdef int nr_iter = 0
|
||||
|
@ -429,13 +497,15 @@ cdef class Token:
|
|||
nr_iter += 1
|
||||
# This is ugly, but it's a way to guard out infinite loops
|
||||
if nr_iter >= 10000000:
|
||||
raise RuntimeError(Errors.E045.format(attr='token.lefts'))
|
||||
raise RuntimeError(Errors.E045.format(attr="token.lefts"))
|
||||
|
||||
property rights:
|
||||
"""The rightward immediate children of the word, in the syntactic
|
||||
dependency parse.
|
||||
|
||||
YIELDS (Token): A right-child of the token.
|
||||
|
||||
DOCS: https://spacy.io/api/token#rights
|
||||
"""
|
||||
def __get__(self):
|
||||
cdef const TokenC* ptr = self.c + (self.c.r_edge - self.i)
|
||||
|
@ -447,7 +517,7 @@ cdef class Token:
|
|||
ptr -= 1
|
||||
nr_iter += 1
|
||||
if nr_iter >= 10000000:
|
||||
raise RuntimeError(Errors.E045.format(attr='token.rights'))
|
||||
raise RuntimeError(Errors.E045.format(attr="token.rights"))
|
||||
tokens.reverse()
|
||||
for t in tokens:
|
||||
yield t
|
||||
|
@ -455,7 +525,9 @@ cdef class Token:
|
|||
property children:
|
||||
"""A sequence of the token's immediate syntactic children.
|
||||
|
||||
YIELDS (Token): A child token such that child.head==self
|
||||
YIELDS (Token): A child token such that `child.head==self`.
|
||||
|
||||
DOCS: https://spacy.io/api/token#children
|
||||
"""
|
||||
def __get__(self):
|
||||
yield from self.lefts
|
||||
|
@ -467,6 +539,8 @@ cdef class Token:
|
|||
|
||||
YIELDS (Token): A descendent token such that
|
||||
`self.is_ancestor(descendent) or token == self`.
|
||||
|
||||
DOCS: https://spacy.io/api/token#subtree
|
||||
"""
|
||||
def __get__(self):
|
||||
for word in self.lefts:
|
||||
|
@ -496,11 +570,13 @@ cdef class Token:
|
|||
|
||||
YIELDS (Token): A sequence of ancestor tokens such that
|
||||
`ancestor.is_ancestor(self)`.
|
||||
|
||||
DOCS: https://spacy.io/api/token#ancestors
|
||||
"""
|
||||
def __get__(self):
|
||||
cdef const TokenC* head_ptr = self.c
|
||||
# guard against infinite loop, no token can have
|
||||
# more ancestors than tokens in the tree
|
||||
# Guard against infinite loop, no token can have
|
||||
# more ancestors than tokens in the tree.
|
||||
cdef int i = 0
|
||||
while head_ptr.head != 0 and i < self.doc.length:
|
||||
head_ptr += head_ptr.head
|
||||
|
@ -513,6 +589,8 @@ cdef class Token:
|
|||
|
||||
descendant (Token): Another token.
|
||||
RETURNS (bool): Whether this token is the ancestor of the descendant.
|
||||
|
||||
DOCS: https://spacy.io/api/token#is_ancestor
|
||||
"""
|
||||
if self.doc is not descendant.doc:
|
||||
return False
|
||||
|
@ -528,34 +606,28 @@ cdef class Token:
|
|||
return self.doc[self.i + self.c.head]
|
||||
|
||||
def __set__(self, Token new_head):
|
||||
# this function sets the head of self to new_head
|
||||
# and updates the counters for left/right dependents
|
||||
# and left/right corner for the new and the old head
|
||||
|
||||
# do nothing if old head is new head
|
||||
# This function sets the head of self to new_head and updates the
|
||||
# counters for left/right dependents and left/right corner for the
|
||||
# new and the old head
|
||||
# Do nothing if old head is new head
|
||||
if self.i + self.c.head == new_head.i:
|
||||
return
|
||||
|
||||
cdef Token old_head = self.head
|
||||
cdef int rel_newhead_i = new_head.i - self.i
|
||||
|
||||
# is the new head a descendant of the old head
|
||||
# Is the new head a descendant of the old head
|
||||
cdef bint is_desc = old_head.is_ancestor(new_head)
|
||||
|
||||
cdef int new_edge
|
||||
cdef Token anc, child
|
||||
|
||||
# update number of deps of old head
|
||||
# Update number of deps of old head
|
||||
if self.c.head > 0: # left dependent
|
||||
old_head.c.l_kids -= 1
|
||||
if self.c.l_edge == old_head.c.l_edge:
|
||||
# the token dominates the left edge so the left edge of
|
||||
# the head may change when the token is reattached, it may
|
||||
# The token dominates the left edge so the left edge of
|
||||
# the head may change when the token is reattached, it may
|
||||
# not change if the new head is a descendant of the current
|
||||
# head
|
||||
|
||||
# head.
|
||||
new_edge = self.c.l_edge
|
||||
# the new l_edge is the left-most l_edge on any of the
|
||||
# The new l_edge is the left-most l_edge on any of the
|
||||
# other dependents where the l_edge is left of the head,
|
||||
# otherwise it is the head
|
||||
if not is_desc:
|
||||
|
@ -566,21 +638,18 @@ cdef class Token:
|
|||
if child.c.l_edge < new_edge:
|
||||
new_edge = child.c.l_edge
|
||||
old_head.c.l_edge = new_edge
|
||||
|
||||
# walk up the tree from old_head and assign new l_edge to
|
||||
# Walk up the tree from old_head and assign new l_edge to
|
||||
# ancestors until an ancestor already has an l_edge that's
|
||||
# further left
|
||||
for anc in old_head.ancestors:
|
||||
if anc.c.l_edge <= new_edge:
|
||||
break
|
||||
anc.c.l_edge = new_edge
|
||||
|
||||
elif self.c.head < 0: # right dependent
|
||||
old_head.c.r_kids -= 1
|
||||
# do the same thing as for l_edge
|
||||
# Do the same thing as for l_edge
|
||||
if self.c.r_edge == old_head.c.r_edge:
|
||||
new_edge = self.c.r_edge
|
||||
|
||||
if not is_desc:
|
||||
new_edge = old_head.i
|
||||
for child in old_head.children:
|
||||
|
@ -589,16 +658,14 @@ cdef class Token:
|
|||
if child.c.r_edge > new_edge:
|
||||
new_edge = child.c.r_edge
|
||||
old_head.c.r_edge = new_edge
|
||||
|
||||
for anc in old_head.ancestors:
|
||||
if anc.c.r_edge >= new_edge:
|
||||
break
|
||||
anc.c.r_edge = new_edge
|
||||
|
||||
# update number of deps of new head
|
||||
# Update number of deps of new head
|
||||
if rel_newhead_i > 0: # left dependent
|
||||
new_head.c.l_kids += 1
|
||||
# walk up the tree from new head and set l_edge to self.l_edge
|
||||
# Walk up the tree from new head and set l_edge to self.l_edge
|
||||
# until you hit a token with an l_edge further to the left
|
||||
if self.c.l_edge < new_head.c.l_edge:
|
||||
new_head.c.l_edge = self.c.l_edge
|
||||
|
@ -606,34 +673,33 @@ cdef class Token:
|
|||
if anc.c.l_edge <= self.c.l_edge:
|
||||
break
|
||||
anc.c.l_edge = self.c.l_edge
|
||||
|
||||
elif rel_newhead_i < 0: # right dependent
|
||||
new_head.c.r_kids += 1
|
||||
# do the same as for l_edge
|
||||
# Do the same as for l_edge
|
||||
if self.c.r_edge > new_head.c.r_edge:
|
||||
new_head.c.r_edge = self.c.r_edge
|
||||
for anc in new_head.ancestors:
|
||||
if anc.c.r_edge >= self.c.r_edge:
|
||||
break
|
||||
anc.c.r_edge = self.c.r_edge
|
||||
|
||||
# set new head
|
||||
# Set new head
|
||||
self.c.head = rel_newhead_i
|
||||
|
||||
property conjuncts:
|
||||
"""A sequence of coordinated tokens, including the token itself.
|
||||
|
||||
YIELDS (Token): A coordinated token.
|
||||
|
||||
DOCS: https://spacy.io/api/token#conjuncts
|
||||
"""
|
||||
def __get__(self):
|
||||
"""Get a list of conjoined words."""
|
||||
cdef Token word
|
||||
if 'conjuncts' in self.doc.user_token_hooks:
|
||||
yield from self.doc.user_token_hooks['conjuncts'](self)
|
||||
if "conjuncts" in self.doc.user_token_hooks:
|
||||
yield from self.doc.user_token_hooks["conjuncts"](self)
|
||||
else:
|
||||
if self.dep_ != 'conj':
|
||||
if self.dep != conj:
|
||||
for word in self.rights:
|
||||
if word.dep_ == 'conj':
|
||||
if word.dep == conj:
|
||||
yield word
|
||||
yield from word.conjuncts
|
||||
|
||||
|
@ -670,7 +736,7 @@ cdef class Token:
|
|||
RETURNS (unicode): IOB code of named entity tag.
|
||||
"""
|
||||
def __get__(self):
|
||||
iob_strings = ('', 'I', 'O', 'B')
|
||||
iob_strings = ("", "I", "O", "B")
|
||||
return iob_strings[self.c.ent_iob]
|
||||
|
||||
property ent_id:
|
||||
|
@ -697,7 +763,7 @@ cdef class Token:
|
|||
"""RETURNS (unicode): The trailing whitespace character, if present.
|
||||
"""
|
||||
def __get__(self):
|
||||
return ' ' if self.c.spacy else ''
|
||||
return " " if self.c.spacy else ""
|
||||
|
||||
property orth_:
|
||||
"""RETURNS (unicode): Verbatim text content (identical to
|
||||
|
@ -770,6 +836,7 @@ cdef class Token:
|
|||
"""RETURNS (unicode): Coarse-grained part-of-speech tag."""
|
||||
def __get__(self):
|
||||
return parts_of_speech.NAMES[self.c.pos]
|
||||
|
||||
def __set__(self, pos_name):
|
||||
self.c.pos = parts_of_speech.IDS[pos_name]
|
||||
|
||||
|
|
|
@ -25,7 +25,7 @@ except ImportError:
|
|||
from .symbols import ORTH
|
||||
from .compat import cupy, CudaStream, path2str, basestring_, unicode_
|
||||
from .compat import import_file
|
||||
from .errors import Errors
|
||||
from .errors import Errors, Warnings, deprecation_warning
|
||||
|
||||
|
||||
LANGUAGES = {}
|
||||
|
@ -565,7 +565,8 @@ def itershuffle(iterable, bufsize=1000):
|
|||
def to_bytes(getters, exclude):
|
||||
serialized = OrderedDict()
|
||||
for key, getter in getters.items():
|
||||
if key not in exclude:
|
||||
# Split to support file names like meta.json
|
||||
if key.split(".")[0] not in exclude:
|
||||
serialized[key] = getter()
|
||||
return srsly.msgpack_dumps(serialized)
|
||||
|
||||
|
@ -573,7 +574,8 @@ def to_bytes(getters, exclude):
|
|||
def from_bytes(bytes_data, setters, exclude):
|
||||
msg = srsly.msgpack_loads(bytes_data)
|
||||
for key, setter in setters.items():
|
||||
if key not in exclude and key in msg:
|
||||
# Split to support file names like meta.json
|
||||
if key.split(".")[0] not in exclude and key in msg:
|
||||
setter(msg[key])
|
||||
return msg
|
||||
|
||||
|
@ -583,7 +585,8 @@ def to_disk(path, writers, exclude):
|
|||
if not path.exists():
|
||||
path.mkdir()
|
||||
for key, writer in writers.items():
|
||||
if key not in exclude:
|
||||
# Split to support file names like meta.json
|
||||
if key.split(".")[0] not in exclude:
|
||||
writer(path / key)
|
||||
return path
|
||||
|
||||
|
@ -591,7 +594,8 @@ def to_disk(path, writers, exclude):
|
|||
def from_disk(path, readers, exclude):
|
||||
path = ensure_path(path)
|
||||
for key, reader in readers.items():
|
||||
if key not in exclude:
|
||||
# Split to support file names like meta.json
|
||||
if key.split(".")[0] not in exclude:
|
||||
reader(path / key)
|
||||
return path
|
||||
|
||||
|
@ -677,6 +681,23 @@ def validate_json(data, validator):
|
|||
return errors
|
||||
|
||||
|
||||
def get_serialization_exclude(serializers, exclude, kwargs):
|
||||
"""Helper function to validate serialization args and manage transition from
|
||||
keyword arguments (pre v2.1) to exclude argument.
|
||||
"""
|
||||
exclude = list(exclude)
|
||||
# Split to support file names like meta.json
|
||||
options = [name.split(".")[0] for name in serializers]
|
||||
for key, value in kwargs.items():
|
||||
if key in ("vocab",) and value is False:
|
||||
deprecation_warning(Warnings.W015.format(arg=key))
|
||||
exclude.append(key)
|
||||
elif key.split(".")[0] in options:
|
||||
raise ValueError(Errors.E128.format(arg=key))
|
||||
# TODO: user warning?
|
||||
return exclude
|
||||
|
||||
|
||||
class SimpleFrozenDict(dict):
|
||||
"""Simplified implementation of a frozen dict, mainly used as default
|
||||
function or method argument (for arguments that should default to empty
|
||||
|
@ -696,14 +717,14 @@ class SimpleFrozenDict(dict):
|
|||
class DummyTokenizer(object):
|
||||
# add dummy methods for to_bytes, from_bytes, to_disk and from_disk to
|
||||
# allow serialization (see #1557)
|
||||
def to_bytes(self, **exclude):
|
||||
def to_bytes(self, **kwargs):
|
||||
return b""
|
||||
|
||||
def from_bytes(self, _bytes_data, **exclude):
|
||||
def from_bytes(self, _bytes_data, **kwargs):
|
||||
return self
|
||||
|
||||
def to_disk(self, _path, **exclude):
|
||||
def to_disk(self, _path, **kwargs):
|
||||
return None
|
||||
|
||||
def from_disk(self, _path, **exclude):
|
||||
def from_disk(self, _path, **kwargs):
|
||||
return self
|
||||
|
|
|
@ -1,30 +1,31 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
cimport numpy as np
|
||||
from cython.operator cimport dereference as deref
|
||||
from libcpp.set cimport set as cppset
|
||||
|
||||
import functools
|
||||
import numpy
|
||||
from collections import OrderedDict
|
||||
import srsly
|
||||
|
||||
cimport numpy as np
|
||||
from thinc.neural.util import get_array_module
|
||||
from thinc.neural._classes.model import Model
|
||||
|
||||
from .strings cimport StringStore
|
||||
|
||||
from .strings import get_string_id
|
||||
from .compat import basestring_, path2str
|
||||
from .errors import Errors
|
||||
from . import util
|
||||
|
||||
from cython.operator cimport dereference as deref
|
||||
from libcpp.set cimport set as cppset
|
||||
|
||||
def unpickle_vectors(bytes_data):
|
||||
return Vectors().from_bytes(bytes_data)
|
||||
|
||||
|
||||
class GlobalRegistry(object):
|
||||
'''Global store of vectors, to avoid repeatedly loading the data.'''
|
||||
"""Global store of vectors, to avoid repeatedly loading the data."""
|
||||
data = {}
|
||||
|
||||
@classmethod
|
||||
|
@ -46,8 +47,10 @@ cdef class Vectors:
|
|||
rows in the vectors.data table.
|
||||
|
||||
Multiple keys can be mapped to the same vector, and not all of the rows in
|
||||
the table need to be assigned --- so len(list(vectors.keys())) may be
|
||||
the table need to be assigned - so len(list(vectors.keys())) may be
|
||||
greater or smaller than vectors.shape[0].
|
||||
|
||||
DOCS: https://spacy.io/api/vectors
|
||||
"""
|
||||
cdef public object name
|
||||
cdef public object data
|
||||
|
@ -62,12 +65,14 @@ cdef class Vectors:
|
|||
keys (iterable): A sequence of keys, aligned with the data.
|
||||
name (string): A name to identify the vectors table.
|
||||
RETURNS (Vectors): The newly created object.
|
||||
|
||||
DOCS: https://spacy.io/api/vectors#init
|
||||
"""
|
||||
self.name = name
|
||||
if data is None:
|
||||
if shape is None:
|
||||
shape = (0,0)
|
||||
data = numpy.zeros(shape, dtype='f')
|
||||
data = numpy.zeros(shape, dtype="f")
|
||||
self.data = data
|
||||
self.key2row = OrderedDict()
|
||||
if self.data is not None:
|
||||
|
@ -84,23 +89,40 @@ cdef class Vectors:
|
|||
in the vector table.
|
||||
|
||||
RETURNS (tuple): A `(rows, dims)` pair.
|
||||
|
||||
DOCS: https://spacy.io/api/vectors#shape
|
||||
"""
|
||||
return self.data.shape
|
||||
|
||||
@property
|
||||
def size(self):
|
||||
"""RETURNS (int): rows*dims"""
|
||||
"""The vector size i,e. rows * dims.
|
||||
|
||||
RETURNS (int): The vector size.
|
||||
|
||||
DOCS: https://spacy.io/api/vectors#size
|
||||
"""
|
||||
return self.data.shape[0] * self.data.shape[1]
|
||||
|
||||
@property
|
||||
def is_full(self):
|
||||
"""RETURNS (bool): `True` if no slots are available for new keys."""
|
||||
"""Whether the vectors table is full.
|
||||
|
||||
RETURNS (bool): `True` if no slots are available for new keys.
|
||||
|
||||
DOCS: https://spacy.io/api/vectors#is_full
|
||||
"""
|
||||
return self._unset.size() == 0
|
||||
|
||||
@property
|
||||
def n_keys(self):
|
||||
"""RETURNS (int) The number of keys in the table. Note that this is the
|
||||
number of all keys, not just unique vectors."""
|
||||
"""Get the number of keys in the table. Note that this is the number
|
||||
of all keys, not just unique vectors.
|
||||
|
||||
RETURNS (int): The number of keys in the table.
|
||||
|
||||
DOCS: https://spacy.io/api/vectors#n_keys
|
||||
"""
|
||||
return len(self.key2row)
|
||||
|
||||
def __reduce__(self):
|
||||
|
@ -111,6 +133,8 @@ cdef class Vectors:
|
|||
|
||||
key (int): The key to get the vector for.
|
||||
RETURNS (ndarray): The vector for the key.
|
||||
|
||||
DOCS: https://spacy.io/api/vectors#getitem
|
||||
"""
|
||||
i = self.key2row[key]
|
||||
if i is None:
|
||||
|
@ -123,6 +147,8 @@ cdef class Vectors:
|
|||
|
||||
key (int): The key to set the vector for.
|
||||
vector (ndarray): The vector to set.
|
||||
|
||||
DOCS: https://spacy.io/api/vectors#setitem
|
||||
"""
|
||||
i = self.key2row[key]
|
||||
self.data[i] = vector
|
||||
|
@ -133,6 +159,8 @@ cdef class Vectors:
|
|||
"""Iterate over the keys in the table.
|
||||
|
||||
YIELDS (int): A key in the table.
|
||||
|
||||
DOCS: https://spacy.io/api/vectors#iter
|
||||
"""
|
||||
yield from self.key2row
|
||||
|
||||
|
@ -140,6 +168,8 @@ cdef class Vectors:
|
|||
"""Return the number of vectors in the table.
|
||||
|
||||
RETURNS (int): The number of vectors in the data.
|
||||
|
||||
DOCS: https://spacy.io/api/vectors#len
|
||||
"""
|
||||
return self.data.shape[0]
|
||||
|
||||
|
@ -148,6 +178,8 @@ cdef class Vectors:
|
|||
|
||||
key (int): The key to check.
|
||||
RETURNS (bool): Whether the key has a vector entry.
|
||||
|
||||
DOCS: https://spacy.io/api/vectors#contains
|
||||
"""
|
||||
return key in self.key2row
|
||||
|
||||
|
@ -159,6 +191,12 @@ cdef class Vectors:
|
|||
If the number of vectors is reduced, keys mapped to rows that have been
|
||||
deleted are removed. These removed items are returned as a list of
|
||||
`(key, row)` tuples.
|
||||
|
||||
shape (tuple): A `(rows, dims)` tuple.
|
||||
inplace (bool): Reallocate the memory.
|
||||
RETURNS (list): The removed items as a list of `(key, row)` tuples.
|
||||
|
||||
DOCS: https://spacy.io/api/vectors#resize
|
||||
"""
|
||||
if inplace:
|
||||
self.data.resize(shape, refcheck=False)
|
||||
|
@ -175,10 +213,7 @@ cdef class Vectors:
|
|||
return removed_items
|
||||
|
||||
def keys(self):
|
||||
"""A sequence of the keys in the table.
|
||||
|
||||
RETURNS (iterable): The keys.
|
||||
"""
|
||||
"""RETURNS (iterable): A sequence of keys in the table."""
|
||||
return self.key2row.keys()
|
||||
|
||||
def values(self):
|
||||
|
@ -188,6 +223,8 @@ cdef class Vectors:
|
|||
returned may be less than the length of the vectors table.
|
||||
|
||||
YIELDS (ndarray): A vector in the table.
|
||||
|
||||
DOCS: https://spacy.io/api/vectors#values
|
||||
"""
|
||||
for row, vector in enumerate(range(self.data.shape[0])):
|
||||
if not self._unset.count(row):
|
||||
|
@ -197,6 +234,8 @@ cdef class Vectors:
|
|||
"""Iterate over `(key, vector)` pairs.
|
||||
|
||||
YIELDS (tuple): A key/vector pair.
|
||||
|
||||
DOCS: https://spacy.io/api/vectors#items
|
||||
"""
|
||||
for key, row in self.key2row.items():
|
||||
yield key, self.data[row]
|
||||
|
@ -215,7 +254,7 @@ cdef class Vectors:
|
|||
RETURNS: The requested key, keys, row or rows.
|
||||
"""
|
||||
if sum(arg is None for arg in (key, keys, row, rows)) != 3:
|
||||
bad_kwargs = {'key': key, 'keys': keys, 'row': row, 'rows': rows}
|
||||
bad_kwargs = {"key": key, "keys": keys, "row": row, "rows": rows}
|
||||
raise ValueError(Errors.E059.format(kwargs=bad_kwargs))
|
||||
xp = get_array_module(self.data)
|
||||
if key is not None:
|
||||
|
@ -224,7 +263,7 @@ cdef class Vectors:
|
|||
elif keys is not None:
|
||||
keys = [get_string_id(key) for key in keys]
|
||||
rows = [self.key2row.get(key, -1.) for key in keys]
|
||||
return xp.asarray(rows, dtype='i')
|
||||
return xp.asarray(rows, dtype="i")
|
||||
else:
|
||||
targets = set()
|
||||
if row is not None:
|
||||
|
@ -236,7 +275,7 @@ cdef class Vectors:
|
|||
if row in targets:
|
||||
results.append(key)
|
||||
targets.remove(row)
|
||||
return xp.asarray(results, dtype='uint64')
|
||||
return xp.asarray(results, dtype="uint64")
|
||||
|
||||
def add(self, key, *, vector=None, row=None):
|
||||
"""Add a key to the table. Keys can be mapped to an existing vector
|
||||
|
@ -246,6 +285,8 @@ cdef class Vectors:
|
|||
vector (ndarray / None): A vector to add for the key.
|
||||
row (int / None): The row number of a vector to map the key to.
|
||||
RETURNS (int): The row the vector was added to.
|
||||
|
||||
DOCS: https://spacy.io/api/vectors#add
|
||||
"""
|
||||
key = get_string_id(key)
|
||||
if row is None and key in self.key2row:
|
||||
|
@ -292,11 +333,10 @@ cdef class Vectors:
|
|||
sims = xp.dot(batch, vectors.T)
|
||||
best_rows[i:i+batch_size] = sims.argmax(axis=1)
|
||||
scores[i:i+batch_size] = sims.max(axis=1)
|
||||
|
||||
xp = get_array_module(self.data)
|
||||
row2key = {row: key for key, row in self.key2row.items()}
|
||||
keys = xp.asarray(
|
||||
[row2key[row] for row in best_rows if row in row2key], dtype='uint64')
|
||||
[row2key[row] for row in best_rows if row in row2key], dtype="uint64")
|
||||
return (keys, best_rows, scores)
|
||||
|
||||
def from_glove(self, path):
|
||||
|
@ -308,58 +348,62 @@ cdef class Vectors:
|
|||
|
||||
path (unicode / Path): The path to load the GloVe vectors from.
|
||||
RETURNS: A `StringStore` object, holding the key-to-string mapping.
|
||||
|
||||
DOCS: https://spacy.io/api/vectors#from_glove
|
||||
"""
|
||||
path = util.ensure_path(path)
|
||||
width = None
|
||||
for name in path.iterdir():
|
||||
if name.parts[-1].startswith('vectors'):
|
||||
if name.parts[-1].startswith("vectors"):
|
||||
_, dims, dtype, _2 = name.parts[-1].split('.')
|
||||
width = int(dims)
|
||||
break
|
||||
else:
|
||||
raise IOError(Errors.E061.format(filename=path))
|
||||
bin_loc = path / 'vectors.{dims}.{dtype}.bin'.format(dims=dims,
|
||||
dtype=dtype)
|
||||
bin_loc = path / "vectors.{dims}.{dtype}.bin".format(dims=dims, dtype=dtype)
|
||||
xp = get_array_module(self.data)
|
||||
self.data = None
|
||||
with bin_loc.open('rb') as file_:
|
||||
with bin_loc.open("rb") as file_:
|
||||
self.data = xp.fromfile(file_, dtype=dtype)
|
||||
if dtype != 'float32':
|
||||
self.data = xp.ascontiguousarray(self.data, dtype='float32')
|
||||
if dtype != "float32":
|
||||
self.data = xp.ascontiguousarray(self.data, dtype="float32")
|
||||
if self.data.ndim == 1:
|
||||
self.data = self.data.reshape((self.data.size//width, width))
|
||||
n = 0
|
||||
strings = StringStore()
|
||||
with (path / 'vocab.txt').open('r') as file_:
|
||||
with (path / "vocab.txt").open("r") as file_:
|
||||
for i, line in enumerate(file_):
|
||||
key = strings.add(line.strip())
|
||||
self.add(key, row=i)
|
||||
return strings
|
||||
|
||||
def to_disk(self, path, **exclude):
|
||||
def to_disk(self, path, **kwargs):
|
||||
"""Save the current state to a directory.
|
||||
|
||||
path (unicode / Path): A path to a directory, which will be created if
|
||||
it doesn't exists. Either a string or a Path-like object.
|
||||
it doesn't exists.
|
||||
|
||||
DOCS: https://spacy.io/api/vectors#to_disk
|
||||
"""
|
||||
xp = get_array_module(self.data)
|
||||
if xp is numpy:
|
||||
save_array = lambda arr, file_: xp.save(file_, arr,
|
||||
allow_pickle=False)
|
||||
save_array = lambda arr, file_: xp.save(file_, arr, allow_pickle=False)
|
||||
else:
|
||||
save_array = lambda arr, file_: xp.save(file_, arr)
|
||||
serializers = OrderedDict((
|
||||
('vectors', lambda p: save_array(self.data, p.open('wb'))),
|
||||
('key2row', lambda p: srsly.write_msgpack(p, self.key2row))
|
||||
("vectors", lambda p: save_array(self.data, p.open("wb"))),
|
||||
("key2row", lambda p: srsly.write_msgpack(p, self.key2row))
|
||||
))
|
||||
return util.to_disk(path, serializers, exclude)
|
||||
return util.to_disk(path, serializers, [])
|
||||
|
||||
def from_disk(self, path, **exclude):
|
||||
def from_disk(self, path, **kwargs):
|
||||
"""Loads state from a directory. Modifies the object in place and
|
||||
returns it.
|
||||
|
||||
path (unicode / Path): Directory path, string or Path-like object.
|
||||
RETURNS (Vectors): The modified object.
|
||||
|
||||
DOCS: https://spacy.io/api/vectors#from_disk
|
||||
"""
|
||||
def load_key2row(path):
|
||||
if path.exists():
|
||||
|
@ -380,46 +424,51 @@ cdef class Vectors:
|
|||
self.data = xp.load(str(path))
|
||||
|
||||
serializers = OrderedDict((
|
||||
('key2row', load_key2row),
|
||||
('keys', load_keys),
|
||||
('vectors', load_vectors),
|
||||
("key2row", load_key2row),
|
||||
("keys", load_keys),
|
||||
("vectors", load_vectors),
|
||||
))
|
||||
util.from_disk(path, serializers, exclude)
|
||||
util.from_disk(path, serializers, [])
|
||||
return self
|
||||
|
||||
def to_bytes(self, **exclude):
|
||||
def to_bytes(self, **kwargs):
|
||||
"""Serialize the current state to a binary string.
|
||||
|
||||
**exclude: Named attributes to prevent from being serialized.
|
||||
exclude (list): String names of serialization fields to exclude.
|
||||
RETURNS (bytes): The serialized form of the `Vectors` object.
|
||||
|
||||
DOCS: https://spacy.io/api/vectors#to_bytes
|
||||
"""
|
||||
def serialize_weights():
|
||||
if hasattr(self.data, 'to_bytes'):
|
||||
if hasattr(self.data, "to_bytes"):
|
||||
return self.data.to_bytes()
|
||||
else:
|
||||
return srsly.msgpack_dumps(self.data)
|
||||
serializers = OrderedDict((
|
||||
('key2row', lambda: srsly.msgpack_dumps(self.key2row)),
|
||||
('vectors', serialize_weights)
|
||||
))
|
||||
return util.to_bytes(serializers, exclude)
|
||||
|
||||
def from_bytes(self, data, **exclude):
|
||||
serializers = OrderedDict((
|
||||
("key2row", lambda: srsly.msgpack_dumps(self.key2row)),
|
||||
("vectors", serialize_weights)
|
||||
))
|
||||
return util.to_bytes(serializers, [])
|
||||
|
||||
def from_bytes(self, data, **kwargs):
|
||||
"""Load state from a binary string.
|
||||
|
||||
data (bytes): The data to load from.
|
||||
**exclude: Named attributes to prevent from being loaded.
|
||||
exclude (list): String names of serialization fields to exclude.
|
||||
RETURNS (Vectors): The `Vectors` object.
|
||||
|
||||
DOCS: https://spacy.io/api/vectors#from_bytes
|
||||
"""
|
||||
def deserialize_weights(b):
|
||||
if hasattr(self.data, 'from_bytes'):
|
||||
if hasattr(self.data, "from_bytes"):
|
||||
self.data.from_bytes()
|
||||
else:
|
||||
self.data = srsly.msgpack_loads(b)
|
||||
|
||||
deserializers = OrderedDict((
|
||||
('key2row', lambda b: self.key2row.update(srsly.msgpack_loads(b))),
|
||||
('vectors', deserialize_weights)
|
||||
("key2row", lambda b: self.key2row.update(srsly.msgpack_loads(b))),
|
||||
("vectors", deserialize_weights)
|
||||
))
|
||||
util.from_bytes(data, deserializers, exclude)
|
||||
util.from_bytes(data, deserializers, [])
|
||||
return self
|
||||
|
|
142
spacy/vocab.pyx
142
spacy/vocab.pyx
|
@ -1,12 +1,13 @@
|
|||
# coding: utf8
|
||||
# cython: profile=True
|
||||
from __future__ import unicode_literals
|
||||
from libc.string cimport memcpy
|
||||
|
||||
import numpy
|
||||
import srsly
|
||||
|
||||
from collections import OrderedDict
|
||||
from thinc.neural.util import get_array_module
|
||||
|
||||
from .lexeme cimport EMPTY_LEXEME
|
||||
from .lexeme cimport Lexeme
|
||||
from .typedefs cimport attr_t
|
||||
|
@ -27,6 +28,8 @@ cdef class Vocab:
|
|||
"""A look-up table that allows you to access `Lexeme` objects. The `Vocab`
|
||||
instance also provides access to the `StringStore`, and owns underlying
|
||||
C-data that is shared between `Doc` objects.
|
||||
|
||||
DOCS: https://spacy.io/api/vocab
|
||||
"""
|
||||
def __init__(self, lex_attr_getters=None, tag_map=None, lemmatizer=None,
|
||||
strings=tuple(), oov_prob=-20., **deprecated_kwargs):
|
||||
|
@ -62,7 +65,7 @@ cdef class Vocab:
|
|||
langfunc = None
|
||||
if self.lex_attr_getters:
|
||||
langfunc = self.lex_attr_getters.get(LANG, None)
|
||||
return langfunc('_') if langfunc else ''
|
||||
return langfunc("_") if langfunc else ""
|
||||
|
||||
def __len__(self):
|
||||
"""The current number of lexemes stored.
|
||||
|
@ -87,11 +90,7 @@ cdef class Vocab:
|
|||
available bit will be chosen.
|
||||
RETURNS (int): The integer ID by which the flag value can be checked.
|
||||
|
||||
EXAMPLE:
|
||||
>>> my_product_getter = lambda text: text in ['spaCy', 'dislaCy']
|
||||
>>> MY_PRODUCT = nlp.vocab.add_flag(my_product_getter)
|
||||
>>> doc = nlp(u'I like spaCy')
|
||||
>>> assert doc[2].check_flag(MY_PRODUCT) == True
|
||||
DOCS: https://spacy.io/api/vocab#add_flag
|
||||
"""
|
||||
if flag_id == -1:
|
||||
for bit in range(1, 64):
|
||||
|
@ -112,7 +111,7 @@ cdef class Vocab:
|
|||
`Lexeme` if necessary using memory acquired from the given pool. If the
|
||||
pool is the lexicon's own memory, the lexeme is saved in the lexicon.
|
||||
"""
|
||||
if string == u'':
|
||||
if string == "":
|
||||
return &EMPTY_LEXEME
|
||||
cdef LexemeC* lex
|
||||
cdef hash_t key = self.strings[string]
|
||||
|
@ -176,10 +175,12 @@ cdef class Vocab:
|
|||
|
||||
string (unicode): The ID string.
|
||||
RETURNS (bool) Whether the string has an entry in the vocabulary.
|
||||
|
||||
DOCS: https://spacy.io/api/vocab#contains
|
||||
"""
|
||||
cdef hash_t int_key
|
||||
if isinstance(key, bytes):
|
||||
int_key = self.strings[key.decode('utf8')]
|
||||
int_key = self.strings[key.decode("utf8")]
|
||||
elif isinstance(key, unicode):
|
||||
int_key = self.strings[key]
|
||||
else:
|
||||
|
@ -191,6 +192,8 @@ cdef class Vocab:
|
|||
"""Iterate over the lexemes in the vocabulary.
|
||||
|
||||
YIELDS (Lexeme): An entry in the vocabulary.
|
||||
|
||||
DOCS: https://spacy.io/api/vocab#iter
|
||||
"""
|
||||
cdef attr_t key
|
||||
cdef size_t addr
|
||||
|
@ -210,8 +213,10 @@ cdef class Vocab:
|
|||
RETURNS (Lexeme): The lexeme indicated by the given ID.
|
||||
|
||||
EXAMPLE:
|
||||
>>> apple = nlp.vocab.strings['apple']
|
||||
>>> assert nlp.vocab[apple] == nlp.vocab[u'apple']
|
||||
>>> apple = nlp.vocab.strings["apple"]
|
||||
>>> assert nlp.vocab[apple] == nlp.vocab[u"apple"]
|
||||
|
||||
DOCS: https://spacy.io/api/vocab#getitem
|
||||
"""
|
||||
cdef attr_t orth
|
||||
if isinstance(id_or_string, unicode):
|
||||
|
@ -284,6 +289,8 @@ cdef class Vocab:
|
|||
`(string, score)` tuples, where `string` is the entry the removed
|
||||
word was mapped to, and `score` the similarity score between the
|
||||
two words.
|
||||
|
||||
DOCS: https://spacy.io/api/vocab#prune_vectors
|
||||
"""
|
||||
xp = get_array_module(self.vectors.data)
|
||||
# Make prob negative so it sorts by rank ascending
|
||||
|
@ -291,16 +298,12 @@ cdef class Vocab:
|
|||
priority = [(-lex.prob, self.vectors.key2row[lex.orth], lex.orth)
|
||||
for lex in self if lex.orth in self.vectors.key2row]
|
||||
priority.sort()
|
||||
indices = xp.asarray([i for (prob, i, key) in priority], dtype='i')
|
||||
keys = xp.asarray([key for (prob, i, key) in priority], dtype='uint64')
|
||||
|
||||
indices = xp.asarray([i for (prob, i, key) in priority], dtype="i")
|
||||
keys = xp.asarray([key for (prob, i, key) in priority], dtype="uint64")
|
||||
keep = xp.ascontiguousarray(self.vectors.data[indices[:nr_row]])
|
||||
toss = xp.ascontiguousarray(self.vectors.data[indices[nr_row:]])
|
||||
|
||||
self.vectors = Vectors(data=keep, keys=keys)
|
||||
|
||||
syn_keys, syn_rows, scores = self.vectors.most_similar(toss, batch_size=batch_size)
|
||||
|
||||
remap = {}
|
||||
for i, key in enumerate(keys[nr_row:]):
|
||||
self.vectors.add(key, row=syn_rows[i])
|
||||
|
@ -319,21 +322,22 @@ cdef class Vocab:
|
|||
RETURNS (numpy.ndarray): A word vector. Size
|
||||
and shape determined by the `vocab.vectors` instance. Usually, a
|
||||
numpy ndarray of shape (300,) and dtype float32.
|
||||
|
||||
DOCS: https://spacy.io/api/vocab#get_vector
|
||||
"""
|
||||
if isinstance(orth, basestring_):
|
||||
orth = self.strings.add(orth)
|
||||
word = self[orth].orth_
|
||||
if orth in self.vectors.key2row:
|
||||
return self.vectors[orth]
|
||||
|
||||
# Assign default ngram limits to minn and maxn which is the length of the word.
|
||||
if minn is None:
|
||||
minn = len(word)
|
||||
if maxn is None:
|
||||
maxn = len(word)
|
||||
vectors = numpy.zeros((self.vectors_length,), dtype='f')
|
||||
|
||||
# Fasttext's ngram computation taken from https://github.com/facebookresearch/fastText
|
||||
vectors = numpy.zeros((self.vectors_length,), dtype="f")
|
||||
# Fasttext's ngram computation taken from
|
||||
# https://github.com/facebookresearch/fastText
|
||||
ngrams_size = 0;
|
||||
for i in range(len(word)):
|
||||
ngram = ""
|
||||
|
@ -356,12 +360,16 @@ cdef class Vocab:
|
|||
n = n + 1
|
||||
if ngrams_size > 0:
|
||||
vectors = vectors * (1.0/ngrams_size)
|
||||
|
||||
return vectors
|
||||
|
||||
def set_vector(self, orth, vector):
|
||||
"""Set a vector for a word in the vocabulary. Words can be referenced
|
||||
by string or int ID.
|
||||
|
||||
orth (int / unicode): The word.
|
||||
vector (numpy.ndarray[ndim=1, dtype='float32']): The vector to set.
|
||||
|
||||
DOCS: https://spacy.io/api/vocab#set_vector
|
||||
"""
|
||||
if isinstance(orth, basestring_):
|
||||
orth = self.strings.add(orth)
|
||||
|
@ -372,55 +380,77 @@ cdef class Vocab:
|
|||
else:
|
||||
width = self.vectors.shape[1]
|
||||
self.vectors.resize((new_rows, width))
|
||||
lex = self[orth] # Adds worse to vocab
|
||||
lex = self[orth] # Adds words to vocab
|
||||
self.vectors.add(orth, vector=vector)
|
||||
self.vectors.add(orth, vector=vector)
|
||||
|
||||
def has_vector(self, orth):
|
||||
"""Check whether a word has a vector. Returns False if no vectors have
|
||||
been loaded. Words can be looked up by string or int ID."""
|
||||
been loaded. Words can be looked up by string or int ID.
|
||||
|
||||
orth (int / unicode): The word.
|
||||
RETURNS (bool): Whether the word has a vector.
|
||||
|
||||
DOCS: https://spacy.io/api/vocab#has_vector
|
||||
"""
|
||||
if isinstance(orth, basestring_):
|
||||
orth = self.strings.add(orth)
|
||||
return orth in self.vectors
|
||||
|
||||
def to_disk(self, path, **exclude):
|
||||
def to_disk(self, path, exclude=tuple(), **kwargs):
|
||||
"""Save the current state to a directory.
|
||||
|
||||
path (unicode or Path): A path to a directory, which will be created if
|
||||
it doesn't exist. Paths may be either strings or Path-like objects.
|
||||
it doesn't exist.
|
||||
exclude (list): String names of serialization fields to exclude.
|
||||
|
||||
DOCS: https://spacy.io/api/vocab#to_disk
|
||||
"""
|
||||
path = util.ensure_path(path)
|
||||
if not path.exists():
|
||||
path.mkdir()
|
||||
self.strings.to_disk(path / 'strings.json')
|
||||
with (path / 'lexemes.bin').open('wb') as file_:
|
||||
file_.write(self.lexemes_to_bytes())
|
||||
if self.vectors is not None:
|
||||
setters = ["strings", "lexemes", "vectors"]
|
||||
exclude = util.get_serialization_exclude(setters, exclude, kwargs)
|
||||
if "strings" not in exclude:
|
||||
self.strings.to_disk(path / "strings.json")
|
||||
if "lexemes" not in exclude:
|
||||
with (path / "lexemes.bin").open("wb") as file_:
|
||||
file_.write(self.lexemes_to_bytes())
|
||||
if "vectors" not in "exclude" and self.vectors is not None:
|
||||
self.vectors.to_disk(path)
|
||||
|
||||
def from_disk(self, path, **exclude):
|
||||
def from_disk(self, path, exclude=tuple(), **kwargs):
|
||||
"""Loads state from a directory. Modifies the object in place and
|
||||
returns it.
|
||||
|
||||
path (unicode or Path): A path to a directory. Paths may be either
|
||||
strings or `Path`-like objects.
|
||||
path (unicode or Path): A path to a directory.
|
||||
exclude (list): String names of serialization fields to exclude.
|
||||
RETURNS (Vocab): The modified `Vocab` object.
|
||||
|
||||
DOCS: https://spacy.io/api/vocab#to_disk
|
||||
"""
|
||||
path = util.ensure_path(path)
|
||||
self.strings.from_disk(path / 'strings.json')
|
||||
with (path / 'lexemes.bin').open('rb') as file_:
|
||||
self.lexemes_from_bytes(file_.read())
|
||||
if self.vectors is not None:
|
||||
self.vectors.from_disk(path, exclude='strings.json')
|
||||
if self.vectors.name is not None:
|
||||
link_vectors_to_models(self)
|
||||
getters = ["strings", "lexemes", "vectors"]
|
||||
exclude = util.get_serialization_exclude(getters, exclude, kwargs)
|
||||
if "strings" not in exclude:
|
||||
self.strings.from_disk(path / "strings.json") # TODO: add exclude?
|
||||
if "lexemes" not in exclude:
|
||||
with (path / "lexemes.bin").open("rb") as file_:
|
||||
self.lexemes_from_bytes(file_.read())
|
||||
if "vectors" not in exclude:
|
||||
if self.vectors is not None:
|
||||
self.vectors.from_disk(path, exclude=["strings"])
|
||||
if self.vectors.name is not None:
|
||||
link_vectors_to_models(self)
|
||||
return self
|
||||
|
||||
def to_bytes(self, **exclude):
|
||||
def to_bytes(self, exclude=tuple(), **kwargs):
|
||||
"""Serialize the current state to a binary string.
|
||||
|
||||
**exclude: Named attributes to prevent from being serialized.
|
||||
exclude (list): String names of serialization fields to exclude.
|
||||
RETURNS (bytes): The serialized form of the `Vocab` object.
|
||||
|
||||
DOCS: https://spacy.io/api/vocab#to_bytes
|
||||
"""
|
||||
def deserialize_vectors():
|
||||
if self.vectors is None:
|
||||
|
@ -429,29 +459,34 @@ cdef class Vocab:
|
|||
return self.vectors.to_bytes()
|
||||
|
||||
getters = OrderedDict((
|
||||
('strings', lambda: self.strings.to_bytes()),
|
||||
('lexemes', lambda: self.lexemes_to_bytes()),
|
||||
('vectors', deserialize_vectors)
|
||||
("strings", lambda: self.strings.to_bytes()),
|
||||
("lexemes", lambda: self.lexemes_to_bytes()),
|
||||
("vectors", deserialize_vectors)
|
||||
))
|
||||
exclude = util.get_serialization_exclude(getters, exclude, kwargs)
|
||||
return util.to_bytes(getters, exclude)
|
||||
|
||||
def from_bytes(self, bytes_data, **exclude):
|
||||
def from_bytes(self, bytes_data, exclude=tuple(), **kwargs):
|
||||
"""Load state from a binary string.
|
||||
|
||||
bytes_data (bytes): The data to load from.
|
||||
**exclude: Named attributes to prevent from being loaded.
|
||||
exclude (list): String names of serialization fields to exclude.
|
||||
RETURNS (Vocab): The `Vocab` object.
|
||||
|
||||
DOCS: https://spacy.io/api/vocab#from_bytes
|
||||
"""
|
||||
def serialize_vectors(b):
|
||||
if self.vectors is None:
|
||||
return None
|
||||
else:
|
||||
return self.vectors.from_bytes(b)
|
||||
|
||||
setters = OrderedDict((
|
||||
('strings', lambda b: self.strings.from_bytes(b)),
|
||||
('lexemes', lambda b: self.lexemes_from_bytes(b)),
|
||||
('vectors', lambda b: serialize_vectors(b))
|
||||
("strings", lambda b: self.strings.from_bytes(b)),
|
||||
("lexemes", lambda b: self.lexemes_from_bytes(b)),
|
||||
("vectors", lambda b: serialize_vectors(b))
|
||||
))
|
||||
exclude = util.get_serialization_exclude(setters, exclude, kwargs)
|
||||
util.from_bytes(bytes_data, setters, exclude)
|
||||
if self.vectors.name is not None:
|
||||
link_vectors_to_models(self)
|
||||
|
@ -467,7 +502,7 @@ cdef class Vocab:
|
|||
if addr == 0:
|
||||
continue
|
||||
size += sizeof(lex_data.data)
|
||||
byte_string = b'\0' * size
|
||||
byte_string = b"\0" * size
|
||||
byte_ptr = <unsigned char*>byte_string
|
||||
cdef int j
|
||||
cdef int i = 0
|
||||
|
@ -497,7 +532,10 @@ cdef class Vocab:
|
|||
for j in range(sizeof(lex_data.data)):
|
||||
lex_data.data[j] = bytes_ptr[i+j]
|
||||
Lexeme.c_from_bytes(lexeme, lex_data)
|
||||
|
||||
prev_entry = self._by_orth.get(lexeme.orth)
|
||||
if prev_entry != NULL:
|
||||
memcpy(prev_entry, lexeme, sizeof(LexemeC))
|
||||
continue
|
||||
ptr = self.strings._map.get(lexeme.orth)
|
||||
if ptr == NULL:
|
||||
continue
|
||||
|
|
32
travis.sh
32
travis.sh
|
@ -1,32 +0,0 @@
|
|||
#!/bin/bash
|
||||
|
||||
if [ "${VIA}" == "pypi" ]; then
|
||||
rm -rf *
|
||||
pip install spacy-nightly
|
||||
python -m spacy download en
|
||||
fi
|
||||
|
||||
if [[ "${VIA}" == "sdist" && "${TRAVIS_PULL_REQUEST}" == "false" ]]; then
|
||||
rm -rf *
|
||||
pip uninstall spacy
|
||||
wget https://api.explosion.ai/build/spacy/sdist/$TRAVIS_COMMIT
|
||||
mv $TRAVIS_COMMIT sdist.tgz
|
||||
pip install -U sdist.tgz
|
||||
fi
|
||||
|
||||
|
||||
if [ "${VIA}" == "compile" ]; then
|
||||
pip install -r requirements.txt
|
||||
python setup.py build_ext --inplace
|
||||
pip install -e .
|
||||
fi
|
||||
|
||||
# mkdir -p corpora/en
|
||||
# cd corpora/en
|
||||
# wget --no-check-certificate http://wordnetcode.princeton.edu/3.0/WordNet-3.0.tar.gz
|
||||
# tar -xzf WordNet-3.0.tar.gz
|
||||
# mv WordNet-3.0 wordnet
|
||||
# cd ../../
|
||||
# mkdir models/
|
||||
# python bin/init_model.py en lang_data/ corpora/ models/en
|
||||
#fi
|
|
@ -134,28 +134,50 @@ converter can be specified on the command line, or chosen based on the file
|
|||
extension of the input file.
|
||||
|
||||
```bash
|
||||
$ python -m spacy convert [input_file] [output_dir] [--converter] [--n-sents]
|
||||
[--morphology]
|
||||
$ python -m spacy convert [input_file] [output_dir] [--file-type] [--converter]
|
||||
[--n-sents] [--morphology] [--lang]
|
||||
```
|
||||
|
||||
| Argument | Type | Description |
|
||||
| -------------------------------------------- | ---------- | ---------------------------------------------------------- |
|
||||
| `input_file` | positional | Input file. |
|
||||
| `output_dir` | positional | Output directory for converted JSON file. |
|
||||
| `converter`, `-c` <Tag variant="new">2</Tag> | option | Name of converter to use (see below). |
|
||||
| `--n-sents`, `-n` | option | Number of sentences per document. |
|
||||
| `--morphology`, `-m` | option | Enable appending morphology to tags. |
|
||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
||||
| **CREATES** | JSON | Data in spaCy's [JSON format](/api/annotation#json-input). |
|
||||
| Argument | Type | Description |
|
||||
| ------------------------------------------------ | ---------- | ------------------------------------------------------------------------------------------------- |
|
||||
| `input_file` | positional | Input file. |
|
||||
| `output_dir` | positional | Output directory for converted file. Defaults to `"-"`, meaning data will be written to `stdout`. |
|
||||
| `--file-type`, `-t` <Tag variant="new">2.1</Tag> | option | Type of file to create (see below). |
|
||||
| `--converter`, `-c` <Tag variant="new">2</Tag> | option | Name of converter to use (see below). |
|
||||
| `--n-sents`, `-n` | option | Number of sentences per document. |
|
||||
| `--morphology`, `-m` | option | Enable appending morphology to tags. |
|
||||
| `--lang`, `-l` <Tag variant="new">2.1</Tag> | option | Language code (if tokenizer required). |
|
||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
||||
| **CREATES** | JSON | Data in spaCy's [JSON format](/api/annotation#json-input). |
|
||||
|
||||
The following file format converters are available:
|
||||
### Output file types {new="2.1"}
|
||||
|
||||
| ID | Description |
|
||||
| ----------------- | --------------------------------------------------------------- |
|
||||
| `auto` | Automatically pick converter based on file extension (default). |
|
||||
| `conllu`, `conll` | Universal Dependencies `.conllu` or `.conll` format. |
|
||||
| `ner` | Tab-based named entity recognition format. |
|
||||
| `iob` | IOB or IOB2 named entity recognition format. |
|
||||
> #### Which format should I choose?
|
||||
>
|
||||
> If you're not sure, go with the default `jsonl`. Newline-delimited JSON means
|
||||
> that there's one JSON object per line. Unlike a regular JSON file, it can also
|
||||
> be read in line-by-line and you won't have to parse the _entire file_ first.
|
||||
> This makes it a very convenient format for larger corpora.
|
||||
|
||||
All output files generated by this command are compatible with
|
||||
[`spacy train`](/api/cli#train).
|
||||
|
||||
| ID | Description |
|
||||
| ------- | --------------------------------- |
|
||||
| `jsonl` | Newline-delimited JSON (default). |
|
||||
| `json` | Regular JSON. |
|
||||
| `msg` | Binary MessagePack format. |
|
||||
|
||||
### Converter options
|
||||
|
||||
<!-- TODO: document jsonl option – maybe update it? -->
|
||||
|
||||
| ID | Description |
|
||||
| ------------------------------ | --------------------------------------------------------------- |
|
||||
| `auto` | Automatically pick converter based on file extension (default). |
|
||||
| `conll`, `conllu`, `conllubio` | Universal Dependencies `.conllu` or `.conll` format. |
|
||||
| `ner` | Tab-based named entity recognition format. |
|
||||
| `iob` | IOB or IOB2 named entity recognition format. |
|
||||
|
||||
## Train {#train}
|
||||
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
---
|
||||
title: DependencyParser
|
||||
tag: class
|
||||
source: spacy/pipeline.pyx
|
||||
source: spacy/pipeline/pipes.pyx
|
||||
---
|
||||
|
||||
This class is a subclass of `Pipe` and follows the same API. The pipeline
|
||||
|
@ -211,7 +211,7 @@ Modify the pipe's model, to use the given parameter values.
|
|||
> ```python
|
||||
> parser = DependencyParser(nlp.vocab)
|
||||
> with parser.use_params():
|
||||
> parser.to_disk('/best_model')
|
||||
> parser.to_disk("/best_model")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
|
@ -226,7 +226,7 @@ Add a new label to the pipe.
|
|||
>
|
||||
> ```python
|
||||
> parser = DependencyParser(nlp.vocab)
|
||||
> parser.add_label('MY_LABEL')
|
||||
> parser.add_label("MY_LABEL")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
|
@ -241,12 +241,13 @@ Serialize the pipe to disk.
|
|||
>
|
||||
> ```python
|
||||
> parser = DependencyParser(nlp.vocab)
|
||||
> parser.to_disk('/path/to/parser')
|
||||
> parser.to_disk("/path/to/parser")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| Name | Type | Description |
|
||||
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
|
||||
## DependencyParser.from_disk {#from_disk tag="method"}
|
||||
|
||||
|
@ -256,17 +257,18 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
|||
>
|
||||
> ```python
|
||||
> parser = DependencyParser(nlp.vocab)
|
||||
> parser.from_disk('/path/to/parser')
|
||||
> parser.from_disk("/path/to/parser")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------------------ | -------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `DependencyParser` | The modified `DependencyParser` object. |
|
||||
|
||||
## DependencyParser.to_bytes {#to_bytes tag="method"}
|
||||
|
||||
> #### example
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> parser = DependencyParser(nlp.vocab)
|
||||
|
@ -275,10 +277,10 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
|||
|
||||
Serialize the pipe to a bytestring.
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | ----------------------------------------------------- |
|
||||
| `**exclude` | - | Named attributes to prevent from being serialized. |
|
||||
| **RETURNS** | bytes | The serialized form of the `DependencyParser` object. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | ------------------------------------------------------------------------- |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | bytes | The serialized form of the `DependencyParser` object. |
|
||||
|
||||
## DependencyParser.from_bytes {#from_bytes tag="method"}
|
||||
|
||||
|
@ -292,11 +294,11 @@ Load the pipe from a bytestring. Modifies the object in place and returns it.
|
|||
> parser.from_bytes(parser_bytes)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------ | ------------------ | ---------------------------------------------- |
|
||||
| `bytes_data` | bytes | The data to load from. |
|
||||
| `**exclude` | - | Named attributes to prevent from being loaded. |
|
||||
| **RETURNS** | `DependencyParser` | The `DependencyParser` object. |
|
||||
| Name | Type | Description |
|
||||
| ------------ | ------------------ | ------------------------------------------------------------------------- |
|
||||
| `bytes_data` | bytes | The data to load from. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `DependencyParser` | The `DependencyParser` object. |
|
||||
|
||||
## DependencyParser.labels {#labels tag="property"}
|
||||
|
||||
|
@ -312,3 +314,21 @@ The labels currently added to the component.
|
|||
| Name | Type | Description |
|
||||
| ----------- | ----- | ---------------------------------- |
|
||||
| **RETURNS** | tuple | The labels added to the component. |
|
||||
|
||||
## Serialization fields {#serialization-fields}
|
||||
|
||||
During serialization, spaCy will export several data fields used to restore
|
||||
different aspects of the object. If needed, you can exclude them from
|
||||
serialization by passing in the string names via the `exclude` argument.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> data = parser.to_disk("/path", exclude=["vocab"])
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ------- | -------------------------------------------------------------- |
|
||||
| `vocab` | The shared [`Vocab`](/api/vocab). |
|
||||
| `cfg` | The config file. You usually don't want to exclude this. |
|
||||
| `model` | The binary model data. You usually don't want to exclude this. |
|
||||
|
|
|
@ -127,6 +127,7 @@ details, see the documentation on
|
|||
| `method` | callable | Set a custom method on the object, for example `doc._.compare(other_doc)`. |
|
||||
| `getter` | callable | Getter function that takes the object and returns an attribute value. Is called when the user accesses the `._` attribute. |
|
||||
| `setter` | callable | Setter function that takes the `Doc` and a value, and modifies the object. Is called when the user writes to the `Doc._` attribute. |
|
||||
| `force` | bool | Force overwriting existing attribute. |
|
||||
|
||||
## Doc.get_extension {#get_extension tag="classmethod" new="2"}
|
||||
|
||||
|
@ -236,7 +237,7 @@ attribute ID.
|
|||
> from spacy.attrs import ORTH
|
||||
> doc = nlp(u"apple apple orange banana")
|
||||
> assert doc.count_by(ORTH) == {7024L: 1, 119552L: 1, 2087L: 2}
|
||||
> doc.to_array([attrs.ORTH])
|
||||
> doc.to_array([ORTH])
|
||||
> # array([[11880], [11880], [7561], [12800]])
|
||||
> ```
|
||||
|
||||
|
@ -263,6 +264,46 @@ ancestor is found, e.g. if span excludes a necessary ancestor.
|
|||
| ----------- | -------------------------------------- | ----------------------------------------------- |
|
||||
| **RETURNS** | `numpy.ndarray[ndim=2, dtype='int32']` | The lowest common ancestor matrix of the `Doc`. |
|
||||
|
||||
## Doc.to_json {#to_json, tag="method" new="2.1"}
|
||||
|
||||
Convert a Doc to JSON. The format it produces will be the new format for the
|
||||
[`spacy train`](/api/cli#train) command (not implemented yet). If custom
|
||||
underscore attributes are specified, their values need to be JSON-serializable.
|
||||
They'll be added to an `"_"` key in the data, e.g. `"_": {"foo": "bar"}`.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> doc = nlp(u"Hello")
|
||||
> json_doc = doc.to_json()
|
||||
> ```
|
||||
>
|
||||
> #### Result
|
||||
>
|
||||
> ```python
|
||||
> {
|
||||
> "text": "Hello",
|
||||
> "ents": [],
|
||||
> "sents": [{"start": 0, "end": 5}],
|
||||
> "tokens": [{"id": 0, "start": 0, "end": 5, "pos": "INTJ", "tag": "UH", "dep": "ROOT", "head": 0}
|
||||
> ]
|
||||
> }
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------ | ---- | ------------------------------------------------------------------------------ |
|
||||
| `underscore` | list | Optional list of string names of custom JSON-serializable `doc._.` attributes. |
|
||||
| **RETURNS** | dict | The JSON-formatted data. |
|
||||
|
||||
<Infobox title="Deprecation note" variant="warning">
|
||||
|
||||
spaCy previously implemented a `Doc.print_tree` method that returned a similar
|
||||
JSON-formatted representation of a `Doc`. As of v2.1, this method is deprecated
|
||||
in favor of `Doc.to_json`. If you need more complex nested representations, you
|
||||
might want to write your own function to extract the data.
|
||||
|
||||
</Infobox>
|
||||
|
||||
## Doc.to_array {#to_array tag="method"}
|
||||
|
||||
Export given token attributes to a numpy `ndarray`. If `attr_ids` is a sequence
|
||||
|
@ -308,11 +349,12 @@ array of attributes.
|
|||
> assert doc[0].pos_ == doc2[0].pos_
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | -------------------------------------- | ----------------------------- |
|
||||
| `attrs` | ints | A list of attribute ID ints. |
|
||||
| `array` | `numpy.ndarray[ndim=2, dtype='int32']` | The attribute values to load. |
|
||||
| **RETURNS** | `Doc` | Itself. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | -------------------------------------- | ------------------------------------------------------------------------- |
|
||||
| `attrs` | list | A list of attribute ID ints. |
|
||||
| `array` | `numpy.ndarray[ndim=2, dtype='int32']` | The attribute values to load. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `Doc` | Itself. |
|
||||
|
||||
## Doc.to_disk {#to_disk tag="method" new="2"}
|
||||
|
||||
|
@ -324,9 +366,10 @@ Save the current state to a directory.
|
|||
> doc.to_disk("/path/to/doc")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| Name | Type | Description |
|
||||
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
|
||||
## Doc.from_disk {#from_disk tag="method" new="2"}
|
||||
|
||||
|
@ -343,6 +386,7 @@ Loads state from a directory. Modifies the object in place and returns it.
|
|||
| Name | Type | Description |
|
||||
| ----------- | ---------------- | -------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `Doc` | The modified `Doc` object. |
|
||||
|
||||
## Doc.to_bytes {#to_bytes tag="method"}
|
||||
|
@ -356,9 +400,10 @@ Serialize, i.e. export the document contents to a binary string.
|
|||
> doc_bytes = doc.to_bytes()
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | --------------------------------------------------------------------- |
|
||||
| **RETURNS** | bytes | A losslessly serialized copy of the `Doc`, including all annotations. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | ------------------------------------------------------------------------- |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | bytes | A losslessly serialized copy of the `Doc`, including all annotations. |
|
||||
|
||||
## Doc.from_bytes {#from_bytes tag="method"}
|
||||
|
||||
|
@ -375,10 +420,11 @@ Deserialize, i.e. import the document contents from a binary string.
|
|||
> assert doc.text == doc2.text
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | ------------------------ |
|
||||
| `data` | bytes | The string to load from. |
|
||||
| **RETURNS** | `Doc` | The `Doc` object. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | ------------------------------------------------------------------------- |
|
||||
| `data` | bytes | The string to load from. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `Doc` | The `Doc` object. |
|
||||
|
||||
## Doc.retokenize {#retokenize tag="contextmanager" new="2.1"}
|
||||
|
||||
|
@ -429,14 +475,16 @@ to specify how the new subtokens should be integrated into the dependency tree.
|
|||
The list of per-token heads can either be a token in the original document, e.g.
|
||||
`doc[2]`, or a tuple consisting of the token in the original document and its
|
||||
subtoken index. For example, `(doc[3], 1)` will attach the subtoken to the
|
||||
second subtoken of `doc[3]`. This mechanism allows attaching subtokens to other
|
||||
newly created subtokens, without having to keep track of the changing token
|
||||
indices. If the specified head token will be split within the retokenizer block
|
||||
and no subtoken index is specified, it will default to `0`. Attributes to set on
|
||||
subtokens can be provided as a list of values. They'll be applied to the
|
||||
resulting token (if they're context-dependent token attributes like `LEMMA` or
|
||||
`DEP`) or to the underlying lexeme (if they're context-independent lexical
|
||||
attributes like `LOWER` or `IS_STOP`).
|
||||
second subtoken of `doc[3]`.
|
||||
|
||||
This mechanism allows attaching subtokens to other newly created subtokens,
|
||||
without having to keep track of the changing token indices. If the specified
|
||||
head token will be split within the retokenizer block and no subtoken index is
|
||||
specified, it will default to `0`. Attributes to set on subtokens can be
|
||||
provided as a list of values. They'll be applied to the resulting token (if
|
||||
they're context-dependent token attributes like `LEMMA` or `DEP`) or to the
|
||||
underlying lexeme (if they're context-independent lexical attributes like
|
||||
`LOWER` or `IS_STOP`).
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
@ -487,8 +535,8 @@ and end token boundaries, the document remains unchanged.
|
|||
|
||||
## Doc.ents {#ents tag="property" model="NER"}
|
||||
|
||||
Iterate over the entities in the document. Yields named-entity `Span` objects,
|
||||
if the entity recognizer has been applied to the document.
|
||||
The named entities in the document. Returns a tuple of named entity `Span`
|
||||
objects, if the entity recognizer has been applied.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
@ -500,9 +548,9 @@ if the entity recognizer has been applied to the document.
|
|||
> assert ents[0].text == u"Mr. Best"
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ---------- | ------ | ------------------------- |
|
||||
| **YIELDS** | `Span` | Entities in the document. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | ------------------------------------------------ |
|
||||
| **RETURNS** | tuple | Entities in the document, one `Span` per entity. |
|
||||
|
||||
## Doc.noun_chunks {#noun_chunks tag="property" model="parser"}
|
||||
|
||||
|
@ -541,9 +589,9 @@ will be unavailable.
|
|||
> assert [s.root.text for s in sents] == [u"is", u"'s"]
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ---------- | ---------------------------------- | ----------- |
|
||||
| **YIELDS** | `Span | Sentences in the document. |
|
||||
| Name | Type | Description |
|
||||
| ---------- | ------ | -------------------------- |
|
||||
| **YIELDS** | `Span` | Sentences in the document. |
|
||||
|
||||
## Doc.has_vector {#has_vector tag="property" model="vectors"}
|
||||
|
||||
|
@ -597,20 +645,43 @@ The L2 norm of the document's vector representation.
|
|||
|
||||
## Attributes {#attributes}
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------------------------------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `text` | unicode | A unicode representation of the document text. |
|
||||
| `text_with_ws` | unicode | An alias of `Doc.text`, provided for duck-type compatibility with `Span` and `Token`. |
|
||||
| `mem` | `Pool` | The document's local memory heap, for all C data it owns. |
|
||||
| `vocab` | `Vocab` | The store of lexical types. |
|
||||
| `tensor` <Tag variant="new">2</Tag> | object | Container for dense vector representations. |
|
||||
| `cats` <Tag variant="new">2</Tag> | dictionary | Maps either a label to a score for categories applied to whole document, or `(start_char, end_char, label)` to score for categories applied to spans. `start_char` and `end_char` should be character offsets, label can be either a string or an integer ID, and score should be a float. |
|
||||
| `user_data` | - | A generic storage area, for user custom data. |
|
||||
| `is_tagged` | bool | A flag indicating that the document has been part-of-speech tagged. |
|
||||
| `is_parsed` | bool | A flag indicating that the document has been syntactically parsed. |
|
||||
| `is_sentenced` | bool | A flag indicating that sentence boundaries have been applied to the document. |
|
||||
| `sentiment` | float | The document's positivity/negativity score, if available. |
|
||||
| `user_hooks` | dict | A dictionary that allows customization of the `Doc`'s properties. |
|
||||
| `user_token_hooks` | dict | A dictionary that allows customization of properties of `Token` children. |
|
||||
| `user_span_hooks` | dict | A dictionary that allows customization of properties of `Span` children. |
|
||||
| `_` | `Underscore` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). |
|
||||
| Name | Type | Description |
|
||||
| --------------------------------------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `text` | unicode | A unicode representation of the document text. |
|
||||
| `text_with_ws` | unicode | An alias of `Doc.text`, provided for duck-type compatibility with `Span` and `Token`. |
|
||||
| `mem` | `Pool` | The document's local memory heap, for all C data it owns. |
|
||||
| `vocab` | `Vocab` | The store of lexical types. |
|
||||
| `tensor` <Tag variant="new">2</Tag> | object | Container for dense vector representations. |
|
||||
| `cats` <Tag variant="new">2</Tag> | dictionary | Maps either a label to a score for categories applied to whole document, or `(start_char, end_char, label)` to score for categories applied to spans. `start_char` and `end_char` should be character offsets, label can be either a string or an integer ID, and score should be a float. |
|
||||
| `user_data` | - | A generic storage area, for user custom data. |
|
||||
| `is_tagged` | bool | A flag indicating that the document has been part-of-speech tagged. |
|
||||
| `is_parsed` | bool | A flag indicating that the document has been syntactically parsed. |
|
||||
| `is_sentenced` | bool | A flag indicating that sentence boundaries have been applied to the document. |
|
||||
| `is_nered` <Tag variant="new">2.1</Tag> | bool | A flag indicating that named entities have been set. Will return `True` if _any_ of the tokens has an entity tag set, even if the others are unknown. |
|
||||
| `sentiment` | float | The document's positivity/negativity score, if available. |
|
||||
| `user_hooks` | dict | A dictionary that allows customization of the `Doc`'s properties. |
|
||||
| `user_token_hooks` | dict | A dictionary that allows customization of properties of `Token` children. |
|
||||
| `user_span_hooks` | dict | A dictionary that allows customization of properties of `Span` children. |
|
||||
| `_` | `Underscore` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). |
|
||||
|
||||
## Serialization fields {#serialization-fields}
|
||||
|
||||
During serialization, spaCy will export several data fields used to restore
|
||||
different aspects of the object. If needed, you can exclude them from
|
||||
serialization by passing in the string names via the `exclude` argument.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> data = doc.to_bytes(exclude=["text", "tensor"])
|
||||
> doc.from_disk("./doc.bin", exclude=["user_data"])
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ------------------ | --------------------------------------------- |
|
||||
| `text` | The value of the `Doc.text` attribute. |
|
||||
| `sentiment` | The value of the `Doc.sentiment` attribute. |
|
||||
| `tensor` | The value of the `Doc.tensor` attribute. |
|
||||
| `user_data` | The value of the `Doc.user_data` dictionary. |
|
||||
| `user_data_keys` | The keys of the `Doc.user_data` dictionary. |
|
||||
| `user_data_values` | The values of the `Doc.user_data` dictionary. |
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
---
|
||||
title: EntityRecognizer
|
||||
tag: class
|
||||
source: spacy/pipeline.pyx
|
||||
source: spacy/pipeline/pipes.pyx
|
||||
---
|
||||
|
||||
This class is a subclass of `Pipe` and follows the same API. The pipeline
|
||||
|
@ -211,7 +211,7 @@ Modify the pipe's model, to use the given parameter values.
|
|||
> ```python
|
||||
> ner = EntityRecognizer(nlp.vocab)
|
||||
> with ner.use_params():
|
||||
> ner.to_disk('/best_model')
|
||||
> ner.to_disk("/best_model")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
|
@ -226,7 +226,7 @@ Add a new label to the pipe.
|
|||
>
|
||||
> ```python
|
||||
> ner = EntityRecognizer(nlp.vocab)
|
||||
> ner.add_label('MY_LABEL')
|
||||
> ner.add_label("MY_LABEL")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
|
@ -241,12 +241,13 @@ Serialize the pipe to disk.
|
|||
>
|
||||
> ```python
|
||||
> ner = EntityRecognizer(nlp.vocab)
|
||||
> ner.to_disk('/path/to/ner')
|
||||
> ner.to_disk("/path/to/ner")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| Name | Type | Description |
|
||||
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
|
||||
## EntityRecognizer.from_disk {#from_disk tag="method"}
|
||||
|
||||
|
@ -256,17 +257,18 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
|||
>
|
||||
> ```python
|
||||
> ner = EntityRecognizer(nlp.vocab)
|
||||
> ner.from_disk('/path/to/ner')
|
||||
> ner.from_disk("/path/to/ner")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------------------ | -------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `EntityRecognizer` | The modified `EntityRecognizer` object. |
|
||||
|
||||
## EntityRecognizer.to_bytes {#to_bytes tag="method"}
|
||||
|
||||
> #### example
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> ner = EntityRecognizer(nlp.vocab)
|
||||
|
@ -275,10 +277,10 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
|||
|
||||
Serialize the pipe to a bytestring.
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | ----------------------------------------------------- |
|
||||
| `**exclude` | - | Named attributes to prevent from being serialized. |
|
||||
| **RETURNS** | bytes | The serialized form of the `EntityRecognizer` object. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | ------------------------------------------------------------------------- |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | bytes | The serialized form of the `EntityRecognizer` object. |
|
||||
|
||||
## EntityRecognizer.from_bytes {#from_bytes tag="method"}
|
||||
|
||||
|
@ -292,11 +294,11 @@ Load the pipe from a bytestring. Modifies the object in place and returns it.
|
|||
> ner.from_bytes(ner_bytes)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------ | ------------------ | ---------------------------------------------- |
|
||||
| `bytes_data` | bytes | The data to load from. |
|
||||
| `**exclude` | - | Named attributes to prevent from being loaded. |
|
||||
| **RETURNS** | `EntityRecognizer` | The `EntityRecognizer` object. |
|
||||
| Name | Type | Description |
|
||||
| ------------ | ------------------ | ------------------------------------------------------------------------- |
|
||||
| `bytes_data` | bytes | The data to load from. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `EntityRecognizer` | The `EntityRecognizer` object. |
|
||||
|
||||
## EntityRecognizer.labels {#labels tag="property"}
|
||||
|
||||
|
@ -312,3 +314,21 @@ The labels currently added to the component.
|
|||
| Name | Type | Description |
|
||||
| ----------- | ----- | ---------------------------------- |
|
||||
| **RETURNS** | tuple | The labels added to the component. |
|
||||
|
||||
## Serialization fields {#serialization-fields}
|
||||
|
||||
During serialization, spaCy will export several data fields used to restore
|
||||
different aspects of the object. If needed, you can exclude them from
|
||||
serialization by passing in the string names via the `exclude` argument.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> data = ner.to_disk("/path", exclude=["vocab"])
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ------- | -------------------------------------------------------------- |
|
||||
| `vocab` | The shared [`Vocab`](/api/vocab). |
|
||||
| `cfg` | The config file. You usually don't want to exclude this. |
|
||||
| `model` | The binary model data. You usually don't want to exclude this. |
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
---
|
||||
title: EntityRuler
|
||||
tag: class
|
||||
source: spacy/pipeline.pyx
|
||||
source: spacy/pipeline/entityruler.py
|
||||
new: 2.1
|
||||
---
|
||||
|
||||
|
@ -128,7 +128,7 @@ newline-delimited JSON (JSONL).
|
|||
>
|
||||
> ```python
|
||||
> ruler = EntityRuler(nlp)
|
||||
> ruler.to_disk('/path/to/rules.jsonl')
|
||||
> ruler.to_disk("/path/to/rules.jsonl")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
|
@ -144,7 +144,7 @@ JSON (JSONL) with one entry per line.
|
|||
>
|
||||
> ```python
|
||||
> ruler = EntityRuler(nlp)
|
||||
> ruler.from_disk('/path/to/rules.jsonl')
|
||||
> ruler.from_disk("/path/to/rules.jsonl")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
|
|
|
@ -327,7 +327,7 @@ the model**.
|
|||
| Name | Type | Description |
|
||||
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling) and prevent from being saved. |
|
||||
| `exclude` | list | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. |
|
||||
|
||||
## Language.from_disk {#from_disk tag="method" new="2"}
|
||||
|
||||
|
@ -349,22 +349,22 @@ loaded object.
|
|||
> nlp = English().from_disk("/path/to/en_model")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---------------- | --------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||
| `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
|
||||
| **RETURNS** | `Language` | The modified `Language` object. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---------------- | ----------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `Language` | The modified `Language` object. |
|
||||
|
||||
<Infobox title="Changed in v2.0" variant="warning">
|
||||
|
||||
As of spaCy v2.0, the `save_to_directory` method has been renamed to `to_disk`,
|
||||
to improve consistency across classes. Pipeline components to prevent from being
|
||||
loaded can now be added as a list to `disable`, instead of specifying one
|
||||
keyword argument per component.
|
||||
loaded can now be added as a list to `disable` (v2.0) or `exclude` (v2.1),
|
||||
instead of specifying one keyword argument per component.
|
||||
|
||||
```diff
|
||||
- nlp = spacy.load("en", tagger=False, entity=False)
|
||||
+ nlp = English().from_disk("/model", disable=["tagger', 'ner"])
|
||||
+ nlp = English().from_disk("/model", exclude=["tagger", "ner"])
|
||||
```
|
||||
|
||||
</Infobox>
|
||||
|
@ -379,10 +379,10 @@ Serialize the current state to a binary string.
|
|||
> nlp_bytes = nlp.to_bytes()
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | ------------------------------------------------------------------------------------------------------------------- |
|
||||
| `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling) and prevent from being serialized. |
|
||||
| **RETURNS** | bytes | The serialized form of the `Language` object. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | ----------------------------------------------------------------------------------------- |
|
||||
| `exclude` | list | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | bytes | The serialized form of the `Language` object. |
|
||||
|
||||
## Language.from_bytes {#from_bytes tag="method"}
|
||||
|
||||
|
@ -400,20 +400,21 @@ available to the loaded object.
|
|||
> nlp2.from_bytes(nlp_bytes)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------ | ---------- | --------------------------------------------------------------------------------- |
|
||||
| `bytes_data` | bytes | The data to load from. |
|
||||
| `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
|
||||
| **RETURNS** | `Language` | The `Language` object. |
|
||||
| Name | Type | Description |
|
||||
| ------------ | ---------- | ----------------------------------------------------------------------------------------- |
|
||||
| `bytes_data` | bytes | The data to load from. |
|
||||
| `exclude` | list | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `Language` | The `Language` object. |
|
||||
|
||||
<Infobox title="Changed in v2.0" variant="warning">
|
||||
|
||||
Pipeline components to prevent from being loaded can now be added as a list to
|
||||
`disable`, instead of specifying one keyword argument per component.
|
||||
`disable` (v2.0) or `exclude` (v2.1), instead of specifying one keyword argument
|
||||
per component.
|
||||
|
||||
```diff
|
||||
- nlp = English().from_bytes(bytes, tagger=False, entity=False)
|
||||
+ nlp = English().from_bytes(bytes, disable=["tagger", "ner"])
|
||||
+ nlp = English().from_bytes(bytes, exclude=["tagger", "ner"])
|
||||
```
|
||||
|
||||
</Infobox>
|
||||
|
@ -437,3 +438,23 @@ Pipeline components to prevent from being loaded can now be added as a list to
|
|||
| `Defaults` | class | Settings, data and factory methods for creating the `nlp` object and processing pipeline. |
|
||||
| `lang` | unicode | Two-letter language ID, i.e. [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). |
|
||||
| `factories` <Tag variant="new">2</Tag> | dict | Factories that create pre-defined pipeline components, e.g. the tagger, parser or entity recognizer, keyed by their component name. |
|
||||
|
||||
## Serialization fields {#serialization-fields}
|
||||
|
||||
During serialization, spaCy will export several data fields used to restore
|
||||
different aspects of the object. If needed, you can exclude them from
|
||||
serialization by passing in the string names via the `exclude` argument.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> data = nlp.to_bytes(exclude=["tokenizer", "vocab"])
|
||||
> nlp.from_disk("./model-data", exclude=["ner"])
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ----------- | -------------------------------------------------- |
|
||||
| `vocab` | The shared [`Vocab`](/api/vocab). |
|
||||
| `tokenizer` | Tokenization rules and exceptions. |
|
||||
| `meta` | The meta data, available as `Language.meta`. |
|
||||
| ... | String names of pipeline components, e.g. `"ner"`. |
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
---
|
||||
title: Pipeline Functions
|
||||
teaser: Other built-in pipeline components and helpers
|
||||
source: spacy/pipeline.pyx
|
||||
source: spacy/pipeline/functions.py
|
||||
menu:
|
||||
- ['merge_noun_chunks', 'merge_noun_chunks']
|
||||
- ['merge_entities', 'merge_entities']
|
||||
|
@ -73,10 +73,10 @@ components to the end of the pipeline and after all other components.
|
|||
| `doc` | `Doc` | The `Doc` object to process, e.g. the `Doc` in the pipeline. |
|
||||
| **RETURNS** | `Doc` | The modified `Doc` with merged entities. |
|
||||
|
||||
## merge_subtokens {#merge_entities tag="function" new="2.1"}
|
||||
## merge_subtokens {#merge_subtokens tag="function" new="2.1"}
|
||||
|
||||
Merge subtokens into a single token. Also available via the string name
|
||||
`"merge_entities"`. After initialization, the component is typically added to
|
||||
`"merge_subtokens"`. After initialization, the component is typically added to
|
||||
the processing pipeline using [`nlp.add_pipe`](/api/language#add_pipe).
|
||||
|
||||
As of v2.1, the parser is able to predict "subtokens" that should be merged into
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
---
|
||||
title: SentenceSegmenter
|
||||
tag: class
|
||||
source: spacy/pipeline.pyx
|
||||
source: spacy/pipeline/hooks.py
|
||||
---
|
||||
|
||||
A simple spaCy hook, to allow custom sentence boundary detection logic that
|
||||
|
|
|
@ -260,8 +260,8 @@ Retokenize the document, such that the span is merged into a single token.
|
|||
|
||||
## Span.ents {#ents tag="property" new="2.0.12" model="ner"}
|
||||
|
||||
Iterate over the entities in the span. Yields named-entity `Span` objects, if
|
||||
the entity recognizer has been applied to the parent document.
|
||||
The named entities in the span. Returns a tuple of named entity `Span` objects,
|
||||
if the entity recognizer has been applied.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
@ -274,9 +274,9 @@ the entity recognizer has been applied to the parent document.
|
|||
> assert ents[0].text == u"Mr. Best"
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ---------- | ------ | ------------------------- |
|
||||
| **YIELDS** | `Span` | Entities in the document. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | -------------------------------------------- |
|
||||
| **RETURNS** | tuple | Entities in the span, one `Span` per entity. |
|
||||
|
||||
## Span.as_doc {#as_doc tag="method"}
|
||||
|
||||
|
@ -297,8 +297,9 @@ Create a new `Doc` object corresponding to the `Span`, with a copy of the data.
|
|||
|
||||
## Span.root {#root tag="property" model="parser"}
|
||||
|
||||
The token within the span that's highest in the parse tree. If there's a tie,
|
||||
the earliest is preferred.
|
||||
The token with the shortest path to the root of the sentence (or the root
|
||||
itself). If multiple tokens are equally high in the tree, the first token is
|
||||
taken.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
|
|
@ -151,10 +151,9 @@ Serialize the current state to a binary string.
|
|||
> store_bytes = stringstore.to_bytes()
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | -------------------------------------------------- |
|
||||
| `**exclude` | - | Named attributes to prevent from being serialized. |
|
||||
| **RETURNS** | bytes | The serialized form of the `StringStore` object. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | ------------------------------------------------ |
|
||||
| **RETURNS** | bytes | The serialized form of the `StringStore` object. |
|
||||
|
||||
## StringStore.from_bytes {#from_bytes tag="method"}
|
||||
|
||||
|
@ -168,11 +167,10 @@ Load state from a binary string.
|
|||
> new_store = StringStore().from_bytes(store_bytes)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------ | ------------- | ---------------------------------------------- |
|
||||
| `bytes_data` | bytes | The data to load from. |
|
||||
| `**exclude` | - | Named attributes to prevent from being loaded. |
|
||||
| **RETURNS** | `StringStore` | The `StringStore` object. |
|
||||
| Name | Type | Description |
|
||||
| ------------ | ------------- | ------------------------- |
|
||||
| `bytes_data` | bytes | The data to load from. |
|
||||
| **RETURNS** | `StringStore` | The `StringStore` object. |
|
||||
|
||||
## Utilities {#util}
|
||||
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
---
|
||||
title: Tagger
|
||||
tag: class
|
||||
source: spacy/pipeline.pyx
|
||||
source: spacy/pipeline/pipes.pyx
|
||||
---
|
||||
|
||||
This class is a subclass of `Pipe` and follows the same API. The pipeline
|
||||
|
@ -209,7 +209,7 @@ Modify the pipe's model, to use the given parameter values.
|
|||
> ```python
|
||||
> tagger = Tagger(nlp.vocab)
|
||||
> with tagger.use_params():
|
||||
> tagger.to_disk('/best_model')
|
||||
> tagger.to_disk("/best_model")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
|
@ -225,7 +225,7 @@ Add a new label to the pipe.
|
|||
> ```python
|
||||
> from spacy.symbols import POS
|
||||
> tagger = Tagger(nlp.vocab)
|
||||
> tagger.add_label('MY_LABEL', {POS: 'NOUN'})
|
||||
> tagger.add_label("MY_LABEL", {POS: 'NOUN'})
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
|
@ -241,12 +241,13 @@ Serialize the pipe to disk.
|
|||
>
|
||||
> ```python
|
||||
> tagger = Tagger(nlp.vocab)
|
||||
> tagger.to_disk('/path/to/tagger')
|
||||
> tagger.to_disk("/path/to/tagger")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| Name | Type | Description |
|
||||
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
|
||||
## Tagger.from_disk {#from_disk tag="method"}
|
||||
|
||||
|
@ -256,17 +257,18 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
|||
>
|
||||
> ```python
|
||||
> tagger = Tagger(nlp.vocab)
|
||||
> tagger.from_disk('/path/to/tagger')
|
||||
> tagger.from_disk("/path/to/tagger")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---------------- | -------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `Tagger` | The modified `Tagger` object. |
|
||||
|
||||
## Tagger.to_bytes {#to_bytes tag="method"}
|
||||
|
||||
> #### example
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> tagger = Tagger(nlp.vocab)
|
||||
|
@ -275,10 +277,10 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
|||
|
||||
Serialize the pipe to a bytestring.
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | -------------------------------------------------- |
|
||||
| `**exclude` | - | Named attributes to prevent from being serialized. |
|
||||
| **RETURNS** | bytes | The serialized form of the `Tagger` object. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | ------------------------------------------------------------------------- |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | bytes | The serialized form of the `Tagger` object. |
|
||||
|
||||
## Tagger.from_bytes {#from_bytes tag="method"}
|
||||
|
||||
|
@ -292,11 +294,11 @@ Load the pipe from a bytestring. Modifies the object in place and returns it.
|
|||
> tagger.from_bytes(tagger_bytes)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------ | -------- | ---------------------------------------------- |
|
||||
| `bytes_data` | bytes | The data to load from. |
|
||||
| `**exclude` | - | Named attributes to prevent from being loaded. |
|
||||
| **RETURNS** | `Tagger` | The `Tagger` object. |
|
||||
| Name | Type | Description |
|
||||
| ------------ | -------- | ------------------------------------------------------------------------- |
|
||||
| `bytes_data` | bytes | The data to load from. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `Tagger` | The `Tagger` object. |
|
||||
|
||||
## Tagger.labels {#labels tag="property"}
|
||||
|
||||
|
@ -314,3 +316,22 @@ tags by default, e.g. `VERB`, `NOUN` and so on.
|
|||
| Name | Type | Description |
|
||||
| ----------- | ----- | ---------------------------------- |
|
||||
| **RETURNS** | tuple | The labels added to the component. |
|
||||
|
||||
## Serialization fields {#serialization-fields}
|
||||
|
||||
During serialization, spaCy will export several data fields used to restore
|
||||
different aspects of the object. If needed, you can exclude them from
|
||||
serialization by passing in the string names via the `exclude` argument.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> data = tagger.to_disk("/path", exclude=["vocab"])
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| --------- | ------------------------------------------------------------------------------------------ |
|
||||
| `vocab` | The shared [`Vocab`](/api/vocab). |
|
||||
| `cfg` | The config file. You usually don't want to exclude this. |
|
||||
| `model` | The binary model data. You usually don't want to exclude this. |
|
||||
| `tag_map` | The [tag map](/usage/adding-languages#tag-map) mapping fine-grained to coarse-grained tag. |
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
---
|
||||
title: TextCategorizer
|
||||
tag: class
|
||||
source: spacy/pipeline.pyx
|
||||
source: spacy/pipeline/pipes.pyx
|
||||
new: 2
|
||||
---
|
||||
|
||||
|
@ -227,7 +227,7 @@ Modify the pipe's model, to use the given parameter values.
|
|||
> ```python
|
||||
> textcat = TextCategorizer(nlp.vocab)
|
||||
> with textcat.use_params():
|
||||
> textcat.to_disk('/best_model')
|
||||
> textcat.to_disk("/best_model")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
|
@ -242,7 +242,7 @@ Add a new label to the pipe.
|
|||
>
|
||||
> ```python
|
||||
> textcat = TextCategorizer(nlp.vocab)
|
||||
> textcat.add_label('MY_LABEL')
|
||||
> textcat.add_label("MY_LABEL")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
|
@ -257,12 +257,13 @@ Serialize the pipe to disk.
|
|||
>
|
||||
> ```python
|
||||
> textcat = TextCategorizer(nlp.vocab)
|
||||
> textcat.to_disk('/path/to/textcat')
|
||||
> textcat.to_disk("/path/to/textcat")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| Name | Type | Description |
|
||||
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
|
||||
## TextCategorizer.from_disk {#from_disk tag="method"}
|
||||
|
||||
|
@ -272,17 +273,18 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
|||
>
|
||||
> ```python
|
||||
> textcat = TextCategorizer(nlp.vocab)
|
||||
> textcat.from_disk('/path/to/textcat')
|
||||
> textcat.from_disk("/path/to/textcat")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----------------- | -------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `TextCategorizer` | The modified `TextCategorizer` object. |
|
||||
|
||||
## TextCategorizer.to_bytes {#to_bytes tag="method"}
|
||||
|
||||
> #### example
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> textcat = TextCategorizer(nlp.vocab)
|
||||
|
@ -291,10 +293,10 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
|||
|
||||
Serialize the pipe to a bytestring.
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | ---------------------------------------------------- |
|
||||
| `**exclude` | - | Named attributes to prevent from being serialized. |
|
||||
| **RETURNS** | bytes | The serialized form of the `TextCategorizer` object. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | ------------------------------------------------------------------------- |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | bytes | The serialized form of the `TextCategorizer` object. |
|
||||
|
||||
## TextCategorizer.from_bytes {#from_bytes tag="method"}
|
||||
|
||||
|
@ -308,11 +310,11 @@ Load the pipe from a bytestring. Modifies the object in place and returns it.
|
|||
> textcat.from_bytes(textcat_bytes)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------ | ----------------- | ---------------------------------------------- |
|
||||
| `bytes_data` | bytes | The data to load from. |
|
||||
| `**exclude` | - | Named attributes to prevent from being loaded. |
|
||||
| **RETURNS** | `TextCategorizer` | The `TextCategorizer` object. |
|
||||
| Name | Type | Description |
|
||||
| ------------ | ----------------- | ------------------------------------------------------------------------- |
|
||||
| `bytes_data` | bytes | The data to load from. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `TextCategorizer` | The `TextCategorizer` object. |
|
||||
|
||||
## TextCategorizer.labels {#labels tag="property"}
|
||||
|
||||
|
@ -328,3 +330,21 @@ The labels currently added to the component.
|
|||
| Name | Type | Description |
|
||||
| ----------- | ----- | ---------------------------------- |
|
||||
| **RETURNS** | tuple | The labels added to the component. |
|
||||
|
||||
## Serialization fields {#serialization-fields}
|
||||
|
||||
During serialization, spaCy will export several data fields used to restore
|
||||
different aspects of the object. If needed, you can exclude them from
|
||||
serialization by passing in the string names via the `exclude` argument.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> data = textcat.to_disk("/path", exclude=["vocab"])
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ------- | -------------------------------------------------------------- |
|
||||
| `vocab` | The shared [`Vocab`](/api/vocab). |
|
||||
| `cfg` | The config file. You usually don't want to exclude this. |
|
||||
| `model` | The binary model data. You usually don't want to exclude this. |
|
||||
|
|
|
@ -324,7 +324,7 @@ A sequence containing the token and all the token's syntactic descendants.
|
|||
## Token.is_sent_start {#is_sent_start tag="property" new="2"}
|
||||
|
||||
A boolean value indicating whether the token starts a sentence. `None` if
|
||||
unknown. Defaults to `True` for the first token in the `doc`.
|
||||
unknown. Defaults to `True` for the first token in the `Doc`.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
|
|
@ -116,6 +116,74 @@ details and examples.
|
|||
| `string` | unicode | The string to specially tokenize. |
|
||||
| `token_attrs` | iterable | A sequence of dicts, where each dict describes a token and its attributes. The `ORTH` fields of the attributes must exactly match the string when they are concatenated. |
|
||||
|
||||
## Tokenizer.to_disk {#to_disk tag="method"}
|
||||
|
||||
Serialize the tokenizer to disk.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> tokenizer = Tokenizer(nlp.vocab)
|
||||
> tokenizer.to_disk("/path/to/tokenizer")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
|
||||
## Tokenizer.from_disk {#from_disk tag="method"}
|
||||
|
||||
Load the tokenizer from disk. Modifies the object in place and returns it.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> tokenizer = Tokenizer(nlp.vocab)
|
||||
> tokenizer.from_disk("/path/to/tokenizer")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---------------- | -------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `Tokenizer` | The modified `Tokenizer` object. |
|
||||
|
||||
## Tokenizer.to_bytes {#to_bytes tag="method"}
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> tokenizer = tokenizer(nlp.vocab)
|
||||
> tokenizer_bytes = tokenizer.to_bytes()
|
||||
> ```
|
||||
|
||||
Serialize the tokenizer to a bytestring.
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | ------------------------------------------------------------------------- |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | bytes | The serialized form of the `Tokenizer` object. |
|
||||
|
||||
## Tokenizer.from_bytes {#from_bytes tag="method"}
|
||||
|
||||
Load the tokenizer from a bytestring. Modifies the object in place and returns
|
||||
it.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> tokenizer_bytes = tokenizer.to_bytes()
|
||||
> tokenizer = Tokenizer(nlp.vocab)
|
||||
> tokenizer.from_bytes(tokenizer_bytes)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------ | ----------- | ------------------------------------------------------------------------- |
|
||||
| `bytes_data` | bytes | The data to load from. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `Tokenizer` | The `Tokenizer` object. |
|
||||
|
||||
## Attributes {#attributes}
|
||||
|
||||
| Name | Type | Description |
|
||||
|
@ -124,3 +192,25 @@ details and examples.
|
|||
| `prefix_search` | - | A function to find segment boundaries from the start of a string. Returns the length of the segment, or `None`. |
|
||||
| `suffix_search` | - | A function to find segment boundaries from the end of a string. Returns the length of the segment, or `None`. |
|
||||
| `infix_finditer` | - | A function to find internal segment separators, e.g. hyphens. Returns a (possibly empty) list of `re.MatchObject` objects. |
|
||||
|
||||
## Serialization fields {#serialization-fields}
|
||||
|
||||
During serialization, spaCy will export several data fields used to restore
|
||||
different aspects of the object. If needed, you can exclude them from
|
||||
serialization by passing in the string names via the `exclude` argument.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> data = tokenizer.to_bytes(exclude=["vocab", "exceptions"])
|
||||
> tokenizer.from_disk("./data", exclude=["token_match"])
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ---------------- | --------------------------------- |
|
||||
| `vocab` | The shared [`Vocab`](/api/vocab). |
|
||||
| `prefix_search` | The prefix rules. |
|
||||
| `suffix_search` | The suffix rules. |
|
||||
| `infix_finditer` | The infix rules. |
|
||||
| `token_match` | The token match expression. |
|
||||
| `exceptions` | The tokenizer exception rules. |
|
||||
|
|
|
@ -642,7 +642,7 @@ All Python code is written in an **intersection of Python 2 and Python 3**. This
|
|||
is easy in Cython, but somewhat ugly in Python. Logic that deals with Python or
|
||||
platform compatibility only lives in `spacy.compat`. To distinguish them from
|
||||
the builtin functions, replacement functions are suffixed with an underscore,
|
||||
e.e `unicode_`.
|
||||
e.g. `unicode_`.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
@ -660,7 +660,7 @@ e.e `unicode_`.
|
|||
| `compat.input_` | `raw_input` | `input` |
|
||||
| `compat.path2str` | `str(path)` with `.decode('utf8')` | `str(path)` |
|
||||
|
||||
### compat.is_config {#is_config tag="function"}
|
||||
### compat.is_config {#compat.is_config tag="function"}
|
||||
|
||||
Check if a specific configuration of Python version and operating system matches
|
||||
the user's setup. Mostly used to display targeted error messages.
|
||||
|
|
|
@ -311,10 +311,9 @@ Save the current state to a directory.
|
|||
>
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| `**exclude` | - | Named attributes to prevent from being saved. |
|
||||
| Name | Type | Description |
|
||||
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
|
||||
## Vectors.from_disk {#from_disk tag="method"}
|
||||
|
||||
|
@ -342,10 +341,9 @@ Serialize the current state to a binary string.
|
|||
> vectors_bytes = vectors.to_bytes()
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | -------------------------------------------------- |
|
||||
| `**exclude` | - | Named attributes to prevent from being serialized. |
|
||||
| **RETURNS** | bytes | The serialized form of the `Vectors` object. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | -------------------------------------------- |
|
||||
| **RETURNS** | bytes | The serialized form of the `Vectors` object. |
|
||||
|
||||
## Vectors.from_bytes {#from_bytes tag="method"}
|
||||
|
||||
|
@ -360,11 +358,10 @@ Load state from a binary string.
|
|||
> new_vectors.from_bytes(vectors_bytes)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | --------- | ---------------------------------------------- |
|
||||
| `data` | bytes | The data to load from. |
|
||||
| `**exclude` | - | Named attributes to prevent from being loaded. |
|
||||
| **RETURNS** | `Vectors` | The `Vectors` object. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | --------- | ---------------------- |
|
||||
| `data` | bytes | The data to load from. |
|
||||
| **RETURNS** | `Vectors` | The `Vectors` object. |
|
||||
|
||||
## Attributes {#attributes}
|
||||
|
||||
|
|
|
@ -221,9 +221,10 @@ Save the current state to a directory.
|
|||
> nlp.vocab.to_disk("/path/to/vocab")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| Name | Type | Description |
|
||||
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
|
||||
## Vocab.from_disk {#from_disk tag="method" new="2"}
|
||||
|
||||
|
@ -239,6 +240,7 @@ Loads state from a directory. Modifies the object in place and returns it.
|
|||
| Name | Type | Description |
|
||||
| ----------- | ---------------- | -------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `Vocab` | The modified `Vocab` object. |
|
||||
|
||||
## Vocab.to_bytes {#to_bytes tag="method"}
|
||||
|
@ -251,10 +253,10 @@ Serialize the current state to a binary string.
|
|||
> vocab_bytes = nlp.vocab.to_bytes()
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | -------------------------------------------------- |
|
||||
| `**exclude` | - | Named attributes to prevent from being serialized. |
|
||||
| **RETURNS** | bytes | The serialized form of the `Vocab` object. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | ------------------------------------------------------------------------- |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | bytes | The serialized form of the `Vocab` object. |
|
||||
|
||||
## Vocab.from_bytes {#from_bytes tag="method"}
|
||||
|
||||
|
@ -269,11 +271,11 @@ Load state from a binary string.
|
|||
> vocab.from_bytes(vocab_bytes)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------ | ------- | ---------------------------------------------- |
|
||||
| `bytes_data` | bytes | The data to load from. |
|
||||
| `**exclude` | - | Named attributes to prevent from being loaded. |
|
||||
| **RETURNS** | `Vocab` | The `Vocab` object. |
|
||||
| Name | Type | Description |
|
||||
| ------------ | ------- | ------------------------------------------------------------------------- |
|
||||
| `bytes_data` | bytes | The data to load from. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `Vocab` | The `Vocab` object. |
|
||||
|
||||
## Attributes {#attributes}
|
||||
|
||||
|
@ -291,3 +293,22 @@ Load state from a binary string.
|
|||
| `strings` | `StringStore` | A table managing the string-to-int mapping. |
|
||||
| `vectors` <Tag variant="new">2</Tag> | `Vectors` | A table associating word IDs to word vectors. |
|
||||
| `vectors_length` | int | Number of dimensions for each word vector. |
|
||||
|
||||
## Serialization fields {#serialization-fields}
|
||||
|
||||
During serialization, spaCy will export several data fields used to restore
|
||||
different aspects of the object. If needed, you can exclude them from
|
||||
serialization by passing in the string names via the `exclude` argument.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> data = vocab.to_bytes(exclude=["strings", "vectors"])
|
||||
> vocab.from_disk("./vocab", exclude=["strings"])
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| --------- | ----------------------------------------------------- |
|
||||
| `strings` | The strings in the [`StringStore`](/api/stringstore). |
|
||||
| `lexemes` | The lexeme data. |
|
||||
| `vectors` | The word vectors, if available. |
|
||||
|
|
|
@ -424,7 +424,7 @@ take a path to a JSON file containing the patterns. This lets you reuse the
|
|||
component with different patterns, depending on your application:
|
||||
|
||||
```python
|
||||
html_merger = BadHTMLMerger(nlp, path='/path/to/patterns.json')
|
||||
html_merger = BadHTMLMerger(nlp, path="/path/to/patterns.json")
|
||||
```
|
||||
|
||||
<Infobox title="📖 Processing pipelines">
|
||||
|
|
|
@ -237,6 +237,19 @@ if all of your models are up to date, you can run the
|
|||
+ retokenizer.merge(doc[6:8])
|
||||
```
|
||||
|
||||
- The serialization methods `to_disk`, `from_disk`, `to_bytes` and `from_bytes`
|
||||
now support a single `exclude` argument to provide a list of string names to
|
||||
exclude. The docs have been updated to list the available serialization fields
|
||||
for each class. The `disable` argument on the [`Language`](/api/language)
|
||||
serialization methods has been renamed to `exclude` for consistency.
|
||||
|
||||
```diff
|
||||
- nlp.to_disk("/path", disable=["parser", "ner"])
|
||||
+ nlp.to_disk("/path", exclude=["parser", "ner"])
|
||||
- data = nlp.tokenizer.to_bytes(vocab=False)
|
||||
+ data = nlp.tokenizer.to_bytes(exclude=["vocab"])
|
||||
```
|
||||
|
||||
- For better compatibility with the Universal Dependencies data, the lemmatizer
|
||||
now preserves capitalization, e.g. for proper nouns. See
|
||||
[this issue](https://github.com/explosion/spaCy/issues/3256) for details.
|
||||
|
|
Loading…
Reference in New Issue
Block a user