Merge branch 'develop' into master-tmp

This commit is contained in:
Ines Montani 2020-09-04 13:15:36 +02:00
commit 864a697e63
1028 changed files with 79514 additions and 102996 deletions

View File

@ -1,11 +0,0 @@
steps:
-
command: "fab env clean make test sdist"
label: ":dizzy: :python:"
artifact_paths: "dist/*.tar.gz"
- wait
- trigger: "spacy-sdist-against-models"
label: ":dizzy: :hammer:"
build:
env:
SPACY_VERSION: "{$SPACY_VERSION}"

View File

@ -1,11 +0,0 @@
steps:
-
command: "fab env clean make test wheel"
label: ":dizzy: :python:"
artifact_paths: "dist/*.whl"
- wait
- trigger: "spacy-train-from-wheel"
label: ":dizzy: :train:"
build:
env:
SPACY_VERSION: "{$SPACY_VERSION}"

106
.github/contributors/tiangolo.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Sebastián Ramírez |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2020-07-01 |
| GitHub username | tiangolo |
| Website (optional) | |

8
.gitignore vendored
View File

@ -18,8 +18,7 @@ website/.npm
website/logs
*.log
npm-debug.log*
website/www/
website/_deploy.sh
quickstart-training-generator.js
# Cython / C extensions
cythonize.json
@ -44,12 +43,14 @@ __pycache__/
.env*
.~env/
.venv
env3.6/
venv/
env3.*/
.dev
.denv
.pypyenv
.pytest_cache/
.mypy_cache/
# Distribution / packaging
env/
@ -119,3 +120,6 @@ Desktop.ini
# Pycharm project files
*.idea
# IPython
.ipynb_checkpoints/

View File

@ -1,23 +0,0 @@
language: python
sudo: false
cache: pip
dist: trusty
group: edge
python:
- "2.7"
os:
- linux
install:
- "pip install -r requirements.txt"
- "python setup.py build_ext --inplace"
- "pip install -e ."
script:
- "cat /proc/cpuinfo | grep flags | head -n 1"
- "python -m pytest --tb=native spacy"
branches:
except:
- spacy.io
notifications:
slack:
secure: F8GvqnweSdzImuLL64TpfG0i5rYl89liyr9tmFVsHl4c0DNiDuGhZivUz0M1broS8svE3OPOllLfQbACG/4KxD890qfF9MoHzvRDlp7U+RtwMV/YAkYn8MGWjPIbRbX0HpGdY7O2Rc9Qy4Kk0T8ZgiqXYIqAz2Eva9/9BlSmsJQ=
email: false

View File

@ -5,7 +5,7 @@
Thanks for your interest in contributing to spaCy 🎉 The project is maintained
by [@honnibal](https://github.com/honnibal) and [@ines](https://github.com/ines),
and we'll do our best to help you get started. This page will give you a quick
overview of how things are organised and most importantly, how to get involved.
overview of how things are organized and most importantly, how to get involved.
## Table of contents
@ -195,7 +195,7 @@ modules in `.py` files, not Cython modules in `.pyx` and `.pxd` files.**
### Code formatting
[`black`](https://github.com/ambv/black) is an opinionated Python code
formatter, optimised to produce readable code and small diffs. You can run
formatter, optimized to produce readable code and small diffs. You can run
`black` from the command-line, or via your code editor. For example, if you're
using [Visual Studio Code](https://code.visualstudio.com/), you can add the
following to your `settings.json` to use `black` for formatting and auto-format
@ -216,7 +216,7 @@ list of available editor integrations.
#### Disabling formatting
There are a few cases where auto-formatting doesn't improve readability for
example, in some of the the language data files like the `tag_map.py`, or in
example, in some of the language data files like the `tag_map.py`, or in
the tests that construct `Doc` objects from lists of words and other labels.
Wrapping a block in `# fmt: off` and `# fmt: on` lets you disable formatting
for that particular code. Here's an example:
@ -280,29 +280,13 @@ except: # noqa: E722
### Python conventions
All Python code must be written in an **intersection of Python 2 and Python 3**.
This is easy in Cython, but somewhat ugly in Python. Logic that deals with
Python or platform compatibility should only live in
[`spacy.compat`](spacy/compat.py). To distinguish them from the builtin
functions, replacement functions are suffixed with an underscore, for example
`unicode_`. If you need to access the user's version or platform information,
for example to show more specific error messages, you can use the `is_config()`
helper function.
```python
from .compat import unicode_, is_config
compatible_unicode = unicode_('hello world')
if is_config(windows=True, python2=True):
print("You are using Python 2 on Windows.")
```
All Python code must be written **compatible with Python 3.6+**.
Code that interacts with the file-system should accept objects that follow the
`pathlib.Path` API, without assuming that the object inherits from `pathlib.Path`.
If the function is user-facing and takes a path as an argument, it should check
whether the path is provided as a string. Strings should be converted to
`pathlib.Path` objects. Serialization and deserialization functions should always
accept **file-like objects**, as it makes the library io-agnostic. Working on
accept **file-like objects**, as it makes the library IO-agnostic. Working on
buffers makes the code more general, easier to test, and compatible with Python
3's asynchronous IO.
@ -400,7 +384,7 @@ of Python and C++, with additional complexity and syntax from numpy. The
many "traps for new players". Working in Cython is very rewarding once you're
over the initial learning curve. As with C and C++, the first way you write
something in Cython will often be the performance-optimal approach. In contrast,
Python optimisation generally requires a lot of experimentation. Is it faster to
Python optimization generally requires a lot of experimentation. Is it faster to
have an `if item in my_dict` check, or to use `.get()`? What about `try`/`except`?
Does this numpy operation create a copy? There's no way to guess the answers to
these questions, and you'll usually be dissatisfied with your results — so
@ -416,7 +400,7 @@ Python. If it's not fast enough the first time, just switch to Cython.
- [PEP 8 Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) (python.org)
- [Official Cython documentation](http://docs.cython.org/en/latest/) (cython.org)
- [Writing C in Cython](https://explosion.ai/blog/writing-c-in-cython) (explosion.ai)
- [Multi-threading spaCys parser and named entity recogniser](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai)
- [Multi-threading spaCys parser and named entity recognizer](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai)
## Adding tests
@ -428,7 +412,7 @@ name. For example, tests for the `Tokenizer` can be found in
all test files and test functions need to be prefixed with `test_`.
When adding tests, make sure to use descriptive names, keep the code short and
concise and only test for one behaviour at a time. Try to `parametrize` test
concise and only test for one behavior at a time. Try to `parametrize` test
cases wherever possible, use our pre-defined fixtures for spaCy components and
avoid unnecessary imports.

View File

@ -1,9 +1,9 @@
recursive-include include *.h
recursive-include spacy *.txt *.pyx *.pxd
recursive-include spacy *.pyx *.pxd *.txt *.cfg *.jinja
include LICENSE
include README.md
include bin/spacy
include pyproject.toml
recursive-exclude spacy/lang *.json
recursive-include spacy/lang *.json.gz
recursive-include spacy/cli *.json *.yml
recursive-include licenses *

View File

@ -1,29 +1,57 @@
SHELL := /bin/bash
PYVER := 3.6
ifndef SPACY_EXTRAS
override SPACY_EXTRAS = spacy-lookups-data jieba pkuseg==0.0.25 sudachipy sudachidict_core
endif
ifndef PYVER
override PYVER = 3.6
endif
VENV := ./env$(PYVER)
version := $(shell "bin/get-version.sh")
package := $(shell "bin/get-package.sh")
dist/spacy-$(version).pex : wheelhouse/spacy-$(version).stamp
$(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m spacy -o $@ spacy==$(version) jsonschema spacy-lookups-data jieba pkuseg==0.0.25 sudachipy sudachidict_core
ifndef SPACY_BIN
override SPACY_BIN = $(package)-$(version).pex
endif
ifndef WHEELHOUSE
override WHEELHOUSE = "./wheelhouse"
endif
dist/$(SPACY_BIN) : $(WHEELHOUSE)/spacy-$(PYVER)-$(version).stamp
$(VENV)/bin/pex \
-f $(WHEELHOUSE) \
--no-index \
--disable-cache \
-m spacy \
-o $@ \
$(package)==$(version) \
$(SPACY_EXTRAS)
chmod a+rx $@
cp $@ dist/spacy.pex
dist/pytest.pex : wheelhouse/pytest-*.whl
$(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m pytest -o $@ pytest pytest-timeout mock
dist/pytest.pex : $(WHEELHOUSE)/pytest-*.whl
$(VENV)/bin/pex -f $(WHEELHOUSE) --no-index --disable-cache -m pytest -o $@ pytest pytest-timeout mock
chmod a+rx $@
wheelhouse/spacy-$(version).stamp : $(VENV)/bin/pex setup.py spacy/*.py* spacy/*/*.py*
$(VENV)/bin/pip wheel . -w ./wheelhouse
$(VENV)/bin/pip wheel jsonschema spacy-lookups-data jieba pkuseg==0.0.25 sudachipy sudachidict_core -w ./wheelhouse
$(WHEELHOUSE)/spacy-$(PYVER)-$(version).stamp : $(VENV)/bin/pex setup.py spacy/*.py* spacy/*/*.py*
$(VENV)/bin/pip wheel . -w $(WHEELHOUSE)
$(VENV)/bin/pip wheel $(SPACY_EXTRAS) -w $(WHEELHOUSE)
touch $@
wheelhouse/pytest-%.whl : $(VENV)/bin/pex
$(VENV)/bin/pip wheel pytest pytest-timeout mock -w ./wheelhouse
$(WHEELHOUSE)/pytest-%.whl : $(VENV)/bin/pex
$(VENV)/bin/pip wheel pytest pytest-timeout mock -w $(WHEELHOUSE)
$(VENV)/bin/pex :
python$(PYVER) -m venv $(VENV)
$(VENV)/bin/pip install -U pip setuptools pex wheel
$(VENV)/bin/pip install numpy
.PHONY : clean test
@ -33,6 +61,6 @@ test : dist/spacy-$(version).pex dist/pytest.pex
clean : setup.py
rm -rf dist/*
rm -rf ./wheelhouse
rm -rf $(WHEELHOUSE)/*
rm -rf $(VENV)
python setup.py clean --all

View File

@ -15,7 +15,6 @@ It's commercial open-source software, released under the MIT license.
[Check out the release notes here.](https://github.com/explosion/spaCy/releases)
[![Azure Pipelines](<https://img.shields.io/azure-devops/build/explosion-ai/public/8/master.svg?logo=azure-pipelines&style=flat-square&label=build+(3.x)>)](https://dev.azure.com/explosion-ai/public/_build?definitionId=8)
[![Travis Build Status](<https://img.shields.io/travis/explosion/spaCy/master.svg?style=flat-square&logo=travis-ci&logoColor=white&label=build+(2.7)>)](https://travis-ci.org/explosion/spaCy)
[![Current Release Version](https://img.shields.io/github/release/explosion/spacy.svg?style=flat-square&logo=github)](https://github.com/explosion/spaCy/releases)
[![pypi Version](https://img.shields.io/pypi/v/spacy.svg?style=flat-square&logo=pypi&logoColor=white)](https://pypi.org/project/spacy/)
[![conda Version](https://img.shields.io/conda/vn/conda-forge/spacy.svg?style=flat-square&logo=conda-forge&logoColor=white)](https://anaconda.org/conda-forge/spacy)
@ -50,9 +49,8 @@ It's commercial open-source software, released under the MIT license.
## 💬 Where to ask questions
The spaCy project is maintained by [@honnibal](https://github.com/honnibal) and
[@ines](https://github.com/ines), along with core contributors
[@svlandeg](https://github.com/svlandeg) and
The spaCy project is maintained by [@honnibal](https://github.com/honnibal),
[@ines](https://github.com/ines), [@svlandeg](https://github.com/svlandeg) and
[@adrianeboyd](https://github.com/adrianeboyd). Please understand that we won't
be able to provide individual support via email. We also believe that help is
much more valuable if it's shared publicly, so that more people can benefit from
@ -98,12 +96,19 @@ For detailed installation instructions, see the
- **Operating system**: macOS / OS X · Linux · Windows (Cygwin, MinGW, Visual
Studio)
- **Python version**: Python 2.7, 3.5+ (only 64 bit)
- **Python version**: Python 3.6+ (only 64 bit)
- **Package managers**: [pip] · [conda] (via `conda-forge`)
[pip]: https://pypi.org/project/spacy/
[conda]: https://anaconda.org/conda-forge/spacy
> ⚠️ **Important note for Python 3.8:** We can't yet ship pre-compiled binary
> wheels for spaCy that work on Python 3.8, as we're still waiting for our CI
> providers and other tooling to support it. This means that in order to run
> spaCy on Python 3.8, you'll need [a compiler installed](#source) and compile
> the library and its Cython dependencies locally. If this is causing problems
> for you, the easiest solution is to **use Python 3.7** in the meantime.
### pip
Using pip, spaCy releases are available as source packages and binary wheels (as
@ -188,7 +193,7 @@ pip install https://github.com/explosion/spacy-models/releases/download/en_core_
### Loading and using models
To load a model, use `spacy.load()` with the model name, a shortcut link or a
To load a model, use `spacy.load()` with the model name or a
path to the model data directory.
```python
@ -263,9 +268,7 @@ and git preinstalled.
Install a version of the
[Visual C++ Build Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/)
or [Visual Studio Express](https://visualstudio.microsoft.com/vs/express/) that
matches the version that was used to compile your Python interpreter. For
official distributions these are VS 2008 (Python 2.7), VS 2010 (Python 3.4) and
VS 2015 (Python 3.5).
matches the version that was used to compile your Python interpreter.
## Run tests

View File

@ -27,7 +27,7 @@ jobs:
inputs:
versionSpec: '3.7'
- script: |
pip install flake8
pip install flake8==3.5.0
python -m flake8 spacy --count --select=E901,E999,F821,F822,F823 --show-source --statistics
displayName: 'flake8'
@ -35,12 +35,6 @@ jobs:
dependsOn: 'Validate'
strategy:
matrix:
Python35Linux:
imageName: 'ubuntu-16.04'
python.version: '3.5'
Python35Windows:
imageName: 'vs2017-win2016'
python.version: '3.5'
Python36Linux:
imageName: 'ubuntu-16.04'
python.version: '3.6'
@ -58,7 +52,7 @@ jobs:
# imageName: 'vs2017-win2016'
# python.version: '3.7'
# Python37Mac:
# imageName: 'macos-10.13'
# imageName: 'macos-10.14'
# python.version: '3.7'
Python38Linux:
imageName: 'ubuntu-16.04'

View File

@ -1,169 +0,0 @@
#!/usr/bin/env python
""" cythonize.py
Cythonize pyx files into C++ files as needed.
Usage: cythonize.py [root]
Checks pyx files to see if they have been changed relative to their
corresponding C++ files. If they have, then runs cython on these files to
recreate the C++ files.
Additionally, checks pxd files and setup.py if they have been changed. If
they have, rebuilds everything.
Change detection based on file hashes stored in JSON format.
For now, this script should be run by developers when changing Cython files
and the resulting C++ files checked in, so that end-users (and Python-only
developers) do not get the Cython dependencies.
Based upon:
https://raw.github.com/dagss/private-scipy-refactor/cythonize/cythonize.py
https://raw.githubusercontent.com/numpy/numpy/master/tools/cythonize.py
Note: this script does not check any of the dependent C++ libraries.
"""
from __future__ import print_function
import os
import sys
import json
import hashlib
import subprocess
import argparse
HASH_FILE = "cythonize.json"
def process_pyx(fromfile, tofile, language_level="-2"):
print("Processing %s" % fromfile)
try:
from Cython.Compiler.Version import version as cython_version
from distutils.version import LooseVersion
if LooseVersion(cython_version) < LooseVersion("0.19"):
raise Exception("Require Cython >= 0.19")
except ImportError:
pass
flags = ["--fast-fail", language_level]
if tofile.endswith(".cpp"):
flags += ["--cplus"]
try:
try:
r = subprocess.call(
["cython"] + flags + ["-o", tofile, fromfile], env=os.environ
) # See Issue #791
if r != 0:
raise Exception("Cython failed")
except OSError:
# There are ways of installing Cython that don't result in a cython
# executable on the path, see gh-2397.
r = subprocess.call(
[
sys.executable,
"-c",
"import sys; from Cython.Compiler.Main import "
"setuptools_main as main; sys.exit(main())",
]
+ flags
+ ["-o", tofile, fromfile]
)
if r != 0:
raise Exception("Cython failed")
except OSError:
raise OSError("Cython needs to be installed")
def preserve_cwd(path, func, *args):
orig_cwd = os.getcwd()
try:
os.chdir(path)
func(*args)
finally:
os.chdir(orig_cwd)
def load_hashes(filename):
try:
return json.load(open(filename))
except (ValueError, IOError):
return {}
def save_hashes(hash_db, filename):
with open(filename, "w") as f:
f.write(json.dumps(hash_db))
def get_hash(path):
return hashlib.md5(open(path, "rb").read()).hexdigest()
def hash_changed(base, path, db):
full_path = os.path.normpath(os.path.join(base, path))
return not get_hash(full_path) == db.get(full_path)
def hash_add(base, path, db):
full_path = os.path.normpath(os.path.join(base, path))
db[full_path] = get_hash(full_path)
def process(base, filename, db):
root, ext = os.path.splitext(filename)
if ext in [".pyx", ".cpp"]:
if hash_changed(base, filename, db) or not os.path.isfile(
os.path.join(base, root + ".cpp")
):
preserve_cwd(base, process_pyx, root + ".pyx", root + ".cpp")
hash_add(base, root + ".cpp", db)
hash_add(base, root + ".pyx", db)
def check_changes(root, db):
res = False
new_db = {}
setup_filename = "setup.py"
hash_add(".", setup_filename, new_db)
if hash_changed(".", setup_filename, db):
res = True
for base, _, files in os.walk(root):
for filename in files:
if filename.endswith(".pxd"):
hash_add(base, filename, new_db)
if hash_changed(base, filename, db):
res = True
if res:
db.clear()
db.update(new_db)
return res
def run(root):
db = load_hashes(HASH_FILE)
try:
check_changes(root, db)
for base, _, files in os.walk(root):
for filename in files:
process(base, filename, db)
finally:
save_hashes(db, HASH_FILE)
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Cythonize pyx files into C++ files as needed"
)
parser.add_argument("root", help="root directory")
args = parser.parse_args()
run(args.root)

12
bin/get-package.sh Executable file
View File

@ -0,0 +1,12 @@
#!/usr/bin/env bash
set -e
version=$(grep "__title__ = " spacy/about.py)
version=${version/__title__ = }
version=${version/\'/}
version=${version/\'/}
version=${version/\"/}
version=${version/\"/}
echo $version

View File

@ -1,97 +0,0 @@
# coding: utf8
from __future__ import unicode_literals
import bz2
import re
import srsly
import sys
import random
import datetime
import plac
from pathlib import Path
_unset = object()
class Reddit(object):
"""Stream cleaned comments from Reddit."""
pre_format_re = re.compile(r"^[`*~]")
post_format_re = re.compile(r"[`*~]$")
url_re = re.compile(r"\[([^]]+)\]\(%%URL\)")
link_re = re.compile(r"\[([^]]+)\]\(https?://[^\)]+\)")
def __init__(self, file_path, meta_keys={"subreddit": "section"}):
"""
file_path (unicode / Path): Path to archive or directory of archives.
meta_keys (dict): Meta data key included in the Reddit corpus, mapped
to display name in Prodigy meta.
RETURNS (Reddit): The Reddit loader.
"""
self.meta = meta_keys
file_path = Path(file_path)
if not file_path.exists():
raise IOError("Can't find file path: {}".format(file_path))
if not file_path.is_dir():
self.files = [file_path]
else:
self.files = list(file_path.iterdir())
def __iter__(self):
for file_path in self.iter_files():
with bz2.open(str(file_path)) as f:
for line in f:
line = line.strip()
if not line:
continue
comment = srsly.json_loads(line)
if self.is_valid(comment):
text = self.strip_tags(comment["body"])
yield {"text": text}
def get_meta(self, item):
return {name: item.get(key, "n/a") for key, name in self.meta.items()}
def iter_files(self):
for file_path in self.files:
yield file_path
def strip_tags(self, text):
text = self.link_re.sub(r"\1", text)
text = text.replace("&gt;", ">").replace("&lt;", "<")
text = self.pre_format_re.sub("", text)
text = self.post_format_re.sub("", text)
text = re.sub(r"\s+", " ", text)
return text.strip()
def is_valid(self, comment):
return (
comment["body"] is not None
and comment["body"] != "[deleted]"
and comment["body"] != "[removed]"
)
def main(path):
reddit = Reddit(path)
for comment in reddit:
print(srsly.json_dumps(comment))
if __name__ == "__main__":
import socket
try:
BrokenPipeError
except NameError:
BrokenPipeError = socket.error
try:
plac.call(main)
except BrokenPipeError:
import os, sys
# Python flushes standard streams on exit; redirect remaining output
# to devnull to avoid another BrokenPipeError at shutdown
devnull = os.open(os.devnull, os.O_WRONLY)
os.dup2(devnull, sys.stdout.fileno())
sys.exit(1) # Python exits with error code 1 on EPIPE

View File

@ -1,2 +0,0 @@
#! /bin/sh
python -m spacy "$@"

View File

@ -1,81 +0,0 @@
#!/usr/bin/env python
from __future__ import print_function, unicode_literals, division
import logging
from pathlib import Path
from collections import defaultdict
from gensim.models import Word2Vec
import plac
import spacy
logger = logging.getLogger(__name__)
class Corpus(object):
def __init__(self, directory, nlp):
self.directory = directory
self.nlp = nlp
def __iter__(self):
for text_loc in iter_dir(self.directory):
with text_loc.open("r", encoding="utf-8") as file_:
text = file_.read()
# This is to keep the input to the blank model (which doesn't
# sentencize) from being too long. It works particularly well with
# the output of [WikiExtractor](https://github.com/attardi/wikiextractor)
paragraphs = text.split('\n\n')
for par in paragraphs:
yield [word.orth_ for word in self.nlp(par)]
def iter_dir(loc):
dir_path = Path(loc)
for fn_path in dir_path.iterdir():
if fn_path.is_dir():
for sub_path in fn_path.iterdir():
yield sub_path
else:
yield fn_path
@plac.annotations(
lang=("ISO language code"),
in_dir=("Location of input directory"),
out_loc=("Location of output file"),
n_workers=("Number of workers", "option", "n", int),
size=("Dimension of the word vectors", "option", "d", int),
window=("Context window size", "option", "w", int),
min_count=("Min count", "option", "m", int),
negative=("Number of negative samples", "option", "g", int),
nr_iter=("Number of iterations", "option", "i", int),
)
def main(
lang,
in_dir,
out_loc,
negative=5,
n_workers=4,
window=5,
size=128,
min_count=10,
nr_iter=5,
):
logging.basicConfig(
format="%(asctime)s : %(levelname)s : %(message)s", level=logging.INFO
)
nlp = spacy.blank(lang)
corpus = Corpus(in_dir, nlp)
model = Word2Vec(
sentences=corpus,
size=size,
window=window,
min_count=min_count,
workers=n_workers,
sample=1e-5,
negative=negative,
)
model.save(out_loc)
if __name__ == "__main__":
plac.call(main)

View File

@ -1,2 +0,0 @@
from .conll17_ud_eval import main as ud_evaluate # noqa: F401
from .ud_train import main as ud_train # noqa: F401

View File

@ -1,614 +0,0 @@
#!/usr/bin/env python
# flake8: noqa
# CoNLL 2017 UD Parsing evaluation script.
#
# Compatible with Python 2.7 and 3.2+, can be used either as a module
# or a standalone executable.
#
# Copyright 2017 Institute of Formal and Applied Linguistics (UFAL),
# Faculty of Mathematics and Physics, Charles University, Czech Republic.
#
# Changelog:
# - [02 Jan 2017] Version 0.9: Initial release
# - [25 Jan 2017] Version 0.9.1: Fix bug in LCS alignment computation
# - [10 Mar 2017] Version 1.0: Add documentation and test
# Compare HEADs correctly using aligned words
# Allow evaluation with errorneous spaces in forms
# Compare forms in LCS case insensitively
# Detect cycles and multiple root nodes
# Compute AlignedAccuracy
# Command line usage
# ------------------
# conll17_ud_eval.py [-v] [-w weights_file] gold_conllu_file system_conllu_file
#
# - if no -v is given, only the CoNLL17 UD Shared Task evaluation LAS metrics
# is printed
# - if -v is given, several metrics are printed (as precision, recall, F1 score,
# and in case the metric is computed on aligned words also accuracy on these):
# - Tokens: how well do the gold tokens match system tokens
# - Sentences: how well do the gold sentences match system sentences
# - Words: how well can the gold words be aligned to system words
# - UPOS: using aligned words, how well does UPOS match
# - XPOS: using aligned words, how well does XPOS match
# - Feats: using aligned words, how well does FEATS match
# - AllTags: using aligned words, how well does UPOS+XPOS+FEATS match
# - Lemmas: using aligned words, how well does LEMMA match
# - UAS: using aligned words, how well does HEAD match
# - LAS: using aligned words, how well does HEAD+DEPREL(ignoring subtypes) match
# - if weights_file is given (with lines containing deprel-weight pairs),
# one more metric is shown:
# - WeightedLAS: as LAS, but each deprel (ignoring subtypes) has different weight
# API usage
# ---------
# - load_conllu(file)
# - loads CoNLL-U file from given file object to an internal representation
# - the file object should return str on both Python 2 and Python 3
# - raises UDError exception if the given file cannot be loaded
# - evaluate(gold_ud, system_ud)
# - evaluate the given gold and system CoNLL-U files (loaded with load_conllu)
# - raises UDError if the concatenated tokens of gold and system file do not match
# - returns a dictionary with the metrics described above, each metrics having
# four fields: precision, recall, f1 and aligned_accuracy (when using aligned
# words, otherwise this is None)
# Description of token matching
# -----------------------------
# In order to match tokens of gold file and system file, we consider the text
# resulting from concatenation of gold tokens and text resulting from
# concatenation of system tokens. These texts should match -- if they do not,
# the evaluation fails.
#
# If the texts do match, every token is represented as a range in this original
# text, and tokens are equal only if their range is the same.
# Description of word matching
# ----------------------------
# When matching words of gold file and system file, we first match the tokens.
# The words which are also tokens are matched as tokens, but words in multi-word
# tokens have to be handled differently.
#
# To handle multi-word tokens, we start by finding "multi-word spans".
# Multi-word span is a span in the original text such that
# - it contains at least one multi-word token
# - all multi-word tokens in the span (considering both gold and system ones)
# are completely inside the span (i.e., they do not "stick out")
# - the multi-word span is as small as possible
#
# For every multi-word span, we align the gold and system words completely
# inside this span using LCS on their FORMs. The words not intersecting
# (even partially) any multi-word span are then aligned as tokens.
from __future__ import division
from __future__ import print_function
import argparse
import io
import sys
import unittest
# CoNLL-U column names
ID, FORM, LEMMA, UPOS, XPOS, FEATS, HEAD, DEPREL, DEPS, MISC = range(10)
# UD Error is used when raising exceptions in this module
class UDError(Exception):
pass
# Load given CoNLL-U file into internal representation
def load_conllu(file, check_parse=True):
# Internal representation classes
class UDRepresentation:
def __init__(self):
# Characters of all the tokens in the whole file.
# Whitespace between tokens is not included.
self.characters = []
# List of UDSpan instances with start&end indices into `characters`.
self.tokens = []
# List of UDWord instances.
self.words = []
# List of UDSpan instances with start&end indices into `characters`.
self.sentences = []
class UDSpan:
def __init__(self, start, end, characters):
self.start = start
# Note that self.end marks the first position **after the end** of span,
# so we can use characters[start:end] or range(start, end).
self.end = end
self.characters = characters
@property
def text(self):
return ''.join(self.characters[self.start:self.end])
def __str__(self):
return self.text
def __repr__(self):
return self.text
class UDWord:
def __init__(self, span, columns, is_multiword):
# Span of this word (or MWT, see below) within ud_representation.characters.
self.span = span
# 10 columns of the CoNLL-U file: ID, FORM, LEMMA,...
self.columns = columns
# is_multiword==True means that this word is part of a multi-word token.
# In that case, self.span marks the span of the whole multi-word token.
self.is_multiword = is_multiword
# Reference to the UDWord instance representing the HEAD (or None if root).
self.parent = None
# Let's ignore language-specific deprel subtypes.
self.columns[DEPREL] = columns[DEPREL].split(':')[0]
ud = UDRepresentation()
# Load the CoNLL-U file
index, sentence_start = 0, None
linenum = 0
while True:
line = file.readline()
linenum += 1
if not line:
break
line = line.rstrip("\r\n")
# Handle sentence start boundaries
if sentence_start is None:
# Skip comments
if line.startswith("#"):
continue
# Start a new sentence
ud.sentences.append(UDSpan(index, 0, ud.characters))
sentence_start = len(ud.words)
if not line:
# Add parent UDWord links and check there are no cycles
def process_word(word):
if word.parent == "remapping":
raise UDError("There is a cycle in a sentence")
if word.parent is None:
head = int(word.columns[HEAD])
if head > len(ud.words) - sentence_start:
raise UDError("Line {}: HEAD '{}' points outside of the sentence".format(
linenum, word.columns[HEAD]))
if head:
parent = ud.words[sentence_start + head - 1]
word.parent = "remapping"
process_word(parent)
word.parent = parent
for word in ud.words[sentence_start:]:
process_word(word)
# Check there is a single root node
if check_parse:
if len([word for word in ud.words[sentence_start:] if word.parent is None]) != 1:
raise UDError("There are multiple roots in a sentence")
# End the sentence
ud.sentences[-1].end = index
sentence_start = None
continue
# Read next token/word
columns = line.split("\t")
if len(columns) != 10:
raise UDError("The CoNLL-U line {} does not contain 10 tab-separated columns: '{}'".format(linenum, line))
# Skip empty nodes
if "." in columns[ID]:
continue
# Delete spaces from FORM so gold.characters == system.characters
# even if one of them tokenizes the space.
columns[FORM] = columns[FORM].replace(" ", "")
if not columns[FORM]:
raise UDError("There is an empty FORM in the CoNLL-U file -- line %d" % linenum)
# Save token
ud.characters.extend(columns[FORM])
ud.tokens.append(UDSpan(index, index + len(columns[FORM]), ud.characters))
index += len(columns[FORM])
# Handle multi-word tokens to save word(s)
if "-" in columns[ID]:
try:
start, end = map(int, columns[ID].split("-"))
except:
raise UDError("Cannot parse multi-word token ID '{}'".format(columns[ID]))
for _ in range(start, end + 1):
word_line = file.readline().rstrip("\r\n")
word_columns = word_line.split("\t")
if len(word_columns) != 10:
print(columns)
raise UDError("The CoNLL-U line {} does not contain 10 tab-separated columns: '{}'".format(linenum, word_line))
ud.words.append(UDWord(ud.tokens[-1], word_columns, is_multiword=True))
# Basic tokens/words
else:
try:
word_id = int(columns[ID])
except:
raise UDError("Cannot parse word ID '{}'".format(columns[ID]))
if word_id != len(ud.words) - sentence_start + 1:
raise UDError("Incorrect word ID '{}' for word '{}', expected '{}'".format(columns[ID], columns[FORM], len(ud.words) - sentence_start + 1))
try:
head_id = int(columns[HEAD])
except:
raise UDError("Cannot parse HEAD '{}'".format(columns[HEAD]))
if head_id < 0:
raise UDError("HEAD cannot be negative")
ud.words.append(UDWord(ud.tokens[-1], columns, is_multiword=False))
if sentence_start is not None:
raise UDError("The CoNLL-U file does not end with empty line")
return ud
# Evaluate the gold and system treebanks (loaded using load_conllu).
def evaluate(gold_ud, system_ud, deprel_weights=None, check_parse=True):
class Score:
def __init__(self, gold_total, system_total, correct, aligned_total=None, undersegmented=None, oversegmented=None):
self.precision = correct / system_total if system_total else 0.0
self.recall = correct / gold_total if gold_total else 0.0
self.f1 = 2 * correct / (system_total + gold_total) if system_total + gold_total else 0.0
self.aligned_accuracy = correct / aligned_total if aligned_total else aligned_total
self.undersegmented = undersegmented
self.oversegmented = oversegmented
self.under_perc = len(undersegmented) / gold_total if gold_total and undersegmented else 0.0
self.over_perc = len(oversegmented) / gold_total if gold_total and oversegmented else 0.0
class AlignmentWord:
def __init__(self, gold_word, system_word):
self.gold_word = gold_word
self.system_word = system_word
self.gold_parent = None
self.system_parent_gold_aligned = None
class Alignment:
def __init__(self, gold_words, system_words):
self.gold_words = gold_words
self.system_words = system_words
self.matched_words = []
self.matched_words_map = {}
def append_aligned_words(self, gold_word, system_word):
self.matched_words.append(AlignmentWord(gold_word, system_word))
self.matched_words_map[system_word] = gold_word
def fill_parents(self):
# We represent root parents in both gold and system data by '0'.
# For gold data, we represent non-root parent by corresponding gold word.
# For system data, we represent non-root parent by either gold word aligned
# to parent system nodes, or by None if no gold words is aligned to the parent.
for words in self.matched_words:
words.gold_parent = words.gold_word.parent if words.gold_word.parent is not None else 0
words.system_parent_gold_aligned = self.matched_words_map.get(words.system_word.parent, None) \
if words.system_word.parent is not None else 0
def lower(text):
if sys.version_info < (3, 0) and isinstance(text, str):
return text.decode("utf-8").lower()
return text.lower()
def spans_score(gold_spans, system_spans):
correct, gi, si = 0, 0, 0
undersegmented = []
oversegmented = []
combo = 0
previous_end_si_earlier = False
previous_end_gi_earlier = False
while gi < len(gold_spans) and si < len(system_spans):
previous_si = system_spans[si-1] if si > 0 else None
previous_gi = gold_spans[gi-1] if gi > 0 else None
if system_spans[si].start < gold_spans[gi].start:
# avoid counting the same mistake twice
if not previous_end_si_earlier:
combo += 1
oversegmented.append(str(previous_gi).strip())
si += 1
elif gold_spans[gi].start < system_spans[si].start:
# avoid counting the same mistake twice
if not previous_end_gi_earlier:
combo += 1
undersegmented.append(str(previous_si).strip())
gi += 1
else:
correct += gold_spans[gi].end == system_spans[si].end
if gold_spans[gi].end < system_spans[si].end:
undersegmented.append(str(system_spans[si]).strip())
previous_end_gi_earlier = True
previous_end_si_earlier = False
elif gold_spans[gi].end > system_spans[si].end:
oversegmented.append(str(gold_spans[gi]).strip())
previous_end_si_earlier = True
previous_end_gi_earlier = False
else:
previous_end_gi_earlier = False
previous_end_si_earlier = False
si += 1
gi += 1
return Score(len(gold_spans), len(system_spans), correct, None, undersegmented, oversegmented)
def alignment_score(alignment, key_fn, weight_fn=lambda w: 1):
gold, system, aligned, correct = 0, 0, 0, 0
for word in alignment.gold_words:
gold += weight_fn(word)
for word in alignment.system_words:
system += weight_fn(word)
for words in alignment.matched_words:
aligned += weight_fn(words.gold_word)
if key_fn is None:
# Return score for whole aligned words
return Score(gold, system, aligned)
for words in alignment.matched_words:
if key_fn(words.gold_word, words.gold_parent) == key_fn(words.system_word, words.system_parent_gold_aligned):
correct += weight_fn(words.gold_word)
return Score(gold, system, correct, aligned)
def beyond_end(words, i, multiword_span_end):
if i >= len(words):
return True
if words[i].is_multiword:
return words[i].span.start >= multiword_span_end
return words[i].span.end > multiword_span_end
def extend_end(word, multiword_span_end):
if word.is_multiword and word.span.end > multiword_span_end:
return word.span.end
return multiword_span_end
def find_multiword_span(gold_words, system_words, gi, si):
# We know gold_words[gi].is_multiword or system_words[si].is_multiword.
# Find the start of the multiword span (gs, ss), so the multiword span is minimal.
# Initialize multiword_span_end characters index.
if gold_words[gi].is_multiword:
multiword_span_end = gold_words[gi].span.end
if not system_words[si].is_multiword and system_words[si].span.start < gold_words[gi].span.start:
si += 1
else: # if system_words[si].is_multiword
multiword_span_end = system_words[si].span.end
if not gold_words[gi].is_multiword and gold_words[gi].span.start < system_words[si].span.start:
gi += 1
gs, ss = gi, si
# Find the end of the multiword span
# (so both gi and si are pointing to the word following the multiword span end).
while not beyond_end(gold_words, gi, multiword_span_end) or \
not beyond_end(system_words, si, multiword_span_end):
if gi < len(gold_words) and (si >= len(system_words) or
gold_words[gi].span.start <= system_words[si].span.start):
multiword_span_end = extend_end(gold_words[gi], multiword_span_end)
gi += 1
else:
multiword_span_end = extend_end(system_words[si], multiword_span_end)
si += 1
return gs, ss, gi, si
def compute_lcs(gold_words, system_words, gi, si, gs, ss):
lcs = [[0] * (si - ss) for i in range(gi - gs)]
for g in reversed(range(gi - gs)):
for s in reversed(range(si - ss)):
if lower(gold_words[gs + g].columns[FORM]) == lower(system_words[ss + s].columns[FORM]):
lcs[g][s] = 1 + (lcs[g+1][s+1] if g+1 < gi-gs and s+1 < si-ss else 0)
lcs[g][s] = max(lcs[g][s], lcs[g+1][s] if g+1 < gi-gs else 0)
lcs[g][s] = max(lcs[g][s], lcs[g][s+1] if s+1 < si-ss else 0)
return lcs
def align_words(gold_words, system_words):
alignment = Alignment(gold_words, system_words)
gi, si = 0, 0
while gi < len(gold_words) and si < len(system_words):
if gold_words[gi].is_multiword or system_words[si].is_multiword:
# A: Multi-word tokens => align via LCS within the whole "multiword span".
gs, ss, gi, si = find_multiword_span(gold_words, system_words, gi, si)
if si > ss and gi > gs:
lcs = compute_lcs(gold_words, system_words, gi, si, gs, ss)
# Store aligned words
s, g = 0, 0
while g < gi - gs and s < si - ss:
if lower(gold_words[gs + g].columns[FORM]) == lower(system_words[ss + s].columns[FORM]):
alignment.append_aligned_words(gold_words[gs+g], system_words[ss+s])
g += 1
s += 1
elif lcs[g][s] == (lcs[g+1][s] if g+1 < gi-gs else 0):
g += 1
else:
s += 1
else:
# B: No multi-word token => align according to spans.
if (gold_words[gi].span.start, gold_words[gi].span.end) == (system_words[si].span.start, system_words[si].span.end):
alignment.append_aligned_words(gold_words[gi], system_words[si])
gi += 1
si += 1
elif gold_words[gi].span.start <= system_words[si].span.start:
gi += 1
else:
si += 1
alignment.fill_parents()
return alignment
# Check that underlying character sequences do match
if gold_ud.characters != system_ud.characters:
index = 0
while gold_ud.characters[index] == system_ud.characters[index]:
index += 1
raise UDError(
"The concatenation of tokens in gold file and in system file differ!\n" +
"First 20 differing characters in gold file: '{}' and system file: '{}'".format(
"".join(gold_ud.characters[index:index + 20]),
"".join(system_ud.characters[index:index + 20])
)
)
# Align words
alignment = align_words(gold_ud.words, system_ud.words)
# Compute the F1-scores
if check_parse:
result = {
"Tokens": spans_score(gold_ud.tokens, system_ud.tokens),
"Sentences": spans_score(gold_ud.sentences, system_ud.sentences),
"Words": alignment_score(alignment, None),
"UPOS": alignment_score(alignment, lambda w, parent: w.columns[UPOS]),
"XPOS": alignment_score(alignment, lambda w, parent: w.columns[XPOS]),
"Feats": alignment_score(alignment, lambda w, parent: w.columns[FEATS]),
"AllTags": alignment_score(alignment, lambda w, parent: (w.columns[UPOS], w.columns[XPOS], w.columns[FEATS])),
"Lemmas": alignment_score(alignment, lambda w, parent: w.columns[LEMMA]),
"UAS": alignment_score(alignment, lambda w, parent: parent),
"LAS": alignment_score(alignment, lambda w, parent: (parent, w.columns[DEPREL])),
}
else:
result = {
"Tokens": spans_score(gold_ud.tokens, system_ud.tokens),
"Sentences": spans_score(gold_ud.sentences, system_ud.sentences),
"Words": alignment_score(alignment, None),
"Feats": alignment_score(alignment, lambda w, parent: w.columns[FEATS]),
"Lemmas": alignment_score(alignment, lambda w, parent: w.columns[LEMMA]),
}
# Add WeightedLAS if weights are given
if deprel_weights is not None:
def weighted_las(word):
return deprel_weights.get(word.columns[DEPREL], 1.0)
result["WeightedLAS"] = alignment_score(alignment, lambda w, parent: (parent, w.columns[DEPREL]), weighted_las)
return result
def load_deprel_weights(weights_file):
if weights_file is None:
return None
deprel_weights = {}
for line in weights_file:
# Ignore comments and empty lines
if line.startswith("#") or not line.strip():
continue
columns = line.rstrip("\r\n").split()
if len(columns) != 2:
raise ValueError("Expected two columns in the UD Relations weights file on line '{}'".format(line))
deprel_weights[columns[0]] = float(columns[1])
return deprel_weights
def load_conllu_file(path):
_file = open(path, mode="r", **({"encoding": "utf-8"} if sys.version_info >= (3, 0) else {}))
return load_conllu(_file)
def evaluate_wrapper(args):
# Load CoNLL-U files
gold_ud = load_conllu_file(args.gold_file)
system_ud = load_conllu_file(args.system_file)
# Load weights if requested
deprel_weights = load_deprel_weights(args.weights)
return evaluate(gold_ud, system_ud, deprel_weights)
def main():
# Parse arguments
parser = argparse.ArgumentParser()
parser.add_argument("gold_file", type=str,
help="Name of the CoNLL-U file with the gold data.")
parser.add_argument("system_file", type=str,
help="Name of the CoNLL-U file with the predicted data.")
parser.add_argument("--weights", "-w", type=argparse.FileType("r"), default=None,
metavar="deprel_weights_file",
help="Compute WeightedLAS using given weights for Universal Dependency Relations.")
parser.add_argument("--verbose", "-v", default=0, action="count",
help="Print all metrics.")
args = parser.parse_args()
# Use verbose if weights are supplied
if args.weights is not None and not args.verbose:
args.verbose = 1
# Evaluate
evaluation = evaluate_wrapper(args)
# Print the evaluation
if not args.verbose:
print("LAS F1 Score: {:.2f}".format(100 * evaluation["LAS"].f1))
else:
metrics = ["Tokens", "Sentences", "Words", "UPOS", "XPOS", "Feats", "AllTags", "Lemmas", "UAS", "LAS"]
if args.weights is not None:
metrics.append("WeightedLAS")
print("Metrics | Precision | Recall | F1 Score | AligndAcc")
print("-----------+-----------+-----------+-----------+-----------")
for metric in metrics:
print("{:11}|{:10.2f} |{:10.2f} |{:10.2f} |{}".format(
metric,
100 * evaluation[metric].precision,
100 * evaluation[metric].recall,
100 * evaluation[metric].f1,
"{:10.2f}".format(100 * evaluation[metric].aligned_accuracy) if evaluation[metric].aligned_accuracy is not None else ""
))
if __name__ == "__main__":
main()
# Tests, which can be executed with `python -m unittest conll17_ud_eval`.
class TestAlignment(unittest.TestCase):
@staticmethod
def _load_words(words):
"""Prepare fake CoNLL-U files with fake HEAD to prevent multiple roots errors."""
lines, num_words = [], 0
for w in words:
parts = w.split(" ")
if len(parts) == 1:
num_words += 1
lines.append("{}\t{}\t_\t_\t_\t_\t{}\t_\t_\t_".format(num_words, parts[0], int(num_words>1)))
else:
lines.append("{}-{}\t{}\t_\t_\t_\t_\t_\t_\t_\t_".format(num_words + 1, num_words + len(parts) - 1, parts[0]))
for part in parts[1:]:
num_words += 1
lines.append("{}\t{}\t_\t_\t_\t_\t{}\t_\t_\t_".format(num_words, part, int(num_words>1)))
return load_conllu((io.StringIO if sys.version_info >= (3, 0) else io.BytesIO)("\n".join(lines+["\n"])))
def _test_exception(self, gold, system):
self.assertRaises(UDError, evaluate, self._load_words(gold), self._load_words(system))
def _test_ok(self, gold, system, correct):
metrics = evaluate(self._load_words(gold), self._load_words(system))
gold_words = sum((max(1, len(word.split(" ")) - 1) for word in gold))
system_words = sum((max(1, len(word.split(" ")) - 1) for word in system))
self.assertEqual((metrics["Words"].precision, metrics["Words"].recall, metrics["Words"].f1),
(correct / system_words, correct / gold_words, 2 * correct / (gold_words + system_words)))
def test_exception(self):
self._test_exception(["a"], ["b"])
def test_equal(self):
self._test_ok(["a"], ["a"], 1)
self._test_ok(["a", "b", "c"], ["a", "b", "c"], 3)
def test_equal_with_multiword(self):
self._test_ok(["abc a b c"], ["a", "b", "c"], 3)
self._test_ok(["a", "bc b c", "d"], ["a", "b", "c", "d"], 4)
self._test_ok(["abcd a b c d"], ["ab a b", "cd c d"], 4)
self._test_ok(["abc a b c", "de d e"], ["a", "bcd b c d", "e"], 5)
def test_alignment(self):
self._test_ok(["abcd"], ["a", "b", "c", "d"], 0)
self._test_ok(["abc", "d"], ["a", "b", "c", "d"], 1)
self._test_ok(["a", "bc", "d"], ["a", "b", "c", "d"], 2)
self._test_ok(["a", "bc b c", "d"], ["a", "b", "cd"], 2)
self._test_ok(["abc a BX c", "def d EX f"], ["ab a b", "cd c d", "ef e f"], 4)
self._test_ok(["ab a b", "cd bc d"], ["a", "bc", "d"], 2)
self._test_ok(["a", "bc b c", "d"], ["ab AX BX", "cd CX a"], 1)

View File

@ -1,293 +0,0 @@
import spacy
import time
import re
import plac
import operator
import datetime
from pathlib import Path
import xml.etree.ElementTree as ET
import conll17_ud_eval
from ud_train import write_conllu
from spacy.lang.lex_attrs import word_shape
from spacy.util import get_lang_class
# All languages in spaCy - in UD format (note that Norwegian is 'no' instead of 'nb')
ALL_LANGUAGES = ("af, ar, bg, bn, ca, cs, da, de, el, en, es, et, fa, fi, fr,"
"ga, he, hi, hr, hu, id, is, it, ja, kn, ko, lt, lv, mr, no,"
"nl, pl, pt, ro, ru, si, sk, sl, sq, sr, sv, ta, te, th, tl,"
"tr, tt, uk, ur, vi, zh")
# Non-parsing tasks that will be evaluated (works for default models)
EVAL_NO_PARSE = ['Tokens', 'Words', 'Lemmas', 'Sentences', 'Feats']
# Tasks that will be evaluated if check_parse=True (does not work for default models)
EVAL_PARSE = ['Tokens', 'Words', 'Lemmas', 'Sentences', 'Feats', 'UPOS', 'XPOS', 'AllTags', 'UAS', 'LAS']
# Minimum frequency an error should have to be printed
PRINT_FREQ = 20
# Maximum number of errors printed per category
PRINT_TOTAL = 10
space_re = re.compile("\s+")
def load_model(modelname, add_sentencizer=False):
""" Load a specific spaCy model """
loading_start = time.time()
nlp = spacy.load(modelname)
if add_sentencizer:
nlp.add_pipe(nlp.create_pipe('sentencizer'))
loading_end = time.time()
loading_time = loading_end - loading_start
if add_sentencizer:
return nlp, loading_time, modelname + '_sentencizer'
return nlp, loading_time, modelname
def load_default_model_sentencizer(lang):
""" Load a generic spaCy model and add the sentencizer for sentence tokenization"""
loading_start = time.time()
lang_class = get_lang_class(lang)
nlp = lang_class()
nlp.add_pipe(nlp.create_pipe('sentencizer'))
loading_end = time.time()
loading_time = loading_end - loading_start
return nlp, loading_time, lang + "_default_" + 'sentencizer'
def split_text(text):
return [space_re.sub(" ", par.strip()) for par in text.split("\n\n")]
def get_freq_tuples(my_list, print_total_threshold):
""" Turn a list of errors into frequency-sorted tuples thresholded by a certain total number """
d = {}
for token in my_list:
d.setdefault(token, 0)
d[token] += 1
return sorted(d.items(), key=operator.itemgetter(1), reverse=True)[:print_total_threshold]
def _contains_blinded_text(stats_xml):
""" Heuristic to determine whether the treebank has blinded texts or not """
tree = ET.parse(stats_xml)
root = tree.getroot()
total_tokens = int(root.find('size/total/tokens').text)
unique_forms = int(root.find('forms').get('unique'))
# assume the corpus is largely blinded when there are less than 1% unique tokens
return (unique_forms / total_tokens) < 0.01
def fetch_all_treebanks(ud_dir, languages, corpus, best_per_language):
"""" Fetch the txt files for all treebanks for a given set of languages """
all_treebanks = dict()
treebank_size = dict()
for l in languages:
all_treebanks[l] = []
treebank_size[l] = 0
for treebank_dir in ud_dir.iterdir():
if treebank_dir.is_dir():
for txt_path in treebank_dir.iterdir():
if txt_path.name.endswith('-ud-' + corpus + '.txt'):
file_lang = txt_path.name.split('_')[0]
if file_lang in languages:
gold_path = treebank_dir / txt_path.name.replace('.txt', '.conllu')
stats_xml = treebank_dir / "stats.xml"
# ignore treebanks where the texts are not publicly available
if not _contains_blinded_text(stats_xml):
if not best_per_language:
all_treebanks[file_lang].append(txt_path)
# check the tokens in the gold annotation to keep only the biggest treebank per language
else:
with gold_path.open(mode='r', encoding='utf-8') as gold_file:
gold_ud = conll17_ud_eval.load_conllu(gold_file)
gold_tokens = len(gold_ud.tokens)
if treebank_size[file_lang] < gold_tokens:
all_treebanks[file_lang] = [txt_path]
treebank_size[file_lang] = gold_tokens
return all_treebanks
def run_single_eval(nlp, loading_time, print_name, text_path, gold_ud, tmp_output_path, out_file, print_header,
check_parse, print_freq_tasks):
"""" Run an evaluation of a model nlp on a certain specified treebank """
with text_path.open(mode='r', encoding='utf-8') as f:
flat_text = f.read()
# STEP 1: tokenize text
tokenization_start = time.time()
texts = split_text(flat_text)
docs = list(nlp.pipe(texts))
tokenization_end = time.time()
tokenization_time = tokenization_end - tokenization_start
# STEP 2: record stats and timings
tokens_per_s = int(len(gold_ud.tokens) / tokenization_time)
print_header_1 = ['date', 'text_path', 'gold_tokens', 'model', 'loading_time', 'tokenization_time', 'tokens_per_s']
print_string_1 = [str(datetime.date.today()), text_path.name, len(gold_ud.tokens),
print_name, "%.2f" % loading_time, "%.2f" % tokenization_time, tokens_per_s]
# STEP 3: evaluate predicted tokens and features
with tmp_output_path.open(mode="w", encoding="utf8") as tmp_out_file:
write_conllu(docs, tmp_out_file)
with tmp_output_path.open(mode="r", encoding="utf8") as sys_file:
sys_ud = conll17_ud_eval.load_conllu(sys_file, check_parse=check_parse)
tmp_output_path.unlink()
scores = conll17_ud_eval.evaluate(gold_ud, sys_ud, check_parse=check_parse)
# STEP 4: format the scoring results
eval_headers = EVAL_PARSE
if not check_parse:
eval_headers = EVAL_NO_PARSE
for score_name in eval_headers:
score = scores[score_name]
print_string_1.extend(["%.2f" % score.precision,
"%.2f" % score.recall,
"%.2f" % score.f1])
print_string_1.append("-" if score.aligned_accuracy is None else "%.2f" % score.aligned_accuracy)
print_string_1.append("-" if score.undersegmented is None else "%.4f" % score.under_perc)
print_string_1.append("-" if score.oversegmented is None else "%.4f" % score.over_perc)
print_header_1.extend([score_name + '_p', score_name + '_r', score_name + '_F', score_name + '_acc',
score_name + '_under', score_name + '_over'])
if score_name in print_freq_tasks:
print_header_1.extend([score_name + '_word_under_ex', score_name + '_shape_under_ex',
score_name + '_word_over_ex', score_name + '_shape_over_ex'])
d_under_words = get_freq_tuples(score.undersegmented, PRINT_TOTAL)
d_under_shapes = get_freq_tuples([word_shape(x) for x in score.undersegmented], PRINT_TOTAL)
d_over_words = get_freq_tuples(score.oversegmented, PRINT_TOTAL)
d_over_shapes = get_freq_tuples([word_shape(x) for x in score.oversegmented], PRINT_TOTAL)
# saving to CSV with ; seperator so blinding ; in the example output
print_string_1.append(
str({k: v for k, v in d_under_words if v > PRINT_FREQ}).replace(";", "*SEMICOLON*"))
print_string_1.append(
str({k: v for k, v in d_under_shapes if v > PRINT_FREQ}).replace(";", "*SEMICOLON*"))
print_string_1.append(
str({k: v for k, v in d_over_words if v > PRINT_FREQ}).replace(";", "*SEMICOLON*"))
print_string_1.append(
str({k: v for k, v in d_over_shapes if v > PRINT_FREQ}).replace(";", "*SEMICOLON*"))
# STEP 5: print the formatted results to CSV
if print_header:
out_file.write(';'.join(map(str, print_header_1)) + '\n')
out_file.write(';'.join(map(str, print_string_1)) + '\n')
def run_all_evals(models, treebanks, out_file, check_parse, print_freq_tasks):
"""" Run an evaluation for each language with its specified models and treebanks """
print_header = True
for tb_lang, treebank_list in treebanks.items():
print()
print("Language", tb_lang)
for text_path in treebank_list:
print(" Evaluating on", text_path)
gold_path = text_path.parent / (text_path.stem + '.conllu')
print(" Gold data from ", gold_path)
# nested try blocks to ensure the code can continue with the next iteration after a failure
try:
with gold_path.open(mode='r', encoding='utf-8') as gold_file:
gold_ud = conll17_ud_eval.load_conllu(gold_file)
for nlp, nlp_loading_time, nlp_name in models[tb_lang]:
try:
print(" Benchmarking", nlp_name)
tmp_output_path = text_path.parent / str('tmp_' + nlp_name + '.conllu')
run_single_eval(nlp, nlp_loading_time, nlp_name, text_path, gold_ud, tmp_output_path, out_file,
print_header, check_parse, print_freq_tasks)
print_header = False
except Exception as e:
print(" Ran into trouble: ", str(e))
except Exception as e:
print(" Ran into trouble: ", str(e))
@plac.annotations(
out_path=("Path to output CSV file", "positional", None, Path),
ud_dir=("Path to Universal Dependencies corpus", "positional", None, Path),
check_parse=("Set flag to evaluate parsing performance", "flag", "p", bool),
langs=("Enumeration of languages to evaluate (default: all)", "option", "l", str),
exclude_trained_models=("Set flag to exclude trained models", "flag", "t", bool),
exclude_multi=("Set flag to exclude the multi-language model as default baseline", "flag", "m", bool),
hide_freq=("Set flag to avoid printing out more detailed high-freq tokenization errors", "flag", "f", bool),
corpus=("Whether to run on train, dev or test", "option", "c", str),
best_per_language=("Set flag to only keep the largest treebank for each language", "flag", "b", bool)
)
def main(out_path, ud_dir, check_parse=False, langs=ALL_LANGUAGES, exclude_trained_models=False, exclude_multi=False,
hide_freq=False, corpus='train', best_per_language=False):
""""
Assemble all treebanks and models to run evaluations with.
When setting check_parse to True, the default models will not be evaluated as they don't have parsing functionality
"""
languages = [lang.strip() for lang in langs.split(",")]
print_freq_tasks = []
if not hide_freq:
print_freq_tasks = ['Tokens']
# fetching all relevant treebank from the directory
treebanks = fetch_all_treebanks(ud_dir, languages, corpus, best_per_language)
print()
print("Loading all relevant models for", languages)
models = dict()
# multi-lang model
multi = None
if not exclude_multi and not check_parse:
multi = load_model('xx_ent_wiki_sm', add_sentencizer=True)
# initialize all models with the multi-lang model
for lang in languages:
models[lang] = [multi] if multi else []
# add default models if we don't want to evaluate parsing info
if not check_parse:
# Norwegian is 'nb' in spaCy but 'no' in the UD corpora
if lang == 'no':
models['no'].append(load_default_model_sentencizer('nb'))
else:
models[lang].append(load_default_model_sentencizer(lang))
# language-specific trained models
if not exclude_trained_models:
if 'de' in models:
models['de'].append(load_model('de_core_news_sm'))
models['de'].append(load_model('de_core_news_md'))
if 'el' in models:
models['el'].append(load_model('el_core_news_sm'))
models['el'].append(load_model('el_core_news_md'))
if 'en' in models:
models['en'].append(load_model('en_core_web_sm'))
models['en'].append(load_model('en_core_web_md'))
models['en'].append(load_model('en_core_web_lg'))
if 'es' in models:
models['es'].append(load_model('es_core_news_sm'))
models['es'].append(load_model('es_core_news_md'))
if 'fr' in models:
models['fr'].append(load_model('fr_core_news_sm'))
models['fr'].append(load_model('fr_core_news_md'))
if 'it' in models:
models['it'].append(load_model('it_core_news_sm'))
if 'nl' in models:
models['nl'].append(load_model('nl_core_news_sm'))
if 'pt' in models:
models['pt'].append(load_model('pt_core_news_sm'))
with out_path.open(mode='w', encoding='utf-8') as out_file:
run_all_evals(models, treebanks, out_file, check_parse, print_freq_tasks)
if __name__ == "__main__":
plac.call(main)

View File

@ -1,335 +0,0 @@
# flake8: noqa
"""Train for CONLL 2017 UD treebank evaluation. Takes .conllu files, writes
.conllu format for development data, allowing the official scorer to be used.
"""
from __future__ import unicode_literals
import plac
from pathlib import Path
import re
import sys
import srsly
import spacy
import spacy.util
from spacy.tokens import Token, Doc
from spacy.gold import GoldParse
from spacy.util import compounding, minibatch_by_words
from spacy.syntax.nonproj import projectivize
from spacy.matcher import Matcher
# from spacy.morphology import Fused_begin, Fused_inside
from spacy import displacy
from collections import defaultdict, Counter
from timeit import default_timer as timer
Fused_begin = None
Fused_inside = None
import itertools
import random
import numpy.random
from . import conll17_ud_eval
from spacy import lang
from spacy.lang import zh
from spacy.lang import ja
from spacy.lang import ru
################
# Data reading #
################
space_re = re.compile(r"\s+")
def split_text(text):
return [space_re.sub(" ", par.strip()) for par in text.split("\n\n")]
##############
# Evaluation #
##############
def read_conllu(file_):
docs = []
sent = []
doc = []
for line in file_:
if line.startswith("# newdoc"):
if doc:
docs.append(doc)
doc = []
elif line.startswith("#"):
continue
elif not line.strip():
if sent:
doc.append(sent)
sent = []
else:
sent.append(list(line.strip().split("\t")))
if len(sent[-1]) != 10:
print(repr(line))
raise ValueError
if sent:
doc.append(sent)
if doc:
docs.append(doc)
return docs
def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None):
if text_loc.parts[-1].endswith(".conllu"):
docs = []
with text_loc.open(encoding="utf8") as file_:
for conllu_doc in read_conllu(file_):
for conllu_sent in conllu_doc:
words = [line[1] for line in conllu_sent]
docs.append(Doc(nlp.vocab, words=words))
for name, component in nlp.pipeline:
docs = list(component.pipe(docs))
else:
with text_loc.open("r", encoding="utf8") as text_file:
texts = split_text(text_file.read())
docs = list(nlp.pipe(texts))
with sys_loc.open("w", encoding="utf8") as out_file:
write_conllu(docs, out_file)
with gold_loc.open("r", encoding="utf8") as gold_file:
gold_ud = conll17_ud_eval.load_conllu(gold_file)
with sys_loc.open("r", encoding="utf8") as sys_file:
sys_ud = conll17_ud_eval.load_conllu(sys_file)
scores = conll17_ud_eval.evaluate(gold_ud, sys_ud)
return docs, scores
def write_conllu(docs, file_):
merger = Matcher(docs[0].vocab)
merger.add("SUBTOK", None, [{"DEP": "subtok", "op": "+"}])
for i, doc in enumerate(docs):
matches = []
if doc.is_parsed:
matches = merger(doc)
spans = [doc[start : end + 1] for _, start, end in matches]
with doc.retokenize() as retokenizer:
for span in spans:
retokenizer.merge(span)
file_.write("# newdoc id = {i}\n".format(i=i))
for j, sent in enumerate(doc.sents):
file_.write("# sent_id = {i}.{j}\n".format(i=i, j=j))
file_.write("# text = {text}\n".format(text=sent.text))
for k, token in enumerate(sent):
file_.write(_get_token_conllu(token, k, len(sent)) + "\n")
file_.write("\n")
for word in sent:
if word.head.i == word.i and word.dep_ == "ROOT":
break
else:
print("Rootless sentence!")
print(sent)
print(i)
for w in sent:
print(w.i, w.text, w.head.text, w.head.i, w.dep_)
raise ValueError
def _get_token_conllu(token, k, sent_len):
if token.check_morph(Fused_begin) and (k + 1 < sent_len):
n = 1
text = [token.text]
while token.nbor(n).check_morph(Fused_inside):
text.append(token.nbor(n).text)
n += 1
id_ = "%d-%d" % (k + 1, (k + n))
fields = [id_, "".join(text)] + ["_"] * 8
lines = ["\t".join(fields)]
else:
lines = []
if token.head.i == token.i:
head = 0
else:
head = k + (token.head.i - token.i) + 1
fields = [
str(k + 1),
token.text,
token.lemma_,
token.pos_,
token.tag_,
"_",
str(head),
token.dep_.lower(),
"_",
"_",
]
if token.check_morph(Fused_begin) and (k + 1 < sent_len):
if k == 0:
fields[1] = token.norm_[0].upper() + token.norm_[1:]
else:
fields[1] = token.norm_
elif token.check_morph(Fused_inside):
fields[1] = token.norm_
elif token._.split_start is not None:
split_start = token._.split_start
split_end = token._.split_end
split_len = (split_end.i - split_start.i) + 1
n_in_split = token.i - split_start.i
subtokens = guess_fused_orths(split_start.text, [""] * split_len)
fields[1] = subtokens[n_in_split]
lines.append("\t".join(fields))
return "\n".join(lines)
def guess_fused_orths(word, ud_forms):
"""The UD data 'fused tokens' don't necessarily expand to keys that match
the form. We need orths that exact match the string. Here we make a best
effort to divide up the word."""
if word == "".join(ud_forms):
# Happy case: we get a perfect split, with each letter accounted for.
return ud_forms
elif len(word) == sum(len(subtoken) for subtoken in ud_forms):
# Unideal, but at least lengths match.
output = []
remain = word
for subtoken in ud_forms:
assert len(subtoken) >= 1
output.append(remain[: len(subtoken)])
remain = remain[len(subtoken) :]
assert len(remain) == 0, (word, ud_forms, remain)
return output
else:
# Let's say word is 6 long, and there are three subtokens. The orths
# *must* equal the original string. Arbitrarily, split [4, 1, 1]
first = word[: len(word) - (len(ud_forms) - 1)]
output = [first]
remain = word[len(first) :]
for i in range(1, len(ud_forms)):
assert remain
output.append(remain[:1])
remain = remain[1:]
assert len(remain) == 0, (word, output, remain)
return output
def print_results(name, ud_scores):
fields = {}
if ud_scores is not None:
fields.update(
{
"words": ud_scores["Words"].f1 * 100,
"sents": ud_scores["Sentences"].f1 * 100,
"tags": ud_scores["XPOS"].f1 * 100,
"uas": ud_scores["UAS"].f1 * 100,
"las": ud_scores["LAS"].f1 * 100,
}
)
else:
fields.update({"words": 0.0, "sents": 0.0, "tags": 0.0, "uas": 0.0, "las": 0.0})
tpl = "\t".join(
(name, "{las:.1f}", "{uas:.1f}", "{tags:.1f}", "{sents:.1f}", "{words:.1f}")
)
print(tpl.format(**fields))
return fields
def get_token_split_start(token):
if token.text == "":
assert token.i != 0
i = -1
while token.nbor(i).text == "":
i -= 1
return token.nbor(i)
elif (token.i + 1) < len(token.doc) and token.nbor(1).text == "":
return token
else:
return None
def get_token_split_end(token):
if (token.i + 1) == len(token.doc):
return token if token.text == "" else None
elif token.text != "" and token.nbor(1).text != "":
return None
i = 1
while (token.i + i) < len(token.doc) and token.nbor(i).text == "":
i += 1
return token.nbor(i - 1)
##################
# Initialization #
##################
def load_nlp(experiments_dir, corpus):
nlp = spacy.load(experiments_dir / corpus / "best-model")
return nlp
def initialize_pipeline(nlp, docs, golds, config, device):
nlp.add_pipe(nlp.create_pipe("parser"))
return nlp
@plac.annotations(
test_data_dir=(
"Path to Universal Dependencies test data",
"positional",
None,
Path,
),
experiment_dir=("Parent directory with output model", "positional", None, Path),
corpus=(
"UD corpus to evaluate, e.g. UD_English, UD_Spanish, etc",
"positional",
None,
str,
),
)
def main(test_data_dir, experiment_dir, corpus):
Token.set_extension("split_start", getter=get_token_split_start)
Token.set_extension("split_end", getter=get_token_split_end)
Token.set_extension("begins_fused", default=False)
Token.set_extension("inside_fused", default=False)
lang.zh.Chinese.Defaults.use_jieba = False
lang.ja.Japanese.Defaults.use_janome = False
lang.ru.Russian.Defaults.use_pymorphy2 = False
nlp = load_nlp(experiment_dir, corpus)
treebank_code = nlp.meta["treebank"]
for section in ("test", "dev"):
if section == "dev":
section_dir = "conll17-ud-development-2017-03-19"
else:
section_dir = "conll17-ud-test-2017-05-09"
text_path = test_data_dir / "input" / section_dir / (treebank_code + ".txt")
udpipe_path = (
test_data_dir / "input" / section_dir / (treebank_code + "-udpipe.conllu")
)
gold_path = test_data_dir / "gold" / section_dir / (treebank_code + ".conllu")
header = [section, "LAS", "UAS", "TAG", "SENT", "WORD"]
print("\t".join(header))
inputs = {"gold": gold_path, "udp": udpipe_path, "raw": text_path}
for input_type in ("udp", "raw"):
input_path = inputs[input_type]
output_path = (
experiment_dir / corpus / "{section}.conllu".format(section=section)
)
parsed_docs, test_scores = evaluate(nlp, input_path, gold_path, output_path)
accuracy = print_results(input_type, test_scores)
acc_path = (
experiment_dir
/ corpus
/ "{section}-accuracy.json".format(section=section)
)
srsly.write_json(acc_path, accuracy)
if __name__ == "__main__":
plac.call(main)

View File

@ -1,570 +0,0 @@
# flake8: noqa
"""Train for CONLL 2017 UD treebank evaluation. Takes .conllu files, writes
.conllu format for development data, allowing the official scorer to be used.
"""
from __future__ import unicode_literals
import plac
from pathlib import Path
import re
import json
import tqdm
import spacy
import spacy.util
from bin.ud import conll17_ud_eval
from spacy.tokens import Token, Doc
from spacy.gold import GoldParse
from spacy.util import compounding, minibatch, minibatch_by_words
from spacy.syntax.nonproj import projectivize
from spacy.matcher import Matcher
from spacy import displacy
from collections import defaultdict
import random
from spacy import lang
from spacy.lang import zh
from spacy.lang import ja
try:
import torch
except ImportError:
torch = None
################
# Data reading #
################
space_re = re.compile("\s+")
def split_text(text):
return [space_re.sub(" ", par.strip()) for par in text.split("\n\n")]
def read_data(
nlp,
conllu_file,
text_file,
raw_text=True,
oracle_segments=False,
max_doc_length=None,
limit=None,
):
"""Read the CONLLU format into (Doc, GoldParse) tuples. If raw_text=True,
include Doc objects created using nlp.make_doc and then aligned against
the gold-standard sequences. If oracle_segments=True, include Doc objects
created from the gold-standard segments. At least one must be True."""
if not raw_text and not oracle_segments:
raise ValueError("At least one of raw_text or oracle_segments must be True")
paragraphs = split_text(text_file.read())
conllu = read_conllu(conllu_file)
# sd is spacy doc; cd is conllu doc
# cs is conllu sent, ct is conllu token
docs = []
golds = []
for doc_id, (text, cd) in enumerate(zip(paragraphs, conllu)):
sent_annots = []
for cs in cd:
sent = defaultdict(list)
for id_, word, lemma, pos, tag, morph, head, dep, _, space_after in cs:
if "." in id_:
continue
if "-" in id_:
continue
id_ = int(id_) - 1
head = int(head) - 1 if head != "0" else id_
sent["words"].append(word)
sent["tags"].append(tag)
sent["morphology"].append(_parse_morph_string(morph))
sent["morphology"][-1].add("POS_%s" % pos)
sent["heads"].append(head)
sent["deps"].append("ROOT" if dep == "root" else dep)
sent["spaces"].append(space_after == "_")
sent["entities"] = ["-"] * len(sent["words"])
sent["heads"], sent["deps"] = projectivize(sent["heads"], sent["deps"])
if oracle_segments:
docs.append(Doc(nlp.vocab, words=sent["words"], spaces=sent["spaces"]))
golds.append(GoldParse(docs[-1], **sent))
assert golds[-1].morphology is not None
sent_annots.append(sent)
if raw_text and max_doc_length and len(sent_annots) >= max_doc_length:
doc, gold = _make_gold(nlp, None, sent_annots)
assert gold.morphology is not None
sent_annots = []
docs.append(doc)
golds.append(gold)
if limit and len(docs) >= limit:
return docs, golds
if raw_text and sent_annots:
doc, gold = _make_gold(nlp, None, sent_annots)
docs.append(doc)
golds.append(gold)
if limit and len(docs) >= limit:
return docs, golds
return docs, golds
def _parse_morph_string(morph_string):
if morph_string == '_':
return set()
output = []
replacements = {'1': 'one', '2': 'two', '3': 'three'}
for feature in morph_string.split('|'):
key, value = feature.split('=')
value = replacements.get(value, value)
value = value.split(',')[0]
output.append('%s_%s' % (key, value.lower()))
return set(output)
def read_conllu(file_):
docs = []
sent = []
doc = []
for line in file_:
if line.startswith("# newdoc"):
if doc:
docs.append(doc)
doc = []
elif line.startswith("#"):
continue
elif not line.strip():
if sent:
doc.append(sent)
sent = []
else:
sent.append(list(line.strip().split("\t")))
if len(sent[-1]) != 10:
print(repr(line))
raise ValueError
if sent:
doc.append(sent)
if doc:
docs.append(doc)
return docs
def _make_gold(nlp, text, sent_annots, drop_deps=0.0):
# Flatten the conll annotations, and adjust the head indices
flat = defaultdict(list)
sent_starts = []
for sent in sent_annots:
flat["heads"].extend(len(flat["words"])+head for head in sent["heads"])
for field in ["words", "tags", "deps", "morphology", "entities", "spaces"]:
flat[field].extend(sent[field])
sent_starts.append(True)
sent_starts.extend([False] * (len(sent["words"]) - 1))
# Construct text if necessary
assert len(flat["words"]) == len(flat["spaces"])
if text is None:
text = "".join(
word + " " * space for word, space in zip(flat["words"], flat["spaces"])
)
doc = nlp.make_doc(text)
flat.pop("spaces")
gold = GoldParse(doc, **flat)
gold.sent_starts = sent_starts
for i in range(len(gold.heads)):
if random.random() < drop_deps:
gold.heads[i] = None
gold.labels[i] = None
return doc, gold
#############################
# Data transforms for spaCy #
#############################
def golds_to_gold_tuples(docs, golds):
"""Get out the annoying 'tuples' format used by begin_training, given the
GoldParse objects."""
tuples = []
for doc, gold in zip(docs, golds):
text = doc.text
ids, words, tags, heads, labels, iob = zip(*gold.orig_annot)
sents = [((ids, words, tags, heads, labels, iob), [])]
tuples.append((text, sents))
return tuples
##############
# Evaluation #
##############
def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None):
if text_loc.parts[-1].endswith(".conllu"):
docs = []
with text_loc.open(encoding="utf8") as file_:
for conllu_doc in read_conllu(file_):
for conllu_sent in conllu_doc:
words = [line[1] for line in conllu_sent]
docs.append(Doc(nlp.vocab, words=words))
for name, component in nlp.pipeline:
docs = list(component.pipe(docs))
else:
with text_loc.open("r", encoding="utf8") as text_file:
texts = split_text(text_file.read())
docs = list(nlp.pipe(texts))
with sys_loc.open("w", encoding="utf8") as out_file:
write_conllu(docs, out_file)
with gold_loc.open("r", encoding="utf8") as gold_file:
gold_ud = conll17_ud_eval.load_conllu(gold_file)
with sys_loc.open("r", encoding="utf8") as sys_file:
sys_ud = conll17_ud_eval.load_conllu(sys_file)
scores = conll17_ud_eval.evaluate(gold_ud, sys_ud)
return docs, scores
def write_conllu(docs, file_):
if not Token.has_extension("get_conllu_lines"):
Token.set_extension("get_conllu_lines", method=get_token_conllu)
if not Token.has_extension("begins_fused"):
Token.set_extension("begins_fused", default=False)
if not Token.has_extension("inside_fused"):
Token.set_extension("inside_fused", default=False)
merger = Matcher(docs[0].vocab)
merger.add("SUBTOK", None, [{"DEP": "subtok", "op": "+"}])
for i, doc in enumerate(docs):
matches = []
if doc.is_parsed:
matches = merger(doc)
spans = [doc[start : end + 1] for _, start, end in matches]
seen_tokens = set()
with doc.retokenize() as retokenizer:
for span in spans:
span_tokens = set(range(span.start, span.end))
if not span_tokens.intersection(seen_tokens):
retokenizer.merge(span)
seen_tokens.update(span_tokens)
file_.write("# newdoc id = {i}\n".format(i=i))
for j, sent in enumerate(doc.sents):
file_.write("# sent_id = {i}.{j}\n".format(i=i, j=j))
file_.write("# text = {text}\n".format(text=sent.text))
for k, token in enumerate(sent):
if token.head.i > sent[-1].i or token.head.i < sent[0].i:
for word in doc[sent[0].i - 10 : sent[0].i]:
print(word.i, word.head.i, word.text, word.dep_)
for word in sent:
print(word.i, word.head.i, word.text, word.dep_)
for word in doc[sent[-1].i : sent[-1].i + 10]:
print(word.i, word.head.i, word.text, word.dep_)
raise ValueError(
"Invalid parse: head outside sentence (%s)" % token.text
)
file_.write(token._.get_conllu_lines(k) + "\n")
file_.write("\n")
def print_progress(itn, losses, ud_scores):
fields = {
"dep_loss": losses.get("parser", 0.0),
"morph_loss": losses.get("morphologizer", 0.0),
"tag_loss": losses.get("tagger", 0.0),
"words": ud_scores["Words"].f1 * 100,
"sents": ud_scores["Sentences"].f1 * 100,
"tags": ud_scores["XPOS"].f1 * 100,
"uas": ud_scores["UAS"].f1 * 100,
"las": ud_scores["LAS"].f1 * 100,
"morph": ud_scores["Feats"].f1 * 100,
}
header = ["Epoch", "P.Loss", "M.Loss", "LAS", "UAS", "TAG", "MORPH", "SENT", "WORD"]
if itn == 0:
print("\t".join(header))
tpl = "\t".join((
"{:d}",
"{dep_loss:.1f}",
"{morph_loss:.1f}",
"{las:.1f}",
"{uas:.1f}",
"{tags:.1f}",
"{morph:.1f}",
"{sents:.1f}",
"{words:.1f}",
))
print(tpl.format(itn, **fields))
# def get_sent_conllu(sent, sent_id):
# lines = ["# sent_id = {sent_id}".format(sent_id=sent_id)]
def get_token_conllu(token, i):
if token._.begins_fused:
n = 1
while token.nbor(n)._.inside_fused:
n += 1
id_ = "%d-%d" % (i, i + n)
lines = [id_, token.text, "_", "_", "_", "_", "_", "_", "_", "_"]
else:
lines = []
if token.head.i == token.i:
head = 0
else:
head = i + (token.head.i - token.i) + 1
features = list(token.morph)
feat_str = []
replacements = {"one": "1", "two": "2", "three": "3"}
for feat in features:
if not feat.startswith("begin") and not feat.startswith("end"):
key, value = feat.split("_", 1)
value = replacements.get(value, value)
feat_str.append("%s=%s" % (key, value.title()))
if not feat_str:
feat_str = "_"
else:
feat_str = "|".join(feat_str)
fields = [str(i+1), token.text, token.lemma_, token.pos_, token.tag_, feat_str,
str(head), token.dep_.lower(), "_", "_"]
lines.append("\t".join(fields))
return "\n".join(lines)
##################
# Initialization #
##################
def load_nlp(corpus, config, vectors=None):
lang = corpus.split("_")[0]
nlp = spacy.blank(lang)
if config.vectors:
if not vectors:
raise ValueError(
"config asks for vectors, but no vectors "
"directory set on command line (use -v)"
)
if (Path(vectors) / corpus).exists():
nlp.vocab.from_disk(Path(vectors) / corpus / "vocab")
nlp.meta["treebank"] = corpus
return nlp
def initialize_pipeline(nlp, docs, golds, config, device):
nlp.add_pipe(nlp.create_pipe("tagger", config={"set_morphology": False}))
nlp.add_pipe(nlp.create_pipe("morphologizer"))
nlp.add_pipe(nlp.create_pipe("parser"))
if config.multitask_tag:
nlp.parser.add_multitask_objective("tag")
if config.multitask_sent:
nlp.parser.add_multitask_objective("sent_start")
for gold in golds:
for tag in gold.tags:
if tag is not None:
nlp.tagger.add_label(tag)
if torch is not None and device != -1:
torch.set_default_tensor_type("torch.cuda.FloatTensor")
optimizer = nlp.begin_training(
lambda: golds_to_gold_tuples(docs, golds),
device=device,
subword_features=config.subword_features,
conv_depth=config.conv_depth,
bilstm_depth=config.bilstm_depth,
)
if config.pretrained_tok2vec:
_load_pretrained_tok2vec(nlp, config.pretrained_tok2vec)
return optimizer
def _load_pretrained_tok2vec(nlp, loc):
"""Load pretrained weights for the 'token-to-vector' part of the component
models, which is typically a CNN. See 'spacy pretrain'. Experimental.
"""
with Path(loc).open("rb", encoding="utf8") as file_:
weights_data = file_.read()
loaded = []
for name, component in nlp.pipeline:
if hasattr(component, "model") and hasattr(component.model, "tok2vec"):
component.tok2vec.from_bytes(weights_data)
loaded.append(name)
return loaded
########################
# Command line helpers #
########################
class Config(object):
def __init__(
self,
vectors=None,
max_doc_length=10,
multitask_tag=False,
multitask_sent=False,
multitask_dep=False,
multitask_vectors=None,
bilstm_depth=0,
nr_epoch=30,
min_batch_size=100,
max_batch_size=1000,
batch_by_words=True,
dropout=0.2,
conv_depth=4,
subword_features=True,
vectors_dir=None,
pretrained_tok2vec=None,
):
if vectors_dir is not None:
if vectors is None:
vectors = True
if multitask_vectors is None:
multitask_vectors = True
for key, value in locals().items():
setattr(self, key, value)
@classmethod
def load(cls, loc, vectors_dir=None):
with Path(loc).open("r", encoding="utf8") as file_:
cfg = json.load(file_)
if vectors_dir is not None:
cfg["vectors_dir"] = vectors_dir
return cls(**cfg)
class Dataset(object):
def __init__(self, path, section):
self.path = path
self.section = section
self.conllu = None
self.text = None
for file_path in self.path.iterdir():
name = file_path.parts[-1]
if section in name and name.endswith("conllu"):
self.conllu = file_path
elif section in name and name.endswith("txt"):
self.text = file_path
if self.conllu is None:
msg = "Could not find .txt file in {path} for {section}"
raise IOError(msg.format(section=section, path=path))
if self.text is None:
msg = "Could not find .txt file in {path} for {section}"
self.lang = self.conllu.parts[-1].split("-")[0].split("_")[0]
class TreebankPaths(object):
def __init__(self, ud_path, treebank, **cfg):
self.train = Dataset(ud_path / treebank, "train")
self.dev = Dataset(ud_path / treebank, "dev")
self.lang = self.train.lang
@plac.annotations(
ud_dir=("Path to Universal Dependencies corpus", "positional", None, Path),
parses_dir=("Directory to write the development parses", "positional", None, Path),
corpus=(
"UD corpus to train and evaluate on, e.g. UD_Spanish-AnCora",
"positional",
None,
str,
),
config=("Path to json formatted config file", "option", "C", Path),
limit=("Size limit", "option", "n", int),
gpu_device=("Use GPU", "option", "g", int),
use_oracle_segments=("Use oracle segments", "flag", "G", int),
vectors_dir=(
"Path to directory with pretrained vectors, named e.g. en/",
"option",
"v",
Path,
),
)
def main(
ud_dir,
parses_dir,
corpus,
config=None,
limit=0,
gpu_device=-1,
vectors_dir=None,
use_oracle_segments=False,
):
Token.set_extension("get_conllu_lines", method=get_token_conllu)
Token.set_extension("begins_fused", default=False)
Token.set_extension("inside_fused", default=False)
spacy.util.fix_random_seed()
lang.zh.Chinese.Defaults.use_jieba = False
lang.ja.Japanese.Defaults.use_janome = False
if config is not None:
config = Config.load(config, vectors_dir=vectors_dir)
else:
config = Config(vectors_dir=vectors_dir)
paths = TreebankPaths(ud_dir, corpus)
if not (parses_dir / corpus).exists():
(parses_dir / corpus).mkdir()
print("Train and evaluate", corpus, "using lang", paths.lang)
nlp = load_nlp(paths.lang, config, vectors=vectors_dir)
docs, golds = read_data(
nlp,
paths.train.conllu.open(encoding="utf8"),
paths.train.text.open(encoding="utf8"),
max_doc_length=config.max_doc_length,
limit=limit,
)
optimizer = initialize_pipeline(nlp, docs, golds, config, gpu_device)
batch_sizes = compounding(config.min_batch_size, config.max_batch_size, 1.001)
beam_prob = compounding(0.2, 0.8, 1.001)
for i in range(config.nr_epoch):
docs, golds = read_data(
nlp,
paths.train.conllu.open(encoding="utf8"),
paths.train.text.open(encoding="utf8"),
max_doc_length=config.max_doc_length,
limit=limit,
oracle_segments=use_oracle_segments,
raw_text=not use_oracle_segments,
)
Xs = list(zip(docs, golds))
random.shuffle(Xs)
if config.batch_by_words:
batches = minibatch_by_words(Xs, size=batch_sizes)
else:
batches = minibatch(Xs, size=batch_sizes)
losses = {}
n_train_words = sum(len(doc) for doc in docs)
with tqdm.tqdm(total=n_train_words, leave=False) as pbar:
for batch in batches:
batch_docs, batch_gold = zip(*batch)
pbar.update(sum(len(doc) for doc in batch_docs))
nlp.parser.cfg["beam_update_prob"] = next(beam_prob)
nlp.update(
batch_docs,
batch_gold,
sgd=optimizer,
drop=config.dropout,
losses=losses,
)
out_path = parses_dir / corpus / "epoch-{i}.conllu".format(i=i)
with nlp.use_params(optimizer.averages):
if use_oracle_segments:
parsed_docs, scores = evaluate(nlp, paths.dev.conllu,
paths.dev.conllu, out_path)
else:
parsed_docs, scores = evaluate(nlp, paths.dev.text,
paths.dev.conllu, out_path)
print_progress(i, losses, scores)
def _render_parses(i, to_render):
to_render[0].user_data["title"] = "Batch %d" % i
with Path("/tmp/parses.html").open("w", encoding="utf8") as file_:
html = displacy.render(to_render[:5], style="dep", page=True)
file_.write(html)
if __name__ == "__main__":
plac.call(main)

View File

@ -1,19 +0,0 @@
<a href="https://explosion.ai"><img src="https://explosion.ai/assets/img/logo.svg" width="125" height="125" align="right" /></a>
# spaCy examples
The examples are Python scripts with well-behaved command line interfaces. For
more detailed usage guides, see the [documentation](https://spacy.io/usage/).
To see the available arguments, you can use the `--help` or `-h` flag:
```bash
$ python examples/training/train_ner.py --help
```
While we try to keep the examples up to date, they are not currently exercised
by the test suite, as some of them require significant data downloads or take
time to train. If you find that an example is no longer running,
[please tell us](https://github.com/explosion/spaCy/issues)! We know there's
nothing worse than trying to figure out what you're doing wrong, and it turns
out your code was never the problem.

View File

@ -1,267 +0,0 @@
"""
This example shows how to use an LSTM sentiment classification model trained
using Keras in spaCy. spaCy splits the document into sentences, and each
sentence is classified using the LSTM. The scores for the sentences are then
aggregated to give the document score. This kind of hierarchical model is quite
difficult in "pure" Keras or Tensorflow, but it's very effective. The Keras
example on this dataset performs quite poorly, because it cuts off the documents
so that they're a fixed size. This hurts review accuracy a lot, because people
often summarise their rating in the final sentence
Prerequisites:
spacy download en_vectors_web_lg
pip install keras==2.0.9
Compatible with: spaCy v2.0.0+
"""
import plac
import random
import pathlib
import cytoolz
import numpy
from keras.models import Sequential, model_from_json
from keras.layers import LSTM, Dense, Embedding, Bidirectional
from keras.layers import TimeDistributed
from keras.optimizers import Adam
import thinc.extra.datasets
from spacy.compat import pickle
import spacy
class SentimentAnalyser(object):
@classmethod
def load(cls, path, nlp, max_length=100):
with (path / "config.json").open() as file_:
model = model_from_json(file_.read())
with (path / "model").open("rb") as file_:
lstm_weights = pickle.load(file_)
embeddings = get_embeddings(nlp.vocab)
model.set_weights([embeddings] + lstm_weights)
return cls(model, max_length=max_length)
def __init__(self, model, max_length=100):
self._model = model
self.max_length = max_length
def __call__(self, doc):
X = get_features([doc], self.max_length)
y = self._model.predict(X)
self.set_sentiment(doc, y)
def pipe(self, docs, batch_size=1000):
for minibatch in cytoolz.partition_all(batch_size, docs):
minibatch = list(minibatch)
sentences = []
for doc in minibatch:
sentences.extend(doc.sents)
Xs = get_features(sentences, self.max_length)
ys = self._model.predict(Xs)
for sent, label in zip(sentences, ys):
sent.doc.sentiment += label - 0.5
for doc in minibatch:
yield doc
def set_sentiment(self, doc, y):
doc.sentiment = float(y[0])
# Sentiment has a native slot for a single float.
# For arbitrary data storage, there's:
# doc.user_data['my_data'] = y
def get_labelled_sentences(docs, doc_labels):
labels = []
sentences = []
for doc, y in zip(docs, doc_labels):
for sent in doc.sents:
sentences.append(sent)
labels.append(y)
return sentences, numpy.asarray(labels, dtype="int32")
def get_features(docs, max_length):
docs = list(docs)
Xs = numpy.zeros((len(docs), max_length), dtype="int32")
for i, doc in enumerate(docs):
j = 0
for token in doc:
vector_id = token.vocab.vectors.find(key=token.orth)
if vector_id >= 0:
Xs[i, j] = vector_id
else:
Xs[i, j] = 0
j += 1
if j >= max_length:
break
return Xs
def train(
train_texts,
train_labels,
dev_texts,
dev_labels,
lstm_shape,
lstm_settings,
lstm_optimizer,
batch_size=100,
nb_epoch=5,
by_sentence=True,
):
print("Loading spaCy")
nlp = spacy.load("en_vectors_web_lg")
nlp.add_pipe(nlp.create_pipe("sentencizer"))
embeddings = get_embeddings(nlp.vocab)
model = compile_lstm(embeddings, lstm_shape, lstm_settings)
print("Parsing texts...")
train_docs = list(nlp.pipe(train_texts))
dev_docs = list(nlp.pipe(dev_texts))
if by_sentence:
train_docs, train_labels = get_labelled_sentences(train_docs, train_labels)
dev_docs, dev_labels = get_labelled_sentences(dev_docs, dev_labels)
train_X = get_features(train_docs, lstm_shape["max_length"])
dev_X = get_features(dev_docs, lstm_shape["max_length"])
model.fit(
train_X,
train_labels,
validation_data=(dev_X, dev_labels),
epochs=nb_epoch,
batch_size=batch_size,
)
return model
def compile_lstm(embeddings, shape, settings):
model = Sequential()
model.add(
Embedding(
embeddings.shape[0],
embeddings.shape[1],
input_length=shape["max_length"],
trainable=False,
weights=[embeddings],
mask_zero=True,
)
)
model.add(TimeDistributed(Dense(shape["nr_hidden"], use_bias=False)))
model.add(
Bidirectional(
LSTM(
shape["nr_hidden"],
recurrent_dropout=settings["dropout"],
dropout=settings["dropout"],
)
)
)
model.add(Dense(shape["nr_class"], activation="sigmoid"))
model.compile(
optimizer=Adam(lr=settings["lr"]),
loss="binary_crossentropy",
metrics=["accuracy"],
)
return model
def get_embeddings(vocab):
return vocab.vectors.data
def evaluate(model_dir, texts, labels, max_length=100):
nlp = spacy.load("en_vectors_web_lg")
nlp.add_pipe(nlp.create_pipe("sentencizer"))
nlp.add_pipe(SentimentAnalyser.load(model_dir, nlp, max_length=max_length))
correct = 0
i = 0
for doc in nlp.pipe(texts, batch_size=1000):
correct += bool(doc.sentiment >= 0.5) == bool(labels[i])
i += 1
return float(correct) / i
def read_data(data_dir, limit=0):
examples = []
for subdir, label in (("pos", 1), ("neg", 0)):
for filename in (data_dir / subdir).iterdir():
with filename.open() as file_:
text = file_.read()
examples.append((text, label))
random.shuffle(examples)
if limit >= 1:
examples = examples[:limit]
return zip(*examples) # Unzips into two lists
@plac.annotations(
train_dir=("Location of training file or directory"),
dev_dir=("Location of development file or directory"),
model_dir=("Location of output model directory",),
is_runtime=("Demonstrate run-time usage", "flag", "r", bool),
nr_hidden=("Number of hidden units", "option", "H", int),
max_length=("Maximum sentence length", "option", "L", int),
dropout=("Dropout", "option", "d", float),
learn_rate=("Learn rate", "option", "e", float),
nb_epoch=("Number of training epochs", "option", "i", int),
batch_size=("Size of minibatches for training LSTM", "option", "b", int),
nr_examples=("Limit to N examples", "option", "n", int),
)
def main(
model_dir=None,
train_dir=None,
dev_dir=None,
is_runtime=False,
nr_hidden=64,
max_length=100, # Shape
dropout=0.5,
learn_rate=0.001, # General NN config
nb_epoch=5,
batch_size=256,
nr_examples=-1,
): # Training params
if model_dir is not None:
model_dir = pathlib.Path(model_dir)
if train_dir is None or dev_dir is None:
imdb_data = thinc.extra.datasets.imdb()
if is_runtime:
if dev_dir is None:
dev_texts, dev_labels = zip(*imdb_data[1])
else:
dev_texts, dev_labels = read_data(dev_dir)
acc = evaluate(model_dir, dev_texts, dev_labels, max_length=max_length)
print(acc)
else:
if train_dir is None:
train_texts, train_labels = zip(*imdb_data[0])
else:
print("Read data")
train_texts, train_labels = read_data(train_dir, limit=nr_examples)
if dev_dir is None:
dev_texts, dev_labels = zip(*imdb_data[1])
else:
dev_texts, dev_labels = read_data(dev_dir, imdb_data, limit=nr_examples)
train_labels = numpy.asarray(train_labels, dtype="int32")
dev_labels = numpy.asarray(dev_labels, dtype="int32")
lstm = train(
train_texts,
train_labels,
dev_texts,
dev_labels,
{"nr_hidden": nr_hidden, "max_length": max_length, "nr_class": 1},
{"dropout": dropout, "lr": learn_rate},
{},
nb_epoch=nb_epoch,
batch_size=batch_size,
)
weights = lstm.get_weights()
if model_dir is not None:
with (model_dir / "model").open("wb") as file_:
pickle.dump(weights[1:], file_)
with (model_dir / "config.json").open("w") as file_:
file_.write(lstm.to_json())
if __name__ == "__main__":
plac.call(main)

View File

@ -1,82 +0,0 @@
#!/usr/bin/env python
# coding: utf8
"""A simple example of extracting relations between phrases and entities using
spaCy's named entity recognizer and the dependency parse. Here, we extract
money and currency values (entities labelled as MONEY) and then check the
dependency tree to find the noun phrase they are referring to for example:
$9.4 million --> Net income.
Compatible with: spaCy v2.0.0+
Last tested with: v2.2.1
"""
from __future__ import unicode_literals, print_function
import plac
import spacy
TEXTS = [
"Net income was $9.4 million compared to the prior year of $2.7 million.",
"Revenue exceeded twelve billion dollars, with a loss of $1b.",
]
@plac.annotations(
model=("Model to load (needs parser and NER)", "positional", None, str)
)
def main(model="en_core_web_sm"):
nlp = spacy.load(model)
print("Loaded model '%s'" % model)
print("Processing %d texts" % len(TEXTS))
for text in TEXTS:
doc = nlp(text)
relations = extract_currency_relations(doc)
for r1, r2 in relations:
print("{:<10}\t{}\t{}".format(r1.text, r2.ent_type_, r2.text))
def filter_spans(spans):
# Filter a sequence of spans so they don't contain overlaps
# For spaCy 2.1.4+: this function is available as spacy.util.filter_spans()
get_sort_key = lambda span: (span.end - span.start, -span.start)
sorted_spans = sorted(spans, key=get_sort_key, reverse=True)
result = []
seen_tokens = set()
for span in sorted_spans:
# Check for end - 1 here because boundaries are inclusive
if span.start not in seen_tokens and span.end - 1 not in seen_tokens:
result.append(span)
seen_tokens.update(range(span.start, span.end))
result = sorted(result, key=lambda span: span.start)
return result
def extract_currency_relations(doc):
# Merge entities and noun chunks into one token
spans = list(doc.ents) + list(doc.noun_chunks)
spans = filter_spans(spans)
with doc.retokenize() as retokenizer:
for span in spans:
retokenizer.merge(span)
relations = []
for money in filter(lambda w: w.ent_type_ == "MONEY", doc):
if money.dep_ in ("attr", "dobj"):
subject = [w for w in money.head.lefts if w.dep_ == "nsubj"]
if subject:
subject = subject[0]
relations.append((subject, money))
elif money.dep_ == "pobj" and money.head.dep_ == "prep":
relations.append((money.head.head, money))
return relations
if __name__ == "__main__":
plac.call(main)
# Expected output:
# Net income MONEY $9.4 million
# the prior year MONEY $2.7 million
# Revenue MONEY twelve billion dollars
# a loss MONEY 1b

View File

@ -1,67 +0,0 @@
#!/usr/bin/env python
# coding: utf8
"""This example shows how to navigate the parse tree including subtrees
attached to a word.
Based on issue #252:
"In the documents and tutorials the main thing I haven't found is
examples on how to break sentences down into small sub thoughts/chunks. The
noun_chunks is handy, but having examples on using the token.head to find small
(near-complete) sentence chunks would be neat. Lets take the example sentence:
"displaCy uses CSS and JavaScript to show you how computers understand language"
This sentence has two main parts (XCOMP & CCOMP) according to the breakdown:
[displaCy] uses CSS and Javascript [to + show]
show you how computers understand [language]
I'm assuming that we can use the token.head to build these groups."
Compatible with: spaCy v2.0.0+
Last tested with: v2.1.0
"""
from __future__ import unicode_literals, print_function
import plac
import spacy
@plac.annotations(model=("Model to load", "positional", None, str))
def main(model="en_core_web_sm"):
nlp = spacy.load(model)
print("Loaded model '%s'" % model)
doc = nlp(
"displaCy uses CSS and JavaScript to show you how computers "
"understand language"
)
# The easiest way is to find the head of the subtree you want, and then use
# the `.subtree`, `.children`, `.lefts` and `.rights` iterators. `.subtree`
# is the one that does what you're asking for most directly:
for word in doc:
if word.dep_ in ("xcomp", "ccomp"):
print("".join(w.text_with_ws for w in word.subtree))
# It'd probably be better for `word.subtree` to return a `Span` object
# instead of a generator over the tokens. If you want the `Span` you can
# get it via the `.right_edge` and `.left_edge` properties. The `Span`
# object is nice because you can easily get a vector, merge it, etc.
for word in doc:
if word.dep_ in ("xcomp", "ccomp"):
subtree_span = doc[word.left_edge.i : word.right_edge.i + 1]
print(subtree_span.text, "|", subtree_span.root.text)
# You might also want to select a head, and then select a start and end
# position by walking along its children. You could then take the
# `.left_edge` and `.right_edge` of those tokens, and use it to calculate
# a span.
if __name__ == "__main__":
plac.call(main)
# Expected output:
# to show you how computers understand language
# how computers understand language
# to show you how computers understand language | show
# how computers understand language | understand

View File

@ -1,112 +0,0 @@
#!/usr/bin/env python
# coding: utf8
"""Match a large set of multi-word expressions in O(1) time.
The idea is to associate each word in the vocabulary with a tag, noting whether
they begin, end, or are inside at least one pattern. An additional tag is used
for single-word patterns. Complete patterns are also stored in a hash set.
When we process a document, we look up the words in the vocabulary, to
associate the words with the tags. We then search for tag-sequences that
correspond to valid candidates. Finally, we look up the candidates in the hash
set.
For instance, to search for the phrases "Barack Hussein Obama" and "Hilary
Clinton", we would associate "Barack" and "Hilary" with the B tag, Hussein with
the I tag, and Obama and Clinton with the L tag.
The document "Barack Clinton and Hilary Clinton" would have the tag sequence
[{B}, {L}, {}, {B}, {L}], so we'd get two matches. However, only the second
candidate is in the phrase dictionary, so only one is returned as a match.
The algorithm is O(n) at run-time for document of length n because we're only
ever matching over the tag patterns. So no matter how many phrases we're
looking for, our pattern set stays very small (exact size depends on the
maximum length we're looking for, as the query language currently has no
quantifiers).
The example expects a .bz2 file from the Reddit corpus, and a patterns file,
formatted in jsonl as a sequence of entries like this:
{"text":"Anchorage"}
{"text":"Angola"}
{"text":"Ann Arbor"}
{"text":"Annapolis"}
{"text":"Appalachia"}
{"text":"Argentina"}
Reddit comments corpus:
* https://files.pushshift.io/reddit/
* https://archive.org/details/2015_reddit_comments_corpus
Compatible with: spaCy v2.0.0+
"""
from __future__ import print_function, unicode_literals, division
from bz2 import BZ2File
import time
import plac
import json
from spacy.matcher import PhraseMatcher
import spacy
@plac.annotations(
patterns_loc=("Path to gazetteer", "positional", None, str),
text_loc=("Path to Reddit corpus file", "positional", None, str),
n=("Number of texts to read", "option", "n", int),
lang=("Language class to initialise", "option", "l", str),
)
def main(patterns_loc, text_loc, n=10000, lang="en"):
nlp = spacy.blank(lang)
nlp.vocab.lex_attr_getters = {}
phrases = read_gazetteer(nlp.tokenizer, patterns_loc)
count = 0
t1 = time.time()
for ent_id, text in get_matches(nlp.tokenizer, phrases, read_text(text_loc, n=n)):
count += 1
t2 = time.time()
print("%d docs in %.3f s. %d matches" % (n, (t2 - t1), count))
def read_gazetteer(tokenizer, loc, n=-1):
for i, line in enumerate(open(loc)):
data = json.loads(line.strip())
phrase = tokenizer(data["text"])
for w in phrase:
_ = tokenizer.vocab[w.text]
if len(phrase) >= 2:
yield phrase
def read_text(bz2_loc, n=10000):
with BZ2File(bz2_loc) as file_:
for i, line in enumerate(file_):
data = json.loads(line)
yield data["body"]
if i >= n:
break
def get_matches(tokenizer, phrases, texts):
matcher = PhraseMatcher(tokenizer.vocab)
matcher.add("Phrase", None, *phrases)
for text in texts:
doc = tokenizer(text)
for w in doc:
_ = doc.vocab[w.text]
matches = matcher(doc)
for ent_id, start, end in matches:
yield (ent_id, doc[start:end].text)
if __name__ == "__main__":
if False:
import cProfile
import pstats
cProfile.runctx("plac.call(main)", globals(), locals(), "Profile.prof")
s = pstats.Stats("Profile.prof")
s.strip_dirs().sort_stats("time").print_stats()
else:
plac.call(main)

View File

@ -1,114 +0,0 @@
<a href="https://explosion.ai"><img src="https://explosion.ai/assets/img/logo.svg" width="125" height="125" align="right" /></a>
# A decomposable attention model for Natural Language Inference
**by Matthew Honnibal, [@honnibal](https://github.com/honnibal)**
**Updated for spaCy 2.0+ and Keras 2.2.2+ by John Stewart, [@free-variation](https://github.com/free-variation)**
This directory contains an implementation of the entailment prediction model described
by [Parikh et al. (2016)](https://arxiv.org/pdf/1606.01933.pdf). The model is notable
for its competitive performance with very few parameters.
The model is implemented using [Keras](https://keras.io/) and [spaCy](https://spacy.io).
Keras is used to build and train the network. spaCy is used to load
the [GloVe](http://nlp.stanford.edu/projects/glove/) vectors, perform the
feature extraction, and help you apply the model at run-time. The following
demo code shows how the entailment model can be used at runtime, once the
hook is installed to customise the `.similarity()` method of spaCy's `Doc`
and `Span` objects:
```python
def demo(shape):
nlp = spacy.load('en_vectors_web_lg')
nlp.add_pipe(KerasSimilarityShim.load(nlp.path / 'similarity', nlp, shape[0]))
doc1 = nlp(u'The king of France is bald.')
doc2 = nlp(u'France has no king.')
print("Sentence 1:", doc1)
print("Sentence 2:", doc2)
entailment_type, confidence = doc1.similarity(doc2)
print("Entailment type:", entailment_type, "(Confidence:", confidence, ")")
```
Which gives the output `Entailment type: contradiction (Confidence: 0.60604566)`, showing that
the system has definite opinions about Betrand Russell's [famous conundrum](https://users.drew.edu/jlenz/br-on-denoting.html)!
I'm working on a blog post to explain Parikh et al.'s model in more detail.
A [notebook](https://github.com/free-variation/spaCy/blob/master/examples/notebooks/Decompositional%20Attention.ipynb) is available that briefly explains this implementation.
I think it is a very interesting example of the attention mechanism, which
I didn't understand very well before working through this paper. There are
lots of ways to extend the model.
## What's where
| File | Description |
| --- | --- |
| `__main__.py` | The script that will be executed. Defines the CLI, the data reading, etc — all the boring stuff. |
| `spacy_hook.py` | Provides a class `KerasSimilarityShim` that lets you use an arbitrary function to customize spaCy's `doc.similarity()` method. Instead of the default average-of-vectors algorithm, when you call `doc1.similarity(doc2)`, you'll get the result of `your_model(doc1, doc2)`. |
| `keras_decomposable_attention.py` | Defines the neural network model. |
## Setting up
First, install [Keras](https://keras.io/), [spaCy](https://spacy.io) and the spaCy
English models (about 1GB of data):
```bash
pip install keras
pip install spacy
python -m spacy download en_vectors_web_lg
```
You'll also want to get Keras working on your GPU, and you will need a backend, such as TensorFlow or Theano.
This will depend on your set up, so you're mostly on your own for this step. If you're using AWS, try the
[NVidia AMI](https://aws.amazon.com/marketplace/pp/B00FYCDDTE). It made things pretty easy.
Once you've installed the dependencies, you can run a small preliminary test of
the Keras model:
```bash
py.test keras_parikh_entailment/keras_decomposable_attention.py
```
This compiles the model and fits it with some dummy data. You should see that
both tests passed.
Finally, download the [Stanford Natural Language Inference corpus](http://nlp.stanford.edu/projects/snli/).
## Running the example
You can run the `keras_parikh_entailment/` directory as a script, which executes the file
[`keras_parikh_entailment/__main__.py`](__main__.py). If you run the script without arguments
the usage is shown. Running it with `-h` explains the command line arguments.
The first thing you'll want to do is train the model:
```bash
python keras_parikh_entailment/ train -t <path to SNLI train JSON> -s <path to SNLI dev JSON>
```
Training takes about 300 epochs for full accuracy, and I haven't rerun the full
experiment since refactoring things to publish this example — please let me
know if I've broken something. You should get to at least 85% on the development data even after 10-15 epochs.
The other two modes demonstrate run-time usage. I never like relying on the accuracy printed
by `.fit()` methods. I never really feel confident until I've run a new process that loads
the model and starts making predictions, without access to the gold labels. I've therefore
included an `evaluate` mode.
```bash
python keras_parikh_entailment/ evaluate -s <path to SNLI train JSON>
```
Finally, there's also a little demo, which mostly exists to show
you how run-time usage will eventually look.
```bash
python keras_parikh_entailment/ demo
```
## Getting updates
We should have the blog post explaining the model ready before the end of the week. To get
notified when it's published, you can either follow me on [Twitter](https://twitter.com/honnibal)
or subscribe to our [mailing list](http://eepurl.com/ckUpQ5).

View File

@ -1,207 +0,0 @@
import numpy as np
import json
from keras.utils import to_categorical
import plac
import sys
from keras_decomposable_attention import build_model
from spacy_hook import get_embeddings, KerasSimilarityShim
try:
import cPickle as pickle
except ImportError:
import pickle
import spacy
# workaround for keras/tensorflow bug
# see https://github.com/tensorflow/tensorflow/issues/3388
import os
import importlib
from keras import backend as K
def set_keras_backend(backend):
if K.backend() != backend:
os.environ["KERAS_BACKEND"] = backend
importlib.reload(K)
assert K.backend() == backend
if backend == "tensorflow":
K.get_session().close()
cfg = K.tf.ConfigProto()
cfg.gpu_options.allow_growth = True
K.set_session(K.tf.Session(config=cfg))
K.clear_session()
set_keras_backend("tensorflow")
def train(train_loc, dev_loc, shape, settings):
train_texts1, train_texts2, train_labels = read_snli(train_loc)
dev_texts1, dev_texts2, dev_labels = read_snli(dev_loc)
print("Loading spaCy")
nlp = spacy.load("en_vectors_web_lg")
assert nlp.path is not None
print("Processing texts...")
train_X = create_dataset(nlp, train_texts1, train_texts2, 100, shape[0])
dev_X = create_dataset(nlp, dev_texts1, dev_texts2, 100, shape[0])
print("Compiling network")
model = build_model(get_embeddings(nlp.vocab), shape, settings)
print(settings)
model.fit(
train_X,
train_labels,
validation_data=(dev_X, dev_labels),
epochs=settings["nr_epoch"],
batch_size=settings["batch_size"],
)
if not (nlp.path / "similarity").exists():
(nlp.path / "similarity").mkdir()
print("Saving to", nlp.path / "similarity")
weights = model.get_weights()
# remove the embedding matrix. We can reconstruct it.
del weights[1]
with (nlp.path / "similarity" / "model").open("wb") as file_:
pickle.dump(weights, file_)
with (nlp.path / "similarity" / "config.json").open("w") as file_:
file_.write(model.to_json())
def evaluate(dev_loc, shape):
dev_texts1, dev_texts2, dev_labels = read_snli(dev_loc)
nlp = spacy.load("en_vectors_web_lg")
nlp.add_pipe(KerasSimilarityShim.load(nlp.path / "similarity", nlp, shape[0]))
total = 0.0
correct = 0.0
for text1, text2, label in zip(dev_texts1, dev_texts2, dev_labels):
doc1 = nlp(text1)
doc2 = nlp(text2)
sim, _ = doc1.similarity(doc2)
if sim == KerasSimilarityShim.entailment_types[label.argmax()]:
correct += 1
total += 1
return correct, total
def demo(shape):
nlp = spacy.load("en_vectors_web_lg")
nlp.add_pipe(KerasSimilarityShim.load(nlp.path / "similarity", nlp, shape[0]))
doc1 = nlp("The king of France is bald.")
doc2 = nlp("France has no king.")
print("Sentence 1:", doc1)
print("Sentence 2:", doc2)
entailment_type, confidence = doc1.similarity(doc2)
print("Entailment type:", entailment_type, "(Confidence:", confidence, ")")
LABELS = {"entailment": 0, "contradiction": 1, "neutral": 2}
def read_snli(path):
texts1 = []
texts2 = []
labels = []
with open(path, "r") as file_:
for line in file_:
eg = json.loads(line)
label = eg["gold_label"]
if label == "-": # per Parikh, ignore - SNLI entries
continue
texts1.append(eg["sentence1"])
texts2.append(eg["sentence2"])
labels.append(LABELS[label])
return texts1, texts2, to_categorical(np.asarray(labels, dtype="int32"))
def create_dataset(nlp, texts, hypotheses, num_unk, max_length):
sents = texts + hypotheses
sents_as_ids = []
for sent in sents:
doc = nlp(sent)
word_ids = []
for i, token in enumerate(doc):
# skip odd spaces from tokenizer
if token.has_vector and token.vector_norm == 0:
continue
if i > max_length:
break
if token.has_vector:
word_ids.append(token.rank + num_unk + 1)
else:
# if we don't have a vector, pick an OOV entry
word_ids.append(token.rank % num_unk + 1)
# there must be a simpler way of generating padded arrays from lists...
word_id_vec = np.zeros((max_length), dtype="int")
clipped_len = min(max_length, len(word_ids))
word_id_vec[:clipped_len] = word_ids[:clipped_len]
sents_as_ids.append(word_id_vec)
return [np.array(sents_as_ids[: len(texts)]), np.array(sents_as_ids[len(texts) :])]
@plac.annotations(
mode=("Mode to execute", "positional", None, str, ["train", "evaluate", "demo"]),
train_loc=("Path to training data", "option", "t", str),
dev_loc=("Path to development or test data", "option", "s", str),
max_length=("Length to truncate sentences", "option", "L", int),
nr_hidden=("Number of hidden units", "option", "H", int),
dropout=("Dropout level", "option", "d", float),
learn_rate=("Learning rate", "option", "r", float),
batch_size=("Batch size for neural network training", "option", "b", int),
nr_epoch=("Number of training epochs", "option", "e", int),
entail_dir=(
"Direction of entailment",
"option",
"D",
str,
["both", "left", "right"],
),
)
def main(
mode,
train_loc,
dev_loc,
max_length=50,
nr_hidden=200,
dropout=0.2,
learn_rate=0.001,
batch_size=1024,
nr_epoch=10,
entail_dir="both",
):
shape = (max_length, nr_hidden, 3)
settings = {
"lr": learn_rate,
"dropout": dropout,
"batch_size": batch_size,
"nr_epoch": nr_epoch,
"entail_dir": entail_dir,
}
if mode == "train":
if train_loc == None or dev_loc == None:
print("Train mode requires paths to training and development data sets.")
sys.exit(1)
train(train_loc, dev_loc, shape, settings)
elif mode == "evaluate":
if dev_loc == None:
print("Evaluate mode requires paths to test data set.")
sys.exit(1)
correct, total = evaluate(dev_loc, shape)
print(correct, "/", total, correct / total)
else:
demo(shape)
if __name__ == "__main__":
plac.call(main)

View File

@ -1,152 +0,0 @@
# Semantic entailment/similarity with decomposable attention (using spaCy and Keras)
# Practical state-of-the-art textual entailment with spaCy and Keras
import numpy as np
from keras import layers, Model, models, optimizers
from keras import backend as K
def build_model(vectors, shape, settings):
max_length, nr_hidden, nr_class = shape
input1 = layers.Input(shape=(max_length,), dtype="int32", name="words1")
input2 = layers.Input(shape=(max_length,), dtype="int32", name="words2")
# embeddings (projected)
embed = create_embedding(vectors, max_length, nr_hidden)
a = embed(input1)
b = embed(input2)
# step 1: attend
F = create_feedforward(nr_hidden)
att_weights = layers.dot([F(a), F(b)], axes=-1)
G = create_feedforward(nr_hidden)
if settings["entail_dir"] == "both":
norm_weights_a = layers.Lambda(normalizer(1))(att_weights)
norm_weights_b = layers.Lambda(normalizer(2))(att_weights)
alpha = layers.dot([norm_weights_a, a], axes=1)
beta = layers.dot([norm_weights_b, b], axes=1)
# step 2: compare
comp1 = layers.concatenate([a, beta])
comp2 = layers.concatenate([b, alpha])
v1 = layers.TimeDistributed(G)(comp1)
v2 = layers.TimeDistributed(G)(comp2)
# step 3: aggregate
v1_sum = layers.Lambda(sum_word)(v1)
v2_sum = layers.Lambda(sum_word)(v2)
concat = layers.concatenate([v1_sum, v2_sum])
elif settings["entail_dir"] == "left":
norm_weights_a = layers.Lambda(normalizer(1))(att_weights)
alpha = layers.dot([norm_weights_a, a], axes=1)
comp2 = layers.concatenate([b, alpha])
v2 = layers.TimeDistributed(G)(comp2)
v2_sum = layers.Lambda(sum_word)(v2)
concat = v2_sum
else:
norm_weights_b = layers.Lambda(normalizer(2))(att_weights)
beta = layers.dot([norm_weights_b, b], axes=1)
comp1 = layers.concatenate([a, beta])
v1 = layers.TimeDistributed(G)(comp1)
v1_sum = layers.Lambda(sum_word)(v1)
concat = v1_sum
H = create_feedforward(nr_hidden)
out = H(concat)
out = layers.Dense(nr_class, activation="softmax")(out)
model = Model([input1, input2], out)
model.compile(
optimizer=optimizers.Adam(lr=settings["lr"]),
loss="categorical_crossentropy",
metrics=["accuracy"],
)
return model
def create_embedding(vectors, max_length, projected_dim):
return models.Sequential(
[
layers.Embedding(
vectors.shape[0],
vectors.shape[1],
input_length=max_length,
weights=[vectors],
trainable=False,
),
layers.TimeDistributed(
layers.Dense(projected_dim, activation=None, use_bias=False)
),
]
)
def create_feedforward(num_units=200, activation="relu", dropout_rate=0.2):
return models.Sequential(
[
layers.Dense(num_units, activation=activation),
layers.Dropout(dropout_rate),
layers.Dense(num_units, activation=activation),
layers.Dropout(dropout_rate),
]
)
def normalizer(axis):
def _normalize(att_weights):
exp_weights = K.exp(att_weights)
sum_weights = K.sum(exp_weights, axis=axis, keepdims=True)
return exp_weights / sum_weights
return _normalize
def sum_word(x):
return K.sum(x, axis=1)
def test_build_model():
vectors = np.ndarray((100, 8), dtype="float32")
shape = (10, 16, 3)
settings = {"lr": 0.001, "dropout": 0.2, "gru_encode": True, "entail_dir": "both"}
model = build_model(vectors, shape, settings)
def test_fit_model():
def _generate_X(nr_example, length, nr_vector):
X1 = np.ndarray((nr_example, length), dtype="int32")
X1 *= X1 < nr_vector
X1 *= 0 <= X1
X2 = np.ndarray((nr_example, length), dtype="int32")
X2 *= X2 < nr_vector
X2 *= 0 <= X2
return [X1, X2]
def _generate_Y(nr_example, nr_class):
ys = np.zeros((nr_example, nr_class), dtype="int32")
for i in range(nr_example):
ys[i, i % nr_class] = 1
return ys
vectors = np.ndarray((100, 8), dtype="float32")
shape = (10, 16, 3)
settings = {"lr": 0.001, "dropout": 0.2, "gru_encode": True, "entail_dir": "both"}
model = build_model(vectors, shape, settings)
train_X = _generate_X(20, shape[0], vectors.shape[0])
train_Y = _generate_Y(20, shape[2])
dev_X = _generate_X(15, shape[0], vectors.shape[0])
dev_Y = _generate_Y(15, shape[2])
model.fit(train_X, train_Y, validation_data=(dev_X, dev_Y), epochs=5, batch_size=4)
__all__ = [build_model]

View File

@ -1,77 +0,0 @@
import numpy as np
from keras.models import model_from_json
try:
import cPickle as pickle
except ImportError:
import pickle
class KerasSimilarityShim(object):
entailment_types = ["entailment", "contradiction", "neutral"]
@classmethod
def load(cls, path, nlp, max_length=100, get_features=None):
if get_features is None:
get_features = get_word_ids
with (path / "config.json").open() as file_:
model = model_from_json(file_.read())
with (path / "model").open("rb") as file_:
weights = pickle.load(file_)
embeddings = get_embeddings(nlp.vocab)
weights.insert(1, embeddings)
model.set_weights(weights)
return cls(model, get_features=get_features, max_length=max_length)
def __init__(self, model, get_features=None, max_length=100):
self.model = model
self.get_features = get_features
self.max_length = max_length
def __call__(self, doc):
doc.user_hooks["similarity"] = self.predict
doc.user_span_hooks["similarity"] = self.predict
return doc
def predict(self, doc1, doc2):
x1 = self.get_features([doc1], max_length=self.max_length)
x2 = self.get_features([doc2], max_length=self.max_length)
scores = self.model.predict([x1, x2])
return self.entailment_types[scores.argmax()], scores.max()
def get_embeddings(vocab, nr_unk=100):
# the extra +1 is for a zero vector representing sentence-final padding
num_vectors = max(lex.rank for lex in vocab) + 2
# create random vectors for OOV tokens
oov = np.random.normal(size=(nr_unk, vocab.vectors_length))
oov = oov / oov.sum(axis=1, keepdims=True)
vectors = np.zeros((num_vectors + nr_unk, vocab.vectors_length), dtype="float32")
vectors[1 : (nr_unk + 1),] = oov
for lex in vocab:
if lex.has_vector and lex.vector_norm > 0:
vectors[nr_unk + lex.rank + 1] = lex.vector / lex.vector_norm
return vectors
def get_word_ids(docs, max_length=100, nr_unk=100):
Xs = np.zeros((len(docs), max_length), dtype="int32")
for i, doc in enumerate(docs):
for j, token in enumerate(doc):
if j == max_length:
break
if token.has_vector:
Xs[i, j] = token.rank + nr_unk + 1
else:
Xs[i, j] = token.rank % nr_unk + 1
return Xs

View File

@ -1,45 +0,0 @@
# coding: utf-8
"""
Example of loading previously parsed text using spaCy's DocBin class. The example
performs an entity count to show that the annotations are available.
For more details, see https://spacy.io/usage/saving-loading#docs
Installation:
python -m spacy download en_core_web_lg
Usage:
python examples/load_from_docbin.py en_core_web_lg RC_2015-03-9.spacy
"""
from __future__ import unicode_literals
import spacy
from spacy.tokens import DocBin
from timeit import default_timer as timer
from collections import Counter
EXAMPLE_PARSES_PATH = "RC_2015-03-9.spacy"
def main(model="en_core_web_lg", docbin_path=EXAMPLE_PARSES_PATH):
nlp = spacy.load(model)
print("Reading data from {}".format(docbin_path))
with open(docbin_path, "rb") as file_:
bytes_data = file_.read()
nr_word = 0
start_time = timer()
entities = Counter()
docbin = DocBin().from_bytes(bytes_data)
for doc in docbin.get_docs(nlp.vocab):
nr_word += len(doc)
entities.update((e.label_, e.text) for e in doc.ents)
end_time = timer()
msg = "Loaded {nr_word} words in {seconds} seconds ({wps} words per second)"
wps = nr_word / (end_time - start_time)
print(msg.format(nr_word=nr_word, seconds=end_time - start_time, wps=wps))
print("Most common entities:")
for (label, entity), freq in entities.most_common(30):
print(freq, entity, label)
if __name__ == "__main__":
import plac
plac.call(main)

View File

@ -1,955 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Natural language inference using spaCy and Keras"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Introduction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebook details an implementation of the natural language inference model presented in [(Parikh et al, 2016)](https://arxiv.org/abs/1606.01933). The model is notable for the small number of paramaters *and hyperparameters* it specifices, while still yielding good performance."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Constructing the dataset"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import spacy\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We only need the GloVe vectors from spaCy, not a full NLP pipeline."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"nlp = spacy.load('en_vectors_web_lg')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Function to load the SNLI dataset. The categories are converted to one-shot representation. The function comes from an example in spaCy."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/jds/tensorflow-gpu/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n",
" from ._conv import register_converters as _register_converters\n",
"Using TensorFlow backend.\n"
]
}
],
"source": [
"import json\n",
"from keras.utils import to_categorical\n",
"\n",
"LABELS = {'entailment': 0, 'contradiction': 1, 'neutral': 2}\n",
"def read_snli(path):\n",
" texts1 = []\n",
" texts2 = []\n",
" labels = []\n",
" with open(path, 'r') as file_:\n",
" for line in file_:\n",
" eg = json.loads(line)\n",
" label = eg['gold_label']\n",
" if label == '-': # per Parikh, ignore - SNLI entries\n",
" continue\n",
" texts1.append(eg['sentence1'])\n",
" texts2.append(eg['sentence2'])\n",
" labels.append(LABELS[label])\n",
" return texts1, texts2, to_categorical(np.asarray(labels, dtype='int32'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Because Keras can do the train/test split for us, we'll load *all* SNLI triples from one file."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"texts,hypotheses,labels = read_snli('snli/snli_1.0_train.jsonl')"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"def create_dataset(nlp, texts, hypotheses, num_oov, max_length, norm_vectors = True):\n",
" sents = texts + hypotheses\n",
" \n",
" # the extra +1 is for a zero vector represting NULL for padding\n",
" num_vectors = max(lex.rank for lex in nlp.vocab) + 2 \n",
" \n",
" # create random vectors for OOV tokens\n",
" oov = np.random.normal(size=(num_oov, nlp.vocab.vectors_length))\n",
" oov = oov / oov.sum(axis=1, keepdims=True)\n",
" \n",
" vectors = np.zeros((num_vectors + num_oov, nlp.vocab.vectors_length), dtype='float32')\n",
" vectors[num_vectors:, ] = oov\n",
" for lex in nlp.vocab:\n",
" if lex.has_vector and lex.vector_norm > 0:\n",
" vectors[lex.rank + 1] = lex.vector / lex.vector_norm if norm_vectors == True else lex.vector\n",
" \n",
" sents_as_ids = []\n",
" for sent in sents:\n",
" doc = nlp(sent)\n",
" word_ids = []\n",
" \n",
" for i, token in enumerate(doc):\n",
" # skip odd spaces from tokenizer\n",
" if token.has_vector and token.vector_norm == 0:\n",
" continue\n",
" \n",
" if i > max_length:\n",
" break\n",
" \n",
" if token.has_vector:\n",
" word_ids.append(token.rank + 1)\n",
" else:\n",
" # if we don't have a vector, pick an OOV entry\n",
" word_ids.append(token.rank % num_oov + num_vectors) \n",
" \n",
" # there must be a simpler way of generating padded arrays from lists...\n",
" word_id_vec = np.zeros((max_length), dtype='int')\n",
" clipped_len = min(max_length, len(word_ids))\n",
" word_id_vec[:clipped_len] = word_ids[:clipped_len]\n",
" sents_as_ids.append(word_id_vec)\n",
" \n",
" \n",
" return vectors, np.array(sents_as_ids[:len(texts)]), np.array(sents_as_ids[len(texts):])"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"sem_vectors, text_vectors, hypothesis_vectors = create_dataset(nlp, texts, hypotheses, 100, 50, True)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"texts_test,hypotheses_test,labels_test = read_snli('snli/snli_1.0_test.jsonl')"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"_, text_vectors_test, hypothesis_vectors_test = create_dataset(nlp, texts_test, hypotheses_test, 100, 50, True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We use spaCy to tokenize the sentences and return, when available, a semantic vector for each token. \n",
"\n",
"OOV terms (tokens for which no semantic vector is available) are assigned to one of a set of randomly-generated OOV vectors, per (Parikh et al, 2016).\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that we will clip sentences to 50 words maximum."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"from keras import layers, Model, models\n",
"from keras import backend as K"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Building the model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The embedding layer copies the 300-dimensional GloVe vectors into GPU memory. Per (Parikh et al, 2016), the vectors, which are not adapted during training, are projected down to lower-dimensional vectors using a trained projection matrix."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"def create_embedding(vectors, max_length, projected_dim):\n",
" return models.Sequential([\n",
" layers.Embedding(\n",
" vectors.shape[0],\n",
" vectors.shape[1],\n",
" input_length=max_length,\n",
" weights=[vectors],\n",
" trainable=False),\n",
" \n",
" layers.TimeDistributed(\n",
" layers.Dense(projected_dim,\n",
" activation=None,\n",
" use_bias=False))\n",
" ])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The Parikh model makes use of three feedforward blocks that construct nonlinear combinations of their input. Each block contains two ReLU layers and two dropout layers."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"def create_feedforward(num_units=200, activation='relu', dropout_rate=0.2):\n",
" return models.Sequential([\n",
" layers.Dense(num_units, activation=activation),\n",
" layers.Dropout(dropout_rate),\n",
" layers.Dense(num_units, activation=activation),\n",
" layers.Dropout(dropout_rate)\n",
" ])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The basic idea of the (Parikh et al, 2016) model is to:\n",
"\n",
"1. *Align*: Construct an alignment of subphrases in the text and hypothesis using an attention-like mechanism, called \"decompositional\" because the layer is applied to each of the two sentences individually rather than to their product. The dot product of the nonlinear transformations of the inputs is then normalized vertically and horizontally to yield a pair of \"soft\" alignment structures, from text->hypothesis and hypothesis->text. Concretely, for each word in one sentence, a multinomial distribution is computed over the words of the other sentence, by learning a multinomial logistic with softmax target.\n",
"2. *Compare*: Each word is now compared to its aligned phrase using a function modeled as a two-layer feedforward ReLU network. The output is a high-dimensional representation of the strength of association between word and aligned phrase.\n",
"3. *Aggregate*: The comparison vectors are summed, separately, for the text and the hypothesis. The result is two vectors: one that describes the degree of association of the text to the hypothesis, and the second, of the hypothesis to the text.\n",
"4. Finally, these two vectors are processed by a dense layer followed by a softmax classifier, as usual.\n",
"\n",
"Note that because in entailment the truth conditions of the consequent must be a subset of those of the antecedent, it is not obvious that we need both vectors in step (3). Entailment is not symmetric. It may be enough to just use the hypothesis->text vector. We will explore this possibility later."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We need a couple of little functions for Lambda layers to normalize and aggregate weights:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"def normalizer(axis):\n",
" def _normalize(att_weights):\n",
" exp_weights = K.exp(att_weights)\n",
" sum_weights = K.sum(exp_weights, axis=axis, keepdims=True)\n",
" return exp_weights/sum_weights\n",
" return _normalize\n",
"\n",
"def sum_word(x):\n",
" return K.sum(x, axis=1)\n"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"def build_model(vectors, max_length, num_hidden, num_classes, projected_dim, entail_dir='both'):\n",
" input1 = layers.Input(shape=(max_length,), dtype='int32', name='words1')\n",
" input2 = layers.Input(shape=(max_length,), dtype='int32', name='words2')\n",
" \n",
" # embeddings (projected)\n",
" embed = create_embedding(vectors, max_length, projected_dim)\n",
" \n",
" a = embed(input1)\n",
" b = embed(input2)\n",
" \n",
" # step 1: attend\n",
" F = create_feedforward(num_hidden)\n",
" att_weights = layers.dot([F(a), F(b)], axes=-1)\n",
" \n",
" G = create_feedforward(num_hidden)\n",
" \n",
" if entail_dir == 'both':\n",
" norm_weights_a = layers.Lambda(normalizer(1))(att_weights)\n",
" norm_weights_b = layers.Lambda(normalizer(2))(att_weights)\n",
" alpha = layers.dot([norm_weights_a, a], axes=1)\n",
" beta = layers.dot([norm_weights_b, b], axes=1)\n",
"\n",
" # step 2: compare\n",
" comp1 = layers.concatenate([a, beta])\n",
" comp2 = layers.concatenate([b, alpha])\n",
" v1 = layers.TimeDistributed(G)(comp1)\n",
" v2 = layers.TimeDistributed(G)(comp2)\n",
"\n",
" # step 3: aggregate\n",
" v1_sum = layers.Lambda(sum_word)(v1)\n",
" v2_sum = layers.Lambda(sum_word)(v2)\n",
" concat = layers.concatenate([v1_sum, v2_sum])\n",
" elif entail_dir == 'left':\n",
" norm_weights_a = layers.Lambda(normalizer(1))(att_weights)\n",
" alpha = layers.dot([norm_weights_a, a], axes=1)\n",
" comp2 = layers.concatenate([b, alpha])\n",
" v2 = layers.TimeDistributed(G)(comp2)\n",
" v2_sum = layers.Lambda(sum_word)(v2)\n",
" concat = v2_sum\n",
" else:\n",
" norm_weights_b = layers.Lambda(normalizer(2))(att_weights)\n",
" beta = layers.dot([norm_weights_b, b], axes=1)\n",
" comp1 = layers.concatenate([a, beta])\n",
" v1 = layers.TimeDistributed(G)(comp1)\n",
" v1_sum = layers.Lambda(sum_word)(v1)\n",
" concat = v1_sum\n",
" \n",
" H = create_feedforward(num_hidden)\n",
" out = H(concat)\n",
" out = layers.Dense(num_classes, activation='softmax')(out)\n",
" \n",
" model = Model([input1, input2], out)\n",
" \n",
" model.compile(optimizer='adam',\n",
" loss='categorical_crossentropy',\n",
" metrics=['accuracy'])\n",
" return model\n",
" \n",
" \n",
" "
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"__________________________________________________________________________________________________\n",
"Layer (type) Output Shape Param # Connected to \n",
"==================================================================================================\n",
"words1 (InputLayer) (None, 50) 0 \n",
"__________________________________________________________________________________________________\n",
"words2 (InputLayer) (None, 50) 0 \n",
"__________________________________________________________________________________________________\n",
"sequential_1 (Sequential) (None, 50, 200) 321381600 words1[0][0] \n",
" words2[0][0] \n",
"__________________________________________________________________________________________________\n",
"sequential_2 (Sequential) (None, 50, 200) 80400 sequential_1[1][0] \n",
" sequential_1[2][0] \n",
"__________________________________________________________________________________________________\n",
"dot_1 (Dot) (None, 50, 50) 0 sequential_2[1][0] \n",
" sequential_2[2][0] \n",
"__________________________________________________________________________________________________\n",
"lambda_2 (Lambda) (None, 50, 50) 0 dot_1[0][0] \n",
"__________________________________________________________________________________________________\n",
"lambda_1 (Lambda) (None, 50, 50) 0 dot_1[0][0] \n",
"__________________________________________________________________________________________________\n",
"dot_3 (Dot) (None, 50, 200) 0 lambda_2[0][0] \n",
" sequential_1[2][0] \n",
"__________________________________________________________________________________________________\n",
"dot_2 (Dot) (None, 50, 200) 0 lambda_1[0][0] \n",
" sequential_1[1][0] \n",
"__________________________________________________________________________________________________\n",
"concatenate_1 (Concatenate) (None, 50, 400) 0 sequential_1[1][0] \n",
" dot_3[0][0] \n",
"__________________________________________________________________________________________________\n",
"concatenate_2 (Concatenate) (None, 50, 400) 0 sequential_1[2][0] \n",
" dot_2[0][0] \n",
"__________________________________________________________________________________________________\n",
"time_distributed_2 (TimeDistrib (None, 50, 200) 120400 concatenate_1[0][0] \n",
"__________________________________________________________________________________________________\n",
"time_distributed_3 (TimeDistrib (None, 50, 200) 120400 concatenate_2[0][0] \n",
"__________________________________________________________________________________________________\n",
"lambda_3 (Lambda) (None, 200) 0 time_distributed_2[0][0] \n",
"__________________________________________________________________________________________________\n",
"lambda_4 (Lambda) (None, 200) 0 time_distributed_3[0][0] \n",
"__________________________________________________________________________________________________\n",
"concatenate_3 (Concatenate) (None, 400) 0 lambda_3[0][0] \n",
" lambda_4[0][0] \n",
"__________________________________________________________________________________________________\n",
"sequential_4 (Sequential) (None, 200) 120400 concatenate_3[0][0] \n",
"__________________________________________________________________________________________________\n",
"dense_8 (Dense) (None, 3) 603 sequential_4[1][0] \n",
"==================================================================================================\n",
"Total params: 321,703,403\n",
"Trainable params: 381,803\n",
"Non-trainable params: 321,321,600\n",
"__________________________________________________________________________________________________\n"
]
}
],
"source": [
"K.clear_session()\n",
"m = build_model(sem_vectors, 50, 200, 3, 200)\n",
"m.summary()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The number of trainable parameters, ~381k, is the number given by Parikh et al, so we're on the right track."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Training the model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Parikh et al use tiny batches of 4, training for 50MM batches, which amounts to around 500 epochs. Here we'll use large batches to better use the GPU, and train for fewer epochs -- for purposes of this experiment."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Train on 549367 samples, validate on 9824 samples\n",
"Epoch 1/50\n",
"549367/549367 [==============================] - 34s 62us/step - loss: 0.7599 - acc: 0.6617 - val_loss: 0.5396 - val_acc: 0.7861\n",
"Epoch 2/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.5611 - acc: 0.7763 - val_loss: 0.4892 - val_acc: 0.8085\n",
"Epoch 3/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.5212 - acc: 0.7948 - val_loss: 0.4574 - val_acc: 0.8261\n",
"Epoch 4/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4986 - acc: 0.8045 - val_loss: 0.4410 - val_acc: 0.8274\n",
"Epoch 5/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4819 - acc: 0.8114 - val_loss: 0.4224 - val_acc: 0.8383\n",
"Epoch 6/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4714 - acc: 0.8166 - val_loss: 0.4200 - val_acc: 0.8379\n",
"Epoch 7/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4633 - acc: 0.8203 - val_loss: 0.4098 - val_acc: 0.8457\n",
"Epoch 8/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4558 - acc: 0.8232 - val_loss: 0.4114 - val_acc: 0.8415\n",
"Epoch 9/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4508 - acc: 0.8250 - val_loss: 0.4062 - val_acc: 0.8477\n",
"Epoch 10/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4433 - acc: 0.8286 - val_loss: 0.3982 - val_acc: 0.8486\n",
"Epoch 11/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4388 - acc: 0.8307 - val_loss: 0.3953 - val_acc: 0.8497\n",
"Epoch 12/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4351 - acc: 0.8321 - val_loss: 0.3973 - val_acc: 0.8522\n",
"Epoch 13/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4309 - acc: 0.8342 - val_loss: 0.3939 - val_acc: 0.8539\n",
"Epoch 14/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4269 - acc: 0.8355 - val_loss: 0.3932 - val_acc: 0.8517\n",
"Epoch 15/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4247 - acc: 0.8369 - val_loss: 0.3938 - val_acc: 0.8515\n",
"Epoch 16/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4208 - acc: 0.8379 - val_loss: 0.3936 - val_acc: 0.8504\n",
"Epoch 17/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4194 - acc: 0.8390 - val_loss: 0.3885 - val_acc: 0.8560\n",
"Epoch 18/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4162 - acc: 0.8402 - val_loss: 0.3874 - val_acc: 0.8561\n",
"Epoch 19/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4140 - acc: 0.8409 - val_loss: 0.3889 - val_acc: 0.8545\n",
"Epoch 20/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4114 - acc: 0.8426 - val_loss: 0.3864 - val_acc: 0.8583\n",
"Epoch 21/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4092 - acc: 0.8430 - val_loss: 0.3870 - val_acc: 0.8561\n",
"Epoch 22/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4062 - acc: 0.8442 - val_loss: 0.3852 - val_acc: 0.8577\n",
"Epoch 23/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4050 - acc: 0.8450 - val_loss: 0.3850 - val_acc: 0.8578\n",
"Epoch 24/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4035 - acc: 0.8455 - val_loss: 0.3825 - val_acc: 0.8555\n",
"Epoch 25/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4018 - acc: 0.8460 - val_loss: 0.3837 - val_acc: 0.8573\n",
"Epoch 26/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3989 - acc: 0.8476 - val_loss: 0.3843 - val_acc: 0.8599\n",
"Epoch 27/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3979 - acc: 0.8481 - val_loss: 0.3841 - val_acc: 0.8589\n",
"Epoch 28/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3967 - acc: 0.8484 - val_loss: 0.3811 - val_acc: 0.8575\n",
"Epoch 29/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3956 - acc: 0.8492 - val_loss: 0.3829 - val_acc: 0.8589\n",
"Epoch 30/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3938 - acc: 0.8499 - val_loss: 0.3859 - val_acc: 0.8562\n",
"Epoch 31/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3925 - acc: 0.8500 - val_loss: 0.3798 - val_acc: 0.8587\n",
"Epoch 32/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3906 - acc: 0.8509 - val_loss: 0.3834 - val_acc: 0.8569\n",
"Epoch 33/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3893 - acc: 0.8511 - val_loss: 0.3806 - val_acc: 0.8588\n",
"Epoch 34/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3885 - acc: 0.8515 - val_loss: 0.3828 - val_acc: 0.8603\n",
"Epoch 35/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3879 - acc: 0.8520 - val_loss: 0.3800 - val_acc: 0.8594\n",
"Epoch 36/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3860 - acc: 0.8530 - val_loss: 0.3796 - val_acc: 0.8577\n",
"Epoch 37/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3856 - acc: 0.8532 - val_loss: 0.3857 - val_acc: 0.8591\n",
"Epoch 38/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3838 - acc: 0.8535 - val_loss: 0.3835 - val_acc: 0.8603\n",
"Epoch 39/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3830 - acc: 0.8543 - val_loss: 0.3830 - val_acc: 0.8599\n",
"Epoch 40/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3818 - acc: 0.8548 - val_loss: 0.3832 - val_acc: 0.8559\n",
"Epoch 41/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3806 - acc: 0.8551 - val_loss: 0.3845 - val_acc: 0.8553\n",
"Epoch 42/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3803 - acc: 0.8550 - val_loss: 0.3789 - val_acc: 0.8617\n",
"Epoch 43/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3791 - acc: 0.8556 - val_loss: 0.3835 - val_acc: 0.8580\n",
"Epoch 44/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3778 - acc: 0.8565 - val_loss: 0.3799 - val_acc: 0.8580\n",
"Epoch 45/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3766 - acc: 0.8571 - val_loss: 0.3790 - val_acc: 0.8625\n",
"Epoch 46/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3770 - acc: 0.8569 - val_loss: 0.3820 - val_acc: 0.8590\n",
"Epoch 47/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3761 - acc: 0.8573 - val_loss: 0.3831 - val_acc: 0.8581\n",
"Epoch 48/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3739 - acc: 0.8579 - val_loss: 0.3828 - val_acc: 0.8599\n",
"Epoch 49/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3738 - acc: 0.8577 - val_loss: 0.3785 - val_acc: 0.8590\n",
"Epoch 50/50\n",
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3726 - acc: 0.8580 - val_loss: 0.3820 - val_acc: 0.8585\n"
]
},
{
"data": {
"text/plain": [
"<keras.callbacks.History at 0x7f5c9f49c438>"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m.fit([text_vectors, hypothesis_vectors], labels, batch_size=1024, epochs=50,validation_data=([text_vectors_test, hypothesis_vectors_test], labels_test))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The result is broadly in the region reported by Parikh et al: ~86 vs 86.3%. The small difference might be accounted by differences in `max_length` (here set at 50), in the training regime, and that here we use Keras' built-in validation splitting rather than the SNLI test set."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Experiment: the asymmetric model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It was suggested earlier that, based on the semantics of entailment, the vector representing the strength of association between the hypothesis to the text is all that is needed for classifying the entailment.\n",
"\n",
"The following model removes consideration of the complementary vector (text to hypothesis) from the computation. This will decrease the paramater count slightly, because the final dense layers will be smaller, and speed up the forward pass when predicting, because fewer calculations will be needed."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"__________________________________________________________________________________________________\n",
"Layer (type) Output Shape Param # Connected to \n",
"==================================================================================================\n",
"words2 (InputLayer) (None, 50) 0 \n",
"__________________________________________________________________________________________________\n",
"words1 (InputLayer) (None, 50) 0 \n",
"__________________________________________________________________________________________________\n",
"sequential_5 (Sequential) (None, 50, 200) 321381600 words1[0][0] \n",
" words2[0][0] \n",
"__________________________________________________________________________________________________\n",
"sequential_6 (Sequential) (None, 50, 200) 80400 sequential_5[1][0] \n",
" sequential_5[2][0] \n",
"__________________________________________________________________________________________________\n",
"dot_4 (Dot) (None, 50, 50) 0 sequential_6[1][0] \n",
" sequential_6[2][0] \n",
"__________________________________________________________________________________________________\n",
"lambda_5 (Lambda) (None, 50, 50) 0 dot_4[0][0] \n",
"__________________________________________________________________________________________________\n",
"dot_5 (Dot) (None, 50, 200) 0 lambda_5[0][0] \n",
" sequential_5[1][0] \n",
"__________________________________________________________________________________________________\n",
"concatenate_4 (Concatenate) (None, 50, 400) 0 sequential_5[2][0] \n",
" dot_5[0][0] \n",
"__________________________________________________________________________________________________\n",
"time_distributed_5 (TimeDistrib (None, 50, 200) 120400 concatenate_4[0][0] \n",
"__________________________________________________________________________________________________\n",
"lambda_6 (Lambda) (None, 200) 0 time_distributed_5[0][0] \n",
"__________________________________________________________________________________________________\n",
"sequential_8 (Sequential) (None, 200) 80400 lambda_6[0][0] \n",
"__________________________________________________________________________________________________\n",
"dense_16 (Dense) (None, 3) 603 sequential_8[1][0] \n",
"==================================================================================================\n",
"Total params: 321,663,403\n",
"Trainable params: 341,803\n",
"Non-trainable params: 321,321,600\n",
"__________________________________________________________________________________________________\n"
]
}
],
"source": [
"m1 = build_model(sem_vectors, 50, 200, 3, 200, 'left')\n",
"m1.summary()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The parameter count has indeed decreased by 40,000, corresponding to the 200x200 smaller H function."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Train on 549367 samples, validate on 9824 samples\n",
"Epoch 1/50\n",
"549367/549367 [==============================] - 25s 46us/step - loss: 0.7331 - acc: 0.6770 - val_loss: 0.5257 - val_acc: 0.7936\n",
"Epoch 2/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.5518 - acc: 0.7799 - val_loss: 0.4717 - val_acc: 0.8159\n",
"Epoch 3/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.5147 - acc: 0.7967 - val_loss: 0.4449 - val_acc: 0.8278\n",
"Epoch 4/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4948 - acc: 0.8060 - val_loss: 0.4326 - val_acc: 0.8344\n",
"Epoch 5/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4814 - acc: 0.8122 - val_loss: 0.4247 - val_acc: 0.8359\n",
"Epoch 6/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4712 - acc: 0.8162 - val_loss: 0.4143 - val_acc: 0.8430\n",
"Epoch 7/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4635 - acc: 0.8205 - val_loss: 0.4172 - val_acc: 0.8401\n",
"Epoch 8/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4570 - acc: 0.8223 - val_loss: 0.4106 - val_acc: 0.8422\n",
"Epoch 9/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4505 - acc: 0.8259 - val_loss: 0.4043 - val_acc: 0.8451\n",
"Epoch 10/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4459 - acc: 0.8280 - val_loss: 0.4050 - val_acc: 0.8467\n",
"Epoch 11/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4405 - acc: 0.8300 - val_loss: 0.3975 - val_acc: 0.8481\n",
"Epoch 12/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4360 - acc: 0.8324 - val_loss: 0.4026 - val_acc: 0.8496\n",
"Epoch 13/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4327 - acc: 0.8334 - val_loss: 0.4024 - val_acc: 0.8471\n",
"Epoch 14/50\n",
"549367/549367 [==============================] - 24s 45us/step - loss: 0.4293 - acc: 0.8350 - val_loss: 0.3955 - val_acc: 0.8496\n",
"Epoch 15/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4263 - acc: 0.8369 - val_loss: 0.3980 - val_acc: 0.8490\n",
"Epoch 16/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4236 - acc: 0.8377 - val_loss: 0.3958 - val_acc: 0.8496\n",
"Epoch 17/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4213 - acc: 0.8384 - val_loss: 0.3954 - val_acc: 0.8496\n",
"Epoch 18/50\n",
"549367/549367 [==============================] - 24s 45us/step - loss: 0.4187 - acc: 0.8394 - val_loss: 0.3929 - val_acc: 0.8514\n",
"Epoch 19/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4157 - acc: 0.8409 - val_loss: 0.3939 - val_acc: 0.8507\n",
"Epoch 20/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4135 - acc: 0.8417 - val_loss: 0.3953 - val_acc: 0.8522\n",
"Epoch 21/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4122 - acc: 0.8424 - val_loss: 0.3974 - val_acc: 0.8506\n",
"Epoch 22/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4099 - acc: 0.8435 - val_loss: 0.3918 - val_acc: 0.8522\n",
"Epoch 23/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4075 - acc: 0.8443 - val_loss: 0.3901 - val_acc: 0.8513\n",
"Epoch 24/50\n",
"549367/549367 [==============================] - 24s 44us/step - loss: 0.4067 - acc: 0.8447 - val_loss: 0.3885 - val_acc: 0.8543\n",
"Epoch 25/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4047 - acc: 0.8454 - val_loss: 0.3846 - val_acc: 0.8531\n",
"Epoch 26/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4031 - acc: 0.8461 - val_loss: 0.3864 - val_acc: 0.8562\n",
"Epoch 27/50\n",
"549367/549367 [==============================] - 24s 45us/step - loss: 0.4020 - acc: 0.8467 - val_loss: 0.3874 - val_acc: 0.8546\n",
"Epoch 28/50\n",
"549367/549367 [==============================] - 24s 45us/step - loss: 0.4001 - acc: 0.8473 - val_loss: 0.3848 - val_acc: 0.8534\n",
"Epoch 29/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3991 - acc: 0.8479 - val_loss: 0.3865 - val_acc: 0.8562\n",
"Epoch 30/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3976 - acc: 0.8484 - val_loss: 0.3833 - val_acc: 0.8574\n",
"Epoch 31/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3961 - acc: 0.8487 - val_loss: 0.3846 - val_acc: 0.8585\n",
"Epoch 32/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3942 - acc: 0.8498 - val_loss: 0.3805 - val_acc: 0.8573\n",
"Epoch 33/50\n",
"549367/549367 [==============================] - 24s 44us/step - loss: 0.3935 - acc: 0.8503 - val_loss: 0.3856 - val_acc: 0.8579\n",
"Epoch 34/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3923 - acc: 0.8507 - val_loss: 0.3829 - val_acc: 0.8560\n",
"Epoch 35/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3920 - acc: 0.8508 - val_loss: 0.3864 - val_acc: 0.8575\n",
"Epoch 36/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3907 - acc: 0.8516 - val_loss: 0.3873 - val_acc: 0.8563\n",
"Epoch 37/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3891 - acc: 0.8519 - val_loss: 0.3850 - val_acc: 0.8570\n",
"Epoch 38/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3872 - acc: 0.8522 - val_loss: 0.3815 - val_acc: 0.8591\n",
"Epoch 39/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3887 - acc: 0.8520 - val_loss: 0.3829 - val_acc: 0.8590\n",
"Epoch 40/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3868 - acc: 0.8531 - val_loss: 0.3807 - val_acc: 0.8600\n",
"Epoch 41/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3859 - acc: 0.8537 - val_loss: 0.3832 - val_acc: 0.8574\n",
"Epoch 42/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3849 - acc: 0.8537 - val_loss: 0.3850 - val_acc: 0.8576\n",
"Epoch 43/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3834 - acc: 0.8541 - val_loss: 0.3825 - val_acc: 0.8563\n",
"Epoch 44/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3829 - acc: 0.8548 - val_loss: 0.3844 - val_acc: 0.8540\n",
"Epoch 45/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3816 - acc: 0.8552 - val_loss: 0.3841 - val_acc: 0.8559\n",
"Epoch 46/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3816 - acc: 0.8549 - val_loss: 0.3880 - val_acc: 0.8567\n",
"Epoch 47/50\n",
"549367/549367 [==============================] - 24s 45us/step - loss: 0.3799 - acc: 0.8559 - val_loss: 0.3767 - val_acc: 0.8635\n",
"Epoch 48/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3800 - acc: 0.8560 - val_loss: 0.3786 - val_acc: 0.8563\n",
"Epoch 49/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3781 - acc: 0.8563 - val_loss: 0.3812 - val_acc: 0.8596\n",
"Epoch 50/50\n",
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3788 - acc: 0.8560 - val_loss: 0.3782 - val_acc: 0.8601\n"
]
},
{
"data": {
"text/plain": [
"<keras.callbacks.History at 0x7f5ca1bf3e48>"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m1.fit([text_vectors, hypothesis_vectors], labels, batch_size=1024, epochs=50,validation_data=([text_vectors_test, hypothesis_vectors_test], labels_test))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This model performs the same as the slightly more complex model that evaluates alignments in both directions. Note also that processing time is improved, from 64 down to 48 microseconds per step. \n",
"\n",
"Let's now look at an asymmetric model that evaluates text to hypothesis comparisons. The prediction is that such a model will correctly classify a decent proportion of the exemplars, but not as accurately as the previous two.\n",
"\n",
"We'll just use 10 epochs for expediency."
]
},
{
"cell_type": "code",
"execution_count": 96,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"__________________________________________________________________________________________________\n",
"Layer (type) Output Shape Param # Connected to \n",
"==================================================================================================\n",
"words1 (InputLayer) (None, 50) 0 \n",
"__________________________________________________________________________________________________\n",
"words2 (InputLayer) (None, 50) 0 \n",
"__________________________________________________________________________________________________\n",
"sequential_13 (Sequential) (None, 50, 200) 321381600 words1[0][0] \n",
" words2[0][0] \n",
"__________________________________________________________________________________________________\n",
"sequential_14 (Sequential) (None, 50, 200) 80400 sequential_13[1][0] \n",
" sequential_13[2][0] \n",
"__________________________________________________________________________________________________\n",
"dot_8 (Dot) (None, 50, 50) 0 sequential_14[1][0] \n",
" sequential_14[2][0] \n",
"__________________________________________________________________________________________________\n",
"lambda_9 (Lambda) (None, 50, 50) 0 dot_8[0][0] \n",
"__________________________________________________________________________________________________\n",
"dot_9 (Dot) (None, 50, 200) 0 lambda_9[0][0] \n",
" sequential_13[2][0] \n",
"__________________________________________________________________________________________________\n",
"concatenate_6 (Concatenate) (None, 50, 400) 0 sequential_13[1][0] \n",
" dot_9[0][0] \n",
"__________________________________________________________________________________________________\n",
"time_distributed_9 (TimeDistrib (None, 50, 200) 120400 concatenate_6[0][0] \n",
"__________________________________________________________________________________________________\n",
"lambda_10 (Lambda) (None, 200) 0 time_distributed_9[0][0] \n",
"__________________________________________________________________________________________________\n",
"sequential_16 (Sequential) (None, 200) 80400 lambda_10[0][0] \n",
"__________________________________________________________________________________________________\n",
"dense_32 (Dense) (None, 3) 603 sequential_16[1][0] \n",
"==================================================================================================\n",
"Total params: 321,663,403\n",
"Trainable params: 341,803\n",
"Non-trainable params: 321,321,600\n",
"__________________________________________________________________________________________________\n"
]
}
],
"source": [
"m2 = build_model(sem_vectors, 50, 200, 3, 200, 'right')\n",
"m2.summary()"
]
},
{
"cell_type": "code",
"execution_count": 97,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Train on 455226 samples, validate on 113807 samples\n",
"Epoch 1/10\n",
"455226/455226 [==============================] - 22s 49us/step - loss: 0.8920 - acc: 0.5771 - val_loss: 0.8001 - val_acc: 0.6435\n",
"Epoch 2/10\n",
"455226/455226 [==============================] - 22s 47us/step - loss: 0.7808 - acc: 0.6553 - val_loss: 0.7267 - val_acc: 0.6855\n",
"Epoch 3/10\n",
"455226/455226 [==============================] - 22s 47us/step - loss: 0.7329 - acc: 0.6825 - val_loss: 0.6966 - val_acc: 0.7006\n",
"Epoch 4/10\n",
"455226/455226 [==============================] - 22s 47us/step - loss: 0.7055 - acc: 0.6978 - val_loss: 0.6713 - val_acc: 0.7150\n",
"Epoch 5/10\n",
"455226/455226 [==============================] - 22s 47us/step - loss: 0.6862 - acc: 0.7081 - val_loss: 0.6533 - val_acc: 0.7253\n",
"Epoch 6/10\n",
"455226/455226 [==============================] - 21s 47us/step - loss: 0.6694 - acc: 0.7179 - val_loss: 0.6472 - val_acc: 0.7277\n",
"Epoch 7/10\n",
"455226/455226 [==============================] - 22s 47us/step - loss: 0.6555 - acc: 0.7252 - val_loss: 0.6338 - val_acc: 0.7347\n",
"Epoch 8/10\n",
"455226/455226 [==============================] - 22s 48us/step - loss: 0.6434 - acc: 0.7310 - val_loss: 0.6246 - val_acc: 0.7385\n",
"Epoch 9/10\n",
"455226/455226 [==============================] - 22s 47us/step - loss: 0.6325 - acc: 0.7367 - val_loss: 0.6164 - val_acc: 0.7424\n",
"Epoch 10/10\n",
"455226/455226 [==============================] - 22s 47us/step - loss: 0.6216 - acc: 0.7426 - val_loss: 0.6082 - val_acc: 0.7478\n"
]
},
{
"data": {
"text/plain": [
"<keras.callbacks.History at 0x7fa6850cf080>"
]
},
"execution_count": 97,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m2.fit([text_vectors, hypothesis_vectors], labels, batch_size=1024, epochs=10,validation_split=.2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Comparing this fit to the validation accuracy of the previous two models after 10 epochs, we observe that its accuracy is roughly 10% lower.\n",
"\n",
"It is reassuring that the neural modeling here reproduces what we know from the semantics of natural language!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@ -1,78 +0,0 @@
#!/usr/bin/env python
# coding: utf-8
"""This example contains several snippets of methods that can be set via custom
Doc, Token or Span attributes in spaCy v2.0. Attribute methods act like
they're "bound" to the object and are partially applied i.e. the object
they're called on is passed in as the first argument.
* Custom pipeline components: https://spacy.io//usage/processing-pipelines#custom-components
Compatible with: spaCy v2.0.0+
Last tested with: v2.1.0
"""
from __future__ import unicode_literals, print_function
import plac
from spacy.lang.en import English
from spacy.tokens import Doc, Span
from spacy import displacy
from pathlib import Path
@plac.annotations(
output_dir=("Output directory for saved HTML", "positional", None, Path)
)
def main(output_dir=None):
nlp = English() # start off with blank English class
Doc.set_extension("overlap", method=overlap_tokens)
doc1 = nlp("Peach emoji is where it has always been.")
doc2 = nlp("Peach is the superior emoji.")
print("Text 1:", doc1.text)
print("Text 2:", doc2.text)
print("Overlapping tokens:", doc1._.overlap(doc2))
Doc.set_extension("to_html", method=to_html)
doc = nlp("This is a sentence about Apple.")
# add entity manually for demo purposes, to make it work without a model
doc.ents = [Span(doc, 5, 6, label=nlp.vocab.strings["ORG"])]
print("Text:", doc.text)
doc._.to_html(output=output_dir, style="ent")
def to_html(doc, output="/tmp", style="dep"):
"""Doc method extension for saving the current state as a displaCy
visualization.
"""
# generate filename from first six non-punct tokens
file_name = "-".join([w.text for w in doc[:6] if not w.is_punct]) + ".html"
html = displacy.render(doc, style=style, page=True) # render markup
if output is not None:
output_path = Path(output)
if not output_path.exists():
output_path.mkdir()
output_file = Path(output) / file_name
output_file.open("w", encoding="utf-8").write(html) # save to file
print("Saved HTML to {}".format(output_file))
else:
print(html)
def overlap_tokens(doc, other_doc):
"""Get the tokens from the original Doc that are also in the comparison Doc.
"""
overlap = []
other_tokens = [token.text for token in other_doc]
for token in doc:
if token.text in other_tokens:
overlap.append(token)
return overlap
if __name__ == "__main__":
plac.call(main)
# Expected output:
# Text 1: Peach emoji is where it has always been.
# Text 2: Peach is the superior emoji.
# Overlapping tokens: [Peach, emoji, is, .]

View File

@ -1,130 +0,0 @@
#!/usr/bin/env python
# coding: utf8
"""Example of a spaCy v2.0 pipeline component that requests all countries via
the REST Countries API, merges country names into one token, assigns entity
labels and sets attributes on country tokens, e.g. the capital and lat/lng
coordinates. Can be extended with more details from the API.
* REST Countries API: https://restcountries.eu (Mozilla Public License MPL 2.0)
* Custom pipeline components: https://spacy.io//usage/processing-pipelines#custom-components
Compatible with: spaCy v2.0.0+
Last tested with: v2.1.0
Prerequisites: pip install requests
"""
from __future__ import unicode_literals, print_function
import requests
import plac
from spacy.lang.en import English
from spacy.matcher import PhraseMatcher
from spacy.tokens import Doc, Span, Token
def main():
# For simplicity, we start off with only the blank English Language class
# and no model or pre-defined pipeline loaded.
nlp = English()
rest_countries = RESTCountriesComponent(nlp) # initialise component
nlp.add_pipe(rest_countries) # add it to the pipeline
doc = nlp("Some text about Colombia and the Czech Republic")
print("Pipeline", nlp.pipe_names) # pipeline contains component name
print("Doc has countries", doc._.has_country) # Doc contains countries
for token in doc:
if token._.is_country:
print(
token.text,
token._.country_capital,
token._.country_latlng,
token._.country_flag,
) # country data
print("Entities", [(e.text, e.label_) for e in doc.ents]) # entities
class RESTCountriesComponent(object):
"""spaCy v2.0 pipeline component that requests all countries via
the REST Countries API, merges country names into one token, assigns entity
labels and sets attributes on country tokens.
"""
name = "rest_countries" # component name, will show up in the pipeline
def __init__(self, nlp, label="GPE"):
"""Initialise the pipeline component. The shared nlp instance is used
to initialise the matcher with the shared vocab, get the label ID and
generate Doc objects as phrase match patterns.
"""
# Make request once on initialisation and store the data
r = requests.get("https://restcountries.eu/rest/v2/all")
r.raise_for_status() # make sure requests raises an error if it fails
countries = r.json()
# Convert API response to dict keyed by country name for easy lookup
# This could also be extended using the alternative and foreign language
# names provided by the API
self.countries = {c["name"]: c for c in countries}
self.label = nlp.vocab.strings[label] # get entity label ID
# Set up the PhraseMatcher with Doc patterns for each country name
patterns = [nlp(c) for c in self.countries.keys()]
self.matcher = PhraseMatcher(nlp.vocab)
self.matcher.add("COUNTRIES", None, *patterns)
# Register attribute on the Token. We'll be overwriting this based on
# the matches, so we're only setting a default value, not a getter.
# If no default value is set, it defaults to None.
Token.set_extension("is_country", default=False)
Token.set_extension("country_capital", default=False)
Token.set_extension("country_latlng", default=False)
Token.set_extension("country_flag", default=False)
# Register attributes on Doc and Span via a getter that checks if one of
# the contained tokens is set to is_country == True.
Doc.set_extension("has_country", getter=self.has_country)
Span.set_extension("has_country", getter=self.has_country)
def __call__(self, doc):
"""Apply the pipeline component on a Doc object and modify it if matches
are found. Return the Doc, so it can be processed by the next component
in the pipeline, if available.
"""
matches = self.matcher(doc)
spans = [] # keep the spans for later so we can merge them afterwards
for _, start, end in matches:
# Generate Span representing the entity & set label
entity = Span(doc, start, end, label=self.label)
spans.append(entity)
# Set custom attribute on each token of the entity
# Can be extended with other data returned by the API, like
# currencies, country code, flag, calling code etc.
for token in entity:
token._.set("is_country", True)
token._.set("country_capital", self.countries[entity.text]["capital"])
token._.set("country_latlng", self.countries[entity.text]["latlng"])
token._.set("country_flag", self.countries[entity.text]["flag"])
# Overwrite doc.ents and add entity be careful not to replace!
doc.ents = list(doc.ents) + [entity]
for span in spans:
# Iterate over all spans and merge them into one token. This is done
# after setting the entities otherwise, it would cause mismatched
# indices!
span.merge()
return doc # don't forget to return the Doc!
def has_country(self, tokens):
"""Getter for Doc and Span attributes. Returns True if one of the tokens
is a country. Since the getter is only called when we access the
attribute, we can refer to the Token's 'is_country' attribute here,
which is already set in the processing step."""
return any([t._.get("is_country") for t in tokens])
if __name__ == "__main__":
plac.call(main)
# Expected output:
# Pipeline ['rest_countries']
# Doc has countries True
# Colombia Bogotá [4.0, -72.0] https://restcountries.eu/data/col.svg
# Czech Republic Prague [49.75, 15.5] https://restcountries.eu/data/cze.svg
# Entities [('Colombia', 'GPE'), ('Czech Republic', 'GPE')]

View File

@ -1,115 +0,0 @@
#!/usr/bin/env python
# coding: utf8
"""Example of a spaCy v2.0 pipeline component that sets entity annotations
based on list of single or multiple-word company names. Companies are
labelled as ORG and their spans are merged into one token. Additionally,
._.has_tech_org and ._.is_tech_org is set on the Doc/Span and Token
respectively.
* Custom pipeline components: https://spacy.io//usage/processing-pipelines#custom-components
Compatible with: spaCy v2.0.0+
Last tested with: v2.1.0
"""
from __future__ import unicode_literals, print_function
import plac
from spacy.lang.en import English
from spacy.matcher import PhraseMatcher
from spacy.tokens import Doc, Span, Token
@plac.annotations(
text=("Text to process", "positional", None, str),
companies=("Names of technology companies", "positional", None, str),
)
def main(text="Alphabet Inc. is the company behind Google.", *companies):
# For simplicity, we start off with only the blank English Language class
# and no model or pre-defined pipeline loaded.
nlp = English()
if not companies: # set default companies if none are set via args
companies = ["Alphabet Inc.", "Google", "Netflix", "Apple"] # etc.
component = TechCompanyRecognizer(nlp, companies) # initialise component
nlp.add_pipe(component, last=True) # add last to the pipeline
doc = nlp(text)
print("Pipeline", nlp.pipe_names) # pipeline contains component name
print("Tokens", [t.text for t in doc]) # company names from the list are merged
print("Doc has_tech_org", doc._.has_tech_org) # Doc contains tech orgs
print("Token 0 is_tech_org", doc[0]._.is_tech_org) # "Alphabet Inc." is a tech org
print("Token 1 is_tech_org", doc[1]._.is_tech_org) # "is" is not
print("Entities", [(e.text, e.label_) for e in doc.ents]) # all orgs are entities
class TechCompanyRecognizer(object):
"""Example of a spaCy v2.0 pipeline component that sets entity annotations
based on list of single or multiple-word company names. Companies are
labelled as ORG and their spans are merged into one token. Additionally,
._.has_tech_org and ._.is_tech_org is set on the Doc/Span and Token
respectively."""
name = "tech_companies" # component name, will show up in the pipeline
def __init__(self, nlp, companies=tuple(), label="ORG"):
"""Initialise the pipeline component. The shared nlp instance is used
to initialise the matcher with the shared vocab, get the label ID and
generate Doc objects as phrase match patterns.
"""
self.label = nlp.vocab.strings[label] # get entity label ID
# Set up the PhraseMatcher it can now take Doc objects as patterns,
# so even if the list of companies is long, it's very efficient
patterns = [nlp(org) for org in companies]
self.matcher = PhraseMatcher(nlp.vocab)
self.matcher.add("TECH_ORGS", None, *patterns)
# Register attribute on the Token. We'll be overwriting this based on
# the matches, so we're only setting a default value, not a getter.
Token.set_extension("is_tech_org", default=False)
# Register attributes on Doc and Span via a getter that checks if one of
# the contained tokens is set to is_tech_org == True.
Doc.set_extension("has_tech_org", getter=self.has_tech_org)
Span.set_extension("has_tech_org", getter=self.has_tech_org)
def __call__(self, doc):
"""Apply the pipeline component on a Doc object and modify it if matches
are found. Return the Doc, so it can be processed by the next component
in the pipeline, if available.
"""
matches = self.matcher(doc)
spans = [] # keep the spans for later so we can merge them afterwards
for _, start, end in matches:
# Generate Span representing the entity & set label
entity = Span(doc, start, end, label=self.label)
spans.append(entity)
# Set custom attribute on each token of the entity
for token in entity:
token._.set("is_tech_org", True)
# Overwrite doc.ents and add entity be careful not to replace!
doc.ents = list(doc.ents) + [entity]
for span in spans:
# Iterate over all spans and merge them into one token. This is done
# after setting the entities otherwise, it would cause mismatched
# indices!
span.merge()
return doc # don't forget to return the Doc!
def has_tech_org(self, tokens):
"""Getter for Doc and Span attributes. Returns True if one of the tokens
is a tech org. Since the getter is only called when we access the
attribute, we can refer to the Token's 'is_tech_org' attribute here,
which is already set in the processing step."""
return any([t._.get("is_tech_org") for t in tokens])
if __name__ == "__main__":
plac.call(main)
# Expected output:
# Pipeline ['tech_companies']
# Tokens ['Alphabet Inc.', 'is', 'the', 'company', 'behind', 'Google', '.']
# Doc has_tech_org True
# Token 0 is_tech_org True
# Token 1 is_tech_org False
# Entities [('Alphabet Inc.', 'ORG'), ('Google', 'ORG')]

View File

@ -1,61 +0,0 @@
"""Example of adding a pipeline component to prohibit sentence boundaries
before certain tokens.
What we do is write to the token.is_sent_start attribute, which
takes values in {True, False, None}. The default value None allows the parser
to predict sentence segments. The value False prohibits the parser from inserting
a sentence boundary before that token. Note that fixing the sentence segmentation
should also improve the parse quality.
The specific example here is drawn from https://github.com/explosion/spaCy/issues/2627
Other versions of the model may not make the original mistake, so the specific
example might not be apt for future versions.
Compatible with: spaCy v2.0.0+
Last tested with: v2.1.0
"""
import plac
import spacy
def prevent_sentence_boundaries(doc):
for token in doc:
if not can_be_sentence_start(token):
token.is_sent_start = False
return doc
def can_be_sentence_start(token):
if token.i == 0:
return True
# We're not checking for is_title here to ignore arbitrary titlecased
# tokens within sentences
# elif token.is_title:
# return True
elif token.nbor(-1).is_punct:
return True
elif token.nbor(-1).is_space:
return True
else:
return False
@plac.annotations(
text=("The raw text to process", "positional", None, str),
spacy_model=("spaCy model to use (with a parser)", "option", "m", str),
)
def main(text="Been here And I'm loving it.", spacy_model="en_core_web_lg"):
print("Using spaCy model '{}'".format(spacy_model))
print("Processing text '{}'".format(text))
nlp = spacy.load(spacy_model)
doc = nlp(text)
sentences = [sent.text.strip() for sent in doc.sents]
print("Before:", sentences)
nlp.add_pipe(prevent_sentence_boundaries, before="parser")
doc = nlp(text)
sentences = [sent.text.strip() for sent in doc.sents]
print("After:", sentences)
if __name__ == "__main__":
plac.call(main)

View File

@ -1,37 +0,0 @@
#!/usr/bin/env python
# coding: utf8
"""Demonstrate adding a rule-based component that forces some tokens to not
be entities, before the NER tagger is applied. This is used to hotfix the issue
in https://github.com/explosion/spaCy/issues/2870, present as of spaCy v2.0.16.
Compatible with: spaCy v2.0.0+
Last tested with: v2.1.0
"""
from __future__ import unicode_literals
import spacy
from spacy.attrs import ENT_IOB
def fix_space_tags(doc):
ent_iobs = doc.to_array([ENT_IOB])
for i, token in enumerate(doc):
if token.is_space:
# Sets 'O' tag (0 is None, so I is 1, O is 2)
ent_iobs[i] = 2
doc.from_array([ENT_IOB], ent_iobs.reshape((len(doc), 1)))
return doc
def main():
nlp = spacy.load("en_core_web_sm")
text = "This is some crazy test where I dont need an Apple Watch to make things bug"
doc = nlp(text)
print("Before", doc.ents)
nlp.add_pipe(fix_space_tags, name="fix-ner", before="ner")
doc = nlp(text)
print("After", doc.ents)
if __name__ == "__main__":
main()

View File

@ -1,84 +0,0 @@
#!/usr/bin/env python
# coding: utf8
"""Example of multi-processing with Joblib. Here, we're exporting
part-of-speech-tagged, true-cased, (very roughly) sentence-separated text, with
each "sentence" on a newline, and spaces between tokens. Data is loaded from
the IMDB movie reviews dataset and will be loaded automatically via Thinc's
built-in dataset loader.
Compatible with: spaCy v2.0.0+
Last tested with: v2.1.0
Prerequisites: pip install joblib
"""
from __future__ import print_function, unicode_literals
from pathlib import Path
from joblib import Parallel, delayed
from functools import partial
import thinc.extra.datasets
import plac
import spacy
from spacy.util import minibatch
@plac.annotations(
output_dir=("Output directory", "positional", None, Path),
model=("Model name (needs tagger)", "positional", None, str),
n_jobs=("Number of workers", "option", "n", int),
batch_size=("Batch-size for each process", "option", "b", int),
limit=("Limit of entries from the dataset", "option", "l", int),
)
def main(output_dir, model="en_core_web_sm", n_jobs=4, batch_size=1000, limit=10000):
nlp = spacy.load(model) # load spaCy model
print("Loaded model '%s'" % model)
if not output_dir.exists():
output_dir.mkdir()
# load and pre-process the IMBD dataset
print("Loading IMDB data...")
data, _ = thinc.extra.datasets.imdb()
texts, _ = zip(*data[-limit:])
print("Processing texts...")
partitions = minibatch(texts, size=batch_size)
executor = Parallel(n_jobs=n_jobs, backend="multiprocessing", prefer="processes")
do = delayed(partial(transform_texts, nlp))
tasks = (do(i, batch, output_dir) for i, batch in enumerate(partitions))
executor(tasks)
def transform_texts(nlp, batch_id, texts, output_dir):
print(nlp.pipe_names)
out_path = Path(output_dir) / ("%d.txt" % batch_id)
if out_path.exists(): # return None in case same batch is called again
return None
print("Processing batch", batch_id)
with out_path.open("w", encoding="utf8") as f:
for doc in nlp.pipe(texts):
f.write(" ".join(represent_word(w) for w in doc if not w.is_space))
f.write("\n")
print("Saved {} texts to {}.txt".format(len(texts), batch_id))
def represent_word(word):
text = word.text
# True-case, i.e. try to normalize sentence-initial capitals.
# Only do this if the lower-cased form is more probable.
if (
text.istitle()
and is_sent_begin(word)
and word.prob < word.doc.vocab[text.lower()].prob
):
text = text.lower()
return text + "|" + word.tag_
def is_sent_begin(word):
if word.i == 0:
return True
elif word.i >= 2 and word.nbor(-1).text in (".", "!", "?", "..."):
return True
else:
return False
if __name__ == "__main__":
plac.call(main)

View File

@ -1,153 +0,0 @@
# coding: utf-8
"""
Example of a Streamlit app for an interactive spaCy model visualizer. You can
either download the script, or point streamlit run to the raw URL of this
file. For more details, see https://streamlit.io.
Installation:
pip install streamlit
python -m spacy download en_core_web_sm
python -m spacy download en_core_web_md
python -m spacy download de_core_news_sm
Usage:
streamlit run streamlit_spacy.py
"""
from __future__ import unicode_literals
import streamlit as st
import spacy
from spacy import displacy
import pandas as pd
SPACY_MODEL_NAMES = ["en_core_web_sm", "en_core_web_md", "de_core_news_sm"]
DEFAULT_TEXT = "Mark Zuckerberg is the CEO of Facebook."
HTML_WRAPPER = """<div style="overflow-x: auto; border: 1px solid #e6e9ef; border-radius: 0.25rem; padding: 1rem; margin-bottom: 2.5rem">{}</div>"""
@st.cache(allow_output_mutation=True)
def load_model(name):
return spacy.load(name)
@st.cache(allow_output_mutation=True)
def process_text(model_name, text):
nlp = load_model(model_name)
return nlp(text)
st.sidebar.title("Interactive spaCy visualizer")
st.sidebar.markdown(
"""
Process text with [spaCy](https://spacy.io) models and visualize named entities,
dependencies and more. Uses spaCy's built-in
[displaCy](http://spacy.io/usage/visualizers) visualizer under the hood.
"""
)
spacy_model = st.sidebar.selectbox("Model name", SPACY_MODEL_NAMES)
model_load_state = st.info(f"Loading model '{spacy_model}'...")
nlp = load_model(spacy_model)
model_load_state.empty()
text = st.text_area("Text to analyze", DEFAULT_TEXT)
doc = process_text(spacy_model, text)
if "parser" in nlp.pipe_names:
st.header("Dependency Parse & Part-of-speech tags")
st.sidebar.header("Dependency Parse")
split_sents = st.sidebar.checkbox("Split sentences", value=True)
collapse_punct = st.sidebar.checkbox("Collapse punctuation", value=True)
collapse_phrases = st.sidebar.checkbox("Collapse phrases")
compact = st.sidebar.checkbox("Compact mode")
options = {
"collapse_punct": collapse_punct,
"collapse_phrases": collapse_phrases,
"compact": compact,
}
docs = [span.as_doc() for span in doc.sents] if split_sents else [doc]
for sent in docs:
html = displacy.render(sent, options=options)
# Double newlines seem to mess with the rendering
html = html.replace("\n\n", "\n")
if split_sents and len(docs) > 1:
st.markdown(f"> {sent.text}")
st.write(HTML_WRAPPER.format(html), unsafe_allow_html=True)
if "ner" in nlp.pipe_names:
st.header("Named Entities")
st.sidebar.header("Named Entities")
label_set = nlp.get_pipe("ner").labels
labels = st.sidebar.multiselect(
"Entity labels", options=label_set, default=list(label_set)
)
html = displacy.render(doc, style="ent", options={"ents": labels})
# Newlines seem to mess with the rendering
html = html.replace("\n", " ")
st.write(HTML_WRAPPER.format(html), unsafe_allow_html=True)
attrs = ["text", "label_", "start", "end", "start_char", "end_char"]
if "entity_linker" in nlp.pipe_names:
attrs.append("kb_id_")
data = [
[str(getattr(ent, attr)) for attr in attrs]
for ent in doc.ents
if ent.label_ in labels
]
df = pd.DataFrame(data, columns=attrs)
st.dataframe(df)
if "textcat" in nlp.pipe_names:
st.header("Text Classification")
st.markdown(f"> {text}")
df = pd.DataFrame(doc.cats.items(), columns=("Label", "Score"))
st.dataframe(df)
vector_size = nlp.meta.get("vectors", {}).get("width", 0)
if vector_size:
st.header("Vectors & Similarity")
st.code(nlp.meta["vectors"])
text1 = st.text_input("Text or word 1", "apple")
text2 = st.text_input("Text or word 2", "orange")
doc1 = process_text(spacy_model, text1)
doc2 = process_text(spacy_model, text2)
similarity = doc1.similarity(doc2)
if similarity > 0.5:
st.success(similarity)
else:
st.error(similarity)
st.header("Token attributes")
if st.button("Show token attributes"):
attrs = [
"idx",
"text",
"lemma_",
"pos_",
"tag_",
"dep_",
"head",
"ent_type_",
"ent_iob_",
"shape_",
"is_alpha",
"is_ascii",
"is_digit",
"is_punct",
"like_num",
]
data = [[str(getattr(token, attr)) for attr in attrs] for token in doc]
df = pd.DataFrame(data, columns=attrs)
st.dataframe(df)
st.header("JSON Doc")
if st.button("Show JSON Doc"):
st.json(doc.to_json())
st.header("JSON model meta")
if st.button("Show JSON model meta"):
st.json(nlp.meta)

View File

@ -1 +0,0 @@
{"nr_epoch": 3, "batch_size": 24, "dropout": 0.001, "vectors": 0, "multitask_tag": 0, "multitask_sent": 0}

View File

@ -1,434 +0,0 @@
"""Train for CONLL 2017 UD treebank evaluation. Takes .conllu files, writes
.conllu format for development data, allowing the official scorer to be used.
"""
from __future__ import unicode_literals
import plac
import attr
from pathlib import Path
import re
import json
import tqdm
import spacy
import spacy.util
from spacy.tokens import Token, Doc
from spacy.gold import GoldParse
from spacy.syntax.nonproj import projectivize
from collections import defaultdict
from spacy.matcher import Matcher
import itertools
import random
import numpy.random
from bin.ud import conll17_ud_eval
import spacy.lang.zh
import spacy.lang.ja
spacy.lang.zh.Chinese.Defaults.use_jieba = False
spacy.lang.ja.Japanese.Defaults.use_janome = False
random.seed(0)
numpy.random.seed(0)
def minibatch_by_words(items, size=5000):
random.shuffle(items)
if isinstance(size, int):
size_ = itertools.repeat(size)
else:
size_ = size
items = iter(items)
while True:
batch_size = next(size_)
batch = []
while batch_size >= 0:
try:
doc, gold = next(items)
except StopIteration:
if batch:
yield batch
return
batch_size -= len(doc)
batch.append((doc, gold))
if batch:
yield batch
else:
break
################
# Data reading #
################
space_re = re.compile("\s+")
def split_text(text):
return [space_re.sub(" ", par.strip()) for par in text.split("\n\n")]
def read_data(
nlp,
conllu_file,
text_file,
raw_text=True,
oracle_segments=False,
max_doc_length=None,
limit=None,
):
"""Read the CONLLU format into (Doc, GoldParse) tuples. If raw_text=True,
include Doc objects created using nlp.make_doc and then aligned against
the gold-standard sequences. If oracle_segments=True, include Doc objects
created from the gold-standard segments. At least one must be True."""
if not raw_text and not oracle_segments:
raise ValueError("At least one of raw_text or oracle_segments must be True")
paragraphs = split_text(text_file.read())
conllu = read_conllu(conllu_file)
# sd is spacy doc; cd is conllu doc
# cs is conllu sent, ct is conllu token
docs = []
golds = []
for doc_id, (text, cd) in enumerate(zip(paragraphs, conllu)):
sent_annots = []
for cs in cd:
sent = defaultdict(list)
for id_, word, lemma, pos, tag, morph, head, dep, _, space_after in cs:
if "." in id_:
continue
if "-" in id_:
continue
id_ = int(id_) - 1
head = int(head) - 1 if head != "0" else id_
sent["words"].append(word)
sent["tags"].append(tag)
sent["heads"].append(head)
sent["deps"].append("ROOT" if dep == "root" else dep)
sent["spaces"].append(space_after == "_")
sent["entities"] = ["-"] * len(sent["words"])
sent["heads"], sent["deps"] = projectivize(sent["heads"], sent["deps"])
if oracle_segments:
docs.append(Doc(nlp.vocab, words=sent["words"], spaces=sent["spaces"]))
golds.append(GoldParse(docs[-1], **sent))
sent_annots.append(sent)
if raw_text and max_doc_length and len(sent_annots) >= max_doc_length:
doc, gold = _make_gold(nlp, None, sent_annots)
sent_annots = []
docs.append(doc)
golds.append(gold)
if limit and len(docs) >= limit:
return docs, golds
if raw_text and sent_annots:
doc, gold = _make_gold(nlp, None, sent_annots)
docs.append(doc)
golds.append(gold)
if limit and len(docs) >= limit:
return docs, golds
return docs, golds
def read_conllu(file_):
docs = []
sent = []
doc = []
for line in file_:
if line.startswith("# newdoc"):
if doc:
docs.append(doc)
doc = []
elif line.startswith("#"):
continue
elif not line.strip():
if sent:
doc.append(sent)
sent = []
else:
sent.append(list(line.strip().split("\t")))
if len(sent[-1]) != 10:
print(repr(line))
raise ValueError
if sent:
doc.append(sent)
if doc:
docs.append(doc)
return docs
def _make_gold(nlp, text, sent_annots):
# Flatten the conll annotations, and adjust the head indices
flat = defaultdict(list)
for sent in sent_annots:
flat["heads"].extend(len(flat["words"]) + head for head in sent["heads"])
for field in ["words", "tags", "deps", "entities", "spaces"]:
flat[field].extend(sent[field])
# Construct text if necessary
assert len(flat["words"]) == len(flat["spaces"])
if text is None:
text = "".join(
word + " " * space for word, space in zip(flat["words"], flat["spaces"])
)
doc = nlp.make_doc(text)
flat.pop("spaces")
gold = GoldParse(doc, **flat)
return doc, gold
#############################
# Data transforms for spaCy #
#############################
def golds_to_gold_tuples(docs, golds):
"""Get out the annoying 'tuples' format used by begin_training, given the
GoldParse objects."""
tuples = []
for doc, gold in zip(docs, golds):
text = doc.text
ids, words, tags, heads, labels, iob = zip(*gold.orig_annot)
sents = [((ids, words, tags, heads, labels, iob), [])]
tuples.append((text, sents))
return tuples
##############
# Evaluation #
##############
def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None):
with text_loc.open("r", encoding="utf8") as text_file:
texts = split_text(text_file.read())
docs = list(nlp.pipe(texts))
with sys_loc.open("w", encoding="utf8") as out_file:
write_conllu(docs, out_file)
with gold_loc.open("r", encoding="utf8") as gold_file:
gold_ud = conll17_ud_eval.load_conllu(gold_file)
with sys_loc.open("r", encoding="utf8") as sys_file:
sys_ud = conll17_ud_eval.load_conllu(sys_file)
scores = conll17_ud_eval.evaluate(gold_ud, sys_ud)
return scores
def write_conllu(docs, file_):
merger = Matcher(docs[0].vocab)
merger.add("SUBTOK", None, [{"DEP": "subtok", "op": "+"}])
for i, doc in enumerate(docs):
matches = merger(doc)
spans = [doc[start : end + 1] for _, start, end in matches]
offsets = [(span.start_char, span.end_char) for span in spans]
for start_char, end_char in offsets:
doc.merge(start_char, end_char)
file_.write("# newdoc id = {i}\n".format(i=i))
for j, sent in enumerate(doc.sents):
file_.write("# sent_id = {i}.{j}\n".format(i=i, j=j))
file_.write("# text = {text}\n".format(text=sent.text))
for k, token in enumerate(sent):
file_.write(token._.get_conllu_lines(k) + "\n")
file_.write("\n")
def print_progress(itn, losses, ud_scores):
fields = {
"dep_loss": losses.get("parser", 0.0),
"tag_loss": losses.get("tagger", 0.0),
"words": ud_scores["Words"].f1 * 100,
"sents": ud_scores["Sentences"].f1 * 100,
"tags": ud_scores["XPOS"].f1 * 100,
"uas": ud_scores["UAS"].f1 * 100,
"las": ud_scores["LAS"].f1 * 100,
}
header = ["Epoch", "Loss", "LAS", "UAS", "TAG", "SENT", "WORD"]
if itn == 0:
print("\t".join(header))
tpl = "\t".join(
(
"{:d}",
"{dep_loss:.1f}",
"{las:.1f}",
"{uas:.1f}",
"{tags:.1f}",
"{sents:.1f}",
"{words:.1f}",
)
)
print(tpl.format(itn, **fields))
# def get_sent_conllu(sent, sent_id):
# lines = ["# sent_id = {sent_id}".format(sent_id=sent_id)]
def get_token_conllu(token, i):
if token._.begins_fused:
n = 1
while token.nbor(n)._.inside_fused:
n += 1
id_ = "%d-%d" % (i, i + n)
lines = [id_, token.text, "_", "_", "_", "_", "_", "_", "_", "_"]
else:
lines = []
if token.head.i == token.i:
head = 0
else:
head = i + (token.head.i - token.i) + 1
fields = [
str(i + 1),
token.text,
token.lemma_,
token.pos_,
token.tag_,
"_",
str(head),
token.dep_.lower(),
"_",
"_",
]
lines.append("\t".join(fields))
return "\n".join(lines)
##################
# Initialization #
##################
def load_nlp(corpus, config):
lang = corpus.split("_")[0]
nlp = spacy.blank(lang)
if config.vectors:
nlp.vocab.from_disk(config.vectors / "vocab")
return nlp
def initialize_pipeline(nlp, docs, golds, config):
nlp.add_pipe(nlp.create_pipe("parser"))
if config.multitask_tag:
nlp.parser.add_multitask_objective("tag")
if config.multitask_sent:
nlp.parser.add_multitask_objective("sent_start")
nlp.parser.moves.add_action(2, "subtok")
nlp.add_pipe(nlp.create_pipe("tagger"))
for gold in golds:
for tag in gold.tags:
if tag is not None:
nlp.tagger.add_label(tag)
# Replace labels that didn't make the frequency cutoff
actions = set(nlp.parser.labels)
label_set = set([act.split("-")[1] for act in actions if "-" in act])
for gold in golds:
for i, label in enumerate(gold.labels):
if label is not None and label not in label_set:
gold.labels[i] = label.split("||")[0]
return nlp.begin_training(lambda: golds_to_gold_tuples(docs, golds))
########################
# Command line helpers #
########################
@attr.s
class Config(object):
vectors = attr.ib(default=None)
max_doc_length = attr.ib(default=10)
multitask_tag = attr.ib(default=True)
multitask_sent = attr.ib(default=True)
nr_epoch = attr.ib(default=30)
batch_size = attr.ib(default=1000)
dropout = attr.ib(default=0.2)
@classmethod
def load(cls, loc):
with Path(loc).open("r", encoding="utf8") as file_:
cfg = json.load(file_)
return cls(**cfg)
class Dataset(object):
def __init__(self, path, section):
self.path = path
self.section = section
self.conllu = None
self.text = None
for file_path in self.path.iterdir():
name = file_path.parts[-1]
if section in name and name.endswith("conllu"):
self.conllu = file_path
elif section in name and name.endswith("txt"):
self.text = file_path
if self.conllu is None:
msg = "Could not find .txt file in {path} for {section}"
raise IOError(msg.format(section=section, path=path))
if self.text is None:
msg = "Could not find .txt file in {path} for {section}"
self.lang = self.conllu.parts[-1].split("-")[0].split("_")[0]
class TreebankPaths(object):
def __init__(self, ud_path, treebank, **cfg):
self.train = Dataset(ud_path / treebank, "train")
self.dev = Dataset(ud_path / treebank, "dev")
self.lang = self.train.lang
@plac.annotations(
ud_dir=("Path to Universal Dependencies corpus", "positional", None, Path),
parses_dir=("Directory to write the development parses", "positional", None, Path),
config=("Path to json formatted config file", "positional", None, Config.load),
corpus=(
"UD corpus to train and evaluate on, e.g. UD_Spanish-AnCora",
"positional",
None,
str,
),
limit=("Size limit", "option", "n", int),
)
def main(ud_dir, parses_dir, config, corpus, limit=0):
Token.set_extension("get_conllu_lines", method=get_token_conllu)
Token.set_extension("begins_fused", default=False)
Token.set_extension("inside_fused", default=False)
paths = TreebankPaths(ud_dir, corpus)
if not (parses_dir / corpus).exists():
(parses_dir / corpus).mkdir()
print("Train and evaluate", corpus, "using lang", paths.lang)
nlp = load_nlp(paths.lang, config)
docs, golds = read_data(
nlp,
paths.train.conllu.open(encoding="utf8"),
paths.train.text.open(encoding="utf8"),
max_doc_length=config.max_doc_length,
limit=limit,
)
optimizer = initialize_pipeline(nlp, docs, golds, config)
for i in range(config.nr_epoch):
docs = [nlp.make_doc(doc.text) for doc in docs]
batches = minibatch_by_words(list(zip(docs, golds)), size=config.batch_size)
losses = {}
n_train_words = sum(len(doc) for doc in docs)
with tqdm.tqdm(total=n_train_words, leave=False) as pbar:
for batch in batches:
batch_docs, batch_gold = zip(*batch)
pbar.update(sum(len(doc) for doc in batch_docs))
nlp.update(
batch_docs,
batch_gold,
sgd=optimizer,
drop=config.dropout,
losses=losses,
)
out_path = parses_dir / corpus / "epoch-{i}.conllu".format(i=i)
with nlp.use_params(optimizer.averages):
scores = evaluate(nlp, paths.dev.text, paths.dev.conllu, out_path)
print_progress(i, losses, scores)
if __name__ == "__main__":
plac.call(main)

View File

@ -1,114 +0,0 @@
#!/usr/bin/env python
# coding: utf8
"""Example of defining a knowledge base in spaCy,
which is needed to implement entity linking functionality.
For more details, see the documentation:
* Knowledge base: https://spacy.io/api/kb
* Entity Linking: https://spacy.io/usage/linguistic-features#entity-linking
Compatible with: spaCy v2.2.4
Last tested with: v2.2.4
"""
from __future__ import unicode_literals, print_function
import plac
from pathlib import Path
from spacy.vocab import Vocab
import spacy
from spacy.kb import KnowledgeBase
# Q2146908 (Russ Cochran): American golfer
# Q7381115 (Russ Cochran): publisher
ENTITIES = {"Q2146908": ("American golfer", 342), "Q7381115": ("publisher", 17)}
@plac.annotations(
model=("Model name, should have pretrained word embeddings", "positional", None, str),
output_dir=("Optional output directory", "option", "o", Path),
)
def main(model=None, output_dir=None):
"""Load the model and create the KB with pre-defined entity encodings.
If an output_dir is provided, the KB will be stored there in a file 'kb'.
The updated vocab will also be written to a directory in the output_dir."""
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
# check the length of the nlp vectors
if "vectors" not in nlp.meta or not nlp.vocab.vectors.size:
raise ValueError(
"The `nlp` object should have access to pretrained word vectors, "
" cf. https://spacy.io/usage/models#languages."
)
# You can change the dimension of vectors in your KB by using an encoder that changes the dimensionality.
# For simplicity, we'll just use the original vector dimension here instead.
vectors_dim = nlp.vocab.vectors.shape[1]
kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=vectors_dim)
# set up the data
entity_ids = []
descr_embeddings = []
freqs = []
for key, value in ENTITIES.items():
desc, freq = value
entity_ids.append(key)
descr_embeddings.append(nlp(desc).vector)
freqs.append(freq)
# set the entities, can also be done by calling `kb.add_entity` for each entity
kb.set_entities(entity_list=entity_ids, freq_list=freqs, vector_list=descr_embeddings)
# adding aliases, the entities need to be defined in the KB beforehand
kb.add_alias(
alias="Russ Cochran",
entities=["Q2146908", "Q7381115"],
probabilities=[0.24, 0.7], # the sum of these probabilities should not exceed 1
)
# test the trained model
print()
_print_kb(kb)
# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
kb_path = str(output_dir / "kb")
kb.dump(kb_path)
print()
print("Saved KB to", kb_path)
vocab_path = output_dir / "vocab"
kb.vocab.to_disk(vocab_path)
print("Saved vocab to", vocab_path)
print()
# test the saved model
# always reload a knowledge base with the same vocab instance!
print("Loading vocab from", vocab_path)
print("Loading KB from", kb_path)
vocab2 = Vocab().from_disk(vocab_path)
kb2 = KnowledgeBase(vocab=vocab2)
kb2.load_bulk(kb_path)
print()
_print_kb(kb2)
def _print_kb(kb):
print(kb.get_size_entities(), "kb entities:", kb.get_entity_strings())
print(kb.get_size_aliases(), "kb aliases:", kb.get_alias_strings())
if __name__ == "__main__":
plac.call(main)
# Expected output:
# 2 kb entities: ['Q2146908', 'Q7381115']
# 1 kb aliases: ['Russ Cochran']

View File

@ -1,89 +0,0 @@
"""This example shows how to add a multi-task objective that is trained
alongside the entity recognizer. This is an alternative to adding features
to the model.
The multi-task idea is to train an auxiliary model to predict some attribute,
with weights shared between the auxiliary model and the main model. In this
example, we're predicting the position of the word in the document.
The model that predicts the position of the word encourages the convolutional
layers to include the position information in their representation. The
information is then available to the main model, as a feature.
The overall idea is that we might know something about what sort of features
we'd like the CNN to extract. The multi-task objectives can encourage the
extraction of this type of feature. The multi-task objective is only used
during training. We discard the auxiliary model before run-time.
The specific example here is not necessarily a good idea --- but it shows
how an arbitrary objective function for some word can be used.
Developed and tested for spaCy 2.0.6. Updated for v2.2.2
"""
import random
import plac
import spacy
import os.path
from spacy.tokens import Doc
from spacy.gold import read_json_file, GoldParse
random.seed(0)
PWD = os.path.dirname(__file__)
TRAIN_DATA = list(read_json_file(
os.path.join(PWD, "ner_example_data", "ner-sent-per-line.json")))
def get_position_label(i, words, tags, heads, labels, ents):
"""Return labels indicating the position of the word in the document.
"""
if len(words) < 20:
return "short-doc"
elif i == 0:
return "first-word"
elif i < 10:
return "early-word"
elif i < 20:
return "mid-word"
elif i == len(words) - 1:
return "last-word"
else:
return "late-word"
def main(n_iter=10):
nlp = spacy.blank("en")
ner = nlp.create_pipe("ner")
ner.add_multitask_objective(get_position_label)
nlp.add_pipe(ner)
print(nlp.pipeline)
print("Create data", len(TRAIN_DATA))
optimizer = nlp.begin_training(get_gold_tuples=lambda: TRAIN_DATA)
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
for text, annot_brackets in TRAIN_DATA:
for annotations, _ in annot_brackets:
doc = Doc(nlp.vocab, words=annotations[1])
gold = GoldParse.from_annot_tuples(doc, annotations)
nlp.update(
[doc], # batch of texts
[gold], # batch of annotations
drop=0.2, # dropout - make it harder to memorise data
sgd=optimizer, # callable to update weights
losses=losses,
)
print(losses.get("nn_labeller", 0.0), losses["ner"])
# test the trained model
for text, _ in TRAIN_DATA:
if text is not None:
doc = nlp(text)
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])
if __name__ == "__main__":
plac.call(main)

View File

@ -1,217 +0,0 @@
"""This script is experimental.
Try pre-training the CNN component of the text categorizer using a cheap
language modelling-like objective. Specifically, we load pretrained vectors
(from something like word2vec, GloVe, FastText etc), and use the CNN to
predict the tokens' pretrained vectors. This isn't as easy as it sounds:
we're not merely doing compression here, because heavy dropout is applied,
including over the input words. This means the model must often (50% of the time)
use the context in order to predict the word.
To evaluate the technique, we're pre-training with the 50k texts from the IMDB
corpus, and then training with only 100 labels. Note that it's a bit dirty to
pre-train with the development data, but also not *so* terrible: we're not using
the development labels, after all --- only the unlabelled text.
"""
import plac
import tqdm
import random
import spacy
import thinc.extra.datasets
from spacy.util import minibatch, use_gpu, compounding
from spacy._ml import Tok2Vec
from spacy.pipeline import TextCategorizer
import numpy
def load_texts(limit=0):
train, dev = thinc.extra.datasets.imdb()
train_texts, train_labels = zip(*train)
dev_texts, dev_labels = zip(*train)
train_texts = list(train_texts)
dev_texts = list(dev_texts)
random.shuffle(train_texts)
random.shuffle(dev_texts)
if limit >= 1:
return train_texts[:limit]
else:
return list(train_texts) + list(dev_texts)
def load_textcat_data(limit=0):
"""Load data from the IMDB dataset."""
# Partition off part of the train data for evaluation
train_data, eval_data = thinc.extra.datasets.imdb()
random.shuffle(train_data)
train_data = train_data[-limit:]
texts, labels = zip(*train_data)
eval_texts, eval_labels = zip(*eval_data)
cats = [{"POSITIVE": bool(y), "NEGATIVE": not bool(y)} for y in labels]
eval_cats = [{"POSITIVE": bool(y), "NEGATIVE": not bool(y)} for y in eval_labels]
return (texts, cats), (eval_texts, eval_cats)
def prefer_gpu():
used = spacy.util.use_gpu(0)
if used is None:
return False
else:
import cupy.random
cupy.random.seed(0)
return True
def build_textcat_model(tok2vec, nr_class, width):
from thinc.v2v import Model, Softmax, Maxout
from thinc.api import flatten_add_lengths, chain
from thinc.t2v import Pooling, sum_pool, mean_pool, max_pool
from thinc.misc import Residual, LayerNorm
from spacy._ml import logistic, zero_init
with Model.define_operators({">>": chain}):
model = (
tok2vec
>> flatten_add_lengths
>> Pooling(mean_pool)
>> Softmax(nr_class, width)
)
model.tok2vec = tok2vec
return model
def block_gradients(model):
from thinc.api import wrap
def forward(X, drop=0.0):
Y, _ = model.begin_update(X, drop=drop)
return Y, None
return wrap(forward, model)
def create_pipeline(width, embed_size, vectors_model):
print("Load vectors")
nlp = spacy.load(vectors_model)
print("Start training")
textcat = TextCategorizer(
nlp.vocab,
labels=["POSITIVE", "NEGATIVE"],
model=build_textcat_model(
Tok2Vec(width=width, embed_size=embed_size), 2, width
),
)
nlp.add_pipe(textcat)
return nlp
def train_tensorizer(nlp, texts, dropout, n_iter):
tensorizer = nlp.create_pipe("tensorizer")
nlp.add_pipe(tensorizer)
optimizer = nlp.begin_training()
for i in range(n_iter):
losses = {}
for i, batch in enumerate(minibatch(tqdm.tqdm(texts))):
docs = [nlp.make_doc(text) for text in batch]
tensorizer.update(docs, None, losses=losses, sgd=optimizer, drop=dropout)
print(losses)
return optimizer
def train_textcat(nlp, n_texts, n_iter=10):
textcat = nlp.get_pipe("textcat")
tok2vec_weights = textcat.model.tok2vec.to_bytes()
(train_texts, train_cats), (dev_texts, dev_cats) = load_textcat_data(limit=n_texts)
print(
"Using {} examples ({} training, {} evaluation)".format(
n_texts, len(train_texts), len(dev_texts)
)
)
train_data = list(zip(train_texts, [{"cats": cats} for cats in train_cats]))
# get names of other pipes to disable them during training
pipe_exceptions = ["textcat", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
with nlp.disable_pipes(*other_pipes): # only train textcat
optimizer = nlp.begin_training()
textcat.model.tok2vec.from_bytes(tok2vec_weights)
print("Training the model...")
print("{:^5}\t{:^5}\t{:^5}\t{:^5}".format("LOSS", "P", "R", "F"))
for i in range(n_iter):
losses = {"textcat": 0.0}
# batch up the examples using spaCy's minibatch
batches = minibatch(tqdm.tqdm(train_data), size=2)
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(texts, annotations, sgd=optimizer, drop=0.2, losses=losses)
with textcat.model.use_params(optimizer.averages):
# evaluate on the dev data split off in load_data()
scores = evaluate_textcat(nlp.tokenizer, textcat, dev_texts, dev_cats)
print(
"{0:.3f}\t{1:.3f}\t{2:.3f}\t{3:.3f}".format( # print a simple table
losses["textcat"],
scores["textcat_p"],
scores["textcat_r"],
scores["textcat_f"],
)
)
def evaluate_textcat(tokenizer, textcat, texts, cats):
docs = (tokenizer(text) for text in texts)
tp = 1e-8
fp = 1e-8
tn = 1e-8
fn = 1e-8
for i, doc in enumerate(textcat.pipe(docs)):
gold = cats[i]
for label, score in doc.cats.items():
if label not in gold:
continue
if score >= 0.5 and gold[label] >= 0.5:
tp += 1.0
elif score >= 0.5 and gold[label] < 0.5:
fp += 1.0
elif score < 0.5 and gold[label] < 0.5:
tn += 1
elif score < 0.5 and gold[label] >= 0.5:
fn += 1
precision = tp / (tp + fp)
recall = tp / (tp + fn)
f_score = 2 * (precision * recall) / (precision + recall)
return {"textcat_p": precision, "textcat_r": recall, "textcat_f": f_score}
@plac.annotations(
width=("Width of CNN layers", "positional", None, int),
embed_size=("Embedding rows", "positional", None, int),
pretrain_iters=("Number of iterations to pretrain", "option", "pn", int),
train_iters=("Number of iterations to train", "option", "tn", int),
train_examples=("Number of labelled examples", "option", "eg", int),
vectors_model=("Name or path to vectors model to learn from"),
)
def main(
width,
embed_size,
vectors_model,
pretrain_iters=30,
train_iters=30,
train_examples=1000,
):
random.seed(0)
numpy.random.seed(0)
use_gpu = prefer_gpu()
print("Using GPU?", use_gpu)
nlp = create_pipeline(width, embed_size, vectors_model)
print("Load data")
texts = load_texts(limit=0)
print("Train tensorizer")
optimizer = train_tensorizer(nlp, texts, dropout=0.2, n_iter=pretrain_iters)
print("Train textcat")
train_textcat(nlp, train_examples, n_iter=train_iters)
if __name__ == "__main__":
plac.call(main)

View File

@ -1,97 +0,0 @@
"""Prevent catastrophic forgetting with rehearsal updates."""
import plac
import random
import warnings
import srsly
import spacy
from spacy.gold import GoldParse
from spacy.util import minibatch, compounding
LABEL = "ANIMAL"
TRAIN_DATA = [
(
"Horses are too tall and they pretend to care about your feelings",
{"entities": [(0, 6, "ANIMAL")]},
),
("Do they bite?", {"entities": []}),
(
"horses are too tall and they pretend to care about your feelings",
{"entities": [(0, 6, "ANIMAL")]},
),
("horses pretend to care about your feelings", {"entities": [(0, 6, "ANIMAL")]}),
(
"they pretend to care about your feelings, those horses",
{"entities": [(48, 54, "ANIMAL")]},
),
("horses?", {"entities": [(0, 6, "ANIMAL")]}),
]
def read_raw_data(nlp, jsonl_loc):
for json_obj in srsly.read_jsonl(jsonl_loc):
if json_obj["text"].strip():
doc = nlp.make_doc(json_obj["text"])
yield doc
def read_gold_data(nlp, gold_loc):
docs = []
golds = []
for json_obj in srsly.read_jsonl(gold_loc):
doc = nlp.make_doc(json_obj["text"])
ents = [(ent["start"], ent["end"], ent["label"]) for ent in json_obj["spans"]]
gold = GoldParse(doc, entities=ents)
docs.append(doc)
golds.append(gold)
return list(zip(docs, golds))
def main(model_name, unlabelled_loc):
n_iter = 10
dropout = 0.2
batch_size = 4
nlp = spacy.load(model_name)
nlp.get_pipe("ner").add_label(LABEL)
raw_docs = list(read_raw_data(nlp, unlabelled_loc))
optimizer = nlp.resume_training()
# Avoid use of Adam when resuming training. I don't understand this well
# yet, but I'm getting weird results from Adam. Try commenting out the
# nlp.update(), and using Adam -- you'll find the models drift apart.
# I guess Adam is losing precision, introducing gradient noise?
optimizer.alpha = 0.1
optimizer.b1 = 0.0
optimizer.b2 = 0.0
# get names of other pipes to disable them during training
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
sizes = compounding(1.0, 4.0, 1.001)
with nlp.disable_pipes(*other_pipes), warnings.catch_warnings():
# show warnings for misaligned entity spans once
warnings.filterwarnings("once", category=UserWarning, module='spacy')
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
random.shuffle(raw_docs)
losses = {}
r_losses = {}
# batch up the examples using spaCy's minibatch
raw_batches = minibatch(raw_docs, size=4)
for batch in minibatch(TRAIN_DATA, size=sizes):
docs, golds = zip(*batch)
nlp.update(docs, golds, sgd=optimizer, drop=dropout, losses=losses)
raw_batch = list(next(raw_batches))
nlp.rehearse(raw_batch, sgd=optimizer, losses=r_losses)
print("Losses", losses)
print("R. Losses", r_losses)
print(nlp.get_pipe("ner").model.unseen_classes)
test_text = "Do you like horses?"
doc = nlp(test_text)
print("Entities in '%s'" % test_text)
for ent in doc.ents:
print(ent.label_, ent.text)
if __name__ == "__main__":
plac.call(main)

View File

@ -1,177 +0,0 @@
#!/usr/bin/env python
# coding: utf8
"""Example of training spaCy's entity linker, starting off with a predefined
knowledge base and corresponding vocab, and a blank English model.
For more details, see the documentation:
* Training: https://spacy.io/usage/training
* Entity Linking: https://spacy.io/usage/linguistic-features#entity-linking
Compatible with: spaCy v2.2.4
Last tested with: v2.2.4
"""
from __future__ import unicode_literals, print_function
import plac
import random
from pathlib import Path
from spacy.vocab import Vocab
import spacy
from spacy.kb import KnowledgeBase
from spacy.pipeline import EntityRuler
from spacy.util import minibatch, compounding
def sample_train_data():
train_data = []
# Q2146908 (Russ Cochran): American golfer
# Q7381115 (Russ Cochran): publisher
text_1 = "Russ Cochran his reprints include EC Comics."
dict_1 = {(0, 12): {"Q7381115": 1.0, "Q2146908": 0.0}}
train_data.append((text_1, {"links": dict_1}))
text_2 = "Russ Cochran has been publishing comic art."
dict_2 = {(0, 12): {"Q7381115": 1.0, "Q2146908": 0.0}}
train_data.append((text_2, {"links": dict_2}))
text_3 = "Russ Cochran captured his first major title with his son as caddie."
dict_3 = {(0, 12): {"Q7381115": 0.0, "Q2146908": 1.0}}
train_data.append((text_3, {"links": dict_3}))
text_4 = "Russ Cochran was a member of University of Kentucky's golf team."
dict_4 = {(0, 12): {"Q7381115": 0.0, "Q2146908": 1.0}}
train_data.append((text_4, {"links": dict_4}))
return train_data
# training data
TRAIN_DATA = sample_train_data()
@plac.annotations(
kb_path=("Path to the knowledge base", "positional", None, Path),
vocab_path=("Path to the vocab for the kb", "positional", None, Path),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int),
)
def main(kb_path, vocab_path, output_dir=None, n_iter=50):
"""Create a blank model with the specified vocab, set up the pipeline and train the entity linker.
The `vocab` should be the one used during creation of the KB."""
# create blank English model with correct vocab
nlp = spacy.blank("en")
nlp.vocab.from_disk(vocab_path)
nlp.vocab.vectors.name = "spacy_pretrained_vectors"
print("Created blank 'en' model with vocab from '%s'" % vocab_path)
# Add a sentencizer component. Alternatively, add a dependency parser for higher accuracy.
nlp.add_pipe(nlp.create_pipe('sentencizer'))
# Add a custom component to recognize "Russ Cochran" as an entity for the example training data.
# Note that in a realistic application, an actual NER algorithm should be used instead.
ruler = EntityRuler(nlp)
patterns = [{"label": "PERSON", "pattern": [{"LOWER": "russ"}, {"LOWER": "cochran"}]}]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)
# Create the Entity Linker component and add it to the pipeline.
if "entity_linker" not in nlp.pipe_names:
# use only the predicted EL score and not the prior probability (for demo purposes)
cfg = {"incl_prior": False}
entity_linker = nlp.create_pipe("entity_linker", cfg)
kb = KnowledgeBase(vocab=nlp.vocab)
kb.load_bulk(kb_path)
print("Loaded Knowledge Base from '%s'" % kb_path)
entity_linker.set_kb(kb)
nlp.add_pipe(entity_linker, last=True)
# Convert the texts to docs to make sure we have doc.ents set for the training examples.
# Also ensure that the annotated examples correspond to known identifiers in the knowlege base.
kb_ids = nlp.get_pipe("entity_linker").kb.get_entity_strings()
TRAIN_DOCS = []
for text, annotation in TRAIN_DATA:
with nlp.disable_pipes("entity_linker"):
doc = nlp(text)
annotation_clean = annotation
for offset, kb_id_dict in annotation["links"].items():
new_dict = {}
for kb_id, value in kb_id_dict.items():
if kb_id in kb_ids:
new_dict[kb_id] = value
else:
print(
"Removed", kb_id, "from training because it is not in the KB."
)
annotation_clean["links"][offset] = new_dict
TRAIN_DOCS.append((doc, annotation_clean))
# get names of other pipes to disable them during training
pipe_exceptions = ["entity_linker", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
with nlp.disable_pipes(*other_pipes): # only train entity linker
# reset and initialize the weights randomly
optimizer = nlp.begin_training()
for itn in range(n_iter):
random.shuffle(TRAIN_DOCS)
losses = {}
# batch up the examples using spaCy's minibatch
batches = minibatch(TRAIN_DOCS, size=compounding(4.0, 32.0, 1.001))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(
texts, # batch of texts
annotations, # batch of annotations
drop=0.2, # dropout - make it harder to memorise data
losses=losses,
sgd=optimizer,
)
print(itn, "Losses", losses)
# test the trained model
_apply_model(nlp)
# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
print()
print("Saved model to", output_dir)
# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
_apply_model(nlp2)
def _apply_model(nlp):
for text, annotation in TRAIN_DATA:
# apply the entity linker which will now make predictions for the 'Russ Cochran' entities
doc = nlp(text)
print()
print("Entities", [(ent.text, ent.label_, ent.kb_id_) for ent in doc.ents])
print("Tokens", [(t.text, t.ent_type_, t.ent_kb_id_) for t in doc])
if __name__ == "__main__":
plac.call(main)
# Expected output (can be shuffled):
# Entities[('Russ Cochran', 'PERSON', 'Q7381115')]
# Tokens[('Russ', 'PERSON', 'Q7381115'), ('Cochran', 'PERSON', 'Q7381115'), ("his", '', ''), ('reprints', '', ''), ('include', '', ''), ('The', '', ''), ('Complete', '', ''), ('EC', '', ''), ('Library', '', ''), ('.', '', '')]
# Entities[('Russ Cochran', 'PERSON', 'Q7381115')]
# Tokens[('Russ', 'PERSON', 'Q7381115'), ('Cochran', 'PERSON', 'Q7381115'), ('has', '', ''), ('been', '', ''), ('publishing', '', ''), ('comic', '', ''), ('art', '', ''), ('.', '', '')]
# Entities[('Russ Cochran', 'PERSON', 'Q2146908')]
# Tokens[('Russ', 'PERSON', 'Q2146908'), ('Cochran', 'PERSON', 'Q2146908'), ('captured', '', ''), ('his', '', ''), ('first', '', ''), ('major', '', ''), ('title', '', ''), ('with', '', ''), ('his', '', ''), ('son', '', ''), ('as', '', ''), ('caddie', '', ''), ('.', '', '')]
# Entities[('Russ Cochran', 'PERSON', 'Q2146908')]
# Tokens[('Russ', 'PERSON', 'Q2146908'), ('Cochran', 'PERSON', 'Q2146908'), ('was', '', ''), ('a', '', ''), ('member', '', ''), ('of', '', ''), ('University', '', ''), ('of', '', ''), ('Kentucky', '', ''), ("'s", '', ''), ('golf', '', ''), ('team', '', ''), ('.', '', '')]

View File

@ -1,195 +0,0 @@
#!/usr/bin/env python
# coding: utf-8
"""Using the parser to recognise your own semantics
spaCy's parser component can be trained to predict any type of tree
structure over your input text. You can also predict trees over whole documents
or chat logs, with connections between the sentence-roots used to annotate
discourse structure. In this example, we'll build a message parser for a common
"chat intent": finding local businesses. Our message semantics will have the
following types of relations: ROOT, PLACE, QUALITY, ATTRIBUTE, TIME, LOCATION.
"show me the best hotel in berlin"
('show', 'ROOT', 'show')
('best', 'QUALITY', 'hotel') --> hotel with QUALITY best
('hotel', 'PLACE', 'show') --> show PLACE hotel
('berlin', 'LOCATION', 'hotel') --> hotel with LOCATION berlin
Compatible with: spaCy v2.0.0+
"""
from __future__ import unicode_literals, print_function
import plac
import random
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding
# training data: texts, heads and dependency labels
# for no relation, we simply chose an arbitrary dependency label, e.g. '-'
TRAIN_DATA = [
(
"find a cafe with great wifi",
{
"heads": [0, 2, 0, 5, 5, 2], # index of token head
"deps": ["ROOT", "-", "PLACE", "-", "QUALITY", "ATTRIBUTE"],
},
),
(
"find a hotel near the beach",
{
"heads": [0, 2, 0, 5, 5, 2],
"deps": ["ROOT", "-", "PLACE", "QUALITY", "-", "ATTRIBUTE"],
},
),
(
"find me the closest gym that's open late",
{
"heads": [0, 0, 4, 4, 0, 6, 4, 6, 6],
"deps": [
"ROOT",
"-",
"-",
"QUALITY",
"PLACE",
"-",
"-",
"ATTRIBUTE",
"TIME",
],
},
),
(
"show me the cheapest store that sells flowers",
{
"heads": [0, 0, 4, 4, 0, 4, 4, 4], # attach "flowers" to store!
"deps": ["ROOT", "-", "-", "QUALITY", "PLACE", "-", "-", "PRODUCT"],
},
),
(
"find a nice restaurant in london",
{
"heads": [0, 3, 3, 0, 3, 3],
"deps": ["ROOT", "-", "QUALITY", "PLACE", "-", "LOCATION"],
},
),
(
"show me the coolest hostel in berlin",
{
"heads": [0, 0, 4, 4, 0, 4, 4],
"deps": ["ROOT", "-", "-", "QUALITY", "PLACE", "-", "LOCATION"],
},
),
(
"find a good italian restaurant near work",
{
"heads": [0, 4, 4, 4, 0, 4, 5],
"deps": [
"ROOT",
"-",
"QUALITY",
"ATTRIBUTE",
"PLACE",
"ATTRIBUTE",
"LOCATION",
],
},
),
]
@plac.annotations(
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int),
)
def main(model=None, output_dir=None, n_iter=15):
"""Load the model, set up the pipeline and train the parser."""
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
else:
nlp = spacy.blank("en") # create blank Language class
print("Created blank 'en' model")
# We'll use the built-in dependency parser class, but we want to create a
# fresh instance just in case.
if "parser" in nlp.pipe_names:
nlp.remove_pipe("parser")
parser = nlp.create_pipe("parser")
nlp.add_pipe(parser, first=True)
for text, annotations in TRAIN_DATA:
for dep in annotations.get("deps", []):
parser.add_label(dep)
pipe_exceptions = ["parser", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
with nlp.disable_pipes(*other_pipes): # only train parser
optimizer = nlp.begin_training()
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
# batch up the examples using spaCy's minibatch
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(texts, annotations, sgd=optimizer, losses=losses)
print("Losses", losses)
# test the trained model
test_model(nlp)
# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
test_model(nlp2)
def test_model(nlp):
texts = [
"find a hotel with good wifi",
"find me the cheapest gym near work",
"show me the best hotel in berlin",
]
docs = nlp.pipe(texts)
for doc in docs:
print(doc.text)
print([(t.text, t.dep_, t.head.text) for t in doc if t.dep_ != "-"])
if __name__ == "__main__":
plac.call(main)
# Expected output:
# find a hotel with good wifi
# [
# ('find', 'ROOT', 'find'),
# ('hotel', 'PLACE', 'find'),
# ('good', 'QUALITY', 'wifi'),
# ('wifi', 'ATTRIBUTE', 'hotel')
# ]
# find me the cheapest gym near work
# [
# ('find', 'ROOT', 'find'),
# ('cheapest', 'QUALITY', 'gym'),
# ('gym', 'PLACE', 'find'),
# ('near', 'ATTRIBUTE', 'gym'),
# ('work', 'LOCATION', 'near')
# ]
# show me the best hotel in berlin
# [
# ('show', 'ROOT', 'show'),
# ('best', 'QUALITY', 'hotel'),
# ('hotel', 'PLACE', 'show'),
# ('berlin', 'LOCATION', 'hotel')
# ]

View File

@ -1,117 +0,0 @@
#!/usr/bin/env python
# coding: utf8
"""Example of training spaCy's named entity recognizer, starting off with an
existing model or a blank model.
For more details, see the documentation:
* Training: https://spacy.io/usage/training
* NER: https://spacy.io/usage/linguistic-features#named-entities
Compatible with: spaCy v2.0.0+
Last tested with: v2.2.4
"""
from __future__ import unicode_literals, print_function
import plac
import random
import warnings
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding
# training data
TRAIN_DATA = [
("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]
@plac.annotations(
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int),
)
def main(model=None, output_dir=None, n_iter=100):
"""Load the model, set up the pipeline and train the entity recognizer."""
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
else:
nlp = spacy.blank("en") # create blank Language class
print("Created blank 'en' model")
# create the built-in pipeline components and add them to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if "ner" not in nlp.pipe_names:
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner, last=True)
# otherwise, get it so we can add labels
else:
ner = nlp.get_pipe("ner")
# add labels
for _, annotations in TRAIN_DATA:
for ent in annotations.get("entities"):
ner.add_label(ent[2])
# get names of other pipes to disable them during training
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
# only train NER
with nlp.disable_pipes(*other_pipes), warnings.catch_warnings():
# show warnings for misaligned entity spans once
warnings.filterwarnings("once", category=UserWarning, module='spacy')
# reset and initialize the weights randomly but only if we're
# training a new model
if model is None:
nlp.begin_training()
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
# batch up the examples using spaCy's minibatch
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(
texts, # batch of texts
annotations, # batch of annotations
drop=0.5, # dropout - make it harder to memorise data
losses=losses,
)
print("Losses", losses)
# test the trained model
for text, _ in TRAIN_DATA:
doc = nlp(text)
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])
# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
for text, _ in TRAIN_DATA:
doc = nlp2(text)
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])
if __name__ == "__main__":
plac.call(main)
# Expected output:
# Entities [('Shaka Khan', 'PERSON')]
# Tokens [('Who', '', 2), ('is', '', 2), ('Shaka', 'PERSON', 3),
# ('Khan', 'PERSON', 1), ('?', '', 2)]
# Entities [('London', 'LOC'), ('Berlin', 'LOC')]
# Tokens [('I', '', 2), ('like', '', 2), ('London', 'LOC', 3),
# ('and', '', 2), ('Berlin', 'LOC', 3), ('.', '', 2)]

View File

@ -1,144 +0,0 @@
#!/usr/bin/env python
# coding: utf8
"""Example of training an additional entity type
This script shows how to add a new entity type to an existing pretrained NER
model. To keep the example short and simple, only four sentences are provided
as examples. In practice, you'll need many more — a few hundred would be a
good start. You will also likely need to mix in examples of other entity
types, which might be obtained by running the entity recognizer over unlabelled
sentences, and adding their annotations to the training set.
The actual training is performed by looping over the examples, and calling
`nlp.entity.update()`. The `update()` method steps through the words of the
input. At each word, it makes a prediction. It then consults the annotations
provided on the GoldParse instance, to see whether it was right. If it was
wrong, it adjusts its weights so that the correct action will score higher
next time.
After training your model, you can save it to a directory. We recommend
wrapping models as Python packages, for ease of deployment.
For more details, see the documentation:
* Training: https://spacy.io/usage/training
* NER: https://spacy.io/usage/linguistic-features#named-entities
Compatible with: spaCy v2.1.0+
Last tested with: v2.2.4
"""
from __future__ import unicode_literals, print_function
import plac
import random
import warnings
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding
# new entity label
LABEL = "ANIMAL"
# training data
# Note: If you're using an existing model, make sure to mix in examples of
# other entity types that spaCy correctly recognized before. Otherwise, your
# model might learn the new type, but "forget" what it previously knew.
# https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting
TRAIN_DATA = [
(
"Horses are too tall and they pretend to care about your feelings",
{"entities": [(0, 6, LABEL)]},
),
("Do they bite?", {"entities": []}),
(
"horses are too tall and they pretend to care about your feelings",
{"entities": [(0, 6, LABEL)]},
),
("horses pretend to care about your feelings", {"entities": [(0, 6, LABEL)]}),
(
"they pretend to care about your feelings, those horses",
{"entities": [(48, 54, LABEL)]},
),
("horses?", {"entities": [(0, 6, LABEL)]}),
]
@plac.annotations(
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
new_model_name=("New model name for model meta.", "option", "nm", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int),
)
def main(model=None, new_model_name="animal", output_dir=None, n_iter=30):
"""Set up the pipeline and entity recognizer, and train the new entity."""
random.seed(0)
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
else:
nlp = spacy.blank("en") # create blank Language class
print("Created blank 'en' model")
# Add entity recognizer to model if it's not in the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if "ner" not in nlp.pipe_names:
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)
# otherwise, get it, so we can add labels to it
else:
ner = nlp.get_pipe("ner")
ner.add_label(LABEL) # add new entity label to entity recognizer
# Adding extraneous labels shouldn't mess anything up
ner.add_label("VEGETABLE")
if model is None:
optimizer = nlp.begin_training()
else:
optimizer = nlp.resume_training()
move_names = list(ner.move_names)
# get names of other pipes to disable them during training
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
# only train NER
with nlp.disable_pipes(*other_pipes), warnings.catch_warnings():
# show warnings for misaligned entity spans once
warnings.filterwarnings("once", category=UserWarning, module='spacy')
sizes = compounding(1.0, 4.0, 1.001)
# batch up the examples using spaCy's minibatch
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
batches = minibatch(TRAIN_DATA, size=sizes)
losses = {}
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
print("Losses", losses)
# test the trained model
test_text = "Do you like horses?"
doc = nlp(test_text)
print("Entities in '%s'" % test_text)
for ent in doc.ents:
print(ent.label_, ent.text)
# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.meta["name"] = new_model_name # rename model
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
# Check the classes have loaded back consistently
assert nlp2.get_pipe("ner").move_names == move_names
doc2 = nlp2(test_text)
for ent in doc2.ents:
print(ent.label_, ent.text)
if __name__ == "__main__":
plac.call(main)

View File

@ -1,111 +0,0 @@
#!/usr/bin/env python
# coding: utf8
"""Example of training spaCy dependency parser, starting off with an existing
model or a blank model. For more details, see the documentation:
* Training: https://spacy.io/usage/training
* Dependency Parse: https://spacy.io/usage/linguistic-features#dependency-parse
Compatible with: spaCy v2.0.0+
Last tested with: v2.1.0
"""
from __future__ import unicode_literals, print_function
import plac
import random
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding
# training data
TRAIN_DATA = [
(
"They trade mortgage-backed securities.",
{
"heads": [1, 1, 4, 4, 5, 1, 1],
"deps": ["nsubj", "ROOT", "compound", "punct", "nmod", "dobj", "punct"],
},
),
(
"I like London and Berlin.",
{
"heads": [1, 1, 1, 2, 2, 1],
"deps": ["nsubj", "ROOT", "dobj", "cc", "conj", "punct"],
},
),
]
@plac.annotations(
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int),
)
def main(model=None, output_dir=None, n_iter=15):
"""Load the model, set up the pipeline and train the parser."""
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
else:
nlp = spacy.blank("en") # create blank Language class
print("Created blank 'en' model")
# add the parser to the pipeline if it doesn't exist
# nlp.create_pipe works for built-ins that are registered with spaCy
if "parser" not in nlp.pipe_names:
parser = nlp.create_pipe("parser")
nlp.add_pipe(parser, first=True)
# otherwise, get it, so we can add labels to it
else:
parser = nlp.get_pipe("parser")
# add labels to the parser
for _, annotations in TRAIN_DATA:
for dep in annotations.get("deps", []):
parser.add_label(dep)
# get names of other pipes to disable them during training
pipe_exceptions = ["parser", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
with nlp.disable_pipes(*other_pipes): # only train parser
optimizer = nlp.begin_training()
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
# batch up the examples using spaCy's minibatch
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(texts, annotations, sgd=optimizer, losses=losses)
print("Losses", losses)
# test the trained model
test_text = "I like securities."
doc = nlp(test_text)
print("Dependencies", [(t.text, t.dep_, t.head.text) for t in doc])
# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
doc = nlp2(test_text)
print("Dependencies", [(t.text, t.dep_, t.head.text) for t in doc])
if __name__ == "__main__":
plac.call(main)
# expected result:
# [
# ('I', 'nsubj', 'like'),
# ('like', 'ROOT', 'like'),
# ('securities', 'dobj', 'like'),
# ('.', 'punct', 'like')
# ]

View File

@ -1,101 +0,0 @@
#!/usr/bin/env python
# coding: utf8
"""
A simple example for training a part-of-speech tagger with a custom tag map.
To allow us to update the tag map with our custom one, this example starts off
with a blank Language class and modifies its defaults. For more details, see
the documentation:
* Training: https://spacy.io/usage/training
* POS Tagging: https://spacy.io/usage/linguistic-features#pos-tagging
Compatible with: spaCy v2.0.0+
Last tested with: v2.1.0
"""
from __future__ import unicode_literals, print_function
import plac
import random
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding
# You need to define a mapping from your data's part-of-speech tag names to the
# Universal Part-of-Speech tag set, as spaCy includes an enum of these tags.
# See here for the Universal Tag Set:
# http://universaldependencies.github.io/docs/u/pos/index.html
# You may also specify morphological features for your tags, from the universal
# scheme.
TAG_MAP = {"N": {"pos": "NOUN"}, "V": {"pos": "VERB"}, "J": {"pos": "ADJ"}}
# Usually you'll read this in, of course. Data formats vary. Ensure your
# strings are unicode and that the number of tags assigned matches spaCy's
# tokenization. If not, you can always add a 'words' key to the annotations
# that specifies the gold-standard tokenization, e.g.:
# ("Eatblueham", {'words': ['Eat', 'blue', 'ham'], 'tags': ['V', 'J', 'N']})
TRAIN_DATA = [
("I like green eggs", {"tags": ["N", "V", "J", "N"]}),
("Eat blue ham", {"tags": ["V", "J", "N"]}),
]
@plac.annotations(
lang=("ISO Code of language to use", "option", "l", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int),
)
def main(lang="en", output_dir=None, n_iter=25):
"""Create a new model, set up the pipeline and train the tagger. In order to
train the tagger with a custom tag map, we're creating a new Language
instance with a custom vocab.
"""
nlp = spacy.blank(lang)
# add the tagger to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
tagger = nlp.create_pipe("tagger")
# Add the tags. This needs to be done before you start training.
for tag, values in TAG_MAP.items():
tagger.add_label(tag, values)
nlp.add_pipe(tagger)
optimizer = nlp.begin_training()
for i in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
# batch up the examples using spaCy's minibatch
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(texts, annotations, sgd=optimizer, losses=losses)
print("Losses", losses)
# test the trained model
test_text = "I like blue eggs"
doc = nlp(test_text)
print("Tags", [(t.text, t.tag_, t.pos_) for t in doc])
# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
# test the save model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
doc = nlp2(test_text)
print("Tags", [(t.text, t.tag_, t.pos_) for t in doc])
if __name__ == "__main__":
plac.call(main)
# Expected output:
# [
# ('I', 'N', 'NOUN'),
# ('like', 'V', 'VERB'),
# ('blue', 'J', 'ADJ'),
# ('eggs', 'N', 'NOUN')
# ]

View File

@ -1,160 +0,0 @@
#!/usr/bin/env python
# coding: utf8
"""Train a convolutional neural network text classifier on the
IMDB dataset, using the TextCategorizer component. The dataset will be loaded
automatically via Thinc's built-in dataset loader. The model is added to
spacy.pipeline, and predictions are available via `doc.cats`. For more details,
see the documentation:
* Training: https://spacy.io/usage/training
Compatible with: spaCy v2.0.0+
"""
from __future__ import unicode_literals, print_function
import plac
import random
from pathlib import Path
import thinc.extra.datasets
import spacy
from spacy.util import minibatch, compounding
@plac.annotations(
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
output_dir=("Optional output directory", "option", "o", Path),
n_texts=("Number of texts to train from", "option", "t", int),
n_iter=("Number of training iterations", "option", "n", int),
init_tok2vec=("Pretrained tok2vec weights", "option", "t2v", Path),
)
def main(model=None, output_dir=None, n_iter=20, n_texts=2000, init_tok2vec=None):
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
else:
nlp = spacy.blank("en") # create blank Language class
print("Created blank 'en' model")
# add the text classifier to the pipeline if it doesn't exist
# nlp.create_pipe works for built-ins that are registered with spaCy
if "textcat" not in nlp.pipe_names:
textcat = nlp.create_pipe(
"textcat", config={"exclusive_classes": True, "architecture": "simple_cnn"}
)
nlp.add_pipe(textcat, last=True)
# otherwise, get it, so we can add labels to it
else:
textcat = nlp.get_pipe("textcat")
# add label to text classifier
textcat.add_label("POSITIVE")
textcat.add_label("NEGATIVE")
# load the IMDB dataset
print("Loading IMDB data...")
(train_texts, train_cats), (dev_texts, dev_cats) = load_data()
train_texts = train_texts[:n_texts]
train_cats = train_cats[:n_texts]
print(
"Using {} examples ({} training, {} evaluation)".format(
n_texts, len(train_texts), len(dev_texts)
)
)
train_data = list(zip(train_texts, [{"cats": cats} for cats in train_cats]))
# get names of other pipes to disable them during training
pipe_exceptions = ["textcat", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
with nlp.disable_pipes(*other_pipes): # only train textcat
optimizer = nlp.begin_training()
if init_tok2vec is not None:
with init_tok2vec.open("rb") as file_:
textcat.model.tok2vec.from_bytes(file_.read())
print("Training the model...")
print("{:^5}\t{:^5}\t{:^5}\t{:^5}".format("LOSS", "P", "R", "F"))
batch_sizes = compounding(4.0, 32.0, 1.001)
for i in range(n_iter):
losses = {}
# batch up the examples using spaCy's minibatch
random.shuffle(train_data)
batches = minibatch(train_data, size=batch_sizes)
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(texts, annotations, sgd=optimizer, drop=0.2, losses=losses)
with textcat.model.use_params(optimizer.averages):
# evaluate on the dev data split off in load_data()
scores = evaluate(nlp.tokenizer, textcat, dev_texts, dev_cats)
print(
"{0:.3f}\t{1:.3f}\t{2:.3f}\t{3:.3f}".format( # print a simple table
losses["textcat"],
scores["textcat_p"],
scores["textcat_r"],
scores["textcat_f"],
)
)
# test the trained model
test_text = "This movie sucked"
doc = nlp(test_text)
print(test_text, doc.cats)
if output_dir is not None:
with nlp.use_params(optimizer.averages):
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
doc2 = nlp2(test_text)
print(test_text, doc2.cats)
def load_data(limit=0, split=0.8):
"""Load data from the IMDB dataset."""
# Partition off part of the train data for evaluation
train_data, _ = thinc.extra.datasets.imdb()
random.shuffle(train_data)
train_data = train_data[-limit:]
texts, labels = zip(*train_data)
cats = [{"POSITIVE": bool(y), "NEGATIVE": not bool(y)} for y in labels]
split = int(len(train_data) * split)
return (texts[:split], cats[:split]), (texts[split:], cats[split:])
def evaluate(tokenizer, textcat, texts, cats):
docs = (tokenizer(text) for text in texts)
tp = 0.0 # True positives
fp = 1e-8 # False positives
fn = 1e-8 # False negatives
tn = 0.0 # True negatives
for i, doc in enumerate(textcat.pipe(docs)):
gold = cats[i]
for label, score in doc.cats.items():
if label not in gold:
continue
if label == "NEGATIVE":
continue
if score >= 0.5 and gold[label] >= 0.5:
tp += 1.0
elif score >= 0.5 and gold[label] < 0.5:
fp += 1.0
elif score < 0.5 and gold[label] < 0.5:
tn += 1
elif score < 0.5 and gold[label] >= 0.5:
fn += 1
precision = tp / (tp + fp)
recall = tp / (tp + fn)
if (precision + recall) == 0:
f_score = 0.0
else:
f_score = 2 * (precision * recall) / (precision + recall)
return {"textcat_p": precision, "textcat_r": recall, "textcat_f": f_score}
if __name__ == "__main__":
plac.call(main)

View File

@ -1,49 +0,0 @@
#!/usr/bin/env python
# coding: utf8
"""Load vectors for a language trained using fastText
https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
Compatible with: spaCy v2.0.0+
"""
from __future__ import unicode_literals
import plac
import numpy
import spacy
from spacy.language import Language
@plac.annotations(
vectors_loc=("Path to .vec file", "positional", None, str),
lang=(
"Optional language ID. If not set, blank Language() will be used.",
"positional",
None,
str,
),
)
def main(vectors_loc, lang=None):
if lang is None:
nlp = Language()
else:
# create empty language class this is required if you're planning to
# save the model to disk and load it back later (models always need a
# "lang" setting). Use 'xx' for blank multi-language class.
nlp = spacy.blank(lang)
with open(vectors_loc, "rb") as file_:
header = file_.readline()
nr_row, nr_dim = header.split()
nlp.vocab.reset_vectors(width=int(nr_dim))
for line in file_:
line = line.rstrip().decode("utf8")
pieces = line.rsplit(" ", int(nr_dim))
word = pieces[0]
vector = numpy.asarray([float(v) for v in pieces[1:]], dtype="f")
nlp.vocab.set_vector(word, vector) # add the vectors to the vocab
# test the vectors and similarity
text = "class colspan"
doc = nlp(text)
print(text, doc[0].similarity(doc[1]))
if __name__ == "__main__":
plac.call(main)

View File

@ -1,105 +0,0 @@
#!/usr/bin/env python
# coding: utf8
"""Visualize spaCy word vectors in Tensorboard.
Adapted from: https://gist.github.com/BrikerMan/7bd4e4bd0a00ac9076986148afc06507
"""
from __future__ import unicode_literals
from os import path
import tqdm
import math
import numpy
import plac
import spacy
import tensorflow as tf
from tensorflow.contrib.tensorboard.plugins.projector import (
visualize_embeddings,
ProjectorConfig,
)
@plac.annotations(
vectors_loc=("Path to spaCy model that contains vectors", "positional", None, str),
out_loc=(
"Path to output folder for tensorboard session data",
"positional",
None,
str,
),
name=(
"Human readable name for tsv file and vectors tensor",
"positional",
None,
str,
),
)
def main(vectors_loc, out_loc, name="spaCy_vectors"):
meta_file = "{}.tsv".format(name)
out_meta_file = path.join(out_loc, meta_file)
print("Loading spaCy vectors model: {}".format(vectors_loc))
model = spacy.load(vectors_loc)
print("Finding lexemes with vectors attached: {}".format(vectors_loc))
strings_stream = tqdm.tqdm(
model.vocab.strings, total=len(model.vocab.strings), leave=False
)
queries = [w for w in strings_stream if model.vocab.has_vector(w)]
vector_count = len(queries)
print(
"Building Tensorboard Projector metadata for ({}) vectors: {}".format(
vector_count, out_meta_file
)
)
# Store vector data in a tensorflow variable
tf_vectors_variable = numpy.zeros((vector_count, model.vocab.vectors.shape[1]))
# Write a tab-separated file that contains information about the vectors for visualization
#
# Reference: https://www.tensorflow.org/programmers_guide/embedding#metadata
with open(out_meta_file, "wb") as file_metadata:
# Define columns in the first row
file_metadata.write("Text\tFrequency\n".encode("utf-8"))
# Write out a row for each vector that we add to the tensorflow variable we created
vec_index = 0
for text in tqdm.tqdm(queries, total=len(queries), leave=False):
# https://github.com/tensorflow/tensorflow/issues/9094
text = "<Space>" if text.lstrip() == "" else text
lex = model.vocab[text]
# Store vector data and metadata
tf_vectors_variable[vec_index] = model.vocab.get_vector(text)
file_metadata.write(
"{}\t{}\n".format(text, math.exp(lex.prob) * vector_count).encode(
"utf-8"
)
)
vec_index += 1
print("Running Tensorflow Session...")
sess = tf.InteractiveSession()
tf.Variable(tf_vectors_variable, trainable=False, name=name)
tf.global_variables_initializer().run()
saver = tf.train.Saver()
writer = tf.summary.FileWriter(out_loc, sess.graph)
# Link the embeddings into the config
config = ProjectorConfig()
embed = config.embeddings.add()
embed.tensor_name = name
embed.metadata_path = meta_file
# Tell the projector about the configured embeddings and metadata file
visualize_embeddings(writer, config)
# Save session and print run command to the output
print("Saving Tensorboard Session...")
saver.save(sess, path.join(out_loc, "{}.ckpt".format(name)))
print("Done. Run `tensorboard --logdir={0}` to view in Tensorboard".format(out_loc))
if __name__ == "__main__":
plac.call(main)

View File

@ -5,16 +5,17 @@ from spacy.gold import docs_to_json
import srsly
import sys
@plac.annotations(
model=("Model name. Defaults to 'en'.", "option", "m", str),
input_file=("Input file (jsonl)", "positional", None, Path),
output_dir=("Output directory", "positional", None, Path),
n_texts=("Number of texts to convert", "option", "t", int),
)
def convert(model='en', input_file=None, output_dir=None, n_texts=0):
def convert(model="en", input_file=None, output_dir=None, n_texts=0):
# Load model with tokenizer + sentencizer only
nlp = spacy.load(model)
nlp.disable_pipes(*nlp.pipe_names)
nlp.select_pipes(disable=nlp.pipe_names)
sentencizer = nlp.create_pipe("sentencizer")
nlp.add_pipe(sentencizer, first=True)
@ -49,5 +50,6 @@ def convert(model='en', input_file=None, output_dir=None, n_texts=0):
srsly.write_json(output_dir / input_file.with_suffix(".json"), [docs_to_json(docs)])
if __name__ == "__main__":
plac.call(convert)

View File

@ -0,0 +1,133 @@
[paths]
train = ""
dev = ""
raw = null
init_tok2vec = null
[system]
seed = 0
use_pytorch_for_gpu_memory = false
[training]
seed = ${system:seed}
dropout = 0.1
init_tok2vec = ${paths:init_tok2vec}
vectors = null
accumulate_gradient = 1
max_steps = 0
max_epochs = 0
patience = 10000
eval_frequency = 200
score_weights = {"dep_las": 0.4, "ents_f": 0.4, "tag_acc": 0.2}
frozen_components = []
[training.train_corpus]
@readers = "spacy.Corpus.v1"
path = ${paths:train}
gold_preproc = true
max_length = 0
limit = 0
[training.dev_corpus]
@readers = "spacy.Corpus.v1"
path = ${paths:dev}
gold_preproc = ${training.read_train:gold_preproc}
max_length = 0
limit = 0
[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 1e-8
learn_rate = 0.001
[nlp]
lang = "en"
load_vocab_data = false
pipeline = ["tok2vec", "ner", "tagger", "parser"]
[nlp.tokenizer]
@tokenizers = "spacy.Tokenizer.v1"
[nlp.lemmatizer]
@lemmatizers = "spacy.Lemmatizer.v1"
[components]
[components.tok2vec]
factory = "tok2vec"
[components.ner]
factory = "ner"
learn_tokens = false
min_action_freq = 1
[components.tagger]
factory = "tagger"
[components.parser]
factory = "parser"
learn_tokens = false
min_action_freq = 30
[components.tagger.model]
@architectures = "spacy.Tagger.v1"
[components.tagger.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode:width}
[components.parser.model]
@architectures = "spacy.TransitionBasedParser.v1"
nr_feature_tokens = 8
hidden_width = 128
maxout_pieces = 2
use_upper = true
[components.parser.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode:width}
[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v1"
nr_feature_tokens = 3
hidden_width = 128
maxout_pieces = 2
use_upper = true
[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode:width}
[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v1"
[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v1"
width = ${components.tok2vec.model.encode:width}
rows = 2000
also_embed_subwords = true
also_use_static_vectors = false
[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

View File

@ -0,0 +1,152 @@
# Training hyper-parameters and additional features.
[training]
# Whether to train on sequences with 'gold standard' sentence boundaries
# and tokens. If you set this to true, take care to ensure your run-time
# data is passed in sentence-by-sentence via some prior preprocessing.
gold_preproc = false
# Limitations on training document length or number of examples.
max_length = 0
limit = 0
# Data augmentation
orth_variant_level = 0.0
dropout = 0.1
# Controls early-stopping. 0 or -1 mean unlimited.
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 400
# Other settings
seed = 0
accumulate_gradient = 1
use_pytorch_for_gpu_memory = false
# Control how scores are printed and checkpoints are evaluated.
scores = ["speed", "tags_acc", "uas", "las", "ents_f"]
score_weights = {"las": 0.4, "ents_f": 0.4, "tags_acc": 0.2}
# These settings are invalid for the transformer models.
init_tok2vec = null
discard_oversize = false
omit_extra_lookups = false
batch_by = "words"
use_gpu = -1
raw_text = null
tag_map = null
[training.batch_size]
@schedules = "compounding.v1"
start = 1000
stop = 1000
compound = 1.001
[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = true
eps = 1e-8
learn_rate = 0.001
[pretraining]
max_epochs = 1000
min_length = 5
max_length = 500
dropout = 0.2
n_save_every = null
batch_size = 3000
seed = ${training:seed}
use_pytorch_for_gpu_memory = ${training:use_pytorch_for_gpu_memory}
tok2vec_model = "nlp.pipeline.tok2vec.model"
[pretraining.objective]
type = "characters"
n_characters = 4
[pretraining.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = true
eps = 1e-8
learn_rate = 0.001
[nlp]
lang = "en"
vectors = null
base_model = null
[nlp.pipeline]
[nlp.pipeline.tok2vec]
factory = "tok2vec"
[nlp.pipeline.senter]
factory = "senter"
[nlp.pipeline.ner]
factory = "ner"
learn_tokens = false
min_action_freq = 1
beam_width = 1
beam_update_prob = 1.0
[nlp.pipeline.tagger]
factory = "tagger"
[nlp.pipeline.parser]
factory = "parser"
learn_tokens = false
min_action_freq = 1
beam_width = 1
beam_update_prob = 1.0
[nlp.pipeline.senter.model]
@architectures = "spacy.Tagger.v1"
[nlp.pipeline.senter.model.tok2vec]
@architectures = "spacy.Tok2VecTensors.v1"
width = ${nlp.pipeline.tok2vec.model:width}
[nlp.pipeline.tagger.model]
@architectures = "spacy.Tagger.v1"
[nlp.pipeline.tagger.model.tok2vec]
@architectures = "spacy.Tok2VecTensors.v1"
width = ${nlp.pipeline.tok2vec.model:width}
[nlp.pipeline.parser.model]
@architectures = "spacy.TransitionBasedParser.v1"
nr_feature_tokens = 8
hidden_width = 128
maxout_pieces = 3
use_upper = false
[nlp.pipeline.parser.model.tok2vec]
@architectures = "spacy.Tok2VecTensors.v1"
width = ${nlp.pipeline.tok2vec.model:width}
[nlp.pipeline.ner.model]
@architectures = "spacy.TransitionBasedParser.v1"
nr_feature_tokens = 3
hidden_width = 128
maxout_pieces = 3
use_upper = false
[nlp.pipeline.ner.model.tok2vec]
@architectures = "spacy.Tok2VecTensors.v1"
width = ${nlp.pipeline.tok2vec.model:width}
[nlp.pipeline.tok2vec.model]
@architectures = "spacy.HashEmbedCNN.v1"
pretrained_vectors = ${nlp:vectors}
width = 256
depth = 6
window_size = 1
embed_size = 10000
maxout_pieces = 3
subword_features = true
dropout = null

View File

@ -0,0 +1,73 @@
# Training hyper-parameters and additional features.
[training]
# Whether to train on sequences with 'gold standard' sentence boundaries
# and tokens. If you set this to true, take care to ensure your run-time
# data is passed in sentence-by-sentence via some prior preprocessing.
gold_preproc = false
# Limitations on training document length or number of examples.
max_length = 3000
limit = 0
# Data augmentation
orth_variant_level = 0.0
dropout = 0.1
# Controls early-stopping. 0 or -1 mean unlimited.
patience = 100000
max_epochs = 0
max_steps = 0
eval_frequency = 1000
# Other settings
seed = 0
accumulate_gradient = 1
use_pytorch_for_gpu_memory = false
# Control how scores are printed and checkpoints are evaluated.
scores = ["speed", "ents_p", "ents_r", "ents_f"]
score_weights = {"ents_f": 1.0}
# These settings are invalid for the transformer models.
init_tok2vec = null
discard_oversize = false
omit_extra_lookups = false
batch_by = "words"
[training.batch_size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = true
eps = 1e-8
learn_rate = 0.001
[nlp]
lang = "en"
vectors = null
[nlp.pipeline.ner]
factory = "ner"
learn_tokens = false
min_action_freq = 1
[nlp.pipeline.ner.model]
@architectures = "spacy.TransitionBasedParser.v1"
nr_feature_tokens = 3
hidden_width = 64
maxout_pieces = 2
use_upper = true
[nlp.pipeline.ner.model.tok2vec]
@architectures = "spacy.HashEmbedCNN.v1"
pretrained_vectors = ${nlp:vectors}
width = 96
depth = 4
window_size = 1
embed_size = 2000
maxout_pieces = 3
subword_features = true
dropout = ${training:dropout}

View File

@ -0,0 +1,73 @@
[training]
patience = 10000
eval_frequency = 200
dropout = 0.2
init_tok2vec = null
vectors = null
max_epochs = 100
orth_variant_level = 0.0
gold_preproc = true
max_length = 0
use_gpu = 0
scores = ["tags_acc", "uas", "las"]
score_weights = {"las": 0.8, "tags_acc": 0.2}
limit = 0
seed = 0
accumulate_gradient = 2
discard_oversize = false
[training.batch_size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
[training.optimizer]
@optimizers = "Adam.v1"
learn_rate = 0.001
beta1 = 0.9
beta2 = 0.999
[nlp]
lang = "en"
vectors = ${training:vectors}
[nlp.pipeline.tok2vec]
factory = "tok2vec"
[nlp.pipeline.tagger]
factory = "tagger"
[nlp.pipeline.parser]
factory = "parser"
learn_tokens = false
min_action_freq = 1
beam_width = 1
beam_update_prob = 1.0
[nlp.pipeline.tagger.model]
@architectures = "spacy.Tagger.v1"
[nlp.pipeline.tagger.model.tok2vec]
@architectures = "spacy.Tok2VecTensors.v1"
width = ${nlp.pipeline.tok2vec.model:width}
[nlp.pipeline.parser.model]
@architectures = "spacy.TransitionBasedParser.v1"
nr_feature_tokens = 8
hidden_width = 64
maxout_pieces = 3
[nlp.pipeline.parser.model.tok2vec]
@architectures = "spacy.Tok2VecTensors.v1"
width = ${nlp.pipeline.tok2vec.model:width}
[nlp.pipeline.tok2vec.model]
@architectures = "spacy.HashEmbedBiLSTM.v1"
pretrained_vectors = ${nlp:vectors}
width = 96
depth = 4
embed_size = 2000
subword_features = true
maxout_pieces = 3
dropout = null

View File

@ -0,0 +1,110 @@
[paths]
train = ""
dev = ""
raw = null
init_tok2vec = null
[system]
seed = 0
use_pytorch_for_gpu_memory = false
[training]
seed = ${system:seed}
dropout = 0.2
init_tok2vec = ${paths:init_tok2vec}
vectors = null
accumulate_gradient = 1
max_steps = 0
max_epochs = 0
patience = 10000
eval_frequency = 200
score_weights = {"dep_las": 0.8, "tag_acc": 0.2}
[training.read_train]
@readers = "spacy.Corpus.v1"
path = ${paths:train}
gold_preproc = true
max_length = 0
limit = 0
[training.read_dev]
@readers = "spacy.Corpus.v1"
path = ${paths:dev}
gold_preproc = ${training.read_train:gold_preproc}
max_length = 0
limit = 0
[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
[training.optimizer]
@optimizers = "Adam.v1"
learn_rate = 0.001
beta1 = 0.9
beta2 = 0.999
[nlp]
lang = "en"
pipeline = ["tok2vec", "tagger", "parser"]
load_vocab_data = false
[nlp.tokenizer]
@tokenizers = "spacy.Tokenizer.v1"
[nlp.lemmatizer]
@lemmatizers = "spacy.Lemmatizer.v1"
[components]
[components.tok2vec]
factory = "tok2vec"
[components.tagger]
factory = "tagger"
[components.parser]
factory = "parser"
learn_tokens = false
min_action_freq = 1
[components.tagger.model]
@architectures = "spacy.Tagger.v1"
[components.tagger.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode:width}
[components.parser.model]
@architectures = "spacy.TransitionBasedParser.v1"
nr_feature_tokens = 8
hidden_width = 64
maxout_pieces = 3
[components.parser.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode:width}
[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v1"
[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v1"
width = ${components.tok2vec.model.encode:width}
rows = 2000
also_embed_subwords = true
also_use_static_vectors = false
[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

View File

@ -0,0 +1,69 @@
[training]
use_gpu = -1
limit = 0
dropout = 0.2
patience = 10000
eval_frequency = 200
scores = ["ents_f"]
score_weights = {"ents_f": 1}
orth_variant_level = 0.0
gold_preproc = true
max_length = 0
batch_size = 25
seed = 0
accumulate_gradient = 2
discard_oversize = false
[training.optimizer]
@optimizers = "Adam.v1"
learn_rate = 0.001
beta1 = 0.9
beta2 = 0.999
[nlp]
lang = "en"
vectors = null
[nlp.pipeline.tok2vec]
factory = "tok2vec"
[nlp.pipeline.tok2vec.model]
@architectures = "spacy.Tok2Vec.v1"
[nlp.pipeline.tok2vec.model.extract]
@architectures = "spacy.CharacterEmbed.v1"
width = 96
nM = 64
nC = 8
rows = 2000
columns = ["ID", "NORM", "PREFIX", "SUFFIX", "SHAPE", "ORTH"]
dropout = null
[nlp.pipeline.tok2vec.model.extract.features]
@architectures = "spacy.Doc2Feats.v1"
columns = ${nlp.pipeline.tok2vec.model.extract:columns}
[nlp.pipeline.tok2vec.model.embed]
@architectures = "spacy.LayerNormalizedMaxout.v1"
width = ${nlp.pipeline.tok2vec.model.extract:width}
maxout_pieces = 4
[nlp.pipeline.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
width = ${nlp.pipeline.tok2vec.model.extract:width}
window_size = 1
maxout_pieces = 2
depth = 2
[nlp.pipeline.ner]
factory = "ner"
[nlp.pipeline.ner.model]
@architectures = "spacy.TransitionBasedParser.v1"
nr_feature_tokens = 6
hidden_width = 64
maxout_pieces = 2
[nlp.pipeline.ner.model.tok2vec]
@architectures = "spacy.Tok2VecTensors.v1"
width = ${nlp.pipeline.tok2vec.model.extract:width}

View File

@ -0,0 +1,48 @@
[training]
use_gpu = -1
limit = 0
dropout = 0.2
patience = 10000
eval_frequency = 200
scores = ["ents_p", "ents_r", "ents_f"]
score_weights = {"ents_f": 1}
orth_variant_level = 0.0
gold_preproc = true
max_length = 0
seed = 0
accumulate_gradient = 2
discard_oversize = false
[training.batch_size]
@schedules = "compounding.v1"
start = 3000
stop = 3000
compound = 1.001
[training.optimizer]
@optimizers = "Adam.v1"
learn_rate = 0.001
beta1 = 0.9
beta2 = 0.999
[nlp]
lang = "en"
vectors = null
[nlp.pipeline.ner]
factory = "simple_ner"
[nlp.pipeline.ner.model]
@architectures = "spacy.BiluoTagger.v1"
[nlp.pipeline.ner.model.tok2vec]
@architectures = "spacy.HashEmbedCNN.v1"
width = 128
depth = 4
embed_size = 7000
maxout_pieces = 3
window_size = 1
subword_features = true
pretrained_vectors = null
dropout = null

154
fabfile.py vendored
View File

@ -1,154 +0,0 @@
# coding: utf-8
from __future__ import unicode_literals, print_function
import contextlib
from pathlib import Path
from fabric.api import local, lcd, env, settings, prefix
from os import path, environ
import shutil
import sys
PWD = path.dirname(__file__)
ENV = environ["VENV_DIR"] if "VENV_DIR" in environ else ".env"
VENV_DIR = Path(PWD) / ENV
@contextlib.contextmanager
def virtualenv(name, create=False, python="/usr/bin/python3.6"):
python = Path(python).resolve()
env_path = VENV_DIR
if create:
if env_path.exists():
shutil.rmtree(str(env_path))
local("{python} -m venv {env_path}".format(python=python, env_path=VENV_DIR))
def wrapped_local(cmd, env_vars=[], capture=False, direct=False):
return local(
"source {}/bin/activate && {}".format(env_path, cmd),
shell="/bin/bash",
capture=False,
)
yield wrapped_local
def env(lang="python3.6"):
if VENV_DIR.exists():
local("rm -rf {env}".format(env=VENV_DIR))
if lang.startswith("python3"):
local("{lang} -m venv {env}".format(lang=lang, env=VENV_DIR))
else:
local("{lang} -m pip install virtualenv --no-cache-dir".format(lang=lang))
local(
"{lang} -m virtualenv {env} --no-cache-dir".format(lang=lang, env=VENV_DIR)
)
with virtualenv(VENV_DIR) as venv_local:
print(venv_local("python --version", capture=True))
venv_local("pip install --upgrade setuptools --no-cache-dir")
venv_local("pip install pytest --no-cache-dir")
venv_local("pip install wheel --no-cache-dir")
venv_local("pip install -r requirements.txt --no-cache-dir")
venv_local("pip install pex --no-cache-dir")
def install():
with virtualenv(VENV_DIR) as venv_local:
venv_local("pip install dist/*.tar.gz")
def make():
with lcd(path.dirname(__file__)):
local(
"export PYTHONPATH=`pwd` && source .env/bin/activate && python setup.py build_ext --inplace",
shell="/bin/bash",
)
def sdist():
with virtualenv(VENV_DIR) as venv_local:
with lcd(path.dirname(__file__)):
venv_local("python -m pip install -U setuptools srsly")
venv_local("python setup.py sdist")
def wheel():
with virtualenv(VENV_DIR) as venv_local:
with lcd(path.dirname(__file__)):
venv_local("python setup.py bdist_wheel")
def pex():
with virtualenv(VENV_DIR) as venv_local:
with lcd(path.dirname(__file__)):
sha = local("git rev-parse --short HEAD", capture=True)
venv_local(
"pex dist/*.whl -e spacy -o dist/spacy-%s.pex" % sha, direct=True
)
def clean():
with lcd(path.dirname(__file__)):
local("rm -f dist/*.whl")
local("rm -f dist/*.pex")
with virtualenv(VENV_DIR) as venv_local:
venv_local("python setup.py clean --all")
def test():
with virtualenv(VENV_DIR) as venv_local:
with lcd(path.dirname(__file__)):
venv_local("pytest -x spacy/tests")
def train():
args = environ.get("SPACY_TRAIN_ARGS", "")
with virtualenv(VENV_DIR) as venv_local:
venv_local("spacy train {args}".format(args=args))
def conll17(treebank_dir, experiment_dir, vectors_dir, config, corpus=""):
is_not_clean = local("git status --porcelain", capture=True)
if is_not_clean:
print("Repository is not clean")
print(is_not_clean)
sys.exit(1)
git_sha = local("git rev-parse --short HEAD", capture=True)
config_checksum = local("sha256sum {config}".format(config=config), capture=True)
experiment_dir = Path(experiment_dir) / "{}--{}".format(
config_checksum[:6], git_sha
)
if not experiment_dir.exists():
experiment_dir.mkdir()
test_data_dir = Path(treebank_dir) / "ud-test-v2.0-conll2017"
assert test_data_dir.exists()
assert test_data_dir.is_dir()
if corpus:
corpora = [corpus]
else:
corpora = ["UD_English", "UD_Chinese", "UD_Japanese", "UD_Vietnamese"]
local(
"cp {config} {experiment_dir}/config.json".format(
config=config, experiment_dir=experiment_dir
)
)
with virtualenv(VENV_DIR) as venv_local:
for corpus in corpora:
venv_local(
"spacy ud-train {treebank_dir} {experiment_dir} {config} {corpus} -v {vectors_dir}".format(
treebank_dir=treebank_dir,
experiment_dir=experiment_dir,
config=config,
corpus=corpus,
vectors_dir=vectors_dir,
)
)
venv_local(
"spacy ud-run-test {test_data_dir} {experiment_dir} {corpus}".format(
test_data_dir=test_data_dir,
experiment_dir=experiment_dir,
config=config,
corpus=corpus,
)
)

View File

@ -1,259 +0,0 @@
// ISO C9x compliant stdint.h for Microsoft Visual Studio
// Based on ISO/IEC 9899:TC2 Committee draft (May 6, 2005) WG14/N1124
//
// Copyright (c) 2006-2013 Alexander Chemeris
//
// Redistribution and use in source and binary forms, with or without
// modification, are permitted provided that the following conditions are met:
//
// 1. Redistributions of source code must retain the above copyright notice,
// this list of conditions and the following disclaimer.
//
// 2. Redistributions in binary form must reproduce the above copyright
// notice, this list of conditions and the following disclaimer in the
// documentation and/or other materials provided with the distribution.
//
// 3. Neither the name of the product nor the names of its contributors may
// be used to endorse or promote products derived from this software
// without specific prior written permission.
//
// THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED
// WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
// MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO
// EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
// SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
// OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
// WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
// OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
// ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
//
///////////////////////////////////////////////////////////////////////////////
#ifndef _MSC_VER // [
#error "Use this header only with Microsoft Visual C++ compilers!"
#endif // _MSC_VER ]
#ifndef _MSC_STDINT_H_ // [
#define _MSC_STDINT_H_
#if _MSC_VER > 1000
#pragma once
#endif
#if _MSC_VER >= 1600 // [
#include <stdint.h>
#else // ] _MSC_VER >= 1600 [
#include <limits.h>
// For Visual Studio 6 in C++ mode and for many Visual Studio versions when
// compiling for ARM we should wrap <wchar.h> include with 'extern "C++" {}'
// or compiler give many errors like this:
// error C2733: second C linkage of overloaded function 'wmemchr' not allowed
#ifdef __cplusplus
extern "C" {
#endif
# include <wchar.h>
#ifdef __cplusplus
}
#endif
// Define _W64 macros to mark types changing their size, like intptr_t.
#ifndef _W64
# if !defined(__midl) && (defined(_X86_) || defined(_M_IX86)) && _MSC_VER >= 1300
# define _W64 __w64
# else
# define _W64
# endif
#endif
// 7.18.1 Integer types
// 7.18.1.1 Exact-width integer types
// Visual Studio 6 and Embedded Visual C++ 4 doesn't
// realize that, e.g. char has the same size as __int8
// so we give up on __intX for them.
#if (_MSC_VER < 1300)
typedef signed char int8_t;
typedef signed short int16_t;
typedef signed int int32_t;
typedef unsigned char uint8_t;
typedef unsigned short uint16_t;
typedef unsigned int uint32_t;
#else
typedef signed __int8 int8_t;
typedef signed __int16 int16_t;
typedef signed __int32 int32_t;
typedef unsigned __int8 uint8_t;
typedef unsigned __int16 uint16_t;
typedef unsigned __int32 uint32_t;
#endif
typedef signed __int64 int64_t;
typedef unsigned __int64 uint64_t;
// 7.18.1.2 Minimum-width integer types
typedef int8_t int_least8_t;
typedef int16_t int_least16_t;
typedef int32_t int_least32_t;
typedef int64_t int_least64_t;
typedef uint8_t uint_least8_t;
typedef uint16_t uint_least16_t;
typedef uint32_t uint_least32_t;
typedef uint64_t uint_least64_t;
// 7.18.1.3 Fastest minimum-width integer types
typedef int8_t int_fast8_t;
typedef int16_t int_fast16_t;
typedef int32_t int_fast32_t;
typedef int64_t int_fast64_t;
typedef uint8_t uint_fast8_t;
typedef uint16_t uint_fast16_t;
typedef uint32_t uint_fast32_t;
typedef uint64_t uint_fast64_t;
// 7.18.1.4 Integer types capable of holding object pointers
#ifdef _WIN64 // [
typedef signed __int64 intptr_t;
typedef unsigned __int64 uintptr_t;
#else // _WIN64 ][
typedef _W64 signed int intptr_t;
typedef _W64 unsigned int uintptr_t;
#endif // _WIN64 ]
// 7.18.1.5 Greatest-width integer types
typedef int64_t intmax_t;
typedef uint64_t uintmax_t;
// 7.18.2 Limits of specified-width integer types
#if !defined(__cplusplus) || defined(__STDC_LIMIT_MACROS) // [ See footnote 220 at page 257 and footnote 221 at page 259
// 7.18.2.1 Limits of exact-width integer types
#define INT8_MIN ((int8_t)_I8_MIN)
#define INT8_MAX _I8_MAX
#define INT16_MIN ((int16_t)_I16_MIN)
#define INT16_MAX _I16_MAX
#define INT32_MIN ((int32_t)_I32_MIN)
#define INT32_MAX _I32_MAX
#define INT64_MIN ((int64_t)_I64_MIN)
#define INT64_MAX _I64_MAX
#define UINT8_MAX _UI8_MAX
#define UINT16_MAX _UI16_MAX
#define UINT32_MAX _UI32_MAX
#define UINT64_MAX _UI64_MAX
// 7.18.2.2 Limits of minimum-width integer types
#define INT_LEAST8_MIN INT8_MIN
#define INT_LEAST8_MAX INT8_MAX
#define INT_LEAST16_MIN INT16_MIN
#define INT_LEAST16_MAX INT16_MAX
#define INT_LEAST32_MIN INT32_MIN
#define INT_LEAST32_MAX INT32_MAX
#define INT_LEAST64_MIN INT64_MIN
#define INT_LEAST64_MAX INT64_MAX
#define UINT_LEAST8_MAX UINT8_MAX
#define UINT_LEAST16_MAX UINT16_MAX
#define UINT_LEAST32_MAX UINT32_MAX
#define UINT_LEAST64_MAX UINT64_MAX
// 7.18.2.3 Limits of fastest minimum-width integer types
#define INT_FAST8_MIN INT8_MIN
#define INT_FAST8_MAX INT8_MAX
#define INT_FAST16_MIN INT16_MIN
#define INT_FAST16_MAX INT16_MAX
#define INT_FAST32_MIN INT32_MIN
#define INT_FAST32_MAX INT32_MAX
#define INT_FAST64_MIN INT64_MIN
#define INT_FAST64_MAX INT64_MAX
#define UINT_FAST8_MAX UINT8_MAX
#define UINT_FAST16_MAX UINT16_MAX
#define UINT_FAST32_MAX UINT32_MAX
#define UINT_FAST64_MAX UINT64_MAX
// 7.18.2.4 Limits of integer types capable of holding object pointers
#ifdef _WIN64 // [
# define INTPTR_MIN INT64_MIN
# define INTPTR_MAX INT64_MAX
# define UINTPTR_MAX UINT64_MAX
#else // _WIN64 ][
# define INTPTR_MIN INT32_MIN
# define INTPTR_MAX INT32_MAX
# define UINTPTR_MAX UINT32_MAX
#endif // _WIN64 ]
// 7.18.2.5 Limits of greatest-width integer types
#define INTMAX_MIN INT64_MIN
#define INTMAX_MAX INT64_MAX
#define UINTMAX_MAX UINT64_MAX
// 7.18.3 Limits of other integer types
#ifdef _WIN64 // [
# define PTRDIFF_MIN _I64_MIN
# define PTRDIFF_MAX _I64_MAX
#else // _WIN64 ][
# define PTRDIFF_MIN _I32_MIN
# define PTRDIFF_MAX _I32_MAX
#endif // _WIN64 ]
#define SIG_ATOMIC_MIN INT_MIN
#define SIG_ATOMIC_MAX INT_MAX
#ifndef SIZE_MAX // [
# ifdef _WIN64 // [
# define SIZE_MAX _UI64_MAX
# else // _WIN64 ][
# define SIZE_MAX _UI32_MAX
# endif // _WIN64 ]
#endif // SIZE_MAX ]
// WCHAR_MIN and WCHAR_MAX are also defined in <wchar.h>
#ifndef WCHAR_MIN // [
# define WCHAR_MIN 0
#endif // WCHAR_MIN ]
#ifndef WCHAR_MAX // [
# define WCHAR_MAX _UI16_MAX
#endif // WCHAR_MAX ]
#define WINT_MIN 0
#define WINT_MAX _UI16_MAX
#endif // __STDC_LIMIT_MACROS ]
// 7.18.4 Limits of other integer types
#if !defined(__cplusplus) || defined(__STDC_CONSTANT_MACROS) // [ See footnote 224 at page 260
// 7.18.4.1 Macros for minimum-width integer constants
#define INT8_C(val) val##i8
#define INT16_C(val) val##i16
#define INT32_C(val) val##i32
#define INT64_C(val) val##i64
#define UINT8_C(val) val##ui8
#define UINT16_C(val) val##ui16
#define UINT32_C(val) val##ui32
#define UINT64_C(val) val##ui64
// 7.18.4.2 Macros for greatest-width integer constants
// These #ifndef's are needed to prevent collisions with <boost/cstdint.hpp>.
// Check out Issue 9 for the details.
#ifndef INTMAX_C // [
# define INTMAX_C INT64_C
#endif // INTMAX_C ]
#ifndef UINTMAX_C // [
# define UINTMAX_C UINT64_C
#endif // UINTMAX_C ]
#endif // __STDC_CONSTANT_MACROS ]
#endif // _MSC_VER >= 1600 ]
#endif // _MSC_STDINT_H_ ]

View File

@ -1,22 +0,0 @@
//-----------------------------------------------------------------------------
// MurmurHash2 was written by Austin Appleby, and is placed in the public
// domain. The author hereby disclaims copyright to this source code.
#ifndef _MURMURHASH2_H_
#define _MURMURHASH2_H_
#include <stdint.h>
//-----------------------------------------------------------------------------
uint32_t MurmurHash2 ( const void * key, int len, uint32_t seed );
uint64_t MurmurHash64A ( const void * key, int len, uint64_t seed );
uint64_t MurmurHash64B ( const void * key, int len, uint64_t seed );
uint32_t MurmurHash2A ( const void * key, int len, uint32_t seed );
uint32_t MurmurHashNeutral2 ( const void * key, int len, uint32_t seed );
uint32_t MurmurHashAligned2 ( const void * key, int len, uint32_t seed );
//-----------------------------------------------------------------------------
#endif // _MURMURHASH2_H_

View File

@ -1,28 +0,0 @@
//-----------------------------------------------------------------------------
// MurmurHash3 was written by Austin Appleby, and is placed in the public
// domain. The author hereby disclaims copyright to this source code.
#ifndef _MURMURHASH3_H_
#define _MURMURHASH3_H_
#include <stdint.h>
//-----------------------------------------------------------------------------
#ifdef __cplusplus
extern "C" {
#endif
void MurmurHash3_x86_32 ( const void * key, int len, uint32_t seed, void * out );
void MurmurHash3_x86_128 ( const void * key, int len, uint32_t seed, void * out );
void MurmurHash3_x64_128 ( const void * key, int len, uint32_t seed, void * out );
#ifdef __cplusplus
}
#endif
//-----------------------------------------------------------------------------
#endif // _MURMURHASH3_H_

File diff suppressed because it is too large Load Diff

View File

@ -1,323 +0,0 @@
#ifdef _UMATHMODULE
#ifdef NPY_ENABLE_SEPARATE_COMPILATION
extern NPY_NO_EXPORT PyTypeObject PyUFunc_Type;
#else
NPY_NO_EXPORT PyTypeObject PyUFunc_Type;
#endif
#ifdef NPY_ENABLE_SEPARATE_COMPILATION
extern NPY_NO_EXPORT PyTypeObject PyUFunc_Type;
#else
NPY_NO_EXPORT PyTypeObject PyUFunc_Type;
#endif
NPY_NO_EXPORT PyObject * PyUFunc_FromFuncAndData \
(PyUFuncGenericFunction *, void **, char *, int, int, int, int, char *, char *, int);
NPY_NO_EXPORT int PyUFunc_RegisterLoopForType \
(PyUFuncObject *, int, PyUFuncGenericFunction, int *, void *);
NPY_NO_EXPORT int PyUFunc_GenericFunction \
(PyUFuncObject *, PyObject *, PyObject *, PyArrayObject **);
NPY_NO_EXPORT void PyUFunc_f_f_As_d_d \
(char **, npy_intp *, npy_intp *, void *);
NPY_NO_EXPORT void PyUFunc_d_d \
(char **, npy_intp *, npy_intp *, void *);
NPY_NO_EXPORT void PyUFunc_f_f \
(char **, npy_intp *, npy_intp *, void *);
NPY_NO_EXPORT void PyUFunc_g_g \
(char **, npy_intp *, npy_intp *, void *);
NPY_NO_EXPORT void PyUFunc_F_F_As_D_D \
(char **, npy_intp *, npy_intp *, void *);
NPY_NO_EXPORT void PyUFunc_F_F \
(char **, npy_intp *, npy_intp *, void *);
NPY_NO_EXPORT void PyUFunc_D_D \
(char **, npy_intp *, npy_intp *, void *);
NPY_NO_EXPORT void PyUFunc_G_G \
(char **, npy_intp *, npy_intp *, void *);
NPY_NO_EXPORT void PyUFunc_O_O \
(char **, npy_intp *, npy_intp *, void *);
NPY_NO_EXPORT void PyUFunc_ff_f_As_dd_d \
(char **, npy_intp *, npy_intp *, void *);
NPY_NO_EXPORT void PyUFunc_ff_f \
(char **, npy_intp *, npy_intp *, void *);
NPY_NO_EXPORT void PyUFunc_dd_d \
(char **, npy_intp *, npy_intp *, void *);
NPY_NO_EXPORT void PyUFunc_gg_g \
(char **, npy_intp *, npy_intp *, void *);
NPY_NO_EXPORT void PyUFunc_FF_F_As_DD_D \
(char **, npy_intp *, npy_intp *, void *);
NPY_NO_EXPORT void PyUFunc_DD_D \
(char **, npy_intp *, npy_intp *, void *);
NPY_NO_EXPORT void PyUFunc_FF_F \
(char **, npy_intp *, npy_intp *, void *);
NPY_NO_EXPORT void PyUFunc_GG_G \
(char **, npy_intp *, npy_intp *, void *);
NPY_NO_EXPORT void PyUFunc_OO_O \
(char **, npy_intp *, npy_intp *, void *);
NPY_NO_EXPORT void PyUFunc_O_O_method \
(char **, npy_intp *, npy_intp *, void *);
NPY_NO_EXPORT void PyUFunc_OO_O_method \
(char **, npy_intp *, npy_intp *, void *);
NPY_NO_EXPORT void PyUFunc_On_Om \
(char **, npy_intp *, npy_intp *, void *);
NPY_NO_EXPORT int PyUFunc_GetPyValues \
(char *, int *, int *, PyObject **);
NPY_NO_EXPORT int PyUFunc_checkfperr \
(int, PyObject *, int *);
NPY_NO_EXPORT void PyUFunc_clearfperr \
(void);
NPY_NO_EXPORT int PyUFunc_getfperr \
(void);
NPY_NO_EXPORT int PyUFunc_handlefperr \
(int, PyObject *, int, int *);
NPY_NO_EXPORT int PyUFunc_ReplaceLoopBySignature \
(PyUFuncObject *, PyUFuncGenericFunction, int *, PyUFuncGenericFunction *);
NPY_NO_EXPORT PyObject * PyUFunc_FromFuncAndDataAndSignature \
(PyUFuncGenericFunction *, void **, char *, int, int, int, int, char *, char *, int, const char *);
NPY_NO_EXPORT int PyUFunc_SetUsesArraysAsData \
(void **, size_t);
NPY_NO_EXPORT void PyUFunc_e_e \
(char **, npy_intp *, npy_intp *, void *);
NPY_NO_EXPORT void PyUFunc_e_e_As_f_f \
(char **, npy_intp *, npy_intp *, void *);
NPY_NO_EXPORT void PyUFunc_e_e_As_d_d \
(char **, npy_intp *, npy_intp *, void *);
NPY_NO_EXPORT void PyUFunc_ee_e \
(char **, npy_intp *, npy_intp *, void *);
NPY_NO_EXPORT void PyUFunc_ee_e_As_ff_f \
(char **, npy_intp *, npy_intp *, void *);
NPY_NO_EXPORT void PyUFunc_ee_e_As_dd_d \
(char **, npy_intp *, npy_intp *, void *);
NPY_NO_EXPORT int PyUFunc_DefaultTypeResolver \
(PyUFuncObject *, NPY_CASTING, PyArrayObject **, PyObject *, PyArray_Descr **);
NPY_NO_EXPORT int PyUFunc_ValidateCasting \
(PyUFuncObject *, NPY_CASTING, PyArrayObject **, PyArray_Descr **);
#else
#if defined(PY_UFUNC_UNIQUE_SYMBOL)
#define PyUFunc_API PY_UFUNC_UNIQUE_SYMBOL
#endif
#if defined(NO_IMPORT) || defined(NO_IMPORT_UFUNC)
extern void **PyUFunc_API;
#else
#if defined(PY_UFUNC_UNIQUE_SYMBOL)
void **PyUFunc_API;
#else
static void **PyUFunc_API=NULL;
#endif
#endif
#define PyUFunc_Type (*(PyTypeObject *)PyUFunc_API[0])
#define PyUFunc_FromFuncAndData \
(*(PyObject * (*)(PyUFuncGenericFunction *, void **, char *, int, int, int, int, char *, char *, int)) \
PyUFunc_API[1])
#define PyUFunc_RegisterLoopForType \
(*(int (*)(PyUFuncObject *, int, PyUFuncGenericFunction, int *, void *)) \
PyUFunc_API[2])
#define PyUFunc_GenericFunction \
(*(int (*)(PyUFuncObject *, PyObject *, PyObject *, PyArrayObject **)) \
PyUFunc_API[3])
#define PyUFunc_f_f_As_d_d \
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
PyUFunc_API[4])
#define PyUFunc_d_d \
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
PyUFunc_API[5])
#define PyUFunc_f_f \
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
PyUFunc_API[6])
#define PyUFunc_g_g \
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
PyUFunc_API[7])
#define PyUFunc_F_F_As_D_D \
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
PyUFunc_API[8])
#define PyUFunc_F_F \
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
PyUFunc_API[9])
#define PyUFunc_D_D \
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
PyUFunc_API[10])
#define PyUFunc_G_G \
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
PyUFunc_API[11])
#define PyUFunc_O_O \
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
PyUFunc_API[12])
#define PyUFunc_ff_f_As_dd_d \
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
PyUFunc_API[13])
#define PyUFunc_ff_f \
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
PyUFunc_API[14])
#define PyUFunc_dd_d \
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
PyUFunc_API[15])
#define PyUFunc_gg_g \
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
PyUFunc_API[16])
#define PyUFunc_FF_F_As_DD_D \
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
PyUFunc_API[17])
#define PyUFunc_DD_D \
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
PyUFunc_API[18])
#define PyUFunc_FF_F \
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
PyUFunc_API[19])
#define PyUFunc_GG_G \
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
PyUFunc_API[20])
#define PyUFunc_OO_O \
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
PyUFunc_API[21])
#define PyUFunc_O_O_method \
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
PyUFunc_API[22])
#define PyUFunc_OO_O_method \
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
PyUFunc_API[23])
#define PyUFunc_On_Om \
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
PyUFunc_API[24])
#define PyUFunc_GetPyValues \
(*(int (*)(char *, int *, int *, PyObject **)) \
PyUFunc_API[25])
#define PyUFunc_checkfperr \
(*(int (*)(int, PyObject *, int *)) \
PyUFunc_API[26])
#define PyUFunc_clearfperr \
(*(void (*)(void)) \
PyUFunc_API[27])
#define PyUFunc_getfperr \
(*(int (*)(void)) \
PyUFunc_API[28])
#define PyUFunc_handlefperr \
(*(int (*)(int, PyObject *, int, int *)) \
PyUFunc_API[29])
#define PyUFunc_ReplaceLoopBySignature \
(*(int (*)(PyUFuncObject *, PyUFuncGenericFunction, int *, PyUFuncGenericFunction *)) \
PyUFunc_API[30])
#define PyUFunc_FromFuncAndDataAndSignature \
(*(PyObject * (*)(PyUFuncGenericFunction *, void **, char *, int, int, int, int, char *, char *, int, const char *)) \
PyUFunc_API[31])
#define PyUFunc_SetUsesArraysAsData \
(*(int (*)(void **, size_t)) \
PyUFunc_API[32])
#define PyUFunc_e_e \
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
PyUFunc_API[33])
#define PyUFunc_e_e_As_f_f \
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
PyUFunc_API[34])
#define PyUFunc_e_e_As_d_d \
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
PyUFunc_API[35])
#define PyUFunc_ee_e \
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
PyUFunc_API[36])
#define PyUFunc_ee_e_As_ff_f \
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
PyUFunc_API[37])
#define PyUFunc_ee_e_As_dd_d \
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
PyUFunc_API[38])
#define PyUFunc_DefaultTypeResolver \
(*(int (*)(PyUFuncObject *, NPY_CASTING, PyArrayObject **, PyObject *, PyArray_Descr **)) \
PyUFunc_API[39])
#define PyUFunc_ValidateCasting \
(*(int (*)(PyUFuncObject *, NPY_CASTING, PyArrayObject **, PyArray_Descr **)) \
PyUFunc_API[40])
static int
_import_umath(void)
{
PyObject *numpy = PyImport_ImportModule("numpy.core.umath");
PyObject *c_api = NULL;
if (numpy == NULL) {
PyErr_SetString(PyExc_ImportError, "numpy.core.umath failed to import");
return -1;
}
c_api = PyObject_GetAttrString(numpy, "_UFUNC_API");
Py_DECREF(numpy);
if (c_api == NULL) {
PyErr_SetString(PyExc_AttributeError, "_UFUNC_API not found");
return -1;
}
#if PY_VERSION_HEX >= 0x03000000
if (!PyCapsule_CheckExact(c_api)) {
PyErr_SetString(PyExc_RuntimeError, "_UFUNC_API is not PyCapsule object");
Py_DECREF(c_api);
return -1;
}
PyUFunc_API = (void **)PyCapsule_GetPointer(c_api, NULL);
#else
if (!PyCObject_Check(c_api)) {
PyErr_SetString(PyExc_RuntimeError, "_UFUNC_API is not PyCObject object");
Py_DECREF(c_api);
return -1;
}
PyUFunc_API = (void **)PyCObject_AsVoidPtr(c_api);
#endif
Py_DECREF(c_api);
if (PyUFunc_API == NULL) {
PyErr_SetString(PyExc_RuntimeError, "_UFUNC_API is NULL pointer");
return -1;
}
return 0;
}
#if PY_VERSION_HEX >= 0x03000000
#define NUMPY_IMPORT_UMATH_RETVAL NULL
#else
#define NUMPY_IMPORT_UMATH_RETVAL
#endif
#define import_umath() \
do {\
UFUNC_NOFPE\
if (_import_umath() < 0) {\
PyErr_Print();\
PyErr_SetString(PyExc_ImportError,\
"numpy.core.umath failed to import");\
return NUMPY_IMPORT_UMATH_RETVAL;\
}\
} while(0)
#define import_umath1(ret) \
do {\
UFUNC_NOFPE\
if (_import_umath() < 0) {\
PyErr_Print();\
PyErr_SetString(PyExc_ImportError,\
"numpy.core.umath failed to import");\
return ret;\
}\
} while(0)
#define import_umath2(ret, msg) \
do {\
UFUNC_NOFPE\
if (_import_umath() < 0) {\
PyErr_Print();\
PyErr_SetString(PyExc_ImportError, msg);\
return ret;\
}\
} while(0)
#define import_ufunc() \
do {\
UFUNC_NOFPE\
if (_import_umath() < 0) {\
PyErr_Print();\
PyErr_SetString(PyExc_ImportError,\
"numpy.core.umath failed to import");\
}\
} while(0)
#endif

View File

@ -1,90 +0,0 @@
#ifndef _NPY_INCLUDE_NEIGHBORHOOD_IMP
#error You should not include this header directly
#endif
/*
* Private API (here for inline)
*/
static NPY_INLINE int
_PyArrayNeighborhoodIter_IncrCoord(PyArrayNeighborhoodIterObject* iter);
/*
* Update to next item of the iterator
*
* Note: this simply increment the coordinates vector, last dimension
* incremented first , i.e, for dimension 3
* ...
* -1, -1, -1
* -1, -1, 0
* -1, -1, 1
* ....
* -1, 0, -1
* -1, 0, 0
* ....
* 0, -1, -1
* 0, -1, 0
* ....
*/
#define _UPDATE_COORD_ITER(c) \
wb = iter->coordinates[c] < iter->bounds[c][1]; \
if (wb) { \
iter->coordinates[c] += 1; \
return 0; \
} \
else { \
iter->coordinates[c] = iter->bounds[c][0]; \
}
static NPY_INLINE int
_PyArrayNeighborhoodIter_IncrCoord(PyArrayNeighborhoodIterObject* iter)
{
npy_intp i, wb;
for (i = iter->nd - 1; i >= 0; --i) {
_UPDATE_COORD_ITER(i)
}
return 0;
}
/*
* Version optimized for 2d arrays, manual loop unrolling
*/
static NPY_INLINE int
_PyArrayNeighborhoodIter_IncrCoord2D(PyArrayNeighborhoodIterObject* iter)
{
npy_intp wb;
_UPDATE_COORD_ITER(1)
_UPDATE_COORD_ITER(0)
return 0;
}
#undef _UPDATE_COORD_ITER
/*
* Advance to the next neighbour
*/
static NPY_INLINE int
PyArrayNeighborhoodIter_Next(PyArrayNeighborhoodIterObject* iter)
{
_PyArrayNeighborhoodIter_IncrCoord (iter);
iter->dataptr = iter->translate((PyArrayIterObject*)iter, iter->coordinates);
return 0;
}
/*
* Reset functions
*/
static NPY_INLINE int
PyArrayNeighborhoodIter_Reset(PyArrayNeighborhoodIterObject* iter)
{
npy_intp i;
for (i = 0; i < iter->nd; ++i) {
iter->coordinates[i] = iter->bounds[i][0];
}
iter->dataptr = iter->translate((PyArrayIterObject*)iter, iter->coordinates);
return 0;
}

View File

@ -1,29 +0,0 @@
#define NPY_SIZEOF_SHORT SIZEOF_SHORT
#define NPY_SIZEOF_INT SIZEOF_INT
#define NPY_SIZEOF_LONG SIZEOF_LONG
#define NPY_SIZEOF_FLOAT 4
#define NPY_SIZEOF_COMPLEX_FLOAT 8
#define NPY_SIZEOF_DOUBLE 8
#define NPY_SIZEOF_COMPLEX_DOUBLE 16
#define NPY_SIZEOF_LONGDOUBLE 16
#define NPY_SIZEOF_COMPLEX_LONGDOUBLE 32
#define NPY_SIZEOF_PY_INTPTR_T 8
#define NPY_SIZEOF_PY_LONG_LONG 8
#define NPY_SIZEOF_LONGLONG 8
#define NPY_NO_SMP 0
#define NPY_HAVE_DECL_ISNAN
#define NPY_HAVE_DECL_ISINF
#define NPY_HAVE_DECL_ISFINITE
#define NPY_HAVE_DECL_SIGNBIT
#define NPY_USE_C99_COMPLEX 1
#define NPY_HAVE_COMPLEX_DOUBLE 1
#define NPY_HAVE_COMPLEX_FLOAT 1
#define NPY_HAVE_COMPLEX_LONG_DOUBLE 1
#define NPY_USE_C99_FORMATS 1
#define NPY_VISIBILITY_HIDDEN __attribute__((visibility("hidden")))
#define NPY_ABI_VERSION 0x01000009
#define NPY_API_VERSION 0x00000007
#ifndef __STDC_FORMAT_MACROS
#define __STDC_FORMAT_MACROS 1
#endif

View File

@ -1,22 +0,0 @@
/* This expects the following variables to be defined (besides
the usual ones from pyconfig.h
SIZEOF_LONG_DOUBLE -- sizeof(long double) or sizeof(double) if no
long double is present on platform.
CHAR_BIT -- number of bits in a char (usually 8)
(should be in limits.h)
*/
#ifndef Py_ARRAYOBJECT_H
#define Py_ARRAYOBJECT_H
#include "ndarrayobject.h"
#include "npy_interrupt.h"
#ifdef NPY_NO_PREFIX
#include "noprefix.h"
#endif
#endif

View File

@ -1,175 +0,0 @@
#ifndef _NPY_ARRAYSCALARS_H_
#define _NPY_ARRAYSCALARS_H_
#ifndef _MULTIARRAYMODULE
typedef struct {
PyObject_HEAD
npy_bool obval;
} PyBoolScalarObject;
#endif
typedef struct {
PyObject_HEAD
signed char obval;
} PyByteScalarObject;
typedef struct {
PyObject_HEAD
short obval;
} PyShortScalarObject;
typedef struct {
PyObject_HEAD
int obval;
} PyIntScalarObject;
typedef struct {
PyObject_HEAD
long obval;
} PyLongScalarObject;
typedef struct {
PyObject_HEAD
npy_longlong obval;
} PyLongLongScalarObject;
typedef struct {
PyObject_HEAD
unsigned char obval;
} PyUByteScalarObject;
typedef struct {
PyObject_HEAD
unsigned short obval;
} PyUShortScalarObject;
typedef struct {
PyObject_HEAD
unsigned int obval;
} PyUIntScalarObject;
typedef struct {
PyObject_HEAD
unsigned long obval;
} PyULongScalarObject;
typedef struct {
PyObject_HEAD
npy_ulonglong obval;
} PyULongLongScalarObject;
typedef struct {
PyObject_HEAD
npy_half obval;
} PyHalfScalarObject;
typedef struct {
PyObject_HEAD
float obval;
} PyFloatScalarObject;
typedef struct {
PyObject_HEAD
double obval;
} PyDoubleScalarObject;
typedef struct {
PyObject_HEAD
npy_longdouble obval;
} PyLongDoubleScalarObject;
typedef struct {
PyObject_HEAD
npy_cfloat obval;
} PyCFloatScalarObject;
typedef struct {
PyObject_HEAD
npy_cdouble obval;
} PyCDoubleScalarObject;
typedef struct {
PyObject_HEAD
npy_clongdouble obval;
} PyCLongDoubleScalarObject;
typedef struct {
PyObject_HEAD
PyObject * obval;
} PyObjectScalarObject;
typedef struct {
PyObject_HEAD
npy_datetime obval;
PyArray_DatetimeMetaData obmeta;
} PyDatetimeScalarObject;
typedef struct {
PyObject_HEAD
npy_timedelta obval;
PyArray_DatetimeMetaData obmeta;
} PyTimedeltaScalarObject;
typedef struct {
PyObject_HEAD
char obval;
} PyScalarObject;
#define PyStringScalarObject PyStringObject
#define PyUnicodeScalarObject PyUnicodeObject
typedef struct {
PyObject_VAR_HEAD
char *obval;
PyArray_Descr *descr;
int flags;
PyObject *base;
} PyVoidScalarObject;
/* Macros
Py<Cls><bitsize>ScalarObject
Py<Cls><bitsize>ArrType_Type
are defined in ndarrayobject.h
*/
#define PyArrayScalar_False ((PyObject *)(&(_PyArrayScalar_BoolValues[0])))
#define PyArrayScalar_True ((PyObject *)(&(_PyArrayScalar_BoolValues[1])))
#define PyArrayScalar_FromLong(i) \
((PyObject *)(&(_PyArrayScalar_BoolValues[((i)!=0)])))
#define PyArrayScalar_RETURN_BOOL_FROM_LONG(i) \
return Py_INCREF(PyArrayScalar_FromLong(i)), \
PyArrayScalar_FromLong(i)
#define PyArrayScalar_RETURN_FALSE \
return Py_INCREF(PyArrayScalar_False), \
PyArrayScalar_False
#define PyArrayScalar_RETURN_TRUE \
return Py_INCREF(PyArrayScalar_True), \
PyArrayScalar_True
#define PyArrayScalar_New(cls) \
Py##cls##ArrType_Type.tp_alloc(&Py##cls##ArrType_Type, 0)
#define PyArrayScalar_VAL(obj, cls) \
((Py##cls##ScalarObject *)obj)->obval
#define PyArrayScalar_ASSIGN(obj, cls, val) \
PyArrayScalar_VAL(obj, cls) = val
#endif

View File

@ -1,69 +0,0 @@
#ifndef __NPY_HALFFLOAT_H__
#define __NPY_HALFFLOAT_H__
#include <Python.h>
#include <numpy/npy_math.h>
#ifdef __cplusplus
extern "C" {
#endif
/*
* Half-precision routines
*/
/* Conversions */
float npy_half_to_float(npy_half h);
double npy_half_to_double(npy_half h);
npy_half npy_float_to_half(float f);
npy_half npy_double_to_half(double d);
/* Comparisons */
int npy_half_eq(npy_half h1, npy_half h2);
int npy_half_ne(npy_half h1, npy_half h2);
int npy_half_le(npy_half h1, npy_half h2);
int npy_half_lt(npy_half h1, npy_half h2);
int npy_half_ge(npy_half h1, npy_half h2);
int npy_half_gt(npy_half h1, npy_half h2);
/* faster *_nonan variants for when you know h1 and h2 are not NaN */
int npy_half_eq_nonan(npy_half h1, npy_half h2);
int npy_half_lt_nonan(npy_half h1, npy_half h2);
int npy_half_le_nonan(npy_half h1, npy_half h2);
/* Miscellaneous functions */
int npy_half_iszero(npy_half h);
int npy_half_isnan(npy_half h);
int npy_half_isinf(npy_half h);
int npy_half_isfinite(npy_half h);
int npy_half_signbit(npy_half h);
npy_half npy_half_copysign(npy_half x, npy_half y);
npy_half npy_half_spacing(npy_half h);
npy_half npy_half_nextafter(npy_half x, npy_half y);
/*
* Half-precision constants
*/
#define NPY_HALF_ZERO (0x0000u)
#define NPY_HALF_PZERO (0x0000u)
#define NPY_HALF_NZERO (0x8000u)
#define NPY_HALF_ONE (0x3c00u)
#define NPY_HALF_NEGONE (0xbc00u)
#define NPY_HALF_PINF (0x7c00u)
#define NPY_HALF_NINF (0xfc00u)
#define NPY_HALF_NAN (0x7e00u)
#define NPY_MAX_HALF (0x7bffu)
/*
* Bit-level conversions
*/
npy_uint16 npy_floatbits_to_halfbits(npy_uint32 f);
npy_uint16 npy_doublebits_to_halfbits(npy_uint64 d);
npy_uint32 npy_halfbits_to_floatbits(npy_uint16 h);
npy_uint64 npy_halfbits_to_doublebits(npy_uint16 h);
#ifdef __cplusplus
}
#endif
#endif

File diff suppressed because it is too large Load Diff

View File

@ -1,244 +0,0 @@
/*
* DON'T INCLUDE THIS DIRECTLY.
*/
#ifndef NPY_NDARRAYOBJECT_H
#define NPY_NDARRAYOBJECT_H
#ifdef __cplusplus
#define CONFUSE_EMACS {
#define CONFUSE_EMACS2 }
extern "C" CONFUSE_EMACS
#undef CONFUSE_EMACS
#undef CONFUSE_EMACS2
/* ... otherwise a semi-smart identer (like emacs) tries to indent
everything when you're typing */
#endif
#include "ndarraytypes.h"
/* Includes the "function" C-API -- these are all stored in a
list of pointers --- one for each file
The two lists are concatenated into one in multiarray.
They are available as import_array()
*/
#include "__multiarray_api.h"
/* C-API that requries previous API to be defined */
#define PyArray_DescrCheck(op) (((PyObject*)(op))->ob_type==&PyArrayDescr_Type)
#define PyArray_Check(op) PyObject_TypeCheck(op, &PyArray_Type)
#define PyArray_CheckExact(op) (((PyObject*)(op))->ob_type == &PyArray_Type)
#define PyArray_HasArrayInterfaceType(op, type, context, out) \
((((out)=PyArray_FromStructInterface(op)) != Py_NotImplemented) || \
(((out)=PyArray_FromInterface(op)) != Py_NotImplemented) || \
(((out)=PyArray_FromArrayAttr(op, type, context)) != \
Py_NotImplemented))
#define PyArray_HasArrayInterface(op, out) \
PyArray_HasArrayInterfaceType(op, NULL, NULL, out)
#define PyArray_IsZeroDim(op) (PyArray_Check(op) && \
(PyArray_NDIM((PyArrayObject *)op) == 0))
#define PyArray_IsScalar(obj, cls) \
(PyObject_TypeCheck(obj, &Py##cls##ArrType_Type))
#define PyArray_CheckScalar(m) (PyArray_IsScalar(m, Generic) || \
PyArray_IsZeroDim(m))
#define PyArray_IsPythonNumber(obj) \
(PyInt_Check(obj) || PyFloat_Check(obj) || PyComplex_Check(obj) || \
PyLong_Check(obj) || PyBool_Check(obj))
#define PyArray_IsPythonScalar(obj) \
(PyArray_IsPythonNumber(obj) || PyString_Check(obj) || \
PyUnicode_Check(obj))
#define PyArray_IsAnyScalar(obj) \
(PyArray_IsScalar(obj, Generic) || PyArray_IsPythonScalar(obj))
#define PyArray_CheckAnyScalar(obj) (PyArray_IsPythonScalar(obj) || \
PyArray_CheckScalar(obj))
#define PyArray_IsIntegerScalar(obj) (PyInt_Check(obj) \
|| PyLong_Check(obj) \
|| PyArray_IsScalar((obj), Integer))
#define PyArray_GETCONTIGUOUS(m) (PyArray_ISCONTIGUOUS(m) ? \
Py_INCREF(m), (m) : \
(PyArrayObject *)(PyArray_Copy(m)))
#define PyArray_SAMESHAPE(a1,a2) ((PyArray_NDIM(a1) == PyArray_NDIM(a2)) && \
PyArray_CompareLists(PyArray_DIMS(a1), \
PyArray_DIMS(a2), \
PyArray_NDIM(a1)))
#define PyArray_SIZE(m) PyArray_MultiplyList(PyArray_DIMS(m), PyArray_NDIM(m))
#define PyArray_NBYTES(m) (PyArray_ITEMSIZE(m) * PyArray_SIZE(m))
#define PyArray_FROM_O(m) PyArray_FromAny(m, NULL, 0, 0, 0, NULL)
#define PyArray_FROM_OF(m,flags) PyArray_CheckFromAny(m, NULL, 0, 0, flags, \
NULL)
#define PyArray_FROM_OT(m,type) PyArray_FromAny(m, \
PyArray_DescrFromType(type), 0, 0, 0, NULL);
#define PyArray_FROM_OTF(m, type, flags) \
PyArray_FromAny(m, PyArray_DescrFromType(type), 0, 0, \
(((flags) & NPY_ARRAY_ENSURECOPY) ? \
((flags) | NPY_ARRAY_DEFAULT) : (flags)), NULL)
#define PyArray_FROMANY(m, type, min, max, flags) \
PyArray_FromAny(m, PyArray_DescrFromType(type), min, max, \
(((flags) & NPY_ARRAY_ENSURECOPY) ? \
(flags) | NPY_ARRAY_DEFAULT : (flags)), NULL)
#define PyArray_ZEROS(m, dims, type, is_f_order) \
PyArray_Zeros(m, dims, PyArray_DescrFromType(type), is_f_order)
#define PyArray_EMPTY(m, dims, type, is_f_order) \
PyArray_Empty(m, dims, PyArray_DescrFromType(type), is_f_order)
#define PyArray_FILLWBYTE(obj, val) memset(PyArray_DATA(obj), val, \
PyArray_NBYTES(obj))
#define PyArray_REFCOUNT(obj) (((PyObject *)(obj))->ob_refcnt)
#define NPY_REFCOUNT PyArray_REFCOUNT
#define NPY_MAX_ELSIZE (2 * NPY_SIZEOF_LONGDOUBLE)
#define PyArray_ContiguousFromAny(op, type, min_depth, max_depth) \
PyArray_FromAny(op, PyArray_DescrFromType(type), min_depth, \
max_depth, NPY_ARRAY_DEFAULT, NULL)
#define PyArray_EquivArrTypes(a1, a2) \
PyArray_EquivTypes(PyArray_DESCR(a1), PyArray_DESCR(a2))
#define PyArray_EquivByteorders(b1, b2) \
(((b1) == (b2)) || (PyArray_ISNBO(b1) == PyArray_ISNBO(b2)))
#define PyArray_SimpleNew(nd, dims, typenum) \
PyArray_New(&PyArray_Type, nd, dims, typenum, NULL, NULL, 0, 0, NULL)
#define PyArray_SimpleNewFromData(nd, dims, typenum, data) \
PyArray_New(&PyArray_Type, nd, dims, typenum, NULL, \
data, 0, NPY_ARRAY_CARRAY, NULL)
#define PyArray_SimpleNewFromDescr(nd, dims, descr) \
PyArray_NewFromDescr(&PyArray_Type, descr, nd, dims, \
NULL, NULL, 0, NULL)
#define PyArray_ToScalar(data, arr) \
PyArray_Scalar(data, PyArray_DESCR(arr), (PyObject *)arr)
/* These might be faster without the dereferencing of obj
going on inside -- of course an optimizing compiler should
inline the constants inside a for loop making it a moot point
*/
#define PyArray_GETPTR1(obj, i) ((void *)(PyArray_BYTES(obj) + \
(i)*PyArray_STRIDES(obj)[0]))
#define PyArray_GETPTR2(obj, i, j) ((void *)(PyArray_BYTES(obj) + \
(i)*PyArray_STRIDES(obj)[0] + \
(j)*PyArray_STRIDES(obj)[1]))
#define PyArray_GETPTR3(obj, i, j, k) ((void *)(PyArray_BYTES(obj) + \
(i)*PyArray_STRIDES(obj)[0] + \
(j)*PyArray_STRIDES(obj)[1] + \
(k)*PyArray_STRIDES(obj)[2]))
#define PyArray_GETPTR4(obj, i, j, k, l) ((void *)(PyArray_BYTES(obj) + \
(i)*PyArray_STRIDES(obj)[0] + \
(j)*PyArray_STRIDES(obj)[1] + \
(k)*PyArray_STRIDES(obj)[2] + \
(l)*PyArray_STRIDES(obj)[3]))
static NPY_INLINE void
PyArray_XDECREF_ERR(PyArrayObject *arr)
{
if (arr != NULL) {
if (PyArray_FLAGS(arr) & NPY_ARRAY_UPDATEIFCOPY) {
PyArrayObject *base = (PyArrayObject *)PyArray_BASE(arr);
PyArray_ENABLEFLAGS(base, NPY_ARRAY_WRITEABLE);
PyArray_CLEARFLAGS(arr, NPY_ARRAY_UPDATEIFCOPY);
}
Py_DECREF(arr);
}
}
#define PyArray_DESCR_REPLACE(descr) do { \
PyArray_Descr *_new_; \
_new_ = PyArray_DescrNew(descr); \
Py_XDECREF(descr); \
descr = _new_; \
} while(0)
/* Copy should always return contiguous array */
#define PyArray_Copy(obj) PyArray_NewCopy(obj, NPY_CORDER)
#define PyArray_FromObject(op, type, min_depth, max_depth) \
PyArray_FromAny(op, PyArray_DescrFromType(type), min_depth, \
max_depth, NPY_ARRAY_BEHAVED | \
NPY_ARRAY_ENSUREARRAY, NULL)
#define PyArray_ContiguousFromObject(op, type, min_depth, max_depth) \
PyArray_FromAny(op, PyArray_DescrFromType(type), min_depth, \
max_depth, NPY_ARRAY_DEFAULT | \
NPY_ARRAY_ENSUREARRAY, NULL)
#define PyArray_CopyFromObject(op, type, min_depth, max_depth) \
PyArray_FromAny(op, PyArray_DescrFromType(type), min_depth, \
max_depth, NPY_ARRAY_ENSURECOPY | \
NPY_ARRAY_DEFAULT | \
NPY_ARRAY_ENSUREARRAY, NULL)
#define PyArray_Cast(mp, type_num) \
PyArray_CastToType(mp, PyArray_DescrFromType(type_num), 0)
#define PyArray_Take(ap, items, axis) \
PyArray_TakeFrom(ap, items, axis, NULL, NPY_RAISE)
#define PyArray_Put(ap, items, values) \
PyArray_PutTo(ap, items, values, NPY_RAISE)
/* Compatibility with old Numeric stuff -- don't use in new code */
#define PyArray_FromDimsAndData(nd, d, type, data) \
PyArray_FromDimsAndDataAndDescr(nd, d, PyArray_DescrFromType(type), \
data)
/*
Check to see if this key in the dictionary is the "title"
entry of the tuple (i.e. a duplicate dictionary entry in the fields
dict.
*/
#define NPY_TITLE_KEY(key, value) ((PyTuple_GET_SIZE((value))==3) && \
(PyTuple_GET_ITEM((value), 2) == (key)))
/* Define python version independent deprecation macro */
#if PY_VERSION_HEX >= 0x02050000
#define DEPRECATE(msg) PyErr_WarnEx(PyExc_DeprecationWarning,msg,1)
#define DEPRECATE_FUTUREWARNING(msg) PyErr_WarnEx(PyExc_FutureWarning,msg,1)
#else
#define DEPRECATE(msg) PyErr_Warn(PyExc_DeprecationWarning,msg)
#define DEPRECATE_FUTUREWARNING(msg) PyErr_Warn(PyExc_FutureWarning,msg)
#endif
#ifdef __cplusplus
}
#endif
#endif /* NPY_NDARRAYOBJECT_H */

File diff suppressed because it is too large Load Diff

View File

@ -1,209 +0,0 @@
#ifndef NPY_NOPREFIX_H
#define NPY_NOPREFIX_H
/*
* You can directly include noprefix.h as a backward
* compatibility measure
*/
#ifndef NPY_NO_PREFIX
#include "ndarrayobject.h"
#include "npy_interrupt.h"
#endif
#define SIGSETJMP NPY_SIGSETJMP
#define SIGLONGJMP NPY_SIGLONGJMP
#define SIGJMP_BUF NPY_SIGJMP_BUF
#define MAX_DIMS NPY_MAXDIMS
#define longlong npy_longlong
#define ulonglong npy_ulonglong
#define Bool npy_bool
#define longdouble npy_longdouble
#define byte npy_byte
#ifndef _BSD_SOURCE
#define ushort npy_ushort
#define uint npy_uint
#define ulong npy_ulong
#endif
#define ubyte npy_ubyte
#define ushort npy_ushort
#define uint npy_uint
#define ulong npy_ulong
#define cfloat npy_cfloat
#define cdouble npy_cdouble
#define clongdouble npy_clongdouble
#define Int8 npy_int8
#define UInt8 npy_uint8
#define Int16 npy_int16
#define UInt16 npy_uint16
#define Int32 npy_int32
#define UInt32 npy_uint32
#define Int64 npy_int64
#define UInt64 npy_uint64
#define Int128 npy_int128
#define UInt128 npy_uint128
#define Int256 npy_int256
#define UInt256 npy_uint256
#define Float16 npy_float16
#define Complex32 npy_complex32
#define Float32 npy_float32
#define Complex64 npy_complex64
#define Float64 npy_float64
#define Complex128 npy_complex128
#define Float80 npy_float80
#define Complex160 npy_complex160
#define Float96 npy_float96
#define Complex192 npy_complex192
#define Float128 npy_float128
#define Complex256 npy_complex256
#define intp npy_intp
#define uintp npy_uintp
#define datetime npy_datetime
#define timedelta npy_timedelta
#define SIZEOF_INTP NPY_SIZEOF_INTP
#define SIZEOF_UINTP NPY_SIZEOF_UINTP
#define SIZEOF_DATETIME NPY_SIZEOF_DATETIME
#define SIZEOF_TIMEDELTA NPY_SIZEOF_TIMEDELTA
#define LONGLONG_FMT NPY_LONGLONG_FMT
#define ULONGLONG_FMT NPY_ULONGLONG_FMT
#define LONGLONG_SUFFIX NPY_LONGLONG_SUFFIX
#define ULONGLONG_SUFFIX NPY_ULONGLONG_SUFFIX
#define MAX_INT8 127
#define MIN_INT8 -128
#define MAX_UINT8 255
#define MAX_INT16 32767
#define MIN_INT16 -32768
#define MAX_UINT16 65535
#define MAX_INT32 2147483647
#define MIN_INT32 (-MAX_INT32 - 1)
#define MAX_UINT32 4294967295U
#define MAX_INT64 LONGLONG_SUFFIX(9223372036854775807)
#define MIN_INT64 (-MAX_INT64 - LONGLONG_SUFFIX(1))
#define MAX_UINT64 ULONGLONG_SUFFIX(18446744073709551615)
#define MAX_INT128 LONGLONG_SUFFIX(85070591730234615865843651857942052864)
#define MIN_INT128 (-MAX_INT128 - LONGLONG_SUFFIX(1))
#define MAX_UINT128 ULONGLONG_SUFFIX(170141183460469231731687303715884105728)
#define MAX_INT256 LONGLONG_SUFFIX(57896044618658097711785492504343953926634992332820282019728792003956564819967)
#define MIN_INT256 (-MAX_INT256 - LONGLONG_SUFFIX(1))
#define MAX_UINT256 ULONGLONG_SUFFIX(115792089237316195423570985008687907853269984665640564039457584007913129639935)
#define MAX_BYTE NPY_MAX_BYTE
#define MIN_BYTE NPY_MIN_BYTE
#define MAX_UBYTE NPY_MAX_UBYTE
#define MAX_SHORT NPY_MAX_SHORT
#define MIN_SHORT NPY_MIN_SHORT
#define MAX_USHORT NPY_MAX_USHORT
#define MAX_INT NPY_MAX_INT
#define MIN_INT NPY_MIN_INT
#define MAX_UINT NPY_MAX_UINT
#define MAX_LONG NPY_MAX_LONG
#define MIN_LONG NPY_MIN_LONG
#define MAX_ULONG NPY_MAX_ULONG
#define MAX_LONGLONG NPY_MAX_LONGLONG
#define MIN_LONGLONG NPY_MIN_LONGLONG
#define MAX_ULONGLONG NPY_MAX_ULONGLONG
#define MIN_DATETIME NPY_MIN_DATETIME
#define MAX_DATETIME NPY_MAX_DATETIME
#define MIN_TIMEDELTA NPY_MIN_TIMEDELTA
#define MAX_TIMEDELTA NPY_MAX_TIMEDELTA
#define SIZEOF_LONGDOUBLE NPY_SIZEOF_LONGDOUBLE
#define SIZEOF_LONGLONG NPY_SIZEOF_LONGLONG
#define SIZEOF_HALF NPY_SIZEOF_HALF
#define BITSOF_BOOL NPY_BITSOF_BOOL
#define BITSOF_CHAR NPY_BITSOF_CHAR
#define BITSOF_SHORT NPY_BITSOF_SHORT
#define BITSOF_INT NPY_BITSOF_INT
#define BITSOF_LONG NPY_BITSOF_LONG
#define BITSOF_LONGLONG NPY_BITSOF_LONGLONG
#define BITSOF_HALF NPY_BITSOF_HALF
#define BITSOF_FLOAT NPY_BITSOF_FLOAT
#define BITSOF_DOUBLE NPY_BITSOF_DOUBLE
#define BITSOF_LONGDOUBLE NPY_BITSOF_LONGDOUBLE
#define BITSOF_DATETIME NPY_BITSOF_DATETIME
#define BITSOF_TIMEDELTA NPY_BITSOF_TIMEDELTA
#define _pya_malloc PyArray_malloc
#define _pya_free PyArray_free
#define _pya_realloc PyArray_realloc
#define BEGIN_THREADS_DEF NPY_BEGIN_THREADS_DEF
#define BEGIN_THREADS NPY_BEGIN_THREADS
#define END_THREADS NPY_END_THREADS
#define ALLOW_C_API_DEF NPY_ALLOW_C_API_DEF
#define ALLOW_C_API NPY_ALLOW_C_API
#define DISABLE_C_API NPY_DISABLE_C_API
#define PY_FAIL NPY_FAIL
#define PY_SUCCEED NPY_SUCCEED
#ifndef TRUE
#define TRUE NPY_TRUE
#endif
#ifndef FALSE
#define FALSE NPY_FALSE
#endif
#define LONGDOUBLE_FMT NPY_LONGDOUBLE_FMT
#define CONTIGUOUS NPY_CONTIGUOUS
#define C_CONTIGUOUS NPY_C_CONTIGUOUS
#define FORTRAN NPY_FORTRAN
#define F_CONTIGUOUS NPY_F_CONTIGUOUS
#define OWNDATA NPY_OWNDATA
#define FORCECAST NPY_FORCECAST
#define ENSURECOPY NPY_ENSURECOPY
#define ENSUREARRAY NPY_ENSUREARRAY
#define ELEMENTSTRIDES NPY_ELEMENTSTRIDES
#define ALIGNED NPY_ALIGNED
#define NOTSWAPPED NPY_NOTSWAPPED
#define WRITEABLE NPY_WRITEABLE
#define UPDATEIFCOPY NPY_UPDATEIFCOPY
#define ARR_HAS_DESCR NPY_ARR_HAS_DESCR
#define BEHAVED NPY_BEHAVED
#define BEHAVED_NS NPY_BEHAVED_NS
#define CARRAY NPY_CARRAY
#define CARRAY_RO NPY_CARRAY_RO
#define FARRAY NPY_FARRAY
#define FARRAY_RO NPY_FARRAY_RO
#define DEFAULT NPY_DEFAULT
#define IN_ARRAY NPY_IN_ARRAY
#define OUT_ARRAY NPY_OUT_ARRAY
#define INOUT_ARRAY NPY_INOUT_ARRAY
#define IN_FARRAY NPY_IN_FARRAY
#define OUT_FARRAY NPY_OUT_FARRAY
#define INOUT_FARRAY NPY_INOUT_FARRAY
#define UPDATE_ALL NPY_UPDATE_ALL
#define OWN_DATA NPY_OWNDATA
#define BEHAVED_FLAGS NPY_BEHAVED
#define BEHAVED_FLAGS_NS NPY_BEHAVED_NS
#define CARRAY_FLAGS_RO NPY_CARRAY_RO
#define CARRAY_FLAGS NPY_CARRAY
#define FARRAY_FLAGS NPY_FARRAY
#define FARRAY_FLAGS_RO NPY_FARRAY_RO
#define DEFAULT_FLAGS NPY_DEFAULT
#define UPDATE_ALL_FLAGS NPY_UPDATE_ALL_FLAGS
#ifndef MIN
#define MIN PyArray_MIN
#endif
#ifndef MAX
#define MAX PyArray_MAX
#endif
#define MAX_INTP NPY_MAX_INTP
#define MIN_INTP NPY_MIN_INTP
#define MAX_UINTP NPY_MAX_UINTP
#define INTP_FMT NPY_INTP_FMT
#define REFCOUNT PyArray_REFCOUNT
#define MAX_ELSIZE NPY_MAX_ELSIZE
#endif

View File

@ -1,417 +0,0 @@
/*
* This is a convenience header file providing compatibility utilities
* for supporting Python 2 and Python 3 in the same code base.
*
* If you want to use this for your own projects, it's recommended to make a
* copy of it. Although the stuff below is unlikely to change, we don't provide
* strong backwards compatibility guarantees at the moment.
*/
#ifndef _NPY_3KCOMPAT_H_
#define _NPY_3KCOMPAT_H_
#include <Python.h>
#include <stdio.h>
#if PY_VERSION_HEX >= 0x03000000
#ifndef NPY_PY3K
#define NPY_PY3K 1
#endif
#endif
#include "numpy/npy_common.h"
#include "numpy/ndarrayobject.h"
#ifdef __cplusplus
extern "C" {
#endif
/*
* PyInt -> PyLong
*/
#if defined(NPY_PY3K)
/* Return True only if the long fits in a C long */
static NPY_INLINE int PyInt_Check(PyObject *op) {
int overflow = 0;
if (!PyLong_Check(op)) {
return 0;
}
PyLong_AsLongAndOverflow(op, &overflow);
return (overflow == 0);
}
#define PyInt_FromLong PyLong_FromLong
#define PyInt_AsLong PyLong_AsLong
#define PyInt_AS_LONG PyLong_AsLong
#define PyInt_AsSsize_t PyLong_AsSsize_t
/* NOTE:
*
* Since the PyLong type is very different from the fixed-range PyInt,
* we don't define PyInt_Type -> PyLong_Type.
*/
#endif /* NPY_PY3K */
/*
* PyString -> PyBytes
*/
#if defined(NPY_PY3K)
#define PyString_Type PyBytes_Type
#define PyString_Check PyBytes_Check
#define PyStringObject PyBytesObject
#define PyString_FromString PyBytes_FromString
#define PyString_FromStringAndSize PyBytes_FromStringAndSize
#define PyString_AS_STRING PyBytes_AS_STRING
#define PyString_AsStringAndSize PyBytes_AsStringAndSize
#define PyString_FromFormat PyBytes_FromFormat
#define PyString_Concat PyBytes_Concat
#define PyString_ConcatAndDel PyBytes_ConcatAndDel
#define PyString_AsString PyBytes_AsString
#define PyString_GET_SIZE PyBytes_GET_SIZE
#define PyString_Size PyBytes_Size
#define PyUString_Type PyUnicode_Type
#define PyUString_Check PyUnicode_Check
#define PyUStringObject PyUnicodeObject
#define PyUString_FromString PyUnicode_FromString
#define PyUString_FromStringAndSize PyUnicode_FromStringAndSize
#define PyUString_FromFormat PyUnicode_FromFormat
#define PyUString_Concat PyUnicode_Concat2
#define PyUString_ConcatAndDel PyUnicode_ConcatAndDel
#define PyUString_GET_SIZE PyUnicode_GET_SIZE
#define PyUString_Size PyUnicode_Size
#define PyUString_InternFromString PyUnicode_InternFromString
#define PyUString_Format PyUnicode_Format
#else
#define PyBytes_Type PyString_Type
#define PyBytes_Check PyString_Check
#define PyBytesObject PyStringObject
#define PyBytes_FromString PyString_FromString
#define PyBytes_FromStringAndSize PyString_FromStringAndSize
#define PyBytes_AS_STRING PyString_AS_STRING
#define PyBytes_AsStringAndSize PyString_AsStringAndSize
#define PyBytes_FromFormat PyString_FromFormat
#define PyBytes_Concat PyString_Concat
#define PyBytes_ConcatAndDel PyString_ConcatAndDel
#define PyBytes_AsString PyString_AsString
#define PyBytes_GET_SIZE PyString_GET_SIZE
#define PyBytes_Size PyString_Size
#define PyUString_Type PyString_Type
#define PyUString_Check PyString_Check
#define PyUStringObject PyStringObject
#define PyUString_FromString PyString_FromString
#define PyUString_FromStringAndSize PyString_FromStringAndSize
#define PyUString_FromFormat PyString_FromFormat
#define PyUString_Concat PyString_Concat
#define PyUString_ConcatAndDel PyString_ConcatAndDel
#define PyUString_GET_SIZE PyString_GET_SIZE
#define PyUString_Size PyString_Size
#define PyUString_InternFromString PyString_InternFromString
#define PyUString_Format PyString_Format
#endif /* NPY_PY3K */
static NPY_INLINE void
PyUnicode_ConcatAndDel(PyObject **left, PyObject *right)
{
PyObject *newobj;
newobj = PyUnicode_Concat(*left, right);
Py_DECREF(*left);
Py_DECREF(right);
*left = newobj;
}
static NPY_INLINE void
PyUnicode_Concat2(PyObject **left, PyObject *right)
{
PyObject *newobj;
newobj = PyUnicode_Concat(*left, right);
Py_DECREF(*left);
*left = newobj;
}
/*
* PyFile_* compatibility
*/
#if defined(NPY_PY3K)
/*
* Get a FILE* handle to the file represented by the Python object
*/
static NPY_INLINE FILE*
npy_PyFile_Dup(PyObject *file, char *mode)
{
int fd, fd2;
PyObject *ret, *os;
Py_ssize_t pos;
FILE *handle;
/* Flush first to ensure things end up in the file in the correct order */
ret = PyObject_CallMethod(file, "flush", "");
if (ret == NULL) {
return NULL;
}
Py_DECREF(ret);
fd = PyObject_AsFileDescriptor(file);
if (fd == -1) {
return NULL;
}
os = PyImport_ImportModule("os");
if (os == NULL) {
return NULL;
}
ret = PyObject_CallMethod(os, "dup", "i", fd);
Py_DECREF(os);
if (ret == NULL) {
return NULL;
}
fd2 = PyNumber_AsSsize_t(ret, NULL);
Py_DECREF(ret);
#ifdef _WIN32
handle = _fdopen(fd2, mode);
#else
handle = fdopen(fd2, mode);
#endif
if (handle == NULL) {
PyErr_SetString(PyExc_IOError,
"Getting a FILE* from a Python file object failed");
}
ret = PyObject_CallMethod(file, "tell", "");
if (ret == NULL) {
fclose(handle);
return NULL;
}
pos = PyNumber_AsSsize_t(ret, PyExc_OverflowError);
Py_DECREF(ret);
if (PyErr_Occurred()) {
fclose(handle);
return NULL;
}
npy_fseek(handle, pos, SEEK_SET);
return handle;
}
/*
* Close the dup-ed file handle, and seek the Python one to the current position
*/
static NPY_INLINE int
npy_PyFile_DupClose(PyObject *file, FILE* handle)
{
PyObject *ret;
Py_ssize_t position;
position = npy_ftell(handle);
fclose(handle);
ret = PyObject_CallMethod(file, "seek", NPY_SSIZE_T_PYFMT "i", position, 0);
if (ret == NULL) {
return -1;
}
Py_DECREF(ret);
return 0;
}
static NPY_INLINE int
npy_PyFile_Check(PyObject *file)
{
int fd;
fd = PyObject_AsFileDescriptor(file);
if (fd == -1) {
PyErr_Clear();
return 0;
}
return 1;
}
#else
#define npy_PyFile_Dup(file, mode) PyFile_AsFile(file)
#define npy_PyFile_DupClose(file, handle) (0)
#define npy_PyFile_Check PyFile_Check
#endif
static NPY_INLINE PyObject*
npy_PyFile_OpenFile(PyObject *filename, const char *mode)
{
PyObject *open;
open = PyDict_GetItemString(PyEval_GetBuiltins(), "open");
if (open == NULL) {
return NULL;
}
return PyObject_CallFunction(open, "Os", filename, mode);
}
static NPY_INLINE int
npy_PyFile_CloseFile(PyObject *file)
{
PyObject *ret;
ret = PyObject_CallMethod(file, "close", NULL);
if (ret == NULL) {
return -1;
}
Py_DECREF(ret);
return 0;
}
/*
* PyObject_Cmp
*/
#if defined(NPY_PY3K)
static NPY_INLINE int
PyObject_Cmp(PyObject *i1, PyObject *i2, int *cmp)
{
int v;
v = PyObject_RichCompareBool(i1, i2, Py_LT);
if (v == 0) {
*cmp = -1;
return 1;
}
else if (v == -1) {
return -1;
}
v = PyObject_RichCompareBool(i1, i2, Py_GT);
if (v == 0) {
*cmp = 1;
return 1;
}
else if (v == -1) {
return -1;
}
v = PyObject_RichCompareBool(i1, i2, Py_EQ);
if (v == 0) {
*cmp = 0;
return 1;
}
else {
*cmp = 0;
return -1;
}
}
#endif
/*
* PyCObject functions adapted to PyCapsules.
*
* The main job here is to get rid of the improved error handling
* of PyCapsules. It's a shame...
*/
#if PY_VERSION_HEX >= 0x03000000
static NPY_INLINE PyObject *
NpyCapsule_FromVoidPtr(void *ptr, void (*dtor)(PyObject *))
{
PyObject *ret = PyCapsule_New(ptr, NULL, dtor);
if (ret == NULL) {
PyErr_Clear();
}
return ret;
}
static NPY_INLINE PyObject *
NpyCapsule_FromVoidPtrAndDesc(void *ptr, void* context, void (*dtor)(PyObject *))
{
PyObject *ret = NpyCapsule_FromVoidPtr(ptr, dtor);
if (ret != NULL && PyCapsule_SetContext(ret, context) != 0) {
PyErr_Clear();
Py_DECREF(ret);
ret = NULL;
}
return ret;
}
static NPY_INLINE void *
NpyCapsule_AsVoidPtr(PyObject *obj)
{
void *ret = PyCapsule_GetPointer(obj, NULL);
if (ret == NULL) {
PyErr_Clear();
}
return ret;
}
static NPY_INLINE void *
NpyCapsule_GetDesc(PyObject *obj)
{
return PyCapsule_GetContext(obj);
}
static NPY_INLINE int
NpyCapsule_Check(PyObject *ptr)
{
return PyCapsule_CheckExact(ptr);
}
static NPY_INLINE void
simple_capsule_dtor(PyObject *cap)
{
PyArray_free(PyCapsule_GetPointer(cap, NULL));
}
#else
static NPY_INLINE PyObject *
NpyCapsule_FromVoidPtr(void *ptr, void (*dtor)(void *))
{
return PyCObject_FromVoidPtr(ptr, dtor);
}
static NPY_INLINE PyObject *
NpyCapsule_FromVoidPtrAndDesc(void *ptr, void* context,
void (*dtor)(void *, void *))
{
return PyCObject_FromVoidPtrAndDesc(ptr, context, dtor);
}
static NPY_INLINE void *
NpyCapsule_AsVoidPtr(PyObject *ptr)
{
return PyCObject_AsVoidPtr(ptr);
}
static NPY_INLINE void *
NpyCapsule_GetDesc(PyObject *obj)
{
return PyCObject_GetDesc(obj);
}
static NPY_INLINE int
NpyCapsule_Check(PyObject *ptr)
{
return PyCObject_Check(ptr);
}
static NPY_INLINE void
simple_capsule_dtor(void *ptr)
{
PyArray_free(ptr);
}
#endif
/*
* Hash value compatibility.
* As of Python 3.2 hash values are of type Py_hash_t.
* Previous versions use C long.
*/
#if PY_VERSION_HEX < 0x03020000
typedef long npy_hash_t;
#define NPY_SIZEOF_HASH_T NPY_SIZEOF_LONG
#else
typedef Py_hash_t npy_hash_t;
#define NPY_SIZEOF_HASH_T NPY_SIZEOF_INTP
#endif
#ifdef __cplusplus
}
#endif
#endif /* _NPY_3KCOMPAT_H_ */

View File

@ -1,930 +0,0 @@
#ifndef _NPY_COMMON_H_
#define _NPY_COMMON_H_
/* numpconfig.h is auto-generated */
#include "numpyconfig.h"
#if defined(_MSC_VER)
#define NPY_INLINE __inline
#elif defined(__GNUC__)
#if defined(__STRICT_ANSI__)
#define NPY_INLINE __inline__
#else
#define NPY_INLINE inline
#endif
#else
#define NPY_INLINE
#endif
/* Enable 64 bit file position support on win-amd64. Ticket #1660 */
#if defined(_MSC_VER) && defined(_WIN64) && (_MSC_VER > 1400)
#define npy_fseek _fseeki64
#define npy_ftell _ftelli64
#else
#define npy_fseek fseek
#define npy_ftell ftell
#endif
/* enums for detected endianness */
enum {
NPY_CPU_UNKNOWN_ENDIAN,
NPY_CPU_LITTLE,
NPY_CPU_BIG
};
/*
* This is to typedef npy_intp to the appropriate pointer size for
* this platform. Py_intptr_t, Py_uintptr_t are defined in pyport.h.
*/
typedef Py_intptr_t npy_intp;
typedef Py_uintptr_t npy_uintp;
#define NPY_SIZEOF_CHAR 1
#define NPY_SIZEOF_BYTE 1
#define NPY_SIZEOF_INTP NPY_SIZEOF_PY_INTPTR_T
#define NPY_SIZEOF_UINTP NPY_SIZEOF_PY_INTPTR_T
#define NPY_SIZEOF_CFLOAT NPY_SIZEOF_COMPLEX_FLOAT
#define NPY_SIZEOF_CDOUBLE NPY_SIZEOF_COMPLEX_DOUBLE
#define NPY_SIZEOF_CLONGDOUBLE NPY_SIZEOF_COMPLEX_LONGDOUBLE
#ifdef constchar
#undef constchar
#endif
#if (PY_VERSION_HEX < 0x02050000)
#ifndef PY_SSIZE_T_MIN
typedef int Py_ssize_t;
#define PY_SSIZE_T_MAX INT_MAX
#define PY_SSIZE_T_MIN INT_MIN
#endif
#define NPY_SSIZE_T_PYFMT "i"
#define constchar const char
#else
#define NPY_SSIZE_T_PYFMT "n"
#define constchar char
#endif
/* NPY_INTP_FMT Note:
* Unlike the other NPY_*_FMT macros which are used with
* PyOS_snprintf, NPY_INTP_FMT is used with PyErr_Format and
* PyString_Format. These functions use different formatting
* codes which are portably specified according to the Python
* documentation. See ticket #1795.
*
* On Windows x64, the LONGLONG formatter should be used, but
* in Python 2.6 the %lld formatter is not supported. In this
* case we work around the problem by using the %zd formatter.
*/
#if NPY_SIZEOF_PY_INTPTR_T == NPY_SIZEOF_INT
#define NPY_INTP NPY_INT
#define NPY_UINTP NPY_UINT
#define PyIntpArrType_Type PyIntArrType_Type
#define PyUIntpArrType_Type PyUIntArrType_Type
#define NPY_MAX_INTP NPY_MAX_INT
#define NPY_MIN_INTP NPY_MIN_INT
#define NPY_MAX_UINTP NPY_MAX_UINT
#define NPY_INTP_FMT "d"
#elif NPY_SIZEOF_PY_INTPTR_T == NPY_SIZEOF_LONG
#define NPY_INTP NPY_LONG
#define NPY_UINTP NPY_ULONG
#define PyIntpArrType_Type PyLongArrType_Type
#define PyUIntpArrType_Type PyULongArrType_Type
#define NPY_MAX_INTP NPY_MAX_LONG
#define NPY_MIN_INTP NPY_MIN_LONG
#define NPY_MAX_UINTP NPY_MAX_ULONG
#define NPY_INTP_FMT "ld"
#elif defined(PY_LONG_LONG) && (NPY_SIZEOF_PY_INTPTR_T == NPY_SIZEOF_LONGLONG)
#define NPY_INTP NPY_LONGLONG
#define NPY_UINTP NPY_ULONGLONG
#define PyIntpArrType_Type PyLongLongArrType_Type
#define PyUIntpArrType_Type PyULongLongArrType_Type
#define NPY_MAX_INTP NPY_MAX_LONGLONG
#define NPY_MIN_INTP NPY_MIN_LONGLONG
#define NPY_MAX_UINTP NPY_MAX_ULONGLONG
#if (PY_VERSION_HEX >= 0x02070000)
#define NPY_INTP_FMT "lld"
#else
#define NPY_INTP_FMT "zd"
#endif
#endif
/*
* We can only use C99 formats for npy_int_p if it is the same as
* intp_t, hence the condition on HAVE_UNITPTR_T
*/
#if (NPY_USE_C99_FORMATS) == 1 \
&& (defined HAVE_UINTPTR_T) \
&& (defined HAVE_INTTYPES_H)
#include <inttypes.h>
#undef NPY_INTP_FMT
#define NPY_INTP_FMT PRIdPTR
#endif
/*
* Some platforms don't define bool, long long, or long double.
* Handle that here.
*/
#define NPY_BYTE_FMT "hhd"
#define NPY_UBYTE_FMT "hhu"
#define NPY_SHORT_FMT "hd"
#define NPY_USHORT_FMT "hu"
#define NPY_INT_FMT "d"
#define NPY_UINT_FMT "u"
#define NPY_LONG_FMT "ld"
#define NPY_ULONG_FMT "lu"
#define NPY_HALF_FMT "g"
#define NPY_FLOAT_FMT "g"
#define NPY_DOUBLE_FMT "g"
#ifdef PY_LONG_LONG
typedef PY_LONG_LONG npy_longlong;
typedef unsigned PY_LONG_LONG npy_ulonglong;
# ifdef _MSC_VER
# define NPY_LONGLONG_FMT "I64d"
# define NPY_ULONGLONG_FMT "I64u"
# elif defined(__APPLE__) || defined(__FreeBSD__)
/* "%Ld" only parses 4 bytes -- "L" is floating modifier on MacOS X/BSD */
# define NPY_LONGLONG_FMT "lld"
# define NPY_ULONGLONG_FMT "llu"
/*
another possible variant -- *quad_t works on *BSD, but is deprecated:
#define LONGLONG_FMT "qd"
#define ULONGLONG_FMT "qu"
*/
# else
# define NPY_LONGLONG_FMT "Ld"
# define NPY_ULONGLONG_FMT "Lu"
# endif
# ifdef _MSC_VER
# define NPY_LONGLONG_SUFFIX(x) (x##i64)
# define NPY_ULONGLONG_SUFFIX(x) (x##Ui64)
# else
# define NPY_LONGLONG_SUFFIX(x) (x##LL)
# define NPY_ULONGLONG_SUFFIX(x) (x##ULL)
# endif
#else
typedef long npy_longlong;
typedef unsigned long npy_ulonglong;
# define NPY_LONGLONG_SUFFIX(x) (x##L)
# define NPY_ULONGLONG_SUFFIX(x) (x##UL)
#endif
typedef unsigned char npy_bool;
#define NPY_FALSE 0
#define NPY_TRUE 1
#if NPY_SIZEOF_LONGDOUBLE == NPY_SIZEOF_DOUBLE
typedef double npy_longdouble;
#define NPY_LONGDOUBLE_FMT "g"
#else
typedef long double npy_longdouble;
#define NPY_LONGDOUBLE_FMT "Lg"
#endif
#ifndef Py_USING_UNICODE
#error Must use Python with unicode enabled.
#endif
typedef signed char npy_byte;
typedef unsigned char npy_ubyte;
typedef unsigned short npy_ushort;
typedef unsigned int npy_uint;
typedef unsigned long npy_ulong;
/* These are for completeness */
typedef char npy_char;
typedef short npy_short;
typedef int npy_int;
typedef long npy_long;
typedef float npy_float;
typedef double npy_double;
/*
* Disabling C99 complex usage: a lot of C code in numpy/scipy rely on being
* able to do .real/.imag. Will have to convert code first.
*/
#if 0
#if defined(NPY_USE_C99_COMPLEX) && defined(NPY_HAVE_COMPLEX_DOUBLE)
typedef complex npy_cdouble;
#else
typedef struct { double real, imag; } npy_cdouble;
#endif
#if defined(NPY_USE_C99_COMPLEX) && defined(NPY_HAVE_COMPLEX_FLOAT)
typedef complex float npy_cfloat;
#else
typedef struct { float real, imag; } npy_cfloat;
#endif
#if defined(NPY_USE_C99_COMPLEX) && defined(NPY_HAVE_COMPLEX_LONG_DOUBLE)
typedef complex long double npy_clongdouble;
#else
typedef struct {npy_longdouble real, imag;} npy_clongdouble;
#endif
#endif
#if NPY_SIZEOF_COMPLEX_DOUBLE != 2 * NPY_SIZEOF_DOUBLE
#error npy_cdouble definition is not compatible with C99 complex definition ! \
Please contact Numpy maintainers and give detailed information about your \
compiler and platform
#endif
typedef struct { double real, imag; } npy_cdouble;
#if NPY_SIZEOF_COMPLEX_FLOAT != 2 * NPY_SIZEOF_FLOAT
#error npy_cfloat definition is not compatible with C99 complex definition ! \
Please contact Numpy maintainers and give detailed information about your \
compiler and platform
#endif
typedef struct { float real, imag; } npy_cfloat;
#if NPY_SIZEOF_COMPLEX_LONGDOUBLE != 2 * NPY_SIZEOF_LONGDOUBLE
#error npy_clongdouble definition is not compatible with C99 complex definition ! \
Please contact Numpy maintainers and give detailed information about your \
compiler and platform
#endif
typedef struct { npy_longdouble real, imag; } npy_clongdouble;
/*
* numarray-style bit-width typedefs
*/
#define NPY_MAX_INT8 127
#define NPY_MIN_INT8 -128
#define NPY_MAX_UINT8 255
#define NPY_MAX_INT16 32767
#define NPY_MIN_INT16 -32768
#define NPY_MAX_UINT16 65535
#define NPY_MAX_INT32 2147483647
#define NPY_MIN_INT32 (-NPY_MAX_INT32 - 1)
#define NPY_MAX_UINT32 4294967295U
#define NPY_MAX_INT64 NPY_LONGLONG_SUFFIX(9223372036854775807)
#define NPY_MIN_INT64 (-NPY_MAX_INT64 - NPY_LONGLONG_SUFFIX(1))
#define NPY_MAX_UINT64 NPY_ULONGLONG_SUFFIX(18446744073709551615)
#define NPY_MAX_INT128 NPY_LONGLONG_SUFFIX(85070591730234615865843651857942052864)
#define NPY_MIN_INT128 (-NPY_MAX_INT128 - NPY_LONGLONG_SUFFIX(1))
#define NPY_MAX_UINT128 NPY_ULONGLONG_SUFFIX(170141183460469231731687303715884105728)
#define NPY_MAX_INT256 NPY_LONGLONG_SUFFIX(57896044618658097711785492504343953926634992332820282019728792003956564819967)
#define NPY_MIN_INT256 (-NPY_MAX_INT256 - NPY_LONGLONG_SUFFIX(1))
#define NPY_MAX_UINT256 NPY_ULONGLONG_SUFFIX(115792089237316195423570985008687907853269984665640564039457584007913129639935)
#define NPY_MIN_DATETIME NPY_MIN_INT64
#define NPY_MAX_DATETIME NPY_MAX_INT64
#define NPY_MIN_TIMEDELTA NPY_MIN_INT64
#define NPY_MAX_TIMEDELTA NPY_MAX_INT64
/* Need to find the number of bits for each type and
make definitions accordingly.
C states that sizeof(char) == 1 by definition
So, just using the sizeof keyword won't help.
It also looks like Python itself uses sizeof(char) quite a
bit, which by definition should be 1 all the time.
Idea: Make Use of CHAR_BIT which should tell us how many
BITS per CHARACTER
*/
/* Include platform definitions -- These are in the C89/90 standard */
#include <limits.h>
#define NPY_MAX_BYTE SCHAR_MAX
#define NPY_MIN_BYTE SCHAR_MIN
#define NPY_MAX_UBYTE UCHAR_MAX
#define NPY_MAX_SHORT SHRT_MAX
#define NPY_MIN_SHORT SHRT_MIN
#define NPY_MAX_USHORT USHRT_MAX
#define NPY_MAX_INT INT_MAX
#ifndef INT_MIN
#define INT_MIN (-INT_MAX - 1)
#endif
#define NPY_MIN_INT INT_MIN
#define NPY_MAX_UINT UINT_MAX
#define NPY_MAX_LONG LONG_MAX
#define NPY_MIN_LONG LONG_MIN
#define NPY_MAX_ULONG ULONG_MAX
#define NPY_SIZEOF_HALF 2
#define NPY_SIZEOF_DATETIME 8
#define NPY_SIZEOF_TIMEDELTA 8
#define NPY_BITSOF_BOOL (sizeof(npy_bool) * CHAR_BIT)
#define NPY_BITSOF_CHAR CHAR_BIT
#define NPY_BITSOF_BYTE (NPY_SIZEOF_BYTE * CHAR_BIT)
#define NPY_BITSOF_SHORT (NPY_SIZEOF_SHORT * CHAR_BIT)
#define NPY_BITSOF_INT (NPY_SIZEOF_INT * CHAR_BIT)
#define NPY_BITSOF_LONG (NPY_SIZEOF_LONG * CHAR_BIT)
#define NPY_BITSOF_LONGLONG (NPY_SIZEOF_LONGLONG * CHAR_BIT)
#define NPY_BITSOF_INTP (NPY_SIZEOF_INTP * CHAR_BIT)
#define NPY_BITSOF_HALF (NPY_SIZEOF_HALF * CHAR_BIT)
#define NPY_BITSOF_FLOAT (NPY_SIZEOF_FLOAT * CHAR_BIT)
#define NPY_BITSOF_DOUBLE (NPY_SIZEOF_DOUBLE * CHAR_BIT)
#define NPY_BITSOF_LONGDOUBLE (NPY_SIZEOF_LONGDOUBLE * CHAR_BIT)
#define NPY_BITSOF_CFLOAT (NPY_SIZEOF_CFLOAT * CHAR_BIT)
#define NPY_BITSOF_CDOUBLE (NPY_SIZEOF_CDOUBLE * CHAR_BIT)
#define NPY_BITSOF_CLONGDOUBLE (NPY_SIZEOF_CLONGDOUBLE * CHAR_BIT)
#define NPY_BITSOF_DATETIME (NPY_SIZEOF_DATETIME * CHAR_BIT)
#define NPY_BITSOF_TIMEDELTA (NPY_SIZEOF_TIMEDELTA * CHAR_BIT)
#if NPY_BITSOF_LONG == 8
#define NPY_INT8 NPY_LONG
#define NPY_UINT8 NPY_ULONG
typedef long npy_int8;
typedef unsigned long npy_uint8;
#define PyInt8ScalarObject PyLongScalarObject
#define PyInt8ArrType_Type PyLongArrType_Type
#define PyUInt8ScalarObject PyULongScalarObject
#define PyUInt8ArrType_Type PyULongArrType_Type
#define NPY_INT8_FMT NPY_LONG_FMT
#define NPY_UINT8_FMT NPY_ULONG_FMT
#elif NPY_BITSOF_LONG == 16
#define NPY_INT16 NPY_LONG
#define NPY_UINT16 NPY_ULONG
typedef long npy_int16;
typedef unsigned long npy_uint16;
#define PyInt16ScalarObject PyLongScalarObject
#define PyInt16ArrType_Type PyLongArrType_Type
#define PyUInt16ScalarObject PyULongScalarObject
#define PyUInt16ArrType_Type PyULongArrType_Type
#define NPY_INT16_FMT NPY_LONG_FMT
#define NPY_UINT16_FMT NPY_ULONG_FMT
#elif NPY_BITSOF_LONG == 32
#define NPY_INT32 NPY_LONG
#define NPY_UINT32 NPY_ULONG
typedef long npy_int32;
typedef unsigned long npy_uint32;
typedef unsigned long npy_ucs4;
#define PyInt32ScalarObject PyLongScalarObject
#define PyInt32ArrType_Type PyLongArrType_Type
#define PyUInt32ScalarObject PyULongScalarObject
#define PyUInt32ArrType_Type PyULongArrType_Type
#define NPY_INT32_FMT NPY_LONG_FMT
#define NPY_UINT32_FMT NPY_ULONG_FMT
#elif NPY_BITSOF_LONG == 64
#define NPY_INT64 NPY_LONG
#define NPY_UINT64 NPY_ULONG
typedef long npy_int64;
typedef unsigned long npy_uint64;
#define PyInt64ScalarObject PyLongScalarObject
#define PyInt64ArrType_Type PyLongArrType_Type
#define PyUInt64ScalarObject PyULongScalarObject
#define PyUInt64ArrType_Type PyULongArrType_Type
#define NPY_INT64_FMT NPY_LONG_FMT
#define NPY_UINT64_FMT NPY_ULONG_FMT
#define MyPyLong_FromInt64 PyLong_FromLong
#define MyPyLong_AsInt64 PyLong_AsLong
#elif NPY_BITSOF_LONG == 128
#define NPY_INT128 NPY_LONG
#define NPY_UINT128 NPY_ULONG
typedef long npy_int128;
typedef unsigned long npy_uint128;
#define PyInt128ScalarObject PyLongScalarObject
#define PyInt128ArrType_Type PyLongArrType_Type
#define PyUInt128ScalarObject PyULongScalarObject
#define PyUInt128ArrType_Type PyULongArrType_Type
#define NPY_INT128_FMT NPY_LONG_FMT
#define NPY_UINT128_FMT NPY_ULONG_FMT
#endif
#if NPY_BITSOF_LONGLONG == 8
# ifndef NPY_INT8
# define NPY_INT8 NPY_LONGLONG
# define NPY_UINT8 NPY_ULONGLONG
typedef npy_longlong npy_int8;
typedef npy_ulonglong npy_uint8;
# define PyInt8ScalarObject PyLongLongScalarObject
# define PyInt8ArrType_Type PyLongLongArrType_Type
# define PyUInt8ScalarObject PyULongLongScalarObject
# define PyUInt8ArrType_Type PyULongLongArrType_Type
#define NPY_INT8_FMT NPY_LONGLONG_FMT
#define NPY_UINT8_FMT NPY_ULONGLONG_FMT
# endif
# define NPY_MAX_LONGLONG NPY_MAX_INT8
# define NPY_MIN_LONGLONG NPY_MIN_INT8
# define NPY_MAX_ULONGLONG NPY_MAX_UINT8
#elif NPY_BITSOF_LONGLONG == 16
# ifndef NPY_INT16
# define NPY_INT16 NPY_LONGLONG
# define NPY_UINT16 NPY_ULONGLONG
typedef npy_longlong npy_int16;
typedef npy_ulonglong npy_uint16;
# define PyInt16ScalarObject PyLongLongScalarObject
# define PyInt16ArrType_Type PyLongLongArrType_Type
# define PyUInt16ScalarObject PyULongLongScalarObject
# define PyUInt16ArrType_Type PyULongLongArrType_Type
#define NPY_INT16_FMT NPY_LONGLONG_FMT
#define NPY_UINT16_FMT NPY_ULONGLONG_FMT
# endif
# define NPY_MAX_LONGLONG NPY_MAX_INT16
# define NPY_MIN_LONGLONG NPY_MIN_INT16
# define NPY_MAX_ULONGLONG NPY_MAX_UINT16
#elif NPY_BITSOF_LONGLONG == 32
# ifndef NPY_INT32
# define NPY_INT32 NPY_LONGLONG
# define NPY_UINT32 NPY_ULONGLONG
typedef npy_longlong npy_int32;
typedef npy_ulonglong npy_uint32;
typedef npy_ulonglong npy_ucs4;
# define PyInt32ScalarObject PyLongLongScalarObject
# define PyInt32ArrType_Type PyLongLongArrType_Type
# define PyUInt32ScalarObject PyULongLongScalarObject
# define PyUInt32ArrType_Type PyULongLongArrType_Type
#define NPY_INT32_FMT NPY_LONGLONG_FMT
#define NPY_UINT32_FMT NPY_ULONGLONG_FMT
# endif
# define NPY_MAX_LONGLONG NPY_MAX_INT32
# define NPY_MIN_LONGLONG NPY_MIN_INT32
# define NPY_MAX_ULONGLONG NPY_MAX_UINT32
#elif NPY_BITSOF_LONGLONG == 64
# ifndef NPY_INT64
# define NPY_INT64 NPY_LONGLONG
# define NPY_UINT64 NPY_ULONGLONG
typedef npy_longlong npy_int64;
typedef npy_ulonglong npy_uint64;
# define PyInt64ScalarObject PyLongLongScalarObject
# define PyInt64ArrType_Type PyLongLongArrType_Type
# define PyUInt64ScalarObject PyULongLongScalarObject
# define PyUInt64ArrType_Type PyULongLongArrType_Type
#define NPY_INT64_FMT NPY_LONGLONG_FMT
#define NPY_UINT64_FMT NPY_ULONGLONG_FMT
# define MyPyLong_FromInt64 PyLong_FromLongLong
# define MyPyLong_AsInt64 PyLong_AsLongLong
# endif
# define NPY_MAX_LONGLONG NPY_MAX_INT64
# define NPY_MIN_LONGLONG NPY_MIN_INT64
# define NPY_MAX_ULONGLONG NPY_MAX_UINT64
#elif NPY_BITSOF_LONGLONG == 128
# ifndef NPY_INT128
# define NPY_INT128 NPY_LONGLONG
# define NPY_UINT128 NPY_ULONGLONG
typedef npy_longlong npy_int128;
typedef npy_ulonglong npy_uint128;
# define PyInt128ScalarObject PyLongLongScalarObject
# define PyInt128ArrType_Type PyLongLongArrType_Type
# define PyUInt128ScalarObject PyULongLongScalarObject
# define PyUInt128ArrType_Type PyULongLongArrType_Type
#define NPY_INT128_FMT NPY_LONGLONG_FMT
#define NPY_UINT128_FMT NPY_ULONGLONG_FMT
# endif
# define NPY_MAX_LONGLONG NPY_MAX_INT128
# define NPY_MIN_LONGLONG NPY_MIN_INT128
# define NPY_MAX_ULONGLONG NPY_MAX_UINT128
#elif NPY_BITSOF_LONGLONG == 256
# define NPY_INT256 NPY_LONGLONG
# define NPY_UINT256 NPY_ULONGLONG
typedef npy_longlong npy_int256;
typedef npy_ulonglong npy_uint256;
# define PyInt256ScalarObject PyLongLongScalarObject
# define PyInt256ArrType_Type PyLongLongArrType_Type
# define PyUInt256ScalarObject PyULongLongScalarObject
# define PyUInt256ArrType_Type PyULongLongArrType_Type
#define NPY_INT256_FMT NPY_LONGLONG_FMT
#define NPY_UINT256_FMT NPY_ULONGLONG_FMT
# define NPY_MAX_LONGLONG NPY_MAX_INT256
# define NPY_MIN_LONGLONG NPY_MIN_INT256
# define NPY_MAX_ULONGLONG NPY_MAX_UINT256
#endif
#if NPY_BITSOF_INT == 8
#ifndef NPY_INT8
#define NPY_INT8 NPY_INT
#define NPY_UINT8 NPY_UINT
typedef int npy_int8;
typedef unsigned int npy_uint8;
# define PyInt8ScalarObject PyIntScalarObject
# define PyInt8ArrType_Type PyIntArrType_Type
# define PyUInt8ScalarObject PyUIntScalarObject
# define PyUInt8ArrType_Type PyUIntArrType_Type
#define NPY_INT8_FMT NPY_INT_FMT
#define NPY_UINT8_FMT NPY_UINT_FMT
#endif
#elif NPY_BITSOF_INT == 16
#ifndef NPY_INT16
#define NPY_INT16 NPY_INT
#define NPY_UINT16 NPY_UINT
typedef int npy_int16;
typedef unsigned int npy_uint16;
# define PyInt16ScalarObject PyIntScalarObject
# define PyInt16ArrType_Type PyIntArrType_Type
# define PyUInt16ScalarObject PyIntUScalarObject
# define PyUInt16ArrType_Type PyIntUArrType_Type
#define NPY_INT16_FMT NPY_INT_FMT
#define NPY_UINT16_FMT NPY_UINT_FMT
#endif
#elif NPY_BITSOF_INT == 32
#ifndef NPY_INT32
#define NPY_INT32 NPY_INT
#define NPY_UINT32 NPY_UINT
typedef int npy_int32;
typedef unsigned int npy_uint32;
typedef unsigned int npy_ucs4;
# define PyInt32ScalarObject PyIntScalarObject
# define PyInt32ArrType_Type PyIntArrType_Type
# define PyUInt32ScalarObject PyUIntScalarObject
# define PyUInt32ArrType_Type PyUIntArrType_Type
#define NPY_INT32_FMT NPY_INT_FMT
#define NPY_UINT32_FMT NPY_UINT_FMT
#endif
#elif NPY_BITSOF_INT == 64
#ifndef NPY_INT64
#define NPY_INT64 NPY_INT
#define NPY_UINT64 NPY_UINT
typedef int npy_int64;
typedef unsigned int npy_uint64;
# define PyInt64ScalarObject PyIntScalarObject
# define PyInt64ArrType_Type PyIntArrType_Type
# define PyUInt64ScalarObject PyUIntScalarObject
# define PyUInt64ArrType_Type PyUIntArrType_Type
#define NPY_INT64_FMT NPY_INT_FMT
#define NPY_UINT64_FMT NPY_UINT_FMT
# define MyPyLong_FromInt64 PyLong_FromLong
# define MyPyLong_AsInt64 PyLong_AsLong
#endif
#elif NPY_BITSOF_INT == 128
#ifndef NPY_INT128
#define NPY_INT128 NPY_INT
#define NPY_UINT128 NPY_UINT
typedef int npy_int128;
typedef unsigned int npy_uint128;
# define PyInt128ScalarObject PyIntScalarObject
# define PyInt128ArrType_Type PyIntArrType_Type
# define PyUInt128ScalarObject PyUIntScalarObject
# define PyUInt128ArrType_Type PyUIntArrType_Type
#define NPY_INT128_FMT NPY_INT_FMT
#define NPY_UINT128_FMT NPY_UINT_FMT
#endif
#endif
#if NPY_BITSOF_SHORT == 8
#ifndef NPY_INT8
#define NPY_INT8 NPY_SHORT
#define NPY_UINT8 NPY_USHORT
typedef short npy_int8;
typedef unsigned short npy_uint8;
# define PyInt8ScalarObject PyShortScalarObject
# define PyInt8ArrType_Type PyShortArrType_Type
# define PyUInt8ScalarObject PyUShortScalarObject
# define PyUInt8ArrType_Type PyUShortArrType_Type
#define NPY_INT8_FMT NPY_SHORT_FMT
#define NPY_UINT8_FMT NPY_USHORT_FMT
#endif
#elif NPY_BITSOF_SHORT == 16
#ifndef NPY_INT16
#define NPY_INT16 NPY_SHORT
#define NPY_UINT16 NPY_USHORT
typedef short npy_int16;
typedef unsigned short npy_uint16;
# define PyInt16ScalarObject PyShortScalarObject
# define PyInt16ArrType_Type PyShortArrType_Type
# define PyUInt16ScalarObject PyUShortScalarObject
# define PyUInt16ArrType_Type PyUShortArrType_Type
#define NPY_INT16_FMT NPY_SHORT_FMT
#define NPY_UINT16_FMT NPY_USHORT_FMT
#endif
#elif NPY_BITSOF_SHORT == 32
#ifndef NPY_INT32
#define NPY_INT32 NPY_SHORT
#define NPY_UINT32 NPY_USHORT
typedef short npy_int32;
typedef unsigned short npy_uint32;
typedef unsigned short npy_ucs4;
# define PyInt32ScalarObject PyShortScalarObject
# define PyInt32ArrType_Type PyShortArrType_Type
# define PyUInt32ScalarObject PyUShortScalarObject
# define PyUInt32ArrType_Type PyUShortArrType_Type
#define NPY_INT32_FMT NPY_SHORT_FMT
#define NPY_UINT32_FMT NPY_USHORT_FMT
#endif
#elif NPY_BITSOF_SHORT == 64
#ifndef NPY_INT64
#define NPY_INT64 NPY_SHORT
#define NPY_UINT64 NPY_USHORT
typedef short npy_int64;
typedef unsigned short npy_uint64;
# define PyInt64ScalarObject PyShortScalarObject
# define PyInt64ArrType_Type PyShortArrType_Type
# define PyUInt64ScalarObject PyUShortScalarObject
# define PyUInt64ArrType_Type PyUShortArrType_Type
#define NPY_INT64_FMT NPY_SHORT_FMT
#define NPY_UINT64_FMT NPY_USHORT_FMT
# define MyPyLong_FromInt64 PyLong_FromLong
# define MyPyLong_AsInt64 PyLong_AsLong
#endif
#elif NPY_BITSOF_SHORT == 128
#ifndef NPY_INT128
#define NPY_INT128 NPY_SHORT
#define NPY_UINT128 NPY_USHORT
typedef short npy_int128;
typedef unsigned short npy_uint128;
# define PyInt128ScalarObject PyShortScalarObject
# define PyInt128ArrType_Type PyShortArrType_Type
# define PyUInt128ScalarObject PyUShortScalarObject
# define PyUInt128ArrType_Type PyUShortArrType_Type
#define NPY_INT128_FMT NPY_SHORT_FMT
#define NPY_UINT128_FMT NPY_USHORT_FMT
#endif
#endif
#if NPY_BITSOF_CHAR == 8
#ifndef NPY_INT8
#define NPY_INT8 NPY_BYTE
#define NPY_UINT8 NPY_UBYTE
typedef signed char npy_int8;
typedef unsigned char npy_uint8;
# define PyInt8ScalarObject PyByteScalarObject
# define PyInt8ArrType_Type PyByteArrType_Type
# define PyUInt8ScalarObject PyUByteScalarObject
# define PyUInt8ArrType_Type PyUByteArrType_Type
#define NPY_INT8_FMT NPY_BYTE_FMT
#define NPY_UINT8_FMT NPY_UBYTE_FMT
#endif
#elif NPY_BITSOF_CHAR == 16
#ifndef NPY_INT16
#define NPY_INT16 NPY_BYTE
#define NPY_UINT16 NPY_UBYTE
typedef signed char npy_int16;
typedef unsigned char npy_uint16;
# define PyInt16ScalarObject PyByteScalarObject
# define PyInt16ArrType_Type PyByteArrType_Type
# define PyUInt16ScalarObject PyUByteScalarObject
# define PyUInt16ArrType_Type PyUByteArrType_Type
#define NPY_INT16_FMT NPY_BYTE_FMT
#define NPY_UINT16_FMT NPY_UBYTE_FMT
#endif
#elif NPY_BITSOF_CHAR == 32
#ifndef NPY_INT32
#define NPY_INT32 NPY_BYTE
#define NPY_UINT32 NPY_UBYTE
typedef signed char npy_int32;
typedef unsigned char npy_uint32;
typedef unsigned char npy_ucs4;
# define PyInt32ScalarObject PyByteScalarObject
# define PyInt32ArrType_Type PyByteArrType_Type
# define PyUInt32ScalarObject PyUByteScalarObject
# define PyUInt32ArrType_Type PyUByteArrType_Type
#define NPY_INT32_FMT NPY_BYTE_FMT
#define NPY_UINT32_FMT NPY_UBYTE_FMT
#endif
#elif NPY_BITSOF_CHAR == 64
#ifndef NPY_INT64
#define NPY_INT64 NPY_BYTE
#define NPY_UINT64 NPY_UBYTE
typedef signed char npy_int64;
typedef unsigned char npy_uint64;
# define PyInt64ScalarObject PyByteScalarObject
# define PyInt64ArrType_Type PyByteArrType_Type
# define PyUInt64ScalarObject PyUByteScalarObject
# define PyUInt64ArrType_Type PyUByteArrType_Type
#define NPY_INT64_FMT NPY_BYTE_FMT
#define NPY_UINT64_FMT NPY_UBYTE_FMT
# define MyPyLong_FromInt64 PyLong_FromLong
# define MyPyLong_AsInt64 PyLong_AsLong
#endif
#elif NPY_BITSOF_CHAR == 128
#ifndef NPY_INT128
#define NPY_INT128 NPY_BYTE
#define NPY_UINT128 NPY_UBYTE
typedef signed char npy_int128;
typedef unsigned char npy_uint128;
# define PyInt128ScalarObject PyByteScalarObject
# define PyInt128ArrType_Type PyByteArrType_Type
# define PyUInt128ScalarObject PyUByteScalarObject
# define PyUInt128ArrType_Type PyUByteArrType_Type
#define NPY_INT128_FMT NPY_BYTE_FMT
#define NPY_UINT128_FMT NPY_UBYTE_FMT
#endif
#endif
#if NPY_BITSOF_DOUBLE == 32
#ifndef NPY_FLOAT32
#define NPY_FLOAT32 NPY_DOUBLE
#define NPY_COMPLEX64 NPY_CDOUBLE
typedef double npy_float32;
typedef npy_cdouble npy_complex64;
# define PyFloat32ScalarObject PyDoubleScalarObject
# define PyComplex64ScalarObject PyCDoubleScalarObject
# define PyFloat32ArrType_Type PyDoubleArrType_Type
# define PyComplex64ArrType_Type PyCDoubleArrType_Type
#define NPY_FLOAT32_FMT NPY_DOUBLE_FMT
#define NPY_COMPLEX64_FMT NPY_CDOUBLE_FMT
#endif
#elif NPY_BITSOF_DOUBLE == 64
#ifndef NPY_FLOAT64
#define NPY_FLOAT64 NPY_DOUBLE
#define NPY_COMPLEX128 NPY_CDOUBLE
typedef double npy_float64;
typedef npy_cdouble npy_complex128;
# define PyFloat64ScalarObject PyDoubleScalarObject
# define PyComplex128ScalarObject PyCDoubleScalarObject
# define PyFloat64ArrType_Type PyDoubleArrType_Type
# define PyComplex128ArrType_Type PyCDoubleArrType_Type
#define NPY_FLOAT64_FMT NPY_DOUBLE_FMT
#define NPY_COMPLEX128_FMT NPY_CDOUBLE_FMT
#endif
#elif NPY_BITSOF_DOUBLE == 80
#ifndef NPY_FLOAT80
#define NPY_FLOAT80 NPY_DOUBLE
#define NPY_COMPLEX160 NPY_CDOUBLE
typedef double npy_float80;
typedef npy_cdouble npy_complex160;
# define PyFloat80ScalarObject PyDoubleScalarObject
# define PyComplex160ScalarObject PyCDoubleScalarObject
# define PyFloat80ArrType_Type PyDoubleArrType_Type
# define PyComplex160ArrType_Type PyCDoubleArrType_Type
#define NPY_FLOAT80_FMT NPY_DOUBLE_FMT
#define NPY_COMPLEX160_FMT NPY_CDOUBLE_FMT
#endif
#elif NPY_BITSOF_DOUBLE == 96
#ifndef NPY_FLOAT96
#define NPY_FLOAT96 NPY_DOUBLE
#define NPY_COMPLEX192 NPY_CDOUBLE
typedef double npy_float96;
typedef npy_cdouble npy_complex192;
# define PyFloat96ScalarObject PyDoubleScalarObject
# define PyComplex192ScalarObject PyCDoubleScalarObject
# define PyFloat96ArrType_Type PyDoubleArrType_Type
# define PyComplex192ArrType_Type PyCDoubleArrType_Type
#define NPY_FLOAT96_FMT NPY_DOUBLE_FMT
#define NPY_COMPLEX192_FMT NPY_CDOUBLE_FMT
#endif
#elif NPY_BITSOF_DOUBLE == 128
#ifndef NPY_FLOAT128
#define NPY_FLOAT128 NPY_DOUBLE
#define NPY_COMPLEX256 NPY_CDOUBLE
typedef double npy_float128;
typedef npy_cdouble npy_complex256;
# define PyFloat128ScalarObject PyDoubleScalarObject
# define PyComplex256ScalarObject PyCDoubleScalarObject
# define PyFloat128ArrType_Type PyDoubleArrType_Type
# define PyComplex256ArrType_Type PyCDoubleArrType_Type
#define NPY_FLOAT128_FMT NPY_DOUBLE_FMT
#define NPY_COMPLEX256_FMT NPY_CDOUBLE_FMT
#endif
#endif
#if NPY_BITSOF_FLOAT == 32
#ifndef NPY_FLOAT32
#define NPY_FLOAT32 NPY_FLOAT
#define NPY_COMPLEX64 NPY_CFLOAT
typedef float npy_float32;
typedef npy_cfloat npy_complex64;
# define PyFloat32ScalarObject PyFloatScalarObject
# define PyComplex64ScalarObject PyCFloatScalarObject
# define PyFloat32ArrType_Type PyFloatArrType_Type
# define PyComplex64ArrType_Type PyCFloatArrType_Type
#define NPY_FLOAT32_FMT NPY_FLOAT_FMT
#define NPY_COMPLEX64_FMT NPY_CFLOAT_FMT
#endif
#elif NPY_BITSOF_FLOAT == 64
#ifndef NPY_FLOAT64
#define NPY_FLOAT64 NPY_FLOAT
#define NPY_COMPLEX128 NPY_CFLOAT
typedef float npy_float64;
typedef npy_cfloat npy_complex128;
# define PyFloat64ScalarObject PyFloatScalarObject
# define PyComplex128ScalarObject PyCFloatScalarObject
# define PyFloat64ArrType_Type PyFloatArrType_Type
# define PyComplex128ArrType_Type PyCFloatArrType_Type
#define NPY_FLOAT64_FMT NPY_FLOAT_FMT
#define NPY_COMPLEX128_FMT NPY_CFLOAT_FMT
#endif
#elif NPY_BITSOF_FLOAT == 80
#ifndef NPY_FLOAT80
#define NPY_FLOAT80 NPY_FLOAT
#define NPY_COMPLEX160 NPY_CFLOAT
typedef float npy_float80;
typedef npy_cfloat npy_complex160;
# define PyFloat80ScalarObject PyFloatScalarObject
# define PyComplex160ScalarObject PyCFloatScalarObject
# define PyFloat80ArrType_Type PyFloatArrType_Type
# define PyComplex160ArrType_Type PyCFloatArrType_Type
#define NPY_FLOAT80_FMT NPY_FLOAT_FMT
#define NPY_COMPLEX160_FMT NPY_CFLOAT_FMT
#endif
#elif NPY_BITSOF_FLOAT == 96
#ifndef NPY_FLOAT96
#define NPY_FLOAT96 NPY_FLOAT
#define NPY_COMPLEX192 NPY_CFLOAT
typedef float npy_float96;
typedef npy_cfloat npy_complex192;
# define PyFloat96ScalarObject PyFloatScalarObject
# define PyComplex192ScalarObject PyCFloatScalarObject
# define PyFloat96ArrType_Type PyFloatArrType_Type
# define PyComplex192ArrType_Type PyCFloatArrType_Type
#define NPY_FLOAT96_FMT NPY_FLOAT_FMT
#define NPY_COMPLEX192_FMT NPY_CFLOAT_FMT
#endif
#elif NPY_BITSOF_FLOAT == 128
#ifndef NPY_FLOAT128
#define NPY_FLOAT128 NPY_FLOAT
#define NPY_COMPLEX256 NPY_CFLOAT
typedef float npy_float128;
typedef npy_cfloat npy_complex256;
# define PyFloat128ScalarObject PyFloatScalarObject
# define PyComplex256ScalarObject PyCFloatScalarObject
# define PyFloat128ArrType_Type PyFloatArrType_Type
# define PyComplex256ArrType_Type PyCFloatArrType_Type
#define NPY_FLOAT128_FMT NPY_FLOAT_FMT
#define NPY_COMPLEX256_FMT NPY_CFLOAT_FMT
#endif
#endif
/* half/float16 isn't a floating-point type in C */
#define NPY_FLOAT16 NPY_HALF
typedef npy_uint16 npy_half;
typedef npy_half npy_float16;
#if NPY_BITSOF_LONGDOUBLE == 32
#ifndef NPY_FLOAT32
#define NPY_FLOAT32 NPY_LONGDOUBLE
#define NPY_COMPLEX64 NPY_CLONGDOUBLE
typedef npy_longdouble npy_float32;
typedef npy_clongdouble npy_complex64;
# define PyFloat32ScalarObject PyLongDoubleScalarObject
# define PyComplex64ScalarObject PyCLongDoubleScalarObject
# define PyFloat32ArrType_Type PyLongDoubleArrType_Type
# define PyComplex64ArrType_Type PyCLongDoubleArrType_Type
#define NPY_FLOAT32_FMT NPY_LONGDOUBLE_FMT
#define NPY_COMPLEX64_FMT NPY_CLONGDOUBLE_FMT
#endif
#elif NPY_BITSOF_LONGDOUBLE == 64
#ifndef NPY_FLOAT64
#define NPY_FLOAT64 NPY_LONGDOUBLE
#define NPY_COMPLEX128 NPY_CLONGDOUBLE
typedef npy_longdouble npy_float64;
typedef npy_clongdouble npy_complex128;
# define PyFloat64ScalarObject PyLongDoubleScalarObject
# define PyComplex128ScalarObject PyCLongDoubleScalarObject
# define PyFloat64ArrType_Type PyLongDoubleArrType_Type
# define PyComplex128ArrType_Type PyCLongDoubleArrType_Type
#define NPY_FLOAT64_FMT NPY_LONGDOUBLE_FMT
#define NPY_COMPLEX128_FMT NPY_CLONGDOUBLE_FMT
#endif
#elif NPY_BITSOF_LONGDOUBLE == 80
#ifndef NPY_FLOAT80
#define NPY_FLOAT80 NPY_LONGDOUBLE
#define NPY_COMPLEX160 NPY_CLONGDOUBLE
typedef npy_longdouble npy_float80;
typedef npy_clongdouble npy_complex160;
# define PyFloat80ScalarObject PyLongDoubleScalarObject
# define PyComplex160ScalarObject PyCLongDoubleScalarObject
# define PyFloat80ArrType_Type PyLongDoubleArrType_Type
# define PyComplex160ArrType_Type PyCLongDoubleArrType_Type
#define NPY_FLOAT80_FMT NPY_LONGDOUBLE_FMT
#define NPY_COMPLEX160_FMT NPY_CLONGDOUBLE_FMT
#endif
#elif NPY_BITSOF_LONGDOUBLE == 96
#ifndef NPY_FLOAT96
#define NPY_FLOAT96 NPY_LONGDOUBLE
#define NPY_COMPLEX192 NPY_CLONGDOUBLE
typedef npy_longdouble npy_float96;
typedef npy_clongdouble npy_complex192;
# define PyFloat96ScalarObject PyLongDoubleScalarObject
# define PyComplex192ScalarObject PyCLongDoubleScalarObject
# define PyFloat96ArrType_Type PyLongDoubleArrType_Type
# define PyComplex192ArrType_Type PyCLongDoubleArrType_Type
#define NPY_FLOAT96_FMT NPY_LONGDOUBLE_FMT
#define NPY_COMPLEX192_FMT NPY_CLONGDOUBLE_FMT
#endif
#elif NPY_BITSOF_LONGDOUBLE == 128
#ifndef NPY_FLOAT128
#define NPY_FLOAT128 NPY_LONGDOUBLE
#define NPY_COMPLEX256 NPY_CLONGDOUBLE
typedef npy_longdouble npy_float128;
typedef npy_clongdouble npy_complex256;
# define PyFloat128ScalarObject PyLongDoubleScalarObject
# define PyComplex256ScalarObject PyCLongDoubleScalarObject
# define PyFloat128ArrType_Type PyLongDoubleArrType_Type
# define PyComplex256ArrType_Type PyCLongDoubleArrType_Type
#define NPY_FLOAT128_FMT NPY_LONGDOUBLE_FMT
#define NPY_COMPLEX256_FMT NPY_CLONGDOUBLE_FMT
#endif
#elif NPY_BITSOF_LONGDOUBLE == 256
#define NPY_FLOAT256 NPY_LONGDOUBLE
#define NPY_COMPLEX512 NPY_CLONGDOUBLE
typedef npy_longdouble npy_float256;
typedef npy_clongdouble npy_complex512;
# define PyFloat256ScalarObject PyLongDoubleScalarObject
# define PyComplex512ScalarObject PyCLongDoubleScalarObject
# define PyFloat256ArrType_Type PyLongDoubleArrType_Type
# define PyComplex512ArrType_Type PyCLongDoubleArrType_Type
#define NPY_FLOAT256_FMT NPY_LONGDOUBLE_FMT
#define NPY_COMPLEX512_FMT NPY_CLONGDOUBLE_FMT
#endif
/* datetime typedefs */
typedef npy_int64 npy_timedelta;
typedef npy_int64 npy_datetime;
#define NPY_DATETIME_FMT NPY_INT64_FMT
#define NPY_TIMEDELTA_FMT NPY_INT64_FMT
/* End of typedefs for numarray style bit-width names */
#endif

View File

@ -1,109 +0,0 @@
/*
* This set (target) cpu specific macros:
* - Possible values:
* NPY_CPU_X86
* NPY_CPU_AMD64
* NPY_CPU_PPC
* NPY_CPU_PPC64
* NPY_CPU_SPARC
* NPY_CPU_S390
* NPY_CPU_IA64
* NPY_CPU_HPPA
* NPY_CPU_ALPHA
* NPY_CPU_ARMEL
* NPY_CPU_ARMEB
* NPY_CPU_SH_LE
* NPY_CPU_SH_BE
*/
#ifndef _NPY_CPUARCH_H_
#define _NPY_CPUARCH_H_
#include "numpyconfig.h"
#if defined( __i386__ ) || defined(i386) || defined(_M_IX86)
/*
* __i386__ is defined by gcc and Intel compiler on Linux,
* _M_IX86 by VS compiler,
* i386 by Sun compilers on opensolaris at least
*/
#define NPY_CPU_X86
#elif defined(__x86_64__) || defined(__amd64__) || defined(__x86_64) || defined(_M_AMD64)
/*
* both __x86_64__ and __amd64__ are defined by gcc
* __x86_64 defined by sun compiler on opensolaris at least
* _M_AMD64 defined by MS compiler
*/
#define NPY_CPU_AMD64
#elif defined(__ppc__) || defined(__powerpc__) || defined(_ARCH_PPC)
/*
* __ppc__ is defined by gcc, I remember having seen __powerpc__ once,
* but can't find it ATM
* _ARCH_PPC is used by at least gcc on AIX
*/
#define NPY_CPU_PPC
#elif defined(__ppc64__)
#define NPY_CPU_PPC64
#elif defined(__sparc__) || defined(__sparc)
/* __sparc__ is defined by gcc and Forte (e.g. Sun) compilers */
#define NPY_CPU_SPARC
#elif defined(__s390__)
#define NPY_CPU_S390
#elif defined(__ia64)
#define NPY_CPU_IA64
#elif defined(__hppa)
#define NPY_CPU_HPPA
#elif defined(__alpha__)
#define NPY_CPU_ALPHA
#elif defined(__arm__) && defined(__ARMEL__)
#define NPY_CPU_ARMEL
#elif defined(__arm__) && defined(__ARMEB__)
#define NPY_CPU_ARMEB
#elif defined(__sh__) && defined(__LITTLE_ENDIAN__)
#define NPY_CPU_SH_LE
#elif defined(__sh__) && defined(__BIG_ENDIAN__)
#define NPY_CPU_SH_BE
#elif defined(__MIPSEL__)
#define NPY_CPU_MIPSEL
#elif defined(__MIPSEB__)
#define NPY_CPU_MIPSEB
#elif defined(__aarch64__)
#define NPY_CPU_AARCH64
#else
#error Unknown CPU, please report this to numpy maintainers with \
information about your platform (OS, CPU and compiler)
#endif
/*
This "white-lists" the architectures that we know don't require
pointer alignment. We white-list, since the memcpy version will
work everywhere, whereas assignment will only work where pointer
dereferencing doesn't require alignment.
TODO: There may be more architectures we can white list.
*/
#if defined(NPY_CPU_X86) || defined(NPY_CPU_AMD64)
#define NPY_COPY_PYOBJECT_PTR(dst, src) (*((PyObject **)(dst)) = *((PyObject **)(src)))
#else
#if NPY_SIZEOF_PY_INTPTR_T == 4
#define NPY_COPY_PYOBJECT_PTR(dst, src) \
((char*)(dst))[0] = ((char*)(src))[0]; \
((char*)(dst))[1] = ((char*)(src))[1]; \
((char*)(dst))[2] = ((char*)(src))[2]; \
((char*)(dst))[3] = ((char*)(src))[3];
#elif NPY_SIZEOF_PY_INTPTR_T == 8
#define NPY_COPY_PYOBJECT_PTR(dst, src) \
((char*)(dst))[0] = ((char*)(src))[0]; \
((char*)(dst))[1] = ((char*)(src))[1]; \
((char*)(dst))[2] = ((char*)(src))[2]; \
((char*)(dst))[3] = ((char*)(src))[3]; \
((char*)(dst))[4] = ((char*)(src))[4]; \
((char*)(dst))[5] = ((char*)(src))[5]; \
((char*)(dst))[6] = ((char*)(src))[6]; \
((char*)(dst))[7] = ((char*)(src))[7];
#else
#error Unknown architecture, please report this to numpy maintainers with \
information about your platform (OS, CPU and compiler)
#endif
#endif
#endif

View File

@ -1,129 +0,0 @@
#ifndef _NPY_DEPRECATED_API_H
#define _NPY_DEPRECATED_API_H
#if defined(_WIN32)
#define _WARN___STR2__(x) #x
#define _WARN___STR1__(x) _WARN___STR2__(x)
#define _WARN___LOC__ __FILE__ "(" _WARN___STR1__(__LINE__) ") : Warning Msg: "
#pragma message(_WARN___LOC__"Using deprecated NumPy API, disable it by " \
"#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION")
#elif defined(__GNUC__)
#warning "Using deprecated NumPy API, disable it by #defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION"
#endif
/* TODO: How to do this warning message for other compilers? */
/*
* This header exists to collect all dangerous/deprecated NumPy API.
*
* This is an attempt to remove bad API, the proliferation of macros,
* and namespace pollution currently produced by the NumPy headers.
*/
#if defined(NPY_NO_DEPRECATED_API)
#error Should never include npy_deprecated_api directly.
#endif
/* These array flags are deprecated as of NumPy 1.7 */
#define NPY_CONTIGUOUS NPY_ARRAY_C_CONTIGUOUS
#define NPY_FORTRAN NPY_ARRAY_F_CONTIGUOUS
/*
* The consistent NPY_ARRAY_* names which don't pollute the NPY_*
* namespace were added in NumPy 1.7.
*
* These versions of the carray flags are deprecated, but
* probably should only be removed after two releases instead of one.
*/
#define NPY_C_CONTIGUOUS NPY_ARRAY_C_CONTIGUOUS
#define NPY_F_CONTIGUOUS NPY_ARRAY_F_CONTIGUOUS
#define NPY_OWNDATA NPY_ARRAY_OWNDATA
#define NPY_FORCECAST NPY_ARRAY_FORCECAST
#define NPY_ENSURECOPY NPY_ARRAY_ENSURECOPY
#define NPY_ENSUREARRAY NPY_ARRAY_ENSUREARRAY
#define NPY_ELEMENTSTRIDES NPY_ARRAY_ELEMENTSTRIDES
#define NPY_ALIGNED NPY_ARRAY_ALIGNED
#define NPY_NOTSWAPPED NPY_ARRAY_NOTSWAPPED
#define NPY_WRITEABLE NPY_ARRAY_WRITEABLE
#define NPY_UPDATEIFCOPY NPY_ARRAY_UPDATEIFCOPY
#define NPY_BEHAVED NPY_ARRAY_BEHAVED
#define NPY_BEHAVED_NS NPY_ARRAY_BEHAVED_NS
#define NPY_CARRAY NPY_ARRAY_CARRAY
#define NPY_CARRAY_RO NPY_ARRAY_CARRAY_RO
#define NPY_FARRAY NPY_ARRAY_FARRAY
#define NPY_FARRAY_RO NPY_ARRAY_FARRAY_RO
#define NPY_DEFAULT NPY_ARRAY_DEFAULT
#define NPY_IN_ARRAY NPY_ARRAY_IN_ARRAY
#define NPY_OUT_ARRAY NPY_ARRAY_OUT_ARRAY
#define NPY_INOUT_ARRAY NPY_ARRAY_INOUT_ARRAY
#define NPY_IN_FARRAY NPY_ARRAY_IN_FARRAY
#define NPY_OUT_FARRAY NPY_ARRAY_OUT_FARRAY
#define NPY_INOUT_FARRAY NPY_ARRAY_INOUT_FARRAY
#define NPY_UPDATE_ALL NPY_ARRAY_UPDATE_ALL
/* This way of accessing the default type is deprecated as of NumPy 1.7 */
#define PyArray_DEFAULT NPY_DEFAULT_TYPE
/* These DATETIME bits aren't used internally */
#if PY_VERSION_HEX >= 0x03000000
#define PyDataType_GetDatetimeMetaData(descr) \
((descr->metadata == NULL) ? NULL : \
((PyArray_DatetimeMetaData *)(PyCapsule_GetPointer( \
PyDict_GetItemString( \
descr->metadata, NPY_METADATA_DTSTR), NULL))))
#else
#define PyDataType_GetDatetimeMetaData(descr) \
((descr->metadata == NULL) ? NULL : \
((PyArray_DatetimeMetaData *)(PyCObject_AsVoidPtr( \
PyDict_GetItemString(descr->metadata, NPY_METADATA_DTSTR)))))
#endif
/*
* Deprecated as of NumPy 1.7, this kind of shortcut doesn't
* belong in the public API.
*/
#define NPY_AO PyArrayObject
/*
* Deprecated as of NumPy 1.7, an all-lowercase macro doesn't
* belong in the public API.
*/
#define fortran fortran_
/*
* Deprecated as of NumPy 1.7, as it is a namespace-polluting
* macro.
*/
#define FORTRAN_IF PyArray_FORTRAN_IF
/* Deprecated as of NumPy 1.7, datetime64 uses c_metadata instead */
#define NPY_METADATA_DTSTR "__timeunit__"
/*
* Deprecated as of NumPy 1.7.
* The reasoning:
* - These are for datetime, but there's no datetime "namespace".
* - They just turn NPY_STR_<x> into "<x>", which is just
* making something simple be indirected.
*/
#define NPY_STR_Y "Y"
#define NPY_STR_M "M"
#define NPY_STR_W "W"
#define NPY_STR_D "D"
#define NPY_STR_h "h"
#define NPY_STR_m "m"
#define NPY_STR_s "s"
#define NPY_STR_ms "ms"
#define NPY_STR_us "us"
#define NPY_STR_ns "ns"
#define NPY_STR_ps "ps"
#define NPY_STR_fs "fs"
#define NPY_STR_as "as"
/*
* The macros in old_defines.h are Deprecated as of NumPy 1.7 and will be
* removed in the next major release.
*/
#include "old_defines.h"
#endif

View File

@ -1,46 +0,0 @@
#ifndef _NPY_ENDIAN_H_
#define _NPY_ENDIAN_H_
/*
* NPY_BYTE_ORDER is set to the same value as BYTE_ORDER set by glibc in
* endian.h
*/
#ifdef NPY_HAVE_ENDIAN_H
/* Use endian.h if available */
#include <endian.h>
#define NPY_BYTE_ORDER __BYTE_ORDER
#define NPY_LITTLE_ENDIAN __LITTLE_ENDIAN
#define NPY_BIG_ENDIAN __BIG_ENDIAN
#else
/* Set endianness info using target CPU */
#include "npy_cpu.h"
#define NPY_LITTLE_ENDIAN 1234
#define NPY_BIG_ENDIAN 4321
#if defined(NPY_CPU_X86) \
|| defined(NPY_CPU_AMD64) \
|| defined(NPY_CPU_IA64) \
|| defined(NPY_CPU_ALPHA) \
|| defined(NPY_CPU_ARMEL) \
|| defined(NPY_CPU_AARCH64) \
|| defined(NPY_CPU_SH_LE) \
|| defined(NPY_CPU_MIPSEL)
#define NPY_BYTE_ORDER NPY_LITTLE_ENDIAN
#elif defined(NPY_CPU_PPC) \
|| defined(NPY_CPU_SPARC) \
|| defined(NPY_CPU_S390) \
|| defined(NPY_CPU_HPPA) \
|| defined(NPY_CPU_PPC64) \
|| defined(NPY_CPU_ARMEB) \
|| defined(NPY_CPU_SH_BE) \
|| defined(NPY_CPU_MIPSEB)
#define NPY_BYTE_ORDER NPY_BIG_ENDIAN
#else
#error Unknown CPU: can not set endianness
#endif
#endif
#endif

Some files were not shown because too many files have changed in this diff Show More