mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-26 17:24:41 +03:00
Merge branch 'develop' into master-tmp
This commit is contained in:
commit
59deeb7da6
|
@ -1,11 +0,0 @@
|
|||
steps:
|
||||
-
|
||||
command: "fab env clean make test sdist"
|
||||
label: ":dizzy: :python:"
|
||||
artifact_paths: "dist/*.tar.gz"
|
||||
- wait
|
||||
- trigger: "spacy-sdist-against-models"
|
||||
label: ":dizzy: :hammer:"
|
||||
build:
|
||||
env:
|
||||
SPACY_VERSION: "{$SPACY_VERSION}"
|
|
@ -1,11 +0,0 @@
|
|||
steps:
|
||||
-
|
||||
command: "fab env clean make test wheel"
|
||||
label: ":dizzy: :python:"
|
||||
artifact_paths: "dist/*.whl"
|
||||
- wait
|
||||
- trigger: "spacy-train-from-wheel"
|
||||
label: ":dizzy: :train:"
|
||||
build:
|
||||
env:
|
||||
SPACY_VERSION: "{$SPACY_VERSION}"
|
106
.github/contributors/tiangolo.md
vendored
Normal file
106
.github/contributors/tiangolo.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [ ] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Sebastián Ramírez |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2020-07-01 |
|
||||
| GitHub username | tiangolo |
|
||||
| Website (optional) | |
|
8
.gitignore
vendored
8
.gitignore
vendored
|
@ -18,8 +18,7 @@ website/.npm
|
|||
website/logs
|
||||
*.log
|
||||
npm-debug.log*
|
||||
website/www/
|
||||
website/_deploy.sh
|
||||
quickstart-training-generator.js
|
||||
|
||||
# Cython / C extensions
|
||||
cythonize.json
|
||||
|
@ -44,12 +43,14 @@ __pycache__/
|
|||
.env*
|
||||
.~env/
|
||||
.venv
|
||||
env3.6/
|
||||
venv/
|
||||
env3.*/
|
||||
.dev
|
||||
.denv
|
||||
.pypyenv
|
||||
.pytest_cache/
|
||||
.mypy_cache/
|
||||
|
||||
# Distribution / packaging
|
||||
env/
|
||||
|
@ -119,3 +120,6 @@ Desktop.ini
|
|||
|
||||
# Pycharm project files
|
||||
*.idea
|
||||
|
||||
# IPython
|
||||
.ipynb_checkpoints/
|
||||
|
|
23
.travis.yml
23
.travis.yml
|
@ -1,23 +0,0 @@
|
|||
language: python
|
||||
sudo: false
|
||||
cache: pip
|
||||
dist: trusty
|
||||
group: edge
|
||||
python:
|
||||
- "2.7"
|
||||
os:
|
||||
- linux
|
||||
install:
|
||||
- "pip install -r requirements.txt"
|
||||
- "python setup.py build_ext --inplace"
|
||||
- "pip install -e ."
|
||||
script:
|
||||
- "cat /proc/cpuinfo | grep flags | head -n 1"
|
||||
- "python -m pytest --tb=native spacy"
|
||||
branches:
|
||||
except:
|
||||
- spacy.io
|
||||
notifications:
|
||||
slack:
|
||||
secure: F8GvqnweSdzImuLL64TpfG0i5rYl89liyr9tmFVsHl4c0DNiDuGhZivUz0M1broS8svE3OPOllLfQbACG/4KxD890qfF9MoHzvRDlp7U+RtwMV/YAkYn8MGWjPIbRbX0HpGdY7O2Rc9Qy4Kk0T8ZgiqXYIqAz2Eva9/9BlSmsJQ=
|
||||
email: false
|
188
CONTRIBUTING.md
188
CONTRIBUTING.md
|
@ -5,7 +5,7 @@
|
|||
Thanks for your interest in contributing to spaCy 🎉 The project is maintained
|
||||
by [@honnibal](https://github.com/honnibal) and [@ines](https://github.com/ines),
|
||||
and we'll do our best to help you get started. This page will give you a quick
|
||||
overview of how things are organised and most importantly, how to get involved.
|
||||
overview of how things are organized and most importantly, how to get involved.
|
||||
|
||||
## Table of contents
|
||||
|
||||
|
@ -43,33 +43,33 @@ can also submit a [regression test](#fixing-bugs) straight away. When you're
|
|||
opening an issue to report the bug, simply refer to your pull request in the
|
||||
issue body. A few more tips:
|
||||
|
||||
- **Describing your issue:** Try to provide as many details as possible. What
|
||||
exactly goes wrong? _How_ is it failing? Is there an error?
|
||||
"XY doesn't work" usually isn't that helpful for tracking down problems. Always
|
||||
remember to include the code you ran and if possible, extract only the relevant
|
||||
parts and don't just dump your entire script. This will make it easier for us to
|
||||
reproduce the error.
|
||||
- **Describing your issue:** Try to provide as many details as possible. What
|
||||
exactly goes wrong? _How_ is it failing? Is there an error?
|
||||
"XY doesn't work" usually isn't that helpful for tracking down problems. Always
|
||||
remember to include the code you ran and if possible, extract only the relevant
|
||||
parts and don't just dump your entire script. This will make it easier for us to
|
||||
reproduce the error.
|
||||
|
||||
- **Getting info about your spaCy installation and environment:** If you're
|
||||
using spaCy v1.7+, you can use the command line interface to print details and
|
||||
even format them as Markdown to copy-paste into GitHub issues:
|
||||
`python -m spacy info --markdown`.
|
||||
- **Getting info about your spaCy installation and environment:** If you're
|
||||
using spaCy v1.7+, you can use the command line interface to print details and
|
||||
even format them as Markdown to copy-paste into GitHub issues:
|
||||
`python -m spacy info --markdown`.
|
||||
|
||||
- **Checking the model compatibility:** If you're having problems with a
|
||||
[statistical model](https://spacy.io/models), it may be because the
|
||||
model is incompatible with your spaCy installation. In spaCy v2.0+, you can check
|
||||
this on the command line by running `python -m spacy validate`.
|
||||
- **Checking the model compatibility:** If you're having problems with a
|
||||
[statistical model](https://spacy.io/models), it may be because the
|
||||
model is incompatible with your spaCy installation. In spaCy v2.0+, you can check
|
||||
this on the command line by running `python -m spacy validate`.
|
||||
|
||||
- **Sharing a model's output, like dependencies and entities:** spaCy v2.0+
|
||||
comes with [built-in visualizers](https://spacy.io/usage/visualizers) that
|
||||
you can run from within your script or a Jupyter notebook. For some issues, it's
|
||||
helpful to **include a screenshot** of the visualization. You can simply drag and
|
||||
drop the image into GitHub's editor and it will be uploaded and included.
|
||||
- **Sharing a model's output, like dependencies and entities:** spaCy v2.0+
|
||||
comes with [built-in visualizers](https://spacy.io/usage/visualizers) that
|
||||
you can run from within your script or a Jupyter notebook. For some issues, it's
|
||||
helpful to **include a screenshot** of the visualization. You can simply drag and
|
||||
drop the image into GitHub's editor and it will be uploaded and included.
|
||||
|
||||
- **Sharing long blocks of code or logs:** If you need to include long code,
|
||||
logs or tracebacks, you can wrap them in `<details>` and `</details>`. This
|
||||
[collapses the content](https://developer.mozilla.org/en/docs/Web/HTML/Element/details)
|
||||
so it only becomes visible on click, making the issue easier to read and follow.
|
||||
- **Sharing long blocks of code or logs:** If you need to include long code,
|
||||
logs or tracebacks, you can wrap them in `<details>` and `</details>`. This
|
||||
[collapses the content](https://developer.mozilla.org/en/docs/Web/HTML/Element/details)
|
||||
so it only becomes visible on click, making the issue easier to read and follow.
|
||||
|
||||
### Issue labels
|
||||
|
||||
|
@ -94,39 +94,39 @@ shipped in the core library, and what could be provided in other packages. Our
|
|||
philosophy is to prefer a smaller core library. We generally ask the following
|
||||
questions:
|
||||
|
||||
- **What would this feature look like if implemented in a separate package?**
|
||||
Some features would be very difficult to implement externally – for example,
|
||||
changes to spaCy's built-in methods. In contrast, a library of word
|
||||
alignment functions could easily live as a separate package that depended on
|
||||
spaCy — there's little difference between writing `import word_aligner` and
|
||||
`import spacy.word_aligner`. spaCy v2.0+ makes it easy to implement
|
||||
[custom pipeline components](https://spacy.io/usage/processing-pipelines#custom-components),
|
||||
and add your own attributes, properties and methods to the `Doc`, `Token` and
|
||||
`Span`. If you're looking to implement a new spaCy feature, starting with a
|
||||
custom component package is usually the best strategy. You won't have to worry
|
||||
about spaCy's internals and you can test your module in an isolated
|
||||
environment. And if it works well, we can always integrate it into the core
|
||||
library later.
|
||||
- **What would this feature look like if implemented in a separate package?**
|
||||
Some features would be very difficult to implement externally – for example,
|
||||
changes to spaCy's built-in methods. In contrast, a library of word
|
||||
alignment functions could easily live as a separate package that depended on
|
||||
spaCy — there's little difference between writing `import word_aligner` and
|
||||
`import spacy.word_aligner`. spaCy v2.0+ makes it easy to implement
|
||||
[custom pipeline components](https://spacy.io/usage/processing-pipelines#custom-components),
|
||||
and add your own attributes, properties and methods to the `Doc`, `Token` and
|
||||
`Span`. If you're looking to implement a new spaCy feature, starting with a
|
||||
custom component package is usually the best strategy. You won't have to worry
|
||||
about spaCy's internals and you can test your module in an isolated
|
||||
environment. And if it works well, we can always integrate it into the core
|
||||
library later.
|
||||
|
||||
- **Would the feature be easier to implement if it relied on "heavy" dependencies spaCy doesn't currently require?**
|
||||
Python has a very rich ecosystem. Libraries like scikit-learn, SciPy, Gensim or
|
||||
TensorFlow/Keras do lots of useful things — but we don't want to have them as
|
||||
dependencies. If the feature requires functionality in one of these libraries,
|
||||
it's probably better to break it out into a different package.
|
||||
- **Would the feature be easier to implement if it relied on "heavy" dependencies spaCy doesn't currently require?**
|
||||
Python has a very rich ecosystem. Libraries like scikit-learn, SciPy, Gensim or
|
||||
TensorFlow/Keras do lots of useful things — but we don't want to have them as
|
||||
dependencies. If the feature requires functionality in one of these libraries,
|
||||
it's probably better to break it out into a different package.
|
||||
|
||||
- **Is the feature orthogonal to the current spaCy functionality, or overlapping?**
|
||||
spaCy strongly prefers to avoid having 6 different ways of doing the same thing.
|
||||
As better techniques are developed, we prefer to drop support for "the old way".
|
||||
However, it's rare that one approach _entirely_ dominates another. It's very
|
||||
common that there's still a use-case for the "obsolete" approach. For instance,
|
||||
[WordNet](https://wordnet.princeton.edu/) is still very useful — but word
|
||||
vectors are better for most use-cases, and the two approaches to lexical
|
||||
semantics do a lot of the same things. spaCy therefore only supports word
|
||||
vectors, and support for WordNet is currently left for other packages.
|
||||
- **Is the feature orthogonal to the current spaCy functionality, or overlapping?**
|
||||
spaCy strongly prefers to avoid having 6 different ways of doing the same thing.
|
||||
As better techniques are developed, we prefer to drop support for "the old way".
|
||||
However, it's rare that one approach _entirely_ dominates another. It's very
|
||||
common that there's still a use-case for the "obsolete" approach. For instance,
|
||||
[WordNet](https://wordnet.princeton.edu/) is still very useful — but word
|
||||
vectors are better for most use-cases, and the two approaches to lexical
|
||||
semantics do a lot of the same things. spaCy therefore only supports word
|
||||
vectors, and support for WordNet is currently left for other packages.
|
||||
|
||||
- **Do you need the feature to get basic things done?** We do want spaCy to be
|
||||
at least somewhat self-contained. If we keep needing some feature in our
|
||||
recipes, that does provide some argument for bringing it "in house".
|
||||
- **Do you need the feature to get basic things done?** We do want spaCy to be
|
||||
at least somewhat self-contained. If we keep needing some feature in our
|
||||
recipes, that does provide some argument for bringing it "in house".
|
||||
|
||||
### Getting started
|
||||
|
||||
|
@ -195,7 +195,7 @@ modules in `.py` files, not Cython modules in `.pyx` and `.pxd` files.**
|
|||
### Code formatting
|
||||
|
||||
[`black`](https://github.com/ambv/black) is an opinionated Python code
|
||||
formatter, optimised to produce readable code and small diffs. You can run
|
||||
formatter, optimized to produce readable code and small diffs. You can run
|
||||
`black` from the command-line, or via your code editor. For example, if you're
|
||||
using [Visual Studio Code](https://code.visualstudio.com/), you can add the
|
||||
following to your `settings.json` to use `black` for formatting and auto-format
|
||||
|
@ -203,10 +203,10 @@ your files on save:
|
|||
|
||||
```json
|
||||
{
|
||||
"python.formatting.provider": "black",
|
||||
"[python]": {
|
||||
"editor.formatOnSave": true
|
||||
}
|
||||
"python.formatting.provider": "black",
|
||||
"[python]": {
|
||||
"editor.formatOnSave": true
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
|
@ -216,7 +216,7 @@ list of available editor integrations.
|
|||
#### Disabling formatting
|
||||
|
||||
There are a few cases where auto-formatting doesn't improve readability – for
|
||||
example, in some of the the language data files like the `tag_map.py`, or in
|
||||
example, in some of the language data files like the `tag_map.py`, or in
|
||||
the tests that construct `Doc` objects from lists of words and other labels.
|
||||
Wrapping a block in `# fmt: off` and `# fmt: on` lets you disable formatting
|
||||
for that particular code. Here's an example:
|
||||
|
@ -224,7 +224,7 @@ for that particular code. Here's an example:
|
|||
```python
|
||||
# fmt: off
|
||||
text = "I look forward to using Thingamajig. I've been told it will make my life easier..."
|
||||
heads = [1, 0, -1, -2, -1, -1, -5, -1, 3, 2, 1, 0, 2, 1, -3, 1, 1, -3, -7]
|
||||
heads = [1, 1, 1, 1, 3, 4, 1, 6, 11, 11, 11, 11, 14, 14, 11, 16, 17, 14, 11]
|
||||
deps = ["nsubj", "ROOT", "advmod", "prep", "pcomp", "dobj", "punct", "",
|
||||
"nsubjpass", "aux", "auxpass", "ROOT", "nsubj", "aux", "ccomp",
|
||||
"poss", "nsubj", "ccomp", "punct"]
|
||||
|
@ -280,29 +280,13 @@ except: # noqa: E722
|
|||
|
||||
### Python conventions
|
||||
|
||||
All Python code must be written in an **intersection of Python 2 and Python 3**.
|
||||
This is easy in Cython, but somewhat ugly in Python. Logic that deals with
|
||||
Python or platform compatibility should only live in
|
||||
[`spacy.compat`](spacy/compat.py). To distinguish them from the builtin
|
||||
functions, replacement functions are suffixed with an underscore, for example
|
||||
`unicode_`. If you need to access the user's version or platform information,
|
||||
for example to show more specific error messages, you can use the `is_config()`
|
||||
helper function.
|
||||
|
||||
```python
|
||||
from .compat import unicode_, is_config
|
||||
|
||||
compatible_unicode = unicode_('hello world')
|
||||
if is_config(windows=True, python2=True):
|
||||
print("You are using Python 2 on Windows.")
|
||||
```
|
||||
|
||||
All Python code must be written **compatible with Python 3.6+**.
|
||||
Code that interacts with the file-system should accept objects that follow the
|
||||
`pathlib.Path` API, without assuming that the object inherits from `pathlib.Path`.
|
||||
If the function is user-facing and takes a path as an argument, it should check
|
||||
whether the path is provided as a string. Strings should be converted to
|
||||
`pathlib.Path` objects. Serialization and deserialization functions should always
|
||||
accept **file-like objects**, as it makes the library io-agnostic. Working on
|
||||
accept **file-like objects**, as it makes the library IO-agnostic. Working on
|
||||
buffers makes the code more general, easier to test, and compatible with Python
|
||||
3's asynchronous IO.
|
||||
|
||||
|
@ -400,7 +384,7 @@ of Python and C++, with additional complexity and syntax from numpy. The
|
|||
many "traps for new players". Working in Cython is very rewarding once you're
|
||||
over the initial learning curve. As with C and C++, the first way you write
|
||||
something in Cython will often be the performance-optimal approach. In contrast,
|
||||
Python optimisation generally requires a lot of experimentation. Is it faster to
|
||||
Python optimization generally requires a lot of experimentation. Is it faster to
|
||||
have an `if item in my_dict` check, or to use `.get()`? What about `try`/`except`?
|
||||
Does this numpy operation create a copy? There's no way to guess the answers to
|
||||
these questions, and you'll usually be dissatisfied with your results — so
|
||||
|
@ -413,10 +397,10 @@ Python. If it's not fast enough the first time, just switch to Cython.
|
|||
|
||||
### Resources to get you started
|
||||
|
||||
- [PEP 8 Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) (python.org)
|
||||
- [Official Cython documentation](http://docs.cython.org/en/latest/) (cython.org)
|
||||
- [Writing C in Cython](https://explosion.ai/blog/writing-c-in-cython) (explosion.ai)
|
||||
- [Multi-threading spaCy’s parser and named entity recogniser](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai)
|
||||
- [PEP 8 Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) (python.org)
|
||||
- [Official Cython documentation](http://docs.cython.org/en/latest/) (cython.org)
|
||||
- [Writing C in Cython](https://explosion.ai/blog/writing-c-in-cython) (explosion.ai)
|
||||
- [Multi-threading spaCy’s parser and named entity recognizer](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai)
|
||||
|
||||
## Adding tests
|
||||
|
||||
|
@ -428,7 +412,7 @@ name. For example, tests for the `Tokenizer` can be found in
|
|||
all test files and test functions need to be prefixed with `test_`.
|
||||
|
||||
When adding tests, make sure to use descriptive names, keep the code short and
|
||||
concise and only test for one behaviour at a time. Try to `parametrize` test
|
||||
concise and only test for one behavior at a time. Try to `parametrize` test
|
||||
cases wherever possible, use our pre-defined fixtures for spaCy components and
|
||||
avoid unnecessary imports.
|
||||
|
||||
|
@ -437,7 +421,7 @@ Tests that require the model to be loaded should be marked with
|
|||
`@pytest.mark.models`. Loading the models is expensive and not necessary if
|
||||
you're not actually testing the model performance. If all you need is a `Doc`
|
||||
object with annotations like heads, POS tags or the dependency parse, you can
|
||||
use the `get_doc()` utility function to construct it manually.
|
||||
use the `Doc` constructor to construct it manually.
|
||||
|
||||
📖 **For more guidelines and information on how to add tests, check out the [tests README](spacy/tests/README.md).**
|
||||
|
||||
|
@ -456,25 +440,25 @@ simply click on the "Suggest edits" button at the bottom of a page.
|
|||
We're very excited about all the new possibilities for **community extensions**
|
||||
and plugins in spaCy v2.0, and we can't wait to see what you build with it!
|
||||
|
||||
- An extension or plugin should add substantial functionality, be
|
||||
**well-documented** and **open-source**. It should be available for users to download
|
||||
and install as a Python package – for example via [PyPi](http://pypi.python.org).
|
||||
- An extension or plugin should add substantial functionality, be
|
||||
**well-documented** and **open-source**. It should be available for users to download
|
||||
and install as a Python package – for example via [PyPi](http://pypi.python.org).
|
||||
|
||||
- Extensions that write to `Doc`, `Token` or `Span` attributes should be wrapped
|
||||
as [pipeline components](https://spacy.io/usage/processing-pipelines#custom-components)
|
||||
that users can **add to their processing pipeline** using `nlp.add_pipe()`.
|
||||
- Extensions that write to `Doc`, `Token` or `Span` attributes should be wrapped
|
||||
as [pipeline components](https://spacy.io/usage/processing-pipelines#custom-components)
|
||||
that users can **add to their processing pipeline** using `nlp.add_pipe()`.
|
||||
|
||||
- When publishing your extension on GitHub, **tag it** with the topics
|
||||
[`spacy`](https://github.com/topics/spacy?o=desc&s=stars) and
|
||||
[`spacy-extensions`](https://github.com/topics/spacy-extension?o=desc&s=stars)
|
||||
to make it easier to find. Those are also the topics we're linking to from the
|
||||
spaCy website. If you're sharing your project on Twitter, feel free to tag
|
||||
[@spacy_io](https://twitter.com/spacy_io) so we can check it out.
|
||||
- When publishing your extension on GitHub, **tag it** with the topics
|
||||
[`spacy`](https://github.com/topics/spacy?o=desc&s=stars) and
|
||||
[`spacy-extensions`](https://github.com/topics/spacy-extension?o=desc&s=stars)
|
||||
to make it easier to find. Those are also the topics we're linking to from the
|
||||
spaCy website. If you're sharing your project on Twitter, feel free to tag
|
||||
[@spacy_io](https://twitter.com/spacy_io) so we can check it out.
|
||||
|
||||
- Once your extension is published, you can open an issue on the
|
||||
[issue tracker](https://github.com/explosion/spacy/issues) to suggest it for the
|
||||
[resources directory](https://spacy.io/usage/resources#extensions) on the
|
||||
website.
|
||||
- Once your extension is published, you can open an issue on the
|
||||
[issue tracker](https://github.com/explosion/spacy/issues) to suggest it for the
|
||||
[resources directory](https://spacy.io/usage/resources#extensions) on the
|
||||
website.
|
||||
|
||||
📖 **For more tips and best practices, see the [checklist for developing spaCy extensions](https://spacy.io/usage/processing-pipelines#extensions).**
|
||||
|
||||
|
|
|
@ -1,9 +1,9 @@
|
|||
recursive-include include *.h
|
||||
recursive-include spacy *.txt *.pyx *.pxd
|
||||
recursive-include spacy *.pyx *.pxd *.txt *.cfg *.jinja
|
||||
include LICENSE
|
||||
include README.md
|
||||
include bin/spacy
|
||||
include pyproject.toml
|
||||
recursive-exclude spacy/lang *.json
|
||||
recursive-include spacy/lang *.json.gz
|
||||
recursive-include spacy/cli *.json *.yml
|
||||
recursive-include licenses *
|
||||
|
|
48
Makefile
48
Makefile
|
@ -1,29 +1,55 @@
|
|||
SHELL := /bin/bash
|
||||
PYVER := 3.6
|
||||
|
||||
ifndef SPACY_EXTRAS
|
||||
override SPACY_EXTRAS = spacy-lookups-data==1.0.0rc0 jieba pkuseg==0.0.25 pickle5 sudachipy sudachidict_core
|
||||
endif
|
||||
|
||||
ifndef PYVER
|
||||
override PYVER = 3.6
|
||||
endif
|
||||
|
||||
VENV := ./env$(PYVER)
|
||||
|
||||
version := $(shell "bin/get-version.sh")
|
||||
package := $(shell "bin/get-package.sh")
|
||||
|
||||
dist/spacy-$(version).pex : wheelhouse/spacy-$(version).stamp
|
||||
$(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m spacy -o $@ spacy==$(version) jsonschema spacy-lookups-data jieba pkuseg==0.0.25 sudachipy sudachidict_core
|
||||
ifndef SPACY_BIN
|
||||
override SPACY_BIN = $(package)-$(version).pex
|
||||
endif
|
||||
|
||||
ifndef WHEELHOUSE
|
||||
override WHEELHOUSE = "./wheelhouse"
|
||||
endif
|
||||
|
||||
|
||||
dist/$(SPACY_BIN) : $(WHEELHOUSE)/spacy-$(PYVER)-$(version).stamp
|
||||
$(VENV)/bin/pex \
|
||||
-f $(WHEELHOUSE) \
|
||||
--no-index \
|
||||
--disable-cache \
|
||||
-o $@ \
|
||||
$(package)==$(version) \
|
||||
$(SPACY_EXTRAS)
|
||||
chmod a+rx $@
|
||||
cp $@ dist/spacy.pex
|
||||
|
||||
dist/pytest.pex : wheelhouse/pytest-*.whl
|
||||
$(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m pytest -o $@ pytest pytest-timeout mock
|
||||
dist/pytest.pex : $(WHEELHOUSE)/pytest-*.whl
|
||||
$(VENV)/bin/pex -f $(WHEELHOUSE) --no-index --disable-cache -m pytest -o $@ pytest pytest-timeout mock
|
||||
chmod a+rx $@
|
||||
|
||||
wheelhouse/spacy-$(version).stamp : $(VENV)/bin/pex setup.py spacy/*.py* spacy/*/*.py*
|
||||
$(VENV)/bin/pip wheel . -w ./wheelhouse
|
||||
$(VENV)/bin/pip wheel jsonschema spacy-lookups-data jieba pkuseg==0.0.25 sudachipy sudachidict_core -w ./wheelhouse
|
||||
$(WHEELHOUSE)/spacy-$(PYVER)-$(version).stamp : $(VENV)/bin/pex setup.py spacy/*.py* spacy/*/*.py*
|
||||
$(VENV)/bin/pip wheel . -w $(WHEELHOUSE)
|
||||
$(VENV)/bin/pip wheel $(SPACY_EXTRAS) -w $(WHEELHOUSE)
|
||||
|
||||
touch $@
|
||||
|
||||
wheelhouse/pytest-%.whl : $(VENV)/bin/pex
|
||||
$(VENV)/bin/pip wheel pytest pytest-timeout mock -w ./wheelhouse
|
||||
$(WHEELHOUSE)/pytest-%.whl : $(VENV)/bin/pex
|
||||
$(VENV)/bin/pip wheel pytest pytest-timeout mock -w $(WHEELHOUSE)
|
||||
|
||||
$(VENV)/bin/pex :
|
||||
python$(PYVER) -m venv $(VENV)
|
||||
$(VENV)/bin/pip install -U pip setuptools pex wheel
|
||||
$(VENV)/bin/pip install numpy
|
||||
|
||||
.PHONY : clean test
|
||||
|
||||
|
@ -33,6 +59,6 @@ test : dist/spacy-$(version).pex dist/pytest.pex
|
|||
|
||||
clean : setup.py
|
||||
rm -rf dist/*
|
||||
rm -rf ./wheelhouse
|
||||
rm -rf $(WHEELHOUSE)/*
|
||||
rm -rf $(VENV)
|
||||
python setup.py clean --all
|
||||
|
|
105
README.md
105
README.md
|
@ -4,18 +4,19 @@
|
|||
|
||||
spaCy is a library for advanced Natural Language Processing in Python and
|
||||
Cython. It's built on the very latest research, and was designed from day one to
|
||||
be used in real products. spaCy comes with
|
||||
[pretrained statistical models](https://spacy.io/models) and word vectors, and
|
||||
be used in real products.
|
||||
|
||||
spaCy comes with
|
||||
[pretrained pipelines](https://spacy.io/models) and vectors, and
|
||||
currently supports tokenization for **60+ languages**. It features
|
||||
state-of-the-art speed, convolutional **neural network models** for tagging,
|
||||
parsing and **named entity recognition** and easy **deep learning** integration.
|
||||
It's commercial open-source software, released under the MIT license.
|
||||
parsing, **named entity recognition**, **text classification** and more, multi-task learning with pretrained **transformers** like BERT, as well as a production-ready training system and easy model packaging, deployment and workflow management.
|
||||
spaCy is commercial open-source software, released under the MIT license.
|
||||
|
||||
💫 **Version 2.3 out now!**
|
||||
💫 **Version 3.0 out now!**
|
||||
[Check out the release notes here.](https://github.com/explosion/spaCy/releases)
|
||||
|
||||
[![Azure Pipelines](<https://img.shields.io/azure-devops/build/explosion-ai/public/8/master.svg?logo=azure-pipelines&style=flat-square&label=build+(3.x)>)](https://dev.azure.com/explosion-ai/public/_build?definitionId=8)
|
||||
[![Travis Build Status](<https://img.shields.io/travis/explosion/spaCy/master.svg?style=flat-square&logo=travis-ci&logoColor=white&label=build+(2.7)>)](https://travis-ci.org/explosion/spaCy)
|
||||
[![Azure Pipelines](https://img.shields.io/azure-devops/build/explosion-ai/public/8/master.svg?logo=azure-pipelines&style=flat-square&label=build)](https://dev.azure.com/explosion-ai/public/_build?definitionId=8)
|
||||
[![Current Release Version](https://img.shields.io/github/release/explosion/spacy.svg?style=flat-square&logo=github)](https://github.com/explosion/spaCy/releases)
|
||||
[![pypi Version](https://img.shields.io/pypi/v/spacy.svg?style=flat-square&logo=pypi&logoColor=white)](https://pypi.org/project/spacy/)
|
||||
[![conda Version](https://img.shields.io/conda/vn/conda-forge/spacy.svg?style=flat-square&logo=conda-forge&logoColor=white)](https://anaconda.org/conda-forge/spacy)
|
||||
|
@ -28,64 +29,60 @@ It's commercial open-source software, released under the MIT license.
|
|||
|
||||
## 📖 Documentation
|
||||
|
||||
| Documentation | |
|
||||
| --------------- | -------------------------------------------------------------- |
|
||||
| [spaCy 101] | New to spaCy? Here's everything you need to know! |
|
||||
| [Usage Guides] | How to use spaCy and its features. |
|
||||
| [New in v2.3] | New features, backwards incompatibilities and migration guide. |
|
||||
| [API Reference] | The detailed reference for spaCy's API. |
|
||||
| [Models] | Download statistical language models for spaCy. |
|
||||
| [Universe] | Libraries, extensions, demos, books and courses. |
|
||||
| [Changelog] | Changes and version history. |
|
||||
| [Contribute] | How to contribute to the spaCy project and code base. |
|
||||
| Documentation | |
|
||||
| ------------------- | -------------------------------------------------------------- |
|
||||
| [spaCy 101] | New to spaCy? Here's everything you need to know! |
|
||||
| [Usage Guides] | How to use spaCy and its features. |
|
||||
| [New in v3.0] | New features, backwards incompatibilities and migration guide. |
|
||||
| [Project Templates] | End-to-end workflows you can clone, modify and run. |
|
||||
| [API Reference] | The detailed reference for spaCy's API. |
|
||||
| [Models] | Download statistical language models for spaCy. |
|
||||
| [Universe] | Libraries, extensions, demos, books and courses. |
|
||||
| [Changelog] | Changes and version history. |
|
||||
| [Contribute] | How to contribute to the spaCy project and code base. |
|
||||
|
||||
[spacy 101]: https://spacy.io/usage/spacy-101
|
||||
[new in v2.3]: https://spacy.io/usage/v2-3
|
||||
[new in v3.0]: https://spacy.io/usage/v3
|
||||
[usage guides]: https://spacy.io/usage/
|
||||
[api reference]: https://spacy.io/api/
|
||||
[models]: https://spacy.io/models
|
||||
[universe]: https://spacy.io/universe
|
||||
[project templates]: https://github.com/explosion/projects
|
||||
[changelog]: https://spacy.io/usage#changelog
|
||||
[contribute]: https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md
|
||||
|
||||
## 💬 Where to ask questions
|
||||
|
||||
The spaCy project is maintained by [@honnibal](https://github.com/honnibal) and
|
||||
[@ines](https://github.com/ines), along with core contributors
|
||||
[@svlandeg](https://github.com/svlandeg) and
|
||||
The spaCy project is maintained by [@honnibal](https://github.com/honnibal),
|
||||
[@ines](https://github.com/ines), [@svlandeg](https://github.com/svlandeg) and
|
||||
[@adrianeboyd](https://github.com/adrianeboyd). Please understand that we won't
|
||||
be able to provide individual support via email. We also believe that help is
|
||||
much more valuable if it's shared publicly, so that more people can benefit from
|
||||
it.
|
||||
|
||||
| Type | Platforms |
|
||||
| ------------------------ | ------------------------------------------------------ |
|
||||
| 🚨 **Bug Reports** | [GitHub Issue Tracker] |
|
||||
| 🎁 **Feature Requests** | [GitHub Issue Tracker] |
|
||||
| 👩💻 **Usage Questions** | [Stack Overflow] · [Gitter Chat] · [Reddit User Group] |
|
||||
| 🗯 **General Discussion** | [Gitter Chat] · [Reddit User Group] |
|
||||
| Type | Platforms |
|
||||
| ----------------------- | ---------------------- |
|
||||
| 🚨 **Bug Reports** | [GitHub Issue Tracker] |
|
||||
| 🎁 **Feature Requests** | [GitHub Issue Tracker] |
|
||||
| 👩💻 **Usage Questions** | [Stack Overflow] |
|
||||
|
||||
[github issue tracker]: https://github.com/explosion/spaCy/issues
|
||||
[stack overflow]: https://stackoverflow.com/questions/tagged/spacy
|
||||
[gitter chat]: https://gitter.im/explosion/spaCy
|
||||
[reddit user group]: https://www.reddit.com/r/spacynlp
|
||||
|
||||
## Features
|
||||
|
||||
- Non-destructive **tokenization**
|
||||
- **Named entity** recognition
|
||||
- Support for **50+ languages**
|
||||
- pretrained [statistical models](https://spacy.io/models) and word vectors
|
||||
- Support for **60+ languages**
|
||||
- **Trained pipelines**
|
||||
- Multi-task learning with pretrained **transformers** like BERT
|
||||
- Pretrained **word vectors**
|
||||
- State-of-the-art speed
|
||||
- Easy **deep learning** integration
|
||||
- Part-of-speech tagging
|
||||
- Labelled dependency parsing
|
||||
- Syntax-driven sentence segmentation
|
||||
- Production-ready **training system**
|
||||
- Linguistically-motivated **tokenization**
|
||||
- Components for named **entity recognition**, part-of-speech-tagging, dependency parsing, sentence segmentation, **text classification**, lemmatization, morphological analysis, entity linking and more
|
||||
- Easily extensible with **custom components** and attributes
|
||||
- Support for custom models in **PyTorch**, **TensorFlow** and other frameworks
|
||||
- Built in **visualizers** for syntax and NER
|
||||
- Convenient string-to-hash mapping
|
||||
- Export to numpy data arrays
|
||||
- Efficient binary serialization
|
||||
- Easy **model packaging** and deployment
|
||||
- Easy **model packaging**, deployment and workflow management
|
||||
- Robust, rigorously evaluated accuracy
|
||||
|
||||
📖 **For more details, see the
|
||||
|
@ -98,7 +95,7 @@ For detailed installation instructions, see the
|
|||
|
||||
- **Operating system**: macOS / OS X · Linux · Windows (Cygwin, MinGW, Visual
|
||||
Studio)
|
||||
- **Python version**: Python 2.7, 3.5+ (only 64 bit)
|
||||
- **Python version**: Python 3.6+ (only 64 bit)
|
||||
- **Package managers**: [pip] · [conda] (via `conda-forge`)
|
||||
|
||||
[pip]: https://pypi.org/project/spacy/
|
||||
|
@ -159,26 +156,26 @@ If you've trained your own models, keep in mind that your training and runtime
|
|||
inputs must match. After updating spaCy, we recommend **retraining your models**
|
||||
with the new version.
|
||||
|
||||
📖 **For details on upgrading from spaCy 1.x to spaCy 2.x, see the
|
||||
[migration guide](https://spacy.io/usage/v2#migrating).**
|
||||
📖 **For details on upgrading from spaCy 2.x to spaCy 3.x, see the
|
||||
[migration guide](https://spacy.io/usage/v3#migrating).**
|
||||
|
||||
## Download models
|
||||
|
||||
As of v1.7.0, models for spaCy can be installed as **Python packages**. This
|
||||
Trained pipelines for spaCy can be installed as **Python packages**. This
|
||||
means that they're a component of your application, just like any other module.
|
||||
Models can be installed using spaCy's `download` command, or manually by
|
||||
pointing pip to a path or URL.
|
||||
|
||||
| Documentation | |
|
||||
| ---------------------- | ------------------------------------------------------------- |
|
||||
| [Available Models] | Detailed model descriptions, accuracy figures and benchmarks. |
|
||||
| [Models Documentation] | Detailed usage instructions. |
|
||||
| Documentation | |
|
||||
| ---------------------- | ---------------------------------------------------------------- |
|
||||
| [Available Pipelines] | Detailed pipeline descriptions, accuracy figures and benchmarks. |
|
||||
| [Models Documentation] | Detailed usage instructions. |
|
||||
|
||||
[available models]: https://spacy.io/models
|
||||
[available pipelines]: https://spacy.io/models
|
||||
[models documentation]: https://spacy.io/docs/usage/models
|
||||
|
||||
```bash
|
||||
# download best-matching version of specific model for your spaCy installation
|
||||
# Download best-matching version of specific model for your spaCy installation
|
||||
python -m spacy download en_core_web_sm
|
||||
|
||||
# pip install .tar.gz archive from path or URL
|
||||
|
@ -188,7 +185,7 @@ pip install https://github.com/explosion/spacy-models/releases/download/en_core_
|
|||
|
||||
### Loading and using models
|
||||
|
||||
To load a model, use `spacy.load()` with the model name, a shortcut link or a
|
||||
To load a model, use `spacy.load()` with the model name or a
|
||||
path to the model data directory.
|
||||
|
||||
```python
|
||||
|
@ -263,9 +260,7 @@ and git preinstalled.
|
|||
Install a version of the
|
||||
[Visual C++ Build Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/)
|
||||
or [Visual Studio Express](https://visualstudio.microsoft.com/vs/express/) that
|
||||
matches the version that was used to compile your Python interpreter. For
|
||||
official distributions these are VS 2008 (Python 2.7), VS 2010 (Python 3.4) and
|
||||
VS 2015 (Python 3.5).
|
||||
matches the version that was used to compile your Python interpreter.
|
||||
|
||||
## Run tests
|
||||
|
||||
|
|
|
@ -27,7 +27,7 @@ jobs:
|
|||
inputs:
|
||||
versionSpec: '3.7'
|
||||
- script: |
|
||||
pip install flake8
|
||||
pip install flake8==3.5.0
|
||||
python -m flake8 spacy --count --select=E901,E999,F821,F822,F823 --show-source --statistics
|
||||
displayName: 'flake8'
|
||||
|
||||
|
@ -35,12 +35,6 @@ jobs:
|
|||
dependsOn: 'Validate'
|
||||
strategy:
|
||||
matrix:
|
||||
Python35Linux:
|
||||
imageName: 'ubuntu-16.04'
|
||||
python.version: '3.5'
|
||||
Python35Windows:
|
||||
imageName: 'vs2017-win2016'
|
||||
python.version: '3.5'
|
||||
Python36Linux:
|
||||
imageName: 'ubuntu-16.04'
|
||||
python.version: '3.6'
|
||||
|
@ -58,7 +52,7 @@ jobs:
|
|||
# imageName: 'vs2017-win2016'
|
||||
# python.version: '3.7'
|
||||
# Python37Mac:
|
||||
# imageName: 'macos-10.13'
|
||||
# imageName: 'macos-10.14'
|
||||
# python.version: '3.7'
|
||||
Python38Linux:
|
||||
imageName: 'ubuntu-16.04'
|
||||
|
|
169
bin/cythonize.py
169
bin/cythonize.py
|
@ -1,169 +0,0 @@
|
|||
#!/usr/bin/env python
|
||||
""" cythonize.py
|
||||
|
||||
Cythonize pyx files into C++ files as needed.
|
||||
|
||||
Usage: cythonize.py [root]
|
||||
|
||||
Checks pyx files to see if they have been changed relative to their
|
||||
corresponding C++ files. If they have, then runs cython on these files to
|
||||
recreate the C++ files.
|
||||
|
||||
Additionally, checks pxd files and setup.py if they have been changed. If
|
||||
they have, rebuilds everything.
|
||||
|
||||
Change detection based on file hashes stored in JSON format.
|
||||
|
||||
For now, this script should be run by developers when changing Cython files
|
||||
and the resulting C++ files checked in, so that end-users (and Python-only
|
||||
developers) do not get the Cython dependencies.
|
||||
|
||||
Based upon:
|
||||
|
||||
https://raw.github.com/dagss/private-scipy-refactor/cythonize/cythonize.py
|
||||
https://raw.githubusercontent.com/numpy/numpy/master/tools/cythonize.py
|
||||
|
||||
Note: this script does not check any of the dependent C++ libraries.
|
||||
"""
|
||||
from __future__ import print_function
|
||||
|
||||
import os
|
||||
import sys
|
||||
import json
|
||||
import hashlib
|
||||
import subprocess
|
||||
import argparse
|
||||
|
||||
|
||||
HASH_FILE = "cythonize.json"
|
||||
|
||||
|
||||
def process_pyx(fromfile, tofile, language_level="-2"):
|
||||
print("Processing %s" % fromfile)
|
||||
try:
|
||||
from Cython.Compiler.Version import version as cython_version
|
||||
from distutils.version import LooseVersion
|
||||
|
||||
if LooseVersion(cython_version) < LooseVersion("0.19"):
|
||||
raise Exception("Require Cython >= 0.19")
|
||||
|
||||
except ImportError:
|
||||
pass
|
||||
|
||||
flags = ["--fast-fail", language_level]
|
||||
if tofile.endswith(".cpp"):
|
||||
flags += ["--cplus"]
|
||||
|
||||
try:
|
||||
try:
|
||||
r = subprocess.call(
|
||||
["cython"] + flags + ["-o", tofile, fromfile], env=os.environ
|
||||
) # See Issue #791
|
||||
if r != 0:
|
||||
raise Exception("Cython failed")
|
||||
except OSError:
|
||||
# There are ways of installing Cython that don't result in a cython
|
||||
# executable on the path, see gh-2397.
|
||||
r = subprocess.call(
|
||||
[
|
||||
sys.executable,
|
||||
"-c",
|
||||
"import sys; from Cython.Compiler.Main import "
|
||||
"setuptools_main as main; sys.exit(main())",
|
||||
]
|
||||
+ flags
|
||||
+ ["-o", tofile, fromfile]
|
||||
)
|
||||
if r != 0:
|
||||
raise Exception("Cython failed")
|
||||
except OSError:
|
||||
raise OSError("Cython needs to be installed")
|
||||
|
||||
|
||||
def preserve_cwd(path, func, *args):
|
||||
orig_cwd = os.getcwd()
|
||||
try:
|
||||
os.chdir(path)
|
||||
func(*args)
|
||||
finally:
|
||||
os.chdir(orig_cwd)
|
||||
|
||||
|
||||
def load_hashes(filename):
|
||||
try:
|
||||
return json.load(open(filename))
|
||||
except (ValueError, IOError):
|
||||
return {}
|
||||
|
||||
|
||||
def save_hashes(hash_db, filename):
|
||||
with open(filename, "w") as f:
|
||||
f.write(json.dumps(hash_db))
|
||||
|
||||
|
||||
def get_hash(path):
|
||||
return hashlib.md5(open(path, "rb").read()).hexdigest()
|
||||
|
||||
|
||||
def hash_changed(base, path, db):
|
||||
full_path = os.path.normpath(os.path.join(base, path))
|
||||
return not get_hash(full_path) == db.get(full_path)
|
||||
|
||||
|
||||
def hash_add(base, path, db):
|
||||
full_path = os.path.normpath(os.path.join(base, path))
|
||||
db[full_path] = get_hash(full_path)
|
||||
|
||||
|
||||
def process(base, filename, db):
|
||||
root, ext = os.path.splitext(filename)
|
||||
if ext in [".pyx", ".cpp"]:
|
||||
if hash_changed(base, filename, db) or not os.path.isfile(
|
||||
os.path.join(base, root + ".cpp")
|
||||
):
|
||||
preserve_cwd(base, process_pyx, root + ".pyx", root + ".cpp")
|
||||
hash_add(base, root + ".cpp", db)
|
||||
hash_add(base, root + ".pyx", db)
|
||||
|
||||
|
||||
def check_changes(root, db):
|
||||
res = False
|
||||
new_db = {}
|
||||
|
||||
setup_filename = "setup.py"
|
||||
hash_add(".", setup_filename, new_db)
|
||||
if hash_changed(".", setup_filename, db):
|
||||
res = True
|
||||
|
||||
for base, _, files in os.walk(root):
|
||||
for filename in files:
|
||||
if filename.endswith(".pxd"):
|
||||
hash_add(base, filename, new_db)
|
||||
if hash_changed(base, filename, db):
|
||||
res = True
|
||||
|
||||
if res:
|
||||
db.clear()
|
||||
db.update(new_db)
|
||||
return res
|
||||
|
||||
|
||||
def run(root):
|
||||
db = load_hashes(HASH_FILE)
|
||||
|
||||
try:
|
||||
check_changes(root, db)
|
||||
for base, _, files in os.walk(root):
|
||||
for filename in files:
|
||||
process(base, filename, db)
|
||||
finally:
|
||||
save_hashes(db, HASH_FILE)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Cythonize pyx files into C++ files as needed"
|
||||
)
|
||||
parser.add_argument("root", help="root directory")
|
||||
args = parser.parse_args()
|
||||
run(args.root)
|
12
bin/get-package.sh
Executable file
12
bin/get-package.sh
Executable file
|
@ -0,0 +1,12 @@
|
|||
#!/usr/bin/env bash
|
||||
|
||||
set -e
|
||||
|
||||
version=$(grep "__title__ = " spacy/about.py)
|
||||
version=${version/__title__ = }
|
||||
version=${version/\'/}
|
||||
version=${version/\'/}
|
||||
version=${version/\"/}
|
||||
version=${version/\"/}
|
||||
|
||||
echo $version
|
|
@ -1,97 +0,0 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import bz2
|
||||
import re
|
||||
import srsly
|
||||
import sys
|
||||
import random
|
||||
import datetime
|
||||
import plac
|
||||
from pathlib import Path
|
||||
|
||||
_unset = object()
|
||||
|
||||
|
||||
class Reddit(object):
|
||||
"""Stream cleaned comments from Reddit."""
|
||||
|
||||
pre_format_re = re.compile(r"^[`*~]")
|
||||
post_format_re = re.compile(r"[`*~]$")
|
||||
url_re = re.compile(r"\[([^]]+)\]\(%%URL\)")
|
||||
link_re = re.compile(r"\[([^]]+)\]\(https?://[^\)]+\)")
|
||||
|
||||
def __init__(self, file_path, meta_keys={"subreddit": "section"}):
|
||||
"""
|
||||
file_path (unicode / Path): Path to archive or directory of archives.
|
||||
meta_keys (dict): Meta data key included in the Reddit corpus, mapped
|
||||
to display name in Prodigy meta.
|
||||
RETURNS (Reddit): The Reddit loader.
|
||||
"""
|
||||
self.meta = meta_keys
|
||||
file_path = Path(file_path)
|
||||
if not file_path.exists():
|
||||
raise IOError("Can't find file path: {}".format(file_path))
|
||||
if not file_path.is_dir():
|
||||
self.files = [file_path]
|
||||
else:
|
||||
self.files = list(file_path.iterdir())
|
||||
|
||||
def __iter__(self):
|
||||
for file_path in self.iter_files():
|
||||
with bz2.open(str(file_path)) as f:
|
||||
for line in f:
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
comment = srsly.json_loads(line)
|
||||
if self.is_valid(comment):
|
||||
text = self.strip_tags(comment["body"])
|
||||
yield {"text": text}
|
||||
|
||||
def get_meta(self, item):
|
||||
return {name: item.get(key, "n/a") for key, name in self.meta.items()}
|
||||
|
||||
def iter_files(self):
|
||||
for file_path in self.files:
|
||||
yield file_path
|
||||
|
||||
def strip_tags(self, text):
|
||||
text = self.link_re.sub(r"\1", text)
|
||||
text = text.replace(">", ">").replace("<", "<")
|
||||
text = self.pre_format_re.sub("", text)
|
||||
text = self.post_format_re.sub("", text)
|
||||
text = re.sub(r"\s+", " ", text)
|
||||
return text.strip()
|
||||
|
||||
def is_valid(self, comment):
|
||||
return (
|
||||
comment["body"] is not None
|
||||
and comment["body"] != "[deleted]"
|
||||
and comment["body"] != "[removed]"
|
||||
)
|
||||
|
||||
|
||||
def main(path):
|
||||
reddit = Reddit(path)
|
||||
for comment in reddit:
|
||||
print(srsly.json_dumps(comment))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import socket
|
||||
|
||||
try:
|
||||
BrokenPipeError
|
||||
except NameError:
|
||||
BrokenPipeError = socket.error
|
||||
try:
|
||||
plac.call(main)
|
||||
except BrokenPipeError:
|
||||
import os, sys
|
||||
|
||||
# Python flushes standard streams on exit; redirect remaining output
|
||||
# to devnull to avoid another BrokenPipeError at shutdown
|
||||
devnull = os.open(os.devnull, os.O_WRONLY)
|
||||
os.dup2(devnull, sys.stdout.fileno())
|
||||
sys.exit(1) # Python exits with error code 1 on EPIPE
|
|
@ -1,81 +0,0 @@
|
|||
#!/usr/bin/env python
|
||||
from __future__ import print_function, unicode_literals, division
|
||||
|
||||
import logging
|
||||
from pathlib import Path
|
||||
from collections import defaultdict
|
||||
from gensim.models import Word2Vec
|
||||
import plac
|
||||
import spacy
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class Corpus(object):
|
||||
def __init__(self, directory, nlp):
|
||||
self.directory = directory
|
||||
self.nlp = nlp
|
||||
|
||||
def __iter__(self):
|
||||
for text_loc in iter_dir(self.directory):
|
||||
with text_loc.open("r", encoding="utf-8") as file_:
|
||||
text = file_.read()
|
||||
|
||||
# This is to keep the input to the blank model (which doesn't
|
||||
# sentencize) from being too long. It works particularly well with
|
||||
# the output of [WikiExtractor](https://github.com/attardi/wikiextractor)
|
||||
paragraphs = text.split('\n\n')
|
||||
for par in paragraphs:
|
||||
yield [word.orth_ for word in self.nlp(par)]
|
||||
|
||||
|
||||
def iter_dir(loc):
|
||||
dir_path = Path(loc)
|
||||
for fn_path in dir_path.iterdir():
|
||||
if fn_path.is_dir():
|
||||
for sub_path in fn_path.iterdir():
|
||||
yield sub_path
|
||||
else:
|
||||
yield fn_path
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
lang=("ISO language code"),
|
||||
in_dir=("Location of input directory"),
|
||||
out_loc=("Location of output file"),
|
||||
n_workers=("Number of workers", "option", "n", int),
|
||||
size=("Dimension of the word vectors", "option", "d", int),
|
||||
window=("Context window size", "option", "w", int),
|
||||
min_count=("Min count", "option", "m", int),
|
||||
negative=("Number of negative samples", "option", "g", int),
|
||||
nr_iter=("Number of iterations", "option", "i", int),
|
||||
)
|
||||
def main(
|
||||
lang,
|
||||
in_dir,
|
||||
out_loc,
|
||||
negative=5,
|
||||
n_workers=4,
|
||||
window=5,
|
||||
size=128,
|
||||
min_count=10,
|
||||
nr_iter=5,
|
||||
):
|
||||
logging.basicConfig(
|
||||
format="%(asctime)s : %(levelname)s : %(message)s", level=logging.INFO
|
||||
)
|
||||
nlp = spacy.blank(lang)
|
||||
corpus = Corpus(in_dir, nlp)
|
||||
model = Word2Vec(
|
||||
sentences=corpus,
|
||||
size=size,
|
||||
window=window,
|
||||
min_count=min_count,
|
||||
workers=n_workers,
|
||||
sample=1e-5,
|
||||
negative=negative,
|
||||
)
|
||||
model.save(out_loc)
|
||||
|
||||
if __name__ == "__main__":
|
||||
plac.call(main)
|
|
@ -1,2 +0,0 @@
|
|||
from .conll17_ud_eval import main as ud_evaluate # noqa: F401
|
||||
from .ud_train import main as ud_train # noqa: F401
|
|
@ -1,614 +0,0 @@
|
|||
#!/usr/bin/env python
|
||||
# flake8: noqa
|
||||
|
||||
# CoNLL 2017 UD Parsing evaluation script.
|
||||
#
|
||||
# Compatible with Python 2.7 and 3.2+, can be used either as a module
|
||||
# or a standalone executable.
|
||||
#
|
||||
# Copyright 2017 Institute of Formal and Applied Linguistics (UFAL),
|
||||
# Faculty of Mathematics and Physics, Charles University, Czech Republic.
|
||||
#
|
||||
# Changelog:
|
||||
# - [02 Jan 2017] Version 0.9: Initial release
|
||||
# - [25 Jan 2017] Version 0.9.1: Fix bug in LCS alignment computation
|
||||
# - [10 Mar 2017] Version 1.0: Add documentation and test
|
||||
# Compare HEADs correctly using aligned words
|
||||
# Allow evaluation with errorneous spaces in forms
|
||||
# Compare forms in LCS case insensitively
|
||||
# Detect cycles and multiple root nodes
|
||||
# Compute AlignedAccuracy
|
||||
|
||||
# Command line usage
|
||||
# ------------------
|
||||
# conll17_ud_eval.py [-v] [-w weights_file] gold_conllu_file system_conllu_file
|
||||
#
|
||||
# - if no -v is given, only the CoNLL17 UD Shared Task evaluation LAS metrics
|
||||
# is printed
|
||||
# - if -v is given, several metrics are printed (as precision, recall, F1 score,
|
||||
# and in case the metric is computed on aligned words also accuracy on these):
|
||||
# - Tokens: how well do the gold tokens match system tokens
|
||||
# - Sentences: how well do the gold sentences match system sentences
|
||||
# - Words: how well can the gold words be aligned to system words
|
||||
# - UPOS: using aligned words, how well does UPOS match
|
||||
# - XPOS: using aligned words, how well does XPOS match
|
||||
# - Feats: using aligned words, how well does FEATS match
|
||||
# - AllTags: using aligned words, how well does UPOS+XPOS+FEATS match
|
||||
# - Lemmas: using aligned words, how well does LEMMA match
|
||||
# - UAS: using aligned words, how well does HEAD match
|
||||
# - LAS: using aligned words, how well does HEAD+DEPREL(ignoring subtypes) match
|
||||
# - if weights_file is given (with lines containing deprel-weight pairs),
|
||||
# one more metric is shown:
|
||||
# - WeightedLAS: as LAS, but each deprel (ignoring subtypes) has different weight
|
||||
|
||||
# API usage
|
||||
# ---------
|
||||
# - load_conllu(file)
|
||||
# - loads CoNLL-U file from given file object to an internal representation
|
||||
# - the file object should return str on both Python 2 and Python 3
|
||||
# - raises UDError exception if the given file cannot be loaded
|
||||
# - evaluate(gold_ud, system_ud)
|
||||
# - evaluate the given gold and system CoNLL-U files (loaded with load_conllu)
|
||||
# - raises UDError if the concatenated tokens of gold and system file do not match
|
||||
# - returns a dictionary with the metrics described above, each metrics having
|
||||
# four fields: precision, recall, f1 and aligned_accuracy (when using aligned
|
||||
# words, otherwise this is None)
|
||||
|
||||
# Description of token matching
|
||||
# -----------------------------
|
||||
# In order to match tokens of gold file and system file, we consider the text
|
||||
# resulting from concatenation of gold tokens and text resulting from
|
||||
# concatenation of system tokens. These texts should match -- if they do not,
|
||||
# the evaluation fails.
|
||||
#
|
||||
# If the texts do match, every token is represented as a range in this original
|
||||
# text, and tokens are equal only if their range is the same.
|
||||
|
||||
# Description of word matching
|
||||
# ----------------------------
|
||||
# When matching words of gold file and system file, we first match the tokens.
|
||||
# The words which are also tokens are matched as tokens, but words in multi-word
|
||||
# tokens have to be handled differently.
|
||||
#
|
||||
# To handle multi-word tokens, we start by finding "multi-word spans".
|
||||
# Multi-word span is a span in the original text such that
|
||||
# - it contains at least one multi-word token
|
||||
# - all multi-word tokens in the span (considering both gold and system ones)
|
||||
# are completely inside the span (i.e., they do not "stick out")
|
||||
# - the multi-word span is as small as possible
|
||||
#
|
||||
# For every multi-word span, we align the gold and system words completely
|
||||
# inside this span using LCS on their FORMs. The words not intersecting
|
||||
# (even partially) any multi-word span are then aligned as tokens.
|
||||
|
||||
|
||||
from __future__ import division
|
||||
from __future__ import print_function
|
||||
|
||||
import argparse
|
||||
import io
|
||||
import sys
|
||||
import unittest
|
||||
|
||||
# CoNLL-U column names
|
||||
ID, FORM, LEMMA, UPOS, XPOS, FEATS, HEAD, DEPREL, DEPS, MISC = range(10)
|
||||
|
||||
# UD Error is used when raising exceptions in this module
|
||||
class UDError(Exception):
|
||||
pass
|
||||
|
||||
# Load given CoNLL-U file into internal representation
|
||||
def load_conllu(file, check_parse=True):
|
||||
# Internal representation classes
|
||||
class UDRepresentation:
|
||||
def __init__(self):
|
||||
# Characters of all the tokens in the whole file.
|
||||
# Whitespace between tokens is not included.
|
||||
self.characters = []
|
||||
# List of UDSpan instances with start&end indices into `characters`.
|
||||
self.tokens = []
|
||||
# List of UDWord instances.
|
||||
self.words = []
|
||||
# List of UDSpan instances with start&end indices into `characters`.
|
||||
self.sentences = []
|
||||
class UDSpan:
|
||||
def __init__(self, start, end, characters):
|
||||
self.start = start
|
||||
# Note that self.end marks the first position **after the end** of span,
|
||||
# so we can use characters[start:end] or range(start, end).
|
||||
self.end = end
|
||||
self.characters = characters
|
||||
|
||||
@property
|
||||
def text(self):
|
||||
return ''.join(self.characters[self.start:self.end])
|
||||
|
||||
def __str__(self):
|
||||
return self.text
|
||||
|
||||
def __repr__(self):
|
||||
return self.text
|
||||
class UDWord:
|
||||
def __init__(self, span, columns, is_multiword):
|
||||
# Span of this word (or MWT, see below) within ud_representation.characters.
|
||||
self.span = span
|
||||
# 10 columns of the CoNLL-U file: ID, FORM, LEMMA,...
|
||||
self.columns = columns
|
||||
# is_multiword==True means that this word is part of a multi-word token.
|
||||
# In that case, self.span marks the span of the whole multi-word token.
|
||||
self.is_multiword = is_multiword
|
||||
# Reference to the UDWord instance representing the HEAD (or None if root).
|
||||
self.parent = None
|
||||
# Let's ignore language-specific deprel subtypes.
|
||||
self.columns[DEPREL] = columns[DEPREL].split(':')[0]
|
||||
|
||||
ud = UDRepresentation()
|
||||
|
||||
# Load the CoNLL-U file
|
||||
index, sentence_start = 0, None
|
||||
linenum = 0
|
||||
while True:
|
||||
line = file.readline()
|
||||
linenum += 1
|
||||
if not line:
|
||||
break
|
||||
line = line.rstrip("\r\n")
|
||||
|
||||
# Handle sentence start boundaries
|
||||
if sentence_start is None:
|
||||
# Skip comments
|
||||
if line.startswith("#"):
|
||||
continue
|
||||
# Start a new sentence
|
||||
ud.sentences.append(UDSpan(index, 0, ud.characters))
|
||||
sentence_start = len(ud.words)
|
||||
if not line:
|
||||
# Add parent UDWord links and check there are no cycles
|
||||
def process_word(word):
|
||||
if word.parent == "remapping":
|
||||
raise UDError("There is a cycle in a sentence")
|
||||
if word.parent is None:
|
||||
head = int(word.columns[HEAD])
|
||||
if head > len(ud.words) - sentence_start:
|
||||
raise UDError("Line {}: HEAD '{}' points outside of the sentence".format(
|
||||
linenum, word.columns[HEAD]))
|
||||
if head:
|
||||
parent = ud.words[sentence_start + head - 1]
|
||||
word.parent = "remapping"
|
||||
process_word(parent)
|
||||
word.parent = parent
|
||||
|
||||
for word in ud.words[sentence_start:]:
|
||||
process_word(word)
|
||||
|
||||
# Check there is a single root node
|
||||
if check_parse:
|
||||
if len([word for word in ud.words[sentence_start:] if word.parent is None]) != 1:
|
||||
raise UDError("There are multiple roots in a sentence")
|
||||
|
||||
# End the sentence
|
||||
ud.sentences[-1].end = index
|
||||
sentence_start = None
|
||||
continue
|
||||
|
||||
# Read next token/word
|
||||
columns = line.split("\t")
|
||||
if len(columns) != 10:
|
||||
raise UDError("The CoNLL-U line {} does not contain 10 tab-separated columns: '{}'".format(linenum, line))
|
||||
|
||||
# Skip empty nodes
|
||||
if "." in columns[ID]:
|
||||
continue
|
||||
|
||||
# Delete spaces from FORM so gold.characters == system.characters
|
||||
# even if one of them tokenizes the space.
|
||||
columns[FORM] = columns[FORM].replace(" ", "")
|
||||
if not columns[FORM]:
|
||||
raise UDError("There is an empty FORM in the CoNLL-U file -- line %d" % linenum)
|
||||
|
||||
# Save token
|
||||
ud.characters.extend(columns[FORM])
|
||||
ud.tokens.append(UDSpan(index, index + len(columns[FORM]), ud.characters))
|
||||
index += len(columns[FORM])
|
||||
|
||||
# Handle multi-word tokens to save word(s)
|
||||
if "-" in columns[ID]:
|
||||
try:
|
||||
start, end = map(int, columns[ID].split("-"))
|
||||
except:
|
||||
raise UDError("Cannot parse multi-word token ID '{}'".format(columns[ID]))
|
||||
|
||||
for _ in range(start, end + 1):
|
||||
word_line = file.readline().rstrip("\r\n")
|
||||
word_columns = word_line.split("\t")
|
||||
if len(word_columns) != 10:
|
||||
print(columns)
|
||||
raise UDError("The CoNLL-U line {} does not contain 10 tab-separated columns: '{}'".format(linenum, word_line))
|
||||
ud.words.append(UDWord(ud.tokens[-1], word_columns, is_multiword=True))
|
||||
# Basic tokens/words
|
||||
else:
|
||||
try:
|
||||
word_id = int(columns[ID])
|
||||
except:
|
||||
raise UDError("Cannot parse word ID '{}'".format(columns[ID]))
|
||||
if word_id != len(ud.words) - sentence_start + 1:
|
||||
raise UDError("Incorrect word ID '{}' for word '{}', expected '{}'".format(columns[ID], columns[FORM], len(ud.words) - sentence_start + 1))
|
||||
|
||||
try:
|
||||
head_id = int(columns[HEAD])
|
||||
except:
|
||||
raise UDError("Cannot parse HEAD '{}'".format(columns[HEAD]))
|
||||
if head_id < 0:
|
||||
raise UDError("HEAD cannot be negative")
|
||||
|
||||
ud.words.append(UDWord(ud.tokens[-1], columns, is_multiword=False))
|
||||
|
||||
if sentence_start is not None:
|
||||
raise UDError("The CoNLL-U file does not end with empty line")
|
||||
|
||||
return ud
|
||||
|
||||
# Evaluate the gold and system treebanks (loaded using load_conllu).
|
||||
def evaluate(gold_ud, system_ud, deprel_weights=None, check_parse=True):
|
||||
class Score:
|
||||
def __init__(self, gold_total, system_total, correct, aligned_total=None, undersegmented=None, oversegmented=None):
|
||||
self.precision = correct / system_total if system_total else 0.0
|
||||
self.recall = correct / gold_total if gold_total else 0.0
|
||||
self.f1 = 2 * correct / (system_total + gold_total) if system_total + gold_total else 0.0
|
||||
self.aligned_accuracy = correct / aligned_total if aligned_total else aligned_total
|
||||
self.undersegmented = undersegmented
|
||||
self.oversegmented = oversegmented
|
||||
self.under_perc = len(undersegmented) / gold_total if gold_total and undersegmented else 0.0
|
||||
self.over_perc = len(oversegmented) / gold_total if gold_total and oversegmented else 0.0
|
||||
class AlignmentWord:
|
||||
def __init__(self, gold_word, system_word):
|
||||
self.gold_word = gold_word
|
||||
self.system_word = system_word
|
||||
self.gold_parent = None
|
||||
self.system_parent_gold_aligned = None
|
||||
class Alignment:
|
||||
def __init__(self, gold_words, system_words):
|
||||
self.gold_words = gold_words
|
||||
self.system_words = system_words
|
||||
self.matched_words = []
|
||||
self.matched_words_map = {}
|
||||
def append_aligned_words(self, gold_word, system_word):
|
||||
self.matched_words.append(AlignmentWord(gold_word, system_word))
|
||||
self.matched_words_map[system_word] = gold_word
|
||||
def fill_parents(self):
|
||||
# We represent root parents in both gold and system data by '0'.
|
||||
# For gold data, we represent non-root parent by corresponding gold word.
|
||||
# For system data, we represent non-root parent by either gold word aligned
|
||||
# to parent system nodes, or by None if no gold words is aligned to the parent.
|
||||
for words in self.matched_words:
|
||||
words.gold_parent = words.gold_word.parent if words.gold_word.parent is not None else 0
|
||||
words.system_parent_gold_aligned = self.matched_words_map.get(words.system_word.parent, None) \
|
||||
if words.system_word.parent is not None else 0
|
||||
|
||||
def lower(text):
|
||||
if sys.version_info < (3, 0) and isinstance(text, str):
|
||||
return text.decode("utf-8").lower()
|
||||
return text.lower()
|
||||
|
||||
def spans_score(gold_spans, system_spans):
|
||||
correct, gi, si = 0, 0, 0
|
||||
undersegmented = []
|
||||
oversegmented = []
|
||||
combo = 0
|
||||
previous_end_si_earlier = False
|
||||
previous_end_gi_earlier = False
|
||||
while gi < len(gold_spans) and si < len(system_spans):
|
||||
previous_si = system_spans[si-1] if si > 0 else None
|
||||
previous_gi = gold_spans[gi-1] if gi > 0 else None
|
||||
if system_spans[si].start < gold_spans[gi].start:
|
||||
# avoid counting the same mistake twice
|
||||
if not previous_end_si_earlier:
|
||||
combo += 1
|
||||
oversegmented.append(str(previous_gi).strip())
|
||||
si += 1
|
||||
elif gold_spans[gi].start < system_spans[si].start:
|
||||
# avoid counting the same mistake twice
|
||||
if not previous_end_gi_earlier:
|
||||
combo += 1
|
||||
undersegmented.append(str(previous_si).strip())
|
||||
gi += 1
|
||||
else:
|
||||
correct += gold_spans[gi].end == system_spans[si].end
|
||||
if gold_spans[gi].end < system_spans[si].end:
|
||||
undersegmented.append(str(system_spans[si]).strip())
|
||||
previous_end_gi_earlier = True
|
||||
previous_end_si_earlier = False
|
||||
elif gold_spans[gi].end > system_spans[si].end:
|
||||
oversegmented.append(str(gold_spans[gi]).strip())
|
||||
previous_end_si_earlier = True
|
||||
previous_end_gi_earlier = False
|
||||
else:
|
||||
previous_end_gi_earlier = False
|
||||
previous_end_si_earlier = False
|
||||
si += 1
|
||||
gi += 1
|
||||
|
||||
return Score(len(gold_spans), len(system_spans), correct, None, undersegmented, oversegmented)
|
||||
|
||||
def alignment_score(alignment, key_fn, weight_fn=lambda w: 1):
|
||||
gold, system, aligned, correct = 0, 0, 0, 0
|
||||
|
||||
for word in alignment.gold_words:
|
||||
gold += weight_fn(word)
|
||||
|
||||
for word in alignment.system_words:
|
||||
system += weight_fn(word)
|
||||
|
||||
for words in alignment.matched_words:
|
||||
aligned += weight_fn(words.gold_word)
|
||||
|
||||
if key_fn is None:
|
||||
# Return score for whole aligned words
|
||||
return Score(gold, system, aligned)
|
||||
|
||||
for words in alignment.matched_words:
|
||||
if key_fn(words.gold_word, words.gold_parent) == key_fn(words.system_word, words.system_parent_gold_aligned):
|
||||
correct += weight_fn(words.gold_word)
|
||||
|
||||
return Score(gold, system, correct, aligned)
|
||||
|
||||
def beyond_end(words, i, multiword_span_end):
|
||||
if i >= len(words):
|
||||
return True
|
||||
if words[i].is_multiword:
|
||||
return words[i].span.start >= multiword_span_end
|
||||
return words[i].span.end > multiword_span_end
|
||||
|
||||
def extend_end(word, multiword_span_end):
|
||||
if word.is_multiword and word.span.end > multiword_span_end:
|
||||
return word.span.end
|
||||
return multiword_span_end
|
||||
|
||||
def find_multiword_span(gold_words, system_words, gi, si):
|
||||
# We know gold_words[gi].is_multiword or system_words[si].is_multiword.
|
||||
# Find the start of the multiword span (gs, ss), so the multiword span is minimal.
|
||||
# Initialize multiword_span_end characters index.
|
||||
if gold_words[gi].is_multiword:
|
||||
multiword_span_end = gold_words[gi].span.end
|
||||
if not system_words[si].is_multiword and system_words[si].span.start < gold_words[gi].span.start:
|
||||
si += 1
|
||||
else: # if system_words[si].is_multiword
|
||||
multiword_span_end = system_words[si].span.end
|
||||
if not gold_words[gi].is_multiword and gold_words[gi].span.start < system_words[si].span.start:
|
||||
gi += 1
|
||||
gs, ss = gi, si
|
||||
|
||||
# Find the end of the multiword span
|
||||
# (so both gi and si are pointing to the word following the multiword span end).
|
||||
while not beyond_end(gold_words, gi, multiword_span_end) or \
|
||||
not beyond_end(system_words, si, multiword_span_end):
|
||||
if gi < len(gold_words) and (si >= len(system_words) or
|
||||
gold_words[gi].span.start <= system_words[si].span.start):
|
||||
multiword_span_end = extend_end(gold_words[gi], multiword_span_end)
|
||||
gi += 1
|
||||
else:
|
||||
multiword_span_end = extend_end(system_words[si], multiword_span_end)
|
||||
si += 1
|
||||
return gs, ss, gi, si
|
||||
|
||||
def compute_lcs(gold_words, system_words, gi, si, gs, ss):
|
||||
lcs = [[0] * (si - ss) for i in range(gi - gs)]
|
||||
for g in reversed(range(gi - gs)):
|
||||
for s in reversed(range(si - ss)):
|
||||
if lower(gold_words[gs + g].columns[FORM]) == lower(system_words[ss + s].columns[FORM]):
|
||||
lcs[g][s] = 1 + (lcs[g+1][s+1] if g+1 < gi-gs and s+1 < si-ss else 0)
|
||||
lcs[g][s] = max(lcs[g][s], lcs[g+1][s] if g+1 < gi-gs else 0)
|
||||
lcs[g][s] = max(lcs[g][s], lcs[g][s+1] if s+1 < si-ss else 0)
|
||||
return lcs
|
||||
|
||||
def align_words(gold_words, system_words):
|
||||
alignment = Alignment(gold_words, system_words)
|
||||
|
||||
gi, si = 0, 0
|
||||
while gi < len(gold_words) and si < len(system_words):
|
||||
if gold_words[gi].is_multiword or system_words[si].is_multiword:
|
||||
# A: Multi-word tokens => align via LCS within the whole "multiword span".
|
||||
gs, ss, gi, si = find_multiword_span(gold_words, system_words, gi, si)
|
||||
|
||||
if si > ss and gi > gs:
|
||||
lcs = compute_lcs(gold_words, system_words, gi, si, gs, ss)
|
||||
|
||||
# Store aligned words
|
||||
s, g = 0, 0
|
||||
while g < gi - gs and s < si - ss:
|
||||
if lower(gold_words[gs + g].columns[FORM]) == lower(system_words[ss + s].columns[FORM]):
|
||||
alignment.append_aligned_words(gold_words[gs+g], system_words[ss+s])
|
||||
g += 1
|
||||
s += 1
|
||||
elif lcs[g][s] == (lcs[g+1][s] if g+1 < gi-gs else 0):
|
||||
g += 1
|
||||
else:
|
||||
s += 1
|
||||
else:
|
||||
# B: No multi-word token => align according to spans.
|
||||
if (gold_words[gi].span.start, gold_words[gi].span.end) == (system_words[si].span.start, system_words[si].span.end):
|
||||
alignment.append_aligned_words(gold_words[gi], system_words[si])
|
||||
gi += 1
|
||||
si += 1
|
||||
elif gold_words[gi].span.start <= system_words[si].span.start:
|
||||
gi += 1
|
||||
else:
|
||||
si += 1
|
||||
|
||||
alignment.fill_parents()
|
||||
|
||||
return alignment
|
||||
|
||||
# Check that underlying character sequences do match
|
||||
if gold_ud.characters != system_ud.characters:
|
||||
index = 0
|
||||
while gold_ud.characters[index] == system_ud.characters[index]:
|
||||
index += 1
|
||||
|
||||
raise UDError(
|
||||
"The concatenation of tokens in gold file and in system file differ!\n" +
|
||||
"First 20 differing characters in gold file: '{}' and system file: '{}'".format(
|
||||
"".join(gold_ud.characters[index:index + 20]),
|
||||
"".join(system_ud.characters[index:index + 20])
|
||||
)
|
||||
)
|
||||
|
||||
# Align words
|
||||
alignment = align_words(gold_ud.words, system_ud.words)
|
||||
|
||||
# Compute the F1-scores
|
||||
if check_parse:
|
||||
result = {
|
||||
"Tokens": spans_score(gold_ud.tokens, system_ud.tokens),
|
||||
"Sentences": spans_score(gold_ud.sentences, system_ud.sentences),
|
||||
"Words": alignment_score(alignment, None),
|
||||
"UPOS": alignment_score(alignment, lambda w, parent: w.columns[UPOS]),
|
||||
"XPOS": alignment_score(alignment, lambda w, parent: w.columns[XPOS]),
|
||||
"Feats": alignment_score(alignment, lambda w, parent: w.columns[FEATS]),
|
||||
"AllTags": alignment_score(alignment, lambda w, parent: (w.columns[UPOS], w.columns[XPOS], w.columns[FEATS])),
|
||||
"Lemmas": alignment_score(alignment, lambda w, parent: w.columns[LEMMA]),
|
||||
"UAS": alignment_score(alignment, lambda w, parent: parent),
|
||||
"LAS": alignment_score(alignment, lambda w, parent: (parent, w.columns[DEPREL])),
|
||||
}
|
||||
else:
|
||||
result = {
|
||||
"Tokens": spans_score(gold_ud.tokens, system_ud.tokens),
|
||||
"Sentences": spans_score(gold_ud.sentences, system_ud.sentences),
|
||||
"Words": alignment_score(alignment, None),
|
||||
"Feats": alignment_score(alignment, lambda w, parent: w.columns[FEATS]),
|
||||
"Lemmas": alignment_score(alignment, lambda w, parent: w.columns[LEMMA]),
|
||||
}
|
||||
|
||||
|
||||
# Add WeightedLAS if weights are given
|
||||
if deprel_weights is not None:
|
||||
def weighted_las(word):
|
||||
return deprel_weights.get(word.columns[DEPREL], 1.0)
|
||||
result["WeightedLAS"] = alignment_score(alignment, lambda w, parent: (parent, w.columns[DEPREL]), weighted_las)
|
||||
|
||||
return result
|
||||
|
||||
def load_deprel_weights(weights_file):
|
||||
if weights_file is None:
|
||||
return None
|
||||
|
||||
deprel_weights = {}
|
||||
for line in weights_file:
|
||||
# Ignore comments and empty lines
|
||||
if line.startswith("#") or not line.strip():
|
||||
continue
|
||||
|
||||
columns = line.rstrip("\r\n").split()
|
||||
if len(columns) != 2:
|
||||
raise ValueError("Expected two columns in the UD Relations weights file on line '{}'".format(line))
|
||||
|
||||
deprel_weights[columns[0]] = float(columns[1])
|
||||
|
||||
return deprel_weights
|
||||
|
||||
def load_conllu_file(path):
|
||||
_file = open(path, mode="r", **({"encoding": "utf-8"} if sys.version_info >= (3, 0) else {}))
|
||||
return load_conllu(_file)
|
||||
|
||||
def evaluate_wrapper(args):
|
||||
# Load CoNLL-U files
|
||||
gold_ud = load_conllu_file(args.gold_file)
|
||||
system_ud = load_conllu_file(args.system_file)
|
||||
|
||||
# Load weights if requested
|
||||
deprel_weights = load_deprel_weights(args.weights)
|
||||
|
||||
return evaluate(gold_ud, system_ud, deprel_weights)
|
||||
|
||||
def main():
|
||||
# Parse arguments
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("gold_file", type=str,
|
||||
help="Name of the CoNLL-U file with the gold data.")
|
||||
parser.add_argument("system_file", type=str,
|
||||
help="Name of the CoNLL-U file with the predicted data.")
|
||||
parser.add_argument("--weights", "-w", type=argparse.FileType("r"), default=None,
|
||||
metavar="deprel_weights_file",
|
||||
help="Compute WeightedLAS using given weights for Universal Dependency Relations.")
|
||||
parser.add_argument("--verbose", "-v", default=0, action="count",
|
||||
help="Print all metrics.")
|
||||
args = parser.parse_args()
|
||||
|
||||
# Use verbose if weights are supplied
|
||||
if args.weights is not None and not args.verbose:
|
||||
args.verbose = 1
|
||||
|
||||
# Evaluate
|
||||
evaluation = evaluate_wrapper(args)
|
||||
|
||||
# Print the evaluation
|
||||
if not args.verbose:
|
||||
print("LAS F1 Score: {:.2f}".format(100 * evaluation["LAS"].f1))
|
||||
else:
|
||||
metrics = ["Tokens", "Sentences", "Words", "UPOS", "XPOS", "Feats", "AllTags", "Lemmas", "UAS", "LAS"]
|
||||
if args.weights is not None:
|
||||
metrics.append("WeightedLAS")
|
||||
|
||||
print("Metrics | Precision | Recall | F1 Score | AligndAcc")
|
||||
print("-----------+-----------+-----------+-----------+-----------")
|
||||
for metric in metrics:
|
||||
print("{:11}|{:10.2f} |{:10.2f} |{:10.2f} |{}".format(
|
||||
metric,
|
||||
100 * evaluation[metric].precision,
|
||||
100 * evaluation[metric].recall,
|
||||
100 * evaluation[metric].f1,
|
||||
"{:10.2f}".format(100 * evaluation[metric].aligned_accuracy) if evaluation[metric].aligned_accuracy is not None else ""
|
||||
))
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
||||
# Tests, which can be executed with `python -m unittest conll17_ud_eval`.
|
||||
class TestAlignment(unittest.TestCase):
|
||||
@staticmethod
|
||||
def _load_words(words):
|
||||
"""Prepare fake CoNLL-U files with fake HEAD to prevent multiple roots errors."""
|
||||
lines, num_words = [], 0
|
||||
for w in words:
|
||||
parts = w.split(" ")
|
||||
if len(parts) == 1:
|
||||
num_words += 1
|
||||
lines.append("{}\t{}\t_\t_\t_\t_\t{}\t_\t_\t_".format(num_words, parts[0], int(num_words>1)))
|
||||
else:
|
||||
lines.append("{}-{}\t{}\t_\t_\t_\t_\t_\t_\t_\t_".format(num_words + 1, num_words + len(parts) - 1, parts[0]))
|
||||
for part in parts[1:]:
|
||||
num_words += 1
|
||||
lines.append("{}\t{}\t_\t_\t_\t_\t{}\t_\t_\t_".format(num_words, part, int(num_words>1)))
|
||||
return load_conllu((io.StringIO if sys.version_info >= (3, 0) else io.BytesIO)("\n".join(lines+["\n"])))
|
||||
|
||||
def _test_exception(self, gold, system):
|
||||
self.assertRaises(UDError, evaluate, self._load_words(gold), self._load_words(system))
|
||||
|
||||
def _test_ok(self, gold, system, correct):
|
||||
metrics = evaluate(self._load_words(gold), self._load_words(system))
|
||||
gold_words = sum((max(1, len(word.split(" ")) - 1) for word in gold))
|
||||
system_words = sum((max(1, len(word.split(" ")) - 1) for word in system))
|
||||
self.assertEqual((metrics["Words"].precision, metrics["Words"].recall, metrics["Words"].f1),
|
||||
(correct / system_words, correct / gold_words, 2 * correct / (gold_words + system_words)))
|
||||
|
||||
def test_exception(self):
|
||||
self._test_exception(["a"], ["b"])
|
||||
|
||||
def test_equal(self):
|
||||
self._test_ok(["a"], ["a"], 1)
|
||||
self._test_ok(["a", "b", "c"], ["a", "b", "c"], 3)
|
||||
|
||||
def test_equal_with_multiword(self):
|
||||
self._test_ok(["abc a b c"], ["a", "b", "c"], 3)
|
||||
self._test_ok(["a", "bc b c", "d"], ["a", "b", "c", "d"], 4)
|
||||
self._test_ok(["abcd a b c d"], ["ab a b", "cd c d"], 4)
|
||||
self._test_ok(["abc a b c", "de d e"], ["a", "bcd b c d", "e"], 5)
|
||||
|
||||
def test_alignment(self):
|
||||
self._test_ok(["abcd"], ["a", "b", "c", "d"], 0)
|
||||
self._test_ok(["abc", "d"], ["a", "b", "c", "d"], 1)
|
||||
self._test_ok(["a", "bc", "d"], ["a", "b", "c", "d"], 2)
|
||||
self._test_ok(["a", "bc b c", "d"], ["a", "b", "cd"], 2)
|
||||
self._test_ok(["abc a BX c", "def d EX f"], ["ab a b", "cd c d", "ef e f"], 4)
|
||||
self._test_ok(["ab a b", "cd bc d"], ["a", "bc", "d"], 2)
|
||||
self._test_ok(["a", "bc b c", "d"], ["ab AX BX", "cd CX a"], 1)
|
|
@ -1,293 +0,0 @@
|
|||
import spacy
|
||||
import time
|
||||
import re
|
||||
import plac
|
||||
import operator
|
||||
import datetime
|
||||
from pathlib import Path
|
||||
import xml.etree.ElementTree as ET
|
||||
|
||||
import conll17_ud_eval
|
||||
from ud_train import write_conllu
|
||||
from spacy.lang.lex_attrs import word_shape
|
||||
from spacy.util import get_lang_class
|
||||
|
||||
# All languages in spaCy - in UD format (note that Norwegian is 'no' instead of 'nb')
|
||||
ALL_LANGUAGES = ("af, ar, bg, bn, ca, cs, da, de, el, en, es, et, fa, fi, fr,"
|
||||
"ga, he, hi, hr, hu, id, is, it, ja, kn, ko, lt, lv, mr, no,"
|
||||
"nl, pl, pt, ro, ru, si, sk, sl, sq, sr, sv, ta, te, th, tl,"
|
||||
"tr, tt, uk, ur, vi, zh")
|
||||
|
||||
# Non-parsing tasks that will be evaluated (works for default models)
|
||||
EVAL_NO_PARSE = ['Tokens', 'Words', 'Lemmas', 'Sentences', 'Feats']
|
||||
|
||||
# Tasks that will be evaluated if check_parse=True (does not work for default models)
|
||||
EVAL_PARSE = ['Tokens', 'Words', 'Lemmas', 'Sentences', 'Feats', 'UPOS', 'XPOS', 'AllTags', 'UAS', 'LAS']
|
||||
|
||||
# Minimum frequency an error should have to be printed
|
||||
PRINT_FREQ = 20
|
||||
|
||||
# Maximum number of errors printed per category
|
||||
PRINT_TOTAL = 10
|
||||
|
||||
space_re = re.compile("\s+")
|
||||
|
||||
|
||||
def load_model(modelname, add_sentencizer=False):
|
||||
""" Load a specific spaCy model """
|
||||
loading_start = time.time()
|
||||
nlp = spacy.load(modelname)
|
||||
if add_sentencizer:
|
||||
nlp.add_pipe(nlp.create_pipe('sentencizer'))
|
||||
loading_end = time.time()
|
||||
loading_time = loading_end - loading_start
|
||||
if add_sentencizer:
|
||||
return nlp, loading_time, modelname + '_sentencizer'
|
||||
return nlp, loading_time, modelname
|
||||
|
||||
|
||||
def load_default_model_sentencizer(lang):
|
||||
""" Load a generic spaCy model and add the sentencizer for sentence tokenization"""
|
||||
loading_start = time.time()
|
||||
lang_class = get_lang_class(lang)
|
||||
nlp = lang_class()
|
||||
nlp.add_pipe(nlp.create_pipe('sentencizer'))
|
||||
loading_end = time.time()
|
||||
loading_time = loading_end - loading_start
|
||||
return nlp, loading_time, lang + "_default_" + 'sentencizer'
|
||||
|
||||
|
||||
def split_text(text):
|
||||
return [space_re.sub(" ", par.strip()) for par in text.split("\n\n")]
|
||||
|
||||
|
||||
def get_freq_tuples(my_list, print_total_threshold):
|
||||
""" Turn a list of errors into frequency-sorted tuples thresholded by a certain total number """
|
||||
d = {}
|
||||
for token in my_list:
|
||||
d.setdefault(token, 0)
|
||||
d[token] += 1
|
||||
return sorted(d.items(), key=operator.itemgetter(1), reverse=True)[:print_total_threshold]
|
||||
|
||||
|
||||
def _contains_blinded_text(stats_xml):
|
||||
""" Heuristic to determine whether the treebank has blinded texts or not """
|
||||
tree = ET.parse(stats_xml)
|
||||
root = tree.getroot()
|
||||
total_tokens = int(root.find('size/total/tokens').text)
|
||||
unique_forms = int(root.find('forms').get('unique'))
|
||||
|
||||
# assume the corpus is largely blinded when there are less than 1% unique tokens
|
||||
return (unique_forms / total_tokens) < 0.01
|
||||
|
||||
|
||||
def fetch_all_treebanks(ud_dir, languages, corpus, best_per_language):
|
||||
"""" Fetch the txt files for all treebanks for a given set of languages """
|
||||
all_treebanks = dict()
|
||||
treebank_size = dict()
|
||||
for l in languages:
|
||||
all_treebanks[l] = []
|
||||
treebank_size[l] = 0
|
||||
|
||||
for treebank_dir in ud_dir.iterdir():
|
||||
if treebank_dir.is_dir():
|
||||
for txt_path in treebank_dir.iterdir():
|
||||
if txt_path.name.endswith('-ud-' + corpus + '.txt'):
|
||||
file_lang = txt_path.name.split('_')[0]
|
||||
if file_lang in languages:
|
||||
gold_path = treebank_dir / txt_path.name.replace('.txt', '.conllu')
|
||||
stats_xml = treebank_dir / "stats.xml"
|
||||
# ignore treebanks where the texts are not publicly available
|
||||
if not _contains_blinded_text(stats_xml):
|
||||
if not best_per_language:
|
||||
all_treebanks[file_lang].append(txt_path)
|
||||
# check the tokens in the gold annotation to keep only the biggest treebank per language
|
||||
else:
|
||||
with gold_path.open(mode='r', encoding='utf-8') as gold_file:
|
||||
gold_ud = conll17_ud_eval.load_conllu(gold_file)
|
||||
gold_tokens = len(gold_ud.tokens)
|
||||
if treebank_size[file_lang] < gold_tokens:
|
||||
all_treebanks[file_lang] = [txt_path]
|
||||
treebank_size[file_lang] = gold_tokens
|
||||
|
||||
return all_treebanks
|
||||
|
||||
|
||||
def run_single_eval(nlp, loading_time, print_name, text_path, gold_ud, tmp_output_path, out_file, print_header,
|
||||
check_parse, print_freq_tasks):
|
||||
"""" Run an evaluation of a model nlp on a certain specified treebank """
|
||||
with text_path.open(mode='r', encoding='utf-8') as f:
|
||||
flat_text = f.read()
|
||||
|
||||
# STEP 1: tokenize text
|
||||
tokenization_start = time.time()
|
||||
texts = split_text(flat_text)
|
||||
docs = list(nlp.pipe(texts))
|
||||
tokenization_end = time.time()
|
||||
tokenization_time = tokenization_end - tokenization_start
|
||||
|
||||
# STEP 2: record stats and timings
|
||||
tokens_per_s = int(len(gold_ud.tokens) / tokenization_time)
|
||||
|
||||
print_header_1 = ['date', 'text_path', 'gold_tokens', 'model', 'loading_time', 'tokenization_time', 'tokens_per_s']
|
||||
print_string_1 = [str(datetime.date.today()), text_path.name, len(gold_ud.tokens),
|
||||
print_name, "%.2f" % loading_time, "%.2f" % tokenization_time, tokens_per_s]
|
||||
|
||||
# STEP 3: evaluate predicted tokens and features
|
||||
with tmp_output_path.open(mode="w", encoding="utf8") as tmp_out_file:
|
||||
write_conllu(docs, tmp_out_file)
|
||||
with tmp_output_path.open(mode="r", encoding="utf8") as sys_file:
|
||||
sys_ud = conll17_ud_eval.load_conllu(sys_file, check_parse=check_parse)
|
||||
tmp_output_path.unlink()
|
||||
scores = conll17_ud_eval.evaluate(gold_ud, sys_ud, check_parse=check_parse)
|
||||
|
||||
# STEP 4: format the scoring results
|
||||
eval_headers = EVAL_PARSE
|
||||
if not check_parse:
|
||||
eval_headers = EVAL_NO_PARSE
|
||||
|
||||
for score_name in eval_headers:
|
||||
score = scores[score_name]
|
||||
print_string_1.extend(["%.2f" % score.precision,
|
||||
"%.2f" % score.recall,
|
||||
"%.2f" % score.f1])
|
||||
print_string_1.append("-" if score.aligned_accuracy is None else "%.2f" % score.aligned_accuracy)
|
||||
print_string_1.append("-" if score.undersegmented is None else "%.4f" % score.under_perc)
|
||||
print_string_1.append("-" if score.oversegmented is None else "%.4f" % score.over_perc)
|
||||
|
||||
print_header_1.extend([score_name + '_p', score_name + '_r', score_name + '_F', score_name + '_acc',
|
||||
score_name + '_under', score_name + '_over'])
|
||||
|
||||
if score_name in print_freq_tasks:
|
||||
print_header_1.extend([score_name + '_word_under_ex', score_name + '_shape_under_ex',
|
||||
score_name + '_word_over_ex', score_name + '_shape_over_ex'])
|
||||
|
||||
d_under_words = get_freq_tuples(score.undersegmented, PRINT_TOTAL)
|
||||
d_under_shapes = get_freq_tuples([word_shape(x) for x in score.undersegmented], PRINT_TOTAL)
|
||||
d_over_words = get_freq_tuples(score.oversegmented, PRINT_TOTAL)
|
||||
d_over_shapes = get_freq_tuples([word_shape(x) for x in score.oversegmented], PRINT_TOTAL)
|
||||
|
||||
# saving to CSV with ; seperator so blinding ; in the example output
|
||||
print_string_1.append(
|
||||
str({k: v for k, v in d_under_words if v > PRINT_FREQ}).replace(";", "*SEMICOLON*"))
|
||||
print_string_1.append(
|
||||
str({k: v for k, v in d_under_shapes if v > PRINT_FREQ}).replace(";", "*SEMICOLON*"))
|
||||
print_string_1.append(
|
||||
str({k: v for k, v in d_over_words if v > PRINT_FREQ}).replace(";", "*SEMICOLON*"))
|
||||
print_string_1.append(
|
||||
str({k: v for k, v in d_over_shapes if v > PRINT_FREQ}).replace(";", "*SEMICOLON*"))
|
||||
|
||||
# STEP 5: print the formatted results to CSV
|
||||
if print_header:
|
||||
out_file.write(';'.join(map(str, print_header_1)) + '\n')
|
||||
out_file.write(';'.join(map(str, print_string_1)) + '\n')
|
||||
|
||||
|
||||
def run_all_evals(models, treebanks, out_file, check_parse, print_freq_tasks):
|
||||
"""" Run an evaluation for each language with its specified models and treebanks """
|
||||
print_header = True
|
||||
|
||||
for tb_lang, treebank_list in treebanks.items():
|
||||
print()
|
||||
print("Language", tb_lang)
|
||||
for text_path in treebank_list:
|
||||
print(" Evaluating on", text_path)
|
||||
|
||||
gold_path = text_path.parent / (text_path.stem + '.conllu')
|
||||
print(" Gold data from ", gold_path)
|
||||
|
||||
# nested try blocks to ensure the code can continue with the next iteration after a failure
|
||||
try:
|
||||
with gold_path.open(mode='r', encoding='utf-8') as gold_file:
|
||||
gold_ud = conll17_ud_eval.load_conllu(gold_file)
|
||||
|
||||
for nlp, nlp_loading_time, nlp_name in models[tb_lang]:
|
||||
try:
|
||||
print(" Benchmarking", nlp_name)
|
||||
tmp_output_path = text_path.parent / str('tmp_' + nlp_name + '.conllu')
|
||||
run_single_eval(nlp, nlp_loading_time, nlp_name, text_path, gold_ud, tmp_output_path, out_file,
|
||||
print_header, check_parse, print_freq_tasks)
|
||||
print_header = False
|
||||
except Exception as e:
|
||||
print(" Ran into trouble: ", str(e))
|
||||
except Exception as e:
|
||||
print(" Ran into trouble: ", str(e))
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
out_path=("Path to output CSV file", "positional", None, Path),
|
||||
ud_dir=("Path to Universal Dependencies corpus", "positional", None, Path),
|
||||
check_parse=("Set flag to evaluate parsing performance", "flag", "p", bool),
|
||||
langs=("Enumeration of languages to evaluate (default: all)", "option", "l", str),
|
||||
exclude_trained_models=("Set flag to exclude trained models", "flag", "t", bool),
|
||||
exclude_multi=("Set flag to exclude the multi-language model as default baseline", "flag", "m", bool),
|
||||
hide_freq=("Set flag to avoid printing out more detailed high-freq tokenization errors", "flag", "f", bool),
|
||||
corpus=("Whether to run on train, dev or test", "option", "c", str),
|
||||
best_per_language=("Set flag to only keep the largest treebank for each language", "flag", "b", bool)
|
||||
)
|
||||
def main(out_path, ud_dir, check_parse=False, langs=ALL_LANGUAGES, exclude_trained_models=False, exclude_multi=False,
|
||||
hide_freq=False, corpus='train', best_per_language=False):
|
||||
""""
|
||||
Assemble all treebanks and models to run evaluations with.
|
||||
When setting check_parse to True, the default models will not be evaluated as they don't have parsing functionality
|
||||
"""
|
||||
languages = [lang.strip() for lang in langs.split(",")]
|
||||
|
||||
print_freq_tasks = []
|
||||
if not hide_freq:
|
||||
print_freq_tasks = ['Tokens']
|
||||
|
||||
# fetching all relevant treebank from the directory
|
||||
treebanks = fetch_all_treebanks(ud_dir, languages, corpus, best_per_language)
|
||||
|
||||
print()
|
||||
print("Loading all relevant models for", languages)
|
||||
models = dict()
|
||||
|
||||
# multi-lang model
|
||||
multi = None
|
||||
if not exclude_multi and not check_parse:
|
||||
multi = load_model('xx_ent_wiki_sm', add_sentencizer=True)
|
||||
|
||||
# initialize all models with the multi-lang model
|
||||
for lang in languages:
|
||||
models[lang] = [multi] if multi else []
|
||||
# add default models if we don't want to evaluate parsing info
|
||||
if not check_parse:
|
||||
# Norwegian is 'nb' in spaCy but 'no' in the UD corpora
|
||||
if lang == 'no':
|
||||
models['no'].append(load_default_model_sentencizer('nb'))
|
||||
else:
|
||||
models[lang].append(load_default_model_sentencizer(lang))
|
||||
|
||||
# language-specific trained models
|
||||
if not exclude_trained_models:
|
||||
if 'de' in models:
|
||||
models['de'].append(load_model('de_core_news_sm'))
|
||||
models['de'].append(load_model('de_core_news_md'))
|
||||
if 'el' in models:
|
||||
models['el'].append(load_model('el_core_news_sm'))
|
||||
models['el'].append(load_model('el_core_news_md'))
|
||||
if 'en' in models:
|
||||
models['en'].append(load_model('en_core_web_sm'))
|
||||
models['en'].append(load_model('en_core_web_md'))
|
||||
models['en'].append(load_model('en_core_web_lg'))
|
||||
if 'es' in models:
|
||||
models['es'].append(load_model('es_core_news_sm'))
|
||||
models['es'].append(load_model('es_core_news_md'))
|
||||
if 'fr' in models:
|
||||
models['fr'].append(load_model('fr_core_news_sm'))
|
||||
models['fr'].append(load_model('fr_core_news_md'))
|
||||
if 'it' in models:
|
||||
models['it'].append(load_model('it_core_news_sm'))
|
||||
if 'nl' in models:
|
||||
models['nl'].append(load_model('nl_core_news_sm'))
|
||||
if 'pt' in models:
|
||||
models['pt'].append(load_model('pt_core_news_sm'))
|
||||
|
||||
with out_path.open(mode='w', encoding='utf-8') as out_file:
|
||||
run_all_evals(models, treebanks, out_file, check_parse, print_freq_tasks)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
plac.call(main)
|
|
@ -1,335 +0,0 @@
|
|||
# flake8: noqa
|
||||
"""Train for CONLL 2017 UD treebank evaluation. Takes .conllu files, writes
|
||||
.conllu format for development data, allowing the official scorer to be used.
|
||||
"""
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import plac
|
||||
from pathlib import Path
|
||||
import re
|
||||
import sys
|
||||
import srsly
|
||||
|
||||
import spacy
|
||||
import spacy.util
|
||||
from spacy.tokens import Token, Doc
|
||||
from spacy.gold import GoldParse
|
||||
from spacy.util import compounding, minibatch_by_words
|
||||
from spacy.syntax.nonproj import projectivize
|
||||
from spacy.matcher import Matcher
|
||||
|
||||
# from spacy.morphology import Fused_begin, Fused_inside
|
||||
from spacy import displacy
|
||||
from collections import defaultdict, Counter
|
||||
from timeit import default_timer as timer
|
||||
|
||||
Fused_begin = None
|
||||
Fused_inside = None
|
||||
|
||||
import itertools
|
||||
import random
|
||||
import numpy.random
|
||||
|
||||
from . import conll17_ud_eval
|
||||
|
||||
from spacy import lang
|
||||
from spacy.lang import zh
|
||||
from spacy.lang import ja
|
||||
from spacy.lang import ru
|
||||
|
||||
|
||||
################
|
||||
# Data reading #
|
||||
################
|
||||
|
||||
space_re = re.compile(r"\s+")
|
||||
|
||||
|
||||
def split_text(text):
|
||||
return [space_re.sub(" ", par.strip()) for par in text.split("\n\n")]
|
||||
|
||||
|
||||
##############
|
||||
# Evaluation #
|
||||
##############
|
||||
|
||||
|
||||
def read_conllu(file_):
|
||||
docs = []
|
||||
sent = []
|
||||
doc = []
|
||||
for line in file_:
|
||||
if line.startswith("# newdoc"):
|
||||
if doc:
|
||||
docs.append(doc)
|
||||
doc = []
|
||||
elif line.startswith("#"):
|
||||
continue
|
||||
elif not line.strip():
|
||||
if sent:
|
||||
doc.append(sent)
|
||||
sent = []
|
||||
else:
|
||||
sent.append(list(line.strip().split("\t")))
|
||||
if len(sent[-1]) != 10:
|
||||
print(repr(line))
|
||||
raise ValueError
|
||||
if sent:
|
||||
doc.append(sent)
|
||||
if doc:
|
||||
docs.append(doc)
|
||||
return docs
|
||||
|
||||
|
||||
def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None):
|
||||
if text_loc.parts[-1].endswith(".conllu"):
|
||||
docs = []
|
||||
with text_loc.open(encoding="utf8") as file_:
|
||||
for conllu_doc in read_conllu(file_):
|
||||
for conllu_sent in conllu_doc:
|
||||
words = [line[1] for line in conllu_sent]
|
||||
docs.append(Doc(nlp.vocab, words=words))
|
||||
for name, component in nlp.pipeline:
|
||||
docs = list(component.pipe(docs))
|
||||
else:
|
||||
with text_loc.open("r", encoding="utf8") as text_file:
|
||||
texts = split_text(text_file.read())
|
||||
docs = list(nlp.pipe(texts))
|
||||
with sys_loc.open("w", encoding="utf8") as out_file:
|
||||
write_conllu(docs, out_file)
|
||||
with gold_loc.open("r", encoding="utf8") as gold_file:
|
||||
gold_ud = conll17_ud_eval.load_conllu(gold_file)
|
||||
with sys_loc.open("r", encoding="utf8") as sys_file:
|
||||
sys_ud = conll17_ud_eval.load_conllu(sys_file)
|
||||
scores = conll17_ud_eval.evaluate(gold_ud, sys_ud)
|
||||
return docs, scores
|
||||
|
||||
|
||||
def write_conllu(docs, file_):
|
||||
merger = Matcher(docs[0].vocab)
|
||||
merger.add("SUBTOK", None, [{"DEP": "subtok", "op": "+"}])
|
||||
for i, doc in enumerate(docs):
|
||||
matches = []
|
||||
if doc.is_parsed:
|
||||
matches = merger(doc)
|
||||
spans = [doc[start : end + 1] for _, start, end in matches]
|
||||
with doc.retokenize() as retokenizer:
|
||||
for span in spans:
|
||||
retokenizer.merge(span)
|
||||
file_.write("# newdoc id = {i}\n".format(i=i))
|
||||
for j, sent in enumerate(doc.sents):
|
||||
file_.write("# sent_id = {i}.{j}\n".format(i=i, j=j))
|
||||
file_.write("# text = {text}\n".format(text=sent.text))
|
||||
for k, token in enumerate(sent):
|
||||
file_.write(_get_token_conllu(token, k, len(sent)) + "\n")
|
||||
file_.write("\n")
|
||||
for word in sent:
|
||||
if word.head.i == word.i and word.dep_ == "ROOT":
|
||||
break
|
||||
else:
|
||||
print("Rootless sentence!")
|
||||
print(sent)
|
||||
print(i)
|
||||
for w in sent:
|
||||
print(w.i, w.text, w.head.text, w.head.i, w.dep_)
|
||||
raise ValueError
|
||||
|
||||
|
||||
def _get_token_conllu(token, k, sent_len):
|
||||
if token.check_morph(Fused_begin) and (k + 1 < sent_len):
|
||||
n = 1
|
||||
text = [token.text]
|
||||
while token.nbor(n).check_morph(Fused_inside):
|
||||
text.append(token.nbor(n).text)
|
||||
n += 1
|
||||
id_ = "%d-%d" % (k + 1, (k + n))
|
||||
fields = [id_, "".join(text)] + ["_"] * 8
|
||||
lines = ["\t".join(fields)]
|
||||
else:
|
||||
lines = []
|
||||
if token.head.i == token.i:
|
||||
head = 0
|
||||
else:
|
||||
head = k + (token.head.i - token.i) + 1
|
||||
fields = [
|
||||
str(k + 1),
|
||||
token.text,
|
||||
token.lemma_,
|
||||
token.pos_,
|
||||
token.tag_,
|
||||
"_",
|
||||
str(head),
|
||||
token.dep_.lower(),
|
||||
"_",
|
||||
"_",
|
||||
]
|
||||
if token.check_morph(Fused_begin) and (k + 1 < sent_len):
|
||||
if k == 0:
|
||||
fields[1] = token.norm_[0].upper() + token.norm_[1:]
|
||||
else:
|
||||
fields[1] = token.norm_
|
||||
elif token.check_morph(Fused_inside):
|
||||
fields[1] = token.norm_
|
||||
elif token._.split_start is not None:
|
||||
split_start = token._.split_start
|
||||
split_end = token._.split_end
|
||||
split_len = (split_end.i - split_start.i) + 1
|
||||
n_in_split = token.i - split_start.i
|
||||
subtokens = guess_fused_orths(split_start.text, [""] * split_len)
|
||||
fields[1] = subtokens[n_in_split]
|
||||
|
||||
lines.append("\t".join(fields))
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def guess_fused_orths(word, ud_forms):
|
||||
"""The UD data 'fused tokens' don't necessarily expand to keys that match
|
||||
the form. We need orths that exact match the string. Here we make a best
|
||||
effort to divide up the word."""
|
||||
if word == "".join(ud_forms):
|
||||
# Happy case: we get a perfect split, with each letter accounted for.
|
||||
return ud_forms
|
||||
elif len(word) == sum(len(subtoken) for subtoken in ud_forms):
|
||||
# Unideal, but at least lengths match.
|
||||
output = []
|
||||
remain = word
|
||||
for subtoken in ud_forms:
|
||||
assert len(subtoken) >= 1
|
||||
output.append(remain[: len(subtoken)])
|
||||
remain = remain[len(subtoken) :]
|
||||
assert len(remain) == 0, (word, ud_forms, remain)
|
||||
return output
|
||||
else:
|
||||
# Let's say word is 6 long, and there are three subtokens. The orths
|
||||
# *must* equal the original string. Arbitrarily, split [4, 1, 1]
|
||||
first = word[: len(word) - (len(ud_forms) - 1)]
|
||||
output = [first]
|
||||
remain = word[len(first) :]
|
||||
for i in range(1, len(ud_forms)):
|
||||
assert remain
|
||||
output.append(remain[:1])
|
||||
remain = remain[1:]
|
||||
assert len(remain) == 0, (word, output, remain)
|
||||
return output
|
||||
|
||||
|
||||
def print_results(name, ud_scores):
|
||||
fields = {}
|
||||
if ud_scores is not None:
|
||||
fields.update(
|
||||
{
|
||||
"words": ud_scores["Words"].f1 * 100,
|
||||
"sents": ud_scores["Sentences"].f1 * 100,
|
||||
"tags": ud_scores["XPOS"].f1 * 100,
|
||||
"uas": ud_scores["UAS"].f1 * 100,
|
||||
"las": ud_scores["LAS"].f1 * 100,
|
||||
}
|
||||
)
|
||||
else:
|
||||
fields.update({"words": 0.0, "sents": 0.0, "tags": 0.0, "uas": 0.0, "las": 0.0})
|
||||
tpl = "\t".join(
|
||||
(name, "{las:.1f}", "{uas:.1f}", "{tags:.1f}", "{sents:.1f}", "{words:.1f}")
|
||||
)
|
||||
print(tpl.format(**fields))
|
||||
return fields
|
||||
|
||||
|
||||
def get_token_split_start(token):
|
||||
if token.text == "":
|
||||
assert token.i != 0
|
||||
i = -1
|
||||
while token.nbor(i).text == "":
|
||||
i -= 1
|
||||
return token.nbor(i)
|
||||
elif (token.i + 1) < len(token.doc) and token.nbor(1).text == "":
|
||||
return token
|
||||
else:
|
||||
return None
|
||||
|
||||
|
||||
def get_token_split_end(token):
|
||||
if (token.i + 1) == len(token.doc):
|
||||
return token if token.text == "" else None
|
||||
elif token.text != "" and token.nbor(1).text != "":
|
||||
return None
|
||||
i = 1
|
||||
while (token.i + i) < len(token.doc) and token.nbor(i).text == "":
|
||||
i += 1
|
||||
return token.nbor(i - 1)
|
||||
|
||||
|
||||
##################
|
||||
# Initialization #
|
||||
##################
|
||||
|
||||
|
||||
def load_nlp(experiments_dir, corpus):
|
||||
nlp = spacy.load(experiments_dir / corpus / "best-model")
|
||||
return nlp
|
||||
|
||||
|
||||
def initialize_pipeline(nlp, docs, golds, config, device):
|
||||
nlp.add_pipe(nlp.create_pipe("parser"))
|
||||
return nlp
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
test_data_dir=(
|
||||
"Path to Universal Dependencies test data",
|
||||
"positional",
|
||||
None,
|
||||
Path,
|
||||
),
|
||||
experiment_dir=("Parent directory with output model", "positional", None, Path),
|
||||
corpus=(
|
||||
"UD corpus to evaluate, e.g. UD_English, UD_Spanish, etc",
|
||||
"positional",
|
||||
None,
|
||||
str,
|
||||
),
|
||||
)
|
||||
def main(test_data_dir, experiment_dir, corpus):
|
||||
Token.set_extension("split_start", getter=get_token_split_start)
|
||||
Token.set_extension("split_end", getter=get_token_split_end)
|
||||
Token.set_extension("begins_fused", default=False)
|
||||
Token.set_extension("inside_fused", default=False)
|
||||
lang.zh.Chinese.Defaults.use_jieba = False
|
||||
lang.ja.Japanese.Defaults.use_janome = False
|
||||
lang.ru.Russian.Defaults.use_pymorphy2 = False
|
||||
|
||||
nlp = load_nlp(experiment_dir, corpus)
|
||||
|
||||
treebank_code = nlp.meta["treebank"]
|
||||
for section in ("test", "dev"):
|
||||
if section == "dev":
|
||||
section_dir = "conll17-ud-development-2017-03-19"
|
||||
else:
|
||||
section_dir = "conll17-ud-test-2017-05-09"
|
||||
text_path = test_data_dir / "input" / section_dir / (treebank_code + ".txt")
|
||||
udpipe_path = (
|
||||
test_data_dir / "input" / section_dir / (treebank_code + "-udpipe.conllu")
|
||||
)
|
||||
gold_path = test_data_dir / "gold" / section_dir / (treebank_code + ".conllu")
|
||||
|
||||
header = [section, "LAS", "UAS", "TAG", "SENT", "WORD"]
|
||||
print("\t".join(header))
|
||||
inputs = {"gold": gold_path, "udp": udpipe_path, "raw": text_path}
|
||||
for input_type in ("udp", "raw"):
|
||||
input_path = inputs[input_type]
|
||||
output_path = (
|
||||
experiment_dir / corpus / "{section}.conllu".format(section=section)
|
||||
)
|
||||
|
||||
parsed_docs, test_scores = evaluate(nlp, input_path, gold_path, output_path)
|
||||
|
||||
accuracy = print_results(input_type, test_scores)
|
||||
acc_path = (
|
||||
experiment_dir
|
||||
/ corpus
|
||||
/ "{section}-accuracy.json".format(section=section)
|
||||
)
|
||||
srsly.write_json(acc_path, accuracy)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
plac.call(main)
|
|
@ -1,570 +0,0 @@
|
|||
# flake8: noqa
|
||||
"""Train for CONLL 2017 UD treebank evaluation. Takes .conllu files, writes
|
||||
.conllu format for development data, allowing the official scorer to be used.
|
||||
"""
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import plac
|
||||
from pathlib import Path
|
||||
import re
|
||||
import json
|
||||
import tqdm
|
||||
|
||||
import spacy
|
||||
import spacy.util
|
||||
from bin.ud import conll17_ud_eval
|
||||
from spacy.tokens import Token, Doc
|
||||
from spacy.gold import GoldParse
|
||||
from spacy.util import compounding, minibatch, minibatch_by_words
|
||||
from spacy.syntax.nonproj import projectivize
|
||||
from spacy.matcher import Matcher
|
||||
from spacy import displacy
|
||||
from collections import defaultdict
|
||||
|
||||
import random
|
||||
|
||||
from spacy import lang
|
||||
from spacy.lang import zh
|
||||
from spacy.lang import ja
|
||||
|
||||
try:
|
||||
import torch
|
||||
except ImportError:
|
||||
torch = None
|
||||
|
||||
|
||||
################
|
||||
# Data reading #
|
||||
################
|
||||
|
||||
space_re = re.compile("\s+")
|
||||
|
||||
|
||||
def split_text(text):
|
||||
return [space_re.sub(" ", par.strip()) for par in text.split("\n\n")]
|
||||
|
||||
|
||||
def read_data(
|
||||
nlp,
|
||||
conllu_file,
|
||||
text_file,
|
||||
raw_text=True,
|
||||
oracle_segments=False,
|
||||
max_doc_length=None,
|
||||
limit=None,
|
||||
):
|
||||
"""Read the CONLLU format into (Doc, GoldParse) tuples. If raw_text=True,
|
||||
include Doc objects created using nlp.make_doc and then aligned against
|
||||
the gold-standard sequences. If oracle_segments=True, include Doc objects
|
||||
created from the gold-standard segments. At least one must be True."""
|
||||
if not raw_text and not oracle_segments:
|
||||
raise ValueError("At least one of raw_text or oracle_segments must be True")
|
||||
paragraphs = split_text(text_file.read())
|
||||
conllu = read_conllu(conllu_file)
|
||||
# sd is spacy doc; cd is conllu doc
|
||||
# cs is conllu sent, ct is conllu token
|
||||
docs = []
|
||||
golds = []
|
||||
for doc_id, (text, cd) in enumerate(zip(paragraphs, conllu)):
|
||||
sent_annots = []
|
||||
for cs in cd:
|
||||
sent = defaultdict(list)
|
||||
for id_, word, lemma, pos, tag, morph, head, dep, _, space_after in cs:
|
||||
if "." in id_:
|
||||
continue
|
||||
if "-" in id_:
|
||||
continue
|
||||
id_ = int(id_) - 1
|
||||
head = int(head) - 1 if head != "0" else id_
|
||||
sent["words"].append(word)
|
||||
sent["tags"].append(tag)
|
||||
sent["morphology"].append(_parse_morph_string(morph))
|
||||
sent["morphology"][-1].add("POS_%s" % pos)
|
||||
sent["heads"].append(head)
|
||||
sent["deps"].append("ROOT" if dep == "root" else dep)
|
||||
sent["spaces"].append(space_after == "_")
|
||||
sent["entities"] = ["-"] * len(sent["words"])
|
||||
sent["heads"], sent["deps"] = projectivize(sent["heads"], sent["deps"])
|
||||
if oracle_segments:
|
||||
docs.append(Doc(nlp.vocab, words=sent["words"], spaces=sent["spaces"]))
|
||||
golds.append(GoldParse(docs[-1], **sent))
|
||||
assert golds[-1].morphology is not None
|
||||
|
||||
sent_annots.append(sent)
|
||||
if raw_text and max_doc_length and len(sent_annots) >= max_doc_length:
|
||||
doc, gold = _make_gold(nlp, None, sent_annots)
|
||||
assert gold.morphology is not None
|
||||
sent_annots = []
|
||||
docs.append(doc)
|
||||
golds.append(gold)
|
||||
if limit and len(docs) >= limit:
|
||||
return docs, golds
|
||||
|
||||
if raw_text and sent_annots:
|
||||
doc, gold = _make_gold(nlp, None, sent_annots)
|
||||
docs.append(doc)
|
||||
golds.append(gold)
|
||||
if limit and len(docs) >= limit:
|
||||
return docs, golds
|
||||
return docs, golds
|
||||
|
||||
def _parse_morph_string(morph_string):
|
||||
if morph_string == '_':
|
||||
return set()
|
||||
output = []
|
||||
replacements = {'1': 'one', '2': 'two', '3': 'three'}
|
||||
for feature in morph_string.split('|'):
|
||||
key, value = feature.split('=')
|
||||
value = replacements.get(value, value)
|
||||
value = value.split(',')[0]
|
||||
output.append('%s_%s' % (key, value.lower()))
|
||||
return set(output)
|
||||
|
||||
def read_conllu(file_):
|
||||
docs = []
|
||||
sent = []
|
||||
doc = []
|
||||
for line in file_:
|
||||
if line.startswith("# newdoc"):
|
||||
if doc:
|
||||
docs.append(doc)
|
||||
doc = []
|
||||
elif line.startswith("#"):
|
||||
continue
|
||||
elif not line.strip():
|
||||
if sent:
|
||||
doc.append(sent)
|
||||
sent = []
|
||||
else:
|
||||
sent.append(list(line.strip().split("\t")))
|
||||
if len(sent[-1]) != 10:
|
||||
print(repr(line))
|
||||
raise ValueError
|
||||
if sent:
|
||||
doc.append(sent)
|
||||
if doc:
|
||||
docs.append(doc)
|
||||
return docs
|
||||
|
||||
|
||||
def _make_gold(nlp, text, sent_annots, drop_deps=0.0):
|
||||
# Flatten the conll annotations, and adjust the head indices
|
||||
flat = defaultdict(list)
|
||||
sent_starts = []
|
||||
for sent in sent_annots:
|
||||
flat["heads"].extend(len(flat["words"])+head for head in sent["heads"])
|
||||
for field in ["words", "tags", "deps", "morphology", "entities", "spaces"]:
|
||||
flat[field].extend(sent[field])
|
||||
sent_starts.append(True)
|
||||
sent_starts.extend([False] * (len(sent["words"]) - 1))
|
||||
# Construct text if necessary
|
||||
assert len(flat["words"]) == len(flat["spaces"])
|
||||
if text is None:
|
||||
text = "".join(
|
||||
word + " " * space for word, space in zip(flat["words"], flat["spaces"])
|
||||
)
|
||||
doc = nlp.make_doc(text)
|
||||
flat.pop("spaces")
|
||||
gold = GoldParse(doc, **flat)
|
||||
gold.sent_starts = sent_starts
|
||||
for i in range(len(gold.heads)):
|
||||
if random.random() < drop_deps:
|
||||
gold.heads[i] = None
|
||||
gold.labels[i] = None
|
||||
|
||||
return doc, gold
|
||||
|
||||
|
||||
#############################
|
||||
# Data transforms for spaCy #
|
||||
#############################
|
||||
|
||||
|
||||
def golds_to_gold_tuples(docs, golds):
|
||||
"""Get out the annoying 'tuples' format used by begin_training, given the
|
||||
GoldParse objects."""
|
||||
tuples = []
|
||||
for doc, gold in zip(docs, golds):
|
||||
text = doc.text
|
||||
ids, words, tags, heads, labels, iob = zip(*gold.orig_annot)
|
||||
sents = [((ids, words, tags, heads, labels, iob), [])]
|
||||
tuples.append((text, sents))
|
||||
return tuples
|
||||
|
||||
|
||||
##############
|
||||
# Evaluation #
|
||||
##############
|
||||
|
||||
|
||||
def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None):
|
||||
if text_loc.parts[-1].endswith(".conllu"):
|
||||
docs = []
|
||||
with text_loc.open(encoding="utf8") as file_:
|
||||
for conllu_doc in read_conllu(file_):
|
||||
for conllu_sent in conllu_doc:
|
||||
words = [line[1] for line in conllu_sent]
|
||||
docs.append(Doc(nlp.vocab, words=words))
|
||||
for name, component in nlp.pipeline:
|
||||
docs = list(component.pipe(docs))
|
||||
else:
|
||||
with text_loc.open("r", encoding="utf8") as text_file:
|
||||
texts = split_text(text_file.read())
|
||||
docs = list(nlp.pipe(texts))
|
||||
with sys_loc.open("w", encoding="utf8") as out_file:
|
||||
write_conllu(docs, out_file)
|
||||
with gold_loc.open("r", encoding="utf8") as gold_file:
|
||||
gold_ud = conll17_ud_eval.load_conllu(gold_file)
|
||||
with sys_loc.open("r", encoding="utf8") as sys_file:
|
||||
sys_ud = conll17_ud_eval.load_conllu(sys_file)
|
||||
scores = conll17_ud_eval.evaluate(gold_ud, sys_ud)
|
||||
return docs, scores
|
||||
|
||||
|
||||
def write_conllu(docs, file_):
|
||||
if not Token.has_extension("get_conllu_lines"):
|
||||
Token.set_extension("get_conllu_lines", method=get_token_conllu)
|
||||
if not Token.has_extension("begins_fused"):
|
||||
Token.set_extension("begins_fused", default=False)
|
||||
if not Token.has_extension("inside_fused"):
|
||||
Token.set_extension("inside_fused", default=False)
|
||||
|
||||
merger = Matcher(docs[0].vocab)
|
||||
merger.add("SUBTOK", None, [{"DEP": "subtok", "op": "+"}])
|
||||
for i, doc in enumerate(docs):
|
||||
matches = []
|
||||
if doc.is_parsed:
|
||||
matches = merger(doc)
|
||||
spans = [doc[start : end + 1] for _, start, end in matches]
|
||||
seen_tokens = set()
|
||||
with doc.retokenize() as retokenizer:
|
||||
for span in spans:
|
||||
span_tokens = set(range(span.start, span.end))
|
||||
if not span_tokens.intersection(seen_tokens):
|
||||
retokenizer.merge(span)
|
||||
seen_tokens.update(span_tokens)
|
||||
|
||||
file_.write("# newdoc id = {i}\n".format(i=i))
|
||||
for j, sent in enumerate(doc.sents):
|
||||
file_.write("# sent_id = {i}.{j}\n".format(i=i, j=j))
|
||||
file_.write("# text = {text}\n".format(text=sent.text))
|
||||
for k, token in enumerate(sent):
|
||||
if token.head.i > sent[-1].i or token.head.i < sent[0].i:
|
||||
for word in doc[sent[0].i - 10 : sent[0].i]:
|
||||
print(word.i, word.head.i, word.text, word.dep_)
|
||||
for word in sent:
|
||||
print(word.i, word.head.i, word.text, word.dep_)
|
||||
for word in doc[sent[-1].i : sent[-1].i + 10]:
|
||||
print(word.i, word.head.i, word.text, word.dep_)
|
||||
raise ValueError(
|
||||
"Invalid parse: head outside sentence (%s)" % token.text
|
||||
)
|
||||
file_.write(token._.get_conllu_lines(k) + "\n")
|
||||
file_.write("\n")
|
||||
|
||||
|
||||
def print_progress(itn, losses, ud_scores):
|
||||
fields = {
|
||||
"dep_loss": losses.get("parser", 0.0),
|
||||
"morph_loss": losses.get("morphologizer", 0.0),
|
||||
"tag_loss": losses.get("tagger", 0.0),
|
||||
"words": ud_scores["Words"].f1 * 100,
|
||||
"sents": ud_scores["Sentences"].f1 * 100,
|
||||
"tags": ud_scores["XPOS"].f1 * 100,
|
||||
"uas": ud_scores["UAS"].f1 * 100,
|
||||
"las": ud_scores["LAS"].f1 * 100,
|
||||
"morph": ud_scores["Feats"].f1 * 100,
|
||||
}
|
||||
header = ["Epoch", "P.Loss", "M.Loss", "LAS", "UAS", "TAG", "MORPH", "SENT", "WORD"]
|
||||
if itn == 0:
|
||||
print("\t".join(header))
|
||||
tpl = "\t".join((
|
||||
"{:d}",
|
||||
"{dep_loss:.1f}",
|
||||
"{morph_loss:.1f}",
|
||||
"{las:.1f}",
|
||||
"{uas:.1f}",
|
||||
"{tags:.1f}",
|
||||
"{morph:.1f}",
|
||||
"{sents:.1f}",
|
||||
"{words:.1f}",
|
||||
))
|
||||
print(tpl.format(itn, **fields))
|
||||
|
||||
|
||||
# def get_sent_conllu(sent, sent_id):
|
||||
# lines = ["# sent_id = {sent_id}".format(sent_id=sent_id)]
|
||||
|
||||
|
||||
def get_token_conllu(token, i):
|
||||
if token._.begins_fused:
|
||||
n = 1
|
||||
while token.nbor(n)._.inside_fused:
|
||||
n += 1
|
||||
id_ = "%d-%d" % (i, i + n)
|
||||
lines = [id_, token.text, "_", "_", "_", "_", "_", "_", "_", "_"]
|
||||
else:
|
||||
lines = []
|
||||
if token.head.i == token.i:
|
||||
head = 0
|
||||
else:
|
||||
head = i + (token.head.i - token.i) + 1
|
||||
features = list(token.morph)
|
||||
feat_str = []
|
||||
replacements = {"one": "1", "two": "2", "three": "3"}
|
||||
for feat in features:
|
||||
if not feat.startswith("begin") and not feat.startswith("end"):
|
||||
key, value = feat.split("_", 1)
|
||||
value = replacements.get(value, value)
|
||||
feat_str.append("%s=%s" % (key, value.title()))
|
||||
if not feat_str:
|
||||
feat_str = "_"
|
||||
else:
|
||||
feat_str = "|".join(feat_str)
|
||||
fields = [str(i+1), token.text, token.lemma_, token.pos_, token.tag_, feat_str,
|
||||
str(head), token.dep_.lower(), "_", "_"]
|
||||
lines.append("\t".join(fields))
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
|
||||
##################
|
||||
# Initialization #
|
||||
##################
|
||||
|
||||
|
||||
def load_nlp(corpus, config, vectors=None):
|
||||
lang = corpus.split("_")[0]
|
||||
nlp = spacy.blank(lang)
|
||||
if config.vectors:
|
||||
if not vectors:
|
||||
raise ValueError(
|
||||
"config asks for vectors, but no vectors "
|
||||
"directory set on command line (use -v)"
|
||||
)
|
||||
if (Path(vectors) / corpus).exists():
|
||||
nlp.vocab.from_disk(Path(vectors) / corpus / "vocab")
|
||||
nlp.meta["treebank"] = corpus
|
||||
return nlp
|
||||
|
||||
|
||||
def initialize_pipeline(nlp, docs, golds, config, device):
|
||||
nlp.add_pipe(nlp.create_pipe("tagger", config={"set_morphology": False}))
|
||||
nlp.add_pipe(nlp.create_pipe("morphologizer"))
|
||||
nlp.add_pipe(nlp.create_pipe("parser"))
|
||||
if config.multitask_tag:
|
||||
nlp.parser.add_multitask_objective("tag")
|
||||
if config.multitask_sent:
|
||||
nlp.parser.add_multitask_objective("sent_start")
|
||||
for gold in golds:
|
||||
for tag in gold.tags:
|
||||
if tag is not None:
|
||||
nlp.tagger.add_label(tag)
|
||||
if torch is not None and device != -1:
|
||||
torch.set_default_tensor_type("torch.cuda.FloatTensor")
|
||||
optimizer = nlp.begin_training(
|
||||
lambda: golds_to_gold_tuples(docs, golds),
|
||||
device=device,
|
||||
subword_features=config.subword_features,
|
||||
conv_depth=config.conv_depth,
|
||||
bilstm_depth=config.bilstm_depth,
|
||||
)
|
||||
if config.pretrained_tok2vec:
|
||||
_load_pretrained_tok2vec(nlp, config.pretrained_tok2vec)
|
||||
return optimizer
|
||||
|
||||
|
||||
def _load_pretrained_tok2vec(nlp, loc):
|
||||
"""Load pretrained weights for the 'token-to-vector' part of the component
|
||||
models, which is typically a CNN. See 'spacy pretrain'. Experimental.
|
||||
"""
|
||||
with Path(loc).open("rb", encoding="utf8") as file_:
|
||||
weights_data = file_.read()
|
||||
loaded = []
|
||||
for name, component in nlp.pipeline:
|
||||
if hasattr(component, "model") and hasattr(component.model, "tok2vec"):
|
||||
component.tok2vec.from_bytes(weights_data)
|
||||
loaded.append(name)
|
||||
return loaded
|
||||
|
||||
|
||||
########################
|
||||
# Command line helpers #
|
||||
########################
|
||||
|
||||
|
||||
class Config(object):
|
||||
def __init__(
|
||||
self,
|
||||
vectors=None,
|
||||
max_doc_length=10,
|
||||
multitask_tag=False,
|
||||
multitask_sent=False,
|
||||
multitask_dep=False,
|
||||
multitask_vectors=None,
|
||||
bilstm_depth=0,
|
||||
nr_epoch=30,
|
||||
min_batch_size=100,
|
||||
max_batch_size=1000,
|
||||
batch_by_words=True,
|
||||
dropout=0.2,
|
||||
conv_depth=4,
|
||||
subword_features=True,
|
||||
vectors_dir=None,
|
||||
pretrained_tok2vec=None,
|
||||
):
|
||||
if vectors_dir is not None:
|
||||
if vectors is None:
|
||||
vectors = True
|
||||
if multitask_vectors is None:
|
||||
multitask_vectors = True
|
||||
for key, value in locals().items():
|
||||
setattr(self, key, value)
|
||||
|
||||
@classmethod
|
||||
def load(cls, loc, vectors_dir=None):
|
||||
with Path(loc).open("r", encoding="utf8") as file_:
|
||||
cfg = json.load(file_)
|
||||
if vectors_dir is not None:
|
||||
cfg["vectors_dir"] = vectors_dir
|
||||
return cls(**cfg)
|
||||
|
||||
|
||||
class Dataset(object):
|
||||
def __init__(self, path, section):
|
||||
self.path = path
|
||||
self.section = section
|
||||
self.conllu = None
|
||||
self.text = None
|
||||
for file_path in self.path.iterdir():
|
||||
name = file_path.parts[-1]
|
||||
if section in name and name.endswith("conllu"):
|
||||
self.conllu = file_path
|
||||
elif section in name and name.endswith("txt"):
|
||||
self.text = file_path
|
||||
if self.conllu is None:
|
||||
msg = "Could not find .txt file in {path} for {section}"
|
||||
raise IOError(msg.format(section=section, path=path))
|
||||
if self.text is None:
|
||||
msg = "Could not find .txt file in {path} for {section}"
|
||||
self.lang = self.conllu.parts[-1].split("-")[0].split("_")[0]
|
||||
|
||||
|
||||
class TreebankPaths(object):
|
||||
def __init__(self, ud_path, treebank, **cfg):
|
||||
self.train = Dataset(ud_path / treebank, "train")
|
||||
self.dev = Dataset(ud_path / treebank, "dev")
|
||||
self.lang = self.train.lang
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
ud_dir=("Path to Universal Dependencies corpus", "positional", None, Path),
|
||||
parses_dir=("Directory to write the development parses", "positional", None, Path),
|
||||
corpus=(
|
||||
"UD corpus to train and evaluate on, e.g. UD_Spanish-AnCora",
|
||||
"positional",
|
||||
None,
|
||||
str,
|
||||
),
|
||||
config=("Path to json formatted config file", "option", "C", Path),
|
||||
limit=("Size limit", "option", "n", int),
|
||||
gpu_device=("Use GPU", "option", "g", int),
|
||||
use_oracle_segments=("Use oracle segments", "flag", "G", int),
|
||||
vectors_dir=(
|
||||
"Path to directory with pretrained vectors, named e.g. en/",
|
||||
"option",
|
||||
"v",
|
||||
Path,
|
||||
),
|
||||
)
|
||||
def main(
|
||||
ud_dir,
|
||||
parses_dir,
|
||||
corpus,
|
||||
config=None,
|
||||
limit=0,
|
||||
gpu_device=-1,
|
||||
vectors_dir=None,
|
||||
use_oracle_segments=False,
|
||||
):
|
||||
Token.set_extension("get_conllu_lines", method=get_token_conllu)
|
||||
Token.set_extension("begins_fused", default=False)
|
||||
Token.set_extension("inside_fused", default=False)
|
||||
|
||||
spacy.util.fix_random_seed()
|
||||
lang.zh.Chinese.Defaults.use_jieba = False
|
||||
lang.ja.Japanese.Defaults.use_janome = False
|
||||
|
||||
if config is not None:
|
||||
config = Config.load(config, vectors_dir=vectors_dir)
|
||||
else:
|
||||
config = Config(vectors_dir=vectors_dir)
|
||||
paths = TreebankPaths(ud_dir, corpus)
|
||||
if not (parses_dir / corpus).exists():
|
||||
(parses_dir / corpus).mkdir()
|
||||
print("Train and evaluate", corpus, "using lang", paths.lang)
|
||||
nlp = load_nlp(paths.lang, config, vectors=vectors_dir)
|
||||
|
||||
docs, golds = read_data(
|
||||
nlp,
|
||||
paths.train.conllu.open(encoding="utf8"),
|
||||
paths.train.text.open(encoding="utf8"),
|
||||
max_doc_length=config.max_doc_length,
|
||||
limit=limit,
|
||||
)
|
||||
|
||||
optimizer = initialize_pipeline(nlp, docs, golds, config, gpu_device)
|
||||
|
||||
batch_sizes = compounding(config.min_batch_size, config.max_batch_size, 1.001)
|
||||
beam_prob = compounding(0.2, 0.8, 1.001)
|
||||
for i in range(config.nr_epoch):
|
||||
docs, golds = read_data(
|
||||
nlp,
|
||||
paths.train.conllu.open(encoding="utf8"),
|
||||
paths.train.text.open(encoding="utf8"),
|
||||
max_doc_length=config.max_doc_length,
|
||||
limit=limit,
|
||||
oracle_segments=use_oracle_segments,
|
||||
raw_text=not use_oracle_segments,
|
||||
)
|
||||
Xs = list(zip(docs, golds))
|
||||
random.shuffle(Xs)
|
||||
if config.batch_by_words:
|
||||
batches = minibatch_by_words(Xs, size=batch_sizes)
|
||||
else:
|
||||
batches = minibatch(Xs, size=batch_sizes)
|
||||
losses = {}
|
||||
n_train_words = sum(len(doc) for doc in docs)
|
||||
with tqdm.tqdm(total=n_train_words, leave=False) as pbar:
|
||||
for batch in batches:
|
||||
batch_docs, batch_gold = zip(*batch)
|
||||
pbar.update(sum(len(doc) for doc in batch_docs))
|
||||
nlp.parser.cfg["beam_update_prob"] = next(beam_prob)
|
||||
nlp.update(
|
||||
batch_docs,
|
||||
batch_gold,
|
||||
sgd=optimizer,
|
||||
drop=config.dropout,
|
||||
losses=losses,
|
||||
)
|
||||
|
||||
out_path = parses_dir / corpus / "epoch-{i}.conllu".format(i=i)
|
||||
with nlp.use_params(optimizer.averages):
|
||||
if use_oracle_segments:
|
||||
parsed_docs, scores = evaluate(nlp, paths.dev.conllu,
|
||||
paths.dev.conllu, out_path)
|
||||
else:
|
||||
parsed_docs, scores = evaluate(nlp, paths.dev.text,
|
||||
paths.dev.conllu, out_path)
|
||||
print_progress(i, losses, scores)
|
||||
|
||||
|
||||
def _render_parses(i, to_render):
|
||||
to_render[0].user_data["title"] = "Batch %d" % i
|
||||
with Path("/tmp/parses.html").open("w", encoding="utf8") as file_:
|
||||
html = displacy.render(to_render[:5], style="dep", page=True)
|
||||
file_.write(html)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
plac.call(main)
|
|
@ -1,19 +0,0 @@
|
|||
<a href="https://explosion.ai"><img src="https://explosion.ai/assets/img/logo.svg" width="125" height="125" align="right" /></a>
|
||||
|
||||
# spaCy examples
|
||||
|
||||
The examples are Python scripts with well-behaved command line interfaces. For
|
||||
more detailed usage guides, see the [documentation](https://spacy.io/usage/).
|
||||
|
||||
To see the available arguments, you can use the `--help` or `-h` flag:
|
||||
|
||||
```bash
|
||||
$ python examples/training/train_ner.py --help
|
||||
```
|
||||
|
||||
While we try to keep the examples up to date, they are not currently exercised
|
||||
by the test suite, as some of them require significant data downloads or take
|
||||
time to train. If you find that an example is no longer running,
|
||||
[please tell us](https://github.com/explosion/spaCy/issues)! We know there's
|
||||
nothing worse than trying to figure out what you're doing wrong, and it turns
|
||||
out your code was never the problem.
|
|
@ -1,267 +0,0 @@
|
|||
"""
|
||||
This example shows how to use an LSTM sentiment classification model trained
|
||||
using Keras in spaCy. spaCy splits the document into sentences, and each
|
||||
sentence is classified using the LSTM. The scores for the sentences are then
|
||||
aggregated to give the document score. This kind of hierarchical model is quite
|
||||
difficult in "pure" Keras or Tensorflow, but it's very effective. The Keras
|
||||
example on this dataset performs quite poorly, because it cuts off the documents
|
||||
so that they're a fixed size. This hurts review accuracy a lot, because people
|
||||
often summarise their rating in the final sentence
|
||||
|
||||
Prerequisites:
|
||||
spacy download en_vectors_web_lg
|
||||
pip install keras==2.0.9
|
||||
|
||||
Compatible with: spaCy v2.0.0+
|
||||
"""
|
||||
|
||||
import plac
|
||||
import random
|
||||
import pathlib
|
||||
import cytoolz
|
||||
import numpy
|
||||
from keras.models import Sequential, model_from_json
|
||||
from keras.layers import LSTM, Dense, Embedding, Bidirectional
|
||||
from keras.layers import TimeDistributed
|
||||
from keras.optimizers import Adam
|
||||
import thinc.extra.datasets
|
||||
from spacy.compat import pickle
|
||||
import spacy
|
||||
|
||||
|
||||
class SentimentAnalyser(object):
|
||||
@classmethod
|
||||
def load(cls, path, nlp, max_length=100):
|
||||
with (path / "config.json").open() as file_:
|
||||
model = model_from_json(file_.read())
|
||||
with (path / "model").open("rb") as file_:
|
||||
lstm_weights = pickle.load(file_)
|
||||
embeddings = get_embeddings(nlp.vocab)
|
||||
model.set_weights([embeddings] + lstm_weights)
|
||||
return cls(model, max_length=max_length)
|
||||
|
||||
def __init__(self, model, max_length=100):
|
||||
self._model = model
|
||||
self.max_length = max_length
|
||||
|
||||
def __call__(self, doc):
|
||||
X = get_features([doc], self.max_length)
|
||||
y = self._model.predict(X)
|
||||
self.set_sentiment(doc, y)
|
||||
|
||||
def pipe(self, docs, batch_size=1000):
|
||||
for minibatch in cytoolz.partition_all(batch_size, docs):
|
||||
minibatch = list(minibatch)
|
||||
sentences = []
|
||||
for doc in minibatch:
|
||||
sentences.extend(doc.sents)
|
||||
Xs = get_features(sentences, self.max_length)
|
||||
ys = self._model.predict(Xs)
|
||||
for sent, label in zip(sentences, ys):
|
||||
sent.doc.sentiment += label - 0.5
|
||||
for doc in minibatch:
|
||||
yield doc
|
||||
|
||||
def set_sentiment(self, doc, y):
|
||||
doc.sentiment = float(y[0])
|
||||
# Sentiment has a native slot for a single float.
|
||||
# For arbitrary data storage, there's:
|
||||
# doc.user_data['my_data'] = y
|
||||
|
||||
|
||||
def get_labelled_sentences(docs, doc_labels):
|
||||
labels = []
|
||||
sentences = []
|
||||
for doc, y in zip(docs, doc_labels):
|
||||
for sent in doc.sents:
|
||||
sentences.append(sent)
|
||||
labels.append(y)
|
||||
return sentences, numpy.asarray(labels, dtype="int32")
|
||||
|
||||
|
||||
def get_features(docs, max_length):
|
||||
docs = list(docs)
|
||||
Xs = numpy.zeros((len(docs), max_length), dtype="int32")
|
||||
for i, doc in enumerate(docs):
|
||||
j = 0
|
||||
for token in doc:
|
||||
vector_id = token.vocab.vectors.find(key=token.orth)
|
||||
if vector_id >= 0:
|
||||
Xs[i, j] = vector_id
|
||||
else:
|
||||
Xs[i, j] = 0
|
||||
j += 1
|
||||
if j >= max_length:
|
||||
break
|
||||
return Xs
|
||||
|
||||
|
||||
def train(
|
||||
train_texts,
|
||||
train_labels,
|
||||
dev_texts,
|
||||
dev_labels,
|
||||
lstm_shape,
|
||||
lstm_settings,
|
||||
lstm_optimizer,
|
||||
batch_size=100,
|
||||
nb_epoch=5,
|
||||
by_sentence=True,
|
||||
):
|
||||
|
||||
print("Loading spaCy")
|
||||
nlp = spacy.load("en_vectors_web_lg")
|
||||
nlp.add_pipe(nlp.create_pipe("sentencizer"))
|
||||
embeddings = get_embeddings(nlp.vocab)
|
||||
model = compile_lstm(embeddings, lstm_shape, lstm_settings)
|
||||
|
||||
print("Parsing texts...")
|
||||
train_docs = list(nlp.pipe(train_texts))
|
||||
dev_docs = list(nlp.pipe(dev_texts))
|
||||
if by_sentence:
|
||||
train_docs, train_labels = get_labelled_sentences(train_docs, train_labels)
|
||||
dev_docs, dev_labels = get_labelled_sentences(dev_docs, dev_labels)
|
||||
|
||||
train_X = get_features(train_docs, lstm_shape["max_length"])
|
||||
dev_X = get_features(dev_docs, lstm_shape["max_length"])
|
||||
model.fit(
|
||||
train_X,
|
||||
train_labels,
|
||||
validation_data=(dev_X, dev_labels),
|
||||
epochs=nb_epoch,
|
||||
batch_size=batch_size,
|
||||
)
|
||||
return model
|
||||
|
||||
|
||||
def compile_lstm(embeddings, shape, settings):
|
||||
model = Sequential()
|
||||
model.add(
|
||||
Embedding(
|
||||
embeddings.shape[0],
|
||||
embeddings.shape[1],
|
||||
input_length=shape["max_length"],
|
||||
trainable=False,
|
||||
weights=[embeddings],
|
||||
mask_zero=True,
|
||||
)
|
||||
)
|
||||
model.add(TimeDistributed(Dense(shape["nr_hidden"], use_bias=False)))
|
||||
model.add(
|
||||
Bidirectional(
|
||||
LSTM(
|
||||
shape["nr_hidden"],
|
||||
recurrent_dropout=settings["dropout"],
|
||||
dropout=settings["dropout"],
|
||||
)
|
||||
)
|
||||
)
|
||||
model.add(Dense(shape["nr_class"], activation="sigmoid"))
|
||||
model.compile(
|
||||
optimizer=Adam(lr=settings["lr"]),
|
||||
loss="binary_crossentropy",
|
||||
metrics=["accuracy"],
|
||||
)
|
||||
return model
|
||||
|
||||
|
||||
def get_embeddings(vocab):
|
||||
return vocab.vectors.data
|
||||
|
||||
|
||||
def evaluate(model_dir, texts, labels, max_length=100):
|
||||
nlp = spacy.load("en_vectors_web_lg")
|
||||
nlp.add_pipe(nlp.create_pipe("sentencizer"))
|
||||
nlp.add_pipe(SentimentAnalyser.load(model_dir, nlp, max_length=max_length))
|
||||
|
||||
correct = 0
|
||||
i = 0
|
||||
for doc in nlp.pipe(texts, batch_size=1000):
|
||||
correct += bool(doc.sentiment >= 0.5) == bool(labels[i])
|
||||
i += 1
|
||||
return float(correct) / i
|
||||
|
||||
|
||||
def read_data(data_dir, limit=0):
|
||||
examples = []
|
||||
for subdir, label in (("pos", 1), ("neg", 0)):
|
||||
for filename in (data_dir / subdir).iterdir():
|
||||
with filename.open() as file_:
|
||||
text = file_.read()
|
||||
examples.append((text, label))
|
||||
random.shuffle(examples)
|
||||
if limit >= 1:
|
||||
examples = examples[:limit]
|
||||
return zip(*examples) # Unzips into two lists
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
train_dir=("Location of training file or directory"),
|
||||
dev_dir=("Location of development file or directory"),
|
||||
model_dir=("Location of output model directory",),
|
||||
is_runtime=("Demonstrate run-time usage", "flag", "r", bool),
|
||||
nr_hidden=("Number of hidden units", "option", "H", int),
|
||||
max_length=("Maximum sentence length", "option", "L", int),
|
||||
dropout=("Dropout", "option", "d", float),
|
||||
learn_rate=("Learn rate", "option", "e", float),
|
||||
nb_epoch=("Number of training epochs", "option", "i", int),
|
||||
batch_size=("Size of minibatches for training LSTM", "option", "b", int),
|
||||
nr_examples=("Limit to N examples", "option", "n", int),
|
||||
)
|
||||
def main(
|
||||
model_dir=None,
|
||||
train_dir=None,
|
||||
dev_dir=None,
|
||||
is_runtime=False,
|
||||
nr_hidden=64,
|
||||
max_length=100, # Shape
|
||||
dropout=0.5,
|
||||
learn_rate=0.001, # General NN config
|
||||
nb_epoch=5,
|
||||
batch_size=256,
|
||||
nr_examples=-1,
|
||||
): # Training params
|
||||
if model_dir is not None:
|
||||
model_dir = pathlib.Path(model_dir)
|
||||
if train_dir is None or dev_dir is None:
|
||||
imdb_data = thinc.extra.datasets.imdb()
|
||||
if is_runtime:
|
||||
if dev_dir is None:
|
||||
dev_texts, dev_labels = zip(*imdb_data[1])
|
||||
else:
|
||||
dev_texts, dev_labels = read_data(dev_dir)
|
||||
acc = evaluate(model_dir, dev_texts, dev_labels, max_length=max_length)
|
||||
print(acc)
|
||||
else:
|
||||
if train_dir is None:
|
||||
train_texts, train_labels = zip(*imdb_data[0])
|
||||
else:
|
||||
print("Read data")
|
||||
train_texts, train_labels = read_data(train_dir, limit=nr_examples)
|
||||
if dev_dir is None:
|
||||
dev_texts, dev_labels = zip(*imdb_data[1])
|
||||
else:
|
||||
dev_texts, dev_labels = read_data(dev_dir, imdb_data, limit=nr_examples)
|
||||
train_labels = numpy.asarray(train_labels, dtype="int32")
|
||||
dev_labels = numpy.asarray(dev_labels, dtype="int32")
|
||||
lstm = train(
|
||||
train_texts,
|
||||
train_labels,
|
||||
dev_texts,
|
||||
dev_labels,
|
||||
{"nr_hidden": nr_hidden, "max_length": max_length, "nr_class": 1},
|
||||
{"dropout": dropout, "lr": learn_rate},
|
||||
{},
|
||||
nb_epoch=nb_epoch,
|
||||
batch_size=batch_size,
|
||||
)
|
||||
weights = lstm.get_weights()
|
||||
if model_dir is not None:
|
||||
with (model_dir / "model").open("wb") as file_:
|
||||
pickle.dump(weights[1:], file_)
|
||||
with (model_dir / "config.json").open("w") as file_:
|
||||
file_.write(lstm.to_json())
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
plac.call(main)
|
|
@ -1,82 +0,0 @@
|
|||
#!/usr/bin/env python
|
||||
# coding: utf8
|
||||
"""A simple example of extracting relations between phrases and entities using
|
||||
spaCy's named entity recognizer and the dependency parse. Here, we extract
|
||||
money and currency values (entities labelled as MONEY) and then check the
|
||||
dependency tree to find the noun phrase they are referring to – for example:
|
||||
$9.4 million --> Net income.
|
||||
|
||||
Compatible with: spaCy v2.0.0+
|
||||
Last tested with: v2.2.1
|
||||
"""
|
||||
from __future__ import unicode_literals, print_function
|
||||
|
||||
import plac
|
||||
import spacy
|
||||
|
||||
|
||||
TEXTS = [
|
||||
"Net income was $9.4 million compared to the prior year of $2.7 million.",
|
||||
"Revenue exceeded twelve billion dollars, with a loss of $1b.",
|
||||
]
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
model=("Model to load (needs parser and NER)", "positional", None, str)
|
||||
)
|
||||
def main(model="en_core_web_sm"):
|
||||
nlp = spacy.load(model)
|
||||
print("Loaded model '%s'" % model)
|
||||
print("Processing %d texts" % len(TEXTS))
|
||||
|
||||
for text in TEXTS:
|
||||
doc = nlp(text)
|
||||
relations = extract_currency_relations(doc)
|
||||
for r1, r2 in relations:
|
||||
print("{:<10}\t{}\t{}".format(r1.text, r2.ent_type_, r2.text))
|
||||
|
||||
|
||||
def filter_spans(spans):
|
||||
# Filter a sequence of spans so they don't contain overlaps
|
||||
# For spaCy 2.1.4+: this function is available as spacy.util.filter_spans()
|
||||
get_sort_key = lambda span: (span.end - span.start, -span.start)
|
||||
sorted_spans = sorted(spans, key=get_sort_key, reverse=True)
|
||||
result = []
|
||||
seen_tokens = set()
|
||||
for span in sorted_spans:
|
||||
# Check for end - 1 here because boundaries are inclusive
|
||||
if span.start not in seen_tokens and span.end - 1 not in seen_tokens:
|
||||
result.append(span)
|
||||
seen_tokens.update(range(span.start, span.end))
|
||||
result = sorted(result, key=lambda span: span.start)
|
||||
return result
|
||||
|
||||
|
||||
def extract_currency_relations(doc):
|
||||
# Merge entities and noun chunks into one token
|
||||
spans = list(doc.ents) + list(doc.noun_chunks)
|
||||
spans = filter_spans(spans)
|
||||
with doc.retokenize() as retokenizer:
|
||||
for span in spans:
|
||||
retokenizer.merge(span)
|
||||
|
||||
relations = []
|
||||
for money in filter(lambda w: w.ent_type_ == "MONEY", doc):
|
||||
if money.dep_ in ("attr", "dobj"):
|
||||
subject = [w for w in money.head.lefts if w.dep_ == "nsubj"]
|
||||
if subject:
|
||||
subject = subject[0]
|
||||
relations.append((subject, money))
|
||||
elif money.dep_ == "pobj" and money.head.dep_ == "prep":
|
||||
relations.append((money.head.head, money))
|
||||
return relations
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
plac.call(main)
|
||||
|
||||
# Expected output:
|
||||
# Net income MONEY $9.4 million
|
||||
# the prior year MONEY $2.7 million
|
||||
# Revenue MONEY twelve billion dollars
|
||||
# a loss MONEY 1b
|
|
@ -1,67 +0,0 @@
|
|||
#!/usr/bin/env python
|
||||
# coding: utf8
|
||||
"""This example shows how to navigate the parse tree including subtrees
|
||||
attached to a word.
|
||||
|
||||
Based on issue #252:
|
||||
"In the documents and tutorials the main thing I haven't found is
|
||||
examples on how to break sentences down into small sub thoughts/chunks. The
|
||||
noun_chunks is handy, but having examples on using the token.head to find small
|
||||
(near-complete) sentence chunks would be neat. Lets take the example sentence:
|
||||
"displaCy uses CSS and JavaScript to show you how computers understand language"
|
||||
|
||||
This sentence has two main parts (XCOMP & CCOMP) according to the breakdown:
|
||||
[displaCy] uses CSS and Javascript [to + show]
|
||||
show you how computers understand [language]
|
||||
|
||||
I'm assuming that we can use the token.head to build these groups."
|
||||
|
||||
Compatible with: spaCy v2.0.0+
|
||||
Last tested with: v2.1.0
|
||||
"""
|
||||
from __future__ import unicode_literals, print_function
|
||||
|
||||
import plac
|
||||
import spacy
|
||||
|
||||
|
||||
@plac.annotations(model=("Model to load", "positional", None, str))
|
||||
def main(model="en_core_web_sm"):
|
||||
nlp = spacy.load(model)
|
||||
print("Loaded model '%s'" % model)
|
||||
|
||||
doc = nlp(
|
||||
"displaCy uses CSS and JavaScript to show you how computers "
|
||||
"understand language"
|
||||
)
|
||||
|
||||
# The easiest way is to find the head of the subtree you want, and then use
|
||||
# the `.subtree`, `.children`, `.lefts` and `.rights` iterators. `.subtree`
|
||||
# is the one that does what you're asking for most directly:
|
||||
for word in doc:
|
||||
if word.dep_ in ("xcomp", "ccomp"):
|
||||
print("".join(w.text_with_ws for w in word.subtree))
|
||||
|
||||
# It'd probably be better for `word.subtree` to return a `Span` object
|
||||
# instead of a generator over the tokens. If you want the `Span` you can
|
||||
# get it via the `.right_edge` and `.left_edge` properties. The `Span`
|
||||
# object is nice because you can easily get a vector, merge it, etc.
|
||||
for word in doc:
|
||||
if word.dep_ in ("xcomp", "ccomp"):
|
||||
subtree_span = doc[word.left_edge.i : word.right_edge.i + 1]
|
||||
print(subtree_span.text, "|", subtree_span.root.text)
|
||||
|
||||
# You might also want to select a head, and then select a start and end
|
||||
# position by walking along its children. You could then take the
|
||||
# `.left_edge` and `.right_edge` of those tokens, and use it to calculate
|
||||
# a span.
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
plac.call(main)
|
||||
|
||||
# Expected output:
|
||||
# to show you how computers understand language
|
||||
# how computers understand language
|
||||
# to show you how computers understand language | show
|
||||
# how computers understand language | understand
|
|
@ -1,112 +0,0 @@
|
|||
#!/usr/bin/env python
|
||||
# coding: utf8
|
||||
"""Match a large set of multi-word expressions in O(1) time.
|
||||
|
||||
The idea is to associate each word in the vocabulary with a tag, noting whether
|
||||
they begin, end, or are inside at least one pattern. An additional tag is used
|
||||
for single-word patterns. Complete patterns are also stored in a hash set.
|
||||
When we process a document, we look up the words in the vocabulary, to
|
||||
associate the words with the tags. We then search for tag-sequences that
|
||||
correspond to valid candidates. Finally, we look up the candidates in the hash
|
||||
set.
|
||||
|
||||
For instance, to search for the phrases "Barack Hussein Obama" and "Hilary
|
||||
Clinton", we would associate "Barack" and "Hilary" with the B tag, Hussein with
|
||||
the I tag, and Obama and Clinton with the L tag.
|
||||
|
||||
The document "Barack Clinton and Hilary Clinton" would have the tag sequence
|
||||
[{B}, {L}, {}, {B}, {L}], so we'd get two matches. However, only the second
|
||||
candidate is in the phrase dictionary, so only one is returned as a match.
|
||||
|
||||
The algorithm is O(n) at run-time for document of length n because we're only
|
||||
ever matching over the tag patterns. So no matter how many phrases we're
|
||||
looking for, our pattern set stays very small (exact size depends on the
|
||||
maximum length we're looking for, as the query language currently has no
|
||||
quantifiers).
|
||||
|
||||
The example expects a .bz2 file from the Reddit corpus, and a patterns file,
|
||||
formatted in jsonl as a sequence of entries like this:
|
||||
|
||||
{"text":"Anchorage"}
|
||||
{"text":"Angola"}
|
||||
{"text":"Ann Arbor"}
|
||||
{"text":"Annapolis"}
|
||||
{"text":"Appalachia"}
|
||||
{"text":"Argentina"}
|
||||
|
||||
Reddit comments corpus:
|
||||
* https://files.pushshift.io/reddit/
|
||||
* https://archive.org/details/2015_reddit_comments_corpus
|
||||
|
||||
Compatible with: spaCy v2.0.0+
|
||||
"""
|
||||
from __future__ import print_function, unicode_literals, division
|
||||
|
||||
from bz2 import BZ2File
|
||||
import time
|
||||
import plac
|
||||
import json
|
||||
|
||||
from spacy.matcher import PhraseMatcher
|
||||
import spacy
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
patterns_loc=("Path to gazetteer", "positional", None, str),
|
||||
text_loc=("Path to Reddit corpus file", "positional", None, str),
|
||||
n=("Number of texts to read", "option", "n", int),
|
||||
lang=("Language class to initialise", "option", "l", str),
|
||||
)
|
||||
def main(patterns_loc, text_loc, n=10000, lang="en"):
|
||||
nlp = spacy.blank(lang)
|
||||
nlp.vocab.lex_attr_getters = {}
|
||||
phrases = read_gazetteer(nlp.tokenizer, patterns_loc)
|
||||
count = 0
|
||||
t1 = time.time()
|
||||
for ent_id, text in get_matches(nlp.tokenizer, phrases, read_text(text_loc, n=n)):
|
||||
count += 1
|
||||
t2 = time.time()
|
||||
print("%d docs in %.3f s. %d matches" % (n, (t2 - t1), count))
|
||||
|
||||
|
||||
def read_gazetteer(tokenizer, loc, n=-1):
|
||||
for i, line in enumerate(open(loc)):
|
||||
data = json.loads(line.strip())
|
||||
phrase = tokenizer(data["text"])
|
||||
for w in phrase:
|
||||
_ = tokenizer.vocab[w.text]
|
||||
if len(phrase) >= 2:
|
||||
yield phrase
|
||||
|
||||
|
||||
def read_text(bz2_loc, n=10000):
|
||||
with BZ2File(bz2_loc) as file_:
|
||||
for i, line in enumerate(file_):
|
||||
data = json.loads(line)
|
||||
yield data["body"]
|
||||
if i >= n:
|
||||
break
|
||||
|
||||
|
||||
def get_matches(tokenizer, phrases, texts):
|
||||
matcher = PhraseMatcher(tokenizer.vocab)
|
||||
matcher.add("Phrase", None, *phrases)
|
||||
for text in texts:
|
||||
doc = tokenizer(text)
|
||||
for w in doc:
|
||||
_ = doc.vocab[w.text]
|
||||
matches = matcher(doc)
|
||||
for ent_id, start, end in matches:
|
||||
yield (ent_id, doc[start:end].text)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
if False:
|
||||
import cProfile
|
||||
import pstats
|
||||
|
||||
cProfile.runctx("plac.call(main)", globals(), locals(), "Profile.prof")
|
||||
s = pstats.Stats("Profile.prof")
|
||||
s.strip_dirs().sort_stats("time").print_stats()
|
||||
else:
|
||||
plac.call(main)
|
|
@ -1,114 +0,0 @@
|
|||
<a href="https://explosion.ai"><img src="https://explosion.ai/assets/img/logo.svg" width="125" height="125" align="right" /></a>
|
||||
|
||||
# A decomposable attention model for Natural Language Inference
|
||||
**by Matthew Honnibal, [@honnibal](https://github.com/honnibal)**
|
||||
**Updated for spaCy 2.0+ and Keras 2.2.2+ by John Stewart, [@free-variation](https://github.com/free-variation)**
|
||||
|
||||
This directory contains an implementation of the entailment prediction model described
|
||||
by [Parikh et al. (2016)](https://arxiv.org/pdf/1606.01933.pdf). The model is notable
|
||||
for its competitive performance with very few parameters.
|
||||
|
||||
The model is implemented using [Keras](https://keras.io/) and [spaCy](https://spacy.io).
|
||||
Keras is used to build and train the network. spaCy is used to load
|
||||
the [GloVe](http://nlp.stanford.edu/projects/glove/) vectors, perform the
|
||||
feature extraction, and help you apply the model at run-time. The following
|
||||
demo code shows how the entailment model can be used at runtime, once the
|
||||
hook is installed to customise the `.similarity()` method of spaCy's `Doc`
|
||||
and `Span` objects:
|
||||
|
||||
```python
|
||||
def demo(shape):
|
||||
nlp = spacy.load('en_vectors_web_lg')
|
||||
nlp.add_pipe(KerasSimilarityShim.load(nlp.path / 'similarity', nlp, shape[0]))
|
||||
|
||||
doc1 = nlp(u'The king of France is bald.')
|
||||
doc2 = nlp(u'France has no king.')
|
||||
|
||||
print("Sentence 1:", doc1)
|
||||
print("Sentence 2:", doc2)
|
||||
|
||||
entailment_type, confidence = doc1.similarity(doc2)
|
||||
print("Entailment type:", entailment_type, "(Confidence:", confidence, ")")
|
||||
```
|
||||
|
||||
Which gives the output `Entailment type: contradiction (Confidence: 0.60604566)`, showing that
|
||||
the system has definite opinions about Betrand Russell's [famous conundrum](https://users.drew.edu/jlenz/br-on-denoting.html)!
|
||||
|
||||
I'm working on a blog post to explain Parikh et al.'s model in more detail.
|
||||
A [notebook](https://github.com/free-variation/spaCy/blob/master/examples/notebooks/Decompositional%20Attention.ipynb) is available that briefly explains this implementation.
|
||||
I think it is a very interesting example of the attention mechanism, which
|
||||
I didn't understand very well before working through this paper. There are
|
||||
lots of ways to extend the model.
|
||||
|
||||
## What's where
|
||||
|
||||
| File | Description |
|
||||
| --- | --- |
|
||||
| `__main__.py` | The script that will be executed. Defines the CLI, the data reading, etc — all the boring stuff. |
|
||||
| `spacy_hook.py` | Provides a class `KerasSimilarityShim` that lets you use an arbitrary function to customize spaCy's `doc.similarity()` method. Instead of the default average-of-vectors algorithm, when you call `doc1.similarity(doc2)`, you'll get the result of `your_model(doc1, doc2)`. |
|
||||
| `keras_decomposable_attention.py` | Defines the neural network model. |
|
||||
|
||||
## Setting up
|
||||
|
||||
First, install [Keras](https://keras.io/), [spaCy](https://spacy.io) and the spaCy
|
||||
English models (about 1GB of data):
|
||||
|
||||
```bash
|
||||
pip install keras
|
||||
pip install spacy
|
||||
python -m spacy download en_vectors_web_lg
|
||||
```
|
||||
|
||||
You'll also want to get Keras working on your GPU, and you will need a backend, such as TensorFlow or Theano.
|
||||
This will depend on your set up, so you're mostly on your own for this step. If you're using AWS, try the
|
||||
[NVidia AMI](https://aws.amazon.com/marketplace/pp/B00FYCDDTE). It made things pretty easy.
|
||||
|
||||
Once you've installed the dependencies, you can run a small preliminary test of
|
||||
the Keras model:
|
||||
|
||||
```bash
|
||||
py.test keras_parikh_entailment/keras_decomposable_attention.py
|
||||
```
|
||||
|
||||
This compiles the model and fits it with some dummy data. You should see that
|
||||
both tests passed.
|
||||
|
||||
Finally, download the [Stanford Natural Language Inference corpus](http://nlp.stanford.edu/projects/snli/).
|
||||
|
||||
## Running the example
|
||||
|
||||
You can run the `keras_parikh_entailment/` directory as a script, which executes the file
|
||||
[`keras_parikh_entailment/__main__.py`](__main__.py). If you run the script without arguments
|
||||
the usage is shown. Running it with `-h` explains the command line arguments.
|
||||
|
||||
The first thing you'll want to do is train the model:
|
||||
|
||||
```bash
|
||||
python keras_parikh_entailment/ train -t <path to SNLI train JSON> -s <path to SNLI dev JSON>
|
||||
```
|
||||
|
||||
Training takes about 300 epochs for full accuracy, and I haven't rerun the full
|
||||
experiment since refactoring things to publish this example — please let me
|
||||
know if I've broken something. You should get to at least 85% on the development data even after 10-15 epochs.
|
||||
|
||||
The other two modes demonstrate run-time usage. I never like relying on the accuracy printed
|
||||
by `.fit()` methods. I never really feel confident until I've run a new process that loads
|
||||
the model and starts making predictions, without access to the gold labels. I've therefore
|
||||
included an `evaluate` mode.
|
||||
|
||||
```bash
|
||||
python keras_parikh_entailment/ evaluate -s <path to SNLI train JSON>
|
||||
```
|
||||
|
||||
Finally, there's also a little demo, which mostly exists to show
|
||||
you how run-time usage will eventually look.
|
||||
|
||||
```bash
|
||||
python keras_parikh_entailment/ demo
|
||||
```
|
||||
|
||||
## Getting updates
|
||||
|
||||
We should have the blog post explaining the model ready before the end of the week. To get
|
||||
notified when it's published, you can either follow me on [Twitter](https://twitter.com/honnibal)
|
||||
or subscribe to our [mailing list](http://eepurl.com/ckUpQ5).
|
|
@ -1,207 +0,0 @@
|
|||
import numpy as np
|
||||
import json
|
||||
from keras.utils import to_categorical
|
||||
import plac
|
||||
import sys
|
||||
|
||||
from keras_decomposable_attention import build_model
|
||||
from spacy_hook import get_embeddings, KerasSimilarityShim
|
||||
|
||||
try:
|
||||
import cPickle as pickle
|
||||
except ImportError:
|
||||
import pickle
|
||||
|
||||
import spacy
|
||||
|
||||
# workaround for keras/tensorflow bug
|
||||
# see https://github.com/tensorflow/tensorflow/issues/3388
|
||||
import os
|
||||
import importlib
|
||||
from keras import backend as K
|
||||
|
||||
|
||||
def set_keras_backend(backend):
|
||||
if K.backend() != backend:
|
||||
os.environ["KERAS_BACKEND"] = backend
|
||||
importlib.reload(K)
|
||||
assert K.backend() == backend
|
||||
if backend == "tensorflow":
|
||||
K.get_session().close()
|
||||
cfg = K.tf.ConfigProto()
|
||||
cfg.gpu_options.allow_growth = True
|
||||
K.set_session(K.tf.Session(config=cfg))
|
||||
K.clear_session()
|
||||
|
||||
|
||||
set_keras_backend("tensorflow")
|
||||
|
||||
|
||||
def train(train_loc, dev_loc, shape, settings):
|
||||
train_texts1, train_texts2, train_labels = read_snli(train_loc)
|
||||
dev_texts1, dev_texts2, dev_labels = read_snli(dev_loc)
|
||||
|
||||
print("Loading spaCy")
|
||||
nlp = spacy.load("en_vectors_web_lg")
|
||||
assert nlp.path is not None
|
||||
print("Processing texts...")
|
||||
train_X = create_dataset(nlp, train_texts1, train_texts2, 100, shape[0])
|
||||
dev_X = create_dataset(nlp, dev_texts1, dev_texts2, 100, shape[0])
|
||||
|
||||
print("Compiling network")
|
||||
model = build_model(get_embeddings(nlp.vocab), shape, settings)
|
||||
|
||||
print(settings)
|
||||
model.fit(
|
||||
train_X,
|
||||
train_labels,
|
||||
validation_data=(dev_X, dev_labels),
|
||||
epochs=settings["nr_epoch"],
|
||||
batch_size=settings["batch_size"],
|
||||
)
|
||||
if not (nlp.path / "similarity").exists():
|
||||
(nlp.path / "similarity").mkdir()
|
||||
print("Saving to", nlp.path / "similarity")
|
||||
weights = model.get_weights()
|
||||
# remove the embedding matrix. We can reconstruct it.
|
||||
del weights[1]
|
||||
with (nlp.path / "similarity" / "model").open("wb") as file_:
|
||||
pickle.dump(weights, file_)
|
||||
with (nlp.path / "similarity" / "config.json").open("w") as file_:
|
||||
file_.write(model.to_json())
|
||||
|
||||
|
||||
def evaluate(dev_loc, shape):
|
||||
dev_texts1, dev_texts2, dev_labels = read_snli(dev_loc)
|
||||
nlp = spacy.load("en_vectors_web_lg")
|
||||
nlp.add_pipe(KerasSimilarityShim.load(nlp.path / "similarity", nlp, shape[0]))
|
||||
total = 0.0
|
||||
correct = 0.0
|
||||
for text1, text2, label in zip(dev_texts1, dev_texts2, dev_labels):
|
||||
doc1 = nlp(text1)
|
||||
doc2 = nlp(text2)
|
||||
sim, _ = doc1.similarity(doc2)
|
||||
if sim == KerasSimilarityShim.entailment_types[label.argmax()]:
|
||||
correct += 1
|
||||
total += 1
|
||||
return correct, total
|
||||
|
||||
|
||||
def demo(shape):
|
||||
nlp = spacy.load("en_vectors_web_lg")
|
||||
nlp.add_pipe(KerasSimilarityShim.load(nlp.path / "similarity", nlp, shape[0]))
|
||||
|
||||
doc1 = nlp("The king of France is bald.")
|
||||
doc2 = nlp("France has no king.")
|
||||
|
||||
print("Sentence 1:", doc1)
|
||||
print("Sentence 2:", doc2)
|
||||
|
||||
entailment_type, confidence = doc1.similarity(doc2)
|
||||
print("Entailment type:", entailment_type, "(Confidence:", confidence, ")")
|
||||
|
||||
|
||||
LABELS = {"entailment": 0, "contradiction": 1, "neutral": 2}
|
||||
|
||||
|
||||
def read_snli(path):
|
||||
texts1 = []
|
||||
texts2 = []
|
||||
labels = []
|
||||
with open(path, "r") as file_:
|
||||
for line in file_:
|
||||
eg = json.loads(line)
|
||||
label = eg["gold_label"]
|
||||
if label == "-": # per Parikh, ignore - SNLI entries
|
||||
continue
|
||||
texts1.append(eg["sentence1"])
|
||||
texts2.append(eg["sentence2"])
|
||||
labels.append(LABELS[label])
|
||||
return texts1, texts2, to_categorical(np.asarray(labels, dtype="int32"))
|
||||
|
||||
|
||||
def create_dataset(nlp, texts, hypotheses, num_unk, max_length):
|
||||
sents = texts + hypotheses
|
||||
sents_as_ids = []
|
||||
for sent in sents:
|
||||
doc = nlp(sent)
|
||||
word_ids = []
|
||||
for i, token in enumerate(doc):
|
||||
# skip odd spaces from tokenizer
|
||||
if token.has_vector and token.vector_norm == 0:
|
||||
continue
|
||||
|
||||
if i > max_length:
|
||||
break
|
||||
|
||||
if token.has_vector:
|
||||
word_ids.append(token.rank + num_unk + 1)
|
||||
else:
|
||||
# if we don't have a vector, pick an OOV entry
|
||||
word_ids.append(token.rank % num_unk + 1)
|
||||
|
||||
# there must be a simpler way of generating padded arrays from lists...
|
||||
word_id_vec = np.zeros((max_length), dtype="int")
|
||||
clipped_len = min(max_length, len(word_ids))
|
||||
word_id_vec[:clipped_len] = word_ids[:clipped_len]
|
||||
sents_as_ids.append(word_id_vec)
|
||||
|
||||
return [np.array(sents_as_ids[: len(texts)]), np.array(sents_as_ids[len(texts) :])]
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
mode=("Mode to execute", "positional", None, str, ["train", "evaluate", "demo"]),
|
||||
train_loc=("Path to training data", "option", "t", str),
|
||||
dev_loc=("Path to development or test data", "option", "s", str),
|
||||
max_length=("Length to truncate sentences", "option", "L", int),
|
||||
nr_hidden=("Number of hidden units", "option", "H", int),
|
||||
dropout=("Dropout level", "option", "d", float),
|
||||
learn_rate=("Learning rate", "option", "r", float),
|
||||
batch_size=("Batch size for neural network training", "option", "b", int),
|
||||
nr_epoch=("Number of training epochs", "option", "e", int),
|
||||
entail_dir=(
|
||||
"Direction of entailment",
|
||||
"option",
|
||||
"D",
|
||||
str,
|
||||
["both", "left", "right"],
|
||||
),
|
||||
)
|
||||
def main(
|
||||
mode,
|
||||
train_loc,
|
||||
dev_loc,
|
||||
max_length=50,
|
||||
nr_hidden=200,
|
||||
dropout=0.2,
|
||||
learn_rate=0.001,
|
||||
batch_size=1024,
|
||||
nr_epoch=10,
|
||||
entail_dir="both",
|
||||
):
|
||||
shape = (max_length, nr_hidden, 3)
|
||||
settings = {
|
||||
"lr": learn_rate,
|
||||
"dropout": dropout,
|
||||
"batch_size": batch_size,
|
||||
"nr_epoch": nr_epoch,
|
||||
"entail_dir": entail_dir,
|
||||
}
|
||||
|
||||
if mode == "train":
|
||||
if train_loc == None or dev_loc == None:
|
||||
print("Train mode requires paths to training and development data sets.")
|
||||
sys.exit(1)
|
||||
train(train_loc, dev_loc, shape, settings)
|
||||
elif mode == "evaluate":
|
||||
if dev_loc == None:
|
||||
print("Evaluate mode requires paths to test data set.")
|
||||
sys.exit(1)
|
||||
correct, total = evaluate(dev_loc, shape)
|
||||
print(correct, "/", total, correct / total)
|
||||
else:
|
||||
demo(shape)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
plac.call(main)
|
|
@ -1,152 +0,0 @@
|
|||
# Semantic entailment/similarity with decomposable attention (using spaCy and Keras)
|
||||
# Practical state-of-the-art textual entailment with spaCy and Keras
|
||||
|
||||
import numpy as np
|
||||
from keras import layers, Model, models, optimizers
|
||||
from keras import backend as K
|
||||
|
||||
|
||||
def build_model(vectors, shape, settings):
|
||||
max_length, nr_hidden, nr_class = shape
|
||||
|
||||
input1 = layers.Input(shape=(max_length,), dtype="int32", name="words1")
|
||||
input2 = layers.Input(shape=(max_length,), dtype="int32", name="words2")
|
||||
|
||||
# embeddings (projected)
|
||||
embed = create_embedding(vectors, max_length, nr_hidden)
|
||||
|
||||
a = embed(input1)
|
||||
b = embed(input2)
|
||||
|
||||
# step 1: attend
|
||||
F = create_feedforward(nr_hidden)
|
||||
att_weights = layers.dot([F(a), F(b)], axes=-1)
|
||||
|
||||
G = create_feedforward(nr_hidden)
|
||||
|
||||
if settings["entail_dir"] == "both":
|
||||
norm_weights_a = layers.Lambda(normalizer(1))(att_weights)
|
||||
norm_weights_b = layers.Lambda(normalizer(2))(att_weights)
|
||||
alpha = layers.dot([norm_weights_a, a], axes=1)
|
||||
beta = layers.dot([norm_weights_b, b], axes=1)
|
||||
|
||||
# step 2: compare
|
||||
comp1 = layers.concatenate([a, beta])
|
||||
comp2 = layers.concatenate([b, alpha])
|
||||
v1 = layers.TimeDistributed(G)(comp1)
|
||||
v2 = layers.TimeDistributed(G)(comp2)
|
||||
|
||||
# step 3: aggregate
|
||||
v1_sum = layers.Lambda(sum_word)(v1)
|
||||
v2_sum = layers.Lambda(sum_word)(v2)
|
||||
concat = layers.concatenate([v1_sum, v2_sum])
|
||||
|
||||
elif settings["entail_dir"] == "left":
|
||||
norm_weights_a = layers.Lambda(normalizer(1))(att_weights)
|
||||
alpha = layers.dot([norm_weights_a, a], axes=1)
|
||||
comp2 = layers.concatenate([b, alpha])
|
||||
v2 = layers.TimeDistributed(G)(comp2)
|
||||
v2_sum = layers.Lambda(sum_word)(v2)
|
||||
concat = v2_sum
|
||||
|
||||
else:
|
||||
norm_weights_b = layers.Lambda(normalizer(2))(att_weights)
|
||||
beta = layers.dot([norm_weights_b, b], axes=1)
|
||||
comp1 = layers.concatenate([a, beta])
|
||||
v1 = layers.TimeDistributed(G)(comp1)
|
||||
v1_sum = layers.Lambda(sum_word)(v1)
|
||||
concat = v1_sum
|
||||
|
||||
H = create_feedforward(nr_hidden)
|
||||
out = H(concat)
|
||||
out = layers.Dense(nr_class, activation="softmax")(out)
|
||||
|
||||
model = Model([input1, input2], out)
|
||||
|
||||
model.compile(
|
||||
optimizer=optimizers.Adam(lr=settings["lr"]),
|
||||
loss="categorical_crossentropy",
|
||||
metrics=["accuracy"],
|
||||
)
|
||||
|
||||
return model
|
||||
|
||||
|
||||
def create_embedding(vectors, max_length, projected_dim):
|
||||
return models.Sequential(
|
||||
[
|
||||
layers.Embedding(
|
||||
vectors.shape[0],
|
||||
vectors.shape[1],
|
||||
input_length=max_length,
|
||||
weights=[vectors],
|
||||
trainable=False,
|
||||
),
|
||||
layers.TimeDistributed(
|
||||
layers.Dense(projected_dim, activation=None, use_bias=False)
|
||||
),
|
||||
]
|
||||
)
|
||||
|
||||
|
||||
def create_feedforward(num_units=200, activation="relu", dropout_rate=0.2):
|
||||
return models.Sequential(
|
||||
[
|
||||
layers.Dense(num_units, activation=activation),
|
||||
layers.Dropout(dropout_rate),
|
||||
layers.Dense(num_units, activation=activation),
|
||||
layers.Dropout(dropout_rate),
|
||||
]
|
||||
)
|
||||
|
||||
|
||||
def normalizer(axis):
|
||||
def _normalize(att_weights):
|
||||
exp_weights = K.exp(att_weights)
|
||||
sum_weights = K.sum(exp_weights, axis=axis, keepdims=True)
|
||||
return exp_weights / sum_weights
|
||||
|
||||
return _normalize
|
||||
|
||||
|
||||
def sum_word(x):
|
||||
return K.sum(x, axis=1)
|
||||
|
||||
|
||||
def test_build_model():
|
||||
vectors = np.ndarray((100, 8), dtype="float32")
|
||||
shape = (10, 16, 3)
|
||||
settings = {"lr": 0.001, "dropout": 0.2, "gru_encode": True, "entail_dir": "both"}
|
||||
model = build_model(vectors, shape, settings)
|
||||
|
||||
|
||||
def test_fit_model():
|
||||
def _generate_X(nr_example, length, nr_vector):
|
||||
X1 = np.ndarray((nr_example, length), dtype="int32")
|
||||
X1 *= X1 < nr_vector
|
||||
X1 *= 0 <= X1
|
||||
X2 = np.ndarray((nr_example, length), dtype="int32")
|
||||
X2 *= X2 < nr_vector
|
||||
X2 *= 0 <= X2
|
||||
return [X1, X2]
|
||||
|
||||
def _generate_Y(nr_example, nr_class):
|
||||
ys = np.zeros((nr_example, nr_class), dtype="int32")
|
||||
for i in range(nr_example):
|
||||
ys[i, i % nr_class] = 1
|
||||
return ys
|
||||
|
||||
vectors = np.ndarray((100, 8), dtype="float32")
|
||||
shape = (10, 16, 3)
|
||||
settings = {"lr": 0.001, "dropout": 0.2, "gru_encode": True, "entail_dir": "both"}
|
||||
model = build_model(vectors, shape, settings)
|
||||
|
||||
train_X = _generate_X(20, shape[0], vectors.shape[0])
|
||||
train_Y = _generate_Y(20, shape[2])
|
||||
dev_X = _generate_X(15, shape[0], vectors.shape[0])
|
||||
dev_Y = _generate_Y(15, shape[2])
|
||||
|
||||
model.fit(train_X, train_Y, validation_data=(dev_X, dev_Y), epochs=5, batch_size=4)
|
||||
|
||||
|
||||
__all__ = [build_model]
|
|
@ -1,77 +0,0 @@
|
|||
import numpy as np
|
||||
from keras.models import model_from_json
|
||||
|
||||
try:
|
||||
import cPickle as pickle
|
||||
except ImportError:
|
||||
import pickle
|
||||
|
||||
|
||||
class KerasSimilarityShim(object):
|
||||
entailment_types = ["entailment", "contradiction", "neutral"]
|
||||
|
||||
@classmethod
|
||||
def load(cls, path, nlp, max_length=100, get_features=None):
|
||||
|
||||
if get_features is None:
|
||||
get_features = get_word_ids
|
||||
|
||||
with (path / "config.json").open() as file_:
|
||||
model = model_from_json(file_.read())
|
||||
with (path / "model").open("rb") as file_:
|
||||
weights = pickle.load(file_)
|
||||
|
||||
embeddings = get_embeddings(nlp.vocab)
|
||||
weights.insert(1, embeddings)
|
||||
model.set_weights(weights)
|
||||
|
||||
return cls(model, get_features=get_features, max_length=max_length)
|
||||
|
||||
def __init__(self, model, get_features=None, max_length=100):
|
||||
self.model = model
|
||||
self.get_features = get_features
|
||||
self.max_length = max_length
|
||||
|
||||
def __call__(self, doc):
|
||||
doc.user_hooks["similarity"] = self.predict
|
||||
doc.user_span_hooks["similarity"] = self.predict
|
||||
|
||||
return doc
|
||||
|
||||
def predict(self, doc1, doc2):
|
||||
x1 = self.get_features([doc1], max_length=self.max_length)
|
||||
x2 = self.get_features([doc2], max_length=self.max_length)
|
||||
scores = self.model.predict([x1, x2])
|
||||
|
||||
return self.entailment_types[scores.argmax()], scores.max()
|
||||
|
||||
|
||||
def get_embeddings(vocab, nr_unk=100):
|
||||
# the extra +1 is for a zero vector representing sentence-final padding
|
||||
num_vectors = max(lex.rank for lex in vocab) + 2
|
||||
|
||||
# create random vectors for OOV tokens
|
||||
oov = np.random.normal(size=(nr_unk, vocab.vectors_length))
|
||||
oov = oov / oov.sum(axis=1, keepdims=True)
|
||||
|
||||
vectors = np.zeros((num_vectors + nr_unk, vocab.vectors_length), dtype="float32")
|
||||
vectors[1 : (nr_unk + 1),] = oov
|
||||
for lex in vocab:
|
||||
if lex.has_vector and lex.vector_norm > 0:
|
||||
vectors[nr_unk + lex.rank + 1] = lex.vector / lex.vector_norm
|
||||
|
||||
return vectors
|
||||
|
||||
|
||||
def get_word_ids(docs, max_length=100, nr_unk=100):
|
||||
Xs = np.zeros((len(docs), max_length), dtype="int32")
|
||||
|
||||
for i, doc in enumerate(docs):
|
||||
for j, token in enumerate(doc):
|
||||
if j == max_length:
|
||||
break
|
||||
if token.has_vector:
|
||||
Xs[i, j] = token.rank + nr_unk + 1
|
||||
else:
|
||||
Xs[i, j] = token.rank % nr_unk + 1
|
||||
return Xs
|
|
@ -1,45 +0,0 @@
|
|||
# coding: utf-8
|
||||
"""
|
||||
Example of loading previously parsed text using spaCy's DocBin class. The example
|
||||
performs an entity count to show that the annotations are available.
|
||||
For more details, see https://spacy.io/usage/saving-loading#docs
|
||||
Installation:
|
||||
python -m spacy download en_core_web_lg
|
||||
Usage:
|
||||
python examples/load_from_docbin.py en_core_web_lg RC_2015-03-9.spacy
|
||||
"""
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import spacy
|
||||
from spacy.tokens import DocBin
|
||||
from timeit import default_timer as timer
|
||||
from collections import Counter
|
||||
|
||||
EXAMPLE_PARSES_PATH = "RC_2015-03-9.spacy"
|
||||
|
||||
|
||||
def main(model="en_core_web_lg", docbin_path=EXAMPLE_PARSES_PATH):
|
||||
nlp = spacy.load(model)
|
||||
print("Reading data from {}".format(docbin_path))
|
||||
with open(docbin_path, "rb") as file_:
|
||||
bytes_data = file_.read()
|
||||
nr_word = 0
|
||||
start_time = timer()
|
||||
entities = Counter()
|
||||
docbin = DocBin().from_bytes(bytes_data)
|
||||
for doc in docbin.get_docs(nlp.vocab):
|
||||
nr_word += len(doc)
|
||||
entities.update((e.label_, e.text) for e in doc.ents)
|
||||
end_time = timer()
|
||||
msg = "Loaded {nr_word} words in {seconds} seconds ({wps} words per second)"
|
||||
wps = nr_word / (end_time - start_time)
|
||||
print(msg.format(nr_word=nr_word, seconds=end_time - start_time, wps=wps))
|
||||
print("Most common entities:")
|
||||
for (label, entity), freq in entities.most_common(30):
|
||||
print(freq, entity, label)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import plac
|
||||
|
||||
plac.call(main)
|
|
@ -1,955 +0,0 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Natural language inference using spaCy and Keras"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Introduction"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"This notebook details an implementation of the natural language inference model presented in [(Parikh et al, 2016)](https://arxiv.org/abs/1606.01933). The model is notable for the small number of paramaters *and hyperparameters* it specifices, while still yielding good performance."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Constructing the dataset"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import spacy\n",
|
||||
"import numpy as np"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We only need the GloVe vectors from spaCy, not a full NLP pipeline."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"nlp = spacy.load('en_vectors_web_lg')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Function to load the SNLI dataset. The categories are converted to one-shot representation. The function comes from an example in spaCy."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"/home/jds/tensorflow-gpu/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n",
|
||||
" from ._conv import register_converters as _register_converters\n",
|
||||
"Using TensorFlow backend.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"import json\n",
|
||||
"from keras.utils import to_categorical\n",
|
||||
"\n",
|
||||
"LABELS = {'entailment': 0, 'contradiction': 1, 'neutral': 2}\n",
|
||||
"def read_snli(path):\n",
|
||||
" texts1 = []\n",
|
||||
" texts2 = []\n",
|
||||
" labels = []\n",
|
||||
" with open(path, 'r') as file_:\n",
|
||||
" for line in file_:\n",
|
||||
" eg = json.loads(line)\n",
|
||||
" label = eg['gold_label']\n",
|
||||
" if label == '-': # per Parikh, ignore - SNLI entries\n",
|
||||
" continue\n",
|
||||
" texts1.append(eg['sentence1'])\n",
|
||||
" texts2.append(eg['sentence2'])\n",
|
||||
" labels.append(LABELS[label])\n",
|
||||
" return texts1, texts2, to_categorical(np.asarray(labels, dtype='int32'))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Because Keras can do the train/test split for us, we'll load *all* SNLI triples from one file."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"texts,hypotheses,labels = read_snli('snli/snli_1.0_train.jsonl')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def create_dataset(nlp, texts, hypotheses, num_oov, max_length, norm_vectors = True):\n",
|
||||
" sents = texts + hypotheses\n",
|
||||
" \n",
|
||||
" # the extra +1 is for a zero vector represting NULL for padding\n",
|
||||
" num_vectors = max(lex.rank for lex in nlp.vocab) + 2 \n",
|
||||
" \n",
|
||||
" # create random vectors for OOV tokens\n",
|
||||
" oov = np.random.normal(size=(num_oov, nlp.vocab.vectors_length))\n",
|
||||
" oov = oov / oov.sum(axis=1, keepdims=True)\n",
|
||||
" \n",
|
||||
" vectors = np.zeros((num_vectors + num_oov, nlp.vocab.vectors_length), dtype='float32')\n",
|
||||
" vectors[num_vectors:, ] = oov\n",
|
||||
" for lex in nlp.vocab:\n",
|
||||
" if lex.has_vector and lex.vector_norm > 0:\n",
|
||||
" vectors[lex.rank + 1] = lex.vector / lex.vector_norm if norm_vectors == True else lex.vector\n",
|
||||
" \n",
|
||||
" sents_as_ids = []\n",
|
||||
" for sent in sents:\n",
|
||||
" doc = nlp(sent)\n",
|
||||
" word_ids = []\n",
|
||||
" \n",
|
||||
" for i, token in enumerate(doc):\n",
|
||||
" # skip odd spaces from tokenizer\n",
|
||||
" if token.has_vector and token.vector_norm == 0:\n",
|
||||
" continue\n",
|
||||
" \n",
|
||||
" if i > max_length:\n",
|
||||
" break\n",
|
||||
" \n",
|
||||
" if token.has_vector:\n",
|
||||
" word_ids.append(token.rank + 1)\n",
|
||||
" else:\n",
|
||||
" # if we don't have a vector, pick an OOV entry\n",
|
||||
" word_ids.append(token.rank % num_oov + num_vectors) \n",
|
||||
" \n",
|
||||
" # there must be a simpler way of generating padded arrays from lists...\n",
|
||||
" word_id_vec = np.zeros((max_length), dtype='int')\n",
|
||||
" clipped_len = min(max_length, len(word_ids))\n",
|
||||
" word_id_vec[:clipped_len] = word_ids[:clipped_len]\n",
|
||||
" sents_as_ids.append(word_id_vec)\n",
|
||||
" \n",
|
||||
" \n",
|
||||
" return vectors, np.array(sents_as_ids[:len(texts)]), np.array(sents_as_ids[len(texts):])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"sem_vectors, text_vectors, hypothesis_vectors = create_dataset(nlp, texts, hypotheses, 100, 50, True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"texts_test,hypotheses_test,labels_test = read_snli('snli/snli_1.0_test.jsonl')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"_, text_vectors_test, hypothesis_vectors_test = create_dataset(nlp, texts_test, hypotheses_test, 100, 50, True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We use spaCy to tokenize the sentences and return, when available, a semantic vector for each token. \n",
|
||||
"\n",
|
||||
"OOV terms (tokens for which no semantic vector is available) are assigned to one of a set of randomly-generated OOV vectors, per (Parikh et al, 2016).\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Note that we will clip sentences to 50 words maximum."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 13,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from keras import layers, Model, models\n",
|
||||
"from keras import backend as K"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Building the model"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The embedding layer copies the 300-dimensional GloVe vectors into GPU memory. Per (Parikh et al, 2016), the vectors, which are not adapted during training, are projected down to lower-dimensional vectors using a trained projection matrix."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 14,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def create_embedding(vectors, max_length, projected_dim):\n",
|
||||
" return models.Sequential([\n",
|
||||
" layers.Embedding(\n",
|
||||
" vectors.shape[0],\n",
|
||||
" vectors.shape[1],\n",
|
||||
" input_length=max_length,\n",
|
||||
" weights=[vectors],\n",
|
||||
" trainable=False),\n",
|
||||
" \n",
|
||||
" layers.TimeDistributed(\n",
|
||||
" layers.Dense(projected_dim,\n",
|
||||
" activation=None,\n",
|
||||
" use_bias=False))\n",
|
||||
" ])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The Parikh model makes use of three feedforward blocks that construct nonlinear combinations of their input. Each block contains two ReLU layers and two dropout layers."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 15,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def create_feedforward(num_units=200, activation='relu', dropout_rate=0.2):\n",
|
||||
" return models.Sequential([\n",
|
||||
" layers.Dense(num_units, activation=activation),\n",
|
||||
" layers.Dropout(dropout_rate),\n",
|
||||
" layers.Dense(num_units, activation=activation),\n",
|
||||
" layers.Dropout(dropout_rate)\n",
|
||||
" ])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The basic idea of the (Parikh et al, 2016) model is to:\n",
|
||||
"\n",
|
||||
"1. *Align*: Construct an alignment of subphrases in the text and hypothesis using an attention-like mechanism, called \"decompositional\" because the layer is applied to each of the two sentences individually rather than to their product. The dot product of the nonlinear transformations of the inputs is then normalized vertically and horizontally to yield a pair of \"soft\" alignment structures, from text->hypothesis and hypothesis->text. Concretely, for each word in one sentence, a multinomial distribution is computed over the words of the other sentence, by learning a multinomial logistic with softmax target.\n",
|
||||
"2. *Compare*: Each word is now compared to its aligned phrase using a function modeled as a two-layer feedforward ReLU network. The output is a high-dimensional representation of the strength of association between word and aligned phrase.\n",
|
||||
"3. *Aggregate*: The comparison vectors are summed, separately, for the text and the hypothesis. The result is two vectors: one that describes the degree of association of the text to the hypothesis, and the second, of the hypothesis to the text.\n",
|
||||
"4. Finally, these two vectors are processed by a dense layer followed by a softmax classifier, as usual.\n",
|
||||
"\n",
|
||||
"Note that because in entailment the truth conditions of the consequent must be a subset of those of the antecedent, it is not obvious that we need both vectors in step (3). Entailment is not symmetric. It may be enough to just use the hypothesis->text vector. We will explore this possibility later."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We need a couple of little functions for Lambda layers to normalize and aggregate weights:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 16,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def normalizer(axis):\n",
|
||||
" def _normalize(att_weights):\n",
|
||||
" exp_weights = K.exp(att_weights)\n",
|
||||
" sum_weights = K.sum(exp_weights, axis=axis, keepdims=True)\n",
|
||||
" return exp_weights/sum_weights\n",
|
||||
" return _normalize\n",
|
||||
"\n",
|
||||
"def sum_word(x):\n",
|
||||
" return K.sum(x, axis=1)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 17,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def build_model(vectors, max_length, num_hidden, num_classes, projected_dim, entail_dir='both'):\n",
|
||||
" input1 = layers.Input(shape=(max_length,), dtype='int32', name='words1')\n",
|
||||
" input2 = layers.Input(shape=(max_length,), dtype='int32', name='words2')\n",
|
||||
" \n",
|
||||
" # embeddings (projected)\n",
|
||||
" embed = create_embedding(vectors, max_length, projected_dim)\n",
|
||||
" \n",
|
||||
" a = embed(input1)\n",
|
||||
" b = embed(input2)\n",
|
||||
" \n",
|
||||
" # step 1: attend\n",
|
||||
" F = create_feedforward(num_hidden)\n",
|
||||
" att_weights = layers.dot([F(a), F(b)], axes=-1)\n",
|
||||
" \n",
|
||||
" G = create_feedforward(num_hidden)\n",
|
||||
" \n",
|
||||
" if entail_dir == 'both':\n",
|
||||
" norm_weights_a = layers.Lambda(normalizer(1))(att_weights)\n",
|
||||
" norm_weights_b = layers.Lambda(normalizer(2))(att_weights)\n",
|
||||
" alpha = layers.dot([norm_weights_a, a], axes=1)\n",
|
||||
" beta = layers.dot([norm_weights_b, b], axes=1)\n",
|
||||
"\n",
|
||||
" # step 2: compare\n",
|
||||
" comp1 = layers.concatenate([a, beta])\n",
|
||||
" comp2 = layers.concatenate([b, alpha])\n",
|
||||
" v1 = layers.TimeDistributed(G)(comp1)\n",
|
||||
" v2 = layers.TimeDistributed(G)(comp2)\n",
|
||||
"\n",
|
||||
" # step 3: aggregate\n",
|
||||
" v1_sum = layers.Lambda(sum_word)(v1)\n",
|
||||
" v2_sum = layers.Lambda(sum_word)(v2)\n",
|
||||
" concat = layers.concatenate([v1_sum, v2_sum])\n",
|
||||
" elif entail_dir == 'left':\n",
|
||||
" norm_weights_a = layers.Lambda(normalizer(1))(att_weights)\n",
|
||||
" alpha = layers.dot([norm_weights_a, a], axes=1)\n",
|
||||
" comp2 = layers.concatenate([b, alpha])\n",
|
||||
" v2 = layers.TimeDistributed(G)(comp2)\n",
|
||||
" v2_sum = layers.Lambda(sum_word)(v2)\n",
|
||||
" concat = v2_sum\n",
|
||||
" else:\n",
|
||||
" norm_weights_b = layers.Lambda(normalizer(2))(att_weights)\n",
|
||||
" beta = layers.dot([norm_weights_b, b], axes=1)\n",
|
||||
" comp1 = layers.concatenate([a, beta])\n",
|
||||
" v1 = layers.TimeDistributed(G)(comp1)\n",
|
||||
" v1_sum = layers.Lambda(sum_word)(v1)\n",
|
||||
" concat = v1_sum\n",
|
||||
" \n",
|
||||
" H = create_feedforward(num_hidden)\n",
|
||||
" out = H(concat)\n",
|
||||
" out = layers.Dense(num_classes, activation='softmax')(out)\n",
|
||||
" \n",
|
||||
" model = Model([input1, input2], out)\n",
|
||||
" \n",
|
||||
" model.compile(optimizer='adam',\n",
|
||||
" loss='categorical_crossentropy',\n",
|
||||
" metrics=['accuracy'])\n",
|
||||
" return model\n",
|
||||
" \n",
|
||||
" \n",
|
||||
" "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 18,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"__________________________________________________________________________________________________\n",
|
||||
"Layer (type) Output Shape Param # Connected to \n",
|
||||
"==================================================================================================\n",
|
||||
"words1 (InputLayer) (None, 50) 0 \n",
|
||||
"__________________________________________________________________________________________________\n",
|
||||
"words2 (InputLayer) (None, 50) 0 \n",
|
||||
"__________________________________________________________________________________________________\n",
|
||||
"sequential_1 (Sequential) (None, 50, 200) 321381600 words1[0][0] \n",
|
||||
" words2[0][0] \n",
|
||||
"__________________________________________________________________________________________________\n",
|
||||
"sequential_2 (Sequential) (None, 50, 200) 80400 sequential_1[1][0] \n",
|
||||
" sequential_1[2][0] \n",
|
||||
"__________________________________________________________________________________________________\n",
|
||||
"dot_1 (Dot) (None, 50, 50) 0 sequential_2[1][0] \n",
|
||||
" sequential_2[2][0] \n",
|
||||
"__________________________________________________________________________________________________\n",
|
||||
"lambda_2 (Lambda) (None, 50, 50) 0 dot_1[0][0] \n",
|
||||
"__________________________________________________________________________________________________\n",
|
||||
"lambda_1 (Lambda) (None, 50, 50) 0 dot_1[0][0] \n",
|
||||
"__________________________________________________________________________________________________\n",
|
||||
"dot_3 (Dot) (None, 50, 200) 0 lambda_2[0][0] \n",
|
||||
" sequential_1[2][0] \n",
|
||||
"__________________________________________________________________________________________________\n",
|
||||
"dot_2 (Dot) (None, 50, 200) 0 lambda_1[0][0] \n",
|
||||
" sequential_1[1][0] \n",
|
||||
"__________________________________________________________________________________________________\n",
|
||||
"concatenate_1 (Concatenate) (None, 50, 400) 0 sequential_1[1][0] \n",
|
||||
" dot_3[0][0] \n",
|
||||
"__________________________________________________________________________________________________\n",
|
||||
"concatenate_2 (Concatenate) (None, 50, 400) 0 sequential_1[2][0] \n",
|
||||
" dot_2[0][0] \n",
|
||||
"__________________________________________________________________________________________________\n",
|
||||
"time_distributed_2 (TimeDistrib (None, 50, 200) 120400 concatenate_1[0][0] \n",
|
||||
"__________________________________________________________________________________________________\n",
|
||||
"time_distributed_3 (TimeDistrib (None, 50, 200) 120400 concatenate_2[0][0] \n",
|
||||
"__________________________________________________________________________________________________\n",
|
||||
"lambda_3 (Lambda) (None, 200) 0 time_distributed_2[0][0] \n",
|
||||
"__________________________________________________________________________________________________\n",
|
||||
"lambda_4 (Lambda) (None, 200) 0 time_distributed_3[0][0] \n",
|
||||
"__________________________________________________________________________________________________\n",
|
||||
"concatenate_3 (Concatenate) (None, 400) 0 lambda_3[0][0] \n",
|
||||
" lambda_4[0][0] \n",
|
||||
"__________________________________________________________________________________________________\n",
|
||||
"sequential_4 (Sequential) (None, 200) 120400 concatenate_3[0][0] \n",
|
||||
"__________________________________________________________________________________________________\n",
|
||||
"dense_8 (Dense) (None, 3) 603 sequential_4[1][0] \n",
|
||||
"==================================================================================================\n",
|
||||
"Total params: 321,703,403\n",
|
||||
"Trainable params: 381,803\n",
|
||||
"Non-trainable params: 321,321,600\n",
|
||||
"__________________________________________________________________________________________________\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"K.clear_session()\n",
|
||||
"m = build_model(sem_vectors, 50, 200, 3, 200)\n",
|
||||
"m.summary()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The number of trainable parameters, ~381k, is the number given by Parikh et al, so we're on the right track."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Training the model"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Parikh et al use tiny batches of 4, training for 50MM batches, which amounts to around 500 epochs. Here we'll use large batches to better use the GPU, and train for fewer epochs -- for purposes of this experiment."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 19,
|
||||
"metadata": {
|
||||
"scrolled": true
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Train on 549367 samples, validate on 9824 samples\n",
|
||||
"Epoch 1/50\n",
|
||||
"549367/549367 [==============================] - 34s 62us/step - loss: 0.7599 - acc: 0.6617 - val_loss: 0.5396 - val_acc: 0.7861\n",
|
||||
"Epoch 2/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.5611 - acc: 0.7763 - val_loss: 0.4892 - val_acc: 0.8085\n",
|
||||
"Epoch 3/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.5212 - acc: 0.7948 - val_loss: 0.4574 - val_acc: 0.8261\n",
|
||||
"Epoch 4/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4986 - acc: 0.8045 - val_loss: 0.4410 - val_acc: 0.8274\n",
|
||||
"Epoch 5/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4819 - acc: 0.8114 - val_loss: 0.4224 - val_acc: 0.8383\n",
|
||||
"Epoch 6/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4714 - acc: 0.8166 - val_loss: 0.4200 - val_acc: 0.8379\n",
|
||||
"Epoch 7/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4633 - acc: 0.8203 - val_loss: 0.4098 - val_acc: 0.8457\n",
|
||||
"Epoch 8/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4558 - acc: 0.8232 - val_loss: 0.4114 - val_acc: 0.8415\n",
|
||||
"Epoch 9/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4508 - acc: 0.8250 - val_loss: 0.4062 - val_acc: 0.8477\n",
|
||||
"Epoch 10/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4433 - acc: 0.8286 - val_loss: 0.3982 - val_acc: 0.8486\n",
|
||||
"Epoch 11/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4388 - acc: 0.8307 - val_loss: 0.3953 - val_acc: 0.8497\n",
|
||||
"Epoch 12/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4351 - acc: 0.8321 - val_loss: 0.3973 - val_acc: 0.8522\n",
|
||||
"Epoch 13/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4309 - acc: 0.8342 - val_loss: 0.3939 - val_acc: 0.8539\n",
|
||||
"Epoch 14/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4269 - acc: 0.8355 - val_loss: 0.3932 - val_acc: 0.8517\n",
|
||||
"Epoch 15/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4247 - acc: 0.8369 - val_loss: 0.3938 - val_acc: 0.8515\n",
|
||||
"Epoch 16/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4208 - acc: 0.8379 - val_loss: 0.3936 - val_acc: 0.8504\n",
|
||||
"Epoch 17/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4194 - acc: 0.8390 - val_loss: 0.3885 - val_acc: 0.8560\n",
|
||||
"Epoch 18/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4162 - acc: 0.8402 - val_loss: 0.3874 - val_acc: 0.8561\n",
|
||||
"Epoch 19/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4140 - acc: 0.8409 - val_loss: 0.3889 - val_acc: 0.8545\n",
|
||||
"Epoch 20/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4114 - acc: 0.8426 - val_loss: 0.3864 - val_acc: 0.8583\n",
|
||||
"Epoch 21/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4092 - acc: 0.8430 - val_loss: 0.3870 - val_acc: 0.8561\n",
|
||||
"Epoch 22/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4062 - acc: 0.8442 - val_loss: 0.3852 - val_acc: 0.8577\n",
|
||||
"Epoch 23/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4050 - acc: 0.8450 - val_loss: 0.3850 - val_acc: 0.8578\n",
|
||||
"Epoch 24/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4035 - acc: 0.8455 - val_loss: 0.3825 - val_acc: 0.8555\n",
|
||||
"Epoch 25/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.4018 - acc: 0.8460 - val_loss: 0.3837 - val_acc: 0.8573\n",
|
||||
"Epoch 26/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3989 - acc: 0.8476 - val_loss: 0.3843 - val_acc: 0.8599\n",
|
||||
"Epoch 27/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3979 - acc: 0.8481 - val_loss: 0.3841 - val_acc: 0.8589\n",
|
||||
"Epoch 28/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3967 - acc: 0.8484 - val_loss: 0.3811 - val_acc: 0.8575\n",
|
||||
"Epoch 29/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3956 - acc: 0.8492 - val_loss: 0.3829 - val_acc: 0.8589\n",
|
||||
"Epoch 30/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3938 - acc: 0.8499 - val_loss: 0.3859 - val_acc: 0.8562\n",
|
||||
"Epoch 31/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3925 - acc: 0.8500 - val_loss: 0.3798 - val_acc: 0.8587\n",
|
||||
"Epoch 32/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3906 - acc: 0.8509 - val_loss: 0.3834 - val_acc: 0.8569\n",
|
||||
"Epoch 33/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3893 - acc: 0.8511 - val_loss: 0.3806 - val_acc: 0.8588\n",
|
||||
"Epoch 34/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3885 - acc: 0.8515 - val_loss: 0.3828 - val_acc: 0.8603\n",
|
||||
"Epoch 35/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3879 - acc: 0.8520 - val_loss: 0.3800 - val_acc: 0.8594\n",
|
||||
"Epoch 36/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3860 - acc: 0.8530 - val_loss: 0.3796 - val_acc: 0.8577\n",
|
||||
"Epoch 37/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3856 - acc: 0.8532 - val_loss: 0.3857 - val_acc: 0.8591\n",
|
||||
"Epoch 38/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3838 - acc: 0.8535 - val_loss: 0.3835 - val_acc: 0.8603\n",
|
||||
"Epoch 39/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3830 - acc: 0.8543 - val_loss: 0.3830 - val_acc: 0.8599\n",
|
||||
"Epoch 40/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3818 - acc: 0.8548 - val_loss: 0.3832 - val_acc: 0.8559\n",
|
||||
"Epoch 41/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3806 - acc: 0.8551 - val_loss: 0.3845 - val_acc: 0.8553\n",
|
||||
"Epoch 42/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3803 - acc: 0.8550 - val_loss: 0.3789 - val_acc: 0.8617\n",
|
||||
"Epoch 43/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3791 - acc: 0.8556 - val_loss: 0.3835 - val_acc: 0.8580\n",
|
||||
"Epoch 44/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3778 - acc: 0.8565 - val_loss: 0.3799 - val_acc: 0.8580\n",
|
||||
"Epoch 45/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3766 - acc: 0.8571 - val_loss: 0.3790 - val_acc: 0.8625\n",
|
||||
"Epoch 46/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3770 - acc: 0.8569 - val_loss: 0.3820 - val_acc: 0.8590\n",
|
||||
"Epoch 47/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3761 - acc: 0.8573 - val_loss: 0.3831 - val_acc: 0.8581\n",
|
||||
"Epoch 48/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3739 - acc: 0.8579 - val_loss: 0.3828 - val_acc: 0.8599\n",
|
||||
"Epoch 49/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3738 - acc: 0.8577 - val_loss: 0.3785 - val_acc: 0.8590\n",
|
||||
"Epoch 50/50\n",
|
||||
"549367/549367 [==============================] - 33s 60us/step - loss: 0.3726 - acc: 0.8580 - val_loss: 0.3820 - val_acc: 0.8585\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"<keras.callbacks.History at 0x7f5c9f49c438>"
|
||||
]
|
||||
},
|
||||
"execution_count": 19,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"m.fit([text_vectors, hypothesis_vectors], labels, batch_size=1024, epochs=50,validation_data=([text_vectors_test, hypothesis_vectors_test], labels_test))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The result is broadly in the region reported by Parikh et al: ~86 vs 86.3%. The small difference might be accounted by differences in `max_length` (here set at 50), in the training regime, and that here we use Keras' built-in validation splitting rather than the SNLI test set."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Experiment: the asymmetric model"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"It was suggested earlier that, based on the semantics of entailment, the vector representing the strength of association between the hypothesis to the text is all that is needed for classifying the entailment.\n",
|
||||
"\n",
|
||||
"The following model removes consideration of the complementary vector (text to hypothesis) from the computation. This will decrease the paramater count slightly, because the final dense layers will be smaller, and speed up the forward pass when predicting, because fewer calculations will be needed."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 20,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"__________________________________________________________________________________________________\n",
|
||||
"Layer (type) Output Shape Param # Connected to \n",
|
||||
"==================================================================================================\n",
|
||||
"words2 (InputLayer) (None, 50) 0 \n",
|
||||
"__________________________________________________________________________________________________\n",
|
||||
"words1 (InputLayer) (None, 50) 0 \n",
|
||||
"__________________________________________________________________________________________________\n",
|
||||
"sequential_5 (Sequential) (None, 50, 200) 321381600 words1[0][0] \n",
|
||||
" words2[0][0] \n",
|
||||
"__________________________________________________________________________________________________\n",
|
||||
"sequential_6 (Sequential) (None, 50, 200) 80400 sequential_5[1][0] \n",
|
||||
" sequential_5[2][0] \n",
|
||||
"__________________________________________________________________________________________________\n",
|
||||
"dot_4 (Dot) (None, 50, 50) 0 sequential_6[1][0] \n",
|
||||
" sequential_6[2][0] \n",
|
||||
"__________________________________________________________________________________________________\n",
|
||||
"lambda_5 (Lambda) (None, 50, 50) 0 dot_4[0][0] \n",
|
||||
"__________________________________________________________________________________________________\n",
|
||||
"dot_5 (Dot) (None, 50, 200) 0 lambda_5[0][0] \n",
|
||||
" sequential_5[1][0] \n",
|
||||
"__________________________________________________________________________________________________\n",
|
||||
"concatenate_4 (Concatenate) (None, 50, 400) 0 sequential_5[2][0] \n",
|
||||
" dot_5[0][0] \n",
|
||||
"__________________________________________________________________________________________________\n",
|
||||
"time_distributed_5 (TimeDistrib (None, 50, 200) 120400 concatenate_4[0][0] \n",
|
||||
"__________________________________________________________________________________________________\n",
|
||||
"lambda_6 (Lambda) (None, 200) 0 time_distributed_5[0][0] \n",
|
||||
"__________________________________________________________________________________________________\n",
|
||||
"sequential_8 (Sequential) (None, 200) 80400 lambda_6[0][0] \n",
|
||||
"__________________________________________________________________________________________________\n",
|
||||
"dense_16 (Dense) (None, 3) 603 sequential_8[1][0] \n",
|
||||
"==================================================================================================\n",
|
||||
"Total params: 321,663,403\n",
|
||||
"Trainable params: 341,803\n",
|
||||
"Non-trainable params: 321,321,600\n",
|
||||
"__________________________________________________________________________________________________\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"m1 = build_model(sem_vectors, 50, 200, 3, 200, 'left')\n",
|
||||
"m1.summary()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The parameter count has indeed decreased by 40,000, corresponding to the 200x200 smaller H function."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 21,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Train on 549367 samples, validate on 9824 samples\n",
|
||||
"Epoch 1/50\n",
|
||||
"549367/549367 [==============================] - 25s 46us/step - loss: 0.7331 - acc: 0.6770 - val_loss: 0.5257 - val_acc: 0.7936\n",
|
||||
"Epoch 2/50\n",
|
||||
"549367/549367 [==============================] - 25s 45us/step - loss: 0.5518 - acc: 0.7799 - val_loss: 0.4717 - val_acc: 0.8159\n",
|
||||
"Epoch 3/50\n",
|
||||
"549367/549367 [==============================] - 25s 45us/step - loss: 0.5147 - acc: 0.7967 - val_loss: 0.4449 - val_acc: 0.8278\n",
|
||||
"Epoch 4/50\n",
|
||||
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4948 - acc: 0.8060 - val_loss: 0.4326 - val_acc: 0.8344\n",
|
||||
"Epoch 5/50\n",
|
||||
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4814 - acc: 0.8122 - val_loss: 0.4247 - val_acc: 0.8359\n",
|
||||
"Epoch 6/50\n",
|
||||
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4712 - acc: 0.8162 - val_loss: 0.4143 - val_acc: 0.8430\n",
|
||||
"Epoch 7/50\n",
|
||||
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4635 - acc: 0.8205 - val_loss: 0.4172 - val_acc: 0.8401\n",
|
||||
"Epoch 8/50\n",
|
||||
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4570 - acc: 0.8223 - val_loss: 0.4106 - val_acc: 0.8422\n",
|
||||
"Epoch 9/50\n",
|
||||
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4505 - acc: 0.8259 - val_loss: 0.4043 - val_acc: 0.8451\n",
|
||||
"Epoch 10/50\n",
|
||||
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4459 - acc: 0.8280 - val_loss: 0.4050 - val_acc: 0.8467\n",
|
||||
"Epoch 11/50\n",
|
||||
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4405 - acc: 0.8300 - val_loss: 0.3975 - val_acc: 0.8481\n",
|
||||
"Epoch 12/50\n",
|
||||
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4360 - acc: 0.8324 - val_loss: 0.4026 - val_acc: 0.8496\n",
|
||||
"Epoch 13/50\n",
|
||||
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4327 - acc: 0.8334 - val_loss: 0.4024 - val_acc: 0.8471\n",
|
||||
"Epoch 14/50\n",
|
||||
"549367/549367 [==============================] - 24s 45us/step - loss: 0.4293 - acc: 0.8350 - val_loss: 0.3955 - val_acc: 0.8496\n",
|
||||
"Epoch 15/50\n",
|
||||
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4263 - acc: 0.8369 - val_loss: 0.3980 - val_acc: 0.8490\n",
|
||||
"Epoch 16/50\n",
|
||||
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4236 - acc: 0.8377 - val_loss: 0.3958 - val_acc: 0.8496\n",
|
||||
"Epoch 17/50\n",
|
||||
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4213 - acc: 0.8384 - val_loss: 0.3954 - val_acc: 0.8496\n",
|
||||
"Epoch 18/50\n",
|
||||
"549367/549367 [==============================] - 24s 45us/step - loss: 0.4187 - acc: 0.8394 - val_loss: 0.3929 - val_acc: 0.8514\n",
|
||||
"Epoch 19/50\n",
|
||||
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4157 - acc: 0.8409 - val_loss: 0.3939 - val_acc: 0.8507\n",
|
||||
"Epoch 20/50\n",
|
||||
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4135 - acc: 0.8417 - val_loss: 0.3953 - val_acc: 0.8522\n",
|
||||
"Epoch 21/50\n",
|
||||
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4122 - acc: 0.8424 - val_loss: 0.3974 - val_acc: 0.8506\n",
|
||||
"Epoch 22/50\n",
|
||||
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4099 - acc: 0.8435 - val_loss: 0.3918 - val_acc: 0.8522\n",
|
||||
"Epoch 23/50\n",
|
||||
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4075 - acc: 0.8443 - val_loss: 0.3901 - val_acc: 0.8513\n",
|
||||
"Epoch 24/50\n",
|
||||
"549367/549367 [==============================] - 24s 44us/step - loss: 0.4067 - acc: 0.8447 - val_loss: 0.3885 - val_acc: 0.8543\n",
|
||||
"Epoch 25/50\n",
|
||||
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4047 - acc: 0.8454 - val_loss: 0.3846 - val_acc: 0.8531\n",
|
||||
"Epoch 26/50\n",
|
||||
"549367/549367 [==============================] - 25s 45us/step - loss: 0.4031 - acc: 0.8461 - val_loss: 0.3864 - val_acc: 0.8562\n",
|
||||
"Epoch 27/50\n",
|
||||
"549367/549367 [==============================] - 24s 45us/step - loss: 0.4020 - acc: 0.8467 - val_loss: 0.3874 - val_acc: 0.8546\n",
|
||||
"Epoch 28/50\n",
|
||||
"549367/549367 [==============================] - 24s 45us/step - loss: 0.4001 - acc: 0.8473 - val_loss: 0.3848 - val_acc: 0.8534\n",
|
||||
"Epoch 29/50\n",
|
||||
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3991 - acc: 0.8479 - val_loss: 0.3865 - val_acc: 0.8562\n",
|
||||
"Epoch 30/50\n",
|
||||
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3976 - acc: 0.8484 - val_loss: 0.3833 - val_acc: 0.8574\n",
|
||||
"Epoch 31/50\n",
|
||||
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3961 - acc: 0.8487 - val_loss: 0.3846 - val_acc: 0.8585\n",
|
||||
"Epoch 32/50\n",
|
||||
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3942 - acc: 0.8498 - val_loss: 0.3805 - val_acc: 0.8573\n",
|
||||
"Epoch 33/50\n",
|
||||
"549367/549367 [==============================] - 24s 44us/step - loss: 0.3935 - acc: 0.8503 - val_loss: 0.3856 - val_acc: 0.8579\n",
|
||||
"Epoch 34/50\n",
|
||||
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3923 - acc: 0.8507 - val_loss: 0.3829 - val_acc: 0.8560\n",
|
||||
"Epoch 35/50\n",
|
||||
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3920 - acc: 0.8508 - val_loss: 0.3864 - val_acc: 0.8575\n",
|
||||
"Epoch 36/50\n",
|
||||
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3907 - acc: 0.8516 - val_loss: 0.3873 - val_acc: 0.8563\n",
|
||||
"Epoch 37/50\n",
|
||||
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3891 - acc: 0.8519 - val_loss: 0.3850 - val_acc: 0.8570\n",
|
||||
"Epoch 38/50\n",
|
||||
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3872 - acc: 0.8522 - val_loss: 0.3815 - val_acc: 0.8591\n",
|
||||
"Epoch 39/50\n",
|
||||
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3887 - acc: 0.8520 - val_loss: 0.3829 - val_acc: 0.8590\n",
|
||||
"Epoch 40/50\n",
|
||||
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3868 - acc: 0.8531 - val_loss: 0.3807 - val_acc: 0.8600\n",
|
||||
"Epoch 41/50\n",
|
||||
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3859 - acc: 0.8537 - val_loss: 0.3832 - val_acc: 0.8574\n",
|
||||
"Epoch 42/50\n",
|
||||
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3849 - acc: 0.8537 - val_loss: 0.3850 - val_acc: 0.8576\n",
|
||||
"Epoch 43/50\n",
|
||||
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3834 - acc: 0.8541 - val_loss: 0.3825 - val_acc: 0.8563\n",
|
||||
"Epoch 44/50\n",
|
||||
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3829 - acc: 0.8548 - val_loss: 0.3844 - val_acc: 0.8540\n",
|
||||
"Epoch 45/50\n",
|
||||
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3816 - acc: 0.8552 - val_loss: 0.3841 - val_acc: 0.8559\n",
|
||||
"Epoch 46/50\n",
|
||||
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3816 - acc: 0.8549 - val_loss: 0.3880 - val_acc: 0.8567\n",
|
||||
"Epoch 47/50\n",
|
||||
"549367/549367 [==============================] - 24s 45us/step - loss: 0.3799 - acc: 0.8559 - val_loss: 0.3767 - val_acc: 0.8635\n",
|
||||
"Epoch 48/50\n",
|
||||
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3800 - acc: 0.8560 - val_loss: 0.3786 - val_acc: 0.8563\n",
|
||||
"Epoch 49/50\n",
|
||||
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3781 - acc: 0.8563 - val_loss: 0.3812 - val_acc: 0.8596\n",
|
||||
"Epoch 50/50\n",
|
||||
"549367/549367 [==============================] - 25s 45us/step - loss: 0.3788 - acc: 0.8560 - val_loss: 0.3782 - val_acc: 0.8601\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"<keras.callbacks.History at 0x7f5ca1bf3e48>"
|
||||
]
|
||||
},
|
||||
"execution_count": 21,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"m1.fit([text_vectors, hypothesis_vectors], labels, batch_size=1024, epochs=50,validation_data=([text_vectors_test, hypothesis_vectors_test], labels_test))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"This model performs the same as the slightly more complex model that evaluates alignments in both directions. Note also that processing time is improved, from 64 down to 48 microseconds per step. \n",
|
||||
"\n",
|
||||
"Let's now look at an asymmetric model that evaluates text to hypothesis comparisons. The prediction is that such a model will correctly classify a decent proportion of the exemplars, but not as accurately as the previous two.\n",
|
||||
"\n",
|
||||
"We'll just use 10 epochs for expediency."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 96,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"__________________________________________________________________________________________________\n",
|
||||
"Layer (type) Output Shape Param # Connected to \n",
|
||||
"==================================================================================================\n",
|
||||
"words1 (InputLayer) (None, 50) 0 \n",
|
||||
"__________________________________________________________________________________________________\n",
|
||||
"words2 (InputLayer) (None, 50) 0 \n",
|
||||
"__________________________________________________________________________________________________\n",
|
||||
"sequential_13 (Sequential) (None, 50, 200) 321381600 words1[0][0] \n",
|
||||
" words2[0][0] \n",
|
||||
"__________________________________________________________________________________________________\n",
|
||||
"sequential_14 (Sequential) (None, 50, 200) 80400 sequential_13[1][0] \n",
|
||||
" sequential_13[2][0] \n",
|
||||
"__________________________________________________________________________________________________\n",
|
||||
"dot_8 (Dot) (None, 50, 50) 0 sequential_14[1][0] \n",
|
||||
" sequential_14[2][0] \n",
|
||||
"__________________________________________________________________________________________________\n",
|
||||
"lambda_9 (Lambda) (None, 50, 50) 0 dot_8[0][0] \n",
|
||||
"__________________________________________________________________________________________________\n",
|
||||
"dot_9 (Dot) (None, 50, 200) 0 lambda_9[0][0] \n",
|
||||
" sequential_13[2][0] \n",
|
||||
"__________________________________________________________________________________________________\n",
|
||||
"concatenate_6 (Concatenate) (None, 50, 400) 0 sequential_13[1][0] \n",
|
||||
" dot_9[0][0] \n",
|
||||
"__________________________________________________________________________________________________\n",
|
||||
"time_distributed_9 (TimeDistrib (None, 50, 200) 120400 concatenate_6[0][0] \n",
|
||||
"__________________________________________________________________________________________________\n",
|
||||
"lambda_10 (Lambda) (None, 200) 0 time_distributed_9[0][0] \n",
|
||||
"__________________________________________________________________________________________________\n",
|
||||
"sequential_16 (Sequential) (None, 200) 80400 lambda_10[0][0] \n",
|
||||
"__________________________________________________________________________________________________\n",
|
||||
"dense_32 (Dense) (None, 3) 603 sequential_16[1][0] \n",
|
||||
"==================================================================================================\n",
|
||||
"Total params: 321,663,403\n",
|
||||
"Trainable params: 341,803\n",
|
||||
"Non-trainable params: 321,321,600\n",
|
||||
"__________________________________________________________________________________________________\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"m2 = build_model(sem_vectors, 50, 200, 3, 200, 'right')\n",
|
||||
"m2.summary()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 97,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Train on 455226 samples, validate on 113807 samples\n",
|
||||
"Epoch 1/10\n",
|
||||
"455226/455226 [==============================] - 22s 49us/step - loss: 0.8920 - acc: 0.5771 - val_loss: 0.8001 - val_acc: 0.6435\n",
|
||||
"Epoch 2/10\n",
|
||||
"455226/455226 [==============================] - 22s 47us/step - loss: 0.7808 - acc: 0.6553 - val_loss: 0.7267 - val_acc: 0.6855\n",
|
||||
"Epoch 3/10\n",
|
||||
"455226/455226 [==============================] - 22s 47us/step - loss: 0.7329 - acc: 0.6825 - val_loss: 0.6966 - val_acc: 0.7006\n",
|
||||
"Epoch 4/10\n",
|
||||
"455226/455226 [==============================] - 22s 47us/step - loss: 0.7055 - acc: 0.6978 - val_loss: 0.6713 - val_acc: 0.7150\n",
|
||||
"Epoch 5/10\n",
|
||||
"455226/455226 [==============================] - 22s 47us/step - loss: 0.6862 - acc: 0.7081 - val_loss: 0.6533 - val_acc: 0.7253\n",
|
||||
"Epoch 6/10\n",
|
||||
"455226/455226 [==============================] - 21s 47us/step - loss: 0.6694 - acc: 0.7179 - val_loss: 0.6472 - val_acc: 0.7277\n",
|
||||
"Epoch 7/10\n",
|
||||
"455226/455226 [==============================] - 22s 47us/step - loss: 0.6555 - acc: 0.7252 - val_loss: 0.6338 - val_acc: 0.7347\n",
|
||||
"Epoch 8/10\n",
|
||||
"455226/455226 [==============================] - 22s 48us/step - loss: 0.6434 - acc: 0.7310 - val_loss: 0.6246 - val_acc: 0.7385\n",
|
||||
"Epoch 9/10\n",
|
||||
"455226/455226 [==============================] - 22s 47us/step - loss: 0.6325 - acc: 0.7367 - val_loss: 0.6164 - val_acc: 0.7424\n",
|
||||
"Epoch 10/10\n",
|
||||
"455226/455226 [==============================] - 22s 47us/step - loss: 0.6216 - acc: 0.7426 - val_loss: 0.6082 - val_acc: 0.7478\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"<keras.callbacks.History at 0x7fa6850cf080>"
|
||||
]
|
||||
},
|
||||
"execution_count": 97,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"m2.fit([text_vectors, hypothesis_vectors], labels, batch_size=1024, epochs=10,validation_split=.2)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Comparing this fit to the validation accuracy of the previous two models after 10 epochs, we observe that its accuracy is roughly 10% lower.\n",
|
||||
"\n",
|
||||
"It is reassuring that the neural modeling here reproduces what we know from the semantics of natural language!"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.5.2"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
|
@ -1,78 +0,0 @@
|
|||
#!/usr/bin/env python
|
||||
# coding: utf-8
|
||||
"""This example contains several snippets of methods that can be set via custom
|
||||
Doc, Token or Span attributes in spaCy v2.0. Attribute methods act like
|
||||
they're "bound" to the object and are partially applied – i.e. the object
|
||||
they're called on is passed in as the first argument.
|
||||
|
||||
* Custom pipeline components: https://spacy.io//usage/processing-pipelines#custom-components
|
||||
|
||||
Compatible with: spaCy v2.0.0+
|
||||
Last tested with: v2.1.0
|
||||
"""
|
||||
from __future__ import unicode_literals, print_function
|
||||
|
||||
import plac
|
||||
from spacy.lang.en import English
|
||||
from spacy.tokens import Doc, Span
|
||||
from spacy import displacy
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
output_dir=("Output directory for saved HTML", "positional", None, Path)
|
||||
)
|
||||
def main(output_dir=None):
|
||||
nlp = English() # start off with blank English class
|
||||
|
||||
Doc.set_extension("overlap", method=overlap_tokens)
|
||||
doc1 = nlp("Peach emoji is where it has always been.")
|
||||
doc2 = nlp("Peach is the superior emoji.")
|
||||
print("Text 1:", doc1.text)
|
||||
print("Text 2:", doc2.text)
|
||||
print("Overlapping tokens:", doc1._.overlap(doc2))
|
||||
|
||||
Doc.set_extension("to_html", method=to_html)
|
||||
doc = nlp("This is a sentence about Apple.")
|
||||
# add entity manually for demo purposes, to make it work without a model
|
||||
doc.ents = [Span(doc, 5, 6, label=nlp.vocab.strings["ORG"])]
|
||||
print("Text:", doc.text)
|
||||
doc._.to_html(output=output_dir, style="ent")
|
||||
|
||||
|
||||
def to_html(doc, output="/tmp", style="dep"):
|
||||
"""Doc method extension for saving the current state as a displaCy
|
||||
visualization.
|
||||
"""
|
||||
# generate filename from first six non-punct tokens
|
||||
file_name = "-".join([w.text for w in doc[:6] if not w.is_punct]) + ".html"
|
||||
html = displacy.render(doc, style=style, page=True) # render markup
|
||||
if output is not None:
|
||||
output_path = Path(output)
|
||||
if not output_path.exists():
|
||||
output_path.mkdir()
|
||||
output_file = Path(output) / file_name
|
||||
output_file.open("w", encoding="utf-8").write(html) # save to file
|
||||
print("Saved HTML to {}".format(output_file))
|
||||
else:
|
||||
print(html)
|
||||
|
||||
|
||||
def overlap_tokens(doc, other_doc):
|
||||
"""Get the tokens from the original Doc that are also in the comparison Doc.
|
||||
"""
|
||||
overlap = []
|
||||
other_tokens = [token.text for token in other_doc]
|
||||
for token in doc:
|
||||
if token.text in other_tokens:
|
||||
overlap.append(token)
|
||||
return overlap
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
plac.call(main)
|
||||
|
||||
# Expected output:
|
||||
# Text 1: Peach emoji is where it has always been.
|
||||
# Text 2: Peach is the superior emoji.
|
||||
# Overlapping tokens: [Peach, emoji, is, .]
|
|
@ -1,130 +0,0 @@
|
|||
#!/usr/bin/env python
|
||||
# coding: utf8
|
||||
"""Example of a spaCy v2.0 pipeline component that requests all countries via
|
||||
the REST Countries API, merges country names into one token, assigns entity
|
||||
labels and sets attributes on country tokens, e.g. the capital and lat/lng
|
||||
coordinates. Can be extended with more details from the API.
|
||||
|
||||
* REST Countries API: https://restcountries.eu (Mozilla Public License MPL 2.0)
|
||||
* Custom pipeline components: https://spacy.io//usage/processing-pipelines#custom-components
|
||||
|
||||
Compatible with: spaCy v2.0.0+
|
||||
Last tested with: v2.1.0
|
||||
Prerequisites: pip install requests
|
||||
"""
|
||||
from __future__ import unicode_literals, print_function
|
||||
|
||||
import requests
|
||||
import plac
|
||||
from spacy.lang.en import English
|
||||
from spacy.matcher import PhraseMatcher
|
||||
from spacy.tokens import Doc, Span, Token
|
||||
|
||||
|
||||
def main():
|
||||
# For simplicity, we start off with only the blank English Language class
|
||||
# and no model or pre-defined pipeline loaded.
|
||||
nlp = English()
|
||||
rest_countries = RESTCountriesComponent(nlp) # initialise component
|
||||
nlp.add_pipe(rest_countries) # add it to the pipeline
|
||||
doc = nlp("Some text about Colombia and the Czech Republic")
|
||||
print("Pipeline", nlp.pipe_names) # pipeline contains component name
|
||||
print("Doc has countries", doc._.has_country) # Doc contains countries
|
||||
for token in doc:
|
||||
if token._.is_country:
|
||||
print(
|
||||
token.text,
|
||||
token._.country_capital,
|
||||
token._.country_latlng,
|
||||
token._.country_flag,
|
||||
) # country data
|
||||
print("Entities", [(e.text, e.label_) for e in doc.ents]) # entities
|
||||
|
||||
|
||||
class RESTCountriesComponent(object):
|
||||
"""spaCy v2.0 pipeline component that requests all countries via
|
||||
the REST Countries API, merges country names into one token, assigns entity
|
||||
labels and sets attributes on country tokens.
|
||||
"""
|
||||
|
||||
name = "rest_countries" # component name, will show up in the pipeline
|
||||
|
||||
def __init__(self, nlp, label="GPE"):
|
||||
"""Initialise the pipeline component. The shared nlp instance is used
|
||||
to initialise the matcher with the shared vocab, get the label ID and
|
||||
generate Doc objects as phrase match patterns.
|
||||
"""
|
||||
# Make request once on initialisation and store the data
|
||||
r = requests.get("https://restcountries.eu/rest/v2/all")
|
||||
r.raise_for_status() # make sure requests raises an error if it fails
|
||||
countries = r.json()
|
||||
|
||||
# Convert API response to dict keyed by country name for easy lookup
|
||||
# This could also be extended using the alternative and foreign language
|
||||
# names provided by the API
|
||||
self.countries = {c["name"]: c for c in countries}
|
||||
self.label = nlp.vocab.strings[label] # get entity label ID
|
||||
|
||||
# Set up the PhraseMatcher with Doc patterns for each country name
|
||||
patterns = [nlp(c) for c in self.countries.keys()]
|
||||
self.matcher = PhraseMatcher(nlp.vocab)
|
||||
self.matcher.add("COUNTRIES", None, *patterns)
|
||||
|
||||
# Register attribute on the Token. We'll be overwriting this based on
|
||||
# the matches, so we're only setting a default value, not a getter.
|
||||
# If no default value is set, it defaults to None.
|
||||
Token.set_extension("is_country", default=False)
|
||||
Token.set_extension("country_capital", default=False)
|
||||
Token.set_extension("country_latlng", default=False)
|
||||
Token.set_extension("country_flag", default=False)
|
||||
|
||||
# Register attributes on Doc and Span via a getter that checks if one of
|
||||
# the contained tokens is set to is_country == True.
|
||||
Doc.set_extension("has_country", getter=self.has_country)
|
||||
Span.set_extension("has_country", getter=self.has_country)
|
||||
|
||||
def __call__(self, doc):
|
||||
"""Apply the pipeline component on a Doc object and modify it if matches
|
||||
are found. Return the Doc, so it can be processed by the next component
|
||||
in the pipeline, if available.
|
||||
"""
|
||||
matches = self.matcher(doc)
|
||||
spans = [] # keep the spans for later so we can merge them afterwards
|
||||
for _, start, end in matches:
|
||||
# Generate Span representing the entity & set label
|
||||
entity = Span(doc, start, end, label=self.label)
|
||||
spans.append(entity)
|
||||
# Set custom attribute on each token of the entity
|
||||
# Can be extended with other data returned by the API, like
|
||||
# currencies, country code, flag, calling code etc.
|
||||
for token in entity:
|
||||
token._.set("is_country", True)
|
||||
token._.set("country_capital", self.countries[entity.text]["capital"])
|
||||
token._.set("country_latlng", self.countries[entity.text]["latlng"])
|
||||
token._.set("country_flag", self.countries[entity.text]["flag"])
|
||||
# Overwrite doc.ents and add entity – be careful not to replace!
|
||||
doc.ents = list(doc.ents) + [entity]
|
||||
for span in spans:
|
||||
# Iterate over all spans and merge them into one token. This is done
|
||||
# after setting the entities – otherwise, it would cause mismatched
|
||||
# indices!
|
||||
span.merge()
|
||||
return doc # don't forget to return the Doc!
|
||||
|
||||
def has_country(self, tokens):
|
||||
"""Getter for Doc and Span attributes. Returns True if one of the tokens
|
||||
is a country. Since the getter is only called when we access the
|
||||
attribute, we can refer to the Token's 'is_country' attribute here,
|
||||
which is already set in the processing step."""
|
||||
return any([t._.get("is_country") for t in tokens])
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
plac.call(main)
|
||||
|
||||
# Expected output:
|
||||
# Pipeline ['rest_countries']
|
||||
# Doc has countries True
|
||||
# Colombia Bogotá [4.0, -72.0] https://restcountries.eu/data/col.svg
|
||||
# Czech Republic Prague [49.75, 15.5] https://restcountries.eu/data/cze.svg
|
||||
# Entities [('Colombia', 'GPE'), ('Czech Republic', 'GPE')]
|
|
@ -1,115 +0,0 @@
|
|||
#!/usr/bin/env python
|
||||
# coding: utf8
|
||||
"""Example of a spaCy v2.0 pipeline component that sets entity annotations
|
||||
based on list of single or multiple-word company names. Companies are
|
||||
labelled as ORG and their spans are merged into one token. Additionally,
|
||||
._.has_tech_org and ._.is_tech_org is set on the Doc/Span and Token
|
||||
respectively.
|
||||
|
||||
* Custom pipeline components: https://spacy.io//usage/processing-pipelines#custom-components
|
||||
|
||||
Compatible with: spaCy v2.0.0+
|
||||
Last tested with: v2.1.0
|
||||
"""
|
||||
from __future__ import unicode_literals, print_function
|
||||
|
||||
import plac
|
||||
from spacy.lang.en import English
|
||||
from spacy.matcher import PhraseMatcher
|
||||
from spacy.tokens import Doc, Span, Token
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
text=("Text to process", "positional", None, str),
|
||||
companies=("Names of technology companies", "positional", None, str),
|
||||
)
|
||||
def main(text="Alphabet Inc. is the company behind Google.", *companies):
|
||||
# For simplicity, we start off with only the blank English Language class
|
||||
# and no model or pre-defined pipeline loaded.
|
||||
nlp = English()
|
||||
if not companies: # set default companies if none are set via args
|
||||
companies = ["Alphabet Inc.", "Google", "Netflix", "Apple"] # etc.
|
||||
component = TechCompanyRecognizer(nlp, companies) # initialise component
|
||||
nlp.add_pipe(component, last=True) # add last to the pipeline
|
||||
|
||||
doc = nlp(text)
|
||||
print("Pipeline", nlp.pipe_names) # pipeline contains component name
|
||||
print("Tokens", [t.text for t in doc]) # company names from the list are merged
|
||||
print("Doc has_tech_org", doc._.has_tech_org) # Doc contains tech orgs
|
||||
print("Token 0 is_tech_org", doc[0]._.is_tech_org) # "Alphabet Inc." is a tech org
|
||||
print("Token 1 is_tech_org", doc[1]._.is_tech_org) # "is" is not
|
||||
print("Entities", [(e.text, e.label_) for e in doc.ents]) # all orgs are entities
|
||||
|
||||
|
||||
class TechCompanyRecognizer(object):
|
||||
"""Example of a spaCy v2.0 pipeline component that sets entity annotations
|
||||
based on list of single or multiple-word company names. Companies are
|
||||
labelled as ORG and their spans are merged into one token. Additionally,
|
||||
._.has_tech_org and ._.is_tech_org is set on the Doc/Span and Token
|
||||
respectively."""
|
||||
|
||||
name = "tech_companies" # component name, will show up in the pipeline
|
||||
|
||||
def __init__(self, nlp, companies=tuple(), label="ORG"):
|
||||
"""Initialise the pipeline component. The shared nlp instance is used
|
||||
to initialise the matcher with the shared vocab, get the label ID and
|
||||
generate Doc objects as phrase match patterns.
|
||||
"""
|
||||
self.label = nlp.vocab.strings[label] # get entity label ID
|
||||
|
||||
# Set up the PhraseMatcher – it can now take Doc objects as patterns,
|
||||
# so even if the list of companies is long, it's very efficient
|
||||
patterns = [nlp(org) for org in companies]
|
||||
self.matcher = PhraseMatcher(nlp.vocab)
|
||||
self.matcher.add("TECH_ORGS", None, *patterns)
|
||||
|
||||
# Register attribute on the Token. We'll be overwriting this based on
|
||||
# the matches, so we're only setting a default value, not a getter.
|
||||
Token.set_extension("is_tech_org", default=False)
|
||||
|
||||
# Register attributes on Doc and Span via a getter that checks if one of
|
||||
# the contained tokens is set to is_tech_org == True.
|
||||
Doc.set_extension("has_tech_org", getter=self.has_tech_org)
|
||||
Span.set_extension("has_tech_org", getter=self.has_tech_org)
|
||||
|
||||
def __call__(self, doc):
|
||||
"""Apply the pipeline component on a Doc object and modify it if matches
|
||||
are found. Return the Doc, so it can be processed by the next component
|
||||
in the pipeline, if available.
|
||||
"""
|
||||
matches = self.matcher(doc)
|
||||
spans = [] # keep the spans for later so we can merge them afterwards
|
||||
for _, start, end in matches:
|
||||
# Generate Span representing the entity & set label
|
||||
entity = Span(doc, start, end, label=self.label)
|
||||
spans.append(entity)
|
||||
# Set custom attribute on each token of the entity
|
||||
for token in entity:
|
||||
token._.set("is_tech_org", True)
|
||||
# Overwrite doc.ents and add entity – be careful not to replace!
|
||||
doc.ents = list(doc.ents) + [entity]
|
||||
for span in spans:
|
||||
# Iterate over all spans and merge them into one token. This is done
|
||||
# after setting the entities – otherwise, it would cause mismatched
|
||||
# indices!
|
||||
span.merge()
|
||||
return doc # don't forget to return the Doc!
|
||||
|
||||
def has_tech_org(self, tokens):
|
||||
"""Getter for Doc and Span attributes. Returns True if one of the tokens
|
||||
is a tech org. Since the getter is only called when we access the
|
||||
attribute, we can refer to the Token's 'is_tech_org' attribute here,
|
||||
which is already set in the processing step."""
|
||||
return any([t._.get("is_tech_org") for t in tokens])
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
plac.call(main)
|
||||
|
||||
# Expected output:
|
||||
# Pipeline ['tech_companies']
|
||||
# Tokens ['Alphabet Inc.', 'is', 'the', 'company', 'behind', 'Google', '.']
|
||||
# Doc has_tech_org True
|
||||
# Token 0 is_tech_org True
|
||||
# Token 1 is_tech_org False
|
||||
# Entities [('Alphabet Inc.', 'ORG'), ('Google', 'ORG')]
|
|
@ -1,61 +0,0 @@
|
|||
"""Example of adding a pipeline component to prohibit sentence boundaries
|
||||
before certain tokens.
|
||||
|
||||
What we do is write to the token.is_sent_start attribute, which
|
||||
takes values in {True, False, None}. The default value None allows the parser
|
||||
to predict sentence segments. The value False prohibits the parser from inserting
|
||||
a sentence boundary before that token. Note that fixing the sentence segmentation
|
||||
should also improve the parse quality.
|
||||
|
||||
The specific example here is drawn from https://github.com/explosion/spaCy/issues/2627
|
||||
Other versions of the model may not make the original mistake, so the specific
|
||||
example might not be apt for future versions.
|
||||
|
||||
Compatible with: spaCy v2.0.0+
|
||||
Last tested with: v2.1.0
|
||||
"""
|
||||
import plac
|
||||
import spacy
|
||||
|
||||
|
||||
def prevent_sentence_boundaries(doc):
|
||||
for token in doc:
|
||||
if not can_be_sentence_start(token):
|
||||
token.is_sent_start = False
|
||||
return doc
|
||||
|
||||
|
||||
def can_be_sentence_start(token):
|
||||
if token.i == 0:
|
||||
return True
|
||||
# We're not checking for is_title here to ignore arbitrary titlecased
|
||||
# tokens within sentences
|
||||
# elif token.is_title:
|
||||
# return True
|
||||
elif token.nbor(-1).is_punct:
|
||||
return True
|
||||
elif token.nbor(-1).is_space:
|
||||
return True
|
||||
else:
|
||||
return False
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
text=("The raw text to process", "positional", None, str),
|
||||
spacy_model=("spaCy model to use (with a parser)", "option", "m", str),
|
||||
)
|
||||
def main(text="Been here And I'm loving it.", spacy_model="en_core_web_lg"):
|
||||
print("Using spaCy model '{}'".format(spacy_model))
|
||||
print("Processing text '{}'".format(text))
|
||||
nlp = spacy.load(spacy_model)
|
||||
doc = nlp(text)
|
||||
sentences = [sent.text.strip() for sent in doc.sents]
|
||||
print("Before:", sentences)
|
||||
nlp.add_pipe(prevent_sentence_boundaries, before="parser")
|
||||
doc = nlp(text)
|
||||
sentences = [sent.text.strip() for sent in doc.sents]
|
||||
print("After:", sentences)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
plac.call(main)
|
|
@ -1,37 +0,0 @@
|
|||
#!/usr/bin/env python
|
||||
# coding: utf8
|
||||
"""Demonstrate adding a rule-based component that forces some tokens to not
|
||||
be entities, before the NER tagger is applied. This is used to hotfix the issue
|
||||
in https://github.com/explosion/spaCy/issues/2870, present as of spaCy v2.0.16.
|
||||
|
||||
Compatible with: spaCy v2.0.0+
|
||||
Last tested with: v2.1.0
|
||||
"""
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import spacy
|
||||
from spacy.attrs import ENT_IOB
|
||||
|
||||
|
||||
def fix_space_tags(doc):
|
||||
ent_iobs = doc.to_array([ENT_IOB])
|
||||
for i, token in enumerate(doc):
|
||||
if token.is_space:
|
||||
# Sets 'O' tag (0 is None, so I is 1, O is 2)
|
||||
ent_iobs[i] = 2
|
||||
doc.from_array([ENT_IOB], ent_iobs.reshape((len(doc), 1)))
|
||||
return doc
|
||||
|
||||
|
||||
def main():
|
||||
nlp = spacy.load("en_core_web_sm")
|
||||
text = "This is some crazy test where I dont need an Apple Watch to make things bug"
|
||||
doc = nlp(text)
|
||||
print("Before", doc.ents)
|
||||
nlp.add_pipe(fix_space_tags, name="fix-ner", before="ner")
|
||||
doc = nlp(text)
|
||||
print("After", doc.ents)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
|
@ -1,84 +0,0 @@
|
|||
#!/usr/bin/env python
|
||||
# coding: utf8
|
||||
"""Example of multi-processing with Joblib. Here, we're exporting
|
||||
part-of-speech-tagged, true-cased, (very roughly) sentence-separated text, with
|
||||
each "sentence" on a newline, and spaces between tokens. Data is loaded from
|
||||
the IMDB movie reviews dataset and will be loaded automatically via Thinc's
|
||||
built-in dataset loader.
|
||||
|
||||
Compatible with: spaCy v2.0.0+
|
||||
Last tested with: v2.1.0
|
||||
Prerequisites: pip install joblib
|
||||
"""
|
||||
from __future__ import print_function, unicode_literals
|
||||
|
||||
from pathlib import Path
|
||||
from joblib import Parallel, delayed
|
||||
from functools import partial
|
||||
import thinc.extra.datasets
|
||||
import plac
|
||||
import spacy
|
||||
from spacy.util import minibatch
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
output_dir=("Output directory", "positional", None, Path),
|
||||
model=("Model name (needs tagger)", "positional", None, str),
|
||||
n_jobs=("Number of workers", "option", "n", int),
|
||||
batch_size=("Batch-size for each process", "option", "b", int),
|
||||
limit=("Limit of entries from the dataset", "option", "l", int),
|
||||
)
|
||||
def main(output_dir, model="en_core_web_sm", n_jobs=4, batch_size=1000, limit=10000):
|
||||
nlp = spacy.load(model) # load spaCy model
|
||||
print("Loaded model '%s'" % model)
|
||||
if not output_dir.exists():
|
||||
output_dir.mkdir()
|
||||
# load and pre-process the IMBD dataset
|
||||
print("Loading IMDB data...")
|
||||
data, _ = thinc.extra.datasets.imdb()
|
||||
texts, _ = zip(*data[-limit:])
|
||||
print("Processing texts...")
|
||||
partitions = minibatch(texts, size=batch_size)
|
||||
executor = Parallel(n_jobs=n_jobs, backend="multiprocessing", prefer="processes")
|
||||
do = delayed(partial(transform_texts, nlp))
|
||||
tasks = (do(i, batch, output_dir) for i, batch in enumerate(partitions))
|
||||
executor(tasks)
|
||||
|
||||
|
||||
def transform_texts(nlp, batch_id, texts, output_dir):
|
||||
print(nlp.pipe_names)
|
||||
out_path = Path(output_dir) / ("%d.txt" % batch_id)
|
||||
if out_path.exists(): # return None in case same batch is called again
|
||||
return None
|
||||
print("Processing batch", batch_id)
|
||||
with out_path.open("w", encoding="utf8") as f:
|
||||
for doc in nlp.pipe(texts):
|
||||
f.write(" ".join(represent_word(w) for w in doc if not w.is_space))
|
||||
f.write("\n")
|
||||
print("Saved {} texts to {}.txt".format(len(texts), batch_id))
|
||||
|
||||
|
||||
def represent_word(word):
|
||||
text = word.text
|
||||
# True-case, i.e. try to normalize sentence-initial capitals.
|
||||
# Only do this if the lower-cased form is more probable.
|
||||
if (
|
||||
text.istitle()
|
||||
and is_sent_begin(word)
|
||||
and word.prob < word.doc.vocab[text.lower()].prob
|
||||
):
|
||||
text = text.lower()
|
||||
return text + "|" + word.tag_
|
||||
|
||||
|
||||
def is_sent_begin(word):
|
||||
if word.i == 0:
|
||||
return True
|
||||
elif word.i >= 2 and word.nbor(-1).text in (".", "!", "?", "..."):
|
||||
return True
|
||||
else:
|
||||
return False
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
plac.call(main)
|
|
@ -1,153 +0,0 @@
|
|||
# coding: utf-8
|
||||
"""
|
||||
Example of a Streamlit app for an interactive spaCy model visualizer. You can
|
||||
either download the script, or point streamlit run to the raw URL of this
|
||||
file. For more details, see https://streamlit.io.
|
||||
|
||||
Installation:
|
||||
pip install streamlit
|
||||
python -m spacy download en_core_web_sm
|
||||
python -m spacy download en_core_web_md
|
||||
python -m spacy download de_core_news_sm
|
||||
|
||||
Usage:
|
||||
streamlit run streamlit_spacy.py
|
||||
"""
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import streamlit as st
|
||||
import spacy
|
||||
from spacy import displacy
|
||||
import pandas as pd
|
||||
|
||||
|
||||
SPACY_MODEL_NAMES = ["en_core_web_sm", "en_core_web_md", "de_core_news_sm"]
|
||||
DEFAULT_TEXT = "Mark Zuckerberg is the CEO of Facebook."
|
||||
HTML_WRAPPER = """<div style="overflow-x: auto; border: 1px solid #e6e9ef; border-radius: 0.25rem; padding: 1rem; margin-bottom: 2.5rem">{}</div>"""
|
||||
|
||||
|
||||
@st.cache(allow_output_mutation=True)
|
||||
def load_model(name):
|
||||
return spacy.load(name)
|
||||
|
||||
|
||||
@st.cache(allow_output_mutation=True)
|
||||
def process_text(model_name, text):
|
||||
nlp = load_model(model_name)
|
||||
return nlp(text)
|
||||
|
||||
|
||||
st.sidebar.title("Interactive spaCy visualizer")
|
||||
st.sidebar.markdown(
|
||||
"""
|
||||
Process text with [spaCy](https://spacy.io) models and visualize named entities,
|
||||
dependencies and more. Uses spaCy's built-in
|
||||
[displaCy](http://spacy.io/usage/visualizers) visualizer under the hood.
|
||||
"""
|
||||
)
|
||||
|
||||
spacy_model = st.sidebar.selectbox("Model name", SPACY_MODEL_NAMES)
|
||||
model_load_state = st.info(f"Loading model '{spacy_model}'...")
|
||||
nlp = load_model(spacy_model)
|
||||
model_load_state.empty()
|
||||
|
||||
text = st.text_area("Text to analyze", DEFAULT_TEXT)
|
||||
doc = process_text(spacy_model, text)
|
||||
|
||||
if "parser" in nlp.pipe_names:
|
||||
st.header("Dependency Parse & Part-of-speech tags")
|
||||
st.sidebar.header("Dependency Parse")
|
||||
split_sents = st.sidebar.checkbox("Split sentences", value=True)
|
||||
collapse_punct = st.sidebar.checkbox("Collapse punctuation", value=True)
|
||||
collapse_phrases = st.sidebar.checkbox("Collapse phrases")
|
||||
compact = st.sidebar.checkbox("Compact mode")
|
||||
options = {
|
||||
"collapse_punct": collapse_punct,
|
||||
"collapse_phrases": collapse_phrases,
|
||||
"compact": compact,
|
||||
}
|
||||
docs = [span.as_doc() for span in doc.sents] if split_sents else [doc]
|
||||
for sent in docs:
|
||||
html = displacy.render(sent, options=options)
|
||||
# Double newlines seem to mess with the rendering
|
||||
html = html.replace("\n\n", "\n")
|
||||
if split_sents and len(docs) > 1:
|
||||
st.markdown(f"> {sent.text}")
|
||||
st.write(HTML_WRAPPER.format(html), unsafe_allow_html=True)
|
||||
|
||||
if "ner" in nlp.pipe_names:
|
||||
st.header("Named Entities")
|
||||
st.sidebar.header("Named Entities")
|
||||
label_set = nlp.get_pipe("ner").labels
|
||||
labels = st.sidebar.multiselect(
|
||||
"Entity labels", options=label_set, default=list(label_set)
|
||||
)
|
||||
html = displacy.render(doc, style="ent", options={"ents": labels})
|
||||
# Newlines seem to mess with the rendering
|
||||
html = html.replace("\n", " ")
|
||||
st.write(HTML_WRAPPER.format(html), unsafe_allow_html=True)
|
||||
attrs = ["text", "label_", "start", "end", "start_char", "end_char"]
|
||||
if "entity_linker" in nlp.pipe_names:
|
||||
attrs.append("kb_id_")
|
||||
data = [
|
||||
[str(getattr(ent, attr)) for attr in attrs]
|
||||
for ent in doc.ents
|
||||
if ent.label_ in labels
|
||||
]
|
||||
df = pd.DataFrame(data, columns=attrs)
|
||||
st.dataframe(df)
|
||||
|
||||
|
||||
if "textcat" in nlp.pipe_names:
|
||||
st.header("Text Classification")
|
||||
st.markdown(f"> {text}")
|
||||
df = pd.DataFrame(doc.cats.items(), columns=("Label", "Score"))
|
||||
st.dataframe(df)
|
||||
|
||||
|
||||
vector_size = nlp.meta.get("vectors", {}).get("width", 0)
|
||||
if vector_size:
|
||||
st.header("Vectors & Similarity")
|
||||
st.code(nlp.meta["vectors"])
|
||||
text1 = st.text_input("Text or word 1", "apple")
|
||||
text2 = st.text_input("Text or word 2", "orange")
|
||||
doc1 = process_text(spacy_model, text1)
|
||||
doc2 = process_text(spacy_model, text2)
|
||||
similarity = doc1.similarity(doc2)
|
||||
if similarity > 0.5:
|
||||
st.success(similarity)
|
||||
else:
|
||||
st.error(similarity)
|
||||
|
||||
st.header("Token attributes")
|
||||
|
||||
if st.button("Show token attributes"):
|
||||
attrs = [
|
||||
"idx",
|
||||
"text",
|
||||
"lemma_",
|
||||
"pos_",
|
||||
"tag_",
|
||||
"dep_",
|
||||
"head",
|
||||
"ent_type_",
|
||||
"ent_iob_",
|
||||
"shape_",
|
||||
"is_alpha",
|
||||
"is_ascii",
|
||||
"is_digit",
|
||||
"is_punct",
|
||||
"like_num",
|
||||
]
|
||||
data = [[str(getattr(token, attr)) for attr in attrs] for token in doc]
|
||||
df = pd.DataFrame(data, columns=attrs)
|
||||
st.dataframe(df)
|
||||
|
||||
|
||||
st.header("JSON Doc")
|
||||
if st.button("Show JSON Doc"):
|
||||
st.json(doc.to_json())
|
||||
|
||||
st.header("JSON model meta")
|
||||
if st.button("Show JSON model meta"):
|
||||
st.json(nlp.meta)
|
|
@ -1 +0,0 @@
|
|||
{"nr_epoch": 3, "batch_size": 24, "dropout": 0.001, "vectors": 0, "multitask_tag": 0, "multitask_sent": 0}
|
|
@ -1,434 +0,0 @@
|
|||
"""Train for CONLL 2017 UD treebank evaluation. Takes .conllu files, writes
|
||||
.conllu format for development data, allowing the official scorer to be used.
|
||||
"""
|
||||
from __future__ import unicode_literals
|
||||
import plac
|
||||
import attr
|
||||
from pathlib import Path
|
||||
import re
|
||||
import json
|
||||
import tqdm
|
||||
|
||||
import spacy
|
||||
import spacy.util
|
||||
from spacy.tokens import Token, Doc
|
||||
from spacy.gold import GoldParse
|
||||
from spacy.syntax.nonproj import projectivize
|
||||
from collections import defaultdict
|
||||
from spacy.matcher import Matcher
|
||||
|
||||
import itertools
|
||||
import random
|
||||
import numpy.random
|
||||
|
||||
from bin.ud import conll17_ud_eval
|
||||
|
||||
import spacy.lang.zh
|
||||
import spacy.lang.ja
|
||||
|
||||
spacy.lang.zh.Chinese.Defaults.use_jieba = False
|
||||
spacy.lang.ja.Japanese.Defaults.use_janome = False
|
||||
|
||||
random.seed(0)
|
||||
numpy.random.seed(0)
|
||||
|
||||
|
||||
def minibatch_by_words(items, size=5000):
|
||||
random.shuffle(items)
|
||||
if isinstance(size, int):
|
||||
size_ = itertools.repeat(size)
|
||||
else:
|
||||
size_ = size
|
||||
items = iter(items)
|
||||
while True:
|
||||
batch_size = next(size_)
|
||||
batch = []
|
||||
while batch_size >= 0:
|
||||
try:
|
||||
doc, gold = next(items)
|
||||
except StopIteration:
|
||||
if batch:
|
||||
yield batch
|
||||
return
|
||||
batch_size -= len(doc)
|
||||
batch.append((doc, gold))
|
||||
if batch:
|
||||
yield batch
|
||||
else:
|
||||
break
|
||||
|
||||
|
||||
################
|
||||
# Data reading #
|
||||
################
|
||||
|
||||
space_re = re.compile("\s+")
|
||||
|
||||
|
||||
def split_text(text):
|
||||
return [space_re.sub(" ", par.strip()) for par in text.split("\n\n")]
|
||||
|
||||
|
||||
def read_data(
|
||||
nlp,
|
||||
conllu_file,
|
||||
text_file,
|
||||
raw_text=True,
|
||||
oracle_segments=False,
|
||||
max_doc_length=None,
|
||||
limit=None,
|
||||
):
|
||||
"""Read the CONLLU format into (Doc, GoldParse) tuples. If raw_text=True,
|
||||
include Doc objects created using nlp.make_doc and then aligned against
|
||||
the gold-standard sequences. If oracle_segments=True, include Doc objects
|
||||
created from the gold-standard segments. At least one must be True."""
|
||||
if not raw_text and not oracle_segments:
|
||||
raise ValueError("At least one of raw_text or oracle_segments must be True")
|
||||
paragraphs = split_text(text_file.read())
|
||||
conllu = read_conllu(conllu_file)
|
||||
# sd is spacy doc; cd is conllu doc
|
||||
# cs is conllu sent, ct is conllu token
|
||||
docs = []
|
||||
golds = []
|
||||
for doc_id, (text, cd) in enumerate(zip(paragraphs, conllu)):
|
||||
sent_annots = []
|
||||
for cs in cd:
|
||||
sent = defaultdict(list)
|
||||
for id_, word, lemma, pos, tag, morph, head, dep, _, space_after in cs:
|
||||
if "." in id_:
|
||||
continue
|
||||
if "-" in id_:
|
||||
continue
|
||||
id_ = int(id_) - 1
|
||||
head = int(head) - 1 if head != "0" else id_
|
||||
sent["words"].append(word)
|
||||
sent["tags"].append(tag)
|
||||
sent["heads"].append(head)
|
||||
sent["deps"].append("ROOT" if dep == "root" else dep)
|
||||
sent["spaces"].append(space_after == "_")
|
||||
sent["entities"] = ["-"] * len(sent["words"])
|
||||
sent["heads"], sent["deps"] = projectivize(sent["heads"], sent["deps"])
|
||||
if oracle_segments:
|
||||
docs.append(Doc(nlp.vocab, words=sent["words"], spaces=sent["spaces"]))
|
||||
golds.append(GoldParse(docs[-1], **sent))
|
||||
|
||||
sent_annots.append(sent)
|
||||
if raw_text and max_doc_length and len(sent_annots) >= max_doc_length:
|
||||
doc, gold = _make_gold(nlp, None, sent_annots)
|
||||
sent_annots = []
|
||||
docs.append(doc)
|
||||
golds.append(gold)
|
||||
if limit and len(docs) >= limit:
|
||||
return docs, golds
|
||||
|
||||
if raw_text and sent_annots:
|
||||
doc, gold = _make_gold(nlp, None, sent_annots)
|
||||
docs.append(doc)
|
||||
golds.append(gold)
|
||||
if limit and len(docs) >= limit:
|
||||
return docs, golds
|
||||
return docs, golds
|
||||
|
||||
|
||||
def read_conllu(file_):
|
||||
docs = []
|
||||
sent = []
|
||||
doc = []
|
||||
for line in file_:
|
||||
if line.startswith("# newdoc"):
|
||||
if doc:
|
||||
docs.append(doc)
|
||||
doc = []
|
||||
elif line.startswith("#"):
|
||||
continue
|
||||
elif not line.strip():
|
||||
if sent:
|
||||
doc.append(sent)
|
||||
sent = []
|
||||
else:
|
||||
sent.append(list(line.strip().split("\t")))
|
||||
if len(sent[-1]) != 10:
|
||||
print(repr(line))
|
||||
raise ValueError
|
||||
if sent:
|
||||
doc.append(sent)
|
||||
if doc:
|
||||
docs.append(doc)
|
||||
return docs
|
||||
|
||||
|
||||
def _make_gold(nlp, text, sent_annots):
|
||||
# Flatten the conll annotations, and adjust the head indices
|
||||
flat = defaultdict(list)
|
||||
for sent in sent_annots:
|
||||
flat["heads"].extend(len(flat["words"]) + head for head in sent["heads"])
|
||||
for field in ["words", "tags", "deps", "entities", "spaces"]:
|
||||
flat[field].extend(sent[field])
|
||||
# Construct text if necessary
|
||||
assert len(flat["words"]) == len(flat["spaces"])
|
||||
if text is None:
|
||||
text = "".join(
|
||||
word + " " * space for word, space in zip(flat["words"], flat["spaces"])
|
||||
)
|
||||
doc = nlp.make_doc(text)
|
||||
flat.pop("spaces")
|
||||
gold = GoldParse(doc, **flat)
|
||||
return doc, gold
|
||||
|
||||
|
||||
#############################
|
||||
# Data transforms for spaCy #
|
||||
#############################
|
||||
|
||||
|
||||
def golds_to_gold_tuples(docs, golds):
|
||||
"""Get out the annoying 'tuples' format used by begin_training, given the
|
||||
GoldParse objects."""
|
||||
tuples = []
|
||||
for doc, gold in zip(docs, golds):
|
||||
text = doc.text
|
||||
ids, words, tags, heads, labels, iob = zip(*gold.orig_annot)
|
||||
sents = [((ids, words, tags, heads, labels, iob), [])]
|
||||
tuples.append((text, sents))
|
||||
return tuples
|
||||
|
||||
|
||||
##############
|
||||
# Evaluation #
|
||||
##############
|
||||
|
||||
|
||||
def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None):
|
||||
with text_loc.open("r", encoding="utf8") as text_file:
|
||||
texts = split_text(text_file.read())
|
||||
docs = list(nlp.pipe(texts))
|
||||
with sys_loc.open("w", encoding="utf8") as out_file:
|
||||
write_conllu(docs, out_file)
|
||||
with gold_loc.open("r", encoding="utf8") as gold_file:
|
||||
gold_ud = conll17_ud_eval.load_conllu(gold_file)
|
||||
with sys_loc.open("r", encoding="utf8") as sys_file:
|
||||
sys_ud = conll17_ud_eval.load_conllu(sys_file)
|
||||
scores = conll17_ud_eval.evaluate(gold_ud, sys_ud)
|
||||
return scores
|
||||
|
||||
|
||||
def write_conllu(docs, file_):
|
||||
merger = Matcher(docs[0].vocab)
|
||||
merger.add("SUBTOK", None, [{"DEP": "subtok", "op": "+"}])
|
||||
for i, doc in enumerate(docs):
|
||||
matches = merger(doc)
|
||||
spans = [doc[start : end + 1] for _, start, end in matches]
|
||||
offsets = [(span.start_char, span.end_char) for span in spans]
|
||||
for start_char, end_char in offsets:
|
||||
doc.merge(start_char, end_char)
|
||||
file_.write("# newdoc id = {i}\n".format(i=i))
|
||||
for j, sent in enumerate(doc.sents):
|
||||
file_.write("# sent_id = {i}.{j}\n".format(i=i, j=j))
|
||||
file_.write("# text = {text}\n".format(text=sent.text))
|
||||
for k, token in enumerate(sent):
|
||||
file_.write(token._.get_conllu_lines(k) + "\n")
|
||||
file_.write("\n")
|
||||
|
||||
|
||||
def print_progress(itn, losses, ud_scores):
|
||||
fields = {
|
||||
"dep_loss": losses.get("parser", 0.0),
|
||||
"tag_loss": losses.get("tagger", 0.0),
|
||||
"words": ud_scores["Words"].f1 * 100,
|
||||
"sents": ud_scores["Sentences"].f1 * 100,
|
||||
"tags": ud_scores["XPOS"].f1 * 100,
|
||||
"uas": ud_scores["UAS"].f1 * 100,
|
||||
"las": ud_scores["LAS"].f1 * 100,
|
||||
}
|
||||
header = ["Epoch", "Loss", "LAS", "UAS", "TAG", "SENT", "WORD"]
|
||||
if itn == 0:
|
||||
print("\t".join(header))
|
||||
tpl = "\t".join(
|
||||
(
|
||||
"{:d}",
|
||||
"{dep_loss:.1f}",
|
||||
"{las:.1f}",
|
||||
"{uas:.1f}",
|
||||
"{tags:.1f}",
|
||||
"{sents:.1f}",
|
||||
"{words:.1f}",
|
||||
)
|
||||
)
|
||||
print(tpl.format(itn, **fields))
|
||||
|
||||
|
||||
# def get_sent_conllu(sent, sent_id):
|
||||
# lines = ["# sent_id = {sent_id}".format(sent_id=sent_id)]
|
||||
|
||||
|
||||
def get_token_conllu(token, i):
|
||||
if token._.begins_fused:
|
||||
n = 1
|
||||
while token.nbor(n)._.inside_fused:
|
||||
n += 1
|
||||
id_ = "%d-%d" % (i, i + n)
|
||||
lines = [id_, token.text, "_", "_", "_", "_", "_", "_", "_", "_"]
|
||||
else:
|
||||
lines = []
|
||||
if token.head.i == token.i:
|
||||
head = 0
|
||||
else:
|
||||
head = i + (token.head.i - token.i) + 1
|
||||
fields = [
|
||||
str(i + 1),
|
||||
token.text,
|
||||
token.lemma_,
|
||||
token.pos_,
|
||||
token.tag_,
|
||||
"_",
|
||||
str(head),
|
||||
token.dep_.lower(),
|
||||
"_",
|
||||
"_",
|
||||
]
|
||||
lines.append("\t".join(fields))
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
##################
|
||||
# Initialization #
|
||||
##################
|
||||
|
||||
|
||||
def load_nlp(corpus, config):
|
||||
lang = corpus.split("_")[0]
|
||||
nlp = spacy.blank(lang)
|
||||
if config.vectors:
|
||||
nlp.vocab.from_disk(config.vectors / "vocab")
|
||||
return nlp
|
||||
|
||||
|
||||
def initialize_pipeline(nlp, docs, golds, config):
|
||||
nlp.add_pipe(nlp.create_pipe("parser"))
|
||||
if config.multitask_tag:
|
||||
nlp.parser.add_multitask_objective("tag")
|
||||
if config.multitask_sent:
|
||||
nlp.parser.add_multitask_objective("sent_start")
|
||||
nlp.parser.moves.add_action(2, "subtok")
|
||||
nlp.add_pipe(nlp.create_pipe("tagger"))
|
||||
for gold in golds:
|
||||
for tag in gold.tags:
|
||||
if tag is not None:
|
||||
nlp.tagger.add_label(tag)
|
||||
# Replace labels that didn't make the frequency cutoff
|
||||
actions = set(nlp.parser.labels)
|
||||
label_set = set([act.split("-")[1] for act in actions if "-" in act])
|
||||
for gold in golds:
|
||||
for i, label in enumerate(gold.labels):
|
||||
if label is not None and label not in label_set:
|
||||
gold.labels[i] = label.split("||")[0]
|
||||
return nlp.begin_training(lambda: golds_to_gold_tuples(docs, golds))
|
||||
|
||||
|
||||
########################
|
||||
# Command line helpers #
|
||||
########################
|
||||
|
||||
|
||||
@attr.s
|
||||
class Config(object):
|
||||
vectors = attr.ib(default=None)
|
||||
max_doc_length = attr.ib(default=10)
|
||||
multitask_tag = attr.ib(default=True)
|
||||
multitask_sent = attr.ib(default=True)
|
||||
nr_epoch = attr.ib(default=30)
|
||||
batch_size = attr.ib(default=1000)
|
||||
dropout = attr.ib(default=0.2)
|
||||
|
||||
@classmethod
|
||||
def load(cls, loc):
|
||||
with Path(loc).open("r", encoding="utf8") as file_:
|
||||
cfg = json.load(file_)
|
||||
return cls(**cfg)
|
||||
|
||||
|
||||
class Dataset(object):
|
||||
def __init__(self, path, section):
|
||||
self.path = path
|
||||
self.section = section
|
||||
self.conllu = None
|
||||
self.text = None
|
||||
for file_path in self.path.iterdir():
|
||||
name = file_path.parts[-1]
|
||||
if section in name and name.endswith("conllu"):
|
||||
self.conllu = file_path
|
||||
elif section in name and name.endswith("txt"):
|
||||
self.text = file_path
|
||||
if self.conllu is None:
|
||||
msg = "Could not find .txt file in {path} for {section}"
|
||||
raise IOError(msg.format(section=section, path=path))
|
||||
if self.text is None:
|
||||
msg = "Could not find .txt file in {path} for {section}"
|
||||
self.lang = self.conllu.parts[-1].split("-")[0].split("_")[0]
|
||||
|
||||
|
||||
class TreebankPaths(object):
|
||||
def __init__(self, ud_path, treebank, **cfg):
|
||||
self.train = Dataset(ud_path / treebank, "train")
|
||||
self.dev = Dataset(ud_path / treebank, "dev")
|
||||
self.lang = self.train.lang
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
ud_dir=("Path to Universal Dependencies corpus", "positional", None, Path),
|
||||
parses_dir=("Directory to write the development parses", "positional", None, Path),
|
||||
config=("Path to json formatted config file", "positional", None, Config.load),
|
||||
corpus=(
|
||||
"UD corpus to train and evaluate on, e.g. UD_Spanish-AnCora",
|
||||
"positional",
|
||||
None,
|
||||
str,
|
||||
),
|
||||
limit=("Size limit", "option", "n", int),
|
||||
)
|
||||
def main(ud_dir, parses_dir, config, corpus, limit=0):
|
||||
Token.set_extension("get_conllu_lines", method=get_token_conllu)
|
||||
Token.set_extension("begins_fused", default=False)
|
||||
Token.set_extension("inside_fused", default=False)
|
||||
|
||||
paths = TreebankPaths(ud_dir, corpus)
|
||||
if not (parses_dir / corpus).exists():
|
||||
(parses_dir / corpus).mkdir()
|
||||
print("Train and evaluate", corpus, "using lang", paths.lang)
|
||||
nlp = load_nlp(paths.lang, config)
|
||||
|
||||
docs, golds = read_data(
|
||||
nlp,
|
||||
paths.train.conllu.open(encoding="utf8"),
|
||||
paths.train.text.open(encoding="utf8"),
|
||||
max_doc_length=config.max_doc_length,
|
||||
limit=limit,
|
||||
)
|
||||
|
||||
optimizer = initialize_pipeline(nlp, docs, golds, config)
|
||||
|
||||
for i in range(config.nr_epoch):
|
||||
docs = [nlp.make_doc(doc.text) for doc in docs]
|
||||
batches = minibatch_by_words(list(zip(docs, golds)), size=config.batch_size)
|
||||
losses = {}
|
||||
n_train_words = sum(len(doc) for doc in docs)
|
||||
with tqdm.tqdm(total=n_train_words, leave=False) as pbar:
|
||||
for batch in batches:
|
||||
batch_docs, batch_gold = zip(*batch)
|
||||
pbar.update(sum(len(doc) for doc in batch_docs))
|
||||
nlp.update(
|
||||
batch_docs,
|
||||
batch_gold,
|
||||
sgd=optimizer,
|
||||
drop=config.dropout,
|
||||
losses=losses,
|
||||
)
|
||||
|
||||
out_path = parses_dir / corpus / "epoch-{i}.conllu".format(i=i)
|
||||
with nlp.use_params(optimizer.averages):
|
||||
scores = evaluate(nlp, paths.dev.text, paths.dev.conllu, out_path)
|
||||
print_progress(i, losses, scores)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
plac.call(main)
|
|
@ -1,114 +0,0 @@
|
|||
#!/usr/bin/env python
|
||||
# coding: utf8
|
||||
|
||||
"""Example of defining a knowledge base in spaCy,
|
||||
which is needed to implement entity linking functionality.
|
||||
|
||||
For more details, see the documentation:
|
||||
* Knowledge base: https://spacy.io/api/kb
|
||||
* Entity Linking: https://spacy.io/usage/linguistic-features#entity-linking
|
||||
|
||||
Compatible with: spaCy v2.2.4
|
||||
Last tested with: v2.2.4
|
||||
"""
|
||||
from __future__ import unicode_literals, print_function
|
||||
|
||||
import plac
|
||||
from pathlib import Path
|
||||
|
||||
from spacy.vocab import Vocab
|
||||
import spacy
|
||||
from spacy.kb import KnowledgeBase
|
||||
|
||||
|
||||
# Q2146908 (Russ Cochran): American golfer
|
||||
# Q7381115 (Russ Cochran): publisher
|
||||
ENTITIES = {"Q2146908": ("American golfer", 342), "Q7381115": ("publisher", 17)}
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
model=("Model name, should have pretrained word embeddings", "positional", None, str),
|
||||
output_dir=("Optional output directory", "option", "o", Path),
|
||||
)
|
||||
def main(model=None, output_dir=None):
|
||||
"""Load the model and create the KB with pre-defined entity encodings.
|
||||
If an output_dir is provided, the KB will be stored there in a file 'kb'.
|
||||
The updated vocab will also be written to a directory in the output_dir."""
|
||||
|
||||
nlp = spacy.load(model) # load existing spaCy model
|
||||
print("Loaded model '%s'" % model)
|
||||
|
||||
# check the length of the nlp vectors
|
||||
if "vectors" not in nlp.meta or not nlp.vocab.vectors.size:
|
||||
raise ValueError(
|
||||
"The `nlp` object should have access to pretrained word vectors, "
|
||||
" cf. https://spacy.io/usage/models#languages."
|
||||
)
|
||||
|
||||
# You can change the dimension of vectors in your KB by using an encoder that changes the dimensionality.
|
||||
# For simplicity, we'll just use the original vector dimension here instead.
|
||||
vectors_dim = nlp.vocab.vectors.shape[1]
|
||||
kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=vectors_dim)
|
||||
|
||||
# set up the data
|
||||
entity_ids = []
|
||||
descr_embeddings = []
|
||||
freqs = []
|
||||
for key, value in ENTITIES.items():
|
||||
desc, freq = value
|
||||
entity_ids.append(key)
|
||||
descr_embeddings.append(nlp(desc).vector)
|
||||
freqs.append(freq)
|
||||
|
||||
# set the entities, can also be done by calling `kb.add_entity` for each entity
|
||||
kb.set_entities(entity_list=entity_ids, freq_list=freqs, vector_list=descr_embeddings)
|
||||
|
||||
# adding aliases, the entities need to be defined in the KB beforehand
|
||||
kb.add_alias(
|
||||
alias="Russ Cochran",
|
||||
entities=["Q2146908", "Q7381115"],
|
||||
probabilities=[0.24, 0.7], # the sum of these probabilities should not exceed 1
|
||||
)
|
||||
|
||||
# test the trained model
|
||||
print()
|
||||
_print_kb(kb)
|
||||
|
||||
# save model to output directory
|
||||
if output_dir is not None:
|
||||
output_dir = Path(output_dir)
|
||||
if not output_dir.exists():
|
||||
output_dir.mkdir()
|
||||
kb_path = str(output_dir / "kb")
|
||||
kb.dump(kb_path)
|
||||
print()
|
||||
print("Saved KB to", kb_path)
|
||||
|
||||
vocab_path = output_dir / "vocab"
|
||||
kb.vocab.to_disk(vocab_path)
|
||||
print("Saved vocab to", vocab_path)
|
||||
|
||||
print()
|
||||
|
||||
# test the saved model
|
||||
# always reload a knowledge base with the same vocab instance!
|
||||
print("Loading vocab from", vocab_path)
|
||||
print("Loading KB from", kb_path)
|
||||
vocab2 = Vocab().from_disk(vocab_path)
|
||||
kb2 = KnowledgeBase(vocab=vocab2)
|
||||
kb2.load_bulk(kb_path)
|
||||
print()
|
||||
_print_kb(kb2)
|
||||
|
||||
|
||||
def _print_kb(kb):
|
||||
print(kb.get_size_entities(), "kb entities:", kb.get_entity_strings())
|
||||
print(kb.get_size_aliases(), "kb aliases:", kb.get_alias_strings())
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
plac.call(main)
|
||||
|
||||
# Expected output:
|
||||
# 2 kb entities: ['Q2146908', 'Q7381115']
|
||||
# 1 kb aliases: ['Russ Cochran']
|
|
@ -1,89 +0,0 @@
|
|||
"""This example shows how to add a multi-task objective that is trained
|
||||
alongside the entity recognizer. This is an alternative to adding features
|
||||
to the model.
|
||||
|
||||
The multi-task idea is to train an auxiliary model to predict some attribute,
|
||||
with weights shared between the auxiliary model and the main model. In this
|
||||
example, we're predicting the position of the word in the document.
|
||||
|
||||
The model that predicts the position of the word encourages the convolutional
|
||||
layers to include the position information in their representation. The
|
||||
information is then available to the main model, as a feature.
|
||||
|
||||
The overall idea is that we might know something about what sort of features
|
||||
we'd like the CNN to extract. The multi-task objectives can encourage the
|
||||
extraction of this type of feature. The multi-task objective is only used
|
||||
during training. We discard the auxiliary model before run-time.
|
||||
|
||||
The specific example here is not necessarily a good idea --- but it shows
|
||||
how an arbitrary objective function for some word can be used.
|
||||
|
||||
Developed and tested for spaCy 2.0.6. Updated for v2.2.2
|
||||
"""
|
||||
import random
|
||||
import plac
|
||||
import spacy
|
||||
import os.path
|
||||
from spacy.tokens import Doc
|
||||
from spacy.gold import read_json_file, GoldParse
|
||||
|
||||
random.seed(0)
|
||||
|
||||
PWD = os.path.dirname(__file__)
|
||||
|
||||
TRAIN_DATA = list(read_json_file(
|
||||
os.path.join(PWD, "ner_example_data", "ner-sent-per-line.json")))
|
||||
|
||||
|
||||
def get_position_label(i, words, tags, heads, labels, ents):
|
||||
"""Return labels indicating the position of the word in the document.
|
||||
"""
|
||||
if len(words) < 20:
|
||||
return "short-doc"
|
||||
elif i == 0:
|
||||
return "first-word"
|
||||
elif i < 10:
|
||||
return "early-word"
|
||||
elif i < 20:
|
||||
return "mid-word"
|
||||
elif i == len(words) - 1:
|
||||
return "last-word"
|
||||
else:
|
||||
return "late-word"
|
||||
|
||||
|
||||
def main(n_iter=10):
|
||||
nlp = spacy.blank("en")
|
||||
ner = nlp.create_pipe("ner")
|
||||
ner.add_multitask_objective(get_position_label)
|
||||
nlp.add_pipe(ner)
|
||||
print(nlp.pipeline)
|
||||
|
||||
print("Create data", len(TRAIN_DATA))
|
||||
optimizer = nlp.begin_training(get_gold_tuples=lambda: TRAIN_DATA)
|
||||
for itn in range(n_iter):
|
||||
random.shuffle(TRAIN_DATA)
|
||||
losses = {}
|
||||
for text, annot_brackets in TRAIN_DATA:
|
||||
for annotations, _ in annot_brackets:
|
||||
doc = Doc(nlp.vocab, words=annotations[1])
|
||||
gold = GoldParse.from_annot_tuples(doc, annotations)
|
||||
nlp.update(
|
||||
[doc], # batch of texts
|
||||
[gold], # batch of annotations
|
||||
drop=0.2, # dropout - make it harder to memorise data
|
||||
sgd=optimizer, # callable to update weights
|
||||
losses=losses,
|
||||
)
|
||||
print(losses.get("nn_labeller", 0.0), losses["ner"])
|
||||
|
||||
# test the trained model
|
||||
for text, _ in TRAIN_DATA:
|
||||
if text is not None:
|
||||
doc = nlp(text)
|
||||
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
|
||||
print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
plac.call(main)
|
|
@ -1,217 +0,0 @@
|
|||
"""This script is experimental.
|
||||
|
||||
Try pre-training the CNN component of the text categorizer using a cheap
|
||||
language modelling-like objective. Specifically, we load pretrained vectors
|
||||
(from something like word2vec, GloVe, FastText etc), and use the CNN to
|
||||
predict the tokens' pretrained vectors. This isn't as easy as it sounds:
|
||||
we're not merely doing compression here, because heavy dropout is applied,
|
||||
including over the input words. This means the model must often (50% of the time)
|
||||
use the context in order to predict the word.
|
||||
|
||||
To evaluate the technique, we're pre-training with the 50k texts from the IMDB
|
||||
corpus, and then training with only 100 labels. Note that it's a bit dirty to
|
||||
pre-train with the development data, but also not *so* terrible: we're not using
|
||||
the development labels, after all --- only the unlabelled text.
|
||||
"""
|
||||
import plac
|
||||
import tqdm
|
||||
import random
|
||||
import spacy
|
||||
import thinc.extra.datasets
|
||||
from spacy.util import minibatch, use_gpu, compounding
|
||||
from spacy._ml import Tok2Vec
|
||||
from spacy.pipeline import TextCategorizer
|
||||
import numpy
|
||||
|
||||
|
||||
def load_texts(limit=0):
|
||||
train, dev = thinc.extra.datasets.imdb()
|
||||
train_texts, train_labels = zip(*train)
|
||||
dev_texts, dev_labels = zip(*train)
|
||||
train_texts = list(train_texts)
|
||||
dev_texts = list(dev_texts)
|
||||
random.shuffle(train_texts)
|
||||
random.shuffle(dev_texts)
|
||||
if limit >= 1:
|
||||
return train_texts[:limit]
|
||||
else:
|
||||
return list(train_texts) + list(dev_texts)
|
||||
|
||||
|
||||
def load_textcat_data(limit=0):
|
||||
"""Load data from the IMDB dataset."""
|
||||
# Partition off part of the train data for evaluation
|
||||
train_data, eval_data = thinc.extra.datasets.imdb()
|
||||
random.shuffle(train_data)
|
||||
train_data = train_data[-limit:]
|
||||
texts, labels = zip(*train_data)
|
||||
eval_texts, eval_labels = zip(*eval_data)
|
||||
cats = [{"POSITIVE": bool(y), "NEGATIVE": not bool(y)} for y in labels]
|
||||
eval_cats = [{"POSITIVE": bool(y), "NEGATIVE": not bool(y)} for y in eval_labels]
|
||||
return (texts, cats), (eval_texts, eval_cats)
|
||||
|
||||
|
||||
def prefer_gpu():
|
||||
used = spacy.util.use_gpu(0)
|
||||
if used is None:
|
||||
return False
|
||||
else:
|
||||
import cupy.random
|
||||
|
||||
cupy.random.seed(0)
|
||||
return True
|
||||
|
||||
|
||||
def build_textcat_model(tok2vec, nr_class, width):
|
||||
from thinc.v2v import Model, Softmax, Maxout
|
||||
from thinc.api import flatten_add_lengths, chain
|
||||
from thinc.t2v import Pooling, sum_pool, mean_pool, max_pool
|
||||
from thinc.misc import Residual, LayerNorm
|
||||
from spacy._ml import logistic, zero_init
|
||||
|
||||
with Model.define_operators({">>": chain}):
|
||||
model = (
|
||||
tok2vec
|
||||
>> flatten_add_lengths
|
||||
>> Pooling(mean_pool)
|
||||
>> Softmax(nr_class, width)
|
||||
)
|
||||
model.tok2vec = tok2vec
|
||||
return model
|
||||
|
||||
|
||||
def block_gradients(model):
|
||||
from thinc.api import wrap
|
||||
|
||||
def forward(X, drop=0.0):
|
||||
Y, _ = model.begin_update(X, drop=drop)
|
||||
return Y, None
|
||||
|
||||
return wrap(forward, model)
|
||||
|
||||
|
||||
def create_pipeline(width, embed_size, vectors_model):
|
||||
print("Load vectors")
|
||||
nlp = spacy.load(vectors_model)
|
||||
print("Start training")
|
||||
textcat = TextCategorizer(
|
||||
nlp.vocab,
|
||||
labels=["POSITIVE", "NEGATIVE"],
|
||||
model=build_textcat_model(
|
||||
Tok2Vec(width=width, embed_size=embed_size), 2, width
|
||||
),
|
||||
)
|
||||
|
||||
nlp.add_pipe(textcat)
|
||||
return nlp
|
||||
|
||||
|
||||
def train_tensorizer(nlp, texts, dropout, n_iter):
|
||||
tensorizer = nlp.create_pipe("tensorizer")
|
||||
nlp.add_pipe(tensorizer)
|
||||
optimizer = nlp.begin_training()
|
||||
for i in range(n_iter):
|
||||
losses = {}
|
||||
for i, batch in enumerate(minibatch(tqdm.tqdm(texts))):
|
||||
docs = [nlp.make_doc(text) for text in batch]
|
||||
tensorizer.update(docs, None, losses=losses, sgd=optimizer, drop=dropout)
|
||||
print(losses)
|
||||
return optimizer
|
||||
|
||||
|
||||
def train_textcat(nlp, n_texts, n_iter=10):
|
||||
textcat = nlp.get_pipe("textcat")
|
||||
tok2vec_weights = textcat.model.tok2vec.to_bytes()
|
||||
(train_texts, train_cats), (dev_texts, dev_cats) = load_textcat_data(limit=n_texts)
|
||||
print(
|
||||
"Using {} examples ({} training, {} evaluation)".format(
|
||||
n_texts, len(train_texts), len(dev_texts)
|
||||
)
|
||||
)
|
||||
train_data = list(zip(train_texts, [{"cats": cats} for cats in train_cats]))
|
||||
|
||||
# get names of other pipes to disable them during training
|
||||
pipe_exceptions = ["textcat", "trf_wordpiecer", "trf_tok2vec"]
|
||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
|
||||
with nlp.disable_pipes(*other_pipes): # only train textcat
|
||||
optimizer = nlp.begin_training()
|
||||
textcat.model.tok2vec.from_bytes(tok2vec_weights)
|
||||
print("Training the model...")
|
||||
print("{:^5}\t{:^5}\t{:^5}\t{:^5}".format("LOSS", "P", "R", "F"))
|
||||
for i in range(n_iter):
|
||||
losses = {"textcat": 0.0}
|
||||
# batch up the examples using spaCy's minibatch
|
||||
batches = minibatch(tqdm.tqdm(train_data), size=2)
|
||||
for batch in batches:
|
||||
texts, annotations = zip(*batch)
|
||||
nlp.update(texts, annotations, sgd=optimizer, drop=0.2, losses=losses)
|
||||
with textcat.model.use_params(optimizer.averages):
|
||||
# evaluate on the dev data split off in load_data()
|
||||
scores = evaluate_textcat(nlp.tokenizer, textcat, dev_texts, dev_cats)
|
||||
print(
|
||||
"{0:.3f}\t{1:.3f}\t{2:.3f}\t{3:.3f}".format( # print a simple table
|
||||
losses["textcat"],
|
||||
scores["textcat_p"],
|
||||
scores["textcat_r"],
|
||||
scores["textcat_f"],
|
||||
)
|
||||
)
|
||||
|
||||
|
||||
def evaluate_textcat(tokenizer, textcat, texts, cats):
|
||||
docs = (tokenizer(text) for text in texts)
|
||||
tp = 1e-8
|
||||
fp = 1e-8
|
||||
tn = 1e-8
|
||||
fn = 1e-8
|
||||
for i, doc in enumerate(textcat.pipe(docs)):
|
||||
gold = cats[i]
|
||||
for label, score in doc.cats.items():
|
||||
if label not in gold:
|
||||
continue
|
||||
if score >= 0.5 and gold[label] >= 0.5:
|
||||
tp += 1.0
|
||||
elif score >= 0.5 and gold[label] < 0.5:
|
||||
fp += 1.0
|
||||
elif score < 0.5 and gold[label] < 0.5:
|
||||
tn += 1
|
||||
elif score < 0.5 and gold[label] >= 0.5:
|
||||
fn += 1
|
||||
precision = tp / (tp + fp)
|
||||
recall = tp / (tp + fn)
|
||||
f_score = 2 * (precision * recall) / (precision + recall)
|
||||
return {"textcat_p": precision, "textcat_r": recall, "textcat_f": f_score}
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
width=("Width of CNN layers", "positional", None, int),
|
||||
embed_size=("Embedding rows", "positional", None, int),
|
||||
pretrain_iters=("Number of iterations to pretrain", "option", "pn", int),
|
||||
train_iters=("Number of iterations to train", "option", "tn", int),
|
||||
train_examples=("Number of labelled examples", "option", "eg", int),
|
||||
vectors_model=("Name or path to vectors model to learn from"),
|
||||
)
|
||||
def main(
|
||||
width,
|
||||
embed_size,
|
||||
vectors_model,
|
||||
pretrain_iters=30,
|
||||
train_iters=30,
|
||||
train_examples=1000,
|
||||
):
|
||||
random.seed(0)
|
||||
numpy.random.seed(0)
|
||||
use_gpu = prefer_gpu()
|
||||
print("Using GPU?", use_gpu)
|
||||
|
||||
nlp = create_pipeline(width, embed_size, vectors_model)
|
||||
print("Load data")
|
||||
texts = load_texts(limit=0)
|
||||
print("Train tensorizer")
|
||||
optimizer = train_tensorizer(nlp, texts, dropout=0.2, n_iter=pretrain_iters)
|
||||
print("Train textcat")
|
||||
train_textcat(nlp, train_examples, n_iter=train_iters)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
plac.call(main)
|
|
@ -1,97 +0,0 @@
|
|||
"""Prevent catastrophic forgetting with rehearsal updates."""
|
||||
import plac
|
||||
import random
|
||||
import warnings
|
||||
import srsly
|
||||
import spacy
|
||||
from spacy.gold import GoldParse
|
||||
from spacy.util import minibatch, compounding
|
||||
|
||||
|
||||
LABEL = "ANIMAL"
|
||||
TRAIN_DATA = [
|
||||
(
|
||||
"Horses are too tall and they pretend to care about your feelings",
|
||||
{"entities": [(0, 6, "ANIMAL")]},
|
||||
),
|
||||
("Do they bite?", {"entities": []}),
|
||||
(
|
||||
"horses are too tall and they pretend to care about your feelings",
|
||||
{"entities": [(0, 6, "ANIMAL")]},
|
||||
),
|
||||
("horses pretend to care about your feelings", {"entities": [(0, 6, "ANIMAL")]}),
|
||||
(
|
||||
"they pretend to care about your feelings, those horses",
|
||||
{"entities": [(48, 54, "ANIMAL")]},
|
||||
),
|
||||
("horses?", {"entities": [(0, 6, "ANIMAL")]}),
|
||||
]
|
||||
|
||||
|
||||
def read_raw_data(nlp, jsonl_loc):
|
||||
for json_obj in srsly.read_jsonl(jsonl_loc):
|
||||
if json_obj["text"].strip():
|
||||
doc = nlp.make_doc(json_obj["text"])
|
||||
yield doc
|
||||
|
||||
|
||||
def read_gold_data(nlp, gold_loc):
|
||||
docs = []
|
||||
golds = []
|
||||
for json_obj in srsly.read_jsonl(gold_loc):
|
||||
doc = nlp.make_doc(json_obj["text"])
|
||||
ents = [(ent["start"], ent["end"], ent["label"]) for ent in json_obj["spans"]]
|
||||
gold = GoldParse(doc, entities=ents)
|
||||
docs.append(doc)
|
||||
golds.append(gold)
|
||||
return list(zip(docs, golds))
|
||||
|
||||
|
||||
def main(model_name, unlabelled_loc):
|
||||
n_iter = 10
|
||||
dropout = 0.2
|
||||
batch_size = 4
|
||||
nlp = spacy.load(model_name)
|
||||
nlp.get_pipe("ner").add_label(LABEL)
|
||||
raw_docs = list(read_raw_data(nlp, unlabelled_loc))
|
||||
optimizer = nlp.resume_training()
|
||||
# Avoid use of Adam when resuming training. I don't understand this well
|
||||
# yet, but I'm getting weird results from Adam. Try commenting out the
|
||||
# nlp.update(), and using Adam -- you'll find the models drift apart.
|
||||
# I guess Adam is losing precision, introducing gradient noise?
|
||||
optimizer.alpha = 0.1
|
||||
optimizer.b1 = 0.0
|
||||
optimizer.b2 = 0.0
|
||||
|
||||
# get names of other pipes to disable them during training
|
||||
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
|
||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
|
||||
sizes = compounding(1.0, 4.0, 1.001)
|
||||
with nlp.disable_pipes(*other_pipes), warnings.catch_warnings():
|
||||
# show warnings for misaligned entity spans once
|
||||
warnings.filterwarnings("once", category=UserWarning, module='spacy')
|
||||
|
||||
for itn in range(n_iter):
|
||||
random.shuffle(TRAIN_DATA)
|
||||
random.shuffle(raw_docs)
|
||||
losses = {}
|
||||
r_losses = {}
|
||||
# batch up the examples using spaCy's minibatch
|
||||
raw_batches = minibatch(raw_docs, size=4)
|
||||
for batch in minibatch(TRAIN_DATA, size=sizes):
|
||||
docs, golds = zip(*batch)
|
||||
nlp.update(docs, golds, sgd=optimizer, drop=dropout, losses=losses)
|
||||
raw_batch = list(next(raw_batches))
|
||||
nlp.rehearse(raw_batch, sgd=optimizer, losses=r_losses)
|
||||
print("Losses", losses)
|
||||
print("R. Losses", r_losses)
|
||||
print(nlp.get_pipe("ner").model.unseen_classes)
|
||||
test_text = "Do you like horses?"
|
||||
doc = nlp(test_text)
|
||||
print("Entities in '%s'" % test_text)
|
||||
for ent in doc.ents:
|
||||
print(ent.label_, ent.text)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
plac.call(main)
|
|
@ -1,177 +0,0 @@
|
|||
#!/usr/bin/env python
|
||||
# coding: utf8
|
||||
|
||||
"""Example of training spaCy's entity linker, starting off with a predefined
|
||||
knowledge base and corresponding vocab, and a blank English model.
|
||||
|
||||
For more details, see the documentation:
|
||||
* Training: https://spacy.io/usage/training
|
||||
* Entity Linking: https://spacy.io/usage/linguistic-features#entity-linking
|
||||
|
||||
Compatible with: spaCy v2.2.4
|
||||
Last tested with: v2.2.4
|
||||
"""
|
||||
from __future__ import unicode_literals, print_function
|
||||
|
||||
import plac
|
||||
import random
|
||||
from pathlib import Path
|
||||
|
||||
from spacy.vocab import Vocab
|
||||
|
||||
import spacy
|
||||
from spacy.kb import KnowledgeBase
|
||||
from spacy.pipeline import EntityRuler
|
||||
from spacy.util import minibatch, compounding
|
||||
|
||||
|
||||
def sample_train_data():
|
||||
train_data = []
|
||||
|
||||
# Q2146908 (Russ Cochran): American golfer
|
||||
# Q7381115 (Russ Cochran): publisher
|
||||
|
||||
text_1 = "Russ Cochran his reprints include EC Comics."
|
||||
dict_1 = {(0, 12): {"Q7381115": 1.0, "Q2146908": 0.0}}
|
||||
train_data.append((text_1, {"links": dict_1}))
|
||||
|
||||
text_2 = "Russ Cochran has been publishing comic art."
|
||||
dict_2 = {(0, 12): {"Q7381115": 1.0, "Q2146908": 0.0}}
|
||||
train_data.append((text_2, {"links": dict_2}))
|
||||
|
||||
text_3 = "Russ Cochran captured his first major title with his son as caddie."
|
||||
dict_3 = {(0, 12): {"Q7381115": 0.0, "Q2146908": 1.0}}
|
||||
train_data.append((text_3, {"links": dict_3}))
|
||||
|
||||
text_4 = "Russ Cochran was a member of University of Kentucky's golf team."
|
||||
dict_4 = {(0, 12): {"Q7381115": 0.0, "Q2146908": 1.0}}
|
||||
train_data.append((text_4, {"links": dict_4}))
|
||||
|
||||
return train_data
|
||||
|
||||
|
||||
# training data
|
||||
TRAIN_DATA = sample_train_data()
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
kb_path=("Path to the knowledge base", "positional", None, Path),
|
||||
vocab_path=("Path to the vocab for the kb", "positional", None, Path),
|
||||
output_dir=("Optional output directory", "option", "o", Path),
|
||||
n_iter=("Number of training iterations", "option", "n", int),
|
||||
)
|
||||
def main(kb_path, vocab_path, output_dir=None, n_iter=50):
|
||||
"""Create a blank model with the specified vocab, set up the pipeline and train the entity linker.
|
||||
The `vocab` should be the one used during creation of the KB."""
|
||||
# create blank English model with correct vocab
|
||||
nlp = spacy.blank("en")
|
||||
nlp.vocab.from_disk(vocab_path)
|
||||
nlp.vocab.vectors.name = "spacy_pretrained_vectors"
|
||||
print("Created blank 'en' model with vocab from '%s'" % vocab_path)
|
||||
|
||||
# Add a sentencizer component. Alternatively, add a dependency parser for higher accuracy.
|
||||
nlp.add_pipe(nlp.create_pipe('sentencizer'))
|
||||
|
||||
# Add a custom component to recognize "Russ Cochran" as an entity for the example training data.
|
||||
# Note that in a realistic application, an actual NER algorithm should be used instead.
|
||||
ruler = EntityRuler(nlp)
|
||||
patterns = [{"label": "PERSON", "pattern": [{"LOWER": "russ"}, {"LOWER": "cochran"}]}]
|
||||
ruler.add_patterns(patterns)
|
||||
nlp.add_pipe(ruler)
|
||||
|
||||
# Create the Entity Linker component and add it to the pipeline.
|
||||
if "entity_linker" not in nlp.pipe_names:
|
||||
# use only the predicted EL score and not the prior probability (for demo purposes)
|
||||
cfg = {"incl_prior": False}
|
||||
entity_linker = nlp.create_pipe("entity_linker", cfg)
|
||||
kb = KnowledgeBase(vocab=nlp.vocab)
|
||||
kb.load_bulk(kb_path)
|
||||
print("Loaded Knowledge Base from '%s'" % kb_path)
|
||||
entity_linker.set_kb(kb)
|
||||
nlp.add_pipe(entity_linker, last=True)
|
||||
|
||||
# Convert the texts to docs to make sure we have doc.ents set for the training examples.
|
||||
# Also ensure that the annotated examples correspond to known identifiers in the knowlege base.
|
||||
kb_ids = nlp.get_pipe("entity_linker").kb.get_entity_strings()
|
||||
TRAIN_DOCS = []
|
||||
for text, annotation in TRAIN_DATA:
|
||||
with nlp.disable_pipes("entity_linker"):
|
||||
doc = nlp(text)
|
||||
annotation_clean = annotation
|
||||
for offset, kb_id_dict in annotation["links"].items():
|
||||
new_dict = {}
|
||||
for kb_id, value in kb_id_dict.items():
|
||||
if kb_id in kb_ids:
|
||||
new_dict[kb_id] = value
|
||||
else:
|
||||
print(
|
||||
"Removed", kb_id, "from training because it is not in the KB."
|
||||
)
|
||||
annotation_clean["links"][offset] = new_dict
|
||||
TRAIN_DOCS.append((doc, annotation_clean))
|
||||
|
||||
# get names of other pipes to disable them during training
|
||||
pipe_exceptions = ["entity_linker", "trf_wordpiecer", "trf_tok2vec"]
|
||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
|
||||
with nlp.disable_pipes(*other_pipes): # only train entity linker
|
||||
# reset and initialize the weights randomly
|
||||
optimizer = nlp.begin_training()
|
||||
for itn in range(n_iter):
|
||||
random.shuffle(TRAIN_DOCS)
|
||||
losses = {}
|
||||
# batch up the examples using spaCy's minibatch
|
||||
batches = minibatch(TRAIN_DOCS, size=compounding(4.0, 32.0, 1.001))
|
||||
for batch in batches:
|
||||
texts, annotations = zip(*batch)
|
||||
nlp.update(
|
||||
texts, # batch of texts
|
||||
annotations, # batch of annotations
|
||||
drop=0.2, # dropout - make it harder to memorise data
|
||||
losses=losses,
|
||||
sgd=optimizer,
|
||||
)
|
||||
print(itn, "Losses", losses)
|
||||
|
||||
# test the trained model
|
||||
_apply_model(nlp)
|
||||
|
||||
# save model to output directory
|
||||
if output_dir is not None:
|
||||
output_dir = Path(output_dir)
|
||||
if not output_dir.exists():
|
||||
output_dir.mkdir()
|
||||
nlp.to_disk(output_dir)
|
||||
print()
|
||||
print("Saved model to", output_dir)
|
||||
|
||||
# test the saved model
|
||||
print("Loading from", output_dir)
|
||||
nlp2 = spacy.load(output_dir)
|
||||
_apply_model(nlp2)
|
||||
|
||||
|
||||
def _apply_model(nlp):
|
||||
for text, annotation in TRAIN_DATA:
|
||||
# apply the entity linker which will now make predictions for the 'Russ Cochran' entities
|
||||
doc = nlp(text)
|
||||
print()
|
||||
print("Entities", [(ent.text, ent.label_, ent.kb_id_) for ent in doc.ents])
|
||||
print("Tokens", [(t.text, t.ent_type_, t.ent_kb_id_) for t in doc])
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
plac.call(main)
|
||||
|
||||
# Expected output (can be shuffled):
|
||||
|
||||
# Entities[('Russ Cochran', 'PERSON', 'Q7381115')]
|
||||
# Tokens[('Russ', 'PERSON', 'Q7381115'), ('Cochran', 'PERSON', 'Q7381115'), ("his", '', ''), ('reprints', '', ''), ('include', '', ''), ('The', '', ''), ('Complete', '', ''), ('EC', '', ''), ('Library', '', ''), ('.', '', '')]
|
||||
|
||||
# Entities[('Russ Cochran', 'PERSON', 'Q7381115')]
|
||||
# Tokens[('Russ', 'PERSON', 'Q7381115'), ('Cochran', 'PERSON', 'Q7381115'), ('has', '', ''), ('been', '', ''), ('publishing', '', ''), ('comic', '', ''), ('art', '', ''), ('.', '', '')]
|
||||
|
||||
# Entities[('Russ Cochran', 'PERSON', 'Q2146908')]
|
||||
# Tokens[('Russ', 'PERSON', 'Q2146908'), ('Cochran', 'PERSON', 'Q2146908'), ('captured', '', ''), ('his', '', ''), ('first', '', ''), ('major', '', ''), ('title', '', ''), ('with', '', ''), ('his', '', ''), ('son', '', ''), ('as', '', ''), ('caddie', '', ''), ('.', '', '')]
|
||||
|
||||
# Entities[('Russ Cochran', 'PERSON', 'Q2146908')]
|
||||
# Tokens[('Russ', 'PERSON', 'Q2146908'), ('Cochran', 'PERSON', 'Q2146908'), ('was', '', ''), ('a', '', ''), ('member', '', ''), ('of', '', ''), ('University', '', ''), ('of', '', ''), ('Kentucky', '', ''), ("'s", '', ''), ('golf', '', ''), ('team', '', ''), ('.', '', '')]
|
|
@ -1,195 +0,0 @@
|
|||
#!/usr/bin/env python
|
||||
# coding: utf-8
|
||||
"""Using the parser to recognise your own semantics
|
||||
|
||||
spaCy's parser component can be trained to predict any type of tree
|
||||
structure over your input text. You can also predict trees over whole documents
|
||||
or chat logs, with connections between the sentence-roots used to annotate
|
||||
discourse structure. In this example, we'll build a message parser for a common
|
||||
"chat intent": finding local businesses. Our message semantics will have the
|
||||
following types of relations: ROOT, PLACE, QUALITY, ATTRIBUTE, TIME, LOCATION.
|
||||
|
||||
"show me the best hotel in berlin"
|
||||
('show', 'ROOT', 'show')
|
||||
('best', 'QUALITY', 'hotel') --> hotel with QUALITY best
|
||||
('hotel', 'PLACE', 'show') --> show PLACE hotel
|
||||
('berlin', 'LOCATION', 'hotel') --> hotel with LOCATION berlin
|
||||
|
||||
Compatible with: spaCy v2.0.0+
|
||||
"""
|
||||
from __future__ import unicode_literals, print_function
|
||||
|
||||
import plac
|
||||
import random
|
||||
from pathlib import Path
|
||||
import spacy
|
||||
from spacy.util import minibatch, compounding
|
||||
|
||||
|
||||
# training data: texts, heads and dependency labels
|
||||
# for no relation, we simply chose an arbitrary dependency label, e.g. '-'
|
||||
TRAIN_DATA = [
|
||||
(
|
||||
"find a cafe with great wifi",
|
||||
{
|
||||
"heads": [0, 2, 0, 5, 5, 2], # index of token head
|
||||
"deps": ["ROOT", "-", "PLACE", "-", "QUALITY", "ATTRIBUTE"],
|
||||
},
|
||||
),
|
||||
(
|
||||
"find a hotel near the beach",
|
||||
{
|
||||
"heads": [0, 2, 0, 5, 5, 2],
|
||||
"deps": ["ROOT", "-", "PLACE", "QUALITY", "-", "ATTRIBUTE"],
|
||||
},
|
||||
),
|
||||
(
|
||||
"find me the closest gym that's open late",
|
||||
{
|
||||
"heads": [0, 0, 4, 4, 0, 6, 4, 6, 6],
|
||||
"deps": [
|
||||
"ROOT",
|
||||
"-",
|
||||
"-",
|
||||
"QUALITY",
|
||||
"PLACE",
|
||||
"-",
|
||||
"-",
|
||||
"ATTRIBUTE",
|
||||
"TIME",
|
||||
],
|
||||
},
|
||||
),
|
||||
(
|
||||
"show me the cheapest store that sells flowers",
|
||||
{
|
||||
"heads": [0, 0, 4, 4, 0, 4, 4, 4], # attach "flowers" to store!
|
||||
"deps": ["ROOT", "-", "-", "QUALITY", "PLACE", "-", "-", "PRODUCT"],
|
||||
},
|
||||
),
|
||||
(
|
||||
"find a nice restaurant in london",
|
||||
{
|
||||
"heads": [0, 3, 3, 0, 3, 3],
|
||||
"deps": ["ROOT", "-", "QUALITY", "PLACE", "-", "LOCATION"],
|
||||
},
|
||||
),
|
||||
(
|
||||
"show me the coolest hostel in berlin",
|
||||
{
|
||||
"heads": [0, 0, 4, 4, 0, 4, 4],
|
||||
"deps": ["ROOT", "-", "-", "QUALITY", "PLACE", "-", "LOCATION"],
|
||||
},
|
||||
),
|
||||
(
|
||||
"find a good italian restaurant near work",
|
||||
{
|
||||
"heads": [0, 4, 4, 4, 0, 4, 5],
|
||||
"deps": [
|
||||
"ROOT",
|
||||
"-",
|
||||
"QUALITY",
|
||||
"ATTRIBUTE",
|
||||
"PLACE",
|
||||
"ATTRIBUTE",
|
||||
"LOCATION",
|
||||
],
|
||||
},
|
||||
),
|
||||
]
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
|
||||
output_dir=("Optional output directory", "option", "o", Path),
|
||||
n_iter=("Number of training iterations", "option", "n", int),
|
||||
)
|
||||
def main(model=None, output_dir=None, n_iter=15):
|
||||
"""Load the model, set up the pipeline and train the parser."""
|
||||
if model is not None:
|
||||
nlp = spacy.load(model) # load existing spaCy model
|
||||
print("Loaded model '%s'" % model)
|
||||
else:
|
||||
nlp = spacy.blank("en") # create blank Language class
|
||||
print("Created blank 'en' model")
|
||||
|
||||
# We'll use the built-in dependency parser class, but we want to create a
|
||||
# fresh instance – just in case.
|
||||
if "parser" in nlp.pipe_names:
|
||||
nlp.remove_pipe("parser")
|
||||
parser = nlp.create_pipe("parser")
|
||||
nlp.add_pipe(parser, first=True)
|
||||
|
||||
for text, annotations in TRAIN_DATA:
|
||||
for dep in annotations.get("deps", []):
|
||||
parser.add_label(dep)
|
||||
|
||||
pipe_exceptions = ["parser", "trf_wordpiecer", "trf_tok2vec"]
|
||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
|
||||
with nlp.disable_pipes(*other_pipes): # only train parser
|
||||
optimizer = nlp.begin_training()
|
||||
for itn in range(n_iter):
|
||||
random.shuffle(TRAIN_DATA)
|
||||
losses = {}
|
||||
# batch up the examples using spaCy's minibatch
|
||||
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
|
||||
for batch in batches:
|
||||
texts, annotations = zip(*batch)
|
||||
nlp.update(texts, annotations, sgd=optimizer, losses=losses)
|
||||
print("Losses", losses)
|
||||
|
||||
# test the trained model
|
||||
test_model(nlp)
|
||||
|
||||
# save model to output directory
|
||||
if output_dir is not None:
|
||||
output_dir = Path(output_dir)
|
||||
if not output_dir.exists():
|
||||
output_dir.mkdir()
|
||||
nlp.to_disk(output_dir)
|
||||
print("Saved model to", output_dir)
|
||||
|
||||
# test the saved model
|
||||
print("Loading from", output_dir)
|
||||
nlp2 = spacy.load(output_dir)
|
||||
test_model(nlp2)
|
||||
|
||||
|
||||
def test_model(nlp):
|
||||
texts = [
|
||||
"find a hotel with good wifi",
|
||||
"find me the cheapest gym near work",
|
||||
"show me the best hotel in berlin",
|
||||
]
|
||||
docs = nlp.pipe(texts)
|
||||
for doc in docs:
|
||||
print(doc.text)
|
||||
print([(t.text, t.dep_, t.head.text) for t in doc if t.dep_ != "-"])
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
plac.call(main)
|
||||
|
||||
# Expected output:
|
||||
# find a hotel with good wifi
|
||||
# [
|
||||
# ('find', 'ROOT', 'find'),
|
||||
# ('hotel', 'PLACE', 'find'),
|
||||
# ('good', 'QUALITY', 'wifi'),
|
||||
# ('wifi', 'ATTRIBUTE', 'hotel')
|
||||
# ]
|
||||
# find me the cheapest gym near work
|
||||
# [
|
||||
# ('find', 'ROOT', 'find'),
|
||||
# ('cheapest', 'QUALITY', 'gym'),
|
||||
# ('gym', 'PLACE', 'find'),
|
||||
# ('near', 'ATTRIBUTE', 'gym'),
|
||||
# ('work', 'LOCATION', 'near')
|
||||
# ]
|
||||
# show me the best hotel in berlin
|
||||
# [
|
||||
# ('show', 'ROOT', 'show'),
|
||||
# ('best', 'QUALITY', 'hotel'),
|
||||
# ('hotel', 'PLACE', 'show'),
|
||||
# ('berlin', 'LOCATION', 'hotel')
|
||||
# ]
|
|
@ -1,117 +0,0 @@
|
|||
#!/usr/bin/env python
|
||||
# coding: utf8
|
||||
"""Example of training spaCy's named entity recognizer, starting off with an
|
||||
existing model or a blank model.
|
||||
|
||||
For more details, see the documentation:
|
||||
* Training: https://spacy.io/usage/training
|
||||
* NER: https://spacy.io/usage/linguistic-features#named-entities
|
||||
|
||||
Compatible with: spaCy v2.0.0+
|
||||
Last tested with: v2.2.4
|
||||
"""
|
||||
from __future__ import unicode_literals, print_function
|
||||
|
||||
import plac
|
||||
import random
|
||||
import warnings
|
||||
from pathlib import Path
|
||||
import spacy
|
||||
from spacy.util import minibatch, compounding
|
||||
|
||||
|
||||
# training data
|
||||
TRAIN_DATA = [
|
||||
("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
|
||||
("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
|
||||
]
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
|
||||
output_dir=("Optional output directory", "option", "o", Path),
|
||||
n_iter=("Number of training iterations", "option", "n", int),
|
||||
)
|
||||
def main(model=None, output_dir=None, n_iter=100):
|
||||
"""Load the model, set up the pipeline and train the entity recognizer."""
|
||||
if model is not None:
|
||||
nlp = spacy.load(model) # load existing spaCy model
|
||||
print("Loaded model '%s'" % model)
|
||||
else:
|
||||
nlp = spacy.blank("en") # create blank Language class
|
||||
print("Created blank 'en' model")
|
||||
|
||||
# create the built-in pipeline components and add them to the pipeline
|
||||
# nlp.create_pipe works for built-ins that are registered with spaCy
|
||||
if "ner" not in nlp.pipe_names:
|
||||
ner = nlp.create_pipe("ner")
|
||||
nlp.add_pipe(ner, last=True)
|
||||
# otherwise, get it so we can add labels
|
||||
else:
|
||||
ner = nlp.get_pipe("ner")
|
||||
|
||||
# add labels
|
||||
for _, annotations in TRAIN_DATA:
|
||||
for ent in annotations.get("entities"):
|
||||
ner.add_label(ent[2])
|
||||
|
||||
# get names of other pipes to disable them during training
|
||||
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
|
||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
|
||||
# only train NER
|
||||
with nlp.disable_pipes(*other_pipes), warnings.catch_warnings():
|
||||
# show warnings for misaligned entity spans once
|
||||
warnings.filterwarnings("once", category=UserWarning, module='spacy')
|
||||
|
||||
# reset and initialize the weights randomly – but only if we're
|
||||
# training a new model
|
||||
if model is None:
|
||||
nlp.begin_training()
|
||||
for itn in range(n_iter):
|
||||
random.shuffle(TRAIN_DATA)
|
||||
losses = {}
|
||||
# batch up the examples using spaCy's minibatch
|
||||
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
|
||||
for batch in batches:
|
||||
texts, annotations = zip(*batch)
|
||||
nlp.update(
|
||||
texts, # batch of texts
|
||||
annotations, # batch of annotations
|
||||
drop=0.5, # dropout - make it harder to memorise data
|
||||
losses=losses,
|
||||
)
|
||||
print("Losses", losses)
|
||||
|
||||
# test the trained model
|
||||
for text, _ in TRAIN_DATA:
|
||||
doc = nlp(text)
|
||||
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
|
||||
print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])
|
||||
|
||||
# save model to output directory
|
||||
if output_dir is not None:
|
||||
output_dir = Path(output_dir)
|
||||
if not output_dir.exists():
|
||||
output_dir.mkdir()
|
||||
nlp.to_disk(output_dir)
|
||||
print("Saved model to", output_dir)
|
||||
|
||||
# test the saved model
|
||||
print("Loading from", output_dir)
|
||||
nlp2 = spacy.load(output_dir)
|
||||
for text, _ in TRAIN_DATA:
|
||||
doc = nlp2(text)
|
||||
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
|
||||
print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
plac.call(main)
|
||||
|
||||
# Expected output:
|
||||
# Entities [('Shaka Khan', 'PERSON')]
|
||||
# Tokens [('Who', '', 2), ('is', '', 2), ('Shaka', 'PERSON', 3),
|
||||
# ('Khan', 'PERSON', 1), ('?', '', 2)]
|
||||
# Entities [('London', 'LOC'), ('Berlin', 'LOC')]
|
||||
# Tokens [('I', '', 2), ('like', '', 2), ('London', 'LOC', 3),
|
||||
# ('and', '', 2), ('Berlin', 'LOC', 3), ('.', '', 2)]
|
|
@ -1,144 +0,0 @@
|
|||
#!/usr/bin/env python
|
||||
# coding: utf8
|
||||
"""Example of training an additional entity type
|
||||
|
||||
This script shows how to add a new entity type to an existing pretrained NER
|
||||
model. To keep the example short and simple, only four sentences are provided
|
||||
as examples. In practice, you'll need many more — a few hundred would be a
|
||||
good start. You will also likely need to mix in examples of other entity
|
||||
types, which might be obtained by running the entity recognizer over unlabelled
|
||||
sentences, and adding their annotations to the training set.
|
||||
|
||||
The actual training is performed by looping over the examples, and calling
|
||||
`nlp.entity.update()`. The `update()` method steps through the words of the
|
||||
input. At each word, it makes a prediction. It then consults the annotations
|
||||
provided on the GoldParse instance, to see whether it was right. If it was
|
||||
wrong, it adjusts its weights so that the correct action will score higher
|
||||
next time.
|
||||
|
||||
After training your model, you can save it to a directory. We recommend
|
||||
wrapping models as Python packages, for ease of deployment.
|
||||
|
||||
For more details, see the documentation:
|
||||
* Training: https://spacy.io/usage/training
|
||||
* NER: https://spacy.io/usage/linguistic-features#named-entities
|
||||
|
||||
Compatible with: spaCy v2.1.0+
|
||||
Last tested with: v2.2.4
|
||||
"""
|
||||
from __future__ import unicode_literals, print_function
|
||||
|
||||
import plac
|
||||
import random
|
||||
import warnings
|
||||
from pathlib import Path
|
||||
import spacy
|
||||
from spacy.util import minibatch, compounding
|
||||
|
||||
|
||||
# new entity label
|
||||
LABEL = "ANIMAL"
|
||||
|
||||
# training data
|
||||
# Note: If you're using an existing model, make sure to mix in examples of
|
||||
# other entity types that spaCy correctly recognized before. Otherwise, your
|
||||
# model might learn the new type, but "forget" what it previously knew.
|
||||
# https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting
|
||||
TRAIN_DATA = [
|
||||
(
|
||||
"Horses are too tall and they pretend to care about your feelings",
|
||||
{"entities": [(0, 6, LABEL)]},
|
||||
),
|
||||
("Do they bite?", {"entities": []}),
|
||||
(
|
||||
"horses are too tall and they pretend to care about your feelings",
|
||||
{"entities": [(0, 6, LABEL)]},
|
||||
),
|
||||
("horses pretend to care about your feelings", {"entities": [(0, 6, LABEL)]}),
|
||||
(
|
||||
"they pretend to care about your feelings, those horses",
|
||||
{"entities": [(48, 54, LABEL)]},
|
||||
),
|
||||
("horses?", {"entities": [(0, 6, LABEL)]}),
|
||||
]
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
|
||||
new_model_name=("New model name for model meta.", "option", "nm", str),
|
||||
output_dir=("Optional output directory", "option", "o", Path),
|
||||
n_iter=("Number of training iterations", "option", "n", int),
|
||||
)
|
||||
def main(model=None, new_model_name="animal", output_dir=None, n_iter=30):
|
||||
"""Set up the pipeline and entity recognizer, and train the new entity."""
|
||||
random.seed(0)
|
||||
if model is not None:
|
||||
nlp = spacy.load(model) # load existing spaCy model
|
||||
print("Loaded model '%s'" % model)
|
||||
else:
|
||||
nlp = spacy.blank("en") # create blank Language class
|
||||
print("Created blank 'en' model")
|
||||
# Add entity recognizer to model if it's not in the pipeline
|
||||
# nlp.create_pipe works for built-ins that are registered with spaCy
|
||||
if "ner" not in nlp.pipe_names:
|
||||
ner = nlp.create_pipe("ner")
|
||||
nlp.add_pipe(ner)
|
||||
# otherwise, get it, so we can add labels to it
|
||||
else:
|
||||
ner = nlp.get_pipe("ner")
|
||||
|
||||
ner.add_label(LABEL) # add new entity label to entity recognizer
|
||||
# Adding extraneous labels shouldn't mess anything up
|
||||
ner.add_label("VEGETABLE")
|
||||
if model is None:
|
||||
optimizer = nlp.begin_training()
|
||||
else:
|
||||
optimizer = nlp.resume_training()
|
||||
move_names = list(ner.move_names)
|
||||
# get names of other pipes to disable them during training
|
||||
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
|
||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
|
||||
# only train NER
|
||||
with nlp.disable_pipes(*other_pipes), warnings.catch_warnings():
|
||||
# show warnings for misaligned entity spans once
|
||||
warnings.filterwarnings("once", category=UserWarning, module='spacy')
|
||||
|
||||
sizes = compounding(1.0, 4.0, 1.001)
|
||||
# batch up the examples using spaCy's minibatch
|
||||
for itn in range(n_iter):
|
||||
random.shuffle(TRAIN_DATA)
|
||||
batches = minibatch(TRAIN_DATA, size=sizes)
|
||||
losses = {}
|
||||
for batch in batches:
|
||||
texts, annotations = zip(*batch)
|
||||
nlp.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
|
||||
print("Losses", losses)
|
||||
|
||||
# test the trained model
|
||||
test_text = "Do you like horses?"
|
||||
doc = nlp(test_text)
|
||||
print("Entities in '%s'" % test_text)
|
||||
for ent in doc.ents:
|
||||
print(ent.label_, ent.text)
|
||||
|
||||
# save model to output directory
|
||||
if output_dir is not None:
|
||||
output_dir = Path(output_dir)
|
||||
if not output_dir.exists():
|
||||
output_dir.mkdir()
|
||||
nlp.meta["name"] = new_model_name # rename model
|
||||
nlp.to_disk(output_dir)
|
||||
print("Saved model to", output_dir)
|
||||
|
||||
# test the saved model
|
||||
print("Loading from", output_dir)
|
||||
nlp2 = spacy.load(output_dir)
|
||||
# Check the classes have loaded back consistently
|
||||
assert nlp2.get_pipe("ner").move_names == move_names
|
||||
doc2 = nlp2(test_text)
|
||||
for ent in doc2.ents:
|
||||
print(ent.label_, ent.text)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
plac.call(main)
|
|
@ -1,111 +0,0 @@
|
|||
#!/usr/bin/env python
|
||||
# coding: utf8
|
||||
"""Example of training spaCy dependency parser, starting off with an existing
|
||||
model or a blank model. For more details, see the documentation:
|
||||
* Training: https://spacy.io/usage/training
|
||||
* Dependency Parse: https://spacy.io/usage/linguistic-features#dependency-parse
|
||||
|
||||
Compatible with: spaCy v2.0.0+
|
||||
Last tested with: v2.1.0
|
||||
"""
|
||||
from __future__ import unicode_literals, print_function
|
||||
|
||||
import plac
|
||||
import random
|
||||
from pathlib import Path
|
||||
import spacy
|
||||
from spacy.util import minibatch, compounding
|
||||
|
||||
|
||||
# training data
|
||||
TRAIN_DATA = [
|
||||
(
|
||||
"They trade mortgage-backed securities.",
|
||||
{
|
||||
"heads": [1, 1, 4, 4, 5, 1, 1],
|
||||
"deps": ["nsubj", "ROOT", "compound", "punct", "nmod", "dobj", "punct"],
|
||||
},
|
||||
),
|
||||
(
|
||||
"I like London and Berlin.",
|
||||
{
|
||||
"heads": [1, 1, 1, 2, 2, 1],
|
||||
"deps": ["nsubj", "ROOT", "dobj", "cc", "conj", "punct"],
|
||||
},
|
||||
),
|
||||
]
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
|
||||
output_dir=("Optional output directory", "option", "o", Path),
|
||||
n_iter=("Number of training iterations", "option", "n", int),
|
||||
)
|
||||
def main(model=None, output_dir=None, n_iter=15):
|
||||
"""Load the model, set up the pipeline and train the parser."""
|
||||
if model is not None:
|
||||
nlp = spacy.load(model) # load existing spaCy model
|
||||
print("Loaded model '%s'" % model)
|
||||
else:
|
||||
nlp = spacy.blank("en") # create blank Language class
|
||||
print("Created blank 'en' model")
|
||||
|
||||
# add the parser to the pipeline if it doesn't exist
|
||||
# nlp.create_pipe works for built-ins that are registered with spaCy
|
||||
if "parser" not in nlp.pipe_names:
|
||||
parser = nlp.create_pipe("parser")
|
||||
nlp.add_pipe(parser, first=True)
|
||||
# otherwise, get it, so we can add labels to it
|
||||
else:
|
||||
parser = nlp.get_pipe("parser")
|
||||
|
||||
# add labels to the parser
|
||||
for _, annotations in TRAIN_DATA:
|
||||
for dep in annotations.get("deps", []):
|
||||
parser.add_label(dep)
|
||||
|
||||
# get names of other pipes to disable them during training
|
||||
pipe_exceptions = ["parser", "trf_wordpiecer", "trf_tok2vec"]
|
||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
|
||||
with nlp.disable_pipes(*other_pipes): # only train parser
|
||||
optimizer = nlp.begin_training()
|
||||
for itn in range(n_iter):
|
||||
random.shuffle(TRAIN_DATA)
|
||||
losses = {}
|
||||
# batch up the examples using spaCy's minibatch
|
||||
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
|
||||
for batch in batches:
|
||||
texts, annotations = zip(*batch)
|
||||
nlp.update(texts, annotations, sgd=optimizer, losses=losses)
|
||||
print("Losses", losses)
|
||||
|
||||
# test the trained model
|
||||
test_text = "I like securities."
|
||||
doc = nlp(test_text)
|
||||
print("Dependencies", [(t.text, t.dep_, t.head.text) for t in doc])
|
||||
|
||||
# save model to output directory
|
||||
if output_dir is not None:
|
||||
output_dir = Path(output_dir)
|
||||
if not output_dir.exists():
|
||||
output_dir.mkdir()
|
||||
nlp.to_disk(output_dir)
|
||||
print("Saved model to", output_dir)
|
||||
|
||||
# test the saved model
|
||||
print("Loading from", output_dir)
|
||||
nlp2 = spacy.load(output_dir)
|
||||
doc = nlp2(test_text)
|
||||
print("Dependencies", [(t.text, t.dep_, t.head.text) for t in doc])
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
plac.call(main)
|
||||
|
||||
# expected result:
|
||||
# [
|
||||
# ('I', 'nsubj', 'like'),
|
||||
# ('like', 'ROOT', 'like'),
|
||||
# ('securities', 'dobj', 'like'),
|
||||
# ('.', 'punct', 'like')
|
||||
# ]
|
|
@ -1,101 +0,0 @@
|
|||
#!/usr/bin/env python
|
||||
# coding: utf8
|
||||
"""
|
||||
A simple example for training a part-of-speech tagger with a custom tag map.
|
||||
To allow us to update the tag map with our custom one, this example starts off
|
||||
with a blank Language class and modifies its defaults. For more details, see
|
||||
the documentation:
|
||||
* Training: https://spacy.io/usage/training
|
||||
* POS Tagging: https://spacy.io/usage/linguistic-features#pos-tagging
|
||||
|
||||
Compatible with: spaCy v2.0.0+
|
||||
Last tested with: v2.1.0
|
||||
"""
|
||||
from __future__ import unicode_literals, print_function
|
||||
|
||||
import plac
|
||||
import random
|
||||
from pathlib import Path
|
||||
import spacy
|
||||
from spacy.util import minibatch, compounding
|
||||
|
||||
|
||||
# You need to define a mapping from your data's part-of-speech tag names to the
|
||||
# Universal Part-of-Speech tag set, as spaCy includes an enum of these tags.
|
||||
# See here for the Universal Tag Set:
|
||||
# http://universaldependencies.github.io/docs/u/pos/index.html
|
||||
# You may also specify morphological features for your tags, from the universal
|
||||
# scheme.
|
||||
TAG_MAP = {"N": {"pos": "NOUN"}, "V": {"pos": "VERB"}, "J": {"pos": "ADJ"}}
|
||||
|
||||
# Usually you'll read this in, of course. Data formats vary. Ensure your
|
||||
# strings are unicode and that the number of tags assigned matches spaCy's
|
||||
# tokenization. If not, you can always add a 'words' key to the annotations
|
||||
# that specifies the gold-standard tokenization, e.g.:
|
||||
# ("Eatblueham", {'words': ['Eat', 'blue', 'ham'], 'tags': ['V', 'J', 'N']})
|
||||
TRAIN_DATA = [
|
||||
("I like green eggs", {"tags": ["N", "V", "J", "N"]}),
|
||||
("Eat blue ham", {"tags": ["V", "J", "N"]}),
|
||||
]
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
lang=("ISO Code of language to use", "option", "l", str),
|
||||
output_dir=("Optional output directory", "option", "o", Path),
|
||||
n_iter=("Number of training iterations", "option", "n", int),
|
||||
)
|
||||
def main(lang="en", output_dir=None, n_iter=25):
|
||||
"""Create a new model, set up the pipeline and train the tagger. In order to
|
||||
train the tagger with a custom tag map, we're creating a new Language
|
||||
instance with a custom vocab.
|
||||
"""
|
||||
nlp = spacy.blank(lang)
|
||||
# add the tagger to the pipeline
|
||||
# nlp.create_pipe works for built-ins that are registered with spaCy
|
||||
tagger = nlp.create_pipe("tagger")
|
||||
# Add the tags. This needs to be done before you start training.
|
||||
for tag, values in TAG_MAP.items():
|
||||
tagger.add_label(tag, values)
|
||||
nlp.add_pipe(tagger)
|
||||
|
||||
optimizer = nlp.begin_training()
|
||||
for i in range(n_iter):
|
||||
random.shuffle(TRAIN_DATA)
|
||||
losses = {}
|
||||
# batch up the examples using spaCy's minibatch
|
||||
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
|
||||
for batch in batches:
|
||||
texts, annotations = zip(*batch)
|
||||
nlp.update(texts, annotations, sgd=optimizer, losses=losses)
|
||||
print("Losses", losses)
|
||||
|
||||
# test the trained model
|
||||
test_text = "I like blue eggs"
|
||||
doc = nlp(test_text)
|
||||
print("Tags", [(t.text, t.tag_, t.pos_) for t in doc])
|
||||
|
||||
# save model to output directory
|
||||
if output_dir is not None:
|
||||
output_dir = Path(output_dir)
|
||||
if not output_dir.exists():
|
||||
output_dir.mkdir()
|
||||
nlp.to_disk(output_dir)
|
||||
print("Saved model to", output_dir)
|
||||
|
||||
# test the save model
|
||||
print("Loading from", output_dir)
|
||||
nlp2 = spacy.load(output_dir)
|
||||
doc = nlp2(test_text)
|
||||
print("Tags", [(t.text, t.tag_, t.pos_) for t in doc])
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
plac.call(main)
|
||||
|
||||
# Expected output:
|
||||
# [
|
||||
# ('I', 'N', 'NOUN'),
|
||||
# ('like', 'V', 'VERB'),
|
||||
# ('blue', 'J', 'ADJ'),
|
||||
# ('eggs', 'N', 'NOUN')
|
||||
# ]
|
|
@ -1,160 +0,0 @@
|
|||
#!/usr/bin/env python
|
||||
# coding: utf8
|
||||
"""Train a convolutional neural network text classifier on the
|
||||
IMDB dataset, using the TextCategorizer component. The dataset will be loaded
|
||||
automatically via Thinc's built-in dataset loader. The model is added to
|
||||
spacy.pipeline, and predictions are available via `doc.cats`. For more details,
|
||||
see the documentation:
|
||||
* Training: https://spacy.io/usage/training
|
||||
|
||||
Compatible with: spaCy v2.0.0+
|
||||
"""
|
||||
from __future__ import unicode_literals, print_function
|
||||
import plac
|
||||
import random
|
||||
from pathlib import Path
|
||||
import thinc.extra.datasets
|
||||
|
||||
import spacy
|
||||
from spacy.util import minibatch, compounding
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
|
||||
output_dir=("Optional output directory", "option", "o", Path),
|
||||
n_texts=("Number of texts to train from", "option", "t", int),
|
||||
n_iter=("Number of training iterations", "option", "n", int),
|
||||
init_tok2vec=("Pretrained tok2vec weights", "option", "t2v", Path),
|
||||
)
|
||||
def main(model=None, output_dir=None, n_iter=20, n_texts=2000, init_tok2vec=None):
|
||||
if output_dir is not None:
|
||||
output_dir = Path(output_dir)
|
||||
if not output_dir.exists():
|
||||
output_dir.mkdir()
|
||||
|
||||
if model is not None:
|
||||
nlp = spacy.load(model) # load existing spaCy model
|
||||
print("Loaded model '%s'" % model)
|
||||
else:
|
||||
nlp = spacy.blank("en") # create blank Language class
|
||||
print("Created blank 'en' model")
|
||||
|
||||
# add the text classifier to the pipeline if it doesn't exist
|
||||
# nlp.create_pipe works for built-ins that are registered with spaCy
|
||||
if "textcat" not in nlp.pipe_names:
|
||||
textcat = nlp.create_pipe(
|
||||
"textcat", config={"exclusive_classes": True, "architecture": "simple_cnn"}
|
||||
)
|
||||
nlp.add_pipe(textcat, last=True)
|
||||
# otherwise, get it, so we can add labels to it
|
||||
else:
|
||||
textcat = nlp.get_pipe("textcat")
|
||||
|
||||
# add label to text classifier
|
||||
textcat.add_label("POSITIVE")
|
||||
textcat.add_label("NEGATIVE")
|
||||
|
||||
# load the IMDB dataset
|
||||
print("Loading IMDB data...")
|
||||
(train_texts, train_cats), (dev_texts, dev_cats) = load_data()
|
||||
train_texts = train_texts[:n_texts]
|
||||
train_cats = train_cats[:n_texts]
|
||||
print(
|
||||
"Using {} examples ({} training, {} evaluation)".format(
|
||||
n_texts, len(train_texts), len(dev_texts)
|
||||
)
|
||||
)
|
||||
train_data = list(zip(train_texts, [{"cats": cats} for cats in train_cats]))
|
||||
|
||||
# get names of other pipes to disable them during training
|
||||
pipe_exceptions = ["textcat", "trf_wordpiecer", "trf_tok2vec"]
|
||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
|
||||
with nlp.disable_pipes(*other_pipes): # only train textcat
|
||||
optimizer = nlp.begin_training()
|
||||
if init_tok2vec is not None:
|
||||
with init_tok2vec.open("rb") as file_:
|
||||
textcat.model.tok2vec.from_bytes(file_.read())
|
||||
print("Training the model...")
|
||||
print("{:^5}\t{:^5}\t{:^5}\t{:^5}".format("LOSS", "P", "R", "F"))
|
||||
batch_sizes = compounding(4.0, 32.0, 1.001)
|
||||
for i in range(n_iter):
|
||||
losses = {}
|
||||
# batch up the examples using spaCy's minibatch
|
||||
random.shuffle(train_data)
|
||||
batches = minibatch(train_data, size=batch_sizes)
|
||||
for batch in batches:
|
||||
texts, annotations = zip(*batch)
|
||||
nlp.update(texts, annotations, sgd=optimizer, drop=0.2, losses=losses)
|
||||
with textcat.model.use_params(optimizer.averages):
|
||||
# evaluate on the dev data split off in load_data()
|
||||
scores = evaluate(nlp.tokenizer, textcat, dev_texts, dev_cats)
|
||||
print(
|
||||
"{0:.3f}\t{1:.3f}\t{2:.3f}\t{3:.3f}".format( # print a simple table
|
||||
losses["textcat"],
|
||||
scores["textcat_p"],
|
||||
scores["textcat_r"],
|
||||
scores["textcat_f"],
|
||||
)
|
||||
)
|
||||
|
||||
# test the trained model
|
||||
test_text = "This movie sucked"
|
||||
doc = nlp(test_text)
|
||||
print(test_text, doc.cats)
|
||||
|
||||
if output_dir is not None:
|
||||
with nlp.use_params(optimizer.averages):
|
||||
nlp.to_disk(output_dir)
|
||||
print("Saved model to", output_dir)
|
||||
|
||||
# test the saved model
|
||||
print("Loading from", output_dir)
|
||||
nlp2 = spacy.load(output_dir)
|
||||
doc2 = nlp2(test_text)
|
||||
print(test_text, doc2.cats)
|
||||
|
||||
|
||||
def load_data(limit=0, split=0.8):
|
||||
"""Load data from the IMDB dataset."""
|
||||
# Partition off part of the train data for evaluation
|
||||
train_data, _ = thinc.extra.datasets.imdb()
|
||||
random.shuffle(train_data)
|
||||
train_data = train_data[-limit:]
|
||||
texts, labels = zip(*train_data)
|
||||
cats = [{"POSITIVE": bool(y), "NEGATIVE": not bool(y)} for y in labels]
|
||||
split = int(len(train_data) * split)
|
||||
return (texts[:split], cats[:split]), (texts[split:], cats[split:])
|
||||
|
||||
|
||||
def evaluate(tokenizer, textcat, texts, cats):
|
||||
docs = (tokenizer(text) for text in texts)
|
||||
tp = 0.0 # True positives
|
||||
fp = 1e-8 # False positives
|
||||
fn = 1e-8 # False negatives
|
||||
tn = 0.0 # True negatives
|
||||
for i, doc in enumerate(textcat.pipe(docs)):
|
||||
gold = cats[i]
|
||||
for label, score in doc.cats.items():
|
||||
if label not in gold:
|
||||
continue
|
||||
if label == "NEGATIVE":
|
||||
continue
|
||||
if score >= 0.5 and gold[label] >= 0.5:
|
||||
tp += 1.0
|
||||
elif score >= 0.5 and gold[label] < 0.5:
|
||||
fp += 1.0
|
||||
elif score < 0.5 and gold[label] < 0.5:
|
||||
tn += 1
|
||||
elif score < 0.5 and gold[label] >= 0.5:
|
||||
fn += 1
|
||||
precision = tp / (tp + fp)
|
||||
recall = tp / (tp + fn)
|
||||
if (precision + recall) == 0:
|
||||
f_score = 0.0
|
||||
else:
|
||||
f_score = 2 * (precision * recall) / (precision + recall)
|
||||
return {"textcat_p": precision, "textcat_r": recall, "textcat_f": f_score}
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
plac.call(main)
|
|
@ -1,49 +0,0 @@
|
|||
#!/usr/bin/env python
|
||||
# coding: utf8
|
||||
"""Load vectors for a language trained using fastText
|
||||
https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
|
||||
Compatible with: spaCy v2.0.0+
|
||||
"""
|
||||
from __future__ import unicode_literals
|
||||
import plac
|
||||
import numpy
|
||||
|
||||
import spacy
|
||||
from spacy.language import Language
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
vectors_loc=("Path to .vec file", "positional", None, str),
|
||||
lang=(
|
||||
"Optional language ID. If not set, blank Language() will be used.",
|
||||
"positional",
|
||||
None,
|
||||
str,
|
||||
),
|
||||
)
|
||||
def main(vectors_loc, lang=None):
|
||||
if lang is None:
|
||||
nlp = Language()
|
||||
else:
|
||||
# create empty language class – this is required if you're planning to
|
||||
# save the model to disk and load it back later (models always need a
|
||||
# "lang" setting). Use 'xx' for blank multi-language class.
|
||||
nlp = spacy.blank(lang)
|
||||
with open(vectors_loc, "rb") as file_:
|
||||
header = file_.readline()
|
||||
nr_row, nr_dim = header.split()
|
||||
nlp.vocab.reset_vectors(width=int(nr_dim))
|
||||
for line in file_:
|
||||
line = line.rstrip().decode("utf8")
|
||||
pieces = line.rsplit(" ", int(nr_dim))
|
||||
word = pieces[0]
|
||||
vector = numpy.asarray([float(v) for v in pieces[1:]], dtype="f")
|
||||
nlp.vocab.set_vector(word, vector) # add the vectors to the vocab
|
||||
# test the vectors and similarity
|
||||
text = "class colspan"
|
||||
doc = nlp(text)
|
||||
print(text, doc[0].similarity(doc[1]))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
plac.call(main)
|
|
@ -1,105 +0,0 @@
|
|||
#!/usr/bin/env python
|
||||
# coding: utf8
|
||||
"""Visualize spaCy word vectors in Tensorboard.
|
||||
|
||||
Adapted from: https://gist.github.com/BrikerMan/7bd4e4bd0a00ac9076986148afc06507
|
||||
"""
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from os import path
|
||||
|
||||
import tqdm
|
||||
import math
|
||||
import numpy
|
||||
import plac
|
||||
import spacy
|
||||
import tensorflow as tf
|
||||
from tensorflow.contrib.tensorboard.plugins.projector import (
|
||||
visualize_embeddings,
|
||||
ProjectorConfig,
|
||||
)
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
vectors_loc=("Path to spaCy model that contains vectors", "positional", None, str),
|
||||
out_loc=(
|
||||
"Path to output folder for tensorboard session data",
|
||||
"positional",
|
||||
None,
|
||||
str,
|
||||
),
|
||||
name=(
|
||||
"Human readable name for tsv file and vectors tensor",
|
||||
"positional",
|
||||
None,
|
||||
str,
|
||||
),
|
||||
)
|
||||
def main(vectors_loc, out_loc, name="spaCy_vectors"):
|
||||
meta_file = "{}.tsv".format(name)
|
||||
out_meta_file = path.join(out_loc, meta_file)
|
||||
|
||||
print("Loading spaCy vectors model: {}".format(vectors_loc))
|
||||
model = spacy.load(vectors_loc)
|
||||
print("Finding lexemes with vectors attached: {}".format(vectors_loc))
|
||||
strings_stream = tqdm.tqdm(
|
||||
model.vocab.strings, total=len(model.vocab.strings), leave=False
|
||||
)
|
||||
queries = [w for w in strings_stream if model.vocab.has_vector(w)]
|
||||
vector_count = len(queries)
|
||||
|
||||
print(
|
||||
"Building Tensorboard Projector metadata for ({}) vectors: {}".format(
|
||||
vector_count, out_meta_file
|
||||
)
|
||||
)
|
||||
|
||||
# Store vector data in a tensorflow variable
|
||||
tf_vectors_variable = numpy.zeros((vector_count, model.vocab.vectors.shape[1]))
|
||||
|
||||
# Write a tab-separated file that contains information about the vectors for visualization
|
||||
#
|
||||
# Reference: https://www.tensorflow.org/programmers_guide/embedding#metadata
|
||||
with open(out_meta_file, "wb") as file_metadata:
|
||||
# Define columns in the first row
|
||||
file_metadata.write("Text\tFrequency\n".encode("utf-8"))
|
||||
# Write out a row for each vector that we add to the tensorflow variable we created
|
||||
vec_index = 0
|
||||
for text in tqdm.tqdm(queries, total=len(queries), leave=False):
|
||||
# https://github.com/tensorflow/tensorflow/issues/9094
|
||||
text = "<Space>" if text.lstrip() == "" else text
|
||||
lex = model.vocab[text]
|
||||
|
||||
# Store vector data and metadata
|
||||
tf_vectors_variable[vec_index] = model.vocab.get_vector(text)
|
||||
file_metadata.write(
|
||||
"{}\t{}\n".format(text, math.exp(lex.prob) * vector_count).encode(
|
||||
"utf-8"
|
||||
)
|
||||
)
|
||||
vec_index += 1
|
||||
|
||||
print("Running Tensorflow Session...")
|
||||
sess = tf.InteractiveSession()
|
||||
tf.Variable(tf_vectors_variable, trainable=False, name=name)
|
||||
tf.global_variables_initializer().run()
|
||||
saver = tf.train.Saver()
|
||||
writer = tf.summary.FileWriter(out_loc, sess.graph)
|
||||
|
||||
# Link the embeddings into the config
|
||||
config = ProjectorConfig()
|
||||
embed = config.embeddings.add()
|
||||
embed.tensor_name = name
|
||||
embed.metadata_path = meta_file
|
||||
|
||||
# Tell the projector about the configured embeddings and metadata file
|
||||
visualize_embeddings(writer, config)
|
||||
|
||||
# Save session and print run command to the output
|
||||
print("Saving Tensorboard Session...")
|
||||
saver.save(sess, path.join(out_loc, "{}.ckpt".format(name)))
|
||||
print("Done. Run `tensorboard --logdir={0}` to view in Tensorboard".format(out_loc))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
plac.call(main)
|
|
@ -1,20 +1,21 @@
|
|||
from pathlib import Path
|
||||
import plac
|
||||
import spacy
|
||||
from spacy.gold import docs_to_json
|
||||
from spacy.training import docs_to_json
|
||||
import srsly
|
||||
import sys
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
model=("Model name. Defaults to 'en'.", "option", "m", str),
|
||||
input_file=("Input file (jsonl)", "positional", None, Path),
|
||||
output_dir=("Output directory", "positional", None, Path),
|
||||
n_texts=("Number of texts to convert", "option", "t", int),
|
||||
)
|
||||
def convert(model='en', input_file=None, output_dir=None, n_texts=0):
|
||||
def convert(model="en", input_file=None, output_dir=None, n_texts=0):
|
||||
# Load model with tokenizer + sentencizer only
|
||||
nlp = spacy.load(model)
|
||||
nlp.disable_pipes(*nlp.pipe_names)
|
||||
nlp.select_pipes(disable=nlp.pipe_names)
|
||||
sentencizer = nlp.create_pipe("sentencizer")
|
||||
nlp.add_pipe(sentencizer, first=True)
|
||||
|
||||
|
@ -49,5 +50,6 @@ def convert(model='en', input_file=None, output_dir=None, n_texts=0):
|
|||
|
||||
srsly.write_json(output_dir / input_file.with_suffix(".json"), [docs_to_json(docs)])
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
plac.call(convert)
|
154
fabfile.py
vendored
154
fabfile.py
vendored
|
@ -1,154 +0,0 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals, print_function
|
||||
|
||||
import contextlib
|
||||
from pathlib import Path
|
||||
from fabric.api import local, lcd, env, settings, prefix
|
||||
from os import path, environ
|
||||
import shutil
|
||||
import sys
|
||||
|
||||
|
||||
PWD = path.dirname(__file__)
|
||||
ENV = environ["VENV_DIR"] if "VENV_DIR" in environ else ".env"
|
||||
VENV_DIR = Path(PWD) / ENV
|
||||
|
||||
|
||||
@contextlib.contextmanager
|
||||
def virtualenv(name, create=False, python="/usr/bin/python3.6"):
|
||||
python = Path(python).resolve()
|
||||
env_path = VENV_DIR
|
||||
if create:
|
||||
if env_path.exists():
|
||||
shutil.rmtree(str(env_path))
|
||||
local("{python} -m venv {env_path}".format(python=python, env_path=VENV_DIR))
|
||||
|
||||
def wrapped_local(cmd, env_vars=[], capture=False, direct=False):
|
||||
return local(
|
||||
"source {}/bin/activate && {}".format(env_path, cmd),
|
||||
shell="/bin/bash",
|
||||
capture=False,
|
||||
)
|
||||
|
||||
yield wrapped_local
|
||||
|
||||
|
||||
def env(lang="python3.6"):
|
||||
if VENV_DIR.exists():
|
||||
local("rm -rf {env}".format(env=VENV_DIR))
|
||||
if lang.startswith("python3"):
|
||||
local("{lang} -m venv {env}".format(lang=lang, env=VENV_DIR))
|
||||
else:
|
||||
local("{lang} -m pip install virtualenv --no-cache-dir".format(lang=lang))
|
||||
local(
|
||||
"{lang} -m virtualenv {env} --no-cache-dir".format(lang=lang, env=VENV_DIR)
|
||||
)
|
||||
with virtualenv(VENV_DIR) as venv_local:
|
||||
print(venv_local("python --version", capture=True))
|
||||
venv_local("pip install --upgrade setuptools --no-cache-dir")
|
||||
venv_local("pip install pytest --no-cache-dir")
|
||||
venv_local("pip install wheel --no-cache-dir")
|
||||
venv_local("pip install -r requirements.txt --no-cache-dir")
|
||||
venv_local("pip install pex --no-cache-dir")
|
||||
|
||||
|
||||
def install():
|
||||
with virtualenv(VENV_DIR) as venv_local:
|
||||
venv_local("pip install dist/*.tar.gz")
|
||||
|
||||
|
||||
def make():
|
||||
with lcd(path.dirname(__file__)):
|
||||
local(
|
||||
"export PYTHONPATH=`pwd` && source .env/bin/activate && python setup.py build_ext --inplace",
|
||||
shell="/bin/bash",
|
||||
)
|
||||
|
||||
|
||||
def sdist():
|
||||
with virtualenv(VENV_DIR) as venv_local:
|
||||
with lcd(path.dirname(__file__)):
|
||||
venv_local("python -m pip install -U setuptools srsly")
|
||||
venv_local("python setup.py sdist")
|
||||
|
||||
|
||||
def wheel():
|
||||
with virtualenv(VENV_DIR) as venv_local:
|
||||
with lcd(path.dirname(__file__)):
|
||||
venv_local("python setup.py bdist_wheel")
|
||||
|
||||
|
||||
def pex():
|
||||
with virtualenv(VENV_DIR) as venv_local:
|
||||
with lcd(path.dirname(__file__)):
|
||||
sha = local("git rev-parse --short HEAD", capture=True)
|
||||
venv_local(
|
||||
"pex dist/*.whl -e spacy -o dist/spacy-%s.pex" % sha, direct=True
|
||||
)
|
||||
|
||||
|
||||
def clean():
|
||||
with lcd(path.dirname(__file__)):
|
||||
local("rm -f dist/*.whl")
|
||||
local("rm -f dist/*.pex")
|
||||
with virtualenv(VENV_DIR) as venv_local:
|
||||
venv_local("python setup.py clean --all")
|
||||
|
||||
|
||||
def test():
|
||||
with virtualenv(VENV_DIR) as venv_local:
|
||||
with lcd(path.dirname(__file__)):
|
||||
venv_local("pytest -x spacy/tests")
|
||||
|
||||
|
||||
def train():
|
||||
args = environ.get("SPACY_TRAIN_ARGS", "")
|
||||
with virtualenv(VENV_DIR) as venv_local:
|
||||
venv_local("spacy train {args}".format(args=args))
|
||||
|
||||
|
||||
def conll17(treebank_dir, experiment_dir, vectors_dir, config, corpus=""):
|
||||
is_not_clean = local("git status --porcelain", capture=True)
|
||||
if is_not_clean:
|
||||
print("Repository is not clean")
|
||||
print(is_not_clean)
|
||||
sys.exit(1)
|
||||
git_sha = local("git rev-parse --short HEAD", capture=True)
|
||||
config_checksum = local("sha256sum {config}".format(config=config), capture=True)
|
||||
experiment_dir = Path(experiment_dir) / "{}--{}".format(
|
||||
config_checksum[:6], git_sha
|
||||
)
|
||||
if not experiment_dir.exists():
|
||||
experiment_dir.mkdir()
|
||||
test_data_dir = Path(treebank_dir) / "ud-test-v2.0-conll2017"
|
||||
assert test_data_dir.exists()
|
||||
assert test_data_dir.is_dir()
|
||||
if corpus:
|
||||
corpora = [corpus]
|
||||
else:
|
||||
corpora = ["UD_English", "UD_Chinese", "UD_Japanese", "UD_Vietnamese"]
|
||||
|
||||
local(
|
||||
"cp {config} {experiment_dir}/config.json".format(
|
||||
config=config, experiment_dir=experiment_dir
|
||||
)
|
||||
)
|
||||
with virtualenv(VENV_DIR) as venv_local:
|
||||
for corpus in corpora:
|
||||
venv_local(
|
||||
"spacy ud-train {treebank_dir} {experiment_dir} {config} {corpus} -v {vectors_dir}".format(
|
||||
treebank_dir=treebank_dir,
|
||||
experiment_dir=experiment_dir,
|
||||
config=config,
|
||||
corpus=corpus,
|
||||
vectors_dir=vectors_dir,
|
||||
)
|
||||
)
|
||||
venv_local(
|
||||
"spacy ud-run-test {test_data_dir} {experiment_dir} {corpus}".format(
|
||||
test_data_dir=test_data_dir,
|
||||
experiment_dir=experiment_dir,
|
||||
config=config,
|
||||
corpus=corpus,
|
||||
)
|
||||
)
|
|
@ -1,259 +0,0 @@
|
|||
// ISO C9x compliant stdint.h for Microsoft Visual Studio
|
||||
// Based on ISO/IEC 9899:TC2 Committee draft (May 6, 2005) WG14/N1124
|
||||
//
|
||||
// Copyright (c) 2006-2013 Alexander Chemeris
|
||||
//
|
||||
// Redistribution and use in source and binary forms, with or without
|
||||
// modification, are permitted provided that the following conditions are met:
|
||||
//
|
||||
// 1. Redistributions of source code must retain the above copyright notice,
|
||||
// this list of conditions and the following disclaimer.
|
||||
//
|
||||
// 2. Redistributions in binary form must reproduce the above copyright
|
||||
// notice, this list of conditions and the following disclaimer in the
|
||||
// documentation and/or other materials provided with the distribution.
|
||||
//
|
||||
// 3. Neither the name of the product nor the names of its contributors may
|
||||
// be used to endorse or promote products derived from this software
|
||||
// without specific prior written permission.
|
||||
//
|
||||
// THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED
|
||||
// WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
|
||||
// MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO
|
||||
// EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
|
||||
// SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
|
||||
// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
|
||||
// OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
|
||||
// WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
|
||||
// OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
|
||||
// ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
//
|
||||
///////////////////////////////////////////////////////////////////////////////
|
||||
|
||||
#ifndef _MSC_VER // [
|
||||
#error "Use this header only with Microsoft Visual C++ compilers!"
|
||||
#endif // _MSC_VER ]
|
||||
|
||||
#ifndef _MSC_STDINT_H_ // [
|
||||
#define _MSC_STDINT_H_
|
||||
|
||||
#if _MSC_VER > 1000
|
||||
#pragma once
|
||||
#endif
|
||||
|
||||
#if _MSC_VER >= 1600 // [
|
||||
#include <stdint.h>
|
||||
#else // ] _MSC_VER >= 1600 [
|
||||
|
||||
#include <limits.h>
|
||||
|
||||
// For Visual Studio 6 in C++ mode and for many Visual Studio versions when
|
||||
// compiling for ARM we should wrap <wchar.h> include with 'extern "C++" {}'
|
||||
// or compiler give many errors like this:
|
||||
// error C2733: second C linkage of overloaded function 'wmemchr' not allowed
|
||||
#ifdef __cplusplus
|
||||
extern "C" {
|
||||
#endif
|
||||
# include <wchar.h>
|
||||
#ifdef __cplusplus
|
||||
}
|
||||
#endif
|
||||
|
||||
// Define _W64 macros to mark types changing their size, like intptr_t.
|
||||
#ifndef _W64
|
||||
# if !defined(__midl) && (defined(_X86_) || defined(_M_IX86)) && _MSC_VER >= 1300
|
||||
# define _W64 __w64
|
||||
# else
|
||||
# define _W64
|
||||
# endif
|
||||
#endif
|
||||
|
||||
|
||||
// 7.18.1 Integer types
|
||||
|
||||
// 7.18.1.1 Exact-width integer types
|
||||
|
||||
// Visual Studio 6 and Embedded Visual C++ 4 doesn't
|
||||
// realize that, e.g. char has the same size as __int8
|
||||
// so we give up on __intX for them.
|
||||
#if (_MSC_VER < 1300)
|
||||
typedef signed char int8_t;
|
||||
typedef signed short int16_t;
|
||||
typedef signed int int32_t;
|
||||
typedef unsigned char uint8_t;
|
||||
typedef unsigned short uint16_t;
|
||||
typedef unsigned int uint32_t;
|
||||
#else
|
||||
typedef signed __int8 int8_t;
|
||||
typedef signed __int16 int16_t;
|
||||
typedef signed __int32 int32_t;
|
||||
typedef unsigned __int8 uint8_t;
|
||||
typedef unsigned __int16 uint16_t;
|
||||
typedef unsigned __int32 uint32_t;
|
||||
#endif
|
||||
typedef signed __int64 int64_t;
|
||||
typedef unsigned __int64 uint64_t;
|
||||
|
||||
|
||||
// 7.18.1.2 Minimum-width integer types
|
||||
typedef int8_t int_least8_t;
|
||||
typedef int16_t int_least16_t;
|
||||
typedef int32_t int_least32_t;
|
||||
typedef int64_t int_least64_t;
|
||||
typedef uint8_t uint_least8_t;
|
||||
typedef uint16_t uint_least16_t;
|
||||
typedef uint32_t uint_least32_t;
|
||||
typedef uint64_t uint_least64_t;
|
||||
|
||||
// 7.18.1.3 Fastest minimum-width integer types
|
||||
typedef int8_t int_fast8_t;
|
||||
typedef int16_t int_fast16_t;
|
||||
typedef int32_t int_fast32_t;
|
||||
typedef int64_t int_fast64_t;
|
||||
typedef uint8_t uint_fast8_t;
|
||||
typedef uint16_t uint_fast16_t;
|
||||
typedef uint32_t uint_fast32_t;
|
||||
typedef uint64_t uint_fast64_t;
|
||||
|
||||
// 7.18.1.4 Integer types capable of holding object pointers
|
||||
#ifdef _WIN64 // [
|
||||
typedef signed __int64 intptr_t;
|
||||
typedef unsigned __int64 uintptr_t;
|
||||
#else // _WIN64 ][
|
||||
typedef _W64 signed int intptr_t;
|
||||
typedef _W64 unsigned int uintptr_t;
|
||||
#endif // _WIN64 ]
|
||||
|
||||
// 7.18.1.5 Greatest-width integer types
|
||||
typedef int64_t intmax_t;
|
||||
typedef uint64_t uintmax_t;
|
||||
|
||||
|
||||
// 7.18.2 Limits of specified-width integer types
|
||||
|
||||
#if !defined(__cplusplus) || defined(__STDC_LIMIT_MACROS) // [ See footnote 220 at page 257 and footnote 221 at page 259
|
||||
|
||||
// 7.18.2.1 Limits of exact-width integer types
|
||||
#define INT8_MIN ((int8_t)_I8_MIN)
|
||||
#define INT8_MAX _I8_MAX
|
||||
#define INT16_MIN ((int16_t)_I16_MIN)
|
||||
#define INT16_MAX _I16_MAX
|
||||
#define INT32_MIN ((int32_t)_I32_MIN)
|
||||
#define INT32_MAX _I32_MAX
|
||||
#define INT64_MIN ((int64_t)_I64_MIN)
|
||||
#define INT64_MAX _I64_MAX
|
||||
#define UINT8_MAX _UI8_MAX
|
||||
#define UINT16_MAX _UI16_MAX
|
||||
#define UINT32_MAX _UI32_MAX
|
||||
#define UINT64_MAX _UI64_MAX
|
||||
|
||||
// 7.18.2.2 Limits of minimum-width integer types
|
||||
#define INT_LEAST8_MIN INT8_MIN
|
||||
#define INT_LEAST8_MAX INT8_MAX
|
||||
#define INT_LEAST16_MIN INT16_MIN
|
||||
#define INT_LEAST16_MAX INT16_MAX
|
||||
#define INT_LEAST32_MIN INT32_MIN
|
||||
#define INT_LEAST32_MAX INT32_MAX
|
||||
#define INT_LEAST64_MIN INT64_MIN
|
||||
#define INT_LEAST64_MAX INT64_MAX
|
||||
#define UINT_LEAST8_MAX UINT8_MAX
|
||||
#define UINT_LEAST16_MAX UINT16_MAX
|
||||
#define UINT_LEAST32_MAX UINT32_MAX
|
||||
#define UINT_LEAST64_MAX UINT64_MAX
|
||||
|
||||
// 7.18.2.3 Limits of fastest minimum-width integer types
|
||||
#define INT_FAST8_MIN INT8_MIN
|
||||
#define INT_FAST8_MAX INT8_MAX
|
||||
#define INT_FAST16_MIN INT16_MIN
|
||||
#define INT_FAST16_MAX INT16_MAX
|
||||
#define INT_FAST32_MIN INT32_MIN
|
||||
#define INT_FAST32_MAX INT32_MAX
|
||||
#define INT_FAST64_MIN INT64_MIN
|
||||
#define INT_FAST64_MAX INT64_MAX
|
||||
#define UINT_FAST8_MAX UINT8_MAX
|
||||
#define UINT_FAST16_MAX UINT16_MAX
|
||||
#define UINT_FAST32_MAX UINT32_MAX
|
||||
#define UINT_FAST64_MAX UINT64_MAX
|
||||
|
||||
// 7.18.2.4 Limits of integer types capable of holding object pointers
|
||||
#ifdef _WIN64 // [
|
||||
# define INTPTR_MIN INT64_MIN
|
||||
# define INTPTR_MAX INT64_MAX
|
||||
# define UINTPTR_MAX UINT64_MAX
|
||||
#else // _WIN64 ][
|
||||
# define INTPTR_MIN INT32_MIN
|
||||
# define INTPTR_MAX INT32_MAX
|
||||
# define UINTPTR_MAX UINT32_MAX
|
||||
#endif // _WIN64 ]
|
||||
|
||||
// 7.18.2.5 Limits of greatest-width integer types
|
||||
#define INTMAX_MIN INT64_MIN
|
||||
#define INTMAX_MAX INT64_MAX
|
||||
#define UINTMAX_MAX UINT64_MAX
|
||||
|
||||
// 7.18.3 Limits of other integer types
|
||||
|
||||
#ifdef _WIN64 // [
|
||||
# define PTRDIFF_MIN _I64_MIN
|
||||
# define PTRDIFF_MAX _I64_MAX
|
||||
#else // _WIN64 ][
|
||||
# define PTRDIFF_MIN _I32_MIN
|
||||
# define PTRDIFF_MAX _I32_MAX
|
||||
#endif // _WIN64 ]
|
||||
|
||||
#define SIG_ATOMIC_MIN INT_MIN
|
||||
#define SIG_ATOMIC_MAX INT_MAX
|
||||
|
||||
#ifndef SIZE_MAX // [
|
||||
# ifdef _WIN64 // [
|
||||
# define SIZE_MAX _UI64_MAX
|
||||
# else // _WIN64 ][
|
||||
# define SIZE_MAX _UI32_MAX
|
||||
# endif // _WIN64 ]
|
||||
#endif // SIZE_MAX ]
|
||||
|
||||
// WCHAR_MIN and WCHAR_MAX are also defined in <wchar.h>
|
||||
#ifndef WCHAR_MIN // [
|
||||
# define WCHAR_MIN 0
|
||||
#endif // WCHAR_MIN ]
|
||||
#ifndef WCHAR_MAX // [
|
||||
# define WCHAR_MAX _UI16_MAX
|
||||
#endif // WCHAR_MAX ]
|
||||
|
||||
#define WINT_MIN 0
|
||||
#define WINT_MAX _UI16_MAX
|
||||
|
||||
#endif // __STDC_LIMIT_MACROS ]
|
||||
|
||||
|
||||
// 7.18.4 Limits of other integer types
|
||||
|
||||
#if !defined(__cplusplus) || defined(__STDC_CONSTANT_MACROS) // [ See footnote 224 at page 260
|
||||
|
||||
// 7.18.4.1 Macros for minimum-width integer constants
|
||||
|
||||
#define INT8_C(val) val##i8
|
||||
#define INT16_C(val) val##i16
|
||||
#define INT32_C(val) val##i32
|
||||
#define INT64_C(val) val##i64
|
||||
|
||||
#define UINT8_C(val) val##ui8
|
||||
#define UINT16_C(val) val##ui16
|
||||
#define UINT32_C(val) val##ui32
|
||||
#define UINT64_C(val) val##ui64
|
||||
|
||||
// 7.18.4.2 Macros for greatest-width integer constants
|
||||
// These #ifndef's are needed to prevent collisions with <boost/cstdint.hpp>.
|
||||
// Check out Issue 9 for the details.
|
||||
#ifndef INTMAX_C // [
|
||||
# define INTMAX_C INT64_C
|
||||
#endif // INTMAX_C ]
|
||||
#ifndef UINTMAX_C // [
|
||||
# define UINTMAX_C UINT64_C
|
||||
#endif // UINTMAX_C ]
|
||||
|
||||
#endif // __STDC_CONSTANT_MACROS ]
|
||||
|
||||
#endif // _MSC_VER >= 1600 ]
|
||||
|
||||
#endif // _MSC_STDINT_H_ ]
|
|
@ -1,22 +0,0 @@
|
|||
//-----------------------------------------------------------------------------
|
||||
// MurmurHash2 was written by Austin Appleby, and is placed in the public
|
||||
// domain. The author hereby disclaims copyright to this source code.
|
||||
|
||||
#ifndef _MURMURHASH2_H_
|
||||
#define _MURMURHASH2_H_
|
||||
|
||||
#include <stdint.h>
|
||||
|
||||
//-----------------------------------------------------------------------------
|
||||
|
||||
uint32_t MurmurHash2 ( const void * key, int len, uint32_t seed );
|
||||
uint64_t MurmurHash64A ( const void * key, int len, uint64_t seed );
|
||||
uint64_t MurmurHash64B ( const void * key, int len, uint64_t seed );
|
||||
uint32_t MurmurHash2A ( const void * key, int len, uint32_t seed );
|
||||
uint32_t MurmurHashNeutral2 ( const void * key, int len, uint32_t seed );
|
||||
uint32_t MurmurHashAligned2 ( const void * key, int len, uint32_t seed );
|
||||
|
||||
//-----------------------------------------------------------------------------
|
||||
|
||||
#endif // _MURMURHASH2_H_
|
||||
|
|
@ -1,28 +0,0 @@
|
|||
//-----------------------------------------------------------------------------
|
||||
// MurmurHash3 was written by Austin Appleby, and is placed in the public
|
||||
// domain. The author hereby disclaims copyright to this source code.
|
||||
|
||||
#ifndef _MURMURHASH3_H_
|
||||
#define _MURMURHASH3_H_
|
||||
|
||||
#include <stdint.h>
|
||||
|
||||
//-----------------------------------------------------------------------------
|
||||
#ifdef __cplusplus
|
||||
extern "C" {
|
||||
#endif
|
||||
|
||||
|
||||
void MurmurHash3_x86_32 ( const void * key, int len, uint32_t seed, void * out );
|
||||
|
||||
void MurmurHash3_x86_128 ( const void * key, int len, uint32_t seed, void * out );
|
||||
|
||||
void MurmurHash3_x64_128 ( const void * key, int len, uint32_t seed, void * out );
|
||||
|
||||
#ifdef __cplusplus
|
||||
}
|
||||
#endif
|
||||
|
||||
//-----------------------------------------------------------------------------
|
||||
|
||||
#endif // _MURMURHASH3_H_
|
File diff suppressed because it is too large
Load Diff
|
@ -1,323 +0,0 @@
|
|||
|
||||
#ifdef _UMATHMODULE
|
||||
|
||||
#ifdef NPY_ENABLE_SEPARATE_COMPILATION
|
||||
extern NPY_NO_EXPORT PyTypeObject PyUFunc_Type;
|
||||
#else
|
||||
NPY_NO_EXPORT PyTypeObject PyUFunc_Type;
|
||||
#endif
|
||||
|
||||
#ifdef NPY_ENABLE_SEPARATE_COMPILATION
|
||||
extern NPY_NO_EXPORT PyTypeObject PyUFunc_Type;
|
||||
#else
|
||||
NPY_NO_EXPORT PyTypeObject PyUFunc_Type;
|
||||
#endif
|
||||
|
||||
NPY_NO_EXPORT PyObject * PyUFunc_FromFuncAndData \
|
||||
(PyUFuncGenericFunction *, void **, char *, int, int, int, int, char *, char *, int);
|
||||
NPY_NO_EXPORT int PyUFunc_RegisterLoopForType \
|
||||
(PyUFuncObject *, int, PyUFuncGenericFunction, int *, void *);
|
||||
NPY_NO_EXPORT int PyUFunc_GenericFunction \
|
||||
(PyUFuncObject *, PyObject *, PyObject *, PyArrayObject **);
|
||||
NPY_NO_EXPORT void PyUFunc_f_f_As_d_d \
|
||||
(char **, npy_intp *, npy_intp *, void *);
|
||||
NPY_NO_EXPORT void PyUFunc_d_d \
|
||||
(char **, npy_intp *, npy_intp *, void *);
|
||||
NPY_NO_EXPORT void PyUFunc_f_f \
|
||||
(char **, npy_intp *, npy_intp *, void *);
|
||||
NPY_NO_EXPORT void PyUFunc_g_g \
|
||||
(char **, npy_intp *, npy_intp *, void *);
|
||||
NPY_NO_EXPORT void PyUFunc_F_F_As_D_D \
|
||||
(char **, npy_intp *, npy_intp *, void *);
|
||||
NPY_NO_EXPORT void PyUFunc_F_F \
|
||||
(char **, npy_intp *, npy_intp *, void *);
|
||||
NPY_NO_EXPORT void PyUFunc_D_D \
|
||||
(char **, npy_intp *, npy_intp *, void *);
|
||||
NPY_NO_EXPORT void PyUFunc_G_G \
|
||||
(char **, npy_intp *, npy_intp *, void *);
|
||||
NPY_NO_EXPORT void PyUFunc_O_O \
|
||||
(char **, npy_intp *, npy_intp *, void *);
|
||||
NPY_NO_EXPORT void PyUFunc_ff_f_As_dd_d \
|
||||
(char **, npy_intp *, npy_intp *, void *);
|
||||
NPY_NO_EXPORT void PyUFunc_ff_f \
|
||||
(char **, npy_intp *, npy_intp *, void *);
|
||||
NPY_NO_EXPORT void PyUFunc_dd_d \
|
||||
(char **, npy_intp *, npy_intp *, void *);
|
||||
NPY_NO_EXPORT void PyUFunc_gg_g \
|
||||
(char **, npy_intp *, npy_intp *, void *);
|
||||
NPY_NO_EXPORT void PyUFunc_FF_F_As_DD_D \
|
||||
(char **, npy_intp *, npy_intp *, void *);
|
||||
NPY_NO_EXPORT void PyUFunc_DD_D \
|
||||
(char **, npy_intp *, npy_intp *, void *);
|
||||
NPY_NO_EXPORT void PyUFunc_FF_F \
|
||||
(char **, npy_intp *, npy_intp *, void *);
|
||||
NPY_NO_EXPORT void PyUFunc_GG_G \
|
||||
(char **, npy_intp *, npy_intp *, void *);
|
||||
NPY_NO_EXPORT void PyUFunc_OO_O \
|
||||
(char **, npy_intp *, npy_intp *, void *);
|
||||
NPY_NO_EXPORT void PyUFunc_O_O_method \
|
||||
(char **, npy_intp *, npy_intp *, void *);
|
||||
NPY_NO_EXPORT void PyUFunc_OO_O_method \
|
||||
(char **, npy_intp *, npy_intp *, void *);
|
||||
NPY_NO_EXPORT void PyUFunc_On_Om \
|
||||
(char **, npy_intp *, npy_intp *, void *);
|
||||
NPY_NO_EXPORT int PyUFunc_GetPyValues \
|
||||
(char *, int *, int *, PyObject **);
|
||||
NPY_NO_EXPORT int PyUFunc_checkfperr \
|
||||
(int, PyObject *, int *);
|
||||
NPY_NO_EXPORT void PyUFunc_clearfperr \
|
||||
(void);
|
||||
NPY_NO_EXPORT int PyUFunc_getfperr \
|
||||
(void);
|
||||
NPY_NO_EXPORT int PyUFunc_handlefperr \
|
||||
(int, PyObject *, int, int *);
|
||||
NPY_NO_EXPORT int PyUFunc_ReplaceLoopBySignature \
|
||||
(PyUFuncObject *, PyUFuncGenericFunction, int *, PyUFuncGenericFunction *);
|
||||
NPY_NO_EXPORT PyObject * PyUFunc_FromFuncAndDataAndSignature \
|
||||
(PyUFuncGenericFunction *, void **, char *, int, int, int, int, char *, char *, int, const char *);
|
||||
NPY_NO_EXPORT int PyUFunc_SetUsesArraysAsData \
|
||||
(void **, size_t);
|
||||
NPY_NO_EXPORT void PyUFunc_e_e \
|
||||
(char **, npy_intp *, npy_intp *, void *);
|
||||
NPY_NO_EXPORT void PyUFunc_e_e_As_f_f \
|
||||
(char **, npy_intp *, npy_intp *, void *);
|
||||
NPY_NO_EXPORT void PyUFunc_e_e_As_d_d \
|
||||
(char **, npy_intp *, npy_intp *, void *);
|
||||
NPY_NO_EXPORT void PyUFunc_ee_e \
|
||||
(char **, npy_intp *, npy_intp *, void *);
|
||||
NPY_NO_EXPORT void PyUFunc_ee_e_As_ff_f \
|
||||
(char **, npy_intp *, npy_intp *, void *);
|
||||
NPY_NO_EXPORT void PyUFunc_ee_e_As_dd_d \
|
||||
(char **, npy_intp *, npy_intp *, void *);
|
||||
NPY_NO_EXPORT int PyUFunc_DefaultTypeResolver \
|
||||
(PyUFuncObject *, NPY_CASTING, PyArrayObject **, PyObject *, PyArray_Descr **);
|
||||
NPY_NO_EXPORT int PyUFunc_ValidateCasting \
|
||||
(PyUFuncObject *, NPY_CASTING, PyArrayObject **, PyArray_Descr **);
|
||||
|
||||
#else
|
||||
|
||||
#if defined(PY_UFUNC_UNIQUE_SYMBOL)
|
||||
#define PyUFunc_API PY_UFUNC_UNIQUE_SYMBOL
|
||||
#endif
|
||||
|
||||
#if defined(NO_IMPORT) || defined(NO_IMPORT_UFUNC)
|
||||
extern void **PyUFunc_API;
|
||||
#else
|
||||
#if defined(PY_UFUNC_UNIQUE_SYMBOL)
|
||||
void **PyUFunc_API;
|
||||
#else
|
||||
static void **PyUFunc_API=NULL;
|
||||
#endif
|
||||
#endif
|
||||
|
||||
#define PyUFunc_Type (*(PyTypeObject *)PyUFunc_API[0])
|
||||
#define PyUFunc_FromFuncAndData \
|
||||
(*(PyObject * (*)(PyUFuncGenericFunction *, void **, char *, int, int, int, int, char *, char *, int)) \
|
||||
PyUFunc_API[1])
|
||||
#define PyUFunc_RegisterLoopForType \
|
||||
(*(int (*)(PyUFuncObject *, int, PyUFuncGenericFunction, int *, void *)) \
|
||||
PyUFunc_API[2])
|
||||
#define PyUFunc_GenericFunction \
|
||||
(*(int (*)(PyUFuncObject *, PyObject *, PyObject *, PyArrayObject **)) \
|
||||
PyUFunc_API[3])
|
||||
#define PyUFunc_f_f_As_d_d \
|
||||
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
|
||||
PyUFunc_API[4])
|
||||
#define PyUFunc_d_d \
|
||||
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
|
||||
PyUFunc_API[5])
|
||||
#define PyUFunc_f_f \
|
||||
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
|
||||
PyUFunc_API[6])
|
||||
#define PyUFunc_g_g \
|
||||
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
|
||||
PyUFunc_API[7])
|
||||
#define PyUFunc_F_F_As_D_D \
|
||||
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
|
||||
PyUFunc_API[8])
|
||||
#define PyUFunc_F_F \
|
||||
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
|
||||
PyUFunc_API[9])
|
||||
#define PyUFunc_D_D \
|
||||
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
|
||||
PyUFunc_API[10])
|
||||
#define PyUFunc_G_G \
|
||||
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
|
||||
PyUFunc_API[11])
|
||||
#define PyUFunc_O_O \
|
||||
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
|
||||
PyUFunc_API[12])
|
||||
#define PyUFunc_ff_f_As_dd_d \
|
||||
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
|
||||
PyUFunc_API[13])
|
||||
#define PyUFunc_ff_f \
|
||||
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
|
||||
PyUFunc_API[14])
|
||||
#define PyUFunc_dd_d \
|
||||
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
|
||||
PyUFunc_API[15])
|
||||
#define PyUFunc_gg_g \
|
||||
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
|
||||
PyUFunc_API[16])
|
||||
#define PyUFunc_FF_F_As_DD_D \
|
||||
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
|
||||
PyUFunc_API[17])
|
||||
#define PyUFunc_DD_D \
|
||||
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
|
||||
PyUFunc_API[18])
|
||||
#define PyUFunc_FF_F \
|
||||
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
|
||||
PyUFunc_API[19])
|
||||
#define PyUFunc_GG_G \
|
||||
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
|
||||
PyUFunc_API[20])
|
||||
#define PyUFunc_OO_O \
|
||||
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
|
||||
PyUFunc_API[21])
|
||||
#define PyUFunc_O_O_method \
|
||||
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
|
||||
PyUFunc_API[22])
|
||||
#define PyUFunc_OO_O_method \
|
||||
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
|
||||
PyUFunc_API[23])
|
||||
#define PyUFunc_On_Om \
|
||||
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
|
||||
PyUFunc_API[24])
|
||||
#define PyUFunc_GetPyValues \
|
||||
(*(int (*)(char *, int *, int *, PyObject **)) \
|
||||
PyUFunc_API[25])
|
||||
#define PyUFunc_checkfperr \
|
||||
(*(int (*)(int, PyObject *, int *)) \
|
||||
PyUFunc_API[26])
|
||||
#define PyUFunc_clearfperr \
|
||||
(*(void (*)(void)) \
|
||||
PyUFunc_API[27])
|
||||
#define PyUFunc_getfperr \
|
||||
(*(int (*)(void)) \
|
||||
PyUFunc_API[28])
|
||||
#define PyUFunc_handlefperr \
|
||||
(*(int (*)(int, PyObject *, int, int *)) \
|
||||
PyUFunc_API[29])
|
||||
#define PyUFunc_ReplaceLoopBySignature \
|
||||
(*(int (*)(PyUFuncObject *, PyUFuncGenericFunction, int *, PyUFuncGenericFunction *)) \
|
||||
PyUFunc_API[30])
|
||||
#define PyUFunc_FromFuncAndDataAndSignature \
|
||||
(*(PyObject * (*)(PyUFuncGenericFunction *, void **, char *, int, int, int, int, char *, char *, int, const char *)) \
|
||||
PyUFunc_API[31])
|
||||
#define PyUFunc_SetUsesArraysAsData \
|
||||
(*(int (*)(void **, size_t)) \
|
||||
PyUFunc_API[32])
|
||||
#define PyUFunc_e_e \
|
||||
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
|
||||
PyUFunc_API[33])
|
||||
#define PyUFunc_e_e_As_f_f \
|
||||
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
|
||||
PyUFunc_API[34])
|
||||
#define PyUFunc_e_e_As_d_d \
|
||||
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
|
||||
PyUFunc_API[35])
|
||||
#define PyUFunc_ee_e \
|
||||
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
|
||||
PyUFunc_API[36])
|
||||
#define PyUFunc_ee_e_As_ff_f \
|
||||
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
|
||||
PyUFunc_API[37])
|
||||
#define PyUFunc_ee_e_As_dd_d \
|
||||
(*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
|
||||
PyUFunc_API[38])
|
||||
#define PyUFunc_DefaultTypeResolver \
|
||||
(*(int (*)(PyUFuncObject *, NPY_CASTING, PyArrayObject **, PyObject *, PyArray_Descr **)) \
|
||||
PyUFunc_API[39])
|
||||
#define PyUFunc_ValidateCasting \
|
||||
(*(int (*)(PyUFuncObject *, NPY_CASTING, PyArrayObject **, PyArray_Descr **)) \
|
||||
PyUFunc_API[40])
|
||||
|
||||
static int
|
||||
_import_umath(void)
|
||||
{
|
||||
PyObject *numpy = PyImport_ImportModule("numpy.core.umath");
|
||||
PyObject *c_api = NULL;
|
||||
|
||||
if (numpy == NULL) {
|
||||
PyErr_SetString(PyExc_ImportError, "numpy.core.umath failed to import");
|
||||
return -1;
|
||||
}
|
||||
c_api = PyObject_GetAttrString(numpy, "_UFUNC_API");
|
||||
Py_DECREF(numpy);
|
||||
if (c_api == NULL) {
|
||||
PyErr_SetString(PyExc_AttributeError, "_UFUNC_API not found");
|
||||
return -1;
|
||||
}
|
||||
|
||||
#if PY_VERSION_HEX >= 0x03000000
|
||||
if (!PyCapsule_CheckExact(c_api)) {
|
||||
PyErr_SetString(PyExc_RuntimeError, "_UFUNC_API is not PyCapsule object");
|
||||
Py_DECREF(c_api);
|
||||
return -1;
|
||||
}
|
||||
PyUFunc_API = (void **)PyCapsule_GetPointer(c_api, NULL);
|
||||
#else
|
||||
if (!PyCObject_Check(c_api)) {
|
||||
PyErr_SetString(PyExc_RuntimeError, "_UFUNC_API is not PyCObject object");
|
||||
Py_DECREF(c_api);
|
||||
return -1;
|
||||
}
|
||||
PyUFunc_API = (void **)PyCObject_AsVoidPtr(c_api);
|
||||
#endif
|
||||
Py_DECREF(c_api);
|
||||
if (PyUFunc_API == NULL) {
|
||||
PyErr_SetString(PyExc_RuntimeError, "_UFUNC_API is NULL pointer");
|
||||
return -1;
|
||||
}
|
||||
return 0;
|
||||
}
|
||||
|
||||
#if PY_VERSION_HEX >= 0x03000000
|
||||
#define NUMPY_IMPORT_UMATH_RETVAL NULL
|
||||
#else
|
||||
#define NUMPY_IMPORT_UMATH_RETVAL
|
||||
#endif
|
||||
|
||||
#define import_umath() \
|
||||
do {\
|
||||
UFUNC_NOFPE\
|
||||
if (_import_umath() < 0) {\
|
||||
PyErr_Print();\
|
||||
PyErr_SetString(PyExc_ImportError,\
|
||||
"numpy.core.umath failed to import");\
|
||||
return NUMPY_IMPORT_UMATH_RETVAL;\
|
||||
}\
|
||||
} while(0)
|
||||
|
||||
#define import_umath1(ret) \
|
||||
do {\
|
||||
UFUNC_NOFPE\
|
||||
if (_import_umath() < 0) {\
|
||||
PyErr_Print();\
|
||||
PyErr_SetString(PyExc_ImportError,\
|
||||
"numpy.core.umath failed to import");\
|
||||
return ret;\
|
||||
}\
|
||||
} while(0)
|
||||
|
||||
#define import_umath2(ret, msg) \
|
||||
do {\
|
||||
UFUNC_NOFPE\
|
||||
if (_import_umath() < 0) {\
|
||||
PyErr_Print();\
|
||||
PyErr_SetString(PyExc_ImportError, msg);\
|
||||
return ret;\
|
||||
}\
|
||||
} while(0)
|
||||
|
||||
#define import_ufunc() \
|
||||
do {\
|
||||
UFUNC_NOFPE\
|
||||
if (_import_umath() < 0) {\
|
||||
PyErr_Print();\
|
||||
PyErr_SetString(PyExc_ImportError,\
|
||||
"numpy.core.umath failed to import");\
|
||||
}\
|
||||
} while(0)
|
||||
|
||||
#endif
|
|
@ -1,90 +0,0 @@
|
|||
#ifndef _NPY_INCLUDE_NEIGHBORHOOD_IMP
|
||||
#error You should not include this header directly
|
||||
#endif
|
||||
/*
|
||||
* Private API (here for inline)
|
||||
*/
|
||||
static NPY_INLINE int
|
||||
_PyArrayNeighborhoodIter_IncrCoord(PyArrayNeighborhoodIterObject* iter);
|
||||
|
||||
/*
|
||||
* Update to next item of the iterator
|
||||
*
|
||||
* Note: this simply increment the coordinates vector, last dimension
|
||||
* incremented first , i.e, for dimension 3
|
||||
* ...
|
||||
* -1, -1, -1
|
||||
* -1, -1, 0
|
||||
* -1, -1, 1
|
||||
* ....
|
||||
* -1, 0, -1
|
||||
* -1, 0, 0
|
||||
* ....
|
||||
* 0, -1, -1
|
||||
* 0, -1, 0
|
||||
* ....
|
||||
*/
|
||||
#define _UPDATE_COORD_ITER(c) \
|
||||
wb = iter->coordinates[c] < iter->bounds[c][1]; \
|
||||
if (wb) { \
|
||||
iter->coordinates[c] += 1; \
|
||||
return 0; \
|
||||
} \
|
||||
else { \
|
||||
iter->coordinates[c] = iter->bounds[c][0]; \
|
||||
}
|
||||
|
||||
static NPY_INLINE int
|
||||
_PyArrayNeighborhoodIter_IncrCoord(PyArrayNeighborhoodIterObject* iter)
|
||||
{
|
||||
npy_intp i, wb;
|
||||
|
||||
for (i = iter->nd - 1; i >= 0; --i) {
|
||||
_UPDATE_COORD_ITER(i)
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
/*
|
||||
* Version optimized for 2d arrays, manual loop unrolling
|
||||
*/
|
||||
static NPY_INLINE int
|
||||
_PyArrayNeighborhoodIter_IncrCoord2D(PyArrayNeighborhoodIterObject* iter)
|
||||
{
|
||||
npy_intp wb;
|
||||
|
||||
_UPDATE_COORD_ITER(1)
|
||||
_UPDATE_COORD_ITER(0)
|
||||
|
||||
return 0;
|
||||
}
|
||||
#undef _UPDATE_COORD_ITER
|
||||
|
||||
/*
|
||||
* Advance to the next neighbour
|
||||
*/
|
||||
static NPY_INLINE int
|
||||
PyArrayNeighborhoodIter_Next(PyArrayNeighborhoodIterObject* iter)
|
||||
{
|
||||
_PyArrayNeighborhoodIter_IncrCoord (iter);
|
||||
iter->dataptr = iter->translate((PyArrayIterObject*)iter, iter->coordinates);
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
/*
|
||||
* Reset functions
|
||||
*/
|
||||
static NPY_INLINE int
|
||||
PyArrayNeighborhoodIter_Reset(PyArrayNeighborhoodIterObject* iter)
|
||||
{
|
||||
npy_intp i;
|
||||
|
||||
for (i = 0; i < iter->nd; ++i) {
|
||||
iter->coordinates[i] = iter->bounds[i][0];
|
||||
}
|
||||
iter->dataptr = iter->translate((PyArrayIterObject*)iter, iter->coordinates);
|
||||
|
||||
return 0;
|
||||
}
|
|
@ -1,29 +0,0 @@
|
|||
#define NPY_SIZEOF_SHORT SIZEOF_SHORT
|
||||
#define NPY_SIZEOF_INT SIZEOF_INT
|
||||
#define NPY_SIZEOF_LONG SIZEOF_LONG
|
||||
#define NPY_SIZEOF_FLOAT 4
|
||||
#define NPY_SIZEOF_COMPLEX_FLOAT 8
|
||||
#define NPY_SIZEOF_DOUBLE 8
|
||||
#define NPY_SIZEOF_COMPLEX_DOUBLE 16
|
||||
#define NPY_SIZEOF_LONGDOUBLE 16
|
||||
#define NPY_SIZEOF_COMPLEX_LONGDOUBLE 32
|
||||
#define NPY_SIZEOF_PY_INTPTR_T 8
|
||||
#define NPY_SIZEOF_PY_LONG_LONG 8
|
||||
#define NPY_SIZEOF_LONGLONG 8
|
||||
#define NPY_NO_SMP 0
|
||||
#define NPY_HAVE_DECL_ISNAN
|
||||
#define NPY_HAVE_DECL_ISINF
|
||||
#define NPY_HAVE_DECL_ISFINITE
|
||||
#define NPY_HAVE_DECL_SIGNBIT
|
||||
#define NPY_USE_C99_COMPLEX 1
|
||||
#define NPY_HAVE_COMPLEX_DOUBLE 1
|
||||
#define NPY_HAVE_COMPLEX_FLOAT 1
|
||||
#define NPY_HAVE_COMPLEX_LONG_DOUBLE 1
|
||||
#define NPY_USE_C99_FORMATS 1
|
||||
#define NPY_VISIBILITY_HIDDEN __attribute__((visibility("hidden")))
|
||||
#define NPY_ABI_VERSION 0x01000009
|
||||
#define NPY_API_VERSION 0x00000007
|
||||
|
||||
#ifndef __STDC_FORMAT_MACROS
|
||||
#define __STDC_FORMAT_MACROS 1
|
||||
#endif
|
|
@ -1,22 +0,0 @@
|
|||
|
||||
/* This expects the following variables to be defined (besides
|
||||
the usual ones from pyconfig.h
|
||||
|
||||
SIZEOF_LONG_DOUBLE -- sizeof(long double) or sizeof(double) if no
|
||||
long double is present on platform.
|
||||
CHAR_BIT -- number of bits in a char (usually 8)
|
||||
(should be in limits.h)
|
||||
|
||||
*/
|
||||
|
||||
#ifndef Py_ARRAYOBJECT_H
|
||||
#define Py_ARRAYOBJECT_H
|
||||
|
||||
#include "ndarrayobject.h"
|
||||
#include "npy_interrupt.h"
|
||||
|
||||
#ifdef NPY_NO_PREFIX
|
||||
#include "noprefix.h"
|
||||
#endif
|
||||
|
||||
#endif
|
|
@ -1,175 +0,0 @@
|
|||
#ifndef _NPY_ARRAYSCALARS_H_
|
||||
#define _NPY_ARRAYSCALARS_H_
|
||||
|
||||
#ifndef _MULTIARRAYMODULE
|
||||
typedef struct {
|
||||
PyObject_HEAD
|
||||
npy_bool obval;
|
||||
} PyBoolScalarObject;
|
||||
#endif
|
||||
|
||||
|
||||
typedef struct {
|
||||
PyObject_HEAD
|
||||
signed char obval;
|
||||
} PyByteScalarObject;
|
||||
|
||||
|
||||
typedef struct {
|
||||
PyObject_HEAD
|
||||
short obval;
|
||||
} PyShortScalarObject;
|
||||
|
||||
|
||||
typedef struct {
|
||||
PyObject_HEAD
|
||||
int obval;
|
||||
} PyIntScalarObject;
|
||||
|
||||
|
||||
typedef struct {
|
||||
PyObject_HEAD
|
||||
long obval;
|
||||
} PyLongScalarObject;
|
||||
|
||||
|
||||
typedef struct {
|
||||
PyObject_HEAD
|
||||
npy_longlong obval;
|
||||
} PyLongLongScalarObject;
|
||||
|
||||
|
||||
typedef struct {
|
||||
PyObject_HEAD
|
||||
unsigned char obval;
|
||||
} PyUByteScalarObject;
|
||||
|
||||
|
||||
typedef struct {
|
||||
PyObject_HEAD
|
||||
unsigned short obval;
|
||||
} PyUShortScalarObject;
|
||||
|
||||
|
||||
typedef struct {
|
||||
PyObject_HEAD
|
||||
unsigned int obval;
|
||||
} PyUIntScalarObject;
|
||||
|
||||
|
||||
typedef struct {
|
||||
PyObject_HEAD
|
||||
unsigned long obval;
|
||||
} PyULongScalarObject;
|
||||
|
||||
|
||||
typedef struct {
|
||||
PyObject_HEAD
|
||||
npy_ulonglong obval;
|
||||
} PyULongLongScalarObject;
|
||||
|
||||
|
||||
typedef struct {
|
||||
PyObject_HEAD
|
||||
npy_half obval;
|
||||
} PyHalfScalarObject;
|
||||
|
||||
|
||||
typedef struct {
|
||||
PyObject_HEAD
|
||||
float obval;
|
||||
} PyFloatScalarObject;
|
||||
|
||||
|
||||
typedef struct {
|
||||
PyObject_HEAD
|
||||
double obval;
|
||||
} PyDoubleScalarObject;
|
||||
|
||||
|
||||
typedef struct {
|
||||
PyObject_HEAD
|
||||
npy_longdouble obval;
|
||||
} PyLongDoubleScalarObject;
|
||||
|
||||
|
||||
typedef struct {
|
||||
PyObject_HEAD
|
||||
npy_cfloat obval;
|
||||
} PyCFloatScalarObject;
|
||||
|
||||
|
||||
typedef struct {
|
||||
PyObject_HEAD
|
||||
npy_cdouble obval;
|
||||
} PyCDoubleScalarObject;
|
||||
|
||||
|
||||
typedef struct {
|
||||
PyObject_HEAD
|
||||
npy_clongdouble obval;
|
||||
} PyCLongDoubleScalarObject;
|
||||
|
||||
|
||||
typedef struct {
|
||||
PyObject_HEAD
|
||||
PyObject * obval;
|
||||
} PyObjectScalarObject;
|
||||
|
||||
typedef struct {
|
||||
PyObject_HEAD
|
||||
npy_datetime obval;
|
||||
PyArray_DatetimeMetaData obmeta;
|
||||
} PyDatetimeScalarObject;
|
||||
|
||||
typedef struct {
|
||||
PyObject_HEAD
|
||||
npy_timedelta obval;
|
||||
PyArray_DatetimeMetaData obmeta;
|
||||
} PyTimedeltaScalarObject;
|
||||
|
||||
|
||||
typedef struct {
|
||||
PyObject_HEAD
|
||||
char obval;
|
||||
} PyScalarObject;
|
||||
|
||||
#define PyStringScalarObject PyStringObject
|
||||
#define PyUnicodeScalarObject PyUnicodeObject
|
||||
|
||||
typedef struct {
|
||||
PyObject_VAR_HEAD
|
||||
char *obval;
|
||||
PyArray_Descr *descr;
|
||||
int flags;
|
||||
PyObject *base;
|
||||
} PyVoidScalarObject;
|
||||
|
||||
/* Macros
|
||||
Py<Cls><bitsize>ScalarObject
|
||||
Py<Cls><bitsize>ArrType_Type
|
||||
are defined in ndarrayobject.h
|
||||
*/
|
||||
|
||||
#define PyArrayScalar_False ((PyObject *)(&(_PyArrayScalar_BoolValues[0])))
|
||||
#define PyArrayScalar_True ((PyObject *)(&(_PyArrayScalar_BoolValues[1])))
|
||||
#define PyArrayScalar_FromLong(i) \
|
||||
((PyObject *)(&(_PyArrayScalar_BoolValues[((i)!=0)])))
|
||||
#define PyArrayScalar_RETURN_BOOL_FROM_LONG(i) \
|
||||
return Py_INCREF(PyArrayScalar_FromLong(i)), \
|
||||
PyArrayScalar_FromLong(i)
|
||||
#define PyArrayScalar_RETURN_FALSE \
|
||||
return Py_INCREF(PyArrayScalar_False), \
|
||||
PyArrayScalar_False
|
||||
#define PyArrayScalar_RETURN_TRUE \
|
||||
return Py_INCREF(PyArrayScalar_True), \
|
||||
PyArrayScalar_True
|
||||
|
||||
#define PyArrayScalar_New(cls) \
|
||||
Py##cls##ArrType_Type.tp_alloc(&Py##cls##ArrType_Type, 0)
|
||||
#define PyArrayScalar_VAL(obj, cls) \
|
||||
((Py##cls##ScalarObject *)obj)->obval
|
||||
#define PyArrayScalar_ASSIGN(obj, cls, val) \
|
||||
PyArrayScalar_VAL(obj, cls) = val
|
||||
|
||||
#endif
|
|
@ -1,69 +0,0 @@
|
|||
#ifndef __NPY_HALFFLOAT_H__
|
||||
#define __NPY_HALFFLOAT_H__
|
||||
|
||||
#include <Python.h>
|
||||
#include <numpy/npy_math.h>
|
||||
|
||||
#ifdef __cplusplus
|
||||
extern "C" {
|
||||
#endif
|
||||
|
||||
/*
|
||||
* Half-precision routines
|
||||
*/
|
||||
|
||||
/* Conversions */
|
||||
float npy_half_to_float(npy_half h);
|
||||
double npy_half_to_double(npy_half h);
|
||||
npy_half npy_float_to_half(float f);
|
||||
npy_half npy_double_to_half(double d);
|
||||
/* Comparisons */
|
||||
int npy_half_eq(npy_half h1, npy_half h2);
|
||||
int npy_half_ne(npy_half h1, npy_half h2);
|
||||
int npy_half_le(npy_half h1, npy_half h2);
|
||||
int npy_half_lt(npy_half h1, npy_half h2);
|
||||
int npy_half_ge(npy_half h1, npy_half h2);
|
||||
int npy_half_gt(npy_half h1, npy_half h2);
|
||||
/* faster *_nonan variants for when you know h1 and h2 are not NaN */
|
||||
int npy_half_eq_nonan(npy_half h1, npy_half h2);
|
||||
int npy_half_lt_nonan(npy_half h1, npy_half h2);
|
||||
int npy_half_le_nonan(npy_half h1, npy_half h2);
|
||||
/* Miscellaneous functions */
|
||||
int npy_half_iszero(npy_half h);
|
||||
int npy_half_isnan(npy_half h);
|
||||
int npy_half_isinf(npy_half h);
|
||||
int npy_half_isfinite(npy_half h);
|
||||
int npy_half_signbit(npy_half h);
|
||||
npy_half npy_half_copysign(npy_half x, npy_half y);
|
||||
npy_half npy_half_spacing(npy_half h);
|
||||
npy_half npy_half_nextafter(npy_half x, npy_half y);
|
||||
|
||||
/*
|
||||
* Half-precision constants
|
||||
*/
|
||||
|
||||
#define NPY_HALF_ZERO (0x0000u)
|
||||
#define NPY_HALF_PZERO (0x0000u)
|
||||
#define NPY_HALF_NZERO (0x8000u)
|
||||
#define NPY_HALF_ONE (0x3c00u)
|
||||
#define NPY_HALF_NEGONE (0xbc00u)
|
||||
#define NPY_HALF_PINF (0x7c00u)
|
||||
#define NPY_HALF_NINF (0xfc00u)
|
||||
#define NPY_HALF_NAN (0x7e00u)
|
||||
|
||||
#define NPY_MAX_HALF (0x7bffu)
|
||||
|
||||
/*
|
||||
* Bit-level conversions
|
||||
*/
|
||||
|
||||
npy_uint16 npy_floatbits_to_halfbits(npy_uint32 f);
|
||||
npy_uint16 npy_doublebits_to_halfbits(npy_uint64 d);
|
||||
npy_uint32 npy_halfbits_to_floatbits(npy_uint16 h);
|
||||
npy_uint64 npy_halfbits_to_doublebits(npy_uint16 h);
|
||||
|
||||
#ifdef __cplusplus
|
||||
}
|
||||
#endif
|
||||
|
||||
#endif
|
File diff suppressed because it is too large
Load Diff
|
@ -1,244 +0,0 @@
|
|||
/*
|
||||
* DON'T INCLUDE THIS DIRECTLY.
|
||||
*/
|
||||
|
||||
#ifndef NPY_NDARRAYOBJECT_H
|
||||
#define NPY_NDARRAYOBJECT_H
|
||||
#ifdef __cplusplus
|
||||
#define CONFUSE_EMACS {
|
||||
#define CONFUSE_EMACS2 }
|
||||
extern "C" CONFUSE_EMACS
|
||||
#undef CONFUSE_EMACS
|
||||
#undef CONFUSE_EMACS2
|
||||
/* ... otherwise a semi-smart identer (like emacs) tries to indent
|
||||
everything when you're typing */
|
||||
#endif
|
||||
|
||||
#include "ndarraytypes.h"
|
||||
|
||||
/* Includes the "function" C-API -- these are all stored in a
|
||||
list of pointers --- one for each file
|
||||
The two lists are concatenated into one in multiarray.
|
||||
|
||||
They are available as import_array()
|
||||
*/
|
||||
|
||||
#include "__multiarray_api.h"
|
||||
|
||||
|
||||
/* C-API that requries previous API to be defined */
|
||||
|
||||
#define PyArray_DescrCheck(op) (((PyObject*)(op))->ob_type==&PyArrayDescr_Type)
|
||||
|
||||
#define PyArray_Check(op) PyObject_TypeCheck(op, &PyArray_Type)
|
||||
#define PyArray_CheckExact(op) (((PyObject*)(op))->ob_type == &PyArray_Type)
|
||||
|
||||
#define PyArray_HasArrayInterfaceType(op, type, context, out) \
|
||||
((((out)=PyArray_FromStructInterface(op)) != Py_NotImplemented) || \
|
||||
(((out)=PyArray_FromInterface(op)) != Py_NotImplemented) || \
|
||||
(((out)=PyArray_FromArrayAttr(op, type, context)) != \
|
||||
Py_NotImplemented))
|
||||
|
||||
#define PyArray_HasArrayInterface(op, out) \
|
||||
PyArray_HasArrayInterfaceType(op, NULL, NULL, out)
|
||||
|
||||
#define PyArray_IsZeroDim(op) (PyArray_Check(op) && \
|
||||
(PyArray_NDIM((PyArrayObject *)op) == 0))
|
||||
|
||||
#define PyArray_IsScalar(obj, cls) \
|
||||
(PyObject_TypeCheck(obj, &Py##cls##ArrType_Type))
|
||||
|
||||
#define PyArray_CheckScalar(m) (PyArray_IsScalar(m, Generic) || \
|
||||
PyArray_IsZeroDim(m))
|
||||
|
||||
#define PyArray_IsPythonNumber(obj) \
|
||||
(PyInt_Check(obj) || PyFloat_Check(obj) || PyComplex_Check(obj) || \
|
||||
PyLong_Check(obj) || PyBool_Check(obj))
|
||||
|
||||
#define PyArray_IsPythonScalar(obj) \
|
||||
(PyArray_IsPythonNumber(obj) || PyString_Check(obj) || \
|
||||
PyUnicode_Check(obj))
|
||||
|
||||
#define PyArray_IsAnyScalar(obj) \
|
||||
(PyArray_IsScalar(obj, Generic) || PyArray_IsPythonScalar(obj))
|
||||
|
||||
#define PyArray_CheckAnyScalar(obj) (PyArray_IsPythonScalar(obj) || \
|
||||
PyArray_CheckScalar(obj))
|
||||
|
||||
#define PyArray_IsIntegerScalar(obj) (PyInt_Check(obj) \
|
||||
|| PyLong_Check(obj) \
|
||||
|| PyArray_IsScalar((obj), Integer))
|
||||
|
||||
|
||||
#define PyArray_GETCONTIGUOUS(m) (PyArray_ISCONTIGUOUS(m) ? \
|
||||
Py_INCREF(m), (m) : \
|
||||
(PyArrayObject *)(PyArray_Copy(m)))
|
||||
|
||||
#define PyArray_SAMESHAPE(a1,a2) ((PyArray_NDIM(a1) == PyArray_NDIM(a2)) && \
|
||||
PyArray_CompareLists(PyArray_DIMS(a1), \
|
||||
PyArray_DIMS(a2), \
|
||||
PyArray_NDIM(a1)))
|
||||
|
||||
#define PyArray_SIZE(m) PyArray_MultiplyList(PyArray_DIMS(m), PyArray_NDIM(m))
|
||||
#define PyArray_NBYTES(m) (PyArray_ITEMSIZE(m) * PyArray_SIZE(m))
|
||||
#define PyArray_FROM_O(m) PyArray_FromAny(m, NULL, 0, 0, 0, NULL)
|
||||
|
||||
#define PyArray_FROM_OF(m,flags) PyArray_CheckFromAny(m, NULL, 0, 0, flags, \
|
||||
NULL)
|
||||
|
||||
#define PyArray_FROM_OT(m,type) PyArray_FromAny(m, \
|
||||
PyArray_DescrFromType(type), 0, 0, 0, NULL);
|
||||
|
||||
#define PyArray_FROM_OTF(m, type, flags) \
|
||||
PyArray_FromAny(m, PyArray_DescrFromType(type), 0, 0, \
|
||||
(((flags) & NPY_ARRAY_ENSURECOPY) ? \
|
||||
((flags) | NPY_ARRAY_DEFAULT) : (flags)), NULL)
|
||||
|
||||
#define PyArray_FROMANY(m, type, min, max, flags) \
|
||||
PyArray_FromAny(m, PyArray_DescrFromType(type), min, max, \
|
||||
(((flags) & NPY_ARRAY_ENSURECOPY) ? \
|
||||
(flags) | NPY_ARRAY_DEFAULT : (flags)), NULL)
|
||||
|
||||
#define PyArray_ZEROS(m, dims, type, is_f_order) \
|
||||
PyArray_Zeros(m, dims, PyArray_DescrFromType(type), is_f_order)
|
||||
|
||||
#define PyArray_EMPTY(m, dims, type, is_f_order) \
|
||||
PyArray_Empty(m, dims, PyArray_DescrFromType(type), is_f_order)
|
||||
|
||||
#define PyArray_FILLWBYTE(obj, val) memset(PyArray_DATA(obj), val, \
|
||||
PyArray_NBYTES(obj))
|
||||
|
||||
#define PyArray_REFCOUNT(obj) (((PyObject *)(obj))->ob_refcnt)
|
||||
#define NPY_REFCOUNT PyArray_REFCOUNT
|
||||
#define NPY_MAX_ELSIZE (2 * NPY_SIZEOF_LONGDOUBLE)
|
||||
|
||||
#define PyArray_ContiguousFromAny(op, type, min_depth, max_depth) \
|
||||
PyArray_FromAny(op, PyArray_DescrFromType(type), min_depth, \
|
||||
max_depth, NPY_ARRAY_DEFAULT, NULL)
|
||||
|
||||
#define PyArray_EquivArrTypes(a1, a2) \
|
||||
PyArray_EquivTypes(PyArray_DESCR(a1), PyArray_DESCR(a2))
|
||||
|
||||
#define PyArray_EquivByteorders(b1, b2) \
|
||||
(((b1) == (b2)) || (PyArray_ISNBO(b1) == PyArray_ISNBO(b2)))
|
||||
|
||||
#define PyArray_SimpleNew(nd, dims, typenum) \
|
||||
PyArray_New(&PyArray_Type, nd, dims, typenum, NULL, NULL, 0, 0, NULL)
|
||||
|
||||
#define PyArray_SimpleNewFromData(nd, dims, typenum, data) \
|
||||
PyArray_New(&PyArray_Type, nd, dims, typenum, NULL, \
|
||||
data, 0, NPY_ARRAY_CARRAY, NULL)
|
||||
|
||||
#define PyArray_SimpleNewFromDescr(nd, dims, descr) \
|
||||
PyArray_NewFromDescr(&PyArray_Type, descr, nd, dims, \
|
||||
NULL, NULL, 0, NULL)
|
||||
|
||||
#define PyArray_ToScalar(data, arr) \
|
||||
PyArray_Scalar(data, PyArray_DESCR(arr), (PyObject *)arr)
|
||||
|
||||
|
||||
/* These might be faster without the dereferencing of obj
|
||||
going on inside -- of course an optimizing compiler should
|
||||
inline the constants inside a for loop making it a moot point
|
||||
*/
|
||||
|
||||
#define PyArray_GETPTR1(obj, i) ((void *)(PyArray_BYTES(obj) + \
|
||||
(i)*PyArray_STRIDES(obj)[0]))
|
||||
|
||||
#define PyArray_GETPTR2(obj, i, j) ((void *)(PyArray_BYTES(obj) + \
|
||||
(i)*PyArray_STRIDES(obj)[0] + \
|
||||
(j)*PyArray_STRIDES(obj)[1]))
|
||||
|
||||
#define PyArray_GETPTR3(obj, i, j, k) ((void *)(PyArray_BYTES(obj) + \
|
||||
(i)*PyArray_STRIDES(obj)[0] + \
|
||||
(j)*PyArray_STRIDES(obj)[1] + \
|
||||
(k)*PyArray_STRIDES(obj)[2]))
|
||||
|
||||
#define PyArray_GETPTR4(obj, i, j, k, l) ((void *)(PyArray_BYTES(obj) + \
|
||||
(i)*PyArray_STRIDES(obj)[0] + \
|
||||
(j)*PyArray_STRIDES(obj)[1] + \
|
||||
(k)*PyArray_STRIDES(obj)[2] + \
|
||||
(l)*PyArray_STRIDES(obj)[3]))
|
||||
|
||||
static NPY_INLINE void
|
||||
PyArray_XDECREF_ERR(PyArrayObject *arr)
|
||||
{
|
||||
if (arr != NULL) {
|
||||
if (PyArray_FLAGS(arr) & NPY_ARRAY_UPDATEIFCOPY) {
|
||||
PyArrayObject *base = (PyArrayObject *)PyArray_BASE(arr);
|
||||
PyArray_ENABLEFLAGS(base, NPY_ARRAY_WRITEABLE);
|
||||
PyArray_CLEARFLAGS(arr, NPY_ARRAY_UPDATEIFCOPY);
|
||||
}
|
||||
Py_DECREF(arr);
|
||||
}
|
||||
}
|
||||
|
||||
#define PyArray_DESCR_REPLACE(descr) do { \
|
||||
PyArray_Descr *_new_; \
|
||||
_new_ = PyArray_DescrNew(descr); \
|
||||
Py_XDECREF(descr); \
|
||||
descr = _new_; \
|
||||
} while(0)
|
||||
|
||||
/* Copy should always return contiguous array */
|
||||
#define PyArray_Copy(obj) PyArray_NewCopy(obj, NPY_CORDER)
|
||||
|
||||
#define PyArray_FromObject(op, type, min_depth, max_depth) \
|
||||
PyArray_FromAny(op, PyArray_DescrFromType(type), min_depth, \
|
||||
max_depth, NPY_ARRAY_BEHAVED | \
|
||||
NPY_ARRAY_ENSUREARRAY, NULL)
|
||||
|
||||
#define PyArray_ContiguousFromObject(op, type, min_depth, max_depth) \
|
||||
PyArray_FromAny(op, PyArray_DescrFromType(type), min_depth, \
|
||||
max_depth, NPY_ARRAY_DEFAULT | \
|
||||
NPY_ARRAY_ENSUREARRAY, NULL)
|
||||
|
||||
#define PyArray_CopyFromObject(op, type, min_depth, max_depth) \
|
||||
PyArray_FromAny(op, PyArray_DescrFromType(type), min_depth, \
|
||||
max_depth, NPY_ARRAY_ENSURECOPY | \
|
||||
NPY_ARRAY_DEFAULT | \
|
||||
NPY_ARRAY_ENSUREARRAY, NULL)
|
||||
|
||||
#define PyArray_Cast(mp, type_num) \
|
||||
PyArray_CastToType(mp, PyArray_DescrFromType(type_num), 0)
|
||||
|
||||
#define PyArray_Take(ap, items, axis) \
|
||||
PyArray_TakeFrom(ap, items, axis, NULL, NPY_RAISE)
|
||||
|
||||
#define PyArray_Put(ap, items, values) \
|
||||
PyArray_PutTo(ap, items, values, NPY_RAISE)
|
||||
|
||||
/* Compatibility with old Numeric stuff -- don't use in new code */
|
||||
|
||||
#define PyArray_FromDimsAndData(nd, d, type, data) \
|
||||
PyArray_FromDimsAndDataAndDescr(nd, d, PyArray_DescrFromType(type), \
|
||||
data)
|
||||
|
||||
|
||||
/*
|
||||
Check to see if this key in the dictionary is the "title"
|
||||
entry of the tuple (i.e. a duplicate dictionary entry in the fields
|
||||
dict.
|
||||
*/
|
||||
|
||||
#define NPY_TITLE_KEY(key, value) ((PyTuple_GET_SIZE((value))==3) && \
|
||||
(PyTuple_GET_ITEM((value), 2) == (key)))
|
||||
|
||||
|
||||
/* Define python version independent deprecation macro */
|
||||
|
||||
#if PY_VERSION_HEX >= 0x02050000
|
||||
#define DEPRECATE(msg) PyErr_WarnEx(PyExc_DeprecationWarning,msg,1)
|
||||
#define DEPRECATE_FUTUREWARNING(msg) PyErr_WarnEx(PyExc_FutureWarning,msg,1)
|
||||
#else
|
||||
#define DEPRECATE(msg) PyErr_Warn(PyExc_DeprecationWarning,msg)
|
||||
#define DEPRECATE_FUTUREWARNING(msg) PyErr_Warn(PyExc_FutureWarning,msg)
|
||||
#endif
|
||||
|
||||
|
||||
#ifdef __cplusplus
|
||||
}
|
||||
#endif
|
||||
|
||||
|
||||
#endif /* NPY_NDARRAYOBJECT_H */
|
File diff suppressed because it is too large
Load Diff
|
@ -1,209 +0,0 @@
|
|||
#ifndef NPY_NOPREFIX_H
|
||||
#define NPY_NOPREFIX_H
|
||||
|
||||
/*
|
||||
* You can directly include noprefix.h as a backward
|
||||
* compatibility measure
|
||||
*/
|
||||
#ifndef NPY_NO_PREFIX
|
||||
#include "ndarrayobject.h"
|
||||
#include "npy_interrupt.h"
|
||||
#endif
|
||||
|
||||
#define SIGSETJMP NPY_SIGSETJMP
|
||||
#define SIGLONGJMP NPY_SIGLONGJMP
|
||||
#define SIGJMP_BUF NPY_SIGJMP_BUF
|
||||
|
||||
#define MAX_DIMS NPY_MAXDIMS
|
||||
|
||||
#define longlong npy_longlong
|
||||
#define ulonglong npy_ulonglong
|
||||
#define Bool npy_bool
|
||||
#define longdouble npy_longdouble
|
||||
#define byte npy_byte
|
||||
|
||||
#ifndef _BSD_SOURCE
|
||||
#define ushort npy_ushort
|
||||
#define uint npy_uint
|
||||
#define ulong npy_ulong
|
||||
#endif
|
||||
|
||||
#define ubyte npy_ubyte
|
||||
#define ushort npy_ushort
|
||||
#define uint npy_uint
|
||||
#define ulong npy_ulong
|
||||
#define cfloat npy_cfloat
|
||||
#define cdouble npy_cdouble
|
||||
#define clongdouble npy_clongdouble
|
||||
#define Int8 npy_int8
|
||||
#define UInt8 npy_uint8
|
||||
#define Int16 npy_int16
|
||||
#define UInt16 npy_uint16
|
||||
#define Int32 npy_int32
|
||||
#define UInt32 npy_uint32
|
||||
#define Int64 npy_int64
|
||||
#define UInt64 npy_uint64
|
||||
#define Int128 npy_int128
|
||||
#define UInt128 npy_uint128
|
||||
#define Int256 npy_int256
|
||||
#define UInt256 npy_uint256
|
||||
#define Float16 npy_float16
|
||||
#define Complex32 npy_complex32
|
||||
#define Float32 npy_float32
|
||||
#define Complex64 npy_complex64
|
||||
#define Float64 npy_float64
|
||||
#define Complex128 npy_complex128
|
||||
#define Float80 npy_float80
|
||||
#define Complex160 npy_complex160
|
||||
#define Float96 npy_float96
|
||||
#define Complex192 npy_complex192
|
||||
#define Float128 npy_float128
|
||||
#define Complex256 npy_complex256
|
||||
#define intp npy_intp
|
||||
#define uintp npy_uintp
|
||||
#define datetime npy_datetime
|
||||
#define timedelta npy_timedelta
|
||||
|
||||
#define SIZEOF_INTP NPY_SIZEOF_INTP
|
||||
#define SIZEOF_UINTP NPY_SIZEOF_UINTP
|
||||
#define SIZEOF_DATETIME NPY_SIZEOF_DATETIME
|
||||
#define SIZEOF_TIMEDELTA NPY_SIZEOF_TIMEDELTA
|
||||
|
||||
#define LONGLONG_FMT NPY_LONGLONG_FMT
|
||||
#define ULONGLONG_FMT NPY_ULONGLONG_FMT
|
||||
#define LONGLONG_SUFFIX NPY_LONGLONG_SUFFIX
|
||||
#define ULONGLONG_SUFFIX NPY_ULONGLONG_SUFFIX
|
||||
|
||||
#define MAX_INT8 127
|
||||
#define MIN_INT8 -128
|
||||
#define MAX_UINT8 255
|
||||
#define MAX_INT16 32767
|
||||
#define MIN_INT16 -32768
|
||||
#define MAX_UINT16 65535
|
||||
#define MAX_INT32 2147483647
|
||||
#define MIN_INT32 (-MAX_INT32 - 1)
|
||||
#define MAX_UINT32 4294967295U
|
||||
#define MAX_INT64 LONGLONG_SUFFIX(9223372036854775807)
|
||||
#define MIN_INT64 (-MAX_INT64 - LONGLONG_SUFFIX(1))
|
||||
#define MAX_UINT64 ULONGLONG_SUFFIX(18446744073709551615)
|
||||
#define MAX_INT128 LONGLONG_SUFFIX(85070591730234615865843651857942052864)
|
||||
#define MIN_INT128 (-MAX_INT128 - LONGLONG_SUFFIX(1))
|
||||
#define MAX_UINT128 ULONGLONG_SUFFIX(170141183460469231731687303715884105728)
|
||||
#define MAX_INT256 LONGLONG_SUFFIX(57896044618658097711785492504343953926634992332820282019728792003956564819967)
|
||||
#define MIN_INT256 (-MAX_INT256 - LONGLONG_SUFFIX(1))
|
||||
#define MAX_UINT256 ULONGLONG_SUFFIX(115792089237316195423570985008687907853269984665640564039457584007913129639935)
|
||||
|
||||
#define MAX_BYTE NPY_MAX_BYTE
|
||||
#define MIN_BYTE NPY_MIN_BYTE
|
||||
#define MAX_UBYTE NPY_MAX_UBYTE
|
||||
#define MAX_SHORT NPY_MAX_SHORT
|
||||
#define MIN_SHORT NPY_MIN_SHORT
|
||||
#define MAX_USHORT NPY_MAX_USHORT
|
||||
#define MAX_INT NPY_MAX_INT
|
||||
#define MIN_INT NPY_MIN_INT
|
||||
#define MAX_UINT NPY_MAX_UINT
|
||||
#define MAX_LONG NPY_MAX_LONG
|
||||
#define MIN_LONG NPY_MIN_LONG
|
||||
#define MAX_ULONG NPY_MAX_ULONG
|
||||
#define MAX_LONGLONG NPY_MAX_LONGLONG
|
||||
#define MIN_LONGLONG NPY_MIN_LONGLONG
|
||||
#define MAX_ULONGLONG NPY_MAX_ULONGLONG
|
||||
#define MIN_DATETIME NPY_MIN_DATETIME
|
||||
#define MAX_DATETIME NPY_MAX_DATETIME
|
||||
#define MIN_TIMEDELTA NPY_MIN_TIMEDELTA
|
||||
#define MAX_TIMEDELTA NPY_MAX_TIMEDELTA
|
||||
|
||||
#define SIZEOF_LONGDOUBLE NPY_SIZEOF_LONGDOUBLE
|
||||
#define SIZEOF_LONGLONG NPY_SIZEOF_LONGLONG
|
||||
#define SIZEOF_HALF NPY_SIZEOF_HALF
|
||||
#define BITSOF_BOOL NPY_BITSOF_BOOL
|
||||
#define BITSOF_CHAR NPY_BITSOF_CHAR
|
||||
#define BITSOF_SHORT NPY_BITSOF_SHORT
|
||||
#define BITSOF_INT NPY_BITSOF_INT
|
||||
#define BITSOF_LONG NPY_BITSOF_LONG
|
||||
#define BITSOF_LONGLONG NPY_BITSOF_LONGLONG
|
||||
#define BITSOF_HALF NPY_BITSOF_HALF
|
||||
#define BITSOF_FLOAT NPY_BITSOF_FLOAT
|
||||
#define BITSOF_DOUBLE NPY_BITSOF_DOUBLE
|
||||
#define BITSOF_LONGDOUBLE NPY_BITSOF_LONGDOUBLE
|
||||
#define BITSOF_DATETIME NPY_BITSOF_DATETIME
|
||||
#define BITSOF_TIMEDELTA NPY_BITSOF_TIMEDELTA
|
||||
|
||||
#define _pya_malloc PyArray_malloc
|
||||
#define _pya_free PyArray_free
|
||||
#define _pya_realloc PyArray_realloc
|
||||
|
||||
#define BEGIN_THREADS_DEF NPY_BEGIN_THREADS_DEF
|
||||
#define BEGIN_THREADS NPY_BEGIN_THREADS
|
||||
#define END_THREADS NPY_END_THREADS
|
||||
#define ALLOW_C_API_DEF NPY_ALLOW_C_API_DEF
|
||||
#define ALLOW_C_API NPY_ALLOW_C_API
|
||||
#define DISABLE_C_API NPY_DISABLE_C_API
|
||||
|
||||
#define PY_FAIL NPY_FAIL
|
||||
#define PY_SUCCEED NPY_SUCCEED
|
||||
|
||||
#ifndef TRUE
|
||||
#define TRUE NPY_TRUE
|
||||
#endif
|
||||
|
||||
#ifndef FALSE
|
||||
#define FALSE NPY_FALSE
|
||||
#endif
|
||||
|
||||
#define LONGDOUBLE_FMT NPY_LONGDOUBLE_FMT
|
||||
|
||||
#define CONTIGUOUS NPY_CONTIGUOUS
|
||||
#define C_CONTIGUOUS NPY_C_CONTIGUOUS
|
||||
#define FORTRAN NPY_FORTRAN
|
||||
#define F_CONTIGUOUS NPY_F_CONTIGUOUS
|
||||
#define OWNDATA NPY_OWNDATA
|
||||
#define FORCECAST NPY_FORCECAST
|
||||
#define ENSURECOPY NPY_ENSURECOPY
|
||||
#define ENSUREARRAY NPY_ENSUREARRAY
|
||||
#define ELEMENTSTRIDES NPY_ELEMENTSTRIDES
|
||||
#define ALIGNED NPY_ALIGNED
|
||||
#define NOTSWAPPED NPY_NOTSWAPPED
|
||||
#define WRITEABLE NPY_WRITEABLE
|
||||
#define UPDATEIFCOPY NPY_UPDATEIFCOPY
|
||||
#define ARR_HAS_DESCR NPY_ARR_HAS_DESCR
|
||||
#define BEHAVED NPY_BEHAVED
|
||||
#define BEHAVED_NS NPY_BEHAVED_NS
|
||||
#define CARRAY NPY_CARRAY
|
||||
#define CARRAY_RO NPY_CARRAY_RO
|
||||
#define FARRAY NPY_FARRAY
|
||||
#define FARRAY_RO NPY_FARRAY_RO
|
||||
#define DEFAULT NPY_DEFAULT
|
||||
#define IN_ARRAY NPY_IN_ARRAY
|
||||
#define OUT_ARRAY NPY_OUT_ARRAY
|
||||
#define INOUT_ARRAY NPY_INOUT_ARRAY
|
||||
#define IN_FARRAY NPY_IN_FARRAY
|
||||
#define OUT_FARRAY NPY_OUT_FARRAY
|
||||
#define INOUT_FARRAY NPY_INOUT_FARRAY
|
||||
#define UPDATE_ALL NPY_UPDATE_ALL
|
||||
|
||||
#define OWN_DATA NPY_OWNDATA
|
||||
#define BEHAVED_FLAGS NPY_BEHAVED
|
||||
#define BEHAVED_FLAGS_NS NPY_BEHAVED_NS
|
||||
#define CARRAY_FLAGS_RO NPY_CARRAY_RO
|
||||
#define CARRAY_FLAGS NPY_CARRAY
|
||||
#define FARRAY_FLAGS NPY_FARRAY
|
||||
#define FARRAY_FLAGS_RO NPY_FARRAY_RO
|
||||
#define DEFAULT_FLAGS NPY_DEFAULT
|
||||
#define UPDATE_ALL_FLAGS NPY_UPDATE_ALL_FLAGS
|
||||
|
||||
#ifndef MIN
|
||||
#define MIN PyArray_MIN
|
||||
#endif
|
||||
#ifndef MAX
|
||||
#define MAX PyArray_MAX
|
||||
#endif
|
||||
#define MAX_INTP NPY_MAX_INTP
|
||||
#define MIN_INTP NPY_MIN_INTP
|
||||
#define MAX_UINTP NPY_MAX_UINTP
|
||||
#define INTP_FMT NPY_INTP_FMT
|
||||
|
||||
#define REFCOUNT PyArray_REFCOUNT
|
||||
#define MAX_ELSIZE NPY_MAX_ELSIZE
|
||||
|
||||
#endif
|
|
@ -1,417 +0,0 @@
|
|||
/*
|
||||
* This is a convenience header file providing compatibility utilities
|
||||
* for supporting Python 2 and Python 3 in the same code base.
|
||||
*
|
||||
* If you want to use this for your own projects, it's recommended to make a
|
||||
* copy of it. Although the stuff below is unlikely to change, we don't provide
|
||||
* strong backwards compatibility guarantees at the moment.
|
||||
*/
|
||||
|
||||
#ifndef _NPY_3KCOMPAT_H_
|
||||
#define _NPY_3KCOMPAT_H_
|
||||
|
||||
#include <Python.h>
|
||||
#include <stdio.h>
|
||||
|
||||
#if PY_VERSION_HEX >= 0x03000000
|
||||
#ifndef NPY_PY3K
|
||||
#define NPY_PY3K 1
|
||||
#endif
|
||||
#endif
|
||||
|
||||
#include "numpy/npy_common.h"
|
||||
#include "numpy/ndarrayobject.h"
|
||||
|
||||
#ifdef __cplusplus
|
||||
extern "C" {
|
||||
#endif
|
||||
|
||||
/*
|
||||
* PyInt -> PyLong
|
||||
*/
|
||||
|
||||
#if defined(NPY_PY3K)
|
||||
/* Return True only if the long fits in a C long */
|
||||
static NPY_INLINE int PyInt_Check(PyObject *op) {
|
||||
int overflow = 0;
|
||||
if (!PyLong_Check(op)) {
|
||||
return 0;
|
||||
}
|
||||
PyLong_AsLongAndOverflow(op, &overflow);
|
||||
return (overflow == 0);
|
||||
}
|
||||
|
||||
#define PyInt_FromLong PyLong_FromLong
|
||||
#define PyInt_AsLong PyLong_AsLong
|
||||
#define PyInt_AS_LONG PyLong_AsLong
|
||||
#define PyInt_AsSsize_t PyLong_AsSsize_t
|
||||
|
||||
/* NOTE:
|
||||
*
|
||||
* Since the PyLong type is very different from the fixed-range PyInt,
|
||||
* we don't define PyInt_Type -> PyLong_Type.
|
||||
*/
|
||||
#endif /* NPY_PY3K */
|
||||
|
||||
/*
|
||||
* PyString -> PyBytes
|
||||
*/
|
||||
|
||||
#if defined(NPY_PY3K)
|
||||
|
||||
#define PyString_Type PyBytes_Type
|
||||
#define PyString_Check PyBytes_Check
|
||||
#define PyStringObject PyBytesObject
|
||||
#define PyString_FromString PyBytes_FromString
|
||||
#define PyString_FromStringAndSize PyBytes_FromStringAndSize
|
||||
#define PyString_AS_STRING PyBytes_AS_STRING
|
||||
#define PyString_AsStringAndSize PyBytes_AsStringAndSize
|
||||
#define PyString_FromFormat PyBytes_FromFormat
|
||||
#define PyString_Concat PyBytes_Concat
|
||||
#define PyString_ConcatAndDel PyBytes_ConcatAndDel
|
||||
#define PyString_AsString PyBytes_AsString
|
||||
#define PyString_GET_SIZE PyBytes_GET_SIZE
|
||||
#define PyString_Size PyBytes_Size
|
||||
|
||||
#define PyUString_Type PyUnicode_Type
|
||||
#define PyUString_Check PyUnicode_Check
|
||||
#define PyUStringObject PyUnicodeObject
|
||||
#define PyUString_FromString PyUnicode_FromString
|
||||
#define PyUString_FromStringAndSize PyUnicode_FromStringAndSize
|
||||
#define PyUString_FromFormat PyUnicode_FromFormat
|
||||
#define PyUString_Concat PyUnicode_Concat2
|
||||
#define PyUString_ConcatAndDel PyUnicode_ConcatAndDel
|
||||
#define PyUString_GET_SIZE PyUnicode_GET_SIZE
|
||||
#define PyUString_Size PyUnicode_Size
|
||||
#define PyUString_InternFromString PyUnicode_InternFromString
|
||||
#define PyUString_Format PyUnicode_Format
|
||||
|
||||
#else
|
||||
|
||||
#define PyBytes_Type PyString_Type
|
||||
#define PyBytes_Check PyString_Check
|
||||
#define PyBytesObject PyStringObject
|
||||
#define PyBytes_FromString PyString_FromString
|
||||
#define PyBytes_FromStringAndSize PyString_FromStringAndSize
|
||||
#define PyBytes_AS_STRING PyString_AS_STRING
|
||||
#define PyBytes_AsStringAndSize PyString_AsStringAndSize
|
||||
#define PyBytes_FromFormat PyString_FromFormat
|
||||
#define PyBytes_Concat PyString_Concat
|
||||
#define PyBytes_ConcatAndDel PyString_ConcatAndDel
|
||||
#define PyBytes_AsString PyString_AsString
|
||||
#define PyBytes_GET_SIZE PyString_GET_SIZE
|
||||
#define PyBytes_Size PyString_Size
|
||||
|
||||
#define PyUString_Type PyString_Type
|
||||
#define PyUString_Check PyString_Check
|
||||
#define PyUStringObject PyStringObject
|
||||
#define PyUString_FromString PyString_FromString
|
||||
#define PyUString_FromStringAndSize PyString_FromStringAndSize
|
||||
#define PyUString_FromFormat PyString_FromFormat
|
||||
#define PyUString_Concat PyString_Concat
|
||||
#define PyUString_ConcatAndDel PyString_ConcatAndDel
|
||||
#define PyUString_GET_SIZE PyString_GET_SIZE
|
||||
#define PyUString_Size PyString_Size
|
||||
#define PyUString_InternFromString PyString_InternFromString
|
||||
#define PyUString_Format PyString_Format
|
||||
|
||||
#endif /* NPY_PY3K */
|
||||
|
||||
|
||||
static NPY_INLINE void
|
||||
PyUnicode_ConcatAndDel(PyObject **left, PyObject *right)
|
||||
{
|
||||
PyObject *newobj;
|
||||
newobj = PyUnicode_Concat(*left, right);
|
||||
Py_DECREF(*left);
|
||||
Py_DECREF(right);
|
||||
*left = newobj;
|
||||
}
|
||||
|
||||
static NPY_INLINE void
|
||||
PyUnicode_Concat2(PyObject **left, PyObject *right)
|
||||
{
|
||||
PyObject *newobj;
|
||||
newobj = PyUnicode_Concat(*left, right);
|
||||
Py_DECREF(*left);
|
||||
*left = newobj;
|
||||
}
|
||||
|
||||
/*
|
||||
* PyFile_* compatibility
|
||||
*/
|
||||
#if defined(NPY_PY3K)
|
||||
|
||||
/*
|
||||
* Get a FILE* handle to the file represented by the Python object
|
||||
*/
|
||||
static NPY_INLINE FILE*
|
||||
npy_PyFile_Dup(PyObject *file, char *mode)
|
||||
{
|
||||
int fd, fd2;
|
||||
PyObject *ret, *os;
|
||||
Py_ssize_t pos;
|
||||
FILE *handle;
|
||||
/* Flush first to ensure things end up in the file in the correct order */
|
||||
ret = PyObject_CallMethod(file, "flush", "");
|
||||
if (ret == NULL) {
|
||||
return NULL;
|
||||
}
|
||||
Py_DECREF(ret);
|
||||
fd = PyObject_AsFileDescriptor(file);
|
||||
if (fd == -1) {
|
||||
return NULL;
|
||||
}
|
||||
os = PyImport_ImportModule("os");
|
||||
if (os == NULL) {
|
||||
return NULL;
|
||||
}
|
||||
ret = PyObject_CallMethod(os, "dup", "i", fd);
|
||||
Py_DECREF(os);
|
||||
if (ret == NULL) {
|
||||
return NULL;
|
||||
}
|
||||
fd2 = PyNumber_AsSsize_t(ret, NULL);
|
||||
Py_DECREF(ret);
|
||||
#ifdef _WIN32
|
||||
handle = _fdopen(fd2, mode);
|
||||
#else
|
||||
handle = fdopen(fd2, mode);
|
||||
#endif
|
||||
if (handle == NULL) {
|
||||
PyErr_SetString(PyExc_IOError,
|
||||
"Getting a FILE* from a Python file object failed");
|
||||
}
|
||||
ret = PyObject_CallMethod(file, "tell", "");
|
||||
if (ret == NULL) {
|
||||
fclose(handle);
|
||||
return NULL;
|
||||
}
|
||||
pos = PyNumber_AsSsize_t(ret, PyExc_OverflowError);
|
||||
Py_DECREF(ret);
|
||||
if (PyErr_Occurred()) {
|
||||
fclose(handle);
|
||||
return NULL;
|
||||
}
|
||||
npy_fseek(handle, pos, SEEK_SET);
|
||||
return handle;
|
||||
}
|
||||
|
||||
/*
|
||||
* Close the dup-ed file handle, and seek the Python one to the current position
|
||||
*/
|
||||
static NPY_INLINE int
|
||||
npy_PyFile_DupClose(PyObject *file, FILE* handle)
|
||||
{
|
||||
PyObject *ret;
|
||||
Py_ssize_t position;
|
||||
position = npy_ftell(handle);
|
||||
fclose(handle);
|
||||
|
||||
ret = PyObject_CallMethod(file, "seek", NPY_SSIZE_T_PYFMT "i", position, 0);
|
||||
if (ret == NULL) {
|
||||
return -1;
|
||||
}
|
||||
Py_DECREF(ret);
|
||||
return 0;
|
||||
}
|
||||
|
||||
static NPY_INLINE int
|
||||
npy_PyFile_Check(PyObject *file)
|
||||
{
|
||||
int fd;
|
||||
fd = PyObject_AsFileDescriptor(file);
|
||||
if (fd == -1) {
|
||||
PyErr_Clear();
|
||||
return 0;
|
||||
}
|
||||
return 1;
|
||||
}
|
||||
|
||||
#else
|
||||
|
||||
#define npy_PyFile_Dup(file, mode) PyFile_AsFile(file)
|
||||
#define npy_PyFile_DupClose(file, handle) (0)
|
||||
#define npy_PyFile_Check PyFile_Check
|
||||
|
||||
#endif
|
||||
|
||||
static NPY_INLINE PyObject*
|
||||
npy_PyFile_OpenFile(PyObject *filename, const char *mode)
|
||||
{
|
||||
PyObject *open;
|
||||
open = PyDict_GetItemString(PyEval_GetBuiltins(), "open");
|
||||
if (open == NULL) {
|
||||
return NULL;
|
||||
}
|
||||
return PyObject_CallFunction(open, "Os", filename, mode);
|
||||
}
|
||||
|
||||
static NPY_INLINE int
|
||||
npy_PyFile_CloseFile(PyObject *file)
|
||||
{
|
||||
PyObject *ret;
|
||||
|
||||
ret = PyObject_CallMethod(file, "close", NULL);
|
||||
if (ret == NULL) {
|
||||
return -1;
|
||||
}
|
||||
Py_DECREF(ret);
|
||||
return 0;
|
||||
}
|
||||
|
||||
/*
|
||||
* PyObject_Cmp
|
||||
*/
|
||||
#if defined(NPY_PY3K)
|
||||
static NPY_INLINE int
|
||||
PyObject_Cmp(PyObject *i1, PyObject *i2, int *cmp)
|
||||
{
|
||||
int v;
|
||||
v = PyObject_RichCompareBool(i1, i2, Py_LT);
|
||||
if (v == 0) {
|
||||
*cmp = -1;
|
||||
return 1;
|
||||
}
|
||||
else if (v == -1) {
|
||||
return -1;
|
||||
}
|
||||
|
||||
v = PyObject_RichCompareBool(i1, i2, Py_GT);
|
||||
if (v == 0) {
|
||||
*cmp = 1;
|
||||
return 1;
|
||||
}
|
||||
else if (v == -1) {
|
||||
return -1;
|
||||
}
|
||||
|
||||
v = PyObject_RichCompareBool(i1, i2, Py_EQ);
|
||||
if (v == 0) {
|
||||
*cmp = 0;
|
||||
return 1;
|
||||
}
|
||||
else {
|
||||
*cmp = 0;
|
||||
return -1;
|
||||
}
|
||||
}
|
||||
#endif
|
||||
|
||||
/*
|
||||
* PyCObject functions adapted to PyCapsules.
|
||||
*
|
||||
* The main job here is to get rid of the improved error handling
|
||||
* of PyCapsules. It's a shame...
|
||||
*/
|
||||
#if PY_VERSION_HEX >= 0x03000000
|
||||
|
||||
static NPY_INLINE PyObject *
|
||||
NpyCapsule_FromVoidPtr(void *ptr, void (*dtor)(PyObject *))
|
||||
{
|
||||
PyObject *ret = PyCapsule_New(ptr, NULL, dtor);
|
||||
if (ret == NULL) {
|
||||
PyErr_Clear();
|
||||
}
|
||||
return ret;
|
||||
}
|
||||
|
||||
static NPY_INLINE PyObject *
|
||||
NpyCapsule_FromVoidPtrAndDesc(void *ptr, void* context, void (*dtor)(PyObject *))
|
||||
{
|
||||
PyObject *ret = NpyCapsule_FromVoidPtr(ptr, dtor);
|
||||
if (ret != NULL && PyCapsule_SetContext(ret, context) != 0) {
|
||||
PyErr_Clear();
|
||||
Py_DECREF(ret);
|
||||
ret = NULL;
|
||||
}
|
||||
return ret;
|
||||
}
|
||||
|
||||
static NPY_INLINE void *
|
||||
NpyCapsule_AsVoidPtr(PyObject *obj)
|
||||
{
|
||||
void *ret = PyCapsule_GetPointer(obj, NULL);
|
||||
if (ret == NULL) {
|
||||
PyErr_Clear();
|
||||
}
|
||||
return ret;
|
||||
}
|
||||
|
||||
static NPY_INLINE void *
|
||||
NpyCapsule_GetDesc(PyObject *obj)
|
||||
{
|
||||
return PyCapsule_GetContext(obj);
|
||||
}
|
||||
|
||||
static NPY_INLINE int
|
||||
NpyCapsule_Check(PyObject *ptr)
|
||||
{
|
||||
return PyCapsule_CheckExact(ptr);
|
||||
}
|
||||
|
||||
static NPY_INLINE void
|
||||
simple_capsule_dtor(PyObject *cap)
|
||||
{
|
||||
PyArray_free(PyCapsule_GetPointer(cap, NULL));
|
||||
}
|
||||
|
||||
#else
|
||||
|
||||
static NPY_INLINE PyObject *
|
||||
NpyCapsule_FromVoidPtr(void *ptr, void (*dtor)(void *))
|
||||
{
|
||||
return PyCObject_FromVoidPtr(ptr, dtor);
|
||||
}
|
||||
|
||||
static NPY_INLINE PyObject *
|
||||
NpyCapsule_FromVoidPtrAndDesc(void *ptr, void* context,
|
||||
void (*dtor)(void *, void *))
|
||||
{
|
||||
return PyCObject_FromVoidPtrAndDesc(ptr, context, dtor);
|
||||
}
|
||||
|
||||
static NPY_INLINE void *
|
||||
NpyCapsule_AsVoidPtr(PyObject *ptr)
|
||||
{
|
||||
return PyCObject_AsVoidPtr(ptr);
|
||||
}
|
||||
|
||||
static NPY_INLINE void *
|
||||
NpyCapsule_GetDesc(PyObject *obj)
|
||||
{
|
||||
return PyCObject_GetDesc(obj);
|
||||
}
|
||||
|
||||
static NPY_INLINE int
|
||||
NpyCapsule_Check(PyObject *ptr)
|
||||
{
|
||||
return PyCObject_Check(ptr);
|
||||
}
|
||||
|
||||
static NPY_INLINE void
|
||||
simple_capsule_dtor(void *ptr)
|
||||
{
|
||||
PyArray_free(ptr);
|
||||
}
|
||||
|
||||
#endif
|
||||
|
||||
/*
|
||||
* Hash value compatibility.
|
||||
* As of Python 3.2 hash values are of type Py_hash_t.
|
||||
* Previous versions use C long.
|
||||
*/
|
||||
#if PY_VERSION_HEX < 0x03020000
|
||||
typedef long npy_hash_t;
|
||||
#define NPY_SIZEOF_HASH_T NPY_SIZEOF_LONG
|
||||
#else
|
||||
typedef Py_hash_t npy_hash_t;
|
||||
#define NPY_SIZEOF_HASH_T NPY_SIZEOF_INTP
|
||||
#endif
|
||||
|
||||
#ifdef __cplusplus
|
||||
}
|
||||
#endif
|
||||
|
||||
#endif /* _NPY_3KCOMPAT_H_ */
|
|
@ -1,930 +0,0 @@
|
|||
#ifndef _NPY_COMMON_H_
|
||||
#define _NPY_COMMON_H_
|
||||
|
||||
/* numpconfig.h is auto-generated */
|
||||
#include "numpyconfig.h"
|
||||
|
||||
#if defined(_MSC_VER)
|
||||
#define NPY_INLINE __inline
|
||||
#elif defined(__GNUC__)
|
||||
#if defined(__STRICT_ANSI__)
|
||||
#define NPY_INLINE __inline__
|
||||
#else
|
||||
#define NPY_INLINE inline
|
||||
#endif
|
||||
#else
|
||||
#define NPY_INLINE
|
||||
#endif
|
||||
|
||||
/* Enable 64 bit file position support on win-amd64. Ticket #1660 */
|
||||
#if defined(_MSC_VER) && defined(_WIN64) && (_MSC_VER > 1400)
|
||||
#define npy_fseek _fseeki64
|
||||
#define npy_ftell _ftelli64
|
||||
#else
|
||||
#define npy_fseek fseek
|
||||
#define npy_ftell ftell
|
||||
#endif
|
||||
|
||||
/* enums for detected endianness */
|
||||
enum {
|
||||
NPY_CPU_UNKNOWN_ENDIAN,
|
||||
NPY_CPU_LITTLE,
|
||||
NPY_CPU_BIG
|
||||
};
|
||||
|
||||
/*
|
||||
* This is to typedef npy_intp to the appropriate pointer size for
|
||||
* this platform. Py_intptr_t, Py_uintptr_t are defined in pyport.h.
|
||||
*/
|
||||
typedef Py_intptr_t npy_intp;
|
||||
typedef Py_uintptr_t npy_uintp;
|
||||
#define NPY_SIZEOF_CHAR 1
|
||||
#define NPY_SIZEOF_BYTE 1
|
||||
#define NPY_SIZEOF_INTP NPY_SIZEOF_PY_INTPTR_T
|
||||
#define NPY_SIZEOF_UINTP NPY_SIZEOF_PY_INTPTR_T
|
||||
#define NPY_SIZEOF_CFLOAT NPY_SIZEOF_COMPLEX_FLOAT
|
||||
#define NPY_SIZEOF_CDOUBLE NPY_SIZEOF_COMPLEX_DOUBLE
|
||||
#define NPY_SIZEOF_CLONGDOUBLE NPY_SIZEOF_COMPLEX_LONGDOUBLE
|
||||
|
||||
#ifdef constchar
|
||||
#undef constchar
|
||||
#endif
|
||||
|
||||
#if (PY_VERSION_HEX < 0x02050000)
|
||||
#ifndef PY_SSIZE_T_MIN
|
||||
typedef int Py_ssize_t;
|
||||
#define PY_SSIZE_T_MAX INT_MAX
|
||||
#define PY_SSIZE_T_MIN INT_MIN
|
||||
#endif
|
||||
#define NPY_SSIZE_T_PYFMT "i"
|
||||
#define constchar const char
|
||||
#else
|
||||
#define NPY_SSIZE_T_PYFMT "n"
|
||||
#define constchar char
|
||||
#endif
|
||||
|
||||
/* NPY_INTP_FMT Note:
|
||||
* Unlike the other NPY_*_FMT macros which are used with
|
||||
* PyOS_snprintf, NPY_INTP_FMT is used with PyErr_Format and
|
||||
* PyString_Format. These functions use different formatting
|
||||
* codes which are portably specified according to the Python
|
||||
* documentation. See ticket #1795.
|
||||
*
|
||||
* On Windows x64, the LONGLONG formatter should be used, but
|
||||
* in Python 2.6 the %lld formatter is not supported. In this
|
||||
* case we work around the problem by using the %zd formatter.
|
||||
*/
|
||||
#if NPY_SIZEOF_PY_INTPTR_T == NPY_SIZEOF_INT
|
||||
#define NPY_INTP NPY_INT
|
||||
#define NPY_UINTP NPY_UINT
|
||||
#define PyIntpArrType_Type PyIntArrType_Type
|
||||
#define PyUIntpArrType_Type PyUIntArrType_Type
|
||||
#define NPY_MAX_INTP NPY_MAX_INT
|
||||
#define NPY_MIN_INTP NPY_MIN_INT
|
||||
#define NPY_MAX_UINTP NPY_MAX_UINT
|
||||
#define NPY_INTP_FMT "d"
|
||||
#elif NPY_SIZEOF_PY_INTPTR_T == NPY_SIZEOF_LONG
|
||||
#define NPY_INTP NPY_LONG
|
||||
#define NPY_UINTP NPY_ULONG
|
||||
#define PyIntpArrType_Type PyLongArrType_Type
|
||||
#define PyUIntpArrType_Type PyULongArrType_Type
|
||||
#define NPY_MAX_INTP NPY_MAX_LONG
|
||||
#define NPY_MIN_INTP NPY_MIN_LONG
|
||||
#define NPY_MAX_UINTP NPY_MAX_ULONG
|
||||
#define NPY_INTP_FMT "ld"
|
||||
#elif defined(PY_LONG_LONG) && (NPY_SIZEOF_PY_INTPTR_T == NPY_SIZEOF_LONGLONG)
|
||||
#define NPY_INTP NPY_LONGLONG
|
||||
#define NPY_UINTP NPY_ULONGLONG
|
||||
#define PyIntpArrType_Type PyLongLongArrType_Type
|
||||
#define PyUIntpArrType_Type PyULongLongArrType_Type
|
||||
#define NPY_MAX_INTP NPY_MAX_LONGLONG
|
||||
#define NPY_MIN_INTP NPY_MIN_LONGLONG
|
||||
#define NPY_MAX_UINTP NPY_MAX_ULONGLONG
|
||||
#if (PY_VERSION_HEX >= 0x02070000)
|
||||
#define NPY_INTP_FMT "lld"
|
||||
#else
|
||||
#define NPY_INTP_FMT "zd"
|
||||
#endif
|
||||
#endif
|
||||
|
||||
/*
|
||||
* We can only use C99 formats for npy_int_p if it is the same as
|
||||
* intp_t, hence the condition on HAVE_UNITPTR_T
|
||||
*/
|
||||
#if (NPY_USE_C99_FORMATS) == 1 \
|
||||
&& (defined HAVE_UINTPTR_T) \
|
||||
&& (defined HAVE_INTTYPES_H)
|
||||
#include <inttypes.h>
|
||||
#undef NPY_INTP_FMT
|
||||
#define NPY_INTP_FMT PRIdPTR
|
||||
#endif
|
||||
|
||||
|
||||
/*
|
||||
* Some platforms don't define bool, long long, or long double.
|
||||
* Handle that here.
|
||||
*/
|
||||
#define NPY_BYTE_FMT "hhd"
|
||||
#define NPY_UBYTE_FMT "hhu"
|
||||
#define NPY_SHORT_FMT "hd"
|
||||
#define NPY_USHORT_FMT "hu"
|
||||
#define NPY_INT_FMT "d"
|
||||
#define NPY_UINT_FMT "u"
|
||||
#define NPY_LONG_FMT "ld"
|
||||
#define NPY_ULONG_FMT "lu"
|
||||
#define NPY_HALF_FMT "g"
|
||||
#define NPY_FLOAT_FMT "g"
|
||||
#define NPY_DOUBLE_FMT "g"
|
||||
|
||||
|
||||
#ifdef PY_LONG_LONG
|
||||
typedef PY_LONG_LONG npy_longlong;
|
||||
typedef unsigned PY_LONG_LONG npy_ulonglong;
|
||||
# ifdef _MSC_VER
|
||||
# define NPY_LONGLONG_FMT "I64d"
|
||||
# define NPY_ULONGLONG_FMT "I64u"
|
||||
# elif defined(__APPLE__) || defined(__FreeBSD__)
|
||||
/* "%Ld" only parses 4 bytes -- "L" is floating modifier on MacOS X/BSD */
|
||||
# define NPY_LONGLONG_FMT "lld"
|
||||
# define NPY_ULONGLONG_FMT "llu"
|
||||
/*
|
||||
another possible variant -- *quad_t works on *BSD, but is deprecated:
|
||||
#define LONGLONG_FMT "qd"
|
||||
#define ULONGLONG_FMT "qu"
|
||||
*/
|
||||
# else
|
||||
# define NPY_LONGLONG_FMT "Ld"
|
||||
# define NPY_ULONGLONG_FMT "Lu"
|
||||
# endif
|
||||
# ifdef _MSC_VER
|
||||
# define NPY_LONGLONG_SUFFIX(x) (x##i64)
|
||||
# define NPY_ULONGLONG_SUFFIX(x) (x##Ui64)
|
||||
# else
|
||||
# define NPY_LONGLONG_SUFFIX(x) (x##LL)
|
||||
# define NPY_ULONGLONG_SUFFIX(x) (x##ULL)
|
||||
# endif
|
||||
#else
|
||||
typedef long npy_longlong;
|
||||
typedef unsigned long npy_ulonglong;
|
||||
# define NPY_LONGLONG_SUFFIX(x) (x##L)
|
||||
# define NPY_ULONGLONG_SUFFIX(x) (x##UL)
|
||||
#endif
|
||||
|
||||
|
||||
typedef unsigned char npy_bool;
|
||||
#define NPY_FALSE 0
|
||||
#define NPY_TRUE 1
|
||||
|
||||
|
||||
#if NPY_SIZEOF_LONGDOUBLE == NPY_SIZEOF_DOUBLE
|
||||
typedef double npy_longdouble;
|
||||
#define NPY_LONGDOUBLE_FMT "g"
|
||||
#else
|
||||
typedef long double npy_longdouble;
|
||||
#define NPY_LONGDOUBLE_FMT "Lg"
|
||||
#endif
|
||||
|
||||
#ifndef Py_USING_UNICODE
|
||||
#error Must use Python with unicode enabled.
|
||||
#endif
|
||||
|
||||
|
||||
typedef signed char npy_byte;
|
||||
typedef unsigned char npy_ubyte;
|
||||
typedef unsigned short npy_ushort;
|
||||
typedef unsigned int npy_uint;
|
||||
typedef unsigned long npy_ulong;
|
||||
|
||||
/* These are for completeness */
|
||||
typedef char npy_char;
|
||||
typedef short npy_short;
|
||||
typedef int npy_int;
|
||||
typedef long npy_long;
|
||||
typedef float npy_float;
|
||||
typedef double npy_double;
|
||||
|
||||
/*
|
||||
* Disabling C99 complex usage: a lot of C code in numpy/scipy rely on being
|
||||
* able to do .real/.imag. Will have to convert code first.
|
||||
*/
|
||||
#if 0
|
||||
#if defined(NPY_USE_C99_COMPLEX) && defined(NPY_HAVE_COMPLEX_DOUBLE)
|
||||
typedef complex npy_cdouble;
|
||||
#else
|
||||
typedef struct { double real, imag; } npy_cdouble;
|
||||
#endif
|
||||
|
||||
#if defined(NPY_USE_C99_COMPLEX) && defined(NPY_HAVE_COMPLEX_FLOAT)
|
||||
typedef complex float npy_cfloat;
|
||||
#else
|
||||
typedef struct { float real, imag; } npy_cfloat;
|
||||
#endif
|
||||
|
||||
#if defined(NPY_USE_C99_COMPLEX) && defined(NPY_HAVE_COMPLEX_LONG_DOUBLE)
|
||||
typedef complex long double npy_clongdouble;
|
||||
#else
|
||||
typedef struct {npy_longdouble real, imag;} npy_clongdouble;
|
||||
#endif
|
||||
#endif
|
||||
#if NPY_SIZEOF_COMPLEX_DOUBLE != 2 * NPY_SIZEOF_DOUBLE
|
||||
#error npy_cdouble definition is not compatible with C99 complex definition ! \
|
||||
Please contact Numpy maintainers and give detailed information about your \
|
||||
compiler and platform
|
||||
#endif
|
||||
typedef struct { double real, imag; } npy_cdouble;
|
||||
|
||||
#if NPY_SIZEOF_COMPLEX_FLOAT != 2 * NPY_SIZEOF_FLOAT
|
||||
#error npy_cfloat definition is not compatible with C99 complex definition ! \
|
||||
Please contact Numpy maintainers and give detailed information about your \
|
||||
compiler and platform
|
||||
#endif
|
||||
typedef struct { float real, imag; } npy_cfloat;
|
||||
|
||||
#if NPY_SIZEOF_COMPLEX_LONGDOUBLE != 2 * NPY_SIZEOF_LONGDOUBLE
|
||||
#error npy_clongdouble definition is not compatible with C99 complex definition ! \
|
||||
Please contact Numpy maintainers and give detailed information about your \
|
||||
compiler and platform
|
||||
#endif
|
||||
typedef struct { npy_longdouble real, imag; } npy_clongdouble;
|
||||
|
||||
/*
|
||||
* numarray-style bit-width typedefs
|
||||
*/
|
||||
#define NPY_MAX_INT8 127
|
||||
#define NPY_MIN_INT8 -128
|
||||
#define NPY_MAX_UINT8 255
|
||||
#define NPY_MAX_INT16 32767
|
||||
#define NPY_MIN_INT16 -32768
|
||||
#define NPY_MAX_UINT16 65535
|
||||
#define NPY_MAX_INT32 2147483647
|
||||
#define NPY_MIN_INT32 (-NPY_MAX_INT32 - 1)
|
||||
#define NPY_MAX_UINT32 4294967295U
|
||||
#define NPY_MAX_INT64 NPY_LONGLONG_SUFFIX(9223372036854775807)
|
||||
#define NPY_MIN_INT64 (-NPY_MAX_INT64 - NPY_LONGLONG_SUFFIX(1))
|
||||
#define NPY_MAX_UINT64 NPY_ULONGLONG_SUFFIX(18446744073709551615)
|
||||
#define NPY_MAX_INT128 NPY_LONGLONG_SUFFIX(85070591730234615865843651857942052864)
|
||||
#define NPY_MIN_INT128 (-NPY_MAX_INT128 - NPY_LONGLONG_SUFFIX(1))
|
||||
#define NPY_MAX_UINT128 NPY_ULONGLONG_SUFFIX(170141183460469231731687303715884105728)
|
||||
#define NPY_MAX_INT256 NPY_LONGLONG_SUFFIX(57896044618658097711785492504343953926634992332820282019728792003956564819967)
|
||||
#define NPY_MIN_INT256 (-NPY_MAX_INT256 - NPY_LONGLONG_SUFFIX(1))
|
||||
#define NPY_MAX_UINT256 NPY_ULONGLONG_SUFFIX(115792089237316195423570985008687907853269984665640564039457584007913129639935)
|
||||
#define NPY_MIN_DATETIME NPY_MIN_INT64
|
||||
#define NPY_MAX_DATETIME NPY_MAX_INT64
|
||||
#define NPY_MIN_TIMEDELTA NPY_MIN_INT64
|
||||
#define NPY_MAX_TIMEDELTA NPY_MAX_INT64
|
||||
|
||||
/* Need to find the number of bits for each type and
|
||||
make definitions accordingly.
|
||||
|
||||
C states that sizeof(char) == 1 by definition
|
||||
|
||||
So, just using the sizeof keyword won't help.
|
||||
|
||||
It also looks like Python itself uses sizeof(char) quite a
|
||||
bit, which by definition should be 1 all the time.
|
||||
|
||||
Idea: Make Use of CHAR_BIT which should tell us how many
|
||||
BITS per CHARACTER
|
||||
*/
|
||||
|
||||
/* Include platform definitions -- These are in the C89/90 standard */
|
||||
#include <limits.h>
|
||||
#define NPY_MAX_BYTE SCHAR_MAX
|
||||
#define NPY_MIN_BYTE SCHAR_MIN
|
||||
#define NPY_MAX_UBYTE UCHAR_MAX
|
||||
#define NPY_MAX_SHORT SHRT_MAX
|
||||
#define NPY_MIN_SHORT SHRT_MIN
|
||||
#define NPY_MAX_USHORT USHRT_MAX
|
||||
#define NPY_MAX_INT INT_MAX
|
||||
#ifndef INT_MIN
|
||||
#define INT_MIN (-INT_MAX - 1)
|
||||
#endif
|
||||
#define NPY_MIN_INT INT_MIN
|
||||
#define NPY_MAX_UINT UINT_MAX
|
||||
#define NPY_MAX_LONG LONG_MAX
|
||||
#define NPY_MIN_LONG LONG_MIN
|
||||
#define NPY_MAX_ULONG ULONG_MAX
|
||||
|
||||
#define NPY_SIZEOF_HALF 2
|
||||
#define NPY_SIZEOF_DATETIME 8
|
||||
#define NPY_SIZEOF_TIMEDELTA 8
|
||||
|
||||
#define NPY_BITSOF_BOOL (sizeof(npy_bool) * CHAR_BIT)
|
||||
#define NPY_BITSOF_CHAR CHAR_BIT
|
||||
#define NPY_BITSOF_BYTE (NPY_SIZEOF_BYTE * CHAR_BIT)
|
||||
#define NPY_BITSOF_SHORT (NPY_SIZEOF_SHORT * CHAR_BIT)
|
||||
#define NPY_BITSOF_INT (NPY_SIZEOF_INT * CHAR_BIT)
|
||||
#define NPY_BITSOF_LONG (NPY_SIZEOF_LONG * CHAR_BIT)
|
||||
#define NPY_BITSOF_LONGLONG (NPY_SIZEOF_LONGLONG * CHAR_BIT)
|
||||
#define NPY_BITSOF_INTP (NPY_SIZEOF_INTP * CHAR_BIT)
|
||||
#define NPY_BITSOF_HALF (NPY_SIZEOF_HALF * CHAR_BIT)
|
||||
#define NPY_BITSOF_FLOAT (NPY_SIZEOF_FLOAT * CHAR_BIT)
|
||||
#define NPY_BITSOF_DOUBLE (NPY_SIZEOF_DOUBLE * CHAR_BIT)
|
||||
#define NPY_BITSOF_LONGDOUBLE (NPY_SIZEOF_LONGDOUBLE * CHAR_BIT)
|
||||
#define NPY_BITSOF_CFLOAT (NPY_SIZEOF_CFLOAT * CHAR_BIT)
|
||||
#define NPY_BITSOF_CDOUBLE (NPY_SIZEOF_CDOUBLE * CHAR_BIT)
|
||||
#define NPY_BITSOF_CLONGDOUBLE (NPY_SIZEOF_CLONGDOUBLE * CHAR_BIT)
|
||||
#define NPY_BITSOF_DATETIME (NPY_SIZEOF_DATETIME * CHAR_BIT)
|
||||
#define NPY_BITSOF_TIMEDELTA (NPY_SIZEOF_TIMEDELTA * CHAR_BIT)
|
||||
|
||||
#if NPY_BITSOF_LONG == 8
|
||||
#define NPY_INT8 NPY_LONG
|
||||
#define NPY_UINT8 NPY_ULONG
|
||||
typedef long npy_int8;
|
||||
typedef unsigned long npy_uint8;
|
||||
#define PyInt8ScalarObject PyLongScalarObject
|
||||
#define PyInt8ArrType_Type PyLongArrType_Type
|
||||
#define PyUInt8ScalarObject PyULongScalarObject
|
||||
#define PyUInt8ArrType_Type PyULongArrType_Type
|
||||
#define NPY_INT8_FMT NPY_LONG_FMT
|
||||
#define NPY_UINT8_FMT NPY_ULONG_FMT
|
||||
#elif NPY_BITSOF_LONG == 16
|
||||
#define NPY_INT16 NPY_LONG
|
||||
#define NPY_UINT16 NPY_ULONG
|
||||
typedef long npy_int16;
|
||||
typedef unsigned long npy_uint16;
|
||||
#define PyInt16ScalarObject PyLongScalarObject
|
||||
#define PyInt16ArrType_Type PyLongArrType_Type
|
||||
#define PyUInt16ScalarObject PyULongScalarObject
|
||||
#define PyUInt16ArrType_Type PyULongArrType_Type
|
||||
#define NPY_INT16_FMT NPY_LONG_FMT
|
||||
#define NPY_UINT16_FMT NPY_ULONG_FMT
|
||||
#elif NPY_BITSOF_LONG == 32
|
||||
#define NPY_INT32 NPY_LONG
|
||||
#define NPY_UINT32 NPY_ULONG
|
||||
typedef long npy_int32;
|
||||
typedef unsigned long npy_uint32;
|
||||
typedef unsigned long npy_ucs4;
|
||||
#define PyInt32ScalarObject PyLongScalarObject
|
||||
#define PyInt32ArrType_Type PyLongArrType_Type
|
||||
#define PyUInt32ScalarObject PyULongScalarObject
|
||||
#define PyUInt32ArrType_Type PyULongArrType_Type
|
||||
#define NPY_INT32_FMT NPY_LONG_FMT
|
||||
#define NPY_UINT32_FMT NPY_ULONG_FMT
|
||||
#elif NPY_BITSOF_LONG == 64
|
||||
#define NPY_INT64 NPY_LONG
|
||||
#define NPY_UINT64 NPY_ULONG
|
||||
typedef long npy_int64;
|
||||
typedef unsigned long npy_uint64;
|
||||
#define PyInt64ScalarObject PyLongScalarObject
|
||||
#define PyInt64ArrType_Type PyLongArrType_Type
|
||||
#define PyUInt64ScalarObject PyULongScalarObject
|
||||
#define PyUInt64ArrType_Type PyULongArrType_Type
|
||||
#define NPY_INT64_FMT NPY_LONG_FMT
|
||||
#define NPY_UINT64_FMT NPY_ULONG_FMT
|
||||
#define MyPyLong_FromInt64 PyLong_FromLong
|
||||
#define MyPyLong_AsInt64 PyLong_AsLong
|
||||
#elif NPY_BITSOF_LONG == 128
|
||||
#define NPY_INT128 NPY_LONG
|
||||
#define NPY_UINT128 NPY_ULONG
|
||||
typedef long npy_int128;
|
||||
typedef unsigned long npy_uint128;
|
||||
#define PyInt128ScalarObject PyLongScalarObject
|
||||
#define PyInt128ArrType_Type PyLongArrType_Type
|
||||
#define PyUInt128ScalarObject PyULongScalarObject
|
||||
#define PyUInt128ArrType_Type PyULongArrType_Type
|
||||
#define NPY_INT128_FMT NPY_LONG_FMT
|
||||
#define NPY_UINT128_FMT NPY_ULONG_FMT
|
||||
#endif
|
||||
|
||||
#if NPY_BITSOF_LONGLONG == 8
|
||||
# ifndef NPY_INT8
|
||||
# define NPY_INT8 NPY_LONGLONG
|
||||
# define NPY_UINT8 NPY_ULONGLONG
|
||||
typedef npy_longlong npy_int8;
|
||||
typedef npy_ulonglong npy_uint8;
|
||||
# define PyInt8ScalarObject PyLongLongScalarObject
|
||||
# define PyInt8ArrType_Type PyLongLongArrType_Type
|
||||
# define PyUInt8ScalarObject PyULongLongScalarObject
|
||||
# define PyUInt8ArrType_Type PyULongLongArrType_Type
|
||||
#define NPY_INT8_FMT NPY_LONGLONG_FMT
|
||||
#define NPY_UINT8_FMT NPY_ULONGLONG_FMT
|
||||
# endif
|
||||
# define NPY_MAX_LONGLONG NPY_MAX_INT8
|
||||
# define NPY_MIN_LONGLONG NPY_MIN_INT8
|
||||
# define NPY_MAX_ULONGLONG NPY_MAX_UINT8
|
||||
#elif NPY_BITSOF_LONGLONG == 16
|
||||
# ifndef NPY_INT16
|
||||
# define NPY_INT16 NPY_LONGLONG
|
||||
# define NPY_UINT16 NPY_ULONGLONG
|
||||
typedef npy_longlong npy_int16;
|
||||
typedef npy_ulonglong npy_uint16;
|
||||
# define PyInt16ScalarObject PyLongLongScalarObject
|
||||
# define PyInt16ArrType_Type PyLongLongArrType_Type
|
||||
# define PyUInt16ScalarObject PyULongLongScalarObject
|
||||
# define PyUInt16ArrType_Type PyULongLongArrType_Type
|
||||
#define NPY_INT16_FMT NPY_LONGLONG_FMT
|
||||
#define NPY_UINT16_FMT NPY_ULONGLONG_FMT
|
||||
# endif
|
||||
# define NPY_MAX_LONGLONG NPY_MAX_INT16
|
||||
# define NPY_MIN_LONGLONG NPY_MIN_INT16
|
||||
# define NPY_MAX_ULONGLONG NPY_MAX_UINT16
|
||||
#elif NPY_BITSOF_LONGLONG == 32
|
||||
# ifndef NPY_INT32
|
||||
# define NPY_INT32 NPY_LONGLONG
|
||||
# define NPY_UINT32 NPY_ULONGLONG
|
||||
typedef npy_longlong npy_int32;
|
||||
typedef npy_ulonglong npy_uint32;
|
||||
typedef npy_ulonglong npy_ucs4;
|
||||
# define PyInt32ScalarObject PyLongLongScalarObject
|
||||
# define PyInt32ArrType_Type PyLongLongArrType_Type
|
||||
# define PyUInt32ScalarObject PyULongLongScalarObject
|
||||
# define PyUInt32ArrType_Type PyULongLongArrType_Type
|
||||
#define NPY_INT32_FMT NPY_LONGLONG_FMT
|
||||
#define NPY_UINT32_FMT NPY_ULONGLONG_FMT
|
||||
# endif
|
||||
# define NPY_MAX_LONGLONG NPY_MAX_INT32
|
||||
# define NPY_MIN_LONGLONG NPY_MIN_INT32
|
||||
# define NPY_MAX_ULONGLONG NPY_MAX_UINT32
|
||||
#elif NPY_BITSOF_LONGLONG == 64
|
||||
# ifndef NPY_INT64
|
||||
# define NPY_INT64 NPY_LONGLONG
|
||||
# define NPY_UINT64 NPY_ULONGLONG
|
||||
typedef npy_longlong npy_int64;
|
||||
typedef npy_ulonglong npy_uint64;
|
||||
# define PyInt64ScalarObject PyLongLongScalarObject
|
||||
# define PyInt64ArrType_Type PyLongLongArrType_Type
|
||||
# define PyUInt64ScalarObject PyULongLongScalarObject
|
||||
# define PyUInt64ArrType_Type PyULongLongArrType_Type
|
||||
#define NPY_INT64_FMT NPY_LONGLONG_FMT
|
||||
#define NPY_UINT64_FMT NPY_ULONGLONG_FMT
|
||||
# define MyPyLong_FromInt64 PyLong_FromLongLong
|
||||
# define MyPyLong_AsInt64 PyLong_AsLongLong
|
||||
# endif
|
||||
# define NPY_MAX_LONGLONG NPY_MAX_INT64
|
||||
# define NPY_MIN_LONGLONG NPY_MIN_INT64
|
||||
# define NPY_MAX_ULONGLONG NPY_MAX_UINT64
|
||||
#elif NPY_BITSOF_LONGLONG == 128
|
||||
# ifndef NPY_INT128
|
||||
# define NPY_INT128 NPY_LONGLONG
|
||||
# define NPY_UINT128 NPY_ULONGLONG
|
||||
typedef npy_longlong npy_int128;
|
||||
typedef npy_ulonglong npy_uint128;
|
||||
# define PyInt128ScalarObject PyLongLongScalarObject
|
||||
# define PyInt128ArrType_Type PyLongLongArrType_Type
|
||||
# define PyUInt128ScalarObject PyULongLongScalarObject
|
||||
# define PyUInt128ArrType_Type PyULongLongArrType_Type
|
||||
#define NPY_INT128_FMT NPY_LONGLONG_FMT
|
||||
#define NPY_UINT128_FMT NPY_ULONGLONG_FMT
|
||||
# endif
|
||||
# define NPY_MAX_LONGLONG NPY_MAX_INT128
|
||||
# define NPY_MIN_LONGLONG NPY_MIN_INT128
|
||||
# define NPY_MAX_ULONGLONG NPY_MAX_UINT128
|
||||
#elif NPY_BITSOF_LONGLONG == 256
|
||||
# define NPY_INT256 NPY_LONGLONG
|
||||
# define NPY_UINT256 NPY_ULONGLONG
|
||||
typedef npy_longlong npy_int256;
|
||||
typedef npy_ulonglong npy_uint256;
|
||||
# define PyInt256ScalarObject PyLongLongScalarObject
|
||||
# define PyInt256ArrType_Type PyLongLongArrType_Type
|
||||
# define PyUInt256ScalarObject PyULongLongScalarObject
|
||||
# define PyUInt256ArrType_Type PyULongLongArrType_Type
|
||||
#define NPY_INT256_FMT NPY_LONGLONG_FMT
|
||||
#define NPY_UINT256_FMT NPY_ULONGLONG_FMT
|
||||
# define NPY_MAX_LONGLONG NPY_MAX_INT256
|
||||
# define NPY_MIN_LONGLONG NPY_MIN_INT256
|
||||
# define NPY_MAX_ULONGLONG NPY_MAX_UINT256
|
||||
#endif
|
||||
|
||||
#if NPY_BITSOF_INT == 8
|
||||
#ifndef NPY_INT8
|
||||
#define NPY_INT8 NPY_INT
|
||||
#define NPY_UINT8 NPY_UINT
|
||||
typedef int npy_int8;
|
||||
typedef unsigned int npy_uint8;
|
||||
# define PyInt8ScalarObject PyIntScalarObject
|
||||
# define PyInt8ArrType_Type PyIntArrType_Type
|
||||
# define PyUInt8ScalarObject PyUIntScalarObject
|
||||
# define PyUInt8ArrType_Type PyUIntArrType_Type
|
||||
#define NPY_INT8_FMT NPY_INT_FMT
|
||||
#define NPY_UINT8_FMT NPY_UINT_FMT
|
||||
#endif
|
||||
#elif NPY_BITSOF_INT == 16
|
||||
#ifndef NPY_INT16
|
||||
#define NPY_INT16 NPY_INT
|
||||
#define NPY_UINT16 NPY_UINT
|
||||
typedef int npy_int16;
|
||||
typedef unsigned int npy_uint16;
|
||||
# define PyInt16ScalarObject PyIntScalarObject
|
||||
# define PyInt16ArrType_Type PyIntArrType_Type
|
||||
# define PyUInt16ScalarObject PyIntUScalarObject
|
||||
# define PyUInt16ArrType_Type PyIntUArrType_Type
|
||||
#define NPY_INT16_FMT NPY_INT_FMT
|
||||
#define NPY_UINT16_FMT NPY_UINT_FMT
|
||||
#endif
|
||||
#elif NPY_BITSOF_INT == 32
|
||||
#ifndef NPY_INT32
|
||||
#define NPY_INT32 NPY_INT
|
||||
#define NPY_UINT32 NPY_UINT
|
||||
typedef int npy_int32;
|
||||
typedef unsigned int npy_uint32;
|
||||
typedef unsigned int npy_ucs4;
|
||||
# define PyInt32ScalarObject PyIntScalarObject
|
||||
# define PyInt32ArrType_Type PyIntArrType_Type
|
||||
# define PyUInt32ScalarObject PyUIntScalarObject
|
||||
# define PyUInt32ArrType_Type PyUIntArrType_Type
|
||||
#define NPY_INT32_FMT NPY_INT_FMT
|
||||
#define NPY_UINT32_FMT NPY_UINT_FMT
|
||||
#endif
|
||||
#elif NPY_BITSOF_INT == 64
|
||||
#ifndef NPY_INT64
|
||||
#define NPY_INT64 NPY_INT
|
||||
#define NPY_UINT64 NPY_UINT
|
||||
typedef int npy_int64;
|
||||
typedef unsigned int npy_uint64;
|
||||
# define PyInt64ScalarObject PyIntScalarObject
|
||||
# define PyInt64ArrType_Type PyIntArrType_Type
|
||||
# define PyUInt64ScalarObject PyUIntScalarObject
|
||||
# define PyUInt64ArrType_Type PyUIntArrType_Type
|
||||
#define NPY_INT64_FMT NPY_INT_FMT
|
||||
#define NPY_UINT64_FMT NPY_UINT_FMT
|
||||
# define MyPyLong_FromInt64 PyLong_FromLong
|
||||
# define MyPyLong_AsInt64 PyLong_AsLong
|
||||
#endif
|
||||
#elif NPY_BITSOF_INT == 128
|
||||
#ifndef NPY_INT128
|
||||
#define NPY_INT128 NPY_INT
|
||||
#define NPY_UINT128 NPY_UINT
|
||||
typedef int npy_int128;
|
||||
typedef unsigned int npy_uint128;
|
||||
# define PyInt128ScalarObject PyIntScalarObject
|
||||
# define PyInt128ArrType_Type PyIntArrType_Type
|
||||
# define PyUInt128ScalarObject PyUIntScalarObject
|
||||
# define PyUInt128ArrType_Type PyUIntArrType_Type
|
||||
#define NPY_INT128_FMT NPY_INT_FMT
|
||||
#define NPY_UINT128_FMT NPY_UINT_FMT
|
||||
#endif
|
||||
#endif
|
||||
|
||||
#if NPY_BITSOF_SHORT == 8
|
||||
#ifndef NPY_INT8
|
||||
#define NPY_INT8 NPY_SHORT
|
||||
#define NPY_UINT8 NPY_USHORT
|
||||
typedef short npy_int8;
|
||||
typedef unsigned short npy_uint8;
|
||||
# define PyInt8ScalarObject PyShortScalarObject
|
||||
# define PyInt8ArrType_Type PyShortArrType_Type
|
||||
# define PyUInt8ScalarObject PyUShortScalarObject
|
||||
# define PyUInt8ArrType_Type PyUShortArrType_Type
|
||||
#define NPY_INT8_FMT NPY_SHORT_FMT
|
||||
#define NPY_UINT8_FMT NPY_USHORT_FMT
|
||||
#endif
|
||||
#elif NPY_BITSOF_SHORT == 16
|
||||
#ifndef NPY_INT16
|
||||
#define NPY_INT16 NPY_SHORT
|
||||
#define NPY_UINT16 NPY_USHORT
|
||||
typedef short npy_int16;
|
||||
typedef unsigned short npy_uint16;
|
||||
# define PyInt16ScalarObject PyShortScalarObject
|
||||
# define PyInt16ArrType_Type PyShortArrType_Type
|
||||
# define PyUInt16ScalarObject PyUShortScalarObject
|
||||
# define PyUInt16ArrType_Type PyUShortArrType_Type
|
||||
#define NPY_INT16_FMT NPY_SHORT_FMT
|
||||
#define NPY_UINT16_FMT NPY_USHORT_FMT
|
||||
#endif
|
||||
#elif NPY_BITSOF_SHORT == 32
|
||||
#ifndef NPY_INT32
|
||||
#define NPY_INT32 NPY_SHORT
|
||||
#define NPY_UINT32 NPY_USHORT
|
||||
typedef short npy_int32;
|
||||
typedef unsigned short npy_uint32;
|
||||
typedef unsigned short npy_ucs4;
|
||||
# define PyInt32ScalarObject PyShortScalarObject
|
||||
# define PyInt32ArrType_Type PyShortArrType_Type
|
||||
# define PyUInt32ScalarObject PyUShortScalarObject
|
||||
# define PyUInt32ArrType_Type PyUShortArrType_Type
|
||||
#define NPY_INT32_FMT NPY_SHORT_FMT
|
||||
#define NPY_UINT32_FMT NPY_USHORT_FMT
|
||||
#endif
|
||||
#elif NPY_BITSOF_SHORT == 64
|
||||
#ifndef NPY_INT64
|
||||
#define NPY_INT64 NPY_SHORT
|
||||
#define NPY_UINT64 NPY_USHORT
|
||||
typedef short npy_int64;
|
||||
typedef unsigned short npy_uint64;
|
||||
# define PyInt64ScalarObject PyShortScalarObject
|
||||
# define PyInt64ArrType_Type PyShortArrType_Type
|
||||
# define PyUInt64ScalarObject PyUShortScalarObject
|
||||
# define PyUInt64ArrType_Type PyUShortArrType_Type
|
||||
#define NPY_INT64_FMT NPY_SHORT_FMT
|
||||
#define NPY_UINT64_FMT NPY_USHORT_FMT
|
||||
# define MyPyLong_FromInt64 PyLong_FromLong
|
||||
# define MyPyLong_AsInt64 PyLong_AsLong
|
||||
#endif
|
||||
#elif NPY_BITSOF_SHORT == 128
|
||||
#ifndef NPY_INT128
|
||||
#define NPY_INT128 NPY_SHORT
|
||||
#define NPY_UINT128 NPY_USHORT
|
||||
typedef short npy_int128;
|
||||
typedef unsigned short npy_uint128;
|
||||
# define PyInt128ScalarObject PyShortScalarObject
|
||||
# define PyInt128ArrType_Type PyShortArrType_Type
|
||||
# define PyUInt128ScalarObject PyUShortScalarObject
|
||||
# define PyUInt128ArrType_Type PyUShortArrType_Type
|
||||
#define NPY_INT128_FMT NPY_SHORT_FMT
|
||||
#define NPY_UINT128_FMT NPY_USHORT_FMT
|
||||
#endif
|
||||
#endif
|
||||
|
||||
|
||||
#if NPY_BITSOF_CHAR == 8
|
||||
#ifndef NPY_INT8
|
||||
#define NPY_INT8 NPY_BYTE
|
||||
#define NPY_UINT8 NPY_UBYTE
|
||||
typedef signed char npy_int8;
|
||||
typedef unsigned char npy_uint8;
|
||||
# define PyInt8ScalarObject PyByteScalarObject
|
||||
# define PyInt8ArrType_Type PyByteArrType_Type
|
||||
# define PyUInt8ScalarObject PyUByteScalarObject
|
||||
# define PyUInt8ArrType_Type PyUByteArrType_Type
|
||||
#define NPY_INT8_FMT NPY_BYTE_FMT
|
||||
#define NPY_UINT8_FMT NPY_UBYTE_FMT
|
||||
#endif
|
||||
#elif NPY_BITSOF_CHAR == 16
|
||||
#ifndef NPY_INT16
|
||||
#define NPY_INT16 NPY_BYTE
|
||||
#define NPY_UINT16 NPY_UBYTE
|
||||
typedef signed char npy_int16;
|
||||
typedef unsigned char npy_uint16;
|
||||
# define PyInt16ScalarObject PyByteScalarObject
|
||||
# define PyInt16ArrType_Type PyByteArrType_Type
|
||||
# define PyUInt16ScalarObject PyUByteScalarObject
|
||||
# define PyUInt16ArrType_Type PyUByteArrType_Type
|
||||
#define NPY_INT16_FMT NPY_BYTE_FMT
|
||||
#define NPY_UINT16_FMT NPY_UBYTE_FMT
|
||||
#endif
|
||||
#elif NPY_BITSOF_CHAR == 32
|
||||
#ifndef NPY_INT32
|
||||
#define NPY_INT32 NPY_BYTE
|
||||
#define NPY_UINT32 NPY_UBYTE
|
||||
typedef signed char npy_int32;
|
||||
typedef unsigned char npy_uint32;
|
||||
typedef unsigned char npy_ucs4;
|
||||
# define PyInt32ScalarObject PyByteScalarObject
|
||||
# define PyInt32ArrType_Type PyByteArrType_Type
|
||||
# define PyUInt32ScalarObject PyUByteScalarObject
|
||||
# define PyUInt32ArrType_Type PyUByteArrType_Type
|
||||
#define NPY_INT32_FMT NPY_BYTE_FMT
|
||||
#define NPY_UINT32_FMT NPY_UBYTE_FMT
|
||||
#endif
|
||||
#elif NPY_BITSOF_CHAR == 64
|
||||
#ifndef NPY_INT64
|
||||
#define NPY_INT64 NPY_BYTE
|
||||
#define NPY_UINT64 NPY_UBYTE
|
||||
typedef signed char npy_int64;
|
||||
typedef unsigned char npy_uint64;
|
||||
# define PyInt64ScalarObject PyByteScalarObject
|
||||
# define PyInt64ArrType_Type PyByteArrType_Type
|
||||
# define PyUInt64ScalarObject PyUByteScalarObject
|
||||
# define PyUInt64ArrType_Type PyUByteArrType_Type
|
||||
#define NPY_INT64_FMT NPY_BYTE_FMT
|
||||
#define NPY_UINT64_FMT NPY_UBYTE_FMT
|
||||
# define MyPyLong_FromInt64 PyLong_FromLong
|
||||
# define MyPyLong_AsInt64 PyLong_AsLong
|
||||
#endif
|
||||
#elif NPY_BITSOF_CHAR == 128
|
||||
#ifndef NPY_INT128
|
||||
#define NPY_INT128 NPY_BYTE
|
||||
#define NPY_UINT128 NPY_UBYTE
|
||||
typedef signed char npy_int128;
|
||||
typedef unsigned char npy_uint128;
|
||||
# define PyInt128ScalarObject PyByteScalarObject
|
||||
# define PyInt128ArrType_Type PyByteArrType_Type
|
||||
# define PyUInt128ScalarObject PyUByteScalarObject
|
||||
# define PyUInt128ArrType_Type PyUByteArrType_Type
|
||||
#define NPY_INT128_FMT NPY_BYTE_FMT
|
||||
#define NPY_UINT128_FMT NPY_UBYTE_FMT
|
||||
#endif
|
||||
#endif
|
||||
|
||||
|
||||
|
||||
#if NPY_BITSOF_DOUBLE == 32
|
||||
#ifndef NPY_FLOAT32
|
||||
#define NPY_FLOAT32 NPY_DOUBLE
|
||||
#define NPY_COMPLEX64 NPY_CDOUBLE
|
||||
typedef double npy_float32;
|
||||
typedef npy_cdouble npy_complex64;
|
||||
# define PyFloat32ScalarObject PyDoubleScalarObject
|
||||
# define PyComplex64ScalarObject PyCDoubleScalarObject
|
||||
# define PyFloat32ArrType_Type PyDoubleArrType_Type
|
||||
# define PyComplex64ArrType_Type PyCDoubleArrType_Type
|
||||
#define NPY_FLOAT32_FMT NPY_DOUBLE_FMT
|
||||
#define NPY_COMPLEX64_FMT NPY_CDOUBLE_FMT
|
||||
#endif
|
||||
#elif NPY_BITSOF_DOUBLE == 64
|
||||
#ifndef NPY_FLOAT64
|
||||
#define NPY_FLOAT64 NPY_DOUBLE
|
||||
#define NPY_COMPLEX128 NPY_CDOUBLE
|
||||
typedef double npy_float64;
|
||||
typedef npy_cdouble npy_complex128;
|
||||
# define PyFloat64ScalarObject PyDoubleScalarObject
|
||||
# define PyComplex128ScalarObject PyCDoubleScalarObject
|
||||
# define PyFloat64ArrType_Type PyDoubleArrType_Type
|
||||
# define PyComplex128ArrType_Type PyCDoubleArrType_Type
|
||||
#define NPY_FLOAT64_FMT NPY_DOUBLE_FMT
|
||||
#define NPY_COMPLEX128_FMT NPY_CDOUBLE_FMT
|
||||
#endif
|
||||
#elif NPY_BITSOF_DOUBLE == 80
|
||||
#ifndef NPY_FLOAT80
|
||||
#define NPY_FLOAT80 NPY_DOUBLE
|
||||
#define NPY_COMPLEX160 NPY_CDOUBLE
|
||||
typedef double npy_float80;
|
||||
typedef npy_cdouble npy_complex160;
|
||||
# define PyFloat80ScalarObject PyDoubleScalarObject
|
||||
# define PyComplex160ScalarObject PyCDoubleScalarObject
|
||||
# define PyFloat80ArrType_Type PyDoubleArrType_Type
|
||||
# define PyComplex160ArrType_Type PyCDoubleArrType_Type
|
||||
#define NPY_FLOAT80_FMT NPY_DOUBLE_FMT
|
||||
#define NPY_COMPLEX160_FMT NPY_CDOUBLE_FMT
|
||||
#endif
|
||||
#elif NPY_BITSOF_DOUBLE == 96
|
||||
#ifndef NPY_FLOAT96
|
||||
#define NPY_FLOAT96 NPY_DOUBLE
|
||||
#define NPY_COMPLEX192 NPY_CDOUBLE
|
||||
typedef double npy_float96;
|
||||
typedef npy_cdouble npy_complex192;
|
||||
# define PyFloat96ScalarObject PyDoubleScalarObject
|
||||
# define PyComplex192ScalarObject PyCDoubleScalarObject
|
||||
# define PyFloat96ArrType_Type PyDoubleArrType_Type
|
||||
# define PyComplex192ArrType_Type PyCDoubleArrType_Type
|
||||
#define NPY_FLOAT96_FMT NPY_DOUBLE_FMT
|
||||
#define NPY_COMPLEX192_FMT NPY_CDOUBLE_FMT
|
||||
#endif
|
||||
#elif NPY_BITSOF_DOUBLE == 128
|
||||
#ifndef NPY_FLOAT128
|
||||
#define NPY_FLOAT128 NPY_DOUBLE
|
||||
#define NPY_COMPLEX256 NPY_CDOUBLE
|
||||
typedef double npy_float128;
|
||||
typedef npy_cdouble npy_complex256;
|
||||
# define PyFloat128ScalarObject PyDoubleScalarObject
|
||||
# define PyComplex256ScalarObject PyCDoubleScalarObject
|
||||
# define PyFloat128ArrType_Type PyDoubleArrType_Type
|
||||
# define PyComplex256ArrType_Type PyCDoubleArrType_Type
|
||||
#define NPY_FLOAT128_FMT NPY_DOUBLE_FMT
|
||||
#define NPY_COMPLEX256_FMT NPY_CDOUBLE_FMT
|
||||
#endif
|
||||
#endif
|
||||
|
||||
|
||||
|
||||
#if NPY_BITSOF_FLOAT == 32
|
||||
#ifndef NPY_FLOAT32
|
||||
#define NPY_FLOAT32 NPY_FLOAT
|
||||
#define NPY_COMPLEX64 NPY_CFLOAT
|
||||
typedef float npy_float32;
|
||||
typedef npy_cfloat npy_complex64;
|
||||
# define PyFloat32ScalarObject PyFloatScalarObject
|
||||
# define PyComplex64ScalarObject PyCFloatScalarObject
|
||||
# define PyFloat32ArrType_Type PyFloatArrType_Type
|
||||
# define PyComplex64ArrType_Type PyCFloatArrType_Type
|
||||
#define NPY_FLOAT32_FMT NPY_FLOAT_FMT
|
||||
#define NPY_COMPLEX64_FMT NPY_CFLOAT_FMT
|
||||
#endif
|
||||
#elif NPY_BITSOF_FLOAT == 64
|
||||
#ifndef NPY_FLOAT64
|
||||
#define NPY_FLOAT64 NPY_FLOAT
|
||||
#define NPY_COMPLEX128 NPY_CFLOAT
|
||||
typedef float npy_float64;
|
||||
typedef npy_cfloat npy_complex128;
|
||||
# define PyFloat64ScalarObject PyFloatScalarObject
|
||||
# define PyComplex128ScalarObject PyCFloatScalarObject
|
||||
# define PyFloat64ArrType_Type PyFloatArrType_Type
|
||||
# define PyComplex128ArrType_Type PyCFloatArrType_Type
|
||||
#define NPY_FLOAT64_FMT NPY_FLOAT_FMT
|
||||
#define NPY_COMPLEX128_FMT NPY_CFLOAT_FMT
|
||||
#endif
|
||||
#elif NPY_BITSOF_FLOAT == 80
|
||||
#ifndef NPY_FLOAT80
|
||||
#define NPY_FLOAT80 NPY_FLOAT
|
||||
#define NPY_COMPLEX160 NPY_CFLOAT
|
||||
typedef float npy_float80;
|
||||
typedef npy_cfloat npy_complex160;
|
||||
# define PyFloat80ScalarObject PyFloatScalarObject
|
||||
# define PyComplex160ScalarObject PyCFloatScalarObject
|
||||
# define PyFloat80ArrType_Type PyFloatArrType_Type
|
||||
# define PyComplex160ArrType_Type PyCFloatArrType_Type
|
||||
#define NPY_FLOAT80_FMT NPY_FLOAT_FMT
|
||||
#define NPY_COMPLEX160_FMT NPY_CFLOAT_FMT
|
||||
#endif
|
||||
#elif NPY_BITSOF_FLOAT == 96
|
||||
#ifndef NPY_FLOAT96
|
||||
#define NPY_FLOAT96 NPY_FLOAT
|
||||
#define NPY_COMPLEX192 NPY_CFLOAT
|
||||
typedef float npy_float96;
|
||||
typedef npy_cfloat npy_complex192;
|
||||
# define PyFloat96ScalarObject PyFloatScalarObject
|
||||
# define PyComplex192ScalarObject PyCFloatScalarObject
|
||||
# define PyFloat96ArrType_Type PyFloatArrType_Type
|
||||
# define PyComplex192ArrType_Type PyCFloatArrType_Type
|
||||
#define NPY_FLOAT96_FMT NPY_FLOAT_FMT
|
||||
#define NPY_COMPLEX192_FMT NPY_CFLOAT_FMT
|
||||
#endif
|
||||
#elif NPY_BITSOF_FLOAT == 128
|
||||
#ifndef NPY_FLOAT128
|
||||
#define NPY_FLOAT128 NPY_FLOAT
|
||||
#define NPY_COMPLEX256 NPY_CFLOAT
|
||||
typedef float npy_float128;
|
||||
typedef npy_cfloat npy_complex256;
|
||||
# define PyFloat128ScalarObject PyFloatScalarObject
|
||||
# define PyComplex256ScalarObject PyCFloatScalarObject
|
||||
# define PyFloat128ArrType_Type PyFloatArrType_Type
|
||||
# define PyComplex256ArrType_Type PyCFloatArrType_Type
|
||||
#define NPY_FLOAT128_FMT NPY_FLOAT_FMT
|
||||
#define NPY_COMPLEX256_FMT NPY_CFLOAT_FMT
|
||||
#endif
|
||||
#endif
|
||||
|
||||
/* half/float16 isn't a floating-point type in C */
|
||||
#define NPY_FLOAT16 NPY_HALF
|
||||
typedef npy_uint16 npy_half;
|
||||
typedef npy_half npy_float16;
|
||||
|
||||
#if NPY_BITSOF_LONGDOUBLE == 32
|
||||
#ifndef NPY_FLOAT32
|
||||
#define NPY_FLOAT32 NPY_LONGDOUBLE
|
||||
#define NPY_COMPLEX64 NPY_CLONGDOUBLE
|
||||
typedef npy_longdouble npy_float32;
|
||||
typedef npy_clongdouble npy_complex64;
|
||||
# define PyFloat32ScalarObject PyLongDoubleScalarObject
|
||||
# define PyComplex64ScalarObject PyCLongDoubleScalarObject
|
||||
# define PyFloat32ArrType_Type PyLongDoubleArrType_Type
|
||||
# define PyComplex64ArrType_Type PyCLongDoubleArrType_Type
|
||||
#define NPY_FLOAT32_FMT NPY_LONGDOUBLE_FMT
|
||||
#define NPY_COMPLEX64_FMT NPY_CLONGDOUBLE_FMT
|
||||
#endif
|
||||
#elif NPY_BITSOF_LONGDOUBLE == 64
|
||||
#ifndef NPY_FLOAT64
|
||||
#define NPY_FLOAT64 NPY_LONGDOUBLE
|
||||
#define NPY_COMPLEX128 NPY_CLONGDOUBLE
|
||||
typedef npy_longdouble npy_float64;
|
||||
typedef npy_clongdouble npy_complex128;
|
||||
# define PyFloat64ScalarObject PyLongDoubleScalarObject
|
||||
# define PyComplex128ScalarObject PyCLongDoubleScalarObject
|
||||
# define PyFloat64ArrType_Type PyLongDoubleArrType_Type
|
||||
# define PyComplex128ArrType_Type PyCLongDoubleArrType_Type
|
||||
#define NPY_FLOAT64_FMT NPY_LONGDOUBLE_FMT
|
||||
#define NPY_COMPLEX128_FMT NPY_CLONGDOUBLE_FMT
|
||||
#endif
|
||||
#elif NPY_BITSOF_LONGDOUBLE == 80
|
||||
#ifndef NPY_FLOAT80
|
||||
#define NPY_FLOAT80 NPY_LONGDOUBLE
|
||||
#define NPY_COMPLEX160 NPY_CLONGDOUBLE
|
||||
typedef npy_longdouble npy_float80;
|
||||
typedef npy_clongdouble npy_complex160;
|
||||
# define PyFloat80ScalarObject PyLongDoubleScalarObject
|
||||
# define PyComplex160ScalarObject PyCLongDoubleScalarObject
|
||||
# define PyFloat80ArrType_Type PyLongDoubleArrType_Type
|
||||
# define PyComplex160ArrType_Type PyCLongDoubleArrType_Type
|
||||
#define NPY_FLOAT80_FMT NPY_LONGDOUBLE_FMT
|
||||
#define NPY_COMPLEX160_FMT NPY_CLONGDOUBLE_FMT
|
||||
#endif
|
||||
#elif NPY_BITSOF_LONGDOUBLE == 96
|
||||
#ifndef NPY_FLOAT96
|
||||
#define NPY_FLOAT96 NPY_LONGDOUBLE
|
||||
#define NPY_COMPLEX192 NPY_CLONGDOUBLE
|
||||
typedef npy_longdouble npy_float96;
|
||||
typedef npy_clongdouble npy_complex192;
|
||||
# define PyFloat96ScalarObject PyLongDoubleScalarObject
|
||||
# define PyComplex192ScalarObject PyCLongDoubleScalarObject
|
||||
# define PyFloat96ArrType_Type PyLongDoubleArrType_Type
|
||||
# define PyComplex192ArrType_Type PyCLongDoubleArrType_Type
|
||||
#define NPY_FLOAT96_FMT NPY_LONGDOUBLE_FMT
|
||||
#define NPY_COMPLEX192_FMT NPY_CLONGDOUBLE_FMT
|
||||
#endif
|
||||
#elif NPY_BITSOF_LONGDOUBLE == 128
|
||||
#ifndef NPY_FLOAT128
|
||||
#define NPY_FLOAT128 NPY_LONGDOUBLE
|
||||
#define NPY_COMPLEX256 NPY_CLONGDOUBLE
|
||||
typedef npy_longdouble npy_float128;
|
||||
typedef npy_clongdouble npy_complex256;
|
||||
# define PyFloat128ScalarObject PyLongDoubleScalarObject
|
||||
# define PyComplex256ScalarObject PyCLongDoubleScalarObject
|
||||
# define PyFloat128ArrType_Type PyLongDoubleArrType_Type
|
||||
# define PyComplex256ArrType_Type PyCLongDoubleArrType_Type
|
||||
#define NPY_FLOAT128_FMT NPY_LONGDOUBLE_FMT
|
||||
#define NPY_COMPLEX256_FMT NPY_CLONGDOUBLE_FMT
|
||||
#endif
|
||||
#elif NPY_BITSOF_LONGDOUBLE == 256
|
||||
#define NPY_FLOAT256 NPY_LONGDOUBLE
|
||||
#define NPY_COMPLEX512 NPY_CLONGDOUBLE
|
||||
typedef npy_longdouble npy_float256;
|
||||
typedef npy_clongdouble npy_complex512;
|
||||
# define PyFloat256ScalarObject PyLongDoubleScalarObject
|
||||
# define PyComplex512ScalarObject PyCLongDoubleScalarObject
|
||||
# define PyFloat256ArrType_Type PyLongDoubleArrType_Type
|
||||
# define PyComplex512ArrType_Type PyCLongDoubleArrType_Type
|
||||
#define NPY_FLOAT256_FMT NPY_LONGDOUBLE_FMT
|
||||
#define NPY_COMPLEX512_FMT NPY_CLONGDOUBLE_FMT
|
||||
#endif
|
||||
|
||||
/* datetime typedefs */
|
||||
typedef npy_int64 npy_timedelta;
|
||||
typedef npy_int64 npy_datetime;
|
||||
#define NPY_DATETIME_FMT NPY_INT64_FMT
|
||||
#define NPY_TIMEDELTA_FMT NPY_INT64_FMT
|
||||
|
||||
/* End of typedefs for numarray style bit-width names */
|
||||
|
||||
#endif
|
||||
|
|
@ -1,109 +0,0 @@
|
|||
/*
|
||||
* This set (target) cpu specific macros:
|
||||
* - Possible values:
|
||||
* NPY_CPU_X86
|
||||
* NPY_CPU_AMD64
|
||||
* NPY_CPU_PPC
|
||||
* NPY_CPU_PPC64
|
||||
* NPY_CPU_SPARC
|
||||
* NPY_CPU_S390
|
||||
* NPY_CPU_IA64
|
||||
* NPY_CPU_HPPA
|
||||
* NPY_CPU_ALPHA
|
||||
* NPY_CPU_ARMEL
|
||||
* NPY_CPU_ARMEB
|
||||
* NPY_CPU_SH_LE
|
||||
* NPY_CPU_SH_BE
|
||||
*/
|
||||
#ifndef _NPY_CPUARCH_H_
|
||||
#define _NPY_CPUARCH_H_
|
||||
|
||||
#include "numpyconfig.h"
|
||||
|
||||
#if defined( __i386__ ) || defined(i386) || defined(_M_IX86)
|
||||
/*
|
||||
* __i386__ is defined by gcc and Intel compiler on Linux,
|
||||
* _M_IX86 by VS compiler,
|
||||
* i386 by Sun compilers on opensolaris at least
|
||||
*/
|
||||
#define NPY_CPU_X86
|
||||
#elif defined(__x86_64__) || defined(__amd64__) || defined(__x86_64) || defined(_M_AMD64)
|
||||
/*
|
||||
* both __x86_64__ and __amd64__ are defined by gcc
|
||||
* __x86_64 defined by sun compiler on opensolaris at least
|
||||
* _M_AMD64 defined by MS compiler
|
||||
*/
|
||||
#define NPY_CPU_AMD64
|
||||
#elif defined(__ppc__) || defined(__powerpc__) || defined(_ARCH_PPC)
|
||||
/*
|
||||
* __ppc__ is defined by gcc, I remember having seen __powerpc__ once,
|
||||
* but can't find it ATM
|
||||
* _ARCH_PPC is used by at least gcc on AIX
|
||||
*/
|
||||
#define NPY_CPU_PPC
|
||||
#elif defined(__ppc64__)
|
||||
#define NPY_CPU_PPC64
|
||||
#elif defined(__sparc__) || defined(__sparc)
|
||||
/* __sparc__ is defined by gcc and Forte (e.g. Sun) compilers */
|
||||
#define NPY_CPU_SPARC
|
||||
#elif defined(__s390__)
|
||||
#define NPY_CPU_S390
|
||||
#elif defined(__ia64)
|
||||
#define NPY_CPU_IA64
|
||||
#elif defined(__hppa)
|
||||
#define NPY_CPU_HPPA
|
||||
#elif defined(__alpha__)
|
||||
#define NPY_CPU_ALPHA
|
||||
#elif defined(__arm__) && defined(__ARMEL__)
|
||||
#define NPY_CPU_ARMEL
|
||||
#elif defined(__arm__) && defined(__ARMEB__)
|
||||
#define NPY_CPU_ARMEB
|
||||
#elif defined(__sh__) && defined(__LITTLE_ENDIAN__)
|
||||
#define NPY_CPU_SH_LE
|
||||
#elif defined(__sh__) && defined(__BIG_ENDIAN__)
|
||||
#define NPY_CPU_SH_BE
|
||||
#elif defined(__MIPSEL__)
|
||||
#define NPY_CPU_MIPSEL
|
||||
#elif defined(__MIPSEB__)
|
||||
#define NPY_CPU_MIPSEB
|
||||
#elif defined(__aarch64__)
|
||||
#define NPY_CPU_AARCH64
|
||||
#else
|
||||
#error Unknown CPU, please report this to numpy maintainers with \
|
||||
information about your platform (OS, CPU and compiler)
|
||||
#endif
|
||||
|
||||
/*
|
||||
This "white-lists" the architectures that we know don't require
|
||||
pointer alignment. We white-list, since the memcpy version will
|
||||
work everywhere, whereas assignment will only work where pointer
|
||||
dereferencing doesn't require alignment.
|
||||
|
||||
TODO: There may be more architectures we can white list.
|
||||
*/
|
||||
#if defined(NPY_CPU_X86) || defined(NPY_CPU_AMD64)
|
||||
#define NPY_COPY_PYOBJECT_PTR(dst, src) (*((PyObject **)(dst)) = *((PyObject **)(src)))
|
||||
#else
|
||||
#if NPY_SIZEOF_PY_INTPTR_T == 4
|
||||
#define NPY_COPY_PYOBJECT_PTR(dst, src) \
|
||||
((char*)(dst))[0] = ((char*)(src))[0]; \
|
||||
((char*)(dst))[1] = ((char*)(src))[1]; \
|
||||
((char*)(dst))[2] = ((char*)(src))[2]; \
|
||||
((char*)(dst))[3] = ((char*)(src))[3];
|
||||
#elif NPY_SIZEOF_PY_INTPTR_T == 8
|
||||
#define NPY_COPY_PYOBJECT_PTR(dst, src) \
|
||||
((char*)(dst))[0] = ((char*)(src))[0]; \
|
||||
((char*)(dst))[1] = ((char*)(src))[1]; \
|
||||
((char*)(dst))[2] = ((char*)(src))[2]; \
|
||||
((char*)(dst))[3] = ((char*)(src))[3]; \
|
||||
((char*)(dst))[4] = ((char*)(src))[4]; \
|
||||
((char*)(dst))[5] = ((char*)(src))[5]; \
|
||||
((char*)(dst))[6] = ((char*)(src))[6]; \
|
||||
((char*)(dst))[7] = ((char*)(src))[7];
|
||||
#else
|
||||
#error Unknown architecture, please report this to numpy maintainers with \
|
||||
information about your platform (OS, CPU and compiler)
|
||||
#endif
|
||||
#endif
|
||||
|
||||
#endif
|
|
@ -1,129 +0,0 @@
|
|||
#ifndef _NPY_DEPRECATED_API_H
|
||||
#define _NPY_DEPRECATED_API_H
|
||||
|
||||
#if defined(_WIN32)
|
||||
#define _WARN___STR2__(x) #x
|
||||
#define _WARN___STR1__(x) _WARN___STR2__(x)
|
||||
#define _WARN___LOC__ __FILE__ "(" _WARN___STR1__(__LINE__) ") : Warning Msg: "
|
||||
#pragma message(_WARN___LOC__"Using deprecated NumPy API, disable it by " \
|
||||
"#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION")
|
||||
#elif defined(__GNUC__)
|
||||
#warning "Using deprecated NumPy API, disable it by #defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION"
|
||||
#endif
|
||||
/* TODO: How to do this warning message for other compilers? */
|
||||
|
||||
/*
|
||||
* This header exists to collect all dangerous/deprecated NumPy API.
|
||||
*
|
||||
* This is an attempt to remove bad API, the proliferation of macros,
|
||||
* and namespace pollution currently produced by the NumPy headers.
|
||||
*/
|
||||
|
||||
#if defined(NPY_NO_DEPRECATED_API)
|
||||
#error Should never include npy_deprecated_api directly.
|
||||
#endif
|
||||
|
||||
/* These array flags are deprecated as of NumPy 1.7 */
|
||||
#define NPY_CONTIGUOUS NPY_ARRAY_C_CONTIGUOUS
|
||||
#define NPY_FORTRAN NPY_ARRAY_F_CONTIGUOUS
|
||||
|
||||
/*
|
||||
* The consistent NPY_ARRAY_* names which don't pollute the NPY_*
|
||||
* namespace were added in NumPy 1.7.
|
||||
*
|
||||
* These versions of the carray flags are deprecated, but
|
||||
* probably should only be removed after two releases instead of one.
|
||||
*/
|
||||
#define NPY_C_CONTIGUOUS NPY_ARRAY_C_CONTIGUOUS
|
||||
#define NPY_F_CONTIGUOUS NPY_ARRAY_F_CONTIGUOUS
|
||||
#define NPY_OWNDATA NPY_ARRAY_OWNDATA
|
||||
#define NPY_FORCECAST NPY_ARRAY_FORCECAST
|
||||
#define NPY_ENSURECOPY NPY_ARRAY_ENSURECOPY
|
||||
#define NPY_ENSUREARRAY NPY_ARRAY_ENSUREARRAY
|
||||
#define NPY_ELEMENTSTRIDES NPY_ARRAY_ELEMENTSTRIDES
|
||||
#define NPY_ALIGNED NPY_ARRAY_ALIGNED
|
||||
#define NPY_NOTSWAPPED NPY_ARRAY_NOTSWAPPED
|
||||
#define NPY_WRITEABLE NPY_ARRAY_WRITEABLE
|
||||
#define NPY_UPDATEIFCOPY NPY_ARRAY_UPDATEIFCOPY
|
||||
#define NPY_BEHAVED NPY_ARRAY_BEHAVED
|
||||
#define NPY_BEHAVED_NS NPY_ARRAY_BEHAVED_NS
|
||||
#define NPY_CARRAY NPY_ARRAY_CARRAY
|
||||
#define NPY_CARRAY_RO NPY_ARRAY_CARRAY_RO
|
||||
#define NPY_FARRAY NPY_ARRAY_FARRAY
|
||||
#define NPY_FARRAY_RO NPY_ARRAY_FARRAY_RO
|
||||
#define NPY_DEFAULT NPY_ARRAY_DEFAULT
|
||||
#define NPY_IN_ARRAY NPY_ARRAY_IN_ARRAY
|
||||
#define NPY_OUT_ARRAY NPY_ARRAY_OUT_ARRAY
|
||||
#define NPY_INOUT_ARRAY NPY_ARRAY_INOUT_ARRAY
|
||||
#define NPY_IN_FARRAY NPY_ARRAY_IN_FARRAY
|
||||
#define NPY_OUT_FARRAY NPY_ARRAY_OUT_FARRAY
|
||||
#define NPY_INOUT_FARRAY NPY_ARRAY_INOUT_FARRAY
|
||||
#define NPY_UPDATE_ALL NPY_ARRAY_UPDATE_ALL
|
||||
|
||||
/* This way of accessing the default type is deprecated as of NumPy 1.7 */
|
||||
#define PyArray_DEFAULT NPY_DEFAULT_TYPE
|
||||
|
||||
/* These DATETIME bits aren't used internally */
|
||||
#if PY_VERSION_HEX >= 0x03000000
|
||||
#define PyDataType_GetDatetimeMetaData(descr) \
|
||||
((descr->metadata == NULL) ? NULL : \
|
||||
((PyArray_DatetimeMetaData *)(PyCapsule_GetPointer( \
|
||||
PyDict_GetItemString( \
|
||||
descr->metadata, NPY_METADATA_DTSTR), NULL))))
|
||||
#else
|
||||
#define PyDataType_GetDatetimeMetaData(descr) \
|
||||
((descr->metadata == NULL) ? NULL : \
|
||||
((PyArray_DatetimeMetaData *)(PyCObject_AsVoidPtr( \
|
||||
PyDict_GetItemString(descr->metadata, NPY_METADATA_DTSTR)))))
|
||||
#endif
|
||||
|
||||
/*
|
||||
* Deprecated as of NumPy 1.7, this kind of shortcut doesn't
|
||||
* belong in the public API.
|
||||
*/
|
||||
#define NPY_AO PyArrayObject
|
||||
|
||||
/*
|
||||
* Deprecated as of NumPy 1.7, an all-lowercase macro doesn't
|
||||
* belong in the public API.
|
||||
*/
|
||||
#define fortran fortran_
|
||||
|
||||
/*
|
||||
* Deprecated as of NumPy 1.7, as it is a namespace-polluting
|
||||
* macro.
|
||||
*/
|
||||
#define FORTRAN_IF PyArray_FORTRAN_IF
|
||||
|
||||
/* Deprecated as of NumPy 1.7, datetime64 uses c_metadata instead */
|
||||
#define NPY_METADATA_DTSTR "__timeunit__"
|
||||
|
||||
/*
|
||||
* Deprecated as of NumPy 1.7.
|
||||
* The reasoning:
|
||||
* - These are for datetime, but there's no datetime "namespace".
|
||||
* - They just turn NPY_STR_<x> into "<x>", which is just
|
||||
* making something simple be indirected.
|
||||
*/
|
||||
#define NPY_STR_Y "Y"
|
||||
#define NPY_STR_M "M"
|
||||
#define NPY_STR_W "W"
|
||||
#define NPY_STR_D "D"
|
||||
#define NPY_STR_h "h"
|
||||
#define NPY_STR_m "m"
|
||||
#define NPY_STR_s "s"
|
||||
#define NPY_STR_ms "ms"
|
||||
#define NPY_STR_us "us"
|
||||
#define NPY_STR_ns "ns"
|
||||
#define NPY_STR_ps "ps"
|
||||
#define NPY_STR_fs "fs"
|
||||
#define NPY_STR_as "as"
|
||||
|
||||
/*
|
||||
* The macros in old_defines.h are Deprecated as of NumPy 1.7 and will be
|
||||
* removed in the next major release.
|
||||
*/
|
||||
#include "old_defines.h"
|
||||
|
||||
|
||||
#endif
|
|
@ -1,46 +0,0 @@
|
|||
#ifndef _NPY_ENDIAN_H_
|
||||
#define _NPY_ENDIAN_H_
|
||||
|
||||
/*
|
||||
* NPY_BYTE_ORDER is set to the same value as BYTE_ORDER set by glibc in
|
||||
* endian.h
|
||||
*/
|
||||
|
||||
#ifdef NPY_HAVE_ENDIAN_H
|
||||
/* Use endian.h if available */
|
||||
#include <endian.h>
|
||||
|
||||
#define NPY_BYTE_ORDER __BYTE_ORDER
|
||||
#define NPY_LITTLE_ENDIAN __LITTLE_ENDIAN
|
||||
#define NPY_BIG_ENDIAN __BIG_ENDIAN
|
||||
#else
|
||||
/* Set endianness info using target CPU */
|
||||
#include "npy_cpu.h"
|
||||
|
||||
#define NPY_LITTLE_ENDIAN 1234
|
||||
#define NPY_BIG_ENDIAN 4321
|
||||
|
||||
#if defined(NPY_CPU_X86) \
|
||||
|| defined(NPY_CPU_AMD64) \
|
||||
|| defined(NPY_CPU_IA64) \
|
||||
|| defined(NPY_CPU_ALPHA) \
|
||||
|| defined(NPY_CPU_ARMEL) \
|
||||
|| defined(NPY_CPU_AARCH64) \
|
||||
|| defined(NPY_CPU_SH_LE) \
|
||||
|| defined(NPY_CPU_MIPSEL)
|
||||
#define NPY_BYTE_ORDER NPY_LITTLE_ENDIAN
|
||||
#elif defined(NPY_CPU_PPC) \
|
||||
|| defined(NPY_CPU_SPARC) \
|
||||
|| defined(NPY_CPU_S390) \
|
||||
|| defined(NPY_CPU_HPPA) \
|
||||
|| defined(NPY_CPU_PPC64) \
|
||||
|| defined(NPY_CPU_ARMEB) \
|
||||
|| defined(NPY_CPU_SH_BE) \
|
||||
|| defined(NPY_CPU_MIPSEB)
|
||||
#define NPY_BYTE_ORDER NPY_BIG_ENDIAN
|
||||
#else
|
||||
#error Unknown CPU: can not set endianness
|
||||
#endif
|
||||
#endif
|
||||
|
||||
#endif
|
|
@ -1,117 +0,0 @@
|
|||
|
||||
/* Signal handling:
|
||||
|
||||
This header file defines macros that allow your code to handle
|
||||
interrupts received during processing. Interrupts that
|
||||
could reasonably be handled:
|
||||
|
||||
SIGINT, SIGABRT, SIGALRM, SIGSEGV
|
||||
|
||||
****Warning***************
|
||||
|
||||
Do not allow code that creates temporary memory or increases reference
|
||||
counts of Python objects to be interrupted unless you handle it
|
||||
differently.
|
||||
|
||||
**************************
|
||||
|
||||
The mechanism for handling interrupts is conceptually simple:
|
||||
|
||||
- replace the signal handler with our own home-grown version
|
||||
and store the old one.
|
||||
- run the code to be interrupted -- if an interrupt occurs
|
||||
the handler should basically just cause a return to the
|
||||
calling function for finish work.
|
||||
- restore the old signal handler
|
||||
|
||||
Of course, every code that allows interrupts must account for
|
||||
returning via the interrupt and handle clean-up correctly. But,
|
||||
even still, the simple paradigm is complicated by at least three
|
||||
factors.
|
||||
|
||||
1) platform portability (i.e. Microsoft says not to use longjmp
|
||||
to return from signal handling. They have a __try and __except
|
||||
extension to C instead but what about mingw?).
|
||||
|
||||
2) how to handle threads: apparently whether signals are delivered to
|
||||
every thread of the process or the "invoking" thread is platform
|
||||
dependent. --- we don't handle threads for now.
|
||||
|
||||
3) do we need to worry about re-entrance. For now, assume the
|
||||
code will not call-back into itself.
|
||||
|
||||
Ideas:
|
||||
|
||||
1) Start by implementing an approach that works on platforms that
|
||||
can use setjmp and longjmp functionality and does nothing
|
||||
on other platforms.
|
||||
|
||||
2) Ignore threads --- i.e. do not mix interrupt handling and threads
|
||||
|
||||
3) Add a default signal_handler function to the C-API but have the rest
|
||||
use macros.
|
||||
|
||||
|
||||
Simple Interface:
|
||||
|
||||
|
||||
In your C-extension: around a block of code you want to be interruptable
|
||||
with a SIGINT
|
||||
|
||||
NPY_SIGINT_ON
|
||||
[code]
|
||||
NPY_SIGINT_OFF
|
||||
|
||||
In order for this to work correctly, the
|
||||
[code] block must not allocate any memory or alter the reference count of any
|
||||
Python objects. In other words [code] must be interruptible so that continuation
|
||||
after NPY_SIGINT_OFF will only be "missing some computations"
|
||||
|
||||
Interrupt handling does not work well with threads.
|
||||
|
||||
*/
|
||||
|
||||
/* Add signal handling macros
|
||||
Make the global variable and signal handler part of the C-API
|
||||
*/
|
||||
|
||||
#ifndef NPY_INTERRUPT_H
|
||||
#define NPY_INTERRUPT_H
|
||||
|
||||
#ifndef NPY_NO_SIGNAL
|
||||
|
||||
#include <setjmp.h>
|
||||
#include <signal.h>
|
||||
|
||||
#ifndef sigsetjmp
|
||||
|
||||
#define NPY_SIGSETJMP(arg1, arg2) setjmp(arg1)
|
||||
#define NPY_SIGLONGJMP(arg1, arg2) longjmp(arg1, arg2)
|
||||
#define NPY_SIGJMP_BUF jmp_buf
|
||||
|
||||
#else
|
||||
|
||||
#define NPY_SIGSETJMP(arg1, arg2) sigsetjmp(arg1, arg2)
|
||||
#define NPY_SIGLONGJMP(arg1, arg2) siglongjmp(arg1, arg2)
|
||||
#define NPY_SIGJMP_BUF sigjmp_buf
|
||||
|
||||
#endif
|
||||
|
||||
# define NPY_SIGINT_ON { \
|
||||
PyOS_sighandler_t _npy_sig_save; \
|
||||
_npy_sig_save = PyOS_setsig(SIGINT, _PyArray_SigintHandler); \
|
||||
if (NPY_SIGSETJMP(*((NPY_SIGJMP_BUF *)_PyArray_GetSigintBuf()), \
|
||||
1) == 0) { \
|
||||
|
||||
# define NPY_SIGINT_OFF } \
|
||||
PyOS_setsig(SIGINT, _npy_sig_save); \
|
||||
}
|
||||
|
||||
#else /* NPY_NO_SIGNAL */
|
||||
|
||||
#define NPY_SIGINT_ON
|
||||
#define NPY_SIGINT_OFF
|
||||
|
||||
#endif /* HAVE_SIGSETJMP */
|
||||
|
||||
#endif /* NPY_INTERRUPT_H */
|
|
@ -1,438 +0,0 @@
|
|||
#ifndef __NPY_MATH_C99_H_
|
||||
#define __NPY_MATH_C99_H_
|
||||
|
||||
#include <math.h>
|
||||
#ifdef __SUNPRO_CC
|
||||
#include <sunmath.h>
|
||||
#endif
|
||||
#include <numpy/npy_common.h>
|
||||
|
||||
/*
|
||||
* NAN and INFINITY like macros (same behavior as glibc for NAN, same as C99
|
||||
* for INFINITY)
|
||||
*
|
||||
* XXX: I should test whether INFINITY and NAN are available on the platform
|
||||
*/
|
||||
NPY_INLINE static float __npy_inff(void)
|
||||
{
|
||||
const union { npy_uint32 __i; float __f;} __bint = {0x7f800000UL};
|
||||
return __bint.__f;
|
||||
}
|
||||
|
||||
NPY_INLINE static float __npy_nanf(void)
|
||||
{
|
||||
const union { npy_uint32 __i; float __f;} __bint = {0x7fc00000UL};
|
||||
return __bint.__f;
|
||||
}
|
||||
|
||||
NPY_INLINE static float __npy_pzerof(void)
|
||||
{
|
||||
const union { npy_uint32 __i; float __f;} __bint = {0x00000000UL};
|
||||
return __bint.__f;
|
||||
}
|
||||
|
||||
NPY_INLINE static float __npy_nzerof(void)
|
||||
{
|
||||
const union { npy_uint32 __i; float __f;} __bint = {0x80000000UL};
|
||||
return __bint.__f;
|
||||
}
|
||||
|
||||
#define NPY_INFINITYF __npy_inff()
|
||||
#define NPY_NANF __npy_nanf()
|
||||
#define NPY_PZEROF __npy_pzerof()
|
||||
#define NPY_NZEROF __npy_nzerof()
|
||||
|
||||
#define NPY_INFINITY ((npy_double)NPY_INFINITYF)
|
||||
#define NPY_NAN ((npy_double)NPY_NANF)
|
||||
#define NPY_PZERO ((npy_double)NPY_PZEROF)
|
||||
#define NPY_NZERO ((npy_double)NPY_NZEROF)
|
||||
|
||||
#define NPY_INFINITYL ((npy_longdouble)NPY_INFINITYF)
|
||||
#define NPY_NANL ((npy_longdouble)NPY_NANF)
|
||||
#define NPY_PZEROL ((npy_longdouble)NPY_PZEROF)
|
||||
#define NPY_NZEROL ((npy_longdouble)NPY_NZEROF)
|
||||
|
||||
/*
|
||||
* Useful constants
|
||||
*/
|
||||
#define NPY_E 2.718281828459045235360287471352662498 /* e */
|
||||
#define NPY_LOG2E 1.442695040888963407359924681001892137 /* log_2 e */
|
||||
#define NPY_LOG10E 0.434294481903251827651128918916605082 /* log_10 e */
|
||||
#define NPY_LOGE2 0.693147180559945309417232121458176568 /* log_e 2 */
|
||||
#define NPY_LOGE10 2.302585092994045684017991454684364208 /* log_e 10 */
|
||||
#define NPY_PI 3.141592653589793238462643383279502884 /* pi */
|
||||
#define NPY_PI_2 1.570796326794896619231321691639751442 /* pi/2 */
|
||||
#define NPY_PI_4 0.785398163397448309615660845819875721 /* pi/4 */
|
||||
#define NPY_1_PI 0.318309886183790671537767526745028724 /* 1/pi */
|
||||
#define NPY_2_PI 0.636619772367581343075535053490057448 /* 2/pi */
|
||||
#define NPY_EULER 0.577215664901532860606512090082402431 /* Euler constant */
|
||||
#define NPY_SQRT2 1.414213562373095048801688724209698079 /* sqrt(2) */
|
||||
#define NPY_SQRT1_2 0.707106781186547524400844362104849039 /* 1/sqrt(2) */
|
||||
|
||||
#define NPY_Ef 2.718281828459045235360287471352662498F /* e */
|
||||
#define NPY_LOG2Ef 1.442695040888963407359924681001892137F /* log_2 e */
|
||||
#define NPY_LOG10Ef 0.434294481903251827651128918916605082F /* log_10 e */
|
||||
#define NPY_LOGE2f 0.693147180559945309417232121458176568F /* log_e 2 */
|
||||
#define NPY_LOGE10f 2.302585092994045684017991454684364208F /* log_e 10 */
|
||||
#define NPY_PIf 3.141592653589793238462643383279502884F /* pi */
|
||||
#define NPY_PI_2f 1.570796326794896619231321691639751442F /* pi/2 */
|
||||
#define NPY_PI_4f 0.785398163397448309615660845819875721F /* pi/4 */
|
||||
#define NPY_1_PIf 0.318309886183790671537767526745028724F /* 1/pi */
|
||||
#define NPY_2_PIf 0.636619772367581343075535053490057448F /* 2/pi */
|
||||
#define NPY_EULERf 0.577215664901532860606512090082402431F /* Euler constan*/
|
||||
#define NPY_SQRT2f 1.414213562373095048801688724209698079F /* sqrt(2) */
|
||||
#define NPY_SQRT1_2f 0.707106781186547524400844362104849039F /* 1/sqrt(2) */
|
||||
|
||||
#define NPY_El 2.718281828459045235360287471352662498L /* e */
|
||||
#define NPY_LOG2El 1.442695040888963407359924681001892137L /* log_2 e */
|
||||
#define NPY_LOG10El 0.434294481903251827651128918916605082L /* log_10 e */
|
||||
#define NPY_LOGE2l 0.693147180559945309417232121458176568L /* log_e 2 */
|
||||
#define NPY_LOGE10l 2.302585092994045684017991454684364208L /* log_e 10 */
|
||||
#define NPY_PIl 3.141592653589793238462643383279502884L /* pi */
|
||||
#define NPY_PI_2l 1.570796326794896619231321691639751442L /* pi/2 */
|
||||
#define NPY_PI_4l 0.785398163397448309615660845819875721L /* pi/4 */
|
||||
#define NPY_1_PIl 0.318309886183790671537767526745028724L /* 1/pi */
|
||||
#define NPY_2_PIl 0.636619772367581343075535053490057448L /* 2/pi */
|
||||
#define NPY_EULERl 0.577215664901532860606512090082402431L /* Euler constan*/
|
||||
#define NPY_SQRT2l 1.414213562373095048801688724209698079L /* sqrt(2) */
|
||||
#define NPY_SQRT1_2l 0.707106781186547524400844362104849039L /* 1/sqrt(2) */
|
||||
|
||||
/*
|
||||
* C99 double math funcs
|
||||
*/
|
||||
double npy_sin(double x);
|
||||
double npy_cos(double x);
|
||||
double npy_tan(double x);
|
||||
double npy_sinh(double x);
|
||||
double npy_cosh(double x);
|
||||
double npy_tanh(double x);
|
||||
|
||||
double npy_asin(double x);
|
||||
double npy_acos(double x);
|
||||
double npy_atan(double x);
|
||||
double npy_aexp(double x);
|
||||
double npy_alog(double x);
|
||||
double npy_asqrt(double x);
|
||||
double npy_afabs(double x);
|
||||
|
||||
double npy_log(double x);
|
||||
double npy_log10(double x);
|
||||
double npy_exp(double x);
|
||||
double npy_sqrt(double x);
|
||||
|
||||
double npy_fabs(double x);
|
||||
double npy_ceil(double x);
|
||||
double npy_fmod(double x, double y);
|
||||
double npy_floor(double x);
|
||||
|
||||
double npy_expm1(double x);
|
||||
double npy_log1p(double x);
|
||||
double npy_hypot(double x, double y);
|
||||
double npy_acosh(double x);
|
||||
double npy_asinh(double xx);
|
||||
double npy_atanh(double x);
|
||||
double npy_rint(double x);
|
||||
double npy_trunc(double x);
|
||||
double npy_exp2(double x);
|
||||
double npy_log2(double x);
|
||||
|
||||
double npy_atan2(double x, double y);
|
||||
double npy_pow(double x, double y);
|
||||
double npy_modf(double x, double* y);
|
||||
|
||||
double npy_copysign(double x, double y);
|
||||
double npy_nextafter(double x, double y);
|
||||
double npy_spacing(double x);
|
||||
|
||||
/*
|
||||
* IEEE 754 fpu handling. Those are guaranteed to be macros
|
||||
*/
|
||||
#ifndef NPY_HAVE_DECL_ISNAN
|
||||
#define npy_isnan(x) ((x) != (x))
|
||||
#else
|
||||
#ifdef _MSC_VER
|
||||
#define npy_isnan(x) _isnan((x))
|
||||
#else
|
||||
#define npy_isnan(x) isnan((x))
|
||||
#endif
|
||||
#endif
|
||||
|
||||
#ifndef NPY_HAVE_DECL_ISFINITE
|
||||
#ifdef _MSC_VER
|
||||
#define npy_isfinite(x) _finite((x))
|
||||
#else
|
||||
#define npy_isfinite(x) !npy_isnan((x) + (-x))
|
||||
#endif
|
||||
#else
|
||||
#define npy_isfinite(x) isfinite((x))
|
||||
#endif
|
||||
|
||||
#ifndef NPY_HAVE_DECL_ISINF
|
||||
#define npy_isinf(x) (!npy_isfinite(x) && !npy_isnan(x))
|
||||
#else
|
||||
#ifdef _MSC_VER
|
||||
#define npy_isinf(x) (!_finite((x)) && !_isnan((x)))
|
||||
#else
|
||||
#define npy_isinf(x) isinf((x))
|
||||
#endif
|
||||
#endif
|
||||
|
||||
#ifndef NPY_HAVE_DECL_SIGNBIT
|
||||
int _npy_signbit_f(float x);
|
||||
int _npy_signbit_d(double x);
|
||||
int _npy_signbit_ld(long double x);
|
||||
#define npy_signbit(x) \
|
||||
(sizeof (x) == sizeof (long double) ? _npy_signbit_ld (x) \
|
||||
: sizeof (x) == sizeof (double) ? _npy_signbit_d (x) \
|
||||
: _npy_signbit_f (x))
|
||||
#else
|
||||
#define npy_signbit(x) signbit((x))
|
||||
#endif
|
||||
|
||||
/*
|
||||
* float C99 math functions
|
||||
*/
|
||||
|
||||
float npy_sinf(float x);
|
||||
float npy_cosf(float x);
|
||||
float npy_tanf(float x);
|
||||
float npy_sinhf(float x);
|
||||
float npy_coshf(float x);
|
||||
float npy_tanhf(float x);
|
||||
float npy_fabsf(float x);
|
||||
float npy_floorf(float x);
|
||||
float npy_ceilf(float x);
|
||||
float npy_rintf(float x);
|
||||
float npy_truncf(float x);
|
||||
float npy_sqrtf(float x);
|
||||
float npy_log10f(float x);
|
||||
float npy_logf(float x);
|
||||
float npy_expf(float x);
|
||||
float npy_expm1f(float x);
|
||||
float npy_asinf(float x);
|
||||
float npy_acosf(float x);
|
||||
float npy_atanf(float x);
|
||||
float npy_asinhf(float x);
|
||||
float npy_acoshf(float x);
|
||||
float npy_atanhf(float x);
|
||||
float npy_log1pf(float x);
|
||||
float npy_exp2f(float x);
|
||||
float npy_log2f(float x);
|
||||
|
||||
float npy_atan2f(float x, float y);
|
||||
float npy_hypotf(float x, float y);
|
||||
float npy_powf(float x, float y);
|
||||
float npy_fmodf(float x, float y);
|
||||
|
||||
float npy_modff(float x, float* y);
|
||||
|
||||
float npy_copysignf(float x, float y);
|
||||
float npy_nextafterf(float x, float y);
|
||||
float npy_spacingf(float x);
|
||||
|
||||
/*
|
||||
* float C99 math functions
|
||||
*/
|
||||
|
||||
npy_longdouble npy_sinl(npy_longdouble x);
|
||||
npy_longdouble npy_cosl(npy_longdouble x);
|
||||
npy_longdouble npy_tanl(npy_longdouble x);
|
||||
npy_longdouble npy_sinhl(npy_longdouble x);
|
||||
npy_longdouble npy_coshl(npy_longdouble x);
|
||||
npy_longdouble npy_tanhl(npy_longdouble x);
|
||||
npy_longdouble npy_fabsl(npy_longdouble x);
|
||||
npy_longdouble npy_floorl(npy_longdouble x);
|
||||
npy_longdouble npy_ceill(npy_longdouble x);
|
||||
npy_longdouble npy_rintl(npy_longdouble x);
|
||||
npy_longdouble npy_truncl(npy_longdouble x);
|
||||
npy_longdouble npy_sqrtl(npy_longdouble x);
|
||||
npy_longdouble npy_log10l(npy_longdouble x);
|
||||
npy_longdouble npy_logl(npy_longdouble x);
|
||||
npy_longdouble npy_expl(npy_longdouble x);
|
||||
npy_longdouble npy_expm1l(npy_longdouble x);
|
||||
npy_longdouble npy_asinl(npy_longdouble x);
|
||||
npy_longdouble npy_acosl(npy_longdouble x);
|
||||
npy_longdouble npy_atanl(npy_longdouble x);
|
||||
npy_longdouble npy_asinhl(npy_longdouble x);
|
||||
npy_longdouble npy_acoshl(npy_longdouble x);
|
||||
npy_longdouble npy_atanhl(npy_longdouble x);
|
||||
npy_longdouble npy_log1pl(npy_longdouble x);
|
||||
npy_longdouble npy_exp2l(npy_longdouble x);
|
||||
npy_longdouble npy_log2l(npy_longdouble x);
|
||||
|
||||
npy_longdouble npy_atan2l(npy_longdouble x, npy_longdouble y);
|
||||
npy_longdouble npy_hypotl(npy_longdouble x, npy_longdouble y);
|
||||
npy_longdouble npy_powl(npy_longdouble x, npy_longdouble y);
|
||||
npy_longdouble npy_fmodl(npy_longdouble x, npy_longdouble y);
|
||||
|
||||
npy_longdouble npy_modfl(npy_longdouble x, npy_longdouble* y);
|
||||
|
||||
npy_longdouble npy_copysignl(npy_longdouble x, npy_longdouble y);
|
||||
npy_longdouble npy_nextafterl(npy_longdouble x, npy_longdouble y);
|
||||
npy_longdouble npy_spacingl(npy_longdouble x);
|
||||
|
||||
/*
|
||||
* Non standard functions
|
||||
*/
|
||||
double npy_deg2rad(double x);
|
||||
double npy_rad2deg(double x);
|
||||
double npy_logaddexp(double x, double y);
|
||||
double npy_logaddexp2(double x, double y);
|
||||
|
||||
float npy_deg2radf(float x);
|
||||
float npy_rad2degf(float x);
|
||||
float npy_logaddexpf(float x, float y);
|
||||
float npy_logaddexp2f(float x, float y);
|
||||
|
||||
npy_longdouble npy_deg2radl(npy_longdouble x);
|
||||
npy_longdouble npy_rad2degl(npy_longdouble x);
|
||||
npy_longdouble npy_logaddexpl(npy_longdouble x, npy_longdouble y);
|
||||
npy_longdouble npy_logaddexp2l(npy_longdouble x, npy_longdouble y);
|
||||
|
||||
#define npy_degrees npy_rad2deg
|
||||
#define npy_degreesf npy_rad2degf
|
||||
#define npy_degreesl npy_rad2degl
|
||||
|
||||
#define npy_radians npy_deg2rad
|
||||
#define npy_radiansf npy_deg2radf
|
||||
#define npy_radiansl npy_deg2radl
|
||||
|
||||
/*
|
||||
* Complex declarations
|
||||
*/
|
||||
|
||||
/*
|
||||
* C99 specifies that complex numbers have the same representation as
|
||||
* an array of two elements, where the first element is the real part
|
||||
* and the second element is the imaginary part.
|
||||
*/
|
||||
#define __NPY_CPACK_IMP(x, y, type, ctype) \
|
||||
union { \
|
||||
ctype z; \
|
||||
type a[2]; \
|
||||
} z1;; \
|
||||
\
|
||||
z1.a[0] = (x); \
|
||||
z1.a[1] = (y); \
|
||||
\
|
||||
return z1.z;
|
||||
|
||||
static NPY_INLINE npy_cdouble npy_cpack(double x, double y)
|
||||
{
|
||||
__NPY_CPACK_IMP(x, y, double, npy_cdouble);
|
||||
}
|
||||
|
||||
static NPY_INLINE npy_cfloat npy_cpackf(float x, float y)
|
||||
{
|
||||
__NPY_CPACK_IMP(x, y, float, npy_cfloat);
|
||||
}
|
||||
|
||||
static NPY_INLINE npy_clongdouble npy_cpackl(npy_longdouble x, npy_longdouble y)
|
||||
{
|
||||
__NPY_CPACK_IMP(x, y, npy_longdouble, npy_clongdouble);
|
||||
}
|
||||
#undef __NPY_CPACK_IMP
|
||||
|
||||
/*
|
||||
* Same remark as above, but in the other direction: extract first/second
|
||||
* member of complex number, assuming a C99-compatible representation
|
||||
*
|
||||
* Those are defineds as static inline, and such as a reasonable compiler would
|
||||
* most likely compile this to one or two instructions (on CISC at least)
|
||||
*/
|
||||
#define __NPY_CEXTRACT_IMP(z, index, type, ctype) \
|
||||
union { \
|
||||
ctype z; \
|
||||
type a[2]; \
|
||||
} __z_repr; \
|
||||
__z_repr.z = z; \
|
||||
\
|
||||
return __z_repr.a[index];
|
||||
|
||||
static NPY_INLINE double npy_creal(npy_cdouble z)
|
||||
{
|
||||
__NPY_CEXTRACT_IMP(z, 0, double, npy_cdouble);
|
||||
}
|
||||
|
||||
static NPY_INLINE double npy_cimag(npy_cdouble z)
|
||||
{
|
||||
__NPY_CEXTRACT_IMP(z, 1, double, npy_cdouble);
|
||||
}
|
||||
|
||||
static NPY_INLINE float npy_crealf(npy_cfloat z)
|
||||
{
|
||||
__NPY_CEXTRACT_IMP(z, 0, float, npy_cfloat);
|
||||
}
|
||||
|
||||
static NPY_INLINE float npy_cimagf(npy_cfloat z)
|
||||
{
|
||||
__NPY_CEXTRACT_IMP(z, 1, float, npy_cfloat);
|
||||
}
|
||||
|
||||
static NPY_INLINE npy_longdouble npy_creall(npy_clongdouble z)
|
||||
{
|
||||
__NPY_CEXTRACT_IMP(z, 0, npy_longdouble, npy_clongdouble);
|
||||
}
|
||||
|
||||
static NPY_INLINE npy_longdouble npy_cimagl(npy_clongdouble z)
|
||||
{
|
||||
__NPY_CEXTRACT_IMP(z, 1, npy_longdouble, npy_clongdouble);
|
||||
}
|
||||
#undef __NPY_CEXTRACT_IMP
|
||||
|
||||
/*
|
||||
* Double precision complex functions
|
||||
*/
|
||||
double npy_cabs(npy_cdouble z);
|
||||
double npy_carg(npy_cdouble z);
|
||||
|
||||
npy_cdouble npy_cexp(npy_cdouble z);
|
||||
npy_cdouble npy_clog(npy_cdouble z);
|
||||
npy_cdouble npy_cpow(npy_cdouble x, npy_cdouble y);
|
||||
|
||||
npy_cdouble npy_csqrt(npy_cdouble z);
|
||||
|
||||
npy_cdouble npy_ccos(npy_cdouble z);
|
||||
npy_cdouble npy_csin(npy_cdouble z);
|
||||
|
||||
/*
|
||||
* Single precision complex functions
|
||||
*/
|
||||
float npy_cabsf(npy_cfloat z);
|
||||
float npy_cargf(npy_cfloat z);
|
||||
|
||||
npy_cfloat npy_cexpf(npy_cfloat z);
|
||||
npy_cfloat npy_clogf(npy_cfloat z);
|
||||
npy_cfloat npy_cpowf(npy_cfloat x, npy_cfloat y);
|
||||
|
||||
npy_cfloat npy_csqrtf(npy_cfloat z);
|
||||
|
||||
npy_cfloat npy_ccosf(npy_cfloat z);
|
||||
npy_cfloat npy_csinf(npy_cfloat z);
|
||||
|
||||
/*
|
||||
* Extended precision complex functions
|
||||
*/
|
||||
npy_longdouble npy_cabsl(npy_clongdouble z);
|
||||
npy_longdouble npy_cargl(npy_clongdouble z);
|
||||
|
||||
npy_clongdouble npy_cexpl(npy_clongdouble z);
|
||||
npy_clongdouble npy_clogl(npy_clongdouble z);
|
||||
npy_clongdouble npy_cpowl(npy_clongdouble x, npy_clongdouble y);
|
||||
|
||||
npy_clongdouble npy_csqrtl(npy_clongdouble z);
|
||||
|
||||
npy_clongdouble npy_ccosl(npy_clongdouble z);
|
||||
npy_clongdouble npy_csinl(npy_clongdouble z);
|
||||
|
||||
/*
|
||||
* Functions that set the floating point error
|
||||
* status word.
|
||||
*/
|
||||
|
||||
void npy_set_floatstatus_divbyzero(void);
|
||||
void npy_set_floatstatus_overflow(void);
|
||||
void npy_set_floatstatus_underflow(void);
|
||||
void npy_set_floatstatus_invalid(void);
|
||||
|
||||
#endif
|
|
@ -1,19 +0,0 @@
|
|||
/*
|
||||
* This include file is provided for inclusion in Cython *.pyd files where
|
||||
* one would like to define the NPY_NO_DEPRECATED_API macro. It can be
|
||||
* included by
|
||||
*
|
||||
* cdef extern from "npy_no_deprecated_api.h": pass
|
||||
*
|
||||
*/
|
||||
#ifndef NPY_NO_DEPRECATED_API
|
||||
|
||||
/* put this check here since there may be multiple includes in C extensions. */
|
||||
#if defined(NDARRAYTYPES_H) || defined(_NPY_DEPRECATED_API_H) || \
|
||||
defined(OLD_DEFINES_H)
|
||||
#error "npy_no_deprecated_api.h" must be first among numpy includes.
|
||||
#else
|
||||
#define NPY_NO_DEPRECATED_API NPY_API_VERSION
|
||||
#endif
|
||||
|
||||
#endif
|
|
@ -1,30 +0,0 @@
|
|||
#ifndef _NPY_OS_H_
|
||||
#define _NPY_OS_H_
|
||||
|
||||
#if defined(linux) || defined(__linux) || defined(__linux__)
|
||||
#define NPY_OS_LINUX
|
||||
#elif defined(__FreeBSD__) || defined(__NetBSD__) || \
|
||||
defined(__OpenBSD__) || defined(__DragonFly__)
|
||||
#define NPY_OS_BSD
|
||||
#ifdef __FreeBSD__
|
||||
#define NPY_OS_FREEBSD
|
||||
#elif defined(__NetBSD__)
|
||||
#define NPY_OS_NETBSD
|
||||
#elif defined(__OpenBSD__)
|
||||
#define NPY_OS_OPENBSD
|
||||
#elif defined(__DragonFly__)
|
||||
#define NPY_OS_DRAGONFLY
|
||||
#endif
|
||||
#elif defined(sun) || defined(__sun)
|
||||
#define NPY_OS_SOLARIS
|
||||
#elif defined(__CYGWIN__)
|
||||
#define NPY_OS_CYGWIN
|
||||
#elif defined(_WIN32) || defined(__WIN32__) || defined(WIN32)
|
||||
#define NPY_OS_WIN32
|
||||
#elif defined(__APPLE__)
|
||||
#define NPY_OS_DARWIN
|
||||
#else
|
||||
#define NPY_OS_UNKNOWN
|
||||
#endif
|
||||
|
||||
#endif
|
|
@ -1,33 +0,0 @@
|
|||
#ifndef _NPY_NUMPYCONFIG_H_
|
||||
#define _NPY_NUMPYCONFIG_H_
|
||||
|
||||
#include "_numpyconfig.h"
|
||||
|
||||
/*
|
||||
* On Mac OS X, because there is only one configuration stage for all the archs
|
||||
* in universal builds, any macro which depends on the arch needs to be
|
||||
* harcoded
|
||||
*/
|
||||
#ifdef __APPLE__
|
||||
#undef NPY_SIZEOF_LONG
|
||||
#undef NPY_SIZEOF_PY_INTPTR_T
|
||||
|
||||
#ifdef __LP64__
|
||||
#define NPY_SIZEOF_LONG 8
|
||||
#define NPY_SIZEOF_PY_INTPTR_T 8
|
||||
#else
|
||||
#define NPY_SIZEOF_LONG 4
|
||||
#define NPY_SIZEOF_PY_INTPTR_T 4
|
||||
#endif
|
||||
#endif
|
||||
|
||||
/**
|
||||
* To help with the NPY_NO_DEPRECATED_API macro, we include API version
|
||||
* numbers for specific versions of NumPy. To exclude all API that was
|
||||
* deprecated as of 1.7, add the following before #including any NumPy
|
||||
* headers:
|
||||
* #define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION
|
||||
*/
|
||||
#define NPY_1_7_API_VERSION 0x00000007
|
||||
|
||||
#endif
|
|
@ -1,187 +0,0 @@
|
|||
/* This header is deprecated as of NumPy 1.7 */
|
||||
#ifndef OLD_DEFINES_H
|
||||
#define OLD_DEFINES_H
|
||||
|
||||
#if defined(NPY_NO_DEPRECATED_API) && NPY_NO_DEPRECATED_API >= NPY_1_7_API_VERSION
|
||||
#error The header "old_defines.h" is deprecated as of NumPy 1.7.
|
||||
#endif
|
||||
|
||||
#define NDARRAY_VERSION NPY_VERSION
|
||||
|
||||
#define PyArray_MIN_BUFSIZE NPY_MIN_BUFSIZE
|
||||
#define PyArray_MAX_BUFSIZE NPY_MAX_BUFSIZE
|
||||
#define PyArray_BUFSIZE NPY_BUFSIZE
|
||||
|
||||
#define PyArray_PRIORITY NPY_PRIORITY
|
||||
#define PyArray_SUBTYPE_PRIORITY NPY_PRIORITY
|
||||
#define PyArray_NUM_FLOATTYPE NPY_NUM_FLOATTYPE
|
||||
|
||||
#define NPY_MAX PyArray_MAX
|
||||
#define NPY_MIN PyArray_MIN
|
||||
|
||||
#define PyArray_TYPES NPY_TYPES
|
||||
#define PyArray_BOOL NPY_BOOL
|
||||
#define PyArray_BYTE NPY_BYTE
|
||||
#define PyArray_UBYTE NPY_UBYTE
|
||||
#define PyArray_SHORT NPY_SHORT
|
||||
#define PyArray_USHORT NPY_USHORT
|
||||
#define PyArray_INT NPY_INT
|
||||
#define PyArray_UINT NPY_UINT
|
||||
#define PyArray_LONG NPY_LONG
|
||||
#define PyArray_ULONG NPY_ULONG
|
||||
#define PyArray_LONGLONG NPY_LONGLONG
|
||||
#define PyArray_ULONGLONG NPY_ULONGLONG
|
||||
#define PyArray_HALF NPY_HALF
|
||||
#define PyArray_FLOAT NPY_FLOAT
|
||||
#define PyArray_DOUBLE NPY_DOUBLE
|
||||
#define PyArray_LONGDOUBLE NPY_LONGDOUBLE
|
||||
#define PyArray_CFLOAT NPY_CFLOAT
|
||||
#define PyArray_CDOUBLE NPY_CDOUBLE
|
||||
#define PyArray_CLONGDOUBLE NPY_CLONGDOUBLE
|
||||
#define PyArray_OBJECT NPY_OBJECT
|
||||
#define PyArray_STRING NPY_STRING
|
||||
#define PyArray_UNICODE NPY_UNICODE
|
||||
#define PyArray_VOID NPY_VOID
|
||||
#define PyArray_DATETIME NPY_DATETIME
|
||||
#define PyArray_TIMEDELTA NPY_TIMEDELTA
|
||||
#define PyArray_NTYPES NPY_NTYPES
|
||||
#define PyArray_NOTYPE NPY_NOTYPE
|
||||
#define PyArray_CHAR NPY_CHAR
|
||||
#define PyArray_USERDEF NPY_USERDEF
|
||||
#define PyArray_NUMUSERTYPES NPY_NUMUSERTYPES
|
||||
|
||||
#define PyArray_INTP NPY_INTP
|
||||
#define PyArray_UINTP NPY_UINTP
|
||||
|
||||
#define PyArray_INT8 NPY_INT8
|
||||
#define PyArray_UINT8 NPY_UINT8
|
||||
#define PyArray_INT16 NPY_INT16
|
||||
#define PyArray_UINT16 NPY_UINT16
|
||||
#define PyArray_INT32 NPY_INT32
|
||||
#define PyArray_UINT32 NPY_UINT32
|
||||
|
||||
#ifdef NPY_INT64
|
||||
#define PyArray_INT64 NPY_INT64
|
||||
#define PyArray_UINT64 NPY_UINT64
|
||||
#endif
|
||||
|
||||
#ifdef NPY_INT128
|
||||
#define PyArray_INT128 NPY_INT128
|
||||
#define PyArray_UINT128 NPY_UINT128
|
||||
#endif
|
||||
|
||||
#ifdef NPY_FLOAT16
|
||||
#define PyArray_FLOAT16 NPY_FLOAT16
|
||||
#define PyArray_COMPLEX32 NPY_COMPLEX32
|
||||
#endif
|
||||
|
||||
#ifdef NPY_FLOAT80
|
||||
#define PyArray_FLOAT80 NPY_FLOAT80
|
||||
#define PyArray_COMPLEX160 NPY_COMPLEX160
|
||||
#endif
|
||||
|
||||
#ifdef NPY_FLOAT96
|
||||
#define PyArray_FLOAT96 NPY_FLOAT96
|
||||
#define PyArray_COMPLEX192 NPY_COMPLEX192
|
||||
#endif
|
||||
|
||||
#ifdef NPY_FLOAT128
|
||||
#define PyArray_FLOAT128 NPY_FLOAT128
|
||||
#define PyArray_COMPLEX256 NPY_COMPLEX256
|
||||
#endif
|
||||
|
||||
#define PyArray_FLOAT32 NPY_FLOAT32
|
||||
#define PyArray_COMPLEX64 NPY_COMPLEX64
|
||||
#define PyArray_FLOAT64 NPY_FLOAT64
|
||||
#define PyArray_COMPLEX128 NPY_COMPLEX128
|
||||
|
||||
|
||||
#define PyArray_TYPECHAR NPY_TYPECHAR
|
||||
#define PyArray_BOOLLTR NPY_BOOLLTR
|
||||
#define PyArray_BYTELTR NPY_BYTELTR
|
||||
#define PyArray_UBYTELTR NPY_UBYTELTR
|
||||
#define PyArray_SHORTLTR NPY_SHORTLTR
|
||||
#define PyArray_USHORTLTR NPY_USHORTLTR
|
||||
#define PyArray_INTLTR NPY_INTLTR
|
||||
#define PyArray_UINTLTR NPY_UINTLTR
|
||||
#define PyArray_LONGLTR NPY_LONGLTR
|
||||
#define PyArray_ULONGLTR NPY_ULONGLTR
|
||||
#define PyArray_LONGLONGLTR NPY_LONGLONGLTR
|
||||
#define PyArray_ULONGLONGLTR NPY_ULONGLONGLTR
|
||||
#define PyArray_HALFLTR NPY_HALFLTR
|
||||
#define PyArray_FLOATLTR NPY_FLOATLTR
|
||||
#define PyArray_DOUBLELTR NPY_DOUBLELTR
|
||||
#define PyArray_LONGDOUBLELTR NPY_LONGDOUBLELTR
|
||||
#define PyArray_CFLOATLTR NPY_CFLOATLTR
|
||||
#define PyArray_CDOUBLELTR NPY_CDOUBLELTR
|
||||
#define PyArray_CLONGDOUBLELTR NPY_CLONGDOUBLELTR
|
||||
#define PyArray_OBJECTLTR NPY_OBJECTLTR
|
||||
#define PyArray_STRINGLTR NPY_STRINGLTR
|
||||
#define PyArray_STRINGLTR2 NPY_STRINGLTR2
|
||||
#define PyArray_UNICODELTR NPY_UNICODELTR
|
||||
#define PyArray_VOIDLTR NPY_VOIDLTR
|
||||
#define PyArray_DATETIMELTR NPY_DATETIMELTR
|
||||
#define PyArray_TIMEDELTALTR NPY_TIMEDELTALTR
|
||||
#define PyArray_CHARLTR NPY_CHARLTR
|
||||
#define PyArray_INTPLTR NPY_INTPLTR
|
||||
#define PyArray_UINTPLTR NPY_UINTPLTR
|
||||
#define PyArray_GENBOOLLTR NPY_GENBOOLLTR
|
||||
#define PyArray_SIGNEDLTR NPY_SIGNEDLTR
|
||||
#define PyArray_UNSIGNEDLTR NPY_UNSIGNEDLTR
|
||||
#define PyArray_FLOATINGLTR NPY_FLOATINGLTR
|
||||
#define PyArray_COMPLEXLTR NPY_COMPLEXLTR
|
||||
|
||||
#define PyArray_QUICKSORT NPY_QUICKSORT
|
||||
#define PyArray_HEAPSORT NPY_HEAPSORT
|
||||
#define PyArray_MERGESORT NPY_MERGESORT
|
||||
#define PyArray_SORTKIND NPY_SORTKIND
|
||||
#define PyArray_NSORTS NPY_NSORTS
|
||||
|
||||
#define PyArray_NOSCALAR NPY_NOSCALAR
|
||||
#define PyArray_BOOL_SCALAR NPY_BOOL_SCALAR
|
||||
#define PyArray_INTPOS_SCALAR NPY_INTPOS_SCALAR
|
||||
#define PyArray_INTNEG_SCALAR NPY_INTNEG_SCALAR
|
||||
#define PyArray_FLOAT_SCALAR NPY_FLOAT_SCALAR
|
||||
#define PyArray_COMPLEX_SCALAR NPY_COMPLEX_SCALAR
|
||||
#define PyArray_OBJECT_SCALAR NPY_OBJECT_SCALAR
|
||||
#define PyArray_SCALARKIND NPY_SCALARKIND
|
||||
#define PyArray_NSCALARKINDS NPY_NSCALARKINDS
|
||||
|
||||
#define PyArray_ANYORDER NPY_ANYORDER
|
||||
#define PyArray_CORDER NPY_CORDER
|
||||
#define PyArray_FORTRANORDER NPY_FORTRANORDER
|
||||
#define PyArray_ORDER NPY_ORDER
|
||||
|
||||
#define PyDescr_ISBOOL PyDataType_ISBOOL
|
||||
#define PyDescr_ISUNSIGNED PyDataType_ISUNSIGNED
|
||||
#define PyDescr_ISSIGNED PyDataType_ISSIGNED
|
||||
#define PyDescr_ISINTEGER PyDataType_ISINTEGER
|
||||
#define PyDescr_ISFLOAT PyDataType_ISFLOAT
|
||||
#define PyDescr_ISNUMBER PyDataType_ISNUMBER
|
||||
#define PyDescr_ISSTRING PyDataType_ISSTRING
|
||||
#define PyDescr_ISCOMPLEX PyDataType_ISCOMPLEX
|
||||
#define PyDescr_ISPYTHON PyDataType_ISPYTHON
|
||||
#define PyDescr_ISFLEXIBLE PyDataType_ISFLEXIBLE
|
||||
#define PyDescr_ISUSERDEF PyDataType_ISUSERDEF
|
||||
#define PyDescr_ISEXTENDED PyDataType_ISEXTENDED
|
||||
#define PyDescr_ISOBJECT PyDataType_ISOBJECT
|
||||
#define PyDescr_HASFIELDS PyDataType_HASFIELDS
|
||||
|
||||
#define PyArray_LITTLE NPY_LITTLE
|
||||
#define PyArray_BIG NPY_BIG
|
||||
#define PyArray_NATIVE NPY_NATIVE
|
||||
#define PyArray_SWAP NPY_SWAP
|
||||
#define PyArray_IGNORE NPY_IGNORE
|
||||
|
||||
#define PyArray_NATBYTE NPY_NATBYTE
|
||||
#define PyArray_OPPBYTE NPY_OPPBYTE
|
||||
|
||||
#define PyArray_MAX_ELSIZE NPY_MAX_ELSIZE
|
||||
|
||||
#define PyArray_USE_PYMEM NPY_USE_PYMEM
|
||||
|
||||
#define PyArray_RemoveLargest PyArray_RemoveSmallest
|
||||
|
||||
#define PyArray_UCS4 npy_ucs4
|
||||
|
||||
#endif
|
|
@ -1,23 +0,0 @@
|
|||
#include "arrayobject.h"
|
||||
|
||||
#ifndef REFCOUNT
|
||||
# define REFCOUNT NPY_REFCOUNT
|
||||
# define MAX_ELSIZE 16
|
||||
#endif
|
||||
|
||||
#define PyArray_UNSIGNED_TYPES
|
||||
#define PyArray_SBYTE NPY_BYTE
|
||||
#define PyArray_CopyArray PyArray_CopyInto
|
||||
#define _PyArray_multiply_list PyArray_MultiplyIntList
|
||||
#define PyArray_ISSPACESAVER(m) NPY_FALSE
|
||||
#define PyScalarArray_Check PyArray_CheckScalar
|
||||
|
||||
#define CONTIGUOUS NPY_CONTIGUOUS
|
||||
#define OWN_DIMENSIONS 0
|
||||
#define OWN_STRIDES 0
|
||||
#define OWN_DATA NPY_OWNDATA
|
||||
#define SAVESPACE 0
|
||||
#define SAVESPACEBIT 0
|
||||
|
||||
#undef import_array
|
||||
#define import_array() { if (_import_array() < 0) {PyErr_Print(); PyErr_SetString(PyExc_ImportError, "numpy.core.multiarray failed to import"); } }
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user