Merge branch 'develop' into master-tmp

2025-11-09 04:17:53 +03:00 · 2020-10-04 14:52:20 +02:00 · 2020-10-04 14:52:20 +02:00 · 59deeb7da6
commit 59deeb7da6
parent 3589a64d44 43d7652635
1072 changed files with 86775 additions and 106332 deletions
--- a/.buildkite/sdist.yml
+++ b/.buildkite/sdist.yml
@ -1,11 +0,0 @@
 steps:
  -
    command: "fab env clean make test sdist"
    label: ":dizzy: :python:"
    artifact_paths: "dist/*.tar.gz"
  - wait
  - trigger: "spacy-sdist-against-models"
    label: ":dizzy: :hammer:"
    build:
      env:
        SPACY_VERSION: "{$SPACY_VERSION}"
--- a/.buildkite/train.yml
+++ b/.buildkite/train.yml
@ -1,11 +0,0 @@
 steps:
  -
    command: "fab env clean make test wheel"
    label: ":dizzy: :python:"
    artifact_paths: "dist/*.whl"
  - wait
  - trigger: "spacy-train-from-wheel"
    label: ":dizzy: :train:"
    build:
      env:
        SPACY_VERSION: "{$SPACY_VERSION}"
--- a/.github/contributors/tiangolo.md
+++ b/.github/contributors/tiangolo.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [ ] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Sebastián Ramírez    |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | 2020-07-01           |
 | GitHub username                | tiangolo             |
 | Website (optional)             |                      |
--- a/.gitignore
+++ b/.gitignore
@ -18,8 +18,7 @@ website/.npm
 website/logs
 *.log
 npm-debug.log*
-website/www/
+quickstart-training-generator.js
 website/_deploy.sh
 # Cython / C extensions
 cythonize.json
@ -44,12 +43,14 @@ __pycache__/
 .env*
 .~env/
 .venv
 env3.6/
 venv/
 env3.*/
 .dev
 .denv
 .pypyenv
 .pytest_cache/
 .mypy_cache/
 # Distribution / packaging
 env/
@ -119,3 +120,6 @@ Desktop.ini
 # Pycharm project files
 *.idea
 # IPython
 .ipynb_checkpoints/
--- a/.travis.yml
+++ b/.travis.yml
@ -1,23 +0,0 @@
 language: python
 sudo: false
 cache: pip
 dist: trusty
 group: edge
 python:
   - "2.7"
 os:
  - linux
 install:
  - "pip install -r requirements.txt"
  - "python setup.py build_ext --inplace"
  - "pip install -e ."
 script:
  - "cat /proc/cpuinfo | grep flags | head -n 1"
  - "python -m pytest --tb=native spacy"
 branches:
  except:
    - spacy.io
 notifications:
  slack:
    secure: F8GvqnweSdzImuLL64TpfG0i5rYl89liyr9tmFVsHl4c0DNiDuGhZivUz0M1broS8svE3OPOllLfQbACG/4KxD890qfF9MoHzvRDlp7U+RtwMV/YAkYn8MGWjPIbRbX0HpGdY7O2Rc9Qy4Kk0T8ZgiqXYIqAz2Eva9/9BlSmsJQ=
  email: false
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -5,7 +5,7 @@
 Thanks for your interest in contributing to spaCy 🎉 The project is maintained
 by [@honnibal](https://github.com/honnibal) and [@ines](https://github.com/ines),
 and we'll do our best to help you get started. This page will give you a quick
-overview of how things are organised and most importantly, how to get involved.
+overview of how things are organized and most importantly, how to get involved.
 ## Table of contents
@ -43,33 +43,33 @@ can also submit a [regression test](#fixing-bugs) straight away. When you're
 opening an issue to report the bug, simply refer to your pull request in the
 issue body. A few more tips:
-   **Describing your issue:** Try to provide as many details as possible. What
+- **Describing your issue:** Try to provide as many details as possible. What
-    exactly goes wrong? _How_ is it failing? Is there an error?
+  exactly goes wrong? _How_ is it failing? Is there an error?
-    "XY doesn't work" usually isn't that helpful for tracking down problems. Always
+  "XY doesn't work" usually isn't that helpful for tracking down problems. Always
-    remember to include the code you ran and if possible, extract only the relevant
+  remember to include the code you ran and if possible, extract only the relevant
-    parts and don't just dump your entire script. This will make it easier for us to
+  parts and don't just dump your entire script. This will make it easier for us to
-    reproduce the error.
+  reproduce the error.
-   **Getting info about your spaCy installation and environment:** If you're
+- **Getting info about your spaCy installation and environment:** If you're
-    using spaCy v1.7+, you can use the command line interface to print details and
+  using spaCy v1.7+, you can use the command line interface to print details and
-    even format them as Markdown to copy-paste into GitHub issues:
+  even format them as Markdown to copy-paste into GitHub issues:
-    `python -m spacy info --markdown`.
+  `python -m spacy info --markdown`.
-   **Checking the model compatibility:** If you're having problems with a
+- **Checking the model compatibility:** If you're having problems with a
-    [statistical model](https://spacy.io/models), it may be because the
+  [statistical model](https://spacy.io/models), it may be because the
-    model is incompatible with your spaCy installation. In spaCy v2.0+, you can check
+  model is incompatible with your spaCy installation. In spaCy v2.0+, you can check
-    this on the command line by running `python -m spacy validate`.
+  this on the command line by running `python -m spacy validate`.
-   **Sharing a model's output, like dependencies and entities:** spaCy v2.0+
+- **Sharing a model's output, like dependencies and entities:** spaCy v2.0+
-    comes with [built-in visualizers](https://spacy.io/usage/visualizers) that
+  comes with [built-in visualizers](https://spacy.io/usage/visualizers) that
-    you can run from within your script or a Jupyter notebook. For some issues, it's
+  you can run from within your script or a Jupyter notebook. For some issues, it's
-    helpful to **include a screenshot** of the visualization. You can simply drag and
+  helpful to **include a screenshot** of the visualization. You can simply drag and
-    drop the image into GitHub's editor and it will be uploaded and included.
+  drop the image into GitHub's editor and it will be uploaded and included.
-   **Sharing long blocks of code or logs:** If you need to include long code,
+- **Sharing long blocks of code or logs:** If you need to include long code,
-    logs or tracebacks, you can wrap them in `<details>` and `</details>`. This
+  logs or tracebacks, you can wrap them in `<details>` and `</details>`. This
-    [collapses the content](https://developer.mozilla.org/en/docs/Web/HTML/Element/details)
+  [collapses the content](https://developer.mozilla.org/en/docs/Web/HTML/Element/details)
-    so it only becomes visible on click, making the issue easier to read and follow.
+  so it only becomes visible on click, making the issue easier to read and follow.
 ### Issue labels
@ -94,39 +94,39 @@ shipped in the core library, and what could be provided in other packages. Our
 philosophy is to prefer a smaller core library. We generally ask the following
 questions:
-   **What would this feature look like if implemented in a separate package?**
+- **What would this feature look like if implemented in a separate package?**
-    Some features would be very difficult to implement externally – for example,
+  Some features would be very difficult to implement externally – for example,
-    changes to spaCy's built-in methods. In contrast, a library of word
+  changes to spaCy's built-in methods. In contrast, a library of word
-    alignment functions could easily live as a separate package that depended on
+  alignment functions could easily live as a separate package that depended on
-    spaCy — there's little difference between writing `import word_aligner` and
+  spaCy — there's little difference between writing `import word_aligner` and
-    `import spacy.word_aligner`. spaCy v2.0+ makes it easy to implement
+  `import spacy.word_aligner`. spaCy v2.0+ makes it easy to implement
-    [custom pipeline components](https://spacy.io/usage/processing-pipelines#custom-components),
+  [custom pipeline components](https://spacy.io/usage/processing-pipelines#custom-components),
-    and add your own attributes, properties and methods to the `Doc`, `Token` and
+  and add your own attributes, properties and methods to the `Doc`, `Token` and
-    `Span`. If you're looking to implement a new spaCy feature, starting with a
+  `Span`. If you're looking to implement a new spaCy feature, starting with a
-    custom component package is usually the best strategy. You won't have to worry
+  custom component package is usually the best strategy. You won't have to worry
-    about spaCy's internals and you can test your module in an isolated
+  about spaCy's internals and you can test your module in an isolated
-    environment. And if it works well, we can always integrate it into the core
+  environment. And if it works well, we can always integrate it into the core
-    library later.
+  library later.
-   **Would the feature be easier to implement if it relied on "heavy" dependencies spaCy doesn't currently require?**
+- **Would the feature be easier to implement if it relied on "heavy" dependencies spaCy doesn't currently require?**
-    Python has a very rich ecosystem. Libraries like scikit-learn, SciPy, Gensim or
+  Python has a very rich ecosystem. Libraries like scikit-learn, SciPy, Gensim or
-    TensorFlow/Keras do lots of useful things — but we don't want to have them as
+  TensorFlow/Keras do lots of useful things — but we don't want to have them as
-    dependencies. If the feature requires functionality in one of these libraries,
+  dependencies. If the feature requires functionality in one of these libraries,
-    it's probably better to break it out into a different package.
+  it's probably better to break it out into a different package.
-   **Is the feature orthogonal to the current spaCy functionality, or overlapping?**
+- **Is the feature orthogonal to the current spaCy functionality, or overlapping?**
-    spaCy strongly prefers to avoid having 6 different ways of doing the same thing.
+  spaCy strongly prefers to avoid having 6 different ways of doing the same thing.
-    As better techniques are developed, we prefer to drop support for "the old way".
+  As better techniques are developed, we prefer to drop support for "the old way".
-    However, it's rare that one approach _entirely_ dominates another. It's very
+  However, it's rare that one approach _entirely_ dominates another. It's very
-    common that there's still a use-case for the "obsolete" approach. For instance,
+  common that there's still a use-case for the "obsolete" approach. For instance,
-    [WordNet](https://wordnet.princeton.edu/) is still very useful — but word
+  [WordNet](https://wordnet.princeton.edu/) is still very useful — but word
-    vectors are better for most use-cases, and the two approaches to lexical
+  vectors are better for most use-cases, and the two approaches to lexical
-    semantics do a lot of the same things. spaCy therefore only supports word
+  semantics do a lot of the same things. spaCy therefore only supports word
-    vectors, and support for WordNet is currently left for other packages.
+  vectors, and support for WordNet is currently left for other packages.
-   **Do you need the feature to get basic things done?** We do want spaCy to be
+- **Do you need the feature to get basic things done?** We do want spaCy to be
-    at least somewhat self-contained. If we keep needing some feature in our
+  at least somewhat self-contained. If we keep needing some feature in our
-    recipes, that does provide some argument for bringing it "in house".
+  recipes, that does provide some argument for bringing it "in house".
 ### Getting started
@ -195,7 +195,7 @@ modules in `.py` files, not Cython modules in `.pyx` and `.pxd` files.**
 ### Code formatting
 [`black`](https://github.com/ambv/black) is an opinionated Python code
-formatter, optimised to produce readable code and small diffs. You can run
+formatter, optimized to produce readable code and small diffs. You can run
 `black` from the command-line, or via your code editor. For example, if you're
 using [Visual Studio Code](https://code.visualstudio.com/), you can add the
 following to your `settings.json` to use `black` for formatting and auto-format
@ -203,10 +203,10 @@ your files on save:
 ```json
 {
-    "python.formatting.provider": "black",
+  "python.formatting.provider": "black",
-    "[python]": {
+  "[python]": {
-        "editor.formatOnSave": true
+    "editor.formatOnSave": true
-    }
+  }
 }
 ```
@ -216,7 +216,7 @@ list of available editor integrations.
 #### Disabling formatting
 There are a few cases where auto-formatting doesn't improve readability – for
-example, in some of the the language data files like the `tag_map.py`, or in
+example, in some of the language data files like the `tag_map.py`, or in
 the tests that construct `Doc` objects from lists of words and other labels.
 Wrapping a block in `# fmt: off` and `# fmt: on` lets you disable formatting
 for that particular code. Here's an example:
@ -224,7 +224,7 @@ for that particular code. Here's an example:
 ```python
 # fmt: off
 text = "I look forward to using Thingamajig.  I've been told it will make my life easier..."
-heads = [1, 0, -1, -2, -1, -1, -5, -1, 3, 2, 1, 0, 2, 1, -3, 1, 1, -3, -7]
+heads = [1, 1, 1, 1, 3, 4, 1, 6, 11, 11, 11, 11, 14, 14, 11, 16, 17, 14, 11]
 deps = ["nsubj", "ROOT", "advmod", "prep", "pcomp", "dobj", "punct", "",
        "nsubjpass", "aux", "auxpass", "ROOT", "nsubj", "aux", "ccomp",
        "poss", "nsubj", "ccomp", "punct"]
@ -280,29 +280,13 @@ except:  # noqa: E722
 ### Python conventions
-All Python code must be written in an **intersection of Python 2 and Python 3**.
+All Python code must be written **compatible with Python 3.6+**.
 This is easy in Cython, but somewhat ugly in Python. Logic that deals with
 Python or platform compatibility should only live in
 [`spacy.compat`](spacy/compat.py). To distinguish them from the builtin
 functions, replacement functions are suffixed with an underscore, for example
 `unicode_`. If you need to access the user's version or platform information,
 for example to show more specific error messages, you can use the `is_config()`
 helper function.
 ```python
 from .compat import unicode_, is_config
 compatible_unicode = unicode_('hello world')
 if is_config(windows=True, python2=True):
    print("You are using Python 2 on Windows.")
 ```
 Code that interacts with the file-system should accept objects that follow the
 `pathlib.Path` API, without assuming that the object inherits from `pathlib.Path`.
 If the function is user-facing and takes a path as an argument, it should check
 whether the path is provided as a string. Strings should be converted to
 `pathlib.Path` objects. Serialization and deserialization functions should always
-accept **file-like objects**, as it makes the library io-agnostic. Working on
+accept **file-like objects**, as it makes the library IO-agnostic. Working on
 buffers makes the code more general, easier to test, and compatible with Python
 3's asynchronous IO.
@ -400,7 +384,7 @@ of Python and C++, with additional complexity and syntax from numpy. The
 many "traps for new players". Working in Cython is very rewarding once you're
 over the initial learning curve. As with C and C++, the first way you write
 something in Cython will often be the performance-optimal approach. In contrast,
-Python optimisation generally requires a lot of experimentation. Is it faster to
+Python optimization generally requires a lot of experimentation. Is it faster to
 have an `if item in my_dict` check, or to use `.get()`? What about `try`/`except`?
 Does this numpy operation create a copy? There's no way to guess the answers to
 these questions, and you'll usually be dissatisfied with your results — so
@ -413,10 +397,10 @@ Python. If it's not fast enough the first time, just switch to Cython.
 ### Resources to get you started
-   [PEP 8 Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) (python.org)
+- [PEP 8 Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) (python.org)
-   [Official Cython documentation](http://docs.cython.org/en/latest/) (cython.org)
+- [Official Cython documentation](http://docs.cython.org/en/latest/) (cython.org)
-   [Writing C in Cython](https://explosion.ai/blog/writing-c-in-cython) (explosion.ai)
+- [Writing C in Cython](https://explosion.ai/blog/writing-c-in-cython) (explosion.ai)
-   [Multi-threading spaCy’s parser and named entity recogniser](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai)
+- [Multi-threading spaCy’s parser and named entity recognizer](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai)
 ## Adding tests
@ -428,7 +412,7 @@ name. For example, tests for the `Tokenizer` can be found in
 all test files and test functions need to be prefixed with `test_`.
 When adding tests, make sure to use descriptive names, keep the code short and
-concise and only test for one behaviour at a time. Try to `parametrize` test
+concise and only test for one behavior at a time. Try to `parametrize` test
 cases wherever possible, use our pre-defined fixtures for spaCy components and
 avoid unnecessary imports.
@ -437,7 +421,7 @@ Tests that require the model to be loaded should be marked with
 `@pytest.mark.models`. Loading the models is expensive and not necessary if
 you're not actually testing the model performance. If all you need is a `Doc`
 object with annotations like heads, POS tags or the dependency parse, you can
-use the `get_doc()` utility function to construct it manually.
+use the `Doc` constructor to construct it manually.
 📖 **For more guidelines and information on how to add tests, check out the [tests README](spacy/tests/README.md).**
@ -456,25 +440,25 @@ simply click on the "Suggest edits" button at the bottom of a page.
 We're very excited about all the new possibilities for **community extensions**
 and plugins in spaCy v2.0, and we can't wait to see what you build with it!
-   An extension or plugin should add substantial functionality, be
+- An extension or plugin should add substantial functionality, be
-    **well-documented** and **open-source**. It should be available for users to download
+  **well-documented** and **open-source**. It should be available for users to download
-    and install as a Python package – for example via [PyPi](http://pypi.python.org).
+  and install as a Python package – for example via [PyPi](http://pypi.python.org).
-   Extensions that write to `Doc`, `Token` or `Span` attributes should be wrapped
+- Extensions that write to `Doc`, `Token` or `Span` attributes should be wrapped
-    as [pipeline components](https://spacy.io/usage/processing-pipelines#custom-components)
+  as [pipeline components](https://spacy.io/usage/processing-pipelines#custom-components)
-    that users can **add to their processing pipeline** using `nlp.add_pipe()`.
+  that users can **add to their processing pipeline** using `nlp.add_pipe()`.
-   When publishing your extension on GitHub, **tag it** with the topics
+- When publishing your extension on GitHub, **tag it** with the topics
-    [`spacy`](https://github.com/topics/spacy?o=desc&s=stars) and
+  [`spacy`](https://github.com/topics/spacy?o=desc&s=stars) and
-    [`spacy-extensions`](https://github.com/topics/spacy-extension?o=desc&s=stars)
+  [`spacy-extensions`](https://github.com/topics/spacy-extension?o=desc&s=stars)
-    to make it easier to find. Those are also the topics we're linking to from the
+  to make it easier to find. Those are also the topics we're linking to from the
-    spaCy website. If you're sharing your project on Twitter, feel free to tag
+  spaCy website. If you're sharing your project on Twitter, feel free to tag
-    [@spacy_io](https://twitter.com/spacy_io) so we can check it out.
+  [@spacy_io](https://twitter.com/spacy_io) so we can check it out.
-   Once your extension is published, you can open an issue on the
+- Once your extension is published, you can open an issue on the
-    [issue tracker](https://github.com/explosion/spacy/issues) to suggest it for the
+  [issue tracker](https://github.com/explosion/spacy/issues) to suggest it for the
-    [resources directory](https://spacy.io/usage/resources#extensions) on the
+  [resources directory](https://spacy.io/usage/resources#extensions) on the
-    website.
+  website.
 📖 **For more tips and best practices, see the [checklist for developing spaCy extensions](https://spacy.io/usage/processing-pipelines#extensions).**
--- a/MANIFEST.in
+++ b/MANIFEST.in
@ -1,9 +1,9 @@
 recursive-include include *.h
-recursive-include spacy *.txt *.pyx *.pxd
+recursive-include spacy *.pyx *.pxd *.txt *.cfg *.jinja
 include LICENSE
 include README.md
 include bin/spacy
 include pyproject.toml
 recursive-exclude spacy/lang *.json
 recursive-include spacy/lang *.json.gz
 recursive-include spacy/cli *.json *.yml
 recursive-include licenses *
--- a/48
+++ b/48
@ -1,29 +1,55 @@
 SHELL := /bin/bash
-PYVER := 3.6
+
 ifndef SPACY_EXTRAS
 override SPACY_EXTRAS = spacy-lookups-data==1.0.0rc0 jieba pkuseg==0.0.25 pickle5 sudachipy sudachidict_core
 endif
 ifndef PYVER
 override PYVER = 3.6
 endif
 VENV := ./env$(PYVER)
 version := $(shell "bin/get-version.sh")
 package := $(shell "bin/get-package.sh")
-dist/spacy-$(version).pex : wheelhouse/spacy-$(version).stamp
+ifndef SPACY_BIN
-	$(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m spacy -o $@ spacy==$(version) jsonschema spacy-lookups-data jieba pkuseg==0.0.25 sudachipy sudachidict_core
+override SPACY_BIN = $(package)-$(version).pex
 endif
 ifndef WHEELHOUSE
 override WHEELHOUSE = "./wheelhouse"
 endif
 dist/$(SPACY_BIN) : $(WHEELHOUSE)/spacy-$(PYVER)-$(version).stamp
 	$(VENV)/bin/pex \
 		-f $(WHEELHOUSE) \
 		--no-index \
 		--disable-cache \
 		-o $@ \
 		$(package)==$(version) \
 		$(SPACY_EXTRAS)
 	chmod a+rx $@
 	cp $@ dist/spacy.pex
-dist/pytest.pex : wheelhouse/pytest-*.whl
+dist/pytest.pex : $(WHEELHOUSE)/pytest-*.whl
-	$(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m pytest -o $@ pytest pytest-timeout mock
+	$(VENV)/bin/pex -f $(WHEELHOUSE) --no-index --disable-cache -m pytest -o $@ pytest pytest-timeout mock
 	chmod a+rx $@
-wheelhouse/spacy-$(version).stamp : $(VENV)/bin/pex setup.py spacy/*.py* spacy/*/*.py*
+$(WHEELHOUSE)/spacy-$(PYVER)-$(version).stamp : $(VENV)/bin/pex setup.py spacy/*.py* spacy/*/*.py*
-	$(VENV)/bin/pip wheel . -w ./wheelhouse
+	$(VENV)/bin/pip wheel . -w $(WHEELHOUSE)
-	$(VENV)/bin/pip wheel jsonschema spacy-lookups-data jieba pkuseg==0.0.25 sudachipy sudachidict_core -w ./wheelhouse
+	$(VENV)/bin/pip wheel $(SPACY_EXTRAS) -w $(WHEELHOUSE)
 	touch $@
-wheelhouse/pytest-%.whl : $(VENV)/bin/pex
+$(WHEELHOUSE)/pytest-%.whl : $(VENV)/bin/pex
-	$(VENV)/bin/pip wheel pytest pytest-timeout mock -w ./wheelhouse
+	$(VENV)/bin/pip wheel pytest pytest-timeout mock -w $(WHEELHOUSE)
 $(VENV)/bin/pex :
 	python$(PYVER) -m venv $(VENV)
 	$(VENV)/bin/pip install -U pip setuptools pex wheel
 	$(VENV)/bin/pip install numpy
 .PHONY : clean test
@ -33,6 +59,6 @@ test : dist/spacy-$(version).pex dist/pytest.pex
 clean : setup.py
 	rm -rf dist/*
-	rm -rf ./wheelhouse
+	rm -rf $(WHEELHOUSE)/*
 	rm -rf $(VENV)
 	python setup.py clean --all
--- a/README.md
+++ b/README.md
@ -4,18 +4,19 @@
 spaCy is a library for advanced Natural Language Processing in Python and
 Cython. It's built on the very latest research, and was designed from day one to
-be used in real products. spaCy comes with
+be used in real products.
-[pretrained statistical models](https://spacy.io/models) and word vectors, and
+
 spaCy comes with
 [pretrained pipelines](https://spacy.io/models) and vectors, and
 currently supports tokenization for **60+ languages**. It features
 state-of-the-art speed, convolutional **neural network models** for tagging,
-parsing and **named entity recognition** and easy **deep learning** integration.
+parsing, **named entity recognition**, **text classification** and more, multi-task learning with pretrained **transformers** like BERT, as well as a production-ready training system and easy model packaging, deployment and workflow management.
-It's commercial open-source software, released under the MIT license.
+spaCy is commercial open-source software, released under the MIT license.
-💫 **Version 2.3 out now!**
+💫 **Version 3.0 out now!**
 [Check out the release notes here.](https://github.com/explosion/spaCy/releases)
-[![Azure Pipelines](<https://img.shields.io/azure-devops/build/explosion-ai/public/8/master.svg?logo=azure-pipelines&style=flat-square&label=build+(3.x)>)](https://dev.azure.com/explosion-ai/public/_build?definitionId=8)
+[![Azure Pipelines](https://img.shields.io/azure-devops/build/explosion-ai/public/8/master.svg?logo=azure-pipelines&style=flat-square&label=build)](https://dev.azure.com/explosion-ai/public/_build?definitionId=8)
 [![Travis Build Status](<https://img.shields.io/travis/explosion/spaCy/master.svg?style=flat-square&logo=travis-ci&logoColor=white&label=build+(2.7)>)](https://travis-ci.org/explosion/spaCy)
 [![Current Release Version](https://img.shields.io/github/release/explosion/spacy.svg?style=flat-square&logo=github)](https://github.com/explosion/spaCy/releases)
 [![pypi Version](https://img.shields.io/pypi/v/spacy.svg?style=flat-square&logo=pypi&logoColor=white)](https://pypi.org/project/spacy/)
 [![conda Version](https://img.shields.io/conda/vn/conda-forge/spacy.svg?style=flat-square&logo=conda-forge&logoColor=white)](https://anaconda.org/conda-forge/spacy)
@ -28,64 +29,60 @@ It's commercial open-source software, released under the MIT license.
 ## 📖 Documentation
-| Documentation   |                                                                |
+| Documentation       |                                                                |
-| --------------- | -------------------------------------------------------------- |
+| ------------------- | -------------------------------------------------------------- |
-| [spaCy 101]     | New to spaCy? Here's everything you need to know!              |
+| [spaCy 101]         | New to spaCy? Here's everything you need to know!              |
-| [Usage Guides]  | How to use spaCy and its features.                             |
+| [Usage Guides]      | How to use spaCy and its features.                             |
-| [New in v2.3]   | New features, backwards incompatibilities and migration guide. |
+| [New in v3.0]       | New features, backwards incompatibilities and migration guide. |
-| [API Reference] | The detailed reference for spaCy's API.                        |
+| [Project Templates] | End-to-end workflows you can clone, modify and run.            |
-| [Models]        | Download statistical language models for spaCy.                |
+| [API Reference]     | The detailed reference for spaCy's API.                        |
-| [Universe]      | Libraries, extensions, demos, books and courses.               |
+| [Models]            | Download statistical language models for spaCy.                |
-| [Changelog]     | Changes and version history.                                   |
+| [Universe]          | Libraries, extensions, demos, books and courses.               |
-| [Contribute]    | How to contribute to the spaCy project and code base.          |
+| [Changelog]         | Changes and version history.                                   |
 | [Contribute]        | How to contribute to the spaCy project and code base.          |
 [spacy 101]: https://spacy.io/usage/spacy-101
-[new in v2.3]: https://spacy.io/usage/v2-3
+[new in v3.0]: https://spacy.io/usage/v3
 [usage guides]: https://spacy.io/usage/
 [api reference]: https://spacy.io/api/
 [models]: https://spacy.io/models
 [universe]: https://spacy.io/universe
 [project templates]: https://github.com/explosion/projects
 [changelog]: https://spacy.io/usage#changelog
 [contribute]: https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md
 ## 💬 Where to ask questions
-The spaCy project is maintained by [@honnibal](https://github.com/honnibal) and
+The spaCy project is maintained by [@honnibal](https://github.com/honnibal),
-[@ines](https://github.com/ines), along with core contributors
+[@ines](https://github.com/ines), [@svlandeg](https://github.com/svlandeg) and
 [@svlandeg](https://github.com/svlandeg) and
 [@adrianeboyd](https://github.com/adrianeboyd). Please understand that we won't
 be able to provide individual support via email. We also believe that help is
 much more valuable if it's shared publicly, so that more people can benefit from
 it.
-| Type                     | Platforms                                              |
+| Type                    | Platforms              |
-| ------------------------ | ------------------------------------------------------ |
+| ----------------------- | ---------------------- |
-| 🚨 **Bug Reports**       | [GitHub Issue Tracker]                                 |
+| 🚨 **Bug Reports**      | [GitHub Issue Tracker] |
-| 🎁 **Feature Requests**  | [GitHub Issue Tracker]                                 |
+| 🎁 **Feature Requests** | [GitHub Issue Tracker] |
-| 👩‍💻 **Usage Questions**   | [Stack Overflow] · [Gitter Chat] · [Reddit User Group] |
+| 👩‍💻 **Usage Questions**  | [Stack Overflow]       |
 | 🗯 **General Discussion** | [Gitter Chat] · [Reddit User Group]                    |
 [github issue tracker]: https://github.com/explosion/spaCy/issues
 [stack overflow]: https://stackoverflow.com/questions/tagged/spacy
 [gitter chat]: https://gitter.im/explosion/spaCy
 [reddit user group]: https://www.reddit.com/r/spacynlp
 ## Features
- Non-destructive **tokenization**
+- Support for **60+ languages**
- **Named entity** recognition
+- **Trained pipelines**
- Support for **50+ languages**
+- Multi-task learning with pretrained **transformers** like BERT
- pretrained [statistical models](https://spacy.io/models) and word vectors
+- Pretrained **word vectors**
 - State-of-the-art speed
- Easy **deep learning** integration
+- Production-ready **training system**
- Part-of-speech tagging
+- Linguistically-motivated **tokenization**
- Labelled dependency parsing
+- Components for named **entity recognition**, part-of-speech-tagging, dependency parsing, sentence segmentation, **text classification**, lemmatization, morphological analysis, entity linking and more
- Syntax-driven sentence segmentation
+- Easily extensible with **custom components** and attributes
 - Support for custom models in **PyTorch**, **TensorFlow** and other frameworks
 - Built in **visualizers** for syntax and NER
- Convenient string-to-hash mapping
+- Easy **model packaging**, deployment and workflow management
 - Export to numpy data arrays
 - Efficient binary serialization
 - Easy **model packaging** and deployment
 - Robust, rigorously evaluated accuracy
 📖 **For more details, see the
@ -98,7 +95,7 @@ For detailed installation instructions, see the
 - **Operating system**: macOS / OS X · Linux · Windows (Cygwin, MinGW, Visual
  Studio)
- **Python version**: Python 2.7, 3.5+ (only 64 bit)
+- **Python version**: Python 3.6+ (only 64 bit)
 - **Package managers**: [pip] · [conda] (via `conda-forge`)
 [pip]: https://pypi.org/project/spacy/
@ -159,26 +156,26 @@ If you've trained your own models, keep in mind that your training and runtime
 inputs must match. After updating spaCy, we recommend **retraining your models**
 with the new version.
-📖 **For details on upgrading from spaCy 1.x to spaCy 2.x, see the
+📖 **For details on upgrading from spaCy 2.x to spaCy 3.x, see the
-[migration guide](https://spacy.io/usage/v2#migrating).**
+[migration guide](https://spacy.io/usage/v3#migrating).**
 ## Download models
-As of v1.7.0, models for spaCy can be installed as **Python packages**. This
+Trained pipelines for spaCy can be installed as **Python packages**. This
 means that they're a component of your application, just like any other module.
 Models can be installed using spaCy's `download` command, or manually by
 pointing pip to a path or URL.
-| Documentation          |                                                               |
+| Documentation          |                                                                  |
-| ---------------------- | ------------------------------------------------------------- |
+| ---------------------- | ---------------------------------------------------------------- |
-| [Available Models]     | Detailed model descriptions, accuracy figures and benchmarks. |
+| [Available Pipelines]  | Detailed pipeline descriptions, accuracy figures and benchmarks. |
-| [Models Documentation] | Detailed usage instructions.                                  |
+| [Models Documentation] | Detailed usage instructions.                                     |
-[available models]: https://spacy.io/models
+[available pipelines]: https://spacy.io/models
 [models documentation]: https://spacy.io/docs/usage/models
 ```bash
-# download best-matching version of specific model for your spaCy installation
+# Download best-matching version of specific model for your spaCy installation
 python -m spacy download en_core_web_sm
 # pip install .tar.gz archive from path or URL
@ -188,7 +185,7 @@ pip install https://github.com/explosion/spacy-models/releases/download/en_core_
 ### Loading and using models
-To load a model, use `spacy.load()` with the model name, a shortcut link or a
+To load a model, use `spacy.load()` with the model name or a
 path to the model data directory.
 ```python
@ -263,9 +260,7 @@ and git preinstalled.
 Install a version of the
 [Visual C++ Build Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/)
 or [Visual Studio Express](https://visualstudio.microsoft.com/vs/express/) that
-matches the version that was used to compile your Python interpreter. For
+matches the version that was used to compile your Python interpreter.
 official distributions these are VS 2008 (Python 2.7), VS 2010 (Python 3.4) and
 VS 2015 (Python 3.5).
 ## Run tests
--- a/azure-pipelines.yml
+++ b/azure-pipelines.yml
@ -27,7 +27,7 @@ jobs:
    inputs:
      versionSpec: '3.7'
  - script: |
-      pip install flake8
+      pip install flake8==3.5.0
      python -m flake8 spacy --count --select=E901,E999,F821,F822,F823 --show-source --statistics
    displayName: 'flake8'
@ -35,12 +35,6 @@ jobs:
  dependsOn: 'Validate'
  strategy:
    matrix:
      Python35Linux:
        imageName: 'ubuntu-16.04'
        python.version: '3.5'
      Python35Windows:
        imageName: 'vs2017-win2016'
        python.version: '3.5'
      Python36Linux:
        imageName: 'ubuntu-16.04'
        python.version: '3.6'
@ -58,7 +52,7 @@ jobs:
      #   imageName: 'vs2017-win2016'
      #   python.version: '3.7'
      # Python37Mac:
-      #   imageName: 'macos-10.13'
+      #   imageName: 'macos-10.14'
      #   python.version: '3.7'
      Python38Linux:
        imageName: 'ubuntu-16.04'
--- a/bin/cythonize.py
+++ b/bin/cythonize.py
@ -1,169 +0,0 @@
 #!/usr/bin/env python
 """ cythonize.py
 Cythonize pyx files into C++ files as needed.
 Usage: cythonize.py [root]
 Checks pyx files to see if they have been changed relative to their
 corresponding C++ files. If they have, then runs cython on these files to
 recreate the C++ files.
 Additionally, checks pxd files and setup.py if they have been changed. If
 they have, rebuilds everything.
 Change detection based on file hashes stored in JSON format.
 For now, this script should be run by developers when changing Cython files
 and the resulting C++ files checked in, so that end-users (and Python-only
 developers) do not get the Cython dependencies.
 Based upon:
 https://raw.github.com/dagss/private-scipy-refactor/cythonize/cythonize.py
 https://raw.githubusercontent.com/numpy/numpy/master/tools/cythonize.py
 Note: this script does not check any of the dependent C++ libraries.
 """
 from __future__ import print_function
 import os
 import sys
 import json
 import hashlib
 import subprocess
 import argparse
 HASH_FILE = "cythonize.json"
 def process_pyx(fromfile, tofile, language_level="-2"):
    print("Processing %s" % fromfile)
    try:
        from Cython.Compiler.Version import version as cython_version
        from distutils.version import LooseVersion
        if LooseVersion(cython_version) < LooseVersion("0.19"):
            raise Exception("Require Cython >= 0.19")
    except ImportError:
        pass
    flags = ["--fast-fail", language_level]
    if tofile.endswith(".cpp"):
        flags += ["--cplus"]
    try:
        try:
            r = subprocess.call(
                ["cython"] + flags + ["-o", tofile, fromfile], env=os.environ
            )  # See Issue #791
            if r != 0:
                raise Exception("Cython failed")
        except OSError:
            # There are ways of installing Cython that don't result in a cython
            # executable on the path, see gh-2397.
            r = subprocess.call(
                [
                    sys.executable,
                    "-c",
                    "import sys; from Cython.Compiler.Main import "
                    "setuptools_main as main; sys.exit(main())",
                ]
                + flags
                + ["-o", tofile, fromfile]
            )
            if r != 0:
                raise Exception("Cython failed")
    except OSError:
        raise OSError("Cython needs to be installed")
 def preserve_cwd(path, func, *args):
    orig_cwd = os.getcwd()
    try:
        os.chdir(path)
        func(*args)
    finally:
        os.chdir(orig_cwd)
 def load_hashes(filename):
    try:
        return json.load(open(filename))
    except (ValueError, IOError):
        return {}
 def save_hashes(hash_db, filename):
    with open(filename, "w") as f:
        f.write(json.dumps(hash_db))
 def get_hash(path):
    return hashlib.md5(open(path, "rb").read()).hexdigest()
 def hash_changed(base, path, db):
    full_path = os.path.normpath(os.path.join(base, path))
    return not get_hash(full_path) == db.get(full_path)
 def hash_add(base, path, db):
    full_path = os.path.normpath(os.path.join(base, path))
    db[full_path] = get_hash(full_path)
 def process(base, filename, db):
    root, ext = os.path.splitext(filename)
    if ext in [".pyx", ".cpp"]:
        if hash_changed(base, filename, db) or not os.path.isfile(
            os.path.join(base, root + ".cpp")
        ):
            preserve_cwd(base, process_pyx, root + ".pyx", root + ".cpp")
            hash_add(base, root + ".cpp", db)
            hash_add(base, root + ".pyx", db)
 def check_changes(root, db):
    res = False
    new_db = {}
    setup_filename = "setup.py"
    hash_add(".", setup_filename, new_db)
    if hash_changed(".", setup_filename, db):
        res = True
    for base, _, files in os.walk(root):
        for filename in files:
            if filename.endswith(".pxd"):
                hash_add(base, filename, new_db)
                if hash_changed(base, filename, db):
                    res = True
    if res:
        db.clear()
        db.update(new_db)
    return res
 def run(root):
    db = load_hashes(HASH_FILE)
    try:
        check_changes(root, db)
        for base, _, files in os.walk(root):
            for filename in files:
                process(base, filename, db)
    finally:
        save_hashes(db, HASH_FILE)
 if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="Cythonize pyx files into C++ files as needed"
    )
    parser.add_argument("root", help="root directory")
    args = parser.parse_args()
    run(args.root)
--- a/bin/get-package.sh
+++ b/bin/get-package.sh
@ -0,0 +1,12 @@
 #!/usr/bin/env bash
 set -e
 version=$(grep "__title__ = " spacy/about.py)
 version=${version/__title__ = }
 version=${version/\'/}
 version=${version/\'/}
 version=${version/\"/}
 version=${version/\"/}
 echo $version
--- a/bin/load_reddit.py
+++ b/bin/load_reddit.py
@ -1,97 +0,0 @@
 # coding: utf8
 from __future__ import unicode_literals
 import bz2
 import re
 import srsly
 import sys
 import random
 import datetime
 import plac
 from pathlib import Path
 _unset = object()
 class Reddit(object):
    """Stream cleaned comments from Reddit."""
    pre_format_re = re.compile(r"^[`*~]")
    post_format_re = re.compile(r"[`*~]$")
    url_re = re.compile(r"\[([^]]+)\]\(%%URL\)")
    link_re = re.compile(r"\[([^]]+)\]\(https?://[^\)]+\)")
    def __init__(self, file_path, meta_keys={"subreddit": "section"}):
        """
        file_path (unicode / Path): Path to archive or directory of archives.
        meta_keys (dict): Meta data key included in the Reddit corpus, mapped
            to display name in Prodigy meta.
        RETURNS (Reddit): The Reddit loader.
        """
        self.meta = meta_keys
        file_path = Path(file_path)
        if not file_path.exists():
            raise IOError("Can't find file path: {}".format(file_path))
        if not file_path.is_dir():
            self.files = [file_path]
        else:
            self.files = list(file_path.iterdir())
    def __iter__(self):
        for file_path in self.iter_files():
            with bz2.open(str(file_path)) as f:
                for line in f:
                    line = line.strip()
                    if not line:
                        continue
                    comment = srsly.json_loads(line)
                    if self.is_valid(comment):
                        text = self.strip_tags(comment["body"])
                        yield {"text": text}
    def get_meta(self, item):
        return {name: item.get(key, "n/a") for key, name in self.meta.items()}
    def iter_files(self):
        for file_path in self.files:
            yield file_path
    def strip_tags(self, text):
        text = self.link_re.sub(r"\1", text)
        text = text.replace("&gt;", ">").replace("&lt;", "<")
        text = self.pre_format_re.sub("", text)
        text = self.post_format_re.sub("", text)
        text = re.sub(r"\s+", " ", text)
        return text.strip()
    def is_valid(self, comment):
        return (
            comment["body"] is not None
            and comment["body"] != "[deleted]"
            and comment["body"] != "[removed]"
        )
 def main(path):
    reddit = Reddit(path)
    for comment in reddit:
        print(srsly.json_dumps(comment))
 if __name__ == "__main__":
    import socket
    try:
        BrokenPipeError
    except NameError:
        BrokenPipeError = socket.error
    try:
        plac.call(main)
    except BrokenPipeError:
        import os, sys
        # Python flushes standard streams on exit; redirect remaining output
        # to devnull to avoid another BrokenPipeError at shutdown
        devnull = os.open(os.devnull, os.O_WRONLY)
        os.dup2(devnull, sys.stdout.fileno())
        sys.exit(1)  # Python exits with error code 1 on EPIPE
--- a/bin/spacy
+++ b/bin/spacy
@ -1,2 +0,0 @@
 #! /bin/sh
 python -m spacy "$@"
--- a/bin/train_word_vectors.py
+++ b/bin/train_word_vectors.py
@ -1,81 +0,0 @@
 #!/usr/bin/env python
 from __future__ import print_function, unicode_literals, division
 import logging
 from pathlib import Path
 from collections import defaultdict
 from gensim.models import Word2Vec
 import plac
 import spacy
 logger = logging.getLogger(__name__)
 class Corpus(object):
    def __init__(self, directory, nlp):
        self.directory = directory
        self.nlp = nlp
    def __iter__(self):
        for text_loc in iter_dir(self.directory):
            with text_loc.open("r", encoding="utf-8") as file_:
                text = file_.read()
            # This is to keep the input to the blank model (which doesn't
            # sentencize) from being too long. It works particularly well with
            # the output of [WikiExtractor](https://github.com/attardi/wikiextractor)
            paragraphs = text.split('\n\n')
            for par in paragraphs:
                yield [word.orth_ for word in self.nlp(par)]
 def iter_dir(loc):
    dir_path = Path(loc)
    for fn_path in dir_path.iterdir():
        if fn_path.is_dir():
            for sub_path in fn_path.iterdir():
                yield sub_path
        else:
            yield fn_path
@plac.annotations(
    lang=("ISO language code"),
    in_dir=("Location of input directory"),
    out_loc=("Location of output file"),
    n_workers=("Number of workers", "option", "n", int),
    size=("Dimension of the word vectors", "option", "d", int),
    window=("Context window size", "option", "w", int),
    min_count=("Min count", "option", "m", int),
    negative=("Number of negative samples", "option", "g", int),
    nr_iter=("Number of iterations", "option", "i", int),
 )
 def main(
    lang,
    in_dir,
    out_loc,
    negative=5,
    n_workers=4,
    window=5,
    size=128,
    min_count=10,
    nr_iter=5,
 ):
    logging.basicConfig(
        format="%(asctime)s : %(levelname)s : %(message)s", level=logging.INFO
    )
    nlp = spacy.blank(lang)
    corpus = Corpus(in_dir, nlp)
    model = Word2Vec(
        sentences=corpus,
        size=size,
        window=window,
        min_count=min_count,
        workers=n_workers,
        sample=1e-5,
        negative=negative,
    )
    model.save(out_loc)
 if __name__ == "__main__":
    plac.call(main)
--- a/bin/ud/init.py
+++ b/bin/ud/init.py
@ -1,2 +0,0 @@
 from .conll17_ud_eval import main as ud_evaluate  # noqa: F401
 from .ud_train import main as ud_train  # noqa: F401
--- a/bin/ud/conll17_ud_eval.py
+++ b/bin/ud/conll17_ud_eval.py
@ -1,614 +0,0 @@
 #!/usr/bin/env python
 # flake8: noqa
 # CoNLL 2017 UD Parsing evaluation script.
 #
 # Compatible with Python 2.7 and 3.2+, can be used either as a module
 # or a standalone executable.
 #
 # Copyright 2017 Institute of Formal and Applied Linguistics (UFAL),
 # Faculty of Mathematics and Physics, Charles University, Czech Republic.
 #
 # Changelog:
 # - [02 Jan 2017] Version 0.9: Initial release
 # - [25 Jan 2017] Version 0.9.1: Fix bug in LCS alignment computation
 # - [10 Mar 2017] Version 1.0: Add documentation and test
 #                              Compare HEADs correctly using aligned words
 #                              Allow evaluation with errorneous spaces in forms
 #                              Compare forms in LCS case insensitively
 #                              Detect cycles and multiple root nodes
 #                              Compute AlignedAccuracy
 # Command line usage
 # ------------------
 # conll17_ud_eval.py [-v] [-w weights_file] gold_conllu_file system_conllu_file
 #
 # - if no -v is given, only the CoNLL17 UD Shared Task evaluation LAS metrics
 #   is printed
 # - if -v is given, several metrics are printed (as precision, recall, F1 score,
 #   and in case the metric is computed on aligned words also accuracy on these):
 #   - Tokens: how well do the gold tokens match system tokens
 #   - Sentences: how well do the gold sentences match system sentences
 #   - Words: how well can the gold words be aligned to system words
 #   - UPOS: using aligned words, how well does UPOS match
 #   - XPOS: using aligned words, how well does XPOS match
 #   - Feats: using aligned words, how well does FEATS match
 #   - AllTags: using aligned words, how well does UPOS+XPOS+FEATS match
 #   - Lemmas: using aligned words, how well does LEMMA match
 #   - UAS: using aligned words, how well does HEAD match
 #   - LAS: using aligned words, how well does HEAD+DEPREL(ignoring subtypes) match
 # - if weights_file is given (with lines containing deprel-weight pairs),
 #   one more metric is shown:
 #   - WeightedLAS: as LAS, but each deprel (ignoring subtypes) has different weight
 # API usage
 # ---------
 # - load_conllu(file)
 #   - loads CoNLL-U file from given file object to an internal representation
 #   - the file object should return str on both Python 2 and Python 3
 #   - raises UDError exception if the given file cannot be loaded
 # - evaluate(gold_ud, system_ud)
 #   - evaluate the given gold and system CoNLL-U files (loaded with load_conllu)
 #   - raises UDError if the concatenated tokens of gold and system file do not match
 #   - returns a dictionary with the metrics described above, each metrics having
 #     four fields: precision, recall, f1 and aligned_accuracy (when using aligned
 #     words, otherwise this is None)
 # Description of token matching
 # -----------------------------
 # In order to match tokens of gold file and system file, we consider the text
 # resulting from concatenation of gold tokens and text resulting from
 # concatenation of system tokens. These texts should match -- if they do not,
 # the evaluation fails.
 #
 # If the texts do match, every token is represented as a range in this original
 # text, and tokens are equal only if their range is the same.
 # Description of word matching
 # ----------------------------
 # When matching words of gold file and system file, we first match the tokens.
 # The words which are also tokens are matched as tokens, but words in multi-word
 # tokens have to be handled differently.
 #
 # To handle multi-word tokens, we start by finding "multi-word spans".
 # Multi-word span is a span in the original text such that
 # - it contains at least one multi-word token
 # - all multi-word tokens in the span (considering both gold and system ones)
 #   are completely inside the span (i.e., they do not "stick out")
 # - the multi-word span is as small as possible
 #
 # For every multi-word span, we align the gold and system words completely
 # inside this span using LCS on their FORMs. The words not intersecting
 # (even partially) any multi-word span are then aligned as tokens.
 from __future__ import division
 from __future__ import print_function
 import argparse
 import io
 import sys
 import unittest
 # CoNLL-U column names
 ID, FORM, LEMMA, UPOS, XPOS, FEATS, HEAD, DEPREL, DEPS, MISC = range(10)
 # UD Error is used when raising exceptions in this module
 class UDError(Exception):
    pass
 # Load given CoNLL-U file into internal representation
 def load_conllu(file, check_parse=True):
    # Internal representation classes
    class UDRepresentation:
        def __init__(self):
            # Characters of all the tokens in the whole file.
            # Whitespace between tokens is not included.
            self.characters = []
            # List of UDSpan instances with start&end indices into `characters`.
            self.tokens = []
            # List of UDWord instances.
            self.words = []
            # List of UDSpan instances with start&end indices into `characters`.
            self.sentences = []
    class UDSpan:
        def __init__(self, start, end, characters):
            self.start = start
            # Note that self.end marks the first position **after the end** of span,
            # so we can use characters[start:end] or range(start, end).
            self.end = end
            self.characters = characters
        @property
        def text(self):
            return ''.join(self.characters[self.start:self.end])
        def __str__(self):
            return self.text
        def __repr__(self):
            return self.text
    class UDWord:
        def __init__(self, span, columns, is_multiword):
            # Span of this word (or MWT, see below) within ud_representation.characters.
            self.span = span
            # 10 columns of the CoNLL-U file: ID, FORM, LEMMA,...
            self.columns = columns
            # is_multiword==True means that this word is part of a multi-word token.
            # In that case, self.span marks the span of the whole multi-word token.
            self.is_multiword = is_multiword
            # Reference to the UDWord instance representing the HEAD (or None if root).
            self.parent = None
            # Let's ignore language-specific deprel subtypes.
            self.columns[DEPREL] = columns[DEPREL].split(':')[0]
    ud = UDRepresentation()
    # Load the CoNLL-U file
    index, sentence_start = 0, None
    linenum = 0
    while True:
        line = file.readline()
        linenum += 1
        if not line:
            break
        line = line.rstrip("\r\n")
        # Handle sentence start boundaries
        if sentence_start is None:
            # Skip comments
            if line.startswith("#"):
                continue
            # Start a new sentence
            ud.sentences.append(UDSpan(index, 0, ud.characters))
            sentence_start = len(ud.words)
        if not line:
            # Add parent UDWord links and check there are no cycles
            def process_word(word):
                if word.parent == "remapping":
                    raise UDError("There is a cycle in a sentence")
                if word.parent is None:
                    head = int(word.columns[HEAD])
                    if head > len(ud.words) - sentence_start:
                        raise UDError("Line {}: HEAD '{}' points outside of the sentence".format(
                            linenum, word.columns[HEAD]))
                    if head:
                        parent = ud.words[sentence_start + head - 1]
                        word.parent = "remapping"
                        process_word(parent)
                        word.parent = parent
            for word in ud.words[sentence_start:]:
                process_word(word)
            # Check there is a single root node
            if check_parse:
                if len([word for word in ud.words[sentence_start:] if word.parent is None]) != 1:
                    raise UDError("There are multiple roots in a sentence")
            # End the sentence
            ud.sentences[-1].end = index
            sentence_start = None
            continue
        # Read next token/word
        columns = line.split("\t")
        if len(columns) != 10:
            raise UDError("The CoNLL-U line {} does not contain 10 tab-separated columns: '{}'".format(linenum, line))
        # Skip empty nodes
        if "." in columns[ID]:
            continue
        # Delete spaces from FORM so gold.characters == system.characters
        # even if one of them tokenizes the space.
        columns[FORM] = columns[FORM].replace(" ", "")
        if not columns[FORM]:
            raise UDError("There is an empty FORM in the CoNLL-U file -- line %d" % linenum)
        # Save token
        ud.characters.extend(columns[FORM])
        ud.tokens.append(UDSpan(index, index + len(columns[FORM]), ud.characters))
        index += len(columns[FORM])
        # Handle multi-word tokens to save word(s)
        if "-" in columns[ID]:
            try:
                start, end = map(int, columns[ID].split("-"))
            except:
                raise UDError("Cannot parse multi-word token ID '{}'".format(columns[ID]))
            for _ in range(start, end + 1):
                word_line = file.readline().rstrip("\r\n")
                word_columns = word_line.split("\t")
                if len(word_columns) != 10:
                    print(columns)
                    raise UDError("The CoNLL-U line {} does not contain 10 tab-separated columns: '{}'".format(linenum, word_line))
                ud.words.append(UDWord(ud.tokens[-1], word_columns, is_multiword=True))
        # Basic tokens/words
        else:
            try:
                word_id = int(columns[ID])
            except:
                raise UDError("Cannot parse word ID '{}'".format(columns[ID]))
            if word_id != len(ud.words) - sentence_start + 1:
                raise UDError("Incorrect word ID '{}' for word '{}', expected '{}'".format(columns[ID], columns[FORM], len(ud.words) - sentence_start + 1))
            try:
                head_id = int(columns[HEAD])
            except:
                raise UDError("Cannot parse HEAD '{}'".format(columns[HEAD]))
            if head_id < 0:
                raise UDError("HEAD cannot be negative")
            ud.words.append(UDWord(ud.tokens[-1], columns, is_multiword=False))
    if sentence_start is not None:
        raise UDError("The CoNLL-U file does not end with empty line")
    return ud
 # Evaluate the gold and system treebanks (loaded using load_conllu).
 def evaluate(gold_ud, system_ud, deprel_weights=None, check_parse=True):
    class Score:
        def __init__(self, gold_total, system_total, correct, aligned_total=None, undersegmented=None, oversegmented=None):
            self.precision = correct / system_total if system_total else 0.0
            self.recall = correct / gold_total if gold_total else 0.0
            self.f1 = 2 * correct / (system_total + gold_total) if system_total + gold_total else 0.0
            self.aligned_accuracy = correct / aligned_total if aligned_total else aligned_total
            self.undersegmented = undersegmented
            self.oversegmented = oversegmented
            self.under_perc = len(undersegmented) / gold_total if gold_total and undersegmented else 0.0
            self.over_perc = len(oversegmented) / gold_total if gold_total and oversegmented else 0.0
    class AlignmentWord:
        def __init__(self, gold_word, system_word):
            self.gold_word = gold_word
            self.system_word = system_word
            self.gold_parent = None
            self.system_parent_gold_aligned = None
    class Alignment:
        def __init__(self, gold_words, system_words):
            self.gold_words = gold_words
            self.system_words = system_words
            self.matched_words = []
            self.matched_words_map = {}
        def append_aligned_words(self, gold_word, system_word):
            self.matched_words.append(AlignmentWord(gold_word, system_word))
            self.matched_words_map[system_word] = gold_word
        def fill_parents(self):
            # We represent root parents in both gold and system data by '0'.
            # For gold data, we represent non-root parent by corresponding gold word.
            # For system data, we represent non-root parent by either gold word aligned
            # to parent system nodes, or by None if no gold words is aligned to the parent.
            for words in self.matched_words:
                words.gold_parent = words.gold_word.parent if words.gold_word.parent is not None else 0
                words.system_parent_gold_aligned = self.matched_words_map.get(words.system_word.parent, None) \
                    if words.system_word.parent is not None else 0
    def lower(text):
        if sys.version_info < (3, 0) and isinstance(text, str):
            return text.decode("utf-8").lower()
        return text.lower()
    def spans_score(gold_spans, system_spans):
        correct, gi, si = 0, 0, 0
        undersegmented = []
        oversegmented = []
        combo = 0
        previous_end_si_earlier = False
        previous_end_gi_earlier = False
        while gi < len(gold_spans) and si < len(system_spans):
            previous_si = system_spans[si-1] if si > 0 else None
            previous_gi = gold_spans[gi-1] if gi > 0 else None
            if system_spans[si].start < gold_spans[gi].start:
                # avoid counting the same mistake twice
                if not previous_end_si_earlier:
                    combo += 1
                    oversegmented.append(str(previous_gi).strip())
                si += 1
            elif gold_spans[gi].start < system_spans[si].start:
                # avoid counting the same mistake twice
                if not previous_end_gi_earlier:
                    combo += 1
                    undersegmented.append(str(previous_si).strip())
                gi += 1
            else:
                correct += gold_spans[gi].end == system_spans[si].end
                if gold_spans[gi].end < system_spans[si].end:
                    undersegmented.append(str(system_spans[si]).strip())
                    previous_end_gi_earlier = True
                    previous_end_si_earlier = False
                elif gold_spans[gi].end > system_spans[si].end:
                    oversegmented.append(str(gold_spans[gi]).strip())
                    previous_end_si_earlier = True
                    previous_end_gi_earlier = False
                else:
                    previous_end_gi_earlier = False
                    previous_end_si_earlier = False
                si += 1
                gi += 1
        return Score(len(gold_spans), len(system_spans), correct, None, undersegmented, oversegmented)
    def alignment_score(alignment, key_fn, weight_fn=lambda w: 1):
        gold, system, aligned, correct = 0, 0, 0, 0
        for word in alignment.gold_words:
            gold += weight_fn(word)
        for word in alignment.system_words:
            system += weight_fn(word)
        for words in alignment.matched_words:
            aligned += weight_fn(words.gold_word)
        if key_fn is None:
            # Return score for whole aligned words
            return Score(gold, system, aligned)
        for words in alignment.matched_words:
            if key_fn(words.gold_word, words.gold_parent) == key_fn(words.system_word, words.system_parent_gold_aligned):
                correct += weight_fn(words.gold_word)
        return Score(gold, system, correct, aligned)
    def beyond_end(words, i, multiword_span_end):
        if i >= len(words):
            return True
        if words[i].is_multiword:
            return words[i].span.start >= multiword_span_end
        return words[i].span.end > multiword_span_end
    def extend_end(word, multiword_span_end):
        if word.is_multiword and word.span.end > multiword_span_end:
            return word.span.end
        return multiword_span_end
    def find_multiword_span(gold_words, system_words, gi, si):
        # We know gold_words[gi].is_multiword or system_words[si].is_multiword.
        # Find the start of the multiword span (gs, ss), so the multiword span is minimal.
        # Initialize multiword_span_end characters index.
        if gold_words[gi].is_multiword:
            multiword_span_end = gold_words[gi].span.end
            if not system_words[si].is_multiword and system_words[si].span.start < gold_words[gi].span.start:
                si += 1
        else: # if system_words[si].is_multiword
            multiword_span_end = system_words[si].span.end
            if not gold_words[gi].is_multiword and gold_words[gi].span.start < system_words[si].span.start:
                gi += 1
        gs, ss = gi, si
        # Find the end of the multiword span
        # (so both gi and si are pointing to the word following the multiword span end).
        while not beyond_end(gold_words, gi, multiword_span_end) or \
              not beyond_end(system_words, si, multiword_span_end):
            if gi < len(gold_words) and (si >= len(system_words) or
                                         gold_words[gi].span.start <= system_words[si].span.start):
                multiword_span_end = extend_end(gold_words[gi], multiword_span_end)
                gi += 1
            else:
                multiword_span_end = extend_end(system_words[si], multiword_span_end)
                si += 1
        return gs, ss, gi, si
    def compute_lcs(gold_words, system_words, gi, si, gs, ss):
        lcs = [[0] * (si - ss) for i in range(gi - gs)]
        for g in reversed(range(gi - gs)):
            for s in reversed(range(si - ss)):
                if lower(gold_words[gs + g].columns[FORM]) == lower(system_words[ss + s].columns[FORM]):
                    lcs[g][s] = 1 + (lcs[g+1][s+1] if g+1 < gi-gs and s+1 < si-ss else 0)
                lcs[g][s] = max(lcs[g][s], lcs[g+1][s] if g+1 < gi-gs else 0)
                lcs[g][s] = max(lcs[g][s], lcs[g][s+1] if s+1 < si-ss else 0)
        return lcs
    def align_words(gold_words, system_words):
        alignment = Alignment(gold_words, system_words)
        gi, si = 0, 0
        while gi < len(gold_words) and si < len(system_words):
            if gold_words[gi].is_multiword or system_words[si].is_multiword:
                # A: Multi-word tokens => align via LCS within the whole "multiword span".
                gs, ss, gi, si = find_multiword_span(gold_words, system_words, gi, si)
                if si > ss and gi > gs:
                    lcs = compute_lcs(gold_words, system_words, gi, si, gs, ss)
                    # Store aligned words
                    s, g = 0, 0
                    while g < gi - gs and s < si - ss:
                        if lower(gold_words[gs + g].columns[FORM]) == lower(system_words[ss + s].columns[FORM]):
                            alignment.append_aligned_words(gold_words[gs+g], system_words[ss+s])
                            g += 1
                            s += 1
                        elif lcs[g][s] == (lcs[g+1][s] if g+1 < gi-gs else 0):
                            g += 1
                        else:
                            s += 1
            else:
                # B: No multi-word token => align according to spans.
                if (gold_words[gi].span.start, gold_words[gi].span.end) == (system_words[si].span.start, system_words[si].span.end):
                    alignment.append_aligned_words(gold_words[gi], system_words[si])
                    gi += 1
                    si += 1
                elif gold_words[gi].span.start <= system_words[si].span.start:
                    gi += 1
                else:
                    si += 1
        alignment.fill_parents()
        return alignment
    # Check that underlying character sequences do match
    if gold_ud.characters != system_ud.characters:
        index = 0
        while gold_ud.characters[index] == system_ud.characters[index]:
            index += 1
        raise UDError(
            "The concatenation of tokens in gold file and in system file differ!\n" +
            "First 20 differing characters in gold file: '{}' and system file: '{}'".format(
                "".join(gold_ud.characters[index:index + 20]),
                "".join(system_ud.characters[index:index + 20])
            )
        )
    # Align words
    alignment = align_words(gold_ud.words, system_ud.words)
    # Compute the F1-scores
    if check_parse:
        result = {
            "Tokens": spans_score(gold_ud.tokens, system_ud.tokens),
            "Sentences": spans_score(gold_ud.sentences, system_ud.sentences),
            "Words": alignment_score(alignment, None),
            "UPOS": alignment_score(alignment, lambda w, parent: w.columns[UPOS]),
            "XPOS": alignment_score(alignment, lambda w, parent: w.columns[XPOS]),
            "Feats": alignment_score(alignment, lambda w, parent: w.columns[FEATS]),
            "AllTags": alignment_score(alignment, lambda w, parent: (w.columns[UPOS], w.columns[XPOS], w.columns[FEATS])),
            "Lemmas": alignment_score(alignment, lambda w, parent: w.columns[LEMMA]),
            "UAS": alignment_score(alignment, lambda w, parent: parent),
            "LAS": alignment_score(alignment, lambda w, parent: (parent, w.columns[DEPREL])),
        }
    else:
        result = {
            "Tokens": spans_score(gold_ud.tokens, system_ud.tokens),
            "Sentences": spans_score(gold_ud.sentences, system_ud.sentences),
            "Words": alignment_score(alignment, None),
            "Feats": alignment_score(alignment, lambda w, parent: w.columns[FEATS]),
            "Lemmas": alignment_score(alignment, lambda w, parent: w.columns[LEMMA]),
        }
    # Add WeightedLAS if weights are given
    if deprel_weights is not None:
        def weighted_las(word):
            return deprel_weights.get(word.columns[DEPREL], 1.0)
        result["WeightedLAS"] = alignment_score(alignment, lambda w, parent: (parent, w.columns[DEPREL]), weighted_las)
    return result
 def load_deprel_weights(weights_file):
    if weights_file is None:
        return None
    deprel_weights = {}
    for line in weights_file:
        # Ignore comments and empty lines
        if line.startswith("#") or not line.strip():
            continue
        columns = line.rstrip("\r\n").split()
        if len(columns) != 2:
            raise ValueError("Expected two columns in the UD Relations weights file on line '{}'".format(line))
        deprel_weights[columns[0]] = float(columns[1])
    return deprel_weights
 def load_conllu_file(path):
    _file = open(path, mode="r", **({"encoding": "utf-8"} if sys.version_info >= (3, 0) else {}))
    return load_conllu(_file)
 def evaluate_wrapper(args):
    # Load CoNLL-U files
    gold_ud = load_conllu_file(args.gold_file)
    system_ud = load_conllu_file(args.system_file)
    # Load weights if requested
    deprel_weights = load_deprel_weights(args.weights)
    return evaluate(gold_ud, system_ud, deprel_weights)
 def main():
    # Parse arguments
    parser = argparse.ArgumentParser()
    parser.add_argument("gold_file", type=str,
                        help="Name of the CoNLL-U file with the gold data.")
    parser.add_argument("system_file", type=str,
                        help="Name of the CoNLL-U file with the predicted data.")
    parser.add_argument("--weights", "-w", type=argparse.FileType("r"), default=None,
                        metavar="deprel_weights_file",
                        help="Compute WeightedLAS using given weights for Universal Dependency Relations.")
    parser.add_argument("--verbose", "-v", default=0, action="count",
                        help="Print all metrics.")
    args = parser.parse_args()
    # Use verbose if weights are supplied
    if args.weights is not None and not args.verbose:
        args.verbose = 1
    # Evaluate
    evaluation = evaluate_wrapper(args)
    # Print the evaluation
    if not args.verbose:
        print("LAS F1 Score: {:.2f}".format(100 * evaluation["LAS"].f1))
    else:
        metrics = ["Tokens", "Sentences", "Words", "UPOS", "XPOS", "Feats", "AllTags", "Lemmas", "UAS", "LAS"]
        if args.weights is not None:
            metrics.append("WeightedLAS")
        print("Metrics    | Precision |    Recall |  F1 Score | AligndAcc")
        print("-----------+-----------+-----------+-----------+-----------")
        for metric in metrics:
            print("{:11}|{:10.2f} |{:10.2f} |{:10.2f} |{}".format(
                metric,
                100 * evaluation[metric].precision,
                100 * evaluation[metric].recall,
                100 * evaluation[metric].f1,
                "{:10.2f}".format(100 * evaluation[metric].aligned_accuracy) if evaluation[metric].aligned_accuracy is not None else ""
            ))
 if __name__ == "__main__":
    main()
 # Tests, which can be executed with `python -m unittest conll17_ud_eval`.
 class TestAlignment(unittest.TestCase):
    @staticmethod
    def _load_words(words):
        """Prepare fake CoNLL-U files with fake HEAD to prevent multiple roots errors."""
        lines, num_words = [], 0
        for w in words:
            parts = w.split(" ")
            if len(parts) == 1:
                num_words += 1
                lines.append("{}\t{}\t_\t_\t_\t_\t{}\t_\t_\t_".format(num_words, parts[0], int(num_words>1)))
            else:
                lines.append("{}-{}\t{}\t_\t_\t_\t_\t_\t_\t_\t_".format(num_words + 1, num_words + len(parts) - 1, parts[0]))
                for part in parts[1:]:
                    num_words += 1
                    lines.append("{}\t{}\t_\t_\t_\t_\t{}\t_\t_\t_".format(num_words, part, int(num_words>1)))
        return load_conllu((io.StringIO if sys.version_info >= (3, 0) else io.BytesIO)("\n".join(lines+["\n"])))
    def _test_exception(self, gold, system):
        self.assertRaises(UDError, evaluate, self._load_words(gold), self._load_words(system))
    def _test_ok(self, gold, system, correct):
        metrics = evaluate(self._load_words(gold), self._load_words(system))
        gold_words = sum((max(1, len(word.split(" ")) - 1) for word in gold))
        system_words = sum((max(1, len(word.split(" ")) - 1) for word in system))
        self.assertEqual((metrics["Words"].precision, metrics["Words"].recall, metrics["Words"].f1),
                         (correct / system_words, correct / gold_words, 2 * correct / (gold_words + system_words)))
    def test_exception(self):
        self._test_exception(["a"], ["b"])
    def test_equal(self):
        self._test_ok(["a"], ["a"], 1)
        self._test_ok(["a", "b", "c"], ["a", "b", "c"], 3)
    def test_equal_with_multiword(self):
        self._test_ok(["abc a b c"], ["a", "b", "c"], 3)
        self._test_ok(["a", "bc b c", "d"], ["a", "b", "c", "d"], 4)
        self._test_ok(["abcd a b c d"], ["ab a b", "cd c d"], 4)
        self._test_ok(["abc a b c", "de d e"], ["a", "bcd b c d", "e"], 5)
    def test_alignment(self):
        self._test_ok(["abcd"], ["a", "b", "c", "d"], 0)
        self._test_ok(["abc", "d"], ["a", "b", "c", "d"], 1)
        self._test_ok(["a", "bc", "d"], ["a", "b", "c", "d"], 2)
        self._test_ok(["a", "bc b c", "d"], ["a", "b", "cd"], 2)
        self._test_ok(["abc a BX c", "def d EX f"], ["ab a b", "cd c d", "ef e f"], 4)
        self._test_ok(["ab a b", "cd bc d"], ["a", "bc", "d"], 2)
        self._test_ok(["a", "bc b c", "d"], ["ab AX BX", "cd CX a"], 1)
--- a/bin/ud/run_eval.py
+++ b/bin/ud/run_eval.py
@ -1,293 +0,0 @@
 import spacy
 import time
 import re
 import plac
 import operator
 import datetime
 from pathlib import Path
 import xml.etree.ElementTree as ET
 import conll17_ud_eval
 from ud_train import write_conllu
 from spacy.lang.lex_attrs import word_shape
 from spacy.util import get_lang_class
 # All languages in spaCy - in UD format (note that Norwegian is 'no' instead of 'nb')
 ALL_LANGUAGES = ("af, ar, bg, bn, ca, cs, da, de, el, en, es, et, fa, fi, fr,"
                 "ga, he, hi, hr, hu, id, is, it, ja, kn, ko, lt, lv, mr, no,"
                 "nl, pl, pt, ro, ru, si, sk, sl, sq, sr, sv, ta, te, th, tl,"
                 "tr, tt, uk, ur, vi, zh")
 # Non-parsing tasks that will be evaluated (works for default models)
 EVAL_NO_PARSE = ['Tokens', 'Words', 'Lemmas', 'Sentences', 'Feats']
 # Tasks that will be evaluated if check_parse=True (does not work for default models)
 EVAL_PARSE = ['Tokens', 'Words', 'Lemmas', 'Sentences', 'Feats', 'UPOS', 'XPOS', 'AllTags', 'UAS', 'LAS']
 # Minimum frequency an error should have to be printed
 PRINT_FREQ = 20
 # Maximum number of errors printed per category
 PRINT_TOTAL = 10
 space_re = re.compile("\s+")
 def load_model(modelname, add_sentencizer=False):
    """ Load a specific spaCy model """
    loading_start = time.time()
    nlp = spacy.load(modelname)
    if add_sentencizer:
        nlp.add_pipe(nlp.create_pipe('sentencizer'))
    loading_end = time.time()
    loading_time = loading_end - loading_start
    if add_sentencizer:
        return nlp, loading_time, modelname + '_sentencizer'
    return nlp, loading_time, modelname
 def load_default_model_sentencizer(lang):
    """ Load a generic spaCy model and add the sentencizer for sentence tokenization"""
    loading_start = time.time()
    lang_class = get_lang_class(lang)
    nlp = lang_class()
    nlp.add_pipe(nlp.create_pipe('sentencizer'))
    loading_end = time.time()
    loading_time = loading_end - loading_start
    return nlp, loading_time, lang + "_default_" + 'sentencizer'
 def split_text(text):
    return [space_re.sub(" ", par.strip()) for par in text.split("\n\n")]
 def get_freq_tuples(my_list, print_total_threshold):
    """ Turn a list of errors into frequency-sorted tuples thresholded by a certain total number """
    d = {}
    for token in my_list:
        d.setdefault(token, 0)
        d[token] += 1
    return sorted(d.items(), key=operator.itemgetter(1), reverse=True)[:print_total_threshold]
 def _contains_blinded_text(stats_xml):
    """ Heuristic to determine whether the treebank has blinded texts or not """
    tree = ET.parse(stats_xml)
    root = tree.getroot()
    total_tokens = int(root.find('size/total/tokens').text)
    unique_forms = int(root.find('forms').get('unique'))
    # assume the corpus is largely blinded when there are less than 1% unique tokens
    return (unique_forms / total_tokens) < 0.01
 def fetch_all_treebanks(ud_dir, languages, corpus, best_per_language):
    """" Fetch the txt files for all treebanks for a given set of languages """
    all_treebanks = dict()
    treebank_size = dict()
    for l in languages:
        all_treebanks[l] = []
        treebank_size[l] = 0
    for treebank_dir in ud_dir.iterdir():
        if treebank_dir.is_dir():
            for txt_path in treebank_dir.iterdir():
                if txt_path.name.endswith('-ud-' + corpus + '.txt'):
                    file_lang = txt_path.name.split('_')[0]
                    if file_lang in languages:
                        gold_path = treebank_dir / txt_path.name.replace('.txt', '.conllu')
                        stats_xml = treebank_dir / "stats.xml"
                        # ignore treebanks where the texts are not publicly available
                        if not _contains_blinded_text(stats_xml):
                            if not best_per_language:
                                all_treebanks[file_lang].append(txt_path)
                            # check the tokens in the gold annotation to keep only the biggest treebank per language
                            else:
                                with gold_path.open(mode='r', encoding='utf-8') as gold_file:
                                    gold_ud = conll17_ud_eval.load_conllu(gold_file)
                                    gold_tokens = len(gold_ud.tokens)
                                if treebank_size[file_lang] < gold_tokens:
                                    all_treebanks[file_lang] = [txt_path]
                                    treebank_size[file_lang] = gold_tokens
    return all_treebanks
 def run_single_eval(nlp, loading_time, print_name, text_path, gold_ud, tmp_output_path, out_file, print_header,
                    check_parse, print_freq_tasks):
    """" Run an evaluation of a model nlp on a certain specified treebank """
    with text_path.open(mode='r', encoding='utf-8') as f:
        flat_text = f.read()
    # STEP 1: tokenize text
    tokenization_start = time.time()
    texts = split_text(flat_text)
    docs = list(nlp.pipe(texts))
    tokenization_end = time.time()
    tokenization_time = tokenization_end - tokenization_start
    # STEP 2: record stats and timings
    tokens_per_s = int(len(gold_ud.tokens) / tokenization_time)
    print_header_1 = ['date', 'text_path', 'gold_tokens', 'model', 'loading_time', 'tokenization_time', 'tokens_per_s']
    print_string_1 = [str(datetime.date.today()), text_path.name, len(gold_ud.tokens),
                      print_name, "%.2f" % loading_time, "%.2f" % tokenization_time, tokens_per_s]
    # STEP 3: evaluate predicted tokens and features
    with tmp_output_path.open(mode="w", encoding="utf8") as tmp_out_file:
        write_conllu(docs, tmp_out_file)
    with tmp_output_path.open(mode="r", encoding="utf8") as sys_file:
        sys_ud = conll17_ud_eval.load_conllu(sys_file, check_parse=check_parse)
    tmp_output_path.unlink()
    scores = conll17_ud_eval.evaluate(gold_ud, sys_ud, check_parse=check_parse)
    # STEP 4: format the scoring results
    eval_headers = EVAL_PARSE
    if not check_parse:
        eval_headers = EVAL_NO_PARSE
    for score_name in eval_headers:
        score = scores[score_name]
        print_string_1.extend(["%.2f" % score.precision,
                               "%.2f" % score.recall,
                               "%.2f" % score.f1])
        print_string_1.append("-" if score.aligned_accuracy is None else "%.2f" % score.aligned_accuracy)
        print_string_1.append("-" if score.undersegmented is None else "%.4f" % score.under_perc)
        print_string_1.append("-" if score.oversegmented is None else "%.4f" % score.over_perc)
        print_header_1.extend([score_name + '_p', score_name + '_r', score_name + '_F', score_name + '_acc',
                               score_name + '_under', score_name + '_over'])
        if score_name in print_freq_tasks:
            print_header_1.extend([score_name + '_word_under_ex', score_name + '_shape_under_ex',
                                   score_name + '_word_over_ex', score_name + '_shape_over_ex'])
            d_under_words = get_freq_tuples(score.undersegmented, PRINT_TOTAL)
            d_under_shapes = get_freq_tuples([word_shape(x) for x in score.undersegmented], PRINT_TOTAL)
            d_over_words = get_freq_tuples(score.oversegmented, PRINT_TOTAL)
            d_over_shapes = get_freq_tuples([word_shape(x) for x in score.oversegmented], PRINT_TOTAL)
            # saving to CSV with ; seperator so blinding ; in the example output
            print_string_1.append(
                str({k: v for k, v in d_under_words if v > PRINT_FREQ}).replace(";", "*SEMICOLON*"))
            print_string_1.append(
                str({k: v for k, v in d_under_shapes if v > PRINT_FREQ}).replace(";", "*SEMICOLON*"))
            print_string_1.append(
                str({k: v for k, v in d_over_words if v > PRINT_FREQ}).replace(";", "*SEMICOLON*"))
            print_string_1.append(
                str({k: v for k, v in d_over_shapes if v > PRINT_FREQ}).replace(";", "*SEMICOLON*"))
    # STEP 5: print the formatted results to CSV
    if print_header:
        out_file.write(';'.join(map(str, print_header_1)) + '\n')
    out_file.write(';'.join(map(str, print_string_1)) + '\n')
 def run_all_evals(models, treebanks, out_file, check_parse, print_freq_tasks):
    """" Run an evaluation for each language with its specified models and treebanks """
    print_header = True
    for tb_lang, treebank_list in treebanks.items():
        print()
        print("Language", tb_lang)
        for text_path in treebank_list:
            print(" Evaluating on", text_path)
            gold_path = text_path.parent / (text_path.stem + '.conllu')
            print("  Gold data from ", gold_path)
            # nested try blocks to ensure the code can continue with the next iteration after a failure
            try:
                with gold_path.open(mode='r', encoding='utf-8') as gold_file:
                    gold_ud = conll17_ud_eval.load_conllu(gold_file)
                for nlp, nlp_loading_time, nlp_name in models[tb_lang]:
                    try:
                        print("   Benchmarking", nlp_name)
                        tmp_output_path = text_path.parent / str('tmp_' + nlp_name + '.conllu')
                        run_single_eval(nlp, nlp_loading_time, nlp_name, text_path, gold_ud, tmp_output_path, out_file,
                                        print_header, check_parse, print_freq_tasks)
                        print_header = False
                    except Exception as e:
                        print("    Ran into trouble: ", str(e))
            except Exception as e:
                print("   Ran into trouble: ", str(e))
@plac.annotations(
    out_path=("Path to output CSV file", "positional", None, Path),
    ud_dir=("Path to Universal Dependencies corpus", "positional", None, Path),
    check_parse=("Set flag to evaluate parsing performance", "flag", "p", bool),
    langs=("Enumeration of languages to evaluate (default: all)", "option", "l", str),
    exclude_trained_models=("Set flag to exclude trained models", "flag", "t", bool),
    exclude_multi=("Set flag to exclude the multi-language model as default baseline", "flag", "m", bool),
    hide_freq=("Set flag to avoid printing out more detailed high-freq tokenization errors", "flag", "f", bool),
    corpus=("Whether to run on train, dev or test", "option", "c", str),
    best_per_language=("Set flag to only keep the largest treebank for each language", "flag", "b", bool)
 )
 def main(out_path, ud_dir, check_parse=False, langs=ALL_LANGUAGES, exclude_trained_models=False, exclude_multi=False,
         hide_freq=False, corpus='train', best_per_language=False):
    """"
    Assemble all treebanks and models to run evaluations with.
    When setting check_parse to True, the default models will not be evaluated as they don't have parsing functionality
    """
    languages = [lang.strip() for lang in langs.split(",")]
    print_freq_tasks = []
    if not hide_freq:
        print_freq_tasks = ['Tokens']
    # fetching all relevant treebank from the directory
    treebanks = fetch_all_treebanks(ud_dir, languages, corpus, best_per_language)
    print()
    print("Loading all relevant models for", languages)
    models = dict()
    # multi-lang model
    multi = None
    if not exclude_multi and not check_parse:
        multi = load_model('xx_ent_wiki_sm', add_sentencizer=True)
    # initialize all models with the multi-lang model
    for lang in languages:
        models[lang] = [multi] if multi else []
        # add default models if we don't want to evaluate parsing info
        if not check_parse:
            # Norwegian is 'nb' in spaCy but 'no' in the UD corpora
            if lang == 'no':
                models['no'].append(load_default_model_sentencizer('nb'))
            else:
                models[lang].append(load_default_model_sentencizer(lang))
    # language-specific trained models
    if not exclude_trained_models:
        if 'de' in models:
            models['de'].append(load_model('de_core_news_sm'))
            models['de'].append(load_model('de_core_news_md'))
        if 'el' in models:
            models['el'].append(load_model('el_core_news_sm'))
            models['el'].append(load_model('el_core_news_md'))
        if 'en' in models:
            models['en'].append(load_model('en_core_web_sm'))
            models['en'].append(load_model('en_core_web_md'))
            models['en'].append(load_model('en_core_web_lg'))
        if 'es' in models:
            models['es'].append(load_model('es_core_news_sm'))
            models['es'].append(load_model('es_core_news_md'))
        if 'fr' in models:
            models['fr'].append(load_model('fr_core_news_sm'))
            models['fr'].append(load_model('fr_core_news_md'))
        if 'it' in models:
            models['it'].append(load_model('it_core_news_sm'))
        if 'nl' in models:
            models['nl'].append(load_model('nl_core_news_sm'))
        if 'pt' in models:
            models['pt'].append(load_model('pt_core_news_sm'))
    with out_path.open(mode='w', encoding='utf-8') as out_file:
        run_all_evals(models, treebanks, out_file, check_parse, print_freq_tasks)
 if __name__ == "__main__":
    plac.call(main)
--- a/bin/ud/ud_run_test.py
+++ b/bin/ud/ud_run_test.py
@ -1,335 +0,0 @@
 # flake8: noqa
 """Train for CONLL 2017 UD treebank evaluation. Takes .conllu files, writes
 .conllu format for development data, allowing the official scorer to be used.
 """
 from __future__ import unicode_literals
 import plac
 from pathlib import Path
 import re
 import sys
 import srsly
 import spacy
 import spacy.util
 from spacy.tokens import Token, Doc
 from spacy.gold import GoldParse
 from spacy.util import compounding, minibatch_by_words
 from spacy.syntax.nonproj import projectivize
 from spacy.matcher import Matcher
 # from spacy.morphology import Fused_begin, Fused_inside
 from spacy import displacy
 from collections import defaultdict, Counter
 from timeit import default_timer as timer
 Fused_begin = None
 Fused_inside = None
 import itertools
 import random
 import numpy.random
 from . import conll17_ud_eval
 from spacy import lang
 from spacy.lang import zh
 from spacy.lang import ja
 from spacy.lang import ru
 ################
 # Data reading #
 ################
 space_re = re.compile(r"\s+")
 def split_text(text):
    return [space_re.sub(" ", par.strip()) for par in text.split("\n\n")]
 ##############
 # Evaluation #
 ##############
 def read_conllu(file_):
    docs = []
    sent = []
    doc = []
    for line in file_:
        if line.startswith("# newdoc"):
            if doc:
                docs.append(doc)
            doc = []
        elif line.startswith("#"):
            continue
        elif not line.strip():
            if sent:
                doc.append(sent)
            sent = []
        else:
            sent.append(list(line.strip().split("\t")))
            if len(sent[-1]) != 10:
                print(repr(line))
                raise ValueError
    if sent:
        doc.append(sent)
    if doc:
        docs.append(doc)
    return docs
 def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None):
    if text_loc.parts[-1].endswith(".conllu"):
        docs = []
        with text_loc.open(encoding="utf8") as file_:
            for conllu_doc in read_conllu(file_):
                for conllu_sent in conllu_doc:
                    words = [line[1] for line in conllu_sent]
                    docs.append(Doc(nlp.vocab, words=words))
        for name, component in nlp.pipeline:
            docs = list(component.pipe(docs))
    else:
        with text_loc.open("r", encoding="utf8") as text_file:
            texts = split_text(text_file.read())
            docs = list(nlp.pipe(texts))
    with sys_loc.open("w", encoding="utf8") as out_file:
        write_conllu(docs, out_file)
    with gold_loc.open("r", encoding="utf8") as gold_file:
        gold_ud = conll17_ud_eval.load_conllu(gold_file)
        with sys_loc.open("r", encoding="utf8") as sys_file:
            sys_ud = conll17_ud_eval.load_conllu(sys_file)
        scores = conll17_ud_eval.evaluate(gold_ud, sys_ud)
    return docs, scores
 def write_conllu(docs, file_):
    merger = Matcher(docs[0].vocab)
    merger.add("SUBTOK", None, [{"DEP": "subtok", "op": "+"}])
    for i, doc in enumerate(docs):
        matches = []
        if doc.is_parsed:
            matches = merger(doc)
        spans = [doc[start : end + 1] for _, start, end in matches]
        with doc.retokenize() as retokenizer:
            for span in spans:
                retokenizer.merge(span)
        file_.write("# newdoc id = {i}\n".format(i=i))
        for j, sent in enumerate(doc.sents):
            file_.write("# sent_id = {i}.{j}\n".format(i=i, j=j))
            file_.write("# text = {text}\n".format(text=sent.text))
            for k, token in enumerate(sent):
                file_.write(_get_token_conllu(token, k, len(sent)) + "\n")
            file_.write("\n")
            for word in sent:
                if word.head.i == word.i and word.dep_ == "ROOT":
                    break
            else:
                print("Rootless sentence!")
                print(sent)
                print(i)
                for w in sent:
                    print(w.i, w.text, w.head.text, w.head.i, w.dep_)
                raise ValueError
 def _get_token_conllu(token, k, sent_len):
    if token.check_morph(Fused_begin) and (k + 1 < sent_len):
        n = 1
        text = [token.text]
        while token.nbor(n).check_morph(Fused_inside):
            text.append(token.nbor(n).text)
            n += 1
        id_ = "%d-%d" % (k + 1, (k + n))
        fields = [id_, "".join(text)] + ["_"] * 8
        lines = ["\t".join(fields)]
    else:
        lines = []
    if token.head.i == token.i:
        head = 0
    else:
        head = k + (token.head.i - token.i) + 1
    fields = [
        str(k + 1),
        token.text,
        token.lemma_,
        token.pos_,
        token.tag_,
        "_",
        str(head),
        token.dep_.lower(),
        "_",
        "_",
    ]
    if token.check_morph(Fused_begin) and (k + 1 < sent_len):
        if k == 0:
            fields[1] = token.norm_[0].upper() + token.norm_[1:]
        else:
            fields[1] = token.norm_
    elif token.check_morph(Fused_inside):
        fields[1] = token.norm_
    elif token._.split_start is not None:
        split_start = token._.split_start
        split_end = token._.split_end
        split_len = (split_end.i - split_start.i) + 1
        n_in_split = token.i - split_start.i
        subtokens = guess_fused_orths(split_start.text, [""] * split_len)
        fields[1] = subtokens[n_in_split]
    lines.append("\t".join(fields))
    return "\n".join(lines)
 def guess_fused_orths(word, ud_forms):
    """The UD data 'fused tokens' don't necessarily expand to keys that match
    the form. We need orths that exact match the string. Here we make a best
    effort to divide up the word."""
    if word == "".join(ud_forms):
        # Happy case: we get a perfect split, with each letter accounted for.
        return ud_forms
    elif len(word) == sum(len(subtoken) for subtoken in ud_forms):
        # Unideal, but at least lengths match.
        output = []
        remain = word
        for subtoken in ud_forms:
            assert len(subtoken) >= 1
            output.append(remain[: len(subtoken)])
            remain = remain[len(subtoken) :]
        assert len(remain) == 0, (word, ud_forms, remain)
        return output
    else:
        # Let's say word is 6 long, and there are three subtokens. The orths
        # *must* equal the original string. Arbitrarily, split [4, 1, 1]
        first = word[: len(word) - (len(ud_forms) - 1)]
        output = [first]
        remain = word[len(first) :]
        for i in range(1, len(ud_forms)):
            assert remain
            output.append(remain[:1])
            remain = remain[1:]
        assert len(remain) == 0, (word, output, remain)
        return output
 def print_results(name, ud_scores):
    fields = {}
    if ud_scores is not None:
        fields.update(
            {
                "words": ud_scores["Words"].f1 * 100,
                "sents": ud_scores["Sentences"].f1 * 100,
                "tags": ud_scores["XPOS"].f1 * 100,
                "uas": ud_scores["UAS"].f1 * 100,
                "las": ud_scores["LAS"].f1 * 100,
            }
        )
    else:
        fields.update({"words": 0.0, "sents": 0.0, "tags": 0.0, "uas": 0.0, "las": 0.0})
    tpl = "\t".join(
        (name, "{las:.1f}", "{uas:.1f}", "{tags:.1f}", "{sents:.1f}", "{words:.1f}")
    )
    print(tpl.format(**fields))
    return fields
 def get_token_split_start(token):
    if token.text == "":
        assert token.i != 0
        i = -1
        while token.nbor(i).text == "":
            i -= 1
        return token.nbor(i)
    elif (token.i + 1) < len(token.doc) and token.nbor(1).text == "":
        return token
    else:
        return None
 def get_token_split_end(token):
    if (token.i + 1) == len(token.doc):
        return token if token.text == "" else None
    elif token.text != "" and token.nbor(1).text != "":
        return None
    i = 1
    while (token.i + i) < len(token.doc) and token.nbor(i).text == "":
        i += 1
    return token.nbor(i - 1)
 ##################
 # Initialization #
 ##################
 def load_nlp(experiments_dir, corpus):
    nlp = spacy.load(experiments_dir / corpus / "best-model")
    return nlp
 def initialize_pipeline(nlp, docs, golds, config, device):
    nlp.add_pipe(nlp.create_pipe("parser"))
    return nlp
@plac.annotations(
    test_data_dir=(
        "Path to Universal Dependencies test data",
        "positional",
        None,
        Path,
    ),
    experiment_dir=("Parent directory with output model", "positional", None, Path),
    corpus=(
        "UD corpus to evaluate, e.g. UD_English, UD_Spanish, etc",
        "positional",
        None,
        str,
    ),
 )
 def main(test_data_dir, experiment_dir, corpus):
    Token.set_extension("split_start", getter=get_token_split_start)
    Token.set_extension("split_end", getter=get_token_split_end)
    Token.set_extension("begins_fused", default=False)
    Token.set_extension("inside_fused", default=False)
    lang.zh.Chinese.Defaults.use_jieba = False
    lang.ja.Japanese.Defaults.use_janome = False
    lang.ru.Russian.Defaults.use_pymorphy2 = False
    nlp = load_nlp(experiment_dir, corpus)
    treebank_code = nlp.meta["treebank"]
    for section in ("test", "dev"):
        if section == "dev":
            section_dir = "conll17-ud-development-2017-03-19"
        else:
            section_dir = "conll17-ud-test-2017-05-09"
        text_path = test_data_dir / "input" / section_dir / (treebank_code + ".txt")
        udpipe_path = (
            test_data_dir / "input" / section_dir / (treebank_code + "-udpipe.conllu")
        )
        gold_path = test_data_dir / "gold" / section_dir / (treebank_code + ".conllu")
        header = [section, "LAS", "UAS", "TAG", "SENT", "WORD"]
        print("\t".join(header))
        inputs = {"gold": gold_path, "udp": udpipe_path, "raw": text_path}
        for input_type in ("udp", "raw"):
            input_path = inputs[input_type]
            output_path = (
                experiment_dir / corpus / "{section}.conllu".format(section=section)
            )
            parsed_docs, test_scores = evaluate(nlp, input_path, gold_path, output_path)
            accuracy = print_results(input_type, test_scores)
            acc_path = (
                experiment_dir
                / corpus
                / "{section}-accuracy.json".format(section=section)
            )
            srsly.write_json(acc_path, accuracy)
 if __name__ == "__main__":
    plac.call(main)
--- a/bin/ud/ud_train.py
+++ b/bin/ud/ud_train.py
@ -1,570 +0,0 @@
 # flake8: noqa
 """Train for CONLL 2017 UD treebank evaluation. Takes .conllu files, writes
 .conllu format for development data, allowing the official scorer to be used.
 """
 from __future__ import unicode_literals
 import plac
 from pathlib import Path
 import re
 import json
 import tqdm
 import spacy
 import spacy.util
 from bin.ud import conll17_ud_eval
 from spacy.tokens import Token, Doc
 from spacy.gold import GoldParse
 from spacy.util import compounding, minibatch, minibatch_by_words
 from spacy.syntax.nonproj import projectivize
 from spacy.matcher import Matcher
 from spacy import displacy
 from collections import defaultdict
 import random
 from spacy import lang
 from spacy.lang import zh
 from spacy.lang import ja
 try:
    import torch
 except ImportError:
    torch = None
 ################
 # Data reading #
 ################
 space_re = re.compile("\s+")
 def split_text(text):
    return [space_re.sub(" ", par.strip()) for par in text.split("\n\n")]
 def read_data(
    nlp,
    conllu_file,
    text_file,
    raw_text=True,
    oracle_segments=False,
    max_doc_length=None,
    limit=None,
 ):
    """Read the CONLLU format into (Doc, GoldParse) tuples. If raw_text=True,
    include Doc objects created using nlp.make_doc and then aligned against
    the gold-standard sequences. If oracle_segments=True, include Doc objects
    created from the gold-standard segments. At least one must be True."""
    if not raw_text and not oracle_segments:
        raise ValueError("At least one of raw_text or oracle_segments must be True")
    paragraphs = split_text(text_file.read())
    conllu = read_conllu(conllu_file)
    # sd is spacy doc; cd is conllu doc
    # cs is conllu sent, ct is conllu token
    docs = []
    golds = []
    for doc_id, (text, cd) in enumerate(zip(paragraphs, conllu)):
        sent_annots = []
        for cs in cd:
            sent = defaultdict(list)
            for id_, word, lemma, pos, tag, morph, head, dep, _, space_after in cs:
                if "." in id_:
                    continue
                if "-" in id_:
                    continue
                id_ = int(id_) - 1
                head = int(head) - 1 if head != "0" else id_
                sent["words"].append(word)
                sent["tags"].append(tag)
                sent["morphology"].append(_parse_morph_string(morph))
                sent["morphology"][-1].add("POS_%s" % pos)
                sent["heads"].append(head)
                sent["deps"].append("ROOT" if dep == "root" else dep)
                sent["spaces"].append(space_after == "_")
            sent["entities"] = ["-"] * len(sent["words"])
            sent["heads"], sent["deps"] = projectivize(sent["heads"], sent["deps"])
            if oracle_segments:
                docs.append(Doc(nlp.vocab, words=sent["words"], spaces=sent["spaces"]))
                golds.append(GoldParse(docs[-1], **sent))
                assert golds[-1].morphology is not None
            sent_annots.append(sent)
            if raw_text and max_doc_length and len(sent_annots) >= max_doc_length:
                doc, gold = _make_gold(nlp, None, sent_annots)
                assert gold.morphology is not None
                sent_annots = []
                docs.append(doc)
                golds.append(gold)
                if limit and len(docs) >= limit:
                    return docs, golds
        if raw_text and sent_annots:
            doc, gold = _make_gold(nlp, None, sent_annots)
            docs.append(doc)
            golds.append(gold)
        if limit and len(docs) >= limit:
            return docs, golds
    return docs, golds
 def _parse_morph_string(morph_string):
    if morph_string == '_':
        return set()
    output = []
    replacements = {'1': 'one', '2': 'two', '3': 'three'}
    for feature in morph_string.split('|'):
        key, value = feature.split('=')
        value = replacements.get(value, value)
        value = value.split(',')[0]
        output.append('%s_%s' % (key, value.lower()))
    return set(output)
 def read_conllu(file_):
    docs = []
    sent = []
    doc = []
    for line in file_:
        if line.startswith("# newdoc"):
            if doc:
                docs.append(doc)
            doc = []
        elif line.startswith("#"):
            continue
        elif not line.strip():
            if sent:
                doc.append(sent)
            sent = []
        else:
            sent.append(list(line.strip().split("\t")))
            if len(sent[-1]) != 10:
                print(repr(line))
                raise ValueError
    if sent:
        doc.append(sent)
    if doc:
        docs.append(doc)
    return docs
 def _make_gold(nlp, text, sent_annots, drop_deps=0.0):
    # Flatten the conll annotations, and adjust the head indices
    flat = defaultdict(list)
    sent_starts = []
    for sent in sent_annots:
        flat["heads"].extend(len(flat["words"])+head for head in sent["heads"])
        for field in ["words", "tags", "deps", "morphology", "entities", "spaces"]:
            flat[field].extend(sent[field])
        sent_starts.append(True)
        sent_starts.extend([False] * (len(sent["words"]) - 1))
    # Construct text if necessary
    assert len(flat["words"]) == len(flat["spaces"])
    if text is None:
        text = "".join(
            word + " " * space for word, space in zip(flat["words"], flat["spaces"])
        )
    doc = nlp.make_doc(text)
    flat.pop("spaces")
    gold = GoldParse(doc, **flat)
    gold.sent_starts = sent_starts
    for i in range(len(gold.heads)):
        if random.random() < drop_deps:
            gold.heads[i] = None
            gold.labels[i] = None
    return doc, gold
 #############################
 # Data transforms for spaCy #
 #############################
 def golds_to_gold_tuples(docs, golds):
    """Get out the annoying 'tuples' format used by begin_training, given the
    GoldParse objects."""
    tuples = []
    for doc, gold in zip(docs, golds):
        text = doc.text
        ids, words, tags, heads, labels, iob = zip(*gold.orig_annot)
        sents = [((ids, words, tags, heads, labels, iob), [])]
        tuples.append((text, sents))
    return tuples
 ##############
 # Evaluation #
 ##############
 def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None):
    if text_loc.parts[-1].endswith(".conllu"):
        docs = []
        with text_loc.open(encoding="utf8") as file_:
            for conllu_doc in read_conllu(file_):
                for conllu_sent in conllu_doc:
                    words = [line[1] for line in conllu_sent]
                    docs.append(Doc(nlp.vocab, words=words))
        for name, component in nlp.pipeline:
            docs = list(component.pipe(docs))
    else:
        with text_loc.open("r", encoding="utf8") as text_file:
            texts = split_text(text_file.read())
            docs = list(nlp.pipe(texts))
    with sys_loc.open("w", encoding="utf8") as out_file:
        write_conllu(docs, out_file)
    with gold_loc.open("r", encoding="utf8") as gold_file:
        gold_ud = conll17_ud_eval.load_conllu(gold_file)
        with sys_loc.open("r", encoding="utf8") as sys_file:
            sys_ud = conll17_ud_eval.load_conllu(sys_file)
        scores = conll17_ud_eval.evaluate(gold_ud, sys_ud)
    return docs, scores
 def write_conllu(docs, file_):
    if not Token.has_extension("get_conllu_lines"):
        Token.set_extension("get_conllu_lines", method=get_token_conllu)
    if not Token.has_extension("begins_fused"):
        Token.set_extension("begins_fused", default=False)
    if not Token.has_extension("inside_fused"):
        Token.set_extension("inside_fused", default=False)
    merger = Matcher(docs[0].vocab)
    merger.add("SUBTOK", None, [{"DEP": "subtok", "op": "+"}])
    for i, doc in enumerate(docs):
        matches = []
        if doc.is_parsed:
            matches = merger(doc)
        spans = [doc[start : end + 1] for _, start, end in matches]
        seen_tokens = set()
        with doc.retokenize() as retokenizer:
            for span in spans:
                span_tokens = set(range(span.start, span.end))
                if not span_tokens.intersection(seen_tokens):
                    retokenizer.merge(span)
                    seen_tokens.update(span_tokens)
        file_.write("# newdoc id = {i}\n".format(i=i))
        for j, sent in enumerate(doc.sents):
            file_.write("# sent_id = {i}.{j}\n".format(i=i, j=j))
            file_.write("# text = {text}\n".format(text=sent.text))
            for k, token in enumerate(sent):
                if token.head.i > sent[-1].i or token.head.i < sent[0].i:
                    for word in doc[sent[0].i - 10 : sent[0].i]:
                        print(word.i, word.head.i, word.text, word.dep_)
                    for word in sent:
                        print(word.i, word.head.i, word.text, word.dep_)
                    for word in doc[sent[-1].i : sent[-1].i + 10]:
                        print(word.i, word.head.i, word.text, word.dep_)
                    raise ValueError(
                        "Invalid parse: head outside sentence (%s)" % token.text
                    )
                file_.write(token._.get_conllu_lines(k) + "\n")
            file_.write("\n")
 def print_progress(itn, losses, ud_scores):
    fields = {
        "dep_loss": losses.get("parser", 0.0),
        "morph_loss": losses.get("morphologizer", 0.0),
        "tag_loss": losses.get("tagger", 0.0),
        "words": ud_scores["Words"].f1 * 100,
        "sents": ud_scores["Sentences"].f1 * 100,
        "tags": ud_scores["XPOS"].f1 * 100,
        "uas": ud_scores["UAS"].f1 * 100,
        "las": ud_scores["LAS"].f1 * 100,
        "morph": ud_scores["Feats"].f1 * 100,
    }
    header = ["Epoch", "P.Loss", "M.Loss", "LAS", "UAS", "TAG", "MORPH", "SENT", "WORD"]
    if itn == 0:
        print("\t".join(header))
    tpl = "\t".join((
        "{:d}",
        "{dep_loss:.1f}",
        "{morph_loss:.1f}",
        "{las:.1f}",
        "{uas:.1f}",
        "{tags:.1f}",
        "{morph:.1f}",
        "{sents:.1f}",
        "{words:.1f}",
    ))
    print(tpl.format(itn, **fields))
 # def get_sent_conllu(sent, sent_id):
 #    lines = ["# sent_id = {sent_id}".format(sent_id=sent_id)]
 def get_token_conllu(token, i):
    if token._.begins_fused:
        n = 1
        while token.nbor(n)._.inside_fused:
            n += 1
        id_ = "%d-%d" % (i, i + n)
        lines = [id_, token.text, "_", "_", "_", "_", "_", "_", "_", "_"]
    else:
        lines = []
    if token.head.i == token.i:
        head = 0
    else:
        head = i + (token.head.i - token.i) + 1
    features = list(token.morph)
    feat_str = []
    replacements = {"one": "1", "two": "2", "three": "3"}
    for feat in features:
        if not feat.startswith("begin") and not feat.startswith("end"):
            key, value = feat.split("_", 1)
            value = replacements.get(value, value)
            feat_str.append("%s=%s" % (key, value.title()))
    if not feat_str:
        feat_str = "_"
    else:
        feat_str = "|".join(feat_str)
    fields = [str(i+1), token.text, token.lemma_, token.pos_, token.tag_, feat_str,
              str(head), token.dep_.lower(), "_", "_"]
    lines.append("\t".join(fields))
    return "\n".join(lines)
 ##################
 # Initialization #
 ##################
 def load_nlp(corpus, config, vectors=None):
    lang = corpus.split("_")[0]
    nlp = spacy.blank(lang)
    if config.vectors:
        if not vectors:
            raise ValueError(
                "config asks for vectors, but no vectors "
                "directory set on command line (use -v)"
            )
        if (Path(vectors) / corpus).exists():
            nlp.vocab.from_disk(Path(vectors) / corpus / "vocab")
    nlp.meta["treebank"] = corpus
    return nlp
 def initialize_pipeline(nlp, docs, golds, config, device):
    nlp.add_pipe(nlp.create_pipe("tagger", config={"set_morphology": False}))
    nlp.add_pipe(nlp.create_pipe("morphologizer"))
    nlp.add_pipe(nlp.create_pipe("parser"))
    if config.multitask_tag:
        nlp.parser.add_multitask_objective("tag")
    if config.multitask_sent:
        nlp.parser.add_multitask_objective("sent_start")
    for gold in golds:
        for tag in gold.tags:
            if tag is not None:
                nlp.tagger.add_label(tag)
    if torch is not None and device != -1:
        torch.set_default_tensor_type("torch.cuda.FloatTensor")
    optimizer = nlp.begin_training(
        lambda: golds_to_gold_tuples(docs, golds),
        device=device,
        subword_features=config.subword_features,
        conv_depth=config.conv_depth,
        bilstm_depth=config.bilstm_depth,
    )
    if config.pretrained_tok2vec:
        _load_pretrained_tok2vec(nlp, config.pretrained_tok2vec)
    return optimizer
 def _load_pretrained_tok2vec(nlp, loc):
    """Load pretrained weights for the 'token-to-vector' part of the component
    models, which is typically a CNN. See 'spacy pretrain'. Experimental.
    """
    with Path(loc).open("rb", encoding="utf8") as file_:
        weights_data = file_.read()
    loaded = []
    for name, component in nlp.pipeline:
        if hasattr(component, "model") and hasattr(component.model, "tok2vec"):
            component.tok2vec.from_bytes(weights_data)
            loaded.append(name)
    return loaded
 ########################
 # Command line helpers #
 ########################
 class Config(object):
    def __init__(
        self,
        vectors=None,
        max_doc_length=10,
        multitask_tag=False,
        multitask_sent=False,
        multitask_dep=False,
        multitask_vectors=None,
        bilstm_depth=0,
        nr_epoch=30,
        min_batch_size=100,
        max_batch_size=1000,
        batch_by_words=True,
        dropout=0.2,
        conv_depth=4,
        subword_features=True,
        vectors_dir=None,
        pretrained_tok2vec=None,
    ):
        if vectors_dir is not None:
            if vectors is None:
                vectors = True
            if multitask_vectors is None:
                multitask_vectors = True
        for key, value in locals().items():
            setattr(self, key, value)
    @classmethod
    def load(cls, loc, vectors_dir=None):
        with Path(loc).open("r", encoding="utf8") as file_:
            cfg = json.load(file_)
        if vectors_dir is not None:
            cfg["vectors_dir"] = vectors_dir
        return cls(**cfg)
 class Dataset(object):
    def __init__(self, path, section):
        self.path = path
        self.section = section
        self.conllu = None
        self.text = None
        for file_path in self.path.iterdir():
            name = file_path.parts[-1]
            if section in name and name.endswith("conllu"):
                self.conllu = file_path
            elif section in name and name.endswith("txt"):
                self.text = file_path
        if self.conllu is None:
            msg = "Could not find .txt file in {path} for {section}"
            raise IOError(msg.format(section=section, path=path))
        if self.text is None:
            msg = "Could not find .txt file in {path} for {section}"
        self.lang = self.conllu.parts[-1].split("-")[0].split("_")[0]
 class TreebankPaths(object):
    def __init__(self, ud_path, treebank, **cfg):
        self.train = Dataset(ud_path / treebank, "train")
        self.dev = Dataset(ud_path / treebank, "dev")
        self.lang = self.train.lang
@plac.annotations(
    ud_dir=("Path to Universal Dependencies corpus", "positional", None, Path),
    parses_dir=("Directory to write the development parses", "positional", None, Path),
    corpus=(
        "UD corpus to train and evaluate on, e.g. UD_Spanish-AnCora",
        "positional",
        None,
        str,
    ),
    config=("Path to json formatted config file", "option", "C", Path),
    limit=("Size limit", "option", "n", int),
    gpu_device=("Use GPU", "option", "g", int),
    use_oracle_segments=("Use oracle segments", "flag", "G", int),
    vectors_dir=(
        "Path to directory with pretrained vectors, named e.g. en/",
        "option",
        "v",
        Path,
    ),
 )
 def main(
    ud_dir,
    parses_dir,
    corpus,
    config=None,
    limit=0,
    gpu_device=-1,
    vectors_dir=None,
    use_oracle_segments=False,
 ):
    Token.set_extension("get_conllu_lines", method=get_token_conllu)
    Token.set_extension("begins_fused", default=False)
    Token.set_extension("inside_fused", default=False)
    spacy.util.fix_random_seed()
    lang.zh.Chinese.Defaults.use_jieba = False
    lang.ja.Japanese.Defaults.use_janome = False
    if config is not None:
        config = Config.load(config, vectors_dir=vectors_dir)
    else:
        config = Config(vectors_dir=vectors_dir)
    paths = TreebankPaths(ud_dir, corpus)
    if not (parses_dir / corpus).exists():
        (parses_dir / corpus).mkdir()
    print("Train and evaluate", corpus, "using lang", paths.lang)
    nlp = load_nlp(paths.lang, config, vectors=vectors_dir)
    docs, golds = read_data(
        nlp,
        paths.train.conllu.open(encoding="utf8"),
        paths.train.text.open(encoding="utf8"),
        max_doc_length=config.max_doc_length,
        limit=limit,
    )
    optimizer = initialize_pipeline(nlp, docs, golds, config, gpu_device)
    batch_sizes = compounding(config.min_batch_size, config.max_batch_size, 1.001)
    beam_prob = compounding(0.2, 0.8, 1.001)
    for i in range(config.nr_epoch):
        docs, golds = read_data(
            nlp,
            paths.train.conllu.open(encoding="utf8"),
            paths.train.text.open(encoding="utf8"),
            max_doc_length=config.max_doc_length,
            limit=limit,
            oracle_segments=use_oracle_segments,
            raw_text=not use_oracle_segments,
        )
        Xs = list(zip(docs, golds))
        random.shuffle(Xs)
        if config.batch_by_words:
            batches = minibatch_by_words(Xs, size=batch_sizes)
        else:
            batches = minibatch(Xs, size=batch_sizes)
        losses = {}
        n_train_words = sum(len(doc) for doc in docs)
        with tqdm.tqdm(total=n_train_words, leave=False) as pbar:
            for batch in batches:
                batch_docs, batch_gold = zip(*batch)
                pbar.update(sum(len(doc) for doc in batch_docs))
                nlp.parser.cfg["beam_update_prob"] = next(beam_prob)
                nlp.update(
                    batch_docs,
                    batch_gold,
                    sgd=optimizer,
                    drop=config.dropout,
                    losses=losses,
                )
        out_path = parses_dir / corpus / "epoch-{i}.conllu".format(i=i)
        with nlp.use_params(optimizer.averages):
            if use_oracle_segments:
                parsed_docs, scores = evaluate(nlp, paths.dev.conllu,
                                                paths.dev.conllu, out_path)
            else:
                parsed_docs, scores = evaluate(nlp, paths.dev.text,
                                                paths.dev.conllu, out_path)
        print_progress(i, losses, scores)
 def _render_parses(i, to_render):
    to_render[0].user_data["title"] = "Batch %d" % i
    with Path("/tmp/parses.html").open("w", encoding="utf8") as file_:
        html = displacy.render(to_render[:5], style="dep", page=True)
        file_.write(html)
 if __name__ == "__main__":
    plac.call(main)
--- a/examples/README.md
+++ b/examples/README.md
@ -1,19 +0,0 @@
 <a href="https://explosion.ai"><img src="https://explosion.ai/assets/img/logo.svg" width="125" height="125" align="right" /></a>
 # spaCy examples
 The examples are Python scripts with well-behaved command line interfaces. For
 more detailed usage guides, see the [documentation](https://spacy.io/usage/).
 To see the available arguments, you can use the `--help` or `-h` flag:
 ```bash
 $ python examples/training/train_ner.py --help
 ```
 While we try to keep the examples up to date, they are not currently exercised
 by the test suite, as some of them require significant data downloads or take
 time to train. If you find that an example is no longer running,
 [please tell us](https://github.com/explosion/spaCy/issues)! We know there's
 nothing worse than trying to figure out what you're doing wrong, and it turns
 out your code was never the problem.
--- a/examples/deep_learning_keras.py
+++ b/examples/deep_learning_keras.py
@ -1,267 +0,0 @@
 """
 This example shows how to use an LSTM sentiment classification model trained
 using Keras in spaCy. spaCy splits the document into sentences, and each
 sentence is classified using the LSTM. The scores for the sentences are then
 aggregated to give the document score. This kind of hierarchical model is quite
 difficult in "pure" Keras or Tensorflow, but it's very effective. The Keras
 example on this dataset performs quite poorly, because it cuts off the documents
 so that they're a fixed size. This hurts review accuracy a lot, because people
 often summarise their rating in the final sentence
 Prerequisites:
 spacy download en_vectors_web_lg
 pip install keras==2.0.9
 Compatible with: spaCy v2.0.0+
 """
 import plac
 import random
 import pathlib
 import cytoolz
 import numpy
 from keras.models import Sequential, model_from_json
 from keras.layers import LSTM, Dense, Embedding, Bidirectional
 from keras.layers import TimeDistributed
 from keras.optimizers import Adam
 import thinc.extra.datasets
 from spacy.compat import pickle
 import spacy
 class SentimentAnalyser(object):
    @classmethod
    def load(cls, path, nlp, max_length=100):
        with (path / "config.json").open() as file_:
            model = model_from_json(file_.read())
        with (path / "model").open("rb") as file_:
            lstm_weights = pickle.load(file_)
        embeddings = get_embeddings(nlp.vocab)
        model.set_weights([embeddings] + lstm_weights)
        return cls(model, max_length=max_length)
    def __init__(self, model, max_length=100):
        self._model = model
        self.max_length = max_length
    def __call__(self, doc):
        X = get_features([doc], self.max_length)
        y = self._model.predict(X)
        self.set_sentiment(doc, y)
    def pipe(self, docs, batch_size=1000):
        for minibatch in cytoolz.partition_all(batch_size, docs):
            minibatch = list(minibatch)
            sentences = []
            for doc in minibatch:
                sentences.extend(doc.sents)
            Xs = get_features(sentences, self.max_length)
            ys = self._model.predict(Xs)
            for sent, label in zip(sentences, ys):
                sent.doc.sentiment += label - 0.5
            for doc in minibatch:
                yield doc
    def set_sentiment(self, doc, y):
        doc.sentiment = float(y[0])
        # Sentiment has a native slot for a single float.
        # For arbitrary data storage, there's:
        # doc.user_data['my_data'] = y
 def get_labelled_sentences(docs, doc_labels):
    labels = []
    sentences = []
    for doc, y in zip(docs, doc_labels):
        for sent in doc.sents:
            sentences.append(sent)
            labels.append(y)
    return sentences, numpy.asarray(labels, dtype="int32")
 def get_features(docs, max_length):
    docs = list(docs)
    Xs = numpy.zeros((len(docs), max_length), dtype="int32")
    for i, doc in enumerate(docs):
        j = 0
        for token in doc:
            vector_id = token.vocab.vectors.find(key=token.orth)
            if vector_id >= 0:
                Xs[i, j] = vector_id
            else:
                Xs[i, j] = 0
            j += 1
            if j >= max_length:
                break
    return Xs
 def train(
    train_texts,
    train_labels,
    dev_texts,
    dev_labels,
    lstm_shape,
    lstm_settings,
    lstm_optimizer,
    batch_size=100,
    nb_epoch=5,
    by_sentence=True,
 ):
    print("Loading spaCy")
    nlp = spacy.load("en_vectors_web_lg")
    nlp.add_pipe(nlp.create_pipe("sentencizer"))
    embeddings = get_embeddings(nlp.vocab)
    model = compile_lstm(embeddings, lstm_shape, lstm_settings)
    print("Parsing texts...")
    train_docs = list(nlp.pipe(train_texts))
    dev_docs = list(nlp.pipe(dev_texts))
    if by_sentence:
        train_docs, train_labels = get_labelled_sentences(train_docs, train_labels)
        dev_docs, dev_labels = get_labelled_sentences(dev_docs, dev_labels)
    train_X = get_features(train_docs, lstm_shape["max_length"])
    dev_X = get_features(dev_docs, lstm_shape["max_length"])
    model.fit(
        train_X,
        train_labels,
        validation_data=(dev_X, dev_labels),
        epochs=nb_epoch,
        batch_size=batch_size,
    )
    return model
 def compile_lstm(embeddings, shape, settings):
    model = Sequential()
    model.add(
        Embedding(
            embeddings.shape[0],
            embeddings.shape[1],
            input_length=shape["max_length"],
            trainable=False,
            weights=[embeddings],
            mask_zero=True,
        )
    )
    model.add(TimeDistributed(Dense(shape["nr_hidden"], use_bias=False)))
    model.add(
        Bidirectional(
            LSTM(
                shape["nr_hidden"],
                recurrent_dropout=settings["dropout"],
                dropout=settings["dropout"],
            )
        )
    )
    model.add(Dense(shape["nr_class"], activation="sigmoid"))
    model.compile(
        optimizer=Adam(lr=settings["lr"]),
        loss="binary_crossentropy",
        metrics=["accuracy"],
    )
    return model
 def get_embeddings(vocab):
    return vocab.vectors.data
 def evaluate(model_dir, texts, labels, max_length=100):
    nlp = spacy.load("en_vectors_web_lg")
    nlp.add_pipe(nlp.create_pipe("sentencizer"))
    nlp.add_pipe(SentimentAnalyser.load(model_dir, nlp, max_length=max_length))
    correct = 0
    i = 0
    for doc in nlp.pipe(texts, batch_size=1000):
        correct += bool(doc.sentiment >= 0.5) == bool(labels[i])
        i += 1
    return float(correct) / i
 def read_data(data_dir, limit=0):
    examples = []
    for subdir, label in (("pos", 1), ("neg", 0)):
        for filename in (data_dir / subdir).iterdir():
            with filename.open() as file_:
                text = file_.read()
            examples.append((text, label))
    random.shuffle(examples)
    if limit >= 1:
        examples = examples[:limit]
    return zip(*examples)  # Unzips into two lists
@plac.annotations(
    train_dir=("Location of training file or directory"),
    dev_dir=("Location of development file or directory"),
    model_dir=("Location of output model directory",),
    is_runtime=("Demonstrate run-time usage", "flag", "r", bool),
    nr_hidden=("Number of hidden units", "option", "H", int),
    max_length=("Maximum sentence length", "option", "L", int),
    dropout=("Dropout", "option", "d", float),
    learn_rate=("Learn rate", "option", "e", float),
    nb_epoch=("Number of training epochs", "option", "i", int),
    batch_size=("Size of minibatches for training LSTM", "option", "b", int),
    nr_examples=("Limit to N examples", "option", "n", int),
 )
 def main(
    model_dir=None,
    train_dir=None,
    dev_dir=None,
    is_runtime=False,
    nr_hidden=64,
    max_length=100,  # Shape
    dropout=0.5,
    learn_rate=0.001,  # General NN config
    nb_epoch=5,
    batch_size=256,
    nr_examples=-1,
 ):  # Training params
    if model_dir is not None:
        model_dir = pathlib.Path(model_dir)
    if train_dir is None or dev_dir is None:
        imdb_data = thinc.extra.datasets.imdb()
    if is_runtime:
        if dev_dir is None:
            dev_texts, dev_labels = zip(*imdb_data[1])
        else:
            dev_texts, dev_labels = read_data(dev_dir)
        acc = evaluate(model_dir, dev_texts, dev_labels, max_length=max_length)
        print(acc)
    else:
        if train_dir is None:
            train_texts, train_labels = zip(*imdb_data[0])
        else:
            print("Read data")
            train_texts, train_labels = read_data(train_dir, limit=nr_examples)
        if dev_dir is None:
            dev_texts, dev_labels = zip(*imdb_data[1])
        else:
            dev_texts, dev_labels = read_data(dev_dir, imdb_data, limit=nr_examples)
        train_labels = numpy.asarray(train_labels, dtype="int32")
        dev_labels = numpy.asarray(dev_labels, dtype="int32")
        lstm = train(
            train_texts,
            train_labels,
            dev_texts,
            dev_labels,
            {"nr_hidden": nr_hidden, "max_length": max_length, "nr_class": 1},
            {"dropout": dropout, "lr": learn_rate},
            {},
            nb_epoch=nb_epoch,
            batch_size=batch_size,
        )
        weights = lstm.get_weights()
        if model_dir is not None:
            with (model_dir / "model").open("wb") as file_:
                pickle.dump(weights[1:], file_)
            with (model_dir / "config.json").open("w") as file_:
                file_.write(lstm.to_json())
 if __name__ == "__main__":
    plac.call(main)
--- a/examples/information_extraction/entity_relations.py
+++ b/examples/information_extraction/entity_relations.py
@ -1,82 +0,0 @@
 #!/usr/bin/env python
 # coding: utf8
 """A simple example of extracting relations between phrases and entities using
 spaCy's named entity recognizer and the dependency parse. Here, we extract
 money and currency values (entities labelled as MONEY) and then check the
 dependency tree to find the noun phrase they are referring to – for example:
 $9.4 million --> Net income.
 Compatible with: spaCy v2.0.0+
 Last tested with: v2.2.1
 """
 from __future__ import unicode_literals, print_function
 import plac
 import spacy
 TEXTS = [
    "Net income was $9.4 million compared to the prior year of $2.7 million.",
    "Revenue exceeded twelve billion dollars, with a loss of $1b.",
 ]
@plac.annotations(
    model=("Model to load (needs parser and NER)", "positional", None, str)
 )
 def main(model="en_core_web_sm"):
    nlp = spacy.load(model)
    print("Loaded model '%s'" % model)
    print("Processing %d texts" % len(TEXTS))
    for text in TEXTS:
        doc = nlp(text)
        relations = extract_currency_relations(doc)
        for r1, r2 in relations:
            print("{:<10}\t{}\t{}".format(r1.text, r2.ent_type_, r2.text))
 def filter_spans(spans):
    # Filter a sequence of spans so they don't contain overlaps
    # For spaCy 2.1.4+: this function is available as spacy.util.filter_spans()
    get_sort_key = lambda span: (span.end - span.start, -span.start)
    sorted_spans = sorted(spans, key=get_sort_key, reverse=True)
    result = []
    seen_tokens = set()
    for span in sorted_spans:
        # Check for end - 1 here because boundaries are inclusive
        if span.start not in seen_tokens and span.end - 1 not in seen_tokens:
            result.append(span)
        seen_tokens.update(range(span.start, span.end))
    result = sorted(result, key=lambda span: span.start)
    return result
 def extract_currency_relations(doc):
    # Merge entities and noun chunks into one token
    spans = list(doc.ents) + list(doc.noun_chunks)
    spans = filter_spans(spans)
    with doc.retokenize() as retokenizer:
        for span in spans:
            retokenizer.merge(span)
    relations = []
    for money in filter(lambda w: w.ent_type_ == "MONEY", doc):
        if money.dep_ in ("attr", "dobj"):
            subject = [w for w in money.head.lefts if w.dep_ == "nsubj"]
            if subject:
                subject = subject[0]
                relations.append((subject, money))
        elif money.dep_ == "pobj" and money.head.dep_ == "prep":
            relations.append((money.head.head, money))
    return relations
 if __name__ == "__main__":
    plac.call(main)
    # Expected output:
    # Net income      MONEY   $9.4 million
    # the prior year  MONEY   $2.7 million
    # Revenue         MONEY   twelve billion dollars
    # a loss          MONEY   1b
--- a/examples/information_extraction/parse_subtrees.py
+++ b/examples/information_extraction/parse_subtrees.py
@ -1,67 +0,0 @@
 #!/usr/bin/env python
 # coding: utf8
 """This example shows how to navigate the parse tree including subtrees
 attached to a word.
 Based on issue #252:
 "In the documents and tutorials the main thing I haven't found is
 examples on how to break sentences down into small sub thoughts/chunks. The
 noun_chunks is handy, but having examples on using the token.head to find small
 (near-complete) sentence chunks would be neat. Lets take the example sentence:
 "displaCy uses CSS and JavaScript to show you how computers understand language"
 This sentence has two main parts (XCOMP & CCOMP) according to the breakdown:
 [displaCy] uses CSS and Javascript [to + show]
 show you how computers understand [language]
 I'm assuming that we can use the token.head to build these groups."
 Compatible with: spaCy v2.0.0+
 Last tested with: v2.1.0
 """
 from __future__ import unicode_literals, print_function
 import plac
 import spacy
@plac.annotations(model=("Model to load", "positional", None, str))
 def main(model="en_core_web_sm"):
    nlp = spacy.load(model)
    print("Loaded model '%s'" % model)
    doc = nlp(
        "displaCy uses CSS and JavaScript to show you how computers "
        "understand language"
    )
    # The easiest way is to find the head of the subtree you want, and then use
    # the `.subtree`, `.children`, `.lefts` and `.rights` iterators. `.subtree`
    # is the one that does what you're asking for most directly:
    for word in doc:
        if word.dep_ in ("xcomp", "ccomp"):
            print("".join(w.text_with_ws for w in word.subtree))
    # It'd probably be better for `word.subtree` to return a `Span` object
    # instead of a generator over the tokens. If you want the `Span` you can
    # get it via the `.right_edge` and `.left_edge` properties. The `Span`
    # object is nice because you can easily get a vector, merge it, etc.
    for word in doc:
        if word.dep_ in ("xcomp", "ccomp"):
            subtree_span = doc[word.left_edge.i : word.right_edge.i + 1]
            print(subtree_span.text, "|", subtree_span.root.text)
    # You might also want to select a head, and then select a start and end
    # position by walking along its children. You could then take the
    # `.left_edge` and `.right_edge` of those tokens, and use it to calculate
    # a span.
 if __name__ == "__main__":
    plac.call(main)
    # Expected output:
    # to show you how computers understand language
    # how computers understand language
    # to show you how computers understand language | show
    # how computers understand language | understand
--- a/examples/information_extraction/phrase_matcher.py
+++ b/examples/information_extraction/phrase_matcher.py
@ -1,112 +0,0 @@
 #!/usr/bin/env python
 # coding: utf8
 """Match a large set of multi-word expressions in O(1) time.
 The idea is to associate each word in the vocabulary with a tag, noting whether
 they begin, end, or are inside at least one pattern. An additional tag is used
 for single-word patterns. Complete patterns are also stored in a hash set.
 When we process a document, we look up the words in the vocabulary, to
 associate the words with the tags.  We then search for tag-sequences that
 correspond to valid candidates. Finally, we look up the candidates in the hash
 set.
 For instance, to search for the phrases "Barack Hussein Obama" and "Hilary
 Clinton", we would associate "Barack" and "Hilary" with the B tag, Hussein with
 the I tag, and Obama and Clinton with the L tag.
 The document "Barack Clinton and Hilary Clinton" would have the tag sequence
 [{B}, {L}, {}, {B}, {L}], so we'd get two matches. However, only the second
 candidate is in the phrase dictionary, so only one is returned as a match.
 The algorithm is O(n) at run-time for document of length n because we're only
 ever matching over the tag patterns. So no matter how many phrases we're
 looking for, our pattern set stays very small (exact size depends on the
 maximum length we're looking for, as the query language currently has no
 quantifiers).
 The example expects a .bz2 file from the Reddit corpus, and a patterns file,
 formatted in jsonl as a sequence of entries like this:
 {"text":"Anchorage"}
 {"text":"Angola"}
 {"text":"Ann Arbor"}
 {"text":"Annapolis"}
 {"text":"Appalachia"}
 {"text":"Argentina"}
 Reddit comments corpus:
 * https://files.pushshift.io/reddit/
 * https://archive.org/details/2015_reddit_comments_corpus
 Compatible with: spaCy v2.0.0+
 """
 from __future__ import print_function, unicode_literals, division
 from bz2 import BZ2File
 import time
 import plac
 import json
 from spacy.matcher import PhraseMatcher
 import spacy
@plac.annotations(
    patterns_loc=("Path to gazetteer", "positional", None, str),
    text_loc=("Path to Reddit corpus file", "positional", None, str),
    n=("Number of texts to read", "option", "n", int),
    lang=("Language class to initialise", "option", "l", str),
 )
 def main(patterns_loc, text_loc, n=10000, lang="en"):
    nlp = spacy.blank(lang)
    nlp.vocab.lex_attr_getters = {}
    phrases = read_gazetteer(nlp.tokenizer, patterns_loc)
    count = 0
    t1 = time.time()
    for ent_id, text in get_matches(nlp.tokenizer, phrases, read_text(text_loc, n=n)):
        count += 1
    t2 = time.time()
    print("%d docs in %.3f s. %d matches" % (n, (t2 - t1), count))
 def read_gazetteer(tokenizer, loc, n=-1):
    for i, line in enumerate(open(loc)):
        data = json.loads(line.strip())
        phrase = tokenizer(data["text"])
        for w in phrase:
            _ = tokenizer.vocab[w.text]
        if len(phrase) >= 2:
            yield phrase
 def read_text(bz2_loc, n=10000):
    with BZ2File(bz2_loc) as file_:
        for i, line in enumerate(file_):
            data = json.loads(line)
            yield data["body"]
            if i >= n:
                break
 def get_matches(tokenizer, phrases, texts):
    matcher = PhraseMatcher(tokenizer.vocab)
    matcher.add("Phrase", None, *phrases)
    for text in texts:
        doc = tokenizer(text)
        for w in doc:
            _ = doc.vocab[w.text]
        matches = matcher(doc)
        for ent_id, start, end in matches:
            yield (ent_id, doc[start:end].text)
 if __name__ == "__main__":
    if False:
        import cProfile
        import pstats
        cProfile.runctx("plac.call(main)", globals(), locals(), "Profile.prof")
        s = pstats.Stats("Profile.prof")
        s.strip_dirs().sort_stats("time").print_stats()
    else:
        plac.call(main)
--- a/examples/keras_parikh_entailment/README.md
+++ b/examples/keras_parikh_entailment/README.md
@ -1,114 +0,0 @@
 <a href="https://explosion.ai"><img src="https://explosion.ai/assets/img/logo.svg" width="125" height="125" align="right" /></a>
 # A decomposable attention model for Natural Language Inference
 **by Matthew Honnibal, [@honnibal](https://github.com/honnibal)**
 **Updated for spaCy 2.0+ and Keras 2.2.2+ by John Stewart, [@free-variation](https://github.com/free-variation)**
 This directory contains an implementation of the entailment prediction model described
 by [Parikh et al. (2016)](https://arxiv.org/pdf/1606.01933.pdf). The model is notable
 for its competitive performance with very few parameters.
 The model is implemented using [Keras](https://keras.io/) and [spaCy](https://spacy.io).
 Keras is used to build and train the network. spaCy is used to load
 the [GloVe](http://nlp.stanford.edu/projects/glove/) vectors, perform the
 feature extraction, and help you apply the model at run-time. The following
 demo code shows how the entailment model  can be used at runtime, once the
 hook is installed to customise the `.similarity()` method of spaCy's `Doc`
 and `Span` objects:
 ```python
 def demo(shape):
 	nlp = spacy.load('en_vectors_web_lg')
    nlp.add_pipe(KerasSimilarityShim.load(nlp.path / 'similarity', nlp, shape[0]))
    doc1 = nlp(u'The king of France is bald.')
    doc2 = nlp(u'France has no king.')
    print("Sentence 1:", doc1)
    print("Sentence 2:", doc2)
    entailment_type, confidence = doc1.similarity(doc2)
    print("Entailment type:", entailment_type, "(Confidence:", confidence, ")")
 ```
 Which gives the output `Entailment type: contradiction (Confidence: 0.60604566)`, showing that
 the system has definite opinions about Betrand Russell's [famous conundrum](https://users.drew.edu/jlenz/br-on-denoting.html)!
 I'm working on a blog post to explain Parikh et al.'s model in more detail.
 A [notebook](https://github.com/free-variation/spaCy/blob/master/examples/notebooks/Decompositional%20Attention.ipynb) is available that briefly explains this implementation.
 I think it is a very interesting example of the attention mechanism, which
 I didn't understand very well before working through this paper. There are
 lots of ways to extend the model.
 ## What's where
 | File | Description |
 | --- | --- |
 | `__main__.py` | The script that will be executed. Defines the CLI, the data reading, etc — all the boring stuff. |
 | `spacy_hook.py` | Provides a class `KerasSimilarityShim` that lets you use an arbitrary function to customize spaCy's `doc.similarity()` method. Instead of the default average-of-vectors algorithm, when you call `doc1.similarity(doc2)`, you'll get the result of `your_model(doc1, doc2)`. |
 | `keras_decomposable_attention.py` | Defines the neural network model. |
 ## Setting up
 First, install [Keras](https://keras.io/), [spaCy](https://spacy.io) and the spaCy
 English models (about 1GB of data):
 ```bash
 pip install keras
 pip install spacy
 python -m spacy download en_vectors_web_lg
 ```
 You'll also want to get Keras working on your GPU, and you will need a backend, such as TensorFlow or Theano.
 This will depend on your set up, so you're mostly on your own for this step. If you're using AWS, try the
 [NVidia AMI](https://aws.amazon.com/marketplace/pp/B00FYCDDTE). It made things pretty easy.
 Once you've installed the dependencies, you can run a small preliminary test of
 the Keras model:
 ```bash
 py.test keras_parikh_entailment/keras_decomposable_attention.py
 ```
 This compiles the model and fits it with some dummy data. You should see that
 both tests passed.
 Finally, download the [Stanford Natural Language Inference corpus](http://nlp.stanford.edu/projects/snli/).
 ## Running the example
 You can run the `keras_parikh_entailment/` directory as a script, which executes the file
 [`keras_parikh_entailment/__main__.py`](__main__.py).  If you run the script without arguments
 the usage is shown.  Running it with `-h` explains the command line arguments.
 The first thing you'll want to do is train the model:
 ```bash
 python keras_parikh_entailment/ train -t <path to SNLI train JSON> -s <path to SNLI dev JSON>
 ```
 Training takes about 300 epochs for full accuracy, and I haven't rerun the full
 experiment since refactoring things to publish this example — please let me
 know if I've broken something. You should get to at least 85% on the development data even after 10-15 epochs.
 The other two modes demonstrate run-time usage. I never like relying on the accuracy printed
 by `.fit()` methods. I never really feel confident until I've run a new process that loads
 the model and starts making predictions, without access to the gold labels. I've therefore
 included an `evaluate` mode. 
 ```bash
 python keras_parikh_entailment/ evaluate -s <path to SNLI train JSON>
 ```
 Finally, there's also a little demo, which mostly exists to show
 you how run-time usage will eventually look.
 ```bash
 python keras_parikh_entailment/ demo
 ```
 ## Getting updates
 We should have the blog post explaining the model ready before the end of the week. To get
 notified when it's published, you can either follow me on [Twitter](https://twitter.com/honnibal)
 or subscribe to our [mailing list](http://eepurl.com/ckUpQ5).
--- a/examples/keras_parikh_entailment/main.py
+++ b/examples/keras_parikh_entailment/main.py
@ -1,207 +0,0 @@
 import numpy as np
 import json
 from keras.utils import to_categorical
 import plac
 import sys
 from keras_decomposable_attention import build_model
 from spacy_hook import get_embeddings, KerasSimilarityShim
 try:
    import cPickle as pickle
 except ImportError:
    import pickle
 import spacy
 # workaround for keras/tensorflow bug
 # see https://github.com/tensorflow/tensorflow/issues/3388
 import os
 import importlib
 from keras import backend as K
 def set_keras_backend(backend):
    if K.backend() != backend:
        os.environ["KERAS_BACKEND"] = backend
        importlib.reload(K)
        assert K.backend() == backend
    if backend == "tensorflow":
        K.get_session().close()
        cfg = K.tf.ConfigProto()
        cfg.gpu_options.allow_growth = True
        K.set_session(K.tf.Session(config=cfg))
        K.clear_session()
 set_keras_backend("tensorflow")
 def train(train_loc, dev_loc, shape, settings):
    train_texts1, train_texts2, train_labels = read_snli(train_loc)
    dev_texts1, dev_texts2, dev_labels = read_snli(dev_loc)
    print("Loading spaCy")
    nlp = spacy.load("en_vectors_web_lg")
    assert nlp.path is not None
    print("Processing texts...")
    train_X = create_dataset(nlp, train_texts1, train_texts2, 100, shape[0])
    dev_X = create_dataset(nlp, dev_texts1, dev_texts2, 100, shape[0])
    print("Compiling network")
    model = build_model(get_embeddings(nlp.vocab), shape, settings)
    print(settings)
    model.fit(
        train_X,
        train_labels,
        validation_data=(dev_X, dev_labels),
        epochs=settings["nr_epoch"],
        batch_size=settings["batch_size"],
    )
    if not (nlp.path / "similarity").exists():
        (nlp.path / "similarity").mkdir()
    print("Saving to", nlp.path / "similarity")
    weights = model.get_weights()
    # remove the embedding matrix.  We can reconstruct it.
    del weights[1]
    with (nlp.path / "similarity" / "model").open("wb") as file_:
        pickle.dump(weights, file_)
    with (nlp.path / "similarity" / "config.json").open("w") as file_:
        file_.write(model.to_json())
 def evaluate(dev_loc, shape):
    dev_texts1, dev_texts2, dev_labels = read_snli(dev_loc)
    nlp = spacy.load("en_vectors_web_lg")
    nlp.add_pipe(KerasSimilarityShim.load(nlp.path / "similarity", nlp, shape[0]))
    total = 0.0
    correct = 0.0
    for text1, text2, label in zip(dev_texts1, dev_texts2, dev_labels):
        doc1 = nlp(text1)
        doc2 = nlp(text2)
        sim, _ = doc1.similarity(doc2)
        if sim == KerasSimilarityShim.entailment_types[label.argmax()]:
            correct += 1
        total += 1
    return correct, total
 def demo(shape):
    nlp = spacy.load("en_vectors_web_lg")
    nlp.add_pipe(KerasSimilarityShim.load(nlp.path / "similarity", nlp, shape[0]))
    doc1 = nlp("The king of France is bald.")
    doc2 = nlp("France has no king.")
    print("Sentence 1:", doc1)
    print("Sentence 2:", doc2)
    entailment_type, confidence = doc1.similarity(doc2)
    print("Entailment type:", entailment_type, "(Confidence:", confidence, ")")
 LABELS = {"entailment": 0, "contradiction": 1, "neutral": 2}
 def read_snli(path):
    texts1 = []
    texts2 = []
    labels = []
    with open(path, "r") as file_:
        for line in file_:
            eg = json.loads(line)
            label = eg["gold_label"]
            if label == "-":  # per Parikh, ignore - SNLI entries
                continue
            texts1.append(eg["sentence1"])
            texts2.append(eg["sentence2"])
            labels.append(LABELS[label])
    return texts1, texts2, to_categorical(np.asarray(labels, dtype="int32"))
 def create_dataset(nlp, texts, hypotheses, num_unk, max_length):
    sents = texts + hypotheses
    sents_as_ids = []
    for sent in sents:
        doc = nlp(sent)
        word_ids = []
        for i, token in enumerate(doc):
            # skip odd spaces from tokenizer
            if token.has_vector and token.vector_norm == 0:
                continue
            if i > max_length:
                break
            if token.has_vector:
                word_ids.append(token.rank + num_unk + 1)
            else:
                # if we don't have a vector, pick an OOV entry
                word_ids.append(token.rank % num_unk + 1)
        # there must be a simpler way of generating padded arrays from lists...
        word_id_vec = np.zeros((max_length), dtype="int")
        clipped_len = min(max_length, len(word_ids))
        word_id_vec[:clipped_len] = word_ids[:clipped_len]
        sents_as_ids.append(word_id_vec)
    return [np.array(sents_as_ids[: len(texts)]), np.array(sents_as_ids[len(texts) :])]
@plac.annotations(
    mode=("Mode to execute", "positional", None, str, ["train", "evaluate", "demo"]),
    train_loc=("Path to training data", "option", "t", str),
    dev_loc=("Path to development or test data", "option", "s", str),
    max_length=("Length to truncate sentences", "option", "L", int),
    nr_hidden=("Number of hidden units", "option", "H", int),
    dropout=("Dropout level", "option", "d", float),
    learn_rate=("Learning rate", "option", "r", float),
    batch_size=("Batch size for neural network training", "option", "b", int),
    nr_epoch=("Number of training epochs", "option", "e", int),
    entail_dir=(
        "Direction of entailment",
        "option",
        "D",
        str,
        ["both", "left", "right"],
    ),
 )
 def main(
    mode,
    train_loc,
    dev_loc,
    max_length=50,
    nr_hidden=200,
    dropout=0.2,
    learn_rate=0.001,
    batch_size=1024,
    nr_epoch=10,
    entail_dir="both",
 ):
    shape = (max_length, nr_hidden, 3)
    settings = {
        "lr": learn_rate,
        "dropout": dropout,
        "batch_size": batch_size,
        "nr_epoch": nr_epoch,
        "entail_dir": entail_dir,
    }
    if mode == "train":
        if train_loc == None or dev_loc == None:
            print("Train mode requires paths to training and development data sets.")
            sys.exit(1)
        train(train_loc, dev_loc, shape, settings)
    elif mode == "evaluate":
        if dev_loc == None:
            print("Evaluate mode requires paths to test data set.")
            sys.exit(1)
        correct, total = evaluate(dev_loc, shape)
        print(correct, "/", total, correct / total)
    else:
        demo(shape)
 if __name__ == "__main__":
    plac.call(main)
--- a/examples/keras_parikh_entailment/keras_decomposable_attention.py
+++ b/examples/keras_parikh_entailment/keras_decomposable_attention.py
@ -1,152 +0,0 @@
 # Semantic entailment/similarity with decomposable attention (using spaCy and Keras)
 # Practical state-of-the-art textual entailment with spaCy and Keras
 import numpy as np
 from keras import layers, Model, models, optimizers
 from keras import backend as K
 def build_model(vectors, shape, settings):
    max_length, nr_hidden, nr_class = shape
    input1 = layers.Input(shape=(max_length,), dtype="int32", name="words1")
    input2 = layers.Input(shape=(max_length,), dtype="int32", name="words2")
    # embeddings (projected)
    embed = create_embedding(vectors, max_length, nr_hidden)
    a = embed(input1)
    b = embed(input2)
    # step 1: attend
    F = create_feedforward(nr_hidden)
    att_weights = layers.dot([F(a), F(b)], axes=-1)
    G = create_feedforward(nr_hidden)
    if settings["entail_dir"] == "both":
        norm_weights_a = layers.Lambda(normalizer(1))(att_weights)
        norm_weights_b = layers.Lambda(normalizer(2))(att_weights)
        alpha = layers.dot([norm_weights_a, a], axes=1)
        beta = layers.dot([norm_weights_b, b], axes=1)
        # step 2: compare
        comp1 = layers.concatenate([a, beta])
        comp2 = layers.concatenate([b, alpha])
        v1 = layers.TimeDistributed(G)(comp1)
        v2 = layers.TimeDistributed(G)(comp2)
        # step 3: aggregate
        v1_sum = layers.Lambda(sum_word)(v1)
        v2_sum = layers.Lambda(sum_word)(v2)
        concat = layers.concatenate([v1_sum, v2_sum])
    elif settings["entail_dir"] == "left":
        norm_weights_a = layers.Lambda(normalizer(1))(att_weights)
        alpha = layers.dot([norm_weights_a, a], axes=1)
        comp2 = layers.concatenate([b, alpha])
        v2 = layers.TimeDistributed(G)(comp2)
        v2_sum = layers.Lambda(sum_word)(v2)
        concat = v2_sum
    else:
        norm_weights_b = layers.Lambda(normalizer(2))(att_weights)
        beta = layers.dot([norm_weights_b, b], axes=1)
        comp1 = layers.concatenate([a, beta])
        v1 = layers.TimeDistributed(G)(comp1)
        v1_sum = layers.Lambda(sum_word)(v1)
        concat = v1_sum
    H = create_feedforward(nr_hidden)
    out = H(concat)
    out = layers.Dense(nr_class, activation="softmax")(out)
    model = Model([input1, input2], out)
    model.compile(
        optimizer=optimizers.Adam(lr=settings["lr"]),
        loss="categorical_crossentropy",
        metrics=["accuracy"],
    )
    return model
 def create_embedding(vectors, max_length, projected_dim):
    return models.Sequential(
        [
            layers.Embedding(
                vectors.shape[0],
                vectors.shape[1],
                input_length=max_length,
                weights=[vectors],
                trainable=False,
            ),
            layers.TimeDistributed(
                layers.Dense(projected_dim, activation=None, use_bias=False)
            ),
        ]
    )
 def create_feedforward(num_units=200, activation="relu", dropout_rate=0.2):
    return models.Sequential(
        [
            layers.Dense(num_units, activation=activation),
            layers.Dropout(dropout_rate),
            layers.Dense(num_units, activation=activation),
            layers.Dropout(dropout_rate),
        ]
    )
 def normalizer(axis):
    def _normalize(att_weights):
        exp_weights = K.exp(att_weights)
        sum_weights = K.sum(exp_weights, axis=axis, keepdims=True)
        return exp_weights / sum_weights
    return _normalize
 def sum_word(x):
    return K.sum(x, axis=1)
 def test_build_model():
    vectors = np.ndarray((100, 8), dtype="float32")
    shape = (10, 16, 3)
    settings = {"lr": 0.001, "dropout": 0.2, "gru_encode": True, "entail_dir": "both"}
    model = build_model(vectors, shape, settings)
 def test_fit_model():
    def _generate_X(nr_example, length, nr_vector):
        X1 = np.ndarray((nr_example, length), dtype="int32")
        X1 *= X1 < nr_vector
        X1 *= 0 <= X1
        X2 = np.ndarray((nr_example, length), dtype="int32")
        X2 *= X2 < nr_vector
        X2 *= 0 <= X2
        return [X1, X2]
    def _generate_Y(nr_example, nr_class):
        ys = np.zeros((nr_example, nr_class), dtype="int32")
        for i in range(nr_example):
            ys[i, i % nr_class] = 1
        return ys
    vectors = np.ndarray((100, 8), dtype="float32")
    shape = (10, 16, 3)
    settings = {"lr": 0.001, "dropout": 0.2, "gru_encode": True, "entail_dir": "both"}
    model = build_model(vectors, shape, settings)
    train_X = _generate_X(20, shape[0], vectors.shape[0])
    train_Y = _generate_Y(20, shape[2])
    dev_X = _generate_X(15, shape[0], vectors.shape[0])
    dev_Y = _generate_Y(15, shape[2])
    model.fit(train_X, train_Y, validation_data=(dev_X, dev_Y), epochs=5, batch_size=4)
 __all__ = [build_model]
--- a/examples/keras_parikh_entailment/spacy_hook.py
+++ b/examples/keras_parikh_entailment/spacy_hook.py
@ -1,77 +0,0 @@
 import numpy as np
 from keras.models import model_from_json
 try:
    import cPickle as pickle
 except ImportError:
    import pickle
 class KerasSimilarityShim(object):
    entailment_types = ["entailment", "contradiction", "neutral"]
    @classmethod
    def load(cls, path, nlp, max_length=100, get_features=None):
        if get_features is None:
            get_features = get_word_ids
        with (path / "config.json").open() as file_:
            model = model_from_json(file_.read())
        with (path / "model").open("rb") as file_:
            weights = pickle.load(file_)
        embeddings = get_embeddings(nlp.vocab)
        weights.insert(1, embeddings)
        model.set_weights(weights)
        return cls(model, get_features=get_features, max_length=max_length)
    def __init__(self, model, get_features=None, max_length=100):
        self.model = model
        self.get_features = get_features
        self.max_length = max_length
    def __call__(self, doc):
        doc.user_hooks["similarity"] = self.predict
        doc.user_span_hooks["similarity"] = self.predict
        return doc
    def predict(self, doc1, doc2):
        x1 = self.get_features([doc1], max_length=self.max_length)
        x2 = self.get_features([doc2], max_length=self.max_length)
        scores = self.model.predict([x1, x2])
        return self.entailment_types[scores.argmax()], scores.max()
 def get_embeddings(vocab, nr_unk=100):
    # the extra +1 is for a zero vector representing sentence-final padding
    num_vectors = max(lex.rank for lex in vocab) + 2
    # create random vectors for OOV tokens
    oov = np.random.normal(size=(nr_unk, vocab.vectors_length))
    oov = oov / oov.sum(axis=1, keepdims=True)
    vectors = np.zeros((num_vectors + nr_unk, vocab.vectors_length), dtype="float32")
    vectors[1 : (nr_unk + 1),] = oov
    for lex in vocab:
        if lex.has_vector and lex.vector_norm > 0:
            vectors[nr_unk + lex.rank + 1] = lex.vector / lex.vector_norm
    return vectors
 def get_word_ids(docs, max_length=100, nr_unk=100):
    Xs = np.zeros((len(docs), max_length), dtype="int32")
    for i, doc in enumerate(docs):
        for j, token in enumerate(doc):
            if j == max_length:
                break
            if token.has_vector:
                Xs[i, j] = token.rank + nr_unk + 1
            else:
                Xs[i, j] = token.rank % nr_unk + 1
    return Xs
--- a/examples/load_from_docbin.py
+++ b/examples/load_from_docbin.py
@ -1,45 +0,0 @@
 # coding: utf-8
 """
 Example of loading previously parsed text using spaCy's DocBin class. The example
 performs an entity count to show that the annotations are available.
 For more details, see https://spacy.io/usage/saving-loading#docs
 Installation:
 python -m spacy download en_core_web_lg
 Usage:
 python examples/load_from_docbin.py en_core_web_lg RC_2015-03-9.spacy
 """
 from __future__ import unicode_literals
 import spacy
 from spacy.tokens import DocBin
 from timeit import default_timer as timer
 from collections import Counter
 EXAMPLE_PARSES_PATH = "RC_2015-03-9.spacy"
 def main(model="en_core_web_lg", docbin_path=EXAMPLE_PARSES_PATH):
    nlp = spacy.load(model)
    print("Reading data from {}".format(docbin_path))
    with open(docbin_path, "rb") as file_:
        bytes_data = file_.read()
    nr_word = 0
    start_time = timer()
    entities = Counter()
    docbin = DocBin().from_bytes(bytes_data)
    for doc in docbin.get_docs(nlp.vocab):
        nr_word += len(doc)
        entities.update((e.label_, e.text) for e in doc.ents)
    end_time = timer()
    msg = "Loaded {nr_word} words in {seconds} seconds ({wps} words per second)"
    wps = nr_word / (end_time - start_time)
    print(msg.format(nr_word=nr_word, seconds=end_time - start_time, wps=wps))
    print("Most common entities:")
    for (label, entity), freq in entities.most_common(30):
        print(freq, entity, label)
 if __name__ == "__main__":
    import plac
    plac.call(main)
--- a/examples/notebooks/Decompositional
+++ b/examples/notebooks/Decompositional
@ -1,955 +0,0 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Natural language inference using spaCy and Keras"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook details an implementation of the natural language inference model presented in [(Parikh et al, 2016)](https://arxiv.org/abs/1606.01933).  The model is notable for the small number of paramaters *and hyperparameters* it specifices, while still yielding good performance."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Constructing the dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import spacy\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We only need the GloVe vectors from spaCy, not a full NLP pipeline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "nlp = spacy.load('en_vectors_web_lg')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Function to load the SNLI dataset.  The categories are converted to one-shot representation.  The function comes from an example in spaCy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/jds/tensorflow-gpu/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n",
      "  from ._conv import register_converters as _register_converters\n",
      "Using TensorFlow backend.\n"
     ]
    }
   ],
   "source": [
    "import json\n",
    "from keras.utils import to_categorical\n",
    "\n",
    "LABELS = {'entailment': 0, 'contradiction': 1, 'neutral': 2}\n",
    "def read_snli(path):\n",
    "    texts1 = []\n",
    "    texts2 = []\n",
    "    labels = []\n",
    "    with open(path, 'r') as file_:\n",
    "        for line in file_:\n",
    "            eg = json.loads(line)\n",
    "            label = eg['gold_label']\n",
    "            if label == '-':  # per Parikh, ignore - SNLI entries\n",
    "                continue\n",
    "            texts1.append(eg['sentence1'])\n",
    "            texts2.append(eg['sentence2'])\n",
    "            labels.append(LABELS[label])\n",
    "    return texts1, texts2, to_categorical(np.asarray(labels, dtype='int32'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Because Keras can do the train/test split for us, we'll load *all* SNLI triples from one file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "texts,hypotheses,labels = read_snli('snli/snli_1.0_train.jsonl')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_dataset(nlp, texts, hypotheses, num_oov, max_length, norm_vectors = True):\n",
    "    sents = texts + hypotheses\n",
    "    \n",
    "    # the extra +1 is for a zero vector represting NULL for padding\n",
    "    num_vectors = max(lex.rank for lex in nlp.vocab) + 2 \n",
    "    \n",
    "    # create random vectors for OOV tokens\n",
    "    oov = np.random.normal(size=(num_oov, nlp.vocab.vectors_length))\n",
    "    oov = oov / oov.sum(axis=1, keepdims=True)\n",
    "    \n",
    "    vectors = np.zeros((num_vectors + num_oov, nlp.vocab.vectors_length), dtype='float32')\n",
    "    vectors[num_vectors:, ] = oov\n",
    "    for lex in nlp.vocab:\n",
    "        if lex.has_vector and lex.vector_norm > 0:\n",
    "            vectors[lex.rank + 1] = lex.vector / lex.vector_norm if norm_vectors == True else lex.vector\n",
    "            \n",
    "    sents_as_ids = []\n",
    "    for sent in sents:\n",
    "        doc = nlp(sent)\n",
    "        word_ids = []\n",
    "        \n",
    "        for i, token in enumerate(doc):\n",
    "            # skip odd spaces from tokenizer\n",
    "            if token.has_vector and token.vector_norm == 0:\n",
    "                continue\n",
    "                \n",
    "            if i > max_length:\n",
    "                break\n",
    "                \n",
    "            if token.has_vector:\n",
    "                word_ids.append(token.rank + 1)\n",
    "            else:\n",
    "                # if we don't have a vector, pick an OOV entry\n",
    "                word_ids.append(token.rank % num_oov + num_vectors) \n",
    "                \n",
    "        # there must be a simpler way of generating padded arrays from lists...\n",
    "        word_id_vec = np.zeros((max_length), dtype='int')\n",
    "        clipped_len = min(max_length, len(word_ids))\n",
    "        word_id_vec[:clipped_len] = word_ids[:clipped_len]\n",
    "        sents_as_ids.append(word_id_vec)\n",
    "        \n",
    "        \n",
    "    return vectors, np.array(sents_as_ids[:len(texts)]), np.array(sents_as_ids[len(texts):])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "sem_vectors, text_vectors, hypothesis_vectors = create_dataset(nlp, texts, hypotheses, 100, 50, True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "texts_test,hypotheses_test,labels_test = read_snli('snli/snli_1.0_test.jsonl')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "_, text_vectors_test, hypothesis_vectors_test = create_dataset(nlp, texts_test, hypotheses_test, 100, 50, True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use spaCy to tokenize the sentences and return, when available, a semantic vector for each token.  \n",
    "\n",
    "OOV terms (tokens for which no semantic vector is available) are assigned to one of a set of randomly-generated OOV vectors, per (Parikh et al, 2016).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that we will clip sentences to 50 words maximum."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "from keras import layers, Model, models\n",
    "from keras import backend as K"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Building the model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The embedding layer copies the 300-dimensional GloVe vectors into GPU memory.  Per (Parikh et al, 2016), the vectors, which are not adapted during training, are projected down to lower-dimensional vectors using a trained projection matrix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_embedding(vectors, max_length, projected_dim):\n",
    "    return models.Sequential([\n",
    "        layers.Embedding(\n",
    "            vectors.shape[0],\n",
    "            vectors.shape[1],\n",
    "            input_length=max_length,\n",
    "            weights=[vectors],\n",
    "            trainable=False),\n",
    "        \n",
    "        layers.TimeDistributed(\n",
    "            layers.Dense(projected_dim,\n",
    "                         activation=None,\n",
    "                         use_bias=False))\n",
    "    ])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Parikh model makes use of three feedforward blocks that construct nonlinear combinations of their input.  Each block contains two ReLU layers and two dropout layers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_feedforward(num_units=200, activation='relu', dropout_rate=0.2):\n",
    "    return models.Sequential([\n",
    "        layers.Dense(num_units, activation=activation),\n",
    "        layers.Dropout(dropout_rate),\n",
    "        layers.Dense(num_units, activation=activation),\n",
    "        layers.Dropout(dropout_rate)\n",
    "    ])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The basic idea of the (Parikh et al, 2016) model is to:\n",
    "\n",
    "1.  *Align*: Construct an alignment of subphrases in the text and hypothesis using an attention-like mechanism, called \"decompositional\" because the layer is applied to each of the two sentences individually rather than to their product.  The dot product of the nonlinear transformations of the inputs is then normalized vertically and horizontally to yield a pair of \"soft\" alignment structures, from text->hypothesis and hypothesis->text.  Concretely, for each word in one sentence, a multinomial distribution is computed over the words of the other sentence, by learning a multinomial logistic with softmax target.\n",
    "2.  *Compare*: Each word is now compared to its aligned phrase using a function modeled as a two-layer feedforward ReLU network.  The output is a high-dimensional representation of the strength of association between word and aligned phrase.\n",
    "3.  *Aggregate*: The comparison vectors are summed, separately, for the text and the hypothesis.  The result is two vectors: one that describes the degree of association of the text to the hypothesis, and the second, of the hypothesis to the text.\n",
    "4.  Finally, these two vectors are processed by a dense layer followed by a softmax classifier, as usual.\n",
    "\n",
    "Note that because in entailment the truth conditions of the consequent must be a subset of those of the antecedent, it is not obvious that we need both vectors in step (3).  Entailment is not symmetric.  It may be enough to just use the hypothesis->text vector.  We will explore this possibility later."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We need a couple of little functions for Lambda layers to normalize and aggregate weights:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "def normalizer(axis):\n",
    "    def _normalize(att_weights):\n",
    "        exp_weights = K.exp(att_weights)\n",
    "        sum_weights = K.sum(exp_weights, axis=axis, keepdims=True)\n",
    "        return exp_weights/sum_weights\n",
    "    return _normalize\n",
    "\n",
    "def sum_word(x):\n",
    "    return K.sum(x, axis=1)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "def build_model(vectors, max_length, num_hidden, num_classes, projected_dim, entail_dir='both'):\n",
    "    input1 = layers.Input(shape=(max_length,), dtype='int32', name='words1')\n",
    "    input2 = layers.Input(shape=(max_length,), dtype='int32', name='words2')\n",
    "    \n",
    "    # embeddings (projected)\n",
    "    embed = create_embedding(vectors, max_length, projected_dim)\n",
    "   \n",
    "    a = embed(input1)\n",
    "    b = embed(input2)\n",
    "    \n",
    "    # step 1: attend\n",
    "    F = create_feedforward(num_hidden)\n",
    "    att_weights = layers.dot([F(a), F(b)], axes=-1)\n",
    "    \n",
    "    G = create_feedforward(num_hidden)\n",
    "    \n",
    "    if entail_dir == 'both':\n",
    "        norm_weights_a = layers.Lambda(normalizer(1))(att_weights)\n",
    "        norm_weights_b = layers.Lambda(normalizer(2))(att_weights)\n",
    "        alpha = layers.dot([norm_weights_a, a], axes=1)\n",
    "        beta  = layers.dot([norm_weights_b, b], axes=1)\n",
    "\n",
    "        # step 2: compare\n",
    "        comp1 = layers.concatenate([a, beta])\n",
    "        comp2 = layers.concatenate([b, alpha])\n",
    "        v1 = layers.TimeDistributed(G)(comp1)\n",
    "        v2 = layers.TimeDistributed(G)(comp2)\n",
    "\n",
    "        # step 3: aggregate\n",
    "        v1_sum = layers.Lambda(sum_word)(v1)\n",
    "        v2_sum = layers.Lambda(sum_word)(v2)\n",
    "        concat = layers.concatenate([v1_sum, v2_sum])\n",
    "    elif entail_dir == 'left':\n",
    "        norm_weights_a = layers.Lambda(normalizer(1))(att_weights)\n",
    "        alpha = layers.dot([norm_weights_a, a], axes=1)\n",
    "        comp2 = layers.concatenate([b, alpha])\n",
    "        v2 = layers.TimeDistributed(G)(comp2)\n",
    "        v2_sum = layers.Lambda(sum_word)(v2)\n",
    "        concat = v2_sum\n",
    "    else:\n",
    "        norm_weights_b = layers.Lambda(normalizer(2))(att_weights)\n",
    "        beta  = layers.dot([norm_weights_b, b], axes=1)\n",
    "        comp1 = layers.concatenate([a, beta])\n",
    "        v1 = layers.TimeDistributed(G)(comp1)\n",
    "        v1_sum = layers.Lambda(sum_word)(v1)\n",
    "        concat = v1_sum\n",
    "    \n",
    "    H = create_feedforward(num_hidden)\n",
    "    out = H(concat)\n",
    "    out = layers.Dense(num_classes, activation='softmax')(out)\n",
    "    \n",
    "    model = Model([input1, input2], out)\n",
    "    \n",
    "    model.compile(optimizer='adam',\n",
    "                  loss='categorical_crossentropy',\n",
    "                  metrics=['accuracy'])\n",
    "    return model\n",
    "    \n",
    "    \n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "__________________________________________________________________________________________________\n",
      "Layer (type)                    Output Shape         Param #     Connected to                     \n",
      "==================================================================================================\n",
      "words1 (InputLayer)             (None, 50)           0                                            \n",
      "__________________________________________________________________________________________________\n",
      "words2 (InputLayer)             (None, 50)           0                                            \n",
      "__________________________________________________________________________________________________\n",
      "sequential_1 (Sequential)       (None, 50, 200)      321381600   words1[0][0]                     \n",
      "                                                                 words2[0][0]                     \n",
      "__________________________________________________________________________________________________\n",
      "sequential_2 (Sequential)       (None, 50, 200)      80400       sequential_1[1][0]               \n",
      "                                                                 sequential_1[2][0]               \n",
      "__________________________________________________________________________________________________\n",
      "dot_1 (Dot)                     (None, 50, 50)       0           sequential_2[1][0]               \n",
      "                                                                 sequential_2[2][0]               \n",
      "__________________________________________________________________________________________________\n",
      "lambda_2 (Lambda)               (None, 50, 50)       0           dot_1[0][0]                      \n",
      "__________________________________________________________________________________________________\n",
      "lambda_1 (Lambda)               (None, 50, 50)       0           dot_1[0][0]                      \n",
      "__________________________________________________________________________________________________\n",
      "dot_3 (Dot)                     (None, 50, 200)      0           lambda_2[0][0]                   \n",
      "                                                                 sequential_1[2][0]               \n",
      "__________________________________________________________________________________________________\n",
      "dot_2 (Dot)                     (None, 50, 200)      0           lambda_1[0][0]                   \n",
      "                                                                 sequential_1[1][0]               \n",
      "__________________________________________________________________________________________________\n",
      "concatenate_1 (Concatenate)     (None, 50, 400)      0           sequential_1[1][0]               \n",
      "                                                                 dot_3[0][0]                      \n",
      "__________________________________________________________________________________________________\n",
      "concatenate_2 (Concatenate)     (None, 50, 400)      0           sequential_1[2][0]               \n",
      "                                                                 dot_2[0][0]                      \n",
      "__________________________________________________________________________________________________\n",
      "time_distributed_2 (TimeDistrib (None, 50, 200)      120400      concatenate_1[0][0]              \n",
      "__________________________________________________________________________________________________\n",
      "time_distributed_3 (TimeDistrib (None, 50, 200)      120400      concatenate_2[0][0]              \n",
      "__________________________________________________________________________________________________\n",
      "lambda_3 (Lambda)               (None, 200)          0           time_distributed_2[0][0]         \n",
      "__________________________________________________________________________________________________\n",
      "lambda_4 (Lambda)               (None, 200)          0           time_distributed_3[0][0]         \n",
      "__________________________________________________________________________________________________\n",
      "concatenate_3 (Concatenate)     (None, 400)          0           lambda_3[0][0]                   \n",
      "                                                                 lambda_4[0][0]                   \n",
      "__________________________________________________________________________________________________\n",
      "sequential_4 (Sequential)       (None, 200)          120400      concatenate_3[0][0]              \n",
      "__________________________________________________________________________________________________\n",
      "dense_8 (Dense)                 (None, 3)            603         sequential_4[1][0]               \n",
      "==================================================================================================\n",
      "Total params: 321,703,403\n",
      "Trainable params: 381,803\n",
      "Non-trainable params: 321,321,600\n",
      "__________________________________________________________________________________________________\n"
     ]
    }
   ],
   "source": [
    "K.clear_session()\n",
    "m = build_model(sem_vectors, 50, 200, 3, 200)\n",
    "m.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The number of trainable parameters, ~381k, is the number given by Parikh et al, so we're on the right track."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training the model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Parikh et al use tiny batches of 4, training for 50MM batches, which amounts to around 500 epochs.  Here we'll use large batches to better use the GPU, and train for fewer epochs -- for purposes of this experiment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train on 549367 samples, validate on 9824 samples\n",
      "Epoch 1/50\n",
      "549367/549367 [==============================] - 34s 62us/step - loss: 0.7599 - acc: 0.6617 - val_loss: 0.5396 - val_acc: 0.7861\n",
      "Epoch 2/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.5611 - acc: 0.7763 - val_loss: 0.4892 - val_acc: 0.8085\n",
      "Epoch 3/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.5212 - acc: 0.7948 - val_loss: 0.4574 - val_acc: 0.8261\n",
      "Epoch 4/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4986 - acc: 0.8045 - val_loss: 0.4410 - val_acc: 0.8274\n",
      "Epoch 5/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4819 - acc: 0.8114 - val_loss: 0.4224 - val_acc: 0.8383\n",
      "Epoch 6/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4714 - acc: 0.8166 - val_loss: 0.4200 - val_acc: 0.8379\n",
      "Epoch 7/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4633 - acc: 0.8203 - val_loss: 0.4098 - val_acc: 0.8457\n",
      "Epoch 8/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4558 - acc: 0.8232 - val_loss: 0.4114 - val_acc: 0.8415\n",
      "Epoch 9/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4508 - acc: 0.8250 - val_loss: 0.4062 - val_acc: 0.8477\n",
      "Epoch 10/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4433 - acc: 0.8286 - val_loss: 0.3982 - val_acc: 0.8486\n",
      "Epoch 11/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4388 - acc: 0.8307 - val_loss: 0.3953 - val_acc: 0.8497\n",
      "Epoch 12/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4351 - acc: 0.8321 - val_loss: 0.3973 - val_acc: 0.8522\n",
      "Epoch 13/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4309 - acc: 0.8342 - val_loss: 0.3939 - val_acc: 0.8539\n",
      "Epoch 14/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4269 - acc: 0.8355 - val_loss: 0.3932 - val_acc: 0.8517\n",
      "Epoch 15/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4247 - acc: 0.8369 - val_loss: 0.3938 - val_acc: 0.8515\n",
      "Epoch 16/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4208 - acc: 0.8379 - val_loss: 0.3936 - val_acc: 0.8504\n",
      "Epoch 17/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4194 - acc: 0.8390 - val_loss: 0.3885 - val_acc: 0.8560\n",
      "Epoch 18/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4162 - acc: 0.8402 - val_loss: 0.3874 - val_acc: 0.8561\n",
      "Epoch 19/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4140 - acc: 0.8409 - val_loss: 0.3889 - val_acc: 0.8545\n",
      "Epoch 20/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4114 - acc: 0.8426 - val_loss: 0.3864 - val_acc: 0.8583\n",
      "Epoch 21/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4092 - acc: 0.8430 - val_loss: 0.3870 - val_acc: 0.8561\n",
      "Epoch 22/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4062 - acc: 0.8442 - val_loss: 0.3852 - val_acc: 0.8577\n",
      "Epoch 23/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4050 - acc: 0.8450 - val_loss: 0.3850 - val_acc: 0.8578\n",
      "Epoch 24/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4035 - acc: 0.8455 - val_loss: 0.3825 - val_acc: 0.8555\n",
      "Epoch 25/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.4018 - acc: 0.8460 - val_loss: 0.3837 - val_acc: 0.8573\n",
      "Epoch 26/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3989 - acc: 0.8476 - val_loss: 0.3843 - val_acc: 0.8599\n",
      "Epoch 27/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3979 - acc: 0.8481 - val_loss: 0.3841 - val_acc: 0.8589\n",
      "Epoch 28/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3967 - acc: 0.8484 - val_loss: 0.3811 - val_acc: 0.8575\n",
      "Epoch 29/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3956 - acc: 0.8492 - val_loss: 0.3829 - val_acc: 0.8589\n",
      "Epoch 30/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3938 - acc: 0.8499 - val_loss: 0.3859 - val_acc: 0.8562\n",
      "Epoch 31/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3925 - acc: 0.8500 - val_loss: 0.3798 - val_acc: 0.8587\n",
      "Epoch 32/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3906 - acc: 0.8509 - val_loss: 0.3834 - val_acc: 0.8569\n",
      "Epoch 33/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3893 - acc: 0.8511 - val_loss: 0.3806 - val_acc: 0.8588\n",
      "Epoch 34/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3885 - acc: 0.8515 - val_loss: 0.3828 - val_acc: 0.8603\n",
      "Epoch 35/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3879 - acc: 0.8520 - val_loss: 0.3800 - val_acc: 0.8594\n",
      "Epoch 36/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3860 - acc: 0.8530 - val_loss: 0.3796 - val_acc: 0.8577\n",
      "Epoch 37/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3856 - acc: 0.8532 - val_loss: 0.3857 - val_acc: 0.8591\n",
      "Epoch 38/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3838 - acc: 0.8535 - val_loss: 0.3835 - val_acc: 0.8603\n",
      "Epoch 39/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3830 - acc: 0.8543 - val_loss: 0.3830 - val_acc: 0.8599\n",
      "Epoch 40/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3818 - acc: 0.8548 - val_loss: 0.3832 - val_acc: 0.8559\n",
      "Epoch 41/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3806 - acc: 0.8551 - val_loss: 0.3845 - val_acc: 0.8553\n",
      "Epoch 42/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3803 - acc: 0.8550 - val_loss: 0.3789 - val_acc: 0.8617\n",
      "Epoch 43/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3791 - acc: 0.8556 - val_loss: 0.3835 - val_acc: 0.8580\n",
      "Epoch 44/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3778 - acc: 0.8565 - val_loss: 0.3799 - val_acc: 0.8580\n",
      "Epoch 45/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3766 - acc: 0.8571 - val_loss: 0.3790 - val_acc: 0.8625\n",
      "Epoch 46/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3770 - acc: 0.8569 - val_loss: 0.3820 - val_acc: 0.8590\n",
      "Epoch 47/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3761 - acc: 0.8573 - val_loss: 0.3831 - val_acc: 0.8581\n",
      "Epoch 48/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3739 - acc: 0.8579 - val_loss: 0.3828 - val_acc: 0.8599\n",
      "Epoch 49/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3738 - acc: 0.8577 - val_loss: 0.3785 - val_acc: 0.8590\n",
      "Epoch 50/50\n",
      "549367/549367 [==============================] - 33s 60us/step - loss: 0.3726 - acc: 0.8580 - val_loss: 0.3820 - val_acc: 0.8585\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<keras.callbacks.History at 0x7f5c9f49c438>"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m.fit([text_vectors, hypothesis_vectors], labels, batch_size=1024, epochs=50,validation_data=([text_vectors_test, hypothesis_vectors_test], labels_test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The result is broadly in the region reported by Parikh et al: ~86 vs 86.3%.  The small difference might be accounted by differences in `max_length` (here set at 50), in the training regime, and that here we use Keras' built-in validation splitting rather than the SNLI test set."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Experiment: the asymmetric model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It was suggested earlier that, based on the semantics of entailment, the vector representing the strength of association between the hypothesis to the text is all that is needed for classifying the entailment.\n",
    "\n",
    "The following model removes consideration of the complementary vector (text to hypothesis) from the computation.  This will decrease the paramater count slightly, because the final dense layers will be smaller, and speed up the forward pass when predicting, because fewer calculations will be needed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "__________________________________________________________________________________________________\n",
      "Layer (type)                    Output Shape         Param #     Connected to                     \n",
      "==================================================================================================\n",
      "words2 (InputLayer)             (None, 50)           0                                            \n",
      "__________________________________________________________________________________________________\n",
      "words1 (InputLayer)             (None, 50)           0                                            \n",
      "__________________________________________________________________________________________________\n",
      "sequential_5 (Sequential)       (None, 50, 200)      321381600   words1[0][0]                     \n",
      "                                                                 words2[0][0]                     \n",
      "__________________________________________________________________________________________________\n",
      "sequential_6 (Sequential)       (None, 50, 200)      80400       sequential_5[1][0]               \n",
      "                                                                 sequential_5[2][0]               \n",
      "__________________________________________________________________________________________________\n",
      "dot_4 (Dot)                     (None, 50, 50)       0           sequential_6[1][0]               \n",
      "                                                                 sequential_6[2][0]               \n",
      "__________________________________________________________________________________________________\n",
      "lambda_5 (Lambda)               (None, 50, 50)       0           dot_4[0][0]                      \n",
      "__________________________________________________________________________________________________\n",
      "dot_5 (Dot)                     (None, 50, 200)      0           lambda_5[0][0]                   \n",
      "                                                                 sequential_5[1][0]               \n",
      "__________________________________________________________________________________________________\n",
      "concatenate_4 (Concatenate)     (None, 50, 400)      0           sequential_5[2][0]               \n",
      "                                                                 dot_5[0][0]                      \n",
      "__________________________________________________________________________________________________\n",
      "time_distributed_5 (TimeDistrib (None, 50, 200)      120400      concatenate_4[0][0]              \n",
      "__________________________________________________________________________________________________\n",
      "lambda_6 (Lambda)               (None, 200)          0           time_distributed_5[0][0]         \n",
      "__________________________________________________________________________________________________\n",
      "sequential_8 (Sequential)       (None, 200)          80400       lambda_6[0][0]                   \n",
      "__________________________________________________________________________________________________\n",
      "dense_16 (Dense)                (None, 3)            603         sequential_8[1][0]               \n",
      "==================================================================================================\n",
      "Total params: 321,663,403\n",
      "Trainable params: 341,803\n",
      "Non-trainable params: 321,321,600\n",
      "__________________________________________________________________________________________________\n"
     ]
    }
   ],
   "source": [
    "m1 = build_model(sem_vectors, 50, 200, 3, 200, 'left')\n",
    "m1.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The parameter count has indeed decreased by 40,000, corresponding to the 200x200 smaller H function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train on 549367 samples, validate on 9824 samples\n",
      "Epoch 1/50\n",
      "549367/549367 [==============================] - 25s 46us/step - loss: 0.7331 - acc: 0.6770 - val_loss: 0.5257 - val_acc: 0.7936\n",
      "Epoch 2/50\n",
      "549367/549367 [==============================] - 25s 45us/step - loss: 0.5518 - acc: 0.7799 - val_loss: 0.4717 - val_acc: 0.8159\n",
      "Epoch 3/50\n",
      "549367/549367 [==============================] - 25s 45us/step - loss: 0.5147 - acc: 0.7967 - val_loss: 0.4449 - val_acc: 0.8278\n",
      "Epoch 4/50\n",
      "549367/549367 [==============================] - 25s 45us/step - loss: 0.4948 - acc: 0.8060 - val_loss: 0.4326 - val_acc: 0.8344\n",
      "Epoch 5/50\n",
      "549367/549367 [==============================] - 25s 45us/step - loss: 0.4814 - acc: 0.8122 - val_loss: 0.4247 - val_acc: 0.8359\n",
      "Epoch 6/50\n",
      "549367/549367 [==============================] - 25s 45us/step - loss: 0.4712 - acc: 0.8162 - val_loss: 0.4143 - val_acc: 0.8430\n",
      "Epoch 7/50\n",
      "549367/549367 [==============================] - 25s 45us/step - loss: 0.4635 - acc: 0.8205 - val_loss: 0.4172 - val_acc: 0.8401\n",
      "Epoch 8/50\n",
      "549367/549367 [==============================] - 25s 45us/step - loss: 0.4570 - acc: 0.8223 - val_loss: 0.4106 - val_acc: 0.8422\n",
      "Epoch 9/50\n",
      "549367/549367 [==============================] - 25s 45us/step - loss: 0.4505 - acc: 0.8259 - val_loss: 0.4043 - val_acc: 0.8451\n",
      "Epoch 10/50\n",
      "549367/549367 [==============================] - 25s 45us/step - loss: 0.4459 - acc: 0.8280 - val_loss: 0.4050 - val_acc: 0.8467\n",
      "Epoch 11/50\n",
      "549367/549367 [==============================] - 25s 45us/step - loss: 0.4405 - acc: 0.8300 - val_loss: 0.3975 - val_acc: 0.8481\n",
      "Epoch 12/50\n",
      "549367/549367 [==============================] - 25s 45us/step - loss: 0.4360 - acc: 0.8324 - val_loss: 0.4026 - val_acc: 0.8496\n",
      "Epoch 13/50\n",
      "549367/549367 [==============================] - 25s 45us/step - loss: 0.4327 - acc: 0.8334 - val_loss: 0.4024 - val_acc: 0.8471\n",
      "Epoch 14/50\n",
      "549367/549367 [==============================] - 24s 45us/step - loss: 0.4293 - acc: 0.8350 - val_loss: 0.3955 - val_acc: 0.8496\n",
      "Epoch 15/50\n",
      "549367/549367 [==============================] - 25s 45us/step - loss: 0.4263 - acc: 0.8369 - val_loss: 0.3980 - val_acc: 0.8490\n",
      "Epoch 16/50\n",
      "549367/549367 [==============================] - 25s 45us/step - loss: 0.4236 - acc: 0.8377 - val_loss: 0.3958 - val_acc: 0.8496\n",
      "Epoch 17/50\n",
      "549367/549367 [==============================] - 25s 45us/step - loss: 0.4213 - acc: 0.8384 - val_loss: 0.3954 - val_acc: 0.8496\n",
      "Epoch 18/50\n",
      "549367/549367 [==============================] - 24s 45us/step - loss: 0.4187 - acc: 0.8394 - val_loss: 0.3929 - val_acc: 0.8514\n",
      "Epoch 19/50\n",
      "549367/549367 [==============================] - 25s 45us/step - loss: 0.4157 - acc: 0.8409 - val_loss: 0.3939 - val_acc: 0.8507\n",
      "Epoch 20/50\n",
      "549367/549367 [==============================] - 25s 45us/step - loss: 0.4135 - acc: 0.8417 - val_loss: 0.3953 - val_acc: 0.8522\n",
      "Epoch 21/50\n",
      "549367/549367 [==============================] - 25s 45us/step - loss: 0.4122 - acc: 0.8424 - val_loss: 0.3974 - val_acc: 0.8506\n",
      "Epoch 22/50\n",
      "549367/549367 [==============================] - 25s 45us/step - loss: 0.4099 - acc: 0.8435 - val_loss: 0.3918 - val_acc: 0.8522\n",
      "Epoch 23/50\n",
      "549367/549367 [==============================] - 25s 45us/step - loss: 0.4075 - acc: 0.8443 - val_loss: 0.3901 - val_acc: 0.8513\n",
      "Epoch 24/50\n",
      "549367/549367 [==============================] - 24s 44us/step - loss: 0.4067 - acc: 0.8447 - val_loss: 0.3885 - val_acc: 0.8543\n",
      "Epoch 25/50\n",
      "549367/549367 [==============================] - 25s 45us/step - loss: 0.4047 - acc: 0.8454 - val_loss: 0.3846 - val_acc: 0.8531\n",
      "Epoch 26/50\n",
      "549367/549367 [==============================] - 25s 45us/step - loss: 0.4031 - acc: 0.8461 - val_loss: 0.3864 - val_acc: 0.8562\n",
      "Epoch 27/50\n",
      "549367/549367 [==============================] - 24s 45us/step - loss: 0.4020 - acc: 0.8467 - val_loss: 0.3874 - val_acc: 0.8546\n",
      "Epoch 28/50\n",
      "549367/549367 [==============================] - 24s 45us/step - loss: 0.4001 - acc: 0.8473 - val_loss: 0.3848 - val_acc: 0.8534\n",
      "Epoch 29/50\n",
      "549367/549367 [==============================] - 25s 45us/step - loss: 0.3991 - acc: 0.8479 - val_loss: 0.3865 - val_acc: 0.8562\n",
      "Epoch 30/50\n",
      "549367/549367 [==============================] - 25s 45us/step - loss: 0.3976 - acc: 0.8484 - val_loss: 0.3833 - val_acc: 0.8574\n",
      "Epoch 31/50\n",
      "549367/549367 [==============================] - 25s 45us/step - loss: 0.3961 - acc: 0.8487 - val_loss: 0.3846 - val_acc: 0.8585\n",
      "Epoch 32/50\n",
      "549367/549367 [==============================] - 25s 45us/step - loss: 0.3942 - acc: 0.8498 - val_loss: 0.3805 - val_acc: 0.8573\n",
      "Epoch 33/50\n",
      "549367/549367 [==============================] - 24s 44us/step - loss: 0.3935 - acc: 0.8503 - val_loss: 0.3856 - val_acc: 0.8579\n",
      "Epoch 34/50\n",
      "549367/549367 [==============================] - 25s 45us/step - loss: 0.3923 - acc: 0.8507 - val_loss: 0.3829 - val_acc: 0.8560\n",
      "Epoch 35/50\n",
      "549367/549367 [==============================] - 25s 45us/step - loss: 0.3920 - acc: 0.8508 - val_loss: 0.3864 - val_acc: 0.8575\n",
      "Epoch 36/50\n",
      "549367/549367 [==============================] - 25s 45us/step - loss: 0.3907 - acc: 0.8516 - val_loss: 0.3873 - val_acc: 0.8563\n",
      "Epoch 37/50\n",
      "549367/549367 [==============================] - 25s 45us/step - loss: 0.3891 - acc: 0.8519 - val_loss: 0.3850 - val_acc: 0.8570\n",
      "Epoch 38/50\n",
      "549367/549367 [==============================] - 25s 45us/step - loss: 0.3872 - acc: 0.8522 - val_loss: 0.3815 - val_acc: 0.8591\n",
      "Epoch 39/50\n",
      "549367/549367 [==============================] - 25s 45us/step - loss: 0.3887 - acc: 0.8520 - val_loss: 0.3829 - val_acc: 0.8590\n",
      "Epoch 40/50\n",
      "549367/549367 [==============================] - 25s 45us/step - loss: 0.3868 - acc: 0.8531 - val_loss: 0.3807 - val_acc: 0.8600\n",
      "Epoch 41/50\n",
      "549367/549367 [==============================] - 25s 45us/step - loss: 0.3859 - acc: 0.8537 - val_loss: 0.3832 - val_acc: 0.8574\n",
      "Epoch 42/50\n",
      "549367/549367 [==============================] - 25s 45us/step - loss: 0.3849 - acc: 0.8537 - val_loss: 0.3850 - val_acc: 0.8576\n",
      "Epoch 43/50\n",
      "549367/549367 [==============================] - 25s 45us/step - loss: 0.3834 - acc: 0.8541 - val_loss: 0.3825 - val_acc: 0.8563\n",
      "Epoch 44/50\n",
      "549367/549367 [==============================] - 25s 45us/step - loss: 0.3829 - acc: 0.8548 - val_loss: 0.3844 - val_acc: 0.8540\n",
      "Epoch 45/50\n",
      "549367/549367 [==============================] - 25s 45us/step - loss: 0.3816 - acc: 0.8552 - val_loss: 0.3841 - val_acc: 0.8559\n",
      "Epoch 46/50\n",
      "549367/549367 [==============================] - 25s 45us/step - loss: 0.3816 - acc: 0.8549 - val_loss: 0.3880 - val_acc: 0.8567\n",
      "Epoch 47/50\n",
      "549367/549367 [==============================] - 24s 45us/step - loss: 0.3799 - acc: 0.8559 - val_loss: 0.3767 - val_acc: 0.8635\n",
      "Epoch 48/50\n",
      "549367/549367 [==============================] - 25s 45us/step - loss: 0.3800 - acc: 0.8560 - val_loss: 0.3786 - val_acc: 0.8563\n",
      "Epoch 49/50\n",
      "549367/549367 [==============================] - 25s 45us/step - loss: 0.3781 - acc: 0.8563 - val_loss: 0.3812 - val_acc: 0.8596\n",
      "Epoch 50/50\n",
      "549367/549367 [==============================] - 25s 45us/step - loss: 0.3788 - acc: 0.8560 - val_loss: 0.3782 - val_acc: 0.8601\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<keras.callbacks.History at 0x7f5ca1bf3e48>"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m1.fit([text_vectors, hypothesis_vectors], labels, batch_size=1024, epochs=50,validation_data=([text_vectors_test, hypothesis_vectors_test], labels_test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This model performs the same as the slightly more complex model that evaluates alignments in both directions.  Note also that processing time is improved, from 64 down to 48 microseconds per step. \n",
    "\n",
    "Let's now look at an asymmetric model that evaluates text to hypothesis comparisons.  The prediction is that such a model will correctly classify a decent proportion of the exemplars, but not as accurately as the previous two.\n",
    "\n",
    "We'll just use 10 epochs for expediency."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "__________________________________________________________________________________________________\n",
      "Layer (type)                    Output Shape         Param #     Connected to                     \n",
      "==================================================================================================\n",
      "words1 (InputLayer)             (None, 50)           0                                            \n",
      "__________________________________________________________________________________________________\n",
      "words2 (InputLayer)             (None, 50)           0                                            \n",
      "__________________________________________________________________________________________________\n",
      "sequential_13 (Sequential)      (None, 50, 200)      321381600   words1[0][0]                     \n",
      "                                                                 words2[0][0]                     \n",
      "__________________________________________________________________________________________________\n",
      "sequential_14 (Sequential)      (None, 50, 200)      80400       sequential_13[1][0]              \n",
      "                                                                 sequential_13[2][0]              \n",
      "__________________________________________________________________________________________________\n",
      "dot_8 (Dot)                     (None, 50, 50)       0           sequential_14[1][0]              \n",
      "                                                                 sequential_14[2][0]              \n",
      "__________________________________________________________________________________________________\n",
      "lambda_9 (Lambda)               (None, 50, 50)       0           dot_8[0][0]                      \n",
      "__________________________________________________________________________________________________\n",
      "dot_9 (Dot)                     (None, 50, 200)      0           lambda_9[0][0]                   \n",
      "                                                                 sequential_13[2][0]              \n",
      "__________________________________________________________________________________________________\n",
      "concatenate_6 (Concatenate)     (None, 50, 400)      0           sequential_13[1][0]              \n",
      "                                                                 dot_9[0][0]                      \n",
      "__________________________________________________________________________________________________\n",
      "time_distributed_9 (TimeDistrib (None, 50, 200)      120400      concatenate_6[0][0]              \n",
      "__________________________________________________________________________________________________\n",
      "lambda_10 (Lambda)              (None, 200)          0           time_distributed_9[0][0]         \n",
      "__________________________________________________________________________________________________\n",
      "sequential_16 (Sequential)      (None, 200)          80400       lambda_10[0][0]                  \n",
      "__________________________________________________________________________________________________\n",
      "dense_32 (Dense)                (None, 3)            603         sequential_16[1][0]              \n",
      "==================================================================================================\n",
      "Total params: 321,663,403\n",
      "Trainable params: 341,803\n",
      "Non-trainable params: 321,321,600\n",
      "__________________________________________________________________________________________________\n"
     ]
    }
   ],
   "source": [
    "m2 = build_model(sem_vectors, 50, 200, 3, 200, 'right')\n",
    "m2.summary()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train on 455226 samples, validate on 113807 samples\n",
      "Epoch 1/10\n",
      "455226/455226 [==============================] - 22s 49us/step - loss: 0.8920 - acc: 0.5771 - val_loss: 0.8001 - val_acc: 0.6435\n",
      "Epoch 2/10\n",
      "455226/455226 [==============================] - 22s 47us/step - loss: 0.7808 - acc: 0.6553 - val_loss: 0.7267 - val_acc: 0.6855\n",
      "Epoch 3/10\n",
      "455226/455226 [==============================] - 22s 47us/step - loss: 0.7329 - acc: 0.6825 - val_loss: 0.6966 - val_acc: 0.7006\n",
      "Epoch 4/10\n",
      "455226/455226 [==============================] - 22s 47us/step - loss: 0.7055 - acc: 0.6978 - val_loss: 0.6713 - val_acc: 0.7150\n",
      "Epoch 5/10\n",
      "455226/455226 [==============================] - 22s 47us/step - loss: 0.6862 - acc: 0.7081 - val_loss: 0.6533 - val_acc: 0.7253\n",
      "Epoch 6/10\n",
      "455226/455226 [==============================] - 21s 47us/step - loss: 0.6694 - acc: 0.7179 - val_loss: 0.6472 - val_acc: 0.7277\n",
      "Epoch 7/10\n",
      "455226/455226 [==============================] - 22s 47us/step - loss: 0.6555 - acc: 0.7252 - val_loss: 0.6338 - val_acc: 0.7347\n",
      "Epoch 8/10\n",
      "455226/455226 [==============================] - 22s 48us/step - loss: 0.6434 - acc: 0.7310 - val_loss: 0.6246 - val_acc: 0.7385\n",
      "Epoch 9/10\n",
      "455226/455226 [==============================] - 22s 47us/step - loss: 0.6325 - acc: 0.7367 - val_loss: 0.6164 - val_acc: 0.7424\n",
      "Epoch 10/10\n",
      "455226/455226 [==============================] - 22s 47us/step - loss: 0.6216 - acc: 0.7426 - val_loss: 0.6082 - val_acc: 0.7478\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<keras.callbacks.History at 0x7fa6850cf080>"
      ]
     },
     "execution_count": 97,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m2.fit([text_vectors, hypothesis_vectors], labels, batch_size=1024, epochs=10,validation_split=.2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Comparing this fit to the validation accuracy of the previous two models after 10 epochs, we observe that its accuracy is roughly 10% lower.\n",
    "\n",
    "It is reassuring that the neural modeling here reproduces what we know from the semantics of natural language!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
--- a/examples/pipeline/custom_attr_methods.py
+++ b/examples/pipeline/custom_attr_methods.py
@ -1,78 +0,0 @@
 #!/usr/bin/env python
 # coding: utf-8
 """This example contains several snippets of methods that can be set via custom
 Doc, Token or Span attributes in spaCy v2.0. Attribute methods act like
 they're "bound" to the object and are partially applied – i.e. the object
 they're called on is passed in as the first argument.
 * Custom pipeline components: https://spacy.io//usage/processing-pipelines#custom-components
 Compatible with: spaCy v2.0.0+
 Last tested with: v2.1.0
 """
 from __future__ import unicode_literals, print_function
 import plac
 from spacy.lang.en import English
 from spacy.tokens import Doc, Span
 from spacy import displacy
 from pathlib import Path
@plac.annotations(
    output_dir=("Output directory for saved HTML", "positional", None, Path)
 )
 def main(output_dir=None):
    nlp = English()  # start off with blank English class
    Doc.set_extension("overlap", method=overlap_tokens)
    doc1 = nlp("Peach emoji is where it has always been.")
    doc2 = nlp("Peach is the superior emoji.")
    print("Text 1:", doc1.text)
    print("Text 2:", doc2.text)
    print("Overlapping tokens:", doc1._.overlap(doc2))
    Doc.set_extension("to_html", method=to_html)
    doc = nlp("This is a sentence about Apple.")
    # add entity manually for demo purposes, to make it work without a model
    doc.ents = [Span(doc, 5, 6, label=nlp.vocab.strings["ORG"])]
    print("Text:", doc.text)
    doc._.to_html(output=output_dir, style="ent")
 def to_html(doc, output="/tmp", style="dep"):
    """Doc method extension for saving the current state as a displaCy
    visualization.
    """
    # generate filename from first six non-punct tokens
    file_name = "-".join([w.text for w in doc[:6] if not w.is_punct]) + ".html"
    html = displacy.render(doc, style=style, page=True)  # render markup
    if output is not None:
        output_path = Path(output)
        if not output_path.exists():
            output_path.mkdir()
        output_file = Path(output) / file_name
        output_file.open("w", encoding="utf-8").write(html)  # save to file
        print("Saved HTML to {}".format(output_file))
    else:
        print(html)
 def overlap_tokens(doc, other_doc):
    """Get the tokens from the original Doc that are also in the comparison Doc.
    """
    overlap = []
    other_tokens = [token.text for token in other_doc]
    for token in doc:
        if token.text in other_tokens:
            overlap.append(token)
    return overlap
 if __name__ == "__main__":
    plac.call(main)
    # Expected output:
    # Text 1: Peach emoji is where it has always been.
    # Text 2: Peach is the superior emoji.
    # Overlapping tokens: [Peach, emoji, is, .]
--- a/examples/pipeline/custom_component_countries_api.py
+++ b/examples/pipeline/custom_component_countries_api.py
@ -1,130 +0,0 @@
 #!/usr/bin/env python
 # coding: utf8
 """Example of a spaCy v2.0 pipeline component that requests all countries via
 the REST Countries API, merges country names into one token, assigns entity
 labels and sets attributes on country tokens, e.g. the capital and lat/lng
 coordinates. Can be extended with more details from the API.
 * REST Countries API: https://restcountries.eu (Mozilla Public License MPL 2.0)
 * Custom pipeline components: https://spacy.io//usage/processing-pipelines#custom-components
 Compatible with: spaCy v2.0.0+
 Last tested with: v2.1.0
 Prerequisites: pip install requests
 """
 from __future__ import unicode_literals, print_function
 import requests
 import plac
 from spacy.lang.en import English
 from spacy.matcher import PhraseMatcher
 from spacy.tokens import Doc, Span, Token
 def main():
    # For simplicity, we start off with only the blank English Language class
    # and no model or pre-defined pipeline loaded.
    nlp = English()
    rest_countries = RESTCountriesComponent(nlp)  # initialise component
    nlp.add_pipe(rest_countries)  # add it to the pipeline
    doc = nlp("Some text about Colombia and the Czech Republic")
    print("Pipeline", nlp.pipe_names)  # pipeline contains component name
    print("Doc has countries", doc._.has_country)  # Doc contains countries
    for token in doc:
        if token._.is_country:
            print(
                token.text,
                token._.country_capital,
                token._.country_latlng,
                token._.country_flag,
            )  # country data
    print("Entities", [(e.text, e.label_) for e in doc.ents])  # entities
 class RESTCountriesComponent(object):
    """spaCy v2.0 pipeline component that requests all countries via
    the REST Countries API, merges country names into one token, assigns entity
    labels and sets attributes on country tokens.
    """
    name = "rest_countries"  # component name, will show up in the pipeline
    def __init__(self, nlp, label="GPE"):
        """Initialise the pipeline component. The shared nlp instance is used
        to initialise the matcher with the shared vocab, get the label ID and
        generate Doc objects as phrase match patterns.
        """
        # Make request once on initialisation and store the data
        r = requests.get("https://restcountries.eu/rest/v2/all")
        r.raise_for_status()  # make sure requests raises an error if it fails
        countries = r.json()
        # Convert API response to dict keyed by country name for easy lookup
        # This could also be extended using the alternative and foreign language
        # names provided by the API
        self.countries = {c["name"]: c for c in countries}
        self.label = nlp.vocab.strings[label]  # get entity label ID
        # Set up the PhraseMatcher with Doc patterns for each country name
        patterns = [nlp(c) for c in self.countries.keys()]
        self.matcher = PhraseMatcher(nlp.vocab)
        self.matcher.add("COUNTRIES", None, *patterns)
        # Register attribute on the Token. We'll be overwriting this based on
        # the matches, so we're only setting a default value, not a getter.
        # If no default value is set, it defaults to None.
        Token.set_extension("is_country", default=False)
        Token.set_extension("country_capital", default=False)
        Token.set_extension("country_latlng", default=False)
        Token.set_extension("country_flag", default=False)
        # Register attributes on Doc and Span via a getter that checks if one of
        # the contained tokens is set to is_country == True.
        Doc.set_extension("has_country", getter=self.has_country)
        Span.set_extension("has_country", getter=self.has_country)
    def __call__(self, doc):
        """Apply the pipeline component on a Doc object and modify it if matches
        are found. Return the Doc, so it can be processed by the next component
        in the pipeline, if available.
        """
        matches = self.matcher(doc)
        spans = []  # keep the spans for later so we can merge them afterwards
        for _, start, end in matches:
            # Generate Span representing the entity & set label
            entity = Span(doc, start, end, label=self.label)
            spans.append(entity)
            # Set custom attribute on each token of the entity
            # Can be extended with other data returned by the API, like
            # currencies, country code, flag, calling code etc.
            for token in entity:
                token._.set("is_country", True)
                token._.set("country_capital", self.countries[entity.text]["capital"])
                token._.set("country_latlng", self.countries[entity.text]["latlng"])
                token._.set("country_flag", self.countries[entity.text]["flag"])
            # Overwrite doc.ents and add entity – be careful not to replace!
            doc.ents = list(doc.ents) + [entity]
        for span in spans:
            # Iterate over all spans and merge them into one token. This is done
            # after setting the entities – otherwise, it would cause mismatched
            # indices!
            span.merge()
        return doc  # don't forget to return the Doc!
    def has_country(self, tokens):
        """Getter for Doc and Span attributes. Returns True if one of the tokens
        is a country. Since the getter is only called when we access the
        attribute, we can refer to the Token's 'is_country' attribute here,
        which is already set in the processing step."""
        return any([t._.get("is_country") for t in tokens])
 if __name__ == "__main__":
    plac.call(main)
    # Expected output:
    # Pipeline ['rest_countries']
    # Doc has countries True
    # Colombia Bogotá [4.0, -72.0] https://restcountries.eu/data/col.svg
    # Czech Republic Prague [49.75, 15.5] https://restcountries.eu/data/cze.svg
    # Entities [('Colombia', 'GPE'), ('Czech Republic', 'GPE')]
--- a/examples/pipeline/custom_component_entities.py
+++ b/examples/pipeline/custom_component_entities.py
@ -1,115 +0,0 @@
 #!/usr/bin/env python
 # coding: utf8
 """Example of a spaCy v2.0 pipeline component that sets entity annotations
 based on list of single or multiple-word company names. Companies are
 labelled as ORG and their spans are merged into one token. Additionally,
 ._.has_tech_org and ._.is_tech_org is set on the Doc/Span and Token
 respectively.
 * Custom pipeline components: https://spacy.io//usage/processing-pipelines#custom-components
 Compatible with: spaCy v2.0.0+
 Last tested with: v2.1.0
 """
 from __future__ import unicode_literals, print_function
 import plac
 from spacy.lang.en import English
 from spacy.matcher import PhraseMatcher
 from spacy.tokens import Doc, Span, Token
@plac.annotations(
    text=("Text to process", "positional", None, str),
    companies=("Names of technology companies", "positional", None, str),
 )
 def main(text="Alphabet Inc. is the company behind Google.", *companies):
    # For simplicity, we start off with only the blank English Language class
    # and no model or pre-defined pipeline loaded.
    nlp = English()
    if not companies:  # set default companies if none are set via args
        companies = ["Alphabet Inc.", "Google", "Netflix", "Apple"]  # etc.
    component = TechCompanyRecognizer(nlp, companies)  # initialise component
    nlp.add_pipe(component, last=True)  # add last to the pipeline
    doc = nlp(text)
    print("Pipeline", nlp.pipe_names)  # pipeline contains component name
    print("Tokens", [t.text for t in doc])  # company names from the list are merged
    print("Doc has_tech_org", doc._.has_tech_org)  # Doc contains tech orgs
    print("Token 0 is_tech_org", doc[0]._.is_tech_org)  # "Alphabet Inc." is a tech org
    print("Token 1 is_tech_org", doc[1]._.is_tech_org)  # "is" is not
    print("Entities", [(e.text, e.label_) for e in doc.ents])  # all orgs are entities
 class TechCompanyRecognizer(object):
    """Example of a spaCy v2.0 pipeline component that sets entity annotations
    based on list of single or multiple-word company names. Companies are
    labelled as ORG and their spans are merged into one token. Additionally,
    ._.has_tech_org and ._.is_tech_org is set on the Doc/Span and Token
    respectively."""
    name = "tech_companies"  # component name, will show up in the pipeline
    def __init__(self, nlp, companies=tuple(), label="ORG"):
        """Initialise the pipeline component. The shared nlp instance is used
        to initialise the matcher with the shared vocab, get the label ID and
        generate Doc objects as phrase match patterns.
        """
        self.label = nlp.vocab.strings[label]  # get entity label ID
        # Set up the PhraseMatcher – it can now take Doc objects as patterns,
        # so even if the list of companies is long, it's very efficient
        patterns = [nlp(org) for org in companies]
        self.matcher = PhraseMatcher(nlp.vocab)
        self.matcher.add("TECH_ORGS", None, *patterns)
        # Register attribute on the Token. We'll be overwriting this based on
        # the matches, so we're only setting a default value, not a getter.
        Token.set_extension("is_tech_org", default=False)
        # Register attributes on Doc and Span via a getter that checks if one of
        # the contained tokens is set to is_tech_org == True.
        Doc.set_extension("has_tech_org", getter=self.has_tech_org)
        Span.set_extension("has_tech_org", getter=self.has_tech_org)
    def __call__(self, doc):
        """Apply the pipeline component on a Doc object and modify it if matches
        are found. Return the Doc, so it can be processed by the next component
        in the pipeline, if available.
        """
        matches = self.matcher(doc)
        spans = []  # keep the spans for later so we can merge them afterwards
        for _, start, end in matches:
            # Generate Span representing the entity & set label
            entity = Span(doc, start, end, label=self.label)
            spans.append(entity)
            # Set custom attribute on each token of the entity
            for token in entity:
                token._.set("is_tech_org", True)
            # Overwrite doc.ents and add entity – be careful not to replace!
            doc.ents = list(doc.ents) + [entity]
        for span in spans:
            # Iterate over all spans and merge them into one token. This is done
            # after setting the entities – otherwise, it would cause mismatched
            # indices!
            span.merge()
        return doc  # don't forget to return the Doc!
    def has_tech_org(self, tokens):
        """Getter for Doc and Span attributes. Returns True if one of the tokens
        is a tech org. Since the getter is only called when we access the
        attribute, we can refer to the Token's 'is_tech_org' attribute here,
        which is already set in the processing step."""
        return any([t._.get("is_tech_org") for t in tokens])
 if __name__ == "__main__":
    plac.call(main)
    # Expected output:
    # Pipeline ['tech_companies']
    # Tokens ['Alphabet Inc.', 'is', 'the', 'company', 'behind', 'Google', '.']
    # Doc has_tech_org True
    # Token 0 is_tech_org True
    # Token 1 is_tech_org False
    # Entities [('Alphabet Inc.', 'ORG'), ('Google', 'ORG')]
--- a/examples/pipeline/custom_sentence_segmentation.py
+++ b/examples/pipeline/custom_sentence_segmentation.py
@ -1,61 +0,0 @@
 """Example of adding a pipeline component to prohibit sentence boundaries
 before certain tokens.
 What we do is write to the token.is_sent_start attribute, which
 takes values in {True, False, None}. The default value None allows the parser
 to predict sentence segments. The value False prohibits the parser from inserting
 a sentence boundary before that token. Note that fixing the sentence segmentation
 should also improve the parse quality.
 The specific example here is drawn from https://github.com/explosion/spaCy/issues/2627
 Other versions of the model may not make the original mistake, so the specific
 example might not be apt for future versions.
 Compatible with: spaCy v2.0.0+
 Last tested with: v2.1.0
 """
 import plac
 import spacy
 def prevent_sentence_boundaries(doc):
    for token in doc:
        if not can_be_sentence_start(token):
            token.is_sent_start = False
    return doc
 def can_be_sentence_start(token):
    if token.i == 0:
        return True
    # We're not checking for is_title here to ignore arbitrary titlecased
    # tokens within sentences
    # elif token.is_title:
    #    return True
    elif token.nbor(-1).is_punct:
        return True
    elif token.nbor(-1).is_space:
        return True
    else:
        return False
@plac.annotations(
    text=("The raw text to process", "positional", None, str),
    spacy_model=("spaCy model to use (with a parser)", "option", "m", str),
 )
 def main(text="Been here And I'm loving it.", spacy_model="en_core_web_lg"):
    print("Using spaCy model '{}'".format(spacy_model))
    print("Processing text '{}'".format(text))
    nlp = spacy.load(spacy_model)
    doc = nlp(text)
    sentences = [sent.text.strip() for sent in doc.sents]
    print("Before:", sentences)
    nlp.add_pipe(prevent_sentence_boundaries, before="parser")
    doc = nlp(text)
    sentences = [sent.text.strip() for sent in doc.sents]
    print("After:", sentences)
 if __name__ == "__main__":
    plac.call(main)
--- a/examples/pipeline/fix_space_entities.py
+++ b/examples/pipeline/fix_space_entities.py
@ -1,37 +0,0 @@
 #!/usr/bin/env python
 # coding: utf8
 """Demonstrate adding a rule-based component that forces some tokens to not
 be entities, before the NER tagger is applied. This is used to hotfix the issue
 in https://github.com/explosion/spaCy/issues/2870, present as of spaCy v2.0.16.
 Compatible with: spaCy v2.0.0+
 Last tested with: v2.1.0
 """
 from __future__ import unicode_literals
 import spacy
 from spacy.attrs import ENT_IOB
 def fix_space_tags(doc):
    ent_iobs = doc.to_array([ENT_IOB])
    for i, token in enumerate(doc):
        if token.is_space:
            # Sets 'O' tag (0 is None, so I is 1, O is 2)
            ent_iobs[i] = 2
    doc.from_array([ENT_IOB], ent_iobs.reshape((len(doc), 1)))
    return doc
 def main():
    nlp = spacy.load("en_core_web_sm")
    text = "This is some crazy test where I dont need an Apple                Watch to make things bug"
    doc = nlp(text)
    print("Before", doc.ents)
    nlp.add_pipe(fix_space_tags, name="fix-ner", before="ner")
    doc = nlp(text)
    print("After", doc.ents)
 if __name__ == "__main__":
    main()
--- a/examples/pipeline/multi_processing.py
+++ b/examples/pipeline/multi_processing.py
@ -1,84 +0,0 @@
 #!/usr/bin/env python
 # coding: utf8
 """Example of multi-processing with Joblib. Here, we're exporting
 part-of-speech-tagged, true-cased, (very roughly) sentence-separated text, with
 each "sentence" on a newline, and spaces between tokens. Data is loaded from
 the IMDB movie reviews dataset and will be loaded automatically via Thinc's
 built-in dataset loader.
 Compatible with: spaCy v2.0.0+
 Last tested with: v2.1.0
 Prerequisites: pip install joblib
 """
 from __future__ import print_function, unicode_literals
 from pathlib import Path
 from joblib import Parallel, delayed
 from functools import partial
 import thinc.extra.datasets
 import plac
 import spacy
 from spacy.util import minibatch
@plac.annotations(
    output_dir=("Output directory", "positional", None, Path),
    model=("Model name (needs tagger)", "positional", None, str),
    n_jobs=("Number of workers", "option", "n", int),
    batch_size=("Batch-size for each process", "option", "b", int),
    limit=("Limit of entries from the dataset", "option", "l", int),
 )
 def main(output_dir, model="en_core_web_sm", n_jobs=4, batch_size=1000, limit=10000):
    nlp = spacy.load(model)  # load spaCy model
    print("Loaded model '%s'" % model)
    if not output_dir.exists():
        output_dir.mkdir()
    # load and pre-process the IMBD dataset
    print("Loading IMDB data...")
    data, _ = thinc.extra.datasets.imdb()
    texts, _ = zip(*data[-limit:])
    print("Processing texts...")
    partitions = minibatch(texts, size=batch_size)
    executor = Parallel(n_jobs=n_jobs, backend="multiprocessing", prefer="processes")
    do = delayed(partial(transform_texts, nlp))
    tasks = (do(i, batch, output_dir) for i, batch in enumerate(partitions))
    executor(tasks)
 def transform_texts(nlp, batch_id, texts, output_dir):
    print(nlp.pipe_names)
    out_path = Path(output_dir) / ("%d.txt" % batch_id)
    if out_path.exists():  # return None in case same batch is called again
        return None
    print("Processing batch", batch_id)
    with out_path.open("w", encoding="utf8") as f:
        for doc in nlp.pipe(texts):
            f.write(" ".join(represent_word(w) for w in doc if not w.is_space))
            f.write("\n")
    print("Saved {} texts to {}.txt".format(len(texts), batch_id))
 def represent_word(word):
    text = word.text
    # True-case, i.e. try to normalize sentence-initial capitals.
    # Only do this if the lower-cased form is more probable.
    if (
        text.istitle()
        and is_sent_begin(word)
        and word.prob < word.doc.vocab[text.lower()].prob
    ):
        text = text.lower()
    return text + "|" + word.tag_
 def is_sent_begin(word):
    if word.i == 0:
        return True
    elif word.i >= 2 and word.nbor(-1).text in (".", "!", "?", "..."):
        return True
    else:
        return False
 if __name__ == "__main__":
    plac.call(main)
--- a/examples/streamlit_spacy.py
+++ b/examples/streamlit_spacy.py
@ -1,153 +0,0 @@
 # coding: utf-8
 """
 Example of a Streamlit app for an interactive spaCy model visualizer. You can
 either download the script, or point streamlit run to the raw URL of this
 file. For more details, see https://streamlit.io.
 Installation:
 pip install streamlit
 python -m spacy download en_core_web_sm
 python -m spacy download en_core_web_md
 python -m spacy download de_core_news_sm
 Usage:
 streamlit run streamlit_spacy.py
 """
 from __future__ import unicode_literals
 import streamlit as st
 import spacy
 from spacy import displacy
 import pandas as pd
 SPACY_MODEL_NAMES = ["en_core_web_sm", "en_core_web_md", "de_core_news_sm"]
 DEFAULT_TEXT = "Mark Zuckerberg is the CEO of Facebook."
 HTML_WRAPPER = """<div style="overflow-x: auto; border: 1px solid #e6e9ef; border-radius: 0.25rem; padding: 1rem; margin-bottom: 2.5rem">{}</div>"""
@st.cache(allow_output_mutation=True)
 def load_model(name):
    return spacy.load(name)
@st.cache(allow_output_mutation=True)
 def process_text(model_name, text):
    nlp = load_model(model_name)
    return nlp(text)
 st.sidebar.title("Interactive spaCy visualizer")
 st.sidebar.markdown(
    """
 Process text with [spaCy](https://spacy.io) models and visualize named entities,
 dependencies and more. Uses spaCy's built-in
 [displaCy](http://spacy.io/usage/visualizers) visualizer under the hood.
 """
 )
 spacy_model = st.sidebar.selectbox("Model name", SPACY_MODEL_NAMES)
 model_load_state = st.info(f"Loading model '{spacy_model}'...")
 nlp = load_model(spacy_model)
 model_load_state.empty()
 text = st.text_area("Text to analyze", DEFAULT_TEXT)
 doc = process_text(spacy_model, text)
 if "parser" in nlp.pipe_names:
    st.header("Dependency Parse & Part-of-speech tags")
    st.sidebar.header("Dependency Parse")
    split_sents = st.sidebar.checkbox("Split sentences", value=True)
    collapse_punct = st.sidebar.checkbox("Collapse punctuation", value=True)
    collapse_phrases = st.sidebar.checkbox("Collapse phrases")
    compact = st.sidebar.checkbox("Compact mode")
    options = {
        "collapse_punct": collapse_punct,
        "collapse_phrases": collapse_phrases,
        "compact": compact,
    }
    docs = [span.as_doc() for span in doc.sents] if split_sents else [doc]
    for sent in docs:
        html = displacy.render(sent, options=options)
        # Double newlines seem to mess with the rendering
        html = html.replace("\n\n", "\n")
        if split_sents and len(docs) > 1:
            st.markdown(f"> {sent.text}")
        st.write(HTML_WRAPPER.format(html), unsafe_allow_html=True)
 if "ner" in nlp.pipe_names:
    st.header("Named Entities")
    st.sidebar.header("Named Entities")
    label_set = nlp.get_pipe("ner").labels
    labels = st.sidebar.multiselect(
        "Entity labels", options=label_set, default=list(label_set)
    )
    html = displacy.render(doc, style="ent", options={"ents": labels})
    # Newlines seem to mess with the rendering
    html = html.replace("\n", " ")
    st.write(HTML_WRAPPER.format(html), unsafe_allow_html=True)
    attrs = ["text", "label_", "start", "end", "start_char", "end_char"]
    if "entity_linker" in nlp.pipe_names:
        attrs.append("kb_id_")
    data = [
        [str(getattr(ent, attr)) for attr in attrs]
        for ent in doc.ents
        if ent.label_ in labels
    ]
    df = pd.DataFrame(data, columns=attrs)
    st.dataframe(df)
 if "textcat" in nlp.pipe_names:
    st.header("Text Classification")
    st.markdown(f"> {text}")
    df = pd.DataFrame(doc.cats.items(), columns=("Label", "Score"))
    st.dataframe(df)
 vector_size = nlp.meta.get("vectors", {}).get("width", 0)
 if vector_size:
    st.header("Vectors & Similarity")
    st.code(nlp.meta["vectors"])
    text1 = st.text_input("Text or word 1", "apple")
    text2 = st.text_input("Text or word 2", "orange")
    doc1 = process_text(spacy_model, text1)
    doc2 = process_text(spacy_model, text2)
    similarity = doc1.similarity(doc2)
    if similarity > 0.5:
        st.success(similarity)
    else:
        st.error(similarity)
 st.header("Token attributes")
 if st.button("Show token attributes"):
    attrs = [
        "idx",
        "text",
        "lemma_",
        "pos_",
        "tag_",
        "dep_",
        "head",
        "ent_type_",
        "ent_iob_",
        "shape_",
        "is_alpha",
        "is_ascii",
        "is_digit",
        "is_punct",
        "like_num",
    ]
    data = [[str(getattr(token, attr)) for attr in attrs] for token in doc]
    df = pd.DataFrame(data, columns=attrs)
    st.dataframe(df)
 st.header("JSON Doc")
 if st.button("Show JSON Doc"):
    st.json(doc.to_json())
 st.header("JSON model meta")
 if st.button("Show JSON model meta"):
    st.json(nlp.meta)
--- a/examples/training/conllu-config.json
+++ b/examples/training/conllu-config.json
@ -1 +0,0 @@
 {"nr_epoch": 3, "batch_size": 24, "dropout":  0.001, "vectors":  0, "multitask_tag":  0, "multitask_sent":  0}
--- a/examples/training/conllu.py
+++ b/examples/training/conllu.py
@ -1,434 +0,0 @@
 """Train for CONLL 2017 UD treebank evaluation. Takes .conllu files, writes
 .conllu format for development data, allowing the official scorer to be used.
 """
 from __future__ import unicode_literals
 import plac
 import attr
 from pathlib import Path
 import re
 import json
 import tqdm
 import spacy
 import spacy.util
 from spacy.tokens import Token, Doc
 from spacy.gold import GoldParse
 from spacy.syntax.nonproj import projectivize
 from collections import defaultdict
 from spacy.matcher import Matcher
 import itertools
 import random
 import numpy.random
 from bin.ud import conll17_ud_eval
 import spacy.lang.zh
 import spacy.lang.ja
 spacy.lang.zh.Chinese.Defaults.use_jieba = False
 spacy.lang.ja.Japanese.Defaults.use_janome = False
 random.seed(0)
 numpy.random.seed(0)
 def minibatch_by_words(items, size=5000):
    random.shuffle(items)
    if isinstance(size, int):
        size_ = itertools.repeat(size)
    else:
        size_ = size
    items = iter(items)
    while True:
        batch_size = next(size_)
        batch = []
        while batch_size >= 0:
            try:
                doc, gold = next(items)
            except StopIteration:
                if batch:
                    yield batch
                return
            batch_size -= len(doc)
            batch.append((doc, gold))
        if batch:
            yield batch
        else:
            break
 ################
 # Data reading #
 ################
 space_re = re.compile("\s+")
 def split_text(text):
    return [space_re.sub(" ", par.strip()) for par in text.split("\n\n")]
 def read_data(
    nlp,
    conllu_file,
    text_file,
    raw_text=True,
    oracle_segments=False,
    max_doc_length=None,
    limit=None,
 ):
    """Read the CONLLU format into (Doc, GoldParse) tuples. If raw_text=True,
    include Doc objects created using nlp.make_doc and then aligned against
    the gold-standard sequences. If oracle_segments=True, include Doc objects
    created from the gold-standard segments. At least one must be True."""
    if not raw_text and not oracle_segments:
        raise ValueError("At least one of raw_text or oracle_segments must be True")
    paragraphs = split_text(text_file.read())
    conllu = read_conllu(conllu_file)
    # sd is spacy doc; cd is conllu doc
    # cs is conllu sent, ct is conllu token
    docs = []
    golds = []
    for doc_id, (text, cd) in enumerate(zip(paragraphs, conllu)):
        sent_annots = []
        for cs in cd:
            sent = defaultdict(list)
            for id_, word, lemma, pos, tag, morph, head, dep, _, space_after in cs:
                if "." in id_:
                    continue
                if "-" in id_:
                    continue
                id_ = int(id_) - 1
                head = int(head) - 1 if head != "0" else id_
                sent["words"].append(word)
                sent["tags"].append(tag)
                sent["heads"].append(head)
                sent["deps"].append("ROOT" if dep == "root" else dep)
                sent["spaces"].append(space_after == "_")
            sent["entities"] = ["-"] * len(sent["words"])
            sent["heads"], sent["deps"] = projectivize(sent["heads"], sent["deps"])
            if oracle_segments:
                docs.append(Doc(nlp.vocab, words=sent["words"], spaces=sent["spaces"]))
                golds.append(GoldParse(docs[-1], **sent))
            sent_annots.append(sent)
            if raw_text and max_doc_length and len(sent_annots) >= max_doc_length:
                doc, gold = _make_gold(nlp, None, sent_annots)
                sent_annots = []
                docs.append(doc)
                golds.append(gold)
                if limit and len(docs) >= limit:
                    return docs, golds
        if raw_text and sent_annots:
            doc, gold = _make_gold(nlp, None, sent_annots)
            docs.append(doc)
            golds.append(gold)
        if limit and len(docs) >= limit:
            return docs, golds
    return docs, golds
 def read_conllu(file_):
    docs = []
    sent = []
    doc = []
    for line in file_:
        if line.startswith("# newdoc"):
            if doc:
                docs.append(doc)
            doc = []
        elif line.startswith("#"):
            continue
        elif not line.strip():
            if sent:
                doc.append(sent)
            sent = []
        else:
            sent.append(list(line.strip().split("\t")))
            if len(sent[-1]) != 10:
                print(repr(line))
                raise ValueError
    if sent:
        doc.append(sent)
    if doc:
        docs.append(doc)
    return docs
 def _make_gold(nlp, text, sent_annots):
    # Flatten the conll annotations, and adjust the head indices
    flat = defaultdict(list)
    for sent in sent_annots:
        flat["heads"].extend(len(flat["words"]) + head for head in sent["heads"])
        for field in ["words", "tags", "deps", "entities", "spaces"]:
            flat[field].extend(sent[field])
    # Construct text if necessary
    assert len(flat["words"]) == len(flat["spaces"])
    if text is None:
        text = "".join(
            word + " " * space for word, space in zip(flat["words"], flat["spaces"])
        )
    doc = nlp.make_doc(text)
    flat.pop("spaces")
    gold = GoldParse(doc, **flat)
    return doc, gold
 #############################
 # Data transforms for spaCy #
 #############################
 def golds_to_gold_tuples(docs, golds):
    """Get out the annoying 'tuples' format used by begin_training, given the
    GoldParse objects."""
    tuples = []
    for doc, gold in zip(docs, golds):
        text = doc.text
        ids, words, tags, heads, labels, iob = zip(*gold.orig_annot)
        sents = [((ids, words, tags, heads, labels, iob), [])]
        tuples.append((text, sents))
    return tuples
 ##############
 # Evaluation #
 ##############
 def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None):
    with text_loc.open("r", encoding="utf8") as text_file:
        texts = split_text(text_file.read())
        docs = list(nlp.pipe(texts))
    with sys_loc.open("w", encoding="utf8") as out_file:
        write_conllu(docs, out_file)
    with gold_loc.open("r", encoding="utf8") as gold_file:
        gold_ud = conll17_ud_eval.load_conllu(gold_file)
        with sys_loc.open("r", encoding="utf8") as sys_file:
            sys_ud = conll17_ud_eval.load_conllu(sys_file)
        scores = conll17_ud_eval.evaluate(gold_ud, sys_ud)
    return scores
 def write_conllu(docs, file_):
    merger = Matcher(docs[0].vocab)
    merger.add("SUBTOK", None, [{"DEP": "subtok", "op": "+"}])
    for i, doc in enumerate(docs):
        matches = merger(doc)
        spans = [doc[start : end + 1] for _, start, end in matches]
        offsets = [(span.start_char, span.end_char) for span in spans]
        for start_char, end_char in offsets:
            doc.merge(start_char, end_char)
        file_.write("# newdoc id = {i}\n".format(i=i))
        for j, sent in enumerate(doc.sents):
            file_.write("# sent_id = {i}.{j}\n".format(i=i, j=j))
            file_.write("# text = {text}\n".format(text=sent.text))
            for k, token in enumerate(sent):
                file_.write(token._.get_conllu_lines(k) + "\n")
            file_.write("\n")
 def print_progress(itn, losses, ud_scores):
    fields = {
        "dep_loss": losses.get("parser", 0.0),
        "tag_loss": losses.get("tagger", 0.0),
        "words": ud_scores["Words"].f1 * 100,
        "sents": ud_scores["Sentences"].f1 * 100,
        "tags": ud_scores["XPOS"].f1 * 100,
        "uas": ud_scores["UAS"].f1 * 100,
        "las": ud_scores["LAS"].f1 * 100,
    }
    header = ["Epoch", "Loss", "LAS", "UAS", "TAG", "SENT", "WORD"]
    if itn == 0:
        print("\t".join(header))
    tpl = "\t".join(
        (
            "{:d}",
            "{dep_loss:.1f}",
            "{las:.1f}",
            "{uas:.1f}",
            "{tags:.1f}",
            "{sents:.1f}",
            "{words:.1f}",
        )
    )
    print(tpl.format(itn, **fields))
 # def get_sent_conllu(sent, sent_id):
 #    lines = ["# sent_id = {sent_id}".format(sent_id=sent_id)]
 def get_token_conllu(token, i):
    if token._.begins_fused:
        n = 1
        while token.nbor(n)._.inside_fused:
            n += 1
        id_ = "%d-%d" % (i, i + n)
        lines = [id_, token.text, "_", "_", "_", "_", "_", "_", "_", "_"]
    else:
        lines = []
    if token.head.i == token.i:
        head = 0
    else:
        head = i + (token.head.i - token.i) + 1
    fields = [
        str(i + 1),
        token.text,
        token.lemma_,
        token.pos_,
        token.tag_,
        "_",
        str(head),
        token.dep_.lower(),
        "_",
        "_",
    ]
    lines.append("\t".join(fields))
    return "\n".join(lines)
 ##################
 # Initialization #
 ##################
 def load_nlp(corpus, config):
    lang = corpus.split("_")[0]
    nlp = spacy.blank(lang)
    if config.vectors:
        nlp.vocab.from_disk(config.vectors / "vocab")
    return nlp
 def initialize_pipeline(nlp, docs, golds, config):
    nlp.add_pipe(nlp.create_pipe("parser"))
    if config.multitask_tag:
        nlp.parser.add_multitask_objective("tag")
    if config.multitask_sent:
        nlp.parser.add_multitask_objective("sent_start")
    nlp.parser.moves.add_action(2, "subtok")
    nlp.add_pipe(nlp.create_pipe("tagger"))
    for gold in golds:
        for tag in gold.tags:
            if tag is not None:
                nlp.tagger.add_label(tag)
    # Replace labels that didn't make the frequency cutoff
    actions = set(nlp.parser.labels)
    label_set = set([act.split("-")[1] for act in actions if "-" in act])
    for gold in golds:
        for i, label in enumerate(gold.labels):
            if label is not None and label not in label_set:
                gold.labels[i] = label.split("||")[0]
    return nlp.begin_training(lambda: golds_to_gold_tuples(docs, golds))
 ########################
 # Command line helpers #
 ########################
@attr.s
 class Config(object):
    vectors = attr.ib(default=None)
    max_doc_length = attr.ib(default=10)
    multitask_tag = attr.ib(default=True)
    multitask_sent = attr.ib(default=True)
    nr_epoch = attr.ib(default=30)
    batch_size = attr.ib(default=1000)
    dropout = attr.ib(default=0.2)
    @classmethod
    def load(cls, loc):
        with Path(loc).open("r", encoding="utf8") as file_:
            cfg = json.load(file_)
        return cls(**cfg)
 class Dataset(object):
    def __init__(self, path, section):
        self.path = path
        self.section = section
        self.conllu = None
        self.text = None
        for file_path in self.path.iterdir():
            name = file_path.parts[-1]
            if section in name and name.endswith("conllu"):
                self.conllu = file_path
            elif section in name and name.endswith("txt"):
                self.text = file_path
        if self.conllu is None:
            msg = "Could not find .txt file in {path} for {section}"
            raise IOError(msg.format(section=section, path=path))
        if self.text is None:
            msg = "Could not find .txt file in {path} for {section}"
        self.lang = self.conllu.parts[-1].split("-")[0].split("_")[0]
 class TreebankPaths(object):
    def __init__(self, ud_path, treebank, **cfg):
        self.train = Dataset(ud_path / treebank, "train")
        self.dev = Dataset(ud_path / treebank, "dev")
        self.lang = self.train.lang
@plac.annotations(
    ud_dir=("Path to Universal Dependencies corpus", "positional", None, Path),
    parses_dir=("Directory to write the development parses", "positional", None, Path),
    config=("Path to json formatted config file", "positional", None, Config.load),
    corpus=(
        "UD corpus to train and evaluate on, e.g. UD_Spanish-AnCora",
        "positional",
        None,
        str,
    ),
    limit=("Size limit", "option", "n", int),
 )
 def main(ud_dir, parses_dir, config, corpus, limit=0):
    Token.set_extension("get_conllu_lines", method=get_token_conllu)
    Token.set_extension("begins_fused", default=False)
    Token.set_extension("inside_fused", default=False)
    paths = TreebankPaths(ud_dir, corpus)
    if not (parses_dir / corpus).exists():
        (parses_dir / corpus).mkdir()
    print("Train and evaluate", corpus, "using lang", paths.lang)
    nlp = load_nlp(paths.lang, config)
    docs, golds = read_data(
        nlp,
        paths.train.conllu.open(encoding="utf8"),
        paths.train.text.open(encoding="utf8"),
        max_doc_length=config.max_doc_length,
        limit=limit,
    )
    optimizer = initialize_pipeline(nlp, docs, golds, config)
    for i in range(config.nr_epoch):
        docs = [nlp.make_doc(doc.text) for doc in docs]
        batches = minibatch_by_words(list(zip(docs, golds)), size=config.batch_size)
        losses = {}
        n_train_words = sum(len(doc) for doc in docs)
        with tqdm.tqdm(total=n_train_words, leave=False) as pbar:
            for batch in batches:
                batch_docs, batch_gold = zip(*batch)
                pbar.update(sum(len(doc) for doc in batch_docs))
                nlp.update(
                    batch_docs,
                    batch_gold,
                    sgd=optimizer,
                    drop=config.dropout,
                    losses=losses,
                )
        out_path = parses_dir / corpus / "epoch-{i}.conllu".format(i=i)
        with nlp.use_params(optimizer.averages):
            scores = evaluate(nlp, paths.dev.text, paths.dev.conllu, out_path)
            print_progress(i, losses, scores)
 if __name__ == "__main__":
    plac.call(main)
--- a/examples/training/create_kb.py
+++ b/examples/training/create_kb.py
@ -1,114 +0,0 @@
 #!/usr/bin/env python
 # coding: utf8
 """Example of defining a knowledge base in spaCy,
 which is needed to implement entity linking functionality.
 For more details, see the documentation:
 * Knowledge base: https://spacy.io/api/kb
 * Entity Linking: https://spacy.io/usage/linguistic-features#entity-linking
 Compatible with: spaCy v2.2.4
 Last tested with: v2.2.4
 """
 from __future__ import unicode_literals, print_function
 import plac
 from pathlib import Path
 from spacy.vocab import Vocab
 import spacy
 from spacy.kb import KnowledgeBase
 # Q2146908 (Russ Cochran): American golfer
 # Q7381115 (Russ Cochran): publisher
 ENTITIES = {"Q2146908": ("American golfer", 342), "Q7381115": ("publisher", 17)}
@plac.annotations(
    model=("Model name, should have pretrained word embeddings", "positional", None, str),
    output_dir=("Optional output directory", "option", "o", Path),
 )
 def main(model=None, output_dir=None):
    """Load the model and create the KB with pre-defined entity encodings.
    If an output_dir is provided, the KB will be stored there in a file 'kb'.
    The updated vocab will also be written to a directory in the output_dir."""
    nlp = spacy.load(model)  # load existing spaCy model
    print("Loaded model '%s'" % model)
    # check the length of the nlp vectors
    if "vectors" not in nlp.meta or not nlp.vocab.vectors.size:
        raise ValueError(
            "The `nlp` object should have access to pretrained word vectors, "
            " cf. https://spacy.io/usage/models#languages."
        )
    # You can change the dimension of vectors in your KB by using an encoder that changes the dimensionality.
    # For simplicity, we'll just use the original vector dimension here instead.
    vectors_dim = nlp.vocab.vectors.shape[1]
    kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=vectors_dim)
    # set up the data
    entity_ids = []
    descr_embeddings = []
    freqs = []
    for key, value in ENTITIES.items():
        desc, freq = value
        entity_ids.append(key)
        descr_embeddings.append(nlp(desc).vector)
        freqs.append(freq)
    # set the entities, can also be done by calling `kb.add_entity` for each entity
    kb.set_entities(entity_list=entity_ids, freq_list=freqs, vector_list=descr_embeddings)
    # adding aliases, the entities need to be defined in the KB beforehand
    kb.add_alias(
        alias="Russ Cochran",
        entities=["Q2146908", "Q7381115"],
        probabilities=[0.24, 0.7],  # the sum of these probabilities should not exceed 1
    )
    # test the trained model
    print()
    _print_kb(kb)
    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        kb_path = str(output_dir / "kb")
        kb.dump(kb_path)
        print()
        print("Saved KB to", kb_path)
        vocab_path = output_dir / "vocab"
        kb.vocab.to_disk(vocab_path)
        print("Saved vocab to", vocab_path)
        print()
        # test the saved model
        # always reload a knowledge base with the same vocab instance!
        print("Loading vocab from", vocab_path)
        print("Loading KB from", kb_path)
        vocab2 = Vocab().from_disk(vocab_path)
        kb2 = KnowledgeBase(vocab=vocab2)
        kb2.load_bulk(kb_path)
        print()
        _print_kb(kb2)
 def _print_kb(kb):
    print(kb.get_size_entities(), "kb entities:", kb.get_entity_strings())
    print(kb.get_size_aliases(), "kb aliases:", kb.get_alias_strings())
 if __name__ == "__main__":
    plac.call(main)
    # Expected output:
    # 2 kb entities: ['Q2146908', 'Q7381115']
    # 1 kb aliases: ['Russ Cochran']
--- a/examples/training/ner_multitask_objective.py
+++ b/examples/training/ner_multitask_objective.py
@ -1,89 +0,0 @@
 """This example shows how to add a multi-task objective that is trained
 alongside the entity recognizer. This is an alternative to adding features
 to the model.
 The multi-task idea is to train an auxiliary model to predict some attribute,
 with weights shared between the auxiliary model and the main model. In this
 example, we're predicting the position of the word in the document.
 The model that predicts the position of the word encourages the convolutional
 layers to include the position information in their representation. The
 information is then available to the main model, as a feature.
 The overall idea is that we might know something about what sort of features
 we'd like the CNN to extract. The multi-task objectives can encourage the
 extraction of this type of feature. The multi-task objective is only used
 during training. We discard the auxiliary model before run-time.
 The specific example here is not necessarily a good idea --- but it shows
 how an arbitrary objective function for some word can be used.
 Developed and tested for spaCy 2.0.6. Updated for v2.2.2
 """
 import random
 import plac
 import spacy
 import os.path
 from spacy.tokens import Doc
 from spacy.gold import read_json_file, GoldParse
 random.seed(0)
 PWD = os.path.dirname(__file__)
 TRAIN_DATA = list(read_json_file(
    os.path.join(PWD, "ner_example_data", "ner-sent-per-line.json")))
 def get_position_label(i, words, tags, heads, labels, ents):
    """Return labels indicating the position of the word in the document.
    """
    if len(words) < 20:
        return "short-doc"
    elif i == 0:
        return "first-word"
    elif i < 10:
        return "early-word"
    elif i < 20:
        return "mid-word"
    elif i == len(words) - 1:
        return "last-word"
    else:
        return "late-word"
 def main(n_iter=10):
    nlp = spacy.blank("en")
    ner = nlp.create_pipe("ner")
    ner.add_multitask_objective(get_position_label)
    nlp.add_pipe(ner)
    print(nlp.pipeline)
    print("Create data", len(TRAIN_DATA))
    optimizer = nlp.begin_training(get_gold_tuples=lambda: TRAIN_DATA)
    for itn in range(n_iter):
        random.shuffle(TRAIN_DATA)
        losses = {}
        for text, annot_brackets in TRAIN_DATA:
            for annotations, _ in annot_brackets:
                doc = Doc(nlp.vocab, words=annotations[1])
                gold = GoldParse.from_annot_tuples(doc, annotations)
                nlp.update(
                    [doc],  # batch of texts
                    [gold],  # batch of annotations
                    drop=0.2,  # dropout - make it harder to memorise data
                    sgd=optimizer,  # callable to update weights
                    losses=losses,
                )
        print(losses.get("nn_labeller", 0.0), losses["ner"])
    # test the trained model
    for text, _ in TRAIN_DATA:
        if text is not None:
            doc = nlp(text)
            print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
            print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])
 if __name__ == "__main__":
    plac.call(main)
--- a/examples/training/pretrain_textcat.py
+++ b/examples/training/pretrain_textcat.py
@ -1,217 +0,0 @@
 """This script is experimental.
 Try pre-training the CNN component of the text categorizer using a cheap
 language modelling-like objective. Specifically, we load pretrained vectors
 (from something like word2vec, GloVe, FastText etc), and use the CNN to
 predict the tokens' pretrained vectors. This isn't as easy as it sounds:
 we're not merely doing compression here, because heavy dropout is applied,
 including over the input words. This means the model must often (50% of the time)
 use the context in order to predict the word.
 To evaluate the technique, we're pre-training with the 50k texts from the IMDB
 corpus, and then training with only 100 labels. Note that it's a bit dirty to
 pre-train with the development data, but also not *so* terrible: we're not using
 the development labels, after all --- only the unlabelled text.
 """
 import plac
 import tqdm
 import random
 import spacy
 import thinc.extra.datasets
 from spacy.util import minibatch, use_gpu, compounding
 from spacy._ml import Tok2Vec
 from spacy.pipeline import TextCategorizer
 import numpy
 def load_texts(limit=0):
    train, dev = thinc.extra.datasets.imdb()
    train_texts, train_labels = zip(*train)
    dev_texts, dev_labels = zip(*train)
    train_texts = list(train_texts)
    dev_texts = list(dev_texts)
    random.shuffle(train_texts)
    random.shuffle(dev_texts)
    if limit >= 1:
        return train_texts[:limit]
    else:
        return list(train_texts) + list(dev_texts)
 def load_textcat_data(limit=0):
    """Load data from the IMDB dataset."""
    # Partition off part of the train data for evaluation
    train_data, eval_data = thinc.extra.datasets.imdb()
    random.shuffle(train_data)
    train_data = train_data[-limit:]
    texts, labels = zip(*train_data)
    eval_texts, eval_labels = zip(*eval_data)
    cats = [{"POSITIVE": bool(y), "NEGATIVE": not bool(y)} for y in labels]
    eval_cats = [{"POSITIVE": bool(y), "NEGATIVE": not bool(y)} for y in eval_labels]
    return (texts, cats), (eval_texts, eval_cats)
 def prefer_gpu():
    used = spacy.util.use_gpu(0)
    if used is None:
        return False
    else:
        import cupy.random
        cupy.random.seed(0)
        return True
 def build_textcat_model(tok2vec, nr_class, width):
    from thinc.v2v import Model, Softmax, Maxout
    from thinc.api import flatten_add_lengths, chain
    from thinc.t2v import Pooling, sum_pool, mean_pool, max_pool
    from thinc.misc import Residual, LayerNorm
    from spacy._ml import logistic, zero_init
    with Model.define_operators({">>": chain}):
        model = (
            tok2vec
            >> flatten_add_lengths
            >> Pooling(mean_pool)
            >> Softmax(nr_class, width)
        )
    model.tok2vec = tok2vec
    return model
 def block_gradients(model):
    from thinc.api import wrap
    def forward(X, drop=0.0):
        Y, _ = model.begin_update(X, drop=drop)
        return Y, None
    return wrap(forward, model)
 def create_pipeline(width, embed_size, vectors_model):
    print("Load vectors")
    nlp = spacy.load(vectors_model)
    print("Start training")
    textcat = TextCategorizer(
        nlp.vocab,
        labels=["POSITIVE", "NEGATIVE"],
        model=build_textcat_model(
            Tok2Vec(width=width, embed_size=embed_size), 2, width
        ),
    )
    nlp.add_pipe(textcat)
    return nlp
 def train_tensorizer(nlp, texts, dropout, n_iter):
    tensorizer = nlp.create_pipe("tensorizer")
    nlp.add_pipe(tensorizer)
    optimizer = nlp.begin_training()
    for i in range(n_iter):
        losses = {}
        for i, batch in enumerate(minibatch(tqdm.tqdm(texts))):
            docs = [nlp.make_doc(text) for text in batch]
            tensorizer.update(docs, None, losses=losses, sgd=optimizer, drop=dropout)
        print(losses)
    return optimizer
 def train_textcat(nlp, n_texts, n_iter=10):
    textcat = nlp.get_pipe("textcat")
    tok2vec_weights = textcat.model.tok2vec.to_bytes()
    (train_texts, train_cats), (dev_texts, dev_cats) = load_textcat_data(limit=n_texts)
    print(
        "Using {} examples ({} training, {} evaluation)".format(
            n_texts, len(train_texts), len(dev_texts)
        )
    )
    train_data = list(zip(train_texts, [{"cats": cats} for cats in train_cats]))
    # get names of other pipes to disable them during training
    pipe_exceptions = ["textcat", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
    with nlp.disable_pipes(*other_pipes):  # only train textcat
        optimizer = nlp.begin_training()
        textcat.model.tok2vec.from_bytes(tok2vec_weights)
        print("Training the model...")
        print("{:^5}\t{:^5}\t{:^5}\t{:^5}".format("LOSS", "P", "R", "F"))
        for i in range(n_iter):
            losses = {"textcat": 0.0}
            # batch up the examples using spaCy's minibatch
            batches = minibatch(tqdm.tqdm(train_data), size=2)
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(texts, annotations, sgd=optimizer, drop=0.2, losses=losses)
            with textcat.model.use_params(optimizer.averages):
                # evaluate on the dev data split off in load_data()
                scores = evaluate_textcat(nlp.tokenizer, textcat, dev_texts, dev_cats)
            print(
                "{0:.3f}\t{1:.3f}\t{2:.3f}\t{3:.3f}".format(  # print a simple table
                    losses["textcat"],
                    scores["textcat_p"],
                    scores["textcat_r"],
                    scores["textcat_f"],
                )
            )
 def evaluate_textcat(tokenizer, textcat, texts, cats):
    docs = (tokenizer(text) for text in texts)
    tp = 1e-8
    fp = 1e-8
    tn = 1e-8
    fn = 1e-8
    for i, doc in enumerate(textcat.pipe(docs)):
        gold = cats[i]
        for label, score in doc.cats.items():
            if label not in gold:
                continue
            if score >= 0.5 and gold[label] >= 0.5:
                tp += 1.0
            elif score >= 0.5 and gold[label] < 0.5:
                fp += 1.0
            elif score < 0.5 and gold[label] < 0.5:
                tn += 1
            elif score < 0.5 and gold[label] >= 0.5:
                fn += 1
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    f_score = 2 * (precision * recall) / (precision + recall)
    return {"textcat_p": precision, "textcat_r": recall, "textcat_f": f_score}
@plac.annotations(
    width=("Width of CNN layers", "positional", None, int),
    embed_size=("Embedding rows", "positional", None, int),
    pretrain_iters=("Number of iterations to pretrain", "option", "pn", int),
    train_iters=("Number of iterations to train", "option", "tn", int),
    train_examples=("Number of labelled examples", "option", "eg", int),
    vectors_model=("Name or path to vectors model to learn from"),
 )
 def main(
    width,
    embed_size,
    vectors_model,
    pretrain_iters=30,
    train_iters=30,
    train_examples=1000,
 ):
    random.seed(0)
    numpy.random.seed(0)
    use_gpu = prefer_gpu()
    print("Using GPU?", use_gpu)
    nlp = create_pipeline(width, embed_size, vectors_model)
    print("Load data")
    texts = load_texts(limit=0)
    print("Train tensorizer")
    optimizer = train_tensorizer(nlp, texts, dropout=0.2, n_iter=pretrain_iters)
    print("Train textcat")
    train_textcat(nlp, train_examples, n_iter=train_iters)
 if __name__ == "__main__":
    plac.call(main)
--- a/examples/training/rehearsal.py
+++ b/examples/training/rehearsal.py
@ -1,97 +0,0 @@
 """Prevent catastrophic forgetting with rehearsal updates."""
 import plac
 import random
 import warnings
 import srsly
 import spacy
 from spacy.gold import GoldParse
 from spacy.util import minibatch, compounding
 LABEL = "ANIMAL"
 TRAIN_DATA = [
    (
        "Horses are too tall and they pretend to care about your feelings",
        {"entities": [(0, 6, "ANIMAL")]},
    ),
    ("Do they bite?", {"entities": []}),
    (
        "horses are too tall and they pretend to care about your feelings",
        {"entities": [(0, 6, "ANIMAL")]},
    ),
    ("horses pretend to care about your feelings", {"entities": [(0, 6, "ANIMAL")]}),
    (
        "they pretend to care about your feelings, those horses",
        {"entities": [(48, 54, "ANIMAL")]},
    ),
    ("horses?", {"entities": [(0, 6, "ANIMAL")]}),
 ]
 def read_raw_data(nlp, jsonl_loc):
    for json_obj in srsly.read_jsonl(jsonl_loc):
        if json_obj["text"].strip():
            doc = nlp.make_doc(json_obj["text"])
            yield doc
 def read_gold_data(nlp, gold_loc):
    docs = []
    golds = []
    for json_obj in srsly.read_jsonl(gold_loc):
        doc = nlp.make_doc(json_obj["text"])
        ents = [(ent["start"], ent["end"], ent["label"]) for ent in json_obj["spans"]]
        gold = GoldParse(doc, entities=ents)
        docs.append(doc)
        golds.append(gold)
    return list(zip(docs, golds))
 def main(model_name, unlabelled_loc):
    n_iter = 10
    dropout = 0.2
    batch_size = 4
    nlp = spacy.load(model_name)
    nlp.get_pipe("ner").add_label(LABEL)
    raw_docs = list(read_raw_data(nlp, unlabelled_loc))
    optimizer = nlp.resume_training()
    # Avoid use of Adam when resuming training. I don't understand this well
    # yet, but I'm getting weird results from Adam. Try commenting out the
    # nlp.update(), and using Adam -- you'll find the models drift apart.
    # I guess Adam is losing precision, introducing gradient noise?
    optimizer.alpha = 0.1
    optimizer.b1 = 0.0
    optimizer.b2 = 0.0
    # get names of other pipes to disable them during training
    pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
    sizes = compounding(1.0, 4.0, 1.001)
    with nlp.disable_pipes(*other_pipes), warnings.catch_warnings():
        # show warnings for misaligned entity spans once
        warnings.filterwarnings("once", category=UserWarning, module='spacy')
        for itn in range(n_iter):
            random.shuffle(TRAIN_DATA)
            random.shuffle(raw_docs)
            losses = {}
            r_losses = {}
            # batch up the examples using spaCy's minibatch
            raw_batches = minibatch(raw_docs, size=4)
            for batch in minibatch(TRAIN_DATA, size=sizes):
                docs, golds = zip(*batch)
                nlp.update(docs, golds, sgd=optimizer, drop=dropout, losses=losses)
                raw_batch = list(next(raw_batches))
                nlp.rehearse(raw_batch, sgd=optimizer, losses=r_losses)
            print("Losses", losses)
            print("R. Losses", r_losses)
    print(nlp.get_pipe("ner").model.unseen_classes)
    test_text = "Do you like horses?"
    doc = nlp(test_text)
    print("Entities in '%s'" % test_text)
    for ent in doc.ents:
        print(ent.label_, ent.text)
 if __name__ == "__main__":
    plac.call(main)
--- a/examples/training/train_entity_linker.py
+++ b/examples/training/train_entity_linker.py
@ -1,177 +0,0 @@
 #!/usr/bin/env python
 # coding: utf8
 """Example of training spaCy's entity linker, starting off with a predefined
 knowledge base and corresponding vocab, and a blank English model.
 For more details, see the documentation:
 * Training: https://spacy.io/usage/training
 * Entity Linking: https://spacy.io/usage/linguistic-features#entity-linking
 Compatible with: spaCy v2.2.4
 Last tested with: v2.2.4
 """
 from __future__ import unicode_literals, print_function
 import plac
 import random
 from pathlib import Path
 from spacy.vocab import Vocab
 import spacy
 from spacy.kb import KnowledgeBase
 from spacy.pipeline import EntityRuler
 from spacy.util import minibatch, compounding
 def sample_train_data():
    train_data = []
    # Q2146908 (Russ Cochran): American golfer
    # Q7381115 (Russ Cochran): publisher
    text_1 = "Russ Cochran his reprints include EC Comics."
    dict_1 = {(0, 12): {"Q7381115": 1.0, "Q2146908": 0.0}}
    train_data.append((text_1, {"links": dict_1}))
    text_2 = "Russ Cochran has been publishing comic art."
    dict_2 = {(0, 12): {"Q7381115": 1.0, "Q2146908": 0.0}}
    train_data.append((text_2, {"links": dict_2}))
    text_3 = "Russ Cochran captured his first major title with his son as caddie."
    dict_3 = {(0, 12): {"Q7381115": 0.0, "Q2146908": 1.0}}
    train_data.append((text_3, {"links": dict_3}))
    text_4 = "Russ Cochran was a member of University of Kentucky's golf team."
    dict_4 = {(0, 12): {"Q7381115": 0.0, "Q2146908": 1.0}}
    train_data.append((text_4, {"links": dict_4}))
    return train_data
 # training data
 TRAIN_DATA = sample_train_data()
@plac.annotations(
    kb_path=("Path to the knowledge base", "positional", None, Path),
    vocab_path=("Path to the vocab for the kb", "positional", None, Path),
    output_dir=("Optional output directory", "option", "o", Path),
    n_iter=("Number of training iterations", "option", "n", int),
 )
 def main(kb_path, vocab_path, output_dir=None, n_iter=50):
    """Create a blank model with the specified vocab, set up the pipeline and train the entity linker.
    The `vocab` should be the one used during creation of the KB."""
    # create blank English model with correct vocab
    nlp = spacy.blank("en")
    nlp.vocab.from_disk(vocab_path)
    nlp.vocab.vectors.name = "spacy_pretrained_vectors"
    print("Created blank 'en' model with vocab from '%s'" % vocab_path)
    # Add a sentencizer component. Alternatively, add a dependency parser for higher accuracy.
    nlp.add_pipe(nlp.create_pipe('sentencizer'))
    # Add a custom component to recognize "Russ Cochran" as an entity for the example training data.
    # Note that in a realistic application, an actual NER algorithm should be used instead.
    ruler = EntityRuler(nlp)
    patterns = [{"label": "PERSON", "pattern": [{"LOWER": "russ"}, {"LOWER": "cochran"}]}]
    ruler.add_patterns(patterns)
    nlp.add_pipe(ruler)
    # Create the Entity Linker component and add it to the pipeline.
    if "entity_linker" not in nlp.pipe_names:
        # use only the predicted EL score and not the prior probability (for demo purposes)
        cfg = {"incl_prior": False}
        entity_linker = nlp.create_pipe("entity_linker", cfg)
        kb = KnowledgeBase(vocab=nlp.vocab)
        kb.load_bulk(kb_path)
        print("Loaded Knowledge Base from '%s'" % kb_path)
        entity_linker.set_kb(kb)
        nlp.add_pipe(entity_linker, last=True)
    # Convert the texts to docs to make sure we have doc.ents set for the training examples.
    # Also ensure that the annotated examples correspond to known identifiers in the knowlege base.
    kb_ids = nlp.get_pipe("entity_linker").kb.get_entity_strings()
    TRAIN_DOCS = []
    for text, annotation in TRAIN_DATA:
        with nlp.disable_pipes("entity_linker"):
            doc = nlp(text)
        annotation_clean = annotation
        for offset, kb_id_dict in annotation["links"].items():
            new_dict = {}
            for kb_id, value in kb_id_dict.items():
                if kb_id in kb_ids:
                    new_dict[kb_id] = value
                else:
                    print(
                        "Removed", kb_id, "from training because it is not in the KB."
                    )
            annotation_clean["links"][offset] = new_dict
        TRAIN_DOCS.append((doc, annotation_clean))
    # get names of other pipes to disable them during training
    pipe_exceptions = ["entity_linker", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
    with nlp.disable_pipes(*other_pipes):  # only train entity linker
        # reset and initialize the weights randomly
        optimizer = nlp.begin_training()
        for itn in range(n_iter):
            random.shuffle(TRAIN_DOCS)
            losses = {}
            # batch up the examples using spaCy's minibatch
            batches = minibatch(TRAIN_DOCS, size=compounding(4.0, 32.0, 1.001))
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(
                    texts,  # batch of texts
                    annotations,  # batch of annotations
                    drop=0.2,  # dropout - make it harder to memorise data
                    losses=losses,
                    sgd=optimizer,
                )
            print(itn, "Losses", losses)
    # test the trained model
    _apply_model(nlp)
    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.to_disk(output_dir)
        print()
        print("Saved model to", output_dir)
        # test the saved model
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        _apply_model(nlp2)
 def _apply_model(nlp):
    for text, annotation in TRAIN_DATA:
        # apply the entity linker which will now make predictions for the 'Russ Cochran' entities
        doc = nlp(text)
        print()
        print("Entities", [(ent.text, ent.label_, ent.kb_id_) for ent in doc.ents])
        print("Tokens", [(t.text, t.ent_type_, t.ent_kb_id_) for t in doc])
 if __name__ == "__main__":
    plac.call(main)
    # Expected output (can be shuffled):
    # Entities[('Russ Cochran', 'PERSON', 'Q7381115')]
    # Tokens[('Russ', 'PERSON', 'Q7381115'), ('Cochran', 'PERSON', 'Q7381115'), ("his", '', ''), ('reprints', '', ''), ('include', '', ''), ('The', '', ''), ('Complete', '', ''), ('EC', '', ''), ('Library', '', ''), ('.', '', '')]
    # Entities[('Russ Cochran', 'PERSON', 'Q7381115')]
    # Tokens[('Russ', 'PERSON', 'Q7381115'), ('Cochran', 'PERSON', 'Q7381115'), ('has', '', ''), ('been', '', ''), ('publishing', '', ''), ('comic', '', ''), ('art', '', ''), ('.', '', '')]
    # Entities[('Russ Cochran', 'PERSON', 'Q2146908')]
    # Tokens[('Russ', 'PERSON', 'Q2146908'), ('Cochran', 'PERSON', 'Q2146908'), ('captured', '', ''), ('his', '', ''), ('first', '', ''), ('major', '', ''), ('title', '', ''), ('with', '', ''), ('his', '', ''), ('son', '', ''), ('as', '', ''), ('caddie', '', ''), ('.', '', '')]
    # Entities[('Russ Cochran', 'PERSON', 'Q2146908')]
    # Tokens[('Russ', 'PERSON', 'Q2146908'), ('Cochran', 'PERSON', 'Q2146908'), ('was', '', ''), ('a', '', ''), ('member', '', ''), ('of', '', ''), ('University', '', ''), ('of', '', ''), ('Kentucky', '', ''), ("'s", '', ''), ('golf', '', ''), ('team', '', ''), ('.', '', '')]
--- a/examples/training/train_intent_parser.py
+++ b/examples/training/train_intent_parser.py
@ -1,195 +0,0 @@
 #!/usr/bin/env python
 # coding: utf-8
 """Using the parser to recognise your own semantics
 spaCy's parser component can be trained to predict any type of tree
 structure over your input text. You can also predict trees over whole documents
 or chat logs, with connections between the sentence-roots used to annotate
 discourse structure. In this example, we'll build a message parser for a common
 "chat intent": finding local businesses. Our message semantics will have the
 following types of relations: ROOT, PLACE, QUALITY, ATTRIBUTE, TIME, LOCATION.
 "show me the best hotel in berlin"
 ('show', 'ROOT', 'show')
 ('best', 'QUALITY', 'hotel') --> hotel with QUALITY best
 ('hotel', 'PLACE', 'show') --> show PLACE hotel
 ('berlin', 'LOCATION', 'hotel') --> hotel with LOCATION berlin
 Compatible with: spaCy v2.0.0+
 """
 from __future__ import unicode_literals, print_function
 import plac
 import random
 from pathlib import Path
 import spacy
 from spacy.util import minibatch, compounding
 # training data: texts, heads and dependency labels
 # for no relation, we simply chose an arbitrary dependency label, e.g. '-'
 TRAIN_DATA = [
    (
        "find a cafe with great wifi",
        {
            "heads": [0, 2, 0, 5, 5, 2],  # index of token head
            "deps": ["ROOT", "-", "PLACE", "-", "QUALITY", "ATTRIBUTE"],
        },
    ),
    (
        "find a hotel near the beach",
        {
            "heads": [0, 2, 0, 5, 5, 2],
            "deps": ["ROOT", "-", "PLACE", "QUALITY", "-", "ATTRIBUTE"],
        },
    ),
    (
        "find me the closest gym that's open late",
        {
            "heads": [0, 0, 4, 4, 0, 6, 4, 6, 6],
            "deps": [
                "ROOT",
                "-",
                "-",
                "QUALITY",
                "PLACE",
                "-",
                "-",
                "ATTRIBUTE",
                "TIME",
            ],
        },
    ),
    (
        "show me the cheapest store that sells flowers",
        {
            "heads": [0, 0, 4, 4, 0, 4, 4, 4],  # attach "flowers" to store!
            "deps": ["ROOT", "-", "-", "QUALITY", "PLACE", "-", "-", "PRODUCT"],
        },
    ),
    (
        "find a nice restaurant in london",
        {
            "heads": [0, 3, 3, 0, 3, 3],
            "deps": ["ROOT", "-", "QUALITY", "PLACE", "-", "LOCATION"],
        },
    ),
    (
        "show me the coolest hostel in berlin",
        {
            "heads": [0, 0, 4, 4, 0, 4, 4],
            "deps": ["ROOT", "-", "-", "QUALITY", "PLACE", "-", "LOCATION"],
        },
    ),
    (
        "find a good italian restaurant near work",
        {
            "heads": [0, 4, 4, 4, 0, 4, 5],
            "deps": [
                "ROOT",
                "-",
                "QUALITY",
                "ATTRIBUTE",
                "PLACE",
                "ATTRIBUTE",
                "LOCATION",
            ],
        },
    ),
 ]
@plac.annotations(
    model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
    output_dir=("Optional output directory", "option", "o", Path),
    n_iter=("Number of training iterations", "option", "n", int),
 )
 def main(model=None, output_dir=None, n_iter=15):
    """Load the model, set up the pipeline and train the parser."""
    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank("en")  # create blank Language class
        print("Created blank 'en' model")
    # We'll use the built-in dependency parser class, but we want to create a
    # fresh instance – just in case.
    if "parser" in nlp.pipe_names:
        nlp.remove_pipe("parser")
    parser = nlp.create_pipe("parser")
    nlp.add_pipe(parser, first=True)
    for text, annotations in TRAIN_DATA:
        for dep in annotations.get("deps", []):
            parser.add_label(dep)
    pipe_exceptions = ["parser", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
    with nlp.disable_pipes(*other_pipes):  # only train parser
        optimizer = nlp.begin_training()
        for itn in range(n_iter):
            random.shuffle(TRAIN_DATA)
            losses = {}
            # batch up the examples using spaCy's minibatch
            batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(texts, annotations, sgd=optimizer, losses=losses)
            print("Losses", losses)
    # test the trained model
    test_model(nlp)
    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)
        # test the saved model
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        test_model(nlp2)
 def test_model(nlp):
    texts = [
        "find a hotel with good wifi",
        "find me the cheapest gym near work",
        "show me the best hotel in berlin",
    ]
    docs = nlp.pipe(texts)
    for doc in docs:
        print(doc.text)
        print([(t.text, t.dep_, t.head.text) for t in doc if t.dep_ != "-"])
 if __name__ == "__main__":
    plac.call(main)
    # Expected output:
    # find a hotel with good wifi
    # [
    #   ('find', 'ROOT', 'find'),
    #   ('hotel', 'PLACE', 'find'),
    #   ('good', 'QUALITY', 'wifi'),
    #   ('wifi', 'ATTRIBUTE', 'hotel')
    # ]
    # find me the cheapest gym near work
    # [
    #   ('find', 'ROOT', 'find'),
    #   ('cheapest', 'QUALITY', 'gym'),
    #   ('gym', 'PLACE', 'find'),
    #   ('near', 'ATTRIBUTE', 'gym'),
    #   ('work', 'LOCATION', 'near')
    # ]
    # show me the best hotel in berlin
    # [
    #   ('show', 'ROOT', 'show'),
    #   ('best', 'QUALITY', 'hotel'),
    #   ('hotel', 'PLACE', 'show'),
    #   ('berlin', 'LOCATION', 'hotel')
    # ]
--- a/examples/training/train_ner.py
+++ b/examples/training/train_ner.py
@ -1,117 +0,0 @@
 #!/usr/bin/env python
 # coding: utf8
 """Example of training spaCy's named entity recognizer, starting off with an
 existing model or a blank model.
 For more details, see the documentation:
 * Training: https://spacy.io/usage/training
 * NER: https://spacy.io/usage/linguistic-features#named-entities
 Compatible with: spaCy v2.0.0+
 Last tested with: v2.2.4
 """
 from __future__ import unicode_literals, print_function
 import plac
 import random
 import warnings
 from pathlib import Path
 import spacy
 from spacy.util import minibatch, compounding
 # training data
 TRAIN_DATA = [
    ("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
    ("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
 ]
@plac.annotations(
    model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
    output_dir=("Optional output directory", "option", "o", Path),
    n_iter=("Number of training iterations", "option", "n", int),
 )
 def main(model=None, output_dir=None, n_iter=100):
    """Load the model, set up the pipeline and train the entity recognizer."""
    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank("en")  # create blank Language class
        print("Created blank 'en' model")
    # create the built-in pipeline components and add them to the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner, last=True)
    # otherwise, get it so we can add labels
    else:
        ner = nlp.get_pipe("ner")
    # add labels
    for _, annotations in TRAIN_DATA:
        for ent in annotations.get("entities"):
            ner.add_label(ent[2])
    # get names of other pipes to disable them during training
    pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
    # only train NER
    with nlp.disable_pipes(*other_pipes), warnings.catch_warnings():
        # show warnings for misaligned entity spans once
        warnings.filterwarnings("once", category=UserWarning, module='spacy')
        # reset and initialize the weights randomly – but only if we're
        # training a new model
        if model is None:
            nlp.begin_training()
        for itn in range(n_iter):
            random.shuffle(TRAIN_DATA)
            losses = {}
            # batch up the examples using spaCy's minibatch
            batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(
                    texts,  # batch of texts
                    annotations,  # batch of annotations
                    drop=0.5,  # dropout - make it harder to memorise data
                    losses=losses,
                )
            print("Losses", losses)
    # test the trained model
    for text, _ in TRAIN_DATA:
        doc = nlp(text)
        print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
        print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])
    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)
        # test the saved model
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        for text, _ in TRAIN_DATA:
            doc = nlp2(text)
            print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
            print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])
 if __name__ == "__main__":
    plac.call(main)
    # Expected output:
    # Entities [('Shaka Khan', 'PERSON')]
    # Tokens [('Who', '', 2), ('is', '', 2), ('Shaka', 'PERSON', 3),
    # ('Khan', 'PERSON', 1), ('?', '', 2)]
    # Entities [('London', 'LOC'), ('Berlin', 'LOC')]
    # Tokens [('I', '', 2), ('like', '', 2), ('London', 'LOC', 3),
    # ('and', '', 2), ('Berlin', 'LOC', 3), ('.', '', 2)]
--- a/examples/training/train_new_entity_type.py
+++ b/examples/training/train_new_entity_type.py
@ -1,144 +0,0 @@
 #!/usr/bin/env python
 # coding: utf8
 """Example of training an additional entity type
 This script shows how to add a new entity type to an existing pretrained NER
 model. To keep the example short and simple, only four sentences are provided
 as examples. In practice, you'll need many more — a few hundred would be a
 good start. You will also likely need to mix in examples of other entity
 types, which might be obtained by running the entity recognizer over unlabelled
 sentences, and adding their annotations to the training set.
 The actual training is performed by looping over the examples, and calling
 `nlp.entity.update()`. The `update()` method steps through the words of the
 input. At each word, it makes a prediction. It then consults the annotations
 provided on the GoldParse instance, to see whether it was right. If it was
 wrong, it adjusts its weights so that the correct action will score higher
 next time.
 After training your model, you can save it to a directory. We recommend
 wrapping models as Python packages, for ease of deployment.
 For more details, see the documentation:
 * Training: https://spacy.io/usage/training
 * NER: https://spacy.io/usage/linguistic-features#named-entities
 Compatible with: spaCy v2.1.0+
 Last tested with: v2.2.4
 """
 from __future__ import unicode_literals, print_function
 import plac
 import random
 import warnings
 from pathlib import Path
 import spacy
 from spacy.util import minibatch, compounding
 # new entity label
 LABEL = "ANIMAL"
 # training data
 # Note: If you're using an existing model, make sure to mix in examples of
 # other entity types that spaCy correctly recognized before. Otherwise, your
 # model might learn the new type, but "forget" what it previously knew.
 # https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting
 TRAIN_DATA = [
    (
        "Horses are too tall and they pretend to care about your feelings",
        {"entities": [(0, 6, LABEL)]},
    ),
    ("Do they bite?", {"entities": []}),
    (
        "horses are too tall and they pretend to care about your feelings",
        {"entities": [(0, 6, LABEL)]},
    ),
    ("horses pretend to care about your feelings", {"entities": [(0, 6, LABEL)]}),
    (
        "they pretend to care about your feelings, those horses",
        {"entities": [(48, 54, LABEL)]},
    ),
    ("horses?", {"entities": [(0, 6, LABEL)]}),
 ]
@plac.annotations(
    model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
    new_model_name=("New model name for model meta.", "option", "nm", str),
    output_dir=("Optional output directory", "option", "o", Path),
    n_iter=("Number of training iterations", "option", "n", int),
 )
 def main(model=None, new_model_name="animal", output_dir=None, n_iter=30):
    """Set up the pipeline and entity recognizer, and train the new entity."""
    random.seed(0)
    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank("en")  # create blank Language class
        print("Created blank 'en' model")
    # Add entity recognizer to model if it's not in the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner)
    # otherwise, get it, so we can add labels to it
    else:
        ner = nlp.get_pipe("ner")
    ner.add_label(LABEL)  # add new entity label to entity recognizer
    # Adding extraneous labels shouldn't mess anything up
    ner.add_label("VEGETABLE")
    if model is None:
        optimizer = nlp.begin_training()
    else:
        optimizer = nlp.resume_training()
    move_names = list(ner.move_names)
    # get names of other pipes to disable them during training
    pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
    # only train NER
    with nlp.disable_pipes(*other_pipes), warnings.catch_warnings():
        # show warnings for misaligned entity spans once
        warnings.filterwarnings("once", category=UserWarning, module='spacy')
        sizes = compounding(1.0, 4.0, 1.001)
        # batch up the examples using spaCy's minibatch
        for itn in range(n_iter):
            random.shuffle(TRAIN_DATA)
            batches = minibatch(TRAIN_DATA, size=sizes)
            losses = {}
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
            print("Losses", losses)
    # test the trained model
    test_text = "Do you like horses?"
    doc = nlp(test_text)
    print("Entities in '%s'" % test_text)
    for ent in doc.ents:
        print(ent.label_, ent.text)
    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.meta["name"] = new_model_name  # rename model
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)
        # test the saved model
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        # Check the classes have loaded back consistently
        assert nlp2.get_pipe("ner").move_names == move_names
        doc2 = nlp2(test_text)
        for ent in doc2.ents:
            print(ent.label_, ent.text)
 if __name__ == "__main__":
    plac.call(main)
--- a/examples/training/train_parser.py
+++ b/examples/training/train_parser.py
@ -1,111 +0,0 @@
 #!/usr/bin/env python
 # coding: utf8
 """Example of training spaCy dependency parser, starting off with an existing
 model or a blank model. For more details, see the documentation:
 * Training: https://spacy.io/usage/training
 * Dependency Parse: https://spacy.io/usage/linguistic-features#dependency-parse
 Compatible with: spaCy v2.0.0+
 Last tested with: v2.1.0
 """
 from __future__ import unicode_literals, print_function
 import plac
 import random
 from pathlib import Path
 import spacy
 from spacy.util import minibatch, compounding
 # training data
 TRAIN_DATA = [
    (
        "They trade mortgage-backed securities.",
        {
            "heads": [1, 1, 4, 4, 5, 1, 1],
            "deps": ["nsubj", "ROOT", "compound", "punct", "nmod", "dobj", "punct"],
        },
    ),
    (
        "I like London and Berlin.",
        {
            "heads": [1, 1, 1, 2, 2, 1],
            "deps": ["nsubj", "ROOT", "dobj", "cc", "conj", "punct"],
        },
    ),
 ]
@plac.annotations(
    model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
    output_dir=("Optional output directory", "option", "o", Path),
    n_iter=("Number of training iterations", "option", "n", int),
 )
 def main(model=None, output_dir=None, n_iter=15):
    """Load the model, set up the pipeline and train the parser."""
    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank("en")  # create blank Language class
        print("Created blank 'en' model")
    # add the parser to the pipeline if it doesn't exist
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if "parser" not in nlp.pipe_names:
        parser = nlp.create_pipe("parser")
        nlp.add_pipe(parser, first=True)
    # otherwise, get it, so we can add labels to it
    else:
        parser = nlp.get_pipe("parser")
    # add labels to the parser
    for _, annotations in TRAIN_DATA:
        for dep in annotations.get("deps", []):
            parser.add_label(dep)
    # get names of other pipes to disable them during training
    pipe_exceptions = ["parser", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
    with nlp.disable_pipes(*other_pipes):  # only train parser
        optimizer = nlp.begin_training()
        for itn in range(n_iter):
            random.shuffle(TRAIN_DATA)
            losses = {}
            # batch up the examples using spaCy's minibatch
            batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(texts, annotations, sgd=optimizer, losses=losses)
            print("Losses", losses)
    # test the trained model
    test_text = "I like securities."
    doc = nlp(test_text)
    print("Dependencies", [(t.text, t.dep_, t.head.text) for t in doc])
    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)
        # test the saved model
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        doc = nlp2(test_text)
        print("Dependencies", [(t.text, t.dep_, t.head.text) for t in doc])
 if __name__ == "__main__":
    plac.call(main)
    # expected result:
    # [
    #   ('I', 'nsubj', 'like'),
    #   ('like', 'ROOT', 'like'),
    #   ('securities', 'dobj', 'like'),
    #   ('.', 'punct', 'like')
    # ]
--- a/examples/training/train_tagger.py
+++ b/examples/training/train_tagger.py
@ -1,101 +0,0 @@
 #!/usr/bin/env python
 # coding: utf8
 """
 A simple example for training a part-of-speech tagger with a custom tag map.
 To allow us to update the tag map with our custom one, this example starts off
 with a blank Language class and modifies its defaults. For more details, see
 the documentation:
 * Training: https://spacy.io/usage/training
 * POS Tagging: https://spacy.io/usage/linguistic-features#pos-tagging
 Compatible with: spaCy v2.0.0+
 Last tested with: v2.1.0
 """
 from __future__ import unicode_literals, print_function
 import plac
 import random
 from pathlib import Path
 import spacy
 from spacy.util import minibatch, compounding
 # You need to define a mapping from your data's part-of-speech tag names to the
 # Universal Part-of-Speech tag set, as spaCy includes an enum of these tags.
 # See here for the Universal Tag Set:
 # http://universaldependencies.github.io/docs/u/pos/index.html
 # You may also specify morphological features for your tags, from the universal
 # scheme.
 TAG_MAP = {"N": {"pos": "NOUN"}, "V": {"pos": "VERB"}, "J": {"pos": "ADJ"}}
 # Usually you'll read this in, of course. Data formats vary. Ensure your
 # strings are unicode and that the number of tags assigned matches spaCy's
 # tokenization. If not, you can always add a 'words' key to the annotations
 # that specifies the gold-standard tokenization, e.g.:
 # ("Eatblueham", {'words': ['Eat', 'blue', 'ham'], 'tags': ['V', 'J', 'N']})
 TRAIN_DATA = [
    ("I like green eggs", {"tags": ["N", "V", "J", "N"]}),
    ("Eat blue ham", {"tags": ["V", "J", "N"]}),
 ]
@plac.annotations(
    lang=("ISO Code of language to use", "option", "l", str),
    output_dir=("Optional output directory", "option", "o", Path),
    n_iter=("Number of training iterations", "option", "n", int),
 )
 def main(lang="en", output_dir=None, n_iter=25):
    """Create a new model, set up the pipeline and train the tagger. In order to
    train the tagger with a custom tag map, we're creating a new Language
    instance with a custom vocab.
    """
    nlp = spacy.blank(lang)
    # add the tagger to the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    tagger = nlp.create_pipe("tagger")
    # Add the tags. This needs to be done before you start training.
    for tag, values in TAG_MAP.items():
        tagger.add_label(tag, values)
    nlp.add_pipe(tagger)
    optimizer = nlp.begin_training()
    for i in range(n_iter):
        random.shuffle(TRAIN_DATA)
        losses = {}
        # batch up the examples using spaCy's minibatch
        batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(texts, annotations, sgd=optimizer, losses=losses)
        print("Losses", losses)
    # test the trained model
    test_text = "I like blue eggs"
    doc = nlp(test_text)
    print("Tags", [(t.text, t.tag_, t.pos_) for t in doc])
    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)
        # test the save model
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        doc = nlp2(test_text)
        print("Tags", [(t.text, t.tag_, t.pos_) for t in doc])
 if __name__ == "__main__":
    plac.call(main)
    # Expected output:
    # [
    #   ('I', 'N', 'NOUN'),
    #   ('like', 'V', 'VERB'),
    #   ('blue', 'J', 'ADJ'),
    #   ('eggs', 'N', 'NOUN')
    # ]
--- a/examples/training/train_textcat.py
+++ b/examples/training/train_textcat.py
@ -1,160 +0,0 @@
 #!/usr/bin/env python
 # coding: utf8
 """Train a convolutional neural network text classifier on the
 IMDB dataset, using the TextCategorizer component. The dataset will be loaded
 automatically via Thinc's built-in dataset loader. The model is added to
 spacy.pipeline, and predictions are available via `doc.cats`. For more details,
 see the documentation:
 * Training: https://spacy.io/usage/training
 Compatible with: spaCy v2.0.0+
 """
 from __future__ import unicode_literals, print_function
 import plac
 import random
 from pathlib import Path
 import thinc.extra.datasets
 import spacy
 from spacy.util import minibatch, compounding
@plac.annotations(
    model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
    output_dir=("Optional output directory", "option", "o", Path),
    n_texts=("Number of texts to train from", "option", "t", int),
    n_iter=("Number of training iterations", "option", "n", int),
    init_tok2vec=("Pretrained tok2vec weights", "option", "t2v", Path),
 )
 def main(model=None, output_dir=None, n_iter=20, n_texts=2000, init_tok2vec=None):
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank("en")  # create blank Language class
        print("Created blank 'en' model")
    # add the text classifier to the pipeline if it doesn't exist
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if "textcat" not in nlp.pipe_names:
        textcat = nlp.create_pipe(
            "textcat", config={"exclusive_classes": True, "architecture": "simple_cnn"}
        )
        nlp.add_pipe(textcat, last=True)
    # otherwise, get it, so we can add labels to it
    else:
        textcat = nlp.get_pipe("textcat")
    # add label to text classifier
    textcat.add_label("POSITIVE")
    textcat.add_label("NEGATIVE")
    # load the IMDB dataset
    print("Loading IMDB data...")
    (train_texts, train_cats), (dev_texts, dev_cats) = load_data()
    train_texts = train_texts[:n_texts]
    train_cats = train_cats[:n_texts]
    print(
        "Using {} examples ({} training, {} evaluation)".format(
            n_texts, len(train_texts), len(dev_texts)
        )
    )
    train_data = list(zip(train_texts, [{"cats": cats} for cats in train_cats]))
    # get names of other pipes to disable them during training
    pipe_exceptions = ["textcat", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
    with nlp.disable_pipes(*other_pipes):  # only train textcat
        optimizer = nlp.begin_training()
        if init_tok2vec is not None:
            with init_tok2vec.open("rb") as file_:
                textcat.model.tok2vec.from_bytes(file_.read())
        print("Training the model...")
        print("{:^5}\t{:^5}\t{:^5}\t{:^5}".format("LOSS", "P", "R", "F"))
        batch_sizes = compounding(4.0, 32.0, 1.001)
        for i in range(n_iter):
            losses = {}
            # batch up the examples using spaCy's minibatch
            random.shuffle(train_data)
            batches = minibatch(train_data, size=batch_sizes)
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(texts, annotations, sgd=optimizer, drop=0.2, losses=losses)
            with textcat.model.use_params(optimizer.averages):
                # evaluate on the dev data split off in load_data()
                scores = evaluate(nlp.tokenizer, textcat, dev_texts, dev_cats)
            print(
                "{0:.3f}\t{1:.3f}\t{2:.3f}\t{3:.3f}".format(  # print a simple table
                    losses["textcat"],
                    scores["textcat_p"],
                    scores["textcat_r"],
                    scores["textcat_f"],
                )
            )
    # test the trained model
    test_text = "This movie sucked"
    doc = nlp(test_text)
    print(test_text, doc.cats)
    if output_dir is not None:
        with nlp.use_params(optimizer.averages):
            nlp.to_disk(output_dir)
        print("Saved model to", output_dir)
        # test the saved model
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        doc2 = nlp2(test_text)
        print(test_text, doc2.cats)
 def load_data(limit=0, split=0.8):
    """Load data from the IMDB dataset."""
    # Partition off part of the train data for evaluation
    train_data, _ = thinc.extra.datasets.imdb()
    random.shuffle(train_data)
    train_data = train_data[-limit:]
    texts, labels = zip(*train_data)
    cats = [{"POSITIVE": bool(y), "NEGATIVE": not bool(y)} for y in labels]
    split = int(len(train_data) * split)
    return (texts[:split], cats[:split]), (texts[split:], cats[split:])
 def evaluate(tokenizer, textcat, texts, cats):
    docs = (tokenizer(text) for text in texts)
    tp = 0.0  # True positives
    fp = 1e-8  # False positives
    fn = 1e-8  # False negatives
    tn = 0.0  # True negatives
    for i, doc in enumerate(textcat.pipe(docs)):
        gold = cats[i]
        for label, score in doc.cats.items():
            if label not in gold:
                continue
            if label == "NEGATIVE":
                continue
            if score >= 0.5 and gold[label] >= 0.5:
                tp += 1.0
            elif score >= 0.5 and gold[label] < 0.5:
                fp += 1.0
            elif score < 0.5 and gold[label] < 0.5:
                tn += 1
            elif score < 0.5 and gold[label] >= 0.5:
                fn += 1
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    if (precision + recall) == 0:
        f_score = 0.0
    else:
        f_score = 2 * (precision * recall) / (precision + recall)
    return {"textcat_p": precision, "textcat_r": recall, "textcat_f": f_score}
 if __name__ == "__main__":
    plac.call(main)
--- a/examples/vectors_fast_text.py
+++ b/examples/vectors_fast_text.py
@ -1,49 +0,0 @@
 #!/usr/bin/env python
 # coding: utf8
 """Load vectors for a language trained using fastText
 https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
 Compatible with: spaCy v2.0.0+
 """
 from __future__ import unicode_literals
 import plac
 import numpy
 import spacy
 from spacy.language import Language
@plac.annotations(
    vectors_loc=("Path to .vec file", "positional", None, str),
    lang=(
        "Optional language ID. If not set, blank Language() will be used.",
        "positional",
        None,
        str,
    ),
 )
 def main(vectors_loc, lang=None):
    if lang is None:
        nlp = Language()
    else:
        # create empty language class – this is required if you're planning to
        # save the model to disk and load it back later (models always need a
        # "lang" setting). Use 'xx' for blank multi-language class.
        nlp = spacy.blank(lang)
    with open(vectors_loc, "rb") as file_:
        header = file_.readline()
        nr_row, nr_dim = header.split()
        nlp.vocab.reset_vectors(width=int(nr_dim))
        for line in file_:
            line = line.rstrip().decode("utf8")
            pieces = line.rsplit(" ", int(nr_dim))
            word = pieces[0]
            vector = numpy.asarray([float(v) for v in pieces[1:]], dtype="f")
            nlp.vocab.set_vector(word, vector)  # add the vectors to the vocab
    # test the vectors and similarity
    text = "class colspan"
    doc = nlp(text)
    print(text, doc[0].similarity(doc[1]))
 if __name__ == "__main__":
    plac.call(main)
--- a/examples/vectors_tensorboard.py
+++ b/examples/vectors_tensorboard.py
@ -1,105 +0,0 @@
 #!/usr/bin/env python
 # coding: utf8
 """Visualize spaCy word vectors in Tensorboard.
 Adapted from: https://gist.github.com/BrikerMan/7bd4e4bd0a00ac9076986148afc06507
 """
 from __future__ import unicode_literals
 from os import path
 import tqdm
 import math
 import numpy
 import plac
 import spacy
 import tensorflow as tf
 from tensorflow.contrib.tensorboard.plugins.projector import (
    visualize_embeddings,
    ProjectorConfig,
 )
@plac.annotations(
    vectors_loc=("Path to spaCy model that contains vectors", "positional", None, str),
    out_loc=(
        "Path to output folder for tensorboard session data",
        "positional",
        None,
        str,
    ),
    name=(
        "Human readable name for tsv file and vectors tensor",
        "positional",
        None,
        str,
    ),
 )
 def main(vectors_loc, out_loc, name="spaCy_vectors"):
    meta_file = "{}.tsv".format(name)
    out_meta_file = path.join(out_loc, meta_file)
    print("Loading spaCy vectors model: {}".format(vectors_loc))
    model = spacy.load(vectors_loc)
    print("Finding lexemes with vectors attached: {}".format(vectors_loc))
    strings_stream = tqdm.tqdm(
        model.vocab.strings, total=len(model.vocab.strings), leave=False
    )
    queries = [w for w in strings_stream if model.vocab.has_vector(w)]
    vector_count = len(queries)
    print(
        "Building Tensorboard Projector metadata for ({}) vectors: {}".format(
            vector_count, out_meta_file
        )
    )
    # Store vector data in a tensorflow variable
    tf_vectors_variable = numpy.zeros((vector_count, model.vocab.vectors.shape[1]))
    # Write a tab-separated file that contains information about the vectors for visualization
    #
    # Reference: https://www.tensorflow.org/programmers_guide/embedding#metadata
    with open(out_meta_file, "wb") as file_metadata:
        # Define columns in the first row
        file_metadata.write("Text\tFrequency\n".encode("utf-8"))
        # Write out a row for each vector that we add to the tensorflow variable we created
        vec_index = 0
        for text in tqdm.tqdm(queries, total=len(queries), leave=False):
            # https://github.com/tensorflow/tensorflow/issues/9094
            text = "<Space>" if text.lstrip() == "" else text
            lex = model.vocab[text]
            # Store vector data and metadata
            tf_vectors_variable[vec_index] = model.vocab.get_vector(text)
            file_metadata.write(
                "{}\t{}\n".format(text, math.exp(lex.prob) * vector_count).encode(
                    "utf-8"
                )
            )
            vec_index += 1
    print("Running Tensorflow Session...")
    sess = tf.InteractiveSession()
    tf.Variable(tf_vectors_variable, trainable=False, name=name)
    tf.global_variables_initializer().run()
    saver = tf.train.Saver()
    writer = tf.summary.FileWriter(out_loc, sess.graph)
    # Link the embeddings into the config
    config = ProjectorConfig()
    embed = config.embeddings.add()
    embed.tensor_name = name
    embed.metadata_path = meta_file
    # Tell the projector about the configured embeddings and metadata file
    visualize_embeddings(writer, config)
    # Save session and print run command to the output
    print("Saving Tensorboard Session...")
    saver.save(sess, path.join(out_loc, "{}.ckpt".format(name)))
    print("Done. Run `tensorboard --logdir={0}` to view in Tensorboard".format(out_loc))
 if __name__ == "__main__":
    plac.call(main)
--- a/extra/example_data/ner_example_data/README.md
+++ b/extra/example_data/ner_example_data/README.md
--- a/extra/example_data/ner_example_data/ner-sent-per-line.iob
+++ b/extra/example_data/ner_example_data/ner-sent-per-line.iob
--- a/extra/example_data/ner_example_data/ner-sent-per-line.json
+++ b/extra/example_data/ner_example_data/ner-sent-per-line.json
--- a/extra/example_data/ner_example_data/ner-token-per-line-conll2003.iob
+++ b/extra/example_data/ner_example_data/ner-token-per-line-conll2003.iob
--- a/extra/example_data/ner_example_data/ner-token-per-line-conll2003.json
+++ b/extra/example_data/ner_example_data/ner-token-per-line-conll2003.json
--- a/extra/example_data/ner_example_data/ner-token-per-line-with-pos.iob
+++ b/extra/example_data/ner_example_data/ner-token-per-line-with-pos.iob
--- a/extra/example_data/ner_example_data/ner-token-per-line-with-pos.json
+++ b/extra/example_data/ner_example_data/ner-token-per-line-with-pos.json
--- a/extra/example_data/ner_example_data/ner-token-per-line.iob
+++ b/extra/example_data/ner_example_data/ner-token-per-line.iob
--- a/extra/example_data/ner_example_data/ner-token-per-line.json
+++ b/extra/example_data/ner_example_data/ner-token-per-line.json
--- a/extra/example_data/textcat_example_data/CC0.txt
+++ b/extra/example_data/textcat_example_data/CC0.txt
--- a/extra/example_data/textcat_example_data/CC_BY-SA-3.0.txt
+++ b/extra/example_data/textcat_example_data/CC_BY-SA-3.0.txt
--- a/extra/example_data/textcat_example_data/CC_BY-SA-4.0.txt
+++ b/extra/example_data/textcat_example_data/CC_BY-SA-4.0.txt
--- a/extra/example_data/textcat_example_data/README.md
+++ b/extra/example_data/textcat_example_data/README.md
--- a/extra/example_data/textcat_example_data/cooking.json
+++ b/extra/example_data/textcat_example_data/cooking.json
--- a/extra/example_data/textcat_example_data/cooking.jsonl
+++ b/extra/example_data/textcat_example_data/cooking.jsonl
--- a/extra/example_data/textcat_example_data/jigsaw-toxic-comment.json
+++ b/extra/example_data/textcat_example_data/jigsaw-toxic-comment.json
--- a/extra/example_data/textcat_example_data/jigsaw-toxic-comment.jsonl
+++ b/extra/example_data/textcat_example_data/jigsaw-toxic-comment.jsonl
--- a/extra/example_data/textcat_example_data/textcatjsonl_to_trainjson.py
+++ b/extra/example_data/textcat_example_data/textcatjsonl_to_trainjson.py
@ -1,20 +1,21 @@
 from pathlib import Path
 import plac
 import spacy
-from spacy.gold import docs_to_json
+from spacy.training import docs_to_json
 import srsly
 import sys
@plac.annotations(
    model=("Model name. Defaults to 'en'.", "option", "m", str),
    input_file=("Input file (jsonl)", "positional", None, Path),
    output_dir=("Output directory", "positional", None, Path),
    n_texts=("Number of texts to convert", "option", "t", int),
 )
-def convert(model='en', input_file=None, output_dir=None, n_texts=0):
+def convert(model="en", input_file=None, output_dir=None, n_texts=0):
    # Load model with tokenizer + sentencizer only
    nlp = spacy.load(model)
-    nlp.disable_pipes(*nlp.pipe_names)
+    nlp.select_pipes(disable=nlp.pipe_names)
    sentencizer = nlp.create_pipe("sentencizer")
    nlp.add_pipe(sentencizer, first=True)
@ -49,5 +50,6 @@ def convert(model='en', input_file=None, output_dir=None, n_texts=0):
    srsly.write_json(output_dir / input_file.with_suffix(".json"), [docs_to_json(docs)])
 if __name__ == "__main__":
    plac.call(convert)
--- a/extra/example_data/training-data.json
+++ b/extra/example_data/training-data.json
--- a/extra/example_data/vocab-data.jsonl
+++ b/extra/example_data/vocab-data.jsonl
--- a/fabfile.py
+++ b/fabfile.py
@ -1,154 +0,0 @@
 # coding: utf-8
 from __future__ import unicode_literals, print_function
 import contextlib
 from pathlib import Path
 from fabric.api import local, lcd, env, settings, prefix
 from os import path, environ
 import shutil
 import sys
 PWD = path.dirname(__file__)
 ENV = environ["VENV_DIR"] if "VENV_DIR" in environ else ".env"
 VENV_DIR = Path(PWD) / ENV
@contextlib.contextmanager
 def virtualenv(name, create=False, python="/usr/bin/python3.6"):
    python = Path(python).resolve()
    env_path = VENV_DIR
    if create:
        if env_path.exists():
            shutil.rmtree(str(env_path))
        local("{python} -m venv {env_path}".format(python=python, env_path=VENV_DIR))
    def wrapped_local(cmd, env_vars=[], capture=False, direct=False):
        return local(
            "source {}/bin/activate && {}".format(env_path, cmd),
            shell="/bin/bash",
            capture=False,
        )
    yield wrapped_local
 def env(lang="python3.6"):
    if VENV_DIR.exists():
        local("rm -rf {env}".format(env=VENV_DIR))
    if lang.startswith("python3"):
        local("{lang} -m venv {env}".format(lang=lang, env=VENV_DIR))
    else:
        local("{lang} -m pip install virtualenv --no-cache-dir".format(lang=lang))
        local(
            "{lang} -m virtualenv {env} --no-cache-dir".format(lang=lang, env=VENV_DIR)
        )
    with virtualenv(VENV_DIR) as venv_local:
        print(venv_local("python --version", capture=True))
        venv_local("pip install --upgrade setuptools --no-cache-dir")
        venv_local("pip install pytest --no-cache-dir")
        venv_local("pip install wheel --no-cache-dir")
        venv_local("pip install -r requirements.txt --no-cache-dir")
        venv_local("pip install pex --no-cache-dir")
 def install():
    with virtualenv(VENV_DIR) as venv_local:
        venv_local("pip install dist/*.tar.gz")
 def make():
    with lcd(path.dirname(__file__)):
        local(
            "export PYTHONPATH=`pwd` && source .env/bin/activate && python setup.py build_ext --inplace",
            shell="/bin/bash",
        )
 def sdist():
    with virtualenv(VENV_DIR) as venv_local:
        with lcd(path.dirname(__file__)):
            venv_local("python -m pip install -U setuptools srsly")
            venv_local("python setup.py sdist")
 def wheel():
    with virtualenv(VENV_DIR) as venv_local:
        with lcd(path.dirname(__file__)):
            venv_local("python setup.py bdist_wheel")
 def pex():
    with virtualenv(VENV_DIR) as venv_local:
        with lcd(path.dirname(__file__)):
            sha = local("git rev-parse --short HEAD", capture=True)
            venv_local(
                "pex dist/*.whl -e spacy -o dist/spacy-%s.pex" % sha, direct=True
            )
 def clean():
    with lcd(path.dirname(__file__)):
        local("rm -f dist/*.whl")
        local("rm -f dist/*.pex")
        with virtualenv(VENV_DIR) as venv_local:
            venv_local("python setup.py clean --all")
 def test():
    with virtualenv(VENV_DIR) as venv_local:
        with lcd(path.dirname(__file__)):
            venv_local("pytest -x spacy/tests")
 def train():
    args = environ.get("SPACY_TRAIN_ARGS", "")
    with virtualenv(VENV_DIR) as venv_local:
        venv_local("spacy train {args}".format(args=args))
 def conll17(treebank_dir, experiment_dir, vectors_dir, config, corpus=""):
    is_not_clean = local("git status --porcelain", capture=True)
    if is_not_clean:
        print("Repository is not clean")
        print(is_not_clean)
        sys.exit(1)
    git_sha = local("git rev-parse --short HEAD", capture=True)
    config_checksum = local("sha256sum {config}".format(config=config), capture=True)
    experiment_dir = Path(experiment_dir) / "{}--{}".format(
        config_checksum[:6], git_sha
    )
    if not experiment_dir.exists():
        experiment_dir.mkdir()
    test_data_dir = Path(treebank_dir) / "ud-test-v2.0-conll2017"
    assert test_data_dir.exists()
    assert test_data_dir.is_dir()
    if corpus:
        corpora = [corpus]
    else:
        corpora = ["UD_English", "UD_Chinese", "UD_Japanese", "UD_Vietnamese"]
    local(
        "cp {config} {experiment_dir}/config.json".format(
            config=config, experiment_dir=experiment_dir
        )
    )
    with virtualenv(VENV_DIR) as venv_local:
        for corpus in corpora:
            venv_local(
                "spacy ud-train {treebank_dir} {experiment_dir} {config} {corpus} -v {vectors_dir}".format(
                    treebank_dir=treebank_dir,
                    experiment_dir=experiment_dir,
                    config=config,
                    corpus=corpus,
                    vectors_dir=vectors_dir,
                )
            )
            venv_local(
                "spacy ud-run-test {test_data_dir} {experiment_dir} {corpus}".format(
                    test_data_dir=test_data_dir,
                    experiment_dir=experiment_dir,
                    config=config,
                    corpus=corpus,
                )
            )
--- a/include/msvc9/stdint.h
+++ b/include/msvc9/stdint.h
@ -1,259 +0,0 @@
 // ISO C9x  compliant stdint.h for Microsoft Visual Studio
 // Based on ISO/IEC 9899:TC2 Committee draft (May 6, 2005) WG14/N1124 
 // 
 //  Copyright (c) 2006-2013 Alexander Chemeris
 // 
 // Redistribution and use in source and binary forms, with or without
 // modification, are permitted provided that the following conditions are met:
 // 
 //   1. Redistributions of source code must retain the above copyright notice,
 //      this list of conditions and the following disclaimer.
 // 
 //   2. Redistributions in binary form must reproduce the above copyright
 //      notice, this list of conditions and the following disclaimer in the
 //      documentation and/or other materials provided with the distribution.
 // 
 //   3. Neither the name of the product nor the names of its contributors may
 //      be used to endorse or promote products derived from this software
 //      without specific prior written permission.
 // 
 // THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED
 // WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
 // MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO
 // EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
 // SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
 // PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
 // OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, 
 // WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
 // OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
 // ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 // 
 ///////////////////////////////////////////////////////////////////////////////
 #ifndef _MSC_VER // [
 #error "Use this header only with Microsoft Visual C++ compilers!"
 #endif // _MSC_VER ]
 #ifndef _MSC_STDINT_H_ // [
 #define _MSC_STDINT_H_
 #if _MSC_VER > 1000
 #pragma once
 #endif
 #if _MSC_VER >= 1600 // [
 #include <stdint.h>
 #else // ] _MSC_VER >= 1600 [
 #include <limits.h>
 // For Visual Studio 6 in C++ mode and for many Visual Studio versions when
 // compiling for ARM we should wrap <wchar.h> include with 'extern "C++" {}'
 // or compiler give many errors like this:
 //   error C2733: second C linkage of overloaded function 'wmemchr' not allowed
 #ifdef __cplusplus
 extern "C" {
 #endif
 #  include <wchar.h>
 #ifdef __cplusplus
 }
 #endif
 // Define _W64 macros to mark types changing their size, like intptr_t.
 #ifndef _W64
 #  if !defined(__midl) && (defined(_X86_) || defined(_M_IX86)) && _MSC_VER >= 1300
 #     define _W64 __w64
 #  else
 #     define _W64
 #  endif
 #endif
 // 7.18.1 Integer types
 // 7.18.1.1 Exact-width integer types
 // Visual Studio 6 and Embedded Visual C++ 4 doesn't
 // realize that, e.g. char has the same size as __int8
 // so we give up on __intX for them.
 #if (_MSC_VER < 1300)
   typedef signed char       int8_t;
   typedef signed short      int16_t;
   typedef signed int        int32_t;
   typedef unsigned char     uint8_t;
   typedef unsigned short    uint16_t;
   typedef unsigned int      uint32_t;
 #else
   typedef signed __int8     int8_t;
   typedef signed __int16    int16_t;
   typedef signed __int32    int32_t;
   typedef unsigned __int8   uint8_t;
   typedef unsigned __int16  uint16_t;
   typedef unsigned __int32  uint32_t;
 #endif
 typedef signed __int64       int64_t;
 typedef unsigned __int64     uint64_t;
 // 7.18.1.2 Minimum-width integer types
 typedef int8_t    int_least8_t;
 typedef int16_t   int_least16_t;
 typedef int32_t   int_least32_t;
 typedef int64_t   int_least64_t;
 typedef uint8_t   uint_least8_t;
 typedef uint16_t  uint_least16_t;
 typedef uint32_t  uint_least32_t;
 typedef uint64_t  uint_least64_t;
 // 7.18.1.3 Fastest minimum-width integer types
 typedef int8_t    int_fast8_t;
 typedef int16_t   int_fast16_t;
 typedef int32_t   int_fast32_t;
 typedef int64_t   int_fast64_t;
 typedef uint8_t   uint_fast8_t;
 typedef uint16_t  uint_fast16_t;
 typedef uint32_t  uint_fast32_t;
 typedef uint64_t  uint_fast64_t;
 // 7.18.1.4 Integer types capable of holding object pointers
 #ifdef _WIN64 // [
   typedef signed __int64    intptr_t;
   typedef unsigned __int64  uintptr_t;
 #else // _WIN64 ][
   typedef _W64 signed int   intptr_t;
   typedef _W64 unsigned int uintptr_t;
 #endif // _WIN64 ]
 // 7.18.1.5 Greatest-width integer types
 typedef int64_t   intmax_t;
 typedef uint64_t  uintmax_t;
 // 7.18.2 Limits of specified-width integer types
 #if !defined(__cplusplus) || defined(__STDC_LIMIT_MACROS) // [   See footnote 220 at page 257 and footnote 221 at page 259
 // 7.18.2.1 Limits of exact-width integer types
 #define INT8_MIN     ((int8_t)_I8_MIN)
 #define INT8_MAX     _I8_MAX
 #define INT16_MIN    ((int16_t)_I16_MIN)
 #define INT16_MAX    _I16_MAX
 #define INT32_MIN    ((int32_t)_I32_MIN)
 #define INT32_MAX    _I32_MAX
 #define INT64_MIN    ((int64_t)_I64_MIN)
 #define INT64_MAX    _I64_MAX
 #define UINT8_MAX    _UI8_MAX
 #define UINT16_MAX   _UI16_MAX
 #define UINT32_MAX   _UI32_MAX
 #define UINT64_MAX   _UI64_MAX
 // 7.18.2.2 Limits of minimum-width integer types
 #define INT_LEAST8_MIN    INT8_MIN
 #define INT_LEAST8_MAX    INT8_MAX
 #define INT_LEAST16_MIN   INT16_MIN
 #define INT_LEAST16_MAX   INT16_MAX
 #define INT_LEAST32_MIN   INT32_MIN
 #define INT_LEAST32_MAX   INT32_MAX
 #define INT_LEAST64_MIN   INT64_MIN
 #define INT_LEAST64_MAX   INT64_MAX
 #define UINT_LEAST8_MAX   UINT8_MAX
 #define UINT_LEAST16_MAX  UINT16_MAX
 #define UINT_LEAST32_MAX  UINT32_MAX
 #define UINT_LEAST64_MAX  UINT64_MAX
 // 7.18.2.3 Limits of fastest minimum-width integer types
 #define INT_FAST8_MIN    INT8_MIN
 #define INT_FAST8_MAX    INT8_MAX
 #define INT_FAST16_MIN   INT16_MIN
 #define INT_FAST16_MAX   INT16_MAX
 #define INT_FAST32_MIN   INT32_MIN
 #define INT_FAST32_MAX   INT32_MAX
 #define INT_FAST64_MIN   INT64_MIN
 #define INT_FAST64_MAX   INT64_MAX
 #define UINT_FAST8_MAX   UINT8_MAX
 #define UINT_FAST16_MAX  UINT16_MAX
 #define UINT_FAST32_MAX  UINT32_MAX
 #define UINT_FAST64_MAX  UINT64_MAX
 // 7.18.2.4 Limits of integer types capable of holding object pointers
 #ifdef _WIN64 // [
 #  define INTPTR_MIN   INT64_MIN
 #  define INTPTR_MAX   INT64_MAX
 #  define UINTPTR_MAX  UINT64_MAX
 #else // _WIN64 ][
 #  define INTPTR_MIN   INT32_MIN
 #  define INTPTR_MAX   INT32_MAX
 #  define UINTPTR_MAX  UINT32_MAX
 #endif // _WIN64 ]
 // 7.18.2.5 Limits of greatest-width integer types
 #define INTMAX_MIN   INT64_MIN
 #define INTMAX_MAX   INT64_MAX
 #define UINTMAX_MAX  UINT64_MAX
 // 7.18.3 Limits of other integer types
 #ifdef _WIN64 // [
 #  define PTRDIFF_MIN  _I64_MIN
 #  define PTRDIFF_MAX  _I64_MAX
 #else  // _WIN64 ][
 #  define PTRDIFF_MIN  _I32_MIN
 #  define PTRDIFF_MAX  _I32_MAX
 #endif  // _WIN64 ]
 #define SIG_ATOMIC_MIN  INT_MIN
 #define SIG_ATOMIC_MAX  INT_MAX
 #ifndef SIZE_MAX // [
 #  ifdef _WIN64 // [
 #     define SIZE_MAX  _UI64_MAX
 #  else // _WIN64 ][
 #     define SIZE_MAX  _UI32_MAX
 #  endif // _WIN64 ]
 #endif // SIZE_MAX ]
 // WCHAR_MIN and WCHAR_MAX are also defined in <wchar.h>
 #ifndef WCHAR_MIN // [
 #  define WCHAR_MIN  0
 #endif  // WCHAR_MIN ]
 #ifndef WCHAR_MAX // [
 #  define WCHAR_MAX  _UI16_MAX
 #endif  // WCHAR_MAX ]
 #define WINT_MIN  0
 #define WINT_MAX  _UI16_MAX
 #endif // __STDC_LIMIT_MACROS ]
 // 7.18.4 Limits of other integer types
 #if !defined(__cplusplus) || defined(__STDC_CONSTANT_MACROS) // [   See footnote 224 at page 260
 // 7.18.4.1 Macros for minimum-width integer constants
 #define INT8_C(val)  val##i8
 #define INT16_C(val) val##i16
 #define INT32_C(val) val##i32
 #define INT64_C(val) val##i64
 #define UINT8_C(val)  val##ui8
 #define UINT16_C(val) val##ui16
 #define UINT32_C(val) val##ui32
 #define UINT64_C(val) val##ui64
 // 7.18.4.2 Macros for greatest-width integer constants
 // These #ifndef's are needed to prevent collisions with <boost/cstdint.hpp>.
 // Check out Issue 9 for the details.
 #ifndef INTMAX_C //   [
 #  define INTMAX_C   INT64_C
 #endif // INTMAX_C    ]
 #ifndef UINTMAX_C //  [
 #  define UINTMAX_C  UINT64_C
 #endif // UINTMAX_C   ]
 #endif // __STDC_CONSTANT_MACROS ]
 #endif // _MSC_VER >= 1600 ]
 #endif // _MSC_STDINT_H_ ]
--- a/include/murmurhash/MurmurHash2.h
+++ b/include/murmurhash/MurmurHash2.h
@ -1,22 +0,0 @@
 //-----------------------------------------------------------------------------
 // MurmurHash2 was written by Austin Appleby, and is placed in the public
 // domain. The author hereby disclaims copyright to this source code.
 #ifndef _MURMURHASH2_H_
 #define _MURMURHASH2_H_
 #include <stdint.h>
 //-----------------------------------------------------------------------------
 uint32_t MurmurHash2        ( const void * key, int len, uint32_t seed );
 uint64_t MurmurHash64A      ( const void * key, int len, uint64_t seed );
 uint64_t MurmurHash64B      ( const void * key, int len, uint64_t seed );
 uint32_t MurmurHash2A       ( const void * key, int len, uint32_t seed );
 uint32_t MurmurHashNeutral2 ( const void * key, int len, uint32_t seed );
 uint32_t MurmurHashAligned2 ( const void * key, int len, uint32_t seed );
 //-----------------------------------------------------------------------------
 #endif // _MURMURHASH2_H_
--- a/include/murmurhash/MurmurHash3.h
+++ b/include/murmurhash/MurmurHash3.h
@ -1,28 +0,0 @@
 //-----------------------------------------------------------------------------
 // MurmurHash3 was written by Austin Appleby, and is placed in the public
 // domain. The author hereby disclaims copyright to this source code.
 #ifndef _MURMURHASH3_H_
 #define _MURMURHASH3_H_
 #include <stdint.h>
 //-----------------------------------------------------------------------------
 #ifdef __cplusplus
 extern "C" {
 #endif
 void MurmurHash3_x86_32  ( const void * key, int len, uint32_t seed, void * out );
 void MurmurHash3_x86_128 ( const void * key, int len, uint32_t seed, void * out );
 void MurmurHash3_x64_128 ( const void * key, int len, uint32_t seed, void * out );
 #ifdef __cplusplus
 }
 #endif
 //-----------------------------------------------------------------------------
 #endif // _MURMURHASH3_H_
--- a/include/numpy/__multiarray_api.h
+++ b/include/numpy/__multiarray_api.h
--- a/include/numpy/__ufunc_api.h
+++ b/include/numpy/__ufunc_api.h
@ -1,323 +0,0 @@
 #ifdef _UMATHMODULE
 #ifdef NPY_ENABLE_SEPARATE_COMPILATION
 extern NPY_NO_EXPORT PyTypeObject PyUFunc_Type;
 #else
 NPY_NO_EXPORT PyTypeObject PyUFunc_Type;
 #endif
 #ifdef NPY_ENABLE_SEPARATE_COMPILATION
    extern NPY_NO_EXPORT PyTypeObject PyUFunc_Type;
 #else
    NPY_NO_EXPORT PyTypeObject PyUFunc_Type;
 #endif
 NPY_NO_EXPORT PyObject * PyUFunc_FromFuncAndData \
       (PyUFuncGenericFunction *, void **, char *, int, int, int, int, char *, char *, int);
 NPY_NO_EXPORT int PyUFunc_RegisterLoopForType \
       (PyUFuncObject *, int, PyUFuncGenericFunction, int *, void *);
 NPY_NO_EXPORT int PyUFunc_GenericFunction \
       (PyUFuncObject *, PyObject *, PyObject *, PyArrayObject **);
 NPY_NO_EXPORT void PyUFunc_f_f_As_d_d \
       (char **, npy_intp *, npy_intp *, void *);
 NPY_NO_EXPORT void PyUFunc_d_d \
       (char **, npy_intp *, npy_intp *, void *);
 NPY_NO_EXPORT void PyUFunc_f_f \
       (char **, npy_intp *, npy_intp *, void *);
 NPY_NO_EXPORT void PyUFunc_g_g \
       (char **, npy_intp *, npy_intp *, void *);
 NPY_NO_EXPORT void PyUFunc_F_F_As_D_D \
       (char **, npy_intp *, npy_intp *, void *);
 NPY_NO_EXPORT void PyUFunc_F_F \
       (char **, npy_intp *, npy_intp *, void *);
 NPY_NO_EXPORT void PyUFunc_D_D \
       (char **, npy_intp *, npy_intp *, void *);
 NPY_NO_EXPORT void PyUFunc_G_G \
       (char **, npy_intp *, npy_intp *, void *);
 NPY_NO_EXPORT void PyUFunc_O_O \
       (char **, npy_intp *, npy_intp *, void *);
 NPY_NO_EXPORT void PyUFunc_ff_f_As_dd_d \
       (char **, npy_intp *, npy_intp *, void *);
 NPY_NO_EXPORT void PyUFunc_ff_f \
       (char **, npy_intp *, npy_intp *, void *);
 NPY_NO_EXPORT void PyUFunc_dd_d \
       (char **, npy_intp *, npy_intp *, void *);
 NPY_NO_EXPORT void PyUFunc_gg_g \
       (char **, npy_intp *, npy_intp *, void *);
 NPY_NO_EXPORT void PyUFunc_FF_F_As_DD_D \
       (char **, npy_intp *, npy_intp *, void *);
 NPY_NO_EXPORT void PyUFunc_DD_D \
       (char **, npy_intp *, npy_intp *, void *);
 NPY_NO_EXPORT void PyUFunc_FF_F \
       (char **, npy_intp *, npy_intp *, void *);
 NPY_NO_EXPORT void PyUFunc_GG_G \
       (char **, npy_intp *, npy_intp *, void *);
 NPY_NO_EXPORT void PyUFunc_OO_O \
       (char **, npy_intp *, npy_intp *, void *);
 NPY_NO_EXPORT void PyUFunc_O_O_method \
       (char **, npy_intp *, npy_intp *, void *);
 NPY_NO_EXPORT void PyUFunc_OO_O_method \
       (char **, npy_intp *, npy_intp *, void *);
 NPY_NO_EXPORT void PyUFunc_On_Om \
       (char **, npy_intp *, npy_intp *, void *);
 NPY_NO_EXPORT int PyUFunc_GetPyValues \
       (char *, int *, int *, PyObject **);
 NPY_NO_EXPORT int PyUFunc_checkfperr \
       (int, PyObject *, int *);
 NPY_NO_EXPORT void PyUFunc_clearfperr \
       (void);
 NPY_NO_EXPORT int PyUFunc_getfperr \
       (void);
 NPY_NO_EXPORT int PyUFunc_handlefperr \
       (int, PyObject *, int, int *);
 NPY_NO_EXPORT int PyUFunc_ReplaceLoopBySignature \
       (PyUFuncObject *, PyUFuncGenericFunction, int *, PyUFuncGenericFunction *);
 NPY_NO_EXPORT PyObject * PyUFunc_FromFuncAndDataAndSignature \
       (PyUFuncGenericFunction *, void **, char *, int, int, int, int, char *, char *, int, const char *);
 NPY_NO_EXPORT int PyUFunc_SetUsesArraysAsData \
       (void **, size_t);
 NPY_NO_EXPORT void PyUFunc_e_e \
       (char **, npy_intp *, npy_intp *, void *);
 NPY_NO_EXPORT void PyUFunc_e_e_As_f_f \
       (char **, npy_intp *, npy_intp *, void *);
 NPY_NO_EXPORT void PyUFunc_e_e_As_d_d \
       (char **, npy_intp *, npy_intp *, void *);
 NPY_NO_EXPORT void PyUFunc_ee_e \
       (char **, npy_intp *, npy_intp *, void *);
 NPY_NO_EXPORT void PyUFunc_ee_e_As_ff_f \
       (char **, npy_intp *, npy_intp *, void *);
 NPY_NO_EXPORT void PyUFunc_ee_e_As_dd_d \
       (char **, npy_intp *, npy_intp *, void *);
 NPY_NO_EXPORT int PyUFunc_DefaultTypeResolver \
       (PyUFuncObject *, NPY_CASTING, PyArrayObject **, PyObject *, PyArray_Descr **);
 NPY_NO_EXPORT int PyUFunc_ValidateCasting \
       (PyUFuncObject *, NPY_CASTING, PyArrayObject **, PyArray_Descr **);
 #else
 #if defined(PY_UFUNC_UNIQUE_SYMBOL)
 #define PyUFunc_API PY_UFUNC_UNIQUE_SYMBOL
 #endif
 #if defined(NO_IMPORT) || defined(NO_IMPORT_UFUNC)
 extern void **PyUFunc_API;
 #else
 #if defined(PY_UFUNC_UNIQUE_SYMBOL)
 void **PyUFunc_API;
 #else
 static void **PyUFunc_API=NULL;
 #endif
 #endif
 #define PyUFunc_Type (*(PyTypeObject *)PyUFunc_API[0])
 #define PyUFunc_FromFuncAndData \
        (*(PyObject * (*)(PyUFuncGenericFunction *, void **, char *, int, int, int, int, char *, char *, int)) \
         PyUFunc_API[1])
 #define PyUFunc_RegisterLoopForType \
        (*(int (*)(PyUFuncObject *, int, PyUFuncGenericFunction, int *, void *)) \
         PyUFunc_API[2])
 #define PyUFunc_GenericFunction \
        (*(int (*)(PyUFuncObject *, PyObject *, PyObject *, PyArrayObject **)) \
         PyUFunc_API[3])
 #define PyUFunc_f_f_As_d_d \
        (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
         PyUFunc_API[4])
 #define PyUFunc_d_d \
        (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
         PyUFunc_API[5])
 #define PyUFunc_f_f \
        (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
         PyUFunc_API[6])
 #define PyUFunc_g_g \
        (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
         PyUFunc_API[7])
 #define PyUFunc_F_F_As_D_D \
        (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
         PyUFunc_API[8])
 #define PyUFunc_F_F \
        (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
         PyUFunc_API[9])
 #define PyUFunc_D_D \
        (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
         PyUFunc_API[10])
 #define PyUFunc_G_G \
        (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
         PyUFunc_API[11])
 #define PyUFunc_O_O \
        (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
         PyUFunc_API[12])
 #define PyUFunc_ff_f_As_dd_d \
        (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
         PyUFunc_API[13])
 #define PyUFunc_ff_f \
        (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
         PyUFunc_API[14])
 #define PyUFunc_dd_d \
        (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
         PyUFunc_API[15])
 #define PyUFunc_gg_g \
        (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
         PyUFunc_API[16])
 #define PyUFunc_FF_F_As_DD_D \
        (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
         PyUFunc_API[17])
 #define PyUFunc_DD_D \
        (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
         PyUFunc_API[18])
 #define PyUFunc_FF_F \
        (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
         PyUFunc_API[19])
 #define PyUFunc_GG_G \
        (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
         PyUFunc_API[20])
 #define PyUFunc_OO_O \
        (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
         PyUFunc_API[21])
 #define PyUFunc_O_O_method \
        (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
         PyUFunc_API[22])
 #define PyUFunc_OO_O_method \
        (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
         PyUFunc_API[23])
 #define PyUFunc_On_Om \
        (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
         PyUFunc_API[24])
 #define PyUFunc_GetPyValues \
        (*(int (*)(char *, int *, int *, PyObject **)) \
         PyUFunc_API[25])
 #define PyUFunc_checkfperr \
        (*(int (*)(int, PyObject *, int *)) \
         PyUFunc_API[26])
 #define PyUFunc_clearfperr \
        (*(void (*)(void)) \
         PyUFunc_API[27])
 #define PyUFunc_getfperr \
        (*(int (*)(void)) \
         PyUFunc_API[28])
 #define PyUFunc_handlefperr \
        (*(int (*)(int, PyObject *, int, int *)) \
         PyUFunc_API[29])
 #define PyUFunc_ReplaceLoopBySignature \
        (*(int (*)(PyUFuncObject *, PyUFuncGenericFunction, int *, PyUFuncGenericFunction *)) \
         PyUFunc_API[30])
 #define PyUFunc_FromFuncAndDataAndSignature \
        (*(PyObject * (*)(PyUFuncGenericFunction *, void **, char *, int, int, int, int, char *, char *, int, const char *)) \
         PyUFunc_API[31])
 #define PyUFunc_SetUsesArraysAsData \
        (*(int (*)(void **, size_t)) \
         PyUFunc_API[32])
 #define PyUFunc_e_e \
        (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
         PyUFunc_API[33])
 #define PyUFunc_e_e_As_f_f \
        (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
         PyUFunc_API[34])
 #define PyUFunc_e_e_As_d_d \
        (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
         PyUFunc_API[35])
 #define PyUFunc_ee_e \
        (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
         PyUFunc_API[36])
 #define PyUFunc_ee_e_As_ff_f \
        (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
         PyUFunc_API[37])
 #define PyUFunc_ee_e_As_dd_d \
        (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \
         PyUFunc_API[38])
 #define PyUFunc_DefaultTypeResolver \
        (*(int (*)(PyUFuncObject *, NPY_CASTING, PyArrayObject **, PyObject *, PyArray_Descr **)) \
         PyUFunc_API[39])
 #define PyUFunc_ValidateCasting \
        (*(int (*)(PyUFuncObject *, NPY_CASTING, PyArrayObject **, PyArray_Descr **)) \
         PyUFunc_API[40])
 static int
 _import_umath(void)
 {
  PyObject *numpy = PyImport_ImportModule("numpy.core.umath");
  PyObject *c_api = NULL;
  if (numpy == NULL) {
      PyErr_SetString(PyExc_ImportError, "numpy.core.umath failed to import");
      return -1;
  }
  c_api = PyObject_GetAttrString(numpy, "_UFUNC_API");
  Py_DECREF(numpy);
  if (c_api == NULL) {
      PyErr_SetString(PyExc_AttributeError, "_UFUNC_API not found");
      return -1;
  }
 #if PY_VERSION_HEX >= 0x03000000
  if (!PyCapsule_CheckExact(c_api)) {
      PyErr_SetString(PyExc_RuntimeError, "_UFUNC_API is not PyCapsule object");
      Py_DECREF(c_api);
      return -1;
  }
  PyUFunc_API = (void **)PyCapsule_GetPointer(c_api, NULL);
 #else
  if (!PyCObject_Check(c_api)) {
      PyErr_SetString(PyExc_RuntimeError, "_UFUNC_API is not PyCObject object");
      Py_DECREF(c_api);
      return -1;
  }
  PyUFunc_API = (void **)PyCObject_AsVoidPtr(c_api);
 #endif
  Py_DECREF(c_api);
  if (PyUFunc_API == NULL) {
      PyErr_SetString(PyExc_RuntimeError, "_UFUNC_API is NULL pointer");
      return -1;
  }
  return 0;
 }
 #if PY_VERSION_HEX >= 0x03000000
 #define NUMPY_IMPORT_UMATH_RETVAL NULL
 #else
 #define NUMPY_IMPORT_UMATH_RETVAL
 #endif
 #define import_umath() \
    do {\
        UFUNC_NOFPE\
        if (_import_umath() < 0) {\
            PyErr_Print();\
            PyErr_SetString(PyExc_ImportError,\
                    "numpy.core.umath failed to import");\
            return NUMPY_IMPORT_UMATH_RETVAL;\
        }\
    } while(0)
 #define import_umath1(ret) \
    do {\
        UFUNC_NOFPE\
        if (_import_umath() < 0) {\
            PyErr_Print();\
            PyErr_SetString(PyExc_ImportError,\
                    "numpy.core.umath failed to import");\
            return ret;\
        }\
    } while(0)
 #define import_umath2(ret, msg) \
    do {\
        UFUNC_NOFPE\
        if (_import_umath() < 0) {\
            PyErr_Print();\
            PyErr_SetString(PyExc_ImportError, msg);\
            return ret;\
        }\
    } while(0)
 #define import_ufunc() \
    do {\
        UFUNC_NOFPE\
        if (_import_umath() < 0) {\
            PyErr_Print();\
            PyErr_SetString(PyExc_ImportError,\
                    "numpy.core.umath failed to import");\
        }\
    } while(0)
 #endif
--- a/include/numpy/_neighborhood_iterator_imp.h
+++ b/include/numpy/_neighborhood_iterator_imp.h
@ -1,90 +0,0 @@
 #ifndef _NPY_INCLUDE_NEIGHBORHOOD_IMP
 #error You should not include this header directly
 #endif
 /*
 * Private API (here for inline)
 */
 static NPY_INLINE int
 _PyArrayNeighborhoodIter_IncrCoord(PyArrayNeighborhoodIterObject* iter);
 /*
 * Update to next item of the iterator
 *
 * Note: this simply increment the coordinates vector, last dimension
 * incremented first , i.e, for dimension 3
 * ...
 * -1, -1, -1
 * -1, -1,  0
 * -1, -1,  1
 *  ....
 * -1,  0, -1
 * -1,  0,  0
 *  ....
 * 0,  -1, -1
 * 0,  -1,  0
 *  ....
 */
 #define _UPDATE_COORD_ITER(c) \
    wb = iter->coordinates[c] < iter->bounds[c][1]; \
    if (wb) { \
        iter->coordinates[c] += 1; \
        return 0; \
    } \
    else { \
        iter->coordinates[c] = iter->bounds[c][0]; \
    }
 static NPY_INLINE int
 _PyArrayNeighborhoodIter_IncrCoord(PyArrayNeighborhoodIterObject* iter)
 {
    npy_intp i, wb;
    for (i = iter->nd - 1; i >= 0; --i) {
        _UPDATE_COORD_ITER(i)
    }
    return 0;
 }
 /*
 * Version optimized for 2d arrays, manual loop unrolling
 */
 static NPY_INLINE int
 _PyArrayNeighborhoodIter_IncrCoord2D(PyArrayNeighborhoodIterObject* iter)
 {
    npy_intp wb;
    _UPDATE_COORD_ITER(1)
    _UPDATE_COORD_ITER(0)
    return 0;
 }
 #undef _UPDATE_COORD_ITER
 /*
 * Advance to the next neighbour
 */
 static NPY_INLINE int
 PyArrayNeighborhoodIter_Next(PyArrayNeighborhoodIterObject* iter)
 {
    _PyArrayNeighborhoodIter_IncrCoord (iter);
    iter->dataptr = iter->translate((PyArrayIterObject*)iter, iter->coordinates);
    return 0;
 }
 /*
 * Reset functions
 */
 static NPY_INLINE int
 PyArrayNeighborhoodIter_Reset(PyArrayNeighborhoodIterObject* iter)
 {
    npy_intp i;
    for (i = 0; i < iter->nd; ++i) {
        iter->coordinates[i] = iter->bounds[i][0];
    }
    iter->dataptr = iter->translate((PyArrayIterObject*)iter, iter->coordinates);
    return 0;
 }
--- a/include/numpy/_numpyconfig.h
+++ b/include/numpy/_numpyconfig.h
@ -1,29 +0,0 @@
 #define NPY_SIZEOF_SHORT SIZEOF_SHORT
 #define NPY_SIZEOF_INT SIZEOF_INT
 #define NPY_SIZEOF_LONG SIZEOF_LONG
 #define NPY_SIZEOF_FLOAT 4
 #define NPY_SIZEOF_COMPLEX_FLOAT 8
 #define NPY_SIZEOF_DOUBLE 8
 #define NPY_SIZEOF_COMPLEX_DOUBLE 16
 #define NPY_SIZEOF_LONGDOUBLE 16
 #define NPY_SIZEOF_COMPLEX_LONGDOUBLE 32
 #define NPY_SIZEOF_PY_INTPTR_T 8
 #define NPY_SIZEOF_PY_LONG_LONG 8
 #define NPY_SIZEOF_LONGLONG 8
 #define NPY_NO_SMP 0
 #define NPY_HAVE_DECL_ISNAN
 #define NPY_HAVE_DECL_ISINF
 #define NPY_HAVE_DECL_ISFINITE
 #define NPY_HAVE_DECL_SIGNBIT
 #define NPY_USE_C99_COMPLEX 1
 #define NPY_HAVE_COMPLEX_DOUBLE 1
 #define NPY_HAVE_COMPLEX_FLOAT 1
 #define NPY_HAVE_COMPLEX_LONG_DOUBLE 1
 #define NPY_USE_C99_FORMATS 1
 #define NPY_VISIBILITY_HIDDEN __attribute__((visibility("hidden")))
 #define NPY_ABI_VERSION 0x01000009
 #define NPY_API_VERSION 0x00000007
 #ifndef __STDC_FORMAT_MACROS
 #define __STDC_FORMAT_MACROS 1
 #endif
--- a/include/numpy/arrayobject.h
+++ b/include/numpy/arrayobject.h
@ -1,22 +0,0 @@
 /* This expects the following variables to be defined (besides
   the usual ones from pyconfig.h
   SIZEOF_LONG_DOUBLE -- sizeof(long double) or sizeof(double) if no
                         long double is present on platform.
   CHAR_BIT       --     number of bits in a char (usually 8)
                         (should be in limits.h)
 */
 #ifndef Py_ARRAYOBJECT_H
 #define Py_ARRAYOBJECT_H
 #include "ndarrayobject.h"
 #include "npy_interrupt.h"
 #ifdef NPY_NO_PREFIX
 #include "noprefix.h"
 #endif
 #endif
--- a/include/numpy/arrayscalars.h
+++ b/include/numpy/arrayscalars.h
@ -1,175 +0,0 @@
 #ifndef _NPY_ARRAYSCALARS_H_
 #define _NPY_ARRAYSCALARS_H_
 #ifndef _MULTIARRAYMODULE
 typedef struct {
        PyObject_HEAD
        npy_bool obval;
 } PyBoolScalarObject;
 #endif
 typedef struct {
        PyObject_HEAD
        signed char obval;
 } PyByteScalarObject;
 typedef struct {
        PyObject_HEAD
        short obval;
 } PyShortScalarObject;
 typedef struct {
        PyObject_HEAD
        int obval;
 } PyIntScalarObject;
 typedef struct {
        PyObject_HEAD
        long obval;
 } PyLongScalarObject;
 typedef struct {
        PyObject_HEAD
        npy_longlong obval;
 } PyLongLongScalarObject;
 typedef struct {
        PyObject_HEAD
        unsigned char obval;
 } PyUByteScalarObject;
 typedef struct {
        PyObject_HEAD
        unsigned short obval;
 } PyUShortScalarObject;
 typedef struct {
        PyObject_HEAD
        unsigned int obval;
 } PyUIntScalarObject;
 typedef struct {
        PyObject_HEAD
        unsigned long obval;
 } PyULongScalarObject;
 typedef struct {
        PyObject_HEAD
        npy_ulonglong obval;
 } PyULongLongScalarObject;
 typedef struct {
        PyObject_HEAD
        npy_half obval;
 } PyHalfScalarObject;
 typedef struct {
        PyObject_HEAD
        float obval;
 } PyFloatScalarObject;
 typedef struct {
        PyObject_HEAD
        double obval;
 } PyDoubleScalarObject;
 typedef struct {
        PyObject_HEAD
        npy_longdouble obval;
 } PyLongDoubleScalarObject;
 typedef struct {
        PyObject_HEAD
        npy_cfloat obval;
 } PyCFloatScalarObject;
 typedef struct {
        PyObject_HEAD
        npy_cdouble obval;
 } PyCDoubleScalarObject;
 typedef struct {
        PyObject_HEAD
        npy_clongdouble obval;
 } PyCLongDoubleScalarObject;
 typedef struct {
        PyObject_HEAD
        PyObject * obval;
 } PyObjectScalarObject;
 typedef struct {
        PyObject_HEAD
        npy_datetime obval;
        PyArray_DatetimeMetaData obmeta;
 } PyDatetimeScalarObject;
 typedef struct {
        PyObject_HEAD
        npy_timedelta obval;
        PyArray_DatetimeMetaData obmeta;
 } PyTimedeltaScalarObject;
 typedef struct {
        PyObject_HEAD
        char obval;
 } PyScalarObject;
 #define PyStringScalarObject PyStringObject
 #define PyUnicodeScalarObject PyUnicodeObject
 typedef struct {
        PyObject_VAR_HEAD
        char *obval;
        PyArray_Descr *descr;
        int flags;
        PyObject *base;
 } PyVoidScalarObject;
 /* Macros
     Py<Cls><bitsize>ScalarObject
     Py<Cls><bitsize>ArrType_Type
   are defined in ndarrayobject.h
 */
 #define PyArrayScalar_False ((PyObject *)(&(_PyArrayScalar_BoolValues[0])))
 #define PyArrayScalar_True ((PyObject *)(&(_PyArrayScalar_BoolValues[1])))
 #define PyArrayScalar_FromLong(i) \
        ((PyObject *)(&(_PyArrayScalar_BoolValues[((i)!=0)])))
 #define PyArrayScalar_RETURN_BOOL_FROM_LONG(i)                  \
        return Py_INCREF(PyArrayScalar_FromLong(i)), \
                PyArrayScalar_FromLong(i)
 #define PyArrayScalar_RETURN_FALSE              \
        return Py_INCREF(PyArrayScalar_False),  \
                PyArrayScalar_False
 #define PyArrayScalar_RETURN_TRUE               \
        return Py_INCREF(PyArrayScalar_True),   \
                PyArrayScalar_True
 #define PyArrayScalar_New(cls) \
        Py##cls##ArrType_Type.tp_alloc(&Py##cls##ArrType_Type, 0)
 #define PyArrayScalar_VAL(obj, cls)             \
        ((Py##cls##ScalarObject *)obj)->obval
 #define PyArrayScalar_ASSIGN(obj, cls, val) \
        PyArrayScalar_VAL(obj, cls) = val
 #endif
--- a/include/numpy/halffloat.h
+++ b/include/numpy/halffloat.h
@ -1,69 +0,0 @@
 #ifndef __NPY_HALFFLOAT_H__
 #define __NPY_HALFFLOAT_H__
 #include <Python.h>
 #include <numpy/npy_math.h>
 #ifdef __cplusplus
 extern "C" {
 #endif
 /*
 * Half-precision routines
 */
 /* Conversions */
 float npy_half_to_float(npy_half h);
 double npy_half_to_double(npy_half h);
 npy_half npy_float_to_half(float f);
 npy_half npy_double_to_half(double d);
 /* Comparisons */
 int npy_half_eq(npy_half h1, npy_half h2);
 int npy_half_ne(npy_half h1, npy_half h2);
 int npy_half_le(npy_half h1, npy_half h2);
 int npy_half_lt(npy_half h1, npy_half h2);
 int npy_half_ge(npy_half h1, npy_half h2);
 int npy_half_gt(npy_half h1, npy_half h2);
 /* faster *_nonan variants for when you know h1 and h2 are not NaN */
 int npy_half_eq_nonan(npy_half h1, npy_half h2);
 int npy_half_lt_nonan(npy_half h1, npy_half h2);
 int npy_half_le_nonan(npy_half h1, npy_half h2);
 /* Miscellaneous functions */
 int npy_half_iszero(npy_half h);
 int npy_half_isnan(npy_half h);
 int npy_half_isinf(npy_half h);
 int npy_half_isfinite(npy_half h);
 int npy_half_signbit(npy_half h);
 npy_half npy_half_copysign(npy_half x, npy_half y);
 npy_half npy_half_spacing(npy_half h);
 npy_half npy_half_nextafter(npy_half x, npy_half y);
 /*
 * Half-precision constants
 */
 #define NPY_HALF_ZERO   (0x0000u)
 #define NPY_HALF_PZERO  (0x0000u)
 #define NPY_HALF_NZERO  (0x8000u)
 #define NPY_HALF_ONE    (0x3c00u)
 #define NPY_HALF_NEGONE (0xbc00u)
 #define NPY_HALF_PINF   (0x7c00u)
 #define NPY_HALF_NINF   (0xfc00u)
 #define NPY_HALF_NAN    (0x7e00u)
 #define NPY_MAX_HALF    (0x7bffu)
 /*
 * Bit-level conversions
 */
 npy_uint16 npy_floatbits_to_halfbits(npy_uint32 f);
 npy_uint16 npy_doublebits_to_halfbits(npy_uint64 d);
 npy_uint32 npy_halfbits_to_floatbits(npy_uint16 h);
 npy_uint64 npy_halfbits_to_doublebits(npy_uint16 h);
 #ifdef __cplusplus
 }
 #endif
 #endif
--- a/include/numpy/multiarray_api.txt
+++ b/include/numpy/multiarray_api.txt
--- a/include/numpy/ndarrayobject.h
+++ b/include/numpy/ndarrayobject.h
@ -1,244 +0,0 @@
 /*
 * DON'T INCLUDE THIS DIRECTLY.
 */
 #ifndef NPY_NDARRAYOBJECT_H
 #define NPY_NDARRAYOBJECT_H
 #ifdef __cplusplus
 #define CONFUSE_EMACS {
 #define CONFUSE_EMACS2 }
 extern "C" CONFUSE_EMACS
 #undef CONFUSE_EMACS
 #undef CONFUSE_EMACS2
 /* ... otherwise a semi-smart identer (like emacs) tries to indent
       everything when you're typing */
 #endif
 #include "ndarraytypes.h"
 /* Includes the "function" C-API -- these are all stored in a
   list of pointers --- one for each file
   The two lists are concatenated into one in multiarray.
   They are available as import_array()
 */
 #include "__multiarray_api.h"
 /* C-API that requries previous API to be defined */
 #define PyArray_DescrCheck(op) (((PyObject*)(op))->ob_type==&PyArrayDescr_Type)
 #define PyArray_Check(op) PyObject_TypeCheck(op, &PyArray_Type)
 #define PyArray_CheckExact(op) (((PyObject*)(op))->ob_type == &PyArray_Type)
 #define PyArray_HasArrayInterfaceType(op, type, context, out)                 \
        ((((out)=PyArray_FromStructInterface(op)) != Py_NotImplemented) ||    \
         (((out)=PyArray_FromInterface(op)) != Py_NotImplemented) ||          \
         (((out)=PyArray_FromArrayAttr(op, type, context)) !=                 \
          Py_NotImplemented))
 #define PyArray_HasArrayInterface(op, out)                                    \
        PyArray_HasArrayInterfaceType(op, NULL, NULL, out)
 #define PyArray_IsZeroDim(op) (PyArray_Check(op) && \
                               (PyArray_NDIM((PyArrayObject *)op) == 0))
 #define PyArray_IsScalar(obj, cls)                                            \
        (PyObject_TypeCheck(obj, &Py##cls##ArrType_Type))
 #define PyArray_CheckScalar(m) (PyArray_IsScalar(m, Generic) ||               \
                                PyArray_IsZeroDim(m))
 #define PyArray_IsPythonNumber(obj)                                           \
        (PyInt_Check(obj) || PyFloat_Check(obj) || PyComplex_Check(obj) ||    \
         PyLong_Check(obj) || PyBool_Check(obj))
 #define PyArray_IsPythonScalar(obj)                                           \
        (PyArray_IsPythonNumber(obj) || PyString_Check(obj) ||                \
         PyUnicode_Check(obj))
 #define PyArray_IsAnyScalar(obj)                                              \
        (PyArray_IsScalar(obj, Generic) || PyArray_IsPythonScalar(obj))
 #define PyArray_CheckAnyScalar(obj) (PyArray_IsPythonScalar(obj) ||           \
                                     PyArray_CheckScalar(obj))
 #define PyArray_IsIntegerScalar(obj) (PyInt_Check(obj)                        \
              || PyLong_Check(obj)                                            \
              || PyArray_IsScalar((obj), Integer))
 #define PyArray_GETCONTIGUOUS(m) (PyArray_ISCONTIGUOUS(m) ?                   \
                                  Py_INCREF(m), (m) :                         \
                                  (PyArrayObject *)(PyArray_Copy(m)))
 #define PyArray_SAMESHAPE(a1,a2) ((PyArray_NDIM(a1) == PyArray_NDIM(a2)) &&   \
                                  PyArray_CompareLists(PyArray_DIMS(a1),      \
                                                       PyArray_DIMS(a2),      \
                                                       PyArray_NDIM(a1)))
 #define PyArray_SIZE(m) PyArray_MultiplyList(PyArray_DIMS(m), PyArray_NDIM(m))
 #define PyArray_NBYTES(m) (PyArray_ITEMSIZE(m) * PyArray_SIZE(m))
 #define PyArray_FROM_O(m) PyArray_FromAny(m, NULL, 0, 0, 0, NULL)
 #define PyArray_FROM_OF(m,flags) PyArray_CheckFromAny(m, NULL, 0, 0, flags,   \
                                                      NULL)
 #define PyArray_FROM_OT(m,type) PyArray_FromAny(m,                            \
                                PyArray_DescrFromType(type), 0, 0, 0, NULL);
 #define PyArray_FROM_OTF(m, type, flags) \
        PyArray_FromAny(m, PyArray_DescrFromType(type), 0, 0, \
                        (((flags) & NPY_ARRAY_ENSURECOPY) ? \
                         ((flags) | NPY_ARRAY_DEFAULT) : (flags)), NULL)
 #define PyArray_FROMANY(m, type, min, max, flags) \
        PyArray_FromAny(m, PyArray_DescrFromType(type), min, max, \
                        (((flags) & NPY_ARRAY_ENSURECOPY) ? \
                         (flags) | NPY_ARRAY_DEFAULT : (flags)), NULL)
 #define PyArray_ZEROS(m, dims, type, is_f_order) \
        PyArray_Zeros(m, dims, PyArray_DescrFromType(type), is_f_order)
 #define PyArray_EMPTY(m, dims, type, is_f_order) \
        PyArray_Empty(m, dims, PyArray_DescrFromType(type), is_f_order)
 #define PyArray_FILLWBYTE(obj, val) memset(PyArray_DATA(obj), val, \
                                           PyArray_NBYTES(obj))
 #define PyArray_REFCOUNT(obj) (((PyObject *)(obj))->ob_refcnt)
 #define NPY_REFCOUNT PyArray_REFCOUNT
 #define NPY_MAX_ELSIZE (2 * NPY_SIZEOF_LONGDOUBLE)
 #define PyArray_ContiguousFromAny(op, type, min_depth, max_depth) \
        PyArray_FromAny(op, PyArray_DescrFromType(type), min_depth, \
                              max_depth, NPY_ARRAY_DEFAULT, NULL)
 #define PyArray_EquivArrTypes(a1, a2) \
        PyArray_EquivTypes(PyArray_DESCR(a1), PyArray_DESCR(a2))
 #define PyArray_EquivByteorders(b1, b2) \
        (((b1) == (b2)) || (PyArray_ISNBO(b1) == PyArray_ISNBO(b2)))
 #define PyArray_SimpleNew(nd, dims, typenum) \
        PyArray_New(&PyArray_Type, nd, dims, typenum, NULL, NULL, 0, 0, NULL)
 #define PyArray_SimpleNewFromData(nd, dims, typenum, data) \
        PyArray_New(&PyArray_Type, nd, dims, typenum, NULL, \
                    data, 0, NPY_ARRAY_CARRAY, NULL)
 #define PyArray_SimpleNewFromDescr(nd, dims, descr) \
        PyArray_NewFromDescr(&PyArray_Type, descr, nd, dims, \
                             NULL, NULL, 0, NULL)
 #define PyArray_ToScalar(data, arr) \
        PyArray_Scalar(data, PyArray_DESCR(arr), (PyObject *)arr)
 /* These might be faster without the dereferencing of obj
   going on inside -- of course an optimizing compiler should
   inline the constants inside a for loop making it a moot point
 */
 #define PyArray_GETPTR1(obj, i) ((void *)(PyArray_BYTES(obj) + \
                                         (i)*PyArray_STRIDES(obj)[0]))
 #define PyArray_GETPTR2(obj, i, j) ((void *)(PyArray_BYTES(obj) + \
                                            (i)*PyArray_STRIDES(obj)[0] + \
                                            (j)*PyArray_STRIDES(obj)[1]))
 #define PyArray_GETPTR3(obj, i, j, k) ((void *)(PyArray_BYTES(obj) + \
                                            (i)*PyArray_STRIDES(obj)[0] + \
                                            (j)*PyArray_STRIDES(obj)[1] + \
                                            (k)*PyArray_STRIDES(obj)[2]))
 #define PyArray_GETPTR4(obj, i, j, k, l) ((void *)(PyArray_BYTES(obj) + \
                                            (i)*PyArray_STRIDES(obj)[0] + \
                                            (j)*PyArray_STRIDES(obj)[1] + \
                                            (k)*PyArray_STRIDES(obj)[2] + \
                                            (l)*PyArray_STRIDES(obj)[3]))
 static NPY_INLINE void
 PyArray_XDECREF_ERR(PyArrayObject *arr)
 {
    if (arr != NULL) {
        if (PyArray_FLAGS(arr) & NPY_ARRAY_UPDATEIFCOPY) {
            PyArrayObject *base = (PyArrayObject *)PyArray_BASE(arr);
            PyArray_ENABLEFLAGS(base, NPY_ARRAY_WRITEABLE);
            PyArray_CLEARFLAGS(arr, NPY_ARRAY_UPDATEIFCOPY);
        }
        Py_DECREF(arr);
    }
 }
 #define PyArray_DESCR_REPLACE(descr) do { \
                PyArray_Descr *_new_; \
                _new_ = PyArray_DescrNew(descr); \
                Py_XDECREF(descr); \
                descr = _new_; \
        } while(0)
 /* Copy should always return contiguous array */
 #define PyArray_Copy(obj) PyArray_NewCopy(obj, NPY_CORDER)
 #define PyArray_FromObject(op, type, min_depth, max_depth) \
        PyArray_FromAny(op, PyArray_DescrFromType(type), min_depth, \
                              max_depth, NPY_ARRAY_BEHAVED | \
                                         NPY_ARRAY_ENSUREARRAY, NULL)
 #define PyArray_ContiguousFromObject(op, type, min_depth, max_depth) \
        PyArray_FromAny(op, PyArray_DescrFromType(type), min_depth, \
                              max_depth, NPY_ARRAY_DEFAULT | \
                                         NPY_ARRAY_ENSUREARRAY, NULL)
 #define PyArray_CopyFromObject(op, type, min_depth, max_depth) \
        PyArray_FromAny(op, PyArray_DescrFromType(type), min_depth, \
                        max_depth, NPY_ARRAY_ENSURECOPY | \
                                   NPY_ARRAY_DEFAULT | \
                                   NPY_ARRAY_ENSUREARRAY, NULL)
 #define PyArray_Cast(mp, type_num)                                            \
        PyArray_CastToType(mp, PyArray_DescrFromType(type_num), 0)
 #define PyArray_Take(ap, items, axis)                                         \
        PyArray_TakeFrom(ap, items, axis, NULL, NPY_RAISE)
 #define PyArray_Put(ap, items, values)                                        \
        PyArray_PutTo(ap, items, values, NPY_RAISE)
 /* Compatibility with old Numeric stuff -- don't use in new code */
 #define PyArray_FromDimsAndData(nd, d, type, data)                            \
        PyArray_FromDimsAndDataAndDescr(nd, d, PyArray_DescrFromType(type),   \
                                        data)
 /*
   Check to see if this key in the dictionary is the "title"
   entry of the tuple (i.e. a duplicate dictionary entry in the fields
   dict.
 */
 #define NPY_TITLE_KEY(key, value) ((PyTuple_GET_SIZE((value))==3) && \
                                   (PyTuple_GET_ITEM((value), 2) == (key)))
 /* Define python version independent deprecation macro */
 #if PY_VERSION_HEX >= 0x02050000
 #define DEPRECATE(msg) PyErr_WarnEx(PyExc_DeprecationWarning,msg,1)
 #define DEPRECATE_FUTUREWARNING(msg) PyErr_WarnEx(PyExc_FutureWarning,msg,1)
 #else
 #define DEPRECATE(msg) PyErr_Warn(PyExc_DeprecationWarning,msg)
 #define DEPRECATE_FUTUREWARNING(msg) PyErr_Warn(PyExc_FutureWarning,msg)
 #endif
 #ifdef __cplusplus
 }
 #endif
 #endif /* NPY_NDARRAYOBJECT_H */
--- a/include/numpy/ndarraytypes.h
+++ b/include/numpy/ndarraytypes.h
--- a/include/numpy/noprefix.h
+++ b/include/numpy/noprefix.h
@ -1,209 +0,0 @@
 #ifndef NPY_NOPREFIX_H
 #define NPY_NOPREFIX_H
 /*
 * You can directly include noprefix.h as a backward
 * compatibility measure
 */
 #ifndef NPY_NO_PREFIX
 #include "ndarrayobject.h"
 #include "npy_interrupt.h"
 #endif
 #define SIGSETJMP   NPY_SIGSETJMP
 #define SIGLONGJMP  NPY_SIGLONGJMP
 #define SIGJMP_BUF  NPY_SIGJMP_BUF
 #define MAX_DIMS NPY_MAXDIMS
 #define longlong    npy_longlong
 #define ulonglong   npy_ulonglong
 #define Bool        npy_bool
 #define longdouble  npy_longdouble
 #define byte        npy_byte
 #ifndef _BSD_SOURCE
 #define ushort      npy_ushort
 #define uint        npy_uint
 #define ulong       npy_ulong
 #endif
 #define ubyte       npy_ubyte
 #define ushort      npy_ushort
 #define uint        npy_uint
 #define ulong       npy_ulong
 #define cfloat      npy_cfloat
 #define cdouble     npy_cdouble
 #define clongdouble npy_clongdouble
 #define Int8        npy_int8
 #define UInt8       npy_uint8
 #define Int16       npy_int16
 #define UInt16      npy_uint16
 #define Int32       npy_int32
 #define UInt32      npy_uint32
 #define Int64       npy_int64
 #define UInt64      npy_uint64
 #define Int128      npy_int128
 #define UInt128     npy_uint128
 #define Int256      npy_int256
 #define UInt256     npy_uint256
 #define Float16     npy_float16
 #define Complex32   npy_complex32
 #define Float32     npy_float32
 #define Complex64   npy_complex64
 #define Float64     npy_float64
 #define Complex128  npy_complex128
 #define Float80     npy_float80
 #define Complex160  npy_complex160
 #define Float96     npy_float96
 #define Complex192  npy_complex192
 #define Float128    npy_float128
 #define Complex256  npy_complex256
 #define intp        npy_intp
 #define uintp       npy_uintp
 #define datetime    npy_datetime
 #define timedelta   npy_timedelta
 #define SIZEOF_INTP NPY_SIZEOF_INTP
 #define SIZEOF_UINTP NPY_SIZEOF_UINTP
 #define SIZEOF_DATETIME NPY_SIZEOF_DATETIME
 #define SIZEOF_TIMEDELTA NPY_SIZEOF_TIMEDELTA
 #define LONGLONG_FMT NPY_LONGLONG_FMT
 #define ULONGLONG_FMT NPY_ULONGLONG_FMT
 #define LONGLONG_SUFFIX NPY_LONGLONG_SUFFIX
 #define ULONGLONG_SUFFIX NPY_ULONGLONG_SUFFIX
 #define MAX_INT8 127
 #define MIN_INT8 -128
 #define MAX_UINT8 255
 #define MAX_INT16 32767
 #define MIN_INT16 -32768
 #define MAX_UINT16 65535
 #define MAX_INT32 2147483647
 #define MIN_INT32 (-MAX_INT32 - 1)
 #define MAX_UINT32 4294967295U
 #define MAX_INT64 LONGLONG_SUFFIX(9223372036854775807)
 #define MIN_INT64 (-MAX_INT64 - LONGLONG_SUFFIX(1))
 #define MAX_UINT64 ULONGLONG_SUFFIX(18446744073709551615)
 #define MAX_INT128 LONGLONG_SUFFIX(85070591730234615865843651857942052864)
 #define MIN_INT128 (-MAX_INT128 - LONGLONG_SUFFIX(1))
 #define MAX_UINT128 ULONGLONG_SUFFIX(170141183460469231731687303715884105728)
 #define MAX_INT256 LONGLONG_SUFFIX(57896044618658097711785492504343953926634992332820282019728792003956564819967)
 #define MIN_INT256 (-MAX_INT256 - LONGLONG_SUFFIX(1))
 #define MAX_UINT256 ULONGLONG_SUFFIX(115792089237316195423570985008687907853269984665640564039457584007913129639935)
 #define MAX_BYTE NPY_MAX_BYTE
 #define MIN_BYTE NPY_MIN_BYTE
 #define MAX_UBYTE NPY_MAX_UBYTE
 #define MAX_SHORT NPY_MAX_SHORT
 #define MIN_SHORT NPY_MIN_SHORT
 #define MAX_USHORT NPY_MAX_USHORT
 #define MAX_INT   NPY_MAX_INT
 #define MIN_INT   NPY_MIN_INT
 #define MAX_UINT  NPY_MAX_UINT
 #define MAX_LONG  NPY_MAX_LONG
 #define MIN_LONG  NPY_MIN_LONG
 #define MAX_ULONG  NPY_MAX_ULONG
 #define MAX_LONGLONG NPY_MAX_LONGLONG
 #define MIN_LONGLONG NPY_MIN_LONGLONG
 #define MAX_ULONGLONG NPY_MAX_ULONGLONG
 #define MIN_DATETIME NPY_MIN_DATETIME
 #define MAX_DATETIME NPY_MAX_DATETIME
 #define MIN_TIMEDELTA NPY_MIN_TIMEDELTA
 #define MAX_TIMEDELTA NPY_MAX_TIMEDELTA
 #define SIZEOF_LONGDOUBLE NPY_SIZEOF_LONGDOUBLE
 #define SIZEOF_LONGLONG   NPY_SIZEOF_LONGLONG
 #define SIZEOF_HALF       NPY_SIZEOF_HALF
 #define BITSOF_BOOL       NPY_BITSOF_BOOL
 #define BITSOF_CHAR       NPY_BITSOF_CHAR
 #define BITSOF_SHORT      NPY_BITSOF_SHORT
 #define BITSOF_INT        NPY_BITSOF_INT
 #define BITSOF_LONG       NPY_BITSOF_LONG
 #define BITSOF_LONGLONG   NPY_BITSOF_LONGLONG
 #define BITSOF_HALF       NPY_BITSOF_HALF
 #define BITSOF_FLOAT      NPY_BITSOF_FLOAT
 #define BITSOF_DOUBLE     NPY_BITSOF_DOUBLE
 #define BITSOF_LONGDOUBLE NPY_BITSOF_LONGDOUBLE
 #define BITSOF_DATETIME   NPY_BITSOF_DATETIME
 #define BITSOF_TIMEDELTA   NPY_BITSOF_TIMEDELTA
 #define _pya_malloc PyArray_malloc
 #define _pya_free PyArray_free
 #define _pya_realloc PyArray_realloc
 #define BEGIN_THREADS_DEF NPY_BEGIN_THREADS_DEF
 #define BEGIN_THREADS     NPY_BEGIN_THREADS
 #define END_THREADS       NPY_END_THREADS
 #define ALLOW_C_API_DEF   NPY_ALLOW_C_API_DEF
 #define ALLOW_C_API       NPY_ALLOW_C_API
 #define DISABLE_C_API     NPY_DISABLE_C_API
 #define PY_FAIL NPY_FAIL
 #define PY_SUCCEED NPY_SUCCEED
 #ifndef TRUE
 #define TRUE NPY_TRUE
 #endif
 #ifndef FALSE
 #define FALSE NPY_FALSE
 #endif
 #define LONGDOUBLE_FMT NPY_LONGDOUBLE_FMT
 #define CONTIGUOUS         NPY_CONTIGUOUS
 #define C_CONTIGUOUS       NPY_C_CONTIGUOUS
 #define FORTRAN            NPY_FORTRAN
 #define F_CONTIGUOUS       NPY_F_CONTIGUOUS
 #define OWNDATA            NPY_OWNDATA
 #define FORCECAST          NPY_FORCECAST
 #define ENSURECOPY         NPY_ENSURECOPY
 #define ENSUREARRAY        NPY_ENSUREARRAY
 #define ELEMENTSTRIDES     NPY_ELEMENTSTRIDES
 #define ALIGNED            NPY_ALIGNED
 #define NOTSWAPPED         NPY_NOTSWAPPED
 #define WRITEABLE          NPY_WRITEABLE
 #define UPDATEIFCOPY       NPY_UPDATEIFCOPY
 #define ARR_HAS_DESCR      NPY_ARR_HAS_DESCR
 #define BEHAVED            NPY_BEHAVED
 #define BEHAVED_NS         NPY_BEHAVED_NS
 #define CARRAY             NPY_CARRAY
 #define CARRAY_RO          NPY_CARRAY_RO
 #define FARRAY             NPY_FARRAY
 #define FARRAY_RO          NPY_FARRAY_RO
 #define DEFAULT            NPY_DEFAULT
 #define IN_ARRAY           NPY_IN_ARRAY
 #define OUT_ARRAY          NPY_OUT_ARRAY
 #define INOUT_ARRAY        NPY_INOUT_ARRAY
 #define IN_FARRAY          NPY_IN_FARRAY
 #define OUT_FARRAY         NPY_OUT_FARRAY
 #define INOUT_FARRAY       NPY_INOUT_FARRAY
 #define UPDATE_ALL         NPY_UPDATE_ALL
 #define OWN_DATA          NPY_OWNDATA
 #define BEHAVED_FLAGS     NPY_BEHAVED
 #define BEHAVED_FLAGS_NS  NPY_BEHAVED_NS
 #define CARRAY_FLAGS_RO   NPY_CARRAY_RO
 #define CARRAY_FLAGS      NPY_CARRAY
 #define FARRAY_FLAGS      NPY_FARRAY
 #define FARRAY_FLAGS_RO   NPY_FARRAY_RO
 #define DEFAULT_FLAGS     NPY_DEFAULT
 #define UPDATE_ALL_FLAGS  NPY_UPDATE_ALL_FLAGS
 #ifndef MIN
 #define MIN PyArray_MIN
 #endif
 #ifndef MAX
 #define MAX PyArray_MAX
 #endif
 #define MAX_INTP NPY_MAX_INTP
 #define MIN_INTP NPY_MIN_INTP
 #define MAX_UINTP NPY_MAX_UINTP
 #define INTP_FMT NPY_INTP_FMT
 #define REFCOUNT PyArray_REFCOUNT
 #define MAX_ELSIZE NPY_MAX_ELSIZE
 #endif
--- a/include/numpy/npy_3kcompat.h
+++ b/include/numpy/npy_3kcompat.h
@ -1,417 +0,0 @@
 /*
 * This is a convenience header file providing compatibility utilities
 * for supporting Python 2 and Python 3 in the same code base.
 *
 * If you want to use this for your own projects, it's recommended to make a
 * copy of it. Although the stuff below is unlikely to change, we don't provide
 * strong backwards compatibility guarantees at the moment.
 */
 #ifndef _NPY_3KCOMPAT_H_
 #define _NPY_3KCOMPAT_H_
 #include <Python.h>
 #include <stdio.h>
 #if PY_VERSION_HEX >= 0x03000000
 #ifndef NPY_PY3K
 #define NPY_PY3K 1
 #endif
 #endif
 #include "numpy/npy_common.h"
 #include "numpy/ndarrayobject.h"
 #ifdef __cplusplus
 extern "C" {
 #endif
 /*
 * PyInt -> PyLong
 */
 #if defined(NPY_PY3K)
 /* Return True only if the long fits in a C long */
 static NPY_INLINE int PyInt_Check(PyObject *op) {
    int overflow = 0;
    if (!PyLong_Check(op)) {
        return 0;
    }
    PyLong_AsLongAndOverflow(op, &overflow);
    return (overflow == 0);
 }
 #define PyInt_FromLong PyLong_FromLong
 #define PyInt_AsLong PyLong_AsLong
 #define PyInt_AS_LONG PyLong_AsLong
 #define PyInt_AsSsize_t PyLong_AsSsize_t
 /* NOTE:
 *
 * Since the PyLong type is very different from the fixed-range PyInt,
 * we don't define PyInt_Type -> PyLong_Type.
 */
 #endif /* NPY_PY3K */
 /*
 * PyString -> PyBytes
 */
 #if defined(NPY_PY3K)
 #define PyString_Type PyBytes_Type
 #define PyString_Check PyBytes_Check
 #define PyStringObject PyBytesObject
 #define PyString_FromString PyBytes_FromString
 #define PyString_FromStringAndSize PyBytes_FromStringAndSize
 #define PyString_AS_STRING PyBytes_AS_STRING
 #define PyString_AsStringAndSize PyBytes_AsStringAndSize
 #define PyString_FromFormat PyBytes_FromFormat
 #define PyString_Concat PyBytes_Concat
 #define PyString_ConcatAndDel PyBytes_ConcatAndDel
 #define PyString_AsString PyBytes_AsString
 #define PyString_GET_SIZE PyBytes_GET_SIZE
 #define PyString_Size PyBytes_Size
 #define PyUString_Type PyUnicode_Type
 #define PyUString_Check PyUnicode_Check
 #define PyUStringObject PyUnicodeObject
 #define PyUString_FromString PyUnicode_FromString
 #define PyUString_FromStringAndSize PyUnicode_FromStringAndSize
 #define PyUString_FromFormat PyUnicode_FromFormat
 #define PyUString_Concat PyUnicode_Concat2
 #define PyUString_ConcatAndDel PyUnicode_ConcatAndDel
 #define PyUString_GET_SIZE PyUnicode_GET_SIZE
 #define PyUString_Size PyUnicode_Size
 #define PyUString_InternFromString PyUnicode_InternFromString
 #define PyUString_Format PyUnicode_Format
 #else
 #define PyBytes_Type PyString_Type
 #define PyBytes_Check PyString_Check
 #define PyBytesObject PyStringObject
 #define PyBytes_FromString PyString_FromString
 #define PyBytes_FromStringAndSize PyString_FromStringAndSize
 #define PyBytes_AS_STRING PyString_AS_STRING
 #define PyBytes_AsStringAndSize PyString_AsStringAndSize
 #define PyBytes_FromFormat PyString_FromFormat
 #define PyBytes_Concat PyString_Concat
 #define PyBytes_ConcatAndDel PyString_ConcatAndDel
 #define PyBytes_AsString PyString_AsString
 #define PyBytes_GET_SIZE PyString_GET_SIZE
 #define PyBytes_Size PyString_Size
 #define PyUString_Type PyString_Type
 #define PyUString_Check PyString_Check
 #define PyUStringObject PyStringObject
 #define PyUString_FromString PyString_FromString
 #define PyUString_FromStringAndSize PyString_FromStringAndSize
 #define PyUString_FromFormat PyString_FromFormat
 #define PyUString_Concat PyString_Concat
 #define PyUString_ConcatAndDel PyString_ConcatAndDel
 #define PyUString_GET_SIZE PyString_GET_SIZE
 #define PyUString_Size PyString_Size
 #define PyUString_InternFromString PyString_InternFromString
 #define PyUString_Format PyString_Format
 #endif /* NPY_PY3K */
 static NPY_INLINE void
 PyUnicode_ConcatAndDel(PyObject **left, PyObject *right)
 {
    PyObject *newobj;
    newobj = PyUnicode_Concat(*left, right);
    Py_DECREF(*left);
    Py_DECREF(right);
    *left = newobj;
 }
 static NPY_INLINE void
 PyUnicode_Concat2(PyObject **left, PyObject *right)
 {
    PyObject *newobj;
    newobj = PyUnicode_Concat(*left, right);
    Py_DECREF(*left);
    *left = newobj;
 }
 /*
 * PyFile_* compatibility
 */
 #if defined(NPY_PY3K)
 /*
 * Get a FILE* handle to the file represented by the Python object
 */
 static NPY_INLINE FILE*
 npy_PyFile_Dup(PyObject *file, char *mode)
 {
    int fd, fd2;
    PyObject *ret, *os;
    Py_ssize_t pos;
    FILE *handle;
    /* Flush first to ensure things end up in the file in the correct order */
    ret = PyObject_CallMethod(file, "flush", "");
    if (ret == NULL) {
        return NULL;
    }
    Py_DECREF(ret);
    fd = PyObject_AsFileDescriptor(file);
    if (fd == -1) {
        return NULL;
    }
    os = PyImport_ImportModule("os");
    if (os == NULL) {
        return NULL;
    }
    ret = PyObject_CallMethod(os, "dup", "i", fd);
    Py_DECREF(os);
    if (ret == NULL) {
        return NULL;
    }
    fd2 = PyNumber_AsSsize_t(ret, NULL);
    Py_DECREF(ret);
 #ifdef _WIN32
    handle = _fdopen(fd2, mode);
 #else
    handle = fdopen(fd2, mode);
 #endif
    if (handle == NULL) {
        PyErr_SetString(PyExc_IOError,
                        "Getting a FILE* from a Python file object failed");
    }
    ret = PyObject_CallMethod(file, "tell", "");
    if (ret == NULL) {
        fclose(handle);
        return NULL;
    }
    pos = PyNumber_AsSsize_t(ret, PyExc_OverflowError);
    Py_DECREF(ret);
    if (PyErr_Occurred()) {
        fclose(handle);
        return NULL;
    }
    npy_fseek(handle, pos, SEEK_SET);
    return handle;
 }
 /*
 * Close the dup-ed file handle, and seek the Python one to the current position
 */
 static NPY_INLINE int
 npy_PyFile_DupClose(PyObject *file, FILE* handle)
 {
    PyObject *ret;
    Py_ssize_t position;
    position = npy_ftell(handle);
    fclose(handle);
    ret = PyObject_CallMethod(file, "seek", NPY_SSIZE_T_PYFMT "i", position, 0);
    if (ret == NULL) {
        return -1;
    }
    Py_DECREF(ret);
    return 0;
 }
 static NPY_INLINE int
 npy_PyFile_Check(PyObject *file)
 {
    int fd;
    fd = PyObject_AsFileDescriptor(file);
    if (fd == -1) {
        PyErr_Clear();
        return 0;
    }
    return 1;
 }
 #else
 #define npy_PyFile_Dup(file, mode) PyFile_AsFile(file)
 #define npy_PyFile_DupClose(file, handle) (0)
 #define npy_PyFile_Check PyFile_Check
 #endif
 static NPY_INLINE PyObject*
 npy_PyFile_OpenFile(PyObject *filename, const char *mode)
 {
    PyObject *open;
    open = PyDict_GetItemString(PyEval_GetBuiltins(), "open");
    if (open == NULL) {
        return NULL;
    }
    return PyObject_CallFunction(open, "Os", filename, mode);
 }
 static NPY_INLINE int
 npy_PyFile_CloseFile(PyObject *file)
 {
    PyObject *ret;
    ret = PyObject_CallMethod(file, "close", NULL);
    if (ret == NULL) {
        return -1;
    }
    Py_DECREF(ret);
    return 0;
 }
 /*
 * PyObject_Cmp
 */
 #if defined(NPY_PY3K)
 static NPY_INLINE int
 PyObject_Cmp(PyObject *i1, PyObject *i2, int *cmp)
 {
    int v;
    v = PyObject_RichCompareBool(i1, i2, Py_LT);
    if (v == 0) {
        *cmp = -1;
        return 1;
    }
    else if (v == -1) {
        return -1;
    }
    v = PyObject_RichCompareBool(i1, i2, Py_GT);
    if (v == 0) {
        *cmp = 1;
        return 1;
    }
    else if (v == -1) {
        return -1;
    }
    v = PyObject_RichCompareBool(i1, i2, Py_EQ);
    if (v == 0) {
        *cmp = 0;
        return 1;
    }
    else {
        *cmp = 0;
        return -1;
    }
 }
 #endif
 /*
 * PyCObject functions adapted to PyCapsules.
 *
 * The main job here is to get rid of the improved error handling
 * of PyCapsules. It's a shame...
 */
 #if PY_VERSION_HEX >= 0x03000000
 static NPY_INLINE PyObject *
 NpyCapsule_FromVoidPtr(void *ptr, void (*dtor)(PyObject *))
 {
    PyObject *ret = PyCapsule_New(ptr, NULL, dtor);
    if (ret == NULL) {
        PyErr_Clear();
    }
    return ret;
 }
 static NPY_INLINE PyObject *
 NpyCapsule_FromVoidPtrAndDesc(void *ptr, void* context, void (*dtor)(PyObject *))
 {
    PyObject *ret = NpyCapsule_FromVoidPtr(ptr, dtor);
    if (ret != NULL && PyCapsule_SetContext(ret, context) != 0) {
        PyErr_Clear();
        Py_DECREF(ret);
        ret = NULL;
    }
    return ret;
 }
 static NPY_INLINE void *
 NpyCapsule_AsVoidPtr(PyObject *obj)
 {
    void *ret = PyCapsule_GetPointer(obj, NULL);
    if (ret == NULL) {
        PyErr_Clear();
    }
    return ret;
 }
 static NPY_INLINE void *
 NpyCapsule_GetDesc(PyObject *obj)
 {
    return PyCapsule_GetContext(obj);
 }
 static NPY_INLINE int
 NpyCapsule_Check(PyObject *ptr)
 {
    return PyCapsule_CheckExact(ptr);
 }
 static NPY_INLINE void
 simple_capsule_dtor(PyObject *cap)
 {
    PyArray_free(PyCapsule_GetPointer(cap, NULL));
 }
 #else
 static NPY_INLINE PyObject *
 NpyCapsule_FromVoidPtr(void *ptr, void (*dtor)(void *))
 {
    return PyCObject_FromVoidPtr(ptr, dtor);
 }
 static NPY_INLINE PyObject *
 NpyCapsule_FromVoidPtrAndDesc(void *ptr, void* context,
        void (*dtor)(void *, void *))
 {
    return PyCObject_FromVoidPtrAndDesc(ptr, context, dtor);
 }
 static NPY_INLINE void *
 NpyCapsule_AsVoidPtr(PyObject *ptr)
 {
    return PyCObject_AsVoidPtr(ptr);
 }
 static NPY_INLINE void *
 NpyCapsule_GetDesc(PyObject *obj)
 {
    return PyCObject_GetDesc(obj);
 }
 static NPY_INLINE int
 NpyCapsule_Check(PyObject *ptr)
 {
    return PyCObject_Check(ptr);
 }
 static NPY_INLINE void
 simple_capsule_dtor(void *ptr)
 {
    PyArray_free(ptr);
 }
 #endif
 /*
 * Hash value compatibility.
 * As of Python 3.2 hash values are of type Py_hash_t.
 * Previous versions use C long.
 */
 #if PY_VERSION_HEX < 0x03020000
 typedef long npy_hash_t;
 #define NPY_SIZEOF_HASH_T NPY_SIZEOF_LONG
 #else
 typedef Py_hash_t npy_hash_t;
 #define NPY_SIZEOF_HASH_T NPY_SIZEOF_INTP
 #endif
 #ifdef __cplusplus
 }
 #endif
 #endif /* _NPY_3KCOMPAT_H_ */
--- a/include/numpy/npy_common.h
+++ b/include/numpy/npy_common.h
@ -1,930 +0,0 @@
 #ifndef _NPY_COMMON_H_
 #define _NPY_COMMON_H_
 /* numpconfig.h is auto-generated */
 #include "numpyconfig.h"
 #if defined(_MSC_VER)
        #define NPY_INLINE __inline
 #elif defined(__GNUC__)
 	#if defined(__STRICT_ANSI__)
 		#define NPY_INLINE __inline__
 	#else
 		#define NPY_INLINE inline
 	#endif
 #else
        #define NPY_INLINE
 #endif
 /* Enable 64 bit file position support on win-amd64. Ticket #1660 */
 #if defined(_MSC_VER) && defined(_WIN64) && (_MSC_VER > 1400)
    #define npy_fseek _fseeki64
    #define npy_ftell _ftelli64
 #else
    #define npy_fseek fseek
    #define npy_ftell ftell
 #endif
 /* enums for detected endianness */
 enum {
        NPY_CPU_UNKNOWN_ENDIAN,
        NPY_CPU_LITTLE,
        NPY_CPU_BIG
 };
 /*
 * This is to typedef npy_intp to the appropriate pointer size for
 * this platform.  Py_intptr_t, Py_uintptr_t are defined in pyport.h.
 */
 typedef Py_intptr_t npy_intp;
 typedef Py_uintptr_t npy_uintp;
 #define NPY_SIZEOF_CHAR 1
 #define NPY_SIZEOF_BYTE 1
 #define NPY_SIZEOF_INTP NPY_SIZEOF_PY_INTPTR_T
 #define NPY_SIZEOF_UINTP NPY_SIZEOF_PY_INTPTR_T
 #define NPY_SIZEOF_CFLOAT NPY_SIZEOF_COMPLEX_FLOAT
 #define NPY_SIZEOF_CDOUBLE NPY_SIZEOF_COMPLEX_DOUBLE
 #define NPY_SIZEOF_CLONGDOUBLE NPY_SIZEOF_COMPLEX_LONGDOUBLE
 #ifdef constchar
 #undef constchar
 #endif
 #if (PY_VERSION_HEX < 0x02050000)
  #ifndef PY_SSIZE_T_MIN
    typedef int Py_ssize_t;
    #define PY_SSIZE_T_MAX INT_MAX
    #define PY_SSIZE_T_MIN INT_MIN
  #endif
 #define NPY_SSIZE_T_PYFMT "i"
 #define constchar const char
 #else
 #define NPY_SSIZE_T_PYFMT "n"
 #define constchar char
 #endif
 /* NPY_INTP_FMT Note:
 *      Unlike the other NPY_*_FMT macros which are used with
 *      PyOS_snprintf, NPY_INTP_FMT is used with PyErr_Format and
 *      PyString_Format. These functions use different formatting
 *      codes which are portably specified according to the Python
 *      documentation. See ticket #1795.
 *
 *      On Windows x64, the LONGLONG formatter should be used, but
 *      in Python 2.6 the %lld formatter is not supported. In this
 *      case we work around the problem by using the %zd formatter.
 */
 #if NPY_SIZEOF_PY_INTPTR_T == NPY_SIZEOF_INT
        #define NPY_INTP NPY_INT
        #define NPY_UINTP NPY_UINT
        #define PyIntpArrType_Type PyIntArrType_Type
        #define PyUIntpArrType_Type PyUIntArrType_Type
        #define NPY_MAX_INTP NPY_MAX_INT
        #define NPY_MIN_INTP NPY_MIN_INT
        #define NPY_MAX_UINTP NPY_MAX_UINT
        #define NPY_INTP_FMT "d"
 #elif NPY_SIZEOF_PY_INTPTR_T == NPY_SIZEOF_LONG
        #define NPY_INTP NPY_LONG
        #define NPY_UINTP NPY_ULONG
        #define PyIntpArrType_Type PyLongArrType_Type
        #define PyUIntpArrType_Type PyULongArrType_Type
        #define NPY_MAX_INTP NPY_MAX_LONG
        #define NPY_MIN_INTP NPY_MIN_LONG
        #define NPY_MAX_UINTP NPY_MAX_ULONG
        #define NPY_INTP_FMT "ld"
 #elif defined(PY_LONG_LONG) && (NPY_SIZEOF_PY_INTPTR_T == NPY_SIZEOF_LONGLONG)
        #define NPY_INTP NPY_LONGLONG
        #define NPY_UINTP NPY_ULONGLONG
        #define PyIntpArrType_Type PyLongLongArrType_Type
        #define PyUIntpArrType_Type PyULongLongArrType_Type
        #define NPY_MAX_INTP NPY_MAX_LONGLONG
        #define NPY_MIN_INTP NPY_MIN_LONGLONG
        #define NPY_MAX_UINTP NPY_MAX_ULONGLONG
    #if (PY_VERSION_HEX >= 0x02070000)
        #define NPY_INTP_FMT "lld"
    #else
        #define NPY_INTP_FMT "zd"
    #endif
 #endif
 /*
 * We can only use C99 formats for npy_int_p if it is the same as
 * intp_t, hence the condition on HAVE_UNITPTR_T
 */
 #if (NPY_USE_C99_FORMATS) == 1 \
        && (defined HAVE_UINTPTR_T) \
        && (defined HAVE_INTTYPES_H)
        #include <inttypes.h>
        #undef NPY_INTP_FMT
        #define NPY_INTP_FMT PRIdPTR
 #endif
 /*
 * Some platforms don't define bool, long long, or long double.
 * Handle that here.
 */
 #define NPY_BYTE_FMT "hhd"
 #define NPY_UBYTE_FMT "hhu"
 #define NPY_SHORT_FMT "hd"
 #define NPY_USHORT_FMT "hu"
 #define NPY_INT_FMT "d"
 #define NPY_UINT_FMT "u"
 #define NPY_LONG_FMT "ld"
 #define NPY_ULONG_FMT "lu"
 #define NPY_HALF_FMT "g"
 #define NPY_FLOAT_FMT "g"
 #define NPY_DOUBLE_FMT "g"
 #ifdef PY_LONG_LONG
 typedef PY_LONG_LONG npy_longlong;
 typedef unsigned PY_LONG_LONG npy_ulonglong;
 #  ifdef _MSC_VER
 #    define NPY_LONGLONG_FMT         "I64d"
 #    define NPY_ULONGLONG_FMT        "I64u"
 #  elif defined(__APPLE__) || defined(__FreeBSD__)
 /*   "%Ld" only parses 4 bytes -- "L" is floating modifier on MacOS X/BSD */
 #    define NPY_LONGLONG_FMT         "lld"
 #    define NPY_ULONGLONG_FMT        "llu"
 /*
     another possible variant -- *quad_t works on *BSD, but is deprecated:
     #define LONGLONG_FMT   "qd"
     #define ULONGLONG_FMT   "qu"
 */
 #  else
 #    define NPY_LONGLONG_FMT         "Ld"
 #    define NPY_ULONGLONG_FMT        "Lu"
 #  endif
 #  ifdef _MSC_VER
 #    define NPY_LONGLONG_SUFFIX(x)   (x##i64)
 #    define NPY_ULONGLONG_SUFFIX(x)  (x##Ui64)
 #  else
 #    define NPY_LONGLONG_SUFFIX(x)   (x##LL)
 #    define NPY_ULONGLONG_SUFFIX(x)  (x##ULL)
 #  endif
 #else
 typedef long npy_longlong;
 typedef unsigned long npy_ulonglong;
 #  define NPY_LONGLONG_SUFFIX(x)  (x##L)
 #  define NPY_ULONGLONG_SUFFIX(x) (x##UL)
 #endif
 typedef unsigned char npy_bool;
 #define NPY_FALSE 0
 #define NPY_TRUE 1
 #if NPY_SIZEOF_LONGDOUBLE == NPY_SIZEOF_DOUBLE
        typedef double npy_longdouble;
        #define NPY_LONGDOUBLE_FMT "g"
 #else
        typedef long double npy_longdouble;
        #define NPY_LONGDOUBLE_FMT "Lg"
 #endif
 #ifndef Py_USING_UNICODE
 #error Must use Python with unicode enabled.
 #endif
 typedef signed char npy_byte;
 typedef unsigned char npy_ubyte;
 typedef unsigned short npy_ushort;
 typedef unsigned int npy_uint;
 typedef unsigned long npy_ulong;
 /* These are for completeness */
 typedef char npy_char;
 typedef short npy_short;
 typedef int npy_int;
 typedef long npy_long;
 typedef float npy_float;
 typedef double npy_double;
 /*
 * Disabling C99 complex usage: a lot of C code in numpy/scipy rely on being
 * able to do .real/.imag. Will have to convert code first.
 */
 #if 0
 #if defined(NPY_USE_C99_COMPLEX) && defined(NPY_HAVE_COMPLEX_DOUBLE)
 typedef complex npy_cdouble;
 #else
 typedef struct { double real, imag; } npy_cdouble;
 #endif
 #if defined(NPY_USE_C99_COMPLEX) && defined(NPY_HAVE_COMPLEX_FLOAT)
 typedef complex float npy_cfloat;
 #else
 typedef struct { float real, imag; } npy_cfloat;
 #endif
 #if defined(NPY_USE_C99_COMPLEX) && defined(NPY_HAVE_COMPLEX_LONG_DOUBLE)
 typedef complex long double npy_clongdouble;
 #else
 typedef struct {npy_longdouble real, imag;} npy_clongdouble;
 #endif
 #endif
 #if NPY_SIZEOF_COMPLEX_DOUBLE != 2 * NPY_SIZEOF_DOUBLE
 #error npy_cdouble definition is not compatible with C99 complex definition ! \
        Please contact Numpy maintainers and give detailed information about your \
        compiler and platform
 #endif
 typedef struct { double real, imag; } npy_cdouble;
 #if NPY_SIZEOF_COMPLEX_FLOAT != 2 * NPY_SIZEOF_FLOAT
 #error npy_cfloat definition is not compatible with C99 complex definition ! \
        Please contact Numpy maintainers and give detailed information about your \
        compiler and platform
 #endif
 typedef struct { float real, imag; } npy_cfloat;
 #if NPY_SIZEOF_COMPLEX_LONGDOUBLE != 2 * NPY_SIZEOF_LONGDOUBLE
 #error npy_clongdouble definition is not compatible with C99 complex definition ! \
        Please contact Numpy maintainers and give detailed information about your \
        compiler and platform
 #endif
 typedef struct { npy_longdouble real, imag; } npy_clongdouble;
 /*
 * numarray-style bit-width typedefs
 */
 #define NPY_MAX_INT8 127
 #define NPY_MIN_INT8 -128
 #define NPY_MAX_UINT8 255
 #define NPY_MAX_INT16 32767
 #define NPY_MIN_INT16 -32768
 #define NPY_MAX_UINT16 65535
 #define NPY_MAX_INT32 2147483647
 #define NPY_MIN_INT32 (-NPY_MAX_INT32 - 1)
 #define NPY_MAX_UINT32 4294967295U
 #define NPY_MAX_INT64 NPY_LONGLONG_SUFFIX(9223372036854775807)
 #define NPY_MIN_INT64 (-NPY_MAX_INT64 - NPY_LONGLONG_SUFFIX(1))
 #define NPY_MAX_UINT64 NPY_ULONGLONG_SUFFIX(18446744073709551615)
 #define NPY_MAX_INT128 NPY_LONGLONG_SUFFIX(85070591730234615865843651857942052864)
 #define NPY_MIN_INT128 (-NPY_MAX_INT128 - NPY_LONGLONG_SUFFIX(1))
 #define NPY_MAX_UINT128 NPY_ULONGLONG_SUFFIX(170141183460469231731687303715884105728)
 #define NPY_MAX_INT256 NPY_LONGLONG_SUFFIX(57896044618658097711785492504343953926634992332820282019728792003956564819967)
 #define NPY_MIN_INT256 (-NPY_MAX_INT256 - NPY_LONGLONG_SUFFIX(1))
 #define NPY_MAX_UINT256 NPY_ULONGLONG_SUFFIX(115792089237316195423570985008687907853269984665640564039457584007913129639935)
 #define NPY_MIN_DATETIME NPY_MIN_INT64
 #define NPY_MAX_DATETIME NPY_MAX_INT64
 #define NPY_MIN_TIMEDELTA NPY_MIN_INT64
 #define NPY_MAX_TIMEDELTA NPY_MAX_INT64
        /* Need to find the number of bits for each type and
           make definitions accordingly.
           C states that sizeof(char) == 1 by definition
           So, just using the sizeof keyword won't help.
           It also looks like Python itself uses sizeof(char) quite a
           bit, which by definition should be 1 all the time.
           Idea: Make Use of CHAR_BIT which should tell us how many
           BITS per CHARACTER
        */
        /* Include platform definitions -- These are in the C89/90 standard */
 #include <limits.h>
 #define NPY_MAX_BYTE SCHAR_MAX
 #define NPY_MIN_BYTE SCHAR_MIN
 #define NPY_MAX_UBYTE UCHAR_MAX
 #define NPY_MAX_SHORT SHRT_MAX
 #define NPY_MIN_SHORT SHRT_MIN
 #define NPY_MAX_USHORT USHRT_MAX
 #define NPY_MAX_INT   INT_MAX
 #ifndef INT_MIN
 #define INT_MIN (-INT_MAX - 1)
 #endif
 #define NPY_MIN_INT   INT_MIN
 #define NPY_MAX_UINT  UINT_MAX
 #define NPY_MAX_LONG  LONG_MAX
 #define NPY_MIN_LONG  LONG_MIN
 #define NPY_MAX_ULONG  ULONG_MAX
 #define NPY_SIZEOF_HALF 2
 #define NPY_SIZEOF_DATETIME 8
 #define NPY_SIZEOF_TIMEDELTA 8
 #define NPY_BITSOF_BOOL (sizeof(npy_bool) * CHAR_BIT)
 #define NPY_BITSOF_CHAR CHAR_BIT
 #define NPY_BITSOF_BYTE (NPY_SIZEOF_BYTE * CHAR_BIT)
 #define NPY_BITSOF_SHORT (NPY_SIZEOF_SHORT * CHAR_BIT)
 #define NPY_BITSOF_INT (NPY_SIZEOF_INT * CHAR_BIT)
 #define NPY_BITSOF_LONG (NPY_SIZEOF_LONG * CHAR_BIT)
 #define NPY_BITSOF_LONGLONG (NPY_SIZEOF_LONGLONG * CHAR_BIT)
 #define NPY_BITSOF_INTP (NPY_SIZEOF_INTP * CHAR_BIT)
 #define NPY_BITSOF_HALF (NPY_SIZEOF_HALF * CHAR_BIT)
 #define NPY_BITSOF_FLOAT (NPY_SIZEOF_FLOAT * CHAR_BIT)
 #define NPY_BITSOF_DOUBLE (NPY_SIZEOF_DOUBLE * CHAR_BIT)
 #define NPY_BITSOF_LONGDOUBLE (NPY_SIZEOF_LONGDOUBLE * CHAR_BIT)
 #define NPY_BITSOF_CFLOAT (NPY_SIZEOF_CFLOAT * CHAR_BIT)
 #define NPY_BITSOF_CDOUBLE (NPY_SIZEOF_CDOUBLE * CHAR_BIT)
 #define NPY_BITSOF_CLONGDOUBLE (NPY_SIZEOF_CLONGDOUBLE * CHAR_BIT)
 #define NPY_BITSOF_DATETIME (NPY_SIZEOF_DATETIME * CHAR_BIT)
 #define NPY_BITSOF_TIMEDELTA (NPY_SIZEOF_TIMEDELTA * CHAR_BIT)
 #if NPY_BITSOF_LONG == 8
 #define NPY_INT8 NPY_LONG
 #define NPY_UINT8 NPY_ULONG
        typedef long npy_int8;
        typedef unsigned long npy_uint8;
 #define PyInt8ScalarObject PyLongScalarObject
 #define PyInt8ArrType_Type PyLongArrType_Type
 #define PyUInt8ScalarObject PyULongScalarObject
 #define PyUInt8ArrType_Type PyULongArrType_Type
 #define NPY_INT8_FMT NPY_LONG_FMT
 #define NPY_UINT8_FMT NPY_ULONG_FMT
 #elif NPY_BITSOF_LONG == 16
 #define NPY_INT16 NPY_LONG
 #define NPY_UINT16 NPY_ULONG
        typedef long npy_int16;
        typedef unsigned long npy_uint16;
 #define PyInt16ScalarObject PyLongScalarObject
 #define PyInt16ArrType_Type PyLongArrType_Type
 #define PyUInt16ScalarObject PyULongScalarObject
 #define PyUInt16ArrType_Type PyULongArrType_Type
 #define NPY_INT16_FMT NPY_LONG_FMT
 #define NPY_UINT16_FMT NPY_ULONG_FMT
 #elif NPY_BITSOF_LONG == 32
 #define NPY_INT32 NPY_LONG
 #define NPY_UINT32 NPY_ULONG
        typedef long npy_int32;
        typedef unsigned long npy_uint32;
        typedef unsigned long npy_ucs4;
 #define PyInt32ScalarObject PyLongScalarObject
 #define PyInt32ArrType_Type PyLongArrType_Type
 #define PyUInt32ScalarObject PyULongScalarObject
 #define PyUInt32ArrType_Type PyULongArrType_Type
 #define NPY_INT32_FMT NPY_LONG_FMT
 #define NPY_UINT32_FMT NPY_ULONG_FMT
 #elif NPY_BITSOF_LONG == 64
 #define NPY_INT64 NPY_LONG
 #define NPY_UINT64 NPY_ULONG
        typedef long npy_int64;
        typedef unsigned long npy_uint64;
 #define PyInt64ScalarObject PyLongScalarObject
 #define PyInt64ArrType_Type PyLongArrType_Type
 #define PyUInt64ScalarObject PyULongScalarObject
 #define PyUInt64ArrType_Type PyULongArrType_Type
 #define NPY_INT64_FMT NPY_LONG_FMT
 #define NPY_UINT64_FMT NPY_ULONG_FMT
 #define MyPyLong_FromInt64 PyLong_FromLong
 #define MyPyLong_AsInt64 PyLong_AsLong
 #elif NPY_BITSOF_LONG == 128
 #define NPY_INT128 NPY_LONG
 #define NPY_UINT128 NPY_ULONG
        typedef long npy_int128;
        typedef unsigned long npy_uint128;
 #define PyInt128ScalarObject PyLongScalarObject
 #define PyInt128ArrType_Type PyLongArrType_Type
 #define PyUInt128ScalarObject PyULongScalarObject
 #define PyUInt128ArrType_Type PyULongArrType_Type
 #define NPY_INT128_FMT NPY_LONG_FMT
 #define NPY_UINT128_FMT NPY_ULONG_FMT
 #endif
 #if NPY_BITSOF_LONGLONG == 8
 #  ifndef NPY_INT8
 #    define NPY_INT8 NPY_LONGLONG
 #    define NPY_UINT8 NPY_ULONGLONG
        typedef npy_longlong npy_int8;
        typedef npy_ulonglong npy_uint8;
 #    define PyInt8ScalarObject PyLongLongScalarObject
 #    define PyInt8ArrType_Type PyLongLongArrType_Type
 #    define PyUInt8ScalarObject PyULongLongScalarObject
 #    define PyUInt8ArrType_Type PyULongLongArrType_Type
 #define NPY_INT8_FMT NPY_LONGLONG_FMT
 #define NPY_UINT8_FMT NPY_ULONGLONG_FMT
 #  endif
 #  define NPY_MAX_LONGLONG NPY_MAX_INT8
 #  define NPY_MIN_LONGLONG NPY_MIN_INT8
 #  define NPY_MAX_ULONGLONG NPY_MAX_UINT8
 #elif NPY_BITSOF_LONGLONG == 16
 #  ifndef NPY_INT16
 #    define NPY_INT16 NPY_LONGLONG
 #    define NPY_UINT16 NPY_ULONGLONG
        typedef npy_longlong npy_int16;
        typedef npy_ulonglong npy_uint16;
 #    define PyInt16ScalarObject PyLongLongScalarObject
 #    define PyInt16ArrType_Type PyLongLongArrType_Type
 #    define PyUInt16ScalarObject PyULongLongScalarObject
 #    define PyUInt16ArrType_Type PyULongLongArrType_Type
 #define NPY_INT16_FMT NPY_LONGLONG_FMT
 #define NPY_UINT16_FMT NPY_ULONGLONG_FMT
 #  endif
 #  define NPY_MAX_LONGLONG NPY_MAX_INT16
 #  define NPY_MIN_LONGLONG NPY_MIN_INT16
 #  define NPY_MAX_ULONGLONG NPY_MAX_UINT16
 #elif NPY_BITSOF_LONGLONG == 32
 #  ifndef NPY_INT32
 #    define NPY_INT32 NPY_LONGLONG
 #    define NPY_UINT32 NPY_ULONGLONG
        typedef npy_longlong npy_int32;
        typedef npy_ulonglong npy_uint32;
        typedef npy_ulonglong npy_ucs4;
 #    define PyInt32ScalarObject PyLongLongScalarObject
 #    define PyInt32ArrType_Type PyLongLongArrType_Type
 #    define PyUInt32ScalarObject PyULongLongScalarObject
 #    define PyUInt32ArrType_Type PyULongLongArrType_Type
 #define NPY_INT32_FMT NPY_LONGLONG_FMT
 #define NPY_UINT32_FMT NPY_ULONGLONG_FMT
 #  endif
 #  define NPY_MAX_LONGLONG NPY_MAX_INT32
 #  define NPY_MIN_LONGLONG NPY_MIN_INT32
 #  define NPY_MAX_ULONGLONG NPY_MAX_UINT32
 #elif NPY_BITSOF_LONGLONG == 64
 #  ifndef NPY_INT64
 #    define NPY_INT64 NPY_LONGLONG
 #    define NPY_UINT64 NPY_ULONGLONG
        typedef npy_longlong npy_int64;
        typedef npy_ulonglong npy_uint64;
 #    define PyInt64ScalarObject PyLongLongScalarObject
 #    define PyInt64ArrType_Type PyLongLongArrType_Type
 #    define PyUInt64ScalarObject PyULongLongScalarObject
 #    define PyUInt64ArrType_Type PyULongLongArrType_Type
 #define NPY_INT64_FMT NPY_LONGLONG_FMT
 #define NPY_UINT64_FMT NPY_ULONGLONG_FMT
 #    define MyPyLong_FromInt64 PyLong_FromLongLong
 #    define MyPyLong_AsInt64 PyLong_AsLongLong
 #  endif
 #  define NPY_MAX_LONGLONG NPY_MAX_INT64
 #  define NPY_MIN_LONGLONG NPY_MIN_INT64
 #  define NPY_MAX_ULONGLONG NPY_MAX_UINT64
 #elif NPY_BITSOF_LONGLONG == 128
 #  ifndef NPY_INT128
 #    define NPY_INT128 NPY_LONGLONG
 #    define NPY_UINT128 NPY_ULONGLONG
        typedef npy_longlong npy_int128;
        typedef npy_ulonglong npy_uint128;
 #    define PyInt128ScalarObject PyLongLongScalarObject
 #    define PyInt128ArrType_Type PyLongLongArrType_Type
 #    define PyUInt128ScalarObject PyULongLongScalarObject
 #    define PyUInt128ArrType_Type PyULongLongArrType_Type
 #define NPY_INT128_FMT NPY_LONGLONG_FMT
 #define NPY_UINT128_FMT NPY_ULONGLONG_FMT
 #  endif
 #  define NPY_MAX_LONGLONG NPY_MAX_INT128
 #  define NPY_MIN_LONGLONG NPY_MIN_INT128
 #  define NPY_MAX_ULONGLONG NPY_MAX_UINT128
 #elif NPY_BITSOF_LONGLONG == 256
 #  define NPY_INT256 NPY_LONGLONG
 #  define NPY_UINT256 NPY_ULONGLONG
        typedef npy_longlong npy_int256;
        typedef npy_ulonglong npy_uint256;
 #  define PyInt256ScalarObject PyLongLongScalarObject
 #  define PyInt256ArrType_Type PyLongLongArrType_Type
 #  define PyUInt256ScalarObject PyULongLongScalarObject
 #  define PyUInt256ArrType_Type PyULongLongArrType_Type
 #define NPY_INT256_FMT NPY_LONGLONG_FMT
 #define NPY_UINT256_FMT NPY_ULONGLONG_FMT
 #  define NPY_MAX_LONGLONG NPY_MAX_INT256
 #  define NPY_MIN_LONGLONG NPY_MIN_INT256
 #  define NPY_MAX_ULONGLONG NPY_MAX_UINT256
 #endif
 #if NPY_BITSOF_INT == 8
 #ifndef NPY_INT8
 #define NPY_INT8 NPY_INT
 #define NPY_UINT8 NPY_UINT
        typedef int npy_int8;
        typedef unsigned int npy_uint8;
 #    define PyInt8ScalarObject PyIntScalarObject
 #    define PyInt8ArrType_Type PyIntArrType_Type
 #    define PyUInt8ScalarObject PyUIntScalarObject
 #    define PyUInt8ArrType_Type PyUIntArrType_Type
 #define NPY_INT8_FMT NPY_INT_FMT
 #define NPY_UINT8_FMT NPY_UINT_FMT
 #endif
 #elif NPY_BITSOF_INT == 16
 #ifndef NPY_INT16
 #define NPY_INT16 NPY_INT
 #define NPY_UINT16 NPY_UINT
        typedef int npy_int16;
        typedef unsigned int npy_uint16;
 #    define PyInt16ScalarObject PyIntScalarObject
 #    define PyInt16ArrType_Type PyIntArrType_Type
 #    define PyUInt16ScalarObject PyIntUScalarObject
 #    define PyUInt16ArrType_Type PyIntUArrType_Type
 #define NPY_INT16_FMT NPY_INT_FMT
 #define NPY_UINT16_FMT NPY_UINT_FMT
 #endif
 #elif NPY_BITSOF_INT == 32
 #ifndef NPY_INT32
 #define NPY_INT32 NPY_INT
 #define NPY_UINT32 NPY_UINT
        typedef int npy_int32;
        typedef unsigned int npy_uint32;
        typedef unsigned int npy_ucs4;
 #    define PyInt32ScalarObject PyIntScalarObject
 #    define PyInt32ArrType_Type PyIntArrType_Type
 #    define PyUInt32ScalarObject PyUIntScalarObject
 #    define PyUInt32ArrType_Type PyUIntArrType_Type
 #define NPY_INT32_FMT NPY_INT_FMT
 #define NPY_UINT32_FMT NPY_UINT_FMT
 #endif
 #elif NPY_BITSOF_INT == 64
 #ifndef NPY_INT64
 #define NPY_INT64 NPY_INT
 #define NPY_UINT64 NPY_UINT
        typedef int npy_int64;
        typedef unsigned int npy_uint64;
 #    define PyInt64ScalarObject PyIntScalarObject
 #    define PyInt64ArrType_Type PyIntArrType_Type
 #    define PyUInt64ScalarObject PyUIntScalarObject
 #    define PyUInt64ArrType_Type PyUIntArrType_Type
 #define NPY_INT64_FMT NPY_INT_FMT
 #define NPY_UINT64_FMT NPY_UINT_FMT
 #    define MyPyLong_FromInt64 PyLong_FromLong
 #    define MyPyLong_AsInt64 PyLong_AsLong
 #endif
 #elif NPY_BITSOF_INT == 128
 #ifndef NPY_INT128
 #define NPY_INT128 NPY_INT
 #define NPY_UINT128 NPY_UINT
        typedef int npy_int128;
        typedef unsigned int npy_uint128;
 #    define PyInt128ScalarObject PyIntScalarObject
 #    define PyInt128ArrType_Type PyIntArrType_Type
 #    define PyUInt128ScalarObject PyUIntScalarObject
 #    define PyUInt128ArrType_Type PyUIntArrType_Type
 #define NPY_INT128_FMT NPY_INT_FMT
 #define NPY_UINT128_FMT NPY_UINT_FMT
 #endif
 #endif
 #if NPY_BITSOF_SHORT == 8
 #ifndef NPY_INT8
 #define NPY_INT8 NPY_SHORT
 #define NPY_UINT8 NPY_USHORT
        typedef short npy_int8;
        typedef unsigned short npy_uint8;
 #    define PyInt8ScalarObject PyShortScalarObject
 #    define PyInt8ArrType_Type PyShortArrType_Type
 #    define PyUInt8ScalarObject PyUShortScalarObject
 #    define PyUInt8ArrType_Type PyUShortArrType_Type
 #define NPY_INT8_FMT NPY_SHORT_FMT
 #define NPY_UINT8_FMT NPY_USHORT_FMT
 #endif
 #elif NPY_BITSOF_SHORT == 16
 #ifndef NPY_INT16
 #define NPY_INT16 NPY_SHORT
 #define NPY_UINT16 NPY_USHORT
        typedef short npy_int16;
        typedef unsigned short npy_uint16;
 #    define PyInt16ScalarObject PyShortScalarObject
 #    define PyInt16ArrType_Type PyShortArrType_Type
 #    define PyUInt16ScalarObject PyUShortScalarObject
 #    define PyUInt16ArrType_Type PyUShortArrType_Type
 #define NPY_INT16_FMT NPY_SHORT_FMT
 #define NPY_UINT16_FMT NPY_USHORT_FMT
 #endif
 #elif NPY_BITSOF_SHORT == 32
 #ifndef NPY_INT32
 #define NPY_INT32 NPY_SHORT
 #define NPY_UINT32 NPY_USHORT
        typedef short npy_int32;
        typedef unsigned short npy_uint32;
        typedef unsigned short npy_ucs4;
 #    define PyInt32ScalarObject PyShortScalarObject
 #    define PyInt32ArrType_Type PyShortArrType_Type
 #    define PyUInt32ScalarObject PyUShortScalarObject
 #    define PyUInt32ArrType_Type PyUShortArrType_Type
 #define NPY_INT32_FMT NPY_SHORT_FMT
 #define NPY_UINT32_FMT NPY_USHORT_FMT
 #endif
 #elif NPY_BITSOF_SHORT == 64
 #ifndef NPY_INT64
 #define NPY_INT64 NPY_SHORT
 #define NPY_UINT64 NPY_USHORT
        typedef short npy_int64;
        typedef unsigned short npy_uint64;
 #    define PyInt64ScalarObject PyShortScalarObject
 #    define PyInt64ArrType_Type PyShortArrType_Type
 #    define PyUInt64ScalarObject PyUShortScalarObject
 #    define PyUInt64ArrType_Type PyUShortArrType_Type
 #define NPY_INT64_FMT NPY_SHORT_FMT
 #define NPY_UINT64_FMT NPY_USHORT_FMT
 #    define MyPyLong_FromInt64 PyLong_FromLong
 #    define MyPyLong_AsInt64 PyLong_AsLong
 #endif
 #elif NPY_BITSOF_SHORT == 128
 #ifndef NPY_INT128
 #define NPY_INT128 NPY_SHORT
 #define NPY_UINT128 NPY_USHORT
        typedef short npy_int128;
        typedef unsigned short npy_uint128;
 #    define PyInt128ScalarObject PyShortScalarObject
 #    define PyInt128ArrType_Type PyShortArrType_Type
 #    define PyUInt128ScalarObject PyUShortScalarObject
 #    define PyUInt128ArrType_Type PyUShortArrType_Type
 #define NPY_INT128_FMT NPY_SHORT_FMT
 #define NPY_UINT128_FMT NPY_USHORT_FMT
 #endif
 #endif
 #if NPY_BITSOF_CHAR == 8
 #ifndef NPY_INT8
 #define NPY_INT8 NPY_BYTE
 #define NPY_UINT8 NPY_UBYTE
        typedef signed char npy_int8;
        typedef unsigned char npy_uint8;
 #    define PyInt8ScalarObject PyByteScalarObject
 #    define PyInt8ArrType_Type PyByteArrType_Type
 #    define PyUInt8ScalarObject PyUByteScalarObject
 #    define PyUInt8ArrType_Type PyUByteArrType_Type
 #define NPY_INT8_FMT NPY_BYTE_FMT
 #define NPY_UINT8_FMT NPY_UBYTE_FMT
 #endif
 #elif NPY_BITSOF_CHAR == 16
 #ifndef NPY_INT16
 #define NPY_INT16 NPY_BYTE
 #define NPY_UINT16 NPY_UBYTE
        typedef signed char npy_int16;
        typedef unsigned char npy_uint16;
 #    define PyInt16ScalarObject PyByteScalarObject
 #    define PyInt16ArrType_Type PyByteArrType_Type
 #    define PyUInt16ScalarObject PyUByteScalarObject
 #    define PyUInt16ArrType_Type PyUByteArrType_Type
 #define NPY_INT16_FMT NPY_BYTE_FMT
 #define NPY_UINT16_FMT NPY_UBYTE_FMT
 #endif
 #elif NPY_BITSOF_CHAR == 32
 #ifndef NPY_INT32
 #define NPY_INT32 NPY_BYTE
 #define NPY_UINT32 NPY_UBYTE
        typedef signed char npy_int32;
        typedef unsigned char npy_uint32;
        typedef unsigned char npy_ucs4;
 #    define PyInt32ScalarObject PyByteScalarObject
 #    define PyInt32ArrType_Type PyByteArrType_Type
 #    define PyUInt32ScalarObject PyUByteScalarObject
 #    define PyUInt32ArrType_Type PyUByteArrType_Type
 #define NPY_INT32_FMT NPY_BYTE_FMT
 #define NPY_UINT32_FMT NPY_UBYTE_FMT
 #endif
 #elif NPY_BITSOF_CHAR == 64
 #ifndef NPY_INT64
 #define NPY_INT64 NPY_BYTE
 #define NPY_UINT64 NPY_UBYTE
        typedef signed char npy_int64;
        typedef unsigned char npy_uint64;
 #    define PyInt64ScalarObject PyByteScalarObject
 #    define PyInt64ArrType_Type PyByteArrType_Type
 #    define PyUInt64ScalarObject PyUByteScalarObject
 #    define PyUInt64ArrType_Type PyUByteArrType_Type
 #define NPY_INT64_FMT NPY_BYTE_FMT
 #define NPY_UINT64_FMT NPY_UBYTE_FMT
 #    define MyPyLong_FromInt64 PyLong_FromLong
 #    define MyPyLong_AsInt64 PyLong_AsLong
 #endif
 #elif NPY_BITSOF_CHAR == 128
 #ifndef NPY_INT128
 #define NPY_INT128 NPY_BYTE
 #define NPY_UINT128 NPY_UBYTE
        typedef signed char npy_int128;
        typedef unsigned char npy_uint128;
 #    define PyInt128ScalarObject PyByteScalarObject
 #    define PyInt128ArrType_Type PyByteArrType_Type
 #    define PyUInt128ScalarObject PyUByteScalarObject
 #    define PyUInt128ArrType_Type PyUByteArrType_Type
 #define NPY_INT128_FMT NPY_BYTE_FMT
 #define NPY_UINT128_FMT NPY_UBYTE_FMT
 #endif
 #endif
 #if NPY_BITSOF_DOUBLE == 32
 #ifndef NPY_FLOAT32
 #define NPY_FLOAT32 NPY_DOUBLE
 #define NPY_COMPLEX64 NPY_CDOUBLE
        typedef double npy_float32;
        typedef npy_cdouble npy_complex64;
 #    define PyFloat32ScalarObject PyDoubleScalarObject
 #    define PyComplex64ScalarObject PyCDoubleScalarObject
 #    define PyFloat32ArrType_Type PyDoubleArrType_Type
 #    define PyComplex64ArrType_Type PyCDoubleArrType_Type
 #define NPY_FLOAT32_FMT NPY_DOUBLE_FMT
 #define NPY_COMPLEX64_FMT NPY_CDOUBLE_FMT
 #endif
 #elif NPY_BITSOF_DOUBLE == 64
 #ifndef NPY_FLOAT64
 #define NPY_FLOAT64 NPY_DOUBLE
 #define NPY_COMPLEX128 NPY_CDOUBLE
        typedef double npy_float64;
        typedef npy_cdouble npy_complex128;
 #    define PyFloat64ScalarObject PyDoubleScalarObject
 #    define PyComplex128ScalarObject PyCDoubleScalarObject
 #    define PyFloat64ArrType_Type PyDoubleArrType_Type
 #    define PyComplex128ArrType_Type PyCDoubleArrType_Type
 #define NPY_FLOAT64_FMT NPY_DOUBLE_FMT
 #define NPY_COMPLEX128_FMT NPY_CDOUBLE_FMT
 #endif
 #elif NPY_BITSOF_DOUBLE == 80
 #ifndef NPY_FLOAT80
 #define NPY_FLOAT80 NPY_DOUBLE
 #define NPY_COMPLEX160 NPY_CDOUBLE
        typedef double npy_float80;
        typedef npy_cdouble npy_complex160;
 #    define PyFloat80ScalarObject PyDoubleScalarObject
 #    define PyComplex160ScalarObject PyCDoubleScalarObject
 #    define PyFloat80ArrType_Type PyDoubleArrType_Type
 #    define PyComplex160ArrType_Type PyCDoubleArrType_Type
 #define NPY_FLOAT80_FMT NPY_DOUBLE_FMT
 #define NPY_COMPLEX160_FMT NPY_CDOUBLE_FMT
 #endif
 #elif NPY_BITSOF_DOUBLE == 96
 #ifndef NPY_FLOAT96
 #define NPY_FLOAT96 NPY_DOUBLE
 #define NPY_COMPLEX192 NPY_CDOUBLE
        typedef double npy_float96;
        typedef npy_cdouble npy_complex192;
 #    define PyFloat96ScalarObject PyDoubleScalarObject
 #    define PyComplex192ScalarObject PyCDoubleScalarObject
 #    define PyFloat96ArrType_Type PyDoubleArrType_Type
 #    define PyComplex192ArrType_Type PyCDoubleArrType_Type
 #define NPY_FLOAT96_FMT NPY_DOUBLE_FMT
 #define NPY_COMPLEX192_FMT NPY_CDOUBLE_FMT
 #endif
 #elif NPY_BITSOF_DOUBLE == 128
 #ifndef NPY_FLOAT128
 #define NPY_FLOAT128 NPY_DOUBLE
 #define NPY_COMPLEX256 NPY_CDOUBLE
        typedef double npy_float128;
        typedef npy_cdouble npy_complex256;
 #    define PyFloat128ScalarObject PyDoubleScalarObject
 #    define PyComplex256ScalarObject PyCDoubleScalarObject
 #    define PyFloat128ArrType_Type PyDoubleArrType_Type
 #    define PyComplex256ArrType_Type PyCDoubleArrType_Type
 #define NPY_FLOAT128_FMT NPY_DOUBLE_FMT
 #define NPY_COMPLEX256_FMT NPY_CDOUBLE_FMT
 #endif
 #endif
 #if NPY_BITSOF_FLOAT == 32
 #ifndef NPY_FLOAT32
 #define NPY_FLOAT32 NPY_FLOAT
 #define NPY_COMPLEX64 NPY_CFLOAT
        typedef float npy_float32;
        typedef npy_cfloat npy_complex64;
 #    define PyFloat32ScalarObject PyFloatScalarObject
 #    define PyComplex64ScalarObject PyCFloatScalarObject
 #    define PyFloat32ArrType_Type PyFloatArrType_Type
 #    define PyComplex64ArrType_Type PyCFloatArrType_Type
 #define NPY_FLOAT32_FMT NPY_FLOAT_FMT
 #define NPY_COMPLEX64_FMT NPY_CFLOAT_FMT
 #endif
 #elif NPY_BITSOF_FLOAT == 64
 #ifndef NPY_FLOAT64
 #define NPY_FLOAT64 NPY_FLOAT
 #define NPY_COMPLEX128 NPY_CFLOAT
        typedef float npy_float64;
        typedef npy_cfloat npy_complex128;
 #    define PyFloat64ScalarObject PyFloatScalarObject
 #    define PyComplex128ScalarObject PyCFloatScalarObject
 #    define PyFloat64ArrType_Type PyFloatArrType_Type
 #    define PyComplex128ArrType_Type PyCFloatArrType_Type
 #define NPY_FLOAT64_FMT NPY_FLOAT_FMT
 #define NPY_COMPLEX128_FMT NPY_CFLOAT_FMT
 #endif
 #elif NPY_BITSOF_FLOAT == 80
 #ifndef NPY_FLOAT80
 #define NPY_FLOAT80 NPY_FLOAT
 #define NPY_COMPLEX160 NPY_CFLOAT
        typedef float npy_float80;
        typedef npy_cfloat npy_complex160;
 #    define PyFloat80ScalarObject PyFloatScalarObject
 #    define PyComplex160ScalarObject PyCFloatScalarObject
 #    define PyFloat80ArrType_Type PyFloatArrType_Type
 #    define PyComplex160ArrType_Type PyCFloatArrType_Type
 #define NPY_FLOAT80_FMT NPY_FLOAT_FMT
 #define NPY_COMPLEX160_FMT NPY_CFLOAT_FMT
 #endif
 #elif NPY_BITSOF_FLOAT == 96
 #ifndef NPY_FLOAT96
 #define NPY_FLOAT96 NPY_FLOAT
 #define NPY_COMPLEX192 NPY_CFLOAT
        typedef float npy_float96;
        typedef npy_cfloat npy_complex192;
 #    define PyFloat96ScalarObject PyFloatScalarObject
 #    define PyComplex192ScalarObject PyCFloatScalarObject
 #    define PyFloat96ArrType_Type PyFloatArrType_Type
 #    define PyComplex192ArrType_Type PyCFloatArrType_Type
 #define NPY_FLOAT96_FMT NPY_FLOAT_FMT
 #define NPY_COMPLEX192_FMT NPY_CFLOAT_FMT
 #endif
 #elif NPY_BITSOF_FLOAT == 128
 #ifndef NPY_FLOAT128
 #define NPY_FLOAT128 NPY_FLOAT
 #define NPY_COMPLEX256 NPY_CFLOAT
        typedef float npy_float128;
        typedef npy_cfloat npy_complex256;
 #    define PyFloat128ScalarObject PyFloatScalarObject
 #    define PyComplex256ScalarObject PyCFloatScalarObject
 #    define PyFloat128ArrType_Type PyFloatArrType_Type
 #    define PyComplex256ArrType_Type PyCFloatArrType_Type
 #define NPY_FLOAT128_FMT NPY_FLOAT_FMT
 #define NPY_COMPLEX256_FMT NPY_CFLOAT_FMT
 #endif
 #endif
 /* half/float16 isn't a floating-point type in C */
 #define NPY_FLOAT16 NPY_HALF
 typedef npy_uint16 npy_half;
 typedef npy_half npy_float16;
 #if NPY_BITSOF_LONGDOUBLE == 32
 #ifndef NPY_FLOAT32
 #define NPY_FLOAT32 NPY_LONGDOUBLE
 #define NPY_COMPLEX64 NPY_CLONGDOUBLE
        typedef npy_longdouble npy_float32;
        typedef npy_clongdouble npy_complex64;
 #    define PyFloat32ScalarObject PyLongDoubleScalarObject
 #    define PyComplex64ScalarObject PyCLongDoubleScalarObject
 #    define PyFloat32ArrType_Type PyLongDoubleArrType_Type
 #    define PyComplex64ArrType_Type PyCLongDoubleArrType_Type
 #define NPY_FLOAT32_FMT NPY_LONGDOUBLE_FMT
 #define NPY_COMPLEX64_FMT NPY_CLONGDOUBLE_FMT
 #endif
 #elif NPY_BITSOF_LONGDOUBLE == 64
 #ifndef NPY_FLOAT64
 #define NPY_FLOAT64 NPY_LONGDOUBLE
 #define NPY_COMPLEX128 NPY_CLONGDOUBLE
        typedef npy_longdouble npy_float64;
        typedef npy_clongdouble npy_complex128;
 #    define PyFloat64ScalarObject PyLongDoubleScalarObject
 #    define PyComplex128ScalarObject PyCLongDoubleScalarObject
 #    define PyFloat64ArrType_Type PyLongDoubleArrType_Type
 #    define PyComplex128ArrType_Type PyCLongDoubleArrType_Type
 #define NPY_FLOAT64_FMT NPY_LONGDOUBLE_FMT
 #define NPY_COMPLEX128_FMT NPY_CLONGDOUBLE_FMT
 #endif
 #elif NPY_BITSOF_LONGDOUBLE == 80
 #ifndef NPY_FLOAT80
 #define NPY_FLOAT80 NPY_LONGDOUBLE
 #define NPY_COMPLEX160 NPY_CLONGDOUBLE
        typedef npy_longdouble npy_float80;
        typedef npy_clongdouble npy_complex160;
 #    define PyFloat80ScalarObject PyLongDoubleScalarObject
 #    define PyComplex160ScalarObject PyCLongDoubleScalarObject
 #    define PyFloat80ArrType_Type PyLongDoubleArrType_Type
 #    define PyComplex160ArrType_Type PyCLongDoubleArrType_Type
 #define NPY_FLOAT80_FMT NPY_LONGDOUBLE_FMT
 #define NPY_COMPLEX160_FMT NPY_CLONGDOUBLE_FMT
 #endif
 #elif NPY_BITSOF_LONGDOUBLE == 96
 #ifndef NPY_FLOAT96
 #define NPY_FLOAT96 NPY_LONGDOUBLE
 #define NPY_COMPLEX192 NPY_CLONGDOUBLE
        typedef npy_longdouble npy_float96;
        typedef npy_clongdouble npy_complex192;
 #    define PyFloat96ScalarObject PyLongDoubleScalarObject
 #    define PyComplex192ScalarObject PyCLongDoubleScalarObject
 #    define PyFloat96ArrType_Type PyLongDoubleArrType_Type
 #    define PyComplex192ArrType_Type PyCLongDoubleArrType_Type
 #define NPY_FLOAT96_FMT NPY_LONGDOUBLE_FMT
 #define NPY_COMPLEX192_FMT NPY_CLONGDOUBLE_FMT
 #endif
 #elif NPY_BITSOF_LONGDOUBLE == 128
 #ifndef NPY_FLOAT128
 #define NPY_FLOAT128 NPY_LONGDOUBLE
 #define NPY_COMPLEX256 NPY_CLONGDOUBLE
        typedef npy_longdouble npy_float128;
        typedef npy_clongdouble npy_complex256;
 #    define PyFloat128ScalarObject PyLongDoubleScalarObject
 #    define PyComplex256ScalarObject PyCLongDoubleScalarObject
 #    define PyFloat128ArrType_Type PyLongDoubleArrType_Type
 #    define PyComplex256ArrType_Type PyCLongDoubleArrType_Type
 #define NPY_FLOAT128_FMT NPY_LONGDOUBLE_FMT
 #define NPY_COMPLEX256_FMT NPY_CLONGDOUBLE_FMT
 #endif
 #elif NPY_BITSOF_LONGDOUBLE == 256
 #define NPY_FLOAT256 NPY_LONGDOUBLE
 #define NPY_COMPLEX512 NPY_CLONGDOUBLE
        typedef npy_longdouble npy_float256;
        typedef npy_clongdouble npy_complex512;
 #    define PyFloat256ScalarObject PyLongDoubleScalarObject
 #    define PyComplex512ScalarObject PyCLongDoubleScalarObject
 #    define PyFloat256ArrType_Type PyLongDoubleArrType_Type
 #    define PyComplex512ArrType_Type PyCLongDoubleArrType_Type
 #define NPY_FLOAT256_FMT NPY_LONGDOUBLE_FMT
 #define NPY_COMPLEX512_FMT NPY_CLONGDOUBLE_FMT
 #endif
 /* datetime typedefs */
 typedef npy_int64 npy_timedelta;
 typedef npy_int64 npy_datetime;
 #define NPY_DATETIME_FMT NPY_INT64_FMT
 #define NPY_TIMEDELTA_FMT NPY_INT64_FMT
 /* End of typedefs for numarray style bit-width names */
 #endif
--- a/include/numpy/npy_cpu.h
+++ b/include/numpy/npy_cpu.h
@ -1,109 +0,0 @@
 /*
 * This set (target) cpu specific macros:
 *      - Possible values:
 *              NPY_CPU_X86
 *              NPY_CPU_AMD64
 *              NPY_CPU_PPC
 *              NPY_CPU_PPC64
 *              NPY_CPU_SPARC
 *              NPY_CPU_S390
 *              NPY_CPU_IA64
 *              NPY_CPU_HPPA
 *              NPY_CPU_ALPHA
 *              NPY_CPU_ARMEL
 *              NPY_CPU_ARMEB
 *              NPY_CPU_SH_LE
 *              NPY_CPU_SH_BE
 */
 #ifndef _NPY_CPUARCH_H_
 #define _NPY_CPUARCH_H_
 #include "numpyconfig.h"
 #if defined( __i386__ ) || defined(i386) || defined(_M_IX86)
    /*
     * __i386__ is defined by gcc and Intel compiler on Linux,
     * _M_IX86 by VS compiler,
     * i386 by Sun compilers on opensolaris at least
     */
    #define NPY_CPU_X86
 #elif defined(__x86_64__) || defined(__amd64__) || defined(__x86_64) || defined(_M_AMD64)
    /*
     * both __x86_64__ and __amd64__ are defined by gcc
     * __x86_64 defined by sun compiler on opensolaris at least
     * _M_AMD64 defined by MS compiler
     */
    #define NPY_CPU_AMD64
 #elif defined(__ppc__) || defined(__powerpc__) || defined(_ARCH_PPC)
    /*
     * __ppc__ is defined by gcc, I remember having seen __powerpc__ once,
     * but can't find it ATM
     * _ARCH_PPC is used by at least gcc on AIX
     */
    #define NPY_CPU_PPC
 #elif defined(__ppc64__)
    #define NPY_CPU_PPC64
 #elif defined(__sparc__) || defined(__sparc)
    /* __sparc__ is defined by gcc and Forte (e.g. Sun) compilers */
    #define NPY_CPU_SPARC
 #elif defined(__s390__)
    #define NPY_CPU_S390
 #elif defined(__ia64)
    #define NPY_CPU_IA64
 #elif defined(__hppa)
    #define NPY_CPU_HPPA
 #elif defined(__alpha__)
    #define NPY_CPU_ALPHA
 #elif defined(__arm__) && defined(__ARMEL__)
    #define NPY_CPU_ARMEL
 #elif defined(__arm__) && defined(__ARMEB__)
    #define NPY_CPU_ARMEB
 #elif defined(__sh__) && defined(__LITTLE_ENDIAN__)
    #define NPY_CPU_SH_LE
 #elif defined(__sh__) && defined(__BIG_ENDIAN__)
    #define NPY_CPU_SH_BE
 #elif defined(__MIPSEL__)
    #define NPY_CPU_MIPSEL
 #elif defined(__MIPSEB__)
    #define NPY_CPU_MIPSEB
 #elif defined(__aarch64__)
    #define NPY_CPU_AARCH64
 #else
    #error Unknown CPU, please report this to numpy maintainers with \
    information about your platform (OS, CPU and compiler)
 #endif
 /*
   This "white-lists" the architectures that we know don't require
   pointer alignment.  We white-list, since the memcpy version will
   work everywhere, whereas assignment will only work where pointer
   dereferencing doesn't require alignment.
   TODO: There may be more architectures we can white list.
 */
 #if defined(NPY_CPU_X86) || defined(NPY_CPU_AMD64)
    #define NPY_COPY_PYOBJECT_PTR(dst, src) (*((PyObject **)(dst)) = *((PyObject **)(src)))
 #else
    #if NPY_SIZEOF_PY_INTPTR_T == 4
        #define NPY_COPY_PYOBJECT_PTR(dst, src) \
            ((char*)(dst))[0] = ((char*)(src))[0]; \
            ((char*)(dst))[1] = ((char*)(src))[1]; \
            ((char*)(dst))[2] = ((char*)(src))[2]; \
            ((char*)(dst))[3] = ((char*)(src))[3];
    #elif NPY_SIZEOF_PY_INTPTR_T == 8
        #define NPY_COPY_PYOBJECT_PTR(dst, src) \
            ((char*)(dst))[0] = ((char*)(src))[0]; \
            ((char*)(dst))[1] = ((char*)(src))[1]; \
            ((char*)(dst))[2] = ((char*)(src))[2]; \
            ((char*)(dst))[3] = ((char*)(src))[3]; \
            ((char*)(dst))[4] = ((char*)(src))[4]; \
            ((char*)(dst))[5] = ((char*)(src))[5]; \
            ((char*)(dst))[6] = ((char*)(src))[6]; \
            ((char*)(dst))[7] = ((char*)(src))[7];
    #else
        #error Unknown architecture, please report this to numpy maintainers with \
        information about your platform (OS, CPU and compiler)
    #endif
 #endif
 #endif
--- a/include/numpy/npy_deprecated_api.h
+++ b/include/numpy/npy_deprecated_api.h
@ -1,129 +0,0 @@
 #ifndef _NPY_DEPRECATED_API_H
 #define _NPY_DEPRECATED_API_H
 #if defined(_WIN32)
 #define _WARN___STR2__(x) #x
 #define _WARN___STR1__(x) _WARN___STR2__(x)
 #define _WARN___LOC__ __FILE__ "(" _WARN___STR1__(__LINE__) ") : Warning Msg: "
 #pragma message(_WARN___LOC__"Using deprecated NumPy API, disable it by " \
                            "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION")
 #elif defined(__GNUC__)
 #warning "Using deprecated NumPy API, disable it by #defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION"
 #endif
 /* TODO: How to do this warning message for other compilers? */
 /*
 * This header exists to collect all dangerous/deprecated NumPy API.
 *
 * This is an attempt to remove bad API, the proliferation of macros,
 * and namespace pollution currently produced by the NumPy headers.
 */
 #if defined(NPY_NO_DEPRECATED_API)
 #error Should never include npy_deprecated_api directly.
 #endif
 /* These array flags are deprecated as of NumPy 1.7 */
 #define NPY_CONTIGUOUS NPY_ARRAY_C_CONTIGUOUS
 #define NPY_FORTRAN NPY_ARRAY_F_CONTIGUOUS
 /*
 * The consistent NPY_ARRAY_* names which don't pollute the NPY_*
 * namespace were added in NumPy 1.7.
 *
 * These versions of the carray flags are deprecated, but
 * probably should only be removed after two releases instead of one.
 */
 #define NPY_C_CONTIGUOUS   NPY_ARRAY_C_CONTIGUOUS
 #define NPY_F_CONTIGUOUS   NPY_ARRAY_F_CONTIGUOUS
 #define NPY_OWNDATA        NPY_ARRAY_OWNDATA
 #define NPY_FORCECAST      NPY_ARRAY_FORCECAST
 #define NPY_ENSURECOPY     NPY_ARRAY_ENSURECOPY
 #define NPY_ENSUREARRAY    NPY_ARRAY_ENSUREARRAY
 #define NPY_ELEMENTSTRIDES NPY_ARRAY_ELEMENTSTRIDES
 #define NPY_ALIGNED        NPY_ARRAY_ALIGNED
 #define NPY_NOTSWAPPED     NPY_ARRAY_NOTSWAPPED
 #define NPY_WRITEABLE      NPY_ARRAY_WRITEABLE
 #define NPY_UPDATEIFCOPY   NPY_ARRAY_UPDATEIFCOPY
 #define NPY_BEHAVED        NPY_ARRAY_BEHAVED
 #define NPY_BEHAVED_NS     NPY_ARRAY_BEHAVED_NS
 #define NPY_CARRAY         NPY_ARRAY_CARRAY
 #define NPY_CARRAY_RO      NPY_ARRAY_CARRAY_RO
 #define NPY_FARRAY         NPY_ARRAY_FARRAY
 #define NPY_FARRAY_RO      NPY_ARRAY_FARRAY_RO
 #define NPY_DEFAULT        NPY_ARRAY_DEFAULT
 #define NPY_IN_ARRAY       NPY_ARRAY_IN_ARRAY
 #define NPY_OUT_ARRAY      NPY_ARRAY_OUT_ARRAY
 #define NPY_INOUT_ARRAY    NPY_ARRAY_INOUT_ARRAY
 #define NPY_IN_FARRAY      NPY_ARRAY_IN_FARRAY
 #define NPY_OUT_FARRAY     NPY_ARRAY_OUT_FARRAY
 #define NPY_INOUT_FARRAY   NPY_ARRAY_INOUT_FARRAY
 #define NPY_UPDATE_ALL     NPY_ARRAY_UPDATE_ALL
 /* This way of accessing the default type is deprecated as of NumPy 1.7 */
 #define PyArray_DEFAULT NPY_DEFAULT_TYPE
 /* These DATETIME bits aren't used internally */
 #if PY_VERSION_HEX >= 0x03000000
 #define PyDataType_GetDatetimeMetaData(descr)                                 \
    ((descr->metadata == NULL) ? NULL :                                       \
        ((PyArray_DatetimeMetaData *)(PyCapsule_GetPointer(                   \
                PyDict_GetItemString(                                         \
                    descr->metadata, NPY_METADATA_DTSTR), NULL))))
 #else
 #define PyDataType_GetDatetimeMetaData(descr)                                 \
    ((descr->metadata == NULL) ? NULL :                                       \
        ((PyArray_DatetimeMetaData *)(PyCObject_AsVoidPtr(                    \
                PyDict_GetItemString(descr->metadata, NPY_METADATA_DTSTR)))))
 #endif
 /*
 * Deprecated as of NumPy 1.7, this kind of shortcut doesn't
 * belong in the public API.
 */
 #define NPY_AO PyArrayObject
 /*
 * Deprecated as of NumPy 1.7, an all-lowercase macro doesn't
 * belong in the public API.
 */
 #define fortran fortran_
 /*
 * Deprecated as of NumPy 1.7, as it is a namespace-polluting
 * macro.
 */
 #define FORTRAN_IF PyArray_FORTRAN_IF
 /* Deprecated as of NumPy 1.7, datetime64 uses c_metadata instead */
 #define NPY_METADATA_DTSTR "__timeunit__"
 /*
 * Deprecated as of NumPy 1.7.
 * The reasoning:
 *  - These are for datetime, but there's no datetime "namespace".
 *  - They just turn NPY_STR_<x> into "<x>", which is just
 *    making something simple be indirected.
 */
 #define NPY_STR_Y "Y"
 #define NPY_STR_M "M"
 #define NPY_STR_W "W"
 #define NPY_STR_D "D"
 #define NPY_STR_h "h"
 #define NPY_STR_m "m"
 #define NPY_STR_s "s"
 #define NPY_STR_ms "ms"
 #define NPY_STR_us "us"
 #define NPY_STR_ns "ns"
 #define NPY_STR_ps "ps"
 #define NPY_STR_fs "fs"
 #define NPY_STR_as "as"
 /*
 * The macros in old_defines.h are Deprecated as of NumPy 1.7 and will be
 * removed in the next major release.
 */
 #include "old_defines.h"
 #endif
--- a/include/numpy/npy_endian.h
+++ b/include/numpy/npy_endian.h
@ -1,46 +0,0 @@
 #ifndef _NPY_ENDIAN_H_
 #define _NPY_ENDIAN_H_
 /*
 * NPY_BYTE_ORDER is set to the same value as BYTE_ORDER set by glibc in
 * endian.h
 */
 #ifdef NPY_HAVE_ENDIAN_H
    /* Use endian.h if available */
    #include <endian.h>
    #define NPY_BYTE_ORDER __BYTE_ORDER
    #define NPY_LITTLE_ENDIAN __LITTLE_ENDIAN
    #define NPY_BIG_ENDIAN __BIG_ENDIAN
 #else
    /* Set endianness info using target CPU */
    #include "npy_cpu.h"
    #define NPY_LITTLE_ENDIAN 1234
    #define NPY_BIG_ENDIAN 4321
    #if defined(NPY_CPU_X86)            \
            || defined(NPY_CPU_AMD64)   \
            || defined(NPY_CPU_IA64)    \
            || defined(NPY_CPU_ALPHA)   \
            || defined(NPY_CPU_ARMEL)   \
            || defined(NPY_CPU_AARCH64) \
            || defined(NPY_CPU_SH_LE)   \
            || defined(NPY_CPU_MIPSEL)
        #define NPY_BYTE_ORDER NPY_LITTLE_ENDIAN
    #elif defined(NPY_CPU_PPC)          \
            || defined(NPY_CPU_SPARC)   \
            || defined(NPY_CPU_S390)    \
            || defined(NPY_CPU_HPPA)    \
            || defined(NPY_CPU_PPC64)   \
            || defined(NPY_CPU_ARMEB)   \
            || defined(NPY_CPU_SH_BE)   \
            || defined(NPY_CPU_MIPSEB)
        #define NPY_BYTE_ORDER NPY_BIG_ENDIAN
    #else
        #error Unknown CPU: can not set endianness
    #endif
 #endif
 #endif
--- a/include/numpy/npy_interrupt.h
+++ b/include/numpy/npy_interrupt.h
@ -1,117 +0,0 @@
 /* Signal handling:
 This header file defines macros that allow your code to handle
 interrupts received during processing.  Interrupts that
 could reasonably be handled:
 SIGINT, SIGABRT, SIGALRM, SIGSEGV
 ****Warning***************
 Do not allow code that creates temporary memory or increases reference
 counts of Python objects to be interrupted unless you handle it
 differently.
 **************************
 The mechanism for handling interrupts is conceptually simple:
  - replace the signal handler with our own home-grown version
     and store the old one.
  - run the code to be interrupted -- if an interrupt occurs
     the handler should basically just cause a return to the
     calling function for finish work.
  - restore the old signal handler
 Of course, every code that allows interrupts must account for
 returning via the interrupt and handle clean-up correctly.  But,
 even still, the simple paradigm is complicated by at least three
 factors.
 1) platform portability (i.e. Microsoft says not to use longjmp
     to return from signal handling.  They have a __try  and __except
     extension to C instead but what about mingw?).
 2) how to handle threads: apparently whether signals are delivered to
    every thread of the process or the "invoking" thread is platform
    dependent. --- we don't handle threads for now.
 3) do we need to worry about re-entrance.  For now, assume the
    code will not call-back into itself.
 Ideas:
 1) Start by implementing an approach that works on platforms that
    can use setjmp and longjmp functionality and does nothing
    on other platforms.
 2) Ignore threads --- i.e. do not mix interrupt handling and threads
 3) Add a default signal_handler function to the C-API but have the rest
    use macros.
 Simple Interface:
 In your C-extension: around a block of code you want to be interruptable
 with a SIGINT
 NPY_SIGINT_ON
 [code]
 NPY_SIGINT_OFF
 In order for this to work correctly, the
 [code] block must not allocate any memory or alter the reference count of any
 Python objects.  In other words [code] must be interruptible so that continuation
 after NPY_SIGINT_OFF will only be "missing some computations"
 Interrupt handling does not work well with threads.
 */
 /* Add signal handling macros
   Make the global variable and signal handler part of the C-API
 */
 #ifndef NPY_INTERRUPT_H
 #define NPY_INTERRUPT_H
 #ifndef NPY_NO_SIGNAL
 #include <setjmp.h>
 #include <signal.h>
 #ifndef sigsetjmp
 #define NPY_SIGSETJMP(arg1, arg2) setjmp(arg1)
 #define NPY_SIGLONGJMP(arg1, arg2) longjmp(arg1, arg2)
 #define NPY_SIGJMP_BUF jmp_buf
 #else
 #define NPY_SIGSETJMP(arg1, arg2) sigsetjmp(arg1, arg2)
 #define NPY_SIGLONGJMP(arg1, arg2) siglongjmp(arg1, arg2)
 #define NPY_SIGJMP_BUF sigjmp_buf
 #endif
 #    define NPY_SIGINT_ON {                                             \
                   PyOS_sighandler_t _npy_sig_save;                     \
                   _npy_sig_save = PyOS_setsig(SIGINT, _PyArray_SigintHandler); \
                   if (NPY_SIGSETJMP(*((NPY_SIGJMP_BUF *)_PyArray_GetSigintBuf()), \
                                 1) == 0) {                             \
 #    define NPY_SIGINT_OFF }                                      \
        PyOS_setsig(SIGINT, _npy_sig_save);                       \
        }
 #else /* NPY_NO_SIGNAL  */
 #define NPY_SIGINT_ON
 #define NPY_SIGINT_OFF
 #endif /* HAVE_SIGSETJMP */
 #endif /* NPY_INTERRUPT_H */
--- a/include/numpy/npy_math.h
+++ b/include/numpy/npy_math.h
@ -1,438 +0,0 @@
 #ifndef __NPY_MATH_C99_H_
 #define __NPY_MATH_C99_H_
 #include <math.h>
 #ifdef __SUNPRO_CC
 #include <sunmath.h>
 #endif
 #include <numpy/npy_common.h>
 /*
 * NAN and INFINITY like macros (same behavior as glibc for NAN, same as C99
 * for INFINITY)
 *
 * XXX: I should test whether INFINITY and NAN are available on the platform
 */
 NPY_INLINE static float __npy_inff(void)
 {
    const union { npy_uint32 __i; float __f;} __bint = {0x7f800000UL};
    return __bint.__f;
 }
 NPY_INLINE static float __npy_nanf(void)
 {
    const union { npy_uint32 __i; float __f;} __bint = {0x7fc00000UL};
    return __bint.__f;
 }
 NPY_INLINE static float __npy_pzerof(void)
 {
    const union { npy_uint32 __i; float __f;} __bint = {0x00000000UL};
    return __bint.__f;
 }
 NPY_INLINE static float __npy_nzerof(void)
 {
    const union { npy_uint32 __i; float __f;} __bint = {0x80000000UL};
    return __bint.__f;
 }
 #define NPY_INFINITYF __npy_inff()
 #define NPY_NANF __npy_nanf()
 #define NPY_PZEROF __npy_pzerof()
 #define NPY_NZEROF __npy_nzerof()
 #define NPY_INFINITY ((npy_double)NPY_INFINITYF)
 #define NPY_NAN ((npy_double)NPY_NANF)
 #define NPY_PZERO ((npy_double)NPY_PZEROF)
 #define NPY_NZERO ((npy_double)NPY_NZEROF)
 #define NPY_INFINITYL ((npy_longdouble)NPY_INFINITYF)
 #define NPY_NANL ((npy_longdouble)NPY_NANF)
 #define NPY_PZEROL ((npy_longdouble)NPY_PZEROF)
 #define NPY_NZEROL ((npy_longdouble)NPY_NZEROF)
 /*
 * Useful constants
 */
 #define NPY_E         2.718281828459045235360287471352662498  /* e */
 #define NPY_LOG2E     1.442695040888963407359924681001892137  /* log_2 e */
 #define NPY_LOG10E    0.434294481903251827651128918916605082  /* log_10 e */
 #define NPY_LOGE2     0.693147180559945309417232121458176568  /* log_e 2 */
 #define NPY_LOGE10    2.302585092994045684017991454684364208  /* log_e 10 */
 #define NPY_PI        3.141592653589793238462643383279502884  /* pi */
 #define NPY_PI_2      1.570796326794896619231321691639751442  /* pi/2 */
 #define NPY_PI_4      0.785398163397448309615660845819875721  /* pi/4 */
 #define NPY_1_PI      0.318309886183790671537767526745028724  /* 1/pi */
 #define NPY_2_PI      0.636619772367581343075535053490057448  /* 2/pi */
 #define NPY_EULER     0.577215664901532860606512090082402431  /* Euler constant */
 #define NPY_SQRT2     1.414213562373095048801688724209698079  /* sqrt(2) */
 #define NPY_SQRT1_2   0.707106781186547524400844362104849039  /* 1/sqrt(2) */
 #define NPY_Ef        2.718281828459045235360287471352662498F /* e */
 #define NPY_LOG2Ef    1.442695040888963407359924681001892137F /* log_2 e */
 #define NPY_LOG10Ef   0.434294481903251827651128918916605082F /* log_10 e */
 #define NPY_LOGE2f    0.693147180559945309417232121458176568F /* log_e 2 */
 #define NPY_LOGE10f   2.302585092994045684017991454684364208F /* log_e 10 */
 #define NPY_PIf       3.141592653589793238462643383279502884F /* pi */
 #define NPY_PI_2f     1.570796326794896619231321691639751442F /* pi/2 */
 #define NPY_PI_4f     0.785398163397448309615660845819875721F /* pi/4 */
 #define NPY_1_PIf     0.318309886183790671537767526745028724F /* 1/pi */
 #define NPY_2_PIf     0.636619772367581343075535053490057448F /* 2/pi */
 #define NPY_EULERf    0.577215664901532860606512090082402431F /* Euler constan*/
 #define NPY_SQRT2f    1.414213562373095048801688724209698079F /* sqrt(2) */
 #define NPY_SQRT1_2f  0.707106781186547524400844362104849039F /* 1/sqrt(2) */
 #define NPY_El        2.718281828459045235360287471352662498L /* e */
 #define NPY_LOG2El    1.442695040888963407359924681001892137L /* log_2 e */
 #define NPY_LOG10El   0.434294481903251827651128918916605082L /* log_10 e */
 #define NPY_LOGE2l    0.693147180559945309417232121458176568L /* log_e 2 */
 #define NPY_LOGE10l   2.302585092994045684017991454684364208L /* log_e 10 */
 #define NPY_PIl       3.141592653589793238462643383279502884L /* pi */
 #define NPY_PI_2l     1.570796326794896619231321691639751442L /* pi/2 */
 #define NPY_PI_4l     0.785398163397448309615660845819875721L /* pi/4 */
 #define NPY_1_PIl     0.318309886183790671537767526745028724L /* 1/pi */
 #define NPY_2_PIl     0.636619772367581343075535053490057448L /* 2/pi */
 #define NPY_EULERl    0.577215664901532860606512090082402431L /* Euler constan*/
 #define NPY_SQRT2l    1.414213562373095048801688724209698079L /* sqrt(2) */
 #define NPY_SQRT1_2l  0.707106781186547524400844362104849039L /* 1/sqrt(2) */
 /*
 * C99 double math funcs
 */
 double npy_sin(double x);
 double npy_cos(double x);
 double npy_tan(double x);
 double npy_sinh(double x);
 double npy_cosh(double x);
 double npy_tanh(double x);
 double npy_asin(double x);
 double npy_acos(double x);
 double npy_atan(double x);
 double npy_aexp(double x);
 double npy_alog(double x);
 double npy_asqrt(double x);
 double npy_afabs(double x);
 double npy_log(double x);
 double npy_log10(double x);
 double npy_exp(double x);
 double npy_sqrt(double x);
 double npy_fabs(double x);
 double npy_ceil(double x);
 double npy_fmod(double x, double y);
 double npy_floor(double x);
 double npy_expm1(double x);
 double npy_log1p(double x);
 double npy_hypot(double x, double y);
 double npy_acosh(double x);
 double npy_asinh(double xx);
 double npy_atanh(double x);
 double npy_rint(double x);
 double npy_trunc(double x);
 double npy_exp2(double x);
 double npy_log2(double x);
 double npy_atan2(double x, double y);
 double npy_pow(double x, double y);
 double npy_modf(double x, double* y);
 double npy_copysign(double x, double y);
 double npy_nextafter(double x, double y);
 double npy_spacing(double x);
 /*
 * IEEE 754 fpu handling. Those are guaranteed to be macros
 */
 #ifndef NPY_HAVE_DECL_ISNAN
    #define npy_isnan(x) ((x) != (x))
 #else
    #ifdef _MSC_VER
        #define npy_isnan(x) _isnan((x))
    #else
        #define npy_isnan(x) isnan((x))
    #endif
 #endif
 #ifndef NPY_HAVE_DECL_ISFINITE
    #ifdef _MSC_VER
        #define npy_isfinite(x) _finite((x))
    #else
        #define npy_isfinite(x) !npy_isnan((x) + (-x))
    #endif
 #else
    #define npy_isfinite(x) isfinite((x))
 #endif
 #ifndef NPY_HAVE_DECL_ISINF
    #define npy_isinf(x) (!npy_isfinite(x) && !npy_isnan(x))
 #else
    #ifdef _MSC_VER
        #define npy_isinf(x) (!_finite((x)) && !_isnan((x)))
    #else
        #define npy_isinf(x) isinf((x))
    #endif
 #endif
 #ifndef NPY_HAVE_DECL_SIGNBIT
    int _npy_signbit_f(float x);
    int _npy_signbit_d(double x);
    int _npy_signbit_ld(long double x);
    #define npy_signbit(x) \
        (sizeof (x) == sizeof (long double) ? _npy_signbit_ld (x) \
         : sizeof (x) == sizeof (double) ? _npy_signbit_d (x) \
         : _npy_signbit_f (x))
 #else
    #define npy_signbit(x) signbit((x))
 #endif
 /*
 * float C99 math functions
 */
 float npy_sinf(float x);
 float npy_cosf(float x);
 float npy_tanf(float x);
 float npy_sinhf(float x);
 float npy_coshf(float x);
 float npy_tanhf(float x);
 float npy_fabsf(float x);
 float npy_floorf(float x);
 float npy_ceilf(float x);
 float npy_rintf(float x);
 float npy_truncf(float x);
 float npy_sqrtf(float x);
 float npy_log10f(float x);
 float npy_logf(float x);
 float npy_expf(float x);
 float npy_expm1f(float x);
 float npy_asinf(float x);
 float npy_acosf(float x);
 float npy_atanf(float x);
 float npy_asinhf(float x);
 float npy_acoshf(float x);
 float npy_atanhf(float x);
 float npy_log1pf(float x);
 float npy_exp2f(float x);
 float npy_log2f(float x);
 float npy_atan2f(float x, float y);
 float npy_hypotf(float x, float y);
 float npy_powf(float x, float y);
 float npy_fmodf(float x, float y);
 float npy_modff(float x, float* y);
 float npy_copysignf(float x, float y);
 float npy_nextafterf(float x, float y);
 float npy_spacingf(float x);
 /*
 * float C99 math functions
 */
 npy_longdouble npy_sinl(npy_longdouble x);
 npy_longdouble npy_cosl(npy_longdouble x);
 npy_longdouble npy_tanl(npy_longdouble x);
 npy_longdouble npy_sinhl(npy_longdouble x);
 npy_longdouble npy_coshl(npy_longdouble x);
 npy_longdouble npy_tanhl(npy_longdouble x);
 npy_longdouble npy_fabsl(npy_longdouble x);
 npy_longdouble npy_floorl(npy_longdouble x);
 npy_longdouble npy_ceill(npy_longdouble x);
 npy_longdouble npy_rintl(npy_longdouble x);
 npy_longdouble npy_truncl(npy_longdouble x);
 npy_longdouble npy_sqrtl(npy_longdouble x);
 npy_longdouble npy_log10l(npy_longdouble x);
 npy_longdouble npy_logl(npy_longdouble x);
 npy_longdouble npy_expl(npy_longdouble x);
 npy_longdouble npy_expm1l(npy_longdouble x);
 npy_longdouble npy_asinl(npy_longdouble x);
 npy_longdouble npy_acosl(npy_longdouble x);
 npy_longdouble npy_atanl(npy_longdouble x);
 npy_longdouble npy_asinhl(npy_longdouble x);
 npy_longdouble npy_acoshl(npy_longdouble x);
 npy_longdouble npy_atanhl(npy_longdouble x);
 npy_longdouble npy_log1pl(npy_longdouble x);
 npy_longdouble npy_exp2l(npy_longdouble x);
 npy_longdouble npy_log2l(npy_longdouble x);
 npy_longdouble npy_atan2l(npy_longdouble x, npy_longdouble y);
 npy_longdouble npy_hypotl(npy_longdouble x, npy_longdouble y);
 npy_longdouble npy_powl(npy_longdouble x, npy_longdouble y);
 npy_longdouble npy_fmodl(npy_longdouble x, npy_longdouble y);
 npy_longdouble npy_modfl(npy_longdouble x, npy_longdouble* y);
 npy_longdouble npy_copysignl(npy_longdouble x, npy_longdouble y);
 npy_longdouble npy_nextafterl(npy_longdouble x, npy_longdouble y);
 npy_longdouble npy_spacingl(npy_longdouble x);
 /*
 * Non standard functions
 */
 double npy_deg2rad(double x);
 double npy_rad2deg(double x);
 double npy_logaddexp(double x, double y);
 double npy_logaddexp2(double x, double y);
 float npy_deg2radf(float x);
 float npy_rad2degf(float x);
 float npy_logaddexpf(float x, float y);
 float npy_logaddexp2f(float x, float y);
 npy_longdouble npy_deg2radl(npy_longdouble x);
 npy_longdouble npy_rad2degl(npy_longdouble x);
 npy_longdouble npy_logaddexpl(npy_longdouble x, npy_longdouble y);
 npy_longdouble npy_logaddexp2l(npy_longdouble x, npy_longdouble y);
 #define npy_degrees npy_rad2deg
 #define npy_degreesf npy_rad2degf
 #define npy_degreesl npy_rad2degl
 #define npy_radians npy_deg2rad
 #define npy_radiansf npy_deg2radf
 #define npy_radiansl npy_deg2radl
 /*
 * Complex declarations
 */
 /*
 * C99 specifies that complex numbers have the same representation as
 * an array of two elements, where the first element is the real part
 * and the second element is the imaginary part.
 */
 #define __NPY_CPACK_IMP(x, y, type, ctype)   \
    union {                                  \
        ctype z;                             \
        type a[2];                           \
    } z1;;                                   \
                                             \
    z1.a[0] = (x);                           \
    z1.a[1] = (y);                           \
                                             \
    return z1.z;
 static NPY_INLINE npy_cdouble npy_cpack(double x, double y)
 {
    __NPY_CPACK_IMP(x, y, double, npy_cdouble);
 }
 static NPY_INLINE npy_cfloat npy_cpackf(float x, float y)
 {
    __NPY_CPACK_IMP(x, y, float, npy_cfloat);
 }
 static NPY_INLINE npy_clongdouble npy_cpackl(npy_longdouble x, npy_longdouble y)
 {
    __NPY_CPACK_IMP(x, y, npy_longdouble, npy_clongdouble);
 }
 #undef __NPY_CPACK_IMP
 /*
 * Same remark as above, but in the other direction: extract first/second
 * member of complex number, assuming a C99-compatible representation
 *
 * Those are defineds as static inline, and such as a reasonable compiler would
 * most likely compile this to one or two instructions (on CISC at least)
 */
 #define __NPY_CEXTRACT_IMP(z, index, type, ctype)   \
    union {                                         \
        ctype z;                                    \
        type a[2];                                  \
    } __z_repr;                                     \
    __z_repr.z = z;                                 \
                                                    \
    return __z_repr.a[index];
 static NPY_INLINE double npy_creal(npy_cdouble z)
 {
    __NPY_CEXTRACT_IMP(z, 0, double, npy_cdouble);
 }
 static NPY_INLINE double npy_cimag(npy_cdouble z)
 {
    __NPY_CEXTRACT_IMP(z, 1, double, npy_cdouble);
 }
 static NPY_INLINE float npy_crealf(npy_cfloat z)
 {
    __NPY_CEXTRACT_IMP(z, 0, float, npy_cfloat);
 }
 static NPY_INLINE float npy_cimagf(npy_cfloat z)
 {
    __NPY_CEXTRACT_IMP(z, 1, float, npy_cfloat);
 }
 static NPY_INLINE npy_longdouble npy_creall(npy_clongdouble z)
 {
    __NPY_CEXTRACT_IMP(z, 0, npy_longdouble, npy_clongdouble);
 }
 static NPY_INLINE npy_longdouble npy_cimagl(npy_clongdouble z)
 {
    __NPY_CEXTRACT_IMP(z, 1, npy_longdouble, npy_clongdouble);
 }
 #undef __NPY_CEXTRACT_IMP
 /*
 * Double precision complex functions
 */
 double npy_cabs(npy_cdouble z);
 double npy_carg(npy_cdouble z);
 npy_cdouble npy_cexp(npy_cdouble z);
 npy_cdouble npy_clog(npy_cdouble z);
 npy_cdouble npy_cpow(npy_cdouble x, npy_cdouble y);
 npy_cdouble npy_csqrt(npy_cdouble z);
 npy_cdouble npy_ccos(npy_cdouble z);
 npy_cdouble npy_csin(npy_cdouble z);
 /*
 * Single precision complex functions
 */
 float npy_cabsf(npy_cfloat z);
 float npy_cargf(npy_cfloat z);
 npy_cfloat npy_cexpf(npy_cfloat z);
 npy_cfloat npy_clogf(npy_cfloat z);
 npy_cfloat npy_cpowf(npy_cfloat x, npy_cfloat y);
 npy_cfloat npy_csqrtf(npy_cfloat z);
 npy_cfloat npy_ccosf(npy_cfloat z);
 npy_cfloat npy_csinf(npy_cfloat z);
 /*
 * Extended precision complex functions
 */
 npy_longdouble npy_cabsl(npy_clongdouble z);
 npy_longdouble npy_cargl(npy_clongdouble z);
 npy_clongdouble npy_cexpl(npy_clongdouble z);
 npy_clongdouble npy_clogl(npy_clongdouble z);
 npy_clongdouble npy_cpowl(npy_clongdouble x, npy_clongdouble y);
 npy_clongdouble npy_csqrtl(npy_clongdouble z);
 npy_clongdouble npy_ccosl(npy_clongdouble z);
 npy_clongdouble npy_csinl(npy_clongdouble z);
 /*
 * Functions that set the floating point error
 * status word.
 */
 void npy_set_floatstatus_divbyzero(void);
 void npy_set_floatstatus_overflow(void);
 void npy_set_floatstatus_underflow(void);
 void npy_set_floatstatus_invalid(void);
 #endif
--- a/include/numpy/npy_no_deprecated_api.h
+++ b/include/numpy/npy_no_deprecated_api.h
@ -1,19 +0,0 @@
 /*
 * This include file is provided for inclusion in Cython *.pyd files where
 * one would like to define the NPY_NO_DEPRECATED_API macro. It can be
 * included by
 *
 * cdef extern from "npy_no_deprecated_api.h": pass
 *
 */
 #ifndef NPY_NO_DEPRECATED_API
 /* put this check here since there may be multiple includes in C extensions. */
 #if defined(NDARRAYTYPES_H) || defined(_NPY_DEPRECATED_API_H) || \
    defined(OLD_DEFINES_H)
 #error "npy_no_deprecated_api.h" must be first among numpy includes.
 #else
 #define NPY_NO_DEPRECATED_API NPY_API_VERSION
 #endif
 #endif
--- a/include/numpy/npy_os.h
+++ b/include/numpy/npy_os.h
@ -1,30 +0,0 @@
 #ifndef _NPY_OS_H_
 #define _NPY_OS_H_
 #if defined(linux) || defined(__linux) || defined(__linux__)
    #define NPY_OS_LINUX
 #elif defined(__FreeBSD__) || defined(__NetBSD__) || \
            defined(__OpenBSD__) || defined(__DragonFly__)
    #define NPY_OS_BSD
    #ifdef __FreeBSD__
        #define NPY_OS_FREEBSD
    #elif defined(__NetBSD__)
        #define NPY_OS_NETBSD
    #elif defined(__OpenBSD__)
        #define NPY_OS_OPENBSD
    #elif defined(__DragonFly__)
        #define NPY_OS_DRAGONFLY
    #endif
 #elif defined(sun) || defined(__sun)
    #define NPY_OS_SOLARIS
 #elif defined(__CYGWIN__)
    #define NPY_OS_CYGWIN
 #elif defined(_WIN32) || defined(__WIN32__) || defined(WIN32)
    #define NPY_OS_WIN32
 #elif defined(__APPLE__)
    #define NPY_OS_DARWIN
 #else
    #define NPY_OS_UNKNOWN
 #endif
 #endif
--- a/include/numpy/numpyconfig.h
+++ b/include/numpy/numpyconfig.h
@ -1,33 +0,0 @@
 #ifndef _NPY_NUMPYCONFIG_H_
 #define _NPY_NUMPYCONFIG_H_
 #include "_numpyconfig.h"
 /*
 * On Mac OS X, because there is only one configuration stage for all the archs
 * in universal builds, any macro which depends on the arch needs to be
 * harcoded
 */
 #ifdef __APPLE__
 	#undef NPY_SIZEOF_LONG
 	#undef NPY_SIZEOF_PY_INTPTR_T
 	#ifdef __LP64__
 		#define NPY_SIZEOF_LONG 		8
 		#define NPY_SIZEOF_PY_INTPTR_T 	8
 	#else
 		#define NPY_SIZEOF_LONG 		4
 		#define NPY_SIZEOF_PY_INTPTR_T 	4
 	#endif
 #endif
 /**
 * To help with the NPY_NO_DEPRECATED_API macro, we include API version
 * numbers for specific versions of NumPy. To exclude all API that was
 * deprecated as of 1.7, add the following before #including any NumPy
 * headers:
 *   #define NPY_NO_DEPRECATED_API  NPY_1_7_API_VERSION
 */
 #define NPY_1_7_API_VERSION 0x00000007
 #endif
--- a/include/numpy/old_defines.h
+++ b/include/numpy/old_defines.h
@ -1,187 +0,0 @@
 /* This header is deprecated as of NumPy 1.7 */
 #ifndef OLD_DEFINES_H
 #define OLD_DEFINES_H
 #if defined(NPY_NO_DEPRECATED_API) && NPY_NO_DEPRECATED_API >= NPY_1_7_API_VERSION
 #error The header "old_defines.h" is deprecated as of NumPy 1.7.
 #endif
 #define NDARRAY_VERSION NPY_VERSION
 #define PyArray_MIN_BUFSIZE NPY_MIN_BUFSIZE
 #define PyArray_MAX_BUFSIZE NPY_MAX_BUFSIZE
 #define PyArray_BUFSIZE NPY_BUFSIZE
 #define PyArray_PRIORITY NPY_PRIORITY
 #define PyArray_SUBTYPE_PRIORITY NPY_PRIORITY
 #define PyArray_NUM_FLOATTYPE NPY_NUM_FLOATTYPE
 #define NPY_MAX PyArray_MAX
 #define NPY_MIN PyArray_MIN
 #define PyArray_TYPES       NPY_TYPES
 #define PyArray_BOOL        NPY_BOOL
 #define PyArray_BYTE        NPY_BYTE
 #define PyArray_UBYTE       NPY_UBYTE
 #define PyArray_SHORT       NPY_SHORT
 #define PyArray_USHORT      NPY_USHORT
 #define PyArray_INT         NPY_INT
 #define PyArray_UINT        NPY_UINT
 #define PyArray_LONG        NPY_LONG
 #define PyArray_ULONG       NPY_ULONG
 #define PyArray_LONGLONG    NPY_LONGLONG
 #define PyArray_ULONGLONG   NPY_ULONGLONG
 #define PyArray_HALF        NPY_HALF
 #define PyArray_FLOAT       NPY_FLOAT
 #define PyArray_DOUBLE      NPY_DOUBLE
 #define PyArray_LONGDOUBLE  NPY_LONGDOUBLE
 #define PyArray_CFLOAT      NPY_CFLOAT
 #define PyArray_CDOUBLE     NPY_CDOUBLE
 #define PyArray_CLONGDOUBLE NPY_CLONGDOUBLE
 #define PyArray_OBJECT      NPY_OBJECT
 #define PyArray_STRING      NPY_STRING
 #define PyArray_UNICODE     NPY_UNICODE
 #define PyArray_VOID        NPY_VOID
 #define PyArray_DATETIME    NPY_DATETIME
 #define PyArray_TIMEDELTA   NPY_TIMEDELTA
 #define PyArray_NTYPES      NPY_NTYPES
 #define PyArray_NOTYPE      NPY_NOTYPE
 #define PyArray_CHAR        NPY_CHAR
 #define PyArray_USERDEF     NPY_USERDEF
 #define PyArray_NUMUSERTYPES NPY_NUMUSERTYPES
 #define PyArray_INTP        NPY_INTP
 #define PyArray_UINTP       NPY_UINTP
 #define PyArray_INT8    NPY_INT8
 #define PyArray_UINT8   NPY_UINT8
 #define PyArray_INT16   NPY_INT16
 #define PyArray_UINT16  NPY_UINT16
 #define PyArray_INT32   NPY_INT32
 #define PyArray_UINT32  NPY_UINT32
 #ifdef NPY_INT64
 #define PyArray_INT64   NPY_INT64
 #define PyArray_UINT64  NPY_UINT64
 #endif
 #ifdef NPY_INT128
 #define PyArray_INT128 NPY_INT128
 #define PyArray_UINT128 NPY_UINT128
 #endif
 #ifdef NPY_FLOAT16
 #define PyArray_FLOAT16  NPY_FLOAT16
 #define PyArray_COMPLEX32  NPY_COMPLEX32
 #endif
 #ifdef NPY_FLOAT80
 #define PyArray_FLOAT80  NPY_FLOAT80
 #define PyArray_COMPLEX160  NPY_COMPLEX160
 #endif
 #ifdef NPY_FLOAT96
 #define PyArray_FLOAT96  NPY_FLOAT96
 #define PyArray_COMPLEX192  NPY_COMPLEX192
 #endif
 #ifdef NPY_FLOAT128
 #define PyArray_FLOAT128  NPY_FLOAT128
 #define PyArray_COMPLEX256  NPY_COMPLEX256
 #endif
 #define PyArray_FLOAT32    NPY_FLOAT32
 #define PyArray_COMPLEX64  NPY_COMPLEX64
 #define PyArray_FLOAT64    NPY_FLOAT64
 #define PyArray_COMPLEX128 NPY_COMPLEX128
 #define PyArray_TYPECHAR        NPY_TYPECHAR
 #define PyArray_BOOLLTR         NPY_BOOLLTR
 #define PyArray_BYTELTR         NPY_BYTELTR
 #define PyArray_UBYTELTR        NPY_UBYTELTR
 #define PyArray_SHORTLTR        NPY_SHORTLTR
 #define PyArray_USHORTLTR       NPY_USHORTLTR
 #define PyArray_INTLTR          NPY_INTLTR
 #define PyArray_UINTLTR         NPY_UINTLTR
 #define PyArray_LONGLTR         NPY_LONGLTR
 #define PyArray_ULONGLTR        NPY_ULONGLTR
 #define PyArray_LONGLONGLTR     NPY_LONGLONGLTR
 #define PyArray_ULONGLONGLTR    NPY_ULONGLONGLTR
 #define PyArray_HALFLTR         NPY_HALFLTR
 #define PyArray_FLOATLTR        NPY_FLOATLTR
 #define PyArray_DOUBLELTR       NPY_DOUBLELTR
 #define PyArray_LONGDOUBLELTR   NPY_LONGDOUBLELTR
 #define PyArray_CFLOATLTR       NPY_CFLOATLTR
 #define PyArray_CDOUBLELTR      NPY_CDOUBLELTR
 #define PyArray_CLONGDOUBLELTR  NPY_CLONGDOUBLELTR
 #define PyArray_OBJECTLTR       NPY_OBJECTLTR
 #define PyArray_STRINGLTR       NPY_STRINGLTR
 #define PyArray_STRINGLTR2      NPY_STRINGLTR2
 #define PyArray_UNICODELTR      NPY_UNICODELTR
 #define PyArray_VOIDLTR         NPY_VOIDLTR
 #define PyArray_DATETIMELTR     NPY_DATETIMELTR
 #define PyArray_TIMEDELTALTR    NPY_TIMEDELTALTR
 #define PyArray_CHARLTR         NPY_CHARLTR
 #define PyArray_INTPLTR         NPY_INTPLTR
 #define PyArray_UINTPLTR        NPY_UINTPLTR
 #define PyArray_GENBOOLLTR      NPY_GENBOOLLTR
 #define PyArray_SIGNEDLTR       NPY_SIGNEDLTR
 #define PyArray_UNSIGNEDLTR     NPY_UNSIGNEDLTR
 #define PyArray_FLOATINGLTR     NPY_FLOATINGLTR
 #define PyArray_COMPLEXLTR      NPY_COMPLEXLTR
 #define PyArray_QUICKSORT   NPY_QUICKSORT
 #define PyArray_HEAPSORT    NPY_HEAPSORT
 #define PyArray_MERGESORT   NPY_MERGESORT
 #define PyArray_SORTKIND    NPY_SORTKIND
 #define PyArray_NSORTS      NPY_NSORTS
 #define PyArray_NOSCALAR       NPY_NOSCALAR
 #define PyArray_BOOL_SCALAR    NPY_BOOL_SCALAR
 #define PyArray_INTPOS_SCALAR  NPY_INTPOS_SCALAR
 #define PyArray_INTNEG_SCALAR  NPY_INTNEG_SCALAR
 #define PyArray_FLOAT_SCALAR   NPY_FLOAT_SCALAR
 #define PyArray_COMPLEX_SCALAR NPY_COMPLEX_SCALAR
 #define PyArray_OBJECT_SCALAR  NPY_OBJECT_SCALAR
 #define PyArray_SCALARKIND     NPY_SCALARKIND
 #define PyArray_NSCALARKINDS   NPY_NSCALARKINDS
 #define PyArray_ANYORDER     NPY_ANYORDER
 #define PyArray_CORDER       NPY_CORDER
 #define PyArray_FORTRANORDER NPY_FORTRANORDER
 #define PyArray_ORDER        NPY_ORDER
 #define PyDescr_ISBOOL      PyDataType_ISBOOL
 #define PyDescr_ISUNSIGNED  PyDataType_ISUNSIGNED
 #define PyDescr_ISSIGNED    PyDataType_ISSIGNED
 #define PyDescr_ISINTEGER   PyDataType_ISINTEGER
 #define PyDescr_ISFLOAT     PyDataType_ISFLOAT
 #define PyDescr_ISNUMBER    PyDataType_ISNUMBER
 #define PyDescr_ISSTRING    PyDataType_ISSTRING
 #define PyDescr_ISCOMPLEX   PyDataType_ISCOMPLEX
 #define PyDescr_ISPYTHON    PyDataType_ISPYTHON
 #define PyDescr_ISFLEXIBLE  PyDataType_ISFLEXIBLE
 #define PyDescr_ISUSERDEF   PyDataType_ISUSERDEF
 #define PyDescr_ISEXTENDED  PyDataType_ISEXTENDED
 #define PyDescr_ISOBJECT    PyDataType_ISOBJECT
 #define PyDescr_HASFIELDS   PyDataType_HASFIELDS
 #define PyArray_LITTLE NPY_LITTLE
 #define PyArray_BIG NPY_BIG
 #define PyArray_NATIVE NPY_NATIVE
 #define PyArray_SWAP NPY_SWAP
 #define PyArray_IGNORE NPY_IGNORE
 #define PyArray_NATBYTE NPY_NATBYTE
 #define PyArray_OPPBYTE NPY_OPPBYTE
 #define PyArray_MAX_ELSIZE NPY_MAX_ELSIZE
 #define PyArray_USE_PYMEM NPY_USE_PYMEM
 #define PyArray_RemoveLargest PyArray_RemoveSmallest
 #define PyArray_UCS4 npy_ucs4
 #endif
--- a/include/numpy/oldnumeric.h
+++ b/include/numpy/oldnumeric.h
@ -1,23 +0,0 @@
 #include "arrayobject.h"
 #ifndef REFCOUNT
 #  define REFCOUNT NPY_REFCOUNT
 #  define MAX_ELSIZE 16
 #endif
 #define PyArray_UNSIGNED_TYPES
 #define PyArray_SBYTE NPY_BYTE
 #define PyArray_CopyArray PyArray_CopyInto
 #define _PyArray_multiply_list PyArray_MultiplyIntList
 #define PyArray_ISSPACESAVER(m) NPY_FALSE
 #define PyScalarArray_Check PyArray_CheckScalar
 #define CONTIGUOUS NPY_CONTIGUOUS
 #define OWN_DIMENSIONS 0
 #define OWN_STRIDES 0
 #define OWN_DATA NPY_OWNDATA
 #define SAVESPACE 0
 #define SAVESPACEBIT 0
 #undef import_array
 #define import_array() { if (_import_array() < 0) {PyErr_Print(); PyErr_SetString(PyExc_ImportError, "numpy.core.multiarray failed to import"); } }
--- a/Show More
+++ b/Show More
		`@ -1,2 +0,0 @@`
			`from .conll17_ud_eval import main as ud_evaluate # noqa: F401`
			`from .ud_train import main as ud_train # noqa: F401`
		`@ -1 +0,0 @@`
			`{"nr_epoch": 3, "batch_size": 24, "dropout": 0.001, "vectors": 0, "multitask_tag": 0, "multitask_sent": 0}`