mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-13 02:36:32 +03:00
Merge pull request #1502 from explosion/develop
💫 Merge spaCy 2 into master 🎉
This commit is contained in:
commit
75d2a05c29
|
@ -1 +1,55 @@
|
|||
environment:
|
||||
|
||||
matrix:
|
||||
|
||||
# For Python versions available on Appveyor, see
|
||||
# http://www.appveyor.com/docs/installed-software#python
|
||||
# The list here is complete (excluding Python 2.6, which
|
||||
# isn't covered by this document) at the time of writing.
|
||||
|
||||
- PYTHON: "C:\\Python27"
|
||||
#- PYTHON: "C:\\Python33"
|
||||
#- PYTHON: "C:\\Python34"
|
||||
#- PYTHON: "C:\\Python35"
|
||||
#- PYTHON: "C:\\Python27-x64"
|
||||
#- PYTHON: "C:\\Python33-x64"
|
||||
#- DISTUTILS_USE_SDK: "1"
|
||||
#- PYTHON: "C:\\Python34-x64"
|
||||
#- DISTUTILS_USE_SDK: "1"
|
||||
#- PYTHON: "C:\\Python35-x64"
|
||||
- PYTHON: "C:\\Python36-x64"
|
||||
|
||||
install:
|
||||
# We need wheel installed to build wheels
|
||||
- "%PYTHON%\\python.exe -m pip install wheel"
|
||||
- "%PYTHON%\\python.exe -m pip install cython"
|
||||
- "%PYTHON%\\python.exe -m pip install -r requirements.txt"
|
||||
- "%PYTHON%\\python.exe -m pip install -e ."
|
||||
|
||||
build: off
|
||||
|
||||
test_script:
|
||||
# Put your test command here.
|
||||
# If you don't need to build C extensions on 64-bit Python 3.3 or 3.4,
|
||||
# you can remove "build.cmd" from the front of the command, as it's
|
||||
# only needed to support those cases.
|
||||
# Note that you must use the environment variable %PYTHON% to refer to
|
||||
# the interpreter you're using - Appveyor does not do anything special
|
||||
# to put the Python version you want to use on PATH.
|
||||
- "%PYTHON%\\python.exe -m pytest spacy/"
|
||||
|
||||
after_test:
|
||||
# This step builds your wheels.
|
||||
# Again, you only need build.cmd if you're building C extensions for
|
||||
# 64-bit Python 3.3/3.4. And you need to use %PYTHON% to get the correct
|
||||
# interpreter
|
||||
- "%PYTHON%\\python.exe setup.py bdist_wheel"
|
||||
|
||||
artifacts:
|
||||
# bdist_wheel puts your built wheel in the dist directory
|
||||
- path: dist\*
|
||||
|
||||
#on_success:
|
||||
# You can use this step to upload your artifacts to a public website.
|
||||
# See Appveyor's documentation for more details. Or you can simply
|
||||
# access your wheels from the Appveyor "artifacts" tab for your build.
|
||||
|
|
|
@ -3,3 +3,9 @@ steps:
|
|||
command: "fab env clean make test sdist"
|
||||
label: ":dizzy: :python:"
|
||||
artifact_paths: "dist/*.tar.gz"
|
||||
- wait
|
||||
- trigger: "spacy-sdist-against-models"
|
||||
label: ":dizzy: :hammer:"
|
||||
build:
|
||||
env:
|
||||
SPACY_VERSION: "{$SPACY_VERSION}"
|
||||
|
|
18
.github/CONTRIBUTOR_AGREEMENT.md
vendored
18
.github/CONTRIBUTOR_AGREEMENT.md
vendored
|
@ -87,7 +87,7 @@ U.S. Federal law. Any choice of law rules will not apply.
|
|||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [ ] I am signing on behalf of myself as an individual and no other person
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
|
@ -96,11 +96,11 @@ mark both statements:
|
|||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | |
|
||||
| GitHub username | |
|
||||
| Website (optional) | |
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Abhinav Sharma |
|
||||
| Company name (if applicable) | Fourtek I.T. Solutions Pvt. Ltd. |
|
||||
| Title or role (if applicable) | Machine Learning Engineer |
|
||||
| Date | 3 Novermber 2017 |
|
||||
| GitHub username | abhi18av |
|
||||
| Website (optional) | https://abhi18av.github.io/ |
|
||||
|
|
4
.github/PULL_REQUEST_TEMPLATE.md
vendored
4
.github/PULL_REQUEST_TEMPLATE.md
vendored
|
@ -7,7 +7,7 @@ ran. If your test fixes a bug reported in an issue, don't forget to include the
|
|||
issue number. If your PR is still a work in progress, that's totally fine – just
|
||||
include a note to let us know. -->
|
||||
|
||||
### Types of changes
|
||||
### Types of change
|
||||
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
|
||||
or new feature, or a change to the documentation? -->
|
||||
|
||||
|
@ -16,4 +16,4 @@ or new feature, or a change to the documentation? -->
|
|||
tick off all the boxes. [] -> [x] -->
|
||||
- [ ] I have submitted the spaCy Contributor Agreement.
|
||||
- [ ] I ran the tests, and all new and existing tests passed.
|
||||
- [ ] My changes don't require a change to the documentation, or if they do, I've added all required details.
|
||||
- [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.
|
||||
|
|
|
@ -78,7 +78,7 @@ took place before the date you sign these terms.
|
|||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
|
@ -87,20 +87,20 @@ U.S. Federal law. Any choice of law rules will not apply.
|
|||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [ ] I am signing on behalf of myself as an individual and no other person
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect my
|
||||
contributions.
|
||||
|
||||
* [x] I am signing on behalf of my employer or a legal entity and I have the
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Jeffrey Gerard |
|
||||
| Company name (if applicable) | cephalo-ai / wellio |
|
||||
| Title or role (if applicable) | Senior Data Scientist|
|
||||
| Date | 11/02/2017 |
|
||||
| GitHub username | IamJeffG |
|
||||
| Name | Jim O'Regan |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2017-06-24 |
|
||||
| GitHub username | jimregan |
|
||||
| Website (optional) | |
|
2
.github/contributors/ramananbalakrishnan.md
vendored
2
.github/contributors/ramananbalakrishnan.md
vendored
|
@ -101,6 +101,6 @@ mark both statements:
|
|||
| Name | Ramanan Balakrishnan |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2017-10-18 |
|
||||
| Date | 2017-10-19 |
|
||||
| GitHub username | ramananbalakrishnan |
|
||||
| Website (optional) | |
|
||||
|
|
13
.gitignore
vendored
13
.gitignore
vendored
|
@ -1,14 +1,12 @@
|
|||
# spaCy
|
||||
spacy/data/
|
||||
corpora/
|
||||
models/
|
||||
/models/
|
||||
keys/
|
||||
|
||||
# Website
|
||||
website/www/
|
||||
website/_deploy.sh
|
||||
website/package.json
|
||||
website/announcement.jade
|
||||
website/.gitignore
|
||||
|
||||
# Cython / C extensions
|
||||
|
@ -29,10 +27,8 @@ Profile.prof
|
|||
.python-version
|
||||
__pycache__/
|
||||
*.py[cod]
|
||||
.env*/
|
||||
.env/
|
||||
.env2/
|
||||
.env3/
|
||||
.env*
|
||||
.~env/
|
||||
.venv
|
||||
venv/
|
||||
|
@ -42,7 +38,6 @@ venv/
|
|||
|
||||
# Distribution / packaging
|
||||
env/
|
||||
bin/
|
||||
build/
|
||||
develop-eggs/
|
||||
dist/
|
||||
|
@ -102,7 +97,3 @@ Desktop.ini
|
|||
|
||||
# Other
|
||||
*.tgz
|
||||
|
||||
|
||||
# JetBrains PyCharm
|
||||
.idea/
|
|
@ -14,8 +14,7 @@ os:
|
|||
env:
|
||||
- VIA=compile LC_ALL=en_US.ascii
|
||||
- VIA=compile
|
||||
|
||||
# - VIA=sdist
|
||||
#- VIA=pypi_nightly
|
||||
|
||||
install:
|
||||
- "./travis.sh"
|
||||
|
@ -23,7 +22,7 @@ install:
|
|||
script:
|
||||
- "pip install pytest pytest-timeout"
|
||||
- if [[ "${VIA}" == "compile" ]]; then python -m pytest --tb=native spacy; fi
|
||||
- if [[ "${VIA}" == "pypi" ]]; then python -m pytest --tb=native `python -c "import os.path; import spacy; print(os.path.abspath(ospath.dirname(spacy.__file__)))"`; fi
|
||||
- if [[ "${VIA}" == "pypi_nightly" ]]; then python -m pytest --tb=native --models --en `python -c "import os.path; import spacy; print(os.path.abspath(os.path.dirname(spacy.__file__)))"`; fi
|
||||
- if [[ "${VIA}" == "sdist" ]]; then python -m pytest --tb=native `python -c "import os.path; import spacy; print(os.path.abspath(os.path.dirname(spacy.__file__)))"`; fi
|
||||
|
||||
notifications:
|
||||
|
|
393
CONTRIBUTING.md
393
CONTRIBUTING.md
|
@ -2,7 +2,10 @@
|
|||
|
||||
# Contribute to spaCy
|
||||
|
||||
Following the v1.0 release, it's time to welcome more contributors into the spaCy project and code base 🎉 This page will give you a quick overview of how things are organised and most importantly, how to get involved.
|
||||
Thanks for your interest in contributing to spaCy 🎉 The project is maintained
|
||||
by [@honnibal](https://github.com/honnibal) and [@ines](https://github.com/ines),
|
||||
and we'll do our best to help you get started. This page will give you a quick
|
||||
overview of how things are organised and most importantly, how to get involved.
|
||||
|
||||
## Table of contents
|
||||
1. [Issues and bug reports](#issues-and-bug-reports)
|
||||
|
@ -10,27 +13,68 @@ Following the v1.0 release, it's time to welcome more contributors into the spaC
|
|||
3. [Code conventions](#code-conventions)
|
||||
4. [Adding tests](#adding-tests)
|
||||
5. [Updating the website](#updating-the-website)
|
||||
6. [Submitting a tutorial](#submitting-a-tutorial)
|
||||
7. [Submitting a project to the showcase](#submitting-a-project-to-the-showcase)
|
||||
8. [Code of conduct](#code-of-conduct)
|
||||
6. [Publishing extensions and plugins](#publishing-spacy-extensions-and-plugins)
|
||||
7. [Code of conduct](#code-of-conduct)
|
||||
|
||||
## Issues and bug reports
|
||||
|
||||
First, [do a quick search](https://github.com/issues?q=+is%3Aissue+user%3Aexplosion) to see if the issue has already been reported. If so, it's often better to just leave a comment on an existing issue, rather than creating a new one.
|
||||
First, [do a quick search](https://github.com/issues?q=+is%3Aissue+user%3Aexplosion)
|
||||
to see if the issue has already been reported. If so, it's often better to just
|
||||
leave a comment on an existing issue, rather than creating a new one. Old issues
|
||||
also often include helpful tips and solutions to common problems. You should
|
||||
also check the [troubleshooting guide](https://spacy.io/usage/#troubleshooting)
|
||||
to see if your problem is already listed there.
|
||||
|
||||
If you're looking for help with your code, consider posting a question on [StackOverflow](http://stackoverflow.com/questions/tagged/spacy) instead. If you tag it `spacy` and `python`, more people will see it and hopefully be able to help.
|
||||
If you're looking for help with your code, consider posting a question on
|
||||
[StackOverflow](http://stackoverflow.com/questions/tagged/spacy) instead. If you
|
||||
tag it `spacy` and `python`, more people will see it and hopefully be able to
|
||||
help. Please understand that we won't be able to provide individual support via
|
||||
email. We also believe that help is much more valuable if it's **shared publicly**,
|
||||
so that more people can benefit from it.
|
||||
|
||||
When opening an issue, use a descriptive title and include your environment (operating system, Python version, spaCy version). Our [issue template](https://github.com/explosion/spaCy/issues/new) helps you remember the most important details to include. If you've discovered a bug, you can also submit a [regression test](#fixing-bugs) straight away. When you're opening an issue to report the bug, simply refer to your pull request in the issue body.
|
||||
### Submitting issues
|
||||
|
||||
### Tips
|
||||
When opening an issue, use a **descriptive title** and include your
|
||||
**environment** (operating system, Python version, spaCy version). Our
|
||||
[issue template](https://github.com/explosion/spaCy/issues/new) helps you
|
||||
remember the most important details to include. If you've discovered a bug, you
|
||||
can also submit a [regression test](#fixing-bugs) straight away. When you're
|
||||
opening an issue to report the bug, simply refer to your pull request in the
|
||||
issue body. A few more tips:
|
||||
|
||||
* **Getting info about your spaCy installation and environment**: If you're using spaCy v1.7+, you can use the command line interface to print details and even format them as Markdown to copy-paste into GitHub issues: `python -m spacy info --markdown`.
|
||||
* **Describing your issue:** Try to provide as many details as possible. What
|
||||
exactly goes wrong? *How* is is failing? Is there an error?
|
||||
"XY doesn't work" usually isn't that helpful for tracking down problems. Always
|
||||
remember to include the code you ran and if possible, extract only the relevant
|
||||
parts and don't just dump your entire script. This will make it easier for us to
|
||||
reproduce the error.
|
||||
|
||||
* **Sharing long blocks of code or logs**: If you need to include long code, logs or tracebacks, you can wrap them in `<details>` and `</details>`. This [collapses the content](https://developer.mozilla.org/en/docs/Web/HTML/Element/details) so it only becomes visible on click, making the issue easier to read and follow.
|
||||
* **Getting info about your spaCy installation and environment:** If you're
|
||||
using spaCy v1.7+, you can use the command line interface to print details and
|
||||
even format them as Markdown to copy-paste into GitHub issues:
|
||||
`python -m spacy info --markdown`.
|
||||
|
||||
* **Checking the model compatibility:** If you're having problems with a
|
||||
[statistical model](https://spacy.io/models), it may be because to the
|
||||
model is incompatible with your spaCy installation. In spaCy v2.0+, you can check
|
||||
this on the command line by running `spacy validate`.
|
||||
|
||||
* **Sharing a model's output, like dependencies and entities:** spaCy v2.0+
|
||||
comes with [built-in visualizers](https://spacy.io/usage/visualizers) that
|
||||
you can run from within your script or a Jupyter notebook. For some issues, it's
|
||||
helpful to **include a screenshot** of the visualization. You can simply drag and
|
||||
drop the image into GitHub's editor and it will be uploaded and included.
|
||||
|
||||
* **Sharing long blocks of code or logs:** If you need to include long code,
|
||||
logs or tracebacks, you can wrap them in `<details>` and `</details>`. This
|
||||
[collapses the content](https://developer.mozilla.org/en/docs/Web/HTML/Element/details)
|
||||
so it only becomes visible on click, making the issue easier to read and follow.
|
||||
|
||||
### Issue labels
|
||||
|
||||
To distinguish issues that are opened by us, the maintainers, we usually add a 💫 to the title. We also use the following system to tag our issues:
|
||||
To distinguish issues that are opened by us, the maintainers, we usually add a
|
||||
💫 to the title. We also use the following system to tag our issues and pull
|
||||
requests:
|
||||
|
||||
| Issue label | Description |
|
||||
| --- | --- |
|
||||
|
@ -40,55 +84,143 @@ To distinguish issues that are opened by us, the maintainers, we usually add a
|
|||
| [`performance`](https://github.com/explosion/spaCy/labels/performance) | Accuracy, speed and memory use problems |
|
||||
| [`tests`](https://github.com/explosion/spaCy/labels/tests) | Missing or incorrect [tests](spacy/tests) |
|
||||
| [`docs`](https://github.com/explosion/spaCy/labels/docs), [`examples`](https://github.com/explosion/spaCy/labels/examples) | Issues related to the [documentation](https://spacy.io/docs) and [examples](spacy/examples) |
|
||||
| [`training`](https://github.com/explosion/spaCy/labels/training) | Issues related to training and updating models |
|
||||
| [`models`](https://github.com/explosion/spaCy/labels/models), `language / [name]` | Issues related to the specific [models](https://github.com/explosion/spacy-models), languages and data |
|
||||
| [`linux`](https://github.com/explosion/spaCy/labels/linux), [`osx`](https://github.com/explosion/spaCy/labels/osx), [`windows`](https://github.com/explosion/spaCy/labels/windows) | Issues related to the specific operating systems |
|
||||
| [`pip`](https://github.com/explosion/spaCy/labels/pip), [`conda`](https://github.com/explosion/spaCy/labels/conda) | Issues related to the specific package managers |
|
||||
| [`wip`](https://github.com/explosion/spaCy/labels/wip) | Work in progress |
|
||||
| [`wip`](https://github.com/explosion/spaCy/labels/wip) | Work in progress, mostly used for pull requests. |
|
||||
| [`duplicate`](https://github.com/explosion/spaCy/labels/duplicate) | Duplicates, i.e. issues that have been reported before |
|
||||
| [`meta`](https://github.com/explosion/spaCy/labels/meta) | Meta topics, e.g. repo organisation and issue management |
|
||||
| [`help wanted`](https://github.com/explosion/spaCy/labels/help%20wanted), [`help wanted (easy)`](https://github.com/explosion/spaCy/labels/help%20wanted%20%28easy%29) | Requests for contributions |
|
||||
|
||||
## Contributing to the code base
|
||||
|
||||
You don't have to be an NLP expert or Python pro to contribute, and we're happy to help you get started. If you're new to spaCy, a good place to start is the [`help wanted (easy)`](https://github.com/explosion/spaCy/issues?q=is%3Aissue+is%3Aopen+label%3A%22help+wanted+%28easy%29%22) label, which we use to tag bugs and feature requests that are easy and self-contained. If you've decided to take on one of these problems and you're making good progress, don't forget to add a quick comment to the issue. You can also use the issue to ask questions, or share your work in progress.
|
||||
You don't have to be an NLP expert or Python pro to contribute, and we're happy
|
||||
to help you get started. If you're new to spaCy, a good place to start is the
|
||||
[spaCy 101 guide](https://spacy.io/usage/spacy-101) and the
|
||||
[`help wanted (easy)`](https://github.com/explosion/spaCy/issues?q=is%3Aissue+is%3Aopen+label%3A%22help+wanted+%28easy%29%22)
|
||||
label, which we use to tag bugs and feature requests that are easy and
|
||||
self-contained. If you've decided to take on one of these problems and you're
|
||||
making good progress, don't forget to add a quick comment to the issue. You can
|
||||
also use the issue to ask questions, or share your work in progress.
|
||||
|
||||
### What belongs in spaCy?
|
||||
|
||||
Every library has a different inclusion philosophy — a policy of what should be shipped in the core library, and what could be provided in other packages. Our philosophy is to prefer a smaller core library. We generally ask the following questions:
|
||||
Every library has a different inclusion philosophy — a policy of what should be
|
||||
shipped in the core library, and what could be provided in other packages. Our
|
||||
philosophy is to prefer a smaller core library. We generally ask the following
|
||||
questions:
|
||||
|
||||
* **What would this feature look like if implemented in a separate package?** Some features would be very difficult to implement externally. For instance, anything that requires a change to the `Token` class really needs to be implemented within spaCy, because there's no convenient way to make spaCy return custom `Token` objects. In contrast, a library of word alignment functions could easily live as a separate package that depended on spaCy — there's little difference between writing `import word_aligner` and `import spacy.word_aligner`.
|
||||
* **What would this feature look like if implemented in a separate package?**
|
||||
Some features would be very difficult to implement externally – for example,
|
||||
changes to spaCy's built-in methods. In contrast, a library of word
|
||||
alignment functions could easily live as a separate package that depended on
|
||||
spaCy — there's little difference between writing `import word_aligner` and
|
||||
`import spacy.word_aligner`. spaCy v2.0+ makes it easy to implement
|
||||
[custom pipeline components](https://spacy.io/usage/processing-pipelines#custom-components),
|
||||
and add your own attributes, properties and methods to the `Doc`, `Token` and
|
||||
`Span`. If you're looking to implement a new spaCy feature, starting with a
|
||||
custom component package is usually the best strategy. You won't have to worry
|
||||
about spaCy's internals and you can test your module in an isolated
|
||||
environment. And if it works well, we can always integrate it into the core
|
||||
library later.
|
||||
|
||||
* **Would the feature be easier to implement if it relied on "heavy" dependencies spaCy doesn't currently require?** Python has a very rich ecosystem. Libraries like Sci-Kit Learn, Scipy, Gensim, Keras etc. do lots of useful things — but we don't want to have them as dependencies. If the feature requires functionality in one of these libraries, it's probably better to break it out into a different package.
|
||||
* **Would the feature be easier to implement if it relied on "heavy" dependencies spaCy doesn't currently require?**
|
||||
Python has a very rich ecosystem. Libraries like scikit-learn, SciPy, Gensim or
|
||||
TensorFlow/Keras do lots of useful things — but we don't want to have them as
|
||||
dependencies. If the feature requires functionality in one of these libraries,
|
||||
it's probably better to break it out into a different package.
|
||||
|
||||
* **Is the feature orthogonal to the current spaCy functionality, or overlapping?** spaCy strongly prefers to avoid having 6 different ways of doing the same thing. As better techniques are developed, we prefer to drop support for "the old way". However, it's rare that one approach *entirely* dominates another. It's very common that there's still a use-case for the "obsolete" approach. For instance, [WordNet](https://wordnet.princeton.edu/) is still very useful — but word vectors are better for most use-cases, and the two approaches to lexical semantics do a lot of the same things. spaCy therefore only supports word vectors, and support for WordNet is currently left for other packages.
|
||||
* **Is the feature orthogonal to the current spaCy functionality, or overlapping?**
|
||||
spaCy strongly prefers to avoid having 6 different ways of doing the same thing.
|
||||
As better techniques are developed, we prefer to drop support for "the old way".
|
||||
However, it's rare that one approach *entirely* dominates another. It's very
|
||||
common that there's still a use-case for the "obsolete" approach. For instance,
|
||||
[WordNet](https://wordnet.princeton.edu/) is still very useful — but word
|
||||
vectors are better for most use-cases, and the two approaches to lexical
|
||||
semantics do a lot of the same things. spaCy therefore only supports word
|
||||
vectors, and support for WordNet is currently left for other packages.
|
||||
|
||||
* **Do you need the feature to get basic things done?** We do want spaCy to be at least somewhat self-contained. If we keep needing some feature in our recipes, that does provide some argument for bringing it "in house".
|
||||
* **Do you need the feature to get basic things done?** We do want spaCy to be
|
||||
at least somewhat self-contained. If we keep needing some feature in our
|
||||
recipes, that does provide some argument for bringing it "in house".
|
||||
|
||||
### Developer resources
|
||||
### Getting started
|
||||
|
||||
To make changes to spaCy's code base, you need to clone the GitHub repository
|
||||
and build spaCy from source. You'll need to make sure that you have a
|
||||
development environment consisting of a Python distribution including header
|
||||
files, a compiler, [pip](https://pip.pypa.io/en/latest/installing/),
|
||||
[virtualenv](https://virtualenv.pypa.io/en/stable/) and
|
||||
[git](https://git-scm.com) installed. The compiler is usually the trickiest part.
|
||||
|
||||
```
|
||||
python -m pip install -U pip venv
|
||||
git clone https://github.com/explosion/spaCy
|
||||
cd spaCy
|
||||
|
||||
venv .env
|
||||
source .env/bin/activate
|
||||
export PYTHONPATH=`pwd`
|
||||
pip install -r requirements.txt
|
||||
python setup.py build_ext --inplace
|
||||
```
|
||||
|
||||
If you've made changes to `.pyx` files, you need to recompile spaCy before you
|
||||
can test your changes by re-running `python setup.py build_ext --inplace`.
|
||||
Changes to `.py` files will be effective immediately.
|
||||
|
||||
📖 **For more details and instructions, see the documentation on [compiling spaCy from source](https://spacy.io/usage/#source) and the [quickstart widget](https://spacy.io/usage/#section-quickstart) to get the right commands for your platform and Python version.**
|
||||
|
||||
The [spaCy developer resources](https://github.com/explosion/spacy-dev-resources) repo contains useful scripts, tools and templates for developing spaCy, adding new languages and training new models. If you've written a script that might help others, feel free to contribute it to that repository.
|
||||
|
||||
### Contributor agreement
|
||||
|
||||
If you've made a substantial contribution to spaCy, you should fill in the [spaCy contributor agreement](.github/CONTRIBUTOR_AGREEMENT.md) to ensure that your contribution can be used across the project. If you agree to be bound by the terms of the agreement, fill in the [template](.github/CONTRIBUTOR_AGREEMENT.md) and include it with your pull request, or sumit it separately to [`.github/contributors/`](/.github/contributors). The name of the file should be your GitHub username, with the extension `.md`. For example, the user
|
||||
If you've made a contribution to spaCy, you should fill in the
|
||||
[spaCy contributor agreement](.github/CONTRIBUTOR_AGREEMENT.md) to ensure that
|
||||
your contribution can be used across the project. If you agree to be bound by
|
||||
the terms of the agreement, fill in the [template]((.github/CONTRIBUTOR_AGREEMENT.md))
|
||||
and include it with your pull request, or sumit it separately to
|
||||
[`.github/contributors/`](/.github/contributors). The name of the file should be
|
||||
your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
|
||||
### Fixing bugs
|
||||
|
||||
When fixing a bug, first create an [issue](https://github.com/explosion/spaCy/issues) if one does not already exist. The description text can be very short – we don't want to make this too bureaucratic.
|
||||
When fixing a bug, first create an
|
||||
[issue](https://github.com/explosion/spaCy/issues) if one does not already exist.
|
||||
The description text can be very short – we don't want to make this too
|
||||
bureaucratic.
|
||||
|
||||
Next, create a test file named `test_issue[ISSUE NUMBER].py` in the [`spacy/tests/regression`](spacy/tests/regression) folder. Test for the bug you're fixing, and make sure the test fails. Next, add and commit your test file referencing the issue number in the commit message. Finally, fix the bug, make sure your test passes and reference the issue in your commit message.
|
||||
Next, create a test file named `test_issue[ISSUE NUMBER].py` in the
|
||||
[`spacy/tests/regression`](spacy/tests/regression) folder. Test for the bug
|
||||
you're fixing, and make sure the test fails. Next, add and commit your test file
|
||||
referencing the issue number in the commit message. Finally, fix the bug, make
|
||||
sure your test passes and reference the issue in your commit message.
|
||||
|
||||
📖 **For more information on how to add tests, check out the [tests README](spacy/tests/README.md).**
|
||||
|
||||
## Code conventions
|
||||
|
||||
Code should loosely follow [pep8](https://www.python.org/dev/peps/pep-0008/). Regular line length is **80 characters**, with some tolerance for lines up to 90 characters if the alternative would be worse — for instance, if your list comprehension comes to 82 characters, it's better not to split it over two lines.
|
||||
Code should loosely follow [pep8](https://www.python.org/dev/peps/pep-0008/).
|
||||
Regular line length is **80 characters**, with some tolerance for lines up to
|
||||
90 characters if the alternative would be worse — for instance, if your list
|
||||
comprehension comes to 82 characters, it's better not to split it over two lines.
|
||||
You can also use a linter like [`flake8`](https://pypi.python.org/pypi/flake8)
|
||||
or [`frosted`](https://pypi.python.org/pypi/frosted) – just keep in mind that
|
||||
it won't work very well for `.pyx` files and will complain about Cython syntax
|
||||
like `<int*>` or `cimport`.
|
||||
|
||||
### Python conventions
|
||||
|
||||
All Python code must be written in an **intersection of Python 2 and Python 3**. This is easy in Cython, but somewhat ugly in Python. Logic that deals with Python or platform compatibility should only live in [`spacy.compat`](spacy/compat.py). To distinguish them from the builtin functions, replacement functions are suffixed with an undersocre, for example `unicode_`. If you need to access the user's version or platform information, for example to show more specific error messages, you can use the `is_config()` helper function.
|
||||
All Python code must be written in an **intersection of Python 2 and Python 3**.
|
||||
This is easy in Cython, but somewhat ugly in Python. Logic that deals with
|
||||
Python or platform compatibility should only live in
|
||||
[`spacy.compat`](spacy/compat.py). To distinguish them from the builtin
|
||||
functions, replacement functions are suffixed with an undersocre, for example
|
||||
`unicode_`. If you need to access the user's version or platform information,
|
||||
for example to show more specific error messages, you can use the `is_config()`
|
||||
helper function.
|
||||
|
||||
```python
|
||||
from .compat import unicode_, json_dumps, is_config
|
||||
|
@ -99,21 +231,56 @@ if is_config(windows=True, python2=True):
|
|||
print("You are using Python 2 on Windows.")
|
||||
```
|
||||
|
||||
Code that interacts with the file-system should accept objects that follow the `pathlib.Path` API, without assuming that the object inherits from `pathlib.Path`. If the function is user-facing and takes a path as an argument, it should check whether the path is provided as a string. Strings should be converted to `pathlib.Path` objects.
|
||||
Code that interacts with the file-system should accept objects that follow the
|
||||
`pathlib.Path` API, without assuming that the object inherits from `pathlib.Path`.
|
||||
If the function is user-facing and takes a path as an argument, it should check
|
||||
whether the path is provided as a string. Strings should be converted to
|
||||
`pathlib.Path` objects. Serialization and deserialization functions should always
|
||||
accept **file-like objects**, as it makes the library io-agnostic. Working on
|
||||
buffers makes the code more general, easier to test, and compatible with Python
|
||||
3's asynchronous IO.
|
||||
|
||||
At the time of writing (v1.7), spaCy's serialization and deserialization functions are inconsistent about accepting paths vs accepting file-like objects. The correct answer is "file-like objects" — that's what we want going forward, as it makes the library io-agnostic. Working on buffers makes the code more general, easier to test, and compatible with Python 3's asynchronous IO.
|
||||
Although spaCy uses a lot of classes, **inheritance is viewed with some suspicion**
|
||||
— it's seen as a mechanism of last resort. You should discuss plans to extend
|
||||
the class hierarchy before implementing.
|
||||
|
||||
Although spaCy uses a lot of classes, inheritance is viewed with some suspicion — it's seen as a mechanism of last resort. You should discuss plans to extend the class hierarchy before implementing.
|
||||
|
||||
We have a number of conventions around variable naming that are still being documented, and aren't 100% strict. A general policy is that instances of the class `Doc` should by default be called `doc`, `Token` `token`, `Lexeme` `lex`, `Vocab` `vocab` and `Language` `nlp`. You should avoid naming variables that are of other types these names. For instance, don't name a text string `doc` — you should usually call this `text`. Two general code style preferences further help with naming. First, lean away from introducing temporary variables, as these clutter your namespace. This is one reason why comprehension expressions are often preferred. Second, keep your functions shortish, so that can work in a smaller scope. Of course, this is a question of trade-offs.
|
||||
We have a number of conventions around variable naming that are still being
|
||||
documented, and aren't 100% strict. A general policy is that instances of the
|
||||
class `Doc` should by default be called `doc`, `Token` `token`, `Lexeme` `lex`,
|
||||
`Vocab` `vocab` and `Language` `nlp`. You should avoid naming variables that are
|
||||
of other types these names. For instance, don't name a text string `doc` — you
|
||||
should usually call this `text`. Two general code style preferences further help
|
||||
with naming. First, **lean away from introducing temporary variables**, as these
|
||||
clutter your namespace. This is one reason why comprehension expressions are
|
||||
often preferred. Second, **keep your functions shortish**, so that can work in a
|
||||
smaller scope. Of course, this is a question of trade-offs.
|
||||
|
||||
### Cython conventions
|
||||
|
||||
spaCy's core data structures are implemented as [Cython](http://cython.org/) `cdef` classes. Memory is managed through the `cymem.cymem.Pool` class, which allows you to allocate memory which will be freed when the `Pool` object is garbage collected. This means you usually don't have to worry about freeing memory. You just have to decide which Python object owns the memory, and make it own the `Pool`. When that object goes out of scope, the memory will be freed. You do have to take care that no pointers outlive the object that owns them — but this is generally quite easy.
|
||||
spaCy's core data structures are implemented as [Cython](http://cython.org/) `cdef`
|
||||
classes. Memory is managed through the `cymem.cymem.Pool` class, which allows
|
||||
you to allocate memory which will be freed when the `Pool` object is garbage
|
||||
collected. This means you usually don't have to worry about freeing memory. You
|
||||
just have to decide which Python object owns the memory, and make it own the
|
||||
`Pool`. When that object goes out of scope, the memory will be freed. You do
|
||||
have to take care that no pointers outlive the object that owns them — but this
|
||||
is generally quite easy.
|
||||
|
||||
All Cython modules should have the `# cython: infer_types=True` compiler directive at the top of the file. This makes the code much cleaner, as it avoids the need for many type declarations. If possible, you should prefer to declare your functions `nogil`, even if you don't especially care about multi-threading. The reason is that `nogil` functions help the Cython compiler reason about your code quite a lot — you're telling the compiler that no Python dynamics are possible. This lets many errors be raised, and ensures your function will run at C speed.
|
||||
All Cython modules should have the `# cython: infer_types=True` compiler
|
||||
directive at the top of the file. This makes the code much cleaner, as it avoids
|
||||
the need for many type declarations. If possible, you should prefer to declare
|
||||
your functions `nogil`, even if you don't especially care about multi-threading.
|
||||
The reason is that `nogil` functions help the Cython compiler reason about your
|
||||
code quite a lot — you're telling the compiler that no Python dynamics are
|
||||
possible. This lets many errors be raised, and ensures your function will run
|
||||
at C speed.
|
||||
|
||||
Cython gives you many choices of sequences: you could have a Python list, a numpy array, a memory view, a C++ vector, or a pointer. Pointers are preferred, because they are fastest, have the most explicit semantics, and let the compiler check your code more strictly. C++ vectors are also great — but you should only use them internally in functions. It's less friendly to accept a vector as an argument, because that asks the user to do much more work.
|
||||
Cython gives you many choices of sequences: you could have a Python list, a
|
||||
numpy array, a memory view, a C++ vector, or a pointer. Pointers are preferred,
|
||||
because they are fastest, have the most explicit semantics, and let the compiler
|
||||
check your code more strictly. C++ vectors are also great — but you should only
|
||||
use them internally in functions. It's less friendly to accept a vector as an
|
||||
argument, because that asks the user to do much more work.
|
||||
|
||||
Here's how to get a pointer from a numpy array, memory view or vector:
|
||||
|
||||
|
@ -124,9 +291,14 @@ cdef void get_pointers(np.ndarray[int, mode='c'] numpy_array, vector[int] cpp_ve
|
|||
pointer3 = &memory_view[0]
|
||||
```
|
||||
|
||||
Both C arrays and C++ vectors reassure the compiler that no Python operations are possible on your variable. This is a big advantage: it lets the Cython compiler raise many more errors for you.
|
||||
Both C arrays and C++ vectors reassure the compiler that no Python operations
|
||||
are possible on your variable. This is a big advantage: it lets the Cython
|
||||
compiler raise many more errors for you.
|
||||
|
||||
When getting a pointer from a numpy array or memoryview, take care that the data is actually stored in C-contiguous order — otherwise you'll get a pointer to nonsense. The type-declarations in the code above should generate runtime errors if buffers with incorrect memory layouts are passed in.
|
||||
When getting a pointer from a numpy array or memoryview, take care that the data
|
||||
is actually stored in C-contiguous order — otherwise you'll get a pointer to
|
||||
nonsense. The type-declarations in the code above should generate runtime errors
|
||||
if buffers with incorrect memory layouts are passed in.
|
||||
|
||||
To iterate over the array, the following style is preferred:
|
||||
|
||||
|
@ -138,13 +310,40 @@ cdef int c_total(const int* int_array, int length) nogil:
|
|||
return total
|
||||
```
|
||||
|
||||
If this is confusing, consider that the compiler couldn't deal with `for item in int_array:` — there's no length attached to a raw pointer, so how could we figure out where to stop? The length is provided in the slice notation as a solution to this. Note that we don't have to declare the type of `item` in the code above — the compiler can easily infer it. This gives us tidy code that looks quite like Python, but is exactly as fast as C — because we've made sure the compilation to C is trivial.
|
||||
If this is confusing, consider that the compiler couldn't deal with
|
||||
`for item in int_array:` — there's no length attached to a raw pointer, so how
|
||||
could we figure out where to stop? The length is provided in the slice notation
|
||||
as a solution to this. Note that we don't have to declare the type of `item` in
|
||||
the code above — the compiler can easily infer it. This gives us tidy code that
|
||||
looks quite like Python, but is exactly as fast as C — because we've made sure
|
||||
the compilation to C is trivial.
|
||||
|
||||
Your functions cannot be declared `nogil` if they need to create Python objects or call Python functions. This is perfectly okay — you shouldn't torture your code just to get `nogil` functions. However, if your function isn't `nogil`, you should compile your module with `cython -a --cplus my_module.pyx` and open the resulting `my_module.html` file in a browser. This will let you see how Cython is compiling your code. Calls into the Python run-time will be in bright yellow. This lets you easily see whether Cython is able to correctly type your code, or whether there are unexpected problems.
|
||||
Your functions cannot be declared `nogil` if they need to create Python objects
|
||||
or call Python functions. This is perfectly okay — you shouldn't torture your
|
||||
code just to get `nogil` functions. However, if your function isn't `nogil`, you
|
||||
should compile your module with `cython -a --cplus my_module.pyx` and open the
|
||||
resulting `my_module.html` file in a browser. This will let you see how Cython
|
||||
is compiling your code. Calls into the Python run-time will be in bright yellow.
|
||||
This lets you easily see whether Cython is able to correctly type your code, or
|
||||
whether there are unexpected problems.
|
||||
|
||||
Finally, if you're new to Cython, you should expect to find the first steps a bit frustrating. It's a very large language, since it's essentially a superset of Python and C++, with additional complexity and syntax from numpy. The [documentation](http://docs.cython.org/en/latest/) isn't great, and there are many "traps for new players". Help is available on [Gitter](https://gitter.im/explosion/spaCy).
|
||||
|
||||
Working in Cython is very rewarding once you're over the initial learning curve. As with C and C++, the first way you write something in Cython will often be the performance-optimal approach. In contrast, Python optimisation generally requires a lot of experimentation. Is it faster to have an `if item in my_dict` check, or to use `.get()`? What about `try`/`except`? Does this numpy operation create a copy? There's no way to guess the answers to these questions, and you'll usually be dissatisfied with your results — so there's no way to know when to stop this process. In the worst case, you'll make a mess that invites the next reader to try their luck too. This is like one of those [volcanic gas-traps](http://www.wemjournal.org/article/S1080-6032%2809%2970088-2/abstract), where the rescuers keep passing out from low oxygen, causing another rescuer to follow — only to succumb themselves. In short, just say no to optimizing your Python. If it's not fast enough the first time, just switch to Cython.
|
||||
Finally, if you're new to Cython, you should expect to find the first steps a
|
||||
bit frustrating. It's a very large language, since it's essentially a superset
|
||||
of Python and C++, with additional complexity and syntax from numpy. The
|
||||
[documentation](http://docs.cython.org/en/latest/) isn't great, and there are
|
||||
many "traps for new players". Working in Cython is very rewarding once you're
|
||||
over the initial learning curve. As with C and C++, the first way you write
|
||||
something in Cython will often be the performance-optimal approach. In contrast,
|
||||
Python optimisation generally requires a lot of experimentation. Is it faster to
|
||||
have an `if item in my_dict` check, or to use `.get()`? What about `try`/`except`?
|
||||
Does this numpy operation create a copy? There's no way to guess the answers to
|
||||
these questions, and you'll usually be dissatisfied with your results — so
|
||||
there's no way to know when to stop this process. In the worst case, you'll make
|
||||
a mess that invites the next reader to try their luck too. This is like one of
|
||||
those [volcanic gas-traps](http://www.wemjournal.org/article/S1080-6032%2809%2970088-2/abstract),
|
||||
where the rescuers keep passing out from low oxygen, causing another rescuer to
|
||||
follow — only to succumb themselves. In short, just say no to optimizing your
|
||||
Python. If it's not fast enough the first time, just switch to Cython.
|
||||
|
||||
### Resources to get you started
|
||||
|
||||
|
@ -156,18 +355,34 @@ Working in Cython is very rewarding once you're over the initial learning curve.
|
|||
|
||||
## Adding tests
|
||||
|
||||
spaCy uses the [pytest](http://doc.pytest.org/) framework for testing. For more info on this, see the [pytest documentation](http://docs.pytest.org/en/latest/contents.html). Tests for spaCy modules and classes live in their own directories of the same name. For example, tests for the `Tokenizer` can be found in [`/spacy/tests/tokenizer`](spacy/tests/tokenizer). To be interpreted and run, all test files and test functions need to be prefixed with `test_`.
|
||||
spaCy uses the [pytest](http://doc.pytest.org/) framework for testing. For more
|
||||
info on this, see the [pytest documentation](http://docs.pytest.org/en/latest/contents.html).
|
||||
Tests for spaCy modules and classes live in their own directories of the same
|
||||
name. For example, tests for the `Tokenizer` can be found in
|
||||
[`/spacy/tests/tokenizer`](spacy/tests/tokenizer). To be interpreted and run,
|
||||
all test files and test functions need to be prefixed with `test_`.
|
||||
|
||||
When adding tests, make sure to use descriptive names, keep the code short and concise and only test for one behaviour at a time. Try to `parametrize` test cases wherever possible, use our pre-defined fixtures for spaCy components and avoid unnecessary imports.
|
||||
When adding tests, make sure to use descriptive names, keep the code short and
|
||||
concise and only test for one behaviour at a time. Try to `parametrize` test
|
||||
cases wherever possible, use our pre-defined fixtures for spaCy components and
|
||||
avoid unnecessary imports.
|
||||
|
||||
Extensive tests that take a long time should be marked with `@pytest.mark.slow`. Tests that require the model to be loaded should be marked with `@pytest.mark.models`. Loading the models is expensive and not necessary if you're not actually testing the model performance. If all you needs ia a `Doc` object with annotations like heads, POS tags or the dependency parse, you can use the `get_doc()` utility function to construct it manually.
|
||||
Extensive tests that take a long time should be marked with `@pytest.mark.slow`.
|
||||
Tests that require the model to be loaded should be marked with
|
||||
`@pytest.mark.models`. Loading the models is expensive and not necessary if
|
||||
you're not actually testing the model performance. If all you needs ia a `Doc`
|
||||
object with annotations like heads, POS tags or the dependency parse, you can
|
||||
use the `get_doc()` utility function to construct it manually.
|
||||
|
||||
📖 **For more guidelines and information on how to add tests, check out the [tests README](spacy/tests/README.md).**
|
||||
|
||||
|
||||
## Updating the website
|
||||
|
||||
Our [website and docs](https://spacy.io) are implemented in [Jade/Pug](https://www.jade-lang.org), and built or served by [Harp](https://harpjs.com). Jade/Pug is an extensible templating language with a readable syntax, that compiles to HTML. Here's how to view the site locally:
|
||||
Our [website and docs](https://spacy.io) are implemented in
|
||||
[Jade/Pug](https://www.jade-lang.org), and built or served by
|
||||
[Harp](https://harpjs.com). Jade/Pug is an extensible templating language with a
|
||||
readable syntax, that compiles to HTML. Here's how to view the site locally:
|
||||
|
||||
```bash
|
||||
sudo npm install --global harp
|
||||
|
@ -176,9 +391,25 @@ cd spaCy/website
|
|||
harp server
|
||||
```
|
||||
|
||||
The docs can always use another example or more detail, and they should always be up to date and not misleading. To quickly find the correct file to edit, simply click on the "Suggest edits" button at the bottom of a page.
|
||||
The docs can always use another example or more detail, and they should always
|
||||
be up to date and not misleading. To quickly find the correct file to edit,
|
||||
simply click on the "Suggest edits" button at the bottom of a page. To keep
|
||||
long pages maintainable, and allow including content in several places without
|
||||
doubling it, sections often consist of partials. Partials and partial directories
|
||||
are prefixed by an underscore `_` so they're not compiled with the site. For
|
||||
example:
|
||||
|
||||
To make it easy to add content components, we use a [collection of custom mixins](_includes/_mixins.jade), like `+table`, `+list` or `+code`.
|
||||
```pug
|
||||
+section("tokenization")
|
||||
+h(2, "tokenization") Tokenization
|
||||
include _spacy-101/_tokenization
|
||||
```
|
||||
|
||||
So if you're looking to edit the content of the tokenization section, you can
|
||||
find it in `_spacy-101/_tokenization.jade`. To make it easy to add content
|
||||
components, we use a [collection of custom mixins](_includes/_mixins.jade),
|
||||
like `+table`, `+list` or `+code`. For an overview of the available mixins and
|
||||
components, see the [styleguide](https://spacy.io/styleguide).
|
||||
|
||||
📖 **For more info and troubleshooting guides, check out the [website README](website).**
|
||||
|
||||
|
@ -186,62 +417,40 @@ To make it easy to add content components, we use a [collection of custom mixins
|
|||
|
||||
* [Guide to static websites with Harp and Jade](https://ines.io/blog/the-ultimate-guide-static-websites-harp-jade) (ines.io)
|
||||
* [Building a website with modular markup components (mixins)](https://explosion.ai/blog/modular-markup) (explosion.ai)
|
||||
* [spacy.io Styleguide](https://spacy.io/styleguide) (spacy.io)
|
||||
* [Jade/Pug documentation](https://pugjs.org) (pugjs.org)
|
||||
* [Harp documentation](https://harpjs.com/) (harpjs.com)
|
||||
|
||||
|
||||
## Submitting a tutorial
|
||||
## Publishing spaCy extensions and plugins
|
||||
|
||||
Did you write a [tutorial](https://spacy.io/docs/usage/tutorials) to help others use spaCy, or did you come across one that should be added to our directory? You can submit it by making a pull request to [`website/docs/usage/_data.json`](website/docs/usage/_data.json):
|
||||
We're very excited about all the new possibilities for **community extensions**
|
||||
and plugins in spaCy v2.0, and we can't wait to see what you build with it!
|
||||
|
||||
```json
|
||||
{
|
||||
"tutorials": {
|
||||
"deep_dives": {
|
||||
"Deep Learning with custom pipelines and Keras": {
|
||||
"url": "https://explosion.ai/blog/spacy-deep-learning-keras",
|
||||
"author": "Matthew Honnibal",
|
||||
"tags": [ "keras", "sentiment" ]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
* An extension or plugin should add substantial functionality, be
|
||||
**well-documented** and **open-source**. It should be available for users to download
|
||||
and install as a Python package – for example via [PyPi](http://pypi.python.org).
|
||||
|
||||
### A few tips
|
||||
* Extensions that write to `Doc`, `Token` or `Span` attributes should be wrapped
|
||||
as [pipeline components](https://spacy.io/usage/processing-pipelines#custom-components)
|
||||
that users can **add to their processing pipeline** using `nlp.add_pipe()`.
|
||||
|
||||
* A suitable tutorial should provide additional content and practical examples that are not covered as such in the docs.
|
||||
* Make sure to choose the right category – `first_steps`, `deep_dives` (tutorials that take a deeper look at specific features) or `code` (programs and scripts on GitHub etc.).
|
||||
* Don't go overboard with the tags. Take inspirations from the existing ones and only add tags for features (`"sentiment"`, `"pos"`) or integrations (`"jupyter"`, `"keras"`).
|
||||
* Double-check the JSON markup and/or use a linter. A wrong or missing comma will (unfortunately) break the site rendering.
|
||||
* When publishing your extension on GitHub, **tag it** with the topics
|
||||
[`spacy`](https://github.com/topics/spacy?o=desc&s=stars) and
|
||||
[`spacy-extensions`](https://github.com/topics/spacy-extension?o=desc&s=stars)
|
||||
to make it easier to find. Those are also the topics we're linking to from the
|
||||
spaCy website. If you're sharing your project on Twitter, feel free to tag
|
||||
[@spacy_io](https://twitter.com/spacy_io) so we can check it out.
|
||||
|
||||
## Submitting a project to the showcase
|
||||
* Once your extension is published, you can open an issue on the
|
||||
[issue tracker](https://github.com/explosion/spacy/issues) to suggest it for the
|
||||
[resources directory](https://spacy.io/usage/resources#extensions) on the
|
||||
website.
|
||||
|
||||
Have you built a library, visualizer, demo or product with spaCy, or did you come across one that should be featured in our [showcase](https://spacy.io/docs/usage/showcase)? You can submit it by making a pull request to [`website/docs/usage/_data.json`](website/docs/usage/_data.json):
|
||||
|
||||
```json
|
||||
{
|
||||
"showcase": {
|
||||
"visualizations": {
|
||||
"displaCy": {
|
||||
"url": "https://demos.explosion.ai/displacy",
|
||||
"author": "Ines Montani",
|
||||
"description": "An open-source NLP visualiser for the modern web",
|
||||
"image": "displacy.jpg"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### A few tips
|
||||
|
||||
* A suitable third-party library should add substantial functionality, be well-documented and open-source. If it's just a code snippet or script, consider submitting it to the `code` category of the tutorials section instead.
|
||||
* A suitable demo should be hosted and accessible online. Open-source code is always a plus.
|
||||
* For visualizations and products, add an image that clearly shows how it looks – screenshots are ideal.
|
||||
* The image should be resized to 300x188px, optimised using a tool like [ImageOptim](https://imageoptim.com/mac) and added to [`website/assets/img/showcase`](website/assets/img/showcase).
|
||||
* Double-check the JSON markup and/or use a linter. A wrong or missing comma will (unfortunately) break the site rendering.
|
||||
📖 **For more tips and best practices, see the [checklist for developing spaCy extensions](https://spacy.io/usage/processing-pipelines#extensions).**
|
||||
|
||||
## Code of conduct
|
||||
|
||||
spaCy adheres to the [Contributor Covenant Code of Conduct](http://contributor-covenant.org/version/1/4/). By participating, you are expected to uphold this code.
|
||||
spaCy adheres to the
|
||||
[Contributor Covenant Code of Conduct](http://contributor-covenant.org/version/1/4/).
|
||||
By participating, you are expected to uphold this code.
|
||||
|
|
|
@ -7,7 +7,8 @@ This is a list of everyone who has made significant contributions to spaCy, in a
|
|||
* Alexis Eidelman, [@AlexisEidelman](https://github.com/AlexisEidelman)
|
||||
* Andreas Grivas, [@andreasgrv](https://github.com/andreasgrv)
|
||||
* Andrew Poliakov, [@pavlin99th](https://github.com/pavlin99th)
|
||||
* Aniruddha Adhikary [@aniruddha-adhikary](https://github.com/aniruddha-adhikary)
|
||||
* Aniruddha Adhikary, [@aniruddha-adhikary](https://github.com/aniruddha-adhikary)
|
||||
* Anto Binish Kaspar, [@binishkaspar](https://github.com/binishkaspar)
|
||||
* Ben Eyal, [@beneyal](https://github.com/beneyal)
|
||||
* Bhargav Srinivasa, [@bhargavvader](https://github.com/bhargavvader)
|
||||
* Bruno P. Kinoshita, [@kinow](https://github.com/kinow)
|
||||
|
@ -47,6 +48,7 @@ This is a list of everyone who has made significant contributions to spaCy, in a
|
|||
* Orion Montoya, [@mdcclv](https://github.com/mdcclv)
|
||||
* Paul O'Leary McCann, [@polm](https://github.com/polm)
|
||||
* Pokey Rule, [@pokey](https://github.com/pokey)
|
||||
* Ramanan Balakrishnan, [@ramananbalakrishnan](https://github.com/ramananbalakrishnan)
|
||||
* Raphaël Bournhonesque, [@raphael0202](https://github.com/raphael0202)
|
||||
* Rob van Nieuwpoort, [@RvanNieuwpoort](https://github.com/RvanNieuwpoort)
|
||||
* Roman Inflianskas, [@rominf](https://github.com/rominf)
|
||||
|
|
|
@ -1,3 +1,4 @@
|
|||
recursive-include include *.h
|
||||
include LICENSE
|
||||
include README.rst
|
||||
include bin/spacy
|
||||
|
|
293
README.rst
293
README.rst
|
@ -1,25 +1,20 @@
|
|||
spaCy: Industrial-strength NLP
|
||||
******************************
|
||||
|
||||
spaCy is a library for advanced natural language processing in Python and
|
||||
Cython. spaCy is built on the very latest research, but it isn't researchware.
|
||||
It was designed from day one to be used in real products. spaCy currently supports
|
||||
English, German, French and Spanish, as well as tokenization for Italian,
|
||||
Portuguese, Dutch, Swedish, Finnish, Norwegian, Hungarian, Bengali, Hebrew, Thai,
|
||||
Chinese and Japanese. It's commercial open-source software, released under the
|
||||
MIT license.
|
||||
spaCy is a library for advanced Natural Language Processing in Python and Cython.
|
||||
It's built on the very latest research, and was designed from day one to be
|
||||
used in real products. spaCy comes with
|
||||
`pre-trained statistical models <https://spacy.io/models>`_ and word
|
||||
vectors, and currently supports tokenization for **20+ languages**. It features
|
||||
the **fastest syntactic parser** in the world, convolutional **neural network models**
|
||||
for tagging, parsing and **named entity recognition** and easy **deep learning**
|
||||
integration. It's commercial open-source software, released under the MIT license.
|
||||
|
||||
⭐️ **Test spaCy v2.0.0 and the new models!** `Check out the new features here. <https://alpha.spacy.io/usage/v2>`_
|
||||
|
||||
💫 **Version 1.10 out now!** `Read the release notes here. <https://github.com/explosion/spaCy/releases/>`_
|
||||
💫 **Version 2.0 out now!** `Check out the new features here. <https://spacy.io/usage/v2>`_
|
||||
|
||||
.. image:: https://img.shields.io/travis/explosion/spaCy/master.svg?style=flat-square
|
||||
:target: https://travis-ci.org/explosion/spaCy
|
||||
:alt: Travis Build Status
|
||||
|
||||
.. image:: https://img.shields.io/appveyor/ci/explosion/spacy/master.svg?style=flat-square
|
||||
:target: https://ci.appveyor.com/project/explosion/spacy
|
||||
:alt: Appveyor Build Status
|
||||
:alt: Build Status
|
||||
|
||||
.. image:: https://img.shields.io/github/release/explosion/spacy.svg?style=flat-square
|
||||
:target: https://github.com/explosion/spaCy/releases
|
||||
|
@ -44,69 +39,72 @@ MIT license.
|
|||
📖 Documentation
|
||||
================
|
||||
|
||||
=================== ===
|
||||
`Usage Workflows`_ How to use spaCy and its features.
|
||||
`API Reference`_ The detailed reference for spaCy's API.
|
||||
`Troubleshooting`_ Common problems and solutions for beginners.
|
||||
`Tutorials`_ End-to-end examples, with code you can modify and run.
|
||||
`Showcase & Demos`_ Demos, libraries and products from the spaCy community.
|
||||
`Contribute`_ How to contribute to the spaCy project and code base.
|
||||
=================== ===
|
||||
=================== ===
|
||||
`spaCy 101`_ New to spaCy? Here's everything you need to know!
|
||||
`Usage Guides`_ How to use spaCy and its features.
|
||||
`New in v2.0`_ New features, backwards incompatibilities and migration guide.
|
||||
`API Reference`_ The detailed reference for spaCy's API.
|
||||
`Models`_ Download statistical language models for spaCy.
|
||||
`Resources`_ Libraries, extensions, demos, books and courses.
|
||||
`Changelog`_ Changes and version history.
|
||||
`Contribute`_ How to contribute to the spaCy project and code base.
|
||||
=================== ===
|
||||
|
||||
.. _Usage Workflows: https://spacy.io/docs/usage/
|
||||
.. _API Reference: https://spacy.io/docs/api/
|
||||
.. _Troubleshooting: https://spacy.io/docs/usage/troubleshooting
|
||||
.. _Tutorials: https://spacy.io/docs/usage/tutorials
|
||||
.. _Showcase & Demos: https://spacy.io/docs/usage/showcase
|
||||
.. _spaCy 101: https://spacy.io/usage/spacy-101
|
||||
.. _New in v2.0: https://spacy.io/usage/v2#migrating
|
||||
.. _Usage Guides: https://spacy.io/usage/
|
||||
.. _API Reference: https://spacy.io/api/
|
||||
.. _Models: https://spacy.io/models
|
||||
.. _Resources: https://spacy.io/usage/resources
|
||||
.. _Changelog: https://spacy.io/usage/#changelog
|
||||
.. _Contribute: https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md
|
||||
|
||||
💬 Where to ask questions
|
||||
==========================
|
||||
|
||||
Please understand that we won't be able to provide individual support via email. We also believe that help is much more valuable if it's shared publicly, so that more people can benefit from it.
|
||||
The spaCy project is maintained by `@honnibal <https://github.com/honnibal>`_
|
||||
and `@ines <https://github.com/ines>`_. Please understand that we won't be able
|
||||
to provide individual support via email. We also believe that help is much more
|
||||
valuable if it's shared publicly, so that more people can benefit from it.
|
||||
|
||||
====================== ===
|
||||
**Bug reports** `GitHub issue tracker`_
|
||||
**Usage questions** `StackOverflow`_, `Gitter chat`_, `Reddit user group`_
|
||||
**General discussion** `Gitter chat`_, `Reddit user group`_
|
||||
**Bug Reports** `GitHub Issue Tracker`_
|
||||
**Usage Questions** `StackOverflow`_, `Gitter Chat`_, `Reddit User Group`_
|
||||
**General Discussion** `Gitter Chat`_, `Reddit User Group`_
|
||||
====================== ===
|
||||
|
||||
.. _GitHub issue tracker: https://github.com/explosion/spaCy/issues
|
||||
.. _GitHub Issue Tracker: https://github.com/explosion/spaCy/issues
|
||||
.. _StackOverflow: http://stackoverflow.com/questions/tagged/spacy
|
||||
.. _Gitter chat: https://gitter.im/explosion/spaCy
|
||||
.. _Reddit user group: https://www.reddit.com/r/spacynlp
|
||||
.. _Gitter Chat: https://gitter.im/explosion/spaCy
|
||||
.. _Reddit User Group: https://www.reddit.com/r/spacynlp
|
||||
|
||||
Features
|
||||
========
|
||||
|
||||
* Non-destructive **tokenization**
|
||||
* Syntax-driven sentence segmentation
|
||||
* Pre-trained **word vectors**
|
||||
* Part-of-speech tagging
|
||||
* **Fastest syntactic parser** in the world
|
||||
* **Named entity** recognition
|
||||
* Labelled dependency parsing
|
||||
* Convenient string-to-int mapping
|
||||
* Export to numpy data arrays
|
||||
* GIL-free **multi-threading**
|
||||
* Efficient binary serialization
|
||||
* Non-destructive **tokenization**
|
||||
* Support for **20+ languages**
|
||||
* Pre-trained `statistical models <https://spacy.io/models>`_ and word vectors
|
||||
* Easy **deep learning** integration
|
||||
* Statistical models for **English**, **German**, **French** and **Spanish**
|
||||
* Part-of-speech tagging
|
||||
* Labelled dependency parsing
|
||||
* Syntax-driven sentence segmentation
|
||||
* Built in **visualizers** for syntax and NER
|
||||
* Convenient string-to-hash mapping
|
||||
* Export to numpy data arrays
|
||||
* Efficient binary serialization
|
||||
* Easy **model packaging** and deployment
|
||||
* State-of-the-art speed
|
||||
* Robust, rigorously evaluated accuracy
|
||||
|
||||
See `facts, figures and benchmarks <https://spacy.io/docs/api/>`_.
|
||||
📖 **For more details, see the** `facts, figures and benchmarks <https://spacy.io/usage/facts-figures>`_.
|
||||
|
||||
Top Performance
|
||||
---------------
|
||||
Install spaCy
|
||||
=============
|
||||
|
||||
* Fastest in the world: <50ms per document. No faster system has ever been
|
||||
announced.
|
||||
* Accuracy within 1% of the current state-of-the-art on all tasks performed
|
||||
(parsing, named entity recognition, part-of-speech tagging). The only more
|
||||
accurate systems are an order of magnitude slower or more.
|
||||
|
||||
Supports
|
||||
--------
|
||||
For detailed installation instructions, see
|
||||
the `documentation <https://spacy.io/usage>`_.
|
||||
|
||||
==================== ===
|
||||
**Operating system** macOS / OS X, Linux, Windows (Cygwin, MinGW, Visual Studio)
|
||||
|
@ -117,12 +115,6 @@ Supports
|
|||
.. _pip: https://pypi.python.org/pypi/spacy
|
||||
.. _conda: https://anaconda.org/conda-forge/spacy
|
||||
|
||||
Install spaCy
|
||||
=============
|
||||
|
||||
Installation requires a working build environment. See notes on Ubuntu,
|
||||
macOS/OS X and Windows for details.
|
||||
|
||||
pip
|
||||
---
|
||||
|
||||
|
@ -130,14 +122,14 @@ Using pip, spaCy releases are currently only available as source packages.
|
|||
|
||||
.. code:: bash
|
||||
|
||||
pip install -U spacy
|
||||
pip install spacy
|
||||
|
||||
When using pip it is generally recommended to install packages in a ``virtualenv``
|
||||
to avoid modifying system state:
|
||||
When using pip it is generally recommended to install packages in a virtual
|
||||
environment to avoid modifying system state:
|
||||
|
||||
.. code:: bash
|
||||
|
||||
virtualenv .env
|
||||
venv .env
|
||||
source .env/bin/activate
|
||||
pip install spacy
|
||||
|
||||
|
@ -156,24 +148,40 @@ For the feedstock including the build recipe and configuration,
|
|||
check out `this repository <https://github.com/conda-forge/spacy-feedstock>`_.
|
||||
Improvements and pull requests to the recipe and setup are always appreciated.
|
||||
|
||||
Updating spaCy
|
||||
--------------
|
||||
|
||||
Some updates to spaCy may require downloading new statistical models. If you're
|
||||
running spaCy v2.0 or higher, you can use the ``validate`` command to check if
|
||||
your installed models are compatible and if not, print details on how to update
|
||||
them:
|
||||
|
||||
.. code:: bash
|
||||
|
||||
pip install -U spacy
|
||||
spacy validate
|
||||
|
||||
If you've trained your own models, keep in mind that your training and runtime
|
||||
inputs must match. After updating spaCy, we recommend **retraining your models**
|
||||
with the new version.
|
||||
|
||||
📖 **For details on upgrading from spaCy 1.x to spaCy 2.x, see the**
|
||||
`migration guide <https://spacy.io/usage/v2#migrating>`_.
|
||||
|
||||
Download models
|
||||
===============
|
||||
|
||||
As of v1.7.0, models for spaCy can be installed as **Python packages**.
|
||||
This means that they're a component of your application, just like any
|
||||
other module. They're versioned and can be defined as a dependency in your
|
||||
``requirements.txt``. Models can be installed from a download URL or
|
||||
a local directory, manually or via pip. Their data can be located anywhere on
|
||||
your file system. To make a model available to spaCy, all you need to do is
|
||||
create a "shortcut link", an internal alias that tells spaCy where to find the
|
||||
data files for a specific model name.
|
||||
other module. Models can be installed using spaCy's ``download`` command,
|
||||
or manually by pointing pip to a path or URL.
|
||||
|
||||
======================= ===
|
||||
`spaCy Models`_ Available models, latest releases and direct download.
|
||||
`Available Models`_ Detailed model descriptions, accuracy figures and benchmarks.
|
||||
`Models Documentation`_ Detailed usage instructions.
|
||||
======================= ===
|
||||
|
||||
.. _spaCy Models: https://github.com/explosion/spacy-models/releases/
|
||||
.. _Available Models: https://spacy.io/models
|
||||
.. _Models Documentation: https://spacy.io/docs/usage/models
|
||||
|
||||
.. code:: bash
|
||||
|
@ -182,17 +190,10 @@ data files for a specific model name.
|
|||
python -m spacy download en
|
||||
|
||||
# download best-matching version of specific model for your spaCy installation
|
||||
python -m spacy download en_core_web_md
|
||||
python -m spacy download en_core_web_lg
|
||||
|
||||
# pip install .tar.gz archive from path or URL
|
||||
pip install /Users/you/en_core_web_md-1.2.0.tar.gz
|
||||
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_md-1.2.0/en_core_web_md-1.2.0.tar.gz
|
||||
|
||||
# set up shortcut link to load installed package as "en_default"
|
||||
python -m spacy link en_core_web_md en_default
|
||||
|
||||
# set up shortcut link to load local model as "my_amazing_model"
|
||||
python -m spacy link /Users/you/data my_amazing_model
|
||||
pip install /Users/you/en_core_web_sm-2.0.0.tar.gz
|
||||
|
||||
Loading and using models
|
||||
------------------------
|
||||
|
@ -202,28 +203,28 @@ To load a model, use ``spacy.load()`` with the model's shortcut link:
|
|||
.. code:: python
|
||||
|
||||
import spacy
|
||||
nlp = spacy.load('en_default')
|
||||
nlp = spacy.load('en')
|
||||
doc = nlp(u'This is a sentence.')
|
||||
|
||||
If you've installed a model via pip, you can also ``import`` it directly and
|
||||
then call its ``load()`` method with no arguments. This should also work for
|
||||
older models in previous versions of spaCy.
|
||||
then call its ``load()`` method:
|
||||
|
||||
.. code:: python
|
||||
|
||||
import spacy
|
||||
import en_core_web_md
|
||||
import en_core_web_sm
|
||||
|
||||
nlp = en_core_web_md.load()
|
||||
nlp = en_core_web_.load()
|
||||
doc = nlp(u'This is a sentence.')
|
||||
|
||||
📖 **For more info and examples, check out the** `models documentation <https://spacy.io/docs/usage/models>`_.
|
||||
📖 **For more info and examples, check out the**
|
||||
`models documentation <https://spacy.io/docs/usage/models>`_.
|
||||
|
||||
Support for older versions
|
||||
--------------------------
|
||||
|
||||
If you're using an older version (v1.6.0 or below), you can still download and
|
||||
install the old models from within spaCy using ``python -m spacy.en.download all``
|
||||
If you're using an older version (``v1.6.0`` or below), you can still download
|
||||
and install the old models from within spaCy using ``python -m spacy.en.download all``
|
||||
or ``python -m spacy.de.download all``. The ``.tar.gz`` archives are also
|
||||
`attached to the v1.6.0 release <https://github.com/explosion/spaCy/tree/v1.6.0>`_.
|
||||
To download and install the models manually, unpack the archive, drop the
|
||||
|
@ -236,7 +237,7 @@ Compile from source
|
|||
The other way to install spaCy is to clone its
|
||||
`GitHub repository <https://github.com/explosion/spaCy>`_ and build it from
|
||||
source. That is the common way if you want to make changes to the code base.
|
||||
You'll need to make sure that you have a development enviroment consisting of a
|
||||
You'll need to make sure that you have a development environment consisting of a
|
||||
Python distribution including header files, a compiler,
|
||||
`pip <https://pip.pypa.io/en/latest/installing/>`__, `virtualenv <https://virtualenv.pypa.io/>`_
|
||||
and `git <https://git-scm.com>`_ installed. The compiler part is the trickiest.
|
||||
|
@ -246,36 +247,36 @@ details.
|
|||
.. code:: bash
|
||||
|
||||
# make sure you are using recent pip/virtualenv versions
|
||||
python -m pip install -U pip virtualenv
|
||||
python -m pip install -U pip venv
|
||||
git clone https://github.com/explosion/spaCy
|
||||
cd spaCy
|
||||
|
||||
virtualenv .env
|
||||
venv .env
|
||||
source .env/bin/activate
|
||||
export PYTHONPATH=`pwd`
|
||||
pip install -r requirements.txt
|
||||
pip install -e .
|
||||
python setup.py build_ext --inplace
|
||||
|
||||
Compared to a regular install via pip, `requirements.txt <requirements.txt>`_
|
||||
additionally installs developer dependencies such as Cython.
|
||||
Compared to regular install via pip, `requirements.txt <requirements.txt>`_
|
||||
additionally installs developer dependencies such as Cython. For more details
|
||||
and instructions, see the documentation on
|
||||
`compiling spaCy from source <https://spacy.io/usage/#source>`_ and the
|
||||
`quickstart widget <https://spacy.io/usage/#section-quickstart>`_ to get
|
||||
the right commands for your platform and Python version.
|
||||
|
||||
Instead of the above verbose commands, you can also use the following
|
||||
`Fabric <http://www.fabfile.org/>`_ commands:
|
||||
`Fabric <http://www.fabfile.org/>`_ commands. All commands assume that your
|
||||
virtual environment is located in a directory ``.env``. If you're using a
|
||||
different directory, you can change it via the environment variable ``VENV_DIR``,
|
||||
for example ``VENV_DIR=".custom-env" fab clean make``.
|
||||
|
||||
============= ===
|
||||
``fab env`` Create ``virtualenv`` and delete previous one, if it exists.
|
||||
``fab env`` Create virtual environment and delete previous one, if it exists.
|
||||
``fab make`` Compile the source.
|
||||
``fab clean`` Remove compiled objects, including the generated C++.
|
||||
``fab test`` Run basic tests, aborting after first failure.
|
||||
============= ===
|
||||
|
||||
All commands assume that your ``virtualenv`` is located in a directory ``.env``.
|
||||
If you're using a different directory, you can change it via the environment
|
||||
variable ``VENV_DIR``, for example:
|
||||
|
||||
.. code:: bash
|
||||
|
||||
VENV_DIR=".custom-env" fab clean make
|
||||
|
||||
Ubuntu
|
||||
------
|
||||
|
||||
|
@ -317,80 +318,4 @@ and ``--model`` are optional and enable additional tests:
|
|||
|
||||
# make sure you are using recent pytest version
|
||||
python -m pip install -U pytest
|
||||
|
||||
python -m pytest <spacy-directory> --vectors --models --slow
|
||||
|
||||
🛠 Changelog
|
||||
============
|
||||
|
||||
=========== ============== ===========
|
||||
Version Date Description
|
||||
=========== ============== ===========
|
||||
`v1.10.0`_ ``2017-11-07`` Alpha support for Thai & Russian, plus improvements and bug fixes
|
||||
`v1.9.0`_ ``2017-07-22`` Spanish model, alpha support for Norwegian & Japanese, and bug fixes
|
||||
`v1.8.2`_ ``2017-04-26`` French model and small improvements
|
||||
`v1.8.1`_ ``2017-04-23`` Saving, loading and training bug fixes
|
||||
`v1.8.0`_ ``2017-04-16`` Better NER training, saving and loading
|
||||
`v1.7.5`_ ``2017-04-07`` Bug fixes and new CLI commands
|
||||
`v1.7.3`_ ``2017-03-26`` Alpha support for Hebrew, new CLI commands and bug fixes
|
||||
`v1.7.2`_ ``2017-03-20`` Small fixes to beam parser and model linking
|
||||
`v1.7.1`_ ``2017-03-19`` Fix data download for system installation
|
||||
`v1.7.0`_ ``2017-03-18`` New 50 MB model, CLI, better downloads and lots of bug fixes
|
||||
`v1.6.0`_ ``2017-01-16`` Improvements to tokenizer and tests
|
||||
`v1.5.0`_ ``2016-12-27`` Alpha support for Swedish and Hungarian
|
||||
`v1.4.0`_ ``2016-12-18`` Improved language data and alpha Dutch support
|
||||
`v1.3.0`_ ``2016-12-03`` Improve API consistency
|
||||
`v1.2.0`_ ``2016-11-04`` Alpha tokenizers for Chinese, French, Spanish, Italian and Portuguese
|
||||
`v1.1.0`_ ``2016-10-23`` Bug fixes and adjustments
|
||||
`v1.0.0`_ ``2016-10-18`` Support for deep learning workflows and entity-aware rule matcher
|
||||
`v0.101.0`_ ``2016-05-10`` Fixed German model
|
||||
`v0.100.7`_ ``2016-05-05`` German support
|
||||
`v0.100.6`_ ``2016-03-08`` Add support for GloVe vectors
|
||||
`v0.100.5`_ ``2016-02-07`` Fix incorrect use of header file
|
||||
`v0.100.4`_ ``2016-02-07`` Fix OSX problem introduced in 0.100.3
|
||||
`v0.100.3`_ ``2016-02-06`` Multi-threading, faster loading and bugfixes
|
||||
`v0.100.2`_ ``2016-01-21`` Fix data version lock
|
||||
`v0.100.1`_ ``2016-01-21`` Fix install for OSX
|
||||
`v0.100`_ ``2016-01-19`` Revise setup.py, better model downloads, bug fixes
|
||||
`v0.99`_ ``2015-11-08`` Improve span merging, internal refactoring
|
||||
`v0.98`_ ``2015-11-03`` Smaller package, bug fixes
|
||||
`v0.97`_ ``2015-10-23`` Load the StringStore from a json list, instead of a text file
|
||||
`v0.96`_ ``2015-10-19`` Hotfix to .merge method
|
||||
`v0.95`_ ``2015-10-18`` Bug fixes
|
||||
`v0.94`_ ``2015-10-09`` Fix memory and parse errors
|
||||
`v0.93`_ ``2015-09-22`` Bug fixes to word vectors
|
||||
=========== ============== ===========
|
||||
|
||||
.. _v1.10.0: https://github.com/explosion/spaCy/releases/tag/v1.10.0
|
||||
.. _v1.9.0: https://github.com/explosion/spaCy/releases/tag/v1.9.0
|
||||
.. _v1.8.2: https://github.com/explosion/spaCy/releases/tag/v1.8.2
|
||||
.. _v1.8.1: https://github.com/explosion/spaCy/releases/tag/v1.8.1
|
||||
.. _v1.8.0: https://github.com/explosion/spaCy/releases/tag/v1.8.0
|
||||
.. _v1.7.5: https://github.com/explosion/spaCy/releases/tag/v1.7.5
|
||||
.. _v1.7.3: https://github.com/explosion/spaCy/releases/tag/v1.7.3
|
||||
.. _v1.7.2: https://github.com/explosion/spaCy/releases/tag/v1.7.2
|
||||
.. _v1.7.1: https://github.com/explosion/spaCy/releases/tag/v1.7.1
|
||||
.. _v1.7.0: https://github.com/explosion/spaCy/releases/tag/v1.7.0
|
||||
.. _v1.6.0: https://github.com/explosion/spaCy/releases/tag/v1.6.0
|
||||
.. _v1.5.0: https://github.com/explosion/spaCy/releases/tag/v1.5.0
|
||||
.. _v1.4.0: https://github.com/explosion/spaCy/releases/tag/v1.4.0
|
||||
.. _v1.3.0: https://github.com/explosion/spaCy/releases/tag/v1.3.0
|
||||
.. _v1.2.0: https://github.com/explosion/spaCy/releases/tag/v1.2.0
|
||||
.. _v1.1.0: https://github.com/explosion/spaCy/releases/tag/v1.1.0
|
||||
.. _v1.0.0: https://github.com/explosion/spaCy/releases/tag/v1.0.0
|
||||
.. _v0.101.0: https://github.com/explosion/spaCy/releases/tag/0.101.0
|
||||
.. _v0.100.7: https://github.com/explosion/spaCy/releases/tag/0.100.7
|
||||
.. _v0.100.6: https://github.com/explosion/spaCy/releases/tag/0.100.6
|
||||
.. _v0.100.5: https://github.com/explosion/spaCy/releases/tag/0.100.5
|
||||
.. _v0.100.4: https://github.com/explosion/spaCy/releases/tag/0.100.4
|
||||
.. _v0.100.3: https://github.com/explosion/spaCy/releases/tag/0.100.3
|
||||
.. _v0.100.2: https://github.com/explosion/spaCy/releases/tag/0.100.2
|
||||
.. _v0.100.1: https://github.com/explosion/spaCy/releases/tag/0.100.1
|
||||
.. _v0.100: https://github.com/explosion/spaCy/releases/tag/0.100
|
||||
.. _v0.99: https://github.com/explosion/spaCy/releases/tag/0.99
|
||||
.. _v0.98: https://github.com/explosion/spaCy/releases/tag/0.98
|
||||
.. _v0.97: https://github.com/explosion/spaCy/releases/tag/0.97
|
||||
.. _v0.96: https://github.com/explosion/spaCy/releases/tag/0.96
|
||||
.. _v0.95: https://github.com/explosion/spaCy/releases/tag/0.95
|
||||
.. _v0.94: https://github.com/explosion/spaCy/releases/tag/0.94
|
||||
.. _v0.93: https://github.com/explosion/spaCy/releases/tag/0.93
|
||||
python -m pytest <spacy-directory>
|
|
@ -1,93 +0,0 @@
|
|||
#!/usr/bin/env python
|
||||
|
||||
from __future__ import unicode_literals, print_function
|
||||
|
||||
import plac
|
||||
import joblib
|
||||
from os import path
|
||||
import os
|
||||
import bz2
|
||||
import ujson
|
||||
from preshed.counter import PreshCounter
|
||||
from joblib import Parallel, delayed
|
||||
import io
|
||||
|
||||
from spacy.en import English
|
||||
from spacy.strings import StringStore
|
||||
from spacy.attrs import ORTH
|
||||
from spacy.tokenizer import Tokenizer
|
||||
from spacy.vocab import Vocab
|
||||
|
||||
|
||||
def iter_comments(loc):
|
||||
with bz2.BZ2File(loc) as file_:
|
||||
for line in file_:
|
||||
yield ujson.loads(line)
|
||||
|
||||
|
||||
def count_freqs(input_loc, output_loc):
|
||||
print(output_loc)
|
||||
vocab = English.default_vocab(get_lex_attr=None)
|
||||
tokenizer = Tokenizer.from_dir(vocab,
|
||||
path.join(English.default_data_dir(), 'tokenizer'))
|
||||
|
||||
counts = PreshCounter()
|
||||
for json_comment in iter_comments(input_loc):
|
||||
doc = tokenizer(json_comment['body'])
|
||||
doc.count_by(ORTH, counts=counts)
|
||||
|
||||
with io.open(output_loc, 'w', 'utf8') as file_:
|
||||
for orth, freq in counts:
|
||||
string = tokenizer.vocab.strings[orth]
|
||||
if not string.isspace():
|
||||
file_.write('%d\t%s\n' % (freq, string))
|
||||
|
||||
|
||||
def parallelize(func, iterator, n_jobs):
|
||||
Parallel(n_jobs=n_jobs)(delayed(func)(*item) for item in iterator)
|
||||
|
||||
|
||||
def merge_counts(locs, out_loc):
|
||||
string_map = StringStore()
|
||||
counts = PreshCounter()
|
||||
for loc in locs:
|
||||
with io.open(loc, 'r', encoding='utf8') as file_:
|
||||
for line in file_:
|
||||
freq, word = line.strip().split('\t', 1)
|
||||
orth = string_map[word]
|
||||
counts.inc(orth, int(freq))
|
||||
with io.open(out_loc, 'w', encoding='utf8') as file_:
|
||||
for orth, count in counts:
|
||||
string = string_map[orth]
|
||||
file_.write('%d\t%s\n' % (count, string))
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
input_loc=("Location of input file list"),
|
||||
freqs_dir=("Directory for frequency files"),
|
||||
output_loc=("Location for output file"),
|
||||
n_jobs=("Number of workers", "option", "n", int),
|
||||
skip_existing=("Skip inputs where an output file exists", "flag", "s", bool),
|
||||
)
|
||||
def main(input_loc, freqs_dir, output_loc, n_jobs=2, skip_existing=False):
|
||||
tasks = []
|
||||
outputs = []
|
||||
for input_path in open(input_loc):
|
||||
input_path = input_path.strip()
|
||||
if not input_path:
|
||||
continue
|
||||
filename = input_path.split('/')[-1]
|
||||
output_path = path.join(freqs_dir, filename.replace('bz2', 'freq'))
|
||||
outputs.append(output_path)
|
||||
if not path.exists(output_path) or not skip_existing:
|
||||
tasks.append((input_path, output_path))
|
||||
|
||||
if tasks:
|
||||
parallelize(count_freqs, tasks, n_jobs)
|
||||
|
||||
print("Merge")
|
||||
merge_counts(outputs, output_loc)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
plac.call(main)
|
|
@ -1,89 +0,0 @@
|
|||
#!/usr/bin/env python
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from xml.etree import cElementTree as ElementTree
|
||||
import json
|
||||
import re
|
||||
|
||||
import plac
|
||||
from pathlib import Path
|
||||
from os import path
|
||||
|
||||
|
||||
escaped_tokens = {
|
||||
'-LRB-': '(',
|
||||
'-RRB-': ')',
|
||||
'-LSB-': '[',
|
||||
'-RSB-': ']',
|
||||
'-LCB-': '{',
|
||||
'-RCB-': '}',
|
||||
}
|
||||
|
||||
def read_parses(parse_loc):
|
||||
offset = 0
|
||||
doc = []
|
||||
for parse in open(str(parse_loc) + '.dep').read().strip().split('\n\n'):
|
||||
parse = _adjust_token_ids(parse, offset)
|
||||
offset += len(parse.split('\n'))
|
||||
doc.append(parse)
|
||||
return doc
|
||||
|
||||
def _adjust_token_ids(parse, offset):
|
||||
output = []
|
||||
for line in parse.split('\n'):
|
||||
pieces = line.split()
|
||||
pieces[0] = str(int(pieces[0]) + offset)
|
||||
pieces[5] = str(int(pieces[5]) + offset) if pieces[5] != '0' else '0'
|
||||
output.append('\t'.join(pieces))
|
||||
return '\n'.join(output)
|
||||
|
||||
|
||||
def _fmt_doc(filename, paras):
|
||||
return {'id': filename, 'paragraphs': [_fmt_para(*para) for para in paras]}
|
||||
|
||||
|
||||
def _fmt_para(raw, sents):
|
||||
return {'raw': raw, 'sentences': [_fmt_sent(sent) for sent in sents]}
|
||||
|
||||
|
||||
def _fmt_sent(sent):
|
||||
return {
|
||||
'tokens': [_fmt_token(*t.split()) for t in sent.strip().split('\n')],
|
||||
'brackets': []}
|
||||
|
||||
|
||||
def _fmt_token(id_, word, hyph, pos, ner, head, dep, blank1, blank2, blank3):
|
||||
head = int(head) - 1
|
||||
id_ = int(id_) - 1
|
||||
head = (head - id_) if head != -1 else 0
|
||||
return {'id': id_, 'orth': word, 'tag': pos, 'dep': dep, 'head': head}
|
||||
|
||||
|
||||
tags_re = re.compile(r'<[\w\?/][^>]+>')
|
||||
def main(out_dir, ewtb_dir='/usr/local/data/eng_web_tbk'):
|
||||
ewtb_dir = Path(ewtb_dir)
|
||||
out_dir = Path(out_dir)
|
||||
if not out_dir.exists():
|
||||
out_dir.mkdir()
|
||||
for genre_dir in ewtb_dir.joinpath('data').iterdir():
|
||||
#if 'answers' in str(genre_dir): continue
|
||||
parse_dir = genre_dir.joinpath('penntree')
|
||||
docs = []
|
||||
for source_loc in genre_dir.joinpath('source').joinpath('source_original').iterdir():
|
||||
filename = source_loc.parts[-1].replace('.sgm.sgm', '')
|
||||
filename = filename.replace('.xml', '')
|
||||
filename = filename.replace('.txt', '')
|
||||
parse_loc = parse_dir.joinpath(filename + '.xml.tree')
|
||||
parses = read_parses(parse_loc)
|
||||
source = source_loc.open().read().strip()
|
||||
if 'answers' in str(genre_dir):
|
||||
source = tags_re.sub('', source).strip()
|
||||
docs.append(_fmt_doc(filename, [[source, parses]]))
|
||||
|
||||
out_loc = out_dir.joinpath(genre_dir.parts[-1] + '.json')
|
||||
with open(str(out_loc), 'w') as out_file:
|
||||
out_file.write(json.dumps(docs, indent=4))
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
plac.call(main)
|
|
@ -1,32 +0,0 @@
|
|||
import io
|
||||
import plac
|
||||
|
||||
from spacy.en import English
|
||||
|
||||
|
||||
def main(text_loc):
|
||||
with io.open(text_loc, 'r', encoding='utf8') as file_:
|
||||
text = file_.read()
|
||||
NLU = English()
|
||||
for paragraph in text.split('\n\n'):
|
||||
tokens = NLU(paragraph)
|
||||
|
||||
ent_starts = {}
|
||||
ent_ends = {}
|
||||
for span in tokens.ents:
|
||||
ent_starts[span.start] = span.label_
|
||||
ent_ends[span.end] = span.label_
|
||||
|
||||
output = []
|
||||
for token in tokens:
|
||||
if token.i in ent_starts:
|
||||
output.append('<%s>' % ent_starts[token.i])
|
||||
output.append(token.orth_)
|
||||
if (token.i+1) in ent_ends:
|
||||
output.append('</%s>' % ent_ends[token.i+1])
|
||||
output.append('\n\n')
|
||||
print ' '.join(output)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
plac.call(main)
|
|
@ -1,157 +0,0 @@
|
|||
#!/usr/bin/env python
|
||||
from __future__ import division
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import os
|
||||
from os import path
|
||||
import shutil
|
||||
import io
|
||||
import random
|
||||
import time
|
||||
import gzip
|
||||
|
||||
import plac
|
||||
import cProfile
|
||||
import pstats
|
||||
|
||||
import spacy.util
|
||||
from spacy.en import English
|
||||
from spacy.gold import GoldParse
|
||||
|
||||
from spacy.syntax.util import Config
|
||||
from spacy.syntax.arc_eager import ArcEager
|
||||
from spacy.syntax.parser import Parser
|
||||
from spacy.scorer import Scorer
|
||||
from spacy.tagger import Tagger
|
||||
|
||||
# Last updated for spaCy v0.97
|
||||
|
||||
|
||||
def read_conll(file_):
|
||||
"""Read a standard CoNLL/MALT-style format"""
|
||||
sents = []
|
||||
for sent_str in file_.read().strip().split('\n\n'):
|
||||
ids = []
|
||||
words = []
|
||||
heads = []
|
||||
labels = []
|
||||
tags = []
|
||||
for i, line in enumerate(sent_str.split('\n')):
|
||||
word, pos_string, head_idx, label = _parse_line(line)
|
||||
words.append(word)
|
||||
if head_idx < 0:
|
||||
head_idx = i
|
||||
ids.append(i)
|
||||
heads.append(head_idx)
|
||||
labels.append(label)
|
||||
tags.append(pos_string)
|
||||
text = ' '.join(words)
|
||||
annot = (ids, words, tags, heads, labels, ['O'] * len(ids))
|
||||
sents.append((None, [(annot, [])]))
|
||||
return sents
|
||||
|
||||
|
||||
def _parse_line(line):
|
||||
pieces = line.split()
|
||||
if len(pieces) == 4:
|
||||
word, pos, head_idx, label = pieces
|
||||
head_idx = int(head_idx)
|
||||
elif len(pieces) == 15:
|
||||
id_ = int(pieces[0].split('_')[-1])
|
||||
word = pieces[1]
|
||||
pos = pieces[4]
|
||||
head_idx = int(pieces[8])-1
|
||||
label = pieces[10]
|
||||
else:
|
||||
id_ = int(pieces[0].split('_')[-1])
|
||||
word = pieces[1]
|
||||
pos = pieces[4]
|
||||
head_idx = int(pieces[6])-1
|
||||
label = pieces[7]
|
||||
if head_idx == 0:
|
||||
label = 'ROOT'
|
||||
return word, pos, head_idx, label
|
||||
|
||||
|
||||
def score_model(scorer, nlp, raw_text, annot_tuples, verbose=False):
|
||||
tokens = nlp.tokenizer.tokens_from_list(annot_tuples[1])
|
||||
nlp.tagger(tokens)
|
||||
nlp.parser(tokens)
|
||||
gold = GoldParse(tokens, annot_tuples, make_projective=False)
|
||||
scorer.score(tokens, gold, verbose=verbose, punct_labels=('--', 'p', 'punct'))
|
||||
|
||||
|
||||
def train(Language, gold_tuples, model_dir, n_iter=15, feat_set=u'basic', seed=0,
|
||||
gold_preproc=False, force_gold=False):
|
||||
dep_model_dir = path.join(model_dir, 'deps')
|
||||
pos_model_dir = path.join(model_dir, 'pos')
|
||||
if path.exists(dep_model_dir):
|
||||
shutil.rmtree(dep_model_dir)
|
||||
if path.exists(pos_model_dir):
|
||||
shutil.rmtree(pos_model_dir)
|
||||
os.mkdir(dep_model_dir)
|
||||
os.mkdir(pos_model_dir)
|
||||
|
||||
Config.write(dep_model_dir, 'config', features=feat_set, seed=seed,
|
||||
labels=ArcEager.get_labels(gold_tuples))
|
||||
|
||||
nlp = Language(data_dir=model_dir, tagger=False, parser=False, entity=False)
|
||||
nlp.tagger = Tagger.blank(nlp.vocab, Tagger.default_templates())
|
||||
nlp.parser = Parser.from_dir(dep_model_dir, nlp.vocab.strings, ArcEager)
|
||||
|
||||
print("Itn.\tP.Loss\tUAS\tNER F.\tTag %\tToken %")
|
||||
for itn in range(n_iter):
|
||||
scorer = Scorer()
|
||||
loss = 0
|
||||
for _, sents in gold_tuples:
|
||||
for annot_tuples, _ in sents:
|
||||
if len(annot_tuples[1]) == 1:
|
||||
continue
|
||||
|
||||
score_model(scorer, nlp, None, annot_tuples, verbose=False)
|
||||
|
||||
tokens = nlp.tokenizer.tokens_from_list(annot_tuples[1])
|
||||
nlp.tagger(tokens)
|
||||
gold = GoldParse(tokens, annot_tuples, make_projective=True)
|
||||
if not gold.is_projective:
|
||||
raise Exception(
|
||||
"Non-projective sentence in training, after we should "
|
||||
"have enforced projectivity: %s" % annot_tuples
|
||||
)
|
||||
|
||||
loss += nlp.parser.train(tokens, gold)
|
||||
nlp.tagger.train(tokens, gold.tags)
|
||||
random.shuffle(gold_tuples)
|
||||
print('%d:\t%d\t%.3f\t%.3f\t%.3f' % (itn, loss, scorer.uas,
|
||||
scorer.tags_acc, scorer.token_acc))
|
||||
print('end training')
|
||||
nlp.end_training(model_dir)
|
||||
print('done')
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
train_loc=("Location of CoNLL 09 formatted training file"),
|
||||
dev_loc=("Location of CoNLL 09 formatted development file"),
|
||||
model_dir=("Location of output model directory"),
|
||||
eval_only=("Skip training, and only evaluate", "flag", "e", bool),
|
||||
n_iter=("Number of training iterations", "option", "i", int),
|
||||
)
|
||||
def main(train_loc, dev_loc, model_dir, n_iter=15):
|
||||
with io.open(train_loc, 'r', encoding='utf8') as file_:
|
||||
train_sents = read_conll(file_)
|
||||
if not eval_only:
|
||||
train(English, train_sents, model_dir, n_iter=n_iter)
|
||||
nlp = English(data_dir=model_dir)
|
||||
dev_sents = read_conll(io.open(dev_loc, 'r', encoding='utf8'))
|
||||
scorer = Scorer()
|
||||
for _, sents in dev_sents:
|
||||
for annot_tuples, _ in sents:
|
||||
score_model(scorer, nlp, None, annot_tuples)
|
||||
print('TOK', 100-scorer.token_acc)
|
||||
print('POS', scorer.tags_acc)
|
||||
print('UAS', scorer.uas)
|
||||
print('LAS', scorer.las)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
plac.call(main)
|
|
@ -1,187 +0,0 @@
|
|||
#!/usr/bin/env python
|
||||
from __future__ import division
|
||||
from __future__ import unicode_literals
|
||||
from __future__ import print_function
|
||||
|
||||
import os
|
||||
from os import path
|
||||
import shutil
|
||||
import io
|
||||
import random
|
||||
|
||||
import plac
|
||||
import re
|
||||
|
||||
import spacy.util
|
||||
|
||||
from spacy.syntax.util import Config
|
||||
from spacy.gold import read_json_file
|
||||
from spacy.gold import GoldParse
|
||||
from spacy.gold import merge_sents
|
||||
|
||||
from spacy.scorer import Scorer
|
||||
|
||||
from spacy.syntax.arc_eager import ArcEager
|
||||
from spacy.syntax.ner import BiluoPushDown
|
||||
from spacy.tagger import Tagger
|
||||
from spacy.syntax.parser import Parser
|
||||
from spacy.syntax.nonproj import PseudoProjectivity
|
||||
|
||||
|
||||
def _corrupt(c, noise_level):
|
||||
if random.random() >= noise_level:
|
||||
return c
|
||||
elif c == ' ':
|
||||
return '\n'
|
||||
elif c == '\n':
|
||||
return ' '
|
||||
elif c in ['.', "'", "!", "?"]:
|
||||
return ''
|
||||
else:
|
||||
return c.lower()
|
||||
|
||||
|
||||
def add_noise(orig, noise_level):
|
||||
if random.random() >= noise_level:
|
||||
return orig
|
||||
elif type(orig) == list:
|
||||
corrupted = [_corrupt(word, noise_level) for word in orig]
|
||||
corrupted = [w for w in corrupted if w]
|
||||
return corrupted
|
||||
else:
|
||||
return ''.join(_corrupt(c, noise_level) for c in orig)
|
||||
|
||||
|
||||
def score_model(scorer, nlp, raw_text, annot_tuples, verbose=False):
|
||||
if raw_text is None:
|
||||
tokens = nlp.tokenizer.tokens_from_list(annot_tuples[1])
|
||||
else:
|
||||
tokens = nlp.tokenizer(raw_text)
|
||||
nlp.tagger(tokens)
|
||||
nlp.entity(tokens)
|
||||
nlp.parser(tokens)
|
||||
gold = GoldParse(tokens, annot_tuples)
|
||||
scorer.score(tokens, gold, verbose=verbose)
|
||||
|
||||
|
||||
def train(Language, train_data, dev_data, model_dir, tagger_cfg, parser_cfg, entity_cfg,
|
||||
n_iter=15, seed=0, gold_preproc=False, n_sents=0, corruption_level=0):
|
||||
print("Itn.\tN weight\tN feats\tUAS\tNER F.\tTag %\tToken %")
|
||||
format_str = '{:d}\t{:d}\t{:d}\t{uas:.3f}\t{ents_f:.3f}\t{tags_acc:.3f}\t{token_acc:.3f}'
|
||||
with Language.train(model_dir, train_data,
|
||||
tagger_cfg, parser_cfg, entity_cfg) as trainer:
|
||||
loss = 0
|
||||
for itn, epoch in enumerate(trainer.epochs(n_iter, gold_preproc=gold_preproc,
|
||||
augment_data=None)):
|
||||
for doc, gold in epoch:
|
||||
trainer.update(doc, gold)
|
||||
dev_scores = trainer.evaluate(dev_data, gold_preproc=gold_preproc)
|
||||
print(format_str.format(itn, trainer.nlp.parser.model.nr_weight,
|
||||
trainer.nlp.parser.model.nr_active_feat, **dev_scores.scores))
|
||||
|
||||
|
||||
def evaluate(Language, gold_tuples, model_dir, gold_preproc=False, verbose=False,
|
||||
beam_width=None, cand_preproc=None):
|
||||
print("Load parser", model_dir)
|
||||
nlp = Language(path=model_dir)
|
||||
if nlp.lang == 'de':
|
||||
nlp.vocab.morphology.lemmatizer = lambda string,pos: set([string])
|
||||
if beam_width is not None:
|
||||
nlp.parser.cfg.beam_width = beam_width
|
||||
scorer = Scorer()
|
||||
for raw_text, sents in gold_tuples:
|
||||
if gold_preproc:
|
||||
raw_text = None
|
||||
else:
|
||||
sents = merge_sents(sents)
|
||||
for annot_tuples, brackets in sents:
|
||||
if raw_text is None:
|
||||
tokens = nlp.tokenizer.tokens_from_list(annot_tuples[1])
|
||||
nlp.tagger(tokens)
|
||||
nlp.parser(tokens)
|
||||
nlp.entity(tokens)
|
||||
else:
|
||||
tokens = nlp(raw_text)
|
||||
gold = GoldParse.from_annot_tuples(tokens, annot_tuples)
|
||||
scorer.score(tokens, gold, verbose=verbose)
|
||||
return scorer
|
||||
|
||||
|
||||
def write_parses(Language, dev_loc, model_dir, out_loc):
|
||||
nlp = Language(data_dir=model_dir)
|
||||
gold_tuples = read_json_file(dev_loc)
|
||||
scorer = Scorer()
|
||||
out_file = io.open(out_loc, 'w', 'utf8')
|
||||
for raw_text, sents in gold_tuples:
|
||||
sents = _merge_sents(sents)
|
||||
for annot_tuples, brackets in sents:
|
||||
if raw_text is None:
|
||||
tokens = nlp.tokenizer.tokens_from_list(annot_tuples[1])
|
||||
nlp.tagger(tokens)
|
||||
nlp.entity(tokens)
|
||||
nlp.parser(tokens)
|
||||
else:
|
||||
tokens = nlp(raw_text)
|
||||
#gold = GoldParse(tokens, annot_tuples)
|
||||
#scorer.score(tokens, gold, verbose=False)
|
||||
for sent in tokens.sents:
|
||||
for t in sent:
|
||||
if not t.is_space:
|
||||
out_file.write(
|
||||
'%d\t%s\t%s\t%s\t%s\n' % (t.i, t.orth_, t.tag_, t.head.orth_, t.dep_)
|
||||
)
|
||||
out_file.write('\n')
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
language=("The language to train", "positional", None, str, ['en','de', 'zh']),
|
||||
train_loc=("Location of training file or directory"),
|
||||
dev_loc=("Location of development file or directory"),
|
||||
model_dir=("Location of output model directory",),
|
||||
eval_only=("Skip training, and only evaluate", "flag", "e", bool),
|
||||
corruption_level=("Amount of noise to add to training data", "option", "c", float),
|
||||
gold_preproc=("Use gold-standard sentence boundaries in training?", "flag", "g", bool),
|
||||
out_loc=("Out location", "option", "o", str),
|
||||
n_sents=("Number of training sentences", "option", "n", int),
|
||||
n_iter=("Number of training iterations", "option", "i", int),
|
||||
verbose=("Verbose error reporting", "flag", "v", bool),
|
||||
debug=("Debug mode", "flag", "d", bool),
|
||||
pseudoprojective=("Use pseudo-projective parsing", "flag", "p", bool),
|
||||
L1=("L1 regularization penalty", "option", "L", float),
|
||||
)
|
||||
def main(language, train_loc, dev_loc, model_dir, n_sents=0, n_iter=15, out_loc="", verbose=False,
|
||||
debug=False, corruption_level=0.0, gold_preproc=False, eval_only=False, pseudoprojective=False,
|
||||
L1=1e-6):
|
||||
parser_cfg = dict(locals())
|
||||
tagger_cfg = dict(locals())
|
||||
entity_cfg = dict(locals())
|
||||
|
||||
lang = spacy.util.get_lang_class(language)
|
||||
|
||||
parser_cfg['features'] = lang.Defaults.parser_features
|
||||
entity_cfg['features'] = lang.Defaults.entity_features
|
||||
|
||||
if not eval_only:
|
||||
gold_train = list(read_json_file(train_loc))
|
||||
gold_dev = list(read_json_file(dev_loc))
|
||||
if n_sents > 0:
|
||||
gold_train = gold_train[:n_sents]
|
||||
train(lang, gold_train, gold_dev, model_dir, tagger_cfg, parser_cfg, entity_cfg,
|
||||
n_sents=n_sents, gold_preproc=gold_preproc, corruption_level=corruption_level,
|
||||
n_iter=n_iter)
|
||||
if out_loc:
|
||||
write_parses(lang, dev_loc, model_dir, out_loc)
|
||||
scorer = evaluate(lang, list(read_json_file(dev_loc)),
|
||||
model_dir, gold_preproc=gold_preproc, verbose=verbose)
|
||||
print('TOK', scorer.token_acc)
|
||||
print('POS', scorer.tags_acc)
|
||||
print('UAS', scorer.uas)
|
||||
print('LAS', scorer.las)
|
||||
|
||||
print('NER P', scorer.ents_p)
|
||||
print('NER R', scorer.ents_r)
|
||||
print('NER F', scorer.ents_f)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
plac.call(main)
|
|
@ -1,139 +0,0 @@
|
|||
from __future__ import unicode_literals
|
||||
import plac
|
||||
import json
|
||||
import random
|
||||
import pathlib
|
||||
|
||||
from spacy.tokens import Doc
|
||||
from spacy.syntax.nonproj import PseudoProjectivity
|
||||
from spacy.language import Language
|
||||
from spacy.gold import GoldParse
|
||||
from spacy.tagger import Tagger
|
||||
from spacy.pipeline import DependencyParser, BeamDependencyParser
|
||||
from spacy.syntax.parser import get_templates
|
||||
from spacy.syntax.arc_eager import ArcEager
|
||||
from spacy.scorer import Scorer
|
||||
from spacy.language_data.tag_map import TAG_MAP as DEFAULT_TAG_MAP
|
||||
import spacy.attrs
|
||||
import io
|
||||
|
||||
|
||||
def read_conllx(loc, n=0):
|
||||
with io.open(loc, 'r', encoding='utf8') as file_:
|
||||
text = file_.read()
|
||||
i = 0
|
||||
for sent in text.strip().split('\n\n'):
|
||||
lines = sent.strip().split('\n')
|
||||
if lines:
|
||||
while lines[0].startswith('#'):
|
||||
lines.pop(0)
|
||||
tokens = []
|
||||
for line in lines:
|
||||
id_, word, lemma, pos, tag, morph, head, dep, _1, \
|
||||
_2 = line.split('\t')
|
||||
if '-' in id_ or '.' in id_:
|
||||
continue
|
||||
try:
|
||||
id_ = int(id_) - 1
|
||||
head = (int(head) - 1) if head != '0' else id_
|
||||
dep = 'ROOT' if dep == 'root' else dep
|
||||
tokens.append((id_, word, tag, head, dep, 'O'))
|
||||
except:
|
||||
print(line)
|
||||
raise
|
||||
tuples = [list(t) for t in zip(*tokens)]
|
||||
yield (None, [[tuples, []]])
|
||||
i += 1
|
||||
if n >= 1 and i >= n:
|
||||
break
|
||||
|
||||
|
||||
def score_model(vocab, tagger, parser, gold_docs, verbose=False):
|
||||
scorer = Scorer()
|
||||
for _, gold_doc in gold_docs:
|
||||
for (ids, words, tags, heads, deps, entities), _ in gold_doc:
|
||||
doc = Doc(vocab, words=words)
|
||||
tagger(doc)
|
||||
parser(doc)
|
||||
PseudoProjectivity.deprojectivize(doc)
|
||||
gold = GoldParse(doc, tags=tags, heads=heads, deps=deps)
|
||||
scorer.score(doc, gold, verbose=verbose)
|
||||
return scorer
|
||||
|
||||
|
||||
def main(lang_name, train_loc, dev_loc, model_dir, clusters_loc=None):
|
||||
LangClass = spacy.util.get_lang_class(lang_name)
|
||||
train_sents = list(read_conllx(train_loc))
|
||||
train_sents = PseudoProjectivity.preprocess_training_data(train_sents)
|
||||
|
||||
actions = ArcEager.get_actions(gold_parses=train_sents)
|
||||
features = get_templates('basic')
|
||||
|
||||
model_dir = pathlib.Path(model_dir)
|
||||
if not model_dir.exists():
|
||||
model_dir.mkdir()
|
||||
if not (model_dir / 'deps').exists():
|
||||
(model_dir / 'deps').mkdir()
|
||||
if not (model_dir / 'pos').exists():
|
||||
(model_dir / 'pos').mkdir()
|
||||
with (model_dir / 'deps' / 'config.json').open('wb') as file_:
|
||||
file_.write(
|
||||
json.dumps(
|
||||
{'pseudoprojective': True, 'labels': actions, 'features': features}).encode('utf8'))
|
||||
|
||||
vocab = LangClass.Defaults.create_vocab()
|
||||
if not (model_dir / 'vocab').exists():
|
||||
(model_dir / 'vocab').mkdir()
|
||||
else:
|
||||
if (model_dir / 'vocab' / 'strings.json').exists():
|
||||
with (model_dir / 'vocab' / 'strings.json').open() as file_:
|
||||
vocab.strings.load(file_)
|
||||
if (model_dir / 'vocab' / 'lexemes.bin').exists():
|
||||
vocab.load_lexemes(model_dir / 'vocab' / 'lexemes.bin')
|
||||
|
||||
if clusters_loc is not None:
|
||||
clusters_loc = pathlib.Path(clusters_loc)
|
||||
with clusters_loc.open() as file_:
|
||||
for line in file_:
|
||||
try:
|
||||
cluster, word, freq = line.split()
|
||||
except ValueError:
|
||||
continue
|
||||
lex = vocab[word]
|
||||
lex.cluster = int(cluster[::-1], 2)
|
||||
# Populate vocab
|
||||
for _, doc_sents in train_sents:
|
||||
for (ids, words, tags, heads, deps, ner), _ in doc_sents:
|
||||
for word in words:
|
||||
_ = vocab[word]
|
||||
for dep in deps:
|
||||
_ = vocab[dep]
|
||||
for tag in tags:
|
||||
_ = vocab[tag]
|
||||
if vocab.morphology.tag_map:
|
||||
for tag in tags:
|
||||
assert tag in vocab.morphology.tag_map, repr(tag)
|
||||
tagger = Tagger(vocab)
|
||||
parser = DependencyParser(vocab, actions=actions, features=features, L1=0.0)
|
||||
|
||||
for itn in range(30):
|
||||
loss = 0.
|
||||
for _, doc_sents in train_sents:
|
||||
for (ids, words, tags, heads, deps, ner), _ in doc_sents:
|
||||
doc = Doc(vocab, words=words)
|
||||
gold = GoldParse(doc, tags=tags, heads=heads, deps=deps)
|
||||
tagger(doc)
|
||||
loss += parser.update(doc, gold, itn=itn)
|
||||
doc = Doc(vocab, words=words)
|
||||
tagger.update(doc, gold)
|
||||
random.shuffle(train_sents)
|
||||
scorer = score_model(vocab, tagger, parser, read_conllx(dev_loc))
|
||||
print('%d:\t%.3f\t%.3f\t%.3f' % (itn, loss, scorer.uas, scorer.tags_acc))
|
||||
nlp = LangClass(vocab=vocab, tagger=tagger, parser=parser)
|
||||
nlp.end_training(model_dir)
|
||||
scorer = score_model(vocab, tagger, parser, read_conllx(dev_loc))
|
||||
print('%d:\t%.3f\t%.3f\t%.3f' % (itn, scorer.uas, scorer.las, scorer.tags_acc))
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
plac.call(main)
|
|
@ -1,194 +0,0 @@
|
|||
"""Convert OntoNotes into a json format.
|
||||
|
||||
doc: {
|
||||
id: string,
|
||||
paragraphs: [{
|
||||
raw: string,
|
||||
sents: [int],
|
||||
tokens: [{
|
||||
start: int,
|
||||
tag: string,
|
||||
head: int,
|
||||
dep: string}],
|
||||
ner: [{
|
||||
start: int,
|
||||
end: int,
|
||||
label: string}],
|
||||
brackets: [{
|
||||
start: int,
|
||||
end: int,
|
||||
label: string}]}]}
|
||||
|
||||
Consumes output of spacy/munge/align_raw.py
|
||||
"""
|
||||
from __future__ import unicode_literals
|
||||
import plac
|
||||
import json
|
||||
from os import path
|
||||
import os
|
||||
import re
|
||||
import io
|
||||
from collections import defaultdict
|
||||
|
||||
from spacy.munge import read_ptb
|
||||
from spacy.munge import read_conll
|
||||
from spacy.munge import read_ner
|
||||
|
||||
|
||||
def _iter_raw_files(raw_loc):
|
||||
files = json.load(open(raw_loc))
|
||||
for f in files:
|
||||
yield f
|
||||
|
||||
|
||||
def format_doc(file_id, raw_paras, ptb_text, dep_text, ner_text):
|
||||
ptb_sents = read_ptb.split(ptb_text)
|
||||
dep_sents = read_conll.split(dep_text)
|
||||
if len(ptb_sents) != len(dep_sents):
|
||||
return None
|
||||
if ner_text is not None:
|
||||
ner_sents = read_ner.split(ner_text)
|
||||
else:
|
||||
ner_sents = [None] * len(ptb_sents)
|
||||
|
||||
i = 0
|
||||
doc = {'id': file_id}
|
||||
if raw_paras is None:
|
||||
doc['paragraphs'] = [format_para(None, ptb_sents, dep_sents, ner_sents)]
|
||||
#for ptb_sent, dep_sent, ner_sent in zip(ptb_sents, dep_sents, ner_sents):
|
||||
# doc['paragraphs'].append(format_para(None, [ptb_sent], [dep_sent], [ner_sent]))
|
||||
else:
|
||||
doc['paragraphs'] = []
|
||||
for raw_sents in raw_paras:
|
||||
para = format_para(
|
||||
' '.join(raw_sents).replace('<SEP>', ''),
|
||||
ptb_sents[i:i+len(raw_sents)],
|
||||
dep_sents[i:i+len(raw_sents)],
|
||||
ner_sents[i:i+len(raw_sents)])
|
||||
if para['sentences']:
|
||||
doc['paragraphs'].append(para)
|
||||
i += len(raw_sents)
|
||||
return doc
|
||||
|
||||
|
||||
def format_para(raw_text, ptb_sents, dep_sents, ner_sents):
|
||||
para = {'raw': raw_text, 'sentences': []}
|
||||
offset = 0
|
||||
assert len(ptb_sents) == len(dep_sents) == len(ner_sents)
|
||||
for ptb_text, dep_text, ner_text in zip(ptb_sents, dep_sents, ner_sents):
|
||||
_, deps = read_conll.parse(dep_text, strip_bad_periods=True)
|
||||
if deps and 'VERB' in [t['tag'] for t in deps]:
|
||||
continue
|
||||
if ner_text is not None:
|
||||
_, ner = read_ner.parse(ner_text, strip_bad_periods=True)
|
||||
else:
|
||||
ner = ['-' for _ in deps]
|
||||
_, brackets = read_ptb.parse(ptb_text, strip_bad_periods=True)
|
||||
# Necessary because the ClearNLP converter deletes EDITED words.
|
||||
if len(ner) != len(deps):
|
||||
ner = ['-' for _ in deps]
|
||||
para['sentences'].append(format_sentence(deps, ner, brackets))
|
||||
return para
|
||||
|
||||
|
||||
def format_sentence(deps, ner, brackets):
|
||||
sent = {'tokens': [], 'brackets': []}
|
||||
for token_id, (token, token_ent) in enumerate(zip(deps, ner)):
|
||||
sent['tokens'].append(format_token(token_id, token, token_ent))
|
||||
|
||||
for label, start, end in brackets:
|
||||
if start != end:
|
||||
sent['brackets'].append({
|
||||
'label': label,
|
||||
'first': start,
|
||||
'last': (end-1)})
|
||||
return sent
|
||||
|
||||
|
||||
def format_token(token_id, token, ner):
|
||||
assert token_id == token['id']
|
||||
head = (token['head'] - token_id) if token['head'] != -1 else 0
|
||||
return {
|
||||
'id': token_id,
|
||||
'orth': token['word'],
|
||||
'tag': token['tag'],
|
||||
'head': head,
|
||||
'dep': token['dep'],
|
||||
'ner': ner}
|
||||
|
||||
|
||||
def read_file(*pieces):
|
||||
loc = path.join(*pieces)
|
||||
if not path.exists(loc):
|
||||
return None
|
||||
else:
|
||||
return io.open(loc, 'r', encoding='utf8').read().strip()
|
||||
|
||||
|
||||
def get_file_names(section_dir, subsection):
|
||||
filenames = []
|
||||
for fn in os.listdir(path.join(section_dir, subsection)):
|
||||
filenames.append(fn.rsplit('.', 1)[0])
|
||||
return list(sorted(set(filenames)))
|
||||
|
||||
|
||||
def read_wsj_with_source(onto_dir, raw_dir):
|
||||
# Now do WSJ, with source alignment
|
||||
onto_dir = path.join(onto_dir, 'data', 'english', 'annotations', 'nw', 'wsj')
|
||||
docs = {}
|
||||
for i in range(25):
|
||||
section = str(i) if i >= 10 else ('0' + str(i))
|
||||
raw_loc = path.join(raw_dir, 'wsj%s.json' % section)
|
||||
for j, (filename, raw_paras) in enumerate(_iter_raw_files(raw_loc)):
|
||||
if section == '00':
|
||||
j += 1
|
||||
if section == '04' and filename == '55':
|
||||
continue
|
||||
ptb = read_file(onto_dir, section, '%s.parse' % filename)
|
||||
dep = read_file(onto_dir, section, '%s.parse.dep' % filename)
|
||||
ner = read_file(onto_dir, section, '%s.name' % filename)
|
||||
if ptb is not None and dep is not None:
|
||||
docs[filename] = format_doc(filename, raw_paras, ptb, dep, ner)
|
||||
return docs
|
||||
|
||||
|
||||
def get_doc(onto_dir, file_path, wsj_docs):
|
||||
filename = file_path.rsplit('/', 1)[1]
|
||||
if filename in wsj_docs:
|
||||
return wsj_docs[filename]
|
||||
else:
|
||||
ptb = read_file(onto_dir, file_path + '.parse')
|
||||
dep = read_file(onto_dir, file_path + '.parse.dep')
|
||||
ner = read_file(onto_dir, file_path + '.name')
|
||||
if ptb is not None and dep is not None:
|
||||
return format_doc(filename, None, ptb, dep, ner)
|
||||
else:
|
||||
return None
|
||||
|
||||
|
||||
def read_ids(loc):
|
||||
return open(loc).read().strip().split('\n')
|
||||
|
||||
|
||||
def main(onto_dir, raw_dir, out_dir):
|
||||
wsj_docs = read_wsj_with_source(onto_dir, raw_dir)
|
||||
|
||||
for partition in ('train', 'test', 'development'):
|
||||
ids = read_ids(path.join(onto_dir, '%s.id' % partition))
|
||||
docs_by_genre = defaultdict(list)
|
||||
for file_path in ids:
|
||||
doc = get_doc(onto_dir, file_path, wsj_docs)
|
||||
if doc is not None:
|
||||
genre = file_path.split('/')[3]
|
||||
docs_by_genre[genre].append(doc)
|
||||
part_dir = path.join(out_dir, partition)
|
||||
if not path.exists(part_dir):
|
||||
os.mkdir(part_dir)
|
||||
for genre, docs in sorted(docs_by_genre.items()):
|
||||
out_loc = path.join(part_dir, genre + '.json')
|
||||
with open(out_loc, 'w') as file_:
|
||||
json.dump(docs, file_, indent=4)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
plac.call(main)
|
|
@ -1,13 +0,0 @@
|
|||
"""Read a vector file, and prepare it as binary data, for easy consumption"""
|
||||
|
||||
import plac
|
||||
|
||||
from spacy.vocab import write_binary_vectors
|
||||
|
||||
|
||||
def main(in_loc, out_loc):
|
||||
write_binary_vectors(in_loc, out_loc)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
plac.call(main)
|
|
@ -1,175 +0,0 @@
|
|||
#!/usr/bin/env python
|
||||
from __future__ import division
|
||||
from __future__ import unicode_literals
|
||||
from __future__ import print_function
|
||||
|
||||
import os
|
||||
from os import path
|
||||
import shutil
|
||||
import codecs
|
||||
import random
|
||||
|
||||
import plac
|
||||
import re
|
||||
|
||||
import spacy.util
|
||||
from spacy.en import English
|
||||
|
||||
from spacy.tagger import Tagger
|
||||
|
||||
from spacy.syntax.util import Config
|
||||
from spacy.gold import read_json_file
|
||||
from spacy.gold import GoldParse
|
||||
|
||||
from spacy.scorer import Scorer
|
||||
|
||||
|
||||
def score_model(scorer, nlp, raw_text, annot_tuples):
|
||||
if raw_text is None:
|
||||
tokens = nlp.tokenizer.tokens_from_list(annot_tuples[1])
|
||||
else:
|
||||
tokens = nlp.tokenizer(raw_text)
|
||||
nlp.tagger(tokens)
|
||||
gold = GoldParse(tokens, annot_tuples)
|
||||
scorer.score(tokens, gold)
|
||||
|
||||
|
||||
def _merge_sents(sents):
|
||||
m_deps = [[], [], [], [], [], []]
|
||||
m_brackets = []
|
||||
i = 0
|
||||
for (ids, words, tags, heads, labels, ner), brackets in sents:
|
||||
m_deps[0].extend(id_ + i for id_ in ids)
|
||||
m_deps[1].extend(words)
|
||||
m_deps[2].extend(tags)
|
||||
m_deps[3].extend(head + i for head in heads)
|
||||
m_deps[4].extend(labels)
|
||||
m_deps[5].extend(ner)
|
||||
m_brackets.extend((b['first'] + i, b['last'] + i, b['label']) for b in brackets)
|
||||
i += len(ids)
|
||||
return [(m_deps, m_brackets)]
|
||||
|
||||
|
||||
def train(Language, gold_tuples, model_dir, n_iter=15, feat_set=u'basic',
|
||||
seed=0, gold_preproc=False, n_sents=0, corruption_level=0,
|
||||
beam_width=1, verbose=False,
|
||||
use_orig_arc_eager=False):
|
||||
if n_sents > 0:
|
||||
gold_tuples = gold_tuples[:n_sents]
|
||||
|
||||
templates = Tagger.default_templates()
|
||||
nlp = Language(data_dir=model_dir, tagger=False)
|
||||
nlp.tagger = Tagger.blank(nlp.vocab, templates)
|
||||
|
||||
print("Itn.\tP.Loss\tUAS\tNER F.\tTag %\tToken %")
|
||||
for itn in range(n_iter):
|
||||
scorer = Scorer()
|
||||
loss = 0
|
||||
for raw_text, sents in gold_tuples:
|
||||
if gold_preproc:
|
||||
raw_text = None
|
||||
else:
|
||||
sents = _merge_sents(sents)
|
||||
for annot_tuples, ctnt in sents:
|
||||
words = annot_tuples[1]
|
||||
gold_tags = annot_tuples[2]
|
||||
score_model(scorer, nlp, raw_text, annot_tuples)
|
||||
if raw_text is None:
|
||||
tokens = nlp.tokenizer.tokens_from_list(words)
|
||||
else:
|
||||
tokens = nlp.tokenizer(raw_text)
|
||||
loss += nlp.tagger.train(tokens, gold_tags)
|
||||
random.shuffle(gold_tuples)
|
||||
print('%d:\t%d\t%.3f\t%.3f\t%.3f\t%.3f' % (itn, loss, scorer.uas, scorer.ents_f,
|
||||
scorer.tags_acc,
|
||||
scorer.token_acc))
|
||||
nlp.end_training(model_dir)
|
||||
|
||||
def evaluate(Language, gold_tuples, model_dir, gold_preproc=False, verbose=False,
|
||||
beam_width=None):
|
||||
nlp = Language(data_dir=model_dir)
|
||||
if beam_width is not None:
|
||||
nlp.parser.cfg.beam_width = beam_width
|
||||
scorer = Scorer()
|
||||
for raw_text, sents in gold_tuples:
|
||||
if gold_preproc:
|
||||
raw_text = None
|
||||
else:
|
||||
sents = _merge_sents(sents)
|
||||
for annot_tuples, brackets in sents:
|
||||
if raw_text is None:
|
||||
tokens = nlp.tokenizer.tokens_from_list(annot_tuples[1])
|
||||
nlp.tagger(tokens)
|
||||
nlp.entity(tokens)
|
||||
nlp.parser(tokens)
|
||||
else:
|
||||
tokens = nlp(raw_text, merge_mwes=False)
|
||||
gold = GoldParse(tokens, annot_tuples)
|
||||
scorer.score(tokens, gold, verbose=verbose)
|
||||
return scorer
|
||||
|
||||
|
||||
def write_parses(Language, dev_loc, model_dir, out_loc, beam_width=None):
|
||||
nlp = Language(data_dir=model_dir)
|
||||
if beam_width is not None:
|
||||
nlp.parser.cfg.beam_width = beam_width
|
||||
gold_tuples = read_json_file(dev_loc)
|
||||
scorer = Scorer()
|
||||
out_file = codecs.open(out_loc, 'w', 'utf8')
|
||||
for raw_text, sents in gold_tuples:
|
||||
sents = _merge_sents(sents)
|
||||
for annot_tuples, brackets in sents:
|
||||
if raw_text is None:
|
||||
tokens = nlp.tokenizer.tokens_from_list(annot_tuples[1])
|
||||
nlp.tagger(tokens)
|
||||
nlp.entity(tokens)
|
||||
nlp.parser(tokens)
|
||||
else:
|
||||
tokens = nlp(raw_text, merge_mwes=False)
|
||||
gold = GoldParse(tokens, annot_tuples)
|
||||
scorer.score(tokens, gold, verbose=False)
|
||||
for t in tokens:
|
||||
out_file.write(
|
||||
'%s\t%s\t%s\t%s\n' % (t.orth_, t.tag_, t.head.orth_, t.dep_)
|
||||
)
|
||||
return scorer
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
train_loc=("Location of training file or directory"),
|
||||
dev_loc=("Location of development file or directory"),
|
||||
model_dir=("Location of output model directory",),
|
||||
eval_only=("Skip training, and only evaluate", "flag", "e", bool),
|
||||
corruption_level=("Amount of noise to add to training data", "option", "c", float),
|
||||
gold_preproc=("Use gold-standard sentence boundaries in training?", "flag", "g", bool),
|
||||
out_loc=("Out location", "option", "o", str),
|
||||
n_sents=("Number of training sentences", "option", "n", int),
|
||||
n_iter=("Number of training iterations", "option", "i", int),
|
||||
verbose=("Verbose error reporting", "flag", "v", bool),
|
||||
debug=("Debug mode", "flag", "d", bool),
|
||||
)
|
||||
def main(train_loc, dev_loc, model_dir, n_sents=0, n_iter=15, out_loc="", verbose=False,
|
||||
debug=False, corruption_level=0.0, gold_preproc=False, eval_only=False):
|
||||
if not eval_only:
|
||||
gold_train = list(read_json_file(train_loc))
|
||||
train(English, gold_train, model_dir,
|
||||
feat_set='basic' if not debug else 'debug',
|
||||
gold_preproc=gold_preproc, n_sents=n_sents,
|
||||
corruption_level=corruption_level, n_iter=n_iter,
|
||||
verbose=verbose)
|
||||
#if out_loc:
|
||||
# write_parses(English, dev_loc, model_dir, out_loc, beam_width=beam_width)
|
||||
scorer = evaluate(English, list(read_json_file(dev_loc)),
|
||||
model_dir, gold_preproc=gold_preproc, verbose=verbose)
|
||||
print('TOK', scorer.token_acc)
|
||||
print('POS', scorer.tags_acc)
|
||||
print('UAS', scorer.uas)
|
||||
print('LAS', scorer.las)
|
||||
|
||||
print('NER P', scorer.ents_p)
|
||||
print('NER R', scorer.ents_r)
|
||||
print('NER F', scorer.ents_f)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
plac.call(main)
|
|
@ -1,160 +0,0 @@
|
|||
#!/usr/bin/env python
|
||||
from __future__ import division
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import os
|
||||
from os import path
|
||||
import shutil
|
||||
import io
|
||||
import random
|
||||
import time
|
||||
import gzip
|
||||
import ujson
|
||||
|
||||
import plac
|
||||
import cProfile
|
||||
import pstats
|
||||
|
||||
import spacy.util
|
||||
from spacy.de import German
|
||||
from spacy.gold import GoldParse
|
||||
from spacy.tagger import Tagger
|
||||
from spacy.scorer import PRFScore
|
||||
|
||||
from spacy.tagger import P2_orth, P2_cluster, P2_shape, P2_prefix, P2_suffix, P2_pos, P2_lemma, P2_flags
|
||||
from spacy.tagger import P1_orth, P1_cluster, P1_shape, P1_prefix, P1_suffix, P1_pos, P1_lemma, P1_flags
|
||||
from spacy.tagger import W_orth, W_cluster, W_shape, W_prefix, W_suffix, W_pos, W_lemma, W_flags
|
||||
from spacy.tagger import N1_orth, N1_cluster, N1_shape, N1_prefix, N1_suffix, N1_pos, N1_lemma, N1_flags
|
||||
from spacy.tagger import N2_orth, N2_cluster, N2_shape, N2_prefix, N2_suffix, N2_pos, N2_lemma, N2_flags, N_CONTEXT_FIELDS
|
||||
|
||||
|
||||
def default_templates():
|
||||
return spacy.tagger.Tagger.default_templates()
|
||||
|
||||
def default_templates_without_clusters():
|
||||
return (
|
||||
(W_orth,),
|
||||
(P1_lemma, P1_pos),
|
||||
(P2_lemma, P2_pos),
|
||||
(N1_orth,),
|
||||
(N2_orth,),
|
||||
|
||||
(W_suffix,),
|
||||
(W_prefix,),
|
||||
|
||||
(P1_pos,),
|
||||
(P2_pos,),
|
||||
(P1_pos, P2_pos),
|
||||
(P1_pos, W_orth),
|
||||
(P1_suffix,),
|
||||
(N1_suffix,),
|
||||
|
||||
(W_shape,),
|
||||
|
||||
(W_flags,),
|
||||
(N1_flags,),
|
||||
(N2_flags,),
|
||||
(P1_flags,),
|
||||
(P2_flags,),
|
||||
)
|
||||
|
||||
|
||||
def make_tagger(vocab, templates):
|
||||
model = spacy.tagger.TaggerModel(templates)
|
||||
return spacy.tagger.Tagger(vocab,model)
|
||||
|
||||
|
||||
def read_conll(file_):
|
||||
def sentences():
|
||||
words, tags = [], []
|
||||
for line in file_:
|
||||
line = line.strip()
|
||||
if line:
|
||||
word, tag = line.split('\t')[1::3][:2] # get column 1 and 4 (CoNLL09)
|
||||
words.append(word)
|
||||
tags.append(tag)
|
||||
elif words:
|
||||
yield words, tags
|
||||
words, tags = [], []
|
||||
if words:
|
||||
yield words, tags
|
||||
return [ s for s in sentences() ]
|
||||
|
||||
|
||||
def score_model(score, nlp, words, gold_tags):
|
||||
tokens = nlp.tokenizer.tokens_from_list(words)
|
||||
assert(len(tokens) == len(gold_tags))
|
||||
nlp.tagger(tokens)
|
||||
|
||||
for token, gold_tag in zip(tokens,gold_tags):
|
||||
score.score_set(set([token.tag_]),set([gold_tag]))
|
||||
|
||||
|
||||
def train(Language, train_sents, dev_sents, model_dir, n_iter=15, seed=21):
|
||||
# make shuffling deterministic
|
||||
random.seed(seed)
|
||||
|
||||
# set up directory for model
|
||||
pos_model_dir = path.join(model_dir, 'pos')
|
||||
if path.exists(pos_model_dir):
|
||||
shutil.rmtree(pos_model_dir)
|
||||
os.mkdir(pos_model_dir)
|
||||
|
||||
nlp = Language(data_dir=model_dir, tagger=False, parser=False, entity=False)
|
||||
nlp.tagger = make_tagger(nlp.vocab,default_templates())
|
||||
|
||||
print("Itn.\ttrain acc %\tdev acc %")
|
||||
for itn in range(n_iter):
|
||||
# train on train set
|
||||
#train_acc = PRFScore()
|
||||
correct, total = 0., 0.
|
||||
for words, gold_tags in train_sents:
|
||||
tokens = nlp.tokenizer.tokens_from_list(words)
|
||||
correct += nlp.tagger.train(tokens, gold_tags)
|
||||
total += len(words)
|
||||
train_acc = correct/total
|
||||
|
||||
# test on dev set
|
||||
dev_acc = PRFScore()
|
||||
for words, gold_tags in dev_sents:
|
||||
score_model(dev_acc, nlp, words, gold_tags)
|
||||
|
||||
random.shuffle(train_sents)
|
||||
print('%d:\t%6.2f\t%6.2f' % (itn, 100*train_acc, 100*dev_acc.precision))
|
||||
|
||||
|
||||
print('end training')
|
||||
nlp.end_training(model_dir)
|
||||
print('done')
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
train_loc=("Location of CoNLL 09 formatted training file"),
|
||||
dev_loc=("Location of CoNLL 09 formatted development file"),
|
||||
model_dir=("Location of output model directory"),
|
||||
eval_only=("Skip training, and only evaluate", "flag", "e", bool),
|
||||
n_iter=("Number of training iterations", "option", "i", int),
|
||||
)
|
||||
def main(train_loc, dev_loc, model_dir, eval_only=False, n_iter=15):
|
||||
# training
|
||||
if not eval_only:
|
||||
with io.open(train_loc, 'r', encoding='utf8') as trainfile_, \
|
||||
io.open(dev_loc, 'r', encoding='utf8') as devfile_:
|
||||
train_sents = read_conll(trainfile_)
|
||||
dev_sents = read_conll(devfile_)
|
||||
train(German, train_sents, dev_sents, model_dir, n_iter=n_iter)
|
||||
|
||||
# testing
|
||||
with io.open(dev_loc, 'r', encoding='utf8') as file_:
|
||||
dev_sents = read_conll(file_)
|
||||
nlp = German(data_dir=model_dir)
|
||||
|
||||
dev_acc = PRFScore()
|
||||
for words, gold_tags in dev_sents:
|
||||
score_model(dev_acc, nlp, words, gold_tags)
|
||||
|
||||
print('POS: %6.2f %%' % (100*dev_acc.precision))
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
plac.call(main)
|
|
@ -2,20 +2,18 @@
|
|||
|
||||
# spaCy examples
|
||||
|
||||
The examples are Python scripts with well-behaved command line interfaces. For a full list of spaCy tutorials and code snippets, see the [documentation](https://spacy.io/docs/usage/tutorials).
|
||||
The examples are Python scripts with well-behaved command line interfaces. For
|
||||
more detailed usage guides, see the [documentation](https://spacy.io/usage/).
|
||||
|
||||
## How to run an example
|
||||
|
||||
For example, to run the [`nn_text_class.py`](nn_text_class.py) script, do:
|
||||
To see the available arguments, you can use the `--help` or `-h` flag:
|
||||
|
||||
```bash
|
||||
$ python examples/nn_text_class.py
|
||||
usage: nn_text_class.py [-h] [-d 3] [-H 300] [-i 5] [-w 40000] [-b 24]
|
||||
[-r 0.3] [-p 1e-05] [-e 0.005]
|
||||
data_dir
|
||||
nn_text_class.py: error: too few arguments
|
||||
$ python examples/training/train_ner.py --help
|
||||
```
|
||||
|
||||
You can print detailed help with the `-h` argument.
|
||||
|
||||
While we try to keep the examples up to date, they are not currently exercised by the test suite, as some of them require significant data downloads or take time to train. If you find that an example is no longer running, [please tell us](https://github.com/explosion/spaCy/issues)! We know there's nothing worse than trying to figure out what you're doing wrong, and it turns out your code was never the problem.
|
||||
While we try to keep the examples up to date, they are not currently exercised
|
||||
by the test suite, as some of them require significant data downloads or take
|
||||
time to train. If you find that an example is no longer running,
|
||||
[please tell us](https://github.com/explosion/spaCy/issues)! We know there's
|
||||
nothing worse than trying to figure out what you're doing wrong, and it turns
|
||||
out your code was never the problem.
|
||||
|
|
|
@ -1,37 +0,0 @@
|
|||
# encoding: utf8
|
||||
from __future__ import unicode_literals, print_function
|
||||
|
||||
from math import sqrt
|
||||
from numpy import dot
|
||||
from numpy.linalg import norm
|
||||
|
||||
|
||||
def handle_tweet(spacy, tweet_data, query):
|
||||
text = tweet_data.get('text', u'')
|
||||
# Twython returns either bytes or unicode, depending on tweet.
|
||||
# ಠ_ಠ #APIshaming
|
||||
try:
|
||||
match_tweet(spacy, text, query)
|
||||
except TypeError:
|
||||
match_tweet(spacy, text.decode('utf8'), query)
|
||||
|
||||
|
||||
def match_tweet(spacy, text, query):
|
||||
def get_vector(word):
|
||||
return spacy.vocab[word].repvec
|
||||
|
||||
tweet = spacy(text)
|
||||
tweet = [w.repvec for w in tweet if w.is_alpha and w.lower_ != query]
|
||||
if tweet:
|
||||
accept = map(get_vector, 'child classroom teach'.split())
|
||||
reject = map(get_vector, 'mouth hands giveaway'.split())
|
||||
|
||||
y = sum(max(cos(w1, w2), 0) for w1 in tweet for w2 in accept)
|
||||
n = sum(max(cos(w1, w2), 0) for w1 in tweet for w2 in reject)
|
||||
|
||||
if (y / (y + n)) >= 0.5 or True:
|
||||
print(text)
|
||||
|
||||
|
||||
def cos(v1, v2):
|
||||
return dot(v1, v2) / (norm(v1) * norm(v2))
|
|
@ -1,16 +1,24 @@
|
|||
import plac
|
||||
import collections
|
||||
import random
|
||||
"""
|
||||
This example shows how to use an LSTM sentiment classification model trained using Keras in spaCy. spaCy splits the document into sentences, and each sentence is classified using the LSTM. The scores for the sentences are then aggregated to give the document score. This kind of hierarchical model is quite difficult in "pure" Keras or Tensorflow, but it's very effective. The Keras example on this dataset performs quite poorly, because it cuts off the documents so that they're a fixed size. This hurts review accuracy a lot, because people often summarise their rating in the final sentence
|
||||
|
||||
Prerequisites:
|
||||
spacy download en_vectors_web_lg
|
||||
pip install keras==2.0.9
|
||||
|
||||
Compatible with: spaCy v2.0.0+
|
||||
"""
|
||||
|
||||
import plac
|
||||
import random
|
||||
import pathlib
|
||||
import cytoolz
|
||||
import numpy
|
||||
from keras.models import Sequential, model_from_json
|
||||
from keras.layers import LSTM, Dense, Embedding, Dropout, Bidirectional
|
||||
from keras.layers import LSTM, Dense, Embedding, Bidirectional
|
||||
from keras.layers import TimeDistributed
|
||||
from keras.optimizers import Adam
|
||||
import cPickle as pickle
|
||||
|
||||
import thinc.extra.datasets
|
||||
from spacy.compat import pickle
|
||||
import spacy
|
||||
|
||||
|
||||
|
@ -70,28 +78,32 @@ def get_features(docs, max_length):
|
|||
for i, doc in enumerate(docs):
|
||||
j = 0
|
||||
for token in doc:
|
||||
if token.has_vector and not token.is_punct and not token.is_space:
|
||||
Xs[i, j] = token.rank + 1
|
||||
j += 1
|
||||
if j >= max_length:
|
||||
break
|
||||
vector_id = token.vocab.vectors.find(key=token.orth)
|
||||
if vector_id >= 0:
|
||||
Xs[i, j] = vector_id
|
||||
else:
|
||||
Xs[i, j] = 0
|
||||
j += 1
|
||||
if j >= max_length:
|
||||
break
|
||||
return Xs
|
||||
|
||||
|
||||
def train(train_texts, train_labels, dev_texts, dev_labels,
|
||||
lstm_shape, lstm_settings, lstm_optimizer, batch_size=100, nb_epoch=5,
|
||||
by_sentence=True):
|
||||
lstm_shape, lstm_settings, lstm_optimizer, batch_size=100,
|
||||
nb_epoch=5, by_sentence=True):
|
||||
print("Loading spaCy")
|
||||
nlp = spacy.load('en', entity=False)
|
||||
nlp = spacy.load('en_vectors_web_lg')
|
||||
nlp.add_pipe(nlp.create_pipe('sentencizer'))
|
||||
embeddings = get_embeddings(nlp.vocab)
|
||||
model = compile_lstm(embeddings, lstm_shape, lstm_settings)
|
||||
print("Parsing texts...")
|
||||
train_docs = list(nlp.pipe(train_texts, batch_size=5000, n_threads=3))
|
||||
dev_docs = list(nlp.pipe(dev_texts, batch_size=5000, n_threads=3))
|
||||
train_docs = list(nlp.pipe(train_texts))
|
||||
dev_docs = list(nlp.pipe(dev_texts))
|
||||
if by_sentence:
|
||||
train_docs, train_labels = get_labelled_sentences(train_docs, train_labels)
|
||||
dev_docs, dev_labels = get_labelled_sentences(dev_docs, dev_labels)
|
||||
|
||||
|
||||
train_X = get_features(train_docs, lstm_shape['max_length'])
|
||||
dev_X = get_features(dev_docs, lstm_shape['max_length'])
|
||||
model.fit(train_X, train_labels, validation_data=(dev_X, dev_labels),
|
||||
|
@ -111,9 +123,10 @@ def compile_lstm(embeddings, shape, settings):
|
|||
mask_zero=True
|
||||
)
|
||||
)
|
||||
model.add(TimeDistributed(Dense(shape['nr_hidden'], bias=False)))
|
||||
model.add(Bidirectional(LSTM(shape['nr_hidden'], dropout_U=settings['dropout'],
|
||||
dropout_W=settings['dropout'])))
|
||||
model.add(TimeDistributed(Dense(shape['nr_hidden'], use_bias=False)))
|
||||
model.add(Bidirectional(LSTM(shape['nr_hidden'],
|
||||
recurrent_dropout=settings['dropout'],
|
||||
dropout=settings['dropout'])))
|
||||
model.add(Dense(shape['nr_class'], activation='sigmoid'))
|
||||
model.compile(optimizer=Adam(lr=settings['lr']), loss='binary_crossentropy',
|
||||
metrics=['accuracy'])
|
||||
|
@ -121,12 +134,7 @@ def compile_lstm(embeddings, shape, settings):
|
|||
|
||||
|
||||
def get_embeddings(vocab):
|
||||
max_rank = max(lex.rank+1 for lex in vocab if lex.has_vector)
|
||||
vectors = numpy.ndarray((max_rank+1, vocab.vectors_length), dtype='float32')
|
||||
for lex in vocab:
|
||||
if lex.has_vector:
|
||||
vectors[lex.rank + 1] = lex.vector
|
||||
return vectors
|
||||
return vocab.vectors.data
|
||||
|
||||
|
||||
def evaluate(model_dir, texts, labels, max_length=100):
|
||||
|
@ -136,12 +144,12 @@ def evaluate(model_dir, texts, labels, max_length=100):
|
|||
'''
|
||||
return [nlp.tagger, nlp.parser, SentimentAnalyser.load(model_dir, nlp,
|
||||
max_length=max_length)]
|
||||
|
||||
|
||||
nlp = spacy.load('en')
|
||||
nlp.pipeline = create_pipeline(nlp)
|
||||
|
||||
correct = 0
|
||||
i = 0
|
||||
i = 0
|
||||
for doc in nlp.pipe(texts, batch_size=1000, n_threads=4):
|
||||
correct += bool(doc.sentiment >= 0.5) == bool(labels[i])
|
||||
i += 1
|
||||
|
@ -174,22 +182,32 @@ def read_data(data_dir, limit=0):
|
|||
batch_size=("Size of minibatches for training LSTM", "option", "b", int),
|
||||
nr_examples=("Limit to N examples", "option", "n", int)
|
||||
)
|
||||
def main(model_dir, train_dir, dev_dir,
|
||||
def main(model_dir=None, train_dir=None, dev_dir=None,
|
||||
is_runtime=False,
|
||||
nr_hidden=64, max_length=100, # Shape
|
||||
dropout=0.5, learn_rate=0.001, # General NN config
|
||||
nb_epoch=5, batch_size=100, nr_examples=-1): # Training params
|
||||
model_dir = pathlib.Path(model_dir)
|
||||
train_dir = pathlib.Path(train_dir)
|
||||
dev_dir = pathlib.Path(dev_dir)
|
||||
if model_dir is not None:
|
||||
model_dir = pathlib.Path(model_dir)
|
||||
if train_dir is None or dev_dir is None:
|
||||
imdb_data = thinc.extra.datasets.imdb()
|
||||
if is_runtime:
|
||||
dev_texts, dev_labels = read_data(dev_dir)
|
||||
if dev_dir is None:
|
||||
dev_texts, dev_labels = zip(*imdb_data[1])
|
||||
else:
|
||||
dev_texts, dev_labels = read_data(dev_dir)
|
||||
acc = evaluate(model_dir, dev_texts, dev_labels, max_length=max_length)
|
||||
print(acc)
|
||||
else:
|
||||
print("Read data")
|
||||
train_texts, train_labels = read_data(train_dir, limit=nr_examples)
|
||||
dev_texts, dev_labels = read_data(dev_dir, limit=nr_examples)
|
||||
if train_dir is None:
|
||||
train_texts, train_labels = zip(*imdb_data[0])
|
||||
else:
|
||||
print("Read data")
|
||||
train_texts, train_labels = read_data(train_dir, limit=nr_examples)
|
||||
if dev_dir is None:
|
||||
dev_texts, dev_labels = zip(*imdb_data[1])
|
||||
else:
|
||||
dev_texts, dev_labels = read_data(dev_dir, imdb_data, limit=nr_examples)
|
||||
train_labels = numpy.asarray(train_labels, dtype='int32')
|
||||
dev_labels = numpy.asarray(dev_labels, dtype='int32')
|
||||
lstm = train(train_texts, train_labels, dev_texts, dev_labels,
|
||||
|
@ -198,10 +216,11 @@ def main(model_dir, train_dir, dev_dir,
|
|||
{},
|
||||
nb_epoch=nb_epoch, batch_size=batch_size)
|
||||
weights = lstm.get_weights()
|
||||
with (model_dir / 'model').open('wb') as file_:
|
||||
pickle.dump(weights[1:], file_)
|
||||
with (model_dir / 'config.json').open('wb') as file_:
|
||||
file_.write(lstm.to_json())
|
||||
if model_dir is not None:
|
||||
with (model_dir / 'model').open('wb') as file_:
|
||||
pickle.dump(weights[1:], file_)
|
||||
with (model_dir / 'config.json').open('wb') as file_:
|
||||
file_.write(lstm.to_json())
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
|
|
|
@ -1,33 +0,0 @@
|
|||
'''
|
||||
Match a dependency pattern. See https://github.com/explosion/spaCy/pull/1120
|
||||
|
||||
We start by creating a DependencyTree for the Doc. This class models the document
|
||||
dependency tree. Then we compile the query into a Pattern using the PatternParser.
|
||||
The syntax is quite simple:
|
||||
|
||||
we define a node named 'fox', that must match in the dep tree a token
|
||||
whose orth_ is 'fox'. an anonymous token whose lemma is 'quick' must have fox
|
||||
as parent, with a dep_ matching the regex am.* another anonymous token whose
|
||||
orth_ matches the regex brown|yellow has fox as parent, with whathever dep_
|
||||
DependencyTree.match returns a list of PatternMatch. Notice that we can assign
|
||||
names to anonymous or defined nodes ([word:fox]=f). We can get the Token mapped
|
||||
to the fox node using match['f'].
|
||||
'''
|
||||
import spacy
|
||||
from spacy.pattern import PatternParser, DependencyTree
|
||||
|
||||
nlp = spacy.load('en')
|
||||
doc = nlp("The quick brown fox jumped over the lazy dog.")
|
||||
tree = DependencyTree(doc)
|
||||
|
||||
query = """fox [word:fox]=f
|
||||
[lemma:quick]=q >/am.*/ fox
|
||||
[word:/brown|yellow/] > fox"""
|
||||
|
||||
pattern = PatternParser.parse(query)
|
||||
matches = tree.match(pattern)
|
||||
|
||||
assert len(matches) == 1
|
||||
match = matches[0]
|
||||
|
||||
assert match['f'] == doc[3]
|
|
@ -1,59 +0,0 @@
|
|||
"""Issue #252
|
||||
|
||||
Question:
|
||||
|
||||
In the documents and tutorials the main thing I haven't found is examples on how to break sentences down into small sub thoughts/chunks. The noun_chunks is handy, but having examples on using the token.head to find small (near-complete) sentence chunks would be neat.
|
||||
|
||||
Lets take the example sentence on https://displacy.spacy.io/displacy/index.html
|
||||
|
||||
displaCy uses CSS and JavaScript to show you how computers understand language
|
||||
This sentence has two main parts (XCOMP & CCOMP) according to the breakdown:
|
||||
|
||||
[displaCy] uses CSS and Javascript [to + show]
|
||||
&
|
||||
show you how computers understand [language]
|
||||
I'm assuming that we can use the token.head to build these groups. In one of your examples you had the following function.
|
||||
|
||||
def dependency_labels_to_root(token):
|
||||
'''Walk up the syntactic tree, collecting the arc labels.'''
|
||||
dep_labels = []
|
||||
while token.head is not token:
|
||||
dep_labels.append(token.dep)
|
||||
token = token.head
|
||||
return dep_labels
|
||||
"""
|
||||
from __future__ import print_function, unicode_literals
|
||||
|
||||
# Answer:
|
||||
# The easiest way is to find the head of the subtree you want, and then use the
|
||||
# `.subtree`, `.children`, `.lefts` and `.rights` iterators. `.subtree` is the
|
||||
# one that does what you're asking for most directly:
|
||||
|
||||
from spacy.en import English
|
||||
nlp = English()
|
||||
|
||||
doc = nlp(u'displaCy uses CSS and JavaScript to show you how computers understand language')
|
||||
for word in doc:
|
||||
if word.dep_ in ('xcomp', 'ccomp'):
|
||||
print(''.join(w.text_with_ws for w in word.subtree))
|
||||
|
||||
# It'd probably be better for `word.subtree` to return a `Span` object instead
|
||||
# of a generator over the tokens. If you want the `Span` you can get it via the
|
||||
# `.right_edge` and `.left_edge` properties. The `Span` object is nice because
|
||||
# you can easily get a vector, merge it, etc.
|
||||
|
||||
doc = nlp(u'displaCy uses CSS and JavaScript to show you how computers understand language')
|
||||
for word in doc:
|
||||
if word.dep_ in ('xcomp', 'ccomp'):
|
||||
subtree_span = doc[word.left_edge.i : word.right_edge.i + 1]
|
||||
print(subtree_span.text, '|', subtree_span.root.text)
|
||||
print(subtree_span.similarity(doc))
|
||||
print(subtree_span.similarity(subtree_span.root))
|
||||
|
||||
|
||||
# You might also want to select a head, and then select a start and end position by
|
||||
# walking along its children. You could then take the `.left_edge` and `.right_edge`
|
||||
# of those tokens, and use it to calculate a span.
|
||||
|
||||
|
||||
|
|
@ -1,59 +0,0 @@
|
|||
import plac
|
||||
|
||||
from spacy.en import English
|
||||
from spacy.parts_of_speech import NOUN
|
||||
from spacy.parts_of_speech import ADP as PREP
|
||||
|
||||
|
||||
def _span_to_tuple(span):
|
||||
start = span[0].idx
|
||||
end = span[-1].idx + len(span[-1])
|
||||
tag = span.root.tag_
|
||||
text = span.text
|
||||
label = span.label_
|
||||
return (start, end, tag, text, label)
|
||||
|
||||
def merge_spans(spans, doc):
|
||||
# This is a bit awkward atm. What we're doing here is merging the entities,
|
||||
# so that each only takes up a single token. But an entity is a Span, and
|
||||
# each Span is a view into the doc. When we merge a span, we invalidate
|
||||
# the other spans. This will get fixed --- but for now the solution
|
||||
# is to gather the information first, before merging.
|
||||
tuples = [_span_to_tuple(span) for span in spans]
|
||||
for span_tuple in tuples:
|
||||
doc.merge(*span_tuple)
|
||||
|
||||
|
||||
def extract_currency_relations(doc):
|
||||
merge_spans(doc.ents, doc)
|
||||
merge_spans(doc.noun_chunks, doc)
|
||||
|
||||
relations = []
|
||||
for money in filter(lambda w: w.ent_type_ == 'MONEY', doc):
|
||||
if money.dep_ in ('attr', 'dobj'):
|
||||
subject = [w for w in money.head.lefts if w.dep_ == 'nsubj']
|
||||
if subject:
|
||||
subject = subject[0]
|
||||
relations.append((subject, money))
|
||||
elif money.dep_ == 'pobj' and money.head.dep_ == 'prep':
|
||||
relations.append((money.head.head, money))
|
||||
|
||||
return relations
|
||||
|
||||
|
||||
def main():
|
||||
nlp = English()
|
||||
texts = [
|
||||
u'Net income was $9.4 million compared to the prior year of $2.7 million.',
|
||||
u'Revenue exceeded twelve billion dollars, with a loss of $1b.',
|
||||
]
|
||||
|
||||
for text in texts:
|
||||
doc = nlp(text)
|
||||
relations = extract_currency_relations(doc)
|
||||
for r1, r2 in relations:
|
||||
print(r1.text, r2.ent_type_, r2.text)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
plac.call(main)
|
61
examples/information_extraction/entity_relations.py
Normal file
61
examples/information_extraction/entity_relations.py
Normal file
|
@ -0,0 +1,61 @@
|
|||
#!/usr/bin/env python
|
||||
# coding: utf8
|
||||
"""A simple example of extracting relations between phrases and entities using
|
||||
spaCy's named entity recognizer and the dependency parse. Here, we extract
|
||||
money and currency values (entities labelled as MONEY) and then check the
|
||||
dependency tree to find the noun phrase they are referring to – for example:
|
||||
$9.4 million --> Net income.
|
||||
|
||||
Compatible with: spaCy v2.0.0+
|
||||
"""
|
||||
from __future__ import unicode_literals, print_function
|
||||
|
||||
import plac
|
||||
import spacy
|
||||
|
||||
|
||||
TEXTS = [
|
||||
'Net income was $9.4 million compared to the prior year of $2.7 million.',
|
||||
'Revenue exceeded twelve billion dollars, with a loss of $1b.',
|
||||
]
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
model=("Model to load (needs parser and NER)", "positional", None, str))
|
||||
def main(model='en_core_web_sm'):
|
||||
nlp = spacy.load(model)
|
||||
print("Loaded model '%s'" % model)
|
||||
print("Processing %d texts" % len(TEXTS))
|
||||
|
||||
for text in TEXTS:
|
||||
doc = nlp(text)
|
||||
relations = extract_currency_relations(doc)
|
||||
for r1, r2 in relations:
|
||||
print('{:<10}\t{}\t{}'.format(r1.text, r2.ent_type_, r2.text))
|
||||
|
||||
|
||||
def extract_currency_relations(doc):
|
||||
# merge entities and noun chunks into one token
|
||||
for span in [*list(doc.ents), *list(doc.noun_chunks)]:
|
||||
span.merge()
|
||||
|
||||
relations = []
|
||||
for money in filter(lambda w: w.ent_type_ == 'MONEY', doc):
|
||||
if money.dep_ in ('attr', 'dobj'):
|
||||
subject = [w for w in money.head.lefts if w.dep_ == 'nsubj']
|
||||
if subject:
|
||||
subject = subject[0]
|
||||
relations.append((subject, money))
|
||||
elif money.dep_ == 'pobj' and money.head.dep_ == 'prep':
|
||||
relations.append((money.head.head, money))
|
||||
return relations
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
plac.call(main)
|
||||
|
||||
# Expected output:
|
||||
# Net income MONEY $9.4 million
|
||||
# the prior year MONEY $2.7 million
|
||||
# Revenue MONEY twelve billion dollars
|
||||
# a loss MONEY 1b
|
64
examples/information_extraction/parse_subtrees.py
Normal file
64
examples/information_extraction/parse_subtrees.py
Normal file
|
@ -0,0 +1,64 @@
|
|||
#!/usr/bin/env python
|
||||
# coding: utf8
|
||||
"""This example shows how to navigate the parse tree including subtrees
|
||||
attached to a word.
|
||||
|
||||
Based on issue #252:
|
||||
"In the documents and tutorials the main thing I haven't found is
|
||||
examples on how to break sentences down into small sub thoughts/chunks. The
|
||||
noun_chunks is handy, but having examples on using the token.head to find small
|
||||
(near-complete) sentence chunks would be neat. Lets take the example sentence:
|
||||
"displaCy uses CSS and JavaScript to show you how computers understand language"
|
||||
|
||||
This sentence has two main parts (XCOMP & CCOMP) according to the breakdown:
|
||||
[displaCy] uses CSS and Javascript [to + show]
|
||||
show you how computers understand [language]
|
||||
|
||||
I'm assuming that we can use the token.head to build these groups."
|
||||
|
||||
Compatible with: spaCy v2.0.0+
|
||||
"""
|
||||
from __future__ import unicode_literals, print_function
|
||||
|
||||
import plac
|
||||
import spacy
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
model=("Model to load", "positional", None, str))
|
||||
def main(model='en_core_web_sm'):
|
||||
nlp = spacy.load(model)
|
||||
print("Loaded model '%s'" % model)
|
||||
|
||||
doc = nlp("displaCy uses CSS and JavaScript to show you how computers "
|
||||
"understand language")
|
||||
|
||||
# The easiest way is to find the head of the subtree you want, and then use
|
||||
# the `.subtree`, `.children`, `.lefts` and `.rights` iterators. `.subtree`
|
||||
# is the one that does what you're asking for most directly:
|
||||
for word in doc:
|
||||
if word.dep_ in ('xcomp', 'ccomp'):
|
||||
print(''.join(w.text_with_ws for w in word.subtree))
|
||||
|
||||
# It'd probably be better for `word.subtree` to return a `Span` object
|
||||
# instead of a generator over the tokens. If you want the `Span` you can
|
||||
# get it via the `.right_edge` and `.left_edge` properties. The `Span`
|
||||
# object is nice because you can easily get a vector, merge it, etc.
|
||||
for word in doc:
|
||||
if word.dep_ in ('xcomp', 'ccomp'):
|
||||
subtree_span = doc[word.left_edge.i : word.right_edge.i + 1]
|
||||
print(subtree_span.text, '|', subtree_span.root.text)
|
||||
|
||||
# You might also want to select a head, and then select a start and end
|
||||
# position by walking along its children. You could then take the
|
||||
# `.left_edge` and `.right_edge` of those tokens, and use it to calculate
|
||||
# a span.
|
||||
|
||||
if __name__ == '__main__':
|
||||
plac.call(main)
|
||||
|
||||
# Expected output:
|
||||
# to show you how computers understand language
|
||||
# how computers understand language
|
||||
# to show you how computers understand language | show
|
||||
# how computers understand language | understand
|
107
examples/information_extraction/phrase_matcher.py
Normal file
107
examples/information_extraction/phrase_matcher.py
Normal file
|
@ -0,0 +1,107 @@
|
|||
#!/usr/bin/env python
|
||||
# coding: utf8
|
||||
"""Match a large set of multi-word expressions in O(1) time.
|
||||
|
||||
The idea is to associate each word in the vocabulary with a tag, noting whether
|
||||
they begin, end, or are inside at least one pattern. An additional tag is used
|
||||
for single-word patterns. Complete patterns are also stored in a hash set.
|
||||
When we process a document, we look up the words in the vocabulary, to
|
||||
associate the words with the tags. We then search for tag-sequences that
|
||||
correspond to valid candidates. Finally, we look up the candidates in the hash
|
||||
set.
|
||||
|
||||
For instance, to search for the phrases "Barack Hussein Obama" and "Hilary
|
||||
Clinton", we would associate "Barack" and "Hilary" with the B tag, Hussein with
|
||||
the I tag, and Obama and Clinton with the L tag.
|
||||
|
||||
The document "Barack Clinton and Hilary Clinton" would have the tag sequence
|
||||
[{B}, {L}, {}, {B}, {L}], so we'd get two matches. However, only the second
|
||||
candidate is in the phrase dictionary, so only one is returned as a match.
|
||||
|
||||
The algorithm is O(n) at run-time for document of length n because we're only
|
||||
ever matching over the tag patterns. So no matter how many phrases we're
|
||||
looking for, our pattern set stays very small (exact size depends on the
|
||||
maximum length we're looking for, as the query language currently has no
|
||||
quantifiers).
|
||||
|
||||
The example expects a .bz2 file from the Reddit corpus, and a patterns file,
|
||||
formatted in jsonl as a sequence of entries like this:
|
||||
|
||||
{"text":"Anchorage"}
|
||||
{"text":"Angola"}
|
||||
{"text":"Ann Arbor"}
|
||||
{"text":"Annapolis"}
|
||||
{"text":"Appalachia"}
|
||||
{"text":"Argentina"}
|
||||
|
||||
Compatible with: spaCy v2.0.0+
|
||||
"""
|
||||
from __future__ import print_function, unicode_literals, division
|
||||
|
||||
from bz2 import BZ2File
|
||||
import time
|
||||
import plac
|
||||
import ujson
|
||||
|
||||
from spacy.matcher import PhraseMatcher
|
||||
import spacy
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
patterns_loc=("Path to gazetteer", "positional", None, str),
|
||||
text_loc=("Path to Reddit corpus file", "positional", None, str),
|
||||
n=("Number of texts to read", "option", "n", int),
|
||||
lang=("Language class to initialise", "option", "l", str))
|
||||
def main(patterns_loc, text_loc, n=10000, lang='en'):
|
||||
nlp = spacy.blank('en')
|
||||
nlp.vocab.lex_attr_getters = {}
|
||||
phrases = read_gazetteer(nlp.tokenizer, patterns_loc)
|
||||
count = 0
|
||||
t1 = time.time()
|
||||
for ent_id, text in get_matches(nlp.tokenizer, phrases,
|
||||
read_text(text_loc, n=n)):
|
||||
count += 1
|
||||
t2 = time.time()
|
||||
print("%d docs in %.3f s. %d matches" % (n, (t2 - t1), count))
|
||||
|
||||
|
||||
def read_gazetteer(tokenizer, loc, n=-1):
|
||||
for i, line in enumerate(open(loc)):
|
||||
data = ujson.loads(line.strip())
|
||||
phrase = tokenizer(data['text'])
|
||||
for w in phrase:
|
||||
_ = tokenizer.vocab[w.text]
|
||||
if len(phrase) >= 2:
|
||||
yield phrase
|
||||
|
||||
|
||||
def read_text(bz2_loc, n=10000):
|
||||
with BZ2File(bz2_loc) as file_:
|
||||
for i, line in enumerate(file_):
|
||||
data = ujson.loads(line)
|
||||
yield data['body']
|
||||
if i >= n:
|
||||
break
|
||||
|
||||
|
||||
def get_matches(tokenizer, phrases, texts, max_length=6):
|
||||
matcher = PhraseMatcher(tokenizer.vocab, max_length=max_length)
|
||||
matcher.add('Phrase', None, *phrases)
|
||||
for text in texts:
|
||||
doc = tokenizer(text)
|
||||
for w in doc:
|
||||
_ = doc.vocab[w.text]
|
||||
matches = matcher(doc)
|
||||
for ent_id, start, end in matches:
|
||||
yield (ent_id, doc[start:end].text)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
if False:
|
||||
import cProfile
|
||||
import pstats
|
||||
cProfile.runctx("plac.call(main)", globals(), locals(), "Profile.prof")
|
||||
s = pstats.Stats("Profile.prof")
|
||||
s.strip_dirs().sort_stats("time").print_stats()
|
||||
else:
|
||||
plac.call(main)
|
|
@ -1,5 +0,0 @@
|
|||
An example of inventory counting using SpaCy.io NLP library. Meant to show how to instantiate Spacy's English class, and allow reusability by reloading the main module.
|
||||
|
||||
In the future, a better implementation of this library would be to apply machine learning to each query and learn what to classify as the quantitative statement (55 kgs OF), vs the actual item of count (how likely is a preposition object to be the item of count if x,y,z qualifications appear in the statement).
|
||||
|
||||
|
|
@ -1,35 +0,0 @@
|
|||
class Inventory:
|
||||
"""
|
||||
Inventory class - a struct{} like feature to house inventory counts
|
||||
across modules.
|
||||
"""
|
||||
originalQuery = None
|
||||
item = ""
|
||||
unit = ""
|
||||
amount = ""
|
||||
|
||||
def __init__(self, statement):
|
||||
"""
|
||||
Constructor - only takes in the original query/statement
|
||||
:return: new Inventory object
|
||||
"""
|
||||
|
||||
self.originalQuery = statement
|
||||
pass
|
||||
|
||||
def __str__(self):
|
||||
return str(self.amount) + ' ' + str(self.unit) + ' ' + str(self.item)
|
||||
|
||||
def printInfo(self):
|
||||
print '-------------Inventory Count------------'
|
||||
print "Original Query: " + str(self.originalQuery)
|
||||
print 'Amount: ' + str(self.amount)
|
||||
print 'Unit: ' + str(self.unit)
|
||||
print 'Item: ' + str(self.item)
|
||||
print '----------------------------------------'
|
||||
|
||||
def isValid(self):
|
||||
if not self.item or not self.unit or not self.amount or not self.originalQuery:
|
||||
return False
|
||||
else:
|
||||
return True
|
|
@ -1,92 +0,0 @@
|
|||
from inventory import Inventory
|
||||
|
||||
|
||||
def runTest(nlp):
|
||||
testset = []
|
||||
testset += [nlp(u'6 lobster cakes')]
|
||||
testset += [nlp(u'6 avacados')]
|
||||
testset += [nlp(u'fifty five carrots')]
|
||||
testset += [nlp(u'i have 55 carrots')]
|
||||
testset += [nlp(u'i got me some 9 cabbages')]
|
||||
testset += [nlp(u'i got 65 kgs of carrots')]
|
||||
|
||||
result = []
|
||||
for doc in testset:
|
||||
c = decodeInventoryEntry_level1(doc)
|
||||
if not c.isValid():
|
||||
c = decodeInventoryEntry_level2(doc)
|
||||
result.append(c)
|
||||
|
||||
for i in result:
|
||||
i.printInfo()
|
||||
|
||||
|
||||
def decodeInventoryEntry_level1(document):
|
||||
"""
|
||||
Decodes a basic entry such as: '6 lobster cake' or '6' cakes
|
||||
@param document : NLP Doc object
|
||||
:return: Status if decoded correctly (true, false), and Inventory object
|
||||
"""
|
||||
count = Inventory(str(document))
|
||||
for token in document:
|
||||
if token.pos_ == (u'NOUN' or u'NNS' or u'NN'):
|
||||
item = str(token)
|
||||
|
||||
for child in token.children:
|
||||
if child.dep_ == u'compound' or child.dep_ == u'ad':
|
||||
item = str(child) + str(item)
|
||||
elif child.dep_ == u'nummod':
|
||||
count.amount = str(child).strip()
|
||||
for numerical_child in child.children:
|
||||
# this isn't arithmetic rather than treating it such as a string
|
||||
count.amount = str(numerical_child) + str(count.amount).strip()
|
||||
else:
|
||||
print "WARNING: unknown child: " + str(child) + ':'+str(child.dep_)
|
||||
|
||||
count.item = item
|
||||
count.unit = item
|
||||
|
||||
return count
|
||||
|
||||
|
||||
def decodeInventoryEntry_level2(document):
|
||||
"""
|
||||
Entry level 2, a more complicated parsing scheme that covers examples such as
|
||||
'i have 80 boxes of freshly baked pies'
|
||||
|
||||
@document @param document : NLP Doc object
|
||||
:return: Status if decoded correctly (true, false), and Inventory object-
|
||||
"""
|
||||
|
||||
count = Inventory(str(document))
|
||||
|
||||
for token in document:
|
||||
# Look for a preposition object that is a noun (this is the item we are counting).
|
||||
# If found, look at its' dependency (if a preposition that is not indicative of
|
||||
# inventory location, the dependency of the preposition must be a noun
|
||||
|
||||
if token.dep_ == (u'pobj' or u'meta') and token.pos_ == (u'NOUN' or u'NNS' or u'NN'):
|
||||
item = ''
|
||||
|
||||
# Go through all the token's children, these are possible adjectives and other add-ons
|
||||
# this deals with cases such as 'hollow rounded waffle pancakes"
|
||||
for i in token.children:
|
||||
item += ' ' + str(i)
|
||||
|
||||
item += ' ' + str(token)
|
||||
count.item = item
|
||||
|
||||
# Get the head of the item:
|
||||
if token.head.dep_ != u'prep':
|
||||
# Break out of the loop, this is a confusing entry
|
||||
break
|
||||
else:
|
||||
amountUnit = token.head.head
|
||||
count.unit = str(amountUnit)
|
||||
|
||||
for inner in amountUnit.children:
|
||||
if inner.pos_ == u'NUM':
|
||||
count.amount += str(inner)
|
||||
return count
|
||||
|
||||
|
|
@ -1,30 +0,0 @@
|
|||
import inventoryCount as mainModule
|
||||
import os
|
||||
from spacy.en import English
|
||||
|
||||
if __name__ == '__main__':
|
||||
"""
|
||||
Main module for this example - loads the English main NLP class,
|
||||
and keeps it in RAM while waiting for the user to re-run it. Allows the
|
||||
developer to re-edit their module under testing without having
|
||||
to wait as long to load the English class
|
||||
"""
|
||||
|
||||
# Set the NLP object here for the parameters you want to see,
|
||||
# or just leave it blank and get all the opts
|
||||
print "Loading English module... this will take a while."
|
||||
nlp = English()
|
||||
print "Done loading English module."
|
||||
while True:
|
||||
try:
|
||||
reload(mainModule)
|
||||
mainModule.runTest(nlp)
|
||||
raw_input('================ To reload main module, press Enter ================')
|
||||
|
||||
|
||||
except Exception, e:
|
||||
print "Unexpected error: " + str(e)
|
||||
continue
|
||||
|
||||
|
||||
|
|
@ -3,16 +3,21 @@
|
|||
# A decomposable attention model for Natural Language Inference
|
||||
**by Matthew Honnibal, [@honnibal](https://github.com/honnibal)**
|
||||
|
||||
> ⚠️ **IMPORTANT NOTE:** This example is currently only compatible with spaCy
|
||||
> v1.x. We're working on porting the example over to Keras v2.x and spaCy v2.x.
|
||||
> See [#1445](https://github.com/explosion/spaCy/issues/1445) for details –
|
||||
> contributions welcome!
|
||||
|
||||
This directory contains an implementation of the entailment prediction model described
|
||||
by [Parikh et al. (2016)](https://arxiv.org/pdf/1606.01933.pdf). The model is notable
|
||||
by [Parikh et al. (2016)](https://arxiv.org/pdf/1606.01933.pdf). The model is notable
|
||||
for its competitive performance with very few parameters.
|
||||
|
||||
The model is implemented using [Keras](https://keras.io/) and [spaCy](https://spacy.io).
|
||||
Keras is used to build and train the network. spaCy is used to load
|
||||
the [GloVe](http://nlp.stanford.edu/projects/glove/) vectors, perform the
|
||||
feature extraction, and help you apply the model at run-time. The following
|
||||
demo code shows how the entailment model can be used at runtime, once the
|
||||
hook is installed to customise the `.similarity()` method of spaCy's `Doc`
|
||||
The model is implemented using [Keras](https://keras.io/) and [spaCy](https://spacy.io).
|
||||
Keras is used to build and train the network. spaCy is used to load
|
||||
the [GloVe](http://nlp.stanford.edu/projects/glove/) vectors, perform the
|
||||
feature extraction, and help you apply the model at run-time. The following
|
||||
demo code shows how the entailment model can be used at runtime, once the
|
||||
hook is installed to customise the `.similarity()` method of spaCy's `Doc`
|
||||
and `Span` objects:
|
||||
|
||||
```python
|
||||
|
@ -37,13 +42,13 @@ lots of ways to extend the model.
|
|||
|
||||
| File | Description |
|
||||
| --- | --- |
|
||||
| `__main__.py` | The script that will be executed. Defines the CLI, the data reading, etc — all the boring stuff. |
|
||||
| `__main__.py` | The script that will be executed. Defines the CLI, the data reading, etc — all the boring stuff. |
|
||||
| `spacy_hook.py` | Provides a class `SimilarityShim` that lets you use an arbitrary function to customize spaCy's `doc.similarity()` method. Instead of the default average-of-vectors algorithm, when you call `doc1.similarity(doc2)`, you'll get the result of `your_model(doc1, doc2)`. |
|
||||
| `keras_decomposable_attention.py` | Defines the neural network model. |
|
||||
|
||||
## Setting up
|
||||
|
||||
First, install [Keras](https://keras.io/), [spaCy](https://spacy.io) and the spaCy
|
||||
First, install [Keras](https://keras.io/), [spaCy](https://spacy.io) and the spaCy
|
||||
English models (about 1GB of data):
|
||||
|
||||
```bash
|
||||
|
@ -52,12 +57,12 @@ pip install spacy
|
|||
python -m spacy.en.download
|
||||
```
|
||||
|
||||
⚠️ **Important:** In order for the example to run, you'll need to install Keras from
|
||||
the 1.2.2 release (and not via `pip install keras`). For more info on this, see
|
||||
⚠️ **Important:** In order for the example to run, you'll need to install Keras from
|
||||
the 1.2.2 release (and not via `pip install keras`). For more info on this, see
|
||||
[#727](https://github.com/explosion/spaCy/issues/727).
|
||||
|
||||
You'll also want to get Keras working on your GPU. This will depend on your
|
||||
set up, so you're mostly on your own for this step. If you're using AWS, try the
|
||||
set up, so you're mostly on your own for this step. If you're using AWS, try the
|
||||
[NVidia AMI](https://aws.amazon.com/marketplace/pp/B00FYCDDTE). It made things pretty easy.
|
||||
|
||||
Once you've installed the dependencies, you can run a small preliminary test of
|
||||
|
@ -94,5 +99,5 @@ you how run-time usage will eventually look.
|
|||
## Getting updates
|
||||
|
||||
We should have the blog post explaining the model ready before the end of the week. To get
|
||||
notified when it's published, you can either the follow me on [Twitter](https://twitter.com/honnibal),
|
||||
notified when it's published, you can either the follow me on [Twitter](https://twitter.com/honnibal),
|
||||
or subscribe to our [mailing list](http://eepurl.com/ckUpQ5).
|
||||
|
|
|
@ -1,161 +0,0 @@
|
|||
from __future__ import unicode_literals, print_function
|
||||
|
||||
import spacy.en
|
||||
import spacy.matcher
|
||||
from spacy.attrs import ORTH, TAG, LOWER, IS_ALPHA, FLAG63
|
||||
|
||||
import plac
|
||||
|
||||
|
||||
def main():
|
||||
nlp = spacy.en.English()
|
||||
example = u"I prefer Siri to Google Now. I'll google now to find out how the google now service works."
|
||||
before = nlp(example)
|
||||
print("Before")
|
||||
for ent in before.ents:
|
||||
print(ent.text, ent.label_, [w.tag_ for w in ent])
|
||||
# Output:
|
||||
# Google ORG [u'NNP']
|
||||
# google ORG [u'VB']
|
||||
# google ORG [u'NNP']
|
||||
nlp.matcher.add(
|
||||
"GoogleNow", # Entity ID: Not really used at the moment.
|
||||
"PRODUCT", # Entity type: should be one of the types in the NER data
|
||||
{"wiki_en": "Google_Now"}, # Arbitrary attributes. Currently unused.
|
||||
[ # List of patterns that can be Surface Forms of the entity
|
||||
|
||||
# This Surface Form matches "Google Now", verbatim
|
||||
[ # Each Surface Form is a list of Token Specifiers.
|
||||
{ # This Token Specifier matches tokens whose orth field is "Google"
|
||||
ORTH: "Google"
|
||||
},
|
||||
{ # This Token Specifier matches tokens whose orth field is "Now"
|
||||
ORTH: "Now"
|
||||
}
|
||||
],
|
||||
[ # This Surface Form matches "google now", verbatim, and requires
|
||||
# "google" to have the NNP tag. This helps prevent the pattern from
|
||||
# matching cases like "I will google now to look up the time"
|
||||
{
|
||||
ORTH: "google",
|
||||
TAG: "NNP"
|
||||
},
|
||||
{
|
||||
ORTH: "now"
|
||||
}
|
||||
]
|
||||
]
|
||||
)
|
||||
after = nlp(example)
|
||||
print("After")
|
||||
for ent in after.ents:
|
||||
print(ent.text, ent.label_, [w.tag_ for w in ent])
|
||||
# Output
|
||||
# Google Now PRODUCT [u'NNP', u'RB']
|
||||
# google ORG [u'VB']
|
||||
# google now PRODUCT [u'NNP', u'RB']
|
||||
#
|
||||
# You can customize attribute values in the lexicon, and then refer to the
|
||||
# new attributes in your Token Specifiers.
|
||||
# This is particularly good for word-set membership.
|
||||
#
|
||||
australian_capitals = ['Brisbane', 'Sydney', 'Canberra', 'Melbourne', 'Hobart',
|
||||
'Darwin', 'Adelaide', 'Perth']
|
||||
# Internally, the tokenizer immediately maps each token to a pointer to a
|
||||
# LexemeC struct. These structs hold various features, e.g. the integer IDs
|
||||
# of the normalized string forms.
|
||||
# For our purposes, the key attribute is a 64-bit integer, used as a bit field.
|
||||
# spaCy currently only uses 12 of the bits for its built-in features, so
|
||||
# the others are available for use. It's best to use the higher bits, as
|
||||
# future versions of spaCy may add more flags. For instance, we might add
|
||||
# a built-in IS_MONTH flag, taking up FLAG13. So, we bind our user-field to
|
||||
# FLAG63 here.
|
||||
is_australian_capital = FLAG63
|
||||
# Now we need to set the flag value. It's False on all tokens by default,
|
||||
# so we just need to set it to True for the tokens we want.
|
||||
# Here we iterate over the strings, and set it on only the literal matches.
|
||||
for string in australian_capitals:
|
||||
lexeme = nlp.vocab[string]
|
||||
lexeme.set_flag(is_australian_capital, True)
|
||||
print('Sydney', nlp.vocab[u'Sydney'].check_flag(is_australian_capital))
|
||||
print('sydney', nlp.vocab[u'sydney'].check_flag(is_australian_capital))
|
||||
# If we want case-insensitive matching, we have to be a little bit more
|
||||
# round-about, as there's no case-insensitive index to the vocabulary. So
|
||||
# we have to iterate over the vocabulary.
|
||||
# We'll be looking up attribute IDs in this set a lot, so it's good to pre-build it
|
||||
target_ids = {nlp.vocab.strings[s.lower()] for s in australian_capitals}
|
||||
for lexeme in nlp.vocab:
|
||||
if lexeme.lower in target_ids:
|
||||
lexeme.set_flag(is_australian_capital, True)
|
||||
print('Sydney', nlp.vocab[u'Sydney'].check_flag(is_australian_capital))
|
||||
print('sydney', nlp.vocab[u'sydney'].check_flag(is_australian_capital))
|
||||
print('SYDNEY', nlp.vocab[u'SYDNEY'].check_flag(is_australian_capital))
|
||||
# Output
|
||||
# Sydney True
|
||||
# sydney False
|
||||
# Sydney True
|
||||
# sydney True
|
||||
# SYDNEY True
|
||||
#
|
||||
# The key thing to note here is that we're setting these attributes once,
|
||||
# over the vocabulary --- and then reusing them at run-time. This means the
|
||||
# amortized complexity of anything we do this way is going to be O(1). You
|
||||
# can match over expressions that need to have sets with tens of thousands
|
||||
# of values, e.g. "all the street names in Germany", and you'll still have
|
||||
# O(1) complexity. Most regular expression algorithms don't scale well to
|
||||
# this sort of problem.
|
||||
#
|
||||
# Now, let's use this in a pattern
|
||||
nlp.matcher.add("AuCitySportsTeam", "ORG", {},
|
||||
[
|
||||
[
|
||||
{LOWER: "the"},
|
||||
{is_australian_capital: True},
|
||||
{TAG: "NNS"}
|
||||
],
|
||||
[
|
||||
{LOWER: "the"},
|
||||
{is_australian_capital: True},
|
||||
{TAG: "NNPS"}
|
||||
],
|
||||
[
|
||||
{LOWER: "the"},
|
||||
{IS_ALPHA: True}, # Allow a word in between, e.g. The Western Sydney
|
||||
{is_australian_capital: True},
|
||||
{TAG: "NNS"}
|
||||
],
|
||||
[
|
||||
{LOWER: "the"},
|
||||
{IS_ALPHA: True}, # Allow a word in between, e.g. The Western Sydney
|
||||
{is_australian_capital: True},
|
||||
{TAG: "NNPS"}
|
||||
]
|
||||
])
|
||||
doc = nlp(u'The pattern should match the Brisbane Broncos and the South Darwin Spiders, but not the Colorado Boulders')
|
||||
for ent in doc.ents:
|
||||
print(ent.text, ent.label_)
|
||||
# Output
|
||||
# the Brisbane Broncos ORG
|
||||
# the South Darwin Spiders ORG
|
||||
|
||||
|
||||
# Output
|
||||
# Before
|
||||
# Google ORG [u'NNP']
|
||||
# google ORG [u'VB']
|
||||
# google ORG [u'NNP']
|
||||
# After
|
||||
# Google Now PRODUCT [u'NNP', u'RB']
|
||||
# google ORG [u'VB']
|
||||
# google now PRODUCT [u'NNP', u'RB']
|
||||
# Sydney True
|
||||
# sydney False
|
||||
# Sydney True
|
||||
# sydney True
|
||||
# SYDNEY True
|
||||
# the Brisbane Broncos ORG
|
||||
# the South Darwin Spiders ORG
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
|
|
@ -1,98 +0,0 @@
|
|||
"""Match a large set of multi-word expressions in O(1) time.
|
||||
|
||||
The idea is to associate each word in the vocabulary with a tag, noting whether
|
||||
they begin, end, or are inside at least one pattern. An additional tag is used
|
||||
for single-word patterns. Complete patterns are also stored in a hash set.
|
||||
|
||||
When we process a document, we look up the words in the vocabulary, to associate
|
||||
the words with the tags. We then search for tag-sequences that correspond to
|
||||
valid candidates. Finally, we look up the candidates in the hash set.
|
||||
|
||||
For instance, to search for the phrases "Barack Hussein Obama" and "Hilary Clinton", we
|
||||
would associate "Barack" and "Hilary" with the B tag, Hussein with the I tag,
|
||||
and Obama and Clinton with the L tag.
|
||||
|
||||
The document "Barack Clinton and Hilary Clinton" would have the tag sequence
|
||||
[{B}, {L}, {}, {B}, {L}], so we'd get two matches. However, only the second candidate
|
||||
is in the phrase dictionary, so only one is returned as a match.
|
||||
|
||||
The algorithm is O(n) at run-time for document of length n because we're only ever
|
||||
matching over the tag patterns. So no matter how many phrases we're looking for,
|
||||
our pattern set stays very small (exact size depends on the maximum length we're
|
||||
looking for, as the query language currently has no quantifiers)
|
||||
"""
|
||||
from __future__ import print_function, unicode_literals, division
|
||||
from ast import literal_eval
|
||||
from bz2 import BZ2File
|
||||
import time
|
||||
import math
|
||||
import codecs
|
||||
|
||||
import plac
|
||||
|
||||
from preshed.maps import PreshMap
|
||||
from preshed.counter import PreshCounter
|
||||
from spacy.strings import hash_string
|
||||
from spacy.en import English
|
||||
from spacy.matcher import PhraseMatcher
|
||||
|
||||
|
||||
def read_gazetteer(tokenizer, loc, n=-1):
|
||||
for i, line in enumerate(open(loc)):
|
||||
phrase = literal_eval('u' + line.strip())
|
||||
if ' (' in phrase and phrase.endswith(')'):
|
||||
phrase = phrase.split(' (', 1)[0]
|
||||
if i >= n:
|
||||
break
|
||||
phrase = tokenizer(phrase)
|
||||
if all((t.is_lower and t.prob >= -10) for t in phrase):
|
||||
continue
|
||||
if len(phrase) >= 2:
|
||||
yield phrase
|
||||
|
||||
|
||||
def read_text(bz2_loc):
|
||||
with BZ2File(bz2_loc) as file_:
|
||||
for line in file_:
|
||||
yield line.decode('utf8')
|
||||
|
||||
|
||||
def get_matches(tokenizer, phrases, texts, max_length=6):
|
||||
matcher = PhraseMatcher(tokenizer.vocab, phrases, max_length=max_length)
|
||||
print("Match")
|
||||
for text in texts:
|
||||
doc = tokenizer(text)
|
||||
matches = matcher(doc)
|
||||
for mwe in doc.ents:
|
||||
yield mwe
|
||||
|
||||
|
||||
def main(patterns_loc, text_loc, counts_loc, n=10000000):
|
||||
nlp = English(parser=False, tagger=False, entity=False)
|
||||
print("Make matcher")
|
||||
phrases = read_gazetteer(nlp.tokenizer, patterns_loc, n=n)
|
||||
counts = PreshCounter()
|
||||
t1 = time.time()
|
||||
for mwe in get_matches(nlp.tokenizer, phrases, read_text(text_loc)):
|
||||
counts.inc(hash_string(mwe.text), 1)
|
||||
t2 = time.time()
|
||||
print("10m tokens in %d s" % (t2 - t1))
|
||||
|
||||
with codecs.open(counts_loc, 'w', 'utf8') as file_:
|
||||
for phrase in read_gazetteer(nlp.tokenizer, patterns_loc, n=n):
|
||||
text = phrase.string
|
||||
key = hash_string(text)
|
||||
count = counts[key]
|
||||
if count != 0:
|
||||
file_.write('%d\t%s\n' % (count, text))
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
if False:
|
||||
import cProfile
|
||||
import pstats
|
||||
cProfile.runctx("plac.call(main)", globals(), locals(), "Profile.prof")
|
||||
s = pstats.Stats("Profile.prof")
|
||||
s.strip_dirs().sort_stats("time").print_stats()
|
||||
else:
|
||||
plac.call(main)
|
|
@ -1,281 +0,0 @@
|
|||
"""This script expects something like a binary sentiment data set, such as
|
||||
that available here: `http://www.cs.cornell.edu/people/pabo/movie-review-data/`
|
||||
|
||||
It expects a directory structure like: `data_dir/train/{pos|neg}`
|
||||
and `data_dir/test/{pos|neg}`. Put (say) 90% of the files in the former
|
||||
and the remainder in the latter.
|
||||
"""
|
||||
|
||||
from __future__ import unicode_literals
|
||||
from __future__ import print_function
|
||||
from __future__ import division
|
||||
|
||||
from collections import defaultdict
|
||||
from pathlib import Path
|
||||
import numpy
|
||||
import plac
|
||||
|
||||
import spacy.en
|
||||
|
||||
|
||||
def read_data(nlp, data_dir):
|
||||
for subdir, label in (('pos', 1), ('neg', 0)):
|
||||
for filename in (data_dir / subdir).iterdir():
|
||||
text = filename.open().read()
|
||||
doc = nlp(text)
|
||||
if len(doc) >= 1:
|
||||
yield doc, label
|
||||
|
||||
|
||||
def partition(examples, split_size):
|
||||
examples = list(examples)
|
||||
numpy.random.shuffle(examples)
|
||||
n_docs = len(examples)
|
||||
split = int(n_docs * split_size)
|
||||
return examples[:split], examples[split:]
|
||||
|
||||
|
||||
def minibatch(data, bs=24):
|
||||
for i in range(0, len(data), bs):
|
||||
yield data[i:i+bs]
|
||||
|
||||
|
||||
class Extractor(object):
|
||||
def __init__(self, nlp, vector_length, dropout=0.3):
|
||||
self.nlp = nlp
|
||||
self.dropout = dropout
|
||||
self.vector = numpy.zeros((vector_length, ))
|
||||
|
||||
def doc2bow(self, doc, dropout=None):
|
||||
if dropout is None:
|
||||
dropout = self.dropout
|
||||
bow = defaultdict(int)
|
||||
all_words = defaultdict(int)
|
||||
for word in doc:
|
||||
if numpy.random.random() >= dropout and not word.is_punct:
|
||||
bow[word.lower] += 1
|
||||
all_words[word.lower] += 1
|
||||
if sum(bow.values()) >= 1:
|
||||
return bow
|
||||
else:
|
||||
return all_words
|
||||
|
||||
def bow2vec(self, bow, E):
|
||||
self.vector.fill(0)
|
||||
n = 0
|
||||
for orth_id, freq in bow.items():
|
||||
self.vector += self.nlp.vocab[self.nlp.vocab.strings[orth_id]].vector * freq
|
||||
# Apply the fine-tuning we've learned
|
||||
if orth_id < E.shape[0]:
|
||||
self.vector += E[orth_id] * freq
|
||||
n += freq
|
||||
return self.vector / n
|
||||
|
||||
|
||||
class NeuralNetwork(object):
|
||||
def __init__(self, depth, width, n_classes, n_vocab, extracter, optimizer):
|
||||
self.depth = depth
|
||||
self.width = width
|
||||
self.n_classes = n_classes
|
||||
self.weights = Params.random(depth, width, width, n_classes, n_vocab)
|
||||
self.doc2bow = extracter.doc2bow
|
||||
self.bow2vec = extracter.bow2vec
|
||||
self.optimizer = optimizer
|
||||
self._gradient = Params.zero(depth, width, width, n_classes, n_vocab)
|
||||
self._activity = numpy.zeros((depth, width))
|
||||
|
||||
def train(self, batch):
|
||||
activity = self._activity
|
||||
gradient = self._gradient
|
||||
activity.fill(0)
|
||||
gradient.data.fill(0)
|
||||
loss = 0
|
||||
word_freqs = defaultdict(int)
|
||||
for doc, label in batch:
|
||||
word_ids = self.doc2bow(doc)
|
||||
vector = self.bow2vec(word_ids, self.weights.E)
|
||||
self.forward(activity, vector)
|
||||
loss += self.backprop(vector, gradient, activity, word_ids, label)
|
||||
for w, freq in word_ids.items():
|
||||
word_freqs[w] += freq
|
||||
self.optimizer(self.weights, gradient, len(batch), word_freqs)
|
||||
return loss
|
||||
|
||||
def predict(self, doc):
|
||||
actv = self._activity
|
||||
actv.fill(0)
|
||||
W = self.weights.W
|
||||
b = self.weights.b
|
||||
E = self.weights.E
|
||||
|
||||
vector = self.bow2vec(self.doc2bow(doc, dropout=0.0), E)
|
||||
self.forward(actv, vector)
|
||||
return numpy.argmax(softmax(actv[-1], W[-1], b[-1]))
|
||||
|
||||
def forward(self, actv, in_):
|
||||
actv.fill(0)
|
||||
W = self.weights.W; b = self.weights.b
|
||||
actv[0] = relu(in_, W[0], b[0])
|
||||
for i in range(1, self.depth):
|
||||
actv[i] = relu(actv[i-1], W[i], b[i])
|
||||
|
||||
def backprop(self, input_vector, gradient, activity, ids, label):
|
||||
W = self.weights.W
|
||||
b = self.weights.b
|
||||
|
||||
target = numpy.zeros(self.n_classes)
|
||||
target[label] = 1.0
|
||||
pred = softmax(activity[-1], W[-1], b[-1])
|
||||
delta = pred - target
|
||||
|
||||
for i in range(self.depth, 0, -1):
|
||||
gradient.b[i] += delta
|
||||
gradient.W[i] += numpy.outer(delta, activity[i-1])
|
||||
delta = d_relu(activity[i-1]) * W[i].T.dot(delta)
|
||||
|
||||
gradient.b[0] += delta
|
||||
gradient.W[0] += numpy.outer(delta, input_vector)
|
||||
tuning = W[0].T.dot(delta).reshape((self.width,)) / len(ids)
|
||||
for w, freq in ids.items():
|
||||
if w < gradient.E.shape[0]:
|
||||
gradient.E[w] += tuning * freq
|
||||
return -sum(target * numpy.log(pred))
|
||||
|
||||
|
||||
def softmax(actvn, W, b):
|
||||
w = W.dot(actvn) + b
|
||||
ew = numpy.exp(w - max(w))
|
||||
return (ew / sum(ew)).ravel()
|
||||
|
||||
|
||||
def relu(actvn, W, b):
|
||||
x = W.dot(actvn) + b
|
||||
return x * (x > 0)
|
||||
|
||||
|
||||
def d_relu(x):
|
||||
return x > 0
|
||||
|
||||
|
||||
class Adagrad(object):
|
||||
def __init__(self, lr, rho):
|
||||
self.eps = 1e-3
|
||||
# initial learning rate
|
||||
self.learning_rate = lr
|
||||
self.rho = rho
|
||||
# stores sum of squared gradients
|
||||
#self.h = numpy.zeros(self.dim)
|
||||
#self._curr_rate = numpy.zeros(self.h.shape)
|
||||
self.h = None
|
||||
self._curr_rate = None
|
||||
|
||||
def __call__(self, weights, gradient, batch_size, word_freqs):
|
||||
if self.h is None:
|
||||
self.h = numpy.zeros(gradient.data.shape)
|
||||
self._curr_rate = numpy.zeros(gradient.data.shape)
|
||||
self.L2_penalty(gradient, weights, word_freqs)
|
||||
update = self.rescale(gradient.data / batch_size)
|
||||
weights.data -= update
|
||||
|
||||
def rescale(self, gradient):
|
||||
if self.h is None:
|
||||
self.h = numpy.zeros(gradient.data.shape)
|
||||
self._curr_rate = numpy.zeros(gradient.data.shape)
|
||||
self._curr_rate.fill(0)
|
||||
self.h += gradient ** 2
|
||||
self._curr_rate = self.learning_rate / (numpy.sqrt(self.h) + self.eps)
|
||||
return self._curr_rate * gradient
|
||||
|
||||
def L2_penalty(self, gradient, weights, word_freqs):
|
||||
# L2 Regularization
|
||||
for i in range(len(weights.W)):
|
||||
gradient.W[i] += weights.W[i] * self.rho
|
||||
gradient.b[i] += weights.b[i] * self.rho
|
||||
for w, freq in word_freqs.items():
|
||||
if w < gradient.E.shape[0]:
|
||||
gradient.E[w] += weights.E[w] * self.rho
|
||||
|
||||
|
||||
class Params(object):
|
||||
@classmethod
|
||||
def zero(cls, depth, n_embed, n_hidden, n_labels, n_vocab):
|
||||
return cls(depth, n_embed, n_hidden, n_labels, n_vocab, lambda x: numpy.zeros((x,)))
|
||||
|
||||
@classmethod
|
||||
def random(cls, depth, nE, nH, nL, nV):
|
||||
return cls(depth, nE, nH, nL, nV, lambda x: (numpy.random.rand(x) * 2 - 1) * 0.08)
|
||||
|
||||
def __init__(self, depth, n_embed, n_hidden, n_labels, n_vocab, initializer):
|
||||
nE = n_embed; nH = n_hidden; nL = n_labels; nV = n_vocab
|
||||
n_weights = sum([
|
||||
(nE * nH) + nH,
|
||||
(nH * nH + nH) * depth,
|
||||
(nH * nL) + nL,
|
||||
(nV * nE)
|
||||
])
|
||||
self.data = initializer(n_weights)
|
||||
self.W = []
|
||||
self.b = []
|
||||
i = self._add_layer(0, nE, nH)
|
||||
for _ in range(1, depth):
|
||||
i = self._add_layer(i, nH, nH)
|
||||
i = self._add_layer(i, nL, nH)
|
||||
self.E = self.data[i : i + (nV * nE)].reshape((nV, nE))
|
||||
self.E.fill(0)
|
||||
|
||||
def _add_layer(self, start, x, y):
|
||||
end = start + (x * y)
|
||||
self.W.append(self.data[start : end].reshape((x, y)))
|
||||
self.b.append(self.data[end : end + x].reshape((x, )))
|
||||
return end + x
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
data_dir=("Data directory", "positional", None, Path),
|
||||
n_iter=("Number of iterations (epochs)", "option", "i", int),
|
||||
width=("Size of hidden layers", "option", "H", int),
|
||||
depth=("Depth", "option", "d", int),
|
||||
dropout=("Drop-out rate", "option", "r", float),
|
||||
rho=("Regularization penalty", "option", "p", float),
|
||||
eta=("Learning rate", "option", "e", float),
|
||||
batch_size=("Batch size", "option", "b", int),
|
||||
vocab_size=("Number of words to fine-tune", "option", "w", int),
|
||||
)
|
||||
def main(data_dir, depth=3, width=300, n_iter=5, vocab_size=40000,
|
||||
batch_size=24, dropout=0.3, rho=1e-5, eta=0.005):
|
||||
n_classes = 2
|
||||
print("Loading")
|
||||
nlp = spacy.en.English(parser=False)
|
||||
train_data, dev_data = partition(read_data(nlp, data_dir / 'train'), 0.8)
|
||||
print("Begin training")
|
||||
extracter = Extractor(nlp, width, dropout=0.3)
|
||||
optimizer = Adagrad(eta, rho)
|
||||
model = NeuralNetwork(depth, width, n_classes, vocab_size, extracter, optimizer)
|
||||
prev_best = 0
|
||||
best_weights = None
|
||||
for epoch in range(n_iter):
|
||||
numpy.random.shuffle(train_data)
|
||||
train_loss = 0.0
|
||||
for batch in minibatch(train_data, bs=batch_size):
|
||||
train_loss += model.train(batch)
|
||||
n_correct = sum(model.predict(x) == y for x, y in dev_data)
|
||||
print(epoch, train_loss, n_correct / len(dev_data))
|
||||
if n_correct >= prev_best:
|
||||
best_weights = model.weights.data.copy()
|
||||
prev_best = n_correct
|
||||
|
||||
model.weights.data = best_weights
|
||||
print("Evaluating")
|
||||
eval_data = list(read_data(nlp, data_dir / 'test'))
|
||||
n_correct = sum(model.predict(x) == y for x, y in eval_data)
|
||||
print(n_correct / len(eval_data))
|
||||
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
#import cProfile
|
||||
#import pstats
|
||||
#cProfile.runctx("main(Path('data/aclImdb'))", globals(), locals(), "Profile.prof")
|
||||
#s = pstats.Stats("Profile.prof")
|
||||
#s.strip_dirs().sort_stats("time").print_stats(100)
|
||||
plac.call(main)
|
|
@ -1,74 +0,0 @@
|
|||
from __future__ import print_function, unicode_literals, division
|
||||
import io
|
||||
import bz2
|
||||
import logging
|
||||
from toolz import partition
|
||||
from os import path
|
||||
import re
|
||||
|
||||
import spacy.en
|
||||
from spacy.tokens import Doc
|
||||
|
||||
from joblib import Parallel, delayed
|
||||
import plac
|
||||
import ujson
|
||||
|
||||
|
||||
def parallelize(func, iterator, n_jobs, extra, backend='multiprocessing'):
|
||||
extra = tuple(extra)
|
||||
return Parallel(n_jobs=n_jobs, backend=backend)(delayed(func)(*(item + extra))
|
||||
for item in iterator)
|
||||
|
||||
|
||||
def iter_comments(loc):
|
||||
with bz2.BZ2File(loc) as file_:
|
||||
for i, line in enumerate(file_):
|
||||
yield ujson.loads(line)['body']
|
||||
|
||||
|
||||
pre_format_re = re.compile(r'^[\`\*\~]')
|
||||
post_format_re = re.compile(r'[\`\*\~]$')
|
||||
url_re = re.compile(r'\[([^]]+)\]\(%%URL\)')
|
||||
link_re = re.compile(r'\[([^]]+)\]\(https?://[^\)]+\)')
|
||||
def strip_meta(text):
|
||||
text = link_re.sub(r'\1', text)
|
||||
text = text.replace('>', '>').replace('<', '<')
|
||||
text = pre_format_re.sub('', text)
|
||||
text = post_format_re.sub('', text)
|
||||
return text.strip()
|
||||
|
||||
|
||||
def save_parses(batch_id, input_, out_dir, n_threads, batch_size):
|
||||
out_loc = path.join(out_dir, '%d.bin' % batch_id)
|
||||
if path.exists(out_loc):
|
||||
return None
|
||||
print('Batch', batch_id)
|
||||
nlp = spacy.en.English()
|
||||
nlp.matcher = None
|
||||
with open(out_loc, 'wb') as file_:
|
||||
texts = (strip_meta(text) for text in input_)
|
||||
texts = (text for text in texts if text.strip())
|
||||
for doc in nlp.pipe(texts, batch_size=batch_size, n_threads=n_threads):
|
||||
file_.write(doc.to_bytes())
|
||||
|
||||
@plac.annotations(
|
||||
in_loc=("Location of input file"),
|
||||
out_dir=("Location of input file"),
|
||||
n_process=("Number of processes", "option", "p", int),
|
||||
n_thread=("Number of threads per process", "option", "t", int),
|
||||
batch_size=("Number of texts to accumulate in a buffer", "option", "b", int)
|
||||
)
|
||||
def main(in_loc, out_dir, n_process=1, n_thread=4, batch_size=100):
|
||||
if not path.exists(out_dir):
|
||||
path.join(out_dir)
|
||||
if n_process >= 2:
|
||||
texts = partition(200000, iter_comments(in_loc))
|
||||
parallelize(save_parses, enumerate(texts), n_process, [out_dir, n_thread, batch_size],
|
||||
backend='multiprocessing')
|
||||
else:
|
||||
save_parses(0, iter_comments(in_loc), out_dir, n_thread, batch_size)
|
||||
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
plac.call(main)
|
76
examples/pipeline/custom_attr_methods.py
Normal file
76
examples/pipeline/custom_attr_methods.py
Normal file
|
@ -0,0 +1,76 @@
|
|||
#!/usr/bin/env python
|
||||
# coding: utf-8
|
||||
"""This example contains several snippets of methods that can be set via custom
|
||||
Doc, Token or Span attributes in spaCy v2.0. Attribute methods act like
|
||||
they're "bound" to the object and are partially applied – i.e. the object
|
||||
they're called on is passed in as the first argument.
|
||||
|
||||
* Custom pipeline components: https://spacy.io//usage/processing-pipelines#custom-components
|
||||
|
||||
Compatible with: spaCy v2.0.0+
|
||||
"""
|
||||
from __future__ import unicode_literals, print_function
|
||||
|
||||
import plac
|
||||
from spacy.lang.en import English
|
||||
from spacy.tokens import Doc, Span
|
||||
from spacy import displacy
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
output_dir=("Output directory for saved HTML", "positional", None, Path))
|
||||
def main(output_dir=None):
|
||||
nlp = English() # start off with blank English class
|
||||
|
||||
Doc.set_extension('overlap', method=overlap_tokens)
|
||||
doc1 = nlp(u"Peach emoji is where it has always been.")
|
||||
doc2 = nlp(u"Peach is the superior emoji.")
|
||||
print("Text 1:", doc1.text)
|
||||
print("Text 2:", doc2.text)
|
||||
print("Overlapping tokens:", doc1._.overlap(doc2))
|
||||
|
||||
Doc.set_extension('to_html', method=to_html)
|
||||
doc = nlp(u"This is a sentence about Apple.")
|
||||
# add entity manually for demo purposes, to make it work without a model
|
||||
doc.ents = [Span(doc, 5, 6, label=nlp.vocab.strings['ORG'])]
|
||||
print("Text:", doc.text)
|
||||
doc._.to_html(output=output_dir, style='ent')
|
||||
|
||||
|
||||
def to_html(doc, output='/tmp', style='dep'):
|
||||
"""Doc method extension for saving the current state as a displaCy
|
||||
visualization.
|
||||
"""
|
||||
# generate filename from first six non-punct tokens
|
||||
file_name = '-'.join([w.text for w in doc[:6] if not w.is_punct]) + '.html'
|
||||
html = displacy.render(doc, style=style, page=True) # render markup
|
||||
if output is not None:
|
||||
output_path = Path(output)
|
||||
if not output_path.exists():
|
||||
output_path.mkdir()
|
||||
output_file = Path(output) / file_name
|
||||
output_file.open('w', encoding='utf-8').write(html) # save to file
|
||||
print('Saved HTML to {}'.format(output_file))
|
||||
else:
|
||||
print(html)
|
||||
|
||||
|
||||
def overlap_tokens(doc, other_doc):
|
||||
"""Get the tokens from the original Doc that are also in the comparison Doc.
|
||||
"""
|
||||
overlap = []
|
||||
other_tokens = [token.text for token in other_doc]
|
||||
for token in doc:
|
||||
if token.text in other_tokens:
|
||||
overlap.append(token)
|
||||
return overlap
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
plac.call(main)
|
||||
|
||||
# Expected output:
|
||||
# Text 1: Peach emoji is where it has always been.
|
||||
# Text 2: Peach is the superior emoji.
|
||||
# Overlapping tokens: [Peach, emoji, is, .]
|
124
examples/pipeline/custom_component_countries_api.py
Normal file
124
examples/pipeline/custom_component_countries_api.py
Normal file
|
@ -0,0 +1,124 @@
|
|||
#!/usr/bin/env python
|
||||
# coding: utf8
|
||||
"""Example of a spaCy v2.0 pipeline component that requests all countries via
|
||||
the REST Countries API, merges country names into one token, assigns entity
|
||||
labels and sets attributes on country tokens, e.g. the capital and lat/lng
|
||||
coordinates. Can be extended with more details from the API.
|
||||
|
||||
* REST Countries API: https://restcountries.eu (Mozilla Public License MPL 2.0)
|
||||
* Custom pipeline components: https://spacy.io//usage/processing-pipelines#custom-components
|
||||
|
||||
Compatible with: spaCy v2.0.0+
|
||||
"""
|
||||
from __future__ import unicode_literals, print_function
|
||||
|
||||
import requests
|
||||
import plac
|
||||
from spacy.lang.en import English
|
||||
from spacy.matcher import PhraseMatcher
|
||||
from spacy.tokens import Doc, Span, Token
|
||||
|
||||
|
||||
def main():
|
||||
# For simplicity, we start off with only the blank English Language class
|
||||
# and no model or pre-defined pipeline loaded.
|
||||
nlp = English()
|
||||
rest_countries = RESTCountriesComponent(nlp) # initialise component
|
||||
nlp.add_pipe(rest_countries) # add it to the pipeline
|
||||
doc = nlp(u"Some text about Colombia and the Czech Republic")
|
||||
print('Pipeline', nlp.pipe_names) # pipeline contains component name
|
||||
print('Doc has countries', doc._.has_country) # Doc contains countries
|
||||
for token in doc:
|
||||
if token._.is_country:
|
||||
print(token.text, token._.country_capital, token._.country_latlng,
|
||||
token._.country_flag) # country data
|
||||
print('Entities', [(e.text, e.label_) for e in doc.ents]) # entities
|
||||
|
||||
|
||||
class RESTCountriesComponent(object):
|
||||
"""spaCy v2.0 pipeline component that requests all countries via
|
||||
the REST Countries API, merges country names into one token, assigns entity
|
||||
labels and sets attributes on country tokens.
|
||||
"""
|
||||
name = 'rest_countries' # component name, will show up in the pipeline
|
||||
|
||||
def __init__(self, nlp, label='GPE'):
|
||||
"""Initialise the pipeline component. The shared nlp instance is used
|
||||
to initialise the matcher with the shared vocab, get the label ID and
|
||||
generate Doc objects as phrase match patterns.
|
||||
"""
|
||||
# Make request once on initialisation and store the data
|
||||
r = requests.get('https://restcountries.eu/rest/v2/all')
|
||||
r.raise_for_status() # make sure requests raises an error if it fails
|
||||
countries = r.json()
|
||||
|
||||
# Convert API response to dict keyed by country name for easy lookup
|
||||
# This could also be extended using the alternative and foreign language
|
||||
# names provided by the API
|
||||
self.countries = {c['name']: c for c in countries}
|
||||
self.label = nlp.vocab.strings[label] # get entity label ID
|
||||
|
||||
# Set up the PhraseMatcher with Doc patterns for each country name
|
||||
patterns = [nlp(c) for c in self.countries.keys()]
|
||||
self.matcher = PhraseMatcher(nlp.vocab)
|
||||
self.matcher.add('COUNTRIES', None, *patterns)
|
||||
|
||||
# Register attribute on the Token. We'll be overwriting this based on
|
||||
# the matches, so we're only setting a default value, not a getter.
|
||||
# If no default value is set, it defaults to None.
|
||||
Token.set_extension('is_country', default=False)
|
||||
Token.set_extension('country_capital')
|
||||
Token.set_extension('country_latlng')
|
||||
Token.set_extension('country_flag')
|
||||
|
||||
# Register attributes on Doc and Span via a getter that checks if one of
|
||||
# the contained tokens is set to is_country == True.
|
||||
Doc.set_extension('has_country', getter=self.has_country)
|
||||
Span.set_extension('has_country', getter=self.has_country)
|
||||
|
||||
|
||||
def __call__(self, doc):
|
||||
"""Apply the pipeline component on a Doc object and modify it if matches
|
||||
are found. Return the Doc, so it can be processed by the next component
|
||||
in the pipeline, if available.
|
||||
"""
|
||||
matches = self.matcher(doc)
|
||||
spans = [] # keep the spans for later so we can merge them afterwards
|
||||
for _, start, end in matches:
|
||||
# Generate Span representing the entity & set label
|
||||
entity = Span(doc, start, end, label=self.label)
|
||||
spans.append(entity)
|
||||
# Set custom attribute on each token of the entity
|
||||
# Can be extended with other data returned by the API, like
|
||||
# currencies, country code, flag, calling code etc.
|
||||
for token in entity:
|
||||
token._.set('is_country', True)
|
||||
token._.set('country_capital', self.countries[entity.text]['capital'])
|
||||
token._.set('country_latlng', self.countries[entity.text]['latlng'])
|
||||
token._.set('country_flag', self.countries[entity.text]['flag'])
|
||||
# Overwrite doc.ents and add entity – be careful not to replace!
|
||||
doc.ents = list(doc.ents) + [entity]
|
||||
for span in spans:
|
||||
# Iterate over all spans and merge them into one token. This is done
|
||||
# after setting the entities – otherwise, it would cause mismatched
|
||||
# indices!
|
||||
span.merge()
|
||||
return doc # don't forget to return the Doc!
|
||||
|
||||
def has_country(self, tokens):
|
||||
"""Getter for Doc and Span attributes. Returns True if one of the tokens
|
||||
is a country. Since the getter is only called when we access the
|
||||
attribute, we can refer to the Token's 'is_country' attribute here,
|
||||
which is already set in the processing step."""
|
||||
return any([t._.get('is_country') for t in tokens])
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
plac.call(main)
|
||||
|
||||
# Expected output:
|
||||
# Pipeline ['rest_countries']
|
||||
# Doc has countries True
|
||||
# Colombia Bogotá [4.0, -72.0] https://restcountries.eu/data/col.svg
|
||||
# Czech Republic Prague [49.75, 15.5] https://restcountries.eu/data/cze.svg
|
||||
# Entities [('Colombia', 'GPE'), ('Czech Republic', 'GPE')]
|
112
examples/pipeline/custom_component_entities.py
Normal file
112
examples/pipeline/custom_component_entities.py
Normal file
|
@ -0,0 +1,112 @@
|
|||
#!/usr/bin/env python
|
||||
# coding: utf8
|
||||
"""Example of a spaCy v2.0 pipeline component that sets entity annotations
|
||||
based on list of single or multiple-word company names. Companies are
|
||||
labelled as ORG and their spans are merged into one token. Additionally,
|
||||
._.has_tech_org and ._.is_tech_org is set on the Doc/Span and Token
|
||||
respectively.
|
||||
|
||||
* Custom pipeline components: https://spacy.io//usage/processing-pipelines#custom-components
|
||||
|
||||
Compatible with: spaCy v2.0.0+
|
||||
"""
|
||||
from __future__ import unicode_literals, print_function
|
||||
|
||||
import plac
|
||||
from spacy.lang.en import English
|
||||
from spacy.matcher import PhraseMatcher
|
||||
from spacy.tokens import Doc, Span, Token
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
text=("Text to process", "positional", None, str),
|
||||
companies=("Names of technology companies", "positional", None, str))
|
||||
def main(text="Alphabet Inc. is the company behind Google.", *companies):
|
||||
# For simplicity, we start off with only the blank English Language class
|
||||
# and no model or pre-defined pipeline loaded.
|
||||
nlp = English()
|
||||
if not companies: # set default companies if none are set via args
|
||||
companies = ['Alphabet Inc.', 'Google', 'Netflix', 'Apple'] # etc.
|
||||
component = TechCompanyRecognizer(nlp, companies) # initialise component
|
||||
nlp.add_pipe(component, last=True) # add last to the pipeline
|
||||
|
||||
doc = nlp(text)
|
||||
print('Pipeline', nlp.pipe_names) # pipeline contains component name
|
||||
print('Tokens', [t.text for t in doc]) # company names from the list are merged
|
||||
print('Doc has_tech_org', doc._.has_tech_org) # Doc contains tech orgs
|
||||
print('Token 0 is_tech_org', doc[0]._.is_tech_org) # "Alphabet Inc." is a tech org
|
||||
print('Token 1 is_tech_org', doc[1]._.is_tech_org) # "is" is not
|
||||
print('Entities', [(e.text, e.label_) for e in doc.ents]) # all orgs are entities
|
||||
|
||||
|
||||
class TechCompanyRecognizer(object):
|
||||
"""Example of a spaCy v2.0 pipeline component that sets entity annotations
|
||||
based on list of single or multiple-word company names. Companies are
|
||||
labelled as ORG and their spans are merged into one token. Additionally,
|
||||
._.has_tech_org and ._.is_tech_org is set on the Doc/Span and Token
|
||||
respectively."""
|
||||
name = 'tech_companies' # component name, will show up in the pipeline
|
||||
|
||||
def __init__(self, nlp, companies=tuple(), label='ORG'):
|
||||
"""Initialise the pipeline component. The shared nlp instance is used
|
||||
to initialise the matcher with the shared vocab, get the label ID and
|
||||
generate Doc objects as phrase match patterns.
|
||||
"""
|
||||
self.label = nlp.vocab.strings[label] # get entity label ID
|
||||
|
||||
# Set up the PhraseMatcher – it can now take Doc objects as patterns,
|
||||
# so even if the list of companies is long, it's very efficient
|
||||
patterns = [nlp(org) for org in companies]
|
||||
self.matcher = PhraseMatcher(nlp.vocab)
|
||||
self.matcher.add('TECH_ORGS', None, *patterns)
|
||||
|
||||
# Register attribute on the Token. We'll be overwriting this based on
|
||||
# the matches, so we're only setting a default value, not a getter.
|
||||
Token.set_extension('is_tech_org', default=False)
|
||||
|
||||
# Register attributes on Doc and Span via a getter that checks if one of
|
||||
# the contained tokens is set to is_tech_org == True.
|
||||
Doc.set_extension('has_tech_org', getter=self.has_tech_org)
|
||||
Span.set_extension('has_tech_org', getter=self.has_tech_org)
|
||||
|
||||
def __call__(self, doc):
|
||||
"""Apply the pipeline component on a Doc object and modify it if matches
|
||||
are found. Return the Doc, so it can be processed by the next component
|
||||
in the pipeline, if available.
|
||||
"""
|
||||
matches = self.matcher(doc)
|
||||
spans = [] # keep the spans for later so we can merge them afterwards
|
||||
for _, start, end in matches:
|
||||
# Generate Span representing the entity & set label
|
||||
entity = Span(doc, start, end, label=self.label)
|
||||
spans.append(entity)
|
||||
# Set custom attribute on each token of the entity
|
||||
for token in entity:
|
||||
token._.set('is_tech_org', True)
|
||||
# Overwrite doc.ents and add entity – be careful not to replace!
|
||||
doc.ents = list(doc.ents) + [entity]
|
||||
for span in spans:
|
||||
# Iterate over all spans and merge them into one token. This is done
|
||||
# after setting the entities – otherwise, it would cause mismatched
|
||||
# indices!
|
||||
span.merge()
|
||||
return doc # don't forget to return the Doc!
|
||||
|
||||
def has_tech_org(self, tokens):
|
||||
"""Getter for Doc and Span attributes. Returns True if one of the tokens
|
||||
is a tech org. Since the getter is only called when we access the
|
||||
attribute, we can refer to the Token's 'is_tech_org' attribute here,
|
||||
which is already set in the processing step."""
|
||||
return any([t._.get('is_tech_org') for t in tokens])
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
plac.call(main)
|
||||
|
||||
# Expected output:
|
||||
# Pipeline ['tech_companies']
|
||||
# Tokens ['Alphabet Inc.', 'is', 'the', 'company', 'behind', 'Google', '.']
|
||||
# Doc has_tech_org True
|
||||
# Token 0 is_tech_org True
|
||||
# Token 1 is_tech_org False
|
||||
# Entities [('Alphabet Inc.', 'ORG'), ('Google', 'ORG')]
|
78
examples/pipeline/multi_processing.py
Normal file
78
examples/pipeline/multi_processing.py
Normal file
|
@ -0,0 +1,78 @@
|
|||
#!/usr/bin/env python
|
||||
# coding: utf8
|
||||
"""Example of multi-processing with Joblib. Here, we're exporting
|
||||
part-of-speech-tagged, true-cased, (very roughly) sentence-separated text, with
|
||||
each "sentence" on a newline, and spaces between tokens. Data is loaded from
|
||||
the IMDB movie reviews dataset and will be loaded automatically via Thinc's
|
||||
built-in dataset loader.
|
||||
|
||||
Compatible with: spaCy v2.0.0+
|
||||
"""
|
||||
from __future__ import print_function, unicode_literals
|
||||
from toolz import partition_all
|
||||
from pathlib import Path
|
||||
from joblib import Parallel, delayed
|
||||
import thinc.extra.datasets
|
||||
import plac
|
||||
import spacy
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
output_dir=("Output directory", "positional", None, Path),
|
||||
model=("Model name (needs tagger)", "positional", None, str),
|
||||
n_jobs=("Number of workers", "option", "n", int),
|
||||
batch_size=("Batch-size for each process", "option", "b", int),
|
||||
limit=("Limit of entries from the dataset", "option", "l", int))
|
||||
def main(output_dir, model='en_core_web_sm', n_jobs=4, batch_size=1000,
|
||||
limit=10000):
|
||||
nlp = spacy.load(model) # load spaCy model
|
||||
print("Loaded model '%s'" % model)
|
||||
if not output_dir.exists():
|
||||
output_dir.mkdir()
|
||||
# load and pre-process the IMBD dataset
|
||||
print("Loading IMDB data...")
|
||||
data, _ = thinc.extra.datasets.imdb()
|
||||
texts, _ = zip(*data[-limit:])
|
||||
print("Processing texts...")
|
||||
partitions = partition_all(batch_size, texts)
|
||||
executor = Parallel(n_jobs=n_jobs)
|
||||
do = delayed(transform_texts)
|
||||
tasks = (do(nlp, i, batch, output_dir)
|
||||
for i, batch in enumerate(partitions))
|
||||
executor(tasks)
|
||||
|
||||
|
||||
def transform_texts(nlp, batch_id, texts, output_dir):
|
||||
print(nlp.pipe_names)
|
||||
out_path = Path(output_dir) / ('%d.txt' % batch_id)
|
||||
if out_path.exists(): # return None in case same batch is called again
|
||||
return None
|
||||
print('Processing batch', batch_id)
|
||||
with out_path.open('w', encoding='utf8') as f:
|
||||
for doc in nlp.pipe(texts):
|
||||
f.write(' '.join(represent_word(w) for w in doc if not w.is_space))
|
||||
f.write('\n')
|
||||
print('Saved {} texts to {}.txt'.format(len(texts), batch_id))
|
||||
|
||||
|
||||
def represent_word(word):
|
||||
text = word.text
|
||||
# True-case, i.e. try to normalize sentence-initial capitals.
|
||||
# Only do this if the lower-cased form is more probable.
|
||||
if text.istitle() and is_sent_begin(word) \
|
||||
and word.prob < word.doc.vocab[text.lower()].prob:
|
||||
text = text.lower()
|
||||
return text + '|' + word.tag_
|
||||
|
||||
|
||||
def is_sent_begin(word):
|
||||
if word.i == 0:
|
||||
return True
|
||||
elif word.i >= 2 and word.nbor(-1).text in ('.', '!', '?', '...'):
|
||||
return True
|
||||
else:
|
||||
return False
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
plac.call(main)
|
|
@ -1,90 +0,0 @@
|
|||
"""
|
||||
Print part-of-speech tagged, true-cased, (very roughly) sentence-separated
|
||||
text, with each "sentence" on a newline, and spaces between tokens. Supports
|
||||
multi-processing.
|
||||
"""
|
||||
from __future__ import print_function, unicode_literals, division
|
||||
import io
|
||||
import bz2
|
||||
import logging
|
||||
from toolz import partition
|
||||
from os import path
|
||||
|
||||
import spacy.en
|
||||
|
||||
from joblib import Parallel, delayed
|
||||
import plac
|
||||
import ujson
|
||||
|
||||
|
||||
def parallelize(func, iterator, n_jobs, extra):
|
||||
extra = tuple(extra)
|
||||
return Parallel(n_jobs=n_jobs)(delayed(func)(*(item + extra)) for item in iterator)
|
||||
|
||||
|
||||
def iter_texts_from_json_bz2(loc):
|
||||
"""
|
||||
Iterator of unicode strings, one per document (here, a comment).
|
||||
|
||||
Expects a a path to a BZ2 file, which should be new-line delimited JSON. The
|
||||
document text should be in a string field titled 'body'.
|
||||
|
||||
This is the data format of the Reddit comments corpus.
|
||||
"""
|
||||
with bz2.BZ2File(loc) as file_:
|
||||
for i, line in enumerate(file_):
|
||||
yield ujson.loads(line)['body']
|
||||
|
||||
|
||||
def transform_texts(batch_id, input_, out_dir):
|
||||
out_loc = path.join(out_dir, '%d.txt' % batch_id)
|
||||
if path.exists(out_loc):
|
||||
return None
|
||||
print('Batch', batch_id)
|
||||
nlp = spacy.en.English(parser=False, entity=False)
|
||||
with io.open(out_loc, 'w', encoding='utf8') as file_:
|
||||
for text in input_:
|
||||
doc = nlp(text)
|
||||
file_.write(' '.join(represent_word(w) for w in doc if not w.is_space))
|
||||
file_.write('\n')
|
||||
|
||||
|
||||
def represent_word(word):
|
||||
text = word.text
|
||||
# True-case, i.e. try to normalize sentence-initial capitals.
|
||||
# Only do this if the lower-cased form is more probable.
|
||||
if text.istitle() \
|
||||
and is_sent_begin(word) \
|
||||
and word.prob < word.doc.vocab[text.lower()].prob:
|
||||
text = text.lower()
|
||||
return text + '|' + word.tag_
|
||||
|
||||
|
||||
def is_sent_begin(word):
|
||||
# It'd be nice to have some heuristics like these in the library, for these
|
||||
# times where we don't care so much about accuracy of SBD, and we don't want
|
||||
# to parse
|
||||
if word.i == 0:
|
||||
return True
|
||||
elif word.i >= 2 and word.nbor(-1).text in ('.', '!', '?', '...'):
|
||||
return True
|
||||
else:
|
||||
return False
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
in_loc=("Location of input file"),
|
||||
out_dir=("Location of input file"),
|
||||
n_workers=("Number of workers", "option", "n", int),
|
||||
batch_size=("Batch-size for each process", "option", "b", int)
|
||||
)
|
||||
def main(in_loc, out_dir, n_workers=4, batch_size=100000):
|
||||
if not path.exists(out_dir):
|
||||
path.join(out_dir)
|
||||
texts = partition(batch_size, iter_texts_from_json_bz2(in_loc))
|
||||
parallelize(transform_texts, enumerate(texts), n_workers, [out_dir])
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
plac.call(main)
|
||||
|
|
@ -1,22 +0,0 @@
|
|||
# Load NER
|
||||
from __future__ import unicode_literals
|
||||
import spacy
|
||||
import pathlib
|
||||
from spacy.pipeline import EntityRecognizer
|
||||
from spacy.vocab import Vocab
|
||||
|
||||
def load_model(model_dir):
|
||||
model_dir = pathlib.Path(model_dir)
|
||||
nlp = spacy.load('en', parser=False, entity=False, add_vectors=False)
|
||||
with (model_dir / 'vocab' / 'strings.json').open('r', encoding='utf8') as file_:
|
||||
nlp.vocab.strings.load(file_)
|
||||
nlp.vocab.load_lexemes(model_dir / 'vocab' / 'lexemes.bin')
|
||||
ner = EntityRecognizer.load(model_dir, nlp.vocab, require=True)
|
||||
return (nlp, ner)
|
||||
|
||||
(nlp, ner) = load_model('ner')
|
||||
doc = nlp.make_doc('Who is Shaka Khan?')
|
||||
nlp.tagger(doc)
|
||||
ner(doc)
|
||||
for word in doc:
|
||||
print(word.text, word.orth, word.lower, word.tag_, word.ent_type_, word.ent_iob)
|
148
examples/training/train_intent_parser.py
Normal file
148
examples/training/train_intent_parser.py
Normal file
|
@ -0,0 +1,148 @@
|
|||
#!/usr/bin/env python
|
||||
# coding: utf-8
|
||||
"""Using the parser to recognise your own semantics
|
||||
|
||||
spaCy's parser component can be used to trained to predict any type of tree
|
||||
structure over your input text. You can also predict trees over whole documents
|
||||
or chat logs, with connections between the sentence-roots used to annotate
|
||||
discourse structure. In this example, we'll build a message parser for a common
|
||||
"chat intent": finding local businesses. Our message semantics will have the
|
||||
following types of relations: ROOT, PLACE, QUALITY, ATTRIBUTE, TIME, LOCATION.
|
||||
|
||||
"show me the best hotel in berlin"
|
||||
('show', 'ROOT', 'show')
|
||||
('best', 'QUALITY', 'hotel') --> hotel with QUALITY best
|
||||
('hotel', 'PLACE', 'show') --> show PLACE hotel
|
||||
('berlin', 'LOCATION', 'hotel') --> hotel with LOCATION berlin
|
||||
|
||||
Compatible with: spaCy v2.0.0+
|
||||
"""
|
||||
from __future__ import unicode_literals, print_function
|
||||
|
||||
import plac
|
||||
import random
|
||||
import spacy
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
# training data: texts, heads and dependency labels
|
||||
# for no relation, we simply chose an arbitrary dependency label, e.g. '-'
|
||||
TRAIN_DATA = [
|
||||
("find a cafe with great wifi", {
|
||||
'heads': [0, 2, 0, 5, 5, 2], # index of token head
|
||||
'deps': ['ROOT', '-', 'PLACE', '-', 'QUALITY', 'ATTRIBUTE']
|
||||
}),
|
||||
("find a hotel near the beach", {
|
||||
'heads': [0, 2, 0, 5, 5, 2],
|
||||
'deps': ['ROOT', '-', 'PLACE', 'QUALITY', '-', 'ATTRIBUTE']
|
||||
}),
|
||||
("find me the closest gym that's open late", {
|
||||
'heads': [0, 0, 4, 4, 0, 6, 4, 6, 6],
|
||||
'deps': ['ROOT', '-', '-', 'QUALITY', 'PLACE', '-', '-', 'ATTRIBUTE', 'TIME']
|
||||
}),
|
||||
("show me the cheapest store that sells flowers", {
|
||||
'heads': [0, 0, 4, 4, 0, 4, 4, 4], # attach "flowers" to store!
|
||||
'deps': ['ROOT', '-', '-', 'QUALITY', 'PLACE', '-', '-', 'PRODUCT']
|
||||
}),
|
||||
("find a nice restaurant in london", {
|
||||
'heads': [0, 3, 3, 0, 3, 3],
|
||||
'deps': ['ROOT', '-', 'QUALITY', 'PLACE', '-', 'LOCATION']
|
||||
}),
|
||||
("show me the coolest hostel in berlin", {
|
||||
'heads': [0, 0, 4, 4, 0, 4, 4],
|
||||
'deps': ['ROOT', '-', '-', 'QUALITY', 'PLACE', '-', 'LOCATION']
|
||||
}),
|
||||
("find a good italian restaurant near work", {
|
||||
'heads': [0, 4, 4, 4, 0, 4, 5],
|
||||
'deps': ['ROOT', '-', 'QUALITY', 'ATTRIBUTE', 'PLACE', 'ATTRIBUTE', 'LOCATION']
|
||||
})
|
||||
]
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
|
||||
output_dir=("Optional output directory", "option", "o", Path),
|
||||
n_iter=("Number of training iterations", "option", "n", int))
|
||||
def main(model=None, output_dir=None, n_iter=5):
|
||||
"""Load the model, set up the pipeline and train the parser."""
|
||||
if model is not None:
|
||||
nlp = spacy.load(model) # load existing spaCy model
|
||||
print("Loaded model '%s'" % model)
|
||||
else:
|
||||
nlp = spacy.blank('en') # create blank Language class
|
||||
print("Created blank 'en' model")
|
||||
|
||||
# We'll use the built-in dependency parser
|
||||
# class, but we want to create a fresh instance, and give it a different
|
||||
# name.
|
||||
if 'parser' in nlp.pipe_names:
|
||||
nlp.remove_pipe('parser')
|
||||
parser = nlp.create_pipe('parser')
|
||||
nlp.add_pipe(parser, name='intent-parser', first=True)
|
||||
|
||||
for text, annotations in TRAIN_DATA:
|
||||
for dep in annotations.get('deps', []):
|
||||
parser.add_label(dep)
|
||||
|
||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'intent-parser']
|
||||
with nlp.disable_pipes(*other_pipes): # only train parser
|
||||
optimizer = nlp.begin_training()
|
||||
for itn in range(n_iter):
|
||||
random.shuffle(TRAIN_DATA)
|
||||
losses = {}
|
||||
for text, annotations in TRAIN_DATA:
|
||||
nlp.update([text], [annotations], sgd=optimizer, losses=losses)
|
||||
print(losses)
|
||||
|
||||
# test the trained model
|
||||
test_model(nlp)
|
||||
|
||||
# save model to output directory
|
||||
if output_dir is not None:
|
||||
output_dir = Path(output_dir)
|
||||
if not output_dir.exists():
|
||||
output_dir.mkdir()
|
||||
nlp.to_disk(output_dir)
|
||||
print("Saved model to", output_dir)
|
||||
|
||||
# test the saved model
|
||||
print("Loading from", output_dir)
|
||||
nlp2 = spacy.load(output_dir)
|
||||
test_model(nlp2)
|
||||
|
||||
|
||||
def test_model(nlp):
|
||||
texts = ["find a hotel with good wifi",
|
||||
"find me the cheapest gym near work",
|
||||
"show me the best hotel in berlin"]
|
||||
docs = nlp.pipe(texts)
|
||||
for doc in docs:
|
||||
print(doc.text)
|
||||
print([(t.text, t.dep_, t.head.text) for t in doc if t.dep_ != '-'])
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
plac.call(main)
|
||||
|
||||
# Expected output:
|
||||
# find a hotel with good wifi
|
||||
# [
|
||||
# ('find', 'ROOT', 'find'),
|
||||
# ('hotel', 'PLACE', 'find'),
|
||||
# ('good', 'QUALITY', 'wifi'),
|
||||
# ('wifi', 'ATTRIBUTE', 'hotel')
|
||||
# ]
|
||||
# find me the cheapest gym near work
|
||||
# [
|
||||
# ('find', 'ROOT', 'find'),
|
||||
# ('cheapest', 'QUALITY', 'gym'),
|
||||
# ('gym', 'PLACE', 'find')
|
||||
# ('work', 'LOCATION', 'near')
|
||||
# ]
|
||||
# show me the best hotel in berlin
|
||||
# [
|
||||
# ('show', 'ROOT', 'show'),
|
||||
# ('best', 'QUALITY', 'hotel'),
|
||||
# ('hotel', 'PLACE', 'show'),
|
||||
# ('berlin', 'LOCATION', 'hotel')
|
||||
# ]
|
|
@ -1,98 +1,106 @@
|
|||
#!/usr/bin/env python
|
||||
# coding: utf8
|
||||
"""Example of training spaCy's named entity recognizer, starting off with an
|
||||
existing model or a blank model.
|
||||
|
||||
For more details, see the documentation:
|
||||
* Training: https://spacy.io/usage/training
|
||||
* NER: https://spacy.io/usage/linguistic-features#named-entities
|
||||
|
||||
Compatible with: spaCy v2.0.0+
|
||||
"""
|
||||
from __future__ import unicode_literals, print_function
|
||||
import json
|
||||
import pathlib
|
||||
|
||||
import plac
|
||||
import random
|
||||
|
||||
from pathlib import Path
|
||||
import spacy
|
||||
from spacy.pipeline import EntityRecognizer
|
||||
from spacy.gold import GoldParse
|
||||
from spacy.tagger import Tagger
|
||||
|
||||
|
||||
try:
|
||||
unicode
|
||||
except:
|
||||
unicode = str
|
||||
|
||||
|
||||
def train_ner(nlp, train_data, entity_types):
|
||||
# Add new words to vocab.
|
||||
for raw_text, _ in train_data:
|
||||
doc = nlp.make_doc(raw_text)
|
||||
for word in doc:
|
||||
_ = nlp.vocab[word.orth]
|
||||
|
||||
# Train NER.
|
||||
ner = EntityRecognizer(nlp.vocab, entity_types=entity_types)
|
||||
for itn in range(5):
|
||||
random.shuffle(train_data)
|
||||
for raw_text, entity_offsets in train_data:
|
||||
doc = nlp.make_doc(raw_text)
|
||||
gold = GoldParse(doc, entities=entity_offsets)
|
||||
ner.update(doc, gold)
|
||||
return ner
|
||||
|
||||
def save_model(ner, model_dir):
|
||||
model_dir = pathlib.Path(model_dir)
|
||||
if not model_dir.exists():
|
||||
model_dir.mkdir()
|
||||
assert model_dir.is_dir()
|
||||
|
||||
with (model_dir / 'config.json').open('wb') as file_:
|
||||
data = json.dumps(ner.cfg)
|
||||
if isinstance(data, unicode):
|
||||
data = data.encode('utf8')
|
||||
file_.write(data)
|
||||
ner.model.dump(str(model_dir / 'model'))
|
||||
if not (model_dir / 'vocab').exists():
|
||||
(model_dir / 'vocab').mkdir()
|
||||
ner.vocab.dump(str(model_dir / 'vocab' / 'lexemes.bin'))
|
||||
with (model_dir / 'vocab' / 'strings.json').open('w', encoding='utf8') as file_:
|
||||
ner.vocab.strings.dump(file_)
|
||||
# training data
|
||||
TRAIN_DATA = [
|
||||
('Who is Shaka Khan?', {
|
||||
'entities': [(7, 17, 'PERSON')]
|
||||
}),
|
||||
('I like London and Berlin.', {
|
||||
'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]
|
||||
})
|
||||
]
|
||||
|
||||
|
||||
def main(model_dir=None):
|
||||
nlp = spacy.load('en', parser=False, entity=False, add_vectors=False)
|
||||
@plac.annotations(
|
||||
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
|
||||
output_dir=("Optional output directory", "option", "o", Path),
|
||||
n_iter=("Number of training iterations", "option", "n", int))
|
||||
def main(model=None, output_dir=None, n_iter=100):
|
||||
"""Load the model, set up the pipeline and train the entity recognizer."""
|
||||
if model is not None:
|
||||
nlp = spacy.load(model) # load existing spaCy model
|
||||
print("Loaded model '%s'" % model)
|
||||
else:
|
||||
nlp = spacy.blank('en') # create blank Language class
|
||||
print("Created blank 'en' model")
|
||||
|
||||
# v1.1.2 onwards
|
||||
if nlp.tagger is None:
|
||||
print('---- WARNING ----')
|
||||
print('Data directory not found')
|
||||
print('please run: `python -m spacy.en.download --force all` for better performance')
|
||||
print('Using feature templates for tagging')
|
||||
print('-----------------')
|
||||
nlp.tagger = Tagger(nlp.vocab, features=Tagger.feature_templates)
|
||||
# create the built-in pipeline components and add them to the pipeline
|
||||
# nlp.create_pipe works for built-ins that are registered with spaCy
|
||||
if 'ner' not in nlp.pipe_names:
|
||||
ner = nlp.create_pipe('ner')
|
||||
nlp.add_pipe(ner, last=True)
|
||||
# otherwise, get it so we can add labels
|
||||
else:
|
||||
ner = nlp.get_pipe('ner')
|
||||
|
||||
train_data = [
|
||||
(
|
||||
'Who is Shaka Khan?',
|
||||
[(len('Who is '), len('Who is Shaka Khan'), 'PERSON')]
|
||||
),
|
||||
(
|
||||
'I like London and Berlin.',
|
||||
[(len('I like '), len('I like London'), 'LOC'),
|
||||
(len('I like London and '), len('I like London and Berlin'), 'LOC')]
|
||||
)
|
||||
]
|
||||
ner = train_ner(nlp, train_data, ['PERSON', 'LOC'])
|
||||
# add labels
|
||||
for _, annotations in TRAIN_DATA:
|
||||
for ent in annotations.get('entities'):
|
||||
ner.add_label(ent[2])
|
||||
|
||||
doc = nlp.make_doc('Who is Shaka Khan?')
|
||||
nlp.tagger(doc)
|
||||
ner(doc)
|
||||
for word in doc:
|
||||
print(word.text, word.orth, word.lower, word.tag_, word.ent_type_, word.ent_iob)
|
||||
|
||||
if model_dir is not None:
|
||||
save_model(ner, model_dir)
|
||||
# get names of other pipes to disable them during training
|
||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
|
||||
with nlp.disable_pipes(*other_pipes): # only train NER
|
||||
optimizer = nlp.begin_training()
|
||||
for itn in range(n_iter):
|
||||
random.shuffle(TRAIN_DATA)
|
||||
losses = {}
|
||||
for text, annotations in TRAIN_DATA:
|
||||
nlp.update(
|
||||
[text], # batch of texts
|
||||
[annotations], # batch of annotations
|
||||
drop=0.5, # dropout - make it harder to memorise data
|
||||
sgd=optimizer, # callable to update weights
|
||||
losses=losses)
|
||||
print(losses)
|
||||
|
||||
# test the trained model
|
||||
for text, _ in TRAIN_DATA:
|
||||
doc = nlp(text)
|
||||
print('Entities', [(ent.text, ent.label_) for ent in doc.ents])
|
||||
print('Tokens', [(t.text, t.ent_type_, t.ent_iob) for t in doc])
|
||||
|
||||
# save model to output directory
|
||||
if output_dir is not None:
|
||||
output_dir = Path(output_dir)
|
||||
if not output_dir.exists():
|
||||
output_dir.mkdir()
|
||||
nlp.to_disk(output_dir)
|
||||
print("Saved model to", output_dir)
|
||||
|
||||
# test the saved model
|
||||
print("Loading from", output_dir)
|
||||
nlp2 = spacy.load(output_dir)
|
||||
for text, _ in TRAIN_DATA:
|
||||
doc = nlp2(text)
|
||||
print('Entities', [(ent.text, ent.label_) for ent in doc.ents])
|
||||
print('Tokens', [(t.text, t.ent_type_, t.ent_iob) for t in doc])
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main('ner')
|
||||
# Who "" 2
|
||||
# is "" 2
|
||||
# Shaka "" PERSON 3
|
||||
# Khan "" PERSON 1
|
||||
# ? "" 2
|
||||
plac.call(main)
|
||||
|
||||
# Expected output:
|
||||
# Entities [('Shaka Khan', 'PERSON')]
|
||||
# Tokens [('Who', '', 2), ('is', '', 2), ('Shaka', 'PERSON', 3),
|
||||
# ('Khan', 'PERSON', 1), ('?', '', 2)]
|
||||
# Entities [('London', 'LOC'), ('Berlin', 'LOC')]
|
||||
# Tokens [('I', '', 2), ('like', '', 2), ('London', 'LOC', 3),
|
||||
# ('and', '', 2), ('Berlin', 'LOC', 3), ('.', '', 2)]
|
||||
|
|
|
@ -1,244 +0,0 @@
|
|||
#!/usr/bin/env python
|
||||
'''Example of training a named entity recognition system from scratch using spaCy
|
||||
|
||||
This example is written to be self-contained and reasonably transparent.
|
||||
To achieve that, it duplicates some of spaCy's internal functionality.
|
||||
|
||||
Specifically, in this example, we don't use spaCy's built-in Language class to
|
||||
wire together the Vocab, Tokenizer and EntityRecognizer. Instead, we write
|
||||
our own simle Pipeline class, so that it's easier to see how the pieces
|
||||
interact.
|
||||
|
||||
Input data:
|
||||
https://www.lt.informatik.tu-darmstadt.de/fileadmin/user_upload/Group_LangTech/data/GermEval2014_complete_data.zip
|
||||
|
||||
Developed for: spaCy 1.7.1
|
||||
Last tested for: spaCy 1.7.1
|
||||
'''
|
||||
from __future__ import unicode_literals, print_function
|
||||
import plac
|
||||
from pathlib import Path
|
||||
import random
|
||||
import json
|
||||
|
||||
import spacy.orth as orth_funcs
|
||||
from spacy.vocab import Vocab
|
||||
from spacy.pipeline import BeamEntityRecognizer
|
||||
from spacy.pipeline import EntityRecognizer
|
||||
from spacy.tokenizer import Tokenizer
|
||||
from spacy.tokens import Doc
|
||||
from spacy.attrs import *
|
||||
from spacy.gold import GoldParse
|
||||
from spacy.gold import _iob_to_biluo as iob_to_biluo
|
||||
from spacy.scorer import Scorer
|
||||
|
||||
try:
|
||||
unicode
|
||||
except NameError:
|
||||
unicode = str
|
||||
|
||||
|
||||
def init_vocab():
|
||||
return Vocab(
|
||||
lex_attr_getters={
|
||||
LOWER: lambda string: string.lower(),
|
||||
SHAPE: orth_funcs.word_shape,
|
||||
PREFIX: lambda string: string[0],
|
||||
SUFFIX: lambda string: string[-3:],
|
||||
CLUSTER: lambda string: 0,
|
||||
IS_ALPHA: orth_funcs.is_alpha,
|
||||
IS_ASCII: orth_funcs.is_ascii,
|
||||
IS_DIGIT: lambda string: string.isdigit(),
|
||||
IS_LOWER: orth_funcs.is_lower,
|
||||
IS_PUNCT: orth_funcs.is_punct,
|
||||
IS_SPACE: lambda string: string.isspace(),
|
||||
IS_TITLE: orth_funcs.is_title,
|
||||
IS_UPPER: orth_funcs.is_upper,
|
||||
IS_STOP: lambda string: False,
|
||||
IS_OOV: lambda string: True
|
||||
})
|
||||
|
||||
|
||||
def save_vocab(vocab, path):
|
||||
path = Path(path)
|
||||
if not path.exists():
|
||||
path.mkdir()
|
||||
elif not path.is_dir():
|
||||
raise IOError("Can't save vocab to %s\nNot a directory" % path)
|
||||
with (path / 'strings.json').open('w') as file_:
|
||||
vocab.strings.dump(file_)
|
||||
vocab.dump((path / 'lexemes.bin').as_posix())
|
||||
|
||||
|
||||
def load_vocab(path):
|
||||
path = Path(path)
|
||||
if not path.exists():
|
||||
raise IOError("Cannot load vocab from %s\nDoes not exist" % path)
|
||||
if not path.is_dir():
|
||||
raise IOError("Cannot load vocab from %s\nNot a directory" % path)
|
||||
return Vocab.load(path)
|
||||
|
||||
|
||||
def init_ner_model(vocab, features=None):
|
||||
if features is None:
|
||||
features = tuple(EntityRecognizer.feature_templates)
|
||||
return EntityRecognizer(vocab, features=features)
|
||||
|
||||
|
||||
def save_ner_model(model, path):
|
||||
path = Path(path)
|
||||
if not path.exists():
|
||||
path.mkdir()
|
||||
if not path.is_dir():
|
||||
raise IOError("Can't save model to %s\nNot a directory" % path)
|
||||
model.model.dump((path / 'model').as_posix())
|
||||
with (path / 'config.json').open('w') as file_:
|
||||
data = json.dumps(model.cfg)
|
||||
if not isinstance(data, unicode):
|
||||
data = data.decode('utf8')
|
||||
file_.write(data)
|
||||
|
||||
|
||||
def load_ner_model(vocab, path):
|
||||
return EntityRecognizer.load(path, vocab)
|
||||
|
||||
|
||||
class Pipeline(object):
|
||||
@classmethod
|
||||
def load(cls, path):
|
||||
path = Path(path)
|
||||
if not path.exists():
|
||||
raise IOError("Cannot load pipeline from %s\nDoes not exist" % path)
|
||||
if not path.is_dir():
|
||||
raise IOError("Cannot load pipeline from %s\nNot a directory" % path)
|
||||
vocab = load_vocab(path)
|
||||
tokenizer = Tokenizer(vocab, {}, None, None, None)
|
||||
ner_model = load_ner_model(vocab, path / 'ner')
|
||||
return cls(vocab, tokenizer, ner_model)
|
||||
|
||||
def __init__(self, vocab=None, tokenizer=None, entity=None):
|
||||
if vocab is None:
|
||||
vocab = init_vocab()
|
||||
if tokenizer is None:
|
||||
tokenizer = Tokenizer(vocab, {}, None, None, None)
|
||||
if entity is None:
|
||||
entity = init_ner_model(self.vocab)
|
||||
self.vocab = vocab
|
||||
self.tokenizer = tokenizer
|
||||
self.entity = entity
|
||||
self.pipeline = [self.entity]
|
||||
|
||||
def __call__(self, input_):
|
||||
doc = self.make_doc(input_)
|
||||
for process in self.pipeline:
|
||||
process(doc)
|
||||
return doc
|
||||
|
||||
def make_doc(self, input_):
|
||||
if isinstance(input_, bytes):
|
||||
input_ = input_.decode('utf8')
|
||||
if isinstance(input_, unicode):
|
||||
return self.tokenizer(input_)
|
||||
else:
|
||||
return Doc(self.vocab, words=input_)
|
||||
|
||||
def make_gold(self, input_, annotations):
|
||||
doc = self.make_doc(input_)
|
||||
gold = GoldParse(doc, entities=annotations)
|
||||
return gold
|
||||
|
||||
def update(self, input_, annot):
|
||||
doc = self.make_doc(input_)
|
||||
gold = self.make_gold(input_, annot)
|
||||
for ner in gold.ner:
|
||||
if ner not in (None, '-', 'O'):
|
||||
action, label = ner.split('-', 1)
|
||||
self.entity.add_label(label)
|
||||
return self.entity.update(doc, gold)
|
||||
|
||||
def evaluate(self, examples):
|
||||
scorer = Scorer()
|
||||
for input_, annot in examples:
|
||||
gold = self.make_gold(input_, annot)
|
||||
doc = self(input_)
|
||||
scorer.score(doc, gold)
|
||||
return scorer.scores
|
||||
|
||||
def average_weights(self):
|
||||
self.entity.model.end_training()
|
||||
|
||||
def save(self, path):
|
||||
path = Path(path)
|
||||
if not path.exists():
|
||||
path.mkdir()
|
||||
elif not path.is_dir():
|
||||
raise IOError("Can't save pipeline to %s\nNot a directory" % path)
|
||||
save_vocab(self.vocab, path / 'vocab')
|
||||
save_ner_model(self.entity, path / 'ner')
|
||||
|
||||
|
||||
def train(nlp, train_examples, dev_examples, ctx, nr_epoch=5):
|
||||
next_epoch = train_examples
|
||||
print("Iter", "Loss", "P", "R", "F")
|
||||
for i in range(nr_epoch):
|
||||
this_epoch = next_epoch
|
||||
next_epoch = []
|
||||
loss = 0
|
||||
for input_, annot in this_epoch:
|
||||
loss += nlp.update(input_, annot)
|
||||
if (i+1) < nr_epoch:
|
||||
next_epoch.append((input_, annot))
|
||||
random.shuffle(next_epoch)
|
||||
scores = nlp.evaluate(dev_examples)
|
||||
report_scores(i, loss, scores)
|
||||
nlp.average_weights()
|
||||
scores = nlp.evaluate(dev_examples)
|
||||
report_scores(channels, i+1, loss, scores)
|
||||
|
||||
|
||||
def report_scores(i, loss, scores):
|
||||
precision = '%.2f' % scores['ents_p']
|
||||
recall = '%.2f' % scores['ents_r']
|
||||
f_measure = '%.2f' % scores['ents_f']
|
||||
print('%d %s %s %s' % (int(loss), precision, recall, f_measure))
|
||||
|
||||
|
||||
def read_examples(path):
|
||||
path = Path(path)
|
||||
with path.open() as file_:
|
||||
sents = file_.read().strip().split('\n\n')
|
||||
for sent in sents:
|
||||
if not sent.strip():
|
||||
continue
|
||||
tokens = sent.split('\n')
|
||||
while tokens and tokens[0].startswith('#'):
|
||||
tokens.pop(0)
|
||||
words = []
|
||||
iob = []
|
||||
for token in tokens:
|
||||
if token.strip():
|
||||
pieces = token.split()
|
||||
words.append(pieces[1])
|
||||
iob.append(pieces[2])
|
||||
yield words, iob_to_biluo(iob)
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
model_dir=("Path to save the model", "positional", None, Path),
|
||||
train_loc=("Path to your training data", "positional", None, Path),
|
||||
dev_loc=("Path to your development data", "positional", None, Path),
|
||||
)
|
||||
def main(model_dir=Path('/home/matt/repos/spaCy/spacy/data/de-1.0.0'),
|
||||
train_loc=None, dev_loc=None, nr_epoch=30):
|
||||
|
||||
train_examples = read_examples(train_loc)
|
||||
dev_examples = read_examples(dev_loc)
|
||||
nlp = Pipeline.load(model_dir)
|
||||
|
||||
train(nlp, train_examples, list(dev_examples), ctx, nr_epoch)
|
||||
|
||||
nlp.save(model_dir)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
|
@ -1,7 +1,6 @@
|
|||
#!/usr/bin/env python
|
||||
# coding: utf8
|
||||
"""
|
||||
Example of training an additional entity type
|
||||
"""Example of training an additional entity type
|
||||
|
||||
This script shows how to add a new entity type to an existing pre-trained NER
|
||||
model. To keep the example short and simple, only four sentences are provided
|
||||
|
@ -21,111 +20,114 @@ After training your model, you can save it to a directory. We recommend
|
|||
wrapping models as Python packages, for ease of deployment.
|
||||
|
||||
For more details, see the documentation:
|
||||
* Training the Named Entity Recognizer: https://spacy.io/docs/usage/train-ner
|
||||
* Saving and loading models: https://spacy.io/docs/usage/saving-loading
|
||||
* Training: https://spacy.io/usage/training
|
||||
* NER: https://spacy.io/usage/linguistic-features#named-entities
|
||||
|
||||
Developed for: spaCy 1.9.0
|
||||
Last tested for: spaCy 1.9.0
|
||||
Compatible with: spaCy v2.0.0+
|
||||
"""
|
||||
from __future__ import unicode_literals, print_function
|
||||
|
||||
import plac
|
||||
import random
|
||||
from pathlib import Path
|
||||
import random
|
||||
|
||||
import spacy
|
||||
from spacy.gold import GoldParse
|
||||
from spacy.tagger import Tagger
|
||||
|
||||
|
||||
def train_ner(nlp, train_data, output_dir):
|
||||
# Add new words to vocab
|
||||
for raw_text, _ in train_data:
|
||||
doc = nlp.make_doc(raw_text)
|
||||
for word in doc:
|
||||
_ = nlp.vocab[word.orth]
|
||||
random.seed(0)
|
||||
# You may need to change the learning rate. It's generally difficult to
|
||||
# guess what rate you should set, especially when you have limited data.
|
||||
nlp.entity.model.learn_rate = 0.001
|
||||
for itn in range(1000):
|
||||
random.shuffle(train_data)
|
||||
loss = 0.
|
||||
for raw_text, entity_offsets in train_data:
|
||||
doc = nlp.make_doc(raw_text)
|
||||
gold = GoldParse(doc, entities=entity_offsets)
|
||||
# By default, the GoldParse class assumes that the entities
|
||||
# described by offset are complete, and all other words should
|
||||
# have the tag 'O'. You can tell it to make no assumptions
|
||||
# about the tag of a word by giving it the tag '-'.
|
||||
# However, this allows a trivial solution to the current
|
||||
# learning problem: if words are either 'any tag' or 'ANIMAL',
|
||||
# the model can learn that all words can be tagged 'ANIMAL'.
|
||||
#for i in range(len(gold.ner)):
|
||||
#if not gold.ner[i].endswith('ANIMAL'):
|
||||
# gold.ner[i] = '-'
|
||||
nlp.tagger(doc)
|
||||
# As of 1.9, spaCy's parser now lets you supply a dropout probability
|
||||
# This might help the model generalize better from only a few
|
||||
# examples.
|
||||
loss += nlp.entity.update(doc, gold, drop=0.9)
|
||||
if loss == 0:
|
||||
break
|
||||
# This step averages the model's weights. This may or may not be good for
|
||||
# your situation --- it's empirical.
|
||||
nlp.end_training()
|
||||
if output_dir:
|
||||
if not output_dir.exists():
|
||||
output_dir.mkdir()
|
||||
nlp.save_to_directory(output_dir)
|
||||
# new entity label
|
||||
LABEL = 'ANIMAL'
|
||||
|
||||
# training data
|
||||
# Note: If you're using an existing model, make sure to mix in examples of
|
||||
# other entity types that spaCy correctly recognized before. Otherwise, your
|
||||
# model might learn the new type, but "forget" what it previously knew.
|
||||
# https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting
|
||||
TRAIN_DATA = [
|
||||
("Horses are too tall and they pretend to care about your feelings", {
|
||||
'entities': [(0, 6, 'ANIMAL')]
|
||||
}),
|
||||
|
||||
("Do they bite?", {
|
||||
'entities': []
|
||||
}),
|
||||
|
||||
("horses are too tall and they pretend to care about your feelings", {
|
||||
'entities': [(0, 6, 'ANIMAL')]
|
||||
}),
|
||||
|
||||
("horses pretend to care about your feelings", {
|
||||
'entities': [(0, 6, 'ANIMAL')]
|
||||
}),
|
||||
|
||||
("they pretend to care about your feelings, those horses", {
|
||||
'entities': [(48, 54, 'ANIMAL')]
|
||||
}),
|
||||
|
||||
("horses?", {
|
||||
'entities': [(0, 6, 'ANIMAL')]
|
||||
})
|
||||
]
|
||||
|
||||
|
||||
def main(model_name, output_directory=None):
|
||||
print("Loading initial model", model_name)
|
||||
nlp = spacy.load(model_name)
|
||||
if output_directory is not None:
|
||||
output_directory = Path(output_directory)
|
||||
@plac.annotations(
|
||||
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
|
||||
new_model_name=("New model name for model meta.", "option", "nm", str),
|
||||
output_dir=("Optional output directory", "option", "o", Path),
|
||||
n_iter=("Number of training iterations", "option", "n", int))
|
||||
def main(model=None, new_model_name='animal', output_dir=None, n_iter=50):
|
||||
"""Set up the pipeline and entity recognizer, and train the new entity."""
|
||||
if model is not None:
|
||||
nlp = spacy.load(model) # load existing spaCy model
|
||||
print("Loaded model '%s'" % model)
|
||||
else:
|
||||
nlp = spacy.blank('en') # create blank Language class
|
||||
print("Created blank 'en' model")
|
||||
|
||||
train_data = [
|
||||
(
|
||||
"Horses are too tall and they pretend to care about your feelings",
|
||||
[(0, 6, 'ANIMAL')],
|
||||
),
|
||||
(
|
||||
"horses are too tall and they pretend to care about your feelings",
|
||||
[(0, 6, 'ANIMAL')]
|
||||
),
|
||||
(
|
||||
"horses pretend to care about your feelings",
|
||||
[(0, 6, 'ANIMAL')]
|
||||
),
|
||||
(
|
||||
"they pretend to care about your feelings, those horses",
|
||||
[(48, 54, 'ANIMAL')]
|
||||
),
|
||||
(
|
||||
"horses?",
|
||||
[(0, 6, 'ANIMAL')]
|
||||
)
|
||||
# Add entity recognizer to model if it's not in the pipeline
|
||||
# nlp.create_pipe works for built-ins that are registered with spaCy
|
||||
if 'ner' not in nlp.pipe_names:
|
||||
ner = nlp.create_pipe('ner')
|
||||
nlp.add_pipe(ner)
|
||||
# otherwise, get it, so we can add labels to it
|
||||
else:
|
||||
ner = nlp.get_pipe('ner')
|
||||
|
||||
]
|
||||
nlp.entity.add_label('ANIMAL')
|
||||
train_ner(nlp, train_data, output_directory)
|
||||
ner.add_label(LABEL) # add new entity label to entity recognizer
|
||||
|
||||
# Test that the entity is recognized
|
||||
doc = nlp('Do you like horses?')
|
||||
print("Ents in 'Do you like horses?':")
|
||||
# get names of other pipes to disable them during training
|
||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
|
||||
with nlp.disable_pipes(*other_pipes): # only train NER
|
||||
optimizer = nlp.begin_training()
|
||||
for itn in range(n_iter):
|
||||
random.shuffle(TRAIN_DATA)
|
||||
losses = {}
|
||||
for text, annotations in TRAIN_DATA:
|
||||
nlp.update([text], [annotations], sgd=optimizer, drop=0.35,
|
||||
losses=losses)
|
||||
print(losses)
|
||||
|
||||
# test the trained model
|
||||
test_text = 'Do you like horses?'
|
||||
doc = nlp(test_text)
|
||||
print("Entities in '%s'" % test_text)
|
||||
for ent in doc.ents:
|
||||
print(ent.label_, ent.text)
|
||||
if output_directory:
|
||||
print("Loading from", output_directory)
|
||||
nlp2 = spacy.load('en', path=output_directory)
|
||||
nlp2.entity.add_label('ANIMAL')
|
||||
doc2 = nlp2('Do you like horses?')
|
||||
|
||||
# save model to output directory
|
||||
if output_dir is not None:
|
||||
output_dir = Path(output_dir)
|
||||
if not output_dir.exists():
|
||||
output_dir.mkdir()
|
||||
nlp.meta['name'] = new_model_name # rename model
|
||||
nlp.to_disk(output_dir)
|
||||
print("Saved model to", output_dir)
|
||||
|
||||
# test the saved model
|
||||
print("Loading from", output_dir)
|
||||
nlp2 = spacy.load(output_dir)
|
||||
doc2 = nlp2(test_text)
|
||||
for ent in doc2.ents:
|
||||
print(ent.label_, ent.text)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
import plac
|
||||
plac.call(main)
|
||||
|
|
|
@ -1,75 +1,98 @@
|
|||
#!/usr/bin/env python
|
||||
# coding: utf8
|
||||
"""Example of training spaCy dependency parser, starting off with an existing
|
||||
model or a blank model. For more details, see the documentation:
|
||||
* Training: https://spacy.io/usage/training
|
||||
* Dependency Parse: https://spacy.io/usage/linguistic-features#dependency-parse
|
||||
|
||||
Compatible with: spaCy v2.0.0+
|
||||
"""
|
||||
from __future__ import unicode_literals, print_function
|
||||
import json
|
||||
import pathlib
|
||||
|
||||
import plac
|
||||
import random
|
||||
|
||||
from pathlib import Path
|
||||
import spacy
|
||||
from spacy.pipeline import DependencyParser
|
||||
from spacy.gold import GoldParse
|
||||
from spacy.tokens import Doc
|
||||
|
||||
|
||||
def train_parser(nlp, train_data, left_labels, right_labels):
|
||||
parser = DependencyParser(
|
||||
nlp.vocab,
|
||||
left_labels=left_labels,
|
||||
right_labels=right_labels)
|
||||
for itn in range(1000):
|
||||
random.shuffle(train_data)
|
||||
loss = 0
|
||||
for words, heads, deps in train_data:
|
||||
doc = Doc(nlp.vocab, words=words)
|
||||
gold = GoldParse(doc, heads=heads, deps=deps)
|
||||
loss += parser.update(doc, gold)
|
||||
parser.model.end_training()
|
||||
return parser
|
||||
# training data
|
||||
TRAIN_DATA = [
|
||||
("They trade mortgage-backed securities.", {
|
||||
'heads': [1, 1, 4, 4, 5, 1, 1],
|
||||
'deps': ['nsubj', 'ROOT', 'compound', 'punct', 'nmod', 'dobj', 'punct']
|
||||
}),
|
||||
("I like London and Berlin", {
|
||||
'heads': [1, 1, 1, 2, 2, 1],
|
||||
'deps': ['nsubj', 'ROOT', 'dobj', 'cc', 'conj', 'punct']
|
||||
})
|
||||
]
|
||||
|
||||
|
||||
def main(model_dir=None):
|
||||
if model_dir is not None:
|
||||
model_dir = pathlib.Path(model_dir)
|
||||
if not model_dir.exists():
|
||||
model_dir.mkdir()
|
||||
assert model_dir.is_dir()
|
||||
@plac.annotations(
|
||||
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
|
||||
output_dir=("Optional output directory", "option", "o", Path),
|
||||
n_iter=("Number of training iterations", "option", "n", int))
|
||||
def main(model=None, output_dir=None, n_iter=10):
|
||||
"""Load the model, set up the pipeline and train the parser."""
|
||||
if model is not None:
|
||||
nlp = spacy.load(model) # load existing spaCy model
|
||||
print("Loaded model '%s'" % model)
|
||||
else:
|
||||
nlp = spacy.blank('en') # create blank Language class
|
||||
print("Created blank 'en' model")
|
||||
|
||||
nlp = spacy.load('en', tagger=False, parser=False, entity=False, add_vectors=False)
|
||||
# add the parser to the pipeline if it doesn't exist
|
||||
# nlp.create_pipe works for built-ins that are registered with spaCy
|
||||
if 'parser' not in nlp.pipe_names:
|
||||
parser = nlp.create_pipe('parser')
|
||||
nlp.add_pipe(parser, first=True)
|
||||
# otherwise, get it, so we can add labels to it
|
||||
else:
|
||||
parser = nlp.get_pipe('parser')
|
||||
|
||||
train_data = [
|
||||
(
|
||||
['They', 'trade', 'mortgage', '-', 'backed', 'securities', '.'],
|
||||
[1, 1, 4, 4, 5, 1, 1],
|
||||
['nsubj', 'ROOT', 'compound', 'punct', 'nmod', 'dobj', 'punct']
|
||||
),
|
||||
(
|
||||
['I', 'like', 'London', 'and', 'Berlin', '.'],
|
||||
[1, 1, 1, 2, 2, 1],
|
||||
['nsubj', 'ROOT', 'dobj', 'cc', 'conj', 'punct']
|
||||
)
|
||||
]
|
||||
left_labels = set()
|
||||
right_labels = set()
|
||||
for _, heads, deps in train_data:
|
||||
for i, (head, dep) in enumerate(zip(heads, deps)):
|
||||
if i < head:
|
||||
left_labels.add(dep)
|
||||
elif i > head:
|
||||
right_labels.add(dep)
|
||||
parser = train_parser(nlp, train_data, sorted(left_labels), sorted(right_labels))
|
||||
# add labels to the parser
|
||||
for _, annotations in TRAIN_DATA:
|
||||
for dep in annotations.get('deps', []):
|
||||
parser.add_label(dep)
|
||||
|
||||
doc = Doc(nlp.vocab, words=['I', 'like', 'securities', '.'])
|
||||
parser(doc)
|
||||
for word in doc:
|
||||
print(word.text, word.dep_, word.head.text)
|
||||
# get names of other pipes to disable them during training
|
||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'parser']
|
||||
with nlp.disable_pipes(*other_pipes): # only train parser
|
||||
optimizer = nlp.begin_training()
|
||||
for itn in range(n_iter):
|
||||
random.shuffle(TRAIN_DATA)
|
||||
losses = {}
|
||||
for text, annotations in TRAIN_DATA:
|
||||
nlp.update([text], [annotations], sgd=optimizer, losses=losses)
|
||||
print(losses)
|
||||
|
||||
if model_dir is not None:
|
||||
with (model_dir / 'config.json').open('w') as file_:
|
||||
json.dump(parser.cfg, file_)
|
||||
parser.model.dump(str(model_dir / 'model'))
|
||||
# test the trained model
|
||||
test_text = "I like securities."
|
||||
doc = nlp(test_text)
|
||||
print('Dependencies', [(t.text, t.dep_, t.head.text) for t in doc])
|
||||
|
||||
# save model to output directory
|
||||
if output_dir is not None:
|
||||
output_dir = Path(output_dir)
|
||||
if not output_dir.exists():
|
||||
output_dir.mkdir()
|
||||
nlp.to_disk(output_dir)
|
||||
print("Saved model to", output_dir)
|
||||
|
||||
# test the saved model
|
||||
print("Loading from", output_dir)
|
||||
nlp2 = spacy.load(output_dir)
|
||||
doc = nlp2(test_text)
|
||||
print('Dependencies', [(t.text, t.dep_, t.head.text) for t in doc])
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
# I nsubj like
|
||||
# like ROOT like
|
||||
# securities dobj like
|
||||
# . cc securities
|
||||
plac.call(main)
|
||||
|
||||
# expected result:
|
||||
# [
|
||||
# ('I', 'nsubj', 'like'),
|
||||
# ('like', 'ROOT', 'like'),
|
||||
# ('securities', 'dobj', 'like'),
|
||||
# ('.', 'punct', 'like')
|
||||
# ]
|
||||
|
|
|
@ -1,18 +1,22 @@
|
|||
"""A quick example for training a part-of-speech tagger, without worrying
|
||||
about the tokenization, or other language-specific customizations."""
|
||||
#!/usr/bin/env python
|
||||
# coding: utf8
|
||||
"""
|
||||
A simple example for training a part-of-speech tagger with a custom tag map.
|
||||
To allow us to update the tag map with our custom one, this example starts off
|
||||
with a blank Language class and modifies its defaults. For more details, see
|
||||
the documentation:
|
||||
* Training: https://spacy.io/usage/training
|
||||
* POS Tagging: https://spacy.io/usage/linguistic-features#pos-tagging
|
||||
|
||||
from __future__ import unicode_literals
|
||||
from __future__ import print_function
|
||||
Compatible with: spaCy v2.0.0+
|
||||
"""
|
||||
from __future__ import unicode_literals, print_function
|
||||
|
||||
import plac
|
||||
from pathlib import Path
|
||||
|
||||
from spacy.vocab import Vocab
|
||||
from spacy.tagger import Tagger
|
||||
from spacy.tokens import Doc
|
||||
from spacy.gold import GoldParse
|
||||
|
||||
import random
|
||||
from pathlib import Path
|
||||
import spacy
|
||||
|
||||
|
||||
# You need to define a mapping from your data's part-of-speech tag names to the
|
||||
# Universal Part-of-Speech tag set, as spaCy includes an enum of these tags.
|
||||
|
@ -21,61 +25,72 @@ import random
|
|||
# You may also specify morphological features for your tags, from the universal
|
||||
# scheme.
|
||||
TAG_MAP = {
|
||||
'N': {"pos": "NOUN"},
|
||||
'V': {"pos": "VERB"},
|
||||
'J': {"pos": "ADJ"}
|
||||
'N': {'pos': 'NOUN'},
|
||||
'V': {'pos': 'VERB'},
|
||||
'J': {'pos': 'ADJ'}
|
||||
}
|
||||
|
||||
# Usually you'll read this in, of course. Data formats vary.
|
||||
# Ensure your strings are unicode.
|
||||
DATA = [
|
||||
(
|
||||
["I", "like", "green", "eggs"],
|
||||
["N", "V", "J", "N"]
|
||||
),
|
||||
(
|
||||
["Eat", "blue", "ham"],
|
||||
["V", "J", "N"]
|
||||
)
|
||||
TRAIN_DATA = [
|
||||
("I like green eggs", {'tags': ['N', 'V', 'J', 'N']}),
|
||||
("Eat blue ham", {'tags': ['V', 'J', 'N']})
|
||||
]
|
||||
|
||||
|
||||
def ensure_dir(path):
|
||||
if not path.exists():
|
||||
path.mkdir()
|
||||
@plac.annotations(
|
||||
lang=("ISO Code of language to use", "option", "l", str),
|
||||
output_dir=("Optional output directory", "option", "o", Path),
|
||||
n_iter=("Number of training iterations", "option", "n", int))
|
||||
def main(lang='en', output_dir=None, n_iter=25):
|
||||
"""Create a new model, set up the pipeline and train the tagger. In order to
|
||||
train the tagger with a custom tag map, we're creating a new Language
|
||||
instance with a custom vocab.
|
||||
"""
|
||||
nlp = spacy.blank(lang)
|
||||
# add the tagger to the pipeline
|
||||
# nlp.create_pipe works for built-ins that are registered with spaCy
|
||||
tagger = nlp.create_pipe('tagger')
|
||||
# Add the tags. This needs to be done before you start training.
|
||||
for tag, values in TAG_MAP.items():
|
||||
tagger.add_label(tag, values)
|
||||
nlp.add_pipe(tagger)
|
||||
|
||||
optimizer = nlp.begin_training()
|
||||
for i in range(n_iter):
|
||||
random.shuffle(TRAIN_DATA)
|
||||
losses = {}
|
||||
for text, annotations in TRAIN_DATA:
|
||||
nlp.update([text], [annotations], sgd=optimizer, losses=losses)
|
||||
print(losses)
|
||||
|
||||
def main(output_dir=None):
|
||||
# test the trained model
|
||||
test_text = "I like blue eggs"
|
||||
doc = nlp(test_text)
|
||||
print('Tags', [(t.text, t.tag_, t.pos_) for t in doc])
|
||||
|
||||
# save model to output directory
|
||||
if output_dir is not None:
|
||||
output_dir = Path(output_dir)
|
||||
ensure_dir(output_dir)
|
||||
ensure_dir(output_dir / "pos")
|
||||
ensure_dir(output_dir / "vocab")
|
||||
if not output_dir.exists():
|
||||
output_dir.mkdir()
|
||||
nlp.to_disk(output_dir)
|
||||
print("Saved model to", output_dir)
|
||||
|
||||
vocab = Vocab(tag_map=TAG_MAP)
|
||||
# The default_templates argument is where features are specified. See
|
||||
# spacy/tagger.pyx for the defaults.
|
||||
tagger = Tagger(vocab)
|
||||
for i in range(25):
|
||||
for words, tags in DATA:
|
||||
doc = Doc(vocab, words=words)
|
||||
gold = GoldParse(doc, tags=tags)
|
||||
tagger.update(doc, gold)
|
||||
random.shuffle(DATA)
|
||||
tagger.model.end_training()
|
||||
doc = Doc(vocab, orths_and_spaces=zip(["I", "like", "blue", "eggs"], [True] * 4))
|
||||
tagger(doc)
|
||||
for word in doc:
|
||||
print(word.text, word.tag_, word.pos_)
|
||||
if output_dir is not None:
|
||||
tagger.model.dump(str(output_dir / 'pos' / 'model'))
|
||||
with (output_dir / 'vocab' / 'strings.json').open('w') as file_:
|
||||
tagger.vocab.strings.dump(file_)
|
||||
# test the save model
|
||||
print("Loading from", output_dir)
|
||||
nlp2 = spacy.load(output_dir)
|
||||
doc = nlp2(test_text)
|
||||
print('Tags', [(t.text, t.tag_, t.pos_) for t in doc])
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
plac.call(main)
|
||||
# I V VERB
|
||||
# like V VERB
|
||||
# blue N NOUN
|
||||
# eggs N NOUN
|
||||
|
||||
# Expected output:
|
||||
# [
|
||||
# ('I', 'N', 'NOUN'),
|
||||
# ('like', 'V', 'VERB'),
|
||||
# ('blue', 'J', 'ADJ'),
|
||||
# ('eggs', 'N', 'NOUN')
|
||||
# ]
|
||||
|
|
|
@ -1,164 +0,0 @@
|
|||
'''
|
||||
This example shows training of the POS tagger without the Language class,
|
||||
showing the APIs of the atomic components.
|
||||
|
||||
This example was adapted from the gist here:
|
||||
|
||||
https://gist.github.com/kamac/a7bc139f62488839a8118214a4d932f2
|
||||
|
||||
Issue discussing the gist:
|
||||
|
||||
https://github.com/explosion/spaCy/issues/1179
|
||||
|
||||
The example was written for spaCy 1.8.2.
|
||||
'''
|
||||
from __future__ import unicode_literals
|
||||
from __future__ import print_function
|
||||
|
||||
import plac
|
||||
import codecs
|
||||
import spacy.symbols as symbols
|
||||
import spacy
|
||||
from pathlib import Path
|
||||
|
||||
from spacy.vocab import Vocab
|
||||
from spacy.tagger import Tagger
|
||||
from spacy.tokens import Doc
|
||||
from spacy.gold import GoldParse
|
||||
from spacy.language import Language
|
||||
from spacy import orth
|
||||
from spacy import attrs
|
||||
|
||||
import random
|
||||
|
||||
TAG_MAP = {
|
||||
'ADJ': {symbols.POS: symbols.ADJ},
|
||||
'ADP': {symbols.POS: symbols.ADP},
|
||||
'PUNCT': {symbols.POS: symbols.PUNCT},
|
||||
'ADV': {symbols.POS: symbols.ADV},
|
||||
'AUX': {symbols.POS: symbols.AUX},
|
||||
'SYM': {symbols.POS: symbols.SYM},
|
||||
'INTJ': {symbols.POS: symbols.INTJ},
|
||||
'CCONJ': {symbols.POS: symbols.CCONJ},
|
||||
'X': {symbols.POS: symbols.X},
|
||||
'NOUN': {symbols.POS: symbols.NOUN},
|
||||
'DET': {symbols.POS: symbols.DET},
|
||||
'PROPN': {symbols.POS: symbols.PROPN},
|
||||
'NUM': {symbols.POS: symbols.NUM},
|
||||
'VERB': {symbols.POS: symbols.VERB},
|
||||
'PART': {symbols.POS: symbols.PART},
|
||||
'PRON': {symbols.POS: symbols.PRON},
|
||||
'SCONJ': {symbols.POS: symbols.SCONJ},
|
||||
}
|
||||
|
||||
LEX_ATTR_GETTERS = {
|
||||
attrs.LOWER: lambda string: string.lower(),
|
||||
attrs.NORM: lambda string: string,
|
||||
attrs.SHAPE: orth.word_shape,
|
||||
attrs.PREFIX: lambda string: string[0],
|
||||
attrs.SUFFIX: lambda string: string[-3:],
|
||||
attrs.CLUSTER: lambda string: 0,
|
||||
attrs.IS_ALPHA: orth.is_alpha,
|
||||
attrs.IS_ASCII: orth.is_ascii,
|
||||
attrs.IS_DIGIT: lambda string: string.isdigit(),
|
||||
attrs.IS_LOWER: orth.is_lower,
|
||||
attrs.IS_PUNCT: orth.is_punct,
|
||||
attrs.IS_SPACE: lambda string: string.isspace(),
|
||||
attrs.IS_TITLE: orth.is_title,
|
||||
attrs.IS_UPPER: orth.is_upper,
|
||||
attrs.IS_BRACKET: orth.is_bracket,
|
||||
attrs.IS_QUOTE: orth.is_quote,
|
||||
attrs.IS_LEFT_PUNCT: orth.is_left_punct,
|
||||
attrs.IS_RIGHT_PUNCT: orth.is_right_punct,
|
||||
attrs.LIKE_URL: orth.like_url,
|
||||
attrs.LIKE_NUM: orth.like_number,
|
||||
attrs.LIKE_EMAIL: orth.like_email,
|
||||
attrs.IS_STOP: lambda string: False,
|
||||
attrs.IS_OOV: lambda string: True
|
||||
}
|
||||
|
||||
|
||||
def read_ud_data(path):
|
||||
data = []
|
||||
last_number = -1
|
||||
sentence_words = []
|
||||
sentence_tags = []
|
||||
with codecs.open(path, encoding="utf-8") as f:
|
||||
while True:
|
||||
line = f.readline()
|
||||
if not line:
|
||||
break
|
||||
|
||||
if line[0].isdigit():
|
||||
d = line.split()
|
||||
if not "-" in d[0]:
|
||||
number = int(line[0])
|
||||
if number < last_number:
|
||||
data.append((sentence_words, sentence_tags),)
|
||||
sentence_words = []
|
||||
sentence_tags = []
|
||||
sentence_words.append(d[2])
|
||||
sentence_tags.append(d[3])
|
||||
last_number = number
|
||||
if len(sentence_words) > 0:
|
||||
data.append((sentence_words, sentence_tags,))
|
||||
return data
|
||||
|
||||
def ensure_dir(path):
|
||||
if not path.exists():
|
||||
path.mkdir()
|
||||
|
||||
|
||||
def main(train_loc, dev_loc, output_dir=None):
|
||||
if output_dir is not None:
|
||||
output_dir = Path(output_dir)
|
||||
ensure_dir(output_dir)
|
||||
ensure_dir(output_dir / "pos")
|
||||
ensure_dir(output_dir / "vocab")
|
||||
|
||||
train_data = read_ud_data(train_loc)
|
||||
vocab = Vocab(tag_map=TAG_MAP, lex_attr_getters=LEX_ATTR_GETTERS)
|
||||
# Populate vocab
|
||||
for words, _ in train_data:
|
||||
for word in words:
|
||||
_ = vocab[word]
|
||||
|
||||
model = spacy.tagger.TaggerModel(spacy.tagger.Tagger.feature_templates)
|
||||
tagger = Tagger(vocab, model)
|
||||
print(tagger.tag_names)
|
||||
for i in range(30):
|
||||
print("training model (iteration " + str(i) + ")...")
|
||||
score = 0.
|
||||
num_samples = 0.
|
||||
for words, tags in train_data:
|
||||
doc = Doc(vocab, words=words)
|
||||
gold = GoldParse(doc, tags=tags)
|
||||
cost = tagger.update(doc, gold)
|
||||
for i, word in enumerate(doc):
|
||||
num_samples += 1
|
||||
if word.tag_ == tags[i]:
|
||||
score += 1
|
||||
print('Train acc', score/num_samples)
|
||||
random.shuffle(train_data)
|
||||
tagger.model.end_training()
|
||||
|
||||
score = 0.0
|
||||
test_data = read_ud_data(dev_loc)
|
||||
num_samples = 0
|
||||
for words, tags in test_data:
|
||||
doc = Doc(vocab, words)
|
||||
tagger(doc)
|
||||
for i, word in enumerate(doc):
|
||||
num_samples += 1
|
||||
if word.tag_ == tags[i]:
|
||||
score += 1
|
||||
print("score: " + str(score / num_samples * 100.0))
|
||||
|
||||
if output_dir is not None:
|
||||
tagger.model.dump(str(output_dir / 'pos' / 'model'))
|
||||
with (output_dir / 'vocab' / 'strings.json').open('w') as file_:
|
||||
tagger.vocab.strings.dump(file_)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
plac.call(main)
|
133
examples/training/train_textcat.py
Normal file
133
examples/training/train_textcat.py
Normal file
|
@ -0,0 +1,133 @@
|
|||
#!/usr/bin/env python
|
||||
# coding: utf8
|
||||
"""Train a multi-label convolutional neural network text classifier on the
|
||||
IMDB dataset, using the TextCategorizer component. The dataset will be loaded
|
||||
automatically via Thinc's built-in dataset loader. The model is added to
|
||||
spacy.pipeline, and predictions are available via `doc.cats`. For more details,
|
||||
see the documentation:
|
||||
* Training: https://spacy.io/usage/training
|
||||
* Text classification: https://spacy.io/usage/text-classification
|
||||
|
||||
Compatible with: spaCy v2.0.0+
|
||||
"""
|
||||
from __future__ import unicode_literals, print_function
|
||||
import plac
|
||||
import random
|
||||
from pathlib import Path
|
||||
import thinc.extra.datasets
|
||||
|
||||
import spacy
|
||||
from spacy.util import minibatch, compounding
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
|
||||
output_dir=("Optional output directory", "option", "o", Path),
|
||||
n_texts=("Number of texts to train from", "option", "t", int),
|
||||
n_iter=("Number of training iterations", "option", "n", int))
|
||||
def main(model=None, output_dir=None, n_iter=20, n_texts=2000):
|
||||
if model is not None:
|
||||
nlp = spacy.load(model) # load existing spaCy model
|
||||
print("Loaded model '%s'" % model)
|
||||
else:
|
||||
nlp = spacy.blank('en') # create blank Language class
|
||||
print("Created blank 'en' model")
|
||||
|
||||
# add the text classifier to the pipeline if it doesn't exist
|
||||
# nlp.create_pipe works for built-ins that are registered with spaCy
|
||||
if 'textcat' not in nlp.pipe_names:
|
||||
textcat = nlp.create_pipe('textcat')
|
||||
nlp.add_pipe(textcat, last=True)
|
||||
# otherwise, get it, so we can add labels to it
|
||||
else:
|
||||
textcat = nlp.get_pipe('textcat')
|
||||
|
||||
# add label to text classifier
|
||||
textcat.add_label('POSITIVE')
|
||||
|
||||
# load the IMBD dataset
|
||||
print("Loading IMDB data...")
|
||||
(train_texts, train_cats), (dev_texts, dev_cats) = load_data(limit=n_texts)
|
||||
print("Using %d training examples" % n_texts)
|
||||
train_data = list(zip(train_texts,
|
||||
[{'cats': cats} for cats in train_cats]))
|
||||
|
||||
# get names of other pipes to disable them during training
|
||||
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'textcat']
|
||||
with nlp.disable_pipes(*other_pipes): # only train textcat
|
||||
optimizer = nlp.begin_training()
|
||||
print("Training the model...")
|
||||
print('{:^5}\t{:^5}\t{:^5}\t{:^5}'.format('LOSS', 'P', 'R', 'F'))
|
||||
for i in range(n_iter):
|
||||
losses = {}
|
||||
# batch up the examples using spaCy's minibatch
|
||||
batches = minibatch(train_data, size=compounding(4., 32., 1.001))
|
||||
for batch in batches:
|
||||
texts, annotations = zip(*batch)
|
||||
nlp.update(texts, annotations, sgd=optimizer, drop=0.2,
|
||||
losses=losses)
|
||||
with textcat.model.use_params(optimizer.averages):
|
||||
# evaluate on the dev data split off in load_data()
|
||||
scores = evaluate(nlp.tokenizer, textcat, dev_texts, dev_cats)
|
||||
print('{0:.3f}\t{1:.3f}\t{2:.3f}\t{3:.3f}' # print a simple table
|
||||
.format(losses['textcat'], scores['textcat_p'],
|
||||
scores['textcat_r'], scores['textcat_f']))
|
||||
|
||||
# test the trained model
|
||||
test_text = "This movie sucked"
|
||||
doc = nlp(test_text)
|
||||
print(test_text, doc.cats)
|
||||
|
||||
if output_dir is not None:
|
||||
output_dir = Path(output_dir)
|
||||
if not output_dir.exists():
|
||||
output_dir.mkdir()
|
||||
nlp.to_disk(output_dir)
|
||||
print("Saved model to", output_dir)
|
||||
|
||||
# test the saved model
|
||||
print("Loading from", output_dir)
|
||||
nlp2 = spacy.load(output_dir)
|
||||
doc2 = nlp2(test_text)
|
||||
print(test_text, doc2.cats)
|
||||
|
||||
|
||||
def load_data(limit=0, split=0.8):
|
||||
"""Load data from the IMDB dataset."""
|
||||
# Partition off part of the train data for evaluation
|
||||
train_data, _ = thinc.extra.datasets.imdb()
|
||||
random.shuffle(train_data)
|
||||
train_data = train_data[-limit:]
|
||||
texts, labels = zip(*train_data)
|
||||
cats = [{'POSITIVE': bool(y)} for y in labels]
|
||||
split = int(len(train_data) * split)
|
||||
return (texts[:split], cats[:split]), (texts[split:], cats[split:])
|
||||
|
||||
|
||||
def evaluate(tokenizer, textcat, texts, cats):
|
||||
docs = (tokenizer(text) for text in texts)
|
||||
tp = 1e-8 # True positives
|
||||
fp = 1e-8 # False positives
|
||||
fn = 1e-8 # False negatives
|
||||
tn = 1e-8 # True negatives
|
||||
for i, doc in enumerate(textcat.pipe(docs)):
|
||||
gold = cats[i]
|
||||
for label, score in doc.cats.items():
|
||||
if label not in gold:
|
||||
continue
|
||||
if score >= 0.5 and gold[label] >= 0.5:
|
||||
tp += 1.
|
||||
elif score >= 0.5 and gold[label] < 0.5:
|
||||
fp += 1.
|
||||
elif score < 0.5 and gold[label] < 0.5:
|
||||
tn += 1
|
||||
elif score < 0.5 and gold[label] >= 0.5:
|
||||
fn += 1
|
||||
precision = tp / (tp + fp)
|
||||
recall = tp / (tp + fn)
|
||||
f_score = 2 * (precision * recall) / (precision + recall)
|
||||
return {'textcat_p': precision, 'textcat_r': recall, 'textcat_f': f_score}
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
plac.call(main)
|
641
examples/training/training-data.json
Normal file
641
examples/training/training-data.json
Normal file
|
@ -0,0 +1,641 @@
|
|||
[
|
||||
{
|
||||
"id": "wsj_0200",
|
||||
"paragraphs": [
|
||||
{
|
||||
"raw": "In an Oct. 19 review of \"The Misanthrope\" at Chicago's Goodman Theatre (\"Revitalized Classics Take the Stage in Windy City,\" Leisure & Arts), the role of Celimene, played by Kim Cattrall, was mistakenly attributed to Christina Haag. Ms. Haag plays Elianti.",
|
||||
"sentences": [
|
||||
{
|
||||
"tokens": [
|
||||
{
|
||||
"head": 44,
|
||||
"dep": "prep",
|
||||
"tag": "IN",
|
||||
"orth": "In",
|
||||
"ner": "O",
|
||||
"id": 0
|
||||
},
|
||||
{
|
||||
"head": 3,
|
||||
"dep": "det",
|
||||
"tag": "DT",
|
||||
"orth": "an",
|
||||
"ner": "O",
|
||||
"id": 1
|
||||
},
|
||||
{
|
||||
"head": 2,
|
||||
"dep": "nmod",
|
||||
"tag": "NNP",
|
||||
"orth": "Oct.",
|
||||
"ner": "B-DATE",
|
||||
"id": 2
|
||||
},
|
||||
{
|
||||
"head": -1,
|
||||
"dep": "nummod",
|
||||
"tag": "CD",
|
||||
"orth": "19",
|
||||
"ner": "L-DATE",
|
||||
"id": 3
|
||||
},
|
||||
{
|
||||
"head": -4,
|
||||
"dep": "pobj",
|
||||
"tag": "NN",
|
||||
"orth": "review",
|
||||
"ner": "O",
|
||||
"id": 4
|
||||
},
|
||||
{
|
||||
"head": -1,
|
||||
"dep": "prep",
|
||||
"tag": "IN",
|
||||
"orth": "of",
|
||||
"ner": "O",
|
||||
"id": 5
|
||||
},
|
||||
{
|
||||
"head": 2,
|
||||
"dep": "punct",
|
||||
"tag": "``",
|
||||
"orth": "``",
|
||||
"ner": "O",
|
||||
"id": 6
|
||||
},
|
||||
{
|
||||
"head": 1,
|
||||
"dep": "det",
|
||||
"tag": "DT",
|
||||
"orth": "The",
|
||||
"ner": "B-WORK_OF_ART",
|
||||
"id": 7
|
||||
},
|
||||
{
|
||||
"head": -3,
|
||||
"dep": "pobj",
|
||||
"tag": "NN",
|
||||
"orth": "Misanthrope",
|
||||
"ner": "L-WORK_OF_ART",
|
||||
"id": 8
|
||||
},
|
||||
{
|
||||
"head": -1,
|
||||
"dep": "punct",
|
||||
"tag": "''",
|
||||
"orth": "''",
|
||||
"ner": "O",
|
||||
"id": 9
|
||||
},
|
||||
{
|
||||
"head": -2,
|
||||
"dep": "prep",
|
||||
"tag": "IN",
|
||||
"orth": "at",
|
||||
"ner": "O",
|
||||
"id": 10
|
||||
},
|
||||
{
|
||||
"head": 3,
|
||||
"dep": "poss",
|
||||
"tag": "NNP",
|
||||
"orth": "Chicago",
|
||||
"ner": "U-GPE",
|
||||
"id": 11
|
||||
},
|
||||
{
|
||||
"head": -1,
|
||||
"dep": "case",
|
||||
"tag": "POS",
|
||||
"orth": "'s",
|
||||
"ner": "O",
|
||||
"id": 12
|
||||
},
|
||||
{
|
||||
"head": 1,
|
||||
"dep": "compound",
|
||||
"tag": "NNP",
|
||||
"orth": "Goodman",
|
||||
"ner": "B-FAC",
|
||||
"id": 13
|
||||
},
|
||||
{
|
||||
"head": -4,
|
||||
"dep": "pobj",
|
||||
"tag": "NNP",
|
||||
"orth": "Theatre",
|
||||
"ner": "L-FAC",
|
||||
"id": 14
|
||||
},
|
||||
{
|
||||
"head": 4,
|
||||
"dep": "punct",
|
||||
"tag": "-LRB-",
|
||||
"orth": "(",
|
||||
"ner": "O",
|
||||
"id": 15
|
||||
},
|
||||
{
|
||||
"head": 3,
|
||||
"dep": "punct",
|
||||
"tag": "``",
|
||||
"orth": "``",
|
||||
"ner": "O",
|
||||
"id": 16
|
||||
},
|
||||
{
|
||||
"head": 1,
|
||||
"dep": "amod",
|
||||
"tag": "VBN",
|
||||
"orth": "Revitalized",
|
||||
"ner": "B-WORK_OF_ART",
|
||||
"id": 17
|
||||
},
|
||||
{
|
||||
"head": 1,
|
||||
"dep": "nsubj",
|
||||
"tag": "NNS",
|
||||
"orth": "Classics",
|
||||
"ner": "I-WORK_OF_ART",
|
||||
"id": 18
|
||||
},
|
||||
{
|
||||
"head": -15,
|
||||
"dep": "appos",
|
||||
"tag": "VBP",
|
||||
"orth": "Take",
|
||||
"ner": "I-WORK_OF_ART",
|
||||
"id": 19
|
||||
},
|
||||
{
|
||||
"head": 1,
|
||||
"dep": "det",
|
||||
"tag": "DT",
|
||||
"orth": "the",
|
||||
"ner": "I-WORK_OF_ART",
|
||||
"id": 20
|
||||
},
|
||||
{
|
||||
"head": -2,
|
||||
"dep": "dobj",
|
||||
"tag": "NN",
|
||||
"orth": "Stage",
|
||||
"ner": "I-WORK_OF_ART",
|
||||
"id": 21
|
||||
},
|
||||
{
|
||||
"head": -3,
|
||||
"dep": "prep",
|
||||
"tag": "IN",
|
||||
"orth": "in",
|
||||
"ner": "I-WORK_OF_ART",
|
||||
"id": 22
|
||||
},
|
||||
{
|
||||
"head": 1,
|
||||
"dep": "compound",
|
||||
"tag": "NNP",
|
||||
"orth": "Windy",
|
||||
"ner": "I-WORK_OF_ART",
|
||||
"id": 23
|
||||
},
|
||||
{
|
||||
"head": -2,
|
||||
"dep": "pobj",
|
||||
"tag": "NNP",
|
||||
"orth": "City",
|
||||
"ner": "L-WORK_OF_ART",
|
||||
"id": 24
|
||||
},
|
||||
{
|
||||
"head": -6,
|
||||
"dep": "punct",
|
||||
"tag": ",",
|
||||
"orth": ",",
|
||||
"ner": "O",
|
||||
"id": 25
|
||||
},
|
||||
{
|
||||
"head": -7,
|
||||
"dep": "punct",
|
||||
"tag": "''",
|
||||
"orth": "''",
|
||||
"ner": "O",
|
||||
"id": 26
|
||||
},
|
||||
{
|
||||
"head": -8,
|
||||
"dep": "npadvmod",
|
||||
"tag": "NN",
|
||||
"orth": "Leisure",
|
||||
"ner": "B-ORG",
|
||||
"id": 27
|
||||
},
|
||||
{
|
||||
"head": -1,
|
||||
"dep": "cc",
|
||||
"tag": "CC",
|
||||
"orth": "&",
|
||||
"ner": "I-ORG",
|
||||
"id": 28
|
||||
},
|
||||
{
|
||||
"head": -2,
|
||||
"dep": "conj",
|
||||
"tag": "NNS",
|
||||
"orth": "Arts",
|
||||
"ner": "L-ORG",
|
||||
"id": 29
|
||||
},
|
||||
{
|
||||
"head": -11,
|
||||
"dep": "punct",
|
||||
"tag": "-RRB-",
|
||||
"orth": ")",
|
||||
"ner": "O",
|
||||
"id": 30
|
||||
},
|
||||
{
|
||||
"head": 13,
|
||||
"dep": "punct",
|
||||
"tag": ",",
|
||||
"orth": ",",
|
||||
"ner": "O",
|
||||
"id": 31
|
||||
},
|
||||
{
|
||||
"head": 1,
|
||||
"dep": "det",
|
||||
"tag": "DT",
|
||||
"orth": "the",
|
||||
"ner": "O",
|
||||
"id": 32
|
||||
},
|
||||
{
|
||||
"head": 11,
|
||||
"dep": "nsubjpass",
|
||||
"tag": "NN",
|
||||
"orth": "role",
|
||||
"ner": "O",
|
||||
"id": 33
|
||||
},
|
||||
{
|
||||
"head": -1,
|
||||
"dep": "prep",
|
||||
"tag": "IN",
|
||||
"orth": "of",
|
||||
"ner": "O",
|
||||
"id": 34
|
||||
},
|
||||
{
|
||||
"head": -1,
|
||||
"dep": "pobj",
|
||||
"tag": "NNP",
|
||||
"orth": "Celimene",
|
||||
"ner": "U-PERSON",
|
||||
"id": 35
|
||||
},
|
||||
{
|
||||
"head": -3,
|
||||
"dep": "punct",
|
||||
"tag": ",",
|
||||
"orth": ",",
|
||||
"ner": "O",
|
||||
"id": 36
|
||||
},
|
||||
{
|
||||
"head": -4,
|
||||
"dep": "acl",
|
||||
"tag": "VBN",
|
||||
"orth": "played",
|
||||
"ner": "O",
|
||||
"id": 37
|
||||
},
|
||||
{
|
||||
"head": -1,
|
||||
"dep": "agent",
|
||||
"tag": "IN",
|
||||
"orth": "by",
|
||||
"ner": "O",
|
||||
"id": 38
|
||||
},
|
||||
{
|
||||
"head": 1,
|
||||
"dep": "compound",
|
||||
"tag": "NNP",
|
||||
"orth": "Kim",
|
||||
"ner": "B-PERSON",
|
||||
"id": 39
|
||||
},
|
||||
{
|
||||
"head": -2,
|
||||
"dep": "pobj",
|
||||
"tag": "NNP",
|
||||
"orth": "Cattrall",
|
||||
"ner": "L-PERSON",
|
||||
"id": 40
|
||||
},
|
||||
{
|
||||
"head": -8,
|
||||
"dep": "punct",
|
||||
"tag": ",",
|
||||
"orth": ",",
|
||||
"ner": "O",
|
||||
"id": 41
|
||||
},
|
||||
{
|
||||
"head": 2,
|
||||
"dep": "auxpass",
|
||||
"tag": "VBD",
|
||||
"orth": "was",
|
||||
"ner": "O",
|
||||
"id": 42
|
||||
},
|
||||
{
|
||||
"head": 1,
|
||||
"dep": "advmod",
|
||||
"tag": "RB",
|
||||
"orth": "mistakenly",
|
||||
"ner": "O",
|
||||
"id": 43
|
||||
},
|
||||
{
|
||||
"head": 0,
|
||||
"dep": "root",
|
||||
"tag": "VBN",
|
||||
"orth": "attributed",
|
||||
"ner": "O",
|
||||
"id": 44
|
||||
},
|
||||
{
|
||||
"head": -1,
|
||||
"dep": "prep",
|
||||
"tag": "IN",
|
||||
"orth": "to",
|
||||
"ner": "O",
|
||||
"id": 45
|
||||
},
|
||||
{
|
||||
"head": 1,
|
||||
"dep": "compound",
|
||||
"tag": "NNP",
|
||||
"orth": "Christina",
|
||||
"ner": "B-PERSON",
|
||||
"id": 46
|
||||
},
|
||||
{
|
||||
"head": -2,
|
||||
"dep": "pobj",
|
||||
"tag": "NNP",
|
||||
"orth": "Haag",
|
||||
"ner": "L-PERSON",
|
||||
"id": 47
|
||||
},
|
||||
{
|
||||
"head": -4,
|
||||
"dep": "punct",
|
||||
"tag": ".",
|
||||
"orth": ".",
|
||||
"ner": "O",
|
||||
"id": 48
|
||||
}
|
||||
],
|
||||
"brackets": [
|
||||
{
|
||||
"first": 2,
|
||||
"last": 3,
|
||||
"label": "NML"
|
||||
},
|
||||
{
|
||||
"first": 1,
|
||||
"last": 4,
|
||||
"label": "NP"
|
||||
},
|
||||
{
|
||||
"first": 7,
|
||||
"last": 8,
|
||||
"label": "NP-TTL"
|
||||
},
|
||||
{
|
||||
"first": 11,
|
||||
"last": 12,
|
||||
"label": "NP"
|
||||
},
|
||||
{
|
||||
"first": 11,
|
||||
"last": 14,
|
||||
"label": "NP"
|
||||
},
|
||||
{
|
||||
"first": 10,
|
||||
"last": 14,
|
||||
"label": "PP-LOC"
|
||||
},
|
||||
{
|
||||
"first": 6,
|
||||
"last": 14,
|
||||
"label": "NP"
|
||||
},
|
||||
{
|
||||
"first": 5,
|
||||
"last": 14,
|
||||
"label": "PP"
|
||||
},
|
||||
{
|
||||
"first": 1,
|
||||
"last": 14,
|
||||
"label": "NP"
|
||||
},
|
||||
{
|
||||
"first": 17,
|
||||
"last": 18,
|
||||
"label": "NP-SBJ"
|
||||
},
|
||||
{
|
||||
"first": 20,
|
||||
"last": 21,
|
||||
"label": "NP"
|
||||
},
|
||||
{
|
||||
"first": 23,
|
||||
"last": 24,
|
||||
"label": "NP"
|
||||
},
|
||||
{
|
||||
"first": 22,
|
||||
"last": 24,
|
||||
"label": "PP-LOC"
|
||||
},
|
||||
{
|
||||
"first": 19,
|
||||
"last": 24,
|
||||
"label": "VP"
|
||||
},
|
||||
{
|
||||
"first": 17,
|
||||
"last": 24,
|
||||
"label": "S-HLN"
|
||||
},
|
||||
{
|
||||
"first": 27,
|
||||
"last": 29,
|
||||
"label": "NP-TMP"
|
||||
},
|
||||
{
|
||||
"first": 15,
|
||||
"last": 30,
|
||||
"label": "NP"
|
||||
},
|
||||
{
|
||||
"first": 1,
|
||||
"last": 30,
|
||||
"label": "NP"
|
||||
},
|
||||
{
|
||||
"first": 0,
|
||||
"last": 30,
|
||||
"label": "PP-LOC"
|
||||
},
|
||||
{
|
||||
"first": 32,
|
||||
"last": 33,
|
||||
"label": "NP"
|
||||
},
|
||||
{
|
||||
"first": 35,
|
||||
"last": 35,
|
||||
"label": "NP"
|
||||
},
|
||||
{
|
||||
"first": 34,
|
||||
"last": 35,
|
||||
"label": "PP"
|
||||
},
|
||||
{
|
||||
"first": 32,
|
||||
"last": 35,
|
||||
"label": "NP"
|
||||
},
|
||||
{
|
||||
"first": 39,
|
||||
"last": 40,
|
||||
"label": "NP-LGS"
|
||||
},
|
||||
{
|
||||
"first": 38,
|
||||
"last": 40,
|
||||
"label": "PP"
|
||||
},
|
||||
{
|
||||
"first": 37,
|
||||
"last": 40,
|
||||
"label": "VP"
|
||||
},
|
||||
{
|
||||
"first": 32,
|
||||
"last": 41,
|
||||
"label": "NP-SBJ-2"
|
||||
},
|
||||
{
|
||||
"first": 43,
|
||||
"last": 43,
|
||||
"label": "ADVP-MNR"
|
||||
},
|
||||
{
|
||||
"first": 46,
|
||||
"last": 47,
|
||||
"label": "NP"
|
||||
},
|
||||
{
|
||||
"first": 45,
|
||||
"last": 47,
|
||||
"label": "PP-CLR"
|
||||
},
|
||||
{
|
||||
"first": 44,
|
||||
"last": 47,
|
||||
"label": "VP"
|
||||
},
|
||||
{
|
||||
"first": 42,
|
||||
"last": 47,
|
||||
"label": "VP"
|
||||
},
|
||||
{
|
||||
"first": 0,
|
||||
"last": 48,
|
||||
"label": "S"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"tokens": [
|
||||
{
|
||||
"head": 1,
|
||||
"dep": "compound",
|
||||
"tag": "NNP",
|
||||
"orth": "Ms.",
|
||||
"ner": "O",
|
||||
"id": 0
|
||||
},
|
||||
{
|
||||
"head": 1,
|
||||
"dep": "nsubj",
|
||||
"tag": "NNP",
|
||||
"orth": "Haag",
|
||||
"ner": "U-PERSON",
|
||||
"id": 1
|
||||
},
|
||||
{
|
||||
"head": 0,
|
||||
"dep": "root",
|
||||
"tag": "VBZ",
|
||||
"orth": "plays",
|
||||
"ner": "O",
|
||||
"id": 2
|
||||
},
|
||||
{
|
||||
"head": -1,
|
||||
"dep": "dobj",
|
||||
"tag": "NNP",
|
||||
"orth": "Elianti",
|
||||
"ner": "U-PERSON",
|
||||
"id": 3
|
||||
},
|
||||
{
|
||||
"head": -2,
|
||||
"dep": "punct",
|
||||
"tag": ".",
|
||||
"orth": ".",
|
||||
"ner": "O",
|
||||
"id": 4
|
||||
}
|
||||
],
|
||||
"brackets": [
|
||||
{
|
||||
"first": 0,
|
||||
"last": 1,
|
||||
"label": "NP-SBJ"
|
||||
},
|
||||
{
|
||||
"first": 3,
|
||||
"last": 3,
|
||||
"label": "NP"
|
||||
},
|
||||
{
|
||||
"first": 2,
|
||||
"last": 3,
|
||||
"label": "VP"
|
||||
},
|
||||
{
|
||||
"first": 0,
|
||||
"last": 4,
|
||||
"label": "S"
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
21
examples/training/vocab-data.jsonl
Normal file
21
examples/training/vocab-data.jsonl
Normal file
|
@ -0,0 +1,21 @@
|
|||
{"lang": "en", "settings": {"oov_prob": -20.502029418945312}}
|
||||
{"orth": ".", "id": 1, "lower": ".", "norm": ".", "shape": ".", "prefix": ".", "suffix": ".", "length": 1, "cluster": "8", "prob": -3.0678977966308594, "is_alpha": false, "is_ascii": true, "is_digit": false, "is_lower": false, "is_punct": true, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
|
||||
{"orth": ",", "id": 2, "lower": ",", "norm": ",", "shape": ",", "prefix": ",", "suffix": ",", "length": 1, "cluster": "4", "prob": -3.4549596309661865, "is_alpha": false, "is_ascii": true, "is_digit": false, "is_lower": false, "is_punct": true, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
|
||||
{"orth": "the", "id": 3, "lower": "the", "norm": "the", "shape": "xxx", "prefix": "t", "suffix": "the", "length": 3, "cluster": "11", "prob": -3.528766632080078, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
|
||||
{"orth": "I", "id": 4, "lower": "i", "norm": "I", "shape": "X", "prefix": "I", "suffix": "I", "length": 1, "cluster": "346", "prob": -3.791565179824829, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": false, "is_punct": false, "is_space": false, "is_title": true, "is_upper": true, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
|
||||
{"orth": "to", "id": 5, "lower": "to", "norm": "to", "shape": "xx", "prefix": "t", "suffix": "to", "length": 2, "cluster": "12", "prob": -3.8560216426849365, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
|
||||
{"orth": "a", "id": 6, "lower": "a", "norm": "a", "shape": "x", "prefix": "a", "suffix": "a", "length": 1, "cluster": "19", "prob": -3.92978835105896, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
|
||||
{"orth": "and", "id": 7, "lower": "and", "norm": "and", "shape": "xxx", "prefix": "a", "suffix": "and", "length": 3, "cluster": "20", "prob": -4.113108158111572, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
|
||||
{"orth": "of", "id": 8, "lower": "of", "norm": "of", "shape": "xx", "prefix": "o", "suffix": "of", "length": 2, "cluster": "28", "prob": -4.27587366104126, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
|
||||
{"orth": "you", "id": 9, "lower": "you", "norm": "you", "shape": "xxx", "prefix": "y", "suffix": "you", "length": 3, "cluster": "602", "prob": -4.373791217803955, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
|
||||
{"orth": "it", "id": 10, "lower": "it", "norm": "it", "shape": "xx", "prefix": "i", "suffix": "it", "length": 2, "cluster": "474", "prob": -4.388050079345703, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
|
||||
{"orth": "is", "id": 11, "lower": "is", "norm": "is", "shape": "xx", "prefix": "i", "suffix": "is", "length": 2, "cluster": "762", "prob": -4.457748889923096, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
|
||||
{"orth": "that", "id": 12, "lower": "that", "norm": "that", "shape": "xxxx", "prefix": "t", "suffix": "hat", "length": 4, "cluster": "84", "prob": -4.464504718780518, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
|
||||
{"orth": "\n\n", "id": 0, "lower": "\n\n", "norm": "\n\n", "shape": "\n\n", "prefix": "\n", "suffix": "\n\n", "length": 2, "cluster": "0", "prob": -4.606560707092285, "is_alpha": false, "is_ascii": true, "is_digit": false, "is_lower": false, "is_punct": false, "is_space": true, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
|
||||
{"orth": "in", "id": 13, "lower": "in", "norm": "in", "shape": "xx", "prefix": "i", "suffix": "in", "length": 2, "cluster": "60", "prob": -4.619071960449219, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
|
||||
{"orth": "'s", "id": 14, "lower": "'s", "norm": "'s", "shape": "'x", "prefix": "'", "suffix": "'s", "length": 2, "cluster": "52", "prob": -4.830559253692627, "is_alpha": false, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
|
||||
{"orth": "n't", "id": 15, "lower": "n't", "norm": "n't", "shape": "x'x", "prefix": "n", "suffix": "n't", "length": 3, "cluster": "74", "prob": -4.859938621520996, "is_alpha": false, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
|
||||
{"orth": "for", "id": 16, "lower": "for", "norm": "for", "shape": "xxx", "prefix": "f", "suffix": "for", "length": 3, "cluster": "508", "prob": -4.8801093101501465, "is_alpha": true, "is_ascii": true, "is_digit": false, "is_lower": true, "is_punct": false, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
|
||||
{"orth": "\"", "id": 17, "lower": "\"", "norm": "\"", "shape": "\"", "prefix": "\"", "suffix": "\"", "length": 1, "cluster": "0", "prob": -5.02677583694458, "is_alpha": false, "is_ascii": true, "is_digit": false, "is_lower": false, "is_punct": true, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": true, "is_left_punct": true, "is_right_punct": true}
|
||||
{"orth": "?", "id": 18, "lower": "?", "norm": "?", "shape": "?", "prefix": "?", "suffix": "?", "length": 1, "cluster": "0", "prob": -5.05924654006958, "is_alpha": false, "is_ascii": true, "is_digit": false, "is_lower": false, "is_punct": true, "is_space": false, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
|
||||
{"orth": " ", "id": 0, "lower": " ", "norm": " ", "shape": " ", "prefix": " ", "suffix": " ", "length": 1, "cluster": "0", "prob": -5.129165172576904, "is_alpha": false, "is_ascii": true, "is_digit": false, "is_lower": false, "is_punct": false, "is_space": true, "is_title": false, "is_upper": false, "like_url": false, "like_num": false, "like_email": false, "is_stop": false, "is_oov": false, "is_quote": false, "is_left_punct": false, "is_right_punct": false}
|
|
@ -1,36 +0,0 @@
|
|||
# encoding: utf8
|
||||
from __future__ import unicode_literals, print_function
|
||||
import plac
|
||||
import codecs
|
||||
import pathlib
|
||||
import random
|
||||
|
||||
import twython
|
||||
import spacy.en
|
||||
|
||||
import _handler
|
||||
|
||||
|
||||
class Connection(twython.TwythonStreamer):
|
||||
def __init__(self, keys_dir, nlp, query):
|
||||
keys_dir = pathlib.Path(keys_dir)
|
||||
read = lambda fn: (keys_dir / (fn + '.txt')).open().read().strip()
|
||||
api_key = map(read, ['key', 'secret', 'token', 'token_secret'])
|
||||
twython.TwythonStreamer.__init__(self, *api_key)
|
||||
self.nlp = nlp
|
||||
self.query = query
|
||||
|
||||
def on_success(self, data):
|
||||
_handler.handle_tweet(self.nlp, data, self.query)
|
||||
if random.random() >= 0.1:
|
||||
reload(_handler)
|
||||
|
||||
|
||||
def main(keys_dir, term):
|
||||
nlp = spacy.en.English()
|
||||
twitter = Connection(keys_dir, nlp, term)
|
||||
twitter.statuses.filter(track=term, language='en')
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
plac.call(main)
|
35
examples/vectors_fast_text.py
Normal file
35
examples/vectors_fast_text.py
Normal file
|
@ -0,0 +1,35 @@
|
|||
#!/usr/bin/env python
|
||||
# coding: utf8
|
||||
"""Load vectors for a language trained using fastText
|
||||
https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
|
||||
Compatible with: spaCy v2.0.0+
|
||||
"""
|
||||
from __future__ import unicode_literals
|
||||
import plac
|
||||
import numpy
|
||||
|
||||
from spacy.language import Language
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
vectors_loc=("Path to vectors", "positional", None, str))
|
||||
def main(vectors_loc):
|
||||
nlp = Language() # start off with a blank Language class
|
||||
with open(vectors_loc, 'rb') as file_:
|
||||
header = file_.readline()
|
||||
nr_row, nr_dim = header.split()
|
||||
nlp.vocab.clear_vectors(int(nr_dim))
|
||||
for line in file_:
|
||||
line = line.decode('utf8')
|
||||
pieces = line.split()
|
||||
word = pieces[0]
|
||||
vector = numpy.asarray([float(v) for v in pieces[1:]], dtype='f')
|
||||
nlp.vocab.set_vector(word, vector) # add the vectors to the vocab
|
||||
# test the vectors and similarity
|
||||
text = 'class colspan'
|
||||
doc = nlp(text)
|
||||
print(text, doc[0].similarity(doc[1]))
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
plac.call(main)
|
|
@ -1,17 +1,19 @@
|
|||
cython<0.24
|
||||
cython>=0.24,<0.27.0
|
||||
pathlib
|
||||
numpy>=1.7
|
||||
cymem>=1.30,<1.32
|
||||
preshed>=1.0.0,<2.0.0
|
||||
thinc>=6.5.0,<6.6.0
|
||||
murmurhash>=0.26,<0.27
|
||||
thinc>=6.10.0,<6.11.0
|
||||
murmurhash>=0.28,<0.29
|
||||
plac<1.0.0,>=0.9.6
|
||||
six
|
||||
html5lib==1.0b8
|
||||
ujson>=1.35
|
||||
dill>=0.2,<0.3
|
||||
requests>=2.11.0,<3.0.0
|
||||
regex>=2017.4.1,<2017.12.1
|
||||
requests>=2.13.0,<3.0.0
|
||||
regex==2017.4.5
|
||||
ftfy>=4.4.2,<5.0.0
|
||||
pytest>=3.0.6,<4.0.0
|
||||
pip>=9.0.0,<10.0.0
|
||||
mock>=2.0.0,<3.0.0
|
||||
msgpack-python
|
||||
msgpack-numpy
|
||||
html5lib==1.0b8
|
||||
|
|
32
setup.py
32
setup.py
|
@ -24,37 +24,31 @@ MOD_NAMES = [
|
|||
'spacy.vocab',
|
||||
'spacy.attrs',
|
||||
'spacy.morphology',
|
||||
'spacy.tagger',
|
||||
'spacy.pipeline',
|
||||
'spacy.syntax.stateclass',
|
||||
'spacy.syntax._state',
|
||||
'spacy.syntax._beam_utils',
|
||||
'spacy.tokenizer',
|
||||
'spacy.syntax.parser',
|
||||
'spacy.syntax.beam_parser',
|
||||
'spacy.syntax.nn_parser',
|
||||
'spacy.syntax.nonproj',
|
||||
'spacy.syntax.transition_system',
|
||||
'spacy.syntax.arc_eager',
|
||||
'spacy.syntax._parse_features',
|
||||
'spacy.gold',
|
||||
'spacy.orth',
|
||||
'spacy.tokens.doc',
|
||||
'spacy.tokens.span',
|
||||
'spacy.tokens.token',
|
||||
'spacy.serialize.packer',
|
||||
'spacy.serialize.huffman',
|
||||
'spacy.serialize.bits',
|
||||
'spacy.cfile',
|
||||
'spacy.matcher',
|
||||
'spacy.syntax.ner',
|
||||
'spacy.symbols',
|
||||
'spacy.syntax.iterators']
|
||||
# TODO: This is missing a lot of modules. Does it matter?
|
||||
'spacy.vectors',
|
||||
]
|
||||
|
||||
|
||||
COMPILE_OPTIONS = {
|
||||
'msvc': ['/Ox', '/EHsc'],
|
||||
'mingw32' : ['-O3', '-Wno-strict-prototypes', '-Wno-unused-function'],
|
||||
'other' : ['-O3', '-Wno-strict-prototypes', '-Wno-unused-function']
|
||||
'other' : ['-O3', '-Wno-strict-prototypes', '-Wno-unused-function',
|
||||
'-march=native']
|
||||
}
|
||||
|
||||
|
||||
|
@ -67,7 +61,7 @@ LINK_OPTIONS = {
|
|||
|
||||
# I don't understand this very well yet. See Issue #267
|
||||
# Fingers crossed!
|
||||
USE_OPENMP_DEFAULT = '1' if sys.platform != 'darwin' else None
|
||||
USE_OPENMP_DEFAULT = '0' if sys.platform != 'darwin' else None
|
||||
if os.environ.get('USE_OPENMP', USE_OPENMP_DEFAULT) == '1':
|
||||
if sys.platform == 'darwin':
|
||||
COMPILE_OPTIONS['other'].append('-fopenmp')
|
||||
|
@ -190,21 +184,23 @@ def setup_package():
|
|||
url=about['__uri__'],
|
||||
license=about['__license__'],
|
||||
ext_modules=ext_modules,
|
||||
scripts=['bin/spacy'],
|
||||
install_requires=[
|
||||
'numpy>=1.7',
|
||||
'murmurhash>=0.26,<0.27',
|
||||
'murmurhash>=0.28,<0.29',
|
||||
'cymem>=1.30,<1.32',
|
||||
'preshed>=1.0.0,<2.0.0',
|
||||
'thinc>=6.5.0,<6.6.0',
|
||||
'thinc>=6.10.0,<6.11.0',
|
||||
'plac<1.0.0,>=0.9.6',
|
||||
'pip>=9.0.0,<10.0.0',
|
||||
'six',
|
||||
'pathlib',
|
||||
'ujson>=1.35',
|
||||
'dill>=0.2,<0.3',
|
||||
'requests>=2.13.0,<3.0.0',
|
||||
'regex>=2017.4.1,<2017.12.1',
|
||||
'ftfy>=4.4.2,<5.0.0'],
|
||||
'regex==2017.4.5',
|
||||
'ftfy>=4.4.2,<5.0.0',
|
||||
'msgpack-python',
|
||||
'msgpack-numpy'],
|
||||
classifiers=[
|
||||
'Development Status :: 5 - Production/Stable',
|
||||
'Environment :: Console',
|
||||
|
|
|
@ -1,43 +1,28 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from . import util
|
||||
from .deprecated import resolve_model_name
|
||||
from .cli.info import info
|
||||
from .cli.info import info as cli_info
|
||||
from .glossary import explain
|
||||
from .about import __version__
|
||||
|
||||
from . import en, de, zh, es, it, hu, fr, pt, nl, sv, fi, bn, he, nb, ja,th, ru
|
||||
|
||||
|
||||
_languages = (en.English, de.German, es.Spanish, pt.Portuguese, fr.French,
|
||||
it.Italian, hu.Hungarian, zh.Chinese, nl.Dutch, sv.Swedish,
|
||||
fi.Finnish, bn.Bengali, he.Hebrew, nb.Norwegian, ja.Japanese,
|
||||
th.Thai, ru.Russian)
|
||||
|
||||
|
||||
for _lang in _languages:
|
||||
util.set_lang_class(_lang.lang, _lang)
|
||||
from . import util
|
||||
|
||||
|
||||
def load(name, **overrides):
|
||||
if overrides.get('path') in (None, False, True):
|
||||
data_path = util.get_data_path()
|
||||
model_name = resolve_model_name(name)
|
||||
model_path = data_path / model_name
|
||||
if not model_path.exists():
|
||||
lang_name = util.get_lang_class(name).lang
|
||||
model_path = None
|
||||
util.print_msg(
|
||||
"Only loading the '{}' tokenizer.".format(lang_name),
|
||||
title="Warning: no model found for '{}'".format(name))
|
||||
else:
|
||||
model_path = util.ensure_path(overrides['path'])
|
||||
data_path = model_path.parent
|
||||
model_name = ''
|
||||
meta = util.parse_package_meta(data_path, model_name, require=False)
|
||||
lang = meta['lang'] if meta and 'lang' in meta else name
|
||||
cls = util.get_lang_class(lang)
|
||||
overrides['meta'] = meta
|
||||
overrides['path'] = model_path
|
||||
return cls(**overrides)
|
||||
depr_path = overrides.get('path')
|
||||
if depr_path not in (True, False, None):
|
||||
util.deprecated(
|
||||
"As of spaCy v2.0, the keyword argument `path=` is deprecated. "
|
||||
"You can now call spacy.load with the path as its first argument, "
|
||||
"and the model's meta.json will be used to determine the language "
|
||||
"to load. For example:\nnlp = spacy.load('{}')".format(depr_path),
|
||||
'error')
|
||||
return util.load_model(name, **overrides)
|
||||
|
||||
|
||||
def blank(name, **kwargs):
|
||||
LangClass = util.get_lang_class(name)
|
||||
return LangClass(**kwargs)
|
||||
|
||||
|
||||
def info(model=None, markdown=False):
|
||||
return cli_info(None, model, markdown)
|
||||
|
|
|
@ -1,133 +1,35 @@
|
|||
# coding: utf8
|
||||
from __future__ import print_function
|
||||
# NB! This breaks in plac on Python 2!!
|
||||
#from __future__ import unicode_literals
|
||||
|
||||
import plac
|
||||
from spacy.cli import download as cli_download
|
||||
from spacy.cli import link as cli_link
|
||||
from spacy.cli import info as cli_info
|
||||
from spacy.cli import package as cli_package
|
||||
from spacy.cli import train as cli_train
|
||||
from spacy.cli import model as cli_model
|
||||
from spacy.cli import convert as cli_convert
|
||||
|
||||
|
||||
class CLI(object):
|
||||
"""
|
||||
Command-line interface for spaCy
|
||||
"""
|
||||
commands = ('download', 'link', 'info', 'package', 'train', 'model', 'convert')
|
||||
|
||||
@plac.annotations(
|
||||
model=("model to download (shortcut or model name)", "positional", None, str),
|
||||
direct=("force direct download. Needs model name with version and won't "
|
||||
"perform compatibility check", "flag", "d", bool)
|
||||
)
|
||||
def download(self, model=None, direct=False):
|
||||
"""
|
||||
Download compatible model from default download path using pip. Model
|
||||
can be shortcut, model name or, if --direct flag is set, full model name
|
||||
with version.
|
||||
"""
|
||||
cli_download(model, direct)
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
origin=("package name or local path to model", "positional", None, str),
|
||||
link_name=("name of shortuct link to create", "positional", None, str),
|
||||
force=("force overwriting of existing link", "flag", "f", bool)
|
||||
)
|
||||
def link(self, origin, link_name, force=False):
|
||||
"""
|
||||
Create a symlink for models within the spacy/data directory. Accepts
|
||||
either the name of a pip package, or the local path to the model data
|
||||
directory. Linking models allows loading them via spacy.load(link_name).
|
||||
"""
|
||||
cli_link(origin, link_name, force)
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
model=("optional: shortcut link of model", "positional", None, str),
|
||||
markdown=("generate Markdown for GitHub issues", "flag", "md", str)
|
||||
)
|
||||
def info(self, model=None, markdown=False):
|
||||
"""
|
||||
Print info about spaCy installation. If a model shortcut link is
|
||||
speficied as an argument, print model information. Flag --markdown
|
||||
prints details in Markdown for easy copy-pasting to GitHub issues.
|
||||
"""
|
||||
cli_info(model, markdown)
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
input_dir=("directory with model data", "positional", None, str),
|
||||
output_dir=("output parent directory", "positional", None, str),
|
||||
meta=("path to meta.json", "option", "m", str),
|
||||
force=("force overwriting of existing folder in output directory", "flag", "f", bool)
|
||||
)
|
||||
def package(self, input_dir, output_dir, meta=None, force=False):
|
||||
"""
|
||||
Generate Python package for model data, including meta and required
|
||||
installation files. A new directory will be created in the specified
|
||||
output directory, and model data will be copied over.
|
||||
"""
|
||||
cli_package(input_dir, output_dir, meta, force)
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
lang=("model language", "positional", None, str),
|
||||
output_dir=("output directory to store model in", "positional", None, str),
|
||||
train_data=("location of JSON-formatted training data", "positional", None, str),
|
||||
dev_data=("location of JSON-formatted development data (optional)", "positional", None, str),
|
||||
n_iter=("number of iterations", "option", "n", int),
|
||||
parser_L1=("L1 regularization penalty for parser", "option", "L", float),
|
||||
no_tagger=("Don't train tagger", "flag", "T", bool),
|
||||
no_parser=("Don't train parser", "flag", "P", bool),
|
||||
no_ner=("Don't train NER", "flag", "N", bool)
|
||||
)
|
||||
def train(self, lang, output_dir, train_data, dev_data=None, n_iter=15,
|
||||
parser_L1=0.0, no_tagger=False, no_parser=False, no_ner=False):
|
||||
"""
|
||||
Train a model. Expects data in spaCy's JSON format.
|
||||
"""
|
||||
cli_train(lang, output_dir, train_data, dev_data, n_iter, not no_tagger,
|
||||
not no_parser, not no_ner, parser_L1)
|
||||
|
||||
@plac.annotations(
|
||||
lang=("model language", "positional", None, str),
|
||||
model_dir=("output directory to store model in", "positional", None, str),
|
||||
freqs_data=("tab-separated frequencies file", "positional", None, str),
|
||||
clusters_data=("Brown clusters file", "positional", None, str),
|
||||
vectors_data=("word vectors file", "positional", None, str)
|
||||
)
|
||||
def model(self, lang, model_dir, freqs_data, clusters_data=None, vectors_data=None):
|
||||
"""
|
||||
Initialize a new model and its data directory.
|
||||
"""
|
||||
cli_model(lang, model_dir, freqs_data, clusters_data, vectors_data)
|
||||
|
||||
@plac.annotations(
|
||||
input_file=("input file", "positional", None, str),
|
||||
output_dir=("output directory for converted file", "positional", None, str),
|
||||
n_sents=("Number of sentences per doc", "option", "n", float),
|
||||
morphology=("Enable appending morphology to tags", "flag", "m", bool)
|
||||
)
|
||||
def convert(self, input_file, output_dir, n_sents=10, morphology=False):
|
||||
"""
|
||||
Convert files into JSON format for use with train command and other
|
||||
experiment management functions.
|
||||
"""
|
||||
cli_convert(input_file, output_dir, n_sents, morphology)
|
||||
|
||||
|
||||
def __missing__(self, name):
|
||||
print("\n Command %r does not exist."
|
||||
"\n Use the --help flag for a list of available commands.\n" % name)
|
||||
|
||||
# from __future__ import unicode_literals
|
||||
|
||||
if __name__ == '__main__':
|
||||
import plac
|
||||
import sys
|
||||
sys.argv[0] = 'spacy'
|
||||
plac.Interpreter.call(CLI)
|
||||
from spacy.cli import download, link, info, package, train, convert
|
||||
from spacy.cli import vocab, profile, evaluate, validate
|
||||
from spacy.util import prints
|
||||
|
||||
commands = {
|
||||
'download': download,
|
||||
'link': link,
|
||||
'info': info,
|
||||
'train': train,
|
||||
'evaluate': evaluate,
|
||||
'convert': convert,
|
||||
'package': package,
|
||||
'vocab': vocab,
|
||||
'profile': profile,
|
||||
'validate': validate
|
||||
}
|
||||
if len(sys.argv) == 1:
|
||||
prints(', '.join(commands), title="Available commands", exits=1)
|
||||
command = sys.argv.pop(1)
|
||||
sys.argv[0] = 'spacy %s' % command
|
||||
if command in commands:
|
||||
plac.call(commands[command])
|
||||
else:
|
||||
prints(
|
||||
"Available: %s" % ', '.join(commands),
|
||||
title="Unknown command: %s" % command,
|
||||
exits=1)
|
||||
|
|
562
spacy/_ml.py
Normal file
562
spacy/_ml.py
Normal file
|
@ -0,0 +1,562 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import numpy
|
||||
from thinc.v2v import Model, Maxout, Softmax, Affine, ReLu
|
||||
from thinc.i2v import HashEmbed, StaticVectors
|
||||
from thinc.t2t import ExtractWindow, ParametricAttention
|
||||
from thinc.t2v import Pooling, sum_pool
|
||||
from thinc.misc import Residual
|
||||
from thinc.misc import LayerNorm as LN
|
||||
from thinc.api import add, layerize, chain, clone, concatenate, with_flatten
|
||||
from thinc.api import FeatureExtracter, with_getitem, flatten_add_lengths
|
||||
from thinc.api import uniqued, wrap, noop
|
||||
from thinc.linear.linear import LinearModel
|
||||
from thinc.neural.ops import NumpyOps, CupyOps
|
||||
from thinc.neural.util import get_array_module, copy_array
|
||||
from thinc.neural._lsuv import svd_orthonormal
|
||||
from thinc.neural.optimizers import Adam
|
||||
|
||||
from thinc import describe
|
||||
from thinc.describe import Dimension, Synapses, Biases, Gradient
|
||||
from thinc.neural._classes.affine import _set_dimensions_if_needed
|
||||
import thinc.extra.load_nlp
|
||||
|
||||
from .attrs import ID, ORTH, LOWER, NORM, PREFIX, SUFFIX, SHAPE
|
||||
from . import util
|
||||
|
||||
|
||||
VECTORS_KEY = 'spacy_pretrained_vectors'
|
||||
|
||||
|
||||
def cosine(vec1, vec2):
|
||||
xp = get_array_module(vec1)
|
||||
norm1 = xp.linalg.norm(vec1)
|
||||
norm2 = xp.linalg.norm(vec2)
|
||||
if norm1 == 0. or norm2 == 0.:
|
||||
return 0
|
||||
else:
|
||||
return vec1.dot(vec2) / (norm1 * norm2)
|
||||
|
||||
|
||||
def create_default_optimizer(ops, **cfg):
|
||||
learn_rate = util.env_opt('learn_rate', 0.001)
|
||||
beta1 = util.env_opt('optimizer_B1', 0.9)
|
||||
beta2 = util.env_opt('optimizer_B2', 0.999)
|
||||
eps = util.env_opt('optimizer_eps', 1e-08)
|
||||
L2 = util.env_opt('L2_penalty', 1e-6)
|
||||
max_grad_norm = util.env_opt('grad_norm_clip', 1.)
|
||||
optimizer = Adam(ops, learn_rate, L2=L2, beta1=beta1,
|
||||
beta2=beta2, eps=eps)
|
||||
optimizer.max_grad_norm = max_grad_norm
|
||||
optimizer.device = ops.device
|
||||
return optimizer
|
||||
|
||||
@layerize
|
||||
def _flatten_add_lengths(seqs, pad=0, drop=0.):
|
||||
ops = Model.ops
|
||||
lengths = ops.asarray([len(seq) for seq in seqs], dtype='i')
|
||||
|
||||
def finish_update(d_X, sgd=None):
|
||||
return ops.unflatten(d_X, lengths, pad=pad)
|
||||
|
||||
X = ops.flatten(seqs, pad=pad)
|
||||
return (X, lengths), finish_update
|
||||
|
||||
|
||||
@layerize
|
||||
def _logistic(X, drop=0.):
|
||||
xp = get_array_module(X)
|
||||
if not isinstance(X, xp.ndarray):
|
||||
X = xp.asarray(X)
|
||||
# Clip to range (-10, 10)
|
||||
X = xp.minimum(X, 10., X)
|
||||
X = xp.maximum(X, -10., X)
|
||||
Y = 1. / (1. + xp.exp(-X))
|
||||
|
||||
def logistic_bwd(dY, sgd=None):
|
||||
dX = dY * (Y * (1-Y))
|
||||
return dX
|
||||
|
||||
return Y, logistic_bwd
|
||||
|
||||
|
||||
def _zero_init(model):
|
||||
def _zero_init_impl(self, X, y):
|
||||
self.W.fill(0)
|
||||
model.on_data_hooks.append(_zero_init_impl)
|
||||
if model.W is not None:
|
||||
model.W.fill(0.)
|
||||
return model
|
||||
|
||||
|
||||
@layerize
|
||||
def _preprocess_doc(docs, drop=0.):
|
||||
keys = [doc.to_array([LOWER]) for doc in docs]
|
||||
ops = Model.ops
|
||||
lengths = ops.asarray([arr.shape[0] for arr in keys])
|
||||
keys = ops.xp.concatenate(keys)
|
||||
vals = ops.allocate(keys.shape[0]) + 1
|
||||
return (keys, vals, lengths), None
|
||||
|
||||
|
||||
@describe.on_data(_set_dimensions_if_needed,
|
||||
lambda model, X, y: model.init_weights(model))
|
||||
@describe.attributes(
|
||||
nI=Dimension("Input size"),
|
||||
nF=Dimension("Number of features"),
|
||||
nO=Dimension("Output size"),
|
||||
nP=Dimension("Maxout pieces"),
|
||||
W=Synapses("Weights matrix",
|
||||
lambda obj: (obj.nF, obj.nO, obj.nP, obj.nI)),
|
||||
b=Biases("Bias vector",
|
||||
lambda obj: (obj.nO, obj.nP)),
|
||||
pad=Synapses("Pad",
|
||||
lambda obj: (1, obj.nF, obj.nO, obj.nP),
|
||||
lambda M, ops: ops.normal_init(M, 1.)),
|
||||
d_W=Gradient("W"),
|
||||
d_pad=Gradient("pad"),
|
||||
d_b=Gradient("b"))
|
||||
class PrecomputableAffine(Model):
|
||||
def __init__(self, nO=None, nI=None, nF=None, nP=None, **kwargs):
|
||||
Model.__init__(self, **kwargs)
|
||||
self.nO = nO
|
||||
self.nP = nP
|
||||
self.nI = nI
|
||||
self.nF = nF
|
||||
|
||||
def begin_update(self, X, drop=0.):
|
||||
Yf = self.ops.xp.dot(X,
|
||||
self.W.reshape((self.nF*self.nO*self.nP, self.nI)).T)
|
||||
Yf = Yf.reshape((Yf.shape[0], self.nF, self.nO, self.nP))
|
||||
Yf = self._add_padding(Yf)
|
||||
|
||||
def backward(dY_ids, sgd=None):
|
||||
dY, ids = dY_ids
|
||||
dY, ids = self._backprop_padding(dY, ids)
|
||||
Xf = X[ids]
|
||||
Xf = Xf.reshape((Xf.shape[0], self.nF * self.nI))
|
||||
|
||||
self.d_b += dY.sum(axis=0)
|
||||
dY = dY.reshape((dY.shape[0], self.nO*self.nP))
|
||||
|
||||
Wopfi = self.W.transpose((1, 2, 0, 3))
|
||||
Wopfi = self.ops.xp.ascontiguousarray(Wopfi)
|
||||
Wopfi = Wopfi.reshape((self.nO*self.nP, self.nF * self.nI))
|
||||
dXf = self.ops.dot(dY.reshape((dY.shape[0], self.nO*self.nP)), Wopfi)
|
||||
|
||||
# Reuse the buffer
|
||||
dWopfi = Wopfi; dWopfi.fill(0.)
|
||||
self.ops.xp.dot(dY.T, Xf, out=dWopfi)
|
||||
dWopfi = dWopfi.reshape((self.nO, self.nP, self.nF, self.nI))
|
||||
# (o, p, f, i) --> (f, o, p, i)
|
||||
self.d_W += dWopfi.transpose((2, 0, 1, 3))
|
||||
|
||||
if sgd is not None:
|
||||
sgd(self._mem.weights, self._mem.gradient, key=self.id)
|
||||
return dXf.reshape((dXf.shape[0], self.nF, self.nI))
|
||||
return Yf, backward
|
||||
|
||||
def _add_padding(self, Yf):
|
||||
Yf_padded = self.ops.xp.vstack((self.pad, Yf))
|
||||
return Yf_padded
|
||||
|
||||
def _backprop_padding(self, dY, ids):
|
||||
# (1, nF, nO, nP) += (nN, nF, nO, nP) where IDs (nN, nF) < 0
|
||||
mask = ids < 0.
|
||||
mask = mask.sum(axis=1)
|
||||
d_pad = dY * mask.reshape((ids.shape[0], 1, 1))
|
||||
self.d_pad += d_pad.sum(axis=0)
|
||||
return dY, ids
|
||||
|
||||
@staticmethod
|
||||
def init_weights(model):
|
||||
'''This is like the 'layer sequential unit variance', but instead
|
||||
of taking the actual inputs, we randomly generate whitened data.
|
||||
|
||||
Why's this all so complicated? We have a huge number of inputs,
|
||||
and the maxout unit makes guessing the dynamics tricky. Instead
|
||||
we set the maxout weights to values that empirically result in
|
||||
whitened outputs given whitened inputs.
|
||||
'''
|
||||
if (model.W**2).sum() != 0.:
|
||||
return
|
||||
ops = model.ops
|
||||
xp = ops.xp
|
||||
ops.normal_init(model.W, model.nF * model.nI, inplace=True)
|
||||
|
||||
ids = ops.allocate((5000, model.nF), dtype='f')
|
||||
ids += xp.random.uniform(0, 1000, ids.shape)
|
||||
ids = ops.asarray(ids, dtype='i')
|
||||
tokvecs = ops.allocate((5000, model.nI), dtype='f')
|
||||
tokvecs += xp.random.normal(loc=0., scale=1.,
|
||||
size=tokvecs.size).reshape(tokvecs.shape)
|
||||
|
||||
def predict(ids, tokvecs):
|
||||
# nS ids. nW tokvecs
|
||||
hiddens = model(tokvecs) # (nW, f, o, p)
|
||||
# need nS vectors
|
||||
vectors = model.ops.allocate((ids.shape[0], model.nO, model.nP))
|
||||
for i, feats in enumerate(ids):
|
||||
for j, id_ in enumerate(feats):
|
||||
vectors[i] += hiddens[id_, j]
|
||||
vectors += model.b
|
||||
if model.nP >= 2:
|
||||
return model.ops.maxout(vectors)[0]
|
||||
else:
|
||||
return vectors * (vectors >= 0)
|
||||
|
||||
tol_var = 0.01
|
||||
tol_mean = 0.01
|
||||
t_max = 10
|
||||
t_i = 0
|
||||
for t_i in range(t_max):
|
||||
acts1 = predict(ids, tokvecs)
|
||||
var = model.ops.xp.var(acts1)
|
||||
mean = model.ops.xp.mean(acts1)
|
||||
if abs(var - 1.0) >= tol_var:
|
||||
model.W /= model.ops.xp.sqrt(var)
|
||||
elif abs(mean) >= tol_mean:
|
||||
model.b -= mean
|
||||
else:
|
||||
break
|
||||
|
||||
|
||||
def link_vectors_to_models(vocab):
|
||||
vectors = vocab.vectors
|
||||
ops = Model.ops
|
||||
for word in vocab:
|
||||
if word.orth in vectors.key2row:
|
||||
word.rank = vectors.key2row[word.orth]
|
||||
else:
|
||||
word.rank = 0
|
||||
data = ops.asarray(vectors.data)
|
||||
# Set an entry here, so that vectors are accessed by StaticVectors
|
||||
# (unideal, I know)
|
||||
thinc.extra.load_nlp.VECTORS[(ops.device, VECTORS_KEY)] = data
|
||||
|
||||
|
||||
def Tok2Vec(width, embed_size, **kwargs):
|
||||
pretrained_dims = kwargs.get('pretrained_dims', 0)
|
||||
cnn_maxout_pieces = kwargs.get('cnn_maxout_pieces', 2)
|
||||
cols = [ID, NORM, PREFIX, SUFFIX, SHAPE, ORTH]
|
||||
with Model.define_operators({'>>': chain, '|': concatenate, '**': clone,
|
||||
'+': add, '*': reapply}):
|
||||
norm = HashEmbed(width, embed_size, column=cols.index(NORM),
|
||||
name='embed_norm')
|
||||
prefix = HashEmbed(width, embed_size//2, column=cols.index(PREFIX),
|
||||
name='embed_prefix')
|
||||
suffix = HashEmbed(width, embed_size//2, column=cols.index(SUFFIX),
|
||||
name='embed_suffix')
|
||||
shape = HashEmbed(width, embed_size//2, column=cols.index(SHAPE),
|
||||
name='embed_shape')
|
||||
if pretrained_dims is not None and pretrained_dims >= 1:
|
||||
glove = StaticVectors(VECTORS_KEY, width, column=cols.index(ID))
|
||||
|
||||
embed = uniqued(
|
||||
(glove | norm | prefix | suffix | shape)
|
||||
>> LN(Maxout(width, width*5, pieces=3)), column=5)
|
||||
else:
|
||||
embed = uniqued(
|
||||
(norm | prefix | suffix | shape)
|
||||
>> LN(Maxout(width, width*4, pieces=3)), column=5)
|
||||
|
||||
convolution = Residual(
|
||||
ExtractWindow(nW=1)
|
||||
>> LN(Maxout(width, width*3, pieces=cnn_maxout_pieces))
|
||||
)
|
||||
|
||||
tok2vec = (
|
||||
FeatureExtracter(cols)
|
||||
>> with_flatten(
|
||||
embed
|
||||
>> convolution ** 4, pad=4
|
||||
)
|
||||
)
|
||||
# Work around thinc API limitations :(. TODO: Revise in Thinc 7
|
||||
tok2vec.nO = width
|
||||
tok2vec.embed = embed
|
||||
return tok2vec
|
||||
|
||||
|
||||
def reapply(layer, n_times):
|
||||
def reapply_fwd(X, drop=0.):
|
||||
backprops = []
|
||||
for i in range(n_times):
|
||||
Y, backprop = layer.begin_update(X, drop=drop)
|
||||
X = Y
|
||||
backprops.append(backprop)
|
||||
|
||||
def reapply_bwd(dY, sgd=None):
|
||||
dX = None
|
||||
for backprop in reversed(backprops):
|
||||
dY = backprop(dY, sgd=sgd)
|
||||
if dX is None:
|
||||
dX = dY
|
||||
else:
|
||||
dX += dY
|
||||
return dX
|
||||
|
||||
return Y, reapply_bwd
|
||||
return wrap(reapply_fwd, layer)
|
||||
|
||||
|
||||
def asarray(ops, dtype):
|
||||
def forward(X, drop=0.):
|
||||
return ops.asarray(X, dtype=dtype), None
|
||||
return layerize(forward)
|
||||
|
||||
|
||||
def _divide_array(X, size):
|
||||
parts = []
|
||||
index = 0
|
||||
while index < len(X):
|
||||
parts.append(X[index:index + size])
|
||||
index += size
|
||||
return parts
|
||||
|
||||
|
||||
def get_col(idx):
|
||||
assert idx >= 0, idx
|
||||
|
||||
def forward(X, drop=0.):
|
||||
assert idx >= 0, idx
|
||||
if isinstance(X, numpy.ndarray):
|
||||
ops = NumpyOps()
|
||||
else:
|
||||
ops = CupyOps()
|
||||
output = ops.xp.ascontiguousarray(X[:, idx], dtype=X.dtype)
|
||||
|
||||
def backward(y, sgd=None):
|
||||
assert idx >= 0, idx
|
||||
dX = ops.allocate(X.shape)
|
||||
dX[:, idx] += y
|
||||
return dX
|
||||
|
||||
return output, backward
|
||||
|
||||
return layerize(forward)
|
||||
|
||||
|
||||
def doc2feats(cols=None):
|
||||
if cols is None:
|
||||
cols = [ID, NORM, PREFIX, SUFFIX, SHAPE, ORTH]
|
||||
|
||||
def forward(docs, drop=0.):
|
||||
feats = []
|
||||
for doc in docs:
|
||||
feats.append(doc.to_array(cols))
|
||||
return feats, None
|
||||
|
||||
model = layerize(forward)
|
||||
model.cols = cols
|
||||
return model
|
||||
|
||||
|
||||
def print_shape(prefix):
|
||||
def forward(X, drop=0.):
|
||||
return X, lambda dX, **kwargs: dX
|
||||
return layerize(forward)
|
||||
|
||||
|
||||
@layerize
|
||||
def get_token_vectors(tokens_attrs_vectors, drop=0.):
|
||||
tokens, attrs, vectors = tokens_attrs_vectors
|
||||
|
||||
def backward(d_output, sgd=None):
|
||||
return (tokens, d_output)
|
||||
|
||||
return vectors, backward
|
||||
|
||||
|
||||
@layerize
|
||||
def logistic(X, drop=0.):
|
||||
xp = get_array_module(X)
|
||||
if not isinstance(X, xp.ndarray):
|
||||
X = xp.asarray(X)
|
||||
# Clip to range (-10, 10)
|
||||
X = xp.minimum(X, 10., X)
|
||||
X = xp.maximum(X, -10., X)
|
||||
Y = 1. / (1. + xp.exp(-X))
|
||||
|
||||
def logistic_bwd(dY, sgd=None):
|
||||
dX = dY * (Y * (1-Y))
|
||||
return dX
|
||||
|
||||
return Y, logistic_bwd
|
||||
|
||||
|
||||
def zero_init(model):
|
||||
def _zero_init_impl(self, X, y):
|
||||
self.W.fill(0)
|
||||
model.on_data_hooks.append(_zero_init_impl)
|
||||
return model
|
||||
|
||||
|
||||
@layerize
|
||||
def preprocess_doc(docs, drop=0.):
|
||||
keys = [doc.to_array([LOWER]) for doc in docs]
|
||||
ops = Model.ops
|
||||
lengths = ops.asarray([arr.shape[0] for arr in keys])
|
||||
keys = ops.xp.concatenate(keys)
|
||||
vals = ops.allocate(keys.shape[0]) + 1
|
||||
return (keys, vals, lengths), None
|
||||
|
||||
|
||||
def getitem(i):
|
||||
def getitem_fwd(X, drop=0.):
|
||||
return X[i], None
|
||||
return layerize(getitem_fwd)
|
||||
|
||||
|
||||
def build_tagger_model(nr_class, **cfg):
|
||||
embed_size = util.env_opt('embed_size', 7000)
|
||||
if 'token_vector_width' in cfg:
|
||||
token_vector_width = cfg['token_vector_width']
|
||||
else:
|
||||
token_vector_width = util.env_opt('token_vector_width', 128)
|
||||
pretrained_dims = cfg.get('pretrained_dims', 0)
|
||||
with Model.define_operators({'>>': chain, '+': add}):
|
||||
if 'tok2vec' in cfg:
|
||||
tok2vec = cfg['tok2vec']
|
||||
else:
|
||||
tok2vec = Tok2Vec(token_vector_width, embed_size,
|
||||
pretrained_dims=pretrained_dims)
|
||||
softmax = with_flatten(Softmax(nr_class, token_vector_width))
|
||||
model = (
|
||||
tok2vec
|
||||
>> softmax
|
||||
)
|
||||
model.nI = None
|
||||
model.tok2vec = tok2vec
|
||||
model.softmax = softmax
|
||||
return model
|
||||
|
||||
|
||||
@layerize
|
||||
def SpacyVectors(docs, drop=0.):
|
||||
batch = []
|
||||
for doc in docs:
|
||||
indices = numpy.zeros((len(doc),), dtype='i')
|
||||
for i, word in enumerate(doc):
|
||||
if word.orth in doc.vocab.vectors.key2row:
|
||||
indices[i] = doc.vocab.vectors.key2row[word.orth]
|
||||
else:
|
||||
indices[i] = 0
|
||||
vectors = doc.vocab.vectors.data[indices]
|
||||
batch.append(vectors)
|
||||
return batch, None
|
||||
|
||||
|
||||
def build_text_classifier(nr_class, width=64, **cfg):
|
||||
nr_vector = cfg.get('nr_vector', 5000)
|
||||
pretrained_dims = cfg.get('pretrained_dims', 0)
|
||||
with Model.define_operators({'>>': chain, '+': add, '|': concatenate,
|
||||
'**': clone}):
|
||||
if cfg.get('low_data') and pretrained_dims:
|
||||
model = (
|
||||
SpacyVectors
|
||||
>> flatten_add_lengths
|
||||
>> with_getitem(0, Affine(width, pretrained_dims))
|
||||
>> ParametricAttention(width)
|
||||
>> Pooling(sum_pool)
|
||||
>> Residual(ReLu(width, width)) ** 2
|
||||
>> zero_init(Affine(nr_class, width, drop_factor=0.0))
|
||||
>> logistic
|
||||
)
|
||||
return model
|
||||
|
||||
lower = HashEmbed(width, nr_vector, column=1)
|
||||
prefix = HashEmbed(width//2, nr_vector, column=2)
|
||||
suffix = HashEmbed(width//2, nr_vector, column=3)
|
||||
shape = HashEmbed(width//2, nr_vector, column=4)
|
||||
|
||||
trained_vectors = (
|
||||
FeatureExtracter([ORTH, LOWER, PREFIX, SUFFIX, SHAPE, ID])
|
||||
>> with_flatten(
|
||||
uniqued(
|
||||
(lower | prefix | suffix | shape)
|
||||
>> LN(Maxout(width, width+(width//2)*3)),
|
||||
column=0
|
||||
)
|
||||
)
|
||||
)
|
||||
|
||||
if pretrained_dims:
|
||||
static_vectors = (
|
||||
SpacyVectors
|
||||
>> with_flatten(Affine(width, pretrained_dims))
|
||||
)
|
||||
# TODO Make concatenate support lists
|
||||
vectors = concatenate_lists(trained_vectors, static_vectors)
|
||||
vectors_width = width*2
|
||||
else:
|
||||
vectors = trained_vectors
|
||||
vectors_width = width
|
||||
static_vectors = None
|
||||
cnn_model = (
|
||||
vectors
|
||||
>> with_flatten(
|
||||
LN(Maxout(width, vectors_width))
|
||||
>> Residual(
|
||||
(ExtractWindow(nW=1) >> LN(Maxout(width, width*3)))
|
||||
) ** 2, pad=2
|
||||
)
|
||||
>> flatten_add_lengths
|
||||
>> ParametricAttention(width)
|
||||
>> Pooling(sum_pool)
|
||||
>> Residual(zero_init(Maxout(width, width)))
|
||||
>> zero_init(Affine(nr_class, width, drop_factor=0.0))
|
||||
)
|
||||
|
||||
linear_model = (
|
||||
_preprocess_doc
|
||||
>> LinearModel(nr_class, drop_factor=0.)
|
||||
)
|
||||
|
||||
model = (
|
||||
(linear_model | cnn_model)
|
||||
>> zero_init(Affine(nr_class, nr_class*2, drop_factor=0.0))
|
||||
>> logistic
|
||||
)
|
||||
model.nO = nr_class
|
||||
model.lsuv = False
|
||||
return model
|
||||
|
||||
|
||||
@layerize
|
||||
def flatten(seqs, drop=0.):
|
||||
ops = Model.ops
|
||||
lengths = ops.asarray([len(seq) for seq in seqs], dtype='i')
|
||||
|
||||
def finish_update(d_X, sgd=None):
|
||||
return ops.unflatten(d_X, lengths, pad=0)
|
||||
|
||||
X = ops.flatten(seqs, pad=0)
|
||||
return X, finish_update
|
||||
|
||||
|
||||
def concatenate_lists(*layers, **kwargs): # pragma: no cover
|
||||
"""Compose two or more models `f`, `g`, etc, such that their outputs are
|
||||
concatenated, i.e. `concatenate(f, g)(x)` computes `hstack(f(x), g(x))`
|
||||
"""
|
||||
if not layers:
|
||||
return noop()
|
||||
drop_factor = kwargs.get('drop_factor', 1.0)
|
||||
ops = layers[0].ops
|
||||
layers = [chain(layer, flatten) for layer in layers]
|
||||
concat = concatenate(*layers)
|
||||
|
||||
def concatenate_lists_fwd(Xs, drop=0.):
|
||||
drop *= drop_factor
|
||||
lengths = ops.asarray([len(X) for X in Xs], dtype='i')
|
||||
flat_y, bp_flat_y = concat.begin_update(Xs, drop=drop)
|
||||
ys = ops.unflatten(flat_y, lengths)
|
||||
|
||||
def concatenate_lists_bwd(d_ys, sgd=None):
|
||||
return bp_flat_y(ops.flatten(d_ys), sgd=sgd)
|
||||
|
||||
return ys, concatenate_lists_bwd
|
||||
|
||||
model = wrap(concatenate_lists_fwd, concat)
|
||||
return model
|
|
@ -3,14 +3,15 @@
|
|||
# https://github.com/pypa/warehouse/blob/master/warehouse/__about__.py
|
||||
|
||||
__title__ = 'spacy'
|
||||
__version__ = '1.10.0'
|
||||
__version__ = '2.0.0'
|
||||
__summary__ = 'Industrial-strength Natural Language Processing (NLP) with Python and Cython'
|
||||
__uri__ = 'https://spacy.io'
|
||||
__author__ = 'Matthew Honnibal'
|
||||
__email__ = 'matt@explosion.ai'
|
||||
__author__ = 'Explosion AI'
|
||||
__email__ = 'contact@explosion.ai'
|
||||
__license__ = 'MIT'
|
||||
__release__ = False
|
||||
|
||||
__docs_models__ = 'https://spacy.io/docs/usage'
|
||||
__docs_models__ = 'https://spacy.io/usage/models'
|
||||
__download_url__ = 'https://github.com/explosion/spacy-models/releases/download'
|
||||
__compatibility__ = 'https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json'
|
||||
__shortcuts__ = 'https://raw.githubusercontent.com/explosion/spacy-models/master/shortcuts.json'
|
||||
__shortcuts__ = 'https://raw.githubusercontent.com/explosion/spacy-models/master/shortcuts-v2.json'
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
# Reserve 64 values for flag features
|
||||
cpdef enum attr_id_t:
|
||||
cdef enum attr_id_t:
|
||||
NULL_ATTR
|
||||
IS_ALPHA
|
||||
IS_ASCII
|
||||
|
@ -83,6 +83,7 @@ cpdef enum attr_id_t:
|
|||
ENT_IOB
|
||||
ENT_TYPE
|
||||
HEAD
|
||||
SENT_START
|
||||
SPACY
|
||||
PROB
|
||||
|
||||
|
|
|
@ -85,6 +85,7 @@ IDS = {
|
|||
"ENT_IOB": ENT_IOB,
|
||||
"ENT_TYPE": ENT_TYPE,
|
||||
"HEAD": HEAD,
|
||||
"SENT_START": SENT_START,
|
||||
"SPACY": SPACY,
|
||||
"PROB": PROB,
|
||||
"LANG": LANG,
|
||||
|
@ -93,23 +94,19 @@ IDS = {
|
|||
|
||||
# ATTR IDs, in order of the symbol
|
||||
NAMES = [key for key, value in sorted(IDS.items(), key=lambda item: item[1])]
|
||||
locals().update(IDS)
|
||||
|
||||
|
||||
def intify_attrs(stringy_attrs, strings_map=None, _do_deprecated=False):
|
||||
"""
|
||||
Normalize a dictionary of attributes, converting them to ints.
|
||||
|
||||
Arguments:
|
||||
stringy_attrs (dict):
|
||||
Dictionary keyed by attribute string names. Values can be ints or strings.
|
||||
|
||||
strings_map (StringStore):
|
||||
Defaults to None. If provided, encodes string values into ints.
|
||||
|
||||
Returns:
|
||||
inty_attrs (dict):
|
||||
Attributes dictionary with keys and optionally values converted to
|
||||
ints.
|
||||
stringy_attrs (dict): Dictionary keyed by attribute string names. Values
|
||||
can be ints or strings.
|
||||
strings_map (StringStore): Defaults to None. If provided, encodes string
|
||||
values into ints.
|
||||
RETURNS (dict): Attributes dictionary with keys and optionally values
|
||||
converted to ints.
|
||||
"""
|
||||
inty_attrs = {}
|
||||
if _do_deprecated:
|
||||
|
@ -149,6 +146,9 @@ def intify_attrs(stringy_attrs, strings_map=None, _do_deprecated=False):
|
|||
else:
|
||||
int_key = IDS[name.upper()]
|
||||
if strings_map is not None and isinstance(value, basestring):
|
||||
value = strings_map[value]
|
||||
if hasattr(strings_map, 'add'):
|
||||
value = strings_map.add(value)
|
||||
else:
|
||||
value = strings_map[value]
|
||||
inty_attrs[int_key] = value
|
||||
return inty_attrs
|
||||
|
|
|
@ -1,24 +0,0 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals, print_function
|
||||
|
||||
from ..language import Language
|
||||
from ..attrs import LANG
|
||||
|
||||
from .language_data import *
|
||||
|
||||
|
||||
class Bengali(Language):
|
||||
lang = 'bn'
|
||||
|
||||
class Defaults(Language.Defaults):
|
||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||
lex_attr_getters[LANG] = lambda text: 'bn'
|
||||
|
||||
tokenizer_exceptions = TOKENIZER_EXCEPTIONS
|
||||
tag_map = TAG_MAP
|
||||
stop_words = STOP_WORDS
|
||||
lemma_rules = LEMMA_RULES
|
||||
|
||||
prefixes = tuple(TOKENIZER_PREFIXES)
|
||||
suffixes = tuple(TOKENIZER_SUFFIXES)
|
||||
infixes = tuple(TOKENIZER_INFIXES)
|
|
@ -1,27 +0,0 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from spacy.language_data import strings_to_exc, update_exc
|
||||
from .punctuation import *
|
||||
from .stop_words import STOP_WORDS
|
||||
from .tag_map import TAG_MAP as TAG_MAP_BN
|
||||
from .morph_rules import MORPH_RULES
|
||||
from .lemma_rules import LEMMA_RULES
|
||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS as TOKENIZER_EXCEPTIONS_BN
|
||||
from .. import language_data as base
|
||||
|
||||
STOP_WORDS = set(STOP_WORDS)
|
||||
|
||||
TAG_MAP = base.TAG_MAP
|
||||
TAG_MAP.update(TAG_MAP_BN)
|
||||
|
||||
TOKENIZER_EXCEPTIONS = strings_to_exc(base.EMOTICONS)
|
||||
update_exc(TOKENIZER_EXCEPTIONS, strings_to_exc(base.ABBREVIATIONS))
|
||||
TOKENIZER_EXCEPTIONS.update(TOKENIZER_EXCEPTIONS_BN)
|
||||
|
||||
TOKENIZER_PREFIXES = TOKENIZER_PREFIXES
|
||||
TOKENIZER_SUFFIXES = TOKENIZER_SUFFIXES
|
||||
TOKENIZER_INFIXES = TOKENIZER_INFIXES
|
||||
|
||||
__all__ = ["TOKENIZER_EXCEPTIONS", "STOP_WORDS", "TAG_MAP", "MORPH_RULES", "LEMMA_RULES",
|
||||
"TOKENIZER_PREFIXES", "TOKENIZER_SUFFIXES", "TOKENIZER_INFIXES"]
|
|
@ -1,45 +0,0 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ..language_data.punctuation import ALPHA_LOWER, LIST_ELLIPSES, QUOTES, ALPHA_UPPER, LIST_QUOTES, UNITS, \
|
||||
CURRENCY, LIST_PUNCT, ALPHA, _QUOTES
|
||||
|
||||
CURRENCY_SYMBOLS = r"\$ ¢ £ € ¥ ฿ ৳"
|
||||
|
||||
_PUNCT = '। ॥'
|
||||
|
||||
LIST_PUNCT.extend(_PUNCT.strip().split())
|
||||
|
||||
TOKENIZER_PREFIXES = (
|
||||
[r'\+'] +
|
||||
LIST_PUNCT +
|
||||
LIST_ELLIPSES +
|
||||
LIST_QUOTES
|
||||
)
|
||||
|
||||
TOKENIZER_SUFFIXES = (
|
||||
LIST_PUNCT +
|
||||
LIST_ELLIPSES +
|
||||
LIST_QUOTES +
|
||||
[
|
||||
r'(?<=[0-9])\+',
|
||||
r'(?<=°[FfCcKk])\.',
|
||||
r'(?<=[0-9])(?:{c})'.format(c=CURRENCY),
|
||||
r'(?<=[0-9])(?:{u})'.format(u=UNITS),
|
||||
r'(?<=[{al}{p}{c}(?:{q})])\.'.format(al=ALPHA_LOWER, p=r'%²\-\)\]\+', q=QUOTES, c=CURRENCY_SYMBOLS),
|
||||
r'(?<=[{al})])-e'.format(al=ALPHA_LOWER)
|
||||
]
|
||||
)
|
||||
|
||||
TOKENIZER_INFIXES = (
|
||||
LIST_ELLIPSES +
|
||||
[
|
||||
r'(?<=[{al}])\.(?=[{au}])'.format(al=ALPHA_LOWER, au=ALPHA_UPPER),
|
||||
r'(?<=[{a}]),(?=[{a}])'.format(a=ALPHA),
|
||||
r'(?<=[{a}"])[:<>=](?=[{a}])'.format(a=ALPHA),
|
||||
r'(?<=[{a}])--(?=[{a}])'.format(a=ALPHA),
|
||||
r'(?<=[{a}]),(?=[{a}])'.format(a=ALPHA),
|
||||
r'(?<=[{a}])([{q}\)\]\(\[])(?=[\-{a}])'.format(a=ALPHA, q=_QUOTES.replace("'", "").strip().replace(" ", "")),
|
||||
]
|
||||
)
|
||||
__all__ = ["TOKENIZER_PREFIXES", "TOKENIZER_SUFFIXES", "TOKENIZER_INFIXES"]
|
|
@ -1,47 +0,0 @@
|
|||
# coding=utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ..symbols import *
|
||||
|
||||
TOKENIZER_EXCEPTIONS = {}
|
||||
|
||||
ABBREVIATIONS = {
|
||||
"ডঃ": [
|
||||
{ORTH: "ডঃ", LEMMA: "ডক্টর"},
|
||||
],
|
||||
"ডাঃ": [
|
||||
{ORTH: "ডাঃ", LEMMA: "ডাক্তার"},
|
||||
],
|
||||
"ড.": [
|
||||
{ORTH: "ড.", LEMMA: "ডক্টর"},
|
||||
],
|
||||
"ডা.": [
|
||||
{ORTH: "ডা.", LEMMA: "ডাক্তার"},
|
||||
],
|
||||
"মোঃ": [
|
||||
{ORTH: "মোঃ", LEMMA: "মোহাম্মদ"},
|
||||
],
|
||||
"মো.": [
|
||||
{ORTH: "মো.", LEMMA: "মোহাম্মদ"},
|
||||
],
|
||||
"সে.": [
|
||||
{ORTH: "সে.", LEMMA: "সেলসিয়াস"},
|
||||
],
|
||||
"কি.মি.": [
|
||||
{ORTH: "কি.মি.", LEMMA: "কিলোমিটার"},
|
||||
],
|
||||
"কি.মি": [
|
||||
{ORTH: "কি.মি", LEMMA: "কিলোমিটার"},
|
||||
],
|
||||
"সে.মি.": [
|
||||
{ORTH: "সে.মি.", LEMMA: "সেন্টিমিটার"},
|
||||
],
|
||||
"সে.মি": [
|
||||
{ORTH: "সে.মি", LEMMA: "সেন্টিমিটার"},
|
||||
],
|
||||
"মি.লি.": [
|
||||
{ORTH: "মি.লি.", LEMMA: "মিলিলিটার"},
|
||||
]
|
||||
}
|
||||
|
||||
TOKENIZER_EXCEPTIONS.update(ABBREVIATIONS)
|
|
@ -1,26 +0,0 @@
|
|||
from libc.stdio cimport fopen, fclose, fread, fwrite, FILE
|
||||
from cymem.cymem cimport Pool
|
||||
|
||||
cdef class CFile:
|
||||
cdef FILE* fp
|
||||
cdef bint is_open
|
||||
cdef Pool mem
|
||||
cdef int size # For compatibility with subclass
|
||||
cdef int _capacity # For compatibility with subclass
|
||||
|
||||
cdef int read_into(self, void* dest, size_t number, size_t elem_size) except -1
|
||||
|
||||
cdef int write_from(self, void* src, size_t number, size_t elem_size) except -1
|
||||
|
||||
cdef void* alloc_read(self, Pool mem, size_t number, size_t elem_size) except *
|
||||
|
||||
|
||||
|
||||
cdef class StringCFile(CFile):
|
||||
cdef unsigned char* data
|
||||
|
||||
cdef int read_into(self, void* dest, size_t number, size_t elem_size) except -1
|
||||
|
||||
cdef int write_from(self, void* src, size_t number, size_t elem_size) except -1
|
||||
|
||||
cdef void* alloc_read(self, Pool mem, size_t number, size_t elem_size) except *
|
|
@ -1,91 +0,0 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from libc.stdio cimport fopen, fclose, fread, fwrite
|
||||
from libc.string cimport memcpy
|
||||
|
||||
|
||||
cdef class CFile:
|
||||
def __init__(self, loc, mode, on_open_error=None):
|
||||
if isinstance(mode, unicode):
|
||||
mode_str = mode.encode('ascii')
|
||||
else:
|
||||
mode_str = mode
|
||||
if hasattr(loc, 'as_posix'):
|
||||
loc = loc.as_posix()
|
||||
self.mem = Pool()
|
||||
cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc
|
||||
self.fp = fopen(<char*>bytes_loc, mode_str)
|
||||
if self.fp == NULL:
|
||||
if on_open_error is not None:
|
||||
on_open_error()
|
||||
else:
|
||||
raise IOError("Could not open binary file %s" % bytes_loc)
|
||||
self.is_open = True
|
||||
|
||||
def __dealloc__(self):
|
||||
if self.is_open:
|
||||
fclose(self.fp)
|
||||
|
||||
def close(self):
|
||||
fclose(self.fp)
|
||||
self.is_open = False
|
||||
|
||||
cdef int read_into(self, void* dest, size_t number, size_t elem_size) except -1:
|
||||
st = fread(dest, elem_size, number, self.fp)
|
||||
if st != number:
|
||||
raise IOError
|
||||
|
||||
cdef int write_from(self, void* src, size_t number, size_t elem_size) except -1:
|
||||
st = fwrite(src, elem_size, number, self.fp)
|
||||
if st != number:
|
||||
raise IOError
|
||||
|
||||
cdef void* alloc_read(self, Pool mem, size_t number, size_t elem_size) except *:
|
||||
cdef void* dest = mem.alloc(number, elem_size)
|
||||
self.read_into(dest, number, elem_size)
|
||||
return dest
|
||||
|
||||
def write_unicode(self, unicode value):
|
||||
cdef bytes py_bytes = value.encode('utf8')
|
||||
cdef char* chars = <char*>py_bytes
|
||||
self.write(sizeof(char), len(py_bytes), chars)
|
||||
|
||||
|
||||
cdef class StringCFile:
|
||||
def __init__(self, mode, bytes data=b'', on_open_error=None):
|
||||
self.mem = Pool()
|
||||
self.is_open = 'w' in mode
|
||||
self._capacity = max(len(data), 8)
|
||||
self.size = len(data)
|
||||
self.data = <unsigned char*>self.mem.alloc(1, self._capacity)
|
||||
for i in range(len(data)):
|
||||
self.data[i] = data[i]
|
||||
|
||||
def close(self):
|
||||
self.is_open = False
|
||||
|
||||
def string_data(self):
|
||||
return (self.data-self.size)[:self.size]
|
||||
|
||||
cdef int read_into(self, void* dest, size_t number, size_t elem_size) except -1:
|
||||
memcpy(dest, self.data, elem_size * number)
|
||||
self.data += elem_size * number
|
||||
|
||||
cdef int write_from(self, void* src, size_t elem_size, size_t number) except -1:
|
||||
write_size = number * elem_size
|
||||
if (self.size + write_size) >= self._capacity:
|
||||
self._capacity = (self.size + write_size) * 2
|
||||
self.data = <unsigned char*>self.mem.realloc(self.data, self._capacity)
|
||||
memcpy(&self.data[self.size], src, elem_size * number)
|
||||
self.size += write_size
|
||||
|
||||
cdef void* alloc_read(self, Pool mem, size_t number, size_t elem_size) except *:
|
||||
cdef void* dest = mem.alloc(number, elem_size)
|
||||
self.read_into(dest, number, elem_size)
|
||||
return dest
|
||||
|
||||
def write_unicode(self, unicode value):
|
||||
cdef bytes py_bytes = value.encode('utf8')
|
||||
cdef char* chars = <char*>py_bytes
|
||||
self.write(sizeof(char), len(py_bytes), chars)
|
|
@ -2,6 +2,9 @@ from .download import download
|
|||
from .info import info
|
||||
from .link import link
|
||||
from .package import package
|
||||
from .train import train, train_config
|
||||
from .model import model
|
||||
from .profile import profile
|
||||
from .train import train
|
||||
from .evaluate import evaluate
|
||||
from .convert import convert
|
||||
from .vocab import make_vocab as vocab
|
||||
from .validate import validate
|
||||
|
|
|
@ -1,35 +1,46 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import plac
|
||||
from pathlib import Path
|
||||
|
||||
from .converters import conllu2json
|
||||
from .. import util
|
||||
|
||||
|
||||
# Converters are matched by file extension. To add a converter, add a new entry
|
||||
# to this dict with the file extension mapped to the converter function imported
|
||||
# from /converters.
|
||||
from .converters import conllu2json, iob2json, conll_ner2json
|
||||
from ..util import prints
|
||||
|
||||
# Converters are matched by file extension. To add a converter, add a new
|
||||
# entry to this dict with the file extension mapped to the converter function
|
||||
# imported from /converters.
|
||||
CONVERTERS = {
|
||||
'.conllu': conllu2json
|
||||
'conllu': conllu2json,
|
||||
'conll': conllu2json,
|
||||
'ner': conll_ner2json,
|
||||
'iob': iob2json,
|
||||
}
|
||||
|
||||
|
||||
def convert(input_file, output_dir, *args):
|
||||
@plac.annotations(
|
||||
input_file=("input file", "positional", None, str),
|
||||
output_dir=("output directory for converted file", "positional", None, str),
|
||||
n_sents=("Number of sentences per doc", "option", "n", int),
|
||||
converter=("Name of converter (auto, iob, conllu or ner)", "option", "c", str),
|
||||
morphology=("Enable appending morphology to tags", "flag", "m", bool))
|
||||
def convert(cmd, input_file, output_dir, n_sents=1, morphology=False,
|
||||
converter='auto'):
|
||||
"""
|
||||
Convert files into JSON format for use with train command and other
|
||||
experiment management functions.
|
||||
"""
|
||||
input_path = Path(input_file)
|
||||
output_path = Path(output_dir)
|
||||
check_dirs(input_path, output_path)
|
||||
file_ext = input_path.suffix
|
||||
if file_ext in CONVERTERS:
|
||||
CONVERTERS[file_ext](input_path, output_path, *args)
|
||||
else:
|
||||
util.sys_exit("Can't find converter for {}".format(input_path.parts[-1]),
|
||||
title="Unknown format")
|
||||
|
||||
|
||||
def check_dirs(input_file, output_path):
|
||||
if not input_file.exists():
|
||||
util.sys_exit(input_file.as_posix(), title="Input file not found")
|
||||
if not input_path.exists():
|
||||
prints(input_path, title="Input file not found", exits=1)
|
||||
if not output_path.exists():
|
||||
util.sys_exit(output_path.as_posix(), title="Output directory not found")
|
||||
prints(output_path, title="Output directory not found", exits=1)
|
||||
if converter == 'auto':
|
||||
converter = input_path.suffix[1:]
|
||||
if converter not in CONVERTERS:
|
||||
prints("Can't find converter for %s" % converter,
|
||||
title="Unknown format", exits=1)
|
||||
func = CONVERTERS[converter]
|
||||
func(input_path, output_path,
|
||||
n_sents=n_sents, use_morphology=morphology)
|
||||
|
|
|
@ -1 +1,3 @@
|
|||
from .conllu2json import conllu2json
|
||||
from .iob2json import iob2json
|
||||
from .conll_ner2json import conll_ner2json
|
||||
|
|
51
spacy/cli/converters/conll_ner2json.py
Normal file
51
spacy/cli/converters/conll_ner2json.py
Normal file
|
@ -0,0 +1,51 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...compat import json_dumps, path2str
|
||||
from ...util import prints
|
||||
from ...gold import iob_to_biluo
|
||||
|
||||
|
||||
def conll_ner2json(input_path, output_path, n_sents=10, use_morphology=False):
|
||||
"""
|
||||
Convert files in the CoNLL-2003 NER format into JSON format for use with
|
||||
train cli.
|
||||
"""
|
||||
docs = read_conll_ner(input_path)
|
||||
|
||||
output_filename = input_path.parts[-1].replace(".conll", "") + ".json"
|
||||
output_filename = input_path.parts[-1].replace(".conll", "") + ".json"
|
||||
output_file = output_path / output_filename
|
||||
with output_file.open('w', encoding='utf-8') as f:
|
||||
f.write(json_dumps(docs))
|
||||
prints("Created %d documents" % len(docs),
|
||||
title="Generated output file %s" % path2str(output_file))
|
||||
|
||||
|
||||
def read_conll_ner(input_path):
|
||||
text = input_path.open('r', encoding='utf-8').read()
|
||||
i = 0
|
||||
delimit_docs = '-DOCSTART- -X- O O'
|
||||
output_docs = []
|
||||
for doc in text.strip().split(delimit_docs):
|
||||
doc = doc.strip()
|
||||
if not doc:
|
||||
continue
|
||||
output_doc = []
|
||||
for sent in doc.split('\n\n'):
|
||||
sent = sent.strip()
|
||||
if not sent:
|
||||
continue
|
||||
lines = [line.strip() for line in sent.split('\n') if line.strip()]
|
||||
words, tags, chunks, iob_ents = zip(*[line.split() for line in lines])
|
||||
biluo_ents = iob_to_biluo(iob_ents)
|
||||
output_doc.append({'tokens': [
|
||||
{'orth': w, 'tag': tag, 'ner': ent} for (w, tag, ent) in
|
||||
zip(words, tags, biluo_ents)
|
||||
]})
|
||||
output_docs.append({
|
||||
'id': len(output_docs),
|
||||
'paragraphs': [{'sentences': output_doc}]
|
||||
})
|
||||
output_doc = []
|
||||
return output_docs
|
|
@ -1,9 +1,8 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import json
|
||||
from ...compat import json_dumps
|
||||
from ... import util
|
||||
from ...compat import json_dumps, path2str
|
||||
from ...util import prints
|
||||
|
||||
|
||||
def conllu2json(input_path, output_path, n_sents=10, use_morphology=False):
|
||||
|
@ -28,12 +27,13 @@ def conllu2json(input_path, output_path, n_sents=10, use_morphology=False):
|
|||
docs.append(doc)
|
||||
sentences = []
|
||||
|
||||
output_filename = input_path.parts[-1].replace(".conll", ".json")
|
||||
output_filename = input_path.parts[-1].replace(".conllu", ".json")
|
||||
output_file = output_path / output_filename
|
||||
with output_file.open('w', encoding='utf-8') as f:
|
||||
f.write(json_dumps(docs))
|
||||
util.print_msg("Created {} documents".format(len(docs)),
|
||||
title="Generated output file {}".format(output_file))
|
||||
prints("Created %d documents" % len(docs),
|
||||
title="Generated output file %s" % path2str(output_file))
|
||||
|
||||
|
||||
def read_conllx(input_path, use_morphology=False, n=0):
|
||||
|
@ -47,15 +47,16 @@ def read_conllx(input_path, use_morphology=False, n=0):
|
|||
tokens = []
|
||||
for line in lines:
|
||||
|
||||
id_, word, lemma, pos, tag, morph, head, dep, _1, \
|
||||
_2 = line.split('\t')
|
||||
parts = line.split('\t')
|
||||
id_, word, lemma, pos, tag, morph, head, dep, _1, _2 = parts
|
||||
if '-' in id_ or '.' in id_:
|
||||
continue
|
||||
try:
|
||||
id_ = int(id_) - 1
|
||||
head = (int(head) - 1) if head != '0' else id_
|
||||
dep = 'ROOT' if dep == 'root' else dep
|
||||
tag = pos+'__'+morph if use_morphology else pos
|
||||
tag = pos if tag == '_' else tag
|
||||
tag = tag+'__'+morph if use_morphology else tag
|
||||
tokens.append((id_, word, tag, head, dep, 'O'))
|
||||
except:
|
||||
print(line)
|
||||
|
@ -73,10 +74,10 @@ def generate_sentence(sent):
|
|||
tokens = []
|
||||
for i, id in enumerate(id_):
|
||||
token = {}
|
||||
token["orth"] = word[id]
|
||||
token["tag"] = tag[id]
|
||||
token["head"] = head[id] - i
|
||||
token["dep"] = dep[id]
|
||||
token["orth"] = word[i]
|
||||
token["tag"] = tag[i]
|
||||
token["head"] = head[i] - id
|
||||
token["dep"] = dep[i]
|
||||
tokens.append(token)
|
||||
sentence["tokens"] = tokens
|
||||
return sentence
|
||||
|
|
56
spacy/cli/converters/iob2json.py
Normal file
56
spacy/cli/converters/iob2json.py
Normal file
|
@ -0,0 +1,56 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
from cytoolz import partition_all, concat
|
||||
|
||||
from ...compat import json_dumps, path2str
|
||||
from ...util import prints
|
||||
from ...gold import iob_to_biluo
|
||||
|
||||
|
||||
def iob2json(input_path, output_path, n_sents=10, *a, **k):
|
||||
"""
|
||||
Convert IOB files into JSON format for use with train cli.
|
||||
"""
|
||||
with input_path.open('r', encoding='utf8') as file_:
|
||||
sentences = read_iob(file_)
|
||||
docs = merge_sentences(sentences, n_sents)
|
||||
output_filename = input_path.parts[-1].replace(".iob", ".json")
|
||||
output_file = output_path / output_filename
|
||||
with output_file.open('w', encoding='utf-8') as f:
|
||||
f.write(json_dumps(docs))
|
||||
prints("Created %d documents" % len(docs),
|
||||
title="Generated output file %s" % path2str(output_file))
|
||||
|
||||
|
||||
def read_iob(raw_sents):
|
||||
sentences = []
|
||||
for line in raw_sents:
|
||||
if not line.strip():
|
||||
continue
|
||||
tokens = [t.split('|') for t in line.split()]
|
||||
if len(tokens[0]) == 3:
|
||||
words, pos, iob = zip(*tokens)
|
||||
else:
|
||||
words, iob = zip(*tokens)
|
||||
pos = ['-'] * len(words)
|
||||
biluo = iob_to_biluo(iob)
|
||||
sentences.append([
|
||||
{'orth': w, 'tag': p, 'ner': ent}
|
||||
for (w, p, ent) in zip(words, pos, biluo)
|
||||
])
|
||||
sentences = [{'tokens': sent} for sent in sentences]
|
||||
paragraphs = [{'sentences': [sent]} for sent in sentences]
|
||||
docs = [{'id': 0, 'paragraphs': [para]} for para in paragraphs]
|
||||
return docs
|
||||
|
||||
def merge_sentences(docs, n_sents):
|
||||
counter = 0
|
||||
merged = []
|
||||
for group in partition_all(n_sents, docs):
|
||||
group = list(group)
|
||||
first = group.pop(0)
|
||||
to_extend = first['paragraphs'][0]['sentences']
|
||||
for sent in group[1:]:
|
||||
to_extend.extend(sent['paragraphs'][0]['sentences'])
|
||||
merged.append(first)
|
||||
return merged
|
|
@ -1,83 +1,87 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import plac
|
||||
import requests
|
||||
import os
|
||||
import subprocess
|
||||
import sys
|
||||
|
||||
from .link import link_package
|
||||
from .link import link
|
||||
from ..util import prints, get_package_path
|
||||
from .. import about
|
||||
from .. import util
|
||||
|
||||
|
||||
def download(model=None, direct=False):
|
||||
check_error_depr(model)
|
||||
|
||||
@plac.annotations(
|
||||
model=("model to download, shortcut or name)", "positional", None, str),
|
||||
direct=("force direct download. Needs model name with version and won't "
|
||||
"perform compatibility check", "flag", "d", bool))
|
||||
def download(cmd, model, direct=False):
|
||||
"""
|
||||
Download compatible model from default download path using pip. Model
|
||||
can be shortcut, model name or, if --direct flag is set, full model name
|
||||
with version.
|
||||
"""
|
||||
if direct:
|
||||
download_model('{m}/{m}.tar.gz'.format(m=model))
|
||||
dl = download_model('{m}/{m}.tar.gz'.format(m=model))
|
||||
else:
|
||||
model_name = check_shortcut(model)
|
||||
shortcuts = get_json(about.__shortcuts__, "available shortcuts")
|
||||
model_name = shortcuts.get(model, model)
|
||||
compatibility = get_compatibility()
|
||||
version = get_version(model_name, compatibility)
|
||||
download_model('{m}-{v}/{m}-{v}.tar.gz'.format(m=model_name, v=version))
|
||||
link_package(model_name, model, force=True)
|
||||
dl = download_model('{m}-{v}/{m}-{v}.tar.gz'.format(m=model_name,
|
||||
v=version))
|
||||
if dl == 0:
|
||||
try:
|
||||
# Get package path here because link uses
|
||||
# pip.get_installed_distributions() to check if model is a
|
||||
# package, which fails if model was just installed via
|
||||
# subprocess
|
||||
package_path = get_package_path(model_name)
|
||||
link(None, model_name, model, force=True,
|
||||
model_path=package_path)
|
||||
except:
|
||||
# Dirty, but since spacy.download and the auto-linking is
|
||||
# mostly a convenience wrapper, it's best to show a success
|
||||
# message and loading instructions, even if linking fails.
|
||||
prints(
|
||||
"Creating a shortcut link for 'en' didn't work (maybe "
|
||||
"you don't have admin permissions?), but you can still "
|
||||
"load the model via its full package name:",
|
||||
"nlp = spacy.load('%s')" % model_name,
|
||||
title="Download successful")
|
||||
|
||||
|
||||
def get_json(url, desc):
|
||||
r = requests.get(url)
|
||||
if r.status_code != 200:
|
||||
util.sys_exit(
|
||||
"Couldn't fetch {d}. Please find the right model for your spaCy "
|
||||
"installation (v{v}), and download it manually:".format(d=desc, v=about.__version__),
|
||||
"python -m spacy.download [full model name + version] --direct",
|
||||
title="Server error ({c})".format(c=r.status_code))
|
||||
msg = ("Couldn't fetch %s. Please find a model for your spaCy "
|
||||
"installation (v%s), and download it manually.")
|
||||
prints(msg % (desc, about.__version__), about.__docs_models__,
|
||||
title="Server error (%d)" % r.status_code, exits=1)
|
||||
return r.json()
|
||||
|
||||
|
||||
def check_shortcut(model):
|
||||
shortcuts = get_json(about.__shortcuts__, "available shortcuts")
|
||||
return shortcuts.get(model, model)
|
||||
|
||||
|
||||
def get_compatibility():
|
||||
version = about.__version__
|
||||
comp_table = get_json(about.__compatibility__, "compatibility table")
|
||||
comp = comp_table['spacy']
|
||||
if version not in comp:
|
||||
util.sys_exit(
|
||||
"No compatible models found for v{v} of spaCy.".format(v=version),
|
||||
title="Compatibility error")
|
||||
prints("No compatible models found for v%s of spaCy." % version,
|
||||
title="Compatibility error", exits=1)
|
||||
return comp[version]
|
||||
|
||||
|
||||
def get_version(model, comp):
|
||||
if model not in comp:
|
||||
util.sys_exit(
|
||||
"No compatible model found for "
|
||||
"'{m}' (spaCy v{v}).".format(m=model, v=about.__version__),
|
||||
title="Compatibility error")
|
||||
version = about.__version__
|
||||
msg = "No compatible model found for '%s' (spaCy v%s)."
|
||||
prints(msg % (model, version), title="Compatibility error", exits=1)
|
||||
return comp[model][0]
|
||||
|
||||
|
||||
def download_model(filename):
|
||||
util.print_msg("Downloading {f}".format(f=filename))
|
||||
download_url = about.__download_url__ + '/' + filename
|
||||
subprocess.call([sys.executable, '-m',
|
||||
'pip', 'install', '--no-cache-dir', download_url],
|
||||
env=os.environ.copy())
|
||||
|
||||
|
||||
def check_error_depr(model):
|
||||
if not model:
|
||||
util.sys_exit(
|
||||
"python -m spacy.download [name or shortcut]",
|
||||
title="Missing model name or shortcut")
|
||||
|
||||
if model == 'all':
|
||||
util.sys_exit(
|
||||
"As of v1.7.0, the download all command is deprecated. Please "
|
||||
"download the models individually via spacy.download [model name] "
|
||||
"or pip install. For more info on this, see the documentation: "
|
||||
"{d}".format(d=about.__docs_models__),
|
||||
title="Deprecated command")
|
||||
return subprocess.call(
|
||||
[sys.executable, '-m', 'pip', 'install', '--no-cache-dir',
|
||||
download_url], env=os.environ.copy())
|
||||
|
|
113
spacy/cli/evaluate.py
Normal file
113
spacy/cli/evaluate.py
Normal file
|
@ -0,0 +1,113 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals, division, print_function
|
||||
|
||||
import plac
|
||||
from timeit import default_timer as timer
|
||||
import random
|
||||
import numpy.random
|
||||
|
||||
from ..gold import GoldCorpus
|
||||
from ..util import prints
|
||||
from .. import util
|
||||
from .. import displacy
|
||||
|
||||
|
||||
random.seed(0)
|
||||
numpy.random.seed(0)
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
model=("model name or path", "positional", None, str),
|
||||
data_path=("location of JSON-formatted evaluation data", "positional",
|
||||
None, str),
|
||||
gold_preproc=("use gold preprocessing", "flag", "G", bool),
|
||||
gpu_id=("use GPU", "option", "g", int),
|
||||
displacy_path=("directory to output rendered parses as HTML", "option",
|
||||
"dp", str),
|
||||
displacy_limit=("limit of parses to render as HTML", "option", "dl", int))
|
||||
def evaluate(cmd, model, data_path, gpu_id=-1, gold_preproc=False,
|
||||
displacy_path=None, displacy_limit=25):
|
||||
"""
|
||||
Evaluate a model. To render a sample of parses in a HTML file, set an
|
||||
output directory as the displacy_path argument.
|
||||
"""
|
||||
if gpu_id >= 0:
|
||||
util.use_gpu(gpu_id)
|
||||
util.set_env_log(False)
|
||||
data_path = util.ensure_path(data_path)
|
||||
displacy_path = util.ensure_path(displacy_path)
|
||||
if not data_path.exists():
|
||||
prints(data_path, title="Evaluation data not found", exits=1)
|
||||
if displacy_path and not displacy_path.exists():
|
||||
prints(displacy_path, title="Visualization output directory not found",
|
||||
exits=1)
|
||||
corpus = GoldCorpus(data_path, data_path)
|
||||
nlp = util.load_model(model)
|
||||
dev_docs = list(corpus.dev_docs(nlp, gold_preproc=gold_preproc))
|
||||
begin = timer()
|
||||
scorer = nlp.evaluate(dev_docs, verbose=False)
|
||||
end = timer()
|
||||
nwords = sum(len(doc_gold[0]) for doc_gold in dev_docs)
|
||||
print_results(scorer, time=end - begin, words=nwords,
|
||||
wps=nwords / (end - begin))
|
||||
if displacy_path:
|
||||
docs, golds = zip(*dev_docs)
|
||||
render_deps = 'parser' in nlp.meta.get('pipeline', [])
|
||||
render_ents = 'ner' in nlp.meta.get('pipeline', [])
|
||||
render_parses(docs, displacy_path, model_name=model,
|
||||
limit=displacy_limit, deps=render_deps, ents=render_ents)
|
||||
msg = "Generated %s parses as HTML" % displacy_limit
|
||||
prints(displacy_path, title=msg)
|
||||
|
||||
|
||||
def render_parses(docs, output_path, model_name='', limit=250, deps=True,
|
||||
ents=True):
|
||||
docs[0].user_data['title'] = model_name
|
||||
if ents:
|
||||
with (output_path / 'entities.html').open('w') as file_:
|
||||
html = displacy.render(docs[:limit], style='ent', page=True)
|
||||
file_.write(html)
|
||||
if deps:
|
||||
with (output_path / 'parses.html').open('w') as file_:
|
||||
html = displacy.render(docs[:limit], style='dep', page=True,
|
||||
options={'compact': True})
|
||||
file_.write(html)
|
||||
|
||||
|
||||
def print_progress(itn, losses, dev_scores, wps=0.0):
|
||||
scores = {}
|
||||
for col in ['dep_loss', 'tag_loss', 'uas', 'tags_acc', 'token_acc',
|
||||
'ents_p', 'ents_r', 'ents_f', 'wps']:
|
||||
scores[col] = 0.0
|
||||
scores['dep_loss'] = losses.get('parser', 0.0)
|
||||
scores['ner_loss'] = losses.get('ner', 0.0)
|
||||
scores['tag_loss'] = losses.get('tagger', 0.0)
|
||||
scores.update(dev_scores)
|
||||
scores['wps'] = wps
|
||||
tpl = '\t'.join((
|
||||
'{:d}',
|
||||
'{dep_loss:.3f}',
|
||||
'{ner_loss:.3f}',
|
||||
'{uas:.3f}',
|
||||
'{ents_p:.3f}',
|
||||
'{ents_r:.3f}',
|
||||
'{ents_f:.3f}',
|
||||
'{tags_acc:.3f}',
|
||||
'{token_acc:.3f}',
|
||||
'{wps:.1f}'))
|
||||
print(tpl.format(itn, **scores))
|
||||
|
||||
|
||||
def print_results(scorer, time, words, wps):
|
||||
results = {
|
||||
'Time': '%.2f s' % time,
|
||||
'Words': words,
|
||||
'Words/s': '%.0f' % wps,
|
||||
'TOK': '%.2f' % scorer.token_acc,
|
||||
'POS': '%.2f' % scorer.tags_acc,
|
||||
'UAS': '%.2f' % scorer.uas,
|
||||
'LAS': '%.2f' % scorer.las,
|
||||
'NER P': '%.2f' % scorer.ents_p,
|
||||
'NER R': '%.2f' % scorer.ents_r,
|
||||
'NER F': '%.2f' % scorer.ents_f}
|
||||
util.print_table(results, title="Results")
|
|
@ -1,52 +1,62 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import plac
|
||||
import platform
|
||||
from pathlib import Path
|
||||
|
||||
from ..compat import unicode_
|
||||
from ..compat import path2str
|
||||
from .. import about
|
||||
from .. import util
|
||||
|
||||
|
||||
def info(model=None, markdown=False):
|
||||
@plac.annotations(
|
||||
model=("optional: shortcut link of model", "positional", None, str),
|
||||
markdown=("generate Markdown for GitHub issues", "flag", "md", str))
|
||||
def info(cmd, model=None, markdown=False):
|
||||
"""Print info about spaCy installation. If a model shortcut link is
|
||||
speficied as an argument, print model information. Flag --markdown
|
||||
prints details in Markdown for easy copy-pasting to GitHub issues.
|
||||
"""
|
||||
if model:
|
||||
data = util.parse_package_meta(util.get_data_path(), model, require=True)
|
||||
model_path = Path(__file__).parent / util.get_data_path() / model
|
||||
if model_path.resolve() != model_path:
|
||||
data['link'] = unicode_(model_path)
|
||||
data['source'] = unicode_(model_path.resolve())
|
||||
if util.is_package(model):
|
||||
model_path = util.get_package_path(model)
|
||||
else:
|
||||
data['source'] = unicode_(model_path)
|
||||
print_info(data, "model " + model, markdown)
|
||||
model_path = util.get_data_path() / model
|
||||
meta_path = model_path / 'meta.json'
|
||||
if not meta_path.is_file():
|
||||
util.prints(meta_path, title="Can't find model meta.json", exits=1)
|
||||
meta = util.read_json(meta_path)
|
||||
if model_path.resolve() != model_path:
|
||||
meta['link'] = path2str(model_path)
|
||||
meta['source'] = path2str(model_path.resolve())
|
||||
else:
|
||||
meta['source'] = path2str(model_path)
|
||||
print_info(meta, 'model %s' % model, markdown)
|
||||
else:
|
||||
data = get_spacy_data()
|
||||
print_info(data, "spaCy", markdown)
|
||||
data = {'spaCy version': about.__version__,
|
||||
'Location': path2str(Path(__file__).parent.parent),
|
||||
'Platform': platform.platform(),
|
||||
'Python version': platform.python_version(),
|
||||
'Models': list_models()}
|
||||
print_info(data, 'spaCy', markdown)
|
||||
|
||||
|
||||
def print_info(data, title, markdown):
|
||||
title = "Info about {title}".format(title=title)
|
||||
title = 'Info about %s' % title
|
||||
if markdown:
|
||||
util.print_markdown(data, title=title)
|
||||
else:
|
||||
util.print_table(data, title=title)
|
||||
|
||||
|
||||
def get_spacy_data():
|
||||
return {
|
||||
'spaCy version': about.__version__,
|
||||
'Location': unicode_(Path(__file__).parent.parent),
|
||||
'Platform': platform.platform(),
|
||||
'Python version': platform.python_version(),
|
||||
'Installed models': ', '.join(list_models())
|
||||
}
|
||||
|
||||
|
||||
def list_models():
|
||||
# exclude common cache directories – this means models called "cache" etc.
|
||||
# won't show up in list, but it seems worth it
|
||||
exclude = ['cache', 'pycache', '__pycache__']
|
||||
def exclude_dir(dir_name):
|
||||
# exclude common cache directories and hidden directories
|
||||
exclude = ['cache', 'pycache', '__pycache__']
|
||||
return dir_name in exclude or dir_name.startswith('.')
|
||||
data_path = util.get_data_path()
|
||||
if data_path:
|
||||
models = [f.parts[-1] for f in data_path.iterdir() if f.is_dir()]
|
||||
return [m for m in models if m not in exclude]
|
||||
return ', '.join([m for m in models if not exclude_dir(m)])
|
||||
return '-'
|
||||
|
|
|
@ -1,78 +1,56 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pip
|
||||
import plac
|
||||
from pathlib import Path
|
||||
import importlib
|
||||
from ..compat import unicode_, symlink_to
|
||||
|
||||
from ..compat import symlink_to, path2str
|
||||
from ..util import prints
|
||||
from .. import util
|
||||
|
||||
|
||||
def link(origin, link_name, force=False):
|
||||
if is_package(origin):
|
||||
link_package(origin, link_name, force)
|
||||
@plac.annotations(
|
||||
origin=("package name or local path to model", "positional", None, str),
|
||||
link_name=("name of shortuct link to create", "positional", None, str),
|
||||
force=("force overwriting of existing link", "flag", "f", bool))
|
||||
def link(cmd, origin, link_name, force=False, model_path=None):
|
||||
"""
|
||||
Create a symlink for models within the spacy/data directory. Accepts
|
||||
either the name of a pip package, or the local path to the model data
|
||||
directory. Linking models allows loading them via spacy.load(link_name).
|
||||
"""
|
||||
if util.is_package(origin):
|
||||
model_path = util.get_package_path(origin)
|
||||
else:
|
||||
symlink(origin, link_name, force)
|
||||
|
||||
|
||||
def link_package(package_name, link_name, force=False):
|
||||
# Here we're importing the module just to find it. This is worryingly
|
||||
# indirect, but it's otherwise very difficult to find the package.
|
||||
# Python's installation and import rules are very complicated.
|
||||
pkg = importlib.import_module(package_name)
|
||||
package_path = Path(pkg.__file__).parent.parent
|
||||
meta = get_meta(package_path, package_name)
|
||||
model_name = package_name + '-' + meta['version']
|
||||
model_path = package_path / package_name / model_name
|
||||
symlink(model_path, link_name, force)
|
||||
|
||||
|
||||
def symlink(model_path, link_name, force):
|
||||
model_path = Path(model_path)
|
||||
model_path = Path(origin) if model_path is None else Path(model_path)
|
||||
if not model_path.exists():
|
||||
util.sys_exit(
|
||||
"The data should be located in {p}".format(p=model_path),
|
||||
title="Can't locate model data")
|
||||
|
||||
prints("The data should be located in %s" % path2str(model_path),
|
||||
title="Can't locate model data", exits=1)
|
||||
data_path = util.get_data_path()
|
||||
if not data_path or not data_path.exists():
|
||||
spacy_loc = Path(__file__).parent.parent
|
||||
prints("Make sure a directory `/data` exists within your spaCy "
|
||||
"installation and try again. The data directory should be "
|
||||
"located here:", path2str(spacy_loc), exits=1,
|
||||
title="Can't find the spaCy data path to create model symlink")
|
||||
link_path = util.get_data_path() / link_name
|
||||
|
||||
if link_path.exists() and not force:
|
||||
util.sys_exit(
|
||||
"To overwrite an existing link, use the --force flag.",
|
||||
title="Link {l} already exists".format(l=link_name))
|
||||
prints("To overwrite an existing link, use the --force flag.",
|
||||
title="Link %s already exists" % link_name, exits=1)
|
||||
elif link_path.exists():
|
||||
link_path.unlink()
|
||||
|
||||
try:
|
||||
symlink_to(link_path, model_path)
|
||||
except:
|
||||
# This is quite dirty, but just making sure other errors are caught so
|
||||
# users at least see a proper message.
|
||||
util.print_msg(
|
||||
"Creating a symlink in spacy/data failed. Make sure you have the "
|
||||
"required permissions and try re-running the command as admin, or "
|
||||
"use a virtualenv to install spaCy in a user directory, instead of "
|
||||
"doing a system installation.",
|
||||
"You can still import the model as a Python package and call its "
|
||||
"load() method, or create the symlink manually:",
|
||||
"{a} --> {b}".format(a=unicode_(model_path), b=unicode_(link_path)),
|
||||
title="Error: Couldn't link model to '{l}'".format(l=link_name))
|
||||
# This is quite dirty, but just making sure other errors are caught.
|
||||
prints("Creating a symlink in spacy/data failed. Make sure you have "
|
||||
"the required permissions and try re-running the command as "
|
||||
"admin, or use a virtualenv. You can still import the model as "
|
||||
"a module and call its load() method, or create the symlink "
|
||||
"manually.",
|
||||
"%s --> %s" % (path2str(model_path), path2str(link_path)),
|
||||
title="Error: Couldn't link model to '%s'" % link_name)
|
||||
raise
|
||||
|
||||
util.print_msg(
|
||||
"{a} --> {b}".format(a=model_path.as_posix(), b=link_path.as_posix()),
|
||||
"You can now load the model via spacy.load('{l}').".format(l=link_name),
|
||||
title="Linking successful")
|
||||
|
||||
|
||||
def get_meta(package_path, package):
|
||||
meta = util.parse_package_meta(package_path, package)
|
||||
return meta
|
||||
|
||||
|
||||
def is_package(origin):
|
||||
packages = pip.get_installed_distributions()
|
||||
for package in packages:
|
||||
if package.project_name.replace('-', '_') == origin:
|
||||
return True
|
||||
return False
|
||||
prints("%s --> %s" % (path2str(model_path), path2str(link_path)),
|
||||
"You can now load the model via spacy.load('%s')" % link_name,
|
||||
title="Linking successful")
|
||||
|
|
|
@ -1,127 +0,0 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import gzip
|
||||
import math
|
||||
from ast import literal_eval
|
||||
from pathlib import Path
|
||||
from preshed.counter import PreshCounter
|
||||
|
||||
from ..vocab import write_binary_vectors
|
||||
from ..compat import fix_text
|
||||
from .. import util
|
||||
|
||||
|
||||
def model(lang, model_dir, freqs_data, clusters_data, vectors_data):
|
||||
model_path = Path(model_dir)
|
||||
freqs_path = Path(freqs_data)
|
||||
clusters_path = Path(clusters_data) if clusters_data else None
|
||||
vectors_path = Path(vectors_data) if vectors_data else None
|
||||
|
||||
check_dirs(freqs_path, clusters_path, vectors_path)
|
||||
vocab = util.get_lang_class(lang).Defaults.create_vocab()
|
||||
probs, oov_prob = read_probs(freqs_path)
|
||||
clusters = read_clusters(clusters_path) if clusters_path else {}
|
||||
populate_vocab(vocab, clusters, probs, oov_prob)
|
||||
create_model(model_path, vectors_path, vocab, oov_prob)
|
||||
|
||||
|
||||
def create_model(model_path, vectors_path, vocab, oov_prob):
|
||||
vocab_path = model_path / 'vocab'
|
||||
lexemes_path = vocab_path / 'lexemes.bin'
|
||||
strings_path = vocab_path / 'strings.json'
|
||||
oov_path = vocab_path / 'oov_prob'
|
||||
|
||||
if not model_path.exists():
|
||||
model_path.mkdir()
|
||||
if not vocab_path.exists():
|
||||
vocab_path.mkdir()
|
||||
vocab.dump(lexemes_path.as_posix())
|
||||
with strings_path.open('w') as f:
|
||||
vocab.strings.dump(f)
|
||||
with oov_path.open('w') as f:
|
||||
f.write('%f' % oov_prob)
|
||||
if vectors_path:
|
||||
vectors_dest = vocab_path / 'vec.bin'
|
||||
write_binary_vectors(vectors_path.as_posix(), vectors_dest.as_posix())
|
||||
|
||||
|
||||
def read_probs(freqs_path, max_length=100, min_doc_freq=5, min_freq=200):
|
||||
counts = PreshCounter()
|
||||
total = 0
|
||||
freqs_file = check_unzip(freqs_path)
|
||||
for i, line in enumerate(freqs_file):
|
||||
freq, doc_freq, key = line.rstrip().split('\t', 2)
|
||||
freq = int(freq)
|
||||
counts.inc(i+1, freq)
|
||||
total += freq
|
||||
counts.smooth()
|
||||
log_total = math.log(total)
|
||||
freqs_file = check_unzip(freqs_path)
|
||||
probs = {}
|
||||
for line in freqs_file:
|
||||
freq, doc_freq, key = line.rstrip().split('\t', 2)
|
||||
doc_freq = int(doc_freq)
|
||||
freq = int(freq)
|
||||
if doc_freq >= min_doc_freq and freq >= min_freq and len(key) < max_length:
|
||||
word = literal_eval(key)
|
||||
smooth_count = counts.smoother(int(freq))
|
||||
probs[word] = math.log(smooth_count) - log_total
|
||||
oov_prob = math.log(counts.smoother(0)) - log_total
|
||||
return probs, oov_prob
|
||||
|
||||
|
||||
def read_clusters(clusters_path):
|
||||
clusters = {}
|
||||
with clusters_path.open() as f:
|
||||
for line in f:
|
||||
try:
|
||||
cluster, word, freq = line.split()
|
||||
word = fix_text(word)
|
||||
except ValueError:
|
||||
continue
|
||||
# If the clusterer has only seen the word a few times, its
|
||||
# cluster is unreliable.
|
||||
if int(freq) >= 3:
|
||||
clusters[word] = cluster
|
||||
else:
|
||||
clusters[word] = '0'
|
||||
# Expand clusters with re-casing
|
||||
for word, cluster in list(clusters.items()):
|
||||
if word.lower() not in clusters:
|
||||
clusters[word.lower()] = cluster
|
||||
if word.title() not in clusters:
|
||||
clusters[word.title()] = cluster
|
||||
if word.upper() not in clusters:
|
||||
clusters[word.upper()] = cluster
|
||||
return clusters
|
||||
|
||||
|
||||
def populate_vocab(vocab, clusters, probs, oov_prob):
|
||||
for word, prob in reversed(sorted(list(probs.items()), key=lambda item: item[1])):
|
||||
lexeme = vocab[word]
|
||||
lexeme.prob = prob
|
||||
lexeme.is_oov = False
|
||||
# Decode as a little-endian string, so that we can do & 15 to get
|
||||
# the first 4 bits. See _parse_features.pyx
|
||||
if word in clusters:
|
||||
lexeme.cluster = int(clusters[word][::-1], 2)
|
||||
else:
|
||||
lexeme.cluster = 0
|
||||
|
||||
|
||||
def check_unzip(file_path):
|
||||
file_path_str = file_path.as_posix()
|
||||
if file_path_str.endswith('gz'):
|
||||
return gzip.open(file_path_str)
|
||||
else:
|
||||
return file_path.open()
|
||||
|
||||
|
||||
def check_dirs(freqs_data, clusters_data, vectors_data):
|
||||
if not freqs_data.is_file():
|
||||
util.sys_exit(freqs_data.as_posix(), title="No frequencies file found")
|
||||
if clusters_data and not clusters_data.is_file():
|
||||
util.sys_exit(clusters_data.as_posix(), title="No Brown clusters file found")
|
||||
if vectors_data and not vectors_data.is_file():
|
||||
util.sys_exit(vectors_data.as_posix(), title="No word vectors file found")
|
|
@ -1,68 +1,76 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import plac
|
||||
import shutil
|
||||
import requests
|
||||
from pathlib import Path
|
||||
|
||||
from ..compat import unicode_, json_dumps
|
||||
from ..compat import path2str, json_dumps
|
||||
from ..util import prints
|
||||
from .. import util
|
||||
from .. import about
|
||||
|
||||
|
||||
def package(input_dir, output_dir, meta_path, force):
|
||||
input_path = Path(input_dir)
|
||||
output_path = Path(output_dir)
|
||||
@plac.annotations(
|
||||
input_dir=("directory with model data", "positional", None, str),
|
||||
output_dir=("output parent directory", "positional", None, str),
|
||||
meta_path=("path to meta.json", "option", "m", str),
|
||||
create_meta=("create meta.json, even if one exists in directory – if "
|
||||
"existing meta is found, entries are shown as defaults in "
|
||||
"the command line prompt", "flag", "c", bool),
|
||||
force=("force overwriting of existing model directory in output directory",
|
||||
"flag", "f", bool))
|
||||
def package(cmd, input_dir, output_dir, meta_path=None, create_meta=False,
|
||||
force=False):
|
||||
"""
|
||||
Generate Python package for model data, including meta and required
|
||||
installation files. A new directory will be created in the specified
|
||||
output directory, and model data will be copied over.
|
||||
"""
|
||||
input_path = util.ensure_path(input_dir)
|
||||
output_path = util.ensure_path(output_dir)
|
||||
meta_path = util.ensure_path(meta_path)
|
||||
check_dirs(input_path, output_path, meta_path)
|
||||
|
||||
template_setup = get_template('setup.py')
|
||||
template_manifest = get_template('MANIFEST.in')
|
||||
template_init = get_template('en_model_name/__init__.py')
|
||||
if not input_path or not input_path.exists():
|
||||
prints(input_path, title="Model directory not found", exits=1)
|
||||
if not output_path or not output_path.exists():
|
||||
prints(output_path, title="Output directory not found", exits=1)
|
||||
if meta_path and not meta_path.exists():
|
||||
prints(meta_path, title="meta.json not found", exits=1)
|
||||
|
||||
meta_path = meta_path or input_path / 'meta.json'
|
||||
if meta_path.is_file():
|
||||
util.print_msg(unicode_(meta_path), title="Reading meta.json from file")
|
||||
meta = util.read_json(meta_path)
|
||||
else:
|
||||
meta = generate_meta()
|
||||
|
||||
validate_meta(meta, ['lang', 'name', 'version'])
|
||||
if not create_meta: # only print this if user doesn't want to overwrite
|
||||
prints(meta_path, title="Loaded meta.json from file")
|
||||
else:
|
||||
meta = generate_meta(input_dir, meta)
|
||||
meta = validate_meta(meta, ['lang', 'name', 'version'])
|
||||
model_name = meta['lang'] + '_' + meta['name']
|
||||
model_name_v = model_name + '-' + meta['version']
|
||||
main_path = output_path / model_name_v
|
||||
package_path = main_path / model_name
|
||||
|
||||
create_dirs(package_path, force)
|
||||
shutil.copytree(unicode_(input_path), unicode_(package_path / model_name_v))
|
||||
shutil.copytree(path2str(input_path),
|
||||
path2str(package_path / model_name_v))
|
||||
create_file(main_path / 'meta.json', json_dumps(meta))
|
||||
create_file(main_path / 'setup.py', template_setup)
|
||||
create_file(main_path / 'MANIFEST.in', template_manifest)
|
||||
create_file(package_path / '__init__.py', template_init)
|
||||
|
||||
util.print_msg(
|
||||
unicode_(main_path),
|
||||
"To build the package, run `python setup.py sdist` in that directory.",
|
||||
title="Successfully created package {p}".format(p=model_name_v))
|
||||
|
||||
|
||||
def check_dirs(input_path, output_path, meta_path):
|
||||
if not input_path.exists():
|
||||
util.sys_exit(unicode_(input_path.as_posix()), title="Model directory not found")
|
||||
if not output_path.exists():
|
||||
util.sys_exit(unicode_(output_path), title="Output directory not found")
|
||||
if meta_path and not meta_path.exists():
|
||||
util.sys_exit(unicode_(meta_path), title="meta.json not found")
|
||||
create_file(main_path / 'setup.py', TEMPLATE_SETUP)
|
||||
create_file(main_path / 'MANIFEST.in', TEMPLATE_MANIFEST)
|
||||
create_file(package_path / '__init__.py', TEMPLATE_INIT)
|
||||
prints(main_path, "To build the package, run `python setup.py sdist` in "
|
||||
"this directory.",
|
||||
title="Successfully created package '%s'" % model_name_v)
|
||||
|
||||
|
||||
def create_dirs(package_path, force):
|
||||
if package_path.exists():
|
||||
if force:
|
||||
shutil.rmtree(unicode_(package_path))
|
||||
shutil.rmtree(path2str(package_path))
|
||||
else:
|
||||
util.sys_exit(unicode_(package_path),
|
||||
"Please delete the directory and try again, or use the --force "
|
||||
"flag to overwrite existing directories.",
|
||||
title="Package directory already exists")
|
||||
prints(package_path, "Please delete the directory and try again, "
|
||||
"or use the --force flag to overwrite existing "
|
||||
"directories.", title="Package directory already exists",
|
||||
exits=1)
|
||||
Path.mkdir(package_path, parents=True)
|
||||
|
||||
|
||||
|
@ -71,39 +79,125 @@ def create_file(file_path, contents):
|
|||
file_path.open('w', encoding='utf-8').write(contents)
|
||||
|
||||
|
||||
def generate_meta():
|
||||
settings = [('lang', 'Model language', 'en'),
|
||||
('name', 'Model name', 'model'),
|
||||
('version', 'Model version', '0.0.0'),
|
||||
('spacy_version', 'Required spaCy version', '>=1.7.0,<2.0.0'),
|
||||
('description', 'Model description', False),
|
||||
('author', 'Author', False),
|
||||
('email', 'Author email', False),
|
||||
('url', 'Author website', False),
|
||||
('license', 'License', 'CC BY-NC 3.0')]
|
||||
|
||||
util.print_msg("Enter the package settings for your model.", title="Generating meta.json")
|
||||
|
||||
meta = {}
|
||||
def generate_meta(model_path, existing_meta):
|
||||
meta = existing_meta or {}
|
||||
settings = [('lang', 'Model language', meta.get('lang', 'en')),
|
||||
('name', 'Model name', meta.get('name', 'model')),
|
||||
('version', 'Model version', meta.get('version', '0.0.0')),
|
||||
('spacy_version', 'Required spaCy version',
|
||||
'>=%s,<3.0.0' % about.__version__),
|
||||
('description', 'Model description',
|
||||
meta.get('description', False)),
|
||||
('author', 'Author', meta.get('author', False)),
|
||||
('email', 'Author email', meta.get('email', False)),
|
||||
('url', 'Author website', meta.get('url', False)),
|
||||
('license', 'License', meta.get('license', 'CC BY-SA 3.0'))]
|
||||
nlp = util.load_model_from_path(Path(model_path))
|
||||
meta['pipeline'] = nlp.pipe_names
|
||||
meta['vectors'] = {'width': nlp.vocab.vectors_length,
|
||||
'vectors': len(nlp.vocab.vectors),
|
||||
'keys': nlp.vocab.vectors.n_keys}
|
||||
prints("Enter the package settings for your model. The following "
|
||||
"information will be read from your model data: pipeline, vectors.",
|
||||
title="Generating meta.json")
|
||||
for setting, desc, default in settings:
|
||||
response = util.get_raw_input(desc, default)
|
||||
meta[setting] = default if response == '' and default else response
|
||||
if about.__title__ != 'spacy':
|
||||
meta['parent_package'] = about.__title__
|
||||
return meta
|
||||
|
||||
|
||||
def validate_meta(meta, keys):
|
||||
for key in keys:
|
||||
if key not in meta or meta[key] == '':
|
||||
util.sys_exit(
|
||||
"This setting is required to build your package.",
|
||||
title='No "{k}" setting found in meta.json'.format(k=key))
|
||||
prints("This setting is required to build your package.",
|
||||
title='No "%s" setting found in meta.json' % key, exits=1)
|
||||
return meta
|
||||
|
||||
|
||||
def get_template(filepath):
|
||||
url = 'https://raw.githubusercontent.com/explosion/spacy-dev-resources/master/templates/model/'
|
||||
r = requests.get(url + filepath)
|
||||
if r.status_code != 200:
|
||||
util.sys_exit(
|
||||
"Couldn't fetch template files from GitHub.",
|
||||
title="Server error ({c})".format(c=r.status_code))
|
||||
return r.text
|
||||
TEMPLATE_SETUP = """
|
||||
#!/usr/bin/env python
|
||||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import io
|
||||
import json
|
||||
from os import path, walk
|
||||
from shutil import copy
|
||||
from setuptools import setup
|
||||
|
||||
|
||||
def load_meta(fp):
|
||||
with io.open(fp, encoding='utf8') as f:
|
||||
return json.load(f)
|
||||
|
||||
|
||||
def list_files(data_dir):
|
||||
output = []
|
||||
for root, _, filenames in walk(data_dir):
|
||||
for filename in filenames:
|
||||
if not filename.startswith('.'):
|
||||
output.append(path.join(root, filename))
|
||||
output = [path.relpath(p, path.dirname(data_dir)) for p in output]
|
||||
output.append('meta.json')
|
||||
return output
|
||||
|
||||
|
||||
def list_requirements(meta):
|
||||
parent_package = meta.get('parent_package', 'spacy')
|
||||
requirements = [parent_package + meta['spacy_version']]
|
||||
if 'setup_requires' in meta:
|
||||
requirements += meta['setup_requires']
|
||||
return requirements
|
||||
|
||||
|
||||
def setup_package():
|
||||
root = path.abspath(path.dirname(__file__))
|
||||
meta_path = path.join(root, 'meta.json')
|
||||
meta = load_meta(meta_path)
|
||||
model_name = str(meta['lang'] + '_' + meta['name'])
|
||||
model_dir = path.join(model_name, model_name + '-' + meta['version'])
|
||||
|
||||
copy(meta_path, path.join(model_name))
|
||||
copy(meta_path, model_dir)
|
||||
|
||||
setup(
|
||||
name=model_name,
|
||||
description=meta['description'],
|
||||
author=meta['author'],
|
||||
author_email=meta['email'],
|
||||
url=meta['url'],
|
||||
version=meta['version'],
|
||||
license=meta['license'],
|
||||
packages=[model_name],
|
||||
package_data={model_name: list_files(model_dir)},
|
||||
install_requires=list_requirements(meta),
|
||||
zip_safe=False,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
setup_package()
|
||||
""".strip()
|
||||
|
||||
|
||||
TEMPLATE_MANIFEST = """
|
||||
include meta.json
|
||||
""".strip()
|
||||
|
||||
|
||||
TEMPLATE_INIT = """
|
||||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from pathlib import Path
|
||||
from spacy.util import load_model_from_init_py, get_model_meta
|
||||
|
||||
|
||||
__version__ = get_model_meta(Path(__file__).parent)['version']
|
||||
|
||||
|
||||
def load(**overrides):
|
||||
return load_model_from_init_py(__file__, **overrides)
|
||||
""".strip()
|
||||
|
|
45
spacy/cli/profile.py
Normal file
45
spacy/cli/profile.py
Normal file
|
@ -0,0 +1,45 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals, division, print_function
|
||||
|
||||
import plac
|
||||
from pathlib import Path
|
||||
import ujson
|
||||
import cProfile
|
||||
import pstats
|
||||
|
||||
import spacy
|
||||
import sys
|
||||
import tqdm
|
||||
import cytoolz
|
||||
|
||||
|
||||
def read_inputs(loc):
|
||||
if loc is None:
|
||||
file_ = sys.stdin
|
||||
file_ = (line.encode('utf8') for line in file_)
|
||||
else:
|
||||
file_ = Path(loc).open()
|
||||
for line in file_:
|
||||
data = ujson.loads(line)
|
||||
text = data['text']
|
||||
yield text
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
lang=("model/language", "positional", None, str),
|
||||
inputs=("Location of input file", "positional", None, read_inputs))
|
||||
def profile(cmd, lang, inputs=None):
|
||||
"""
|
||||
Profile a spaCy pipeline, to find out which functions take the most time.
|
||||
"""
|
||||
nlp = spacy.load(lang)
|
||||
texts = list(cytoolz.take(10000, inputs))
|
||||
cProfile.runctx("parse_texts(nlp, texts)", globals(), locals(),
|
||||
"Profile.prof")
|
||||
s = pstats.Stats("Profile.prof")
|
||||
s.strip_dirs().sort_stats("time").print_stats()
|
||||
|
||||
|
||||
def parse_texts(nlp, texts):
|
||||
for doc in nlp.pipe(tqdm.tqdm(texts), batch_size=128):
|
||||
pass
|
|
@ -1,105 +1,213 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals, division, print_function
|
||||
|
||||
import json
|
||||
from collections import defaultdict
|
||||
import plac
|
||||
from pathlib import Path
|
||||
import dill
|
||||
import tqdm
|
||||
from thinc.neural._classes.model import Model
|
||||
from timeit import default_timer as timer
|
||||
import random
|
||||
import numpy.random
|
||||
|
||||
from ..util import ensure_path
|
||||
from ..scorer import Scorer
|
||||
from ..gold import GoldParse, merge_sents
|
||||
from ..gold import read_json_file as read_gold_json
|
||||
from ..gold import GoldCorpus, minibatch
|
||||
from ..util import prints
|
||||
from .. import util
|
||||
from .. import about
|
||||
from .. import displacy
|
||||
from ..compat import json_dumps
|
||||
|
||||
random.seed(0)
|
||||
numpy.random.seed(0)
|
||||
|
||||
|
||||
def train(language, output_dir, train_data, dev_data, n_iter, tagger, parser, ner,
|
||||
parser_L1):
|
||||
output_path = ensure_path(output_dir)
|
||||
train_path = ensure_path(train_data)
|
||||
dev_path = ensure_path(dev_data)
|
||||
check_dirs(output_path, train_path, dev_path)
|
||||
|
||||
lang = util.get_lang_class(language)
|
||||
parser_cfg = {
|
||||
'pseudoprojective': True,
|
||||
'L1': parser_L1,
|
||||
'n_iter': n_iter,
|
||||
'lang': language,
|
||||
'features': lang.Defaults.parser_features}
|
||||
entity_cfg = {
|
||||
'n_iter': n_iter,
|
||||
'lang': language,
|
||||
'features': lang.Defaults.entity_features}
|
||||
tagger_cfg = {
|
||||
'n_iter': n_iter,
|
||||
'lang': language,
|
||||
'features': lang.Defaults.tagger_features}
|
||||
gold_train = list(read_gold_json(train_path))
|
||||
gold_dev = list(read_gold_json(dev_path)) if dev_path else None
|
||||
|
||||
train_model(lang, gold_train, gold_dev, output_path, tagger_cfg, parser_cfg,
|
||||
entity_cfg, n_iter)
|
||||
if gold_dev:
|
||||
scorer = evaluate(lang, gold_dev, output_path)
|
||||
print_results(scorer)
|
||||
|
||||
|
||||
def train_config(config):
|
||||
config_path = ensure_path(config)
|
||||
if not config_path.is_file():
|
||||
util.sys_exit(config_path.as_posix(), title="Config file not found")
|
||||
config = json.load(config_path)
|
||||
for setting in []:
|
||||
if setting not in config.keys():
|
||||
util.sys_exit("{s} not found in config file.".format(s=setting),
|
||||
title="Missing setting")
|
||||
|
||||
|
||||
def train_model(Language, train_data, dev_data, output_path, tagger_cfg, parser_cfg,
|
||||
entity_cfg, n_iter):
|
||||
print("Itn.\tN weight\tN feats\tUAS\tNER F.\tTag %\tToken %")
|
||||
|
||||
with Language.train(output_path, train_data,
|
||||
pos=tagger_cfg, deps=parser_cfg, ner=entity_cfg) as trainer:
|
||||
for itn, epoch in enumerate(trainer.epochs(n_iter, augment_data=None)):
|
||||
for doc, gold in epoch:
|
||||
trainer.update(doc, gold)
|
||||
dev_scores = trainer.evaluate(dev_data).scores if dev_data else defaultdict(float)
|
||||
print_progress(itn, trainer.nlp.parser.model.nr_weight,
|
||||
trainer.nlp.parser.model.nr_active_feat,
|
||||
**dev_scores)
|
||||
|
||||
|
||||
def evaluate(Language, gold_tuples, output_path):
|
||||
print("Load parser", output_path)
|
||||
nlp = Language(path=output_path)
|
||||
scorer = Scorer()
|
||||
for raw_text, sents in gold_tuples:
|
||||
sents = merge_sents(sents)
|
||||
for annot_tuples, brackets in sents:
|
||||
if raw_text is None:
|
||||
tokens = nlp.tokenizer.tokens_from_list(annot_tuples[1])
|
||||
nlp.tagger(tokens)
|
||||
nlp.parser(tokens)
|
||||
nlp.entity(tokens)
|
||||
else:
|
||||
tokens = nlp(raw_text)
|
||||
gold = GoldParse.from_annot_tuples(tokens, annot_tuples)
|
||||
scorer.score(tokens, gold)
|
||||
return scorer
|
||||
|
||||
|
||||
def check_dirs(output_path, train_path, dev_path):
|
||||
@plac.annotations(
|
||||
lang=("model language", "positional", None, str),
|
||||
output_dir=("output directory to store model in", "positional", None, str),
|
||||
train_data=("location of JSON-formatted training data", "positional",
|
||||
None, str),
|
||||
dev_data=("location of JSON-formatted development data (optional)",
|
||||
"positional", None, str),
|
||||
n_iter=("number of iterations", "option", "n", int),
|
||||
n_sents=("number of sentences", "option", "ns", int),
|
||||
use_gpu=("Use GPU", "option", "g", int),
|
||||
vectors=("Model to load vectors from", "option", "v"),
|
||||
no_tagger=("Don't train tagger", "flag", "T", bool),
|
||||
no_parser=("Don't train parser", "flag", "P", bool),
|
||||
no_entities=("Don't train NER", "flag", "N", bool),
|
||||
gold_preproc=("Use gold preprocessing", "flag", "G", bool),
|
||||
version=("Model version", "option", "V", str),
|
||||
meta_path=("Optional path to meta.json. All relevant properties will be "
|
||||
"overwritten.", "option", "m", Path))
|
||||
def train(cmd, lang, output_dir, train_data, dev_data, n_iter=30, n_sents=0,
|
||||
use_gpu=-1, vectors=None, no_tagger=False,
|
||||
no_parser=False, no_entities=False, gold_preproc=False,
|
||||
version="0.0.0", meta_path=None):
|
||||
"""
|
||||
Train a model. Expects data in spaCy's JSON format.
|
||||
"""
|
||||
util.set_env_log(True)
|
||||
n_sents = n_sents or None
|
||||
output_path = util.ensure_path(output_dir)
|
||||
train_path = util.ensure_path(train_data)
|
||||
dev_path = util.ensure_path(dev_data)
|
||||
meta_path = util.ensure_path(meta_path)
|
||||
if not output_path.exists():
|
||||
util.sys_exit(output_path.as_posix(), title="Output directory not found")
|
||||
output_path.mkdir()
|
||||
if not train_path.exists():
|
||||
util.sys_exit(train_path.as_posix(), title="Training data not found")
|
||||
prints(train_path, title="Training data not found", exits=1)
|
||||
if dev_path and not dev_path.exists():
|
||||
util.sys_exit(dev_path.as_posix(), title="Development data not found")
|
||||
prints(dev_path, title="Development data not found", exits=1)
|
||||
if meta_path is not None and not meta_path.exists():
|
||||
prints(meta_path, title="meta.json not found", exits=1)
|
||||
meta = util.read_json(meta_path) if meta_path else {}
|
||||
if not isinstance(meta, dict):
|
||||
prints("Expected dict but got: {}".format(type(meta)),
|
||||
title="Not a valid meta.json format", exits=1)
|
||||
meta.setdefault('lang', lang)
|
||||
meta.setdefault('name', 'unnamed')
|
||||
|
||||
pipeline = ['tagger', 'parser', 'ner']
|
||||
if no_tagger and 'tagger' in pipeline:
|
||||
pipeline.remove('tagger')
|
||||
if no_parser and 'parser' in pipeline:
|
||||
pipeline.remove('parser')
|
||||
if no_entities and 'ner' in pipeline:
|
||||
pipeline.remove('ner')
|
||||
|
||||
# Take dropout and batch size as generators of values -- dropout
|
||||
# starts high and decays sharply, to force the optimizer to explore.
|
||||
# Batch size starts at 1 and grows, so that we make updates quickly
|
||||
# at the beginning of training.
|
||||
dropout_rates = util.decaying(util.env_opt('dropout_from', 0.2),
|
||||
util.env_opt('dropout_to', 0.2),
|
||||
util.env_opt('dropout_decay', 0.0))
|
||||
batch_sizes = util.compounding(util.env_opt('batch_from', 1),
|
||||
util.env_opt('batch_to', 16),
|
||||
util.env_opt('batch_compound', 1.001))
|
||||
max_doc_len = util.env_opt('max_doc_len', 5000)
|
||||
corpus = GoldCorpus(train_path, dev_path, limit=n_sents)
|
||||
n_train_words = corpus.count_train()
|
||||
|
||||
lang_class = util.get_lang_class(lang)
|
||||
nlp = lang_class()
|
||||
meta['pipeline'] = pipeline
|
||||
nlp.meta.update(meta)
|
||||
if vectors:
|
||||
util.load_model(vectors, vocab=nlp.vocab)
|
||||
for name in pipeline:
|
||||
nlp.add_pipe(nlp.create_pipe(name), name=name)
|
||||
optimizer = nlp.begin_training(lambda: corpus.train_tuples, device=use_gpu)
|
||||
nlp._optimizer = None
|
||||
|
||||
print("Itn.\tP.Loss\tN.Loss\tUAS\tNER P.\tNER R.\tNER F.\tTag %\tToken %")
|
||||
try:
|
||||
train_docs = corpus.train_docs(nlp, projectivize=True, noise_level=0.0,
|
||||
gold_preproc=gold_preproc, max_length=0)
|
||||
train_docs = list(train_docs)
|
||||
for i in range(n_iter):
|
||||
with tqdm.tqdm(total=n_train_words, leave=False) as pbar:
|
||||
losses = {}
|
||||
for batch in minibatch(train_docs, size=batch_sizes):
|
||||
batch = [(d, g) for (d, g) in batch if len(d) < max_doc_len]
|
||||
if not batch:
|
||||
continue
|
||||
docs, golds = zip(*batch)
|
||||
nlp.update(docs, golds, sgd=optimizer,
|
||||
drop=next(dropout_rates), losses=losses)
|
||||
pbar.update(sum(len(doc) for doc in docs))
|
||||
|
||||
with nlp.use_params(optimizer.averages):
|
||||
util.set_env_log(False)
|
||||
epoch_model_path = output_path / ('model%d' % i)
|
||||
nlp.to_disk(epoch_model_path)
|
||||
nlp_loaded = util.load_model_from_path(epoch_model_path)
|
||||
dev_docs = list(corpus.dev_docs(
|
||||
nlp_loaded,
|
||||
gold_preproc=gold_preproc))
|
||||
nwords = sum(len(doc_gold[0]) for doc_gold in dev_docs)
|
||||
start_time = timer()
|
||||
scorer = nlp_loaded.evaluate(dev_docs)
|
||||
end_time = timer()
|
||||
if use_gpu < 0:
|
||||
gpu_wps = None
|
||||
cpu_wps = nwords/(end_time-start_time)
|
||||
else:
|
||||
gpu_wps = nwords/(end_time-start_time)
|
||||
with Model.use_device('cpu'):
|
||||
nlp_loaded = util.load_model_from_path(epoch_model_path)
|
||||
dev_docs = list(corpus.dev_docs(
|
||||
nlp_loaded, gold_preproc=gold_preproc))
|
||||
start_time = timer()
|
||||
scorer = nlp_loaded.evaluate(dev_docs)
|
||||
end_time = timer()
|
||||
cpu_wps = nwords/(end_time-start_time)
|
||||
acc_loc = (output_path / ('model%d' % i) / 'accuracy.json')
|
||||
with acc_loc.open('w') as file_:
|
||||
file_.write(json_dumps(scorer.scores))
|
||||
meta_loc = output_path / ('model%d' % i) / 'meta.json'
|
||||
meta['accuracy'] = scorer.scores
|
||||
meta['speed'] = {'nwords': nwords, 'cpu': cpu_wps,
|
||||
'gpu': gpu_wps}
|
||||
meta['vectors'] = {'width': nlp.vocab.vectors_length,
|
||||
'vectors': len(nlp.vocab.vectors),
|
||||
'keys': nlp.vocab.vectors.n_keys}
|
||||
meta['lang'] = nlp.lang
|
||||
meta['pipeline'] = pipeline
|
||||
meta['spacy_version'] = '>=%s' % about.__version__
|
||||
meta.setdefault('name', 'model%d' % i)
|
||||
meta.setdefault('version', version)
|
||||
|
||||
with meta_loc.open('w') as file_:
|
||||
file_.write(json_dumps(meta))
|
||||
util.set_env_log(True)
|
||||
print_progress(i, losses, scorer.scores, cpu_wps=cpu_wps,
|
||||
gpu_wps=gpu_wps)
|
||||
finally:
|
||||
print("Saving model...")
|
||||
try:
|
||||
with (output_path / 'model-final.pickle').open('wb') as file_:
|
||||
with nlp.use_params(optimizer.averages):
|
||||
dill.dump(nlp, file_, -1)
|
||||
except:
|
||||
print("Error saving model")
|
||||
|
||||
|
||||
def print_progress(itn, nr_weight, nr_active_feat, **scores):
|
||||
tpl = '{:d}\t{:d}\t{:d}\t{uas:.3f}\t{ents_f:.3f}\t{tags_acc:.3f}\t{token_acc:.3f}'
|
||||
print(tpl.format(itn, nr_weight, nr_active_feat, **scores))
|
||||
def _render_parses(i, to_render):
|
||||
to_render[0].user_data['title'] = "Batch %d" % i
|
||||
with Path('/tmp/entities.html').open('w') as file_:
|
||||
html = displacy.render(to_render[:5], style='ent', page=True)
|
||||
file_.write(html)
|
||||
with Path('/tmp/parses.html').open('w') as file_:
|
||||
html = displacy.render(to_render[:5], style='dep', page=True)
|
||||
file_.write(html)
|
||||
|
||||
|
||||
def print_progress(itn, losses, dev_scores, cpu_wps=0.0, gpu_wps=0.0):
|
||||
scores = {}
|
||||
for col in ['dep_loss', 'tag_loss', 'uas', 'tags_acc', 'token_acc',
|
||||
'ents_p', 'ents_r', 'ents_f', 'cpu_wps', 'gpu_wps']:
|
||||
scores[col] = 0.0
|
||||
scores['dep_loss'] = losses.get('parser', 0.0)
|
||||
scores['ner_loss'] = losses.get('ner', 0.0)
|
||||
scores['tag_loss'] = losses.get('tagger', 0.0)
|
||||
scores.update(dev_scores)
|
||||
scores['cpu_wps'] = cpu_wps
|
||||
scores['gpu_wps'] = gpu_wps or 0.0
|
||||
tpl = '\t'.join((
|
||||
'{:d}',
|
||||
'{dep_loss:.3f}',
|
||||
'{ner_loss:.3f}',
|
||||
'{uas:.3f}',
|
||||
'{ents_p:.3f}',
|
||||
'{ents_r:.3f}',
|
||||
'{ents_f:.3f}',
|
||||
'{tags_acc:.3f}',
|
||||
'{token_acc:.3f}',
|
||||
'{cpu_wps:.1f}',
|
||||
'{gpu_wps:.1f}',
|
||||
))
|
||||
print(tpl.format(itn, **scores))
|
||||
|
||||
|
||||
def print_results(scorer):
|
||||
|
|
126
spacy/cli/validate.py
Normal file
126
spacy/cli/validate.py
Normal file
|
@ -0,0 +1,126 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals, print_function
|
||||
|
||||
import requests
|
||||
import pkg_resources
|
||||
from pathlib import Path
|
||||
|
||||
from ..compat import path2str, locale_escape
|
||||
from ..util import prints, get_data_path, read_json
|
||||
from .. import about
|
||||
|
||||
|
||||
def validate(cmd):
|
||||
"""Validate that the currently installed version of spaCy is compatible
|
||||
with the installed models. Should be run after `pip install -U spacy`.
|
||||
"""
|
||||
r = requests.get(about.__compatibility__)
|
||||
if r.status_code != 200:
|
||||
prints("Couldn't fetch compatibility table.",
|
||||
title="Server error (%d)" % r.status_code, exits=1)
|
||||
compat = r.json()['spacy']
|
||||
all_models = set()
|
||||
for spacy_v, models in dict(compat).items():
|
||||
all_models.update(models.keys())
|
||||
for model, model_vs in models.items():
|
||||
compat[spacy_v][model] = [reformat_version(v) for v in model_vs]
|
||||
|
||||
current_compat = compat[about.__version__]
|
||||
model_links = get_model_links(current_compat)
|
||||
model_pkgs = get_model_pkgs(current_compat, all_models)
|
||||
incompat_links = {l for l, d in model_links.items() if not d['compat']}
|
||||
incompat_models = {d['name'] for _, d in model_pkgs.items()
|
||||
if not d['compat']}
|
||||
incompat_models.update([d['name'] for _, d in model_links.items()
|
||||
if not d['compat']])
|
||||
na_models = [m for m in incompat_models if m not in current_compat]
|
||||
update_models = [m for m in incompat_models if m in current_compat]
|
||||
|
||||
prints(path2str(Path(__file__).parent.parent),
|
||||
title="Installed models (spaCy v{})".format(about.__version__))
|
||||
if model_links or model_pkgs:
|
||||
print(get_row('TYPE', 'NAME', 'MODEL', 'VERSION', ''))
|
||||
for name, data in model_pkgs.items():
|
||||
print(get_model_row(current_compat, name, data, 'package'))
|
||||
for name, data in model_links.items():
|
||||
print(get_model_row(current_compat, name, data, 'link'))
|
||||
else:
|
||||
prints("No models found in your current environment.", exits=0)
|
||||
|
||||
if update_models:
|
||||
cmd = ' python -m spacy download {}'
|
||||
print("\n Use the following commands to update the model packages:")
|
||||
print('\n'.join([cmd.format(pkg) for pkg in update_models]))
|
||||
|
||||
if na_models:
|
||||
prints("The following models are not available for spaCy v{}: {}"
|
||||
.format(about.__version__, ', '.join(na_models)))
|
||||
|
||||
if incompat_links:
|
||||
prints("You may also want to overwrite the incompatible links using "
|
||||
"the `spacy link` command with `--force`, or remove them from "
|
||||
"the data directory. Data path: {}"
|
||||
.format(path2str(get_data_path())))
|
||||
|
||||
|
||||
def get_model_links(compat):
|
||||
links = {}
|
||||
data_path = get_data_path()
|
||||
if data_path:
|
||||
models = [p for p in data_path.iterdir() if is_model_path(p)]
|
||||
for model in models:
|
||||
meta_path = Path(model) / 'meta.json'
|
||||
if not meta_path.exists():
|
||||
continue
|
||||
meta = read_json(meta_path)
|
||||
link = model.parts[-1]
|
||||
name = meta['lang'] + '_' + meta['name']
|
||||
links[link] = {'name': name, 'version': meta['version'],
|
||||
'compat': is_compat(compat, name, meta['version'])}
|
||||
return links
|
||||
|
||||
|
||||
def get_model_pkgs(compat, all_models):
|
||||
pkgs = {}
|
||||
for pkg_name, pkg_data in pkg_resources.working_set.by_key.items():
|
||||
package = pkg_name.replace('-', '_')
|
||||
if package in all_models:
|
||||
version = pkg_data.version
|
||||
pkgs[pkg_name] = {'name': package, 'version': version,
|
||||
'compat': is_compat(compat, package, version)}
|
||||
return pkgs
|
||||
|
||||
|
||||
def get_model_row(compat, name, data, type='package'):
|
||||
tpl_red = '\x1b[38;5;1m{}\x1b[0m'
|
||||
tpl_green = '\x1b[38;5;2m{}\x1b[0m'
|
||||
if data['compat']:
|
||||
comp = tpl_green.format(locale_escape('✔', errors='ignore'))
|
||||
version = tpl_green.format(data['version'])
|
||||
else:
|
||||
comp = '--> {}'.format(compat.get(data['name'], ['n/a'])[0])
|
||||
version = tpl_red.format(data['version'])
|
||||
return get_row(type, name, data['name'], version, comp)
|
||||
|
||||
|
||||
def get_row(*args):
|
||||
tpl_row = ' {:<10}' + (' {:<20}' * 4)
|
||||
return tpl_row.format(*args)
|
||||
|
||||
|
||||
def is_model_path(model_path):
|
||||
exclude = ['cache', 'pycache', '__pycache__']
|
||||
name = model_path.parts[-1]
|
||||
return (model_path.is_dir() and name not in exclude
|
||||
and not name.startswith('.'))
|
||||
|
||||
|
||||
def is_compat(compat, name, version):
|
||||
return name in compat and version in compat[name]
|
||||
|
||||
|
||||
def reformat_version(version):
|
||||
"""Hack to reformat old versions ending on '-alpha' to match pip format."""
|
||||
if version.endswith('-alpha'):
|
||||
return version.replace('-alpha', 'a0')
|
||||
return version.replace('-alpha', 'a')
|
60
spacy/cli/vocab.py
Normal file
60
spacy/cli/vocab.py
Normal file
|
@ -0,0 +1,60 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import plac
|
||||
import json
|
||||
import spacy
|
||||
import numpy
|
||||
from pathlib import Path
|
||||
|
||||
from ..vectors import Vectors
|
||||
from ..util import prints, ensure_path
|
||||
|
||||
|
||||
@plac.annotations(
|
||||
lang=("model language", "positional", None, str),
|
||||
output_dir=("model output directory", "positional", None, Path),
|
||||
lexemes_loc=("location of JSONL-formatted lexical data", "positional",
|
||||
None, Path),
|
||||
vectors_loc=("optional: location of vectors data, as numpy .npz",
|
||||
"positional", None, str),
|
||||
prune_vectors=("optional: number of vectors to prune to.",
|
||||
"option", "V", int)
|
||||
)
|
||||
def make_vocab(cmd, lang, output_dir, lexemes_loc,
|
||||
vectors_loc=None, prune_vectors=-1):
|
||||
"""Compile a vocabulary from a lexicon jsonl file and word vectors."""
|
||||
if not lexemes_loc.exists():
|
||||
prints(lexemes_loc, title="Can't find lexical data", exits=1)
|
||||
vectors_loc = ensure_path(vectors_loc)
|
||||
nlp = spacy.blank(lang)
|
||||
for word in nlp.vocab:
|
||||
word.rank = 0
|
||||
lex_added = 0
|
||||
with lexemes_loc.open() as file_:
|
||||
for line in file_:
|
||||
if line.strip():
|
||||
attrs = json.loads(line)
|
||||
if 'settings' in attrs:
|
||||
nlp.vocab.cfg.update(attrs['settings'])
|
||||
else:
|
||||
lex = nlp.vocab[attrs['orth']]
|
||||
lex.set_attrs(**attrs)
|
||||
assert lex.rank == attrs['id']
|
||||
lex_added += 1
|
||||
if vectors_loc is not None:
|
||||
vector_data = numpy.load(vectors_loc.open('rb'))
|
||||
nlp.vocab.vectors = Vectors(data=vector_data)
|
||||
for word in nlp.vocab:
|
||||
if word.rank:
|
||||
nlp.vocab.vectors.add(word.orth, row=word.rank)
|
||||
|
||||
if prune_vectors >= 1:
|
||||
remap = nlp.vocab.prune_vectors(prune_vectors)
|
||||
if not output_dir.exists():
|
||||
output_dir.mkdir()
|
||||
nlp.to_disk(output_dir)
|
||||
vec_added = len(nlp.vocab.vectors)
|
||||
prints("{} entries, {} vectors".format(lex_added, vec_added), output_dir,
|
||||
title="Sucessfully compiled vocab and vectors, and saved model")
|
||||
return nlp
|
|
@ -5,6 +5,10 @@ import six
|
|||
import ftfy
|
||||
import sys
|
||||
import ujson
|
||||
import itertools
|
||||
import locale
|
||||
|
||||
from thinc.neural.util import copy_array
|
||||
|
||||
try:
|
||||
import cPickle as pickle
|
||||
|
@ -17,9 +21,27 @@ except ImportError:
|
|||
import copyreg as copy_reg
|
||||
|
||||
try:
|
||||
import Queue as queue
|
||||
from cupy.cuda.stream import Stream as CudaStream
|
||||
except ImportError:
|
||||
import queue
|
||||
CudaStream = None
|
||||
|
||||
try:
|
||||
import cupy
|
||||
except ImportError:
|
||||
cupy = None
|
||||
|
||||
try:
|
||||
from thinc.neural.optimizers import Optimizer
|
||||
except ImportError:
|
||||
from thinc.neural.optimizers import Adam as Optimizer
|
||||
|
||||
pickle = pickle
|
||||
copy_reg = copy_reg
|
||||
CudaStream = CudaStream
|
||||
cupy = cupy
|
||||
fix_text = ftfy.fix_text
|
||||
copy_array = copy_array
|
||||
izip = getattr(itertools, 'izip', zip)
|
||||
|
||||
is_python2 = six.PY2
|
||||
is_python3 = six.PY3
|
||||
|
@ -27,37 +49,81 @@ is_windows = sys.platform.startswith('win')
|
|||
is_linux = sys.platform.startswith('linux')
|
||||
is_osx = sys.platform == 'darwin'
|
||||
|
||||
fix_text = ftfy.fix_text
|
||||
|
||||
|
||||
if is_python2:
|
||||
import imp
|
||||
bytes_ = str
|
||||
unicode_ = unicode
|
||||
basestring_ = basestring
|
||||
input_ = raw_input
|
||||
json_dumps = lambda data: ujson.dumps(data, indent=2).decode('utf8')
|
||||
intern = intern
|
||||
json_dumps = lambda data: ujson.dumps(data, indent=2, escape_forward_slashes=False).decode('utf8')
|
||||
path2str = lambda path: str(path).decode('utf8')
|
||||
|
||||
elif is_python3:
|
||||
import importlib.util
|
||||
bytes_ = bytes
|
||||
unicode_ = str
|
||||
basestring_ = str
|
||||
input_ = input
|
||||
json_dumps = lambda data: ujson.dumps(data, indent=2)
|
||||
intern = sys.intern
|
||||
json_dumps = lambda data: ujson.dumps(data, indent=2, escape_forward_slashes=False)
|
||||
path2str = lambda path: str(path)
|
||||
|
||||
|
||||
def b_to_str(b_str):
|
||||
if is_python2:
|
||||
return b_str
|
||||
# important: if no encoding is set, string becomes "b'...'"
|
||||
return str(b_str, encoding='utf8')
|
||||
|
||||
|
||||
def getattr_(obj, name, *default):
|
||||
if is_python3 and isinstance(name, bytes):
|
||||
name = name.decode('utf8')
|
||||
return getattr(obj, name, *default)
|
||||
|
||||
|
||||
def symlink_to(orig, dest):
|
||||
if is_python2 and is_windows:
|
||||
import subprocess
|
||||
subprocess.call(['mklink', '/d', unicode(orig), unicode(dest)], shell=True)
|
||||
subprocess.call(['mklink', '/d', path2str(orig), path2str(dest)], shell=True)
|
||||
else:
|
||||
orig.symlink_to(dest)
|
||||
|
||||
|
||||
def is_config(python2=None, python3=None, windows=None, linux=None, osx=None):
|
||||
return ((python2 == None or python2 == is_python2) and
|
||||
(python3 == None or python3 == is_python3) and
|
||||
(windows == None or windows == is_windows) and
|
||||
(linux == None or linux == is_linux) and
|
||||
(osx == None or osx == is_osx))
|
||||
return ((python2 is None or python2 == is_python2) and
|
||||
(python3 is None or python3 == is_python3) and
|
||||
(windows is None or windows == is_windows) and
|
||||
(linux is None or linux == is_linux) and
|
||||
(osx is None or osx == is_osx))
|
||||
|
||||
|
||||
def normalize_string_keys(old):
|
||||
"""Given a dictionary, make sure keys are unicode strings, not bytes."""
|
||||
new = {}
|
||||
for key, value in old.items():
|
||||
if isinstance(key, bytes_):
|
||||
new[key.decode('utf8')] = value
|
||||
else:
|
||||
new[key] = value
|
||||
return new
|
||||
|
||||
|
||||
def import_file(name, loc):
|
||||
loc = str(loc)
|
||||
if is_python2:
|
||||
return imp.load_source(name, loc)
|
||||
else:
|
||||
spec = importlib.util.spec_from_file_location(name, str(loc))
|
||||
module = importlib.util.module_from_spec(spec)
|
||||
spec.loader.exec_module(module)
|
||||
return module
|
||||
|
||||
|
||||
def locale_escape(string, errors='replace'):
|
||||
'''
|
||||
Mangle non-supported characters, for savages with ascii terminals.
|
||||
'''
|
||||
encoding = locale.getpreferredencoding()
|
||||
string = string.encode(encoding, errors).decode('utf8')
|
||||
return string
|
||||
|
|
|
@ -1,22 +0,0 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals, print_function
|
||||
|
||||
from os import path
|
||||
|
||||
from ..language import Language
|
||||
from ..attrs import LANG
|
||||
|
||||
from .language_data import *
|
||||
|
||||
|
||||
class German(Language):
|
||||
lang = 'de'
|
||||
|
||||
class Defaults(Language.Defaults):
|
||||
tokenizer_exceptions = dict(language_data.TOKENIZER_EXCEPTIONS)
|
||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||
lex_attr_getters[LANG] = lambda text: 'de'
|
||||
|
||||
tokenizer_exceptions = TOKENIZER_EXCEPTIONS
|
||||
tag_map = TAG_MAP
|
||||
stop_words = STOP_WORDS
|
|
@ -1,5 +0,0 @@
|
|||
from ..deprecated import ModelDownload as download
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
download.de()
|
|
@ -1,22 +0,0 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from .. import language_data as base
|
||||
from ..language_data import update_exc, strings_to_exc
|
||||
|
||||
from .tag_map import TAG_MAP
|
||||
from .stop_words import STOP_WORDS
|
||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS, ORTH_ONLY
|
||||
|
||||
|
||||
TAG_MAP = dict(TAG_MAP)
|
||||
STOP_WORDS = set(STOP_WORDS)
|
||||
|
||||
|
||||
TOKENIZER_EXCEPTIONS = dict(TOKENIZER_EXCEPTIONS)
|
||||
update_exc(TOKENIZER_EXCEPTIONS, strings_to_exc(ORTH_ONLY))
|
||||
update_exc(TOKENIZER_EXCEPTIONS, strings_to_exc(base.ABBREVIATIONS))
|
||||
update_exc(TOKENIZER_EXCEPTIONS, strings_to_exc(base.EMOTICONS))
|
||||
|
||||
|
||||
__all__ = ["TOKENIZER_EXCEPTIONS", "TAG_MAP", "STOP_WORDS"]
|
|
@ -1,602 +0,0 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ..symbols import *
|
||||
from ..language_data import PRON_LEMMA, DET_LEMMA
|
||||
|
||||
|
||||
TOKENIZER_EXCEPTIONS = {
|
||||
"\\n": [
|
||||
{ORTH: "\\n", LEMMA: "<nl>", TAG: "SP"}
|
||||
],
|
||||
|
||||
"\\t": [
|
||||
{ORTH: "\\t", LEMMA: "<tab>", TAG: "SP"}
|
||||
],
|
||||
|
||||
"'S": [
|
||||
{ORTH: "'S", LEMMA: PRON_LEMMA, TAG: "PPER"}
|
||||
],
|
||||
|
||||
"'n": [
|
||||
{ORTH: "'n", LEMMA: DET_LEMMA, NORM: "ein"}
|
||||
],
|
||||
|
||||
"'ne": [
|
||||
{ORTH: "'ne", LEMMA: DET_LEMMA, NORM: "eine"}
|
||||
],
|
||||
|
||||
"'nen": [
|
||||
{ORTH: "'nen", LEMMA: DET_LEMMA, NORM: "einen"}
|
||||
],
|
||||
|
||||
"'nem": [
|
||||
{ORTH: "'nem", LEMMA: DET_LEMMA, NORM: "einem"}
|
||||
],
|
||||
|
||||
"'s": [
|
||||
{ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER"}
|
||||
],
|
||||
|
||||
"Abb.": [
|
||||
{ORTH: "Abb.", LEMMA: "Abbildung"}
|
||||
],
|
||||
|
||||
"Abk.": [
|
||||
{ORTH: "Abk.", LEMMA: "Abkürzung"}
|
||||
],
|
||||
|
||||
"Abt.": [
|
||||
{ORTH: "Abt.", LEMMA: "Abteilung"}
|
||||
],
|
||||
|
||||
"Apr.": [
|
||||
{ORTH: "Apr.", LEMMA: "April"}
|
||||
],
|
||||
|
||||
"Aug.": [
|
||||
{ORTH: "Aug.", LEMMA: "August"}
|
||||
],
|
||||
|
||||
"Bd.": [
|
||||
{ORTH: "Bd.", LEMMA: "Band"}
|
||||
],
|
||||
|
||||
"Betr.": [
|
||||
{ORTH: "Betr.", LEMMA: "Betreff"}
|
||||
],
|
||||
|
||||
"Bf.": [
|
||||
{ORTH: "Bf.", LEMMA: "Bahnhof"}
|
||||
],
|
||||
|
||||
"Bhf.": [
|
||||
{ORTH: "Bhf.", LEMMA: "Bahnhof"}
|
||||
],
|
||||
|
||||
"Bsp.": [
|
||||
{ORTH: "Bsp.", LEMMA: "Beispiel"}
|
||||
],
|
||||
|
||||
"Dez.": [
|
||||
{ORTH: "Dez.", LEMMA: "Dezember"}
|
||||
],
|
||||
|
||||
"Di.": [
|
||||
{ORTH: "Di.", LEMMA: "Dienstag"}
|
||||
],
|
||||
|
||||
"Do.": [
|
||||
{ORTH: "Do.", LEMMA: "Donnerstag"}
|
||||
],
|
||||
|
||||
"Fa.": [
|
||||
{ORTH: "Fa.", LEMMA: "Firma"}
|
||||
],
|
||||
|
||||
"Fam.": [
|
||||
{ORTH: "Fam.", LEMMA: "Familie"}
|
||||
],
|
||||
|
||||
"Feb.": [
|
||||
{ORTH: "Feb.", LEMMA: "Februar"}
|
||||
],
|
||||
|
||||
"Fr.": [
|
||||
{ORTH: "Fr.", LEMMA: "Frau"}
|
||||
],
|
||||
|
||||
"Frl.": [
|
||||
{ORTH: "Frl.", LEMMA: "Fräulein"}
|
||||
],
|
||||
|
||||
"Hbf.": [
|
||||
{ORTH: "Hbf.", LEMMA: "Hauptbahnhof"}
|
||||
],
|
||||
|
||||
"Hr.": [
|
||||
{ORTH: "Hr.", LEMMA: "Herr"}
|
||||
],
|
||||
|
||||
"Hrn.": [
|
||||
{ORTH: "Hrn.", LEMMA: "Herr"}
|
||||
],
|
||||
|
||||
"Jan.": [
|
||||
{ORTH: "Jan.", LEMMA: "Januar"}
|
||||
],
|
||||
|
||||
"Jh.": [
|
||||
{ORTH: "Jh.", LEMMA: "Jahrhundert"}
|
||||
],
|
||||
|
||||
"Jhd.": [
|
||||
{ORTH: "Jhd.", LEMMA: "Jahrhundert"}
|
||||
],
|
||||
|
||||
"Jul.": [
|
||||
{ORTH: "Jul.", LEMMA: "Juli"}
|
||||
],
|
||||
|
||||
"Jun.": [
|
||||
{ORTH: "Jun.", LEMMA: "Juni"}
|
||||
],
|
||||
|
||||
"Mi.": [
|
||||
{ORTH: "Mi.", LEMMA: "Mittwoch"}
|
||||
],
|
||||
|
||||
"Mio.": [
|
||||
{ORTH: "Mio.", LEMMA: "Million"}
|
||||
],
|
||||
|
||||
"Mo.": [
|
||||
{ORTH: "Mo.", LEMMA: "Montag"}
|
||||
],
|
||||
|
||||
"Mrd.": [
|
||||
{ORTH: "Mrd.", LEMMA: "Milliarde"}
|
||||
],
|
||||
|
||||
"Mrz.": [
|
||||
{ORTH: "Mrz.", LEMMA: "März"}
|
||||
],
|
||||
|
||||
"MwSt.": [
|
||||
{ORTH: "MwSt.", LEMMA: "Mehrwertsteuer"}
|
||||
],
|
||||
|
||||
"Mär.": [
|
||||
{ORTH: "Mär.", LEMMA: "März"}
|
||||
],
|
||||
|
||||
"Nov.": [
|
||||
{ORTH: "Nov.", LEMMA: "November"}
|
||||
],
|
||||
|
||||
"Nr.": [
|
||||
{ORTH: "Nr.", LEMMA: "Nummer"}
|
||||
],
|
||||
|
||||
"Okt.": [
|
||||
{ORTH: "Okt.", LEMMA: "Oktober"}
|
||||
],
|
||||
|
||||
"Orig.": [
|
||||
{ORTH: "Orig.", LEMMA: "Original"}
|
||||
],
|
||||
|
||||
"Pkt.": [
|
||||
{ORTH: "Pkt.", LEMMA: "Punkt"}
|
||||
],
|
||||
|
||||
"Prof.": [
|
||||
{ORTH: "Prof.", LEMMA: "Professor"}
|
||||
],
|
||||
|
||||
"Red.": [
|
||||
{ORTH: "Red.", LEMMA: "Redaktion"}
|
||||
],
|
||||
|
||||
"S'": [
|
||||
{ORTH: "S'", LEMMA: PRON_LEMMA, TAG: "PPER"}
|
||||
],
|
||||
|
||||
"Sa.": [
|
||||
{ORTH: "Sa.", LEMMA: "Samstag"}
|
||||
],
|
||||
|
||||
"Sep.": [
|
||||
{ORTH: "Sep.", LEMMA: "September"}
|
||||
],
|
||||
|
||||
"Sept.": [
|
||||
{ORTH: "Sept.", LEMMA: "September"}
|
||||
],
|
||||
|
||||
"So.": [
|
||||
{ORTH: "So.", LEMMA: "Sonntag"}
|
||||
],
|
||||
|
||||
"Std.": [
|
||||
{ORTH: "Std.", LEMMA: "Stunde"}
|
||||
],
|
||||
|
||||
"Str.": [
|
||||
{ORTH: "Str.", LEMMA: "Straße"}
|
||||
],
|
||||
|
||||
"Tel.": [
|
||||
{ORTH: "Tel.", LEMMA: "Telefon"}
|
||||
],
|
||||
|
||||
"Tsd.": [
|
||||
{ORTH: "Tsd.", LEMMA: "Tausend"}
|
||||
],
|
||||
|
||||
"Univ.": [
|
||||
{ORTH: "Univ.", LEMMA: "Universität"}
|
||||
],
|
||||
|
||||
"abzgl.": [
|
||||
{ORTH: "abzgl.", LEMMA: "abzüglich"}
|
||||
],
|
||||
|
||||
"allg.": [
|
||||
{ORTH: "allg.", LEMMA: "allgemein"}
|
||||
],
|
||||
|
||||
"auf'm": [
|
||||
{ORTH: "auf", LEMMA: "auf"},
|
||||
{ORTH: "'m", LEMMA: DET_LEMMA, NORM: "dem" }
|
||||
],
|
||||
|
||||
"bspw.": [
|
||||
{ORTH: "bspw.", LEMMA: "beispielsweise"}
|
||||
],
|
||||
|
||||
"bzgl.": [
|
||||
{ORTH: "bzgl.", LEMMA: "bezüglich"}
|
||||
],
|
||||
|
||||
"bzw.": [
|
||||
{ORTH: "bzw.", LEMMA: "beziehungsweise"}
|
||||
],
|
||||
|
||||
"d.h.": [
|
||||
{ORTH: "d.h.", LEMMA: "das heißt"}
|
||||
],
|
||||
|
||||
"dgl.": [
|
||||
{ORTH: "dgl.", LEMMA: "dergleichen"}
|
||||
],
|
||||
|
||||
"du's": [
|
||||
{ORTH: "du", LEMMA: PRON_LEMMA, TAG: "PPER"},
|
||||
{ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER", NORM: "es"}
|
||||
],
|
||||
|
||||
"ebd.": [
|
||||
{ORTH: "ebd.", LEMMA: "ebenda"}
|
||||
],
|
||||
|
||||
"eigtl.": [
|
||||
{ORTH: "eigtl.", LEMMA: "eigentlich"}
|
||||
],
|
||||
|
||||
"engl.": [
|
||||
{ORTH: "engl.", LEMMA: "englisch"}
|
||||
],
|
||||
|
||||
"er's": [
|
||||
{ORTH: "er", LEMMA: PRON_LEMMA, TAG: "PPER"},
|
||||
{ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER", NORM: "es"}
|
||||
],
|
||||
|
||||
"evtl.": [
|
||||
{ORTH: "evtl.", LEMMA: "eventuell"}
|
||||
],
|
||||
|
||||
"frz.": [
|
||||
{ORTH: "frz.", LEMMA: "französisch"}
|
||||
],
|
||||
|
||||
"gegr.": [
|
||||
{ORTH: "gegr.", LEMMA: "gegründet"}
|
||||
],
|
||||
|
||||
"ggf.": [
|
||||
{ORTH: "ggf.", LEMMA: "gegebenenfalls"}
|
||||
],
|
||||
|
||||
"ggfs.": [
|
||||
{ORTH: "ggfs.", LEMMA: "gegebenenfalls"}
|
||||
],
|
||||
|
||||
"ggü.": [
|
||||
{ORTH: "ggü.", LEMMA: "gegenüber"}
|
||||
],
|
||||
|
||||
"hinter'm": [
|
||||
{ORTH: "hinter", LEMMA: "hinter"},
|
||||
{ORTH: "'m", LEMMA: DET_LEMMA, NORM: "dem"}
|
||||
],
|
||||
|
||||
"i.O.": [
|
||||
{ORTH: "i.O.", LEMMA: "in Ordnung"}
|
||||
],
|
||||
|
||||
"i.d.R.": [
|
||||
{ORTH: "i.d.R.", LEMMA: "in der Regel"}
|
||||
],
|
||||
|
||||
"ich's": [
|
||||
{ORTH: "ich", LEMMA: PRON_LEMMA, TAG: "PPER"},
|
||||
{ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER", NORM: "es"}
|
||||
],
|
||||
|
||||
"ihr's": [
|
||||
{ORTH: "ihr", LEMMA: PRON_LEMMA, TAG: "PPER"},
|
||||
{ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER", NORM: "es"}
|
||||
],
|
||||
|
||||
"incl.": [
|
||||
{ORTH: "incl.", LEMMA: "inklusive"}
|
||||
],
|
||||
|
||||
"inkl.": [
|
||||
{ORTH: "inkl.", LEMMA: "inklusive"}
|
||||
],
|
||||
|
||||
"insb.": [
|
||||
{ORTH: "insb.", LEMMA: "insbesondere"}
|
||||
],
|
||||
|
||||
"kath.": [
|
||||
{ORTH: "kath.", LEMMA: "katholisch"}
|
||||
],
|
||||
|
||||
"lt.": [
|
||||
{ORTH: "lt.", LEMMA: "laut"}
|
||||
],
|
||||
|
||||
"max.": [
|
||||
{ORTH: "max.", LEMMA: "maximal"}
|
||||
],
|
||||
|
||||
"min.": [
|
||||
{ORTH: "min.", LEMMA: "minimal"}
|
||||
],
|
||||
|
||||
"mind.": [
|
||||
{ORTH: "mind.", LEMMA: "mindestens"}
|
||||
],
|
||||
|
||||
"mtl.": [
|
||||
{ORTH: "mtl.", LEMMA: "monatlich"}
|
||||
],
|
||||
|
||||
"n.Chr.": [
|
||||
{ORTH: "n.Chr.", LEMMA: "nach Christus"}
|
||||
],
|
||||
|
||||
"orig.": [
|
||||
{ORTH: "orig.", LEMMA: "original"}
|
||||
],
|
||||
|
||||
"röm.": [
|
||||
{ORTH: "röm.", LEMMA: "römisch"}
|
||||
],
|
||||
|
||||
"s'": [
|
||||
{ORTH: "s'", LEMMA: PRON_LEMMA, TAG: "PPER"}
|
||||
],
|
||||
|
||||
"s.o.": [
|
||||
{ORTH: "s.o.", LEMMA: "siehe oben"}
|
||||
],
|
||||
|
||||
"sie's": [
|
||||
{ORTH: "sie", LEMMA: PRON_LEMMA, TAG: "PPER"},
|
||||
{ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER", NORM: "es"}
|
||||
],
|
||||
|
||||
"sog.": [
|
||||
{ORTH: "sog.", LEMMA: "so genannt"}
|
||||
],
|
||||
|
||||
"stellv.": [
|
||||
{ORTH: "stellv.", LEMMA: "stellvertretend"}
|
||||
],
|
||||
|
||||
"tägl.": [
|
||||
{ORTH: "tägl.", LEMMA: "täglich"}
|
||||
],
|
||||
|
||||
"u.U.": [
|
||||
{ORTH: "u.U.", LEMMA: "unter Umständen"}
|
||||
],
|
||||
|
||||
"u.s.w.": [
|
||||
{ORTH: "u.s.w.", LEMMA: "und so weiter"}
|
||||
],
|
||||
|
||||
"u.v.m.": [
|
||||
{ORTH: "u.v.m.", LEMMA: "und vieles mehr"}
|
||||
],
|
||||
|
||||
"unter'm": [
|
||||
{ORTH: "unter", LEMMA: "unter"},
|
||||
{ORTH: "'m", LEMMA: DET_LEMMA, NORM: "dem"}
|
||||
],
|
||||
|
||||
"usf.": [
|
||||
{ORTH: "usf.", LEMMA: "und so fort"}
|
||||
],
|
||||
|
||||
"usw.": [
|
||||
{ORTH: "usw.", LEMMA: "und so weiter"}
|
||||
],
|
||||
|
||||
"uvm.": [
|
||||
{ORTH: "uvm.", LEMMA: "und vieles mehr"}
|
||||
],
|
||||
|
||||
"v.Chr.": [
|
||||
{ORTH: "v.Chr.", LEMMA: "vor Christus"}
|
||||
],
|
||||
|
||||
"v.a.": [
|
||||
{ORTH: "v.a.", LEMMA: "vor allem"}
|
||||
],
|
||||
|
||||
"v.l.n.r.": [
|
||||
{ORTH: "v.l.n.r.", LEMMA: "von links nach rechts"}
|
||||
],
|
||||
|
||||
"vgl.": [
|
||||
{ORTH: "vgl.", LEMMA: "vergleiche"}
|
||||
],
|
||||
|
||||
"vllt.": [
|
||||
{ORTH: "vllt.", LEMMA: "vielleicht"}
|
||||
],
|
||||
|
||||
"vlt.": [
|
||||
{ORTH: "vlt.", LEMMA: "vielleicht"}
|
||||
],
|
||||
|
||||
"vor'm": [
|
||||
{ORTH: "vor", LEMMA: "vor"},
|
||||
{ORTH: "'m", LEMMA: DET_LEMMA, NORM: "dem"}
|
||||
],
|
||||
|
||||
"wir's": [
|
||||
{ORTH: "wir", LEMMA: PRON_LEMMA, TAG: "PPER"},
|
||||
{ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER", NORM: "es"}
|
||||
],
|
||||
|
||||
"z.B.": [
|
||||
{ORTH: "z.B.", LEMMA: "zum Beispiel"}
|
||||
],
|
||||
|
||||
"z.Bsp.": [
|
||||
{ORTH: "z.Bsp.", LEMMA: "zum Beispiel"}
|
||||
],
|
||||
|
||||
"z.T.": [
|
||||
{ORTH: "z.T.", LEMMA: "zum Teil"}
|
||||
],
|
||||
|
||||
"z.Z.": [
|
||||
{ORTH: "z.Z.", LEMMA: "zur Zeit"}
|
||||
],
|
||||
|
||||
"z.Zt.": [
|
||||
{ORTH: "z.Zt.", LEMMA: "zur Zeit"}
|
||||
],
|
||||
|
||||
"z.b.": [
|
||||
{ORTH: "z.b.", LEMMA: "zum Beispiel"}
|
||||
],
|
||||
|
||||
"zzgl.": [
|
||||
{ORTH: "zzgl.", LEMMA: "zuzüglich"}
|
||||
],
|
||||
|
||||
"österr.": [
|
||||
{ORTH: "österr.", LEMMA: "österreichisch"}
|
||||
],
|
||||
|
||||
"über'm": [
|
||||
{ORTH: "über", LEMMA: "über"},
|
||||
{ORTH: "'m", LEMMA: DET_LEMMA, NORM: "dem"}
|
||||
]
|
||||
}
|
||||
|
||||
|
||||
ORTH_ONLY = [
|
||||
"A.C.",
|
||||
"a.D.",
|
||||
"A.D.",
|
||||
"A.G.",
|
||||
"a.M.",
|
||||
"a.Z.",
|
||||
"Abs.",
|
||||
"adv.",
|
||||
"al.",
|
||||
"B.A.",
|
||||
"B.Sc.",
|
||||
"betr.",
|
||||
"biol.",
|
||||
"Biol.",
|
||||
"ca.",
|
||||
"Chr.",
|
||||
"Cie.",
|
||||
"co.",
|
||||
"Co.",
|
||||
"D.C.",
|
||||
"Dipl.-Ing.",
|
||||
"Dipl.",
|
||||
"Dr.",
|
||||
"e.g.",
|
||||
"e.V.",
|
||||
"ehem.",
|
||||
"entspr.",
|
||||
"erm.",
|
||||
"etc.",
|
||||
"ev.",
|
||||
"G.m.b.H.",
|
||||
"geb.",
|
||||
"Gebr.",
|
||||
"gem.",
|
||||
"h.c.",
|
||||
"Hg.",
|
||||
"hrsg.",
|
||||
"Hrsg.",
|
||||
"i.A.",
|
||||
"i.e.",
|
||||
"i.G.",
|
||||
"i.Tr.",
|
||||
"i.V.",
|
||||
"Ing.",
|
||||
"jr.",
|
||||
"Jr.",
|
||||
"jun.",
|
||||
"jur.",
|
||||
"K.O.",
|
||||
"L.A.",
|
||||
"lat.",
|
||||
"M.A.",
|
||||
"m.E.",
|
||||
"m.M.",
|
||||
"M.Sc.",
|
||||
"Mr.",
|
||||
"N.Y.",
|
||||
"N.Y.C.",
|
||||
"nat.",
|
||||
"ö."
|
||||
"o.a.",
|
||||
"o.ä.",
|
||||
"o.g.",
|
||||
"o.k.",
|
||||
"O.K.",
|
||||
"p.a.",
|
||||
"p.s.",
|
||||
"P.S.",
|
||||
"pers.",
|
||||
"phil.",
|
||||
"q.e.d.",
|
||||
"R.I.P.",
|
||||
"rer.",
|
||||
"sen.",
|
||||
"St.",
|
||||
"std.",
|
||||
"u.a.",
|
||||
"U.S.",
|
||||
"U.S.A.",
|
||||
"U.S.S.",
|
||||
"Vol.",
|
||||
"vs.",
|
||||
"wiss."
|
||||
]
|
|
@ -1,159 +0,0 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
from . import about
|
||||
from . import util
|
||||
from .cli import download
|
||||
from .cli import link
|
||||
|
||||
|
||||
def read_lang_data(package):
|
||||
tokenization = package.load_json(('tokenizer', 'specials.json'))
|
||||
with package.open(('tokenizer', 'prefix.txt'), default=None) as file_:
|
||||
prefix = read_prefix(file_) if file_ is not None else None
|
||||
with package.open(('tokenizer', 'suffix.txt'), default=None) as file_:
|
||||
suffix = read_suffix(file_) if file_ is not None else None
|
||||
with package.open(('tokenizer', 'infix.txt'), default=None) as file_:
|
||||
infix = read_infix(file_) if file_ is not None else None
|
||||
return tokenization, prefix, suffix, infix
|
||||
|
||||
|
||||
def align_tokens(ref, indices): # Deprecated, surely?
|
||||
start = 0
|
||||
queue = list(indices)
|
||||
for token in ref:
|
||||
end = start + len(token)
|
||||
emit = []
|
||||
while queue and queue[0][1] <= end:
|
||||
emit.append(queue.pop(0))
|
||||
yield token, emit
|
||||
start = end
|
||||
assert not queue
|
||||
|
||||
|
||||
def detokenize(token_rules, words): # Deprecated?
|
||||
"""
|
||||
To align with treebanks, return a list of "chunks", where a chunk is a
|
||||
sequence of tokens that are separated by whitespace in actual strings. Each
|
||||
chunk should be a tuple of token indices, e.g.
|
||||
|
||||
>>> detokenize(["ca<SEP>n't", '<SEP>!'], ["I", "ca", "n't", "!"])
|
||||
[(0,), (1, 2, 3)]
|
||||
"""
|
||||
string = ' '.join(words)
|
||||
for subtoks in token_rules:
|
||||
# Algorithmically this is dumb, but writing a little list-based match
|
||||
# machine? Ain't nobody got time for that.
|
||||
string = string.replace(subtoks.replace('<SEP>', ' '), subtoks)
|
||||
positions = []
|
||||
i = 0
|
||||
for chunk in string.split():
|
||||
subtoks = chunk.split('<SEP>')
|
||||
positions.append(tuple(range(i, i+len(subtoks))))
|
||||
i += len(subtoks)
|
||||
return positions
|
||||
|
||||
|
||||
def match_best_version(target_name, target_version, path):
|
||||
path = util.ensure_path(path)
|
||||
if path is None or not path.exists():
|
||||
return None
|
||||
matches = []
|
||||
for data_name in path.iterdir():
|
||||
name, version = split_data_name(data_name.parts[-1])
|
||||
if name == target_name:
|
||||
matches.append((tuple(float(v) for v in version.split('.')), data_name))
|
||||
if matches:
|
||||
return Path(max(matches)[1])
|
||||
else:
|
||||
return None
|
||||
|
||||
|
||||
def split_data_name(name):
|
||||
return name.split('-', 1) if '-' in name else (name, '')
|
||||
|
||||
|
||||
def fix_glove_vectors_loading(overrides):
|
||||
"""
|
||||
Special-case hack for loading the GloVe vectors, to support deprecated
|
||||
<1.0 stuff. Phase this out once the data is fixed.
|
||||
"""
|
||||
if 'data_dir' in overrides and 'path' not in overrides:
|
||||
raise ValueError("The argument 'data_dir' has been renamed to 'path'")
|
||||
if overrides.get('path') is False:
|
||||
return overrides
|
||||
if overrides.get('path') in (None, True):
|
||||
data_path = util.get_data_path()
|
||||
else:
|
||||
path = util.ensure_path(overrides['path'])
|
||||
data_path = path.parent
|
||||
vec_path = None
|
||||
if 'add_vectors' not in overrides:
|
||||
if 'vectors' in overrides:
|
||||
vec_path = match_best_version(overrides['vectors'], None, data_path)
|
||||
if vec_path is None:
|
||||
return overrides
|
||||
else:
|
||||
vec_path = match_best_version('en_glove_cc_300_1m_vectors', None, data_path)
|
||||
if vec_path is not None:
|
||||
vec_path = vec_path / 'vocab' / 'vec.bin'
|
||||
if vec_path is not None:
|
||||
overrides['add_vectors'] = lambda vocab: vocab.load_vectors_from_bin_loc(vec_path)
|
||||
return overrides
|
||||
|
||||
|
||||
def resolve_model_name(name):
|
||||
"""
|
||||
If spaCy is loaded with 'de', check if symlink already exists. If
|
||||
not, user may have upgraded from older version and have old models installed.
|
||||
Check if old model directory exists and if so, return that instead and create
|
||||
shortcut link. If English model is found and no shortcut exists, raise error
|
||||
and tell user to install new model.
|
||||
"""
|
||||
if name == 'en' or name == 'de':
|
||||
versions = ['1.0.0', '1.1.0']
|
||||
data_path = Path(util.get_data_path())
|
||||
model_path = data_path / name
|
||||
v_model_paths = [data_path / Path(name + '-' + v) for v in versions]
|
||||
|
||||
if not model_path.exists(): # no shortcut found
|
||||
for v_path in v_model_paths:
|
||||
if v_path.exists(): # versioned model directory found
|
||||
if name == 'de':
|
||||
link(v_path, name)
|
||||
return name
|
||||
else:
|
||||
raise ValueError(
|
||||
"Found English model at {p}. This model is not "
|
||||
"compatible with the current version. See "
|
||||
"https://spacy.io/docs/usage/models to download the "
|
||||
"new model.".format(p=v_path))
|
||||
return name
|
||||
|
||||
|
||||
class ModelDownload():
|
||||
"""
|
||||
Replace download modules within en and de with deprecation warning and
|
||||
download default language model (using shortcut). Use classmethods to allow
|
||||
importing ModelDownload as download and calling download.en() etc.
|
||||
"""
|
||||
|
||||
@classmethod
|
||||
def load(self, lang):
|
||||
util.print_msg(
|
||||
"The spacy.{l}.download command is now deprecated. Please use "
|
||||
"python -m spacy download [model name or shortcut] instead. For more "
|
||||
"info and available models, see the documentation: {d}. "
|
||||
"Downloading default '{l}' model now...".format(d=about.__docs_models__, l=lang),
|
||||
title="Warning: deprecated command")
|
||||
download(lang)
|
||||
|
||||
@classmethod
|
||||
def en(cls, *args, **kwargs):
|
||||
cls.load('en')
|
||||
|
||||
@classmethod
|
||||
def de(cls, *args, **kwargs):
|
||||
cls.load('de')
|
122
spacy/displacy/__init__.py
Normal file
122
spacy/displacy/__init__.py
Normal file
|
@ -0,0 +1,122 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from .render import DependencyRenderer, EntityRenderer
|
||||
from ..tokens import Doc
|
||||
from ..compat import b_to_str
|
||||
from ..util import prints, is_in_jupyter
|
||||
|
||||
|
||||
_html = {}
|
||||
IS_JUPYTER = is_in_jupyter()
|
||||
|
||||
|
||||
def render(docs, style='dep', page=False, minify=False, jupyter=IS_JUPYTER,
|
||||
options={}, manual=False):
|
||||
"""Render displaCy visualisation.
|
||||
|
||||
docs (list or Doc): Document(s) to visualise.
|
||||
style (unicode): Visualisation style, 'dep' or 'ent'.
|
||||
page (bool): Render markup as full HTML page.
|
||||
minify (bool): Minify HTML markup.
|
||||
jupyter (bool): Experimental, use Jupyter's `display()` to output markup.
|
||||
options (dict): Visualiser-specific options, e.g. colors.
|
||||
manual (bool): Don't parse `Doc` and instead expect a dict/list of dicts.
|
||||
RETURNS (unicode): Rendered HTML markup.
|
||||
"""
|
||||
factories = {'dep': (DependencyRenderer, parse_deps),
|
||||
'ent': (EntityRenderer, parse_ents)}
|
||||
if style not in factories:
|
||||
raise ValueError("Unknown style: %s" % style)
|
||||
if isinstance(docs, Doc) or isinstance(docs, dict):
|
||||
docs = [docs]
|
||||
renderer, converter = factories[style]
|
||||
renderer = renderer(options=options)
|
||||
parsed = [converter(doc, options) for doc in docs] if not manual else docs
|
||||
_html['parsed'] = renderer.render(parsed, page=page, minify=minify).strip()
|
||||
html = _html['parsed']
|
||||
if jupyter: # return HTML rendered by IPython display()
|
||||
from IPython.core.display import display, HTML
|
||||
return display(HTML(html))
|
||||
return html
|
||||
|
||||
|
||||
def serve(docs, style='dep', page=True, minify=False, options={}, manual=False,
|
||||
port=5000):
|
||||
"""Serve displaCy visualisation.
|
||||
|
||||
docs (list or Doc): Document(s) to visualise.
|
||||
style (unicode): Visualisation style, 'dep' or 'ent'.
|
||||
page (bool): Render markup as full HTML page.
|
||||
minify (bool): Minify HTML markup.
|
||||
options (dict): Visualiser-specific options, e.g. colors.
|
||||
manual (bool): Don't parse `Doc` and instead expect a dict/list of dicts.
|
||||
port (int): Port to serve visualisation.
|
||||
"""
|
||||
from wsgiref import simple_server
|
||||
render(docs, style=style, page=page, minify=minify, options=options,
|
||||
manual=manual)
|
||||
httpd = simple_server.make_server('0.0.0.0', port, app)
|
||||
prints("Using the '%s' visualizer" % style,
|
||||
title="Serving on port %d..." % port)
|
||||
try:
|
||||
httpd.serve_forever()
|
||||
except KeyboardInterrupt:
|
||||
prints("Shutting down server on port %d." % port)
|
||||
finally:
|
||||
httpd.server_close()
|
||||
|
||||
|
||||
def app(environ, start_response):
|
||||
# headers and status need to be bytes in Python 2, see #1227
|
||||
headers = [(b_to_str(b'Content-type'),
|
||||
b_to_str(b'text/html; charset=utf-8'))]
|
||||
start_response(b_to_str(b'200 OK'), headers)
|
||||
res = _html['parsed'].encode(encoding='utf-8')
|
||||
return [res]
|
||||
|
||||
|
||||
def parse_deps(orig_doc, options={}):
|
||||
"""Generate dependency parse in {'words': [], 'arcs': []} format.
|
||||
|
||||
doc (Doc): Document do parse.
|
||||
RETURNS (dict): Generated dependency parse keyed by words and arcs.
|
||||
"""
|
||||
doc = Doc(orig_doc.vocab).from_bytes(orig_doc.to_bytes())
|
||||
if options.get('collapse_punct', True):
|
||||
spans = []
|
||||
for word in doc[:-1]:
|
||||
if word.is_punct or not word.nbor(1).is_punct:
|
||||
continue
|
||||
start = word.i
|
||||
end = word.i + 1
|
||||
while end < len(doc) and doc[end].is_punct:
|
||||
end += 1
|
||||
span = doc[start:end]
|
||||
spans.append((span.start_char, span.end_char, word.tag_,
|
||||
word.lemma_, word.ent_type_))
|
||||
for span_props in spans:
|
||||
doc.merge(*span_props)
|
||||
words = [{'text': w.text, 'tag': w.tag_} for w in doc]
|
||||
arcs = []
|
||||
for word in doc:
|
||||
if word.i < word.head.i:
|
||||
arcs.append({'start': word.i, 'end': word.head.i,
|
||||
'label': word.dep_, 'dir': 'left'})
|
||||
elif word.i > word.head.i:
|
||||
arcs.append({'start': word.head.i, 'end': word.i,
|
||||
'label': word.dep_, 'dir': 'right'})
|
||||
return {'words': words, 'arcs': arcs}
|
||||
|
||||
|
||||
def parse_ents(doc, options={}):
|
||||
"""Generate named entities in [{start: i, end: i, label: 'label'}] format.
|
||||
|
||||
doc (Doc): Document do parse.
|
||||
RETURNS (dict): Generated entities keyed by text (original text) and ents.
|
||||
"""
|
||||
ents = [{'start': ent.start_char, 'end': ent.end_char, 'label': ent.label_}
|
||||
for ent in doc.ents]
|
||||
title = (doc.user_data.get('title', None)
|
||||
if hasattr(doc, 'user_data') else None)
|
||||
return {'text': doc.text, 'ents': ents, 'title': title}
|
227
spacy/displacy/render.py
Normal file
227
spacy/displacy/render.py
Normal file
|
@ -0,0 +1,227 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_ARCS
|
||||
from .templates import TPL_ENT, TPL_ENTS, TPL_FIGURE, TPL_TITLE, TPL_PAGE
|
||||
from ..util import minify_html
|
||||
|
||||
|
||||
class DependencyRenderer(object):
|
||||
"""Render dependency parses as SVGs."""
|
||||
style = 'dep'
|
||||
|
||||
def __init__(self, options={}):
|
||||
"""Initialise dependency renderer.
|
||||
|
||||
options (dict): Visualiser-specific options (compact, word_spacing,
|
||||
arrow_spacing, arrow_width, arrow_stroke, distance, offset_x,
|
||||
color, bg, font)
|
||||
"""
|
||||
self.compact = options.get('compact', False)
|
||||
self.word_spacing = options.get('word_spacing', 45)
|
||||
self.arrow_spacing = options.get('arrow_spacing',
|
||||
12 if self.compact else 20)
|
||||
self.arrow_width = options.get('arrow_width',
|
||||
6 if self.compact else 10)
|
||||
self.arrow_stroke = options.get('arrow_stroke', 2)
|
||||
self.distance = options.get('distance', 150 if self.compact else 175)
|
||||
self.offset_x = options.get('offset_x', 50)
|
||||
self.color = options.get('color', '#000000')
|
||||
self.bg = options.get('bg', '#ffffff')
|
||||
self.font = options.get('font', 'Arial')
|
||||
|
||||
def render(self, parsed, page=False, minify=False):
|
||||
"""Render complete markup.
|
||||
|
||||
parsed (list): Dependency parses to render.
|
||||
page (bool): Render parses wrapped as full HTML page.
|
||||
minify (bool): Minify HTML markup.
|
||||
RETURNS (unicode): Rendered SVG or HTML markup.
|
||||
"""
|
||||
rendered = [self.render_svg(i, p['words'], p['arcs'])
|
||||
for i, p in enumerate(parsed)]
|
||||
if page:
|
||||
content = ''.join([TPL_FIGURE.format(content=svg)
|
||||
for svg in rendered])
|
||||
markup = TPL_PAGE.format(content=content)
|
||||
else:
|
||||
markup = ''.join(rendered)
|
||||
if minify:
|
||||
return minify_html(markup)
|
||||
return markup
|
||||
|
||||
def render_svg(self, render_id, words, arcs):
|
||||
"""Render SVG.
|
||||
|
||||
render_id (int): Unique ID, typically index of document.
|
||||
words (list): Individual words and their tags.
|
||||
arcs (list): Individual arcs and their start, end, direction and label.
|
||||
RETURNS (unicode): Rendered SVG markup.
|
||||
"""
|
||||
self.levels = self.get_levels(arcs)
|
||||
self.highest_level = len(self.levels)
|
||||
self.offset_y = self.distance/2*self.highest_level+self.arrow_stroke
|
||||
self.width = self.offset_x+len(words)*self.distance
|
||||
self.height = self.offset_y+3*self.word_spacing
|
||||
self.id = render_id
|
||||
words = [self.render_word(w['text'], w['tag'], i)
|
||||
for i, w in enumerate(words)]
|
||||
arcs = [self.render_arrow(a['label'], a['start'],
|
||||
a['end'], a['dir'], i)
|
||||
for i, a in enumerate(arcs)]
|
||||
content = ''.join(words) + ''.join(arcs)
|
||||
return TPL_DEP_SVG.format(id=self.id, width=self.width,
|
||||
height=self.height, color=self.color,
|
||||
bg=self.bg, font=self.font, content=content)
|
||||
|
||||
def render_word(self, text, tag, i):
|
||||
"""Render individual word.
|
||||
|
||||
text (unicode): Word text.
|
||||
tag (unicode): Part-of-speech tag.
|
||||
i (int): Unique ID, typically word index.
|
||||
RETURNS (unicode): Rendered SVG markup.
|
||||
"""
|
||||
y = self.offset_y+self.word_spacing
|
||||
x = self.offset_x+i*self.distance
|
||||
return TPL_DEP_WORDS.format(text=text, tag=tag, x=x, y=y)
|
||||
|
||||
def render_arrow(self, label, start, end, direction, i):
|
||||
"""Render indivicual arrow.
|
||||
|
||||
label (unicode): Dependency label.
|
||||
start (int): Index of start word.
|
||||
end (int): Index of end word.
|
||||
direction (unicode): Arrow direction, 'left' or 'right'.
|
||||
i (int): Unique ID, typically arrow index.
|
||||
RETURNS (unicode): Rendered SVG markup.
|
||||
"""
|
||||
level = self.levels.index(end-start)+1
|
||||
x_start = self.offset_x+start*self.distance+self.arrow_spacing
|
||||
y = self.offset_y
|
||||
x_end = (self.offset_x+(end-start)*self.distance+start*self.distance
|
||||
- self.arrow_spacing*(self.highest_level-level)/4)
|
||||
y_curve = self.offset_y-level*self.distance/2
|
||||
if self.compact:
|
||||
y_curve = self.offset_y-level*self.distance/6
|
||||
if y_curve == 0 and len(self.levels) > 5:
|
||||
y_curve = -self.distance
|
||||
arrowhead = self.get_arrowhead(direction, x_start, y, x_end)
|
||||
arc = self.get_arc(x_start, y, y_curve, x_end)
|
||||
return TPL_DEP_ARCS.format(id=self.id, i=i, stroke=self.arrow_stroke,
|
||||
head=arrowhead, label=label, arc=arc)
|
||||
|
||||
def get_arc(self, x_start, y, y_curve, x_end):
|
||||
"""Render individual arc.
|
||||
|
||||
x_start (int): X-coordinate of arrow start point.
|
||||
y (int): Y-coordinate of arrow start and end point.
|
||||
y_curve (int): Y-corrdinate of Cubic Bézier y_curve point.
|
||||
x_end (int): X-coordinate of arrow end point.
|
||||
RETURNS (unicode): Definition of the arc path ('d' attribute).
|
||||
"""
|
||||
template = "M{x},{y} C{x},{c} {e},{c} {e},{y}"
|
||||
if self.compact:
|
||||
template = "M{x},{y} {x},{c} {e},{c} {e},{y}"
|
||||
return template.format(x=x_start, y=y, c=y_curve, e=x_end)
|
||||
|
||||
def get_arrowhead(self, direction, x, y, end):
|
||||
"""Render individual arrow head.
|
||||
|
||||
direction (unicode): Arrow direction, 'left' or 'right'.
|
||||
x (int): X-coordinate of arrow start point.
|
||||
y (int): Y-coordinate of arrow start and end point.
|
||||
end (int): X-coordinate of arrow end point.
|
||||
RETURNS (unicode): Definition of the arrow head path ('d' attribute).
|
||||
"""
|
||||
if direction is 'left':
|
||||
pos1, pos2, pos3 = (x, x-self.arrow_width+2, x+self.arrow_width-2)
|
||||
else:
|
||||
pos1, pos2, pos3 = (end, end+self.arrow_width-2,
|
||||
end-self.arrow_width+2)
|
||||
arrowhead = (pos1, y+2, pos2, y-self.arrow_width, pos3,
|
||||
y-self.arrow_width)
|
||||
return "M{},{} L{},{} {},{}".format(*arrowhead)
|
||||
|
||||
def get_levels(self, arcs):
|
||||
"""Calculate available arc height "levels".
|
||||
Used to calculate arrow heights dynamically and without wasting space.
|
||||
|
||||
args (list): Individual arcs and their start, end, direction and label.
|
||||
RETURNS (list): Arc levels sorted from lowest to highest.
|
||||
"""
|
||||
levels = set(map(lambda arc: arc['end'] - arc['start'], arcs))
|
||||
return sorted(list(levels))
|
||||
|
||||
|
||||
class EntityRenderer(object):
|
||||
"""Render named entities as HTML."""
|
||||
style = 'ent'
|
||||
|
||||
def __init__(self, options={}):
|
||||
"""Initialise dependency renderer.
|
||||
|
||||
options (dict): Visualiser-specific options (colors, ents)
|
||||
"""
|
||||
colors = {'ORG': '#7aecec', 'PRODUCT': '#bfeeb7', 'GPE': '#feca74',
|
||||
'LOC': '#ff9561', 'PERSON': '#aa9cfc', 'NORP': '#c887fb',
|
||||
'FACILITY': '#9cc9cc', 'EVENT': '#ffeb80', 'LAW': '#ff8197',
|
||||
'LANGUAGE': '#ff8197', 'WORK_OF_ART': '#f0d0ff',
|
||||
'DATE': '#bfe1d9', 'TIME': '#bfe1d9', 'MONEY': '#e4e7d2',
|
||||
'QUANTITY': '#e4e7d2', 'ORDINAL': '#e4e7d2',
|
||||
'CARDINAL': '#e4e7d2', 'PERCENT': '#e4e7d2'}
|
||||
colors.update(options.get('colors', {}))
|
||||
self.default_color = '#ddd'
|
||||
self.colors = colors
|
||||
self.ents = options.get('ents', None)
|
||||
|
||||
def render(self, parsed, page=False, minify=False):
|
||||
"""Render complete markup.
|
||||
|
||||
parsed (list): Dependency parses to render.
|
||||
page (bool): Render parses wrapped as full HTML page.
|
||||
minify (bool): Minify HTML markup.
|
||||
RETURNS (unicode): Rendered HTML markup.
|
||||
"""
|
||||
rendered = [self.render_ents(p['text'], p['ents'],
|
||||
p.get('title', None)) for p in parsed]
|
||||
if page:
|
||||
docs = ''.join([TPL_FIGURE.format(content=doc)
|
||||
for doc in rendered])
|
||||
markup = TPL_PAGE.format(content=docs)
|
||||
else:
|
||||
markup = ''.join(rendered)
|
||||
if minify:
|
||||
return minify_html(markup)
|
||||
return markup
|
||||
|
||||
def render_ents(self, text, spans, title):
|
||||
"""Render entities in text.
|
||||
|
||||
text (unicode): Original text.
|
||||
spans (list): Individual entity spans and their start, end and label.
|
||||
title (unicode or None): Document title set in Doc.user_data['title'].
|
||||
"""
|
||||
markup = ''
|
||||
offset = 0
|
||||
for span in spans:
|
||||
label = span['label']
|
||||
start = span['start']
|
||||
end = span['end']
|
||||
entity = text[start:end]
|
||||
fragments = text[offset:start].split('\n')
|
||||
for i, fragment in enumerate(fragments):
|
||||
markup += fragment
|
||||
if len(fragments) > 1 and i != len(fragments)-1:
|
||||
markup += '</br>'
|
||||
if self.ents is None or label.upper() in self.ents:
|
||||
color = self.colors.get(label.upper(), self.default_color)
|
||||
markup += TPL_ENT.format(label=label, text=entity, bg=color)
|
||||
else:
|
||||
markup += entity
|
||||
offset = end
|
||||
markup += text[offset:]
|
||||
markup = TPL_ENTS.format(content=markup, colors=self.colors)
|
||||
if title:
|
||||
markup = TPL_TITLE.format(title=title) + markup
|
||||
return markup
|
63
spacy/displacy/templates.py
Normal file
63
spacy/displacy/templates.py
Normal file
|
@ -0,0 +1,63 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
# setting explicit height and max-width: none on the SVG is required for
|
||||
# Jupyter to render it properly in a cell
|
||||
|
||||
TPL_DEP_SVG = """
|
||||
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" id="{id}" class="displacy" width="{width}" height="{height}" style="max-width: none; height: {height}px; color: {color}; background: {bg}; font-family: {font}">{content}</svg>
|
||||
"""
|
||||
|
||||
|
||||
TPL_DEP_WORDS = """
|
||||
<text class="displacy-token" fill="currentColor" text-anchor="middle" y="{y}">
|
||||
<tspan class="displacy-word" fill="currentColor" x="{x}">{text}</tspan>
|
||||
<tspan class="displacy-tag" dy="2em" fill="currentColor" x="{x}">{tag}</tspan>
|
||||
</text>
|
||||
"""
|
||||
|
||||
|
||||
TPL_DEP_ARCS = """
|
||||
<g class="displacy-arrow">
|
||||
<path class="displacy-arc" id="arrow-{id}-{i}" stroke-width="{stroke}px" d="{arc}" fill="none" stroke="currentColor"/>
|
||||
<text dy="1.25em" style="font-size: 0.8em; letter-spacing: 1px">
|
||||
<textPath xlink:href="#arrow-{id}-{i}" class="displacy-label" startOffset="50%" fill="currentColor" text-anchor="middle">{label}</textPath>
|
||||
</text>
|
||||
<path class="displacy-arrowhead" d="{head}" fill="currentColor"/>
|
||||
</g>
|
||||
"""
|
||||
|
||||
|
||||
TPL_FIGURE = """
|
||||
<figure style="margin-bottom: 6rem">{content}</figure>
|
||||
"""
|
||||
|
||||
TPL_TITLE = """
|
||||
<h2 style="margin: 0">{title}</h2>
|
||||
"""
|
||||
|
||||
|
||||
TPL_ENTS = """
|
||||
<div class="entities" style="line-height: 2.5">{content}</div>
|
||||
"""
|
||||
|
||||
|
||||
TPL_ENT = """
|
||||
<mark class="entity" style="background: {bg}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone">
|
||||
{text}
|
||||
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">{label}</span>
|
||||
</mark>
|
||||
"""
|
||||
|
||||
|
||||
TPL_PAGE = """
|
||||
<!DOCTYPE html>
|
||||
<html>
|
||||
<head>
|
||||
<title>displaCy</title>
|
||||
</head>
|
||||
|
||||
<body style="font-size: 16px; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; padding: 4rem 2rem;">{content}</body>
|
||||
</html>
|
||||
"""
|
|
@ -1,34 +0,0 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ..language import Language
|
||||
from ..lemmatizer import Lemmatizer
|
||||
from ..vocab import Vocab
|
||||
from ..tokenizer import Tokenizer
|
||||
from ..attrs import LANG
|
||||
from ..deprecated import fix_glove_vectors_loading
|
||||
|
||||
from .language_data import *
|
||||
|
||||
|
||||
class English(Language):
|
||||
lang = 'en'
|
||||
|
||||
class Defaults(Language.Defaults):
|
||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||
lex_attr_getters[LANG] = lambda text: 'en'
|
||||
|
||||
tokenizer_exceptions = TOKENIZER_EXCEPTIONS
|
||||
tag_map = TAG_MAP
|
||||
stop_words = STOP_WORDS
|
||||
|
||||
morph_rules = dict(MORPH_RULES)
|
||||
lemma_rules = dict(LEMMA_RULES)
|
||||
lemma_index = dict(LEMMA_INDEX)
|
||||
lemma_exc = dict(LEMMA_EXC)
|
||||
|
||||
|
||||
def __init__(self, **overrides):
|
||||
# Special-case hack for loading the GloVe vectors, to support <1.0
|
||||
overrides = fix_glove_vectors_loading(overrides)
|
||||
Language.__init__(self, **overrides)
|
|
@ -1,5 +0,0 @@
|
|||
from ..deprecated import ModelDownload as download
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
download.en()
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user