mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-26 09:14:32 +03:00
Merge branch 'master' of https://github.com/explosion/spaCy
This commit is contained in:
commit
2ba5141113
69
README.rst
69
README.rst
|
@ -7,7 +7,7 @@ the very latest research, but it isn't researchware. It was designed from day 1
|
||||||
to be used in real products. It's commercial open-source software, released under
|
to be used in real products. It's commercial open-source software, released under
|
||||||
the MIT license.
|
the MIT license.
|
||||||
|
|
||||||
💫 **Version 1.0 out now!** `Read the release notes here. <https://github.com/explosion/spaCy/releases/>`_
|
💫 **Version 1.1 out now!** `Read the release notes here. <https://github.com/explosion/spaCy/releases/>`_
|
||||||
|
|
||||||
.. image:: http://i.imgur.com/wFvLZyJ.png
|
.. image:: http://i.imgur.com/wFvLZyJ.png
|
||||||
:target: https://travis-ci.org/explosion/spaCy
|
:target: https://travis-ci.org/explosion/spaCy
|
||||||
|
@ -78,20 +78,11 @@ Install spaCy
|
||||||
=============
|
=============
|
||||||
|
|
||||||
spaCy is compatible with 64-bit CPython 2.6+/3.3+ and runs on Unix/Linux, OS X
|
spaCy is compatible with 64-bit CPython 2.6+/3.3+ and runs on Unix/Linux, OS X
|
||||||
and Windows. Source and binary packages are available via
|
and Windows. Source packages are available via
|
||||||
`pip <https://pypi.python.org/pypi/spacy>`_ and `conda <https://anaconda.org/spacy/spacy>`_.
|
`pip <https://pypi.python.org/pypi/spacy>`_. Please make sure that
|
||||||
If there are no binary packages for your platform available please make sure that
|
|
||||||
you have a working build enviroment set up. See notes on Ubuntu, OS X and Windows
|
you have a working build enviroment set up. See notes on Ubuntu, OS X and Windows
|
||||||
for details.
|
for details.
|
||||||
|
|
||||||
conda
|
|
||||||
-----
|
|
||||||
|
|
||||||
.. code:: bash
|
|
||||||
|
|
||||||
conda config --add channels spacy # only needed once
|
|
||||||
conda install spacy
|
|
||||||
|
|
||||||
pip
|
pip
|
||||||
---
|
---
|
||||||
|
|
||||||
|
@ -100,12 +91,6 @@ avoid modifying system state:
|
||||||
|
|
||||||
.. code:: bash
|
.. code:: bash
|
||||||
|
|
||||||
# make sure you are using a recent pip/virtualenv version
|
|
||||||
python -m pip install -U pip virtualenv
|
|
||||||
|
|
||||||
virtualenv .env
|
|
||||||
source .env/bin/activate
|
|
||||||
|
|
||||||
pip install spacy
|
pip install spacy
|
||||||
|
|
||||||
Python packaging is awkward at the best of times, and it's particularly tricky with
|
Python packaging is awkward at the best of times, and it's particularly tricky with
|
||||||
|
@ -120,17 +105,10 @@ English and German, named ``en`` and ``de``, are available.
|
||||||
|
|
||||||
.. code:: bash
|
.. code:: bash
|
||||||
|
|
||||||
python -m spacy.en.download
|
python -m spacy.en.download all
|
||||||
python -m spacy.de.download
|
python -m spacy.de.download all
|
||||||
sputnik --name spacy en_glove_cc_300_1m_vectors # For better word vectors
|
|
||||||
|
|
||||||
Then check whether the model was successfully installed:
|
The download command fetches about 1 GB of data which it installs
|
||||||
|
|
||||||
.. code:: bash
|
|
||||||
|
|
||||||
python -c "import spacy; spacy.load('en'); print('OK')"
|
|
||||||
|
|
||||||
The download command fetches and installs about 500 MB of data which it installs
|
|
||||||
within the ``spacy`` package directory.
|
within the ``spacy`` package directory.
|
||||||
|
|
||||||
Upgrading spaCy
|
Upgrading spaCy
|
||||||
|
@ -138,13 +116,6 @@ Upgrading spaCy
|
||||||
|
|
||||||
To upgrade spaCy to the latest release:
|
To upgrade spaCy to the latest release:
|
||||||
|
|
||||||
conda
|
|
||||||
-----
|
|
||||||
|
|
||||||
.. code:: bash
|
|
||||||
|
|
||||||
conda update spacy
|
|
||||||
|
|
||||||
pip
|
pip
|
||||||
---
|
---
|
||||||
|
|
||||||
|
@ -183,7 +154,7 @@ system. See notes on Ubuntu, OS X and Windows for details.
|
||||||
pip install -r requirements.txt
|
pip install -r requirements.txt
|
||||||
pip install -e .
|
pip install -e .
|
||||||
|
|
||||||
Compared to regular install via pip and conda `requirements.txt <requirements.txt>`_
|
Compared to regular install via pip `requirements.txt <requirements.txt>`_
|
||||||
additionally installs developer dependencies such as cython.
|
additionally installs developer dependencies such as cython.
|
||||||
|
|
||||||
Ubuntu
|
Ubuntu
|
||||||
|
@ -208,6 +179,11 @@ Install a version of Visual Studio Express or higher that matches the version
|
||||||
that was used to compile your Python interpreter. For official distributions
|
that was used to compile your Python interpreter. For official distributions
|
||||||
these are VS 2008 (Python 2.7), VS 2010 (Python 3.4) and VS 2015 (Python 3.5).
|
these are VS 2008 (Python 2.7), VS 2010 (Python 3.4) and VS 2015 (Python 3.5).
|
||||||
|
|
||||||
|
If you don't want to install the entire Visual Studio, you can install a
|
||||||
|
stand-alone compiler. Make sure that you install the correct version for
|
||||||
|
your version of Python. See https://wiki.python.org/moin/WindowsCompilers for
|
||||||
|
links to download these.
|
||||||
|
|
||||||
Run tests
|
Run tests
|
||||||
=========
|
=========
|
||||||
|
|
||||||
|
@ -242,8 +218,25 @@ For the detailed documentation, check out the `spaCy website <https://spacy.io/d
|
||||||
Changelog
|
Changelog
|
||||||
=========
|
=========
|
||||||
|
|
||||||
2016-10-18 `v1.0 <https://github.com/explosion/spaCy/releases/>`_: *Support for deep learning workflows and entity-aware rule matcher*
|
2016-10-23 `v1.1.0 <https://github.com/explosion/spaCy/releases>`_: *Bug fixes and adjustments*
|
||||||
----------------------------------------------------------------------------------------------------------------------------------------
|
-----------------------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
* Rename new ``pipeline`` keyword argument of ``spacy.load()`` to ``create_pipeline``.
|
||||||
|
* Rename new ``vectors`` keyword argument of ``spacy.load()`` to ``add_vectors``.
|
||||||
|
|
||||||
|
**🔴 Bug fixes**
|
||||||
|
|
||||||
|
* Fix issue `#544 <https://github.com/explosion/spaCy/issues/544>`_: Add ``vocab.resize_vectors()`` method, to support changing to vectors of different dimensionality.
|
||||||
|
* Fix issue `#536 <https://github.com/explosion/spaCy/issues/536>`_: Default probability was incorrect for OOV words.
|
||||||
|
* Fix issue `#539 <https://github.com/explosion/spaCy/issues/539>`_: Unspecified encoding when opening some JSON files.
|
||||||
|
* Fix issue `#541 <https://github.com/explosion/spaCy/issues/541>`_: GloVe vectors were being loaded incorrectly.
|
||||||
|
* Fix issue `#522 <https://github.com/explosion/spaCy/issues/522>`_: Similarities and vector norms were calculated incorrectly.
|
||||||
|
* Fix issue `#461 <https://github.com/explosion/spaCy/issues/461>`_: ``ent_iob`` attribute was incorrect after setting entities via ``doc.ents``
|
||||||
|
* Fix issue `#459 <https://github.com/explosion/spaCy/issues/459>`_: Deserialiser failed on empty doc
|
||||||
|
* Fix issue `#514 <https://github.com/explosion/spaCy/issues/514>`_: Serialization failed after adding a new entity label.
|
||||||
|
|
||||||
|
2016-10-18 `v1.0.0 <https://github.com/explosion/spaCy/releases/tag/v1.0.0>`_: *Support for deep learning workflows and entity-aware rule matcher*
|
||||||
|
--------------------------------------------------------------------------------------------------------------------------------------------------
|
||||||
|
|
||||||
**✨ Major features and improvements**
|
**✨ Major features and improvements**
|
||||||
|
|
||||||
|
|
|
@ -1,95 +0,0 @@
|
||||||
Syllogism Contributor Agreement
|
|
||||||
===============================
|
|
||||||
|
|
||||||
This Syllogism Contributor Agreement (“SCA”) is based on the Oracle Contributor
|
|
||||||
Agreement. The SCA applies to any contribution that you make to any product or
|
|
||||||
project managed by us (the “project”), and sets out the intellectual property
|
|
||||||
rights you grant to us in the contributed materials. The term “us” shall mean
|
|
||||||
Syllogism Co. The term "you" shall mean the person or entity identified below.
|
|
||||||
If you agree to be bound by these terms, fill in the information requested below
|
|
||||||
and include the filled-in version with your first pull-request, under the file
|
|
||||||
contrbutors/. The name of the file should be your GitHub username, with the
|
|
||||||
extension .md. For example, the user example_user would create the file
|
|
||||||
spaCy/contributors/example_user.md .
|
|
||||||
|
|
||||||
Read this agreement carefully before signing. These terms and conditions
|
|
||||||
constitute a binding legal agreement.
|
|
||||||
|
|
||||||
1. The term 'contribution' or ‘contributed materials’ means any source code,
|
|
||||||
object code, patch, tool, sample, graphic, specification, manual, documentation,
|
|
||||||
or any other material posted or submitted by you to the project.
|
|
||||||
|
|
||||||
2. With respect to any worldwide copyrights, or copyright applications and registrations,
|
|
||||||
in your contribution:
|
|
||||||
* you hereby assign to us joint ownership, and to the extent that such assignment
|
|
||||||
is or becomes invalid, ineffective or unenforceable, you hereby grant to us a perpetual,
|
|
||||||
irrevocable, non-exclusive, worldwide, no-charge, royalty-free, unrestricted license
|
|
||||||
to exercise all rights under those copyrights. This includes, at our option, the
|
|
||||||
right to sublicense these same rights to third parties through multiple levels of
|
|
||||||
sublicensees or other licensing arrangements;
|
|
||||||
|
|
||||||
* you agree that each of us can do all things in relation to your contribution
|
|
||||||
as if each of us were the sole owners, and if one of us makes a derivative work
|
|
||||||
of your contribution, the one who makes the derivative work (or has it made) will
|
|
||||||
be the sole owner of that derivative work;
|
|
||||||
|
|
||||||
* you agree that you will not assert any moral rights in your contribution against
|
|
||||||
us, our licensees or transferees;
|
|
||||||
|
|
||||||
* you agree that we may register a copyright in your contribution and exercise
|
|
||||||
all ownership rights associated with it; and
|
|
||||||
|
|
||||||
* you agree that neither of us has any duty to consult with, obtain the consent
|
|
||||||
of, pay or render an accounting to the other for any use or distribution of your
|
|
||||||
contribution.
|
|
||||||
|
|
||||||
3. With respect to any patents you own, or that you can license without payment
|
|
||||||
to any third party, you hereby grant to us a perpetual, irrevocable, non-exclusive,
|
|
||||||
worldwide, no-charge, royalty-free license to:
|
|
||||||
|
|
||||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer your
|
|
||||||
contribution in whole or in part, alone or in combination with
|
|
||||||
or included in any product, work or materials arising out of the project to
|
|
||||||
which your contribution was submitted, and
|
|
||||||
|
|
||||||
* at our option, to sublicense these same rights to third parties through multiple
|
|
||||||
levels of sublicensees or other licensing arrangements.
|
|
||||||
|
|
||||||
4. Except as set out above, you keep all right, title, and interest in your
|
|
||||||
contribution. The rights that you grant to us under these terms are effective on
|
|
||||||
the date you first submitted a contribution to us, even if your submission took
|
|
||||||
place before the date you sign these terms.
|
|
||||||
|
|
||||||
5. You covenant, represent, warrant and agree that:
|
|
||||||
|
|
||||||
* Each contribution that you submit is and shall be an original work of authorship
|
|
||||||
and you can legally grant the rights set out in this SCA;
|
|
||||||
|
|
||||||
* to the best of your knowledge, each contribution will not violate any third
|
|
||||||
party's copyrights, trademarks, patents, or other intellectual property rights; and
|
|
||||||
|
|
||||||
* each contribution shall be in compliance with U.S. export control laws and other
|
|
||||||
applicable export and import laws. You agree to notify us if you become aware of
|
|
||||||
any circumstance which would make any of the foregoing representations inaccurate
|
|
||||||
in any respect. Syllogism Co. may publicly disclose your participation in the project,
|
|
||||||
including the fact that you have signed the SCA.
|
|
||||||
|
|
||||||
6. This SCA is governed by the laws of the State of California and applicable U.S.
|
|
||||||
Federal law. Any choice of law rules will not apply.
|
|
||||||
|
|
||||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
|
||||||
mark both statements:
|
|
||||||
|
|
||||||
_x__ I am signing on behalf of myself as an individual and no other person or entity, including my employer, has or will have rights with respect my contributions.
|
|
||||||
|
|
||||||
____ I am signing on behalf of my employer or a legal entity and I have the actual authority to contractually bind that entity.
|
|
||||||
|
|
||||||
| Field | Entry |
|
|
||||||
|------------------------------- | -------------------- |
|
|
||||||
| Name | J Nicolas Schrading |
|
|
||||||
| Company's name (if applicable) | |
|
|
||||||
| Title or Role (if applicable) | |
|
|
||||||
| Date | 2015-08-24 |
|
|
||||||
| GitHub username | NSchrading |
|
|
||||||
| Website (optional) | nicschrading.com |
|
|
||||||
|
|
|
@ -1,95 +0,0 @@
|
||||||
Syllogism Contributor Agreement
|
|
||||||
===============================
|
|
||||||
|
|
||||||
This Syllogism Contributor Agreement (“SCA”) is based on the Oracle Contributor
|
|
||||||
Agreement. The SCA applies to any contribution that you make to any product or
|
|
||||||
project managed by us (the “project”), and sets out the intellectual property
|
|
||||||
rights you grant to us in the contributed materials. The term “us” shall mean
|
|
||||||
Syllogism Co. The term "you" shall mean the person or entity identified below.
|
|
||||||
If you agree to be bound by these terms, fill in the information requested below
|
|
||||||
and include the filled-in version with your first pull-request, under the file
|
|
||||||
contrbutors/. The name of the file should be your GitHub username, with the
|
|
||||||
extension .md. For example, the user example_user would create the file
|
|
||||||
spaCy/contributors/example_user.md .
|
|
||||||
|
|
||||||
Read this agreement carefully before signing. These terms and conditions
|
|
||||||
constitute a binding legal agreement.
|
|
||||||
|
|
||||||
1. The term 'contribution' or ‘contributed materials’ means any source code,
|
|
||||||
object code, patch, tool, sample, graphic, specification, manual, documentation,
|
|
||||||
or any other material posted or submitted by you to the project.
|
|
||||||
|
|
||||||
2. With respect to any worldwide copyrights, or copyright applications and registrations,
|
|
||||||
in your contribution:
|
|
||||||
* you hereby assign to us joint ownership, and to the extent that such assignment
|
|
||||||
is or becomes invalid, ineffective or unenforceable, you hereby grant to us a perpetual,
|
|
||||||
irrevocable, non-exclusive, worldwide, no-charge, royalty-free, unrestricted license
|
|
||||||
to exercise all rights under those copyrights. This includes, at our option, the
|
|
||||||
right to sublicense these same rights to third parties through multiple levels of
|
|
||||||
sublicensees or other licensing arrangements;
|
|
||||||
|
|
||||||
* you agree that each of us can do all things in relation to your contribution
|
|
||||||
as if each of us were the sole owners, and if one of us makes a derivative work
|
|
||||||
of your contribution, the one who makes the derivative work (or has it made) will
|
|
||||||
be the sole owner of that derivative work;
|
|
||||||
|
|
||||||
* you agree that you will not assert any moral rights in your contribution against
|
|
||||||
us, our licensees or transferees;
|
|
||||||
|
|
||||||
* you agree that we may register a copyright in your contribution and exercise
|
|
||||||
all ownership rights associated with it; and
|
|
||||||
|
|
||||||
* you agree that neither of us has any duty to consult with, obtain the consent
|
|
||||||
of, pay or render an accounting to the other for any use or distribution of your
|
|
||||||
contribution.
|
|
||||||
|
|
||||||
3. With respect to any patents you own, or that you can license without payment
|
|
||||||
to any third party, you hereby grant to us a perpetual, irrevocable, non-exclusive,
|
|
||||||
worldwide, no-charge, royalty-free license to:
|
|
||||||
|
|
||||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer your
|
|
||||||
contribution in whole or in part, alone or in combination with
|
|
||||||
or included in any product, work or materials arising out of the project to
|
|
||||||
which your contribution was submitted, and
|
|
||||||
|
|
||||||
* at our option, to sublicense these same rights to third parties through multiple
|
|
||||||
levels of sublicensees or other licensing arrangements.
|
|
||||||
|
|
||||||
4. Except as set out above, you keep all right, title, and interest in your
|
|
||||||
contribution. The rights that you grant to us under these terms are effective on
|
|
||||||
the date you first submitted a contribution to us, even if your submission took
|
|
||||||
place before the date you sign these terms.
|
|
||||||
|
|
||||||
5. You covenant, represent, warrant and agree that:
|
|
||||||
|
|
||||||
* Each contribution that you submit is and shall be an original work of authorship
|
|
||||||
and you can legally grant the rights set out in this SCA;
|
|
||||||
|
|
||||||
* to the best of your knowledge, each contribution will not violate any third
|
|
||||||
party's copyrights, trademarks, patents, or other intellectual property rights; and
|
|
||||||
|
|
||||||
* each contribution shall be in compliance with U.S. export control laws and other
|
|
||||||
applicable export and import laws. You agree to notify us if you become aware of
|
|
||||||
any circumstance which would make any of the foregoing representations inaccurate
|
|
||||||
in any respect. Syllogism Co. may publicly disclose your participation in the project,
|
|
||||||
including the fact that you have signed the SCA.
|
|
||||||
|
|
||||||
6. This SCA is governed by the laws of the State of California and applicable U.S.
|
|
||||||
Federal law. Any choice of law rules will not apply.
|
|
||||||
|
|
||||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
|
||||||
mark both statements:
|
|
||||||
|
|
||||||
x I am signing on behalf of myself as an individual and no other person or entity, including my employer, has or will have rights with respect my contributions.
|
|
||||||
|
|
||||||
____ I am signing on behalf of my employer or a legal entity and I have the actual authority to contractually bind that entity.
|
|
||||||
|
|
||||||
| Field | Entry |
|
|
||||||
|------------------------------- | -------------------- |
|
|
||||||
| Name | Chris DuBois |
|
|
||||||
| Company's name (if applicable) | |
|
|
||||||
| Title or Role (if applicable) | |
|
|
||||||
| Date | 2015.10.07 |
|
|
||||||
| GitHub username | chrisdubois |
|
|
||||||
| Website (optional) | |
|
|
||||||
|
|
|
@ -1,13 +0,0 @@
|
||||||
Signing the Contributors License Agreement
|
|
||||||
==========================================
|
|
||||||
|
|
||||||
SpaCy is a commercial open-source project, owned by Syllogism Co. We require that contributors to SpaCy sign our Contributors License Agreement, which is based on the Oracle Contributor Agreement.
|
|
||||||
|
|
||||||
The CLA must be signed on your first pull request. To do this, simply fill in the file cla_template.md, and include the filed in form in your first pull request.
|
|
||||||
|
|
||||||
$ git clone https://github.com/honnibal/spaCy
|
|
||||||
$ cp spaCy/contributors/cla_template.md spaCy/contributors/<your GitHub username>.md
|
|
||||||
<Now fill in the file spaCy/contributors/<your GitHub username>.md>
|
|
||||||
$ git add -A spaCy/contributors/<your GitHub username>.md
|
|
||||||
|
|
||||||
Now finish your pull request, and you're done.
|
|
|
@ -1,95 +0,0 @@
|
||||||
Syllogism Contributor Agreement
|
|
||||||
===============================
|
|
||||||
|
|
||||||
This Syllogism Contributor Agreement (“SCA”) is based on the Oracle Contributor
|
|
||||||
Agreement. The SCA applies to any contribution that you make to any product or
|
|
||||||
project managed by us (the “project”), and sets out the intellectual property
|
|
||||||
rights you grant to us in the contributed materials. The term “us” shall mean
|
|
||||||
Syllogism Co. The term "you" shall mean the person or entity identified below.
|
|
||||||
If you agree to be bound by these terms, fill in the information requested below
|
|
||||||
and include the filled-in version with your first pull-request, under the file
|
|
||||||
contrbutors/. The name of the file should be your GitHub username, with the
|
|
||||||
extension .md. For example, the user example_user would create the file
|
|
||||||
spaCy/contributors/example_user.md .
|
|
||||||
|
|
||||||
Read this agreement carefully before signing. These terms and conditions
|
|
||||||
constitute a binding legal agreement.
|
|
||||||
|
|
||||||
1. The term 'contribution' or ‘contributed materials’ means any source code,
|
|
||||||
object code, patch, tool, sample, graphic, specification, manual, documentation,
|
|
||||||
or any other material posted or submitted by you to the project.
|
|
||||||
|
|
||||||
2. With respect to any worldwide copyrights, or copyright applications and registrations,
|
|
||||||
in your contribution:
|
|
||||||
* you hereby assign to us joint ownership, and to the extent that such assignment
|
|
||||||
is or becomes invalid, ineffective or unenforceable, you hereby grant to us a perpetual,
|
|
||||||
irrevocable, non-exclusive, worldwide, no-charge, royalty-free, unrestricted license
|
|
||||||
to exercise all rights under those copyrights. This includes, at our option, the
|
|
||||||
right to sublicense these same rights to third parties through multiple levels of
|
|
||||||
sublicensees or other licensing arrangements;
|
|
||||||
|
|
||||||
* you agree that each of us can do all things in relation to your contribution
|
|
||||||
as if each of us were the sole owners, and if one of us makes a derivative work
|
|
||||||
of your contribution, the one who makes the derivative work (or has it made) will
|
|
||||||
be the sole owner of that derivative work;
|
|
||||||
|
|
||||||
* you agree that you will not assert any moral rights in your contribution against
|
|
||||||
us, our licensees or transferees;
|
|
||||||
|
|
||||||
* you agree that we may register a copyright in your contribution and exercise
|
|
||||||
all ownership rights associated with it; and
|
|
||||||
|
|
||||||
* you agree that neither of us has any duty to consult with, obtain the consent
|
|
||||||
of, pay or render an accounting to the other for any use or distribution of your
|
|
||||||
contribution.
|
|
||||||
|
|
||||||
3. With respect to any patents you own, or that you can license without payment
|
|
||||||
to any third party, you hereby grant to us a perpetual, irrevocable, non-exclusive,
|
|
||||||
worldwide, no-charge, royalty-free license to:
|
|
||||||
|
|
||||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer your
|
|
||||||
contribution in whole or in part, alone or in combination with
|
|
||||||
or included in any product, work or materials arising out of the project to
|
|
||||||
which your contribution was submitted, and
|
|
||||||
|
|
||||||
* at our option, to sublicense these same rights to third parties through multiple
|
|
||||||
levels of sublicensees or other licensing arrangements.
|
|
||||||
|
|
||||||
4. Except as set out above, you keep all right, title, and interest in your
|
|
||||||
contribution. The rights that you grant to us under these terms are effective on
|
|
||||||
the date you first submitted a contribution to us, even if your submission took
|
|
||||||
place before the date you sign these terms.
|
|
||||||
|
|
||||||
5. You covenant, represent, warrant and agree that:
|
|
||||||
|
|
||||||
* Each contribution that you submit is and shall be an original work of authorship
|
|
||||||
and you can legally grant the rights set out in this SCA;
|
|
||||||
|
|
||||||
* to the best of your knowledge, each contribution will not violate any third
|
|
||||||
party's copyrights, trademarks, patents, or other intellectual property rights; and
|
|
||||||
|
|
||||||
* each contribution shall be in compliance with U.S. export control laws and other
|
|
||||||
applicable export and import laws. You agree to notify us if you become aware of
|
|
||||||
any circumstance which would make any of the foregoing representations inaccurate
|
|
||||||
in any respect. Syllogism Co. may publicly disclose your participation in the project,
|
|
||||||
including the fact that you have signed the SCA.
|
|
||||||
|
|
||||||
6. This SCA is governed by the laws of the State of California and applicable U.S.
|
|
||||||
Federal law. Any choice of law rules will not apply.
|
|
||||||
|
|
||||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
|
||||||
mark both statements:
|
|
||||||
|
|
||||||
____ I am signing on behalf of myself as an individual and no other person or entity, including my employer, has or will have rights with respect my contributions.
|
|
||||||
|
|
||||||
____ I am signing on behalf of my employer or a legal entity and I have the actual authority to contractually bind that entity.
|
|
||||||
|
|
||||||
| Field | Entry |
|
|
||||||
|------------------------------- | -------------------- |
|
|
||||||
| Name | |
|
|
||||||
| Company's name (if applicable) | |
|
|
||||||
| Title or Role (if applicable) | |
|
|
||||||
| Date | |
|
|
||||||
| GitHub username | |
|
|
||||||
| Website (optional) | |
|
|
||||||
|
|
|
@ -1,95 +0,0 @@
|
||||||
Syllogism Contributor Agreement
|
|
||||||
===============================
|
|
||||||
|
|
||||||
This Syllogism Contributor Agreement (“SCA”) is based on the Oracle Contributor
|
|
||||||
Agreement. The SCA applies to any contribution that you make to any product or
|
|
||||||
project managed by us (the “project”), and sets out the intellectual property
|
|
||||||
rights you grant to us in the contributed materials. The term “us” shall mean
|
|
||||||
Syllogism Co. The term "you" shall mean the person or entity identified below.
|
|
||||||
If you agree to be bound by these terms, fill in the information requested below
|
|
||||||
and include the filled-in version with your first pull-request, under the file
|
|
||||||
contrbutors/. The name of the file should be your GitHub username, with the
|
|
||||||
extension .md. For example, the user example_user would create the file
|
|
||||||
spaCy/contributors/example_user.md .
|
|
||||||
|
|
||||||
Read this agreement carefully before signing. These terms and conditions
|
|
||||||
constitute a binding legal agreement.
|
|
||||||
|
|
||||||
1. The term 'contribution' or ‘contributed materials’ means any source code,
|
|
||||||
object code, patch, tool, sample, graphic, specification, manual, documentation,
|
|
||||||
or any other material posted or submitted by you to the project.
|
|
||||||
|
|
||||||
2. With respect to any worldwide copyrights, or copyright applications and registrations,
|
|
||||||
in your contribution:
|
|
||||||
* you hereby assign to us joint ownership, and to the extent that such assignment
|
|
||||||
is or becomes invalid, ineffective or unenforceable, you hereby grant to us a perpetual,
|
|
||||||
irrevocable, non-exclusive, worldwide, no-charge, royalty-free, unrestricted license
|
|
||||||
to exercise all rights under those copyrights. This includes, at our option, the
|
|
||||||
right to sublicense these same rights to third parties through multiple levels of
|
|
||||||
sublicensees or other licensing arrangements;
|
|
||||||
|
|
||||||
* you agree that each of us can do all things in relation to your contribution
|
|
||||||
as if each of us were the sole owners, and if one of us makes a derivative work
|
|
||||||
of your contribution, the one who makes the derivative work (or has it made) will
|
|
||||||
be the sole owner of that derivative work;
|
|
||||||
|
|
||||||
* you agree that you will not assert any moral rights in your contribution against
|
|
||||||
us, our licensees or transferees;
|
|
||||||
|
|
||||||
* you agree that we may register a copyright in your contribution and exercise
|
|
||||||
all ownership rights associated with it; and
|
|
||||||
|
|
||||||
* you agree that neither of us has any duty to consult with, obtain the consent
|
|
||||||
of, pay or render an accounting to the other for any use or distribution of your
|
|
||||||
contribution.
|
|
||||||
|
|
||||||
3. With respect to any patents you own, or that you can license without payment
|
|
||||||
to any third party, you hereby grant to us a perpetual, irrevocable, non-exclusive,
|
|
||||||
worldwide, no-charge, royalty-free license to:
|
|
||||||
|
|
||||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer your
|
|
||||||
contribution in whole or in part, alone or in combination with
|
|
||||||
or included in any product, work or materials arising out of the project to
|
|
||||||
which your contribution was submitted, and
|
|
||||||
|
|
||||||
* at our option, to sublicense these same rights to third parties through multiple
|
|
||||||
levels of sublicensees or other licensing arrangements.
|
|
||||||
|
|
||||||
4. Except as set out above, you keep all right, title, and interest in your
|
|
||||||
contribution. The rights that you grant to us under these terms are effective on
|
|
||||||
the date you first submitted a contribution to us, even if your submission took
|
|
||||||
place before the date you sign these terms.
|
|
||||||
|
|
||||||
5. You covenant, represent, warrant and agree that:
|
|
||||||
|
|
||||||
* Each contribution that you submit is and shall be an original work of authorship
|
|
||||||
and you can legally grant the rights set out in this SCA;
|
|
||||||
|
|
||||||
* to the best of your knowledge, each contribution will not violate any third
|
|
||||||
party's copyrights, trademarks, patents, or other intellectual property rights; and
|
|
||||||
|
|
||||||
* each contribution shall be in compliance with U.S. export control laws and other
|
|
||||||
applicable export and import laws. You agree to notify us if you become aware of
|
|
||||||
any circumstance which would make any of the foregoing representations inaccurate
|
|
||||||
in any respect. Syllogism Co. may publicly disclose your participation in the project,
|
|
||||||
including the fact that you have signed the SCA.
|
|
||||||
|
|
||||||
6. This SCA is governed by the laws of the State of California and applicable U.S.
|
|
||||||
Federal law. Any choice of law rules will not apply.
|
|
||||||
|
|
||||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
|
||||||
mark both statements:
|
|
||||||
|
|
||||||
x___ I am signing on behalf of myself as an individual and no other person or entity, including my employer, has or will have rights with respect my contributions.
|
|
||||||
|
|
||||||
____ I am signing on behalf of my employer or a legal entity and I have the actual authority to contractually bind that entity.
|
|
||||||
|
|
||||||
| Field | Entry |
|
|
||||||
|------------------------------- | -------------------- |
|
|
||||||
| Name | Jordan Suchow |
|
|
||||||
| Company's name (if applicable) | |
|
|
||||||
| Title or Role (if applicable) | |
|
|
||||||
| Date | 2015-04-19 |
|
|
||||||
| GitHub username | suchow |
|
|
||||||
| Website (optional) | http://suchow.io |
|
|
||||||
|
|
|
@ -1,95 +0,0 @@
|
||||||
Syllogism Contributor Agreement
|
|
||||||
===============================
|
|
||||||
|
|
||||||
This Syllogism Contributor Agreement (“SCA”) is based on the Oracle Contributor
|
|
||||||
Agreement. The SCA applies to any contribution that you make to any product or
|
|
||||||
project managed by us (the “project”), and sets out the intellectual property
|
|
||||||
rights you grant to us in the contributed materials. The term “us” shall mean
|
|
||||||
Syllogism Co. The term "you" shall mean the person or entity identified below.
|
|
||||||
If you agree to be bound by these terms, fill in the information requested below
|
|
||||||
and include the filled-in version with your first pull-request, under the file
|
|
||||||
contrbutors/. The name of the file should be your GitHub username, with the
|
|
||||||
extension .md. For example, the user example_user would create the file
|
|
||||||
spaCy/contributors/example_user.md .
|
|
||||||
|
|
||||||
Read this agreement carefully before signing. These terms and conditions
|
|
||||||
constitute a binding legal agreement.
|
|
||||||
|
|
||||||
1. The term 'contribution' or ‘contributed materials’ means any source code,
|
|
||||||
object code, patch, tool, sample, graphic, specification, manual, documentation,
|
|
||||||
or any other material posted or submitted by you to the project.
|
|
||||||
|
|
||||||
2. With respect to any worldwide copyrights, or copyright applications and registrations,
|
|
||||||
in your contribution:
|
|
||||||
* you hereby assign to us joint ownership, and to the extent that such assignment
|
|
||||||
is or becomes invalid, ineffective or unenforceable, you hereby grant to us a perpetual,
|
|
||||||
irrevocable, non-exclusive, worldwide, no-charge, royalty-free, unrestricted license
|
|
||||||
to exercise all rights under those copyrights. This includes, at our option, the
|
|
||||||
right to sublicense these same rights to third parties through multiple levels of
|
|
||||||
sublicensees or other licensing arrangements;
|
|
||||||
|
|
||||||
* you agree that each of us can do all things in relation to your contribution
|
|
||||||
as if each of us were the sole owners, and if one of us makes a derivative work
|
|
||||||
of your contribution, the one who makes the derivative work (or has it made) will
|
|
||||||
be the sole owner of that derivative work;
|
|
||||||
|
|
||||||
* you agree that you will not assert any moral rights in your contribution against
|
|
||||||
us, our licensees or transferees;
|
|
||||||
|
|
||||||
* you agree that we may register a copyright in your contribution and exercise
|
|
||||||
all ownership rights associated with it; and
|
|
||||||
|
|
||||||
* you agree that neither of us has any duty to consult with, obtain the consent
|
|
||||||
of, pay or render an accounting to the other for any use or distribution of your
|
|
||||||
contribution.
|
|
||||||
|
|
||||||
3. With respect to any patents you own, or that you can license without payment
|
|
||||||
to any third party, you hereby grant to us a perpetual, irrevocable, non-exclusive,
|
|
||||||
worldwide, no-charge, royalty-free license to:
|
|
||||||
|
|
||||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer your
|
|
||||||
contribution in whole or in part, alone or in combination with
|
|
||||||
or included in any product, work or materials arising out of the project to
|
|
||||||
which your contribution was submitted, and
|
|
||||||
|
|
||||||
* at our option, to sublicense these same rights to third parties through multiple
|
|
||||||
levels of sublicensees or other licensing arrangements.
|
|
||||||
|
|
||||||
4. Except as set out above, you keep all right, title, and interest in your
|
|
||||||
contribution. The rights that you grant to us under these terms are effective on
|
|
||||||
the date you first submitted a contribution to us, even if your submission took
|
|
||||||
place before the date you sign these terms.
|
|
||||||
|
|
||||||
5. You covenant, represent, warrant and agree that:
|
|
||||||
|
|
||||||
* Each contribution that you submit is and shall be an original work of authorship
|
|
||||||
and you can legally grant the rights set out in this SCA;
|
|
||||||
|
|
||||||
* to the best of your knowledge, each contribution will not violate any third
|
|
||||||
party's copyrights, trademarks, patents, or other intellectual property rights; and
|
|
||||||
|
|
||||||
* each contribution shall be in compliance with U.S. export control laws and other
|
|
||||||
applicable export and import laws. You agree to notify us if you become aware of
|
|
||||||
any circumstance which would make any of the foregoing representations inaccurate
|
|
||||||
in any respect. Syllogism Co. may publicly disclose your participation in the project,
|
|
||||||
including the fact that you have signed the SCA.
|
|
||||||
|
|
||||||
6. This SCA is governed by the laws of the State of California and applicable U.S.
|
|
||||||
Federal law. Any choice of law rules will not apply.
|
|
||||||
|
|
||||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
|
||||||
mark both statements:
|
|
||||||
|
|
||||||
_x__ I am signing on behalf of myself as an individual and no other person or entity, including my employer, has or will have rights with respect my contributions.
|
|
||||||
|
|
||||||
____ I am signing on behalf of my employer or a legal entity and I have the actual authority to contractually bind that entity.
|
|
||||||
|
|
||||||
| Field | Entry |
|
|
||||||
|------------------------------- | -------------------- |
|
|
||||||
| Name | Vsevolod Solovyov |
|
|
||||||
| Company's name (if applicable) | |
|
|
||||||
| Title or Role (if applicable) | |
|
|
||||||
| Date | 2015-08-24 |
|
|
||||||
| GitHub username | vsolovyov |
|
|
||||||
| Website (optional) | |
|
|
||||||
|
|
20
lang_data/en/LICENSE
Normal file
20
lang_data/en/LICENSE
Normal file
|
@ -0,0 +1,20 @@
|
||||||
|
WordNet Release 3.0 This software and database is being provided to you, the
|
||||||
|
LICENSEE, by Princeton University under the following license. By obtaining,
|
||||||
|
using and/or copying this software and database, you agree that you have read,
|
||||||
|
understood, and will comply with these terms and conditions.: Permission to
|
||||||
|
use, copy, modify and distribute this software and database and its
|
||||||
|
documentation for any purpose and without fee or royalty is hereby granted,
|
||||||
|
provided that you agree to comply with the following copyright notice and
|
||||||
|
statements, including the disclaimer, and that the same appear on ALL copies of
|
||||||
|
the software, database and documentation, including modifications that you make for internal use or for distribution. WordNet 3.0 Copyright 2006 by Princeton
|
||||||
|
University. All rights reserved. THIS SOFTWARE AND DATABASE IS PROVIDED "AS IS"
|
||||||
|
AND PRINCETON UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
|
||||||
|
IMPLIED. BY WAY OF EXAMPLE, BUT NOT LIMITATION, PRINCETON UNIVERSITY MAKES NO
|
||||||
|
REPRESENTATIONS OR WARRANTIES OF MERCHANT- ABILITY OR FITNESS FOR ANY
|
||||||
|
PARTICULAR PURPOSE OR THAT THE USE OF THE LICENSED SOFTWARE, DATABASE OR
|
||||||
|
DOCUMENTATION WILL NOT INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS
|
||||||
|
OR OTHER RIGHTS. The name of Princeton University or Princeton may not be used
|
||||||
|
in advertising or publicity pertaining to distribution of the software and/or
|
||||||
|
database. Title to copyright in this software, database and any associated
|
||||||
|
documentation shall at all times remain with Princeton University and LICENSEE
|
||||||
|
agrees to preserve same.
|
14
package.json
14
package.json
|
@ -1,14 +0,0 @@
|
||||||
{
|
|
||||||
"name": "en",
|
|
||||||
"version": "1.1.0",
|
|
||||||
"description": "english test model",
|
|
||||||
"license": "public domain",
|
|
||||||
"include": [
|
|
||||||
["deps", "*"],
|
|
||||||
["ner", "*"],
|
|
||||||
["pos", "*"],
|
|
||||||
["tokenizer", "*"],
|
|
||||||
["vocab", "*"],
|
|
||||||
["wordnet", "*"]
|
|
||||||
]
|
|
||||||
}
|
|
3
setup.py
3
setup.py
|
@ -202,7 +202,8 @@ def setup_package():
|
||||||
'six',
|
'six',
|
||||||
'cloudpickle',
|
'cloudpickle',
|
||||||
'pathlib',
|
'pathlib',
|
||||||
'sputnik>=0.9.2,<0.10.0'],
|
'sputnik>=0.9.2,<0.10.0',
|
||||||
|
'ujson>=1.35'],
|
||||||
classifiers=[
|
classifiers=[
|
||||||
'Development Status :: 5 - Production/Stable',
|
'Development Status :: 5 - Production/Stable',
|
||||||
'Environment :: Console',
|
'Environment :: Console',
|
||||||
|
|
|
@ -4,7 +4,7 @@
|
||||||
# https://github.com/pypa/warehouse/blob/master/warehouse/__about__.py
|
# https://github.com/pypa/warehouse/blob/master/warehouse/__about__.py
|
||||||
|
|
||||||
__title__ = 'spacy'
|
__title__ = 'spacy'
|
||||||
__version__ = '1.0.5'
|
__version__ = '1.1.2'
|
||||||
__summary__ = 'Industrial-strength NLP'
|
__summary__ = 'Industrial-strength NLP'
|
||||||
__uri__ = 'https://spacy.io'
|
__uri__ = 'https://spacy.io'
|
||||||
__author__ = 'Matthew Honnibal'
|
__author__ = 'Matthew Honnibal'
|
||||||
|
|
|
@ -7,6 +7,7 @@ from sputnik.package_list import (PackageNotFoundException,
|
||||||
CompatiblePackageNotFoundException)
|
CompatiblePackageNotFoundException)
|
||||||
|
|
||||||
from . import about
|
from . import about
|
||||||
|
from . import util
|
||||||
|
|
||||||
|
|
||||||
def download(lang, force=False, fail_on_exist=True):
|
def download(lang, force=False, fail_on_exist=True):
|
||||||
|
@ -34,4 +35,5 @@ def download(lang, force=False, fail_on_exist=True):
|
||||||
"spacy.%s.download --force'." % lang, file=sys.stderr)
|
"spacy.%s.download --force'." % lang, file=sys.stderr)
|
||||||
sys.exit(1)
|
sys.exit(1)
|
||||||
|
|
||||||
print("Model successfully installed.", file=sys.stderr)
|
data_path = util.get_data_path()
|
||||||
|
print("Model successfully installed to %s" % data_path, file=sys.stderr)
|
||||||
|
|
|
@ -1,17 +1,23 @@
|
||||||
import plac
|
import plac
|
||||||
|
import sputnik
|
||||||
|
|
||||||
from ..download import download
|
from ..download import download
|
||||||
|
from .. import about
|
||||||
|
|
||||||
|
|
||||||
@plac.annotations(
|
@plac.annotations(
|
||||||
force=("Force overwrite", "flag", "f", bool),
|
force=("Force overwrite", "flag", "f", bool),
|
||||||
)
|
)
|
||||||
def main(data_size='all', force=False):
|
def main(data_size='all', force=False):
|
||||||
|
if force:
|
||||||
|
sputnik.purge(about.__title__, about.__version__)
|
||||||
|
|
||||||
if data_size in ('all', 'parser'):
|
if data_size in ('all', 'parser'):
|
||||||
print("Downloading parsing model")
|
print("Downloading parsing model")
|
||||||
download('en', force)
|
download('en', False)
|
||||||
if data_size in ('all', 'glove'):
|
if data_size in ('all', 'glove'):
|
||||||
print("Downloading GloVe vectors")
|
print("Downloading GloVe vectors")
|
||||||
download('en_glove_cc_300_1m_vectors', force)
|
download('en_glove_cc_300_1m_vectors', False)
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
if __name__ == '__main__':
|
||||||
|
|
|
@ -74,9 +74,9 @@ class BaseDefaults(object):
|
||||||
def create_tagger(cls, nlp=None):
|
def create_tagger(cls, nlp=None):
|
||||||
if nlp is None:
|
if nlp is None:
|
||||||
return Tagger(cls.create_vocab(), features=cls.tagger_features)
|
return Tagger(cls.create_vocab(), features=cls.tagger_features)
|
||||||
elif nlp.path is None:
|
elif nlp.path is False:
|
||||||
return Tagger(nlp.vocab, features=cls.tagger_features)
|
return Tagger(nlp.vocab, features=cls.tagger_features)
|
||||||
elif not (nlp.path / 'pos').exists():
|
elif nlp.path is None or not (nlp.path / 'pos').exists():
|
||||||
return None
|
return None
|
||||||
else:
|
else:
|
||||||
return Tagger.load(nlp.path / 'pos', nlp.vocab)
|
return Tagger.load(nlp.path / 'pos', nlp.vocab)
|
||||||
|
@ -85,9 +85,9 @@ class BaseDefaults(object):
|
||||||
def create_parser(cls, nlp=None):
|
def create_parser(cls, nlp=None):
|
||||||
if nlp is None:
|
if nlp is None:
|
||||||
return DependencyParser(cls.create_vocab(), features=cls.parser_features)
|
return DependencyParser(cls.create_vocab(), features=cls.parser_features)
|
||||||
elif nlp.path is None:
|
elif nlp.path is False:
|
||||||
return DependencyParser(nlp.vocab, features=cls.parser_features)
|
return DependencyParser(nlp.vocab, features=cls.parser_features)
|
||||||
elif not (nlp.path / 'deps').exists():
|
elif nlp.path is None or not (nlp.path / 'deps').exists():
|
||||||
return None
|
return None
|
||||||
else:
|
else:
|
||||||
return DependencyParser.load(nlp.path / 'deps', nlp.vocab)
|
return DependencyParser.load(nlp.path / 'deps', nlp.vocab)
|
||||||
|
@ -96,9 +96,9 @@ class BaseDefaults(object):
|
||||||
def create_entity(cls, nlp=None):
|
def create_entity(cls, nlp=None):
|
||||||
if nlp is None:
|
if nlp is None:
|
||||||
return EntityRecognizer(cls.create_vocab(), features=cls.entity_features)
|
return EntityRecognizer(cls.create_vocab(), features=cls.entity_features)
|
||||||
elif nlp.path is None:
|
elif nlp.path is False:
|
||||||
return EntityRecognizer(nlp.vocab, features=cls.entity_features)
|
return EntityRecognizer(nlp.vocab, features=cls.entity_features)
|
||||||
elif not (nlp.path / 'ner').exists():
|
elif nlp.path is None or not (nlp.path / 'ner').exists():
|
||||||
return None
|
return None
|
||||||
else:
|
else:
|
||||||
return EntityRecognizer.load(nlp.path / 'ner', nlp.vocab)
|
return EntityRecognizer.load(nlp.path / 'ner', nlp.vocab)
|
||||||
|
@ -107,9 +107,9 @@ class BaseDefaults(object):
|
||||||
def create_matcher(cls, nlp=None):
|
def create_matcher(cls, nlp=None):
|
||||||
if nlp is None:
|
if nlp is None:
|
||||||
return Matcher(cls.create_vocab())
|
return Matcher(cls.create_vocab())
|
||||||
elif nlp.path is None:
|
elif nlp.path is False:
|
||||||
return Matcher(nlp.vocab)
|
return Matcher(nlp.vocab)
|
||||||
elif not (nlp.path / 'vocab').exists():
|
elif nlp.path is None or not (nlp.path / 'vocab').exists():
|
||||||
return None
|
return None
|
||||||
else:
|
else:
|
||||||
return Matcher.load(nlp.path / 'vocab', nlp.vocab)
|
return Matcher.load(nlp.path / 'vocab', nlp.vocab)
|
||||||
|
@ -274,13 +274,13 @@ class Language(object):
|
||||||
if 'make_doc' in overrides:
|
if 'make_doc' in overrides:
|
||||||
self.make_doc = overrides['make_doc']
|
self.make_doc = overrides['make_doc']
|
||||||
elif 'create_make_doc' in overrides:
|
elif 'create_make_doc' in overrides:
|
||||||
self.make_doc = overrides['create_make_doc']
|
self.make_doc = overrides['create_make_doc'](self)
|
||||||
else:
|
else:
|
||||||
self.make_doc = lambda text: self.tokenizer(text)
|
self.make_doc = lambda text: self.tokenizer(text)
|
||||||
if 'pipeline' in overrides:
|
if 'pipeline' in overrides:
|
||||||
self.pipeline = overrides['pipeline']
|
self.pipeline = overrides['pipeline']
|
||||||
elif 'create_pipeline' in overrides:
|
elif 'create_pipeline' in overrides:
|
||||||
self.pipeline = overrides['create_pipeline']
|
self.pipeline = overrides['create_pipeline'](self)
|
||||||
else:
|
else:
|
||||||
self.pipeline = [self.tagger, self.parser, self.matcher, self.entity]
|
self.pipeline = [self.tagger, self.parser, self.matcher, self.entity]
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,5 @@
|
||||||
# cython: embedsignature=True
|
# cython: embedsignature=True
|
||||||
|
from libc.math cimport sqrt
|
||||||
from cpython.ref cimport Py_INCREF
|
from cpython.ref cimport Py_INCREF
|
||||||
from cymem.cymem cimport Pool
|
from cymem.cymem cimport Pool
|
||||||
from murmurhash.mrmr cimport hash64
|
from murmurhash.mrmr cimport hash64
|
||||||
|
@ -115,8 +116,11 @@ cdef class Lexeme:
|
||||||
def __set__(self, vector):
|
def __set__(self, vector):
|
||||||
assert len(vector) == self.vocab.vectors_length
|
assert len(vector) == self.vocab.vectors_length
|
||||||
cdef float value
|
cdef float value
|
||||||
|
cdef double norm = 0.0
|
||||||
for i, value in enumerate(vector):
|
for i, value in enumerate(vector):
|
||||||
self.c.vector[i] = value
|
self.c.vector[i] = value
|
||||||
|
norm += value * value
|
||||||
|
self.c.l2_norm = sqrt(norm)
|
||||||
|
|
||||||
property rank:
|
property rank:
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
|
|
|
@ -7,6 +7,7 @@ from .tagger import Tagger
|
||||||
# TODO: The disorganization here is pretty embarrassing. At least it's only
|
# TODO: The disorganization here is pretty embarrassing. At least it's only
|
||||||
# internals.
|
# internals.
|
||||||
from .syntax.parser import get_templates as get_feature_templates
|
from .syntax.parser import get_templates as get_feature_templates
|
||||||
|
from .attrs import DEP, ENT_TYPE
|
||||||
|
|
||||||
|
|
||||||
cdef class EntityRecognizer(Parser):
|
cdef class EntityRecognizer(Parser):
|
||||||
|
@ -14,11 +15,33 @@ cdef class EntityRecognizer(Parser):
|
||||||
|
|
||||||
feature_templates = get_feature_templates('ner')
|
feature_templates = get_feature_templates('ner')
|
||||||
|
|
||||||
|
def add_label(self, label):
|
||||||
|
for action in self.moves.action_types:
|
||||||
|
self.moves.add_action(action, label)
|
||||||
|
if isinstance(label, basestring):
|
||||||
|
label = self.vocab.strings[label]
|
||||||
|
for attr, freqs in self.vocab.serializer_freqs:
|
||||||
|
if attr == ENT_TYPE and label not in freqs:
|
||||||
|
freqs.append([label, 1])
|
||||||
|
# Super hacky :(
|
||||||
|
self.vocab._serializer = None
|
||||||
|
|
||||||
|
|
||||||
cdef class DependencyParser(Parser):
|
cdef class DependencyParser(Parser):
|
||||||
TransitionSystem = ArcEager
|
TransitionSystem = ArcEager
|
||||||
|
|
||||||
feature_templates = get_feature_templates('basic')
|
feature_templates = get_feature_templates('basic')
|
||||||
|
|
||||||
|
def add_label(self, label):
|
||||||
|
for action in self.moves.action_types:
|
||||||
|
self.moves.add_action(action, label)
|
||||||
|
if isinstance(label, basestring):
|
||||||
|
label = self.vocab.strings[label]
|
||||||
|
for attr, freqs in self.vocab.serializer_freqs:
|
||||||
|
if attr == DEP and label not in freqs:
|
||||||
|
freqs.append([label, 1])
|
||||||
|
# Super hacky :(
|
||||||
|
self.vocab._serializer = None
|
||||||
|
|
||||||
|
|
||||||
__all__ = [Tagger, DependencyParser, EntityRecognizer]
|
__all__ = [Tagger, DependencyParser, EntityRecognizer]
|
||||||
|
|
|
@ -46,9 +46,11 @@ cdef class HuffmanCodec:
|
||||||
item.first = item1.first + item2.first
|
item.first = item1.first + item2.first
|
||||||
item.second = self.nodes.size()-1
|
item.second = self.nodes.size()-1
|
||||||
queue.push(item)
|
queue.push(item)
|
||||||
|
# Careful of empty freqs dicts
|
||||||
|
cdef Code path
|
||||||
|
if queue.size() >= 1:
|
||||||
item = queue.top()
|
item = queue.top()
|
||||||
self.root = self.nodes[item.second]
|
self.root = self.nodes[item.second]
|
||||||
cdef Code path
|
|
||||||
path.bits = 0
|
path.bits = 0
|
||||||
path.length = 0
|
path.length = 0
|
||||||
assign_codes(self.nodes, self.codes, item.second, path)
|
assign_codes(self.nodes, self.codes, item.second, path)
|
||||||
|
|
|
@ -100,6 +100,8 @@ cdef class Packer:
|
||||||
self.attrs = tuple(attrs)
|
self.attrs = tuple(attrs)
|
||||||
|
|
||||||
def pack(self, Doc doc):
|
def pack(self, Doc doc):
|
||||||
|
if len(doc) == 0:
|
||||||
|
return b''
|
||||||
bits = self._orth_encode(doc)
|
bits = self._orth_encode(doc)
|
||||||
if bits is None:
|
if bits is None:
|
||||||
bits = self._char_encode(doc)
|
bits = self._char_encode(doc)
|
||||||
|
@ -116,6 +118,8 @@ cdef class Packer:
|
||||||
return doc
|
return doc
|
||||||
|
|
||||||
def unpack_into(self, byte_string, Doc doc):
|
def unpack_into(self, byte_string, Doc doc):
|
||||||
|
if byte_string == b'':
|
||||||
|
return None
|
||||||
bits = BitArray(byte_string)
|
bits = BitArray(byte_string)
|
||||||
bits.seek(0)
|
bits.seek(0)
|
||||||
cdef int32_t length = bits.read32()
|
cdef int32_t length = bits.read32()
|
||||||
|
|
|
@ -92,6 +92,7 @@ cdef class Parser:
|
||||||
def __init__(self, Vocab vocab, TransitionSystem=None, ParserModel model=None, **cfg):
|
def __init__(self, Vocab vocab, TransitionSystem=None, ParserModel model=None, **cfg):
|
||||||
if TransitionSystem is None:
|
if TransitionSystem is None:
|
||||||
TransitionSystem = self.TransitionSystem
|
TransitionSystem = self.TransitionSystem
|
||||||
|
self.vocab = vocab
|
||||||
actions = TransitionSystem.get_actions(**cfg)
|
actions = TransitionSystem.get_actions(**cfg)
|
||||||
self.moves = TransitionSystem(vocab.strings, actions)
|
self.moves = TransitionSystem(vocab.strings, actions)
|
||||||
# TODO: Remove this when we no longer need to support old-style models
|
# TODO: Remove this when we no longer need to support old-style models
|
||||||
|
@ -226,10 +227,12 @@ cdef class Parser:
|
||||||
stepwise.transition(transition)
|
stepwise.transition(transition)
|
||||||
|
|
||||||
def add_label(self, label):
|
def add_label(self, label):
|
||||||
|
# Doesn't set label into serializer -- subclasses override it to do that.
|
||||||
for action in self.moves.action_types:
|
for action in self.moves.action_types:
|
||||||
self.moves.add_action(action, label)
|
self.moves.add_action(action, label)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
cdef class StepwiseState:
|
cdef class StepwiseState:
|
||||||
cdef readonly StateClass stcls
|
cdef readonly StateClass stcls
|
||||||
cdef readonly Example eg
|
cdef readonly Example eg
|
||||||
|
|
|
@ -109,7 +109,7 @@ cdef class Tagger:
|
||||||
# support old data.
|
# support old data.
|
||||||
path = path if not isinstance(path, basestring) else pathlib.Path(path)
|
path = path if not isinstance(path, basestring) else pathlib.Path(path)
|
||||||
if (path / 'templates.json').exists():
|
if (path / 'templates.json').exists():
|
||||||
with (path / 'templates.json').open() as file_:
|
with (path / 'templates.json').open('r', encoding='utf8') as file_:
|
||||||
templates = json.load(file_)
|
templates = json.load(file_)
|
||||||
elif require:
|
elif require:
|
||||||
raise IOError(
|
raise IOError(
|
||||||
|
|
59
spacy/tests/matcher/test_entity_id.py
Normal file
59
spacy/tests/matcher/test_entity_id.py
Normal file
|
@ -0,0 +1,59 @@
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
import spacy
|
||||||
|
from spacy.vocab import Vocab
|
||||||
|
from spacy.matcher import Matcher
|
||||||
|
from spacy.tokens.doc import Doc
|
||||||
|
from spacy.attrs import *
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def en_vocab():
|
||||||
|
return spacy.get_lang_class('en').Defaults.create_vocab()
|
||||||
|
|
||||||
|
|
||||||
|
def test_init_matcher(en_vocab):
|
||||||
|
matcher = Matcher(en_vocab)
|
||||||
|
assert matcher.n_patterns == 0
|
||||||
|
assert matcher(Doc(en_vocab, words=[u'Some', u'words'])) == []
|
||||||
|
|
||||||
|
|
||||||
|
def test_add_empty_entity(en_vocab):
|
||||||
|
matcher = Matcher(en_vocab)
|
||||||
|
matcher.add_entity('TestEntity')
|
||||||
|
assert matcher.n_patterns == 0
|
||||||
|
assert matcher(Doc(en_vocab, words=[u'Test', u'Entity'])) == []
|
||||||
|
|
||||||
|
|
||||||
|
def test_get_entity_attrs(en_vocab):
|
||||||
|
matcher = Matcher(en_vocab)
|
||||||
|
matcher.add_entity('TestEntity')
|
||||||
|
entity = matcher.get_entity('TestEntity')
|
||||||
|
assert entity == {}
|
||||||
|
matcher.add_entity('TestEntity2', attrs={'Hello': 'World'})
|
||||||
|
entity = matcher.get_entity('TestEntity2')
|
||||||
|
assert entity == {'Hello': 'World'}
|
||||||
|
assert matcher.get_entity('TestEntity') == {}
|
||||||
|
|
||||||
|
|
||||||
|
def test_get_entity_via_match(en_vocab):
|
||||||
|
matcher = Matcher(en_vocab)
|
||||||
|
matcher.add_entity('TestEntity', attrs={u'Hello': u'World'})
|
||||||
|
assert matcher.n_patterns == 0
|
||||||
|
assert matcher(Doc(en_vocab, words=[u'Test', u'Entity'])) == []
|
||||||
|
matcher.add_pattern(u'TestEntity', [{ORTH: u'Test'}, {ORTH: u'Entity'}])
|
||||||
|
assert matcher.n_patterns == 1
|
||||||
|
matches = matcher(Doc(en_vocab, words=[u'Test', u'Entity']))
|
||||||
|
assert len(matches) == 1
|
||||||
|
assert len(matches[0]) == 4
|
||||||
|
ent_id, label, start, end = matches[0]
|
||||||
|
assert ent_id == matcher.vocab.strings[u'TestEntity']
|
||||||
|
assert label == 0
|
||||||
|
assert start == 0
|
||||||
|
assert end == 2
|
||||||
|
attrs = matcher.get_entity(ent_id)
|
||||||
|
assert attrs == {u'Hello': u'World'}
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -46,9 +46,10 @@ def test_overlap_issue242():
|
||||||
if os.environ.get('SPACY_DATA'):
|
if os.environ.get('SPACY_DATA'):
|
||||||
data_dir = os.environ.get('SPACY_DATA')
|
data_dir = os.environ.get('SPACY_DATA')
|
||||||
else:
|
else:
|
||||||
data_dir = False
|
data_dir = None
|
||||||
|
|
||||||
nlp = spacy.en.English(path=data_dir, tagger=False, parser=False, entity=False)
|
nlp = spacy.en.English(path=data_dir, tagger=False, parser=False, entity=False)
|
||||||
|
nlp.matcher = Matcher(nlp.vocab)
|
||||||
|
|
||||||
nlp.matcher.add('FOOD', 'FOOD', {}, patterns)
|
nlp.matcher.add('FOOD', 'FOOD', {}, patterns)
|
||||||
|
|
||||||
|
|
|
@ -50,6 +50,11 @@ def test1():
|
||||||
assert codec.strings == [c for i, c in py_codes]
|
assert codec.strings == [c for i, c in py_codes]
|
||||||
|
|
||||||
|
|
||||||
|
def test_empty():
|
||||||
|
codec = HuffmanCodec({})
|
||||||
|
assert codec.strings == []
|
||||||
|
|
||||||
|
|
||||||
def test_round_trip():
|
def test_round_trip():
|
||||||
freqs = {'the': 10, 'quick': 3, 'brown': 4, 'fox': 1, 'jumped': 5, 'over': 8,
|
freqs = {'the': 10, 'quick': 3, 'brown': 4, 'fox': 1, 'jumped': 5, 'over': 8,
|
||||||
'lazy': 1, 'dog': 2, '.': 9}
|
'lazy': 1, 'dog': 2, '.': 9}
|
||||||
|
|
|
@ -2,6 +2,9 @@ from __future__ import unicode_literals
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
from spacy.tokens import Doc
|
from spacy.tokens import Doc
|
||||||
|
import spacy.en
|
||||||
|
from spacy.serialize.packer import Packer
|
||||||
|
|
||||||
|
|
||||||
def equal(doc1, doc2):
|
def equal(doc1, doc2):
|
||||||
# tokens
|
# tokens
|
||||||
|
@ -84,3 +87,41 @@ def test_serialize_tokens_tags_parse_ner(EN):
|
||||||
|
|
||||||
doc2 = Doc(EN.vocab).from_bytes(doc1.to_bytes())
|
doc2 = Doc(EN.vocab).from_bytes(doc1.to_bytes())
|
||||||
equal(doc1, doc2)
|
equal(doc1, doc2)
|
||||||
|
|
||||||
|
|
||||||
|
def test_serialize_empty_doc():
|
||||||
|
vocab = spacy.en.English.Defaults.create_vocab()
|
||||||
|
doc = Doc(vocab)
|
||||||
|
packer = Packer(vocab, {})
|
||||||
|
b = packer.pack(doc)
|
||||||
|
assert b == b''
|
||||||
|
loaded = Doc(vocab).from_bytes(b)
|
||||||
|
assert len(loaded) == 0
|
||||||
|
|
||||||
|
|
||||||
|
def test_serialize_after_adding_entity():
|
||||||
|
# Re issue #514
|
||||||
|
vocab = spacy.en.English.Defaults.create_vocab()
|
||||||
|
entity_recognizer = spacy.en.English.Defaults.create_entity()
|
||||||
|
|
||||||
|
doc = Doc(vocab, words=u'This is a sentence about pasta .'.split())
|
||||||
|
entity_recognizer.add_label('Food')
|
||||||
|
entity_recognizer(doc)
|
||||||
|
|
||||||
|
label_id = vocab.strings[u'Food']
|
||||||
|
doc.ents = [(label_id, 5,6)]
|
||||||
|
|
||||||
|
assert [(ent.label_, ent.text) for ent in doc.ents] == [(u'Food', u'pasta')]
|
||||||
|
|
||||||
|
byte_string = doc.to_bytes()
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.models
|
||||||
|
def test_serialize_after_adding_entity(EN):
|
||||||
|
EN.entity.add_label(u'Food')
|
||||||
|
doc = EN(u'This is a sentence about pasta.')
|
||||||
|
label_id = EN.vocab.strings[u'Food']
|
||||||
|
doc.ents = [(label_id, 5,6)]
|
||||||
|
byte_string = doc.to_bytes()
|
||||||
|
doc2 = Doc(EN.vocab).from_bytes(byte_string)
|
||||||
|
assert [(ent.label_, ent.text) for ent in doc2.ents] == [(u'Food', u'pasta')]
|
||||||
|
|
42
spacy/tests/tokens/test_add_entities.py
Normal file
42
spacy/tests/tokens/test_add_entities.py
Normal file
|
@ -0,0 +1,42 @@
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
import spacy
|
||||||
|
from spacy.vocab import Vocab
|
||||||
|
from spacy.matcher import Matcher
|
||||||
|
from spacy.tokens.doc import Doc
|
||||||
|
from spacy.attrs import *
|
||||||
|
from spacy.pipeline import EntityRecognizer
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture(scope="module")
|
||||||
|
def en_vocab():
|
||||||
|
return spacy.get_lang_class('en').Defaults.create_vocab()
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture(scope="module")
|
||||||
|
def entity_recognizer(en_vocab):
|
||||||
|
return EntityRecognizer(en_vocab, features=[(2,), (3,)])
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def animal(en_vocab):
|
||||||
|
return nlp.vocab.strings[u"ANIMAL"]
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def doc(en_vocab, entity_recognizer):
|
||||||
|
doc = Doc(en_vocab, words=[u"this", u"is", u"a", u"lion"])
|
||||||
|
entity_recognizer(doc)
|
||||||
|
return doc
|
||||||
|
|
||||||
|
|
||||||
|
def test_set_ents_iob(doc):
|
||||||
|
assert len(list(doc.ents)) == 0
|
||||||
|
tags = [w.ent_iob_ for w in doc]
|
||||||
|
assert tags == (['O'] * len(doc))
|
||||||
|
doc.ents = [(doc.vocab.strings['ANIMAL'], 3, 4)]
|
||||||
|
tags = [w.ent_iob_ for w in doc]
|
||||||
|
assert tags == ['O', 'O', 'O', 'B']
|
||||||
|
doc.ents = [(doc.vocab.strings['WORD'], 0, 2)]
|
||||||
|
tags = [w.ent_iob_ for w in doc]
|
||||||
|
assert tags == ['B', 'I', 'O', 'O']
|
|
@ -5,6 +5,7 @@ from spacy.attrs import IS_SPACE, IS_TITLE, IS_UPPER, LIKE_URL, LIKE_NUM
|
||||||
from spacy.attrs import IS_STOP
|
from spacy.attrs import IS_STOP
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
import numpy
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.models
|
@pytest.mark.models
|
||||||
|
@ -67,7 +68,9 @@ def test_vectors(EN):
|
||||||
assert apples.similarity(oranges) > apples.similarity(oov)
|
assert apples.similarity(oranges) > apples.similarity(oov)
|
||||||
assert apples.similarity(oranges) == oranges.similarity(apples)
|
assert apples.similarity(oranges) == oranges.similarity(apples)
|
||||||
assert sum(apples.vector) != sum(oranges.vector)
|
assert sum(apples.vector) != sum(oranges.vector)
|
||||||
assert apples.vector_norm != oranges.vector_norm
|
assert numpy.isclose(
|
||||||
|
apples.vector_norm,
|
||||||
|
numpy.sqrt(numpy.dot(apples.vector, apples.vector)))
|
||||||
|
|
||||||
@pytest.mark.models
|
@pytest.mark.models
|
||||||
def test_ancestors(EN):
|
def test_ancestors(EN):
|
||||||
|
|
|
@ -108,7 +108,7 @@ def test_set_ents(EN):
|
||||||
assert len(tokens.ents) == 0
|
assert len(tokens.ents) == 0
|
||||||
tokens.ents = [(EN.vocab.strings['PRODUCT'], 2, 4)]
|
tokens.ents = [(EN.vocab.strings['PRODUCT'], 2, 4)]
|
||||||
assert len(list(tokens.ents)) == 1
|
assert len(list(tokens.ents)) == 1
|
||||||
assert [t.ent_iob for t in tokens] == [0, 0, 3, 1, 0, 0, 0, 0]
|
assert [t.ent_iob for t in tokens] == [2, 2, 3, 1, 2, 2, 2, 2]
|
||||||
ent = tokens.ents[0]
|
ent = tokens.ents[0]
|
||||||
assert ent.label_ == 'PRODUCT'
|
assert ent.label_ == 'PRODUCT'
|
||||||
assert ent.start == 2
|
assert ent.start == 2
|
||||||
|
|
96
spacy/tests/vectors/test_similarity.py
Normal file
96
spacy/tests/vectors/test_similarity.py
Normal file
|
@ -0,0 +1,96 @@
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
import spacy
|
||||||
|
from spacy.vocab import Vocab
|
||||||
|
from spacy.tokens.doc import Doc
|
||||||
|
import numpy
|
||||||
|
import numpy.linalg
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
|
def get_vector(letters):
|
||||||
|
return numpy.asarray([ord(letter) for letter in letters], dtype='float32')
|
||||||
|
|
||||||
|
|
||||||
|
def get_cosine(vec1, vec2):
|
||||||
|
return numpy.dot(vec1, vec2) / (numpy.linalg.norm(vec1) * numpy.linalg.norm(vec2))
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture(scope='module')
|
||||||
|
def en_vocab():
|
||||||
|
vocab = spacy.get_lang_class('en').Defaults.create_vocab()
|
||||||
|
vocab.resize_vectors(2)
|
||||||
|
apple_ = vocab[u'apple']
|
||||||
|
orange_ = vocab[u'orange']
|
||||||
|
apple_.vector = get_vector('ap')
|
||||||
|
orange_.vector = get_vector('or')
|
||||||
|
return vocab
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def appleL(en_vocab):
|
||||||
|
return en_vocab['apple']
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def orangeL(en_vocab):
|
||||||
|
return en_vocab['orange']
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture(scope='module')
|
||||||
|
def apple_orange(en_vocab):
|
||||||
|
return Doc(en_vocab, words=[u'apple', u'orange'])
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def appleT(apple_orange):
|
||||||
|
return apple_orange[0]
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def orangeT(apple_orange):
|
||||||
|
return apple_orange[1]
|
||||||
|
|
||||||
|
|
||||||
|
def test_LL_sim(appleL, orangeL):
|
||||||
|
assert appleL.has_vector
|
||||||
|
assert orangeL.has_vector
|
||||||
|
assert appleL.vector_norm != 0
|
||||||
|
assert orangeL.vector_norm != 0
|
||||||
|
assert appleL.vector[0] != orangeL.vector[0] and appleL.vector[1] != orangeL.vector[1]
|
||||||
|
assert numpy.isclose(
|
||||||
|
appleL.similarity(orangeL),
|
||||||
|
get_cosine(get_vector('ap'), get_vector('or')))
|
||||||
|
assert numpy.isclose(
|
||||||
|
orangeL.similarity(appleL),
|
||||||
|
appleL.similarity(orangeL))
|
||||||
|
|
||||||
|
|
||||||
|
def test_TT_sim(appleT, orangeT):
|
||||||
|
assert appleT.has_vector
|
||||||
|
assert orangeT.has_vector
|
||||||
|
assert appleT.vector_norm != 0
|
||||||
|
assert orangeT.vector_norm != 0
|
||||||
|
assert appleT.vector[0] != orangeT.vector[0] and appleT.vector[1] != orangeT.vector[1]
|
||||||
|
assert numpy.isclose(
|
||||||
|
appleT.similarity(orangeT),
|
||||||
|
get_cosine(get_vector('ap'), get_vector('or')))
|
||||||
|
assert numpy.isclose(
|
||||||
|
orangeT.similarity(appleT),
|
||||||
|
appleT.similarity(orangeT))
|
||||||
|
|
||||||
|
|
||||||
|
def test_TD_sim(apple_orange, appleT):
|
||||||
|
assert apple_orange.similarity(appleT) == appleT.similarity(apple_orange)
|
||||||
|
|
||||||
|
def test_DS_sim(apple_orange, appleT):
|
||||||
|
span = apple_orange[:2]
|
||||||
|
assert apple_orange.similarity(span) == 1.0
|
||||||
|
assert span.similarity(apple_orange) == 1.0
|
||||||
|
|
||||||
|
|
||||||
|
def test_TS_sim(apple_orange, appleT):
|
||||||
|
span = apple_orange[:2]
|
||||||
|
assert span.similarity(appleT) == appleT.similarity(span)
|
||||||
|
|
||||||
|
|
12
spacy/tests/vocab/test_add_vectors.py
Normal file
12
spacy/tests/vocab/test_add_vectors.py
Normal file
|
@ -0,0 +1,12 @@
|
||||||
|
import numpy
|
||||||
|
|
||||||
|
import spacy.en
|
||||||
|
|
||||||
|
|
||||||
|
def test_add_vector():
|
||||||
|
vocab = spacy.en.English.Defaults.create_vocab()
|
||||||
|
vocab.resize_vectors(10)
|
||||||
|
lex = vocab[u'Hello']
|
||||||
|
lex.vector = numpy.ndarray((10,), dtype='float32')
|
||||||
|
lex = vocab[u'Hello']
|
||||||
|
assert lex.vector.shape == (10,)
|
|
@ -43,7 +43,7 @@ cdef class Tokenizer:
|
||||||
path = pathlib.Path(path)
|
path = pathlib.Path(path)
|
||||||
|
|
||||||
if rules is None:
|
if rules is None:
|
||||||
with (path / 'tokenizer' / 'specials.json').open() as file_:
|
with (path / 'tokenizer' / 'specials.json').open('r', encoding='utf8') as file_:
|
||||||
rules = json.load(file_)
|
rules = json.load(file_)
|
||||||
if prefix_search in (None, True):
|
if prefix_search in (None, True):
|
||||||
with (path / 'tokenizer' / 'prefix.txt').open() as file_:
|
with (path / 'tokenizer' / 'prefix.txt').open() as file_:
|
||||||
|
|
|
@ -1,12 +1,12 @@
|
||||||
cimport cython
|
cimport cython
|
||||||
from libc.string cimport memcpy, memset
|
from libc.string cimport memcpy, memset
|
||||||
from libc.stdint cimport uint32_t
|
from libc.stdint cimport uint32_t
|
||||||
|
from libc.math cimport sqrt
|
||||||
|
|
||||||
import numpy
|
import numpy
|
||||||
import numpy.linalg
|
import numpy.linalg
|
||||||
import struct
|
import struct
|
||||||
cimport numpy as np
|
cimport numpy as np
|
||||||
import math
|
|
||||||
import six
|
import six
|
||||||
import warnings
|
import warnings
|
||||||
|
|
||||||
|
@ -251,11 +251,12 @@ cdef class Doc:
|
||||||
if 'vector_norm' in self.user_hooks:
|
if 'vector_norm' in self.user_hooks:
|
||||||
return self.user_hooks['vector_norm'](self)
|
return self.user_hooks['vector_norm'](self)
|
||||||
cdef float value
|
cdef float value
|
||||||
|
cdef double norm = 0
|
||||||
if self._vector_norm is None:
|
if self._vector_norm is None:
|
||||||
self._vector_norm = 1e-20
|
norm = 0.0
|
||||||
for value in self.vector:
|
for value in self.vector:
|
||||||
self._vector_norm += value * value
|
norm += value * value
|
||||||
self._vector_norm = math.sqrt(self._vector_norm)
|
self._vector_norm = sqrt(norm) if norm != 0 else 0
|
||||||
return self._vector_norm
|
return self._vector_norm
|
||||||
|
|
||||||
def __set__(self, value):
|
def __set__(self, value):
|
||||||
|
@ -322,7 +323,7 @@ cdef class Doc:
|
||||||
cdef int i
|
cdef int i
|
||||||
for i in range(self.length):
|
for i in range(self.length):
|
||||||
self.c[i].ent_type = 0
|
self.c[i].ent_type = 0
|
||||||
self.c[i].ent_iob = 0
|
self.c[i].ent_iob = 2 # Means O, not missing!
|
||||||
cdef attr_t ent_type
|
cdef attr_t ent_type
|
||||||
cdef int start, end
|
cdef int start, end
|
||||||
for ent_info in ents:
|
for ent_info in ents:
|
||||||
|
|
|
@ -3,7 +3,7 @@ from collections import defaultdict
|
||||||
import numpy
|
import numpy
|
||||||
import numpy.linalg
|
import numpy.linalg
|
||||||
cimport numpy as np
|
cimport numpy as np
|
||||||
import math
|
from libc.math cimport sqrt
|
||||||
import six
|
import six
|
||||||
|
|
||||||
from ..structs cimport TokenC, LexemeC
|
from ..structs cimport TokenC, LexemeC
|
||||||
|
@ -136,11 +136,12 @@ cdef class Span:
|
||||||
if 'vector_norm' in self.doc.user_span_hooks:
|
if 'vector_norm' in self.doc.user_span_hooks:
|
||||||
return self.doc.user_span_hooks['vector'](self)
|
return self.doc.user_span_hooks['vector'](self)
|
||||||
cdef float value
|
cdef float value
|
||||||
|
cdef double norm = 0
|
||||||
if self._vector_norm is None:
|
if self._vector_norm is None:
|
||||||
self._vector_norm = 1e-20
|
norm = 0
|
||||||
for value in self.vector:
|
for value in self.vector:
|
||||||
self._vector_norm += value * value
|
norm += value * value
|
||||||
self._vector_norm = math.sqrt(self._vector_norm)
|
self._vector_norm = sqrt(norm) if norm != 0 else 0
|
||||||
return self._vector_norm
|
return self._vector_norm
|
||||||
|
|
||||||
property text:
|
property text:
|
||||||
|
|
|
@ -4,12 +4,13 @@ from libc.stdio cimport fopen, fclose, fread, fwrite, FILE
|
||||||
from libc.string cimport memset
|
from libc.string cimport memset
|
||||||
from libc.stdint cimport int32_t
|
from libc.stdint cimport int32_t
|
||||||
from libc.stdint cimport uint64_t
|
from libc.stdint cimport uint64_t
|
||||||
|
from libc.math cimport sqrt
|
||||||
|
|
||||||
import bz2
|
import bz2
|
||||||
from os import path
|
from os import path
|
||||||
import io
|
import io
|
||||||
import math
|
import math
|
||||||
import json
|
import ujson as json
|
||||||
import tempfile
|
import tempfile
|
||||||
|
|
||||||
from .lexeme cimport EMPTY_LEXEME
|
from .lexeme cimport EMPTY_LEXEME
|
||||||
|
@ -112,9 +113,9 @@ cdef class Vocab:
|
||||||
self._serializer = None
|
self._serializer = None
|
||||||
|
|
||||||
property serializer:
|
property serializer:
|
||||||
|
# Having the serializer live here is super messy :(
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
if self._serializer is None:
|
if self._serializer is None:
|
||||||
freqs = []
|
|
||||||
self._serializer = Packer(self, self.serializer_freqs)
|
self._serializer = Packer(self, self.serializer_freqs)
|
||||||
return self._serializer
|
return self._serializer
|
||||||
|
|
||||||
|
@ -129,6 +130,20 @@ cdef class Vocab:
|
||||||
"""The current number of lexemes stored."""
|
"""The current number of lexemes stored."""
|
||||||
return self.length
|
return self.length
|
||||||
|
|
||||||
|
def resize_vectors(self, int new_size):
|
||||||
|
'''
|
||||||
|
Set vectors_length to a new size, and allocate more memory for the Lexeme
|
||||||
|
vectors if necessary. The memory will be zeroed.
|
||||||
|
'''
|
||||||
|
cdef hash_t key
|
||||||
|
cdef size_t addr
|
||||||
|
if new_size > self.vectors_length:
|
||||||
|
for key, addr in self._by_hash.items():
|
||||||
|
lex = <LexemeC*>addr
|
||||||
|
lex.vector = <float*>self.mem.realloc(lex.vector,
|
||||||
|
new_size * sizeof(lex.vector[0]))
|
||||||
|
self.vectors_length = new_size
|
||||||
|
|
||||||
def add_flag(self, flag_getter, int flag_id=-1):
|
def add_flag(self, flag_getter, int flag_id=-1):
|
||||||
'''Set a new boolean flag to words in the vocabulary. The flag_setter
|
'''Set a new boolean flag to words in the vocabulary. The flag_setter
|
||||||
function will be called over the words currently in the vocab, and then
|
function will be called over the words currently in the vocab, and then
|
||||||
|
@ -372,6 +387,7 @@ cdef class Vocab:
|
||||||
cdef LexemeC* lexeme
|
cdef LexemeC* lexeme
|
||||||
cdef attr_t orth
|
cdef attr_t orth
|
||||||
cdef int32_t vec_len = -1
|
cdef int32_t vec_len = -1
|
||||||
|
cdef double norm = 0.0
|
||||||
for line_num, line in enumerate(file_):
|
for line_num, line in enumerate(file_):
|
||||||
pieces = line.split()
|
pieces = line.split()
|
||||||
word_str = pieces.pop(0)
|
word_str = pieces.pop(0)
|
||||||
|
@ -383,9 +399,12 @@ cdef class Vocab:
|
||||||
orth = self.strings[word_str]
|
orth = self.strings[word_str]
|
||||||
lexeme = <LexemeC*><void*>self.get_by_orth(self.mem, orth)
|
lexeme = <LexemeC*><void*>self.get_by_orth(self.mem, orth)
|
||||||
lexeme.vector = <float*>self.mem.alloc(vec_len, sizeof(float))
|
lexeme.vector = <float*>self.mem.alloc(vec_len, sizeof(float))
|
||||||
|
|
||||||
for i, val_str in enumerate(pieces):
|
for i, val_str in enumerate(pieces):
|
||||||
lexeme.vector[i] = float(val_str)
|
lexeme.vector[i] = float(val_str)
|
||||||
|
norm = 0.0
|
||||||
|
for i in range(vec_len):
|
||||||
|
norm += lexeme.vector[i] * lexeme.vector[i]
|
||||||
|
lexeme.l2_norm = sqrt(norm)
|
||||||
self.vectors_length = vec_len
|
self.vectors_length = vec_len
|
||||||
return vec_len
|
return vec_len
|
||||||
|
|
||||||
|
@ -424,14 +443,16 @@ cdef class Vocab:
|
||||||
line_num += 1
|
line_num += 1
|
||||||
cdef LexemeC* lex
|
cdef LexemeC* lex
|
||||||
cdef size_t lex_addr
|
cdef size_t lex_addr
|
||||||
|
cdef double norm = 0.0
|
||||||
cdef int i
|
cdef int i
|
||||||
for orth, lex_addr in self._by_orth.items():
|
for orth, lex_addr in self._by_orth.items():
|
||||||
lex = <LexemeC*>lex_addr
|
lex = <LexemeC*>lex_addr
|
||||||
if lex.lower < vectors.size():
|
if lex.lower < vectors.size():
|
||||||
lex.vector = vectors[lex.lower]
|
lex.vector = vectors[lex.lower]
|
||||||
|
norm = 0.0
|
||||||
for i in range(vec_len):
|
for i in range(vec_len):
|
||||||
lex.l2_norm += (lex.vector[i] * lex.vector[i])
|
norm += lex.vector[i] * lex.vector[i]
|
||||||
lex.l2_norm = math.sqrt(lex.l2_norm)
|
lex.l2_norm = sqrt(norm)
|
||||||
else:
|
else:
|
||||||
lex.vector = EMPTY_VEC
|
lex.vector = EMPTY_VEC
|
||||||
self.vectors_length = vec_len
|
self.vectors_length = vec_len
|
||||||
|
|
|
@ -22,8 +22,8 @@
|
||||||
"DEFAULT_SYNTAX" : "python",
|
"DEFAULT_SYNTAX" : "python",
|
||||||
"ANALYTICS": "UA-58931649-1",
|
"ANALYTICS": "UA-58931649-1",
|
||||||
|
|
||||||
"SPACY_VERSION": "0.101.0",
|
"SPACY_VERSION": "1.0",
|
||||||
"SPACY_STARS": "2300",
|
"SPACY_STARS": "2500",
|
||||||
"GITHUB": { "user": "explosion", "repo": "spacy" }
|
"GITHUB": { "user": "explosion", "repo": "spacy" }
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
|
@ -14,7 +14,7 @@ mixin a(url, trusted)
|
||||||
block - section content (block and inline elements)
|
block - section content (block and inline elements)
|
||||||
|
|
||||||
mixin section(id)
|
mixin section(id)
|
||||||
section.o-block(id=(id) ? 'section-' + id : '')&attributes(attributes)
|
section.o-section(id=(id) ? 'section-' + id : '')&attributes(attributes)
|
||||||
block
|
block
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -45,6 +45,9 @@
|
||||||
.o-block-small
|
.o-block-small
|
||||||
margin-bottom: 2rem
|
margin-bottom: 2rem
|
||||||
|
|
||||||
|
.o-section
|
||||||
|
margin-bottom: 12.5rem
|
||||||
|
|
||||||
.o-responsive
|
.o-responsive
|
||||||
overflow: auto
|
overflow: auto
|
||||||
width: 100%
|
width: 100%
|
||||||
|
|
|
@ -1,150 +1,134 @@
|
||||||
//- ----------------------------------
|
//- ----------------------------------
|
||||||
//- 💫 DOCS > API > ENGLISH
|
//- 💫 DOCS > API > LANGUAGE
|
||||||
//- ----------------------------------
|
//- ----------------------------------
|
||||||
|
|
||||||
+section("english")
|
+section("language")
|
||||||
+h(2, "english", "https://github.com/" + SOCIAL.github + "/spaCy/blob/master/spacy/language.py")
|
+h(2, "language", "https://github.com/" + SOCIAL.github + "/spaCy/blob/master/spacy/language.py")
|
||||||
| #[+tag class] English(Language)
|
| #[+tag class] Language
|
||||||
|
|
||||||
p.
|
p.
|
||||||
The English analysis pipeline. Usually you"ll load this once per process,
|
A pipeline that transforms text strings into annotated spaCy Doc objects. Usually you'll load the Language pipeline once and pass the instance around your program.
|
||||||
and pass the instance around your program.
|
|
||||||
|
|
||||||
+code("python", "Overview").
|
+code("python", "Overview").
|
||||||
class Language:
|
class Language:
|
||||||
lang = None
|
Defaults = BaseDefaults
|
||||||
def __init__(self, data_dir=None, tokenizer=None, tagger=None, parser=None, entity=None, matcher=None):
|
|
||||||
return self
|
|
||||||
|
|
||||||
def __call__(self, text, tag=True, parse=True, entity=True):
|
def __init__(self, path=True, **overrides):
|
||||||
return Doc()
|
self.vocab = Vocab()
|
||||||
|
self.tokenizer = Tokenizer()
|
||||||
|
self.tagger = Tagger()
|
||||||
|
self.parser = DependencyParser()
|
||||||
|
self.entity = EntityRecognizer()
|
||||||
|
self.make_doc = lambda text: Doc()
|
||||||
|
self.pipeline = [self.tagger, self.parser, self.entity]
|
||||||
|
|
||||||
def pipe(self, texts_iterator, batch_size=1000, n_threads=2):
|
def __call__(self, text, **toggle):
|
||||||
yield Doc()
|
doc = self.make_doc(text)
|
||||||
|
for proc in self.pipeline:
|
||||||
|
if toggle.get(process.name, True):
|
||||||
|
process(doc)
|
||||||
|
return doc
|
||||||
|
|
||||||
def end_training(self, data_dir=None):
|
def pipe(self, texts_iterator, batch_size=1000, n_threads=2, **toggle):
|
||||||
|
docs = (self.make_doc(text) for text in texts_iterator)
|
||||||
|
for process in self.pipeline:
|
||||||
|
if toggle.get(process.name, True):
|
||||||
|
docs = process.pipe(docs, batch_size=batch_size, n_threads=n_threads)
|
||||||
|
for doc in self.docs:
|
||||||
|
yield doc
|
||||||
|
|
||||||
|
def end_training(self, path=None):
|
||||||
return None
|
return None
|
||||||
|
|
||||||
class English(Language):
|
class English(Language):
|
||||||
lang = "en"
|
class Defaults(BaseDefaults):
|
||||||
|
pass
|
||||||
|
|
||||||
class German(Language):
|
class German(Language):
|
||||||
lang = "de"
|
class Defaults(BaseDefaults):
|
||||||
|
pass
|
||||||
|
|
||||||
+section("english-init")
|
+section("english-init")
|
||||||
+h(3, "english-init")
|
+h(3, "english-init")
|
||||||
| #[+tag method] English.__init__
|
| #[+tag method] Language.__init__
|
||||||
|
|
||||||
p
|
p
|
||||||
| Load the pipeline. Each component can be passed
|
| Load the pipeline. You can disable components by passing None as a value,
|
||||||
| as an argument, or left as #[code None], in which case it will be loaded
|
| e.g. pass parser=None, vectors=None to save memory if you're not using
|
||||||
| from a classmethod, named e.g. #[code default_vocab()].
|
| those components. You can also pass an object as the value.
|
||||||
|
| Pass a function create_pipeline to use a custom pipeline --- see
|
||||||
|
| the custom pipeline tutorial.
|
||||||
|
|
||||||
+aside("Efficiency").
|
+aside("Efficiency").
|
||||||
Loading takes 10-20 seconds, and the instance consumes 2 to 3
|
Loading takes 10-20 seconds, and the instance consumes 2 to 3
|
||||||
gigabytes of memory. Intended use is for one instance to be
|
gigabytes of memory. Intended use is for one instance to be
|
||||||
created for each language per process, but you can create more
|
created for each language per process, but you can create more
|
||||||
if you"re doing something unusual. You may wish to make the
|
if you're doing something unusual. You may wish to make the
|
||||||
instance a global variable or "singleton".
|
instance a global variable or "singleton".
|
||||||
|
|
||||||
+table(["Example", "Description"])
|
+table(["Example", "Description"])
|
||||||
+row
|
+row
|
||||||
+cell #[code.lang-python nlp = English()]
|
+cell #[code nlp = English()]
|
||||||
+cell Load everything, from default package
|
+cell Load everything, from default path.
|
||||||
|
|
||||||
+row
|
+row
|
||||||
+cell #[code.lang-python nlp = English(data_dir='my_data')]
|
+cell #[code nlp = English(path='my_data')]
|
||||||
+cell Load everything, from specified dir
|
+cell Load everything, from specified path
|
||||||
|
|
||||||
+row
|
+row
|
||||||
+cell #[code.lang-python nlp = English(parser=False)]
|
+cell #[code nlp = English(path=path_obj)]
|
||||||
+cell Load everything except the parser.
|
+cell Load everything, from an object that follows the #[code pathlib.Path] protocol.
|
||||||
|
|
||||||
+row
|
+row
|
||||||
+cell #[code.lang-python nlp = English(parser=False, tagger=False)]
|
+cell #[code nlp = English(parser=False, vectors=False)]
|
||||||
+cell Load everything except the parser and tagger.
|
+cell Load everything except the parser and the word vectors.
|
||||||
|
|
||||||
+row
|
+row
|
||||||
+cell #[code.lang-python nlp = English(parser=MyParser())]
|
+cell #[code nlp = English(parser=my_parser)]
|
||||||
+cell Supply your own parser
|
+cell Load everything, and use a custom parser.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code nlp = English(create_pipeline=my_pipeline)]
|
||||||
|
+cell Load everything, and use a custom pipeline.
|
||||||
|
|
||||||
+code("python", "Definition").
|
+code("python", "Definition").
|
||||||
def __init__(self, data_dir=None, tokenizer=None, tagger=None, parser=None, entity=None, matcher=None):
|
def __init__(self, path=True, **overrides):
|
||||||
return self
|
D = self.Defaults
|
||||||
|
self.vocab = Vocab(path=path, parent=self, **D.vocab) \
|
||||||
|
if 'vocab' not in overrides \
|
||||||
|
else overrides['vocab']
|
||||||
|
self.tokenizer = Tokenizer(self.vocab, path=path, **D.tokenizer) \
|
||||||
|
if 'tokenizer' not in overrides \
|
||||||
|
else overrides['tokenizer']
|
||||||
|
self.tagger = Tagger(self.vocab, path=path, **D.tagger) \
|
||||||
|
if 'tagger' not in overrides \
|
||||||
|
else overrides['tagger']
|
||||||
|
self.parser = DependencyParser(self.vocab, path=path, **D.parser) \
|
||||||
|
if 'parser' not in overrides \
|
||||||
|
else overrides['parser']
|
||||||
|
self.entity = EntityRecognizer(self.vocab, path=path, **D.entity) \
|
||||||
|
if 'entity' not in overrides \
|
||||||
|
else overrides['entity']
|
||||||
|
self.matcher = Matcher(self.vocab, path=path, **D.matcher) \
|
||||||
|
if 'matcher' not in overrides \
|
||||||
|
else overrides['matcher']
|
||||||
|
|
||||||
+table(["Arg", "Type", "Description"])
|
if 'make_doc' in overrides:
|
||||||
+row
|
self.make_doc = overrides['make_doc']
|
||||||
+cell data_dir
|
elif 'create_make_doc' in overrides:
|
||||||
+cell str
|
self.make_doc = overrides['create_make_doc'](self)
|
||||||
+cell.
|
else:
|
||||||
The data directory. If None, value is obtained via the
|
self.make_doc = lambda text: self.tokenizer(text)
|
||||||
#[code default_data_dir()] method.
|
if 'pipeline' in overrides:
|
||||||
|
self.pipeline = overrides['pipeline']
|
||||||
|
elif 'create_pipeline' in overrides:
|
||||||
|
self.pipeline = overrides['create_pipeline'](self)
|
||||||
|
else:
|
||||||
|
self.pipeline = [self.tagger, self.parser, self.matcher, self.entity]
|
||||||
|
|
||||||
+row
|
+section("language-call")
|
||||||
+cell vocab
|
+h(3, "language-call")
|
||||||
+cell #[code Vocab]
|
| #[+tag method] Language.__call__
|
||||||
+cell.
|
|
||||||
The vocab object, which should be an instance of class
|
|
||||||
#[code spacy.vocab.Vocab]. If #[code None], the object is
|
|
||||||
obtained from the #[code default_vocab()] class method. The
|
|
||||||
vocab object manages all of the language specific rules and
|
|
||||||
definitions, maintains the cache of lexical types, and manages
|
|
||||||
the word vectors. Because the vocab owns this important data,
|
|
||||||
most objects hold a reference to the vocab.
|
|
||||||
|
|
||||||
+row
|
|
||||||
+cell tokenizer
|
|
||||||
+cell #[code Tokenizer]
|
|
||||||
+cell.
|
|
||||||
The tokenizer, which should be a callable that accepts a
|
|
||||||
unicode string, and returns a #[code Doc] object. If set to
|
|
||||||
#[code None], the default tokenizer is constructed from the
|
|
||||||
#[code default_tokenizer()] method.
|
|
||||||
|
|
||||||
+row
|
|
||||||
+cell tagger
|
|
||||||
+cell #[code Tagger]
|
|
||||||
+cell.
|
|
||||||
The part-of-speech tagger, which should be a callable that
|
|
||||||
accepts a #[code Doc] object, and sets the part-of-speech
|
|
||||||
tags in-place. If set to None, the default tagger is constructed
|
|
||||||
from the #[code default_tagger()] method.
|
|
||||||
|
|
||||||
+row
|
|
||||||
+cell parser
|
|
||||||
+cell #[code Parser]
|
|
||||||
+cell.
|
|
||||||
The dependency parser, which should be a callable that accepts
|
|
||||||
a #[code Doc] object, and sets the sentence boundaries,
|
|
||||||
syntactic heads and dependency labels in-place.
|
|
||||||
If set to #[code None], the default parser is
|
|
||||||
constructed from the #[code default_parser()] method. To disable
|
|
||||||
the parser and prevent it from being loaded, pass #[code parser=False].
|
|
||||||
|
|
||||||
+row
|
|
||||||
+cell entity
|
|
||||||
+cell #[code Parser]
|
|
||||||
+cell.
|
|
||||||
The named entity recognizer, which should be a callable that
|
|
||||||
accepts a #[code Doc] object, and sets the named entity annotations
|
|
||||||
in-place. If set to None, the default entity recognizer is
|
|
||||||
constructed from the #[code default_entity()] method. To disable
|
|
||||||
the entity recognizer and prevent it from being loaded, pass
|
|
||||||
#[code entity=False].
|
|
||||||
|
|
||||||
+row
|
|
||||||
+cell matcher
|
|
||||||
+cell #[code Matcher]
|
|
||||||
+cell.
|
|
||||||
The pattern matcher, which should be a callable that accepts
|
|
||||||
a #[code Doc] object, and sets named entity annotations in-place
|
|
||||||
using token-based rules. If set
|
|
||||||
to None, the default matcher is constructed from the
|
|
||||||
#[code default_matcher()] method.
|
|
||||||
|
|
||||||
+section("english-call")
|
|
||||||
+h(3, "english-call")
|
|
||||||
| #[+tag method] English.__call__
|
|
||||||
|
|
||||||
p
|
p
|
||||||
| The main entry point to spaCy. Takes raw unicode text, and returns
|
| The main entry point to spaCy. Takes raw unicode text, and returns
|
||||||
|
@ -152,30 +136,30 @@
|
||||||
| and #[code Span] objects.
|
| and #[code Span] objects.
|
||||||
|
|
||||||
+aside("Efficiency").
|
+aside("Efficiency").
|
||||||
spaCy"s algorithms are all linear-time, so you can supply
|
spaCy's algorithms are all linear-time, so you can supply
|
||||||
documents of arbitrary length, e.g. whole novels.
|
documents of arbitrary length, e.g. whole novels.
|
||||||
|
|
||||||
+table(["Example", "Description"], "code")
|
+table(["Example", "Description"], "code")
|
||||||
+row
|
+row
|
||||||
+cell #[code.lang-python doc = nlp(u'Some text.')]
|
+cell #[ doc = nlp(u'Some text.')]
|
||||||
+cell Apply the full pipeline.
|
+cell Apply the full pipeline.
|
||||||
+row
|
+row
|
||||||
+cell #[code.lang-python doc = nlp(u'Some text.', parse=False)]
|
+cell #[ doc = nlp(u'Some text.', parse=False)]
|
||||||
+cell Applies tagger and entity, not parser
|
+cell Applies tagger and entity, not parser
|
||||||
+row
|
+row
|
||||||
+cell #[code.lang-python doc = nlp(u'Some text.', entity=False)]
|
+cell #[ doc = nlp(u'Some text.', entity=False)]
|
||||||
+cell Applies tagger and parser, not entity.
|
+cell Applies tagger and parser, not entity.
|
||||||
+row
|
+row
|
||||||
+cell #[code.lang-python doc = nlp(u'Some text.', tag=False)]
|
+cell #[ doc = nlp(u'Some text.', tag=False)]
|
||||||
+cell Does not apply tagger, entity or parser
|
+cell Does not apply tagger, entity or parser
|
||||||
+row
|
+row
|
||||||
+cell #[code.lang-python doc = nlp(u'')]
|
+cell #[ doc = nlp(u'')]
|
||||||
+cell Zero-length tokens, not an error
|
+cell Zero-length tokens, not an error
|
||||||
+row
|
+row
|
||||||
+cell #[code.lang-python doc = nlp(b'Some text')]
|
+cell #[ doc = nlp(b'Some text')]
|
||||||
+cell Error: need unicode
|
+cell Error: need unicode
|
||||||
+row
|
+row
|
||||||
+cell #[code.lang-python doc = nlp(b'Some text'.decode('utf8'))]
|
+cell #[ doc = nlp(b'Some text'.decode('utf8'))]
|
||||||
+cell Decode bytes into unicode first.
|
+cell Decode bytes into unicode first.
|
||||||
|
|
||||||
+code("python", "Definition").
|
+code("python", "Definition").
|
|
@ -8,7 +8,7 @@
|
||||||
["Usage Examples", "#examples", "examples"]
|
["Usage Examples", "#examples", "examples"]
|
||||||
],
|
],
|
||||||
"API": [
|
"API": [
|
||||||
["English", "#english", "english"],
|
["Language", "#language", "language"],
|
||||||
["Doc", "#doc", "doc"],
|
["Doc", "#doc", "doc"],
|
||||||
["Token", "#token", "token"],
|
["Token", "#token", "token"],
|
||||||
["Span", "#span", "span"],
|
["Span", "#span", "span"],
|
||||||
|
|
|
@ -12,78 +12,35 @@
|
||||||
|
|
||||||
p.
|
p.
|
||||||
spaCy is compatible with 64-bit CPython 2.6+/3.3+ and runs on Unix/Linux,
|
spaCy is compatible with 64-bit CPython 2.6+/3.3+ and runs on Unix/Linux,
|
||||||
OS X and Windows. Source and binary packages are available via
|
OS X and Windows. The latest spaCy releases are currently only available as source packages over #[+a("https://pypy.python.org/pypi/spacy") pip]. Installaton requires a working build environment. See notes on #[a(href="/docs#install-source-ubuntu") Ubuntu],
|
||||||
#[+a("https://pypi.python.org/pypi/spacy") pip] and
|
|
||||||
#[+a("https://anaconda.org/spacy/spacy") conda]. If there are
|
|
||||||
no binary packages for your platform available please make sure that you have
|
|
||||||
a working build enviroment set up. See
|
|
||||||
notes on #[a(href="/docs#install-source-ubuntu") Ubuntu],
|
|
||||||
#[a(href="/docs#install-source-osx") OS X] and
|
#[a(href="/docs#install-source-osx") OS X] and
|
||||||
#[a(href="/docs#install-source-windows") Windows] for details.
|
#[a(href="/docs#install-source-windows") Windows] for details.
|
||||||
|
|
||||||
+code("bash", "conda").
|
|
||||||
conda config --add channels spacy # only needed once
|
|
||||||
conda install spacy
|
|
||||||
|
|
||||||
p.
|
|
||||||
When using pip it is generally recommended to install packages in a
|
|
||||||
#[+a("https://virtualenv.readthedocs.org/en/latest/") virtualenv]
|
|
||||||
to avoid modifying system state:
|
|
||||||
|
|
||||||
+code("bash", "pip").
|
|
||||||
# make sure you are using a recent pip/virtualenv version
|
|
||||||
python -m pip install -U pip virtualenv
|
|
||||||
|
|
||||||
virtualenv .env
|
|
||||||
source .env/bin/activate
|
|
||||||
|
|
||||||
pip install spacy
|
|
||||||
|
|
||||||
p.
|
|
||||||
Python packaging is awkward at the best of times, and it's particularly
|
|
||||||
tricky with C extensions, built via Cython, requiring large data files.
|
|
||||||
So, please report issues as you encounter them.
|
|
||||||
|
|
||||||
+section("install-model")
|
|
||||||
+h(3, "install-model")
|
|
||||||
| Install model
|
|
||||||
|
|
||||||
p.
|
|
||||||
After installation you need to download a language model.
|
|
||||||
Currently only models for English and German, named #[code en] and #[code de], are available. Please get in touch with us if you need support for a particular language.
|
|
||||||
|
|
||||||
+code("bash").
|
|
||||||
sputnik --name spacy --repository-url http://index.spacy.io install en==1.1.0
|
|
||||||
|
|
||||||
p.
|
|
||||||
Then check whether the model was successfully installed:
|
|
||||||
|
|
||||||
+code("bash").
|
|
||||||
python -c "import spacy; spacy.load('en'); print('OK')"
|
|
||||||
|
|
||||||
p.
|
|
||||||
The download command fetches and installs about 500 MB of data which it installs
|
|
||||||
within the #[code spacy] package directory.
|
|
||||||
|
|
||||||
+section("install-upgrade")
|
|
||||||
+h(3, "install-upgrade")
|
|
||||||
| Upgrading spaCy
|
|
||||||
|
|
||||||
p.
|
|
||||||
To upgrade spaCy to the latest release:
|
|
||||||
|
|
||||||
+code("bash", "conda").
|
|
||||||
conda update spacy
|
|
||||||
|
|
||||||
+code("bash", "pip").
|
+code("bash", "pip").
|
||||||
pip install -U spacy
|
pip install -U spacy
|
||||||
|
|
||||||
p.
|
p.
|
||||||
Sometimes new releases require a new language model. Then you will have to upgrade to
|
After installation you need to download a language model. Models for English (#[code en]) and German (#[code de]) are available.
|
||||||
a new model, too. You can also force re-downloading and installing a new language model:
|
|
||||||
|
|
||||||
+code("bash").
|
+code("bash").
|
||||||
|
# English:
|
||||||
|
# - Install tagger, parser, NER and GloVe vectors:
|
||||||
|
python -m spacy.en.download all
|
||||||
|
# - OR install English tagger, parser and NER
|
||||||
|
python -m spacy.en.download parser
|
||||||
|
# - OR install English GloVe vectors
|
||||||
|
python -m spacy.en.download glove
|
||||||
|
# German:
|
||||||
|
# - Install German tagger, parser, NER and word vectors
|
||||||
|
python -m spacy.de.download all
|
||||||
|
# Upgrade/overwrite existing data
|
||||||
python -m spacy.en.download --force
|
python -m spacy.en.download --force
|
||||||
|
# Check whether the model was successfully installed
|
||||||
|
python -c "import spacy; spacy.load('en'); print('OK')"
|
||||||
|
|
||||||
|
p.
|
||||||
|
The download command fetches and installs about 1 GB of data which it installs
|
||||||
|
within the #[code spacy] package directory.
|
||||||
|
|
||||||
+section("install-source")
|
+section("install-source")
|
||||||
+h(3, "install-source")
|
+h(3, "install-source")
|
||||||
|
@ -144,18 +101,6 @@
|
||||||
used to compile your Python interpreter. For official distributions
|
used to compile your Python interpreter. For official distributions
|
||||||
these are VS 2008 (Python 2.7), VS 2010 (Python 3.4) and VS 2015 (Python 3.5).
|
these are VS 2008 (Python 2.7), VS 2010 (Python 3.4) and VS 2015 (Python 3.5).
|
||||||
|
|
||||||
+section("install-obsolete-python")
|
|
||||||
+h(3, "install-obsolete-python")
|
|
||||||
| Workaround for obsolete system Python
|
|
||||||
|
|
||||||
p.
|
|
||||||
If you're stuck using a system with an old version of Python, and you
|
|
||||||
don't have root access, we've prepared a bootstrap script to help you
|
|
||||||
compile a local Python install. Run:
|
|
||||||
|
|
||||||
+code("bash").
|
|
||||||
curl https://raw.githubusercontent.com/spacy-io/gist/master/bootstrap_python_env.sh | bash && source .env/bin/activate
|
|
||||||
|
|
||||||
+section("run-tests")
|
+section("run-tests")
|
||||||
+h(3, "run-tests")
|
+h(3, "run-tests")
|
||||||
| Run tests
|
| Run tests
|
||||||
|
|
|
@ -13,7 +13,7 @@ include _quickstart-examples
|
||||||
|
|
||||||
+h(2, "api") API
|
+h(2, "api") API
|
||||||
|
|
||||||
include _api-english
|
include _api-language
|
||||||
include _api-doc
|
include _api-doc
|
||||||
include _api-token
|
include _api-token
|
||||||
include _api-span
|
include _api-span
|
||||||
|
|
|
@ -4,11 +4,23 @@ p.u-text-large spaCy features a rule-matching engine that operates over tokens.
|
||||||
|
|
||||||
+code("python", "Matcher Example").
|
+code("python", "Matcher Example").
|
||||||
from spacy.matcher import Matcher
|
from spacy.matcher import Matcher
|
||||||
from spacy.attributes import *
|
from spacy.attrs import *
|
||||||
import spacy
|
import spacy
|
||||||
|
|
||||||
nlp = spacy.load('en', parser=False, entity=False)
|
nlp = spacy.load('en', parser=False, entity=False)
|
||||||
|
|
||||||
|
def merge_phrases(matcher, doc, i, matches):
|
||||||
|
'''
|
||||||
|
Merge a phrase. We have to be careful here because we'll change the token indices.
|
||||||
|
To avoid problems, merge all the phrases once we're called on the last match.
|
||||||
|
'''
|
||||||
|
if i != len(matches)-1:
|
||||||
|
return None
|
||||||
|
# Get Span objects
|
||||||
|
spans = [(ent_id, label, doc[start : end]) for ent_id, label, start, end in matches]
|
||||||
|
for ent_id, label, span in spans:
|
||||||
|
span.merge(label=label, tag='NNP' if label else span.root.tag_)
|
||||||
|
|
||||||
matcher = Matcher(nlp.vocab)
|
matcher = Matcher(nlp.vocab)
|
||||||
|
|
||||||
matcher.add_entity(
|
matcher.add_entity(
|
||||||
|
@ -17,6 +29,7 @@ p.u-text-large spaCy features a rule-matching engine that operates over tokens.
|
||||||
acceptor=None, # Accept or modify the match
|
acceptor=None, # Accept or modify the match
|
||||||
on_match=merge_phrases # Callback to act on the matches
|
on_match=merge_phrases # Callback to act on the matches
|
||||||
)
|
)
|
||||||
|
|
||||||
matcher.add_pattern(
|
matcher.add_pattern(
|
||||||
"GoogleNow", # Entity ID -- Created if doesn't exist.
|
"GoogleNow", # Entity ID -- Created if doesn't exist.
|
||||||
[ # The pattern is a list of *Token Specifiers*.
|
[ # The pattern is a list of *Token Specifiers*.
|
||||||
|
@ -32,7 +45,7 @@ p.u-text-large spaCy features a rule-matching engine that operates over tokens.
|
||||||
doc = nlp(u"I prefer Siri to Google Now.")
|
doc = nlp(u"I prefer Siri to Google Now.")
|
||||||
matches = matcher(doc)
|
matches = matcher(doc)
|
||||||
for ent_id, label, start, end in matches:
|
for ent_id, label, start, end in matches:
|
||||||
print(nlp.strings[ent_id], nlp.strings[label], doc[start : end].text)
|
print(nlp.vocab.strings[ent_id], nlp.vocab.strings[label], doc[start : end].text)
|
||||||
entity = matcher.get_entity(ent_id)
|
entity = matcher.get_entity(ent_id)
|
||||||
print(entity)
|
print(entity)
|
||||||
|
|
||||||
|
|
|
@ -27,12 +27,13 @@ p #[+a("https://github.com/" + SOCIAL.github + "/spaCy/examples/training/train_t
|
||||||
from spacy.vocab import Vocab
|
from spacy.vocab import Vocab
|
||||||
from spacy.pipeline import EntityRecognizer
|
from spacy.pipeline import EntityRecognizer
|
||||||
from spacy.tokens import Doc
|
from spacy.tokens import Doc
|
||||||
|
from spacy.gold import GoldParse
|
||||||
|
|
||||||
vocab = Vocab()
|
vocab = Vocab()
|
||||||
entity = EntityRecognizer(vocab, entity_types=['PERSON', 'LOC'])
|
entity = EntityRecognizer(vocab, entity_types=['PERSON', 'LOC'])
|
||||||
|
|
||||||
doc = Doc(vocab, words=['Who', 'is', 'Shaka', 'Khan', '?'])
|
doc = Doc(vocab, words=['Who', 'is', 'Shaka', 'Khan', '?'])
|
||||||
entity.update(doc, ['O', 'O', 'B-PERSON', 'L-PERSON', 'O'])
|
entity.update(doc, GoldParse(doc, entities=['O', 'O', 'B-PERSON', 'L-PERSON', 'O']))
|
||||||
|
|
||||||
entity.model.end_training()
|
entity.model.end_training()
|
||||||
|
|
||||||
|
@ -49,8 +50,7 @@ p #[+a("https://github.com/" + SOCIAL.github + "/spaCy/examples/training/train_n
|
||||||
parser = DependencyParser(vocab, labels=['nsubj', 'compound', 'dobj', 'punct'])
|
parser = DependencyParser(vocab, labels=['nsubj', 'compound', 'dobj', 'punct'])
|
||||||
|
|
||||||
doc = Doc(vocab, words=['Who', 'is', 'Shaka', 'Khan', '?'])
|
doc = Doc(vocab, words=['Who', 'is', 'Shaka', 'Khan', '?'])
|
||||||
parser.update(doc, [(1, 'nsubj'), (1, 'ROOT'), (3, 'compound'), (1, 'dobj'),
|
parser.update(doc, GoldParse(doc, heads=[1, 1, 3, 1, 1,], deps=['nsubj', 'ROOT', 'compound', 'dobj', 'punct']))
|
||||||
(1, 'punct')])
|
|
||||||
|
|
||||||
parser.model.end_training()
|
parser.model.end_training()
|
||||||
|
|
||||||
|
|
|
@ -28,6 +28,15 @@ main.o-main
|
||||||
+a("https://www.reddit.com/r/" + SOCIAL.reddit) #[+icon("reddit")] #[strong User Group] on Reddit
|
+a("https://www.reddit.com/r/" + SOCIAL.reddit) #[+icon("reddit")] #[strong User Group] on Reddit
|
||||||
|
|
||||||
+grid.u-border-bottom
|
+grid.u-border-bottom
|
||||||
|
+grid-col("half").u-padding
|
||||||
|
+label Release update
|
||||||
|
+h(2)
|
||||||
|
+a("https://github.com/" + SOCIAL.github + "/spaCy/releases") spaCy v1.0 out now!
|
||||||
|
|
||||||
|
p.u-text-medium I'm excited — and more than a little nervous! — to finally make the #[+a("https://github.com/" + SOCIAL.github + "/spaCy/releases") 1.0 release of spaCy]. By far my favourite part of the release is the new support for custom pipelines. Default support for GloVe vectors is also nice. The trickiest change was a significant rewrite of the Matcher class, to support entity IDs and attributes. I've added #[a(href="/docs/#tutorials") tutorials] for the new features, and some training examples.#[br]#[br]
|
||||||
|
|
||||||
|
+button("https://explosion.ai/blog/spacy-deep-learning-keras", true, "primary") Read the blog post
|
||||||
|
|
||||||
+grid-col("half").u-padding
|
+grid-col("half").u-padding
|
||||||
+label Are you using spaCy?
|
+label Are you using spaCy?
|
||||||
+h(2)
|
+h(2)
|
||||||
|
@ -42,14 +51,6 @@ main.o-main
|
||||||
|
|
||||||
#[+button("https://survey.spacy.io", true, "primary") Take the survey]
|
#[+button("https://survey.spacy.io", true, "primary") Take the survey]
|
||||||
|
|
||||||
+grid-col("half").u-padding
|
|
||||||
+label The blog posts have moved
|
|
||||||
+h(2) Check out the new blog
|
|
||||||
|
|
||||||
p.u-text-medium We've updated the site to make it more focussed on the library itself. This will help us stay organised when we expand the tutorials section — by far the clearest message we've gotten from the survey so far. The blog posts have been moved to the new site for our consulting services, #[+a("https://explosion.ai", true) Explosion AI]. We've also updated our demos, and have open-sourced the services behind them. There are lots more releases to come. #[br]#[br]
|
|
||||||
|
|
||||||
+button("https://explosion.ai/blog", true, "primary") Go to the new blogs
|
|
||||||
|
|
||||||
+grid
|
+grid
|
||||||
+grid-col("half").u-padding
|
+grid-col("half").u-padding
|
||||||
+h(2) Built for Production
|
+h(2) Built for Production
|
||||||
|
|
|
@ -1,16 +0,0 @@
|
||||||
spaCy uses data from Princeton's WordNet project, which is free for commercial use.
|
|
||||||
|
|
||||||
The data is installed alongside spaCy, in spacy/en/data/wordnet.
|
|
||||||
|
|
||||||
WordNet is licensed as follows.
|
|
||||||
|
|
||||||
Commercial Use
|
|
||||||
|
|
||||||
WordNet® is unencumbered, and may be used in commercial applications in accordance with the following license agreement. An attorney representing the commercial interest should review this WordNet license with respect to the intended use.
|
|
||||||
|
|
||||||
WordNet License
|
|
||||||
|
|
||||||
This license is available as the file LICENSE in any downloaded version of WordNet.
|
|
||||||
WordNet 3.0 license: (Download)
|
|
||||||
|
|
||||||
WordNet Release 3.0 This software and database is being provided to you, the LICENSEE, by Princeton University under the following license. By obtaining, using and/or copying this software and database, you agree that you have read, understood, and will comply with these terms and conditions.: Permission to use, copy, modify and distribute this software and database and its documentation for any purpose and without fee or royalty is hereby granted, provided that you agree to comply with the following copyright notice and statements, including the disclaimer, and that the same appear on ALL copies of the software, database and documentation, including modifications that you make for internal use or for distribution. WordNet 3.0 Copyright 2006 by Princeton University. All rights reserved. THIS SOFTWARE AND DATABASE IS PROVIDED "AS IS" AND PRINCETON UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR IMPLIED. BY WAY OF EXAMPLE, BUT NOT LIMITATION, PRINCETON UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES OF MERCHANT- ABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF THE LICENSED SOFTWARE, DATABASE OR DOCUMENTATION WILL NOT INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR OTHER RIGHTS. The name of Princeton University or Princeton may not be used in advertising or publicity pertaining to distribution of the software and/or database. Title to copyright in this software, database and any associated documentation shall at all times remain with Princeton University and LICENSEE agrees to preserve same.
|
|
Loading…
Reference in New Issue
Block a user