Merge branch 'master' into rappdw/tokenizer_exceptions_url_fix

This commit is contained in:
Dan Rapp 2017-03-09 11:42:14 -07:00
commit b9307dfcd7
7 changed files with 301 additions and 141 deletions

View File

@ -10,7 +10,7 @@ software, released under the MIT license.
💫 **Version 1.6 out now!** `Read the release notes here. <https://github.com/explosion/spaCy/releases/>`_ 💫 **Version 1.6 out now!** `Read the release notes here. <https://github.com/explosion/spaCy/releases/>`_
.. image:: https://img.shields.io/travis/explosion/spaCy.svg?style=flat-square .. image:: https://img.shields.io/travis/explosion/spaCy/master.svg?style=flat-square
:target: https://travis-ci.org/explosion/spaCy :target: https://travis-ci.org/explosion/spaCy
:alt: Build Status :alt: Build Status
@ -18,14 +18,14 @@ software, released under the MIT license.
:target: https://github.com/explosion/spaCy/releases :target: https://github.com/explosion/spaCy/releases
:alt: Current Release Version :alt: Current Release Version
.. image:: https://anaconda.org/conda-forge/spacy/badges/version.svg
:target: https://anaconda.org/conda-forge/spacy
:alt: conda Version
.. image:: https://img.shields.io/pypi/v/spacy.svg?style=flat-square .. image:: https://img.shields.io/pypi/v/spacy.svg?style=flat-square
:target: https://pypi.python.org/pypi/spacy :target: https://pypi.python.org/pypi/spacy
:alt: pypi Version :alt: pypi Version
.. image:: https://anaconda.org/conda-forge/spacy/badges/version.svg
:target: https://anaconda.org/conda-forge/spacy
:alt: conda Version
.. image:: https://img.shields.io/badge/gitter-join%20chat%20%E2%86%92-09a3d5.svg?style=flat-square .. image:: https://img.shields.io/badge/gitter-join%20chat%20%E2%86%92-09a3d5.svg?style=flat-square
:target: https://gitter.im/explosion/spaCy :target: https://gitter.im/explosion/spaCy
:alt: spaCy on Gitter :alt: spaCy on Gitter
@ -104,30 +104,35 @@ Supports
Install spaCy Install spaCy
============= =============
spaCy is compatible with 64-bit CPython 2.6+/3.3+ and runs on Unix/Linux, OS X spaCy is compatible with **64-bit CPython 2.6+/3.3+** and runs on **Unix/Linux**,
and Windows. Source packages are available via **macOS/OS X** and **Windows**. The latest spaCy releases are available over
`pip <https://pypi.python.org/pypi/spacy>`_. Please make sure that `pip <https://pypi.python.org/pypi/spacy>`_ (source packages only) and
you have a working build enviroment set up. See notes on Ubuntu, macOS/OS X and Windows `conda <https://anaconda.org/conda-forge/spacy>`_. Installation requires a working
for details. build environment. See notes on Ubuntu, macOS/OS X and Windows for details.
pip pip
--- ---
When using pip it is generally recommended to install packages in a virtualenv to Using pip, spaCy releases are currently only available as source packages.
avoid modifying system state:
.. code:: bash .. code:: bash
pip install spacy pip install -U spacy
Python packaging is awkward at the best of times, and it's particularly tricky with When using pip it is generally recommended to install packages in a ``virtualenv``
C extensions, built via Cython, requiring large data files. So, please report issues to avoid modifying system state:
as you encounter them.
.. code:: bash
virtualenv .env
source .env/bin/activate
pip install spacy
conda conda
----- -----
If you're using conda, you can install spaCy via ``conda-forge``: Thanks to our great community, we've finally re-added conda support. You can now
install spaCy via ``conda-forge``:
.. code:: bash .. code:: bash
@ -136,14 +141,13 @@ If you're using conda, you can install spaCy via ``conda-forge``:
For the feedstock including the build recipe and configuration, For the feedstock including the build recipe and configuration,
check out `this repository <https://github.com/conda-forge/spacy-feedstock>`_. check out `this repository <https://github.com/conda-forge/spacy-feedstock>`_.
Thanks to our great community, we've finally re-added conda support — improvements Improvements and pull requests to the recipe and setup are always appreciated.
and pull requests to the recipe and setup are always appreciated.
Install model Download models
============= ===============
After installation you need to download a language model. Currently only models for After installation you need to download a language model. Models for English
English and German, named ``en`` and ``de``, are available. (``en``) and German (``de``) are available.
.. code:: bash .. code:: bash
@ -153,51 +157,90 @@ English and German, named ``en`` and ``de``, are available.
The download command fetches about 1 GB of data which it installs The download command fetches about 1 GB of data which it installs
within the ``spacy`` package directory. within the ``spacy`` package directory.
Upgrading spaCy Sometimes new releases require a new language model. Then you will have to
=============== upgrade to a new model, too. You can also force re-downloading and installing a
new language model:
To upgrade spaCy to the latest release:
pip
---
.. code:: bash
pip install -U spacy
Sometimes new releases require a new language model. Then you will have to upgrade to
a new model, too. You can also force re-downloading and installing a new language model:
.. code:: bash .. code:: bash
python -m spacy.en.download --force python -m spacy.en.download --force
Download model to custom location
---------------------------------
You can specify where ``spacy.en.download`` and ``spacy.de.download`` download
the language model to using the ``--data-path`` or ``-d`` argument:
.. code:: bash
python -m spacy.en.download all --data-path /some/dir
If you choose to download to a custom location, you will need to tell spaCy where to load the model
from in order to use it. You can do this either by calling ``spacy.util.set_data_path()`` before
calling ``spacy.load()``, or by passing a ``path`` argument to the ``spacy.en.English`` or
``spacy.de.German`` constructors.
Download models manually
------------------------
As of v1.6, the models and word vectors are also available as direct downloads
from GitHub, attached to the `releases <https://github.com/explosion/spacy/releases>`_
as ``.tar.gz`` archives.
To install the models manually, first find the default data path. You can use
``spacy.util.get_data_path()`` to find the directory where spaCy will look for
its models, or change the default data path with ``spacy.util.set_data_path()``.
Then simply unpack the archive and place the contained folder in that directory.
You can now load the models via ``spacy.load()``.
Compile from source Compile from source
=================== ===================
The other way to install spaCy is to clone its GitHub repository and build it from The other way to install spaCy is to clone its
`GitHub repository <https://github.com/explosion/spaCy>`_ and build it from
source. That is the common way if you want to make changes to the code base. source. That is the common way if you want to make changes to the code base.
You'll need to make sure that you have a development enviroment consisting of a You'll need to make sure that you have a development enviroment consisting of a
Python distribution including header files, a compiler, pip, virtualenv and git Python distribution including header files, a compiler,
installed. The compiler part is the trickiest. How to do that depends on your `pip <https://pip.pypa.io/en/latest/installing/>`__, `virtualenv <https://virtualenv.pypa.io/>`_
system. See notes on Ubuntu, OS X and Windows for details. and `git <https://git-scm.com>`_ installed. The compiler part is the trickiest.
How to do that depends on your system. See notes on Ubuntu, OS X and Windows for
details.
.. code:: bash .. code:: bash
# make sure you are using recent pip/virtualenv versions # make sure you are using recent pip/virtualenv versions
python -m pip install -U pip virtualenv python -m pip install -U pip virtualenv
git clone https://github.com/explosion/spaCy
# find git install instructions at https://git-scm.com/downloads
git clone https://github.com/explosion/spaCy.git
cd spaCy cd spaCy
virtualenv .env && source .env/bin/activate
virtualenv .env
source .env/bin/activate
pip install -r requirements.txt pip install -r requirements.txt
pip install -e . pip install -e .
Compared to regular install via pip `requirements.txt <requirements.txt>`_ Compared to regular install via pip `requirements.txt <requirements.txt>`_
additionally installs developer dependencies such as cython. additionally installs developer dependencies such as Cython.
Instead of the above verbose commands, you can also use the following
`Fabric <http://www.fabfile.org/>`_ commands:
+---------------+--------------------------------------------------------------+
| ``fab env`` | Create ``virtualenv`` and delete previous one, if it exists. |
+---------------+--------------------------------------------------------------+
| ``fab make`` | Compile the source. |
+---------------+--------------------------------------------------------------+
| ``fab clean`` | Remove compiled objects, including the generated C++. |
+---------------+--------------------------------------------------------------+
| ``fab test`` | Run basic tests, aborting after first failure. |
+---------------+--------------------------------------------------------------+
All commands assume that your ``virtualenv`` is located in a directory ``.env``.
If you're using a different directory, you can change it via the environment
variable ``VENV_DIR``, for example:
.. code:: bash
VENV_DIR=".custom-env" fab clean make
Ubuntu Ubuntu
------ ------
@ -226,8 +269,8 @@ VS 2010 (Python 3.4) and VS 2015 (Python 3.5).
Run tests Run tests
========= =========
spaCy comes with an extensive test suite. First, find out where spaCy is spaCy comes with an `extensive test suite <spacy/tests>`_. First, find out where
installed: spaCy is installed:
.. code:: bash .. code:: bash
@ -243,22 +286,6 @@ and ``--model`` are optional and enable additional tests:
python -m pytest <spacy-directory> --vectors --model --slow python -m pytest <spacy-directory> --vectors --model --slow
Download model to custom location
=================================
You can specify where ``spacy.en.download`` and ``spacy.de.download`` download the language model
to using the ``--data-path`` or ``-d`` argument:
.. code:: bash
python -m spacy.en.download all --data-path /some/dir
If you choose to download to a custom location, you will need to tell spaCy where to load the model
from in order to use it. You can do this either by calling ``spacy.util.set_data_path()`` before
calling ``spacy.load()``, or by passing a ``path`` argument to the ``spacy.en.English`` or
``spacy.de.German`` constructors.
Changelog Changelog
========= =========

View File

@ -0,0 +1 @@
# coding: utf-8

View File

@ -0,0 +1,40 @@
# encoding: utf8
from __future__ import unicode_literals
import pytest
TESTCASES = []
PUNCTUATION_TESTS = [
(u'আমি বাংলায় গান গাই!', [u'আমি', u'বাংলায়', u'গান', u'গাই', u'!']),
(u'আমি বাংলায় কথা কই।', [u'আমি', u'বাংলায়', u'কথা', u'কই', u'']),
(u'বসুন্ধরা জনসম্মুখে দোষ স্বীকার করলো না?', [u'বসুন্ধরা', u'জনসম্মুখে', u'দোষ', u'স্বীকার', u'করলো', u'না', u'?']),
(u'টাকা থাকলে কি না হয়!', [u'টাকা', u'থাকলে', u'কি', u'না', u'হয়', u'!']),
]
ABBREVIATIONS = [
(u'ডঃ খালেদ বললেন ঢাকায় ৩৫ ডিগ্রি সে.।', [u'ডঃ', u'খালেদ', u'বললেন', u'ঢাকায়', u'৩৫', u'ডিগ্রি', u'সে.', u''])
]
TESTCASES.extend(PUNCTUATION_TESTS)
TESTCASES.extend(ABBREVIATIONS)
@pytest.mark.parametrize('text,expected_tokens', TESTCASES)
def test_tokenizer_handles_testcases(bn_tokenizer, text, expected_tokens):
tokens = bn_tokenizer(text)
token_list = [token.text for token in tokens if not token.is_space]
assert expected_tokens == token_list
def test_tokenizer_handles_long_text(bn_tokenizer):
text = u"""নর্থ সাউথ বিশ্ববিদ্যালয়ে সারাবছর কোন না কোন বিষয়ে গবেষণা চলতেই থাকে। \
অভি ি রগণ ি ি িি গবষণ রকল কর, \
মধ রয বট ি ি ি আরিিি ইনি \
এসকল রকল কর যম ি যথ পরি ইজড হওয সমভব \
আর গবষণ িরক ি অনকখি! \
কন হও, গবষক ি লপ - নর উথ ইউনিিি রতি ি রয \
নর উথ অসরণ কমিউনিি দর আমনরণ"""
tokens = bn_tokenizer(text)
assert len(tokens) == 84

View File

@ -11,6 +11,7 @@ from ..nl import Dutch
from ..sv import Swedish from ..sv import Swedish
from ..hu import Hungarian from ..hu import Hungarian
from ..fi import Finnish from ..fi import Finnish
from ..bn import Bengali
from ..tokens import Doc from ..tokens import Doc
from ..strings import StringStore from ..strings import StringStore
from ..lemmatizer import Lemmatizer from ..lemmatizer import Lemmatizer
@ -24,7 +25,7 @@ import pytest
LANGUAGES = [English, German, Spanish, Italian, French, Portuguese, Dutch, LANGUAGES = [English, German, Spanish, Italian, French, Portuguese, Dutch,
Swedish, Hungarian, Finnish] Swedish, Hungarian, Finnish, Bengali]
@pytest.fixture(params=LANGUAGES) @pytest.fixture(params=LANGUAGES)
@ -73,6 +74,11 @@ def sv_tokenizer():
return Swedish.Defaults.create_tokenizer() return Swedish.Defaults.create_tokenizer()
@pytest.fixture
def bn_tokenizer():
return Bengali.Defaults.create_tokenizer()
@pytest.fixture @pytest.fixture
def stringstore(): def stringstore():
return StringStore() return StringStore()

View File

@ -31,7 +31,7 @@ def test_tokenizer_handles_punct(tokenizer):
def test_tokenizer_handles_digits(tokenizer): def test_tokenizer_handles_digits(tokenizer):
exceptions = ["hu"] exceptions = ["hu", "bn"]
text = "Lorem ipsum: 1984." text = "Lorem ipsum: 1984."
tokens = tokenizer(text) tokens = tokenizer(text)

View File

@ -138,7 +138,9 @@ p
+code. +code.
import spacy import spacy
import random
from spacy.gold import GoldParse from spacy.gold import GoldParse
from spacy.language import EntityRecognizer
train_data = [ train_data = [
('Who is Chaka Khan?', [(7, 17, 'PERSON')]), ('Who is Chaka Khan?', [(7, 17, 'PERSON')]),

View File

@ -5,16 +5,46 @@ include ../../_includes/_mixins
p p
| spaCy is compatible with #[strong 64-bit CPython 2.6+&#8725;3.3+] and | spaCy is compatible with #[strong 64-bit CPython 2.6+&#8725;3.3+] and
| runs on #[strong Unix/Linux], #[strong macOS/OS X] and | runs on #[strong Unix/Linux], #[strong macOS/OS X] and
| #[strong Windows]. The latest spaCy releases are currently only | #[strong Windows]. The latest spaCy releases are
| available as source packages over | available over #[+a("https://pypi.python.org/pypi/spacy") pip] (source
| #[+a("https://pypi.python.org/pypi/spacy") pip]. Installation requires a | packages only) and #[+a("https://anaconda.org/conda-forge/spacy") conda].
| working build environment. See notes on | Installation requires a working build environment. See notes on
| #[a(href="#source-ubuntu") Ubuntu], #[a(href="#source-osx") macOS/OS X] | #[a(href="#source-ubuntu") Ubuntu], #[a(href="#source-osx") macOS/OS X]
| and #[a(href="#source-windows") Windows] for details. | and #[a(href="#source-windows") Windows] for details.
+h(2, "pip") pip
p Using pip, spaCy releases are currently only available as source packages.
+code(false, "bash"). +code(false, "bash").
pip install -U spacy pip install -U spacy
p
| When using pip it is generally recommended to install packages in a
| #[code virtualenv] to avoid modifying system state:
+code(false, "bash").
virtualenv .env
source .env/bin/activate
pip install spacy
+h(2, "conda") conda
p
| Thanks to our great community, we've finally re-added conda support. You
| can now install spaCy via #[code conda-forge]:
+code(false, "bash").
conda config --add channels conda-forge
conda install spacy
p
| For the feedstock including the build recipe and configuration, check out
| #[+a("https://github.com/conda-forge/spacy-feedstock") this repository].
| Improvements and pull requests to the recipe and setup are always appreciated.
+h(2, "models") Download models
p p
| After installation you need to download a language model. Models for | After installation you need to download a language model. Models for
| English (#[code en]) and German (#[code de]) are available. | English (#[code en]) and German (#[code de]) are available.
@ -36,18 +66,49 @@ p
# Check whether the model was successfully installed # Check whether the model was successfully installed
python -c "import spacy; spacy.load('en'); print('OK')" python -c "import spacy; spacy.load('en'); print('OK')"
p The download command fetches about 1 GB of data which it installs within the #[code spacy] package directory. p
| The download command fetches about 1 GB of data which it
| installs within the #[code spacy] package directory.
+h(3, "custom-location") Download model to custom location
p
| You can specify where #[code spacy.en.download] and
| #[code spacy.de.download] download the language model to using the
| #[code --data-path] or #[code -d] argument:
+code(false, "bash").
python -m spacy.en.download all --data-path /some/dir
p
| If you choose to download to a custom location, you will need to tell
| spaCy where to load the model from in order to use it. You can do this
| either by calling #[code spacy.util.set_data_path()] before calling
| #[code spacy.load()], or by passing a #[code path] argument to the
| #[code spacy.en.English] or #[code spacy.de.German] constructors.
+h(3, "models-manual") Download models manually
p
| As of v1.6, the models and word vectors are also available as direct
| downloads from GitHub, attached to the #[+a(gh("spaCy") + "/releases") releases] as #[code .tar.gz] archives.
p
| To install the models manually, first find the default data path. You can
| use #[code spacy.util.get_data_path()] to find the directory where spaCy
| will look for its models, or change the default data path with
| #[code spacy.util.set_data_path()]. Then simply unpack the archive and
| place the contained folder in that directory. You can now load the models
| via #[code spacy.load()].
+h(2, "source") Compile from source +h(2, "source") Compile from source
p p
| The other way to install spaCy is to clone its | The other way to install spaCy is to clone its
| #[+a(gh("spaCy")) GitHub repository] and build it from source. That is | #[+a(gh("spaCy")) GitHub repository] and build it from source. That is
| the common way if you want to make changes to the code base. | the common way if you want to make changes to the code base. You'll need to
| make sure that you have a development enviroment consisting of a Python
p | distribution including header files, a compiler,
| You'll need to make sure that you have a development enviroment
| consisting of a Python distribution including header files, a compiler,
| #[+a("https://pip.pypa.io/en/latest/installing/") pip], | #[+a("https://pip.pypa.io/en/latest/installing/") pip],
| #[+a("https://virtualenv.pypa.io/") virtualenv] and | #[+a("https://virtualenv.pypa.io/") virtualenv] and
| #[+a("https://git-scm.com") git] installed. The compiler part is the | #[+a("https://git-scm.com") git] installed. The compiler part is the
@ -55,6 +116,50 @@ p
| #[a(href="#source-ubuntu") Ubuntu], #[a(href="#source-osx") OS X] and | #[a(href="#source-ubuntu") Ubuntu], #[a(href="#source-osx") OS X] and
| #[a(href="#source-windows") Windows] for details. | #[a(href="#source-windows") Windows] for details.
+code(false, "bash").
# make sure you are using recent pip/virtualenv versions
python -m pip install -U pip virtualenv
git clone #{gh("spaCy")}
cd spaCy
virtualenv .env
source .env/bin/activate
pip install -r requirements.txt
pip install -e .
p
| Compared to regular install via pip, #[+a(gh("spaCy", "requirements.txt")) requirements.txt]
| additionally installs developer dependencies such as Cython.
p
| Instead of the above verbose commands, you can also use the following
| #[+a("http://www.fabfile.org/") Fabric] commands:
+table(["Command", "Description"])
+row
+cell #[code fab env]
+cell Create #[code virtualenv] and delete previous one, if it exists.
+row
+cell #[code fab make]
+cell Compile the source.
+row
+cell #[code fab clean]
+cell Remove compiled objects, including the generated C++.
+row
+cell #[code fab test]
+cell Run basic tests, aborting after first failure.
p
| All commands assume that your #[code virtualenv] is located in a
| directory #[code .env]. If you're using a different directory, you can
| change it via the environment variable #[code VENV_DIR], for example:
+code(false, "bash").
VENV_DIR=".custom-env" fab clean make
+h(3, "source-ubuntu") Ubuntu +h(3, "source-ubuntu") Ubuntu
p Install system-level dependencies via #[code apt-get]: p Install system-level dependencies via #[code apt-get]:
@ -67,12 +172,8 @@ p Install system-level dependencies via #[code apt-get]:
p p
| Install a recent version of #[+a("https://developer.apple.com/xcode/") XCode], | Install a recent version of #[+a("https://developer.apple.com/xcode/") XCode],
| including the so-called "Command Line Tools". macOS and OS X ship with | including the so-called "Command Line Tools". macOS and OS X ship with
| Python and git preinstalled. | Python and git preinstalled. To compile spaCy with multi-threading support
| on macOS / OS X, #[+a("https://github.com/explosion/spaCy/issues/267") see here].
p
| To compile spaCy with multi-threading support on macOS / OS X,
| #[+a("https://github.com/explosion/spaCy/issues/267") see here].
+h(3, "source-windows") Windows +h(3, "source-windows") Windows
@ -98,8 +199,8 @@ p
+h(2, "tests") Run tests +h(2, "tests") Run tests
p p
| spaCy comes with an extensive test suite. First, find out where spaCy is | spaCy comes with an #[+a(gh("spacy", "spacy/tests")) extensive test suite].
| installed: | First, find out where spaCy is installed:
+code(false, "bash"). +code(false, "bash").
python -c "import os; import spacy; print(os.path.dirname(spacy.__file__))" python -c "import os; import spacy; print(os.path.dirname(spacy.__file__))"
@ -114,20 +215,3 @@ p
python -m pip install -U pytest python -m pip install -U pytest
python -m pytest &lt;spacy-directory&gt; --vectors --model --slow python -m pytest &lt;spacy-directory&gt; --vectors --model --slow
+h(2, "custom-location") Download model to custom location
p
| You can specify where #[code spacy.en.download] and
| #[code spacy.de.download] download the language model to using the
| #[code --data-path] or #[code -d] argument:
+code(false, "bash").
python -m spacy.en.download all --data-path /some/dir
p
| If you choose to download to a custom location, you will need to tell
| spaCy where to load the model from in order to use it. You can do this
| either by calling #[code spacy.util.set_data_path()] before calling
| #[code spacy.load()], or by passing a #[code path] argument to the
| #[code spacy.en.English] or #[code spacy.de.German] constructors.