diff --git a/README.rst b/README.rst index 017a456f8..b6bee922b 100644 --- a/README.rst +++ b/README.rst @@ -1,35 +1,35 @@ spaCy: Industrial-strength NLP ****************************** -spaCy is a library for advanced natural language processing in Python and -Cython. spaCy is built on the very latest research, but it isn't researchware. -It was designed from day one to be used in real products. spaCy currently supports -English and German, as well as tokenization for Chinese, Spanish, Italian, French, +spaCy is a library for advanced natural language processing in Python and +Cython. spaCy is built on the very latest research, but it isn't researchware. +It was designed from day one to be used in real products. spaCy currently supports +English and German, as well as tokenization for Chinese, Spanish, Italian, French, Portuguese, Dutch, Swedish, Finnish, Hungarian and Bengali. It's commercial open-source software, released under the MIT license. đŸ’Ģ **Version 1.6 out now!** `Read the release notes here. `_ -.. image:: https://img.shields.io/travis/explosion/spaCy.svg?style=flat-square +.. image:: https://img.shields.io/travis/explosion/spaCy/master.svg?style=flat-square :target: https://travis-ci.org/explosion/spaCy :alt: Build Status - + .. image:: https://img.shields.io/github/release/explosion/spacy.svg?style=flat-square - :target: https://github.com/explosion/spaCy/releases + :target: https://github.com/explosion/spaCy/releases :alt: Current Release Version - -.. image:: https://anaconda.org/conda-forge/spacy/badges/version.svg - :target: https://anaconda.org/conda-forge/spacy - :alt: conda Version - + .. image:: https://img.shields.io/pypi/v/spacy.svg?style=flat-square :target: https://pypi.python.org/pypi/spacy :alt: pypi Version +.. image:: https://anaconda.org/conda-forge/spacy/badges/version.svg + :target: https://anaconda.org/conda-forge/spacy + :alt: conda Version + .. image:: https://img.shields.io/badge/gitter-join%20chat%20%E2%86%92-09a3d5.svg?style=flat-square :target: https://gitter.im/explosion/spaCy :alt: spaCy on Gitter - + .. image:: https://img.shields.io/twitter/follow/spacy_io.svg?style=social&label=Follow :target: https://twitter.com/spacy_io :alt: spaCy on Twitter @@ -55,7 +55,7 @@ software, released under the MIT license. +---------------------------+------------------------------------------------------------------------------------------------------------+ | **Bug reports**     | `GitHub Issue tracker `_                                     | +---------------------------+------------------------------------------------------------------------------------------------------------+ -| **Usage questions**   | `StackOverflow `_, `Reddit usergroup                     | +| **Usage questions**   | `StackOverflow `_, `Reddit usergroup                     | | | `_, `Gitter chat `_ | +---------------------------+------------------------------------------------------------------------------------------------------------+ | **General discussion** | `Reddit usergroup `_, | @@ -104,100 +104,143 @@ Supports Install spaCy ============= -spaCy is compatible with 64-bit CPython 2.6+/3.3+ and runs on Unix/Linux, OS X -and Windows. Source packages are available via -`pip `_. Please make sure that -you have a working build enviroment set up. See notes on Ubuntu, macOS/OS X and Windows -for details. +spaCy is compatible with **64-bit CPython 2.6+/3.3+** and runs on **Unix/Linux**, +**macOS/OS X** and **Windows**. The latest spaCy releases are available over +`pip `_ (source packages only) and +`conda `_. Installation requires a working +build environment. See notes on Ubuntu, macOS/OS X and Windows for details. pip --- -When using pip it is generally recommended to install packages in a virtualenv to -avoid modifying system state: +Using pip, spaCy releases are currently only available as source packages. .. code:: bash - pip install spacy + pip install -U spacy -Python packaging is awkward at the best of times, and it's particularly tricky with -C extensions, built via Cython, requiring large data files. So, please report issues -as you encounter them. +When using pip it is generally recommended to install packages in a ``virtualenv`` +to avoid modifying system state: + +.. code:: bash + + virtualenv .env + source .env/bin/activate + pip install spacy conda ----- -If you're using conda, you can install spaCy via ``conda-forge``: +Thanks to our great community, we've finally re-added conda support. You can now +install spaCy via ``conda-forge``: .. code:: bash   conda config --add channels conda-forge   conda install spacy - + For the feedstock including the build recipe and configuration, check out `this repository `_. -Thanks to our great community, we've finally re-added conda support — improvements -and pull requests to the recipe and setup are always appreciated. +Improvements and pull requests to the recipe and setup are always appreciated. -Install model -============= +Download models +=============== -After installation you need to download a language model. Currently only models for -English and German, named ``en`` and ``de``, are available. +After installation you need to download a language model. Models for English +(``en``) and German (``de``) are available. .. code:: bash python -m spacy.en.download all python -m spacy.de.download all -The download command fetches about 1 GB of data which it installs +The download command fetches about 1 GB of data which it installs within the ``spacy`` package directory. -Upgrading spaCy -=============== - -To upgrade spaCy to the latest release: - -pip ---- - -.. code:: bash - - pip install -U spacy - -Sometimes new releases require a new language model. Then you will have to upgrade to -a new model, too. You can also force re-downloading and installing a new language model: +Sometimes new releases require a new language model. Then you will have to +upgrade to a new model, too. You can also force re-downloading and installing a +new language model: .. code:: bash python -m spacy.en.download --force +Download model to custom location +--------------------------------- + +You can specify where ``spacy.en.download`` and ``spacy.de.download`` download +the language model to using the ``--data-path`` or ``-d`` argument: + +.. code:: bash + + python -m spacy.en.download all --data-path /some/dir + +If you choose to download to a custom location, you will need to tell spaCy where to load the model +from in order to use it. You can do this either by calling ``spacy.util.set_data_path()`` before +calling ``spacy.load()``, or by passing a ``path`` argument to the ``spacy.en.English`` or +``spacy.de.German`` constructors. + +Download models manually +------------------------ + +As of v1.6, the models and word vectors are also available as direct downloads +from GitHub, attached to the `releases `_ +as ``.tar.gz`` archives. + +To install the models manually, first find the default data path. You can use +``spacy.util.get_data_path()`` to find the directory where spaCy will look for +its models, or change the default data path with ``spacy.util.set_data_path()``. +Then simply unpack the archive and place the contained folder in that directory. +You can now load the models via ``spacy.load()``. + Compile from source =================== -The other way to install spaCy is to clone its GitHub repository and build it from +The other way to install spaCy is to clone its +`GitHub repository `_ and build it from source. That is the common way if you want to make changes to the code base. - -You'll need to make sure that you have a development enviroment consisting of a -Python distribution including header files, a compiler, pip, virtualenv and git -installed. The compiler part is the trickiest. How to do that depends on your -system. See notes on Ubuntu, OS X and Windows for details. +You'll need to make sure that you have a development enviroment consisting of a +Python distribution including header files, a compiler, +`pip `__, `virtualenv `_ +and `git `_ installed. The compiler part is the trickiest. +How to do that depends on your system. See notes on Ubuntu, OS X and Windows for +details. .. code:: bash # make sure you are using recent pip/virtualenv versions python -m pip install -U pip virtualenv - - # find git install instructions at https://git-scm.com/downloads - git clone https://github.com/explosion/spaCy.git - + git clone https://github.com/explosion/spaCy cd spaCy - virtualenv .env && source .env/bin/activate + + virtualenv .env + source .env/bin/activate pip install -r requirements.txt pip install -e . - -Compared to regular install via pip `requirements.txt `_ -additionally installs developer dependencies such as cython. + +Compared to regular install via pip `requirements.txt `_ +additionally installs developer dependencies such as Cython. + +Instead of the above verbose commands, you can also use the following +`Fabric `_ commands: + ++---------------+--------------------------------------------------------------+ +| ``fab env`` | Create ``virtualenv`` and delete previous one, if it exists. | ++---------------+--------------------------------------------------------------+ +| ``fab make`` | Compile the source. | ++---------------+--------------------------------------------------------------+ +| ``fab clean`` | Remove compiled objects, including the generated C++. | ++---------------+--------------------------------------------------------------+ +| ``fab test`` | Run basic tests, aborting after first failure. | ++---------------+--------------------------------------------------------------+ + +All commands assume that your ``virtualenv`` is located in a directory ``.env``. +If you're using a different directory, you can change it via the environment +variable ``VENV_DIR``, for example: + +.. code:: bash + + VENV_DIR=".custom-env" fab clean make Ubuntu ------ @@ -211,54 +254,38 @@ Install system-level dependencies via ``apt-get``: macOS / OS X ------------ -Install a recent version of `XCode `_, -including the so-called "Command Line Tools". macOS and OS X ship with Python +Install a recent version of `XCode `_, +including the so-called "Command Line Tools". macOS and OS X ship with Python and git preinstalled. Windows ------- Install a version of `Visual Studio Express `_ -or higher that matches the version that was used to compile your Python -interpreter. For official distributions these are VS 2008 (Python 2.7), +or higher that matches the version that was used to compile your Python +interpreter. For official distributions these are VS 2008 (Python 2.7), VS 2010 (Python 3.4) and VS 2015 (Python 3.5). Run tests ========= -spaCy comes with an extensive test suite. First, find out where spaCy is -installed: +spaCy comes with an `extensive test suite `_. First, find out where +spaCy is installed: .. code:: bash - + python -c "import os; import spacy; print(os.path.dirname(spacy.__file__))" -Then run ``pytest`` on that directory. The flags ``--vectors``, ``--slow`` +Then run ``pytest`` on that directory. The flags ``--vectors``, ``--slow`` and ``--model`` are optional and enable additional tests: .. code:: bash - + # make sure you are using recent pytest version python -m pip install -U pytest python -m pytest --vectors --model --slow -Download model to custom location -================================= - -You can specify where ``spacy.en.download`` and ``spacy.de.download`` download the language model -to using the ``--data-path`` or ``-d`` argument: - -.. code:: bash - - python -m spacy.en.download all --data-path /some/dir - - -If you choose to download to a custom location, you will need to tell spaCy where to load the model -from in order to use it. You can do this either by calling ``spacy.util.set_data_path()`` before -calling ``spacy.load()``, or by passing a ``path`` argument to the ``spacy.en.English`` or -``spacy.de.German`` constructors. - Changelog ========= @@ -473,10 +500,10 @@ Thanks to `@daylen `_, `@RahulKulhari `_: *German!* ------------------------------------------------------------------------------------------- -spaCy finally supports another language, in addition to English. We're lucky -to have Wolfgang Seeker on the team, and the new German model is just the -beginning. Now that there are multiple languages, you should consider loading -spaCy via the ``load()`` function. This function also makes it easier to load extra +spaCy finally supports another language, in addition to English. We're lucky +to have Wolfgang Seeker on the team, and the new German model is just the +beginning. Now that there are multiple languages, you should consider loading +spaCy via the ``load()`` function. This function also makes it easier to load extra word vector data for English: .. code:: python @@ -484,25 +511,25 @@ word vector data for English: import spacy en_nlp = spacy.load('en', vectors='en_glove_cc_300_1m_vectors') de_nlp = spacy.load('de') - -To support use of the load function, there are also two new helper functions: -``spacy.get_lang_class`` and ``spacy.set_lang_class``. Once the German model is + +To support use of the load function, there are also two new helper functions: +``spacy.get_lang_class`` and ``spacy.set_lang_class``. Once the German model is loaded, you can use it just like the English model: .. code:: python doc = nlp(u'''Wikipedia ist ein Projekt zum Aufbau einer Enzyklopädie aus freien Inhalten, zu dem du mit deinem Wissen beitragen kannst. Seit Mai 2001 sind 1.936.257 Artikel in deutscher Sprache entstanden.''') - + for sent in doc.sents: print(sent.root.text, sent.root.n_lefts, sent.root.n_rights) - + # (u'ist', 1, 2) # (u'sind', 1, 3) - -The German model provides tokenization, POS tagging, sentence boundary detection, -syntactic dependency parsing, recognition of organisation, location and person -entities, and word vector representations trained on a mix of open subtitles and -Wikipedia data. It doesn't yet provide lemmatisation or morphological analysis, + +The German model provides tokenization, POS tagging, sentence boundary detection, +syntactic dependency parsing, recognition of organisation, location and person +entities, and word vector representations trained on a mix of open subtitles and +Wikipedia data. It doesn't yet provide lemmatisation or morphological analysis, and it doesn't yet recognise numeric entities such as numbers and dates. **Bugfixes** @@ -518,7 +545,7 @@ and it doesn't yet recognise numeric entities such as numbers and dates. 2016-03-08 `v0.100.6 `_: *Add support for GloVe vectors* ----------------------------------------------------------------------------------------------------------------- -This release offers improved support for replacing the word vectors used by spaCy. +This release offers improved support for replacing the word vectors used by spaCy. To install Stanford's GloVe vectors, trained on the Common Crawl, just run: .. code:: bash @@ -527,8 +554,8 @@ To install Stanford's GloVe vectors, trained on the Common Crawl, just run: To reduce memory usage and loading time, we've trimmed the vocabulary down to 1m entries. -This release also integrates all the code necessary for German parsing. A German model -will be released shortly. To assist in multi-lingual processing, we've added a ``load()`` +This release also integrates all the code necessary for German parsing. A German model +will be released shortly. To assist in multi-lingual processing, we've added a ``load()`` function. To load the English model with the GloVe vectors: .. code:: python diff --git a/spacy/tests/bn/__init__.py b/spacy/tests/bn/__init__.py new file mode 100644 index 000000000..57d631c3f --- /dev/null +++ b/spacy/tests/bn/__init__.py @@ -0,0 +1 @@ +# coding: utf-8 diff --git a/spacy/tests/bn/test_tokenizer.py b/spacy/tests/bn/test_tokenizer.py new file mode 100644 index 000000000..08b9a00df --- /dev/null +++ b/spacy/tests/bn/test_tokenizer.py @@ -0,0 +1,40 @@ +# encoding: utf8 +from __future__ import unicode_literals + +import pytest + +TESTCASES = [] + +PUNCTUATION_TESTS = [ + (u'āĻ†āĻŽāĻŋ āĻŦāĻžāĻ‚āĻ˛āĻžāĻ¯āĻŧ āĻ—āĻžāĻ¨ āĻ—āĻžāĻ‡!', [u'āĻ†āĻŽāĻŋ', u'āĻŦāĻžāĻ‚āĻ˛āĻžāĻ¯āĻŧ', u'āĻ—āĻžāĻ¨', u'āĻ—āĻžāĻ‡', u'!']), + (u'āĻ†āĻŽāĻŋ āĻŦāĻžāĻ‚āĻ˛āĻžāĻ¯āĻŧ āĻ•āĻĨāĻž āĻ•āĻ‡āĨ¤', [u'āĻ†āĻŽāĻŋ', u'āĻŦāĻžāĻ‚āĻ˛āĻžāĻ¯āĻŧ', u'āĻ•āĻĨāĻž', u'āĻ•āĻ‡', u'āĨ¤']), + (u'āĻŦāĻ¸ā§āĻ¨ā§āĻ§āĻ°āĻž āĻœāĻ¨āĻ¸āĻŽā§āĻŽā§āĻ–ā§‡ āĻĻā§‹āĻˇ āĻ¸ā§āĻŦā§€āĻ•āĻžāĻ° āĻ•āĻ°āĻ˛ā§‹ āĻ¨āĻž?', [u'āĻŦāĻ¸ā§āĻ¨ā§āĻ§āĻ°āĻž', u'āĻœāĻ¨āĻ¸āĻŽā§āĻŽā§āĻ–ā§‡', u'āĻĻā§‹āĻˇ', u'āĻ¸ā§āĻŦā§€āĻ•āĻžāĻ°', u'āĻ•āĻ°āĻ˛ā§‹', u'āĻ¨āĻž', u'?']), + (u'āĻŸāĻžāĻ•āĻž āĻĨāĻžāĻ•āĻ˛ā§‡ āĻ•āĻŋ āĻ¨āĻž āĻšāĻ¯āĻŧ!', [u'āĻŸāĻžāĻ•āĻž', u'āĻĨāĻžāĻ•āĻ˛ā§‡', u'āĻ•āĻŋ', u'āĻ¨āĻž', u'āĻšāĻ¯āĻŧ', u'!']), +] + +ABBREVIATIONS = [ + (u'āĻĄāĻƒ āĻ–āĻžāĻ˛ā§‡āĻĻ āĻŦāĻ˛āĻ˛ā§‡āĻ¨ āĻĸāĻžāĻ•āĻžāĻ¯āĻŧ ā§Šā§Ģ āĻĄāĻŋāĻ—ā§āĻ°āĻŋ āĻ¸ā§‡.āĨ¤', [u'āĻĄāĻƒ', u'āĻ–āĻžāĻ˛ā§‡āĻĻ', u'āĻŦāĻ˛āĻ˛ā§‡āĻ¨', u'āĻĸāĻžāĻ•āĻžāĻ¯āĻŧ', u'ā§Šā§Ģ', u'āĻĄāĻŋāĻ—ā§āĻ°āĻŋ', u'āĻ¸ā§‡.', u'āĨ¤']) +] + +TESTCASES.extend(PUNCTUATION_TESTS) +TESTCASES.extend(ABBREVIATIONS) + + +@pytest.mark.parametrize('text,expected_tokens', TESTCASES) +def test_tokenizer_handles_testcases(bn_tokenizer, text, expected_tokens): + tokens = bn_tokenizer(text) + token_list = [token.text for token in tokens if not token.is_space] + assert expected_tokens == token_list + + +def test_tokenizer_handles_long_text(bn_tokenizer): + text = u"""āĻ¨āĻ°ā§āĻĨ āĻ¸āĻžāĻ‰āĻĨ āĻŦāĻŋāĻļā§āĻŦāĻŦāĻŋāĻĻā§āĻ¯āĻžāĻ˛āĻ¯āĻŧā§‡ āĻ¸āĻžāĻ°āĻžāĻŦāĻ›āĻ° āĻ•ā§‹āĻ¨ āĻ¨āĻž āĻ•ā§‹āĻ¨ āĻŦāĻŋāĻˇāĻ¯āĻŧā§‡ āĻ—āĻŦā§‡āĻˇāĻŖāĻž āĻšāĻ˛āĻ¤ā§‡āĻ‡ āĻĨāĻžāĻ•ā§‡āĨ¤ \ +āĻ…āĻ­āĻŋāĻœā§āĻž āĻĢā§āĻ¯āĻžāĻ•āĻžāĻ˛ā§āĻŸāĻŋ āĻŽā§‡āĻŽā§āĻŦāĻžāĻ°āĻ—āĻŖ āĻĒā§āĻ°āĻžāĻ¯āĻŧāĻ‡ āĻļāĻŋāĻ•ā§āĻˇāĻžāĻ°ā§āĻĨā§€āĻĻā§‡āĻ° āĻ¨āĻŋāĻ¯āĻŧā§‡ āĻŦāĻŋāĻ­āĻŋāĻ¨ā§āĻ¨ āĻ—āĻŦā§‡āĻˇāĻŖāĻž āĻĒā§āĻ°āĻ•āĻ˛ā§āĻĒā§‡ āĻ•āĻžāĻœ āĻ•āĻ°ā§‡āĻ¨, \ +āĻ¯āĻžāĻ° āĻŽāĻ§ā§āĻ¯ā§‡ āĻ°āĻ¯āĻŧā§‡āĻ›ā§‡ āĻ°ā§‹āĻŦāĻŸ āĻĨā§‡āĻ•ā§‡ āĻŽā§‡āĻļāĻŋāĻ¨ āĻ˛āĻžāĻ°ā§āĻ¨āĻŋāĻ‚ āĻ¸āĻŋāĻ¸ā§āĻŸā§‡āĻŽ āĻ“ āĻ†āĻ°ā§āĻŸāĻŋāĻĢāĻŋāĻļāĻŋāĻ¯āĻŧāĻžāĻ˛ āĻ‡āĻ¨ā§āĻŸā§‡āĻ˛āĻŋāĻœā§‡āĻ¨ā§āĻ¸āĨ¤ \ +āĻāĻ¸āĻ•āĻ˛ āĻĒā§āĻ°āĻ•āĻ˛ā§āĻĒā§‡ āĻ•āĻžāĻœ āĻ•āĻ°āĻžāĻ° āĻŽāĻžāĻ§ā§āĻ¯āĻŽā§‡ āĻ¸āĻ‚āĻļā§āĻ˛āĻŋāĻˇā§āĻŸ āĻ•ā§āĻˇā§‡āĻ¤ā§āĻ°ā§‡ āĻ¯āĻĨā§‡āĻˇā§āĻ  āĻĒāĻ°āĻŋāĻŽāĻžāĻŖ āĻ¸ā§āĻĒā§‡āĻļāĻžāĻ˛āĻžāĻ‡āĻœāĻĄ āĻšāĻ“āĻ¯āĻŧāĻž āĻ¸āĻŽā§āĻ­āĻŦāĨ¤ \ +āĻ†āĻ° āĻ—āĻŦā§‡āĻˇāĻŖāĻžāĻ° āĻ•āĻžāĻœ āĻ¤ā§‹āĻŽāĻžāĻ° āĻ•ā§āĻ¯āĻžāĻ°āĻŋāĻ¯āĻŧāĻžāĻ°āĻ•ā§‡ āĻ ā§‡āĻ˛ā§‡ āĻ¨āĻŋāĻ¯āĻŧā§‡ āĻ¯āĻžāĻŦā§‡ āĻ…āĻ¨ā§‡āĻ•āĻ–āĻžāĻ¨āĻŋ! \ +āĻ•āĻ¨ā§āĻŸā§‡āĻ¸ā§āĻŸ āĻĒā§āĻ°ā§‹āĻ—ā§āĻ°āĻžāĻŽāĻžāĻ° āĻšāĻ“, āĻ—āĻŦā§‡āĻˇāĻ• āĻ•āĻŋāĻ‚āĻŦāĻž āĻĄā§‡āĻ­ā§‡āĻ˛āĻĒāĻžāĻ° - āĻ¨āĻ°ā§āĻĨ āĻ¸āĻžāĻ‰āĻĨ āĻ‡āĻ‰āĻ¨āĻŋāĻ­āĻžāĻ°ā§āĻ¸āĻŋāĻŸāĻŋāĻ¤ā§‡ āĻ¤ā§‹āĻŽāĻžāĻ° āĻĒā§āĻ°āĻ¤āĻŋāĻ­āĻž āĻŦāĻŋāĻ•āĻžāĻļā§‡āĻ° āĻ¸ā§āĻ¯ā§‹āĻ— āĻ°āĻ¯āĻŧā§‡āĻ›ā§‡āĻ‡āĨ¤ \ +āĻ¨āĻ°ā§āĻĨ āĻ¸āĻžāĻ‰āĻĨā§‡āĻ° āĻ…āĻ¸āĻžāĻ§āĻžāĻ°āĻŖ āĻ•āĻŽāĻŋāĻ‰āĻ¨āĻŋāĻŸāĻŋāĻ¤ā§‡ āĻ¤ā§‹āĻŽāĻžāĻ•ā§‡ āĻ¸āĻžāĻĻāĻ° āĻ†āĻŽāĻ¨ā§āĻ¤ā§āĻ°āĻŖāĨ¤""" + + tokens = bn_tokenizer(text) + assert len(tokens) == 84 diff --git a/spacy/tests/conftest.py b/spacy/tests/conftest.py index b6dcb905a..7c6dcda1b 100644 --- a/spacy/tests/conftest.py +++ b/spacy/tests/conftest.py @@ -11,6 +11,7 @@ from ..nl import Dutch from ..sv import Swedish from ..hu import Hungarian from ..fi import Finnish +from ..bn import Bengali from ..tokens import Doc from ..strings import StringStore from ..lemmatizer import Lemmatizer @@ -24,7 +25,7 @@ import pytest LANGUAGES = [English, German, Spanish, Italian, French, Portuguese, Dutch, - Swedish, Hungarian, Finnish] + Swedish, Hungarian, Finnish, Bengali] @pytest.fixture(params=LANGUAGES) @@ -73,6 +74,11 @@ def sv_tokenizer(): return Swedish.Defaults.create_tokenizer() +@pytest.fixture +def bn_tokenizer(): + return Bengali.Defaults.create_tokenizer() + + @pytest.fixture def stringstore(): return StringStore() diff --git a/spacy/tests/tokenizer/test_tokenizer.py b/spacy/tests/tokenizer/test_tokenizer.py index 822834f58..349458bce 100644 --- a/spacy/tests/tokenizer/test_tokenizer.py +++ b/spacy/tests/tokenizer/test_tokenizer.py @@ -31,7 +31,7 @@ def test_tokenizer_handles_punct(tokenizer): def test_tokenizer_handles_digits(tokenizer): - exceptions = ["hu"] + exceptions = ["hu", "bn"] text = "Lorem ipsum: 1984." tokens = tokenizer(text) diff --git a/website/docs/usage/entity-recognition.jade b/website/docs/usage/entity-recognition.jade index 9fc7dddd9..210b04337 100644 --- a/website/docs/usage/entity-recognition.jade +++ b/website/docs/usage/entity-recognition.jade @@ -138,7 +138,9 @@ p +code. import spacy + import random from spacy.gold import GoldParse + from spacy.language import EntityRecognizer train_data = [ ('Who is Chaka Khan?', [(7, 17, 'PERSON')]), diff --git a/website/docs/usage/index.jade b/website/docs/usage/index.jade index 1c142592d..479635e4b 100644 --- a/website/docs/usage/index.jade +++ b/website/docs/usage/index.jade @@ -5,16 +5,46 @@ include ../../_includes/_mixins p | spaCy is compatible with #[strong 64-bit CPython 2.6+∕3.3+] and | runs on #[strong Unix/Linux], #[strong macOS/OS X] and - | #[strong Windows]. The latest spaCy releases are currently only - | available as source packages over - | #[+a("https://pypi.python.org/pypi/spacy") pip]. Installation requires a - | working build environment. See notes on + | #[strong Windows]. The latest spaCy releases are + | available over #[+a("https://pypi.python.org/pypi/spacy") pip] (source + | packages only) and #[+a("https://anaconda.org/conda-forge/spacy") conda]. + | Installation requires a working build environment. See notes on | #[a(href="#source-ubuntu") Ubuntu], #[a(href="#source-osx") macOS/OS X] | and #[a(href="#source-windows") Windows] for details. ++h(2, "pip") pip + +p Using pip, spaCy releases are currently only available as source packages. + +code(false, "bash"). pip install -U spacy +p + | When using pip it is generally recommended to install packages in a + | #[code virtualenv] to avoid modifying system state: + ++code(false, "bash"). + virtualenv .env + source .env/bin/activate + pip install spacy + ++h(2, "conda") conda + +p + | Thanks to our great community, we've finally re-added conda support. You + | can now install spaCy via #[code conda-forge]: + ++code(false, "bash"). + conda config --add channels conda-forge + conda install spacy + +p + | For the feedstock including the build recipe and configuration, check out + | #[+a("https://github.com/conda-forge/spacy-feedstock") this repository]. + | Improvements and pull requests to the recipe and setup are always appreciated. + ++h(2, "models") Download models + p | After installation you need to download a language model. Models for | English (#[code en]) and German (#[code de]) are available. @@ -36,18 +66,49 @@ p # Check whether the model was successfully installed python -c "import spacy; spacy.load('en'); print('OK')" -p The download command fetches about 1 GB of data which it installs within the #[code spacy] package directory. +p + | The download command fetches about 1 GB of data which it + | installs within the #[code spacy] package directory. + ++h(3, "custom-location") Download model to custom location + +p + | You can specify where #[code spacy.en.download] and + | #[code spacy.de.download] download the language model to using the + | #[code --data-path] or #[code -d] argument: + ++code(false, "bash"). + python -m spacy.en.download all --data-path /some/dir + +p + | If you choose to download to a custom location, you will need to tell + | spaCy where to load the model from in order to use it. You can do this + | either by calling #[code spacy.util.set_data_path()] before calling + | #[code spacy.load()], or by passing a #[code path] argument to the + | #[code spacy.en.English] or #[code spacy.de.German] constructors. + ++h(3, "models-manual") Download models manually + +p + | As of v1.6, the models and word vectors are also available as direct + | downloads from GitHub, attached to the #[+a(gh("spaCy") + "/releases") releases] as #[code .tar.gz] archives. + +p + | To install the models manually, first find the default data path. You can + | use #[code spacy.util.get_data_path()] to find the directory where spaCy + | will look for its models, or change the default data path with + | #[code spacy.util.set_data_path()]. Then simply unpack the archive and + | place the contained folder in that directory. You can now load the models + | via #[code spacy.load()]. +h(2, "source") Compile from source p | The other way to install spaCy is to clone its | #[+a(gh("spaCy")) GitHub repository] and build it from source. That is - | the common way if you want to make changes to the code base. - -p - | You'll need to make sure that you have a development enviroment - | consisting of a Python distribution including header files, a compiler, + | the common way if you want to make changes to the code base. You'll need to + | make sure that you have a development enviroment consisting of a Python + | distribution including header files, a compiler, | #[+a("https://pip.pypa.io/en/latest/installing/") pip], | #[+a("https://virtualenv.pypa.io/") virtualenv] and | #[+a("https://git-scm.com") git] installed. The compiler part is the @@ -55,6 +116,50 @@ p | #[a(href="#source-ubuntu") Ubuntu], #[a(href="#source-osx") OS X] and | #[a(href="#source-windows") Windows] for details. ++code(false, "bash"). + # make sure you are using recent pip/virtualenv versions + python -m pip install -U pip virtualenv + git clone #{gh("spaCy")} + cd spaCy + + virtualenv .env + source .env/bin/activate + pip install -r requirements.txt + pip install -e . + +p + | Compared to regular install via pip, #[+a(gh("spaCy", "requirements.txt")) requirements.txt] + | additionally installs developer dependencies such as Cython. + +p + | Instead of the above verbose commands, you can also use the following + | #[+a("http://www.fabfile.org/") Fabric] commands: + ++table(["Command", "Description"]) + +row + +cell #[code fab env] + +cell Create #[code virtualenv] and delete previous one, if it exists. + + +row + +cell #[code fab make] + +cell Compile the source. + + +row + +cell #[code fab clean] + +cell Remove compiled objects, including the generated C++. + + +row + +cell #[code fab test] + +cell Run basic tests, aborting after first failure. + +p + | All commands assume that your #[code virtualenv] is located in a + | directory #[code .env]. If you're using a different directory, you can + | change it via the environment variable #[code VENV_DIR], for example: + ++code(false, "bash"). + VENV_DIR=".custom-env" fab clean make + +h(3, "source-ubuntu") Ubuntu p Install system-level dependencies via #[code apt-get]: @@ -67,12 +172,8 @@ p Install system-level dependencies via #[code apt-get]: p | Install a recent version of #[+a("https://developer.apple.com/xcode/") XCode], | including the so-called "Command Line Tools". macOS and OS X ship with - | Python and git preinstalled. - -p - | To compile spaCy with multi-threading support on macOS / OS X, - | #[+a("https://github.com/explosion/spaCy/issues/267") see here]. - + | Python and git preinstalled. To compile spaCy with multi-threading support + | on macOS / OS X, #[+a("https://github.com/explosion/spaCy/issues/267") see here]. +h(3, "source-windows") Windows @@ -98,8 +199,8 @@ p +h(2, "tests") Run tests p - | spaCy comes with an extensive test suite. First, find out where spaCy is - | installed: + | spaCy comes with an #[+a(gh("spacy", "spacy/tests")) extensive test suite]. + | First, find out where spaCy is installed: +code(false, "bash"). python -c "import os; import spacy; print(os.path.dirname(spacy.__file__))" @@ -114,20 +215,3 @@ p python -m pip install -U pytest python -m pytest <spacy-directory> --vectors --model --slow - -+h(2, "custom-location") Download model to custom location - -p - | You can specify where #[code spacy.en.download] and - | #[code spacy.de.download] download the language model to using the - | #[code --data-path] or #[code -d] argument: - -+code(false, "bash"). - python -m spacy.en.download all --data-path /some/dir - -p - | If you choose to download to a custom location, you will need to tell - | spaCy where to load the model from in order to use it. You can do this - | either by calling #[code spacy.util.set_data_path()] before calling - | #[code spacy.load()], or by passing a #[code path] argument to the - | #[code spacy.en.English] or #[code spacy.de.German] constructors.