From d6cc4d3dfe0fce697788681b2640aaa202d3fa1f Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Fri, 30 Sep 2016 14:17:23 +0200 Subject: [PATCH] Update README.rst --- README.rst | 361 ++++++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 315 insertions(+), 46 deletions(-) diff --git a/README.rst b/README.rst index bd4186fd1..7eda162b7 100644 --- a/README.rst +++ b/README.rst @@ -1,77 +1,235 @@ -.. image:: https://travis-ci.org/spacy-io/spaCy.svg?branch=master - :target: https://travis-ci.org/spacy-io/spaCy - -============================== spaCy: Industrial-strength NLP -============================== +****************************** -spaCy is a library for advanced natural language processing in Python and Cython. + .. image:: http://i.imgur.com/jYM3iOy.png + :target: https://spacy.io -Documentation and details: https://spacy.io/ + .. image:: https://travis-ci.org/spacy-io/spaCy.svg?branch=master + :target: https://travis-ci.org/spacy-io/spaCy -spaCy is built on the very latest research, but it isn't researchware. It was -designed from day 1 to be used in real products. It's commercial open-source -software, released under the MIT license. +spaCy is a library for advanced natural language processing in Python and +Cython. `See here `_ for documentation and details. spaCy is built on +the very latest research, but it isn't researchware. It was designed from day 1 +to be used in real products. It's commercial open-source software, released under +the MIT license. Features --------- +======== * Labelled dependency parsing (91.8% accuracy on OntoNotes 5) - * Named entity recognition (82.6% accuracy on OntoNotes 5) - * Part-of-speech tagging (97.1% accuracy on OntoNotes 5) - * Easy to use word vectors - * All strings mapped to integer IDs - * Export to numpy data arrays - * Alignment maintained to original string, ensuring easy mark up calculation - * Range of easy-to-use orthographic features. - * No pre-processing required. spaCy takes raw text as input, warts and newlines and all. Top Peformance --------------- +============== * Fastest in the world: <50ms per document. No faster system has ever been announced. - * Accuracy within 1% of the current state of the art on all tasks performed (parsing, named entity recognition, part-of-speech tagging). The only more accurate systems are an order of magnitude slower or more. Supports --------- +======== * CPython 2.6, 2.7, 3.3, 3.4, 3.5 (only 64 bit) * OSX * Linux * Windows (Cygwin, MinGW, Visual Studio) +Install spaCy +============= -2016-05-0 0.101.0: Fixed German model -------------------------------------- +spaCy is compatible with 64-bit CPython 2.6+/3.3+ and runs on Unix/Linux, OS X +and Windows. Source and binary packages are available via +`pip `_ and `conda `_. +If there are no binary packages for your platform available please make sure that +you have a working build enviroment set up. See notes on Ubuntu, OS X and Windows +for details. + +**conda** + +.. code:: bash + + conda config --add channels spacy # only needed once + conda install spacy + +**pip** + +When using pip it is generally recommended to install packages in a virtualenv to +avoid modifying system state: + +.. code:: bash + + # make sure you are using a recent pip/virtualenv version + python -m pip install -U pip virtualenv + + virtualenv .env + source .env/bin/activate + + pip install spacy + +Python packaging is awkward at the best of times, and it's particularly tricky with +C extensions, built via Cython, requiring large data files. So, please report issues +as you encounter them. + +Install model +============= + +After installation you need to download a language model. Currently only models for +English and German, named ``en`` and ``de``, are available. + +.. code:: bash + + python -m spacy.en.download + python -m spacy.de.download + sputnik --name spacy en_glove_cc_300_1m_vectors # For better word vectors + +Then check whether the model was successfully installed: + +.. code:: bash + + python -c "import spacy; spacy.load('en'); print('OK')" + +The download command fetches and installs about 500 MB of data which it installs +within the ``spacy`` package directory. + +Upgrading spaCy +=============== + +To upgrade spaCy to the latest release: + +**conda** + +.. code:: bash + + conda update spacy + +**pip** + +.. code:: bash + + pip install -U spacy + +Sometimes new releases require a new language model. Then you will have to upgrade to +a new model, too. You can also force re-downloading and installing a new language model: + +.. code:: bash + + python -m spacy.en.download --force + +Compile from source +=================== + +The other way to install spaCy is to clone its GitHub repository and build it from +source. That is the common way if you want to make changes to the code base. + +You'll need to make sure that you have a development enviroment consisting of a +Python distribution including header files, a compiler, pip, virtualenv and git +installed. The compiler part is the trickiest. How to do that depends on your +system. See notes on Ubuntu, OS X and Windows for details. + +.. code:: bash + + # make sure you are using recent pip/virtualenv versions + python -m pip install -U pip virtualenv + + # find git install instructions at https://git-scm.com/downloads + git clone https://github.com/spacy-io/spaCy.git + + cd spaCy + virtualenv .env && source .env/bin/activate + pip install -r requirements.txt + pip install -e . + +Compared to regular install via pip and conda `requirements.txt `_ +additionally installs developer dependencies such as cython. + +**Ubuntu** + +Install system-level dependencies via ``apt-get``: + +.. code:: bash + + sudo apt-get install build-essential python-dev git + +**OS X** + +Install a recent version of XCode, including the so-called "Command Line Tools". +OS X ships with Python and git preinstalled. + +**Windows** + +Install a version of Visual Studio Express or higher that matches the version +that was used to compile your Python interpreter. For official distributions +these are VS 2008 (Python 2.7), VS 2010 (Python 3.4) and VS 2015 (Python 3.5). + +**Workaround for obsolete system Python** + +If you're stuck using a system with an old version of Python, and you don't +have root access, we've prepared a bootstrap script to help you compile a local +Python install. Run: + +.. code:: bash + + curl https://raw.githubusercontent.com/spacy-io/gist/master/bootstrap_python_env.sh | bash && source .env/bin/activate + +**Run tests** + +spaCy comes with an extensive test suite. First, find out where spaCy is +installed: + +.. code:: bash + + python -c "import os; import spacy; print(os.path.dirname(spacy.__file__))" + +Then run ``pytest`` on that directory. The flags ``--vectors``, ``--slow`` +and ``--model`` are optional and enable additional tests: + +.. code:: bash + + # make sure you are using recent pytest version + python -m pip install -U pytest + + python -m pytest --vectors --model --slow + +API Documentation and Usage Examples +==================================== + +For the detailed documentation, check out the `spaCy website `_. + +* `Usage Examples `_ +* `API `_ +* `Annotation Specification `_ +* `Tutorials `_ + + +Changelog +========= + +2016-05-10 `v0.101.0 <../../releases/tag/0.101.0>`_: *Fixed German model* +------------------------------------------------------------------------- * Fixed bug that prevented German parses from being deprojectivised. - * Bug fixes to sentence boundary detection. - * Add rich comparison methods to the Lexeme class. +* Add missing ``Doc.has_vector`` and ``Span.has_vector`` properties. +* Add missing ``Span.sent`` property. -* Add missing Doc.has_vector and Span.has_vector properties. +2016-05-05 `v0.100.7 <../../releases/tag/0.100.7>`_: *German!* +-------------------------------------------------------------- -* Add missing Span.sent property. - - -2016-05-05 v0.100.7: German! ----------------------------- - -spaCy finally supports another language, in addition to English. We're lucky to have Wolfgang Seeker on the team, and the new German model is just the beginning. -Now that there are multiple languages, you should consider loading spaCy via the load() function. This function also makes it easier to load extra word vector data for English: +spaCy finally supports another language, in addition to English. We're lucky +to have Wolfgang Seeker on the team, and the new German model is just the +beginning. Now that there are multiple languages, you should consider loading +spaCy via the ``load()`` function. This function also makes it easier to load extra +word vector data for English: .. code:: python @@ -79,8 +237,9 @@ Now that there are multiple languages, you should consider loading spaCy via the en_nlp = spacy.load('en', vectors='en_glove_cc_300_1m_vectors') de_nlp = spacy.load('de') -To support use of the load function, there are also two new helper functions: spacy.get_lang_class and spacy.set_lang_class. -Once the German model is loaded, you can use it just like the English model: +To support use of the load function, there are also two new helper functions: +``spacy.get_lang_class`` and ``spacy.set_lang_class``. Once the German model is +loaded, you can use it just like the English model: .. code:: python @@ -92,20 +251,130 @@ Once the German model is loaded, you can use it just like the English model: # (u'ist', 1, 2) # (u'sind', 1, 3) -The German model provides tokenization, POS tagging, sentence boundary detection, syntactic dependency parsing, recognition of organisation, location and person entities, and word vector representations trained on a mix of open subtitles and Wikipedia data. It doesn't yet provide lemmatisation or morphological analysis, and it doesn't yet recognise numeric entities such as numbers and dates. +The German model provides tokenization, POS tagging, sentence boundary detection, +syntactic dependency parsing, recognition of organisation, location and person +entities, and word vector representations trained on a mix of open subtitles and +Wikipedia data. It doesn't yet provide lemmatisation or morphological analysis, +and it doesn't yet recognise numeric entities such as numbers and dates. -Bugfixes --------- +**Bugfixes** -* spaCy < 0.100.7 had a bug in the semantics of the Token.__str__ and Token.__unicode__ built-ins: they included a trailing space. +* spaCy < 0.100.7 had a bug in the semantics of the ``Token.__str__`` and ``Token.__unicode__`` built-ins: they included a trailing space. * Improve handling of "infixed" hyphens. Previously the tokenizer struggled with multiple hyphens, such as "well-to-do". - * Improve handling of periods after mixed-case tokens - * Improve lemmatization for English special-case tokens - * Fix bug that allowed spaces to be treated as heads in the syntactic parse - * Fix bug that led to inconsistent sentence boundaries before and after serialisation. - * Fix bug from deserialising untagged documents. + +2016-03-08 `v0.100.6 <../../releases/tag/0.100.6>`_: *Add support for GloVe vectors* +------------------------------------------------------------------------------------ + +This release offers improved support for replacing the word vectors used by spaCy. +To install Stanford's GloVe vectors, trained on the Common Crawl, just run: + +.. code:: bash + sputnik --name spacy install en_glove_cc_300_1m_vectors + +To reduce memory usage and loading time, we've trimmed the vocabulary down to 1m entries. + +This release also integrates all the code necessary for German parsing. A German model +will be released shortly. To assist in multi-lingual processing, we've added a ``load()`` +function. To load the English model with the GloVe vectors: + +.. code:: python + spacy.load('en', vectors='en_glove_cc_300_1m_vectors') + +2016-02-07 `v0.100.5 <../../releases/tag/0.100.5>`_ +--------------------------------------------------- + +Fix incorrect use of header file, caused from problem with thinc + +2016-02-07 `v0.100.4 <../../releases/tag/0.100.4>`_: *Fix OSX problem introduced in 0.100.3* +-------------------------------------------------------------------------------------------- + +Small correction to right_edge calculation + +2016-02-06 `v0.100.3 <../../releases/tag/0.100.3>`_ +--------------------------------------------------- + +Support multi-threading, via the ``.pipe`` method. spaCy now releases the GIL around the +parser and entity recognizer, so systems that support OpenMP should be able to do +shared memory parallelism at close to full efficiency. + +We've also greatly reduced loading time, and fixed a number of bugs. + +2016-01-21 `v0.100.2 <../../releases/tag/0.100.2>`_ +--------------------------------------------------- + +Fix data version lock that affected v0.100.1 + +2016-01-21 `v0.100.1 <../../releases/tag/0.100.1>`_: *Fix install for OSX* +-------------------------------------------------------------------------- + +v0.100 included header files built on Linux that caused installation to fail on OSX. +This should now be corrected. We also update the default data distribution, to +include a small fix to the tokenizer. + +2016-01-19 `v0.100 <../../releases/tag/0.100>`_: *Revise setup.py, better model downloads, bug fixes* +----------------------------------------------------------------------------------------------------- + +* Redo setup.py, and remove ugly headers_workaround hack. Should result in fewer install problems. +* Update data downloading and installation functionality, by migrating to the Sputnik data-package manager. This will allow us to offer finer grained control of data installation in future. +* Fix bug when using custom entity types in ``Matcher``. This should work by default when using the + ``English.__call__`` method of running the pipeline. If invoking ``Parser.__call__`` directly to do NER, + you should call the ``Parser.add_label()`` method to register your entity type. +* Fix head-finding rules in ``Span``. +* Fix problem that caused ``doc.merge()`` to sometimes hang +* Fix problems in handling of whitespace + +2015-11-08 `v0.99 <../../releases/tag/0.99>`_: *Improve span merging, internal refactoring* +------------------------------------------------------------------------------------------- + +* Merging multi-word tokens into one, via the ``doc.merge()`` and ``span.merge()`` methods, no longer invalidates existing ``Span`` objects. This makes it much easier to merge multiple spans, e.g. to merge all named entities, or all base noun phrases. Thanks to @andreasgrv for help on this patch. +* Lots of internal refactoring, especially around the machine learning module, thinc. The thinc API has now been improved, and the spacy._ml wrapper module is no longer necessary. +* The lemmatizer now lower-cases non-noun, noun-verb and non-adjective words. +* A new attribute, ``.rank``, is added to Token and Lexeme objects, giving the frequency rank of the word. + +2015-11-03 `v0.98 <../../releases/tag/0.98>`_: *Smaller package, bug fixes* +--------------------------------------------------------------------------- + +* Remove binary data from PyPi package. +* Delete archive after downloading data +* Use updated cymem, preshed and thinc packages +* Fix information loss in deserialize +* Fix ``__str__`` methods for Python2 + +2015-10-23 `v0.97 <../../releases/tag/0.97>`_: *Load the StringStore from a json list, instead of a text file* +-------------------------------------------------------------------------------------------------------------- + +* Fix bugs in download.py +* Require ``--force`` to over-write the data directory in download.py +* Fix bugs in ``Matcher`` and ``doc.merge()`` + +2015-10-19 `v0.96 <../../releases/tag/0.96>`_: *Hotfix to .merge method* +------------------------------------------------------------------------ + +* Fix bug that caused text to be lost after ``.merge`` +* Fix bug in Matcher when matched entities overlapped + +2015-10-18 `v0.95 <../../releases/tag/0.95>`_: *Bugfixes* +--------------------------------------------------------- + +* Reform encoding of symbols +* Fix bugs in ``Matcher`` +* Fix bugs in ``Span`` +* Add tokenizer rule to fix numeric range tokenization +* Add specific string-length cap in Tokenizer +* Fix ``token.conjuncts``` + +2015-10-09 `v0.94 <../../releases/tag/0.94>`_ +--------------------------------------------- + +* Fix memory error that caused crashes on 32bit platforms +* Fix parse errors caused by smart quotes and em-dashes + +2015-09-22 `v0.93 <../../releases/tag/0.93>`_ +--------------------------------------------- + +Bug fixes to word vectors