spaCy/website/src/jade/home/_installation.jade
2015-10-27 10:31:19 +01:00

213 lines
8.1 KiB
Plaintext

mixin Option(name, open)
details(open=open)
summary
h4= name
block
+Option("Updating your installation")
| To update your installation:
pre.language-bash
code
$ pip install --upgrade spacy
$ python -m spacy.en.download --force all
p Most updates ship a new model, so you will usually have to redownload the data.
+Option("conda", true)
pre.language-bash: code
| $ conda config --add channels spacy
| $ conda install spacy
| $ python -m spacy.en.download all
+Option("pip and virtualenv", true)
p With Python 2.7 or Python 3, using Linux or OSX, ensure that you have the following packages installed:
pre.language-bash: code
| build-essential python-dev
p Then run:
pre.language-bash: code
| $ pip install spacy
| $ python -m spacy.en.download all
p
| The download command fetches and installs about 400mb of data, for
| the parser model and word vectors, which it installs within the spacy.en
| package directory.
+Option("Workaround for obsolete system Python", false)
p
| If you're stuck using a server with an old version of Python, and you
| don't have root access, I've prepared a bootstrap script to help you
| compile a local Python install. Run:
pre.language-bash: code
| $ curl https://raw.githubusercontent.com/honnibal/spaCy/master/bootstrap_python_env.sh | bash && source .env/bin/activate
+Option("Compile from source", false)
p
| The other way to install the package is to clone the github repository,
| and build it from source. This installs an additional dependency,
| Cython. If you're using Python 2, I also recommend installing fabric
| and fabtools – this is how I build the project.
|
| Ensure that you have the following packages installed:
pre.language-bash: code
| build-essential python-dev git python-virtualenv
pre.language-bash: code
| $ git clone https://github.com/honnibal/spaCy.git
| $ cd spaCy
| $ virtualenv .env && source .env/bin/activate
| $ export PYTHONPATH=`pwd`
| $ pip install -r requirements.txt
| $ python setup.py build_ext --inplace
| $ python -m spacy.en.download
| $ pip install pytest
| $ py.test tests/
p
| Python packaging is awkward at the best of times, and it's particularly tricky
| with C extensions, built via Cython, requiring large data files. So,
| please report issues as you encounter them.
+Option("pypy (Unsupported)")
| If PyPy support is a priority for you, please get in touch. We could likely
| fix the remaining issues, if necessary. However, the library is likely to
| be much slower on PyPy, as it's written in Cython, which produces code tuned
| for the performance of CPython.
+Option("Windows (Unsupported)")
| Unfortunately we don't currently support Windows.
h4 What's New?
details
summary
h4 2015-10-24 v0.97: Reduce load time, bug fixes
ul
li Load the StringStore from a json list, instead of a text file. Accept a file-like object in the API instead of a path, for better flexibility.
li * Load from file, rather than path, in StringStore
li Fix bugs in download.py
li Require #[code --force] to over-write the data directory in download.py
li Fix bugs in #[code Matcher] and #[code doc.merge()]
details
summary
h4 2015-09-21 v0.93: Bug fixes to word vectors. Rename .repvec to .vector. Rename .string attribute.
ul
li Bug fixes for word vectors.
li The attribute to access word vectors was formerly named #[code token.repvec]. This has been renamed #[code token.vector]. The #[code .repvec] name is now deprecated. It will remain available until the next version.
li Add #[code .vector] attributes to #[code Doc] and #[code Span] objects, which gives the average of their word vectors.
li Add a #[code .similarity] method to #[code Token], #[code Doc], #[code Span] and #[code Lexeme] objects, that performs cosine similarity over word vectors.
li The attribute #[code .string], which gave a whitespace-padded string representation, is now renamed #[code .text_with_ws]. A new #[code .text] attribute has been added. The #[code .string] attribute is now deprecated. It will remain available until the next version.
details
summary
h4 2015-09-15 v0.90: Refactor to allow multi-lingual, better customization.
ul
li Move code out of #[code spacy.en] module, to allow new languages to be added more easily.
li New #[code Matcher] class, to support token-based expressions, for custom named entity rules.
li Users can now write to #[code Lexeme] objects, to set their own flags and string attributes. Values set on the vocabulary are inherited by tokens. This is especially effective in combination with the rule logic.
details
summary
h4 2015-07-29 v0.89: Fix Spans, efficiency
ul
li Fix regression in parse times on very long texts. Recent versions were calculating parse features in a way that was polynomial in input length.
li Add tag SP (coarse tag SPACE) for whitespace tokens. Ensure entity recogniser does not assign entities to whitespace.
li Rename #[code Span.head] to #[code Span.root], fix its documentation, and make it more efficient.
details
summary
h4 2015-07-08 v0.88: Refactoring release.
ul
li If you have the data for v0.87, you don't need to redownload the data for this release.
li You can now set #[code tag=False], #[code parse=False] or #[code entity=False] when creating the pipleine, to disable some of the models. See the documentation for details.
li Models no longer lazy-loaded.
li Warning emitted when parse=True or entity=True but model not loaded.
li Rename the tokens.Tokens class to tokens.Doc. An alias has been made to assist backwards compatibility, but you should update your code to refer to the new class name.
li Various bits of internal refactoring
details
summary
h4 2015-07-01 v0.87: Memory use
ul
li Changed weights data structure. Memory use should be reduced 30-40%. Fixed speed regressions introduced in the last few versions.
li Models should now be slightly more robust to noise in the input text, as I'm now training on data with a small amount of noise added, e.g. I randomly corrupt capitalization, swap spaces for newlines, etc. This is bringing a small benefit on out-of-domain data. I think this strategy could yield better results with a better noise-generation function. If you think you have a good way to make clean text resemble the kind of noisy input you're seeing in your domain, get in touch.
details
summary
h4 2015-06-24 v0.86: Parser accuracy
p Parser now more accurate, using novel non-monotonic transition system that's currently under review.
details
summary
h4 2015-05-12 v0.85: More diverse training data
ul
li Parser produces richer dependency labels following the `ClearNLP scheme`_
li Training data now includes text from a variety of genres.
li Parser now uses more memory and the data is slightly larger, due to the additional labels. Impact on efficiency is minimal: entire process still takes <10ms per document.
p Most users should see a substantial increase in accuracy from the new model.
details
summary
h4 2015-05-12 v0.84: Bug fixes
ul
li Bug fixes for parsing
li Bug fixes for named entity recognition
details
summary
h4 2015-04-13 v0.80
p Preliminary support for named-entity recognition. Its accuracy is substantially behind the state-of-the-art. I'm working on improvements.
ul
li Better sentence boundary detection, drawn from the syntactic structure.
li Lots of bug fixes.
details
summary
h4 2015-03-05: v0.70
ul
li Improved parse navigation API
li Bug fixes to labelled parsing
details
summary
h4 2015-01-30: v0.4
ul
li Train statistical models on running text running text
details
summary
h4 2015-01-25: v0.33
ul
li Alpha release