* Update docs

This commit is contained in:
Matthew Honnibal 2014-12-30 21:20:34 +11:00
parent bb0b00f819
commit cdc1a27104
9 changed files with 6 additions and 188 deletions

View File

@ -23,6 +23,10 @@ under spacy.en.defs.
.. autommodule:: spacy.en.pos
:members:
.. automodule:: spacy.en.attrs
:members:
:undoc-members:
The Tokens Class
----------------

View File

@ -1,8 +0,0 @@
API
===
.. toctree::
:maxdepth: 2
tokenizers/index.rst
lexicon.rst

View File

@ -1,6 +0,0 @@
spacy.word.Lexeme
=================
.. autoclass:: spacy.word.Lexeme
:members:

View File

@ -1,94 +0,0 @@
spacy.en.EN
============
.. automodule:: spacy.en
Tokenizer API
-------------
.. automethod:: spacy.en.EN.tokenize
:noindex:
.. automethod:: spacy.en.EN.lookup
:noindex:
Lexeme Features Flag IDs
------------------------
A number of boolean features are computed for English Lexemes. To access a feature,
pass its ID to the :py:meth:`spacy.word.Lexeme.check_flag` function.
Orthographic Features
---------------------
These features describe the `orthographic` (lettering) type of the word. The
function used to compute the value is listed along with the flag.
.. data:: IS_ALPHA
:py:func:`spacy.orth.is_alpha`
.. data:: IS_DIGIT
:py:func:`spacy.orth.is_digit`
.. data:: IS_UPPER
:py:func:`spacy.orth.is_upper`
.. data:: IS_PUNCT
:py:func:`spacy.orth.is_punct`
.. data:: IS_SPACE
:py:func:`spacy.orth.is_space`
.. data:: IS_ASCII
:py:func:`spacy.orth.is_ascii`
.. data:: IS_TITLE
:py:func:`spacy.orth.is_title`
.. data:: IS_LOWER
:py:func:`spacy.orth.is_lower`
.. data:: IS_UPPER
:py:func:`spacy.orth.is_upper`
Distributional Orthographic Features
------------------------------------
These features describe how often the lower-cased form of the word appears
in various case-styles in a large sample of English text. See :py:func:`spacy.orth.oft_case`
.. data:: OFT_UPPER
.. data:: OFT_LOWER
.. data:: OFT_TITLE
Tag Dictionary Features
-----------------------
These features describe whether the word commonly occurs with a given
part-of-speech, in a large text corpus, using a part-of-speech tagger designed
to reduce the tag-dictionary bias of its training corpus. See
:py:func:`spacy.orth.can_tag`.
.. data:: CAN_PUNCT
.. data:: CAN_CONJ
.. data:: CAN_NUM
.. data:: CAN_DET
.. data:: CAN_ADP
.. data:: CAN_ADJ
.. data:: CAN_ADV
.. data:: CAN_VERB
.. data:: CAN_NOUN
.. data:: CAN_PDT
.. data:: CAN_POS
.. data:: CAN_PRON
.. data:: CAN_PRT

View File

@ -1,8 +0,0 @@
Tokenizers
===================================
Each module listed here implements a different tokenization scheme, usually
intended for a specific language.
.. toctree::
en.rst

View File

@ -1,13 +0,0 @@
How
===
Tutorial
--------
Installation
------------
API
---

View File

@ -79,9 +79,11 @@ you'll find NLTK etc much more expensive, because what you save on license
cost, you'll lose many times over in lost productivity. $5000 does not buy you
much developer time.
.. toctree::
:hidden:
:maxdepth: 3
features.rst
license_stories.rst
api.rst

View File

@ -1,31 +0,0 @@
What
====
Overview
--------
Feature List
------------
License (for the code)
-------
+------------------+------+
| Non-commercial | $0 |
+------------------+------+
| Trial commercial | $0 |
+------------------+------+
| Full commercial | $500 |
+------------------+------+
spaCy is non-free software. Its source is published, but the copyright is
retained by the author (Matthew Honnibal). Licenses are currently under preparation.
There is currently a gap between the output of academic NLP researchers, and
the needs of a small software companiess. I left academia to try to correct this.
My idea is that non-commercial and trial commercial use should "feel" just like
free software. But, if you do use the code in a commercial product, a small
fixed license-fee will apply, in order to fund development.
Pricing (for the data)
----------------------

View File

@ -1,28 +0,0 @@
Why
===
Benchmarks
----------
Efficiency
----------
+--------+-------+--------------+--------------+
| System | Time | Words/second | Speed Factor |
+--------+-------+--------------+--------------+
| NLTK | 6m4s | 89,000 | 1.00 |
+--------+-------+--------------+--------------+
| spaCy | 9.5s | 3,093,000 | 38.30 |
+--------+-------+--------------+--------------+
Accuracy
--------
The comparison refers to 30 million words from the English Gigaword, on
a Maxbook Air. For context, calling string.split() on the data completes in
about 5s.
Pros and Cons
-------------