mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-25 17:36:30 +03:00
* Update docs
This commit is contained in:
parent
bb0b00f819
commit
cdc1a27104
|
@ -23,6 +23,10 @@ under spacy.en.defs.
|
|||
.. autommodule:: spacy.en.pos
|
||||
:members:
|
||||
|
||||
.. automodule:: spacy.en.attrs
|
||||
:members:
|
||||
:undoc-members:
|
||||
|
||||
The Tokens Class
|
||||
----------------
|
||||
|
||||
|
|
|
@ -1,8 +0,0 @@
|
|||
API
|
||||
===
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
tokenizers/index.rst
|
||||
lexicon.rst
|
|
@ -1,6 +0,0 @@
|
|||
spacy.word.Lexeme
|
||||
=================
|
||||
|
||||
|
||||
.. autoclass:: spacy.word.Lexeme
|
||||
:members:
|
|
@ -1,94 +0,0 @@
|
|||
spacy.en.EN
|
||||
============
|
||||
|
||||
.. automodule:: spacy.en
|
||||
|
||||
Tokenizer API
|
||||
-------------
|
||||
|
||||
.. automethod:: spacy.en.EN.tokenize
|
||||
:noindex:
|
||||
|
||||
.. automethod:: spacy.en.EN.lookup
|
||||
:noindex:
|
||||
|
||||
Lexeme Features Flag IDs
|
||||
------------------------
|
||||
|
||||
A number of boolean features are computed for English Lexemes. To access a feature,
|
||||
pass its ID to the :py:meth:`spacy.word.Lexeme.check_flag` function.
|
||||
|
||||
Orthographic Features
|
||||
---------------------
|
||||
|
||||
These features describe the `orthographic` (lettering) type of the word. The
|
||||
function used to compute the value is listed along with the flag.
|
||||
|
||||
.. data:: IS_ALPHA
|
||||
|
||||
:py:func:`spacy.orth.is_alpha`
|
||||
|
||||
.. data:: IS_DIGIT
|
||||
|
||||
:py:func:`spacy.orth.is_digit`
|
||||
|
||||
.. data:: IS_UPPER
|
||||
|
||||
:py:func:`spacy.orth.is_upper`
|
||||
|
||||
.. data:: IS_PUNCT
|
||||
|
||||
:py:func:`spacy.orth.is_punct`
|
||||
|
||||
.. data:: IS_SPACE
|
||||
|
||||
:py:func:`spacy.orth.is_space`
|
||||
|
||||
.. data:: IS_ASCII
|
||||
|
||||
:py:func:`spacy.orth.is_ascii`
|
||||
|
||||
.. data:: IS_TITLE
|
||||
|
||||
:py:func:`spacy.orth.is_title`
|
||||
|
||||
.. data:: IS_LOWER
|
||||
|
||||
:py:func:`spacy.orth.is_lower`
|
||||
|
||||
.. data:: IS_UPPER
|
||||
|
||||
:py:func:`spacy.orth.is_upper`
|
||||
|
||||
Distributional Orthographic Features
|
||||
------------------------------------
|
||||
|
||||
These features describe how often the lower-cased form of the word appears
|
||||
in various case-styles in a large sample of English text. See :py:func:`spacy.orth.oft_case`
|
||||
|
||||
.. data:: OFT_UPPER
|
||||
.. data:: OFT_LOWER
|
||||
.. data:: OFT_TITLE
|
||||
|
||||
|
||||
Tag Dictionary Features
|
||||
-----------------------
|
||||
|
||||
These features describe whether the word commonly occurs with a given
|
||||
part-of-speech, in a large text corpus, using a part-of-speech tagger designed
|
||||
to reduce the tag-dictionary bias of its training corpus. See
|
||||
:py:func:`spacy.orth.can_tag`.
|
||||
|
||||
.. data:: CAN_PUNCT
|
||||
.. data:: CAN_CONJ
|
||||
.. data:: CAN_NUM
|
||||
.. data:: CAN_DET
|
||||
.. data:: CAN_ADP
|
||||
.. data:: CAN_ADJ
|
||||
.. data:: CAN_ADV
|
||||
.. data:: CAN_VERB
|
||||
.. data:: CAN_NOUN
|
||||
.. data:: CAN_PDT
|
||||
.. data:: CAN_POS
|
||||
.. data:: CAN_PRON
|
||||
.. data:: CAN_PRT
|
|
@ -1,8 +0,0 @@
|
|||
Tokenizers
|
||||
===================================
|
||||
|
||||
Each module listed here implements a different tokenization scheme, usually
|
||||
intended for a specific language.
|
||||
|
||||
.. toctree::
|
||||
en.rst
|
|
@ -1,13 +0,0 @@
|
|||
How
|
||||
===
|
||||
|
||||
Tutorial
|
||||
--------
|
||||
|
||||
Installation
|
||||
------------
|
||||
|
||||
API
|
||||
---
|
||||
|
||||
|
|
@ -79,9 +79,11 @@ you'll find NLTK etc much more expensive, because what you save on license
|
|||
cost, you'll lose many times over in lost productivity. $5000 does not buy you
|
||||
much developer time.
|
||||
|
||||
|
||||
.. toctree::
|
||||
:hidden:
|
||||
:maxdepth: 3
|
||||
|
||||
features.rst
|
||||
license_stories.rst
|
||||
api.rst
|
||||
|
|
|
@ -1,31 +0,0 @@
|
|||
What
|
||||
====
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
Feature List
|
||||
------------
|
||||
|
||||
License (for the code)
|
||||
-------
|
||||
|
||||
+------------------+------+
|
||||
| Non-commercial | $0 |
|
||||
+------------------+------+
|
||||
| Trial commercial | $0 |
|
||||
+------------------+------+
|
||||
| Full commercial | $500 |
|
||||
+------------------+------+
|
||||
|
||||
spaCy is non-free software. Its source is published, but the copyright is
|
||||
retained by the author (Matthew Honnibal). Licenses are currently under preparation.
|
||||
|
||||
There is currently a gap between the output of academic NLP researchers, and
|
||||
the needs of a small software companiess. I left academia to try to correct this.
|
||||
My idea is that non-commercial and trial commercial use should "feel" just like
|
||||
free software. But, if you do use the code in a commercial product, a small
|
||||
fixed license-fee will apply, in order to fund development.
|
||||
|
||||
Pricing (for the data)
|
||||
----------------------
|
|
@ -1,28 +0,0 @@
|
|||
Why
|
||||
===
|
||||
|
||||
Benchmarks
|
||||
----------
|
||||
|
||||
Efficiency
|
||||
----------
|
||||
|
||||
+--------+-------+--------------+--------------+
|
||||
| System | Time | Words/second | Speed Factor |
|
||||
+--------+-------+--------------+--------------+
|
||||
| NLTK | 6m4s | 89,000 | 1.00 |
|
||||
+--------+-------+--------------+--------------+
|
||||
| spaCy | 9.5s | 3,093,000 | 38.30 |
|
||||
+--------+-------+--------------+--------------+
|
||||
|
||||
|
||||
Accuracy
|
||||
--------
|
||||
|
||||
The comparison refers to 30 million words from the English Gigaword, on
|
||||
a Maxbook Air. For context, calling string.split() on the data completes in
|
||||
about 5s.
|
||||
|
||||
Pros and Cons
|
||||
-------------
|
||||
|
Loading…
Reference in New Issue
Block a user