mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-13 02:36:32 +03:00
* Update docs
This commit is contained in:
parent
bb0b00f819
commit
cdc1a27104
|
@ -23,6 +23,10 @@ under spacy.en.defs.
|
||||||
.. autommodule:: spacy.en.pos
|
.. autommodule:: spacy.en.pos
|
||||||
:members:
|
:members:
|
||||||
|
|
||||||
|
.. automodule:: spacy.en.attrs
|
||||||
|
:members:
|
||||||
|
:undoc-members:
|
||||||
|
|
||||||
The Tokens Class
|
The Tokens Class
|
||||||
----------------
|
----------------
|
||||||
|
|
||||||
|
|
|
@ -1,8 +0,0 @@
|
||||||
API
|
|
||||||
===
|
|
||||||
|
|
||||||
.. toctree::
|
|
||||||
:maxdepth: 2
|
|
||||||
|
|
||||||
tokenizers/index.rst
|
|
||||||
lexicon.rst
|
|
|
@ -1,6 +0,0 @@
|
||||||
spacy.word.Lexeme
|
|
||||||
=================
|
|
||||||
|
|
||||||
|
|
||||||
.. autoclass:: spacy.word.Lexeme
|
|
||||||
:members:
|
|
|
@ -1,94 +0,0 @@
|
||||||
spacy.en.EN
|
|
||||||
============
|
|
||||||
|
|
||||||
.. automodule:: spacy.en
|
|
||||||
|
|
||||||
Tokenizer API
|
|
||||||
-------------
|
|
||||||
|
|
||||||
.. automethod:: spacy.en.EN.tokenize
|
|
||||||
:noindex:
|
|
||||||
|
|
||||||
.. automethod:: spacy.en.EN.lookup
|
|
||||||
:noindex:
|
|
||||||
|
|
||||||
Lexeme Features Flag IDs
|
|
||||||
------------------------
|
|
||||||
|
|
||||||
A number of boolean features are computed for English Lexemes. To access a feature,
|
|
||||||
pass its ID to the :py:meth:`spacy.word.Lexeme.check_flag` function.
|
|
||||||
|
|
||||||
Orthographic Features
|
|
||||||
---------------------
|
|
||||||
|
|
||||||
These features describe the `orthographic` (lettering) type of the word. The
|
|
||||||
function used to compute the value is listed along with the flag.
|
|
||||||
|
|
||||||
.. data:: IS_ALPHA
|
|
||||||
|
|
||||||
:py:func:`spacy.orth.is_alpha`
|
|
||||||
|
|
||||||
.. data:: IS_DIGIT
|
|
||||||
|
|
||||||
:py:func:`spacy.orth.is_digit`
|
|
||||||
|
|
||||||
.. data:: IS_UPPER
|
|
||||||
|
|
||||||
:py:func:`spacy.orth.is_upper`
|
|
||||||
|
|
||||||
.. data:: IS_PUNCT
|
|
||||||
|
|
||||||
:py:func:`spacy.orth.is_punct`
|
|
||||||
|
|
||||||
.. data:: IS_SPACE
|
|
||||||
|
|
||||||
:py:func:`spacy.orth.is_space`
|
|
||||||
|
|
||||||
.. data:: IS_ASCII
|
|
||||||
|
|
||||||
:py:func:`spacy.orth.is_ascii`
|
|
||||||
|
|
||||||
.. data:: IS_TITLE
|
|
||||||
|
|
||||||
:py:func:`spacy.orth.is_title`
|
|
||||||
|
|
||||||
.. data:: IS_LOWER
|
|
||||||
|
|
||||||
:py:func:`spacy.orth.is_lower`
|
|
||||||
|
|
||||||
.. data:: IS_UPPER
|
|
||||||
|
|
||||||
:py:func:`spacy.orth.is_upper`
|
|
||||||
|
|
||||||
Distributional Orthographic Features
|
|
||||||
------------------------------------
|
|
||||||
|
|
||||||
These features describe how often the lower-cased form of the word appears
|
|
||||||
in various case-styles in a large sample of English text. See :py:func:`spacy.orth.oft_case`
|
|
||||||
|
|
||||||
.. data:: OFT_UPPER
|
|
||||||
.. data:: OFT_LOWER
|
|
||||||
.. data:: OFT_TITLE
|
|
||||||
|
|
||||||
|
|
||||||
Tag Dictionary Features
|
|
||||||
-----------------------
|
|
||||||
|
|
||||||
These features describe whether the word commonly occurs with a given
|
|
||||||
part-of-speech, in a large text corpus, using a part-of-speech tagger designed
|
|
||||||
to reduce the tag-dictionary bias of its training corpus. See
|
|
||||||
:py:func:`spacy.orth.can_tag`.
|
|
||||||
|
|
||||||
.. data:: CAN_PUNCT
|
|
||||||
.. data:: CAN_CONJ
|
|
||||||
.. data:: CAN_NUM
|
|
||||||
.. data:: CAN_DET
|
|
||||||
.. data:: CAN_ADP
|
|
||||||
.. data:: CAN_ADJ
|
|
||||||
.. data:: CAN_ADV
|
|
||||||
.. data:: CAN_VERB
|
|
||||||
.. data:: CAN_NOUN
|
|
||||||
.. data:: CAN_PDT
|
|
||||||
.. data:: CAN_POS
|
|
||||||
.. data:: CAN_PRON
|
|
||||||
.. data:: CAN_PRT
|
|
|
@ -1,8 +0,0 @@
|
||||||
Tokenizers
|
|
||||||
===================================
|
|
||||||
|
|
||||||
Each module listed here implements a different tokenization scheme, usually
|
|
||||||
intended for a specific language.
|
|
||||||
|
|
||||||
.. toctree::
|
|
||||||
en.rst
|
|
|
@ -1,13 +0,0 @@
|
||||||
How
|
|
||||||
===
|
|
||||||
|
|
||||||
Tutorial
|
|
||||||
--------
|
|
||||||
|
|
||||||
Installation
|
|
||||||
------------
|
|
||||||
|
|
||||||
API
|
|
||||||
---
|
|
||||||
|
|
||||||
|
|
|
@ -79,9 +79,11 @@ you'll find NLTK etc much more expensive, because what you save on license
|
||||||
cost, you'll lose many times over in lost productivity. $5000 does not buy you
|
cost, you'll lose many times over in lost productivity. $5000 does not buy you
|
||||||
much developer time.
|
much developer time.
|
||||||
|
|
||||||
|
|
||||||
.. toctree::
|
.. toctree::
|
||||||
:hidden:
|
:hidden:
|
||||||
:maxdepth: 3
|
:maxdepth: 3
|
||||||
|
|
||||||
features.rst
|
features.rst
|
||||||
license_stories.rst
|
license_stories.rst
|
||||||
|
api.rst
|
||||||
|
|
|
@ -1,31 +0,0 @@
|
||||||
What
|
|
||||||
====
|
|
||||||
|
|
||||||
Overview
|
|
||||||
--------
|
|
||||||
|
|
||||||
Feature List
|
|
||||||
------------
|
|
||||||
|
|
||||||
License (for the code)
|
|
||||||
-------
|
|
||||||
|
|
||||||
+------------------+------+
|
|
||||||
| Non-commercial | $0 |
|
|
||||||
+------------------+------+
|
|
||||||
| Trial commercial | $0 |
|
|
||||||
+------------------+------+
|
|
||||||
| Full commercial | $500 |
|
|
||||||
+------------------+------+
|
|
||||||
|
|
||||||
spaCy is non-free software. Its source is published, but the copyright is
|
|
||||||
retained by the author (Matthew Honnibal). Licenses are currently under preparation.
|
|
||||||
|
|
||||||
There is currently a gap between the output of academic NLP researchers, and
|
|
||||||
the needs of a small software companiess. I left academia to try to correct this.
|
|
||||||
My idea is that non-commercial and trial commercial use should "feel" just like
|
|
||||||
free software. But, if you do use the code in a commercial product, a small
|
|
||||||
fixed license-fee will apply, in order to fund development.
|
|
||||||
|
|
||||||
Pricing (for the data)
|
|
||||||
----------------------
|
|
|
@ -1,28 +0,0 @@
|
||||||
Why
|
|
||||||
===
|
|
||||||
|
|
||||||
Benchmarks
|
|
||||||
----------
|
|
||||||
|
|
||||||
Efficiency
|
|
||||||
----------
|
|
||||||
|
|
||||||
+--------+-------+--------------+--------------+
|
|
||||||
| System | Time | Words/second | Speed Factor |
|
|
||||||
+--------+-------+--------------+--------------+
|
|
||||||
| NLTK | 6m4s | 89,000 | 1.00 |
|
|
||||||
+--------+-------+--------------+--------------+
|
|
||||||
| spaCy | 9.5s | 3,093,000 | 38.30 |
|
|
||||||
+--------+-------+--------------+--------------+
|
|
||||||
|
|
||||||
|
|
||||||
Accuracy
|
|
||||||
--------
|
|
||||||
|
|
||||||
The comparison refers to 30 million words from the English Gigaword, on
|
|
||||||
a Maxbook Air. For context, calling string.split() on the data completes in
|
|
||||||
about 5s.
|
|
||||||
|
|
||||||
Pros and Cons
|
|
||||||
-------------
|
|
||||||
|
|
Loading…
Reference in New Issue
Block a user