mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-31 07:57:35 +03:00 
			
		
		
		
	* Impove index docs
This commit is contained in:
		
							parent
							
								
									e8dbac8a0c
								
							
						
					
					
						commit
						e28b224b80
					
				|  | @ -3,87 +3,184 @@ | ||||||
|    You can adapt this file completely to your liking, but it should at least |    You can adapt this file completely to your liking, but it should at least | ||||||
|    contain the root `toctree` directive. |    contain the root `toctree` directive. | ||||||
| 
 | 
 | ||||||
| =================================== | ============================== | ||||||
| spaCy: Text-processing for products | spaCy: Industrial-strength NLP | ||||||
| =================================== | ============================== | ||||||
| 
 | 
 | ||||||
| spaCy is a library for industrial-strength text processing in Python and Cython. | spaCy is a library for industrial-strength text processing in Python and Cython. | ||||||
| Its core values are efficiency, accuracy and minimalism: you get a fast pipeline of | It is commercial open source software, with a dual (AGPL or commercial) | ||||||
| state-of-the-art components, a nice API, and no clutter: | license. | ||||||
| 
 | 
 | ||||||
|     >>> from spacy.en import English | If you're a small company doing NLP, spaCy might seem like a minor miracle. | ||||||
|     >>> nlp = English() | It's by far the fastest NLP software available.  The full processing pipeline | ||||||
|     >>> tokens = nlp(u'An example sentence', tag=True, parse=True) | completes in 7ms, including state-of-the-art part-of-speech tagging and | ||||||
|     >>> for token in tokens: | dependency parsing.  All strings are mapped to integer IDs, tokens | ||||||
|     ...   print token.lemma, token.pos, bin(token.cluster) | are linked to word vectors and other lexical resources, and a range of useful | ||||||
|     an DT Xx 0b111011110 | features are pre-calculated and cached. | ||||||
|     example NN xxxx 0b111110001 |  | ||||||
|     sentence NN xxxx 0b1101111110010 |  | ||||||
| 
 | 
 | ||||||
| spaCy is particularly good for feature extraction, because it pre-loads lexical | If none of that made any sense to you, here's the gist of it.  Computers don't | ||||||
| resources, maps strings to integer IDs, and supports output of numpy arrays: | understand text. This is unfortunate, because that's what the web almost entirely | ||||||
|  | consists of.  We want to recommend people text based on other text they liked. | ||||||
|  | We want to shorten text to display it on a mobile screen.  We want to aggregate | ||||||
|  | it, link it, filter it, categorise it, generate it and correct it. | ||||||
| 
 | 
 | ||||||
|     >>> from spacy.en import attrs | spaCy provides a set of utility functions that help programmers build such | ||||||
|     >>> tokens.to_array((attrs.LEMMA, attrs.POS, attrs.SHAPE, attrs.CLUSTER)) | products.  It's an NLP engine, analogous to the 3d engines commonly licensed | ||||||
|     array([[ 1265,    14,    76,   478], | for game development. | ||||||
|        [ 1545,    24,   262,   497], |  | ||||||
|        [ 3385,    24,   262, 14309]]) |  | ||||||
| 
 | 
 | ||||||
| spaCy also makes it easy to add in-line mark up. Let's say you're convinced by | Example functionality | ||||||
| Stephen King's advice that `adverbs are not your friend <http://www.brainpickings.org/2013/03/13/stephen-king-on-adverbs/>`_, so you want to mark | --------------------- | ||||||
| them in red. We'll use one of the examples he finds particularly egregious: |  | ||||||
| 
 | 
 | ||||||
|     >>> tokens = nlp(u"‘Give it back,’ he pleaded abjectly, ‘it’s mine.’") | Let's say you're developing a proofreading tool, or possibly an IDE for | ||||||
|     >>> red = lambda string: u'\033[91m{0}\033[0m'.format(string) | writers.  You're convinced by Stephen King's advice that `adverbs are not your | ||||||
|     >>> red = lambda string: unicode(string).upper() # TODO -- make red work on website... | friend <http://www.brainpickings.org/2013/03/13/stephen-king-on-adverbs/>`_, so | ||||||
|     >>> print u''.join(red(t) if t.is_adverb else unicode(t) for t in tokens) | you want to **mark adverbs in red**.  We'll use one of the examples he finds | ||||||
|  | particularly egregious: | ||||||
|  | 
 | ||||||
|  |     >>> import spacy.en | ||||||
|  |     >>> from spacy.enums import ADVERB | ||||||
|  |     >>> # Load the pipeline, and call it with some text. | ||||||
|  |     >>> nlp = spacy.en.English() | ||||||
|  |     >>> tokens = nlp("‘Give it back,’ he pleaded abjectly, ‘it’s mine.’", | ||||||
|  |                      tag=True, parse=True) | ||||||
|  |     >>> output = '' | ||||||
|  |     >>> for tok in tokens: | ||||||
|  |     ...     # Token.string preserves whitespace, making it easy to | ||||||
|  |     ...     # reconstruct the original string. | ||||||
|  |     ...     output += tok.string.upper() if tok.is_pos(ADVERB) else tok.string | ||||||
|  |     >>> print(output) | ||||||
|     ‘Give it BACK,’ he pleaded ABJECTLY, ‘it’s mine.’ |     ‘Give it BACK,’ he pleaded ABJECTLY, ‘it’s mine.’ | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| Easy --- except, "back" isn't the sort of word we're looking for, even though | Easy enough --- but the problem is that we've also highlighted "back", when probably | ||||||
| it's undeniably an adverb.  Let's search refine the logic a little, and only | we only wanted to highlight "abjectly". This is undoubtedly an adverb, but it's | ||||||
| highlight adverbs that modify verbs: | not the sort of adverb King is talking about.  This is a persistent problem when | ||||||
|  | dealing with linguistic categories: the prototypical examples, the ones whic | ||||||
|  | spring to your mind, are often not the most common cases. | ||||||
| 
 | 
 | ||||||
|     >>> print u''.join(red(t) if t.is_adverb and t.head.is_verb else unicode(t) for t in tokens) | There are lots of ways we might refine our logic, depending on just what words | ||||||
|  | we want to flag.  The simplest way to filter out adverbs like "back" and "not" | ||||||
|  | is by word frequency: these words are much more common than the manner adverbs | ||||||
|  | the style guides are worried about. | ||||||
|  | 
 | ||||||
|  | The prob attribute of a Lexeme or Token object gives a log probability estimate | ||||||
|  | of the word, based on smoothed counts from a 3bn word corpus: | ||||||
|  | 
 | ||||||
|  |    >>> nlp.vocab[u'back'].prob | ||||||
|  |    -7.403977394104004 | ||||||
|  |    >>> nlp.vocab[u'not'].prob | ||||||
|  |    -5.407193660736084 | ||||||
|  |    >>> nlp.vocab[u'quietly'].prob | ||||||
|  |    -11.07155704498291 | ||||||
|  | 
 | ||||||
|  | So we can easily exclude the N most frequent words in English from our adverb | ||||||
|  | marker.  Let's try N=1000 for now: | ||||||
|  | 
 | ||||||
|  |     >>> import spacy.en | ||||||
|  |     >>> from spacy.enums import ADVERB | ||||||
|  |     >>> nlp = spacy.en.English() | ||||||
|  |     >>> # Find log probability of Nth most frequent word | ||||||
|  |     >>> probs = [lex.prob for lex in nlp.vocab] | ||||||
|  |     >>> is_adverb = lambda tok: tok.is_pos(ADVERB) and tok.prob < probs[-1000] | ||||||
|  |     >>> tokens = nlp("‘Give it back,’ he pleaded abjectly, ‘it’s mine.’", | ||||||
|  |                      tag=True, parse=True) | ||||||
|  |     >>> print(''.join(tok.string.upper() if is_adverb(tok) else tok.string)) | ||||||
|     ‘Give it back,’ he pleaded ABJECTLY, ‘it’s mine.’ |     ‘Give it back,’ he pleaded ABJECTLY, ‘it’s mine.’ | ||||||
| 
 | 
 | ||||||
| spaCy is also very efficient --- much more efficient than any other language | There are lots of ways to refine the logic, depending on just what words we | ||||||
| processing tools available.  The table below compares the time to tokenize, POS | want to flag.  Let's define this narrowly, and only flag adverbs applied to | ||||||
| tag and parse a document (amortized over 100k samples).  It also shows accuracy | verbs of communication or perception: | ||||||
| on the standard evaluation, from the Wall Street Journal: |  | ||||||
| 
 | 
 | ||||||
| +----------+----------+---------+----------+----------+------------+ |     >>> from spacy.enums import VERB, WN_V_COMMUNICATION, WN_V_COGNITION | ||||||
| | System   | Tokenize | POS Tag | Parse    | POS Acc. | Parse Acc. | |     >>> def is_say_verb(tok): | ||||||
| +----------+----------+---------+----------+----------+------------+ |     ...   return tok.is_pos(VERB) and (tok.check_flag(WN_V_COMMUNICATION) or | ||||||
| | spaCy    | 0.37ms   | 0.98ms  | 10ms     | 97.3%    | 92.4%      | |                                        tok.check_flag(WN_V_COGNITION)) | ||||||
| +----------+----------+---------+----------+----------+------------+ |     >>> print(''.join(tok.string.upper() if is_adverb(tok) and is_say_verb(tok.head) | ||||||
| | NLTK     | 6.2ms    | 443ms   | n/a      | 94.0%    | n/a        | |                       else tok.string)) | ||||||
| +----------+----------+---------+----------+----------+------------+ |     ‘Give it back,’ he pleaded ABJECTLY, ‘it’s mine.’ | ||||||
| | CoreNLP  | 4.2ms    | 13ms    | todo     | 96.97%   | 92.2%      | |  | ||||||
| +----------+----------+---------+----------+----------+------------+ |  | ||||||
| | ZPar     | n/a      | 15ms    | 850ms    | 97.3%    | 92.9%      | |  | ||||||
| +----------+----------+---------+----------+----------+------------+ |  | ||||||
| 
 | 
 | ||||||
| (The CoreNLP results refer to their recently published shift-reduce neural | The two flags refer to the 45 top-level categories in the WordNet ontology. | ||||||
| network parser.) | spaCy stores membership in these categories as a bit set, because | ||||||
|  | words can have multiple senses.  We only need one 64 | ||||||
|  | bit flag variable per word in the vocabulary, so this useful data requires only | ||||||
|  | 2.4mb of memory. | ||||||
| 
 | 
 | ||||||
| I wrote spaCy so that startups and other small companies could take advantage | spaCy packs all sorts of other goodies into its lexicon. | ||||||
| of the enormous progress being made by NLP academics.  Academia is competitive, | Words are mapped to one these rich lexical types immediately, during | ||||||
| and what you're competing to do is write papers --- so it's very hard to write | tokenization --- and spaCy's tokenizer is *fast*. | ||||||
| software useful to non-academics. Seeing this gap, I resigned from my post-doc, | 
 | ||||||
| and wrote spaCy. | Efficiency | ||||||
|  | ---------- | ||||||
|  | 
 | ||||||
|  | .. table:: Efficiency comparison. See `Benchmarks`_ for details. | ||||||
|  | 
 | ||||||
|  |   +--------------+---------------------------+--------------------------------+ | ||||||
|  |   |              | Absolute (ms per doc)     | Relative (to spaCy)            | | ||||||
|  |   +--------------+----------+--------+-------+----------+---------+-----------+ | ||||||
|  |   | System       | Tokenize | Tag    | Parse | Tokenize | Tag     | Parse     | | ||||||
|  |   +--------------+----------+--------+-------+----------+---------+-----------+ | ||||||
|  |   | spaCy        | 0.2ms    | 1ms    | 7ms   | 1x       | 1x      | 1x        | | ||||||
|  |   +--------------+----------+--------+-------+----------+---------+-----------+ | ||||||
|  |   | CoreNLP      | 2ms      | 10ms   | 49ms  | 10x      | 10x     | 7x        | | ||||||
|  |   +--------------+----------+--------+-------+----------+---------+-----------+ | ||||||
|  |   | ZPar         | 1ms      | 8ms    | 850ms | 5x       | 8x      | 121x      | | ||||||
|  |   +--------------+----------+--------+-------+----------+---------+-----------+ | ||||||
|  |   | NLTK         | 4ms      | 443ms  | n/a   | 20x      | 443x    |  n/a      | | ||||||
|  |   +--------------+----------+--------+-------+----------+---------+-----------+ | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | Efficiency is a major concern for NLP applications.  It is very common to hear | ||||||
|  | people say that they cannot afford more detailed processing, because their | ||||||
|  | datasets are too large.  This is a bad position to be in.  If you can't apply | ||||||
|  | detailed processing, you generally have to cobble together various heuristics. | ||||||
|  | This normally takes a few iterations, and what you come up with will usually be | ||||||
|  | brittle and difficult to reason about. | ||||||
|  | 
 | ||||||
|  | spaCy's parser is faster than most taggers, and its tokenizer is fast enough | ||||||
|  | for truly web-scale processing.  And the tokenizer doesn't just give you a list | ||||||
|  | of strings.  A spaCy token is a pointer to a Lexeme struct, from which you can | ||||||
|  | access a wide range of pre-computed features. | ||||||
|  | 
 | ||||||
|  | .. I wrote spaCy because I think existing commercial NLP engines are crap. | ||||||
|  |   Alchemy API are a typical example.  Check out this part of their terms of | ||||||
|  |   service: | ||||||
|  |   publish or perform any benchmark or performance tests or analysis relating to | ||||||
|  |   the Service or the use thereof without express authorization from AlchemyAPI; | ||||||
|  | 
 | ||||||
|  | .. Did you get that? You're not allowed to evaluate how well their system works, | ||||||
|  |   unless you're granted a special exception.  Their system must be pretty | ||||||
|  |   terrible to motivate such an embarrassing restriction. | ||||||
|  |   They must know this makes them look bad, but they apparently believe allowing | ||||||
|  |   you to evaluate their product would make them look even worse! | ||||||
|  | 
 | ||||||
|  | .. spaCy is based on science, not alchemy.  It's open source, and I am happy to | ||||||
|  |   clarify any detail of the algorithms I've implemented. | ||||||
|  |   It's evaluated against the current best published systems, following the standard | ||||||
|  |   methodologies.  These evaluations show that it performs extremely well.   | ||||||
|  | 
 | ||||||
|  | Accuracy | ||||||
|  | -------- | ||||||
|  | 
 | ||||||
|  | .. table:: Accuracy comparison, on the standard benchmark data from the Wall Street Journal. See `Benchmarks`_ for details. | ||||||
|  | 
 | ||||||
|  |   +--------------+----------+------------+ | ||||||
|  |   | System       | POS acc. | Parse acc. | | ||||||
|  |   +--------------+----------+------------+ | ||||||
|  |   | spaCy        | 97.2     | 92.4       | | ||||||
|  |   +--------------+----------+------------+ | ||||||
|  |   | CoreNLP      | 96.9     | 92.2       |  | ||||||
|  |   +--------------+----------+------------+ | ||||||
|  |   | ZPar         | 97.3     | 92.9       | | ||||||
|  |   +--------------+----------+------------+ | ||||||
|  |   | NLTK         | 94.3     | n/a        | | ||||||
|  |   +--------------+----------+------------+ | ||||||
| 
 | 
 | ||||||
| spaCy is dual-licensed: you can either use it under the GPL, or pay a one-time |  | ||||||
| fee of $5000 for a commercial license.  I think this is excellent value: |  | ||||||
| you'll find NLTK etc much more expensive, because what you save on license |  | ||||||
| cost, you'll lose many times over in lost productivity. $5000 does not buy you |  | ||||||
| much developer time. |  | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| .. toctree:: | .. toctree:: | ||||||
|     :hidden: |  | ||||||
|     :maxdepth: 3 |     :maxdepth: 3 | ||||||
| 
 | 
 | ||||||
|  |     license.rst  | ||||||
|  |     quickstart.rst | ||||||
|     features.rst |     features.rst | ||||||
|     license_stories.rst  |  | ||||||
|     api.rst |     api.rst | ||||||
|  |  | ||||||
		Loading…
	
		Reference in New Issue
	
	Block a user