mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-26 05:31:15 +03:00 
			
		
		
		
	* More index.rst fiddling
This commit is contained in:
		
							parent
							
								
									9f3f07cab6
								
							
						
					
					
						commit
						69e3a07fa1
					
				|  | @ -8,60 +8,35 @@ spaCy: Industrial-strength NLP | ||||||
| ================================ | ================================ | ||||||
| 
 | 
 | ||||||
| spaCy is a library for industrial-strength text processing in Python and Cython. | spaCy is a library for industrial-strength text processing in Python and Cython. | ||||||
| It features extremely efficient, up-to-date algorithms, and a rethink of how those | Its core values are efficiency, accuracy and minimalism: you get a fast pipeline of | ||||||
| algorithms should be accessed. | state-of-the-art components, a nice API, and no clutter. | ||||||
| 
 | 
 | ||||||
| A typical text-processing API looks something like this: | spaCy is particularly good for feature extraction, because it pre-loads lexical | ||||||
|  | resources, maps strings to integer IDs, and supports output of numpy arrays: | ||||||
| 
 | 
 | ||||||
|     >>> import nltk |     >>> from spacy.en import English | ||||||
|     >>> nltk.pos_tag(nltk.word_tokenize('''Some string of language.''')) |     >>> from spacy.en import attrs | ||||||
|     [('Some', 'DT'), ('string', 'VBG'), ('of', 'IN'), ('language', 'NN'), ('.', '.')] |     >>> nlp = English() | ||||||
|  |     >>> tokens = nlp(u'An example sentence', pos_tag=True, parse=True) | ||||||
|  |     >>> tokens.to_array((attrs.LEMMA, attrs.POS, attrs.SHAPE, attrs.CLUSTER)) | ||||||
| 
 | 
 | ||||||
| This API often leaves you with a lot of busy-work.  If you're doing some machine | spaCy also makes it easy to add in-line mark up. Let's say you want to mark all | ||||||
| learning or information extraction, all the strings have to be mapped to integers, | adverbs in red: | ||||||
| and you have to save and load the mapping at training and runtime.  If you want |  | ||||||
| to display mark-up based on the annotation, you have to realign the tokens to your |  | ||||||
| original string. |  | ||||||
| 
 | 
 | ||||||
| I've been writing NLP systems for almost ten years now, so I've done these |     >>> from spacy.defs import ADVERB | ||||||
| things dozens of times.  When designing spaCy, I thought carefully about how to |     >>> color = lambda t: u'\033[91m' % t if t.pos == ADVERB else u'%s' | ||||||
| make the right thing easy.   |     >>> print u''.join(color(t) + unicode(t) for t in tokens) | ||||||
| 
 | 
 | ||||||
| We begin by initializing a global vocabulary store: | Tokens.__iter__ produces a sequence of Token objects.  The Token.__unicode__ | ||||||
|  | method --- invoked by unicode(t) --- pads each token with any whitespace that | ||||||
|  | followed it.  So, u''.join(unicode(t) for t in tokens) is guaranteed to restore | ||||||
|  | the original string. | ||||||
| 
 | 
 | ||||||
|     >>> from spacy.en import EN | spaCy is also very efficient --- much more efficient than any other language | ||||||
|     >>> EN.load() | processing tools available.  The table below compares the time to tokenize, POS | ||||||
|  | tag and parse 100m words of text; it also shows accuracy on the standard | ||||||
|  | evaluation, from the Wall Street Journal: | ||||||
| 
 | 
 | ||||||
| The vocabulary reads in a data file with all sorts of pre-computed lexical |  | ||||||
| features.  You can load anything you like here, but by default I give you: |  | ||||||
| 
 |  | ||||||
| * String IDs for the word's string, its prefix, suffix and "shape"; |  | ||||||
| * Length (in unicode code-points) |  | ||||||
| * A cluster ID, representing distributional similarity; |  | ||||||
| * A cluster ID, representing its typical POS tag distribution; |  | ||||||
| * Good-turing smoothed unigram probability; |  | ||||||
| * 64 boolean features, for assorted orthographic and distributional features. |  | ||||||
| 
 |  | ||||||
| With so many features pre-computed, you usually don't have to do any string |  | ||||||
| processing at all.  You give spaCy your string, and tell it to give you either |  | ||||||
| a numpy array, or a counts dictionary: |  | ||||||
| 
 |  | ||||||
|     >>> from spacy.en import feature_names as fn |  | ||||||
|     >>> tokens = EN.tokenize(u'''Some string of language.''') |  | ||||||
|     >>> tokens.to_array((fn.WORD, fn.SUFFIX, fn.CLUSTER)) |  | ||||||
|     ... |  | ||||||
|     >>> tokens.count_by(fn.WORD) |  | ||||||
| 
 |  | ||||||
| If you do need strings, you can simply iterate over the Tokens object: |  | ||||||
| 
 |  | ||||||
|     >>> for token in tokens: |  | ||||||
|     ...    |  | ||||||
| 
 |  | ||||||
| I mostly use this for debugging and testing. |  | ||||||
| 
 |  | ||||||
| spaCy returns these rich Tokens objects much faster than most other tokenizers |  | ||||||
| can give you a list of strings --- in fact, spaCy's POS tagger is *4 times |  | ||||||
| faster* than CoreNLP's tokenizer: |  | ||||||
| 
 | 
 | ||||||
| +----------+----------+---------------+----------+ | +----------+----------+---------------+----------+ | ||||||
| | System   | Tokenize | POS Tag       |          | | | System   | Tokenize | POS Tag       |          | | ||||||
|  | @ -75,8 +50,16 @@ faster* than CoreNLP's tokenizer: | ||||||
| | ZPar     |          | ~1,500s       |          | | | ZPar     |          | ~1,500s       |          | | ||||||
| +----------+----------+---------------+----------+ | +----------+----------+---------------+----------+ | ||||||
| 
 | 
 | ||||||
|  | spaCy completes its whole pipeline faster than some of the other libraries can | ||||||
|  | tokenize the text.  Its POS tag accuracy is as good as any system available. | ||||||
|  | For parsing, I chose an algorithm that sacrificed some accuracy, in favour of | ||||||
|  | efficiency. | ||||||
| 
 | 
 | ||||||
| 
 | I wrote spaCy so that startups and other small companies could take advantage | ||||||
|  | of the enormous progress being made by NLP academics.  Academia is competitive, | ||||||
|  | and what you're competing to do is write papers --- so it's very hard to write | ||||||
|  | software useful to non-academics. Seeing this gap, I resigned from my post-doc, | ||||||
|  | and wrote spaCy. | ||||||
| 
 | 
 | ||||||
| .. toctree:: | .. toctree:: | ||||||
|     :hidden: |     :hidden: | ||||||
|  |  | ||||||
		Loading…
	
		Reference in New Issue
	
	Block a user