mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-11-04 09:57:26 +03:00 
			
		
		
		
	
		
			
				
	
	
		
			78 lines
		
	
	
		
			4.9 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			78 lines
		
	
	
		
			4.9 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
Lexeme Features
 | 
						|
===============
 | 
						|
 | 
						|
A lexeme is an entry in the lexicon --- the vocabulary --- for a word, punctuation
 | 
						|
symbol, whitespace unit, etc.  Lexemes come with lots of pre-computed information,
 | 
						|
that help you write good feature functions.  Features are integer-valued where
 | 
						|
possible --- instead of strings, spaCy refers to strings by consecutive ID numbers,
 | 
						|
which you can use to look up the string values if necessary.
 | 
						|
 | 
						|
String features
 | 
						|
---------------
 | 
						|
 | 
						|
+---------+-------------------------------------------------------------------+
 | 
						|
| SIC     | The word as it appeared in the sentence, unaltered.               |
 | 
						|
+---------+-------------------------------------------------------------------+
 | 
						|
| NORM    | For frequent words, case normalization is applied.                |
 | 
						|
|         | Otherwise, back-off to SHAPE.                                     |
 | 
						|
+---------+-------------------------------------------------------------------+
 | 
						|
| SHAPE   | Remap the characters of the word as follows:                      |
 | 
						|
|         |                                                                   |
 | 
						|
|         | a-z --> x, A-Z --> X, 0-9 --> d, ,.;:"'?!$- --> self, other --> \*|
 | 
						|
|         |                                                                   |
 | 
						|
|         | Trim sequences of length 3+ to 3, e.g                             |
 | 
						|
|         |                                                                   |
 | 
						|
|         | apples --> xxx, Apples --> Xxxx, app9LES@ --> xxx9XXX*            |
 | 
						|
+---------+-------------------------------------------------------------------+
 | 
						|
| ASCIIED | Use unidecode.unidecode(sic) to approximate the word using the    |
 | 
						|
|         | ascii characters.                                                 |
 | 
						|
+---------+-------------------------------------------------------------------+
 | 
						|
| PREFIX  | sic_unicode_string[:1]                                            |
 | 
						|
+---------+-------------------------------------------------------------------+
 | 
						|
| SUFFIX  | sic_unicode_string[-3:]                                           |
 | 
						|
+---------+-------------------------------------------------------------------+
 | 
						|
 | 
						|
 | 
						|
Integer features
 | 
						|
----------------
 | 
						|
 | 
						|
+--------------+--------------------------------------------------------------+
 | 
						|
| LENGTH       |  Length of the string, in unicode                            |
 | 
						|
+--------------+--------------------------------------------------------------+
 | 
						|
| CLUSTER      | Brown cluster                                                |
 | 
						|
+--------------+--------------------------------------------------------------+
 | 
						|
| POS_TYPE     | K-means cluster of word's tag affinities                     |
 | 
						|
+--------------+--------------------------------------------------------------+
 | 
						|
| SENSE_TYPE   | K-means cluster of word's sense affinities                   |
 | 
						|
+--------------+--------------------------------------------------------------+
 | 
						|
 | 
						|
Boolean features
 | 
						|
----------------
 | 
						|
 | 
						|
+-------------+--------------------------------------------------------------+
 | 
						|
| IS_ALPHA    | The result of sic.isalpha()                                  |
 | 
						|
+-------------+--------------------------------------------------------------+
 | 
						|
| IS_ASCII    | Check whether all the word's characters are ascii characters |
 | 
						|
+-------------+--------------------------------------------------------------+
 | 
						|
| IS_DIGIT    | The result of sic.isdigit()                                  |
 | 
						|
+-------------+--------------------------------------------------------------+
 | 
						|
| IS_LOWER    | The result of sic.islower()                                  |
 | 
						|
+-------------+--------------------------------------------------------------+
 | 
						|
| IS_PUNCT    | Check whether all characters are in the class TODO           |
 | 
						|
+-------------+--------------------------------------------------------------+
 | 
						|
| IS_SPACE    | The result of sic.isspace()                                  |
 | 
						|
+-------------+--------------------------------------------------------------+
 | 
						|
| IS_TITLE    | The result of sic.istitle()                                  |
 | 
						|
+-------------+--------------------------------------------------------------+
 | 
						|
| IS_UPPER    | The result of sic.isupper()                                  |
 | 
						|
+-------------+--------------------------------------------------------------+
 | 
						|
| LIKE_URL    | Check whether the string looks like it could be a URL.  Aims |
 | 
						|
|             | for low false negative rate.                                 |
 | 
						|
+-------------+--------------------------------------------------------------+
 | 
						|
| LIKE_NUMBER | Check whether the string looks like it could be a numeric    |
 | 
						|
|             | entity, e.g. 10,000 10th .10 . Skews for low false negative  |
 | 
						|
|             | rate.                                                        |
 | 
						|
+-------------+--------------------------------------------------------------+
 | 
						|
| IN_LIST     | Facility for loading arbitrary run-time word lists?          |
 | 
						|
+-------------+--------------------------------------------------------------+
 |