mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-11-04 01:48:04 +03:00 
			
		
		
		
	Add note on languages with non-latin characters (see #996)
This commit is contained in:
		
							parent
							
								
									3a9710f356
								
							
						
					
					
						commit
						2bfec1a4f8
					
				| 
						 | 
				
			
			@ -98,6 +98,17 @@ p
 | 
			
		|||
    |  so that Python functions can be used to help you generalise and combine
 | 
			
		||||
    |  the data as you require.
 | 
			
		||||
 | 
			
		||||
+infobox("For languages with non-latin characters")
 | 
			
		||||
    |  In order for the tokenizer to split suffixes, prefixes and infixes, spaCy
 | 
			
		||||
    |  needs to know the language's character set. If the language you're adding
 | 
			
		||||
    |  uses non-latin characters, you might need to add the required character
 | 
			
		||||
    |  classes to the global
 | 
			
		||||
    |  #[+src(gh("spacy", "spacy/language_data/punctuation.py")) punctuation.py].
 | 
			
		||||
    |  spaCy uses the #[+a("https://pypi.python.org/pypi/regex/") #[code regex] library]
 | 
			
		||||
    |  to keep this simple and readable. If the language requires very specific
 | 
			
		||||
    |  punctuation rules, you should consider overwriting the default regular
 | 
			
		||||
    |  expressions with your own in the language's #[code Defaults].
 | 
			
		||||
 | 
			
		||||
+h(3, "stop-words") Stop words
 | 
			
		||||
 | 
			
		||||
p
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
		Loading…
	
		Reference in New Issue
	
	Block a user