mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-25 09:26:27 +03:00
Add note on languages with non-latin characters (see #996)
This commit is contained in:
parent
3a9710f356
commit
2bfec1a4f8
|
@ -98,6 +98,17 @@ p
|
||||||
| so that Python functions can be used to help you generalise and combine
|
| so that Python functions can be used to help you generalise and combine
|
||||||
| the data as you require.
|
| the data as you require.
|
||||||
|
|
||||||
|
+infobox("For languages with non-latin characters")
|
||||||
|
| In order for the tokenizer to split suffixes, prefixes and infixes, spaCy
|
||||||
|
| needs to know the language's character set. If the language you're adding
|
||||||
|
| uses non-latin characters, you might need to add the required character
|
||||||
|
| classes to the global
|
||||||
|
| #[+src(gh("spacy", "spacy/language_data/punctuation.py")) punctuation.py].
|
||||||
|
| spaCy uses the #[+a("https://pypi.python.org/pypi/regex/") #[code regex] library]
|
||||||
|
| to keep this simple and readable. If the language requires very specific
|
||||||
|
| punctuation rules, you should consider overwriting the default regular
|
||||||
|
| expressions with your own in the language's #[code Defaults].
|
||||||
|
|
||||||
+h(3, "stop-words") Stop words
|
+h(3, "stop-words") Stop words
|
||||||
|
|
||||||
p
|
p
|
||||||
|
|
Loading…
Reference in New Issue
Block a user