spaCy/website/src/jade/tutorials/multilingual.jade
2015-09-24 18:15:07 +10:00

30 lines
2.3 KiB
Plaintext

include ./meta.jade
include ../header.jade
+WritePost(Meta)
h3 Overview
p Each language requires the following definition files for the tokenizer, morphological analyzer and lemmatizer:
ol
li Adapt the punctuation rules if necessary. Punctuation rules are defined in separate #[a(href="https://github.com/honnibal/spaCy/blob/master/lang_data/en/prefix.txt") prefix], #[a(href="https://github.com/honnibal/spaCy/blob/master/lang_data/en/suffix.txt") suffix] and #[a(href="https://github.com/honnibal/spaCy/blob/master/lang_data/en/infix.txt") infix] files. Most languages will not require many changes for these files.
li Specify tokenizer special-cases. A lookup-table is used to handle contractions and fused tokens. The lookup table is specified in a json file. Each entry is keyed by the string. Its value is a list of token specifications, including the form, lemma, part-of-speech, and morphological features.
li Write lemmatization rules, and a list of exceptions. The English lemmatization rules can be seen #[a(href="https://github.com/honnibal/spaCy/blob/master/lang_data/en/lemma_rules.json") here], used by the lemmatizer #[a(href="https://github.com/honnibal/spaCy/blob/master/spacy/lemmatizer.py") here].
li Write a #[a(href="https://github.com/honnibal/spaCy/blob/master/lang_data/en/tag_map.json") tag_map.json] file, which maps the language-specific treebank tag scheme to the #[a(href="http://universaldependencies.github.io/docs/") universal part-of-speech scheme], with additional morphological features.
li Write a #[a(href="https://github.com/honnibal/spaCy/blob/master/lang_data/en/morphs.json") morphs.json] file, which lists special-cases for the morphological analyzer and lemmatizer. This file is keyed by the part-of-speech tag, and then by a list of orthographic forms. This allows words which require different morphological features, depending on their part-of-speech tag, to receive them.
h3 Tokenization algorithm
h3 Producing the Brown clusters
p See #[a(href="https://github.com/percyliang/brown-cluster") here].
h3 Producing word frequencies
p See #[a(href="https://github.com/honnibal/spaCy/blob/master/bin/get_freqs.py") here].
h3 Train tagger, dependency parser and named entity recognizer
p These require annotated data, which we typically must license.