spaCy/website/src/jade/tutorials/multilingual.jade

include ./meta.jade
include ../header.jade

+WritePost(Meta)
    h3 Overview


    p Each language requires the following definition files for the tokenizer, morphological analyzer and lemmatizer:

    ol
        li Adapt the punctuation rules if necessary. Punctuation rules are defined in separate #[a(href="https://github.com/honnibal/spaCy/blob/master/lang_data/en/prefix.txt") prefix], #[a(href="https://github.com/honnibal/spaCy/blob/master/lang_data/en/suffix.txt") suffix] and #[a(href="https://github.com/honnibal/spaCy/blob/master/lang_data/en/infix.txt") infix] files. Most languages will not require many changes for these files.
        li Specify tokenizer special-cases. A lookup-table is used to handle contractions and fused tokens.  The lookup table is specified in a json file. Each entry is keyed by the string. Its value is a list of token specifications, including the form, lemma, part-of-speech, and morphological features.
        li Write lemmatization rules, and a list of exceptions. The English lemmatization rules can be seen #[a(href="https://github.com/honnibal/spaCy/blob/master/lang_data/en/lemma_rules.json") here], used by the lemmatizer #[a(href="https://github.com/honnibal/spaCy/blob/master/spacy/lemmatizer.py") here].
        li Write a #[a(href="https://github.com/honnibal/spaCy/blob/master/lang_data/en/tag_map.json") tag_map.json] file, which maps the language-specific treebank tag scheme to the #[a(href="http://universaldependencies.github.io/docs/") universal part-of-speech scheme], with additional morphological features.
        li Write a #[a(href="https://github.com/honnibal/spaCy/blob/master/lang_data/en/morphs.json") morphs.json] file, which lists special-cases for the morphological analyzer and lemmatizer. This file is keyed by the part-of-speech tag, and then by a list of orthographic forms. This allows words which require different morphological features, depending on their part-of-speech tag, to receive them.

    h3 Tokenization algorithm

    h3 Producing the Brown clusters

    p See #[a(href="https://github.com/percyliang/brown-cluster") here].

    h3 Producing word frequencies

    p See #[a(href="https://github.com/honnibal/spaCy/blob/master/bin/get_freqs.py") here].

    h3 Train tagger, dependency parser and named entity recognizer

    p These require annotated data, which we typically must license.