spaCy/spacy/lang/el/get_pos_from_wiktionary.py

def get_pos_from_wiktionary():
    import re
    from gensim.corpora.wikicorpus import extract_pages

    regex = re.compile(r"==={{(\w+)\|el}}===")
    regex2 = re.compile(r"==={{(\w+ \w+)\|el}}===")

    # get words based on the Wiktionary dump
    # check only for specific parts

    # ==={{κύριο όνομα|el}}===
    expected_parts = [
        "μετοχή",
        "ρήμα",
        "επίθετο",
        "επίρρημα",
        "ουσιαστικό",
        "κύριο όνομα",
        "άρθρο",
    ]

    wiktionary_file_path = (
        "/data/gsoc2018-spacy/spacy/lang/el/res/elwiktionary-latest-pages-articles.xml"
    )

    proper_names_dict = {
        "ουσιαστικό": "nouns",
        "επίθετο": "adjectives",
        "άρθρο": "dets",
        "επίρρημα": "adverbs",
        "κύριο όνομα": "proper_names",
        "μετοχή": "participles",
        "ρήμα": "verbs",
    }
    expected_parts_dict = {}
    for expected_part in expected_parts:
        expected_parts_dict[expected_part] = []

    for title, text, pageid in extract_pages(wiktionary_file_path):
        if text.startswith("#REDIRECT"):
            continue
        title = title.lower()
        all_regex = regex.findall(text)
        all_regex.extend(regex2.findall(text))
        for a in all_regex:
            if a in expected_parts:
                expected_parts_dict[a].append(title)

    for i in expected_parts_dict:
        with open("_{0}.py".format(proper_names_dict[i]), "w") as f:
            f.write("from __future__ import unicode_literals\n")
            f.write('{} = set("""\n'.format(proper_names_dict[i].upper()))
            words = sorted(expected_parts_dict[i])
            line = ""
            to_write = []
            for word in words:
                if len(line + " " + word) > 79:
                    to_write.append(line)
                    line = ""
                else:
                    line = line + " " + word
            f.write("\n".join(to_write))
            f.write('\n""".split())')
💫 WIP: Basic lookup class scaffolding and JSON for all lemmati… (#4167) * Improve load_language_data helper * WIP: Add Lookups implementation * Start moving lemma data over to JSON * WIP: move data over for more languages * Convert more languages * Fix lemmatizer fixtures in tests * Finish conversion * Auto-format JSON files * Fix test for now * Make sure tables are stored on instance 2019-08-22 15:21:32 +03:00			`def get_pos_from_wiktionary():`
			`import re`
			`from gensim.corpora.wikicorpus import extract_pages`
Greek language optimizations (#2558) * Greek language optimizations * Add encoding on files containing greek words * Add encoding on files containing greek words 2018-07-18 19:51:38 +03:00
💫 WIP: Basic lookup class scaffolding and JSON for all lemmati… (#4167) * Improve load_language_data helper * WIP: Add Lookups implementation * Start moving lemma data over to JSON * WIP: move data over for more languages * Convert more languages * Fix lemmatizer fixtures in tests * Finish conversion * Auto-format JSON files * Fix test for now * Make sure tables are stored on instance 2019-08-22 15:21:32 +03:00			`regex = re.compile(r"==={{(\w+)\\|el}}===")`
			`regex2 = re.compile(r"==={{(\w+ \w+)\\|el}}===")`
💫 Tidy up and auto-format .py files (#2983) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files. - [x] Update flake8 config to exclude very large files (lemmatization tables etc.) - [x] Update code to be compatible with flake8 rules - [x] Fix various small bugs, inconsistencies and messy stuff in the language data - [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means) Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results. At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information. ### Types of change enhancement, code style ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2018-11-30 19:03:03 +03:00
💫 WIP: Basic lookup class scaffolding and JSON for all lemmati… (#4167) * Improve load_language_data helper * WIP: Add Lookups implementation * Start moving lemma data over to JSON * WIP: move data over for more languages * Convert more languages * Fix lemmatizer fixtures in tests * Finish conversion * Auto-format JSON files * Fix test for now * Make sure tables are stored on instance 2019-08-22 15:21:32 +03:00			`# get words based on the Wiktionary dump`
			`# check only for specific parts`
Greek language optimizations (#2558) * Greek language optimizations * Add encoding on files containing greek words * Add encoding on files containing greek words 2018-07-18 19:51:38 +03:00
💫 WIP: Basic lookup class scaffolding and JSON for all lemmati… (#4167) * Improve load_language_data helper * WIP: Add Lookups implementation * Start moving lemma data over to JSON * WIP: move data over for more languages * Convert more languages * Fix lemmatizer fixtures in tests * Finish conversion * Auto-format JSON files * Fix test for now * Make sure tables are stored on instance 2019-08-22 15:21:32 +03:00			`# ==={{κύριο όνομα\|el}}===`
			`expected_parts = [`
			`"μετοχή",`
			`"ρήμα",`
			`"επίθετο",`
			`"επίρρημα",`
			`"ουσιαστικό",`
			`"κύριο όνομα",`
			`"άρθρο",`
			`]`
Greek language optimizations (#2558) * Greek language optimizations * Add encoding on files containing greek words * Add encoding on files containing greek words 2018-07-18 19:51:38 +03:00
💫 WIP: Basic lookup class scaffolding and JSON for all lemmati… (#4167) * Improve load_language_data helper * WIP: Add Lookups implementation * Start moving lemma data over to JSON * WIP: move data over for more languages * Convert more languages * Fix lemmatizer fixtures in tests * Finish conversion * Auto-format JSON files * Fix test for now * Make sure tables are stored on instance 2019-08-22 15:21:32 +03:00			`wiktionary_file_path = (`
			`"/data/gsoc2018-spacy/spacy/lang/el/res/elwiktionary-latest-pages-articles.xml"`
			`)`
Greek language optimizations (#2558) * Greek language optimizations * Add encoding on files containing greek words * Add encoding on files containing greek words 2018-07-18 19:51:38 +03:00
💫 WIP: Basic lookup class scaffolding and JSON for all lemmati… (#4167) * Improve load_language_data helper * WIP: Add Lookups implementation * Start moving lemma data over to JSON * WIP: move data over for more languages * Convert more languages * Fix lemmatizer fixtures in tests * Finish conversion * Auto-format JSON files * Fix test for now * Make sure tables are stored on instance 2019-08-22 15:21:32 +03:00			`proper_names_dict = {`
			`"ουσιαστικό": "nouns",`
			`"επίθετο": "adjectives",`
			`"άρθρο": "dets",`
			`"επίρρημα": "adverbs",`
			`"κύριο όνομα": "proper_names",`
			`"μετοχή": "participles",`
			`"ρήμα": "verbs",`
			`}`
			`expected_parts_dict = {}`
			`for expected_part in expected_parts:`
			`expected_parts_dict[expected_part] = []`
Greek language optimizations (#2558) * Greek language optimizations * Add encoding on files containing greek words * Add encoding on files containing greek words 2018-07-18 19:51:38 +03:00
💫 WIP: Basic lookup class scaffolding and JSON for all lemmati… (#4167) * Improve load_language_data helper * WIP: Add Lookups implementation * Start moving lemma data over to JSON * WIP: move data over for more languages * Convert more languages * Fix lemmatizer fixtures in tests * Finish conversion * Auto-format JSON files * Fix test for now * Make sure tables are stored on instance 2019-08-22 15:21:32 +03:00			`for title, text, pageid in extract_pages(wiktionary_file_path):`
			`if text.startswith("#REDIRECT"):`
			`continue`
			`title = title.lower()`
			`all_regex = regex.findall(text)`
			`all_regex.extend(regex2.findall(text))`
			`for a in all_regex:`
			`if a in expected_parts:`
			`expected_parts_dict[a].append(title)`
Greek language optimizations (#2558) * Greek language optimizations * Add encoding on files containing greek words * Add encoding on files containing greek words 2018-07-18 19:51:38 +03:00
💫 WIP: Basic lookup class scaffolding and JSON for all lemmati… (#4167) * Improve load_language_data helper * WIP: Add Lookups implementation * Start moving lemma data over to JSON * WIP: move data over for more languages * Convert more languages * Fix lemmatizer fixtures in tests * Finish conversion * Auto-format JSON files * Fix test for now * Make sure tables are stored on instance 2019-08-22 15:21:32 +03:00			`for i in expected_parts_dict:`
			`with open("_{0}.py".format(proper_names_dict[i]), "w") as f:`
			`f.write("from __future__ import unicode_literals\n")`
			`f.write('{} = set("""\n'.format(proper_names_dict[i].upper()))`
			`words = sorted(expected_parts_dict[i])`
			`line = ""`
			`to_write = []`
			`for word in words:`
			`if len(line + " " + word) > 79:`
			`to_write.append(line)`
			`line = ""`
			`else:`
			`line = line + " " + word`
			`f.write("\n".join(to_write))`
			`f.write('\n""".split())')`