spaCy/spacy/en/__init__.py

# encoding: utf8
from __future__ import unicode_literals, print_function

from os import path

from ..util import match_best_version
from ..util import get_data_path
from ..language import Language
from ..lemmatizer import Lemmatizer
from ..vocab import Vocab
from ..tokenizer import Tokenizer
from ..attrs import LANG

from .language_data import *


class English(Language):
    lang = 'en'

    class Defaults(Language.Defaults):
        lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
        lex_attr_getters[LANG] = lambda text: 'en'

        tokenizer_exceptions = TOKENIZER_EXCEPTIONS
        tag_map = TAG_MAP
        stop_words = STOP_WORDS
        lemma_rules = LEMMA_RULES


    def __init__(self, **overrides):
        # Make a special-case hack for loading the GloVe vectors, to support
        # deprecated <1.0 stuff. Phase this out once the data is fixed.
        overrides = _fix_deprecated_glove_vectors_loading(overrides)
        Language.__init__(self, **overrides)


def _fix_deprecated_glove_vectors_loading(overrides):
    if 'data_dir' in overrides and 'path' not in overrides:
        raise ValueError("The argument 'data_dir' has been renamed to 'path'")
    if overrides.get('path') is False:
        return overrides
    if overrides.get('path') in (None, True):
        data_path = get_data_path()
    else:
        path = overrides['path']
        data_path = path.parent
    vec_path = None
    if 'add_vectors' not in overrides:
        if 'vectors' in overrides:
            vec_path = match_best_version(overrides['vectors'], None, data_path)
            if vec_path is None:
                raise IOError(
                    'Could not load data pack %s from %s' % (overrides['vectors'], data_path))
        else:
            vec_path = match_best_version('en_glove_cc_300_1m_vectors', None, data_path)
        if vec_path is not None:
            vec_path = vec_path / 'vocab' / 'vec.bin'
    if vec_path is not None:
        overrides['add_vectors'] = lambda vocab: vocab.load_vectors_from_bin_loc(vec_path)
    return overrides
Add encoding declaration 2016-12-17 14:25:44 +03:00			`# encoding: utf8`
* Use language base class 2015-08-25 16:37:30 +03:00			`from __future__ import unicode_literals, print_function`
* Begin refactor 2015-07-07 15:00:07 +03:00
* Use language base class 2015-08-25 16:37:30 +03:00			`from os import path`
* Tmp 2014-12-24 09:42:00 +03:00
Untested fix for issue #684: GloVe vectors hack should be inserted in English, not in spacy.load. 2016-12-19 00:29:31 +03:00			`from ..util import match_best_version`
Fix issue #684: GloVe vectors not loaded in spacy.en.English. 2016-12-19 00:46:31 +03:00			`from ..util import get_data_path`
* Use language base class 2015-08-25 16:37:30 +03:00			`from ..language import Language`
Refactor so that the tokenizer data is read from Python data, rather than from disk 2016-09-25 15:49:53 +03:00			`from ..lemmatizer import Lemmatizer`
			`from ..vocab import Vocab`
			`from ..tokenizer import Tokenizer`
Add LANG attribute to English and German 2016-10-18 19:52:48 +03:00			`from ..attrs import LANG`
Reorganize exceptions for English and German 2016-12-08 15:58:32 +03:00
Reorganise language data 2016-12-18 18:54:19 +03:00			`from .language_data import *`
* Restore the LOCAL_DATA_DIR global in spacy/en/__init__.py, although this is now deprecated 2016-01-19 04:54:56 +03:00
Fix formatting 2016-12-18 18:58:28 +03:00
* Use language base class 2015-08-25 16:37:30 +03:00			`class English(Language):`
* Fix pickle problems 2015-12-28 18:54:03 +03:00			`lang = 'en'`
strip data/ from package, friendlier Language invocation, make data_dir backward/forward-compatible 2015-12-18 11:52:55 +03:00
Finish refactoring data loading 2016-09-24 21:26:17 +03:00			`class Defaults(Language.Defaults):`
			`lex_attr_getters = dict(Language.Defaults.lex_attr_getters)`
Add LANG attribute to English and German 2016-10-18 19:52:48 +03:00			`lex_attr_getters[LANG] = lambda text: 'en'`
Finish refactoring data loading 2016-09-24 21:26:17 +03:00
Reorganize exceptions for English and German 2016-12-08 15:58:32 +03:00			`tokenizer_exceptions = TOKENIZER_EXCEPTIONS`
			`tag_map = TAG_MAP`
			`stop_words = STOP_WORDS`
Wire up lemmatizer rules for English 2016-12-18 17:50:09 +03:00			`lemma_rules = LEMMA_RULES`
Untested fix for issue #684: GloVe vectors hack should be inserted in English, not in spacy.load. 2016-12-19 00:29:31 +03:00

			`def __init__(self, **overrides):`
			`# Make a special-case hack for loading the GloVe vectors, to support`
			`# deprecated <1.0 stuff. Phase this out once the data is fixed.`
			`overrides = _fix_deprecated_glove_vectors_loading(overrides)`
			`Language.__init__(self, **overrides)`


			`def _fix_deprecated_glove_vectors_loading(overrides):`
			`if 'data_dir' in overrides and 'path' not in overrides:`
			`raise ValueError("The argument 'data_dir' has been renamed to 'path'")`
Another tweak to GloVe path hackery. 2016-12-19 01:12:49 +03:00			`if overrides.get('path') is False:`
			`return overrides`
			`if overrides.get('path') in (None, True):`
Fix issue #684: GloVe vectors not loaded in spacy.en.English. 2016-12-19 00:46:31 +03:00			`data_path = get_data_path()`
			`else:`
			`path = overrides['path']`
Untested fix for issue #684: GloVe vectors hack should be inserted in English, not in spacy.load. 2016-12-19 00:29:31 +03:00			`data_path = path.parent`
Fixed missing vec_path declaration that was failing if 'add_vectors' was set Added vec_path variable declaration to avoid accessing it before assignment in case 'add_vectors' is in overrides. 2016-12-20 20:21:05 +03:00			`vec_path = None`
Fix issue #684: GloVe vectors not loaded in spacy.en.English. 2016-12-19 00:46:31 +03:00			`if 'add_vectors' not in overrides:`
Untested fix for issue #684: GloVe vectors hack should be inserted in English, not in spacy.load. 2016-12-19 00:29:31 +03:00			`if 'vectors' in overrides:`
			`vec_path = match_best_version(overrides['vectors'], None, data_path)`
			`if vec_path is None:`
			`raise IOError(`
			`'Could not load data pack %s from %s' % (overrides['vectors'], data_path))`
			`else:`
			`vec_path = match_best_version('en_glove_cc_300_1m_vectors', None, data_path)`
			`if vec_path is not None:`
			`vec_path = vec_path / 'vocab' / 'vec.bin'`
Fix vector loading re glove hack 2016-12-19 01:06:44 +03:00			`if vec_path is not None:`
			`overrides['add_vectors'] = lambda vocab: vocab.load_vectors_from_bin_loc(vec_path)`
Untested fix for issue #684: GloVe vectors hack should be inserted in English, not in spacy.load. 2016-12-19 00:29:31 +03:00			`return overrides`