spaCy/spacy/en/__init__.py

# coding: utf8
from __future__ import unicode_literals, print_function

from os import path
from pathlib import Path

from ..util import match_best_version
from ..util import get_data_path
from ..language import Language
from ..lemmatizer import Lemmatizer
from ..vocab import Vocab
from ..tokenizer import Tokenizer
from ..attrs import LANG

from .language_data import *

try:
    basestring
except NameError:
    basestring = str


class English(Language):
    lang = 'en'

    class Defaults(Language.Defaults):
        lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
        lex_attr_getters[LANG] = lambda text: 'en'

        tokenizer_exceptions = TOKENIZER_EXCEPTIONS
        tag_map = TAG_MAP
        stop_words = STOP_WORDS


    def __init__(self, **overrides):
        # Make a special-case hack for loading the GloVe vectors, to support
        # deprecated <1.0 stuff. Phase this out once the data is fixed.
        overrides = _fix_deprecated_glove_vectors_loading(overrides)
        Language.__init__(self, **overrides)


def _fix_deprecated_glove_vectors_loading(overrides):
    if 'data_dir' in overrides and 'path' not in overrides:
        raise ValueError("The argument 'data_dir' has been renamed to 'path'")
    if overrides.get('path') is False:
        return overrides
    if overrides.get('path') in (None, True):
        data_path = get_data_path()
    else:
        path = overrides['path']
        if isinstance(path, basestring):
            path = Path(path)
        data_path = path.parent
    vec_path = None
    if 'add_vectors' not in overrides:
        if 'vectors' in overrides:
            vec_path = match_best_version(overrides['vectors'], None, data_path)
            if vec_path is None:
                return overrides
        else:
            vec_path = match_best_version('en_glove_cc_300_1m_vectors', None, data_path)
        if vec_path is not None:
            vec_path = vec_path / 'vocab' / 'vec.bin'
    if vec_path is not None:
        overrides['add_vectors'] = lambda vocab: vocab.load_vectors_from_bin_loc(vec_path)
    return overrides
Use consistent unicode declarations 2017-03-12 15:07:28 +03:00			`# coding: utf8`
* Use language base class 2015-08-25 16:37:30 +03:00			`from __future__ import unicode_literals, print_function`
* Begin refactor 2015-07-07 15:00:07 +03:00
* Use language base class 2015-08-25 16:37:30 +03:00			`from os import path`
Fix 2/3 problems for training 2017-03-08 03:37:52 +03:00			`from pathlib import Path`
* Tmp 2014-12-24 09:42:00 +03:00
Untested fix for issue #684: GloVe vectors hack should be inserted in English, not in spacy.load. 2016-12-19 00:29:31 +03:00			`from ..util import match_best_version`
Fix issue #684: GloVe vectors not loaded in spacy.en.English. 2016-12-19 00:46:31 +03:00			`from ..util import get_data_path`
* Use language base class 2015-08-25 16:37:30 +03:00			`from ..language import Language`
Refactor so that the tokenizer data is read from Python data, rather than from disk 2016-09-25 15:49:53 +03:00			`from ..lemmatizer import Lemmatizer`
			`from ..vocab import Vocab`
			`from ..tokenizer import Tokenizer`
Add LANG attribute to English and German 2016-10-18 19:52:48 +03:00			`from ..attrs import LANG`
Reorganize exceptions for English and German 2016-12-08 15:58:32 +03:00
Reorganise language data 2016-12-18 18:54:19 +03:00			`from .language_data import *`
* Restore the LOCAL_DATA_DIR global in spacy/en/__init__.py, although this is now deprecated 2016-01-19 04:54:56 +03:00
Fix 2/3 problems for training 2017-03-08 03:37:52 +03:00			`try:`
			`basestring`
			`except NameError:`
			`basestring = str`

Fix formatting 2016-12-18 18:58:28 +03:00
* Use language base class 2015-08-25 16:37:30 +03:00			`class English(Language):`
* Fix pickle problems 2015-12-28 18:54:03 +03:00			`lang = 'en'`
strip data/ from package, friendlier Language invocation, make data_dir backward/forward-compatible 2015-12-18 11:52:55 +03:00
Finish refactoring data loading 2016-09-24 21:26:17 +03:00			`class Defaults(Language.Defaults):`
			`lex_attr_getters = dict(Language.Defaults.lex_attr_getters)`
Add LANG attribute to English and German 2016-10-18 19:52:48 +03:00			`lex_attr_getters[LANG] = lambda text: 'en'`
Finish refactoring data loading 2016-09-24 21:26:17 +03:00
Reorganize exceptions for English and German 2016-12-08 15:58:32 +03:00			`tokenizer_exceptions = TOKENIZER_EXCEPTIONS`
			`tag_map = TAG_MAP`
			`stop_words = STOP_WORDS`
Untested fix for issue #684: GloVe vectors hack should be inserted in English, not in spacy.load. 2016-12-19 00:29:31 +03:00

			`def __init__(self, **overrides):`
			`# Make a special-case hack for loading the GloVe vectors, to support`
			`# deprecated <1.0 stuff. Phase this out once the data is fixed.`
			`overrides = _fix_deprecated_glove_vectors_loading(overrides)`
			`Language.__init__(self, **overrides)`


			`def _fix_deprecated_glove_vectors_loading(overrides):`
			`if 'data_dir' in overrides and 'path' not in overrides:`
			`raise ValueError("The argument 'data_dir' has been renamed to 'path'")`
Another tweak to GloVe path hackery. 2016-12-19 01:12:49 +03:00			`if overrides.get('path') is False:`
			`return overrides`
			`if overrides.get('path') in (None, True):`
Fix issue #684: GloVe vectors not loaded in spacy.en.English. 2016-12-19 00:46:31 +03:00			`data_path = get_data_path()`
			`else:`
			`path = overrides['path']`
Fix 2/3 problems for training 2017-03-08 03:37:52 +03:00			`if isinstance(path, basestring):`
			`path = Path(path)`
Untested fix for issue #684: GloVe vectors hack should be inserted in English, not in spacy.load. 2016-12-19 00:29:31 +03:00			`data_path = path.parent`
Fixed missing vec_path declaration that was failing if 'add_vectors' was set Added vec_path variable declaration to avoid accessing it before assignment in case 'add_vectors' is in overrides. 2016-12-20 20:21:05 +03:00			`vec_path = None`
Fix issue #684: GloVe vectors not loaded in spacy.en.English. 2016-12-19 00:46:31 +03:00			`if 'add_vectors' not in overrides:`
Untested fix for issue #684: GloVe vectors hack should be inserted in English, not in spacy.load. 2016-12-19 00:29:31 +03:00			`if 'vectors' in overrides:`
			`vec_path = match_best_version(overrides['vectors'], None, data_path)`
			`if vec_path is None:`
Fix 2/3 problems for training 2017-03-08 03:37:52 +03:00			`return overrides`
Untested fix for issue #684: GloVe vectors hack should be inserted in English, not in spacy.load. 2016-12-19 00:29:31 +03:00			`else:`
			`vec_path = match_best_version('en_glove_cc_300_1m_vectors', None, data_path)`
			`if vec_path is not None:`
			`vec_path = vec_path / 'vocab' / 'vec.bin'`
Fix vector loading re glove hack 2016-12-19 01:06:44 +03:00			`if vec_path is not None:`
			`overrides['add_vectors'] = lambda vocab: vocab.load_vectors_from_bin_loc(vec_path)`
Untested fix for issue #684: GloVe vectors hack should be inserted in English, not in spacy.load. 2016-12-19 00:29:31 +03:00			`return overrides`