spaCy/spacy/lang/de/__init__.py

from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES
from .punctuation import TOKENIZER_INFIXES
from .stop_words import STOP_WORDS
from .syntax_iterators import SYNTAX_ITERATORS

from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ...language import Language
from ...attrs import LANG
from ...util import update_exc


class GermanDefaults(Language.Defaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
    lex_attr_getters[LANG] = lambda text: "de"
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
    prefixes = TOKENIZER_PREFIXES
    suffixes = TOKENIZER_SUFFIXES
    infixes = TOKENIZER_INFIXES
    stop_words = STOP_WORDS
    syntax_iterators = SYNTAX_ITERATORS
    single_orth_variants = [
        {"tags": ["$("], "variants": ["…", "..."]},
        {"tags": ["$("], "variants": ["-", "—", "–", "--", "---", "——"]},
    ]
    paired_orth_variants = [
        {
            "tags": ["$("],
            "variants": [("'", "'"), (",", "'"), ("‚", "‘"), ("›", "‹"), ("‹", "›")],
        },
        {
            "tags": ["$("],
            "variants": [("``", "''"), ('"', '"'), ("„", "“"), ("»", "«"), ("«", "»")],
        },
    ]


class German(Language):
    lang = "de"
    Defaults = GermanDefaults


__all__ = ["German"]
-												Reorganise German language data

											
										
										
											2017-05-08 16:44:26 +03:00
+								from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
-												Improve German tokenization

Improve German tokenization with respect to Tiger.

											
										
										
											2020-02-26 15:06:52 +03:00
+								from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES
-												Don't split hyphenated words in German

This way, the tokenizer matches the tokenization in German treebanks

											
										
										
											2017-09-16 21:40:15 +03:00
+								from .punctuation import TOKENIZER_INFIXES
-												Reorganise German language data

											
										
										
											2017-05-08 16:44:26 +03:00
+								from .stop_words import STOP_WORDS
-												Add language-specific syntax iterators to en and de

											
										
										
											2017-05-17 12:37:48 +03:00
+								from .syntax_iterators import SYNTAX_ITERATORS
-												* Add spacy.de

											
										
										
											2015-09-06 22:56:47 +03:00
-												Fix relative imports

											
										
										
											2017-05-08 23:29:04 +03:00
+								from ..tokenizer_exceptions import BASE_EXCEPTIONS
 								from ...language import Language
-												Reduce stored lexemes data, move feats to lookups (#5238)

* Reduce stored lexemes data, move feats to lookups

* Move non-derivable lexemes features (`norm / cluster / prob`) to
`spacy-lookups-data` as lookups
  * Get/set `norm` in both lookups and `LexemeC`, serialize in lookups
  * Remove `cluster` and `prob` from `LexemesC`, get/set/serialize in
    lookups only
* Remove serialization of lexemes data as `vocab/lexemes.bin`
  * Remove `SerializedLexemeC`
  * Remove `Lexeme.to_bytes/from_bytes`
* Modify normalization exception loading:
  * Always create `Vocab.lookups` table `lexeme_norm` for
    normalization exceptions
  * Load base exceptions from `lang.norm_exceptions`, but load
    language-specific exceptions from lookups
  * Set `lex_attr_getter[NORM]` including new lookups table in
    `BaseDefaults.create_vocab()` and when deserializing `Vocab`
* Remove all cached lexemes when deserializing vocab to override
  existing normalizations with the new normalizations (as a replacement
  for the previous step that replaced all lexemes data with the
  deserialized data)

* Skip English normalization test

Skip English normalization test because the data is now in
`spacy-lookups-data`.

* Remove norm exceptions

Moved to spacy-lookups-data.

* Move norm exceptions test to spacy-lookups-data

* Load extra lookups from spacy-lookups-data lazily

Load extra lookups (currently for cluster and prob) lazily from the
entry point `lg_extra` as `Vocab.lookups_extra`.

* Skip creating lexeme cache on load

To improve model loading times, do not create the full lexeme cache when
loading. The lexemes will be created on demand when processing.

* Identify numeric values in Lexeme.set_attrs()

With the removal of a special case for `PROB`, also identify `float` to
avoid trying to convert it with the `StringStore`.

* Skip lexeme cache init in from_bytes

* Unskip and update lookups tests for python3.6+

* Update vocab pickle to include lookups_extra

* Update vocab serialization tests

Check strings rather than lexemes since lexemes aren't initialized
automatically, account for addition of "_SP".

* Re-skip lookups test because of python3.5

* Skip PROB/float values in Lexeme.set_attrs

* Convert is_oov from lexeme flag to lex in vectors

Instead of storing `is_oov` as a lexeme flag, `is_oov` reports whether
the lexeme has a vector.

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
											
										
										
											2020-05-19 16:59:14 +03:00
+								from ...attrs import LANG
 								from ...util import update_exc
-												* Add spacy.de

											
										
										
											2015-09-06 22:56:47 +03:00
-												Move Defaults subclass to module scope (necessary for pickling)

											
										
										
											2017-05-20 20:02:27 +03:00
+								class GermanDefaults(Language.Defaults):
 								    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
-												💫 Tidy up and auto-format .py files (#2983)

<!--- Provide a general summary of your changes in the title. -->

## Description
- [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files.
- [x] Update flake8 config to exclude very large files (lemmatization tables etc.)
- [x] Update code to be compatible with flake8 rules
- [x] Fix various small bugs, inconsistencies and messy stuff in the language data
- [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means)

Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results.

At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information.

### Types of change
enhancement, code style

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2018-11-30 19:03:03 +03:00
+								    lex_attr_getters[LANG] = lambda text: "de"
-												Move Defaults subclass to module scope (necessary for pickling)

											
										
										
											2017-05-20 20:02:27 +03:00
+								    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
-												Improve German tokenization

Improve German tokenization with respect to Tiger.

											
										
										
											2020-02-26 15:06:52 +03:00
+								    prefixes = TOKENIZER_PREFIXES
 								    suffixes = TOKENIZER_SUFFIXES
-												Don't make copies of language data components

											
										
										
											2017-10-11 16:34:55 +03:00
+								    infixes = TOKENIZER_INFIXES
 								    stop_words = STOP_WORDS
 								    syntax_iterators = SYNTAX_ITERATORS
-												Tidy up and auto-format

											
										
										
											2019-09-11 15:00:36 +03:00
+								    single_orth_variants = [
 								        {"tags": ["$("], "variants": ["…", "..."]},
 								        {"tags": ["$("], "variants": ["-", "—", "–", "--", "---", "——"]},
 								    ]
 								    paired_orth_variants = [
 								        {
 								            "tags": ["$("],
 								            "variants": [("'", "'"), (",", "'"), ("‚", "‘"), ("›", "‹"), ("‹", "›")],
 								        },
 								        {
 								            "tags": ["$("],
 								            "variants": [("``", "''"), ('"', '"'), ("„", "“"), ("»", "«"), ("«", "»")],
 								        },
 								    ]
-												define German dummy lemmatizer until morphology is done

											
										
										
											2016-05-02 17:04:53 +03:00
-												Lazy imports language

											
										
										
											2017-05-03 12:01:42 +03:00
-												Move Defaults subclass to module scope (necessary for pickling)

											
										
										
											2017-05-20 20:02:27 +03:00
+								class German(Language):
-												💫 Tidy up and auto-format .py files (#2983)

<!--- Provide a general summary of your changes in the title. -->

## Description
- [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files.
- [x] Update flake8 config to exclude very large files (lemmatization tables etc.)
- [x] Update code to be compatible with flake8 rules
- [x] Fix various small bugs, inconsistencies and messy stuff in the language data
- [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means)

Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results.

At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information.

### Types of change
enhancement, code style

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2018-11-30 19:03:03 +03:00
+								    lang = "de"
-												Move Defaults subclass to module scope (necessary for pickling)

											
										
										
											2017-05-20 20:02:27 +03:00
+								    Defaults = GermanDefaults
-												Adding method lemmatizer for every class

											
										
										
											2017-05-03 13:14:42 +03:00
-												Lazy imports language

											
										
										
											2017-05-03 12:01:42 +03:00
-												💫 Tidy up and auto-format .py files (#2983)

<!--- Provide a general summary of your changes in the title. -->

## Description
- [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files.
- [x] Update flake8 config to exclude very large files (lemmatization tables etc.)
- [x] Update code to be compatible with flake8 rules
- [x] Fix various small bugs, inconsistencies and messy stuff in the language data
- [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means)

Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results.

At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information.

### Types of change
enhancement, code style

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2018-11-30 19:03:03 +03:00
+								__all__ = ["German"]