spaCy/spacy/ml/models/textcat.py

from thinc.api import Model, chain, reduce_mean, Linear, list2ragged, Logistic
from thinc.api import SparseLinear, Softmax

from ...attrs import ORTH
from ...util import registry
from ..extract_ngrams import extract_ngrams


@registry.architectures.register("spacy.TextCatCNN.v1")
def build_simple_cnn_text_classifier(tok2vec, exclusive_classes, nO=None):
    """
    Build a simple CNN text classifier, given a token-to-vector model as inputs.
    If exclusive_classes=True, a softmax non-linearity is applied, so that the
    outputs sum to 1. If exclusive_classes=False, a logistic non-linearity
    is applied instead, so that outputs are in the range [0, 1].
    """
    with Model.define_operators({">>": chain}):
        if exclusive_classes:
            output_layer = Softmax(nO=nO, nI=tok2vec.get_dim("nO"))
            model = tok2vec >> list2ragged() >> reduce_mean() >> output_layer
            model.set_ref("output_layer", output_layer)
        else:
            # TODO: experiment with init_w=zero_init
            linear_layer = Linear(nO=nO, nI=tok2vec.get_dim("nO"))
            model = (
                tok2vec >> list2ragged() >> reduce_mean() >> linear_layer >> Logistic()
            )
            model.set_ref("output_layer", linear_layer)
    model.set_ref("tok2vec", tok2vec)
    model.set_dim("nO", nO)
    return model


@registry.architectures.register("spacy.TextCatBOW.v1")
def build_bow_text_classifier(exclusive_classes, ngram_size, no_output_layer, nO=None):
    # Note: original defaults were ngram_size=1 and no_output_layer=False
    with Model.define_operators({">>": chain}):
        model = extract_ngrams(ngram_size, attr=ORTH) >> SparseLinear(nO)
        model.to_cpu()
        if not no_output_layer:
            output_layer = Softmax(nO) if exclusive_classes else Logistic(nO)
            output_layer.to_cpu()
            model = model >> output_layer
            model.set_ref("output_layer", output_layer)
    return model
Tidy up and auto-format 2020-02-28 13:57:41 +03:00			`from thinc.api import Model, chain, reduce_mean, Linear, list2ragged, Logistic`
			`from thinc.api import SparseLinear, Softmax`
Default settings to configurations (#4995) * fix grad_clip naming * cleaning up pretrained_vectors out of cfg * further refactoring Model init's * move Model building out of pipes * further refactor to require a model config when creating a pipe * small fixes * making cfg in nn_parser more consistent * fixing nr_class for parser * fixing nn_parser's nO * fix printing of loss * architectures in own file per type, consistent naming * convenience methods default_tagger_config and default_tok2vec_config * let create_pipe access default config if available for that component * default_parser_config * move defaults to separate folder * allow reading nlp from package or dir with argument 'name' * architecture spacy.VocabVectors.v1 to read static vectors from file * cleanup * default configs for nel, textcat, morphologizer, tensorizer * fix imports * fixing unit tests * fixes and clean up * fixing defaults, nO, fix unit tests * restore parser IO * fix IO * 'fix' serialization test * add .cfg to manifest fix example configs with additional arguments * replace Morpohologizer with Tagger * add IO bit when testing overfitting of tagger (currently failing) * fix IO - don't initialize when reading from disk * expand overfitting tests to also check IO goes OK * remove dropout from HashEmbed to fix Tagger performance * add defaults for sentrec * update thinc * always pass a Model instance to a Pipe * fix piped_added statement * remove obsolete W029 * remove obsolete errors * restore byte checking tests (work again) * clean up test * further test cleanup * convert from config to Model in create_pipe * bring back error when component is not initialized * cleanup * remove calls for nlp2.begin_training * use thinc.api in imports * allow setting charembed's nM and nC * fix for hardcoded nM/nC + unit test * formatting fixes * trigger build 2020-02-27 20:42:27 +03:00
Tidy up and auto-format 2020-02-28 13:57:41 +03:00			`from ...attrs import ORTH`
			`from ...util import registry`
			`from ..extract_ngrams import extract_ngrams`
Default settings to configurations (#4995) * fix grad_clip naming * cleaning up pretrained_vectors out of cfg * further refactoring Model init's * move Model building out of pipes * further refactor to require a model config when creating a pipe * small fixes * making cfg in nn_parser more consistent * fixing nr_class for parser * fixing nn_parser's nO * fix printing of loss * architectures in own file per type, consistent naming * convenience methods default_tagger_config and default_tok2vec_config * let create_pipe access default config if available for that component * default_parser_config * move defaults to separate folder * allow reading nlp from package or dir with argument 'name' * architecture spacy.VocabVectors.v1 to read static vectors from file * cleanup * default configs for nel, textcat, morphologizer, tensorizer * fix imports * fixing unit tests * fixes and clean up * fixing defaults, nO, fix unit tests * restore parser IO * fix IO * 'fix' serialization test * add .cfg to manifest fix example configs with additional arguments * replace Morpohologizer with Tagger * add IO bit when testing overfitting of tagger (currently failing) * fix IO - don't initialize when reading from disk * expand overfitting tests to also check IO goes OK * remove dropout from HashEmbed to fix Tagger performance * add defaults for sentrec * update thinc * always pass a Model instance to a Pipe * fix piped_added statement * remove obsolete W029 * remove obsolete errors * restore byte checking tests (work again) * clean up test * further test cleanup * convert from config to Model in create_pipe * bring back error when component is not initialized * cleanup * remove calls for nlp2.begin_training * use thinc.api in imports * allow setting charembed's nM and nC * fix for hardcoded nM/nC + unit test * formatting fixes * trigger build 2020-02-27 20:42:27 +03:00

			`@registry.architectures.register("spacy.TextCatCNN.v1")`
			`def build_simple_cnn_text_classifier(tok2vec, exclusive_classes, nO=None):`
			`"""`
			`Build a simple CNN text classifier, given a token-to-vector model as inputs.`
			`If exclusive_classes=True, a softmax non-linearity is applied, so that the`
			`outputs sum to 1. If exclusive_classes=False, a logistic non-linearity`
			`is applied instead, so that outputs are in the range [0, 1].`
			`"""`
			`with Model.define_operators({">>": chain}):`
			`if exclusive_classes:`
			`output_layer = Softmax(nO=nO, nI=tok2vec.get_dim("nO"))`
			`model = tok2vec >> list2ragged() >> reduce_mean() >> output_layer`
			`model.set_ref("output_layer", output_layer)`
			`else:`
			`# TODO: experiment with init_w=zero_init`
			`linear_layer = Linear(nO=nO, nI=tok2vec.get_dim("nO"))`
Tidy up and auto-format 2020-02-28 13:57:41 +03:00			`model = (`
			`tok2vec >> list2ragged() >> reduce_mean() >> linear_layer >> Logistic()`
			`)`
Default settings to configurations (#4995) * fix grad_clip naming * cleaning up pretrained_vectors out of cfg * further refactoring Model init's * move Model building out of pipes * further refactor to require a model config when creating a pipe * small fixes * making cfg in nn_parser more consistent * fixing nr_class for parser * fixing nn_parser's nO * fix printing of loss * architectures in own file per type, consistent naming * convenience methods default_tagger_config and default_tok2vec_config * let create_pipe access default config if available for that component * default_parser_config * move defaults to separate folder * allow reading nlp from package or dir with argument 'name' * architecture spacy.VocabVectors.v1 to read static vectors from file * cleanup * default configs for nel, textcat, morphologizer, tensorizer * fix imports * fixing unit tests * fixes and clean up * fixing defaults, nO, fix unit tests * restore parser IO * fix IO * 'fix' serialization test * add .cfg to manifest fix example configs with additional arguments * replace Morpohologizer with Tagger * add IO bit when testing overfitting of tagger (currently failing) * fix IO - don't initialize when reading from disk * expand overfitting tests to also check IO goes OK * remove dropout from HashEmbed to fix Tagger performance * add defaults for sentrec * update thinc * always pass a Model instance to a Pipe * fix piped_added statement * remove obsolete W029 * remove obsolete errors * restore byte checking tests (work again) * clean up test * further test cleanup * convert from config to Model in create_pipe * bring back error when component is not initialized * cleanup * remove calls for nlp2.begin_training * use thinc.api in imports * allow setting charembed's nM and nC * fix for hardcoded nM/nC + unit test * formatting fixes * trigger build 2020-02-27 20:42:27 +03:00			`model.set_ref("output_layer", linear_layer)`
			`model.set_ref("tok2vec", tok2vec)`
			`model.set_dim("nO", nO)`
			`return model`


			`@registry.architectures.register("spacy.TextCatBOW.v1")`
			`def build_bow_text_classifier(exclusive_classes, ngram_size, no_output_layer, nO=None):`
			`# Note: original defaults were ngram_size=1 and no_output_layer=False`
			`with Model.define_operators({">>": chain}):`
			`model = extract_ngrams(ngram_size, attr=ORTH) >> SparseLinear(nO)`
			`model.to_cpu()`
			`if not no_output_layer:`
			`output_layer = Softmax(nO) if exclusive_classes else Logistic(nO)`
			`output_layer.to_cpu()`
			`model = model >> output_layer`
			`model.set_ref("output_layer", output_layer)`
			`return model`