spaCy/spacy/tests/regression/test_issue3830.py

from spacy.pipeline.pipes import DependencyParser
from spacy.vocab import Vocab

from spacy.pipeline.defaults import default_parser


def test_issue3830_no_subtok():
    """Test that the parser doesn't have subtok label if not learn_tokens"""
    config = {
        "learn_tokens": False,
        "min_action_freq": 30,
        "beam_width": 1,
        "beam_update_prob": 1.0,
    }
    parser = DependencyParser(Vocab(), default_parser(), **config)
    parser.add_label("nsubj")
    assert "subtok" not in parser.labels
    parser.begin_training(lambda: [])
    assert "subtok" not in parser.labels


def test_issue3830_with_subtok():
    """Test that the parser does have subtok label if learn_tokens=True."""
    config = {
        "learn_tokens": True,
        "min_action_freq": 30,
        "beam_width": 1,
        "beam_update_prob": 1.0,
    }
    parser = DependencyParser(Vocab(), default_parser(), **config)
    parser.add_label("nsubj")
    assert "subtok" not in parser.labels
    parser.begin_training(lambda: [])
    assert "subtok" in parser.labels
Fix #3830: 'subtok' label being added even if learn_tokens=False (#4188) * Prevent subtok label if not learning tokens The parser introduces the subtok label to mark tokens that should be merged during post-processing. Previously this happened even if we did not have the --learn-tokens flag set. This patch passes the config through to the parser, to prevent the problem. * Make merge_subtokens a parser post-process if learn_subtokens * Fix train script * Add test for 3830: subtok problem * Fix handlign of non-subtok in parser training 2019-08-23 18:54:00 +03:00			`from spacy.pipeline.pipes import DependencyParser`
			`from spacy.vocab import Vocab`

default models defined in component decorator (#5452) * move defaults to pipeline and use in component decorator * black formatting * relative import 2020-05-19 17:20:03 +03:00			`from spacy.pipeline.defaults import default_parser`
Default settings to configurations (#4995) * fix grad_clip naming * cleaning up pretrained_vectors out of cfg * further refactoring Model init's * move Model building out of pipes * further refactor to require a model config when creating a pipe * small fixes * making cfg in nn_parser more consistent * fixing nr_class for parser * fixing nn_parser's nO * fix printing of loss * architectures in own file per type, consistent naming * convenience methods default_tagger_config and default_tok2vec_config * let create_pipe access default config if available for that component * default_parser_config * move defaults to separate folder * allow reading nlp from package or dir with argument 'name' * architecture spacy.VocabVectors.v1 to read static vectors from file * cleanup * default configs for nel, textcat, morphologizer, tensorizer * fix imports * fixing unit tests * fixes and clean up * fixing defaults, nO, fix unit tests * restore parser IO * fix IO * 'fix' serialization test * add .cfg to manifest fix example configs with additional arguments * replace Morpohologizer with Tagger * add IO bit when testing overfitting of tagger (currently failing) * fix IO - don't initialize when reading from disk * expand overfitting tests to also check IO goes OK * remove dropout from HashEmbed to fix Tagger performance * add defaults for sentrec * update thinc * always pass a Model instance to a Pipe * fix piped_added statement * remove obsolete W029 * remove obsolete errors * restore byte checking tests (work again) * clean up test * further test cleanup * convert from config to Model in create_pipe * bring back error when component is not initialized * cleanup * remove calls for nlp2.begin_training * use thinc.api in imports * allow setting charembed's nM and nC * fix for hardcoded nM/nC + unit test * formatting fixes * trigger build 2020-02-27 20:42:27 +03:00
Fix #3830: 'subtok' label being added even if learn_tokens=False (#4188) * Prevent subtok label if not learning tokens The parser introduces the subtok label to mark tokens that should be merged during post-processing. Previously this happened even if we did not have the --learn-tokens flag set. This patch passes the config through to the parser, to prevent the problem. * Make merge_subtokens a parser post-process if learn_subtokens * Fix train script * Add test for 3830: subtok problem * Fix handlign of non-subtok in parser training 2019-08-23 18:54:00 +03:00
			`def test_issue3830_no_subtok():`
			`"""Test that the parser doesn't have subtok label if not learn_tokens"""`
Tidy up and auto-format 2020-06-20 15:15:04 +03:00			`config = {`
			`"learn_tokens": False,`
			`"min_action_freq": 30,`
			`"beam_width": 1,`
			`"beam_update_prob": 1.0,`
			`}`
train is from-config by default (#5575) * verbose and tag_map options * adding init_tok2vec option and only changing the tok2vec that is specified * adding omit_extra_lookups and verifying textcat config * wip * pretrain bugfix * add replace and resume options * train_textcat fix * raw text functionality * improve UX when KeyError or when input data can't be parsed * avoid unnecessary access to goldparse in TextCat pipe * save performance information in nlp.meta * add noise_level to config * move nn_parser's defaults to config file * multitask in config - doesn't work yet * scorer offering both F and AUC options, need to be specified in config * add textcat verification code from old train script * small fixes to config files * clean up * set default config for ner/parser to allow create_pipe to work as before * two more test fixes * small fixes * cleanup * fix NER pickling + additional unit test * create_pipe as before 2020-06-12 03:02:07 +03:00			`parser = DependencyParser(Vocab(), default_parser(), **config)`
Fix #3830: 'subtok' label being added even if learn_tokens=False (#4188) * Prevent subtok label if not learning tokens The parser introduces the subtok label to mark tokens that should be merged during post-processing. Previously this happened even if we did not have the --learn-tokens flag set. This patch passes the config through to the parser, to prevent the problem. * Make merge_subtokens a parser post-process if learn_subtokens * Fix train script * Add test for 3830: subtok problem * Fix handlign of non-subtok in parser training 2019-08-23 18:54:00 +03:00			`parser.add_label("nsubj")`
			`assert "subtok" not in parser.labels`
			`parser.begin_training(lambda: [])`
			`assert "subtok" not in parser.labels`


			`def test_issue3830_with_subtok():`
			`"""Test that the parser does have subtok label if learn_tokens=True."""`
Tidy up and auto-format 2020-06-20 15:15:04 +03:00			`config = {`
			`"learn_tokens": True,`
			`"min_action_freq": 30,`
			`"beam_width": 1,`
			`"beam_update_prob": 1.0,`
			`}`
train is from-config by default (#5575) * verbose and tag_map options * adding init_tok2vec option and only changing the tok2vec that is specified * adding omit_extra_lookups and verifying textcat config * wip * pretrain bugfix * add replace and resume options * train_textcat fix * raw text functionality * improve UX when KeyError or when input data can't be parsed * avoid unnecessary access to goldparse in TextCat pipe * save performance information in nlp.meta * add noise_level to config * move nn_parser's defaults to config file * multitask in config - doesn't work yet * scorer offering both F and AUC options, need to be specified in config * add textcat verification code from old train script * small fixes to config files * clean up * set default config for ner/parser to allow create_pipe to work as before * two more test fixes * small fixes * cleanup * fix NER pickling + additional unit test * create_pipe as before 2020-06-12 03:02:07 +03:00			`parser = DependencyParser(Vocab(), default_parser(), **config)`
Fix #3830: 'subtok' label being added even if learn_tokens=False (#4188) * Prevent subtok label if not learning tokens The parser introduces the subtok label to mark tokens that should be merged during post-processing. Previously this happened even if we did not have the --learn-tokens flag set. This patch passes the config through to the parser, to prevent the problem. * Make merge_subtokens a parser post-process if learn_subtokens * Fix train script * Add test for 3830: subtok problem * Fix handlign of non-subtok in parser training 2019-08-23 18:54:00 +03:00			`parser.add_label("nsubj")`
			`assert "subtok" not in parser.labels`
			`parser.begin_training(lambda: [])`
			`assert "subtok" in parser.labels`