spaCy/spacy/gold/converters/iob2docs.py

from wasabi import Printer

from ...gold import iob_to_biluo, tags_to_entities
from ...util import minibatch
from .util import merge_sentences
from .conll_ner2docs import n_sents_info


def iob2docs(input_data, n_sents=10, no_print=False, *args, **kwargs):
    """
    Convert IOB files with one sentence per line and tags separated with '|'
    into Doc objects so they can be saved. IOB and IOB2 are accepted.

    Sample formats:

    I|O like|O London|I-GPE and|O New|B-GPE York|I-GPE City|I-GPE .|O
    I|O like|O London|B-GPE and|O New|B-GPE York|I-GPE City|I-GPE .|O
    I|PRP|O like|VBP|O London|NNP|I-GPE and|CC|O New|NNP|B-GPE York|NNP|I-GPE City|NNP|I-GPE .|.|O
    I|PRP|O like|VBP|O London|NNP|B-GPE and|CC|O New|NNP|B-GPE York|NNP|I-GPE City|NNP|I-GPE .|.|O
    """
    msg = Printer(no_print=no_print)
    docs = read_iob(input_data.split("\n"))
    if n_sents > 0:
        n_sents_info(msg, n_sents)
        docs = merge_sentences(docs, n_sents)
    return docs


def read_iob(raw_sents):
    docs = []
    for line in raw_sents:
        if not line.strip():
            continue
        tokens = [t.split("|") for t in line.split()]
        if len(tokens[0]) == 3:
            words, tags, iob = zip(*tokens)
        elif len(tokens[0]) == 2:
            words, iob = zip(*tokens)
            tags = ["-"] * len(words)
        else:
            raise ValueError(
                "The sentence-per-line IOB/IOB2 file is not formatted correctly. Try checking whitespace and delimiters. See https://spacy.io/api/cli#convert"
            )
        doc = Doc(vocab, words=words)
        for i, tag in enumerate(pos):
            doc[i].tag_ = tag
        biluo = iob_to_biluo(iob)
        entities = biluo_tags_to_entities(biluo)
        doc.ents = [Span(doc, start=s, end=e, label=L) for (L, s, e) in entities]
        docs.append(doc)
    return docs
Updates/bugfixes for NER/IOB converters (#4186) * Updates/bugfixes for NER/IOB converters * Converter formats `ner` and `iob` use autodetect to choose a converter if possible * `iob2json` is reverted to handle sentence-per-line data like `word1\|pos1\|ent1 word2\|pos2\|ent2` * Fix bug in `merge_sentences()` so the second sentence in each batch isn't skipped * `conll_ner2json` is made more general so it can handle more formats with whitespace-separated columns * Supports all formats where the first column is the token and the final column is the IOB tag; if present, the second column is the POS tag * As in CoNLL 2003 NER, blank lines separate sentences, `-DOCSTART- -X- O O` separates documents * Add option for segmenting sentences (new flag `-s`) * Parser-based sentence segmentation with a provided model, otherwise with sentencizer (new option `-b` to specify model) * Can group sentences into documents with `n_sents` as long as sentence segmentation is available * Only applies automatic segmentation when there are no existing delimiters in the data * Provide info about settings applied during conversion with warnings and suggestions if settings conflict or might not be not optimal. * Add tests for common formats * Add '(default)' back to docs for -c auto * Add document count back to output * Revert changes to converter output message * Use explicit tabs in convert CLI test data * Adjust/add messages for n_sents=1 default * Add sample NER data to training examples * Update README * Add links in docs to example NER data * Define msg within converters 2019-08-29 13:04:01 +03:00			`from wasabi import Printer`
Tidy up merge conflict leftovers 2018-12-18 15:58:30 +03:00
Start updating converters 2020-06-20 04:19:40 +03:00			`from ...gold import iob_to_biluo, tags_to_entities`
Replace cytoolz.partition_all with util.minibatch 2019-05-11 22:12:09 +03:00			`from ...util import minibatch`
Start updating converters 2020-06-20 04:19:40 +03:00			`from .util import merge_sentences`
Serialize all attrs by default Move converters under spacy.gold Move things around Fix naming Fix name Update converter to produce DocBin Update converters Make spacy convert output docbin Fix import Fix docbin Fix import Update converter Remove jsonl converter Add json2docs converter 2020-06-20 16:59:39 +03:00			`from .conll_ner2docs import n_sents_info`
Add incomplete iob converter 2017-05-19 21:27:51 +03:00

Start updating converters 2020-06-20 04:19:40 +03:00			`def iob2docs(input_data, n_sents=10, no_print=False, args, *kwargs):`
Add incomplete iob converter 2017-05-19 21:27:51 +03:00			`"""`
Updates/bugfixes for NER/IOB converters (#4186) * Updates/bugfixes for NER/IOB converters * Converter formats `ner` and `iob` use autodetect to choose a converter if possible * `iob2json` is reverted to handle sentence-per-line data like `word1\|pos1\|ent1 word2\|pos2\|ent2` * Fix bug in `merge_sentences()` so the second sentence in each batch isn't skipped * `conll_ner2json` is made more general so it can handle more formats with whitespace-separated columns * Supports all formats where the first column is the token and the final column is the IOB tag; if present, the second column is the POS tag * As in CoNLL 2003 NER, blank lines separate sentences, `-DOCSTART- -X- O O` separates documents * Add option for segmenting sentences (new flag `-s`) * Parser-based sentence segmentation with a provided model, otherwise with sentencizer (new option `-b` to specify model) * Can group sentences into documents with `n_sents` as long as sentence segmentation is available * Only applies automatic segmentation when there are no existing delimiters in the data * Provide info about settings applied during conversion with warnings and suggestions if settings conflict or might not be not optimal. * Add tests for common formats * Add '(default)' back to docs for -c auto * Add document count back to output * Revert changes to converter output message * Use explicit tabs in convert CLI test data * Adjust/add messages for n_sents=1 default * Add sample NER data to training examples * Update README * Add links in docs to example NER data * Define msg within converters 2019-08-29 13:04:01 +03:00			`Convert IOB files with one sentence per line and tags separated with '\|'`
Start updating converters 2020-06-20 04:19:40 +03:00			`into Doc objects so they can be saved. IOB and IOB2 are accepted.`
Updates/bugfixes for NER/IOB converters (#4186) * Updates/bugfixes for NER/IOB converters * Converter formats `ner` and `iob` use autodetect to choose a converter if possible * `iob2json` is reverted to handle sentence-per-line data like `word1\|pos1\|ent1 word2\|pos2\|ent2` * Fix bug in `merge_sentences()` so the second sentence in each batch isn't skipped * `conll_ner2json` is made more general so it can handle more formats with whitespace-separated columns * Supports all formats where the first column is the token and the final column is the IOB tag; if present, the second column is the POS tag * As in CoNLL 2003 NER, blank lines separate sentences, `-DOCSTART- -X- O O` separates documents * Add option for segmenting sentences (new flag `-s`) * Parser-based sentence segmentation with a provided model, otherwise with sentencizer (new option `-b` to specify model) * Can group sentences into documents with `n_sents` as long as sentence segmentation is available * Only applies automatic segmentation when there are no existing delimiters in the data * Provide info about settings applied during conversion with warnings and suggestions if settings conflict or might not be not optimal. * Add tests for common formats * Add '(default)' back to docs for -c auto * Add document count back to output * Revert changes to converter output message * Use explicit tabs in convert CLI test data * Adjust/add messages for n_sents=1 default * Add sample NER data to training examples * Update README * Add links in docs to example NER data * Define msg within converters 2019-08-29 13:04:01 +03:00
			`Sample formats:`

			`I\|O like\|O London\|I-GPE and\|O New\|B-GPE York\|I-GPE City\|I-GPE .\|O`
			`I\|O like\|O London\|B-GPE and\|O New\|B-GPE York\|I-GPE City\|I-GPE .\|O`
			`I\|PRP\|O like\|VBP\|O London\|NNP\|I-GPE and\|CC\|O New\|NNP\|B-GPE York\|NNP\|I-GPE City\|NNP\|I-GPE .\|.\|O`
			`I\|PRP\|O like\|VBP\|O London\|NNP\|B-GPE and\|CC\|O New\|NNP\|B-GPE York\|NNP\|I-GPE City\|NNP\|I-GPE .\|.\|O`
Add incomplete iob converter 2017-05-19 21:27:51 +03:00			`"""`
Suppress convert output if writing to stdout (#4472) 2019-10-18 19:12:59 +03:00			`msg = Printer(no_print=no_print)`
Updates/bugfixes for NER/IOB converters (#4186) * Updates/bugfixes for NER/IOB converters * Converter formats `ner` and `iob` use autodetect to choose a converter if possible * `iob2json` is reverted to handle sentence-per-line data like `word1\|pos1\|ent1 word2\|pos2\|ent2` * Fix bug in `merge_sentences()` so the second sentence in each batch isn't skipped * `conll_ner2json` is made more general so it can handle more formats with whitespace-separated columns * Supports all formats where the first column is the token and the final column is the IOB tag; if present, the second column is the POS tag * As in CoNLL 2003 NER, blank lines separate sentences, `-DOCSTART- -X- O O` separates documents * Add option for segmenting sentences (new flag `-s`) * Parser-based sentence segmentation with a provided model, otherwise with sentencizer (new option `-b` to specify model) * Can group sentences into documents with `n_sents` as long as sentence segmentation is available * Only applies automatic segmentation when there are no existing delimiters in the data * Provide info about settings applied during conversion with warnings and suggestions if settings conflict or might not be not optimal. * Add tests for common formats * Add '(default)' back to docs for -c auto * Add document count back to output * Revert changes to converter output message * Use explicit tabs in convert CLI test data * Adjust/add messages for n_sents=1 default * Add sample NER data to training examples * Update README * Add links in docs to example NER data * Define msg within converters 2019-08-29 13:04:01 +03:00			`docs = read_iob(input_data.split("\n"))`
			`if n_sents > 0:`
			`n_sents_info(msg, n_sents)`
			`docs = merge_sentences(docs, n_sents)`
💫 New JSON helpers, training data internals & CLI rewrite (#2932) * Support nowrap setting in util.prints * Tidy up and fix whitespace * Simplify script and use read_jsonl helper * Add JSON schemas (see #2928) * Deprecate Doc.print_tree Will be replaced with Doc.to_json, which will produce a unified format * Add Doc.to_json() method (see #2928) Converts Doc objects to JSON using the same unified format as the training data. Method also supports serializing selected custom attributes in the doc._. space. * Remove outdated test * Add write_json and write_jsonl helpers * WIP: Update spacy train * Tidy up spacy train * WIP: Use wasabi for formatting * Add GoldParse helpers for JSON format * WIP: add debug-data command * Fix typo * Add missing import * Update wasabi pin * Add missing import * 💫 Refactor CLI (#2943) To be merged into #2932. ## Description - [x] refactor CLI To use [`wasabi`](https://github.com/ines/wasabi) - [x] use [`black`](https://github.com/ambv/black) for auto-formatting - [x] add `flake8` config - [x] move all messy UD-related scripts to `cli.ud` - [x] make converters function that take the opened file and return the converted data (instead of having them handle the IO) ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Update wasabi pin * Delete old test * Update errors * Fix typo * Tidy up and format remaining code * Fix formatting * Improve formatting of messages * Auto-format remaining code * Add tok2vec stuff to spacy.train * Fix typo * Update wasabi pin * Fix path checks for when train() is called as function * Reformat and tidy up pretrain script * Update argument annotations * Raise error if model language doesn't match lang * Document new train command 2018-11-30 22:16:14 +03:00			`return docs`
Add incomplete iob converter 2017-05-19 21:27:51 +03:00

Fix concatenation in iob2json converter 2017-10-02 17:50:26 +03:00			`def read_iob(raw_sents):`
Start updating converters 2020-06-20 04:19:40 +03:00			`docs = []`
Fix concatenation in iob2json converter 2017-10-02 17:50:26 +03:00			`for line in raw_sents:`
Add incomplete iob converter 2017-05-19 21:27:51 +03:00			`if not line.strip():`
			`continue`
Tidy up and auto-format [ci skip] 2019-08-31 14:39:06 +03:00			`tokens = [t.split("\|") for t in line.split()]`
Handle iob with no tag in converter 2017-05-28 16:11:39 +03:00			`if len(tokens[0]) == 3:`
Start updating converters 2020-06-20 04:19:40 +03:00			`words, tags, iob = zip(*tokens)`
iob converter: add 'exception' for error 'too many values' (#3159) * added contributor agreement * issue #3128 throw exception on bad IOB/2 formatting * Update spacy/cli/converters/iob2json.py with ValueError Co-Authored-By: gavrieltal <gtloria@protonmail.com> 2019-01-16 15:44:16 +03:00			`elif len(tokens[0]) == 2:`
Handle iob with no tag in converter 2017-05-28 16:11:39 +03:00			`words, iob = zip(*tokens)`
Start updating converters 2020-06-20 04:19:40 +03:00			`tags = ["-"] * len(words)`
iob converter: add 'exception' for error 'too many values' (#3159) * added contributor agreement * issue #3128 throw exception on bad IOB/2 formatting * Update spacy/cli/converters/iob2json.py with ValueError Co-Authored-By: gavrieltal <gtloria@protonmail.com> 2019-01-16 15:44:16 +03:00			`else:`
Merge branch 'master' into develop 2019-02-07 22:54:07 +03:00			`raise ValueError(`
Updates/bugfixes for NER/IOB converters (#4186) * Updates/bugfixes for NER/IOB converters * Converter formats `ner` and `iob` use autodetect to choose a converter if possible * `iob2json` is reverted to handle sentence-per-line data like `word1\|pos1\|ent1 word2\|pos2\|ent2` * Fix bug in `merge_sentences()` so the second sentence in each batch isn't skipped * `conll_ner2json` is made more general so it can handle more formats with whitespace-separated columns * Supports all formats where the first column is the token and the final column is the IOB tag; if present, the second column is the POS tag * As in CoNLL 2003 NER, blank lines separate sentences, `-DOCSTART- -X- O O` separates documents * Add option for segmenting sentences (new flag `-s`) * Parser-based sentence segmentation with a provided model, otherwise with sentencizer (new option `-b` to specify model) * Can group sentences into documents with `n_sents` as long as sentence segmentation is available * Only applies automatic segmentation when there are no existing delimiters in the data * Provide info about settings applied during conversion with warnings and suggestions if settings conflict or might not be not optimal. * Add tests for common formats * Add '(default)' back to docs for -c auto * Add document count back to output * Revert changes to converter output message * Use explicit tabs in convert CLI test data * Adjust/add messages for n_sents=1 default * Add sample NER data to training examples * Update README * Add links in docs to example NER data * Define msg within converters 2019-08-29 13:04:01 +03:00			`"The sentence-per-line IOB/IOB2 file is not formatted correctly. Try checking whitespace and delimiters. See https://spacy.io/api/cli#convert"`
Merge branch 'master' into develop 2019-02-07 22:54:07 +03:00			`)`
Start updating converters 2020-06-20 04:19:40 +03:00			`doc = Doc(vocab, words=words)`
			`for i, tag in enumerate(pos):`
			`doc[i].tag_ = tag`
Fix converters 2017-05-26 19:32:34 +03:00			`biluo = iob_to_biluo(iob)`
Start updating converters 2020-06-20 04:19:40 +03:00			`entities = biluo_tags_to_entities(biluo)`
			`doc.ents = [Span(doc, start=s, end=e, label=L) for (L, s, e) in entities]`
			`docs.append(doc)`
Add incomplete iob converter 2017-05-19 21:27:51 +03:00			`return docs`