spaCy/spacy/cli/converters/conllu2json.py
adrianeboyd 392c4880d9 Restructure Example with merged sents as default (#4632)
* Switch to train_dataset() function in train CLI

* Fixes for pipe() methods in pipeline components

* Don't clobber `examples` variable with `as_example` in pipe() methods
* Remove unnecessary traversals of `examples`

* Update Parser.pipe() for Examples

* Add `as_examples` kwarg to `pipe()` with implementation to return
`Example`s

* Accept `Doc` or `Example` in `pipe()` with `_get_doc()` (copied from
`Pipe`)

* Fixes to Example implementation in spacy.gold

* Move `make_projective` from an attribute of Example to an argument of
`Example.get_gold_parses()`

* Head of 0 are not treated as unset

* Unset heads are set to self rather than `None` (which causes problems
while projectivizing)

* Check for `Doc` (not just not `None`) when creating GoldParses for
pre-merged example

* Don't clobber `examples` variable in `iter_gold_docs()`

* Add/modify gold tests for handling projectivity

* In JSON roundtrip compare results from `dev_dataset` rather than
`train_dataset` to avoid projectivization (and other potential
modifications)

* Add test for projective train vs. nonprojective dev versions of the
same `Doc`

* Handle ignore_misaligned as arg rather than attr

Move `ignore_misaligned` from an attribute of `Example` to an argument
to `Example.get_gold_parses()`, which makes it parallel to
`make_projective`.

Add test with old and new align that checks whether `ignore_misaligned`
errors are raised as expected (only for new align).

* Remove unused attrs from gold.pxd

Remove `ignore_misaligned` and `make_projective` from `gold.pxd`

* Restructure Example with merged sents as default

An `Example` now includes a single `TokenAnnotation` that includes all
the information from one `Doc` (=JSON `paragraph`). If required, the
individual sentences can be returned as a list of examples with
`Example.split_sents()` with no raw text available.

* Input/output a single `Example.token_annotation`

* Add `sent_starts` to `TokenAnnotation` to handle sentence boundaries

* Replace `Example.merge_sents()` with `Example.split_sents()`

* Modify components to use a single `Example.token_annotation`

  * Pipeline components
  * conllu2json converter

* Rework/rename `add_token_annotation()` and `add_doc_annotation()` to
`set_token_annotation()` and `set_doc_annotation()`, functions that set
rather then appending/extending.

* Rename `morphology` to `morphs` in `TokenAnnotation` and `GoldParse`

* Add getters to `TokenAnnotation` to supply default values when a given
attribute is not available

* `Example.get_gold_parses()` in `spacy.gold._make_golds()` is only
applied on single examples, so the `GoldParse` is returned saved in the
provided `Example` rather than creating a new `Example` with no other
internal annotation

* Update tests for API changes and `merge_sents()` vs. `split_sents()`

* Refer to Example.goldparse in iter_gold_docs()

Use `Example.goldparse` in `iter_gold_docs()` instead of `Example.gold`
because a `None` `GoldParse` is generated with ignore_misaligned and
generating it on-the-fly can raise an unwanted AlignmentError

* Fix make_orth_variants()

Fix bug in make_orth_variants() related to conversion from multiple to
one TokenAnnotation per Example.

* Add basic test for make_orth_variants()

* Replace try/except with conditionals

* Replace default morph value with set
2019-11-25 16:03:28 +01:00

146 lines
4.7 KiB
Python

# coding: utf8
from __future__ import unicode_literals
import re
from spacy.gold import Example
from ...gold import iob_to_biluo
def conllu2json(input_data, n_sents=10, use_morphology=False, lang=None, **_):
"""
Convert conllu files into JSON format for use with train cli.
use_morphology parameter enables appending morphology to tags, which is
useful for languages such as Spanish, where UD tags are not so rich.
Extract NER tags if available and convert them so that they follow
BILUO and the Wikipedia scheme
"""
# by @dvsrepo, via #11 explosion/spacy-dev-resources
# by @katarkor
docs = []
sentences = []
conll_data = read_conllx(input_data, use_morphology=use_morphology)
checked_for_ner = False
has_ner_tags = False
for i, example in enumerate(conll_data):
if not checked_for_ner:
has_ner_tags = is_ner(example.token_annotation.entities[0])
checked_for_ner = True
sentences.append(generate_sentence(example.token_annotation, has_ner_tags))
# Real-sized documents could be extracted using the comments on the
# conllu document
if len(sentences) % n_sents == 0:
doc = create_doc(sentences, i)
docs.append(doc)
sentences = []
return docs
def is_ner(tag):
"""
Check the 10th column of the first token to determine if the file contains
NER tags
"""
tag_match = re.match("([A-Z_]+)-([A-Z_]+)", tag)
if tag_match:
return True
elif tag == "O":
return True
else:
return False
def read_conllx(input_data, use_morphology=False, n=0):
""" Yield example data points, one for each sentence """
i = 0
for sent in input_data.strip().split("\n\n"):
lines = sent.strip().split("\n")
if lines:
while lines[0].startswith("#"):
lines.pop(0)
ids, words, tags, heads, deps, ents = [], [], [], [], [], []
for line in lines:
parts = line.split("\t")
id_, word, lemma, pos, tag, morph, head, dep, _1, iob = parts
if "-" in id_ or "." in id_:
continue
try:
id_ = int(id_) - 1
head = (int(head) - 1) if head != "0" else id_
dep = "ROOT" if dep == "root" else dep
tag = pos if tag == "_" else tag
tag = tag + "__" + morph if use_morphology else tag
iob = iob if iob else "O"
ids.append(id_)
words.append(word)
tags.append(tag)
heads.append(head)
deps.append(dep)
ents.append(iob)
except: # noqa: E722
print(line)
raise
example = Example(doc=None)
example.set_token_annotation(ids=ids, words=words, tags=tags,
heads=heads, deps=deps, entities=ents)
yield example
i += 1
if 1 <= n <= i:
break
def simplify_tags(iob):
"""
Simplify tags obtained from the dataset in order to follow Wikipedia
scheme (PER, LOC, ORG, MISC). 'PER', 'LOC' and 'ORG' keep their tags, while
'GPE_LOC' is simplified to 'LOC', 'GPE_ORG' to 'ORG' and all remaining tags to
'MISC'.
"""
new_iob = []
for tag in iob:
tag_match = re.match("([A-Z_]+)-([A-Z_]+)", tag)
if tag_match:
prefix = tag_match.group(1)
suffix = tag_match.group(2)
if suffix == "GPE_LOC":
suffix = "LOC"
elif suffix == "GPE_ORG":
suffix = "ORG"
elif suffix != "PER" and suffix != "LOC" and suffix != "ORG":
suffix = "MISC"
tag = prefix + "-" + suffix
new_iob.append(tag)
return new_iob
def generate_sentence(token_annotation, has_ner_tags):
sentence = {}
tokens = []
if has_ner_tags:
iob = simplify_tags(token_annotation.entities)
biluo = iob_to_biluo(iob)
for i, id in enumerate(token_annotation.ids):
token = {}
token["id"] = id
token["orth"] = token_annotation.words[i]
token["tag"] = token_annotation.tags[i]
token["head"] = token_annotation.heads[i] - id
token["dep"] = token_annotation.deps[i]
if has_ner_tags:
token["ner"] = biluo[i]
tokens.append(token)
sentence["tokens"] = tokens
return sentence
def create_doc(sentences, id):
doc = {}
paragraph = {}
doc["id"] = id
doc["paragraphs"] = []
paragraph["sentences"] = sentences
doc["paragraphs"].append(paragraph)
return doc