spaCy/examples/training/ner_example_data/ner-token-per-line-conll2003.iob
adrianeboyd 82159b5c19 Updates/bugfixes for NER/IOB converters (#4186)
* Updates/bugfixes for NER/IOB converters

* Converter formats `ner` and `iob` use autodetect to choose a converter if
  possible

* `iob2json` is reverted to handle sentence-per-line data like
  `word1|pos1|ent1 word2|pos2|ent2`

  * Fix bug in `merge_sentences()` so the second sentence in each batch isn't
    skipped

* `conll_ner2json` is made more general so it can handle more formats with
  whitespace-separated columns

  * Supports all formats where the first column is the token and the final
    column is the IOB tag; if present, the second column is the POS tag

  * As in CoNLL 2003 NER, blank lines separate sentences, `-DOCSTART- -X- O O`
    separates documents

  * Add option for segmenting sentences (new flag `-s`)

  * Parser-based sentence segmentation with a provided model, otherwise with
    sentencizer (new option `-b` to specify model)

  * Can group sentences into documents with `n_sents` as long as sentence
    segmentation is available

  * Only applies automatic segmentation when there are no existing delimiters
    in the data

* Provide info about settings applied during conversion with warnings and
  suggestions if settings conflict or might not be not optimal.

* Add tests for common formats

* Add '(default)' back to docs for -c auto

* Add document count back to output

* Revert changes to converter output message

* Use explicit tabs in convert CLI test data

* Adjust/add messages for n_sents=1 default

* Add sample NER data to training examples

* Update README

* Add links in docs to example NER data

* Define msg within converters
2019-08-29 12:04:01 +02:00

71 lines
900 B
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

-DOCSTART- -X- O O
When WRB _ O
Sebastian NNP _ B-PERSON
Thrun NNP _ I-PERSON
started VBD _ O
working VBG _ O
on IN _ O
self NN _ O
- HYPH _ O
driving VBG _ O
cars NNS _ O
at IN _ O
Google NNP _ B-ORG
in IN _ O
2007 CD _ B-DATE
, , _ O
few JJ _ O
people NNS _ O
outside RB _ O
of IN _ O
the DT _ O
company NN _ O
took VBD _ O
him PRP _ O
seriously RB _ O
. . _ O
“ '' _ O
I PRP _ O
can MD _ O
tell VB _ O
you PRP _ O
very RB _ O
senior JJ _ O
CEOs NNS _ O
of IN _ O
major JJ _ O
American JJ _ B-NORP
car NN _ O
companies NNS _ O
would MD _ O
shake VB _ O
my PRP$ _ O
hand NN _ O
and CC _ O
turn VB _ O
away RB _ O
because IN _ O
I PRP _ O
was VBD _ O
nt RB _ O
worth JJ _ O
talking VBG _ O
to IN _ O
, , _ O
” '' _ O
said VBD _ O
Thrun NNP _ B-PERSON
, , _ O
in IN _ O
an DT _ O
interview NN _ O
with IN _ O
Recode NNP _ B-ORG
earlier RBR _ B-DATE
this DT _ I-DATE
week NN _ I-DATE
. . _ O