mirror of
https://github.com/explosion/spaCy.git
synced 2024-11-11 20:28:20 +03:00
82159b5c19
* Updates/bugfixes for NER/IOB converters * Converter formats `ner` and `iob` use autodetect to choose a converter if possible * `iob2json` is reverted to handle sentence-per-line data like `word1|pos1|ent1 word2|pos2|ent2` * Fix bug in `merge_sentences()` so the second sentence in each batch isn't skipped * `conll_ner2json` is made more general so it can handle more formats with whitespace-separated columns * Supports all formats where the first column is the token and the final column is the IOB tag; if present, the second column is the POS tag * As in CoNLL 2003 NER, blank lines separate sentences, `-DOCSTART- -X- O O` separates documents * Add option for segmenting sentences (new flag `-s`) * Parser-based sentence segmentation with a provided model, otherwise with sentencizer (new option `-b` to specify model) * Can group sentences into documents with `n_sents` as long as sentence segmentation is available * Only applies automatic segmentation when there are no existing delimiters in the data * Provide info about settings applied during conversion with warnings and suggestions if settings conflict or might not be not optimal. * Add tests for common formats * Add '(default)' back to docs for -c auto * Add document count back to output * Revert changes to converter output message * Use explicit tabs in convert CLI test data * Adjust/add messages for n_sents=1 default * Add sample NER data to training examples * Update README * Add links in docs to example NER data * Define msg within converters
71 lines
900 B
Plaintext
71 lines
900 B
Plaintext
-DOCSTART- -X- O O
|
||
|
||
When WRB _ O
|
||
Sebastian NNP _ B-PERSON
|
||
Thrun NNP _ I-PERSON
|
||
started VBD _ O
|
||
working VBG _ O
|
||
on IN _ O
|
||
self NN _ O
|
||
- HYPH _ O
|
||
driving VBG _ O
|
||
cars NNS _ O
|
||
at IN _ O
|
||
Google NNP _ B-ORG
|
||
in IN _ O
|
||
2007 CD _ B-DATE
|
||
, , _ O
|
||
few JJ _ O
|
||
people NNS _ O
|
||
outside RB _ O
|
||
of IN _ O
|
||
the DT _ O
|
||
company NN _ O
|
||
took VBD _ O
|
||
him PRP _ O
|
||
seriously RB _ O
|
||
. . _ O
|
||
|
||
“ '' _ O
|
||
I PRP _ O
|
||
can MD _ O
|
||
tell VB _ O
|
||
you PRP _ O
|
||
very RB _ O
|
||
senior JJ _ O
|
||
CEOs NNS _ O
|
||
of IN _ O
|
||
major JJ _ O
|
||
American JJ _ B-NORP
|
||
car NN _ O
|
||
companies NNS _ O
|
||
would MD _ O
|
||
shake VB _ O
|
||
my PRP$ _ O
|
||
hand NN _ O
|
||
and CC _ O
|
||
turn VB _ O
|
||
away RB _ O
|
||
because IN _ O
|
||
I PRP _ O
|
||
was VBD _ O
|
||
n’t RB _ O
|
||
worth JJ _ O
|
||
talking VBG _ O
|
||
to IN _ O
|
||
, , _ O
|
||
” '' _ O
|
||
said VBD _ O
|
||
Thrun NNP _ B-PERSON
|
||
, , _ O
|
||
in IN _ O
|
||
an DT _ O
|
||
interview NN _ O
|
||
with IN _ O
|
||
Recode NNP _ B-ORG
|
||
earlier RBR _ B-DATE
|
||
this DT _ I-DATE
|
||
week NN _ I-DATE
|
||
. . _ O
|
||
|