mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-15 20:16:23 +03:00
44 lines
1.8 KiB
Plaintext
44 lines
1.8 KiB
Plaintext
//- 💫 DOCS > API > ANNOTATION > BILUO
|
|
|
|
+table(["Tag", "Description"])
|
|
+row
|
|
+cell #[code #[span.u-color-theme B] EGIN]
|
|
+cell The first token of a multi-token entity.
|
|
|
|
+row
|
|
+cell #[code #[span.u-color-theme I] N]
|
|
+cell An inner token of a multi-token entity.
|
|
|
|
+row
|
|
+cell #[code #[span.u-color-theme L] AST]
|
|
+cell The final token of a multi-token entity.
|
|
|
|
+row
|
|
+cell #[code #[span.u-color-theme U] NIT]
|
|
+cell A single-token entity.
|
|
|
|
+row
|
|
+cell #[code #[span.u-color-theme O] UT]
|
|
+cell A non-entity token.
|
|
|
|
+aside("Why BILUO, not IOB?")
|
|
| There are several coding schemes for encoding entity annotations as
|
|
| token tags. These coding schemes are equally expressive, but not
|
|
| necessarily equally learnable.
|
|
| #[+a("http://www.aclweb.org/anthology/W09-1119") Ratinov and Roth]
|
|
| showed that the minimal #[strong Begin], #[strong In], #[strong Out]
|
|
| scheme was more difficult to learn than the #[strong BILUO] scheme that
|
|
| we use, which explicitly marks boundary tokens.
|
|
|
|
p
|
|
| spaCy translates the character offsets into this scheme, in order to
|
|
| decide the cost of each action given the current state of the entity
|
|
| recogniser. The costs are then used to calculate the gradient of the
|
|
| loss, to train the model. The exact algorithm is a pastiche of
|
|
| well-known methods, and is not currently described in any single
|
|
| publication. The model is a greedy transition-based parser guided by a
|
|
| linear model whose weights are learned using the averaged perceptron
|
|
| loss, via the #[+a("http://www.aclweb.org/anthology/C12-1059") dynamic oracle]
|
|
| imitation learning strategy. The transition system is equivalent to the
|
|
| BILOU tagging scheme.
|