Add BILUO scheme to annotation docs

2025-07-15 18:52:29 +03:00 · 2017-05-21 13:53:34 +02:00 · 2017-05-21 13:53:34 +02:00 · 465a1dd710
commit 465a1dd710
parent 99b631617d
1 changed files with 38 additions and 0 deletions
--- a/website/docs/api/annotation.jade
+++ b/website/docs/api/annotation.jade
@ -71,6 +71,44 @@ include _annotation/_dep-labels

 include _annotation/_named-entities

+h(3, "biluo") BILUO Scheme
+
+p
+    |  spaCy translates character offsets into the BILUO scheme, in order to
+    |  decide the cost of each action given the current state of the entity
+    |  recognizer. The costs are then used to calculate the gradient of the
+    |  loss, to train the model.
+
+aside("Why BILUO, not IOB?")
+    |  There are several coding schemes for encoding entity annotations as
+    |  token tags.  These coding schemes are equally expressive, but not
+    |  necessarily equally learnable.
+    |  #[+a("http://www.aclweb.org/anthology/W09-1119") Ratinov and Roth]
+    |  showed that the minimal #[strong Begin], #[strong In], #[strong Out]
+    |  scheme was more difficult to learn than the #[strong BILUO] scheme that
+    |  we use, which explicitly marks boundary tokens.
+
+table([ "Tag", "Description" ])
+    +row
+        +cell #[code #[span.u-color-theme B] EGIN]
+        +cell The first token of a multi-token entity.
+
+    +row
+        +cell #[code #[span.u-color-theme I] N]
+        +cell An inner token of a multi-token entity.
+
+    +row
+        +cell #[code #[span.u-color-theme L] AST]
+        +cell The final token of a multi-token entity.
+
+    +row
+        +cell #[code #[span.u-color-theme U] NIT]
+        +cell A single-token entity.
+
+    +row
+        +cell #[code #[span.u-color-theme O] UT]
+        +cell A non-entity token.
+
 +h(2, "json-input") JSON input format for training

 p