spaCy/goldparse.md at 9696cf16c1bde4bb104dba957586841741b940c8

mirror of https://github.com/explosion/spaCy.git synced 2025-07-12 09:12:21 +03:00

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

2019-02-17 19:31:19 +01:00

9.3 KiB

Raw Blame History

title	teaser	tag	source
GoldParse	A collection for training annotations	class	spacy/gold.pyx

GoldParse.init

Create a GoldParse.

Name	Type	Description
`doc`	`Doc`	The document the annotations refer to.
`words`	iterable	A sequence of unicode word strings.
`tags`	iterable	A sequence of strings, representing tag annotations.
`heads`	iterable	A sequence of integers, representing syntactic head offsets.
`deps`	iterable	A sequence of strings, representing the syntactic relation types.
`entities`	iterable	A sequence of named entity annotations, either as BILUO tag strings, or as `(start_char, end_char, label)` tuples, representing the entity positions.
RETURNS	`GoldParse`	The newly constructed object.

GoldParse.len

Get the number of gold-standard tokens.

Name	Type	Description
RETURNS	int	The number of gold-standard tokens.

GoldParse.is_projective

Whether the provided syntactic annotations form a projective dependency tree.

Name	Type	Description
RETURNS	bool	Whether annotations form projective tree.

Attributes

Name	Type	Description
`tags`	list	The part-of-speech tag annotations.
`heads`	list	The syntactic head annotations.
`labels`	list	The syntactic relation-type annotations.
`ents`	list	The named entity annotations.
`cand_to_gold`	list	The alignment from candidate tokenization to gold tokenization.
`gold_to_cand`	list	The alignment from gold tokenization to candidate tokenization.
`cats` 2	list	Entries in the list should be either a label, or a `(start, end, label)` triple. The tuple form is used for categories applied to spans of the document.

Utilities

gold.biluo_tags_from_offsets

Encode labelled spans into per-token tags, using the BILUO scheme (Begin/In/Last/Unit/Out).

Returns a list of unicode strings, describing the tags. Each tag string will be of the form of either "", "O" or "{action}-{label}", where action is one of "B", "I", "L", "U". The string "-" is used where the entity offsets don't align with the tokenization in the Doc object. The training algorithm will view these as missing values. O denotes a non-entity token. B denotes the beginning of a multi-token entity, I the inside of an entity of three or more tokens, and L the end of an entity of two or more tokens. U denotes a single-token entity.

Example

from spacy.gold import biluo_tags_from_offsets

doc = nlp(u"I like London.")
entities = [(7, 13, "LOC")]
tags = biluo_tags_from_offsets(doc, entities)
assert tags == ["O", "O", "U-LOC", "O"]

Name	Type	Description
`doc`	`Doc`	The document that the entity offsets refer to. The output tags will refer to the token boundaries within the document.
`entities`	iterable	A sequence of `(start, end, label)` triples. `start` and `end` should be character-offset integers denoting the slice into the original string.
RETURNS	list	Unicode strings, describing the BILUO tags.

gold.offsets_from_biluo_tags

Encode per-token tags following the BILUO scheme into entity offsets.

Example

from spacy.gold import offsets_from_biluo_tags

doc = nlp(u"I like London.")
tags = ["O", "O", "U-LOC", "O"]
entities = offsets_from_biluo_tags(doc, tags)
assert entities == [(7, 13, "LOC")]

Name	Type	Description
`doc`	`Doc`	The document that the BILUO tags refer to.
`entities`	iterable	A sequence of BILUO tags with each tag describing one token. Each tag string will be of the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of `"B"`, `"I"`, `"L"`, `"U"`.
RETURNS	list	A sequence of `(start, end, label)` triples. `start` and `end` will be character-offset integers denoting the slice into the original string.

gold.spans_from_biluo_tags

Encode per-token tags following the BILUO scheme into Span objects. This can be used to create entity spans from token-based tags, e.g. to overwrite the doc.ents.

Example

from spacy.gold import offsets_from_biluo_tags

doc = nlp(u"I like London.")
tags = ["O", "O", "U-LOC", "O"]
doc.ents = spans_from_biluo_tags(doc, tags)

Name	Type	Description
`doc`	`Doc`	The document that the BILUO tags refer to.
`entities`	iterable	A sequence of BILUO tags with each tag describing one token. Each tag string will be of the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of `"B"`, `"I"`, `"L"`, `"U"`.
RETURNS	list	A sequence of `Span` objects with added entity labels.

9.3 KiB Raw Blame History

GoldParse.__init__

GoldParse.__len__

GoldParse.is_projective

Attributes

Utilities

gold.biluo_tags_from_offsets

Example

gold.offsets_from_biluo_tags

Example

gold.spans_from_biluo_tags

Example

9.3 KiB

Raw Blame History

GoldParse.init

GoldParse.len