<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves#3270. Resolves#3222. Resolves#2947. Resolves#2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
C-language objects that let you group variables together
/api/cython-classes
TokenC
tokenc
LexemeC
lexemec
TokenC
Cython data container for the Token object.
Example
token=&doc.c[3]token_ptr=&doc.c[3]
Name
Type
Description
lex
const LexemeC*
A pointer to the lexeme for the token.
morph
uint64_t
An ID allowing lookup of morphological attributes.
pos
univ_pos_t
Coarse-grained part-of-speech tag.
spacy
bint
A binary value indicating whether the token has trailing whitespace.
tag
attr_t
Fine-grained part-of-speech tag.
idx
int
The character offset of the token within the parent document.
lemma
attr_t
Base form of the token, with no inflectional suffixes.
sense
attr_t
Space for storing a word sense ID, currently unused.
head
int
Offset of the syntactic parent relative to the token.
dep
attr_t
Syntactic dependency relation.
l_kids
uint32_t
Number of left children.
r_kids
uint32_t
Number of right children.
l_edge
uint32_t
Offset of the leftmost token of this token's syntactic descendants.
r_edge
uint32_t
Offset of the rightmost token of this token's syntactic descendants.
sent_start
int
Ternary value indicating whether the token is the first word of a sentence. 0 indicates a missing value, -1 indicates False and 1 indicates True. The default value, 0, is interpreted as no sentence break. Sentence boundary detectors will usually set 0 for all tokens except tokens that follow a sentence boundary.
ent_iob
int
IOB code of named entity tag. 0 indicates a missing value, 1 indicates I, 2 indicates 0 and 3 indicates B.
ent_type
attr_t
Named entity type.
ent_id
attr_t
ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution.
Token.get_struct_attr
Get the value of an attribute from the TokenC struct by attribute ID.
The index of the token in the array or -1 if not found.
set_children_from_heads
Set attributes that allow lookup of syntactic children on a TokenC* array.
This function must be called after making changes to the TokenC.head
attribute, in order to make the parse tree navigation consistent.
Struct holding information about a lexical type. LexemeC structs are usually
owned by the Vocab, and accessed through a read-only pointer on the TokenC
struct.
Example
lex=doc.c[3].lex
Name
Type
Description
flags
flags_t
Bit-field for binary lexical flag values.
id
attr_t
Usually used to map lexemes to rows in a matrix, e.g. for word vectors. Does not need to be unique, so currently misnamed.
length
attr_t
Number of unicode characters in the lexeme.
orth
attr_t
ID of the verbatim text content.
lower
attr_t
ID of the lowercase form of the lexeme.
norm
attr_t
ID of the lexeme's norm, i.e. a normalized form of the text.
shape
attr_t
Transform of the lexeme's string, to show orthographic features.
prefix
attr_t
Length-N substring from the start of the lexeme. Defaults to N=1.
suffix
attr_t
Length-N substring from the end of the lexeme. Defaults to N=3.
cluster
attr_t
Brown cluster ID.
prob
float
Smoothed log probability estimate of the lexeme's type.
sentiment
float
A scalar value indicating positivity or negativity.
Lexeme.get_struct_attr
Get the value of an attribute from the LexemeC struct by attribute ID.